All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
@ 2024-02-02  8:07 ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Hi All,

This is a series to opportunistically and transparently use contpte mappings
(set the contiguous bit in ptes) for user memory when those mappings meet the
requirements. The change benefits arm64, but there is some minor refactoring for
x86 and powerpc to enable its integration with core-mm.

It is part of a wider effort to improve performance by allocating and mapping
variable-sized blocks of memory (folios). One aim is for the 4K kernel to
approach the performance of the 16K kernel, but without breaking compatibility
and without the associated increase in memory. Another aim is to benefit the 16K
and 64K kernels by enabling 2M THP, since this is the contpte size for those
kernels. We have good performance data that demonstrates both aims are being met
(see below).

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And for anonymous memory, "multi-size THP" is now upstream.


Patch Layout
============

In this version, I've split the patches to better show each optimization:

  - 1-2:    mm prep: misc code and docs cleanups
  - 3-8:    mm,arm,arm64,powerpc,x86 prep: Replace pte_next_pfn() with more
            general pte_advance_pfn()
  - 9-18:   arm64 prep: Refactor ptep helpers into new layer
  - 19:     functional contpte implementation
  - 20-25:  various optimizations on top of the contpte implementation


Testing
=======

I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
  - mm selftests (inc new tests written for multi-size THP); no regressions
  - Speedometer Java script benchmark in Chromium web browser; no issues
  - Kernel compilation; no issues
  - Various tests under high memory pressure with swap enabled; no issues


Performance
===========

High Level Use Cases
~~~~~~~~~~~~~~~~~~~~

First some high level use cases (kernel compilation and speedometer JavaScript
benchmarks). These are running on Ampere Altra (I've seen similar improvements
on Android/Pixel 6).

baseline:                  mm-unstable (mTHP switched off)
mTHP:                      + enable 16K, 32K, 64K mTHP sizes "always"
mTHP + contpte:            + this series
mTHP + contpte + exefolio: + patch at [5], which series supports

Kernel Compilation with -j8 (negative is faster):

| kernel                    | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline                  |      0.0% |      0.0% |      0.0% |
| mTHP                      |     -5.0% |    -39.1% |     -0.7% |
| mTHP + contpte            |     -6.0% |    -41.4% |     -1.5% |
| mTHP + contpte + exefolio |     -7.8% |    -43.1% |     -3.4% |

Kernel Compilation with -j80 (negative is faster):

| kernel                    | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline                  |      0.0% |      0.0% |      0.0% |
| mTHP                      |     -5.0% |    -36.6% |     -0.6% |
| mTHP + contpte            |     -6.1% |    -38.2% |     -1.6% |
| mTHP + contpte + exefolio |     -7.4% |    -39.2% |     -3.2% |

Speedometer (positive is faster):

| kernel                    | runs_per_min |
|:--------------------------|--------------|
| baseline                  |         0.0% |
| mTHP                      |         1.5% |
| mTHP + contpte            |         3.2% |
| mTHP + contpte + exefolio |         4.5% |


Micro Benchmarks
~~~~~~~~~~~~~~~~

The following microbenchmarks are intended to demonstrate the performance of
fork() and munmap() do not regress. I'm showing results for order-0 (4K)
mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
benchmarks.

baseline:                  mm-unstable + batch fork [6] and zap [7] series
contpte-basic:             + patches 0-19; functional contpte implementation
contpte-batch:             + patches 20-23; implement new batched APIs
contpte-inline:            + patch 24; __always_inline to help compiler
contpte-fold:              + patch 25; fold contpte mapping when sensible

Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
(on top of MacOS) for reference, although experience suggests this might not be
the most reliable for performance numbers of this sort:

| FORK           |         order-0        |         order-9        |
| Ampere Altra   |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      2.7% |       0.0% |      0.2% |
| contpte-basic  |       6.3% |      1.4% |    1948.7% |      0.2% |
| contpte-batch  |       7.6% |      2.0% |      -1.9% |      0.4% |
| contpte-inline |       3.6% |      1.5% |      -1.0% |      0.2% |
| contpte-fold   |       4.6% |      2.1% |      -1.8% |      0.2% |

| MUNMAP         |         order-0        |         order-9        |
| Ampere Altra   |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      0.5% |       0.0% |      0.3% |
| contpte-basic  |       1.8% |      0.3% |    1104.8% |      0.1% |
| contpte-batch  |      -0.3% |      0.4% |       2.7% |      0.1% |
| contpte-inline |      -0.1% |      0.6% |       0.9% |      0.1% |
| contpte-fold   |       0.1% |      0.6% |       0.8% |      0.1% |

| FORK           |         order-0        |         order-9        |
| Apple M2 VM    |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      1.4% |       0.0% |      0.8% |
| contpte-basic  |       6.8% |      1.2% |     469.4% |      1.4% |
| contpte-batch  |      -7.7% |      2.0% |      -8.9% |      0.7% |
| contpte-inline |      -6.0% |      2.1% |      -6.0% |      2.0% |
| contpte-fold   |       5.9% |      1.4% |      -6.4% |      1.4% |

| MUNMAP         |         order-0        |         order-9        |
| Apple M2 VM    |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      0.6% |       0.0% |      0.4% |
| contpte-basic  |       1.6% |      0.6% |     233.6% |      0.7% |
| contpte-batch  |       1.9% |      0.3% |      -3.9% |      0.4% |
| contpte-inline |       2.2% |      0.8% |      -1.6% |      0.9% |
| contpte-fold   |       1.5% |      0.7% |      -1.7% |      0.7% |

Misc
~~~~

John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
some workloads at [8], when using 64K base page kernel.

---
I'd really like to get this into v6.9; I've spoken with Catalin and he is happy
for this to go via the mm-unstable branch, once suitably acked by arm64 folks.
That makes most sense because the series depends on some changes from David at
[6] and [7], which in turn apply on top of mm-unstable as of a few days ago
(d162e170f118).


Changes since v4 [4]
====================

  - Rebased onto David's generic fork [6] and zap [8] batching work
      - I had an implementation similar to this prior to v4, but ditched it
        because I couldn't make it reliably provide a speedup; David succeeded.
      - roughly speaking, a few functions get renamed compared to v4:
          - pte_batch_remaining() -> pte_batch_hint()
	  - set_wrprotects() -> wrprotect_ptes()
          - clear_ptes() -> [get_and_]clear_full_ptes()
      - Had to convert pte_next_pfn() to pte_advance_pfn()
      - Integration into core-mm is simpler because most has been done by
        David's work
  - Reworked patches to better show the progression from basic implementation to
    the various optimizations.
  - Removed the 'full' flag that I added to set_ptes() and set_wrprotects() in
    v4: I've been able to make up most of the performance in other ways, so this
    keeps the interface simpler.
  - Simplified contpte_set_ptes(nr > 1): Observed that set_ptes(nr > 1) is only
    called for ptes that are initially not present. So updated the spec to
    require that, and no longer need to check if any ptes are initially present
    when applying a contpte mapping.


Changes since v3 [3]
====================

  - Added v3#1 to batch set_ptes() when splitting a huge pmd to ptes; avoids
    need to fold contpte blocks for perf improvement
  - Separated the clear_ptes() fast path into its own inline function (Alistair)
  - Reworked core-mm changes to copy_present_ptes() and zap_pte_range() to
    remove overhead when memory is all order-0 folios (for arm64 and !arm64)
  - Significant optimization of arm64 backend fork operations (set_ptes_full()
    and set_wrprotects()) to ensure no regression when memory is order-0 folios.
  - fixed local variable declarations to be reverse xmas tree. - Added
    documentation for the new backend APIs (pte_batch_remaining(),
    set_ptes_full(), clear_ptes(), ptep_set_wrprotects())
  - Renamed tlb_get_guaranteed_space() -> tlb_reserve_space() and pass requested
    number of slots. Avoids allocating memory when not needed; perf improvement.


Changes since v2 [2]
====================

  - Removed contpte_ptep_get_and_clear_full() optimisation for exit() (v2#14),
    and replaced with a batch-clearing approach using a new arch helper,
    clear_ptes() (v3#2 and v3#15) (Alistair and Barry)
  - (v2#1 / v3#1)
      - Fixed folio refcounting so that refcount >= mapcount always (DavidH)
      - Reworked batch demarcation to avoid pte_pgprot() (DavidH)
      - Reverted return semantic of copy_present_page() and instead fix it up in
        copy_present_ptes() (Alistair)
      - Removed page_cont_mapped_vaddr() and replaced with simpler logic
        (Alistair)
      - Made batch accounting clearer in copy_pte_range() (Alistair)
  - (v2#12 / v3#13)
      - Renamed contpte_fold() -> contpte_convert() and hoisted setting/
        clearing CONT_PTE bit to higher level (Alistair)


Changes since v1 [1]
====================

  - Export contpte_* symbols so that modules can continue to call inline
    functions (e.g. ptep_get) which may now call the contpte_* functions (thanks
    to JohnH)
  - Use pte_valid() instead of pte_present() where sensible (thanks to Catalin)
  - Factor out (pte_valid() && pte_cont()) into new pte_valid_cont() helper
    (thanks to Catalin)
  - Fixed bug in contpte_ptep_set_access_flags() where TLBIs were missed (thanks
    to Catalin)
  - Added ARM64_CONTPTE expert Kconfig (enabled by default) (thanks to Anshuman)
  - Simplified contpte_ptep_get_and_clear_full()
  - Improved various code comments


[1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/lkml/20231218105100.172635-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/lkml/08c16f7d-f3b3-4f22-9acc-da943f647dc3@arm.com/
[6] https://lore.kernel.org/lkml/20240129124649.189745-1-david@redhat.com/
[7] https://lore.kernel.org/lkml/20240129143221.263763-1-david@redhat.com/
[8] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/


Thanks,
Ryan

Ryan Roberts (25):
  mm: Clarify the spec for set_ptes()
  mm: thp: Batch-collapse PMD with set_ptes()
  mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
  arm/mm: Convert pte_next_pfn() to pte_advance_pfn()
  arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()
  powerpc/mm: Convert pte_next_pfn() to pte_advance_pfn()
  x86/mm: Convert pte_next_pfn() to pte_advance_pfn()
  mm: Remove pte_next_pfn() and replace with pte_advance_pfn()
  arm64/mm: set_pte(): New layer to manage contig bit
  arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
  arm64/mm: pte_clear(): New layer to manage contig bit
  arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
  arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
  arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
  arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
  arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
  arm64/mm: ptep_get(): New layer to manage contig bit
  arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  arm64/mm: Wire up PTE_CONT for user mappings
  arm64/mm: Implement new wrprotect_ptes() batch API
  arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
  mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
  arm64/mm: Implement pte_batch_hint()
  arm64/mm: __always_inline to improve fork() perf
  arm64/mm: Automatically fold contpte mappings

 arch/arm/mm/mmu.c                 |   2 +-
 arch/arm64/Kconfig                |   9 +
 arch/arm64/include/asm/pgtable.h  | 404 ++++++++++++++++++++++++++----
 arch/arm64/include/asm/tlbflush.h |  13 +-
 arch/arm64/kernel/efi.c           |   4 +-
 arch/arm64/kernel/mte.c           |   2 +-
 arch/arm64/kvm/guest.c            |   2 +-
 arch/arm64/mm/Makefile            |   1 +
 arch/arm64/mm/contpte.c           | 399 +++++++++++++++++++++++++++++
 arch/arm64/mm/fault.c             |  12 +-
 arch/arm64/mm/fixmap.c            |   4 +-
 arch/arm64/mm/hugetlbpage.c       |  40 +--
 arch/arm64/mm/kasan_init.c        |   6 +-
 arch/arm64/mm/mmu.c               |  16 +-
 arch/arm64/mm/pageattr.c          |   6 +-
 arch/arm64/mm/trans_pgd.c         |   6 +-
 arch/powerpc/mm/pgtable.c         |   2 +-
 arch/x86/include/asm/pgtable.h    |   8 +-
 include/linux/pgtable.h           |  29 ++-
 mm/huge_memory.c                  |  58 +++--
 mm/memory.c                       |  20 +-
 21 files changed, 906 insertions(+), 137 deletions(-)
 create mode 100644 arch/arm64/mm/contpte.c

--
2.25.1


^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
@ 2024-02-02  8:07 ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Hi All,

This is a series to opportunistically and transparently use contpte mappings
(set the contiguous bit in ptes) for user memory when those mappings meet the
requirements. The change benefits arm64, but there is some minor refactoring for
x86 and powerpc to enable its integration with core-mm.

It is part of a wider effort to improve performance by allocating and mapping
variable-sized blocks of memory (folios). One aim is for the 4K kernel to
approach the performance of the 16K kernel, but without breaking compatibility
and without the associated increase in memory. Another aim is to benefit the 16K
and 64K kernels by enabling 2M THP, since this is the contpte size for those
kernels. We have good performance data that demonstrates both aims are being met
(see below).

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And for anonymous memory, "multi-size THP" is now upstream.


Patch Layout
============

In this version, I've split the patches to better show each optimization:

  - 1-2:    mm prep: misc code and docs cleanups
  - 3-8:    mm,arm,arm64,powerpc,x86 prep: Replace pte_next_pfn() with more
            general pte_advance_pfn()
  - 9-18:   arm64 prep: Refactor ptep helpers into new layer
  - 19:     functional contpte implementation
  - 20-25:  various optimizations on top of the contpte implementation


Testing
=======

I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
  - mm selftests (inc new tests written for multi-size THP); no regressions
  - Speedometer Java script benchmark in Chromium web browser; no issues
  - Kernel compilation; no issues
  - Various tests under high memory pressure with swap enabled; no issues


Performance
===========

High Level Use Cases
~~~~~~~~~~~~~~~~~~~~

First some high level use cases (kernel compilation and speedometer JavaScript
benchmarks). These are running on Ampere Altra (I've seen similar improvements
on Android/Pixel 6).

baseline:                  mm-unstable (mTHP switched off)
mTHP:                      + enable 16K, 32K, 64K mTHP sizes "always"
mTHP + contpte:            + this series
mTHP + contpte + exefolio: + patch at [5], which series supports

Kernel Compilation with -j8 (negative is faster):

| kernel                    | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline                  |      0.0% |      0.0% |      0.0% |
| mTHP                      |     -5.0% |    -39.1% |     -0.7% |
| mTHP + contpte            |     -6.0% |    -41.4% |     -1.5% |
| mTHP + contpte + exefolio |     -7.8% |    -43.1% |     -3.4% |

Kernel Compilation with -j80 (negative is faster):

| kernel                    | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline                  |      0.0% |      0.0% |      0.0% |
| mTHP                      |     -5.0% |    -36.6% |     -0.6% |
| mTHP + contpte            |     -6.1% |    -38.2% |     -1.6% |
| mTHP + contpte + exefolio |     -7.4% |    -39.2% |     -3.2% |

Speedometer (positive is faster):

| kernel                    | runs_per_min |
|:--------------------------|--------------|
| baseline                  |         0.0% |
| mTHP                      |         1.5% |
| mTHP + contpte            |         3.2% |
| mTHP + contpte + exefolio |         4.5% |


Micro Benchmarks
~~~~~~~~~~~~~~~~

The following microbenchmarks are intended to demonstrate the performance of
fork() and munmap() do not regress. I'm showing results for order-0 (4K)
mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
benchmarks.

baseline:                  mm-unstable + batch fork [6] and zap [7] series
contpte-basic:             + patches 0-19; functional contpte implementation
contpte-batch:             + patches 20-23; implement new batched APIs
contpte-inline:            + patch 24; __always_inline to help compiler
contpte-fold:              + patch 25; fold contpte mapping when sensible

Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
(on top of MacOS) for reference, although experience suggests this might not be
the most reliable for performance numbers of this sort:

| FORK           |         order-0        |         order-9        |
| Ampere Altra   |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      2.7% |       0.0% |      0.2% |
| contpte-basic  |       6.3% |      1.4% |    1948.7% |      0.2% |
| contpte-batch  |       7.6% |      2.0% |      -1.9% |      0.4% |
| contpte-inline |       3.6% |      1.5% |      -1.0% |      0.2% |
| contpte-fold   |       4.6% |      2.1% |      -1.8% |      0.2% |

| MUNMAP         |         order-0        |         order-9        |
| Ampere Altra   |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      0.5% |       0.0% |      0.3% |
| contpte-basic  |       1.8% |      0.3% |    1104.8% |      0.1% |
| contpte-batch  |      -0.3% |      0.4% |       2.7% |      0.1% |
| contpte-inline |      -0.1% |      0.6% |       0.9% |      0.1% |
| contpte-fold   |       0.1% |      0.6% |       0.8% |      0.1% |

| FORK           |         order-0        |         order-9        |
| Apple M2 VM    |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      1.4% |       0.0% |      0.8% |
| contpte-basic  |       6.8% |      1.2% |     469.4% |      1.4% |
| contpte-batch  |      -7.7% |      2.0% |      -8.9% |      0.7% |
| contpte-inline |      -6.0% |      2.1% |      -6.0% |      2.0% |
| contpte-fold   |       5.9% |      1.4% |      -6.4% |      1.4% |

| MUNMAP         |         order-0        |         order-9        |
| Apple M2 VM    |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      0.6% |       0.0% |      0.4% |
| contpte-basic  |       1.6% |      0.6% |     233.6% |      0.7% |
| contpte-batch  |       1.9% |      0.3% |      -3.9% |      0.4% |
| contpte-inline |       2.2% |      0.8% |      -1.6% |      0.9% |
| contpte-fold   |       1.5% |      0.7% |      -1.7% |      0.7% |

Misc
~~~~

John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
some workloads at [8], when using 64K base page kernel.

---
I'd really like to get this into v6.9; I've spoken with Catalin and he is happy
for this to go via the mm-unstable branch, once suitably acked by arm64 folks.
That makes most sense because the series depends on some changes from David at
[6] and [7], which in turn apply on top of mm-unstable as of a few days ago
(d162e170f118).


Changes since v4 [4]
====================

  - Rebased onto David's generic fork [6] and zap [8] batching work
      - I had an implementation similar to this prior to v4, but ditched it
        because I couldn't make it reliably provide a speedup; David succeeded.
      - roughly speaking, a few functions get renamed compared to v4:
          - pte_batch_remaining() -> pte_batch_hint()
	  - set_wrprotects() -> wrprotect_ptes()
          - clear_ptes() -> [get_and_]clear_full_ptes()
      - Had to convert pte_next_pfn() to pte_advance_pfn()
      - Integration into core-mm is simpler because most has been done by
        David's work
  - Reworked patches to better show the progression from basic implementation to
    the various optimizations.
  - Removed the 'full' flag that I added to set_ptes() and set_wrprotects() in
    v4: I've been able to make up most of the performance in other ways, so this
    keeps the interface simpler.
  - Simplified contpte_set_ptes(nr > 1): Observed that set_ptes(nr > 1) is only
    called for ptes that are initially not present. So updated the spec to
    require that, and no longer need to check if any ptes are initially present
    when applying a contpte mapping.


Changes since v3 [3]
====================

  - Added v3#1 to batch set_ptes() when splitting a huge pmd to ptes; avoids
    need to fold contpte blocks for perf improvement
  - Separated the clear_ptes() fast path into its own inline function (Alistair)
  - Reworked core-mm changes to copy_present_ptes() and zap_pte_range() to
    remove overhead when memory is all order-0 folios (for arm64 and !arm64)
  - Significant optimization of arm64 backend fork operations (set_ptes_full()
    and set_wrprotects()) to ensure no regression when memory is order-0 folios.
  - fixed local variable declarations to be reverse xmas tree. - Added
    documentation for the new backend APIs (pte_batch_remaining(),
    set_ptes_full(), clear_ptes(), ptep_set_wrprotects())
  - Renamed tlb_get_guaranteed_space() -> tlb_reserve_space() and pass requested
    number of slots. Avoids allocating memory when not needed; perf improvement.


Changes since v2 [2]
====================

  - Removed contpte_ptep_get_and_clear_full() optimisation for exit() (v2#14),
    and replaced with a batch-clearing approach using a new arch helper,
    clear_ptes() (v3#2 and v3#15) (Alistair and Barry)
  - (v2#1 / v3#1)
      - Fixed folio refcounting so that refcount >= mapcount always (DavidH)
      - Reworked batch demarcation to avoid pte_pgprot() (DavidH)
      - Reverted return semantic of copy_present_page() and instead fix it up in
        copy_present_ptes() (Alistair)
      - Removed page_cont_mapped_vaddr() and replaced with simpler logic
        (Alistair)
      - Made batch accounting clearer in copy_pte_range() (Alistair)
  - (v2#12 / v3#13)
      - Renamed contpte_fold() -> contpte_convert() and hoisted setting/
        clearing CONT_PTE bit to higher level (Alistair)


Changes since v1 [1]
====================

  - Export contpte_* symbols so that modules can continue to call inline
    functions (e.g. ptep_get) which may now call the contpte_* functions (thanks
    to JohnH)
  - Use pte_valid() instead of pte_present() where sensible (thanks to Catalin)
  - Factor out (pte_valid() && pte_cont()) into new pte_valid_cont() helper
    (thanks to Catalin)
  - Fixed bug in contpte_ptep_set_access_flags() where TLBIs were missed (thanks
    to Catalin)
  - Added ARM64_CONTPTE expert Kconfig (enabled by default) (thanks to Anshuman)
  - Simplified contpte_ptep_get_and_clear_full()
  - Improved various code comments


[1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/lkml/20231218105100.172635-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/lkml/08c16f7d-f3b3-4f22-9acc-da943f647dc3@arm.com/
[6] https://lore.kernel.org/lkml/20240129124649.189745-1-david@redhat.com/
[7] https://lore.kernel.org/lkml/20240129143221.263763-1-david@redhat.com/
[8] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/


Thanks,
Ryan

Ryan Roberts (25):
  mm: Clarify the spec for set_ptes()
  mm: thp: Batch-collapse PMD with set_ptes()
  mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
  arm/mm: Convert pte_next_pfn() to pte_advance_pfn()
  arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()
  powerpc/mm: Convert pte_next_pfn() to pte_advance_pfn()
  x86/mm: Convert pte_next_pfn() to pte_advance_pfn()
  mm: Remove pte_next_pfn() and replace with pte_advance_pfn()
  arm64/mm: set_pte(): New layer to manage contig bit
  arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
  arm64/mm: pte_clear(): New layer to manage contig bit
  arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
  arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
  arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
  arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
  arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
  arm64/mm: ptep_get(): New layer to manage contig bit
  arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  arm64/mm: Wire up PTE_CONT for user mappings
  arm64/mm: Implement new wrprotect_ptes() batch API
  arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
  mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
  arm64/mm: Implement pte_batch_hint()
  arm64/mm: __always_inline to improve fork() perf
  arm64/mm: Automatically fold contpte mappings

 arch/arm/mm/mmu.c                 |   2 +-
 arch/arm64/Kconfig                |   9 +
 arch/arm64/include/asm/pgtable.h  | 404 ++++++++++++++++++++++++++----
 arch/arm64/include/asm/tlbflush.h |  13 +-
 arch/arm64/kernel/efi.c           |   4 +-
 arch/arm64/kernel/mte.c           |   2 +-
 arch/arm64/kvm/guest.c            |   2 +-
 arch/arm64/mm/Makefile            |   1 +
 arch/arm64/mm/contpte.c           | 399 +++++++++++++++++++++++++++++
 arch/arm64/mm/fault.c             |  12 +-
 arch/arm64/mm/fixmap.c            |   4 +-
 arch/arm64/mm/hugetlbpage.c       |  40 +--
 arch/arm64/mm/kasan_init.c        |   6 +-
 arch/arm64/mm/mmu.c               |  16 +-
 arch/arm64/mm/pageattr.c          |   6 +-
 arch/arm64/mm/trans_pgd.c         |   6 +-
 arch/powerpc/mm/pgtable.c         |   2 +-
 arch/x86/include/asm/pgtable.h    |   8 +-
 include/linux/pgtable.h           |  29 ++-
 mm/huge_memory.c                  |  58 +++--
 mm/memory.c                       |  20 +-
 21 files changed, 906 insertions(+), 137 deletions(-)
 create mode 100644 arch/arm64/mm/contpte.c

--
2.25.1


^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
@ 2024-02-02  8:07 ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Hi All,

This is a series to opportunistically and transparently use contpte mappings
(set the contiguous bit in ptes) for user memory when those mappings meet the
requirements. The change benefits arm64, but there is some minor refactoring for
x86 and powerpc to enable its integration with core-mm.

It is part of a wider effort to improve performance by allocating and mapping
variable-sized blocks of memory (folios). One aim is for the 4K kernel to
approach the performance of the 16K kernel, but without breaking compatibility
and without the associated increase in memory. Another aim is to benefit the 16K
and 64K kernels by enabling 2M THP, since this is the contpte size for those
kernels. We have good performance data that demonstrates both aims are being met
(see below).

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And for anonymous memory, "multi-size THP" is now upstream.


Patch Layout
============

In this version, I've split the patches to better show each optimization:

  - 1-2:    mm prep: misc code and docs cleanups
  - 3-8:    mm,arm,arm64,powerpc,x86 prep: Replace pte_next_pfn() with more
            general pte_advance_pfn()
  - 9-18:   arm64 prep: Refactor ptep helpers into new layer
  - 19:     functional contpte implementation
  - 20-25:  various optimizations on top of the contpte implementation


Testing
=======

I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
  - mm selftests (inc new tests written for multi-size THP); no regressions
  - Speedometer Java script benchmark in Chromium web browser; no issues
  - Kernel compilation; no issues
  - Various tests under high memory pressure with swap enabled; no issues


Performance
===========

High Level Use Cases
~~~~~~~~~~~~~~~~~~~~

First some high level use cases (kernel compilation and speedometer JavaScript
benchmarks). These are running on Ampere Altra (I've seen similar improvements
on Android/Pixel 6).

baseline:                  mm-unstable (mTHP switched off)
mTHP:                      + enable 16K, 32K, 64K mTHP sizes "always"
mTHP + contpte:            + this series
mTHP + contpte + exefolio: + patch at [5], which series supports

Kernel Compilation with -j8 (negative is faster):

| kernel                    | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline                  |      0.0% |      0.0% |      0.0% |
| mTHP                      |     -5.0% |    -39.1% |     -0.7% |
| mTHP + contpte            |     -6.0% |    -41.4% |     -1.5% |
| mTHP + contpte + exefolio |     -7.8% |    -43.1% |     -3.4% |

Kernel Compilation with -j80 (negative is faster):

| kernel                    | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline                  |      0.0% |      0.0% |      0.0% |
| mTHP                      |     -5.0% |    -36.6% |     -0.6% |
| mTHP + contpte            |     -6.1% |    -38.2% |     -1.6% |
| mTHP + contpte + exefolio |     -7.4% |    -39.2% |     -3.2% |

Speedometer (positive is faster):

| kernel                    | runs_per_min |
|:--------------------------|--------------|
| baseline                  |         0.0% |
| mTHP                      |         1.5% |
| mTHP + contpte            |         3.2% |
| mTHP + contpte + exefolio |         4.5% |


Micro Benchmarks
~~~~~~~~~~~~~~~~

The following microbenchmarks are intended to demonstrate the performance of
fork() and munmap() do not regress. I'm showing results for order-0 (4K)
mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
benchmarks.

baseline:                  mm-unstable + batch fork [6] and zap [7] series
contpte-basic:             + patches 0-19; functional contpte implementation
contpte-batch:             + patches 20-23; implement new batched APIs
contpte-inline:            + patch 24; __always_inline to help compiler
contpte-fold:              + patch 25; fold contpte mapping when sensible

Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
(on top of MacOS) for reference, although experience suggests this might not be
the most reliable for performance numbers of this sort:

| FORK           |         order-0        |         order-9        |
| Ampere Altra   |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      2.7% |       0.0% |      0.2% |
| contpte-basic  |       6.3% |      1.4% |    1948.7% |      0.2% |
| contpte-batch  |       7.6% |      2.0% |      -1.9% |      0.4% |
| contpte-inline |       3.6% |      1.5% |      -1.0% |      0.2% |
| contpte-fold   |       4.6% |      2.1% |      -1.8% |      0.2% |

| MUNMAP         |         order-0        |         order-9        |
| Ampere Altra   |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      0.5% |       0.0% |      0.3% |
| contpte-basic  |       1.8% |      0.3% |    1104.8% |      0.1% |
| contpte-batch  |      -0.3% |      0.4% |       2.7% |      0.1% |
| contpte-inline |      -0.1% |      0.6% |       0.9% |      0.1% |
| contpte-fold   |       0.1% |      0.6% |       0.8% |      0.1% |

| FORK           |         order-0        |         order-9        |
| Apple M2 VM    |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      1.4% |       0.0% |      0.8% |
| contpte-basic  |       6.8% |      1.2% |     469.4% |      1.4% |
| contpte-batch  |      -7.7% |      2.0% |      -8.9% |      0.7% |
| contpte-inline |      -6.0% |      2.1% |      -6.0% |      2.0% |
| contpte-fold   |       5.9% |      1.4% |      -6.4% |      1.4% |

| MUNMAP         |         order-0        |         order-9        |
| Apple M2 VM    |------------------------|------------------------|
| (pte-map)      |       mean |     stdev |       mean |     stdev |
|----------------|------------|-----------|------------|-----------|
| baseline       |       0.0% |      0.6% |       0.0% |      0.4% |
| contpte-basic  |       1.6% |      0.6% |     233.6% |      0.7% |
| contpte-batch  |       1.9% |      0.3% |      -3.9% |      0.4% |
| contpte-inline |       2.2% |      0.8% |      -1.6% |      0.9% |
| contpte-fold   |       1.5% |      0.7% |      -1.7% |      0.7% |

Misc
~~~~

John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
some workloads at [8], when using 64K base page kernel.

---
I'd really like to get this into v6.9; I've spoken with Catalin and he is happy
for this to go via the mm-unstable branch, once suitably acked by arm64 folks.
That makes most sense because the series depends on some changes from David at
[6] and [7], which in turn apply on top of mm-unstable as of a few days ago
(d162e170f118).


Changes since v4 [4]
====================

  - Rebased onto David's generic fork [6] and zap [8] batching work
      - I had an implementation similar to this prior to v4, but ditched it
        because I couldn't make it reliably provide a speedup; David succeeded.
      - roughly speaking, a few functions get renamed compared to v4:
          - pte_batch_remaining() -> pte_batch_hint()
	  - set_wrprotects() -> wrprotect_ptes()
          - clear_ptes() -> [get_and_]clear_full_ptes()
      - Had to convert pte_next_pfn() to pte_advance_pfn()
      - Integration into core-mm is simpler because most has been done by
        David's work
  - Reworked patches to better show the progression from basic implementation to
    the various optimizations.
  - Removed the 'full' flag that I added to set_ptes() and set_wrprotects() in
    v4: I've been able to make up most of the performance in other ways, so this
    keeps the interface simpler.
  - Simplified contpte_set_ptes(nr > 1): Observed that set_ptes(nr > 1) is only
    called for ptes that are initially not present. So updated the spec to
    require that, and no longer need to check if any ptes are initially present
    when applying a contpte mapping.


Changes since v3 [3]
====================

  - Added v3#1 to batch set_ptes() when splitting a huge pmd to ptes; avoids
    need to fold contpte blocks for perf improvement
  - Separated the clear_ptes() fast path into its own inline function (Alistair)
  - Reworked core-mm changes to copy_present_ptes() and zap_pte_range() to
    remove overhead when memory is all order-0 folios (for arm64 and !arm64)
  - Significant optimization of arm64 backend fork operations (set_ptes_full()
    and set_wrprotects()) to ensure no regression when memory is order-0 folios.
  - fixed local variable declarations to be reverse xmas tree. - Added
    documentation for the new backend APIs (pte_batch_remaining(),
    set_ptes_full(), clear_ptes(), ptep_set_wrprotects())
  - Renamed tlb_get_guaranteed_space() -> tlb_reserve_space() and pass requested
    number of slots. Avoids allocating memory when not needed; perf improvement.


Changes since v2 [2]
====================

  - Removed contpte_ptep_get_and_clear_full() optimisation for exit() (v2#14),
    and replaced with a batch-clearing approach using a new arch helper,
    clear_ptes() (v3#2 and v3#15) (Alistair and Barry)
  - (v2#1 / v3#1)
      - Fixed folio refcounting so that refcount >= mapcount always (DavidH)
      - Reworked batch demarcation to avoid pte_pgprot() (DavidH)
      - Reverted return semantic of copy_present_page() and instead fix it up in
        copy_present_ptes() (Alistair)
      - Removed page_cont_mapped_vaddr() and replaced with simpler logic
        (Alistair)
      - Made batch accounting clearer in copy_pte_range() (Alistair)
  - (v2#12 / v3#13)
      - Renamed contpte_fold() -> contpte_convert() and hoisted setting/
        clearing CONT_PTE bit to higher level (Alistair)


Changes since v1 [1]
====================

  - Export contpte_* symbols so that modules can continue to call inline
    functions (e.g. ptep_get) which may now call the contpte_* functions (thanks
    to JohnH)
  - Use pte_valid() instead of pte_present() where sensible (thanks to Catalin)
  - Factor out (pte_valid() && pte_cont()) into new pte_valid_cont() helper
    (thanks to Catalin)
  - Fixed bug in contpte_ptep_set_access_flags() where TLBIs were missed (thanks
    to Catalin)
  - Added ARM64_CONTPTE expert Kconfig (enabled by default) (thanks to Anshuman)
  - Simplified contpte_ptep_get_and_clear_full()
  - Improved various code comments


[1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/lkml/20231218105100.172635-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/lkml/08c16f7d-f3b3-4f22-9acc-da943f647dc3@arm.com/
[6] https://lore.kernel.org/lkml/20240129124649.189745-1-david@redhat.com/
[7] https://lore.kernel.org/lkml/20240129143221.263763-1-david@redhat.com/
[8] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/


Thanks,
Ryan

Ryan Roberts (25):
  mm: Clarify the spec for set_ptes()
  mm: thp: Batch-collapse PMD with set_ptes()
  mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
  arm/mm: Convert pte_next_pfn() to pte_advance_pfn()
  arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()
  powerpc/mm: Convert pte_next_pfn() to pte_advance_pfn()
  x86/mm: Convert pte_next_pfn() to pte_advance_pfn()
  mm: Remove pte_next_pfn() and replace with pte_advance_pfn()
  arm64/mm: set_pte(): New layer to manage contig bit
  arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
  arm64/mm: pte_clear(): New layer to manage contig bit
  arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
  arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
  arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
  arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
  arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
  arm64/mm: ptep_get(): New layer to manage contig bit
  arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  arm64/mm: Wire up PTE_CONT for user mappings
  arm64/mm: Implement new wrprotect_ptes() batch API
  arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
  mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
  arm64/mm: Implement pte_batch_hint()
  arm64/mm: __always_inline to improve fork() perf
  arm64/mm: Automatically fold contpte mappings

 arch/arm/mm/mmu.c                 |   2 +-
 arch/arm64/Kconfig                |   9 +
 arch/arm64/include/asm/pgtable.h  | 404 ++++++++++++++++++++++++++----
 arch/arm64/include/asm/tlbflush.h |  13 +-
 arch/arm64/kernel/efi.c           |   4 +-
 arch/arm64/kernel/mte.c           |   2 +-
 arch/arm64/kvm/guest.c            |   2 +-
 arch/arm64/mm/Makefile            |   1 +
 arch/arm64/mm/contpte.c           | 399 +++++++++++++++++++++++++++++
 arch/arm64/mm/fault.c             |  12 +-
 arch/arm64/mm/fixmap.c            |   4 +-
 arch/arm64/mm/hugetlbpage.c       |  40 +--
 arch/arm64/mm/kasan_init.c        |   6 +-
 arch/arm64/mm/mmu.c               |  16 +-
 arch/arm64/mm/pageattr.c          |   6 +-
 arch/arm64/mm/trans_pgd.c         |   6 +-
 arch/powerpc/mm/pgtable.c         |   2 +-
 arch/x86/include/asm/pgtable.h    |   8 +-
 include/linux/pgtable.h           |  29 ++-
 mm/huge_memory.c                  |  58 +++--
 mm/memory.c                       |  20 +-
 21 files changed, 906 insertions(+), 137 deletions(-)
 create mode 100644 arch/arm64/mm/contpte.c

--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* [PATCH v5 01/25] mm: Clarify the spec for set_ptes()
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

set_ptes() spec implies that it can only be used to set a present pte
because it interprets the PFN field to increment it. However,
set_pte_at() has been implemented on top of set_ptes() since set_ptes()
was introduced, and set_pte_at() allows setting a pte to a not-present
state. So clarify the spec to state that when nr==1, new state of pte
may be present or not present. When nr>1, new state of all ptes must be
present.

While we are at it, tighten the spec to set requirements around the
initial state of ptes; when nr==1 it may be either present or
not-present. But when nr>1 all ptes must initially be not-present. All
set_ptes() callsites already conform to this requirement. Stating it
explicitly is useful because it allows for a simplification to the
upcoming arm64 contpte implementation.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f0feae7f89fb..5e7eaf8f2b97 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -229,6 +229,10 @@ static inline pte_t pte_next_pfn(pte_t pte)
  * @pte: Page table entry for the first page.
  * @nr: Number of pages to map.
  *
+ * When nr==1, initial state of pte may be present or not present, and new state
+ * may be present or not present. When nr>1, initial state of all ptes must be
+ * not present, and new state must be present.
+ *
  * May be overridden by the architecture, or the architecture can define
  * set_pte() and PFN_PTE_SHIFT.
  *
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 01/25] mm: Clarify the spec for set_ptes()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

set_ptes() spec implies that it can only be used to set a present pte
because it interprets the PFN field to increment it. However,
set_pte_at() has been implemented on top of set_ptes() since set_ptes()
was introduced, and set_pte_at() allows setting a pte to a not-present
state. So clarify the spec to state that when nr==1, new state of pte
may be present or not present. When nr>1, new state of all ptes must be
present.

While we are at it, tighten the spec to set requirements around the
initial state of ptes; when nr==1 it may be either present or
not-present. But when nr>1 all ptes must initially be not-present. All
set_ptes() callsites already conform to this requirement. Stating it
explicitly is useful because it allows for a simplification to the
upcoming arm64 contpte implementation.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f0feae7f89fb..5e7eaf8f2b97 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -229,6 +229,10 @@ static inline pte_t pte_next_pfn(pte_t pte)
  * @pte: Page table entry for the first page.
  * @nr: Number of pages to map.
  *
+ * When nr==1, initial state of pte may be present or not present, and new state
+ * may be present or not present. When nr>1, initial state of all ptes must be
+ * not present, and new state must be present.
+ *
  * May be overridden by the architecture, or the architecture can define
  * set_pte() and PFN_PTE_SHIFT.
  *
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 01/25] mm: Clarify the spec for set_ptes()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

set_ptes() spec implies that it can only be used to set a present pte
because it interprets the PFN field to increment it. However,
set_pte_at() has been implemented on top of set_ptes() since set_ptes()
was introduced, and set_pte_at() allows setting a pte to a not-present
state. So clarify the spec to state that when nr==1, new state of pte
may be present or not present. When nr>1, new state of all ptes must be
present.

While we are at it, tighten the spec to set requirements around the
initial state of ptes; when nr==1 it may be either present or
not-present. But when nr>1 all ptes must initially be not-present. All
set_ptes() callsites already conform to this requirement. Stating it
explicitly is useful because it allows for a simplification to the
upcoming arm64 contpte implementation.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f0feae7f89fb..5e7eaf8f2b97 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -229,6 +229,10 @@ static inline pte_t pte_next_pfn(pte_t pte)
  * @pte: Page table entry for the first page.
  * @nr: Number of pages to map.
  *
+ * When nr==1, initial state of pte may be present or not present, and new state
+ * may be present or not present. When nr>1, initial state of all ptes must be
+ * not present, and new state must be present.
+ *
  * May be overridden by the architecture, or the architecture can define
  * set_pte() and PFN_PTE_SHIFT.
  *
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 02/25] mm: thp: Batch-collapse PMD with set_ptes()
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Refactor __split_huge_pmd_locked() so that a present PMD can be
collapsed to PTEs in a single batch using set_ptes().

This should improve performance a little bit, but the real motivation is
to remove the need for the arm64 backend to have to fold the contpte
entries. Instead, since the ptes are set as a batch, the contpte blocks
can be initially set up pre-folded (once the arm64 contpte support is
added in the next few patches). This leads to noticeable performance
improvement during split.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/huge_memory.c | 58 +++++++++++++++++++++++++++---------------------
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 016e20bd813e..14888b15121e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2579,15 +2579,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	pte = pte_offset_map(&_pmd, haddr);
 	VM_BUG_ON(!pte);
-	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
-		pte_t entry;
-		/*
-		 * Note that NUMA hinting access restrictions are not
-		 * transferred to avoid any possibility of altering
-		 * permissions across VMAs.
-		 */
-		if (freeze || pmd_migration) {
+
+	/*
+	 * Note that NUMA hinting access restrictions are not transferred to
+	 * avoid any possibility of altering permissions across VMAs.
+	 */
+	if (freeze || pmd_migration) {
+		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+			pte_t entry;
 			swp_entry_t swp_entry;
+
 			if (write)
 				swp_entry = make_writable_migration_entry(
 							page_to_pfn(page + i));
@@ -2606,25 +2607,32 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_swp_mksoft_dirty(entry);
 			if (uffd_wp)
 				entry = pte_swp_mkuffd_wp(entry);
-		} else {
-			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
-			if (write)
-				entry = pte_mkwrite(entry, vma);
-			if (!young)
-				entry = pte_mkold(entry);
-			/* NOTE: this may set soft-dirty too on some archs */
-			if (dirty)
-				entry = pte_mkdirty(entry);
-			if (soft_dirty)
-				entry = pte_mksoft_dirty(entry);
-			if (uffd_wp)
-				entry = pte_mkuffd_wp(entry);
+
+			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+			set_pte_at(mm, addr, pte + i, entry);
 		}
-		VM_BUG_ON(!pte_none(ptep_get(pte)));
-		set_pte_at(mm, addr, pte, entry);
-		pte++;
+	} else {
+		pte_t entry;
+
+		entry = mk_pte(page, READ_ONCE(vma->vm_page_prot));
+		if (write)
+			entry = pte_mkwrite(entry, vma);
+		if (!young)
+			entry = pte_mkold(entry);
+		/* NOTE: this may set soft-dirty too on some archs */
+		if (dirty)
+			entry = pte_mkdirty(entry);
+		if (soft_dirty)
+			entry = pte_mksoft_dirty(entry);
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+
+		for (i = 0; i < HPAGE_PMD_NR; i++)
+			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+
+		set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
 	}
-	pte_unmap(pte - 1);
+	pte_unmap(pte);
 
 	if (!pmd_migration)
 		folio_remove_rmap_pmd(folio, page, vma);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 02/25] mm: thp: Batch-collapse PMD with set_ptes()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Refactor __split_huge_pmd_locked() so that a present PMD can be
collapsed to PTEs in a single batch using set_ptes().

This should improve performance a little bit, but the real motivation is
to remove the need for the arm64 backend to have to fold the contpte
entries. Instead, since the ptes are set as a batch, the contpte blocks
can be initially set up pre-folded (once the arm64 contpte support is
added in the next few patches). This leads to noticeable performance
improvement during split.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/huge_memory.c | 58 +++++++++++++++++++++++++++---------------------
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 016e20bd813e..14888b15121e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2579,15 +2579,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	pte = pte_offset_map(&_pmd, haddr);
 	VM_BUG_ON(!pte);
-	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
-		pte_t entry;
-		/*
-		 * Note that NUMA hinting access restrictions are not
-		 * transferred to avoid any possibility of altering
-		 * permissions across VMAs.
-		 */
-		if (freeze || pmd_migration) {
+
+	/*
+	 * Note that NUMA hinting access restrictions are not transferred to
+	 * avoid any possibility of altering permissions across VMAs.
+	 */
+	if (freeze || pmd_migration) {
+		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+			pte_t entry;
 			swp_entry_t swp_entry;
+
 			if (write)
 				swp_entry = make_writable_migration_entry(
 							page_to_pfn(page + i));
@@ -2606,25 +2607,32 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_swp_mksoft_dirty(entry);
 			if (uffd_wp)
 				entry = pte_swp_mkuffd_wp(entry);
-		} else {
-			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
-			if (write)
-				entry = pte_mkwrite(entry, vma);
-			if (!young)
-				entry = pte_mkold(entry);
-			/* NOTE: this may set soft-dirty too on some archs */
-			if (dirty)
-				entry = pte_mkdirty(entry);
-			if (soft_dirty)
-				entry = pte_mksoft_dirty(entry);
-			if (uffd_wp)
-				entry = pte_mkuffd_wp(entry);
+
+			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+			set_pte_at(mm, addr, pte + i, entry);
 		}
-		VM_BUG_ON(!pte_none(ptep_get(pte)));
-		set_pte_at(mm, addr, pte, entry);
-		pte++;
+	} else {
+		pte_t entry;
+
+		entry = mk_pte(page, READ_ONCE(vma->vm_page_prot));
+		if (write)
+			entry = pte_mkwrite(entry, vma);
+		if (!young)
+			entry = pte_mkold(entry);
+		/* NOTE: this may set soft-dirty too on some archs */
+		if (dirty)
+			entry = pte_mkdirty(entry);
+		if (soft_dirty)
+			entry = pte_mksoft_dirty(entry);
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+
+		for (i = 0; i < HPAGE_PMD_NR; i++)
+			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+
+		set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
 	}
-	pte_unmap(pte - 1);
+	pte_unmap(pte);
 
 	if (!pmd_migration)
 		folio_remove_rmap_pmd(folio, page, vma);
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 02/25] mm: thp: Batch-collapse PMD with set_ptes()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Refactor __split_huge_pmd_locked() so that a present PMD can be
collapsed to PTEs in a single batch using set_ptes().

This should improve performance a little bit, but the real motivation is
to remove the need for the arm64 backend to have to fold the contpte
entries. Instead, since the ptes are set as a batch, the contpte blocks
can be initially set up pre-folded (once the arm64 contpte support is
added in the next few patches). This leads to noticeable performance
improvement during split.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/huge_memory.c | 58 +++++++++++++++++++++++++++---------------------
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 016e20bd813e..14888b15121e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2579,15 +2579,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	pte = pte_offset_map(&_pmd, haddr);
 	VM_BUG_ON(!pte);
-	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
-		pte_t entry;
-		/*
-		 * Note that NUMA hinting access restrictions are not
-		 * transferred to avoid any possibility of altering
-		 * permissions across VMAs.
-		 */
-		if (freeze || pmd_migration) {
+
+	/*
+	 * Note that NUMA hinting access restrictions are not transferred to
+	 * avoid any possibility of altering permissions across VMAs.
+	 */
+	if (freeze || pmd_migration) {
+		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+			pte_t entry;
 			swp_entry_t swp_entry;
+
 			if (write)
 				swp_entry = make_writable_migration_entry(
 							page_to_pfn(page + i));
@@ -2606,25 +2607,32 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_swp_mksoft_dirty(entry);
 			if (uffd_wp)
 				entry = pte_swp_mkuffd_wp(entry);
-		} else {
-			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
-			if (write)
-				entry = pte_mkwrite(entry, vma);
-			if (!young)
-				entry = pte_mkold(entry);
-			/* NOTE: this may set soft-dirty too on some archs */
-			if (dirty)
-				entry = pte_mkdirty(entry);
-			if (soft_dirty)
-				entry = pte_mksoft_dirty(entry);
-			if (uffd_wp)
-				entry = pte_mkuffd_wp(entry);
+
+			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+			set_pte_at(mm, addr, pte + i, entry);
 		}
-		VM_BUG_ON(!pte_none(ptep_get(pte)));
-		set_pte_at(mm, addr, pte, entry);
-		pte++;
+	} else {
+		pte_t entry;
+
+		entry = mk_pte(page, READ_ONCE(vma->vm_page_prot));
+		if (write)
+			entry = pte_mkwrite(entry, vma);
+		if (!young)
+			entry = pte_mkold(entry);
+		/* NOTE: this may set soft-dirty too on some archs */
+		if (dirty)
+			entry = pte_mkdirty(entry);
+		if (soft_dirty)
+			entry = pte_mksoft_dirty(entry);
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+
+		for (i = 0; i < HPAGE_PMD_NR; i++)
+			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+
+		set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
 	}
-	pte_unmap(pte - 1);
+	pte_unmap(pte);
 
 	if (!pmd_migration)
 		folio_remove_rmap_pmd(folio, page, vma);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

The goal is to be able to advance a PTE by an arbitrary number of PFNs.
So introduce a new API that takes a nr param.

We are going to remove pte_next_pfn() and replace it with
pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
wrapper around pte_advance_pfn() so that we can incrementally switch the
architectures over. Once all arches are moved over, we will change all
the core-mm callers to call pte_advance_pfn() directly and remove the
wrapper.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e7eaf8f2b97..815d92dcb96b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
 
 
 #ifndef pte_next_pfn
+#ifndef pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
+{
+	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
+}
+#endif
 static inline pte_t pte_next_pfn(pte_t pte)
 {
-	return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+	return pte_advance_pfn(pte, 1);
 }
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

The goal is to be able to advance a PTE by an arbitrary number of PFNs.
So introduce a new API that takes a nr param.

We are going to remove pte_next_pfn() and replace it with
pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
wrapper around pte_advance_pfn() so that we can incrementally switch the
architectures over. Once all arches are moved over, we will change all
the core-mm callers to call pte_advance_pfn() directly and remove the
wrapper.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e7eaf8f2b97..815d92dcb96b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
 
 
 #ifndef pte_next_pfn
+#ifndef pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
+{
+	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
+}
+#endif
 static inline pte_t pte_next_pfn(pte_t pte)
 {
-	return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+	return pte_advance_pfn(pte, 1);
 }
 #endif
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

The goal is to be able to advance a PTE by an arbitrary number of PFNs.
So introduce a new API that takes a nr param.

We are going to remove pte_next_pfn() and replace it with
pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
wrapper around pte_advance_pfn() so that we can incrementally switch the
architectures over. Once all arches are moved over, we will change all
the core-mm callers to call pte_advance_pfn() directly and remove the
wrapper.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e7eaf8f2b97..815d92dcb96b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
 
 
 #ifndef pte_next_pfn
+#ifndef pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
+{
+	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
+}
+#endif
 static inline pte_t pte_next_pfn(pte_t pte)
 {
-	return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+	return pte_advance_pfn(pte, 1);
 }
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 04/25] arm/mm: Convert pte_next_pfn() to pte_advance_pfn()
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm/mm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index c24e29c0b9a4..137711c68f2f 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -1814,6 +1814,6 @@ void set_ptes(struct mm_struct *mm, unsigned long addr,
 		if (--nr == 0)
 			break;
 		ptep++;
-		pteval = pte_next_pfn(pteval);
+		pteval = pte_advance_pfn(pteval, 1);
 	}
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 04/25] arm/mm: Convert pte_next_pfn() to pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm/mm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index c24e29c0b9a4..137711c68f2f 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -1814,6 +1814,6 @@ void set_ptes(struct mm_struct *mm, unsigned long addr,
 		if (--nr == 0)
 			break;
 		ptep++;
-		pteval = pte_next_pfn(pteval);
+		pteval = pte_advance_pfn(pteval, 1);
 	}
 }
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 04/25] arm/mm: Convert pte_next_pfn() to pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm/mm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index c24e29c0b9a4..137711c68f2f 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -1814,6 +1814,6 @@ void set_ptes(struct mm_struct *mm, unsigned long addr,
 		if (--nr == 0)
 			break;
 		ptep++;
-		pteval = pte_next_pfn(pteval);
+		pteval = pte_advance_pfn(pteval, 1);
 	}
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 05/25] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9428801c1040..6a6cc78cf879 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -351,10 +351,10 @@ static inline pgprot_t pte_pgprot(pte_t pte)
 	return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
 }
 
-#define pte_next_pfn pte_next_pfn
-static inline pte_t pte_next_pfn(pte_t pte)
+#define pte_advance_pfn pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
-	return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
+	return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
 static inline void set_ptes(struct mm_struct *mm,
@@ -370,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm,
 		if (--nr == 0)
 			break;
 		ptep++;
-		pte = pte_next_pfn(pte);
+		pte = pte_advance_pfn(pte, 1);
 	}
 }
 #define set_ptes set_ptes
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 05/25] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9428801c1040..6a6cc78cf879 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -351,10 +351,10 @@ static inline pgprot_t pte_pgprot(pte_t pte)
 	return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
 }
 
-#define pte_next_pfn pte_next_pfn
-static inline pte_t pte_next_pfn(pte_t pte)
+#define pte_advance_pfn pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
-	return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
+	return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
 static inline void set_ptes(struct mm_struct *mm,
@@ -370,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm,
 		if (--nr == 0)
 			break;
 		ptep++;
-		pte = pte_next_pfn(pte);
+		pte = pte_advance_pfn(pte, 1);
 	}
 }
 #define set_ptes set_ptes
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 05/25] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9428801c1040..6a6cc78cf879 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -351,10 +351,10 @@ static inline pgprot_t pte_pgprot(pte_t pte)
 	return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
 }
 
-#define pte_next_pfn pte_next_pfn
-static inline pte_t pte_next_pfn(pte_t pte)
+#define pte_advance_pfn pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
-	return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
+	return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
 static inline void set_ptes(struct mm_struct *mm,
@@ -370,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm,
 		if (--nr == 0)
 			break;
 		ptep++;
-		pte = pte_next_pfn(pte);
+		pte = pte_advance_pfn(pte, 1);
 	}
 }
 #define set_ptes set_ptes
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 06/25] powerpc/mm: Convert pte_next_pfn() to pte_advance_pfn()
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/powerpc/mm/pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 549a440ed7f6..6853cdb1290d 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -220,7 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 			break;
 		ptep++;
 		addr += PAGE_SIZE;
-		pte = pte_next_pfn(pte);
+		pte = pte_advance_pfn(pte, 1);
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 06/25] powerpc/mm: Convert pte_next_pfn() to pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/powerpc/mm/pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 549a440ed7f6..6853cdb1290d 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -220,7 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 			break;
 		ptep++;
 		addr += PAGE_SIZE;
-		pte = pte_next_pfn(pte);
+		pte = pte_advance_pfn(pte, 1);
 	}
 }
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 06/25] powerpc/mm: Convert pte_next_pfn() to pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/powerpc/mm/pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 549a440ed7f6..6853cdb1290d 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -220,7 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 			break;
 		ptep++;
 		addr += PAGE_SIZE;
-		pte = pte_next_pfn(pte);
+		pte = pte_advance_pfn(pte, 1);
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 07/25] x86/mm: Convert pte_next_pfn() to pte_advance_pfn()
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/x86/include/asm/pgtable.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 9d077bca6a10..b60b0c897b4c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -956,13 +956,13 @@ static inline int pte_same(pte_t a, pte_t b)
 	return a.pte == b.pte;
 }
 
-static inline pte_t pte_next_pfn(pte_t pte)
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
 	if (__pte_needs_invert(pte_val(pte)))
-		return __pte(pte_val(pte) - (1UL << PFN_PTE_SHIFT));
-	return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+		return __pte(pte_val(pte) - (nr << PFN_PTE_SHIFT));
+	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
 }
-#define pte_next_pfn	pte_next_pfn
+#define pte_advance_pfn	pte_advance_pfn
 
 static inline int pte_present(pte_t a)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 07/25] x86/mm: Convert pte_next_pfn() to pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/x86/include/asm/pgtable.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 9d077bca6a10..b60b0c897b4c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -956,13 +956,13 @@ static inline int pte_same(pte_t a, pte_t b)
 	return a.pte == b.pte;
 }
 
-static inline pte_t pte_next_pfn(pte_t pte)
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
 	if (__pte_needs_invert(pte_val(pte)))
-		return __pte(pte_val(pte) - (1UL << PFN_PTE_SHIFT));
-	return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+		return __pte(pte_val(pte) - (nr << PFN_PTE_SHIFT));
+	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
 }
-#define pte_next_pfn	pte_next_pfn
+#define pte_advance_pfn	pte_advance_pfn
 
 static inline int pte_present(pte_t a)
 {
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 07/25] x86/mm: Convert pte_next_pfn() to pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/x86/include/asm/pgtable.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 9d077bca6a10..b60b0c897b4c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -956,13 +956,13 @@ static inline int pte_same(pte_t a, pte_t b)
 	return a.pte == b.pte;
 }
 
-static inline pte_t pte_next_pfn(pte_t pte)
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
 	if (__pte_needs_invert(pte_val(pte)))
-		return __pte(pte_val(pte) - (1UL << PFN_PTE_SHIFT));
-	return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+		return __pte(pte_val(pte) - (nr << PFN_PTE_SHIFT));
+	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
 }
-#define pte_next_pfn	pte_next_pfn
+#define pte_advance_pfn	pte_advance_pfn
 
 static inline int pte_present(pte_t a)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 08/25] mm: Remove pte_next_pfn() and replace with pte_advance_pfn()
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Now that the architectures are converted over to pte_advance_pfn(), we
can remove the pte_next_pfn() wrapper and convert the callers to call
pte_advance_pfn().

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 9 +--------
 mm/memory.c             | 4 ++--
 2 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 815d92dcb96b..50f32cccbd92 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,19 +212,12 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode()	do {} while (0)
 #endif
 
-
-#ifndef pte_next_pfn
 #ifndef pte_advance_pfn
 static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
 	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
 }
 #endif
-static inline pte_t pte_next_pfn(pte_t pte)
-{
-	return pte_advance_pfn(pte, 1);
-}
-#endif
 
 #ifndef set_ptes
 /**
@@ -256,7 +249,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 		if (--nr == 0)
 			break;
 		ptep++;
-		pte = pte_next_pfn(pte);
+		pte = pte_advance_pfn(pte, 1);
 	}
 	arch_leave_lazy_mmu_mode();
 }
diff --git a/mm/memory.c b/mm/memory.c
index 38a010c4d04d..65fbe4f886c1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -988,7 +988,7 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 {
 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
 	const pte_t *end_ptep = start_ptep + max_nr;
-	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), flags);
+	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
 	pte_t *ptep = start_ptep + 1;
 	bool writable;
 
@@ -1017,7 +1017,7 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		if (any_writable)
 			*any_writable |= writable;
 
-		expected_pte = pte_next_pfn(expected_pte);
+		expected_pte = pte_advance_pfn(expected_pte, 1);
 		ptep++;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 08/25] mm: Remove pte_next_pfn() and replace with pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Now that the architectures are converted over to pte_advance_pfn(), we
can remove the pte_next_pfn() wrapper and convert the callers to call
pte_advance_pfn().

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 9 +--------
 mm/memory.c             | 4 ++--
 2 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 815d92dcb96b..50f32cccbd92 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,19 +212,12 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode()	do {} while (0)
 #endif
 
-
-#ifndef pte_next_pfn
 #ifndef pte_advance_pfn
 static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
 	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
 }
 #endif
-static inline pte_t pte_next_pfn(pte_t pte)
-{
-	return pte_advance_pfn(pte, 1);
-}
-#endif
 
 #ifndef set_ptes
 /**
@@ -256,7 +249,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 		if (--nr == 0)
 			break;
 		ptep++;
-		pte = pte_next_pfn(pte);
+		pte = pte_advance_pfn(pte, 1);
 	}
 	arch_leave_lazy_mmu_mode();
 }
diff --git a/mm/memory.c b/mm/memory.c
index 38a010c4d04d..65fbe4f886c1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -988,7 +988,7 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 {
 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
 	const pte_t *end_ptep = start_ptep + max_nr;
-	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), flags);
+	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
 	pte_t *ptep = start_ptep + 1;
 	bool writable;
 
@@ -1017,7 +1017,7 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		if (any_writable)
 			*any_writable |= writable;
 
-		expected_pte = pte_next_pfn(expected_pte);
+		expected_pte = pte_advance_pfn(expected_pte, 1);
 		ptep++;
 	}
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 08/25] mm: Remove pte_next_pfn() and replace with pte_advance_pfn()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Now that the architectures are converted over to pte_advance_pfn(), we
can remove the pte_next_pfn() wrapper and convert the callers to call
pte_advance_pfn().

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 9 +--------
 mm/memory.c             | 4 ++--
 2 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 815d92dcb96b..50f32cccbd92 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,19 +212,12 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode()	do {} while (0)
 #endif
 
-
-#ifndef pte_next_pfn
 #ifndef pte_advance_pfn
 static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
 	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
 }
 #endif
-static inline pte_t pte_next_pfn(pte_t pte)
-{
-	return pte_advance_pfn(pte, 1);
-}
-#endif
 
 #ifndef set_ptes
 /**
@@ -256,7 +249,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 		if (--nr == 0)
 			break;
 		ptep++;
-		pte = pte_next_pfn(pte);
+		pte = pte_advance_pfn(pte, 1);
 	}
 	arch_leave_lazy_mmu_mode();
 }
diff --git a/mm/memory.c b/mm/memory.c
index 38a010c4d04d..65fbe4f886c1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -988,7 +988,7 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 {
 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
 	const pte_t *end_ptep = start_ptep + max_nr;
-	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), flags);
+	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
 	pte_t *ptep = start_ptep + 1;
 	bool writable;
 
@@ -1017,7 +1017,7 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		if (any_writable)
 			*any_writable |= writable;
 
-		expected_pte = pte_next_pfn(expected_pte);
+		expected_pte = pte_advance_pfn(expected_pte, 1);
 		ptep++;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 09/25] arm64/mm: set_pte(): New layer to manage contig bit
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 11 +++++++----
 arch/arm64/kernel/efi.c          |  2 +-
 arch/arm64/mm/fixmap.c           |  2 +-
 arch/arm64/mm/kasan_init.c       |  4 ++--
 arch/arm64/mm/mmu.c              |  2 +-
 arch/arm64/mm/pageattr.c         |  2 +-
 arch/arm64/mm/trans_pgd.c        |  4 ++--
 7 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6a6cc78cf879..3cb45e8dbb52 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 	__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
 #define pte_none(pte)		(!pte_val(pte))
-#define pte_clear(mm,addr,ptep)	set_pte(ptep, __pte(0))
+#define pte_clear(mm, addr, ptep) \
+				__set_pte(ptep, __pte(0))
 #define pte_page(pte)		(pfn_to_page(pte_pfn(pte)))
 
 /*
@@ -261,7 +262,7 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
 }
 
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
 
@@ -366,7 +367,7 @@ static inline void set_ptes(struct mm_struct *mm,
 
 	for (;;) {
 		__check_safe_pte_update(mm, ptep, pte);
-		set_pte(ptep, pte);
+		__set_pte(ptep, pte);
 		if (--nr == 0)
 			break;
 		ptep++;
@@ -540,7 +541,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
 {
 	__sync_cache_and_tags(pte, nr);
 	__check_safe_pte_update(mm, ptep, pte);
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 }
 
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
@@ -1138,6 +1139,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#define set_pte					__set_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 0228001347be..44288a12fc6c 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -111,7 +111,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
 		pte = set_pte_bit(pte, __pgprot(PTE_PXN));
 	else if (system_supports_bti_kernel() && spd->has_bti)
 		pte = set_pte_bit(pte, __pgprot(PTE_GP));
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 	return 0;
 }
 
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index c0a3301203bd..51cd4501816d 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -121,7 +121,7 @@ void __set_fixmap(enum fixed_addresses idx,
 	ptep = fixmap_pte(addr);
 
 	if (pgprot_val(flags)) {
-		set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
+		__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
 	} else {
 		pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 4c7ad574b946..f659bd98c63f 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -112,7 +112,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
 		if (!early)
 			memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
 		next = addr + PAGE_SIZE;
-		set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
+		__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
 	} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
 }
 
@@ -271,7 +271,7 @@ static void __init kasan_init_shadow(void)
 	 * so we should make sure that it maps the zero page read-only.
 	 */
 	for (i = 0; i < PTRS_PER_PTE; i++)
-		set_pte(&kasan_early_shadow_pte[i],
+		__set_pte(&kasan_early_shadow_pte[i],
 			pfn_pte(sym_to_pfn(kasan_early_shadow_page),
 				PAGE_KERNEL_RO));
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d794b2f4b5a3..7cc1930f0e10 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -175,7 +175,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	do {
 		pte_t old_pte = READ_ONCE(*ptep);
 
-		set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
 		/*
 		 * After the PTE entry has been populated once, we
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 924843f1f661..a7996d8edf0a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -41,7 +41,7 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 	pte = clear_pte_bit(pte, cdata->clear_mask);
 	pte = set_pte_bit(pte, cdata->set_mask);
 
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 	return 0;
 }
 
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 7b14df3c6477..230b607cf881 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -41,7 +41,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 * read only (code, rodata). Clear the RDONLY bit from
 		 * the temporary mappings we use during restore.
 		 */
-		set_pte(dst_ptep, pte_mkwrite_novma(pte));
+		__set_pte(dst_ptep, pte_mkwrite_novma(pte));
 	} else if ((debug_pagealloc_enabled() ||
 		   is_kfence_address((void *)addr)) && !pte_none(pte)) {
 		/*
@@ -55,7 +55,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 */
 		BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte)));
+		__set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte)));
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 09/25] arm64/mm: set_pte(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 11 +++++++----
 arch/arm64/kernel/efi.c          |  2 +-
 arch/arm64/mm/fixmap.c           |  2 +-
 arch/arm64/mm/kasan_init.c       |  4 ++--
 arch/arm64/mm/mmu.c              |  2 +-
 arch/arm64/mm/pageattr.c         |  2 +-
 arch/arm64/mm/trans_pgd.c        |  4 ++--
 7 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6a6cc78cf879..3cb45e8dbb52 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 	__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
 #define pte_none(pte)		(!pte_val(pte))
-#define pte_clear(mm,addr,ptep)	set_pte(ptep, __pte(0))
+#define pte_clear(mm, addr, ptep) \
+				__set_pte(ptep, __pte(0))
 #define pte_page(pte)		(pfn_to_page(pte_pfn(pte)))
 
 /*
@@ -261,7 +262,7 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
 }
 
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
 
@@ -366,7 +367,7 @@ static inline void set_ptes(struct mm_struct *mm,
 
 	for (;;) {
 		__check_safe_pte_update(mm, ptep, pte);
-		set_pte(ptep, pte);
+		__set_pte(ptep, pte);
 		if (--nr == 0)
 			break;
 		ptep++;
@@ -540,7 +541,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
 {
 	__sync_cache_and_tags(pte, nr);
 	__check_safe_pte_update(mm, ptep, pte);
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 }
 
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
@@ -1138,6 +1139,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#define set_pte					__set_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 0228001347be..44288a12fc6c 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -111,7 +111,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
 		pte = set_pte_bit(pte, __pgprot(PTE_PXN));
 	else if (system_supports_bti_kernel() && spd->has_bti)
 		pte = set_pte_bit(pte, __pgprot(PTE_GP));
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 	return 0;
 }
 
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index c0a3301203bd..51cd4501816d 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -121,7 +121,7 @@ void __set_fixmap(enum fixed_addresses idx,
 	ptep = fixmap_pte(addr);
 
 	if (pgprot_val(flags)) {
-		set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
+		__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
 	} else {
 		pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 4c7ad574b946..f659bd98c63f 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -112,7 +112,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
 		if (!early)
 			memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
 		next = addr + PAGE_SIZE;
-		set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
+		__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
 	} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
 }
 
@@ -271,7 +271,7 @@ static void __init kasan_init_shadow(void)
 	 * so we should make sure that it maps the zero page read-only.
 	 */
 	for (i = 0; i < PTRS_PER_PTE; i++)
-		set_pte(&kasan_early_shadow_pte[i],
+		__set_pte(&kasan_early_shadow_pte[i],
 			pfn_pte(sym_to_pfn(kasan_early_shadow_page),
 				PAGE_KERNEL_RO));
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d794b2f4b5a3..7cc1930f0e10 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -175,7 +175,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	do {
 		pte_t old_pte = READ_ONCE(*ptep);
 
-		set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
 		/*
 		 * After the PTE entry has been populated once, we
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 924843f1f661..a7996d8edf0a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -41,7 +41,7 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 	pte = clear_pte_bit(pte, cdata->clear_mask);
 	pte = set_pte_bit(pte, cdata->set_mask);
 
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 	return 0;
 }
 
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 7b14df3c6477..230b607cf881 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -41,7 +41,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 * read only (code, rodata). Clear the RDONLY bit from
 		 * the temporary mappings we use during restore.
 		 */
-		set_pte(dst_ptep, pte_mkwrite_novma(pte));
+		__set_pte(dst_ptep, pte_mkwrite_novma(pte));
 	} else if ((debug_pagealloc_enabled() ||
 		   is_kfence_address((void *)addr)) && !pte_none(pte)) {
 		/*
@@ -55,7 +55,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 */
 		BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte)));
+		__set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte)));
 	}
 }
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 09/25] arm64/mm: set_pte(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 11 +++++++----
 arch/arm64/kernel/efi.c          |  2 +-
 arch/arm64/mm/fixmap.c           |  2 +-
 arch/arm64/mm/kasan_init.c       |  4 ++--
 arch/arm64/mm/mmu.c              |  2 +-
 arch/arm64/mm/pageattr.c         |  2 +-
 arch/arm64/mm/trans_pgd.c        |  4 ++--
 7 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6a6cc78cf879..3cb45e8dbb52 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 	__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
 #define pte_none(pte)		(!pte_val(pte))
-#define pte_clear(mm,addr,ptep)	set_pte(ptep, __pte(0))
+#define pte_clear(mm, addr, ptep) \
+				__set_pte(ptep, __pte(0))
 #define pte_page(pte)		(pfn_to_page(pte_pfn(pte)))
 
 /*
@@ -261,7 +262,7 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
 }
 
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
 
@@ -366,7 +367,7 @@ static inline void set_ptes(struct mm_struct *mm,
 
 	for (;;) {
 		__check_safe_pte_update(mm, ptep, pte);
-		set_pte(ptep, pte);
+		__set_pte(ptep, pte);
 		if (--nr == 0)
 			break;
 		ptep++;
@@ -540,7 +541,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
 {
 	__sync_cache_and_tags(pte, nr);
 	__check_safe_pte_update(mm, ptep, pte);
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 }
 
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
@@ -1138,6 +1139,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#define set_pte					__set_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 0228001347be..44288a12fc6c 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -111,7 +111,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
 		pte = set_pte_bit(pte, __pgprot(PTE_PXN));
 	else if (system_supports_bti_kernel() && spd->has_bti)
 		pte = set_pte_bit(pte, __pgprot(PTE_GP));
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 	return 0;
 }
 
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index c0a3301203bd..51cd4501816d 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -121,7 +121,7 @@ void __set_fixmap(enum fixed_addresses idx,
 	ptep = fixmap_pte(addr);
 
 	if (pgprot_val(flags)) {
-		set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
+		__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
 	} else {
 		pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 4c7ad574b946..f659bd98c63f 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -112,7 +112,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
 		if (!early)
 			memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
 		next = addr + PAGE_SIZE;
-		set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
+		__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
 	} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
 }
 
@@ -271,7 +271,7 @@ static void __init kasan_init_shadow(void)
 	 * so we should make sure that it maps the zero page read-only.
 	 */
 	for (i = 0; i < PTRS_PER_PTE; i++)
-		set_pte(&kasan_early_shadow_pte[i],
+		__set_pte(&kasan_early_shadow_pte[i],
 			pfn_pte(sym_to_pfn(kasan_early_shadow_page),
 				PAGE_KERNEL_RO));
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d794b2f4b5a3..7cc1930f0e10 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -175,7 +175,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	do {
 		pte_t old_pte = READ_ONCE(*ptep);
 
-		set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
 		/*
 		 * After the PTE entry has been populated once, we
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 924843f1f661..a7996d8edf0a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -41,7 +41,7 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 	pte = clear_pte_bit(pte, cdata->clear_mask);
 	pte = set_pte_bit(pte, cdata->set_mask);
 
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 	return 0;
 }
 
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 7b14df3c6477..230b607cf881 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -41,7 +41,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 * read only (code, rodata). Clear the RDONLY bit from
 		 * the temporary mappings we use during restore.
 		 */
-		set_pte(dst_ptep, pte_mkwrite_novma(pte));
+		__set_pte(dst_ptep, pte_mkwrite_novma(pte));
 	} else if ((debug_pagealloc_enabled() ||
 		   is_kfence_address((void *)addr)) && !pte_none(pte)) {
 		/*
@@ -55,7 +55,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 */
 		BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte)));
+		__set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte)));
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 10/25] arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

set_pte_at() is a core macro that forwards to set_ptes() (with nr=1).
Instead of creating a __set_pte_at() internal macro, convert all arch
users to use set_ptes()/__set_ptes() directly, as appropriate. Callers
in hugetlb may benefit from calling __set_ptes() once for their whole
range rather than managing their own loop. This is left for future
improvement.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 +++++-----
 arch/arm64/kernel/mte.c          |  2 +-
 arch/arm64/kvm/guest.c           |  2 +-
 arch/arm64/mm/fault.c            |  2 +-
 arch/arm64/mm/hugetlbpage.c      | 10 +++++-----
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3cb45e8dbb52..f1fd6c5e3eca 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -358,9 +358,9 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 	return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
-static inline void set_ptes(struct mm_struct *mm,
-			    unsigned long __always_unused addr,
-			    pte_t *ptep, pte_t pte, unsigned int nr)
+static inline void __set_ptes(struct mm_struct *mm,
+			      unsigned long __always_unused addr,
+			      pte_t *ptep, pte_t pte, unsigned int nr)
 {
 	page_table_check_ptes_set(mm, ptep, pte, nr);
 	__sync_cache_and_tags(pte, nr);
@@ -374,7 +374,6 @@ static inline void set_ptes(struct mm_struct *mm,
 		pte = pte_advance_pfn(pte, 1);
 	}
 }
-#define set_ptes set_ptes
 
 /*
  * Huge pte definitions.
@@ -1079,7 +1078,7 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 #endif /* CONFIG_ARM64_MTE */
 
 /*
- * On AArch64, the cache coherency is handled via the set_pte_at() function.
+ * On AArch64, the cache coherency is handled via the __set_ptes() function.
  */
 static inline void update_mmu_cache_range(struct vm_fault *vmf,
 		struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
@@ -1140,6 +1139,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #endif
 
 #define set_pte					__set_pte
+#define set_ptes				__set_ptes
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a41ef3213e1e..dcdcccd40891 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -67,7 +67,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
 	/*
 	 * If the page content is identical but at least one of the pages is
 	 * tagged, return non-zero to avoid KSM merging. If only one of the
-	 * pages is tagged, set_pte_at() may zero or change the tags of the
+	 * pages is tagged, __set_ptes() may zero or change the tags of the
 	 * other page via mte_sync_tags().
 	 */
 	if (page_mte_tagged(page1) || page_mte_tagged(page2))
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index aaf1d4939739..629145fd3161 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1072,7 +1072,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
 		} else {
 			/*
 			 * Only locking to serialise with a concurrent
-			 * set_pte_at() in the VMM but still overriding the
+			 * __set_ptes() in the VMM but still overriding the
 			 * tags, hence ignoring the return value.
 			 */
 			try_page_mte_tagging(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 13189322a38f..23d0dfc16686 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -205,7 +205,7 @@ static void show_pte(unsigned long addr)
  *
  * It needs to cope with hardware update of the accessed/dirty state by other
  * agents in the system and can safely skip the __sync_icache_dcache() call as,
- * like set_pte_at(), the PTE is never changed from no-exec to exec here.
+ * like __set_ptes(), the PTE is never changed from no-exec to exec here.
  *
  * Returns whether or not the PTE actually changed.
  */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 8116ac599f80..9d7e7315eaa3 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -254,12 +254,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 
 	if (!pte_present(pte)) {
 		for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
-			set_pte_at(mm, addr, ptep, pte);
+			__set_ptes(mm, addr, ptep, pte, 1);
 		return;
 	}
 
 	if (!pte_cont(pte)) {
-		set_pte_at(mm, addr, ptep, pte);
+		__set_ptes(mm, addr, ptep, pte, 1);
 		return;
 	}
 
@@ -270,7 +270,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	clear_flush(mm, addr, ptep, pgsize, ncontig);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -478,7 +478,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 
 	hugeprot = pte_pgprot(pte);
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 
 	return 1;
 }
@@ -507,7 +507,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pfn = pte_pfn(pte);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 }
 
 pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 10/25] arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

set_pte_at() is a core macro that forwards to set_ptes() (with nr=1).
Instead of creating a __set_pte_at() internal macro, convert all arch
users to use set_ptes()/__set_ptes() directly, as appropriate. Callers
in hugetlb may benefit from calling __set_ptes() once for their whole
range rather than managing their own loop. This is left for future
improvement.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 +++++-----
 arch/arm64/kernel/mte.c          |  2 +-
 arch/arm64/kvm/guest.c           |  2 +-
 arch/arm64/mm/fault.c            |  2 +-
 arch/arm64/mm/hugetlbpage.c      | 10 +++++-----
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3cb45e8dbb52..f1fd6c5e3eca 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -358,9 +358,9 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 	return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
-static inline void set_ptes(struct mm_struct *mm,
-			    unsigned long __always_unused addr,
-			    pte_t *ptep, pte_t pte, unsigned int nr)
+static inline void __set_ptes(struct mm_struct *mm,
+			      unsigned long __always_unused addr,
+			      pte_t *ptep, pte_t pte, unsigned int nr)
 {
 	page_table_check_ptes_set(mm, ptep, pte, nr);
 	__sync_cache_and_tags(pte, nr);
@@ -374,7 +374,6 @@ static inline void set_ptes(struct mm_struct *mm,
 		pte = pte_advance_pfn(pte, 1);
 	}
 }
-#define set_ptes set_ptes
 
 /*
  * Huge pte definitions.
@@ -1079,7 +1078,7 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 #endif /* CONFIG_ARM64_MTE */
 
 /*
- * On AArch64, the cache coherency is handled via the set_pte_at() function.
+ * On AArch64, the cache coherency is handled via the __set_ptes() function.
  */
 static inline void update_mmu_cache_range(struct vm_fault *vmf,
 		struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
@@ -1140,6 +1139,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #endif
 
 #define set_pte					__set_pte
+#define set_ptes				__set_ptes
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a41ef3213e1e..dcdcccd40891 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -67,7 +67,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
 	/*
 	 * If the page content is identical but at least one of the pages is
 	 * tagged, return non-zero to avoid KSM merging. If only one of the
-	 * pages is tagged, set_pte_at() may zero or change the tags of the
+	 * pages is tagged, __set_ptes() may zero or change the tags of the
 	 * other page via mte_sync_tags().
 	 */
 	if (page_mte_tagged(page1) || page_mte_tagged(page2))
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index aaf1d4939739..629145fd3161 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1072,7 +1072,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
 		} else {
 			/*
 			 * Only locking to serialise with a concurrent
-			 * set_pte_at() in the VMM but still overriding the
+			 * __set_ptes() in the VMM but still overriding the
 			 * tags, hence ignoring the return value.
 			 */
 			try_page_mte_tagging(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 13189322a38f..23d0dfc16686 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -205,7 +205,7 @@ static void show_pte(unsigned long addr)
  *
  * It needs to cope with hardware update of the accessed/dirty state by other
  * agents in the system and can safely skip the __sync_icache_dcache() call as,
- * like set_pte_at(), the PTE is never changed from no-exec to exec here.
+ * like __set_ptes(), the PTE is never changed from no-exec to exec here.
  *
  * Returns whether or not the PTE actually changed.
  */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 8116ac599f80..9d7e7315eaa3 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -254,12 +254,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 
 	if (!pte_present(pte)) {
 		for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
-			set_pte_at(mm, addr, ptep, pte);
+			__set_ptes(mm, addr, ptep, pte, 1);
 		return;
 	}
 
 	if (!pte_cont(pte)) {
-		set_pte_at(mm, addr, ptep, pte);
+		__set_ptes(mm, addr, ptep, pte, 1);
 		return;
 	}
 
@@ -270,7 +270,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	clear_flush(mm, addr, ptep, pgsize, ncontig);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -478,7 +478,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 
 	hugeprot = pte_pgprot(pte);
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 
 	return 1;
 }
@@ -507,7 +507,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pfn = pte_pfn(pte);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 }
 
 pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 10/25] arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

set_pte_at() is a core macro that forwards to set_ptes() (with nr=1).
Instead of creating a __set_pte_at() internal macro, convert all arch
users to use set_ptes()/__set_ptes() directly, as appropriate. Callers
in hugetlb may benefit from calling __set_ptes() once for their whole
range rather than managing their own loop. This is left for future
improvement.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 +++++-----
 arch/arm64/kernel/mte.c          |  2 +-
 arch/arm64/kvm/guest.c           |  2 +-
 arch/arm64/mm/fault.c            |  2 +-
 arch/arm64/mm/hugetlbpage.c      | 10 +++++-----
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3cb45e8dbb52..f1fd6c5e3eca 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -358,9 +358,9 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 	return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
-static inline void set_ptes(struct mm_struct *mm,
-			    unsigned long __always_unused addr,
-			    pte_t *ptep, pte_t pte, unsigned int nr)
+static inline void __set_ptes(struct mm_struct *mm,
+			      unsigned long __always_unused addr,
+			      pte_t *ptep, pte_t pte, unsigned int nr)
 {
 	page_table_check_ptes_set(mm, ptep, pte, nr);
 	__sync_cache_and_tags(pte, nr);
@@ -374,7 +374,6 @@ static inline void set_ptes(struct mm_struct *mm,
 		pte = pte_advance_pfn(pte, 1);
 	}
 }
-#define set_ptes set_ptes
 
 /*
  * Huge pte definitions.
@@ -1079,7 +1078,7 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 #endif /* CONFIG_ARM64_MTE */
 
 /*
- * On AArch64, the cache coherency is handled via the set_pte_at() function.
+ * On AArch64, the cache coherency is handled via the __set_ptes() function.
  */
 static inline void update_mmu_cache_range(struct vm_fault *vmf,
 		struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
@@ -1140,6 +1139,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #endif
 
 #define set_pte					__set_pte
+#define set_ptes				__set_ptes
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a41ef3213e1e..dcdcccd40891 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -67,7 +67,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
 	/*
 	 * If the page content is identical but at least one of the pages is
 	 * tagged, return non-zero to avoid KSM merging. If only one of the
-	 * pages is tagged, set_pte_at() may zero or change the tags of the
+	 * pages is tagged, __set_ptes() may zero or change the tags of the
 	 * other page via mte_sync_tags().
 	 */
 	if (page_mte_tagged(page1) || page_mte_tagged(page2))
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index aaf1d4939739..629145fd3161 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1072,7 +1072,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
 		} else {
 			/*
 			 * Only locking to serialise with a concurrent
-			 * set_pte_at() in the VMM but still overriding the
+			 * __set_ptes() in the VMM but still overriding the
 			 * tags, hence ignoring the return value.
 			 */
 			try_page_mte_tagging(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 13189322a38f..23d0dfc16686 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -205,7 +205,7 @@ static void show_pte(unsigned long addr)
  *
  * It needs to cope with hardware update of the accessed/dirty state by other
  * agents in the system and can safely skip the __sync_icache_dcache() call as,
- * like set_pte_at(), the PTE is never changed from no-exec to exec here.
+ * like __set_ptes(), the PTE is never changed from no-exec to exec here.
  *
  * Returns whether or not the PTE actually changed.
  */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 8116ac599f80..9d7e7315eaa3 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -254,12 +254,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 
 	if (!pte_present(pte)) {
 		for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
-			set_pte_at(mm, addr, ptep, pte);
+			__set_ptes(mm, addr, ptep, pte, 1);
 		return;
 	}
 
 	if (!pte_cont(pte)) {
-		set_pte_at(mm, addr, ptep, pte);
+		__set_ptes(mm, addr, ptep, pte, 1);
 		return;
 	}
 
@@ -270,7 +270,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	clear_flush(mm, addr, ptep, pgsize, ncontig);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -478,7 +478,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 
 	hugeprot = pte_pgprot(pte);
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 
 	return 1;
 }
@@ -507,7 +507,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pfn = pte_pfn(pte);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 }
 
 pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 11/25] arm64/mm: pte_clear(): New layer to manage contig bit
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 3 ++-
 arch/arm64/mm/fixmap.c           | 2 +-
 arch/arm64/mm/hugetlbpage.c      | 2 +-
 arch/arm64/mm/mmu.c              | 2 +-
 4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f1fd6c5e3eca..3b0ff58109c5 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 	__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
 #define pte_none(pte)		(!pte_val(pte))
-#define pte_clear(mm, addr, ptep) \
+#define __pte_clear(mm, addr, ptep) \
 				__set_pte(ptep, __pte(0))
 #define pte_page(pte)		(pfn_to_page(pte_pfn(pte)))
 
@@ -1140,6 +1140,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
+#define pte_clear				__pte_clear
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index 51cd4501816d..bfc02568805a 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -123,7 +123,7 @@ void __set_fixmap(enum fixed_addresses idx,
 	if (pgprot_val(flags)) {
 		__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
 	} else {
-		pte_clear(&init_mm, addr, ptep);
+		__pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
 	}
 }
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 9d7e7315eaa3..3d73b83cf97f 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -400,7 +400,7 @@ void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 	ncontig = num_contig_ptes(sz, &pgsize);
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		pte_clear(mm, addr, ptep);
+		__pte_clear(mm, addr, ptep);
 }
 
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 7cc1930f0e10..bcaa5a5d86f8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -859,7 +859,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 			continue;
 
 		WARN_ON(!pte_present(pte));
-		pte_clear(&init_mm, addr, ptep);
+		__pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
 		if (free_mapped)
 			free_hotplug_page_range(pte_page(pte),
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 11/25] arm64/mm: pte_clear(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 3 ++-
 arch/arm64/mm/fixmap.c           | 2 +-
 arch/arm64/mm/hugetlbpage.c      | 2 +-
 arch/arm64/mm/mmu.c              | 2 +-
 4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f1fd6c5e3eca..3b0ff58109c5 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 	__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
 #define pte_none(pte)		(!pte_val(pte))
-#define pte_clear(mm, addr, ptep) \
+#define __pte_clear(mm, addr, ptep) \
 				__set_pte(ptep, __pte(0))
 #define pte_page(pte)		(pfn_to_page(pte_pfn(pte)))
 
@@ -1140,6 +1140,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
+#define pte_clear				__pte_clear
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index 51cd4501816d..bfc02568805a 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -123,7 +123,7 @@ void __set_fixmap(enum fixed_addresses idx,
 	if (pgprot_val(flags)) {
 		__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
 	} else {
-		pte_clear(&init_mm, addr, ptep);
+		__pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
 	}
 }
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 9d7e7315eaa3..3d73b83cf97f 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -400,7 +400,7 @@ void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 	ncontig = num_contig_ptes(sz, &pgsize);
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		pte_clear(mm, addr, ptep);
+		__pte_clear(mm, addr, ptep);
 }
 
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 7cc1930f0e10..bcaa5a5d86f8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -859,7 +859,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 			continue;
 
 		WARN_ON(!pte_present(pte));
-		pte_clear(&init_mm, addr, ptep);
+		__pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
 		if (free_mapped)
 			free_hotplug_page_range(pte_page(pte),
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 11/25] arm64/mm: pte_clear(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 3 ++-
 arch/arm64/mm/fixmap.c           | 2 +-
 arch/arm64/mm/hugetlbpage.c      | 2 +-
 arch/arm64/mm/mmu.c              | 2 +-
 4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f1fd6c5e3eca..3b0ff58109c5 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 	__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
 #define pte_none(pte)		(!pte_val(pte))
-#define pte_clear(mm, addr, ptep) \
+#define __pte_clear(mm, addr, ptep) \
 				__set_pte(ptep, __pte(0))
 #define pte_page(pte)		(pfn_to_page(pte_pfn(pte)))
 
@@ -1140,6 +1140,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
+#define pte_clear				__pte_clear
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index 51cd4501816d..bfc02568805a 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -123,7 +123,7 @@ void __set_fixmap(enum fixed_addresses idx,
 	if (pgprot_val(flags)) {
 		__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
 	} else {
-		pte_clear(&init_mm, addr, ptep);
+		__pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
 	}
 }
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 9d7e7315eaa3..3d73b83cf97f 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -400,7 +400,7 @@ void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 	ncontig = num_contig_ptes(sz, &pgsize);
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		pte_clear(mm, addr, ptep);
+		__pte_clear(mm, addr, ptep);
 }
 
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 7cc1930f0e10..bcaa5a5d86f8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -859,7 +859,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 			continue;
 
 		WARN_ON(!pte_present(pte));
-		pte_clear(&init_mm, addr, ptep);
+		__pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
 		if (free_mapped)
 			free_hotplug_page_range(pte_page(pte),
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 12/25] arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 5 +++--
 arch/arm64/mm/hugetlbpage.c      | 6 +++---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3b0ff58109c5..5f560326116e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -953,8 +953,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
-static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address, pte_t *ptep)
 {
 	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
@@ -1141,6 +1140,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
 #define pte_clear				__pte_clear
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define ptep_get_and_clear			__ptep_get_and_clear
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 3d73b83cf97f..7e74e7b67107 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -188,7 +188,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 	unsigned long i;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
-		pte_t pte = ptep_get_and_clear(mm, addr, ptep);
+		pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
 
 		/*
 		 * If HW_AFDBM is enabled, then the HW could turn on
@@ -236,7 +236,7 @@ static void clear_flush(struct mm_struct *mm,
 	unsigned long i, saddr = addr;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		ptep_clear(mm, addr, ptep);
+		__ptep_get_and_clear(mm, addr, ptep);
 
 	flush_tlb_range(&vma, saddr, addr);
 }
@@ -411,7 +411,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 	pte_t orig_pte = ptep_get(ptep);
 
 	if (!pte_cont(orig_pte))
-		return ptep_get_and_clear(mm, addr, ptep);
+		return __ptep_get_and_clear(mm, addr, ptep);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 12/25] arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 5 +++--
 arch/arm64/mm/hugetlbpage.c      | 6 +++---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3b0ff58109c5..5f560326116e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -953,8 +953,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
-static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address, pte_t *ptep)
 {
 	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
@@ -1141,6 +1140,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
 #define pte_clear				__pte_clear
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define ptep_get_and_clear			__ptep_get_and_clear
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 3d73b83cf97f..7e74e7b67107 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -188,7 +188,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 	unsigned long i;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
-		pte_t pte = ptep_get_and_clear(mm, addr, ptep);
+		pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
 
 		/*
 		 * If HW_AFDBM is enabled, then the HW could turn on
@@ -236,7 +236,7 @@ static void clear_flush(struct mm_struct *mm,
 	unsigned long i, saddr = addr;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		ptep_clear(mm, addr, ptep);
+		__ptep_get_and_clear(mm, addr, ptep);
 
 	flush_tlb_range(&vma, saddr, addr);
 }
@@ -411,7 +411,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 	pte_t orig_pte = ptep_get(ptep);
 
 	if (!pte_cont(orig_pte))
-		return ptep_get_and_clear(mm, addr, ptep);
+		return __ptep_get_and_clear(mm, addr, ptep);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 12/25] arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 5 +++--
 arch/arm64/mm/hugetlbpage.c      | 6 +++---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3b0ff58109c5..5f560326116e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -953,8 +953,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
-static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address, pte_t *ptep)
 {
 	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
@@ -1141,6 +1140,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
 #define pte_clear				__pte_clear
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define ptep_get_and_clear			__ptep_get_and_clear
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 3d73b83cf97f..7e74e7b67107 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -188,7 +188,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 	unsigned long i;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
-		pte_t pte = ptep_get_and_clear(mm, addr, ptep);
+		pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
 
 		/*
 		 * If HW_AFDBM is enabled, then the HW could turn on
@@ -236,7 +236,7 @@ static void clear_flush(struct mm_struct *mm,
 	unsigned long i, saddr = addr;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		ptep_clear(mm, addr, ptep);
+		__ptep_get_and_clear(mm, addr, ptep);
 
 	flush_tlb_range(&vma, saddr, addr);
 }
@@ -411,7 +411,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 	pte_t orig_pte = ptep_get(ptep);
 
 	if (!pte_cont(orig_pte))
-		return ptep_get_and_clear(mm, addr, ptep);
+		return __ptep_get_and_clear(mm, addr, ptep);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 13/25] arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5f560326116e..77a8b100e1cd 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -899,8 +899,9 @@ static inline bool pud_user_accessible_page(pud_t pud)
 /*
  * Atomic pte/pmd modifications.
  */
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-static inline int __ptep_test_and_clear_young(pte_t *ptep)
+static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
+					      unsigned long address,
+					      pte_t *ptep)
 {
 	pte_t old_pte, pte;
 
@@ -915,18 +916,11 @@ static inline int __ptep_test_and_clear_young(pte_t *ptep)
 	return pte_young(pte);
 }
 
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
-					    unsigned long address,
-					    pte_t *ptep)
-{
-	return __ptep_test_and_clear_young(ptep);
-}
-
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 					 unsigned long address, pte_t *ptep)
 {
-	int young = ptep_test_and_clear_young(vma, address, ptep);
+	int young = __ptep_test_and_clear_young(vma, address, ptep);
 
 	if (young) {
 		/*
@@ -949,7 +943,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
 {
-	return ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
+	return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -1142,6 +1136,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define pte_clear				__pte_clear
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear			__ptep_get_and_clear
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+#define ptep_test_and_clear_young		__ptep_test_and_clear_young
 
 #endif /* !__ASSEMBLY__ */
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 13/25] arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5f560326116e..77a8b100e1cd 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -899,8 +899,9 @@ static inline bool pud_user_accessible_page(pud_t pud)
 /*
  * Atomic pte/pmd modifications.
  */
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-static inline int __ptep_test_and_clear_young(pte_t *ptep)
+static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
+					      unsigned long address,
+					      pte_t *ptep)
 {
 	pte_t old_pte, pte;
 
@@ -915,18 +916,11 @@ static inline int __ptep_test_and_clear_young(pte_t *ptep)
 	return pte_young(pte);
 }
 
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
-					    unsigned long address,
-					    pte_t *ptep)
-{
-	return __ptep_test_and_clear_young(ptep);
-}
-
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 					 unsigned long address, pte_t *ptep)
 {
-	int young = ptep_test_and_clear_young(vma, address, ptep);
+	int young = __ptep_test_and_clear_young(vma, address, ptep);
 
 	if (young) {
 		/*
@@ -949,7 +943,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
 {
-	return ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
+	return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -1142,6 +1136,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define pte_clear				__pte_clear
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear			__ptep_get_and_clear
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+#define ptep_test_and_clear_young		__ptep_test_and_clear_young
 
 #endif /* !__ASSEMBLY__ */
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 13/25] arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5f560326116e..77a8b100e1cd 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -899,8 +899,9 @@ static inline bool pud_user_accessible_page(pud_t pud)
 /*
  * Atomic pte/pmd modifications.
  */
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-static inline int __ptep_test_and_clear_young(pte_t *ptep)
+static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
+					      unsigned long address,
+					      pte_t *ptep)
 {
 	pte_t old_pte, pte;
 
@@ -915,18 +916,11 @@ static inline int __ptep_test_and_clear_young(pte_t *ptep)
 	return pte_young(pte);
 }
 
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
-					    unsigned long address,
-					    pte_t *ptep)
-{
-	return __ptep_test_and_clear_young(ptep);
-}
-
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 					 unsigned long address, pte_t *ptep)
 {
-	int young = ptep_test_and_clear_young(vma, address, ptep);
+	int young = __ptep_test_and_clear_young(vma, address, ptep);
 
 	if (young) {
 		/*
@@ -949,7 +943,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
 {
-	return ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
+	return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -1142,6 +1136,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define pte_clear				__pte_clear
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear			__ptep_get_and_clear
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+#define ptep_test_and_clear_young		__ptep_test_and_clear_young
 
 #endif /* !__ASSEMBLY__ */
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 14/25] arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 77a8b100e1cd..2870bc12f288 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -138,7 +138,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  * so that we don't erroneously return false for pages that have been
  * remapped as PROT_NONE but are yet to be flushed from the TLB.
  * Note that we can't make any assumptions based on the state of the access
- * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
+ * flag, since __ptep_clear_flush_young() elides a DSB when invalidating the
  * TLB.
  */
 #define pte_accessible(mm, pte)	\
@@ -916,8 +916,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return pte_young(pte);
 }
 
-#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+static inline int __ptep_clear_flush_young(struct vm_area_struct *vma,
 					 unsigned long address, pte_t *ptep)
 {
 	int young = __ptep_test_and_clear_young(vma, address, ptep);
@@ -1138,6 +1137,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define ptep_get_and_clear			__ptep_get_and_clear
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define ptep_test_and_clear_young		__ptep_test_and_clear_young
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+#define ptep_clear_flush_young			__ptep_clear_flush_young
 
 #endif /* !__ASSEMBLY__ */
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 14/25] arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 77a8b100e1cd..2870bc12f288 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -138,7 +138,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  * so that we don't erroneously return false for pages that have been
  * remapped as PROT_NONE but are yet to be flushed from the TLB.
  * Note that we can't make any assumptions based on the state of the access
- * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
+ * flag, since __ptep_clear_flush_young() elides a DSB when invalidating the
  * TLB.
  */
 #define pte_accessible(mm, pte)	\
@@ -916,8 +916,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return pte_young(pte);
 }
 
-#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+static inline int __ptep_clear_flush_young(struct vm_area_struct *vma,
 					 unsigned long address, pte_t *ptep)
 {
 	int young = __ptep_test_and_clear_young(vma, address, ptep);
@@ -1138,6 +1137,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define ptep_get_and_clear			__ptep_get_and_clear
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define ptep_test_and_clear_young		__ptep_test_and_clear_young
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+#define ptep_clear_flush_young			__ptep_clear_flush_young
 
 #endif /* !__ASSEMBLY__ */
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 14/25] arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 77a8b100e1cd..2870bc12f288 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -138,7 +138,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  * so that we don't erroneously return false for pages that have been
  * remapped as PROT_NONE but are yet to be flushed from the TLB.
  * Note that we can't make any assumptions based on the state of the access
- * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
+ * flag, since __ptep_clear_flush_young() elides a DSB when invalidating the
  * TLB.
  */
 #define pte_accessible(mm, pte)	\
@@ -916,8 +916,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return pte_young(pte);
 }
 
-#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+static inline int __ptep_clear_flush_young(struct vm_area_struct *vma,
 					 unsigned long address, pte_t *ptep)
 {
 	int young = __ptep_test_and_clear_young(vma, address, ptep);
@@ -1138,6 +1137,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define ptep_get_and_clear			__ptep_get_and_clear
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define ptep_test_and_clear_young		__ptep_test_and_clear_young
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+#define ptep_clear_flush_young			__ptep_clear_flush_young
 
 #endif /* !__ASSEMBLY__ */
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 15/25] arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++----
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2870bc12f288..4c2d6c483390 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -970,11 +970,11 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
- * ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
  * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
  */
-#define __HAVE_ARCH_PTEP_SET_WRPROTECT
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep)
 {
 	pte_t old_pte, pte;
 
@@ -992,7 +992,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long address, pmd_t *pmdp)
 {
-	ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
+	__ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
 }
 
 #define pmdp_establish pmdp_establish
@@ -1139,6 +1139,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define ptep_test_and_clear_young		__ptep_test_and_clear_young
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 #define ptep_clear_flush_young			__ptep_clear_flush_young
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+#define ptep_set_wrprotect			__ptep_set_wrprotect
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 7e74e7b67107..f6612f3e1c07 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -493,7 +493,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pte_t pte;
 
 	if (!pte_cont(READ_ONCE(*ptep))) {
-		ptep_set_wrprotect(mm, addr, ptep);
+		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 15/25] arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++----
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2870bc12f288..4c2d6c483390 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -970,11 +970,11 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
- * ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
  * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
  */
-#define __HAVE_ARCH_PTEP_SET_WRPROTECT
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep)
 {
 	pte_t old_pte, pte;
 
@@ -992,7 +992,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long address, pmd_t *pmdp)
 {
-	ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
+	__ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
 }
 
 #define pmdp_establish pmdp_establish
@@ -1139,6 +1139,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define ptep_test_and_clear_young		__ptep_test_and_clear_young
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 #define ptep_clear_flush_young			__ptep_clear_flush_young
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+#define ptep_set_wrprotect			__ptep_set_wrprotect
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 7e74e7b67107..f6612f3e1c07 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -493,7 +493,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pte_t pte;
 
 	if (!pte_cont(READ_ONCE(*ptep))) {
-		ptep_set_wrprotect(mm, addr, ptep);
+		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 15/25] arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++----
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2870bc12f288..4c2d6c483390 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -970,11 +970,11 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
- * ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
  * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
  */
-#define __HAVE_ARCH_PTEP_SET_WRPROTECT
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep)
 {
 	pte_t old_pte, pte;
 
@@ -992,7 +992,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long address, pmd_t *pmdp)
 {
-	ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
+	__ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
 }
 
 #define pmdp_establish pmdp_establish
@@ -1139,6 +1139,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define ptep_test_and_clear_young		__ptep_test_and_clear_young
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 #define ptep_clear_flush_young			__ptep_clear_flush_young
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+#define ptep_set_wrprotect			__ptep_set_wrprotect
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 7e74e7b67107..f6612f3e1c07 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -493,7 +493,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pte_t pte;
 
 	if (!pte_cont(READ_ONCE(*ptep))) {
-		ptep_set_wrprotect(mm, addr, ptep);
+		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 16/25] arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++----
 arch/arm64/mm/fault.c            |  6 +++---
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 4c2d6c483390..fe27a3175618 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -312,7 +312,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 
 	/*
 	 * Check for potential race with hardware updates of the pte
-	 * (ptep_set_access_flags safely changes valid ptes without going
+	 * (__ptep_set_access_flags safely changes valid ptes without going
 	 * through an invalid entry).
 	 */
 	VM_WARN_ONCE(!pte_young(pte),
@@ -854,8 +854,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
 }
 
-#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
-extern int ptep_set_access_flags(struct vm_area_struct *vma,
+extern int __ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,
 				 pte_t entry, int dirty);
 
@@ -865,7 +864,8 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmdp,
 					pmd_t entry, int dirty)
 {
-	return ptep_set_access_flags(vma, address, (pte_t *)pmdp, pmd_pte(entry), dirty);
+	return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
+							pmd_pte(entry), dirty);
 }
 
 static inline int pud_devmap(pud_t pud)
@@ -1141,6 +1141,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define ptep_clear_flush_young			__ptep_clear_flush_young
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define ptep_set_wrprotect			__ptep_set_wrprotect
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags			__ptep_set_access_flags
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 23d0dfc16686..dbbc06cfb848 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -209,9 +209,9 @@ static void show_pte(unsigned long addr)
  *
  * Returns whether or not the PTE actually changed.
  */
-int ptep_set_access_flags(struct vm_area_struct *vma,
-			  unsigned long address, pte_t *ptep,
-			  pte_t entry, int dirty)
+int __ptep_set_access_flags(struct vm_area_struct *vma,
+			    unsigned long address, pte_t *ptep,
+			    pte_t entry, int dirty)
 {
 	pteval_t old_pteval, pteval;
 	pte_t pte = READ_ONCE(*ptep);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index f6612f3e1c07..9949b80baac8 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -459,7 +459,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	pte_t orig_pte;
 
 	if (!pte_cont(pte))
-		return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
+		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 	dpfn = pgsize >> PAGE_SHIFT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 16/25] arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++----
 arch/arm64/mm/fault.c            |  6 +++---
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 4c2d6c483390..fe27a3175618 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -312,7 +312,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 
 	/*
 	 * Check for potential race with hardware updates of the pte
-	 * (ptep_set_access_flags safely changes valid ptes without going
+	 * (__ptep_set_access_flags safely changes valid ptes without going
 	 * through an invalid entry).
 	 */
 	VM_WARN_ONCE(!pte_young(pte),
@@ -854,8 +854,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
 }
 
-#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
-extern int ptep_set_access_flags(struct vm_area_struct *vma,
+extern int __ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,
 				 pte_t entry, int dirty);
 
@@ -865,7 +864,8 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmdp,
 					pmd_t entry, int dirty)
 {
-	return ptep_set_access_flags(vma, address, (pte_t *)pmdp, pmd_pte(entry), dirty);
+	return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
+							pmd_pte(entry), dirty);
 }
 
 static inline int pud_devmap(pud_t pud)
@@ -1141,6 +1141,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define ptep_clear_flush_young			__ptep_clear_flush_young
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define ptep_set_wrprotect			__ptep_set_wrprotect
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags			__ptep_set_access_flags
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 23d0dfc16686..dbbc06cfb848 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -209,9 +209,9 @@ static void show_pte(unsigned long addr)
  *
  * Returns whether or not the PTE actually changed.
  */
-int ptep_set_access_flags(struct vm_area_struct *vma,
-			  unsigned long address, pte_t *ptep,
-			  pte_t entry, int dirty)
+int __ptep_set_access_flags(struct vm_area_struct *vma,
+			    unsigned long address, pte_t *ptep,
+			    pte_t entry, int dirty)
 {
 	pteval_t old_pteval, pteval;
 	pte_t pte = READ_ONCE(*ptep);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index f6612f3e1c07..9949b80baac8 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -459,7 +459,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	pte_t orig_pte;
 
 	if (!pte_cont(pte))
-		return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
+		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 	dpfn = pgsize >> PAGE_SHIFT;
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 16/25] arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++----
 arch/arm64/mm/fault.c            |  6 +++---
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 4c2d6c483390..fe27a3175618 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -312,7 +312,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 
 	/*
 	 * Check for potential race with hardware updates of the pte
-	 * (ptep_set_access_flags safely changes valid ptes without going
+	 * (__ptep_set_access_flags safely changes valid ptes without going
 	 * through an invalid entry).
 	 */
 	VM_WARN_ONCE(!pte_young(pte),
@@ -854,8 +854,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
 }
 
-#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
-extern int ptep_set_access_flags(struct vm_area_struct *vma,
+extern int __ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,
 				 pte_t entry, int dirty);
 
@@ -865,7 +864,8 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmdp,
 					pmd_t entry, int dirty)
 {
-	return ptep_set_access_flags(vma, address, (pte_t *)pmdp, pmd_pte(entry), dirty);
+	return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
+							pmd_pte(entry), dirty);
 }
 
 static inline int pud_devmap(pud_t pud)
@@ -1141,6 +1141,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define ptep_clear_flush_young			__ptep_clear_flush_young
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define ptep_set_wrprotect			__ptep_set_wrprotect
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags			__ptep_set_access_flags
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 23d0dfc16686..dbbc06cfb848 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -209,9 +209,9 @@ static void show_pte(unsigned long addr)
  *
  * Returns whether or not the PTE actually changed.
  */
-int ptep_set_access_flags(struct vm_area_struct *vma,
-			  unsigned long address, pte_t *ptep,
-			  pte_t entry, int dirty)
+int __ptep_set_access_flags(struct vm_area_struct *vma,
+			    unsigned long address, pte_t *ptep,
+			    pte_t entry, int dirty)
 {
 	pteval_t old_pteval, pteval;
 	pte_t pte = READ_ONCE(*ptep);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index f6612f3e1c07..9949b80baac8 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -459,7 +459,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	pte_t orig_pte;
 
 	if (!pte_cont(pte))
-		return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
+		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 	dpfn = pgsize >> PAGE_SHIFT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 17/25] arm64/mm: ptep_get(): New layer to manage contig bit
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

arm64 did not previously define an arch-specific ptep_get(), so override
the default version in the arch code, and also define the private
__ptep_get() version. Currently they both do the same thing that the
default version does (READ_ONCE()). Some arch users (hugetlb) were
already using ptep_get() so convert those to the private API. While
other callsites were doing direct READ_ONCE(), so convert those to use
the appropriate (public/private) API too.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 12 +++++++++---
 arch/arm64/kernel/efi.c          |  2 +-
 arch/arm64/mm/fault.c            |  4 ++--
 arch/arm64/mm/hugetlbpage.c      | 18 +++++++++---------
 arch/arm64/mm/kasan_init.c       |  2 +-
 arch/arm64/mm/mmu.c              | 12 ++++++------
 arch/arm64/mm/pageattr.c         |  4 ++--
 arch/arm64/mm/trans_pgd.c        |  2 +-
 8 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index fe27a3175618..7dc6b68ee516 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -276,6 +276,11 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
 	}
 }
 
+static inline pte_t __ptep_get(pte_t *ptep)
+{
+	return READ_ONCE(*ptep);
+}
+
 extern void __sync_icache_dcache(pte_t pteval);
 bool pgattr_change_is_safe(u64 old, u64 new);
 
@@ -303,7 +308,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 	if (!IS_ENABLED(CONFIG_DEBUG_VM))
 		return;
 
-	old_pte = READ_ONCE(*ptep);
+	old_pte = __ptep_get(ptep);
 
 	if (!pte_valid(old_pte) || !pte_valid(pte))
 		return;
@@ -905,7 +910,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 {
 	pte_t old_pte, pte;
 
-	pte = READ_ONCE(*ptep);
+	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_mkold(pte);
@@ -978,7 +983,7 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
 {
 	pte_t old_pte, pte;
 
-	pte = READ_ONCE(*ptep);
+	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_wrprotect(pte);
@@ -1130,6 +1135,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#define ptep_get				__ptep_get
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
 #define pte_clear				__pte_clear
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 44288a12fc6c..9afcc690fe73 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -103,7 +103,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct set_perm_data *spd = data;
 	const efi_memory_desc_t *md = spd->md;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	if (md->attribute & EFI_MEMORY_RO)
 		pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index dbbc06cfb848..892e8cc8983f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -191,7 +191,7 @@ static void show_pte(unsigned long addr)
 		if (!ptep)
 			break;
 
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 		pr_cont(", pte=%016llx", pte_val(pte));
 		pte_unmap(ptep);
 	} while(0);
@@ -214,7 +214,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
 			    pte_t entry, int dirty)
 {
 	pteval_t old_pteval, pteval;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	if (pte_same(pte, entry))
 		return 0;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 9949b80baac8..c3db949560f9 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -152,14 +152,14 @@ pte_t huge_ptep_get(pte_t *ptep)
 {
 	int ncontig, i;
 	size_t pgsize;
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 
 	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
 		return orig_pte;
 
 	ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
 	for (i = 0; i < ncontig; i++, ptep++) {
-		pte_t pte = ptep_get(ptep);
+		pte_t pte = __ptep_get(ptep);
 
 		if (pte_dirty(pte))
 			orig_pte = pte_mkdirty(orig_pte);
@@ -184,7 +184,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 			     unsigned long pgsize,
 			     unsigned long ncontig)
 {
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 	unsigned long i;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
@@ -408,7 +408,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 {
 	int ncontig;
 	size_t pgsize;
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 
 	if (!pte_cont(orig_pte))
 		return __ptep_get_and_clear(mm, addr, ptep);
@@ -431,11 +431,11 @@ static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
 {
 	int i;
 
-	if (pte_write(pte) != pte_write(ptep_get(ptep)))
+	if (pte_write(pte) != pte_write(__ptep_get(ptep)))
 		return 1;
 
 	for (i = 0; i < ncontig; i++) {
-		pte_t orig_pte = ptep_get(ptep + i);
+		pte_t orig_pte = __ptep_get(ptep + i);
 
 		if (pte_dirty(pte) != pte_dirty(orig_pte))
 			return 1;
@@ -492,7 +492,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	size_t pgsize;
 	pte_t pte;
 
-	if (!pte_cont(READ_ONCE(*ptep))) {
+	if (!pte_cont(__ptep_get(ptep))) {
 		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
@@ -517,7 +517,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 	size_t pgsize;
 	int ncontig;
 
-	if (!pte_cont(READ_ONCE(*ptep)))
+	if (!pte_cont(__ptep_get(ptep)))
 		return ptep_clear_flush(vma, addr, ptep);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
@@ -550,7 +550,7 @@ pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(READ_ONCE(*ptep)))
+		if (pte_user_exec(__ptep_get(ptep)))
 			return huge_ptep_clear_flush(vma, addr, ptep);
 	}
 	return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index f659bd98c63f..9ee16cfce587 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -113,7 +113,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
 			memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
 		next = addr + PAGE_SIZE;
 		__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
-	} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
+	} while (ptep++, addr = next, addr != end && pte_none(__ptep_get(ptep)));
 }
 
 static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index bcaa5a5d86f8..8c1ab90bb1e5 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -173,7 +173,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 
 	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
-		pte_t old_pte = READ_ONCE(*ptep);
+		pte_t old_pte = __ptep_get(ptep);
 
 		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
@@ -182,7 +182,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		 * only allow updates to the permission attributes.
 		 */
 		BUG_ON(!pgattr_change_is_safe(pte_val(old_pte),
-					      READ_ONCE(pte_val(*ptep))));
+					      pte_val(__ptep_get(ptep))));
 
 		phys += PAGE_SIZE;
 	} while (ptep++, addr += PAGE_SIZE, addr != end);
@@ -854,7 +854,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 
 	do {
 		ptep = pte_offset_kernel(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 		if (pte_none(pte))
 			continue;
 
@@ -987,7 +987,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
 
 	do {
 		ptep = pte_offset_kernel(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 
 		/*
 		 * This is just a sanity check here which verifies that
@@ -1006,7 +1006,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
 	 */
 	ptep = pte_offset_kernel(pmdp, 0UL);
 	for (i = 0; i < PTRS_PER_PTE; i++) {
-		if (!pte_none(READ_ONCE(ptep[i])))
+		if (!pte_none(__ptep_get(&ptep[i])))
 			return;
 	}
 
@@ -1503,7 +1503,7 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(READ_ONCE(*ptep)))
+		if (pte_user_exec(ptep_get(ptep)))
 			return ptep_clear_flush(vma, addr, ptep);
 	}
 	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index a7996d8edf0a..0c4e3ecf989d 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -36,7 +36,7 @@ bool can_set_direct_map(void)
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	pte = clear_pte_bit(pte, cdata->clear_mask);
 	pte = set_pte_bit(pte, cdata->set_mask);
@@ -245,5 +245,5 @@ bool kernel_page_present(struct page *page)
 		return true;
 
 	ptep = pte_offset_kernel(pmdp, addr);
-	return pte_valid(READ_ONCE(*ptep));
+	return pte_valid(__ptep_get(ptep));
 }
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 230b607cf881..5139a28130c0 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -33,7 +33,7 @@ static void *trans_alloc(struct trans_pgd_info *info)
 
 static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 {
-	pte_t pte = READ_ONCE(*src_ptep);
+	pte_t pte = __ptep_get(src_ptep);
 
 	if (pte_valid(pte)) {
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 17/25] arm64/mm: ptep_get(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

arm64 did not previously define an arch-specific ptep_get(), so override
the default version in the arch code, and also define the private
__ptep_get() version. Currently they both do the same thing that the
default version does (READ_ONCE()). Some arch users (hugetlb) were
already using ptep_get() so convert those to the private API. While
other callsites were doing direct READ_ONCE(), so convert those to use
the appropriate (public/private) API too.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 12 +++++++++---
 arch/arm64/kernel/efi.c          |  2 +-
 arch/arm64/mm/fault.c            |  4 ++--
 arch/arm64/mm/hugetlbpage.c      | 18 +++++++++---------
 arch/arm64/mm/kasan_init.c       |  2 +-
 arch/arm64/mm/mmu.c              | 12 ++++++------
 arch/arm64/mm/pageattr.c         |  4 ++--
 arch/arm64/mm/trans_pgd.c        |  2 +-
 8 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index fe27a3175618..7dc6b68ee516 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -276,6 +276,11 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
 	}
 }
 
+static inline pte_t __ptep_get(pte_t *ptep)
+{
+	return READ_ONCE(*ptep);
+}
+
 extern void __sync_icache_dcache(pte_t pteval);
 bool pgattr_change_is_safe(u64 old, u64 new);
 
@@ -303,7 +308,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 	if (!IS_ENABLED(CONFIG_DEBUG_VM))
 		return;
 
-	old_pte = READ_ONCE(*ptep);
+	old_pte = __ptep_get(ptep);
 
 	if (!pte_valid(old_pte) || !pte_valid(pte))
 		return;
@@ -905,7 +910,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 {
 	pte_t old_pte, pte;
 
-	pte = READ_ONCE(*ptep);
+	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_mkold(pte);
@@ -978,7 +983,7 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
 {
 	pte_t old_pte, pte;
 
-	pte = READ_ONCE(*ptep);
+	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_wrprotect(pte);
@@ -1130,6 +1135,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#define ptep_get				__ptep_get
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
 #define pte_clear				__pte_clear
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 44288a12fc6c..9afcc690fe73 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -103,7 +103,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct set_perm_data *spd = data;
 	const efi_memory_desc_t *md = spd->md;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	if (md->attribute & EFI_MEMORY_RO)
 		pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index dbbc06cfb848..892e8cc8983f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -191,7 +191,7 @@ static void show_pte(unsigned long addr)
 		if (!ptep)
 			break;
 
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 		pr_cont(", pte=%016llx", pte_val(pte));
 		pte_unmap(ptep);
 	} while(0);
@@ -214,7 +214,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
 			    pte_t entry, int dirty)
 {
 	pteval_t old_pteval, pteval;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	if (pte_same(pte, entry))
 		return 0;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 9949b80baac8..c3db949560f9 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -152,14 +152,14 @@ pte_t huge_ptep_get(pte_t *ptep)
 {
 	int ncontig, i;
 	size_t pgsize;
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 
 	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
 		return orig_pte;
 
 	ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
 	for (i = 0; i < ncontig; i++, ptep++) {
-		pte_t pte = ptep_get(ptep);
+		pte_t pte = __ptep_get(ptep);
 
 		if (pte_dirty(pte))
 			orig_pte = pte_mkdirty(orig_pte);
@@ -184,7 +184,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 			     unsigned long pgsize,
 			     unsigned long ncontig)
 {
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 	unsigned long i;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
@@ -408,7 +408,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 {
 	int ncontig;
 	size_t pgsize;
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 
 	if (!pte_cont(orig_pte))
 		return __ptep_get_and_clear(mm, addr, ptep);
@@ -431,11 +431,11 @@ static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
 {
 	int i;
 
-	if (pte_write(pte) != pte_write(ptep_get(ptep)))
+	if (pte_write(pte) != pte_write(__ptep_get(ptep)))
 		return 1;
 
 	for (i = 0; i < ncontig; i++) {
-		pte_t orig_pte = ptep_get(ptep + i);
+		pte_t orig_pte = __ptep_get(ptep + i);
 
 		if (pte_dirty(pte) != pte_dirty(orig_pte))
 			return 1;
@@ -492,7 +492,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	size_t pgsize;
 	pte_t pte;
 
-	if (!pte_cont(READ_ONCE(*ptep))) {
+	if (!pte_cont(__ptep_get(ptep))) {
 		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
@@ -517,7 +517,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 	size_t pgsize;
 	int ncontig;
 
-	if (!pte_cont(READ_ONCE(*ptep)))
+	if (!pte_cont(__ptep_get(ptep)))
 		return ptep_clear_flush(vma, addr, ptep);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
@@ -550,7 +550,7 @@ pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(READ_ONCE(*ptep)))
+		if (pte_user_exec(__ptep_get(ptep)))
 			return huge_ptep_clear_flush(vma, addr, ptep);
 	}
 	return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index f659bd98c63f..9ee16cfce587 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -113,7 +113,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
 			memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
 		next = addr + PAGE_SIZE;
 		__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
-	} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
+	} while (ptep++, addr = next, addr != end && pte_none(__ptep_get(ptep)));
 }
 
 static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index bcaa5a5d86f8..8c1ab90bb1e5 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -173,7 +173,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 
 	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
-		pte_t old_pte = READ_ONCE(*ptep);
+		pte_t old_pte = __ptep_get(ptep);
 
 		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
@@ -182,7 +182,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		 * only allow updates to the permission attributes.
 		 */
 		BUG_ON(!pgattr_change_is_safe(pte_val(old_pte),
-					      READ_ONCE(pte_val(*ptep))));
+					      pte_val(__ptep_get(ptep))));
 
 		phys += PAGE_SIZE;
 	} while (ptep++, addr += PAGE_SIZE, addr != end);
@@ -854,7 +854,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 
 	do {
 		ptep = pte_offset_kernel(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 		if (pte_none(pte))
 			continue;
 
@@ -987,7 +987,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
 
 	do {
 		ptep = pte_offset_kernel(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 
 		/*
 		 * This is just a sanity check here which verifies that
@@ -1006,7 +1006,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
 	 */
 	ptep = pte_offset_kernel(pmdp, 0UL);
 	for (i = 0; i < PTRS_PER_PTE; i++) {
-		if (!pte_none(READ_ONCE(ptep[i])))
+		if (!pte_none(__ptep_get(&ptep[i])))
 			return;
 	}
 
@@ -1503,7 +1503,7 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(READ_ONCE(*ptep)))
+		if (pte_user_exec(ptep_get(ptep)))
 			return ptep_clear_flush(vma, addr, ptep);
 	}
 	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index a7996d8edf0a..0c4e3ecf989d 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -36,7 +36,7 @@ bool can_set_direct_map(void)
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	pte = clear_pte_bit(pte, cdata->clear_mask);
 	pte = set_pte_bit(pte, cdata->set_mask);
@@ -245,5 +245,5 @@ bool kernel_page_present(struct page *page)
 		return true;
 
 	ptep = pte_offset_kernel(pmdp, addr);
-	return pte_valid(READ_ONCE(*ptep));
+	return pte_valid(__ptep_get(ptep));
 }
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 230b607cf881..5139a28130c0 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -33,7 +33,7 @@ static void *trans_alloc(struct trans_pgd_info *info)
 
 static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 {
-	pte_t pte = READ_ONCE(*src_ptep);
+	pte_t pte = __ptep_get(src_ptep);
 
 	if (pte_valid(pte)) {
 		/*
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 17/25] arm64/mm: ptep_get(): New layer to manage contig bit
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

arm64 did not previously define an arch-specific ptep_get(), so override
the default version in the arch code, and also define the private
__ptep_get() version. Currently they both do the same thing that the
default version does (READ_ONCE()). Some arch users (hugetlb) were
already using ptep_get() so convert those to the private API. While
other callsites were doing direct READ_ONCE(), so convert those to use
the appropriate (public/private) API too.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 12 +++++++++---
 arch/arm64/kernel/efi.c          |  2 +-
 arch/arm64/mm/fault.c            |  4 ++--
 arch/arm64/mm/hugetlbpage.c      | 18 +++++++++---------
 arch/arm64/mm/kasan_init.c       |  2 +-
 arch/arm64/mm/mmu.c              | 12 ++++++------
 arch/arm64/mm/pageattr.c         |  4 ++--
 arch/arm64/mm/trans_pgd.c        |  2 +-
 8 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index fe27a3175618..7dc6b68ee516 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -276,6 +276,11 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
 	}
 }
 
+static inline pte_t __ptep_get(pte_t *ptep)
+{
+	return READ_ONCE(*ptep);
+}
+
 extern void __sync_icache_dcache(pte_t pteval);
 bool pgattr_change_is_safe(u64 old, u64 new);
 
@@ -303,7 +308,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 	if (!IS_ENABLED(CONFIG_DEBUG_VM))
 		return;
 
-	old_pte = READ_ONCE(*ptep);
+	old_pte = __ptep_get(ptep);
 
 	if (!pte_valid(old_pte) || !pte_valid(pte))
 		return;
@@ -905,7 +910,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 {
 	pte_t old_pte, pte;
 
-	pte = READ_ONCE(*ptep);
+	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_mkold(pte);
@@ -978,7 +983,7 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
 {
 	pte_t old_pte, pte;
 
-	pte = READ_ONCE(*ptep);
+	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_wrprotect(pte);
@@ -1130,6 +1135,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#define ptep_get				__ptep_get
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
 #define pte_clear				__pte_clear
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 44288a12fc6c..9afcc690fe73 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -103,7 +103,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct set_perm_data *spd = data;
 	const efi_memory_desc_t *md = spd->md;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	if (md->attribute & EFI_MEMORY_RO)
 		pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index dbbc06cfb848..892e8cc8983f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -191,7 +191,7 @@ static void show_pte(unsigned long addr)
 		if (!ptep)
 			break;
 
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 		pr_cont(", pte=%016llx", pte_val(pte));
 		pte_unmap(ptep);
 	} while(0);
@@ -214,7 +214,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
 			    pte_t entry, int dirty)
 {
 	pteval_t old_pteval, pteval;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	if (pte_same(pte, entry))
 		return 0;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 9949b80baac8..c3db949560f9 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -152,14 +152,14 @@ pte_t huge_ptep_get(pte_t *ptep)
 {
 	int ncontig, i;
 	size_t pgsize;
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 
 	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
 		return orig_pte;
 
 	ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
 	for (i = 0; i < ncontig; i++, ptep++) {
-		pte_t pte = ptep_get(ptep);
+		pte_t pte = __ptep_get(ptep);
 
 		if (pte_dirty(pte))
 			orig_pte = pte_mkdirty(orig_pte);
@@ -184,7 +184,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 			     unsigned long pgsize,
 			     unsigned long ncontig)
 {
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 	unsigned long i;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
@@ -408,7 +408,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 {
 	int ncontig;
 	size_t pgsize;
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 
 	if (!pte_cont(orig_pte))
 		return __ptep_get_and_clear(mm, addr, ptep);
@@ -431,11 +431,11 @@ static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
 {
 	int i;
 
-	if (pte_write(pte) != pte_write(ptep_get(ptep)))
+	if (pte_write(pte) != pte_write(__ptep_get(ptep)))
 		return 1;
 
 	for (i = 0; i < ncontig; i++) {
-		pte_t orig_pte = ptep_get(ptep + i);
+		pte_t orig_pte = __ptep_get(ptep + i);
 
 		if (pte_dirty(pte) != pte_dirty(orig_pte))
 			return 1;
@@ -492,7 +492,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	size_t pgsize;
 	pte_t pte;
 
-	if (!pte_cont(READ_ONCE(*ptep))) {
+	if (!pte_cont(__ptep_get(ptep))) {
 		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
@@ -517,7 +517,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 	size_t pgsize;
 	int ncontig;
 
-	if (!pte_cont(READ_ONCE(*ptep)))
+	if (!pte_cont(__ptep_get(ptep)))
 		return ptep_clear_flush(vma, addr, ptep);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
@@ -550,7 +550,7 @@ pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(READ_ONCE(*ptep)))
+		if (pte_user_exec(__ptep_get(ptep)))
 			return huge_ptep_clear_flush(vma, addr, ptep);
 	}
 	return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index f659bd98c63f..9ee16cfce587 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -113,7 +113,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
 			memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
 		next = addr + PAGE_SIZE;
 		__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
-	} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
+	} while (ptep++, addr = next, addr != end && pte_none(__ptep_get(ptep)));
 }
 
 static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index bcaa5a5d86f8..8c1ab90bb1e5 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -173,7 +173,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 
 	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
-		pte_t old_pte = READ_ONCE(*ptep);
+		pte_t old_pte = __ptep_get(ptep);
 
 		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
@@ -182,7 +182,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		 * only allow updates to the permission attributes.
 		 */
 		BUG_ON(!pgattr_change_is_safe(pte_val(old_pte),
-					      READ_ONCE(pte_val(*ptep))));
+					      pte_val(__ptep_get(ptep))));
 
 		phys += PAGE_SIZE;
 	} while (ptep++, addr += PAGE_SIZE, addr != end);
@@ -854,7 +854,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 
 	do {
 		ptep = pte_offset_kernel(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 		if (pte_none(pte))
 			continue;
 
@@ -987,7 +987,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
 
 	do {
 		ptep = pte_offset_kernel(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 
 		/*
 		 * This is just a sanity check here which verifies that
@@ -1006,7 +1006,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
 	 */
 	ptep = pte_offset_kernel(pmdp, 0UL);
 	for (i = 0; i < PTRS_PER_PTE; i++) {
-		if (!pte_none(READ_ONCE(ptep[i])))
+		if (!pte_none(__ptep_get(&ptep[i])))
 			return;
 	}
 
@@ -1503,7 +1503,7 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(READ_ONCE(*ptep)))
+		if (pte_user_exec(ptep_get(ptep)))
 			return ptep_clear_flush(vma, addr, ptep);
 	}
 	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index a7996d8edf0a..0c4e3ecf989d 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -36,7 +36,7 @@ bool can_set_direct_map(void)
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	pte = clear_pte_bit(pte, cdata->clear_mask);
 	pte = set_pte_bit(pte, cdata->set_mask);
@@ -245,5 +245,5 @@ bool kernel_page_present(struct page *page)
 		return true;
 
 	ptep = pte_offset_kernel(pmdp, addr);
-	return pte_valid(READ_ONCE(*ptep));
+	return pte_valid(__ptep_get(ptep));
 }
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 230b607cf881..5139a28130c0 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -33,7 +33,7 @@ static void *trans_alloc(struct trans_pgd_info *info)
 
 static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 {
-	pte_t pte = READ_ONCE(*src_ptep);
+	pte_t pte = __ptep_get(src_ptep);
 
 	if (pte_valid(pte)) {
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 79e932a1bdf8..50a765917327 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -422,7 +422,7 @@ do {									\
 #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
 	__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, kvm_lpa2_is_enabled());
 
-static inline void __flush_tlb_range(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 				     unsigned long start, unsigned long end,
 				     unsigned long stride, bool last_level,
 				     int tlb_level)
@@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 		__flush_tlb_range_op(vae1is, start, pages, stride, asid,
 				     tlb_level, true, lpa2_is_enabled());
 
-	dsb(ish);
 	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
+static inline void __flush_tlb_range(struct vm_area_struct *vma,
+				     unsigned long start, unsigned long end,
+				     unsigned long stride, bool last_level,
+				     int tlb_level)
+{
+	__flush_tlb_range_nosync(vma, start, end, stride,
+				 last_level, tlb_level);
+	dsb(ish);
+}
+
 static inline void flush_tlb_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 79e932a1bdf8..50a765917327 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -422,7 +422,7 @@ do {									\
 #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
 	__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, kvm_lpa2_is_enabled());
 
-static inline void __flush_tlb_range(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 				     unsigned long start, unsigned long end,
 				     unsigned long stride, bool last_level,
 				     int tlb_level)
@@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 		__flush_tlb_range_op(vae1is, start, pages, stride, asid,
 				     tlb_level, true, lpa2_is_enabled());
 
-	dsb(ish);
 	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
+static inline void __flush_tlb_range(struct vm_area_struct *vma,
+				     unsigned long start, unsigned long end,
+				     unsigned long stride, bool last_level,
+				     int tlb_level)
+{
+	__flush_tlb_range_nosync(vma, start, end, stride,
+				 last_level, tlb_level);
+	dsb(ish);
+}
+
 static inline void flush_tlb_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end)
 {
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 79e932a1bdf8..50a765917327 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -422,7 +422,7 @@ do {									\
 #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
 	__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, kvm_lpa2_is_enabled());
 
-static inline void __flush_tlb_range(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 				     unsigned long start, unsigned long end,
 				     unsigned long stride, bool last_level,
 				     int tlb_level)
@@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 		__flush_tlb_range_op(vae1is, start, pages, stride, asid,
 				     tlb_level, true, lpa2_is_enabled());
 
-	dsb(ish);
 	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
+static inline void __flush_tlb_range(struct vm_area_struct *vma,
+				     unsigned long start, unsigned long end,
+				     unsigned long stride, bool last_level,
+				     int tlb_level)
+{
+	__flush_tlb_range_nosync(vma, start, end, stride,
+				 last_level, tlb_level);
+	dsb(ish);
+}
+
 static inline void flush_tlb_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for
user mappings.

In this initial implementation, only suitable batches of PTEs, set via
set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
modification of individual PTEs will cause an "unfold" operation to
repaint the contpte block as individual PTEs before performing the
requested operation. While, a modification of a single PTE could cause
the block of PTEs to which it belongs to become eligible for "folding"
into a contpte entry, "folding" is not performed in this initial
implementation due to the costs of checking the requirements are met.
Due to this, contpte mappings will degrade back to normal pte mappings
over time if/when protections are changed. This will be solved in a
future patch.

Since a contpte block only has a single access and dirty bit, the
semantic here changes slightly; when getting a pte (e.g. ptep_get())
that is part of a contpte mapping, the access and dirty information are
pulled from the block (so all ptes in the block return the same
access/dirty info). When changing the access/dirty info on a pte (e.g.
ptep_set_access_flags()) that is part of a contpte mapping, this change
will affect the whole contpte block. This is works fine in practice
since we guarantee that only a single folio is mapped by a contpte
block, and the core-mm tracks access/dirty information per folio.

In order for the public functions, which used to be pure inline, to
continue to be callable by modules, export all the contpte_* symbols
that are now called by those public inline functions.

The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
at build time. It defaults to enabled as long as its dependency,
TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
enabled, then there is no chance of meeting the physical contiguity
requirement for contpte mappings.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/Kconfig               |   9 +
 arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
 arch/arm64/mm/Makefile           |   1 +
 arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
 4 files changed, 454 insertions(+)
 create mode 100644 arch/arm64/mm/contpte.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index d86d7f4758b5..1442e8ed95b6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
 	select UNWIND_TABLES
 	select DYNAMIC_SCS
 
+config ARM64_CONTPTE
+	bool "Contiguous PTE mappings for user memory" if EXPERT
+	depends on TRANSPARENT_HUGEPAGE
+	default y
+	help
+	  When enabled, user mappings are configured using the PTE contiguous
+	  bit, for any mappings that meet the size and alignment requirements.
+	  This reduces TLB pressure and improves performance.
+
 endmenu # "Kernel Features"
 
 menu "Boot options"
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7dc6b68ee516..34892a95403d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  */
 #define pte_valid_not_user(pte) \
 	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
+/*
+ * Returns true if the pte is valid and has the contiguous bit set.
+ */
+#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
 /*
  * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
  * so that we don't erroneously return false for pages that have been
@@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#ifdef CONFIG_ARM64_CONTPTE
+
+/*
+ * The contpte APIs are used to transparently manage the contiguous bit in ptes
+ * where it is possible and makes sense to do so. The PTE_CONT bit is considered
+ * a private implementation detail of the public ptep API (see below).
+ */
+extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte);
+extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
+extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
+extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte, unsigned int nr);
+extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				pte_t entry, int dirty);
+
+static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte)
+{
+	if (unlikely(pte_valid_cont(pte)))
+		__contpte_try_unfold(mm, addr, ptep, pte);
+}
+
+/*
+ * The below functions constitute the public API that arm64 presents to the
+ * core-mm to manipulate PTE entries within their page tables (or at least this
+ * is the subset of the API that arm64 needs to implement). These public
+ * versions will automatically and transparently apply the contiguous bit where
+ * it makes sense to do so. Therefore any users that are contig-aware (e.g.
+ * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
+ * private versions, which are prefixed with double underscore. All of these
+ * APIs except for ptep_get_lockless() are expected to be called with the PTL
+ * held.
+ */
+
+#define ptep_get ptep_get
+static inline pte_t ptep_get(pte_t *ptep)
+{
+	pte_t pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(pte)))
+		return pte;
+
+	return contpte_ptep_get(ptep, pte);
+}
+
+#define ptep_get_lockless ptep_get_lockless
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+	pte_t pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(pte)))
+		return pte;
+
+	return contpte_ptep_get_lockless(ptep);
+}
+
+static inline void set_pte(pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We don't have the mm or vaddr so cannot unfold contig entries (since
+	 * it requires tlb maintenance). set_pte() is not used in core code, so
+	 * this should never even be called. Regardless do our best to service
+	 * any call and emit a warning if there is any attempt to set a pte on
+	 * top of an existing contig range.
+	 */
+	pte_t orig_pte = __ptep_get(ptep);
+
+	WARN_ON_ONCE(pte_valid_cont(orig_pte));
+	__set_pte(ptep, pte_mknoncont(pte));
+}
+
+#define set_ptes set_ptes
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte, unsigned int nr)
+{
+	pte = pte_mknoncont(pte);
+
+	if (likely(nr == 1)) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		__set_ptes(mm, addr, ptep, pte, 1);
+	} else {
+		contpte_set_ptes(mm, addr, ptep, pte, nr);
+	}
+}
+
+static inline void pte_clear(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+	__pte_clear(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+	return __ptep_get_and_clear(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(orig_pte)))
+		return __ptep_test_and_clear_young(vma, addr, ptep);
+
+	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(orig_pte)))
+		return __ptep_clear_flush_young(vma, addr, ptep);
+
+	return contpte_ptep_clear_flush_young(vma, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+static inline void ptep_set_wrprotect(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+	__ptep_set_wrprotect(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+static inline int ptep_set_access_flags(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				pte_t entry, int dirty)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	entry = pte_mknoncont(entry);
+
+	if (likely(!pte_valid_cont(orig_pte)))
+		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+
+	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+}
+
+#else /* CONFIG_ARM64_CONTPTE */
+
 #define ptep_get				__ptep_get
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
@@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 #define ptep_set_access_flags			__ptep_set_access_flags
 
+#endif /* CONFIG_ARM64_CONTPTE */
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d..60454256945b 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   cache.o copypage.o flush.o \
 				   ioremap.o mmap.o pgd.o mmu.o \
 				   context.o proc.o pageattr.o fixmap.o
+obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
 obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
new file mode 100644
index 000000000000..bfb50e6b44c7
--- /dev/null
+++ b/arch/arm64/mm/contpte.c
@@ -0,0 +1,283 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/mm.h>
+#include <linux/export.h>
+#include <asm/tlbflush.h>
+
+static inline bool mm_is_user(struct mm_struct *mm)
+{
+	/*
+	 * Don't attempt to apply the contig bit to kernel mappings, because
+	 * dynamically adding/removing the contig bit can cause page faults.
+	 * These racing faults are ok for user space, since they get serialized
+	 * on the PTL. But kernel mappings can't tolerate faults.
+	 */
+	return mm != &init_mm;
+}
+
+static inline pte_t *contpte_align_down(pte_t *ptep)
+{
+	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
+}
+
+static void contpte_convert(struct mm_struct *mm, unsigned long addr,
+			    pte_t *ptep, pte_t pte)
+{
+	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	unsigned long start_addr;
+	pte_t *start_ptep;
+	int i;
+
+	start_ptep = ptep = contpte_align_down(ptep);
+	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
+		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
+
+		if (pte_dirty(ptent))
+			pte = pte_mkdirty(pte);
+
+		if (pte_young(ptent))
+			pte = pte_mkyoung(pte);
+	}
+
+	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
+
+	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
+}
+
+void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We have already checked that the ptes are contiguous in
+	 * contpte_try_unfold(), so just check that the mm is user space.
+	 */
+
+	if (!mm_is_user(mm))
+		return;
+
+	pte = pte_mknoncont(pte);
+	contpte_convert(mm, addr, ptep, pte);
+}
+EXPORT_SYMBOL(__contpte_try_unfold);
+
+pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
+{
+	/*
+	 * Gather access/dirty bits, which may be populated in any of the ptes
+	 * of the contig range. We are guarranteed to be holding the PTL, so any
+	 * contiguous range cannot be unfolded or otherwise modified under our
+	 * feet.
+	 */
+
+	pte_t pte;
+	int i;
+
+	ptep = contpte_align_down(ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++) {
+		pte = __ptep_get(ptep);
+
+		if (pte_dirty(pte))
+			orig_pte = pte_mkdirty(orig_pte);
+
+		if (pte_young(pte))
+			orig_pte = pte_mkyoung(orig_pte);
+	}
+
+	return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get);
+
+pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
+{
+	/*
+	 * Gather access/dirty bits, which may be populated in any of the ptes
+	 * of the contig range. We may not be holding the PTL, so any contiguous
+	 * range may be unfolded/modified/refolded under our feet. Therefore we
+	 * ensure we read a _consistent_ contpte range by checking that all ptes
+	 * in the range are valid and have CONT_PTE set, that all pfns are
+	 * contiguous and that all pgprots are the same (ignoring access/dirty).
+	 * If we find a pte that is not consistent, then we must be racing with
+	 * an update so start again. If the target pte does not have CONT_PTE
+	 * set then that is considered consistent on its own because it is not
+	 * part of a contpte range.
+	 */
+
+	pgprot_t orig_prot;
+	unsigned long pfn;
+	pte_t orig_pte;
+	pgprot_t prot;
+	pte_t *ptep;
+	pte_t pte;
+	int i;
+
+retry:
+	orig_pte = __ptep_get(orig_ptep);
+
+	if (!pte_valid_cont(orig_pte))
+		return orig_pte;
+
+	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
+	ptep = contpte_align_down(orig_ptep);
+	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
+		pte = __ptep_get(ptep);
+		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+		if (!pte_valid_cont(pte) ||
+		   pte_pfn(pte) != pfn ||
+		   pgprot_val(prot) != pgprot_val(orig_prot))
+			goto retry;
+
+		if (pte_dirty(pte))
+			orig_pte = pte_mkdirty(orig_pte);
+
+		if (pte_young(pte))
+			orig_pte = pte_mkyoung(orig_pte);
+	}
+
+	return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get_lockless);
+
+void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte, unsigned int nr)
+{
+	unsigned long next;
+	unsigned long end;
+	unsigned long pfn;
+	pgprot_t prot;
+
+	/*
+	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
+	 * all ptes is not-present. Therefore we never need to unfold or
+	 * otherwise invalidate a range before we set the new ptes.
+	 * contpte_set_ptes() should never be called for nr < 2.
+	 */
+	VM_WARN_ON(nr == 1);
+
+	if (!mm_is_user(mm))
+		return __set_ptes(mm, addr, ptep, pte, nr);
+
+	end = addr + (nr << PAGE_SHIFT);
+	pfn = pte_pfn(pte);
+	prot = pte_pgprot(pte);
+
+	do {
+		next = pte_cont_addr_end(addr, end);
+		nr = (next - addr) >> PAGE_SHIFT;
+		pte = pfn_pte(pfn, prot);
+
+		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
+			pte = pte_mkcont(pte);
+		else
+			pte = pte_mknoncont(pte);
+
+		__set_ptes(mm, addr, ptep, pte, nr);
+
+		addr = next;
+		ptep += nr;
+		pfn += nr;
+
+	} while (addr != end);
+}
+EXPORT_SYMBOL(contpte_set_ptes);
+
+int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep)
+{
+	/*
+	 * ptep_clear_flush_young() technically requires us to clear the access
+	 * flag for a _single_ pte. However, the core-mm code actually tracks
+	 * access/dirty per folio, not per page. And since we only create a
+	 * contig range when the range is covered by a single folio, we can get
+	 * away with clearing young for the whole contig range here, so we avoid
+	 * having to unfold.
+	 */
+
+	int young = 0;
+	int i;
+
+	ptep = contpte_align_down(ptep);
+	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+		young |= __ptep_test_and_clear_young(vma, addr, ptep);
+
+	return young;
+}
+EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
+
+int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep)
+{
+	int young;
+
+	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+
+	if (young) {
+		/*
+		 * See comment in __ptep_clear_flush_young(); same rationale for
+		 * eliding the trailing DSB applies here.
+		 */
+		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
+					 PAGE_SIZE, true, 3);
+	}
+
+	return young;
+}
+EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
+
+int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					pte_t entry, int dirty)
+{
+	unsigned long start_addr;
+	pte_t orig_pte;
+	int i;
+
+	/*
+	 * Gather the access/dirty bits for the contiguous range. If nothing has
+	 * changed, its a noop.
+	 */
+	orig_pte = pte_mknoncont(ptep_get(ptep));
+	if (pte_val(orig_pte) == pte_val(entry))
+		return 0;
+
+	/*
+	 * We can fix up access/dirty bits without having to unfold the contig
+	 * range. But if the write bit is changing, we must unfold.
+	 */
+	if (pte_write(orig_pte) == pte_write(entry)) {
+		/*
+		 * For HW access management, we technically only need to update
+		 * the flag on a single pte in the range. But for SW access
+		 * management, we need to update all the ptes to prevent extra
+		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
+		 * and instead flush the whole range at the end.
+		 */
+		ptep = contpte_align_down(ptep);
+		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
+
+		if (dirty)
+			__flush_tlb_range(vma, start_addr, addr,
+							PAGE_SIZE, true, 3);
+	} else {
+		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
+		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+	}
+
+	return 1;
+}
+EXPORT_SYMBOL(contpte_ptep_set_access_flags);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for
user mappings.

In this initial implementation, only suitable batches of PTEs, set via
set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
modification of individual PTEs will cause an "unfold" operation to
repaint the contpte block as individual PTEs before performing the
requested operation. While, a modification of a single PTE could cause
the block of PTEs to which it belongs to become eligible for "folding"
into a contpte entry, "folding" is not performed in this initial
implementation due to the costs of checking the requirements are met.
Due to this, contpte mappings will degrade back to normal pte mappings
over time if/when protections are changed. This will be solved in a
future patch.

Since a contpte block only has a single access and dirty bit, the
semantic here changes slightly; when getting a pte (e.g. ptep_get())
that is part of a contpte mapping, the access and dirty information are
pulled from the block (so all ptes in the block return the same
access/dirty info). When changing the access/dirty info on a pte (e.g.
ptep_set_access_flags()) that is part of a contpte mapping, this change
will affect the whole contpte block. This is works fine in practice
since we guarantee that only a single folio is mapped by a contpte
block, and the core-mm tracks access/dirty information per folio.

In order for the public functions, which used to be pure inline, to
continue to be callable by modules, export all the contpte_* symbols
that are now called by those public inline functions.

The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
at build time. It defaults to enabled as long as its dependency,
TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
enabled, then there is no chance of meeting the physical contiguity
requirement for contpte mappings.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/Kconfig               |   9 +
 arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
 arch/arm64/mm/Makefile           |   1 +
 arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
 4 files changed, 454 insertions(+)
 create mode 100644 arch/arm64/mm/contpte.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index d86d7f4758b5..1442e8ed95b6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
 	select UNWIND_TABLES
 	select DYNAMIC_SCS
 
+config ARM64_CONTPTE
+	bool "Contiguous PTE mappings for user memory" if EXPERT
+	depends on TRANSPARENT_HUGEPAGE
+	default y
+	help
+	  When enabled, user mappings are configured using the PTE contiguous
+	  bit, for any mappings that meet the size and alignment requirements.
+	  This reduces TLB pressure and improves performance.
+
 endmenu # "Kernel Features"
 
 menu "Boot options"
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7dc6b68ee516..34892a95403d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  */
 #define pte_valid_not_user(pte) \
 	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
+/*
+ * Returns true if the pte is valid and has the contiguous bit set.
+ */
+#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
 /*
  * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
  * so that we don't erroneously return false for pages that have been
@@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#ifdef CONFIG_ARM64_CONTPTE
+
+/*
+ * The contpte APIs are used to transparently manage the contiguous bit in ptes
+ * where it is possible and makes sense to do so. The PTE_CONT bit is considered
+ * a private implementation detail of the public ptep API (see below).
+ */
+extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte);
+extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
+extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
+extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte, unsigned int nr);
+extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				pte_t entry, int dirty);
+
+static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte)
+{
+	if (unlikely(pte_valid_cont(pte)))
+		__contpte_try_unfold(mm, addr, ptep, pte);
+}
+
+/*
+ * The below functions constitute the public API that arm64 presents to the
+ * core-mm to manipulate PTE entries within their page tables (or at least this
+ * is the subset of the API that arm64 needs to implement). These public
+ * versions will automatically and transparently apply the contiguous bit where
+ * it makes sense to do so. Therefore any users that are contig-aware (e.g.
+ * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
+ * private versions, which are prefixed with double underscore. All of these
+ * APIs except for ptep_get_lockless() are expected to be called with the PTL
+ * held.
+ */
+
+#define ptep_get ptep_get
+static inline pte_t ptep_get(pte_t *ptep)
+{
+	pte_t pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(pte)))
+		return pte;
+
+	return contpte_ptep_get(ptep, pte);
+}
+
+#define ptep_get_lockless ptep_get_lockless
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+	pte_t pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(pte)))
+		return pte;
+
+	return contpte_ptep_get_lockless(ptep);
+}
+
+static inline void set_pte(pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We don't have the mm or vaddr so cannot unfold contig entries (since
+	 * it requires tlb maintenance). set_pte() is not used in core code, so
+	 * this should never even be called. Regardless do our best to service
+	 * any call and emit a warning if there is any attempt to set a pte on
+	 * top of an existing contig range.
+	 */
+	pte_t orig_pte = __ptep_get(ptep);
+
+	WARN_ON_ONCE(pte_valid_cont(orig_pte));
+	__set_pte(ptep, pte_mknoncont(pte));
+}
+
+#define set_ptes set_ptes
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte, unsigned int nr)
+{
+	pte = pte_mknoncont(pte);
+
+	if (likely(nr == 1)) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		__set_ptes(mm, addr, ptep, pte, 1);
+	} else {
+		contpte_set_ptes(mm, addr, ptep, pte, nr);
+	}
+}
+
+static inline void pte_clear(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+	__pte_clear(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+	return __ptep_get_and_clear(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(orig_pte)))
+		return __ptep_test_and_clear_young(vma, addr, ptep);
+
+	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(orig_pte)))
+		return __ptep_clear_flush_young(vma, addr, ptep);
+
+	return contpte_ptep_clear_flush_young(vma, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+static inline void ptep_set_wrprotect(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+	__ptep_set_wrprotect(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+static inline int ptep_set_access_flags(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				pte_t entry, int dirty)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	entry = pte_mknoncont(entry);
+
+	if (likely(!pte_valid_cont(orig_pte)))
+		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+
+	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+}
+
+#else /* CONFIG_ARM64_CONTPTE */
+
 #define ptep_get				__ptep_get
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
@@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 #define ptep_set_access_flags			__ptep_set_access_flags
 
+#endif /* CONFIG_ARM64_CONTPTE */
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d..60454256945b 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   cache.o copypage.o flush.o \
 				   ioremap.o mmap.o pgd.o mmu.o \
 				   context.o proc.o pageattr.o fixmap.o
+obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
 obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
new file mode 100644
index 000000000000..bfb50e6b44c7
--- /dev/null
+++ b/arch/arm64/mm/contpte.c
@@ -0,0 +1,283 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/mm.h>
+#include <linux/export.h>
+#include <asm/tlbflush.h>
+
+static inline bool mm_is_user(struct mm_struct *mm)
+{
+	/*
+	 * Don't attempt to apply the contig bit to kernel mappings, because
+	 * dynamically adding/removing the contig bit can cause page faults.
+	 * These racing faults are ok for user space, since they get serialized
+	 * on the PTL. But kernel mappings can't tolerate faults.
+	 */
+	return mm != &init_mm;
+}
+
+static inline pte_t *contpte_align_down(pte_t *ptep)
+{
+	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
+}
+
+static void contpte_convert(struct mm_struct *mm, unsigned long addr,
+			    pte_t *ptep, pte_t pte)
+{
+	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	unsigned long start_addr;
+	pte_t *start_ptep;
+	int i;
+
+	start_ptep = ptep = contpte_align_down(ptep);
+	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
+		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
+
+		if (pte_dirty(ptent))
+			pte = pte_mkdirty(pte);
+
+		if (pte_young(ptent))
+			pte = pte_mkyoung(pte);
+	}
+
+	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
+
+	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
+}
+
+void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We have already checked that the ptes are contiguous in
+	 * contpte_try_unfold(), so just check that the mm is user space.
+	 */
+
+	if (!mm_is_user(mm))
+		return;
+
+	pte = pte_mknoncont(pte);
+	contpte_convert(mm, addr, ptep, pte);
+}
+EXPORT_SYMBOL(__contpte_try_unfold);
+
+pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
+{
+	/*
+	 * Gather access/dirty bits, which may be populated in any of the ptes
+	 * of the contig range. We are guarranteed to be holding the PTL, so any
+	 * contiguous range cannot be unfolded or otherwise modified under our
+	 * feet.
+	 */
+
+	pte_t pte;
+	int i;
+
+	ptep = contpte_align_down(ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++) {
+		pte = __ptep_get(ptep);
+
+		if (pte_dirty(pte))
+			orig_pte = pte_mkdirty(orig_pte);
+
+		if (pte_young(pte))
+			orig_pte = pte_mkyoung(orig_pte);
+	}
+
+	return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get);
+
+pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
+{
+	/*
+	 * Gather access/dirty bits, which may be populated in any of the ptes
+	 * of the contig range. We may not be holding the PTL, so any contiguous
+	 * range may be unfolded/modified/refolded under our feet. Therefore we
+	 * ensure we read a _consistent_ contpte range by checking that all ptes
+	 * in the range are valid and have CONT_PTE set, that all pfns are
+	 * contiguous and that all pgprots are the same (ignoring access/dirty).
+	 * If we find a pte that is not consistent, then we must be racing with
+	 * an update so start again. If the target pte does not have CONT_PTE
+	 * set then that is considered consistent on its own because it is not
+	 * part of a contpte range.
+	 */
+
+	pgprot_t orig_prot;
+	unsigned long pfn;
+	pte_t orig_pte;
+	pgprot_t prot;
+	pte_t *ptep;
+	pte_t pte;
+	int i;
+
+retry:
+	orig_pte = __ptep_get(orig_ptep);
+
+	if (!pte_valid_cont(orig_pte))
+		return orig_pte;
+
+	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
+	ptep = contpte_align_down(orig_ptep);
+	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
+		pte = __ptep_get(ptep);
+		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+		if (!pte_valid_cont(pte) ||
+		   pte_pfn(pte) != pfn ||
+		   pgprot_val(prot) != pgprot_val(orig_prot))
+			goto retry;
+
+		if (pte_dirty(pte))
+			orig_pte = pte_mkdirty(orig_pte);
+
+		if (pte_young(pte))
+			orig_pte = pte_mkyoung(orig_pte);
+	}
+
+	return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get_lockless);
+
+void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte, unsigned int nr)
+{
+	unsigned long next;
+	unsigned long end;
+	unsigned long pfn;
+	pgprot_t prot;
+
+	/*
+	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
+	 * all ptes is not-present. Therefore we never need to unfold or
+	 * otherwise invalidate a range before we set the new ptes.
+	 * contpte_set_ptes() should never be called for nr < 2.
+	 */
+	VM_WARN_ON(nr == 1);
+
+	if (!mm_is_user(mm))
+		return __set_ptes(mm, addr, ptep, pte, nr);
+
+	end = addr + (nr << PAGE_SHIFT);
+	pfn = pte_pfn(pte);
+	prot = pte_pgprot(pte);
+
+	do {
+		next = pte_cont_addr_end(addr, end);
+		nr = (next - addr) >> PAGE_SHIFT;
+		pte = pfn_pte(pfn, prot);
+
+		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
+			pte = pte_mkcont(pte);
+		else
+			pte = pte_mknoncont(pte);
+
+		__set_ptes(mm, addr, ptep, pte, nr);
+
+		addr = next;
+		ptep += nr;
+		pfn += nr;
+
+	} while (addr != end);
+}
+EXPORT_SYMBOL(contpte_set_ptes);
+
+int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep)
+{
+	/*
+	 * ptep_clear_flush_young() technically requires us to clear the access
+	 * flag for a _single_ pte. However, the core-mm code actually tracks
+	 * access/dirty per folio, not per page. And since we only create a
+	 * contig range when the range is covered by a single folio, we can get
+	 * away with clearing young for the whole contig range here, so we avoid
+	 * having to unfold.
+	 */
+
+	int young = 0;
+	int i;
+
+	ptep = contpte_align_down(ptep);
+	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+		young |= __ptep_test_and_clear_young(vma, addr, ptep);
+
+	return young;
+}
+EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
+
+int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep)
+{
+	int young;
+
+	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+
+	if (young) {
+		/*
+		 * See comment in __ptep_clear_flush_young(); same rationale for
+		 * eliding the trailing DSB applies here.
+		 */
+		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
+					 PAGE_SIZE, true, 3);
+	}
+
+	return young;
+}
+EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
+
+int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					pte_t entry, int dirty)
+{
+	unsigned long start_addr;
+	pte_t orig_pte;
+	int i;
+
+	/*
+	 * Gather the access/dirty bits for the contiguous range. If nothing has
+	 * changed, its a noop.
+	 */
+	orig_pte = pte_mknoncont(ptep_get(ptep));
+	if (pte_val(orig_pte) == pte_val(entry))
+		return 0;
+
+	/*
+	 * We can fix up access/dirty bits without having to unfold the contig
+	 * range. But if the write bit is changing, we must unfold.
+	 */
+	if (pte_write(orig_pte) == pte_write(entry)) {
+		/*
+		 * For HW access management, we technically only need to update
+		 * the flag on a single pte in the range. But for SW access
+		 * management, we need to update all the ptes to prevent extra
+		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
+		 * and instead flush the whole range at the end.
+		 */
+		ptep = contpte_align_down(ptep);
+		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
+
+		if (dirty)
+			__flush_tlb_range(vma, start_addr, addr,
+							PAGE_SIZE, true, 3);
+	} else {
+		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
+		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+	}
+
+	return 1;
+}
+EXPORT_SYMBOL(contpte_ptep_set_access_flags);
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for
user mappings.

In this initial implementation, only suitable batches of PTEs, set via
set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
modification of individual PTEs will cause an "unfold" operation to
repaint the contpte block as individual PTEs before performing the
requested operation. While, a modification of a single PTE could cause
the block of PTEs to which it belongs to become eligible for "folding"
into a contpte entry, "folding" is not performed in this initial
implementation due to the costs of checking the requirements are met.
Due to this, contpte mappings will degrade back to normal pte mappings
over time if/when protections are changed. This will be solved in a
future patch.

Since a contpte block only has a single access and dirty bit, the
semantic here changes slightly; when getting a pte (e.g. ptep_get())
that is part of a contpte mapping, the access and dirty information are
pulled from the block (so all ptes in the block return the same
access/dirty info). When changing the access/dirty info on a pte (e.g.
ptep_set_access_flags()) that is part of a contpte mapping, this change
will affect the whole contpte block. This is works fine in practice
since we guarantee that only a single folio is mapped by a contpte
block, and the core-mm tracks access/dirty information per folio.

In order for the public functions, which used to be pure inline, to
continue to be callable by modules, export all the contpte_* symbols
that are now called by those public inline functions.

The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
at build time. It defaults to enabled as long as its dependency,
TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
enabled, then there is no chance of meeting the physical contiguity
requirement for contpte mappings.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/Kconfig               |   9 +
 arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
 arch/arm64/mm/Makefile           |   1 +
 arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
 4 files changed, 454 insertions(+)
 create mode 100644 arch/arm64/mm/contpte.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index d86d7f4758b5..1442e8ed95b6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
 	select UNWIND_TABLES
 	select DYNAMIC_SCS
 
+config ARM64_CONTPTE
+	bool "Contiguous PTE mappings for user memory" if EXPERT
+	depends on TRANSPARENT_HUGEPAGE
+	default y
+	help
+	  When enabled, user mappings are configured using the PTE contiguous
+	  bit, for any mappings that meet the size and alignment requirements.
+	  This reduces TLB pressure and improves performance.
+
 endmenu # "Kernel Features"
 
 menu "Boot options"
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7dc6b68ee516..34892a95403d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  */
 #define pte_valid_not_user(pte) \
 	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
+/*
+ * Returns true if the pte is valid and has the contiguous bit set.
+ */
+#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
 /*
  * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
  * so that we don't erroneously return false for pages that have been
@@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#ifdef CONFIG_ARM64_CONTPTE
+
+/*
+ * The contpte APIs are used to transparently manage the contiguous bit in ptes
+ * where it is possible and makes sense to do so. The PTE_CONT bit is considered
+ * a private implementation detail of the public ptep API (see below).
+ */
+extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte);
+extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
+extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
+extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte, unsigned int nr);
+extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				pte_t entry, int dirty);
+
+static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte)
+{
+	if (unlikely(pte_valid_cont(pte)))
+		__contpte_try_unfold(mm, addr, ptep, pte);
+}
+
+/*
+ * The below functions constitute the public API that arm64 presents to the
+ * core-mm to manipulate PTE entries within their page tables (or at least this
+ * is the subset of the API that arm64 needs to implement). These public
+ * versions will automatically and transparently apply the contiguous bit where
+ * it makes sense to do so. Therefore any users that are contig-aware (e.g.
+ * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
+ * private versions, which are prefixed with double underscore. All of these
+ * APIs except for ptep_get_lockless() are expected to be called with the PTL
+ * held.
+ */
+
+#define ptep_get ptep_get
+static inline pte_t ptep_get(pte_t *ptep)
+{
+	pte_t pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(pte)))
+		return pte;
+
+	return contpte_ptep_get(ptep, pte);
+}
+
+#define ptep_get_lockless ptep_get_lockless
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+	pte_t pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(pte)))
+		return pte;
+
+	return contpte_ptep_get_lockless(ptep);
+}
+
+static inline void set_pte(pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We don't have the mm or vaddr so cannot unfold contig entries (since
+	 * it requires tlb maintenance). set_pte() is not used in core code, so
+	 * this should never even be called. Regardless do our best to service
+	 * any call and emit a warning if there is any attempt to set a pte on
+	 * top of an existing contig range.
+	 */
+	pte_t orig_pte = __ptep_get(ptep);
+
+	WARN_ON_ONCE(pte_valid_cont(orig_pte));
+	__set_pte(ptep, pte_mknoncont(pte));
+}
+
+#define set_ptes set_ptes
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte, unsigned int nr)
+{
+	pte = pte_mknoncont(pte);
+
+	if (likely(nr == 1)) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		__set_ptes(mm, addr, ptep, pte, 1);
+	} else {
+		contpte_set_ptes(mm, addr, ptep, pte, nr);
+	}
+}
+
+static inline void pte_clear(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+	__pte_clear(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+	return __ptep_get_and_clear(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(orig_pte)))
+		return __ptep_test_and_clear_young(vma, addr, ptep);
+
+	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (likely(!pte_valid_cont(orig_pte)))
+		return __ptep_clear_flush_young(vma, addr, ptep);
+
+	return contpte_ptep_clear_flush_young(vma, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+static inline void ptep_set_wrprotect(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+	__ptep_set_wrprotect(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+static inline int ptep_set_access_flags(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				pte_t entry, int dirty)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	entry = pte_mknoncont(entry);
+
+	if (likely(!pte_valid_cont(orig_pte)))
+		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+
+	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+}
+
+#else /* CONFIG_ARM64_CONTPTE */
+
 #define ptep_get				__ptep_get
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
@@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 #define ptep_set_access_flags			__ptep_set_access_flags
 
+#endif /* CONFIG_ARM64_CONTPTE */
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d..60454256945b 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   cache.o copypage.o flush.o \
 				   ioremap.o mmap.o pgd.o mmu.o \
 				   context.o proc.o pageattr.o fixmap.o
+obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
 obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
new file mode 100644
index 000000000000..bfb50e6b44c7
--- /dev/null
+++ b/arch/arm64/mm/contpte.c
@@ -0,0 +1,283 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/mm.h>
+#include <linux/export.h>
+#include <asm/tlbflush.h>
+
+static inline bool mm_is_user(struct mm_struct *mm)
+{
+	/*
+	 * Don't attempt to apply the contig bit to kernel mappings, because
+	 * dynamically adding/removing the contig bit can cause page faults.
+	 * These racing faults are ok for user space, since they get serialized
+	 * on the PTL. But kernel mappings can't tolerate faults.
+	 */
+	return mm != &init_mm;
+}
+
+static inline pte_t *contpte_align_down(pte_t *ptep)
+{
+	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
+}
+
+static void contpte_convert(struct mm_struct *mm, unsigned long addr,
+			    pte_t *ptep, pte_t pte)
+{
+	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	unsigned long start_addr;
+	pte_t *start_ptep;
+	int i;
+
+	start_ptep = ptep = contpte_align_down(ptep);
+	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
+		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
+
+		if (pte_dirty(ptent))
+			pte = pte_mkdirty(pte);
+
+		if (pte_young(ptent))
+			pte = pte_mkyoung(pte);
+	}
+
+	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
+
+	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
+}
+
+void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We have already checked that the ptes are contiguous in
+	 * contpte_try_unfold(), so just check that the mm is user space.
+	 */
+
+	if (!mm_is_user(mm))
+		return;
+
+	pte = pte_mknoncont(pte);
+	contpte_convert(mm, addr, ptep, pte);
+}
+EXPORT_SYMBOL(__contpte_try_unfold);
+
+pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
+{
+	/*
+	 * Gather access/dirty bits, which may be populated in any of the ptes
+	 * of the contig range. We are guarranteed to be holding the PTL, so any
+	 * contiguous range cannot be unfolded or otherwise modified under our
+	 * feet.
+	 */
+
+	pte_t pte;
+	int i;
+
+	ptep = contpte_align_down(ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++) {
+		pte = __ptep_get(ptep);
+
+		if (pte_dirty(pte))
+			orig_pte = pte_mkdirty(orig_pte);
+
+		if (pte_young(pte))
+			orig_pte = pte_mkyoung(orig_pte);
+	}
+
+	return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get);
+
+pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
+{
+	/*
+	 * Gather access/dirty bits, which may be populated in any of the ptes
+	 * of the contig range. We may not be holding the PTL, so any contiguous
+	 * range may be unfolded/modified/refolded under our feet. Therefore we
+	 * ensure we read a _consistent_ contpte range by checking that all ptes
+	 * in the range are valid and have CONT_PTE set, that all pfns are
+	 * contiguous and that all pgprots are the same (ignoring access/dirty).
+	 * If we find a pte that is not consistent, then we must be racing with
+	 * an update so start again. If the target pte does not have CONT_PTE
+	 * set then that is considered consistent on its own because it is not
+	 * part of a contpte range.
+	 */
+
+	pgprot_t orig_prot;
+	unsigned long pfn;
+	pte_t orig_pte;
+	pgprot_t prot;
+	pte_t *ptep;
+	pte_t pte;
+	int i;
+
+retry:
+	orig_pte = __ptep_get(orig_ptep);
+
+	if (!pte_valid_cont(orig_pte))
+		return orig_pte;
+
+	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
+	ptep = contpte_align_down(orig_ptep);
+	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
+		pte = __ptep_get(ptep);
+		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+		if (!pte_valid_cont(pte) ||
+		   pte_pfn(pte) != pfn ||
+		   pgprot_val(prot) != pgprot_val(orig_prot))
+			goto retry;
+
+		if (pte_dirty(pte))
+			orig_pte = pte_mkdirty(orig_pte);
+
+		if (pte_young(pte))
+			orig_pte = pte_mkyoung(orig_pte);
+	}
+
+	return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get_lockless);
+
+void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte, unsigned int nr)
+{
+	unsigned long next;
+	unsigned long end;
+	unsigned long pfn;
+	pgprot_t prot;
+
+	/*
+	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
+	 * all ptes is not-present. Therefore we never need to unfold or
+	 * otherwise invalidate a range before we set the new ptes.
+	 * contpte_set_ptes() should never be called for nr < 2.
+	 */
+	VM_WARN_ON(nr == 1);
+
+	if (!mm_is_user(mm))
+		return __set_ptes(mm, addr, ptep, pte, nr);
+
+	end = addr + (nr << PAGE_SHIFT);
+	pfn = pte_pfn(pte);
+	prot = pte_pgprot(pte);
+
+	do {
+		next = pte_cont_addr_end(addr, end);
+		nr = (next - addr) >> PAGE_SHIFT;
+		pte = pfn_pte(pfn, prot);
+
+		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
+			pte = pte_mkcont(pte);
+		else
+			pte = pte_mknoncont(pte);
+
+		__set_ptes(mm, addr, ptep, pte, nr);
+
+		addr = next;
+		ptep += nr;
+		pfn += nr;
+
+	} while (addr != end);
+}
+EXPORT_SYMBOL(contpte_set_ptes);
+
+int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep)
+{
+	/*
+	 * ptep_clear_flush_young() technically requires us to clear the access
+	 * flag for a _single_ pte. However, the core-mm code actually tracks
+	 * access/dirty per folio, not per page. And since we only create a
+	 * contig range when the range is covered by a single folio, we can get
+	 * away with clearing young for the whole contig range here, so we avoid
+	 * having to unfold.
+	 */
+
+	int young = 0;
+	int i;
+
+	ptep = contpte_align_down(ptep);
+	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+		young |= __ptep_test_and_clear_young(vma, addr, ptep);
+
+	return young;
+}
+EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
+
+int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep)
+{
+	int young;
+
+	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+
+	if (young) {
+		/*
+		 * See comment in __ptep_clear_flush_young(); same rationale for
+		 * eliding the trailing DSB applies here.
+		 */
+		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
+					 PAGE_SIZE, true, 3);
+	}
+
+	return young;
+}
+EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
+
+int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					pte_t entry, int dirty)
+{
+	unsigned long start_addr;
+	pte_t orig_pte;
+	int i;
+
+	/*
+	 * Gather the access/dirty bits for the contiguous range. If nothing has
+	 * changed, its a noop.
+	 */
+	orig_pte = pte_mknoncont(ptep_get(ptep));
+	if (pte_val(orig_pte) == pte_val(entry))
+		return 0;
+
+	/*
+	 * We can fix up access/dirty bits without having to unfold the contig
+	 * range. But if the write bit is changing, we must unfold.
+	 */
+	if (pte_write(orig_pte) == pte_write(entry)) {
+		/*
+		 * For HW access management, we technically only need to update
+		 * the flag on a single pte in the range. But for SW access
+		 * management, we need to update all the ptes to prevent extra
+		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
+		 * and instead flush the whole range at the end.
+		 */
+		ptep = contpte_align_down(ptep);
+		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
+
+		if (dirty)
+			__flush_tlb_range(vma, start_addr, addr,
+							PAGE_SIZE, true, 3);
+	} else {
+		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
+		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+	}
+
+	return 1;
+}
+EXPORT_SYMBOL(contpte_ptep_set_access_flags);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Optimize the contpte implementation to fix some of the fork performance
regression introduced by the initial contpte commit. Subsequent patches
will solve it entirely.

During fork(), any private memory in the parent must be write-protected.
Previously this was done 1 PTE at a time. But the core-mm supports
batched wrprotect via the new wrprotect_ptes() API. So let's implement
that API and for fully covered contpte mappings, we no longer need to
unfold the contpte. This has 2 benefits:

  - reduced unfolding, reduces the number of tlbis that must be issued.
  - The memory remains contpte-mapped ("folded") in the parent, so it
    continues to benefit from the more efficient use of the TLB after
    the fork.

The optimization to wrprotect a whole contpte block without unfolding is
possible thanks to the tightening of the Arm ARM in respect to the
definition and behaviour when 'Misprogramming the Contiguous bit'. See
section D21194 at https://developer.arm.com/documentation/102105/latest/

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 61 ++++++++++++++++++++++++++------
 arch/arm64/mm/contpte.c          | 35 ++++++++++++++++++
 2 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 34892a95403d..c07f0d563733 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-/*
- * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
- * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
- */
-static inline void __ptep_set_wrprotect(struct mm_struct *mm,
-					unsigned long address, pte_t *ptep)
+static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep,
+					pte_t pte)
 {
-	pte_t old_pte, pte;
+	pte_t old_pte;
 
-	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_wrprotect(pte);
@@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
 	} while (pte_val(pte) != pte_val(old_pte));
 }
 
+/*
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
+ */
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep)
+{
+	___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
+}
+
+static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long address,
+				pte_t *ptep, unsigned int nr)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+		__ptep_set_wrprotect(mm, address, ptep);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_SET_WRPROTECT
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
@@ -1156,6 +1171,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
+extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty);
@@ -1269,12 +1286,35 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return contpte_ptep_clear_flush_young(vma, addr, ptep);
 }
 
+#define wrprotect_ptes wrprotect_ptes
+static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr)
+{
+	if (likely(nr == 1)) {
+		/*
+		 * Optimization: wrprotect_ptes() can only be called for present
+		 * ptes so we only need to check contig bit as condition for
+		 * unfold, and we can remove the contig bit from the pte we read
+		 * to avoid re-reading. This speeds up fork() which is sensitive
+		 * for order-0 folios. Equivalent to contpte_try_unfold().
+		 */
+		pte_t orig_pte = __ptep_get(ptep);
+
+		if (unlikely(pte_cont(orig_pte))) {
+			__contpte_try_unfold(mm, addr, ptep, orig_pte);
+			orig_pte = pte_mknoncont(orig_pte);
+		}
+		___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
+	} else {
+		contpte_wrprotect_ptes(mm, addr, ptep, nr);
+	}
+}
+
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
-	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
-	__ptep_set_wrprotect(mm, addr, ptep);
+	wrprotect_ptes(mm, addr, ptep, 1);
 }
 
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
@@ -1306,6 +1346,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 #define ptep_clear_flush_young			__ptep_clear_flush_young
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define ptep_set_wrprotect			__ptep_set_wrprotect
+#define wrprotect_ptes				__wrprotect_ptes
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 #define ptep_set_access_flags			__ptep_set_access_flags
 
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bfb50e6b44c7..c85e64baf03b 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -23,6 +23,23 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
 	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
 }
 
+static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, unsigned int nr)
+{
+	/*
+	 * Unfold any partially covered contpte block at the beginning and end
+	 * of the range.
+	 */
+
+	if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+	if (ptep + nr != contpte_align_down(ptep + nr))
+		contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
+				ptep + nr - 1,
+				__ptep_get(ptep + nr - 1));
+}
+
 static void contpte_convert(struct mm_struct *mm, unsigned long addr,
 			    pte_t *ptep, pte_t pte)
 {
@@ -236,6 +253,24 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
 
+void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, unsigned int nr)
+{
+	/*
+	 * If wrprotecting an entire contig range, we can avoid unfolding. Just
+	 * set wrprotect and wait for the later mmu_gather flush to invalidate
+	 * the tlb. Until the flush, the page may or may not be wrprotected.
+	 * After the flush, it is guarranteed wrprotected. If its a partial
+	 * range though, we must unfold, because we can't have a case where
+	 * CONT_PTE is set but wrprotect applies to a subset of the PTEs; this
+	 * would cause it to continue to be unpredictable after the flush.
+	 */
+
+	contpte_try_unfold_partial(mm, addr, ptep, nr);
+	__wrprotect_ptes(mm, addr, ptep, nr);
+}
+EXPORT_SYMBOL(contpte_wrprotect_ptes);
+
 int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep,
 					pte_t entry, int dirty)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Optimize the contpte implementation to fix some of the fork performance
regression introduced by the initial contpte commit. Subsequent patches
will solve it entirely.

During fork(), any private memory in the parent must be write-protected.
Previously this was done 1 PTE at a time. But the core-mm supports
batched wrprotect via the new wrprotect_ptes() API. So let's implement
that API and for fully covered contpte mappings, we no longer need to
unfold the contpte. This has 2 benefits:

  - reduced unfolding, reduces the number of tlbis that must be issued.
  - The memory remains contpte-mapped ("folded") in the parent, so it
    continues to benefit from the more efficient use of the TLB after
    the fork.

The optimization to wrprotect a whole contpte block without unfolding is
possible thanks to the tightening of the Arm ARM in respect to the
definition and behaviour when 'Misprogramming the Contiguous bit'. See
section D21194 at https://developer.arm.com/documentation/102105/latest/

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 61 ++++++++++++++++++++++++++------
 arch/arm64/mm/contpte.c          | 35 ++++++++++++++++++
 2 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 34892a95403d..c07f0d563733 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-/*
- * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
- * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
- */
-static inline void __ptep_set_wrprotect(struct mm_struct *mm,
-					unsigned long address, pte_t *ptep)
+static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep,
+					pte_t pte)
 {
-	pte_t old_pte, pte;
+	pte_t old_pte;
 
-	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_wrprotect(pte);
@@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
 	} while (pte_val(pte) != pte_val(old_pte));
 }
 
+/*
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
+ */
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep)
+{
+	___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
+}
+
+static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long address,
+				pte_t *ptep, unsigned int nr)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+		__ptep_set_wrprotect(mm, address, ptep);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_SET_WRPROTECT
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
@@ -1156,6 +1171,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
+extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty);
@@ -1269,12 +1286,35 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return contpte_ptep_clear_flush_young(vma, addr, ptep);
 }
 
+#define wrprotect_ptes wrprotect_ptes
+static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr)
+{
+	if (likely(nr == 1)) {
+		/*
+		 * Optimization: wrprotect_ptes() can only be called for present
+		 * ptes so we only need to check contig bit as condition for
+		 * unfold, and we can remove the contig bit from the pte we read
+		 * to avoid re-reading. This speeds up fork() which is sensitive
+		 * for order-0 folios. Equivalent to contpte_try_unfold().
+		 */
+		pte_t orig_pte = __ptep_get(ptep);
+
+		if (unlikely(pte_cont(orig_pte))) {
+			__contpte_try_unfold(mm, addr, ptep, orig_pte);
+			orig_pte = pte_mknoncont(orig_pte);
+		}
+		___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
+	} else {
+		contpte_wrprotect_ptes(mm, addr, ptep, nr);
+	}
+}
+
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
-	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
-	__ptep_set_wrprotect(mm, addr, ptep);
+	wrprotect_ptes(mm, addr, ptep, 1);
 }
 
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
@@ -1306,6 +1346,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 #define ptep_clear_flush_young			__ptep_clear_flush_young
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define ptep_set_wrprotect			__ptep_set_wrprotect
+#define wrprotect_ptes				__wrprotect_ptes
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 #define ptep_set_access_flags			__ptep_set_access_flags
 
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bfb50e6b44c7..c85e64baf03b 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -23,6 +23,23 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
 	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
 }
 
+static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, unsigned int nr)
+{
+	/*
+	 * Unfold any partially covered contpte block at the beginning and end
+	 * of the range.
+	 */
+
+	if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+	if (ptep + nr != contpte_align_down(ptep + nr))
+		contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
+				ptep + nr - 1,
+				__ptep_get(ptep + nr - 1));
+}
+
 static void contpte_convert(struct mm_struct *mm, unsigned long addr,
 			    pte_t *ptep, pte_t pte)
 {
@@ -236,6 +253,24 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
 
+void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, unsigned int nr)
+{
+	/*
+	 * If wrprotecting an entire contig range, we can avoid unfolding. Just
+	 * set wrprotect and wait for the later mmu_gather flush to invalidate
+	 * the tlb. Until the flush, the page may or may not be wrprotected.
+	 * After the flush, it is guarranteed wrprotected. If its a partial
+	 * range though, we must unfold, because we can't have a case where
+	 * CONT_PTE is set but wrprotect applies to a subset of the PTEs; this
+	 * would cause it to continue to be unpredictable after the flush.
+	 */
+
+	contpte_try_unfold_partial(mm, addr, ptep, nr);
+	__wrprotect_ptes(mm, addr, ptep, nr);
+}
+EXPORT_SYMBOL(contpte_wrprotect_ptes);
+
 int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep,
 					pte_t entry, int dirty)
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Optimize the contpte implementation to fix some of the fork performance
regression introduced by the initial contpte commit. Subsequent patches
will solve it entirely.

During fork(), any private memory in the parent must be write-protected.
Previously this was done 1 PTE at a time. But the core-mm supports
batched wrprotect via the new wrprotect_ptes() API. So let's implement
that API and for fully covered contpte mappings, we no longer need to
unfold the contpte. This has 2 benefits:

  - reduced unfolding, reduces the number of tlbis that must be issued.
  - The memory remains contpte-mapped ("folded") in the parent, so it
    continues to benefit from the more efficient use of the TLB after
    the fork.

The optimization to wrprotect a whole contpte block without unfolding is
possible thanks to the tightening of the Arm ARM in respect to the
definition and behaviour when 'Misprogramming the Contiguous bit'. See
section D21194 at https://developer.arm.com/documentation/102105/latest/

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 61 ++++++++++++++++++++++++++------
 arch/arm64/mm/contpte.c          | 35 ++++++++++++++++++
 2 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 34892a95403d..c07f0d563733 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-/*
- * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
- * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
- */
-static inline void __ptep_set_wrprotect(struct mm_struct *mm,
-					unsigned long address, pte_t *ptep)
+static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep,
+					pte_t pte)
 {
-	pte_t old_pte, pte;
+	pte_t old_pte;
 
-	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_wrprotect(pte);
@@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
 	} while (pte_val(pte) != pte_val(old_pte));
 }
 
+/*
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
+ */
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep)
+{
+	___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
+}
+
+static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long address,
+				pte_t *ptep, unsigned int nr)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+		__ptep_set_wrprotect(mm, address, ptep);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_SET_WRPROTECT
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
@@ -1156,6 +1171,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
+extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty);
@@ -1269,12 +1286,35 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return contpte_ptep_clear_flush_young(vma, addr, ptep);
 }
 
+#define wrprotect_ptes wrprotect_ptes
+static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr)
+{
+	if (likely(nr == 1)) {
+		/*
+		 * Optimization: wrprotect_ptes() can only be called for present
+		 * ptes so we only need to check contig bit as condition for
+		 * unfold, and we can remove the contig bit from the pte we read
+		 * to avoid re-reading. This speeds up fork() which is sensitive
+		 * for order-0 folios. Equivalent to contpte_try_unfold().
+		 */
+		pte_t orig_pte = __ptep_get(ptep);
+
+		if (unlikely(pte_cont(orig_pte))) {
+			__contpte_try_unfold(mm, addr, ptep, orig_pte);
+			orig_pte = pte_mknoncont(orig_pte);
+		}
+		___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
+	} else {
+		contpte_wrprotect_ptes(mm, addr, ptep, nr);
+	}
+}
+
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
-	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
-	__ptep_set_wrprotect(mm, addr, ptep);
+	wrprotect_ptes(mm, addr, ptep, 1);
 }
 
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
@@ -1306,6 +1346,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 #define ptep_clear_flush_young			__ptep_clear_flush_young
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define ptep_set_wrprotect			__ptep_set_wrprotect
+#define wrprotect_ptes				__wrprotect_ptes
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 #define ptep_set_access_flags			__ptep_set_access_flags
 
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bfb50e6b44c7..c85e64baf03b 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -23,6 +23,23 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
 	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
 }
 
+static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, unsigned int nr)
+{
+	/*
+	 * Unfold any partially covered contpte block at the beginning and end
+	 * of the range.
+	 */
+
+	if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+	if (ptep + nr != contpte_align_down(ptep + nr))
+		contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
+				ptep + nr - 1,
+				__ptep_get(ptep + nr - 1));
+}
+
 static void contpte_convert(struct mm_struct *mm, unsigned long addr,
 			    pte_t *ptep, pte_t pte)
 {
@@ -236,6 +253,24 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
 
+void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, unsigned int nr)
+{
+	/*
+	 * If wrprotecting an entire contig range, we can avoid unfolding. Just
+	 * set wrprotect and wait for the later mmu_gather flush to invalidate
+	 * the tlb. Until the flush, the page may or may not be wrprotected.
+	 * After the flush, it is guarranteed wrprotected. If its a partial
+	 * range though, we must unfold, because we can't have a case where
+	 * CONT_PTE is set but wrprotect applies to a subset of the PTEs; this
+	 * would cause it to continue to be unpredictable after the flush.
+	 */
+
+	contpte_try_unfold_partial(mm, addr, ptep, nr);
+	__wrprotect_ptes(mm, addr, ptep, nr);
+}
+EXPORT_SYMBOL(contpte_wrprotect_ptes);
+
 int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep,
 					pte_t entry, int dirty)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Optimize the contpte implementation to fix some of the
exit/munmap/dontneed performance regression introduced by the initial
contpte commit. Subsequent patches will solve it entirely.

During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
cleared. Previously this was done 1 PTE at a time. But the core-mm
supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
let's implement those APIs and for fully covered contpte mappings, we no
longer need to unfold the contpte. This significantly reduces unfolding
operations, reducing the number of tlbis that must be issued.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 67 ++++++++++++++++++++++++++++++++
 arch/arm64/mm/contpte.c          | 17 ++++++++
 2 files changed, 84 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c07f0d563733..ad04adb7b87f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
 	return pte;
 }
 
+static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full)
+{
+	for (;;) {
+		__ptep_get_and_clear(mm, addr, ptep);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+}
+
+static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full)
+{
+	pte_t pte, tmp_pte;
+
+	pte = __ptep_get_and_clear(mm, addr, ptep);
+	while (--nr) {
+		ptep++;
+		addr += PAGE_SIZE;
+		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
+		if (pte_dirty(tmp_pte))
+			pte = pte_mkdirty(pte);
+		if (pte_young(tmp_pte))
+			pte = pte_mkyoung(pte);
+	}
+	return pte;
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
@@ -1167,6 +1198,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
 extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
 extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr);
+extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full);
+extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full);
 extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1254,6 +1290,35 @@ static inline void pte_clear(struct mm_struct *mm,
 	__pte_clear(mm, addr, ptep);
 }
 
+#define clear_full_ptes clear_full_ptes
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full)
+{
+	if (likely(nr == 1)) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		__clear_full_ptes(mm, addr, ptep, nr, full);
+	} else {
+		contpte_clear_full_ptes(mm, addr, ptep, nr, full);
+	}
+}
+
+#define get_and_clear_full_ptes get_and_clear_full_ptes
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full)
+{
+	pte_t pte;
+
+	if (likely(nr == 1)) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+	} else {
+		pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+	}
+
+	return pte;
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
@@ -1338,6 +1403,8 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
 #define pte_clear				__pte_clear
+#define clear_full_ptes				__clear_full_ptes
+#define get_and_clear_full_ptes			__get_and_clear_full_ptes
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear			__ptep_get_and_clear
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index c85e64baf03b..80346108450b 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -207,6 +207,23 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL(contpte_set_ptes);
 
+void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full)
+{
+	contpte_try_unfold_partial(mm, addr, ptep, nr);
+	__clear_full_ptes(mm, addr, ptep, nr, full);
+}
+EXPORT_SYMBOL(contpte_clear_full_ptes);
+
+pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full)
+{
+	contpte_try_unfold_partial(mm, addr, ptep, nr);
+	return __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+}
+EXPORT_SYMBOL(contpte_get_and_clear_full_ptes);
+
 int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Optimize the contpte implementation to fix some of the
exit/munmap/dontneed performance regression introduced by the initial
contpte commit. Subsequent patches will solve it entirely.

During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
cleared. Previously this was done 1 PTE at a time. But the core-mm
supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
let's implement those APIs and for fully covered contpte mappings, we no
longer need to unfold the contpte. This significantly reduces unfolding
operations, reducing the number of tlbis that must be issued.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 67 ++++++++++++++++++++++++++++++++
 arch/arm64/mm/contpte.c          | 17 ++++++++
 2 files changed, 84 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c07f0d563733..ad04adb7b87f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
 	return pte;
 }
 
+static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full)
+{
+	for (;;) {
+		__ptep_get_and_clear(mm, addr, ptep);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+}
+
+static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full)
+{
+	pte_t pte, tmp_pte;
+
+	pte = __ptep_get_and_clear(mm, addr, ptep);
+	while (--nr) {
+		ptep++;
+		addr += PAGE_SIZE;
+		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
+		if (pte_dirty(tmp_pte))
+			pte = pte_mkdirty(pte);
+		if (pte_young(tmp_pte))
+			pte = pte_mkyoung(pte);
+	}
+	return pte;
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
@@ -1167,6 +1198,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
 extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
 extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr);
+extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full);
+extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full);
 extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1254,6 +1290,35 @@ static inline void pte_clear(struct mm_struct *mm,
 	__pte_clear(mm, addr, ptep);
 }
 
+#define clear_full_ptes clear_full_ptes
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full)
+{
+	if (likely(nr == 1)) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		__clear_full_ptes(mm, addr, ptep, nr, full);
+	} else {
+		contpte_clear_full_ptes(mm, addr, ptep, nr, full);
+	}
+}
+
+#define get_and_clear_full_ptes get_and_clear_full_ptes
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full)
+{
+	pte_t pte;
+
+	if (likely(nr == 1)) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+	} else {
+		pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+	}
+
+	return pte;
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
@@ -1338,6 +1403,8 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
 #define pte_clear				__pte_clear
+#define clear_full_ptes				__clear_full_ptes
+#define get_and_clear_full_ptes			__get_and_clear_full_ptes
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear			__ptep_get_and_clear
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index c85e64baf03b..80346108450b 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -207,6 +207,23 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL(contpte_set_ptes);
 
+void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full)
+{
+	contpte_try_unfold_partial(mm, addr, ptep, nr);
+	__clear_full_ptes(mm, addr, ptep, nr, full);
+}
+EXPORT_SYMBOL(contpte_clear_full_ptes);
+
+pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full)
+{
+	contpte_try_unfold_partial(mm, addr, ptep, nr);
+	return __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+}
+EXPORT_SYMBOL(contpte_get_and_clear_full_ptes);
+
 int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep)
 {
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Optimize the contpte implementation to fix some of the
exit/munmap/dontneed performance regression introduced by the initial
contpte commit. Subsequent patches will solve it entirely.

During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
cleared. Previously this was done 1 PTE at a time. But the core-mm
supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
let's implement those APIs and for fully covered contpte mappings, we no
longer need to unfold the contpte. This significantly reduces unfolding
operations, reducing the number of tlbis that must be issued.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 67 ++++++++++++++++++++++++++++++++
 arch/arm64/mm/contpte.c          | 17 ++++++++
 2 files changed, 84 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c07f0d563733..ad04adb7b87f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
 	return pte;
 }
 
+static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full)
+{
+	for (;;) {
+		__ptep_get_and_clear(mm, addr, ptep);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+}
+
+static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full)
+{
+	pte_t pte, tmp_pte;
+
+	pte = __ptep_get_and_clear(mm, addr, ptep);
+	while (--nr) {
+		ptep++;
+		addr += PAGE_SIZE;
+		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
+		if (pte_dirty(tmp_pte))
+			pte = pte_mkdirty(pte);
+		if (pte_young(tmp_pte))
+			pte = pte_mkyoung(pte);
+	}
+	return pte;
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
@@ -1167,6 +1198,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
 extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
 extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr);
+extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full);
+extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full);
 extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1254,6 +1290,35 @@ static inline void pte_clear(struct mm_struct *mm,
 	__pte_clear(mm, addr, ptep);
 }
 
+#define clear_full_ptes clear_full_ptes
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full)
+{
+	if (likely(nr == 1)) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		__clear_full_ptes(mm, addr, ptep, nr, full);
+	} else {
+		contpte_clear_full_ptes(mm, addr, ptep, nr, full);
+	}
+}
+
+#define get_and_clear_full_ptes get_and_clear_full_ptes
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full)
+{
+	pte_t pte;
+
+	if (likely(nr == 1)) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+	} else {
+		pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+	}
+
+	return pte;
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
@@ -1338,6 +1403,8 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 #define set_pte					__set_pte
 #define set_ptes				__set_ptes
 #define pte_clear				__pte_clear
+#define clear_full_ptes				__clear_full_ptes
+#define get_and_clear_full_ptes			__get_and_clear_full_ptes
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear			__ptep_get_and_clear
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index c85e64baf03b..80346108450b 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -207,6 +207,23 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL(contpte_set_ptes);
 
+void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr, int full)
+{
+	contpte_try_unfold_partial(mm, addr, ptep, nr);
+	__clear_full_ptes(mm, addr, ptep, nr, full);
+}
+EXPORT_SYMBOL(contpte_clear_full_ptes);
+
+pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr, int full)
+{
+	contpte_try_unfold_partial(mm, addr, ptep, nr);
+	return __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+}
+EXPORT_SYMBOL(contpte_get_and_clear_full_ptes);
+
 int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Some architectures (e.g. arm64) can tell from looking at a pte, if some
follow-on ptes also map contiguous physical memory with the same pgprot.
(for arm64, these are contpte mappings).

Take advantage of this knowledge to optimize folio_pte_batch() so that
it can skip these ptes when scanning to create a batch. By default, if
an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
the changes are optimized out and the behaviour is as before.

arm64 will opt-in to providing this hint in the next patch, which will
greatly reduce the cost of ptep_get() when scanning a range of contptes.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 18 ++++++++++++++++++
 mm/memory.c             | 20 +++++++++++++-------
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 50f32cccbd92..cba31f177d27 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode()	do {} while (0)
 #endif
 
+#ifndef pte_batch_hint
+/**
+ * pte_batch_hint - Number of pages that can be added to batch without scanning.
+ * @ptep: Page table pointer for the entry.
+ * @pte: Page table entry.
+ *
+ * Some architectures know that a set of contiguous ptes all map the same
+ * contiguous memory with the same permissions. In this case, it can provide a
+ * hint to aid pte batching without the core code needing to scan every pte.
+ *
+ * May be overridden by the architecture, else pte_batch_hint is always 1.
+ */
+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+	return 1;
+}
+#endif
+
 #ifndef pte_advance_pfn
 static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 65fbe4f886c1..902665b27702 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 {
 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
 	const pte_t *end_ptep = start_ptep + max_nr;
-	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
-	pte_t *ptep = start_ptep + 1;
+	pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
+	pte_t *ptep = start_ptep;
 	bool writable;
+	int nr;
 
 	if (any_writable)
 		*any_writable = false;
 
 	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
-	while (ptep != end_ptep) {
+	nr = pte_batch_hint(ptep, pte);
+	expected_pte = pte_advance_pfn(expected_pte, nr);
+	ptep += nr;
+
+	while (ptep < end_ptep) {
 		pte = ptep_get(ptep);
 		if (any_writable)
 			writable = !!pte_write(pte);
@@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		 * corner cases the next PFN might fall into a different
 		 * folio.
 		 */
-		if (pte_pfn(pte) == folio_end_pfn)
+		if (pte_pfn(pte) >= folio_end_pfn)
 			break;
 
 		if (any_writable)
 			*any_writable |= writable;
 
-		expected_pte = pte_advance_pfn(expected_pte, 1);
-		ptep++;
+		nr = pte_batch_hint(ptep, pte);
+		expected_pte = pte_advance_pfn(expected_pte, nr);
+		ptep += nr;
 	}
 
-	return ptep - start_ptep;
+	return min(ptep - start_ptep, max_nr);
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

Some architectures (e.g. arm64) can tell from looking at a pte, if some
follow-on ptes also map contiguous physical memory with the same pgprot.
(for arm64, these are contpte mappings).

Take advantage of this knowledge to optimize folio_pte_batch() so that
it can skip these ptes when scanning to create a batch. By default, if
an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
the changes are optimized out and the behaviour is as before.

arm64 will opt-in to providing this hint in the next patch, which will
greatly reduce the cost of ptep_get() when scanning a range of contptes.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 18 ++++++++++++++++++
 mm/memory.c             | 20 +++++++++++++-------
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 50f32cccbd92..cba31f177d27 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode()	do {} while (0)
 #endif
 
+#ifndef pte_batch_hint
+/**
+ * pte_batch_hint - Number of pages that can be added to batch without scanning.
+ * @ptep: Page table pointer for the entry.
+ * @pte: Page table entry.
+ *
+ * Some architectures know that a set of contiguous ptes all map the same
+ * contiguous memory with the same permissions. In this case, it can provide a
+ * hint to aid pte batching without the core code needing to scan every pte.
+ *
+ * May be overridden by the architecture, else pte_batch_hint is always 1.
+ */
+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+	return 1;
+}
+#endif
+
 #ifndef pte_advance_pfn
 static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 65fbe4f886c1..902665b27702 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 {
 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
 	const pte_t *end_ptep = start_ptep + max_nr;
-	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
-	pte_t *ptep = start_ptep + 1;
+	pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
+	pte_t *ptep = start_ptep;
 	bool writable;
+	int nr;
 
 	if (any_writable)
 		*any_writable = false;
 
 	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
-	while (ptep != end_ptep) {
+	nr = pte_batch_hint(ptep, pte);
+	expected_pte = pte_advance_pfn(expected_pte, nr);
+	ptep += nr;
+
+	while (ptep < end_ptep) {
 		pte = ptep_get(ptep);
 		if (any_writable)
 			writable = !!pte_write(pte);
@@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		 * corner cases the next PFN might fall into a different
 		 * folio.
 		 */
-		if (pte_pfn(pte) == folio_end_pfn)
+		if (pte_pfn(pte) >= folio_end_pfn)
 			break;
 
 		if (any_writable)
 			*any_writable |= writable;
 
-		expected_pte = pte_advance_pfn(expected_pte, 1);
-		ptep++;
+		nr = pte_batch_hint(ptep, pte);
+		expected_pte = pte_advance_pfn(expected_pte, nr);
+		ptep += nr;
 	}
 
-	return ptep - start_ptep;
+	return min(ptep - start_ptep, max_nr);
 }
 
 /*
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

Some architectures (e.g. arm64) can tell from looking at a pte, if some
follow-on ptes also map contiguous physical memory with the same pgprot.
(for arm64, these are contpte mappings).

Take advantage of this knowledge to optimize folio_pte_batch() so that
it can skip these ptes when scanning to create a batch. By default, if
an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
the changes are optimized out and the behaviour is as before.

arm64 will opt-in to providing this hint in the next patch, which will
greatly reduce the cost of ptep_get() when scanning a range of contptes.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 18 ++++++++++++++++++
 mm/memory.c             | 20 +++++++++++++-------
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 50f32cccbd92..cba31f177d27 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode()	do {} while (0)
 #endif
 
+#ifndef pte_batch_hint
+/**
+ * pte_batch_hint - Number of pages that can be added to batch without scanning.
+ * @ptep: Page table pointer for the entry.
+ * @pte: Page table entry.
+ *
+ * Some architectures know that a set of contiguous ptes all map the same
+ * contiguous memory with the same permissions. In this case, it can provide a
+ * hint to aid pte batching without the core code needing to scan every pte.
+ *
+ * May be overridden by the architecture, else pte_batch_hint is always 1.
+ */
+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+	return 1;
+}
+#endif
+
 #ifndef pte_advance_pfn
 static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 65fbe4f886c1..902665b27702 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 {
 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
 	const pte_t *end_ptep = start_ptep + max_nr;
-	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
-	pte_t *ptep = start_ptep + 1;
+	pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
+	pte_t *ptep = start_ptep;
 	bool writable;
+	int nr;
 
 	if (any_writable)
 		*any_writable = false;
 
 	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
-	while (ptep != end_ptep) {
+	nr = pte_batch_hint(ptep, pte);
+	expected_pte = pte_advance_pfn(expected_pte, nr);
+	ptep += nr;
+
+	while (ptep < end_ptep) {
 		pte = ptep_get(ptep);
 		if (any_writable)
 			writable = !!pte_write(pte);
@@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		 * corner cases the next PFN might fall into a different
 		 * folio.
 		 */
-		if (pte_pfn(pte) == folio_end_pfn)
+		if (pte_pfn(pte) >= folio_end_pfn)
 			break;
 
 		if (any_writable)
 			*any_writable |= writable;
 
-		expected_pte = pte_advance_pfn(expected_pte, 1);
-		ptep++;
+		nr = pte_batch_hint(ptep, pte);
+		expected_pte = pte_advance_pfn(expected_pte, nr);
+		ptep += nr;
 	}
 
-	return ptep - start_ptep;
+	return min(ptep - start_ptep, max_nr);
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

When core code iterates over a range of ptes and calls ptep_get() for
each of them, if the range happens to cover contpte mappings, the number
of pte reads becomes amplified by a factor of the number of PTEs in a
contpte block. This is because for each call to ptep_get(), the
implementation must read all of the ptes in the contpte block to which
it belongs to gather the access and dirty bits.

This causes a hotspot for fork(), as well as operations that unmap
memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
can fix this by implementing pte_batch_hint() which allows their
iterators to skip getting the contpte tail ptes when gathering the batch
of ptes to operate on. This results in the number of PTE reads returning
to 1 per pte.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ad04adb7b87f..353ea67b5d75 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
 		__contpte_try_unfold(mm, addr, ptep, pte);
 }
 
+#define pte_batch_hint pte_batch_hint
+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+	if (!pte_valid_cont(pte))
+		return 1;
+
+	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
+}
+
 /*
  * The below functions constitute the public API that arm64 presents to the
  * core-mm to manipulate PTE entries within their page tables (or at least this
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

When core code iterates over a range of ptes and calls ptep_get() for
each of them, if the range happens to cover contpte mappings, the number
of pte reads becomes amplified by a factor of the number of PTEs in a
contpte block. This is because for each call to ptep_get(), the
implementation must read all of the ptes in the contpte block to which
it belongs to gather the access and dirty bits.

This causes a hotspot for fork(), as well as operations that unmap
memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
can fix this by implementing pte_batch_hint() which allows their
iterators to skip getting the contpte tail ptes when gathering the batch
of ptes to operate on. This results in the number of PTE reads returning
to 1 per pte.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ad04adb7b87f..353ea67b5d75 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
 		__contpte_try_unfold(mm, addr, ptep, pte);
 }
 
+#define pte_batch_hint pte_batch_hint
+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+	if (!pte_valid_cont(pte))
+		return 1;
+
+	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
+}
+
 /*
  * The below functions constitute the public API that arm64 presents to the
  * core-mm to manipulate PTE entries within their page tables (or at least this
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

When core code iterates over a range of ptes and calls ptep_get() for
each of them, if the range happens to cover contpte mappings, the number
of pte reads becomes amplified by a factor of the number of PTEs in a
contpte block. This is because for each call to ptep_get(), the
implementation must read all of the ptes in the contpte block to which
it belongs to gather the access and dirty bits.

This causes a hotspot for fork(), as well as operations that unmap
memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
can fix this by implementing pte_batch_hint() which allows their
iterators to skip getting the contpte tail ptes when gathering the batch
of ptes to operate on. This results in the number of PTE reads returning
to 1 per pte.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ad04adb7b87f..353ea67b5d75 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
 		__contpte_try_unfold(mm, addr, ptep, pte);
 }
 
+#define pte_batch_hint pte_batch_hint
+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+	if (!pte_valid_cont(pte))
+		return 1;
+
+	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
+}
+
 /*
  * The below functions constitute the public API that arm64 presents to the
  * core-mm to manipulate PTE entries within their page tables (or at least this
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 24/25] arm64/mm: __always_inline to improve fork() perf
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

As set_ptes() and wrprotect_ptes() become a bit more complex, the
compiler may choose not to inline them. But this is critical for fork()
performance. So mark the functions, along with contpte_try_unfold()
which is called by them, as __always_inline. This is worth ~1% on the
fork() microbenchmark with order-0 folios (the common case).

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 353ea67b5d75..cdc310880a3b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1213,8 +1213,8 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty);
 
-static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
-					pte_t *ptep, pte_t pte)
+static __always_inline void contpte_try_unfold(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, pte_t pte)
 {
 	if (unlikely(pte_valid_cont(pte)))
 		__contpte_try_unfold(mm, addr, ptep, pte);
@@ -1279,7 +1279,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 }
 
 #define set_ptes set_ptes
-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr)
 {
 	pte = pte_mknoncont(pte);
@@ -1361,8 +1361,8 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 }
 
 #define wrprotect_ptes wrprotect_ptes
-static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
-				pte_t *ptep, unsigned int nr)
+static __always_inline void wrprotect_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, unsigned int nr)
 {
 	if (likely(nr == 1)) {
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 24/25] arm64/mm: __always_inline to improve fork() perf
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

As set_ptes() and wrprotect_ptes() become a bit more complex, the
compiler may choose not to inline them. But this is critical for fork()
performance. So mark the functions, along with contpte_try_unfold()
which is called by them, as __always_inline. This is worth ~1% on the
fork() microbenchmark with order-0 folios (the common case).

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 353ea67b5d75..cdc310880a3b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1213,8 +1213,8 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty);
 
-static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
-					pte_t *ptep, pte_t pte)
+static __always_inline void contpte_try_unfold(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, pte_t pte)
 {
 	if (unlikely(pte_valid_cont(pte)))
 		__contpte_try_unfold(mm, addr, ptep, pte);
@@ -1279,7 +1279,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 }
 
 #define set_ptes set_ptes
-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr)
 {
 	pte = pte_mknoncont(pte);
@@ -1361,8 +1361,8 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 }
 
 #define wrprotect_ptes wrprotect_ptes
-static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
-				pte_t *ptep, unsigned int nr)
+static __always_inline void wrprotect_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, unsigned int nr)
 {
 	if (likely(nr == 1)) {
 		/*
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 24/25] arm64/mm: __always_inline to improve fork() perf
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

As set_ptes() and wrprotect_ptes() become a bit more complex, the
compiler may choose not to inline them. But this is critical for fork()
performance. So mark the functions, along with contpte_try_unfold()
which is called by them, as __always_inline. This is worth ~1% on the
fork() microbenchmark with order-0 folios (the common case).

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 353ea67b5d75..cdc310880a3b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1213,8 +1213,8 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty);
 
-static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
-					pte_t *ptep, pte_t pte)
+static __always_inline void contpte_try_unfold(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, pte_t pte)
 {
 	if (unlikely(pte_valid_cont(pte)))
 		__contpte_try_unfold(mm, addr, ptep, pte);
@@ -1279,7 +1279,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 }
 
 #define set_ptes set_ptes
-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr)
 {
 	pte = pte_mknoncont(pte);
@@ -1361,8 +1361,8 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 }
 
 #define wrprotect_ptes wrprotect_ptes
-static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
-				pte_t *ptep, unsigned int nr)
+static __always_inline void wrprotect_ptes(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, unsigned int nr)
 {
 	if (likely(nr == 1)) {
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-02  8:07   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

There are situations where a change to a single PTE could cause the
contpte block in which it resides to become foldable (i.e. could be
repainted with the contiguous bit). Such situations arise, for example,
when user space temporarily changes protections, via mprotect, for
individual pages, such can be the case for certain garbage collectors.

We would like to detect when such a PTE change occurs. However this can
be expensive due to the amount of checking required. Therefore only
perform the checks when an indiviual PTE is modified via mprotect
(ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
when we are setting the final PTE in a contpte-aligned block.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 26 +++++++++++++
 arch/arm64/mm/contpte.c          | 64 ++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index cdc310880a3b..d3357fe4eb89 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1192,6 +1192,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
  * where it is possible and makes sense to do so. The PTE_CONT bit is considered
  * a private implementation detail of the public ptep API (see below).
  */
+extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte);
 extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte);
 extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
@@ -1213,6 +1215,29 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty);
 
+static __always_inline void contpte_try_fold(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, pte_t pte)
+{
+	/*
+	 * Only bother trying if both the virtual and physical addresses are
+	 * aligned and correspond to the last entry in a contig range. The core
+	 * code mostly modifies ranges from low to high, so this is the likely
+	 * the last modification in the contig range, so a good time to fold.
+	 * We can't fold special mappings, because there is no associated folio.
+	 */
+
+	const unsigned long contmask = CONT_PTES - 1;
+	bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
+
+	if (unlikely(valign)) {
+		bool palign = (pte_pfn(pte) & contmask) == contmask;
+
+		if (unlikely(palign &&
+		    pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
+			__contpte_try_fold(mm, addr, ptep, pte);
+	}
+}
+
 static __always_inline void contpte_try_unfold(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep, pte_t pte)
 {
@@ -1287,6 +1312,7 @@ static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 	if (likely(nr == 1)) {
 		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
 		__set_ptes(mm, addr, ptep, pte, 1);
+		contpte_try_fold(mm, addr, ptep, pte);
 	} else {
 		contpte_set_ptes(mm, addr, ptep, pte, nr);
 	}
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 80346108450b..2c7dafd0552a 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -67,6 +67,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
 	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
 }
 
+void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We have already checked that the virtual and pysical addresses are
+	 * correctly aligned for a contpte mapping in contpte_try_fold() so the
+	 * remaining checks are to ensure that the contpte range is fully
+	 * covered by a single folio, and ensure that all the ptes are valid
+	 * with contiguous PFNs and matching prots. We ignore the state of the
+	 * access and dirty bits for the purpose of deciding if its a contiguous
+	 * range; the folding process will generate a single contpte entry which
+	 * has a single access and dirty bit. Those 2 bits are the logical OR of
+	 * their respective bits in the constituent pte entries. In order to
+	 * ensure the contpte range is covered by a single folio, we must
+	 * recover the folio from the pfn, but special mappings don't have a
+	 * folio backing them. Fortunately contpte_try_fold() already checked
+	 * that the pte is not special - we never try to fold special mappings.
+	 * Note we can't use vm_normal_page() for this since we don't have the
+	 * vma.
+	 */
+
+	unsigned long folio_saddr, folio_eaddr;
+	unsigned long cont_saddr, cont_eaddr;
+	pte_t expected_pte, subpte;
+	struct folio *folio;
+	struct page *page;
+	unsigned long pfn;
+	pte_t *orig_ptep;
+	pgprot_t prot;
+
+	int i;
+
+	if (!mm_is_user(mm))
+		return;
+
+	page = pte_page(pte);
+	folio = page_folio(page);
+	folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
+	folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
+	cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	cont_eaddr = cont_saddr + CONT_PTE_SIZE;
+
+	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
+		return;
+
+	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
+	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+	expected_pte = pfn_pte(pfn, prot);
+	orig_ptep = ptep;
+	ptep = contpte_align_down(ptep);
+
+	for (i = 0; i < CONT_PTES; i++) {
+		subpte = pte_mkold(pte_mkclean(__ptep_get(ptep)));
+		if (!pte_same(subpte, expected_pte))
+			return;
+		expected_pte = pte_advance_pfn(expected_pte, 1);
+		ptep++;
+	}
+
+	pte = pte_mkcont(pte);
+	contpte_convert(mm, addr, orig_ptep, pte);
+}
+EXPORT_SYMBOL(__contpte_try_fold);
+
 void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
 			pte_t *ptep, pte_t pte)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, linux-arm-kernel, x86, linuxppc-dev, linux-mm,
	linux-kernel

There are situations where a change to a single PTE could cause the
contpte block in which it resides to become foldable (i.e. could be
repainted with the contiguous bit). Such situations arise, for example,
when user space temporarily changes protections, via mprotect, for
individual pages, such can be the case for certain garbage collectors.

We would like to detect when such a PTE change occurs. However this can
be expensive due to the amount of checking required. Therefore only
perform the checks when an indiviual PTE is modified via mprotect
(ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
when we are setting the final PTE in a contpte-aligned block.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 26 +++++++++++++
 arch/arm64/mm/contpte.c          | 64 ++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index cdc310880a3b..d3357fe4eb89 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1192,6 +1192,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
  * where it is possible and makes sense to do so. The PTE_CONT bit is considered
  * a private implementation detail of the public ptep API (see below).
  */
+extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte);
 extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte);
 extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
@@ -1213,6 +1215,29 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty);
 
+static __always_inline void contpte_try_fold(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, pte_t pte)
+{
+	/*
+	 * Only bother trying if both the virtual and physical addresses are
+	 * aligned and correspond to the last entry in a contig range. The core
+	 * code mostly modifies ranges from low to high, so this is the likely
+	 * the last modification in the contig range, so a good time to fold.
+	 * We can't fold special mappings, because there is no associated folio.
+	 */
+
+	const unsigned long contmask = CONT_PTES - 1;
+	bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
+
+	if (unlikely(valign)) {
+		bool palign = (pte_pfn(pte) & contmask) == contmask;
+
+		if (unlikely(palign &&
+		    pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
+			__contpte_try_fold(mm, addr, ptep, pte);
+	}
+}
+
 static __always_inline void contpte_try_unfold(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep, pte_t pte)
 {
@@ -1287,6 +1312,7 @@ static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 	if (likely(nr == 1)) {
 		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
 		__set_ptes(mm, addr, ptep, pte, 1);
+		contpte_try_fold(mm, addr, ptep, pte);
 	} else {
 		contpte_set_ptes(mm, addr, ptep, pte, nr);
 	}
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 80346108450b..2c7dafd0552a 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -67,6 +67,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
 	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
 }
 
+void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We have already checked that the virtual and pysical addresses are
+	 * correctly aligned for a contpte mapping in contpte_try_fold() so the
+	 * remaining checks are to ensure that the contpte range is fully
+	 * covered by a single folio, and ensure that all the ptes are valid
+	 * with contiguous PFNs and matching prots. We ignore the state of the
+	 * access and dirty bits for the purpose of deciding if its a contiguous
+	 * range; the folding process will generate a single contpte entry which
+	 * has a single access and dirty bit. Those 2 bits are the logical OR of
+	 * their respective bits in the constituent pte entries. In order to
+	 * ensure the contpte range is covered by a single folio, we must
+	 * recover the folio from the pfn, but special mappings don't have a
+	 * folio backing them. Fortunately contpte_try_fold() already checked
+	 * that the pte is not special - we never try to fold special mappings.
+	 * Note we can't use vm_normal_page() for this since we don't have the
+	 * vma.
+	 */
+
+	unsigned long folio_saddr, folio_eaddr;
+	unsigned long cont_saddr, cont_eaddr;
+	pte_t expected_pte, subpte;
+	struct folio *folio;
+	struct page *page;
+	unsigned long pfn;
+	pte_t *orig_ptep;
+	pgprot_t prot;
+
+	int i;
+
+	if (!mm_is_user(mm))
+		return;
+
+	page = pte_page(pte);
+	folio = page_folio(page);
+	folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
+	folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
+	cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	cont_eaddr = cont_saddr + CONT_PTE_SIZE;
+
+	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
+		return;
+
+	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
+	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+	expected_pte = pfn_pte(pfn, prot);
+	orig_ptep = ptep;
+	ptep = contpte_align_down(ptep);
+
+	for (i = 0; i < CONT_PTES; i++) {
+		subpte = pte_mkold(pte_mkclean(__ptep_get(ptep)));
+		if (!pte_same(subpte, expected_pte))
+			return;
+		expected_pte = pte_advance_pfn(expected_pte, 1);
+		ptep++;
+	}
+
+	pte = pte_mkcont(pte);
+	contpte_convert(mm, addr, orig_ptep, pte);
+}
+EXPORT_SYMBOL(__contpte_try_fold);
+
 void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
 			pte_t *ptep, pte_t pte)
 {
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings
@ 2024-02-02  8:07   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-02  8:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Mark Rutland, David Hildenbrand, Kefeng Wang, John Hubbard,
	Zi Yan, Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: Ryan Roberts, x86, linux-kernel, linux-mm, linuxppc-dev,
	linux-arm-kernel

There are situations where a change to a single PTE could cause the
contpte block in which it resides to become foldable (i.e. could be
repainted with the contiguous bit). Such situations arise, for example,
when user space temporarily changes protections, via mprotect, for
individual pages, such can be the case for certain garbage collectors.

We would like to detect when such a PTE change occurs. However this can
be expensive due to the amount of checking required. Therefore only
perform the checks when an indiviual PTE is modified via mprotect
(ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
when we are setting the final PTE in a contpte-aligned block.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 26 +++++++++++++
 arch/arm64/mm/contpte.c          | 64 ++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index cdc310880a3b..d3357fe4eb89 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1192,6 +1192,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
  * where it is possible and makes sense to do so. The PTE_CONT bit is considered
  * a private implementation detail of the public ptep API (see below).
  */
+extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte);
 extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte);
 extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
@@ -1213,6 +1215,29 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty);
 
+static __always_inline void contpte_try_fold(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, pte_t pte)
+{
+	/*
+	 * Only bother trying if both the virtual and physical addresses are
+	 * aligned and correspond to the last entry in a contig range. The core
+	 * code mostly modifies ranges from low to high, so this is the likely
+	 * the last modification in the contig range, so a good time to fold.
+	 * We can't fold special mappings, because there is no associated folio.
+	 */
+
+	const unsigned long contmask = CONT_PTES - 1;
+	bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
+
+	if (unlikely(valign)) {
+		bool palign = (pte_pfn(pte) & contmask) == contmask;
+
+		if (unlikely(palign &&
+		    pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
+			__contpte_try_fold(mm, addr, ptep, pte);
+	}
+}
+
 static __always_inline void contpte_try_unfold(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep, pte_t pte)
 {
@@ -1287,6 +1312,7 @@ static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 	if (likely(nr == 1)) {
 		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
 		__set_ptes(mm, addr, ptep, pte, 1);
+		contpte_try_fold(mm, addr, ptep, pte);
 	} else {
 		contpte_set_ptes(mm, addr, ptep, pte, nr);
 	}
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 80346108450b..2c7dafd0552a 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -67,6 +67,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
 	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
 }
 
+void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We have already checked that the virtual and pysical addresses are
+	 * correctly aligned for a contpte mapping in contpte_try_fold() so the
+	 * remaining checks are to ensure that the contpte range is fully
+	 * covered by a single folio, and ensure that all the ptes are valid
+	 * with contiguous PFNs and matching prots. We ignore the state of the
+	 * access and dirty bits for the purpose of deciding if its a contiguous
+	 * range; the folding process will generate a single contpte entry which
+	 * has a single access and dirty bit. Those 2 bits are the logical OR of
+	 * their respective bits in the constituent pte entries. In order to
+	 * ensure the contpte range is covered by a single folio, we must
+	 * recover the folio from the pfn, but special mappings don't have a
+	 * folio backing them. Fortunately contpte_try_fold() already checked
+	 * that the pte is not special - we never try to fold special mappings.
+	 * Note we can't use vm_normal_page() for this since we don't have the
+	 * vma.
+	 */
+
+	unsigned long folio_saddr, folio_eaddr;
+	unsigned long cont_saddr, cont_eaddr;
+	pte_t expected_pte, subpte;
+	struct folio *folio;
+	struct page *page;
+	unsigned long pfn;
+	pte_t *orig_ptep;
+	pgprot_t prot;
+
+	int i;
+
+	if (!mm_is_user(mm))
+		return;
+
+	page = pte_page(pte);
+	folio = page_folio(page);
+	folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
+	folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
+	cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	cont_eaddr = cont_saddr + CONT_PTE_SIZE;
+
+	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
+		return;
+
+	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
+	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+	expected_pte = pfn_pte(pfn, prot);
+	orig_ptep = ptep;
+	ptep = contpte_align_down(ptep);
+
+	for (i = 0; i < CONT_PTES; i++) {
+		subpte = pte_mkold(pte_mkclean(__ptep_get(ptep)));
+		if (!pte_same(subpte, expected_pte))
+			return;
+		expected_pte = pte_advance_pfn(expected_pte, 1);
+		ptep++;
+	}
+
+	pte = pte_mkcont(pte);
+	contpte_convert(mm, addr, orig_ptep, pte);
+}
+EXPORT_SYMBOL(__contpte_try_fold);
+
 void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
 			pte_t *ptep, pte_t pte)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
  2024-02-02  8:07 ` Ryan Roberts
  (?)
@ 2024-02-08 17:34   ` Mark Rutland
  -1 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-08 17:34 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:31AM +0000, Ryan Roberts wrote:
> Hi All,

Hi Ryan,

I assume this is the same as your 'features/granule_perf/contpte-lkml_v' branch
on https://gitlab.arm.com/linux-arm/linux-rr/

I've taken a quick look, and I have a few initial/superficial comments before
digging into the detail on the important changes.

> Patch Layout
> ============
> 
> In this version, I've split the patches to better show each optimization:
> 
>   - 1-2:    mm prep: misc code and docs cleanups

I'm not confident enough to comment on patch 2, but these look reasonable to
me.

>   - 3-8:    mm,arm,arm64,powerpc,x86 prep: Replace pte_next_pfn() with more
>             general pte_advance_pfn()

These look fine to me.

>   - 9-18:   arm64 prep: Refactor ptep helpers into new layer

The result of patches 9-17 looks good to me, but the intermediate stages where
some functions are converted is a bit odd, and it's a bit painful for review
since you need to skip ahead a few patches to see the end result to tell that
the conversions are consistent and complete.

IMO it'd be easier for review if that were three patches:

1) Convert READ_ONCE() -> ptep_get()
2) Convert set_pte_at() -> set_ptes()
3) All the "New layer" renames and addition of the trivial wrappers

Patch 18 looks fine to me.

>   - 19:     functional contpte implementation
>   - 20-25:  various optimizations on top of the contpte implementation

I'll try to dig into these over the next few days.

Mark.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
@ 2024-02-08 17:34   ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-08 17:34 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On Fri, Feb 02, 2024 at 08:07:31AM +0000, Ryan Roberts wrote:
> Hi All,

Hi Ryan,

I assume this is the same as your 'features/granule_perf/contpte-lkml_v' branch
on https://gitlab.arm.com/linux-arm/linux-rr/

I've taken a quick look, and I have a few initial/superficial comments before
digging into the detail on the important changes.

> Patch Layout
> ============
> 
> In this version, I've split the patches to better show each optimization:
> 
>   - 1-2:    mm prep: misc code and docs cleanups

I'm not confident enough to comment on patch 2, but these look reasonable to
me.

>   - 3-8:    mm,arm,arm64,powerpc,x86 prep: Replace pte_next_pfn() with more
>             general pte_advance_pfn()

These look fine to me.

>   - 9-18:   arm64 prep: Refactor ptep helpers into new layer

The result of patches 9-17 looks good to me, but the intermediate stages where
some functions are converted is a bit odd, and it's a bit painful for review
since you need to skip ahead a few patches to see the end result to tell that
the conversions are consistent and complete.

IMO it'd be easier for review if that were three patches:

1) Convert READ_ONCE() -> ptep_get()
2) Convert set_pte_at() -> set_ptes()
3) All the "New layer" renames and addition of the trivial wrappers

Patch 18 looks fine to me.

>   - 19:     functional contpte implementation
>   - 20-25:  various optimizations on top of the contpte implementation

I'll try to dig into these over the next few days.

Mark.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
@ 2024-02-08 17:34   ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-08 17:34 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:31AM +0000, Ryan Roberts wrote:
> Hi All,

Hi Ryan,

I assume this is the same as your 'features/granule_perf/contpte-lkml_v' branch
on https://gitlab.arm.com/linux-arm/linux-rr/

I've taken a quick look, and I have a few initial/superficial comments before
digging into the detail on the important changes.

> Patch Layout
> ============
> 
> In this version, I've split the patches to better show each optimization:
> 
>   - 1-2:    mm prep: misc code and docs cleanups

I'm not confident enough to comment on patch 2, but these look reasonable to
me.

>   - 3-8:    mm,arm,arm64,powerpc,x86 prep: Replace pte_next_pfn() with more
>             general pte_advance_pfn()

These look fine to me.

>   - 9-18:   arm64 prep: Refactor ptep helpers into new layer

The result of patches 9-17 looks good to me, but the intermediate stages where
some functions are converted is a bit odd, and it's a bit painful for review
since you need to skip ahead a few patches to see the end result to tell that
the conversions are consistent and complete.

IMO it'd be easier for review if that were three patches:

1) Convert READ_ONCE() -> ptep_get()
2) Convert set_pte_at() -> set_ptes()
3) All the "New layer" renames and addition of the trivial wrappers

Patch 18 looks fine to me.

>   - 19:     functional contpte implementation
>   - 20-25:  various optimizations on top of the contpte implementation

I'll try to dig into these over the next few days.

Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
  2024-02-08 17:34   ` Mark Rutland
  (?)
@ 2024-02-09  8:54     ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-09  8:54 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 08/02/2024 17:34, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:31AM +0000, Ryan Roberts wrote:
>> Hi All,
> 
> Hi Ryan,
> 
> I assume this is the same as your 'features/granule_perf/contpte-lkml_v' branch
> on https://gitlab.arm.com/linux-arm/linux-rr/

Yep - great detective work! features/granule_perf/contpte-lkml_v5 corresponds
exactly to what I posted with all the dependencies in place.

> 
> I've taken a quick look, and I have a few initial/superficial comments before
> digging into the detail on the important changes.

Thanks for doing this!

> 
>> Patch Layout
>> ============
>>
>> In this version, I've split the patches to better show each optimization:
>>
>>   - 1-2:    mm prep: misc code and docs cleanups
> 
> I'm not confident enough to comment on patch 2, but these look reasonable to
> me.

Thanks. David has acked patch 2 already so I think we are good there.

> 
>>   - 3-8:    mm,arm,arm64,powerpc,x86 prep: Replace pte_next_pfn() with more
>>             general pte_advance_pfn()
> 
> These look fine to me.

Thanks!

> 
>>   - 9-18:   arm64 prep: Refactor ptep helpers into new layer
> 
> The result of patches 9-17 looks good to me, but the intermediate stages where
> some functions are converted is a bit odd, and it's a bit painful for review
> since you need to skip ahead a few patches to see the end result to tell that
> the conversions are consistent and complete.
> 
> IMO it'd be easier for review if that were three patches:
> 
> 1) Convert READ_ONCE() -> ptep_get()
> 2) Convert set_pte_at() -> set_ptes()
> 3) All the "New layer" renames and addition of the trivial wrappers

Yep that makes sense. I'll start prepping that today. I'll hold off reposting
until I have your comments on 19-25. I'm also hoping that David will repost the
zap series today so that it can get into mm-unstable by mid-next week. Then I'll
repost on top of that, hopefully by end of next week, folding in all your
comments. This should give planty of time to soak in linux-next.

Thanks,
Ryan

> 
> Patch 18 looks fine to me.
> 
>>   - 19:     functional contpte implementation
>>   - 20-25:  various optimizations on top of the contpte implementation
> 
> I'll try to dig into these over the next few days.
> 
> Mark.


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
@ 2024-02-09  8:54     ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-09  8:54 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On 08/02/2024 17:34, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:31AM +0000, Ryan Roberts wrote:
>> Hi All,
> 
> Hi Ryan,
> 
> I assume this is the same as your 'features/granule_perf/contpte-lkml_v' branch
> on https://gitlab.arm.com/linux-arm/linux-rr/

Yep - great detective work! features/granule_perf/contpte-lkml_v5 corresponds
exactly to what I posted with all the dependencies in place.

> 
> I've taken a quick look, and I have a few initial/superficial comments before
> digging into the detail on the important changes.

Thanks for doing this!

> 
>> Patch Layout
>> ============
>>
>> In this version, I've split the patches to better show each optimization:
>>
>>   - 1-2:    mm prep: misc code and docs cleanups
> 
> I'm not confident enough to comment on patch 2, but these look reasonable to
> me.

Thanks. David has acked patch 2 already so I think we are good there.

> 
>>   - 3-8:    mm,arm,arm64,powerpc,x86 prep: Replace pte_next_pfn() with more
>>             general pte_advance_pfn()
> 
> These look fine to me.

Thanks!

> 
>>   - 9-18:   arm64 prep: Refactor ptep helpers into new layer
> 
> The result of patches 9-17 looks good to me, but the intermediate stages where
> some functions are converted is a bit odd, and it's a bit painful for review
> since you need to skip ahead a few patches to see the end result to tell that
> the conversions are consistent and complete.
> 
> IMO it'd be easier for review if that were three patches:
> 
> 1) Convert READ_ONCE() -> ptep_get()
> 2) Convert set_pte_at() -> set_ptes()
> 3) All the "New layer" renames and addition of the trivial wrappers

Yep that makes sense. I'll start prepping that today. I'll hold off reposting
until I have your comments on 19-25. I'm also hoping that David will repost the
zap series today so that it can get into mm-unstable by mid-next week. Then I'll
repost on top of that, hopefully by end of next week, folding in all your
comments. This should give planty of time to soak in linux-next.

Thanks,
Ryan

> 
> Patch 18 looks fine to me.
> 
>>   - 19:     functional contpte implementation
>>   - 20-25:  various optimizations on top of the contpte implementation
> 
> I'll try to dig into these over the next few days.
> 
> Mark.


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
@ 2024-02-09  8:54     ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-09  8:54 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 08/02/2024 17:34, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:31AM +0000, Ryan Roberts wrote:
>> Hi All,
> 
> Hi Ryan,
> 
> I assume this is the same as your 'features/granule_perf/contpte-lkml_v' branch
> on https://gitlab.arm.com/linux-arm/linux-rr/

Yep - great detective work! features/granule_perf/contpte-lkml_v5 corresponds
exactly to what I posted with all the dependencies in place.

> 
> I've taken a quick look, and I have a few initial/superficial comments before
> digging into the detail on the important changes.

Thanks for doing this!

> 
>> Patch Layout
>> ============
>>
>> In this version, I've split the patches to better show each optimization:
>>
>>   - 1-2:    mm prep: misc code and docs cleanups
> 
> I'm not confident enough to comment on patch 2, but these look reasonable to
> me.

Thanks. David has acked patch 2 already so I think we are good there.

> 
>>   - 3-8:    mm,arm,arm64,powerpc,x86 prep: Replace pte_next_pfn() with more
>>             general pte_advance_pfn()
> 
> These look fine to me.

Thanks!

> 
>>   - 9-18:   arm64 prep: Refactor ptep helpers into new layer
> 
> The result of patches 9-17 looks good to me, but the intermediate stages where
> some functions are converted is a bit odd, and it's a bit painful for review
> since you need to skip ahead a few patches to see the end result to tell that
> the conversions are consistent and complete.
> 
> IMO it'd be easier for review if that were three patches:
> 
> 1) Convert READ_ONCE() -> ptep_get()
> 2) Convert set_pte_at() -> set_ptes()
> 3) All the "New layer" renames and addition of the trivial wrappers

Yep that makes sense. I'll start prepping that today. I'll hold off reposting
until I have your comments on 19-25. I'm also hoping that David will repost the
zap series today so that it can get into mm-unstable by mid-next week. Then I'll
repost on top of that, hopefully by end of next week, folding in all your
comments. This should give planty of time to soak in linux-next.

Thanks,
Ryan

> 
> Patch 18 looks fine to me.
> 
>>   - 19:     functional contpte implementation
>>   - 20-25:  various optimizations on top of the contpte implementation
> 
> I'll try to dig into these over the next few days.
> 
> Mark.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
  2024-02-09  8:54     ` Ryan Roberts
  (?)
@ 2024-02-09 22:16       ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-09 22:16 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

>> 1) Convert READ_ONCE() -> ptep_get()
>> 2) Convert set_pte_at() -> set_ptes()
>> 3) All the "New layer" renames and addition of the trivial wrappers
> 
> Yep that makes sense. I'll start prepping that today. I'll hold off reposting
> until I have your comments on 19-25. I'm also hoping that David will repost the
> zap series today so that it can get into mm-unstable by mid-next week. Then I'll
> repost on top of that, hopefully by end of next week, folding in all your
> comments. This should give planty of time to soak in linux-next.

Just sent out v2. Will review this series (early) next week.

Have a great weekend!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
@ 2024-02-09 22:16       ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-09 22:16 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

>> 1) Convert READ_ONCE() -> ptep_get()
>> 2) Convert set_pte_at() -> set_ptes()
>> 3) All the "New layer" renames and addition of the trivial wrappers
> 
> Yep that makes sense. I'll start prepping that today. I'll hold off reposting
> until I have your comments on 19-25. I'm also hoping that David will repost the
> zap series today so that it can get into mm-unstable by mid-next week. Then I'll
> repost on top of that, hopefully by end of next week, folding in all your
> comments. This should give planty of time to soak in linux-next.

Just sent out v2. Will review this series (early) next week.

Have a great weekend!

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
@ 2024-02-09 22:16       ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-09 22:16 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

>> 1) Convert READ_ONCE() -> ptep_get()
>> 2) Convert set_pte_at() -> set_ptes()
>> 3) All the "New layer" renames and addition of the trivial wrappers
> 
> Yep that makes sense. I'll start prepping that today. I'll hold off reposting
> until I have your comments on 19-25. I'm also hoping that David will repost the
> zap series today so that it can get into mm-unstable by mid-next week. Then I'll
> repost on top of that, hopefully by end of next week, folding in all your
> comments. This should give planty of time to soak in linux-next.

Just sent out v2. Will review this series (early) next week.

Have a great weekend!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
  2024-02-09 22:16       ` David Hildenbrand
  (?)
@ 2024-02-09 23:52         ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-09 23:52 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 09/02/2024 22:16, David Hildenbrand wrote:
>>> 1) Convert READ_ONCE() -> ptep_get()
>>> 2) Convert set_pte_at() -> set_ptes()
>>> 3) All the "New layer" renames and addition of the trivial wrappers
>>
>> Yep that makes sense. I'll start prepping that today. I'll hold off reposting
>> until I have your comments on 19-25. I'm also hoping that David will repost the
>> zap series today so that it can get into mm-unstable by mid-next week. Then I'll
>> repost on top of that, hopefully by end of next week, folding in all your
>> comments. This should give planty of time to soak in linux-next.
> 
> Just sent out v2. Will review this series (early) next week.
> 
> Have a great weekend!

Cheers, David - you too!

> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
@ 2024-02-09 23:52         ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-09 23:52 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 09/02/2024 22:16, David Hildenbrand wrote:
>>> 1) Convert READ_ONCE() -> ptep_get()
>>> 2) Convert set_pte_at() -> set_ptes()
>>> 3) All the "New layer" renames and addition of the trivial wrappers
>>
>> Yep that makes sense. I'll start prepping that today. I'll hold off reposting
>> until I have your comments on 19-25. I'm also hoping that David will repost the
>> zap series today so that it can get into mm-unstable by mid-next week. Then I'll
>> repost on top of that, hopefully by end of next week, folding in all your
>> comments. This should give planty of time to soak in linux-next.
> 
> Just sent out v2. Will review this series (early) next week.
> 
> Have a great weekend!

Cheers, David - you too!

> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings
@ 2024-02-09 23:52         ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-09 23:52 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 09/02/2024 22:16, David Hildenbrand wrote:
>>> 1) Convert READ_ONCE() -> ptep_get()
>>> 2) Convert set_pte_at() -> set_ptes()
>>> 3) All the "New layer" renames and addition of the trivial wrappers
>>
>> Yep that makes sense. I'll start prepping that today. I'll hold off reposting
>> until I have your comments on 19-25. I'm also hoping that David will repost the
>> zap series today so that it can get into mm-unstable by mid-next week. Then I'll
>> repost on top of that, hopefully by end of next week, folding in all your
>> comments. This should give planty of time to soak in linux-next.
> 
> Just sent out v2. Will review this series (early) next week.
> 
> Have a great weekend!

Cheers, David - you too!

> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-02  8:07   ` Ryan Roberts
  (?)
@ 2024-02-12 12:00     ` Mark Rutland
  -1 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-12 12:00 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

Hi Ryan,

Overall this looks pretty good; I have a bunch of minor comments below, and a
bigger question on the way ptep_get_lockless() works.

On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings.
> 
> In this initial implementation, only suitable batches of PTEs, set via
> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
> modification of individual PTEs will cause an "unfold" operation to
> repaint the contpte block as individual PTEs before performing the
> requested operation. While, a modification of a single PTE could cause
> the block of PTEs to which it belongs to become eligible for "folding"
> into a contpte entry, "folding" is not performed in this initial
> implementation due to the costs of checking the requirements are met.
> Due to this, contpte mappings will degrade back to normal pte mappings
> over time if/when protections are changed. This will be solved in a
> future patch.
> 
> Since a contpte block only has a single access and dirty bit, the
> semantic here changes slightly; when getting a pte (e.g. ptep_get())
> that is part of a contpte mapping, the access and dirty information are
> pulled from the block (so all ptes in the block return the same
> access/dirty info). When changing the access/dirty info on a pte (e.g.
> ptep_set_access_flags()) that is part of a contpte mapping, this change
> will affect the whole contpte block. This is works fine in practice
> since we guarantee that only a single folio is mapped by a contpte
> block, and the core-mm tracks access/dirty information per folio.
> 
> In order for the public functions, which used to be pure inline, to
> continue to be callable by modules, export all the contpte_* symbols
> that are now called by those public inline functions.
> 
> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
> at build time. It defaults to enabled as long as its dependency,
> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
> enabled, then there is no chance of meeting the physical contiguity
> requirement for contpte mappings.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/Kconfig               |   9 +
>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>  arch/arm64/mm/Makefile           |   1 +
>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>  4 files changed, 454 insertions(+)
>  create mode 100644 arch/arm64/mm/contpte.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index d86d7f4758b5..1442e8ed95b6 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>  	select UNWIND_TABLES
>  	select DYNAMIC_SCS
>  
> +config ARM64_CONTPTE
> +	bool "Contiguous PTE mappings for user memory" if EXPERT
> +	depends on TRANSPARENT_HUGEPAGE
> +	default y
> +	help
> +	  When enabled, user mappings are configured using the PTE contiguous
> +	  bit, for any mappings that meet the size and alignment requirements.
> +	  This reduces TLB pressure and improves performance.
> +
>  endmenu # "Kernel Features"
>  
>  menu "Boot options"
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 7dc6b68ee516..34892a95403d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>   */
>  #define pte_valid_not_user(pte) \
>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
> +/*
> + * Returns true if the pte is valid and has the contiguous bit set.
> + */
> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>  /*
>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>   * so that we don't erroneously return false for pages that have been
> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>  #define vmemmap_update_pte vmemmap_update_pte
>  #endif
>  
> +#ifdef CONFIG_ARM64_CONTPTE
> +
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
> + * a private implementation detail of the public ptep API (see below).
> + */
> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				pte_t entry, int dirty);
> +
> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte)
> +{
> +	if (unlikely(pte_valid_cont(pte)))
> +		__contpte_try_unfold(mm, addr, ptep, pte);
> +}
> +
> +/*
> + * The below functions constitute the public API that arm64 presents to the
> + * core-mm to manipulate PTE entries within their page tables (or at least this
> + * is the subset of the API that arm64 needs to implement). These public
> + * versions will automatically and transparently apply the contiguous bit where
> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
> + * private versions, which are prefixed with double underscore. All of these
> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
> + * held.
> + */
> +
> +#define ptep_get ptep_get
> +static inline pte_t ptep_get(pte_t *ptep)
> +{
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(pte)))
> +		return pte;
> +
> +	return contpte_ptep_get(ptep, pte);
> +}
> +
> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(pte)))
> +		return pte;
> +
> +	return contpte_ptep_get_lockless(ptep);
> +}
> +
> +static inline void set_pte(pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
> +	 * this should never even be called. Regardless do our best to service
> +	 * any call and emit a warning if there is any attempt to set a pte on
> +	 * top of an existing contig range.
> +	 */
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
> +	__set_pte(ptep, pte_mknoncont(pte));
> +}
> +
> +#define set_ptes set_ptes
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +	pte = pte_mknoncont(pte);

Why do we have to clear the contiguous bit here? Is that for the same reason as
set_pte(), or do we expect callers to legitimately call this with the
contiguous bit set in 'pte'?

I think you explained this to me in-person, and IIRC we don't expect callers to
go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
have to clear it here to defer the decision of whether to set/clear it when
modifying entries. It would be nice if we could have a description of why/when
we need to clear this, e.g. in the 'public API' comment block above.

> +
> +	if (likely(nr == 1)) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		__set_ptes(mm, addr, ptep, pte, 1);
> +	} else {
> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
> +	}
> +}
> +
> +static inline void pte_clear(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	__pte_clear(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	return __ptep_get_and_clear(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	__ptep_set_wrprotect(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				pte_t entry, int dirty)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	entry = pte_mknoncont(entry);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +
> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +}
> +
> +#else /* CONFIG_ARM64_CONTPTE */
> +
>  #define ptep_get				__ptep_get
>  #define set_pte					__set_pte
>  #define set_ptes				__set_ptes
> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>  #define ptep_set_access_flags			__ptep_set_access_flags
>  
> +#endif /* CONFIG_ARM64_CONTPTE */
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* __ASM_PGTABLE_H */
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d..60454256945b 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>  				   cache.o copypage.o flush.o \
>  				   ioremap.o mmap.o pgd.o mmu.o \
>  				   context.o proc.o pageattr.o fixmap.o
> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> new file mode 100644
> index 000000000000..bfb50e6b44c7
> --- /dev/null
> +++ b/arch/arm64/mm/contpte.c
> @@ -0,0 +1,283 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/export.h>
> +#include <asm/tlbflush.h>
> +
> +static inline bool mm_is_user(struct mm_struct *mm)
> +{
> +	/*
> +	 * Don't attempt to apply the contig bit to kernel mappings, because
> +	 * dynamically adding/removing the contig bit can cause page faults.
> +	 * These racing faults are ok for user space, since they get serialized
> +	 * on the PTL. But kernel mappings can't tolerate faults.
> +	 */
> +	return mm != &init_mm;
> +}

We also have the efi_mm as a non-user mm, though I don't think we manipulate
that while it is live, and I'm not sure if that needs any special handling.

> +static inline pte_t *contpte_align_down(pte_t *ptep)
> +{
> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);

I think this can be:

static inline pte_t *contpte_align_down(pte_t *ptep)
{
	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
}

> +
> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> +			    pte_t *ptep, pte_t pte)
> +{
> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	unsigned long start_addr;
> +	pte_t *start_ptep;
> +	int i;
> +
> +	start_ptep = ptep = contpte_align_down(ptep);
> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> +
> +		if (pte_dirty(ptent))
> +			pte = pte_mkdirty(pte);
> +
> +		if (pte_young(ptent))
> +			pte = pte_mkyoung(pte);
> +	}

Not a big deal either way, but I wonder if it makes more sense to accumulate
the 'ptent' dirty/young values, then modify 'pte' once, i.e.

	bool dirty = false, young = false;

	for (...) {
		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
		dirty |= pte_dirty(ptent);
		young |= pte_young(ptent);
	}

	if (dirty)
		pte_mkdirty(pte);
	if (young)
		pte_mkyoung(pte);

I suspect that might generate slightly better code, but I'm also happy with the
current form if people thnk that's more legible (I have no strong feelings
either way).

> +
> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +
> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> +}
> +
> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the ptes are contiguous in
> +	 * contpte_try_unfold(), so just check that the mm is user space.
> +	 */
> +
> +	if (!mm_is_user(mm))
> +		return;

Nit: normally we don't put a line gap between a comment block and the
associated block of code.

> +
> +	pte = pte_mknoncont(pte);
> +	contpte_convert(mm, addr, ptep, pte);
> +}
> +EXPORT_SYMBOL(__contpte_try_unfold);
> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
> +	 * contiguous range cannot be unfolded or otherwise modified under our
> +	 * feet.
> +	 */

Nit: s/guarranteed/guaranteed/

> +
> +	pte_t pte;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
> +		pte = __ptep_get(ptep);
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get);
> +
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We may not be holding the PTL, so any contiguous
> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
> +	 * in the range are valid and have CONT_PTE set, that all pfns are
> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
> +	 * If we find a pte that is not consistent, then we must be racing with
> +	 * an update so start again. If the target pte does not have CONT_PTE
> +	 * set then that is considered consistent on its own because it is not
> +	 * part of a contpte range.
> +	 */
> +
> +	pgprot_t orig_prot;
> +	unsigned long pfn;
> +	pte_t orig_pte;
> +	pgprot_t prot;
> +	pte_t *ptep;
> +	pte_t pte;
> +	int i;
> +
> +retry:
> +	orig_pte = __ptep_get(orig_ptep);
> +
> +	if (!pte_valid_cont(orig_pte))
> +		return orig_pte;
> +
> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> +	ptep = contpte_align_down(orig_ptep);
> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> +		pte = __ptep_get(ptep);
> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +		if (!pte_valid_cont(pte) ||
> +		   pte_pfn(pte) != pfn ||
> +		   pgprot_val(prot) != pgprot_val(orig_prot))
> +			goto retry;
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_lockless);

I'm struggling to convince myself that this is safe in general, as it really
depends on how the caller will use this value. Which caller(s) actually care
about the access/dirty bits, given those could change at any time anyway?

I took a quick scan, and AFAICT:

* For perf_get_pgtable_size(), we only care about whether the entry is valid
  and has the contig bit set. We could clean that up with a new interface, e.g.
  something like a new ptep_get_size_lockless().

* For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
  we look at the pte to start with, since we only care where we can logically
  write to the page at that point.

  I see that we later follow up with:

    with pte_val(pte) != pte_val(ptep_get(ptep)))

  ... is that why we need ptep_get_lockless() to accumulate the access/dirty
  bits? So that shape of lockless-try...locked-compare sequence works?

* For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
  so this doesn' seem to matter.

* For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
  which means the pte isn't valid, and we'll return the orig_pte as-is anyway.

* For pte_range_none() the access/dirty bits don't matter.

* For handle_pte_fault() I think we have the same shape of
  lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
  care about the acess/dirty bits before we reach the locked compare step.

* For ptdump_pte_entry() I think it's arguable that we should continue to
  report the access/dirty bits separately for each PTE, as we have done until
  now, to give an accurate representation of the contents of the translation
  tables.

* For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
  swap entry, the access/dirty bits don't matter.

So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
and IIUC that's only so that the locklessly-loaded pte value can be compared
with a subsequently locked-loaded entry (for which the access/dirty bits will
be accumulated). Have I understood that correctly?

If so, I wonder if we could instead do that comparison modulo the access/dirty
bits, and leave ptep_get_lockless() only reading a single entry?

Thanks,
Mark.

> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +	unsigned long next;
> +	unsigned long end;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +
> +	/*
> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
> +	 * all ptes is not-present. Therefore we never need to unfold or
> +	 * otherwise invalidate a range before we set the new ptes.
> +	 * contpte_set_ptes() should never be called for nr < 2.
> +	 */
> +	VM_WARN_ON(nr == 1);
> +
> +	if (!mm_is_user(mm))
> +		return __set_ptes(mm, addr, ptep, pte, nr);
> +
> +	end = addr + (nr << PAGE_SHIFT);
> +	pfn = pte_pfn(pte);
> +	prot = pte_pgprot(pte);
> +
> +	do {
> +		next = pte_cont_addr_end(addr, end);
> +		nr = (next - addr) >> PAGE_SHIFT;
> +		pte = pfn_pte(pfn, prot);
> +
> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> +			pte = pte_mkcont(pte);
> +		else
> +			pte = pte_mknoncont(pte);
> +
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +
> +		addr = next;
> +		ptep += nr;
> +		pfn += nr;
> +
> +	} while (addr != end);
> +}
> +EXPORT_SYMBOL(contpte_set_ptes);
> +
> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	/*
> +	 * ptep_clear_flush_young() technically requires us to clear the access
> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
> +	 * access/dirty per folio, not per page. And since we only create a
> +	 * contig range when the range is covered by a single folio, we can get
> +	 * away with clearing young for the whole contig range here, so we avoid
> +	 * having to unfold.
> +	 */
> +
> +	int young = 0;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
> +
> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	int young;
> +
> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	if (young) {
> +		/*
> +		 * See comment in __ptep_clear_flush_young(); same rationale for
> +		 * eliding the trailing DSB applies here.
> +		 */
> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> +					 PAGE_SIZE, true, 3);
> +	}
> +
> +	return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
> +
> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					pte_t entry, int dirty)
> +{
> +	unsigned long start_addr;
> +	pte_t orig_pte;
> +	int i;
> +
> +	/*
> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
> +	 * changed, its a noop.
> +	 */
> +	orig_pte = pte_mknoncont(ptep_get(ptep));
> +	if (pte_val(orig_pte) == pte_val(entry))
> +		return 0;
> +
> +	/*
> +	 * We can fix up access/dirty bits without having to unfold the contig
> +	 * range. But if the write bit is changing, we must unfold.
> +	 */
> +	if (pte_write(orig_pte) == pte_write(entry)) {
> +		/*
> +		 * For HW access management, we technically only need to update
> +		 * the flag on a single pte in the range. But for SW access
> +		 * management, we need to update all the ptes to prevent extra
> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
> +		 * and instead flush the whole range at the end.
> +		 */
> +		ptep = contpte_align_down(ptep);
> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
> +
> +		if (dirty)
> +			__flush_tlb_range(vma, start_addr, addr,
> +							PAGE_SIZE, true, 3);
> +	} else {
> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +	}
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 12:00     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-12 12:00 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

Hi Ryan,

Overall this looks pretty good; I have a bunch of minor comments below, and a
bigger question on the way ptep_get_lockless() works.

On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings.
> 
> In this initial implementation, only suitable batches of PTEs, set via
> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
> modification of individual PTEs will cause an "unfold" operation to
> repaint the contpte block as individual PTEs before performing the
> requested operation. While, a modification of a single PTE could cause
> the block of PTEs to which it belongs to become eligible for "folding"
> into a contpte entry, "folding" is not performed in this initial
> implementation due to the costs of checking the requirements are met.
> Due to this, contpte mappings will degrade back to normal pte mappings
> over time if/when protections are changed. This will be solved in a
> future patch.
> 
> Since a contpte block only has a single access and dirty bit, the
> semantic here changes slightly; when getting a pte (e.g. ptep_get())
> that is part of a contpte mapping, the access and dirty information are
> pulled from the block (so all ptes in the block return the same
> access/dirty info). When changing the access/dirty info on a pte (e.g.
> ptep_set_access_flags()) that is part of a contpte mapping, this change
> will affect the whole contpte block. This is works fine in practice
> since we guarantee that only a single folio is mapped by a contpte
> block, and the core-mm tracks access/dirty information per folio.
> 
> In order for the public functions, which used to be pure inline, to
> continue to be callable by modules, export all the contpte_* symbols
> that are now called by those public inline functions.
> 
> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
> at build time. It defaults to enabled as long as its dependency,
> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
> enabled, then there is no chance of meeting the physical contiguity
> requirement for contpte mappings.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/Kconfig               |   9 +
>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>  arch/arm64/mm/Makefile           |   1 +
>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>  4 files changed, 454 insertions(+)
>  create mode 100644 arch/arm64/mm/contpte.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index d86d7f4758b5..1442e8ed95b6 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>  	select UNWIND_TABLES
>  	select DYNAMIC_SCS
>  
> +config ARM64_CONTPTE
> +	bool "Contiguous PTE mappings for user memory" if EXPERT
> +	depends on TRANSPARENT_HUGEPAGE
> +	default y
> +	help
> +	  When enabled, user mappings are configured using the PTE contiguous
> +	  bit, for any mappings that meet the size and alignment requirements.
> +	  This reduces TLB pressure and improves performance.
> +
>  endmenu # "Kernel Features"
>  
>  menu "Boot options"
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 7dc6b68ee516..34892a95403d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>   */
>  #define pte_valid_not_user(pte) \
>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
> +/*
> + * Returns true if the pte is valid and has the contiguous bit set.
> + */
> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>  /*
>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>   * so that we don't erroneously return false for pages that have been
> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>  #define vmemmap_update_pte vmemmap_update_pte
>  #endif
>  
> +#ifdef CONFIG_ARM64_CONTPTE
> +
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
> + * a private implementation detail of the public ptep API (see below).
> + */
> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				pte_t entry, int dirty);
> +
> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte)
> +{
> +	if (unlikely(pte_valid_cont(pte)))
> +		__contpte_try_unfold(mm, addr, ptep, pte);
> +}
> +
> +/*
> + * The below functions constitute the public API that arm64 presents to the
> + * core-mm to manipulate PTE entries within their page tables (or at least this
> + * is the subset of the API that arm64 needs to implement). These public
> + * versions will automatically and transparently apply the contiguous bit where
> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
> + * private versions, which are prefixed with double underscore. All of these
> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
> + * held.
> + */
> +
> +#define ptep_get ptep_get
> +static inline pte_t ptep_get(pte_t *ptep)
> +{
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(pte)))
> +		return pte;
> +
> +	return contpte_ptep_get(ptep, pte);
> +}
> +
> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(pte)))
> +		return pte;
> +
> +	return contpte_ptep_get_lockless(ptep);
> +}
> +
> +static inline void set_pte(pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
> +	 * this should never even be called. Regardless do our best to service
> +	 * any call and emit a warning if there is any attempt to set a pte on
> +	 * top of an existing contig range.
> +	 */
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
> +	__set_pte(ptep, pte_mknoncont(pte));
> +}
> +
> +#define set_ptes set_ptes
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +	pte = pte_mknoncont(pte);

Why do we have to clear the contiguous bit here? Is that for the same reason as
set_pte(), or do we expect callers to legitimately call this with the
contiguous bit set in 'pte'?

I think you explained this to me in-person, and IIRC we don't expect callers to
go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
have to clear it here to defer the decision of whether to set/clear it when
modifying entries. It would be nice if we could have a description of why/when
we need to clear this, e.g. in the 'public API' comment block above.

> +
> +	if (likely(nr == 1)) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		__set_ptes(mm, addr, ptep, pte, 1);
> +	} else {
> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
> +	}
> +}
> +
> +static inline void pte_clear(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	__pte_clear(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	return __ptep_get_and_clear(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	__ptep_set_wrprotect(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				pte_t entry, int dirty)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	entry = pte_mknoncont(entry);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +
> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +}
> +
> +#else /* CONFIG_ARM64_CONTPTE */
> +
>  #define ptep_get				__ptep_get
>  #define set_pte					__set_pte
>  #define set_ptes				__set_ptes
> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>  #define ptep_set_access_flags			__ptep_set_access_flags
>  
> +#endif /* CONFIG_ARM64_CONTPTE */
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* __ASM_PGTABLE_H */
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d..60454256945b 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>  				   cache.o copypage.o flush.o \
>  				   ioremap.o mmap.o pgd.o mmu.o \
>  				   context.o proc.o pageattr.o fixmap.o
> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> new file mode 100644
> index 000000000000..bfb50e6b44c7
> --- /dev/null
> +++ b/arch/arm64/mm/contpte.c
> @@ -0,0 +1,283 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/export.h>
> +#include <asm/tlbflush.h>
> +
> +static inline bool mm_is_user(struct mm_struct *mm)
> +{
> +	/*
> +	 * Don't attempt to apply the contig bit to kernel mappings, because
> +	 * dynamically adding/removing the contig bit can cause page faults.
> +	 * These racing faults are ok for user space, since they get serialized
> +	 * on the PTL. But kernel mappings can't tolerate faults.
> +	 */
> +	return mm != &init_mm;
> +}

We also have the efi_mm as a non-user mm, though I don't think we manipulate
that while it is live, and I'm not sure if that needs any special handling.

> +static inline pte_t *contpte_align_down(pte_t *ptep)
> +{
> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);

I think this can be:

static inline pte_t *contpte_align_down(pte_t *ptep)
{
	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
}

> +
> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> +			    pte_t *ptep, pte_t pte)
> +{
> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	unsigned long start_addr;
> +	pte_t *start_ptep;
> +	int i;
> +
> +	start_ptep = ptep = contpte_align_down(ptep);
> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> +
> +		if (pte_dirty(ptent))
> +			pte = pte_mkdirty(pte);
> +
> +		if (pte_young(ptent))
> +			pte = pte_mkyoung(pte);
> +	}

Not a big deal either way, but I wonder if it makes more sense to accumulate
the 'ptent' dirty/young values, then modify 'pte' once, i.e.

	bool dirty = false, young = false;

	for (...) {
		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
		dirty |= pte_dirty(ptent);
		young |= pte_young(ptent);
	}

	if (dirty)
		pte_mkdirty(pte);
	if (young)
		pte_mkyoung(pte);

I suspect that might generate slightly better code, but I'm also happy with the
current form if people thnk that's more legible (I have no strong feelings
either way).

> +
> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +
> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> +}
> +
> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the ptes are contiguous in
> +	 * contpte_try_unfold(), so just check that the mm is user space.
> +	 */
> +
> +	if (!mm_is_user(mm))
> +		return;

Nit: normally we don't put a line gap between a comment block and the
associated block of code.

> +
> +	pte = pte_mknoncont(pte);
> +	contpte_convert(mm, addr, ptep, pte);
> +}
> +EXPORT_SYMBOL(__contpte_try_unfold);
> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
> +	 * contiguous range cannot be unfolded or otherwise modified under our
> +	 * feet.
> +	 */

Nit: s/guarranteed/guaranteed/

> +
> +	pte_t pte;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
> +		pte = __ptep_get(ptep);
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get);
> +
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We may not be holding the PTL, so any contiguous
> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
> +	 * in the range are valid and have CONT_PTE set, that all pfns are
> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
> +	 * If we find a pte that is not consistent, then we must be racing with
> +	 * an update so start again. If the target pte does not have CONT_PTE
> +	 * set then that is considered consistent on its own because it is not
> +	 * part of a contpte range.
> +	 */
> +
> +	pgprot_t orig_prot;
> +	unsigned long pfn;
> +	pte_t orig_pte;
> +	pgprot_t prot;
> +	pte_t *ptep;
> +	pte_t pte;
> +	int i;
> +
> +retry:
> +	orig_pte = __ptep_get(orig_ptep);
> +
> +	if (!pte_valid_cont(orig_pte))
> +		return orig_pte;
> +
> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> +	ptep = contpte_align_down(orig_ptep);
> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> +		pte = __ptep_get(ptep);
> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +		if (!pte_valid_cont(pte) ||
> +		   pte_pfn(pte) != pfn ||
> +		   pgprot_val(prot) != pgprot_val(orig_prot))
> +			goto retry;
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_lockless);

I'm struggling to convince myself that this is safe in general, as it really
depends on how the caller will use this value. Which caller(s) actually care
about the access/dirty bits, given those could change at any time anyway?

I took a quick scan, and AFAICT:

* For perf_get_pgtable_size(), we only care about whether the entry is valid
  and has the contig bit set. We could clean that up with a new interface, e.g.
  something like a new ptep_get_size_lockless().

* For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
  we look at the pte to start with, since we only care where we can logically
  write to the page at that point.

  I see that we later follow up with:

    with pte_val(pte) != pte_val(ptep_get(ptep)))

  ... is that why we need ptep_get_lockless() to accumulate the access/dirty
  bits? So that shape of lockless-try...locked-compare sequence works?

* For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
  so this doesn' seem to matter.

* For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
  which means the pte isn't valid, and we'll return the orig_pte as-is anyway.

* For pte_range_none() the access/dirty bits don't matter.

* For handle_pte_fault() I think we have the same shape of
  lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
  care about the acess/dirty bits before we reach the locked compare step.

* For ptdump_pte_entry() I think it's arguable that we should continue to
  report the access/dirty bits separately for each PTE, as we have done until
  now, to give an accurate representation of the contents of the translation
  tables.

* For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
  swap entry, the access/dirty bits don't matter.

So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
and IIUC that's only so that the locklessly-loaded pte value can be compared
with a subsequently locked-loaded entry (for which the access/dirty bits will
be accumulated). Have I understood that correctly?

If so, I wonder if we could instead do that comparison modulo the access/dirty
bits, and leave ptep_get_lockless() only reading a single entry?

Thanks,
Mark.

> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +	unsigned long next;
> +	unsigned long end;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +
> +	/*
> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
> +	 * all ptes is not-present. Therefore we never need to unfold or
> +	 * otherwise invalidate a range before we set the new ptes.
> +	 * contpte_set_ptes() should never be called for nr < 2.
> +	 */
> +	VM_WARN_ON(nr == 1);
> +
> +	if (!mm_is_user(mm))
> +		return __set_ptes(mm, addr, ptep, pte, nr);
> +
> +	end = addr + (nr << PAGE_SHIFT);
> +	pfn = pte_pfn(pte);
> +	prot = pte_pgprot(pte);
> +
> +	do {
> +		next = pte_cont_addr_end(addr, end);
> +		nr = (next - addr) >> PAGE_SHIFT;
> +		pte = pfn_pte(pfn, prot);
> +
> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> +			pte = pte_mkcont(pte);
> +		else
> +			pte = pte_mknoncont(pte);
> +
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +
> +		addr = next;
> +		ptep += nr;
> +		pfn += nr;
> +
> +	} while (addr != end);
> +}
> +EXPORT_SYMBOL(contpte_set_ptes);
> +
> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	/*
> +	 * ptep_clear_flush_young() technically requires us to clear the access
> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
> +	 * access/dirty per folio, not per page. And since we only create a
> +	 * contig range when the range is covered by a single folio, we can get
> +	 * away with clearing young for the whole contig range here, so we avoid
> +	 * having to unfold.
> +	 */
> +
> +	int young = 0;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
> +
> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	int young;
> +
> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	if (young) {
> +		/*
> +		 * See comment in __ptep_clear_flush_young(); same rationale for
> +		 * eliding the trailing DSB applies here.
> +		 */
> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> +					 PAGE_SIZE, true, 3);
> +	}
> +
> +	return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
> +
> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					pte_t entry, int dirty)
> +{
> +	unsigned long start_addr;
> +	pte_t orig_pte;
> +	int i;
> +
> +	/*
> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
> +	 * changed, its a noop.
> +	 */
> +	orig_pte = pte_mknoncont(ptep_get(ptep));
> +	if (pte_val(orig_pte) == pte_val(entry))
> +		return 0;
> +
> +	/*
> +	 * We can fix up access/dirty bits without having to unfold the contig
> +	 * range. But if the write bit is changing, we must unfold.
> +	 */
> +	if (pte_write(orig_pte) == pte_write(entry)) {
> +		/*
> +		 * For HW access management, we technically only need to update
> +		 * the flag on a single pte in the range. But for SW access
> +		 * management, we need to update all the ptes to prevent extra
> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
> +		 * and instead flush the whole range at the end.
> +		 */
> +		ptep = contpte_align_down(ptep);
> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
> +
> +		if (dirty)
> +			__flush_tlb_range(vma, start_addr, addr,
> +							PAGE_SIZE, true, 3);
> +	} else {
> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +	}
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 12:00     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-12 12:00 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

Hi Ryan,

Overall this looks pretty good; I have a bunch of minor comments below, and a
bigger question on the way ptep_get_lockless() works.

On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings.
> 
> In this initial implementation, only suitable batches of PTEs, set via
> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
> modification of individual PTEs will cause an "unfold" operation to
> repaint the contpte block as individual PTEs before performing the
> requested operation. While, a modification of a single PTE could cause
> the block of PTEs to which it belongs to become eligible for "folding"
> into a contpte entry, "folding" is not performed in this initial
> implementation due to the costs of checking the requirements are met.
> Due to this, contpte mappings will degrade back to normal pte mappings
> over time if/when protections are changed. This will be solved in a
> future patch.
> 
> Since a contpte block only has a single access and dirty bit, the
> semantic here changes slightly; when getting a pte (e.g. ptep_get())
> that is part of a contpte mapping, the access and dirty information are
> pulled from the block (so all ptes in the block return the same
> access/dirty info). When changing the access/dirty info on a pte (e.g.
> ptep_set_access_flags()) that is part of a contpte mapping, this change
> will affect the whole contpte block. This is works fine in practice
> since we guarantee that only a single folio is mapped by a contpte
> block, and the core-mm tracks access/dirty information per folio.
> 
> In order for the public functions, which used to be pure inline, to
> continue to be callable by modules, export all the contpte_* symbols
> that are now called by those public inline functions.
> 
> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
> at build time. It defaults to enabled as long as its dependency,
> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
> enabled, then there is no chance of meeting the physical contiguity
> requirement for contpte mappings.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/Kconfig               |   9 +
>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>  arch/arm64/mm/Makefile           |   1 +
>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>  4 files changed, 454 insertions(+)
>  create mode 100644 arch/arm64/mm/contpte.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index d86d7f4758b5..1442e8ed95b6 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>  	select UNWIND_TABLES
>  	select DYNAMIC_SCS
>  
> +config ARM64_CONTPTE
> +	bool "Contiguous PTE mappings for user memory" if EXPERT
> +	depends on TRANSPARENT_HUGEPAGE
> +	default y
> +	help
> +	  When enabled, user mappings are configured using the PTE contiguous
> +	  bit, for any mappings that meet the size and alignment requirements.
> +	  This reduces TLB pressure and improves performance.
> +
>  endmenu # "Kernel Features"
>  
>  menu "Boot options"
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 7dc6b68ee516..34892a95403d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>   */
>  #define pte_valid_not_user(pte) \
>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
> +/*
> + * Returns true if the pte is valid and has the contiguous bit set.
> + */
> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>  /*
>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>   * so that we don't erroneously return false for pages that have been
> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>  #define vmemmap_update_pte vmemmap_update_pte
>  #endif
>  
> +#ifdef CONFIG_ARM64_CONTPTE
> +
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
> + * a private implementation detail of the public ptep API (see below).
> + */
> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				pte_t entry, int dirty);
> +
> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte)
> +{
> +	if (unlikely(pte_valid_cont(pte)))
> +		__contpte_try_unfold(mm, addr, ptep, pte);
> +}
> +
> +/*
> + * The below functions constitute the public API that arm64 presents to the
> + * core-mm to manipulate PTE entries within their page tables (or at least this
> + * is the subset of the API that arm64 needs to implement). These public
> + * versions will automatically and transparently apply the contiguous bit where
> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
> + * private versions, which are prefixed with double underscore. All of these
> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
> + * held.
> + */
> +
> +#define ptep_get ptep_get
> +static inline pte_t ptep_get(pte_t *ptep)
> +{
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(pte)))
> +		return pte;
> +
> +	return contpte_ptep_get(ptep, pte);
> +}
> +
> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(pte)))
> +		return pte;
> +
> +	return contpte_ptep_get_lockless(ptep);
> +}
> +
> +static inline void set_pte(pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
> +	 * this should never even be called. Regardless do our best to service
> +	 * any call and emit a warning if there is any attempt to set a pte on
> +	 * top of an existing contig range.
> +	 */
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
> +	__set_pte(ptep, pte_mknoncont(pte));
> +}
> +
> +#define set_ptes set_ptes
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +	pte = pte_mknoncont(pte);

Why do we have to clear the contiguous bit here? Is that for the same reason as
set_pte(), or do we expect callers to legitimately call this with the
contiguous bit set in 'pte'?

I think you explained this to me in-person, and IIRC we don't expect callers to
go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
have to clear it here to defer the decision of whether to set/clear it when
modifying entries. It would be nice if we could have a description of why/when
we need to clear this, e.g. in the 'public API' comment block above.

> +
> +	if (likely(nr == 1)) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		__set_ptes(mm, addr, ptep, pte, 1);
> +	} else {
> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
> +	}
> +}
> +
> +static inline void pte_clear(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	__pte_clear(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	return __ptep_get_and_clear(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	__ptep_set_wrprotect(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				pte_t entry, int dirty)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	entry = pte_mknoncont(entry);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +
> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +}
> +
> +#else /* CONFIG_ARM64_CONTPTE */
> +
>  #define ptep_get				__ptep_get
>  #define set_pte					__set_pte
>  #define set_ptes				__set_ptes
> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>  #define ptep_set_access_flags			__ptep_set_access_flags
>  
> +#endif /* CONFIG_ARM64_CONTPTE */
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* __ASM_PGTABLE_H */
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d..60454256945b 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>  				   cache.o copypage.o flush.o \
>  				   ioremap.o mmap.o pgd.o mmu.o \
>  				   context.o proc.o pageattr.o fixmap.o
> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> new file mode 100644
> index 000000000000..bfb50e6b44c7
> --- /dev/null
> +++ b/arch/arm64/mm/contpte.c
> @@ -0,0 +1,283 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/export.h>
> +#include <asm/tlbflush.h>
> +
> +static inline bool mm_is_user(struct mm_struct *mm)
> +{
> +	/*
> +	 * Don't attempt to apply the contig bit to kernel mappings, because
> +	 * dynamically adding/removing the contig bit can cause page faults.
> +	 * These racing faults are ok for user space, since they get serialized
> +	 * on the PTL. But kernel mappings can't tolerate faults.
> +	 */
> +	return mm != &init_mm;
> +}

We also have the efi_mm as a non-user mm, though I don't think we manipulate
that while it is live, and I'm not sure if that needs any special handling.

> +static inline pte_t *contpte_align_down(pte_t *ptep)
> +{
> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);

I think this can be:

static inline pte_t *contpte_align_down(pte_t *ptep)
{
	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
}

> +
> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> +			    pte_t *ptep, pte_t pte)
> +{
> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	unsigned long start_addr;
> +	pte_t *start_ptep;
> +	int i;
> +
> +	start_ptep = ptep = contpte_align_down(ptep);
> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> +
> +		if (pte_dirty(ptent))
> +			pte = pte_mkdirty(pte);
> +
> +		if (pte_young(ptent))
> +			pte = pte_mkyoung(pte);
> +	}

Not a big deal either way, but I wonder if it makes more sense to accumulate
the 'ptent' dirty/young values, then modify 'pte' once, i.e.

	bool dirty = false, young = false;

	for (...) {
		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
		dirty |= pte_dirty(ptent);
		young |= pte_young(ptent);
	}

	if (dirty)
		pte_mkdirty(pte);
	if (young)
		pte_mkyoung(pte);

I suspect that might generate slightly better code, but I'm also happy with the
current form if people thnk that's more legible (I have no strong feelings
either way).

> +
> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +
> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> +}
> +
> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the ptes are contiguous in
> +	 * contpte_try_unfold(), so just check that the mm is user space.
> +	 */
> +
> +	if (!mm_is_user(mm))
> +		return;

Nit: normally we don't put a line gap between a comment block and the
associated block of code.

> +
> +	pte = pte_mknoncont(pte);
> +	contpte_convert(mm, addr, ptep, pte);
> +}
> +EXPORT_SYMBOL(__contpte_try_unfold);
> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
> +	 * contiguous range cannot be unfolded or otherwise modified under our
> +	 * feet.
> +	 */

Nit: s/guarranteed/guaranteed/

> +
> +	pte_t pte;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
> +		pte = __ptep_get(ptep);
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get);
> +
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We may not be holding the PTL, so any contiguous
> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
> +	 * in the range are valid and have CONT_PTE set, that all pfns are
> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
> +	 * If we find a pte that is not consistent, then we must be racing with
> +	 * an update so start again. If the target pte does not have CONT_PTE
> +	 * set then that is considered consistent on its own because it is not
> +	 * part of a contpte range.
> +	 */
> +
> +	pgprot_t orig_prot;
> +	unsigned long pfn;
> +	pte_t orig_pte;
> +	pgprot_t prot;
> +	pte_t *ptep;
> +	pte_t pte;
> +	int i;
> +
> +retry:
> +	orig_pte = __ptep_get(orig_ptep);
> +
> +	if (!pte_valid_cont(orig_pte))
> +		return orig_pte;
> +
> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> +	ptep = contpte_align_down(orig_ptep);
> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> +		pte = __ptep_get(ptep);
> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +		if (!pte_valid_cont(pte) ||
> +		   pte_pfn(pte) != pfn ||
> +		   pgprot_val(prot) != pgprot_val(orig_prot))
> +			goto retry;
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_lockless);

I'm struggling to convince myself that this is safe in general, as it really
depends on how the caller will use this value. Which caller(s) actually care
about the access/dirty bits, given those could change at any time anyway?

I took a quick scan, and AFAICT:

* For perf_get_pgtable_size(), we only care about whether the entry is valid
  and has the contig bit set. We could clean that up with a new interface, e.g.
  something like a new ptep_get_size_lockless().

* For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
  we look at the pte to start with, since we only care where we can logically
  write to the page at that point.

  I see that we later follow up with:

    with pte_val(pte) != pte_val(ptep_get(ptep)))

  ... is that why we need ptep_get_lockless() to accumulate the access/dirty
  bits? So that shape of lockless-try...locked-compare sequence works?

* For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
  so this doesn' seem to matter.

* For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
  which means the pte isn't valid, and we'll return the orig_pte as-is anyway.

* For pte_range_none() the access/dirty bits don't matter.

* For handle_pte_fault() I think we have the same shape of
  lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
  care about the acess/dirty bits before we reach the locked compare step.

* For ptdump_pte_entry() I think it's arguable that we should continue to
  report the access/dirty bits separately for each PTE, as we have done until
  now, to give an accurate representation of the contents of the translation
  tables.

* For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
  swap entry, the access/dirty bits don't matter.

So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
and IIUC that's only so that the locklessly-loaded pte value can be compared
with a subsequently locked-loaded entry (for which the access/dirty bits will
be accumulated). Have I understood that correctly?

If so, I wonder if we could instead do that comparison modulo the access/dirty
bits, and leave ptep_get_lockless() only reading a single entry?

Thanks,
Mark.

> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +	unsigned long next;
> +	unsigned long end;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +
> +	/*
> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
> +	 * all ptes is not-present. Therefore we never need to unfold or
> +	 * otherwise invalidate a range before we set the new ptes.
> +	 * contpte_set_ptes() should never be called for nr < 2.
> +	 */
> +	VM_WARN_ON(nr == 1);
> +
> +	if (!mm_is_user(mm))
> +		return __set_ptes(mm, addr, ptep, pte, nr);
> +
> +	end = addr + (nr << PAGE_SHIFT);
> +	pfn = pte_pfn(pte);
> +	prot = pte_pgprot(pte);
> +
> +	do {
> +		next = pte_cont_addr_end(addr, end);
> +		nr = (next - addr) >> PAGE_SHIFT;
> +		pte = pfn_pte(pfn, prot);
> +
> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> +			pte = pte_mkcont(pte);
> +		else
> +			pte = pte_mknoncont(pte);
> +
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +
> +		addr = next;
> +		ptep += nr;
> +		pfn += nr;
> +
> +	} while (addr != end);
> +}
> +EXPORT_SYMBOL(contpte_set_ptes);
> +
> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	/*
> +	 * ptep_clear_flush_young() technically requires us to clear the access
> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
> +	 * access/dirty per folio, not per page. And since we only create a
> +	 * contig range when the range is covered by a single folio, we can get
> +	 * away with clearing young for the whole contig range here, so we avoid
> +	 * having to unfold.
> +	 */
> +
> +	int young = 0;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
> +
> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	int young;
> +
> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	if (young) {
> +		/*
> +		 * See comment in __ptep_clear_flush_young(); same rationale for
> +		 * eliding the trailing DSB applies here.
> +		 */
> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> +					 PAGE_SIZE, true, 3);
> +	}
> +
> +	return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
> +
> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					pte_t entry, int dirty)
> +{
> +	unsigned long start_addr;
> +	pte_t orig_pte;
> +	int i;
> +
> +	/*
> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
> +	 * changed, its a noop.
> +	 */
> +	orig_pte = pte_mknoncont(ptep_get(ptep));
> +	if (pte_val(orig_pte) == pte_val(entry))
> +		return 0;
> +
> +	/*
> +	 * We can fix up access/dirty bits without having to unfold the contig
> +	 * range. But if the write bit is changing, we must unfold.
> +	 */
> +	if (pte_write(orig_pte) == pte_write(entry)) {
> +		/*
> +		 * For HW access management, we technically only need to update
> +		 * the flag on a single pte in the range. But for SW access
> +		 * management, we need to update all the ptes to prevent extra
> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
> +		 * and instead flush the whole range at the end.
> +		 */
> +		ptep = contpte_align_down(ptep);
> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
> +
> +		if (dirty)
> +			__flush_tlb_range(vma, start_addr, addr,
> +							PAGE_SIZE, true, 3);
> +	} else {
> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +	}
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
> -- 
> 2.25.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 01/25] mm: Clarify the spec for set_ptes()
  2024-02-02  8:07   ` Ryan Roberts
  (?)
@ 2024-02-12 12:03     ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 12:03 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> set_ptes() spec implies that it can only be used to set a present pte
> because it interprets the PFN field to increment it. However,
> set_pte_at() has been implemented on top of set_ptes() since set_ptes()
> was introduced, and set_pte_at() allows setting a pte to a not-present
> state. So clarify the spec to state that when nr==1, new state of pte
> may be present or not present. When nr>1, new state of all ptes must be
> present.
> 
> While we are at it, tighten the spec to set requirements around the
> initial state of ptes; when nr==1 it may be either present or
> not-present. But when nr>1 all ptes must initially be not-present. All
> set_ptes() callsites already conform to this requirement. Stating it
> explicitly is useful because it allows for a simplification to the
> upcoming arm64 contpte implementation.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f0feae7f89fb..5e7eaf8f2b97 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -229,6 +229,10 @@ static inline pte_t pte_next_pfn(pte_t pte)
>    * @pte: Page table entry for the first page.
>    * @nr: Number of pages to map.
>    *
> + * When nr==1, initial state of pte may be present or not present, and new state
> + * may be present or not present. When nr>1, initial state of all ptes must be
> + * not present, and new state must be present.
> + *
>    * May be overridden by the architecture, or the architecture can define
>    * set_pte() and PFN_PTE_SHIFT.
>    *

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 01/25] mm: Clarify the spec for set_ptes()
@ 2024-02-12 12:03     ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 12:03 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> set_ptes() spec implies that it can only be used to set a present pte
> because it interprets the PFN field to increment it. However,
> set_pte_at() has been implemented on top of set_ptes() since set_ptes()
> was introduced, and set_pte_at() allows setting a pte to a not-present
> state. So clarify the spec to state that when nr==1, new state of pte
> may be present or not present. When nr>1, new state of all ptes must be
> present.
> 
> While we are at it, tighten the spec to set requirements around the
> initial state of ptes; when nr==1 it may be either present or
> not-present. But when nr>1 all ptes must initially be not-present. All
> set_ptes() callsites already conform to this requirement. Stating it
> explicitly is useful because it allows for a simplification to the
> upcoming arm64 contpte implementation.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f0feae7f89fb..5e7eaf8f2b97 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -229,6 +229,10 @@ static inline pte_t pte_next_pfn(pte_t pte)
>    * @pte: Page table entry for the first page.
>    * @nr: Number of pages to map.
>    *
> + * When nr==1, initial state of pte may be present or not present, and new state
> + * may be present or not present. When nr>1, initial state of all ptes must be
> + * not present, and new state must be present.
> + *
>    * May be overridden by the architecture, or the architecture can define
>    * set_pte() and PFN_PTE_SHIFT.
>    *

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 01/25] mm: Clarify the spec for set_ptes()
@ 2024-02-12 12:03     ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 12:03 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> set_ptes() spec implies that it can only be used to set a present pte
> because it interprets the PFN field to increment it. However,
> set_pte_at() has been implemented on top of set_ptes() since set_ptes()
> was introduced, and set_pte_at() allows setting a pte to a not-present
> state. So clarify the spec to state that when nr==1, new state of pte
> may be present or not present. When nr>1, new state of all ptes must be
> present.
> 
> While we are at it, tighten the spec to set requirements around the
> initial state of ptes; when nr==1 it may be either present or
> not-present. But when nr>1 all ptes must initially be not-present. All
> set_ptes() callsites already conform to this requirement. Stating it
> explicitly is useful because it allows for a simplification to the
> upcoming arm64 contpte implementation.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f0feae7f89fb..5e7eaf8f2b97 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -229,6 +229,10 @@ static inline pte_t pte_next_pfn(pte_t pte)
>    * @pte: Page table entry for the first page.
>    * @nr: Number of pages to map.
>    *
> + * When nr==1, initial state of pte may be present or not present, and new state
> + * may be present or not present. When nr>1, initial state of all ptes must be
> + * not present, and new state must be present.
> + *
>    * May be overridden by the architecture, or the architecture can define
>    * set_pte() and PFN_PTE_SHIFT.
>    *

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
  2024-02-02  8:07   ` Ryan Roberts
  (?)
@ 2024-02-12 12:14     ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 12:14 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
> So introduce a new API that takes a nr param.
> 
> We are going to remove pte_next_pfn() and replace it with
> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
> wrapper around pte_advance_pfn() so that we can incrementally switch the
> architectures over. Once all arches are moved over, we will change all
> the core-mm callers to call pte_advance_pfn() directly and remove the
> wrapper.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5e7eaf8f2b97..815d92dcb96b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>   
>   
>   #ifndef pte_next_pfn
> +#ifndef pte_advance_pfn
> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
> +{
> +	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
> +}
> +#endif
>   static inline pte_t pte_next_pfn(pte_t pte)
>   {
> -	return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
> +	return pte_advance_pfn(pte, 1);
>   }
>   #endif
>   

I do wonder if we simply want to leave pte_next_pfn() around? Especially 
patch #4, #6 don't really benefit from the change? So are the other 
set_ptes() implementations.

That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
pte_next_pfn() macro in place.

Any downsides to that? This patch here would become:

#ifndef pte_advance_pfn
static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
{
	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
}
#endif

#ifndef pte_next_pfn
#define pte_next_pfn(pte) pte_advance_pfn(pte, 1)
#endif

As you convert the three arches, make them define pte_advance_pfn and 
udnefine pte_next_pfn. in the end, you can drop the #ifdef around 
pte_next_pfn here.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-12 12:14     ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 12:14 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
> So introduce a new API that takes a nr param.
> 
> We are going to remove pte_next_pfn() and replace it with
> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
> wrapper around pte_advance_pfn() so that we can incrementally switch the
> architectures over. Once all arches are moved over, we will change all
> the core-mm callers to call pte_advance_pfn() directly and remove the
> wrapper.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5e7eaf8f2b97..815d92dcb96b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>   
>   
>   #ifndef pte_next_pfn
> +#ifndef pte_advance_pfn
> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
> +{
> +	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
> +}
> +#endif
>   static inline pte_t pte_next_pfn(pte_t pte)
>   {
> -	return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
> +	return pte_advance_pfn(pte, 1);
>   }
>   #endif
>   

I do wonder if we simply want to leave pte_next_pfn() around? Especially 
patch #4, #6 don't really benefit from the change? So are the other 
set_ptes() implementations.

That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
pte_next_pfn() macro in place.

Any downsides to that? This patch here would become:

#ifndef pte_advance_pfn
static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
{
	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
}
#endif

#ifndef pte_next_pfn
#define pte_next_pfn(pte) pte_advance_pfn(pte, 1)
#endif

As you convert the three arches, make them define pte_advance_pfn and 
udnefine pte_next_pfn. in the end, you can drop the #ifdef around 
pte_next_pfn here.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-12 12:14     ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 12:14 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
> So introduce a new API that takes a nr param.
> 
> We are going to remove pte_next_pfn() and replace it with
> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
> wrapper around pte_advance_pfn() so that we can incrementally switch the
> architectures over. Once all arches are moved over, we will change all
> the core-mm callers to call pte_advance_pfn() directly and remove the
> wrapper.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5e7eaf8f2b97..815d92dcb96b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>   
>   
>   #ifndef pte_next_pfn
> +#ifndef pte_advance_pfn
> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
> +{
> +	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
> +}
> +#endif
>   static inline pte_t pte_next_pfn(pte_t pte)
>   {
> -	return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
> +	return pte_advance_pfn(pte, 1);
>   }
>   #endif
>   

I do wonder if we simply want to leave pte_next_pfn() around? Especially 
patch #4, #6 don't really benefit from the change? So are the other 
set_ptes() implementations.

That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
pte_next_pfn() macro in place.

Any downsides to that? This patch here would become:

#ifndef pte_advance_pfn
static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
{
	return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
}
#endif

#ifndef pte_next_pfn
#define pte_next_pfn(pte) pte_advance_pfn(pte, 1)
#endif

As you convert the three arches, make them define pte_advance_pfn and 
udnefine pte_next_pfn. in the end, you can drop the #ifdef around 
pte_next_pfn here.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  2024-02-02  8:07   ` Ryan Roberts
  (?)
@ 2024-02-12 12:44     ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 12:44 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
> trailing DSB. Forthcoming "contpte" code will take advantage of this
> when clearing the young bit from a contiguous range of ptes.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 79e932a1bdf8..50a765917327 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -422,7 +422,7 @@ do {									\
>   #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>   	__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, kvm_lpa2_is_enabled());
>   
> -static inline void __flush_tlb_range(struct vm_area_struct *vma,
> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>   				     unsigned long start, unsigned long end,
>   				     unsigned long stride, bool last_level,
>   				     int tlb_level)
> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>   		__flush_tlb_range_op(vae1is, start, pages, stride, asid,
>   				     tlb_level, true, lpa2_is_enabled());
>   
> -	dsb(ish);
>   	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>   }
>   
> +static inline void __flush_tlb_range(struct vm_area_struct *vma,
> +				     unsigned long start, unsigned long end,
> +				     unsigned long stride, bool last_level,
> +				     int tlb_level)
> +{
> +	__flush_tlb_range_nosync(vma, start, end, stride,
> +				 last_level, tlb_level);
> +	dsb(ish);
> +}
> +
>   static inline void flush_tlb_range(struct vm_area_struct *vma,
>   				   unsigned long start, unsigned long end)
>   {

You're now calling dsb() after 
mmu_notifier_arch_invalidate_secondary_tlbs().


In flush_tlb_mm(), we have the order

	dsb(ish);	
	mmu_notifier_arch_invalidate_secondary_tlbs()

In flush_tlb_page(), we have the effective order:

	mmu_notifier_arch_invalidate_secondary_tlbs()
	dsb(ish);

In flush_tlb_range(), we used to have the order:

	dsb(ish);
	mmu_notifier_arch_invalidate_secondary_tlbs();


So I *suspect* having that DSB before 
mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, 
nothing in there relies on that placement.

Maybe wort spelling out in the patch description

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
@ 2024-02-12 12:44     ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 12:44 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
> trailing DSB. Forthcoming "contpte" code will take advantage of this
> when clearing the young bit from a contiguous range of ptes.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 79e932a1bdf8..50a765917327 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -422,7 +422,7 @@ do {									\
>   #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>   	__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, kvm_lpa2_is_enabled());
>   
> -static inline void __flush_tlb_range(struct vm_area_struct *vma,
> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>   				     unsigned long start, unsigned long end,
>   				     unsigned long stride, bool last_level,
>   				     int tlb_level)
> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>   		__flush_tlb_range_op(vae1is, start, pages, stride, asid,
>   				     tlb_level, true, lpa2_is_enabled());
>   
> -	dsb(ish);
>   	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>   }
>   
> +static inline void __flush_tlb_range(struct vm_area_struct *vma,
> +				     unsigned long start, unsigned long end,
> +				     unsigned long stride, bool last_level,
> +				     int tlb_level)
> +{
> +	__flush_tlb_range_nosync(vma, start, end, stride,
> +				 last_level, tlb_level);
> +	dsb(ish);
> +}
> +
>   static inline void flush_tlb_range(struct vm_area_struct *vma,
>   				   unsigned long start, unsigned long end)
>   {

You're now calling dsb() after 
mmu_notifier_arch_invalidate_secondary_tlbs().


In flush_tlb_mm(), we have the order

	dsb(ish);	
	mmu_notifier_arch_invalidate_secondary_tlbs()

In flush_tlb_page(), we have the effective order:

	mmu_notifier_arch_invalidate_secondary_tlbs()
	dsb(ish);

In flush_tlb_range(), we used to have the order:

	dsb(ish);
	mmu_notifier_arch_invalidate_secondary_tlbs();


So I *suspect* having that DSB before 
mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, 
nothing in there relies on that placement.

Maybe wort spelling out in the patch description

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
@ 2024-02-12 12:44     ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 12:44 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
> trailing DSB. Forthcoming "contpte" code will take advantage of this
> when clearing the young bit from a contiguous range of ptes.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 79e932a1bdf8..50a765917327 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -422,7 +422,7 @@ do {									\
>   #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>   	__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, kvm_lpa2_is_enabled());
>   
> -static inline void __flush_tlb_range(struct vm_area_struct *vma,
> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>   				     unsigned long start, unsigned long end,
>   				     unsigned long stride, bool last_level,
>   				     int tlb_level)
> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>   		__flush_tlb_range_op(vae1is, start, pages, stride, asid,
>   				     tlb_level, true, lpa2_is_enabled());
>   
> -	dsb(ish);
>   	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>   }
>   
> +static inline void __flush_tlb_range(struct vm_area_struct *vma,
> +				     unsigned long start, unsigned long end,
> +				     unsigned long stride, bool last_level,
> +				     int tlb_level)
> +{
> +	__flush_tlb_range_nosync(vma, start, end, stride,
> +				 last_level, tlb_level);
> +	dsb(ish);
> +}
> +
>   static inline void flush_tlb_range(struct vm_area_struct *vma,
>   				   unsigned long start, unsigned long end)
>   {

You're now calling dsb() after 
mmu_notifier_arch_invalidate_secondary_tlbs().


In flush_tlb_mm(), we have the order

	dsb(ish);	
	mmu_notifier_arch_invalidate_secondary_tlbs()

In flush_tlb_page(), we have the effective order:

	mmu_notifier_arch_invalidate_secondary_tlbs()
	dsb(ish);

In flush_tlb_range(), we used to have the order:

	dsb(ish);
	mmu_notifier_arch_invalidate_secondary_tlbs();


So I *suspect* having that DSB before 
mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, 
nothing in there relies on that placement.

Maybe wort spelling out in the patch description

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 12:00     ` Mark Rutland
  (?)
@ 2024-02-12 12:59       ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 12:59 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 12:00, Mark Rutland wrote:
> Hi Ryan,
> 
> Overall this looks pretty good; I have a bunch of minor comments below, and a
> bigger question on the way ptep_get_lockless() works.

OK great - thanks for the review. Let's see if I can answer them all...

> 
> On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings.
>>
>> In this initial implementation, only suitable batches of PTEs, set via
>> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
>> modification of individual PTEs will cause an "unfold" operation to
>> repaint the contpte block as individual PTEs before performing the
>> requested operation. While, a modification of a single PTE could cause
>> the block of PTEs to which it belongs to become eligible for "folding"
>> into a contpte entry, "folding" is not performed in this initial
>> implementation due to the costs of checking the requirements are met.
>> Due to this, contpte mappings will degrade back to normal pte mappings
>> over time if/when protections are changed. This will be solved in a
>> future patch.
>>
>> Since a contpte block only has a single access and dirty bit, the
>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>> that is part of a contpte mapping, the access and dirty information are
>> pulled from the block (so all ptes in the block return the same
>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>> will affect the whole contpte block. This is works fine in practice
>> since we guarantee that only a single folio is mapped by a contpte
>> block, and the core-mm tracks access/dirty information per folio.
>>
>> In order for the public functions, which used to be pure inline, to
>> continue to be callable by modules, export all the contpte_* symbols
>> that are now called by those public inline functions.
>>
>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>> at build time. It defaults to enabled as long as its dependency,
>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>> enabled, then there is no chance of meeting the physical contiguity
>> requirement for contpte mappings.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/Kconfig               |   9 +
>>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>>  arch/arm64/mm/Makefile           |   1 +
>>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>>  4 files changed, 454 insertions(+)
>>  create mode 100644 arch/arm64/mm/contpte.c
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index d86d7f4758b5..1442e8ed95b6 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>  	select UNWIND_TABLES
>>  	select DYNAMIC_SCS
>>  
>> +config ARM64_CONTPTE
>> +	bool "Contiguous PTE mappings for user memory" if EXPERT
>> +	depends on TRANSPARENT_HUGEPAGE
>> +	default y
>> +	help
>> +	  When enabled, user mappings are configured using the PTE contiguous
>> +	  bit, for any mappings that meet the size and alignment requirements.
>> +	  This reduces TLB pressure and improves performance.
>> +
>>  endmenu # "Kernel Features"
>>  
>>  menu "Boot options"
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 7dc6b68ee516..34892a95403d 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>   */
>>  #define pte_valid_not_user(pte) \
>>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
>> +/*
>> + * Returns true if the pte is valid and has the contiguous bit set.
>> + */
>> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>>  /*
>>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>   * so that we don't erroneously return false for pages that have been
>> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>  #define vmemmap_update_pte vmemmap_update_pte
>>  #endif
>>  
>> +#ifdef CONFIG_ARM64_CONTPTE
>> +
>> +/*
>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>> + * a private implementation detail of the public ptep API (see below).
>> + */
>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty);
>> +
>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	if (unlikely(pte_valid_cont(pte)))
>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>> +}
>> +
>> +/*
>> + * The below functions constitute the public API that arm64 presents to the
>> + * core-mm to manipulate PTE entries within their page tables (or at least this
>> + * is the subset of the API that arm64 needs to implement). These public
>> + * versions will automatically and transparently apply the contiguous bit where
>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>> + * private versions, which are prefixed with double underscore. All of these
>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>> + * held.
>> + */
>> +
>> +#define ptep_get ptep_get
>> +static inline pte_t ptep_get(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(pte)))
>> +		return pte;
>> +
>> +	return contpte_ptep_get(ptep, pte);
>> +}
>> +
>> +#define ptep_get_lockless ptep_get_lockless
>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(pte)))
>> +		return pte;
>> +
>> +	return contpte_ptep_get_lockless(ptep);
>> +}
>> +
>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>> +	 * this should never even be called. Regardless do our best to service
>> +	 * any call and emit a warning if there is any attempt to set a pte on
>> +	 * top of an existing contig range.
>> +	 */
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>> +	__set_pte(ptep, pte_mknoncont(pte));
>> +}
>> +
>> +#define set_ptes set_ptes
>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	pte = pte_mknoncont(pte);
> 
> Why do we have to clear the contiguous bit here? Is that for the same reason as
> set_pte(), or do we expect callers to legitimately call this with the
> contiguous bit set in 'pte'?
> 
> I think you explained this to me in-person, and IIRC we don't expect callers to
> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
> have to clear it here to defer the decision of whether to set/clear it when
> modifying entries. It would be nice if we could have a description of why/when
> we need to clear this, e.g. in the 'public API' comment block above.

Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
private to the architecture code and is never set directly by core code. If the
public API ever receives a pte that happens to have the PTE_CONT bit set, it
would be bad news if we then accidentally set that in the pgtable.

Ideally, we would just uncondidtionally clear the bit before a getter returns
the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
and can therefore never accidentally pass such a pte into a setter function.
However, there is existing functionality that relies on being able to get a pte,
then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
to determine how big the leaf is. This is used in perf_get_pgtable_size().

So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
decided to allow PTE_CONT to leak through the getters and instead
unconditionally clear the bit when a pte is passed to any of the setters.

I'll add a (slightly less verbose) comment as you suggest.

> 
>> +
>> +	if (likely(nr == 1)) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		__set_ptes(mm, addr, ptep, pte, 1);
>> +	} else {
>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>> +	}
>> +}
>> +
>> +static inline void pte_clear(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	__pte_clear(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	return __ptep_get_and_clear(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	__ptep_set_wrprotect(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	entry = pte_mknoncont(entry);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +
>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +}
>> +
>> +#else /* CONFIG_ARM64_CONTPTE */
>> +
>>  #define ptep_get				__ptep_get
>>  #define set_pte					__set_pte
>>  #define set_ptes				__set_ptes
>> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>  #define ptep_set_access_flags			__ptep_set_access_flags
>>  
>> +#endif /* CONFIG_ARM64_CONTPTE */
>> +
>>  #endif /* !__ASSEMBLY__ */
>>  
>>  #endif /* __ASM_PGTABLE_H */
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index dbd1bc95967d..60454256945b 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>  				   cache.o copypage.o flush.o \
>>  				   ioremap.o mmap.o pgd.o mmu.o \
>>  				   context.o proc.o pageattr.o fixmap.o
>> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> new file mode 100644
>> index 000000000000..bfb50e6b44c7
>> --- /dev/null
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -0,0 +1,283 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/export.h>
>> +#include <asm/tlbflush.h>
>> +
>> +static inline bool mm_is_user(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>> +	 * dynamically adding/removing the contig bit can cause page faults.
>> +	 * These racing faults are ok for user space, since they get serialized
>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>> +	 */
>> +	return mm != &init_mm;
>> +}
> 
> We also have the efi_mm as a non-user mm, though I don't think we manipulate
> that while it is live, and I'm not sure if that needs any special handling.

Well we never need this function in the hot (order-0 folio) path, so I think I
could add a check for efi_mm here with performance implication. It's probably
safest to explicitly exclude it? What do you think?

> 
>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>> +{
>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> 
> I think this can be:
> 
> static inline pte_t *contpte_align_down(pte_t *ptep)
> {
> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
> }

Yep - that's much less ugly - thanks!

> 
>> +
>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>> +			    pte_t *ptep, pte_t pte)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr;
>> +	pte_t *start_ptep;
>> +	int i;
>> +
>> +	start_ptep = ptep = contpte_align_down(ptep);
>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> +		if (pte_dirty(ptent))
>> +			pte = pte_mkdirty(pte);
>> +
>> +		if (pte_young(ptent))
>> +			pte = pte_mkyoung(pte);
>> +	}
> 
> Not a big deal either way, but I wonder if it makes more sense to accumulate
> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
> 
> 	bool dirty = false, young = false;
> 
> 	for (...) {
> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> 		dirty |= pte_dirty(ptent);
> 		young |= pte_young(ptent);
> 	}
> 
> 	if (dirty)
> 		pte_mkdirty(pte);
> 	if (young)
> 		pte_mkyoung(pte);
> 
> I suspect that might generate slightly better code, but I'm also happy with the
> current form if people thnk that's more legible (I have no strong feelings
> either way).

I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
We also had the same comment against David's batching patches recently, and he
opted to stick with the former version:

https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/

So I'm inclined to leave it as is, since you're not insisting :)

> 
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +
>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>> +}
>> +
>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the ptes are contiguous in
>> +	 * contpte_try_unfold(), so just check that the mm is user space.
>> +	 */
>> +
>> +	if (!mm_is_user(mm))
>> +		return;
> 
> Nit: normally we don't put a line gap between a comment block and the
> associated block of code.

ACK, I'll fix in next version.

> 
>> +
>> +	pte = pte_mknoncont(pte);
>> +	contpte_convert(mm, addr, ptep, pte);
>> +}
>> +EXPORT_SYMBOL(__contpte_try_unfold);
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>> +	 * feet.
>> +	 */
> 
> Nit: s/guarranteed/guaranteed/

ACK, I'll fix in next version.

> 
>> +
>> +	pte_t pte;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +		pte = __ptep_get(ptep);
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get);
>> +
>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>> +	 * If we find a pte that is not consistent, then we must be racing with
>> +	 * an update so start again. If the target pte does not have CONT_PTE
>> +	 * set then that is considered consistent on its own because it is not
>> +	 * part of a contpte range.
>> +	 */
>> +
>> +	pgprot_t orig_prot;
>> +	unsigned long pfn;
>> +	pte_t orig_pte;
>> +	pgprot_t prot;
>> +	pte_t *ptep;
>> +	pte_t pte;
>> +	int i;
>> +
>> +retry:
>> +	orig_pte = __ptep_get(orig_ptep);
>> +
>> +	if (!pte_valid_cont(orig_pte))
>> +		return orig_pte;
>> +
>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>> +	ptep = contpte_align_down(orig_ptep);
>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		pte = __ptep_get(ptep);
>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +
>> +		if (!pte_valid_cont(pte) ||
>> +		   pte_pfn(pte) != pfn ||
>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>> +			goto retry;
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
> 
> I'm struggling to convince myself that this is safe in general, as it really
> depends on how the caller will use this value. Which caller(s) actually care
> about the access/dirty bits, given those could change at any time anyway?

I think your points below are valid, and agree we should try to make this work
without needing access/dirty if possible. But can you elaborate on why you don't
think it's safe?

> 
> I took a quick scan, and AFAICT:

Thanks for enumerating these; Saves me from having to refresh my memory :)

> 
> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>   and has the contig bit set. We could clean that up with a new interface, e.g.
>   something like a new ptep_get_size_lockless().
> 
> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>   we look at the pte to start with, since we only care where we can logically
>   write to the page at that point.
> 
>   I see that we later follow up with:
> 
>     with pte_val(pte) != pte_val(ptep_get(ptep)))
> 
>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>   bits? So that shape of lockless-try...locked-compare sequence works?
> 
> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>   so this doesn' seem to matter.
> 
> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
> 
> * For pte_range_none() the access/dirty bits don't matter.
> 
> * For handle_pte_fault() I think we have the same shape of
>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>   care about the acess/dirty bits before we reach the locked compare step.
> 
> * For ptdump_pte_entry() I think it's arguable that we should continue to
>   report the access/dirty bits separately for each PTE, as we have done until
>   now, to give an accurate representation of the contents of the translation
>   tables.
> 
> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>   swap entry, the access/dirty bits don't matter.
> 
> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
> and IIUC that's only so that the locklessly-loaded pte value can be compared
> with a subsequently locked-loaded entry (for which the access/dirty bits will
> be accumulated). Have I understood that correctly?

Yes, I agree with what you are saying. My approach was to try to implement the
existing APIs accurately though, the argument being that it reduces the chances
of getting it wrong. But if you think the implementation is unsafe, then I guess
it blows that out of the water...

> 
> If so, I wonder if we could instead do that comparison modulo the access/dirty
> bits, 

I think that would work - but will need to think a bit more on it.

> and leave ptep_get_lockless() only reading a single entry?

I think we will need to do something a bit less fragile. ptep_get() does collect
the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
we will likely want to rename the function and make its documentation explicit
that it does not return those bits.

ptep_get_lockless_noyoungdirty()? yuk... Any ideas?

Of course if I could convince you the current implementation is safe, I might be
able to sidestep this optimization until a later date?

Thanks,
Ryan


> 
> Thanks,
> Mark.
> 
>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	unsigned long next;
>> +	unsigned long end;
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +
>> +	/*
>> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
>> +	 * all ptes is not-present. Therefore we never need to unfold or
>> +	 * otherwise invalidate a range before we set the new ptes.
>> +	 * contpte_set_ptes() should never be called for nr < 2.
>> +	 */
>> +	VM_WARN_ON(nr == 1);
>> +
>> +	if (!mm_is_user(mm))
>> +		return __set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +	end = addr + (nr << PAGE_SHIFT);
>> +	pfn = pte_pfn(pte);
>> +	prot = pte_pgprot(pte);
>> +
>> +	do {
>> +		next = pte_cont_addr_end(addr, end);
>> +		nr = (next - addr) >> PAGE_SHIFT;
>> +		pte = pfn_pte(pfn, prot);
>> +
>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> +			pte = pte_mkcont(pte);
>> +		else
>> +			pte = pte_mknoncont(pte);
>> +
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +		addr = next;
>> +		ptep += nr;
>> +		pfn += nr;
>> +
>> +	} while (addr != end);
>> +}
>> +EXPORT_SYMBOL(contpte_set_ptes);
>> +
>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	/*
>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>> +	 * access/dirty per folio, not per page. And since we only create a
>> +	 * contig range when the range is covered by a single folio, we can get
>> +	 * away with clearing young for the whole contig range here, so we avoid
>> +	 * having to unfold.
>> +	 */
>> +
>> +	int young = 0;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return young;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>> +
>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	int young;
>> +
>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	if (young) {
>> +		/*
>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>> +		 * eliding the trailing DSB applies here.
>> +		 */
>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>> +					 PAGE_SIZE, true, 3);
>> +	}
>> +
>> +	return young;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>> +
>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					pte_t entry, int dirty)
>> +{
>> +	unsigned long start_addr;
>> +	pte_t orig_pte;
>> +	int i;
>> +
>> +	/*
>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>> +	 * changed, its a noop.
>> +	 */
>> +	orig_pte = pte_mknoncont(ptep_get(ptep));
>> +	if (pte_val(orig_pte) == pte_val(entry))
>> +		return 0;
>> +
>> +	/*
>> +	 * We can fix up access/dirty bits without having to unfold the contig
>> +	 * range. But if the write bit is changing, we must unfold.
>> +	 */
>> +	if (pte_write(orig_pte) == pte_write(entry)) {
>> +		/*
>> +		 * For HW access management, we technically only need to update
>> +		 * the flag on a single pte in the range. But for SW access
>> +		 * management, we need to update all the ptes to prevent extra
>> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>> +		 * and instead flush the whole range at the end.
>> +		 */
>> +		ptep = contpte_align_down(ptep);
>> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +
>> +		if (dirty)
>> +			__flush_tlb_range(vma, start_addr, addr,
>> +							PAGE_SIZE, true, 3);
>> +	} else {
>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +	}
>> +
>> +	return 1;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 12:59       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 12:59 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 12:00, Mark Rutland wrote:
> Hi Ryan,
> 
> Overall this looks pretty good; I have a bunch of minor comments below, and a
> bigger question on the way ptep_get_lockless() works.

OK great - thanks for the review. Let's see if I can answer them all...

> 
> On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings.
>>
>> In this initial implementation, only suitable batches of PTEs, set via
>> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
>> modification of individual PTEs will cause an "unfold" operation to
>> repaint the contpte block as individual PTEs before performing the
>> requested operation. While, a modification of a single PTE could cause
>> the block of PTEs to which it belongs to become eligible for "folding"
>> into a contpte entry, "folding" is not performed in this initial
>> implementation due to the costs of checking the requirements are met.
>> Due to this, contpte mappings will degrade back to normal pte mappings
>> over time if/when protections are changed. This will be solved in a
>> future patch.
>>
>> Since a contpte block only has a single access and dirty bit, the
>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>> that is part of a contpte mapping, the access and dirty information are
>> pulled from the block (so all ptes in the block return the same
>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>> will affect the whole contpte block. This is works fine in practice
>> since we guarantee that only a single folio is mapped by a contpte
>> block, and the core-mm tracks access/dirty information per folio.
>>
>> In order for the public functions, which used to be pure inline, to
>> continue to be callable by modules, export all the contpte_* symbols
>> that are now called by those public inline functions.
>>
>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>> at build time. It defaults to enabled as long as its dependency,
>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>> enabled, then there is no chance of meeting the physical contiguity
>> requirement for contpte mappings.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/Kconfig               |   9 +
>>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>>  arch/arm64/mm/Makefile           |   1 +
>>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>>  4 files changed, 454 insertions(+)
>>  create mode 100644 arch/arm64/mm/contpte.c
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index d86d7f4758b5..1442e8ed95b6 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>  	select UNWIND_TABLES
>>  	select DYNAMIC_SCS
>>  
>> +config ARM64_CONTPTE
>> +	bool "Contiguous PTE mappings for user memory" if EXPERT
>> +	depends on TRANSPARENT_HUGEPAGE
>> +	default y
>> +	help
>> +	  When enabled, user mappings are configured using the PTE contiguous
>> +	  bit, for any mappings that meet the size and alignment requirements.
>> +	  This reduces TLB pressure and improves performance.
>> +
>>  endmenu # "Kernel Features"
>>  
>>  menu "Boot options"
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 7dc6b68ee516..34892a95403d 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>   */
>>  #define pte_valid_not_user(pte) \
>>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
>> +/*
>> + * Returns true if the pte is valid and has the contiguous bit set.
>> + */
>> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>>  /*
>>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>   * so that we don't erroneously return false for pages that have been
>> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>  #define vmemmap_update_pte vmemmap_update_pte
>>  #endif
>>  
>> +#ifdef CONFIG_ARM64_CONTPTE
>> +
>> +/*
>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>> + * a private implementation detail of the public ptep API (see below).
>> + */
>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty);
>> +
>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	if (unlikely(pte_valid_cont(pte)))
>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>> +}
>> +
>> +/*
>> + * The below functions constitute the public API that arm64 presents to the
>> + * core-mm to manipulate PTE entries within their page tables (or at least this
>> + * is the subset of the API that arm64 needs to implement). These public
>> + * versions will automatically and transparently apply the contiguous bit where
>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>> + * private versions, which are prefixed with double underscore. All of these
>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>> + * held.
>> + */
>> +
>> +#define ptep_get ptep_get
>> +static inline pte_t ptep_get(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(pte)))
>> +		return pte;
>> +
>> +	return contpte_ptep_get(ptep, pte);
>> +}
>> +
>> +#define ptep_get_lockless ptep_get_lockless
>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(pte)))
>> +		return pte;
>> +
>> +	return contpte_ptep_get_lockless(ptep);
>> +}
>> +
>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>> +	 * this should never even be called. Regardless do our best to service
>> +	 * any call and emit a warning if there is any attempt to set a pte on
>> +	 * top of an existing contig range.
>> +	 */
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>> +	__set_pte(ptep, pte_mknoncont(pte));
>> +}
>> +
>> +#define set_ptes set_ptes
>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	pte = pte_mknoncont(pte);
> 
> Why do we have to clear the contiguous bit here? Is that for the same reason as
> set_pte(), or do we expect callers to legitimately call this with the
> contiguous bit set in 'pte'?
> 
> I think you explained this to me in-person, and IIRC we don't expect callers to
> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
> have to clear it here to defer the decision of whether to set/clear it when
> modifying entries. It would be nice if we could have a description of why/when
> we need to clear this, e.g. in the 'public API' comment block above.

Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
private to the architecture code and is never set directly by core code. If the
public API ever receives a pte that happens to have the PTE_CONT bit set, it
would be bad news if we then accidentally set that in the pgtable.

Ideally, we would just uncondidtionally clear the bit before a getter returns
the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
and can therefore never accidentally pass such a pte into a setter function.
However, there is existing functionality that relies on being able to get a pte,
then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
to determine how big the leaf is. This is used in perf_get_pgtable_size().

So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
decided to allow PTE_CONT to leak through the getters and instead
unconditionally clear the bit when a pte is passed to any of the setters.

I'll add a (slightly less verbose) comment as you suggest.

> 
>> +
>> +	if (likely(nr == 1)) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		__set_ptes(mm, addr, ptep, pte, 1);
>> +	} else {
>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>> +	}
>> +}
>> +
>> +static inline void pte_clear(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	__pte_clear(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	return __ptep_get_and_clear(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	__ptep_set_wrprotect(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	entry = pte_mknoncont(entry);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +
>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +}
>> +
>> +#else /* CONFIG_ARM64_CONTPTE */
>> +
>>  #define ptep_get				__ptep_get
>>  #define set_pte					__set_pte
>>  #define set_ptes				__set_ptes
>> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>  #define ptep_set_access_flags			__ptep_set_access_flags
>>  
>> +#endif /* CONFIG_ARM64_CONTPTE */
>> +
>>  #endif /* !__ASSEMBLY__ */
>>  
>>  #endif /* __ASM_PGTABLE_H */
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index dbd1bc95967d..60454256945b 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>  				   cache.o copypage.o flush.o \
>>  				   ioremap.o mmap.o pgd.o mmu.o \
>>  				   context.o proc.o pageattr.o fixmap.o
>> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> new file mode 100644
>> index 000000000000..bfb50e6b44c7
>> --- /dev/null
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -0,0 +1,283 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/export.h>
>> +#include <asm/tlbflush.h>
>> +
>> +static inline bool mm_is_user(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>> +	 * dynamically adding/removing the contig bit can cause page faults.
>> +	 * These racing faults are ok for user space, since they get serialized
>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>> +	 */
>> +	return mm != &init_mm;
>> +}
> 
> We also have the efi_mm as a non-user mm, though I don't think we manipulate
> that while it is live, and I'm not sure if that needs any special handling.

Well we never need this function in the hot (order-0 folio) path, so I think I
could add a check for efi_mm here with performance implication. It's probably
safest to explicitly exclude it? What do you think?

> 
>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>> +{
>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> 
> I think this can be:
> 
> static inline pte_t *contpte_align_down(pte_t *ptep)
> {
> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
> }

Yep - that's much less ugly - thanks!

> 
>> +
>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>> +			    pte_t *ptep, pte_t pte)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr;
>> +	pte_t *start_ptep;
>> +	int i;
>> +
>> +	start_ptep = ptep = contpte_align_down(ptep);
>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> +		if (pte_dirty(ptent))
>> +			pte = pte_mkdirty(pte);
>> +
>> +		if (pte_young(ptent))
>> +			pte = pte_mkyoung(pte);
>> +	}
> 
> Not a big deal either way, but I wonder if it makes more sense to accumulate
> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
> 
> 	bool dirty = false, young = false;
> 
> 	for (...) {
> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> 		dirty |= pte_dirty(ptent);
> 		young |= pte_young(ptent);
> 	}
> 
> 	if (dirty)
> 		pte_mkdirty(pte);
> 	if (young)
> 		pte_mkyoung(pte);
> 
> I suspect that might generate slightly better code, but I'm also happy with the
> current form if people thnk that's more legible (I have no strong feelings
> either way).

I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
We also had the same comment against David's batching patches recently, and he
opted to stick with the former version:

https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/

So I'm inclined to leave it as is, since you're not insisting :)

> 
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +
>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>> +}
>> +
>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the ptes are contiguous in
>> +	 * contpte_try_unfold(), so just check that the mm is user space.
>> +	 */
>> +
>> +	if (!mm_is_user(mm))
>> +		return;
> 
> Nit: normally we don't put a line gap between a comment block and the
> associated block of code.

ACK, I'll fix in next version.

> 
>> +
>> +	pte = pte_mknoncont(pte);
>> +	contpte_convert(mm, addr, ptep, pte);
>> +}
>> +EXPORT_SYMBOL(__contpte_try_unfold);
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>> +	 * feet.
>> +	 */
> 
> Nit: s/guarranteed/guaranteed/

ACK, I'll fix in next version.

> 
>> +
>> +	pte_t pte;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +		pte = __ptep_get(ptep);
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get);
>> +
>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>> +	 * If we find a pte that is not consistent, then we must be racing with
>> +	 * an update so start again. If the target pte does not have CONT_PTE
>> +	 * set then that is considered consistent on its own because it is not
>> +	 * part of a contpte range.
>> +	 */
>> +
>> +	pgprot_t orig_prot;
>> +	unsigned long pfn;
>> +	pte_t orig_pte;
>> +	pgprot_t prot;
>> +	pte_t *ptep;
>> +	pte_t pte;
>> +	int i;
>> +
>> +retry:
>> +	orig_pte = __ptep_get(orig_ptep);
>> +
>> +	if (!pte_valid_cont(orig_pte))
>> +		return orig_pte;
>> +
>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>> +	ptep = contpte_align_down(orig_ptep);
>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		pte = __ptep_get(ptep);
>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +
>> +		if (!pte_valid_cont(pte) ||
>> +		   pte_pfn(pte) != pfn ||
>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>> +			goto retry;
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
> 
> I'm struggling to convince myself that this is safe in general, as it really
> depends on how the caller will use this value. Which caller(s) actually care
> about the access/dirty bits, given those could change at any time anyway?

I think your points below are valid, and agree we should try to make this work
without needing access/dirty if possible. But can you elaborate on why you don't
think it's safe?

> 
> I took a quick scan, and AFAICT:

Thanks for enumerating these; Saves me from having to refresh my memory :)

> 
> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>   and has the contig bit set. We could clean that up with a new interface, e.g.
>   something like a new ptep_get_size_lockless().
> 
> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>   we look at the pte to start with, since we only care where we can logically
>   write to the page at that point.
> 
>   I see that we later follow up with:
> 
>     with pte_val(pte) != pte_val(ptep_get(ptep)))
> 
>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>   bits? So that shape of lockless-try...locked-compare sequence works?
> 
> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>   so this doesn' seem to matter.
> 
> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
> 
> * For pte_range_none() the access/dirty bits don't matter.
> 
> * For handle_pte_fault() I think we have the same shape of
>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>   care about the acess/dirty bits before we reach the locked compare step.
> 
> * For ptdump_pte_entry() I think it's arguable that we should continue to
>   report the access/dirty bits separately for each PTE, as we have done until
>   now, to give an accurate representation of the contents of the translation
>   tables.
> 
> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>   swap entry, the access/dirty bits don't matter.
> 
> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
> and IIUC that's only so that the locklessly-loaded pte value can be compared
> with a subsequently locked-loaded entry (for which the access/dirty bits will
> be accumulated). Have I understood that correctly?

Yes, I agree with what you are saying. My approach was to try to implement the
existing APIs accurately though, the argument being that it reduces the chances
of getting it wrong. But if you think the implementation is unsafe, then I guess
it blows that out of the water...

> 
> If so, I wonder if we could instead do that comparison modulo the access/dirty
> bits, 

I think that would work - but will need to think a bit more on it.

> and leave ptep_get_lockless() only reading a single entry?

I think we will need to do something a bit less fragile. ptep_get() does collect
the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
we will likely want to rename the function and make its documentation explicit
that it does not return those bits.

ptep_get_lockless_noyoungdirty()? yuk... Any ideas?

Of course if I could convince you the current implementation is safe, I might be
able to sidestep this optimization until a later date?

Thanks,
Ryan


> 
> Thanks,
> Mark.
> 
>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	unsigned long next;
>> +	unsigned long end;
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +
>> +	/*
>> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
>> +	 * all ptes is not-present. Therefore we never need to unfold or
>> +	 * otherwise invalidate a range before we set the new ptes.
>> +	 * contpte_set_ptes() should never be called for nr < 2.
>> +	 */
>> +	VM_WARN_ON(nr == 1);
>> +
>> +	if (!mm_is_user(mm))
>> +		return __set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +	end = addr + (nr << PAGE_SHIFT);
>> +	pfn = pte_pfn(pte);
>> +	prot = pte_pgprot(pte);
>> +
>> +	do {
>> +		next = pte_cont_addr_end(addr, end);
>> +		nr = (next - addr) >> PAGE_SHIFT;
>> +		pte = pfn_pte(pfn, prot);
>> +
>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> +			pte = pte_mkcont(pte);
>> +		else
>> +			pte = pte_mknoncont(pte);
>> +
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +		addr = next;
>> +		ptep += nr;
>> +		pfn += nr;
>> +
>> +	} while (addr != end);
>> +}
>> +EXPORT_SYMBOL(contpte_set_ptes);
>> +
>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	/*
>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>> +	 * access/dirty per folio, not per page. And since we only create a
>> +	 * contig range when the range is covered by a single folio, we can get
>> +	 * away with clearing young for the whole contig range here, so we avoid
>> +	 * having to unfold.
>> +	 */
>> +
>> +	int young = 0;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return young;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>> +
>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	int young;
>> +
>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	if (young) {
>> +		/*
>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>> +		 * eliding the trailing DSB applies here.
>> +		 */
>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>> +					 PAGE_SIZE, true, 3);
>> +	}
>> +
>> +	return young;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>> +
>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					pte_t entry, int dirty)
>> +{
>> +	unsigned long start_addr;
>> +	pte_t orig_pte;
>> +	int i;
>> +
>> +	/*
>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>> +	 * changed, its a noop.
>> +	 */
>> +	orig_pte = pte_mknoncont(ptep_get(ptep));
>> +	if (pte_val(orig_pte) == pte_val(entry))
>> +		return 0;
>> +
>> +	/*
>> +	 * We can fix up access/dirty bits without having to unfold the contig
>> +	 * range. But if the write bit is changing, we must unfold.
>> +	 */
>> +	if (pte_write(orig_pte) == pte_write(entry)) {
>> +		/*
>> +		 * For HW access management, we technically only need to update
>> +		 * the flag on a single pte in the range. But for SW access
>> +		 * management, we need to update all the ptes to prevent extra
>> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>> +		 * and instead flush the whole range at the end.
>> +		 */
>> +		ptep = contpte_align_down(ptep);
>> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +
>> +		if (dirty)
>> +			__flush_tlb_range(vma, start_addr, addr,
>> +							PAGE_SIZE, true, 3);
>> +	} else {
>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +	}
>> +
>> +	return 1;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
>> -- 
>> 2.25.1
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 12:59       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 12:59 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On 12/02/2024 12:00, Mark Rutland wrote:
> Hi Ryan,
> 
> Overall this looks pretty good; I have a bunch of minor comments below, and a
> bigger question on the way ptep_get_lockless() works.

OK great - thanks for the review. Let's see if I can answer them all...

> 
> On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings.
>>
>> In this initial implementation, only suitable batches of PTEs, set via
>> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
>> modification of individual PTEs will cause an "unfold" operation to
>> repaint the contpte block as individual PTEs before performing the
>> requested operation. While, a modification of a single PTE could cause
>> the block of PTEs to which it belongs to become eligible for "folding"
>> into a contpte entry, "folding" is not performed in this initial
>> implementation due to the costs of checking the requirements are met.
>> Due to this, contpte mappings will degrade back to normal pte mappings
>> over time if/when protections are changed. This will be solved in a
>> future patch.
>>
>> Since a contpte block only has a single access and dirty bit, the
>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>> that is part of a contpte mapping, the access and dirty information are
>> pulled from the block (so all ptes in the block return the same
>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>> will affect the whole contpte block. This is works fine in practice
>> since we guarantee that only a single folio is mapped by a contpte
>> block, and the core-mm tracks access/dirty information per folio.
>>
>> In order for the public functions, which used to be pure inline, to
>> continue to be callable by modules, export all the contpte_* symbols
>> that are now called by those public inline functions.
>>
>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>> at build time. It defaults to enabled as long as its dependency,
>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>> enabled, then there is no chance of meeting the physical contiguity
>> requirement for contpte mappings.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/Kconfig               |   9 +
>>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>>  arch/arm64/mm/Makefile           |   1 +
>>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>>  4 files changed, 454 insertions(+)
>>  create mode 100644 arch/arm64/mm/contpte.c
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index d86d7f4758b5..1442e8ed95b6 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>  	select UNWIND_TABLES
>>  	select DYNAMIC_SCS
>>  
>> +config ARM64_CONTPTE
>> +	bool "Contiguous PTE mappings for user memory" if EXPERT
>> +	depends on TRANSPARENT_HUGEPAGE
>> +	default y
>> +	help
>> +	  When enabled, user mappings are configured using the PTE contiguous
>> +	  bit, for any mappings that meet the size and alignment requirements.
>> +	  This reduces TLB pressure and improves performance.
>> +
>>  endmenu # "Kernel Features"
>>  
>>  menu "Boot options"
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 7dc6b68ee516..34892a95403d 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>   */
>>  #define pte_valid_not_user(pte) \
>>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
>> +/*
>> + * Returns true if the pte is valid and has the contiguous bit set.
>> + */
>> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>>  /*
>>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>   * so that we don't erroneously return false for pages that have been
>> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>  #define vmemmap_update_pte vmemmap_update_pte
>>  #endif
>>  
>> +#ifdef CONFIG_ARM64_CONTPTE
>> +
>> +/*
>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>> + * a private implementation detail of the public ptep API (see below).
>> + */
>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty);
>> +
>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	if (unlikely(pte_valid_cont(pte)))
>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>> +}
>> +
>> +/*
>> + * The below functions constitute the public API that arm64 presents to the
>> + * core-mm to manipulate PTE entries within their page tables (or at least this
>> + * is the subset of the API that arm64 needs to implement). These public
>> + * versions will automatically and transparently apply the contiguous bit where
>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>> + * private versions, which are prefixed with double underscore. All of these
>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>> + * held.
>> + */
>> +
>> +#define ptep_get ptep_get
>> +static inline pte_t ptep_get(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(pte)))
>> +		return pte;
>> +
>> +	return contpte_ptep_get(ptep, pte);
>> +}
>> +
>> +#define ptep_get_lockless ptep_get_lockless
>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(pte)))
>> +		return pte;
>> +
>> +	return contpte_ptep_get_lockless(ptep);
>> +}
>> +
>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>> +	 * this should never even be called. Regardless do our best to service
>> +	 * any call and emit a warning if there is any attempt to set a pte on
>> +	 * top of an existing contig range.
>> +	 */
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>> +	__set_pte(ptep, pte_mknoncont(pte));
>> +}
>> +
>> +#define set_ptes set_ptes
>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	pte = pte_mknoncont(pte);
> 
> Why do we have to clear the contiguous bit here? Is that for the same reason as
> set_pte(), or do we expect callers to legitimately call this with the
> contiguous bit set in 'pte'?
> 
> I think you explained this to me in-person, and IIRC we don't expect callers to
> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
> have to clear it here to defer the decision of whether to set/clear it when
> modifying entries. It would be nice if we could have a description of why/when
> we need to clear this, e.g. in the 'public API' comment block above.

Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
private to the architecture code and is never set directly by core code. If the
public API ever receives a pte that happens to have the PTE_CONT bit set, it
would be bad news if we then accidentally set that in the pgtable.

Ideally, we would just uncondidtionally clear the bit before a getter returns
the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
and can therefore never accidentally pass such a pte into a setter function.
However, there is existing functionality that relies on being able to get a pte,
then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
to determine how big the leaf is. This is used in perf_get_pgtable_size().

So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
decided to allow PTE_CONT to leak through the getters and instead
unconditionally clear the bit when a pte is passed to any of the setters.

I'll add a (slightly less verbose) comment as you suggest.

> 
>> +
>> +	if (likely(nr == 1)) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		__set_ptes(mm, addr, ptep, pte, 1);
>> +	} else {
>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>> +	}
>> +}
>> +
>> +static inline void pte_clear(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	__pte_clear(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	return __ptep_get_and_clear(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	__ptep_set_wrprotect(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	entry = pte_mknoncont(entry);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +
>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +}
>> +
>> +#else /* CONFIG_ARM64_CONTPTE */
>> +
>>  #define ptep_get				__ptep_get
>>  #define set_pte					__set_pte
>>  #define set_ptes				__set_ptes
>> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>  #define ptep_set_access_flags			__ptep_set_access_flags
>>  
>> +#endif /* CONFIG_ARM64_CONTPTE */
>> +
>>  #endif /* !__ASSEMBLY__ */
>>  
>>  #endif /* __ASM_PGTABLE_H */
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index dbd1bc95967d..60454256945b 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>  				   cache.o copypage.o flush.o \
>>  				   ioremap.o mmap.o pgd.o mmu.o \
>>  				   context.o proc.o pageattr.o fixmap.o
>> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> new file mode 100644
>> index 000000000000..bfb50e6b44c7
>> --- /dev/null
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -0,0 +1,283 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/export.h>
>> +#include <asm/tlbflush.h>
>> +
>> +static inline bool mm_is_user(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>> +	 * dynamically adding/removing the contig bit can cause page faults.
>> +	 * These racing faults are ok for user space, since they get serialized
>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>> +	 */
>> +	return mm != &init_mm;
>> +}
> 
> We also have the efi_mm as a non-user mm, though I don't think we manipulate
> that while it is live, and I'm not sure if that needs any special handling.

Well we never need this function in the hot (order-0 folio) path, so I think I
could add a check for efi_mm here with performance implication. It's probably
safest to explicitly exclude it? What do you think?

> 
>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>> +{
>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> 
> I think this can be:
> 
> static inline pte_t *contpte_align_down(pte_t *ptep)
> {
> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
> }

Yep - that's much less ugly - thanks!

> 
>> +
>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>> +			    pte_t *ptep, pte_t pte)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr;
>> +	pte_t *start_ptep;
>> +	int i;
>> +
>> +	start_ptep = ptep = contpte_align_down(ptep);
>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> +		if (pte_dirty(ptent))
>> +			pte = pte_mkdirty(pte);
>> +
>> +		if (pte_young(ptent))
>> +			pte = pte_mkyoung(pte);
>> +	}
> 
> Not a big deal either way, but I wonder if it makes more sense to accumulate
> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
> 
> 	bool dirty = false, young = false;
> 
> 	for (...) {
> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> 		dirty |= pte_dirty(ptent);
> 		young |= pte_young(ptent);
> 	}
> 
> 	if (dirty)
> 		pte_mkdirty(pte);
> 	if (young)
> 		pte_mkyoung(pte);
> 
> I suspect that might generate slightly better code, but I'm also happy with the
> current form if people thnk that's more legible (I have no strong feelings
> either way).

I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
We also had the same comment against David's batching patches recently, and he
opted to stick with the former version:

https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/

So I'm inclined to leave it as is, since you're not insisting :)

> 
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +
>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>> +}
>> +
>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the ptes are contiguous in
>> +	 * contpte_try_unfold(), so just check that the mm is user space.
>> +	 */
>> +
>> +	if (!mm_is_user(mm))
>> +		return;
> 
> Nit: normally we don't put a line gap between a comment block and the
> associated block of code.

ACK, I'll fix in next version.

> 
>> +
>> +	pte = pte_mknoncont(pte);
>> +	contpte_convert(mm, addr, ptep, pte);
>> +}
>> +EXPORT_SYMBOL(__contpte_try_unfold);
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>> +	 * feet.
>> +	 */
> 
> Nit: s/guarranteed/guaranteed/

ACK, I'll fix in next version.

> 
>> +
>> +	pte_t pte;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +		pte = __ptep_get(ptep);
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get);
>> +
>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>> +	 * If we find a pte that is not consistent, then we must be racing with
>> +	 * an update so start again. If the target pte does not have CONT_PTE
>> +	 * set then that is considered consistent on its own because it is not
>> +	 * part of a contpte range.
>> +	 */
>> +
>> +	pgprot_t orig_prot;
>> +	unsigned long pfn;
>> +	pte_t orig_pte;
>> +	pgprot_t prot;
>> +	pte_t *ptep;
>> +	pte_t pte;
>> +	int i;
>> +
>> +retry:
>> +	orig_pte = __ptep_get(orig_ptep);
>> +
>> +	if (!pte_valid_cont(orig_pte))
>> +		return orig_pte;
>> +
>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>> +	ptep = contpte_align_down(orig_ptep);
>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		pte = __ptep_get(ptep);
>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +
>> +		if (!pte_valid_cont(pte) ||
>> +		   pte_pfn(pte) != pfn ||
>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>> +			goto retry;
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
> 
> I'm struggling to convince myself that this is safe in general, as it really
> depends on how the caller will use this value. Which caller(s) actually care
> about the access/dirty bits, given those could change at any time anyway?

I think your points below are valid, and agree we should try to make this work
without needing access/dirty if possible. But can you elaborate on why you don't
think it's safe?

> 
> I took a quick scan, and AFAICT:

Thanks for enumerating these; Saves me from having to refresh my memory :)

> 
> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>   and has the contig bit set. We could clean that up with a new interface, e.g.
>   something like a new ptep_get_size_lockless().
> 
> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>   we look at the pte to start with, since we only care where we can logically
>   write to the page at that point.
> 
>   I see that we later follow up with:
> 
>     with pte_val(pte) != pte_val(ptep_get(ptep)))
> 
>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>   bits? So that shape of lockless-try...locked-compare sequence works?
> 
> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>   so this doesn' seem to matter.
> 
> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
> 
> * For pte_range_none() the access/dirty bits don't matter.
> 
> * For handle_pte_fault() I think we have the same shape of
>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>   care about the acess/dirty bits before we reach the locked compare step.
> 
> * For ptdump_pte_entry() I think it's arguable that we should continue to
>   report the access/dirty bits separately for each PTE, as we have done until
>   now, to give an accurate representation of the contents of the translation
>   tables.
> 
> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>   swap entry, the access/dirty bits don't matter.
> 
> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
> and IIUC that's only so that the locklessly-loaded pte value can be compared
> with a subsequently locked-loaded entry (for which the access/dirty bits will
> be accumulated). Have I understood that correctly?

Yes, I agree with what you are saying. My approach was to try to implement the
existing APIs accurately though, the argument being that it reduces the chances
of getting it wrong. But if you think the implementation is unsafe, then I guess
it blows that out of the water...

> 
> If so, I wonder if we could instead do that comparison modulo the access/dirty
> bits, 

I think that would work - but will need to think a bit more on it.

> and leave ptep_get_lockless() only reading a single entry?

I think we will need to do something a bit less fragile. ptep_get() does collect
the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
we will likely want to rename the function and make its documentation explicit
that it does not return those bits.

ptep_get_lockless_noyoungdirty()? yuk... Any ideas?

Of course if I could convince you the current implementation is safe, I might be
able to sidestep this optimization until a later date?

Thanks,
Ryan


> 
> Thanks,
> Mark.
> 
>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	unsigned long next;
>> +	unsigned long end;
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +
>> +	/*
>> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
>> +	 * all ptes is not-present. Therefore we never need to unfold or
>> +	 * otherwise invalidate a range before we set the new ptes.
>> +	 * contpte_set_ptes() should never be called for nr < 2.
>> +	 */
>> +	VM_WARN_ON(nr == 1);
>> +
>> +	if (!mm_is_user(mm))
>> +		return __set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +	end = addr + (nr << PAGE_SHIFT);
>> +	pfn = pte_pfn(pte);
>> +	prot = pte_pgprot(pte);
>> +
>> +	do {
>> +		next = pte_cont_addr_end(addr, end);
>> +		nr = (next - addr) >> PAGE_SHIFT;
>> +		pte = pfn_pte(pfn, prot);
>> +
>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> +			pte = pte_mkcont(pte);
>> +		else
>> +			pte = pte_mknoncont(pte);
>> +
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +		addr = next;
>> +		ptep += nr;
>> +		pfn += nr;
>> +
>> +	} while (addr != end);
>> +}
>> +EXPORT_SYMBOL(contpte_set_ptes);
>> +
>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	/*
>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>> +	 * access/dirty per folio, not per page. And since we only create a
>> +	 * contig range when the range is covered by a single folio, we can get
>> +	 * away with clearing young for the whole contig range here, so we avoid
>> +	 * having to unfold.
>> +	 */
>> +
>> +	int young = 0;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return young;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>> +
>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	int young;
>> +
>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	if (young) {
>> +		/*
>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>> +		 * eliding the trailing DSB applies here.
>> +		 */
>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>> +					 PAGE_SIZE, true, 3);
>> +	}
>> +
>> +	return young;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>> +
>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					pte_t entry, int dirty)
>> +{
>> +	unsigned long start_addr;
>> +	pte_t orig_pte;
>> +	int i;
>> +
>> +	/*
>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>> +	 * changed, its a noop.
>> +	 */
>> +	orig_pte = pte_mknoncont(ptep_get(ptep));
>> +	if (pte_val(orig_pte) == pte_val(entry))
>> +		return 0;
>> +
>> +	/*
>> +	 * We can fix up access/dirty bits without having to unfold the contig
>> +	 * range. But if the write bit is changing, we must unfold.
>> +	 */
>> +	if (pte_write(orig_pte) == pte_write(entry)) {
>> +		/*
>> +		 * For HW access management, we technically only need to update
>> +		 * the flag on a single pte in the range. But for SW access
>> +		 * management, we need to update all the ptes to prevent extra
>> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>> +		 * and instead flush the whole range at the end.
>> +		 */
>> +		ptep = contpte_align_down(ptep);
>> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +
>> +		if (dirty)
>> +			__flush_tlb_range(vma, start_addr, addr,
>> +							PAGE_SIZE, true, 3);
>> +	} else {
>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +	}
>> +
>> +	return 1;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  2024-02-12 12:44     ` David Hildenbrand
  (?)
@ 2024-02-12 13:05       ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 13:05 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 12:44, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
>> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
>> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
>> trailing DSB. Forthcoming "contpte" code will take advantage of this
>> when clearing the young bit from a contiguous range of ptes.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>>   1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/tlbflush.h
>> b/arch/arm64/include/asm/tlbflush.h
>> index 79e932a1bdf8..50a765917327 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -422,7 +422,7 @@ do {                                    \
>>   #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>>       __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false,
>> kvm_lpa2_is_enabled());
>>   -static inline void __flush_tlb_range(struct vm_area_struct *vma,
>> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>>                        unsigned long start, unsigned long end,
>>                        unsigned long stride, bool last_level,
>>                        int tlb_level)
>> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct
>> vm_area_struct *vma,
>>           __flush_tlb_range_op(vae1is, start, pages, stride, asid,
>>                        tlb_level, true, lpa2_is_enabled());
>>   -    dsb(ish);
>>       mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>>   }
>>   +static inline void __flush_tlb_range(struct vm_area_struct *vma,
>> +                     unsigned long start, unsigned long end,
>> +                     unsigned long stride, bool last_level,
>> +                     int tlb_level)
>> +{
>> +    __flush_tlb_range_nosync(vma, start, end, stride,
>> +                 last_level, tlb_level);
>> +    dsb(ish);
>> +}
>> +
>>   static inline void flush_tlb_range(struct vm_area_struct *vma,
>>                      unsigned long start, unsigned long end)
>>   {
> 
> You're now calling dsb() after mmu_notifier_arch_invalidate_secondary_tlbs().
> 
> 
> In flush_tlb_mm(), we have the order
> 
>     dsb(ish);   
>     mmu_notifier_arch_invalidate_secondary_tlbs()
> 
> In flush_tlb_page(), we have the effective order:
> 
>     mmu_notifier_arch_invalidate_secondary_tlbs()
>     dsb(ish);
> 
> In flush_tlb_range(), we used to have the order:
> 
>     dsb(ish);
>     mmu_notifier_arch_invalidate_secondary_tlbs();
> 
> 
> So I *suspect* having that DSB before
> mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, nothing in
> there relies on that placement.

Will spotted this against v3. My argument was that I was following the existing
pattern in flush_tlb_page(). Apparently that is not correct and needs changing,
but the conclusion was to leave my change as is for now, since it is consistent
and change them at a later date together.

https://lore.kernel.org/linux-arm-kernel/123a58b0-2ea6-4da3-9719-98ca55c8095e@arm.com/



> 
> Maybe wort spelling out in the patch description
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 

Thanks!



^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
@ 2024-02-12 13:05       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 13:05 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 12:44, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
>> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
>> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
>> trailing DSB. Forthcoming "contpte" code will take advantage of this
>> when clearing the young bit from a contiguous range of ptes.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>>   1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/tlbflush.h
>> b/arch/arm64/include/asm/tlbflush.h
>> index 79e932a1bdf8..50a765917327 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -422,7 +422,7 @@ do {                                    \
>>   #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>>       __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false,
>> kvm_lpa2_is_enabled());
>>   -static inline void __flush_tlb_range(struct vm_area_struct *vma,
>> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>>                        unsigned long start, unsigned long end,
>>                        unsigned long stride, bool last_level,
>>                        int tlb_level)
>> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct
>> vm_area_struct *vma,
>>           __flush_tlb_range_op(vae1is, start, pages, stride, asid,
>>                        tlb_level, true, lpa2_is_enabled());
>>   -    dsb(ish);
>>       mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>>   }
>>   +static inline void __flush_tlb_range(struct vm_area_struct *vma,
>> +                     unsigned long start, unsigned long end,
>> +                     unsigned long stride, bool last_level,
>> +                     int tlb_level)
>> +{
>> +    __flush_tlb_range_nosync(vma, start, end, stride,
>> +                 last_level, tlb_level);
>> +    dsb(ish);
>> +}
>> +
>>   static inline void flush_tlb_range(struct vm_area_struct *vma,
>>                      unsigned long start, unsigned long end)
>>   {
> 
> You're now calling dsb() after mmu_notifier_arch_invalidate_secondary_tlbs().
> 
> 
> In flush_tlb_mm(), we have the order
> 
>     dsb(ish);   
>     mmu_notifier_arch_invalidate_secondary_tlbs()
> 
> In flush_tlb_page(), we have the effective order:
> 
>     mmu_notifier_arch_invalidate_secondary_tlbs()
>     dsb(ish);
> 
> In flush_tlb_range(), we used to have the order:
> 
>     dsb(ish);
>     mmu_notifier_arch_invalidate_secondary_tlbs();
> 
> 
> So I *suspect* having that DSB before
> mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, nothing in
> there relies on that placement.

Will spotted this against v3. My argument was that I was following the existing
pattern in flush_tlb_page(). Apparently that is not correct and needs changing,
but the conclusion was to leave my change as is for now, since it is consistent
and change them at a later date together.

https://lore.kernel.org/linux-arm-kernel/123a58b0-2ea6-4da3-9719-98ca55c8095e@arm.com/



> 
> Maybe wort spelling out in the patch description
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 

Thanks!



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
@ 2024-02-12 13:05       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 13:05 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 12/02/2024 12:44, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
>> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
>> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
>> trailing DSB. Forthcoming "contpte" code will take advantage of this
>> when clearing the young bit from a contiguous range of ptes.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>>   1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/tlbflush.h
>> b/arch/arm64/include/asm/tlbflush.h
>> index 79e932a1bdf8..50a765917327 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -422,7 +422,7 @@ do {                                    \
>>   #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>>       __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false,
>> kvm_lpa2_is_enabled());
>>   -static inline void __flush_tlb_range(struct vm_area_struct *vma,
>> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>>                        unsigned long start, unsigned long end,
>>                        unsigned long stride, bool last_level,
>>                        int tlb_level)
>> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct
>> vm_area_struct *vma,
>>           __flush_tlb_range_op(vae1is, start, pages, stride, asid,
>>                        tlb_level, true, lpa2_is_enabled());
>>   -    dsb(ish);
>>       mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>>   }
>>   +static inline void __flush_tlb_range(struct vm_area_struct *vma,
>> +                     unsigned long start, unsigned long end,
>> +                     unsigned long stride, bool last_level,
>> +                     int tlb_level)
>> +{
>> +    __flush_tlb_range_nosync(vma, start, end, stride,
>> +                 last_level, tlb_level);
>> +    dsb(ish);
>> +}
>> +
>>   static inline void flush_tlb_range(struct vm_area_struct *vma,
>>                      unsigned long start, unsigned long end)
>>   {
> 
> You're now calling dsb() after mmu_notifier_arch_invalidate_secondary_tlbs().
> 
> 
> In flush_tlb_mm(), we have the order
> 
>     dsb(ish);   
>     mmu_notifier_arch_invalidate_secondary_tlbs()
> 
> In flush_tlb_page(), we have the effective order:
> 
>     mmu_notifier_arch_invalidate_secondary_tlbs()
>     dsb(ish);
> 
> In flush_tlb_range(), we used to have the order:
> 
>     dsb(ish);
>     mmu_notifier_arch_invalidate_secondary_tlbs();
> 
> 
> So I *suspect* having that DSB before
> mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, nothing in
> there relies on that placement.

Will spotted this against v3. My argument was that I was following the existing
pattern in flush_tlb_page(). Apparently that is not correct and needs changing,
but the conclusion was to leave my change as is for now, since it is consistent
and change them at a later date together.

https://lore.kernel.org/linux-arm-kernel/123a58b0-2ea6-4da3-9719-98ca55c8095e@arm.com/



> 
> Maybe wort spelling out in the patch description
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 

Thanks!



^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  2024-02-12 13:05       ` Ryan Roberts
  (?)
@ 2024-02-12 13:15         ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:15 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12.02.24 14:05, Ryan Roberts wrote:
> On 12/02/2024 12:44, David Hildenbrand wrote:
>> On 02.02.24 09:07, Ryan Roberts wrote:
>>> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
>>> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
>>> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
>>> trailing DSB. Forthcoming "contpte" code will take advantage of this
>>> when clearing the young bit from a contiguous range of ptes.
>>>
>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>>>    1 file changed, 11 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/tlbflush.h
>>> b/arch/arm64/include/asm/tlbflush.h
>>> index 79e932a1bdf8..50a765917327 100644
>>> --- a/arch/arm64/include/asm/tlbflush.h
>>> +++ b/arch/arm64/include/asm/tlbflush.h
>>> @@ -422,7 +422,7 @@ do {                                    \
>>>    #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>>>        __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false,
>>> kvm_lpa2_is_enabled());
>>>    -static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>>>                         unsigned long start, unsigned long end,
>>>                         unsigned long stride, bool last_level,
>>>                         int tlb_level)
>>> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct
>>> vm_area_struct *vma,
>>>            __flush_tlb_range_op(vae1is, start, pages, stride, asid,
>>>                         tlb_level, true, lpa2_is_enabled());
>>>    -    dsb(ish);
>>>        mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>>>    }
>>>    +static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>> +                     unsigned long start, unsigned long end,
>>> +                     unsigned long stride, bool last_level,
>>> +                     int tlb_level)
>>> +{
>>> +    __flush_tlb_range_nosync(vma, start, end, stride,
>>> +                 last_level, tlb_level);
>>> +    dsb(ish);
>>> +}
>>> +
>>>    static inline void flush_tlb_range(struct vm_area_struct *vma,
>>>                       unsigned long start, unsigned long end)
>>>    {
>>
>> You're now calling dsb() after mmu_notifier_arch_invalidate_secondary_tlbs().
>>
>>
>> In flush_tlb_mm(), we have the order
>>
>>      dsb(ish);
>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>
>> In flush_tlb_page(), we have the effective order:
>>
>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>      dsb(ish);
>>
>> In flush_tlb_range(), we used to have the order:
>>
>>      dsb(ish);
>>      mmu_notifier_arch_invalidate_secondary_tlbs();
>>
>>
>> So I *suspect* having that DSB before
>> mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, nothing in
>> there relies on that placement.
> 
> Will spotted this against v3. My argument was that I was following the existing
> pattern in flush_tlb_page(). Apparently that is not correct and needs changing,
> but the conclusion was to leave my change as is for now, since it is consistent
> and change them at a later date together.

Good, I think you should add a few words to the patch description 
("ordering might be incorrect, but is in-line with __flush_tlb_page()"; 
will be resolved separately).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
@ 2024-02-12 13:15         ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:15 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 12.02.24 14:05, Ryan Roberts wrote:
> On 12/02/2024 12:44, David Hildenbrand wrote:
>> On 02.02.24 09:07, Ryan Roberts wrote:
>>> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
>>> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
>>> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
>>> trailing DSB. Forthcoming "contpte" code will take advantage of this
>>> when clearing the young bit from a contiguous range of ptes.
>>>
>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>>>    1 file changed, 11 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/tlbflush.h
>>> b/arch/arm64/include/asm/tlbflush.h
>>> index 79e932a1bdf8..50a765917327 100644
>>> --- a/arch/arm64/include/asm/tlbflush.h
>>> +++ b/arch/arm64/include/asm/tlbflush.h
>>> @@ -422,7 +422,7 @@ do {                                    \
>>>    #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>>>        __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false,
>>> kvm_lpa2_is_enabled());
>>>    -static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>>>                         unsigned long start, unsigned long end,
>>>                         unsigned long stride, bool last_level,
>>>                         int tlb_level)
>>> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct
>>> vm_area_struct *vma,
>>>            __flush_tlb_range_op(vae1is, start, pages, stride, asid,
>>>                         tlb_level, true, lpa2_is_enabled());
>>>    -    dsb(ish);
>>>        mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>>>    }
>>>    +static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>> +                     unsigned long start, unsigned long end,
>>> +                     unsigned long stride, bool last_level,
>>> +                     int tlb_level)
>>> +{
>>> +    __flush_tlb_range_nosync(vma, start, end, stride,
>>> +                 last_level, tlb_level);
>>> +    dsb(ish);
>>> +}
>>> +
>>>    static inline void flush_tlb_range(struct vm_area_struct *vma,
>>>                       unsigned long start, unsigned long end)
>>>    {
>>
>> You're now calling dsb() after mmu_notifier_arch_invalidate_secondary_tlbs().
>>
>>
>> In flush_tlb_mm(), we have the order
>>
>>      dsb(ish);
>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>
>> In flush_tlb_page(), we have the effective order:
>>
>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>      dsb(ish);
>>
>> In flush_tlb_range(), we used to have the order:
>>
>>      dsb(ish);
>>      mmu_notifier_arch_invalidate_secondary_tlbs();
>>
>>
>> So I *suspect* having that DSB before
>> mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, nothing in
>> there relies on that placement.
> 
> Will spotted this against v3. My argument was that I was following the existing
> pattern in flush_tlb_page(). Apparently that is not correct and needs changing,
> but the conclusion was to leave my change as is for now, since it is consistent
> and change them at a later date together.

Good, I think you should add a few words to the patch description 
("ordering might be incorrect, but is in-line with __flush_tlb_page()"; 
will be resolved separately).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
@ 2024-02-12 13:15         ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:15 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12.02.24 14:05, Ryan Roberts wrote:
> On 12/02/2024 12:44, David Hildenbrand wrote:
>> On 02.02.24 09:07, Ryan Roberts wrote:
>>> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
>>> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
>>> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
>>> trailing DSB. Forthcoming "contpte" code will take advantage of this
>>> when clearing the young bit from a contiguous range of ptes.
>>>
>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>>>    1 file changed, 11 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/tlbflush.h
>>> b/arch/arm64/include/asm/tlbflush.h
>>> index 79e932a1bdf8..50a765917327 100644
>>> --- a/arch/arm64/include/asm/tlbflush.h
>>> +++ b/arch/arm64/include/asm/tlbflush.h
>>> @@ -422,7 +422,7 @@ do {                                    \
>>>    #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>>>        __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false,
>>> kvm_lpa2_is_enabled());
>>>    -static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>>>                         unsigned long start, unsigned long end,
>>>                         unsigned long stride, bool last_level,
>>>                         int tlb_level)
>>> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct
>>> vm_area_struct *vma,
>>>            __flush_tlb_range_op(vae1is, start, pages, stride, asid,
>>>                         tlb_level, true, lpa2_is_enabled());
>>>    -    dsb(ish);
>>>        mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>>>    }
>>>    +static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>> +                     unsigned long start, unsigned long end,
>>> +                     unsigned long stride, bool last_level,
>>> +                     int tlb_level)
>>> +{
>>> +    __flush_tlb_range_nosync(vma, start, end, stride,
>>> +                 last_level, tlb_level);
>>> +    dsb(ish);
>>> +}
>>> +
>>>    static inline void flush_tlb_range(struct vm_area_struct *vma,
>>>                       unsigned long start, unsigned long end)
>>>    {
>>
>> You're now calling dsb() after mmu_notifier_arch_invalidate_secondary_tlbs().
>>
>>
>> In flush_tlb_mm(), we have the order
>>
>>      dsb(ish);
>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>
>> In flush_tlb_page(), we have the effective order:
>>
>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>      dsb(ish);
>>
>> In flush_tlb_range(), we used to have the order:
>>
>>      dsb(ish);
>>      mmu_notifier_arch_invalidate_secondary_tlbs();
>>
>>
>> So I *suspect* having that DSB before
>> mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, nothing in
>> there relies on that placement.
> 
> Will spotted this against v3. My argument was that I was following the existing
> pattern in flush_tlb_page(). Apparently that is not correct and needs changing,
> but the conclusion was to leave my change as is for now, since it is consistent
> and change them at a later date together.

Good, I think you should add a few words to the patch description 
("ordering might be incorrect, but is in-line with __flush_tlb_page()"; 
will be resolved separately).

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  2024-02-12 13:15         ` David Hildenbrand
  (?)
@ 2024-02-12 13:27           ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 13:27 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 13:15, David Hildenbrand wrote:
> On 12.02.24 14:05, Ryan Roberts wrote:
>> On 12/02/2024 12:44, David Hildenbrand wrote:
>>> On 02.02.24 09:07, Ryan Roberts wrote:
>>>> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
>>>> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
>>>> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
>>>> trailing DSB. Forthcoming "contpte" code will take advantage of this
>>>> when clearing the young bit from a contiguous range of ptes.
>>>>
>>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>>>>    1 file changed, 11 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/tlbflush.h
>>>> b/arch/arm64/include/asm/tlbflush.h
>>>> index 79e932a1bdf8..50a765917327 100644
>>>> --- a/arch/arm64/include/asm/tlbflush.h
>>>> +++ b/arch/arm64/include/asm/tlbflush.h
>>>> @@ -422,7 +422,7 @@ do {                                    \
>>>>    #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>>>>        __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false,
>>>> kvm_lpa2_is_enabled());
>>>>    -static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>>> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>>>>                         unsigned long start, unsigned long end,
>>>>                         unsigned long stride, bool last_level,
>>>>                         int tlb_level)
>>>> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct
>>>> vm_area_struct *vma,
>>>>            __flush_tlb_range_op(vae1is, start, pages, stride, asid,
>>>>                         tlb_level, true, lpa2_is_enabled());
>>>>    -    dsb(ish);
>>>>        mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>>>>    }
>>>>    +static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>>> +                     unsigned long start, unsigned long end,
>>>> +                     unsigned long stride, bool last_level,
>>>> +                     int tlb_level)
>>>> +{
>>>> +    __flush_tlb_range_nosync(vma, start, end, stride,
>>>> +                 last_level, tlb_level);
>>>> +    dsb(ish);
>>>> +}
>>>> +
>>>>    static inline void flush_tlb_range(struct vm_area_struct *vma,
>>>>                       unsigned long start, unsigned long end)
>>>>    {
>>>
>>> You're now calling dsb() after mmu_notifier_arch_invalidate_secondary_tlbs().
>>>
>>>
>>> In flush_tlb_mm(), we have the order
>>>
>>>      dsb(ish);
>>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>>
>>> In flush_tlb_page(), we have the effective order:
>>>
>>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>>      dsb(ish);
>>>
>>> In flush_tlb_range(), we used to have the order:
>>>
>>>      dsb(ish);
>>>      mmu_notifier_arch_invalidate_secondary_tlbs();
>>>
>>>
>>> So I *suspect* having that DSB before
>>> mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, nothing in
>>> there relies on that placement.
>>
>> Will spotted this against v3. My argument was that I was following the existing
>> pattern in flush_tlb_page(). Apparently that is not correct and needs changing,
>> but the conclusion was to leave my change as is for now, since it is consistent
>> and change them at a later date together.
> 
> Good, I think you should add a few words to the patch description ("ordering
> might be incorrect, but is in-line with __flush_tlb_page()"; will be resolved
> separately).
> 

ACK, will do. Thanks!


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
@ 2024-02-12 13:27           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 13:27 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 13:15, David Hildenbrand wrote:
> On 12.02.24 14:05, Ryan Roberts wrote:
>> On 12/02/2024 12:44, David Hildenbrand wrote:
>>> On 02.02.24 09:07, Ryan Roberts wrote:
>>>> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
>>>> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
>>>> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
>>>> trailing DSB. Forthcoming "contpte" code will take advantage of this
>>>> when clearing the young bit from a contiguous range of ptes.
>>>>
>>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>>>>    1 file changed, 11 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/tlbflush.h
>>>> b/arch/arm64/include/asm/tlbflush.h
>>>> index 79e932a1bdf8..50a765917327 100644
>>>> --- a/arch/arm64/include/asm/tlbflush.h
>>>> +++ b/arch/arm64/include/asm/tlbflush.h
>>>> @@ -422,7 +422,7 @@ do {                                    \
>>>>    #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>>>>        __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false,
>>>> kvm_lpa2_is_enabled());
>>>>    -static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>>> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>>>>                         unsigned long start, unsigned long end,
>>>>                         unsigned long stride, bool last_level,
>>>>                         int tlb_level)
>>>> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct
>>>> vm_area_struct *vma,
>>>>            __flush_tlb_range_op(vae1is, start, pages, stride, asid,
>>>>                         tlb_level, true, lpa2_is_enabled());
>>>>    -    dsb(ish);
>>>>        mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>>>>    }
>>>>    +static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>>> +                     unsigned long start, unsigned long end,
>>>> +                     unsigned long stride, bool last_level,
>>>> +                     int tlb_level)
>>>> +{
>>>> +    __flush_tlb_range_nosync(vma, start, end, stride,
>>>> +                 last_level, tlb_level);
>>>> +    dsb(ish);
>>>> +}
>>>> +
>>>>    static inline void flush_tlb_range(struct vm_area_struct *vma,
>>>>                       unsigned long start, unsigned long end)
>>>>    {
>>>
>>> You're now calling dsb() after mmu_notifier_arch_invalidate_secondary_tlbs().
>>>
>>>
>>> In flush_tlb_mm(), we have the order
>>>
>>>      dsb(ish);
>>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>>
>>> In flush_tlb_page(), we have the effective order:
>>>
>>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>>      dsb(ish);
>>>
>>> In flush_tlb_range(), we used to have the order:
>>>
>>>      dsb(ish);
>>>      mmu_notifier_arch_invalidate_secondary_tlbs();
>>>
>>>
>>> So I *suspect* having that DSB before
>>> mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, nothing in
>>> there relies on that placement.
>>
>> Will spotted this against v3. My argument was that I was following the existing
>> pattern in flush_tlb_page(). Apparently that is not correct and needs changing,
>> but the conclusion was to leave my change as is for now, since it is consistent
>> and change them at a later date together.
> 
> Good, I think you should add a few words to the patch description ("ordering
> might be incorrect, but is in-line with __flush_tlb_page()"; will be resolved
> separately).
> 

ACK, will do. Thanks!


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
@ 2024-02-12 13:27           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 13:27 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 12/02/2024 13:15, David Hildenbrand wrote:
> On 12.02.24 14:05, Ryan Roberts wrote:
>> On 12/02/2024 12:44, David Hildenbrand wrote:
>>> On 02.02.24 09:07, Ryan Roberts wrote:
>>>> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
>>>> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
>>>> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
>>>> trailing DSB. Forthcoming "contpte" code will take advantage of this
>>>> when clearing the young bit from a contiguous range of ptes.
>>>>
>>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
>>>>    1 file changed, 11 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/tlbflush.h
>>>> b/arch/arm64/include/asm/tlbflush.h
>>>> index 79e932a1bdf8..50a765917327 100644
>>>> --- a/arch/arm64/include/asm/tlbflush.h
>>>> +++ b/arch/arm64/include/asm/tlbflush.h
>>>> @@ -422,7 +422,7 @@ do {                                    \
>>>>    #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>>>>        __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false,
>>>> kvm_lpa2_is_enabled());
>>>>    -static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>>> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>>>>                         unsigned long start, unsigned long end,
>>>>                         unsigned long stride, bool last_level,
>>>>                         int tlb_level)
>>>> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct
>>>> vm_area_struct *vma,
>>>>            __flush_tlb_range_op(vae1is, start, pages, stride, asid,
>>>>                         tlb_level, true, lpa2_is_enabled());
>>>>    -    dsb(ish);
>>>>        mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>>>>    }
>>>>    +static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>>> +                     unsigned long start, unsigned long end,
>>>> +                     unsigned long stride, bool last_level,
>>>> +                     int tlb_level)
>>>> +{
>>>> +    __flush_tlb_range_nosync(vma, start, end, stride,
>>>> +                 last_level, tlb_level);
>>>> +    dsb(ish);
>>>> +}
>>>> +
>>>>    static inline void flush_tlb_range(struct vm_area_struct *vma,
>>>>                       unsigned long start, unsigned long end)
>>>>    {
>>>
>>> You're now calling dsb() after mmu_notifier_arch_invalidate_secondary_tlbs().
>>>
>>>
>>> In flush_tlb_mm(), we have the order
>>>
>>>      dsb(ish);
>>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>>
>>> In flush_tlb_page(), we have the effective order:
>>>
>>>      mmu_notifier_arch_invalidate_secondary_tlbs()
>>>      dsb(ish);
>>>
>>> In flush_tlb_range(), we used to have the order:
>>>
>>>      dsb(ish);
>>>      mmu_notifier_arch_invalidate_secondary_tlbs();
>>>
>>>
>>> So I *suspect* having that DSB before
>>> mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, nothing in
>>> there relies on that placement.
>>
>> Will spotted this against v3. My argument was that I was following the existing
>> pattern in flush_tlb_page(). Apparently that is not correct and needs changing,
>> but the conclusion was to leave my change as is for now, since it is consistent
>> and change them at a later date together.
> 
> Good, I think you should add a few words to the patch description ("ordering
> might be incorrect, but is in-line with __flush_tlb_page()"; will be resolved
> separately).
> 

ACK, will do. Thanks!


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
  2024-02-02  8:07   ` Ryan Roberts
  (?)
@ 2024-02-12 13:43     ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:43 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> Some architectures (e.g. arm64) can tell from looking at a pte, if some
> follow-on ptes also map contiguous physical memory with the same pgprot.
> (for arm64, these are contpte mappings).
> 
> Take advantage of this knowledge to optimize folio_pte_batch() so that
> it can skip these ptes when scanning to create a batch. By default, if
> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
> the changes are optimized out and the behaviour is as before.
> 
> arm64 will opt-in to providing this hint in the next patch, which will
> greatly reduce the cost of ptep_get() when scanning a range of contptes.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h | 18 ++++++++++++++++++
>   mm/memory.c             | 20 +++++++++++++-------
>   2 files changed, 31 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 50f32cccbd92..cba31f177d27 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>   #define arch_flush_lazy_mmu_mode()	do {} while (0)
>   #endif
>   
> +#ifndef pte_batch_hint
> +/**
> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
> + * @ptep: Page table pointer for the entry.
> + * @pte: Page table entry.
> + *
> + * Some architectures know that a set of contiguous ptes all map the same
> + * contiguous memory with the same permissions. In this case, it can provide a
> + * hint to aid pte batching without the core code needing to scan every pte.

I think we might want to document here the expectation regarding
dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
always when batching, because the dirty bit may target any pte part of the
cont-pte group either way.

Maybe something like:

"
An architecture implementation may only ignore the PTE accessed and dirty bits.
Further, it may only ignore the dirty bit if that bit is already not
maintained with precision per PTE inside the hinted batch, and ptep_get()
would already have to collect it from various PTEs.
"

I think there are some more details to it, but I'm hoping something along
the lines above is sufficient.


> +
>   #ifndef pte_advance_pfn
>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>   {
> diff --git a/mm/memory.c b/mm/memory.c
> index 65fbe4f886c1..902665b27702 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>   {
>   	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>   	const pte_t *end_ptep = start_ptep + max_nr;
> -	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
> -	pte_t *ptep = start_ptep + 1;
> +	pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
> +	pte_t *ptep = start_ptep;
>   	bool writable;
> +	int nr;
>   
>   	if (any_writable)
>   		*any_writable = false;
>   
>   	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>   
> -	while (ptep != end_ptep) {
> +	nr = pte_batch_hint(ptep, pte);
> +	expected_pte = pte_advance_pfn(expected_pte, nr);
> +	ptep += nr;
> +

*Maybe* it's easier to get when initializing expected_pte+ptep only once.

Like:

[...]
pte_t expected_pte, *ptep;
[...]

nr = pte_batch_hint(start_ptep, pte);
expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
ptep = start_ptep + nr;

> +	while (ptep < end_ptep) {
>   		pte = ptep_get(ptep);
>   		if (any_writable)
>   			writable = !!pte_write(pte);
> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>   		 * corner cases the next PFN might fall into a different
>   		 * folio.
>   		 */
> -		if (pte_pfn(pte) == folio_end_pfn)
> +		if (pte_pfn(pte) >= folio_end_pfn)
>   			break;
>   
>   		if (any_writable)
>   			*any_writable |= writable;
>   
> -		expected_pte = pte_advance_pfn(expected_pte, 1);
> -		ptep++;
> +		nr = pte_batch_hint(ptep, pte);
> +		expected_pte = pte_advance_pfn(expected_pte, nr);
> +		ptep += nr;
>   	}
>   
> -	return ptep - start_ptep;
> +	return min(ptep - start_ptep, max_nr);
>   }

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
@ 2024-02-12 13:43     ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:43 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> Some architectures (e.g. arm64) can tell from looking at a pte, if some
> follow-on ptes also map contiguous physical memory with the same pgprot.
> (for arm64, these are contpte mappings).
> 
> Take advantage of this knowledge to optimize folio_pte_batch() so that
> it can skip these ptes when scanning to create a batch. By default, if
> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
> the changes are optimized out and the behaviour is as before.
> 
> arm64 will opt-in to providing this hint in the next patch, which will
> greatly reduce the cost of ptep_get() when scanning a range of contptes.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h | 18 ++++++++++++++++++
>   mm/memory.c             | 20 +++++++++++++-------
>   2 files changed, 31 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 50f32cccbd92..cba31f177d27 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>   #define arch_flush_lazy_mmu_mode()	do {} while (0)
>   #endif
>   
> +#ifndef pte_batch_hint
> +/**
> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
> + * @ptep: Page table pointer for the entry.
> + * @pte: Page table entry.
> + *
> + * Some architectures know that a set of contiguous ptes all map the same
> + * contiguous memory with the same permissions. In this case, it can provide a
> + * hint to aid pte batching without the core code needing to scan every pte.

I think we might want to document here the expectation regarding
dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
always when batching, because the dirty bit may target any pte part of the
cont-pte group either way.

Maybe something like:

"
An architecture implementation may only ignore the PTE accessed and dirty bits.
Further, it may only ignore the dirty bit if that bit is already not
maintained with precision per PTE inside the hinted batch, and ptep_get()
would already have to collect it from various PTEs.
"

I think there are some more details to it, but I'm hoping something along
the lines above is sufficient.


> +
>   #ifndef pte_advance_pfn
>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>   {
> diff --git a/mm/memory.c b/mm/memory.c
> index 65fbe4f886c1..902665b27702 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>   {
>   	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>   	const pte_t *end_ptep = start_ptep + max_nr;
> -	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
> -	pte_t *ptep = start_ptep + 1;
> +	pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
> +	pte_t *ptep = start_ptep;
>   	bool writable;
> +	int nr;
>   
>   	if (any_writable)
>   		*any_writable = false;
>   
>   	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>   
> -	while (ptep != end_ptep) {
> +	nr = pte_batch_hint(ptep, pte);
> +	expected_pte = pte_advance_pfn(expected_pte, nr);
> +	ptep += nr;
> +

*Maybe* it's easier to get when initializing expected_pte+ptep only once.

Like:

[...]
pte_t expected_pte, *ptep;
[...]

nr = pte_batch_hint(start_ptep, pte);
expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
ptep = start_ptep + nr;

> +	while (ptep < end_ptep) {
>   		pte = ptep_get(ptep);
>   		if (any_writable)
>   			writable = !!pte_write(pte);
> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>   		 * corner cases the next PFN might fall into a different
>   		 * folio.
>   		 */
> -		if (pte_pfn(pte) == folio_end_pfn)
> +		if (pte_pfn(pte) >= folio_end_pfn)
>   			break;
>   
>   		if (any_writable)
>   			*any_writable |= writable;
>   
> -		expected_pte = pte_advance_pfn(expected_pte, 1);
> -		ptep++;
> +		nr = pte_batch_hint(ptep, pte);
> +		expected_pte = pte_advance_pfn(expected_pte, nr);
> +		ptep += nr;
>   	}
>   
> -	return ptep - start_ptep;
> +	return min(ptep - start_ptep, max_nr);
>   }

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
@ 2024-02-12 13:43     ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:43 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> Some architectures (e.g. arm64) can tell from looking at a pte, if some
> follow-on ptes also map contiguous physical memory with the same pgprot.
> (for arm64, these are contpte mappings).
> 
> Take advantage of this knowledge to optimize folio_pte_batch() so that
> it can skip these ptes when scanning to create a batch. By default, if
> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
> the changes are optimized out and the behaviour is as before.
> 
> arm64 will opt-in to providing this hint in the next patch, which will
> greatly reduce the cost of ptep_get() when scanning a range of contptes.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h | 18 ++++++++++++++++++
>   mm/memory.c             | 20 +++++++++++++-------
>   2 files changed, 31 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 50f32cccbd92..cba31f177d27 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>   #define arch_flush_lazy_mmu_mode()	do {} while (0)
>   #endif
>   
> +#ifndef pte_batch_hint
> +/**
> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
> + * @ptep: Page table pointer for the entry.
> + * @pte: Page table entry.
> + *
> + * Some architectures know that a set of contiguous ptes all map the same
> + * contiguous memory with the same permissions. In this case, it can provide a
> + * hint to aid pte batching without the core code needing to scan every pte.

I think we might want to document here the expectation regarding
dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
always when batching, because the dirty bit may target any pte part of the
cont-pte group either way.

Maybe something like:

"
An architecture implementation may only ignore the PTE accessed and dirty bits.
Further, it may only ignore the dirty bit if that bit is already not
maintained with precision per PTE inside the hinted batch, and ptep_get()
would already have to collect it from various PTEs.
"

I think there are some more details to it, but I'm hoping something along
the lines above is sufficient.


> +
>   #ifndef pte_advance_pfn
>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>   {
> diff --git a/mm/memory.c b/mm/memory.c
> index 65fbe4f886c1..902665b27702 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>   {
>   	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>   	const pte_t *end_ptep = start_ptep + max_nr;
> -	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
> -	pte_t *ptep = start_ptep + 1;
> +	pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
> +	pte_t *ptep = start_ptep;
>   	bool writable;
> +	int nr;
>   
>   	if (any_writable)
>   		*any_writable = false;
>   
>   	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>   
> -	while (ptep != end_ptep) {
> +	nr = pte_batch_hint(ptep, pte);
> +	expected_pte = pte_advance_pfn(expected_pte, nr);
> +	ptep += nr;
> +

*Maybe* it's easier to get when initializing expected_pte+ptep only once.

Like:

[...]
pte_t expected_pte, *ptep;
[...]

nr = pte_batch_hint(start_ptep, pte);
expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
ptep = start_ptep + nr;

> +	while (ptep < end_ptep) {
>   		pte = ptep_get(ptep);
>   		if (any_writable)
>   			writable = !!pte_write(pte);
> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>   		 * corner cases the next PFN might fall into a different
>   		 * folio.
>   		 */
> -		if (pte_pfn(pte) == folio_end_pfn)
> +		if (pte_pfn(pte) >= folio_end_pfn)
>   			break;
>   
>   		if (any_writable)
>   			*any_writable |= writable;
>   
> -		expected_pte = pte_advance_pfn(expected_pte, 1);
> -		ptep++;
> +		nr = pte_batch_hint(ptep, pte);
> +		expected_pte = pte_advance_pfn(expected_pte, nr);
> +		ptep += nr;
>   	}
>   
> -	return ptep - start_ptep;
> +	return min(ptep - start_ptep, max_nr);
>   }

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()
  2024-02-02  8:07   ` Ryan Roberts
  (?)
@ 2024-02-12 13:46     ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:46 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> When core code iterates over a range of ptes and calls ptep_get() for
> each of them, if the range happens to cover contpte mappings, the number
> of pte reads becomes amplified by a factor of the number of PTEs in a
> contpte block. This is because for each call to ptep_get(), the
> implementation must read all of the ptes in the contpte block to which
> it belongs to gather the access and dirty bits.
> 
> This causes a hotspot for fork(), as well as operations that unmap
> memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
> can fix this by implementing pte_batch_hint() which allows their
> iterators to skip getting the contpte tail ptes when gathering the batch
> of ptes to operate on. This results in the number of PTE reads returning
> to 1 per pte.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/include/asm/pgtable.h | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index ad04adb7b87f..353ea67b5d75 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>   		__contpte_try_unfold(mm, addr, ptep, pte);
>   }
>   
> +#define pte_batch_hint pte_batch_hint
> +static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> +{
> +	if (!pte_valid_cont(pte))
> +		return 1;
> +
> +	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> +}
> +
>   /*
>    * The below functions constitute the public API that arm64 presents to the
>    * core-mm to manipulate PTE entries within their page tables (or at least this


Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()
@ 2024-02-12 13:46     ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:46 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> When core code iterates over a range of ptes and calls ptep_get() for
> each of them, if the range happens to cover contpte mappings, the number
> of pte reads becomes amplified by a factor of the number of PTEs in a
> contpte block. This is because for each call to ptep_get(), the
> implementation must read all of the ptes in the contpte block to which
> it belongs to gather the access and dirty bits.
> 
> This causes a hotspot for fork(), as well as operations that unmap
> memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
> can fix this by implementing pte_batch_hint() which allows their
> iterators to skip getting the contpte tail ptes when gathering the batch
> of ptes to operate on. This results in the number of PTE reads returning
> to 1 per pte.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/include/asm/pgtable.h | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index ad04adb7b87f..353ea67b5d75 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>   		__contpte_try_unfold(mm, addr, ptep, pte);
>   }
>   
> +#define pte_batch_hint pte_batch_hint
> +static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> +{
> +	if (!pte_valid_cont(pte))
> +		return 1;
> +
> +	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> +}
> +
>   /*
>    * The below functions constitute the public API that arm64 presents to the
>    * core-mm to manipulate PTE entries within their page tables (or at least this


Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()
@ 2024-02-12 13:46     ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:46 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 02.02.24 09:07, Ryan Roberts wrote:
> When core code iterates over a range of ptes and calls ptep_get() for
> each of them, if the range happens to cover contpte mappings, the number
> of pte reads becomes amplified by a factor of the number of PTEs in a
> contpte block. This is because for each call to ptep_get(), the
> implementation must read all of the ptes in the contpte block to which
> it belongs to gather the access and dirty bits.
> 
> This causes a hotspot for fork(), as well as operations that unmap
> memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
> can fix this by implementing pte_batch_hint() which allows their
> iterators to skip getting the contpte tail ptes when gathering the batch
> of ptes to operate on. This results in the number of PTE reads returning
> to 1 per pte.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/include/asm/pgtable.h | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index ad04adb7b87f..353ea67b5d75 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>   		__contpte_try_unfold(mm, addr, ptep, pte);
>   }
>   
> +#define pte_batch_hint pte_batch_hint
> +static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> +{
> +	if (!pte_valid_cont(pte))
> +		return 1;
> +
> +	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> +}
> +
>   /*
>    * The below functions constitute the public API that arm64 presents to the
>    * core-mm to manipulate PTE entries within their page tables (or at least this


Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 12:59       ` Ryan Roberts
  (?)
@ 2024-02-12 13:54         ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:54 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>> bits,
> 
> I think that would work - but will need to think a bit more on it.
> 
>> and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?

As discussed (and pointed out abive), there might be quite some 
callsites where we don't really care about uptodate accessed/dirty bits 
-- where ptep_get() is used nowadays.

One way to approach that I had in mind was having an explicit interface:

ptep_get()
ptep_get_uptodate()
ptep_get_lockless()
ptep_get_lockless_uptodate()

Especially the last one might not be needed.

Futher, "uptodate" might not be the best choice because of 
PageUptodate() and friends. But it's better than 
"youngdirty"/"noyoungdirty" IMHO.

Of course, any such changes require care and are better done one step at 
at time separately.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 13:54         ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:54 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>> bits,
> 
> I think that would work - but will need to think a bit more on it.
> 
>> and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?

As discussed (and pointed out abive), there might be quite some 
callsites where we don't really care about uptodate accessed/dirty bits 
-- where ptep_get() is used nowadays.

One way to approach that I had in mind was having an explicit interface:

ptep_get()
ptep_get_uptodate()
ptep_get_lockless()
ptep_get_lockless_uptodate()

Especially the last one might not be needed.

Futher, "uptodate" might not be the best choice because of 
PageUptodate() and friends. But it's better than 
"youngdirty"/"noyoungdirty" IMHO.

Of course, any such changes require care and are better done one step at 
at time separately.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 13:54         ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 13:54 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>> bits,
> 
> I think that would work - but will need to think a bit more on it.
> 
>> and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?

As discussed (and pointed out abive), there might be quite some 
callsites where we don't really care about uptodate accessed/dirty bits 
-- where ptep_get() is used nowadays.

One way to approach that I had in mind was having an explicit interface:

ptep_get()
ptep_get_uptodate()
ptep_get_lockless()
ptep_get_lockless_uptodate()

Especially the last one might not be needed.

Futher, "uptodate" might not be the best choice because of 
PageUptodate() and friends. But it's better than 
"youngdirty"/"noyoungdirty" IMHO.

Of course, any such changes require care and are better done one step at 
at time separately.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
  2024-02-12 12:14     ` David Hildenbrand
  (?)
@ 2024-02-12 14:10       ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 14:10 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 12:14, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>> So introduce a new API that takes a nr param.
>>
>> We are going to remove pte_next_pfn() and replace it with
>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>> architectures over. Once all arches are moved over, we will change all
>> the core-mm callers to call pte_advance_pfn() directly and remove the
>> wrapper.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 5e7eaf8f2b97..815d92dcb96b 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>       #ifndef pte_next_pfn
>> +#ifndef pte_advance_pfn
>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>> +{
>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>> +}
>> +#endif
>>   static inline pte_t pte_next_pfn(pte_t pte)
>>   {
>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>> +    return pte_advance_pfn(pte, 1);
>>   }
>>   #endif
>>   
> 
> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
> #4, #6 don't really benefit from the change? So are the other set_ptes()
> implementations.
> 
> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
> pte_next_pfn() macro in place.
> 
> Any downsides to that? 

The downside is just having multiple functions that effectively do the same
thing. Personally I think its cleaner and easier to understand the code with
just one generic function which we pass 1 to it where we only want to advance by
1. In the end, there are only a couple of places where pte_advance_pfn(1) is
used, so doesn't really seem valuable to me to maintain a specialization.

Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
leave it as I've done in this series.

> This patch here would become:
> 
> #ifndef pte_advance_pfn
> static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
> {
>     return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
> }
> #endif
> 
> #ifndef pte_next_pfn
> #define pte_next_pfn(pte) pte_advance_pfn(pte, 1)
> #endif
> 
> As you convert the three arches, make them define pte_advance_pfn and udnefine
> pte_next_pfn. in the end, you can drop the #ifdef around pte_next_pfn here.
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-12 14:10       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 14:10 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 12:14, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>> So introduce a new API that takes a nr param.
>>
>> We are going to remove pte_next_pfn() and replace it with
>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>> architectures over. Once all arches are moved over, we will change all
>> the core-mm callers to call pte_advance_pfn() directly and remove the
>> wrapper.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 5e7eaf8f2b97..815d92dcb96b 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>       #ifndef pte_next_pfn
>> +#ifndef pte_advance_pfn
>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>> +{
>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>> +}
>> +#endif
>>   static inline pte_t pte_next_pfn(pte_t pte)
>>   {
>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>> +    return pte_advance_pfn(pte, 1);
>>   }
>>   #endif
>>   
> 
> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
> #4, #6 don't really benefit from the change? So are the other set_ptes()
> implementations.
> 
> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
> pte_next_pfn() macro in place.
> 
> Any downsides to that? 

The downside is just having multiple functions that effectively do the same
thing. Personally I think its cleaner and easier to understand the code with
just one generic function which we pass 1 to it where we only want to advance by
1. In the end, there are only a couple of places where pte_advance_pfn(1) is
used, so doesn't really seem valuable to me to maintain a specialization.

Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
leave it as I've done in this series.

> This patch here would become:
> 
> #ifndef pte_advance_pfn
> static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
> {
>     return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
> }
> #endif
> 
> #ifndef pte_next_pfn
> #define pte_next_pfn(pte) pte_advance_pfn(pte, 1)
> #endif
> 
> As you convert the three arches, make them define pte_advance_pfn and udnefine
> pte_next_pfn. in the end, you can drop the #ifdef around pte_next_pfn here.
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-12 14:10       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 14:10 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 12/02/2024 12:14, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>> So introduce a new API that takes a nr param.
>>
>> We are going to remove pte_next_pfn() and replace it with
>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>> architectures over. Once all arches are moved over, we will change all
>> the core-mm callers to call pte_advance_pfn() directly and remove the
>> wrapper.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 5e7eaf8f2b97..815d92dcb96b 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>       #ifndef pte_next_pfn
>> +#ifndef pte_advance_pfn
>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>> +{
>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>> +}
>> +#endif
>>   static inline pte_t pte_next_pfn(pte_t pte)
>>   {
>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>> +    return pte_advance_pfn(pte, 1);
>>   }
>>   #endif
>>   
> 
> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
> #4, #6 don't really benefit from the change? So are the other set_ptes()
> implementations.
> 
> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
> pte_next_pfn() macro in place.
> 
> Any downsides to that? 

The downside is just having multiple functions that effectively do the same
thing. Personally I think its cleaner and easier to understand the code with
just one generic function which we pass 1 to it where we only want to advance by
1. In the end, there are only a couple of places where pte_advance_pfn(1) is
used, so doesn't really seem valuable to me to maintain a specialization.

Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
leave it as I've done in this series.

> This patch here would become:
> 
> #ifndef pte_advance_pfn
> static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
> {
>     return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
> }
> #endif
> 
> #ifndef pte_next_pfn
> #define pte_next_pfn(pte) pte_advance_pfn(pte, 1)
> #endif
> 
> As you convert the three arches, make them define pte_advance_pfn and udnefine
> pte_next_pfn. in the end, you can drop the #ifdef around pte_next_pfn here.
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
  2024-02-12 14:10       ` Ryan Roberts
  (?)
@ 2024-02-12 14:29         ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 14:29 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12.02.24 15:10, Ryan Roberts wrote:
> On 12/02/2024 12:14, David Hildenbrand wrote:
>> On 02.02.24 09:07, Ryan Roberts wrote:
>>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>>> So introduce a new API that takes a nr param.
>>>
>>> We are going to remove pte_next_pfn() and replace it with
>>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>>> architectures over. Once all arches are moved over, we will change all
>>> the core-mm callers to call pte_advance_pfn() directly and remove the
>>> wrapper.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    include/linux/pgtable.h | 8 +++++++-
>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 5e7eaf8f2b97..815d92dcb96b 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>>        #ifndef pte_next_pfn
>>> +#ifndef pte_advance_pfn
>>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>> +{
>>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>>> +}
>>> +#endif
>>>    static inline pte_t pte_next_pfn(pte_t pte)
>>>    {
>>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>>> +    return pte_advance_pfn(pte, 1);
>>>    }
>>>    #endif
>>>    
>>
>> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
>> #4, #6 don't really benefit from the change? So are the other set_ptes()
>> implementations.
>>
>> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
>> pte_next_pfn() macro in place.
>>
>> Any downsides to that?
> 
> The downside is just having multiple functions that effectively do the same
> thing. Personally I think its cleaner and easier to understand the code with
> just one generic function which we pass 1 to it where we only want to advance by
> 1. In the end, there are only a couple of places where pte_advance_pfn(1) is
> used, so doesn't really seem valuable to me to maintain a specialization.

Well, not really functions, just a macro. Like we have set_pte_at() 
translating to set_ptes().

Arguably, we have more callers of set_pte_at().

"Easier to understand", I don't know. :)

> 
> Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
> leave it as I've done in this series.

Well, it makes you patch set shorter and there is less code churn.

So personally, I'd just leave pte_next_pfn() in there. But whatever you 
prefer, not the end of the world.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-12 14:29         ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 14:29 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12.02.24 15:10, Ryan Roberts wrote:
> On 12/02/2024 12:14, David Hildenbrand wrote:
>> On 02.02.24 09:07, Ryan Roberts wrote:
>>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>>> So introduce a new API that takes a nr param.
>>>
>>> We are going to remove pte_next_pfn() and replace it with
>>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>>> architectures over. Once all arches are moved over, we will change all
>>> the core-mm callers to call pte_advance_pfn() directly and remove the
>>> wrapper.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    include/linux/pgtable.h | 8 +++++++-
>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 5e7eaf8f2b97..815d92dcb96b 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>>        #ifndef pte_next_pfn
>>> +#ifndef pte_advance_pfn
>>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>> +{
>>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>>> +}
>>> +#endif
>>>    static inline pte_t pte_next_pfn(pte_t pte)
>>>    {
>>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>>> +    return pte_advance_pfn(pte, 1);
>>>    }
>>>    #endif
>>>    
>>
>> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
>> #4, #6 don't really benefit from the change? So are the other set_ptes()
>> implementations.
>>
>> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
>> pte_next_pfn() macro in place.
>>
>> Any downsides to that?
> 
> The downside is just having multiple functions that effectively do the same
> thing. Personally I think its cleaner and easier to understand the code with
> just one generic function which we pass 1 to it where we only want to advance by
> 1. In the end, there are only a couple of places where pte_advance_pfn(1) is
> used, so doesn't really seem valuable to me to maintain a specialization.

Well, not really functions, just a macro. Like we have set_pte_at() 
translating to set_ptes().

Arguably, we have more callers of set_pte_at().

"Easier to understand", I don't know. :)

> 
> Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
> leave it as I've done in this series.

Well, it makes you patch set shorter and there is less code churn.

So personally, I'd just leave pte_next_pfn() in there. But whatever you 
prefer, not the end of the world.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-12 14:29         ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 14:29 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 12.02.24 15:10, Ryan Roberts wrote:
> On 12/02/2024 12:14, David Hildenbrand wrote:
>> On 02.02.24 09:07, Ryan Roberts wrote:
>>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>>> So introduce a new API that takes a nr param.
>>>
>>> We are going to remove pte_next_pfn() and replace it with
>>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>>> architectures over. Once all arches are moved over, we will change all
>>> the core-mm callers to call pte_advance_pfn() directly and remove the
>>> wrapper.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    include/linux/pgtable.h | 8 +++++++-
>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 5e7eaf8f2b97..815d92dcb96b 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>>        #ifndef pte_next_pfn
>>> +#ifndef pte_advance_pfn
>>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>> +{
>>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>>> +}
>>> +#endif
>>>    static inline pte_t pte_next_pfn(pte_t pte)
>>>    {
>>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>>> +    return pte_advance_pfn(pte, 1);
>>>    }
>>>    #endif
>>>    
>>
>> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
>> #4, #6 don't really benefit from the change? So are the other set_ptes()
>> implementations.
>>
>> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
>> pte_next_pfn() macro in place.
>>
>> Any downsides to that?
> 
> The downside is just having multiple functions that effectively do the same
> thing. Personally I think its cleaner and easier to understand the code with
> just one generic function which we pass 1 to it where we only want to advance by
> 1. In the end, there are only a couple of places where pte_advance_pfn(1) is
> used, so doesn't really seem valuable to me to maintain a specialization.

Well, not really functions, just a macro. Like we have set_pte_at() 
translating to set_ptes().

Arguably, we have more callers of set_pte_at().

"Easier to understand", I don't know. :)

> 
> Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
> leave it as I've done in this series.

Well, it makes you patch set shorter and there is less code churn.

So personally, I'd just leave pte_next_pfn() in there. But whatever you 
prefer, not the end of the world.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 13:54         ` David Hildenbrand
  (?)
@ 2024-02-12 14:45           ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 14:45 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12/02/2024 13:54, David Hildenbrand wrote:
>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>> bits,
>>
>> I think that would work - but will need to think a bit more on it.
>>
>>> and leave ptep_get_lockless() only reading a single entry?
>>
>> I think we will need to do something a bit less fragile. ptep_get() does collect
>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>> we will likely want to rename the function and make its documentation explicit
>> that it does not return those bits.
>>
>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>
>> Of course if I could convince you the current implementation is safe, I might be
>> able to sidestep this optimization until a later date?
> 
> As discussed (and pointed out abive), there might be quite some callsites where
> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
> used nowadays.
> 
> One way to approach that I had in mind was having an explicit interface:
> 
> ptep_get()
> ptep_get_uptodate()
> ptep_get_lockless()
> ptep_get_lockless_uptodate()

Yes, I like the direction of this. I guess we anticipate that call sites
requiring the "_uptodate" variant will be the minority so it makes sense to use
the current names for the "_not_uptodate" variants? But to do a slow migration,
it might be better/safer to have the weaker variant use the new name - that
would allow us to downgrade one at a time?

> 
> Especially the last one might not be needed.
I've done a scan through the code and agree with Mark's original conclusions.
Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
access/dirty info. So I think I could migrate everything to the weaker variant
fairly easily.

> 
> Futher, "uptodate" might not be the best choice because of PageUptodate() and
> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.

Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
"_nosync"?

> 
> Of course, any such changes require care and are better done one step at at time
> separately.
> 

So I propose to introduce ptep_get_lockless_nosync() (name up for discussion)
and migrate all users to it, as part of this series. This will side-step Mark's
correctness concerns. We can add ptep_get_nosync() later and migrate slowly.

Shout if you think this is a bad plan.

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 14:45           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 14:45 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12/02/2024 13:54, David Hildenbrand wrote:
>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>> bits,
>>
>> I think that would work - but will need to think a bit more on it.
>>
>>> and leave ptep_get_lockless() only reading a single entry?
>>
>> I think we will need to do something a bit less fragile. ptep_get() does collect
>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>> we will likely want to rename the function and make its documentation explicit
>> that it does not return those bits.
>>
>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>
>> Of course if I could convince you the current implementation is safe, I might be
>> able to sidestep this optimization until a later date?
> 
> As discussed (and pointed out abive), there might be quite some callsites where
> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
> used nowadays.
> 
> One way to approach that I had in mind was having an explicit interface:
> 
> ptep_get()
> ptep_get_uptodate()
> ptep_get_lockless()
> ptep_get_lockless_uptodate()

Yes, I like the direction of this. I guess we anticipate that call sites
requiring the "_uptodate" variant will be the minority so it makes sense to use
the current names for the "_not_uptodate" variants? But to do a slow migration,
it might be better/safer to have the weaker variant use the new name - that
would allow us to downgrade one at a time?

> 
> Especially the last one might not be needed.
I've done a scan through the code and agree with Mark's original conclusions.
Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
access/dirty info. So I think I could migrate everything to the weaker variant
fairly easily.

> 
> Futher, "uptodate" might not be the best choice because of PageUptodate() and
> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.

Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
"_nosync"?

> 
> Of course, any such changes require care and are better done one step at at time
> separately.
> 

So I propose to introduce ptep_get_lockless_nosync() (name up for discussion)
and migrate all users to it, as part of this series. This will side-step Mark's
correctness concerns. We can add ptep_get_nosync() later and migrate slowly.

Shout if you think this is a bad plan.

Thanks,
Ryan



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 14:45           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 14:45 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 12/02/2024 13:54, David Hildenbrand wrote:
>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>> bits,
>>
>> I think that would work - but will need to think a bit more on it.
>>
>>> and leave ptep_get_lockless() only reading a single entry?
>>
>> I think we will need to do something a bit less fragile. ptep_get() does collect
>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>> we will likely want to rename the function and make its documentation explicit
>> that it does not return those bits.
>>
>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>
>> Of course if I could convince you the current implementation is safe, I might be
>> able to sidestep this optimization until a later date?
> 
> As discussed (and pointed out abive), there might be quite some callsites where
> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
> used nowadays.
> 
> One way to approach that I had in mind was having an explicit interface:
> 
> ptep_get()
> ptep_get_uptodate()
> ptep_get_lockless()
> ptep_get_lockless_uptodate()

Yes, I like the direction of this. I guess we anticipate that call sites
requiring the "_uptodate" variant will be the minority so it makes sense to use
the current names for the "_not_uptodate" variants? But to do a slow migration,
it might be better/safer to have the weaker variant use the new name - that
would allow us to downgrade one at a time?

> 
> Especially the last one might not be needed.
I've done a scan through the code and agree with Mark's original conclusions.
Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
access/dirty info. So I think I could migrate everything to the weaker variant
fairly easily.

> 
> Futher, "uptodate" might not be the best choice because of PageUptodate() and
> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.

Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
"_nosync"?

> 
> Of course, any such changes require care and are better done one step at at time
> separately.
> 

So I propose to introduce ptep_get_lockless_nosync() (name up for discussion)
and migrate all users to it, as part of this series. This will side-step Mark's
correctness concerns. We can add ptep_get_nosync() later and migrate slowly.

Shout if you think this is a bad plan.

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
  2024-02-12 13:43     ` David Hildenbrand
  (?)
@ 2024-02-12 15:00       ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:00 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 13:43, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>> follow-on ptes also map contiguous physical memory with the same pgprot.
>> (for arm64, these are contpte mappings).
>>
>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>> it can skip these ptes when scanning to create a batch. By default, if
>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>> the changes are optimized out and the behaviour is as before.
>>
>> arm64 will opt-in to providing this hint in the next patch, which will
>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 18 ++++++++++++++++++
>>   mm/memory.c             | 20 +++++++++++++-------
>>   2 files changed, 31 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 50f32cccbd92..cba31f177d27 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>   #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>   #endif
>>   +#ifndef pte_batch_hint
>> +/**
>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>> + * @ptep: Page table pointer for the entry.
>> + * @pte: Page table entry.
>> + *
>> + * Some architectures know that a set of contiguous ptes all map the same
>> + * contiguous memory with the same permissions. In this case, it can provide a
>> + * hint to aid pte batching without the core code needing to scan every pte.
> 
> I think we might want to document here the expectation regarding
> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
> always when batching, because the dirty bit may target any pte part of the
> cont-pte group either way.
> 
> Maybe something like:
> 
> "
> An architecture implementation may only ignore the PTE accessed and dirty bits.
> Further, it may only ignore the dirty bit if that bit is already not
> maintained with precision per PTE inside the hinted batch, and ptep_get()
> would already have to collect it from various PTEs.
> "

Yep, sounds good. I'll add it in next version.

> 
> I think there are some more details to it, but I'm hoping something along
> the lines above is sufficient.
> 
> 
>> +
>>   #ifndef pte_advance_pfn
>>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>   {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 65fbe4f886c1..902665b27702 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>   {
>>       unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>       const pte_t *end_ptep = start_ptep + max_nr;
>> -    pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1),
>> flags);
>> -    pte_t *ptep = start_ptep + 1;
>> +    pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
>> +    pte_t *ptep = start_ptep;
>>       bool writable;
>> +    int nr;
>>         if (any_writable)
>>           *any_writable = false;
>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>   -    while (ptep != end_ptep) {
>> +    nr = pte_batch_hint(ptep, pte);
>> +    expected_pte = pte_advance_pfn(expected_pte, nr);
>> +    ptep += nr;
>> +
> 
> *Maybe* it's easier to get when initializing expected_pte+ptep only once.
> 
> Like:
> 
> [...]
> pte_t expected_pte, *ptep;
> [...]
> 
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> ptep = start_ptep + nr;

Yeah that works for me. Will change for next version.

> 
>> +    while (ptep < end_ptep) {
>>           pte = ptep_get(ptep);
>>           if (any_writable)
>>               writable = !!pte_write(pte);
>> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>            * corner cases the next PFN might fall into a different
>>            * folio.
>>            */
>> -        if (pte_pfn(pte) == folio_end_pfn)
>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>               break;
>>             if (any_writable)
>>               *any_writable |= writable;
>>   -        expected_pte = pte_advance_pfn(expected_pte, 1);
>> -        ptep++;
>> +        nr = pte_batch_hint(ptep, pte);
>> +        expected_pte = pte_advance_pfn(expected_pte, nr);
>> +        ptep += nr;
>>       }
>>   -    return ptep - start_ptep;
>> +    return min(ptep - start_ptep, max_nr);
>>   }
> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
@ 2024-02-12 15:00       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:00 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 13:43, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>> follow-on ptes also map contiguous physical memory with the same pgprot.
>> (for arm64, these are contpte mappings).
>>
>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>> it can skip these ptes when scanning to create a batch. By default, if
>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>> the changes are optimized out and the behaviour is as before.
>>
>> arm64 will opt-in to providing this hint in the next patch, which will
>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 18 ++++++++++++++++++
>>   mm/memory.c             | 20 +++++++++++++-------
>>   2 files changed, 31 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 50f32cccbd92..cba31f177d27 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>   #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>   #endif
>>   +#ifndef pte_batch_hint
>> +/**
>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>> + * @ptep: Page table pointer for the entry.
>> + * @pte: Page table entry.
>> + *
>> + * Some architectures know that a set of contiguous ptes all map the same
>> + * contiguous memory with the same permissions. In this case, it can provide a
>> + * hint to aid pte batching without the core code needing to scan every pte.
> 
> I think we might want to document here the expectation regarding
> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
> always when batching, because the dirty bit may target any pte part of the
> cont-pte group either way.
> 
> Maybe something like:
> 
> "
> An architecture implementation may only ignore the PTE accessed and dirty bits.
> Further, it may only ignore the dirty bit if that bit is already not
> maintained with precision per PTE inside the hinted batch, and ptep_get()
> would already have to collect it from various PTEs.
> "

Yep, sounds good. I'll add it in next version.

> 
> I think there are some more details to it, but I'm hoping something along
> the lines above is sufficient.
> 
> 
>> +
>>   #ifndef pte_advance_pfn
>>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>   {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 65fbe4f886c1..902665b27702 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>   {
>>       unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>       const pte_t *end_ptep = start_ptep + max_nr;
>> -    pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1),
>> flags);
>> -    pte_t *ptep = start_ptep + 1;
>> +    pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
>> +    pte_t *ptep = start_ptep;
>>       bool writable;
>> +    int nr;
>>         if (any_writable)
>>           *any_writable = false;
>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>   -    while (ptep != end_ptep) {
>> +    nr = pte_batch_hint(ptep, pte);
>> +    expected_pte = pte_advance_pfn(expected_pte, nr);
>> +    ptep += nr;
>> +
> 
> *Maybe* it's easier to get when initializing expected_pte+ptep only once.
> 
> Like:
> 
> [...]
> pte_t expected_pte, *ptep;
> [...]
> 
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> ptep = start_ptep + nr;

Yeah that works for me. Will change for next version.

> 
>> +    while (ptep < end_ptep) {
>>           pte = ptep_get(ptep);
>>           if (any_writable)
>>               writable = !!pte_write(pte);
>> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>            * corner cases the next PFN might fall into a different
>>            * folio.
>>            */
>> -        if (pte_pfn(pte) == folio_end_pfn)
>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>               break;
>>             if (any_writable)
>>               *any_writable |= writable;
>>   -        expected_pte = pte_advance_pfn(expected_pte, 1);
>> -        ptep++;
>> +        nr = pte_batch_hint(ptep, pte);
>> +        expected_pte = pte_advance_pfn(expected_pte, nr);
>> +        ptep += nr;
>>       }
>>   -    return ptep - start_ptep;
>> +    return min(ptep - start_ptep, max_nr);
>>   }
> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
@ 2024-02-12 15:00       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:00 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 12/02/2024 13:43, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>> follow-on ptes also map contiguous physical memory with the same pgprot.
>> (for arm64, these are contpte mappings).
>>
>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>> it can skip these ptes when scanning to create a batch. By default, if
>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>> the changes are optimized out and the behaviour is as before.
>>
>> arm64 will opt-in to providing this hint in the next patch, which will
>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 18 ++++++++++++++++++
>>   mm/memory.c             | 20 +++++++++++++-------
>>   2 files changed, 31 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 50f32cccbd92..cba31f177d27 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>   #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>   #endif
>>   +#ifndef pte_batch_hint
>> +/**
>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>> + * @ptep: Page table pointer for the entry.
>> + * @pte: Page table entry.
>> + *
>> + * Some architectures know that a set of contiguous ptes all map the same
>> + * contiguous memory with the same permissions. In this case, it can provide a
>> + * hint to aid pte batching without the core code needing to scan every pte.
> 
> I think we might want to document here the expectation regarding
> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
> always when batching, because the dirty bit may target any pte part of the
> cont-pte group either way.
> 
> Maybe something like:
> 
> "
> An architecture implementation may only ignore the PTE accessed and dirty bits.
> Further, it may only ignore the dirty bit if that bit is already not
> maintained with precision per PTE inside the hinted batch, and ptep_get()
> would already have to collect it from various PTEs.
> "

Yep, sounds good. I'll add it in next version.

> 
> I think there are some more details to it, but I'm hoping something along
> the lines above is sufficient.
> 
> 
>> +
>>   #ifndef pte_advance_pfn
>>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>   {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 65fbe4f886c1..902665b27702 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>   {
>>       unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>       const pte_t *end_ptep = start_ptep + max_nr;
>> -    pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1),
>> flags);
>> -    pte_t *ptep = start_ptep + 1;
>> +    pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
>> +    pte_t *ptep = start_ptep;
>>       bool writable;
>> +    int nr;
>>         if (any_writable)
>>           *any_writable = false;
>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>   -    while (ptep != end_ptep) {
>> +    nr = pte_batch_hint(ptep, pte);
>> +    expected_pte = pte_advance_pfn(expected_pte, nr);
>> +    ptep += nr;
>> +
> 
> *Maybe* it's easier to get when initializing expected_pte+ptep only once.
> 
> Like:
> 
> [...]
> pte_t expected_pte, *ptep;
> [...]
> 
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> ptep = start_ptep + nr;

Yeah that works for me. Will change for next version.

> 
>> +    while (ptep < end_ptep) {
>>           pte = ptep_get(ptep);
>>           if (any_writable)
>>               writable = !!pte_write(pte);
>> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>            * corner cases the next PFN might fall into a different
>>            * folio.
>>            */
>> -        if (pte_pfn(pte) == folio_end_pfn)
>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>               break;
>>             if (any_writable)
>>               *any_writable |= writable;
>>   -        expected_pte = pte_advance_pfn(expected_pte, 1);
>> -        ptep++;
>> +        nr = pte_batch_hint(ptep, pte);
>> +        expected_pte = pte_advance_pfn(expected_pte, nr);
>> +        ptep += nr;
>>       }
>>   -    return ptep - start_ptep;
>> +    return min(ptep - start_ptep, max_nr);
>>   }
> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 14:45           ` Ryan Roberts
  (?)
@ 2024-02-12 15:26             ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 15:26 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12.02.24 15:45, Ryan Roberts wrote:
> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>> bits,
>>>
>>> I think that would work - but will need to think a bit more on it.
>>>
>>>> and leave ptep_get_lockless() only reading a single entry?
>>>
>>> I think we will need to do something a bit less fragile. ptep_get() does collect
>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>> we will likely want to rename the function and make its documentation explicit
>>> that it does not return those bits.
>>>
>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>
>>> Of course if I could convince you the current implementation is safe, I might be
>>> able to sidestep this optimization until a later date?
>>
>> As discussed (and pointed out abive), there might be quite some callsites where
>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>> used nowadays.
>>
>> One way to approach that I had in mind was having an explicit interface:
>>
>> ptep_get()
>> ptep_get_uptodate()
>> ptep_get_lockless()
>> ptep_get_lockless_uptodate()
> 
> Yes, I like the direction of this. I guess we anticipate that call sites
> requiring the "_uptodate" variant will be the minority so it makes sense to use
> the current names for the "_not_uptodate" variants? But to do a slow migration,
> it might be better/safer to have the weaker variant use the new name - that
> would allow us to downgrade one at a time?

Yes, I was primarily struggling with names. Likely it makes sense to 
either have two completely new function names, or use the new name only 
for the "faster but less precise" variant.

> 
>>
>> Especially the last one might not be needed.
> I've done a scan through the code and agree with Mark's original conclusions.
> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
> access/dirty info. So I think I could migrate everything to the weaker variant
> fairly easily.
> 
>>
>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
> 
> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
> "_nosync"?

I could live with

ptep_get_sync()
ptep_get_nosync()

with proper documentation :)

I don't think we use "_sync" / "_nosync" in the context of pte 
operations yet.

Well, there seems to be "__arm_v7s_pte_sync" in iommu code, bit at least 
in core code nothing jumped at me.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 15:26             ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 15:26 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12.02.24 15:45, Ryan Roberts wrote:
> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>> bits,
>>>
>>> I think that would work - but will need to think a bit more on it.
>>>
>>>> and leave ptep_get_lockless() only reading a single entry?
>>>
>>> I think we will need to do something a bit less fragile. ptep_get() does collect
>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>> we will likely want to rename the function and make its documentation explicit
>>> that it does not return those bits.
>>>
>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>
>>> Of course if I could convince you the current implementation is safe, I might be
>>> able to sidestep this optimization until a later date?
>>
>> As discussed (and pointed out abive), there might be quite some callsites where
>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>> used nowadays.
>>
>> One way to approach that I had in mind was having an explicit interface:
>>
>> ptep_get()
>> ptep_get_uptodate()
>> ptep_get_lockless()
>> ptep_get_lockless_uptodate()
> 
> Yes, I like the direction of this. I guess we anticipate that call sites
> requiring the "_uptodate" variant will be the minority so it makes sense to use
> the current names for the "_not_uptodate" variants? But to do a slow migration,
> it might be better/safer to have the weaker variant use the new name - that
> would allow us to downgrade one at a time?

Yes, I was primarily struggling with names. Likely it makes sense to 
either have two completely new function names, or use the new name only 
for the "faster but less precise" variant.

> 
>>
>> Especially the last one might not be needed.
> I've done a scan through the code and agree with Mark's original conclusions.
> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
> access/dirty info. So I think I could migrate everything to the weaker variant
> fairly easily.
> 
>>
>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
> 
> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
> "_nosync"?

I could live with

ptep_get_sync()
ptep_get_nosync()

with proper documentation :)

I don't think we use "_sync" / "_nosync" in the context of pte 
operations yet.

Well, there seems to be "__arm_v7s_pte_sync" in iommu code, bit at least 
in core code nothing jumped at me.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 15:26             ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 15:26 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 12.02.24 15:45, Ryan Roberts wrote:
> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>> bits,
>>>
>>> I think that would work - but will need to think a bit more on it.
>>>
>>>> and leave ptep_get_lockless() only reading a single entry?
>>>
>>> I think we will need to do something a bit less fragile. ptep_get() does collect
>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>> we will likely want to rename the function and make its documentation explicit
>>> that it does not return those bits.
>>>
>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>
>>> Of course if I could convince you the current implementation is safe, I might be
>>> able to sidestep this optimization until a later date?
>>
>> As discussed (and pointed out abive), there might be quite some callsites where
>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>> used nowadays.
>>
>> One way to approach that I had in mind was having an explicit interface:
>>
>> ptep_get()
>> ptep_get_uptodate()
>> ptep_get_lockless()
>> ptep_get_lockless_uptodate()
> 
> Yes, I like the direction of this. I guess we anticipate that call sites
> requiring the "_uptodate" variant will be the minority so it makes sense to use
> the current names for the "_not_uptodate" variants? But to do a slow migration,
> it might be better/safer to have the weaker variant use the new name - that
> would allow us to downgrade one at a time?

Yes, I was primarily struggling with names. Likely it makes sense to 
either have two completely new function names, or use the new name only 
for the "faster but less precise" variant.

> 
>>
>> Especially the last one might not be needed.
> I've done a scan through the code and agree with Mark's original conclusions.
> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
> access/dirty info. So I think I could migrate everything to the weaker variant
> fairly easily.
> 
>>
>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
> 
> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
> "_nosync"?

I could live with

ptep_get_sync()
ptep_get_nosync()

with proper documentation :)

I don't think we use "_sync" / "_nosync" in the context of pte 
operations yet.

Well, there seems to be "__arm_v7s_pte_sync" in iommu code, bit at least 
in core code nothing jumped at me.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 12:59       ` Ryan Roberts
  (?)
@ 2024-02-12 15:30         ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:30 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 12:59, Ryan Roberts wrote:
> On 12/02/2024 12:00, Mark Rutland wrote:
>> Hi Ryan,
>>
>> Overall this looks pretty good; I have a bunch of minor comments below, and a
>> bigger question on the way ptep_get_lockless() works.
> 
> OK great - thanks for the review. Let's see if I can answer them all...
> 
>>
>> On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
>>> With the ptep API sufficiently refactored, we can now introduce a new
>>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>>> user mappings.
>>>
>>> In this initial implementation, only suitable batches of PTEs, set via
>>> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
>>> modification of individual PTEs will cause an "unfold" operation to
>>> repaint the contpte block as individual PTEs before performing the
>>> requested operation. While, a modification of a single PTE could cause
>>> the block of PTEs to which it belongs to become eligible for "folding"
>>> into a contpte entry, "folding" is not performed in this initial
>>> implementation due to the costs of checking the requirements are met.
>>> Due to this, contpte mappings will degrade back to normal pte mappings
>>> over time if/when protections are changed. This will be solved in a
>>> future patch.
>>>
>>> Since a contpte block only has a single access and dirty bit, the
>>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>>> that is part of a contpte mapping, the access and dirty information are
>>> pulled from the block (so all ptes in the block return the same
>>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>>> will affect the whole contpte block. This is works fine in practice
>>> since we guarantee that only a single folio is mapped by a contpte
>>> block, and the core-mm tracks access/dirty information per folio.
>>>
>>> In order for the public functions, which used to be pure inline, to
>>> continue to be callable by modules, export all the contpte_* symbols
>>> that are now called by those public inline functions.
>>>
>>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>>> at build time. It defaults to enabled as long as its dependency,
>>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>>> enabled, then there is no chance of meeting the physical contiguity
>>> requirement for contpte mappings.
>>>
>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  arch/arm64/Kconfig               |   9 +
>>>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>>>  arch/arm64/mm/Makefile           |   1 +
>>>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>>>  4 files changed, 454 insertions(+)
>>>  create mode 100644 arch/arm64/mm/contpte.c
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index d86d7f4758b5..1442e8ed95b6 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>>  	select UNWIND_TABLES
>>>  	select DYNAMIC_SCS
>>>  
>>> +config ARM64_CONTPTE
>>> +	bool "Contiguous PTE mappings for user memory" if EXPERT
>>> +	depends on TRANSPARENT_HUGEPAGE
>>> +	default y
>>> +	help
>>> +	  When enabled, user mappings are configured using the PTE contiguous
>>> +	  bit, for any mappings that meet the size and alignment requirements.
>>> +	  This reduces TLB pressure and improves performance.
>>> +
>>>  endmenu # "Kernel Features"
>>>  
>>>  menu "Boot options"
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 7dc6b68ee516..34892a95403d 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>>   */
>>>  #define pte_valid_not_user(pte) \
>>>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
>>> +/*
>>> + * Returns true if the pte is valid and has the contiguous bit set.
>>> + */
>>> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>>>  /*
>>>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>>   * so that we don't erroneously return false for pages that have been
>>> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>>  #define vmemmap_update_pte vmemmap_update_pte
>>>  #endif
>>>  
>>> +#ifdef CONFIG_ARM64_CONTPTE
>>> +
>>> +/*
>>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>>> + * a private implementation detail of the public ptep API (see below).
>>> + */
>>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte);
>>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep);
>>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep);
>>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep,
>>> +				pte_t entry, int dirty);
>>> +
>>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +					pte_t *ptep, pte_t pte)
>>> +{
>>> +	if (unlikely(pte_valid_cont(pte)))
>>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>>> +}
>>> +
>>> +/*
>>> + * The below functions constitute the public API that arm64 presents to the
>>> + * core-mm to manipulate PTE entries within their page tables (or at least this
>>> + * is the subset of the API that arm64 needs to implement). These public
>>> + * versions will automatically and transparently apply the contiguous bit where
>>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>>> + * private versions, which are prefixed with double underscore. All of these
>>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>>> + * held.
>>> + */
>>> +
>>> +#define ptep_get ptep_get
>>> +static inline pte_t ptep_get(pte_t *ptep)
>>> +{
>>> +	pte_t pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(pte)))
>>> +		return pte;
>>> +
>>> +	return contpte_ptep_get(ptep, pte);
>>> +}
>>> +
>>> +#define ptep_get_lockless ptep_get_lockless
>>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>>> +{
>>> +	pte_t pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(pte)))
>>> +		return pte;
>>> +
>>> +	return contpte_ptep_get_lockless(ptep);
>>> +}
>>> +
>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>> +{
>>> +	/*
>>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>>> +	 * this should never even be called. Regardless do our best to service
>>> +	 * any call and emit a warning if there is any attempt to set a pte on
>>> +	 * top of an existing contig range.
>>> +	 */
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>> +	__set_pte(ptep, pte_mknoncont(pte));
>>> +}
>>> +
>>> +#define set_ptes set_ptes
>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>>> +{
>>> +	pte = pte_mknoncont(pte);
>>
>> Why do we have to clear the contiguous bit here? Is that for the same reason as
>> set_pte(), or do we expect callers to legitimately call this with the
>> contiguous bit set in 'pte'?
>>
>> I think you explained this to me in-person, and IIRC we don't expect callers to
>> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
>> have to clear it here to defer the decision of whether to set/clear it when
>> modifying entries. It would be nice if we could have a description of why/when
>> we need to clear this, e.g. in the 'public API' comment block above.
> 
> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
> private to the architecture code and is never set directly by core code. If the
> public API ever receives a pte that happens to have the PTE_CONT bit set, it
> would be bad news if we then accidentally set that in the pgtable.
> 
> Ideally, we would just uncondidtionally clear the bit before a getter returns
> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
> and can therefore never accidentally pass such a pte into a setter function.
> However, there is existing functionality that relies on being able to get a pte,
> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
> to determine how big the leaf is. This is used in perf_get_pgtable_size().
> 
> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
> decided to allow PTE_CONT to leak through the getters and instead
> unconditionally clear the bit when a pte is passed to any of the setters.
> 
> I'll add a (slightly less verbose) comment as you suggest.
> 
>>
>>> +
>>> +	if (likely(nr == 1)) {
>>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +		__set_ptes(mm, addr, ptep, pte, 1);
>>> +	} else {
>>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>> +	}
>>> +}
>>> +
>>> +static inline void pte_clear(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	__pte_clear(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	return __ptep_get_and_clear(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>>> +
>>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	__ptep_set_wrprotect(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep,
>>> +				pte_t entry, int dirty)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	entry = pte_mknoncont(entry);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +
>>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +}
>>> +
>>> +#else /* CONFIG_ARM64_CONTPTE */
>>> +
>>>  #define ptep_get				__ptep_get
>>>  #define set_pte					__set_pte
>>>  #define set_ptes				__set_ptes
>>> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>>  #define ptep_set_access_flags			__ptep_set_access_flags
>>>  
>>> +#endif /* CONFIG_ARM64_CONTPTE */
>>> +
>>>  #endif /* !__ASSEMBLY__ */
>>>  
>>>  #endif /* __ASM_PGTABLE_H */
>>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>>> index dbd1bc95967d..60454256945b 100644
>>> --- a/arch/arm64/mm/Makefile
>>> +++ b/arch/arm64/mm/Makefile
>>> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>>  				   cache.o copypage.o flush.o \
>>>  				   ioremap.o mmap.o pgd.o mmu.o \
>>>  				   context.o proc.o pageattr.o fixmap.o
>>> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> new file mode 100644
>>> index 000000000000..bfb50e6b44c7
>>> --- /dev/null
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -0,0 +1,283 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Copyright (C) 2023 ARM Ltd.
>>> + */
>>> +
>>> +#include <linux/mm.h>
>>> +#include <linux/export.h>
>>> +#include <asm/tlbflush.h>
>>> +
>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>> +{
>>> +	/*
>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>> +	 * These racing faults are ok for user space, since they get serialized
>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>> +	 */
>>> +	return mm != &init_mm;
>>> +}
>>
>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>> that while it is live, and I'm not sure if that needs any special handling.
> 
> Well we never need this function in the hot (order-0 folio) path, so I think I
> could add a check for efi_mm here with performance implication. It's probably
> safest to explicitly exclude it? What do you think?

Oops: This should have read "I think I could add a check for efi_mm here
*without* performance implication"

> 
>>
>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>> +{
>>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>
>> I think this can be:
>>
>> static inline pte_t *contpte_align_down(pte_t *ptep)
>> {
>> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
>> }
> 
> Yep - that's much less ugly - thanks!
> 
>>
>>> +
>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>> +			    pte_t *ptep, pte_t pte)
>>> +{
>>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>> +	unsigned long start_addr;
>>> +	pte_t *start_ptep;
>>> +	int i;
>>> +
>>> +	start_ptep = ptep = contpte_align_down(ptep);
>>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>> +
>>> +		if (pte_dirty(ptent))
>>> +			pte = pte_mkdirty(pte);
>>> +
>>> +		if (pte_young(ptent))
>>> +			pte = pte_mkyoung(pte);
>>> +	}
>>
>> Not a big deal either way, but I wonder if it makes more sense to accumulate
>> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
>>
>> 	bool dirty = false, young = false;
>>
>> 	for (...) {
>> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> 		dirty |= pte_dirty(ptent);
>> 		young |= pte_young(ptent);
>> 	}
>>
>> 	if (dirty)
>> 		pte_mkdirty(pte);
>> 	if (young)
>> 		pte_mkyoung(pte);
>>
>> I suspect that might generate slightly better code, but I'm also happy with the
>> current form if people thnk that's more legible (I have no strong feelings
>> either way).
> 
> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
> We also had the same comment against David's batching patches recently, and he
> opted to stick with the former version:
> 
> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
> 
> So I'm inclined to leave it as is, since you're not insisting :)
> 
>>
>>> +
>>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>> +
>>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>> +}
>>> +
>>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +			pte_t *ptep, pte_t pte)
>>> +{
>>> +	/*
>>> +	 * We have already checked that the ptes are contiguous in
>>> +	 * contpte_try_unfold(), so just check that the mm is user space.
>>> +	 */
>>> +
>>> +	if (!mm_is_user(mm))
>>> +		return;
>>
>> Nit: normally we don't put a line gap between a comment block and the
>> associated block of code.
> 
> ACK, I'll fix in next version.

Just to clarify this, I've got a few instances in this file where I have a
comment that applies to the function as a whole, and in those cases the comment
is the first thing in the body of the function, and there's a blank line between
the end of the comment and the first statement. This is intended to be one of
those comments, although since the function is pretty small, I can see how it
also could look like it applies to the immediately proceeding statements too.

What is the normal policy for such comments? I'd rather leave this alone since
it aligns with how all the others are done in the file. Or should I just remove
the blank line for all instances?


> 
>>
>>> +
>>> +	pte = pte_mknoncont(pte);
>>> +	contpte_convert(mm, addr, ptep, pte);
>>> +}
>>> +EXPORT_SYMBOL(__contpte_try_unfold);
>>> +
>>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>> +{
>>> +	/*
>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>>> +	 * feet.
>>> +	 */
>>
>> Nit: s/guarranteed/guaranteed/
> 
> ACK, I'll fix in next version.
> 
>>
>>> +
>>> +	pte_t pte;
>>> +	int i;
>>> +
>>> +	ptep = contpte_align_down(ptep);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>> +		pte = __ptep_get(ptep);
>>> +
>>> +		if (pte_dirty(pte))
>>> +			orig_pte = pte_mkdirty(orig_pte);
>>> +
>>> +		if (pte_young(pte))
>>> +			orig_pte = pte_mkyoung(orig_pte);
>>> +	}
>>> +
>>> +	return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get);
>>> +
>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>> +{
>>> +	/*
>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>>> +	 * If we find a pte that is not consistent, then we must be racing with
>>> +	 * an update so start again. If the target pte does not have CONT_PTE
>>> +	 * set then that is considered consistent on its own because it is not
>>> +	 * part of a contpte range.
>>> +	 */
>>> +
>>> +	pgprot_t orig_prot;
>>> +	unsigned long pfn;
>>> +	pte_t orig_pte;
>>> +	pgprot_t prot;
>>> +	pte_t *ptep;
>>> +	pte_t pte;
>>> +	int i;
>>> +
>>> +retry:
>>> +	orig_pte = __ptep_get(orig_ptep);
>>> +
>>> +	if (!pte_valid_cont(orig_pte))
>>> +		return orig_pte;
>>> +
>>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>> +	ptep = contpte_align_down(orig_ptep);
>>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>> +		pte = __ptep_get(ptep);
>>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>> +
>>> +		if (!pte_valid_cont(pte) ||
>>> +		   pte_pfn(pte) != pfn ||
>>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>>> +			goto retry;
>>> +
>>> +		if (pte_dirty(pte))
>>> +			orig_pte = pte_mkdirty(orig_pte);
>>> +
>>> +		if (pte_young(pte))
>>> +			orig_pte = pte_mkyoung(orig_pte);
>>> +	}
>>> +
>>> +	return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>
>> I'm struggling to convince myself that this is safe in general, as it really
>> depends on how the caller will use this value. Which caller(s) actually care
>> about the access/dirty bits, given those could change at any time anyway?
> 
> I think your points below are valid, and agree we should try to make this work
> without needing access/dirty if possible. But can you elaborate on why you don't
> think it's safe?
> 
>>
>> I took a quick scan, and AFAICT:
> 
> Thanks for enumerating these; Saves me from having to refresh my memory :)
> 
>>
>> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>>   and has the contig bit set. We could clean that up with a new interface, e.g.
>>   something like a new ptep_get_size_lockless().
>>
>> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>>   we look at the pte to start with, since we only care where we can logically
>>   write to the page at that point.
>>
>>   I see that we later follow up with:
>>
>>     with pte_val(pte) != pte_val(ptep_get(ptep)))
>>
>>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>>   bits? So that shape of lockless-try...locked-compare sequence works?
>>
>> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>>   so this doesn' seem to matter.
>>
>> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
>>
>> * For pte_range_none() the access/dirty bits don't matter.
>>
>> * For handle_pte_fault() I think we have the same shape of
>>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>>   care about the acess/dirty bits before we reach the locked compare step.
>>
>> * For ptdump_pte_entry() I think it's arguable that we should continue to
>>   report the access/dirty bits separately for each PTE, as we have done until
>>   now, to give an accurate representation of the contents of the translation
>>   tables.
>>
>> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>>   swap entry, the access/dirty bits don't matter.
>>
>> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
>> and IIUC that's only so that the locklessly-loaded pte value can be compared
>> with a subsequently locked-loaded entry (for which the access/dirty bits will
>> be accumulated). Have I understood that correctly?
> 
> Yes, I agree with what you are saying. My approach was to try to implement the
> existing APIs accurately though, the argument being that it reduces the chances
> of getting it wrong. But if you think the implementation is unsafe, then I guess
> it blows that out of the water...
> 
>>
>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>> bits, 
> 
> I think that would work - but will need to think a bit more on it.
> 
>> and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?
> 
> Thanks,
> Ryan
> 
> 
>>
>> Thanks,
>> Mark.
>>
>>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>>> +{
>>> +	unsigned long next;
>>> +	unsigned long end;
>>> +	unsigned long pfn;
>>> +	pgprot_t prot;
>>> +
>>> +	/*
>>> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
>>> +	 * all ptes is not-present. Therefore we never need to unfold or
>>> +	 * otherwise invalidate a range before we set the new ptes.
>>> +	 * contpte_set_ptes() should never be called for nr < 2.
>>> +	 */
>>> +	VM_WARN_ON(nr == 1);
>>> +
>>> +	if (!mm_is_user(mm))
>>> +		return __set_ptes(mm, addr, ptep, pte, nr);
>>> +
>>> +	end = addr + (nr << PAGE_SHIFT);
>>> +	pfn = pte_pfn(pte);
>>> +	prot = pte_pgprot(pte);
>>> +
>>> +	do {
>>> +		next = pte_cont_addr_end(addr, end);
>>> +		nr = (next - addr) >> PAGE_SHIFT;
>>> +		pte = pfn_pte(pfn, prot);
>>> +
>>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>>> +			pte = pte_mkcont(pte);
>>> +		else
>>> +			pte = pte_mknoncont(pte);
>>> +
>>> +		__set_ptes(mm, addr, ptep, pte, nr);
>>> +
>>> +		addr = next;
>>> +		ptep += nr;
>>> +		pfn += nr;
>>> +
>>> +	} while (addr != end);
>>> +}
>>> +EXPORT_SYMBOL(contpte_set_ptes);
>>> +
>>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep)
>>> +{
>>> +	/*
>>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>>> +	 * access/dirty per folio, not per page. And since we only create a
>>> +	 * contig range when the range is covered by a single folio, we can get
>>> +	 * away with clearing young for the whole contig range here, so we avoid
>>> +	 * having to unfold.
>>> +	 */
>>> +
>>> +	int young = 0;
>>> +	int i;
>>> +
>>> +	ptep = contpte_align_down(ptep);
>>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	return young;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>>> +
>>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep)
>>> +{
>>> +	int young;
>>> +
>>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	if (young) {
>>> +		/*
>>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>>> +		 * eliding the trailing DSB applies here.
>>> +		 */
>>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>>> +					 PAGE_SIZE, true, 3);
>>> +	}
>>> +
>>> +	return young;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>>> +
>>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep,
>>> +					pte_t entry, int dirty)
>>> +{
>>> +	unsigned long start_addr;
>>> +	pte_t orig_pte;
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>>> +	 * changed, its a noop.
>>> +	 */
>>> +	orig_pte = pte_mknoncont(ptep_get(ptep));
>>> +	if (pte_val(orig_pte) == pte_val(entry))
>>> +		return 0;
>>> +
>>> +	/*
>>> +	 * We can fix up access/dirty bits without having to unfold the contig
>>> +	 * range. But if the write bit is changing, we must unfold.
>>> +	 */
>>> +	if (pte_write(orig_pte) == pte_write(entry)) {
>>> +		/*
>>> +		 * For HW access management, we technically only need to update
>>> +		 * the flag on a single pte in the range. But for SW access
>>> +		 * management, we need to update all the ptes to prevent extra
>>> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>>> +		 * and instead flush the whole range at the end.
>>> +		 */
>>> +		ptep = contpte_align_down(ptep);
>>> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>>> +
>>> +		if (dirty)
>>> +			__flush_tlb_range(vma, start_addr, addr,
>>> +							PAGE_SIZE, true, 3);
>>> +	} else {
>>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +	}
>>> +
>>> +	return 1;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
>>> -- 
>>> 2.25.1
>>>
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 15:30         ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:30 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 12:59, Ryan Roberts wrote:
> On 12/02/2024 12:00, Mark Rutland wrote:
>> Hi Ryan,
>>
>> Overall this looks pretty good; I have a bunch of minor comments below, and a
>> bigger question on the way ptep_get_lockless() works.
> 
> OK great - thanks for the review. Let's see if I can answer them all...
> 
>>
>> On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
>>> With the ptep API sufficiently refactored, we can now introduce a new
>>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>>> user mappings.
>>>
>>> In this initial implementation, only suitable batches of PTEs, set via
>>> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
>>> modification of individual PTEs will cause an "unfold" operation to
>>> repaint the contpte block as individual PTEs before performing the
>>> requested operation. While, a modification of a single PTE could cause
>>> the block of PTEs to which it belongs to become eligible for "folding"
>>> into a contpte entry, "folding" is not performed in this initial
>>> implementation due to the costs of checking the requirements are met.
>>> Due to this, contpte mappings will degrade back to normal pte mappings
>>> over time if/when protections are changed. This will be solved in a
>>> future patch.
>>>
>>> Since a contpte block only has a single access and dirty bit, the
>>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>>> that is part of a contpte mapping, the access and dirty information are
>>> pulled from the block (so all ptes in the block return the same
>>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>>> will affect the whole contpte block. This is works fine in practice
>>> since we guarantee that only a single folio is mapped by a contpte
>>> block, and the core-mm tracks access/dirty information per folio.
>>>
>>> In order for the public functions, which used to be pure inline, to
>>> continue to be callable by modules, export all the contpte_* symbols
>>> that are now called by those public inline functions.
>>>
>>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>>> at build time. It defaults to enabled as long as its dependency,
>>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>>> enabled, then there is no chance of meeting the physical contiguity
>>> requirement for contpte mappings.
>>>
>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  arch/arm64/Kconfig               |   9 +
>>>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>>>  arch/arm64/mm/Makefile           |   1 +
>>>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>>>  4 files changed, 454 insertions(+)
>>>  create mode 100644 arch/arm64/mm/contpte.c
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index d86d7f4758b5..1442e8ed95b6 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>>  	select UNWIND_TABLES
>>>  	select DYNAMIC_SCS
>>>  
>>> +config ARM64_CONTPTE
>>> +	bool "Contiguous PTE mappings for user memory" if EXPERT
>>> +	depends on TRANSPARENT_HUGEPAGE
>>> +	default y
>>> +	help
>>> +	  When enabled, user mappings are configured using the PTE contiguous
>>> +	  bit, for any mappings that meet the size and alignment requirements.
>>> +	  This reduces TLB pressure and improves performance.
>>> +
>>>  endmenu # "Kernel Features"
>>>  
>>>  menu "Boot options"
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 7dc6b68ee516..34892a95403d 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>>   */
>>>  #define pte_valid_not_user(pte) \
>>>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
>>> +/*
>>> + * Returns true if the pte is valid and has the contiguous bit set.
>>> + */
>>> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>>>  /*
>>>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>>   * so that we don't erroneously return false for pages that have been
>>> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>>  #define vmemmap_update_pte vmemmap_update_pte
>>>  #endif
>>>  
>>> +#ifdef CONFIG_ARM64_CONTPTE
>>> +
>>> +/*
>>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>>> + * a private implementation detail of the public ptep API (see below).
>>> + */
>>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte);
>>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep);
>>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep);
>>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep,
>>> +				pte_t entry, int dirty);
>>> +
>>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +					pte_t *ptep, pte_t pte)
>>> +{
>>> +	if (unlikely(pte_valid_cont(pte)))
>>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>>> +}
>>> +
>>> +/*
>>> + * The below functions constitute the public API that arm64 presents to the
>>> + * core-mm to manipulate PTE entries within their page tables (or at least this
>>> + * is the subset of the API that arm64 needs to implement). These public
>>> + * versions will automatically and transparently apply the contiguous bit where
>>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>>> + * private versions, which are prefixed with double underscore. All of these
>>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>>> + * held.
>>> + */
>>> +
>>> +#define ptep_get ptep_get
>>> +static inline pte_t ptep_get(pte_t *ptep)
>>> +{
>>> +	pte_t pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(pte)))
>>> +		return pte;
>>> +
>>> +	return contpte_ptep_get(ptep, pte);
>>> +}
>>> +
>>> +#define ptep_get_lockless ptep_get_lockless
>>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>>> +{
>>> +	pte_t pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(pte)))
>>> +		return pte;
>>> +
>>> +	return contpte_ptep_get_lockless(ptep);
>>> +}
>>> +
>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>> +{
>>> +	/*
>>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>>> +	 * this should never even be called. Regardless do our best to service
>>> +	 * any call and emit a warning if there is any attempt to set a pte on
>>> +	 * top of an existing contig range.
>>> +	 */
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>> +	__set_pte(ptep, pte_mknoncont(pte));
>>> +}
>>> +
>>> +#define set_ptes set_ptes
>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>>> +{
>>> +	pte = pte_mknoncont(pte);
>>
>> Why do we have to clear the contiguous bit here? Is that for the same reason as
>> set_pte(), or do we expect callers to legitimately call this with the
>> contiguous bit set in 'pte'?
>>
>> I think you explained this to me in-person, and IIRC we don't expect callers to
>> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
>> have to clear it here to defer the decision of whether to set/clear it when
>> modifying entries. It would be nice if we could have a description of why/when
>> we need to clear this, e.g. in the 'public API' comment block above.
> 
> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
> private to the architecture code and is never set directly by core code. If the
> public API ever receives a pte that happens to have the PTE_CONT bit set, it
> would be bad news if we then accidentally set that in the pgtable.
> 
> Ideally, we would just uncondidtionally clear the bit before a getter returns
> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
> and can therefore never accidentally pass such a pte into a setter function.
> However, there is existing functionality that relies on being able to get a pte,
> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
> to determine how big the leaf is. This is used in perf_get_pgtable_size().
> 
> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
> decided to allow PTE_CONT to leak through the getters and instead
> unconditionally clear the bit when a pte is passed to any of the setters.
> 
> I'll add a (slightly less verbose) comment as you suggest.
> 
>>
>>> +
>>> +	if (likely(nr == 1)) {
>>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +		__set_ptes(mm, addr, ptep, pte, 1);
>>> +	} else {
>>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>> +	}
>>> +}
>>> +
>>> +static inline void pte_clear(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	__pte_clear(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	return __ptep_get_and_clear(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>>> +
>>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	__ptep_set_wrprotect(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep,
>>> +				pte_t entry, int dirty)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	entry = pte_mknoncont(entry);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +
>>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +}
>>> +
>>> +#else /* CONFIG_ARM64_CONTPTE */
>>> +
>>>  #define ptep_get				__ptep_get
>>>  #define set_pte					__set_pte
>>>  #define set_ptes				__set_ptes
>>> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>>  #define ptep_set_access_flags			__ptep_set_access_flags
>>>  
>>> +#endif /* CONFIG_ARM64_CONTPTE */
>>> +
>>>  #endif /* !__ASSEMBLY__ */
>>>  
>>>  #endif /* __ASM_PGTABLE_H */
>>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>>> index dbd1bc95967d..60454256945b 100644
>>> --- a/arch/arm64/mm/Makefile
>>> +++ b/arch/arm64/mm/Makefile
>>> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>>  				   cache.o copypage.o flush.o \
>>>  				   ioremap.o mmap.o pgd.o mmu.o \
>>>  				   context.o proc.o pageattr.o fixmap.o
>>> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> new file mode 100644
>>> index 000000000000..bfb50e6b44c7
>>> --- /dev/null
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -0,0 +1,283 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Copyright (C) 2023 ARM Ltd.
>>> + */
>>> +
>>> +#include <linux/mm.h>
>>> +#include <linux/export.h>
>>> +#include <asm/tlbflush.h>
>>> +
>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>> +{
>>> +	/*
>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>> +	 * These racing faults are ok for user space, since they get serialized
>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>> +	 */
>>> +	return mm != &init_mm;
>>> +}
>>
>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>> that while it is live, and I'm not sure if that needs any special handling.
> 
> Well we never need this function in the hot (order-0 folio) path, so I think I
> could add a check for efi_mm here with performance implication. It's probably
> safest to explicitly exclude it? What do you think?

Oops: This should have read "I think I could add a check for efi_mm here
*without* performance implication"

> 
>>
>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>> +{
>>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>
>> I think this can be:
>>
>> static inline pte_t *contpte_align_down(pte_t *ptep)
>> {
>> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
>> }
> 
> Yep - that's much less ugly - thanks!
> 
>>
>>> +
>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>> +			    pte_t *ptep, pte_t pte)
>>> +{
>>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>> +	unsigned long start_addr;
>>> +	pte_t *start_ptep;
>>> +	int i;
>>> +
>>> +	start_ptep = ptep = contpte_align_down(ptep);
>>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>> +
>>> +		if (pte_dirty(ptent))
>>> +			pte = pte_mkdirty(pte);
>>> +
>>> +		if (pte_young(ptent))
>>> +			pte = pte_mkyoung(pte);
>>> +	}
>>
>> Not a big deal either way, but I wonder if it makes more sense to accumulate
>> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
>>
>> 	bool dirty = false, young = false;
>>
>> 	for (...) {
>> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> 		dirty |= pte_dirty(ptent);
>> 		young |= pte_young(ptent);
>> 	}
>>
>> 	if (dirty)
>> 		pte_mkdirty(pte);
>> 	if (young)
>> 		pte_mkyoung(pte);
>>
>> I suspect that might generate slightly better code, but I'm also happy with the
>> current form if people thnk that's more legible (I have no strong feelings
>> either way).
> 
> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
> We also had the same comment against David's batching patches recently, and he
> opted to stick with the former version:
> 
> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
> 
> So I'm inclined to leave it as is, since you're not insisting :)
> 
>>
>>> +
>>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>> +
>>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>> +}
>>> +
>>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +			pte_t *ptep, pte_t pte)
>>> +{
>>> +	/*
>>> +	 * We have already checked that the ptes are contiguous in
>>> +	 * contpte_try_unfold(), so just check that the mm is user space.
>>> +	 */
>>> +
>>> +	if (!mm_is_user(mm))
>>> +		return;
>>
>> Nit: normally we don't put a line gap between a comment block and the
>> associated block of code.
> 
> ACK, I'll fix in next version.

Just to clarify this, I've got a few instances in this file where I have a
comment that applies to the function as a whole, and in those cases the comment
is the first thing in the body of the function, and there's a blank line between
the end of the comment and the first statement. This is intended to be one of
those comments, although since the function is pretty small, I can see how it
also could look like it applies to the immediately proceeding statements too.

What is the normal policy for such comments? I'd rather leave this alone since
it aligns with how all the others are done in the file. Or should I just remove
the blank line for all instances?


> 
>>
>>> +
>>> +	pte = pte_mknoncont(pte);
>>> +	contpte_convert(mm, addr, ptep, pte);
>>> +}
>>> +EXPORT_SYMBOL(__contpte_try_unfold);
>>> +
>>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>> +{
>>> +	/*
>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>>> +	 * feet.
>>> +	 */
>>
>> Nit: s/guarranteed/guaranteed/
> 
> ACK, I'll fix in next version.
> 
>>
>>> +
>>> +	pte_t pte;
>>> +	int i;
>>> +
>>> +	ptep = contpte_align_down(ptep);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>> +		pte = __ptep_get(ptep);
>>> +
>>> +		if (pte_dirty(pte))
>>> +			orig_pte = pte_mkdirty(orig_pte);
>>> +
>>> +		if (pte_young(pte))
>>> +			orig_pte = pte_mkyoung(orig_pte);
>>> +	}
>>> +
>>> +	return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get);
>>> +
>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>> +{
>>> +	/*
>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>>> +	 * If we find a pte that is not consistent, then we must be racing with
>>> +	 * an update so start again. If the target pte does not have CONT_PTE
>>> +	 * set then that is considered consistent on its own because it is not
>>> +	 * part of a contpte range.
>>> +	 */
>>> +
>>> +	pgprot_t orig_prot;
>>> +	unsigned long pfn;
>>> +	pte_t orig_pte;
>>> +	pgprot_t prot;
>>> +	pte_t *ptep;
>>> +	pte_t pte;
>>> +	int i;
>>> +
>>> +retry:
>>> +	orig_pte = __ptep_get(orig_ptep);
>>> +
>>> +	if (!pte_valid_cont(orig_pte))
>>> +		return orig_pte;
>>> +
>>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>> +	ptep = contpte_align_down(orig_ptep);
>>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>> +		pte = __ptep_get(ptep);
>>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>> +
>>> +		if (!pte_valid_cont(pte) ||
>>> +		   pte_pfn(pte) != pfn ||
>>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>>> +			goto retry;
>>> +
>>> +		if (pte_dirty(pte))
>>> +			orig_pte = pte_mkdirty(orig_pte);
>>> +
>>> +		if (pte_young(pte))
>>> +			orig_pte = pte_mkyoung(orig_pte);
>>> +	}
>>> +
>>> +	return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>
>> I'm struggling to convince myself that this is safe in general, as it really
>> depends on how the caller will use this value. Which caller(s) actually care
>> about the access/dirty bits, given those could change at any time anyway?
> 
> I think your points below are valid, and agree we should try to make this work
> without needing access/dirty if possible. But can you elaborate on why you don't
> think it's safe?
> 
>>
>> I took a quick scan, and AFAICT:
> 
> Thanks for enumerating these; Saves me from having to refresh my memory :)
> 
>>
>> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>>   and has the contig bit set. We could clean that up with a new interface, e.g.
>>   something like a new ptep_get_size_lockless().
>>
>> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>>   we look at the pte to start with, since we only care where we can logically
>>   write to the page at that point.
>>
>>   I see that we later follow up with:
>>
>>     with pte_val(pte) != pte_val(ptep_get(ptep)))
>>
>>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>>   bits? So that shape of lockless-try...locked-compare sequence works?
>>
>> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>>   so this doesn' seem to matter.
>>
>> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
>>
>> * For pte_range_none() the access/dirty bits don't matter.
>>
>> * For handle_pte_fault() I think we have the same shape of
>>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>>   care about the acess/dirty bits before we reach the locked compare step.
>>
>> * For ptdump_pte_entry() I think it's arguable that we should continue to
>>   report the access/dirty bits separately for each PTE, as we have done until
>>   now, to give an accurate representation of the contents of the translation
>>   tables.
>>
>> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>>   swap entry, the access/dirty bits don't matter.
>>
>> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
>> and IIUC that's only so that the locklessly-loaded pte value can be compared
>> with a subsequently locked-loaded entry (for which the access/dirty bits will
>> be accumulated). Have I understood that correctly?
> 
> Yes, I agree with what you are saying. My approach was to try to implement the
> existing APIs accurately though, the argument being that it reduces the chances
> of getting it wrong. But if you think the implementation is unsafe, then I guess
> it blows that out of the water...
> 
>>
>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>> bits, 
> 
> I think that would work - but will need to think a bit more on it.
> 
>> and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?
> 
> Thanks,
> Ryan
> 
> 
>>
>> Thanks,
>> Mark.
>>
>>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>>> +{
>>> +	unsigned long next;
>>> +	unsigned long end;
>>> +	unsigned long pfn;
>>> +	pgprot_t prot;
>>> +
>>> +	/*
>>> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
>>> +	 * all ptes is not-present. Therefore we never need to unfold or
>>> +	 * otherwise invalidate a range before we set the new ptes.
>>> +	 * contpte_set_ptes() should never be called for nr < 2.
>>> +	 */
>>> +	VM_WARN_ON(nr == 1);
>>> +
>>> +	if (!mm_is_user(mm))
>>> +		return __set_ptes(mm, addr, ptep, pte, nr);
>>> +
>>> +	end = addr + (nr << PAGE_SHIFT);
>>> +	pfn = pte_pfn(pte);
>>> +	prot = pte_pgprot(pte);
>>> +
>>> +	do {
>>> +		next = pte_cont_addr_end(addr, end);
>>> +		nr = (next - addr) >> PAGE_SHIFT;
>>> +		pte = pfn_pte(pfn, prot);
>>> +
>>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>>> +			pte = pte_mkcont(pte);
>>> +		else
>>> +			pte = pte_mknoncont(pte);
>>> +
>>> +		__set_ptes(mm, addr, ptep, pte, nr);
>>> +
>>> +		addr = next;
>>> +		ptep += nr;
>>> +		pfn += nr;
>>> +
>>> +	} while (addr != end);
>>> +}
>>> +EXPORT_SYMBOL(contpte_set_ptes);
>>> +
>>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep)
>>> +{
>>> +	/*
>>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>>> +	 * access/dirty per folio, not per page. And since we only create a
>>> +	 * contig range when the range is covered by a single folio, we can get
>>> +	 * away with clearing young for the whole contig range here, so we avoid
>>> +	 * having to unfold.
>>> +	 */
>>> +
>>> +	int young = 0;
>>> +	int i;
>>> +
>>> +	ptep = contpte_align_down(ptep);
>>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	return young;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>>> +
>>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep)
>>> +{
>>> +	int young;
>>> +
>>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	if (young) {
>>> +		/*
>>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>>> +		 * eliding the trailing DSB applies here.
>>> +		 */
>>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>>> +					 PAGE_SIZE, true, 3);
>>> +	}
>>> +
>>> +	return young;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>>> +
>>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep,
>>> +					pte_t entry, int dirty)
>>> +{
>>> +	unsigned long start_addr;
>>> +	pte_t orig_pte;
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>>> +	 * changed, its a noop.
>>> +	 */
>>> +	orig_pte = pte_mknoncont(ptep_get(ptep));
>>> +	if (pte_val(orig_pte) == pte_val(entry))
>>> +		return 0;
>>> +
>>> +	/*
>>> +	 * We can fix up access/dirty bits without having to unfold the contig
>>> +	 * range. But if the write bit is changing, we must unfold.
>>> +	 */
>>> +	if (pte_write(orig_pte) == pte_write(entry)) {
>>> +		/*
>>> +		 * For HW access management, we technically only need to update
>>> +		 * the flag on a single pte in the range. But for SW access
>>> +		 * management, we need to update all the ptes to prevent extra
>>> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>>> +		 * and instead flush the whole range at the end.
>>> +		 */
>>> +		ptep = contpte_align_down(ptep);
>>> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>>> +
>>> +		if (dirty)
>>> +			__flush_tlb_range(vma, start_addr, addr,
>>> +							PAGE_SIZE, true, 3);
>>> +	} else {
>>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +	}
>>> +
>>> +	return 1;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
>>> -- 
>>> 2.25.1
>>>
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 15:30         ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:30 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On 12/02/2024 12:59, Ryan Roberts wrote:
> On 12/02/2024 12:00, Mark Rutland wrote:
>> Hi Ryan,
>>
>> Overall this looks pretty good; I have a bunch of minor comments below, and a
>> bigger question on the way ptep_get_lockless() works.
> 
> OK great - thanks for the review. Let's see if I can answer them all...
> 
>>
>> On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
>>> With the ptep API sufficiently refactored, we can now introduce a new
>>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>>> user mappings.
>>>
>>> In this initial implementation, only suitable batches of PTEs, set via
>>> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
>>> modification of individual PTEs will cause an "unfold" operation to
>>> repaint the contpte block as individual PTEs before performing the
>>> requested operation. While, a modification of a single PTE could cause
>>> the block of PTEs to which it belongs to become eligible for "folding"
>>> into a contpte entry, "folding" is not performed in this initial
>>> implementation due to the costs of checking the requirements are met.
>>> Due to this, contpte mappings will degrade back to normal pte mappings
>>> over time if/when protections are changed. This will be solved in a
>>> future patch.
>>>
>>> Since a contpte block only has a single access and dirty bit, the
>>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>>> that is part of a contpte mapping, the access and dirty information are
>>> pulled from the block (so all ptes in the block return the same
>>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>>> will affect the whole contpte block. This is works fine in practice
>>> since we guarantee that only a single folio is mapped by a contpte
>>> block, and the core-mm tracks access/dirty information per folio.
>>>
>>> In order for the public functions, which used to be pure inline, to
>>> continue to be callable by modules, export all the contpte_* symbols
>>> that are now called by those public inline functions.
>>>
>>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>>> at build time. It defaults to enabled as long as its dependency,
>>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>>> enabled, then there is no chance of meeting the physical contiguity
>>> requirement for contpte mappings.
>>>
>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  arch/arm64/Kconfig               |   9 +
>>>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>>>  arch/arm64/mm/Makefile           |   1 +
>>>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>>>  4 files changed, 454 insertions(+)
>>>  create mode 100644 arch/arm64/mm/contpte.c
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index d86d7f4758b5..1442e8ed95b6 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>>  	select UNWIND_TABLES
>>>  	select DYNAMIC_SCS
>>>  
>>> +config ARM64_CONTPTE
>>> +	bool "Contiguous PTE mappings for user memory" if EXPERT
>>> +	depends on TRANSPARENT_HUGEPAGE
>>> +	default y
>>> +	help
>>> +	  When enabled, user mappings are configured using the PTE contiguous
>>> +	  bit, for any mappings that meet the size and alignment requirements.
>>> +	  This reduces TLB pressure and improves performance.
>>> +
>>>  endmenu # "Kernel Features"
>>>  
>>>  menu "Boot options"
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 7dc6b68ee516..34892a95403d 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>>   */
>>>  #define pte_valid_not_user(pte) \
>>>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
>>> +/*
>>> + * Returns true if the pte is valid and has the contiguous bit set.
>>> + */
>>> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>>>  /*
>>>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>>   * so that we don't erroneously return false for pages that have been
>>> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>>  #define vmemmap_update_pte vmemmap_update_pte
>>>  #endif
>>>  
>>> +#ifdef CONFIG_ARM64_CONTPTE
>>> +
>>> +/*
>>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>>> + * a private implementation detail of the public ptep API (see below).
>>> + */
>>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte);
>>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep);
>>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep);
>>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep,
>>> +				pte_t entry, int dirty);
>>> +
>>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +					pte_t *ptep, pte_t pte)
>>> +{
>>> +	if (unlikely(pte_valid_cont(pte)))
>>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>>> +}
>>> +
>>> +/*
>>> + * The below functions constitute the public API that arm64 presents to the
>>> + * core-mm to manipulate PTE entries within their page tables (or at least this
>>> + * is the subset of the API that arm64 needs to implement). These public
>>> + * versions will automatically and transparently apply the contiguous bit where
>>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>>> + * private versions, which are prefixed with double underscore. All of these
>>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>>> + * held.
>>> + */
>>> +
>>> +#define ptep_get ptep_get
>>> +static inline pte_t ptep_get(pte_t *ptep)
>>> +{
>>> +	pte_t pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(pte)))
>>> +		return pte;
>>> +
>>> +	return contpte_ptep_get(ptep, pte);
>>> +}
>>> +
>>> +#define ptep_get_lockless ptep_get_lockless
>>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>>> +{
>>> +	pte_t pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(pte)))
>>> +		return pte;
>>> +
>>> +	return contpte_ptep_get_lockless(ptep);
>>> +}
>>> +
>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>> +{
>>> +	/*
>>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>>> +	 * this should never even be called. Regardless do our best to service
>>> +	 * any call and emit a warning if there is any attempt to set a pte on
>>> +	 * top of an existing contig range.
>>> +	 */
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>> +	__set_pte(ptep, pte_mknoncont(pte));
>>> +}
>>> +
>>> +#define set_ptes set_ptes
>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>>> +{
>>> +	pte = pte_mknoncont(pte);
>>
>> Why do we have to clear the contiguous bit here? Is that for the same reason as
>> set_pte(), or do we expect callers to legitimately call this with the
>> contiguous bit set in 'pte'?
>>
>> I think you explained this to me in-person, and IIRC we don't expect callers to
>> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
>> have to clear it here to defer the decision of whether to set/clear it when
>> modifying entries. It would be nice if we could have a description of why/when
>> we need to clear this, e.g. in the 'public API' comment block above.
> 
> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
> private to the architecture code and is never set directly by core code. If the
> public API ever receives a pte that happens to have the PTE_CONT bit set, it
> would be bad news if we then accidentally set that in the pgtable.
> 
> Ideally, we would just uncondidtionally clear the bit before a getter returns
> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
> and can therefore never accidentally pass such a pte into a setter function.
> However, there is existing functionality that relies on being able to get a pte,
> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
> to determine how big the leaf is. This is used in perf_get_pgtable_size().
> 
> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
> decided to allow PTE_CONT to leak through the getters and instead
> unconditionally clear the bit when a pte is passed to any of the setters.
> 
> I'll add a (slightly less verbose) comment as you suggest.
> 
>>
>>> +
>>> +	if (likely(nr == 1)) {
>>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +		__set_ptes(mm, addr, ptep, pte, 1);
>>> +	} else {
>>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>> +	}
>>> +}
>>> +
>>> +static inline void pte_clear(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	__pte_clear(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	return __ptep_get_and_clear(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>>> +
>>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	__ptep_set_wrprotect(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep,
>>> +				pte_t entry, int dirty)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	entry = pte_mknoncont(entry);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +
>>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +}
>>> +
>>> +#else /* CONFIG_ARM64_CONTPTE */
>>> +
>>>  #define ptep_get				__ptep_get
>>>  #define set_pte					__set_pte
>>>  #define set_ptes				__set_ptes
>>> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>>  #define ptep_set_access_flags			__ptep_set_access_flags
>>>  
>>> +#endif /* CONFIG_ARM64_CONTPTE */
>>> +
>>>  #endif /* !__ASSEMBLY__ */
>>>  
>>>  #endif /* __ASM_PGTABLE_H */
>>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>>> index dbd1bc95967d..60454256945b 100644
>>> --- a/arch/arm64/mm/Makefile
>>> +++ b/arch/arm64/mm/Makefile
>>> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>>  				   cache.o copypage.o flush.o \
>>>  				   ioremap.o mmap.o pgd.o mmu.o \
>>>  				   context.o proc.o pageattr.o fixmap.o
>>> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> new file mode 100644
>>> index 000000000000..bfb50e6b44c7
>>> --- /dev/null
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -0,0 +1,283 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Copyright (C) 2023 ARM Ltd.
>>> + */
>>> +
>>> +#include <linux/mm.h>
>>> +#include <linux/export.h>
>>> +#include <asm/tlbflush.h>
>>> +
>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>> +{
>>> +	/*
>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>> +	 * These racing faults are ok for user space, since they get serialized
>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>> +	 */
>>> +	return mm != &init_mm;
>>> +}
>>
>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>> that while it is live, and I'm not sure if that needs any special handling.
> 
> Well we never need this function in the hot (order-0 folio) path, so I think I
> could add a check for efi_mm here with performance implication. It's probably
> safest to explicitly exclude it? What do you think?

Oops: This should have read "I think I could add a check for efi_mm here
*without* performance implication"

> 
>>
>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>> +{
>>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>
>> I think this can be:
>>
>> static inline pte_t *contpte_align_down(pte_t *ptep)
>> {
>> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
>> }
> 
> Yep - that's much less ugly - thanks!
> 
>>
>>> +
>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>> +			    pte_t *ptep, pte_t pte)
>>> +{
>>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>> +	unsigned long start_addr;
>>> +	pte_t *start_ptep;
>>> +	int i;
>>> +
>>> +	start_ptep = ptep = contpte_align_down(ptep);
>>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>> +
>>> +		if (pte_dirty(ptent))
>>> +			pte = pte_mkdirty(pte);
>>> +
>>> +		if (pte_young(ptent))
>>> +			pte = pte_mkyoung(pte);
>>> +	}
>>
>> Not a big deal either way, but I wonder if it makes more sense to accumulate
>> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
>>
>> 	bool dirty = false, young = false;
>>
>> 	for (...) {
>> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> 		dirty |= pte_dirty(ptent);
>> 		young |= pte_young(ptent);
>> 	}
>>
>> 	if (dirty)
>> 		pte_mkdirty(pte);
>> 	if (young)
>> 		pte_mkyoung(pte);
>>
>> I suspect that might generate slightly better code, but I'm also happy with the
>> current form if people thnk that's more legible (I have no strong feelings
>> either way).
> 
> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
> We also had the same comment against David's batching patches recently, and he
> opted to stick with the former version:
> 
> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
> 
> So I'm inclined to leave it as is, since you're not insisting :)
> 
>>
>>> +
>>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>> +
>>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>> +}
>>> +
>>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +			pte_t *ptep, pte_t pte)
>>> +{
>>> +	/*
>>> +	 * We have already checked that the ptes are contiguous in
>>> +	 * contpte_try_unfold(), so just check that the mm is user space.
>>> +	 */
>>> +
>>> +	if (!mm_is_user(mm))
>>> +		return;
>>
>> Nit: normally we don't put a line gap between a comment block and the
>> associated block of code.
> 
> ACK, I'll fix in next version.

Just to clarify this, I've got a few instances in this file where I have a
comment that applies to the function as a whole, and in those cases the comment
is the first thing in the body of the function, and there's a blank line between
the end of the comment and the first statement. This is intended to be one of
those comments, although since the function is pretty small, I can see how it
also could look like it applies to the immediately proceeding statements too.

What is the normal policy for such comments? I'd rather leave this alone since
it aligns with how all the others are done in the file. Or should I just remove
the blank line for all instances?


> 
>>
>>> +
>>> +	pte = pte_mknoncont(pte);
>>> +	contpte_convert(mm, addr, ptep, pte);
>>> +}
>>> +EXPORT_SYMBOL(__contpte_try_unfold);
>>> +
>>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>> +{
>>> +	/*
>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>>> +	 * feet.
>>> +	 */
>>
>> Nit: s/guarranteed/guaranteed/
> 
> ACK, I'll fix in next version.
> 
>>
>>> +
>>> +	pte_t pte;
>>> +	int i;
>>> +
>>> +	ptep = contpte_align_down(ptep);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>> +		pte = __ptep_get(ptep);
>>> +
>>> +		if (pte_dirty(pte))
>>> +			orig_pte = pte_mkdirty(orig_pte);
>>> +
>>> +		if (pte_young(pte))
>>> +			orig_pte = pte_mkyoung(orig_pte);
>>> +	}
>>> +
>>> +	return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get);
>>> +
>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>> +{
>>> +	/*
>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>>> +	 * If we find a pte that is not consistent, then we must be racing with
>>> +	 * an update so start again. If the target pte does not have CONT_PTE
>>> +	 * set then that is considered consistent on its own because it is not
>>> +	 * part of a contpte range.
>>> +	 */
>>> +
>>> +	pgprot_t orig_prot;
>>> +	unsigned long pfn;
>>> +	pte_t orig_pte;
>>> +	pgprot_t prot;
>>> +	pte_t *ptep;
>>> +	pte_t pte;
>>> +	int i;
>>> +
>>> +retry:
>>> +	orig_pte = __ptep_get(orig_ptep);
>>> +
>>> +	if (!pte_valid_cont(orig_pte))
>>> +		return orig_pte;
>>> +
>>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>> +	ptep = contpte_align_down(orig_ptep);
>>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>> +		pte = __ptep_get(ptep);
>>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>> +
>>> +		if (!pte_valid_cont(pte) ||
>>> +		   pte_pfn(pte) != pfn ||
>>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>>> +			goto retry;
>>> +
>>> +		if (pte_dirty(pte))
>>> +			orig_pte = pte_mkdirty(orig_pte);
>>> +
>>> +		if (pte_young(pte))
>>> +			orig_pte = pte_mkyoung(orig_pte);
>>> +	}
>>> +
>>> +	return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>
>> I'm struggling to convince myself that this is safe in general, as it really
>> depends on how the caller will use this value. Which caller(s) actually care
>> about the access/dirty bits, given those could change at any time anyway?
> 
> I think your points below are valid, and agree we should try to make this work
> without needing access/dirty if possible. But can you elaborate on why you don't
> think it's safe?
> 
>>
>> I took a quick scan, and AFAICT:
> 
> Thanks for enumerating these; Saves me from having to refresh my memory :)
> 
>>
>> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>>   and has the contig bit set. We could clean that up with a new interface, e.g.
>>   something like a new ptep_get_size_lockless().
>>
>> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>>   we look at the pte to start with, since we only care where we can logically
>>   write to the page at that point.
>>
>>   I see that we later follow up with:
>>
>>     with pte_val(pte) != pte_val(ptep_get(ptep)))
>>
>>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>>   bits? So that shape of lockless-try...locked-compare sequence works?
>>
>> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>>   so this doesn' seem to matter.
>>
>> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
>>
>> * For pte_range_none() the access/dirty bits don't matter.
>>
>> * For handle_pte_fault() I think we have the same shape of
>>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>>   care about the acess/dirty bits before we reach the locked compare step.
>>
>> * For ptdump_pte_entry() I think it's arguable that we should continue to
>>   report the access/dirty bits separately for each PTE, as we have done until
>>   now, to give an accurate representation of the contents of the translation
>>   tables.
>>
>> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>>   swap entry, the access/dirty bits don't matter.
>>
>> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
>> and IIUC that's only so that the locklessly-loaded pte value can be compared
>> with a subsequently locked-loaded entry (for which the access/dirty bits will
>> be accumulated). Have I understood that correctly?
> 
> Yes, I agree with what you are saying. My approach was to try to implement the
> existing APIs accurately though, the argument being that it reduces the chances
> of getting it wrong. But if you think the implementation is unsafe, then I guess
> it blows that out of the water...
> 
>>
>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>> bits, 
> 
> I think that would work - but will need to think a bit more on it.
> 
>> and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?
> 
> Thanks,
> Ryan
> 
> 
>>
>> Thanks,
>> Mark.
>>
>>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>>> +{
>>> +	unsigned long next;
>>> +	unsigned long end;
>>> +	unsigned long pfn;
>>> +	pgprot_t prot;
>>> +
>>> +	/*
>>> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
>>> +	 * all ptes is not-present. Therefore we never need to unfold or
>>> +	 * otherwise invalidate a range before we set the new ptes.
>>> +	 * contpte_set_ptes() should never be called for nr < 2.
>>> +	 */
>>> +	VM_WARN_ON(nr == 1);
>>> +
>>> +	if (!mm_is_user(mm))
>>> +		return __set_ptes(mm, addr, ptep, pte, nr);
>>> +
>>> +	end = addr + (nr << PAGE_SHIFT);
>>> +	pfn = pte_pfn(pte);
>>> +	prot = pte_pgprot(pte);
>>> +
>>> +	do {
>>> +		next = pte_cont_addr_end(addr, end);
>>> +		nr = (next - addr) >> PAGE_SHIFT;
>>> +		pte = pfn_pte(pfn, prot);
>>> +
>>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>>> +			pte = pte_mkcont(pte);
>>> +		else
>>> +			pte = pte_mknoncont(pte);
>>> +
>>> +		__set_ptes(mm, addr, ptep, pte, nr);
>>> +
>>> +		addr = next;
>>> +		ptep += nr;
>>> +		pfn += nr;
>>> +
>>> +	} while (addr != end);
>>> +}
>>> +EXPORT_SYMBOL(contpte_set_ptes);
>>> +
>>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep)
>>> +{
>>> +	/*
>>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>>> +	 * access/dirty per folio, not per page. And since we only create a
>>> +	 * contig range when the range is covered by a single folio, we can get
>>> +	 * away with clearing young for the whole contig range here, so we avoid
>>> +	 * having to unfold.
>>> +	 */
>>> +
>>> +	int young = 0;
>>> +	int i;
>>> +
>>> +	ptep = contpte_align_down(ptep);
>>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	return young;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>>> +
>>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep)
>>> +{
>>> +	int young;
>>> +
>>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	if (young) {
>>> +		/*
>>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>>> +		 * eliding the trailing DSB applies here.
>>> +		 */
>>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>>> +					 PAGE_SIZE, true, 3);
>>> +	}
>>> +
>>> +	return young;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>>> +
>>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep,
>>> +					pte_t entry, int dirty)
>>> +{
>>> +	unsigned long start_addr;
>>> +	pte_t orig_pte;
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>>> +	 * changed, its a noop.
>>> +	 */
>>> +	orig_pte = pte_mknoncont(ptep_get(ptep));
>>> +	if (pte_val(orig_pte) == pte_val(entry))
>>> +		return 0;
>>> +
>>> +	/*
>>> +	 * We can fix up access/dirty bits without having to unfold the contig
>>> +	 * range. But if the write bit is changing, we must unfold.
>>> +	 */
>>> +	if (pte_write(orig_pte) == pte_write(entry)) {
>>> +		/*
>>> +		 * For HW access management, we technically only need to update
>>> +		 * the flag on a single pte in the range. But for SW access
>>> +		 * management, we need to update all the ptes to prevent extra
>>> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>>> +		 * and instead flush the whole range at the end.
>>> +		 */
>>> +		ptep = contpte_align_down(ptep);
>>> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>>> +
>>> +		if (dirty)
>>> +			__flush_tlb_range(vma, start_addr, addr,
>>> +							PAGE_SIZE, true, 3);
>>> +	} else {
>>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +	}
>>> +
>>> +	return 1;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
>>> -- 
>>> 2.25.1
>>>
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 15:26             ` David Hildenbrand
  (?)
@ 2024-02-12 15:34               ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:34 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12/02/2024 15:26, David Hildenbrand wrote:
> On 12.02.24 15:45, Ryan Roberts wrote:
>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>>> bits,
>>>>
>>>> I think that would work - but will need to think a bit more on it.
>>>>
>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>
>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>> collect
>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>>> we will likely want to rename the function and make its documentation explicit
>>>> that it does not return those bits.
>>>>
>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>
>>>> Of course if I could convince you the current implementation is safe, I
>>>> might be
>>>> able to sidestep this optimization until a later date?
>>>
>>> As discussed (and pointed out abive), there might be quite some callsites where
>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>> used nowadays.
>>>
>>> One way to approach that I had in mind was having an explicit interface:
>>>
>>> ptep_get()
>>> ptep_get_uptodate()
>>> ptep_get_lockless()
>>> ptep_get_lockless_uptodate()
>>
>> Yes, I like the direction of this. I guess we anticipate that call sites
>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>> it might be better/safer to have the weaker variant use the new name - that
>> would allow us to downgrade one at a time?
> 
> Yes, I was primarily struggling with names. Likely it makes sense to either have
> two completely new function names, or use the new name only for the "faster but
> less precise" variant.
> 
>>
>>>
>>> Especially the last one might not be needed.
>> I've done a scan through the code and agree with Mark's original conclusions.
>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>> access/dirty info. So I think I could migrate everything to the weaker variant
>> fairly easily.
>>
>>>
>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>
>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>> "_nosync"?
> 
> I could live with
> 
> ptep_get_sync()
> ptep_get_nosync()
> 
> with proper documentation :)

but could you live with:

ptep_get()
ptep_get_nosync()
ptep_get_lockless_nosync()

?

So leave the "slower, more precise" version with the existing name.

> 
> I don't think we use "_sync" / "_nosync" in the context of pte operations yet.
> 
> Well, there seems to be "__arm_v7s_pte_sync" in iommu code, bit at least in core
> code nothing jumped at me.
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 15:34               ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:34 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12/02/2024 15:26, David Hildenbrand wrote:
> On 12.02.24 15:45, Ryan Roberts wrote:
>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>>> bits,
>>>>
>>>> I think that would work - but will need to think a bit more on it.
>>>>
>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>
>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>> collect
>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>>> we will likely want to rename the function and make its documentation explicit
>>>> that it does not return those bits.
>>>>
>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>
>>>> Of course if I could convince you the current implementation is safe, I
>>>> might be
>>>> able to sidestep this optimization until a later date?
>>>
>>> As discussed (and pointed out abive), there might be quite some callsites where
>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>> used nowadays.
>>>
>>> One way to approach that I had in mind was having an explicit interface:
>>>
>>> ptep_get()
>>> ptep_get_uptodate()
>>> ptep_get_lockless()
>>> ptep_get_lockless_uptodate()
>>
>> Yes, I like the direction of this. I guess we anticipate that call sites
>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>> it might be better/safer to have the weaker variant use the new name - that
>> would allow us to downgrade one at a time?
> 
> Yes, I was primarily struggling with names. Likely it makes sense to either have
> two completely new function names, or use the new name only for the "faster but
> less precise" variant.
> 
>>
>>>
>>> Especially the last one might not be needed.
>> I've done a scan through the code and agree with Mark's original conclusions.
>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>> access/dirty info. So I think I could migrate everything to the weaker variant
>> fairly easily.
>>
>>>
>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>
>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>> "_nosync"?
> 
> I could live with
> 
> ptep_get_sync()
> ptep_get_nosync()
> 
> with proper documentation :)

but could you live with:

ptep_get()
ptep_get_nosync()
ptep_get_lockless_nosync()

?

So leave the "slower, more precise" version with the existing name.

> 
> I don't think we use "_sync" / "_nosync" in the context of pte operations yet.
> 
> Well, there seems to be "__arm_v7s_pte_sync" in iommu code, bit at least in core
> code nothing jumped at me.
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 15:34               ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:34 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 12/02/2024 15:26, David Hildenbrand wrote:
> On 12.02.24 15:45, Ryan Roberts wrote:
>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>>> bits,
>>>>
>>>> I think that would work - but will need to think a bit more on it.
>>>>
>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>
>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>> collect
>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>>> we will likely want to rename the function and make its documentation explicit
>>>> that it does not return those bits.
>>>>
>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>
>>>> Of course if I could convince you the current implementation is safe, I
>>>> might be
>>>> able to sidestep this optimization until a later date?
>>>
>>> As discussed (and pointed out abive), there might be quite some callsites where
>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>> used nowadays.
>>>
>>> One way to approach that I had in mind was having an explicit interface:
>>>
>>> ptep_get()
>>> ptep_get_uptodate()
>>> ptep_get_lockless()
>>> ptep_get_lockless_uptodate()
>>
>> Yes, I like the direction of this. I guess we anticipate that call sites
>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>> it might be better/safer to have the weaker variant use the new name - that
>> would allow us to downgrade one at a time?
> 
> Yes, I was primarily struggling with names. Likely it makes sense to either have
> two completely new function names, or use the new name only for the "faster but
> less precise" variant.
> 
>>
>>>
>>> Especially the last one might not be needed.
>> I've done a scan through the code and agree with Mark's original conclusions.
>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>> access/dirty info. So I think I could migrate everything to the weaker variant
>> fairly easily.
>>
>>>
>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>
>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>> "_nosync"?
> 
> I could live with
> 
> ptep_get_sync()
> ptep_get_nosync()
> 
> with proper documentation :)

but could you live with:

ptep_get()
ptep_get_nosync()
ptep_get_lockless_nosync()

?

So leave the "slower, more precise" version with the existing name.

> 
> I don't think we use "_sync" / "_nosync" in the context of pte operations yet.
> 
> Well, there seems to be "__arm_v7s_pte_sync" in iommu code, bit at least in core
> code nothing jumped at me.
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
  2024-02-12 13:43     ` David Hildenbrand
  (?)
@ 2024-02-12 15:47       ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:47 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 13:43, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>> follow-on ptes also map contiguous physical memory with the same pgprot.
>> (for arm64, these are contpte mappings).
>>
>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>> it can skip these ptes when scanning to create a batch. By default, if
>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>> the changes are optimized out and the behaviour is as before.
>>
>> arm64 will opt-in to providing this hint in the next patch, which will
>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 18 ++++++++++++++++++
>>   mm/memory.c             | 20 +++++++++++++-------
>>   2 files changed, 31 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 50f32cccbd92..cba31f177d27 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>   #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>   #endif
>>   +#ifndef pte_batch_hint
>> +/**
>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>> + * @ptep: Page table pointer for the entry.
>> + * @pte: Page table entry.
>> + *
>> + * Some architectures know that a set of contiguous ptes all map the same
>> + * contiguous memory with the same permissions. In this case, it can provide a
>> + * hint to aid pte batching without the core code needing to scan every pte.
> 
> I think we might want to document here the expectation regarding
> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
> always when batching, because the dirty bit may target any pte part of the
> cont-pte group either way.
> 
> Maybe something like:
> 
> "
> An architecture implementation may only ignore the PTE accessed and dirty bits.
> Further, it may only ignore the dirty bit if that bit is already not
> maintained with precision per PTE inside the hinted batch, and ptep_get()
> would already have to collect it from various PTEs.
> "

I'm proposing to simplify this to:

"
An architecture implementation may ignore the PTE accessed state. Further, the
dirty state must apply atomically to all the PTEs described by the hint.
"

Which I think more accurately describes the requirement. Shout if you disagree.

> 
> I think there are some more details to it, but I'm hoping something along
> the lines above is sufficient.
> 
> 
>> +
>>   #ifndef pte_advance_pfn
>>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>   {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 65fbe4f886c1..902665b27702 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>   {
>>       unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>       const pte_t *end_ptep = start_ptep + max_nr;
>> -    pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1),
>> flags);
>> -    pte_t *ptep = start_ptep + 1;
>> +    pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
>> +    pte_t *ptep = start_ptep;
>>       bool writable;
>> +    int nr;
>>         if (any_writable)
>>           *any_writable = false;
>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>   -    while (ptep != end_ptep) {
>> +    nr = pte_batch_hint(ptep, pte);
>> +    expected_pte = pte_advance_pfn(expected_pte, nr);
>> +    ptep += nr;
>> +
> 
> *Maybe* it's easier to get when initializing expected_pte+ptep only once.
> 
> Like:
> 
> [...]
> pte_t expected_pte, *ptep;
> [...]
> 
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> ptep = start_ptep + nr;
> 
>> +    while (ptep < end_ptep) {
>>           pte = ptep_get(ptep);
>>           if (any_writable)
>>               writable = !!pte_write(pte);
>> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>            * corner cases the next PFN might fall into a different
>>            * folio.
>>            */
>> -        if (pte_pfn(pte) == folio_end_pfn)
>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>               break;
>>             if (any_writable)
>>               *any_writable |= writable;
>>   -        expected_pte = pte_advance_pfn(expected_pte, 1);
>> -        ptep++;
>> +        nr = pte_batch_hint(ptep, pte);
>> +        expected_pte = pte_advance_pfn(expected_pte, nr);
>> +        ptep += nr;
>>       }
>>   -    return ptep - start_ptep;
>> +    return min(ptep - start_ptep, max_nr);
>>   }
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
@ 2024-02-12 15:47       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:47 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 12/02/2024 13:43, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>> follow-on ptes also map contiguous physical memory with the same pgprot.
>> (for arm64, these are contpte mappings).
>>
>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>> it can skip these ptes when scanning to create a batch. By default, if
>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>> the changes are optimized out and the behaviour is as before.
>>
>> arm64 will opt-in to providing this hint in the next patch, which will
>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 18 ++++++++++++++++++
>>   mm/memory.c             | 20 +++++++++++++-------
>>   2 files changed, 31 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 50f32cccbd92..cba31f177d27 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>   #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>   #endif
>>   +#ifndef pte_batch_hint
>> +/**
>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>> + * @ptep: Page table pointer for the entry.
>> + * @pte: Page table entry.
>> + *
>> + * Some architectures know that a set of contiguous ptes all map the same
>> + * contiguous memory with the same permissions. In this case, it can provide a
>> + * hint to aid pte batching without the core code needing to scan every pte.
> 
> I think we might want to document here the expectation regarding
> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
> always when batching, because the dirty bit may target any pte part of the
> cont-pte group either way.
> 
> Maybe something like:
> 
> "
> An architecture implementation may only ignore the PTE accessed and dirty bits.
> Further, it may only ignore the dirty bit if that bit is already not
> maintained with precision per PTE inside the hinted batch, and ptep_get()
> would already have to collect it from various PTEs.
> "

I'm proposing to simplify this to:

"
An architecture implementation may ignore the PTE accessed state. Further, the
dirty state must apply atomically to all the PTEs described by the hint.
"

Which I think more accurately describes the requirement. Shout if you disagree.

> 
> I think there are some more details to it, but I'm hoping something along
> the lines above is sufficient.
> 
> 
>> +
>>   #ifndef pte_advance_pfn
>>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>   {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 65fbe4f886c1..902665b27702 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>   {
>>       unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>       const pte_t *end_ptep = start_ptep + max_nr;
>> -    pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1),
>> flags);
>> -    pte_t *ptep = start_ptep + 1;
>> +    pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
>> +    pte_t *ptep = start_ptep;
>>       bool writable;
>> +    int nr;
>>         if (any_writable)
>>           *any_writable = false;
>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>   -    while (ptep != end_ptep) {
>> +    nr = pte_batch_hint(ptep, pte);
>> +    expected_pte = pte_advance_pfn(expected_pte, nr);
>> +    ptep += nr;
>> +
> 
> *Maybe* it's easier to get when initializing expected_pte+ptep only once.
> 
> Like:
> 
> [...]
> pte_t expected_pte, *ptep;
> [...]
> 
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> ptep = start_ptep + nr;
> 
>> +    while (ptep < end_ptep) {
>>           pte = ptep_get(ptep);
>>           if (any_writable)
>>               writable = !!pte_write(pte);
>> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>            * corner cases the next PFN might fall into a different
>>            * folio.
>>            */
>> -        if (pte_pfn(pte) == folio_end_pfn)
>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>               break;
>>             if (any_writable)
>>               *any_writable |= writable;
>>   -        expected_pte = pte_advance_pfn(expected_pte, 1);
>> -        ptep++;
>> +        nr = pte_batch_hint(ptep, pte);
>> +        expected_pte = pte_advance_pfn(expected_pte, nr);
>> +        ptep += nr;
>>       }
>>   -    return ptep - start_ptep;
>> +    return min(ptep - start_ptep, max_nr);
>>   }
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
@ 2024-02-12 15:47       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 15:47 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 13:43, David Hildenbrand wrote:
> On 02.02.24 09:07, Ryan Roberts wrote:
>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>> follow-on ptes also map contiguous physical memory with the same pgprot.
>> (for arm64, these are contpte mappings).
>>
>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>> it can skip these ptes when scanning to create a batch. By default, if
>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>> the changes are optimized out and the behaviour is as before.
>>
>> arm64 will opt-in to providing this hint in the next patch, which will
>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h | 18 ++++++++++++++++++
>>   mm/memory.c             | 20 +++++++++++++-------
>>   2 files changed, 31 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 50f32cccbd92..cba31f177d27 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>   #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>   #endif
>>   +#ifndef pte_batch_hint
>> +/**
>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>> + * @ptep: Page table pointer for the entry.
>> + * @pte: Page table entry.
>> + *
>> + * Some architectures know that a set of contiguous ptes all map the same
>> + * contiguous memory with the same permissions. In this case, it can provide a
>> + * hint to aid pte batching without the core code needing to scan every pte.
> 
> I think we might want to document here the expectation regarding
> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
> always when batching, because the dirty bit may target any pte part of the
> cont-pte group either way.
> 
> Maybe something like:
> 
> "
> An architecture implementation may only ignore the PTE accessed and dirty bits.
> Further, it may only ignore the dirty bit if that bit is already not
> maintained with precision per PTE inside the hinted batch, and ptep_get()
> would already have to collect it from various PTEs.
> "

I'm proposing to simplify this to:

"
An architecture implementation may ignore the PTE accessed state. Further, the
dirty state must apply atomically to all the PTEs described by the hint.
"

Which I think more accurately describes the requirement. Shout if you disagree.

> 
> I think there are some more details to it, but I'm hoping something along
> the lines above is sufficient.
> 
> 
>> +
>>   #ifndef pte_advance_pfn
>>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>   {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 65fbe4f886c1..902665b27702 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>   {
>>       unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>       const pte_t *end_ptep = start_ptep + max_nr;
>> -    pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1),
>> flags);
>> -    pte_t *ptep = start_ptep + 1;
>> +    pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
>> +    pte_t *ptep = start_ptep;
>>       bool writable;
>> +    int nr;
>>         if (any_writable)
>>           *any_writable = false;
>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>   -    while (ptep != end_ptep) {
>> +    nr = pte_batch_hint(ptep, pte);
>> +    expected_pte = pte_advance_pfn(expected_pte, nr);
>> +    ptep += nr;
>> +
> 
> *Maybe* it's easier to get when initializing expected_pte+ptep only once.
> 
> Like:
> 
> [...]
> pte_t expected_pte, *ptep;
> [...]
> 
> nr = pte_batch_hint(start_ptep, pte);
> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> ptep = start_ptep + nr;
> 
>> +    while (ptep < end_ptep) {
>>           pte = ptep_get(ptep);
>>           if (any_writable)
>>               writable = !!pte_write(pte);
>> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio,
>> unsigned long addr,
>>            * corner cases the next PFN might fall into a different
>>            * folio.
>>            */
>> -        if (pte_pfn(pte) == folio_end_pfn)
>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>               break;
>>             if (any_writable)
>>               *any_writable |= writable;
>>   -        expected_pte = pte_advance_pfn(expected_pte, 1);
>> -        ptep++;
>> +        nr = pte_batch_hint(ptep, pte);
>> +        expected_pte = pte_advance_pfn(expected_pte, nr);
>> +        ptep += nr;
>>       }
>>   -    return ptep - start_ptep;
>> +    return min(ptep - start_ptep, max_nr);
>>   }
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 15:34               ` Ryan Roberts
  (?)
@ 2024-02-12 16:24                 ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 16:24 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12.02.24 16:34, Ryan Roberts wrote:
> On 12/02/2024 15:26, David Hildenbrand wrote:
>> On 12.02.24 15:45, Ryan Roberts wrote:
>>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>>>> bits,
>>>>>
>>>>> I think that would work - but will need to think a bit more on it.
>>>>>
>>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>>
>>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>>> collect
>>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>>>> we will likely want to rename the function and make its documentation explicit
>>>>> that it does not return those bits.
>>>>>
>>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>>
>>>>> Of course if I could convince you the current implementation is safe, I
>>>>> might be
>>>>> able to sidestep this optimization until a later date?
>>>>
>>>> As discussed (and pointed out abive), there might be quite some callsites where
>>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>>> used nowadays.
>>>>
>>>> One way to approach that I had in mind was having an explicit interface:
>>>>
>>>> ptep_get()
>>>> ptep_get_uptodate()
>>>> ptep_get_lockless()
>>>> ptep_get_lockless_uptodate()
>>>
>>> Yes, I like the direction of this. I guess we anticipate that call sites
>>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>>> it might be better/safer to have the weaker variant use the new name - that
>>> would allow us to downgrade one at a time?
>>
>> Yes, I was primarily struggling with names. Likely it makes sense to either have
>> two completely new function names, or use the new name only for the "faster but
>> less precise" variant.
>>
>>>
>>>>
>>>> Especially the last one might not be needed.
>>> I've done a scan through the code and agree with Mark's original conclusions.
>>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>>> access/dirty info. So I think I could migrate everything to the weaker variant
>>> fairly easily.
>>>
>>>>
>>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>>
>>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>>> "_nosync"?
>>
>> I could live with
>>
>> ptep_get_sync()
>> ptep_get_nosync()
>>
>> with proper documentation :)
> 
> but could you live with:
> 
> ptep_get()
> ptep_get_nosync()
> ptep_get_lockless_nosync()
> 
> ?
> 
> So leave the "slower, more precise" version with the existing name.

Sure.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 16:24                 ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 16:24 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 12.02.24 16:34, Ryan Roberts wrote:
> On 12/02/2024 15:26, David Hildenbrand wrote:
>> On 12.02.24 15:45, Ryan Roberts wrote:
>>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>>>> bits,
>>>>>
>>>>> I think that would work - but will need to think a bit more on it.
>>>>>
>>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>>
>>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>>> collect
>>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>>>> we will likely want to rename the function and make its documentation explicit
>>>>> that it does not return those bits.
>>>>>
>>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>>
>>>>> Of course if I could convince you the current implementation is safe, I
>>>>> might be
>>>>> able to sidestep this optimization until a later date?
>>>>
>>>> As discussed (and pointed out abive), there might be quite some callsites where
>>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>>> used nowadays.
>>>>
>>>> One way to approach that I had in mind was having an explicit interface:
>>>>
>>>> ptep_get()
>>>> ptep_get_uptodate()
>>>> ptep_get_lockless()
>>>> ptep_get_lockless_uptodate()
>>>
>>> Yes, I like the direction of this. I guess we anticipate that call sites
>>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>>> it might be better/safer to have the weaker variant use the new name - that
>>> would allow us to downgrade one at a time?
>>
>> Yes, I was primarily struggling with names. Likely it makes sense to either have
>> two completely new function names, or use the new name only for the "faster but
>> less precise" variant.
>>
>>>
>>>>
>>>> Especially the last one might not be needed.
>>> I've done a scan through the code and agree with Mark's original conclusions.
>>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>>> access/dirty info. So I think I could migrate everything to the weaker variant
>>> fairly easily.
>>>
>>>>
>>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>>
>>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>>> "_nosync"?
>>
>> I could live with
>>
>> ptep_get_sync()
>> ptep_get_nosync()
>>
>> with proper documentation :)
> 
> but could you live with:
> 
> ptep_get()
> ptep_get_nosync()
> ptep_get_lockless_nosync()
> 
> ?
> 
> So leave the "slower, more precise" version with the existing name.

Sure.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 16:24                 ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 16:24 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12.02.24 16:34, Ryan Roberts wrote:
> On 12/02/2024 15:26, David Hildenbrand wrote:
>> On 12.02.24 15:45, Ryan Roberts wrote:
>>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>>>> bits,
>>>>>
>>>>> I think that would work - but will need to think a bit more on it.
>>>>>
>>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>>
>>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>>> collect
>>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>>>> we will likely want to rename the function and make its documentation explicit
>>>>> that it does not return those bits.
>>>>>
>>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>>
>>>>> Of course if I could convince you the current implementation is safe, I
>>>>> might be
>>>>> able to sidestep this optimization until a later date?
>>>>
>>>> As discussed (and pointed out abive), there might be quite some callsites where
>>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>>> used nowadays.
>>>>
>>>> One way to approach that I had in mind was having an explicit interface:
>>>>
>>>> ptep_get()
>>>> ptep_get_uptodate()
>>>> ptep_get_lockless()
>>>> ptep_get_lockless_uptodate()
>>>
>>> Yes, I like the direction of this. I guess we anticipate that call sites
>>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>>> it might be better/safer to have the weaker variant use the new name - that
>>> would allow us to downgrade one at a time?
>>
>> Yes, I was primarily struggling with names. Likely it makes sense to either have
>> two completely new function names, or use the new name only for the "faster but
>> less precise" variant.
>>
>>>
>>>>
>>>> Especially the last one might not be needed.
>>> I've done a scan through the code and agree with Mark's original conclusions.
>>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>>> access/dirty info. So I think I could migrate everything to the weaker variant
>>> fairly easily.
>>>
>>>>
>>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>>
>>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>>> "_nosync"?
>>
>> I could live with
>>
>> ptep_get_sync()
>> ptep_get_nosync()
>>
>> with proper documentation :)
> 
> but could you live with:
> 
> ptep_get()
> ptep_get_nosync()
> ptep_get_lockless_nosync()
> 
> ?
> 
> So leave the "slower, more precise" version with the existing name.

Sure.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
  2024-02-12 15:47       ` Ryan Roberts
  (?)
@ 2024-02-12 16:27         ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 16:27 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12.02.24 16:47, Ryan Roberts wrote:
> On 12/02/2024 13:43, David Hildenbrand wrote:
>> On 02.02.24 09:07, Ryan Roberts wrote:
>>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>>> follow-on ptes also map contiguous physical memory with the same pgprot.
>>> (for arm64, these are contpte mappings).
>>>
>>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>>> it can skip these ptes when scanning to create a batch. By default, if
>>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>>> the changes are optimized out and the behaviour is as before.
>>>
>>> arm64 will opt-in to providing this hint in the next patch, which will
>>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>>
>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    include/linux/pgtable.h | 18 ++++++++++++++++++
>>>    mm/memory.c             | 20 +++++++++++++-------
>>>    2 files changed, 31 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 50f32cccbd92..cba31f177d27 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>>    #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>    #endif
>>>    +#ifndef pte_batch_hint
>>> +/**
>>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>>> + * @ptep: Page table pointer for the entry.
>>> + * @pte: Page table entry.
>>> + *
>>> + * Some architectures know that a set of contiguous ptes all map the same
>>> + * contiguous memory with the same permissions. In this case, it can provide a
>>> + * hint to aid pte batching without the core code needing to scan every pte.
>>
>> I think we might want to document here the expectation regarding
>> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
>> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
>> always when batching, because the dirty bit may target any pte part of the
>> cont-pte group either way.
>>
>> Maybe something like:
>>
>> "
>> An architecture implementation may only ignore the PTE accessed and dirty bits.
>> Further, it may only ignore the dirty bit if that bit is already not
>> maintained with precision per PTE inside the hinted batch, and ptep_get()
>> would already have to collect it from various PTEs.
>> "
> 
> I'm proposing to simplify this to:
> 
> "
> An architecture implementation may ignore the PTE accessed state. Further, the
> dirty state must apply atomically to all the PTEs described by the hint.
> "
> 
> Which I think more accurately describes the requirement. Shout if you disagree.

I'm not 100% sure if the "must apply atomically" is clear without all of 
the cont-pte details and ptep_get(). But I fail to describe it in a 
better way.

It's all better compared to what we had before, so LGTM :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
@ 2024-02-12 16:27         ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 16:27 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12.02.24 16:47, Ryan Roberts wrote:
> On 12/02/2024 13:43, David Hildenbrand wrote:
>> On 02.02.24 09:07, Ryan Roberts wrote:
>>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>>> follow-on ptes also map contiguous physical memory with the same pgprot.
>>> (for arm64, these are contpte mappings).
>>>
>>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>>> it can skip these ptes when scanning to create a batch. By default, if
>>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>>> the changes are optimized out and the behaviour is as before.
>>>
>>> arm64 will opt-in to providing this hint in the next patch, which will
>>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>>
>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    include/linux/pgtable.h | 18 ++++++++++++++++++
>>>    mm/memory.c             | 20 +++++++++++++-------
>>>    2 files changed, 31 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 50f32cccbd92..cba31f177d27 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>>    #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>    #endif
>>>    +#ifndef pte_batch_hint
>>> +/**
>>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>>> + * @ptep: Page table pointer for the entry.
>>> + * @pte: Page table entry.
>>> + *
>>> + * Some architectures know that a set of contiguous ptes all map the same
>>> + * contiguous memory with the same permissions. In this case, it can provide a
>>> + * hint to aid pte batching without the core code needing to scan every pte.
>>
>> I think we might want to document here the expectation regarding
>> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
>> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
>> always when batching, because the dirty bit may target any pte part of the
>> cont-pte group either way.
>>
>> Maybe something like:
>>
>> "
>> An architecture implementation may only ignore the PTE accessed and dirty bits.
>> Further, it may only ignore the dirty bit if that bit is already not
>> maintained with precision per PTE inside the hinted batch, and ptep_get()
>> would already have to collect it from various PTEs.
>> "
> 
> I'm proposing to simplify this to:
> 
> "
> An architecture implementation may ignore the PTE accessed state. Further, the
> dirty state must apply atomically to all the PTEs described by the hint.
> "
> 
> Which I think more accurately describes the requirement. Shout if you disagree.

I'm not 100% sure if the "must apply atomically" is clear without all of 
the cont-pte details and ptep_get(). But I fail to describe it in a 
better way.

It's all better compared to what we had before, so LGTM :)

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
@ 2024-02-12 16:27         ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-12 16:27 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 12.02.24 16:47, Ryan Roberts wrote:
> On 12/02/2024 13:43, David Hildenbrand wrote:
>> On 02.02.24 09:07, Ryan Roberts wrote:
>>> Some architectures (e.g. arm64) can tell from looking at a pte, if some
>>> follow-on ptes also map contiguous physical memory with the same pgprot.
>>> (for arm64, these are contpte mappings).
>>>
>>> Take advantage of this knowledge to optimize folio_pte_batch() so that
>>> it can skip these ptes when scanning to create a batch. By default, if
>>> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
>>> the changes are optimized out and the behaviour is as before.
>>>
>>> arm64 will opt-in to providing this hint in the next patch, which will
>>> greatly reduce the cost of ptep_get() when scanning a range of contptes.
>>>
>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    include/linux/pgtable.h | 18 ++++++++++++++++++
>>>    mm/memory.c             | 20 +++++++++++++-------
>>>    2 files changed, 31 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 50f32cccbd92..cba31f177d27 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>>>    #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>    #endif
>>>    +#ifndef pte_batch_hint
>>> +/**
>>> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
>>> + * @ptep: Page table pointer for the entry.
>>> + * @pte: Page table entry.
>>> + *
>>> + * Some architectures know that a set of contiguous ptes all map the same
>>> + * contiguous memory with the same permissions. In this case, it can provide a
>>> + * hint to aid pte batching without the core code needing to scan every pte.
>>
>> I think we might want to document here the expectation regarding
>> dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
>> FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
>> always when batching, because the dirty bit may target any pte part of the
>> cont-pte group either way.
>>
>> Maybe something like:
>>
>> "
>> An architecture implementation may only ignore the PTE accessed and dirty bits.
>> Further, it may only ignore the dirty bit if that bit is already not
>> maintained with precision per PTE inside the hinted batch, and ptep_get()
>> would already have to collect it from various PTEs.
>> "
> 
> I'm proposing to simplify this to:
> 
> "
> An architecture implementation may ignore the PTE accessed state. Further, the
> dirty state must apply atomically to all the PTEs described by the hint.
> "
> 
> Which I think more accurately describes the requirement. Shout if you disagree.

I'm not 100% sure if the "must apply atomically" is clear without all of 
the cont-pte details and ptep_get(). But I fail to describe it in a 
better way.

It's all better compared to what we had before, so LGTM :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 15:30         ` Ryan Roberts
  (?)
@ 2024-02-12 20:38           ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 20:38 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

[...]

>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>> +	 */
>>>> +	return mm != &init_mm;
>>>> +}
>>>
>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>> that while it is live, and I'm not sure if that needs any special handling.
>>
>> Well we never need this function in the hot (order-0 folio) path, so I think I
>> could add a check for efi_mm here with performance implication. It's probably
>> safest to explicitly exclude it? What do you think?
> 
> Oops: This should have read "I think I could add a check for efi_mm here
> *without* performance implication"

It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:

return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);

Is that acceptable? This is my preference, but nothing else outside of efi
references this symbol currently.

Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
There are a couple of things that need to be garanteed for it to be safe:

  - The PFNs of present ptes either need to have an associated struct page or
    need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
    pte_mkdevmap())

  - Live mappings must either be static (no changes that could cause fold/unfold
    while live) or the system must be able to tolerate a temporary fault

Mark suggests efi_mm is not manipulated while live, so that meets the latter
requirement, but I'm not sure about the former?

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 20:38           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 20:38 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

[...]

>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>> +	 */
>>>> +	return mm != &init_mm;
>>>> +}
>>>
>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>> that while it is live, and I'm not sure if that needs any special handling.
>>
>> Well we never need this function in the hot (order-0 folio) path, so I think I
>> could add a check for efi_mm here with performance implication. It's probably
>> safest to explicitly exclude it? What do you think?
> 
> Oops: This should have read "I think I could add a check for efi_mm here
> *without* performance implication"

It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:

return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);

Is that acceptable? This is my preference, but nothing else outside of efi
references this symbol currently.

Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
There are a couple of things that need to be garanteed for it to be safe:

  - The PFNs of present ptes either need to have an associated struct page or
    need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
    pte_mkdevmap())

  - Live mappings must either be static (no changes that could cause fold/unfold
    while live) or the system must be able to tolerate a temporary fault

Mark suggests efi_mm is not manipulated while live, so that meets the latter
requirement, but I'm not sure about the former?

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-12 20:38           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 20:38 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

[...]

>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>> +	 */
>>>> +	return mm != &init_mm;
>>>> +}
>>>
>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>> that while it is live, and I'm not sure if that needs any special handling.
>>
>> Well we never need this function in the hot (order-0 folio) path, so I think I
>> could add a check for efi_mm here with performance implication. It's probably
>> safest to explicitly exclude it? What do you think?
> 
> Oops: This should have read "I think I could add a check for efi_mm here
> *without* performance implication"

It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:

return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);

Is that acceptable? This is my preference, but nothing else outside of efi
references this symbol currently.

Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
There are a couple of things that need to be garanteed for it to be safe:

  - The PFNs of present ptes either need to have an associated struct page or
    need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
    pte_mkdevmap())

  - Live mappings must either be static (no changes that could cause fold/unfold
    while live) or the system must be able to tolerate a temporary fault

Mark suggests efi_mm is not manipulated while live, so that meets the latter
requirement, but I'm not sure about the former?

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
  2024-02-12 14:29         ` David Hildenbrand
  (?)
@ 2024-02-12 21:34           ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 21:34 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 14:29, David Hildenbrand wrote:
> On 12.02.24 15:10, Ryan Roberts wrote:
>> On 12/02/2024 12:14, David Hildenbrand wrote:
>>> On 02.02.24 09:07, Ryan Roberts wrote:
>>>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>>>> So introduce a new API that takes a nr param.
>>>>
>>>> We are going to remove pte_next_pfn() and replace it with
>>>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>>>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>>>> architectures over. Once all arches are moved over, we will change all
>>>> the core-mm callers to call pte_advance_pfn() directly and remove the
>>>> wrapper.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    include/linux/pgtable.h | 8 +++++++-
>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 5e7eaf8f2b97..815d92dcb96b 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>>>        #ifndef pte_next_pfn
>>>> +#ifndef pte_advance_pfn
>>>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>>> +{
>>>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>>>> +}
>>>> +#endif
>>>>    static inline pte_t pte_next_pfn(pte_t pte)
>>>>    {
>>>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>>>> +    return pte_advance_pfn(pte, 1);
>>>>    }
>>>>    #endif
>>>>    
>>>
>>> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
>>> #4, #6 don't really benefit from the change? So are the other set_ptes()
>>> implementations.
>>>
>>> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
>>> pte_next_pfn() macro in place.
>>>
>>> Any downsides to that?
>>
>> The downside is just having multiple functions that effectively do the same
>> thing. Personally I think its cleaner and easier to understand the code with
>> just one generic function which we pass 1 to it where we only want to advance by
>> 1. In the end, there are only a couple of places where pte_advance_pfn(1) is
>> used, so doesn't really seem valuable to me to maintain a specialization.
> 
> Well, not really functions, just a macro. Like we have set_pte_at() translating
> to set_ptes().
> 
> Arguably, we have more callers of set_pte_at().
> 
> "Easier to understand", I don't know. :)
> 
>>
>> Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
>> leave it as I've done in this series.
> 
> Well, it makes you patch set shorter and there is less code churn.
> 
> So personally, I'd just leave pte_next_pfn() in there. But whatever you prefer,
> not the end of the world.

I thought about this a bit more and remembered that I'm the apprentice so I've
changed it as you suggested.


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-12 21:34           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 21:34 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 14:29, David Hildenbrand wrote:
> On 12.02.24 15:10, Ryan Roberts wrote:
>> On 12/02/2024 12:14, David Hildenbrand wrote:
>>> On 02.02.24 09:07, Ryan Roberts wrote:
>>>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>>>> So introduce a new API that takes a nr param.
>>>>
>>>> We are going to remove pte_next_pfn() and replace it with
>>>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>>>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>>>> architectures over. Once all arches are moved over, we will change all
>>>> the core-mm callers to call pte_advance_pfn() directly and remove the
>>>> wrapper.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    include/linux/pgtable.h | 8 +++++++-
>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 5e7eaf8f2b97..815d92dcb96b 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>>>        #ifndef pte_next_pfn
>>>> +#ifndef pte_advance_pfn
>>>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>>> +{
>>>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>>>> +}
>>>> +#endif
>>>>    static inline pte_t pte_next_pfn(pte_t pte)
>>>>    {
>>>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>>>> +    return pte_advance_pfn(pte, 1);
>>>>    }
>>>>    #endif
>>>>    
>>>
>>> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
>>> #4, #6 don't really benefit from the change? So are the other set_ptes()
>>> implementations.
>>>
>>> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
>>> pte_next_pfn() macro in place.
>>>
>>> Any downsides to that?
>>
>> The downside is just having multiple functions that effectively do the same
>> thing. Personally I think its cleaner and easier to understand the code with
>> just one generic function which we pass 1 to it where we only want to advance by
>> 1. In the end, there are only a couple of places where pte_advance_pfn(1) is
>> used, so doesn't really seem valuable to me to maintain a specialization.
> 
> Well, not really functions, just a macro. Like we have set_pte_at() translating
> to set_ptes().
> 
> Arguably, we have more callers of set_pte_at().
> 
> "Easier to understand", I don't know. :)
> 
>>
>> Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
>> leave it as I've done in this series.
> 
> Well, it makes you patch set shorter and there is less code churn.
> 
> So personally, I'd just leave pte_next_pfn() in there. But whatever you prefer,
> not the end of the world.

I thought about this a bit more and remembered that I'm the apprentice so I've
changed it as you suggested.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-12 21:34           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-12 21:34 UTC (permalink / raw)
  To: David Hildenbrand, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 12/02/2024 14:29, David Hildenbrand wrote:
> On 12.02.24 15:10, Ryan Roberts wrote:
>> On 12/02/2024 12:14, David Hildenbrand wrote:
>>> On 02.02.24 09:07, Ryan Roberts wrote:
>>>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>>>> So introduce a new API that takes a nr param.
>>>>
>>>> We are going to remove pte_next_pfn() and replace it with
>>>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>>>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>>>> architectures over. Once all arches are moved over, we will change all
>>>> the core-mm callers to call pte_advance_pfn() directly and remove the
>>>> wrapper.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    include/linux/pgtable.h | 8 +++++++-
>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 5e7eaf8f2b97..815d92dcb96b 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>>>        #ifndef pte_next_pfn
>>>> +#ifndef pte_advance_pfn
>>>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>>> +{
>>>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>>>> +}
>>>> +#endif
>>>>    static inline pte_t pte_next_pfn(pte_t pte)
>>>>    {
>>>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>>>> +    return pte_advance_pfn(pte, 1);
>>>>    }
>>>>    #endif
>>>>    
>>>
>>> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
>>> #4, #6 don't really benefit from the change? So are the other set_ptes()
>>> implementations.
>>>
>>> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
>>> pte_next_pfn() macro in place.
>>>
>>> Any downsides to that?
>>
>> The downside is just having multiple functions that effectively do the same
>> thing. Personally I think its cleaner and easier to understand the code with
>> just one generic function which we pass 1 to it where we only want to advance by
>> 1. In the end, there are only a couple of places where pte_advance_pfn(1) is
>> used, so doesn't really seem valuable to me to maintain a specialization.
> 
> Well, not really functions, just a macro. Like we have set_pte_at() translating
> to set_ptes().
> 
> Arguably, we have more callers of set_pte_at().
> 
> "Easier to understand", I don't know. :)
> 
>>
>> Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
>> leave it as I've done in this series.
> 
> Well, it makes you patch set shorter and there is less code churn.
> 
> So personally, I'd just leave pte_next_pfn() in there. But whatever you prefer,
> not the end of the world.

I thought about this a bit more and remembered that I'm the apprentice so I've
changed it as you suggested.


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
  2024-02-12 21:34           ` Ryan Roberts
  (?)
@ 2024-02-13  9:54             ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13  9:54 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12.02.24 22:34, Ryan Roberts wrote:
> On 12/02/2024 14:29, David Hildenbrand wrote:
>> On 12.02.24 15:10, Ryan Roberts wrote:
>>> On 12/02/2024 12:14, David Hildenbrand wrote:
>>>> On 02.02.24 09:07, Ryan Roberts wrote:
>>>>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>>>>> So introduce a new API that takes a nr param.
>>>>>
>>>>> We are going to remove pte_next_pfn() and replace it with
>>>>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>>>>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>>>>> architectures over. Once all arches are moved over, we will change all
>>>>> the core-mm callers to call pte_advance_pfn() directly and remove the
>>>>> wrapper.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>     include/linux/pgtable.h | 8 +++++++-
>>>>>     1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index 5e7eaf8f2b97..815d92dcb96b 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>>>>         #ifndef pte_next_pfn
>>>>> +#ifndef pte_advance_pfn
>>>>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>>>> +{
>>>>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>>>>> +}
>>>>> +#endif
>>>>>     static inline pte_t pte_next_pfn(pte_t pte)
>>>>>     {
>>>>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>>>>> +    return pte_advance_pfn(pte, 1);
>>>>>     }
>>>>>     #endif
>>>>>     
>>>>
>>>> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
>>>> #4, #6 don't really benefit from the change? So are the other set_ptes()
>>>> implementations.
>>>>
>>>> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
>>>> pte_next_pfn() macro in place.
>>>>
>>>> Any downsides to that?
>>>
>>> The downside is just having multiple functions that effectively do the same
>>> thing. Personally I think its cleaner and easier to understand the code with
>>> just one generic function which we pass 1 to it where we only want to advance by
>>> 1. In the end, there are only a couple of places where pte_advance_pfn(1) is
>>> used, so doesn't really seem valuable to me to maintain a specialization.
>>
>> Well, not really functions, just a macro. Like we have set_pte_at() translating
>> to set_ptes().
>>
>> Arguably, we have more callers of set_pte_at().
>>
>> "Easier to understand", I don't know. :)
>>
>>>
>>> Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
>>> leave it as I've done in this series.
>>
>> Well, it makes you patch set shorter and there is less code churn.
>>
>> So personally, I'd just leave pte_next_pfn() in there. But whatever you prefer,
>> not the end of the world.
> 
> I thought about this a bit more and remembered that I'm the apprentice so I've
> changed it as you suggested.

Oh, I say stupid things all the time. Please push back if you disagree. :)

[shrinking a patch set if possible and reasonable is often a good idea]

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-13  9:54             ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13  9:54 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-arm-kernel

On 12.02.24 22:34, Ryan Roberts wrote:
> On 12/02/2024 14:29, David Hildenbrand wrote:
>> On 12.02.24 15:10, Ryan Roberts wrote:
>>> On 12/02/2024 12:14, David Hildenbrand wrote:
>>>> On 02.02.24 09:07, Ryan Roberts wrote:
>>>>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>>>>> So introduce a new API that takes a nr param.
>>>>>
>>>>> We are going to remove pte_next_pfn() and replace it with
>>>>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>>>>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>>>>> architectures over. Once all arches are moved over, we will change all
>>>>> the core-mm callers to call pte_advance_pfn() directly and remove the
>>>>> wrapper.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>     include/linux/pgtable.h | 8 +++++++-
>>>>>     1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index 5e7eaf8f2b97..815d92dcb96b 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>>>>         #ifndef pte_next_pfn
>>>>> +#ifndef pte_advance_pfn
>>>>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>>>> +{
>>>>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>>>>> +}
>>>>> +#endif
>>>>>     static inline pte_t pte_next_pfn(pte_t pte)
>>>>>     {
>>>>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>>>>> +    return pte_advance_pfn(pte, 1);
>>>>>     }
>>>>>     #endif
>>>>>     
>>>>
>>>> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
>>>> #4, #6 don't really benefit from the change? So are the other set_ptes()
>>>> implementations.
>>>>
>>>> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
>>>> pte_next_pfn() macro in place.
>>>>
>>>> Any downsides to that?
>>>
>>> The downside is just having multiple functions that effectively do the same
>>> thing. Personally I think its cleaner and easier to understand the code with
>>> just one generic function which we pass 1 to it where we only want to advance by
>>> 1. In the end, there are only a couple of places where pte_advance_pfn(1) is
>>> used, so doesn't really seem valuable to me to maintain a specialization.
>>
>> Well, not really functions, just a macro. Like we have set_pte_at() translating
>> to set_ptes().
>>
>> Arguably, we have more callers of set_pte_at().
>>
>> "Easier to understand", I don't know. :)
>>
>>>
>>> Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
>>> leave it as I've done in this series.
>>
>> Well, it makes you patch set shorter and there is less code churn.
>>
>> So personally, I'd just leave pte_next_pfn() in there. But whatever you prefer,
>> not the end of the world.
> 
> I thought about this a bit more and remembered that I'm the apprentice so I've
> changed it as you suggested.

Oh, I say stupid things all the time. Please push back if you disagree. :)

[shrinking a patch set if possible and reasonable is often a good idea]

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()
@ 2024-02-13  9:54             ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13  9:54 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Mark Rutland, Kefeng Wang, John Hubbard, Zi Yan,
	Barry Song, Alistair Popple, Yang Shi, Nicholas Piggin,
	Christophe Leroy, Aneesh Kumar K.V, Naveen N. Rao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin
  Cc: linux-arm-kernel, x86, linuxppc-dev, linux-mm, linux-kernel

On 12.02.24 22:34, Ryan Roberts wrote:
> On 12/02/2024 14:29, David Hildenbrand wrote:
>> On 12.02.24 15:10, Ryan Roberts wrote:
>>> On 12/02/2024 12:14, David Hildenbrand wrote:
>>>> On 02.02.24 09:07, Ryan Roberts wrote:
>>>>> The goal is to be able to advance a PTE by an arbitrary number of PFNs.
>>>>> So introduce a new API that takes a nr param.
>>>>>
>>>>> We are going to remove pte_next_pfn() and replace it with
>>>>> pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
>>>>> wrapper around pte_advance_pfn() so that we can incrementally switch the
>>>>> architectures over. Once all arches are moved over, we will change all
>>>>> the core-mm callers to call pte_advance_pfn() directly and remove the
>>>>> wrapper.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>     include/linux/pgtable.h | 8 +++++++-
>>>>>     1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index 5e7eaf8f2b97..815d92dcb96b 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
>>>>>         #ifndef pte_next_pfn
>>>>> +#ifndef pte_advance_pfn
>>>>> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>>>> +{
>>>>> +    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
>>>>> +}
>>>>> +#endif
>>>>>     static inline pte_t pte_next_pfn(pte_t pte)
>>>>>     {
>>>>> -    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
>>>>> +    return pte_advance_pfn(pte, 1);
>>>>>     }
>>>>>     #endif
>>>>>     
>>>>
>>>> I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
>>>> #4, #6 don't really benefit from the change? So are the other set_ptes()
>>>> implementations.
>>>>
>>>> That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
>>>> pte_next_pfn() macro in place.
>>>>
>>>> Any downsides to that?
>>>
>>> The downside is just having multiple functions that effectively do the same
>>> thing. Personally I think its cleaner and easier to understand the code with
>>> just one generic function which we pass 1 to it where we only want to advance by
>>> 1. In the end, there are only a couple of places where pte_advance_pfn(1) is
>>> used, so doesn't really seem valuable to me to maintain a specialization.
>>
>> Well, not really functions, just a macro. Like we have set_pte_at() translating
>> to set_ptes().
>>
>> Arguably, we have more callers of set_pte_at().
>>
>> "Easier to understand", I don't know. :)
>>
>>>
>>> Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
>>> leave it as I've done in this series.
>>
>> Well, it makes you patch set shorter and there is less code churn.
>>
>> So personally, I'd just leave pte_next_pfn() in there. But whatever you prefer,
>> not the end of the world.
> 
> I thought about this a bit more and remembered that I'm the apprentice so I've
> changed it as you suggested.

Oh, I say stupid things all the time. Please push back if you disagree. :)

[shrinking a patch set if possible and reasonable is often a good idea]

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 20:38           ` Ryan Roberts
  (?)
@ 2024-02-13 10:01             ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 10:01 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12.02.24 21:38, Ryan Roberts wrote:
> [...]
> 
>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>> +{
>>>>> +	/*
>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>> +	 */
>>>>> +	return mm != &init_mm;
>>>>> +}
>>>>
>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>
>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>> could add a check for efi_mm here with performance implication. It's probably
>>> safest to explicitly exclude it? What do you think?
>>
>> Oops: This should have read "I think I could add a check for efi_mm here
>> *without* performance implication"
> 
> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
> 
> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);

Please use all the lines you need ;)

if (IS_ENABLED(CONFIG_EFI) && unlikely(mm == &efi_mm))
	return false;
return mm != &init_mm;

> 
> Is that acceptable? This is my preference, but nothing else outside of efi
> references this symbol currently.

We could also mark MMs in some way to be special.

return mm->is_user;

Then it's easy to extend.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 10:01             ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 10:01 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12.02.24 21:38, Ryan Roberts wrote:
> [...]
> 
>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>> +{
>>>>> +	/*
>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>> +	 */
>>>>> +	return mm != &init_mm;
>>>>> +}
>>>>
>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>
>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>> could add a check for efi_mm here with performance implication. It's probably
>>> safest to explicitly exclude it? What do you think?
>>
>> Oops: This should have read "I think I could add a check for efi_mm here
>> *without* performance implication"
> 
> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
> 
> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);

Please use all the lines you need ;)

if (IS_ENABLED(CONFIG_EFI) && unlikely(mm == &efi_mm))
	return false;
return mm != &init_mm;

> 
> Is that acceptable? This is my preference, but nothing else outside of efi
> references this symbol currently.

We could also mark MMs in some way to be special.

return mm->is_user;

Then it's easy to extend.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 10:01             ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 10:01 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 12.02.24 21:38, Ryan Roberts wrote:
> [...]
> 
>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>> +{
>>>>> +	/*
>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>> +	 */
>>>>> +	return mm != &init_mm;
>>>>> +}
>>>>
>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>
>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>> could add a check for efi_mm here with performance implication. It's probably
>>> safest to explicitly exclude it? What do you think?
>>
>> Oops: This should have read "I think I could add a check for efi_mm here
>> *without* performance implication"
> 
> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
> 
> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);

Please use all the lines you need ;)

if (IS_ENABLED(CONFIG_EFI) && unlikely(mm == &efi_mm))
	return false;
return mm != &init_mm;

> 
> Is that acceptable? This is my preference, but nothing else outside of efi
> references this symbol currently.

We could also mark MMs in some way to be special.

return mm->is_user;

Then it's easy to extend.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 12:59       ` Ryan Roberts
  (?)
@ 2024-02-13 12:02         ` Mark Rutland
  -1 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 12:02 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Mon, Feb 12, 2024 at 12:59:57PM +0000, Ryan Roberts wrote:
> On 12/02/2024 12:00, Mark Rutland wrote:
> > Hi Ryan,

[...]

> >> +static inline void set_pte(pte_t *ptep, pte_t pte)
> >> +{
> >> +	/*
> >> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
> >> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
> >> +	 * this should never even be called. Regardless do our best to service
> >> +	 * any call and emit a warning if there is any attempt to set a pte on
> >> +	 * top of an existing contig range.
> >> +	 */
> >> +	pte_t orig_pte = __ptep_get(ptep);
> >> +
> >> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
> >> +	__set_pte(ptep, pte_mknoncont(pte));
> >> +}
> >> +
> >> +#define set_ptes set_ptes
> >> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> >> +				pte_t *ptep, pte_t pte, unsigned int nr)
> >> +{
> >> +	pte = pte_mknoncont(pte);
> > 
> > Why do we have to clear the contiguous bit here? Is that for the same reason as
> > set_pte(), or do we expect callers to legitimately call this with the
> > contiguous bit set in 'pte'?
> > 
> > I think you explained this to me in-person, and IIRC we don't expect callers to
> > go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
> > have to clear it here to defer the decision of whether to set/clear it when
> > modifying entries. It would be nice if we could have a description of why/when
> > we need to clear this, e.g. in the 'public API' comment block above.
> 
> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
> private to the architecture code and is never set directly by core code. If the
> public API ever receives a pte that happens to have the PTE_CONT bit set, it
> would be bad news if we then accidentally set that in the pgtable.
> 
> Ideally, we would just uncondidtionally clear the bit before a getter returns
> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
> and can therefore never accidentally pass such a pte into a setter function.
> However, there is existing functionality that relies on being able to get a pte,
> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
> to determine how big the leaf is. This is used in perf_get_pgtable_size().
> 
> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
> decided to allow PTE_CONT to leak through the getters and instead
> unconditionally clear the bit when a pte is passed to any of the setters.
> 
> I'll add a (slightly less verbose) comment as you suggest.

Great, thanks!

[...]

> >> +static inline bool mm_is_user(struct mm_struct *mm)
> >> +{
> >> +	/*
> >> +	 * Don't attempt to apply the contig bit to kernel mappings, because
> >> +	 * dynamically adding/removing the contig bit can cause page faults.
> >> +	 * These racing faults are ok for user space, since they get serialized
> >> +	 * on the PTL. But kernel mappings can't tolerate faults.
> >> +	 */
> >> +	return mm != &init_mm;
> >> +}
> > 
> > We also have the efi_mm as a non-user mm, though I don't think we manipulate
> > that while it is live, and I'm not sure if that needs any special handling.
> 
> Well we never need this function in the hot (order-0 folio) path, so I think I
> could add a check for efi_mm here with performance implication. It's probably
> safest to explicitly exclude it? What do you think?

That sounds ok to me.

Otherwise, if we (somehow) know that we avoid calling this at all with an EFI
mm (e.g. because of the way we construct that), I'd be happy with a comment.

Probably best to Cc Ard for whatever we do here.

> >> +static inline pte_t *contpte_align_down(pte_t *ptep)
> >> +{
> >> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> > 
> > I think this can be:
> > 
> > static inline pte_t *contpte_align_down(pte_t *ptep)
> > {
> > 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
> > }
> 
> Yep - that's much less ugly - thanks!
> 
> > 
> >> +
> >> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> >> +			    pte_t *ptep, pte_t pte)
> >> +{
> >> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> >> +	unsigned long start_addr;
> >> +	pte_t *start_ptep;
> >> +	int i;
> >> +
> >> +	start_ptep = ptep = contpte_align_down(ptep);
> >> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> >> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> >> +
> >> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> >> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> >> +
> >> +		if (pte_dirty(ptent))
> >> +			pte = pte_mkdirty(pte);
> >> +
> >> +		if (pte_young(ptent))
> >> +			pte = pte_mkyoung(pte);
> >> +	}
> > 
> > Not a big deal either way, but I wonder if it makes more sense to accumulate
> > the 'ptent' dirty/young values, then modify 'pte' once, i.e.
> > 
> > 	bool dirty = false, young = false;
> > 
> > 	for (...) {
> > 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> > 		dirty |= pte_dirty(ptent);
> > 		young |= pte_young(ptent);
> > 	}
> > 
> > 	if (dirty)
> > 		pte_mkdirty(pte);
> > 	if (young)
> > 		pte_mkyoung(pte);
> > 
> > I suspect that might generate slightly better code, but I'm also happy with the
> > current form if people thnk that's more legible (I have no strong feelings
> > either way).
> 
> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
> We also had the same comment against David's batching patches recently, and he
> opted to stick with the former version:
> 
> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
> 
> So I'm inclined to leave it as is, since you're not insisting :)

That rationale is reasonable, and I'm fine with this as-is.

[...]

> >> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >> +{
> >> +	/*
> >> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> >> +	 * of the contig range. We may not be holding the PTL, so any contiguous
> >> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
> >> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
> >> +	 * in the range are valid and have CONT_PTE set, that all pfns are
> >> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
> >> +	 * If we find a pte that is not consistent, then we must be racing with
> >> +	 * an update so start again. If the target pte does not have CONT_PTE
> >> +	 * set then that is considered consistent on its own because it is not
> >> +	 * part of a contpte range.
> >> +	 */
> >> +
> >> +	pgprot_t orig_prot;
> >> +	unsigned long pfn;
> >> +	pte_t orig_pte;
> >> +	pgprot_t prot;
> >> +	pte_t *ptep;
> >> +	pte_t pte;
> >> +	int i;
> >> +
> >> +retry:
> >> +	orig_pte = __ptep_get(orig_ptep);
> >> +
> >> +	if (!pte_valid_cont(orig_pte))
> >> +		return orig_pte;
> >> +
> >> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> >> +	ptep = contpte_align_down(orig_ptep);
> >> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> >> +
> >> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> >> +		pte = __ptep_get(ptep);
> >> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> >> +
> >> +		if (!pte_valid_cont(pte) ||
> >> +		   pte_pfn(pte) != pfn ||
> >> +		   pgprot_val(prot) != pgprot_val(orig_prot))
> >> +			goto retry;
> >> +
> >> +		if (pte_dirty(pte))
> >> +			orig_pte = pte_mkdirty(orig_pte);
> >> +
> >> +		if (pte_young(pte))
> >> +			orig_pte = pte_mkyoung(orig_pte);
> >> +	}
> >> +
> >> +	return orig_pte;
> >> +}
> >> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
> > 
> > I'm struggling to convince myself that this is safe in general, as it really
> > depends on how the caller will use this value. Which caller(s) actually care
> > about the access/dirty bits, given those could change at any time anyway?
> 
> I think your points below are valid, and agree we should try to make this work
> without needing access/dirty if possible. But can you elaborate on why you don't
> think it's safe?

Having mulled this over, I think it is safe as-is, and I was being overly
cautious.

I had a general fear of potential problems stemming from the fact that (a) the
accumulation of access/dirty bits isn't atomic and (b) the loop is unbounded.
From looking at how this is used today, I think (a) is essentially the same as
reading a stale non-contiguous entry, and I'm being overly cautious there. For
(b), I think that's largely a performance concern and the would only retry
indefinitely in the presence of mis-programmed entries or consistent racing
with a writer under heavy contention.

I think it's still desirable to avoid the accumulation in most cases (to avoid
redundant work and to minimize the potential for unbounded retries), but I'm
happy with that being a follow-up improvement.

> > I took a quick scan, and AFAICT:
> 
> Thanks for enumerating these; Saves me from having to refresh my memory :)
> > 
> > * For perf_get_pgtable_size(), we only care about whether the entry is valid
> >   and has the contig bit set. We could clean that up with a new interface, e.g.
> >   something like a new ptep_get_size_lockless().
> > 
> > * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
> >   we look at the pte to start with, since we only care where we can logically
> >   write to the page at that point.
> > 
> >   I see that we later follow up with:
> > 
> >     with pte_val(pte) != pte_val(ptep_get(ptep)))
> > 
> >   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
> >   bits? So that shape of lockless-try...locked-compare sequence works?
> > 
> > * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
> >   so this doesn' seem to matter.
> > 
> > * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
> >   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
> > 
> > * For pte_range_none() the access/dirty bits don't matter.
> > 
> > * For handle_pte_fault() I think we have the same shape of
> >   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
> >   care about the acess/dirty bits before we reach the locked compare step.
> > 
> > * For ptdump_pte_entry() I think it's arguable that we should continue to
> >   report the access/dirty bits separately for each PTE, as we have done until
> >   now, to give an accurate representation of the contents of the translation
> >   tables.
> > 
> > * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
> >   swap entry, the access/dirty bits don't matter.
> > 
> > So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
> > and IIUC that's only so that the locklessly-loaded pte value can be compared
> > with a subsequently locked-loaded entry (for which the access/dirty bits will
> > be accumulated). Have I understood that correctly?
> 
> Yes, I agree with what you are saying. My approach was to try to implement the
> existing APIs accurately though, the argument being that it reduces the chances
> of getting it wrong. But if you think the implementation is unsafe, then I guess
> it blows that out of the water...

I think your approach makes sense, and as above I'm happy to defer the API
changes/additions to avoid the accumulation of access/dirty bits.

> > If so, I wonder if we could instead do that comparison modulo the access/dirty
> > bits, 
> 
> I think that would work - but will need to think a bit more on it.
> 
> > and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?

Yep. :)

Mark.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 12:02         ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 12:02 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Mon, Feb 12, 2024 at 12:59:57PM +0000, Ryan Roberts wrote:
> On 12/02/2024 12:00, Mark Rutland wrote:
> > Hi Ryan,

[...]

> >> +static inline void set_pte(pte_t *ptep, pte_t pte)
> >> +{
> >> +	/*
> >> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
> >> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
> >> +	 * this should never even be called. Regardless do our best to service
> >> +	 * any call and emit a warning if there is any attempt to set a pte on
> >> +	 * top of an existing contig range.
> >> +	 */
> >> +	pte_t orig_pte = __ptep_get(ptep);
> >> +
> >> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
> >> +	__set_pte(ptep, pte_mknoncont(pte));
> >> +}
> >> +
> >> +#define set_ptes set_ptes
> >> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> >> +				pte_t *ptep, pte_t pte, unsigned int nr)
> >> +{
> >> +	pte = pte_mknoncont(pte);
> > 
> > Why do we have to clear the contiguous bit here? Is that for the same reason as
> > set_pte(), or do we expect callers to legitimately call this with the
> > contiguous bit set in 'pte'?
> > 
> > I think you explained this to me in-person, and IIRC we don't expect callers to
> > go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
> > have to clear it here to defer the decision of whether to set/clear it when
> > modifying entries. It would be nice if we could have a description of why/when
> > we need to clear this, e.g. in the 'public API' comment block above.
> 
> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
> private to the architecture code and is never set directly by core code. If the
> public API ever receives a pte that happens to have the PTE_CONT bit set, it
> would be bad news if we then accidentally set that in the pgtable.
> 
> Ideally, we would just uncondidtionally clear the bit before a getter returns
> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
> and can therefore never accidentally pass such a pte into a setter function.
> However, there is existing functionality that relies on being able to get a pte,
> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
> to determine how big the leaf is. This is used in perf_get_pgtable_size().
> 
> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
> decided to allow PTE_CONT to leak through the getters and instead
> unconditionally clear the bit when a pte is passed to any of the setters.
> 
> I'll add a (slightly less verbose) comment as you suggest.

Great, thanks!

[...]

> >> +static inline bool mm_is_user(struct mm_struct *mm)
> >> +{
> >> +	/*
> >> +	 * Don't attempt to apply the contig bit to kernel mappings, because
> >> +	 * dynamically adding/removing the contig bit can cause page faults.
> >> +	 * These racing faults are ok for user space, since they get serialized
> >> +	 * on the PTL. But kernel mappings can't tolerate faults.
> >> +	 */
> >> +	return mm != &init_mm;
> >> +}
> > 
> > We also have the efi_mm as a non-user mm, though I don't think we manipulate
> > that while it is live, and I'm not sure if that needs any special handling.
> 
> Well we never need this function in the hot (order-0 folio) path, so I think I
> could add a check for efi_mm here with performance implication. It's probably
> safest to explicitly exclude it? What do you think?

That sounds ok to me.

Otherwise, if we (somehow) know that we avoid calling this at all with an EFI
mm (e.g. because of the way we construct that), I'd be happy with a comment.

Probably best to Cc Ard for whatever we do here.

> >> +static inline pte_t *contpte_align_down(pte_t *ptep)
> >> +{
> >> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> > 
> > I think this can be:
> > 
> > static inline pte_t *contpte_align_down(pte_t *ptep)
> > {
> > 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
> > }
> 
> Yep - that's much less ugly - thanks!
> 
> > 
> >> +
> >> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> >> +			    pte_t *ptep, pte_t pte)
> >> +{
> >> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> >> +	unsigned long start_addr;
> >> +	pte_t *start_ptep;
> >> +	int i;
> >> +
> >> +	start_ptep = ptep = contpte_align_down(ptep);
> >> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> >> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> >> +
> >> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> >> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> >> +
> >> +		if (pte_dirty(ptent))
> >> +			pte = pte_mkdirty(pte);
> >> +
> >> +		if (pte_young(ptent))
> >> +			pte = pte_mkyoung(pte);
> >> +	}
> > 
> > Not a big deal either way, but I wonder if it makes more sense to accumulate
> > the 'ptent' dirty/young values, then modify 'pte' once, i.e.
> > 
> > 	bool dirty = false, young = false;
> > 
> > 	for (...) {
> > 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> > 		dirty |= pte_dirty(ptent);
> > 		young |= pte_young(ptent);
> > 	}
> > 
> > 	if (dirty)
> > 		pte_mkdirty(pte);
> > 	if (young)
> > 		pte_mkyoung(pte);
> > 
> > I suspect that might generate slightly better code, but I'm also happy with the
> > current form if people thnk that's more legible (I have no strong feelings
> > either way).
> 
> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
> We also had the same comment against David's batching patches recently, and he
> opted to stick with the former version:
> 
> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
> 
> So I'm inclined to leave it as is, since you're not insisting :)

That rationale is reasonable, and I'm fine with this as-is.

[...]

> >> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >> +{
> >> +	/*
> >> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> >> +	 * of the contig range. We may not be holding the PTL, so any contiguous
> >> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
> >> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
> >> +	 * in the range are valid and have CONT_PTE set, that all pfns are
> >> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
> >> +	 * If we find a pte that is not consistent, then we must be racing with
> >> +	 * an update so start again. If the target pte does not have CONT_PTE
> >> +	 * set then that is considered consistent on its own because it is not
> >> +	 * part of a contpte range.
> >> +	 */
> >> +
> >> +	pgprot_t orig_prot;
> >> +	unsigned long pfn;
> >> +	pte_t orig_pte;
> >> +	pgprot_t prot;
> >> +	pte_t *ptep;
> >> +	pte_t pte;
> >> +	int i;
> >> +
> >> +retry:
> >> +	orig_pte = __ptep_get(orig_ptep);
> >> +
> >> +	if (!pte_valid_cont(orig_pte))
> >> +		return orig_pte;
> >> +
> >> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> >> +	ptep = contpte_align_down(orig_ptep);
> >> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> >> +
> >> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> >> +		pte = __ptep_get(ptep);
> >> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> >> +
> >> +		if (!pte_valid_cont(pte) ||
> >> +		   pte_pfn(pte) != pfn ||
> >> +		   pgprot_val(prot) != pgprot_val(orig_prot))
> >> +			goto retry;
> >> +
> >> +		if (pte_dirty(pte))
> >> +			orig_pte = pte_mkdirty(orig_pte);
> >> +
> >> +		if (pte_young(pte))
> >> +			orig_pte = pte_mkyoung(orig_pte);
> >> +	}
> >> +
> >> +	return orig_pte;
> >> +}
> >> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
> > 
> > I'm struggling to convince myself that this is safe in general, as it really
> > depends on how the caller will use this value. Which caller(s) actually care
> > about the access/dirty bits, given those could change at any time anyway?
> 
> I think your points below are valid, and agree we should try to make this work
> without needing access/dirty if possible. But can you elaborate on why you don't
> think it's safe?

Having mulled this over, I think it is safe as-is, and I was being overly
cautious.

I had a general fear of potential problems stemming from the fact that (a) the
accumulation of access/dirty bits isn't atomic and (b) the loop is unbounded.
From looking at how this is used today, I think (a) is essentially the same as
reading a stale non-contiguous entry, and I'm being overly cautious there. For
(b), I think that's largely a performance concern and the would only retry
indefinitely in the presence of mis-programmed entries or consistent racing
with a writer under heavy contention.

I think it's still desirable to avoid the accumulation in most cases (to avoid
redundant work and to minimize the potential for unbounded retries), but I'm
happy with that being a follow-up improvement.

> > I took a quick scan, and AFAICT:
> 
> Thanks for enumerating these; Saves me from having to refresh my memory :)
> > 
> > * For perf_get_pgtable_size(), we only care about whether the entry is valid
> >   and has the contig bit set. We could clean that up with a new interface, e.g.
> >   something like a new ptep_get_size_lockless().
> > 
> > * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
> >   we look at the pte to start with, since we only care where we can logically
> >   write to the page at that point.
> > 
> >   I see that we later follow up with:
> > 
> >     with pte_val(pte) != pte_val(ptep_get(ptep)))
> > 
> >   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
> >   bits? So that shape of lockless-try...locked-compare sequence works?
> > 
> > * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
> >   so this doesn' seem to matter.
> > 
> > * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
> >   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
> > 
> > * For pte_range_none() the access/dirty bits don't matter.
> > 
> > * For handle_pte_fault() I think we have the same shape of
> >   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
> >   care about the acess/dirty bits before we reach the locked compare step.
> > 
> > * For ptdump_pte_entry() I think it's arguable that we should continue to
> >   report the access/dirty bits separately for each PTE, as we have done until
> >   now, to give an accurate representation of the contents of the translation
> >   tables.
> > 
> > * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
> >   swap entry, the access/dirty bits don't matter.
> > 
> > So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
> > and IIUC that's only so that the locklessly-loaded pte value can be compared
> > with a subsequently locked-loaded entry (for which the access/dirty bits will
> > be accumulated). Have I understood that correctly?
> 
> Yes, I agree with what you are saying. My approach was to try to implement the
> existing APIs accurately though, the argument being that it reduces the chances
> of getting it wrong. But if you think the implementation is unsafe, then I guess
> it blows that out of the water...

I think your approach makes sense, and as above I'm happy to defer the API
changes/additions to avoid the accumulation of access/dirty bits.

> > If so, I wonder if we could instead do that comparison modulo the access/dirty
> > bits, 
> 
> I think that would work - but will need to think a bit more on it.
> 
> > and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?

Yep. :)

Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 12:02         ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 12:02 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On Mon, Feb 12, 2024 at 12:59:57PM +0000, Ryan Roberts wrote:
> On 12/02/2024 12:00, Mark Rutland wrote:
> > Hi Ryan,

[...]

> >> +static inline void set_pte(pte_t *ptep, pte_t pte)
> >> +{
> >> +	/*
> >> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
> >> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
> >> +	 * this should never even be called. Regardless do our best to service
> >> +	 * any call and emit a warning if there is any attempt to set a pte on
> >> +	 * top of an existing contig range.
> >> +	 */
> >> +	pte_t orig_pte = __ptep_get(ptep);
> >> +
> >> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
> >> +	__set_pte(ptep, pte_mknoncont(pte));
> >> +}
> >> +
> >> +#define set_ptes set_ptes
> >> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> >> +				pte_t *ptep, pte_t pte, unsigned int nr)
> >> +{
> >> +	pte = pte_mknoncont(pte);
> > 
> > Why do we have to clear the contiguous bit here? Is that for the same reason as
> > set_pte(), or do we expect callers to legitimately call this with the
> > contiguous bit set in 'pte'?
> > 
> > I think you explained this to me in-person, and IIRC we don't expect callers to
> > go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
> > have to clear it here to defer the decision of whether to set/clear it when
> > modifying entries. It would be nice if we could have a description of why/when
> > we need to clear this, e.g. in the 'public API' comment block above.
> 
> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
> private to the architecture code and is never set directly by core code. If the
> public API ever receives a pte that happens to have the PTE_CONT bit set, it
> would be bad news if we then accidentally set that in the pgtable.
> 
> Ideally, we would just uncondidtionally clear the bit before a getter returns
> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
> and can therefore never accidentally pass such a pte into a setter function.
> However, there is existing functionality that relies on being able to get a pte,
> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
> to determine how big the leaf is. This is used in perf_get_pgtable_size().
> 
> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
> decided to allow PTE_CONT to leak through the getters and instead
> unconditionally clear the bit when a pte is passed to any of the setters.
> 
> I'll add a (slightly less verbose) comment as you suggest.

Great, thanks!

[...]

> >> +static inline bool mm_is_user(struct mm_struct *mm)
> >> +{
> >> +	/*
> >> +	 * Don't attempt to apply the contig bit to kernel mappings, because
> >> +	 * dynamically adding/removing the contig bit can cause page faults.
> >> +	 * These racing faults are ok for user space, since they get serialized
> >> +	 * on the PTL. But kernel mappings can't tolerate faults.
> >> +	 */
> >> +	return mm != &init_mm;
> >> +}
> > 
> > We also have the efi_mm as a non-user mm, though I don't think we manipulate
> > that while it is live, and I'm not sure if that needs any special handling.
> 
> Well we never need this function in the hot (order-0 folio) path, so I think I
> could add a check for efi_mm here with performance implication. It's probably
> safest to explicitly exclude it? What do you think?

That sounds ok to me.

Otherwise, if we (somehow) know that we avoid calling this at all with an EFI
mm (e.g. because of the way we construct that), I'd be happy with a comment.

Probably best to Cc Ard for whatever we do here.

> >> +static inline pte_t *contpte_align_down(pte_t *ptep)
> >> +{
> >> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> > 
> > I think this can be:
> > 
> > static inline pte_t *contpte_align_down(pte_t *ptep)
> > {
> > 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
> > }
> 
> Yep - that's much less ugly - thanks!
> 
> > 
> >> +
> >> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> >> +			    pte_t *ptep, pte_t pte)
> >> +{
> >> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> >> +	unsigned long start_addr;
> >> +	pte_t *start_ptep;
> >> +	int i;
> >> +
> >> +	start_ptep = ptep = contpte_align_down(ptep);
> >> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> >> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> >> +
> >> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> >> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> >> +
> >> +		if (pte_dirty(ptent))
> >> +			pte = pte_mkdirty(pte);
> >> +
> >> +		if (pte_young(ptent))
> >> +			pte = pte_mkyoung(pte);
> >> +	}
> > 
> > Not a big deal either way, but I wonder if it makes more sense to accumulate
> > the 'ptent' dirty/young values, then modify 'pte' once, i.e.
> > 
> > 	bool dirty = false, young = false;
> > 
> > 	for (...) {
> > 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> > 		dirty |= pte_dirty(ptent);
> > 		young |= pte_young(ptent);
> > 	}
> > 
> > 	if (dirty)
> > 		pte_mkdirty(pte);
> > 	if (young)
> > 		pte_mkyoung(pte);
> > 
> > I suspect that might generate slightly better code, but I'm also happy with the
> > current form if people thnk that's more legible (I have no strong feelings
> > either way).
> 
> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
> We also had the same comment against David's batching patches recently, and he
> opted to stick with the former version:
> 
> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
> 
> So I'm inclined to leave it as is, since you're not insisting :)

That rationale is reasonable, and I'm fine with this as-is.

[...]

> >> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >> +{
> >> +	/*
> >> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> >> +	 * of the contig range. We may not be holding the PTL, so any contiguous
> >> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
> >> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
> >> +	 * in the range are valid and have CONT_PTE set, that all pfns are
> >> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
> >> +	 * If we find a pte that is not consistent, then we must be racing with
> >> +	 * an update so start again. If the target pte does not have CONT_PTE
> >> +	 * set then that is considered consistent on its own because it is not
> >> +	 * part of a contpte range.
> >> +	 */
> >> +
> >> +	pgprot_t orig_prot;
> >> +	unsigned long pfn;
> >> +	pte_t orig_pte;
> >> +	pgprot_t prot;
> >> +	pte_t *ptep;
> >> +	pte_t pte;
> >> +	int i;
> >> +
> >> +retry:
> >> +	orig_pte = __ptep_get(orig_ptep);
> >> +
> >> +	if (!pte_valid_cont(orig_pte))
> >> +		return orig_pte;
> >> +
> >> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> >> +	ptep = contpte_align_down(orig_ptep);
> >> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> >> +
> >> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> >> +		pte = __ptep_get(ptep);
> >> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> >> +
> >> +		if (!pte_valid_cont(pte) ||
> >> +		   pte_pfn(pte) != pfn ||
> >> +		   pgprot_val(prot) != pgprot_val(orig_prot))
> >> +			goto retry;
> >> +
> >> +		if (pte_dirty(pte))
> >> +			orig_pte = pte_mkdirty(orig_pte);
> >> +
> >> +		if (pte_young(pte))
> >> +			orig_pte = pte_mkyoung(orig_pte);
> >> +	}
> >> +
> >> +	return orig_pte;
> >> +}
> >> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
> > 
> > I'm struggling to convince myself that this is safe in general, as it really
> > depends on how the caller will use this value. Which caller(s) actually care
> > about the access/dirty bits, given those could change at any time anyway?
> 
> I think your points below are valid, and agree we should try to make this work
> without needing access/dirty if possible. But can you elaborate on why you don't
> think it's safe?

Having mulled this over, I think it is safe as-is, and I was being overly
cautious.

I had a general fear of potential problems stemming from the fact that (a) the
accumulation of access/dirty bits isn't atomic and (b) the loop is unbounded.
From looking at how this is used today, I think (a) is essentially the same as
reading a stale non-contiguous entry, and I'm being overly cautious there. For
(b), I think that's largely a performance concern and the would only retry
indefinitely in the presence of mis-programmed entries or consistent racing
with a writer under heavy contention.

I think it's still desirable to avoid the accumulation in most cases (to avoid
redundant work and to minimize the potential for unbounded retries), but I'm
happy with that being a follow-up improvement.

> > I took a quick scan, and AFAICT:
> 
> Thanks for enumerating these; Saves me from having to refresh my memory :)
> > 
> > * For perf_get_pgtable_size(), we only care about whether the entry is valid
> >   and has the contig bit set. We could clean that up with a new interface, e.g.
> >   something like a new ptep_get_size_lockless().
> > 
> > * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
> >   we look at the pte to start with, since we only care where we can logically
> >   write to the page at that point.
> > 
> >   I see that we later follow up with:
> > 
> >     with pte_val(pte) != pte_val(ptep_get(ptep)))
> > 
> >   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
> >   bits? So that shape of lockless-try...locked-compare sequence works?
> > 
> > * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
> >   so this doesn' seem to matter.
> > 
> > * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
> >   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
> > 
> > * For pte_range_none() the access/dirty bits don't matter.
> > 
> > * For handle_pte_fault() I think we have the same shape of
> >   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
> >   care about the acess/dirty bits before we reach the locked compare step.
> > 
> > * For ptdump_pte_entry() I think it's arguable that we should continue to
> >   report the access/dirty bits separately for each PTE, as we have done until
> >   now, to give an accurate representation of the contents of the translation
> >   tables.
> > 
> > * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
> >   swap entry, the access/dirty bits don't matter.
> > 
> > So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
> > and IIUC that's only so that the locklessly-loaded pte value can be compared
> > with a subsequently locked-loaded entry (for which the access/dirty bits will
> > be accumulated). Have I understood that correctly?
> 
> Yes, I agree with what you are saying. My approach was to try to implement the
> existing APIs accurately though, the argument being that it reduces the chances
> of getting it wrong. But if you think the implementation is unsafe, then I guess
> it blows that out of the water...

I think your approach makes sense, and as above I'm happy to defer the API
changes/additions to avoid the accumulation of access/dirty bits.

> > If so, I wonder if we could instead do that comparison modulo the access/dirty
> > bits, 
> 
> I think that would work - but will need to think a bit more on it.
> 
> > and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?

Yep. :)

Mark.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 20:38           ` Ryan Roberts
  (?)
@ 2024-02-13 12:06             ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 12:06 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 20:38, Ryan Roberts wrote:
> [...]
> 
>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>> +{
>>>>> +	/*
>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>> +	 */
>>>>> +	return mm != &init_mm;
>>>>> +}
>>>>
>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>
>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>> could add a check for efi_mm here with performance implication. It's probably
>>> safest to explicitly exclude it? What do you think?
>>
>> Oops: This should have read "I think I could add a check for efi_mm here
>> *without* performance implication"
> 
> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
> 
> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
> 
> Is that acceptable? This is my preference, but nothing else outside of efi
> references this symbol currently.
> 
> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
> There are a couple of things that need to be garanteed for it to be safe:
> 
>   - The PFNs of present ptes either need to have an associated struct page or
>     need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>     pte_mkdevmap())
> 
>   - Live mappings must either be static (no changes that could cause fold/unfold
>     while live) or the system must be able to tolerate a temporary fault
> 
> Mark suggests efi_mm is not manipulated while live, so that meets the latter
> requirement, but I'm not sure about the former?

I've gone through all the efi code, and conclude that, as Mark suggests, the
mappings are indeed static. And additionally, the ptes are populated using only
the _private_ ptep API, so there is no issue here. As just discussed with Mark,
my prefereence is to not make any changes to code, and just add a comment
describing why efi_mm is safe.

Details:

* Registered with ptdump
    * ptep_get_lockless()
* efi_create_mapping -> create_pgd_mapping … -> init_pte:
    * __ptep_get()
    * __set_pte()
* efi_memattr_apply_permissions -> efi_set_mapping_permissions … -> set_permissions
    * __ptep_get()
    * __set_pte()

Thanks,
Ryan

> 
> Thanks,
> Ryan
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 12:06             ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 12:06 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 12/02/2024 20:38, Ryan Roberts wrote:
> [...]
> 
>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>> +{
>>>>> +	/*
>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>> +	 */
>>>>> +	return mm != &init_mm;
>>>>> +}
>>>>
>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>
>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>> could add a check for efi_mm here with performance implication. It's probably
>>> safest to explicitly exclude it? What do you think?
>>
>> Oops: This should have read "I think I could add a check for efi_mm here
>> *without* performance implication"
> 
> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
> 
> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
> 
> Is that acceptable? This is my preference, but nothing else outside of efi
> references this symbol currently.
> 
> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
> There are a couple of things that need to be garanteed for it to be safe:
> 
>   - The PFNs of present ptes either need to have an associated struct page or
>     need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>     pte_mkdevmap())
> 
>   - Live mappings must either be static (no changes that could cause fold/unfold
>     while live) or the system must be able to tolerate a temporary fault
> 
> Mark suggests efi_mm is not manipulated while live, so that meets the latter
> requirement, but I'm not sure about the former?

I've gone through all the efi code, and conclude that, as Mark suggests, the
mappings are indeed static. And additionally, the ptes are populated using only
the _private_ ptep API, so there is no issue here. As just discussed with Mark,
my prefereence is to not make any changes to code, and just add a comment
describing why efi_mm is safe.

Details:

* Registered with ptdump
    * ptep_get_lockless()
* efi_create_mapping -> create_pgd_mapping … -> init_pte:
    * __ptep_get()
    * __set_pte()
* efi_memattr_apply_permissions -> efi_set_mapping_permissions … -> set_permissions
    * __ptep_get()
    * __set_pte()

Thanks,
Ryan

> 
> Thanks,
> Ryan
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 12:06             ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 12:06 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On 12/02/2024 20:38, Ryan Roberts wrote:
> [...]
> 
>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>> +{
>>>>> +	/*
>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>> +	 */
>>>>> +	return mm != &init_mm;
>>>>> +}
>>>>
>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>
>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>> could add a check for efi_mm here with performance implication. It's probably
>>> safest to explicitly exclude it? What do you think?
>>
>> Oops: This should have read "I think I could add a check for efi_mm here
>> *without* performance implication"
> 
> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
> 
> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
> 
> Is that acceptable? This is my preference, but nothing else outside of efi
> references this symbol currently.
> 
> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
> There are a couple of things that need to be garanteed for it to be safe:
> 
>   - The PFNs of present ptes either need to have an associated struct page or
>     need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>     pte_mkdevmap())
> 
>   - Live mappings must either be static (no changes that could cause fold/unfold
>     while live) or the system must be able to tolerate a temporary fault
> 
> Mark suggests efi_mm is not manipulated while live, so that meets the latter
> requirement, but I'm not sure about the former?

I've gone through all the efi code, and conclude that, as Mark suggests, the
mappings are indeed static. And additionally, the ptes are populated using only
the _private_ ptep API, so there is no issue here. As just discussed with Mark,
my prefereence is to not make any changes to code, and just add a comment
describing why efi_mm is safe.

Details:

* Registered with ptdump
    * ptep_get_lockless()
* efi_create_mapping -> create_pgd_mapping … -> init_pte:
    * __ptep_get()
    * __set_pte()
* efi_memattr_apply_permissions -> efi_set_mapping_permissions … -> set_permissions
    * __ptep_get()
    * __set_pte()

Thanks,
Ryan

> 
> Thanks,
> Ryan
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 12:06             ` Ryan Roberts
  (?)
@ 2024-02-13 12:19               ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 12:19 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13.02.24 13:06, Ryan Roberts wrote:
> On 12/02/2024 20:38, Ryan Roberts wrote:
>> [...]
>>
>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>> +{
>>>>>> +	/*
>>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>>> +	 */
>>>>>> +	return mm != &init_mm;
>>>>>> +}
>>>>>
>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>
>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>> could add a check for efi_mm here with performance implication. It's probably
>>>> safest to explicitly exclude it? What do you think?
>>>
>>> Oops: This should have read "I think I could add a check for efi_mm here
>>> *without* performance implication"
>>
>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
>>
>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>
>> Is that acceptable? This is my preference, but nothing else outside of efi
>> references this symbol currently.
>>
>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>> There are a couple of things that need to be garanteed for it to be safe:
>>
>>    - The PFNs of present ptes either need to have an associated struct page or
>>      need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>      pte_mkdevmap())
>>
>>    - Live mappings must either be static (no changes that could cause fold/unfold
>>      while live) or the system must be able to tolerate a temporary fault
>>
>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>> requirement, but I'm not sure about the former?
> 
> I've gone through all the efi code, and conclude that, as Mark suggests, the
> mappings are indeed static. And additionally, the ptes are populated using only
> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
> my prefereence is to not make any changes to code, and just add a comment
> describing why efi_mm is safe.
> 
> Details:
> 
> * Registered with ptdump
>      * ptep_get_lockless()
> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>      * __ptep_get()
>      * __set_pte()
> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … -> set_permissions
>      * __ptep_get()
>      * __set_pte()

Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via 
the "official" APIs.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 12:19               ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 12:19 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13.02.24 13:06, Ryan Roberts wrote:
> On 12/02/2024 20:38, Ryan Roberts wrote:
>> [...]
>>
>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>> +{
>>>>>> +	/*
>>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>>> +	 */
>>>>>> +	return mm != &init_mm;
>>>>>> +}
>>>>>
>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>
>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>> could add a check for efi_mm here with performance implication. It's probably
>>>> safest to explicitly exclude it? What do you think?
>>>
>>> Oops: This should have read "I think I could add a check for efi_mm here
>>> *without* performance implication"
>>
>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
>>
>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>
>> Is that acceptable? This is my preference, but nothing else outside of efi
>> references this symbol currently.
>>
>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>> There are a couple of things that need to be garanteed for it to be safe:
>>
>>    - The PFNs of present ptes either need to have an associated struct page or
>>      need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>      pte_mkdevmap())
>>
>>    - Live mappings must either be static (no changes that could cause fold/unfold
>>      while live) or the system must be able to tolerate a temporary fault
>>
>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>> requirement, but I'm not sure about the former?
> 
> I've gone through all the efi code, and conclude that, as Mark suggests, the
> mappings are indeed static. And additionally, the ptes are populated using only
> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
> my prefereence is to not make any changes to code, and just add a comment
> describing why efi_mm is safe.
> 
> Details:
> 
> * Registered with ptdump
>      * ptep_get_lockless()
> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>      * __ptep_get()
>      * __set_pte()
> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … -> set_permissions
>      * __ptep_get()
>      * __set_pte()

Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via 
the "official" APIs.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 12:19               ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 12:19 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 13.02.24 13:06, Ryan Roberts wrote:
> On 12/02/2024 20:38, Ryan Roberts wrote:
>> [...]
>>
>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>> +{
>>>>>> +	/*
>>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>>> +	 */
>>>>>> +	return mm != &init_mm;
>>>>>> +}
>>>>>
>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>
>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>> could add a check for efi_mm here with performance implication. It's probably
>>>> safest to explicitly exclude it? What do you think?
>>>
>>> Oops: This should have read "I think I could add a check for efi_mm here
>>> *without* performance implication"
>>
>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
>>
>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>
>> Is that acceptable? This is my preference, but nothing else outside of efi
>> references this symbol currently.
>>
>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>> There are a couple of things that need to be garanteed for it to be safe:
>>
>>    - The PFNs of present ptes either need to have an associated struct page or
>>      need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>      pte_mkdevmap())
>>
>>    - Live mappings must either be static (no changes that could cause fold/unfold
>>      while live) or the system must be able to tolerate a temporary fault
>>
>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>> requirement, but I'm not sure about the former?
> 
> I've gone through all the efi code, and conclude that, as Mark suggests, the
> mappings are indeed static. And additionally, the ptes are populated using only
> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
> my prefereence is to not make any changes to code, and just add a comment
> describing why efi_mm is safe.
> 
> Details:
> 
> * Registered with ptdump
>      * ptep_get_lockless()
> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>      * __ptep_get()
>      * __set_pte()
> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … -> set_permissions
>      * __ptep_get()
>      * __set_pte()

Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via 
the "official" APIs.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 12:02         ` Mark Rutland
  (?)
@ 2024-02-13 13:03           ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:03 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 13/02/2024 12:02, Mark Rutland wrote:
> On Mon, Feb 12, 2024 at 12:59:57PM +0000, Ryan Roberts wrote:
>> On 12/02/2024 12:00, Mark Rutland wrote:
>>> Hi Ryan,
> 
> [...]
> 
>>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	/*
>>>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>>>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>>>> +	 * this should never even be called. Regardless do our best to service
>>>> +	 * any call and emit a warning if there is any attempt to set a pte on
>>>> +	 * top of an existing contig range.
>>>> +	 */
>>>> +	pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>>> +	__set_pte(ptep, pte_mknoncont(pte));
>>>> +}
>>>> +
>>>> +#define set_ptes set_ptes
>>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>>>> +{
>>>> +	pte = pte_mknoncont(pte);
>>>
>>> Why do we have to clear the contiguous bit here? Is that for the same reason as
>>> set_pte(), or do we expect callers to legitimately call this with the
>>> contiguous bit set in 'pte'?
>>>
>>> I think you explained this to me in-person, and IIRC we don't expect callers to
>>> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
>>> have to clear it here to defer the decision of whether to set/clear it when
>>> modifying entries. It would be nice if we could have a description of why/when
>>> we need to clear this, e.g. in the 'public API' comment block above.
>>
>> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
>> private to the architecture code and is never set directly by core code. If the
>> public API ever receives a pte that happens to have the PTE_CONT bit set, it
>> would be bad news if we then accidentally set that in the pgtable.
>>
>> Ideally, we would just uncondidtionally clear the bit before a getter returns
>> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
>> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
>> and can therefore never accidentally pass such a pte into a setter function.
>> However, there is existing functionality that relies on being able to get a pte,
>> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
>> to determine how big the leaf is. This is used in perf_get_pgtable_size().
>>
>> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
>> decided to allow PTE_CONT to leak through the getters and instead
>> unconditionally clear the bit when a pte is passed to any of the setters.
>>
>> I'll add a (slightly less verbose) comment as you suggest.
> 
> Great, thanks!
> 
> [...]
> 
>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>> +	 */
>>>> +	return mm != &init_mm;
>>>> +}
>>>
>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>> that while it is live, and I'm not sure if that needs any special handling.
>>
>> Well we never need this function in the hot (order-0 folio) path, so I think I
>> could add a check for efi_mm here with performance implication. It's probably
>> safest to explicitly exclude it? What do you think?
> 
> That sounds ok to me.
> 
> Otherwise, if we (somehow) know that we avoid calling this at all with an EFI
> mm (e.g. because of the way we construct that), I'd be happy with a comment.

We crossed streams - as per my other email, I'm confident that this is safe so
will just add a comment.

> 
> Probably best to Cc Ard for whatever we do here.

Ard is already on CC.

> 
>>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>>> +{
>>>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>>
>>> I think this can be:
>>>
>>> static inline pte_t *contpte_align_down(pte_t *ptep)
>>> {
>>> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
>>> }
>>
>> Yep - that's much less ugly - thanks!
>>
>>>
>>>> +
>>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>>> +			    pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>>> +	unsigned long start_addr;
>>>> +	pte_t *start_ptep;
>>>> +	int i;
>>>> +
>>>> +	start_ptep = ptep = contpte_align_down(ptep);
>>>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>>> +
>>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>>> +
>>>> +		if (pte_dirty(ptent))
>>>> +			pte = pte_mkdirty(pte);
>>>> +
>>>> +		if (pte_young(ptent))
>>>> +			pte = pte_mkyoung(pte);
>>>> +	}
>>>
>>> Not a big deal either way, but I wonder if it makes more sense to accumulate
>>> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
>>>
>>> 	bool dirty = false, young = false;
>>>
>>> 	for (...) {
>>> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>> 		dirty |= pte_dirty(ptent);
>>> 		young |= pte_young(ptent);
>>> 	}
>>>
>>> 	if (dirty)
>>> 		pte_mkdirty(pte);
>>> 	if (young)
>>> 		pte_mkyoung(pte);
>>>
>>> I suspect that might generate slightly better code, but I'm also happy with the
>>> current form if people thnk that's more legible (I have no strong feelings
>>> either way).
>>
>> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
>> We also had the same comment against David's batching patches recently, and he
>> opted to stick with the former version:
>>
>> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
>>
>> So I'm inclined to leave it as is, since you're not insisting :)
> 
> That rationale is reasonable, and I'm fine with this as-is.
> 
> [...]
> 
>>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>>> +{
>>>> +	/*
>>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>>>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>>>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>>>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>>>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>>>> +	 * If we find a pte that is not consistent, then we must be racing with
>>>> +	 * an update so start again. If the target pte does not have CONT_PTE
>>>> +	 * set then that is considered consistent on its own because it is not
>>>> +	 * part of a contpte range.
>>>> +	 */
>>>> +
>>>> +	pgprot_t orig_prot;
>>>> +	unsigned long pfn;
>>>> +	pte_t orig_pte;
>>>> +	pgprot_t prot;
>>>> +	pte_t *ptep;
>>>> +	pte_t pte;
>>>> +	int i;
>>>> +
>>>> +retry:
>>>> +	orig_pte = __ptep_get(orig_ptep);
>>>> +
>>>> +	if (!pte_valid_cont(orig_pte))
>>>> +		return orig_pte;
>>>> +
>>>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>>> +	ptep = contpte_align_down(orig_ptep);
>>>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>>> +
>>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>>> +		pte = __ptep_get(ptep);
>>>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>>> +
>>>> +		if (!pte_valid_cont(pte) ||
>>>> +		   pte_pfn(pte) != pfn ||
>>>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>>>> +			goto retry;
>>>> +
>>>> +		if (pte_dirty(pte))
>>>> +			orig_pte = pte_mkdirty(orig_pte);
>>>> +
>>>> +		if (pte_young(pte))
>>>> +			orig_pte = pte_mkyoung(orig_pte);
>>>> +	}
>>>> +
>>>> +	return orig_pte;
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>>
>>> I'm struggling to convince myself that this is safe in general, as it really
>>> depends on how the caller will use this value. Which caller(s) actually care
>>> about the access/dirty bits, given those could change at any time anyway?
>>
>> I think your points below are valid, and agree we should try to make this work
>> without needing access/dirty if possible. But can you elaborate on why you don't
>> think it's safe?
> 
> Having mulled this over, I think it is safe as-is, and I was being overly
> cautious.
> 
> I had a general fear of potential problems stemming from the fact that (a) the
> accumulation of access/dirty bits isn't atomic and (b) the loop is unbounded.
> From looking at how this is used today, I think (a) is essentially the same as
> reading a stale non-contiguous entry, and I'm being overly cautious there. For
> (b), I think that's largely a performance concern and the would only retry
> indefinitely in the presence of mis-programmed entries or consistent racing
> with a writer under heavy contention.
> 
> I think it's still desirable to avoid the accumulation in most cases (to avoid
> redundant work and to minimize the potential for unbounded retries), but I'm
> happy with that being a follow-up improvement.

Great! I'll do the conversion to ptep_get_lockless_nosync() as a follow up series.

> 
>>> I took a quick scan, and AFAICT:
>>
>> Thanks for enumerating these; Saves me from having to refresh my memory :)
>>>
>>> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>>>   and has the contig bit set. We could clean that up with a new interface, e.g.
>>>   something like a new ptep_get_size_lockless().
>>>
>>> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>>>   we look at the pte to start with, since we only care where we can logically
>>>   write to the page at that point.
>>>
>>>   I see that we later follow up with:
>>>
>>>     with pte_val(pte) != pte_val(ptep_get(ptep)))
>>>
>>>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>>>   bits? So that shape of lockless-try...locked-compare sequence works?
>>>
>>> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>>>   so this doesn' seem to matter.
>>>
>>> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>>>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
>>>
>>> * For pte_range_none() the access/dirty bits don't matter.
>>>
>>> * For handle_pte_fault() I think we have the same shape of
>>>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>>>   care about the acess/dirty bits before we reach the locked compare step.
>>>
>>> * For ptdump_pte_entry() I think it's arguable that we should continue to
>>>   report the access/dirty bits separately for each PTE, as we have done until
>>>   now, to give an accurate representation of the contents of the translation
>>>   tables.
>>>
>>> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>>>   swap entry, the access/dirty bits don't matter.
>>>
>>> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
>>> and IIUC that's only so that the locklessly-loaded pte value can be compared
>>> with a subsequently locked-loaded entry (for which the access/dirty bits will
>>> be accumulated). Have I understood that correctly?
>>
>> Yes, I agree with what you are saying. My approach was to try to implement the
>> existing APIs accurately though, the argument being that it reduces the chances
>> of getting it wrong. But if you think the implementation is unsafe, then I guess
>> it blows that out of the water...
> 
> I think your approach makes sense, and as above I'm happy to defer the API
> changes/additions to avoid the accumulation of access/dirty bits.
> 
>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>> bits, 
>>
>> I think that would work - but will need to think a bit more on it.
>>
>>> and leave ptep_get_lockless() only reading a single entry?
>>
>> I think we will need to do something a bit less fragile. ptep_get() does collect
>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>> we will likely want to rename the function and make its documentation explicit
>> that it does not return those bits.
>>
>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>
>> Of course if I could convince you the current implementation is safe, I might be
>> able to sidestep this optimization until a later date?
> 
> Yep. :)
> 
> Mark.


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:03           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:03 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 13/02/2024 12:02, Mark Rutland wrote:
> On Mon, Feb 12, 2024 at 12:59:57PM +0000, Ryan Roberts wrote:
>> On 12/02/2024 12:00, Mark Rutland wrote:
>>> Hi Ryan,
> 
> [...]
> 
>>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	/*
>>>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>>>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>>>> +	 * this should never even be called. Regardless do our best to service
>>>> +	 * any call and emit a warning if there is any attempt to set a pte on
>>>> +	 * top of an existing contig range.
>>>> +	 */
>>>> +	pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>>> +	__set_pte(ptep, pte_mknoncont(pte));
>>>> +}
>>>> +
>>>> +#define set_ptes set_ptes
>>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>>>> +{
>>>> +	pte = pte_mknoncont(pte);
>>>
>>> Why do we have to clear the contiguous bit here? Is that for the same reason as
>>> set_pte(), or do we expect callers to legitimately call this with the
>>> contiguous bit set in 'pte'?
>>>
>>> I think you explained this to me in-person, and IIRC we don't expect callers to
>>> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
>>> have to clear it here to defer the decision of whether to set/clear it when
>>> modifying entries. It would be nice if we could have a description of why/when
>>> we need to clear this, e.g. in the 'public API' comment block above.
>>
>> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
>> private to the architecture code and is never set directly by core code. If the
>> public API ever receives a pte that happens to have the PTE_CONT bit set, it
>> would be bad news if we then accidentally set that in the pgtable.
>>
>> Ideally, we would just uncondidtionally clear the bit before a getter returns
>> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
>> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
>> and can therefore never accidentally pass such a pte into a setter function.
>> However, there is existing functionality that relies on being able to get a pte,
>> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
>> to determine how big the leaf is. This is used in perf_get_pgtable_size().
>>
>> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
>> decided to allow PTE_CONT to leak through the getters and instead
>> unconditionally clear the bit when a pte is passed to any of the setters.
>>
>> I'll add a (slightly less verbose) comment as you suggest.
> 
> Great, thanks!
> 
> [...]
> 
>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>> +	 */
>>>> +	return mm != &init_mm;
>>>> +}
>>>
>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>> that while it is live, and I'm not sure if that needs any special handling.
>>
>> Well we never need this function in the hot (order-0 folio) path, so I think I
>> could add a check for efi_mm here with performance implication. It's probably
>> safest to explicitly exclude it? What do you think?
> 
> That sounds ok to me.
> 
> Otherwise, if we (somehow) know that we avoid calling this at all with an EFI
> mm (e.g. because of the way we construct that), I'd be happy with a comment.

We crossed streams - as per my other email, I'm confident that this is safe so
will just add a comment.

> 
> Probably best to Cc Ard for whatever we do here.

Ard is already on CC.

> 
>>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>>> +{
>>>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>>
>>> I think this can be:
>>>
>>> static inline pte_t *contpte_align_down(pte_t *ptep)
>>> {
>>> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
>>> }
>>
>> Yep - that's much less ugly - thanks!
>>
>>>
>>>> +
>>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>>> +			    pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>>> +	unsigned long start_addr;
>>>> +	pte_t *start_ptep;
>>>> +	int i;
>>>> +
>>>> +	start_ptep = ptep = contpte_align_down(ptep);
>>>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>>> +
>>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>>> +
>>>> +		if (pte_dirty(ptent))
>>>> +			pte = pte_mkdirty(pte);
>>>> +
>>>> +		if (pte_young(ptent))
>>>> +			pte = pte_mkyoung(pte);
>>>> +	}
>>>
>>> Not a big deal either way, but I wonder if it makes more sense to accumulate
>>> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
>>>
>>> 	bool dirty = false, young = false;
>>>
>>> 	for (...) {
>>> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>> 		dirty |= pte_dirty(ptent);
>>> 		young |= pte_young(ptent);
>>> 	}
>>>
>>> 	if (dirty)
>>> 		pte_mkdirty(pte);
>>> 	if (young)
>>> 		pte_mkyoung(pte);
>>>
>>> I suspect that might generate slightly better code, but I'm also happy with the
>>> current form if people thnk that's more legible (I have no strong feelings
>>> either way).
>>
>> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
>> We also had the same comment against David's batching patches recently, and he
>> opted to stick with the former version:
>>
>> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
>>
>> So I'm inclined to leave it as is, since you're not insisting :)
> 
> That rationale is reasonable, and I'm fine with this as-is.
> 
> [...]
> 
>>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>>> +{
>>>> +	/*
>>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>>>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>>>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>>>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>>>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>>>> +	 * If we find a pte that is not consistent, then we must be racing with
>>>> +	 * an update so start again. If the target pte does not have CONT_PTE
>>>> +	 * set then that is considered consistent on its own because it is not
>>>> +	 * part of a contpte range.
>>>> +	 */
>>>> +
>>>> +	pgprot_t orig_prot;
>>>> +	unsigned long pfn;
>>>> +	pte_t orig_pte;
>>>> +	pgprot_t prot;
>>>> +	pte_t *ptep;
>>>> +	pte_t pte;
>>>> +	int i;
>>>> +
>>>> +retry:
>>>> +	orig_pte = __ptep_get(orig_ptep);
>>>> +
>>>> +	if (!pte_valid_cont(orig_pte))
>>>> +		return orig_pte;
>>>> +
>>>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>>> +	ptep = contpte_align_down(orig_ptep);
>>>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>>> +
>>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>>> +		pte = __ptep_get(ptep);
>>>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>>> +
>>>> +		if (!pte_valid_cont(pte) ||
>>>> +		   pte_pfn(pte) != pfn ||
>>>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>>>> +			goto retry;
>>>> +
>>>> +		if (pte_dirty(pte))
>>>> +			orig_pte = pte_mkdirty(orig_pte);
>>>> +
>>>> +		if (pte_young(pte))
>>>> +			orig_pte = pte_mkyoung(orig_pte);
>>>> +	}
>>>> +
>>>> +	return orig_pte;
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>>
>>> I'm struggling to convince myself that this is safe in general, as it really
>>> depends on how the caller will use this value. Which caller(s) actually care
>>> about the access/dirty bits, given those could change at any time anyway?
>>
>> I think your points below are valid, and agree we should try to make this work
>> without needing access/dirty if possible. But can you elaborate on why you don't
>> think it's safe?
> 
> Having mulled this over, I think it is safe as-is, and I was being overly
> cautious.
> 
> I had a general fear of potential problems stemming from the fact that (a) the
> accumulation of access/dirty bits isn't atomic and (b) the loop is unbounded.
> From looking at how this is used today, I think (a) is essentially the same as
> reading a stale non-contiguous entry, and I'm being overly cautious there. For
> (b), I think that's largely a performance concern and the would only retry
> indefinitely in the presence of mis-programmed entries or consistent racing
> with a writer under heavy contention.
> 
> I think it's still desirable to avoid the accumulation in most cases (to avoid
> redundant work and to minimize the potential for unbounded retries), but I'm
> happy with that being a follow-up improvement.

Great! I'll do the conversion to ptep_get_lockless_nosync() as a follow up series.

> 
>>> I took a quick scan, and AFAICT:
>>
>> Thanks for enumerating these; Saves me from having to refresh my memory :)
>>>
>>> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>>>   and has the contig bit set. We could clean that up with a new interface, e.g.
>>>   something like a new ptep_get_size_lockless().
>>>
>>> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>>>   we look at the pte to start with, since we only care where we can logically
>>>   write to the page at that point.
>>>
>>>   I see that we later follow up with:
>>>
>>>     with pte_val(pte) != pte_val(ptep_get(ptep)))
>>>
>>>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>>>   bits? So that shape of lockless-try...locked-compare sequence works?
>>>
>>> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>>>   so this doesn' seem to matter.
>>>
>>> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>>>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
>>>
>>> * For pte_range_none() the access/dirty bits don't matter.
>>>
>>> * For handle_pte_fault() I think we have the same shape of
>>>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>>>   care about the acess/dirty bits before we reach the locked compare step.
>>>
>>> * For ptdump_pte_entry() I think it's arguable that we should continue to
>>>   report the access/dirty bits separately for each PTE, as we have done until
>>>   now, to give an accurate representation of the contents of the translation
>>>   tables.
>>>
>>> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>>>   swap entry, the access/dirty bits don't matter.
>>>
>>> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
>>> and IIUC that's only so that the locklessly-loaded pte value can be compared
>>> with a subsequently locked-loaded entry (for which the access/dirty bits will
>>> be accumulated). Have I understood that correctly?
>>
>> Yes, I agree with what you are saying. My approach was to try to implement the
>> existing APIs accurately though, the argument being that it reduces the chances
>> of getting it wrong. But if you think the implementation is unsafe, then I guess
>> it blows that out of the water...
> 
> I think your approach makes sense, and as above I'm happy to defer the API
> changes/additions to avoid the accumulation of access/dirty bits.
> 
>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>> bits, 
>>
>> I think that would work - but will need to think a bit more on it.
>>
>>> and leave ptep_get_lockless() only reading a single entry?
>>
>> I think we will need to do something a bit less fragile. ptep_get() does collect
>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>> we will likely want to rename the function and make its documentation explicit
>> that it does not return those bits.
>>
>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>
>> Of course if I could convince you the current implementation is safe, I might be
>> able to sidestep this optimization until a later date?
> 
> Yep. :)
> 
> Mark.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:03           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:03 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On 13/02/2024 12:02, Mark Rutland wrote:
> On Mon, Feb 12, 2024 at 12:59:57PM +0000, Ryan Roberts wrote:
>> On 12/02/2024 12:00, Mark Rutland wrote:
>>> Hi Ryan,
> 
> [...]
> 
>>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	/*
>>>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>>>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>>>> +	 * this should never even be called. Regardless do our best to service
>>>> +	 * any call and emit a warning if there is any attempt to set a pte on
>>>> +	 * top of an existing contig range.
>>>> +	 */
>>>> +	pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>>> +	__set_pte(ptep, pte_mknoncont(pte));
>>>> +}
>>>> +
>>>> +#define set_ptes set_ptes
>>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>>>> +{
>>>> +	pte = pte_mknoncont(pte);
>>>
>>> Why do we have to clear the contiguous bit here? Is that for the same reason as
>>> set_pte(), or do we expect callers to legitimately call this with the
>>> contiguous bit set in 'pte'?
>>>
>>> I think you explained this to me in-person, and IIRC we don't expect callers to
>>> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
>>> have to clear it here to defer the decision of whether to set/clear it when
>>> modifying entries. It would be nice if we could have a description of why/when
>>> we need to clear this, e.g. in the 'public API' comment block above.
>>
>> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
>> private to the architecture code and is never set directly by core code. If the
>> public API ever receives a pte that happens to have the PTE_CONT bit set, it
>> would be bad news if we then accidentally set that in the pgtable.
>>
>> Ideally, we would just uncondidtionally clear the bit before a getter returns
>> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
>> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
>> and can therefore never accidentally pass such a pte into a setter function.
>> However, there is existing functionality that relies on being able to get a pte,
>> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
>> to determine how big the leaf is. This is used in perf_get_pgtable_size().
>>
>> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
>> decided to allow PTE_CONT to leak through the getters and instead
>> unconditionally clear the bit when a pte is passed to any of the setters.
>>
>> I'll add a (slightly less verbose) comment as you suggest.
> 
> Great, thanks!
> 
> [...]
> 
>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>> +	 */
>>>> +	return mm != &init_mm;
>>>> +}
>>>
>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>> that while it is live, and I'm not sure if that needs any special handling.
>>
>> Well we never need this function in the hot (order-0 folio) path, so I think I
>> could add a check for efi_mm here with performance implication. It's probably
>> safest to explicitly exclude it? What do you think?
> 
> That sounds ok to me.
> 
> Otherwise, if we (somehow) know that we avoid calling this at all with an EFI
> mm (e.g. because of the way we construct that), I'd be happy with a comment.

We crossed streams - as per my other email, I'm confident that this is safe so
will just add a comment.

> 
> Probably best to Cc Ard for whatever we do here.

Ard is already on CC.

> 
>>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>>> +{
>>>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>>
>>> I think this can be:
>>>
>>> static inline pte_t *contpte_align_down(pte_t *ptep)
>>> {
>>> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
>>> }
>>
>> Yep - that's much less ugly - thanks!
>>
>>>
>>>> +
>>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>>> +			    pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>>> +	unsigned long start_addr;
>>>> +	pte_t *start_ptep;
>>>> +	int i;
>>>> +
>>>> +	start_ptep = ptep = contpte_align_down(ptep);
>>>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>>> +
>>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>>> +
>>>> +		if (pte_dirty(ptent))
>>>> +			pte = pte_mkdirty(pte);
>>>> +
>>>> +		if (pte_young(ptent))
>>>> +			pte = pte_mkyoung(pte);
>>>> +	}
>>>
>>> Not a big deal either way, but I wonder if it makes more sense to accumulate
>>> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
>>>
>>> 	bool dirty = false, young = false;
>>>
>>> 	for (...) {
>>> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>> 		dirty |= pte_dirty(ptent);
>>> 		young |= pte_young(ptent);
>>> 	}
>>>
>>> 	if (dirty)
>>> 		pte_mkdirty(pte);
>>> 	if (young)
>>> 		pte_mkyoung(pte);
>>>
>>> I suspect that might generate slightly better code, but I'm also happy with the
>>> current form if people thnk that's more legible (I have no strong feelings
>>> either way).
>>
>> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
>> We also had the same comment against David's batching patches recently, and he
>> opted to stick with the former version:
>>
>> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
>>
>> So I'm inclined to leave it as is, since you're not insisting :)
> 
> That rationale is reasonable, and I'm fine with this as-is.
> 
> [...]
> 
>>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>>> +{
>>>> +	/*
>>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>>>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>>>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>>>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>>>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>>>> +	 * If we find a pte that is not consistent, then we must be racing with
>>>> +	 * an update so start again. If the target pte does not have CONT_PTE
>>>> +	 * set then that is considered consistent on its own because it is not
>>>> +	 * part of a contpte range.
>>>> +	 */
>>>> +
>>>> +	pgprot_t orig_prot;
>>>> +	unsigned long pfn;
>>>> +	pte_t orig_pte;
>>>> +	pgprot_t prot;
>>>> +	pte_t *ptep;
>>>> +	pte_t pte;
>>>> +	int i;
>>>> +
>>>> +retry:
>>>> +	orig_pte = __ptep_get(orig_ptep);
>>>> +
>>>> +	if (!pte_valid_cont(orig_pte))
>>>> +		return orig_pte;
>>>> +
>>>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>>> +	ptep = contpte_align_down(orig_ptep);
>>>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>>> +
>>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>>> +		pte = __ptep_get(ptep);
>>>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>>> +
>>>> +		if (!pte_valid_cont(pte) ||
>>>> +		   pte_pfn(pte) != pfn ||
>>>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>>>> +			goto retry;
>>>> +
>>>> +		if (pte_dirty(pte))
>>>> +			orig_pte = pte_mkdirty(orig_pte);
>>>> +
>>>> +		if (pte_young(pte))
>>>> +			orig_pte = pte_mkyoung(orig_pte);
>>>> +	}
>>>> +
>>>> +	return orig_pte;
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>>
>>> I'm struggling to convince myself that this is safe in general, as it really
>>> depends on how the caller will use this value. Which caller(s) actually care
>>> about the access/dirty bits, given those could change at any time anyway?
>>
>> I think your points below are valid, and agree we should try to make this work
>> without needing access/dirty if possible. But can you elaborate on why you don't
>> think it's safe?
> 
> Having mulled this over, I think it is safe as-is, and I was being overly
> cautious.
> 
> I had a general fear of potential problems stemming from the fact that (a) the
> accumulation of access/dirty bits isn't atomic and (b) the loop is unbounded.
> From looking at how this is used today, I think (a) is essentially the same as
> reading a stale non-contiguous entry, and I'm being overly cautious there. For
> (b), I think that's largely a performance concern and the would only retry
> indefinitely in the presence of mis-programmed entries or consistent racing
> with a writer under heavy contention.
> 
> I think it's still desirable to avoid the accumulation in most cases (to avoid
> redundant work and to minimize the potential for unbounded retries), but I'm
> happy with that being a follow-up improvement.

Great! I'll do the conversion to ptep_get_lockless_nosync() as a follow up series.

> 
>>> I took a quick scan, and AFAICT:
>>
>> Thanks for enumerating these; Saves me from having to refresh my memory :)
>>>
>>> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>>>   and has the contig bit set. We could clean that up with a new interface, e.g.
>>>   something like a new ptep_get_size_lockless().
>>>
>>> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>>>   we look at the pte to start with, since we only care where we can logically
>>>   write to the page at that point.
>>>
>>>   I see that we later follow up with:
>>>
>>>     with pte_val(pte) != pte_val(ptep_get(ptep)))
>>>
>>>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>>>   bits? So that shape of lockless-try...locked-compare sequence works?
>>>
>>> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>>>   so this doesn' seem to matter.
>>>
>>> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>>>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
>>>
>>> * For pte_range_none() the access/dirty bits don't matter.
>>>
>>> * For handle_pte_fault() I think we have the same shape of
>>>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>>>   care about the acess/dirty bits before we reach the locked compare step.
>>>
>>> * For ptdump_pte_entry() I think it's arguable that we should continue to
>>>   report the access/dirty bits separately for each PTE, as we have done until
>>>   now, to give an accurate representation of the contents of the translation
>>>   tables.
>>>
>>> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>>>   swap entry, the access/dirty bits don't matter.
>>>
>>> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
>>> and IIUC that's only so that the locklessly-loaded pte value can be compared
>>> with a subsequently locked-loaded entry (for which the access/dirty bits will
>>> be accumulated). Have I understood that correctly?
>>
>> Yes, I agree with what you are saying. My approach was to try to implement the
>> existing APIs accurately though, the argument being that it reduces the chances
>> of getting it wrong. But if you think the implementation is unsafe, then I guess
>> it blows that out of the water...
> 
> I think your approach makes sense, and as above I'm happy to defer the API
> changes/additions to avoid the accumulation of access/dirty bits.
> 
>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>> bits, 
>>
>> I think that would work - but will need to think a bit more on it.
>>
>>> and leave ptep_get_lockless() only reading a single entry?
>>
>> I think we will need to do something a bit less fragile. ptep_get() does collect
>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>> we will likely want to rename the function and make its documentation explicit
>> that it does not return those bits.
>>
>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>
>> Of course if I could convince you the current implementation is safe, I might be
>> able to sidestep this optimization until a later date?
> 
> Yep. :)
> 
> Mark.


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 12:19               ` David Hildenbrand
  (?)
@ 2024-02-13 13:06                 ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:06 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13/02/2024 12:19, David Hildenbrand wrote:
> On 13.02.24 13:06, Ryan Roberts wrote:
>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>> [...]
>>>
>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +    /*
>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>> +     * These racing faults are ok for user space, since they get serialized
>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>> +     */
>>>>>>> +    return mm != &init_mm;
>>>>>>> +}
>>>>>>
>>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>
>>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>>> could add a check for efi_mm here with performance implication. It's probably
>>>>> safest to explicitly exclude it? What do you think?
>>>>
>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>> *without* performance implication"
>>>
>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>> this:
>>>
>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>
>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>> references this symbol currently.
>>>
>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>> There are a couple of things that need to be garanteed for it to be safe:
>>>
>>>    - The PFNs of present ptes either need to have an associated struct page or
>>>      need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>      pte_mkdevmap())
>>>
>>>    - Live mappings must either be static (no changes that could cause
>>> fold/unfold
>>>      while live) or the system must be able to tolerate a temporary fault
>>>
>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>> requirement, but I'm not sure about the former?
>>
>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>> mappings are indeed static. And additionally, the ptes are populated using only
>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>> my prefereence is to not make any changes to code, and just add a comment
>> describing why efi_mm is safe.
>>
>> Details:
>>
>> * Registered with ptdump
>>      * ptep_get_lockless()
>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>      * __ptep_get()
>>      * __set_pte()
>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>> set_permissions
>>      * __ptep_get()
>>      * __set_pte()
> 
> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
> "official" APIs.

We could, but that would lead to the same linkage issue, which I'm trying to
avoid in the first place:

VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);

This creates new source code dependencies, which I would rather avoid if possible.


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:06                 ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:06 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13/02/2024 12:19, David Hildenbrand wrote:
> On 13.02.24 13:06, Ryan Roberts wrote:
>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>> [...]
>>>
>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +    /*
>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>> +     * These racing faults are ok for user space, since they get serialized
>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>> +     */
>>>>>>> +    return mm != &init_mm;
>>>>>>> +}
>>>>>>
>>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>
>>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>>> could add a check for efi_mm here with performance implication. It's probably
>>>>> safest to explicitly exclude it? What do you think?
>>>>
>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>> *without* performance implication"
>>>
>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>> this:
>>>
>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>
>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>> references this symbol currently.
>>>
>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>> There are a couple of things that need to be garanteed for it to be safe:
>>>
>>>    - The PFNs of present ptes either need to have an associated struct page or
>>>      need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>      pte_mkdevmap())
>>>
>>>    - Live mappings must either be static (no changes that could cause
>>> fold/unfold
>>>      while live) or the system must be able to tolerate a temporary fault
>>>
>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>> requirement, but I'm not sure about the former?
>>
>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>> mappings are indeed static. And additionally, the ptes are populated using only
>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>> my prefereence is to not make any changes to code, and just add a comment
>> describing why efi_mm is safe.
>>
>> Details:
>>
>> * Registered with ptdump
>>      * ptep_get_lockless()
>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>      * __ptep_get()
>>      * __set_pte()
>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>> set_permissions
>>      * __ptep_get()
>>      * __set_pte()
> 
> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
> "official" APIs.

We could, but that would lead to the same linkage issue, which I'm trying to
avoid in the first place:

VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);

This creates new source code dependencies, which I would rather avoid if possible.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:06                 ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:06 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 13/02/2024 12:19, David Hildenbrand wrote:
> On 13.02.24 13:06, Ryan Roberts wrote:
>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>> [...]
>>>
>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +    /*
>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>> +     * These racing faults are ok for user space, since they get serialized
>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>> +     */
>>>>>>> +    return mm != &init_mm;
>>>>>>> +}
>>>>>>
>>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>
>>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>>> could add a check for efi_mm here with performance implication. It's probably
>>>>> safest to explicitly exclude it? What do you think?
>>>>
>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>> *without* performance implication"
>>>
>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>> this:
>>>
>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>
>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>> references this symbol currently.
>>>
>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>> There are a couple of things that need to be garanteed for it to be safe:
>>>
>>>    - The PFNs of present ptes either need to have an associated struct page or
>>>      need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>      pte_mkdevmap())
>>>
>>>    - Live mappings must either be static (no changes that could cause
>>> fold/unfold
>>>      while live) or the system must be able to tolerate a temporary fault
>>>
>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>> requirement, but I'm not sure about the former?
>>
>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>> mappings are indeed static. And additionally, the ptes are populated using only
>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>> my prefereence is to not make any changes to code, and just add a comment
>> describing why efi_mm is safe.
>>
>> Details:
>>
>> * Registered with ptdump
>>      * ptep_get_lockless()
>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>      * __ptep_get()
>>      * __set_pte()
>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>> set_permissions
>>      * __ptep_get()
>>      * __set_pte()
> 
> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
> "official" APIs.

We could, but that would lead to the same linkage issue, which I'm trying to
avoid in the first place:

VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);

This creates new source code dependencies, which I would rather avoid if possible.


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 13:06                 ` Ryan Roberts
  (?)
@ 2024-02-13 13:13                   ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 13:13 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13.02.24 14:06, Ryan Roberts wrote:
> On 13/02/2024 12:19, David Hildenbrand wrote:
>> On 13.02.24 13:06, Ryan Roberts wrote:
>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>> [...]
>>>>
>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>> +{
>>>>>>>> +    /*
>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>> +     * These racing faults are ok for user space, since they get serialized
>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>> +     */
>>>>>>>> +    return mm != &init_mm;
>>>>>>>> +}
>>>>>>>
>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>
>>>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>>>> could add a check for efi_mm here with performance implication. It's probably
>>>>>> safest to explicitly exclude it? What do you think?
>>>>>
>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>> *without* performance implication"
>>>>
>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>> this:
>>>>
>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>
>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>> references this symbol currently.
>>>>
>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>
>>>>     - The PFNs of present ptes either need to have an associated struct page or
>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>       pte_mkdevmap())
>>>>
>>>>     - Live mappings must either be static (no changes that could cause
>>>> fold/unfold
>>>>       while live) or the system must be able to tolerate a temporary fault
>>>>
>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>> requirement, but I'm not sure about the former?
>>>
>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>> mappings are indeed static. And additionally, the ptes are populated using only
>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>> my prefereence is to not make any changes to code, and just add a comment
>>> describing why efi_mm is safe.
>>>
>>> Details:
>>>
>>> * Registered with ptdump
>>>       * ptep_get_lockless()
>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>       * __ptep_get()
>>>       * __set_pte()
>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>> set_permissions
>>>       * __ptep_get()
>>>       * __set_pte()
>>
>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>> "official" APIs.
> 
> We could, but that would lead to the same linkage issue, which I'm trying to
> avoid in the first place:
> 
> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
> 
> This creates new source code dependencies, which I would rather avoid if possible.

Just a thought, you could have a is_efi_mm() function that abstracts all that.

diff --git a/include/linux/efi.h b/include/linux/efi.h
index c74f47711f0b..152f5fa66a2a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -692,6 +692,15 @@ extern struct efi {
  
  extern struct mm_struct efi_mm;
  
+static inline void is_efi_mm(struct mm_struct *mm)
+{
+#ifdef CONFIG_EFI
+       return mm == &efi_mm;
+#else
+       return false;
+#endif
+}
+
  static inline int
  efi_guidcmp (efi_guid_t left, efi_guid_t right)
  {


-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:13                   ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 13:13 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13.02.24 14:06, Ryan Roberts wrote:
> On 13/02/2024 12:19, David Hildenbrand wrote:
>> On 13.02.24 13:06, Ryan Roberts wrote:
>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>> [...]
>>>>
>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>> +{
>>>>>>>> +    /*
>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>> +     * These racing faults are ok for user space, since they get serialized
>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>> +     */
>>>>>>>> +    return mm != &init_mm;
>>>>>>>> +}
>>>>>>>
>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>
>>>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>>>> could add a check for efi_mm here with performance implication. It's probably
>>>>>> safest to explicitly exclude it? What do you think?
>>>>>
>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>> *without* performance implication"
>>>>
>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>> this:
>>>>
>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>
>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>> references this symbol currently.
>>>>
>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>
>>>>     - The PFNs of present ptes either need to have an associated struct page or
>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>       pte_mkdevmap())
>>>>
>>>>     - Live mappings must either be static (no changes that could cause
>>>> fold/unfold
>>>>       while live) or the system must be able to tolerate a temporary fault
>>>>
>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>> requirement, but I'm not sure about the former?
>>>
>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>> mappings are indeed static. And additionally, the ptes are populated using only
>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>> my prefereence is to not make any changes to code, and just add a comment
>>> describing why efi_mm is safe.
>>>
>>> Details:
>>>
>>> * Registered with ptdump
>>>       * ptep_get_lockless()
>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>       * __ptep_get()
>>>       * __set_pte()
>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>> set_permissions
>>>       * __ptep_get()
>>>       * __set_pte()
>>
>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>> "official" APIs.
> 
> We could, but that would lead to the same linkage issue, which I'm trying to
> avoid in the first place:
> 
> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
> 
> This creates new source code dependencies, which I would rather avoid if possible.

Just a thought, you could have a is_efi_mm() function that abstracts all that.

diff --git a/include/linux/efi.h b/include/linux/efi.h
index c74f47711f0b..152f5fa66a2a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -692,6 +692,15 @@ extern struct efi {
  
  extern struct mm_struct efi_mm;
  
+static inline void is_efi_mm(struct mm_struct *mm)
+{
+#ifdef CONFIG_EFI
+       return mm == &efi_mm;
+#else
+       return false;
+#endif
+}
+
  static inline int
  efi_guidcmp (efi_guid_t left, efi_guid_t right)
  {


-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:13                   ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 13:13 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 13.02.24 14:06, Ryan Roberts wrote:
> On 13/02/2024 12:19, David Hildenbrand wrote:
>> On 13.02.24 13:06, Ryan Roberts wrote:
>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>> [...]
>>>>
>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>> +{
>>>>>>>> +    /*
>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>> +     * These racing faults are ok for user space, since they get serialized
>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>> +     */
>>>>>>>> +    return mm != &init_mm;
>>>>>>>> +}
>>>>>>>
>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>
>>>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>>>> could add a check for efi_mm here with performance implication. It's probably
>>>>>> safest to explicitly exclude it? What do you think?
>>>>>
>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>> *without* performance implication"
>>>>
>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>> this:
>>>>
>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>
>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>> references this symbol currently.
>>>>
>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>
>>>>     - The PFNs of present ptes either need to have an associated struct page or
>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>       pte_mkdevmap())
>>>>
>>>>     - Live mappings must either be static (no changes that could cause
>>>> fold/unfold
>>>>       while live) or the system must be able to tolerate a temporary fault
>>>>
>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>> requirement, but I'm not sure about the former?
>>>
>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>> mappings are indeed static. And additionally, the ptes are populated using only
>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>> my prefereence is to not make any changes to code, and just add a comment
>>> describing why efi_mm is safe.
>>>
>>> Details:
>>>
>>> * Registered with ptdump
>>>       * ptep_get_lockless()
>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>       * __ptep_get()
>>>       * __set_pte()
>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>> set_permissions
>>>       * __ptep_get()
>>>       * __set_pte()
>>
>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>> "official" APIs.
> 
> We could, but that would lead to the same linkage issue, which I'm trying to
> avoid in the first place:
> 
> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
> 
> This creates new source code dependencies, which I would rather avoid if possible.

Just a thought, you could have a is_efi_mm() function that abstracts all that.

diff --git a/include/linux/efi.h b/include/linux/efi.h
index c74f47711f0b..152f5fa66a2a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -692,6 +692,15 @@ extern struct efi {
  
  extern struct mm_struct efi_mm;
  
+static inline void is_efi_mm(struct mm_struct *mm)
+{
+#ifdef CONFIG_EFI
+       return mm == &efi_mm;
+#else
+       return false;
+#endif
+}
+
  static inline int
  efi_guidcmp (efi_guid_t left, efi_guid_t right)
  {


-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 13:13                   ` David Hildenbrand
  (?)
@ 2024-02-13 13:20                     ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:20 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13/02/2024 13:13, David Hildenbrand wrote:
> On 13.02.24 14:06, Ryan Roberts wrote:
>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>> [...]
>>>>>
>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>> +{
>>>>>>>>> +    /*
>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>> serialized
>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>> +     */
>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>> manipulate
>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>
>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>> think I
>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>> probably
>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>
>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>> *without* performance implication"
>>>>>
>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>>> this:
>>>>>
>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>
>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>> references this symbol currently.
>>>>>
>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>
>>>>>     - The PFNs of present ptes either need to have an associated struct
>>>>> page or
>>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>       pte_mkdevmap())
>>>>>
>>>>>     - Live mappings must either be static (no changes that could cause
>>>>> fold/unfold
>>>>>       while live) or the system must be able to tolerate a temporary fault
>>>>>
>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>> requirement, but I'm not sure about the former?
>>>>
>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>> my prefereence is to not make any changes to code, and just add a comment
>>>> describing why efi_mm is safe.
>>>>
>>>> Details:
>>>>
>>>> * Registered with ptdump
>>>>       * ptep_get_lockless()
>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>       * __ptep_get()
>>>>       * __set_pte()
>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>> set_permissions
>>>>       * __ptep_get()
>>>>       * __set_pte()
>>>
>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>> "official" APIs.
>>
>> We could, but that would lead to the same linkage issue, which I'm trying to
>> avoid in the first place:
>>
>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>
>> This creates new source code dependencies, which I would rather avoid if
>> possible.
> 
> Just a thought, you could have a is_efi_mm() function that abstracts all that.
> 
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index c74f47711f0b..152f5fa66a2a 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -692,6 +692,15 @@ extern struct efi {
>  
>  extern struct mm_struct efi_mm;
>  
> +static inline void is_efi_mm(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_EFI
> +       return mm == &efi_mm;
> +#else
> +       return false;
> +#endif
> +}
> +
>  static inline int
>  efi_guidcmp (efi_guid_t left, efi_guid_t right)
>  {
> 
> 

That would definitely work, but in that case, I might as well just check for it
in mm_is_user() (and personally I would change the name to mm_is_efi()):


static inline bool mm_is_user(struct mm_struct *mm)
{
	return mm != &init_mm && !mm_is_efi(mm);
}

Any objections?


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:20                     ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:20 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13/02/2024 13:13, David Hildenbrand wrote:
> On 13.02.24 14:06, Ryan Roberts wrote:
>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>> [...]
>>>>>
>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>> +{
>>>>>>>>> +    /*
>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>> serialized
>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>> +     */
>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>> manipulate
>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>
>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>> think I
>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>> probably
>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>
>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>> *without* performance implication"
>>>>>
>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>>> this:
>>>>>
>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>
>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>> references this symbol currently.
>>>>>
>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>
>>>>>     - The PFNs of present ptes either need to have an associated struct
>>>>> page or
>>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>       pte_mkdevmap())
>>>>>
>>>>>     - Live mappings must either be static (no changes that could cause
>>>>> fold/unfold
>>>>>       while live) or the system must be able to tolerate a temporary fault
>>>>>
>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>> requirement, but I'm not sure about the former?
>>>>
>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>> my prefereence is to not make any changes to code, and just add a comment
>>>> describing why efi_mm is safe.
>>>>
>>>> Details:
>>>>
>>>> * Registered with ptdump
>>>>       * ptep_get_lockless()
>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>       * __ptep_get()
>>>>       * __set_pte()
>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>> set_permissions
>>>>       * __ptep_get()
>>>>       * __set_pte()
>>>
>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>> "official" APIs.
>>
>> We could, but that would lead to the same linkage issue, which I'm trying to
>> avoid in the first place:
>>
>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>
>> This creates new source code dependencies, which I would rather avoid if
>> possible.
> 
> Just a thought, you could have a is_efi_mm() function that abstracts all that.
> 
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index c74f47711f0b..152f5fa66a2a 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -692,6 +692,15 @@ extern struct efi {
>  
>  extern struct mm_struct efi_mm;
>  
> +static inline void is_efi_mm(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_EFI
> +       return mm == &efi_mm;
> +#else
> +       return false;
> +#endif
> +}
> +
>  static inline int
>  efi_guidcmp (efi_guid_t left, efi_guid_t right)
>  {
> 
> 

That would definitely work, but in that case, I might as well just check for it
in mm_is_user() (and personally I would change the name to mm_is_efi()):


static inline bool mm_is_user(struct mm_struct *mm)
{
	return mm != &init_mm && !mm_is_efi(mm);
}

Any objections?


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:20                     ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:20 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 13/02/2024 13:13, David Hildenbrand wrote:
> On 13.02.24 14:06, Ryan Roberts wrote:
>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>> [...]
>>>>>
>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>> +{
>>>>>>>>> +    /*
>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>> serialized
>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>> +     */
>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>> manipulate
>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>
>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>> think I
>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>> probably
>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>
>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>> *without* performance implication"
>>>>>
>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>>> this:
>>>>>
>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>
>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>> references this symbol currently.
>>>>>
>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>
>>>>>     - The PFNs of present ptes either need to have an associated struct
>>>>> page or
>>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>       pte_mkdevmap())
>>>>>
>>>>>     - Live mappings must either be static (no changes that could cause
>>>>> fold/unfold
>>>>>       while live) or the system must be able to tolerate a temporary fault
>>>>>
>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>> requirement, but I'm not sure about the former?
>>>>
>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>> my prefereence is to not make any changes to code, and just add a comment
>>>> describing why efi_mm is safe.
>>>>
>>>> Details:
>>>>
>>>> * Registered with ptdump
>>>>       * ptep_get_lockless()
>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>       * __ptep_get()
>>>>       * __set_pte()
>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>> set_permissions
>>>>       * __ptep_get()
>>>>       * __set_pte()
>>>
>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>> "official" APIs.
>>
>> We could, but that would lead to the same linkage issue, which I'm trying to
>> avoid in the first place:
>>
>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>
>> This creates new source code dependencies, which I would rather avoid if
>> possible.
> 
> Just a thought, you could have a is_efi_mm() function that abstracts all that.
> 
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index c74f47711f0b..152f5fa66a2a 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -692,6 +692,15 @@ extern struct efi {
>  
>  extern struct mm_struct efi_mm;
>  
> +static inline void is_efi_mm(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_EFI
> +       return mm == &efi_mm;
> +#else
> +       return false;
> +#endif
> +}
> +
>  static inline int
>  efi_guidcmp (efi_guid_t left, efi_guid_t right)
>  {
> 
> 

That would definitely work, but in that case, I might as well just check for it
in mm_is_user() (and personally I would change the name to mm_is_efi()):


static inline bool mm_is_user(struct mm_struct *mm)
{
	return mm != &init_mm && !mm_is_efi(mm);
}

Any objections?


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 13:20                     ` Ryan Roberts
  (?)
@ 2024-02-13 13:22                       ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 13:22 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13.02.24 14:20, Ryan Roberts wrote:
> On 13/02/2024 13:13, David Hildenbrand wrote:
>> On 13.02.24 14:06, Ryan Roberts wrote:
>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>> [...]
>>>>>>
>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>> +{
>>>>>>>>>> +    /*
>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>> serialized
>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>> +     */
>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>> manipulate
>>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>>
>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>> think I
>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>> probably
>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>
>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>> *without* performance implication"
>>>>>>
>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>>>> this:
>>>>>>
>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>
>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>> references this symbol currently.
>>>>>>
>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>
>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>> page or
>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>        pte_mkdevmap())
>>>>>>
>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>> fold/unfold
>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>
>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>> requirement, but I'm not sure about the former?
>>>>>
>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>> describing why efi_mm is safe.
>>>>>
>>>>> Details:
>>>>>
>>>>> * Registered with ptdump
>>>>>        * ptep_get_lockless()
>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>        * __ptep_get()
>>>>>        * __set_pte()
>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>> set_permissions
>>>>>        * __ptep_get()
>>>>>        * __set_pte()
>>>>
>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>> "official" APIs.
>>>
>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>> avoid in the first place:
>>>
>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>
>>> This creates new source code dependencies, which I would rather avoid if
>>> possible.
>>
>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>
>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>> index c74f47711f0b..152f5fa66a2a 100644
>> --- a/include/linux/efi.h
>> +++ b/include/linux/efi.h
>> @@ -692,6 +692,15 @@ extern struct efi {
>>   
>>   extern struct mm_struct efi_mm;
>>   
>> +static inline void is_efi_mm(struct mm_struct *mm)
>> +{
>> +#ifdef CONFIG_EFI
>> +       return mm == &efi_mm;
>> +#else
>> +       return false;
>> +#endif
>> +}
>> +
>>   static inline int
>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>   {
>>
>>
> 
> That would definitely work, but in that case, I might as well just check for it
> in mm_is_user() (and personally I would change the name to mm_is_efi()):
> 
> 
> static inline bool mm_is_user(struct mm_struct *mm)
> {
> 	return mm != &init_mm && !mm_is_efi(mm);
> }
> 
> Any objections?
> 

Nope :) Maybe slap in an "unlikely()", because efi_mm *is* unlikely to 
show up.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:22                       ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 13:22 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 13.02.24 14:20, Ryan Roberts wrote:
> On 13/02/2024 13:13, David Hildenbrand wrote:
>> On 13.02.24 14:06, Ryan Roberts wrote:
>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>> [...]
>>>>>>
>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>> +{
>>>>>>>>>> +    /*
>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>> serialized
>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>> +     */
>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>> manipulate
>>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>>
>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>> think I
>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>> probably
>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>
>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>> *without* performance implication"
>>>>>>
>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>>>> this:
>>>>>>
>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>
>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>> references this symbol currently.
>>>>>>
>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>
>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>> page or
>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>        pte_mkdevmap())
>>>>>>
>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>> fold/unfold
>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>
>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>> requirement, but I'm not sure about the former?
>>>>>
>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>> describing why efi_mm is safe.
>>>>>
>>>>> Details:
>>>>>
>>>>> * Registered with ptdump
>>>>>        * ptep_get_lockless()
>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>        * __ptep_get()
>>>>>        * __set_pte()
>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>> set_permissions
>>>>>        * __ptep_get()
>>>>>        * __set_pte()
>>>>
>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>> "official" APIs.
>>>
>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>> avoid in the first place:
>>>
>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>
>>> This creates new source code dependencies, which I would rather avoid if
>>> possible.
>>
>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>
>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>> index c74f47711f0b..152f5fa66a2a 100644
>> --- a/include/linux/efi.h
>> +++ b/include/linux/efi.h
>> @@ -692,6 +692,15 @@ extern struct efi {
>>   
>>   extern struct mm_struct efi_mm;
>>   
>> +static inline void is_efi_mm(struct mm_struct *mm)
>> +{
>> +#ifdef CONFIG_EFI
>> +       return mm == &efi_mm;
>> +#else
>> +       return false;
>> +#endif
>> +}
>> +
>>   static inline int
>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>   {
>>
>>
> 
> That would definitely work, but in that case, I might as well just check for it
> in mm_is_user() (and personally I would change the name to mm_is_efi()):
> 
> 
> static inline bool mm_is_user(struct mm_struct *mm)
> {
> 	return mm != &init_mm && !mm_is_efi(mm);
> }
> 
> Any objections?
> 

Nope :) Maybe slap in an "unlikely()", because efi_mm *is* unlikely to 
show up.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:22                       ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 13:22 UTC (permalink / raw)
  To: Ryan Roberts, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13.02.24 14:20, Ryan Roberts wrote:
> On 13/02/2024 13:13, David Hildenbrand wrote:
>> On 13.02.24 14:06, Ryan Roberts wrote:
>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>> [...]
>>>>>>
>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>> +{
>>>>>>>>>> +    /*
>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>> serialized
>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>> +     */
>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>> manipulate
>>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>>
>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>> think I
>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>> probably
>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>
>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>> *without* performance implication"
>>>>>>
>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>>>> this:
>>>>>>
>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>
>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>> references this symbol currently.
>>>>>>
>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>
>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>> page or
>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>        pte_mkdevmap())
>>>>>>
>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>> fold/unfold
>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>
>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>> requirement, but I'm not sure about the former?
>>>>>
>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>> describing why efi_mm is safe.
>>>>>
>>>>> Details:
>>>>>
>>>>> * Registered with ptdump
>>>>>        * ptep_get_lockless()
>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>        * __ptep_get()
>>>>>        * __set_pte()
>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>> set_permissions
>>>>>        * __ptep_get()
>>>>>        * __set_pte()
>>>>
>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>> "official" APIs.
>>>
>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>> avoid in the first place:
>>>
>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>
>>> This creates new source code dependencies, which I would rather avoid if
>>> possible.
>>
>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>
>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>> index c74f47711f0b..152f5fa66a2a 100644
>> --- a/include/linux/efi.h
>> +++ b/include/linux/efi.h
>> @@ -692,6 +692,15 @@ extern struct efi {
>>   
>>   extern struct mm_struct efi_mm;
>>   
>> +static inline void is_efi_mm(struct mm_struct *mm)
>> +{
>> +#ifdef CONFIG_EFI
>> +       return mm == &efi_mm;
>> +#else
>> +       return false;
>> +#endif
>> +}
>> +
>>   static inline int
>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>   {
>>
>>
> 
> That would definitely work, but in that case, I might as well just check for it
> in mm_is_user() (and personally I would change the name to mm_is_efi()):
> 
> 
> static inline bool mm_is_user(struct mm_struct *mm)
> {
> 	return mm != &init_mm && !mm_is_efi(mm);
> }
> 
> Any objections?
> 

Nope :) Maybe slap in an "unlikely()", because efi_mm *is* unlikely to 
show up.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 13:22                       ` David Hildenbrand
  (?)
@ 2024-02-13 13:24                         ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:24 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13/02/2024 13:22, David Hildenbrand wrote:
> On 13.02.24 14:20, Ryan Roberts wrote:
>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>> +{
>>>>>>>>>>> +    /*
>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>> because
>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>> faults.
>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>> serialized
>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>> manipulate
>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>> handling.
>>>>>>>>>
>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>> think I
>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>> probably
>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>
>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>> *without* performance implication"
>>>>>>>
>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I
>>>>>>> can do
>>>>>>> this:
>>>>>>>
>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>
>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>> references this symbol currently.
>>>>>>>
>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>> userspace.
>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>
>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>> page or
>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>        pte_mkdevmap())
>>>>>>>
>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>> fold/unfold
>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>
>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>>> requirement, but I'm not sure about the former?
>>>>>>
>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>> mappings are indeed static. And additionally, the ptes are populated using
>>>>>> only
>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>> Mark,
>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>> describing why efi_mm is safe.
>>>>>>
>>>>>> Details:
>>>>>>
>>>>>> * Registered with ptdump
>>>>>>        * ptep_get_lockless()
>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>> set_permissions
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>
>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>> "official" APIs.
>>>>
>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>> avoid in the first place:
>>>>
>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>
>>>> This creates new source code dependencies, which I would rather avoid if
>>>> possible.
>>>
>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>
>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>> index c74f47711f0b..152f5fa66a2a 100644
>>> --- a/include/linux/efi.h
>>> +++ b/include/linux/efi.h
>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>     extern struct mm_struct efi_mm;
>>>   +static inline void is_efi_mm(struct mm_struct *mm)
>>> +{
>>> +#ifdef CONFIG_EFI
>>> +       return mm == &efi_mm;
>>> +#else
>>> +       return false;
>>> +#endif
>>> +}
>>> +
>>>   static inline int
>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>   {
>>>
>>>
>>
>> That would definitely work, but in that case, I might as well just check for it
>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>
>>
>> static inline bool mm_is_user(struct mm_struct *mm)
>> {
>>     return mm != &init_mm && !mm_is_efi(mm);
>> }
>>
>> Any objections?
>>
> 
> Nope :) Maybe slap in an "unlikely()", because efi_mm *is* unlikely to show up.

Deal

> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:24                         ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:24 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 13/02/2024 13:22, David Hildenbrand wrote:
> On 13.02.24 14:20, Ryan Roberts wrote:
>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>> +{
>>>>>>>>>>> +    /*
>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>> because
>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>> faults.
>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>> serialized
>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>> manipulate
>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>> handling.
>>>>>>>>>
>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>> think I
>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>> probably
>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>
>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>> *without* performance implication"
>>>>>>>
>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I
>>>>>>> can do
>>>>>>> this:
>>>>>>>
>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>
>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>> references this symbol currently.
>>>>>>>
>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>> userspace.
>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>
>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>> page or
>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>        pte_mkdevmap())
>>>>>>>
>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>> fold/unfold
>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>
>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>>> requirement, but I'm not sure about the former?
>>>>>>
>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>> mappings are indeed static. And additionally, the ptes are populated using
>>>>>> only
>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>> Mark,
>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>> describing why efi_mm is safe.
>>>>>>
>>>>>> Details:
>>>>>>
>>>>>> * Registered with ptdump
>>>>>>        * ptep_get_lockless()
>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>> set_permissions
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>
>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>> "official" APIs.
>>>>
>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>> avoid in the first place:
>>>>
>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>
>>>> This creates new source code dependencies, which I would rather avoid if
>>>> possible.
>>>
>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>
>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>> index c74f47711f0b..152f5fa66a2a 100644
>>> --- a/include/linux/efi.h
>>> +++ b/include/linux/efi.h
>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>     extern struct mm_struct efi_mm;
>>>   +static inline void is_efi_mm(struct mm_struct *mm)
>>> +{
>>> +#ifdef CONFIG_EFI
>>> +       return mm == &efi_mm;
>>> +#else
>>> +       return false;
>>> +#endif
>>> +}
>>> +
>>>   static inline int
>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>   {
>>>
>>>
>>
>> That would definitely work, but in that case, I might as well just check for it
>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>
>>
>> static inline bool mm_is_user(struct mm_struct *mm)
>> {
>>     return mm != &init_mm && !mm_is_efi(mm);
>> }
>>
>> Any objections?
>>
> 
> Nope :) Maybe slap in an "unlikely()", because efi_mm *is* unlikely to show up.

Deal

> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:24                         ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 13:24 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13/02/2024 13:22, David Hildenbrand wrote:
> On 13.02.24 14:20, Ryan Roberts wrote:
>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>> +{
>>>>>>>>>>> +    /*
>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>> because
>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>> faults.
>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>> serialized
>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>> manipulate
>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>> handling.
>>>>>>>>>
>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>> think I
>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>> probably
>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>
>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>> *without* performance implication"
>>>>>>>
>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I
>>>>>>> can do
>>>>>>> this:
>>>>>>>
>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>
>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>> references this symbol currently.
>>>>>>>
>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>> userspace.
>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>
>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>> page or
>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>        pte_mkdevmap())
>>>>>>>
>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>> fold/unfold
>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>
>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>>> requirement, but I'm not sure about the former?
>>>>>>
>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>> mappings are indeed static. And additionally, the ptes are populated using
>>>>>> only
>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>> Mark,
>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>> describing why efi_mm is safe.
>>>>>>
>>>>>> Details:
>>>>>>
>>>>>> * Registered with ptdump
>>>>>>        * ptep_get_lockless()
>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>> set_permissions
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>
>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>> "official" APIs.
>>>>
>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>> avoid in the first place:
>>>>
>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>
>>>> This creates new source code dependencies, which I would rather avoid if
>>>> possible.
>>>
>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>
>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>> index c74f47711f0b..152f5fa66a2a 100644
>>> --- a/include/linux/efi.h
>>> +++ b/include/linux/efi.h
>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>     extern struct mm_struct efi_mm;
>>>   +static inline void is_efi_mm(struct mm_struct *mm)
>>> +{
>>> +#ifdef CONFIG_EFI
>>> +       return mm == &efi_mm;
>>> +#else
>>> +       return false;
>>> +#endif
>>> +}
>>> +
>>>   static inline int
>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>   {
>>>
>>>
>>
>> That would definitely work, but in that case, I might as well just check for it
>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>
>>
>> static inline bool mm_is_user(struct mm_struct *mm)
>> {
>>     return mm != &init_mm && !mm_is_efi(mm);
>> }
>>
>> Any objections?
>>
> 
> Nope :) Maybe slap in an "unlikely()", because efi_mm *is* unlikely to show up.

Deal

> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 13:20                     ` Ryan Roberts
  (?)
@ 2024-02-13 13:33                       ` Ard Biesheuvel
  -1 siblings, 0 replies; 240+ messages in thread
From: Ard Biesheuvel @ 2024-02-13 13:33 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 13/02/2024 13:13, David Hildenbrand wrote:
> > On 13.02.24 14:06, Ryan Roberts wrote:
> >> On 13/02/2024 12:19, David Hildenbrand wrote:
> >>> On 13.02.24 13:06, Ryan Roberts wrote:
> >>>> On 12/02/2024 20:38, Ryan Roberts wrote:
> >>>>> [...]
> >>>>>
> >>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
> >>>>>>>>> +{
> >>>>>>>>> +    /*
> >>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
> >>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
> >>>>>>>>> +     * These racing faults are ok for user space, since they get
> >>>>>>>>> serialized
> >>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
> >>>>>>>>> +     */
> >>>>>>>>> +    return mm != &init_mm;
> >>>>>>>>> +}
> >>>>>>>>
> >>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
> >>>>>>>> manipulate
> >>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
> >>>>>>>
> >>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
> >>>>>>> think I
> >>>>>>> could add a check for efi_mm here with performance implication. It's
> >>>>>>> probably
> >>>>>>> safest to explicitly exclude it? What do you think?
> >>>>>>
> >>>>>> Oops: This should have read "I think I could add a check for efi_mm here
> >>>>>> *without* performance implication"
> >>>>>
> >>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
> >>>>> this:
> >>>>>
> >>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
> >>>>>
> >>>>> Is that acceptable? This is my preference, but nothing else outside of efi
> >>>>> references this symbol currently.
> >>>>>
> >>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
> >>>>> There are a couple of things that need to be garanteed for it to be safe:
> >>>>>
> >>>>>     - The PFNs of present ptes either need to have an associated struct
> >>>>> page or
> >>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
> >>>>>       pte_mkdevmap())
> >>>>>
> >>>>>     - Live mappings must either be static (no changes that could cause
> >>>>> fold/unfold
> >>>>>       while live) or the system must be able to tolerate a temporary fault
> >>>>>
> >>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
> >>>>> requirement, but I'm not sure about the former?
> >>>>
> >>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
> >>>> mappings are indeed static. And additionally, the ptes are populated using only
> >>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
> >>>> my prefereence is to not make any changes to code, and just add a comment
> >>>> describing why efi_mm is safe.
> >>>>
> >>>> Details:
> >>>>
> >>>> * Registered with ptdump
> >>>>       * ptep_get_lockless()
> >>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
> >>>>       * __ptep_get()
> >>>>       * __set_pte()
> >>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
> >>>> set_permissions
> >>>>       * __ptep_get()
> >>>>       * __set_pte()
> >>>
> >>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
> >>> "official" APIs.
> >>
> >> We could, but that would lead to the same linkage issue, which I'm trying to
> >> avoid in the first place:
> >>
> >> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
> >>
> >> This creates new source code dependencies, which I would rather avoid if
> >> possible.
> >
> > Just a thought, you could have a is_efi_mm() function that abstracts all that.
> >
> > diff --git a/include/linux/efi.h b/include/linux/efi.h
> > index c74f47711f0b..152f5fa66a2a 100644
> > --- a/include/linux/efi.h
> > +++ b/include/linux/efi.h
> > @@ -692,6 +692,15 @@ extern struct efi {
> >
> >  extern struct mm_struct efi_mm;
> >
> > +static inline void is_efi_mm(struct mm_struct *mm)
> > +{
> > +#ifdef CONFIG_EFI
> > +       return mm == &efi_mm;
> > +#else
> > +       return false;
> > +#endif
> > +}
> > +
> >  static inline int
> >  efi_guidcmp (efi_guid_t left, efi_guid_t right)
> >  {
> >
> >
>
> That would definitely work, but in that case, I might as well just check for it
> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>
>
> static inline bool mm_is_user(struct mm_struct *mm)
> {
>         return mm != &init_mm && !mm_is_efi(mm);
> }
>
> Any objections?
>

Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
declaration is visible to the compiler, and any references should
disappear before the linker could notice that efi_mm does not exist.

In any case, feel free to add

Acked-by: Ard Biesheuvel <ardb@kernel.org>

when you roll a patch based on the above, with or without IS_ENABLED().

And as you concluded, efi_mm is indeed set in stone once the runtime
regions described by the firmware have been mapped, although this may
happen in two passes depending on how the runtime regions are
described. But by the time user MMs might exist, efi_mm should
effectively be immutable.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:33                       ` Ard Biesheuvel
  0 siblings, 0 replies; 240+ messages in thread
From: Ard Biesheuvel @ 2024-02-13 13:33 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Mark Rutland, Kefeng Wang, x86, David Hildenbrand,
	Catalin Marinas, Yang Shi, Dave Hansen, linux-mm,
	Andrey Ryabinin, H. Peter Anvin, Will Deacon, Marc Zyngier,
	Alistair Popple, Barry Song, Matthew Wilcox, Aneesh Kumar K.V,
	Ingo Molnar, Zi Yan, Naveen N. Rao, John Hubbard,
	Nicholas Piggin, Borislav Petkov, Thomas Gleixner,
	linux-arm-kernel, linux-kernel, James Morse, Andrew Morton,
	linuxppc-dev

On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 13/02/2024 13:13, David Hildenbrand wrote:
> > On 13.02.24 14:06, Ryan Roberts wrote:
> >> On 13/02/2024 12:19, David Hildenbrand wrote:
> >>> On 13.02.24 13:06, Ryan Roberts wrote:
> >>>> On 12/02/2024 20:38, Ryan Roberts wrote:
> >>>>> [...]
> >>>>>
> >>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
> >>>>>>>>> +{
> >>>>>>>>> +    /*
> >>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
> >>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
> >>>>>>>>> +     * These racing faults are ok for user space, since they get
> >>>>>>>>> serialized
> >>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
> >>>>>>>>> +     */
> >>>>>>>>> +    return mm != &init_mm;
> >>>>>>>>> +}
> >>>>>>>>
> >>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
> >>>>>>>> manipulate
> >>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
> >>>>>>>
> >>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
> >>>>>>> think I
> >>>>>>> could add a check for efi_mm here with performance implication. It's
> >>>>>>> probably
> >>>>>>> safest to explicitly exclude it? What do you think?
> >>>>>>
> >>>>>> Oops: This should have read "I think I could add a check for efi_mm here
> >>>>>> *without* performance implication"
> >>>>>
> >>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
> >>>>> this:
> >>>>>
> >>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
> >>>>>
> >>>>> Is that acceptable? This is my preference, but nothing else outside of efi
> >>>>> references this symbol currently.
> >>>>>
> >>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
> >>>>> There are a couple of things that need to be garanteed for it to be safe:
> >>>>>
> >>>>>     - The PFNs of present ptes either need to have an associated struct
> >>>>> page or
> >>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
> >>>>>       pte_mkdevmap())
> >>>>>
> >>>>>     - Live mappings must either be static (no changes that could cause
> >>>>> fold/unfold
> >>>>>       while live) or the system must be able to tolerate a temporary fault
> >>>>>
> >>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
> >>>>> requirement, but I'm not sure about the former?
> >>>>
> >>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
> >>>> mappings are indeed static. And additionally, the ptes are populated using only
> >>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
> >>>> my prefereence is to not make any changes to code, and just add a comment
> >>>> describing why efi_mm is safe.
> >>>>
> >>>> Details:
> >>>>
> >>>> * Registered with ptdump
> >>>>       * ptep_get_lockless()
> >>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
> >>>>       * __ptep_get()
> >>>>       * __set_pte()
> >>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
> >>>> set_permissions
> >>>>       * __ptep_get()
> >>>>       * __set_pte()
> >>>
> >>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
> >>> "official" APIs.
> >>
> >> We could, but that would lead to the same linkage issue, which I'm trying to
> >> avoid in the first place:
> >>
> >> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
> >>
> >> This creates new source code dependencies, which I would rather avoid if
> >> possible.
> >
> > Just a thought, you could have a is_efi_mm() function that abstracts all that.
> >
> > diff --git a/include/linux/efi.h b/include/linux/efi.h
> > index c74f47711f0b..152f5fa66a2a 100644
> > --- a/include/linux/efi.h
> > +++ b/include/linux/efi.h
> > @@ -692,6 +692,15 @@ extern struct efi {
> >
> >  extern struct mm_struct efi_mm;
> >
> > +static inline void is_efi_mm(struct mm_struct *mm)
> > +{
> > +#ifdef CONFIG_EFI
> > +       return mm == &efi_mm;
> > +#else
> > +       return false;
> > +#endif
> > +}
> > +
> >  static inline int
> >  efi_guidcmp (efi_guid_t left, efi_guid_t right)
> >  {
> >
> >
>
> That would definitely work, but in that case, I might as well just check for it
> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>
>
> static inline bool mm_is_user(struct mm_struct *mm)
> {
>         return mm != &init_mm && !mm_is_efi(mm);
> }
>
> Any objections?
>

Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
declaration is visible to the compiler, and any references should
disappear before the linker could notice that efi_mm does not exist.

In any case, feel free to add

Acked-by: Ard Biesheuvel <ardb@kernel.org>

when you roll a patch based on the above, with or without IS_ENABLED().

And as you concluded, efi_mm is indeed set in stone once the runtime
regions described by the firmware have been mapped, although this may
happen in two passes depending on how the runtime regions are
described. But by the time user MMs might exist, efi_mm should
effectively be immutable.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:33                       ` Ard Biesheuvel
  0 siblings, 0 replies; 240+ messages in thread
From: Ard Biesheuvel @ 2024-02-13 13:33 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 13/02/2024 13:13, David Hildenbrand wrote:
> > On 13.02.24 14:06, Ryan Roberts wrote:
> >> On 13/02/2024 12:19, David Hildenbrand wrote:
> >>> On 13.02.24 13:06, Ryan Roberts wrote:
> >>>> On 12/02/2024 20:38, Ryan Roberts wrote:
> >>>>> [...]
> >>>>>
> >>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
> >>>>>>>>> +{
> >>>>>>>>> +    /*
> >>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
> >>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
> >>>>>>>>> +     * These racing faults are ok for user space, since they get
> >>>>>>>>> serialized
> >>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
> >>>>>>>>> +     */
> >>>>>>>>> +    return mm != &init_mm;
> >>>>>>>>> +}
> >>>>>>>>
> >>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
> >>>>>>>> manipulate
> >>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
> >>>>>>>
> >>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
> >>>>>>> think I
> >>>>>>> could add a check for efi_mm here with performance implication. It's
> >>>>>>> probably
> >>>>>>> safest to explicitly exclude it? What do you think?
> >>>>>>
> >>>>>> Oops: This should have read "I think I could add a check for efi_mm here
> >>>>>> *without* performance implication"
> >>>>>
> >>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
> >>>>> this:
> >>>>>
> >>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
> >>>>>
> >>>>> Is that acceptable? This is my preference, but nothing else outside of efi
> >>>>> references this symbol currently.
> >>>>>
> >>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
> >>>>> There are a couple of things that need to be garanteed for it to be safe:
> >>>>>
> >>>>>     - The PFNs of present ptes either need to have an associated struct
> >>>>> page or
> >>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
> >>>>>       pte_mkdevmap())
> >>>>>
> >>>>>     - Live mappings must either be static (no changes that could cause
> >>>>> fold/unfold
> >>>>>       while live) or the system must be able to tolerate a temporary fault
> >>>>>
> >>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
> >>>>> requirement, but I'm not sure about the former?
> >>>>
> >>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
> >>>> mappings are indeed static. And additionally, the ptes are populated using only
> >>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
> >>>> my prefereence is to not make any changes to code, and just add a comment
> >>>> describing why efi_mm is safe.
> >>>>
> >>>> Details:
> >>>>
> >>>> * Registered with ptdump
> >>>>       * ptep_get_lockless()
> >>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
> >>>>       * __ptep_get()
> >>>>       * __set_pte()
> >>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
> >>>> set_permissions
> >>>>       * __ptep_get()
> >>>>       * __set_pte()
> >>>
> >>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
> >>> "official" APIs.
> >>
> >> We could, but that would lead to the same linkage issue, which I'm trying to
> >> avoid in the first place:
> >>
> >> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
> >>
> >> This creates new source code dependencies, which I would rather avoid if
> >> possible.
> >
> > Just a thought, you could have a is_efi_mm() function that abstracts all that.
> >
> > diff --git a/include/linux/efi.h b/include/linux/efi.h
> > index c74f47711f0b..152f5fa66a2a 100644
> > --- a/include/linux/efi.h
> > +++ b/include/linux/efi.h
> > @@ -692,6 +692,15 @@ extern struct efi {
> >
> >  extern struct mm_struct efi_mm;
> >
> > +static inline void is_efi_mm(struct mm_struct *mm)
> > +{
> > +#ifdef CONFIG_EFI
> > +       return mm == &efi_mm;
> > +#else
> > +       return false;
> > +#endif
> > +}
> > +
> >  static inline int
> >  efi_guidcmp (efi_guid_t left, efi_guid_t right)
> >  {
> >
> >
>
> That would definitely work, but in that case, I might as well just check for it
> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>
>
> static inline bool mm_is_user(struct mm_struct *mm)
> {
>         return mm != &init_mm && !mm_is_efi(mm);
> }
>
> Any objections?
>

Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
declaration is visible to the compiler, and any references should
disappear before the linker could notice that efi_mm does not exist.

In any case, feel free to add

Acked-by: Ard Biesheuvel <ardb@kernel.org>

when you roll a patch based on the above, with or without IS_ENABLED().

And as you concluded, efi_mm is indeed set in stone once the runtime
regions described by the firmware have been mapped, although this may
happen in two passes depending on how the runtime regions are
described. But by the time user MMs might exist, efi_mm should
effectively be immutable.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 13:33                       ` Ard Biesheuvel
  (?)
@ 2024-02-13 13:45                         ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 13:45 UTC (permalink / raw)
  To: Ard Biesheuvel, Ryan Roberts
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13.02.24 14:33, Ard Biesheuvel wrote:
> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>> +{
>>>>>>>>>>> +    /*
>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>> serialized
>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>> manipulate
>>>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>>>
>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>> think I
>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>> probably
>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>
>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>> *without* performance implication"
>>>>>>>
>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I can do
>>>>>>> this:
>>>>>>>
>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>
>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>> references this symbol currently.
>>>>>>>
>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>
>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>> page or
>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>        pte_mkdevmap())
>>>>>>>
>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>> fold/unfold
>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>
>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>>> requirement, but I'm not sure about the former?
>>>>>>
>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>> describing why efi_mm is safe.
>>>>>>
>>>>>> Details:
>>>>>>
>>>>>> * Registered with ptdump
>>>>>>        * ptep_get_lockless()
>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>> set_permissions
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>
>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>> "official" APIs.
>>>>
>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>> avoid in the first place:
>>>>
>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>
>>>> This creates new source code dependencies, which I would rather avoid if
>>>> possible.
>>>
>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>
>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>> index c74f47711f0b..152f5fa66a2a 100644
>>> --- a/include/linux/efi.h
>>> +++ b/include/linux/efi.h
>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>
>>>   extern struct mm_struct efi_mm;
>>>
>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>> +{
>>> +#ifdef CONFIG_EFI
>>> +       return mm == &efi_mm;
>>> +#else
>>> +       return false;
>>> +#endif
>>> +}
>>> +
>>>   static inline int
>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>   {
>>>
>>>
>>
>> That would definitely work, but in that case, I might as well just check for it
>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>
>>
>> static inline bool mm_is_user(struct mm_struct *mm)
>> {
>>          return mm != &init_mm && !mm_is_efi(mm);
>> }
>>
>> Any objections?
>>
> 
> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
> declaration is visible to the compiler, and any references should
> disappear before the linker could notice that efi_mm does not exist.
> 

Sure, as long as the linker is happy why not. I'll let Ryan mess with 
that :)

> In any case, feel free to add
> 
> Acked-by: Ard Biesheuvel <ardb@kernel.org>

Thanks for the review.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:45                         ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 13:45 UTC (permalink / raw)
  To: Ard Biesheuvel, Ryan Roberts
  Cc: Mark Rutland, Kefeng Wang, x86, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 13.02.24 14:33, Ard Biesheuvel wrote:
> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>> +{
>>>>>>>>>>> +    /*
>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>> serialized
>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>> manipulate
>>>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>>>
>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>> think I
>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>> probably
>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>
>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>> *without* performance implication"
>>>>>>>
>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I can do
>>>>>>> this:
>>>>>>>
>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>
>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>> references this symbol currently.
>>>>>>>
>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>
>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>> page or
>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>        pte_mkdevmap())
>>>>>>>
>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>> fold/unfold
>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>
>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>>> requirement, but I'm not sure about the former?
>>>>>>
>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>> describing why efi_mm is safe.
>>>>>>
>>>>>> Details:
>>>>>>
>>>>>> * Registered with ptdump
>>>>>>        * ptep_get_lockless()
>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>> set_permissions
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>
>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>> "official" APIs.
>>>>
>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>> avoid in the first place:
>>>>
>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>
>>>> This creates new source code dependencies, which I would rather avoid if
>>>> possible.
>>>
>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>
>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>> index c74f47711f0b..152f5fa66a2a 100644
>>> --- a/include/linux/efi.h
>>> +++ b/include/linux/efi.h
>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>
>>>   extern struct mm_struct efi_mm;
>>>
>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>> +{
>>> +#ifdef CONFIG_EFI
>>> +       return mm == &efi_mm;
>>> +#else
>>> +       return false;
>>> +#endif
>>> +}
>>> +
>>>   static inline int
>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>   {
>>>
>>>
>>
>> That would definitely work, but in that case, I might as well just check for it
>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>
>>
>> static inline bool mm_is_user(struct mm_struct *mm)
>> {
>>          return mm != &init_mm && !mm_is_efi(mm);
>> }
>>
>> Any objections?
>>
> 
> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
> declaration is visible to the compiler, and any references should
> disappear before the linker could notice that efi_mm does not exist.
> 

Sure, as long as the linker is happy why not. I'll let Ryan mess with 
that :)

> In any case, feel free to add
> 
> Acked-by: Ard Biesheuvel <ardb@kernel.org>

Thanks for the review.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 13:45                         ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 13:45 UTC (permalink / raw)
  To: Ard Biesheuvel, Ryan Roberts
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13.02.24 14:33, Ard Biesheuvel wrote:
> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>> +{
>>>>>>>>>>> +    /*
>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>> serialized
>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>> manipulate
>>>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>>>
>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>> think I
>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>> probably
>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>
>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>> *without* performance implication"
>>>>>>>
>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I can do
>>>>>>> this:
>>>>>>>
>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>
>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>> references this symbol currently.
>>>>>>>
>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>
>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>> page or
>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>        pte_mkdevmap())
>>>>>>>
>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>> fold/unfold
>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>
>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>>> requirement, but I'm not sure about the former?
>>>>>>
>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>> describing why efi_mm is safe.
>>>>>>
>>>>>> Details:
>>>>>>
>>>>>> * Registered with ptdump
>>>>>>        * ptep_get_lockless()
>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>> set_permissions
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>
>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>> "official" APIs.
>>>>
>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>> avoid in the first place:
>>>>
>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>
>>>> This creates new source code dependencies, which I would rather avoid if
>>>> possible.
>>>
>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>
>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>> index c74f47711f0b..152f5fa66a2a 100644
>>> --- a/include/linux/efi.h
>>> +++ b/include/linux/efi.h
>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>
>>>   extern struct mm_struct efi_mm;
>>>
>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>> +{
>>> +#ifdef CONFIG_EFI
>>> +       return mm == &efi_mm;
>>> +#else
>>> +       return false;
>>> +#endif
>>> +}
>>> +
>>>   static inline int
>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>   {
>>>
>>>
>>
>> That would definitely work, but in that case, I might as well just check for it
>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>
>>
>> static inline bool mm_is_user(struct mm_struct *mm)
>> {
>>          return mm != &init_mm && !mm_is_efi(mm);
>> }
>>
>> Any objections?
>>
> 
> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
> declaration is visible to the compiler, and any references should
> disappear before the linker could notice that efi_mm does not exist.
> 

Sure, as long as the linker is happy why not. I'll let Ryan mess with 
that :)

> In any case, feel free to add
> 
> Acked-by: Ard Biesheuvel <ardb@kernel.org>

Thanks for the review.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 13:45                         ` David Hildenbrand
  (?)
@ 2024-02-13 14:02                           ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 14:02 UTC (permalink / raw)
  To: David Hildenbrand, Ard Biesheuvel
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13/02/2024 13:45, David Hildenbrand wrote:
> On 13.02.24 14:33, Ard Biesheuvel wrote:
>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    /*
>>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>>> because
>>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>>> faults.
>>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>>> serialized
>>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>>> +}
>>>>>>>>>>>
>>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>>> manipulate
>>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>>> handling.
>>>>>>>>>>
>>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>>> think I
>>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>>> probably
>>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>>
>>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>>> *without* performance implication"
>>>>>>>>
>>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I
>>>>>>>> can do
>>>>>>>> this:
>>>>>>>>
>>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>>
>>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>>> references this symbol currently.
>>>>>>>>
>>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>>> userspace.
>>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>>
>>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>>> page or
>>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>>        pte_mkdevmap())
>>>>>>>>
>>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>>> fold/unfold
>>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>>
>>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the
>>>>>>>> latter
>>>>>>>> requirement, but I'm not sure about the former?
>>>>>>>
>>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>>> mappings are indeed static. And additionally, the ptes are populated
>>>>>>> using only
>>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>>> Mark,
>>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>>> describing why efi_mm is safe.
>>>>>>>
>>>>>>> Details:
>>>>>>>
>>>>>>> * Registered with ptdump
>>>>>>>        * ptep_get_lockless()
>>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>>        * __ptep_get()
>>>>>>>        * __set_pte()
>>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>>> set_permissions
>>>>>>>        * __ptep_get()
>>>>>>>        * __set_pte()
>>>>>>
>>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>>> "official" APIs.
>>>>>
>>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>>> avoid in the first place:
>>>>>
>>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>>
>>>>> This creates new source code dependencies, which I would rather avoid if
>>>>> possible.
>>>>
>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>
>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>> --- a/include/linux/efi.h
>>>> +++ b/include/linux/efi.h
>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>
>>>>   extern struct mm_struct efi_mm;
>>>>
>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>> +{
>>>> +#ifdef CONFIG_EFI
>>>> +       return mm == &efi_mm;
>>>> +#else
>>>> +       return false;
>>>> +#endif
>>>> +}
>>>> +
>>>>   static inline int
>>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>   {
>>>>
>>>>
>>>
>>> That would definitely work, but in that case, I might as well just check for it
>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>
>>>
>>> static inline bool mm_is_user(struct mm_struct *mm)
>>> {
>>>          return mm != &init_mm && !mm_is_efi(mm);
>>> }
>>>
>>> Any objections?
>>>
>>
>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>> declaration is visible to the compiler, and any references should
>> disappear before the linker could notice that efi_mm does not exist.
>>
> 
> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)

I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?

The former was what I did initially; It works great, but I didn't like that I
was introducing a new code dependecy between efi and arm64 (nothing else outside
of efi references efi_mm).

So then concluded that it is safe to not worry about efi_mm (thanks for your
confirmation). But then David wanted a VM_WARN check, which reintroduces the
code dependency. So he suggested the mm_is_efi() helper to hide that... This is
all starting to feel circular...

Since I've jsut updated the code to do it David's way, I propose leaving it as
is since nobody has strong feelings.

> 
>> In any case, feel free to add
>>
>> Acked-by: Ard Biesheuvel <ardb@kernel.org>

Great thanks!

> 
> Thanks for the review.
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 14:02                           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 14:02 UTC (permalink / raw)
  To: David Hildenbrand, Ard Biesheuvel
  Cc: Mark Rutland, Kefeng Wang, x86, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 13/02/2024 13:45, David Hildenbrand wrote:
> On 13.02.24 14:33, Ard Biesheuvel wrote:
>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    /*
>>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>>> because
>>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>>> faults.
>>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>>> serialized
>>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>>> +}
>>>>>>>>>>>
>>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>>> manipulate
>>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>>> handling.
>>>>>>>>>>
>>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>>> think I
>>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>>> probably
>>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>>
>>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>>> *without* performance implication"
>>>>>>>>
>>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I
>>>>>>>> can do
>>>>>>>> this:
>>>>>>>>
>>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>>
>>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>>> references this symbol currently.
>>>>>>>>
>>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>>> userspace.
>>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>>
>>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>>> page or
>>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>>        pte_mkdevmap())
>>>>>>>>
>>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>>> fold/unfold
>>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>>
>>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the
>>>>>>>> latter
>>>>>>>> requirement, but I'm not sure about the former?
>>>>>>>
>>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>>> mappings are indeed static. And additionally, the ptes are populated
>>>>>>> using only
>>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>>> Mark,
>>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>>> describing why efi_mm is safe.
>>>>>>>
>>>>>>> Details:
>>>>>>>
>>>>>>> * Registered with ptdump
>>>>>>>        * ptep_get_lockless()
>>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>>        * __ptep_get()
>>>>>>>        * __set_pte()
>>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>>> set_permissions
>>>>>>>        * __ptep_get()
>>>>>>>        * __set_pte()
>>>>>>
>>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>>> "official" APIs.
>>>>>
>>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>>> avoid in the first place:
>>>>>
>>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>>
>>>>> This creates new source code dependencies, which I would rather avoid if
>>>>> possible.
>>>>
>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>
>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>> --- a/include/linux/efi.h
>>>> +++ b/include/linux/efi.h
>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>
>>>>   extern struct mm_struct efi_mm;
>>>>
>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>> +{
>>>> +#ifdef CONFIG_EFI
>>>> +       return mm == &efi_mm;
>>>> +#else
>>>> +       return false;
>>>> +#endif
>>>> +}
>>>> +
>>>>   static inline int
>>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>   {
>>>>
>>>>
>>>
>>> That would definitely work, but in that case, I might as well just check for it
>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>
>>>
>>> static inline bool mm_is_user(struct mm_struct *mm)
>>> {
>>>          return mm != &init_mm && !mm_is_efi(mm);
>>> }
>>>
>>> Any objections?
>>>
>>
>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>> declaration is visible to the compiler, and any references should
>> disappear before the linker could notice that efi_mm does not exist.
>>
> 
> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)

I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?

The former was what I did initially; It works great, but I didn't like that I
was introducing a new code dependecy between efi and arm64 (nothing else outside
of efi references efi_mm).

So then concluded that it is safe to not worry about efi_mm (thanks for your
confirmation). But then David wanted a VM_WARN check, which reintroduces the
code dependency. So he suggested the mm_is_efi() helper to hide that... This is
all starting to feel circular...

Since I've jsut updated the code to do it David's way, I propose leaving it as
is since nobody has strong feelings.

> 
>> In any case, feel free to add
>>
>> Acked-by: Ard Biesheuvel <ardb@kernel.org>

Great thanks!

> 
> Thanks for the review.
> 


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 14:02                           ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 14:02 UTC (permalink / raw)
  To: David Hildenbrand, Ard Biesheuvel
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13/02/2024 13:45, David Hildenbrand wrote:
> On 13.02.24 14:33, Ard Biesheuvel wrote:
>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    /*
>>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>>> because
>>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>>> faults.
>>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>>> serialized
>>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>>> +}
>>>>>>>>>>>
>>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>>> manipulate
>>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>>> handling.
>>>>>>>>>>
>>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>>> think I
>>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>>> probably
>>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>>
>>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>>> *without* performance implication"
>>>>>>>>
>>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I
>>>>>>>> can do
>>>>>>>> this:
>>>>>>>>
>>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>>
>>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>>> references this symbol currently.
>>>>>>>>
>>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>>> userspace.
>>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>>
>>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>>> page or
>>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>>        pte_mkdevmap())
>>>>>>>>
>>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>>> fold/unfold
>>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>>
>>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the
>>>>>>>> latter
>>>>>>>> requirement, but I'm not sure about the former?
>>>>>>>
>>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>>> mappings are indeed static. And additionally, the ptes are populated
>>>>>>> using only
>>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>>> Mark,
>>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>>> describing why efi_mm is safe.
>>>>>>>
>>>>>>> Details:
>>>>>>>
>>>>>>> * Registered with ptdump
>>>>>>>        * ptep_get_lockless()
>>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>>        * __ptep_get()
>>>>>>>        * __set_pte()
>>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>>> set_permissions
>>>>>>>        * __ptep_get()
>>>>>>>        * __set_pte()
>>>>>>
>>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>>> "official" APIs.
>>>>>
>>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>>> avoid in the first place:
>>>>>
>>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>>
>>>>> This creates new source code dependencies, which I would rather avoid if
>>>>> possible.
>>>>
>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>
>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>> --- a/include/linux/efi.h
>>>> +++ b/include/linux/efi.h
>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>
>>>>   extern struct mm_struct efi_mm;
>>>>
>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>> +{
>>>> +#ifdef CONFIG_EFI
>>>> +       return mm == &efi_mm;
>>>> +#else
>>>> +       return false;
>>>> +#endif
>>>> +}
>>>> +
>>>>   static inline int
>>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>   {
>>>>
>>>>
>>>
>>> That would definitely work, but in that case, I might as well just check for it
>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>
>>>
>>> static inline bool mm_is_user(struct mm_struct *mm)
>>> {
>>>          return mm != &init_mm && !mm_is_efi(mm);
>>> }
>>>
>>> Any objections?
>>>
>>
>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>> declaration is visible to the compiler, and any references should
>> disappear before the linker could notice that efi_mm does not exist.
>>
> 
> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)

I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?

The former was what I did initially; It works great, but I didn't like that I
was introducing a new code dependecy between efi and arm64 (nothing else outside
of efi references efi_mm).

So then concluded that it is safe to not worry about efi_mm (thanks for your
confirmation). But then David wanted a VM_WARN check, which reintroduces the
code dependency. So he suggested the mm_is_efi() helper to hide that... This is
all starting to feel circular...

Since I've jsut updated the code to do it David's way, I propose leaving it as
is since nobody has strong feelings.

> 
>> In any case, feel free to add
>>
>> Acked-by: Ard Biesheuvel <ardb@kernel.org>

Great thanks!

> 
> Thanks for the review.
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 14:02                           ` Ryan Roberts
  (?)
@ 2024-02-13 14:05                             ` David Hildenbrand
  -1 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 14:05 UTC (permalink / raw)
  To: Ryan Roberts, Ard Biesheuvel
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13.02.24 15:02, Ryan Roberts wrote:
> On 13/02/2024 13:45, David Hildenbrand wrote:
>> On 13.02.24 14:33, Ard Biesheuvel wrote:
>>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>>>> because
>>>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>>>> faults.
>>>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>>>> serialized
>>>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>>>> +     */
>>>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>
>>>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>>>> manipulate
>>>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>>>> handling.
>>>>>>>>>>>
>>>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>>>> think I
>>>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>>>> probably
>>>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>>>
>>>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>>>> *without* performance implication"
>>>>>>>>>
>>>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I
>>>>>>>>> can do
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>>>
>>>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>>>> references this symbol currently.
>>>>>>>>>
>>>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>>>> userspace.
>>>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>>>
>>>>>>>>>       - The PFNs of present ptes either need to have an associated struct
>>>>>>>>> page or
>>>>>>>>>         need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>>>         pte_mkdevmap())
>>>>>>>>>
>>>>>>>>>       - Live mappings must either be static (no changes that could cause
>>>>>>>>> fold/unfold
>>>>>>>>>         while live) or the system must be able to tolerate a temporary fault
>>>>>>>>>
>>>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the
>>>>>>>>> latter
>>>>>>>>> requirement, but I'm not sure about the former?
>>>>>>>>
>>>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>>>> mappings are indeed static. And additionally, the ptes are populated
>>>>>>>> using only
>>>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>>>> Mark,
>>>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>>>> describing why efi_mm is safe.
>>>>>>>>
>>>>>>>> Details:
>>>>>>>>
>>>>>>>> * Registered with ptdump
>>>>>>>>         * ptep_get_lockless()
>>>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>>>         * __ptep_get()
>>>>>>>>         * __set_pte()
>>>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>>>> set_permissions
>>>>>>>>         * __ptep_get()
>>>>>>>>         * __set_pte()
>>>>>>>
>>>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>>>> "official" APIs.
>>>>>>
>>>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>>>> avoid in the first place:
>>>>>>
>>>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>>>
>>>>>> This creates new source code dependencies, which I would rather avoid if
>>>>>> possible.
>>>>>
>>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>>
>>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>>> --- a/include/linux/efi.h
>>>>> +++ b/include/linux/efi.h
>>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>>
>>>>>    extern struct mm_struct efi_mm;
>>>>>
>>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>>> +{
>>>>> +#ifdef CONFIG_EFI
>>>>> +       return mm == &efi_mm;
>>>>> +#else
>>>>> +       return false;
>>>>> +#endif
>>>>> +}
>>>>> +
>>>>>    static inline int
>>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>>    {
>>>>>
>>>>>
>>>>
>>>> That would definitely work, but in that case, I might as well just check for it
>>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>>
>>>>
>>>> static inline bool mm_is_user(struct mm_struct *mm)
>>>> {
>>>>           return mm != &init_mm && !mm_is_efi(mm);
>>>> }
>>>>
>>>> Any objections?
>>>>
>>>
>>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>>> declaration is visible to the compiler, and any references should
>>> disappear before the linker could notice that efi_mm does not exist.
>>>
>>
>> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
> 
> I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
> IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
> using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
> 
> The former was what I did initially; It works great, but I didn't like that I
> was introducing a new code dependecy between efi and arm64 (nothing else outside
> of efi references efi_mm).
> 
> So then concluded that it is safe to not worry about efi_mm (thanks for your
> confirmation). But then David wanted a VM_WARN check, which reintroduces the
> code dependency. So he suggested the mm_is_efi() helper to hide that... This is
> all starting to feel circular...

I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and 
simply use IS_ENABLED().

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 14:05                             ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 14:05 UTC (permalink / raw)
  To: Ryan Roberts, Ard Biesheuvel
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13.02.24 15:02, Ryan Roberts wrote:
> On 13/02/2024 13:45, David Hildenbrand wrote:
>> On 13.02.24 14:33, Ard Biesheuvel wrote:
>>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>>>> because
>>>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>>>> faults.
>>>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>>>> serialized
>>>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>>>> +     */
>>>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>
>>>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>>>> manipulate
>>>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>>>> handling.
>>>>>>>>>>>
>>>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>>>> think I
>>>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>>>> probably
>>>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>>>
>>>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>>>> *without* performance implication"
>>>>>>>>>
>>>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I
>>>>>>>>> can do
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>>>
>>>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>>>> references this symbol currently.
>>>>>>>>>
>>>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>>>> userspace.
>>>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>>>
>>>>>>>>>       - The PFNs of present ptes either need to have an associated struct
>>>>>>>>> page or
>>>>>>>>>         need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>>>         pte_mkdevmap())
>>>>>>>>>
>>>>>>>>>       - Live mappings must either be static (no changes that could cause
>>>>>>>>> fold/unfold
>>>>>>>>>         while live) or the system must be able to tolerate a temporary fault
>>>>>>>>>
>>>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the
>>>>>>>>> latter
>>>>>>>>> requirement, but I'm not sure about the former?
>>>>>>>>
>>>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>>>> mappings are indeed static. And additionally, the ptes are populated
>>>>>>>> using only
>>>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>>>> Mark,
>>>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>>>> describing why efi_mm is safe.
>>>>>>>>
>>>>>>>> Details:
>>>>>>>>
>>>>>>>> * Registered with ptdump
>>>>>>>>         * ptep_get_lockless()
>>>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>>>         * __ptep_get()
>>>>>>>>         * __set_pte()
>>>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>>>> set_permissions
>>>>>>>>         * __ptep_get()
>>>>>>>>         * __set_pte()
>>>>>>>
>>>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>>>> "official" APIs.
>>>>>>
>>>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>>>> avoid in the first place:
>>>>>>
>>>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>>>
>>>>>> This creates new source code dependencies, which I would rather avoid if
>>>>>> possible.
>>>>>
>>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>>
>>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>>> --- a/include/linux/efi.h
>>>>> +++ b/include/linux/efi.h
>>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>>
>>>>>    extern struct mm_struct efi_mm;
>>>>>
>>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>>> +{
>>>>> +#ifdef CONFIG_EFI
>>>>> +       return mm == &efi_mm;
>>>>> +#else
>>>>> +       return false;
>>>>> +#endif
>>>>> +}
>>>>> +
>>>>>    static inline int
>>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>>    {
>>>>>
>>>>>
>>>>
>>>> That would definitely work, but in that case, I might as well just check for it
>>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>>
>>>>
>>>> static inline bool mm_is_user(struct mm_struct *mm)
>>>> {
>>>>           return mm != &init_mm && !mm_is_efi(mm);
>>>> }
>>>>
>>>> Any objections?
>>>>
>>>
>>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>>> declaration is visible to the compiler, and any references should
>>> disappear before the linker could notice that efi_mm does not exist.
>>>
>>
>> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
> 
> I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
> IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
> using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
> 
> The former was what I did initially; It works great, but I didn't like that I
> was introducing a new code dependecy between efi and arm64 (nothing else outside
> of efi references efi_mm).
> 
> So then concluded that it is safe to not worry about efi_mm (thanks for your
> confirmation). But then David wanted a VM_WARN check, which reintroduces the
> code dependency. So he suggested the mm_is_efi() helper to hide that... This is
> all starting to feel circular...

I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and 
simply use IS_ENABLED().

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 14:05                             ` David Hildenbrand
  0 siblings, 0 replies; 240+ messages in thread
From: David Hildenbrand @ 2024-02-13 14:05 UTC (permalink / raw)
  To: Ryan Roberts, Ard Biesheuvel
  Cc: Mark Rutland, Kefeng Wang, x86, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 13.02.24 15:02, Ryan Roberts wrote:
> On 13/02/2024 13:45, David Hildenbrand wrote:
>> On 13.02.24 14:33, Ard Biesheuvel wrote:
>>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>>>> because
>>>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>>>> faults.
>>>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>>>> serialized
>>>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>>>> +     */
>>>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>
>>>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>>>> manipulate
>>>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>>>> handling.
>>>>>>>>>>>
>>>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>>>> think I
>>>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>>>> probably
>>>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>>>
>>>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>>>> *without* performance implication"
>>>>>>>>>
>>>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I
>>>>>>>>> can do
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>>>
>>>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>>>> references this symbol currently.
>>>>>>>>>
>>>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>>>> userspace.
>>>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>>>
>>>>>>>>>       - The PFNs of present ptes either need to have an associated struct
>>>>>>>>> page or
>>>>>>>>>         need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>>>         pte_mkdevmap())
>>>>>>>>>
>>>>>>>>>       - Live mappings must either be static (no changes that could cause
>>>>>>>>> fold/unfold
>>>>>>>>>         while live) or the system must be able to tolerate a temporary fault
>>>>>>>>>
>>>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the
>>>>>>>>> latter
>>>>>>>>> requirement, but I'm not sure about the former?
>>>>>>>>
>>>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>>>> mappings are indeed static. And additionally, the ptes are populated
>>>>>>>> using only
>>>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>>>> Mark,
>>>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>>>> describing why efi_mm is safe.
>>>>>>>>
>>>>>>>> Details:
>>>>>>>>
>>>>>>>> * Registered with ptdump
>>>>>>>>         * ptep_get_lockless()
>>>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>>>         * __ptep_get()
>>>>>>>>         * __set_pte()
>>>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>>>> set_permissions
>>>>>>>>         * __ptep_get()
>>>>>>>>         * __set_pte()
>>>>>>>
>>>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>>>> "official" APIs.
>>>>>>
>>>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>>>> avoid in the first place:
>>>>>>
>>>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>>>
>>>>>> This creates new source code dependencies, which I would rather avoid if
>>>>>> possible.
>>>>>
>>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>>
>>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>>> --- a/include/linux/efi.h
>>>>> +++ b/include/linux/efi.h
>>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>>
>>>>>    extern struct mm_struct efi_mm;
>>>>>
>>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>>> +{
>>>>> +#ifdef CONFIG_EFI
>>>>> +       return mm == &efi_mm;
>>>>> +#else
>>>>> +       return false;
>>>>> +#endif
>>>>> +}
>>>>> +
>>>>>    static inline int
>>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>>    {
>>>>>
>>>>>
>>>>
>>>> That would definitely work, but in that case, I might as well just check for it
>>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>>
>>>>
>>>> static inline bool mm_is_user(struct mm_struct *mm)
>>>> {
>>>>           return mm != &init_mm && !mm_is_efi(mm);
>>>> }
>>>>
>>>> Any objections?
>>>>
>>>
>>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>>> declaration is visible to the compiler, and any references should
>>> disappear before the linker could notice that efi_mm does not exist.
>>>
>>
>> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
> 
> I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
> IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
> using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
> 
> The former was what I did initially; It works great, but I didn't like that I
> was introducing a new code dependecy between efi and arm64 (nothing else outside
> of efi references efi_mm).
> 
> So then concluded that it is safe to not worry about efi_mm (thanks for your
> confirmation). But then David wanted a VM_WARN check, which reintroduces the
> code dependency. So he suggested the mm_is_efi() helper to hide that... This is
> all starting to feel circular...

I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and 
simply use IS_ENABLED().

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 14:05                             ` David Hildenbrand
  (?)
@ 2024-02-13 14:08                               ` Ard Biesheuvel
  -1 siblings, 0 replies; 240+ messages in thread
From: Ard Biesheuvel @ 2024-02-13 14:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Tue, 13 Feb 2024 at 15:05, David Hildenbrand <david@redhat.com> wrote:
>
> On 13.02.24 15:02, Ryan Roberts wrote:
> > On 13/02/2024 13:45, David Hildenbrand wrote:
> >> On 13.02.24 14:33, Ard Biesheuvel wrote:
> >>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 13/02/2024 13:13, David Hildenbrand wrote:
...
> >>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
> >>>>>
> >>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
> >>>>> index c74f47711f0b..152f5fa66a2a 100644
> >>>>> --- a/include/linux/efi.h
> >>>>> +++ b/include/linux/efi.h
> >>>>> @@ -692,6 +692,15 @@ extern struct efi {
> >>>>>
> >>>>>    extern struct mm_struct efi_mm;
> >>>>>
> >>>>> +static inline void is_efi_mm(struct mm_struct *mm)
> >>>>> +{
> >>>>> +#ifdef CONFIG_EFI
> >>>>> +       return mm == &efi_mm;
> >>>>> +#else
> >>>>> +       return false;
> >>>>> +#endif
> >>>>> +}
> >>>>> +
> >>>>>    static inline int
> >>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
> >>>>>    {
> >>>>>
> >>>>>
> >>>>
> >>>> That would definitely work, but in that case, I might as well just check for it
> >>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
> >>>>
> >>>>
> >>>> static inline bool mm_is_user(struct mm_struct *mm)
> >>>> {
> >>>>           return mm != &init_mm && !mm_is_efi(mm);
> >>>> }
> >>>>
> >>>> Any objections?
> >>>>
> >>>
> >>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
> >>> declaration is visible to the compiler, and any references should
> >>> disappear before the linker could notice that efi_mm does not exist.
> >>>
> >>
> >> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
> >
> > I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
> > IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
> > using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
> >
> > The former was what I did initially; It works great, but I didn't like that I
> > was introducing a new code dependecy between efi and arm64 (nothing else outside
> > of efi references efi_mm).
> >
> > So then concluded that it is safe to not worry about efi_mm (thanks for your
> > confirmation). But then David wanted a VM_WARN check, which reintroduces the
> > code dependency. So he suggested the mm_is_efi() helper to hide that... This is
> > all starting to feel circular...
>
> I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and
> simply use IS_ENABLED().
>

Yes.

static inline void mm_is_efi(struct mm_struct *mm)
{
    return IS_ENABLED(CONFIG_EFI) && mm == &efi_mm;
}

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 14:08                               ` Ard Biesheuvel
  0 siblings, 0 replies; 240+ messages in thread
From: Ard Biesheuvel @ 2024-02-13 14:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Kefeng Wang, x86, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, Ryan Roberts, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On Tue, 13 Feb 2024 at 15:05, David Hildenbrand <david@redhat.com> wrote:
>
> On 13.02.24 15:02, Ryan Roberts wrote:
> > On 13/02/2024 13:45, David Hildenbrand wrote:
> >> On 13.02.24 14:33, Ard Biesheuvel wrote:
> >>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 13/02/2024 13:13, David Hildenbrand wrote:
...
> >>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
> >>>>>
> >>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
> >>>>> index c74f47711f0b..152f5fa66a2a 100644
> >>>>> --- a/include/linux/efi.h
> >>>>> +++ b/include/linux/efi.h
> >>>>> @@ -692,6 +692,15 @@ extern struct efi {
> >>>>>
> >>>>>    extern struct mm_struct efi_mm;
> >>>>>
> >>>>> +static inline void is_efi_mm(struct mm_struct *mm)
> >>>>> +{
> >>>>> +#ifdef CONFIG_EFI
> >>>>> +       return mm == &efi_mm;
> >>>>> +#else
> >>>>> +       return false;
> >>>>> +#endif
> >>>>> +}
> >>>>> +
> >>>>>    static inline int
> >>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
> >>>>>    {
> >>>>>
> >>>>>
> >>>>
> >>>> That would definitely work, but in that case, I might as well just check for it
> >>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
> >>>>
> >>>>
> >>>> static inline bool mm_is_user(struct mm_struct *mm)
> >>>> {
> >>>>           return mm != &init_mm && !mm_is_efi(mm);
> >>>> }
> >>>>
> >>>> Any objections?
> >>>>
> >>>
> >>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
> >>> declaration is visible to the compiler, and any references should
> >>> disappear before the linker could notice that efi_mm does not exist.
> >>>
> >>
> >> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
> >
> > I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
> > IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
> > using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
> >
> > The former was what I did initially; It works great, but I didn't like that I
> > was introducing a new code dependecy between efi and arm64 (nothing else outside
> > of efi references efi_mm).
> >
> > So then concluded that it is safe to not worry about efi_mm (thanks for your
> > confirmation). But then David wanted a VM_WARN check, which reintroduces the
> > code dependency. So he suggested the mm_is_efi() helper to hide that... This is
> > all starting to feel circular...
>
> I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and
> simply use IS_ENABLED().
>

Yes.

static inline void mm_is_efi(struct mm_struct *mm)
{
    return IS_ENABLED(CONFIG_EFI) && mm == &efi_mm;
}

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 14:08                               ` Ard Biesheuvel
  0 siblings, 0 replies; 240+ messages in thread
From: Ard Biesheuvel @ 2024-02-13 14:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon,
	Marc Zyngier, James Morse, Andrey Ryabinin, Andrew Morton,
	Matthew Wilcox, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Tue, 13 Feb 2024 at 15:05, David Hildenbrand <david@redhat.com> wrote:
>
> On 13.02.24 15:02, Ryan Roberts wrote:
> > On 13/02/2024 13:45, David Hildenbrand wrote:
> >> On 13.02.24 14:33, Ard Biesheuvel wrote:
> >>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 13/02/2024 13:13, David Hildenbrand wrote:
...
> >>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
> >>>>>
> >>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
> >>>>> index c74f47711f0b..152f5fa66a2a 100644
> >>>>> --- a/include/linux/efi.h
> >>>>> +++ b/include/linux/efi.h
> >>>>> @@ -692,6 +692,15 @@ extern struct efi {
> >>>>>
> >>>>>    extern struct mm_struct efi_mm;
> >>>>>
> >>>>> +static inline void is_efi_mm(struct mm_struct *mm)
> >>>>> +{
> >>>>> +#ifdef CONFIG_EFI
> >>>>> +       return mm == &efi_mm;
> >>>>> +#else
> >>>>> +       return false;
> >>>>> +#endif
> >>>>> +}
> >>>>> +
> >>>>>    static inline int
> >>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
> >>>>>    {
> >>>>>
> >>>>>
> >>>>
> >>>> That would definitely work, but in that case, I might as well just check for it
> >>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
> >>>>
> >>>>
> >>>> static inline bool mm_is_user(struct mm_struct *mm)
> >>>> {
> >>>>           return mm != &init_mm && !mm_is_efi(mm);
> >>>> }
> >>>>
> >>>> Any objections?
> >>>>
> >>>
> >>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
> >>> declaration is visible to the compiler, and any references should
> >>> disappear before the linker could notice that efi_mm does not exist.
> >>>
> >>
> >> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
> >
> > I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
> > IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
> > using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
> >
> > The former was what I did initially; It works great, but I didn't like that I
> > was introducing a new code dependecy between efi and arm64 (nothing else outside
> > of efi references efi_mm).
> >
> > So then concluded that it is safe to not worry about efi_mm (thanks for your
> > confirmation). But then David wanted a VM_WARN check, which reintroduces the
> > code dependency. So he suggested the mm_is_efi() helper to hide that... This is
> > all starting to feel circular...
>
> I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and
> simply use IS_ENABLED().
>

Yes.

static inline void mm_is_efi(struct mm_struct *mm)
{
    return IS_ENABLED(CONFIG_EFI) && mm == &efi_mm;
}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-13 14:08                               ` Ard Biesheuvel
  (?)
@ 2024-02-13 14:21                                 ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 14:21 UTC (permalink / raw)
  To: Ard Biesheuvel, David Hildenbrand
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13/02/2024 14:08, Ard Biesheuvel wrote:
> On Tue, 13 Feb 2024 at 15:05, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 13.02.24 15:02, Ryan Roberts wrote:
>>> On 13/02/2024 13:45, David Hildenbrand wrote:
>>>> On 13.02.24 14:33, Ard Biesheuvel wrote:
>>>>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 13/02/2024 13:13, David Hildenbrand wrote:
> ...
>>>>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>>>>
>>>>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>>>>> --- a/include/linux/efi.h
>>>>>>> +++ b/include/linux/efi.h
>>>>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>>>>
>>>>>>>    extern struct mm_struct efi_mm;
>>>>>>>
>>>>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_EFI
>>>>>>> +       return mm == &efi_mm;
>>>>>>> +#else
>>>>>>> +       return false;
>>>>>>> +#endif
>>>>>>> +}
>>>>>>> +
>>>>>>>    static inline int
>>>>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>>>>    {
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> That would definitely work, but in that case, I might as well just check for it
>>>>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>>>>
>>>>>>
>>>>>> static inline bool mm_is_user(struct mm_struct *mm)
>>>>>> {
>>>>>>           return mm != &init_mm && !mm_is_efi(mm);
>>>>>> }
>>>>>>
>>>>>> Any objections?
>>>>>>
>>>>>
>>>>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>>>>> declaration is visible to the compiler, and any references should
>>>>> disappear before the linker could notice that efi_mm does not exist.
>>>>>
>>>>
>>>> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
>>>
>>> I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
>>> IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
>>> using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
>>>
>>> The former was what I did initially; It works great, but I didn't like that I
>>> was introducing a new code dependecy between efi and arm64 (nothing else outside
>>> of efi references efi_mm).
>>>
>>> So then concluded that it is safe to not worry about efi_mm (thanks for your
>>> confirmation). But then David wanted a VM_WARN check, which reintroduces the
>>> code dependency. So he suggested the mm_is_efi() helper to hide that... This is
>>> all starting to feel circular...
>>
>> I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and
>> simply use IS_ENABLED().
>>
> 
> Yes.
> 
> static inline void mm_is_efi(struct mm_struct *mm)
> {
>     return IS_ENABLED(CONFIG_EFI) && mm == &efi_mm;
> }

Ahh, got it. Yes, I'll do it like this. Thanks!


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 14:21                                 ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 14:21 UTC (permalink / raw)
  To: Ard Biesheuvel, David Hildenbrand
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 13/02/2024 14:08, Ard Biesheuvel wrote:
> On Tue, 13 Feb 2024 at 15:05, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 13.02.24 15:02, Ryan Roberts wrote:
>>> On 13/02/2024 13:45, David Hildenbrand wrote:
>>>> On 13.02.24 14:33, Ard Biesheuvel wrote:
>>>>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 13/02/2024 13:13, David Hildenbrand wrote:
> ...
>>>>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>>>>
>>>>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>>>>> --- a/include/linux/efi.h
>>>>>>> +++ b/include/linux/efi.h
>>>>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>>>>
>>>>>>>    extern struct mm_struct efi_mm;
>>>>>>>
>>>>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_EFI
>>>>>>> +       return mm == &efi_mm;
>>>>>>> +#else
>>>>>>> +       return false;
>>>>>>> +#endif
>>>>>>> +}
>>>>>>> +
>>>>>>>    static inline int
>>>>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>>>>    {
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> That would definitely work, but in that case, I might as well just check for it
>>>>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>>>>
>>>>>>
>>>>>> static inline bool mm_is_user(struct mm_struct *mm)
>>>>>> {
>>>>>>           return mm != &init_mm && !mm_is_efi(mm);
>>>>>> }
>>>>>>
>>>>>> Any objections?
>>>>>>
>>>>>
>>>>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>>>>> declaration is visible to the compiler, and any references should
>>>>> disappear before the linker could notice that efi_mm does not exist.
>>>>>
>>>>
>>>> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
>>>
>>> I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
>>> IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
>>> using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
>>>
>>> The former was what I did initially; It works great, but I didn't like that I
>>> was introducing a new code dependecy between efi and arm64 (nothing else outside
>>> of efi references efi_mm).
>>>
>>> So then concluded that it is safe to not worry about efi_mm (thanks for your
>>> confirmation). But then David wanted a VM_WARN check, which reintroduces the
>>> code dependency. So he suggested the mm_is_efi() helper to hide that... This is
>>> all starting to feel circular...
>>
>> I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and
>> simply use IS_ENABLED().
>>
> 
> Yes.
> 
> static inline void mm_is_efi(struct mm_struct *mm)
> {
>     return IS_ENABLED(CONFIG_EFI) && mm == &efi_mm;
> }

Ahh, got it. Yes, I'll do it like this. Thanks!


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 14:21                                 ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 14:21 UTC (permalink / raw)
  To: Ard Biesheuvel, David Hildenbrand
  Cc: Mark Rutland, Kefeng Wang, x86, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 13/02/2024 14:08, Ard Biesheuvel wrote:
> On Tue, 13 Feb 2024 at 15:05, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 13.02.24 15:02, Ryan Roberts wrote:
>>> On 13/02/2024 13:45, David Hildenbrand wrote:
>>>> On 13.02.24 14:33, Ard Biesheuvel wrote:
>>>>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 13/02/2024 13:13, David Hildenbrand wrote:
> ...
>>>>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>>>>
>>>>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>>>>> --- a/include/linux/efi.h
>>>>>>> +++ b/include/linux/efi.h
>>>>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>>>>
>>>>>>>    extern struct mm_struct efi_mm;
>>>>>>>
>>>>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_EFI
>>>>>>> +       return mm == &efi_mm;
>>>>>>> +#else
>>>>>>> +       return false;
>>>>>>> +#endif
>>>>>>> +}
>>>>>>> +
>>>>>>>    static inline int
>>>>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>>>>    {
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> That would definitely work, but in that case, I might as well just check for it
>>>>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>>>>
>>>>>>
>>>>>> static inline bool mm_is_user(struct mm_struct *mm)
>>>>>> {
>>>>>>           return mm != &init_mm && !mm_is_efi(mm);
>>>>>> }
>>>>>>
>>>>>> Any objections?
>>>>>>
>>>>>
>>>>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>>>>> declaration is visible to the compiler, and any references should
>>>>> disappear before the linker could notice that efi_mm does not exist.
>>>>>
>>>>
>>>> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
>>>
>>> I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
>>> IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
>>> using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
>>>
>>> The former was what I did initially; It works great, but I didn't like that I
>>> was introducing a new code dependecy between efi and arm64 (nothing else outside
>>> of efi references efi_mm).
>>>
>>> So then concluded that it is safe to not worry about efi_mm (thanks for your
>>> confirmation). But then David wanted a VM_WARN check, which reintroduces the
>>> code dependency. So he suggested the mm_is_efi() helper to hide that... This is
>>> all starting to feel circular...
>>
>> I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and
>> simply use IS_ENABLED().
>>
> 
> Yes.
> 
> static inline void mm_is_efi(struct mm_struct *mm)
> {
>     return IS_ENABLED(CONFIG_EFI) && mm == &efi_mm;
> }

Ahh, got it. Yes, I'll do it like this. Thanks!


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
  2024-02-12 16:24                 ` David Hildenbrand
  (?)
@ 2024-02-13 15:29                   ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 15:29 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12/02/2024 16:24, David Hildenbrand wrote:
> On 12.02.24 16:34, Ryan Roberts wrote:
>> On 12/02/2024 15:26, David Hildenbrand wrote:
>>> On 12.02.24 15:45, Ryan Roberts wrote:
>>>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>>>> If so, I wonder if we could instead do that comparison modulo the
>>>>>>> access/dirty
>>>>>>> bits,
>>>>>>
>>>>>> I think that would work - but will need to think a bit more on it.
>>>>>>
>>>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>>>
>>>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>>>> collect
>>>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't
>>>>>> IMHO. So
>>>>>> we will likely want to rename the function and make its documentation
>>>>>> explicit
>>>>>> that it does not return those bits.
>>>>>>
>>>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>>>
>>>>>> Of course if I could convince you the current implementation is safe, I
>>>>>> might be
>>>>>> able to sidestep this optimization until a later date?
>>>>>
>>>>> As discussed (and pointed out abive), there might be quite some callsites
>>>>> where
>>>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>>>> used nowadays.
>>>>>
>>>>> One way to approach that I had in mind was having an explicit interface:
>>>>>
>>>>> ptep_get()
>>>>> ptep_get_uptodate()
>>>>> ptep_get_lockless()
>>>>> ptep_get_lockless_uptodate()
>>>>
>>>> Yes, I like the direction of this. I guess we anticipate that call sites
>>>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>>>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>>>> it might be better/safer to have the weaker variant use the new name - that
>>>> would allow us to downgrade one at a time?
>>>
>>> Yes, I was primarily struggling with names. Likely it makes sense to either have
>>> two completely new function names, or use the new name only for the "faster but
>>> less precise" variant.
>>>
>>>>
>>>>>
>>>>> Especially the last one might not be needed.
>>>> I've done a scan through the code and agree with Mark's original conclusions.
>>>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>>>> access/dirty info. So I think I could migrate everything to the weaker variant
>>>> fairly easily.
>>>>
>>>>>
>>>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>>>
>>>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>>>> "_nosync"?
>>>
>>> I could live with
>>>
>>> ptep_get_sync()
>>> ptep_get_nosync()
>>>
>>> with proper documentation :)
>>
>> but could you live with:
>>
>> ptep_get()
>> ptep_get_nosync()
>> ptep_get_lockless_nosync()
>>
>> ?
>>
>> So leave the "slower, more precise" version with the existing name.
> 
> Sure.
> 

I'm just implementing this (as a separate RFC), and had an alternative idea for
naming/semantics:

ptep_get()
ptep_get_norecency()
ptep_get_lockless()
ptep_get_lockless_norecency()

The "_norecency" versions explicitly clear the access/dirty bits. This is useful
for the "compare to original pte to check we are not racing" pattern:

pte = ptep_get_lockless_norecency(ptep)
...
<lock>
if (!pte_same(pte, ptep_get_norecency(ptep)))
	// RACE!
...
<unlock>

With the "_nosync" semantic, the access/dirty bits may or may not be set, so the
user has to explicitly clear them to do the comparison. (although I considered a
pte_same_nosync() that would clear the bits for you - but that name is pretty naff).

Although the _norecency semantic requires always explicitly clearing the bits,
so may be infinitesimally slower, it gives a very clear expectation that the
access/dirty bits are always clear and I think that's conveyed well in the name too.

Thoughts?


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 15:29                   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 15:29 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	Kefeng Wang, John Hubbard, Zi Yan, Barry Song, Alistair Popple,
	Yang Shi, Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, linux-arm-kernel, x86, linuxppc-dev,
	linux-mm, linux-kernel

On 12/02/2024 16:24, David Hildenbrand wrote:
> On 12.02.24 16:34, Ryan Roberts wrote:
>> On 12/02/2024 15:26, David Hildenbrand wrote:
>>> On 12.02.24 15:45, Ryan Roberts wrote:
>>>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>>>> If so, I wonder if we could instead do that comparison modulo the
>>>>>>> access/dirty
>>>>>>> bits,
>>>>>>
>>>>>> I think that would work - but will need to think a bit more on it.
>>>>>>
>>>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>>>
>>>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>>>> collect
>>>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't
>>>>>> IMHO. So
>>>>>> we will likely want to rename the function and make its documentation
>>>>>> explicit
>>>>>> that it does not return those bits.
>>>>>>
>>>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>>>
>>>>>> Of course if I could convince you the current implementation is safe, I
>>>>>> might be
>>>>>> able to sidestep this optimization until a later date?
>>>>>
>>>>> As discussed (and pointed out abive), there might be quite some callsites
>>>>> where
>>>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>>>> used nowadays.
>>>>>
>>>>> One way to approach that I had in mind was having an explicit interface:
>>>>>
>>>>> ptep_get()
>>>>> ptep_get_uptodate()
>>>>> ptep_get_lockless()
>>>>> ptep_get_lockless_uptodate()
>>>>
>>>> Yes, I like the direction of this. I guess we anticipate that call sites
>>>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>>>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>>>> it might be better/safer to have the weaker variant use the new name - that
>>>> would allow us to downgrade one at a time?
>>>
>>> Yes, I was primarily struggling with names. Likely it makes sense to either have
>>> two completely new function names, or use the new name only for the "faster but
>>> less precise" variant.
>>>
>>>>
>>>>>
>>>>> Especially the last one might not be needed.
>>>> I've done a scan through the code and agree with Mark's original conclusions.
>>>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>>>> access/dirty info. So I think I could migrate everything to the weaker variant
>>>> fairly easily.
>>>>
>>>>>
>>>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>>>
>>>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>>>> "_nosync"?
>>>
>>> I could live with
>>>
>>> ptep_get_sync()
>>> ptep_get_nosync()
>>>
>>> with proper documentation :)
>>
>> but could you live with:
>>
>> ptep_get()
>> ptep_get_nosync()
>> ptep_get_lockless_nosync()
>>
>> ?
>>
>> So leave the "slower, more precise" version with the existing name.
> 
> Sure.
> 

I'm just implementing this (as a separate RFC), and had an alternative idea for
naming/semantics:

ptep_get()
ptep_get_norecency()
ptep_get_lockless()
ptep_get_lockless_norecency()

The "_norecency" versions explicitly clear the access/dirty bits. This is useful
for the "compare to original pte to check we are not racing" pattern:

pte = ptep_get_lockless_norecency(ptep)
...
<lock>
if (!pte_same(pte, ptep_get_norecency(ptep)))
	// RACE!
...
<unlock>

With the "_nosync" semantic, the access/dirty bits may or may not be set, so the
user has to explicitly clear them to do the comparison. (although I considered a
pte_same_nosync() that would clear the bits for you - but that name is pretty naff).

Although the _norecency semantic requires always explicitly clearing the bits,
so may be infinitesimally slower, it gives a very clear expectation that the
access/dirty bits are always clear and I think that's conveyed well in the name too.

Thoughts?


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings
@ 2024-02-13 15:29                   ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 15:29 UTC (permalink / raw)
  To: David Hildenbrand, Mark Rutland
  Cc: Kefeng Wang, x86, Catalin Marinas, Yang Shi, Dave Hansen,
	linux-mm, Andrey Ryabinin, H. Peter Anvin, Will Deacon,
	Ard Biesheuvel, Marc Zyngier, Alistair Popple, Barry Song,
	Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar, Zi Yan,
	Naveen N. Rao, John Hubbard, Nicholas Piggin, Borislav Petkov,
	Thomas Gleixner, linux-arm-kernel, linux-kernel, James Morse,
	Andrew Morton, linuxppc-dev

On 12/02/2024 16:24, David Hildenbrand wrote:
> On 12.02.24 16:34, Ryan Roberts wrote:
>> On 12/02/2024 15:26, David Hildenbrand wrote:
>>> On 12.02.24 15:45, Ryan Roberts wrote:
>>>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>>>> If so, I wonder if we could instead do that comparison modulo the
>>>>>>> access/dirty
>>>>>>> bits,
>>>>>>
>>>>>> I think that would work - but will need to think a bit more on it.
>>>>>>
>>>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>>>
>>>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>>>> collect
>>>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't
>>>>>> IMHO. So
>>>>>> we will likely want to rename the function and make its documentation
>>>>>> explicit
>>>>>> that it does not return those bits.
>>>>>>
>>>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>>>
>>>>>> Of course if I could convince you the current implementation is safe, I
>>>>>> might be
>>>>>> able to sidestep this optimization until a later date?
>>>>>
>>>>> As discussed (and pointed out abive), there might be quite some callsites
>>>>> where
>>>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>>>> used nowadays.
>>>>>
>>>>> One way to approach that I had in mind was having an explicit interface:
>>>>>
>>>>> ptep_get()
>>>>> ptep_get_uptodate()
>>>>> ptep_get_lockless()
>>>>> ptep_get_lockless_uptodate()
>>>>
>>>> Yes, I like the direction of this. I guess we anticipate that call sites
>>>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>>>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>>>> it might be better/safer to have the weaker variant use the new name - that
>>>> would allow us to downgrade one at a time?
>>>
>>> Yes, I was primarily struggling with names. Likely it makes sense to either have
>>> two completely new function names, or use the new name only for the "faster but
>>> less precise" variant.
>>>
>>>>
>>>>>
>>>>> Especially the last one might not be needed.
>>>> I've done a scan through the code and agree with Mark's original conclusions.
>>>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>>>> access/dirty info. So I think I could migrate everything to the weaker variant
>>>> fairly easily.
>>>>
>>>>>
>>>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>>>
>>>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>>>> "_nosync"?
>>>
>>> I could live with
>>>
>>> ptep_get_sync()
>>> ptep_get_nosync()
>>>
>>> with proper documentation :)
>>
>> but could you live with:
>>
>> ptep_get()
>> ptep_get_nosync()
>> ptep_get_lockless_nosync()
>>
>> ?
>>
>> So leave the "slower, more precise" version with the existing name.
> 
> Sure.
> 

I'm just implementing this (as a separate RFC), and had an alternative idea for
naming/semantics:

ptep_get()
ptep_get_norecency()
ptep_get_lockless()
ptep_get_lockless_norecency()

The "_norecency" versions explicitly clear the access/dirty bits. This is useful
for the "compare to original pte to check we are not racing" pattern:

pte = ptep_get_lockless_norecency(ptep)
...
<lock>
if (!pte_same(pte, ptep_get_norecency(ptep)))
	// RACE!
...
<unlock>

With the "_nosync" semantic, the access/dirty bits may or may not be set, so the
user has to explicitly clear them to do the comparison. (although I considered a
pte_same_nosync() that would clear the bits for you - but that name is pretty naff).

Although the _norecency semantic requires always explicitly clearing the bits,
so may be infinitesimally slower, it gives a very clear expectation that the
access/dirty bits are always clear and I think that's conveyed well in the name too.

Thoughts?


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API
  2024-02-02  8:07   ` Ryan Roberts
  (?)
@ 2024-02-13 16:31     ` Mark Rutland
  -1 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:31 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:51AM +0000, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the fork performance
> regression introduced by the initial contpte commit. Subsequent patches
> will solve it entirely.
> 
> During fork(), any private memory in the parent must be write-protected.
> Previously this was done 1 PTE at a time. But the core-mm supports
> batched wrprotect via the new wrprotect_ptes() API. So let's implement
> that API and for fully covered contpte mappings, we no longer need to
> unfold the contpte. This has 2 benefits:
> 
>   - reduced unfolding, reduces the number of tlbis that must be issued.
>   - The memory remains contpte-mapped ("folded") in the parent, so it
>     continues to benefit from the more efficient use of the TLB after
>     the fork.
> 
> The optimization to wrprotect a whole contpte block without unfolding is
> possible thanks to the tightening of the Arm ARM in respect to the
> definition and behaviour when 'Misprogramming the Contiguous bit'. See
> section D21194 at https://developer.arm.com/documentation/102105/latest/

Minor nit, but it'd be better to refer to a specific revision of the document,
e.g.

  https://developer.arm.com/documentation/102105/ja-07/

That way people can see the specific version of the text you were referring to
even if that changes later, and it means the link is still useful when D21194
gets merged into the ARM ARM and dropped from the known issues doc.

> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 61 ++++++++++++++++++++++++++------
>  arch/arm64/mm/contpte.c          | 35 ++++++++++++++++++
>  2 files changed, 86 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 34892a95403d..c07f0d563733 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -/*
> - * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> - * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> - */
> -static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> -					unsigned long address, pte_t *ptep)
> +static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
> +					unsigned long address, pte_t *ptep,
> +					pte_t pte)
>  {
> -	pte_t old_pte, pte;
> +	pte_t old_pte;
>  
> -	pte = __ptep_get(ptep);
>  	do {
>  		old_pte = pte;
>  		pte = pte_wrprotect(pte);
> @@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>  	} while (pte_val(pte) != pte_val(old_pte));
>  }
>  
> +/*
> + * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> + * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> + */
> +static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> +					unsigned long address, pte_t *ptep)
> +{
> +	___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
> +}
> +
> +static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long address,
> +				pte_t *ptep, unsigned int nr)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> +		__ptep_set_wrprotect(mm, address, ptep);
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_SET_WRPROTECT
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> @@ -1156,6 +1171,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
> +extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr);
>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty);
> @@ -1269,12 +1286,35 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>  }
>  
> +#define wrprotect_ptes wrprotect_ptes
> +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr)
> +{
> +	if (likely(nr == 1)) {
> +		/*
> +		 * Optimization: wrprotect_ptes() can only be called for present
> +		 * ptes so we only need to check contig bit as condition for
> +		 * unfold, and we can remove the contig bit from the pte we read
> +		 * to avoid re-reading. This speeds up fork() which is sensitive
> +		 * for order-0 folios. Equivalent to contpte_try_unfold().
> +		 */
> +		pte_t orig_pte = __ptep_get(ptep);
> +
> +		if (unlikely(pte_cont(orig_pte))) {
> +			__contpte_try_unfold(mm, addr, ptep, orig_pte);
> +			orig_pte = pte_mknoncont(orig_pte);
> +		}
> +		___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
> +	} else {
> +		contpte_wrprotect_ptes(mm, addr, ptep, nr);
> +	}
> +}
> +
>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> -	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> -	__ptep_set_wrprotect(mm, addr, ptep);
> +	wrprotect_ptes(mm, addr, ptep, 1);
>  }
>  
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> @@ -1306,6 +1346,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  #define ptep_clear_flush_young			__ptep_clear_flush_young
>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>  #define ptep_set_wrprotect			__ptep_set_wrprotect
> +#define wrprotect_ptes				__wrprotect_ptes
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>  #define ptep_set_access_flags			__ptep_set_access_flags
>  
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bfb50e6b44c7..c85e64baf03b 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -23,6 +23,23 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
>  	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>  }
>  
> +static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, unsigned int nr)
> +{
> +	/*
> +	 * Unfold any partially covered contpte block at the beginning and end
> +	 * of the range.
> +	 */
> +
> +	if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +
> +	if (ptep + nr != contpte_align_down(ptep + nr))
> +		contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
> +				ptep + nr - 1,
> +				__ptep_get(ptep + nr - 1));

Nit: we should use braces for this 'if' block since it covers multiple lines
(even though the function call is a single statement).

It *might* be worth using temporaries for the last ptep and addr, e.g.

	if (ptep + nr != contpte_align_down(ptep + nr)) {
		unsigned long last_addr = addr + PAGE_SIZE * (nr - 1);
		pte_t *last_ptep = ptep + nr - 1;
		contpte_try_unfold(mm, last_addr, last_ptep,
				   __ptep_get(last_ptep));
	}

... but I'm happy without the temporaries so long as we have braces.

> +}
> +
>  static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>  			    pte_t *ptep, pte_t pte)
>  {
> @@ -236,6 +253,24 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>  }
>  EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>  
> +void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, unsigned int nr)
> +{
> +	/*
> +	 * If wrprotecting an entire contig range, we can avoid unfolding. Just
> +	 * set wrprotect and wait for the later mmu_gather flush to invalidate
> +	 * the tlb. Until the flush, the page may or may not be wrprotected.
> +	 * After the flush, it is guarranteed wrprotected. If its a partial

Typo: s/guarranteed/guaranteed/
Typo: s/its/it's/ (or s/its/it is/)

Other than the above this looks good to me.

Mark.

> +	 * range though, we must unfold, because we can't have a case where
> +	 * CONT_PTE is set but wrprotect applies to a subset of the PTEs; this
> +	 * would cause it to continue to be unpredictable after the flush.
> +	 */
> +
> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
> +	__wrprotect_ptes(mm, addr, ptep, nr);
> +}
> +EXPORT_SYMBOL(contpte_wrprotect_ptes);
> +
>  int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  					unsigned long addr, pte_t *ptep,
>  					pte_t entry, int dirty)
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API
@ 2024-02-13 16:31     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:31 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:51AM +0000, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the fork performance
> regression introduced by the initial contpte commit. Subsequent patches
> will solve it entirely.
> 
> During fork(), any private memory in the parent must be write-protected.
> Previously this was done 1 PTE at a time. But the core-mm supports
> batched wrprotect via the new wrprotect_ptes() API. So let's implement
> that API and for fully covered contpte mappings, we no longer need to
> unfold the contpte. This has 2 benefits:
> 
>   - reduced unfolding, reduces the number of tlbis that must be issued.
>   - The memory remains contpte-mapped ("folded") in the parent, so it
>     continues to benefit from the more efficient use of the TLB after
>     the fork.
> 
> The optimization to wrprotect a whole contpte block without unfolding is
> possible thanks to the tightening of the Arm ARM in respect to the
> definition and behaviour when 'Misprogramming the Contiguous bit'. See
> section D21194 at https://developer.arm.com/documentation/102105/latest/

Minor nit, but it'd be better to refer to a specific revision of the document,
e.g.

  https://developer.arm.com/documentation/102105/ja-07/

That way people can see the specific version of the text you were referring to
even if that changes later, and it means the link is still useful when D21194
gets merged into the ARM ARM and dropped from the known issues doc.

> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 61 ++++++++++++++++++++++++++------
>  arch/arm64/mm/contpte.c          | 35 ++++++++++++++++++
>  2 files changed, 86 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 34892a95403d..c07f0d563733 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -/*
> - * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> - * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> - */
> -static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> -					unsigned long address, pte_t *ptep)
> +static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
> +					unsigned long address, pte_t *ptep,
> +					pte_t pte)
>  {
> -	pte_t old_pte, pte;
> +	pte_t old_pte;
>  
> -	pte = __ptep_get(ptep);
>  	do {
>  		old_pte = pte;
>  		pte = pte_wrprotect(pte);
> @@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>  	} while (pte_val(pte) != pte_val(old_pte));
>  }
>  
> +/*
> + * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> + * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> + */
> +static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> +					unsigned long address, pte_t *ptep)
> +{
> +	___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
> +}
> +
> +static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long address,
> +				pte_t *ptep, unsigned int nr)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> +		__ptep_set_wrprotect(mm, address, ptep);
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_SET_WRPROTECT
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> @@ -1156,6 +1171,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
> +extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr);
>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty);
> @@ -1269,12 +1286,35 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>  }
>  
> +#define wrprotect_ptes wrprotect_ptes
> +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr)
> +{
> +	if (likely(nr == 1)) {
> +		/*
> +		 * Optimization: wrprotect_ptes() can only be called for present
> +		 * ptes so we only need to check contig bit as condition for
> +		 * unfold, and we can remove the contig bit from the pte we read
> +		 * to avoid re-reading. This speeds up fork() which is sensitive
> +		 * for order-0 folios. Equivalent to contpte_try_unfold().
> +		 */
> +		pte_t orig_pte = __ptep_get(ptep);
> +
> +		if (unlikely(pte_cont(orig_pte))) {
> +			__contpte_try_unfold(mm, addr, ptep, orig_pte);
> +			orig_pte = pte_mknoncont(orig_pte);
> +		}
> +		___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
> +	} else {
> +		contpte_wrprotect_ptes(mm, addr, ptep, nr);
> +	}
> +}
> +
>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> -	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> -	__ptep_set_wrprotect(mm, addr, ptep);
> +	wrprotect_ptes(mm, addr, ptep, 1);
>  }
>  
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> @@ -1306,6 +1346,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  #define ptep_clear_flush_young			__ptep_clear_flush_young
>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>  #define ptep_set_wrprotect			__ptep_set_wrprotect
> +#define wrprotect_ptes				__wrprotect_ptes
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>  #define ptep_set_access_flags			__ptep_set_access_flags
>  
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bfb50e6b44c7..c85e64baf03b 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -23,6 +23,23 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
>  	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>  }
>  
> +static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, unsigned int nr)
> +{
> +	/*
> +	 * Unfold any partially covered contpte block at the beginning and end
> +	 * of the range.
> +	 */
> +
> +	if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +
> +	if (ptep + nr != contpte_align_down(ptep + nr))
> +		contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
> +				ptep + nr - 1,
> +				__ptep_get(ptep + nr - 1));

Nit: we should use braces for this 'if' block since it covers multiple lines
(even though the function call is a single statement).

It *might* be worth using temporaries for the last ptep and addr, e.g.

	if (ptep + nr != contpte_align_down(ptep + nr)) {
		unsigned long last_addr = addr + PAGE_SIZE * (nr - 1);
		pte_t *last_ptep = ptep + nr - 1;
		contpte_try_unfold(mm, last_addr, last_ptep,
				   __ptep_get(last_ptep));
	}

... but I'm happy without the temporaries so long as we have braces.

> +}
> +
>  static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>  			    pte_t *ptep, pte_t pte)
>  {
> @@ -236,6 +253,24 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>  }
>  EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>  
> +void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, unsigned int nr)
> +{
> +	/*
> +	 * If wrprotecting an entire contig range, we can avoid unfolding. Just
> +	 * set wrprotect and wait for the later mmu_gather flush to invalidate
> +	 * the tlb. Until the flush, the page may or may not be wrprotected.
> +	 * After the flush, it is guarranteed wrprotected. If its a partial

Typo: s/guarranteed/guaranteed/
Typo: s/its/it's/ (or s/its/it is/)

Other than the above this looks good to me.

Mark.

> +	 * range though, we must unfold, because we can't have a case where
> +	 * CONT_PTE is set but wrprotect applies to a subset of the PTEs; this
> +	 * would cause it to continue to be unpredictable after the flush.
> +	 */
> +
> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
> +	__wrprotect_ptes(mm, addr, ptep, nr);
> +}
> +EXPORT_SYMBOL(contpte_wrprotect_ptes);
> +
>  int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  					unsigned long addr, pte_t *ptep,
>  					pte_t entry, int dirty)
> -- 
> 2.25.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API
@ 2024-02-13 16:31     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:31 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On Fri, Feb 02, 2024 at 08:07:51AM +0000, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the fork performance
> regression introduced by the initial contpte commit. Subsequent patches
> will solve it entirely.
> 
> During fork(), any private memory in the parent must be write-protected.
> Previously this was done 1 PTE at a time. But the core-mm supports
> batched wrprotect via the new wrprotect_ptes() API. So let's implement
> that API and for fully covered contpte mappings, we no longer need to
> unfold the contpte. This has 2 benefits:
> 
>   - reduced unfolding, reduces the number of tlbis that must be issued.
>   - The memory remains contpte-mapped ("folded") in the parent, so it
>     continues to benefit from the more efficient use of the TLB after
>     the fork.
> 
> The optimization to wrprotect a whole contpte block without unfolding is
> possible thanks to the tightening of the Arm ARM in respect to the
> definition and behaviour when 'Misprogramming the Contiguous bit'. See
> section D21194 at https://developer.arm.com/documentation/102105/latest/

Minor nit, but it'd be better to refer to a specific revision of the document,
e.g.

  https://developer.arm.com/documentation/102105/ja-07/

That way people can see the specific version of the text you were referring to
even if that changes later, and it means the link is still useful when D21194
gets merged into the ARM ARM and dropped from the known issues doc.

> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 61 ++++++++++++++++++++++++++------
>  arch/arm64/mm/contpte.c          | 35 ++++++++++++++++++
>  2 files changed, 86 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 34892a95403d..c07f0d563733 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -/*
> - * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> - * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> - */
> -static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> -					unsigned long address, pte_t *ptep)
> +static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
> +					unsigned long address, pte_t *ptep,
> +					pte_t pte)
>  {
> -	pte_t old_pte, pte;
> +	pte_t old_pte;
>  
> -	pte = __ptep_get(ptep);
>  	do {
>  		old_pte = pte;
>  		pte = pte_wrprotect(pte);
> @@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>  	} while (pte_val(pte) != pte_val(old_pte));
>  }
>  
> +/*
> + * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> + * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> + */
> +static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> +					unsigned long address, pte_t *ptep)
> +{
> +	___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
> +}
> +
> +static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long address,
> +				pte_t *ptep, unsigned int nr)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> +		__ptep_set_wrprotect(mm, address, ptep);
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_SET_WRPROTECT
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> @@ -1156,6 +1171,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
> +extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr);
>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty);
> @@ -1269,12 +1286,35 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>  }
>  
> +#define wrprotect_ptes wrprotect_ptes
> +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr)
> +{
> +	if (likely(nr == 1)) {
> +		/*
> +		 * Optimization: wrprotect_ptes() can only be called for present
> +		 * ptes so we only need to check contig bit as condition for
> +		 * unfold, and we can remove the contig bit from the pte we read
> +		 * to avoid re-reading. This speeds up fork() which is sensitive
> +		 * for order-0 folios. Equivalent to contpte_try_unfold().
> +		 */
> +		pte_t orig_pte = __ptep_get(ptep);
> +
> +		if (unlikely(pte_cont(orig_pte))) {
> +			__contpte_try_unfold(mm, addr, ptep, orig_pte);
> +			orig_pte = pte_mknoncont(orig_pte);
> +		}
> +		___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
> +	} else {
> +		contpte_wrprotect_ptes(mm, addr, ptep, nr);
> +	}
> +}
> +
>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> -	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> -	__ptep_set_wrprotect(mm, addr, ptep);
> +	wrprotect_ptes(mm, addr, ptep, 1);
>  }
>  
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> @@ -1306,6 +1346,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  #define ptep_clear_flush_young			__ptep_clear_flush_young
>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>  #define ptep_set_wrprotect			__ptep_set_wrprotect
> +#define wrprotect_ptes				__wrprotect_ptes
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>  #define ptep_set_access_flags			__ptep_set_access_flags
>  
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bfb50e6b44c7..c85e64baf03b 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -23,6 +23,23 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
>  	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>  }
>  
> +static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, unsigned int nr)
> +{
> +	/*
> +	 * Unfold any partially covered contpte block at the beginning and end
> +	 * of the range.
> +	 */
> +
> +	if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +
> +	if (ptep + nr != contpte_align_down(ptep + nr))
> +		contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
> +				ptep + nr - 1,
> +				__ptep_get(ptep + nr - 1));

Nit: we should use braces for this 'if' block since it covers multiple lines
(even though the function call is a single statement).

It *might* be worth using temporaries for the last ptep and addr, e.g.

	if (ptep + nr != contpte_align_down(ptep + nr)) {
		unsigned long last_addr = addr + PAGE_SIZE * (nr - 1);
		pte_t *last_ptep = ptep + nr - 1;
		contpte_try_unfold(mm, last_addr, last_ptep,
				   __ptep_get(last_ptep));
	}

... but I'm happy without the temporaries so long as we have braces.

> +}
> +
>  static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>  			    pte_t *ptep, pte_t pte)
>  {
> @@ -236,6 +253,24 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>  }
>  EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>  
> +void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, unsigned int nr)
> +{
> +	/*
> +	 * If wrprotecting an entire contig range, we can avoid unfolding. Just
> +	 * set wrprotect and wait for the later mmu_gather flush to invalidate
> +	 * the tlb. Until the flush, the page may or may not be wrprotected.
> +	 * After the flush, it is guarranteed wrprotected. If its a partial

Typo: s/guarranteed/guaranteed/
Typo: s/its/it's/ (or s/its/it is/)

Other than the above this looks good to me.

Mark.

> +	 * range though, we must unfold, because we can't have a case where
> +	 * CONT_PTE is set but wrprotect applies to a subset of the PTEs; this
> +	 * would cause it to continue to be unpredictable after the flush.
> +	 */
> +
> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
> +	__wrprotect_ptes(mm, addr, ptep, nr);
> +}
> +EXPORT_SYMBOL(contpte_wrprotect_ptes);
> +
>  int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  					unsigned long addr, pte_t *ptep,
>  					pte_t entry, int dirty)
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API
  2024-02-13 16:31     ` Mark Rutland
  (?)
@ 2024-02-13 16:36       ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 16:36 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 13/02/2024 16:31, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:51AM +0000, Ryan Roberts wrote:
>> Optimize the contpte implementation to fix some of the fork performance
>> regression introduced by the initial contpte commit. Subsequent patches
>> will solve it entirely.
>>
>> During fork(), any private memory in the parent must be write-protected.
>> Previously this was done 1 PTE at a time. But the core-mm supports
>> batched wrprotect via the new wrprotect_ptes() API. So let's implement
>> that API and for fully covered contpte mappings, we no longer need to
>> unfold the contpte. This has 2 benefits:
>>
>>   - reduced unfolding, reduces the number of tlbis that must be issued.
>>   - The memory remains contpte-mapped ("folded") in the parent, so it
>>     continues to benefit from the more efficient use of the TLB after
>>     the fork.
>>
>> The optimization to wrprotect a whole contpte block without unfolding is
>> possible thanks to the tightening of the Arm ARM in respect to the
>> definition and behaviour when 'Misprogramming the Contiguous bit'. See
>> section D21194 at https://developer.arm.com/documentation/102105/latest/
> 
> Minor nit, but it'd be better to refer to a specific revision of the document,
> e.g.
> 
>   https://developer.arm.com/documentation/102105/ja-07/
> 
> That way people can see the specific version of the text you were referring to
> even if that changes later, and it means the link is still useful when D21194
> gets merged into the ARM ARM and dropped from the known issues doc.

ACK: will fix

> 
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 61 ++++++++++++++++++++++++++------
>>  arch/arm64/mm/contpte.c          | 35 ++++++++++++++++++
>>  2 files changed, 86 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 34892a95403d..c07f0d563733 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>>  }
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
>> -/*
>> - * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
>> - * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
>> - */
>> -static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>> -					unsigned long address, pte_t *ptep)
>> +static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
>> +					unsigned long address, pte_t *ptep,
>> +					pte_t pte)
>>  {
>> -	pte_t old_pte, pte;
>> +	pte_t old_pte;
>>  
>> -	pte = __ptep_get(ptep);
>>  	do {
>>  		old_pte = pte;
>>  		pte = pte_wrprotect(pte);
>> @@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>>  	} while (pte_val(pte) != pte_val(old_pte));
>>  }
>>  
>> +/*
>> + * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
>> + * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
>> + */
>> +static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>> +					unsigned long address, pte_t *ptep)
>> +{
>> +	___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
>> +}
>> +
>> +static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long address,
>> +				pte_t *ptep, unsigned int nr)
>> +{
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>> +		__ptep_set_wrprotect(mm, address, ptep);
>> +}
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  #define __HAVE_ARCH_PMDP_SET_WRPROTECT
>>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>> @@ -1156,6 +1171,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep);
>>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep);
>> +extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr);
>>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep,
>>  				pte_t entry, int dirty);
>> @@ -1269,12 +1286,35 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>  	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>  }
>>  
>> +#define wrprotect_ptes wrprotect_ptes
>> +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr)
>> +{
>> +	if (likely(nr == 1)) {
>> +		/*
>> +		 * Optimization: wrprotect_ptes() can only be called for present
>> +		 * ptes so we only need to check contig bit as condition for
>> +		 * unfold, and we can remove the contig bit from the pte we read
>> +		 * to avoid re-reading. This speeds up fork() which is sensitive
>> +		 * for order-0 folios. Equivalent to contpte_try_unfold().
>> +		 */
>> +		pte_t orig_pte = __ptep_get(ptep);
>> +
>> +		if (unlikely(pte_cont(orig_pte))) {
>> +			__contpte_try_unfold(mm, addr, ptep, orig_pte);
>> +			orig_pte = pte_mknoncont(orig_pte);
>> +		}
>> +		___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
>> +	} else {
>> +		contpte_wrprotect_ptes(mm, addr, ptep, nr);
>> +	}
>> +}
>> +
>>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> -	__ptep_set_wrprotect(mm, addr, ptep);
>> +	wrprotect_ptes(mm, addr, ptep, 1);
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> @@ -1306,6 +1346,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>  #define ptep_clear_flush_young			__ptep_clear_flush_young
>>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>  #define ptep_set_wrprotect			__ptep_set_wrprotect
>> +#define wrprotect_ptes				__wrprotect_ptes
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>  #define ptep_set_access_flags			__ptep_set_access_flags
>>  
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bfb50e6b44c7..c85e64baf03b 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -23,6 +23,23 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
>>  	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>  }
>>  
>> +static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, unsigned int nr)
>> +{
>> +	/*
>> +	 * Unfold any partially covered contpte block at the beginning and end
>> +	 * of the range.
>> +	 */
>> +
>> +	if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +
>> +	if (ptep + nr != contpte_align_down(ptep + nr))
>> +		contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
>> +				ptep + nr - 1,
>> +				__ptep_get(ptep + nr - 1));
> 
> Nit: we should use braces for this 'if' block since it covers multiple lines
> (even though the function call is a single statement).
> 
> It *might* be worth using temporaries for the last ptep and addr, e.g.
> 
> 	if (ptep + nr != contpte_align_down(ptep + nr)) {
> 		unsigned long last_addr = addr + PAGE_SIZE * (nr - 1);
> 		pte_t *last_ptep = ptep + nr - 1;
> 		contpte_try_unfold(mm, last_addr, last_ptep,
> 				   __ptep_get(last_ptep));
> 	}
> 
> ... but I'm happy without the temporaries so long as we have braces.

ACK will fix and add temporaries.

> 
>> +}
>> +
>>  static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>  			    pte_t *ptep, pte_t pte)
>>  {
>> @@ -236,6 +253,24 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>  }
>>  EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>>  
>> +void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, unsigned int nr)
>> +{
>> +	/*
>> +	 * If wrprotecting an entire contig range, we can avoid unfolding. Just
>> +	 * set wrprotect and wait for the later mmu_gather flush to invalidate
>> +	 * the tlb. Until the flush, the page may or may not be wrprotected.
>> +	 * After the flush, it is guarranteed wrprotected. If its a partial
> 
> Typo: s/guarranteed/guaranteed/
> Typo: s/its/it's/ (or s/its/it is/)

ACK: already fixed guaranteed after you pointed out the same typo in earlier
patch. Will fix it's.

> 
> Other than the above this looks good to me.

Great thanks!

> 
> Mark.
> 
>> +	 * range though, we must unfold, because we can't have a case where
>> +	 * CONT_PTE is set but wrprotect applies to a subset of the PTEs; this
>> +	 * would cause it to continue to be unpredictable after the flush.
>> +	 */
>> +
>> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
>> +	__wrprotect_ptes(mm, addr, ptep, nr);
>> +}
>> +EXPORT_SYMBOL(contpte_wrprotect_ptes);
>> +
>>  int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>  					unsigned long addr, pte_t *ptep,
>>  					pte_t entry, int dirty)
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API
@ 2024-02-13 16:36       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 16:36 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On 13/02/2024 16:31, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:51AM +0000, Ryan Roberts wrote:
>> Optimize the contpte implementation to fix some of the fork performance
>> regression introduced by the initial contpte commit. Subsequent patches
>> will solve it entirely.
>>
>> During fork(), any private memory in the parent must be write-protected.
>> Previously this was done 1 PTE at a time. But the core-mm supports
>> batched wrprotect via the new wrprotect_ptes() API. So let's implement
>> that API and for fully covered contpte mappings, we no longer need to
>> unfold the contpte. This has 2 benefits:
>>
>>   - reduced unfolding, reduces the number of tlbis that must be issued.
>>   - The memory remains contpte-mapped ("folded") in the parent, so it
>>     continues to benefit from the more efficient use of the TLB after
>>     the fork.
>>
>> The optimization to wrprotect a whole contpte block without unfolding is
>> possible thanks to the tightening of the Arm ARM in respect to the
>> definition and behaviour when 'Misprogramming the Contiguous bit'. See
>> section D21194 at https://developer.arm.com/documentation/102105/latest/
> 
> Minor nit, but it'd be better to refer to a specific revision of the document,
> e.g.
> 
>   https://developer.arm.com/documentation/102105/ja-07/
> 
> That way people can see the specific version of the text you were referring to
> even if that changes later, and it means the link is still useful when D21194
> gets merged into the ARM ARM and dropped from the known issues doc.

ACK: will fix

> 
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 61 ++++++++++++++++++++++++++------
>>  arch/arm64/mm/contpte.c          | 35 ++++++++++++++++++
>>  2 files changed, 86 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 34892a95403d..c07f0d563733 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>>  }
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
>> -/*
>> - * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
>> - * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
>> - */
>> -static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>> -					unsigned long address, pte_t *ptep)
>> +static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
>> +					unsigned long address, pte_t *ptep,
>> +					pte_t pte)
>>  {
>> -	pte_t old_pte, pte;
>> +	pte_t old_pte;
>>  
>> -	pte = __ptep_get(ptep);
>>  	do {
>>  		old_pte = pte;
>>  		pte = pte_wrprotect(pte);
>> @@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>>  	} while (pte_val(pte) != pte_val(old_pte));
>>  }
>>  
>> +/*
>> + * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
>> + * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
>> + */
>> +static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>> +					unsigned long address, pte_t *ptep)
>> +{
>> +	___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
>> +}
>> +
>> +static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long address,
>> +				pte_t *ptep, unsigned int nr)
>> +{
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>> +		__ptep_set_wrprotect(mm, address, ptep);
>> +}
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  #define __HAVE_ARCH_PMDP_SET_WRPROTECT
>>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>> @@ -1156,6 +1171,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep);
>>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep);
>> +extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr);
>>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep,
>>  				pte_t entry, int dirty);
>> @@ -1269,12 +1286,35 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>  	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>  }
>>  
>> +#define wrprotect_ptes wrprotect_ptes
>> +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr)
>> +{
>> +	if (likely(nr == 1)) {
>> +		/*
>> +		 * Optimization: wrprotect_ptes() can only be called for present
>> +		 * ptes so we only need to check contig bit as condition for
>> +		 * unfold, and we can remove the contig bit from the pte we read
>> +		 * to avoid re-reading. This speeds up fork() which is sensitive
>> +		 * for order-0 folios. Equivalent to contpte_try_unfold().
>> +		 */
>> +		pte_t orig_pte = __ptep_get(ptep);
>> +
>> +		if (unlikely(pte_cont(orig_pte))) {
>> +			__contpte_try_unfold(mm, addr, ptep, orig_pte);
>> +			orig_pte = pte_mknoncont(orig_pte);
>> +		}
>> +		___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
>> +	} else {
>> +		contpte_wrprotect_ptes(mm, addr, ptep, nr);
>> +	}
>> +}
>> +
>>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> -	__ptep_set_wrprotect(mm, addr, ptep);
>> +	wrprotect_ptes(mm, addr, ptep, 1);
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> @@ -1306,6 +1346,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>  #define ptep_clear_flush_young			__ptep_clear_flush_young
>>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>  #define ptep_set_wrprotect			__ptep_set_wrprotect
>> +#define wrprotect_ptes				__wrprotect_ptes
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>  #define ptep_set_access_flags			__ptep_set_access_flags
>>  
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bfb50e6b44c7..c85e64baf03b 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -23,6 +23,23 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
>>  	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>  }
>>  
>> +static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, unsigned int nr)
>> +{
>> +	/*
>> +	 * Unfold any partially covered contpte block at the beginning and end
>> +	 * of the range.
>> +	 */
>> +
>> +	if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +
>> +	if (ptep + nr != contpte_align_down(ptep + nr))
>> +		contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
>> +				ptep + nr - 1,
>> +				__ptep_get(ptep + nr - 1));
> 
> Nit: we should use braces for this 'if' block since it covers multiple lines
> (even though the function call is a single statement).
> 
> It *might* be worth using temporaries for the last ptep and addr, e.g.
> 
> 	if (ptep + nr != contpte_align_down(ptep + nr)) {
> 		unsigned long last_addr = addr + PAGE_SIZE * (nr - 1);
> 		pte_t *last_ptep = ptep + nr - 1;
> 		contpte_try_unfold(mm, last_addr, last_ptep,
> 				   __ptep_get(last_ptep));
> 	}
> 
> ... but I'm happy without the temporaries so long as we have braces.

ACK will fix and add temporaries.

> 
>> +}
>> +
>>  static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>  			    pte_t *ptep, pte_t pte)
>>  {
>> @@ -236,6 +253,24 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>  }
>>  EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>>  
>> +void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, unsigned int nr)
>> +{
>> +	/*
>> +	 * If wrprotecting an entire contig range, we can avoid unfolding. Just
>> +	 * set wrprotect and wait for the later mmu_gather flush to invalidate
>> +	 * the tlb. Until the flush, the page may or may not be wrprotected.
>> +	 * After the flush, it is guarranteed wrprotected. If its a partial
> 
> Typo: s/guarranteed/guaranteed/
> Typo: s/its/it's/ (or s/its/it is/)

ACK: already fixed guaranteed after you pointed out the same typo in earlier
patch. Will fix it's.

> 
> Other than the above this looks good to me.

Great thanks!

> 
> Mark.
> 
>> +	 * range though, we must unfold, because we can't have a case where
>> +	 * CONT_PTE is set but wrprotect applies to a subset of the PTEs; this
>> +	 * would cause it to continue to be unpredictable after the flush.
>> +	 */
>> +
>> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
>> +	__wrprotect_ptes(mm, addr, ptep, nr);
>> +}
>> +EXPORT_SYMBOL(contpte_wrprotect_ptes);
>> +
>>  int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>  					unsigned long addr, pte_t *ptep,
>>  					pte_t entry, int dirty)
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API
@ 2024-02-13 16:36       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 16:36 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 13/02/2024 16:31, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:51AM +0000, Ryan Roberts wrote:
>> Optimize the contpte implementation to fix some of the fork performance
>> regression introduced by the initial contpte commit. Subsequent patches
>> will solve it entirely.
>>
>> During fork(), any private memory in the parent must be write-protected.
>> Previously this was done 1 PTE at a time. But the core-mm supports
>> batched wrprotect via the new wrprotect_ptes() API. So let's implement
>> that API and for fully covered contpte mappings, we no longer need to
>> unfold the contpte. This has 2 benefits:
>>
>>   - reduced unfolding, reduces the number of tlbis that must be issued.
>>   - The memory remains contpte-mapped ("folded") in the parent, so it
>>     continues to benefit from the more efficient use of the TLB after
>>     the fork.
>>
>> The optimization to wrprotect a whole contpte block without unfolding is
>> possible thanks to the tightening of the Arm ARM in respect to the
>> definition and behaviour when 'Misprogramming the Contiguous bit'. See
>> section D21194 at https://developer.arm.com/documentation/102105/latest/
> 
> Minor nit, but it'd be better to refer to a specific revision of the document,
> e.g.
> 
>   https://developer.arm.com/documentation/102105/ja-07/
> 
> That way people can see the specific version of the text you were referring to
> even if that changes later, and it means the link is still useful when D21194
> gets merged into the ARM ARM and dropped from the known issues doc.

ACK: will fix

> 
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 61 ++++++++++++++++++++++++++------
>>  arch/arm64/mm/contpte.c          | 35 ++++++++++++++++++
>>  2 files changed, 86 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 34892a95403d..c07f0d563733 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>>  }
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
>> -/*
>> - * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
>> - * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
>> - */
>> -static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>> -					unsigned long address, pte_t *ptep)
>> +static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
>> +					unsigned long address, pte_t *ptep,
>> +					pte_t pte)
>>  {
>> -	pte_t old_pte, pte;
>> +	pte_t old_pte;
>>  
>> -	pte = __ptep_get(ptep);
>>  	do {
>>  		old_pte = pte;
>>  		pte = pte_wrprotect(pte);
>> @@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>>  	} while (pte_val(pte) != pte_val(old_pte));
>>  }
>>  
>> +/*
>> + * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
>> + * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
>> + */
>> +static inline void __ptep_set_wrprotect(struct mm_struct *mm,
>> +					unsigned long address, pte_t *ptep)
>> +{
>> +	___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
>> +}
>> +
>> +static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long address,
>> +				pte_t *ptep, unsigned int nr)
>> +{
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>> +		__ptep_set_wrprotect(mm, address, ptep);
>> +}
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  #define __HAVE_ARCH_PMDP_SET_WRPROTECT
>>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>> @@ -1156,6 +1171,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep);
>>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep);
>> +extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr);
>>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep,
>>  				pte_t entry, int dirty);
>> @@ -1269,12 +1286,35 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>  	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>  }
>>  
>> +#define wrprotect_ptes wrprotect_ptes
>> +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr)
>> +{
>> +	if (likely(nr == 1)) {
>> +		/*
>> +		 * Optimization: wrprotect_ptes() can only be called for present
>> +		 * ptes so we only need to check contig bit as condition for
>> +		 * unfold, and we can remove the contig bit from the pte we read
>> +		 * to avoid re-reading. This speeds up fork() which is sensitive
>> +		 * for order-0 folios. Equivalent to contpte_try_unfold().
>> +		 */
>> +		pte_t orig_pte = __ptep_get(ptep);
>> +
>> +		if (unlikely(pte_cont(orig_pte))) {
>> +			__contpte_try_unfold(mm, addr, ptep, orig_pte);
>> +			orig_pte = pte_mknoncont(orig_pte);
>> +		}
>> +		___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
>> +	} else {
>> +		contpte_wrprotect_ptes(mm, addr, ptep, nr);
>> +	}
>> +}
>> +
>>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> -	__ptep_set_wrprotect(mm, addr, ptep);
>> +	wrprotect_ptes(mm, addr, ptep, 1);
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> @@ -1306,6 +1346,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>  #define ptep_clear_flush_young			__ptep_clear_flush_young
>>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>  #define ptep_set_wrprotect			__ptep_set_wrprotect
>> +#define wrprotect_ptes				__wrprotect_ptes
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>  #define ptep_set_access_flags			__ptep_set_access_flags
>>  
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bfb50e6b44c7..c85e64baf03b 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -23,6 +23,23 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
>>  	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>  }
>>  
>> +static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, unsigned int nr)
>> +{
>> +	/*
>> +	 * Unfold any partially covered contpte block at the beginning and end
>> +	 * of the range.
>> +	 */
>> +
>> +	if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +
>> +	if (ptep + nr != contpte_align_down(ptep + nr))
>> +		contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
>> +				ptep + nr - 1,
>> +				__ptep_get(ptep + nr - 1));
> 
> Nit: we should use braces for this 'if' block since it covers multiple lines
> (even though the function call is a single statement).
> 
> It *might* be worth using temporaries for the last ptep and addr, e.g.
> 
> 	if (ptep + nr != contpte_align_down(ptep + nr)) {
> 		unsigned long last_addr = addr + PAGE_SIZE * (nr - 1);
> 		pte_t *last_ptep = ptep + nr - 1;
> 		contpte_try_unfold(mm, last_addr, last_ptep,
> 				   __ptep_get(last_ptep));
> 	}
> 
> ... but I'm happy without the temporaries so long as we have braces.

ACK will fix and add temporaries.

> 
>> +}
>> +
>>  static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>  			    pte_t *ptep, pte_t pte)
>>  {
>> @@ -236,6 +253,24 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>  }
>>  EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>>  
>> +void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, unsigned int nr)
>> +{
>> +	/*
>> +	 * If wrprotecting an entire contig range, we can avoid unfolding. Just
>> +	 * set wrprotect and wait for the later mmu_gather flush to invalidate
>> +	 * the tlb. Until the flush, the page may or may not be wrprotected.
>> +	 * After the flush, it is guarranteed wrprotected. If its a partial
> 
> Typo: s/guarranteed/guaranteed/
> Typo: s/its/it's/ (or s/its/it is/)

ACK: already fixed guaranteed after you pointed out the same typo in earlier
patch. Will fix it's.

> 
> Other than the above this looks good to me.

Great thanks!

> 
> Mark.
> 
>> +	 * range though, we must unfold, because we can't have a case where
>> +	 * CONT_PTE is set but wrprotect applies to a subset of the PTEs; this
>> +	 * would cause it to continue to be unpredictable after the flush.
>> +	 */
>> +
>> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
>> +	__wrprotect_ptes(mm, addr, ptep, nr);
>> +}
>> +EXPORT_SYMBOL(contpte_wrprotect_ptes);
>> +
>>  int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>  					unsigned long addr, pte_t *ptep,
>>  					pte_t entry, int dirty)
>> -- 
>> 2.25.1
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
  2024-02-02  8:07   ` Ryan Roberts
  (?)
@ 2024-02-13 16:43     ` Mark Rutland
  -1 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:43 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:52AM +0000, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the
> exit/munmap/dontneed performance regression introduced by the initial
> contpte commit. Subsequent patches will solve it entirely.
> 
> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
> cleared. Previously this was done 1 PTE at a time. But the core-mm
> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
> let's implement those APIs and for fully covered contpte mappings, we no
> longer need to unfold the contpte. This significantly reduces unfolding
> operations, reducing the number of tlbis that must be issued.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 67 ++++++++++++++++++++++++++++++++
>  arch/arm64/mm/contpte.c          | 17 ++++++++
>  2 files changed, 84 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index c07f0d563733..ad04adb7b87f 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>  	return pte;
>  }
>  
> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full)
> +{
> +	for (;;) {
> +		__ptep_get_and_clear(mm, addr, ptep);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
> +}

The loop construct is a bit odd; can't this be:

	while (nr--) {
		__ptep_get_and_clear(mm, addr, ptep);
		ptep++;
		addr += PAGE_SIZE;
	}

... or:

	do {
		__ptep_get_and_clear(mm, addr, ptep);
		ptep++;
		addr += PAGE_SIZE;
	} while (--nr);

... ?

Otherwise, this looks good to me.

Mark.

> +
> +static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = __ptep_get_and_clear(mm, addr, ptep);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
> @@ -1167,6 +1198,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full);
> +extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full);
>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> @@ -1254,6 +1290,35 @@ static inline void pte_clear(struct mm_struct *mm,
>  	__pte_clear(mm, addr, ptep);
>  }
>  
> +#define clear_full_ptes clear_full_ptes
> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full)
> +{
> +	if (likely(nr == 1)) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		__clear_full_ptes(mm, addr, ptep, nr, full);
> +	} else {
> +		contpte_clear_full_ptes(mm, addr, ptep, nr, full);
> +	}
> +}
> +
> +#define get_and_clear_full_ptes get_and_clear_full_ptes
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full)
> +{
> +	pte_t pte;
> +
> +	if (likely(nr == 1)) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> +	} else {
> +		pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> +	}
> +
> +	return pte;
> +}
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
> @@ -1338,6 +1403,8 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  #define set_pte					__set_pte
>  #define set_ptes				__set_ptes
>  #define pte_clear				__pte_clear
> +#define clear_full_ptes				__clear_full_ptes
> +#define get_and_clear_full_ptes			__get_and_clear_full_ptes
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  #define ptep_get_and_clear			__ptep_get_and_clear
>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index c85e64baf03b..80346108450b 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -207,6 +207,23 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  EXPORT_SYMBOL(contpte_set_ptes);
>  
> +void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full)
> +{
> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
> +	__clear_full_ptes(mm, addr, ptep, nr, full);
> +}
> +EXPORT_SYMBOL(contpte_clear_full_ptes);
> +
> +pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full)
> +{
> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
> +	return __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> +}
> +EXPORT_SYMBOL(contpte_get_and_clear_full_ptes);
> +
>  int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  					unsigned long addr, pte_t *ptep)
>  {
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
@ 2024-02-13 16:43     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:43 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:52AM +0000, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the
> exit/munmap/dontneed performance regression introduced by the initial
> contpte commit. Subsequent patches will solve it entirely.
> 
> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
> cleared. Previously this was done 1 PTE at a time. But the core-mm
> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
> let's implement those APIs and for fully covered contpte mappings, we no
> longer need to unfold the contpte. This significantly reduces unfolding
> operations, reducing the number of tlbis that must be issued.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 67 ++++++++++++++++++++++++++++++++
>  arch/arm64/mm/contpte.c          | 17 ++++++++
>  2 files changed, 84 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index c07f0d563733..ad04adb7b87f 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>  	return pte;
>  }
>  
> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full)
> +{
> +	for (;;) {
> +		__ptep_get_and_clear(mm, addr, ptep);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
> +}

The loop construct is a bit odd; can't this be:

	while (nr--) {
		__ptep_get_and_clear(mm, addr, ptep);
		ptep++;
		addr += PAGE_SIZE;
	}

... or:

	do {
		__ptep_get_and_clear(mm, addr, ptep);
		ptep++;
		addr += PAGE_SIZE;
	} while (--nr);

... ?

Otherwise, this looks good to me.

Mark.

> +
> +static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = __ptep_get_and_clear(mm, addr, ptep);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
> @@ -1167,6 +1198,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full);
> +extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full);
>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> @@ -1254,6 +1290,35 @@ static inline void pte_clear(struct mm_struct *mm,
>  	__pte_clear(mm, addr, ptep);
>  }
>  
> +#define clear_full_ptes clear_full_ptes
> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full)
> +{
> +	if (likely(nr == 1)) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		__clear_full_ptes(mm, addr, ptep, nr, full);
> +	} else {
> +		contpte_clear_full_ptes(mm, addr, ptep, nr, full);
> +	}
> +}
> +
> +#define get_and_clear_full_ptes get_and_clear_full_ptes
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full)
> +{
> +	pte_t pte;
> +
> +	if (likely(nr == 1)) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> +	} else {
> +		pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> +	}
> +
> +	return pte;
> +}
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
> @@ -1338,6 +1403,8 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  #define set_pte					__set_pte
>  #define set_ptes				__set_ptes
>  #define pte_clear				__pte_clear
> +#define clear_full_ptes				__clear_full_ptes
> +#define get_and_clear_full_ptes			__get_and_clear_full_ptes
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  #define ptep_get_and_clear			__ptep_get_and_clear
>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index c85e64baf03b..80346108450b 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -207,6 +207,23 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  EXPORT_SYMBOL(contpte_set_ptes);
>  
> +void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full)
> +{
> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
> +	__clear_full_ptes(mm, addr, ptep, nr, full);
> +}
> +EXPORT_SYMBOL(contpte_clear_full_ptes);
> +
> +pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full)
> +{
> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
> +	return __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> +}
> +EXPORT_SYMBOL(contpte_get_and_clear_full_ptes);
> +
>  int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  					unsigned long addr, pte_t *ptep)
>  {
> -- 
> 2.25.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
@ 2024-02-13 16:43     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:43 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On Fri, Feb 02, 2024 at 08:07:52AM +0000, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the
> exit/munmap/dontneed performance regression introduced by the initial
> contpte commit. Subsequent patches will solve it entirely.
> 
> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
> cleared. Previously this was done 1 PTE at a time. But the core-mm
> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
> let's implement those APIs and for fully covered contpte mappings, we no
> longer need to unfold the contpte. This significantly reduces unfolding
> operations, reducing the number of tlbis that must be issued.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 67 ++++++++++++++++++++++++++++++++
>  arch/arm64/mm/contpte.c          | 17 ++++++++
>  2 files changed, 84 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index c07f0d563733..ad04adb7b87f 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>  	return pte;
>  }
>  
> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full)
> +{
> +	for (;;) {
> +		__ptep_get_and_clear(mm, addr, ptep);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
> +}

The loop construct is a bit odd; can't this be:

	while (nr--) {
		__ptep_get_and_clear(mm, addr, ptep);
		ptep++;
		addr += PAGE_SIZE;
	}

... or:

	do {
		__ptep_get_and_clear(mm, addr, ptep);
		ptep++;
		addr += PAGE_SIZE;
	} while (--nr);

... ?

Otherwise, this looks good to me.

Mark.

> +
> +static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = __ptep_get_and_clear(mm, addr, ptep);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
> @@ -1167,6 +1198,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full);
> +extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full);
>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> @@ -1254,6 +1290,35 @@ static inline void pte_clear(struct mm_struct *mm,
>  	__pte_clear(mm, addr, ptep);
>  }
>  
> +#define clear_full_ptes clear_full_ptes
> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full)
> +{
> +	if (likely(nr == 1)) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		__clear_full_ptes(mm, addr, ptep, nr, full);
> +	} else {
> +		contpte_clear_full_ptes(mm, addr, ptep, nr, full);
> +	}
> +}
> +
> +#define get_and_clear_full_ptes get_and_clear_full_ptes
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full)
> +{
> +	pte_t pte;
> +
> +	if (likely(nr == 1)) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> +	} else {
> +		pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> +	}
> +
> +	return pte;
> +}
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
> @@ -1338,6 +1403,8 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  #define set_pte					__set_pte
>  #define set_ptes				__set_ptes
>  #define pte_clear				__pte_clear
> +#define clear_full_ptes				__clear_full_ptes
> +#define get_and_clear_full_ptes			__get_and_clear_full_ptes
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  #define ptep_get_and_clear			__ptep_get_and_clear
>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index c85e64baf03b..80346108450b 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -207,6 +207,23 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  EXPORT_SYMBOL(contpte_set_ptes);
>  
> +void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, unsigned int nr, int full)
> +{
> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
> +	__clear_full_ptes(mm, addr, ptep, nr, full);
> +}
> +EXPORT_SYMBOL(contpte_clear_full_ptes);
> +
> +pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr, int full)
> +{
> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
> +	return __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> +}
> +EXPORT_SYMBOL(contpte_get_and_clear_full_ptes);
> +
>  int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  					unsigned long addr, pte_t *ptep)
>  {
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
  2024-02-13 16:43     ` Mark Rutland
  (?)
@ 2024-02-13 16:48       ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 16:48 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 13/02/2024 16:43, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:52AM +0000, Ryan Roberts wrote:
>> Optimize the contpte implementation to fix some of the
>> exit/munmap/dontneed performance regression introduced by the initial
>> contpte commit. Subsequent patches will solve it entirely.
>>
>> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
>> cleared. Previously this was done 1 PTE at a time. But the core-mm
>> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
>> let's implement those APIs and for fully covered contpte mappings, we no
>> longer need to unfold the contpte. This significantly reduces unfolding
>> operations, reducing the number of tlbis that must be issued.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 67 ++++++++++++++++++++++++++++++++
>>  arch/arm64/mm/contpte.c          | 17 ++++++++
>>  2 files changed, 84 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index c07f0d563733..ad04adb7b87f 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>>  	return pte;
>>  }
>>  
>> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	for (;;) {
>> +		__ptep_get_and_clear(mm, addr, ptep);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +	}
>> +}
> 
> The loop construct is a bit odd; can't this be:

I found it a little odd at first, but its avoiding the ptep and addr increments
the last time through the loop. Its the preferred pattern for these functions in
core-mm. See default set_ptes(), wrprotect_ptes(), clear_full_ptes() in
include/linux/pgtable.h.

So I'd prefer to leave it as is so that we match them. What do you think?

> 
> 	while (nr--) {
> 		__ptep_get_and_clear(mm, addr, ptep);
> 		ptep++;
> 		addr += PAGE_SIZE;
> 	}
> 
> ... or:
> 
> 	do {
> 		__ptep_get_and_clear(mm, addr, ptep);
> 		ptep++;
> 		addr += PAGE_SIZE;
> 	} while (--nr);
> 
> ... ?
> 
> Otherwise, this looks good to me.
> 
> Mark.
> 
>> +
>> +static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full)
>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = __ptep_get_and_clear(mm, addr, ptep);
>> +	while (--nr) {
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
>> +	}
>> +	return pte;
>> +}
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
>>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>> @@ -1167,6 +1198,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full);
>> +extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full);
>>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep);
>>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> @@ -1254,6 +1290,35 @@ static inline void pte_clear(struct mm_struct *mm,
>>  	__pte_clear(mm, addr, ptep);
>>  }
>>  
>> +#define clear_full_ptes clear_full_ptes
>> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	if (likely(nr == 1)) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		__clear_full_ptes(mm, addr, ptep, nr, full);
>> +	} else {
>> +		contpte_clear_full_ptes(mm, addr, ptep, nr, full);
>> +	}
>> +}
>> +
>> +#define get_and_clear_full_ptes get_and_clear_full_ptes
>> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full)
>> +{
>> +	pte_t pte;
>> +
>> +	if (likely(nr == 1)) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
>> +	} else {
>> +		pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
>> +	}
>> +
>> +	return pte;
>> +}
>> +
>>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>> @@ -1338,6 +1403,8 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>  #define set_pte					__set_pte
>>  #define set_ptes				__set_ptes
>>  #define pte_clear				__pte_clear
>> +#define clear_full_ptes				__clear_full_ptes
>> +#define get_and_clear_full_ptes			__get_and_clear_full_ptes
>>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  #define ptep_get_and_clear			__ptep_get_and_clear
>>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index c85e64baf03b..80346108450b 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -207,6 +207,23 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>  }
>>  EXPORT_SYMBOL(contpte_set_ptes);
>>  
>> +void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
>> +	__clear_full_ptes(mm, addr, ptep, nr, full);
>> +}
>> +EXPORT_SYMBOL(contpte_clear_full_ptes);
>> +
>> +pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full)
>> +{
>> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
>> +	return __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
>> +}
>> +EXPORT_SYMBOL(contpte_get_and_clear_full_ptes);
>> +
>>  int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  					unsigned long addr, pte_t *ptep)
>>  {
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
@ 2024-02-13 16:48       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 16:48 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On 13/02/2024 16:43, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:52AM +0000, Ryan Roberts wrote:
>> Optimize the contpte implementation to fix some of the
>> exit/munmap/dontneed performance regression introduced by the initial
>> contpte commit. Subsequent patches will solve it entirely.
>>
>> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
>> cleared. Previously this was done 1 PTE at a time. But the core-mm
>> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
>> let's implement those APIs and for fully covered contpte mappings, we no
>> longer need to unfold the contpte. This significantly reduces unfolding
>> operations, reducing the number of tlbis that must be issued.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 67 ++++++++++++++++++++++++++++++++
>>  arch/arm64/mm/contpte.c          | 17 ++++++++
>>  2 files changed, 84 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index c07f0d563733..ad04adb7b87f 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>>  	return pte;
>>  }
>>  
>> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	for (;;) {
>> +		__ptep_get_and_clear(mm, addr, ptep);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +	}
>> +}
> 
> The loop construct is a bit odd; can't this be:

I found it a little odd at first, but its avoiding the ptep and addr increments
the last time through the loop. Its the preferred pattern for these functions in
core-mm. See default set_ptes(), wrprotect_ptes(), clear_full_ptes() in
include/linux/pgtable.h.

So I'd prefer to leave it as is so that we match them. What do you think?

> 
> 	while (nr--) {
> 		__ptep_get_and_clear(mm, addr, ptep);
> 		ptep++;
> 		addr += PAGE_SIZE;
> 	}
> 
> ... or:
> 
> 	do {
> 		__ptep_get_and_clear(mm, addr, ptep);
> 		ptep++;
> 		addr += PAGE_SIZE;
> 	} while (--nr);
> 
> ... ?
> 
> Otherwise, this looks good to me.
> 
> Mark.
> 
>> +
>> +static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full)
>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = __ptep_get_and_clear(mm, addr, ptep);
>> +	while (--nr) {
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
>> +	}
>> +	return pte;
>> +}
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
>>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>> @@ -1167,6 +1198,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full);
>> +extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full);
>>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep);
>>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> @@ -1254,6 +1290,35 @@ static inline void pte_clear(struct mm_struct *mm,
>>  	__pte_clear(mm, addr, ptep);
>>  }
>>  
>> +#define clear_full_ptes clear_full_ptes
>> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	if (likely(nr == 1)) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		__clear_full_ptes(mm, addr, ptep, nr, full);
>> +	} else {
>> +		contpte_clear_full_ptes(mm, addr, ptep, nr, full);
>> +	}
>> +}
>> +
>> +#define get_and_clear_full_ptes get_and_clear_full_ptes
>> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full)
>> +{
>> +	pte_t pte;
>> +
>> +	if (likely(nr == 1)) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
>> +	} else {
>> +		pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
>> +	}
>> +
>> +	return pte;
>> +}
>> +
>>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>> @@ -1338,6 +1403,8 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>  #define set_pte					__set_pte
>>  #define set_ptes				__set_ptes
>>  #define pte_clear				__pte_clear
>> +#define clear_full_ptes				__clear_full_ptes
>> +#define get_and_clear_full_ptes			__get_and_clear_full_ptes
>>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  #define ptep_get_and_clear			__ptep_get_and_clear
>>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index c85e64baf03b..80346108450b 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -207,6 +207,23 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>  }
>>  EXPORT_SYMBOL(contpte_set_ptes);
>>  
>> +void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
>> +	__clear_full_ptes(mm, addr, ptep, nr, full);
>> +}
>> +EXPORT_SYMBOL(contpte_clear_full_ptes);
>> +
>> +pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full)
>> +{
>> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
>> +	return __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
>> +}
>> +EXPORT_SYMBOL(contpte_get_and_clear_full_ptes);
>> +
>>  int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  					unsigned long addr, pte_t *ptep)
>>  {
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
@ 2024-02-13 16:48       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 16:48 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 13/02/2024 16:43, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:52AM +0000, Ryan Roberts wrote:
>> Optimize the contpte implementation to fix some of the
>> exit/munmap/dontneed performance regression introduced by the initial
>> contpte commit. Subsequent patches will solve it entirely.
>>
>> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
>> cleared. Previously this was done 1 PTE at a time. But the core-mm
>> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
>> let's implement those APIs and for fully covered contpte mappings, we no
>> longer need to unfold the contpte. This significantly reduces unfolding
>> operations, reducing the number of tlbis that must be issued.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 67 ++++++++++++++++++++++++++++++++
>>  arch/arm64/mm/contpte.c          | 17 ++++++++
>>  2 files changed, 84 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index c07f0d563733..ad04adb7b87f 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>>  	return pte;
>>  }
>>  
>> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	for (;;) {
>> +		__ptep_get_and_clear(mm, addr, ptep);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +	}
>> +}
> 
> The loop construct is a bit odd; can't this be:

I found it a little odd at first, but its avoiding the ptep and addr increments
the last time through the loop. Its the preferred pattern for these functions in
core-mm. See default set_ptes(), wrprotect_ptes(), clear_full_ptes() in
include/linux/pgtable.h.

So I'd prefer to leave it as is so that we match them. What do you think?

> 
> 	while (nr--) {
> 		__ptep_get_and_clear(mm, addr, ptep);
> 		ptep++;
> 		addr += PAGE_SIZE;
> 	}
> 
> ... or:
> 
> 	do {
> 		__ptep_get_and_clear(mm, addr, ptep);
> 		ptep++;
> 		addr += PAGE_SIZE;
> 	} while (--nr);
> 
> ... ?
> 
> Otherwise, this looks good to me.
> 
> Mark.
> 
>> +
>> +static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full)
>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = __ptep_get_and_clear(mm, addr, ptep);
>> +	while (--nr) {
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
>> +	}
>> +	return pte;
>> +}
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
>>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>> @@ -1167,6 +1198,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full);
>> +extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full);
>>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep);
>>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> @@ -1254,6 +1290,35 @@ static inline void pte_clear(struct mm_struct *mm,
>>  	__pte_clear(mm, addr, ptep);
>>  }
>>  
>> +#define clear_full_ptes clear_full_ptes
>> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	if (likely(nr == 1)) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		__clear_full_ptes(mm, addr, ptep, nr, full);
>> +	} else {
>> +		contpte_clear_full_ptes(mm, addr, ptep, nr, full);
>> +	}
>> +}
>> +
>> +#define get_and_clear_full_ptes get_and_clear_full_ptes
>> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full)
>> +{
>> +	pte_t pte;
>> +
>> +	if (likely(nr == 1)) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
>> +	} else {
>> +		pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
>> +	}
>> +
>> +	return pte;
>> +}
>> +
>>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>> @@ -1338,6 +1403,8 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>  #define set_pte					__set_pte
>>  #define set_ptes				__set_ptes
>>  #define pte_clear				__pte_clear
>> +#define clear_full_ptes				__clear_full_ptes
>> +#define get_and_clear_full_ptes			__get_and_clear_full_ptes
>>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  #define ptep_get_and_clear			__ptep_get_and_clear
>>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index c85e64baf03b..80346108450b 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -207,6 +207,23 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>  }
>>  EXPORT_SYMBOL(contpte_set_ptes);
>>  
>> +void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
>> +	__clear_full_ptes(mm, addr, ptep, nr, full);
>> +}
>> +EXPORT_SYMBOL(contpte_clear_full_ptes);
>> +
>> +pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr, int full)
>> +{
>> +	contpte_try_unfold_partial(mm, addr, ptep, nr);
>> +	return __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
>> +}
>> +EXPORT_SYMBOL(contpte_get_and_clear_full_ptes);
>> +
>>  int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  					unsigned long addr, pte_t *ptep)
>>  {
>> -- 
>> 2.25.1
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
  2024-02-13 16:48       ` Ryan Roberts
  (?)
@ 2024-02-13 16:53         ` Mark Rutland
  -1 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:53 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Tue, Feb 13, 2024 at 04:48:50PM +0000, Ryan Roberts wrote:
> On 13/02/2024 16:43, Mark Rutland wrote:
> > On Fri, Feb 02, 2024 at 08:07:52AM +0000, Ryan Roberts wrote:

> >> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> >> +				pte_t *ptep, unsigned int nr, int full)
> >> +{
> >> +	for (;;) {
> >> +		__ptep_get_and_clear(mm, addr, ptep);
> >> +		if (--nr == 0)
> >> +			break;
> >> +		ptep++;
> >> +		addr += PAGE_SIZE;
> >> +	}
> >> +}
> > 
> > The loop construct is a bit odd; can't this be:
> 
> I found it a little odd at first, but its avoiding the ptep and addr increments
> the last time through the loop. Its the preferred pattern for these functions in
> core-mm. See default set_ptes(), wrprotect_ptes(), clear_full_ptes() in
> include/linux/pgtable.h.
> 
> So I'd prefer to leave it as is so that we match them. What do you think?

That's fair enough; it I'm happy with it as-is.

Mark.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
@ 2024-02-13 16:53         ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:53 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Tue, Feb 13, 2024 at 04:48:50PM +0000, Ryan Roberts wrote:
> On 13/02/2024 16:43, Mark Rutland wrote:
> > On Fri, Feb 02, 2024 at 08:07:52AM +0000, Ryan Roberts wrote:

> >> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> >> +				pte_t *ptep, unsigned int nr, int full)
> >> +{
> >> +	for (;;) {
> >> +		__ptep_get_and_clear(mm, addr, ptep);
> >> +		if (--nr == 0)
> >> +			break;
> >> +		ptep++;
> >> +		addr += PAGE_SIZE;
> >> +	}
> >> +}
> > 
> > The loop construct is a bit odd; can't this be:
> 
> I found it a little odd at first, but its avoiding the ptep and addr increments
> the last time through the loop. Its the preferred pattern for these functions in
> core-mm. See default set_ptes(), wrprotect_ptes(), clear_full_ptes() in
> include/linux/pgtable.h.
> 
> So I'd prefer to leave it as is so that we match them. What do you think?

That's fair enough; it I'm happy with it as-is.

Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
@ 2024-02-13 16:53         ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:53 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On Tue, Feb 13, 2024 at 04:48:50PM +0000, Ryan Roberts wrote:
> On 13/02/2024 16:43, Mark Rutland wrote:
> > On Fri, Feb 02, 2024 at 08:07:52AM +0000, Ryan Roberts wrote:

> >> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> >> +				pte_t *ptep, unsigned int nr, int full)
> >> +{
> >> +	for (;;) {
> >> +		__ptep_get_and_clear(mm, addr, ptep);
> >> +		if (--nr == 0)
> >> +			break;
> >> +		ptep++;
> >> +		addr += PAGE_SIZE;
> >> +	}
> >> +}
> > 
> > The loop construct is a bit odd; can't this be:
> 
> I found it a little odd at first, but its avoiding the ptep and addr increments
> the last time through the loop. Its the preferred pattern for these functions in
> core-mm. See default set_ptes(), wrprotect_ptes(), clear_full_ptes() in
> include/linux/pgtable.h.
> 
> So I'd prefer to leave it as is so that we match them. What do you think?

That's fair enough; it I'm happy with it as-is.

Mark.

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()
  2024-02-02  8:07   ` Ryan Roberts
  (?)
@ 2024-02-13 16:54     ` Mark Rutland
  -1 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:54AM +0000, Ryan Roberts wrote:
> When core code iterates over a range of ptes and calls ptep_get() for
> each of them, if the range happens to cover contpte mappings, the number
> of pte reads becomes amplified by a factor of the number of PTEs in a
> contpte block. This is because for each call to ptep_get(), the
> implementation must read all of the ptes in the contpte block to which
> it belongs to gather the access and dirty bits.
> 
> This causes a hotspot for fork(), as well as operations that unmap
> memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
> can fix this by implementing pte_batch_hint() which allows their
> iterators to skip getting the contpte tail ptes when gathering the batch
> of ptes to operate on. This results in the number of PTE reads returning
> to 1 per pte.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index ad04adb7b87f..353ea67b5d75 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>  		__contpte_try_unfold(mm, addr, ptep, pte);
>  }
>  
> +#define pte_batch_hint pte_batch_hint
> +static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> +{
> +	if (!pte_valid_cont(pte))
> +		return 1;
> +
> +	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> +}
> +
>  /*
>   * The below functions constitute the public API that arm64 presents to the
>   * core-mm to manipulate PTE entries within their page tables (or at least this
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()
@ 2024-02-13 16:54     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On Fri, Feb 02, 2024 at 08:07:54AM +0000, Ryan Roberts wrote:
> When core code iterates over a range of ptes and calls ptep_get() for
> each of them, if the range happens to cover contpte mappings, the number
> of pte reads becomes amplified by a factor of the number of PTEs in a
> contpte block. This is because for each call to ptep_get(), the
> implementation must read all of the ptes in the contpte block to which
> it belongs to gather the access and dirty bits.
> 
> This causes a hotspot for fork(), as well as operations that unmap
> memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
> can fix this by implementing pte_batch_hint() which allows their
> iterators to skip getting the contpte tail ptes when gathering the batch
> of ptes to operate on. This results in the number of PTE reads returning
> to 1 per pte.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index ad04adb7b87f..353ea67b5d75 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>  		__contpte_try_unfold(mm, addr, ptep, pte);
>  }
>  
> +#define pte_batch_hint pte_batch_hint
> +static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> +{
> +	if (!pte_valid_cont(pte))
> +		return 1;
> +
> +	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> +}
> +
>  /*
>   * The below functions constitute the public API that arm64 presents to the
>   * core-mm to manipulate PTE entries within their page tables (or at least this
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()
@ 2024-02-13 16:54     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:54AM +0000, Ryan Roberts wrote:
> When core code iterates over a range of ptes and calls ptep_get() for
> each of them, if the range happens to cover contpte mappings, the number
> of pte reads becomes amplified by a factor of the number of PTEs in a
> contpte block. This is because for each call to ptep_get(), the
> implementation must read all of the ptes in the contpte block to which
> it belongs to gather the access and dirty bits.
> 
> This causes a hotspot for fork(), as well as operations that unmap
> memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
> can fix this by implementing pte_batch_hint() which allows their
> iterators to skip getting the contpte tail ptes when gathering the batch
> of ptes to operate on. This results in the number of PTE reads returning
> to 1 per pte.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index ad04adb7b87f..353ea67b5d75 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>  		__contpte_try_unfold(mm, addr, ptep, pte);
>  }
>  
> +#define pte_batch_hint pte_batch_hint
> +static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> +{
> +	if (!pte_valid_cont(pte))
> +		return 1;
> +
> +	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> +}
> +
>  /*
>   * The below functions constitute the public API that arm64 presents to the
>   * core-mm to manipulate PTE entries within their page tables (or at least this
> -- 
> 2.25.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 24/25] arm64/mm: __always_inline to improve fork() perf
  2024-02-02  8:07   ` Ryan Roberts
  (?)
@ 2024-02-13 16:55     ` Mark Rutland
  -1 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:55AM +0000, Ryan Roberts wrote:
> As set_ptes() and wrprotect_ptes() become a bit more complex, the
> compiler may choose not to inline them. But this is critical for fork()
> performance. So mark the functions, along with contpte_try_unfold()
> which is called by them, as __always_inline. This is worth ~1% on the
> fork() microbenchmark with order-0 folios (the common case).
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

I have no strong feelings either way on this, so:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 353ea67b5d75..cdc310880a3b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1213,8 +1213,8 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty);
>  
> -static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> -					pte_t *ptep, pte_t pte)
> +static __always_inline void contpte_try_unfold(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, pte_t pte)
>  {
>  	if (unlikely(pte_valid_cont(pte)))
>  		__contpte_try_unfold(mm, addr, ptep, pte);
> @@ -1279,7 +1279,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
>  }
>  
>  #define set_ptes set_ptes
> -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> +static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr)
>  {
>  	pte = pte_mknoncont(pte);
> @@ -1361,8 +1361,8 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  }
>  
>  #define wrprotect_ptes wrprotect_ptes
> -static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> -				pte_t *ptep, unsigned int nr)
> +static __always_inline void wrprotect_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, unsigned int nr)
>  {
>  	if (likely(nr == 1)) {
>  		/*
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 24/25] arm64/mm: __always_inline to improve fork() perf
@ 2024-02-13 16:55     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:55AM +0000, Ryan Roberts wrote:
> As set_ptes() and wrprotect_ptes() become a bit more complex, the
> compiler may choose not to inline them. But this is critical for fork()
> performance. So mark the functions, along with contpte_try_unfold()
> which is called by them, as __always_inline. This is worth ~1% on the
> fork() microbenchmark with order-0 folios (the common case).
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

I have no strong feelings either way on this, so:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 353ea67b5d75..cdc310880a3b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1213,8 +1213,8 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty);
>  
> -static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> -					pte_t *ptep, pte_t pte)
> +static __always_inline void contpte_try_unfold(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, pte_t pte)
>  {
>  	if (unlikely(pte_valid_cont(pte)))
>  		__contpte_try_unfold(mm, addr, ptep, pte);
> @@ -1279,7 +1279,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
>  }
>  
>  #define set_ptes set_ptes
> -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> +static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr)
>  {
>  	pte = pte_mknoncont(pte);
> @@ -1361,8 +1361,8 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  }
>  
>  #define wrprotect_ptes wrprotect_ptes
> -static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> -				pte_t *ptep, unsigned int nr)
> +static __always_inline void wrprotect_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, unsigned int nr)
>  {
>  	if (likely(nr == 1)) {
>  		/*
> -- 
> 2.25.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 24/25] arm64/mm: __always_inline to improve fork() perf
@ 2024-02-13 16:55     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 16:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On Fri, Feb 02, 2024 at 08:07:55AM +0000, Ryan Roberts wrote:
> As set_ptes() and wrprotect_ptes() become a bit more complex, the
> compiler may choose not to inline them. But this is critical for fork()
> performance. So mark the functions, along with contpte_try_unfold()
> which is called by them, as __always_inline. This is worth ~1% on the
> fork() microbenchmark with order-0 folios (the common case).
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

I have no strong feelings either way on this, so:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 353ea67b5d75..cdc310880a3b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1213,8 +1213,8 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty);
>  
> -static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> -					pte_t *ptep, pte_t pte)
> +static __always_inline void contpte_try_unfold(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, pte_t pte)
>  {
>  	if (unlikely(pte_valid_cont(pte)))
>  		__contpte_try_unfold(mm, addr, ptep, pte);
> @@ -1279,7 +1279,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
>  }
>  
>  #define set_ptes set_ptes
> -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> +static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr)
>  {
>  	pte = pte_mknoncont(pte);
> @@ -1361,8 +1361,8 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  }
>  
>  #define wrprotect_ptes wrprotect_ptes
> -static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> -				pte_t *ptep, unsigned int nr)
> +static __always_inline void wrprotect_ptes(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, unsigned int nr)
>  {
>  	if (likely(nr == 1)) {
>  		/*
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings
  2024-02-02  8:07   ` Ryan Roberts
  (?)
@ 2024-02-13 17:44     ` Mark Rutland
  -1 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 17:44 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:56AM +0000, Ryan Roberts wrote:
> There are situations where a change to a single PTE could cause the
> contpte block in which it resides to become foldable (i.e. could be
> repainted with the contiguous bit). Such situations arise, for example,
> when user space temporarily changes protections, via mprotect, for
> individual pages, such can be the case for certain garbage collectors.
> 
> We would like to detect when such a PTE change occurs. However this can
> be expensive due to the amount of checking required. Therefore only
> perform the checks when an indiviual PTE is modified via mprotect
> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
> when we are setting the final PTE in a contpte-aligned block.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 26 +++++++++++++
>  arch/arm64/mm/contpte.c          | 64 ++++++++++++++++++++++++++++++++
>  2 files changed, 90 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index cdc310880a3b..d3357fe4eb89 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1192,6 +1192,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>   * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>   * a private implementation detail of the public ptep API (see below).
>   */
> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
>  extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte);
>  extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> @@ -1213,6 +1215,29 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty);
>  
> +static __always_inline void contpte_try_fold(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * Only bother trying if both the virtual and physical addresses are
> +	 * aligned and correspond to the last entry in a contig range. The core
> +	 * code mostly modifies ranges from low to high, so this is the likely
> +	 * the last modification in the contig range, so a good time to fold.
> +	 * We can't fold special mappings, because there is no associated folio.
> +	 */
> +
> +	const unsigned long contmask = CONT_PTES - 1;
> +	bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
> +
> +	if (unlikely(valign)) {
> +		bool palign = (pte_pfn(pte) & contmask) == contmask;
> +
> +		if (unlikely(palign &&
> +		    pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
> +			__contpte_try_fold(mm, addr, ptep, pte);
> +	}
> +}
> +
>  static __always_inline void contpte_try_unfold(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep, pte_t pte)
>  {
> @@ -1287,6 +1312,7 @@ static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>  	if (likely(nr == 1)) {
>  		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  		__set_ptes(mm, addr, ptep, pte, 1);
> +		contpte_try_fold(mm, addr, ptep, pte);
>  	} else {
>  		contpte_set_ptes(mm, addr, ptep, pte, nr);
>  	}
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 80346108450b..2c7dafd0552a 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -67,6 +67,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>  	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>  }
>  
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the virtual and pysical addresses are
> +	 * correctly aligned for a contpte mapping in contpte_try_fold() so the
> +	 * remaining checks are to ensure that the contpte range is fully
> +	 * covered by a single folio, and ensure that all the ptes are valid
> +	 * with contiguous PFNs and matching prots. We ignore the state of the
> +	 * access and dirty bits for the purpose of deciding if its a contiguous
> +	 * range; the folding process will generate a single contpte entry which
> +	 * has a single access and dirty bit. Those 2 bits are the logical OR of
> +	 * their respective bits in the constituent pte entries. In order to
> +	 * ensure the contpte range is covered by a single folio, we must
> +	 * recover the folio from the pfn, but special mappings don't have a
> +	 * folio backing them. Fortunately contpte_try_fold() already checked
> +	 * that the pte is not special - we never try to fold special mappings.
> +	 * Note we can't use vm_normal_page() for this since we don't have the
> +	 * vma.
> +	 */
> +
> +	unsigned long folio_saddr, folio_eaddr;
> +	unsigned long cont_saddr, cont_eaddr;
> +	pte_t expected_pte, subpte;
> +	struct folio *folio;
> +	struct page *page;
> +	unsigned long pfn;
> +	pte_t *orig_ptep;
> +	pgprot_t prot;
> +
> +	int i;
> +
> +	if (!mm_is_user(mm))
> +		return;
> +
> +	page = pte_page(pte);
> +	folio = page_folio(page);
> +	folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
> +	folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
> +	cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	cont_eaddr = cont_saddr + CONT_PTE_SIZE;

I assume that the 's' in *_sddar is for "start", and the 'e' in *_eaddr is for
"end". Could we use "start" and "end" directly, e.g. folio_start, folio_end?

> +
> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
> +		return;
> +
> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);

IIUC this should be the same as:

	pfn = ALIGN_DOWN(pte_pfn(pte), NR_CONT_PTES);

... which would align with the way we generate 'cont_saddr' above.

Otherwise, this looks good to me.

Mark.

> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +	expected_pte = pfn_pte(pfn, prot);
> +	orig_ptep = ptep;
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++) {
> +		subpte = pte_mkold(pte_mkclean(__ptep_get(ptep)));
> +		if (!pte_same(subpte, expected_pte))
> +			return;
> +		expected_pte = pte_advance_pfn(expected_pte, 1);
> +		ptep++;
> +	}
> +
> +	pte = pte_mkcont(pte);
> +	contpte_convert(mm, addr, orig_ptep, pte);
> +}
> +EXPORT_SYMBOL(__contpte_try_fold);
> +
>  void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>  			pte_t *ptep, pte_t pte)
>  {
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings
@ 2024-02-13 17:44     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 17:44 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On Fri, Feb 02, 2024 at 08:07:56AM +0000, Ryan Roberts wrote:
> There are situations where a change to a single PTE could cause the
> contpte block in which it resides to become foldable (i.e. could be
> repainted with the contiguous bit). Such situations arise, for example,
> when user space temporarily changes protections, via mprotect, for
> individual pages, such can be the case for certain garbage collectors.
> 
> We would like to detect when such a PTE change occurs. However this can
> be expensive due to the amount of checking required. Therefore only
> perform the checks when an indiviual PTE is modified via mprotect
> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
> when we are setting the final PTE in a contpte-aligned block.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 26 +++++++++++++
>  arch/arm64/mm/contpte.c          | 64 ++++++++++++++++++++++++++++++++
>  2 files changed, 90 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index cdc310880a3b..d3357fe4eb89 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1192,6 +1192,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>   * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>   * a private implementation detail of the public ptep API (see below).
>   */
> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
>  extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte);
>  extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> @@ -1213,6 +1215,29 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty);
>  
> +static __always_inline void contpte_try_fold(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * Only bother trying if both the virtual and physical addresses are
> +	 * aligned and correspond to the last entry in a contig range. The core
> +	 * code mostly modifies ranges from low to high, so this is the likely
> +	 * the last modification in the contig range, so a good time to fold.
> +	 * We can't fold special mappings, because there is no associated folio.
> +	 */
> +
> +	const unsigned long contmask = CONT_PTES - 1;
> +	bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
> +
> +	if (unlikely(valign)) {
> +		bool palign = (pte_pfn(pte) & contmask) == contmask;
> +
> +		if (unlikely(palign &&
> +		    pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
> +			__contpte_try_fold(mm, addr, ptep, pte);
> +	}
> +}
> +
>  static __always_inline void contpte_try_unfold(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep, pte_t pte)
>  {
> @@ -1287,6 +1312,7 @@ static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>  	if (likely(nr == 1)) {
>  		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  		__set_ptes(mm, addr, ptep, pte, 1);
> +		contpte_try_fold(mm, addr, ptep, pte);
>  	} else {
>  		contpte_set_ptes(mm, addr, ptep, pte, nr);
>  	}
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 80346108450b..2c7dafd0552a 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -67,6 +67,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>  	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>  }
>  
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the virtual and pysical addresses are
> +	 * correctly aligned for a contpte mapping in contpte_try_fold() so the
> +	 * remaining checks are to ensure that the contpte range is fully
> +	 * covered by a single folio, and ensure that all the ptes are valid
> +	 * with contiguous PFNs and matching prots. We ignore the state of the
> +	 * access and dirty bits for the purpose of deciding if its a contiguous
> +	 * range; the folding process will generate a single contpte entry which
> +	 * has a single access and dirty bit. Those 2 bits are the logical OR of
> +	 * their respective bits in the constituent pte entries. In order to
> +	 * ensure the contpte range is covered by a single folio, we must
> +	 * recover the folio from the pfn, but special mappings don't have a
> +	 * folio backing them. Fortunately contpte_try_fold() already checked
> +	 * that the pte is not special - we never try to fold special mappings.
> +	 * Note we can't use vm_normal_page() for this since we don't have the
> +	 * vma.
> +	 */
> +
> +	unsigned long folio_saddr, folio_eaddr;
> +	unsigned long cont_saddr, cont_eaddr;
> +	pte_t expected_pte, subpte;
> +	struct folio *folio;
> +	struct page *page;
> +	unsigned long pfn;
> +	pte_t *orig_ptep;
> +	pgprot_t prot;
> +
> +	int i;
> +
> +	if (!mm_is_user(mm))
> +		return;
> +
> +	page = pte_page(pte);
> +	folio = page_folio(page);
> +	folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
> +	folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
> +	cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	cont_eaddr = cont_saddr + CONT_PTE_SIZE;

I assume that the 's' in *_sddar is for "start", and the 'e' in *_eaddr is for
"end". Could we use "start" and "end" directly, e.g. folio_start, folio_end?

> +
> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
> +		return;
> +
> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);

IIUC this should be the same as:

	pfn = ALIGN_DOWN(pte_pfn(pte), NR_CONT_PTES);

... which would align with the way we generate 'cont_saddr' above.

Otherwise, this looks good to me.

Mark.

> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +	expected_pte = pfn_pte(pfn, prot);
> +	orig_ptep = ptep;
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++) {
> +		subpte = pte_mkold(pte_mkclean(__ptep_get(ptep)));
> +		if (!pte_same(subpte, expected_pte))
> +			return;
> +		expected_pte = pte_advance_pfn(expected_pte, 1);
> +		ptep++;
> +	}
> +
> +	pte = pte_mkcont(pte);
> +	contpte_convert(mm, addr, orig_ptep, pte);
> +}
> +EXPORT_SYMBOL(__contpte_try_fold);
> +
>  void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>  			pte_t *ptep, pte_t pte)
>  {
> -- 
> 2.25.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings
@ 2024-02-13 17:44     ` Mark Rutland
  0 siblings, 0 replies; 240+ messages in thread
From: Mark Rutland @ 2024-02-13 17:44 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On Fri, Feb 02, 2024 at 08:07:56AM +0000, Ryan Roberts wrote:
> There are situations where a change to a single PTE could cause the
> contpte block in which it resides to become foldable (i.e. could be
> repainted with the contiguous bit). Such situations arise, for example,
> when user space temporarily changes protections, via mprotect, for
> individual pages, such can be the case for certain garbage collectors.
> 
> We would like to detect when such a PTE change occurs. However this can
> be expensive due to the amount of checking required. Therefore only
> perform the checks when an indiviual PTE is modified via mprotect
> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
> when we are setting the final PTE in a contpte-aligned block.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 26 +++++++++++++
>  arch/arm64/mm/contpte.c          | 64 ++++++++++++++++++++++++++++++++
>  2 files changed, 90 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index cdc310880a3b..d3357fe4eb89 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1192,6 +1192,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>   * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>   * a private implementation detail of the public ptep API (see below).
>   */
> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
>  extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte);
>  extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> @@ -1213,6 +1215,29 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty);
>  
> +static __always_inline void contpte_try_fold(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * Only bother trying if both the virtual and physical addresses are
> +	 * aligned and correspond to the last entry in a contig range. The core
> +	 * code mostly modifies ranges from low to high, so this is the likely
> +	 * the last modification in the contig range, so a good time to fold.
> +	 * We can't fold special mappings, because there is no associated folio.
> +	 */
> +
> +	const unsigned long contmask = CONT_PTES - 1;
> +	bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
> +
> +	if (unlikely(valign)) {
> +		bool palign = (pte_pfn(pte) & contmask) == contmask;
> +
> +		if (unlikely(palign &&
> +		    pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
> +			__contpte_try_fold(mm, addr, ptep, pte);
> +	}
> +}
> +
>  static __always_inline void contpte_try_unfold(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep, pte_t pte)
>  {
> @@ -1287,6 +1312,7 @@ static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>  	if (likely(nr == 1)) {
>  		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  		__set_ptes(mm, addr, ptep, pte, 1);
> +		contpte_try_fold(mm, addr, ptep, pte);
>  	} else {
>  		contpte_set_ptes(mm, addr, ptep, pte, nr);
>  	}
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 80346108450b..2c7dafd0552a 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -67,6 +67,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>  	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>  }
>  
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the virtual and pysical addresses are
> +	 * correctly aligned for a contpte mapping in contpte_try_fold() so the
> +	 * remaining checks are to ensure that the contpte range is fully
> +	 * covered by a single folio, and ensure that all the ptes are valid
> +	 * with contiguous PFNs and matching prots. We ignore the state of the
> +	 * access and dirty bits for the purpose of deciding if its a contiguous
> +	 * range; the folding process will generate a single contpte entry which
> +	 * has a single access and dirty bit. Those 2 bits are the logical OR of
> +	 * their respective bits in the constituent pte entries. In order to
> +	 * ensure the contpte range is covered by a single folio, we must
> +	 * recover the folio from the pfn, but special mappings don't have a
> +	 * folio backing them. Fortunately contpte_try_fold() already checked
> +	 * that the pte is not special - we never try to fold special mappings.
> +	 * Note we can't use vm_normal_page() for this since we don't have the
> +	 * vma.
> +	 */
> +
> +	unsigned long folio_saddr, folio_eaddr;
> +	unsigned long cont_saddr, cont_eaddr;
> +	pte_t expected_pte, subpte;
> +	struct folio *folio;
> +	struct page *page;
> +	unsigned long pfn;
> +	pte_t *orig_ptep;
> +	pgprot_t prot;
> +
> +	int i;
> +
> +	if (!mm_is_user(mm))
> +		return;
> +
> +	page = pte_page(pte);
> +	folio = page_folio(page);
> +	folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
> +	folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
> +	cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	cont_eaddr = cont_saddr + CONT_PTE_SIZE;

I assume that the 's' in *_sddar is for "start", and the 'e' in *_eaddr is for
"end". Could we use "start" and "end" directly, e.g. folio_start, folio_end?

> +
> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
> +		return;
> +
> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);

IIUC this should be the same as:

	pfn = ALIGN_DOWN(pte_pfn(pte), NR_CONT_PTES);

... which would align with the way we generate 'cont_saddr' above.

Otherwise, this looks good to me.

Mark.

> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +	expected_pte = pfn_pte(pfn, prot);
> +	orig_ptep = ptep;
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++) {
> +		subpte = pte_mkold(pte_mkclean(__ptep_get(ptep)));
> +		if (!pte_same(subpte, expected_pte))
> +			return;
> +		expected_pte = pte_advance_pfn(expected_pte, 1);
> +		ptep++;
> +	}
> +
> +	pte = pte_mkcont(pte);
> +	contpte_convert(mm, addr, orig_ptep, pte);
> +}
> +EXPORT_SYMBOL(__contpte_try_fold);
> +
>  void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>  			pte_t *ptep, pte_t pte)
>  {
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings
  2024-02-13 17:44     ` Mark Rutland
  (?)
@ 2024-02-13 18:05       ` Ryan Roberts
  -1 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 18:05 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 13/02/2024 17:44, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:56AM +0000, Ryan Roberts wrote:
>> There are situations where a change to a single PTE could cause the
>> contpte block in which it resides to become foldable (i.e. could be
>> repainted with the contiguous bit). Such situations arise, for example,
>> when user space temporarily changes protections, via mprotect, for
>> individual pages, such can be the case for certain garbage collectors.
>>
>> We would like to detect when such a PTE change occurs. However this can
>> be expensive due to the amount of checking required. Therefore only
>> perform the checks when an indiviual PTE is modified via mprotect
>> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
>> when we are setting the final PTE in a contpte-aligned block.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 26 +++++++++++++
>>  arch/arm64/mm/contpte.c          | 64 ++++++++++++++++++++++++++++++++
>>  2 files changed, 90 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index cdc310880a3b..d3357fe4eb89 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1192,6 +1192,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>   * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>>   * a private implementation detail of the public ptep API (see below).
>>   */
>> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>>  extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, pte_t pte);
>>  extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> @@ -1213,6 +1215,29 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep,
>>  				pte_t entry, int dirty);
>>  
>> +static __always_inline void contpte_try_fold(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * Only bother trying if both the virtual and physical addresses are
>> +	 * aligned and correspond to the last entry in a contig range. The core
>> +	 * code mostly modifies ranges from low to high, so this is the likely
>> +	 * the last modification in the contig range, so a good time to fold.
>> +	 * We can't fold special mappings, because there is no associated folio.
>> +	 */
>> +
>> +	const unsigned long contmask = CONT_PTES - 1;
>> +	bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
>> +
>> +	if (unlikely(valign)) {
>> +		bool palign = (pte_pfn(pte) & contmask) == contmask;
>> +
>> +		if (unlikely(palign &&
>> +		    pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
>> +			__contpte_try_fold(mm, addr, ptep, pte);
>> +	}
>> +}
>> +
>>  static __always_inline void contpte_try_unfold(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep, pte_t pte)
>>  {
>> @@ -1287,6 +1312,7 @@ static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>  	if (likely(nr == 1)) {
>>  		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  		__set_ptes(mm, addr, ptep, pte, 1);
>> +		contpte_try_fold(mm, addr, ptep, pte);
>>  	} else {
>>  		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>  	}
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index 80346108450b..2c7dafd0552a 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -67,6 +67,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>  	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>  }
>>  
>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the virtual and pysical addresses are
>> +	 * correctly aligned for a contpte mapping in contpte_try_fold() so the
>> +	 * remaining checks are to ensure that the contpte range is fully
>> +	 * covered by a single folio, and ensure that all the ptes are valid
>> +	 * with contiguous PFNs and matching prots. We ignore the state of the
>> +	 * access and dirty bits for the purpose of deciding if its a contiguous
>> +	 * range; the folding process will generate a single contpte entry which
>> +	 * has a single access and dirty bit. Those 2 bits are the logical OR of
>> +	 * their respective bits in the constituent pte entries. In order to
>> +	 * ensure the contpte range is covered by a single folio, we must
>> +	 * recover the folio from the pfn, but special mappings don't have a
>> +	 * folio backing them. Fortunately contpte_try_fold() already checked
>> +	 * that the pte is not special - we never try to fold special mappings.
>> +	 * Note we can't use vm_normal_page() for this since we don't have the
>> +	 * vma.
>> +	 */
>> +
>> +	unsigned long folio_saddr, folio_eaddr;
>> +	unsigned long cont_saddr, cont_eaddr;
>> +	pte_t expected_pte, subpte;
>> +	struct folio *folio;
>> +	struct page *page;
>> +	unsigned long pfn;
>> +	pte_t *orig_ptep;
>> +	pgprot_t prot;
>> +
>> +	int i;
>> +
>> +	if (!mm_is_user(mm))
>> +		return;
>> +
>> +	page = pte_page(pte);
>> +	folio = page_folio(page);
>> +	folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>> +	folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>> +	cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	cont_eaddr = cont_saddr + CONT_PTE_SIZE;
> 
> I assume that the 's' in *_sddar is for "start", and the 'e' in *_eaddr is for
> "end". Could we use "start" and "end" directly, e.g. folio_start, folio_end?

ACK; will fix.

> 
>> +
>> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>> +		return;
>> +
>> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
> 
> IIUC this should be the same as:
> 
> 	pfn = ALIGN_DOWN(pte_pfn(pte), NR_CONT_PTES);
> 
> ... which would align with the way we generate 'cont_saddr' above.

ACK; will fix.

> 
> Otherwise, this looks good to me.

Great thanks!

I'll get these changes done and rebase onto mm-unstable once David's zap
batching series is in, retest and re-post (hopefully in the next couple of days!)

> 
> Mark.
> 
>> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +	expected_pte = pfn_pte(pfn, prot);
>> +	orig_ptep = ptep;
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++) {
>> +		subpte = pte_mkold(pte_mkclean(__ptep_get(ptep)));
>> +		if (!pte_same(subpte, expected_pte))
>> +			return;
>> +		expected_pte = pte_advance_pfn(expected_pte, 1);
>> +		ptep++;
>> +	}
>> +
>> +	pte = pte_mkcont(pte);
>> +	contpte_convert(mm, addr, orig_ptep, pte);
>> +}
>> +EXPORT_SYMBOL(__contpte_try_fold);
>> +
>>  void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>  			pte_t *ptep, pte_t pte)
>>  {
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings
@ 2024-02-13 18:05       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 18:05 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	James Morse, Andrey Ryabinin, Andrew Morton, Matthew Wilcox,
	David Hildenbrand, Kefeng Wang, John Hubbard, Zi Yan, Barry Song,
	Alistair Popple, Yang Shi, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, linux-arm-kernel,
	x86, linuxppc-dev, linux-mm, linux-kernel

On 13/02/2024 17:44, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:56AM +0000, Ryan Roberts wrote:
>> There are situations where a change to a single PTE could cause the
>> contpte block in which it resides to become foldable (i.e. could be
>> repainted with the contiguous bit). Such situations arise, for example,
>> when user space temporarily changes protections, via mprotect, for
>> individual pages, such can be the case for certain garbage collectors.
>>
>> We would like to detect when such a PTE change occurs. However this can
>> be expensive due to the amount of checking required. Therefore only
>> perform the checks when an indiviual PTE is modified via mprotect
>> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
>> when we are setting the final PTE in a contpte-aligned block.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 26 +++++++++++++
>>  arch/arm64/mm/contpte.c          | 64 ++++++++++++++++++++++++++++++++
>>  2 files changed, 90 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index cdc310880a3b..d3357fe4eb89 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1192,6 +1192,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>   * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>>   * a private implementation detail of the public ptep API (see below).
>>   */
>> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>>  extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, pte_t pte);
>>  extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> @@ -1213,6 +1215,29 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep,
>>  				pte_t entry, int dirty);
>>  
>> +static __always_inline void contpte_try_fold(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * Only bother trying if both the virtual and physical addresses are
>> +	 * aligned and correspond to the last entry in a contig range. The core
>> +	 * code mostly modifies ranges from low to high, so this is the likely
>> +	 * the last modification in the contig range, so a good time to fold.
>> +	 * We can't fold special mappings, because there is no associated folio.
>> +	 */
>> +
>> +	const unsigned long contmask = CONT_PTES - 1;
>> +	bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
>> +
>> +	if (unlikely(valign)) {
>> +		bool palign = (pte_pfn(pte) & contmask) == contmask;
>> +
>> +		if (unlikely(palign &&
>> +		    pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
>> +			__contpte_try_fold(mm, addr, ptep, pte);
>> +	}
>> +}
>> +
>>  static __always_inline void contpte_try_unfold(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep, pte_t pte)
>>  {
>> @@ -1287,6 +1312,7 @@ static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>  	if (likely(nr == 1)) {
>>  		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  		__set_ptes(mm, addr, ptep, pte, 1);
>> +		contpte_try_fold(mm, addr, ptep, pte);
>>  	} else {
>>  		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>  	}
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index 80346108450b..2c7dafd0552a 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -67,6 +67,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>  	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>  }
>>  
>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the virtual and pysical addresses are
>> +	 * correctly aligned for a contpte mapping in contpte_try_fold() so the
>> +	 * remaining checks are to ensure that the contpte range is fully
>> +	 * covered by a single folio, and ensure that all the ptes are valid
>> +	 * with contiguous PFNs and matching prots. We ignore the state of the
>> +	 * access and dirty bits for the purpose of deciding if its a contiguous
>> +	 * range; the folding process will generate a single contpte entry which
>> +	 * has a single access and dirty bit. Those 2 bits are the logical OR of
>> +	 * their respective bits in the constituent pte entries. In order to
>> +	 * ensure the contpte range is covered by a single folio, we must
>> +	 * recover the folio from the pfn, but special mappings don't have a
>> +	 * folio backing them. Fortunately contpte_try_fold() already checked
>> +	 * that the pte is not special - we never try to fold special mappings.
>> +	 * Note we can't use vm_normal_page() for this since we don't have the
>> +	 * vma.
>> +	 */
>> +
>> +	unsigned long folio_saddr, folio_eaddr;
>> +	unsigned long cont_saddr, cont_eaddr;
>> +	pte_t expected_pte, subpte;
>> +	struct folio *folio;
>> +	struct page *page;
>> +	unsigned long pfn;
>> +	pte_t *orig_ptep;
>> +	pgprot_t prot;
>> +
>> +	int i;
>> +
>> +	if (!mm_is_user(mm))
>> +		return;
>> +
>> +	page = pte_page(pte);
>> +	folio = page_folio(page);
>> +	folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>> +	folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>> +	cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	cont_eaddr = cont_saddr + CONT_PTE_SIZE;
> 
> I assume that the 's' in *_sddar is for "start", and the 'e' in *_eaddr is for
> "end". Could we use "start" and "end" directly, e.g. folio_start, folio_end?

ACK; will fix.

> 
>> +
>> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>> +		return;
>> +
>> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
> 
> IIUC this should be the same as:
> 
> 	pfn = ALIGN_DOWN(pte_pfn(pte), NR_CONT_PTES);
> 
> ... which would align with the way we generate 'cont_saddr' above.

ACK; will fix.

> 
> Otherwise, this looks good to me.

Great thanks!

I'll get these changes done and rebase onto mm-unstable once David's zap
batching series is in, retest and re-post (hopefully in the next couple of days!)

> 
> Mark.
> 
>> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +	expected_pte = pfn_pte(pfn, prot);
>> +	orig_ptep = ptep;
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++) {
>> +		subpte = pte_mkold(pte_mkclean(__ptep_get(ptep)));
>> +		if (!pte_same(subpte, expected_pte))
>> +			return;
>> +		expected_pte = pte_advance_pfn(expected_pte, 1);
>> +		ptep++;
>> +	}
>> +
>> +	pte = pte_mkcont(pte);
>> +	contpte_convert(mm, addr, orig_ptep, pte);
>> +}
>> +EXPORT_SYMBOL(__contpte_try_fold);
>> +
>>  void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>  			pte_t *ptep, pte_t pte)
>>  {
>> -- 
>> 2.25.1
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 240+ messages in thread

* Re: [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings
@ 2024-02-13 18:05       ` Ryan Roberts
  0 siblings, 0 replies; 240+ messages in thread
From: Ryan Roberts @ 2024-02-13 18:05 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kefeng Wang, x86, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, linux-mm, Andrey Ryabinin, H. Peter Anvin,
	Will Deacon, Ard Biesheuvel, Marc Zyngier, Alistair Popple,
	Barry Song, Matthew Wilcox, Aneesh Kumar K.V, Ingo Molnar,
	Zi Yan, Naveen N. Rao, John Hubbard, Nicholas Piggin,
	Borislav Petkov, Thomas Gleixner, linux-arm-kernel, linux-kernel,
	James Morse, Andrew Morton, linuxppc-dev

On 13/02/2024 17:44, Mark Rutland wrote:
> On Fri, Feb 02, 2024 at 08:07:56AM +0000, Ryan Roberts wrote:
>> There are situations where a change to a single PTE could cause the
>> contpte block in which it resides to become foldable (i.e. could be
>> repainted with the contiguous bit). Such situations arise, for example,
>> when user space temporarily changes protections, via mprotect, for
>> individual pages, such can be the case for certain garbage collectors.
>>
>> We would like to detect when such a PTE change occurs. However this can
>> be expensive due to the amount of checking required. Therefore only
>> perform the checks when an indiviual PTE is modified via mprotect
>> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
>> when we are setting the final PTE in a contpte-aligned block.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 26 +++++++++++++
>>  arch/arm64/mm/contpte.c          | 64 ++++++++++++++++++++++++++++++++
>>  2 files changed, 90 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index cdc310880a3b..d3357fe4eb89 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1192,6 +1192,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>   * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>>   * a private implementation detail of the public ptep API (see below).
>>   */
>> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>>  extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, pte_t pte);
>>  extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> @@ -1213,6 +1215,29 @@ extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep,
>>  				pte_t entry, int dirty);
>>  
>> +static __always_inline void contpte_try_fold(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * Only bother trying if both the virtual and physical addresses are
>> +	 * aligned and correspond to the last entry in a contig range. The core
>> +	 * code mostly modifies ranges from low to high, so this is the likely
>> +	 * the last modification in the contig range, so a good time to fold.
>> +	 * We can't fold special mappings, because there is no associated folio.
>> +	 */
>> +
>> +	const unsigned long contmask = CONT_PTES - 1;
>> +	bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
>> +
>> +	if (unlikely(valign)) {
>> +		bool palign = (pte_pfn(pte) & contmask) == contmask;
>> +
>> +		if (unlikely(palign &&
>> +		    pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
>> +			__contpte_try_fold(mm, addr, ptep, pte);
>> +	}
>> +}
>> +
>>  static __always_inline void contpte_try_unfold(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep, pte_t pte)
>>  {
>> @@ -1287,6 +1312,7 @@ static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>  	if (likely(nr == 1)) {
>>  		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  		__set_ptes(mm, addr, ptep, pte, 1);
>> +		contpte_try_fold(mm, addr, ptep, pte);
>>  	} else {
>>  		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>  	}
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index 80346108450b..2c7dafd0552a 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -67,6 +67,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>  	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>  }
>>  
>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the virtual and pysical addresses are
>> +	 * correctly aligned for a contpte mapping in contpte_try_fold() so the
>> +	 * remaining checks are to ensure that the contpte range is fully
>> +	 * covered by a single folio, and ensure that all the ptes are valid
>> +	 * with contiguous PFNs and matching prots. We ignore the state of the
>> +	 * access and dirty bits for the purpose of deciding if its a contiguous
>> +	 * range; the folding process will generate a single contpte entry which
>> +	 * has a single access and dirty bit. Those 2 bits are the logical OR of
>> +	 * their respective bits in the constituent pte entries. In order to
>> +	 * ensure the contpte range is covered by a single folio, we must
>> +	 * recover the folio from the pfn, but special mappings don't have a
>> +	 * folio backing them. Fortunately contpte_try_fold() already checked
>> +	 * that the pte is not special - we never try to fold special mappings.
>> +	 * Note we can't use vm_normal_page() for this since we don't have the
>> +	 * vma.
>> +	 */
>> +
>> +	unsigned long folio_saddr, folio_eaddr;
>> +	unsigned long cont_saddr, cont_eaddr;
>> +	pte_t expected_pte, subpte;
>> +	struct folio *folio;
>> +	struct page *page;
>> +	unsigned long pfn;
>> +	pte_t *orig_ptep;
>> +	pgprot_t prot;
>> +
>> +	int i;
>> +
>> +	if (!mm_is_user(mm))
>> +		return;
>> +
>> +	page = pte_page(pte);
>> +	folio = page_folio(page);
>> +	folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>> +	folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>> +	cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	cont_eaddr = cont_saddr + CONT_PTE_SIZE;
> 
> I assume that the 's' in *_sddar is for "start", and the 'e' in *_eaddr is for
> "end". Could we use "start" and "end" directly, e.g. folio_start, folio_end?

ACK; will fix.

> 
>> +
>> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>> +		return;
>> +
>> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
> 
> IIUC this should be the same as:
> 
> 	pfn = ALIGN_DOWN(pte_pfn(pte), NR_CONT_PTES);
> 
> ... which would align with the way we generate 'cont_saddr' above.

ACK; will fix.

> 
> Otherwise, this looks good to me.

Great thanks!

I'll get these changes done and rebase onto mm-unstable once David's zap
batching series is in, retest and re-post (hopefully in the next couple of days!)

> 
> Mark.
> 
>> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +	expected_pte = pfn_pte(pfn, prot);
>> +	orig_ptep = ptep;
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++) {
>> +		subpte = pte_mkold(pte_mkclean(__ptep_get(ptep)));
>> +		if (!pte_same(subpte, expected_pte))
>> +			return;
>> +		expected_pte = pte_advance_pfn(expected_pte, 1);
>> +		ptep++;
>> +	}
>> +
>> +	pte = pte_mkcont(pte);
>> +	contpte_convert(mm, addr, orig_ptep, pte);
>> +}
>> +EXPORT_SYMBOL(__contpte_try_fold);
>> +
>>  void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>  			pte_t *ptep, pte_t pte)
>>  {
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 240+ messages in thread

end of thread, other threads:[~2024-02-13 18:06 UTC | newest]

Thread overview: 240+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-02  8:07 [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings Ryan Roberts
2024-02-02  8:07 ` Ryan Roberts
2024-02-02  8:07 ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 01/25] mm: Clarify the spec for set_ptes() Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-12 12:03   ` David Hildenbrand
2024-02-12 12:03     ` David Hildenbrand
2024-02-12 12:03     ` David Hildenbrand
2024-02-02  8:07 ` [PATCH v5 02/25] mm: thp: Batch-collapse PMD with set_ptes() Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn() Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-12 12:14   ` David Hildenbrand
2024-02-12 12:14     ` David Hildenbrand
2024-02-12 12:14     ` David Hildenbrand
2024-02-12 14:10     ` Ryan Roberts
2024-02-12 14:10       ` Ryan Roberts
2024-02-12 14:10       ` Ryan Roberts
2024-02-12 14:29       ` David Hildenbrand
2024-02-12 14:29         ` David Hildenbrand
2024-02-12 14:29         ` David Hildenbrand
2024-02-12 21:34         ` Ryan Roberts
2024-02-12 21:34           ` Ryan Roberts
2024-02-12 21:34           ` Ryan Roberts
2024-02-13  9:54           ` David Hildenbrand
2024-02-13  9:54             ` David Hildenbrand
2024-02-13  9:54             ` David Hildenbrand
2024-02-02  8:07 ` [PATCH v5 04/25] arm/mm: Convert pte_next_pfn() to pte_advance_pfn() Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 05/25] arm64/mm: " Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 06/25] powerpc/mm: " Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 07/25] x86/mm: " Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 08/25] mm: Remove pte_next_pfn() and replace with pte_advance_pfn() Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 09/25] arm64/mm: set_pte(): New layer to manage contig bit Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 10/25] arm64/mm: set_ptes()/set_pte_at(): " Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 11/25] arm64/mm: pte_clear(): " Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 12/25] arm64/mm: ptep_get_and_clear(): " Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 13/25] arm64/mm: ptep_test_and_clear_young(): " Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 14/25] arm64/mm: ptep_clear_flush_young(): " Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 15/25] arm64/mm: ptep_set_wrprotect(): " Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 16/25] arm64/mm: ptep_set_access_flags(): " Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 17/25] arm64/mm: ptep_get(): " Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-12 12:44   ` David Hildenbrand
2024-02-12 12:44     ` David Hildenbrand
2024-02-12 12:44     ` David Hildenbrand
2024-02-12 13:05     ` Ryan Roberts
2024-02-12 13:05       ` Ryan Roberts
2024-02-12 13:05       ` Ryan Roberts
2024-02-12 13:15       ` David Hildenbrand
2024-02-12 13:15         ` David Hildenbrand
2024-02-12 13:15         ` David Hildenbrand
2024-02-12 13:27         ` Ryan Roberts
2024-02-12 13:27           ` Ryan Roberts
2024-02-12 13:27           ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-12 12:00   ` Mark Rutland
2024-02-12 12:00     ` Mark Rutland
2024-02-12 12:00     ` Mark Rutland
2024-02-12 12:59     ` Ryan Roberts
2024-02-12 12:59       ` Ryan Roberts
2024-02-12 12:59       ` Ryan Roberts
2024-02-12 13:54       ` David Hildenbrand
2024-02-12 13:54         ` David Hildenbrand
2024-02-12 13:54         ` David Hildenbrand
2024-02-12 14:45         ` Ryan Roberts
2024-02-12 14:45           ` Ryan Roberts
2024-02-12 14:45           ` Ryan Roberts
2024-02-12 15:26           ` David Hildenbrand
2024-02-12 15:26             ` David Hildenbrand
2024-02-12 15:26             ` David Hildenbrand
2024-02-12 15:34             ` Ryan Roberts
2024-02-12 15:34               ` Ryan Roberts
2024-02-12 15:34               ` Ryan Roberts
2024-02-12 16:24               ` David Hildenbrand
2024-02-12 16:24                 ` David Hildenbrand
2024-02-12 16:24                 ` David Hildenbrand
2024-02-13 15:29                 ` Ryan Roberts
2024-02-13 15:29                   ` Ryan Roberts
2024-02-13 15:29                   ` Ryan Roberts
2024-02-12 15:30       ` Ryan Roberts
2024-02-12 15:30         ` Ryan Roberts
2024-02-12 15:30         ` Ryan Roberts
2024-02-12 20:38         ` Ryan Roberts
2024-02-12 20:38           ` Ryan Roberts
2024-02-12 20:38           ` Ryan Roberts
2024-02-13 10:01           ` David Hildenbrand
2024-02-13 10:01             ` David Hildenbrand
2024-02-13 10:01             ` David Hildenbrand
2024-02-13 12:06           ` Ryan Roberts
2024-02-13 12:06             ` Ryan Roberts
2024-02-13 12:06             ` Ryan Roberts
2024-02-13 12:19             ` David Hildenbrand
2024-02-13 12:19               ` David Hildenbrand
2024-02-13 12:19               ` David Hildenbrand
2024-02-13 13:06               ` Ryan Roberts
2024-02-13 13:06                 ` Ryan Roberts
2024-02-13 13:06                 ` Ryan Roberts
2024-02-13 13:13                 ` David Hildenbrand
2024-02-13 13:13                   ` David Hildenbrand
2024-02-13 13:13                   ` David Hildenbrand
2024-02-13 13:20                   ` Ryan Roberts
2024-02-13 13:20                     ` Ryan Roberts
2024-02-13 13:20                     ` Ryan Roberts
2024-02-13 13:22                     ` David Hildenbrand
2024-02-13 13:22                       ` David Hildenbrand
2024-02-13 13:22                       ` David Hildenbrand
2024-02-13 13:24                       ` Ryan Roberts
2024-02-13 13:24                         ` Ryan Roberts
2024-02-13 13:24                         ` Ryan Roberts
2024-02-13 13:33                     ` Ard Biesheuvel
2024-02-13 13:33                       ` Ard Biesheuvel
2024-02-13 13:33                       ` Ard Biesheuvel
2024-02-13 13:45                       ` David Hildenbrand
2024-02-13 13:45                         ` David Hildenbrand
2024-02-13 13:45                         ` David Hildenbrand
2024-02-13 14:02                         ` Ryan Roberts
2024-02-13 14:02                           ` Ryan Roberts
2024-02-13 14:02                           ` Ryan Roberts
2024-02-13 14:05                           ` David Hildenbrand
2024-02-13 14:05                             ` David Hildenbrand
2024-02-13 14:05                             ` David Hildenbrand
2024-02-13 14:08                             ` Ard Biesheuvel
2024-02-13 14:08                               ` Ard Biesheuvel
2024-02-13 14:08                               ` Ard Biesheuvel
2024-02-13 14:21                               ` Ryan Roberts
2024-02-13 14:21                                 ` Ryan Roberts
2024-02-13 14:21                                 ` Ryan Roberts
2024-02-13 12:02       ` Mark Rutland
2024-02-13 12:02         ` Mark Rutland
2024-02-13 12:02         ` Mark Rutland
2024-02-13 13:03         ` Ryan Roberts
2024-02-13 13:03           ` Ryan Roberts
2024-02-13 13:03           ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-13 16:31   ` Mark Rutland
2024-02-13 16:31     ` Mark Rutland
2024-02-13 16:31     ` Mark Rutland
2024-02-13 16:36     ` Ryan Roberts
2024-02-13 16:36       ` Ryan Roberts
2024-02-13 16:36       ` Ryan Roberts
2024-02-02  8:07 ` [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-13 16:43   ` Mark Rutland
2024-02-13 16:43     ` Mark Rutland
2024-02-13 16:43     ` Mark Rutland
2024-02-13 16:48     ` Ryan Roberts
2024-02-13 16:48       ` Ryan Roberts
2024-02-13 16:48       ` Ryan Roberts
2024-02-13 16:53       ` Mark Rutland
2024-02-13 16:53         ` Mark Rutland
2024-02-13 16:53         ` Mark Rutland
2024-02-02  8:07 ` [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch() Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-12 13:43   ` David Hildenbrand
2024-02-12 13:43     ` David Hildenbrand
2024-02-12 13:43     ` David Hildenbrand
2024-02-12 15:00     ` Ryan Roberts
2024-02-12 15:00       ` Ryan Roberts
2024-02-12 15:00       ` Ryan Roberts
2024-02-12 15:47     ` Ryan Roberts
2024-02-12 15:47       ` Ryan Roberts
2024-02-12 15:47       ` Ryan Roberts
2024-02-12 16:27       ` David Hildenbrand
2024-02-12 16:27         ` David Hildenbrand
2024-02-12 16:27         ` David Hildenbrand
2024-02-02  8:07 ` [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint() Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-12 13:46   ` David Hildenbrand
2024-02-12 13:46     ` David Hildenbrand
2024-02-12 13:46     ` David Hildenbrand
2024-02-13 16:54   ` Mark Rutland
2024-02-13 16:54     ` Mark Rutland
2024-02-13 16:54     ` Mark Rutland
2024-02-02  8:07 ` [PATCH v5 24/25] arm64/mm: __always_inline to improve fork() perf Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-13 16:55   ` Mark Rutland
2024-02-13 16:55     ` Mark Rutland
2024-02-13 16:55     ` Mark Rutland
2024-02-02  8:07 ` [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-02  8:07   ` Ryan Roberts
2024-02-13 17:44   ` Mark Rutland
2024-02-13 17:44     ` Mark Rutland
2024-02-13 17:44     ` Mark Rutland
2024-02-13 18:05     ` Ryan Roberts
2024-02-13 18:05       ` Ryan Roberts
2024-02-13 18:05       ` Ryan Roberts
2024-02-08 17:34 ` [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings Mark Rutland
2024-02-08 17:34   ` Mark Rutland
2024-02-08 17:34   ` Mark Rutland
2024-02-09  8:54   ` Ryan Roberts
2024-02-09  8:54     ` Ryan Roberts
2024-02-09  8:54     ` Ryan Roberts
2024-02-09 22:16     ` David Hildenbrand
2024-02-09 22:16       ` David Hildenbrand
2024-02-09 22:16       ` David Hildenbrand
2024-02-09 23:52       ` Ryan Roberts
2024-02-09 23:52         ` Ryan Roberts
2024-02-09 23:52         ` Ryan Roberts

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.