All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings
@ 2023-06-22 14:41 ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:41 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Hi All,

This is a series to opportunistically and transparently use contpte mappings
(set the contiguous bit in ptes) for user memory when those mappings meet the
requirements. It is part of a wider effort to improve performance of the 4K
kernel with the aim of approaching the performance of the 16K kernel, but
without breaking compatibility and without the associated increase in memory. It
also benefits the 16K and 64K kernels by enabling 2M THP, since this is the
contpte size for those kernels.

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And the other half of my work, to enable the use of large folios for anonymous
memory, aims to make contpte sized folios prevalent for anonymous memory too.


Dependencies
------------

While there is a complicated set of hard and soft dependencies that this patch
set depends on, I wanted to split it out as best I could and kick off proper
review independently.

The series applies on top of these other patch sets, with a tree at:
https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v1

v6.4-rc6
  - base

set_ptes()
  - hard dependency
  - Patch set from Matthew Wilcox to set multiple ptes with a single API call
  - Allows arch backend to more optimally apply contpte mappings
  - https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/

ptep_get() pte encapsulation
  - hard dependency
  - Enabler series from me to ensure none of the core code ever directly
    dereferences a pte_t that lies within a live page table.
  - Enables gathering access/dirty bits from across the whole contpte range
  - in mm-stable and linux-next at time of writing
  - https://lore.kernel.org/linux-mm/d38dc237-6093-d4c5-993e-e8ffdd6cb6fa@arm.com/

Report on physically contiguous memory in smaps
  - soft dependency
  - Enables visibility on how much memory is physically contiguous and how much
    is contpte-mapped - useful for debug
  - https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/

Additionally there are a couple of other dependencies:

anonfolio
  - soft dependency
  - ensures more anonymous memory is allocated in contpte-sized folios, so
    needed to realize the performance improvements (this is the "other half"
    mentioned above).
  - RFC: https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/
  - Intending to post v1 shortly.

exefolio
  - soft dependency
  - Tweak readahead to ensure executable memory are in 64K-sized folios, so
    needed to see reduction in iTLB pressure.
  - Don't intend to post this until we are further down the track with contpte
    and anonfolio.

Arm ARM Clarification
  - hard dependency
  - Current wording disallows the fork() optimization in the final patch.
  - Arm (ATG) have proposed tightening the wording to permit it.
  - In conversation with partners to check this wouldn't cause problems for any
    existing HW deployments

All of the _hard_ dependencies need to be resolved before this can be considered
for merging.


Performance
-----------

Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
javascript benchmark running in Chromium). Both cases are running on Ampere
Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
is repeated 15 times over 5 reboots and averaged.

All improvements are relative to baseline-4k. anonfolio and exefolio are as
described above. contpte is this series. (Note that exefolio only gives an
improvement because contpte is already in place).

Kernel Compilation (smaller is better):

| kernel       |   real-time |   kern-time |   user-time |
|:-------------|------------:|------------:|------------:|
| baseline-4k  |        0.0% |        0.0% |        0.0% |
| anonfolio    |       -5.4% |      -46.0% |       -0.3% |
| contpte      |       -6.8% |      -45.7% |       -2.1% |
| exefolio     |       -8.4% |      -46.4% |       -3.7% |
| baseline-16k |       -8.7% |      -49.2% |       -3.7% |
| baseline-64k |      -10.5% |      -66.0% |       -3.5% |

Speedometer 2.0 (bigger is better):

| kernel       |   runs_per_min |
|:-------------|---------------:|
| baseline-4k  |           0.0% |
| anonfolio    |           1.2% |
| contpte      |           3.1% |
| exefolio     |           4.2% |
| baseline-16k |           5.3% |

I've also run Speedometer 2.0 on Pixel 6 with an Ubuntu SW stack and see similar
gains.

I've also verified that running the contpte changes without anonfolio and
exefolio does not cause any regression vs baseline-4k.


Opens
-----

The only potential issue that I see right now is that due to there only being 1
access/dirty bit per contpte range, if a single page in the range is
accessed/dirtied then all the adjacent pages are reported as accessed/dirtied
too. Access/dirty is managed by the kernel per _folio_, so this information gets
collapsed down anyway, and nothing changes there. However, the per _page_
access/dirty information is reported through pagemap to user space. I'm not sure
if this would/should be considered a break? Thoughts?

Thanks,
Ryan


Ryan Roberts (14):
  arm64/mm: set_pte(): New layer to manage contig bit
  arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
  arm64/mm: pte_clear(): New layer to manage contig bit
  arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
  arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
  arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
  arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
  arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
  arm64/mm: ptep_get(): New layer to manage contig bit
  arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  arm64/mm: Wire up PTE_CONT for user mappings
  arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
  mm: Batch-copy PTE ranges during fork()
  arm64/mm: Implement ptep_set_wrprotects() to optimize fork()

 arch/arm64/include/asm/pgtable.h  | 305 +++++++++++++++++---
 arch/arm64/include/asm/tlbflush.h |  11 +-
 arch/arm64/kernel/efi.c           |   4 +-
 arch/arm64/kernel/mte.c           |   2 +-
 arch/arm64/kvm/guest.c            |   2 +-
 arch/arm64/mm/Makefile            |   3 +-
 arch/arm64/mm/contpte.c           | 443 ++++++++++++++++++++++++++++++
 arch/arm64/mm/fault.c             |  12 +-
 arch/arm64/mm/fixmap.c            |   4 +-
 arch/arm64/mm/hugetlbpage.c       |  40 +--
 arch/arm64/mm/kasan_init.c        |   6 +-
 arch/arm64/mm/mmu.c               |  16 +-
 arch/arm64/mm/pageattr.c          |   6 +-
 arch/arm64/mm/trans_pgd.c         |   6 +-
 include/linux/pgtable.h           |  13 +
 mm/memory.c                       | 149 +++++++---
 16 files changed, 896 insertions(+), 126 deletions(-)
 create mode 100644 arch/arm64/mm/contpte.c

--
2.25.1


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings
@ 2023-06-22 14:41 ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:41 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Hi All,

This is a series to opportunistically and transparently use contpte mappings
(set the contiguous bit in ptes) for user memory when those mappings meet the
requirements. It is part of a wider effort to improve performance of the 4K
kernel with the aim of approaching the performance of the 16K kernel, but
without breaking compatibility and without the associated increase in memory. It
also benefits the 16K and 64K kernels by enabling 2M THP, since this is the
contpte size for those kernels.

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And the other half of my work, to enable the use of large folios for anonymous
memory, aims to make contpte sized folios prevalent for anonymous memory too.


Dependencies
------------

While there is a complicated set of hard and soft dependencies that this patch
set depends on, I wanted to split it out as best I could and kick off proper
review independently.

The series applies on top of these other patch sets, with a tree at:
https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v1

v6.4-rc6
  - base

set_ptes()
  - hard dependency
  - Patch set from Matthew Wilcox to set multiple ptes with a single API call
  - Allows arch backend to more optimally apply contpte mappings
  - https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/

ptep_get() pte encapsulation
  - hard dependency
  - Enabler series from me to ensure none of the core code ever directly
    dereferences a pte_t that lies within a live page table.
  - Enables gathering access/dirty bits from across the whole contpte range
  - in mm-stable and linux-next at time of writing
  - https://lore.kernel.org/linux-mm/d38dc237-6093-d4c5-993e-e8ffdd6cb6fa@arm.com/

Report on physically contiguous memory in smaps
  - soft dependency
  - Enables visibility on how much memory is physically contiguous and how much
    is contpte-mapped - useful for debug
  - https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/

Additionally there are a couple of other dependencies:

anonfolio
  - soft dependency
  - ensures more anonymous memory is allocated in contpte-sized folios, so
    needed to realize the performance improvements (this is the "other half"
    mentioned above).
  - RFC: https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/
  - Intending to post v1 shortly.

exefolio
  - soft dependency
  - Tweak readahead to ensure executable memory are in 64K-sized folios, so
    needed to see reduction in iTLB pressure.
  - Don't intend to post this until we are further down the track with contpte
    and anonfolio.

Arm ARM Clarification
  - hard dependency
  - Current wording disallows the fork() optimization in the final patch.
  - Arm (ATG) have proposed tightening the wording to permit it.
  - In conversation with partners to check this wouldn't cause problems for any
    existing HW deployments

All of the _hard_ dependencies need to be resolved before this can be considered
for merging.


Performance
-----------

Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
javascript benchmark running in Chromium). Both cases are running on Ampere
Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
is repeated 15 times over 5 reboots and averaged.

All improvements are relative to baseline-4k. anonfolio and exefolio are as
described above. contpte is this series. (Note that exefolio only gives an
improvement because contpte is already in place).

Kernel Compilation (smaller is better):

| kernel       |   real-time |   kern-time |   user-time |
|:-------------|------------:|------------:|------------:|
| baseline-4k  |        0.0% |        0.0% |        0.0% |
| anonfolio    |       -5.4% |      -46.0% |       -0.3% |
| contpte      |       -6.8% |      -45.7% |       -2.1% |
| exefolio     |       -8.4% |      -46.4% |       -3.7% |
| baseline-16k |       -8.7% |      -49.2% |       -3.7% |
| baseline-64k |      -10.5% |      -66.0% |       -3.5% |

Speedometer 2.0 (bigger is better):

| kernel       |   runs_per_min |
|:-------------|---------------:|
| baseline-4k  |           0.0% |
| anonfolio    |           1.2% |
| contpte      |           3.1% |
| exefolio     |           4.2% |
| baseline-16k |           5.3% |

I've also run Speedometer 2.0 on Pixel 6 with an Ubuntu SW stack and see similar
gains.

I've also verified that running the contpte changes without anonfolio and
exefolio does not cause any regression vs baseline-4k.


Opens
-----

The only potential issue that I see right now is that due to there only being 1
access/dirty bit per contpte range, if a single page in the range is
accessed/dirtied then all the adjacent pages are reported as accessed/dirtied
too. Access/dirty is managed by the kernel per _folio_, so this information gets
collapsed down anyway, and nothing changes there. However, the per _page_
access/dirty information is reported through pagemap to user space. I'm not sure
if this would/should be considered a break? Thoughts?

Thanks,
Ryan


Ryan Roberts (14):
  arm64/mm: set_pte(): New layer to manage contig bit
  arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
  arm64/mm: pte_clear(): New layer to manage contig bit
  arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
  arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
  arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
  arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
  arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
  arm64/mm: ptep_get(): New layer to manage contig bit
  arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  arm64/mm: Wire up PTE_CONT for user mappings
  arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
  mm: Batch-copy PTE ranges during fork()
  arm64/mm: Implement ptep_set_wrprotects() to optimize fork()

 arch/arm64/include/asm/pgtable.h  | 305 +++++++++++++++++---
 arch/arm64/include/asm/tlbflush.h |  11 +-
 arch/arm64/kernel/efi.c           |   4 +-
 arch/arm64/kernel/mte.c           |   2 +-
 arch/arm64/kvm/guest.c            |   2 +-
 arch/arm64/mm/Makefile            |   3 +-
 arch/arm64/mm/contpte.c           | 443 ++++++++++++++++++++++++++++++
 arch/arm64/mm/fault.c             |  12 +-
 arch/arm64/mm/fixmap.c            |   4 +-
 arch/arm64/mm/hugetlbpage.c       |  40 +--
 arch/arm64/mm/kasan_init.c        |   6 +-
 arch/arm64/mm/mmu.c               |  16 +-
 arch/arm64/mm/pageattr.c          |   6 +-
 arch/arm64/mm/trans_pgd.c         |   6 +-
 include/linux/pgtable.h           |  13 +
 mm/memory.c                       | 149 +++++++---
 16 files changed, 896 insertions(+), 126 deletions(-)
 create mode 100644 arch/arm64/mm/contpte.c

--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v1 01/14] arm64/mm: set_pte(): New layer to manage contig bit
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:41   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:41 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 23 ++++++++++++++++++++---
 arch/arm64/kernel/efi.c          |  2 +-
 arch/arm64/mm/fixmap.c           |  2 +-
 arch/arm64/mm/kasan_init.c       |  4 ++--
 arch/arm64/mm/mmu.c              |  2 +-
 arch/arm64/mm/pageattr.c         |  2 +-
 arch/arm64/mm/trans_pgd.c        |  4 ++--
 7 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6fd012663a01..7f5ce5687466 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 	__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
 #define pte_none(pte)		(!pte_val(pte))
-#define pte_clear(mm,addr,ptep)	set_pte(ptep, __pte(0))
+#define pte_clear(mm, addr, ptep) \
+				__set_pte(ptep, __pte(0))
 #define pte_page(pte)		(pfn_to_page(pte_pfn(pte)))
 
 /*
@@ -260,7 +261,7 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
 }
 
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
 
@@ -352,7 +353,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 	__check_safe_pte_update(mm, ptep, pte);
 
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 }
 
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
@@ -1117,6 +1118,22 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
 extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
+
+/*
+ * The below functions constitute the public API that arm64 presents to the
+ * core-mm to manipulate PTE entries within the their page tables (or at least
+ * this is the subset of the API that arm64 needs to implement). These public
+ * versions will automatically and transparently apply the contiguous bit where
+ * it makes sense to do so. Therefore any users that are contig-aware (e.g.
+ * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
+ * private versions, which are prefixed with double underscore.
+ */
+
+static inline void set_pte(pte_t *ptep, pte_t pte)
+{
+	__set_pte(ptep, pte);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index baab8dd3ead3..7a28b6a08a82 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -115,7 +115,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
 	else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) &&
 		 system_supports_bti() && spd->has_bti)
 		pte = set_pte_bit(pte, __pgprot(PTE_GP));
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 	return 0;
 }
 
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index c0a3301203bd..51cd4501816d 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -121,7 +121,7 @@ void __set_fixmap(enum fixed_addresses idx,
 	ptep = fixmap_pte(addr);
 
 	if (pgprot_val(flags)) {
-		set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
+		__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
 	} else {
 		pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index e969e68de005..40125b217195 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -112,7 +112,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
 		if (!early)
 			memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
 		next = addr + PAGE_SIZE;
-		set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
+		__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
 	} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
 }
 
@@ -275,7 +275,7 @@ static void __init kasan_init_shadow(void)
 	 * so we should make sure that it maps the zero page read-only.
 	 */
 	for (i = 0; i < PTRS_PER_PTE; i++)
-		set_pte(&kasan_early_shadow_pte[i],
+		__set_pte(&kasan_early_shadow_pte[i],
 			pfn_pte(sym_to_pfn(kasan_early_shadow_page),
 				PAGE_KERNEL_RO));
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index af6bc8403ee4..c84dc87d08b9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -178,7 +178,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	do {
 		pte_t old_pte = READ_ONCE(*ptep);
 
-		set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
 		/*
 		 * After the PTE entry has been populated once, we
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 8e2017ba5f1b..057097acf9e0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -41,7 +41,7 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 	pte = clear_pte_bit(pte, cdata->clear_mask);
 	pte = set_pte_bit(pte, cdata->set_mask);
 
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 	return 0;
 }
 
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 4ea2eefbc053..f9997b226614 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -40,7 +40,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 * read only (code, rodata). Clear the RDONLY bit from
 		 * the temporary mappings we use during restore.
 		 */
-		set_pte(dst_ptep, pte_mkwrite(pte));
+		__set_pte(dst_ptep, pte_mkwrite(pte));
 	} else if (debug_pagealloc_enabled() && !pte_none(pte)) {
 		/*
 		 * debug_pagealloc will removed the PTE_VALID bit if
@@ -53,7 +53,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 */
 		BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
+		__set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 01/14] arm64/mm: set_pte(): New layer to manage contig bit
@ 2023-06-22 14:41   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:41 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 23 ++++++++++++++++++++---
 arch/arm64/kernel/efi.c          |  2 +-
 arch/arm64/mm/fixmap.c           |  2 +-
 arch/arm64/mm/kasan_init.c       |  4 ++--
 arch/arm64/mm/mmu.c              |  2 +-
 arch/arm64/mm/pageattr.c         |  2 +-
 arch/arm64/mm/trans_pgd.c        |  4 ++--
 7 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6fd012663a01..7f5ce5687466 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 	__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
 #define pte_none(pte)		(!pte_val(pte))
-#define pte_clear(mm,addr,ptep)	set_pte(ptep, __pte(0))
+#define pte_clear(mm, addr, ptep) \
+				__set_pte(ptep, __pte(0))
 #define pte_page(pte)		(pfn_to_page(pte_pfn(pte)))
 
 /*
@@ -260,7 +261,7 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
 }
 
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
 
@@ -352,7 +353,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 	__check_safe_pte_update(mm, ptep, pte);
 
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 }
 
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
@@ -1117,6 +1118,22 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
 extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
+
+/*
+ * The below functions constitute the public API that arm64 presents to the
+ * core-mm to manipulate PTE entries within the their page tables (or at least
+ * this is the subset of the API that arm64 needs to implement). These public
+ * versions will automatically and transparently apply the contiguous bit where
+ * it makes sense to do so. Therefore any users that are contig-aware (e.g.
+ * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
+ * private versions, which are prefixed with double underscore.
+ */
+
+static inline void set_pte(pte_t *ptep, pte_t pte)
+{
+	__set_pte(ptep, pte);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index baab8dd3ead3..7a28b6a08a82 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -115,7 +115,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
 	else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) &&
 		 system_supports_bti() && spd->has_bti)
 		pte = set_pte_bit(pte, __pgprot(PTE_GP));
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 	return 0;
 }
 
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index c0a3301203bd..51cd4501816d 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -121,7 +121,7 @@ void __set_fixmap(enum fixed_addresses idx,
 	ptep = fixmap_pte(addr);
 
 	if (pgprot_val(flags)) {
-		set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
+		__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
 	} else {
 		pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index e969e68de005..40125b217195 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -112,7 +112,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
 		if (!early)
 			memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
 		next = addr + PAGE_SIZE;
-		set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
+		__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
 	} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
 }
 
@@ -275,7 +275,7 @@ static void __init kasan_init_shadow(void)
 	 * so we should make sure that it maps the zero page read-only.
 	 */
 	for (i = 0; i < PTRS_PER_PTE; i++)
-		set_pte(&kasan_early_shadow_pte[i],
+		__set_pte(&kasan_early_shadow_pte[i],
 			pfn_pte(sym_to_pfn(kasan_early_shadow_page),
 				PAGE_KERNEL_RO));
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index af6bc8403ee4..c84dc87d08b9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -178,7 +178,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	do {
 		pte_t old_pte = READ_ONCE(*ptep);
 
-		set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
 		/*
 		 * After the PTE entry has been populated once, we
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 8e2017ba5f1b..057097acf9e0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -41,7 +41,7 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 	pte = clear_pte_bit(pte, cdata->clear_mask);
 	pte = set_pte_bit(pte, cdata->set_mask);
 
-	set_pte(ptep, pte);
+	__set_pte(ptep, pte);
 	return 0;
 }
 
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 4ea2eefbc053..f9997b226614 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -40,7 +40,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 * read only (code, rodata). Clear the RDONLY bit from
 		 * the temporary mappings we use during restore.
 		 */
-		set_pte(dst_ptep, pte_mkwrite(pte));
+		__set_pte(dst_ptep, pte_mkwrite(pte));
 	} else if (debug_pagealloc_enabled() && !pte_none(pte)) {
 		/*
 		 * debug_pagealloc will removed the PTE_VALID bit if
@@ -53,7 +53,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 		 */
 		BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
+		__set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
 	}
 }
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 02/14] arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:41   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:41 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

set_pte_at() is a core macro that forwards to set_ptes() (with nr=1).
Instead of creating a __set_pte_at() internal macro, convert all arch
users to use set_ptes()/__set_ptes() directly, as appropriate. Callers
in hugetlb may benefit from calling __set_ptes() once for their whole
range rather than managing their own loop. This is left for future
improvement.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 12 +++++++++---
 arch/arm64/kernel/mte.c          |  2 +-
 arch/arm64/kvm/guest.c           |  2 +-
 arch/arm64/mm/fault.c            |  2 +-
 arch/arm64/mm/hugetlbpage.c      | 10 +++++-----
 5 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7f5ce5687466..84919a3c558e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -356,7 +356,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	__set_pte(ptep, pte);
 }
 
-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+static inline void __set_ptes(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep, pte_t pte, unsigned int nr)
 {
 	page_table_check_ptes_set(mm, addr, ptep, pte, nr);
@@ -370,7 +370,6 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 		pte_val(pte) += PAGE_SIZE;
 	}
 }
-#define set_ptes set_ptes
 
 /*
  * Huge pte definitions.
@@ -1067,7 +1066,7 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 #endif /* CONFIG_ARM64_MTE */
 
 /*
- * On AArch64, the cache coherency is handled via the set_pte_at() function.
+ * On AArch64, the cache coherency is handled via the __set_ptes() function.
  */
 static inline void update_mmu_cache_range(struct vm_area_struct *vma,
 		unsigned long addr, pte_t *ptep, unsigned int nr)
@@ -1134,6 +1133,13 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 	__set_pte(ptep, pte);
 }
 
+#define set_ptes set_ptes
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte, unsigned int nr)
+{
+	__set_ptes(mm, addr, ptep, pte, nr);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 7e89968bd282..9b248549a020 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -90,7 +90,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
 	/*
 	 * If the page content is identical but at least one of the pages is
 	 * tagged, return non-zero to avoid KSM merging. If only one of the
-	 * pages is tagged, set_pte_at() may zero or change the tags of the
+	 * pages is tagged, __set_ptes() may zero or change the tags of the
 	 * other page via mte_sync_tags().
 	 */
 	if (page_mte_tagged(page1) || page_mte_tagged(page2))
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 20280a5233f6..478df2edcf99 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1087,7 +1087,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
 		} else {
 			/*
 			 * Only locking to serialise with a concurrent
-			 * set_pte_at() in the VMM but still overriding the
+			 * __set_ptes() in the VMM but still overriding the
 			 * tags, hence ignoring the return value.
 			 */
 			try_page_mte_tagging(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 6045a5117ac1..d3a64624ed88 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -191,7 +191,7 @@ static void show_pte(unsigned long addr)
  *
  * It needs to cope with hardware update of the accessed/dirty state by other
  * agents in the system and can safely skip the __sync_icache_dcache() call as,
- * like set_pte_at(), the PTE is never changed from no-exec to exec here.
+ * like __set_ptes(), the PTE is never changed from no-exec to exec here.
  *
  * Returns whether or not the PTE actually changed.
  */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 95364e8bdc19..31a1da655bf1 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -264,12 +264,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 		ncontig = num_contig_ptes(folio_size(folio), &pgsize);
 
 		for (i = 0; i < ncontig; i++, ptep++)
-			set_pte_at(mm, addr, ptep, pte);
+			__set_ptes(mm, addr, ptep, pte, 1);
 		return;
 	}
 
 	if (!pte_cont(pte)) {
-		set_pte_at(mm, addr, ptep, pte);
+		__set_ptes(mm, addr, ptep, pte, 1);
 		return;
 	}
 
@@ -281,7 +281,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	clear_flush(mm, addr, ptep, pgsize, ncontig);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -496,7 +496,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 
 	hugeprot = pte_pgprot(pte);
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 
 	return 1;
 }
@@ -525,7 +525,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pfn = pte_pfn(pte);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 }
 
 pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 02/14] arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
@ 2023-06-22 14:41   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:41 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

set_pte_at() is a core macro that forwards to set_ptes() (with nr=1).
Instead of creating a __set_pte_at() internal macro, convert all arch
users to use set_ptes()/__set_ptes() directly, as appropriate. Callers
in hugetlb may benefit from calling __set_ptes() once for their whole
range rather than managing their own loop. This is left for future
improvement.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 12 +++++++++---
 arch/arm64/kernel/mte.c          |  2 +-
 arch/arm64/kvm/guest.c           |  2 +-
 arch/arm64/mm/fault.c            |  2 +-
 arch/arm64/mm/hugetlbpage.c      | 10 +++++-----
 5 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7f5ce5687466..84919a3c558e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -356,7 +356,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	__set_pte(ptep, pte);
 }
 
-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+static inline void __set_ptes(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep, pte_t pte, unsigned int nr)
 {
 	page_table_check_ptes_set(mm, addr, ptep, pte, nr);
@@ -370,7 +370,6 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 		pte_val(pte) += PAGE_SIZE;
 	}
 }
-#define set_ptes set_ptes
 
 /*
  * Huge pte definitions.
@@ -1067,7 +1066,7 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 #endif /* CONFIG_ARM64_MTE */
 
 /*
- * On AArch64, the cache coherency is handled via the set_pte_at() function.
+ * On AArch64, the cache coherency is handled via the __set_ptes() function.
  */
 static inline void update_mmu_cache_range(struct vm_area_struct *vma,
 		unsigned long addr, pte_t *ptep, unsigned int nr)
@@ -1134,6 +1133,13 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 	__set_pte(ptep, pte);
 }
 
+#define set_ptes set_ptes
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte, unsigned int nr)
+{
+	__set_ptes(mm, addr, ptep, pte, nr);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 7e89968bd282..9b248549a020 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -90,7 +90,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
 	/*
 	 * If the page content is identical but at least one of the pages is
 	 * tagged, return non-zero to avoid KSM merging. If only one of the
-	 * pages is tagged, set_pte_at() may zero or change the tags of the
+	 * pages is tagged, __set_ptes() may zero or change the tags of the
 	 * other page via mte_sync_tags().
 	 */
 	if (page_mte_tagged(page1) || page_mte_tagged(page2))
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 20280a5233f6..478df2edcf99 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1087,7 +1087,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
 		} else {
 			/*
 			 * Only locking to serialise with a concurrent
-			 * set_pte_at() in the VMM but still overriding the
+			 * __set_ptes() in the VMM but still overriding the
 			 * tags, hence ignoring the return value.
 			 */
 			try_page_mte_tagging(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 6045a5117ac1..d3a64624ed88 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -191,7 +191,7 @@ static void show_pte(unsigned long addr)
  *
  * It needs to cope with hardware update of the accessed/dirty state by other
  * agents in the system and can safely skip the __sync_icache_dcache() call as,
- * like set_pte_at(), the PTE is never changed from no-exec to exec here.
+ * like __set_ptes(), the PTE is never changed from no-exec to exec here.
  *
  * Returns whether or not the PTE actually changed.
  */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 95364e8bdc19..31a1da655bf1 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -264,12 +264,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 		ncontig = num_contig_ptes(folio_size(folio), &pgsize);
 
 		for (i = 0; i < ncontig; i++, ptep++)
-			set_pte_at(mm, addr, ptep, pte);
+			__set_ptes(mm, addr, ptep, pte, 1);
 		return;
 	}
 
 	if (!pte_cont(pte)) {
-		set_pte_at(mm, addr, ptep, pte);
+		__set_ptes(mm, addr, ptep, pte, 1);
 		return;
 	}
 
@@ -281,7 +281,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	clear_flush(mm, addr, ptep, pgsize, ncontig);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -496,7 +496,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 
 	hugeprot = pte_pgprot(pte);
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 
 	return 1;
 }
@@ -525,7 +525,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pfn = pte_pfn(pte);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 }
 
 pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 03/14] arm64/mm: pte_clear(): New layer to manage contig bit
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:41   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:41 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 8 +++++++-
 arch/arm64/mm/fixmap.c           | 2 +-
 arch/arm64/mm/hugetlbpage.c      | 4 ++--
 arch/arm64/mm/mmu.c              | 2 +-
 4 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 84919a3c558e..06b5dca469f5 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 	__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
 #define pte_none(pte)		(!pte_val(pte))
-#define pte_clear(mm, addr, ptep) \
+#define __pte_clear(mm, addr, ptep) \
 				__set_pte(ptep, __pte(0))
 #define pte_page(pte)		(pfn_to_page(pte_pfn(pte)))
 
@@ -1140,6 +1140,12 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 	__set_ptes(mm, addr, ptep, pte, nr);
 }
 
+static inline void pte_clear(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	__pte_clear(mm, addr, ptep);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index 51cd4501816d..bfc02568805a 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -123,7 +123,7 @@ void __set_fixmap(enum fixed_addresses idx,
 	if (pgprot_val(flags)) {
 		__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
 	} else {
-		pte_clear(&init_mm, addr, ptep);
+		__pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
 	}
 }
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 31a1da655bf1..eebd3107c7d2 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -236,7 +236,7 @@ static void clear_flush(struct mm_struct *mm,
 	unsigned long i, saddr = addr;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		pte_clear(mm, addr, ptep);
+		__pte_clear(mm, addr, ptep);
 
 	flush_tlb_range(&vma, saddr, addr);
 }
@@ -418,7 +418,7 @@ void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 	ncontig = num_contig_ptes(sz, &pgsize);
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		pte_clear(mm, addr, ptep);
+		__pte_clear(mm, addr, ptep);
 }
 
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index c84dc87d08b9..085a7e3eec98 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -853,7 +853,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 			continue;
 
 		WARN_ON(!pte_present(pte));
-		pte_clear(&init_mm, addr, ptep);
+		__pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
 		if (free_mapped)
 			free_hotplug_page_range(pte_page(pte),
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 03/14] arm64/mm: pte_clear(): New layer to manage contig bit
@ 2023-06-22 14:41   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:41 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 8 +++++++-
 arch/arm64/mm/fixmap.c           | 2 +-
 arch/arm64/mm/hugetlbpage.c      | 4 ++--
 arch/arm64/mm/mmu.c              | 2 +-
 4 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 84919a3c558e..06b5dca469f5 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 	__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
 #define pte_none(pte)		(!pte_val(pte))
-#define pte_clear(mm, addr, ptep) \
+#define __pte_clear(mm, addr, ptep) \
 				__set_pte(ptep, __pte(0))
 #define pte_page(pte)		(pfn_to_page(pte_pfn(pte)))
 
@@ -1140,6 +1140,12 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 	__set_ptes(mm, addr, ptep, pte, nr);
 }
 
+static inline void pte_clear(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	__pte_clear(mm, addr, ptep);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index 51cd4501816d..bfc02568805a 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -123,7 +123,7 @@ void __set_fixmap(enum fixed_addresses idx,
 	if (pgprot_val(flags)) {
 		__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
 	} else {
-		pte_clear(&init_mm, addr, ptep);
+		__pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
 	}
 }
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 31a1da655bf1..eebd3107c7d2 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -236,7 +236,7 @@ static void clear_flush(struct mm_struct *mm,
 	unsigned long i, saddr = addr;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		pte_clear(mm, addr, ptep);
+		__pte_clear(mm, addr, ptep);
 
 	flush_tlb_range(&vma, saddr, addr);
 }
@@ -418,7 +418,7 @@ void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 	ncontig = num_contig_ptes(sz, &pgsize);
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		pte_clear(mm, addr, ptep);
+		__pte_clear(mm, addr, ptep);
 }
 
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index c84dc87d08b9..085a7e3eec98 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -853,7 +853,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 			continue;
 
 		WARN_ON(!pte_present(pte));
-		pte_clear(&init_mm, addr, ptep);
+		__pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
 		if (free_mapped)
 			free_hotplug_page_range(pte_page(pte),
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 04/14] arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:41   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:41 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++++--
 arch/arm64/mm/hugetlbpage.c      |  4 ++--
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 06b5dca469f5..2a525e72537d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -941,8 +941,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
-static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address, pte_t *ptep)
 {
 	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
@@ -1146,6 +1145,13 @@ static inline void pte_clear(struct mm_struct *mm,
 	__pte_clear(mm, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	return __ptep_get_and_clear(mm, addr, ptep);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index eebd3107c7d2..931a17f3c3fb 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -188,7 +188,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 	unsigned long i;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
-		pte_t pte = ptep_get_and_clear(mm, addr, ptep);
+		pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
 
 		/*
 		 * If HW_AFDBM is enabled, then the HW could turn on
@@ -429,7 +429,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 	pte_t orig_pte = ptep_get(ptep);
 
 	if (!pte_cont(orig_pte))
-		return ptep_get_and_clear(mm, addr, ptep);
+		return __ptep_get_and_clear(mm, addr, ptep);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 04/14] arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
@ 2023-06-22 14:41   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:41 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++++--
 arch/arm64/mm/hugetlbpage.c      |  4 ++--
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 06b5dca469f5..2a525e72537d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -941,8 +941,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
-static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address, pte_t *ptep)
 {
 	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
@@ -1146,6 +1145,13 @@ static inline void pte_clear(struct mm_struct *mm,
 	__pte_clear(mm, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	return __ptep_get_and_clear(mm, addr, ptep);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index eebd3107c7d2..931a17f3c3fb 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -188,7 +188,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 	unsigned long i;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
-		pte_t pte = ptep_get_and_clear(mm, addr, ptep);
+		pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
 
 		/*
 		 * If HW_AFDBM is enabled, then the HW could turn on
@@ -429,7 +429,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 	pte_t orig_pte = ptep_get(ptep);
 
 	if (!pte_cont(orig_pte))
-		return ptep_get_and_clear(mm, addr, ptep);
+		return __ptep_get_and_clear(mm, addr, ptep);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 05/14] arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:42   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2a525e72537d..1f4efa17cc39 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -887,8 +887,9 @@ static inline bool pud_user_accessible_page(pud_t pud)
 /*
  * Atomic pte/pmd modifications.
  */
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-static inline int __ptep_test_and_clear_young(pte_t *ptep)
+static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
+					      unsigned long address,
+					      pte_t *ptep)
 {
 	pte_t old_pte, pte;
 
@@ -903,18 +904,11 @@ static inline int __ptep_test_and_clear_young(pte_t *ptep)
 	return pte_young(pte);
 }
 
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
-					    unsigned long address,
-					    pte_t *ptep)
-{
-	return __ptep_test_and_clear_young(ptep);
-}
-
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 					 unsigned long address, pte_t *ptep)
 {
-	int young = ptep_test_and_clear_young(vma, address, ptep);
+	int young = __ptep_test_and_clear_young(vma, address, ptep);
 
 	if (young) {
 		/*
@@ -937,7 +931,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
 {
-	return ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
+	return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -1152,6 +1146,13 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 	return __ptep_get_and_clear(mm, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep)
+{
+	return __ptep_test_and_clear_young(vma, addr, ptep);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 05/14] arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
@ 2023-06-22 14:42   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2a525e72537d..1f4efa17cc39 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -887,8 +887,9 @@ static inline bool pud_user_accessible_page(pud_t pud)
 /*
  * Atomic pte/pmd modifications.
  */
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-static inline int __ptep_test_and_clear_young(pte_t *ptep)
+static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
+					      unsigned long address,
+					      pte_t *ptep)
 {
 	pte_t old_pte, pte;
 
@@ -903,18 +904,11 @@ static inline int __ptep_test_and_clear_young(pte_t *ptep)
 	return pte_young(pte);
 }
 
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
-					    unsigned long address,
-					    pte_t *ptep)
-{
-	return __ptep_test_and_clear_young(ptep);
-}
-
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 					 unsigned long address, pte_t *ptep)
 {
-	int young = ptep_test_and_clear_young(vma, address, ptep);
+	int young = __ptep_test_and_clear_young(vma, address, ptep);
 
 	if (young) {
 		/*
@@ -937,7 +931,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
 {
-	return ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
+	return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -1152,6 +1146,13 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 	return __ptep_get_and_clear(mm, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep)
+{
+	return __ptep_test_and_clear_young(vma, addr, ptep);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 06/14] arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:42   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1f4efa17cc39..450428b11c49 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -137,7 +137,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  * so that we don't erroneously return false for pages that have been
  * remapped as PROT_NONE but are yet to be flushed from the TLB.
  * Note that we can't make any assumptions based on the state of the access
- * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
+ * flag, since __ptep_clear_flush_young() elides a DSB when invalidating the
  * TLB.
  */
 #define pte_accessible(mm, pte)	\
@@ -904,8 +904,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return pte_young(pte);
 }
 
-#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+static inline int __ptep_clear_flush_young(struct vm_area_struct *vma,
 					 unsigned long address, pte_t *ptep)
 {
 	int young = __ptep_test_and_clear_young(vma, address, ptep);
@@ -1153,6 +1152,13 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return __ptep_test_and_clear_young(vma, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep)
+{
+	return __ptep_clear_flush_young(vma, addr, ptep);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 06/14] arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
@ 2023-06-22 14:42   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1f4efa17cc39..450428b11c49 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -137,7 +137,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  * so that we don't erroneously return false for pages that have been
  * remapped as PROT_NONE but are yet to be flushed from the TLB.
  * Note that we can't make any assumptions based on the state of the access
- * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
+ * flag, since __ptep_clear_flush_young() elides a DSB when invalidating the
  * TLB.
  */
 #define pte_accessible(mm, pte)	\
@@ -904,8 +904,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return pte_young(pte);
 }
 
-#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+static inline int __ptep_clear_flush_young(struct vm_area_struct *vma,
 					 unsigned long address, pte_t *ptep)
 {
 	int young = __ptep_test_and_clear_young(vma, address, ptep);
@@ -1153,6 +1152,13 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return __ptep_test_and_clear_young(vma, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep)
+{
+	return __ptep_clear_flush_young(vma, addr, ptep);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 07/14] arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:42   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 15 +++++++++++----
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 450428b11c49..2fcc3b19c873 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -958,11 +958,11 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
- * ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
  * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
  */
-#define __HAVE_ARCH_PTEP_SET_WRPROTECT
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep)
 {
 	pte_t old_pte, pte;
 
@@ -980,7 +980,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long address, pmd_t *pmdp)
 {
-	ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
+	__ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
 }
 
 #define pmdp_establish pmdp_establish
@@ -1159,6 +1159,13 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return __ptep_clear_flush_young(vma, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+static inline void ptep_set_wrprotect(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	__ptep_set_wrprotect(mm, addr, ptep);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 931a17f3c3fb..7d5eb71db396 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -511,7 +511,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pte_t pte;
 
 	if (!pte_cont(READ_ONCE(*ptep))) {
-		ptep_set_wrprotect(mm, addr, ptep);
+		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 07/14] arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
@ 2023-06-22 14:42   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 15 +++++++++++----
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 450428b11c49..2fcc3b19c873 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -958,11 +958,11 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
- * ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
  * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
  */
-#define __HAVE_ARCH_PTEP_SET_WRPROTECT
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep)
 {
 	pte_t old_pte, pte;
 
@@ -980,7 +980,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long address, pmd_t *pmdp)
 {
-	ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
+	__ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
 }
 
 #define pmdp_establish pmdp_establish
@@ -1159,6 +1159,13 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return __ptep_clear_flush_young(vma, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+static inline void ptep_set_wrprotect(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
+{
+	__ptep_set_wrprotect(mm, addr, ptep);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 931a17f3c3fb..7d5eb71db396 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -511,7 +511,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pte_t pte;
 
 	if (!pte_cont(READ_ONCE(*ptep))) {
-		ptep_set_wrprotect(mm, addr, ptep);
+		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 08/14] arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:42   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 16 ++++++++++++----
 arch/arm64/mm/fault.c            |  6 +++---
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2fcc3b19c873..ff79578fd806 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -311,7 +311,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 
 	/*
 	 * Check for potential race with hardware updates of the pte
-	 * (ptep_set_access_flags safely changes valid ptes without going
+	 * (__ptep_set_access_flags safely changes valid ptes without going
 	 * through an invalid entry).
 	 */
 	VM_WARN_ONCE(!pte_young(pte),
@@ -842,8 +842,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
 }
 
-#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
-extern int ptep_set_access_flags(struct vm_area_struct *vma,
+extern int __ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,
 				 pte_t entry, int dirty);
 
@@ -853,7 +852,8 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmdp,
 					pmd_t entry, int dirty)
 {
-	return ptep_set_access_flags(vma, address, (pte_t *)pmdp, pmd_pte(entry), dirty);
+	return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
+							pmd_pte(entry), dirty);
 }
 
 static inline int pud_devmap(pud_t pud)
@@ -1166,6 +1166,14 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm,
 	__ptep_set_wrprotect(mm, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+static inline int ptep_set_access_flags(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				pte_t entry, int dirty)
+{
+	return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index d3a64624ed88..f5a7a5ff6814 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -195,9 +195,9 @@ static void show_pte(unsigned long addr)
  *
  * Returns whether or not the PTE actually changed.
  */
-int ptep_set_access_flags(struct vm_area_struct *vma,
-			  unsigned long address, pte_t *ptep,
-			  pte_t entry, int dirty)
+int __ptep_set_access_flags(struct vm_area_struct *vma,
+			    unsigned long address, pte_t *ptep,
+			    pte_t entry, int dirty)
 {
 	pteval_t old_pteval, pteval;
 	pte_t pte = READ_ONCE(*ptep);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 7d5eb71db396..9a87b1c5661a 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -477,7 +477,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	pte_t orig_pte;
 
 	if (!pte_cont(pte))
-		return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
+		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 	dpfn = pgsize >> PAGE_SHIFT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 08/14] arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
@ 2023-06-22 14:42   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 16 ++++++++++++----
 arch/arm64/mm/fault.c            |  6 +++---
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2fcc3b19c873..ff79578fd806 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -311,7 +311,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 
 	/*
 	 * Check for potential race with hardware updates of the pte
-	 * (ptep_set_access_flags safely changes valid ptes without going
+	 * (__ptep_set_access_flags safely changes valid ptes without going
 	 * through an invalid entry).
 	 */
 	VM_WARN_ONCE(!pte_young(pte),
@@ -842,8 +842,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
 }
 
-#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
-extern int ptep_set_access_flags(struct vm_area_struct *vma,
+extern int __ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,
 				 pte_t entry, int dirty);
 
@@ -853,7 +852,8 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmdp,
 					pmd_t entry, int dirty)
 {
-	return ptep_set_access_flags(vma, address, (pte_t *)pmdp, pmd_pte(entry), dirty);
+	return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
+							pmd_pte(entry), dirty);
 }
 
 static inline int pud_devmap(pud_t pud)
@@ -1166,6 +1166,14 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm,
 	__ptep_set_wrprotect(mm, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+static inline int ptep_set_access_flags(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				pte_t entry, int dirty)
+{
+	return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index d3a64624ed88..f5a7a5ff6814 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -195,9 +195,9 @@ static void show_pte(unsigned long addr)
  *
  * Returns whether or not the PTE actually changed.
  */
-int ptep_set_access_flags(struct vm_area_struct *vma,
-			  unsigned long address, pte_t *ptep,
-			  pte_t entry, int dirty)
+int __ptep_set_access_flags(struct vm_area_struct *vma,
+			    unsigned long address, pte_t *ptep,
+			    pte_t entry, int dirty)
 {
 	pteval_t old_pteval, pteval;
 	pte_t pte = READ_ONCE(*ptep);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 7d5eb71db396..9a87b1c5661a 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -477,7 +477,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	pte_t orig_pte;
 
 	if (!pte_cont(pte))
-		return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
+		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 	dpfn = pgsize >> PAGE_SHIFT;
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 09/14] arm64/mm: ptep_get(): New layer to manage contig bit
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:42   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

arm64 did not previously define an arch-specific ptep_get(), so override
the default version in the arch code, and also define the private
__ptep_get() version. Currently they both do the same thing that the
default version does (READ_ONCE()). Some arch users (hugetlb) were
already using ptep_get() so convert those to the private API. While
other callsites were doing direct READ_ONCE(), so convert those to use
the appropriate (public/private) API too.

There are some core kernel locations that directly dereference the ptep,
so these will need to be updated separately.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 19 +++++++++++++++----
 arch/arm64/kernel/efi.c          |  2 +-
 arch/arm64/mm/fault.c            |  4 ++--
 arch/arm64/mm/hugetlbpage.c      | 18 +++++++++---------
 arch/arm64/mm/kasan_init.c       |  2 +-
 arch/arm64/mm/mmu.c              | 12 ++++++------
 arch/arm64/mm/pageattr.c         |  4 ++--
 arch/arm64/mm/trans_pgd.c        |  2 +-
 8 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ff79578fd806..31df4d73f9ac 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -275,6 +275,11 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
 	}
 }
 
+static inline pte_t __ptep_get(pte_t *ptep)
+{
+	return READ_ONCE(*ptep);
+}
+
 extern void __sync_icache_dcache(pte_t pteval);
 bool pgattr_change_is_safe(u64 old, u64 new);
 
@@ -302,7 +307,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 	if (!IS_ENABLED(CONFIG_DEBUG_VM))
 		return;
 
-	old_pte = READ_ONCE(*ptep);
+	old_pte = __ptep_get(ptep);
 
 	if (!pte_valid(old_pte) || !pte_valid(pte))
 		return;
@@ -339,7 +344,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	 */
 	if (system_supports_mte() && pte_access_permitted(pte, false) &&
 	    !pte_special(pte)) {
-		pte_t old_pte = READ_ONCE(*ptep);
+		pte_t old_pte = __ptep_get(ptep);
 		/*
 		 * We only need to synchronise if the new PTE has tags enabled
 		 * or if swapping in (in which case another mapping may have
@@ -893,7 +898,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 {
 	pte_t old_pte, pte;
 
-	pte = READ_ONCE(*ptep);
+	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_mkold(pte);
@@ -966,7 +971,7 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
 {
 	pte_t old_pte, pte;
 
-	pte = READ_ONCE(*ptep);
+	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_wrprotect(pte);
@@ -1120,6 +1125,12 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
  * private versions, which are prefixed with double underscore.
  */
 
+#define ptep_get ptep_get
+static inline pte_t ptep_get(pte_t *ptep)
+{
+	return __ptep_get(ptep);
+}
+
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
 	__set_pte(ptep, pte);
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 7a28b6a08a82..9536fbce77a2 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -106,7 +106,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct set_perm_data *spd = data;
 	const efi_memory_desc_t *md = spd->md;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	if (md->attribute & EFI_MEMORY_RO)
 		pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f5a7a5ff6814..3193526b226d 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -177,7 +177,7 @@ static void show_pte(unsigned long addr)
 			break;
 
 		ptep = pte_offset_map(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 		pr_cont(", pte=%016llx", pte_val(pte));
 		pte_unmap(ptep);
 	} while(0);
@@ -200,7 +200,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
 			    pte_t entry, int dirty)
 {
 	pteval_t old_pteval, pteval;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	if (pte_same(pte, entry))
 		return 0;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 9a87b1c5661a..82b2036dbe2f 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -152,14 +152,14 @@ pte_t huge_ptep_get(pte_t *ptep)
 {
 	int ncontig, i;
 	size_t pgsize;
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 
 	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
 		return orig_pte;
 
 	ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
 	for (i = 0; i < ncontig; i++, ptep++) {
-		pte_t pte = ptep_get(ptep);
+		pte_t pte = __ptep_get(ptep);
 
 		if (pte_dirty(pte))
 			orig_pte = pte_mkdirty(orig_pte);
@@ -184,7 +184,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 			     unsigned long pgsize,
 			     unsigned long ncontig)
 {
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 	unsigned long i;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
@@ -426,7 +426,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 {
 	int ncontig;
 	size_t pgsize;
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 
 	if (!pte_cont(orig_pte))
 		return __ptep_get_and_clear(mm, addr, ptep);
@@ -449,11 +449,11 @@ static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
 {
 	int i;
 
-	if (pte_write(pte) != pte_write(ptep_get(ptep)))
+	if (pte_write(pte) != pte_write(__ptep_get(ptep)))
 		return 1;
 
 	for (i = 0; i < ncontig; i++) {
-		pte_t orig_pte = ptep_get(ptep + i);
+		pte_t orig_pte = __ptep_get(ptep + i);
 
 		if (pte_dirty(pte) != pte_dirty(orig_pte))
 			return 1;
@@ -510,7 +510,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	size_t pgsize;
 	pte_t pte;
 
-	if (!pte_cont(READ_ONCE(*ptep))) {
+	if (!pte_cont(__ptep_get(ptep))) {
 		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
@@ -535,7 +535,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 	size_t pgsize;
 	int ncontig;
 
-	if (!pte_cont(READ_ONCE(*ptep)))
+	if (!pte_cont(__ptep_get(ptep)))
 		return ptep_clear_flush(vma, addr, ptep);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
@@ -569,7 +569,7 @@ pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(READ_ONCE(*ptep)))
+		if (pte_user_exec(__ptep_get(ptep)))
 			return huge_ptep_clear_flush(vma, addr, ptep);
 	}
 	return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 40125b217195..65074cf7f3a3 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -113,7 +113,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
 			memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
 		next = addr + PAGE_SIZE;
 		__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
-	} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
+	} while (ptep++, addr = next, addr != end && pte_none(__ptep_get(ptep)));
 }
 
 static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 085a7e3eec98..d5dc36ff3827 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -176,7 +176,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 
 	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
-		pte_t old_pte = READ_ONCE(*ptep);
+		pte_t old_pte = __ptep_get(ptep);
 
 		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
@@ -185,7 +185,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		 * only allow updates to the permission attributes.
 		 */
 		BUG_ON(!pgattr_change_is_safe(pte_val(old_pte),
-					      READ_ONCE(pte_val(*ptep))));
+					      pte_val(__ptep_get(ptep))));
 
 		phys += PAGE_SIZE;
 	} while (ptep++, addr += PAGE_SIZE, addr != end);
@@ -848,7 +848,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 
 	do {
 		ptep = pte_offset_kernel(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 		if (pte_none(pte))
 			continue;
 
@@ -981,7 +981,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
 
 	do {
 		ptep = pte_offset_kernel(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 
 		/*
 		 * This is just a sanity check here which verifies that
@@ -1000,7 +1000,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
 	 */
 	ptep = pte_offset_kernel(pmdp, 0UL);
 	for (i = 0; i < PTRS_PER_PTE; i++) {
-		if (!pte_none(READ_ONCE(ptep[i])))
+		if (!pte_none(__ptep_get(ptep++)))
 			return;
 	}
 
@@ -1470,7 +1470,7 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(READ_ONCE(*ptep)))
+		if (pte_user_exec(ptep_get(ptep)))
 			return ptep_clear_flush(vma, addr, ptep);
 	}
 	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 057097acf9e0..624b0b0982e3 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -36,7 +36,7 @@ bool can_set_direct_map(void)
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	pte = clear_pte_bit(pte, cdata->clear_mask);
 	pte = set_pte_bit(pte, cdata->set_mask);
@@ -246,5 +246,5 @@ bool kernel_page_present(struct page *page)
 		return true;
 
 	ptep = pte_offset_kernel(pmdp, addr);
-	return pte_valid(READ_ONCE(*ptep));
+	return pte_valid(__ptep_get(ptep));
 }
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index f9997b226614..b130a65092c1 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -32,7 +32,7 @@ static void *trans_alloc(struct trans_pgd_info *info)
 
 static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 {
-	pte_t pte = READ_ONCE(*src_ptep);
+	pte_t pte = __ptep_get(src_ptep);
 
 	if (pte_valid(pte)) {
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 09/14] arm64/mm: ptep_get(): New layer to manage contig bit
@ 2023-06-22 14:42   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

arm64 did not previously define an arch-specific ptep_get(), so override
the default version in the arch code, and also define the private
__ptep_get() version. Currently they both do the same thing that the
default version does (READ_ONCE()). Some arch users (hugetlb) were
already using ptep_get() so convert those to the private API. While
other callsites were doing direct READ_ONCE(), so convert those to use
the appropriate (public/private) API too.

There are some core kernel locations that directly dereference the ptep,
so these will need to be updated separately.

No behavioural changes intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 19 +++++++++++++++----
 arch/arm64/kernel/efi.c          |  2 +-
 arch/arm64/mm/fault.c            |  4 ++--
 arch/arm64/mm/hugetlbpage.c      | 18 +++++++++---------
 arch/arm64/mm/kasan_init.c       |  2 +-
 arch/arm64/mm/mmu.c              | 12 ++++++------
 arch/arm64/mm/pageattr.c         |  4 ++--
 arch/arm64/mm/trans_pgd.c        |  2 +-
 8 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ff79578fd806..31df4d73f9ac 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -275,6 +275,11 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
 	}
 }
 
+static inline pte_t __ptep_get(pte_t *ptep)
+{
+	return READ_ONCE(*ptep);
+}
+
 extern void __sync_icache_dcache(pte_t pteval);
 bool pgattr_change_is_safe(u64 old, u64 new);
 
@@ -302,7 +307,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 	if (!IS_ENABLED(CONFIG_DEBUG_VM))
 		return;
 
-	old_pte = READ_ONCE(*ptep);
+	old_pte = __ptep_get(ptep);
 
 	if (!pte_valid(old_pte) || !pte_valid(pte))
 		return;
@@ -339,7 +344,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	 */
 	if (system_supports_mte() && pte_access_permitted(pte, false) &&
 	    !pte_special(pte)) {
-		pte_t old_pte = READ_ONCE(*ptep);
+		pte_t old_pte = __ptep_get(ptep);
 		/*
 		 * We only need to synchronise if the new PTE has tags enabled
 		 * or if swapping in (in which case another mapping may have
@@ -893,7 +898,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 {
 	pte_t old_pte, pte;
 
-	pte = READ_ONCE(*ptep);
+	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_mkold(pte);
@@ -966,7 +971,7 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
 {
 	pte_t old_pte, pte;
 
-	pte = READ_ONCE(*ptep);
+	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
 		pte = pte_wrprotect(pte);
@@ -1120,6 +1125,12 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
  * private versions, which are prefixed with double underscore.
  */
 
+#define ptep_get ptep_get
+static inline pte_t ptep_get(pte_t *ptep)
+{
+	return __ptep_get(ptep);
+}
+
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
 	__set_pte(ptep, pte);
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 7a28b6a08a82..9536fbce77a2 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -106,7 +106,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct set_perm_data *spd = data;
 	const efi_memory_desc_t *md = spd->md;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	if (md->attribute & EFI_MEMORY_RO)
 		pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f5a7a5ff6814..3193526b226d 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -177,7 +177,7 @@ static void show_pte(unsigned long addr)
 			break;
 
 		ptep = pte_offset_map(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 		pr_cont(", pte=%016llx", pte_val(pte));
 		pte_unmap(ptep);
 	} while(0);
@@ -200,7 +200,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
 			    pte_t entry, int dirty)
 {
 	pteval_t old_pteval, pteval;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	if (pte_same(pte, entry))
 		return 0;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 9a87b1c5661a..82b2036dbe2f 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -152,14 +152,14 @@ pte_t huge_ptep_get(pte_t *ptep)
 {
 	int ncontig, i;
 	size_t pgsize;
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 
 	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
 		return orig_pte;
 
 	ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
 	for (i = 0; i < ncontig; i++, ptep++) {
-		pte_t pte = ptep_get(ptep);
+		pte_t pte = __ptep_get(ptep);
 
 		if (pte_dirty(pte))
 			orig_pte = pte_mkdirty(orig_pte);
@@ -184,7 +184,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 			     unsigned long pgsize,
 			     unsigned long ncontig)
 {
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 	unsigned long i;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
@@ -426,7 +426,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 {
 	int ncontig;
 	size_t pgsize;
-	pte_t orig_pte = ptep_get(ptep);
+	pte_t orig_pte = __ptep_get(ptep);
 
 	if (!pte_cont(orig_pte))
 		return __ptep_get_and_clear(mm, addr, ptep);
@@ -449,11 +449,11 @@ static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
 {
 	int i;
 
-	if (pte_write(pte) != pte_write(ptep_get(ptep)))
+	if (pte_write(pte) != pte_write(__ptep_get(ptep)))
 		return 1;
 
 	for (i = 0; i < ncontig; i++) {
-		pte_t orig_pte = ptep_get(ptep + i);
+		pte_t orig_pte = __ptep_get(ptep + i);
 
 		if (pte_dirty(pte) != pte_dirty(orig_pte))
 			return 1;
@@ -510,7 +510,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	size_t pgsize;
 	pte_t pte;
 
-	if (!pte_cont(READ_ONCE(*ptep))) {
+	if (!pte_cont(__ptep_get(ptep))) {
 		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
@@ -535,7 +535,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 	size_t pgsize;
 	int ncontig;
 
-	if (!pte_cont(READ_ONCE(*ptep)))
+	if (!pte_cont(__ptep_get(ptep)))
 		return ptep_clear_flush(vma, addr, ptep);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
@@ -569,7 +569,7 @@ pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(READ_ONCE(*ptep)))
+		if (pte_user_exec(__ptep_get(ptep)))
 			return huge_ptep_clear_flush(vma, addr, ptep);
 	}
 	return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 40125b217195..65074cf7f3a3 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -113,7 +113,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
 			memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
 		next = addr + PAGE_SIZE;
 		__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
-	} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
+	} while (ptep++, addr = next, addr != end && pte_none(__ptep_get(ptep)));
 }
 
 static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 085a7e3eec98..d5dc36ff3827 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -176,7 +176,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 
 	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
-		pte_t old_pte = READ_ONCE(*ptep);
+		pte_t old_pte = __ptep_get(ptep);
 
 		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
@@ -185,7 +185,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		 * only allow updates to the permission attributes.
 		 */
 		BUG_ON(!pgattr_change_is_safe(pte_val(old_pte),
-					      READ_ONCE(pte_val(*ptep))));
+					      pte_val(__ptep_get(ptep))));
 
 		phys += PAGE_SIZE;
 	} while (ptep++, addr += PAGE_SIZE, addr != end);
@@ -848,7 +848,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
 
 	do {
 		ptep = pte_offset_kernel(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 		if (pte_none(pte))
 			continue;
 
@@ -981,7 +981,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
 
 	do {
 		ptep = pte_offset_kernel(pmdp, addr);
-		pte = READ_ONCE(*ptep);
+		pte = __ptep_get(ptep);
 
 		/*
 		 * This is just a sanity check here which verifies that
@@ -1000,7 +1000,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
 	 */
 	ptep = pte_offset_kernel(pmdp, 0UL);
 	for (i = 0; i < PTRS_PER_PTE; i++) {
-		if (!pte_none(READ_ONCE(ptep[i])))
+		if (!pte_none(__ptep_get(ptep++)))
 			return;
 	}
 
@@ -1470,7 +1470,7 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(READ_ONCE(*ptep)))
+		if (pte_user_exec(ptep_get(ptep)))
 			return ptep_clear_flush(vma, addr, ptep);
 	}
 	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 057097acf9e0..624b0b0982e3 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -36,7 +36,7 @@ bool can_set_direct_map(void)
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
-	pte_t pte = READ_ONCE(*ptep);
+	pte_t pte = __ptep_get(ptep);
 
 	pte = clear_pte_bit(pte, cdata->clear_mask);
 	pte = set_pte_bit(pte, cdata->set_mask);
@@ -246,5 +246,5 @@ bool kernel_page_present(struct page *page)
 		return true;
 
 	ptep = pte_offset_kernel(pmdp, addr);
-	return pte_valid(READ_ONCE(*ptep));
+	return pte_valid(__ptep_get(ptep));
 }
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index f9997b226614..b130a65092c1 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -32,7 +32,7 @@ static void *trans_alloc(struct trans_pgd_info *info)
 
 static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 {
-	pte_t pte = READ_ONCE(*src_ptep);
+	pte_t pte = __ptep_get(src_ptep);
 
 	if (pte_valid(pte)) {
 		/*
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 10/14] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:42   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..de1f5d9a546e 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -278,7 +278,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
  */
 #define MAX_TLBI_OPS	PTRS_PER_PTE
 
-static inline void __flush_tlb_range(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 				     unsigned long start, unsigned long end,
 				     unsigned long stride, bool last_level,
 				     int tlb_level)
@@ -357,6 +357,15 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 		}
 		scale++;
 	}
+}
+
+static inline void __flush_tlb_range(struct vm_area_struct *vma,
+				     unsigned long start, unsigned long end,
+				     unsigned long stride, bool last_level,
+				     int tlb_level)
+{
+	__flush_tlb_range_nosync(vma, start, end, stride,
+				 last_level, tlb_level);
 	dsb(ish);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 10/14] arm64/mm: Split __flush_tlb_range() to elide trailing DSB
@ 2023-06-22 14:42   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..de1f5d9a546e 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -278,7 +278,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
  */
 #define MAX_TLBI_OPS	PTRS_PER_PTE
 
-static inline void __flush_tlb_range(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 				     unsigned long start, unsigned long end,
 				     unsigned long stride, bool last_level,
 				     int tlb_level)
@@ -357,6 +357,15 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 		}
 		scale++;
 	}
+}
+
+static inline void __flush_tlb_range(struct vm_area_struct *vma,
+				     unsigned long start, unsigned long end,
+				     unsigned long stride, bool last_level,
+				     int tlb_level)
+{
+	__flush_tlb_range_nosync(vma, start, end, stride,
+				 last_level, tlb_level);
 	dsb(ish);
 }
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:42   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for
user mappings. Whenever it detects a set of PTEs that meet the
requirements for a contiguous range, the PTEs are re-painted with the
PTE_CONT bit.

This initial change provides a baseline that can be optimized in future
commits. That said, fold/unfold operations (which imply tlb
invalidation) are avoided where possible with a few tricks for
access/dirty bit management.

Write-enable and write-protect modifications are likely non-optimal and
likely incure a regression in fork() performance. This will be addressed
separately.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 137 ++++++++++++-
 arch/arm64/mm/Makefile           |   3 +-
 arch/arm64/mm/contpte.c          | 334 +++++++++++++++++++++++++++++++
 3 files changed, 466 insertions(+), 8 deletions(-)
 create mode 100644 arch/arm64/mm/contpte.c

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 31df4d73f9ac..17ea534bc5b0 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
 
+/*
+ * The contpte APIs are used to transparently manage the contiguous bit in ptes
+ * where it is possible and makes sense to do so. The PTE_CONT bit is considered
+ * a private implementation detail of the public ptep API (see below).
+ */
+extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte);
+extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte);
+extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
+extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
+extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte, unsigned int nr);
+extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				pte_t entry, int dirty);
+
+static inline pte_t *contpte_align_down(pte_t *ptep)
+{
+	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
+}
+
+static inline bool contpte_is_enabled(struct mm_struct *mm)
+{
+	/*
+	 * Don't attempt to apply the contig bit to kernel mappings, because
+	 * dynamically adding/removing the contig bit can cause page faults.
+	 * These racing faults are ok for user space, since they get serialized
+	 * on the PTL. But kernel mappings can't tolerate faults.
+	 */
+
+	return mm != &init_mm;
+}
+
+static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte)
+{
+	/*
+	 * Only bother trying if both the virtual and physical addresses are
+	 * aligned and correspond to the last entry in a contig range. The core
+	 * code mostly modifies ranges from low to high, so this is the likely
+	 * the last modification in the contig range, so a good time to fold.
+	 */
+
+	bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1;
+	bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1;
+
+	if (contpte_is_enabled(mm) &&
+	    pte_present(pte) && !pte_cont(pte) &&
+	    valign && palign)
+		__contpte_try_fold(mm, addr, ptep, pte);
+}
+
+static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte)
+{
+	if (contpte_is_enabled(mm) &&
+	    pte_present(pte) && pte_cont(pte))
+		__contpte_try_unfold(mm, addr, ptep, pte);
+}
+
 /*
  * The below functions constitute the public API that arm64 presents to the
  * core-mm to manipulate PTE entries within the their page tables (or at least
@@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
  * versions will automatically and transparently apply the contiguous bit where
  * it makes sense to do so. Therefore any users that are contig-aware (e.g.
  * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
- * private versions, which are prefixed with double underscore.
+ * private versions, which are prefixed with double underscore. All of these
+ * APIs except for ptep_get_lockless() are expected to be called with the PTL
+ * held.
  */
 
 #define ptep_get ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
-	return __ptep_get(ptep);
+	pte_t pte = __ptep_get(ptep);
+
+	if (!pte_present(pte) || !pte_cont(pte))
+		return pte;
+
+	return contpte_ptep_get(ptep, pte);
+}
+
+#define ptep_get_lockless ptep_get_lockless
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+	pte_t pte = __ptep_get(ptep);
+
+	if (!pte_present(pte) || !pte_cont(pte))
+		return pte;
+
+	return contpte_ptep_get_lockless(ptep);
 }
 
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
-	__set_pte(ptep, pte);
+	/*
+	 * We don't have the mm or vaddr so cannot unfold or fold contig entries
+	 * (since it requires tlb maintenance). set_pte() is not used in core
+	 * code, so this should never even be called. Regardless do our best to
+	 * service any call and emit a warning if there is any attempt to set a
+	 * pte on top of an existing contig range.
+	 */
+	pte_t orig_pte = __ptep_get(ptep);
+
+	WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte));
+	__set_pte(ptep, pte_mknoncont(pte));
 }
 
 #define set_ptes set_ptes
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr)
 {
-	__set_ptes(mm, addr, ptep, pte, nr);
+	pte = pte_mknoncont(pte);
+
+	if (!contpte_is_enabled(mm))
+		__set_ptes(mm, addr, ptep, pte, nr);
+	else if (nr == 1) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		__set_ptes(mm, addr, ptep, pte, nr);
+		contpte_try_fold(mm, addr, ptep, pte);
+	} else
+		contpte_set_ptes(mm, addr, ptep, pte, nr);
 }
 
 static inline void pte_clear(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
 	__pte_clear(mm, addr, ptep);
 }
 
@@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm,
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
 	return __ptep_get_and_clear(mm, addr, ptep);
 }
 
@@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep)
 {
-	return __ptep_test_and_clear_young(vma, addr, ptep);
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
+		return __ptep_test_and_clear_young(vma, addr, ptep);
+
+	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep)
 {
-	return __ptep_clear_flush_young(vma, addr, ptep);
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
+		return __ptep_clear_flush_young(vma, addr, ptep);
+
+	return contpte_ptep_clear_flush_young(vma, addr, ptep);
 }
 
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
 	__ptep_set_wrprotect(mm, addr, ptep);
+	contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
 }
 
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
@@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty)
 {
-	return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+	pte_t orig_pte = __ptep_get(ptep);
+
+	entry = pte_mknoncont(entry);
+
+	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
+		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+
+	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
 }
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d..70b6aba09b5d 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -2,7 +2,8 @@
 obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   cache.o copypage.o flush.o \
 				   ioremap.o mmap.o pgd.o mmu.o \
-				   context.o proc.o pageattr.o fixmap.o
+				   context.o proc.o pageattr.o fixmap.o \
+				   contpte.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
 obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
new file mode 100644
index 000000000000..e8e4a298fd53
--- /dev/null
+++ b/arch/arm64/mm/contpte.c
@@ -0,0 +1,334 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/mm.h>
+#include <asm/tlbflush.h>
+
+static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, int nr)
+{
+	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	unsigned long start_addr = addr;
+	int i;
+
+	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
+		__pte_clear(mm, addr, ptep);
+
+	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
+}
+
+static bool ptep_any_present(pte_t *ptep, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++, ptep++) {
+		if (pte_present(__ptep_get(ptep)))
+			return true;
+	}
+
+	return false;
+}
+
+static void contpte_fold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte, bool fold)
+{
+	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	unsigned long start_addr;
+	pte_t *start_ptep;
+	int i;
+
+	start_ptep = ptep = contpte_align_down(ptep);
+	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
+	pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
+		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
+
+		if (pte_dirty(ptent))
+			pte = pte_mkdirty(pte);
+
+		if (pte_young(ptent))
+			pte = pte_mkyoung(pte);
+	}
+
+	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
+
+	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
+}
+
+void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We have already checked that the virtual and pysical addresses are
+	 * correctly aligned for a contig mapping in contpte_try_fold() so the
+	 * remaining checks are to ensure that the contig range is fully covered
+	 * by a single folio, and ensure that all the ptes are present with
+	 * contiguous PFNs and matching prots.
+	 */
+
+	struct page *page = pte_page(pte);
+	struct folio *folio = page_folio(page);
+	unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
+	unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
+	unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
+	unsigned long pfn;
+	pgprot_t prot;
+	pte_t subpte;
+	pte_t *orig_ptep;
+	int i;
+
+	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
+		return;
+
+	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
+	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+	orig_ptep = ptep;
+	ptep = contpte_align_down(ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
+		subpte = __ptep_get(ptep);
+		subpte = pte_mkold(pte_mkclean(subpte));
+
+		if (!pte_present(subpte) ||
+		    pte_pfn(subpte) != pfn ||
+		    pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
+			return;
+	}
+
+	contpte_fold(mm, addr, orig_ptep, pte, true);
+}
+
+void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We have already checked that the ptes are contiguous in
+	 * contpte_try_unfold(), so we can unfold unconditionally here.
+	 */
+
+	contpte_fold(mm, addr, ptep, pte, false);
+}
+
+pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
+{
+	/*
+	 * Gather access/dirty bits, which may be populated in any of the ptes
+	 * of the contig range. We are guarranteed to be holding the PTL, so any
+	 * contiguous range cannot be unfolded or otherwise modified under our
+	 * feet.
+	 */
+
+	pte_t pte;
+	int i;
+
+	ptep = contpte_align_down(ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++) {
+		pte = __ptep_get(ptep);
+
+		/*
+		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
+		 * where some of the ptes in the range may be cleared but others
+		 * are still to do. See contpte_ptep_get_and_clear_full().
+		 */
+		if (pte_val(pte) == 0)
+			continue;
+
+		if (pte_dirty(pte))
+			orig_pte = pte_mkdirty(orig_pte);
+
+		if (pte_young(pte))
+			orig_pte = pte_mkyoung(orig_pte);
+	}
+
+	return orig_pte;
+}
+
+pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
+{
+	/*
+	 * Gather access/dirty bits, which may be populated in any of the ptes
+	 * of the contig range. We may not be holding the PTL, so any contiguous
+	 * range may be unfolded/modified/refolded under our feet.
+	 */
+
+	pte_t orig_pte;
+	pgprot_t orig_prot;
+	pte_t *ptep;
+	unsigned long pfn;
+	pte_t pte;
+	pgprot_t prot;
+	int i;
+
+retry:
+	orig_pte = __ptep_get(orig_ptep);
+
+	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
+		return orig_pte;
+
+	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
+	ptep = contpte_align_down(orig_ptep);
+	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
+		pte = __ptep_get(ptep);
+		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+		if (!pte_present(pte) || !pte_cont(pte) ||
+		   pte_pfn(pte) != pfn ||
+		   pgprot_val(prot) != pgprot_val(orig_prot))
+			goto retry;
+
+		if (pte_dirty(pte))
+			orig_pte = pte_mkdirty(orig_pte);
+
+		if (pte_young(pte))
+			orig_pte = pte_mkyoung(orig_pte);
+	}
+
+	return orig_pte;
+}
+
+void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte, unsigned int nr)
+{
+	unsigned long next;
+	unsigned long end = addr + (nr << PAGE_SHIFT);
+	unsigned long pfn = pte_pfn(pte);
+	pgprot_t prot = pte_pgprot(pte);
+	pte_t orig_pte;
+
+	do {
+		next = pte_cont_addr_end(addr, end);
+		nr = (next - addr) >> PAGE_SHIFT;
+		pte = pfn_pte(pfn, prot);
+
+		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
+			pte = pte_mkcont(pte);
+		else
+			pte = pte_mknoncont(pte);
+
+		/*
+		 * If operating on a partial contiguous range then we must first
+		 * unfold the contiguous range if it was previously folded.
+		 * Otherwise we could end up with overlapping tlb entries.
+		 */
+		if (nr != CONT_PTES)
+			contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+		/*
+		 * If we are replacing ptes that were contiguous or if the new
+		 * ptes are contiguous and any of the ptes being replaced are
+		 * present, we need to clear and flush the range to prevent
+		 * overlapping tlb entries.
+		 */
+		orig_pte = __ptep_get(ptep);
+		if ((pte_present(orig_pte) && pte_cont(orig_pte)) ||
+		    (pte_cont(pte) && ptep_any_present(ptep, nr)))
+			ptep_clear_flush_range(mm, addr, ptep, nr);
+
+		__set_ptes(mm, addr, ptep, pte, nr);
+
+		addr = next;
+		ptep += nr;
+		pfn += nr;
+
+	} while (addr != end);
+}
+
+int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep)
+{
+	/*
+	 * ptep_clear_flush_young() technically requires us to clear the access
+	 * flag for a _single_ pte. However, the core-mm code actually tracks
+	 * access/dirty per folio, not per page. And since we only create a
+	 * contig range when the range is covered by a single folio, we can get
+	 * away with clearing young for the whole contig range here, so we avoid
+	 * having to unfold.
+	 */
+
+	int i;
+	int young = 0;
+
+	ptep = contpte_align_down(ptep);
+	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+		young |= __ptep_test_and_clear_young(vma, addr, ptep);
+
+	return young;
+}
+
+int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep)
+{
+	int young;
+
+	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+	if (young) {
+		/*
+		 * See comment in __ptep_clear_flush_young(); same rationale for
+		 * eliding the trailing DSB applies here.
+		 */
+		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
+					 PAGE_SIZE, true, 3);
+	}
+
+	return young;
+}
+
+int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					pte_t entry, int dirty)
+{
+	pte_t orig_pte;
+	int i;
+
+	/*
+	 * Gather the access/dirty bits for the contiguous range. If nothing has
+	 * changed, its a noop.
+	 */
+	orig_pte = ptep_get(ptep);
+	if (pte_val(orig_pte) == pte_val(entry))
+		return 0;
+
+	/*
+	 * We can fix up access/dirty bits without having to unfold/fold the
+	 * contig range. But if the write bit is changing, we need to go through
+	 * the full unfold/fold cycle.
+	 */
+	if (pte_write(orig_pte) == pte_write(entry)) {
+		/*
+		 * No need to flush here; This is always "more permissive" so we
+		 * can only be _adding_ the access or dirty bit. And since the
+		 * tlb can't cache an entry without the AF set and the dirty bit
+		 * is a SW bit, there can be no confusion. For HW access
+		 * management, we technically only need to update the flag on a
+		 * single pte in the range. But for SW access management, we
+		 * need to update all the ptes to prevent extra faults.
+		 */
+		ptep = contpte_align_down(ptep);
+		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
+	} else {
+		/*
+		 * No need to flush in __ptep_set_access_flags() because we just
+		 * flushed the whole range in __contpte_try_unfold().
+		 */
+		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
+		__ptep_set_access_flags(vma, addr, ptep, entry, 0);
+		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
+	}
+
+	return 1;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
@ 2023-06-22 14:42   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for
user mappings. Whenever it detects a set of PTEs that meet the
requirements for a contiguous range, the PTEs are re-painted with the
PTE_CONT bit.

This initial change provides a baseline that can be optimized in future
commits. That said, fold/unfold operations (which imply tlb
invalidation) are avoided where possible with a few tricks for
access/dirty bit management.

Write-enable and write-protect modifications are likely non-optimal and
likely incure a regression in fork() performance. This will be addressed
separately.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 137 ++++++++++++-
 arch/arm64/mm/Makefile           |   3 +-
 arch/arm64/mm/contpte.c          | 334 +++++++++++++++++++++++++++++++
 3 files changed, 466 insertions(+), 8 deletions(-)
 create mode 100644 arch/arm64/mm/contpte.c

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 31df4d73f9ac..17ea534bc5b0 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
 
+/*
+ * The contpte APIs are used to transparently manage the contiguous bit in ptes
+ * where it is possible and makes sense to do so. The PTE_CONT bit is considered
+ * a private implementation detail of the public ptep API (see below).
+ */
+extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte);
+extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte);
+extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
+extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
+extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, pte_t pte, unsigned int nr);
+extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				pte_t entry, int dirty);
+
+static inline pte_t *contpte_align_down(pte_t *ptep)
+{
+	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
+}
+
+static inline bool contpte_is_enabled(struct mm_struct *mm)
+{
+	/*
+	 * Don't attempt to apply the contig bit to kernel mappings, because
+	 * dynamically adding/removing the contig bit can cause page faults.
+	 * These racing faults are ok for user space, since they get serialized
+	 * on the PTL. But kernel mappings can't tolerate faults.
+	 */
+
+	return mm != &init_mm;
+}
+
+static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte)
+{
+	/*
+	 * Only bother trying if both the virtual and physical addresses are
+	 * aligned and correspond to the last entry in a contig range. The core
+	 * code mostly modifies ranges from low to high, so this is the likely
+	 * the last modification in the contig range, so a good time to fold.
+	 */
+
+	bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1;
+	bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1;
+
+	if (contpte_is_enabled(mm) &&
+	    pte_present(pte) && !pte_cont(pte) &&
+	    valign && palign)
+		__contpte_try_fold(mm, addr, ptep, pte);
+}
+
+static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte)
+{
+	if (contpte_is_enabled(mm) &&
+	    pte_present(pte) && pte_cont(pte))
+		__contpte_try_unfold(mm, addr, ptep, pte);
+}
+
 /*
  * The below functions constitute the public API that arm64 presents to the
  * core-mm to manipulate PTE entries within the their page tables (or at least
@@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
  * versions will automatically and transparently apply the contiguous bit where
  * it makes sense to do so. Therefore any users that are contig-aware (e.g.
  * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
- * private versions, which are prefixed with double underscore.
+ * private versions, which are prefixed with double underscore. All of these
+ * APIs except for ptep_get_lockless() are expected to be called with the PTL
+ * held.
  */
 
 #define ptep_get ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
-	return __ptep_get(ptep);
+	pte_t pte = __ptep_get(ptep);
+
+	if (!pte_present(pte) || !pte_cont(pte))
+		return pte;
+
+	return contpte_ptep_get(ptep, pte);
+}
+
+#define ptep_get_lockless ptep_get_lockless
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+	pte_t pte = __ptep_get(ptep);
+
+	if (!pte_present(pte) || !pte_cont(pte))
+		return pte;
+
+	return contpte_ptep_get_lockless(ptep);
 }
 
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
-	__set_pte(ptep, pte);
+	/*
+	 * We don't have the mm or vaddr so cannot unfold or fold contig entries
+	 * (since it requires tlb maintenance). set_pte() is not used in core
+	 * code, so this should never even be called. Regardless do our best to
+	 * service any call and emit a warning if there is any attempt to set a
+	 * pte on top of an existing contig range.
+	 */
+	pte_t orig_pte = __ptep_get(ptep);
+
+	WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte));
+	__set_pte(ptep, pte_mknoncont(pte));
 }
 
 #define set_ptes set_ptes
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr)
 {
-	__set_ptes(mm, addr, ptep, pte, nr);
+	pte = pte_mknoncont(pte);
+
+	if (!contpte_is_enabled(mm))
+		__set_ptes(mm, addr, ptep, pte, nr);
+	else if (nr == 1) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		__set_ptes(mm, addr, ptep, pte, nr);
+		contpte_try_fold(mm, addr, ptep, pte);
+	} else
+		contpte_set_ptes(mm, addr, ptep, pte, nr);
 }
 
 static inline void pte_clear(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
 	__pte_clear(mm, addr, ptep);
 }
 
@@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm,
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
 	return __ptep_get_and_clear(mm, addr, ptep);
 }
 
@@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep)
 {
-	return __ptep_test_and_clear_young(vma, addr, ptep);
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
+		return __ptep_test_and_clear_young(vma, addr, ptep);
+
+	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep)
 {
-	return __ptep_clear_flush_young(vma, addr, ptep);
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
+		return __ptep_clear_flush_young(vma, addr, ptep);
+
+	return contpte_ptep_clear_flush_young(vma, addr, ptep);
 }
 
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
+	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
 	__ptep_set_wrprotect(mm, addr, ptep);
+	contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
 }
 
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
@@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty)
 {
-	return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+	pte_t orig_pte = __ptep_get(ptep);
+
+	entry = pte_mknoncont(entry);
+
+	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
+		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+
+	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
 }
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d..70b6aba09b5d 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -2,7 +2,8 @@
 obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   cache.o copypage.o flush.o \
 				   ioremap.o mmap.o pgd.o mmu.o \
-				   context.o proc.o pageattr.o fixmap.o
+				   context.o proc.o pageattr.o fixmap.o \
+				   contpte.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
 obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
new file mode 100644
index 000000000000..e8e4a298fd53
--- /dev/null
+++ b/arch/arm64/mm/contpte.c
@@ -0,0 +1,334 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/mm.h>
+#include <asm/tlbflush.h>
+
+static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, int nr)
+{
+	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	unsigned long start_addr = addr;
+	int i;
+
+	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
+		__pte_clear(mm, addr, ptep);
+
+	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
+}
+
+static bool ptep_any_present(pte_t *ptep, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++, ptep++) {
+		if (pte_present(__ptep_get(ptep)))
+			return true;
+	}
+
+	return false;
+}
+
+static void contpte_fold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte, bool fold)
+{
+	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	unsigned long start_addr;
+	pte_t *start_ptep;
+	int i;
+
+	start_ptep = ptep = contpte_align_down(ptep);
+	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
+	pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
+		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
+
+		if (pte_dirty(ptent))
+			pte = pte_mkdirty(pte);
+
+		if (pte_young(ptent))
+			pte = pte_mkyoung(pte);
+	}
+
+	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
+
+	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
+}
+
+void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We have already checked that the virtual and pysical addresses are
+	 * correctly aligned for a contig mapping in contpte_try_fold() so the
+	 * remaining checks are to ensure that the contig range is fully covered
+	 * by a single folio, and ensure that all the ptes are present with
+	 * contiguous PFNs and matching prots.
+	 */
+
+	struct page *page = pte_page(pte);
+	struct folio *folio = page_folio(page);
+	unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
+	unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
+	unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
+	unsigned long pfn;
+	pgprot_t prot;
+	pte_t subpte;
+	pte_t *orig_ptep;
+	int i;
+
+	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
+		return;
+
+	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
+	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+	orig_ptep = ptep;
+	ptep = contpte_align_down(ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
+		subpte = __ptep_get(ptep);
+		subpte = pte_mkold(pte_mkclean(subpte));
+
+		if (!pte_present(subpte) ||
+		    pte_pfn(subpte) != pfn ||
+		    pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
+			return;
+	}
+
+	contpte_fold(mm, addr, orig_ptep, pte, true);
+}
+
+void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+			pte_t *ptep, pte_t pte)
+{
+	/*
+	 * We have already checked that the ptes are contiguous in
+	 * contpte_try_unfold(), so we can unfold unconditionally here.
+	 */
+
+	contpte_fold(mm, addr, ptep, pte, false);
+}
+
+pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
+{
+	/*
+	 * Gather access/dirty bits, which may be populated in any of the ptes
+	 * of the contig range. We are guarranteed to be holding the PTL, so any
+	 * contiguous range cannot be unfolded or otherwise modified under our
+	 * feet.
+	 */
+
+	pte_t pte;
+	int i;
+
+	ptep = contpte_align_down(ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++) {
+		pte = __ptep_get(ptep);
+
+		/*
+		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
+		 * where some of the ptes in the range may be cleared but others
+		 * are still to do. See contpte_ptep_get_and_clear_full().
+		 */
+		if (pte_val(pte) == 0)
+			continue;
+
+		if (pte_dirty(pte))
+			orig_pte = pte_mkdirty(orig_pte);
+
+		if (pte_young(pte))
+			orig_pte = pte_mkyoung(orig_pte);
+	}
+
+	return orig_pte;
+}
+
+pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
+{
+	/*
+	 * Gather access/dirty bits, which may be populated in any of the ptes
+	 * of the contig range. We may not be holding the PTL, so any contiguous
+	 * range may be unfolded/modified/refolded under our feet.
+	 */
+
+	pte_t orig_pte;
+	pgprot_t orig_prot;
+	pte_t *ptep;
+	unsigned long pfn;
+	pte_t pte;
+	pgprot_t prot;
+	int i;
+
+retry:
+	orig_pte = __ptep_get(orig_ptep);
+
+	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
+		return orig_pte;
+
+	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
+	ptep = contpte_align_down(orig_ptep);
+	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
+		pte = __ptep_get(ptep);
+		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+		if (!pte_present(pte) || !pte_cont(pte) ||
+		   pte_pfn(pte) != pfn ||
+		   pgprot_val(prot) != pgprot_val(orig_prot))
+			goto retry;
+
+		if (pte_dirty(pte))
+			orig_pte = pte_mkdirty(orig_pte);
+
+		if (pte_young(pte))
+			orig_pte = pte_mkyoung(orig_pte);
+	}
+
+	return orig_pte;
+}
+
+void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, pte_t pte, unsigned int nr)
+{
+	unsigned long next;
+	unsigned long end = addr + (nr << PAGE_SHIFT);
+	unsigned long pfn = pte_pfn(pte);
+	pgprot_t prot = pte_pgprot(pte);
+	pte_t orig_pte;
+
+	do {
+		next = pte_cont_addr_end(addr, end);
+		nr = (next - addr) >> PAGE_SHIFT;
+		pte = pfn_pte(pfn, prot);
+
+		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
+			pte = pte_mkcont(pte);
+		else
+			pte = pte_mknoncont(pte);
+
+		/*
+		 * If operating on a partial contiguous range then we must first
+		 * unfold the contiguous range if it was previously folded.
+		 * Otherwise we could end up with overlapping tlb entries.
+		 */
+		if (nr != CONT_PTES)
+			contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+		/*
+		 * If we are replacing ptes that were contiguous or if the new
+		 * ptes are contiguous and any of the ptes being replaced are
+		 * present, we need to clear and flush the range to prevent
+		 * overlapping tlb entries.
+		 */
+		orig_pte = __ptep_get(ptep);
+		if ((pte_present(orig_pte) && pte_cont(orig_pte)) ||
+		    (pte_cont(pte) && ptep_any_present(ptep, nr)))
+			ptep_clear_flush_range(mm, addr, ptep, nr);
+
+		__set_ptes(mm, addr, ptep, pte, nr);
+
+		addr = next;
+		ptep += nr;
+		pfn += nr;
+
+	} while (addr != end);
+}
+
+int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep)
+{
+	/*
+	 * ptep_clear_flush_young() technically requires us to clear the access
+	 * flag for a _single_ pte. However, the core-mm code actually tracks
+	 * access/dirty per folio, not per page. And since we only create a
+	 * contig range when the range is covered by a single folio, we can get
+	 * away with clearing young for the whole contig range here, so we avoid
+	 * having to unfold.
+	 */
+
+	int i;
+	int young = 0;
+
+	ptep = contpte_align_down(ptep);
+	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+		young |= __ptep_test_and_clear_young(vma, addr, ptep);
+
+	return young;
+}
+
+int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep)
+{
+	int young;
+
+	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+	if (young) {
+		/*
+		 * See comment in __ptep_clear_flush_young(); same rationale for
+		 * eliding the trailing DSB applies here.
+		 */
+		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
+					 PAGE_SIZE, true, 3);
+	}
+
+	return young;
+}
+
+int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					pte_t entry, int dirty)
+{
+	pte_t orig_pte;
+	int i;
+
+	/*
+	 * Gather the access/dirty bits for the contiguous range. If nothing has
+	 * changed, its a noop.
+	 */
+	orig_pte = ptep_get(ptep);
+	if (pte_val(orig_pte) == pte_val(entry))
+		return 0;
+
+	/*
+	 * We can fix up access/dirty bits without having to unfold/fold the
+	 * contig range. But if the write bit is changing, we need to go through
+	 * the full unfold/fold cycle.
+	 */
+	if (pte_write(orig_pte) == pte_write(entry)) {
+		/*
+		 * No need to flush here; This is always "more permissive" so we
+		 * can only be _adding_ the access or dirty bit. And since the
+		 * tlb can't cache an entry without the AF set and the dirty bit
+		 * is a SW bit, there can be no confusion. For HW access
+		 * management, we technically only need to update the flag on a
+		 * single pte in the range. But for SW access management, we
+		 * need to update all the ptes to prevent extra faults.
+		 */
+		ptep = contpte_align_down(ptep);
+		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
+	} else {
+		/*
+		 * No need to flush in __ptep_set_access_flags() because we just
+		 * flushed the whole range in __contpte_try_unfold().
+		 */
+		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
+		__ptep_set_access_flags(vma, addr, ptep, entry, 0);
+		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
+	}
+
+	return 1;
+}
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 12/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:42   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

ptep_get_and_clear_full() adds a 'full' parameter which is not present
for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
a full address space teardown is in progress. We use this information to
optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
contiguous neighbours to keep their contig bit set, because we know we
are about to clear the rest too.

Before this optimization, the cost of arm64_sys_exit_group() exploded to
32x what it was before PTE_CONT support was wired up, when compiling the
kernel. With this optimization in place, we are back down to the
original cost.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 18 ++++++++-
 arch/arm64/mm/contpte.c          | 68 ++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 17ea534bc5b0..5963da651da7 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1128,6 +1128,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
 extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
 extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr);
+extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1252,12 +1254,24 @@ static inline void pte_clear(struct mm_struct *mm,
 	__pte_clear(mm, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, int full)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (!pte_present(orig_pte) || !pte_cont(orig_pte) || !full) {
+		contpte_try_unfold(mm, addr, ptep, orig_pte);
+		return __ptep_get_and_clear(mm, addr, ptep);
+	} else
+		return contpte_ptep_get_and_clear_full(mm, addr, ptep);
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
-	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
-	return __ptep_get_and_clear(mm, addr, ptep);
+	return ptep_get_and_clear_full(mm, addr, ptep, 0);
 }
 
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index e8e4a298fd53..0b585d1c4c94 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -241,6 +241,74 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 	} while (addr != end);
 }
 
+pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+					unsigned long addr, pte_t *ptep)
+{
+	/*
+	 * When doing a full address space teardown, we can avoid unfolding the
+	 * contiguous range, and therefore avoid the associated tlbi. Instead,
+	 * just clear the pte. The caller is promising to call us for every pte,
+	 * so every pte in the range will be cleared by the time the tlbi is
+	 * issued.
+	 *
+	 * However, this approach will leave the ptes in an inconsistent state
+	 * until ptep_get_and_clear_full() has been called for every pte in the
+	 * range. This could cause ptep_get() to fail to return the correct
+	 * access/dirty bits, if ptep_get() calls are interleved with
+	 * ptep_get_and_clear_full() (which they are). Solve this by copying the
+	 * access/dirty bits to every pte in the range so that ptep_get() still
+	 * sees them if we have already cleared pte that the hw chose to update.
+	 * Note that a full teardown will only happen when the process is
+	 * exiting, so we do not expect anymore accesses and therefore no more
+	 * access/dirty bit updates, so there is no race here.
+	 */
+
+	pte_t *orig_ptep = ptep;
+	pte_t pte;
+	bool flags_propagated = false;
+	bool dirty = false;
+	bool young = false;
+	int i;
+
+	/* First, gather access and dirty bits. */
+	ptep = contpte_align_down(orig_ptep);
+	for (i = 0; i < CONT_PTES; i++, ptep++) {
+		pte = __ptep_get(ptep);
+
+		/*
+		 * If we find a zeroed PTE, contpte_ptep_get_and_clear_full()
+		 * must have already been called for it, so we have already
+		 * propagated the flags to the other ptes.
+		 */
+		if (pte_val(pte) == 0) {
+			flags_propagated = true;
+			break;
+		}
+
+		if (pte_dirty(pte))
+			dirty = true;
+
+		if (pte_young(pte))
+			young = true;
+	}
+
+	/* Now copy the access and dirty bits into each pte in the range. */
+	if (!flags_propagated) {
+		ptep = contpte_align_down(orig_ptep);
+		for (i = 0; i < CONT_PTES; i++, ptep++) {
+			pte = __ptep_get(ptep);
+
+			if (dirty)
+				pte = pte_mkdirty(pte);
+
+			if (young)
+				pte = pte_mkyoung(pte);
+		}
+	}
+
+	return __ptep_get_and_clear(mm, addr, orig_ptep);
+}
+
 int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 12/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
@ 2023-06-22 14:42   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

ptep_get_and_clear_full() adds a 'full' parameter which is not present
for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
a full address space teardown is in progress. We use this information to
optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
contiguous neighbours to keep their contig bit set, because we know we
are about to clear the rest too.

Before this optimization, the cost of arm64_sys_exit_group() exploded to
32x what it was before PTE_CONT support was wired up, when compiling the
kernel. With this optimization in place, we are back down to the
original cost.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 18 ++++++++-
 arch/arm64/mm/contpte.c          | 68 ++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 17ea534bc5b0..5963da651da7 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1128,6 +1128,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
 extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
 extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr);
+extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1252,12 +1254,24 @@ static inline void pte_clear(struct mm_struct *mm,
 	__pte_clear(mm, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, int full)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (!pte_present(orig_pte) || !pte_cont(orig_pte) || !full) {
+		contpte_try_unfold(mm, addr, ptep, orig_pte);
+		return __ptep_get_and_clear(mm, addr, ptep);
+	} else
+		return contpte_ptep_get_and_clear_full(mm, addr, ptep);
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
-	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
-	return __ptep_get_and_clear(mm, addr, ptep);
+	return ptep_get_and_clear_full(mm, addr, ptep, 0);
 }
 
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index e8e4a298fd53..0b585d1c4c94 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -241,6 +241,74 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 	} while (addr != end);
 }
 
+pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+					unsigned long addr, pte_t *ptep)
+{
+	/*
+	 * When doing a full address space teardown, we can avoid unfolding the
+	 * contiguous range, and therefore avoid the associated tlbi. Instead,
+	 * just clear the pte. The caller is promising to call us for every pte,
+	 * so every pte in the range will be cleared by the time the tlbi is
+	 * issued.
+	 *
+	 * However, this approach will leave the ptes in an inconsistent state
+	 * until ptep_get_and_clear_full() has been called for every pte in the
+	 * range. This could cause ptep_get() to fail to return the correct
+	 * access/dirty bits, if ptep_get() calls are interleved with
+	 * ptep_get_and_clear_full() (which they are). Solve this by copying the
+	 * access/dirty bits to every pte in the range so that ptep_get() still
+	 * sees them if we have already cleared pte that the hw chose to update.
+	 * Note that a full teardown will only happen when the process is
+	 * exiting, so we do not expect anymore accesses and therefore no more
+	 * access/dirty bit updates, so there is no race here.
+	 */
+
+	pte_t *orig_ptep = ptep;
+	pte_t pte;
+	bool flags_propagated = false;
+	bool dirty = false;
+	bool young = false;
+	int i;
+
+	/* First, gather access and dirty bits. */
+	ptep = contpte_align_down(orig_ptep);
+	for (i = 0; i < CONT_PTES; i++, ptep++) {
+		pte = __ptep_get(ptep);
+
+		/*
+		 * If we find a zeroed PTE, contpte_ptep_get_and_clear_full()
+		 * must have already been called for it, so we have already
+		 * propagated the flags to the other ptes.
+		 */
+		if (pte_val(pte) == 0) {
+			flags_propagated = true;
+			break;
+		}
+
+		if (pte_dirty(pte))
+			dirty = true;
+
+		if (pte_young(pte))
+			young = true;
+	}
+
+	/* Now copy the access and dirty bits into each pte in the range. */
+	if (!flags_propagated) {
+		ptep = contpte_align_down(orig_ptep);
+		for (i = 0; i < CONT_PTES; i++, ptep++) {
+			pte = __ptep_get(ptep);
+
+			if (dirty)
+				pte = pte_mkdirty(pte);
+
+			if (young)
+				pte = pte_mkyoung(pte);
+		}
+	}
+
+	return __ptep_get_and_clear(mm, addr, orig_ptep);
+}
+
 int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep)
 {
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 13/14] mm: Batch-copy PTE ranges during fork()
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:42   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Convert copy_pte_range() to copy a set of ptes that map a physically
contiguous block of memory in a batch. This will likely improve
performance by a tiny amount due to batching the folio reference count
management and calling set_ptes() rather than making individual calls to
set_pte_at().

However, the primary motivation for this change is to reduce the number
of tlb maintenance operations that the arm64 backend has to perform
during fork, now that it transparently supports the "contiguous bit" in
its ptes. By write-protecting the parent using the new
ptep_set_wrprotects() (note the 's' at the end) function, the backend
can avoid having to unfold contig ranges of PTEs, which is expensive,
when all ptes in the range are being write-protected. Similarly, by
using set_ptes() rather than set_pte_at() to set up ptes in the child,
the backend does not need to fold a contiguous range once they are all
populated - they can be initially populated as a contiguous range in the
first place.

This change addresses the core-mm refactoring only, and introduces
ptep_set_wrprotects() with a default implementation that calls
ptep_set_wrprotect() for each pte in the range. A separate change will
implement ptep_set_wrprotects() in the arm64 backend to realize the
performance improvement.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h |  13 ++++
 mm/memory.c             | 149 +++++++++++++++++++++++++++++++---------
 2 files changed, 128 insertions(+), 34 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a661a17173fa..6a7b28d520de 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -547,6 +547,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
 }
 #endif
 
+#ifndef ptep_set_wrprotects
+struct mm_struct;
+static inline void ptep_set_wrprotects(struct mm_struct *mm,
+				unsigned long address, pte_t *ptep,
+				unsigned int nr)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+		ptep_set_wrprotect(mm, address, ptep);
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/memory.c b/mm/memory.c
index fb30f7523550..9a041cc31c74 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -911,57 +911,126 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		/* Uffd-wp needs to be delivered to dest pte as well */
 		pte = pte_mkuffd_wp(pte);
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
-	return 0;
+	return 1;
+}
+
+static inline unsigned long page_addr(struct page *page,
+				struct page *anchor, unsigned long anchor_addr)
+{
+	unsigned long offset;
+	unsigned long addr;
+
+	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+	addr = anchor_addr + offset;
+
+	if (anchor > page) {
+		if (addr > anchor_addr)
+			return 0;
+	} else {
+		if (addr < anchor_addr)
+			return ULONG_MAX;
+	}
+
+	return addr;
+}
+
+static int calc_anon_folio_map_pgcount(struct folio *folio,
+				       struct page *page, pte_t *pte,
+				       unsigned long addr, unsigned long end)
+{
+	pte_t ptent;
+	int floops;
+	int i;
+	unsigned long pfn;
+
+	end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
+		  end);
+	floops = (end - addr) >> PAGE_SHIFT;
+	pfn = page_to_pfn(page);
+	pfn++;
+	pte++;
+
+	for (i = 1; i < floops; i++) {
+		ptent = ptep_get(pte);
+
+		if (!pte_present(ptent) ||
+		    pte_pfn(ptent) != pfn) {
+			return i;
+		}
+
+		pfn++;
+		pte++;
+	}
+
+	return floops;
 }
 
 /*
- * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
- * is required to copy this pte.
+ * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
+ * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
+ * first pte.
  */
 static inline int
-copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
-		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
-		 struct folio **prealloc)
+copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+		  pte_t *dst_pte, pte_t *src_pte,
+		  unsigned long addr, unsigned long end,
+		  int *rss, struct folio **prealloc)
 {
 	struct mm_struct *src_mm = src_vma->vm_mm;
 	unsigned long vm_flags = src_vma->vm_flags;
 	pte_t pte = ptep_get(src_pte);
 	struct page *page;
 	struct folio *folio;
+	bool anon;
+	int nr;
+	int i;
 
 	page = vm_normal_page(src_vma, addr, pte);
-	if (page)
+	if (page) {
 		folio = page_folio(page);
-	if (page && folio_test_anon(folio)) {
-		/*
-		 * If this page may have been pinned by the parent process,
-		 * copy the page immediately for the child so that we'll always
-		 * guarantee the pinned page won't be randomly replaced in the
-		 * future.
-		 */
-		folio_get(folio);
-		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
-			/* Page may be pinned, we have to copy. */
-			folio_put(folio);
-			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
-						 addr, rss, prealloc, page);
+		anon = folio_test_anon(folio);
+		nr = calc_anon_folio_map_pgcount(folio, page, src_pte, addr, end);
+
+		for (i = 0; i < nr; i++, page++) {
+			if (anon) {
+				/*
+				 * If this page may have been pinned by the
+				 * parent process, copy the page immediately for
+				 * the child so that we'll always guarantee the
+				 * pinned page won't be randomly replaced in the
+				 * future.
+				 */
+				if (unlikely(page_try_dup_anon_rmap(
+						page, false, src_vma))) {
+					if (i != 0)
+						break;
+					/* Page may be pinned, we have to copy. */
+					return copy_present_page(dst_vma, src_vma,
+								 dst_pte, src_pte,
+								 addr, rss,
+								 prealloc, page);
+				}
+				rss[MM_ANONPAGES]++;
+				VM_BUG_ON(PageAnonExclusive(page));
+			} else {
+				page_dup_file_rmap(page, false);
+				rss[mm_counter_file(page)]++;
+			}
 		}
-		rss[MM_ANONPAGES]++;
-	} else if (page) {
-		folio_get(folio);
-		page_dup_file_rmap(page, false);
-		rss[mm_counter_file(page)]++;
-	}
+
+		nr = i;
+		folio_ref_add(folio, nr);
+	} else
+		nr = 1;
 
 	/*
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
+		ptep_set_wrprotects(src_mm, addr, src_pte, nr);
 		pte = pte_wrprotect(pte);
 	}
-	VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
 
 	/*
 	 * If it's a shared mapping, mark it clean in
@@ -974,8 +1043,8 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	if (!userfaultfd_wp(dst_vma))
 		pte = pte_clear_uffd_wp(pte);
 
-	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
-	return 0;
+	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
+	return nr;
 }
 
 static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
@@ -1065,15 +1134,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			 */
 			WARN_ON_ONCE(ret != -ENOENT);
 		}
-		/* copy_present_pte() will clear `*prealloc' if consumed */
-		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
-				       addr, rss, &prealloc);
+		/* copy_present_ptes() will clear `*prealloc' if consumed */
+		ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
+				       addr, end, rss, &prealloc);
+
 		/*
 		 * If we need a pre-allocated page for this pte, drop the
 		 * locks, allocate, and try again.
 		 */
 		if (unlikely(ret == -EAGAIN))
 			break;
+
+		/*
+		 * Positive return value is the number of ptes copied.
+		 */
+		VM_WARN_ON_ONCE(ret < 1);
+		progress += 8 * ret;
+		ret--;
+		dst_pte += ret;
+		src_pte += ret;
+		addr += ret << PAGE_SHIFT;
+		ret = 0;
+
 		if (unlikely(prealloc)) {
 			/*
 			 * pre-alloc page cannot be reused by next time so as
@@ -1084,7 +1166,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			folio_put(prealloc);
 			prealloc = NULL;
 		}
-		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 13/14] mm: Batch-copy PTE ranges during fork()
@ 2023-06-22 14:42   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

Convert copy_pte_range() to copy a set of ptes that map a physically
contiguous block of memory in a batch. This will likely improve
performance by a tiny amount due to batching the folio reference count
management and calling set_ptes() rather than making individual calls to
set_pte_at().

However, the primary motivation for this change is to reduce the number
of tlb maintenance operations that the arm64 backend has to perform
during fork, now that it transparently supports the "contiguous bit" in
its ptes. By write-protecting the parent using the new
ptep_set_wrprotects() (note the 's' at the end) function, the backend
can avoid having to unfold contig ranges of PTEs, which is expensive,
when all ptes in the range are being write-protected. Similarly, by
using set_ptes() rather than set_pte_at() to set up ptes in the child,
the backend does not need to fold a contiguous range once they are all
populated - they can be initially populated as a contiguous range in the
first place.

This change addresses the core-mm refactoring only, and introduces
ptep_set_wrprotects() with a default implementation that calls
ptep_set_wrprotect() for each pte in the range. A separate change will
implement ptep_set_wrprotects() in the arm64 backend to realize the
performance improvement.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h |  13 ++++
 mm/memory.c             | 149 +++++++++++++++++++++++++++++++---------
 2 files changed, 128 insertions(+), 34 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a661a17173fa..6a7b28d520de 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -547,6 +547,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
 }
 #endif
 
+#ifndef ptep_set_wrprotects
+struct mm_struct;
+static inline void ptep_set_wrprotects(struct mm_struct *mm,
+				unsigned long address, pte_t *ptep,
+				unsigned int nr)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+		ptep_set_wrprotect(mm, address, ptep);
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/memory.c b/mm/memory.c
index fb30f7523550..9a041cc31c74 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -911,57 +911,126 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		/* Uffd-wp needs to be delivered to dest pte as well */
 		pte = pte_mkuffd_wp(pte);
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
-	return 0;
+	return 1;
+}
+
+static inline unsigned long page_addr(struct page *page,
+				struct page *anchor, unsigned long anchor_addr)
+{
+	unsigned long offset;
+	unsigned long addr;
+
+	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+	addr = anchor_addr + offset;
+
+	if (anchor > page) {
+		if (addr > anchor_addr)
+			return 0;
+	} else {
+		if (addr < anchor_addr)
+			return ULONG_MAX;
+	}
+
+	return addr;
+}
+
+static int calc_anon_folio_map_pgcount(struct folio *folio,
+				       struct page *page, pte_t *pte,
+				       unsigned long addr, unsigned long end)
+{
+	pte_t ptent;
+	int floops;
+	int i;
+	unsigned long pfn;
+
+	end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
+		  end);
+	floops = (end - addr) >> PAGE_SHIFT;
+	pfn = page_to_pfn(page);
+	pfn++;
+	pte++;
+
+	for (i = 1; i < floops; i++) {
+		ptent = ptep_get(pte);
+
+		if (!pte_present(ptent) ||
+		    pte_pfn(ptent) != pfn) {
+			return i;
+		}
+
+		pfn++;
+		pte++;
+	}
+
+	return floops;
 }
 
 /*
- * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
- * is required to copy this pte.
+ * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
+ * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
+ * first pte.
  */
 static inline int
-copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
-		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
-		 struct folio **prealloc)
+copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+		  pte_t *dst_pte, pte_t *src_pte,
+		  unsigned long addr, unsigned long end,
+		  int *rss, struct folio **prealloc)
 {
 	struct mm_struct *src_mm = src_vma->vm_mm;
 	unsigned long vm_flags = src_vma->vm_flags;
 	pte_t pte = ptep_get(src_pte);
 	struct page *page;
 	struct folio *folio;
+	bool anon;
+	int nr;
+	int i;
 
 	page = vm_normal_page(src_vma, addr, pte);
-	if (page)
+	if (page) {
 		folio = page_folio(page);
-	if (page && folio_test_anon(folio)) {
-		/*
-		 * If this page may have been pinned by the parent process,
-		 * copy the page immediately for the child so that we'll always
-		 * guarantee the pinned page won't be randomly replaced in the
-		 * future.
-		 */
-		folio_get(folio);
-		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
-			/* Page may be pinned, we have to copy. */
-			folio_put(folio);
-			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
-						 addr, rss, prealloc, page);
+		anon = folio_test_anon(folio);
+		nr = calc_anon_folio_map_pgcount(folio, page, src_pte, addr, end);
+
+		for (i = 0; i < nr; i++, page++) {
+			if (anon) {
+				/*
+				 * If this page may have been pinned by the
+				 * parent process, copy the page immediately for
+				 * the child so that we'll always guarantee the
+				 * pinned page won't be randomly replaced in the
+				 * future.
+				 */
+				if (unlikely(page_try_dup_anon_rmap(
+						page, false, src_vma))) {
+					if (i != 0)
+						break;
+					/* Page may be pinned, we have to copy. */
+					return copy_present_page(dst_vma, src_vma,
+								 dst_pte, src_pte,
+								 addr, rss,
+								 prealloc, page);
+				}
+				rss[MM_ANONPAGES]++;
+				VM_BUG_ON(PageAnonExclusive(page));
+			} else {
+				page_dup_file_rmap(page, false);
+				rss[mm_counter_file(page)]++;
+			}
 		}
-		rss[MM_ANONPAGES]++;
-	} else if (page) {
-		folio_get(folio);
-		page_dup_file_rmap(page, false);
-		rss[mm_counter_file(page)]++;
-	}
+
+		nr = i;
+		folio_ref_add(folio, nr);
+	} else
+		nr = 1;
 
 	/*
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
+		ptep_set_wrprotects(src_mm, addr, src_pte, nr);
 		pte = pte_wrprotect(pte);
 	}
-	VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
 
 	/*
 	 * If it's a shared mapping, mark it clean in
@@ -974,8 +1043,8 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	if (!userfaultfd_wp(dst_vma))
 		pte = pte_clear_uffd_wp(pte);
 
-	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
-	return 0;
+	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
+	return nr;
 }
 
 static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
@@ -1065,15 +1134,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			 */
 			WARN_ON_ONCE(ret != -ENOENT);
 		}
-		/* copy_present_pte() will clear `*prealloc' if consumed */
-		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
-				       addr, rss, &prealloc);
+		/* copy_present_ptes() will clear `*prealloc' if consumed */
+		ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
+				       addr, end, rss, &prealloc);
+
 		/*
 		 * If we need a pre-allocated page for this pte, drop the
 		 * locks, allocate, and try again.
 		 */
 		if (unlikely(ret == -EAGAIN))
 			break;
+
+		/*
+		 * Positive return value is the number of ptes copied.
+		 */
+		VM_WARN_ON_ONCE(ret < 1);
+		progress += 8 * ret;
+		ret--;
+		dst_pte += ret;
+		src_pte += ret;
+		addr += ret << PAGE_SHIFT;
+		ret = 0;
+
 		if (unlikely(prealloc)) {
 			/*
 			 * pre-alloc page cannot be reused by next time so as
@@ -1084,7 +1166,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			folio_put(prealloc);
 			prealloc = NULL;
 		}
-		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 14/14] arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-06-22 14:42   ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

With the core-mm changes in place to batch-copy ptes during fork, we can
take advantage of this in arm64 to greatly reduce the number of tlbis we
have to issue, and recover the lost fork performance incured when adding
support for transparent contiguous ptes.

If we are write-protecting a whole contig range, we can apply the
write-protection to the whole range and know that it won't change
whether the range should have the contiguous bit set or not. For ranges
smaller than the contig range, we will still have to unfold, apply the
write-protection, then fold if the change now means the range is
foldable.

Performance tested with the following test written for the will-it-scale
framework:

-------

char *testcase_description = "fork and exit";

void testcase(unsigned long long *iterations, unsigned long nr)
{
	int pid;
	char *mem;

	mem = malloc(SZ_128M);
	assert(mem);
	memset(mem, 1, SZ_128M);

	while (1) {
		pid = fork();
		assert(pid >= 0);

		if (!pid)
			exit(0);

		waitpid(pid, NULL, 0);

		(*iterations)++;
	}
}

-------

I see huge performance regression when PTE_CONT support was added, then
the regression is mostly fixed with the addition of this change. The
following shows regression relative to before PTE_CONT was enabled
(bigger negative value is bigger regression):

|   cpus |   before opt |   after opt |
|-------:|-------------:|------------:|
|      1 |       -10.4% |       -5.2% |
|      8 |       -15.4% |       -3.5% |
|     16 |       -38.7% |       -3.7% |
|     24 |       -57.0% |       -4.4% |
|     32 |       -65.8% |       -5.4% |

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 48 ++++++++++++++++++++++----------
 arch/arm64/mm/contpte.c          | 41 +++++++++++++++++++++++++++
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 3 files changed, 75 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5963da651da7..479a9e5ab756 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -963,21 +963,25 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
- * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * __ptep_set_wrprotects - mark read-only while trasferring potential hardware
  * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
  */
-static inline void __ptep_set_wrprotect(struct mm_struct *mm,
-					unsigned long address, pte_t *ptep)
+static inline void __ptep_set_wrprotects(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep,
+					unsigned int nr)
 {
 	pte_t old_pte, pte;
-
-	pte = __ptep_get(ptep);
-	do {
-		old_pte = pte;
-		pte = pte_wrprotect(pte);
-		pte_val(pte) = cmpxchg_relaxed(&pte_val(*ptep),
-					       pte_val(old_pte), pte_val(pte));
-	} while (pte_val(pte) != pte_val(old_pte));
+	unsigned int i;
+
+	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++) {
+		pte = __ptep_get(ptep);
+		do {
+			old_pte = pte;
+			pte = pte_wrprotect(pte);
+			pte_val(pte) = cmpxchg_relaxed(&pte_val(*ptep),
+						pte_val(old_pte), pte_val(pte));
+		} while (pte_val(pte) != pte_val(old_pte));
+	}
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -985,7 +989,7 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long address, pmd_t *pmdp)
 {
-	__ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
+	__ptep_set_wrprotects(mm, address, (pte_t *)pmdp, 1);
 }
 
 #define pmdp_establish pmdp_establish
@@ -1134,6 +1138,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
+extern void contpte_set_wrprotects(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty);
@@ -1298,13 +1304,25 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return contpte_ptep_clear_flush_young(vma, addr, ptep);
 }
 
+#define ptep_set_wrprotects ptep_set_wrprotects
+static inline void ptep_set_wrprotects(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr)
+{
+	if (!contpte_is_enabled(mm))
+		__ptep_set_wrprotects(mm, addr, ptep, nr);
+	else if (nr == 1) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		__ptep_set_wrprotects(mm, addr, ptep, 1);
+		contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
+	} else
+		contpte_set_wrprotects(mm, addr, ptep, nr);
+}
+
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
-	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
-	__ptep_set_wrprotect(mm, addr, ptep);
-	contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
+	ptep_set_wrprotects(mm, addr, ptep, 1);
 }
 
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 0b585d1c4c94..4f666697547d 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -353,6 +353,47 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 	return young;
 }
 
+void contpte_set_wrprotects(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, unsigned int nr)
+{
+	unsigned long next;
+	unsigned long end = addr + (nr << PAGE_SHIFT);
+
+	do {
+		next = pte_cont_addr_end(addr, end);
+		nr = (next - addr) >> PAGE_SHIFT;
+
+		/*
+		 * If wrprotecting an entire contig range, we can avoid
+		 * unfolding. Just set wrprotect and wait for the later
+		 * mmu_gather flush to invalidate the tlb. Until the flush, the
+		 * page may or may not be wrprotected. After the flush, it is
+		 * guarranteed wrprotected. If its a partial range though, we
+		 * must unfold, because we can't have a case where CONT_PTE is
+		 * set but wrprotect applies to a subset of the PTEs; this would
+		 * cause it to continue to be unpredictable after the flush.
+		 */
+		if (nr != CONT_PTES)
+			contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+		__ptep_set_wrprotects(mm, addr, ptep, nr);
+
+		addr = next;
+		ptep += nr;
+
+		/*
+		 * If applying to a partial contig range, the change could have
+		 * made the range foldable. Use the last pte in the range we
+		 * just set for comparison, since contpte_try_fold() only
+		 * triggers when acting on the last pte in the contig range.
+		 */
+		if (nr != CONT_PTES)
+			contpte_try_fold(mm, addr - PAGE_SIZE, ptep - 1,
+					 __ptep_get(ptep - 1));
+
+	} while (addr != end);
+}
+
 int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep,
 					pte_t entry, int dirty)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 82b2036dbe2f..ecf7bfa761c3 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -511,7 +511,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pte_t pte;
 
 	if (!pte_cont(__ptep_get(ptep))) {
-		__ptep_set_wrprotect(mm, addr, ptep);
+		__ptep_set_wrprotects(mm, addr, ptep, 1);
 		return;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v1 14/14] arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
@ 2023-06-22 14:42   ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-06-22 14:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm

With the core-mm changes in place to batch-copy ptes during fork, we can
take advantage of this in arm64 to greatly reduce the number of tlbis we
have to issue, and recover the lost fork performance incured when adding
support for transparent contiguous ptes.

If we are write-protecting a whole contig range, we can apply the
write-protection to the whole range and know that it won't change
whether the range should have the contiguous bit set or not. For ranges
smaller than the contig range, we will still have to unfold, apply the
write-protection, then fold if the change now means the range is
foldable.

Performance tested with the following test written for the will-it-scale
framework:

-------

char *testcase_description = "fork and exit";

void testcase(unsigned long long *iterations, unsigned long nr)
{
	int pid;
	char *mem;

	mem = malloc(SZ_128M);
	assert(mem);
	memset(mem, 1, SZ_128M);

	while (1) {
		pid = fork();
		assert(pid >= 0);

		if (!pid)
			exit(0);

		waitpid(pid, NULL, 0);

		(*iterations)++;
	}
}

-------

I see huge performance regression when PTE_CONT support was added, then
the regression is mostly fixed with the addition of this change. The
following shows regression relative to before PTE_CONT was enabled
(bigger negative value is bigger regression):

|   cpus |   before opt |   after opt |
|-------:|-------------:|------------:|
|      1 |       -10.4% |       -5.2% |
|      8 |       -15.4% |       -3.5% |
|     16 |       -38.7% |       -3.7% |
|     24 |       -57.0% |       -4.4% |
|     32 |       -65.8% |       -5.4% |

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 48 ++++++++++++++++++++++----------
 arch/arm64/mm/contpte.c          | 41 +++++++++++++++++++++++++++
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 3 files changed, 75 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5963da651da7..479a9e5ab756 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -963,21 +963,25 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
- * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * __ptep_set_wrprotects - mark read-only while trasferring potential hardware
  * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
  */
-static inline void __ptep_set_wrprotect(struct mm_struct *mm,
-					unsigned long address, pte_t *ptep)
+static inline void __ptep_set_wrprotects(struct mm_struct *mm,
+					unsigned long address, pte_t *ptep,
+					unsigned int nr)
 {
 	pte_t old_pte, pte;
-
-	pte = __ptep_get(ptep);
-	do {
-		old_pte = pte;
-		pte = pte_wrprotect(pte);
-		pte_val(pte) = cmpxchg_relaxed(&pte_val(*ptep),
-					       pte_val(old_pte), pte_val(pte));
-	} while (pte_val(pte) != pte_val(old_pte));
+	unsigned int i;
+
+	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++) {
+		pte = __ptep_get(ptep);
+		do {
+			old_pte = pte;
+			pte = pte_wrprotect(pte);
+			pte_val(pte) = cmpxchg_relaxed(&pte_val(*ptep),
+						pte_val(old_pte), pte_val(pte));
+		} while (pte_val(pte) != pte_val(old_pte));
+	}
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -985,7 +989,7 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long address, pmd_t *pmdp)
 {
-	__ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
+	__ptep_set_wrprotects(mm, address, (pte_t *)pmdp, 1);
 }
 
 #define pmdp_establish pmdp_establish
@@ -1134,6 +1138,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
+extern void contpte_set_wrprotects(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep,
 				pte_t entry, int dirty);
@@ -1298,13 +1304,25 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return contpte_ptep_clear_flush_young(vma, addr, ptep);
 }
 
+#define ptep_set_wrprotects ptep_set_wrprotects
+static inline void ptep_set_wrprotects(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep, unsigned int nr)
+{
+	if (!contpte_is_enabled(mm))
+		__ptep_set_wrprotects(mm, addr, ptep, nr);
+	else if (nr == 1) {
+		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+		__ptep_set_wrprotects(mm, addr, ptep, 1);
+		contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
+	} else
+		contpte_set_wrprotects(mm, addr, ptep, nr);
+}
+
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
-	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
-	__ptep_set_wrprotect(mm, addr, ptep);
-	contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
+	ptep_set_wrprotects(mm, addr, ptep, 1);
 }
 
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 0b585d1c4c94..4f666697547d 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -353,6 +353,47 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 	return young;
 }
 
+void contpte_set_wrprotects(struct mm_struct *mm, unsigned long addr,
+					pte_t *ptep, unsigned int nr)
+{
+	unsigned long next;
+	unsigned long end = addr + (nr << PAGE_SHIFT);
+
+	do {
+		next = pte_cont_addr_end(addr, end);
+		nr = (next - addr) >> PAGE_SHIFT;
+
+		/*
+		 * If wrprotecting an entire contig range, we can avoid
+		 * unfolding. Just set wrprotect and wait for the later
+		 * mmu_gather flush to invalidate the tlb. Until the flush, the
+		 * page may or may not be wrprotected. After the flush, it is
+		 * guarranteed wrprotected. If its a partial range though, we
+		 * must unfold, because we can't have a case where CONT_PTE is
+		 * set but wrprotect applies to a subset of the PTEs; this would
+		 * cause it to continue to be unpredictable after the flush.
+		 */
+		if (nr != CONT_PTES)
+			contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+		__ptep_set_wrprotects(mm, addr, ptep, nr);
+
+		addr = next;
+		ptep += nr;
+
+		/*
+		 * If applying to a partial contig range, the change could have
+		 * made the range foldable. Use the last pte in the range we
+		 * just set for comparison, since contpte_try_fold() only
+		 * triggers when acting on the last pte in the contig range.
+		 */
+		if (nr != CONT_PTES)
+			contpte_try_fold(mm, addr - PAGE_SIZE, ptep - 1,
+					 __ptep_get(ptep - 1));
+
+	} while (addr != end);
+}
+
 int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep,
 					pte_t entry, int dirty)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 82b2036dbe2f..ecf7bfa761c3 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -511,7 +511,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	pte_t pte;
 
 	if (!pte_cont(__ptep_get(ptep))) {
-		__ptep_set_wrprotect(mm, addr, ptep);
+		__ptep_set_wrprotects(mm, addr, ptep, 1);
 		return;
 	}
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
  2023-06-22 14:42   ` Ryan Roberts
@ 2023-06-30  1:54     ` John Hubbard
  -1 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2023-06-30  1:54 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Andrey Ryabinin, Alexander Potapenko,
	Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino,
	Andrew Morton, Anshuman Khandual, Matthew Wilcox, Yu Zhao,
	Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 6/22/23 07:42, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings. Whenever it detects a set of PTEs that meet the
> requirements for a contiguous range, the PTEs are re-painted with the
> PTE_CONT bit.
> 
> This initial change provides a baseline that can be optimized in future
> commits. That said, fold/unfold operations (which imply tlb
> invalidation) are avoided where possible with a few tricks for
> access/dirty bit management.
> 
> Write-enable and write-protect modifications are likely non-optimal and
> likely incure a regression in fork() performance. This will be addressed
> separately.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---

Hi Ryan!

While trying out the full series from your gitlab features/granule_perf/all
branch, I found it necessary to EXPORT a symbol in order to build this.
Please see below:

...
> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
> +	 * contiguous range cannot be unfolded or otherwise modified under our
> +	 * feet.
> +	 */
> +
> +	pte_t pte;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
> +		pte = __ptep_get(ptep);
> +
> +		/*
> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
> +		 * where some of the ptes in the range may be cleared but others
> +		 * are still to do. See contpte_ptep_get_and_clear_full().
> +		 */
> +		if (pte_val(pte) == 0)
> +			continue;
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}

Here we need something like this, in order to get it to build in all
possible configurations:

EXPORT_SYMBOL_GPL(contpte_ptep_get);

(and a corresponding "#include linux/export.h" at the top of the file).

Because, the static inline functions invoke this routine, above.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
@ 2023-06-30  1:54     ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2023-06-30  1:54 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Andrey Ryabinin, Alexander Potapenko,
	Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino,
	Andrew Morton, Anshuman Khandual, Matthew Wilcox, Yu Zhao,
	Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 6/22/23 07:42, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings. Whenever it detects a set of PTEs that meet the
> requirements for a contiguous range, the PTEs are re-painted with the
> PTE_CONT bit.
> 
> This initial change provides a baseline that can be optimized in future
> commits. That said, fold/unfold operations (which imply tlb
> invalidation) are avoided where possible with a few tricks for
> access/dirty bit management.
> 
> Write-enable and write-protect modifications are likely non-optimal and
> likely incure a regression in fork() performance. This will be addressed
> separately.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---

Hi Ryan!

While trying out the full series from your gitlab features/granule_perf/all
branch, I found it necessary to EXPORT a symbol in order to build this.
Please see below:

...
> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
> +	 * contiguous range cannot be unfolded or otherwise modified under our
> +	 * feet.
> +	 */
> +
> +	pte_t pte;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
> +		pte = __ptep_get(ptep);
> +
> +		/*
> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
> +		 * where some of the ptes in the range may be cleared but others
> +		 * are still to do. See contpte_ptep_get_and_clear_full().
> +		 */
> +		if (pte_val(pte) == 0)
> +			continue;
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}

Here we need something like this, in order to get it to build in all
possible configurations:

EXPORT_SYMBOL_GPL(contpte_ptep_get);

(and a corresponding "#include linux/export.h" at the top of the file).

Because, the static inline functions invoke this routine, above.

thanks,
-- 
John Hubbard
NVIDIA


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
  2023-06-30  1:54     ` John Hubbard
@ 2023-07-03  9:48       ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-07-03  9:48 UTC (permalink / raw)
  To: John Hubbard, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Andrey Ryabinin, Alexander Potapenko,
	Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino,
	Andrew Morton, Anshuman Khandual, Matthew Wilcox, Yu Zhao,
	Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 30/06/2023 02:54, John Hubbard wrote:
> On 6/22/23 07:42, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings. Whenever it detects a set of PTEs that meet the
>> requirements for a contiguous range, the PTEs are re-painted with the
>> PTE_CONT bit.
>>
>> This initial change provides a baseline that can be optimized in future
>> commits. That said, fold/unfold operations (which imply tlb
>> invalidation) are avoided where possible with a few tricks for
>> access/dirty bit management.
>>
>> Write-enable and write-protect modifications are likely non-optimal and
>> likely incure a regression in fork() performance. This will be addressed
>> separately.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
> 
> Hi Ryan!
> 
> While trying out the full series from your gitlab features/granule_perf/all
> branch, I found it necessary to EXPORT a symbol in order to build this.

Thanks for the bug report!

> Please see below:
> 
> ...
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +    /*
>> +     * Gather access/dirty bits, which may be populated in any of the ptes
>> +     * of the contig range. We are guarranteed to be holding the PTL, so any
>> +     * contiguous range cannot be unfolded or otherwise modified under our
>> +     * feet.
>> +     */
>> +
>> +    pte_t pte;
>> +    int i;
>> +
>> +    ptep = contpte_align_down(ptep);
>> +
>> +    for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +        pte = __ptep_get(ptep);
>> +
>> +        /*
>> +         * Deal with the partial contpte_ptep_get_and_clear_full() case,
>> +         * where some of the ptes in the range may be cleared but others
>> +         * are still to do. See contpte_ptep_get_and_clear_full().
>> +         */
>> +        if (pte_val(pte) == 0)
>> +            continue;
>> +
>> +        if (pte_dirty(pte))
>> +            orig_pte = pte_mkdirty(orig_pte);
>> +
>> +        if (pte_young(pte))
>> +            orig_pte = pte_mkyoung(orig_pte);
>> +    }
>> +
>> +    return orig_pte;
>> +}
> 
> Here we need something like this, in order to get it to build in all
> possible configurations:
> 
> EXPORT_SYMBOL_GPL(contpte_ptep_get);
> 
> (and a corresponding "#include linux/export.h" at the top of the file).
> 
> Because, the static inline functions invoke this routine, above.


A quick grep through the drivers directory shows:

ptep_get() is used by:
  - drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
  - drivers/misc/sgi-gru/grufault.c
  - drivers/vfio/vfio_iommu_type1.c
  - drivers/xen/privcmd.c

ptep_set_at() is used by:
  - drivers/gpu/drm/i915/i915_mm.c
  - drivers/xen/xlate_mmu.c

None of the other symbols are called, but I guess it is possible that out of
tree modules are calling others.

So on the basis that these symbols were previously pure inline, I propose to
export all the contpte_* symbols using EXPORT_SYMBOL() so that anything that was
previously calling them successfully continue to do so. Will include in v2.

Thanks,
Ryan


> 
> thanks,


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
@ 2023-07-03  9:48       ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-07-03  9:48 UTC (permalink / raw)
  To: John Hubbard, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Marc Zyngier, Oliver Upton, James Morse, Suzuki K Poulose,
	Zenghui Yu, Andrey Ryabinin, Alexander Potapenko,
	Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino,
	Andrew Morton, Anshuman Khandual, Matthew Wilcox, Yu Zhao,
	Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, linux-mm

On 30/06/2023 02:54, John Hubbard wrote:
> On 6/22/23 07:42, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings. Whenever it detects a set of PTEs that meet the
>> requirements for a contiguous range, the PTEs are re-painted with the
>> PTE_CONT bit.
>>
>> This initial change provides a baseline that can be optimized in future
>> commits. That said, fold/unfold operations (which imply tlb
>> invalidation) are avoided where possible with a few tricks for
>> access/dirty bit management.
>>
>> Write-enable and write-protect modifications are likely non-optimal and
>> likely incure a regression in fork() performance. This will be addressed
>> separately.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
> 
> Hi Ryan!
> 
> While trying out the full series from your gitlab features/granule_perf/all
> branch, I found it necessary to EXPORT a symbol in order to build this.

Thanks for the bug report!

> Please see below:
> 
> ...
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +    /*
>> +     * Gather access/dirty bits, which may be populated in any of the ptes
>> +     * of the contig range. We are guarranteed to be holding the PTL, so any
>> +     * contiguous range cannot be unfolded or otherwise modified under our
>> +     * feet.
>> +     */
>> +
>> +    pte_t pte;
>> +    int i;
>> +
>> +    ptep = contpte_align_down(ptep);
>> +
>> +    for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +        pte = __ptep_get(ptep);
>> +
>> +        /*
>> +         * Deal with the partial contpte_ptep_get_and_clear_full() case,
>> +         * where some of the ptes in the range may be cleared but others
>> +         * are still to do. See contpte_ptep_get_and_clear_full().
>> +         */
>> +        if (pte_val(pte) == 0)
>> +            continue;
>> +
>> +        if (pte_dirty(pte))
>> +            orig_pte = pte_mkdirty(orig_pte);
>> +
>> +        if (pte_young(pte))
>> +            orig_pte = pte_mkyoung(orig_pte);
>> +    }
>> +
>> +    return orig_pte;
>> +}
> 
> Here we need something like this, in order to get it to build in all
> possible configurations:
> 
> EXPORT_SYMBOL_GPL(contpte_ptep_get);
> 
> (and a corresponding "#include linux/export.h" at the top of the file).
> 
> Because, the static inline functions invoke this routine, above.


A quick grep through the drivers directory shows:

ptep_get() is used by:
  - drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
  - drivers/misc/sgi-gru/grufault.c
  - drivers/vfio/vfio_iommu_type1.c
  - drivers/xen/privcmd.c

ptep_set_at() is used by:
  - drivers/gpu/drm/i915/i915_mm.c
  - drivers/xen/xlate_mmu.c

None of the other symbols are called, but I guess it is possible that out of
tree modules are calling others.

So on the basis that these symbols were previously pure inline, I propose to
export all the contpte_* symbols using EXPORT_SYMBOL() so that anything that was
previously calling them successfully continue to do so. Will include in v2.

Thanks,
Ryan


> 
> thanks,


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
  2023-06-22 14:42   ` Ryan Roberts
@ 2023-07-03 15:17     ` Catalin Marinas
  -1 siblings, 0 replies; 46+ messages in thread
From: Catalin Marinas @ 2023-07-03 15:17 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Ard Biesheuvel, Marc Zyngier, Oliver Upton,
	James Morse, Suzuki K Poulose, Zenghui Yu, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton, Anshuman Khandual,
	Matthew Wilcox, Yu Zhao, Mark Rutland, linux-arm-kernel,
	linux-kernel, linux-mm

Hi Ryan,

Some comments below. I did not have time to trim down the quoted text,
so you may need to scroll through it.

On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings. Whenever it detects a set of PTEs that meet the
> requirements for a contiguous range, the PTEs are re-painted with the
> PTE_CONT bit.
> 
> This initial change provides a baseline that can be optimized in future
> commits. That said, fold/unfold operations (which imply tlb
> invalidation) are avoided where possible with a few tricks for
> access/dirty bit management.
> 
> Write-enable and write-protect modifications are likely non-optimal and
> likely incure a regression in fork() performance. This will be addressed
> separately.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 137 ++++++++++++-
>  arch/arm64/mm/Makefile           |   3 +-
>  arch/arm64/mm/contpte.c          | 334 +++++++++++++++++++++++++++++++
>  3 files changed, 466 insertions(+), 8 deletions(-)
>  create mode 100644 arch/arm64/mm/contpte.c
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 31df4d73f9ac..17ea534bc5b0 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  				    unsigned long addr, pte_t *ptep,
>  				    pte_t old_pte, pte_t new_pte);
>  
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
> + * a private implementation detail of the public ptep API (see below).
> + */
> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				pte_t entry, int dirty);
> +
> +static inline pte_t *contpte_align_down(pte_t *ptep)
> +{
> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> +}
> +
> +static inline bool contpte_is_enabled(struct mm_struct *mm)
> +{
> +	/*
> +	 * Don't attempt to apply the contig bit to kernel mappings, because
> +	 * dynamically adding/removing the contig bit can cause page faults.
> +	 * These racing faults are ok for user space, since they get serialized
> +	 * on the PTL. But kernel mappings can't tolerate faults.
> +	 */
> +
> +	return mm != &init_mm;
> +}
> +
> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * Only bother trying if both the virtual and physical addresses are
> +	 * aligned and correspond to the last entry in a contig range. The core
> +	 * code mostly modifies ranges from low to high, so this is the likely
> +	 * the last modification in the contig range, so a good time to fold.
> +	 */
> +
> +	bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1;
> +	bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1;
> +
> +	if (contpte_is_enabled(mm) &&
> +	    pte_present(pte) && !pte_cont(pte) &&
> +	    valign && palign)
> +		__contpte_try_fold(mm, addr, ptep, pte);

I would use pte_valid() here instead. pte_present() also includes the
PTE_PROT_NONE option which we don't really care about as it's not
accessible.

I've been discussing with Alexandru Elisei about PTE_PROT_NONE and
whether we can use other bits from the pte even if they clash with other
valid permissions. Since the pte is not valid, in theory we could as
long as nothing corrupts the (like a cont bit). The background to this
is multiple migrate types (not just NUMA) for the MTE tag carveout
reuse.

> +}
> +
> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte)
> +{
> +	if (contpte_is_enabled(mm) &&
> +	    pte_present(pte) && pte_cont(pte))
> +		__contpte_try_unfold(mm, addr, ptep, pte);
> +}

Same here and probably most other places where pte_present() is used in
this patch.

> +
>  /*
>   * The below functions constitute the public API that arm64 presents to the
>   * core-mm to manipulate PTE entries within the their page tables (or at least
> @@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>   * versions will automatically and transparently apply the contiguous bit where
>   * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>   * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
> - * private versions, which are prefixed with double underscore.
> + * private versions, which are prefixed with double underscore. All of these
> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
> + * held.
>   */
>  
>  #define ptep_get ptep_get
>  static inline pte_t ptep_get(pte_t *ptep)
>  {
> -	return __ptep_get(ptep);
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (!pte_present(pte) || !pte_cont(pte))
> +		return pte;
> +
> +	return contpte_ptep_get(ptep, pte);
> +}
> +
> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (!pte_present(pte) || !pte_cont(pte))
> +		return pte;
> +
> +	return contpte_ptep_get_lockless(ptep);
>  }
>  
>  static inline void set_pte(pte_t *ptep, pte_t pte)
>  {
> -	__set_pte(ptep, pte);
> +	/*
> +	 * We don't have the mm or vaddr so cannot unfold or fold contig entries
> +	 * (since it requires tlb maintenance). set_pte() is not used in core
> +	 * code, so this should never even be called. Regardless do our best to
> +	 * service any call and emit a warning if there is any attempt to set a
> +	 * pte on top of an existing contig range.
> +	 */
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte));
> +	__set_pte(ptep, pte_mknoncont(pte));

Why the pte_mknoncont() here? Do we expect a contiguous pte? The warning
only checks the old entry.

>  }
>  
>  #define set_ptes set_ptes
>  static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr)
>  {
> -	__set_ptes(mm, addr, ptep, pte, nr);
> +	pte = pte_mknoncont(pte);
> +
> +	if (!contpte_is_enabled(mm))
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +	else if (nr == 1) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +		contpte_try_fold(mm, addr, ptep, pte);
> +	} else
> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>  }
>  
>  static inline void pte_clear(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  	__pte_clear(mm, addr, ptep);
>  }
>  
> @@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm,
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  	return __ptep_get_and_clear(mm, addr, ptep);
>  }
>  
> @@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep)
>  {
> -	return __ptep_test_and_clear_young(vma, addr, ptep);
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return __ptep_test_and_clear_young(vma, addr, ptep);

Since I've seen this construct a few times, you may want to turn it into
a specific check: pte_valid_cont().

> +
> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>  }
>  
>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep)
>  {
> -	return __ptep_clear_flush_young(vma, addr, ptep);
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>  }
>  
>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  	__ptep_set_wrprotect(mm, addr, ptep);
> +	contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
>  }
>  
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> @@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty)
>  {
> -	return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	entry = pte_mknoncont(entry);

As in a few other places, it's not clear to me why the pte_mknoncont()
is needed. Here I expect 'entry' to be cont if *ptep is cont.

> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);

Also wondering, can we have this check on 'entry' rather than
'orig_pte'? And maybe a warning if the cont bit differs between them.

> +
> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>  }
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d..70b6aba09b5d 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -2,7 +2,8 @@
>  obj-y				:= dma-mapping.o extable.o fault.o init.o \
>  				   cache.o copypage.o flush.o \
>  				   ioremap.o mmap.o pgd.o mmu.o \
> -				   context.o proc.o pageattr.o fixmap.o
> +				   context.o proc.o pageattr.o fixmap.o \
> +				   contpte.o
>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> new file mode 100644
> index 000000000000..e8e4a298fd53
> --- /dev/null
> +++ b/arch/arm64/mm/contpte.c
> @@ -0,0 +1,334 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/mm.h>
> +#include <asm/tlbflush.h>
> +
> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, int nr)
> +{
> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	unsigned long start_addr = addr;
> +	int i;
> +
> +	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
> +		__pte_clear(mm, addr, ptep);
> +
> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +}
> +
> +static bool ptep_any_present(pte_t *ptep, int nr)

Valid?

> +{
> +	int i;
> +
> +	for (i = 0; i < nr; i++, ptep++) {
> +		if (pte_present(__ptep_get(ptep)))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte, bool fold)
> +{
> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	unsigned long start_addr;
> +	pte_t *start_ptep;
> +	int i;
> +
> +	start_ptep = ptep = contpte_align_down(ptep);
> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> +	pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> +
> +		if (pte_dirty(ptent))
> +			pte = pte_mkdirty(pte);
> +
> +		if (pte_young(ptent))
> +			pte = pte_mkyoung(pte);
> +	}

I presume this can be unsafe if any of the ptes in the range differ, so
we need some higher level check. But that means we now have three loops
for folding, one to check, another to clear and the last one via
__set_ptes(). I guess we can't collapse the first two loops in a 'try'
function as we need to do the cleaning (and would have to re-instate the
old entries if they can't be made contiguous).

> +
> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +
> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> +}
> +
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the virtual and pysical addresses are
> +	 * correctly aligned for a contig mapping in contpte_try_fold() so the
> +	 * remaining checks are to ensure that the contig range is fully covered
> +	 * by a single folio, and ensure that all the ptes are present with
> +	 * contiguous PFNs and matching prots.
> +	 */
> +
> +	struct page *page = pte_page(pte);
> +	struct folio *folio = page_folio(page);
> +	unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
> +	unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
> +	unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +	pte_t subpte;
> +	pte_t *orig_ptep;
> +	int i;
> +
> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
> +		return;
> +
> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +	orig_ptep = ptep;
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> +		subpte = __ptep_get(ptep);
> +		subpte = pte_mkold(pte_mkclean(subpte));

IIUC, this function assumes ptes that only differ by the dirty status
can be contiguous. That's probably ok, with a chance of the dirty status
spreading to the adjacent ptes in the fold function. Maybe add a comment
on why this is ok (or why it doesn't happen).

> +
> +		if (!pte_present(subpte) ||
> +		    pte_pfn(subpte) != pfn ||
> +		    pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
> +			return;
> +	}
> +
> +	contpte_fold(mm, addr, orig_ptep, pte, true);
> +}
> +
> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the ptes are contiguous in
> +	 * contpte_try_unfold(), so we can unfold unconditionally here.
> +	 */
> +
> +	contpte_fold(mm, addr, ptep, pte, false);
> +}

So the pte_mkyoung(), pte_mkdirty() calls in contpte_fold() are mostly
for the unfold case. Maybe it's clearer if we just have two separate
functions (or document why the pte_mk*() are needed).

> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
> +	 * contiguous range cannot be unfolded or otherwise modified under our
> +	 * feet.
> +	 */
> +
> +	pte_t pte;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
> +		pte = __ptep_get(ptep);
> +
> +		/*
> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
> +		 * where some of the ptes in the range may be cleared but others
> +		 * are still to do. See contpte_ptep_get_and_clear_full().
> +		 */
> +		if (pte_val(pte) == 0)
> +			continue;
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We may not be holding the PTL, so any contiguous
> +	 * range may be unfolded/modified/refolded under our feet.
> +	 */
> +
> +	pte_t orig_pte;
> +	pgprot_t orig_prot;
> +	pte_t *ptep;
> +	unsigned long pfn;
> +	pte_t pte;
> +	pgprot_t prot;
> +	int i;
> +
> +retry:
> +	orig_pte = __ptep_get(orig_ptep);
> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return orig_pte;

I haven't looked through all the patches, so not entirely sure when this
function is called. But since you mention that the range may be
folded/unfolded, how do we deal with pte_cont() racing with something
setting the contig bit?

> +
> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> +	ptep = contpte_align_down(orig_ptep);
> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> +		pte = __ptep_get(ptep);
> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +		if (!pte_present(pte) || !pte_cont(pte) ||
> +		   pte_pfn(pte) != pfn ||
> +		   pgprot_val(prot) != pgprot_val(orig_prot))
> +			goto retry;

It needs better documenting, I don't understand what the retry here is
for (presumably to handle some races). Do we care about some memory
ordering as well? __pte_get() only takes care of reading the ptep once.

> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +
> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +	unsigned long next;
> +	unsigned long end = addr + (nr << PAGE_SHIFT);
> +	unsigned long pfn = pte_pfn(pte);
> +	pgprot_t prot = pte_pgprot(pte);
> +	pte_t orig_pte;
> +
> +	do {
> +		next = pte_cont_addr_end(addr, end);
> +		nr = (next - addr) >> PAGE_SHIFT;
> +		pte = pfn_pte(pfn, prot);
> +
> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> +			pte = pte_mkcont(pte);
> +		else
> +			pte = pte_mknoncont(pte);
> +
> +		/*
> +		 * If operating on a partial contiguous range then we must first
> +		 * unfold the contiguous range if it was previously folded.
> +		 * Otherwise we could end up with overlapping tlb entries.
> +		 */
> +		if (nr != CONT_PTES)
> +			contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +
> +		/*
> +		 * If we are replacing ptes that were contiguous or if the new
> +		 * ptes are contiguous and any of the ptes being replaced are
> +		 * present, we need to clear and flush the range to prevent
> +		 * overlapping tlb entries.
> +		 */
> +		orig_pte = __ptep_get(ptep);
> +		if ((pte_present(orig_pte) && pte_cont(orig_pte)) ||
> +		    (pte_cont(pte) && ptep_any_present(ptep, nr)))
> +			ptep_clear_flush_range(mm, addr, ptep, nr);
> +
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +
> +		addr = next;
> +		ptep += nr;
> +		pfn += nr;
> +
> +	} while (addr != end);
> +}
> +
> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	/*
> +	 * ptep_clear_flush_young() technically requires us to clear the access
> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
> +	 * access/dirty per folio, not per page. And since we only create a
> +	 * contig range when the range is covered by a single folio, we can get
> +	 * away with clearing young for the whole contig range here, so we avoid
> +	 * having to unfold.
> +	 */
> +
> +	int i;
> +	int young = 0;
> +
> +	ptep = contpte_align_down(ptep);
> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	return young;
> +}
> +
> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	int young;
> +
> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +	if (young) {
> +		/*
> +		 * See comment in __ptep_clear_flush_young(); same rationale for
> +		 * eliding the trailing DSB applies here.
> +		 */
> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> +					 PAGE_SIZE, true, 3);
> +	}
> +
> +	return young;
> +}
> +
> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					pte_t entry, int dirty)
> +{
> +	pte_t orig_pte;
> +	int i;
> +
> +	/*
> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
> +	 * changed, its a noop.
> +	 */
> +	orig_pte = ptep_get(ptep);
> +	if (pte_val(orig_pte) == pte_val(entry))
> +		return 0;
> +
> +	/*
> +	 * We can fix up access/dirty bits without having to unfold/fold the
> +	 * contig range. But if the write bit is changing, we need to go through
> +	 * the full unfold/fold cycle.
> +	 */
> +	if (pte_write(orig_pte) == pte_write(entry)) {

Depending on the architecture version, pte_write() either checks a
software only bit or it checks the DBM one.

> +		/*
> +		 * No need to flush here; This is always "more permissive" so we
> +		 * can only be _adding_ the access or dirty bit. And since the
> +		 * tlb can't cache an entry without the AF set and the dirty bit
> +		 * is a SW bit, there can be no confusion. For HW access
> +		 * management, we technically only need to update the flag on a
> +		 * single pte in the range. But for SW access management, we
> +		 * need to update all the ptes to prevent extra faults.
> +		 */

On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
perspective but clean) may be cached in the TLB and we do need flushing.

> +		ptep = contpte_align_down(ptep);
> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
> +	} else {
> +		/*
> +		 * No need to flush in __ptep_set_access_flags() because we just
> +		 * flushed the whole range in __contpte_try_unfold().
> +		 */
> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
> +		__ptep_set_access_flags(vma, addr, ptep, entry, 0);
> +		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
> +	}
> +
> +	return 1;
> +}

-- 
Catalin

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
@ 2023-07-03 15:17     ` Catalin Marinas
  0 siblings, 0 replies; 46+ messages in thread
From: Catalin Marinas @ 2023-07-03 15:17 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Ard Biesheuvel, Marc Zyngier, Oliver Upton,
	James Morse, Suzuki K Poulose, Zenghui Yu, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton, Anshuman Khandual,
	Matthew Wilcox, Yu Zhao, Mark Rutland, linux-arm-kernel,
	linux-kernel, linux-mm

Hi Ryan,

Some comments below. I did not have time to trim down the quoted text,
so you may need to scroll through it.

On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings. Whenever it detects a set of PTEs that meet the
> requirements for a contiguous range, the PTEs are re-painted with the
> PTE_CONT bit.
> 
> This initial change provides a baseline that can be optimized in future
> commits. That said, fold/unfold operations (which imply tlb
> invalidation) are avoided where possible with a few tricks for
> access/dirty bit management.
> 
> Write-enable and write-protect modifications are likely non-optimal and
> likely incure a regression in fork() performance. This will be addressed
> separately.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 137 ++++++++++++-
>  arch/arm64/mm/Makefile           |   3 +-
>  arch/arm64/mm/contpte.c          | 334 +++++++++++++++++++++++++++++++
>  3 files changed, 466 insertions(+), 8 deletions(-)
>  create mode 100644 arch/arm64/mm/contpte.c
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 31df4d73f9ac..17ea534bc5b0 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  				    unsigned long addr, pte_t *ptep,
>  				    pte_t old_pte, pte_t new_pte);
>  
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
> + * a private implementation detail of the public ptep API (see below).
> + */
> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				pte_t entry, int dirty);
> +
> +static inline pte_t *contpte_align_down(pte_t *ptep)
> +{
> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> +}
> +
> +static inline bool contpte_is_enabled(struct mm_struct *mm)
> +{
> +	/*
> +	 * Don't attempt to apply the contig bit to kernel mappings, because
> +	 * dynamically adding/removing the contig bit can cause page faults.
> +	 * These racing faults are ok for user space, since they get serialized
> +	 * on the PTL. But kernel mappings can't tolerate faults.
> +	 */
> +
> +	return mm != &init_mm;
> +}
> +
> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * Only bother trying if both the virtual and physical addresses are
> +	 * aligned and correspond to the last entry in a contig range. The core
> +	 * code mostly modifies ranges from low to high, so this is the likely
> +	 * the last modification in the contig range, so a good time to fold.
> +	 */
> +
> +	bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1;
> +	bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1;
> +
> +	if (contpte_is_enabled(mm) &&
> +	    pte_present(pte) && !pte_cont(pte) &&
> +	    valign && palign)
> +		__contpte_try_fold(mm, addr, ptep, pte);

I would use pte_valid() here instead. pte_present() also includes the
PTE_PROT_NONE option which we don't really care about as it's not
accessible.

I've been discussing with Alexandru Elisei about PTE_PROT_NONE and
whether we can use other bits from the pte even if they clash with other
valid permissions. Since the pte is not valid, in theory we could as
long as nothing corrupts the (like a cont bit). The background to this
is multiple migrate types (not just NUMA) for the MTE tag carveout
reuse.

> +}
> +
> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte)
> +{
> +	if (contpte_is_enabled(mm) &&
> +	    pte_present(pte) && pte_cont(pte))
> +		__contpte_try_unfold(mm, addr, ptep, pte);
> +}

Same here and probably most other places where pte_present() is used in
this patch.

> +
>  /*
>   * The below functions constitute the public API that arm64 presents to the
>   * core-mm to manipulate PTE entries within the their page tables (or at least
> @@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>   * versions will automatically and transparently apply the contiguous bit where
>   * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>   * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
> - * private versions, which are prefixed with double underscore.
> + * private versions, which are prefixed with double underscore. All of these
> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
> + * held.
>   */
>  
>  #define ptep_get ptep_get
>  static inline pte_t ptep_get(pte_t *ptep)
>  {
> -	return __ptep_get(ptep);
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (!pte_present(pte) || !pte_cont(pte))
> +		return pte;
> +
> +	return contpte_ptep_get(ptep, pte);
> +}
> +
> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (!pte_present(pte) || !pte_cont(pte))
> +		return pte;
> +
> +	return contpte_ptep_get_lockless(ptep);
>  }
>  
>  static inline void set_pte(pte_t *ptep, pte_t pte)
>  {
> -	__set_pte(ptep, pte);
> +	/*
> +	 * We don't have the mm or vaddr so cannot unfold or fold contig entries
> +	 * (since it requires tlb maintenance). set_pte() is not used in core
> +	 * code, so this should never even be called. Regardless do our best to
> +	 * service any call and emit a warning if there is any attempt to set a
> +	 * pte on top of an existing contig range.
> +	 */
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte));
> +	__set_pte(ptep, pte_mknoncont(pte));

Why the pte_mknoncont() here? Do we expect a contiguous pte? The warning
only checks the old entry.

>  }
>  
>  #define set_ptes set_ptes
>  static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr)
>  {
> -	__set_ptes(mm, addr, ptep, pte, nr);
> +	pte = pte_mknoncont(pte);
> +
> +	if (!contpte_is_enabled(mm))
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +	else if (nr == 1) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +		contpte_try_fold(mm, addr, ptep, pte);
> +	} else
> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>  }
>  
>  static inline void pte_clear(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  	__pte_clear(mm, addr, ptep);
>  }
>  
> @@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm,
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  	return __ptep_get_and_clear(mm, addr, ptep);
>  }
>  
> @@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep)
>  {
> -	return __ptep_test_and_clear_young(vma, addr, ptep);
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return __ptep_test_and_clear_young(vma, addr, ptep);

Since I've seen this construct a few times, you may want to turn it into
a specific check: pte_valid_cont().

> +
> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>  }
>  
>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep)
>  {
> -	return __ptep_clear_flush_young(vma, addr, ptep);
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>  }
>  
>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>  	__ptep_set_wrprotect(mm, addr, ptep);
> +	contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
>  }
>  
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> @@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep,
>  				pte_t entry, int dirty)
>  {
> -	return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	entry = pte_mknoncont(entry);

As in a few other places, it's not clear to me why the pte_mknoncont()
is needed. Here I expect 'entry' to be cont if *ptep is cont.

> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);

Also wondering, can we have this check on 'entry' rather than
'orig_pte'? And maybe a warning if the cont bit differs between them.

> +
> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>  }
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d..70b6aba09b5d 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -2,7 +2,8 @@
>  obj-y				:= dma-mapping.o extable.o fault.o init.o \
>  				   cache.o copypage.o flush.o \
>  				   ioremap.o mmap.o pgd.o mmu.o \
> -				   context.o proc.o pageattr.o fixmap.o
> +				   context.o proc.o pageattr.o fixmap.o \
> +				   contpte.o
>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> new file mode 100644
> index 000000000000..e8e4a298fd53
> --- /dev/null
> +++ b/arch/arm64/mm/contpte.c
> @@ -0,0 +1,334 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/mm.h>
> +#include <asm/tlbflush.h>
> +
> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, int nr)
> +{
> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	unsigned long start_addr = addr;
> +	int i;
> +
> +	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
> +		__pte_clear(mm, addr, ptep);
> +
> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +}
> +
> +static bool ptep_any_present(pte_t *ptep, int nr)

Valid?

> +{
> +	int i;
> +
> +	for (i = 0; i < nr; i++, ptep++) {
> +		if (pte_present(__ptep_get(ptep)))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte, bool fold)
> +{
> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	unsigned long start_addr;
> +	pte_t *start_ptep;
> +	int i;
> +
> +	start_ptep = ptep = contpte_align_down(ptep);
> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> +	pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> +
> +		if (pte_dirty(ptent))
> +			pte = pte_mkdirty(pte);
> +
> +		if (pte_young(ptent))
> +			pte = pte_mkyoung(pte);
> +	}

I presume this can be unsafe if any of the ptes in the range differ, so
we need some higher level check. But that means we now have three loops
for folding, one to check, another to clear and the last one via
__set_ptes(). I guess we can't collapse the first two loops in a 'try'
function as we need to do the cleaning (and would have to re-instate the
old entries if they can't be made contiguous).

> +
> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +
> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> +}
> +
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the virtual and pysical addresses are
> +	 * correctly aligned for a contig mapping in contpte_try_fold() so the
> +	 * remaining checks are to ensure that the contig range is fully covered
> +	 * by a single folio, and ensure that all the ptes are present with
> +	 * contiguous PFNs and matching prots.
> +	 */
> +
> +	struct page *page = pte_page(pte);
> +	struct folio *folio = page_folio(page);
> +	unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
> +	unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
> +	unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +	pte_t subpte;
> +	pte_t *orig_ptep;
> +	int i;
> +
> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
> +		return;
> +
> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +	orig_ptep = ptep;
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> +		subpte = __ptep_get(ptep);
> +		subpte = pte_mkold(pte_mkclean(subpte));

IIUC, this function assumes ptes that only differ by the dirty status
can be contiguous. That's probably ok, with a chance of the dirty status
spreading to the adjacent ptes in the fold function. Maybe add a comment
on why this is ok (or why it doesn't happen).

> +
> +		if (!pte_present(subpte) ||
> +		    pte_pfn(subpte) != pfn ||
> +		    pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
> +			return;
> +	}
> +
> +	contpte_fold(mm, addr, orig_ptep, pte, true);
> +}
> +
> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the ptes are contiguous in
> +	 * contpte_try_unfold(), so we can unfold unconditionally here.
> +	 */
> +
> +	contpte_fold(mm, addr, ptep, pte, false);
> +}

So the pte_mkyoung(), pte_mkdirty() calls in contpte_fold() are mostly
for the unfold case. Maybe it's clearer if we just have two separate
functions (or document why the pte_mk*() are needed).

> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
> +	 * contiguous range cannot be unfolded or otherwise modified under our
> +	 * feet.
> +	 */
> +
> +	pte_t pte;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
> +		pte = __ptep_get(ptep);
> +
> +		/*
> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
> +		 * where some of the ptes in the range may be cleared but others
> +		 * are still to do. See contpte_ptep_get_and_clear_full().
> +		 */
> +		if (pte_val(pte) == 0)
> +			continue;
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We may not be holding the PTL, so any contiguous
> +	 * range may be unfolded/modified/refolded under our feet.
> +	 */
> +
> +	pte_t orig_pte;
> +	pgprot_t orig_prot;
> +	pte_t *ptep;
> +	unsigned long pfn;
> +	pte_t pte;
> +	pgprot_t prot;
> +	int i;
> +
> +retry:
> +	orig_pte = __ptep_get(orig_ptep);
> +
> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> +		return orig_pte;

I haven't looked through all the patches, so not entirely sure when this
function is called. But since you mention that the range may be
folded/unfolded, how do we deal with pte_cont() racing with something
setting the contig bit?

> +
> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> +	ptep = contpte_align_down(orig_ptep);
> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> +		pte = __ptep_get(ptep);
> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +		if (!pte_present(pte) || !pte_cont(pte) ||
> +		   pte_pfn(pte) != pfn ||
> +		   pgprot_val(prot) != pgprot_val(orig_prot))
> +			goto retry;

It needs better documenting, I don't understand what the retry here is
for (presumably to handle some races). Do we care about some memory
ordering as well? __pte_get() only takes care of reading the ptep once.

> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +
> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +	unsigned long next;
> +	unsigned long end = addr + (nr << PAGE_SHIFT);
> +	unsigned long pfn = pte_pfn(pte);
> +	pgprot_t prot = pte_pgprot(pte);
> +	pte_t orig_pte;
> +
> +	do {
> +		next = pte_cont_addr_end(addr, end);
> +		nr = (next - addr) >> PAGE_SHIFT;
> +		pte = pfn_pte(pfn, prot);
> +
> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> +			pte = pte_mkcont(pte);
> +		else
> +			pte = pte_mknoncont(pte);
> +
> +		/*
> +		 * If operating on a partial contiguous range then we must first
> +		 * unfold the contiguous range if it was previously folded.
> +		 * Otherwise we could end up with overlapping tlb entries.
> +		 */
> +		if (nr != CONT_PTES)
> +			contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +
> +		/*
> +		 * If we are replacing ptes that were contiguous or if the new
> +		 * ptes are contiguous and any of the ptes being replaced are
> +		 * present, we need to clear and flush the range to prevent
> +		 * overlapping tlb entries.
> +		 */
> +		orig_pte = __ptep_get(ptep);
> +		if ((pte_present(orig_pte) && pte_cont(orig_pte)) ||
> +		    (pte_cont(pte) && ptep_any_present(ptep, nr)))
> +			ptep_clear_flush_range(mm, addr, ptep, nr);
> +
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +
> +		addr = next;
> +		ptep += nr;
> +		pfn += nr;
> +
> +	} while (addr != end);
> +}
> +
> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	/*
> +	 * ptep_clear_flush_young() technically requires us to clear the access
> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
> +	 * access/dirty per folio, not per page. And since we only create a
> +	 * contig range when the range is covered by a single folio, we can get
> +	 * away with clearing young for the whole contig range here, so we avoid
> +	 * having to unfold.
> +	 */
> +
> +	int i;
> +	int young = 0;
> +
> +	ptep = contpte_align_down(ptep);
> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	return young;
> +}
> +
> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	int young;
> +
> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +	if (young) {
> +		/*
> +		 * See comment in __ptep_clear_flush_young(); same rationale for
> +		 * eliding the trailing DSB applies here.
> +		 */
> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> +					 PAGE_SIZE, true, 3);
> +	}
> +
> +	return young;
> +}
> +
> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					pte_t entry, int dirty)
> +{
> +	pte_t orig_pte;
> +	int i;
> +
> +	/*
> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
> +	 * changed, its a noop.
> +	 */
> +	orig_pte = ptep_get(ptep);
> +	if (pte_val(orig_pte) == pte_val(entry))
> +		return 0;
> +
> +	/*
> +	 * We can fix up access/dirty bits without having to unfold/fold the
> +	 * contig range. But if the write bit is changing, we need to go through
> +	 * the full unfold/fold cycle.
> +	 */
> +	if (pte_write(orig_pte) == pte_write(entry)) {

Depending on the architecture version, pte_write() either checks a
software only bit or it checks the DBM one.

> +		/*
> +		 * No need to flush here; This is always "more permissive" so we
> +		 * can only be _adding_ the access or dirty bit. And since the
> +		 * tlb can't cache an entry without the AF set and the dirty bit
> +		 * is a SW bit, there can be no confusion. For HW access
> +		 * management, we technically only need to update the flag on a
> +		 * single pte in the range. But for SW access management, we
> +		 * need to update all the ptes to prevent extra faults.
> +		 */

On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
perspective but clean) may be cached in the TLB and we do need flushing.

> +		ptep = contpte_align_down(ptep);
> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
> +	} else {
> +		/*
> +		 * No need to flush in __ptep_set_access_flags() because we just
> +		 * flushed the whole range in __contpte_try_unfold().
> +		 */
> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
> +		__ptep_set_access_flags(vma, addr, ptep, entry, 0);
> +		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
> +	}
> +
> +	return 1;
> +}

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
  2023-07-03 15:17     ` Catalin Marinas
@ 2023-07-04 11:09       ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-07-04 11:09 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Ard Biesheuvel, Marc Zyngier, Oliver Upton,
	James Morse, Suzuki K Poulose, Zenghui Yu, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton, Anshuman Khandual,
	Matthew Wilcox, Yu Zhao, Mark Rutland, linux-arm-kernel,
	linux-kernel, linux-mm

On 03/07/2023 16:17, Catalin Marinas wrote:
> Hi Ryan,
> 
> Some comments below. I did not have time to trim down the quoted text,
> so you may need to scroll through it.

Thanks for the review!

Looking at the comments, I think they all relate to implementation. Does that
imply that you are happy with the shape/approach?

Talking with Anshuman yesterday, he suggested putting this behind a new Kconfig
option that defaults to disabled and also adding a command line option to
disable it when compiled in. I think that makes sense for now at least to reduce
risk of performance regression?

> 
> On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings. Whenever it detects a set of PTEs that meet the
>> requirements for a contiguous range, the PTEs are re-painted with the
>> PTE_CONT bit.
>>
>> This initial change provides a baseline that can be optimized in future
>> commits. That said, fold/unfold operations (which imply tlb
>> invalidation) are avoided where possible with a few tricks for
>> access/dirty bit management.
>>
>> Write-enable and write-protect modifications are likely non-optimal and
>> likely incure a regression in fork() performance. This will be addressed
>> separately.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 137 ++++++++++++-
>>  arch/arm64/mm/Makefile           |   3 +-
>>  arch/arm64/mm/contpte.c          | 334 +++++++++++++++++++++++++++++++
>>  3 files changed, 466 insertions(+), 8 deletions(-)
>>  create mode 100644 arch/arm64/mm/contpte.c
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 31df4d73f9ac..17ea534bc5b0 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>  				    unsigned long addr, pte_t *ptep,
>>  				    pte_t old_pte, pte_t new_pte);
>>  
>> +/*
>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>> + * a private implementation detail of the public ptep API (see below).
>> + */
>> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty);
>> +
>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>> +{
>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>> +}
>> +
>> +static inline bool contpte_is_enabled(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>> +	 * dynamically adding/removing the contig bit can cause page faults.
>> +	 * These racing faults are ok for user space, since they get serialized
>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>> +	 */
>> +
>> +	return mm != &init_mm;
>> +}
>> +
>> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * Only bother trying if both the virtual and physical addresses are
>> +	 * aligned and correspond to the last entry in a contig range. The core
>> +	 * code mostly modifies ranges from low to high, so this is the likely
>> +	 * the last modification in the contig range, so a good time to fold.
>> +	 */
>> +
>> +	bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1;
>> +	bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1;
>> +
>> +	if (contpte_is_enabled(mm) &&
>> +	    pte_present(pte) && !pte_cont(pte) &&
>> +	    valign && palign)
>> +		__contpte_try_fold(mm, addr, ptep, pte);
> 
> I would use pte_valid() here instead. pte_present() also includes the
> PTE_PROT_NONE option which we don't really care about as it's not
> accessible.

Yep good point. I'll audit all of this and make the appropriate changes for v2.

> 
> I've been discussing with Alexandru Elisei about PTE_PROT_NONE and
> whether we can use other bits from the pte even if they clash with other
> valid permissions. Since the pte is not valid, in theory we could as
> long as nothing corrupts the (like a cont bit). The background to this
> is multiple migrate types (not just NUMA) for the MTE tag carveout
> reuse.

ACK.

> 
>> +}
>> +
>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	if (contpte_is_enabled(mm) &&
>> +	    pte_present(pte) && pte_cont(pte))
>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>> +}
> 
> Same here and probably most other places where pte_present() is used in
> this patch.

ACK.

> 
>> +
>>  /*
>>   * The below functions constitute the public API that arm64 presents to the
>>   * core-mm to manipulate PTE entries within the their page tables (or at least
>> @@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   * versions will automatically and transparently apply the contiguous bit where
>>   * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>>   * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>> - * private versions, which are prefixed with double underscore.
>> + * private versions, which are prefixed with double underscore. All of these
>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>> + * held.
>>   */
>>  
>>  #define ptep_get ptep_get
>>  static inline pte_t ptep_get(pte_t *ptep)
>>  {
>> -	return __ptep_get(ptep);
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(pte) || !pte_cont(pte))
>> +		return pte;
>> +
>> +	return contpte_ptep_get(ptep, pte);
>> +}
>> +
>> +#define ptep_get_lockless ptep_get_lockless
>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(pte) || !pte_cont(pte))
>> +		return pte;
>> +
>> +	return contpte_ptep_get_lockless(ptep);
>>  }
>>  
>>  static inline void set_pte(pte_t *ptep, pte_t pte)
>>  {
>> -	__set_pte(ptep, pte);
>> +	/*
>> +	 * We don't have the mm or vaddr so cannot unfold or fold contig entries
>> +	 * (since it requires tlb maintenance). set_pte() is not used in core
>> +	 * code, so this should never even be called. Regardless do our best to
>> +	 * service any call and emit a warning if there is any attempt to set a
>> +	 * pte on top of an existing contig range.
>> +	 */
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte));
>> +	__set_pte(ptep, pte_mknoncont(pte));
> 
> Why the pte_mknoncont() here? Do we expect a contiguous pte? The warning
> only checks the old entry.

Originally, it was my intent that PTE_CONT bit would be totally private to this
layer and the bit should never leak to the generic code (i.e. ptep_get() would
clear it before returning the pte and all functions that accept a pte would WARN
if the bit was set on entry.

However, this approach proved problematic for accounting; I have a separate
change that logs the amount of memory mapped as contpte in
/proc/<pid>/smaps[_rollup]. For this to work, the PTE_CONT bit must be leaked to
the generic code (ptep_get() no longer explicitly clears it). But if we
deliberately leak it, then its possible that it will be set in functions that
take a pte, which would lead to incorrect behavior (potentially leading to a
contpte range that has some PTE_CONT bits set and others cleared). This happens
because there is generic code that follows a pattern like this:

  pte = ptep_get_and_clear(ptep)
  pte = modify_some_bits(pte)
  set_pte_at(pte)

To solve this, I'm explicitly clearing CONT_PTE from any pte that is passed in
to one of these functions.

> 
>>  }
>>  
>>  #define set_ptes set_ptes
>>  static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, pte_t pte, unsigned int nr)
>>  {
>> -	__set_ptes(mm, addr, ptep, pte, nr);
>> +	pte = pte_mknoncont(pte);
>> +
>> +	if (!contpte_is_enabled(mm))
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +	else if (nr == 1) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +		contpte_try_fold(mm, addr, ptep, pte);
>> +	} else
>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>  }
>>  
>>  static inline void pte_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	__pte_clear(mm, addr, ptep);
>>  }
>>  
>> @@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm,
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	return __ptep_get_and_clear(mm, addr, ptep);
>>  }
>>  
>> @@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	return __ptep_test_and_clear_young(vma, addr, ptep);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
> 
> Since I've seen this construct a few times, you may want to turn it into
> a specific check: pte_valid_cont().

ACK - will do for v2.

> 
>> +
>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	return __ptep_clear_flush_young(vma, addr, ptep);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	__ptep_set_wrprotect(mm, addr, ptep);
>> +	contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> @@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep,
>>  				pte_t entry, int dirty)
>>  {
>> -	return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	entry = pte_mknoncont(entry);
> 
> As in a few other places, it's not clear to me why the pte_mknoncont()
> is needed. Here I expect 'entry' to be cont if *ptep is cont.

See explanation above.

> 
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> 
> Also wondering, can we have this check on 'entry' rather than
> 'orig_pte'? And maybe a warning if the cont bit differs between them.

No - the idea is that this API layer has exclusicve control over whether
PTE_CONT is set in the pgtable. Upper layers should never pass a pte with
CONT_PTE set (except for the corner case described above which we deal with by
explicitly clearing PTE_CONT from the passed in pte).

So the check must be on orig_pte - we are checking if a contpte range is present
over the pte we are about to modify. If it is, then we need to handle it
carefully (potentially by unfolding it first; handled by
contpte_ptep_set_access_flags()). If there is no contprte range, then we can
just handle it the "normal" way.

> 
>> +
>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>  }
>>  
>>  #endif /* !__ASSEMBLY__ */
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index dbd1bc95967d..70b6aba09b5d 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -2,7 +2,8 @@
>>  obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>  				   cache.o copypage.o flush.o \
>>  				   ioremap.o mmap.o pgd.o mmu.o \
>> -				   context.o proc.o pageattr.o fixmap.o
>> +				   context.o proc.o pageattr.o fixmap.o \
>> +				   contpte.o
>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> new file mode 100644
>> index 000000000000..e8e4a298fd53
>> --- /dev/null
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -0,0 +1,334 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <asm/tlbflush.h>
>> +
>> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, int nr)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr = addr;
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
>> +		__pte_clear(mm, addr, ptep);
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +}
>> +
>> +static bool ptep_any_present(pte_t *ptep, int nr)
> 
> Valid?

ACK

> 
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++, ptep++) {
>> +		if (pte_present(__ptep_get(ptep)))
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte, bool fold)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr;
>> +	pte_t *start_ptep;
>> +	int i;
>> +
>> +	start_ptep = ptep = contpte_align_down(ptep);
>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>> +	pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> +		if (pte_dirty(ptent))
>> +			pte = pte_mkdirty(pte);
>> +
>> +		if (pte_young(ptent))
>> +			pte = pte_mkyoung(pte);
>> +	}
> 
> I presume this can be unsafe if any of the ptes in the range differ, so
> we need some higher level check. 

Sorry I'm not quite sure what you mean here? The higher level check is where we
look at the current value of the target PTE; if PTE_CONT is set then we know it
is part of a contpte range. We are careful that PTE_CONT is set consistently for
all (valid) PTEs in a contpte range, so we only need to check 1 entry. There is
no risk of racing here because we are always serialized by the PTL.

> But that means we now have three loops
> for folding, one to check, another to clear and the last one via
> __set_ptes(). I guess we can't collapse the first two loops in a 'try'
> function as we need to do the cleaning (and would have to re-instate the
> old entries if they can't be made contiguous).

Yes 3 loops, and I don't see how you would reduce that. The good news is that
this folding path should be rarely taken; most qualifying ranges will be set via
set_ptes() so they are ritten "pre-folded". We only attempt to fold after
setting the pte at the _end_ of the range (see contpte_try_fold()), and the
checker loop in __contpte_try_fold() will usually exit on the second iteration
if the memory is not physically contiguous.

> 
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +
>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>> +}
>> +
>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the virtual and pysical addresses are
>> +	 * correctly aligned for a contig mapping in contpte_try_fold() so the
>> +	 * remaining checks are to ensure that the contig range is fully covered
>> +	 * by a single folio, and ensure that all the ptes are present with
>> +	 * contiguous PFNs and matching prots.
>> +	 */
>> +
>> +	struct page *page = pte_page(pte);
>> +	struct folio *folio = page_folio(page);
>> +	unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>> +	unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>> +	unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +	pte_t subpte;
>> +	pte_t *orig_ptep;
>> +	int i;
>> +
>> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>> +		return;
>> +
>> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
>> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +	orig_ptep = ptep;
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		subpte = __ptep_get(ptep);
>> +		subpte = pte_mkold(pte_mkclean(subpte));
> 
> IIUC, this function assumes ptes that only differ by the dirty status
> can be contiguous. That's probably ok, with a chance of the dirty status
> spreading to the adjacent ptes in the fold function. Maybe add a comment
> on why this is ok (or why it doesn't happen).

Conceptually a contpte range only has a single access and dirty bit. So when
folding, we or all the access bits and all the dirty bits from the constituent
ptes to determine the single access and dirty bits for the contpte mapping. And
when unfolding, we take the single access and dirty bit for the contpte mapping
and apply those values to every individual entry.

So yes, we ignore the access and dirty values for the subptes when evaluating
whether a contiguous range exists. I'll add a comment.

> 
>> +
>> +		if (!pte_present(subpte) ||
>> +		    pte_pfn(subpte) != pfn ||
>> +		    pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
>> +			return;
>> +	}
>> +
>> +	contpte_fold(mm, addr, orig_ptep, pte, true);
>> +}
>> +
>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the ptes are contiguous in
>> +	 * contpte_try_unfold(), so we can unfold unconditionally here.
>> +	 */
>> +
>> +	contpte_fold(mm, addr, ptep, pte, false);
>> +}
> 
> So the pte_mkyoung(), pte_mkdirty() calls in contpte_fold() are mostly
> for the unfold case. Maybe it's clearer if we just have two separate
> functions (or document why the pte_mk*() are needed).

No that's not the case. In the unfold case, we need to "collect" the single
access and dirty bit from the contpte mapping (these may be in any of the
entries), and set the final values for all individual ptes during unfolding.

The obvious side effect here is that if any one page is dirty at fold time, the
whole range will be marked as dirty after folding, then at unfolding all pages
will be marked as dirty (same goes for access). This is the same concern that I
raised in the cover letter. I don't think this is a problem from the kernel's
point of view; the kernel will compress the per-page access/dirty info to
per-folio and we only fold if the whole range is covered by a single folio. But
user space could observe this "over-dirtying" through /proc/<pid>/pagemap. I'm
not sure if that's a problem in practice?

> 
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>> +	 * feet.
>> +	 */
>> +
>> +	pte_t pte;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +		pte = __ptep_get(ptep);
>> +
>> +		/*
>> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
>> +		 * where some of the ptes in the range may be cleared but others
>> +		 * are still to do. See contpte_ptep_get_and_clear_full().
>> +		 */
>> +		if (pte_val(pte) == 0)
>> +			continue;
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +
>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>> +	 * range may be unfolded/modified/refolded under our feet.
>> +	 */
>> +
>> +	pte_t orig_pte;
>> +	pgprot_t orig_prot;
>> +	pte_t *ptep;
>> +	unsigned long pfn;
>> +	pte_t pte;
>> +	pgprot_t prot;
>> +	int i;
>> +
>> +retry:
>> +	orig_pte = __ptep_get(orig_ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return orig_pte;
> 
> I haven't looked through all the patches, so not entirely sure when this
> function is called. 

ptep_get_lockless() is one of the mm optional arch interfaces. arm64 doesn't
currently implement it, because ptep_get() (READ_ONCE()) is safe without the
lock being held. But with the introduction of contpte mappings, there are cases
now where we have to read the whole contpte range to gather access and dirty,
which obviously isn't atomic. And doing that without the PTL is harder than when
we have the PTL, so I've implemented ptep_get_lockless() so we can assume the
PTL is held in ptep_get() and do the simple thing there (the common case).

> But since you mention that the range may be
> folded/unfolded, how do we deal with pte_cont() racing with something
> setting the contig bit?

ptep_get_lockless() is inherrently racy. My intedntion was that we just need to
ensure we read a pte or contpte range that is consistent with itself.

> 
>> +
>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>> +	ptep = contpte_align_down(orig_ptep);
>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		pte = __ptep_get(ptep);
>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +
>> +		if (!pte_present(pte) || !pte_cont(pte) ||
>> +		   pte_pfn(pte) != pfn ||
>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>> +			goto retry;
> 
> It needs better documenting, I don't understand what the retry here is
> for (presumably to handle some races). Do we care about some memory
> ordering as well? __pte_get() only takes care of reading the ptep once.

The intention is that the loop keeps retrying until it scans a whole contpte
range that is consistent with itself (i.e. PTE_CONT bit is set in all, pfn
increments monotonically and pgprots are all the same). If any of those
considtions are not true, it indicates we are racing with an update and need to
retry until its consistent. I'd need to think a bit more on whether we need
anything special for memory ordering...

To be honest, I'm not a big fan of this function. As far as I can tell, the only
user of ptep_get_lockless() that cares about access/dirty is ptdump. Perhaps we
can re-spec this to not return access/dirty info (that would simplify it back to
a READ_ONCE()), then figure out a way to hold the PTL for ptdump and use
ptep_get() which will return the access/dirty info correctly. Do you think
something like that could work?

> 
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +
>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	unsigned long next;
>> +	unsigned long end = addr + (nr << PAGE_SHIFT);
>> +	unsigned long pfn = pte_pfn(pte);
>> +	pgprot_t prot = pte_pgprot(pte);
>> +	pte_t orig_pte;
>> +
>> +	do {
>> +		next = pte_cont_addr_end(addr, end);
>> +		nr = (next - addr) >> PAGE_SHIFT;
>> +		pte = pfn_pte(pfn, prot);
>> +
>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> +			pte = pte_mkcont(pte);
>> +		else
>> +			pte = pte_mknoncont(pte);
>> +
>> +		/*
>> +		 * If operating on a partial contiguous range then we must first
>> +		 * unfold the contiguous range if it was previously folded.
>> +		 * Otherwise we could end up with overlapping tlb entries.
>> +		 */
>> +		if (nr != CONT_PTES)
>> +			contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +
>> +		/*
>> +		 * If we are replacing ptes that were contiguous or if the new
>> +		 * ptes are contiguous and any of the ptes being replaced are
>> +		 * present, we need to clear and flush the range to prevent
>> +		 * overlapping tlb entries.
>> +		 */
>> +		orig_pte = __ptep_get(ptep);
>> +		if ((pte_present(orig_pte) && pte_cont(orig_pte)) ||
>> +		    (pte_cont(pte) && ptep_any_present(ptep, nr)))
>> +			ptep_clear_flush_range(mm, addr, ptep, nr);
>> +
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +		addr = next;
>> +		ptep += nr;
>> +		pfn += nr;
>> +
>> +	} while (addr != end);
>> +}
>> +
>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	/*
>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>> +	 * access/dirty per folio, not per page. And since we only create a
>> +	 * contig range when the range is covered by a single folio, we can get
>> +	 * away with clearing young for the whole contig range here, so we avoid
>> +	 * having to unfold.
>> +	 */
>> +
>> +	int i;
>> +	int young = 0;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return young;
>> +}
>> +
>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	int young;
>> +
>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	if (young) {
>> +		/*
>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>> +		 * eliding the trailing DSB applies here.
>> +		 */
>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>> +					 PAGE_SIZE, true, 3);
>> +	}
>> +
>> +	return young;
>> +}
>> +
>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					pte_t entry, int dirty)
>> +{
>> +	pte_t orig_pte;
>> +	int i;
>> +
>> +	/*
>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>> +	 * changed, its a noop.
>> +	 */
>> +	orig_pte = ptep_get(ptep);
>> +	if (pte_val(orig_pte) == pte_val(entry))
>> +		return 0;
>> +
>> +	/*
>> +	 * We can fix up access/dirty bits without having to unfold/fold the
>> +	 * contig range. But if the write bit is changing, we need to go through
>> +	 * the full unfold/fold cycle.
>> +	 */
>> +	if (pte_write(orig_pte) == pte_write(entry)) {
> 
> Depending on the architecture version, pte_write() either checks a
> software only bit or it checks the DBM one.
> 
>> +		/*
>> +		 * No need to flush here; This is always "more permissive" so we
>> +		 * can only be _adding_ the access or dirty bit. And since the
>> +		 * tlb can't cache an entry without the AF set and the dirty bit
>> +		 * is a SW bit, there can be no confusion. For HW access
>> +		 * management, we technically only need to update the flag on a
>> +		 * single pte in the range. But for SW access management, we
>> +		 * need to update all the ptes to prevent extra faults.
>> +		 */
> 
> On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
> perspective but clean) may be cached in the TLB and we do need flushing.

I don't follow; The Arm ARM says:

  IPNQBP When an Access flag fault is generated, the translation table entry
         causing the fault is not cached in a TLB.

So the entry can only be in the TLB if AF is already 1. And given the dirty bit
is SW, it shouldn't affect the TLB state. And this function promises to only
change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1).

So I'm not sure what case you are describing here?

> 
>> +		ptep = contpte_align_down(ptep);
>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +	} else {
>> +		/*
>> +		 * No need to flush in __ptep_set_access_flags() because we just
>> +		 * flushed the whole range in __contpte_try_unfold().
>> +		 */
>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>> +		__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
>> +	}
>> +
>> +	return 1;
>> +}
> 

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
@ 2023-07-04 11:09       ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-07-04 11:09 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Ard Biesheuvel, Marc Zyngier, Oliver Upton,
	James Morse, Suzuki K Poulose, Zenghui Yu, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton, Anshuman Khandual,
	Matthew Wilcox, Yu Zhao, Mark Rutland, linux-arm-kernel,
	linux-kernel, linux-mm

On 03/07/2023 16:17, Catalin Marinas wrote:
> Hi Ryan,
> 
> Some comments below. I did not have time to trim down the quoted text,
> so you may need to scroll through it.

Thanks for the review!

Looking at the comments, I think they all relate to implementation. Does that
imply that you are happy with the shape/approach?

Talking with Anshuman yesterday, he suggested putting this behind a new Kconfig
option that defaults to disabled and also adding a command line option to
disable it when compiled in. I think that makes sense for now at least to reduce
risk of performance regression?

> 
> On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings. Whenever it detects a set of PTEs that meet the
>> requirements for a contiguous range, the PTEs are re-painted with the
>> PTE_CONT bit.
>>
>> This initial change provides a baseline that can be optimized in future
>> commits. That said, fold/unfold operations (which imply tlb
>> invalidation) are avoided where possible with a few tricks for
>> access/dirty bit management.
>>
>> Write-enable and write-protect modifications are likely non-optimal and
>> likely incure a regression in fork() performance. This will be addressed
>> separately.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 137 ++++++++++++-
>>  arch/arm64/mm/Makefile           |   3 +-
>>  arch/arm64/mm/contpte.c          | 334 +++++++++++++++++++++++++++++++
>>  3 files changed, 466 insertions(+), 8 deletions(-)
>>  create mode 100644 arch/arm64/mm/contpte.c
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 31df4d73f9ac..17ea534bc5b0 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>  				    unsigned long addr, pte_t *ptep,
>>  				    pte_t old_pte, pte_t new_pte);
>>  
>> +/*
>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>> + * a private implementation detail of the public ptep API (see below).
>> + */
>> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty);
>> +
>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>> +{
>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>> +}
>> +
>> +static inline bool contpte_is_enabled(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>> +	 * dynamically adding/removing the contig bit can cause page faults.
>> +	 * These racing faults are ok for user space, since they get serialized
>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>> +	 */
>> +
>> +	return mm != &init_mm;
>> +}
>> +
>> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * Only bother trying if both the virtual and physical addresses are
>> +	 * aligned and correspond to the last entry in a contig range. The core
>> +	 * code mostly modifies ranges from low to high, so this is the likely
>> +	 * the last modification in the contig range, so a good time to fold.
>> +	 */
>> +
>> +	bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1;
>> +	bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1;
>> +
>> +	if (contpte_is_enabled(mm) &&
>> +	    pte_present(pte) && !pte_cont(pte) &&
>> +	    valign && palign)
>> +		__contpte_try_fold(mm, addr, ptep, pte);
> 
> I would use pte_valid() here instead. pte_present() also includes the
> PTE_PROT_NONE option which we don't really care about as it's not
> accessible.

Yep good point. I'll audit all of this and make the appropriate changes for v2.

> 
> I've been discussing with Alexandru Elisei about PTE_PROT_NONE and
> whether we can use other bits from the pte even if they clash with other
> valid permissions. Since the pte is not valid, in theory we could as
> long as nothing corrupts the (like a cont bit). The background to this
> is multiple migrate types (not just NUMA) for the MTE tag carveout
> reuse.

ACK.

> 
>> +}
>> +
>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	if (contpte_is_enabled(mm) &&
>> +	    pte_present(pte) && pte_cont(pte))
>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>> +}
> 
> Same here and probably most other places where pte_present() is used in
> this patch.

ACK.

> 
>> +
>>  /*
>>   * The below functions constitute the public API that arm64 presents to the
>>   * core-mm to manipulate PTE entries within the their page tables (or at least
>> @@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   * versions will automatically and transparently apply the contiguous bit where
>>   * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>>   * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>> - * private versions, which are prefixed with double underscore.
>> + * private versions, which are prefixed with double underscore. All of these
>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>> + * held.
>>   */
>>  
>>  #define ptep_get ptep_get
>>  static inline pte_t ptep_get(pte_t *ptep)
>>  {
>> -	return __ptep_get(ptep);
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(pte) || !pte_cont(pte))
>> +		return pte;
>> +
>> +	return contpte_ptep_get(ptep, pte);
>> +}
>> +
>> +#define ptep_get_lockless ptep_get_lockless
>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(pte) || !pte_cont(pte))
>> +		return pte;
>> +
>> +	return contpte_ptep_get_lockless(ptep);
>>  }
>>  
>>  static inline void set_pte(pte_t *ptep, pte_t pte)
>>  {
>> -	__set_pte(ptep, pte);
>> +	/*
>> +	 * We don't have the mm or vaddr so cannot unfold or fold contig entries
>> +	 * (since it requires tlb maintenance). set_pte() is not used in core
>> +	 * code, so this should never even be called. Regardless do our best to
>> +	 * service any call and emit a warning if there is any attempt to set a
>> +	 * pte on top of an existing contig range.
>> +	 */
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte));
>> +	__set_pte(ptep, pte_mknoncont(pte));
> 
> Why the pte_mknoncont() here? Do we expect a contiguous pte? The warning
> only checks the old entry.

Originally, it was my intent that PTE_CONT bit would be totally private to this
layer and the bit should never leak to the generic code (i.e. ptep_get() would
clear it before returning the pte and all functions that accept a pte would WARN
if the bit was set on entry.

However, this approach proved problematic for accounting; I have a separate
change that logs the amount of memory mapped as contpte in
/proc/<pid>/smaps[_rollup]. For this to work, the PTE_CONT bit must be leaked to
the generic code (ptep_get() no longer explicitly clears it). But if we
deliberately leak it, then its possible that it will be set in functions that
take a pte, which would lead to incorrect behavior (potentially leading to a
contpte range that has some PTE_CONT bits set and others cleared). This happens
because there is generic code that follows a pattern like this:

  pte = ptep_get_and_clear(ptep)
  pte = modify_some_bits(pte)
  set_pte_at(pte)

To solve this, I'm explicitly clearing CONT_PTE from any pte that is passed in
to one of these functions.

> 
>>  }
>>  
>>  #define set_ptes set_ptes
>>  static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, pte_t pte, unsigned int nr)
>>  {
>> -	__set_ptes(mm, addr, ptep, pte, nr);
>> +	pte = pte_mknoncont(pte);
>> +
>> +	if (!contpte_is_enabled(mm))
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +	else if (nr == 1) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +		contpte_try_fold(mm, addr, ptep, pte);
>> +	} else
>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>  }
>>  
>>  static inline void pte_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	__pte_clear(mm, addr, ptep);
>>  }
>>  
>> @@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm,
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	return __ptep_get_and_clear(mm, addr, ptep);
>>  }
>>  
>> @@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	return __ptep_test_and_clear_young(vma, addr, ptep);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
> 
> Since I've seen this construct a few times, you may want to turn it into
> a specific check: pte_valid_cont().

ACK - will do for v2.

> 
>> +
>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>  static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	return __ptep_clear_flush_young(vma, addr, ptep);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>  	__ptep_set_wrprotect(mm, addr, ptep);
>> +	contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
>>  }
>>  
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> @@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep,
>>  				pte_t entry, int dirty)
>>  {
>> -	return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	entry = pte_mknoncont(entry);
> 
> As in a few other places, it's not clear to me why the pte_mknoncont()
> is needed. Here I expect 'entry' to be cont if *ptep is cont.

See explanation above.

> 
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> 
> Also wondering, can we have this check on 'entry' rather than
> 'orig_pte'? And maybe a warning if the cont bit differs between them.

No - the idea is that this API layer has exclusicve control over whether
PTE_CONT is set in the pgtable. Upper layers should never pass a pte with
CONT_PTE set (except for the corner case described above which we deal with by
explicitly clearing PTE_CONT from the passed in pte).

So the check must be on orig_pte - we are checking if a contpte range is present
over the pte we are about to modify. If it is, then we need to handle it
carefully (potentially by unfolding it first; handled by
contpte_ptep_set_access_flags()). If there is no contprte range, then we can
just handle it the "normal" way.

> 
>> +
>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>  }
>>  
>>  #endif /* !__ASSEMBLY__ */
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index dbd1bc95967d..70b6aba09b5d 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -2,7 +2,8 @@
>>  obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>  				   cache.o copypage.o flush.o \
>>  				   ioremap.o mmap.o pgd.o mmu.o \
>> -				   context.o proc.o pageattr.o fixmap.o
>> +				   context.o proc.o pageattr.o fixmap.o \
>> +				   contpte.o
>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> new file mode 100644
>> index 000000000000..e8e4a298fd53
>> --- /dev/null
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -0,0 +1,334 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <asm/tlbflush.h>
>> +
>> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, int nr)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr = addr;
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
>> +		__pte_clear(mm, addr, ptep);
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +}
>> +
>> +static bool ptep_any_present(pte_t *ptep, int nr)
> 
> Valid?

ACK

> 
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++, ptep++) {
>> +		if (pte_present(__ptep_get(ptep)))
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte, bool fold)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr;
>> +	pte_t *start_ptep;
>> +	int i;
>> +
>> +	start_ptep = ptep = contpte_align_down(ptep);
>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>> +	pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> +		if (pte_dirty(ptent))
>> +			pte = pte_mkdirty(pte);
>> +
>> +		if (pte_young(ptent))
>> +			pte = pte_mkyoung(pte);
>> +	}
> 
> I presume this can be unsafe if any of the ptes in the range differ, so
> we need some higher level check. 

Sorry I'm not quite sure what you mean here? The higher level check is where we
look at the current value of the target PTE; if PTE_CONT is set then we know it
is part of a contpte range. We are careful that PTE_CONT is set consistently for
all (valid) PTEs in a contpte range, so we only need to check 1 entry. There is
no risk of racing here because we are always serialized by the PTL.

> But that means we now have three loops
> for folding, one to check, another to clear and the last one via
> __set_ptes(). I guess we can't collapse the first two loops in a 'try'
> function as we need to do the cleaning (and would have to re-instate the
> old entries if they can't be made contiguous).

Yes 3 loops, and I don't see how you would reduce that. The good news is that
this folding path should be rarely taken; most qualifying ranges will be set via
set_ptes() so they are ritten "pre-folded". We only attempt to fold after
setting the pte at the _end_ of the range (see contpte_try_fold()), and the
checker loop in __contpte_try_fold() will usually exit on the second iteration
if the memory is not physically contiguous.

> 
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +
>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>> +}
>> +
>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the virtual and pysical addresses are
>> +	 * correctly aligned for a contig mapping in contpte_try_fold() so the
>> +	 * remaining checks are to ensure that the contig range is fully covered
>> +	 * by a single folio, and ensure that all the ptes are present with
>> +	 * contiguous PFNs and matching prots.
>> +	 */
>> +
>> +	struct page *page = pte_page(pte);
>> +	struct folio *folio = page_folio(page);
>> +	unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>> +	unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>> +	unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +	pte_t subpte;
>> +	pte_t *orig_ptep;
>> +	int i;
>> +
>> +	if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>> +		return;
>> +
>> +	pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
>> +	prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +	orig_ptep = ptep;
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		subpte = __ptep_get(ptep);
>> +		subpte = pte_mkold(pte_mkclean(subpte));
> 
> IIUC, this function assumes ptes that only differ by the dirty status
> can be contiguous. That's probably ok, with a chance of the dirty status
> spreading to the adjacent ptes in the fold function. Maybe add a comment
> on why this is ok (or why it doesn't happen).

Conceptually a contpte range only has a single access and dirty bit. So when
folding, we or all the access bits and all the dirty bits from the constituent
ptes to determine the single access and dirty bits for the contpte mapping. And
when unfolding, we take the single access and dirty bit for the contpte mapping
and apply those values to every individual entry.

So yes, we ignore the access and dirty values for the subptes when evaluating
whether a contiguous range exists. I'll add a comment.

> 
>> +
>> +		if (!pte_present(subpte) ||
>> +		    pte_pfn(subpte) != pfn ||
>> +		    pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
>> +			return;
>> +	}
>> +
>> +	contpte_fold(mm, addr, orig_ptep, pte, true);
>> +}
>> +
>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the ptes are contiguous in
>> +	 * contpte_try_unfold(), so we can unfold unconditionally here.
>> +	 */
>> +
>> +	contpte_fold(mm, addr, ptep, pte, false);
>> +}
> 
> So the pte_mkyoung(), pte_mkdirty() calls in contpte_fold() are mostly
> for the unfold case. Maybe it's clearer if we just have two separate
> functions (or document why the pte_mk*() are needed).

No that's not the case. In the unfold case, we need to "collect" the single
access and dirty bit from the contpte mapping (these may be in any of the
entries), and set the final values for all individual ptes during unfolding.

The obvious side effect here is that if any one page is dirty at fold time, the
whole range will be marked as dirty after folding, then at unfolding all pages
will be marked as dirty (same goes for access). This is the same concern that I
raised in the cover letter. I don't think this is a problem from the kernel's
point of view; the kernel will compress the per-page access/dirty info to
per-folio and we only fold if the whole range is covered by a single folio. But
user space could observe this "over-dirtying" through /proc/<pid>/pagemap. I'm
not sure if that's a problem in practice?

> 
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>> +	 * feet.
>> +	 */
>> +
>> +	pte_t pte;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +		pte = __ptep_get(ptep);
>> +
>> +		/*
>> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
>> +		 * where some of the ptes in the range may be cleared but others
>> +		 * are still to do. See contpte_ptep_get_and_clear_full().
>> +		 */
>> +		if (pte_val(pte) == 0)
>> +			continue;
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +
>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>> +	 * range may be unfolded/modified/refolded under our feet.
>> +	 */
>> +
>> +	pte_t orig_pte;
>> +	pgprot_t orig_prot;
>> +	pte_t *ptep;
>> +	unsigned long pfn;
>> +	pte_t pte;
>> +	pgprot_t prot;
>> +	int i;
>> +
>> +retry:
>> +	orig_pte = __ptep_get(orig_ptep);
>> +
>> +	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>> +		return orig_pte;
> 
> I haven't looked through all the patches, so not entirely sure when this
> function is called. 

ptep_get_lockless() is one of the mm optional arch interfaces. arm64 doesn't
currently implement it, because ptep_get() (READ_ONCE()) is safe without the
lock being held. But with the introduction of contpte mappings, there are cases
now where we have to read the whole contpte range to gather access and dirty,
which obviously isn't atomic. And doing that without the PTL is harder than when
we have the PTL, so I've implemented ptep_get_lockless() so we can assume the
PTL is held in ptep_get() and do the simple thing there (the common case).

> But since you mention that the range may be
> folded/unfolded, how do we deal with pte_cont() racing with something
> setting the contig bit?

ptep_get_lockless() is inherrently racy. My intedntion was that we just need to
ensure we read a pte or contpte range that is consistent with itself.

> 
>> +
>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>> +	ptep = contpte_align_down(orig_ptep);
>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		pte = __ptep_get(ptep);
>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +
>> +		if (!pte_present(pte) || !pte_cont(pte) ||
>> +		   pte_pfn(pte) != pfn ||
>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>> +			goto retry;
> 
> It needs better documenting, I don't understand what the retry here is
> for (presumably to handle some races). Do we care about some memory
> ordering as well? __pte_get() only takes care of reading the ptep once.

The intention is that the loop keeps retrying until it scans a whole contpte
range that is consistent with itself (i.e. PTE_CONT bit is set in all, pfn
increments monotonically and pgprots are all the same). If any of those
considtions are not true, it indicates we are racing with an update and need to
retry until its consistent. I'd need to think a bit more on whether we need
anything special for memory ordering...

To be honest, I'm not a big fan of this function. As far as I can tell, the only
user of ptep_get_lockless() that cares about access/dirty is ptdump. Perhaps we
can re-spec this to not return access/dirty info (that would simplify it back to
a READ_ONCE()), then figure out a way to hold the PTL for ptdump and use
ptep_get() which will return the access/dirty info correctly. Do you think
something like that could work?

> 
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +
>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	unsigned long next;
>> +	unsigned long end = addr + (nr << PAGE_SHIFT);
>> +	unsigned long pfn = pte_pfn(pte);
>> +	pgprot_t prot = pte_pgprot(pte);
>> +	pte_t orig_pte;
>> +
>> +	do {
>> +		next = pte_cont_addr_end(addr, end);
>> +		nr = (next - addr) >> PAGE_SHIFT;
>> +		pte = pfn_pte(pfn, prot);
>> +
>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> +			pte = pte_mkcont(pte);
>> +		else
>> +			pte = pte_mknoncont(pte);
>> +
>> +		/*
>> +		 * If operating on a partial contiguous range then we must first
>> +		 * unfold the contiguous range if it was previously folded.
>> +		 * Otherwise we could end up with overlapping tlb entries.
>> +		 */
>> +		if (nr != CONT_PTES)
>> +			contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +
>> +		/*
>> +		 * If we are replacing ptes that were contiguous or if the new
>> +		 * ptes are contiguous and any of the ptes being replaced are
>> +		 * present, we need to clear and flush the range to prevent
>> +		 * overlapping tlb entries.
>> +		 */
>> +		orig_pte = __ptep_get(ptep);
>> +		if ((pte_present(orig_pte) && pte_cont(orig_pte)) ||
>> +		    (pte_cont(pte) && ptep_any_present(ptep, nr)))
>> +			ptep_clear_flush_range(mm, addr, ptep, nr);
>> +
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +		addr = next;
>> +		ptep += nr;
>> +		pfn += nr;
>> +
>> +	} while (addr != end);
>> +}
>> +
>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	/*
>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>> +	 * access/dirty per folio, not per page. And since we only create a
>> +	 * contig range when the range is covered by a single folio, we can get
>> +	 * away with clearing young for the whole contig range here, so we avoid
>> +	 * having to unfold.
>> +	 */
>> +
>> +	int i;
>> +	int young = 0;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return young;
>> +}
>> +
>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	int young;
>> +
>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	if (young) {
>> +		/*
>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>> +		 * eliding the trailing DSB applies here.
>> +		 */
>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>> +					 PAGE_SIZE, true, 3);
>> +	}
>> +
>> +	return young;
>> +}
>> +
>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					pte_t entry, int dirty)
>> +{
>> +	pte_t orig_pte;
>> +	int i;
>> +
>> +	/*
>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>> +	 * changed, its a noop.
>> +	 */
>> +	orig_pte = ptep_get(ptep);
>> +	if (pte_val(orig_pte) == pte_val(entry))
>> +		return 0;
>> +
>> +	/*
>> +	 * We can fix up access/dirty bits without having to unfold/fold the
>> +	 * contig range. But if the write bit is changing, we need to go through
>> +	 * the full unfold/fold cycle.
>> +	 */
>> +	if (pte_write(orig_pte) == pte_write(entry)) {
> 
> Depending on the architecture version, pte_write() either checks a
> software only bit or it checks the DBM one.
> 
>> +		/*
>> +		 * No need to flush here; This is always "more permissive" so we
>> +		 * can only be _adding_ the access or dirty bit. And since the
>> +		 * tlb can't cache an entry without the AF set and the dirty bit
>> +		 * is a SW bit, there can be no confusion. For HW access
>> +		 * management, we technically only need to update the flag on a
>> +		 * single pte in the range. But for SW access management, we
>> +		 * need to update all the ptes to prevent extra faults.
>> +		 */
> 
> On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
> perspective but clean) may be cached in the TLB and we do need flushing.

I don't follow; The Arm ARM says:

  IPNQBP When an Access flag fault is generated, the translation table entry
         causing the fault is not cached in a TLB.

So the entry can only be in the TLB if AF is already 1. And given the dirty bit
is SW, it shouldn't affect the TLB state. And this function promises to only
change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1).

So I'm not sure what case you are describing here?

> 
>> +		ptep = contpte_align_down(ptep);
>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +	} else {
>> +		/*
>> +		 * No need to flush in __ptep_set_access_flags() because we just
>> +		 * flushed the whole range in __contpte_try_unfold().
>> +		 */
>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>> +		__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
>> +	}
>> +
>> +	return 1;
>> +}
> 

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
  2023-07-04 11:09       ` Ryan Roberts
@ 2023-07-05 13:13         ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-07-05 13:13 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Ard Biesheuvel, Marc Zyngier, Oliver Upton,
	James Morse, Suzuki K Poulose, Zenghui Yu, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton, Anshuman Khandual,
	Matthew Wilcox, Yu Zhao, Mark Rutland, linux-arm-kernel,
	linux-kernel, linux-mm

On 04/07/2023 12:09, Ryan Roberts wrote:
> On 03/07/2023 16:17, Catalin Marinas wrote:
>> Hi Ryan,
>>

...

>>> +
>>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep,
>>> +					pte_t entry, int dirty)
>>> +{
>>> +	pte_t orig_pte;
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>>> +	 * changed, its a noop.
>>> +	 */
>>> +	orig_pte = ptep_get(ptep);
>>> +	if (pte_val(orig_pte) == pte_val(entry))
>>> +		return 0;
>>> +
>>> +	/*
>>> +	 * We can fix up access/dirty bits without having to unfold/fold the
>>> +	 * contig range. But if the write bit is changing, we need to go through
>>> +	 * the full unfold/fold cycle.
>>> +	 */
>>> +	if (pte_write(orig_pte) == pte_write(entry)) {
>>
>> Depending on the architecture version, pte_write() either checks a
>> software only bit or it checks the DBM one.
>>
>>> +		/*
>>> +		 * No need to flush here; This is always "more permissive" so we
>>> +		 * can only be _adding_ the access or dirty bit. And since the
>>> +		 * tlb can't cache an entry without the AF set and the dirty bit
>>> +		 * is a SW bit, there can be no confusion. For HW access
>>> +		 * management, we technically only need to update the flag on a
>>> +		 * single pte in the range. But for SW access management, we
>>> +		 * need to update all the ptes to prevent extra faults.
>>> +		 */
>>
>> On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
>> perspective but clean) may be cached in the TLB and we do need flushing.
> 
> I don't follow; The Arm ARM says:
> 
>   IPNQBP When an Access flag fault is generated, the translation table entry
>          causing the fault is not cached in a TLB.
> 
> So the entry can only be in the TLB if AF is already 1. And given the dirty bit
> is SW, it shouldn't affect the TLB state. And this function promises to only
> change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1).
> 
> So I'm not sure what case you are describing here?

Ahh sorry, I get your point now - on pre-DBM hardware, the HW sees a read-only
PTE when the kernel considers it clean and this can be in the TLB. Then when
making it dirty (from kernel's perspective), we are removing the read-only
protection from the HW perspective, so we need to flush the TLB entry.

> 
>>
>>> +		ptep = contpte_align_down(ptep);
>>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);

Fixed by adding this after iterating though the ptes, intent it to avoid the
per-page tlb flash and instead flush the whole range at the end:

		if (dirty)
			__flush_tlb_range(vma, start_addr, addr,
							PAGE_SIZE, true, 3);

>>> +	} else {
>>> +		/*
>>> +		 * No need to flush in __ptep_set_access_flags() because we just
>>> +		 * flushed the whole range in __contpte_try_unfold().
>>> +		 */
>>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>> +		__ptep_set_access_flags(vma, addr, ptep, entry, 0);

I also think this is wrong; we must pass `dirty` as the last parameter so that
__ptep_set_access_flags will flush if neccessary. My comment about having just
done the flush is incorrect - we have just done a flush, but the ptes are still
valid with their old value so the HW could pull this into the TLB before we
modify the value.

>>> +		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
>>> +	}
>>> +
>>> +	return 1;
>>> +}
>>
> 
> Thanks,
> Ryan
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
@ 2023-07-05 13:13         ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-07-05 13:13 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Ard Biesheuvel, Marc Zyngier, Oliver Upton,
	James Morse, Suzuki K Poulose, Zenghui Yu, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton, Anshuman Khandual,
	Matthew Wilcox, Yu Zhao, Mark Rutland, linux-arm-kernel,
	linux-kernel, linux-mm

On 04/07/2023 12:09, Ryan Roberts wrote:
> On 03/07/2023 16:17, Catalin Marinas wrote:
>> Hi Ryan,
>>

...

>>> +
>>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep,
>>> +					pte_t entry, int dirty)
>>> +{
>>> +	pte_t orig_pte;
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>>> +	 * changed, its a noop.
>>> +	 */
>>> +	orig_pte = ptep_get(ptep);
>>> +	if (pte_val(orig_pte) == pte_val(entry))
>>> +		return 0;
>>> +
>>> +	/*
>>> +	 * We can fix up access/dirty bits without having to unfold/fold the
>>> +	 * contig range. But if the write bit is changing, we need to go through
>>> +	 * the full unfold/fold cycle.
>>> +	 */
>>> +	if (pte_write(orig_pte) == pte_write(entry)) {
>>
>> Depending on the architecture version, pte_write() either checks a
>> software only bit or it checks the DBM one.
>>
>>> +		/*
>>> +		 * No need to flush here; This is always "more permissive" so we
>>> +		 * can only be _adding_ the access or dirty bit. And since the
>>> +		 * tlb can't cache an entry without the AF set and the dirty bit
>>> +		 * is a SW bit, there can be no confusion. For HW access
>>> +		 * management, we technically only need to update the flag on a
>>> +		 * single pte in the range. But for SW access management, we
>>> +		 * need to update all the ptes to prevent extra faults.
>>> +		 */
>>
>> On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
>> perspective but clean) may be cached in the TLB and we do need flushing.
> 
> I don't follow; The Arm ARM says:
> 
>   IPNQBP When an Access flag fault is generated, the translation table entry
>          causing the fault is not cached in a TLB.
> 
> So the entry can only be in the TLB if AF is already 1. And given the dirty bit
> is SW, it shouldn't affect the TLB state. And this function promises to only
> change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1).
> 
> So I'm not sure what case you are describing here?

Ahh sorry, I get your point now - on pre-DBM hardware, the HW sees a read-only
PTE when the kernel considers it clean and this can be in the TLB. Then when
making it dirty (from kernel's perspective), we are removing the read-only
protection from the HW perspective, so we need to flush the TLB entry.

> 
>>
>>> +		ptep = contpte_align_down(ptep);
>>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);

Fixed by adding this after iterating though the ptes, intent it to avoid the
per-page tlb flash and instead flush the whole range at the end:

		if (dirty)
			__flush_tlb_range(vma, start_addr, addr,
							PAGE_SIZE, true, 3);

>>> +	} else {
>>> +		/*
>>> +		 * No need to flush in __ptep_set_access_flags() because we just
>>> +		 * flushed the whole range in __contpte_try_unfold().
>>> +		 */
>>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>> +		__ptep_set_access_flags(vma, addr, ptep, entry, 0);

I also think this is wrong; we must pass `dirty` as the last parameter so that
__ptep_set_access_flags will flush if neccessary. My comment about having just
done the flush is incorrect - we have just done a flush, but the ptes are still
valid with their old value so the HW could pull this into the TLB before we
modify the value.

>>> +		contpte_try_fold(vma->vm_mm, addr, ptep, entry);
>>> +	}
>>> +
>>> +	return 1;
>>> +}
>>
> 
> Thanks,
> Ryan
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings
  2023-06-22 14:41 ` Ryan Roberts
@ 2023-07-10 12:05   ` Barry Song
  -1 siblings, 0 replies; 46+ messages in thread
From: Barry Song @ 2023-07-10 12:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland,
	linux-arm-kernel, linux-kernel, linux-mm

On Thu, Jun 22, 2023 at 11:00 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> This is a series to opportunistically and transparently use contpte mappings
> (set the contiguous bit in ptes) for user memory when those mappings meet the
> requirements. It is part of a wider effort to improve performance of the 4K
> kernel with the aim of approaching the performance of the 16K kernel, but
> without breaking compatibility and without the associated increase in memory. It
> also benefits the 16K and 64K kernels by enabling 2M THP, since this is the
> contpte size for those kernels.
>
> Of course this is only one half of the change. We require the mapped physical
> memory to be the correct size and alignment for this to actually be useful (i.e.
> 64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
> problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs) will
> allocate large folios up to the PMD size today, and more filesystems are coming.
> And the other half of my work, to enable the use of large folios for anonymous
> memory, aims to make contpte sized folios prevalent for anonymous memory too.
>
>
> Dependencies
> ------------
>
> While there is a complicated set of hard and soft dependencies that this patch
> set depends on, I wanted to split it out as best I could and kick off proper
> review independently.
>
> The series applies on top of these other patch sets, with a tree at:
> https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v1
>
> v6.4-rc6
>   - base
>
> set_ptes()
>   - hard dependency
>   - Patch set from Matthew Wilcox to set multiple ptes with a single API call
>   - Allows arch backend to more optimally apply contpte mappings
>   - https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>
> ptep_get() pte encapsulation
>   - hard dependency
>   - Enabler series from me to ensure none of the core code ever directly
>     dereferences a pte_t that lies within a live page table.
>   - Enables gathering access/dirty bits from across the whole contpte range
>   - in mm-stable and linux-next at time of writing
>   - https://lore.kernel.org/linux-mm/d38dc237-6093-d4c5-993e-e8ffdd6cb6fa@arm.com/
>
> Report on physically contiguous memory in smaps
>   - soft dependency
>   - Enables visibility on how much memory is physically contiguous and how much
>     is contpte-mapped - useful for debug
>   - https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/
>
> Additionally there are a couple of other dependencies:
>
> anonfolio
>   - soft dependency
>   - ensures more anonymous memory is allocated in contpte-sized folios, so
>     needed to realize the performance improvements (this is the "other half"
>     mentioned above).
>   - RFC: https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/
>   - Intending to post v1 shortly.
>
> exefolio
>   - soft dependency
>   - Tweak readahead to ensure executable memory are in 64K-sized folios, so
>     needed to see reduction in iTLB pressure.
>   - Don't intend to post this until we are further down the track with contpte
>     and anonfolio.
>
> Arm ARM Clarification
>   - hard dependency
>   - Current wording disallows the fork() optimization in the final patch.
>   - Arm (ATG) have proposed tightening the wording to permit it.
>   - In conversation with partners to check this wouldn't cause problems for any
>     existing HW deployments
>
> All of the _hard_ dependencies need to be resolved before this can be considered
> for merging.
>
>
> Performance
> -----------
>
> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> javascript benchmark running in Chromium). Both cases are running on Ampere
> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> is repeated 15 times over 5 reboots and averaged.
>
> All improvements are relative to baseline-4k. anonfolio and exefolio are as
> described above. contpte is this series. (Note that exefolio only gives an
> improvement because contpte is already in place).
>
> Kernel Compilation (smaller is better):
>
> | kernel       |   real-time |   kern-time |   user-time |
> |:-------------|------------:|------------:|------------:|
> | baseline-4k  |        0.0% |        0.0% |        0.0% |
> | anonfolio    |       -5.4% |      -46.0% |       -0.3% |
> | contpte      |       -6.8% |      -45.7% |       -2.1% |
> | exefolio     |       -8.4% |      -46.4% |       -3.7% |

sorry i am a bit confused. in exefolio case, is anonfolio included?
or it only has large cont-pte folios on exe code? in the other words,
Does the 8.4% improvement come from iTLB miss reduction only,
or from both dTLB and iTLB miss reduction?

> | baseline-16k |       -8.7% |      -49.2% |       -3.7% |
> | baseline-64k |      -10.5% |      -66.0% |       -3.5% |
>
> Speedometer 2.0 (bigger is better):
>
> | kernel       |   runs_per_min |
> |:-------------|---------------:|
> | baseline-4k  |           0.0% |
> | anonfolio    |           1.2% |
> | contpte      |           3.1% |
> | exefolio     |           4.2% |

same question as above.

> | baseline-16k |           5.3% |
>
> I've also run Speedometer 2.0 on Pixel 6 with an Ubuntu SW stack and see similar
> gains.
>
> I've also verified that running the contpte changes without anonfolio and
> exefolio does not cause any regression vs baseline-4k.
>
>
> Opens
> -----
>
> The only potential issue that I see right now is that due to there only being 1
> access/dirty bit per contpte range, if a single page in the range is
> accessed/dirtied then all the adjacent pages are reported as accessed/dirtied
> too. Access/dirty is managed by the kernel per _folio_, so this information gets
> collapsed down anyway, and nothing changes there. However, the per _page_
> access/dirty information is reported through pagemap to user space. I'm not sure
> if this would/should be considered a break? Thoughts?
>
> Thanks,
> Ryan

Thanks
Barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings
@ 2023-07-10 12:05   ` Barry Song
  0 siblings, 0 replies; 46+ messages in thread
From: Barry Song @ 2023-07-10 12:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland,
	linux-arm-kernel, linux-kernel, linux-mm

On Thu, Jun 22, 2023 at 11:00 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> This is a series to opportunistically and transparently use contpte mappings
> (set the contiguous bit in ptes) for user memory when those mappings meet the
> requirements. It is part of a wider effort to improve performance of the 4K
> kernel with the aim of approaching the performance of the 16K kernel, but
> without breaking compatibility and without the associated increase in memory. It
> also benefits the 16K and 64K kernels by enabling 2M THP, since this is the
> contpte size for those kernels.
>
> Of course this is only one half of the change. We require the mapped physical
> memory to be the correct size and alignment for this to actually be useful (i.e.
> 64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
> problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs) will
> allocate large folios up to the PMD size today, and more filesystems are coming.
> And the other half of my work, to enable the use of large folios for anonymous
> memory, aims to make contpte sized folios prevalent for anonymous memory too.
>
>
> Dependencies
> ------------
>
> While there is a complicated set of hard and soft dependencies that this patch
> set depends on, I wanted to split it out as best I could and kick off proper
> review independently.
>
> The series applies on top of these other patch sets, with a tree at:
> https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v1
>
> v6.4-rc6
>   - base
>
> set_ptes()
>   - hard dependency
>   - Patch set from Matthew Wilcox to set multiple ptes with a single API call
>   - Allows arch backend to more optimally apply contpte mappings
>   - https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>
> ptep_get() pte encapsulation
>   - hard dependency
>   - Enabler series from me to ensure none of the core code ever directly
>     dereferences a pte_t that lies within a live page table.
>   - Enables gathering access/dirty bits from across the whole contpte range
>   - in mm-stable and linux-next at time of writing
>   - https://lore.kernel.org/linux-mm/d38dc237-6093-d4c5-993e-e8ffdd6cb6fa@arm.com/
>
> Report on physically contiguous memory in smaps
>   - soft dependency
>   - Enables visibility on how much memory is physically contiguous and how much
>     is contpte-mapped - useful for debug
>   - https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/
>
> Additionally there are a couple of other dependencies:
>
> anonfolio
>   - soft dependency
>   - ensures more anonymous memory is allocated in contpte-sized folios, so
>     needed to realize the performance improvements (this is the "other half"
>     mentioned above).
>   - RFC: https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/
>   - Intending to post v1 shortly.
>
> exefolio
>   - soft dependency
>   - Tweak readahead to ensure executable memory are in 64K-sized folios, so
>     needed to see reduction in iTLB pressure.
>   - Don't intend to post this until we are further down the track with contpte
>     and anonfolio.
>
> Arm ARM Clarification
>   - hard dependency
>   - Current wording disallows the fork() optimization in the final patch.
>   - Arm (ATG) have proposed tightening the wording to permit it.
>   - In conversation with partners to check this wouldn't cause problems for any
>     existing HW deployments
>
> All of the _hard_ dependencies need to be resolved before this can be considered
> for merging.
>
>
> Performance
> -----------
>
> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
> javascript benchmark running in Chromium). Both cases are running on Ampere
> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
> is repeated 15 times over 5 reboots and averaged.
>
> All improvements are relative to baseline-4k. anonfolio and exefolio are as
> described above. contpte is this series. (Note that exefolio only gives an
> improvement because contpte is already in place).
>
> Kernel Compilation (smaller is better):
>
> | kernel       |   real-time |   kern-time |   user-time |
> |:-------------|------------:|------------:|------------:|
> | baseline-4k  |        0.0% |        0.0% |        0.0% |
> | anonfolio    |       -5.4% |      -46.0% |       -0.3% |
> | contpte      |       -6.8% |      -45.7% |       -2.1% |
> | exefolio     |       -8.4% |      -46.4% |       -3.7% |

sorry i am a bit confused. in exefolio case, is anonfolio included?
or it only has large cont-pte folios on exe code? in the other words,
Does the 8.4% improvement come from iTLB miss reduction only,
or from both dTLB and iTLB miss reduction?

> | baseline-16k |       -8.7% |      -49.2% |       -3.7% |
> | baseline-64k |      -10.5% |      -66.0% |       -3.5% |
>
> Speedometer 2.0 (bigger is better):
>
> | kernel       |   runs_per_min |
> |:-------------|---------------:|
> | baseline-4k  |           0.0% |
> | anonfolio    |           1.2% |
> | contpte      |           3.1% |
> | exefolio     |           4.2% |

same question as above.

> | baseline-16k |           5.3% |
>
> I've also run Speedometer 2.0 on Pixel 6 with an Ubuntu SW stack and see similar
> gains.
>
> I've also verified that running the contpte changes without anonfolio and
> exefolio does not cause any regression vs baseline-4k.
>
>
> Opens
> -----
>
> The only potential issue that I see right now is that due to there only being 1
> access/dirty bit per contpte range, if a single page in the range is
> accessed/dirtied then all the adjacent pages are reported as accessed/dirtied
> too. Access/dirty is managed by the kernel per _folio_, so this information gets
> collapsed down anyway, and nothing changes there. However, the per _page_
> access/dirty information is reported through pagemap to user space. I'm not sure
> if this would/should be considered a break? Thoughts?
>
> Thanks,
> Ryan

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings
  2023-07-10 12:05   ` Barry Song
@ 2023-07-10 13:28     ` Ryan Roberts
  -1 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-07-10 13:28 UTC (permalink / raw)
  To: Barry Song
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland,
	linux-arm-kernel, linux-kernel, linux-mm

On 10/07/2023 13:05, Barry Song wrote:
> On Thu, Jun 22, 2023 at 11:00 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi All,
>>
[...]
>>
>> Performance
>> -----------
>>
>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
>> javascript benchmark running in Chromium). Both cases are running on Ampere
>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
>> is repeated 15 times over 5 reboots and averaged.
>>
>> All improvements are relative to baseline-4k. anonfolio and exefolio are as
>> described above. contpte is this series. (Note that exefolio only gives an
>> improvement because contpte is already in place).
>>
>> Kernel Compilation (smaller is better):
>>
>> | kernel       |   real-time |   kern-time |   user-time |
>> |:-------------|------------:|------------:|------------:|
>> | baseline-4k  |        0.0% |        0.0% |        0.0% |
>> | anonfolio    |       -5.4% |      -46.0% |       -0.3% |
>> | contpte      |       -6.8% |      -45.7% |       -2.1% |
>> | exefolio     |       -8.4% |      -46.4% |       -3.7% |
> 
> sorry i am a bit confused. in exefolio case, is anonfolio included?
> or it only has large cont-pte folios on exe code? in the other words,
> Does the 8.4% improvement come from iTLB miss reduction only,
> or from both dTLB and iTLB miss reduction?

The anonfolio -> contpte -> exefolio results are incremental. So:

anonfolio: baseline-4k + anonfolio changes
contpte: anonfolio + contpte changes
exefolio: contpte + exefolio changes

So yes, exefolio includes anonfolio. Sorry for the confusion.

> 
>> | baseline-16k |       -8.7% |      -49.2% |       -3.7% |
>> | baseline-64k |      -10.5% |      -66.0% |       -3.5% |
>>
>> Speedometer 2.0 (bigger is better):
>>
>> | kernel       |   runs_per_min |
>> |:-------------|---------------:|
>> | baseline-4k  |           0.0% |
>> | anonfolio    |           1.2% |
>> | contpte      |           3.1% |
>> | exefolio     |           4.2% |
> 
> same question as above.

same answer as above.

Thanks,
Ryan


> 
>> | baseline-16k |           5.3% |
>>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings
@ 2023-07-10 13:28     ` Ryan Roberts
  0 siblings, 0 replies; 46+ messages in thread
From: Ryan Roberts @ 2023-07-10 13:28 UTC (permalink / raw)
  To: Barry Song
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Marc Zyngier,
	Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
	Andrey Ryabinin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Andrew Morton,
	Anshuman Khandual, Matthew Wilcox, Yu Zhao, Mark Rutland,
	linux-arm-kernel, linux-kernel, linux-mm

On 10/07/2023 13:05, Barry Song wrote:
> On Thu, Jun 22, 2023 at 11:00 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi All,
>>
[...]
>>
>> Performance
>> -----------
>>
>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a
>> javascript benchmark running in Chromium). Both cases are running on Ampere
>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark
>> is repeated 15 times over 5 reboots and averaged.
>>
>> All improvements are relative to baseline-4k. anonfolio and exefolio are as
>> described above. contpte is this series. (Note that exefolio only gives an
>> improvement because contpte is already in place).
>>
>> Kernel Compilation (smaller is better):
>>
>> | kernel       |   real-time |   kern-time |   user-time |
>> |:-------------|------------:|------------:|------------:|
>> | baseline-4k  |        0.0% |        0.0% |        0.0% |
>> | anonfolio    |       -5.4% |      -46.0% |       -0.3% |
>> | contpte      |       -6.8% |      -45.7% |       -2.1% |
>> | exefolio     |       -8.4% |      -46.4% |       -3.7% |
> 
> sorry i am a bit confused. in exefolio case, is anonfolio included?
> or it only has large cont-pte folios on exe code? in the other words,
> Does the 8.4% improvement come from iTLB miss reduction only,
> or from both dTLB and iTLB miss reduction?

The anonfolio -> contpte -> exefolio results are incremental. So:

anonfolio: baseline-4k + anonfolio changes
contpte: anonfolio + contpte changes
exefolio: contpte + exefolio changes

So yes, exefolio includes anonfolio. Sorry for the confusion.

> 
>> | baseline-16k |       -8.7% |      -49.2% |       -3.7% |
>> | baseline-64k |      -10.5% |      -66.0% |       -3.5% |
>>
>> Speedometer 2.0 (bigger is better):
>>
>> | kernel       |   runs_per_min |
>> |:-------------|---------------:|
>> | baseline-4k  |           0.0% |
>> | anonfolio    |           1.2% |
>> | contpte      |           3.1% |
>> | exefolio     |           4.2% |
> 
> same question as above.

same answer as above.

Thanks,
Ryan


> 
>> | baseline-16k |           5.3% |
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
  2023-07-04 11:09       ` Ryan Roberts
@ 2023-07-16 15:09         ` Catalin Marinas
  -1 siblings, 0 replies; 46+ messages in thread
From: Catalin Marinas @ 2023-07-16 15:09 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Ard Biesheuvel, Marc Zyngier, Oliver Upton,
	James Morse, Suzuki K Poulose, Zenghui Yu, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton, Anshuman Khandual,
	Matthew Wilcox, Yu Zhao, Mark Rutland, linux-arm-kernel,
	linux-kernel, linux-mm

On Tue, Jul 04, 2023 at 12:09:31PM +0100, Ryan Roberts wrote:
> On 03/07/2023 16:17, Catalin Marinas wrote:
> > Hi Ryan,
> > 
> > Some comments below. I did not have time to trim down the quoted text,
> > so you may need to scroll through it.
> 
> Thanks for the review!
> 
> Looking at the comments, I think they all relate to implementation. Does that
> imply that you are happy with the shape/approach?

I can't really tell yet as there are a few dependencies and I haven't
applied them to look at the bigger picture. My preference would be to
handle the large folio breaking/making in the core code via APIs like
set_ptes() and eliminate the loop heuristics in the arm64
code to fold/unfold. Maybe it's not entirely possible I need to look at
the bigger picture with all the series applied (and on a bigger screen,
writing this reply on a laptop in flight).

> Talking with Anshuman yesterday, he suggested putting this behind a new Kconfig
> option that defaults to disabled and also adding a command line option to
> disable it when compiled in. I think that makes sense for now at least to reduce
> risk of performance regression?

I'm fine with a Kconfig option (maybe expert) but default enabled,
otherwise it won't get enough coverage. AFAICT, the biggest risk of
regression is the heuristics for folding/unfolding. In general the
overhead should be offset by the reduced TLB pressure but we may find
some pathological case where this gets in the way.

> > On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote:
> >> +		/*
> >> +		 * No need to flush here; This is always "more permissive" so we
> >> +		 * can only be _adding_ the access or dirty bit. And since the
> >> +		 * tlb can't cache an entry without the AF set and the dirty bit
> >> +		 * is a SW bit, there can be no confusion. For HW access
> >> +		 * management, we technically only need to update the flag on a
> >> +		 * single pte in the range. But for SW access management, we
> >> +		 * need to update all the ptes to prevent extra faults.
> >> +		 */
> > 
> > On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
> > perspective but clean) may be cached in the TLB and we do need flushing.
> 
> I don't follow; The Arm ARM says:
> 
>   IPNQBP When an Access flag fault is generated, the translation table entry
>          causing the fault is not cached in a TLB.
> 
> So the entry can only be in the TLB if AF is already 1. And given the dirty bit
> is SW, it shouldn't affect the TLB state. And this function promises to only
> change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1).
> 
> So I'm not sure what case you are describing here?

The comment for this function states that it sets the access/dirty flags
as well as the write permission. Prior to DBM, the page is marked
PTE_RDONLY and we take a fault. This function marks the page dirty by
setting the software PTE_DIRTY bit (no need to worry) but also clearing
PTE_RDONLY so that a subsequent access won't fault again. We do need the
TLBI here since PTE_RDONLY is allowed to be cached in the TLB.

Sorry, I did not reply to your other comments (we can talk in person in
about a week time). I also noticed you figured the above but I had
written it already.

-- 
Catalin

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings
@ 2023-07-16 15:09         ` Catalin Marinas
  0 siblings, 0 replies; 46+ messages in thread
From: Catalin Marinas @ 2023-07-16 15:09 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Ard Biesheuvel, Marc Zyngier, Oliver Upton,
	James Morse, Suzuki K Poulose, Zenghui Yu, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton, Anshuman Khandual,
	Matthew Wilcox, Yu Zhao, Mark Rutland, linux-arm-kernel,
	linux-kernel, linux-mm

On Tue, Jul 04, 2023 at 12:09:31PM +0100, Ryan Roberts wrote:
> On 03/07/2023 16:17, Catalin Marinas wrote:
> > Hi Ryan,
> > 
> > Some comments below. I did not have time to trim down the quoted text,
> > so you may need to scroll through it.
> 
> Thanks for the review!
> 
> Looking at the comments, I think they all relate to implementation. Does that
> imply that you are happy with the shape/approach?

I can't really tell yet as there are a few dependencies and I haven't
applied them to look at the bigger picture. My preference would be to
handle the large folio breaking/making in the core code via APIs like
set_ptes() and eliminate the loop heuristics in the arm64
code to fold/unfold. Maybe it's not entirely possible I need to look at
the bigger picture with all the series applied (and on a bigger screen,
writing this reply on a laptop in flight).

> Talking with Anshuman yesterday, he suggested putting this behind a new Kconfig
> option that defaults to disabled and also adding a command line option to
> disable it when compiled in. I think that makes sense for now at least to reduce
> risk of performance regression?

I'm fine with a Kconfig option (maybe expert) but default enabled,
otherwise it won't get enough coverage. AFAICT, the biggest risk of
regression is the heuristics for folding/unfolding. In general the
overhead should be offset by the reduced TLB pressure but we may find
some pathological case where this gets in the way.

> > On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote:
> >> +		/*
> >> +		 * No need to flush here; This is always "more permissive" so we
> >> +		 * can only be _adding_ the access or dirty bit. And since the
> >> +		 * tlb can't cache an entry without the AF set and the dirty bit
> >> +		 * is a SW bit, there can be no confusion. For HW access
> >> +		 * management, we technically only need to update the flag on a
> >> +		 * single pte in the range. But for SW access management, we
> >> +		 * need to update all the ptes to prevent extra faults.
> >> +		 */
> > 
> > On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
> > perspective but clean) may be cached in the TLB and we do need flushing.
> 
> I don't follow; The Arm ARM says:
> 
>   IPNQBP When an Access flag fault is generated, the translation table entry
>          causing the fault is not cached in a TLB.
> 
> So the entry can only be in the TLB if AF is already 1. And given the dirty bit
> is SW, it shouldn't affect the TLB state. And this function promises to only
> change the bits so they are more permissive (so AF=0 -> AF=1, D=0 -> D=1).
> 
> So I'm not sure what case you are describing here?

The comment for this function states that it sets the access/dirty flags
as well as the write permission. Prior to DBM, the page is marked
PTE_RDONLY and we take a fault. This function marks the page dirty by
setting the software PTE_DIRTY bit (no need to worry) but also clearing
PTE_RDONLY so that a subsequent access won't fault again. We do need the
TLBI here since PTE_RDONLY is allowed to be cached in the TLB.

Sorry, I did not reply to your other comments (we can talk in person in
about a week time). I also noticed you figured the above but I had
written it already.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2023-07-16 15:09 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-22 14:41 [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings Ryan Roberts
2023-06-22 14:41 ` Ryan Roberts
2023-06-22 14:41 ` [PATCH v1 01/14] arm64/mm: set_pte(): New layer to manage contig bit Ryan Roberts
2023-06-22 14:41   ` Ryan Roberts
2023-06-22 14:41 ` [PATCH v1 02/14] arm64/mm: set_ptes()/set_pte_at(): " Ryan Roberts
2023-06-22 14:41   ` Ryan Roberts
2023-06-22 14:41 ` [PATCH v1 03/14] arm64/mm: pte_clear(): " Ryan Roberts
2023-06-22 14:41   ` Ryan Roberts
2023-06-22 14:41 ` [PATCH v1 04/14] arm64/mm: ptep_get_and_clear(): " Ryan Roberts
2023-06-22 14:41   ` Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 05/14] arm64/mm: ptep_test_and_clear_young(): " Ryan Roberts
2023-06-22 14:42   ` Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 06/14] arm64/mm: ptep_clear_flush_young(): " Ryan Roberts
2023-06-22 14:42   ` Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 07/14] arm64/mm: ptep_set_wrprotect(): " Ryan Roberts
2023-06-22 14:42   ` Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 08/14] arm64/mm: ptep_set_access_flags(): " Ryan Roberts
2023-06-22 14:42   ` Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 09/14] arm64/mm: ptep_get(): " Ryan Roberts
2023-06-22 14:42   ` Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 10/14] arm64/mm: Split __flush_tlb_range() to elide trailing DSB Ryan Roberts
2023-06-22 14:42   ` Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings Ryan Roberts
2023-06-22 14:42   ` Ryan Roberts
2023-06-30  1:54   ` John Hubbard
2023-06-30  1:54     ` John Hubbard
2023-07-03  9:48     ` Ryan Roberts
2023-07-03  9:48       ` Ryan Roberts
2023-07-03 15:17   ` Catalin Marinas
2023-07-03 15:17     ` Catalin Marinas
2023-07-04 11:09     ` Ryan Roberts
2023-07-04 11:09       ` Ryan Roberts
2023-07-05 13:13       ` Ryan Roberts
2023-07-05 13:13         ` Ryan Roberts
2023-07-16 15:09       ` Catalin Marinas
2023-07-16 15:09         ` Catalin Marinas
2023-06-22 14:42 ` [PATCH v1 12/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown Ryan Roberts
2023-06-22 14:42   ` Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 13/14] mm: Batch-copy PTE ranges during fork() Ryan Roberts
2023-06-22 14:42   ` Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 14/14] arm64/mm: Implement ptep_set_wrprotects() to optimize fork() Ryan Roberts
2023-06-22 14:42   ` Ryan Roberts
2023-07-10 12:05 ` [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings Barry Song
2023-07-10 12:05   ` Barry Song
2023-07-10 13:28   ` Ryan Roberts
2023-07-10 13:28     ` Ryan Roberts

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.