All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-26 10:14 ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-26 10:14 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

Hi All,

It turns out that creating the linear map can take a significant proportion of
the total boot time, especially when rodata=full. And a large portion of the
time it takes to create the linear map is issuing TLBIs. This series reworks the
kernel pgtable generation code to significantly reduce the number of TLBIs. See
each patch for details.

The below shows the execution time of map_mem() across a couple of different
systems with different RAM configurations. We measure after applying each patch
and show the improvement relative to base (v6.9-rc1):

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)

This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
tested all VA size configs (although I don't anticipate any issues); I'll do
this as part of followup.

Thanks,
Ryan


Ryan Roberts (3):
  arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
  arm64: mm: Don't remap pgtables for allocate vs populate
  arm64: mm: Lazily clear pte table mappings from fixmap

 arch/arm64/include/asm/fixmap.h  |   5 +-
 arch/arm64/include/asm/mmu.h     |   8 +
 arch/arm64/include/asm/pgtable.h |   4 -
 arch/arm64/kernel/cpufeature.c   |  10 +-
 arch/arm64/mm/fixmap.c           |  11 +
 arch/arm64/mm/mmu.c              | 364 +++++++++++++++++++++++--------
 include/linux/pgtable.h          |   8 +
 7 files changed, 307 insertions(+), 103 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-26 10:14 ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-26 10:14 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

Hi All,

It turns out that creating the linear map can take a significant proportion of
the total boot time, especially when rodata=full. And a large portion of the
time it takes to create the linear map is issuing TLBIs. This series reworks the
kernel pgtable generation code to significantly reduce the number of TLBIs. See
each patch for details.

The below shows the execution time of map_mem() across a couple of different
systems with different RAM configurations. We measure after applying each patch
and show the improvement relative to base (v6.9-rc1):

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)

This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
tested all VA size configs (although I don't anticipate any issues); I'll do
this as part of followup.

Thanks,
Ryan


Ryan Roberts (3):
  arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
  arm64: mm: Don't remap pgtables for allocate vs populate
  arm64: mm: Lazily clear pte table mappings from fixmap

 arch/arm64/include/asm/fixmap.h  |   5 +-
 arch/arm64/include/asm/mmu.h     |   8 +
 arch/arm64/include/asm/pgtable.h |   4 -
 arch/arm64/kernel/cpufeature.c   |  10 +-
 arch/arm64/mm/fixmap.c           |  11 +
 arch/arm64/mm/mmu.c              | 364 +++++++++++++++++++++++--------
 include/linux/pgtable.h          |   8 +
 7 files changed, 307 insertions(+), 103 deletions(-)

--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v1 1/3] arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
  2024-03-26 10:14 ` Ryan Roberts
@ 2024-03-26 10:14   ` Ryan Roberts
  -1 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-26 10:14 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

A large part of the kernel boot time is creating the kernel linear map
page tables. When rodata=full, all memory is mapped by pte. And when
there is lots of physical ram, there are lots of pte tables to populate.
The primary cost associated with this is mapping and unmapping the pte
table memory in the fixmap; at unmap time, the TLB entry must be
invalidated and this is expensive.

Previously, each pmd and pte table was fixmapped/fixunmapped for each
cont(pte|pmd) block of mappings (16 entries with 4K granule). This means
we ended up issuing 32 TLBIs per (pmd|pte) table during the population
phase.

Let's fix that, and fixmap/fixunmap each page once per population, for a
saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot
speedup.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
before         |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
after          |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/mm/mmu.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 495b732d5af3..fd91b5bdb514 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -172,12 +172,9 @@ bool pgattr_change_is_safe(u64 old, u64 new)
 	return ((old ^ new) & ~mask) == 0;
 }
 
-static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
-		     phys_addr_t phys, pgprot_t prot)
+static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
+		       phys_addr_t phys, pgprot_t prot)
 {
-	pte_t *ptep;
-
-	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
 		pte_t old_pte = __ptep_get(ptep);
 
@@ -193,7 +190,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		phys += PAGE_SIZE;
 	} while (ptep++, addr += PAGE_SIZE, addr != end);
 
-	pte_clear_fixmap();
+	return ptep;
 }
 
 static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
@@ -204,6 +201,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 {
 	unsigned long next;
 	pmd_t pmd = READ_ONCE(*pmdp);
+	pte_t *ptep;
 
 	BUG_ON(pmd_sect(pmd));
 	if (pmd_none(pmd)) {
@@ -219,6 +217,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 	}
 	BUG_ON(pmd_bad(pmd));
 
+	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -229,20 +228,20 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		init_pte(pmdp, addr, next, phys, __prot);
+		ptep = init_pte(ptep, addr, next, phys, __prot);
 
 		phys += next - addr;
 	} while (addr = next, addr != end);
+
+	pte_clear_fixmap();
 }
 
-static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
-		     phys_addr_t phys, pgprot_t prot,
-		     phys_addr_t (*pgtable_alloc)(int), int flags)
+static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
+		       phys_addr_t phys, pgprot_t prot,
+		       phys_addr_t (*pgtable_alloc)(int), int flags)
 {
 	unsigned long next;
-	pmd_t *pmdp;
 
-	pmdp = pmd_set_fixmap_offset(pudp, addr);
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
@@ -269,7 +268,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 		phys += next - addr;
 	} while (pmdp++, addr = next, addr != end);
 
-	pmd_clear_fixmap();
+	return pmdp;
 }
 
 static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
@@ -279,6 +278,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 {
 	unsigned long next;
 	pud_t pud = READ_ONCE(*pudp);
+	pmd_t *pmdp;
 
 	/*
 	 * Check for initial section mappings in the pgd/pud.
@@ -297,6 +297,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 	}
 	BUG_ON(pud_bad(pud));
 
+	pmdp = pmd_set_fixmap_offset(pudp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -307,10 +308,13 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags);
+		pmdp = init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc,
+				flags);
 
 		phys += next - addr;
 	} while (addr = next, addr != end);
+
+	pmd_clear_fixmap();
 }
 
 static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 1/3] arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
@ 2024-03-26 10:14   ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-26 10:14 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

A large part of the kernel boot time is creating the kernel linear map
page tables. When rodata=full, all memory is mapped by pte. And when
there is lots of physical ram, there are lots of pte tables to populate.
The primary cost associated with this is mapping and unmapping the pte
table memory in the fixmap; at unmap time, the TLB entry must be
invalidated and this is expensive.

Previously, each pmd and pte table was fixmapped/fixunmapped for each
cont(pte|pmd) block of mappings (16 entries with 4K granule). This means
we ended up issuing 32 TLBIs per (pmd|pte) table during the population
phase.

Let's fix that, and fixmap/fixunmap each page once per population, for a
saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot
speedup.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
before         |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
after          |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/mm/mmu.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 495b732d5af3..fd91b5bdb514 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -172,12 +172,9 @@ bool pgattr_change_is_safe(u64 old, u64 new)
 	return ((old ^ new) & ~mask) == 0;
 }
 
-static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
-		     phys_addr_t phys, pgprot_t prot)
+static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
+		       phys_addr_t phys, pgprot_t prot)
 {
-	pte_t *ptep;
-
-	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
 		pte_t old_pte = __ptep_get(ptep);
 
@@ -193,7 +190,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		phys += PAGE_SIZE;
 	} while (ptep++, addr += PAGE_SIZE, addr != end);
 
-	pte_clear_fixmap();
+	return ptep;
 }
 
 static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
@@ -204,6 +201,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 {
 	unsigned long next;
 	pmd_t pmd = READ_ONCE(*pmdp);
+	pte_t *ptep;
 
 	BUG_ON(pmd_sect(pmd));
 	if (pmd_none(pmd)) {
@@ -219,6 +217,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 	}
 	BUG_ON(pmd_bad(pmd));
 
+	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -229,20 +228,20 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		init_pte(pmdp, addr, next, phys, __prot);
+		ptep = init_pte(ptep, addr, next, phys, __prot);
 
 		phys += next - addr;
 	} while (addr = next, addr != end);
+
+	pte_clear_fixmap();
 }
 
-static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
-		     phys_addr_t phys, pgprot_t prot,
-		     phys_addr_t (*pgtable_alloc)(int), int flags)
+static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
+		       phys_addr_t phys, pgprot_t prot,
+		       phys_addr_t (*pgtable_alloc)(int), int flags)
 {
 	unsigned long next;
-	pmd_t *pmdp;
 
-	pmdp = pmd_set_fixmap_offset(pudp, addr);
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
@@ -269,7 +268,7 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 		phys += next - addr;
 	} while (pmdp++, addr = next, addr != end);
 
-	pmd_clear_fixmap();
+	return pmdp;
 }
 
 static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
@@ -279,6 +278,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 {
 	unsigned long next;
 	pud_t pud = READ_ONCE(*pudp);
+	pmd_t *pmdp;
 
 	/*
 	 * Check for initial section mappings in the pgd/pud.
@@ -297,6 +297,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 	}
 	BUG_ON(pud_bad(pud));
 
+	pmdp = pmd_set_fixmap_offset(pudp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -307,10 +308,13 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags);
+		pmdp = init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc,
+				flags);
 
 		phys += next - addr;
 	} while (addr = next, addr != end);
+
+	pmd_clear_fixmap();
 }
 
 static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 2/3] arm64: mm: Don't remap pgtables for allocate vs populate
  2024-03-26 10:14 ` Ryan Roberts
@ 2024-03-26 10:14   ` Ryan Roberts
  -1 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-26 10:14 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

The previous change reduced remapping in the fixmap during the
population stage, but the code was still separately
fixmapping/fixunmapping each table during allocation in order to clear
the contents to zero. Which means each table still has 2 TLB
invalidations issued against it. Let's fix this so that each table is
only mapped/unmapped once, halving the number of TLBIs.

Achieve this by abstracting pgtable allocate, map and unmap operations
out of the main pgtable population loop code and into a `struct
pgtable_ops` function pointer structure. This allows us to formalize the
semantics of "alloc" to mean "alloc and map", requiring an "unmap" when
finished. So "map" is only performed (and also matched by "unmap") if
the pgtable is already been allocated.

As a side effect of this refactoring, we no longer need to use the
fixmap at all once pages have been mapped in the linear map because
their "map" operation can simply do a __va() translation. So with this
change, we are down to 1 TLBI per table when doing early pgtable
manipulations, and 0 TLBIs when doing late pgtable manipulations.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
before         |   77   (0%) |  429   (0%) | 1753   (0%) |  3796   (0%)
after          |   77   (0%) |  375 (-13%) | 1532 (-13%) |  3366 (-11%)

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/mmu.h   |   8 +
 arch/arm64/kernel/cpufeature.c |  10 +-
 arch/arm64/mm/mmu.c            | 308 ++++++++++++++++++++++++---------
 include/linux/pgtable.h        |   8 +
 4 files changed, 243 insertions(+), 91 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 65977c7783c5..ae44353010e8 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -109,6 +109,14 @@ static inline bool kaslr_requires_kpti(void)
 	return true;
 }
 
+#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
+extern
+void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
+			     phys_addr_t size, pgprot_t prot,
+			     void *(*pgtable_alloc)(int, phys_addr_t *),
+			     int flags);
+#endif
+
 #define INIT_MM_CONTEXT(name)	\
 	.pgd = swapper_pg_dir,
 
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 56583677c1f2..9a70b1954706 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1866,17 +1866,13 @@ static bool has_lpa2(const struct arm64_cpu_capabilities *entry, int scope)
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 #define KPTI_NG_TEMP_VA		(-(1UL << PMD_SHIFT))
 
-extern
-void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
-			     phys_addr_t size, pgprot_t prot,
-			     phys_addr_t (*pgtable_alloc)(int), int flags);
-
 static phys_addr_t __initdata kpti_ng_temp_alloc;
 
-static phys_addr_t __init kpti_ng_pgd_alloc(int shift)
+static void *__init kpti_ng_pgd_alloc(int type, phys_addr_t *pa)
 {
 	kpti_ng_temp_alloc -= PAGE_SIZE;
-	return kpti_ng_temp_alloc;
+	*pa = kpti_ng_temp_alloc;
+	return __va(kpti_ng_temp_alloc);
 }
 
 static int __init __kpti_install_ng_mappings(void *__unused)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index fd91b5bdb514..81702b91b107 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -41,9 +41,42 @@
 #include <asm/pgalloc.h>
 #include <asm/kfence.h>
 
+enum pgtable_type {
+	TYPE_P4D = 0,
+	TYPE_PUD = 1,
+	TYPE_PMD = 2,
+	TYPE_PTE = 3,
+};
+
+/**
+ * struct pgtable_ops - Ops to allocate and access pgtable memory. Calls must be
+ * serialized by the caller.
+ * @alloc:      Allocates 1 page of memory for use as pgtable `type` and maps it
+ *              into va space. Returned memory is zeroed. Puts physical address
+ *              of page in *pa, and returns virtual address of the mapping. User
+ *              must explicitly unmap() before doing another alloc() or map() of
+ *              the same `type`.
+ * @map:        Determines the physical address of the pgtable of `type` by
+ *              interpretting `parent` as the pgtable entry for the next level
+ *              up. Maps the page and returns virtual address of the pgtable
+ *              entry within the table that corresponds to `addr`. User must
+ *              explicitly unmap() before doing another alloc() or map() of the
+ *              same `type`.
+ * @unmap:      Unmap the currently mapped page of `type`, which will have been
+ *              mapped either as a result of a previous call to alloc() or
+ *              map(). The page's virtual address must be considered invalid
+ *              after this call returns.
+ */
+struct pgtable_ops {
+	void *(*alloc)(int type, phys_addr_t *pa);
+	void *(*map)(int type, void *parent, unsigned long addr);
+	void (*unmap)(int type);
+};
+
 #define NO_BLOCK_MAPPINGS	BIT(0)
 #define NO_CONT_MAPPINGS	BIT(1)
 #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
+#define NO_ALLOC		BIT(3)
 
 u64 kimage_voffset __ro_after_init;
 EXPORT_SYMBOL(kimage_voffset);
@@ -106,34 +139,89 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 }
 EXPORT_SYMBOL(phys_mem_access_prot);
 
-static phys_addr_t __init early_pgtable_alloc(int shift)
+static void __init early_pgtable_unmap(int type)
+{
+	switch (type) {
+	case TYPE_P4D:
+		p4d_clear_fixmap();
+		break;
+	case TYPE_PUD:
+		pud_clear_fixmap();
+		break;
+	case TYPE_PMD:
+		pmd_clear_fixmap();
+		break;
+	case TYPE_PTE:
+		pte_clear_fixmap();
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void *__init early_pgtable_map(int type, void *parent, unsigned long addr)
+{
+	void *entry;
+
+	switch (type) {
+	case TYPE_P4D:
+		entry = p4d_set_fixmap_offset((pgd_t *)parent, addr);
+		break;
+	case TYPE_PUD:
+		entry = pud_set_fixmap_offset((p4d_t *)parent, addr);
+		break;
+	case TYPE_PMD:
+		entry = pmd_set_fixmap_offset((pud_t *)parent, addr);
+		break;
+	case TYPE_PTE:
+		entry = pte_set_fixmap_offset((pmd_t *)parent, addr);
+		break;
+	default:
+		BUG();
+	}
+
+	return entry;
+}
+
+static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
 {
-	phys_addr_t phys;
-	void *ptr;
+	void *va;
 
-	phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
-					 MEMBLOCK_ALLOC_NOLEAKTRACE);
-	if (!phys)
+	*pa = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
+					MEMBLOCK_ALLOC_NOLEAKTRACE);
+	if (!*pa)
 		panic("Failed to allocate page table page\n");
 
-	/*
-	 * The FIX_{PGD,PUD,PMD} slots may be in active use, but the FIX_PTE
-	 * slot will be free, so we can (ab)use the FIX_PTE slot to initialise
-	 * any level of table.
-	 */
-	ptr = pte_set_fixmap(phys);
-
-	memset(ptr, 0, PAGE_SIZE);
+	switch (type) {
+	case TYPE_P4D:
+		va = p4d_set_fixmap(*pa);
+		break;
+	case TYPE_PUD:
+		va = pud_set_fixmap(*pa);
+		break;
+	case TYPE_PMD:
+		va = pmd_set_fixmap(*pa);
+		break;
+	case TYPE_PTE:
+		va = pte_set_fixmap(*pa);
+		break;
+	default:
+		BUG();
+	}
+	memset(va, 0, PAGE_SIZE);
 
-	/*
-	 * Implicit barriers also ensure the zeroed page is visible to the page
-	 * table walker
-	 */
-	pte_clear_fixmap();
+	/* Ensure the zeroed page is visible to the page table walker */
+	dsb(ishst);
 
-	return phys;
+	return va;
 }
 
+static struct pgtable_ops early_pgtable_ops = {
+	.alloc = early_pgtable_alloc,
+	.map = early_pgtable_map,
+	.unmap = early_pgtable_unmap,
+};
+
 bool pgattr_change_is_safe(u64 old, u64 new)
 {
 	/*
@@ -196,7 +284,7 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
 static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 				unsigned long end, phys_addr_t phys,
 				pgprot_t prot,
-				phys_addr_t (*pgtable_alloc)(int),
+				struct pgtable_ops *ops,
 				int flags)
 {
 	unsigned long next;
@@ -210,14 +298,15 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 
 		if (flags & NO_EXEC_MAPPINGS)
 			pmdval |= PMD_TABLE_PXN;
-		BUG_ON(!pgtable_alloc);
-		pte_phys = pgtable_alloc(PAGE_SHIFT);
+		BUG_ON(flags & NO_ALLOC);
+		ptep = ops->alloc(TYPE_PTE, &pte_phys);
+		ptep += pte_index(addr);
 		__pmd_populate(pmdp, pte_phys, pmdval);
-		pmd = READ_ONCE(*pmdp);
+	} else {
+		BUG_ON(pmd_bad(pmd));
+		ptep = ops->map(TYPE_PTE, pmdp, addr);
 	}
-	BUG_ON(pmd_bad(pmd));
 
-	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -233,12 +322,12 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 		phys += next - addr;
 	} while (addr = next, addr != end);
 
-	pte_clear_fixmap();
+	ops->unmap(TYPE_PTE);
 }
 
 static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		       phys_addr_t phys, pgprot_t prot,
-		       phys_addr_t (*pgtable_alloc)(int), int flags)
+		       struct pgtable_ops *ops, int flags)
 {
 	unsigned long next;
 
@@ -260,7 +349,7 @@ static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 						      READ_ONCE(pmd_val(*pmdp))));
 		} else {
 			alloc_init_cont_pte(pmdp, addr, next, phys, prot,
-					    pgtable_alloc, flags);
+					    ops, flags);
 
 			BUG_ON(pmd_val(old_pmd) != 0 &&
 			       pmd_val(old_pmd) != READ_ONCE(pmd_val(*pmdp)));
@@ -274,7 +363,7 @@ static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 				unsigned long end, phys_addr_t phys,
 				pgprot_t prot,
-				phys_addr_t (*pgtable_alloc)(int), int flags)
+				struct pgtable_ops *ops, int flags)
 {
 	unsigned long next;
 	pud_t pud = READ_ONCE(*pudp);
@@ -290,14 +379,15 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 
 		if (flags & NO_EXEC_MAPPINGS)
 			pudval |= PUD_TABLE_PXN;
-		BUG_ON(!pgtable_alloc);
-		pmd_phys = pgtable_alloc(PMD_SHIFT);
+		BUG_ON(flags & NO_ALLOC);
+		pmdp = ops->alloc(TYPE_PMD, &pmd_phys);
+		pmdp += pmd_index(addr);
 		__pud_populate(pudp, pmd_phys, pudval);
-		pud = READ_ONCE(*pudp);
+	} else {
+		BUG_ON(pud_bad(pud));
+		pmdp = ops->map(TYPE_PMD, pudp, addr);
 	}
-	BUG_ON(pud_bad(pud));
 
-	pmdp = pmd_set_fixmap_offset(pudp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -308,18 +398,17 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		pmdp = init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc,
-				flags);
+		pmdp = init_pmd(pmdp, addr, next, phys, __prot, ops, flags);
 
 		phys += next - addr;
 	} while (addr = next, addr != end);
 
-	pmd_clear_fixmap();
+	ops->unmap(TYPE_PMD);
 }
 
 static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 			   phys_addr_t phys, pgprot_t prot,
-			   phys_addr_t (*pgtable_alloc)(int),
+			   struct pgtable_ops *ops,
 			   int flags)
 {
 	unsigned long next;
@@ -332,14 +421,15 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 
 		if (flags & NO_EXEC_MAPPINGS)
 			p4dval |= P4D_TABLE_PXN;
-		BUG_ON(!pgtable_alloc);
-		pud_phys = pgtable_alloc(PUD_SHIFT);
+		BUG_ON(flags & NO_ALLOC);
+		pudp = ops->alloc(TYPE_PUD, &pud_phys);
+		pudp += pud_index(addr);
 		__p4d_populate(p4dp, pud_phys, p4dval);
-		p4d = READ_ONCE(*p4dp);
+	} else {
+		BUG_ON(p4d_bad(p4d));
+		pudp = ops->map(TYPE_PUD, p4dp, addr);
 	}
-	BUG_ON(p4d_bad(p4d));
 
-	pudp = pud_set_fixmap_offset(p4dp, addr);
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
@@ -361,7 +451,7 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 						      READ_ONCE(pud_val(*pudp))));
 		} else {
 			alloc_init_cont_pmd(pudp, addr, next, phys, prot,
-					    pgtable_alloc, flags);
+					    ops, flags);
 
 			BUG_ON(pud_val(old_pud) != 0 &&
 			       pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
@@ -369,12 +459,12 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 		phys += next - addr;
 	} while (pudp++, addr = next, addr != end);
 
-	pud_clear_fixmap();
+	ops->unmap(TYPE_PUD);
 }
 
 static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 			   phys_addr_t phys, pgprot_t prot,
-			   phys_addr_t (*pgtable_alloc)(int),
+			   struct pgtable_ops *ops,
 			   int flags)
 {
 	unsigned long next;
@@ -387,21 +477,21 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 
 		if (flags & NO_EXEC_MAPPINGS)
 			pgdval |= PGD_TABLE_PXN;
-		BUG_ON(!pgtable_alloc);
-		p4d_phys = pgtable_alloc(P4D_SHIFT);
+		BUG_ON(flags & NO_ALLOC);
+		p4dp = ops->alloc(TYPE_P4D, &p4d_phys);
+		p4dp += p4d_index(addr);
 		__pgd_populate(pgdp, p4d_phys, pgdval);
-		pgd = READ_ONCE(*pgdp);
+	} else {
+		BUG_ON(pgd_bad(pgd));
+		p4dp = ops->map(TYPE_P4D, pgdp, addr);
 	}
-	BUG_ON(pgd_bad(pgd));
 
-	p4dp = p4d_set_fixmap_offset(pgdp, addr);
 	do {
 		p4d_t old_p4d = READ_ONCE(*p4dp);
 
 		next = p4d_addr_end(addr, end);
 
-		alloc_init_pud(p4dp, addr, next, phys, prot,
-			       pgtable_alloc, flags);
+		alloc_init_pud(p4dp, addr, next, phys, prot, ops, flags);
 
 		BUG_ON(p4d_val(old_p4d) != 0 &&
 		       p4d_val(old_p4d) != READ_ONCE(p4d_val(*p4dp)));
@@ -409,13 +499,13 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 		phys += next - addr;
 	} while (p4dp++, addr = next, addr != end);
 
-	p4d_clear_fixmap();
+	ops->unmap(TYPE_P4D);
 }
 
 static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
 					unsigned long virt, phys_addr_t size,
 					pgprot_t prot,
-					phys_addr_t (*pgtable_alloc)(int),
+					struct pgtable_ops *ops,
 					int flags)
 {
 	unsigned long addr, end, next;
@@ -434,8 +524,7 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
 
 	do {
 		next = pgd_addr_end(addr, end);
-		alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
-			       flags);
+		alloc_init_p4d(pgdp, addr, next, phys, prot, ops, flags);
 		phys += next - addr;
 	} while (pgdp++, addr = next, addr != end);
 }
@@ -443,36 +532,59 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
 static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 				 unsigned long virt, phys_addr_t size,
 				 pgprot_t prot,
-				 phys_addr_t (*pgtable_alloc)(int),
+				 struct pgtable_ops *ops,
 				 int flags)
 {
 	mutex_lock(&fixmap_lock);
 	__create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
-				    pgtable_alloc, flags);
+				    ops, flags);
 	mutex_unlock(&fixmap_lock);
 }
 
-#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
-extern __alias(__create_pgd_mapping_locked)
-void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
-			     phys_addr_t size, pgprot_t prot,
-			     phys_addr_t (*pgtable_alloc)(int), int flags);
-#endif
+static void pgd_pgtable_unmap(int type)
+{
+}
+
+static void *pgd_pgtable_map(int type, void *parent, unsigned long addr)
+{
+	void *entry;
+
+	switch (type) {
+	case TYPE_P4D:
+		entry = p4d_offset((pgd_t *)parent, addr);
+		break;
+	case TYPE_PUD:
+		entry = pud_offset((p4d_t *)parent, addr);
+		break;
+	case TYPE_PMD:
+		entry = pmd_offset((pud_t *)parent, addr);
+		break;
+	case TYPE_PTE:
+		entry = pte_offset_kernel((pmd_t *)parent, addr);
+		break;
+	default:
+		BUG();
+	}
+
+	return entry;
+}
 
-static phys_addr_t __pgd_pgtable_alloc(int shift)
+static void *__pgd_pgtable_alloc(int type, phys_addr_t *pa)
 {
-	void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
-	BUG_ON(!ptr);
+	void *va = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
+
+	BUG_ON(!va);
 
 	/* Ensure the zeroed page is visible to the page table walker */
 	dsb(ishst);
-	return __pa(ptr);
+	*pa = __pa(va);
+	return va;
 }
 
-static phys_addr_t pgd_pgtable_alloc(int shift)
+static void *pgd_pgtable_alloc(int type, phys_addr_t *pa)
 {
-	phys_addr_t pa = __pgd_pgtable_alloc(shift);
-	struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
+	void *va = __pgd_pgtable_alloc(type, pa);
+	struct ptdesc *ptdesc = page_ptdesc(phys_to_page(*pa));
 
 	/*
 	 * Call proper page table ctor in case later we need to
@@ -482,13 +594,41 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
 	 * We don't select ARCH_ENABLE_SPLIT_PMD_PTLOCK if pmd is
 	 * folded, and if so pagetable_pte_ctor() becomes nop.
 	 */
-	if (shift == PAGE_SHIFT)
+	if (type == TYPE_PTE)
 		BUG_ON(!pagetable_pte_ctor(ptdesc));
-	else if (shift == PMD_SHIFT)
+	else if (type == TYPE_PMD)
 		BUG_ON(!pagetable_pmd_ctor(ptdesc));
 
-	return pa;
+	return va;
+}
+
+static struct pgtable_ops pgd_pgtable_ops = {
+	.alloc = pgd_pgtable_alloc,
+	.map = pgd_pgtable_map,
+	.unmap = pgd_pgtable_unmap,
+};
+
+static struct pgtable_ops __pgd_pgtable_ops = {
+	.alloc = __pgd_pgtable_alloc,
+	.map = pgd_pgtable_map,
+	.unmap = pgd_pgtable_unmap,
+};
+
+#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
+void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
+			     phys_addr_t size, pgprot_t prot,
+			     void *(*pgtable_alloc)(int, phys_addr_t *),
+			     int flags)
+{
+	struct pgtable_ops ops = {
+		.alloc = pgtable_alloc,
+		.map = pgd_pgtable_map,
+		.unmap = pgd_pgtable_unmap,
+	};
+
+	__create_pgd_mapping_locked(pgdir, phys, virt, size, prot, &ops, flags);
 }
+#endif
 
 /*
  * This function can only be used to modify existing table entries,
@@ -503,8 +643,8 @@ void __init create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
 			&phys, virt);
 		return;
 	}
-	__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL,
-			     NO_CONT_MAPPINGS);
+	__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot,
+			     &early_pgtable_ops, NO_CONT_MAPPINGS | NO_ALLOC);
 }
 
 void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
@@ -519,7 +659,7 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(mm->pgd, phys, virt, size, prot,
-			     pgd_pgtable_alloc, flags);
+			     &pgd_pgtable_ops, flags);
 }
 
 static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
@@ -531,8 +671,8 @@ static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
 		return;
 	}
 
-	__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL,
-			     NO_CONT_MAPPINGS);
+	__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot,
+			     &pgd_pgtable_ops, NO_CONT_MAPPINGS | NO_ALLOC);
 
 	/* flush the TLBs after updating live kernel mappings */
 	flush_tlb_kernel_range(virt, virt + size);
@@ -542,7 +682,7 @@ static void __init __map_memblock(pgd_t *pgdp, phys_addr_t start,
 				  phys_addr_t end, pgprot_t prot, int flags)
 {
 	__create_pgd_mapping(pgdp, start, __phys_to_virt(start), end - start,
-			     prot, early_pgtable_alloc, flags);
+			     prot, &early_pgtable_ops, flags);
 }
 
 void __init mark_linear_text_alias_ro(void)
@@ -733,7 +873,7 @@ static int __init map_entry_trampoline(void)
 	memset(tramp_pg_dir, 0, PGD_SIZE);
 	__create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS,
 			     entry_tramp_text_size(), prot,
-			     __pgd_pgtable_alloc, NO_BLOCK_MAPPINGS);
+			     &__pgd_pgtable_ops, NO_BLOCK_MAPPINGS);
 
 	/* Map both the text and data into the kernel page table */
 	for (i = 0; i < DIV_ROUND_UP(entry_tramp_text_size(), PAGE_SIZE); i++)
@@ -1335,7 +1475,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
-			     size, params->pgprot, __pgd_pgtable_alloc,
+			     size, params->pgprot, &__pgd_pgtable_ops,
 			     flags);
 
 	memblock_clear_nomap(start, size);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 85fc7554cd52..1d9e91847cd8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -83,6 +83,14 @@ static inline unsigned long pud_index(unsigned long address)
 #define pud_index pud_index
 #endif
 
+#ifndef p4d_index
+static inline unsigned long p4d_index(unsigned long address)
+{
+	return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1);
+}
+#define p4d_index p4d_index
+#endif
+
 #ifndef pgd_index
 /* Must be a compile-time constant, so implement it as a macro */
 #define pgd_index(a)  (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 2/3] arm64: mm: Don't remap pgtables for allocate vs populate
@ 2024-03-26 10:14   ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-26 10:14 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

The previous change reduced remapping in the fixmap during the
population stage, but the code was still separately
fixmapping/fixunmapping each table during allocation in order to clear
the contents to zero. Which means each table still has 2 TLB
invalidations issued against it. Let's fix this so that each table is
only mapped/unmapped once, halving the number of TLBIs.

Achieve this by abstracting pgtable allocate, map and unmap operations
out of the main pgtable population loop code and into a `struct
pgtable_ops` function pointer structure. This allows us to formalize the
semantics of "alloc" to mean "alloc and map", requiring an "unmap" when
finished. So "map" is only performed (and also matched by "unmap") if
the pgtable is already been allocated.

As a side effect of this refactoring, we no longer need to use the
fixmap at all once pages have been mapped in the linear map because
their "map" operation can simply do a __va() translation. So with this
change, we are down to 1 TLBI per table when doing early pgtable
manipulations, and 0 TLBIs when doing late pgtable manipulations.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
before         |   77   (0%) |  429   (0%) | 1753   (0%) |  3796   (0%)
after          |   77   (0%) |  375 (-13%) | 1532 (-13%) |  3366 (-11%)

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/mmu.h   |   8 +
 arch/arm64/kernel/cpufeature.c |  10 +-
 arch/arm64/mm/mmu.c            | 308 ++++++++++++++++++++++++---------
 include/linux/pgtable.h        |   8 +
 4 files changed, 243 insertions(+), 91 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 65977c7783c5..ae44353010e8 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -109,6 +109,14 @@ static inline bool kaslr_requires_kpti(void)
 	return true;
 }
 
+#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
+extern
+void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
+			     phys_addr_t size, pgprot_t prot,
+			     void *(*pgtable_alloc)(int, phys_addr_t *),
+			     int flags);
+#endif
+
 #define INIT_MM_CONTEXT(name)	\
 	.pgd = swapper_pg_dir,
 
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 56583677c1f2..9a70b1954706 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1866,17 +1866,13 @@ static bool has_lpa2(const struct arm64_cpu_capabilities *entry, int scope)
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 #define KPTI_NG_TEMP_VA		(-(1UL << PMD_SHIFT))
 
-extern
-void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
-			     phys_addr_t size, pgprot_t prot,
-			     phys_addr_t (*pgtable_alloc)(int), int flags);
-
 static phys_addr_t __initdata kpti_ng_temp_alloc;
 
-static phys_addr_t __init kpti_ng_pgd_alloc(int shift)
+static void *__init kpti_ng_pgd_alloc(int type, phys_addr_t *pa)
 {
 	kpti_ng_temp_alloc -= PAGE_SIZE;
-	return kpti_ng_temp_alloc;
+	*pa = kpti_ng_temp_alloc;
+	return __va(kpti_ng_temp_alloc);
 }
 
 static int __init __kpti_install_ng_mappings(void *__unused)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index fd91b5bdb514..81702b91b107 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -41,9 +41,42 @@
 #include <asm/pgalloc.h>
 #include <asm/kfence.h>
 
+enum pgtable_type {
+	TYPE_P4D = 0,
+	TYPE_PUD = 1,
+	TYPE_PMD = 2,
+	TYPE_PTE = 3,
+};
+
+/**
+ * struct pgtable_ops - Ops to allocate and access pgtable memory. Calls must be
+ * serialized by the caller.
+ * @alloc:      Allocates 1 page of memory for use as pgtable `type` and maps it
+ *              into va space. Returned memory is zeroed. Puts physical address
+ *              of page in *pa, and returns virtual address of the mapping. User
+ *              must explicitly unmap() before doing another alloc() or map() of
+ *              the same `type`.
+ * @map:        Determines the physical address of the pgtable of `type` by
+ *              interpretting `parent` as the pgtable entry for the next level
+ *              up. Maps the page and returns virtual address of the pgtable
+ *              entry within the table that corresponds to `addr`. User must
+ *              explicitly unmap() before doing another alloc() or map() of the
+ *              same `type`.
+ * @unmap:      Unmap the currently mapped page of `type`, which will have been
+ *              mapped either as a result of a previous call to alloc() or
+ *              map(). The page's virtual address must be considered invalid
+ *              after this call returns.
+ */
+struct pgtable_ops {
+	void *(*alloc)(int type, phys_addr_t *pa);
+	void *(*map)(int type, void *parent, unsigned long addr);
+	void (*unmap)(int type);
+};
+
 #define NO_BLOCK_MAPPINGS	BIT(0)
 #define NO_CONT_MAPPINGS	BIT(1)
 #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
+#define NO_ALLOC		BIT(3)
 
 u64 kimage_voffset __ro_after_init;
 EXPORT_SYMBOL(kimage_voffset);
@@ -106,34 +139,89 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 }
 EXPORT_SYMBOL(phys_mem_access_prot);
 
-static phys_addr_t __init early_pgtable_alloc(int shift)
+static void __init early_pgtable_unmap(int type)
+{
+	switch (type) {
+	case TYPE_P4D:
+		p4d_clear_fixmap();
+		break;
+	case TYPE_PUD:
+		pud_clear_fixmap();
+		break;
+	case TYPE_PMD:
+		pmd_clear_fixmap();
+		break;
+	case TYPE_PTE:
+		pte_clear_fixmap();
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void *__init early_pgtable_map(int type, void *parent, unsigned long addr)
+{
+	void *entry;
+
+	switch (type) {
+	case TYPE_P4D:
+		entry = p4d_set_fixmap_offset((pgd_t *)parent, addr);
+		break;
+	case TYPE_PUD:
+		entry = pud_set_fixmap_offset((p4d_t *)parent, addr);
+		break;
+	case TYPE_PMD:
+		entry = pmd_set_fixmap_offset((pud_t *)parent, addr);
+		break;
+	case TYPE_PTE:
+		entry = pte_set_fixmap_offset((pmd_t *)parent, addr);
+		break;
+	default:
+		BUG();
+	}
+
+	return entry;
+}
+
+static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
 {
-	phys_addr_t phys;
-	void *ptr;
+	void *va;
 
-	phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
-					 MEMBLOCK_ALLOC_NOLEAKTRACE);
-	if (!phys)
+	*pa = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
+					MEMBLOCK_ALLOC_NOLEAKTRACE);
+	if (!*pa)
 		panic("Failed to allocate page table page\n");
 
-	/*
-	 * The FIX_{PGD,PUD,PMD} slots may be in active use, but the FIX_PTE
-	 * slot will be free, so we can (ab)use the FIX_PTE slot to initialise
-	 * any level of table.
-	 */
-	ptr = pte_set_fixmap(phys);
-
-	memset(ptr, 0, PAGE_SIZE);
+	switch (type) {
+	case TYPE_P4D:
+		va = p4d_set_fixmap(*pa);
+		break;
+	case TYPE_PUD:
+		va = pud_set_fixmap(*pa);
+		break;
+	case TYPE_PMD:
+		va = pmd_set_fixmap(*pa);
+		break;
+	case TYPE_PTE:
+		va = pte_set_fixmap(*pa);
+		break;
+	default:
+		BUG();
+	}
+	memset(va, 0, PAGE_SIZE);
 
-	/*
-	 * Implicit barriers also ensure the zeroed page is visible to the page
-	 * table walker
-	 */
-	pte_clear_fixmap();
+	/* Ensure the zeroed page is visible to the page table walker */
+	dsb(ishst);
 
-	return phys;
+	return va;
 }
 
+static struct pgtable_ops early_pgtable_ops = {
+	.alloc = early_pgtable_alloc,
+	.map = early_pgtable_map,
+	.unmap = early_pgtable_unmap,
+};
+
 bool pgattr_change_is_safe(u64 old, u64 new)
 {
 	/*
@@ -196,7 +284,7 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
 static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 				unsigned long end, phys_addr_t phys,
 				pgprot_t prot,
-				phys_addr_t (*pgtable_alloc)(int),
+				struct pgtable_ops *ops,
 				int flags)
 {
 	unsigned long next;
@@ -210,14 +298,15 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 
 		if (flags & NO_EXEC_MAPPINGS)
 			pmdval |= PMD_TABLE_PXN;
-		BUG_ON(!pgtable_alloc);
-		pte_phys = pgtable_alloc(PAGE_SHIFT);
+		BUG_ON(flags & NO_ALLOC);
+		ptep = ops->alloc(TYPE_PTE, &pte_phys);
+		ptep += pte_index(addr);
 		__pmd_populate(pmdp, pte_phys, pmdval);
-		pmd = READ_ONCE(*pmdp);
+	} else {
+		BUG_ON(pmd_bad(pmd));
+		ptep = ops->map(TYPE_PTE, pmdp, addr);
 	}
-	BUG_ON(pmd_bad(pmd));
 
-	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -233,12 +322,12 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 		phys += next - addr;
 	} while (addr = next, addr != end);
 
-	pte_clear_fixmap();
+	ops->unmap(TYPE_PTE);
 }
 
 static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		       phys_addr_t phys, pgprot_t prot,
-		       phys_addr_t (*pgtable_alloc)(int), int flags)
+		       struct pgtable_ops *ops, int flags)
 {
 	unsigned long next;
 
@@ -260,7 +349,7 @@ static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 						      READ_ONCE(pmd_val(*pmdp))));
 		} else {
 			alloc_init_cont_pte(pmdp, addr, next, phys, prot,
-					    pgtable_alloc, flags);
+					    ops, flags);
 
 			BUG_ON(pmd_val(old_pmd) != 0 &&
 			       pmd_val(old_pmd) != READ_ONCE(pmd_val(*pmdp)));
@@ -274,7 +363,7 @@ static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 				unsigned long end, phys_addr_t phys,
 				pgprot_t prot,
-				phys_addr_t (*pgtable_alloc)(int), int flags)
+				struct pgtable_ops *ops, int flags)
 {
 	unsigned long next;
 	pud_t pud = READ_ONCE(*pudp);
@@ -290,14 +379,15 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 
 		if (flags & NO_EXEC_MAPPINGS)
 			pudval |= PUD_TABLE_PXN;
-		BUG_ON(!pgtable_alloc);
-		pmd_phys = pgtable_alloc(PMD_SHIFT);
+		BUG_ON(flags & NO_ALLOC);
+		pmdp = ops->alloc(TYPE_PMD, &pmd_phys);
+		pmdp += pmd_index(addr);
 		__pud_populate(pudp, pmd_phys, pudval);
-		pud = READ_ONCE(*pudp);
+	} else {
+		BUG_ON(pud_bad(pud));
+		pmdp = ops->map(TYPE_PMD, pudp, addr);
 	}
-	BUG_ON(pud_bad(pud));
 
-	pmdp = pmd_set_fixmap_offset(pudp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -308,18 +398,17 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		pmdp = init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc,
-				flags);
+		pmdp = init_pmd(pmdp, addr, next, phys, __prot, ops, flags);
 
 		phys += next - addr;
 	} while (addr = next, addr != end);
 
-	pmd_clear_fixmap();
+	ops->unmap(TYPE_PMD);
 }
 
 static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 			   phys_addr_t phys, pgprot_t prot,
-			   phys_addr_t (*pgtable_alloc)(int),
+			   struct pgtable_ops *ops,
 			   int flags)
 {
 	unsigned long next;
@@ -332,14 +421,15 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 
 		if (flags & NO_EXEC_MAPPINGS)
 			p4dval |= P4D_TABLE_PXN;
-		BUG_ON(!pgtable_alloc);
-		pud_phys = pgtable_alloc(PUD_SHIFT);
+		BUG_ON(flags & NO_ALLOC);
+		pudp = ops->alloc(TYPE_PUD, &pud_phys);
+		pudp += pud_index(addr);
 		__p4d_populate(p4dp, pud_phys, p4dval);
-		p4d = READ_ONCE(*p4dp);
+	} else {
+		BUG_ON(p4d_bad(p4d));
+		pudp = ops->map(TYPE_PUD, p4dp, addr);
 	}
-	BUG_ON(p4d_bad(p4d));
 
-	pudp = pud_set_fixmap_offset(p4dp, addr);
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
@@ -361,7 +451,7 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 						      READ_ONCE(pud_val(*pudp))));
 		} else {
 			alloc_init_cont_pmd(pudp, addr, next, phys, prot,
-					    pgtable_alloc, flags);
+					    ops, flags);
 
 			BUG_ON(pud_val(old_pud) != 0 &&
 			       pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
@@ -369,12 +459,12 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 		phys += next - addr;
 	} while (pudp++, addr = next, addr != end);
 
-	pud_clear_fixmap();
+	ops->unmap(TYPE_PUD);
 }
 
 static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 			   phys_addr_t phys, pgprot_t prot,
-			   phys_addr_t (*pgtable_alloc)(int),
+			   struct pgtable_ops *ops,
 			   int flags)
 {
 	unsigned long next;
@@ -387,21 +477,21 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 
 		if (flags & NO_EXEC_MAPPINGS)
 			pgdval |= PGD_TABLE_PXN;
-		BUG_ON(!pgtable_alloc);
-		p4d_phys = pgtable_alloc(P4D_SHIFT);
+		BUG_ON(flags & NO_ALLOC);
+		p4dp = ops->alloc(TYPE_P4D, &p4d_phys);
+		p4dp += p4d_index(addr);
 		__pgd_populate(pgdp, p4d_phys, pgdval);
-		pgd = READ_ONCE(*pgdp);
+	} else {
+		BUG_ON(pgd_bad(pgd));
+		p4dp = ops->map(TYPE_P4D, pgdp, addr);
 	}
-	BUG_ON(pgd_bad(pgd));
 
-	p4dp = p4d_set_fixmap_offset(pgdp, addr);
 	do {
 		p4d_t old_p4d = READ_ONCE(*p4dp);
 
 		next = p4d_addr_end(addr, end);
 
-		alloc_init_pud(p4dp, addr, next, phys, prot,
-			       pgtable_alloc, flags);
+		alloc_init_pud(p4dp, addr, next, phys, prot, ops, flags);
 
 		BUG_ON(p4d_val(old_p4d) != 0 &&
 		       p4d_val(old_p4d) != READ_ONCE(p4d_val(*p4dp)));
@@ -409,13 +499,13 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 		phys += next - addr;
 	} while (p4dp++, addr = next, addr != end);
 
-	p4d_clear_fixmap();
+	ops->unmap(TYPE_P4D);
 }
 
 static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
 					unsigned long virt, phys_addr_t size,
 					pgprot_t prot,
-					phys_addr_t (*pgtable_alloc)(int),
+					struct pgtable_ops *ops,
 					int flags)
 {
 	unsigned long addr, end, next;
@@ -434,8 +524,7 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
 
 	do {
 		next = pgd_addr_end(addr, end);
-		alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
-			       flags);
+		alloc_init_p4d(pgdp, addr, next, phys, prot, ops, flags);
 		phys += next - addr;
 	} while (pgdp++, addr = next, addr != end);
 }
@@ -443,36 +532,59 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
 static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 				 unsigned long virt, phys_addr_t size,
 				 pgprot_t prot,
-				 phys_addr_t (*pgtable_alloc)(int),
+				 struct pgtable_ops *ops,
 				 int flags)
 {
 	mutex_lock(&fixmap_lock);
 	__create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
-				    pgtable_alloc, flags);
+				    ops, flags);
 	mutex_unlock(&fixmap_lock);
 }
 
-#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
-extern __alias(__create_pgd_mapping_locked)
-void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
-			     phys_addr_t size, pgprot_t prot,
-			     phys_addr_t (*pgtable_alloc)(int), int flags);
-#endif
+static void pgd_pgtable_unmap(int type)
+{
+}
+
+static void *pgd_pgtable_map(int type, void *parent, unsigned long addr)
+{
+	void *entry;
+
+	switch (type) {
+	case TYPE_P4D:
+		entry = p4d_offset((pgd_t *)parent, addr);
+		break;
+	case TYPE_PUD:
+		entry = pud_offset((p4d_t *)parent, addr);
+		break;
+	case TYPE_PMD:
+		entry = pmd_offset((pud_t *)parent, addr);
+		break;
+	case TYPE_PTE:
+		entry = pte_offset_kernel((pmd_t *)parent, addr);
+		break;
+	default:
+		BUG();
+	}
+
+	return entry;
+}
 
-static phys_addr_t __pgd_pgtable_alloc(int shift)
+static void *__pgd_pgtable_alloc(int type, phys_addr_t *pa)
 {
-	void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
-	BUG_ON(!ptr);
+	void *va = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
+
+	BUG_ON(!va);
 
 	/* Ensure the zeroed page is visible to the page table walker */
 	dsb(ishst);
-	return __pa(ptr);
+	*pa = __pa(va);
+	return va;
 }
 
-static phys_addr_t pgd_pgtable_alloc(int shift)
+static void *pgd_pgtable_alloc(int type, phys_addr_t *pa)
 {
-	phys_addr_t pa = __pgd_pgtable_alloc(shift);
-	struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
+	void *va = __pgd_pgtable_alloc(type, pa);
+	struct ptdesc *ptdesc = page_ptdesc(phys_to_page(*pa));
 
 	/*
 	 * Call proper page table ctor in case later we need to
@@ -482,13 +594,41 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
 	 * We don't select ARCH_ENABLE_SPLIT_PMD_PTLOCK if pmd is
 	 * folded, and if so pagetable_pte_ctor() becomes nop.
 	 */
-	if (shift == PAGE_SHIFT)
+	if (type == TYPE_PTE)
 		BUG_ON(!pagetable_pte_ctor(ptdesc));
-	else if (shift == PMD_SHIFT)
+	else if (type == TYPE_PMD)
 		BUG_ON(!pagetable_pmd_ctor(ptdesc));
 
-	return pa;
+	return va;
+}
+
+static struct pgtable_ops pgd_pgtable_ops = {
+	.alloc = pgd_pgtable_alloc,
+	.map = pgd_pgtable_map,
+	.unmap = pgd_pgtable_unmap,
+};
+
+static struct pgtable_ops __pgd_pgtable_ops = {
+	.alloc = __pgd_pgtable_alloc,
+	.map = pgd_pgtable_map,
+	.unmap = pgd_pgtable_unmap,
+};
+
+#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
+void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
+			     phys_addr_t size, pgprot_t prot,
+			     void *(*pgtable_alloc)(int, phys_addr_t *),
+			     int flags)
+{
+	struct pgtable_ops ops = {
+		.alloc = pgtable_alloc,
+		.map = pgd_pgtable_map,
+		.unmap = pgd_pgtable_unmap,
+	};
+
+	__create_pgd_mapping_locked(pgdir, phys, virt, size, prot, &ops, flags);
 }
+#endif
 
 /*
  * This function can only be used to modify existing table entries,
@@ -503,8 +643,8 @@ void __init create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
 			&phys, virt);
 		return;
 	}
-	__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL,
-			     NO_CONT_MAPPINGS);
+	__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot,
+			     &early_pgtable_ops, NO_CONT_MAPPINGS | NO_ALLOC);
 }
 
 void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
@@ -519,7 +659,7 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(mm->pgd, phys, virt, size, prot,
-			     pgd_pgtable_alloc, flags);
+			     &pgd_pgtable_ops, flags);
 }
 
 static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
@@ -531,8 +671,8 @@ static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
 		return;
 	}
 
-	__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL,
-			     NO_CONT_MAPPINGS);
+	__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot,
+			     &pgd_pgtable_ops, NO_CONT_MAPPINGS | NO_ALLOC);
 
 	/* flush the TLBs after updating live kernel mappings */
 	flush_tlb_kernel_range(virt, virt + size);
@@ -542,7 +682,7 @@ static void __init __map_memblock(pgd_t *pgdp, phys_addr_t start,
 				  phys_addr_t end, pgprot_t prot, int flags)
 {
 	__create_pgd_mapping(pgdp, start, __phys_to_virt(start), end - start,
-			     prot, early_pgtable_alloc, flags);
+			     prot, &early_pgtable_ops, flags);
 }
 
 void __init mark_linear_text_alias_ro(void)
@@ -733,7 +873,7 @@ static int __init map_entry_trampoline(void)
 	memset(tramp_pg_dir, 0, PGD_SIZE);
 	__create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS,
 			     entry_tramp_text_size(), prot,
-			     __pgd_pgtable_alloc, NO_BLOCK_MAPPINGS);
+			     &__pgd_pgtable_ops, NO_BLOCK_MAPPINGS);
 
 	/* Map both the text and data into the kernel page table */
 	for (i = 0; i < DIV_ROUND_UP(entry_tramp_text_size(), PAGE_SIZE); i++)
@@ -1335,7 +1475,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
-			     size, params->pgprot, __pgd_pgtable_alloc,
+			     size, params->pgprot, &__pgd_pgtable_ops,
 			     flags);
 
 	memblock_clear_nomap(start, size);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 85fc7554cd52..1d9e91847cd8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -83,6 +83,14 @@ static inline unsigned long pud_index(unsigned long address)
 #define pud_index pud_index
 #endif
 
+#ifndef p4d_index
+static inline unsigned long p4d_index(unsigned long address)
+{
+	return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1);
+}
+#define p4d_index p4d_index
+#endif
+
 #ifndef pgd_index
 /* Must be a compile-time constant, so implement it as a macro */
 #define pgd_index(a)  (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 3/3] arm64: mm: Lazily clear pte table mappings from fixmap
  2024-03-26 10:14 ` Ryan Roberts
@ 2024-03-26 10:14   ` Ryan Roberts
  -1 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-26 10:14 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

With the pgtable operations nicely abstracted into `struct pgtable_ops`,
the early pgtable alloc, map and unmap operations are nicely
centralized. So let's enhance the implementation to speed up the
clearing of pte table mappings in the fixmap.

Extend FIX_MAP so that we now have 16 slots in the fixmap dedicated for
pte tables. At alloc/map time, we select the next slot in the series and
map it. Or if we are at the end and no more slots are available, clear
down all of the slots and start at the beginning again. Batching the
clear like this means we can issue tlbis more efficiently.

Due to the batching, there may still be some slots mapped at the end, so
address this by adding an optional cleanup() function to `struct
pgtable_ops`. to handle this for us.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
before         |   77   (0%) |  375   (0%) | 1532   (0%) |  3366   (0%)
after          |   63 (-18%) |  330 (-12%) | 1312 (-14%) |  2929 (-13%)

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/fixmap.h  |  5 +++-
 arch/arm64/include/asm/pgtable.h |  4 ---
 arch/arm64/mm/fixmap.c           | 11 ++++++++
 arch/arm64/mm/mmu.c              | 44 +++++++++++++++++++++++++++++---
 4 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index 87e307804b99..91fcd7c5c513 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -84,7 +84,9 @@ enum fixed_addresses {
 	 * Used for kernel page table creation, so unmapped memory may be used
 	 * for tables.
 	 */
-	FIX_PTE,
+#define NR_PTE_SLOTS		16
+	FIX_PTE_END,
+	FIX_PTE_BEGIN = FIX_PTE_END + NR_PTE_SLOTS - 1,
 	FIX_PMD,
 	FIX_PUD,
 	FIX_P4D,
@@ -108,6 +110,7 @@ void __init early_fixmap_init(void);
 #define __late_clear_fixmap(idx) __set_fixmap((idx), 0, FIXMAP_PAGE_CLEAR)
 
 extern void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot);
+void __init clear_fixmap_nosync(enum fixed_addresses idx);
 
 #include <asm-generic/fixmap.h>
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index afdd56d26ad7..bd5d02f3f0a3 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -686,10 +686,6 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 /* Find an entry in the third-level page table. */
 #define pte_offset_phys(dir,addr)	(pmd_page_paddr(READ_ONCE(*(dir))) + pte_index(addr) * sizeof(pte_t))
 
-#define pte_set_fixmap(addr)		((pte_t *)set_fixmap_offset(FIX_PTE, addr))
-#define pte_set_fixmap_offset(pmd, addr)	pte_set_fixmap(pte_offset_phys(pmd, addr))
-#define pte_clear_fixmap()		clear_fixmap(FIX_PTE)
-
 #define pmd_page(pmd)			phys_to_page(__pmd_to_phys(pmd))
 
 /* use ONLY for statically allocated translation tables */
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index de1e09d986ad..f83385f6ab86 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -131,6 +131,17 @@ void __set_fixmap(enum fixed_addresses idx,
 	}
 }
 
+void __init clear_fixmap_nosync(enum fixed_addresses idx)
+{
+	unsigned long addr = __fix_to_virt(idx);
+	pte_t *ptep;
+
+	BUG_ON(idx <= FIX_HOLE || idx >= __end_of_fixed_addresses);
+
+	ptep = fixmap_pte(addr);
+	pte_clear(&init_mm, addr, ptep);
+}
+
 void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
 {
 	const u64 dt_virt_base = __fix_to_virt(FIX_FDT);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 81702b91b107..1b2a2a2d09b7 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -66,11 +66,14 @@ enum pgtable_type {
  *              mapped either as a result of a previous call to alloc() or
  *              map(). The page's virtual address must be considered invalid
  *              after this call returns.
+ * @cleanup:    (Optional) Called at the end of a set of operations to cleanup
+ *              any lazy state.
  */
 struct pgtable_ops {
 	void *(*alloc)(int type, phys_addr_t *pa);
 	void *(*map)(int type, void *parent, unsigned long addr);
 	void (*unmap)(int type);
+	void (*cleanup)(void);
 };
 
 #define NO_BLOCK_MAPPINGS	BIT(0)
@@ -139,6 +142,29 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 }
 EXPORT_SYMBOL(phys_mem_access_prot);
 
+static int pte_slot_next __initdata = FIX_PTE_BEGIN;
+
+static void __init clear_pte_fixmap_slots(void)
+{
+	unsigned long start = __fix_to_virt(FIX_PTE_BEGIN);
+	unsigned long end = __fix_to_virt(pte_slot_next);
+	int i;
+
+	for (i = FIX_PTE_BEGIN; i > pte_slot_next; i--)
+		clear_fixmap_nosync(i);
+
+	flush_tlb_kernel_range(start, end);
+	pte_slot_next = FIX_PTE_BEGIN;
+}
+
+static int __init pte_fixmap_slot(void)
+{
+	if (pte_slot_next < FIX_PTE_END)
+		clear_pte_fixmap_slots();
+
+	return pte_slot_next--;
+}
+
 static void __init early_pgtable_unmap(int type)
 {
 	switch (type) {
@@ -152,7 +178,7 @@ static void __init early_pgtable_unmap(int type)
 		pmd_clear_fixmap();
 		break;
 	case TYPE_PTE:
-		pte_clear_fixmap();
+		// Unmap lazily: see clear_pte_fixmap_slots().
 		break;
 	default:
 		BUG();
@@ -161,7 +187,9 @@ static void __init early_pgtable_unmap(int type)
 
 static void *__init early_pgtable_map(int type, void *parent, unsigned long addr)
 {
+	phys_addr_t pa;
 	void *entry;
+	int slot;
 
 	switch (type) {
 	case TYPE_P4D:
@@ -174,7 +202,10 @@ static void *__init early_pgtable_map(int type, void *parent, unsigned long addr
 		entry = pmd_set_fixmap_offset((pud_t *)parent, addr);
 		break;
 	case TYPE_PTE:
-		entry = pte_set_fixmap_offset((pmd_t *)parent, addr);
+		slot = pte_fixmap_slot();
+		pa = pte_offset_phys((pmd_t *)parent, addr);
+		set_fixmap(slot, pa);
+		entry = (pte_t *)(__fix_to_virt(slot) + (pa & (PAGE_SIZE - 1)));
 		break;
 	default:
 		BUG();
@@ -186,6 +217,7 @@ static void *__init early_pgtable_map(int type, void *parent, unsigned long addr
 static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
 {
 	void *va;
+	int slot;
 
 	*pa = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
 					MEMBLOCK_ALLOC_NOLEAKTRACE);
@@ -203,7 +235,9 @@ static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
 		va = pmd_set_fixmap(*pa);
 		break;
 	case TYPE_PTE:
-		va = pte_set_fixmap(*pa);
+		slot = pte_fixmap_slot();
+		set_fixmap(slot, *pa);
+		va = (pte_t *)__fix_to_virt(slot);
 		break;
 	default:
 		BUG();
@@ -220,6 +254,7 @@ static struct pgtable_ops early_pgtable_ops = {
 	.alloc = early_pgtable_alloc,
 	.map = early_pgtable_map,
 	.unmap = early_pgtable_unmap,
+	.cleanup = clear_pte_fixmap_slots,
 };
 
 bool pgattr_change_is_safe(u64 old, u64 new)
@@ -527,6 +562,9 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
 		alloc_init_p4d(pgdp, addr, next, phys, prot, ops, flags);
 		phys += next - addr;
 	} while (pgdp++, addr = next, addr != end);
+
+	if (ops->cleanup)
+		ops->cleanup();
 }
 
 static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1 3/3] arm64: mm: Lazily clear pte table mappings from fixmap
@ 2024-03-26 10:14   ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-26 10:14 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

With the pgtable operations nicely abstracted into `struct pgtable_ops`,
the early pgtable alloc, map and unmap operations are nicely
centralized. So let's enhance the implementation to speed up the
clearing of pte table mappings in the fixmap.

Extend FIX_MAP so that we now have 16 slots in the fixmap dedicated for
pte tables. At alloc/map time, we select the next slot in the series and
map it. Or if we are at the end and no more slots are available, clear
down all of the slots and start at the beginning again. Batching the
clear like this means we can issue tlbis more efficiently.

Due to the batching, there may still be some slots mapped at the end, so
address this by adding an optional cleanup() function to `struct
pgtable_ops`. to handle this for us.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
before         |   77   (0%) |  375   (0%) | 1532   (0%) |  3366   (0%)
after          |   63 (-18%) |  330 (-12%) | 1312 (-14%) |  2929 (-13%)

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/fixmap.h  |  5 +++-
 arch/arm64/include/asm/pgtable.h |  4 ---
 arch/arm64/mm/fixmap.c           | 11 ++++++++
 arch/arm64/mm/mmu.c              | 44 +++++++++++++++++++++++++++++---
 4 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index 87e307804b99..91fcd7c5c513 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -84,7 +84,9 @@ enum fixed_addresses {
 	 * Used for kernel page table creation, so unmapped memory may be used
 	 * for tables.
 	 */
-	FIX_PTE,
+#define NR_PTE_SLOTS		16
+	FIX_PTE_END,
+	FIX_PTE_BEGIN = FIX_PTE_END + NR_PTE_SLOTS - 1,
 	FIX_PMD,
 	FIX_PUD,
 	FIX_P4D,
@@ -108,6 +110,7 @@ void __init early_fixmap_init(void);
 #define __late_clear_fixmap(idx) __set_fixmap((idx), 0, FIXMAP_PAGE_CLEAR)
 
 extern void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot);
+void __init clear_fixmap_nosync(enum fixed_addresses idx);
 
 #include <asm-generic/fixmap.h>
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index afdd56d26ad7..bd5d02f3f0a3 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -686,10 +686,6 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 /* Find an entry in the third-level page table. */
 #define pte_offset_phys(dir,addr)	(pmd_page_paddr(READ_ONCE(*(dir))) + pte_index(addr) * sizeof(pte_t))
 
-#define pte_set_fixmap(addr)		((pte_t *)set_fixmap_offset(FIX_PTE, addr))
-#define pte_set_fixmap_offset(pmd, addr)	pte_set_fixmap(pte_offset_phys(pmd, addr))
-#define pte_clear_fixmap()		clear_fixmap(FIX_PTE)
-
 #define pmd_page(pmd)			phys_to_page(__pmd_to_phys(pmd))
 
 /* use ONLY for statically allocated translation tables */
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index de1e09d986ad..f83385f6ab86 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -131,6 +131,17 @@ void __set_fixmap(enum fixed_addresses idx,
 	}
 }
 
+void __init clear_fixmap_nosync(enum fixed_addresses idx)
+{
+	unsigned long addr = __fix_to_virt(idx);
+	pte_t *ptep;
+
+	BUG_ON(idx <= FIX_HOLE || idx >= __end_of_fixed_addresses);
+
+	ptep = fixmap_pte(addr);
+	pte_clear(&init_mm, addr, ptep);
+}
+
 void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
 {
 	const u64 dt_virt_base = __fix_to_virt(FIX_FDT);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 81702b91b107..1b2a2a2d09b7 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -66,11 +66,14 @@ enum pgtable_type {
  *              mapped either as a result of a previous call to alloc() or
  *              map(). The page's virtual address must be considered invalid
  *              after this call returns.
+ * @cleanup:    (Optional) Called at the end of a set of operations to cleanup
+ *              any lazy state.
  */
 struct pgtable_ops {
 	void *(*alloc)(int type, phys_addr_t *pa);
 	void *(*map)(int type, void *parent, unsigned long addr);
 	void (*unmap)(int type);
+	void (*cleanup)(void);
 };
 
 #define NO_BLOCK_MAPPINGS	BIT(0)
@@ -139,6 +142,29 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 }
 EXPORT_SYMBOL(phys_mem_access_prot);
 
+static int pte_slot_next __initdata = FIX_PTE_BEGIN;
+
+static void __init clear_pte_fixmap_slots(void)
+{
+	unsigned long start = __fix_to_virt(FIX_PTE_BEGIN);
+	unsigned long end = __fix_to_virt(pte_slot_next);
+	int i;
+
+	for (i = FIX_PTE_BEGIN; i > pte_slot_next; i--)
+		clear_fixmap_nosync(i);
+
+	flush_tlb_kernel_range(start, end);
+	pte_slot_next = FIX_PTE_BEGIN;
+}
+
+static int __init pte_fixmap_slot(void)
+{
+	if (pte_slot_next < FIX_PTE_END)
+		clear_pte_fixmap_slots();
+
+	return pte_slot_next--;
+}
+
 static void __init early_pgtable_unmap(int type)
 {
 	switch (type) {
@@ -152,7 +178,7 @@ static void __init early_pgtable_unmap(int type)
 		pmd_clear_fixmap();
 		break;
 	case TYPE_PTE:
-		pte_clear_fixmap();
+		// Unmap lazily: see clear_pte_fixmap_slots().
 		break;
 	default:
 		BUG();
@@ -161,7 +187,9 @@ static void __init early_pgtable_unmap(int type)
 
 static void *__init early_pgtable_map(int type, void *parent, unsigned long addr)
 {
+	phys_addr_t pa;
 	void *entry;
+	int slot;
 
 	switch (type) {
 	case TYPE_P4D:
@@ -174,7 +202,10 @@ static void *__init early_pgtable_map(int type, void *parent, unsigned long addr
 		entry = pmd_set_fixmap_offset((pud_t *)parent, addr);
 		break;
 	case TYPE_PTE:
-		entry = pte_set_fixmap_offset((pmd_t *)parent, addr);
+		slot = pte_fixmap_slot();
+		pa = pte_offset_phys((pmd_t *)parent, addr);
+		set_fixmap(slot, pa);
+		entry = (pte_t *)(__fix_to_virt(slot) + (pa & (PAGE_SIZE - 1)));
 		break;
 	default:
 		BUG();
@@ -186,6 +217,7 @@ static void *__init early_pgtable_map(int type, void *parent, unsigned long addr
 static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
 {
 	void *va;
+	int slot;
 
 	*pa = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
 					MEMBLOCK_ALLOC_NOLEAKTRACE);
@@ -203,7 +235,9 @@ static void *__init early_pgtable_alloc(int type, phys_addr_t *pa)
 		va = pmd_set_fixmap(*pa);
 		break;
 	case TYPE_PTE:
-		va = pte_set_fixmap(*pa);
+		slot = pte_fixmap_slot();
+		set_fixmap(slot, *pa);
+		va = (pte_t *)__fix_to_virt(slot);
 		break;
 	default:
 		BUG();
@@ -220,6 +254,7 @@ static struct pgtable_ops early_pgtable_ops = {
 	.alloc = early_pgtable_alloc,
 	.map = early_pgtable_map,
 	.unmap = early_pgtable_unmap,
+	.cleanup = clear_pte_fixmap_slots,
 };
 
 bool pgattr_change_is_safe(u64 old, u64 new)
@@ -527,6 +562,9 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
 		alloc_init_p4d(pgdp, addr, next, phys, prot, ops, flags);
 		phys += next - addr;
 	} while (pgdp++, addr = next, addr != end);
+
+	if (ops->cleanup)
+		ops->cleanup();
 }
 
 static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 2/3] arm64: mm: Don't remap pgtables for allocate vs populate
  2024-03-26 10:14   ` Ryan Roberts
@ 2024-03-27  2:05     ` kernel test robot
  -1 siblings, 0 replies; 38+ messages in thread
From: kernel test robot @ 2024-03-27  2:05 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Mark Rutland,
	Ard Biesheuvel, David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: oe-kbuild-all, Ryan Roberts, linux-arm-kernel, linux-kernel

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.9-rc1 next-20240326]
[cannot apply to arm64/for-next/core arm-perf/for-next/perf arm/for-next arm/fixes kvmarm/next soc/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/arm64-mm-Don-t-remap-pgtables-per-cont-pte-pmd-block/20240326-181754
base:   linus/master
patch link:    https://lore.kernel.org/r/20240326101448.3453626-3-ryan.roberts%40arm.com
patch subject: [PATCH v1 2/3] arm64: mm: Don't remap pgtables for allocate vs populate
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20240327/202403270906.zFGZ2FXl-lkp@intel.com/config)
compiler: gcc-12 (Ubuntu 12.3.0-9ubuntu2) 12.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240327/202403270906.zFGZ2FXl-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202403270906.zFGZ2FXl-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/mm.h:29,
                    from include/linux/memcontrol.h:21,
                    from include/linux/swap.h:9,
                    from include/linux/suspend.h:5,
                    from arch/x86/kernel/asm-offsets.c:14:
>> include/linux/pgtable.h:87:29: error: redefinition of 'p4d_index'
      87 | static inline unsigned long p4d_index(unsigned long address)
         |                             ^~~~~~~~~
   In file included from arch/x86/include/asm/tlbflush.h:16,
                    from arch/x86/include/asm/uaccess.h:17,
                    from include/linux/uaccess.h:11,
                    from include/linux/sched/task.h:13,
                    from include/linux/sched/signal.h:9,
                    from include/linux/rcuwait.h:6,
                    from include/linux/percpu-rwsem.h:7,
                    from include/linux/fs.h:33,
                    from include/linux/compat.h:17,
                    from arch/x86/include/asm/ia32.h:7,
                    from arch/x86/include/asm/elf.h:10,
                    from include/linux/elf.h:6,
                    from include/linux/module.h:19,
                    from include/crypto/aria.h:22,
                    from arch/x86/kernel/asm-offsets.c:10:
   arch/x86/include/asm/pgtable.h:1134:29: note: previous definition of 'p4d_index' with type 'long unsigned int(long unsigned int)'
    1134 | static inline unsigned long p4d_index(unsigned long address)
         |                             ^~~~~~~~~
   make[3]: *** [scripts/Makefile.build:117: arch/x86/kernel/asm-offsets.s] Error 1
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1197: prepare0] Error 2
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:240: __sub-make] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:240: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +/p4d_index +87 include/linux/pgtable.h

    85	
    86	#ifndef p4d_index
  > 87	static inline unsigned long p4d_index(unsigned long address)
    88	{
    89		return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1);
    90	}
    91	#define p4d_index p4d_index
    92	#endif
    93	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 2/3] arm64: mm: Don't remap pgtables for allocate vs populate
@ 2024-03-27  2:05     ` kernel test robot
  0 siblings, 0 replies; 38+ messages in thread
From: kernel test robot @ 2024-03-27  2:05 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Mark Rutland,
	Ard Biesheuvel, David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: oe-kbuild-all, Ryan Roberts, linux-arm-kernel, linux-kernel

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.9-rc1 next-20240326]
[cannot apply to arm64/for-next/core arm-perf/for-next/perf arm/for-next arm/fixes kvmarm/next soc/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/arm64-mm-Don-t-remap-pgtables-per-cont-pte-pmd-block/20240326-181754
base:   linus/master
patch link:    https://lore.kernel.org/r/20240326101448.3453626-3-ryan.roberts%40arm.com
patch subject: [PATCH v1 2/3] arm64: mm: Don't remap pgtables for allocate vs populate
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20240327/202403270906.zFGZ2FXl-lkp@intel.com/config)
compiler: gcc-12 (Ubuntu 12.3.0-9ubuntu2) 12.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240327/202403270906.zFGZ2FXl-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202403270906.zFGZ2FXl-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/mm.h:29,
                    from include/linux/memcontrol.h:21,
                    from include/linux/swap.h:9,
                    from include/linux/suspend.h:5,
                    from arch/x86/kernel/asm-offsets.c:14:
>> include/linux/pgtable.h:87:29: error: redefinition of 'p4d_index'
      87 | static inline unsigned long p4d_index(unsigned long address)
         |                             ^~~~~~~~~
   In file included from arch/x86/include/asm/tlbflush.h:16,
                    from arch/x86/include/asm/uaccess.h:17,
                    from include/linux/uaccess.h:11,
                    from include/linux/sched/task.h:13,
                    from include/linux/sched/signal.h:9,
                    from include/linux/rcuwait.h:6,
                    from include/linux/percpu-rwsem.h:7,
                    from include/linux/fs.h:33,
                    from include/linux/compat.h:17,
                    from arch/x86/include/asm/ia32.h:7,
                    from arch/x86/include/asm/elf.h:10,
                    from include/linux/elf.h:6,
                    from include/linux/module.h:19,
                    from include/crypto/aria.h:22,
                    from arch/x86/kernel/asm-offsets.c:10:
   arch/x86/include/asm/pgtable.h:1134:29: note: previous definition of 'p4d_index' with type 'long unsigned int(long unsigned int)'
    1134 | static inline unsigned long p4d_index(unsigned long address)
         |                             ^~~~~~~~~
   make[3]: *** [scripts/Makefile.build:117: arch/x86/kernel/asm-offsets.s] Error 1
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1197: prepare0] Error 2
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:240: __sub-make] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:240: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +/p4d_index +87 include/linux/pgtable.h

    85	
    86	#ifndef p4d_index
  > 87	static inline unsigned long p4d_index(unsigned long address)
    88	{
    89		return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1);
    90	}
    91	#define p4d_index p4d_index
    92	#endif
    93	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
  2024-03-26 10:14 ` Ryan Roberts
@ 2024-03-27 10:09   ` Ard Biesheuvel
  -1 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2024-03-27 10:09 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

Hi Ryan,

On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And a large portion of the
> time it takes to create the linear map is issuing TLBIs. This series reworks the
> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> each patch for details.
>
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc1):
>
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
>
> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> tested all VA size configs (although I don't anticipate any issues); I'll do
> this as part of followup.
>

These are very nice results!

Before digging into the details: do we still have a strong case for
supporting contiguous PTEs and PMDs in these routines?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-27 10:09   ` Ard Biesheuvel
  0 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2024-03-27 10:09 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

Hi Ryan,

On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And a large portion of the
> time it takes to create the linear map is issuing TLBIs. This series reworks the
> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> each patch for details.
>
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc1):
>
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
>
> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> tested all VA size configs (although I don't anticipate any issues); I'll do
> this as part of followup.
>

These are very nice results!

Before digging into the details: do we still have a strong case for
supporting contiguous PTEs and PMDs in these routines?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
  2024-03-27 10:09   ` Ard Biesheuvel
@ 2024-03-27 10:43     ` Ryan Roberts
  -1 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 10:43 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On 27/03/2024 10:09, Ard Biesheuvel wrote:
> Hi Ryan,
> 
> On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi All,
>>
>> It turns out that creating the linear map can take a significant proportion of
>> the total boot time, especially when rodata=full. And a large portion of the
>> time it takes to create the linear map is issuing TLBIs. This series reworks the
>> kernel pgtable generation code to significantly reduce the number of TLBIs. See
>> each patch for details.
>>
>> The below shows the execution time of map_mem() across a couple of different
>> systems with different RAM configurations. We measure after applying each patch
>> and show the improvement relative to base (v6.9-rc1):
>>
>>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>> ---------------|-------------|-------------|-------------|-------------
>> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
>> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
>> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
>> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
>>
>> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
>> tested all VA size configs (although I don't anticipate any issues); I'll do
>> this as part of followup.
>>
> 
> These are very nice results!
> 
> Before digging into the details: do we still have a strong case for
> supporting contiguous PTEs and PMDs in these routines?

We are currently using contptes and pmds for the linear map when rodata=[on|off]
IIRC? I don't see a need to remove the capability personally.

Also I was talking with Mark R yesterday and he suggested that an even better
solution might be to create a temp pgtable that maps the linear map with pmds,
switch to it, then create the real pgtable that maps the linear map with ptes,
then switch to that. The benefit being that we can avoid the fixmap entirely
when creating the second pgtable - we think this would likely be significantly
faster still.

My second patch adds the infrastructure to make this possible. But your changes
for LPA2 make it significantly more effort; since that change we are now using
the swapper pgtable when we populate the linear map into it - the kernel is
already mapped and that isn't done in paging_init() anymore. So I'm not quite
sure how we can easily make that work at the moment.

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-27 10:43     ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 10:43 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On 27/03/2024 10:09, Ard Biesheuvel wrote:
> Hi Ryan,
> 
> On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi All,
>>
>> It turns out that creating the linear map can take a significant proportion of
>> the total boot time, especially when rodata=full. And a large portion of the
>> time it takes to create the linear map is issuing TLBIs. This series reworks the
>> kernel pgtable generation code to significantly reduce the number of TLBIs. See
>> each patch for details.
>>
>> The below shows the execution time of map_mem() across a couple of different
>> systems with different RAM configurations. We measure after applying each patch
>> and show the improvement relative to base (v6.9-rc1):
>>
>>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>> ---------------|-------------|-------------|-------------|-------------
>> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
>> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
>> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
>> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
>>
>> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
>> tested all VA size configs (although I don't anticipate any issues); I'll do
>> this as part of followup.
>>
> 
> These are very nice results!
> 
> Before digging into the details: do we still have a strong case for
> supporting contiguous PTEs and PMDs in these routines?

We are currently using contptes and pmds for the linear map when rodata=[on|off]
IIRC? I don't see a need to remove the capability personally.

Also I was talking with Mark R yesterday and he suggested that an even better
solution might be to create a temp pgtable that maps the linear map with pmds,
switch to it, then create the real pgtable that maps the linear map with ptes,
then switch to that. The benefit being that we can avoid the fixmap entirely
when creating the second pgtable - we think this would likely be significantly
faster still.

My second patch adds the infrastructure to make this possible. But your changes
for LPA2 make it significantly more effort; since that change we are now using
the swapper pgtable when we populate the linear map into it - the kernel is
already mapped and that isn't done in paging_init() anymore. So I'm not quite
sure how we can easily make that work at the moment.

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
  2024-03-26 10:14 ` Ryan Roberts
@ 2024-03-27 11:06   ` Itaru Kitayama
  -1 siblings, 0 replies; 38+ messages in thread
From: Itaru Kitayama @ 2024-03-27 11:06 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet,
	linux-arm-kernel, linux-kernel

On Tue, Mar 26, 2024 at 10:14:45AM +0000, Ryan Roberts wrote:
> Hi All,
> 
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And a large portion of the
> time it takes to create the linear map is issuing TLBIs. This series reworks the
> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> each patch for details.
> 
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc1):
> 
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
> 
> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> tested all VA size configs (although I don't anticipate any issues); I'll do
> this as part of followup.

The series was applied cleanly on top of v6.9-rc1+ of Linus's master
branch, and boots fine on M1 VM with 14GB of memory.

Just out of curiosity, how did you measure the boot time and obtain the
breakdown of the execution times of each phase?

Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>

Thanks,
Itaru.

> 
> Thanks,
> Ryan
> 
> 
> Ryan Roberts (3):
>   arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
>   arm64: mm: Don't remap pgtables for allocate vs populate
>   arm64: mm: Lazily clear pte table mappings from fixmap
> 
>  arch/arm64/include/asm/fixmap.h  |   5 +-
>  arch/arm64/include/asm/mmu.h     |   8 +
>  arch/arm64/include/asm/pgtable.h |   4 -
>  arch/arm64/kernel/cpufeature.c   |  10 +-
>  arch/arm64/mm/fixmap.c           |  11 +
>  arch/arm64/mm/mmu.c              | 364 +++++++++++++++++++++++--------
>  include/linux/pgtable.h          |   8 +
>  7 files changed, 307 insertions(+), 103 deletions(-)
> 
> --
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-27 11:06   ` Itaru Kitayama
  0 siblings, 0 replies; 38+ messages in thread
From: Itaru Kitayama @ 2024-03-27 11:06 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet,
	linux-arm-kernel, linux-kernel

On Tue, Mar 26, 2024 at 10:14:45AM +0000, Ryan Roberts wrote:
> Hi All,
> 
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And a large portion of the
> time it takes to create the linear map is issuing TLBIs. This series reworks the
> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> each patch for details.
> 
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc1):
> 
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
> 
> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> tested all VA size configs (although I don't anticipate any issues); I'll do
> this as part of followup.

The series was applied cleanly on top of v6.9-rc1+ of Linus's master
branch, and boots fine on M1 VM with 14GB of memory.

Just out of curiosity, how did you measure the boot time and obtain the
breakdown of the execution times of each phase?

Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>

Thanks,
Itaru.

> 
> Thanks,
> Ryan
> 
> 
> Ryan Roberts (3):
>   arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
>   arm64: mm: Don't remap pgtables for allocate vs populate
>   arm64: mm: Lazily clear pte table mappings from fixmap
> 
>  arch/arm64/include/asm/fixmap.h  |   5 +-
>  arch/arm64/include/asm/mmu.h     |   8 +
>  arch/arm64/include/asm/pgtable.h |   4 -
>  arch/arm64/kernel/cpufeature.c   |  10 +-
>  arch/arm64/mm/fixmap.c           |  11 +
>  arch/arm64/mm/mmu.c              | 364 +++++++++++++++++++++++--------
>  include/linux/pgtable.h          |   8 +
>  7 files changed, 307 insertions(+), 103 deletions(-)
> 
> --
> 2.25.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
  2024-03-27 11:06   ` Itaru Kitayama
@ 2024-03-27 11:10     ` Ryan Roberts
  -1 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 11:10 UTC (permalink / raw)
  To: Itaru Kitayama
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet,
	linux-arm-kernel, linux-kernel

On 27/03/2024 11:06, Itaru Kitayama wrote:
> On Tue, Mar 26, 2024 at 10:14:45AM +0000, Ryan Roberts wrote:
>> Hi All,
>>
>> It turns out that creating the linear map can take a significant proportion of
>> the total boot time, especially when rodata=full. And a large portion of the
>> time it takes to create the linear map is issuing TLBIs. This series reworks the
>> kernel pgtable generation code to significantly reduce the number of TLBIs. See
>> each patch for details.
>>
>> The below shows the execution time of map_mem() across a couple of different
>> systems with different RAM configurations. We measure after applying each patch
>> and show the improvement relative to base (v6.9-rc1):
>>
>>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>> ---------------|-------------|-------------|-------------|-------------
>> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
>> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
>> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
>> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
>>
>> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
>> tested all VA size configs (although I don't anticipate any issues); I'll do
>> this as part of followup.
> 
> The series was applied cleanly on top of v6.9-rc1+ of Linus's master
> branch, and boots fine on M1 VM with 14GB of memory.
> 
> Just out of curiosity, how did you measure the boot time and obtain the
> breakdown of the execution times of each phase?

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 495b732d5af3..8a9d47115784 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -792,7 +792,14 @@ static void __init create_idmap(void)

 void __init paging_init(void)
 {
+       u64 start, end;
+
+       start = __arch_counter_get_cntvct();
        map_mem(swapper_pg_dir);
+       end = __arch_counter_get_cntvct();
+
+       pr_err("map_mem: time=%llu us\n",
+               ((end - start) * 1000000) / arch_timer_get_cntfrq());

        memblock_allow_resize();

> 
> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>

Thanks!

> 
> Thanks,
> Itaru.
> 
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (3):
>>   arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
>>   arm64: mm: Don't remap pgtables for allocate vs populate
>>   arm64: mm: Lazily clear pte table mappings from fixmap
>>
>>  arch/arm64/include/asm/fixmap.h  |   5 +-
>>  arch/arm64/include/asm/mmu.h     |   8 +
>>  arch/arm64/include/asm/pgtable.h |   4 -
>>  arch/arm64/kernel/cpufeature.c   |  10 +-
>>  arch/arm64/mm/fixmap.c           |  11 +
>>  arch/arm64/mm/mmu.c              | 364 +++++++++++++++++++++++--------
>>  include/linux/pgtable.h          |   8 +
>>  7 files changed, 307 insertions(+), 103 deletions(-)
>>
>> --
>> 2.25.1
>>


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-27 11:10     ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 11:10 UTC (permalink / raw)
  To: Itaru Kitayama
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet,
	linux-arm-kernel, linux-kernel

On 27/03/2024 11:06, Itaru Kitayama wrote:
> On Tue, Mar 26, 2024 at 10:14:45AM +0000, Ryan Roberts wrote:
>> Hi All,
>>
>> It turns out that creating the linear map can take a significant proportion of
>> the total boot time, especially when rodata=full. And a large portion of the
>> time it takes to create the linear map is issuing TLBIs. This series reworks the
>> kernel pgtable generation code to significantly reduce the number of TLBIs. See
>> each patch for details.
>>
>> The below shows the execution time of map_mem() across a couple of different
>> systems with different RAM configurations. We measure after applying each patch
>> and show the improvement relative to base (v6.9-rc1):
>>
>>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>> ---------------|-------------|-------------|-------------|-------------
>>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>> ---------------|-------------|-------------|-------------|-------------
>> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
>> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
>> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
>> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
>>
>> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
>> tested all VA size configs (although I don't anticipate any issues); I'll do
>> this as part of followup.
> 
> The series was applied cleanly on top of v6.9-rc1+ of Linus's master
> branch, and boots fine on M1 VM with 14GB of memory.
> 
> Just out of curiosity, how did you measure the boot time and obtain the
> breakdown of the execution times of each phase?

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 495b732d5af3..8a9d47115784 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -792,7 +792,14 @@ static void __init create_idmap(void)

 void __init paging_init(void)
 {
+       u64 start, end;
+
+       start = __arch_counter_get_cntvct();
        map_mem(swapper_pg_dir);
+       end = __arch_counter_get_cntvct();
+
+       pr_err("map_mem: time=%llu us\n",
+               ((end - start) * 1000000) / arch_timer_get_cntfrq());

        memblock_allow_resize();

> 
> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>

Thanks!

> 
> Thanks,
> Itaru.
> 
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (3):
>>   arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
>>   arm64: mm: Don't remap pgtables for allocate vs populate
>>   arm64: mm: Lazily clear pte table mappings from fixmap
>>
>>  arch/arm64/include/asm/fixmap.h  |   5 +-
>>  arch/arm64/include/asm/mmu.h     |   8 +
>>  arch/arm64/include/asm/pgtable.h |   4 -
>>  arch/arm64/kernel/cpufeature.c   |  10 +-
>>  arch/arm64/mm/fixmap.c           |  11 +
>>  arch/arm64/mm/mmu.c              | 364 +++++++++++++++++++++++--------
>>  include/linux/pgtable.h          |   8 +
>>  7 files changed, 307 insertions(+), 103 deletions(-)
>>
>> --
>> 2.25.1
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
  2024-03-27 10:43     ` Ryan Roberts
@ 2024-03-27 13:36       ` Ard Biesheuvel
  -1 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2024-03-27 13:36 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/03/2024 10:09, Ard Biesheuvel wrote:
> > Hi Ryan,
> >
> > On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi All,
> >>
> >> It turns out that creating the linear map can take a significant proportion of
> >> the total boot time, especially when rodata=full. And a large portion of the
> >> time it takes to create the linear map is issuing TLBIs. This series reworks the
> >> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> >> each patch for details.
> >>
> >> The below shows the execution time of map_mem() across a couple of different
> >> systems with different RAM configurations. We measure after applying each patch
> >> and show the improvement relative to base (v6.9-rc1):
> >>
> >>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> >>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> >> ---------------|-------------|-------------|-------------|-------------
> >>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> >> ---------------|-------------|-------------|-------------|-------------
> >> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> >> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> >> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> >> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
> >>
> >> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> >> tested all VA size configs (although I don't anticipate any issues); I'll do
> >> this as part of followup.
> >>
> >
> > These are very nice results!
> >
> > Before digging into the details: do we still have a strong case for
> > supporting contiguous PTEs and PMDs in these routines?
>
> We are currently using contptes and pmds for the linear map when rodata=[on|off]
> IIRC?

In principle, yes. In practice?

> I don't see a need to remove the capability personally.
>

Since we are making changes here, it is a relevant question to ask imho.

> Also I was talking with Mark R yesterday and he suggested that an even better
> solution might be to create a temp pgtable that maps the linear map with pmds,
> switch to it, then create the real pgtable that maps the linear map with ptes,
> then switch to that. The benefit being that we can avoid the fixmap entirely
> when creating the second pgtable - we think this would likely be significantly
> faster still.
>

If this is going to be a temporary mapping for the duration of the
initial population of the linear map page tables, we might just as
well use a 1:1 TTBR0 mapping here, which would be completely disjoint
from swapper. And we'd only need to map memory that is being used for
page tables, so on those large systems we'd need to map only a small
slice. Maybe it's time to bring back the memblock alloc limit so we
can manage this more easily?

> My second patch adds the infrastructure to make this possible. But your changes
> for LPA2 make it significantly more effort; since that change we are now using
> the swapper pgtable when we populate the linear map into it - the kernel is
> already mapped and that isn't done in paging_init() anymore. So I'm not quite
> sure how we can easily make that work at the moment.
>

I think a mix of the fixmap approach with a 1:1 map could work here:
- use TTBR0 to create a temp 1:1 map of DRAM
- map page tables lazily as they are allocated but using a coarse mapping
- avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-27 13:36       ` Ard Biesheuvel
  0 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2024-03-27 13:36 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/03/2024 10:09, Ard Biesheuvel wrote:
> > Hi Ryan,
> >
> > On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi All,
> >>
> >> It turns out that creating the linear map can take a significant proportion of
> >> the total boot time, especially when rodata=full. And a large portion of the
> >> time it takes to create the linear map is issuing TLBIs. This series reworks the
> >> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> >> each patch for details.
> >>
> >> The below shows the execution time of map_mem() across a couple of different
> >> systems with different RAM configurations. We measure after applying each patch
> >> and show the improvement relative to base (v6.9-rc1):
> >>
> >>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> >>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> >> ---------------|-------------|-------------|-------------|-------------
> >>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> >> ---------------|-------------|-------------|-------------|-------------
> >> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> >> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> >> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> >> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
> >>
> >> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> >> tested all VA size configs (although I don't anticipate any issues); I'll do
> >> this as part of followup.
> >>
> >
> > These are very nice results!
> >
> > Before digging into the details: do we still have a strong case for
> > supporting contiguous PTEs and PMDs in these routines?
>
> We are currently using contptes and pmds for the linear map when rodata=[on|off]
> IIRC?

In principle, yes. In practice?

> I don't see a need to remove the capability personally.
>

Since we are making changes here, it is a relevant question to ask imho.

> Also I was talking with Mark R yesterday and he suggested that an even better
> solution might be to create a temp pgtable that maps the linear map with pmds,
> switch to it, then create the real pgtable that maps the linear map with ptes,
> then switch to that. The benefit being that we can avoid the fixmap entirely
> when creating the second pgtable - we think this would likely be significantly
> faster still.
>

If this is going to be a temporary mapping for the duration of the
initial population of the linear map page tables, we might just as
well use a 1:1 TTBR0 mapping here, which would be completely disjoint
from swapper. And we'd only need to map memory that is being used for
page tables, so on those large systems we'd need to map only a small
slice. Maybe it's time to bring back the memblock alloc limit so we
can manage this more easily?

> My second patch adds the infrastructure to make this possible. But your changes
> for LPA2 make it significantly more effort; since that change we are now using
> the swapper pgtable when we populate the linear map into it - the kernel is
> already mapped and that isn't done in paging_init() anymore. So I'm not quite
> sure how we can easily make that work at the moment.
>

I think a mix of the fixmap approach with a 1:1 map could work here:
- use TTBR0 to create a temp 1:1 map of DRAM
- map page tables lazily as they are allocated but using a coarse mapping
- avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
  2024-03-27 13:36       ` Ard Biesheuvel
@ 2024-03-27 15:01         ` Ryan Roberts
  -1 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 15:01 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On 27/03/2024 13:36, Ard Biesheuvel wrote:
> On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/03/2024 10:09, Ard Biesheuvel wrote:
>>> Hi Ryan,
>>>
>>> On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> It turns out that creating the linear map can take a significant proportion of
>>>> the total boot time, especially when rodata=full. And a large portion of the
>>>> time it takes to create the linear map is issuing TLBIs. This series reworks the
>>>> kernel pgtable generation code to significantly reduce the number of TLBIs. See
>>>> each patch for details.
>>>>
>>>> The below shows the execution time of map_mem() across a couple of different
>>>> systems with different RAM configurations. We measure after applying each patch
>>>> and show the improvement relative to base (v6.9-rc1):
>>>>
>>>>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>>>> ---------------|-------------|-------------|-------------|-------------
>>>>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>>>> ---------------|-------------|-------------|-------------|-------------
>>>> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
>>>> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
>>>> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
>>>> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
>>>>
>>>> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
>>>> tested all VA size configs (although I don't anticipate any issues); I'll do
>>>> this as part of followup.
>>>>
>>>
>>> These are very nice results!
>>>
>>> Before digging into the details: do we still have a strong case for
>>> supporting contiguous PTEs and PMDs in these routines?
>>
>> We are currently using contptes and pmds for the linear map when rodata=[on|off]
>> IIRC?
> 
> In principle, yes. In practice?
> 
>> I don't see a need to remove the capability personally.
>>
> 
> Since we are making changes here, it is a relevant question to ask imho.
> 
>> Also I was talking with Mark R yesterday and he suggested that an even better
>> solution might be to create a temp pgtable that maps the linear map with pmds,
>> switch to it, then create the real pgtable that maps the linear map with ptes,
>> then switch to that. The benefit being that we can avoid the fixmap entirely
>> when creating the second pgtable - we think this would likely be significantly
>> faster still.
>>
> 
> If this is going to be a temporary mapping for the duration of the
> initial population of the linear map page tables, we might just as
> well use a 1:1 TTBR0 mapping here, which would be completely disjoint
> from swapper. And we'd only need to map memory that is being used for
> page tables, so on those large systems we'd need to map only a small
> slice. Maybe it's time to bring back the memblock alloc limit so we
> can manage this more easily?
> 
>> My second patch adds the infrastructure to make this possible. But your changes
>> for LPA2 make it significantly more effort; since that change we are now using
>> the swapper pgtable when we populate the linear map into it - the kernel is
>> already mapped and that isn't done in paging_init() anymore. So I'm not quite
>> sure how we can easily make that work at the moment.
>>
> 
> I think a mix of the fixmap approach with a 1:1 map could work here:
> - use TTBR0 to create a temp 1:1 map of DRAM
> - map page tables lazily as they are allocated but using a coarse mapping
> - avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.

Yes that could work I think. So to make sure I've understood:

 - create a 1:1 map for all of DRAM using block and cont mappings where possible
     - use memblock_phys_alloc_*() to allocate pgtable memory
     - access via fixmap (should be minimal due to block mappings)
 - install it in TTBR0
 - create all the swapper mappings as normal (no block or cont mappings)
     - use memblock_phys_alloc_*() to alloc pgtable memory
     - phys address is also virtual address due to installed 1:1 map
 - Remove 1:1 map from TTBR0
 - memblock_phys_free() all the memory associated with 1:1 map

That sounds doable on top of the first 2 patches in this series - I'll have a
crack. The only missing piece is depth-first 1:1 map traversal to free the
tables. I'm guessing something already exists that I can repurpose?

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-27 15:01         ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 15:01 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On 27/03/2024 13:36, Ard Biesheuvel wrote:
> On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/03/2024 10:09, Ard Biesheuvel wrote:
>>> Hi Ryan,
>>>
>>> On Tue, 26 Mar 2024 at 12:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> It turns out that creating the linear map can take a significant proportion of
>>>> the total boot time, especially when rodata=full. And a large portion of the
>>>> time it takes to create the linear map is issuing TLBIs. This series reworks the
>>>> kernel pgtable generation code to significantly reduce the number of TLBIs. See
>>>> each patch for details.
>>>>
>>>> The below shows the execution time of map_mem() across a couple of different
>>>> systems with different RAM configurations. We measure after applying each patch
>>>> and show the improvement relative to base (v6.9-rc1):
>>>>
>>>>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>>>>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
>>>> ---------------|-------------|-------------|-------------|-------------
>>>>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
>>>> ---------------|-------------|-------------|-------------|-------------
>>>> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
>>>> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
>>>> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
>>>> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
>>>>
>>>> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
>>>> tested all VA size configs (although I don't anticipate any issues); I'll do
>>>> this as part of followup.
>>>>
>>>
>>> These are very nice results!
>>>
>>> Before digging into the details: do we still have a strong case for
>>> supporting contiguous PTEs and PMDs in these routines?
>>
>> We are currently using contptes and pmds for the linear map when rodata=[on|off]
>> IIRC?
> 
> In principle, yes. In practice?
> 
>> I don't see a need to remove the capability personally.
>>
> 
> Since we are making changes here, it is a relevant question to ask imho.
> 
>> Also I was talking with Mark R yesterday and he suggested that an even better
>> solution might be to create a temp pgtable that maps the linear map with pmds,
>> switch to it, then create the real pgtable that maps the linear map with ptes,
>> then switch to that. The benefit being that we can avoid the fixmap entirely
>> when creating the second pgtable - we think this would likely be significantly
>> faster still.
>>
> 
> If this is going to be a temporary mapping for the duration of the
> initial population of the linear map page tables, we might just as
> well use a 1:1 TTBR0 mapping here, which would be completely disjoint
> from swapper. And we'd only need to map memory that is being used for
> page tables, so on those large systems we'd need to map only a small
> slice. Maybe it's time to bring back the memblock alloc limit so we
> can manage this more easily?
> 
>> My second patch adds the infrastructure to make this possible. But your changes
>> for LPA2 make it significantly more effort; since that change we are now using
>> the swapper pgtable when we populate the linear map into it - the kernel is
>> already mapped and that isn't done in paging_init() anymore. So I'm not quite
>> sure how we can easily make that work at the moment.
>>
> 
> I think a mix of the fixmap approach with a 1:1 map could work here:
> - use TTBR0 to create a temp 1:1 map of DRAM
> - map page tables lazily as they are allocated but using a coarse mapping
> - avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.

Yes that could work I think. So to make sure I've understood:

 - create a 1:1 map for all of DRAM using block and cont mappings where possible
     - use memblock_phys_alloc_*() to allocate pgtable memory
     - access via fixmap (should be minimal due to block mappings)
 - install it in TTBR0
 - create all the swapper mappings as normal (no block or cont mappings)
     - use memblock_phys_alloc_*() to alloc pgtable memory
     - phys address is also virtual address due to installed 1:1 map
 - Remove 1:1 map from TTBR0
 - memblock_phys_free() all the memory associated with 1:1 map

That sounds doable on top of the first 2 patches in this series - I'll have a
crack. The only missing piece is depth-first 1:1 map traversal to free the
tables. I'm guessing something already exists that I can repurpose?

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
  2024-03-27 15:01         ` Ryan Roberts
@ 2024-03-27 15:57           ` Ard Biesheuvel
  -1 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2024-03-27 15:57 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On Wed, 27 Mar 2024 at 17:01, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/03/2024 13:36, Ard Biesheuvel wrote:
> > On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/03/2024 10:09, Ard Biesheuvel wrote:
...
> >
> > I think a mix of the fixmap approach with a 1:1 map could work here:
> > - use TTBR0 to create a temp 1:1 map of DRAM
> > - map page tables lazily as they are allocated but using a coarse mapping
> > - avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.
>
> Yes that could work I think. So to make sure I've understood:
>
>  - create a 1:1 map for all of DRAM using block and cont mappings where possible
>      - use memblock_phys_alloc_*() to allocate pgtable memory
>      - access via fixmap (should be minimal due to block mappings)

Yes but you'd only need the fixmap for pages that are not in the 1:1
map yet, so after an initial ramp up you wouldn't need it at all,
assuming locality of memblock allocations and the use of PMD mappings.
The only tricky thing here is ensuring that we are not mapping memory
that we shouldn't be touching.

>  - install it in TTBR0
>  - create all the swapper mappings as normal (no block or cont mappings)
>      - use memblock_phys_alloc_*() to alloc pgtable memory
>      - phys address is also virtual address due to installed 1:1 map
>  - Remove 1:1 map from TTBR0
>  - memblock_phys_free() all the memory associated with 1:1 map
>

Indeed.

> That sounds doable on top of the first 2 patches in this series - I'll have a
> crack. The only missing piece is depth-first 1:1 map traversal to free the
> tables. I'm guessing something already exists that I can repurpose?
>

Not that I am aware of, but that doesn't sound too complicated.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-27 15:57           ` Ard Biesheuvel
  0 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2024-03-27 15:57 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On Wed, 27 Mar 2024 at 17:01, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/03/2024 13:36, Ard Biesheuvel wrote:
> > On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/03/2024 10:09, Ard Biesheuvel wrote:
...
> >
> > I think a mix of the fixmap approach with a 1:1 map could work here:
> > - use TTBR0 to create a temp 1:1 map of DRAM
> > - map page tables lazily as they are allocated but using a coarse mapping
> > - avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.
>
> Yes that could work I think. So to make sure I've understood:
>
>  - create a 1:1 map for all of DRAM using block and cont mappings where possible
>      - use memblock_phys_alloc_*() to allocate pgtable memory
>      - access via fixmap (should be minimal due to block mappings)

Yes but you'd only need the fixmap for pages that are not in the 1:1
map yet, so after an initial ramp up you wouldn't need it at all,
assuming locality of memblock allocations and the use of PMD mappings.
The only tricky thing here is ensuring that we are not mapping memory
that we shouldn't be touching.

>  - install it in TTBR0
>  - create all the swapper mappings as normal (no block or cont mappings)
>      - use memblock_phys_alloc_*() to alloc pgtable memory
>      - phys address is also virtual address due to installed 1:1 map
>  - Remove 1:1 map from TTBR0
>  - memblock_phys_free() all the memory associated with 1:1 map
>

Indeed.

> That sounds doable on top of the first 2 patches in this series - I'll have a
> crack. The only missing piece is depth-first 1:1 map traversal to free the
> tables. I'm guessing something already exists that I can repurpose?
>

Not that I am aware of, but that doesn't sound too complicated.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
  2024-03-27 15:57           ` Ard Biesheuvel
@ 2024-03-27 16:11             ` Ryan Roberts
  -1 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 16:11 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On 27/03/2024 15:57, Ard Biesheuvel wrote:
> On Wed, 27 Mar 2024 at 17:01, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/03/2024 13:36, Ard Biesheuvel wrote:
>>> On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/03/2024 10:09, Ard Biesheuvel wrote:
> ...
>>>
>>> I think a mix of the fixmap approach with a 1:1 map could work here:
>>> - use TTBR0 to create a temp 1:1 map of DRAM
>>> - map page tables lazily as they are allocated but using a coarse mapping
>>> - avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.
>>
>> Yes that could work I think. So to make sure I've understood:
>>
>>  - create a 1:1 map for all of DRAM using block and cont mappings where possible
>>      - use memblock_phys_alloc_*() to allocate pgtable memory
>>      - access via fixmap (should be minimal due to block mappings)
> 
> Yes but you'd only need the fixmap for pages that are not in the 1:1
> map yet, so after an initial ramp up you wouldn't need it at all,
> assuming locality of memblock allocations and the use of PMD mappings.
> The only tricky thing here is ensuring that we are not mapping memory
> that we shouldn't be touching.

That sounds a bit nasty though. I think it would be simpler to just reuse the
machinery we have, doing the 1:1 map using blocks and fixmap; It should be a
factor of 512 better than what we have, so probably not a problem at that point.
That way, we can rely on memblock to tell us what to map. If its still
problematic I can add a layer to support 1G mappings too.

> 
>>  - install it in TTBR0
>>  - create all the swapper mappings as normal (no block or cont mappings)
>>      - use memblock_phys_alloc_*() to alloc pgtable memory
>>      - phys address is also virtual address due to installed 1:1 map
>>  - Remove 1:1 map from TTBR0
>>  - memblock_phys_free() all the memory associated with 1:1 map
>>
> 
> Indeed.

One question on the state of TTBR0 at entrance to paging_init(); what is it? I
need to know so I can set it back after.

Currently I'm thinking I can do:

cpu_install_ttbr0(my_dram_idmap, TCR_T0SZ(vabits_actual));
<create swapper>
cpu_set_reserved_ttbr0();
local_flush_tlb_all();

But is it ok to leave the reserved pdg in ttbr0, or is it expecting something else?

> 
>> That sounds doable on top of the first 2 patches in this series - I'll have a
>> crack. The only missing piece is depth-first 1:1 map traversal to free the
>> tables. I'm guessing something already exists that I can repurpose?
>>
> 
> Not that I am aware of, but that doesn't sound too complicated.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-27 16:11             ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 16:11 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On 27/03/2024 15:57, Ard Biesheuvel wrote:
> On Wed, 27 Mar 2024 at 17:01, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/03/2024 13:36, Ard Biesheuvel wrote:
>>> On Wed, 27 Mar 2024 at 12:43, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/03/2024 10:09, Ard Biesheuvel wrote:
> ...
>>>
>>> I think a mix of the fixmap approach with a 1:1 map could work here:
>>> - use TTBR0 to create a temp 1:1 map of DRAM
>>> - map page tables lazily as they are allocated but using a coarse mapping
>>> - avoid all TLB maintenance except at the end when tearing down the 1:1 mapping.
>>
>> Yes that could work I think. So to make sure I've understood:
>>
>>  - create a 1:1 map for all of DRAM using block and cont mappings where possible
>>      - use memblock_phys_alloc_*() to allocate pgtable memory
>>      - access via fixmap (should be minimal due to block mappings)
> 
> Yes but you'd only need the fixmap for pages that are not in the 1:1
> map yet, so after an initial ramp up you wouldn't need it at all,
> assuming locality of memblock allocations and the use of PMD mappings.
> The only tricky thing here is ensuring that we are not mapping memory
> that we shouldn't be touching.

That sounds a bit nasty though. I think it would be simpler to just reuse the
machinery we have, doing the 1:1 map using blocks and fixmap; It should be a
factor of 512 better than what we have, so probably not a problem at that point.
That way, we can rely on memblock to tell us what to map. If its still
problematic I can add a layer to support 1G mappings too.

> 
>>  - install it in TTBR0
>>  - create all the swapper mappings as normal (no block or cont mappings)
>>      - use memblock_phys_alloc_*() to alloc pgtable memory
>>      - phys address is also virtual address due to installed 1:1 map
>>  - Remove 1:1 map from TTBR0
>>  - memblock_phys_free() all the memory associated with 1:1 map
>>
> 
> Indeed.

One question on the state of TTBR0 at entrance to paging_init(); what is it? I
need to know so I can set it back after.

Currently I'm thinking I can do:

cpu_install_ttbr0(my_dram_idmap, TCR_T0SZ(vabits_actual));
<create swapper>
cpu_set_reserved_ttbr0();
local_flush_tlb_all();

But is it ok to leave the reserved pdg in ttbr0, or is it expecting something else?

> 
>> That sounds doable on top of the first 2 patches in this series - I'll have a
>> crack. The only missing piece is depth-first 1:1 map traversal to free the
>> tables. I'm guessing something already exists that I can repurpose?
>>
> 
> Not that I am aware of, but that doesn't sound too complicated.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables
  2024-03-26 10:14 ` Ryan Roberts
@ 2024-03-27 19:07   ` Ryan Roberts
  -1 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 19:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

After removing uneccessary TLBIs, the next bottleneck when creating the
page tables for the linear map is DSB and ISB, which were previously
issued per-pte in __set_pte(). Since we are writing multiple ptes in a
given pte table, we can elide these barriers and insert them once we
have finished writing to the table.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h |  7 ++++++-
 arch/arm64/mm/mmu.c              | 13 ++++++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index bd5d02f3f0a3..81e427b23b3f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
 }

-static inline void __set_pte(pte_t *ptep, pte_t pte)
+static inline void ___set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
+}
+
+static inline void __set_pte(pte_t *ptep, pte_t pte)
+{
+	___set_pte(ptep, pte);

 	/*
 	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 1b2a2a2d09b7..c6d5a76732d4 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -301,7 +301,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
 	do {
 		pte_t old_pte = __ptep_get(ptep);

-		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+		/*
+		 * Required barriers to make this visible to the table walker
+		 * are deferred to the end of alloc_init_cont_pte().
+		 */
+		___set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));

 		/*
 		 * After the PTE entry has been populated once, we
@@ -358,6 +362,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 	} while (addr = next, addr != end);

 	ops->unmap(TYPE_PTE);
+
+	/*
+	 * Ensure all previous pgtable writes are visible to the table walker.
+	 * See init_pte().
+	 */
+	dsb(ishst);
+	isb();
 }

 static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
--
2.25.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables
@ 2024-03-27 19:07   ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 19:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

After removing uneccessary TLBIs, the next bottleneck when creating the
page tables for the linear map is DSB and ISB, which were previously
issued per-pte in __set_pte(). Since we are writing multiple ptes in a
given pte table, we can elide these barriers and insert them once we
have finished writing to the table.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h |  7 ++++++-
 arch/arm64/mm/mmu.c              | 13 ++++++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index bd5d02f3f0a3..81e427b23b3f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
 }

-static inline void __set_pte(pte_t *ptep, pte_t pte)
+static inline void ___set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
+}
+
+static inline void __set_pte(pte_t *ptep, pte_t pte)
+{
+	___set_pte(ptep, pte);

 	/*
 	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 1b2a2a2d09b7..c6d5a76732d4 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -301,7 +301,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
 	do {
 		pte_t old_pte = __ptep_get(ptep);

-		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+		/*
+		 * Required barriers to make this visible to the table walker
+		 * are deferred to the end of alloc_init_cont_pte().
+		 */
+		___set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));

 		/*
 		 * After the PTE entry has been populated once, we
@@ -358,6 +362,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 	} while (addr = next, addr != end);

 	ops->unmap(TYPE_PTE);
+
+	/*
+	 * Ensure all previous pgtable writes are visible to the table walker.
+	 * See init_pte().
+	 */
+	dsb(ishst);
+	isb();
 }

 static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
  2024-03-26 10:14 ` Ryan Roberts
@ 2024-03-27 19:12   ` Ryan Roberts
  -1 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 19:12 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: linux-arm-kernel, linux-kernel

On 26/03/2024 10:14, Ryan Roberts wrote:
> Hi All,
> 
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And a large portion of the
> time it takes to create the linear map is issuing TLBIs. This series reworks the
> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> each patch for details.
> 
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc1):
> 
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)

I've just appended an additional patch to this series. This takes us to a ~95%
reduction overall:

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
batch-barriers |   11 (-93%) |   61 (-97%) |  261 (-97%) |   837 (-95%)

Don't believe the intermediate block-based pgtable idea will now be neccessary
so I don't intend to persue that. It might be that we choose to drop the middle
two patchs; I'm keen to hear opinions.

Thanks,
Ryan


> 
> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> tested all VA size configs (although I don't anticipate any issues); I'll do
> this as part of followup.
> 
> Thanks,
> Ryan
> 
> 
> Ryan Roberts (3):
>   arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
>   arm64: mm: Don't remap pgtables for allocate vs populate
>   arm64: mm: Lazily clear pte table mappings from fixmap
> 
>  arch/arm64/include/asm/fixmap.h  |   5 +-
>  arch/arm64/include/asm/mmu.h     |   8 +
>  arch/arm64/include/asm/pgtable.h |   4 -
>  arch/arm64/kernel/cpufeature.c   |  10 +-
>  arch/arm64/mm/fixmap.c           |  11 +
>  arch/arm64/mm/mmu.c              | 364 +++++++++++++++++++++++--------
>  include/linux/pgtable.h          |   8 +
>  7 files changed, 307 insertions(+), 103 deletions(-)
> 
> --
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-27 19:12   ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-27 19:12 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: linux-arm-kernel, linux-kernel

On 26/03/2024 10:14, Ryan Roberts wrote:
> Hi All,
> 
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And a large portion of the
> time it takes to create the linear map is issuing TLBIs. This series reworks the
> kernel pgtable generation code to significantly reduce the number of TLBIs. See
> each patch for details.
> 
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc1):
> 
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)

I've just appended an additional patch to this series. This takes us to a ~95%
reduction overall:

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
batch-barriers |   11 (-93%) |   61 (-97%) |  261 (-97%) |   837 (-95%)

Don't believe the intermediate block-based pgtable idea will now be neccessary
so I don't intend to persue that. It might be that we choose to drop the middle
two patchs; I'm keen to hear opinions.

Thanks,
Ryan


> 
> This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> tested all VA size configs (although I don't anticipate any issues); I'll do
> this as part of followup.
> 
> Thanks,
> Ryan
> 
> 
> Ryan Roberts (3):
>   arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
>   arm64: mm: Don't remap pgtables for allocate vs populate
>   arm64: mm: Lazily clear pte table mappings from fixmap
> 
>  arch/arm64/include/asm/fixmap.h  |   5 +-
>  arch/arm64/include/asm/mmu.h     |   8 +
>  arch/arm64/include/asm/pgtable.h |   4 -
>  arch/arm64/kernel/cpufeature.c   |  10 +-
>  arch/arm64/mm/fixmap.c           |  11 +
>  arch/arm64/mm/mmu.c              | 364 +++++++++++++++++++++++--------
>  include/linux/pgtable.h          |   8 +
>  7 files changed, 307 insertions(+), 103 deletions(-)
> 
> --
> 2.25.1
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables
  2024-03-27 19:07   ` Ryan Roberts
@ 2024-03-28  7:23     ` Ard Biesheuvel
  -1 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2024-03-28  7:23 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On Wed, 27 Mar 2024 at 21:07, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> After removing uneccessary TLBIs, the next bottleneck when creating the
> page tables for the linear map is DSB and ISB, which were previously
> issued per-pte in __set_pte(). Since we are writing multiple ptes in a
> given pte table, we can elide these barriers and insert them once we
> have finished writing to the table.
>

Nice!

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h |  7 ++++++-
>  arch/arm64/mm/mmu.c              | 13 ++++++++++++-
>  2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index bd5d02f3f0a3..81e427b23b3f 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>         return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>  }
>
> -static inline void __set_pte(pte_t *ptep, pte_t pte)
> +static inline void ___set_pte(pte_t *ptep, pte_t pte)

IMHO, we should either use WRITE_ONCE() directly in the caller, or
find a better name.

>  {
>         WRITE_ONCE(*ptep, pte);
> +}
> +
> +static inline void __set_pte(pte_t *ptep, pte_t pte)
> +{
> +       ___set_pte(ptep, pte);
>
>         /*
>          * Only if the new pte is valid and kernel, otherwise TLB maintenance
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 1b2a2a2d09b7..c6d5a76732d4 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -301,7 +301,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>         do {
>                 pte_t old_pte = __ptep_get(ptep);
>
> -               __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
> +               /*
> +                * Required barriers to make this visible to the table walker
> +                * are deferred to the end of alloc_init_cont_pte().
> +                */
> +               ___set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>
>                 /*
>                  * After the PTE entry has been populated once, we
> @@ -358,6 +362,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>         } while (addr = next, addr != end);
>
>         ops->unmap(TYPE_PTE);
> +
> +       /*
> +        * Ensure all previous pgtable writes are visible to the table walker.
> +        * See init_pte().
> +        */
> +       dsb(ishst);
> +       isb();
>  }
>
>  static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables
@ 2024-03-28  7:23     ` Ard Biesheuvel
  0 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2024-03-28  7:23 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On Wed, 27 Mar 2024 at 21:07, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> After removing uneccessary TLBIs, the next bottleneck when creating the
> page tables for the linear map is DSB and ISB, which were previously
> issued per-pte in __set_pte(). Since we are writing multiple ptes in a
> given pte table, we can elide these barriers and insert them once we
> have finished writing to the table.
>

Nice!

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h |  7 ++++++-
>  arch/arm64/mm/mmu.c              | 13 ++++++++++++-
>  2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index bd5d02f3f0a3..81e427b23b3f 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>         return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>  }
>
> -static inline void __set_pte(pte_t *ptep, pte_t pte)
> +static inline void ___set_pte(pte_t *ptep, pte_t pte)

IMHO, we should either use WRITE_ONCE() directly in the caller, or
find a better name.

>  {
>         WRITE_ONCE(*ptep, pte);
> +}
> +
> +static inline void __set_pte(pte_t *ptep, pte_t pte)
> +{
> +       ___set_pte(ptep, pte);
>
>         /*
>          * Only if the new pte is valid and kernel, otherwise TLB maintenance
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 1b2a2a2d09b7..c6d5a76732d4 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -301,7 +301,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>         do {
>                 pte_t old_pte = __ptep_get(ptep);
>
> -               __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
> +               /*
> +                * Required barriers to make this visible to the table walker
> +                * are deferred to the end of alloc_init_cont_pte().
> +                */
> +               ___set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>
>                 /*
>                  * After the PTE entry has been populated once, we
> @@ -358,6 +362,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>         } while (addr = next, addr != end);
>
>         ops->unmap(TYPE_PTE);
> +
> +       /*
> +        * Ensure all previous pgtable writes are visible to the table walker.
> +        * See init_pte().
> +        */
> +       dsb(ishst);
> +       isb();
>  }
>
>  static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
> --
> 2.25.1
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables
  2024-03-28  7:23     ` Ard Biesheuvel
@ 2024-03-28  8:45       ` Ryan Roberts
  -1 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-28  8:45 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On 28/03/2024 07:23, Ard Biesheuvel wrote:
> On Wed, 27 Mar 2024 at 21:07, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> After removing uneccessary TLBIs, the next bottleneck when creating the
>> page tables for the linear map is DSB and ISB, which were previously
>> issued per-pte in __set_pte(). Since we are writing multiple ptes in a
>> given pte table, we can elide these barriers and insert them once we
>> have finished writing to the table.
>>
> 
> Nice!
> 
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h |  7 ++++++-
>>  arch/arm64/mm/mmu.c              | 13 ++++++++++++-
>>  2 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index bd5d02f3f0a3..81e427b23b3f 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>>         return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>>  }
>>
>> -static inline void __set_pte(pte_t *ptep, pte_t pte)
>> +static inline void ___set_pte(pte_t *ptep, pte_t pte)
> 
> IMHO, we should either use WRITE_ONCE() directly in the caller, or
> find a better name.

How about __set_pte_nosync() ?

> 
>>  {
>>         WRITE_ONCE(*ptep, pte);
>> +}
>> +
>> +static inline void __set_pte(pte_t *ptep, pte_t pte)
>> +{
>> +       ___set_pte(ptep, pte);
>>
>>         /*
>>          * Only if the new pte is valid and kernel, otherwise TLB maintenance
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 1b2a2a2d09b7..c6d5a76732d4 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -301,7 +301,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>>         do {
>>                 pte_t old_pte = __ptep_get(ptep);
>>
>> -               __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>> +               /*
>> +                * Required barriers to make this visible to the table walker
>> +                * are deferred to the end of alloc_init_cont_pte().
>> +                */
>> +               ___set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>>
>>                 /*
>>                  * After the PTE entry has been populated once, we
>> @@ -358,6 +362,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>>         } while (addr = next, addr != end);
>>
>>         ops->unmap(TYPE_PTE);
>> +
>> +       /*
>> +        * Ensure all previous pgtable writes are visible to the table walker.
>> +        * See init_pte().
>> +        */
>> +       dsb(ishst);
>> +       isb();
>>  }
>>
>>  static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>> --
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables
@ 2024-03-28  8:45       ` Ryan Roberts
  0 siblings, 0 replies; 38+ messages in thread
From: Ryan Roberts @ 2024-03-28  8:45 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On 28/03/2024 07:23, Ard Biesheuvel wrote:
> On Wed, 27 Mar 2024 at 21:07, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> After removing uneccessary TLBIs, the next bottleneck when creating the
>> page tables for the linear map is DSB and ISB, which were previously
>> issued per-pte in __set_pte(). Since we are writing multiple ptes in a
>> given pte table, we can elide these barriers and insert them once we
>> have finished writing to the table.
>>
> 
> Nice!
> 
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h |  7 ++++++-
>>  arch/arm64/mm/mmu.c              | 13 ++++++++++++-
>>  2 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index bd5d02f3f0a3..81e427b23b3f 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>>         return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>>  }
>>
>> -static inline void __set_pte(pte_t *ptep, pte_t pte)
>> +static inline void ___set_pte(pte_t *ptep, pte_t pte)
> 
> IMHO, we should either use WRITE_ONCE() directly in the caller, or
> find a better name.

How about __set_pte_nosync() ?

> 
>>  {
>>         WRITE_ONCE(*ptep, pte);
>> +}
>> +
>> +static inline void __set_pte(pte_t *ptep, pte_t pte)
>> +{
>> +       ___set_pte(ptep, pte);
>>
>>         /*
>>          * Only if the new pte is valid and kernel, otherwise TLB maintenance
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 1b2a2a2d09b7..c6d5a76732d4 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -301,7 +301,11 @@ static pte_t *init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>>         do {
>>                 pte_t old_pte = __ptep_get(ptep);
>>
>> -               __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>> +               /*
>> +                * Required barriers to make this visible to the table walker
>> +                * are deferred to the end of alloc_init_cont_pte().
>> +                */
>> +               ___set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>>
>>                 /*
>>                  * After the PTE entry has been populated once, we
>> @@ -358,6 +362,13 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>>         } while (addr = next, addr != end);
>>
>>         ops->unmap(TYPE_PTE);
>> +
>> +       /*
>> +        * Ensure all previous pgtable writes are visible to the table walker.
>> +        * See init_pte().
>> +        */
>> +       dsb(ishst);
>> +       isb();
>>  }
>>
>>  static pmd_t *init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>> --
>> 2.25.1
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables
  2024-03-28  8:45       ` Ryan Roberts
@ 2024-03-28  8:56         ` Ard Biesheuvel
  -1 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2024-03-28  8:56 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On Thu, 28 Mar 2024 at 10:45, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/03/2024 07:23, Ard Biesheuvel wrote:
> > On Wed, 27 Mar 2024 at 21:07, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> After removing uneccessary TLBIs, the next bottleneck when creating the
> >> page tables for the linear map is DSB and ISB, which were previously
> >> issued per-pte in __set_pte(). Since we are writing multiple ptes in a
> >> given pte table, we can elide these barriers and insert them once we
> >> have finished writing to the table.
> >>
> >
> > Nice!
> >
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  arch/arm64/include/asm/pgtable.h |  7 ++++++-
> >>  arch/arm64/mm/mmu.c              | 13 ++++++++++++-
> >>  2 files changed, 18 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index bd5d02f3f0a3..81e427b23b3f 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
> >>         return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
> >>  }
> >>
> >> -static inline void __set_pte(pte_t *ptep, pte_t pte)
> >> +static inline void ___set_pte(pte_t *ptep, pte_t pte)
> >
> > IMHO, we should either use WRITE_ONCE() directly in the caller, or
> > find a better name.
>
> How about __set_pte_nosync() ?
>

Works for me.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables
@ 2024-03-28  8:56         ` Ard Biesheuvel
  0 siblings, 0 replies; 38+ messages in thread
From: Ard Biesheuvel @ 2024-03-28  8:56 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On Thu, 28 Mar 2024 at 10:45, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/03/2024 07:23, Ard Biesheuvel wrote:
> > On Wed, 27 Mar 2024 at 21:07, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> After removing uneccessary TLBIs, the next bottleneck when creating the
> >> page tables for the linear map is DSB and ISB, which were previously
> >> issued per-pte in __set_pte(). Since we are writing multiple ptes in a
> >> given pte table, we can elide these barriers and insert them once we
> >> have finished writing to the table.
> >>
> >
> > Nice!
> >
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  arch/arm64/include/asm/pgtable.h |  7 ++++++-
> >>  arch/arm64/mm/mmu.c              | 13 ++++++++++++-
> >>  2 files changed, 18 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index bd5d02f3f0a3..81e427b23b3f 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
> >>         return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
> >>  }
> >>
> >> -static inline void __set_pte(pte_t *ptep, pte_t pte)
> >> +static inline void ___set_pte(pte_t *ptep, pte_t pte)
> >
> > IMHO, we should either use WRITE_ONCE() directly in the caller, or
> > find a better name.
>
> How about __set_pte_nosync() ?
>

Works for me.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
  2024-03-27 19:12   ` Ryan Roberts
@ 2024-03-28 23:08     ` Eric Chanudet
  -1 siblings, 0 replies; 38+ messages in thread
From: Eric Chanudet @ 2024-03-28 23:08 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, linux-arm-kernel, linux-kernel

On Wed, Mar 27, 2024 at 07:12:06PM +0000, Ryan Roberts wrote:
> On 26/03/2024 10:14, Ryan Roberts wrote:
> > Hi All,
> > 
> > It turns out that creating the linear map can take a significant proportion of
> > the total boot time, especially when rodata=full. And a large portion of the
> > time it takes to create the linear map is issuing TLBIs. This series reworks the
> > kernel pgtable generation code to significantly reduce the number of TLBIs. See
> > each patch for details.
> > 
> > The below shows the execution time of map_mem() across a couple of different
> > systems with different RAM configurations. We measure after applying each patch
> > and show the improvement relative to base (v6.9-rc1):
> > 
> >                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> >                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> > ---------------|-------------|-------------|-------------|-------------
> >                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> > ---------------|-------------|-------------|-------------|-------------
> > base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> > no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> > no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> > lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
> 
> I've just appended an additional patch to this series. This takes us to a ~95%
> reduction overall:
> 
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
> batch-barriers |   11 (-93%) |   61 (-97%) |  261 (-97%) |   837 (-95%)
> 
> Don't believe the intermediate block-based pgtable idea will now be neccessary
> so I don't intend to persue that. It might be that we choose to drop the middle
> two patchs; I'm keen to hear opinions.
> 

Applied on v6.9-rc1, I have much shorter base timing on a similar
machine (Ampere HR350A). no-alloc-remap didn't show much difference
either.

               | SA8775p-ride | Ampere HR350A|
               | VM, 36G      | Metal, 256G  |
---------------|--------------|--------------|
               |   ms     (%) |   ms     (%) |
---------------|--------------|--------------|
base           |  358    (0%) | 2213    (0%) |
no-cont-remap  |  232  (-35%) | 1283  (-42%) |
no-alloc-remap |  228  (-36%) | 1282  (-42%) |
lazy-unmap     |  231  (-35%) | 1248  (-44%) |
batch-barriers |   25  (-93%) |  204  (-91%) |

Tested-By: Eric Chanudet <echanude@redhat.com>


> > This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> > tested all VA size configs (although I don't anticipate any issues); I'll do
> > this as part of followup.
> > 
> > Thanks,
> > Ryan
> > 
> > 
> > Ryan Roberts (3):
> >   arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
> >   arm64: mm: Don't remap pgtables for allocate vs populate
> >   arm64: mm: Lazily clear pte table mappings from fixmap
> > 
> >  arch/arm64/include/asm/fixmap.h  |   5 +-
> >  arch/arm64/include/asm/mmu.h     |   8 +
> >  arch/arm64/include/asm/pgtable.h |   4 -
> >  arch/arm64/kernel/cpufeature.c   |  10 +-
> >  arch/arm64/mm/fixmap.c           |  11 +
> >  arch/arm64/mm/mmu.c              | 364 +++++++++++++++++++++++--------
> >  include/linux/pgtable.h          |   8 +
> >  7 files changed, 307 insertions(+), 103 deletions(-)
> > 
> > --
> > 2.25.1
> > 
> 

-- 
Eric Chanudet


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v1 0/3] Speed up boot with faster linear map creation
@ 2024-03-28 23:08     ` Eric Chanudet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Chanudet @ 2024-03-28 23:08 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, linux-arm-kernel, linux-kernel

On Wed, Mar 27, 2024 at 07:12:06PM +0000, Ryan Roberts wrote:
> On 26/03/2024 10:14, Ryan Roberts wrote:
> > Hi All,
> > 
> > It turns out that creating the linear map can take a significant proportion of
> > the total boot time, especially when rodata=full. And a large portion of the
> > time it takes to create the linear map is issuing TLBIs. This series reworks the
> > kernel pgtable generation code to significantly reduce the number of TLBIs. See
> > each patch for details.
> > 
> > The below shows the execution time of map_mem() across a couple of different
> > systems with different RAM configurations. We measure after applying each patch
> > and show the improvement relative to base (v6.9-rc1):
> > 
> >                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
> >                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> > ---------------|-------------|-------------|-------------|-------------
> >                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> > ---------------|-------------|-------------|-------------|-------------
> > base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> > no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> > no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> > lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
> 
> I've just appended an additional patch to this series. This takes us to a ~95%
> reduction overall:
> 
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> base           |  151   (0%) | 2191   (0%) | 8990   (0%) | 17443   (0%)
> no-cont-remap  |   77 (-49%) |  429 (-80%) | 1753 (-80%) |  3796 (-78%)
> no-alloc-remap |   77 (-49%) |  375 (-83%) | 1532 (-83%) |  3366 (-81%)
> lazy-unmap     |   63 (-58%) |  330 (-85%) | 1312 (-85%) |  2929 (-83%)
> batch-barriers |   11 (-93%) |   61 (-97%) |  261 (-97%) |   837 (-95%)
> 
> Don't believe the intermediate block-based pgtable idea will now be neccessary
> so I don't intend to persue that. It might be that we choose to drop the middle
> two patchs; I'm keen to hear opinions.
> 

Applied on v6.9-rc1, I have much shorter base timing on a similar
machine (Ampere HR350A). no-alloc-remap didn't show much difference
either.

               | SA8775p-ride | Ampere HR350A|
               | VM, 36G      | Metal, 256G  |
---------------|--------------|--------------|
               |   ms     (%) |   ms     (%) |
---------------|--------------|--------------|
base           |  358    (0%) | 2213    (0%) |
no-cont-remap  |  232  (-35%) | 1283  (-42%) |
no-alloc-remap |  228  (-36%) | 1282  (-42%) |
lazy-unmap     |  231  (-35%) | 1248  (-44%) |
batch-barriers |   25  (-93%) |  204  (-91%) |

Tested-By: Eric Chanudet <echanude@redhat.com>


> > This series applies on top of v6.9-rc1. All mm selftests pass. I haven't yet
> > tested all VA size configs (although I don't anticipate any issues); I'll do
> > this as part of followup.
> > 
> > Thanks,
> > Ryan
> > 
> > 
> > Ryan Roberts (3):
> >   arm64: mm: Don't remap pgtables per- cont(pte|pmd) block
> >   arm64: mm: Don't remap pgtables for allocate vs populate
> >   arm64: mm: Lazily clear pte table mappings from fixmap
> > 
> >  arch/arm64/include/asm/fixmap.h  |   5 +-
> >  arch/arm64/include/asm/mmu.h     |   8 +
> >  arch/arm64/include/asm/pgtable.h |   4 -
> >  arch/arm64/kernel/cpufeature.c   |  10 +-
> >  arch/arm64/mm/fixmap.c           |  11 +
> >  arch/arm64/mm/mmu.c              | 364 +++++++++++++++++++++++--------
> >  include/linux/pgtable.h          |   8 +
> >  7 files changed, 307 insertions(+), 103 deletions(-)
> > 
> > --
> > 2.25.1
> > 
> 

-- 
Eric Chanudet


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-03-28 23:10 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-26 10:14 [PATCH v1 0/3] Speed up boot with faster linear map creation Ryan Roberts
2024-03-26 10:14 ` Ryan Roberts
2024-03-26 10:14 ` [PATCH v1 1/3] arm64: mm: Don't remap pgtables per- cont(pte|pmd) block Ryan Roberts
2024-03-26 10:14   ` Ryan Roberts
2024-03-26 10:14 ` [PATCH v1 2/3] arm64: mm: Don't remap pgtables for allocate vs populate Ryan Roberts
2024-03-26 10:14   ` Ryan Roberts
2024-03-27  2:05   ` kernel test robot
2024-03-27  2:05     ` kernel test robot
2024-03-26 10:14 ` [PATCH v1 3/3] arm64: mm: Lazily clear pte table mappings from fixmap Ryan Roberts
2024-03-26 10:14   ` Ryan Roberts
2024-03-27 10:09 ` [PATCH v1 0/3] Speed up boot with faster linear map creation Ard Biesheuvel
2024-03-27 10:09   ` Ard Biesheuvel
2024-03-27 10:43   ` Ryan Roberts
2024-03-27 10:43     ` Ryan Roberts
2024-03-27 13:36     ` Ard Biesheuvel
2024-03-27 13:36       ` Ard Biesheuvel
2024-03-27 15:01       ` Ryan Roberts
2024-03-27 15:01         ` Ryan Roberts
2024-03-27 15:57         ` Ard Biesheuvel
2024-03-27 15:57           ` Ard Biesheuvel
2024-03-27 16:11           ` Ryan Roberts
2024-03-27 16:11             ` Ryan Roberts
2024-03-27 11:06 ` Itaru Kitayama
2024-03-27 11:06   ` Itaru Kitayama
2024-03-27 11:10   ` Ryan Roberts
2024-03-27 11:10     ` Ryan Roberts
2024-03-27 19:07 ` [PATCH v1] arm64: mm: Batch dsb and isb when populating pgtables Ryan Roberts
2024-03-27 19:07   ` Ryan Roberts
2024-03-28  7:23   ` Ard Biesheuvel
2024-03-28  7:23     ` Ard Biesheuvel
2024-03-28  8:45     ` Ryan Roberts
2024-03-28  8:45       ` Ryan Roberts
2024-03-28  8:56       ` Ard Biesheuvel
2024-03-28  8:56         ` Ard Biesheuvel
2024-03-27 19:12 ` [PATCH v1 0/3] Speed up boot with faster linear map creation Ryan Roberts
2024-03-27 19:12   ` Ryan Roberts
2024-03-28 23:08   ` Eric Chanudet
2024-03-28 23:08     ` Eric Chanudet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.