[RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64
@ 2018-08-24 15:52 Will Deacon
  2018-08-24 15:52 ` [RFC PATCH 01/11] arm64: tlb: Use last-level invalidation in flush_tlb_kernel_range() Will Deacon
                   ` (12 more replies)
  0 siblings, 13 replies; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

Hi all,

I hacked up this RFC on the back of the recent changes to the mmu_gather
stuff in mainline. It's had a bit of testing and it looks pretty good so
far.

The main changes in the series are:

  - Avoid emitting a DSB barrier after clearing each page-table entry.
    Instead, we can have a single DSB prior to the actual TLB invalidation.

  - Batch last-level TLB invalidation until the end of the VMA, and use
    last-level-only invalidation instructions

  - Batch intermediate TLB invalidation until the end of the gather, and
    use all-level invalidation instructions

  - Adjust the stride of TLB invalidation based upon the smallest unflushed
    granule in the gather

As a really stupid benchmark, unmapping a populated mapping of
0x4_3333_3000 bytes using munmap() takes around 20% of the time it took
before.

The core changes now track the levels of page-table that have been visited
by the mmu_gather since the last flush. It may be possible to use the
page_size field instead if we decide to resurrect that from its current
"debug" status, but I think I'd prefer to track the levels explicitly.

Anyway, I wanted to post this before disappearing for the long weekend
(Monday is a holiday in the UK) in the hope that it helps some of the
ongoing discussions.

Cheers,

Will

--->8

Peter Zijlstra (1):
  asm-generic/tlb: Track freeing of page-table directories in struct
    mmu_gather

Will Deacon (10):
  arm64: tlb: Use last-level invalidation in flush_tlb_kernel_range()
  arm64: tlb: Add DSB ISHST prior to TLBI in
    __flush_tlb_[kernel_]pgtable()
  arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d()
  arm64: tlb: Justify non-leaf invalidation in flush_tlb_range()
  arm64: tlbflush: Allow stride to be specified for __flush_tlb_range()
  arm64: tlb: Remove redundant !CONFIG_HAVE_RCU_TABLE_FREE code
  asm-generic/tlb: Guard with #ifdef CONFIG_MMU
  asm-generic/tlb: Track which levels of the page tables have been
    cleared
  arm64: tlb: Adjust stride and type of TLBI according to mmu_gather
  arm64: tlb: Avoid synchronous TLBIs when freeing page tables

 arch/arm64/Kconfig                |  1 +
 arch/arm64/include/asm/pgtable.h  | 10 ++++-
 arch/arm64/include/asm/tlb.h      | 34 +++++++----------
 arch/arm64/include/asm/tlbflush.h | 28 +++++++-------
 include/asm-generic/tlb.h         | 79 +++++++++++++++++++++++++++++++++------
 mm/memory.c                       |  4 +-
 6 files changed, 105 insertions(+), 51 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH 01/11] arm64: tlb: Use last-level invalidation in flush_tlb_kernel_range()
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
@ 2018-08-24 15:52 ` Will Deacon
  2018-08-24 15:52 ` [RFC PATCH 02/11] arm64: tlb: Add DSB ISHST prior to TLBI in __flush_tlb_[kernel_]pgtable() Will Deacon
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

flush_tlb_kernel_range() is only ever used to invalidate last-level
entries, so we can restrict the scope of the TLB invalidation
instruction.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index a4a1901140ee..7e2a35424ca4 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -199,7 +199,7 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
 
 	dsb(ishst);
 	for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12))
-		__tlbi(vaae1is, addr);
+		__tlbi(vaale1is, addr);
 	dsb(ish);
 	isb();
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 02/11] arm64: tlb: Add DSB ISHST prior to TLBI in __flush_tlb_[kernel_]pgtable()
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
  2018-08-24 15:52 ` [RFC PATCH 01/11] arm64: tlb: Use last-level invalidation in flush_tlb_kernel_range() Will Deacon
@ 2018-08-24 15:52 ` Will Deacon
  2018-08-24 17:56   ` Peter Zijlstra
  2018-08-24 15:52 ` [RFC PATCH 03/11] arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d() Will Deacon
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

__flush_tlb_[kernel_]pgtable() rely on set_pXd() having a DSB after
writing the new table entry and therefore avoid the barrier prior to the
TLBI instruction.

In preparation for delaying our walk-cache invalidation on the unmap()
path, move the DSB into the TLB invalidation routines.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 7e2a35424ca4..e257f8655b84 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -213,6 +213,7 @@ static inline void __flush_tlb_pgtable(struct mm_struct *mm,
 {
 	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
 
+	dsb(ishst);
 	__tlbi(vae1is, addr);
 	__tlbi_user(vae1is, addr);
 	dsb(ish);
@@ -222,6 +223,7 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
 {
 	unsigned long addr = __TLBI_VADDR(kaddr, 0);
 
+	dsb(ishst);
 	__tlbi(vaae1is, addr);
 	dsb(ish);
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 03/11] arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d()
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
  2018-08-24 15:52 ` [RFC PATCH 01/11] arm64: tlb: Use last-level invalidation in flush_tlb_kernel_range() Will Deacon
  2018-08-24 15:52 ` [RFC PATCH 02/11] arm64: tlb: Add DSB ISHST prior to TLBI in __flush_tlb_[kernel_]pgtable() Will Deacon
@ 2018-08-24 15:52 ` Will Deacon
  2018-08-24 16:15   ` Linus Torvalds
  2018-08-24 15:52 ` [RFC PATCH 04/11] arm64: tlb: Justify non-leaf invalidation in flush_tlb_range() Will Deacon
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

Now that our walk-cache invalidation routines imply a DSB before the
invalidation, we no longer need one when we are clearing an entry during
unmap.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1bdeca8918a6..2ab2031b778c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -360,6 +360,7 @@ static inline int pmd_protnone(pmd_t pmd)
 #define pmd_present(pmd)	pte_present(pmd_pte(pmd))
 #define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
+#define pmd_valid(pmd)		pte_valid(pmd_pte(pmd))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
@@ -431,7 +432,9 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	WRITE_ONCE(*pmdp, pmd);
-	dsb(ishst);
+
+	if (pmd_valid(pmd))
+		dsb(ishst);
 }
 
 static inline void pmd_clear(pmd_t *pmdp)
@@ -477,11 +480,14 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
 #define pud_none(pud)		(!pud_val(pud))
 #define pud_bad(pud)		(!(pud_val(pud) & PUD_TABLE_BIT))
 #define pud_present(pud)	pte_present(pud_pte(pud))
+#define pud_valid(pud)		pte_valid(pud_pte(pud))
 
 static inline void set_pud(pud_t *pudp, pud_t pud)
 {
 	WRITE_ONCE(*pudp, pud);
-	dsb(ishst);
+
+	if (pud_valid(pud))
+		dsb(ishst);
 }
 
 static inline void pud_clear(pud_t *pudp)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 04/11] arm64: tlb: Justify non-leaf invalidation in flush_tlb_range()
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
                   ` (2 preceding siblings ...)
  2018-08-24 15:52 ` [RFC PATCH 03/11] arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d() Will Deacon
@ 2018-08-24 15:52 ` Will Deacon
  2018-08-24 15:52 ` [RFC PATCH 05/11] arm64: tlbflush: Allow stride to be specified for __flush_tlb_range() Will Deacon
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

Add a comment to explain why we can't get away with last-level
invalidation in flush_tlb_range()

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/tlbflush.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index e257f8655b84..ddbf1718669d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -182,6 +182,10 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 static inline void flush_tlb_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end)
 {
+	/*
+	 * We cannot use leaf-only invalidation here, since we may be invalidating
+	 * table entries as part of collapsing hugepages or moving page tables.
+	 */
 	__flush_tlb_range(vma, start, end, false);
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 05/11] arm64: tlbflush: Allow stride to be specified for __flush_tlb_range()
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
                   ` (3 preceding siblings ...)
  2018-08-24 15:52 ` [RFC PATCH 04/11] arm64: tlb: Justify non-leaf invalidation in flush_tlb_range() Will Deacon
@ 2018-08-24 15:52 ` Will Deacon
  2018-08-24 15:52 ` [RFC PATCH 06/11] arm64: tlb: Remove redundant !CONFIG_HAVE_RCU_TABLE_FREE code Will Deacon
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

When we are unmapping intermediate page-table entries or huge pages, we
don't need to issue a TLBI instruction for every PAGE_SIZE chunk in the
VA range being unmapped.

Allow the invalidation stride to be passed to __flush_tlb_range(), and
adjust our "just nuke the ASID" heuristic to take this into account.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/tlb.h      |  2 +-
 arch/arm64/include/asm/tlbflush.h | 11 +++++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index a3233167be60..1e1f68ce28f4 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -53,7 +53,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 	 * the __(pte|pmd|pud)_free_tlb() functions, so last level
 	 * TLBI is sufficient here.
 	 */
-	__flush_tlb_range(&vma, tlb->start, tlb->end, true);
+	__flush_tlb_range(&vma, tlb->start, tlb->end, PAGE_SIZE, true);
 }
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index ddbf1718669d..1f77d08e638b 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -153,12 +153,15 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 
 static inline void __flush_tlb_range(struct vm_area_struct *vma,
 				     unsigned long start, unsigned long end,
-				     bool last_level)
+				     unsigned long stride, bool last_level)
 {
 	unsigned long asid = ASID(vma->vm_mm);
 	unsigned long addr;
 
-	if ((end - start) > MAX_TLB_RANGE) {
+	/* Convert the stride into units of 4k */
+	stride >>= 12;
+
+	if ((end - start) > (MAX_TLB_RANGE * stride)) {
 		flush_tlb_mm(vma->vm_mm);
 		return;
 	}
@@ -167,7 +170,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 	end = __TLBI_VADDR(end, asid);
 
 	dsb(ishst);
-	for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12)) {
+	for (addr = start; addr < end; addr += stride) {
 		if (last_level) {
 			__tlbi(vale1is, addr);
 			__tlbi_user(vale1is, addr);
@@ -186,7 +189,7 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
 	 * We cannot use leaf-only invalidation here, since we may be invalidating
 	 * table entries as part of collapsing hugepages or moving page tables.
 	 */
-	__flush_tlb_range(vma, start, end, false);
+	__flush_tlb_range(vma, start, end, PAGE_SIZE, false);
 }
 
 static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 06/11] arm64: tlb: Remove redundant !CONFIG_HAVE_RCU_TABLE_FREE code
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
                   ` (4 preceding siblings ...)
  2018-08-24 15:52 ` [RFC PATCH 05/11] arm64: tlbflush: Allow stride to be specified for __flush_tlb_range() Will Deacon
@ 2018-08-24 15:52 ` Will Deacon
  2018-08-24 15:52 ` [RFC PATCH 07/11] asm-generic/tlb: Guard with #ifdef CONFIG_MMU Will Deacon
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

If there's one thing the RCU-based table freeing doesn't need, it's more
ifdeffery.

Remove the redundant !CONFIG_HAVE_RCU_TABLE_FREE code, since this option
is unconditionally selected in our Kconfig.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/tlb.h | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 1e1f68ce28f4..bd00017d529a 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -22,16 +22,10 @@
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-
-#define tlb_remove_entry(tlb, entry)	tlb_remove_table(tlb, entry)
 static inline void __tlb_remove_table(void *_table)
 {
 	free_page_and_swap_cache((struct page *)_table);
 }
-#else
-#define tlb_remove_entry(tlb, entry)	tlb_remove_page(tlb, entry)
-#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
 
 static void tlb_flush(struct mmu_gather *tlb);
 
@@ -61,7 +55,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 {
 	__flush_tlb_pgtable(tlb->mm, addr);
 	pgtable_page_dtor(pte);
-	tlb_remove_entry(tlb, pte);
+	tlb_remove_table(tlb, pte);
 }
 
 #if CONFIG_PGTABLE_LEVELS > 2
@@ -69,7 +63,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 				  unsigned long addr)
 {
 	__flush_tlb_pgtable(tlb->mm, addr);
-	tlb_remove_entry(tlb, virt_to_page(pmdp));
+	tlb_remove_table(tlb, virt_to_page(pmdp));
 }
 #endif
 
@@ -78,7 +72,7 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
 				  unsigned long addr)
 {
 	__flush_tlb_pgtable(tlb->mm, addr);
-	tlb_remove_entry(tlb, virt_to_page(pudp));
+	tlb_remove_table(tlb, virt_to_page(pudp));
 }
 #endif
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 07/11] asm-generic/tlb: Guard with #ifdef CONFIG_MMU
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
                   ` (5 preceding siblings ...)
  2018-08-24 15:52 ` [RFC PATCH 06/11] arm64: tlb: Remove redundant !CONFIG_HAVE_RCU_TABLE_FREE code Will Deacon
@ 2018-08-24 15:52 ` Will Deacon
  2018-08-24 15:52 ` [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather Will Deacon
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

The inner workings of the mmu_gather-based TLB invalidation mechanism
are not relevant to nommu configurations, so guard them with an ifdef.
This allows us to implement future functions using static inlines
without breaking the build.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/tlb.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b3353e21f3b3..859f897d3d53 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -20,6 +20,8 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
+#ifdef CONFIG_MMU
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -312,4 +314,5 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 
 #define tlb_migrate_finish(mm) do {} while (0)
 
+#endif /* CONFIG_MMU */
 #endif /* _ASM_GENERIC__TLB_H */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
                   ` (6 preceding siblings ...)
  2018-08-24 15:52 ` [RFC PATCH 07/11] asm-generic/tlb: Guard with #ifdef CONFIG_MMU Will Deacon
@ 2018-08-24 15:52 ` Will Deacon
  2018-08-27  4:44   ` Nicholas Piggin
  2018-08-24 15:52 ` [RFC PATCH 09/11] asm-generic/tlb: Track which levels of the page tables have been cleared Will Deacon
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

From: Peter Zijlstra <peterz@infradead.org>

Some architectures require different TLB invalidation instructions
depending on whether it is only the last-level of page table being
changed, or whether there are also changes to the intermediate
(directory) entries higher up the tree.

Add a new bit to the flags bitfield in struct mmu_gather so that the
architecture code can operate accordingly if it's the intermediate
levels being invalidated.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/tlb.h | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 859f897d3d53..a5caf90264e6 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -99,12 +99,22 @@ struct mmu_gather {
 #endif
 	unsigned long		start;
 	unsigned long		end;
-	/* we are in the middle of an operation to clear
-	 * a full mm and can make some optimizations */
-	unsigned int		fullmm : 1,
-	/* we have performed an operation which
-	 * requires a complete flush of the tlb */
-				need_flush_all : 1;
+	/*
+	 * we are in the middle of an operation to clear
+	 * a full mm and can make some optimizations
+	 */
+	unsigned int		fullmm : 1;
+
+	/*
+	 * we have performed an operation which
+	 * requires a complete flush of the tlb
+	 */
+	unsigned int		need_flush_all : 1;
+
+	/*
+	 * we have removed page directories
+	 */
+	unsigned int		freed_tables : 1;
 
 	struct mmu_gather_batch *active;
 	struct mmu_gather_batch	local;
@@ -139,6 +149,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 		tlb->start = TASK_SIZE;
 		tlb->end = 0;
 	}
+	tlb->freed_tables = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -280,6 +291,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define pte_free_tlb(tlb, ptep, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
+		tlb->freed_tables = 1;			\
 		__pte_free_tlb(tlb, ptep, address);		\
 	} while (0)
 #endif
@@ -287,7 +299,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)			\
 	do {							\
-		__tlb_adjust_range(tlb, address, PAGE_SIZE);		\
+		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
+		tlb->freed_tables = 1;			\
 		__pmd_free_tlb(tlb, pmdp, address);		\
 	} while (0)
 #endif
@@ -297,6 +310,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define pud_free_tlb(tlb, pudp, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
+		tlb->freed_tables = 1;			\
 		__pud_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif
@@ -306,7 +320,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)			\
 	do {							\
-		__tlb_adjust_range(tlb, address, PAGE_SIZE);		\
+		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
+		tlb->freed_tables = 1;			\
 		__p4d_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 09/11] asm-generic/tlb: Track which levels of the page tables have been cleared
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
                   ` (7 preceding siblings ...)
  2018-08-24 15:52 ` [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather Will Deacon
@ 2018-08-24 15:52 ` Will Deacon
  2018-08-27  7:53   ` Peter Zijlstra
  2018-08-24 15:52 ` [RFC PATCH 10/11] arm64: tlb: Adjust stride and type of TLBI according to mmu_gather Will Deacon
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/tlb.h | 53 ++++++++++++++++++++++++++++++++++++++++-------
 mm/memory.c               |  4 +++-
 2 files changed, 48 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index a5caf90264e6..081b105d7215 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -116,6 +116,14 @@ struct mmu_gather {
 	 */
 	unsigned int		freed_tables : 1;
 
+	/*
+	 * at which levels have we cleared entries?
+	 */
+	unsigned int		cleared_ptes : 1;
+	unsigned int		cleared_pmds : 1;
+	unsigned int		cleared_puds : 1;
+	unsigned int		cleared_p4ds : 1;
+
 	struct mmu_gather_batch *active;
 	struct mmu_gather_batch	local;
 	struct page		*__pages[MMU_GATHER_BUNDLE];
@@ -150,6 +158,10 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 		tlb->end = 0;
 	}
 	tlb->freed_tables = 0;
+	tlb->cleared_ptes = 0;
+	tlb->cleared_pmds = 0;
+	tlb->cleared_puds = 0;
+	tlb->cleared_p4ds = 0;
 }
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -199,6 +211,20 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 }
 #endif
 
+static inline unsigned long tlb_get_unmap_granule(struct mmu_gather *tlb)
+{
+	if (tlb->cleared_ptes)
+		return PAGE_SIZE;
+	if (tlb->cleared_pmds)
+		return PMD_SIZE;
+	if (tlb->cleared_puds)
+		return PUD_SIZE;
+	if (tlb->cleared_p4ds)
+		return P4D_SIZE;
+
+	return PAGE_SIZE;
+}
+
 /*
  * In the case of tlb vma handling, we can optimise these away in the
  * case where we're doing a full MM flush.  When we're doing a munmap,
@@ -232,13 +258,19 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define tlb_remove_tlb_entry(tlb, ptep, address)		\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
+		tlb->cleared_ptes = 1;				\
 		__tlb_remove_tlb_entry(tlb, ptep, address);	\
 	} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	     \
-	do {							     \
-		__tlb_adjust_range(tlb, address, huge_page_size(h)); \
-		__tlb_remove_tlb_entry(tlb, ptep, address);	     \
+#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	\
+	do {							\
+		unsigned long _sz = huge_page_size(h);		\
+		__tlb_adjust_range(tlb, address, _sz);		\
+		if (_sz == PMD_SIZE)				\
+			tlb->cleared_pmds = 1;			\
+		else if (_sz == PUD_SIZE)			\
+			tlb->cleared_puds = 1;			\
+		__tlb_remove_tlb_entry(tlb, ptep, address);	\
 	} while (0)
 
 /**
@@ -252,6 +284,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)			\
 	do {								\
 		__tlb_adjust_range(tlb, address, HPAGE_PMD_SIZE);	\
+		tlb->cleared_pmds = 1;					\
 		__tlb_remove_pmd_tlb_entry(tlb, pmdp, address);		\
 	} while (0)
 
@@ -266,6 +299,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define tlb_remove_pud_tlb_entry(tlb, pudp, address)			\
 	do {								\
 		__tlb_adjust_range(tlb, address, HPAGE_PUD_SIZE);	\
+		tlb->cleared_puds = 1;					\
 		__tlb_remove_pud_tlb_entry(tlb, pudp, address);		\
 	} while (0)
 
@@ -291,7 +325,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define pte_free_tlb(tlb, ptep, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
-		tlb->freed_tables = 1;			\
+		tlb->freed_tables = 1;				\
+		tlb->cleared_pmds = 1;				\
 		__pte_free_tlb(tlb, ptep, address);		\
 	} while (0)
 #endif
@@ -300,7 +335,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define pmd_free_tlb(tlb, pmdp, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
-		tlb->freed_tables = 1;			\
+		tlb->freed_tables = 1;				\
+		tlb->cleared_puds = 1;				\
 		__pmd_free_tlb(tlb, pmdp, address);		\
 	} while (0)
 #endif
@@ -310,7 +346,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define pud_free_tlb(tlb, pudp, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
-		tlb->freed_tables = 1;			\
+		tlb->freed_tables = 1;				\
+		tlb->cleared_p4ds = 1;				\
 		__pud_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif
@@ -321,7 +358,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 #define p4d_free_tlb(tlb, pudp, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
-		tlb->freed_tables = 1;			\
+		tlb->freed_tables = 1;				\
 		__p4d_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index 83aef222f11b..69982d1d9b7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -267,8 +267,10 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
 {
 	struct mmu_gather_batch *batch, *next;
 
-	if (force)
+	if (force) {
+		__tlb_reset_range(tlb);
 		__tlb_adjust_range(tlb, start, end - start);
+	}
 
 	tlb_flush_mmu(tlb);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 10/11] arm64: tlb: Adjust stride and type of TLBI according to mmu_gather
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
                   ` (8 preceding siblings ...)
  2018-08-24 15:52 ` [RFC PATCH 09/11] asm-generic/tlb: Track which levels of the page tables have been cleared Will Deacon
@ 2018-08-24 15:52 ` Will Deacon
  2018-08-24 15:52 ` [RFC PATCH 11/11] arm64: tlb: Avoid synchronous TLBIs when freeing page tables Will Deacon
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

Now that the core mmu_gather code keeps track of both the levels of page
table cleared and also whether or not these entries correspond to
intermediate entries, we can use this in our tlb_flush() callback to
reduce the number of invalidations we issue as well as their scope.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/tlb.h | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index bd00017d529a..baca8dff6884 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -34,20 +34,21 @@ static void tlb_flush(struct mmu_gather *tlb);
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
 	struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0);
+	bool last_level = !tlb->freed_tables;
+	unsigned long stride = tlb_get_unmap_granule(tlb);
 
 	/*
-	 * The ASID allocator will either invalidate the ASID or mark
-	 * it as used.
+	 * If we're tearing down the address space then we only care about
+	 * invalidating the walk-cache, since the ASID allocator won't
+	 * reallocate our ASID without invalidating the entire TLB.
 	 */
-	if (tlb->fullmm)
+	if (tlb->fullmm) {
+		if (!last_level)
+			flush_tlb_mm(tlb->mm);
 		return;
+	}
 
-	/*
-	 * The intermediate page table levels are already handled by
-	 * the __(pte|pmd|pud)_free_tlb() functions, so last level
-	 * TLBI is sufficient here.
-	 */
-	__flush_tlb_range(&vma, tlb->start, tlb->end, PAGE_SIZE, true);
+	__flush_tlb_range(&vma, tlb->start, tlb->end, stride, last_level);
 }
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 11/11] arm64: tlb: Avoid synchronous TLBIs when freeing page tables
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
                   ` (9 preceding siblings ...)
  2018-08-24 15:52 ` [RFC PATCH 10/11] arm64: tlb: Adjust stride and type of TLBI according to mmu_gather Will Deacon
@ 2018-08-24 15:52 ` Will Deacon
  2018-08-24 16:20 ` [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Linus Torvalds
  2018-09-04 18:38 ` Jon Masters
  12 siblings, 0 replies; 29+ messages in thread
From: Will Deacon @ 2018-08-24 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel, Will Deacon

By selecting HAVE_RCU_TABLE_INVALIDATE, we can rely on tlb_flush() being
called if we fail to batch table pages for freeing. This in turn allows
us to postpone walk-cache invalidation until tlb_finish_mmu(), which
avoids lots of unnecessary DSBs and means we can shoot down the ASID if
the range is large enough.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/Kconfig                |  1 +
 arch/arm64/include/asm/tlb.h      |  3 ---
 arch/arm64/include/asm/tlbflush.h | 11 -----------
 3 files changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 29e75b47becd..89059ee1eccc 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -142,6 +142,7 @@ config ARM64
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RCU_TABLE_FREE
+	select HAVE_RCU_TABLE_INVALIDATE
 	select HAVE_RSEQ
 	select HAVE_STACKPROTECTOR
 	select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index baca8dff6884..4f65614b3141 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -54,7 +54,6 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 				  unsigned long addr)
 {
-	__flush_tlb_pgtable(tlb->mm, addr);
 	pgtable_page_dtor(pte);
 	tlb_remove_table(tlb, pte);
 }
@@ -63,7 +62,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 				  unsigned long addr)
 {
-	__flush_tlb_pgtable(tlb->mm, addr);
 	tlb_remove_table(tlb, virt_to_page(pmdp));
 }
 #endif
@@ -72,7 +70,6 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
 				  unsigned long addr)
 {
-	__flush_tlb_pgtable(tlb->mm, addr);
 	tlb_remove_table(tlb, virt_to_page(pudp));
 }
 #endif
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 1f77d08e638b..2064ba97845f 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -215,17 +215,6 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
  * Used to invalidate the TLB (walk caches) corresponding to intermediate page
  * table levels (pgd/pud/pmd).
  */
-static inline void __flush_tlb_pgtable(struct mm_struct *mm,
-				       unsigned long uaddr)
-{
-	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
-
-	dsb(ishst);
-	__tlbi(vae1is, addr);
-	__tlbi_user(vae1is, addr);
-	dsb(ish);
-}
-
 static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
 {
 	unsigned long addr = __TLBI_VADDR(kaddr, 0);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 03/11] arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d()
  2018-08-24 15:52 ` [RFC PATCH 03/11] arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d() Will Deacon
@ 2018-08-24 16:15   ` Linus Torvalds
  2018-08-28 12:49     ` Will Deacon
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2018-08-24 16:15 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linux Kernel Mailing List, Peter Zijlstra,
	Benjamin Herrenschmidt, Nick Piggin, Catalin Marinas,
	linux-arm-kernel

On Fri, Aug 24, 2018 at 8:52 AM Will Deacon <will.deacon@arm.com> wrote:
>
> Now that our walk-cache invalidation routines imply a DSB before the
> invalidation, we no longer need one when we are clearing an entry during
> unmap.

Do you really still need it when *setting* it?

I'm wondering if you could just remove the thing unconditionally.

Why would you need a barrier for another CPU for a mapping that is
just being created? It's ok if they see the old lack of mapping until
they are told about it, and that eventual "being told about it" must
involve a data transfer already.

And I'm assuming arm doesn't cache negative page table entries, so
there's no issue with any stale tlb.

And any other kernel thread looking at the page tables will have to
honor the page table locking, so you don't need it for some direct
page table lookup either.

Hmm? It seems like you shouldn't need to order the "set page directory
entry" with anything.

But maybe there's some magic arm64 rule I'm not aware of. Maybe even
the local TLB hardware walker isn't coherent with local stores?

                Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
                   ` (10 preceding siblings ...)
  2018-08-24 15:52 ` [RFC PATCH 11/11] arm64: tlb: Avoid synchronous TLBIs when freeing page tables Will Deacon
@ 2018-08-24 16:20 ` Linus Torvalds
  2018-08-26 10:56   ` Peter Zijlstra
  2018-09-04 18:38 ` Jon Masters
  12 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2018-08-24 16:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linux Kernel Mailing List, Peter Zijlstra,
	Benjamin Herrenschmidt, Nick Piggin, Catalin Marinas,
	linux-arm-kernel

On Fri, Aug 24, 2018 at 8:52 AM Will Deacon <will.deacon@arm.com> wrote:
>
> I hacked up this RFC on the back of the recent changes to the mmu_gather
> stuff in mainline. It's had a bit of testing and it looks pretty good so
> far.

Looks good to me.

Apart from the arm64-specific question I had, I wonder whether we need
to have that single "freed_tables" bit at all, since you wanted to
have the four individual bits for the different levels.

Even if somebody doesn't care about the individual bits, it's
generally exactly as cheap to test four bits as it is to test one, so
it seems unnecessary to have that summary bit.

             Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 02/11] arm64: tlb: Add DSB ISHST prior to TLBI in __flush_tlb_[kernel_]pgtable()
  2018-08-24 15:52 ` [RFC PATCH 02/11] arm64: tlb: Add DSB ISHST prior to TLBI in __flush_tlb_[kernel_]pgtable() Will Deacon
@ 2018-08-24 17:56   ` Peter Zijlstra
  2018-08-28 13:03     ` Will Deacon
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2018-08-24 17:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, benh, torvalds, npiggin, catalin.marinas, linux-arm-kernel

On Fri, Aug 24, 2018 at 04:52:37PM +0100, Will Deacon wrote:
> __flush_tlb_[kernel_]pgtable() rely on set_pXd() having a DSB after
> writing the new table entry and therefore avoid the barrier prior to the
> TLBI instruction.
> 
> In preparation for delaying our walk-cache invalidation on the unmap()
> path, move the DSB into the TLB invalidation routines.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm64/include/asm/tlbflush.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 7e2a35424ca4..e257f8655b84 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -213,6 +213,7 @@ static inline void __flush_tlb_pgtable(struct mm_struct *mm,
>  {
>  	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
>  
> +	dsb(ishst);
>  	__tlbi(vae1is, addr);
>  	__tlbi_user(vae1is, addr);
>  	dsb(ish);
> @@ -222,6 +223,7 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
>  {
>  	unsigned long addr = __TLBI_VADDR(kaddr, 0);
>  
> +	dsb(ishst);
>  	__tlbi(vaae1is, addr);
>  	dsb(ish);
>  }

I would suggest these barrier -- like any other barriers, carry a
comment that explain the required ordering.

I think this here reads like:

	STORE: unhook page

	DSB-ishst: wait for all stores to complete
	TLBI: invalidate broadcast
	DSB-ish: wait for TLBI to complete

And the 'newly' placed DSB-ishst ensures the page is observed unlinked
before we issue the invalidate.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64
  2018-08-24 16:20 ` [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Linus Torvalds
@ 2018-08-26 10:56   ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2018-08-26 10:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Linux Kernel Mailing List, Benjamin Herrenschmidt,
	Nick Piggin, Catalin Marinas, linux-arm-kernel

On Fri, Aug 24, 2018 at 09:20:00AM -0700, Linus Torvalds wrote:
> On Fri, Aug 24, 2018 at 8:52 AM Will Deacon <will.deacon@arm.com> wrote:
> >
> > I hacked up this RFC on the back of the recent changes to the mmu_gather
> > stuff in mainline. It's had a bit of testing and it looks pretty good so
> > far.
> 
> Looks good to me.
> 
> Apart from the arm64-specific question I had, I wonder whether we need
> to have that single "freed_tables" bit at all, since you wanted to
> have the four individual bits for the different levels.

I think so; because he also sets those size bits for things like hugetlb
and thp user page frees, not only table page frees. So they're not
exactly the same.

And I think x86 could use this too; if we know we only freed 2M
pages, we can use that in flush_tlb_mm_range() to range flush in 2M
increments instead of 4K.

Something a little like so..

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index cb0a1f470980..cb0898fe9d37 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -8,10 +8,15 @@
 
 #define tlb_flush(tlb)							\
 {									\
-	if (!tlb->fullmm && !tlb->need_flush_all) 			\
-		flush_tlb_mm_range(tlb->mm, tlb->start, tlb->end, 0UL);	\
-	else								\
-		flush_tlb_mm_range(tlb->mm, 0UL, TLB_FLUSH_ALL, 0UL);	\
+	unsigned long start = 0UL, end = TLB_FLUSH_ALL;			\
+	unsigned int invl_shift = tlb_get_unmap_shift(tlb);		\
+									\
+	if (!tlb->fullmm && !tlb->need_flush_all) {			\
+		start = tlb->start;					\
+		end = tlb->end;						\
+	}								\
+									\
+	flush_tlb_mm_range(tlb->mm, start, end, invl_shift);		\
 }
 
 #include <asm-generic/tlb.h>
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 511bf5fae8b8..8ac1cac34f63 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -491,23 +491,25 @@ struct flush_tlb_info {
 	unsigned long		start;
 	unsigned long		end;
 	u64			new_tlb_gen;
+	unsigned int		invl_shift;
 };
 
 #define local_flush_tlb() __flush_tlb()
 
 #define flush_tlb_mm(mm)	flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL)
 
-#define flush_tlb_range(vma, start, end)	\
-		flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
+#define flush_tlb_range(vma, start, end)			\
+		flush_tlb_mm_range(vma->vm_mm, start, end,	\
+				vma->vm_flags & VM_HUGETLB ? PMD_SHUFT : PAGE_SHIFT)
 
 extern void flush_tlb_all(void);
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, unsigned long vmflag);
+				unsigned long end, unsigned int invl_shift);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
-	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, VM_NONE);
+	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT);
 }
 
 void native_flush_tlb_others(const struct cpumask *cpumask,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 752dbf4e0e50..806aa74a8fb4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -537,12 +537,12 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	    f->new_tlb_gen == mm_tlb_gen) {
 		/* Partial flush */
 		unsigned long addr;
-		unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
+		unsigned long nr_pages = (f->end - f->start) >> f->invl_shift;
 
 		addr = f->start;
 		while (addr < f->end) {
 			__flush_tlb_one_user(addr);
-			addr += PAGE_SIZE;
+			addr += 1UL << f->invl_shift;
 		}
 		if (local)
 			count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
@@ -653,12 +653,13 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, unsigned long vmflag)
+				unsigned long end, unsigned int invl_shift)
 {
 	int cpu;
 
 	struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
 		.mm = mm,
+		.invl_shift = invl_shift;
 	};
 
 	cpu = get_cpu();
@@ -668,8 +669,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	/* Should we flush just the requested range? */
 	if ((end != TLB_FLUSH_ALL) &&
-	    !(vmflag & VM_HUGETLB) &&
-	    ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
+	    ((end - start) >> invl_shift) <= tlb_single_page_flush_ceiling) {
 		info.start = start;
 		info.end = end;
 	} else {
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e811ef7b8350..cdde0cdb23e7 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -175,6 +200,25 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 }
 #endif
 
+static inline unsigned long tlb_get_unmap_shift(struct mmu_gather *tlb)
+{
+	if (tlb->cleared_ptes)
+		return PAGE_SHIFT;
+	if (tlb->cleared_pmds)
+		return PMD_SHIFT;
+	if (tlb->cleared_puds)
+		return PUD_SHIFT;
+	if (tlb->cleared_p4ds)
+		return P4D_SHIFT;
+
+	return PAGE_SHIFT;
+}
+
+static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
+{
+	return 1ULL << tlb_get_unmap_shift(tlb);
+}
+
 /*
  * In the case of tlb vma handling, we can optimise these away in the
  * case where we're doing a full MM flush.  When we're doing a munmap,

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather
  2018-08-24 15:52 ` [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather Will Deacon
@ 2018-08-27  4:44   ` Nicholas Piggin
  2018-08-28 13:46     ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Nicholas Piggin @ 2018-08-27  4:44 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, peterz, benh, torvalds, catalin.marinas, linux-arm-kernel

On Fri, 24 Aug 2018 16:52:43 +0100
Will Deacon <will.deacon@arm.com> wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> 
> Some architectures require different TLB invalidation instructions
> depending on whether it is only the last-level of page table being
> changed, or whether there are also changes to the intermediate
> (directory) entries higher up the tree.
> 
> Add a new bit to the flags bitfield in struct mmu_gather so that the
> architecture code can operate accordingly if it's the intermediate
> levels being invalidated.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

powerpc should be able to move right over to using this rather
than keeping the bit in need_flush_all.

powerpc may be able to use the unmap granule thing to improve
its page size dependent flushes, but it might prefer to go
a different way and track start-end for different page sizes.
I wonder how much of that stuff should go into generic code,
and whether it should instead go into a struct arch_mmu_gather.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 09/11] asm-generic/tlb: Track which levels of the page tables have been cleared
  2018-08-24 15:52 ` [RFC PATCH 09/11] asm-generic/tlb: Track which levels of the page tables have been cleared Will Deacon
@ 2018-08-27  7:53   ` Peter Zijlstra
  2018-08-28 13:12     ` Will Deacon
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2018-08-27  7:53 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, benh, torvalds, npiggin, catalin.marinas, linux-arm-kernel

On Fri, Aug 24, 2018 at 04:52:44PM +0100, Will Deacon wrote:
> +static inline unsigned long tlb_get_unmap_granule(struct mmu_gather *tlb)
> +{
> +	if (tlb->cleared_ptes)
> +		return PAGE_SIZE;
> +	if (tlb->cleared_pmds)
> +		return PMD_SIZE;
> +	if (tlb->cleared_puds)
> +		return PUD_SIZE;
> +	if (tlb->cleared_p4ds)
> +		return P4D_SIZE;
> +
> +	return PAGE_SIZE;
> +}

When doing the x86 patch; I found _SHIFT more useful than _SIZE.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 03/11] arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d()
  2018-08-24 16:15   ` Linus Torvalds
@ 2018-08-28 12:49     ` Will Deacon
  0 siblings, 0 replies; 29+ messages in thread
From: Will Deacon @ 2018-08-28 12:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Peter Zijlstra,
	Benjamin Herrenschmidt, Nick Piggin, Catalin Marinas,
	linux-arm-kernel

Hi Linus,

On Fri, Aug 24, 2018 at 09:15:17AM -0700, Linus Torvalds wrote:
> On Fri, Aug 24, 2018 at 8:52 AM Will Deacon <will.deacon@arm.com> wrote:
> >
> > Now that our walk-cache invalidation routines imply a DSB before the
> > invalidation, we no longer need one when we are clearing an entry during
> > unmap.
> 
> Do you really still need it when *setting* it?
> 
> I'm wondering if you could just remove the thing unconditionally.
> 
> Why would you need a barrier for another CPU for a mapping that is
> just being created? It's ok if they see the old lack of mapping until
> they are told about it, and that eventual "being told about it" must
> involve a data transfer already.
> 
> And I'm assuming arm doesn't cache negative page table entries, so
> there's no issue with any stale tlb.
> 
> And any other kernel thread looking at the page tables will have to
> honor the page table locking, so you don't need it for some direct
> page table lookup either.
> 
> Hmm? It seems like you shouldn't need to order the "set page directory
> entry" with anything.
> 
> But maybe there's some magic arm64 rule I'm not aware of. Maybe even
> the local TLB hardware walker isn't coherent with local stores?

Yup, you got it: it's not related to ordering of accesses by other CPUs, but
actually because the page-table walker is treated as a separate observer by
the architecture and therefore we need the DSB to push out the store to the
page-table so that the walker can see it (practically speaking, the walker
isn't guaranteed to snoop the store buffer).

For PTEs mapping user addresses, we actually don't bother with the DSB
when writing a valid entry because it's extremely unlikely that we'd get
back to userspace with the entry sitting in the store buffer. If that
*did* happen, we'd just take the fault a second time. However, if we played
that same trick for pXds, I think that:

	(a) We'd need to distinguish between user and kernel mappings
	    in set_pXd(), since we can't tolerate spurious faults on
	    kernel addresses.
	(b) We'd need to be careful about allocating page-table pages,
	    so that e.g. the walker sees zeroes for a new pgtable

We could probably achieve (a) with a software bit and (b) is a non-issue
because mm/memory.c uses smp_wmb(), which is always a DMB for us (which
will enforce the eventual ordering but doesn't necessarily publish the
stores immediately).

Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 02/11] arm64: tlb: Add DSB ISHST prior to TLBI in __flush_tlb_[kernel_]pgtable()
  2018-08-24 17:56   ` Peter Zijlstra
@ 2018-08-28 13:03     ` Will Deacon
  0 siblings, 0 replies; 29+ messages in thread
From: Will Deacon @ 2018-08-28 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, benh, torvalds, npiggin, catalin.marinas, linux-arm-kernel

On Fri, Aug 24, 2018 at 07:56:09PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 24, 2018 at 04:52:37PM +0100, Will Deacon wrote:
> > __flush_tlb_[kernel_]pgtable() rely on set_pXd() having a DSB after
> > writing the new table entry and therefore avoid the barrier prior to the
> > TLBI instruction.
> > 
> > In preparation for delaying our walk-cache invalidation on the unmap()
> > path, move the DSB into the TLB invalidation routines.
> > 
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  arch/arm64/include/asm/tlbflush.h | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index 7e2a35424ca4..e257f8655b84 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -213,6 +213,7 @@ static inline void __flush_tlb_pgtable(struct mm_struct *mm,
> >  {
> >  	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
> >  
> > +	dsb(ishst);
> >  	__tlbi(vae1is, addr);
> >  	__tlbi_user(vae1is, addr);
> >  	dsb(ish);
> > @@ -222,6 +223,7 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
> >  {
> >  	unsigned long addr = __TLBI_VADDR(kaddr, 0);
> >  
> > +	dsb(ishst);
> >  	__tlbi(vaae1is, addr);
> >  	dsb(ish);
> >  }
> 
> I would suggest these barrier -- like any other barriers, carry a
> comment that explain the required ordering.
> 
> I think this here reads like:
> 
> 	STORE: unhook page
> 
> 	DSB-ishst: wait for all stores to complete
> 	TLBI: invalidate broadcast
> 	DSB-ish: wait for TLBI to complete
> 
> And the 'newly' placed DSB-ishst ensures the page is observed unlinked
> before we issue the invalidate.

Yeah, not a bad idea. We already have a big block comment in this file but
it's desperately out of date, so lemme rewrite that and justify the barriers
at the same time.

Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 09/11] asm-generic/tlb: Track which levels of the page tables have been cleared
  2018-08-27  7:53   ` Peter Zijlstra
@ 2018-08-28 13:12     ` Will Deacon
  0 siblings, 0 replies; 29+ messages in thread
From: Will Deacon @ 2018-08-28 13:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, benh, torvalds, npiggin, catalin.marinas, linux-arm-kernel

On Mon, Aug 27, 2018 at 09:53:41AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 24, 2018 at 04:52:44PM +0100, Will Deacon wrote:
> > +static inline unsigned long tlb_get_unmap_granule(struct mmu_gather *tlb)
> > +{
> > +	if (tlb->cleared_ptes)
> > +		return PAGE_SIZE;
> > +	if (tlb->cleared_pmds)
> > +		return PMD_SIZE;
> > +	if (tlb->cleared_puds)
> > +		return PUD_SIZE;
> > +	if (tlb->cleared_p4ds)
> > +		return P4D_SIZE;
> > +
> > +	return PAGE_SIZE;
> > +}
> 
> When doing the x86 patch; I found _SHIFT more useful than _SIZE.

I'll make that change for the next version.

Cheers,

Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather
  2018-08-27  4:44   ` Nicholas Piggin
@ 2018-08-28 13:46     ` Peter Zijlstra
  2018-08-28 13:48       ` Peter Zijlstra
  2018-08-28 14:12       ` Nicholas Piggin
  0 siblings, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2018-08-28 13:46 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Will Deacon, linux-kernel, benh, torvalds, catalin.marinas,
	linux-arm-kernel

On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote:

> powerpc may be able to use the unmap granule thing to improve
> its page size dependent flushes, but it might prefer to go
> a different way and track start-end for different page sizes.

I don't really see how tracking multiple ranges would help much with
THP. The ranges would end up being almost the same if there is a good
mix of page sizes.

But something like:

void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr)
{
	if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT))
		tblie_pte(addr);
	if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT))
		tlbie_pmd(addr);
	if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT))
		tlbie_pud(addr);
}

void tlb_flush_range(struct mmu_gather *tlb)
{
	unsigned long stride = 1UL << tlb_get_unmap_shift(tlb);
	unsigned long addr;

	for (addr = tlb->start; addr < tlb->end; addr += stride)
		tlb_flush_one(tlb, addr);

	ptesync();
}

Should workd I think. You'll only issue multiple TLBIEs on the
boundaries, not every stride.

And for hugetlb the above should be optimal, since stride and
tlb->cleared_* match up 1:1.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather
  2018-08-28 13:46     ` Peter Zijlstra
@ 2018-08-28 13:48       ` Peter Zijlstra
  2018-08-28 14:12       ` Nicholas Piggin
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2018-08-28 13:48 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Will Deacon, linux-kernel, benh, torvalds, catalin.marinas,
	linux-arm-kernel

On Tue, Aug 28, 2018 at 03:46:38PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote:
> 
> > powerpc may be able to use the unmap granule thing to improve
> > its page size dependent flushes, but it might prefer to go
> > a different way and track start-end for different page sizes.
> 
> I don't really see how tracking multiple ranges would help much with
> THP. The ranges would end up being almost the same if there is a good
> mix of page sizes.
> 
> But something like:
> 
> void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr)
> {
> 	if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT))
> 		tblie_pte(addr);
> 	if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT))
> 		tlbie_pmd(addr);
> 	if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT))
> 		tlbie_pud(addr);
> }

Sorry, those all should (of course) be !(addr << ...).

> void tlb_flush_range(struct mmu_gather *tlb)
> {
> 	unsigned long stride = 1UL << tlb_get_unmap_shift(tlb);
> 	unsigned long addr;
> 
> 	for (addr = tlb->start; addr < tlb->end; addr += stride)
> 		tlb_flush_one(tlb, addr);
> 
> 	ptesync();
> }

And one could; like x86 has; add a threshold above which you just kill
the complete mm.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather
  2018-08-28 13:46     ` Peter Zijlstra
  2018-08-28 13:48       ` Peter Zijlstra
@ 2018-08-28 14:12       ` Nicholas Piggin
  1 sibling, 0 replies; 29+ messages in thread
From: Nicholas Piggin @ 2018-08-28 14:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, linux-kernel, benh, torvalds, catalin.marinas,
	linux-arm-kernel

On Tue, 28 Aug 2018 15:46:38 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote:
> 
> > powerpc may be able to use the unmap granule thing to improve
> > its page size dependent flushes, but it might prefer to go
> > a different way and track start-end for different page sizes.  
> 
> I don't really see how tracking multiple ranges would help much with
> THP. The ranges would end up being almost the same if there is a good
> mix of page sizes.

That's assuming quite large unmaps. But a lot of the time they are
going to go to a full PID flush.

> 
> But something like:
> 
> void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr)
> {
> 	if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT))
> 		tblie_pte(addr);
> 	if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT))
> 		tlbie_pmd(addr);
> 	if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT))
> 		tlbie_pud(addr);
> }
> 
> void tlb_flush_range(struct mmu_gather *tlb)
> {
> 	unsigned long stride = 1UL << tlb_get_unmap_shift(tlb);
> 	unsigned long addr;
> 
> 	for (addr = tlb->start; addr < tlb->end; addr += stride)
> 		tlb_flush_one(tlb, addr);
> 
> 	ptesync();
> }
> 
> Should workd I think. You'll only issue multiple TLBIEs on the
> boundaries, not every stride.

Yeah we already do basically that today in the flush_tlb_range path,
just without the precise test for which page sizes

                if (hflush) {
                        hstart = (start + PMD_SIZE - 1) & PMD_MASK;
                        hend = end & PMD_MASK;
                        if (hstart == hend)
                                hflush = false;
                }

                if (gflush) {
                        gstart = (start + PUD_SIZE - 1) & PUD_MASK;
                        gend = end & PUD_MASK;
                        if (gstart == gend)
                                gflush = false;
                }

                asm volatile("ptesync": : :"memory");
                if (local) {
                        __tlbiel_va_range(start, end, pid, page_size, mmu_virtual_psize);
                        if (hflush)
                                __tlbiel_va_range(hstart, hend, pid,
                                                PMD_SIZE, MMU_PAGE_2M);
                        if (gflush)
                                __tlbiel_va_range(gstart, gend, pid,
                                                PUD_SIZE, MMU_PAGE_1G);
                        asm volatile("ptesync": : :"memory");

Thing is I think it's the smallish range cases you want to optimize
for. And for those we'll probably do something even smarter (like keep
a bitmap of pages to flush) because we really want to keep tlbies off
the bus whereas that's less important for x86.

Still not really seeing a reason not to implement a struct
arch_mmu_gather. A little bit of data contained to the arch is nothing
compared with the multitude of hooks and divergence of code.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64
  2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
                   ` (11 preceding siblings ...)
  2018-08-24 16:20 ` [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Linus Torvalds
@ 2018-09-04 18:38 ` Jon Masters
  2018-09-05 12:28   ` Will Deacon
  12 siblings, 1 reply; 29+ messages in thread
From: Jon Masters @ 2018-09-04 18:38 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: peterz, benh, torvalds, npiggin, catalin.marinas, linux-arm-kernel

On 08/24/2018 11:52 AM, Will Deacon wrote:

> I hacked up this RFC on the back of the recent changes to the mmu_gather
> stuff in mainline. It's had a bit of testing and it looks pretty good so
> far.

I will request the server folks go and test this. You'll probably
remember a couple of parts we've seen where aggressive walker caches
ended up (correctly) seeing stale page table entries and we had all
manner of horrifically hard to debug problems. We have some fairly nice
reproducers that were able to find this last time that we can test.

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64
  2018-09-04 18:38 ` Jon Masters
@ 2018-09-05 12:28   ` Will Deacon
  2018-09-07  6:36     ` Jon Masters
  0 siblings, 1 reply; 29+ messages in thread
From: Will Deacon @ 2018-09-05 12:28 UTC (permalink / raw)
  To: Jon Masters
  Cc: linux-kernel, peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel

On Tue, Sep 04, 2018 at 02:38:02PM -0400, Jon Masters wrote:
> On 08/24/2018 11:52 AM, Will Deacon wrote:
> 
> > I hacked up this RFC on the back of the recent changes to the mmu_gather
> > stuff in mainline. It's had a bit of testing and it looks pretty good so
> > far.
> 
> I will request the server folks go and test this. You'll probably
> remember a couple of parts we've seen where aggressive walker caches
> ended up (correctly) seeing stale page table entries and we had all
> manner of horrifically hard to debug problems. We have some fairly nice
> reproducers that were able to find this last time that we can test.

Cheers, Jon, that would be very helpful. You're probably best off using
my (rebasing) tlb branch rather than picking the RFC:

  git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tlb

Let me know if you'd prefer something stable (I can tag it with a date).

Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64
  2018-09-05 12:28   ` Will Deacon
@ 2018-09-07  6:36     ` Jon Masters
  2018-09-13 15:53       ` Will Deacon
  0 siblings, 1 reply; 29+ messages in thread
From: Jon Masters @ 2018-09-07  6:36 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel

On 09/05/2018 08:28 AM, Will Deacon wrote:
> On Tue, Sep 04, 2018 at 02:38:02PM -0400, Jon Masters wrote:
>> On 08/24/2018 11:52 AM, Will Deacon wrote:
>>
>>> I hacked up this RFC on the back of the recent changes to the mmu_gather
>>> stuff in mainline. It's had a bit of testing and it looks pretty good so
>>> far.
>>
>> I will request the server folks go and test this. You'll probably
>> remember a couple of parts we've seen where aggressive walker caches
>> ended up (correctly) seeing stale page table entries and we had all
>> manner of horrifically hard to debug problems. We have some fairly nice
>> reproducers that were able to find this last time that we can test.
> 
> Cheers, Jon, that would be very helpful. You're probably best off using
> my (rebasing) tlb branch rather than picking the RFC:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tlb
> 
> Let me know if you'd prefer something stable (I can tag it with a date).

That would be useful. I've prodded each of the Arm server SoC vendors I
work with via our weekly call to have them each specifically check this.
A tag would be helpful to that effort I expect. They all claim to be
watching this thread now, so we'll see if they see cabbages here.

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64
  2018-09-07  6:36     ` Jon Masters
@ 2018-09-13 15:53       ` Will Deacon
  2018-09-13 16:53         ` Jon Masters
  0 siblings, 1 reply; 29+ messages in thread
From: Will Deacon @ 2018-09-13 15:53 UTC (permalink / raw)
  To: Jon Masters
  Cc: linux-kernel, peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel

On Fri, Sep 07, 2018 at 02:36:08AM -0400, Jon Masters wrote:
> On 09/05/2018 08:28 AM, Will Deacon wrote:
> > On Tue, Sep 04, 2018 at 02:38:02PM -0400, Jon Masters wrote:
> >> On 08/24/2018 11:52 AM, Will Deacon wrote:
> >>
> >>> I hacked up this RFC on the back of the recent changes to the mmu_gather
> >>> stuff in mainline. It's had a bit of testing and it looks pretty good so
> >>> far.
> >>
> >> I will request the server folks go and test this. You'll probably
> >> remember a couple of parts we've seen where aggressive walker caches
> >> ended up (correctly) seeing stale page table entries and we had all
> >> manner of horrifically hard to debug problems. We have some fairly nice
> >> reproducers that were able to find this last time that we can test.
> > 
> > Cheers, Jon, that would be very helpful. You're probably best off using
> > my (rebasing) tlb branch rather than picking the RFC:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tlb
> > 
> > Let me know if you'd prefer something stable (I can tag it with a date).
> 
> That would be useful. I've prodded each of the Arm server SoC vendors I
> work with via our weekly call to have them each specifically check this.
> A tag would be helpful to that effort I expect. They all claim to be
> watching this thread now, so we'll see if they see cabbages here.

This is now all queued up in the (stable) arm64 for-next/core branch:

  https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/core

so that's the best place to grab the patches.

Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64
  2018-09-13 15:53       ` Will Deacon
@ 2018-09-13 16:53         ` Jon Masters
  0 siblings, 0 replies; 29+ messages in thread
From: Jon Masters @ 2018-09-13 16:53 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, peterz, benh, torvalds, npiggin, catalin.marinas,
	linux-arm-kernel

Tnx

-- 
Computer Architect


> On Sep 13, 2018, at 11:52, Will Deacon <will.deacon@arm.com> wrote:
> 
>> On Fri, Sep 07, 2018 at 02:36:08AM -0400, Jon Masters wrote:
>>> On 09/05/2018 08:28 AM, Will Deacon wrote:
>>>> On Tue, Sep 04, 2018 at 02:38:02PM -0400, Jon Masters wrote:
>>>>> On 08/24/2018 11:52 AM, Will Deacon wrote:
>>>>> 
>>>>> I hacked up this RFC on the back of the recent changes to the mmu_gather
>>>>> stuff in mainline. It's had a bit of testing and it looks pretty good so
>>>>> far.
>>>> 
>>>> I will request the server folks go and test this. You'll probably
>>>> remember a couple of parts we've seen where aggressive walker caches
>>>> ended up (correctly) seeing stale page table entries and we had all
>>>> manner of horrifically hard to debug problems. We have some fairly nice
>>>> reproducers that were able to find this last time that we can test.
>>> 
>>> Cheers, Jon, that would be very helpful. You're probably best off using
>>> my (rebasing) tlb branch rather than picking the RFC:
>>> 
>>>  git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tlb
>>> 
>>> Let me know if you'd prefer something stable (I can tag it with a date).
>> 
>> That would be useful. I've prodded each of the Arm server SoC vendors I
>> work with via our weekly call to have them each specifically check this.
>> A tag would be helpful to that effort I expect. They all claim to be
>> watching this thread now, so we'll see if they see cabbages here.
> 
> This is now all queued up in the (stable) arm64 for-next/core branch:
> 
>  https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/core
> 
> so that's the best place to grab the patches.
> 
> Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2018-09-13 17:35 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-24 15:52 [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Will Deacon
2018-08-24 15:52 ` [RFC PATCH 01/11] arm64: tlb: Use last-level invalidation in flush_tlb_kernel_range() Will Deacon
2018-08-24 15:52 ` [RFC PATCH 02/11] arm64: tlb: Add DSB ISHST prior to TLBI in __flush_tlb_[kernel_]pgtable() Will Deacon
2018-08-24 17:56   ` Peter Zijlstra
2018-08-28 13:03     ` Will Deacon
2018-08-24 15:52 ` [RFC PATCH 03/11] arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d() Will Deacon
2018-08-24 16:15   ` Linus Torvalds
2018-08-28 12:49     ` Will Deacon
2018-08-24 15:52 ` [RFC PATCH 04/11] arm64: tlb: Justify non-leaf invalidation in flush_tlb_range() Will Deacon
2018-08-24 15:52 ` [RFC PATCH 05/11] arm64: tlbflush: Allow stride to be specified for __flush_tlb_range() Will Deacon
2018-08-24 15:52 ` [RFC PATCH 06/11] arm64: tlb: Remove redundant !CONFIG_HAVE_RCU_TABLE_FREE code Will Deacon
2018-08-24 15:52 ` [RFC PATCH 07/11] asm-generic/tlb: Guard with #ifdef CONFIG_MMU Will Deacon
2018-08-24 15:52 ` [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather Will Deacon
2018-08-27  4:44   ` Nicholas Piggin
2018-08-28 13:46     ` Peter Zijlstra
2018-08-28 13:48       ` Peter Zijlstra
2018-08-28 14:12       ` Nicholas Piggin
2018-08-24 15:52 ` [RFC PATCH 09/11] asm-generic/tlb: Track which levels of the page tables have been cleared Will Deacon
2018-08-27  7:53   ` Peter Zijlstra
2018-08-28 13:12     ` Will Deacon
2018-08-24 15:52 ` [RFC PATCH 10/11] arm64: tlb: Adjust stride and type of TLBI according to mmu_gather Will Deacon
2018-08-24 15:52 ` [RFC PATCH 11/11] arm64: tlb: Avoid synchronous TLBIs when freeing page tables Will Deacon
2018-08-24 16:20 ` [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 Linus Torvalds
2018-08-26 10:56   ` Peter Zijlstra
2018-09-04 18:38 ` Jon Masters
2018-09-05 12:28   ` Will Deacon
2018-09-07  6:36     ` Jon Masters
2018-09-13 15:53       ` Will Deacon
2018-09-13 16:53         ` Jon Masters

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).