linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/20] TLB batching consolidation and enhancements
@ 2021-01-31  0:11 Nadav Amit
  2021-01-31  0:11 ` [RFC 01/20] mm/tlb: fix fullmm semantics Nadav Amit
                   ` (21 more replies)
  0 siblings, 22 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, linux-csky, linuxppc-dev, linux-s390, Mel Gorman,
	Nick Piggin, Peter Zijlstra, Thomas Gleixner, Will Deacon, x86,
	Yu Zhao

From: Nadav Amit <namit@vmware.com>

There are currently (at least?) 5 different TLB batching schemes in the
kernel:

1. Using mmu_gather (e.g., zap_page_range()).

2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
   ongoing deferred TLB flush and flushing the entire range eventually
   (e.g., change_protection_range()).

3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).

4. Batching per-table flushes (move_ptes()).

5. By setting a flag on that a deferred TLB flush operation takes place,
   flushing when (try_to_unmap_one() on x86).

It seems that (1)-(4) can be consolidated. In addition, it seems that
(5) is racy. It also seems there can be many redundant TLB flushes, and
potentially TLB-shootdown storms, for instance during batched
reclamation (using try_to_unmap_one()) if at the same time mmu_gather
defers TLB flushes.

More aggressive TLB batching may be possible, but this patch-set does
not add such batching. The proposed changes would enable such batching
in a later time.

Admittedly, I do not understand how things are not broken today, which
frightens me to make further batching before getting things in order.
For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
for each page-table (but not in greater granularity). Can't
ClearPageDirty() be called before the flush, causing writes after
ClearPageDirty() and before the flush to be lost?

This patch-set therefore performs the following changes:

1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
   instead of {inc|dec}_tlb_flush_pending().

2. Avoid TLB flushes if PTE permission is not demoted.

3. Cleans up mmu_gather to be less arch-dependant.

4. Uses mm's generations to track in finer granularity, either per-VMA
   or per page-table, whether a pending mmu_gather operation is
   outstanding. This should allow to avoid some TLB flushes when KSM or
   memory reclamation takes place while another operation such as
   munmap() or mprotect() is running.

5. Changes try_to_unmap_one() flushing scheme, as the current seems
   broken to track in a bitmap which CPUs have outstanding TLB flushes
   instead of having a flag.

Further optimizations are possible, such as changing move_ptes() to use
mmu_gather.

The patches were very very lightly tested. I am looking forward for your
feedback regarding the overall approaches, and whether to split them
into multiple patch-sets.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-csky@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: x86@kernel.org
Cc: Yu Zhao <yuzhao@google.com>


Nadav Amit (20):
  mm/tlb: fix fullmm semantics
  mm/mprotect: use mmu_gather
  mm/mprotect: do not flush on permission promotion
  mm/mapping_dirty_helpers: use mmu_gather
  mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h
  fs/task_mmu: use mmu_gather interface of clear-soft-dirty
  mm: move x86 tlb_gen to generic code
  mm: store completed TLB generation
  mm: create pte/pmd_tlb_flush_pending()
  mm: add pte_to_page()
  mm/tlb: remove arch-specific tlb_start/end_vma()
  mm/tlb: save the VMA that is flushed during tlb_start_vma()
  mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
  mm: move inc/dec_tlb_flush_pending() to mmu_gather.c
  mm: detect deferred TLB flushes in vma granularity
  mm/tlb: per-page table generation tracking
  mm/tlb: updated completed deferred TLB flush conditionally
  mm: make mm_cpumask() volatile
  lib/cpumask: introduce cpumask_atomic_or()
  mm/rmap: avoid potential races

 arch/arm/include/asm/bitops.h         |   4 +-
 arch/arm/include/asm/pgtable.h        |   4 +-
 arch/arm64/include/asm/pgtable.h      |   4 +-
 arch/csky/Kconfig                     |   1 +
 arch/csky/include/asm/tlb.h           |  12 --
 arch/powerpc/Kconfig                  |   1 +
 arch/powerpc/include/asm/tlb.h        |   2 -
 arch/s390/Kconfig                     |   1 +
 arch/s390/include/asm/tlb.h           |   3 -
 arch/sparc/Kconfig                    |   1 +
 arch/sparc/include/asm/pgtable_64.h   |   9 +-
 arch/sparc/include/asm/tlb_64.h       |   2 -
 arch/sparc/mm/init_64.c               |   2 +-
 arch/x86/Kconfig                      |   3 +
 arch/x86/hyperv/mmu.c                 |   2 +-
 arch/x86/include/asm/mmu.h            |  10 -
 arch/x86/include/asm/mmu_context.h    |   1 -
 arch/x86/include/asm/paravirt_types.h |   2 +-
 arch/x86/include/asm/pgtable.h        |  24 +--
 arch/x86/include/asm/tlb.h            |  21 +-
 arch/x86/include/asm/tlbbatch.h       |  15 --
 arch/x86/include/asm/tlbflush.h       |  61 ++++--
 arch/x86/mm/tlb.c                     |  52 +++--
 arch/x86/xen/mmu_pv.c                 |   2 +-
 drivers/firmware/efi/efi.c            |   1 +
 fs/proc/task_mmu.c                    |  29 ++-
 include/asm-generic/bitops/find.h     |   8 +-
 include/asm-generic/tlb.h             | 291 +++++++++++++++++++++-----
 include/linux/bitmap.h                |  21 +-
 include/linux/cpumask.h               |  40 ++--
 include/linux/huge_mm.h               |   3 +-
 include/linux/mm.h                    |  29 ++-
 include/linux/mm_types.h              | 166 ++++++++++-----
 include/linux/mm_types_task.h         |  13 --
 include/linux/pgtable.h               |   2 +-
 include/linux/smp.h                   |   6 +-
 init/Kconfig                          |  21 ++
 kernel/fork.c                         |   2 +
 kernel/smp.c                          |   8 +-
 lib/bitmap.c                          |  33 ++-
 lib/cpumask.c                         |   8 +-
 lib/find_bit.c                        |  10 +-
 mm/huge_memory.c                      |   6 +-
 mm/init-mm.c                          |   1 +
 mm/internal.h                         |  16 --
 mm/ksm.c                              |   2 +-
 mm/madvise.c                          |   6 +-
 mm/mapping_dirty_helpers.c            |  52 +++--
 mm/memory.c                           |   2 +
 mm/mmap.c                             |   1 +
 mm/mmu_gather.c                       |  59 +++++-
 mm/mprotect.c                         |  55 ++---
 mm/mremap.c                           |   2 +-
 mm/pgtable-generic.c                  |   2 +-
 mm/rmap.c                             |  42 ++--
 mm/vmscan.c                           |   1 +
 56 files changed, 803 insertions(+), 374 deletions(-)
 delete mode 100644 arch/x86/include/asm/tlbbatch.h

-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 01/20] mm/tlb: fix fullmm semantics
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  1:02   ` Andy Lutomirski
  2021-02-01 11:36   ` Peter Zijlstra
  2021-01-31  0:11 ` [RFC 02/20] mm/mprotect: use mmu_gather Nadav Amit
                   ` (20 subsequent siblings)
  21 siblings, 2 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

fullmm in mmu_gather is supposed to indicate that the mm is torn-down
(e.g., on process exit) and can therefore allow certain optimizations.
However, tlb_finish_mmu() sets fullmm, when in fact it want to say that
the TLB should be fully flushed.

Change tlb_finish_mmu() to set need_flush_all and check this flag in
tlb_flush_mmu_tlbonly() when deciding whether a flush is needed.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
---
 include/asm-generic/tlb.h | 2 +-
 mm/mmu_gather.c           | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2c68a545ffa7..eea113323468 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -420,7 +420,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 	 * these bits.
 	 */
 	if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
-	      tlb->cleared_puds || tlb->cleared_p4ds))
+	      tlb->cleared_puds || tlb->cleared_p4ds || tlb->need_flush_all))
 		return;
 
 	tlb_flush(tlb);
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 0dc7149b0c61..5a659d4e59eb 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -323,7 +323,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
 		 * On x86 non-fullmm doesn't yield significant difference
 		 * against fullmm.
 		 */
-		tlb->fullmm = 1;
+		tlb->need_flush_all = 1;
 		__tlb_reset_range(tlb);
 		tlb->freed_tables = 1;
 	}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 02/20] mm/mprotect: use mmu_gather
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
  2021-01-31  0:11 ` [RFC 01/20] mm/tlb: fix fullmm semantics Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  0:11 ` [RFC 03/20] mm/mprotect: do not flush on permission promotion Nadav Amit
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates
the code, as developers need to be aware of different invalidation
schemes, and prevents.

Use mmu_gather in change_pXX_range(). As the pages are not released,
only record the flushed range using tlb_flush_pXX_range().

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
---
 include/linux/huge_mm.h |  3 ++-
 mm/huge_memory.c        |  4 +++-
 mm/mprotect.c           | 51 ++++++++++++++++++++---------------------
 3 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6a19f35f836b..6eff7f59a778 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -37,7 +37,8 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud,
 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		   unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
-		    pgprot_t newprot, unsigned long cp_flags);
+		    pgprot_t newprot, unsigned long cp_flags,
+		    struct mmu_gather *tlb);
 vm_fault_t vmf_insert_pfn_pmd_prot(struct vm_fault *vmf, pfn_t pfn,
 				   pgprot_t pgprot, bool write);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9237976abe72..c345b8b06183 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1797,7 +1797,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, pgprot_t newprot, unsigned long cp_flags)
+		unsigned long addr, pgprot_t newprot, unsigned long cp_flags,
+		struct mmu_gather *tlb)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
@@ -1885,6 +1886,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		entry = pmd_clear_uffd_wp(entry);
 	}
 	ret = HPAGE_PMD_NR;
+	tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
 	set_pmd_at(mm, addr, pmd, entry);
 	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
 unlock:
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ab709023e9aa..632d5a677d3f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -32,12 +32,13 @@
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
+#include <asm/tlb.h>
 
 #include "internal.h"
 
-static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, unsigned long end, pgprot_t newprot,
-		unsigned long cp_flags)
+static unsigned long change_pte_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
@@ -138,6 +139,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				ptent = pte_mkwrite(ptent);
 			}
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+			tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
 			pages++;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
@@ -209,9 +211,9 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 	return 0;
 }
 
-static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
-		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -252,7 +254,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
-							      newprot, cp_flags);
+							      newprot, cp_flags, tlb);
 
 				if (nr_ptes) {
 					if (nr_ptes == HPAGE_PMD_NR) {
@@ -266,8 +268,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			}
 			/* fall through, the trans huge pmd just split */
 		}
-		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-					      cp_flags);
+		this_pages = change_pte_range(tlb, vma, pmd, addr, next,
+					      newprot, cp_flags);
 		pages += this_pages;
 next:
 		cond_resched();
@@ -281,9 +283,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	return pages;
 }
 
-static inline unsigned long change_pud_range(struct vm_area_struct *vma,
-		p4d_t *p4d, unsigned long addr, unsigned long end,
-		pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_pud_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -294,16 +296,16 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		pages += change_pmd_range(vma, pud, addr, next, newprot,
+		pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
 					  cp_flags);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
 }
 
-static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
-		pgd_t *pgd, unsigned long addr, unsigned long end,
-		pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_p4d_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -314,7 +316,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 		next = p4d_addr_end(addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
-		pages += change_pud_range(vma, p4d, addr, next, newprot,
+		pages += change_pud_range(tlb, vma, p4d, addr, next, newprot,
 					  cp_flags);
 	} while (p4d++, addr = next, addr != end);
 
@@ -328,25 +330,22 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long start = addr;
 	unsigned long pages = 0;
+	struct mmu_gather tlb;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
-	flush_cache_range(vma, addr, end);
-	inc_tlb_flush_pending(mm);
+	tlb_gather_mmu(&tlb, mm);
+	tlb_start_vma(&tlb, vma);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		pages += change_p4d_range(vma, pgd, addr, next, newprot,
+		pages += change_p4d_range(&tlb, vma, pgd, addr, next, newprot,
 					  cp_flags);
 	} while (pgd++, addr = next, addr != end);
-
-	/* Only flush the TLB if we actually modified any entries: */
-	if (pages)
-		flush_tlb_range(vma, start, end);
-	dec_tlb_flush_pending(mm);
+	tlb_end_vma(&tlb, vma);
+	tlb_finish_mmu(&tlb);
 
 	return pages;
 }
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 03/20] mm/mprotect: do not flush on permission promotion
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
  2021-01-31  0:11 ` [RFC 01/20] mm/tlb: fix fullmm semantics Nadav Amit
  2021-01-31  0:11 ` [RFC 02/20] mm/mprotect: use mmu_gather Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  1:07   ` Andy Lutomirski
  2021-01-31  0:11 ` [RFC 04/20] mm/mapping_dirty_helpers: use mmu_gather Nadav Amit
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

Currently, using mprotect() to unprotect a memory region or uffd to
unprotect a memory region causes a TLB flush. At least on x86, as
protection is promoted, no TLB flush is needed.

Add an arch-specific pte_may_need_flush() which tells whether a TLB
flush is needed based on the old PTE and the new one. Implement an x86
pte_may_need_flush().

For x86, besides the simple logic that PTE protection promotion or
changes of software bits does require a flush, also add logic that
considers the dirty-bit. If the dirty-bit is clear and write-protect is
set, no TLB flush is needed, as x86 updates the dirty-bit atomically
on write, and if the bit is clear, the PTE is reread.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
---
 arch/x86/include/asm/tlbflush.h | 44 +++++++++++++++++++++++++++++++++
 include/asm-generic/tlb.h       |  4 +++
 mm/mprotect.c                   |  3 ++-
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e0b660..a617dc0a9b06 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
+static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
+{
+	const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
+				     _PAGE_SOFTW3 | _PAGE_ACCESSED;
+	const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_GLOBAL;
+	pteval_t oldval = pte_val(oldpte);
+	pteval_t newval = pte_val(newpte);
+	pteval_t diff = oldval ^ newval;
+	pteval_t disable_mask = 0;
+
+	if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
+		disable_mask = _PAGE_NX;
+
+	/* new is non-present: need only if old is present */
+	if (pte_none(newpte))
+		return !pte_none(oldpte);
+
+	/*
+	 * If, excluding the ignored bits, only RW and dirty are cleared and the
+	 * old PTE does not have the dirty-bit set, we can avoid a flush. This
+	 * is possible since x86 architecture set the dirty bit atomically while
+	 * it caches the PTE in the TLB.
+	 *
+	 * The condition considers any change to RW and dirty as not requiring
+	 * flush if the old PTE is not dirty or not writable for simplification
+	 * of the code and to consider (unlikely) cases of changing dirty-bit of
+	 * write-protected PTE.
+	 */
+	if (!(diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask)) &&
+	    (!(pte_dirty(oldpte) || !pte_write(oldpte))))
+		return false;
+
+	/*
+	 * Any change of PFN and any flag other than those that we consider
+	 * requires a flush (e.g., PAT, protection keys). To save flushes we do
+	 * not consider the access bit as it is considered by the kernel as
+	 * best-effort.
+	 */
+	return diff & ((oldval & enable_mask) |
+		       (newval & disable_mask) |
+		       ~(enable_mask | disable_mask | ignore_mask));
+}
+#define pte_may_need_flush pte_may_need_flush
+
 #endif /* !MODULE */
 
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index eea113323468..c2deec0b6919 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -654,6 +654,10 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 	} while (0)
 #endif
 
+#ifndef pte_may_need_flush
+static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte) { return true; }
+#endif
+
 #endif /* CONFIG_MMU */
 
 #endif /* _ASM_GENERIC__TLB_H */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 632d5a677d3f..b7473d2c9a1f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 				ptent = pte_mkwrite(ptent);
 			}
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
-			tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
+			if (pte_may_need_flush(oldpte, ptent))
+				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
 			pages++;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 04/20] mm/mapping_dirty_helpers: use mmu_gather
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (2 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 03/20] mm/mprotect: do not flush on permission promotion Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  0:11 ` [RFC 05/20] mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h Nadav Amit
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

Avoid open-coding mmu_gather for no reason. There is no apparent reason
not to use the existing mmu_gather interfaces.

Use the newly introduced pte_may_need_flush() to check whether a flush
is needed to avoid unnecassary flushes.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
---
 mm/mapping_dirty_helpers.c | 37 +++++++++++--------------------------
 1 file changed, 11 insertions(+), 26 deletions(-)

diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index b59054ef2e10..2ce6cf431026 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -4,7 +4,7 @@
 #include <linux/bitops.h>
 #include <linux/mmu_notifier.h>
 #include <asm/cacheflush.h>
-#include <asm/tlbflush.h>
+#include <asm/tlb.h>
 
 /**
  * struct wp_walk - Private struct for pagetable walk callbacks
@@ -15,8 +15,7 @@
  */
 struct wp_walk {
 	struct mmu_notifier_range range;
-	unsigned long tlbflush_start;
-	unsigned long tlbflush_end;
+	struct mmu_gather tlb;
 	unsigned long total;
 };
 
@@ -42,9 +41,9 @@ static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
 		ptent = pte_wrprotect(old_pte);
 		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
 		wpwalk->total++;
-		wpwalk->tlbflush_start = min(wpwalk->tlbflush_start, addr);
-		wpwalk->tlbflush_end = max(wpwalk->tlbflush_end,
-					   addr + PAGE_SIZE);
+
+		if (pte_may_need_flush(old_pte, ptent))
+			tlb_flush_pte_range(&wpwalk->tlb, addr, PAGE_SIZE);
 	}
 
 	return 0;
@@ -101,9 +100,7 @@ static int clean_record_pte(pte_t *pte, unsigned long addr,
 		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
 
 		wpwalk->total++;
-		wpwalk->tlbflush_start = min(wpwalk->tlbflush_start, addr);
-		wpwalk->tlbflush_end = max(wpwalk->tlbflush_end,
-					   addr + PAGE_SIZE);
+		tlb_flush_pte_range(&wpwalk->tlb, addr, PAGE_SIZE);
 
 		__set_bit(pgoff, cwalk->bitmap);
 		cwalk->start = min(cwalk->start, pgoff);
@@ -184,20 +181,13 @@ static int wp_clean_pre_vma(unsigned long start, unsigned long end,
 {
 	struct wp_walk *wpwalk = walk->private;
 
-	wpwalk->tlbflush_start = end;
-	wpwalk->tlbflush_end = start;
-
 	mmu_notifier_range_init(&wpwalk->range, MMU_NOTIFY_PROTECTION_PAGE, 0,
 				walk->vma, walk->mm, start, end);
 	mmu_notifier_invalidate_range_start(&wpwalk->range);
 	flush_cache_range(walk->vma, start, end);
 
-	/*
-	 * We're not using tlb_gather_mmu() since typically
-	 * only a small subrange of PTEs are affected, whereas
-	 * tlb_gather_mmu() records the full range.
-	 */
-	inc_tlb_flush_pending(walk->mm);
+	tlb_gather_mmu(&wpwalk->tlb, walk->mm);
+	tlb_start_vma(&wpwalk->tlb, walk->vma);
 
 	return 0;
 }
@@ -212,15 +202,10 @@ static void wp_clean_post_vma(struct mm_walk *walk)
 {
 	struct wp_walk *wpwalk = walk->private;
 
-	if (mm_tlb_flush_nested(walk->mm))
-		flush_tlb_range(walk->vma, wpwalk->range.start,
-				wpwalk->range.end);
-	else if (wpwalk->tlbflush_end > wpwalk->tlbflush_start)
-		flush_tlb_range(walk->vma, wpwalk->tlbflush_start,
-				wpwalk->tlbflush_end);
-
 	mmu_notifier_invalidate_range_end(&wpwalk->range);
-	dec_tlb_flush_pending(walk->mm);
+
+	tlb_end_vma(&wpwalk->tlb, walk->vma);
+	tlb_finish_mmu(&wpwalk->tlb);
 }
 
 /*
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 05/20] mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (3 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 04/20] mm/mapping_dirty_helpers: use mmu_gather Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  0:11 ` [RFC 06/20] fs/task_mmu: use mmu_gather interface of clear-soft-dirty Nadav Amit
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Mel Gorman, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, x86

From: Nadav Amit <namit@vmware.com>

Arguably, tlb.h is the natural place for TLB related code. In addition,
task_mmu needs to be able to call to flush_tlb_batched_pending() and
therefore cannot (or should not) use mm/internal.h.

Move all the functions that are controlled by
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH to tlb.h.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: x86@kernel.org
---
 include/asm-generic/tlb.h | 17 +++++++++++++++++
 mm/internal.h             | 16 ----------------
 mm/mremap.c               |  2 +-
 mm/vmscan.c               |  1 +
 4 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index c2deec0b6919..517c89398c83 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -658,6 +658,23 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte) { return true; }
 #endif
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+void try_to_unmap_flush(void);
+void try_to_unmap_flush_dirty(void);
+void flush_tlb_batched_pending(struct mm_struct *mm);
+#else
+static inline void try_to_unmap_flush(void)
+{
+}
+static inline void try_to_unmap_flush_dirty(void)
+{
+}
+static inline void flush_tlb_batched_pending(struct mm_struct *mm)
+{
+}
+static inline void tlb_batch_init(void) { }
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+
 #endif /* CONFIG_MMU */
 
 #endif /* _ASM_GENERIC__TLB_H */
diff --git a/mm/internal.h b/mm/internal.h
index 25d2b2439f19..d3860f9fbb83 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -585,22 +585,6 @@ struct tlbflush_unmap_batch;
  */
 extern struct workqueue_struct *mm_percpu_wq;
 
-#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-void try_to_unmap_flush(void);
-void try_to_unmap_flush_dirty(void);
-void flush_tlb_batched_pending(struct mm_struct *mm);
-#else
-static inline void try_to_unmap_flush(void)
-{
-}
-static inline void try_to_unmap_flush_dirty(void)
-{
-}
-static inline void flush_tlb_batched_pending(struct mm_struct *mm)
-{
-}
-#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
-
 extern const struct trace_print_flags pageflag_names[];
 extern const struct trace_print_flags vmaflag_names[];
 extern const struct trace_print_flags gfpflag_names[];
diff --git a/mm/mremap.c b/mm/mremap.c
index f554320281cc..57655d1b1031 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -26,7 +26,7 @@
 #include <linux/userfaultfd_k.h>
 
 #include <asm/cacheflush.h>
-#include <asm/tlbflush.h>
+#include <asm/tlb.h>
 
 #include "internal.h"
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b1b574ad199d..ee144c359b41 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -52,6 +52,7 @@
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
+#include <asm/tlb.h>
 
 #include <linux/swapops.h>
 #include <linux/balloon_compaction.h>
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 06/20] fs/task_mmu: use mmu_gather interface of clear-soft-dirty
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (4 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 05/20] mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  0:11 ` [RFC 07/20] mm: move x86 tlb_gen to generic code Nadav Amit
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

Use mmu_gather interface in task_mmu instead of
{inc|dec}_tlb_flush_pending(). This would allow to consolidate the code
and to avoid potential bugs.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
---
 fs/proc/task_mmu.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3cec6fbef725..4cd048ffa0f6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1032,8 +1032,25 @@ enum clear_refs_types {
 
 struct clear_refs_private {
 	enum clear_refs_types type;
+	struct mmu_gather tlb;
 };
 
+static int tlb_pre_vma(unsigned long start, unsigned long end,
+		       struct mm_walk *walk)
+{
+	struct clear_refs_private *cp = walk->private;
+
+	tlb_start_vma(&cp->tlb, walk->vma);
+	return 0;
+}
+
+static void tlb_post_vma(struct mm_walk *walk)
+{
+	struct clear_refs_private *cp = walk->private;
+
+	tlb_end_vma(&cp->tlb, walk->vma);
+}
+
 #ifdef CONFIG_MEM_SOFT_DIRTY
 
 #define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE)
@@ -1140,6 +1157,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		/* Clear accessed and referenced bits. */
 		pmdp_test_and_clear_young(vma, addr, pmd);
 		test_and_clear_page_young(page);
+		tlb_flush_pmd_range(&cp->tlb, addr, HPAGE_PMD_SIZE);
 		ClearPageReferenced(page);
 out:
 		spin_unlock(ptl);
@@ -1155,6 +1173,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
 			clear_soft_dirty(vma, addr, pte);
+			tlb_flush_pte_range(&cp->tlb, addr, PAGE_SIZE);
 			continue;
 		}
 
@@ -1168,6 +1187,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		/* Clear accessed and referenced bits. */
 		ptep_test_and_clear_young(vma, addr, pte);
 		test_and_clear_page_young(page);
+		tlb_flush_pte_range(&cp->tlb, addr, PAGE_SIZE);
 		ClearPageReferenced(page);
 	}
 	pte_unmap_unlock(pte - 1, ptl);
@@ -1198,6 +1218,8 @@ static int clear_refs_test_walk(unsigned long start, unsigned long end,
 }
 
 static const struct mm_walk_ops clear_refs_walk_ops = {
+	.pre_vma		= tlb_pre_vma,
+	.post_vma		= tlb_post_vma,
 	.pmd_entry		= clear_refs_pte_range,
 	.test_walk		= clear_refs_test_walk,
 };
@@ -1248,6 +1270,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			goto out_unlock;
 		}
 
+		tlb_gather_mmu(&cp.tlb, mm);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			for (vma = mm->mmap; vma; vma = vma->vm_next) {
 				if (!(vma->vm_flags & VM_SOFTDIRTY))
@@ -1256,7 +1279,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				vma_set_page_prot(vma);
 			}
 
-			inc_tlb_flush_pending(mm);
 			mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY,
 						0, NULL, mm, 0, -1UL);
 			mmu_notifier_invalidate_range_start(&range);
@@ -1265,10 +1287,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				&cp);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			mmu_notifier_invalidate_range_end(&range);
-			flush_tlb_mm(mm);
-			dec_tlb_flush_pending(mm);
 		}
 out_unlock:
+		tlb_finish_mmu(&cp.tlb);
 		mmap_write_unlock(mm);
 out_mm:
 		mmput(mm);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 07/20] mm: move x86 tlb_gen to generic code
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (5 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 06/20] fs/task_mmu: use mmu_gather interface of clear-soft-dirty Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31 18:26   ` Andy Lutomirski
  2021-01-31  0:11 ` [RFC 08/20] mm: store completed TLB generation Nadav Amit
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

x86 currently has a TLB-generation tracking logic that can be used by
additional architectures (as long as they implement some additional
logic).

Extract the relevant pieces of code from x86 to general TLB code. This
would be useful to allow to write the next "fine granularity deferred
TLB flushes detection" patches without making them x86-specific.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/mmu.h         | 10 --------
 arch/x86/include/asm/mmu_context.h |  1 -
 arch/x86/include/asm/tlbflush.h    | 18 --------------
 arch/x86/mm/tlb.c                  |  8 +++---
 drivers/firmware/efi/efi.c         |  1 +
 include/linux/mm_types.h           | 39 ++++++++++++++++++++++++++++++
 init/Kconfig                       |  6 +++++
 kernel/fork.c                      |  2 ++
 mm/init-mm.c                       |  1 +
 mm/rmap.c                          |  9 ++++++-
 11 files changed, 62 insertions(+), 34 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 591efb2476bc..6bd4d626a6b3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -86,6 +86,7 @@ config X86
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_DEBUG_WX
+	select ARCH_HAS_TLB_GENERATIONS
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
 	select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5d7494631ea9..134454956c96 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -23,16 +23,6 @@ typedef struct {
 	 */
 	u64 ctx_id;
 
-	/*
-	 * Any code that needs to do any sort of TLB flushing for this
-	 * mm will first make its changes to the page tables, then
-	 * increment tlb_gen, then flush.  This lets the low-level
-	 * flushing code keep track of what needs flushing.
-	 *
-	 * This is not used on Xen PV.
-	 */
-	atomic64_t tlb_gen;
-
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct rw_semaphore	ldt_usr_sem;
 	struct ldt_struct	*ldt;
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 27516046117a..e7597c642270 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -105,7 +105,6 @@ static inline int init_new_context(struct task_struct *tsk,
 	mutex_init(&mm->context.lock);
 
 	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
-	atomic64_set(&mm->context.tlb_gen, 0);
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index a617dc0a9b06..2110b98026a7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -235,24 +235,6 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
-static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
-{
-	/*
-	 * Bump the generation count.  This also serves as a full barrier
-	 * that synchronizes with switch_mm(): callers are required to order
-	 * their read of mm_cpumask after their writes to the paging
-	 * structures.
-	 */
-	return atomic64_inc_return(&mm->context.tlb_gen);
-}
-
-static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-					struct mm_struct *mm)
-{
-	inc_mm_tlb_gen(mm);
-	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
-}
-
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 569ac1d57f55..7ab21430be41 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -511,7 +511,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * the TLB shootdown code.
 		 */
 		smp_mb();
-		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+		next_tlb_gen = atomic64_read(&next->tlb_gen);
 		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
 				next_tlb_gen)
 			return;
@@ -546,7 +546,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 */
 		if (next != &init_mm)
 			cpumask_set_cpu(cpu, mm_cpumask(next));
-		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+		next_tlb_gen = atomic64_read(&next->tlb_gen);
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
 
@@ -618,7 +618,7 @@ void initialize_tlbstate_and_flush(void)
 {
 	int i;
 	struct mm_struct *mm = this_cpu_read(cpu_tlbstate.loaded_mm);
-	u64 tlb_gen = atomic64_read(&init_mm.context.tlb_gen);
+	u64 tlb_gen = atomic64_read(&init_mm.tlb_gen);
 	unsigned long cr3 = __read_cr3();
 
 	/* Assert that CR3 already references the right mm. */
@@ -667,7 +667,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	 */
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
+	u64 mm_tlb_gen = atomic64_read(&loaded_mm->tlb_gen);
 	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
 
 	/* This code cannot presently handle being reentered. */
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index df3f9bcab581..02a6a1c81576 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -62,6 +62,7 @@ struct mm_struct efi_mm = {
 	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
 	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
 	.cpu_bitmap		= { [BITS_TO_LONGS(NR_CPUS)] = 0},
+	INIT_TLB_GENERATIONS
 };
 
 struct workqueue_struct *efi_rts_wq;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0974ad501a47..2035ac319c2b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -560,6 +560,17 @@ struct mm_struct {
 
 #ifdef CONFIG_IOMMU_SUPPORT
 		u32 pasid;
+#endif
+#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+		/*
+		 * Any code that needs to do any sort of TLB flushing for this
+		 * mm will first make its changes to the page tables, then
+		 * increment tlb_gen, then flush.  This lets the low-level
+		 * flushing code keep track of what needs flushing.
+		 *
+		 * This is not used on Xen PV.
+		 */
+		atomic64_t tlb_gen;
 #endif
 	} __randomize_layout;
 
@@ -676,6 +687,34 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 	return atomic_read(&mm->tlb_flush_pending) > 1;
 }
 
+#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+static inline void init_mm_tlb_gen(struct mm_struct *mm)
+{
+	atomic64_set(&mm->tlb_gen, 0);
+}
+
+static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
+{
+	/*
+	 * Bump the generation count.  This also serves as a full barrier
+	 * that synchronizes with switch_mm(): callers are required to order
+	 * their read of mm_cpumask after their writes to the paging
+	 * structures.
+	 */
+	return atomic64_inc_return(&mm->tlb_gen);
+}
+
+#define INIT_TLB_GENERATIONS					\
+	.tlb_gen	= ATOMIC64_INIT(0),
+
+#else
+
+static inline void init_mm_tlb_gen(struct mm_struct *mm) { }
+
+#define INIT_TLB_GENERATION
+
+#endif
+
 struct vm_fault;
 
 /**
diff --git a/init/Kconfig b/init/Kconfig
index b77c60f8b963..3d11a0f7c8cc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -842,6 +842,12 @@ config ARCH_SUPPORTS_NUMA_BALANCING
 # and the refill costs are offset by the savings of sending fewer IPIs.
 config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	bool
+	depends on ARCH_HAS_TLB_GENERATIONS
+
+#
+# For architectures that track for each address space the TLB generation.
+config ARCH_HAS_TLB_GENERATIONS
+	bool
 
 config CC_HAS_INT128
 	def_bool !$(cc-option,$(m64-flag) -D__SIZEOF_INT128__=0) && 64BIT
diff --git a/kernel/fork.c b/kernel/fork.c
index d66cd1014211..3e735a86ab2c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1027,6 +1027,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	RCU_INIT_POINTER(mm->exe_file, NULL);
 	mmu_notifier_subscriptions_init(mm);
 	init_tlb_flush_pending(mm);
+	init_mm_tlb_gen(mm);
+
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
 #endif
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 153162669f80..ef3a471f4de4 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -38,5 +38,6 @@ struct mm_struct init_mm = {
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
+	INIT_TLB_GENERATIONS
 	INIT_MM_CONTEXT(init_mm)
 };
diff --git a/mm/rmap.c b/mm/rmap.c
index 08c56aaf72eb..9655e1fc328a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -613,11 +613,18 @@ void try_to_unmap_flush_dirty(void)
 		try_to_unmap_flush();
 }
 
+static inline void tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+				   struct mm_struct *mm)
+{
+	inc_mm_tlb_gen(mm);
+	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+}
+
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 
-	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	tlbbatch_add_mm(&tlb_ubc->arch, mm);
 	tlb_ubc->flush_required = true;
 
 	/*
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 08/20] mm: store completed TLB generation
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (6 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 07/20] mm: move x86 tlb_gen to generic code Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31 20:32   ` Andy Lutomirski
  2021-02-01 11:52   ` Peter Zijlstra
  2021-01-31  0:11 ` [RFC 09/20] mm: create pte/pmd_tlb_flush_pending() Nadav Amit
                   ` (13 subsequent siblings)
  21 siblings, 2 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

To detect deferred TLB flushes in fine granularity, we need to keep
track on the completed TLB flush generation for each mm.

Add logic to track for each mm the tlb_gen_completed, which tracks the
completed TLB generation. It is the arch responsibility to call
mark_mm_tlb_gen_done() whenever a TLB flush is completed.

Start the generation numbers from 1 instead of 0. This would allow later
to detect whether flushes of a certain generation were completed.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
---
 arch/x86/mm/tlb.c         | 10 ++++++++++
 include/asm-generic/tlb.h | 33 +++++++++++++++++++++++++++++++++
 include/linux/mm_types.h  | 15 ++++++++++++++-
 3 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7ab21430be41..d17b5575531e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -14,6 +14,7 @@
 #include <asm/nospec-branch.h>
 #include <asm/cache.h>
 #include <asm/apic.h>
+#include <asm/tlb.h>
 
 #include "mm_internal.h"
 
@@ -915,6 +916,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), info);
 
+	/* Update the completed generation */
+	mark_mm_tlb_gen_done(mm, new_tlb_gen);
+
 	put_flush_tlb_info();
 	put_cpu();
 }
@@ -1147,6 +1151,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	cpumask_clear(&batch->cpumask);
 
+	/*
+	 * We cannot call mark_mm_tlb_gen_done() since we do not know which
+	 * mm's should be flushed. This may lead to some unwarranted TLB
+	 * flushes, but not to correction problems.
+	 */
+
 	put_cpu();
 }
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 517c89398c83..427bfcc6cdec 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -513,6 +513,39 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 }
 #endif
 
+#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+
+/*
+ * Helper function to update a generation to have a new value, as long as new
+ * value is greater or equal to gen.
+ */
+static inline void tlb_update_generation(atomic64_t *gen, u64 new_gen)
+{
+	u64 cur_gen = atomic64_read(gen);
+
+	while (cur_gen < new_gen) {
+		u64 old_gen = atomic64_cmpxchg(gen, cur_gen, new_gen);
+
+		/* Check if we succeeded in the cmpxchg */
+		if (likely(cur_gen == old_gen))
+			break;
+
+		cur_gen = old_gen;
+	};
+}
+
+
+static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
+{
+	/*
+	 * Update the completed generation to the new generation if the new
+	 * generation is greater than the previous one.
+	 */
+	tlb_update_generation(&mm->tlb_gen_completed, gen);
+}
+
+#endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
+
 /*
  * tlb_flush_{pte|pmd|pud|p4d}_range() adjust the tlb->start and tlb->end,
  * and set corresponding cleared_*.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2035ac319c2b..8a5eb4bfac59 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -571,6 +571,13 @@ struct mm_struct {
 		 * This is not used on Xen PV.
 		 */
 		atomic64_t tlb_gen;
+
+		/*
+		 * TLB generation which is guarnateed to be flushed, including
+		 * all the PTE changes that were performed before tlb_gen was
+		 * incremented.
+		 */
+		atomic64_t tlb_gen_completed;
 #endif
 	} __randomize_layout;
 
@@ -690,7 +697,13 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 #ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
 static inline void init_mm_tlb_gen(struct mm_struct *mm)
 {
-	atomic64_set(&mm->tlb_gen, 0);
+	/*
+	 * Start from generation of 1, so default generation 0 will be
+	 * considered as flushed and would not be regarded as an outstanding
+	 * deferred invalidation.
+	 */
+	atomic64_set(&mm->tlb_gen, 1);
+	atomic64_set(&mm->tlb_gen_completed, 1);
 }
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 09/20] mm: create pte/pmd_tlb_flush_pending()
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (7 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 08/20] mm: store completed TLB generation Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  0:11 ` [RFC 10/20] mm: add pte_to_page() Nadav Amit
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

In preparation for fine(r) granularity, introduce
pte_tlb_flush_pending() and pmd_tlb_flush_pending(). Right now the
function directs to mm_tlb_flush_pending().

Change pte_accessible() to provide the vma as well.

No functional change. Next patches will use this information on
architectures that use per-table deferred TLB tracking.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
---
 arch/arm/include/asm/pgtable.h      |  4 +++-
 arch/arm64/include/asm/pgtable.h    |  4 ++--
 arch/sparc/include/asm/pgtable_64.h |  9 ++++++---
 arch/sparc/mm/init_64.c             |  2 +-
 arch/x86/include/asm/pgtable.h      |  7 +++----
 include/linux/mm_types.h            | 10 ++++++++++
 include/linux/pgtable.h             |  2 +-
 mm/huge_memory.c                    |  2 +-
 mm/ksm.c                            |  2 +-
 mm/pgtable-generic.c                |  2 +-
 10 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index c02f24400369..59bcacc14dc3 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -190,7 +190,9 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd)
 #define pte_none(pte)		(!pte_val(pte))
 #define pte_present(pte)	(pte_isset((pte), L_PTE_PRESENT))
 #define pte_valid(pte)		(pte_isset((pte), L_PTE_VALID))
-#define pte_accessible(mm, pte)	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
+#define pte_accessible(vma, pte)					\
+				(pte_tlb_flush_pending(vma, pte) ?	\
+				 pte_present(*pte) : pte_valid(*pte))
 #define pte_write(pte)		(pte_isclear((pte), L_PTE_RDONLY))
 #define pte_dirty(pte)		(pte_isset((pte), L_PTE_DIRTY))
 #define pte_young(pte)		(pte_isset((pte), L_PTE_YOUNG))
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 501562793ce2..f14f1e9dbc3e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -126,8 +126,8 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
  * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
  * TLB.
  */
-#define pte_accessible(mm, pte)	\
-	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
+#define pte_accessible(vma, pte)	\
+	(pte_tlb_flush_pending(vma, pte) ? pte_present(*pte) : pte_valid(*pte))
 
 /*
  * p??_access_permitted() is true for valid user mappings (subject to the
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 550d3904de65..749efd9c49c9 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -673,9 +673,9 @@ static inline unsigned long pte_present(pte_t pte)
 }
 
 #define pte_accessible pte_accessible
-static inline unsigned long pte_accessible(struct mm_struct *mm, pte_t a)
+static inline unsigned long pte_accessible(struct vm_area_struct *vma, pte_t *a)
 {
-	return pte_val(a) & _PAGE_VALID;
+	return pte_val(*a) & _PAGE_VALID;
 }
 
 static inline unsigned long pte_special(pte_t pte)
@@ -906,8 +906,11 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
 	 *
 	 * SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U
 	 *             and SUN4V pte layout, so this inline test is fine.
+	 *
+	 * The vma is not propagated to this point, but it is not used by
+	 * sparc's pte_accessible(). We therefore provide NULL.
 	 */
-	if (likely(mm != &init_mm) && pte_accessible(mm, orig))
+	if (likely(mm != &init_mm) && pte_accessible(NULL, ptep))
 		tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift);
 }
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 182bb7bdaa0a..bda397aa9709 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -404,7 +404,7 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *
 	mm = vma->vm_mm;
 
 	/* Don't insert a non-valid PTE into the TSB, we'll deadlock.  */
-	if (!pte_accessible(mm, pte))
+	if (!pte_accessible(vma, ptep))
 		return;
 
 	spin_lock_irqsave(&mm->context.lock, flags);
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..a0e069c15dbc 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -775,13 +775,12 @@ static inline int pte_devmap(pte_t a)
 #endif
 
 #define pte_accessible pte_accessible
-static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
+static inline bool pte_accessible(struct vm_area_struct *vma, pte_t *a)
 {
-	if (pte_flags(a) & _PAGE_PRESENT)
+	if (pte_flags(*a) & _PAGE_PRESENT)
 		return true;
 
-	if ((pte_flags(a) & _PAGE_PROTNONE) &&
-			mm_tlb_flush_pending(mm))
+	if ((pte_flags(*a) & _PAGE_PROTNONE) && pte_tlb_flush_pending(vma, a))
 		return true;
 
 	return false;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8a5eb4bfac59..812ee0fd4c35 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -682,6 +682,16 @@ static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 	return atomic_read(&mm->tlb_flush_pending);
 }
 
+static inline bool pte_tlb_flush_pending(struct vm_area_struct *vma, pte_t *pte)
+{
+	return mm_tlb_flush_pending(vma->vm_mm);
+}
+
+static inline bool pmd_tlb_flush_pending(struct vm_area_struct *vma, pmd_t *pmd)
+{
+	return mm_tlb_flush_pending(vma->vm_mm);
+}
+
 static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 {
 	/*
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8fcdfa52eb4b..e8bce53ca3e8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -725,7 +725,7 @@ static inline void arch_swap_restore(swp_entry_t entry, struct page *page)
 #endif
 
 #ifndef pte_accessible
-# define pte_accessible(mm, pte)	((void)(pte), 1)
+# define pte_accessible(vma, pte)	((void)(pte), 1)
 #endif
 
 #ifndef flush_tlb_fix_spurious_fault
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c345b8b06183..c4b7c00cc69c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1514,7 +1514,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	 * We are not sure a pending tlb flush here is for a huge page
 	 * mapping or not. Hence use the tlb range variant
 	 */
-	if (mm_tlb_flush_pending(vma->vm_mm)) {
+	if (pmd_tlb_flush_pending(vma, vmf->pmd)) {
 		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
 		/*
 		 * change_huge_pmd() released the pmd lock before
diff --git a/mm/ksm.c b/mm/ksm.c
index 9694ee2c71de..515acbffc283 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1060,7 +1060,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
 	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||
-						mm_tlb_flush_pending(mm)) {
+					pte_tlb_flush_pending(vma, pvmw.pte)) {
 		pte_t entry;
 
 		swapped = PageSwapCache(page);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 9578db83e312..2ca66e269d33 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -93,7 +93,7 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	struct mm_struct *mm = (vma)->vm_mm;
 	pte_t pte;
 	pte = ptep_get_and_clear(mm, address, ptep);
-	if (pte_accessible(mm, pte))
+	if (pte_accessible(vma, ptep))
 		flush_tlb_page(vma, address);
 	return pte;
 }
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 10/20] mm: add pte_to_page()
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (8 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 09/20] mm: create pte/pmd_tlb_flush_pending() Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Nick Piggin, Yu Zhao

From: Nadav Amit <namit@vmware.com>

Add a pte_to_page(), which is similar to pmd_to_page, which will be used
later.

Inline pmd_to_page() as well.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
---
 include/linux/mm.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..d78a79fbb012 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2161,6 +2161,13 @@ static inline spinlock_t *ptlock_ptr(struct page *page)
 }
 #endif /* ALLOC_SPLIT_PTLOCKS */
 
+static inline struct page *pte_to_page(pte_t *pte)
+{
+	unsigned long mask = ~(PTRS_PER_PTE * sizeof(pte_t) - 1);
+
+	return virt_to_page((void *)((unsigned long) pte & mask));
+}
+
 static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
 	return ptlock_ptr(pmd_page(*pmd));
@@ -2246,7 +2253,7 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 
 #if USE_SPLIT_PMD_PTLOCKS
 
-static struct page *pmd_to_page(pmd_t *pmd)
+static inline struct page *pmd_to_page(pmd_t *pmd)
 {
 	unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
 	return virt_to_page((void *)((unsigned long) pmd & mask));
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (9 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 10/20] mm: add pte_to_page() Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-02-01 12:09   ` Peter Zijlstra
  2021-01-31  0:11 ` [RFC 12/20] mm/tlb: save the VMA that is flushed during tlb_start_vma() Nadav Amit
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, linux-csky, linuxppc-dev, linux-s390, x86

From: Nadav Amit <namit@vmware.com>

Architecture-specific tlb_start_vma() and tlb_end_vma() seem
unnecessary. They are currently used for:

1. Avoid per-VMA TLB flushes. This can be determined by introducing
   a new config option.

2. Avoid saving information on the vma that is being flushed. Saving
   this information, even for architectures that do not need it, is
   cheap and we will need it for per-VMA deferred TLB flushing.

3. Avoid calling flush_cache_range().

Remove the architecture specific tlb_start_vma() and tlb_end_vma() in
the following manner, corresponding to the previous requirements:

1. Introduce a new config option -
   ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING - to allow architectures to
   define whether they want aggressive TLB flush batching (instead of
   flushing mappings of each VMA separately).

2. Save information on the vma regardless of architecture. Saving this
   information should have negligible overhead, and they will be
   needed for fine granularity TLB flushes.

3. flush_cache_range() is anyhow not defined for the architectures that
   implement tlb_start/end_vma().

No functional change intended.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: linux-csky@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
---
 arch/csky/Kconfig               |  1 +
 arch/csky/include/asm/tlb.h     | 12 ------------
 arch/powerpc/Kconfig            |  1 +
 arch/powerpc/include/asm/tlb.h  |  2 --
 arch/s390/Kconfig               |  1 +
 arch/s390/include/asm/tlb.h     |  3 ---
 arch/sparc/Kconfig              |  1 +
 arch/sparc/include/asm/tlb_64.h |  2 --
 arch/x86/Kconfig                |  1 +
 arch/x86/include/asm/tlb.h      |  3 ---
 include/asm-generic/tlb.h       | 15 +++++----------
 init/Kconfig                    |  8 ++++++++
 12 files changed, 18 insertions(+), 32 deletions(-)

diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 89dd2fcf38fa..924ff5721240 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -8,6 +8,7 @@ config CSKY
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUED_RWLOCKS if NR_CPUS>2
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_FRAME_POINTERS if !CPU_CK610
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 	select COMMON_CLK
diff --git a/arch/csky/include/asm/tlb.h b/arch/csky/include/asm/tlb.h
index fdff9b8d70c8..8130a5f09a6b 100644
--- a/arch/csky/include/asm/tlb.h
+++ b/arch/csky/include/asm/tlb.h
@@ -6,18 +6,6 @@
 
 #include <asm/cacheflush.h>
 
-#define tlb_start_vma(tlb, vma) \
-	do { \
-		if (!(tlb)->fullmm) \
-			flush_cache_range(vma, (vma)->vm_start, (vma)->vm_end); \
-	}  while (0)
-
-#define tlb_end_vma(tlb, vma) \
-	do { \
-		if (!(tlb)->fullmm) \
-			flush_tlb_range(vma, (vma)->vm_start, (vma)->vm_end); \
-	}  while (0)
-
 #define tlb_flush(tlb) flush_tlb_mm((tlb)->mm)
 
 #include <asm-generic/tlb.h>
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 107bb4319e0e..d9761b6f192a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -151,6 +151,7 @@ config PPC
 	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
 	select ARCH_USE_QUEUED_RWLOCKS		if PPC_QUEUED_SPINLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS	if PPC_QUEUED_SPINLOCKS
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select ARCH_WANT_LD_ORPHAN_WARN
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 160422a439aa..880b7daf904e 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -19,8 +19,6 @@
 
 #include <linux/pagemap.h>
 
-#define tlb_start_vma(tlb, vma)	do { } while (0)
-#define tlb_end_vma(tlb, vma)	do { } while (0)
 #define __tlb_remove_tlb_entry	__tlb_remove_tlb_entry
 
 #define tlb_flush tlb_flush
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index c72874f09741..5b3dc5ca9873 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -113,6 +113,7 @@ config S390
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select BUILDTIME_TABLE_SORT
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 954fa8ca6cbd..03f31d59f97c 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -27,9 +27,6 @@ static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 					  struct page *page, int page_size);
 
-#define tlb_start_vma(tlb, vma)			do { } while (0)
-#define tlb_end_vma(tlb, vma)			do { } while (0)
-
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
 #define pmd_free_tlb pmd_free_tlb
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index c9c34dc52b7d..fb46e1b6f177 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -51,6 +51,7 @@ config SPARC
 	select NEED_DMA_MAP_STATE
 	select NEED_SG_DMA_LENGTH
 	select SET_FS
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 
 config SPARC32
 	def_bool !64BIT
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index 779a5a0f0608..3037187482db 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -22,8 +22,6 @@ void smp_flush_tlb_mm(struct mm_struct *mm);
 void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
 void flush_tlb_pending(void);
 
-#define tlb_start_vma(tlb, vma) do { } while (0)
-#define tlb_end_vma(tlb, vma)	do { } while (0)
 #define tlb_flush(tlb)	flush_tlb_pending()
 
 /*
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6bd4d626a6b3..d56b0f5cb00c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -101,6 +101,7 @@ config X86
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select ARCH_USE_SYM_ANNOTATIONS
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_DEFAULT_BPF_JIT	if X86_64
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 1bfe979bb9bc..580636cdc257 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -2,9 +2,6 @@
 #ifndef _ASM_X86_TLB_H
 #define _ASM_X86_TLB_H
 
-#define tlb_start_vma(tlb, vma) do { } while (0)
-#define tlb_end_vma(tlb, vma) do { } while (0)
-
 #define tlb_flush tlb_flush
 static inline void tlb_flush(struct mmu_gather *tlb);
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 427bfcc6cdec..b97136b7010b 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 
 #ifdef CONFIG_MMU_GATHER_NO_RANGE
 
-#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma)
-#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() and tlb_end_vma()
+#if defined(tlb_flush)
+#error MMU_GATHER_NO_RANGE relies on default tlb_flush()
 #endif
 
 /*
@@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 
 #ifndef tlb_flush
 
-#if defined(tlb_start_vma) || defined(tlb_end_vma)
-#error Default tlb_flush() relies on default tlb_start_vma() and tlb_end_vma()
-#endif
-
 /*
  * When an architecture does not provide its own tlb_flush() implementation
  * but does have a reasonably efficient flush_vma_range() implementation
@@ -486,7 +482,6 @@ static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
  * case where we're doing a full MM flush.  When we're doing a munmap,
  * the vmas are adjusted to only cover the region to be torn down.
  */
-#ifndef tlb_start_vma
 static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
 	if (tlb->fullmm)
@@ -495,14 +490,15 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *
 	tlb_update_vma_flags(tlb, vma);
 	flush_cache_range(vma, vma->vm_start, vma->vm_end);
 }
-#endif
 
-#ifndef tlb_end_vma
 static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
 	if (tlb->fullmm)
 		return;
 
+	if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
+		return;
+
 	/*
 	 * Do a TLB flush and reset the range at VMA boundaries; this avoids
 	 * the ranges growing with the unused space between consecutive VMAs,
@@ -511,7 +507,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 	 */
 	tlb_flush_mmu_tlbonly(tlb);
 }
-#endif
 
 #ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
 
diff --git a/init/Kconfig b/init/Kconfig
index 3d11a0f7c8cc..14a599a48738 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -849,6 +849,14 @@ config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 config ARCH_HAS_TLB_GENERATIONS
 	bool
 
+#
+# For architectures that prefer to batch TLB flushes aggressively, i.e.,
+# not to flush after changing or removing each VMA. The architecture must
+# provide its own tlb_flush() function.
+config ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
+	bool
+	depends on !CONFIG_MMU_GATHER_NO_GATHER
+
 config CC_HAS_INT128
 	def_bool !$(cc-option,$(m64-flag) -D__SIZEOF_INT128__=0) && 64BIT
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 12/20] mm/tlb: save the VMA that is flushed during tlb_start_vma()
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (10 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-02-01 12:28   ` Peter Zijlstra
  2021-01-31  0:11 ` [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes() Nadav Amit
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

Certain architectures need information about the vma that is about to be
flushed. Currently, an artificial vma is constructed using the original
vma infromation. Instead of saving the flags, record the vma during
tlb_start_vma() and use this vma when calling flush_tlb_range().

Record the vma unconditionally as it would be needed for per-VMA
deferred TLB flush tracking and the overhead of tracking it
unconditionally should be negligible.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
---
 include/asm-generic/tlb.h | 56 +++++++++++++--------------------------
 1 file changed, 19 insertions(+), 37 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b97136b7010b..041be2ef4426 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -252,6 +252,13 @@ extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
 struct mmu_gather {
 	struct mm_struct	*mm;
 
+	/*
+	 * The current vma. This information is changing upon tlb_start_vma()
+	 * and is therefore only valid between tlb_start_vma() and tlb_end_vma()
+	 * calls.
+	 */
+	struct vm_area_struct   *vma;
+
 #ifdef CONFIG_MMU_GATHER_TABLE_FREE
 	struct mmu_table_batch	*batch;
 #endif
@@ -283,12 +290,6 @@ struct mmu_gather {
 	unsigned int		cleared_puds : 1;
 	unsigned int		cleared_p4ds : 1;
 
-	/*
-	 * tracks VM_EXEC | VM_HUGETLB in tlb_start_vma
-	 */
-	unsigned int		vma_exec : 1;
-	unsigned int		vma_huge : 1;
-
 	unsigned int		batch_count;
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -352,10 +353,6 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 		flush_tlb_mm(tlb->mm);
 }
 
-static inline void
-tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
-
-#define tlb_end_vma tlb_end_vma
 static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
 
 #else /* CONFIG_MMU_GATHER_NO_RANGE */
@@ -364,7 +361,7 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 
 /*
  * When an architecture does not provide its own tlb_flush() implementation
- * but does have a reasonably efficient flush_vma_range() implementation
+ * but does have a reasonably efficient flush_tlb_range() implementation
  * use that.
  */
 static inline void tlb_flush(struct mmu_gather *tlb)
@@ -372,38 +369,20 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 	if (tlb->fullmm || tlb->need_flush_all) {
 		flush_tlb_mm(tlb->mm);
 	} else if (tlb->end) {
-		struct vm_area_struct vma = {
-			.vm_mm = tlb->mm,
-			.vm_flags = (tlb->vma_exec ? VM_EXEC    : 0) |
-				    (tlb->vma_huge ? VM_HUGETLB : 0),
-		};
-
-		flush_tlb_range(&vma, tlb->start, tlb->end);
+		VM_BUG_ON(!tlb->vma);
+		flush_tlb_range(tlb->vma, tlb->start, tlb->end);
 	}
 }
 
 static inline void
-tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma)
+tlb_update_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
-	/*
-	 * flush_tlb_range() implementations that look at VM_HUGETLB (tile,
-	 * mips-4k) flush only large pages.
-	 *
-	 * flush_tlb_range() implementations that flush I-TLB also flush D-TLB
-	 * (tile, xtensa, arm), so it's ok to just add VM_EXEC to an existing
-	 * range.
-	 *
-	 * We rely on tlb_end_vma() to issue a flush, such that when we reset
-	 * these values the batch is empty.
-	 */
-	tlb->vma_huge = is_vm_hugetlb_page(vma);
-	tlb->vma_exec = !!(vma->vm_flags & VM_EXEC);
+	tlb->vma = vma;
 }
-
 #else
 
 static inline void
-tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
+tlb_update_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
 
 #endif
 
@@ -487,17 +466,17 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *
 	if (tlb->fullmm)
 		return;
 
-	tlb_update_vma_flags(tlb, vma);
+	tlb_update_vma(tlb, vma);
 	flush_cache_range(vma, vma->vm_start, vma->vm_end);
 }
 
 static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
 	if (tlb->fullmm)
-		return;
+		goto out;
 
 	if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
-		return;
+		goto out;
 
 	/*
 	 * Do a TLB flush and reset the range at VMA boundaries; this avoids
@@ -506,6 +485,9 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 	 * this.
 	 */
 	tlb_flush_mmu_tlbonly(tlb);
+out:
+	/* Reset the VMA as a precaution. */
+	tlb_update_vma(tlb, NULL);
 }
 
 #ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (11 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 12/20] mm/tlb: save the VMA that is flushed during tlb_start_vma() Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  9:57   ` Damian Tometzki
                     ` (2 more replies)
  2021-01-31  0:11 ` [RFC 14/20] mm: move inc/dec_tlb_flush_pending() to mmu_gather.c Nadav Amit
                   ` (8 subsequent siblings)
  21 siblings, 3 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

Introduce tlb_start_ptes() and tlb_end_ptes() which would be called
before and after PTEs are updated and TLB flushes are deferred. This
will be later be used for fine granualrity deferred TLB flushing
detection.

In the meanwhile, move flush_tlb_batched_pending() into
tlb_start_ptes(). It was not called from mapping_dirty_helpers by
wp_pte() and clean_record_pte(), which might be a bug.

No additional functional change is intended.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
---
 fs/proc/task_mmu.c         |  2 ++
 include/asm-generic/tlb.h  | 18 ++++++++++++++++++
 mm/madvise.c               |  6 ++++--
 mm/mapping_dirty_helpers.c | 15 +++++++++++++--
 mm/memory.c                |  2 ++
 mm/mprotect.c              |  3 ++-
 6 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4cd048ffa0f6..d0cce961fa5c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1168,6 +1168,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		return 0;
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	tlb_start_ptes(&cp->tlb);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
@@ -1190,6 +1191,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		tlb_flush_pte_range(&cp->tlb, addr, PAGE_SIZE);
 		ClearPageReferenced(page);
 	}
+	tlb_end_ptes(&cp->tlb);
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
 	return 0;
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 041be2ef4426..10690763090a 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -58,6 +58,11 @@
  *    Defaults to flushing at tlb_end_vma() to reset the range; helps when
  *    there's large holes between the VMAs.
  *
+ *  - tlb_start_ptes() / tlb_end_ptes; makr the start / end of PTEs change.
+ *
+ *    Does internal accounting to allow fine(r) granularity checks for
+ *    pte_accessible() on certain configuration.
+ *
  *  - tlb_remove_table()
  *
  *    tlb_remove_table() is the basic primitive to free page-table directories
@@ -373,6 +378,10 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 		flush_tlb_range(tlb->vma, tlb->start, tlb->end);
 	}
 }
+#endif
+
+#if __is_defined(tlb_flush) ||						\
+	IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING)
 
 static inline void
 tlb_update_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
@@ -523,6 +532,15 @@ static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
 
 #endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
 
+#define tlb_start_ptes(tlb)						\
+	do {								\
+		struct mmu_gather *_tlb = (tlb);			\
+									\
+		flush_tlb_batched_pending(_tlb->mm);			\
+	} while (0)
+
+static inline void tlb_end_ptes(struct mmu_gather *tlb) { }
+
 /*
  * tlb_flush_{pte|pmd|pud|p4d}_range() adjust the tlb->start and tlb->end,
  * and set corresponding cleared_*.
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938fd3ad228..932c1c2eb9a3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -392,7 +392,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 #endif
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	flush_tlb_batched_pending(mm);
+	tlb_start_ptes(tlb);
 	arch_enter_lazy_mmu_mode();
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
@@ -468,6 +468,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	}
 
 	arch_leave_lazy_mmu_mode();
+	tlb_end_ptes(tlb);
 	pte_unmap_unlock(orig_pte, ptl);
 	if (pageout)
 		reclaim_pages(&page_list);
@@ -588,7 +589,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
-	flush_tlb_batched_pending(mm);
+	tlb_start_ptes(tlb);
 	arch_enter_lazy_mmu_mode();
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
@@ -692,6 +693,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
 	}
 	arch_leave_lazy_mmu_mode();
+	tlb_end_ptes(tlb);
 	pte_unmap_unlock(orig_pte, ptl);
 	cond_resched();
 next:
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 2ce6cf431026..063419ade304 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -6,6 +6,8 @@
 #include <asm/cacheflush.h>
 #include <asm/tlb.h>
 
+#include "internal.h"
+
 /**
  * struct wp_walk - Private struct for pagetable walk callbacks
  * @range: Range for mmu notifiers
@@ -36,7 +38,10 @@ static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
 	pte_t ptent = *pte;
 
 	if (pte_write(ptent)) {
-		pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
+		pte_t old_pte;
+
+		tlb_start_ptes(&wpwalk->tlb);
+		old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
 
 		ptent = pte_wrprotect(old_pte);
 		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
@@ -44,6 +49,7 @@ static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
 
 		if (pte_may_need_flush(old_pte, ptent))
 			tlb_flush_pte_range(&wpwalk->tlb, addr, PAGE_SIZE);
+		tlb_end_ptes(&wpwalk->tlb);
 	}
 
 	return 0;
@@ -94,13 +100,18 @@ static int clean_record_pte(pte_t *pte, unsigned long addr,
 	if (pte_dirty(ptent)) {
 		pgoff_t pgoff = ((addr - walk->vma->vm_start) >> PAGE_SHIFT) +
 			walk->vma->vm_pgoff - cwalk->bitmap_pgoff;
-		pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
+		pte_t old_pte;
+
+		tlb_start_ptes(&wpwalk->tlb);
+
+		old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
 
 		ptent = pte_mkclean(old_pte);
 		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
 
 		wpwalk->total++;
 		tlb_flush_pte_range(&wpwalk->tlb, addr, PAGE_SIZE);
+		tlb_end_ptes(&wpwalk->tlb);
 
 		__set_bit(pgoff, cwalk->bitmap);
 		cwalk->start = min(cwalk->start, pgoff);
diff --git a/mm/memory.c b/mm/memory.c
index 9e8576a83147..929a93c50d9a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1221,6 +1221,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	init_rss_vec(rss);
 	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	pte = start_pte;
+	tlb_start_ptes(tlb);
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	do {
@@ -1314,6 +1315,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 
+	tlb_end_ptes(tlb);
 	/* Do the actual TLB flush before dropping ptl */
 	if (force_flush)
 		tlb_flush_mmu_tlbonly(tlb);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b7473d2c9a1f..1258bbe42ee1 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -70,7 +70,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 	    atomic_read(&vma->vm_mm->mm_users) == 1)
 		target_node = numa_node_id();
 
-	flush_tlb_batched_pending(vma->vm_mm);
+	tlb_start_ptes(tlb);
 	arch_enter_lazy_mmu_mode();
 	do {
 		oldpte = *pte;
@@ -182,6 +182,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
+	tlb_end_ptes(tlb);
 	pte_unmap_unlock(pte - 1, ptl);
 
 	return pages;
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 14/20] mm: move inc/dec_tlb_flush_pending() to mmu_gather.c
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (12 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes() Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  0:11 ` [RFC 15/20] mm: detect deferred TLB flushes in vma granularity Nadav Amit
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

Reduce the chances that inc/dec_tlb_flush_pending() will be abused by
moving them into mmu_gather.c, which is more of their natural place.
This also allows to reduce the clutter on mm_types.h.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
---
 include/linux/mm_types.h | 54 ----------------------------------------
 mm/mmu_gather.c          | 54 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+), 54 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 812ee0fd4c35..676795dfd5d4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -615,60 +615,6 @@ static inline void init_tlb_flush_pending(struct mm_struct *mm)
 	atomic_set(&mm->tlb_flush_pending, 0);
 }
 
-static inline void inc_tlb_flush_pending(struct mm_struct *mm)
-{
-	atomic_inc(&mm->tlb_flush_pending);
-	/*
-	 * The only time this value is relevant is when there are indeed pages
-	 * to flush. And we'll only flush pages after changing them, which
-	 * requires the PTL.
-	 *
-	 * So the ordering here is:
-	 *
-	 *	atomic_inc(&mm->tlb_flush_pending);
-	 *	spin_lock(&ptl);
-	 *	...
-	 *	set_pte_at();
-	 *	spin_unlock(&ptl);
-	 *
-	 *				spin_lock(&ptl)
-	 *				mm_tlb_flush_pending();
-	 *				....
-	 *				spin_unlock(&ptl);
-	 *
-	 *	flush_tlb_range();
-	 *	atomic_dec(&mm->tlb_flush_pending);
-	 *
-	 * Where the increment if constrained by the PTL unlock, it thus
-	 * ensures that the increment is visible if the PTE modification is
-	 * visible. After all, if there is no PTE modification, nobody cares
-	 * about TLB flushes either.
-	 *
-	 * This very much relies on users (mm_tlb_flush_pending() and
-	 * mm_tlb_flush_nested()) only caring about _specific_ PTEs (and
-	 * therefore specific PTLs), because with SPLIT_PTE_PTLOCKS and RCpc
-	 * locks (PPC) the unlock of one doesn't order against the lock of
-	 * another PTL.
-	 *
-	 * The decrement is ordered by the flush_tlb_range(), such that
-	 * mm_tlb_flush_pending() will not return false unless all flushes have
-	 * completed.
-	 */
-}
-
-static inline void dec_tlb_flush_pending(struct mm_struct *mm)
-{
-	/*
-	 * See inc_tlb_flush_pending().
-	 *
-	 * This cannot be smp_mb__before_atomic() because smp_mb() simply does
-	 * not order against TLB invalidate completion, which is what we need.
-	 *
-	 * Therefore we must rely on tlb_flush_*() to guarantee order.
-	 */
-	atomic_dec(&mm->tlb_flush_pending);
-}
-
 static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 {
 	/*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 5a659d4e59eb..13338c096cc6 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -249,6 +249,60 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
 	tlb_flush_mmu_free(tlb);
 }
 
+static inline void inc_tlb_flush_pending(struct mm_struct *mm)
+{
+	atomic_inc(&mm->tlb_flush_pending);
+	/*
+	 * The only time this value is relevant is when there are indeed pages
+	 * to flush. And we'll only flush pages after changing them, which
+	 * requires the PTL.
+	 *
+	 * So the ordering here is:
+	 *
+	 *	atomic_inc(&mm->tlb_flush_pending);
+	 *	spin_lock(&ptl);
+	 *	...
+	 *	set_pte_at();
+	 *	spin_unlock(&ptl);
+	 *
+	 *				spin_lock(&ptl)
+	 *				mm_tlb_flush_pending();
+	 *				....
+	 *				spin_unlock(&ptl);
+	 *
+	 *	flush_tlb_range();
+	 *	atomic_dec(&mm->tlb_flush_pending);
+	 *
+	 * Where the increment if constrained by the PTL unlock, it thus
+	 * ensures that the increment is visible if the PTE modification is
+	 * visible. After all, if there is no PTE modification, nobody cares
+	 * about TLB flushes either.
+	 *
+	 * This very much relies on users (mm_tlb_flush_pending() and
+	 * mm_tlb_flush_nested()) only caring about _specific_ PTEs (and
+	 * therefore specific PTLs), because with SPLIT_PTE_PTLOCKS and RCpc
+	 * locks (PPC) the unlock of one doesn't order against the lock of
+	 * another PTL.
+	 *
+	 * The decrement is ordered by the flush_tlb_range(), such that
+	 * mm_tlb_flush_pending() will not return false unless all flushes have
+	 * completed.
+	 */
+}
+
+static inline void dec_tlb_flush_pending(struct mm_struct *mm)
+{
+	/*
+	 * See inc_tlb_flush_pending().
+	 *
+	 * This cannot be smp_mb__before_atomic() because smp_mb() simply does
+	 * not order against TLB invalidate completion, which is what we need.
+	 *
+	 * Therefore we must rely on tlb_flush_*() to guarantee order.
+	 */
+	atomic_dec(&mm->tlb_flush_pending);
+}
+
 /**
  * tlb_gather_mmu - initialize an mmu_gather structure for page-table tear-down
  * @tlb: the mmu_gather structure to initialize
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 15/20] mm: detect deferred TLB flushes in vma granularity
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (13 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 14/20] mm: move inc/dec_tlb_flush_pending() to mmu_gather.c Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-02-01 22:04   ` Nadav Amit
  2021-01-31  0:11 ` [RFC 16/20] mm/tlb: per-page table generation tracking Nadav Amit
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, x86

From: Nadav Amit <namit@vmware.com>

Currently, deferred TLB flushes are detected in the mm granularity: if
there is any deferred TLB flush in the entire address space due to NUMA
migration, pte_accessible() in x86 would return true, and
ptep_clear_flush() would require a TLB flush. This would happen even if
the PTE resides in a completely different vma.

Recent code changes and possible future enhancements might require to
detect TLB flushes in finer granularity. Detection in finer granularity
can also enable more aggressive TLB deferring in the future.

Record for each vma the last mm's TLB generation after the last deferred
PTE/PMD change while the page-table lock is still held. Increase the mm
generation before recording to indicate that a pending TLB flush is
pending. Record in the mmu_gather struct the mm's TLB generation at the
time in which the last TLB flushing was deferred.

Once the TLB flushing of deferred request takes place, use the deferred
TLB generation that is recorded in mmu_gather. Detection of deferred TLB
flushes is performed by checking whether the mm's completed TLB
generation is the lower/equal than the mm's TLB generation.
Architectures that use the TLB generation logic are required to perform
a full TLB flush if they detect that a new TLB flush request "skips" a
generation (as already done by x86 code).

To indicate that a deferred TLB flush takes place, increase the mm's TLB
generation after updating the PTEs. However, try to avoid increasing the
mm's generation after subsequent PTE updates, as increasing it again
would lead to a full TLB flush once the deferred TLB flushes are
performed (due to the "skipped" TLB generation). Therefore, if the mm
generation did not change after subsequent PTE update, use the previous
generation.

As multiple updates of the vma generation can be performed concurrently,
use atomic operations to ensure that the TLB generation as recorded in
the vma is the last (most recent) one.

Once a deferred TLB flush is eventually performed it might be redundant,
if due to another TLB flush the deferred flush was performed (by doing a
full TLB flush once detecting the "skipped" generation).  This case can
be detected if the deferred TLB generation, as recorded in mmu_gather
was already completed. However, we do not record deferred PUD/P4D
flushes, and freeing tables also requires a flush on cores in lazy
TLB mode. In such cases a TLB flush is needed even if the mm's completed
TLB generation indicates the flush was already "performed".

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: x86@kernel.org
---
 arch/x86/include/asm/tlb.h      |  18 ++++--
 arch/x86/include/asm/tlbflush.h |   5 ++
 arch/x86/mm/tlb.c               |  14 ++++-
 include/asm-generic/tlb.h       | 104 ++++++++++++++++++++++++++++++--
 include/linux/mm_types.h        |  19 ++++++
 mm/mmap.c                       |   1 +
 mm/mmu_gather.c                 |   3 +
 7 files changed, 150 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 580636cdc257..ecf538e6c6d5 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -9,15 +9,23 @@ static inline void tlb_flush(struct mmu_gather *tlb);
 
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
-	unsigned long start = 0UL, end = TLB_FLUSH_ALL;
 	unsigned int stride_shift = tlb_get_unmap_shift(tlb);
 
-	if (!tlb->fullmm && !tlb->need_flush_all) {
-		start = tlb->start;
-		end = tlb->end;
+	/* Perform full flush when needed */
+	if (tlb->fullmm || tlb->need_flush_all) {
+		flush_tlb_mm_range(tlb->mm, 0, TLB_FLUSH_ALL, stride_shift,
+				   tlb->freed_tables);
+		return;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	/* Check if flush was already performed */
+	if (!tlb->freed_tables && !tlb->cleared_puds &&
+	    !tlb->cleared_p4ds &&
+	    atomic64_read(&tlb->mm->tlb_gen_completed) > tlb->defer_gen)
+		return;
+
+	flush_tlb_mm_range_gen(tlb->mm, tlb->start, tlb->end, stride_shift,
+			       tlb->freed_tables, tlb->defer_gen);
 }
 
 /*
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 2110b98026a7..296a00545056 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -225,6 +225,11 @@ void flush_tlb_others(const struct cpumask *cpumask,
 				: PAGE_SHIFT, false)
 
 extern void flush_tlb_all(void);
+
+extern void flush_tlb_mm_range_gen(struct mm_struct *mm, unsigned long start,
+				unsigned long end, unsigned int stride_shift,
+				bool freed_tables, u64 gen);
+
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d17b5575531e..48f4b56fc4a7 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -883,12 +883,11 @@ static inline void put_flush_tlb_info(void)
 #endif
 }
 
-void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
+void flush_tlb_mm_range_gen(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
-				bool freed_tables)
+				bool freed_tables, u64 new_tlb_gen)
 {
 	struct flush_tlb_info *info;
-	u64 new_tlb_gen;
 	int cpu;
 
 	cpu = get_cpu();
@@ -923,6 +922,15 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	put_cpu();
 }
 
+void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, unsigned int stride_shift,
+				bool freed_tables)
+{
+	u64 new_tlb_gen = inc_mm_tlb_gen(mm);
+
+	flush_tlb_mm_range_gen(mm, start, end, stride_shift, freed_tables,
+			       new_tlb_gen);
+}
 
 static void do_flush_tlb_all(void *info)
 {
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 10690763090a..f25d2d955076 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -295,6 +295,11 @@ struct mmu_gather {
 	unsigned int		cleared_puds : 1;
 	unsigned int		cleared_p4ds : 1;
 
+	/*
+	 * Whether a TLB flush was needed for PTEs in the current table
+	 */
+	unsigned int		cleared_ptes_in_table : 1;
+
 	unsigned int		batch_count;
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -305,6 +310,10 @@ struct mmu_gather {
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	unsigned int page_size;
 #endif
+
+#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+	u64			defer_gen;
+#endif
 #endif
 };
 
@@ -381,7 +390,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 #endif
 
 #if __is_defined(tlb_flush) ||						\
-	IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING)
+	IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING) ||	\
+	IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS)
 
 static inline void
 tlb_update_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
@@ -472,7 +482,8 @@ static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
  */
 static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
-	if (tlb->fullmm)
+	if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING) &&
+		       tlb->fullmm)
 		return;
 
 	tlb_update_vma(tlb, vma);
@@ -530,16 +541,87 @@ static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
 	tlb_update_generation(&mm->tlb_gen_completed, gen);
 }
 
-#endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
+static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
+{
+	struct mm_struct *mm = tlb->mm;
+	u64 mm_gen;
+
+	/*
+	 * Any change of PTE before calling __track_deferred_tlb_flush() must be
+	 * performed using RMW atomic operation that provides a memory barriers,
+	 * such as ptep_modify_prot_start().  The barrier ensure the PTEs are
+	 * written before the current generation is read, synchronizing
+	 * (implicitly) with flush_tlb_mm_range().
+	 */
+	smp_mb__after_atomic();
+
+	mm_gen = atomic64_read(&mm->tlb_gen);
+
+	/*
+	 * This condition checks for both first deferred TLB flush and for other
+	 * TLB pending or executed TLB flushes after the last table that we
+	 * updated. In the latter case, we are going to skip a generation, which
+	 * would lead to a full TLB flush. This should therefore not cause
+	 * correctness issues, and should not induce overheads, since anyhow in
+	 * TLB storms it is better to perform full TLB flush.
+	 */
+	if (mm_gen != tlb->defer_gen) {
+		VM_BUG_ON(mm_gen < tlb->defer_gen);
+
+		tlb->defer_gen = inc_mm_tlb_gen(mm);
+	}
+}
+
+/*
+ * Store the deferred TLB generation in the VMA
+ */
+static inline void store_deferred_tlb_gen(struct mmu_gather *tlb)
+{
+	tlb_update_generation(&tlb->vma->defer_tlb_gen, tlb->defer_gen);
+}
+
+/*
+ * Track deferred TLB flushes for PTEs and PMDs to allow fine granularity checks
+ * whether a PTE is accessible. The TLB generation after the PTE is flushed is
+ * saved in the mmu_gather struct. Once a flush is performed, the geneartion is
+ * advanced.
+ */
+static inline void track_defer_tlb_flush(struct mmu_gather *tlb)
+{
+	if (tlb->fullmm)
+		return;
+
+	BUG_ON(!tlb->vma);
+
+	read_defer_tlb_flush_gen(tlb);
+	store_deferred_tlb_gen(tlb);
+}
+
+#define init_vma_tlb_generation(vma)				\
+	atomic64_set(&(vma)->defer_tlb_gen, 0)
+#else
+static inline void init_vma_tlb_generation(struct vm_area_struct *vma) { }
+#endif
 
 #define tlb_start_ptes(tlb)						\
 	do {								\
 		struct mmu_gather *_tlb = (tlb);			\
 									\
 		flush_tlb_batched_pending(_tlb->mm);			\
+		if (IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS))	\
+			_tlb->cleared_ptes_in_table = 0;		\
 	} while (0)
 
-static inline void tlb_end_ptes(struct mmu_gather *tlb) { }
+static inline void tlb_end_ptes(struct mmu_gather *tlb)
+{
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS))
+		return;
+
+	if (tlb->cleared_ptes_in_table)
+		track_defer_tlb_flush(tlb);
+
+	tlb->cleared_ptes_in_table = 0;
+}
 
 /*
  * tlb_flush_{pte|pmd|pud|p4d}_range() adjust the tlb->start and tlb->end,
@@ -550,15 +632,25 @@ static inline void tlb_flush_pte_range(struct mmu_gather *tlb,
 {
 	__tlb_adjust_range(tlb, address, size);
 	tlb->cleared_ptes = 1;
+
+	if (IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS))
+		tlb->cleared_ptes_in_table = 1;
 }
 
-static inline void tlb_flush_pmd_range(struct mmu_gather *tlb,
+static inline void __tlb_flush_pmd_range(struct mmu_gather *tlb,
 				     unsigned long address, unsigned long size)
 {
 	__tlb_adjust_range(tlb, address, size);
 	tlb->cleared_pmds = 1;
 }
 
+static inline void tlb_flush_pmd_range(struct mmu_gather *tlb,
+				     unsigned long address, unsigned long size)
+{
+	__tlb_flush_pmd_range(tlb, address, size);
+	track_defer_tlb_flush(tlb);
+}
+
 static inline void tlb_flush_pud_range(struct mmu_gather *tlb,
 				     unsigned long address, unsigned long size)
 {
@@ -649,7 +741,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 #ifndef pte_free_tlb
 #define pte_free_tlb(tlb, ptep, address)			\
 	do {							\
-		tlb_flush_pmd_range(tlb, address, PAGE_SIZE);	\
+		__tlb_flush_pmd_range(tlb, address, PAGE_SIZE);	\
 		tlb->freed_tables = 1;				\
 		__pte_free_tlb(tlb, ptep, address);		\
 	} while (0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 676795dfd5d4..bbe5d4a422f7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -367,6 +367,9 @@ struct vm_area_struct {
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+#endif
+#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+	atomic64_t defer_tlb_gen;	/* Deferred TLB flushes generation */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
@@ -628,6 +631,21 @@ static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 	return atomic_read(&mm->tlb_flush_pending);
 }
 
+#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+static inline bool pte_tlb_flush_pending(struct vm_area_struct *vma, pte_t *pte)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	return atomic64_read(&vma->defer_tlb_gen) < atomic64_read(&mm->tlb_gen_completed);
+}
+
+static inline bool pmd_tlb_flush_pending(struct vm_area_struct *vma, pmd_t *pmd)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	return atomic64_read(&vma->defer_tlb_gen) < atomic64_read(&mm->tlb_gen_completed);
+}
+#else /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
 static inline bool pte_tlb_flush_pending(struct vm_area_struct *vma, pte_t *pte)
 {
 	return mm_tlb_flush_pending(vma->vm_mm);
@@ -637,6 +655,7 @@ static inline bool pmd_tlb_flush_pending(struct vm_area_struct *vma, pmd_t *pmd)
 {
 	return mm_tlb_flush_pending(vma->vm_mm);
 }
+#endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
 
 static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 {
diff --git a/mm/mmap.c b/mm/mmap.c
index 90673febce6a..a81ef902e296 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3337,6 +3337,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			get_file(new_vma->vm_file);
 		if (new_vma->vm_ops && new_vma->vm_ops->open)
 			new_vma->vm_ops->open(new_vma);
+		init_vma_tlb_generation(new_vma);
 		vma_link(mm, new_vma, prev, rb_link, rb_parent);
 		*need_rmap_locks = false;
 	}
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 13338c096cc6..0d554f2f92ac 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -329,6 +329,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 #endif
 
 	tlb_table_init(tlb);
+#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+	tlb->defer_gen = 0;
+#endif
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	tlb->page_size = 0;
 #endif
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 16/20] mm/tlb: per-page table generation tracking
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (14 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 15/20] mm: detect deferred TLB flushes in vma granularity Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  0:11 ` [RFC 17/20] mm/tlb: updated completed deferred TLB flush conditionally Nadav Amit
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, x86

From: Nadav Amit <namit@vmware.com>

Detecting deferred TLB flushes per-VMA has two drawbacks:

1. It requires an atomic cmpxchg to record mm's TLB generation at the
time of the last TLB flush, as two deferred TLB flushes on the same VMA
can race.

2. It might be in coarse granularity for large VMAs.

On 64-bit architectures that have available space on page-struct, we can
resolve these two drawbacks by recording the TLB generation at the time
of the last deferred flush in page-struct of page-table whose TLB
flushes were deferred.

Introduce a new CONFIG_PER_TABLE_DEFERRED_FLUSHES config option. Record
when enabled the deferred TLB flush generation on page-struct, which is
protected by the page-table lock.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: x86@kernel.org
---
 arch/x86/Kconfig               |  1 +
 arch/x86/include/asm/pgtable.h | 23 ++++++------
 fs/proc/task_mmu.c             |  6 ++--
 include/asm-generic/tlb.h      | 65 ++++++++++++++++++++++++++--------
 include/linux/mm.h             | 13 +++++++
 include/linux/mm_types.h       | 22 ++++++++++++
 init/Kconfig                   |  7 ++++
 mm/huge_memory.c               |  2 +-
 mm/mapping_dirty_helpers.c     |  4 +--
 mm/mprotect.c                  |  2 +-
 10 files changed, 113 insertions(+), 32 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d56b0f5cb00c..dfc6ee9dbe9c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -250,6 +250,7 @@ config X86
 	select X86_FEATURE_NAMES		if PROC_FS
 	select PROC_PID_ARCH_STATUS		if PROC_FS
 	select MAPPING_DIRTY_HELPERS
+	select PER_TABLE_DEFERRED_FLUSHES	if X86_64
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 
 config INSTRUCTION_DECODER
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a0e069c15dbc..b380a849be90 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -774,17 +774,18 @@ static inline int pte_devmap(pte_t a)
 }
 #endif
 
-#define pte_accessible pte_accessible
-static inline bool pte_accessible(struct vm_area_struct *vma, pte_t *a)
-{
-	if (pte_flags(*a) & _PAGE_PRESENT)
-		return true;
-
-	if ((pte_flags(*a) & _PAGE_PROTNONE) && pte_tlb_flush_pending(vma, a))
-		return true;
-
-	return false;
-}
+#define pte_accessible(vma, a)						\
+	({								\
+		pte_t *_a = (a);					\
+		bool r = false;						\
+									\
+		if (pte_flags(*_a) & _PAGE_PRESENT)			\
+			r = true;					\
+		else							\
+			r = ((pte_flags(*_a) & _PAGE_PROTNONE) &&	\
+			     pte_tlb_flush_pending((vma), _a));		\
+		r;							\
+	})
 
 static inline int pmd_present(pmd_t pmd)
 {
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d0cce961fa5c..00e116feb62c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1157,7 +1157,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		/* Clear accessed and referenced bits. */
 		pmdp_test_and_clear_young(vma, addr, pmd);
 		test_and_clear_page_young(page);
-		tlb_flush_pmd_range(&cp->tlb, addr, HPAGE_PMD_SIZE);
+		tlb_flush_pmd_range(&cp->tlb, pmd, addr, HPAGE_PMD_SIZE);
 		ClearPageReferenced(page);
 out:
 		spin_unlock(ptl);
@@ -1174,7 +1174,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
 			clear_soft_dirty(vma, addr, pte);
-			tlb_flush_pte_range(&cp->tlb, addr, PAGE_SIZE);
+			tlb_flush_pte_range(&cp->tlb, pte, addr, PAGE_SIZE);
 			continue;
 		}
 
@@ -1188,7 +1188,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		/* Clear accessed and referenced bits. */
 		ptep_test_and_clear_young(vma, addr, pte);
 		test_and_clear_page_young(page);
-		tlb_flush_pte_range(&cp->tlb, addr, PAGE_SIZE);
+		tlb_flush_pte_range(&cp->tlb, pte, addr, PAGE_SIZE);
 		ClearPageReferenced(page);
 	}
 	tlb_end_ptes(&cp->tlb);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f25d2d955076..74dbb56d816d 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -310,10 +310,12 @@ struct mmu_gather {
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	unsigned int page_size;
 #endif
-
 #ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
 	u64			defer_gen;
 #endif
+#ifdef CONFIG_PER_TABLE_DEFERRED_FLUSHES
+	pte_t			*last_pte;
+#endif
 #endif
 };
 
@@ -572,21 +574,45 @@ static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
 	}
 }
 
+#ifndef CONFIG_PER_TABLE_DEFERRED_FLUSHES
+
 /*
- * Store the deferred TLB generation in the VMA
+ * Store the deferred TLB generation in the VMA or page-table for PTEs or PMDs
  */
-static inline void store_deferred_tlb_gen(struct mmu_gather *tlb)
+static inline void store_deferred_tlb_gen(struct mmu_gather *tlb,
+					  struct page *page)
 {
 	tlb_update_generation(&tlb->vma->defer_tlb_gen, tlb->defer_gen);
 }
 
+static inline void tlb_set_last_pte(struct mmu_gather *tlb, pte_t *pte) { }
+
+#else /* CONFIG_PER_TABLE_DEFERRED_FLUSHES */
+
+/*
+ * Store the deferred TLB generation in the VMA
+ */
+static inline void store_deferred_tlb_gen(struct mmu_gather *tlb,
+					  struct page *page)
+{
+	page->deferred_tlb_gen = tlb->defer_gen;
+}
+
+static inline void tlb_set_last_pte(struct mmu_gather *tlb, pte_t *pte)
+{
+	tlb->last_pte = pte;
+}
+
+#endif /* CONFIG_PER_TABLE_DEFERRED_FLUSHES */
+
 /*
  * Track deferred TLB flushes for PTEs and PMDs to allow fine granularity checks
  * whether a PTE is accessible. The TLB generation after the PTE is flushed is
  * saved in the mmu_gather struct. Once a flush is performed, the geneartion is
  * advanced.
  */
-static inline void track_defer_tlb_flush(struct mmu_gather *tlb)
+static inline void track_defer_tlb_flush(struct mmu_gather *tlb,
+					 struct page *page)
 {
 	if (tlb->fullmm)
 		return;
@@ -594,7 +620,7 @@ static inline void track_defer_tlb_flush(struct mmu_gather *tlb)
 	BUG_ON(!tlb->vma);
 
 	read_defer_tlb_flush_gen(tlb);
-	store_deferred_tlb_gen(tlb);
+	store_deferred_tlb_gen(tlb, page);
 }
 
 #define init_vma_tlb_generation(vma)				\
@@ -610,6 +636,7 @@ static inline void init_vma_tlb_generation(struct vm_area_struct *vma) { }
 		flush_tlb_batched_pending(_tlb->mm);			\
 		if (IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS))	\
 			_tlb->cleared_ptes_in_table = 0;		\
+		tlb_set_last_pte(_tlb, NULL);				\
 	} while (0)
 
 static inline void tlb_end_ptes(struct mmu_gather *tlb)
@@ -617,24 +644,31 @@ static inline void tlb_end_ptes(struct mmu_gather *tlb)
 	if (!IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS))
 		return;
 
+#ifdef CONFIG_PER_TABLE_DEFERRED_FLUSHES
+	if (tlb->last_pte)
+		track_defer_tlb_flush(tlb, pte_to_page(tlb->last_pte));
+#elif CONFIG_ARCH_HAS_TLB_GENERATIONS /* && !CONFIG_PER_TABLE_DEFERRED_FLUSHES */
 	if (tlb->cleared_ptes_in_table)
-		track_defer_tlb_flush(tlb);
-
+		track_defer_tlb_flush(tlb, NULL);
 	tlb->cleared_ptes_in_table = 0;
+#endif /* CONFIG_PER_TABLE_DEFERRED_FLUSHES */
 }
 
 /*
  * tlb_flush_{pte|pmd|pud|p4d}_range() adjust the tlb->start and tlb->end,
  * and set corresponding cleared_*.
  */
-static inline void tlb_flush_pte_range(struct mmu_gather *tlb,
+static inline void tlb_flush_pte_range(struct mmu_gather *tlb, pte_t *pte,
 				     unsigned long address, unsigned long size)
 {
 	__tlb_adjust_range(tlb, address, size);
 	tlb->cleared_ptes = 1;
 
-	if (IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS))
+	if (IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS) &&
+	    !IS_ENABLED(CONFIG_PER_TABLE_DEFERRED_FLUSHES))
 		tlb->cleared_ptes_in_table = 1;
+
+	tlb_set_last_pte(tlb, pte);
 }
 
 static inline void __tlb_flush_pmd_range(struct mmu_gather *tlb,
@@ -644,11 +678,11 @@ static inline void __tlb_flush_pmd_range(struct mmu_gather *tlb,
 	tlb->cleared_pmds = 1;
 }
 
-static inline void tlb_flush_pmd_range(struct mmu_gather *tlb,
+static inline void tlb_flush_pmd_range(struct mmu_gather *tlb, pmd_t *pmd,
 				     unsigned long address, unsigned long size)
 {
 	__tlb_flush_pmd_range(tlb, address, size);
-	track_defer_tlb_flush(tlb);
+	track_defer_tlb_flush(tlb, pmd_to_page(pmd));
 }
 
 static inline void tlb_flush_pud_range(struct mmu_gather *tlb,
@@ -678,7 +712,8 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
  */
 #define tlb_remove_tlb_entry(tlb, ptep, address)		\
 	do {							\
-		tlb_flush_pte_range(tlb, address, PAGE_SIZE);	\
+		tlb_flush_pte_range(tlb, ptep, address,		\
+				    PAGE_SIZE);	\
 		__tlb_remove_tlb_entry(tlb, ptep, address);	\
 	} while (0)
 
@@ -686,7 +721,8 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 	do {							\
 		unsigned long _sz = huge_page_size(h);		\
 		if (_sz == PMD_SIZE)				\
-			tlb_flush_pmd_range(tlb, address, _sz);	\
+			tlb_flush_pmd_range(tlb, (pmd_t *)ptep,	\
+					    address, _sz);	\
 		else if (_sz == PUD_SIZE)			\
 			tlb_flush_pud_range(tlb, address, _sz);	\
 		__tlb_remove_tlb_entry(tlb, ptep, address);	\
@@ -702,7 +738,8 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)			\
 	do {								\
-		tlb_flush_pmd_range(tlb, address, HPAGE_PMD_SIZE);	\
+		tlb_flush_pmd_range(tlb, pmdp, address,			\
+				    HPAGE_PMD_SIZE);			\
 		__tlb_remove_pmd_tlb_entry(tlb, pmdp, address);		\
 	} while (0)
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d78a79fbb012..a8a5bf82bd03 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2208,11 +2208,21 @@ static inline void pgtable_init(void)
 	pgtable_cache_init();
 }
 
+#ifdef CONFIG_PER_TABLE_DEFERRED_FLUSHES
+static inline void page_table_tlb_gen_init(struct page *page)
+{
+	page->deferred_tlb_gen = 0;
+}
+#else /* CONFIG_PER_TABLE_DEFERRED_FLUSHES */
+static inline void page_table_tlb_gen_init(struct page *page) { }
+#endif /* CONFIG_PER_TABLE_DEFERRED_FLUSHES */
+
 static inline bool pgtable_pte_page_ctor(struct page *page)
 {
 	if (!ptlock_init(page))
 		return false;
 	__SetPageTable(page);
+	page_table_tlb_gen_init(page);
 	inc_lruvec_page_state(page, NR_PAGETABLE);
 	return true;
 }
@@ -2221,6 +2231,7 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 {
 	ptlock_free(page);
 	__ClearPageTable(page);
+	page_table_tlb_gen_init(page);
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
@@ -2308,6 +2319,7 @@ static inline bool pgtable_pmd_page_ctor(struct page *page)
 	if (!pmd_ptlock_init(page))
 		return false;
 	__SetPageTable(page);
+	page_table_tlb_gen_init(page);
 	inc_lruvec_page_state(page, NR_PAGETABLE);
 	return true;
 }
@@ -2316,6 +2328,7 @@ static inline void pgtable_pmd_page_dtor(struct page *page)
 {
 	pmd_ptlock_free(page);
 	__ClearPageTable(page);
+	page_table_tlb_gen_init(page);
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bbe5d4a422f7..cae9e8bbf8e6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -148,6 +148,9 @@ struct page {
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
+#ifdef CONFIG_PER_TABLE_DEFERRED_FLUSHES
+				u64 deferred_tlb_gen; /* x86 non-pgd protected by page->ptl */
+#endif
 				struct mm_struct *pt_mm; /* x86 pgds only */
 				atomic_t pt_frag_refcount; /* powerpc */
 			};
@@ -632,6 +635,7 @@ static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 }
 
 #ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+#ifndef CONFIG_PER_TABLE_DEFERRED_FLUSHES
 static inline bool pte_tlb_flush_pending(struct vm_area_struct *vma, pte_t *pte)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -645,6 +649,24 @@ static inline bool pmd_tlb_flush_pending(struct vm_area_struct *vma, pmd_t *pmd)
 
 	return atomic64_read(&vma->defer_tlb_gen) < atomic64_read(&mm->tlb_gen_completed);
 }
+#else /* CONFIG_PER_TABLE_DEFERRED_FLUSHES */
+#define pte_tlb_flush_pending(vma, pte)					\
+	({								\
+		struct mm_struct *mm = (vma)->vm_mm;			\
+									\
+		(pte_to_page(pte))->deferred_tlb_gen <			\
+			atomic64_read(&mm->tlb_gen_completed);		\
+	 })
+
+#define pmd_tlb_flush_pending(vma, pmd)					\
+	({								\
+		struct mm_struct *mm = (vma)->vm_mm;			\
+									\
+		(pmd_to_page(pmd))->deferred_tlb_gen <			\
+			atomic64_read(&mm->tlb_gen_completed);		\
+	 })
+
+#endif /* CONFIG_PER_TABLE_DEFERRED_FLUSHES */
 #else /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
 static inline bool pte_tlb_flush_pending(struct vm_area_struct *vma, pte_t *pte)
 {
diff --git a/init/Kconfig b/init/Kconfig
index 14a599a48738..e0d8a9ea7dd0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -857,6 +857,13 @@ config ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	bool
 	depends on !CONFIG_MMU_GATHER_NO_GATHER
 
+#
+# For architectures that prefer to save deferred TLB generations in the
+# page-table instead of the VMA.
+config PER_TABLE_DEFERRED_FLUSHES
+	bool
+	depends on ARCH_HAS_TLB_GENERATIONS && 64BIT
+
 config CC_HAS_INT128
 	def_bool !$(cc-option,$(m64-flag) -D__SIZEOF_INT128__=0) && 64BIT
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c4b7c00cc69c..8f6c0e1a7ff7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1886,7 +1886,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		entry = pmd_clear_uffd_wp(entry);
 	}
 	ret = HPAGE_PMD_NR;
-	tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
+	tlb_flush_pmd_range(tlb, pmd, addr, HPAGE_PMD_SIZE);
 	set_pmd_at(mm, addr, pmd, entry);
 	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
 unlock:
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 063419ade304..923b8c0ec837 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -48,7 +48,7 @@ static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
 		wpwalk->total++;
 
 		if (pte_may_need_flush(old_pte, ptent))
-			tlb_flush_pte_range(&wpwalk->tlb, addr, PAGE_SIZE);
+			tlb_flush_pte_range(&wpwalk->tlb, pte, addr, PAGE_SIZE);
 		tlb_end_ptes(&wpwalk->tlb);
 	}
 
@@ -110,7 +110,7 @@ static int clean_record_pte(pte_t *pte, unsigned long addr,
 		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
 
 		wpwalk->total++;
-		tlb_flush_pte_range(&wpwalk->tlb, addr, PAGE_SIZE);
+		tlb_flush_pte_range(&wpwalk->tlb, pte, addr, PAGE_SIZE);
 		tlb_end_ptes(&wpwalk->tlb);
 
 		__set_bit(pgoff, cwalk->bitmap);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1258bbe42ee1..c3aa3030f4d9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -140,7 +140,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 			}
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			if (pte_may_need_flush(oldpte, ptent))
-				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
+				tlb_flush_pte_range(tlb, pte, addr, PAGE_SIZE);
 			pages++;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 17/20] mm/tlb: updated completed deferred TLB flush conditionally
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (15 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 16/20] mm/tlb: per-page table generation tracking Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  0:11 ` [RFC 18/20] mm: make mm_cpumask() volatile Nadav Amit
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, x86

From: Nadav Amit <namit@vmware.com>

If all the deferred TLB flushes were completed, there is no need to
update the completed TLB flush. This update requires an atomic cmpxchg,
so we would like to skip it.

To do so, save for each mm the last TLB generation in which TLB flushes
were deferred. While saving this information requires another atomic
cmpxchg, assume that deferred TLB flushes are less frequent than TLB
flushes.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: x86@kernel.org
---
 include/asm-generic/tlb.h | 23 ++++++++++++++++++-----
 include/linux/mm_types.h  |  5 +++++
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 74dbb56d816d..a41af03fbede 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -536,6 +536,14 @@ static inline void tlb_update_generation(atomic64_t *gen, u64 new_gen)
 
 static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
 {
+	/*
+	 * If we all the deferred TLB generations were completed, we can skip
+	 * the update of tlb_gen_completed and save a few cycles on cmpxchg.
+	 */
+	if (atomic64_read(&mm->tlb_gen_deferred) ==
+	    atomic64_read(&mm->tlb_gen_completed))
+		return;
+
 	/*
 	 * Update the completed generation to the new generation if the new
 	 * generation is greater than the previous one.
@@ -546,7 +554,7 @@ static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
 static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
 {
 	struct mm_struct *mm = tlb->mm;
-	u64 mm_gen;
+	u64 mm_gen, new_gen;
 
 	/*
 	 * Any change of PTE before calling __track_deferred_tlb_flush() must be
@@ -567,11 +575,16 @@ static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
 	 * correctness issues, and should not induce overheads, since anyhow in
 	 * TLB storms it is better to perform full TLB flush.
 	 */
-	if (mm_gen != tlb->defer_gen) {
-		VM_BUG_ON(mm_gen < tlb->defer_gen);
+	if (mm_gen == tlb->defer_gen)
+		return;
 
-		tlb->defer_gen = inc_mm_tlb_gen(mm);
-	}
+	VM_BUG_ON(mm_gen < tlb->defer_gen);
+
+	new_gen = inc_mm_tlb_gen(mm);
+	tlb->defer_gen = new_gen;
+
+	/* Update mm->tlb_gen_deferred */
+	tlb_update_generation(&mm->tlb_gen_deferred, new_gen);
 }
 
 #ifndef CONFIG_PER_TABLE_DEFERRED_FLUSHES
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cae9e8bbf8e6..4122a9b8b56f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -578,6 +578,11 @@ struct mm_struct {
 		 */
 		atomic64_t tlb_gen;
 
+		/*
+		 * The last TLB generation which was deferred.
+		 */
+		atomic64_t tlb_gen_deferred;
+
 		/*
 		 * TLB generation which is guarnateed to be flushed, including
 		 * all the PTE changes that were performed before tlb_gen was
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 18/20] mm: make mm_cpumask() volatile
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (16 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 17/20] mm/tlb: updated completed deferred TLB flush conditionally Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  0:11 ` [RFC 19/20] lib/cpumask: introduce cpumask_atomic_or() Nadav Amit
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, x86

From: Nadav Amit <namit@vmware.com>

mm_cpumask() is volatile: a bit might be turned on or off at any given
moment, and it is not protected by any lock. While the kernel coding
guidelines are very prohibitive against the use of volatile, not marking
mm_cpumask() as volatile seems wrong.

Cpumask and bitmap manipulation functions may work fine, as they are
allowed to use either the new or old value. Apparently they do, as no
bugs were reported. However, the fact that mm_cpumask() is not volatile
might lead to theoretical bugs due to compiler optimizations.

For example, cpumask_next() uses _find_next_bit(). A compiler might add
to _find_next_bit() invented loads that would cause __ffs() to run on
different value than the one read before. Consequently, if something
like that happens, the result might be a CPU that was neither set on the
old nor the new mask. I could not find what might go wrong in such a
case, but it seems as an improper result.

Mark mm_cpumask() result as volatile and propagate the "volatile"
qualifier according to the compiler shouts.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: x86@kernel.org
---
 arch/arm/include/asm/bitops.h         |  4 ++--
 arch/x86/hyperv/mmu.c                 |  2 +-
 arch/x86/include/asm/paravirt_types.h |  2 +-
 arch/x86/include/asm/tlbflush.h       |  2 +-
 arch/x86/mm/tlb.c                     |  4 ++--
 arch/x86/xen/mmu_pv.c                 |  2 +-
 include/asm-generic/bitops/find.h     |  8 ++++----
 include/linux/bitmap.h                | 16 +++++++--------
 include/linux/cpumask.h               | 28 +++++++++++++--------------
 include/linux/mm_types.h              |  4 ++--
 include/linux/smp.h                   |  6 +++---
 kernel/smp.c                          |  8 ++++----
 lib/bitmap.c                          |  8 ++++----
 lib/cpumask.c                         |  8 ++++----
 lib/find_bit.c                        | 10 +++++-----
 15 files changed, 56 insertions(+), 56 deletions(-)

diff --git a/arch/arm/include/asm/bitops.h b/arch/arm/include/asm/bitops.h
index c92e42a5c8f7..c8690c0ff15a 100644
--- a/arch/arm/include/asm/bitops.h
+++ b/arch/arm/include/asm/bitops.h
@@ -162,8 +162,8 @@ extern int _test_and_change_bit(int nr, volatile unsigned long * p);
  */
 extern int _find_first_zero_bit_le(const unsigned long *p, unsigned size);
 extern int _find_next_zero_bit_le(const unsigned long *p, int size, int offset);
-extern int _find_first_bit_le(const unsigned long *p, unsigned size);
-extern int _find_next_bit_le(const unsigned long *p, int size, int offset);
+extern int _find_first_bit_le(const volatile unsigned long *p, unsigned size);
+extern int _find_next_bit_le(const volatile unsigned long *p, int size, int offset);
 
 /*
  * Big endian assembly bitops.  nr = 0 -> byte 3 bit 0.
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 2c87350c1fb0..76ce8a0f19ef 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -52,7 +52,7 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
 	return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
+static void hyperv_flush_tlb_others(const volatile struct cpumask *cpus,
 				    const struct flush_tlb_info *info)
 {
 	int cpu, vcpu, gva_n, max_gvas;
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index b6b02b7c19cc..35b5696aedc7 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -201,7 +201,7 @@ struct pv_mmu_ops {
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_one_user)(unsigned long addr);
-	void (*flush_tlb_others)(const struct cpumask *cpus,
+	void (*flush_tlb_others)(const volatile struct cpumask *cpus,
 				 const struct flush_tlb_info *info);
 
 	void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 296a00545056..a4e7c90d11a8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -208,7 +208,7 @@ struct flush_tlb_info {
 void flush_tlb_local(void);
 void flush_tlb_one_user(unsigned long addr);
 void flush_tlb_one_kernel(unsigned long addr);
-void flush_tlb_others(const struct cpumask *cpumask,
+void flush_tlb_others(const volatile struct cpumask *cpumask,
 		      const struct flush_tlb_info *info);
 
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 48f4b56fc4a7..ba85d6bb4988 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -796,7 +796,7 @@ static bool tlb_is_not_lazy(int cpu, void *data)
 	return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
-STATIC_NOPV void native_flush_tlb_others(const struct cpumask *cpumask,
+STATIC_NOPV void native_flush_tlb_others(const volatile struct cpumask *cpumask,
 					 const struct flush_tlb_info *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
@@ -824,7 +824,7 @@ STATIC_NOPV void native_flush_tlb_others(const struct cpumask *cpumask,
 				(void *)info, 1, cpumask);
 }
 
-void flush_tlb_others(const struct cpumask *cpumask,
+void flush_tlb_others(const volatile struct cpumask *cpumask,
 		      const struct flush_tlb_info *info)
 {
 	__flush_tlb_others(cpumask, info);
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index cf2ade864c30..0f9e1ff1e388 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1247,7 +1247,7 @@ static void xen_flush_tlb_one_user(unsigned long addr)
 	preempt_enable();
 }
 
-static void xen_flush_tlb_others(const struct cpumask *cpus,
+static void xen_flush_tlb_others(const volatile struct cpumask *cpus,
 				 const struct flush_tlb_info *info)
 {
 	struct {
diff --git a/include/asm-generic/bitops/find.h b/include/asm-generic/bitops/find.h
index 9fdf21302fdf..324078362ea1 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -12,8 +12,8 @@
  * Returns the bit number for the next set bit
  * If no bits are set, returns @size.
  */
-extern unsigned long find_next_bit(const unsigned long *addr, unsigned long
-		size, unsigned long offset);
+extern unsigned long find_next_bit(const volatile unsigned long *addr,
+				   unsigned long size, unsigned long offset);
 #endif
 
 #ifndef find_next_and_bit
@@ -27,8 +27,8 @@ extern unsigned long find_next_bit(const unsigned long *addr, unsigned long
  * Returns the bit number for the next set bit
  * If no bits are set, returns @size.
  */
-extern unsigned long find_next_and_bit(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long size,
+extern unsigned long find_next_and_bit(const volatile unsigned long *addr1,
+		const volatile unsigned long *addr2, unsigned long size,
 		unsigned long offset);
 #endif
 
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 70a932470b2d..769b7a98e12f 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -141,8 +141,8 @@ extern void __bitmap_shift_left(unsigned long *dst, const unsigned long *src,
 extern void bitmap_cut(unsigned long *dst, const unsigned long *src,
 		       unsigned int first, unsigned int cut,
 		       unsigned int nbits);
-extern int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
-			const unsigned long *bitmap2, unsigned int nbits);
+extern int __bitmap_and(unsigned long *dst, const volatile unsigned long *bitmap1,
+			const volatile unsigned long *bitmap2, unsigned int nbits);
 extern void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1,
 			const unsigned long *bitmap2, unsigned int nbits);
 extern void __bitmap_xor(unsigned long *dst, const unsigned long *bitmap1,
@@ -152,8 +152,8 @@ extern int __bitmap_andnot(unsigned long *dst, const unsigned long *bitmap1,
 extern void __bitmap_replace(unsigned long *dst,
 			const unsigned long *old, const unsigned long *new,
 			const unsigned long *mask, unsigned int nbits);
-extern int __bitmap_intersects(const unsigned long *bitmap1,
-			const unsigned long *bitmap2, unsigned int nbits);
+extern int __bitmap_intersects(const volatile unsigned long *bitmap1,
+			const volatile unsigned long *bitmap2, unsigned int nbits);
 extern int __bitmap_subset(const unsigned long *bitmap1,
 			const unsigned long *bitmap2, unsigned int nbits);
 extern int __bitmap_weight(const unsigned long *bitmap, unsigned int nbits);
@@ -278,8 +278,8 @@ extern void bitmap_to_arr32(u32 *buf, const unsigned long *bitmap,
 			(const unsigned long *) (bitmap), (nbits))
 #endif
 
-static inline int bitmap_and(unsigned long *dst, const unsigned long *src1,
-			const unsigned long *src2, unsigned int nbits)
+static inline int bitmap_and(unsigned long *dst, const volatile unsigned long *src1,
+			const volatile unsigned long *src2, unsigned int nbits)
 {
 	if (small_const_nbits(nbits))
 		return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
@@ -359,8 +359,8 @@ static inline bool bitmap_or_equal(const unsigned long *src1,
 	return !(((*src1 | *src2) ^ *src3) & BITMAP_LAST_WORD_MASK(nbits));
 }
 
-static inline int bitmap_intersects(const unsigned long *src1,
-			const unsigned long *src2, unsigned int nbits)
+static inline int bitmap_intersects(const volatile unsigned long *src1,
+			const volatile unsigned long *src2, unsigned int nbits)
 {
 	if (small_const_nbits(nbits))
 		return ((*src1 & *src2) & BITMAP_LAST_WORD_MASK(nbits)) != 0;
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 383684e30f12..3d7e418aa113 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -158,7 +158,7 @@ static inline unsigned int cpumask_last(const struct cpumask *srcp)
 }
 
 /* Valid inputs for n are -1 and 0. */
-static inline unsigned int cpumask_next(int n, const struct cpumask *srcp)
+static inline unsigned int cpumask_next(int n, const volatile struct cpumask *srcp)
 {
 	return n+1;
 }
@@ -169,8 +169,8 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
 }
 
 static inline unsigned int cpumask_next_and(int n,
-					    const struct cpumask *srcp,
-					    const struct cpumask *andp)
+					    const volatile struct cpumask *srcp,
+					    const volatile struct cpumask *andp)
 {
 	return n+1;
 }
@@ -183,7 +183,7 @@ static inline unsigned int cpumask_next_wrap(int n, const struct cpumask *mask,
 }
 
 /* cpu must be a valid cpu, ie 0, so there's no other choice. */
-static inline unsigned int cpumask_any_but(const struct cpumask *mask,
+static inline unsigned int cpumask_any_but(const volatile struct cpumask *mask,
 					   unsigned int cpu)
 {
 	return 1;
@@ -235,7 +235,7 @@ static inline unsigned int cpumask_last(const struct cpumask *srcp)
 	return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
-unsigned int cpumask_next(int n, const struct cpumask *srcp);
+unsigned int cpumask_next(int n, const volatile struct cpumask *srcp);
 
 /**
  * cpumask_next_zero - get the next unset cpu in a cpumask
@@ -252,8 +252,8 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
 	return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1);
 }
 
-int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
-int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
+int cpumask_next_and(int n, const volatile struct cpumask *, const volatile struct cpumask *);
+int cpumask_any_but(const volatile struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
 int cpumask_any_and_distribute(const struct cpumask *src1p,
 			       const struct cpumask *src2p);
@@ -335,7 +335,7 @@ extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool
  * @cpu: cpu number (< nr_cpu_ids)
  * @dstp: the cpumask pointer
  */
-static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
+static inline void cpumask_set_cpu(unsigned int cpu, volatile struct cpumask *dstp)
 {
 	set_bit(cpumask_check(cpu), cpumask_bits(dstp));
 }
@@ -351,7 +351,7 @@ static inline void __cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
  * @cpu: cpu number (< nr_cpu_ids)
  * @dstp: the cpumask pointer
  */
-static inline void cpumask_clear_cpu(int cpu, struct cpumask *dstp)
+static inline void cpumask_clear_cpu(int cpu, volatile struct cpumask *dstp)
 {
 	clear_bit(cpumask_check(cpu), cpumask_bits(dstp));
 }
@@ -368,7 +368,7 @@ static inline void __cpumask_clear_cpu(int cpu, struct cpumask *dstp)
  *
  * Returns 1 if @cpu is set in @cpumask, else returns 0
  */
-static inline int cpumask_test_cpu(int cpu, const struct cpumask *cpumask)
+static inline int cpumask_test_cpu(int cpu, const volatile struct cpumask *cpumask)
 {
 	return test_bit(cpumask_check(cpu), cpumask_bits((cpumask)));
 }
@@ -428,8 +428,8 @@ static inline void cpumask_clear(struct cpumask *dstp)
  * If *@dstp is empty, returns 0, else returns 1
  */
 static inline int cpumask_and(struct cpumask *dstp,
-			       const struct cpumask *src1p,
-			       const struct cpumask *src2p)
+			       const volatile struct cpumask *src1p,
+			       const volatile struct cpumask *src2p)
 {
 	return bitmap_and(cpumask_bits(dstp), cpumask_bits(src1p),
 				       cpumask_bits(src2p), nr_cpumask_bits);
@@ -521,8 +521,8 @@ static inline bool cpumask_or_equal(const struct cpumask *src1p,
  * @src1p: the first input
  * @src2p: the second input
  */
-static inline bool cpumask_intersects(const struct cpumask *src1p,
-				     const struct cpumask *src2p)
+static inline bool cpumask_intersects(const volatile struct cpumask *src1p,
+				     const volatile struct cpumask *src2p)
 {
 	return bitmap_intersects(cpumask_bits(src1p), cpumask_bits(src2p),
 						      nr_cpumask_bits);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4122a9b8b56f..5a9b8c417f23 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -611,9 +611,9 @@ static inline void mm_init_cpumask(struct mm_struct *mm)
 }
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
-static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
+static inline volatile cpumask_t *mm_cpumask(struct mm_struct *mm)
 {
-	return (struct cpumask *)&mm->cpu_bitmap;
+	return (volatile struct cpumask *)&mm->cpu_bitmap;
 }
 
 struct mmu_gather;
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 70c6f6284dcf..62b3456fec04 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -59,7 +59,7 @@ void on_each_cpu(smp_call_func_t func, void *info, int wait);
  * Call a function on processors specified by mask, which might include
  * the local one.
  */
-void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
+void on_each_cpu_mask(const volatile struct cpumask *mask, smp_call_func_t func,
 		void *info, bool wait);
 
 /*
@@ -71,7 +71,7 @@ void on_each_cpu_cond(smp_cond_func_t cond_func, smp_call_func_t func,
 		      void *info, bool wait);
 
 void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
-			   void *info, bool wait, const struct cpumask *mask);
+			   void *info, bool wait, const volatile struct cpumask *mask);
 
 int smp_call_function_single_async(int cpu, call_single_data_t *csd);
 
@@ -118,7 +118,7 @@ extern void smp_cpus_done(unsigned int max_cpus);
  * Call a function on all other processors
  */
 void smp_call_function(smp_call_func_t func, void *info, int wait);
-void smp_call_function_many(const struct cpumask *mask,
+void smp_call_function_many(const volatile struct cpumask *mask,
 			    smp_call_func_t func, void *info, bool wait);
 
 int smp_call_function_any(const struct cpumask *mask,
diff --git a/kernel/smp.c b/kernel/smp.c
index 1b6070bf97bb..fa6e080251bf 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -604,7 +604,7 @@ int smp_call_function_any(const struct cpumask *mask,
 }
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
-static void smp_call_function_many_cond(const struct cpumask *mask,
+static void smp_call_function_many_cond(const volatile struct cpumask *mask,
 					smp_call_func_t func, void *info,
 					bool wait, smp_cond_func_t cond_func)
 {
@@ -705,7 +705,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
  * hardware interrupt handler or from a bottom half handler. Preemption
  * must be disabled when calling this function.
  */
-void smp_call_function_many(const struct cpumask *mask,
+void smp_call_function_many(const volatile struct cpumask *mask,
 			    smp_call_func_t func, void *info, bool wait)
 {
 	smp_call_function_many_cond(mask, func, info, wait, NULL);
@@ -853,7 +853,7 @@ EXPORT_SYMBOL(on_each_cpu);
  * exception is that it may be used during early boot while
  * early_boot_irqs_disabled is set.
  */
-void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
+void on_each_cpu_mask(const volatile struct cpumask *mask, smp_call_func_t func,
 			void *info, bool wait)
 {
 	int cpu = get_cpu();
@@ -892,7 +892,7 @@ EXPORT_SYMBOL(on_each_cpu_mask);
  * from a hardware interrupt handler or from a bottom half handler.
  */
 void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
-			   void *info, bool wait, const struct cpumask *mask)
+			   void *info, bool wait, const volatile struct cpumask *mask)
 {
 	int cpu = get_cpu();
 
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 75006c4036e9..6df7b13727d3 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -235,8 +235,8 @@ void bitmap_cut(unsigned long *dst, const unsigned long *src,
 }
 EXPORT_SYMBOL(bitmap_cut);
 
-int __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
-				const unsigned long *bitmap2, unsigned int bits)
+int __bitmap_and(unsigned long *dst, const volatile unsigned long *bitmap1,
+		 const volatile unsigned long *bitmap2, unsigned int bits)
 {
 	unsigned int k;
 	unsigned int lim = bits/BITS_PER_LONG;
@@ -301,8 +301,8 @@ void __bitmap_replace(unsigned long *dst,
 }
 EXPORT_SYMBOL(__bitmap_replace);
 
-int __bitmap_intersects(const unsigned long *bitmap1,
-			const unsigned long *bitmap2, unsigned int bits)
+int __bitmap_intersects(const volatile unsigned long *bitmap1,
+			const volatile unsigned long *bitmap2, unsigned int bits)
 {
 	unsigned int k, lim = bits/BITS_PER_LONG;
 	for (k = 0; k < lim; ++k)
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 35924025097b..28763b992beb 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -15,7 +15,7 @@
  *
  * Returns >= nr_cpu_ids if no further cpus set.
  */
-unsigned int cpumask_next(int n, const struct cpumask *srcp)
+unsigned int cpumask_next(int n, const volatile struct cpumask *srcp)
 {
 	/* -1 is a legal arg here. */
 	if (n != -1)
@@ -32,8 +32,8 @@ EXPORT_SYMBOL(cpumask_next);
  *
  * Returns >= nr_cpu_ids if no further cpus set in both.
  */
-int cpumask_next_and(int n, const struct cpumask *src1p,
-		     const struct cpumask *src2p)
+int cpumask_next_and(int n, const volatile struct cpumask *src1p,
+		     const volatile struct cpumask *src2p)
 {
 	/* -1 is a legal arg here. */
 	if (n != -1)
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(cpumask_next_and);
  * Often used to find any cpu but smp_processor_id() in a mask.
  * Returns >= nr_cpu_ids if no cpus set.
  */
-int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
+int cpumask_any_but(const volatile struct cpumask *mask, unsigned int cpu)
 {
 	unsigned int i;
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index f67f86fd2f62..08cd64aecc96 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -29,8 +29,8 @@
  *    searching it for one bits.
  *  - The optional "addr2", which is anded with "addr1" if present.
  */
-static unsigned long _find_next_bit(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long nbits,
+static unsigned long _find_next_bit(const volatile unsigned long *addr1,
+		const volatile unsigned long *addr2, unsigned long nbits,
 		unsigned long start, unsigned long invert, unsigned long le)
 {
 	unsigned long tmp, mask;
@@ -74,7 +74,7 @@ static unsigned long _find_next_bit(const unsigned long *addr1,
 /*
  * Find the next set bit in a memory region.
  */
-unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
+unsigned long find_next_bit(const volatile unsigned long *addr, unsigned long size,
 			    unsigned long offset)
 {
 	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
@@ -92,8 +92,8 @@ EXPORT_SYMBOL(find_next_zero_bit);
 #endif
 
 #if !defined(find_next_and_bit)
-unsigned long find_next_and_bit(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long size,
+unsigned long find_next_and_bit(const volatile unsigned long *addr1,
+		const volatile unsigned long *addr2, unsigned long size,
 		unsigned long offset)
 {
 	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 19/20] lib/cpumask: introduce cpumask_atomic_or()
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (17 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 18/20] mm: make mm_cpumask() volatile Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-01-31  0:11 ` [RFC 20/20] mm/rmap: avoid potential races Nadav Amit
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Mel Gorman, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, x86

From: Nadav Amit <namit@vmware.com>

Introduce cpumask_atomic_or() and bitmask_atomic_or() to allow to
perform atomic or operations atomically on cpumasks. This will be used
by the next patch.

To be more efficient, skip atomic operations when no changes are needed.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: x86@kernel.org
---
 include/linux/bitmap.h  |  5 +++++
 include/linux/cpumask.h | 12 ++++++++++++
 lib/bitmap.c            | 25 +++++++++++++++++++++++++
 3 files changed, 42 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 769b7a98e12f..c9a9b784b244 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,6 +76,7 @@
  *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
  *  bitmap_get_value8(map, start)               Get 8bit value from map at start
  *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
+ *  bitmap_atomic_or(dst, src, nbits)		*dst |= *src (atomically)
  *
  * Note, bitmap_zero() and bitmap_fill() operate over the region of
  * unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -577,6 +578,10 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
 	map[index] |= value << offset;
 }
 
+extern void bitmap_atomic_or(volatile unsigned long *dst,
+		const volatile unsigned long *bitmap, unsigned int bits);
+
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __LINUX_BITMAP_H */
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 3d7e418aa113..0567d73a0192 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -699,6 +699,18 @@ static inline unsigned int cpumask_size(void)
 	return BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long);
 }
 
+/**
+ * cpumask_atomic_or - *dstp |= *srcp (*dstp is set atomically)
+ * @dstp: the cpumask result (and source which is or'd)
+ * @srcp: the source input
+ */
+static inline void cpumask_atomic_or(volatile struct cpumask *dstp,
+				     const volatile struct cpumask *srcp)
+{
+	bitmap_atomic_or(cpumask_bits(dstp), cpumask_bits(srcp),
+			 nr_cpumask_bits);
+}
+
 /*
  * cpumask_var_t: struct cpumask for stack usage.
  *
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 6df7b13727d3..50f1842ff891 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -1310,3 +1310,28 @@ void bitmap_to_arr32(u32 *buf, const unsigned long *bitmap, unsigned int nbits)
 EXPORT_SYMBOL(bitmap_to_arr32);
 
 #endif
+
+void bitmap_atomic_or(volatile unsigned long *dst,
+		      const volatile unsigned long *bitmap, unsigned int bits)
+{
+	unsigned int k;
+	unsigned int nr = BITS_TO_LONGS(bits);
+
+	for (k = 0; k < nr; k++) {
+		unsigned long src = bitmap[k];
+
+		/*
+		 * Skip atomic operations when no bits are changed. Do not use
+		 * bitmap[k] directly to avoid redundant loads as bitmap
+		 * variable is volatile.
+		 */
+		if (!(src & ~dst[k]))
+			continue;
+
+		if (BITS_PER_LONG == 64)
+			atomic64_or(src, (atomic64_t*)&dst[k]);
+		else
+			atomic_or(src, (atomic_t*)&dst[k]);
+	}
+}
+EXPORT_SYMBOL(bitmap_atomic_or);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC 20/20] mm/rmap: avoid potential races
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (18 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 19/20] lib/cpumask: introduce cpumask_atomic_or() Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-08-23  8:05   ` Huang, Ying
  2021-01-31  0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski
  2021-01-31  3:30 ` Nicholas Piggin
  21 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Mel Gorman, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, x86

From: Nadav Amit <namit@vmware.com>

flush_tlb_batched_pending() appears to have a theoretical race:
tlb_flush_batched is being cleared after the TLB flush, and if in
between another core calls set_tlb_ubc_flush_pending() and sets the
pending TLB flush indication, this indication might be lost. Holding the
page-table lock when SPLIT_LOCK is set cannot eliminate this race.

The current batched TLB invalidation scheme therefore does not seem
viable or easily repairable.

Introduce a new scheme, in which a cpumask is maintained for pending
batched TLB flushes. When a full TLB flush is performed clear the
corresponding bit on the CPU the performs the TLB flush.

This scheme is only suitable for architectures that use IPIs for TLB
shootdowns. As x86 is the only architecture that currently uses batched
TLB flushes, this is not an issue.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: x86@kernel.org
---
 arch/x86/include/asm/tlbbatch.h | 15 ------------
 arch/x86/include/asm/tlbflush.h |  2 +-
 arch/x86/mm/tlb.c               | 18 ++++++++++-----
 include/linux/mm.h              |  7 ++++++
 include/linux/mm_types_task.h   | 13 -----------
 mm/rmap.c                       | 41 ++++++++++++++++-----------------
 6 files changed, 40 insertions(+), 56 deletions(-)
 delete mode 100644 arch/x86/include/asm/tlbbatch.h

diff --git a/arch/x86/include/asm/tlbbatch.h b/arch/x86/include/asm/tlbbatch.h
deleted file mode 100644
index 1ad56eb3e8a8..000000000000
--- a/arch/x86/include/asm/tlbbatch.h
+++ /dev/null
@@ -1,15 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ARCH_X86_TLBBATCH_H
-#define _ARCH_X86_TLBBATCH_H
-
-#include <linux/cpumask.h>
-
-struct arch_tlbflush_unmap_batch {
-	/*
-	 * Each bit set is a CPU that potentially has a TLB entry for one of
-	 * the PFNs being flushed..
-	 */
-	struct cpumask cpumask;
-};
-
-#endif /* _ARCH_X86_TLBBATCH_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index a4e7c90d11a8..0e681a565b78 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,7 +240,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
-extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_flush(void);
 
 static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ba85d6bb4988..f7304d45e6b9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -760,8 +760,15 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 			count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_invalidate);
 		trace_tlb_flush(reason, nr_invalidate);
 	} else {
+		int cpu = smp_processor_id();
+
 		/* Full flush. */
 		flush_tlb_local();
+
+		/* If there are batched TLB flushes, mark they are done */
+		if (cpumask_test_cpu(cpu, &tlb_flush_batched_cpumask))
+			cpumask_clear_cpu(cpu, &tlb_flush_batched_cpumask);
+
 		if (local)
 			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
 		trace_tlb_flush(reason, TLB_FLUSH_ALL);
@@ -1143,21 +1150,20 @@ static const struct flush_tlb_info full_flush_tlb_info = {
 	.end = TLB_FLUSH_ALL,
 };
 
-void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+void arch_tlbbatch_flush(void)
 {
 	int cpu = get_cpu();
 
-	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+	if (cpumask_test_cpu(cpu, &tlb_flush_batched_cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
 		flush_tlb_func_local(&full_flush_tlb_info, TLB_LOCAL_SHOOTDOWN);
 		local_irq_enable();
 	}
 
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
-		flush_tlb_others(&batch->cpumask, &full_flush_tlb_info);
-
-	cpumask_clear(&batch->cpumask);
+	if (cpumask_any_but(&tlb_flush_batched_cpumask, cpu) < nr_cpu_ids)
+		flush_tlb_others(&tlb_flush_batched_cpumask,
+				 &full_flush_tlb_info);
 
 	/*
 	 * We cannot call mark_mm_tlb_gen_done() since we do not know which
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a8a5bf82bd03..e4eeee985cf6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3197,5 +3197,12 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
 
 extern int sysctl_nr_trim_pages;
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+extern volatile cpumask_t tlb_flush_batched_cpumask;
+void tlb_batch_init(void);
+#else
+static inline void tlb_batch_init(void) { }
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index c1bc6731125c..742c542aaf3f 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -15,10 +15,6 @@
 
 #include <asm/page.h>
 
-#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-#include <asm/tlbbatch.h>
-#endif
-
 #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
 		IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
@@ -75,15 +71,6 @@ struct page_frag {
 /* Track pages that require TLB flushes */
 struct tlbflush_unmap_batch {
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-	/*
-	 * The arch code makes the following promise: generic code can modify a
-	 * PTE, then call arch_tlbbatch_add_mm() (which internally provides all
-	 * needed barriers), then call arch_tlbbatch_flush(), and the entries
-	 * will be flushed on all CPUs by the time that arch_tlbbatch_flush()
-	 * returns.
-	 */
-	struct arch_tlbflush_unmap_batch arch;
-
 	/* True if a flush is needed. */
 	bool flush_required;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 9655e1fc328a..0d2ac5a72d19 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -586,6 +586,18 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 }
 
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+/*
+ * TLB batching requires arch code to make the following promise: upon a full
+ * TLB flushes, the CPU that performs tlb_flush_batched_cpumask will clear
+ * tlb_flush_batched_cpumask atomically (i.e., during an IRQ or while interrupts
+ * are disabled). arch_tlbbatch_flush() is required to flush all the CPUs that
+ * are set in tlb_flush_batched_cpumask.
+ *
+ * This scheme is therefore only suitable for IPI-based TLB shootdowns.
+ */
+volatile cpumask_t tlb_flush_batched_cpumask = { 0 };
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -599,7 +611,7 @@ void try_to_unmap_flush(void)
 	if (!tlb_ubc->flush_required)
 		return;
 
-	arch_tlbbatch_flush(&tlb_ubc->arch);
+	arch_tlbbatch_flush();
 	tlb_ubc->flush_required = false;
 	tlb_ubc->writable = false;
 }
@@ -613,27 +625,20 @@ void try_to_unmap_flush_dirty(void)
 		try_to_unmap_flush();
 }
 
-static inline void tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
-				   struct mm_struct *mm)
+static inline void tlbbatch_add_mm(struct mm_struct *mm)
 {
+	cpumask_atomic_or(&tlb_flush_batched_cpumask, mm_cpumask(mm));
+
 	inc_mm_tlb_gen(mm);
-	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
 }
 
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 
-	tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	tlbbatch_add_mm(mm);
 	tlb_ubc->flush_required = true;
 
-	/*
-	 * Ensure compiler does not re-order the setting of tlb_flush_batched
-	 * before the PTE is cleared.
-	 */
-	barrier();
-	mm->tlb_flush_batched = true;
-
 	/*
 	 * If the PTE was dirty then it's best to assume it's writable. The
 	 * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush()
@@ -679,16 +684,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
  */
 void flush_tlb_batched_pending(struct mm_struct *mm)
 {
-	if (data_race(mm->tlb_flush_batched)) {
-		flush_tlb_mm(mm);
+	if (!cpumask_intersects(mm_cpumask(mm), &tlb_flush_batched_cpumask))
+		return;
 
-		/*
-		 * Do not allow the compiler to re-order the clearing of
-		 * tlb_flush_batched before the tlb is flushed.
-		 */
-		barrier();
-		mm->tlb_flush_batched = false;
-	}
+	flush_tlb_mm(mm);
 }
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
-- 
2.25.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (19 preceding siblings ...)
  2021-01-31  0:11 ` [RFC 20/20] mm/rmap: avoid potential races Nadav Amit
@ 2021-01-31  0:39 ` Andy Lutomirski
  2021-01-31  1:08   ` Nadav Amit
  2021-01-31  3:30 ` Nicholas Piggin
  21 siblings, 1 reply; 67+ messages in thread
From: Andy Lutomirski @ 2021-01-31  0:39 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, LKML, Nadav Amit, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, linux-csky, linuxppc-dev,
	linux-s390, Mel Gorman, Nick Piggin, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao

On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Nadav Amit <namit@vmware.com>
>
> There are currently (at least?) 5 different TLB batching schemes in the
> kernel:
>
> 1. Using mmu_gather (e.g., zap_page_range()).
>
> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>    ongoing deferred TLB flush and flushing the entire range eventually
>    (e.g., change_protection_range()).
>
> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>
> 4. Batching per-table flushes (move_ptes()).
>
> 5. By setting a flag on that a deferred TLB flush operation takes place,
>    flushing when (try_to_unmap_one() on x86).

Are you referring to the arch_tlbbatch_add_mm/flush mechanism?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 01/20] mm/tlb: fix fullmm semantics
  2021-01-31  0:11 ` [RFC 01/20] mm/tlb: fix fullmm semantics Nadav Amit
@ 2021-01-31  1:02   ` Andy Lutomirski
  2021-01-31  1:19     ` Nadav Amit
  2021-02-01 11:36   ` Peter Zijlstra
  1 sibling, 1 reply; 67+ messages in thread
From: Andy Lutomirski @ 2021-01-31  1:02 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, LKML, Nadav Amit, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, X86 ML

On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Nadav Amit <namit@vmware.com>
>
> fullmm in mmu_gather is supposed to indicate that the mm is torn-down
> (e.g., on process exit) and can therefore allow certain optimizations.
> However, tlb_finish_mmu() sets fullmm, when in fact it want to say that
> the TLB should be fully flushed.

Maybe also rename fullmm?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion
  2021-01-31  0:11 ` [RFC 03/20] mm/mprotect: do not flush on permission promotion Nadav Amit
@ 2021-01-31  1:07   ` Andy Lutomirski
  2021-01-31  1:17     ` Nadav Amit
       [not found]     ` <7a6de15a-a570-31f2-14d6-a8010296e694@citrix.com>
  0 siblings, 2 replies; 67+ messages in thread
From: Andy Lutomirski @ 2021-01-31  1:07 UTC (permalink / raw)
  To: Nadav Amit, Andrew Cooper
  Cc: Linux-MM, LKML, Nadav Amit, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, X86 ML

Adding Andrew Cooper, who has a distressingly extensive understanding
of the x86 PTE magic.

On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Nadav Amit <namit@vmware.com>
>
> Currently, using mprotect() to unprotect a memory region or uffd to
> unprotect a memory region causes a TLB flush. At least on x86, as
> protection is promoted, no TLB flush is needed.
>
> Add an arch-specific pte_may_need_flush() which tells whether a TLB
> flush is needed based on the old PTE and the new one. Implement an x86
> pte_may_need_flush().
>
> For x86, besides the simple logic that PTE protection promotion or
> changes of software bits does require a flush, also add logic that
> considers the dirty-bit. If the dirty-bit is clear and write-protect is
> set, no TLB flush is needed, as x86 updates the dirty-bit atomically
> on write, and if the bit is clear, the PTE is reread.
>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Cc: x86@kernel.org
> ---
>  arch/x86/include/asm/tlbflush.h | 44 +++++++++++++++++++++++++++++++++
>  include/asm-generic/tlb.h       |  4 +++
>  mm/mprotect.c                   |  3 ++-
>  3 files changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 8c87a2e0b660..a617dc0a9b06 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
>
>  extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>
> +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
> +{
> +       const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
> +                                    _PAGE_SOFTW3 | _PAGE_ACCESSED;

Why is accessed ignored?  Surely clearing the accessed bit needs a
flush if the old PTE is present.

> +       const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_GLOBAL;
> +       pteval_t oldval = pte_val(oldpte);
> +       pteval_t newval = pte_val(newpte);
> +       pteval_t diff = oldval ^ newval;
> +       pteval_t disable_mask = 0;
> +
> +       if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
> +               disable_mask = _PAGE_NX;
> +
> +       /* new is non-present: need only if old is present */
> +       if (pte_none(newpte))
> +               return !pte_none(oldpte);
> +
> +       /*
> +        * If, excluding the ignored bits, only RW and dirty are cleared and the
> +        * old PTE does not have the dirty-bit set, we can avoid a flush. This
> +        * is possible since x86 architecture set the dirty bit atomically while

s/set/sets/

> +        * it caches the PTE in the TLB.
> +        *
> +        * The condition considers any change to RW and dirty as not requiring
> +        * flush if the old PTE is not dirty or not writable for simplification
> +        * of the code and to consider (unlikely) cases of changing dirty-bit of
> +        * write-protected PTE.
> +        */
> +       if (!(diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask)) &&
> +           (!(pte_dirty(oldpte) || !pte_write(oldpte))))
> +               return false;

This logic seems confusing to me.  Is your goal to say that, if the
old PTE was clean and writable and the new PTE is write-protected,
then no flush is needed?  If so, I would believe you're right, but I'm
not convinced you've actually implemented this.  Also, there may be
other things going on that need flushing, e.g. a change of the address
or an accessed bit or NX change.

Also, CET makes this extra bizarre.

> +
> +       /*
> +        * Any change of PFN and any flag other than those that we consider
> +        * requires a flush (e.g., PAT, protection keys). To save flushes we do
> +        * not consider the access bit as it is considered by the kernel as
> +        * best-effort.
> +        */
> +       return diff & ((oldval & enable_mask) |
> +                      (newval & disable_mask) |
> +                      ~(enable_mask | disable_mask | ignore_mask));
> +}
> +#define pte_may_need_flush pte_may_need_flush
> +
>  #endif /* !MODULE */
>
>  #endif /* _ASM_X86_TLBFLUSH_H */
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index eea113323468..c2deec0b6919 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -654,6 +654,10 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
>         } while (0)
>  #endif
>
> +#ifndef pte_may_need_flush
> +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte) { return true; }
> +#endif
> +
>  #endif /* CONFIG_MMU */
>
>  #endif /* _ASM_GENERIC__TLB_H */
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 632d5a677d3f..b7473d2c9a1f 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>                                 ptent = pte_mkwrite(ptent);
>                         }
>                         ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> -                       tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> +                       if (pte_may_need_flush(oldpte, ptent))
> +                               tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>                         pages++;
>                 } else if (is_swap_pte(oldpte)) {
>                         swp_entry_t entry = pte_to_swp_entry(oldpte);
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski
@ 2021-01-31  1:08   ` Nadav Amit
  0 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  1:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux-MM, LKML, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky, linuxppc-dev, linux-s390, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao

> On Jan 30, 2021, at 4:39 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
> 
> Are you referring to the arch_tlbbatch_add_mm/flush mechanism?

Yes.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion
  2021-01-31  1:07   ` Andy Lutomirski
@ 2021-01-31  1:17     ` Nadav Amit
  2021-01-31  2:59       ` Andy Lutomirski
       [not found]     ` <7a6de15a-a570-31f2-14d6-a8010296e694@citrix.com>
  1 sibling, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  1:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Cooper, Linux-MM, LKML, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, X86 ML

> On Jan 30, 2021, at 5:07 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> Adding Andrew Cooper, who has a distressingly extensive understanding
> of the x86 PTE magic.
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Currently, using mprotect() to unprotect a memory region or uffd to
>> unprotect a memory region causes a TLB flush. At least on x86, as
>> protection is promoted, no TLB flush is needed.
>> 
>> Add an arch-specific pte_may_need_flush() which tells whether a TLB
>> flush is needed based on the old PTE and the new one. Implement an x86
>> pte_may_need_flush().
>> 
>> For x86, besides the simple logic that PTE protection promotion or
>> changes of software bits does require a flush, also add logic that
>> considers the dirty-bit. If the dirty-bit is clear and write-protect is
>> set, no TLB flush is needed, as x86 updates the dirty-bit atomically
>> on write, and if the bit is clear, the PTE is reread.
>> 
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> Cc: x86@kernel.org
>> ---
>> arch/x86/include/asm/tlbflush.h | 44 +++++++++++++++++++++++++++++++++
>> include/asm-generic/tlb.h       |  4 +++
>> mm/mprotect.c                   |  3 ++-
>> 3 files changed, 50 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>> index 8c87a2e0b660..a617dc0a9b06 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
>> 
>> extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>> 
>> +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
>> +{
>> +       const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
>> +                                    _PAGE_SOFTW3 | _PAGE_ACCESSED;
> 
> Why is accessed ignored?  Surely clearing the accessed bit needs a
> flush if the old PTE is present.

I am just following the current scheme in the kernel (x86):

int ptep_clear_flush_young(struct vm_area_struct *vma,
                           unsigned long address, pte_t *ptep)
{
        /*
         * On x86 CPUs, clearing the accessed bit without a TLB flush
         * doesn't cause data corruption. [ It could cause incorrect
         * page aging and the (mistaken) reclaim of hot pages, but the
         * chance of that should be relatively low. ]
         *
         * So as a performance optimization don't flush the TLB when
         * clearing the accessed bit, it will eventually be flushed by
         * a context switch or a VM operation anyway. [ In the rare
         * event of it not getting flushed for a long time the delay
         * shouldn't really matter because there's no real memory
         * pressure for swapout to react to. ]
         */
        return ptep_test_and_clear_young(vma, address, ptep);
}


> 
>> +       const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_GLOBAL;
>> +       pteval_t oldval = pte_val(oldpte);
>> +       pteval_t newval = pte_val(newpte);
>> +       pteval_t diff = oldval ^ newval;
>> +       pteval_t disable_mask = 0;
>> +
>> +       if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
>> +               disable_mask = _PAGE_NX;
>> +
>> +       /* new is non-present: need only if old is present */
>> +       if (pte_none(newpte))
>> +               return !pte_none(oldpte);
>> +
>> +       /*
>> +        * If, excluding the ignored bits, only RW and dirty are cleared and the
>> +        * old PTE does not have the dirty-bit set, we can avoid a flush. This
>> +        * is possible since x86 architecture set the dirty bit atomically while
> 
> s/set/sets/
> 
>> +        * it caches the PTE in the TLB.
>> +        *
>> +        * The condition considers any change to RW and dirty as not requiring
>> +        * flush if the old PTE is not dirty or not writable for simplification
>> +        * of the code and to consider (unlikely) cases of changing dirty-bit of
>> +        * write-protected PTE.
>> +        */
>> +       if (!(diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask)) &&
>> +           (!(pte_dirty(oldpte) || !pte_write(oldpte))))
>> +               return false;
> 
> This logic seems confusing to me.  Is your goal to say that, if the
> old PTE was clean and writable and the new PTE is write-protected,
> then no flush is needed?

Yes.

> If so, I would believe you're right, but I'm
> not convinced you've actually implemented this.  Also, there may be
> other things going on that need flushing, e.g. a change of the address
> or an accessed bit or NX change.

The first part (diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask) is supposed
to capture changes of address, NX-bit, etc.

The second part is indeed wrong. It should have been:
 (!pte_dirty(oldpte) || !pte_write(oldpte))

> 
> Also, CET makes this extra bizarre.

I saw something about the not-writeable-and-dirty considered differently. I
need to have a look, but I am not sure it affects anything.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 01/20] mm/tlb: fix fullmm semantics
  2021-01-31  1:02   ` Andy Lutomirski
@ 2021-01-31  1:19     ` Nadav Amit
  2021-01-31  2:57       ` Andy Lutomirski
  0 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  1:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux-MM, LKML, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, X86 ML

> On Jan 30, 2021, at 5:02 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> fullmm in mmu_gather is supposed to indicate that the mm is torn-down
>> (e.g., on process exit) and can therefore allow certain optimizations.
>> However, tlb_finish_mmu() sets fullmm, when in fact it want to say that
>> the TLB should be fully flushed.
> 
> Maybe also rename fullmm?

Possible. How about mm_torn_down?

I should have also changed the comment in tlb_finish_mmu().


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 01/20] mm/tlb: fix fullmm semantics
  2021-01-31  1:19     ` Nadav Amit
@ 2021-01-31  2:57       ` Andy Lutomirski
  2021-02-01  7:30         ` Nadav Amit
  0 siblings, 1 reply; 67+ messages in thread
From: Andy Lutomirski @ 2021-01-31  2:57 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Linux-MM, LKML, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, X86 ML

On Sat, Jan 30, 2021 at 5:19 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> > On Jan 30, 2021, at 5:02 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >>
> >> fullmm in mmu_gather is supposed to indicate that the mm is torn-down
> >> (e.g., on process exit) and can therefore allow certain optimizations.
> >> However, tlb_finish_mmu() sets fullmm, when in fact it want to say that
> >> the TLB should be fully flushed.
> >
> > Maybe also rename fullmm?
>
> Possible. How about mm_torn_down?

Sure.  Or mm_exiting, perhaps?

>
> I should have also changed the comment in tlb_finish_mmu().


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion
  2021-01-31  1:17     ` Nadav Amit
@ 2021-01-31  2:59       ` Andy Lutomirski
  0 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2021-01-31  2:59 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Andrew Cooper, Linux-MM, LKML, Andrea Arcangeli,
	Andrew Morton, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, X86 ML

On Sat, Jan 30, 2021 at 5:17 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> > On Jan 30, 2021, at 5:07 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > Adding Andrew Cooper, who has a distressingly extensive understanding
> > of the x86 PTE magic.
> >
> > On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >>
> >> Currently, using mprotect() to unprotect a memory region or uffd to
> >> unprotect a memory region causes a TLB flush. At least on x86, as
> >> protection is promoted, no TLB flush is needed.
> >>
> >> Add an arch-specific pte_may_need_flush() which tells whether a TLB
> >> flush is needed based on the old PTE and the new one. Implement an x86
> >> pte_may_need_flush().
> >>
> >> For x86, besides the simple logic that PTE protection promotion or
> >> changes of software bits does require a flush, also add logic that
> >> considers the dirty-bit. If the dirty-bit is clear and write-protect is
> >> set, no TLB flush is needed, as x86 updates the dirty-bit atomically
> >> on write, and if the bit is clear, the PTE is reread.
> >>
> >> Signed-off-by: Nadav Amit <namit@vmware.com>
> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Andy Lutomirski <luto@kernel.org>
> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Thomas Gleixner <tglx@linutronix.de>
> >> Cc: Will Deacon <will@kernel.org>
> >> Cc: Yu Zhao <yuzhao@google.com>
> >> Cc: Nick Piggin <npiggin@gmail.com>
> >> Cc: x86@kernel.org
> >> ---
> >> arch/x86/include/asm/tlbflush.h | 44 +++++++++++++++++++++++++++++++++
> >> include/asm-generic/tlb.h       |  4 +++
> >> mm/mprotect.c                   |  3 ++-
> >> 3 files changed, 50 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> >> index 8c87a2e0b660..a617dc0a9b06 100644
> >> --- a/arch/x86/include/asm/tlbflush.h
> >> +++ b/arch/x86/include/asm/tlbflush.h
> >> @@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> >>
> >> extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
> >>
> >> +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
> >> +{
> >> +       const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
> >> +                                    _PAGE_SOFTW3 | _PAGE_ACCESSED;
> >
> > Why is accessed ignored?  Surely clearing the accessed bit needs a
> > flush if the old PTE is present.
>
> I am just following the current scheme in the kernel (x86):
>
> int ptep_clear_flush_young(struct vm_area_struct *vma,
>                            unsigned long address, pte_t *ptep)
> {
>         /*
>          * On x86 CPUs, clearing the accessed bit without a TLB flush
>          * doesn't cause data corruption. [ It could cause incorrect
>          * page aging and the (mistaken) reclaim of hot pages, but the
>          * chance of that should be relatively low. ]
>          *

If anyone ever implements the optimization of skipping the flush when
unmapping a !accessed page, then this will cause data corruption.

If nothing else, this deserves a nice comment in the new code.

>          * So as a performance optimization don't flush the TLB when
>          * clearing the accessed bit, it will eventually be flushed by
>          * a context switch or a VM operation anyway. [ In the rare
>          * event of it not getting flushed for a long time the delay
>          * shouldn't really matter because there's no real memory
>          * pressure for swapout to react to. ]
>          */
>         return ptep_test_and_clear_young(vma, address, ptep);
> }
>
>
> >
> >> +       const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_GLOBAL;
> >> +       pteval_t oldval = pte_val(oldpte);
> >> +       pteval_t newval = pte_val(newpte);
> >> +       pteval_t diff = oldval ^ newval;
> >> +       pteval_t disable_mask = 0;
> >> +
> >> +       if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
> >> +               disable_mask = _PAGE_NX;
> >> +
> >> +       /* new is non-present: need only if old is present */
> >> +       if (pte_none(newpte))
> >> +               return !pte_none(oldpte);
> >> +
> >> +       /*
> >> +        * If, excluding the ignored bits, only RW and dirty are cleared and the
> >> +        * old PTE does not have the dirty-bit set, we can avoid a flush. This
> >> +        * is possible since x86 architecture set the dirty bit atomically while
> >
> > s/set/sets/
> >
> >> +        * it caches the PTE in the TLB.
> >> +        *
> >> +        * The condition considers any change to RW and dirty as not requiring
> >> +        * flush if the old PTE is not dirty or not writable for simplification
> >> +        * of the code and to consider (unlikely) cases of changing dirty-bit of
> >> +        * write-protected PTE.
> >> +        */
> >> +       if (!(diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask)) &&
> >> +           (!(pte_dirty(oldpte) || !pte_write(oldpte))))
> >> +               return false;
> >
> > This logic seems confusing to me.  Is your goal to say that, if the
> > old PTE was clean and writable and the new PTE is write-protected,
> > then no flush is needed?
>
> Yes.
>
> > If so, I would believe you're right, but I'm
> > not convinced you've actually implemented this.  Also, there may be
> > other things going on that need flushing, e.g. a change of the address
> > or an accessed bit or NX change.
>
> The first part (diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask) is supposed
> to capture changes of address, NX-bit, etc.
>
> The second part is indeed wrong. It should have been:
>  (!pte_dirty(oldpte) || !pte_write(oldpte))
>
> >
> > Also, CET makes this extra bizarre.
>
> I saw something about the not-writeable-and-dirty considered differently. I
> need to have a look, but I am not sure it affects anything.
>

It affects everyone's sanity. I don't yet have an opinion as to
whether it affects correctness :)


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
                   ` (20 preceding siblings ...)
  2021-01-31  0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski
@ 2021-01-31  3:30 ` Nicholas Piggin
  2021-01-31  7:57   ` Nadav Amit
  21 siblings, 1 reply; 67+ messages in thread
From: Nicholas Piggin @ 2021-01-31  3:30 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Nadav Amit
  Cc: Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky,
	linuxppc-dev, linux-s390, Andy Lutomirski, Mel Gorman,
	Nadav Amit, Peter Zijlstra, Thomas Gleixner, Will Deacon, x86,
	Yu Zhao

Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
> From: Nadav Amit <namit@vmware.com>
> 
> There are currently (at least?) 5 different TLB batching schemes in the
> kernel:
> 
> 1. Using mmu_gather (e.g., zap_page_range()).
> 
> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>    ongoing deferred TLB flush and flushing the entire range eventually
>    (e.g., change_protection_range()).
> 
> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
> 
> 4. Batching per-table flushes (move_ptes()).
> 
> 5. By setting a flag on that a deferred TLB flush operation takes place,
>    flushing when (try_to_unmap_one() on x86).
> 
> It seems that (1)-(4) can be consolidated. In addition, it seems that
> (5) is racy. It also seems there can be many redundant TLB flushes, and
> potentially TLB-shootdown storms, for instance during batched
> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
> defers TLB flushes.
> 
> More aggressive TLB batching may be possible, but this patch-set does
> not add such batching. The proposed changes would enable such batching
> in a later time.
> 
> Admittedly, I do not understand how things are not broken today, which
> frightens me to make further batching before getting things in order.
> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
> for each page-table (but not in greater granularity). Can't
> ClearPageDirty() be called before the flush, causing writes after
> ClearPageDirty() and before the flush to be lost?

Because it's holding the page table lock which stops page_mkclean from 
cleaning the page. Or am I misunderstanding the question?

I'll go through the patches a bit more closely when they all come 
through. Sparc and powerpc of course need the arch lazy mode to get 
per-page/pte information for operations that are not freeing pages, 
which is what mmu gather is designed for.

I wouldn't mind using a similar API so it's less of a black box when 
reading generic code, but it might not quite fit the mmu gather API
exactly (most of these paths don't want a full mmu_gather on stack).

> 
> This patch-set therefore performs the following changes:
> 
> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
>    instead of {inc|dec}_tlb_flush_pending().
> 
> 2. Avoid TLB flushes if PTE permission is not demoted.
> 
> 3. Cleans up mmu_gather to be less arch-dependant.
> 
> 4. Uses mm's generations to track in finer granularity, either per-VMA
>    or per page-table, whether a pending mmu_gather operation is
>    outstanding. This should allow to avoid some TLB flushes when KSM or
>    memory reclamation takes place while another operation such as
>    munmap() or mprotect() is running.
> 
> 5. Changes try_to_unmap_one() flushing scheme, as the current seems
>    broken to track in a bitmap which CPUs have outstanding TLB flushes
>    instead of having a flag.

Putting fixes first, and cleanups and independent patches (like #2) next
would help with getting stuff merged and backported.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  3:30 ` Nicholas Piggin
@ 2021-01-31  7:57   ` Nadav Amit
  2021-01-31  8:14     ` Nadav Amit
  2021-02-01 12:44     ` Peter Zijlstra
  0 siblings, 2 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  7:57 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: LKML, Linux-MM, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky, linuxppc-dev, linux-s390, Andy Lutomirski,
	Mel Gorman, Peter Zijlstra, Thomas Gleixner, Will Deacon, X86 ML,
	Yu Zhao

> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
>> 
>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>> potentially TLB-shootdown storms, for instance during batched
>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>> defers TLB flushes.
>> 
>> More aggressive TLB batching may be possible, but this patch-set does
>> not add such batching. The proposed changes would enable such batching
>> in a later time.
>> 
>> Admittedly, I do not understand how things are not broken today, which
>> frightens me to make further batching before getting things in order.
>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>> for each page-table (but not in greater granularity). Can't
>> ClearPageDirty() be called before the flush, causing writes after
>> ClearPageDirty() and before the flush to be lost?
> 
> Because it's holding the page table lock which stops page_mkclean from 
> cleaning the page. Or am I misunderstanding the question?

Thanks. I understood this part. Looking again at the code, I now understand
my confusion: I forgot that the reverse mapping is removed after the PTE is
zapped.

Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
by performing set_page_dirty() for the batched pages when needed in
tlb_finish_mmu() [ i.e., by marking for each batched page whether
set_page_dirty() should be issued for that page while collecting them ].

> I'll go through the patches a bit more closely when they all come 
> through. Sparc and powerpc of course need the arch lazy mode to get 
> per-page/pte information for operations that are not freeing pages, 
> which is what mmu gather is designed for.

IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
where no previous PTE was set, right?

> I wouldn't mind using a similar API so it's less of a black box when 
> reading generic code, but it might not quite fit the mmu gather API
> exactly (most of these paths don't want a full mmu_gather on stack).

I see your point. It may be possible to create two mmu_gather structs: a
small one that only holds the flush information and another that also holds
the pages. 

>> This patch-set therefore performs the following changes:
>> 
>> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
>>   instead of {inc|dec}_tlb_flush_pending().
>> 
>> 2. Avoid TLB flushes if PTE permission is not demoted.
>> 
>> 3. Cleans up mmu_gather to be less arch-dependant.
>> 
>> 4. Uses mm's generations to track in finer granularity, either per-VMA
>>   or per page-table, whether a pending mmu_gather operation is
>>   outstanding. This should allow to avoid some TLB flushes when KSM or
>>   memory reclamation takes place while another operation such as
>>   munmap() or mprotect() is running.
>> 
>> 5. Changes try_to_unmap_one() flushing scheme, as the current seems
>>   broken to track in a bitmap which CPUs have outstanding TLB flushes
>>   instead of having a flag.
> 
> Putting fixes first, and cleanups and independent patches (like #2) next
> would help with getting stuff merged and backported.

I tried to do it mostly this way. There are some theoretical races which
I did not manage (or try hard enough) to create, so I did not include in
the “fixes” section. I will restructure the patch-set according to the
feedback.

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  7:57   ` Nadav Amit
@ 2021-01-31  8:14     ` Nadav Amit
  2021-02-01 12:44     ` Peter Zijlstra
  1 sibling, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-01-31  8:14 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nicholas Piggin, LKML, Linux-MM, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, linux-csky, linuxppc-dev, linux-s390,
	Andy Lutomirski, Mel Gorman, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, X86 ML, Yu Zhao

> On Jan 30, 2021, at 11:57 PM, Nadav Amit <namit@vmware.com> wrote:
> 
>> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
>> 
>> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>>> From: Nadav Amit <namit@vmware.com>
>>> 
>>> There are currently (at least?) 5 different TLB batching schemes in the
>>> kernel:
>>> 
>>> 1. Using mmu_gather (e.g., zap_page_range()).
>>> 
>>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>>  ongoing deferred TLB flush and flushing the entire range eventually
>>>  (e.g., change_protection_range()).
>>> 
>>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>>> 
>>> 4. Batching per-table flushes (move_ptes()).
>>> 
>>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>>  flushing when (try_to_unmap_one() on x86).
>>> 
>>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>>> potentially TLB-shootdown storms, for instance during batched
>>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>>> defers TLB flushes.
>>> 
>>> More aggressive TLB batching may be possible, but this patch-set does
>>> not add such batching. The proposed changes would enable such batching
>>> in a later time.
>>> 
>>> Admittedly, I do not understand how things are not broken today, which
>>> frightens me to make further batching before getting things in order.
>>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>>> for each page-table (but not in greater granularity). Can't
>>> ClearPageDirty() be called before the flush, causing writes after
>>> ClearPageDirty() and before the flush to be lost?
>> 
>> Because it's holding the page table lock which stops page_mkclean from 
>> cleaning the page. Or am I misunderstanding the question?
> 
> Thanks. I understood this part. Looking again at the code, I now understand
> my confusion: I forgot that the reverse mapping is removed after the PTE is
> zapped.
> 
> Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
> by performing set_page_dirty() for the batched pages when needed in
> tlb_finish_mmu() [ i.e., by marking for each batched page whether
> set_page_dirty() should be issued for that page while collecting them ].

Correcting myself (I hope): no we cannot do so, since the buffers might be
remove from the page at that point.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
  2021-01-31  0:11 ` [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes() Nadav Amit
@ 2021-01-31  9:57   ` Damian Tometzki
  2021-01-31 10:07   ` Damian Tometzki
  2021-02-01 13:19   ` Peter Zijlstra
  2 siblings, 0 replies; 67+ messages in thread
From: Damian Tometzki @ 2021-01-31  9:57 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin, x86

On Sat, 30. Jan 16:11, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Introduce tlb_start_ptes() and tlb_end_ptes() which would be called
> before and after PTEs are updated and TLB flushes are deferred. This
> will be later be used for fine granualrity deferred TLB flushing
> detection.
> 
> In the meanwhile, move flush_tlb_batched_pending() into
> tlb_start_ptes(). It was not called from mapping_dirty_helpers by
> wp_pte() and clean_record_pte(), which might be a bug.
> 
> No additional functional change is intended.
> 
> Signed-off-by: Nadav Amit <namit@vmware.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Cc: x86@kernel.org
> ---
>  fs/proc/task_mmu.c         |  2 ++
>  include/asm-generic/tlb.h  | 18 ++++++++++++++++++
>  mm/madvise.c               |  6 ++++--
>  mm/mapping_dirty_helpers.c | 15 +++++++++++++--
>  mm/memory.c                |  2 ++
>  mm/mprotect.c              |  3 ++-
>  6 files changed, 41 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4cd048ffa0f6..d0cce961fa5c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1168,6 +1168,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  		return 0;
>  
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	tlb_start_ptes(&cp->tlb);
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
>  
> @@ -1190,6 +1191,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  		tlb_flush_pte_range(&cp->tlb, addr, PAGE_SIZE);
>  		ClearPageReferenced(page);
>  	}
> +	tlb_end_ptes(&cp->tlb);
>  	pte_unmap_unlock(pte - 1, ptl);
>  	cond_resched();
>  	return 0;
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 041be2ef4426..10690763090a 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -58,6 +58,11 @@
>   *    Defaults to flushing at tlb_end_vma() to reset the range; helps when
>   *    there's large holes between the VMAs.
>   *
> + *  - tlb_start_ptes() / tlb_end_ptes; makr the start / end of PTEs change.
Hello Nadav,

short nid makr/mark

Damian

 
> + *
> + *    Does internal accounting to allow fine(r) granularity checks for
> + *    pte_accessible() on certain configuration.
> + *
>   *  - tlb_remove_table()
>   *
>   *    tlb_remove_table() is the basic primitive to free page-table directories
> @@ -373,6 +378,10 @@ static inline void tlb_flush(struct mmu_gather *tlb)
>  		flush_tlb_range(tlb->vma, tlb->start, tlb->end);
>  	}
>  }
> +#endif
> +
> +#if __is_defined(tlb_flush) ||						\
> +	IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING)
>  
>  static inline void
>  tlb_update_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
> @@ -523,6 +532,15 @@ static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
>  
>  #endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
>  
> +#define tlb_start_ptes(tlb)						\
> +	do {								\
> +		struct mmu_gather *_tlb = (tlb);			\
> +									\
> +		flush_tlb_batched_pending(_tlb->mm);			\
> +	} while (0)
> +
> +static inline void tlb_end_ptes(struct mmu_gather *tlb) { }
> +
>  /*
>   * tlb_flush_{pte|pmd|pud|p4d}_range() adjust the tlb->start and tlb->end,
>   * and set corresponding cleared_*.
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0938fd3ad228..932c1c2eb9a3 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -392,7 +392,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  #endif
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> -	flush_tlb_batched_pending(mm);
> +	tlb_start_ptes(tlb);
>  	arch_enter_lazy_mmu_mode();
>  	for (; addr < end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
> @@ -468,6 +468,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  	}
>  
>  	arch_leave_lazy_mmu_mode();
> +	tlb_end_ptes(tlb);
>  	pte_unmap_unlock(orig_pte, ptl);
>  	if (pageout)
>  		reclaim_pages(&page_list);
> @@ -588,7 +589,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> -	flush_tlb_batched_pending(mm);
> +	tlb_start_ptes(tlb);
>  	arch_enter_lazy_mmu_mode();
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
> @@ -692,6 +693,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
>  	}
>  	arch_leave_lazy_mmu_mode();
> +	tlb_end_ptes(tlb);
>  	pte_unmap_unlock(orig_pte, ptl);
>  	cond_resched();
>  next:
> diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
> index 2ce6cf431026..063419ade304 100644
> --- a/mm/mapping_dirty_helpers.c
> +++ b/mm/mapping_dirty_helpers.c
> @@ -6,6 +6,8 @@
>  #include <asm/cacheflush.h>
>  #include <asm/tlb.h>
>  
> +#include "internal.h"
> +
>  /**
>   * struct wp_walk - Private struct for pagetable walk callbacks
>   * @range: Range for mmu notifiers
> @@ -36,7 +38,10 @@ static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
>  	pte_t ptent = *pte;
>  
>  	if (pte_write(ptent)) {
> -		pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
> +		pte_t old_pte;
> +
> +		tlb_start_ptes(&wpwalk->tlb);
> +		old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
>  
>  		ptent = pte_wrprotect(old_pte);
>  		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
> @@ -44,6 +49,7 @@ static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
>  
>  		if (pte_may_need_flush(old_pte, ptent))
>  			tlb_flush_pte_range(&wpwalk->tlb, addr, PAGE_SIZE);
> +		tlb_end_ptes(&wpwalk->tlb);
>  	}
>  
>  	return 0;
> @@ -94,13 +100,18 @@ static int clean_record_pte(pte_t *pte, unsigned long addr,
>  	if (pte_dirty(ptent)) {
>  		pgoff_t pgoff = ((addr - walk->vma->vm_start) >> PAGE_SHIFT) +
>  			walk->vma->vm_pgoff - cwalk->bitmap_pgoff;
> -		pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
> +		pte_t old_pte;
> +
> +		tlb_start_ptes(&wpwalk->tlb);
> +
> +		old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
>  
>  		ptent = pte_mkclean(old_pte);
>  		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
>  
>  		wpwalk->total++;
>  		tlb_flush_pte_range(&wpwalk->tlb, addr, PAGE_SIZE);
> +		tlb_end_ptes(&wpwalk->tlb);
>  
>  		__set_bit(pgoff, cwalk->bitmap);
>  		cwalk->start = min(cwalk->start, pgoff);
> diff --git a/mm/memory.c b/mm/memory.c
> index 9e8576a83147..929a93c50d9a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1221,6 +1221,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  	init_rss_vec(rss);
>  	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  	pte = start_pte;
> +	tlb_start_ptes(tlb);
>  	flush_tlb_batched_pending(mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> @@ -1314,6 +1315,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  	add_mm_rss_vec(mm, rss);
>  	arch_leave_lazy_mmu_mode();
>  
> +	tlb_end_ptes(tlb);
>  	/* Do the actual TLB flush before dropping ptl */
>  	if (force_flush)
>  		tlb_flush_mmu_tlbonly(tlb);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index b7473d2c9a1f..1258bbe42ee1 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -70,7 +70,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  	    atomic_read(&vma->vm_mm->mm_users) == 1)
>  		target_node = numa_node_id();
>  
> -	flush_tlb_batched_pending(vma->vm_mm);
> +	tlb_start_ptes(tlb);
>  	arch_enter_lazy_mmu_mode();
>  	do {
>  		oldpte = *pte;
> @@ -182,6 +182,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
> +	tlb_end_ptes(tlb);
>  	pte_unmap_unlock(pte - 1, ptl);
>  
>  	return pages;
> -- 
> 2.25.1
> 
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
  2021-01-31  0:11 ` [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes() Nadav Amit
  2021-01-31  9:57   ` Damian Tometzki
@ 2021-01-31 10:07   ` Damian Tometzki
  2021-02-01  7:29     ` Nadav Amit
  2021-02-01 13:19   ` Peter Zijlstra
  2 siblings, 1 reply; 67+ messages in thread
From: Damian Tometzki @ 2021-01-31 10:07 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin, x86

On Sat, 30. Jan 16:11, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Introduce tlb_start_ptes() and tlb_end_ptes() which would be called
> before and after PTEs are updated and TLB flushes are deferred. This
> will be later be used for fine granualrity deferred TLB flushing
> detection.
> 
> In the meanwhile, move flush_tlb_batched_pending() into
> tlb_start_ptes(). It was not called from mapping_dirty_helpers by
> wp_pte() and clean_record_pte(), which might be a bug.
> 
> No additional functional change is intended.
> 
> Signed-off-by: Nadav Amit <namit@vmware.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Cc: x86@kernel.org
> ---
>  fs/proc/task_mmu.c         |  2 ++
>  include/asm-generic/tlb.h  | 18 ++++++++++++++++++
>  mm/madvise.c               |  6 ++++--
>  mm/mapping_dirty_helpers.c | 15 +++++++++++++--
>  mm/memory.c                |  2 ++
>  mm/mprotect.c              |  3 ++-
>  6 files changed, 41 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4cd048ffa0f6..d0cce961fa5c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1168,6 +1168,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  		return 0;
>  
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	tlb_start_ptes(&cp->tlb);
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
>  
> @@ -1190,6 +1191,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  		tlb_flush_pte_range(&cp->tlb, addr, PAGE_SIZE);
>  		ClearPageReferenced(page);
>  	}
> +	tlb_end_ptes(&cp->tlb);
>  	pte_unmap_unlock(pte - 1, ptl);
>  	cond_resched();
>  	return 0;
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 041be2ef4426..10690763090a 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -58,6 +58,11 @@
>   *    Defaults to flushing at tlb_end_vma() to reset the range; helps when
>   *    there's large holes between the VMAs.
>   *
> + *  - tlb_start_ptes() / tlb_end_ptes; makr the start / end of PTEs change.

Hello Nadav,

short nid makr/mark

Damian

> + *
> + *    Does internal accounting to allow fine(r) granularity checks for
> + *    pte_accessible() on certain configuration.
> + *
>   *  - tlb_remove_table()
>   *
>   *    tlb_remove_table() is the basic primitive to free page-table directories
> @@ -373,6 +378,10 @@ static inline void tlb_flush(struct mmu_gather *tlb)
>  		flush_tlb_range(tlb->vma, tlb->start, tlb->end);
>  	}
>  }
> +#endif
> +
> +#if __is_defined(tlb_flush) ||						\
> +	IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING)
>  
>  static inline void
>  tlb_update_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
> @@ -523,6 +532,15 @@ static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
>  
>  #endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
>  
> +#define tlb_start_ptes(tlb)						\
> +	do {								\
> +		struct mmu_gather *_tlb = (tlb);			\
> +									\
> +		flush_tlb_batched_pending(_tlb->mm);			\
> +	} while (0)
> +
> +static inline void tlb_end_ptes(struct mmu_gather *tlb) { }
> +
>  /*
>   * tlb_flush_{pte|pmd|pud|p4d}_range() adjust the tlb->start and tlb->end,
>   * and set corresponding cleared_*.
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0938fd3ad228..932c1c2eb9a3 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -392,7 +392,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  #endif
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> -	flush_tlb_batched_pending(mm);
> +	tlb_start_ptes(tlb);
>  	arch_enter_lazy_mmu_mode();
>  	for (; addr < end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
> @@ -468,6 +468,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  	}
>  
>  	arch_leave_lazy_mmu_mode();
> +	tlb_end_ptes(tlb);
>  	pte_unmap_unlock(orig_pte, ptl);
>  	if (pageout)
>  		reclaim_pages(&page_list);
> @@ -588,7 +589,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> -	flush_tlb_batched_pending(mm);
> +	tlb_start_ptes(tlb);
>  	arch_enter_lazy_mmu_mode();
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
> @@ -692,6 +693,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
>  	}
>  	arch_leave_lazy_mmu_mode();
> +	tlb_end_ptes(tlb);
>  	pte_unmap_unlock(orig_pte, ptl);
>  	cond_resched();
>  next:
> diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
> index 2ce6cf431026..063419ade304 100644
> --- a/mm/mapping_dirty_helpers.c
> +++ b/mm/mapping_dirty_helpers.c
> @@ -6,6 +6,8 @@
>  #include <asm/cacheflush.h>
>  #include <asm/tlb.h>
>  
> +#include "internal.h"
> +
>  /**
>   * struct wp_walk - Private struct for pagetable walk callbacks
>   * @range: Range for mmu notifiers
> @@ -36,7 +38,10 @@ static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
>  	pte_t ptent = *pte;
>  
>  	if (pte_write(ptent)) {
> -		pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
> +		pte_t old_pte;
> +
> +		tlb_start_ptes(&wpwalk->tlb);
> +		old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
>  
>  		ptent = pte_wrprotect(old_pte);
>  		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
> @@ -44,6 +49,7 @@ static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
>  
>  		if (pte_may_need_flush(old_pte, ptent))
>  			tlb_flush_pte_range(&wpwalk->tlb, addr, PAGE_SIZE);
> +		tlb_end_ptes(&wpwalk->tlb);
>  	}
>  
>  	return 0;
> @@ -94,13 +100,18 @@ static int clean_record_pte(pte_t *pte, unsigned long addr,
>  	if (pte_dirty(ptent)) {
>  		pgoff_t pgoff = ((addr - walk->vma->vm_start) >> PAGE_SHIFT) +
>  			walk->vma->vm_pgoff - cwalk->bitmap_pgoff;
> -		pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
> +		pte_t old_pte;
> +
> +		tlb_start_ptes(&wpwalk->tlb);
> +
> +		old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
>  
>  		ptent = pte_mkclean(old_pte);
>  		ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent);
>  
>  		wpwalk->total++;
>  		tlb_flush_pte_range(&wpwalk->tlb, addr, PAGE_SIZE);
> +		tlb_end_ptes(&wpwalk->tlb);
>  
>  		__set_bit(pgoff, cwalk->bitmap);
>  		cwalk->start = min(cwalk->start, pgoff);
> diff --git a/mm/memory.c b/mm/memory.c
> index 9e8576a83147..929a93c50d9a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1221,6 +1221,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  	init_rss_vec(rss);
>  	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  	pte = start_pte;
> +	tlb_start_ptes(tlb);
>  	flush_tlb_batched_pending(mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> @@ -1314,6 +1315,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  	add_mm_rss_vec(mm, rss);
>  	arch_leave_lazy_mmu_mode();
>  
> +	tlb_end_ptes(tlb);
>  	/* Do the actual TLB flush before dropping ptl */
>  	if (force_flush)
>  		tlb_flush_mmu_tlbonly(tlb);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index b7473d2c9a1f..1258bbe42ee1 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -70,7 +70,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  	    atomic_read(&vma->vm_mm->mm_users) == 1)
>  		target_node = numa_node_id();
>  
> -	flush_tlb_batched_pending(vma->vm_mm);
> +	tlb_start_ptes(tlb);
>  	arch_enter_lazy_mmu_mode();
>  	do {
>  		oldpte = *pte;
> @@ -182,6 +182,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
> +	tlb_end_ptes(tlb);
>  	pte_unmap_unlock(pte - 1, ptl);
>  
>  	return pages;
> -- 
> 2.25.1
> 
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 07/20] mm: move x86 tlb_gen to generic code
  2021-01-31  0:11 ` [RFC 07/20] mm: move x86 tlb_gen to generic code Nadav Amit
@ 2021-01-31 18:26   ` Andy Lutomirski
  0 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2021-01-31 18:26 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, LKML, Nadav Amit, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, X86 ML

On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Nadav Amit <namit@vmware.com>
>
> x86 currently has a TLB-generation tracking logic that can be used by
> additional architectures (as long as they implement some additional
> logic).
>
> Extract the relevant pieces of code from x86 to general TLB code. This
> would be useful to allow to write the next "fine granularity deferred
> TLB flushes detection" patches without making them x86-specific.

Tentative ACK.

My biggest concern about this is that, once it's exposed to core code,
people might come up with clever-but-incorrect ways to abuse it.  Oh
well.

>  struct workqueue_struct *efi_rts_wq;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0974ad501a47..2035ac319c2b 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -560,6 +560,17 @@ struct mm_struct {
>
>  #ifdef CONFIG_IOMMU_SUPPORT
>                 u32 pasid;
> +#endif
> +#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
> +               /*
> +                * Any code that needs to do any sort of TLB flushing for this
> +                * mm will first make its changes to the page tables, then
> +                * increment tlb_gen, then flush.  This lets the low-level
> +                * flushing code keep track of what needs flushing.
> +                *
> +                * This is not used on Xen PV.

That last comment should probably go away.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 08/20] mm: store completed TLB generation
  2021-01-31  0:11 ` [RFC 08/20] mm: store completed TLB generation Nadav Amit
@ 2021-01-31 20:32   ` Andy Lutomirski
  2021-02-01  7:28     ` Nadav Amit
  2021-02-01 11:52   ` Peter Zijlstra
  1 sibling, 1 reply; 67+ messages in thread
From: Andy Lutomirski @ 2021-01-31 20:32 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, LKML, Nadav Amit, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, X86 ML

On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Nadav Amit <namit@vmware.com>
>
> To detect deferred TLB flushes in fine granularity, we need to keep
> track on the completed TLB flush generation for each mm.
>
> Add logic to track for each mm the tlb_gen_completed, which tracks the
> completed TLB generation. It is the arch responsibility to call
> mark_mm_tlb_gen_done() whenever a TLB flush is completed.
>
> Start the generation numbers from 1 instead of 0. This would allow later
> to detect whether flushes of a certain generation were completed.

Can you elaborate on how this helps?

I think you should document that tlb_gen_completed only means that no
outdated TLB entries will be observably used.  In the x86
implementation it's possible for older TLB entries to still exist,
unused, in TLBs of cpus running other mms.

How does this work with arch_tlbbatch_flush()?

>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Cc: x86@kernel.org
> ---
>  arch/x86/mm/tlb.c         | 10 ++++++++++
>  include/asm-generic/tlb.h | 33 +++++++++++++++++++++++++++++++++
>  include/linux/mm_types.h  | 15 ++++++++++++++-
>  3 files changed, 57 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 7ab21430be41..d17b5575531e 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -14,6 +14,7 @@
>  #include <asm/nospec-branch.h>
>  #include <asm/cache.h>
>  #include <asm/apic.h>
> +#include <asm/tlb.h>
>
>  #include "mm_internal.h"
>
> @@ -915,6 +916,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>         if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>                 flush_tlb_others(mm_cpumask(mm), info);
>
> +       /* Update the completed generation */
> +       mark_mm_tlb_gen_done(mm, new_tlb_gen);
> +
>         put_flush_tlb_info();
>         put_cpu();
>  }
> @@ -1147,6 +1151,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>
>         cpumask_clear(&batch->cpumask);
>
> +       /*
> +        * We cannot call mark_mm_tlb_gen_done() since we do not know which
> +        * mm's should be flushed. This may lead to some unwarranted TLB
> +        * flushes, but not to correction problems.
> +        */
> +
>         put_cpu();
>  }
>
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 517c89398c83..427bfcc6cdec 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -513,6 +513,39 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
>  }
>  #endif
>
> +#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
> +
> +/*
> + * Helper function to update a generation to have a new value, as long as new
> + * value is greater or equal to gen.
> + */

I read this a couple of times, and I don't understand it.  How about:

Helper function to atomically set *gen = max(*gen, new_gen)

> +static inline void tlb_update_generation(atomic64_t *gen, u64 new_gen)
> +{
> +       u64 cur_gen = atomic64_read(gen);
> +
> +       while (cur_gen < new_gen) {
> +               u64 old_gen = atomic64_cmpxchg(gen, cur_gen, new_gen);
> +
> +               /* Check if we succeeded in the cmpxchg */
> +               if (likely(cur_gen == old_gen))
> +                       break;
> +
> +               cur_gen = old_gen;
> +       };
> +}
> +
> +
> +static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
> +{
> +       /*
> +        * Update the completed generation to the new generation if the new
> +        * generation is greater than the previous one.
> +        */
> +       tlb_update_generation(&mm->tlb_gen_completed, gen);
> +}
> +
> +#endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
> +
>  /*
>   * tlb_flush_{pte|pmd|pud|p4d}_range() adjust the tlb->start and tlb->end,
>   * and set corresponding cleared_*.
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2035ac319c2b..8a5eb4bfac59 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -571,6 +571,13 @@ struct mm_struct {
>                  * This is not used on Xen PV.
>                  */
>                 atomic64_t tlb_gen;
> +
> +               /*
> +                * TLB generation which is guarnateed to be flushed, including

guaranteed

> +                * all the PTE changes that were performed before tlb_gen was
> +                * incremented.
> +                */

I will defer judgment to future patches before I believe that this isn't racy :)

> +               atomic64_t tlb_gen_completed;
>  #endif
>         } __randomize_layout;
>
> @@ -690,7 +697,13 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
>  #ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
>  static inline void init_mm_tlb_gen(struct mm_struct *mm)
>  {
> -       atomic64_set(&mm->tlb_gen, 0);
> +       /*
> +        * Start from generation of 1, so default generation 0 will be
> +        * considered as flushed and would not be regarded as an outstanding
> +        * deferred invalidation.
> +        */

Aha, this makes sense.

> +       atomic64_set(&mm->tlb_gen, 1);
> +       atomic64_set(&mm->tlb_gen_completed, 1);
>  }
>
>  static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion
       [not found]     ` <7a6de15a-a570-31f2-14d6-a8010296e694@citrix.com>
@ 2021-02-01  5:58       ` Nadav Amit
  2021-02-01 15:38         ` Andrew Cooper
  0 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-02-01  5:58 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Andy Lutomirski, Linux-MM, LKML, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, X86 ML, Andy Lutomirski

> On Jan 31, 2021, at 4:10 AM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> 
> On 31/01/2021 01:07, Andy Lutomirski wrote:
>> Adding Andrew Cooper, who has a distressingly extensive understanding
>> of the x86 PTE magic.
> 
> Pretty sure it is all learning things the hard way...
> 
>> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 632d5a677d3f..b7473d2c9a1f 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>>                                ptent = pte_mkwrite(ptent);
>>>                        }
>>>                        ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>>> -                       tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>> +                       if (pte_may_need_flush(oldpte, ptent))
>>> +                               tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> 
> You're choosing to avoid the flush, based on A/D bits read ahead of the
> actual modification of the PTE.
> 
> In this example, another thread can write into the range (sets A and D),
> and get a suitable TLB entry which goes unflushed while the rest of the
> kernel thinks the memory is write-protected and clean.
> 
> The only safe way to do this is to use XCHG/etc to modify the PTE, and
> base flush calculations on the results.  Atomic operations are ordered
> with A/D updates from pagewalks on other CPUs, even on AMD where A
> updates are explicitly not ordered with regular memory reads, for
> performance reasons.

Thanks Andrew for the feedback, but I think the patch does it exactly in
this safe manner that you describe (at least on native x86, but I see a
similar path elsewhere as well):

oldpte = ptep_modify_prot_start()
-> __ptep_modify_prot_start()
-> ptep_get_and_clear
-> native_ptep_get_and_clear()
-> xchg()

Note that the xchg() will clear the PTE (i.e., making it non-present), and
no further updates of A/D are possible until ptep_modify_prot_commit() is
called.

On non-SMP setups this is not atomic (no xchg), but since we hold the lock,
we should be safe.

I guess you are right and a pte_may_need_flush() deserves a comment to
clarify that oldpte must be obtained by an atomic operation to ensure no A/D
bits are lost (as you say).

Yet, I do not see a correctness problem. Am I missing something?



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 08/20] mm: store completed TLB generation
  2021-01-31 20:32   ` Andy Lutomirski
@ 2021-02-01  7:28     ` Nadav Amit
  2021-02-01 16:53       ` Andy Lutomirski
  0 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-02-01  7:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux-MM, LKML, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, X86 ML

> On Jan 31, 2021, at 12:32 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> To detect deferred TLB flushes in fine granularity, we need to keep
>> track on the completed TLB flush generation for each mm.
>> 
>> Add logic to track for each mm the tlb_gen_completed, which tracks the
>> completed TLB generation. It is the arch responsibility to call
>> mark_mm_tlb_gen_done() whenever a TLB flush is completed.
>> 
>> Start the generation numbers from 1 instead of 0. This would allow later
>> to detect whether flushes of a certain generation were completed.
> 
> Can you elaborate on how this helps?

I guess it should have gone to patch 15.

The relevant code it interacts with is in read_defer_tlb_flush_gen(). It
allows to use a single check to see “outdated” deferred TLB gen. Initially
tlb->defer_gen is zero. We are going to do inc_mm_tlb_gen() both on the
first time we defer TLB entries and whenever we see mm_gen is newer than
tlb->defer_gen:

+       mm_gen = atomic64_read(&mm->tlb_gen);
+
+       /*
+        * This condition checks for both first deferred TLB flush and for other
+        * TLB pending or executed TLB flushes after the last table that we
+        * updated. In the latter case, we are going to skip a generation, which
+        * would lead to a full TLB flush. This should therefore not cause
+        * correctness issues, and should not induce overheads, since anyhow in
+        * TLB storms it is better to perform full TLB flush.
+        */
+       if (mm_gen != tlb->defer_gen) {
+               VM_BUG_ON(mm_gen < tlb->defer_gen);
+
+               tlb->defer_gen = inc_mm_tlb_gen(mm);
+       }


> 
> I think you should document that tlb_gen_completed only means that no
> outdated TLB entries will be observably used.  In the x86
> implementation it's possible for older TLB entries to still exist,
> unused, in TLBs of cpus running other mms.

You mean entries that be later flushed during switch_mm_irqs_off(), right? I
think that overall my comments need some work. Yes.

> How does this work with arch_tlbbatch_flush()?

completed_gen is not updated by arch_tlbbatch_flush(), since I couldn’t find
a way to combine them. completed_gen might not catch up with tlb_gen in this
case until another TLB flush takes place. I do not see correctness issue,
but it might result in redundant TLB flush.

>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> Cc: x86@kernel.org
>> ---
>> arch/x86/mm/tlb.c         | 10 ++++++++++
>> include/asm-generic/tlb.h | 33 +++++++++++++++++++++++++++++++++
>> include/linux/mm_types.h  | 15 ++++++++++++++-
>> 3 files changed, 57 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 7ab21430be41..d17b5575531e 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -14,6 +14,7 @@
>> #include <asm/nospec-branch.h>
>> #include <asm/cache.h>
>> #include <asm/apic.h>
>> +#include <asm/tlb.h>
>> 
>> #include "mm_internal.h"
>> 
>> @@ -915,6 +916,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>>        if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>>                flush_tlb_others(mm_cpumask(mm), info);
>> 
>> +       /* Update the completed generation */
>> +       mark_mm_tlb_gen_done(mm, new_tlb_gen);
>> +
>>        put_flush_tlb_info();
>>        put_cpu();
>> }
>> @@ -1147,6 +1151,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>> 
>>        cpumask_clear(&batch->cpumask);
>> 
>> +       /*
>> +        * We cannot call mark_mm_tlb_gen_done() since we do not know which
>> +        * mm's should be flushed. This may lead to some unwarranted TLB
>> +        * flushes, but not to correction problems.
>> +        */
>> +
>>        put_cpu();
>> }
>> 
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 517c89398c83..427bfcc6cdec 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -513,6 +513,39 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
>> }
>> #endif
>> 
>> +#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
>> +
>> +/*
>> + * Helper function to update a generation to have a new value, as long as new
>> + * value is greater or equal to gen.
>> + */
> 
> I read this a couple of times, and I don't understand it.  How about:
> 
> Helper function to atomically set *gen = max(*gen, new_gen)
> 
>> +static inline void tlb_update_generation(atomic64_t *gen, u64 new_gen)
>> +{
>> +       u64 cur_gen = atomic64_read(gen);
>> +
>> +       while (cur_gen < new_gen) {
>> +               u64 old_gen = atomic64_cmpxchg(gen, cur_gen, new_gen);
>> +
>> +               /* Check if we succeeded in the cmpxchg */
>> +               if (likely(cur_gen == old_gen))
>> +                       break;
>> +
>> +               cur_gen = old_gen;
>> +       };
>> +}
>> +
>> +
>> +static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
>> +{
>> +       /*
>> +        * Update the completed generation to the new generation if the new
>> +        * generation is greater than the previous one.
>> +        */
>> +       tlb_update_generation(&mm->tlb_gen_completed, gen);
>> +}
>> +
>> +#endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
>> +
>> /*
>>  * tlb_flush_{pte|pmd|pud|p4d}_range() adjust the tlb->start and tlb->end,
>>  * and set corresponding cleared_*.
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 2035ac319c2b..8a5eb4bfac59 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -571,6 +571,13 @@ struct mm_struct {
>>                 * This is not used on Xen PV.
>>                 */
>>                atomic64_t tlb_gen;
>> +
>> +               /*
>> +                * TLB generation which is guarnateed to be flushed, including
> 
> guaranteed
> 
>> +                * all the PTE changes that were performed before tlb_gen was
>> +                * incremented.
>> +                */
> 
> I will defer judgment to future patches before I believe that this isn't racy :)

Fair enough. Thanks for the review.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
  2021-01-31 10:07   ` Damian Tometzki
@ 2021-02-01  7:29     ` Nadav Amit
  0 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-02-01  7:29 UTC (permalink / raw)
  To: Damian Tometzki
  Cc: Linux-MM, LKML, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

> On Jan 31, 2021, at 2:07 AM, Damian Tometzki <linux@tometzki.de> wrote:
> 
> On Sat, 30. Jan 16:11, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Introduce tlb_start_ptes() and tlb_end_ptes() which would be called
>> before and after PTEs are updated and TLB flushes are deferred. This
>> will be later be used for fine granualrity deferred TLB flushing
>> detection.
>> 
>> In the meanwhile, move flush_tlb_batched_pending() into
>> tlb_start_ptes(). It was not called from mapping_dirty_helpers by
>> wp_pte() and clean_record_pte(), which might be a bug.
>> 
>> No additional functional change is intended.
>> 
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> Cc: x86@kernel.org
>> ---
>> fs/proc/task_mmu.c         |  2 ++
>> include/asm-generic/tlb.h  | 18 ++++++++++++++++++
>> mm/madvise.c               |  6 ++++--
>> mm/mapping_dirty_helpers.c | 15 +++++++++++++--
>> mm/memory.c                |  2 ++
>> mm/mprotect.c              |  3 ++-
>> 6 files changed, 41 insertions(+), 5 deletions(-)
>> 
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 4cd048ffa0f6..d0cce961fa5c 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1168,6 +1168,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>> 		return 0;
>> 
>> 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> +	tlb_start_ptes(&cp->tlb);
>> 	for (; addr != end; pte++, addr += PAGE_SIZE) {
>> 		ptent = *pte;
>> 
>> @@ -1190,6 +1191,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>> 		tlb_flush_pte_range(&cp->tlb, addr, PAGE_SIZE);
>> 		ClearPageReferenced(page);
>> 	}
>> +	tlb_end_ptes(&cp->tlb);
>> 	pte_unmap_unlock(pte - 1, ptl);
>> 	cond_resched();
>> 	return 0;
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 041be2ef4426..10690763090a 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -58,6 +58,11 @@
>>  *    Defaults to flushing at tlb_end_vma() to reset the range; helps when
>>  *    there's large holes between the VMAs.
>>  *
>> + *  - tlb_start_ptes() / tlb_end_ptes; makr the start / end of PTEs change.
> 
> Hello Nadav,
> 
> short nid makr/mark

Thanks! I will fix it.




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 01/20] mm/tlb: fix fullmm semantics
  2021-01-31  2:57       ` Andy Lutomirski
@ 2021-02-01  7:30         ` Nadav Amit
  0 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-02-01  7:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux-MM, LKML, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, X86 ML

> On Jan 30, 2021, at 6:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sat, Jan 30, 2021 at 5:19 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>>> On Jan 30, 2021, at 5:02 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> 
>>>> fullmm in mmu_gather is supposed to indicate that the mm is torn-down
>>>> (e.g., on process exit) and can therefore allow certain optimizations.
>>>> However, tlb_finish_mmu() sets fullmm, when in fact it want to say that
>>>> the TLB should be fully flushed.
>>> 
>>> Maybe also rename fullmm?
>> 
>> Possible. How about mm_torn_down?
> 
> Sure.  Or mm_exiting, perhaps?

mm_exiting indeed sounds better.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 01/20] mm/tlb: fix fullmm semantics
  2021-01-31  0:11 ` [RFC 01/20] mm/tlb: fix fullmm semantics Nadav Amit
  2021-01-31  1:02   ` Andy Lutomirski
@ 2021-02-01 11:36   ` Peter Zijlstra
  2021-02-02  9:32     ` Nadav Amit
  1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2021-02-01 11:36 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, x86


https://lkml.kernel.org/r/20210127235347.1402-1-will@kernel.org


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 08/20] mm: store completed TLB generation
  2021-01-31  0:11 ` [RFC 08/20] mm: store completed TLB generation Nadav Amit
  2021-01-31 20:32   ` Andy Lutomirski
@ 2021-02-01 11:52   ` Peter Zijlstra
  1 sibling, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2021-02-01 11:52 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, x86

On Sat, Jan 30, 2021 at 04:11:20PM -0800, Nadav Amit wrote:
> +static inline void tlb_update_generation(atomic64_t *gen, u64 new_gen)
> +{
> +	u64 cur_gen = atomic64_read(gen);
> +
> +	while (cur_gen < new_gen) {
> +		u64 old_gen = atomic64_cmpxchg(gen, cur_gen, new_gen);
> +
> +		/* Check if we succeeded in the cmpxchg */
> +		if (likely(cur_gen == old_gen))
> +			break;
> +
> +		cur_gen = old_gen;
> +	};
> +}

	u64 cur_gen = atomic64_read(gen);
	while (cur_gen < new_gen && !atomic64_try_cmpxchg(gen, &cur_gen, new_gen))
		;



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-01-31  0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit
@ 2021-02-01 12:09   ` Peter Zijlstra
  2021-02-02  6:41     ` Nicholas Piggin
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2021-02-01 12:09 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, linux-csky, linuxppc-dev,
	linux-s390, x86

On Sat, Jan 30, 2021 at 04:11:23PM -0800, Nadav Amit wrote:

> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 427bfcc6cdec..b97136b7010b 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
>  
>  #ifdef CONFIG_MMU_GATHER_NO_RANGE
>  
> -#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma)
> -#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() and tlb_end_vma()
> +#if defined(tlb_flush)
> +#error MMU_GATHER_NO_RANGE relies on default tlb_flush()
>  #endif
>  
>  /*
> @@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
>  
>  #ifndef tlb_flush
>  
> -#if defined(tlb_start_vma) || defined(tlb_end_vma)
> -#error Default tlb_flush() relies on default tlb_start_vma() and tlb_end_vma()
> -#endif

#ifdef CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
#error ....
#endif

goes here...


>  static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
>  {
>  	if (tlb->fullmm)
>  		return;
>  
> +	if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
> +		return;

Also, can you please stick to the CONFIG_MMU_GATHER_* namespace?

I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
How about:

	CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH

?




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 12/20] mm/tlb: save the VMA that is flushed during tlb_start_vma()
  2021-01-31  0:11 ` [RFC 12/20] mm/tlb: save the VMA that is flushed during tlb_start_vma() Nadav Amit
@ 2021-02-01 12:28   ` Peter Zijlstra
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2021-02-01 12:28 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, x86

On Sat, Jan 30, 2021 at 04:11:24PM -0800, Nadav Amit wrote:

> @@ -283,12 +290,6 @@ struct mmu_gather {
>  	unsigned int		cleared_puds : 1;
>  	unsigned int		cleared_p4ds : 1;
>  
> -	/*
> -	 * tracks VM_EXEC | VM_HUGETLB in tlb_start_vma
> -	 */
> -	unsigned int		vma_exec : 1;
> -	unsigned int		vma_huge : 1;
> -
>  	unsigned int		batch_count;
>  
>  #ifndef CONFIG_MMU_GATHER_NO_GATHER

> @@ -372,38 +369,20 @@ static inline void tlb_flush(struct mmu_gather *tlb)
>  	if (tlb->fullmm || tlb->need_flush_all) {
>  		flush_tlb_mm(tlb->mm);
>  	} else if (tlb->end) {
> -		struct vm_area_struct vma = {
> -			.vm_mm = tlb->mm,
> -			.vm_flags = (tlb->vma_exec ? VM_EXEC    : 0) |
> -				    (tlb->vma_huge ? VM_HUGETLB : 0),
> -		};
> -
> -		flush_tlb_range(&vma, tlb->start, tlb->end);
> +		VM_BUG_ON(!tlb->vma);
> +		flush_tlb_range(tlb->vma, tlb->start, tlb->end);
>  	}
>  }

I don't much like this, and I think this is a step in the wrong
direction.

The idea is to extend the tlb_{remove,flush}_*() API to provide the
needed information to do TLB flushing. In fact, I think
tlb_remove_huge*() is already sufficient to set the VM_EXEC 'hint'. We
just don't have anything that covers the EXEC thing.

(also, I suspect the page_size crud we have also covers that)

Constructing a fake vma very much ensures arch tlb routines don't go
about and look at anything else either.

> +tlb_update_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
>  {
> -	/*
> -	 * flush_tlb_range() implementations that look at VM_HUGETLB (tile,
> -	 * mips-4k) flush only large pages.
> -	 *
> -	 * flush_tlb_range() implementations that flush I-TLB also flush D-TLB
> -	 * (tile, xtensa, arm), so it's ok to just add VM_EXEC to an existing
> -	 * range.
> -	 *
> -	 * We rely on tlb_end_vma() to issue a flush, such that when we reset
> -	 * these values the batch is empty.
> -	 */
> -	tlb->vma_huge = is_vm_hugetlb_page(vma);
> -	tlb->vma_exec = !!(vma->vm_flags & VM_EXEC);
> +	tlb->vma = vma;
>  }

And you're also removing the useful information about arch tlb flush
functions.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  7:57   ` Nadav Amit
  2021-01-31  8:14     ` Nadav Amit
@ 2021-02-01 12:44     ` Peter Zijlstra
  2021-02-02  7:14       ` Nicholas Piggin
  1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2021-02-01 12:44 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nicholas Piggin, LKML, Linux-MM, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, linux-csky, linuxppc-dev, linux-s390,
	Andy Lutomirski, Mel Gorman, Thomas Gleixner, Will Deacon,
	X86 ML, Yu Zhao

On Sun, Jan 31, 2021 at 07:57:01AM +0000, Nadav Amit wrote:
> > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:

> > I'll go through the patches a bit more closely when they all come 
> > through. Sparc and powerpc of course need the arch lazy mode to get 
> > per-page/pte information for operations that are not freeing pages, 
> > which is what mmu gather is designed for.
> 
> IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
> where no previous PTE was set, right?

These are the HASH architectures. Their hardware doesn't walk the
page-tables, but it consults a hash-table to resolve page translations.

They _MUST_ flush the entries under the PTL to avoid ever seeing
conflicting information, which will make them really unhappy. They can
do this because they have TLBI broadcast.

There's a few more details I'm sure, but those seem to have slipped from
my mind.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
  2021-01-31  0:11 ` [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes() Nadav Amit
  2021-01-31  9:57   ` Damian Tometzki
  2021-01-31 10:07   ` Damian Tometzki
@ 2021-02-01 13:19   ` Peter Zijlstra
  2021-02-01 23:00     ` Nadav Amit
  2 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2021-02-01 13:19 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, x86

On Sat, Jan 30, 2021 at 04:11:25PM -0800, Nadav Amit wrote:
> +#define tlb_start_ptes(tlb)						\
> +	do {								\
> +		struct mmu_gather *_tlb = (tlb);			\
> +									\
> +		flush_tlb_batched_pending(_tlb->mm);			\
> +	} while (0)
> +
> +static inline void tlb_end_ptes(struct mmu_gather *tlb) { }

>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> -	flush_tlb_batched_pending(mm);
> +	tlb_start_ptes(tlb);
>  	arch_enter_lazy_mmu_mode();
>  	for (; addr < end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
> @@ -468,6 +468,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  	}
>  
>  	arch_leave_lazy_mmu_mode();
> +	tlb_end_ptes(tlb);
>  	pte_unmap_unlock(orig_pte, ptl);
>  	if (pageout)
>  		reclaim_pages(&page_list);

I don't like how you're dubbling up on arch_*_lazy_mmu_mode(). It seems
to me those should be folded into tlb_{start,end}_ptes().

Alternatively, the even more work approach would be to, add an optional
@tlb argument to pte_offset_map_lock()/pte_unmap_unlock() and friends.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion
  2021-02-01  5:58       ` Nadav Amit
@ 2021-02-01 15:38         ` Andrew Cooper
  0 siblings, 0 replies; 67+ messages in thread
From: Andrew Cooper @ 2021-02-01 15:38 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Linux-MM, LKML, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, X86 ML, Andy Lutomirski

On 01/02/2021 05:58, Nadav Amit wrote:
>> On Jan 31, 2021, at 4:10 AM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>
>> On 31/01/2021 01:07, Andy Lutomirski wrote:
>>> Adding Andrew Cooper, who has a distressingly extensive understanding
>>> of the x86 PTE magic.
>> Pretty sure it is all learning things the hard way...
>>
>>> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index 632d5a677d3f..b7473d2c9a1f 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>>>                                ptent = pte_mkwrite(ptent);
>>>>                        }
>>>>                        ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>>>> -                       tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>>> +                       if (pte_may_need_flush(oldpte, ptent))
>>>> +                               tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> You're choosing to avoid the flush, based on A/D bits read ahead of the
>> actual modification of the PTE.
>>
>> In this example, another thread can write into the range (sets A and D),
>> and get a suitable TLB entry which goes unflushed while the rest of the
>> kernel thinks the memory is write-protected and clean.
>>
>> The only safe way to do this is to use XCHG/etc to modify the PTE, and
>> base flush calculations on the results.  Atomic operations are ordered
>> with A/D updates from pagewalks on other CPUs, even on AMD where A
>> updates are explicitly not ordered with regular memory reads, for
>> performance reasons.
> Thanks Andrew for the feedback, but I think the patch does it exactly in
> this safe manner that you describe (at least on native x86, but I see a
> similar path elsewhere as well):
>
> oldpte = ptep_modify_prot_start()
> -> __ptep_modify_prot_start()
> -> ptep_get_and_clear
> -> native_ptep_get_and_clear()
> -> xchg()
>
> Note that the xchg() will clear the PTE (i.e., making it non-present), and
> no further updates of A/D are possible until ptep_modify_prot_commit() is
> called.
>
> On non-SMP setups this is not atomic (no xchg), but since we hold the lock,
> we should be safe.
>
> I guess you are right and a pte_may_need_flush() deserves a comment to
> clarify that oldpte must be obtained by an atomic operation to ensure no A/D
> bits are lost (as you say).
>
> Yet, I do not see a correctness problem. Am I missing something?

No(ish) - I failed to spot that path.

But native_ptep_get_and_clear() is broken on !SMP builds.  It needs to
be an XCHG even in that case, to spot A/D updates from prefetch or
shared-virtual-memory DMA.

~Andrew


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 08/20] mm: store completed TLB generation
  2021-02-01  7:28     ` Nadav Amit
@ 2021-02-01 16:53       ` Andy Lutomirski
  0 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2021-02-01 16:53 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Linux-MM, LKML, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, X86 ML

On Sun, Jan 31, 2021 at 11:28 PM Nadav Amit <namit@vmware.com> wrote:
>
> > On Jan 31, 2021, at 12:32 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >>
> >> To detect deferred TLB flushes in fine granularity, we need to keep
> >> track on the completed TLB flush generation for each mm.
> >>
> >> Add logic to track for each mm the tlb_gen_completed, which tracks the
> >> completed TLB generation. It is the arch responsibility to call
> >> mark_mm_tlb_gen_done() whenever a TLB flush is completed.
> >>
> >> Start the generation numbers from 1 instead of 0. This would allow later
> >> to detect whether flushes of a certain generation were completed.
> >
> > Can you elaborate on how this helps?
>
> I guess it should have gone to patch 15.
>
> The relevant code it interacts with is in read_defer_tlb_flush_gen(). It
> allows to use a single check to see “outdated” deferred TLB gen. Initially
> tlb->defer_gen is zero. We are going to do inc_mm_tlb_gen() both on the
> first time we defer TLB entries and whenever we see mm_gen is newer than
> tlb->defer_gen:
>
> +       mm_gen = atomic64_read(&mm->tlb_gen);
> +
> +       /*
> +        * This condition checks for both first deferred TLB flush and for other
> +        * TLB pending or executed TLB flushes after the last table that we
> +        * updated. In the latter case, we are going to skip a generation, which
> +        * would lead to a full TLB flush. This should therefore not cause
> +        * correctness issues, and should not induce overheads, since anyhow in
> +        * TLB storms it is better to perform full TLB flush.
> +        */
> +       if (mm_gen != tlb->defer_gen) {
> +               VM_BUG_ON(mm_gen < tlb->defer_gen);
> +
> +               tlb->defer_gen = inc_mm_tlb_gen(mm);
> +       }
>
>
> >
> > I think you should document that tlb_gen_completed only means that no
> > outdated TLB entries will be observably used.  In the x86
> > implementation it's possible for older TLB entries to still exist,
> > unused, in TLBs of cpus running other mms.
>
> You mean entries that be later flushed during switch_mm_irqs_off(), right? I
> think that overall my comments need some work. Yes.

That's exactly what I mean.

>
> > How does this work with arch_tlbbatch_flush()?
>
> completed_gen is not updated by arch_tlbbatch_flush(), since I couldn’t find
> a way to combine them. completed_gen might not catch up with tlb_gen in this
> case until another TLB flush takes place. I do not see correctness issue,
> but it might result in redundant TLB flush.

Please at least document this.

FWIW, arch_tlbbatch_flush() is gross.  I'm not convinced it's really
supportable with proper broadcast invalidation. I suppose we could
remove it or explicitly track the set of mms that need flushing.

>
> >> Signed-off-by: Nadav Amit <namit@vmware.com>
> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Andy Lutomirski <luto@kernel.org>
> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Thomas Gleixner <tglx@linutronix.de>
> >> Cc: Will Deacon <will@kernel.org>
> >> Cc: Yu Zhao <yuzhao@google.com>
> >> Cc: Nick Piggin <npiggin@gmail.com>
> >> Cc: x86@kernel.org
> >> ---
> >> arch/x86/mm/tlb.c         | 10 ++++++++++
> >> include/asm-generic/tlb.h | 33 +++++++++++++++++++++++++++++++++
> >> include/linux/mm_types.h  | 15 ++++++++++++++-
> >> 3 files changed, 57 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> >> index 7ab21430be41..d17b5575531e 100644
> >> --- a/arch/x86/mm/tlb.c
> >> +++ b/arch/x86/mm/tlb.c
> >> @@ -14,6 +14,7 @@
> >> #include <asm/nospec-branch.h>
> >> #include <asm/cache.h>
> >> #include <asm/apic.h>
> >> +#include <asm/tlb.h>
> >>
> >> #include "mm_internal.h"
> >>
> >> @@ -915,6 +916,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> >>        if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> >>                flush_tlb_others(mm_cpumask(mm), info);
> >>
> >> +       /* Update the completed generation */
> >> +       mark_mm_tlb_gen_done(mm, new_tlb_gen);
> >> +
> >>        put_flush_tlb_info();
> >>        put_cpu();
> >> }
> >> @@ -1147,6 +1151,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> >>
> >>        cpumask_clear(&batch->cpumask);
> >>
> >> +       /*
> >> +        * We cannot call mark_mm_tlb_gen_done() since we do not know which
> >> +        * mm's should be flushed. This may lead to some unwarranted TLB
> >> +        * flushes, but not to correction problems.
> >> +        */
> >> +
> >>        put_cpu();
> >> }
> >>
> >> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> >> index 517c89398c83..427bfcc6cdec 100644
> >> --- a/include/asm-generic/tlb.h
> >> +++ b/include/asm-generic/tlb.h
> >> @@ -513,6 +513,39 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
> >> }
> >> #endif
> >>
> >> +#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
> >> +
> >> +/*
> >> + * Helper function to update a generation to have a new value, as long as new
> >> + * value is greater or equal to gen.
> >> + */
> >
> > I read this a couple of times, and I don't understand it.  How about:
> >
> > Helper function to atomically set *gen = max(*gen, new_gen)
> >
> >> +static inline void tlb_update_generation(atomic64_t *gen, u64 new_gen)
> >> +{
> >> +       u64 cur_gen = atomic64_read(gen);
> >> +
> >> +       while (cur_gen < new_gen) {
> >> +               u64 old_gen = atomic64_cmpxchg(gen, cur_gen, new_gen);
> >> +
> >> +               /* Check if we succeeded in the cmpxchg */
> >> +               if (likely(cur_gen == old_gen))
> >> +                       break;
> >> +
> >> +               cur_gen = old_gen;
> >> +       };
> >> +}
> >> +
> >> +
> >> +static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
> >> +{
> >> +       /*
> >> +        * Update the completed generation to the new generation if the new
> >> +        * generation is greater than the previous one.
> >> +        */
> >> +       tlb_update_generation(&mm->tlb_gen_completed, gen);
> >> +}
> >> +
> >> +#endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
> >> +
> >> /*
> >>  * tlb_flush_{pte|pmd|pud|p4d}_range() adjust the tlb->start and tlb->end,
> >>  * and set corresponding cleared_*.
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 2035ac319c2b..8a5eb4bfac59 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -571,6 +571,13 @@ struct mm_struct {
> >>                 * This is not used on Xen PV.
> >>                 */
> >>                atomic64_t tlb_gen;
> >> +
> >> +               /*
> >> +                * TLB generation which is guarnateed to be flushed, including
> >
> > guaranteed
> >
> >> +                * all the PTE changes that were performed before tlb_gen was
> >> +                * incremented.
> >> +                */
> >
> > I will defer judgment to future patches before I believe that this isn't racy :)
>
> Fair enough. Thanks for the review.
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 15/20] mm: detect deferred TLB flushes in vma granularity
  2021-01-31  0:11 ` [RFC 15/20] mm: detect deferred TLB flushes in vma granularity Nadav Amit
@ 2021-02-01 22:04   ` Nadav Amit
  2021-02-02  0:14     ` Andy Lutomirski
  0 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-02-01 22:04 UTC (permalink / raw)
  To: Linux-MM, LKML, Andy Lutomirski
  Cc: Andrea Arcangeli, Andrew Morton, Dave Hansen, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, X86 ML

> On Jan 30, 2021, at 4:11 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> From: Nadav Amit <namit@vmware.com>
> 
> Currently, deferred TLB flushes are detected in the mm granularity: if
> there is any deferred TLB flush in the entire address space due to NUMA
> migration, pte_accessible() in x86 would return true, and
> ptep_clear_flush() would require a TLB flush. This would happen even if
> the PTE resides in a completely different vma.

[ snip ]

> +static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
> +{
> +	struct mm_struct *mm = tlb->mm;
> +	u64 mm_gen;
> +
> +	/*
> +	 * Any change of PTE before calling __track_deferred_tlb_flush() must be
> +	 * performed using RMW atomic operation that provides a memory barriers,
> +	 * such as ptep_modify_prot_start().  The barrier ensure the PTEs are
> +	 * written before the current generation is read, synchronizing
> +	 * (implicitly) with flush_tlb_mm_range().
> +	 */
> +	smp_mb__after_atomic();
> +
> +	mm_gen = atomic64_read(&mm->tlb_gen);
> +
> +	/*
> +	 * This condition checks for both first deferred TLB flush and for other
> +	 * TLB pending or executed TLB flushes after the last table that we
> +	 * updated. In the latter case, we are going to skip a generation, which
> +	 * would lead to a full TLB flush. This should therefore not cause
> +	 * correctness issues, and should not induce overheads, since anyhow in
> +	 * TLB storms it is better to perform full TLB flush.
> +	 */
> +	if (mm_gen != tlb->defer_gen) {
> +		VM_BUG_ON(mm_gen < tlb->defer_gen);
> +
> +		tlb->defer_gen = inc_mm_tlb_gen(mm);
> +	}
> +}

Andy’s comments managed to make me realize this code is wrong. We must
call inc_mm_tlb_gen(mm) every time.

Otherwise, a CPU that saw the old tlb_gen and updated it in its local
cpu_tlbstate on a context-switch. If the process was not running when the
TLB flush was issued, no IPI will be sent to the CPU. Therefore, later
switch_mm_irqs_off() back to the process will not flush the local TLB.

I need to think if there is a better solution. Multiple calls to
inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush
instead of one that is specific to the ranges, once the flush actually takes
place. On x86 it’s practically a non-issue, since anyhow any update of more
than 33-entries or so would cause a full TLB flush, but this is still ugly.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
  2021-02-01 13:19   ` Peter Zijlstra
@ 2021-02-01 23:00     ` Nadav Amit
  0 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-02-01 23:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

> On Feb 1, 2021, at 5:19 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Sat, Jan 30, 2021 at 04:11:25PM -0800, Nadav Amit wrote:
>> +#define tlb_start_ptes(tlb)						\
>> +	do {								\
>> +		struct mmu_gather *_tlb = (tlb);			\
>> +									\
>> +		flush_tlb_batched_pending(_tlb->mm);			\
>> +	} while (0)
>> +
>> +static inline void tlb_end_ptes(struct mmu_gather *tlb) { }
> 
>> 	tlb_change_page_size(tlb, PAGE_SIZE);
>> 	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> -	flush_tlb_batched_pending(mm);
>> +	tlb_start_ptes(tlb);
>> 	arch_enter_lazy_mmu_mode();
>> 	for (; addr < end; pte++, addr += PAGE_SIZE) {
>> 		ptent = *pte;
>> @@ -468,6 +468,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>> 	}
>> 
>> 	arch_leave_lazy_mmu_mode();
>> +	tlb_end_ptes(tlb);
>> 	pte_unmap_unlock(orig_pte, ptl);
>> 	if (pageout)
>> 		reclaim_pages(&page_list);
> 
> I don't like how you're dubbling up on arch_*_lazy_mmu_mode(). It seems
> to me those should be folded into tlb_{start,end}_ptes().
> 
> Alternatively, the even more work approach would be to, add an optional
> @tlb argument to pte_offset_map_lock()/pte_unmap_unlock() and friends.

Not too fund of the “even more approach”. I still have debts I need to
pay to the kernel community on old patches that didn’t make it through.

I will fold arch_*_lazy_mmu_mode() as you suggested. Admittedly, I do not
understand this arch_*_lazy_mmu_mode() very well - I would have assumed
they would be needed only when PTEs are established, and in other cases
the arch code will hook directly to the TLB flushing interface.

However, based on the code, it seems that powerpc does not even flush PTEs
that are established (only removed/demoted). Probably I am missing
something. I will just blindly fold it.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 15/20] mm: detect deferred TLB flushes in vma granularity
  2021-02-01 22:04   ` Nadav Amit
@ 2021-02-02  0:14     ` Andy Lutomirski
  2021-02-02 20:51       ` Nadav Amit
  0 siblings, 1 reply; 67+ messages in thread
From: Andy Lutomirski @ 2021-02-02  0:14 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, LKML, Andy Lutomirski, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, X86 ML


> On Feb 1, 2021, at 2:04 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> 
>> 
>> On Jan 30, 2021, at 4:11 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Currently, deferred TLB flushes are detected in the mm granularity: if
>> there is any deferred TLB flush in the entire address space due to NUMA
>> migration, pte_accessible() in x86 would return true, and
>> ptep_clear_flush() would require a TLB flush. This would happen even if
>> the PTE resides in a completely different vma.
> 
> [ snip ]
> 
>> +static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
>> +{
>> +    struct mm_struct *mm = tlb->mm;
>> +    u64 mm_gen;
>> +
>> +    /*
>> +     * Any change of PTE before calling __track_deferred_tlb_flush() must be
>> +     * performed using RMW atomic operation that provides a memory barriers,
>> +     * such as ptep_modify_prot_start().  The barrier ensure the PTEs are
>> +     * written before the current generation is read, synchronizing
>> +     * (implicitly) with flush_tlb_mm_range().
>> +     */
>> +    smp_mb__after_atomic();
>> +
>> +    mm_gen = atomic64_read(&mm->tlb_gen);
>> +
>> +    /*
>> +     * This condition checks for both first deferred TLB flush and for other
>> +     * TLB pending or executed TLB flushes after the last table that we
>> +     * updated. In the latter case, we are going to skip a generation, which
>> +     * would lead to a full TLB flush. This should therefore not cause
>> +     * correctness issues, and should not induce overheads, since anyhow in
>> +     * TLB storms it is better to perform full TLB flush.
>> +     */
>> +    if (mm_gen != tlb->defer_gen) {
>> +        VM_BUG_ON(mm_gen < tlb->defer_gen);
>> +
>> +        tlb->defer_gen = inc_mm_tlb_gen(mm);
>> +    }
>> +}
> 
> Andy’s comments managed to make me realize this code is wrong. We must
> call inc_mm_tlb_gen(mm) every time.
> 
> Otherwise, a CPU that saw the old tlb_gen and updated it in its local
> cpu_tlbstate on a context-switch. If the process was not running when the
> TLB flush was issued, no IPI will be sent to the CPU. Therefore, later
> switch_mm_irqs_off() back to the process will not flush the local TLB.
> 
> I need to think if there is a better solution. Multiple calls to
> inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush
> instead of one that is specific to the ranges, once the flush actually takes
> place. On x86 it’s practically a non-issue, since anyhow any update of more
> than 33-entries or so would cause a full TLB flush, but this is still ugly.
> 

What if we had a per-mm ring buffer of flushes?  When starting a flush, we would stick the range in the ring buffer and, when flushing, we would read the ring buffer to catch up.  This would mostly replace the flush_tlb_info struct, and it would let us process multiple partial flushes together.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-02-01 12:09   ` Peter Zijlstra
@ 2021-02-02  6:41     ` Nicholas Piggin
  2021-02-02  7:20       ` Nadav Amit
  0 siblings, 1 reply; 67+ messages in thread
From: Nicholas Piggin @ 2021-02-02  6:41 UTC (permalink / raw)
  To: Nadav Amit, Peter Zijlstra
  Cc: Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky,
	linux-kernel, linux-mm, linuxppc-dev, linux-s390,
	Andy Lutomirski, Nadav Amit, Thomas Gleixner, Will Deacon, x86,
	Yu Zhao

Excerpts from Peter Zijlstra's message of February 1, 2021 10:09 pm:
> On Sat, Jan 30, 2021 at 04:11:23PM -0800, Nadav Amit wrote:
> 
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 427bfcc6cdec..b97136b7010b 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
>>  
>>  #ifdef CONFIG_MMU_GATHER_NO_RANGE
>>  
>> -#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma)
>> -#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() and tlb_end_vma()
>> +#if defined(tlb_flush)
>> +#error MMU_GATHER_NO_RANGE relies on default tlb_flush()
>>  #endif
>>  
>>  /*
>> @@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
>>  
>>  #ifndef tlb_flush
>>  
>> -#if defined(tlb_start_vma) || defined(tlb_end_vma)
>> -#error Default tlb_flush() relies on default tlb_start_vma() and tlb_end_vma()
>> -#endif
> 
> #ifdef CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
> #error ....
> #endif
> 
> goes here...
> 
> 
>>  static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
>>  {
>>  	if (tlb->fullmm)
>>  		return;
>>  
>> +	if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
>> +		return;
> 
> Also, can you please stick to the CONFIG_MMU_GATHER_* namespace?
> 
> I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
> How about:
> 
> 	CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH

Yes please, have to have descriptive names.

I didn't quite see why this was much of an improvement though. Maybe 
follow up patches take advantage of it? I didn't see how they all fit 
together.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-02-01 12:44     ` Peter Zijlstra
@ 2021-02-02  7:14       ` Nicholas Piggin
  0 siblings, 0 replies; 67+ messages in thread
From: Nicholas Piggin @ 2021-02-02  7:14 UTC (permalink / raw)
  To: Nadav Amit, Peter Zijlstra
  Cc: Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky, LKML,
	Linux-MM, linuxppc-dev, linux-s390, Andy Lutomirski, Mel Gorman,
	Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao

Excerpts from Peter Zijlstra's message of February 1, 2021 10:44 pm:
> On Sun, Jan 31, 2021 at 07:57:01AM +0000, Nadav Amit wrote:
>> > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> > I'll go through the patches a bit more closely when they all come 
>> > through. Sparc and powerpc of course need the arch lazy mode to get 
>> > per-page/pte information for operations that are not freeing pages, 
>> > which is what mmu gather is designed for.
>> 
>> IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
>> where no previous PTE was set, right?

In cases of increasing permissiveness of access, yes it may want to 
update the "TLB" (read hash table) to avoid taking hash table faults.

But whatever the reason for the flush, there may have to be more
data carried than just the virtual address range and/or physical
pages.

If you clear out the PTE then you have no guarantee of actually being
able to go back and address the the in-memory or in-hardware translation 
structures to update them, depending on what exact scheme is used
(powerpc probably could if all page sizes were the same, but THP or 
64k/4k sub pages would throw a spanner in those works).

> These are the HASH architectures. Their hardware doesn't walk the
> page-tables, but it consults a hash-table to resolve page translations.

Yeah, it's very cool in a masochistic way.

I actually don't know if it's worth doing a big rework of it, as much 
as I'd like to. Rather than just keep it in place and eventually
dismantling some of the go-fast hooks from core code if we can one day
deprecate it in favour of the much easier radix mode.

The whole thing is like a big steam train, years ago Paul and Ben and 
Anton and co got the boiler stoked up and set all the valves just right 
so it runs unbelievably well for what it's actually doing but look at it
the wrong way and the whole thing could blow up. (at least that's what 
it feels like to me probably because I don't know the code that well).

Sparc could probably do the same, not sure about Xen. I don't suppose
vmware is intending to add any kind of paravirt mode related to this stuff?

Thanks,
Nick


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-02-02  6:41     ` Nicholas Piggin
@ 2021-02-02  7:20       ` Nadav Amit
  2021-02-02  9:31         ` Peter Zijlstra
  0 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-02-02  7:20 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Peter Zijlstra, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky, LKML, Linux-MM, linuxppc-dev, linux-s390,
	Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao

> On Feb 1, 2021, at 10:41 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> Excerpts from Peter Zijlstra's message of February 1, 2021 10:09 pm:
>> I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
>> How about:
>> 
>> 	CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH
> 
> Yes please, have to have descriptive names.

Point taken. I will fix it.

> 
> I didn't quite see why this was much of an improvement though. Maybe 
> follow up patches take advantage of it? I didn't see how they all fit 
> together.

They do, but I realized as I said in other emails that I have a serious bug
in the deferred invalidation scheme.

Having said that, I think there is an advantage of having an explicit config
option instead of relying on whether tlb_end_vma is defined. For instance,
Arm does not define tlb_end_vma, and consequently it flushes the TLB after
each VMA. I suspect it is not intentional.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-02-02  7:20       ` Nadav Amit
@ 2021-02-02  9:31         ` Peter Zijlstra
  2021-02-02  9:54           ` Nadav Amit
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2021-02-02  9:31 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nicholas Piggin, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky, LKML, Linux-MM, linuxppc-dev, linux-s390,
	Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao

On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote:
> Arm does not define tlb_end_vma, and consequently it flushes the TLB after
> each VMA. I suspect it is not intentional.

ARM is one of those that look at the VM_EXEC bit to explicitly flush
ITLB IIRC, so it has to.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 01/20] mm/tlb: fix fullmm semantics
  2021-02-01 11:36   ` Peter Zijlstra
@ 2021-02-02  9:32     ` Nadav Amit
  2021-02-02 11:00       ` Peter Zijlstra
  0 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-02-02  9:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

> On Feb 1, 2021, at 3:36 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> https://lkml.kernel.org/r/20210127235347.1402-1-will@kernel.org

I have seen this series, and applied my patches on it.

Despite Will’s patches, there were still inconsistencies between fullmm
and need_flush_all.

Am I missing something?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-02-02  9:31         ` Peter Zijlstra
@ 2021-02-02  9:54           ` Nadav Amit
  2021-02-02 11:04             ` Peter Zijlstra
  0 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-02-02  9:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky, LKML, Linux-MM, linuxppc-dev, linux-s390,
	Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao

> On Feb 2, 2021, at 1:31 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote:
>> Arm does not define tlb_end_vma, and consequently it flushes the TLB after
>> each VMA. I suspect it is not intentional.
> 
> ARM is one of those that look at the VM_EXEC bit to explicitly flush
> ITLB IIRC, so it has to.

Hmm… I don’t think Arm is doing that. At least arm64 does not use the
default tlb_flush(), and it does not seem to consider VM_EXEC (at least in
this path):

static inline void tlb_flush(struct mmu_gather *tlb)
{
        struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0);
        bool last_level = !tlb->freed_tables;
        unsigned long stride = tlb_get_unmap_size(tlb);
        int tlb_level = tlb_get_level(tlb);
        
        /*
         * If we're tearing down the address space then we only care about
         * invalidating the walk-cache, since the ASID allocator won't
         * reallocate our ASID without invalidating the entire TLB.
         */
        if (tlb->mm_exiting) {
                if (!last_level)
                        flush_tlb_mm(tlb->mm);
                return;
        }       
        
        __flush_tlb_range(&vma, tlb->start, tlb->end, stride,
                          last_level, tlb_level);
}

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 01/20] mm/tlb: fix fullmm semantics
  2021-02-02  9:32     ` Nadav Amit
@ 2021-02-02 11:00       ` Peter Zijlstra
  2021-02-02 21:35         ` Nadav Amit
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2021-02-02 11:00 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, x86

On Tue, Feb 02, 2021 at 01:32:36AM -0800, Nadav Amit wrote:
> > On Feb 1, 2021, at 3:36 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > 
> > https://lkml.kernel.org/r/20210127235347.1402-1-will@kernel.org
> 
> I have seen this series, and applied my patches on it.
> 
> Despite Will’s patches, there were still inconsistencies between fullmm
> and need_flush_all.
> 
> Am I missing something?

I wasn't aware you were on top. I'll look again.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-02-02  9:54           ` Nadav Amit
@ 2021-02-02 11:04             ` Peter Zijlstra
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2021-02-02 11:04 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nicholas Piggin, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky, LKML, Linux-MM, linuxppc-dev, linux-s390,
	Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao

On Tue, Feb 02, 2021 at 09:54:36AM +0000, Nadav Amit wrote:
> > On Feb 2, 2021, at 1:31 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote:
> >> Arm does not define tlb_end_vma, and consequently it flushes the TLB after
> >> each VMA. I suspect it is not intentional.
> > 
> > ARM is one of those that look at the VM_EXEC bit to explicitly flush
> > ITLB IIRC, so it has to.
> 
> Hmm… I don’t think Arm is doing that. At least arm64 does not use the
> default tlb_flush(), and it does not seem to consider VM_EXEC (at least in
> this path):
> 

ARM != ARM64. ARM certainly does, but you're right, I don't think ARM64
does this.




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 15/20] mm: detect deferred TLB flushes in vma granularity
  2021-02-02  0:14     ` Andy Lutomirski
@ 2021-02-02 20:51       ` Nadav Amit
  2021-02-04  4:35         ` Andy Lutomirski
  0 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-02-02 20:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux-MM, LKML, Andy Lutomirski, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, X86 ML

> On Feb 1, 2021, at 4:14 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
>> On Feb 1, 2021, at 2:04 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> Andy’s comments managed to make me realize this code is wrong. We must
>> call inc_mm_tlb_gen(mm) every time.
>> 
>> Otherwise, a CPU that saw the old tlb_gen and updated it in its local
>> cpu_tlbstate on a context-switch. If the process was not running when the
>> TLB flush was issued, no IPI will be sent to the CPU. Therefore, later
>> switch_mm_irqs_off() back to the process will not flush the local TLB.
>> 
>> I need to think if there is a better solution. Multiple calls to
>> inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush
>> instead of one that is specific to the ranges, once the flush actually takes
>> place. On x86 it’s practically a non-issue, since anyhow any update of more
>> than 33-entries or so would cause a full TLB flush, but this is still ugly.
> 
> What if we had a per-mm ring buffer of flushes?  When starting a flush, we would stick the range in the ring buffer and, when flushing, we would read the ring buffer to catch up.  This would mostly replace the flush_tlb_info struct, and it would let us process multiple partial flushes together.

I wanted to sleep on it, and went back and forth on whether it is the right
direction, hence the late response.

I think that what you say make sense. I think that I even tried to do once
something similar for some reason, but my memory plays tricks on me.

So tell me what you think on this ring-based solution. As you said, you keep
per-mm ring of flush_tlb_info. When you queue an entry, you do something
like:

#define RING_ENTRY_INVALID (0)

  gen = inc_mm_tlb_gen(mm);
  struct flush_tlb_info *info = mm->ring[gen % RING_SIZE];
  spin_lock(&mm->ring_lock);
  WRITE_ONCE(info->new_tlb_gen, RING_ENTRY_INVALID);
  smp_wmb();
  info->start = start;
  info->end = end;
  info->stride_shift = stride_shift;
  info->freed_tables = freed_tables;
  smp_store_release(&info->new_tlb_gen, gen);
  spin_unlock(&mm->ring_lock);
  
When you flush you use the entry generation as a sequence lock. On overflow
of the ring (i.e., sequence number mismatch) you perform a full flush:

  for (gen = mm->tlb_gen_completed; gen < mm->tlb_gen; gen++) {
	struct flush_tlb_info *info = &mm->ring[gen % RING_SIZE];

	// detect overflow and invalid entries
	if (smp_load_acquire(info->new_tlb_gen) != gen)
		goto full_flush;

	start = min(start, info->start);
	end = max(end, info->end);
	stride_shift = min(stride_shift, info->stride_shift);
	freed_tables |= info.freed_tables;
	smp_rmb();

	// seqlock-like check that the information was not updated 
	if (READ_ONCE(info->new_tlb_gen) != gen)
		goto full_flush;
  }

On x86 I suspect that performing a full TLB flush would anyhow be the best
thing to do if there is more than a single entry. I am also not sure that it
makes sense to check the ring from flush_tlb_func_common() (i.e., in each
IPI handler) as it might cause cache thrashing.

Instead it may be better to do so from flush_tlb_mm_range(), when the
flushes are initiated, and use an aggregated flush_tlb_info for the flush.

It may also be better to have the ring arch-independent, so it would
resemble more of mmu_gather (the parts about the TLB flush information,
without the freed pages stuff).

We can detect deferred TLB flushes either by storing “deferred_gen” in the
page-tables/VMA (as I did) or by going over the ring, from tlb_gen_completed
to tlb_gen, and checking for an overlap. I think page-tables would be most
efficient/scalable, but perhaps going over the ring would be easier to
understand logic.

Makes sense? Thoughts?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 01/20] mm/tlb: fix fullmm semantics
  2021-02-02 11:00       ` Peter Zijlstra
@ 2021-02-02 21:35         ` Nadav Amit
  2021-02-03  9:44           ` Will Deacon
  0 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-02-02 21:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Andrew Morton, Andy Lutomirski, Dave Hansen,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin, X86 ML,
	Linux-MM, LKML

> On Feb 2, 2021, at 3:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Tue, Feb 02, 2021 at 01:32:36AM -0800, Nadav Amit wrote:
>>> On Feb 1, 2021, at 3:36 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> 
>>> https://lkml.kernel.org/r/20210127235347.1402-1-will@kernel.org
>> 
>> I have seen this series, and applied my patches on it.
>> 
>> Despite Will’s patches, there were still inconsistencies between fullmm
>> and need_flush_all.
>> 
>> Am I missing something?
> 
> I wasn't aware you were on top. I'll look again.

Looking on arm64’s tlb_flush() makes me think that there is currently a bug
that this patch fixes. Arm64’s tlb_flush() does:

       /*
        * If we're tearing down the address space then we only care about
        * invalidating the walk-cache, since the ASID allocator won't
        * reallocate our ASID without invalidating the entire TLB.
        */
       if (tlb->fullmm) {
               if (!last_level)
                       flush_tlb_mm(tlb->mm);
               return;
       } 

But currently tlb_mmu_finish() can mistakenly set fullmm incorrectly (if
mm_tlb_flush_nested() is true), which might skip the TLB flush.

Lucky for us, arm64 flushes each VMA separately (which as we discussed
separately may not be necessary), so the only PTEs that might not be flushed
are PTEs that are updated concurrently by another thread that also defer
their flushes. It therefore seems that the implications are more on the
correctness of certain syscalls (e.g., madvise(DONT_NEED)) without
implications on security or memory corruptions.

Let me know if you want me to send this patch separately with an updated
commit log for faster inclusion.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 01/20] mm/tlb: fix fullmm semantics
  2021-02-02 21:35         ` Nadav Amit
@ 2021-02-03  9:44           ` Will Deacon
  2021-02-04  3:20             ` Nadav Amit
  0 siblings, 1 reply; 67+ messages in thread
From: Will Deacon @ 2021-02-03  9:44 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Thomas Gleixner, Yu Zhao, Nick Piggin, X86 ML,
	Linux-MM, LKML

On Tue, Feb 02, 2021 at 01:35:38PM -0800, Nadav Amit wrote:
> > On Feb 2, 2021, at 3:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Tue, Feb 02, 2021 at 01:32:36AM -0800, Nadav Amit wrote:
> >>> On Feb 1, 2021, at 3:36 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >>> 
> >>> 
> >>> https://lkml.kernel.org/r/20210127235347.1402-1-will@kernel.org
> >> 
> >> I have seen this series, and applied my patches on it.
> >> 
> >> Despite Will’s patches, there were still inconsistencies between fullmm
> >> and need_flush_all.
> >> 
> >> Am I missing something?
> > 
> > I wasn't aware you were on top. I'll look again.
> 
> Looking on arm64’s tlb_flush() makes me think that there is currently a bug
> that this patch fixes. Arm64’s tlb_flush() does:
> 
>        /*
>         * If we're tearing down the address space then we only care about
>         * invalidating the walk-cache, since the ASID allocator won't
>         * reallocate our ASID without invalidating the entire TLB.
>         */
>        if (tlb->fullmm) {
>                if (!last_level)
>                        flush_tlb_mm(tlb->mm);
>                return;
>        } 
> 
> But currently tlb_mmu_finish() can mistakenly set fullmm incorrectly (if
> mm_tlb_flush_nested() is true), which might skip the TLB flush.

But in that case isn't 'freed_tables' set to 1, so 'last_level' will be
false and we'll do the flush in the code above?

Will


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 01/20] mm/tlb: fix fullmm semantics
  2021-02-03  9:44           ` Will Deacon
@ 2021-02-04  3:20             ` Nadav Amit
  0 siblings, 0 replies; 67+ messages in thread
From: Nadav Amit @ 2021-02-04  3:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Thomas Gleixner, Yu Zhao, Nick Piggin, X86 ML,
	Linux-MM, LKML

> On Feb 3, 2021, at 1:44 AM, Will Deacon <will@kernel.org> wrote:
> 
> On Tue, Feb 02, 2021 at 01:35:38PM -0800, Nadav Amit wrote:
>>> On Feb 2, 2021, at 3:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Tue, Feb 02, 2021 at 01:32:36AM -0800, Nadav Amit wrote:
>>>>> On Feb 1, 2021, at 3:36 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> 
>>>>> 
>>>>> https://lkml.kernel.org/r/20210127235347.1402-1-will@kernel.org
>>>> 
>>>> I have seen this series, and applied my patches on it.
>>>> 
>>>> Despite Will’s patches, there were still inconsistencies between fullmm
>>>> and need_flush_all.
>>>> 
>>>> Am I missing something?
>>> 
>>> I wasn't aware you were on top. I'll look again.
>> 
>> Looking on arm64’s tlb_flush() makes me think that there is currently a bug
>> that this patch fixes. Arm64’s tlb_flush() does:
>> 
>>       /*
>>        * If we're tearing down the address space then we only care about
>>        * invalidating the walk-cache, since the ASID allocator won't
>>        * reallocate our ASID without invalidating the entire TLB.
>>        */
>>       if (tlb->fullmm) {
>>               if (!last_level)
>>                       flush_tlb_mm(tlb->mm);
>>               return;
>>       } 
>> 
>> But currently tlb_mmu_finish() can mistakenly set fullmm incorrectly (if
>> mm_tlb_flush_nested() is true), which might skip the TLB flush.
> 
> But in that case isn't 'freed_tables' set to 1, so 'last_level' will be
> false and we'll do the flush in the code above?

Indeed. You are right. So no rush.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 15/20] mm: detect deferred TLB flushes in vma granularity
  2021-02-02 20:51       ` Nadav Amit
@ 2021-02-04  4:35         ` Andy Lutomirski
  0 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2021-02-04  4:35 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, LKML, Andy Lutomirski, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, X86 ML



> On Feb 2, 2021, at 12:52 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> 
>> 
>>> On Feb 1, 2021, at 4:14 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> 
>>> 
>>>> On Feb 1, 2021, at 2:04 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>> 
>>> Andy’s comments managed to make me realize this code is wrong. We must
>>> call inc_mm_tlb_gen(mm) every time.
>>> 
>>> Otherwise, a CPU that saw the old tlb_gen and updated it in its local
>>> cpu_tlbstate on a context-switch. If the process was not running when the
>>> TLB flush was issued, no IPI will be sent to the CPU. Therefore, later
>>> switch_mm_irqs_off() back to the process will not flush the local TLB.
>>> 
>>> I need to think if there is a better solution. Multiple calls to
>>> inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush
>>> instead of one that is specific to the ranges, once the flush actually takes
>>> place. On x86 it’s practically a non-issue, since anyhow any update of more
>>> than 33-entries or so would cause a full TLB flush, but this is still ugly.
>> 
>> What if we had a per-mm ring buffer of flushes?  When starting a flush, we would stick the range in the ring buffer and, when flushing, we would read the ring buffer to catch up.  This would mostly replace the flush_tlb_info struct, and it would let us process multiple partial flushes together.
> 
> I wanted to sleep on it, and went back and forth on whether it is the right
> direction, hence the late response.
> 
> I think that what you say make sense. I think that I even tried to do once
> something similar for some reason, but my memory plays tricks on me.
> 
> So tell me what you think on this ring-based solution. As you said, you keep
> per-mm ring of flush_tlb_info. When you queue an entry, you do something
> like:
> 
> #define RING_ENTRY_INVALID (0)
> 
>  gen = inc_mm_tlb_gen(mm);
>  struct flush_tlb_info *info = mm->ring[gen % RING_SIZE];
>  spin_lock(&mm->ring_lock);

Once you are holding the lock, you should presumably check that the ring didn’t overflow while you were getting the lock — if new_tlb_gen > gen, abort.

>  WRITE_ONCE(info->new_tlb_gen, RING_ENTRY_INVALID);
>  smp_wmb();
>  info->start = start;
>  info->end = end;
>  info->stride_shift = stride_shift;
>  info->freed_tables = freed_tables;
>  smp_store_release(&info->new_tlb_gen, gen);
>  spin_unlock(&mm->ring_lock);
> 

Seems reasonable.  I’m curious how this ends up getting used.

> When you flush you use the entry generation as a sequence lock. On overflow
> of the ring (i.e., sequence number mismatch) you perform a full flush:
> 
>  for (gen = mm->tlb_gen_completed; gen < mm->tlb_gen; gen++) {
>    struct flush_tlb_info *info = &mm->ring[gen % RING_SIZE];
> 
>    // detect overflow and invalid entries
>    if (smp_load_acquire(info->new_tlb_gen) != gen)
>        goto full_flush;
> 
>    start = min(start, info->start);
>    end = max(end, info->end);
>    stride_shift = min(stride_shift, info->stride_shift);
>    freed_tables |= info.freed_tables;
>    smp_rmb();
> 
>    // seqlock-like check that the information was not updated 
>    if (READ_ONCE(info->new_tlb_gen) != gen)
>        goto full_flush;
>  }
> 
> On x86 I suspect that performing a full TLB flush would anyhow be the best
> thing to do if there is more than a single entry. I am also not sure that it
> makes sense to check the ring from flush_tlb_func_common() (i.e., in each
> IPI handler) as it might cause cache thrashing.
> 
> Instead it may be better to do so from flush_tlb_mm_range(), when the
> flushes are initiated, and use an aggregated flush_tlb_info for the flush.
> 
> It may also be better to have the ring arch-independent, so it would
> resemble more of mmu_gather (the parts about the TLB flush information,
> without the freed pages stuff).
> 
> We can detect deferred TLB flushes either by storing “deferred_gen” in the
> page-tables/VMA (as I did) or by going over the ring, from tlb_gen_completed
> to tlb_gen, and checking for an overlap. I think page-tables would be most
> efficient/scalable, but perhaps going over the ring would be easier to
> understand logic.
> 
> Makes sense? Thoughts?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 20/20] mm/rmap: avoid potential races
  2021-01-31  0:11 ` [RFC 20/20] mm/rmap: avoid potential races Nadav Amit
@ 2021-08-23  8:05   ` Huang, Ying
  2021-08-23 15:50     ` Nadav Amit
  0 siblings, 1 reply; 67+ messages in thread
From: Huang, Ying @ 2021-08-23  8:05 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Nadav Amit, Mel Gorman, Andrea Arcangeli,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, x86

Hi, Nadav,

Nadav Amit <nadav.amit@gmail.com> writes:

> From: Nadav Amit <namit@vmware.com>
>
> flush_tlb_batched_pending() appears to have a theoretical race:
> tlb_flush_batched is being cleared after the TLB flush, and if in
> between another core calls set_tlb_ubc_flush_pending() and sets the
> pending TLB flush indication, this indication might be lost. Holding the
> page-table lock when SPLIT_LOCK is set cannot eliminate this race.

Recently, when I read the corresponding code, I find the exact same race
too.  Do you still think the race is possible at least in theory?  If
so, why hasn't your fix been merged?

> The current batched TLB invalidation scheme therefore does not seem
> viable or easily repairable.

I have some idea to fix this without too much code.  If necessary, I
will send it out.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 20/20] mm/rmap: avoid potential races
  2021-08-23  8:05   ` Huang, Ying
@ 2021-08-23 15:50     ` Nadav Amit
  2021-08-24  0:36       ` Huang, Ying
  0 siblings, 1 reply; 67+ messages in thread
From: Nadav Amit @ 2021-08-23 15:50 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Linux-MM, Linux Kernel Mailing List, Mel Gorman,
	Andrea Arcangeli, Andrew Morton, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao, x86



> On Aug 23, 2021, at 1:05 AM, Huang, Ying <ying.huang@intel.com> wrote:
> 
> Hi, Nadav,
> 
> Nadav Amit <nadav.amit@gmail.com> writes:
> 
>> From: Nadav Amit <namit@vmware.com>
>> 
>> flush_tlb_batched_pending() appears to have a theoretical race:
>> tlb_flush_batched is being cleared after the TLB flush, and if in
>> between another core calls set_tlb_ubc_flush_pending() and sets the
>> pending TLB flush indication, this indication might be lost. Holding the
>> page-table lock when SPLIT_LOCK is set cannot eliminate this race.
> 
> Recently, when I read the corresponding code, I find the exact same race
> too.  Do you still think the race is possible at least in theory?  If
> so, why hasn't your fix been merged?

I think the race is possible. It didn’t get merged, IIRC, due to some
addressable criticism and lack of enthusiasm from other people, and
my laziness/busy-ness.

> 
>> The current batched TLB invalidation scheme therefore does not seem
>> viable or easily repairable.
> 
> I have some idea to fix this without too much code.  If necessary, I
> will send it out.

Arguably, it would be preferable to have a small back-portable fix for
this issue specifically. Just try to ensure that you do not introduce
performance overheads. Any solution should be clear about its impact
on additional TLB flushes on the worst-case scenario and the number
of additional atomic operations that would be required.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC 20/20] mm/rmap: avoid potential races
  2021-08-23 15:50     ` Nadav Amit
@ 2021-08-24  0:36       ` Huang, Ying
  0 siblings, 0 replies; 67+ messages in thread
From: Huang, Ying @ 2021-08-24  0:36 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, Linux Kernel Mailing List, Mel Gorman,
	Andrea Arcangeli, Andrew Morton, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao, x86

Nadav Amit <namit@vmware.com> writes:

>> On Aug 23, 2021, at 1:05 AM, Huang, Ying <ying.huang@intel.com> wrote:
>> 
>> Hi, Nadav,
>> 
>> Nadav Amit <nadav.amit@gmail.com> writes:
>> 
>>> From: Nadav Amit <namit@vmware.com>
>>> 
>>> flush_tlb_batched_pending() appears to have a theoretical race:
>>> tlb_flush_batched is being cleared after the TLB flush, and if in
>>> between another core calls set_tlb_ubc_flush_pending() and sets the
>>> pending TLB flush indication, this indication might be lost. Holding the
>>> page-table lock when SPLIT_LOCK is set cannot eliminate this race.
>> 
>> Recently, when I read the corresponding code, I find the exact same race
>> too.  Do you still think the race is possible at least in theory?  If
>> so, why hasn't your fix been merged?
>
> I think the race is possible. It didn’t get merged, IIRC, due to some
> addressable criticism and lack of enthusiasm from other people, and
> my laziness/busy-ness.

Got it!  Thanks your information!

>>> The current batched TLB invalidation scheme therefore does not seem
>>> viable or easily repairable.
>> 
>> I have some idea to fix this without too much code.  If necessary, I
>> will send it out.
>
> Arguably, it would be preferable to have a small back-portable fix for
> this issue specifically. Just try to ensure that you do not introduce
> performance overheads. Any solution should be clear about its impact
> on additional TLB flushes on the worst-case scenario and the number
> of additional atomic operations that would be required.

Sure.  Will do that.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2021-08-24  0:36 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
2021-01-31  0:11 ` [RFC 01/20] mm/tlb: fix fullmm semantics Nadav Amit
2021-01-31  1:02   ` Andy Lutomirski
2021-01-31  1:19     ` Nadav Amit
2021-01-31  2:57       ` Andy Lutomirski
2021-02-01  7:30         ` Nadav Amit
2021-02-01 11:36   ` Peter Zijlstra
2021-02-02  9:32     ` Nadav Amit
2021-02-02 11:00       ` Peter Zijlstra
2021-02-02 21:35         ` Nadav Amit
2021-02-03  9:44           ` Will Deacon
2021-02-04  3:20             ` Nadav Amit
2021-01-31  0:11 ` [RFC 02/20] mm/mprotect: use mmu_gather Nadav Amit
2021-01-31  0:11 ` [RFC 03/20] mm/mprotect: do not flush on permission promotion Nadav Amit
2021-01-31  1:07   ` Andy Lutomirski
2021-01-31  1:17     ` Nadav Amit
2021-01-31  2:59       ` Andy Lutomirski
     [not found]     ` <7a6de15a-a570-31f2-14d6-a8010296e694@citrix.com>
2021-02-01  5:58       ` Nadav Amit
2021-02-01 15:38         ` Andrew Cooper
2021-01-31  0:11 ` [RFC 04/20] mm/mapping_dirty_helpers: use mmu_gather Nadav Amit
2021-01-31  0:11 ` [RFC 05/20] mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h Nadav Amit
2021-01-31  0:11 ` [RFC 06/20] fs/task_mmu: use mmu_gather interface of clear-soft-dirty Nadav Amit
2021-01-31  0:11 ` [RFC 07/20] mm: move x86 tlb_gen to generic code Nadav Amit
2021-01-31 18:26   ` Andy Lutomirski
2021-01-31  0:11 ` [RFC 08/20] mm: store completed TLB generation Nadav Amit
2021-01-31 20:32   ` Andy Lutomirski
2021-02-01  7:28     ` Nadav Amit
2021-02-01 16:53       ` Andy Lutomirski
2021-02-01 11:52   ` Peter Zijlstra
2021-01-31  0:11 ` [RFC 09/20] mm: create pte/pmd_tlb_flush_pending() Nadav Amit
2021-01-31  0:11 ` [RFC 10/20] mm: add pte_to_page() Nadav Amit
2021-01-31  0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit
2021-02-01 12:09   ` Peter Zijlstra
2021-02-02  6:41     ` Nicholas Piggin
2021-02-02  7:20       ` Nadav Amit
2021-02-02  9:31         ` Peter Zijlstra
2021-02-02  9:54           ` Nadav Amit
2021-02-02 11:04             ` Peter Zijlstra
2021-01-31  0:11 ` [RFC 12/20] mm/tlb: save the VMA that is flushed during tlb_start_vma() Nadav Amit
2021-02-01 12:28   ` Peter Zijlstra
2021-01-31  0:11 ` [RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes() Nadav Amit
2021-01-31  9:57   ` Damian Tometzki
2021-01-31 10:07   ` Damian Tometzki
2021-02-01  7:29     ` Nadav Amit
2021-02-01 13:19   ` Peter Zijlstra
2021-02-01 23:00     ` Nadav Amit
2021-01-31  0:11 ` [RFC 14/20] mm: move inc/dec_tlb_flush_pending() to mmu_gather.c Nadav Amit
2021-01-31  0:11 ` [RFC 15/20] mm: detect deferred TLB flushes in vma granularity Nadav Amit
2021-02-01 22:04   ` Nadav Amit
2021-02-02  0:14     ` Andy Lutomirski
2021-02-02 20:51       ` Nadav Amit
2021-02-04  4:35         ` Andy Lutomirski
2021-01-31  0:11 ` [RFC 16/20] mm/tlb: per-page table generation tracking Nadav Amit
2021-01-31  0:11 ` [RFC 17/20] mm/tlb: updated completed deferred TLB flush conditionally Nadav Amit
2021-01-31  0:11 ` [RFC 18/20] mm: make mm_cpumask() volatile Nadav Amit
2021-01-31  0:11 ` [RFC 19/20] lib/cpumask: introduce cpumask_atomic_or() Nadav Amit
2021-01-31  0:11 ` [RFC 20/20] mm/rmap: avoid potential races Nadav Amit
2021-08-23  8:05   ` Huang, Ying
2021-08-23 15:50     ` Nadav Amit
2021-08-24  0:36       ` Huang, Ying
2021-01-31  0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski
2021-01-31  1:08   ` Nadav Amit
2021-01-31  3:30 ` Nicholas Piggin
2021-01-31  7:57   ` Nadav Amit
2021-01-31  8:14     ` Nadav Amit
2021-02-01 12:44     ` Peter Zijlstra
2021-02-02  7:14       ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).