linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier
@ 2018-09-26  3:58 Rik van Riel
  2018-09-26  3:58 ` [PATCH 1/7] x86/mm/tlb: Always use lazy TLB mode Rik van Riel
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Rik van Riel @ 2018-09-26  3:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, kernel-team, will.deacon, songliubraving, mingo, luto,
	hpa, npiggin

Linus asked me to come up with a smaller patch set to get the benefits
of lazy TLB mode, so I spent some time trying out various permutations
of the code, with a few workloads that do lots of context switches, and
also happen to have a fair number of TLB flushes a second.

Both of the workloads tested are memcache style workloads, running
on two socket systems. One of the workloads has around 300,000
context switches a second, and around 19,000 TLB flushes.

The first patch in the series, of always using lazy TLB mode,
reduces CPU use around 1% on both Haswell and Broadwell systems.

The rest of the series reduces the number of TLB flush IPIs by
about 1,500 a second, resulting in a 0.2% reduction in CPU use,
on top of the 1% seen by just enabling lazy TLB mode.

These are the low hanging fruits in the context switch code.

The big thing remaining is the reference count overhead of
the lazy TLB mm_struct, but getting rid of that is rather a
lot of code for a small performance gain. Not quite what
Linus asked for :)

This v2 is "identical" to the version I posted yesterday,
except this one is actually against current -tip (not sure
what went wrong before), with a number of relevant patches
on top:
- tip x86/core
	012e77a903d ("x86/nmi: Fix NMI uaccess race against CR3 switching")
- arm64 tlb/asm-generic (entire branch)
- peterz queue mm/tlb
	12b2b80ec6f4 ("x86/mm: Page size aware flush_tlb_mm_range()")



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/7] x86/mm/tlb: Always use lazy TLB mode
  2018-09-26  3:58 [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Rik van Riel
@ 2018-09-26  3:58 ` Rik van Riel
  2018-10-01 15:58   ` Peter Zijlstra
  2018-09-26  3:58 ` [PATCH 2/7] x86/mm/tlb: Restructure switch_mm_irqs_off() Rik van Riel
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2018-09-26  3:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, kernel-team, will.deacon, songliubraving, mingo, luto,
	hpa, npiggin, Rik van Riel, Linus Torvalds, Thomas Gleixner,
	efault

Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
at page table freeing time, and idle CPUs will no longer get shootdown IPIs
for things like mprotect and madvise, we can always use lazy TLB mode.

Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-7-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 95b0e6357d3e4e05349668940d7ff8f3b7e7e11e)
---
 arch/x86/include/asm/tlbflush.h | 16 ----------------
 arch/x86/mm/tlb.c               | 15 +--------------
 2 files changed, 1 insertion(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 671f65309ce7..d6c0cd9e9591 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -148,22 +148,6 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 #define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr)
 #endif
 
-static inline bool tlb_defer_switch_to_init_mm(void)
-{
-	/*
-	 * If we have PCID, then switching to init_mm is reasonably
-	 * fast.  If we don't have PCID, then switching to init_mm is
-	 * quite slow, so we try to defer it in the hopes that we can
-	 * avoid it entirely.  The latter approach runs the risk of
-	 * receiving otherwise unnecessary IPIs.
-	 *
-	 * This choice is just a heuristic.  The tlb code can handle this
-	 * function returning true or false regardless of whether we have
-	 * PCID.
-	 */
-	return !static_cpu_has(X86_FEATURE_PCID);
-}
-
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6aa195796dec..54a5870190a6 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -368,20 +368,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
 		return;
 
-	if (tlb_defer_switch_to_init_mm()) {
-		/*
-		 * There's a significant optimization that may be possible
-		 * here.  We have accurate enough TLB flush tracking that we
-		 * don't need to maintain coherence of TLB per se when we're
-		 * lazy.  We do, however, need to maintain coherence of
-		 * paging-structure caches.  We could, in principle, leave our
-		 * old mm loaded and only switch to init_mm when
-		 * tlb_remove_page() happens.
-		 */
-		this_cpu_write(cpu_tlbstate.is_lazy, true);
-	} else {
-		switch_mm(NULL, &init_mm, NULL);
-	}
+	this_cpu_write(cpu_tlbstate.is_lazy, true);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/7] x86/mm/tlb: Restructure switch_mm_irqs_off()
  2018-09-26  3:58 [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Rik van Riel
  2018-09-26  3:58 ` [PATCH 1/7] x86/mm/tlb: Always use lazy TLB mode Rik van Riel
@ 2018-09-26  3:58 ` Rik van Riel
  2018-10-02  7:32   ` Peter Zijlstra
  2018-09-26  3:58 ` [PATCH 3/7] smp: use __cpumask_set_cpu in on_each_cpu_cond Rik van Riel
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2018-09-26  3:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, kernel-team, will.deacon, songliubraving, mingo, luto,
	hpa, npiggin, Rik van Riel, Linus Torvalds, Thomas Gleixner,
	efault

Move some code that will be needed for the lazy -> !lazy state
transition when a lazy TLB CPU has gotten out of date.

No functional changes, since the if (real_prev == next) branch
always returns.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/20180716190337.26133-4-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 61d0beb5796ab11f7f3bf38cb2eccc6579aaa70b)
---
 arch/x86/mm/tlb.c | 64 ++++++++++++++++++++++++-----------------------
 1 file changed, 33 insertions(+), 31 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 54a5870190a6..1224f7fb1311 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -187,6 +187,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
 	unsigned cpu = smp_processor_id();
 	u64 next_tlb_gen;
+	bool need_flush;
+	u16 new_asid;
 
 	/*
 	 * NB: The scheduler will call us with prev == next when switching
@@ -308,44 +310,44 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		/* Let nmi_uaccess_okay() know that we're changing CR3. */
 		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
 		barrier();
+	}
 
-		if (need_flush) {
-			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
-			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			load_new_mm_cr3(next->pgd, new_asid, true);
-
-			/*
-			 * NB: This gets called via leave_mm() in the idle path
-			 * where RCU functions differently.  Tracing normally
-			 * uses RCU, so we need to use the _rcuidle variant.
-			 *
-			 * (There is no good reason for this.  The idle code should
-			 *  be rearranged to call this before rcu_idle_enter().)
-			 */
-			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-		} else {
-			/* The new ASID is already up to date. */
-			load_new_mm_cr3(next->pgd, new_asid, false);
-
-			/* See above wrt _rcuidle. */
-			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
-		}
+	if (need_flush) {
+		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
+		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
+		load_new_mm_cr3(next->pgd, new_asid, true);
 
 		/*
-		 * Record last user mm's context id, so we can avoid
-		 * flushing branch buffer with IBPB if we switch back
-		 * to the same user.
+		 * NB: This gets called via leave_mm() in the idle path
+		 * where RCU functions differently.  Tracing normally
+		 * uses RCU, so we need to use the _rcuidle variant.
+		 *
+		 * (There is no good reason for this.  The idle code should
+		 *  be rearranged to call this before rcu_idle_enter().)
 		 */
-		if (next != &init_mm)
-			this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
-
-		/* Make sure we write CR3 before loaded_mm. */
-		barrier();
+		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+	} else {
+		/* The new ASID is already up to date. */
+		load_new_mm_cr3(next->pgd, new_asid, false);
 
-		this_cpu_write(cpu_tlbstate.loaded_mm, next);
-		this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+		/* See above wrt _rcuidle. */
+		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
 	}
 
+	/*
+	 * Record last user mm's context id, so we can avoid
+	 * flushing branch buffer with IBPB if we switch back
+	 * to the same user.
+	 */
+	if (next != &init_mm)
+		this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
+
+	/* Make sure we write CR3 before loaded_mm. */
+	barrier();
+
+	this_cpu_write(cpu_tlbstate.loaded_mm, next);
+	this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+
 	load_mm_cr4(next);
 	switch_ldt(real_prev, next);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/7] smp: use __cpumask_set_cpu in on_each_cpu_cond
  2018-09-26  3:58 [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Rik van Riel
  2018-09-26  3:58 ` [PATCH 1/7] x86/mm/tlb: Always use lazy TLB mode Rik van Riel
  2018-09-26  3:58 ` [PATCH 2/7] x86/mm/tlb: Restructure switch_mm_irqs_off() Rik van Riel
@ 2018-09-26  3:58 ` Rik van Riel
  2018-10-09 14:59   ` [tip:x86/mm] " tip-bot for Rik van Riel
  2018-09-26  3:58 ` [PATCH 4/7] smp,cpumask: introduce on_each_cpu_cond_mask Rik van Riel
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2018-09-26  3:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, kernel-team, will.deacon, songliubraving, mingo, luto,
	hpa, npiggin, Rik van Riel

The code in on_each_cpu_cond sets CPUs in a locally allocated bitmask,
which should never be used by other CPUs simultaneously. There is no
need to use locked memory accesses to set the bits in this bitmap.

Switch to __cpumask_set_cpu.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index d86eec5f51c1..a7d4f9f50a49 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -682,7 +682,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 		preempt_disable();
 		for_each_online_cpu(cpu)
 			if (cond_func(cpu, info))
-				cpumask_set_cpu(cpu, cpus);
+				__cpumask_set_cpu(cpu, cpus);
 		on_each_cpu_mask(cpus, func, info, wait);
 		preempt_enable();
 		free_cpumask_var(cpus);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/7] smp,cpumask: introduce on_each_cpu_cond_mask
  2018-09-26  3:58 [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Rik van Riel
                   ` (2 preceding siblings ...)
  2018-09-26  3:58 ` [PATCH 3/7] smp: use __cpumask_set_cpu in on_each_cpu_cond Rik van Riel
@ 2018-09-26  3:58 ` Rik van Riel
  2018-10-09 14:59   ` [tip:x86/mm] " tip-bot for Rik van Riel
  2018-09-26  3:58 ` [PATCH 5/7] Add freed_tables argument to flush_tlb_mm_range Rik van Riel
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2018-09-26  3:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, kernel-team, will.deacon, songliubraving, mingo, luto,
	hpa, npiggin, Rik van Riel

Introduce a variant of on_each_cpu_cond that iterates only over the
CPUs in a cpumask, in order to avoid making callbacks for every single
CPU in the system when we only need to test a subset.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/smp.h |  4 ++++
 kernel/smp.c        | 17 +++++++++++++----
 kernel/up.c         | 14 +++++++++++---
 3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 9fb239e12b82..a56f08ff3097 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -53,6 +53,10 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 		smp_call_func_t func, void *info, bool wait,
 		gfp_t gfp_flags);
 
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
+		smp_call_func_t func, void *info, bool wait,
+		gfp_t gfp_flags, const struct cpumask *mask);
+
 int smp_call_function_single_async(int cpu, call_single_data_t *csd);
 
 #ifdef CONFIG_SMP
diff --git a/kernel/smp.c b/kernel/smp.c
index a7d4f9f50a49..163c451af42e 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -669,9 +669,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
  * You must not call this function with disabled interrupts or
  * from a hardware interrupt handler or from a bottom half handler.
  */
-void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
 			smp_call_func_t func, void *info, bool wait,
-			gfp_t gfp_flags)
+			gfp_t gfp_flags, const struct cpumask *mask)
 {
 	cpumask_var_t cpus;
 	int cpu, ret;
@@ -680,7 +680,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 
 	if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
 		preempt_disable();
-		for_each_online_cpu(cpu)
+		for_each_cpu(cpu, mask)
 			if (cond_func(cpu, info))
 				__cpumask_set_cpu(cpu, cpus);
 		on_each_cpu_mask(cpus, func, info, wait);
@@ -692,7 +692,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 		 * just have to IPI them one by one.
 		 */
 		preempt_disable();
-		for_each_online_cpu(cpu)
+		for_each_cpu(cpu, mask)
 			if (cond_func(cpu, info)) {
 				ret = smp_call_function_single(cpu, func,
 								info, wait);
@@ -701,6 +701,15 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 		preempt_enable();
 	}
 }
+EXPORT_SYMBOL(on_each_cpu_cond_mask);
+
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+			smp_call_func_t func, void *info, bool wait,
+			gfp_t gfp_flags)
+{
+	on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags,
+				cpu_online_mask);
+}
 EXPORT_SYMBOL(on_each_cpu_cond);
 
 static void do_nothing(void *unused)
diff --git a/kernel/up.c b/kernel/up.c
index 42c46bf3e0a5..ff536f9cc8a2 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -68,9 +68,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
  * Preemption is disabled here to make sure the cond_func is called under the
  * same condtions in UP and SMP.
  */
-void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
-		      smp_call_func_t func, void *info, bool wait,
-		      gfp_t gfp_flags)
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
+			   smp_call_func_t func, void *info, bool wait,
+			   gfp_t gfp_flags, const struct cpumask *mask)
 {
 	unsigned long flags;
 
@@ -82,6 +82,14 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 	}
 	preempt_enable();
 }
+EXPORT_SYMBOL(on_each_cpu_cond_mask);
+
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+		      smp_call_func_t func, void *info, bool wait,
+		      gfp_t gfp_flags)
+{
+	on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags, NULL);
+}
 EXPORT_SYMBOL(on_each_cpu_cond);
 
 int smp_call_on_cpu(unsigned int cpu, int (*func)(void *), void *par, bool phys)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/7] Add freed_tables argument to flush_tlb_mm_range
  2018-09-26  3:58 [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Rik van Riel
                   ` (3 preceding siblings ...)
  2018-09-26  3:58 ` [PATCH 4/7] smp,cpumask: introduce on_each_cpu_cond_mask Rik van Riel
@ 2018-09-26  3:58 ` Rik van Riel
  2018-10-09 15:00   ` [tip:x86/mm] x86/mm/tlb: " tip-bot for Rik van Riel
  2018-09-26  3:58 ` [PATCH 6/7] Add freed_tables element to flush_tlb_info Rik van Riel
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2018-09-26  3:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, kernel-team, will.deacon, songliubraving, mingo, luto,
	hpa, npiggin, Rik van Riel

Add an argument to flush_tlb_mm_range to indicate whether page tables
are about to be freed after this TLB flush. This allows for an
optimization of flush_tlb_mm_range to skip CPUs in lazy TLB mode.

No functional changes.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlb.h      |  2 +-
 arch/x86/include/asm/tlbflush.h | 10 ++++++----
 arch/x86/kernel/ldt.c           |  2 +-
 arch/x86/kernel/vm86_32.c       |  2 +-
 arch/x86/mm/tlb.c               |  3 ++-
 5 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index afbe7d1e68cf..404b8b1d44f5 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -20,7 +20,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 		end = tlb->end;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift);
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
 }
 
 /*
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index d6c0cd9e9591..1dea9860ce5b 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -536,22 +536,24 @@ struct flush_tlb_info {
 
 #define local_flush_tlb() __flush_tlb()
 
-#define flush_tlb_mm(mm)	flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL)
+#define flush_tlb_mm(mm)						\
+		flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true)
 
 #define flush_tlb_range(vma, start, end)				\
 	flush_tlb_mm_range((vma)->vm_mm, start, end,			\
 			   ((vma)->vm_flags & VM_HUGETLB)		\
 				? huge_page_shift(hstate_vma(vma))	\
-				: PAGE_SHIFT)
+				: PAGE_SHIFT, false)
 
 extern void flush_tlb_all(void);
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, unsigned int stride_shift);
+				unsigned long end, unsigned int stride_shift,
+				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
-	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT);
+	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
 void native_flush_tlb_others(const struct cpumask *cpumask,
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 733e6ace0fa4..91eae79ef686 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -273,7 +273,7 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
 	map_ldt_struct_to_user(mm);
 
 	va = (unsigned long)ldt_slot_va(slot);
-	flush_tlb_mm_range(mm, va, va + LDT_SLOT_STRIDE, 0);
+	flush_tlb_mm_range(mm, va, va + LDT_SLOT_STRIDE, 0, false);
 
 	ldt->slot = slot;
 	return 0;
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 1c03e4aa6474..91460acbb650 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -199,7 +199,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
 	pte_unmap_unlock(pte, ptl);
 out:
 	up_write(&mm->mmap_sem);
-	flush_tlb_mm_range(mm, 0xA0000, 0xA0000 + 32*PAGE_SIZE, 0UL);
+	flush_tlb_mm_range(mm, 0xA0000, 0xA0000 + 32*PAGE_SIZE, 0UL, false);
 }
 
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1224f7fb1311..1d74fbc71ad6 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -611,7 +611,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, unsigned int stride_shift)
+				unsigned long end, unsigned int stride_shift,
+				bool freed_tables)
 {
 	int cpu;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/7] Add freed_tables element to flush_tlb_info
  2018-09-26  3:58 [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Rik van Riel
                   ` (4 preceding siblings ...)
  2018-09-26  3:58 ` [PATCH 5/7] Add freed_tables argument to flush_tlb_mm_range Rik van Riel
@ 2018-09-26  3:58 ` Rik van Riel
  2018-10-09 15:00   ` [tip:x86/mm] x86/mm/tlb: " tip-bot for Rik van Riel
  2018-09-26  3:58 ` [PATCH 7/7] x86/mm/tlb: Make lazy TLB mode lazier Rik van Riel
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2018-09-26  3:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, kernel-team, will.deacon, songliubraving, mingo, luto,
	hpa, npiggin, Rik van Riel

Pass the information on to native_flush_tlb_others.

No functional changes.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlbflush.h | 1 +
 arch/x86/mm/tlb.c               | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1dea9860ce5b..323a313947e0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -532,6 +532,7 @@ struct flush_tlb_info {
 	unsigned long		end;
 	u64			new_tlb_gen;
 	unsigned int		stride_shift;
+	bool			freed_tables;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1d74fbc71ad6..b228d2a6b5fa 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -619,6 +619,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
 		.mm = mm,
 		.stride_shift = stride_shift,
+		.freed_tables = freed_tables,
 	};
 
 	cpu = get_cpu();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/7] x86/mm/tlb: Make lazy TLB mode lazier
  2018-09-26  3:58 [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Rik van Riel
                   ` (5 preceding siblings ...)
  2018-09-26  3:58 ` [PATCH 6/7] Add freed_tables element to flush_tlb_info Rik van Riel
@ 2018-09-26  3:58 ` Rik van Riel
  2018-10-01 16:07   ` Peter Zijlstra
  2018-10-09 15:01   ` [tip:x86/mm] " tip-bot for Rik van Riel
  2018-10-01 16:09 ` [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Peter Zijlstra
  2018-10-02  7:44 ` Peter Zijlstra
  8 siblings, 2 replies; 21+ messages in thread
From: Rik van Riel @ 2018-09-26  3:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, kernel-team, will.deacon, songliubraving, mingo, luto,
	hpa, npiggin, Rik van Riel

Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.

Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
which checks whether .tlb_gen changed, and the TLB invalidation code, which
increments .tlb_gen whenever page table entries get invalidated.

The atomic increment in inc_mm_tlb_gen is its own barrier; the context
switch code adds an explicit barrier between reading tlbstate.is_lazy and
next->context.tlb_gen.

CPUs in lazy TLB mode remain part of the mm_cpumask(mm), both because
that allows TLB flush IPIs to be sent at page table freeing time, and
because the cache line bouncing on the mm_cpumask(mm) was responsible
for about half the CPU use in switch_mm_irqs_off().

Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 67 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 58 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b228d2a6b5fa..707d757e70ec 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -185,6 +185,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 {
 	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
 	unsigned cpu = smp_processor_id();
 	u64 next_tlb_gen;
 	bool need_flush;
@@ -242,17 +243,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			   next->context.ctx_id);
 
 		/*
-		 * We don't currently support having a real mm loaded without
-		 * our cpu set in mm_cpumask().  We have all the bookkeeping
-		 * in place to figure out whether we would need to flush
-		 * if our cpu were cleared in mm_cpumask(), but we don't
-		 * currently use it.
+		 * Even in lazy TLB mode, the CPU should stay set in the
+		 * mm_cpumask. The TLB shootdown code can figure out from
+		 * from cpu_tlbstate.is_lazy whether or not to send an IPI.
 		 */
 		if (WARN_ON_ONCE(real_prev != &init_mm &&
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
-		return;
+		/*
+		 * If the CPU is not in lazy TLB mode, we are just switching
+		 * from one thread in a process to another thread in the same
+		 * process. No TLB flush required.
+		 */
+		if (!was_lazy)
+			return;
+
+		/*
+		 * Read the tlb_gen to check whether a flush is needed.
+		 * If the TLB is up to date, just use it.
+		 * The barrier synchronizes with the tlb_gen increment in
+		 * the TLB shootdown code.
+		 */
+		smp_mb();
+		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
+				next_tlb_gen)
+			return;
+
+		/*
+		 * TLB contents went out of date while we were in lazy
+		 * mode. Fall through to the TLB switching code below.
+		 */
+		new_asid = prev_asid;
+		need_flush = true;
 	} else {
 		u16 new_asid;
 		bool need_flush;
@@ -348,8 +372,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	this_cpu_write(cpu_tlbstate.loaded_mm, next);
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
 
-	load_mm_cr4(next);
-	switch_ldt(real_prev, next);
+	if (next != real_prev) {
+		load_mm_cr4(next);
+		switch_ldt(real_prev, next);
+	}
 }
 
 /*
@@ -457,6 +483,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * paging-structure cache to avoid speculatively reading
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
+		 *
+		 * This should be rare, with native_flush_tlb_others skipping
+		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
 		return;
@@ -559,6 +588,11 @@ static void flush_tlb_func_remote(void *info)
 	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
+static bool tlb_is_not_lazy(int cpu, void *data)
+{
+	return !per_cpu(cpu_tlbstate.is_lazy, cpu);
+}
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info)
 {
@@ -594,8 +628,23 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 					       (void *)info, 1);
 		return;
 	}
-	smp_call_function_many(cpumask, flush_tlb_func_remote,
+
+	/*
+	 * If no page tables were freed, we can skip sending IPIs to
+	 * CPUs in lazy TLB mode. They will flush the CPU themselves
+	 * at the next context switch.
+	 *
+	 * However, if page tables are getting freed, we need to send the
+	 * IPI everywhere, to prevent CPUs in lazy TLB mode from tripping
+	 * up on the new contents of what used to be page tables, while
+	 * doing a speculative memory access.
+	 */
+	if (info->freed_tables)
+		smp_call_function_many(cpumask, flush_tlb_func_remote,
 			       (void *)info, 1);
+	else
+		on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
+				(void *)info, 1, GFP_ATOMIC, cpumask);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/7] x86/mm/tlb: Always use lazy TLB mode
  2018-09-26  3:58 ` [PATCH 1/7] x86/mm/tlb: Always use lazy TLB mode Rik van Riel
@ 2018-10-01 15:58   ` Peter Zijlstra
  2018-10-01 16:07     ` Rik van Riel
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-01 15:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kernel-team, will.deacon, songliubraving, mingo,
	luto, hpa, npiggin, Linus Torvalds, Thomas Gleixner, efault

On Tue, Sep 25, 2018 at 11:58:38PM -0400, Rik van Riel wrote:
> Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
> at page table freeing time, and idle CPUs will no longer get shootdown IPIs
> for things like mprotect and madvise, we can always use lazy TLB mode.

But that's only so at the end of this patch series; either change the
Changelog or the location of this patch?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/7] x86/mm/tlb: Always use lazy TLB mode
  2018-10-01 15:58   ` Peter Zijlstra
@ 2018-10-01 16:07     ` Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2018-10-01 16:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kernel-team, will.deacon, songliubraving, mingo,
	luto, hpa, npiggin, Linus Torvalds, Thomas Gleixner, efault

[-- Attachment #1: Type: text/plain, Size: 990 bytes --]

On Mon, 2018-10-01 at 17:58 +0200, Peter Zijlstra wrote:
> On Tue, Sep 25, 2018 at 11:58:38PM -0400, Rik van Riel wrote:
> > Now that CPUs in lazy TLB mode no longer receive TLB shootdown
> > IPIs, except
> > at page table freeing time, and idle CPUs will no longer get
> > shootdown IPIs
> > for things like mprotect and madvise, we can always use lazy TLB
> > mode.
> 
> But that's only so at the end of this patch series; either change the
> Changelog or the location of this patch?

You are right, I should remove that from the changelog.

Want me to resend, or would you like to just replace
it with something like this:

"On most workloads, the number of context switches
far exceeds the number of TLB flushes sent. Optimizing
the context switches, by always using lazy TLB mode,
speeds up those workloads.

This patch results in about a 1% reduction in CPU use
on a two socket Broadwell system running a memcache
like workload."

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/7] x86/mm/tlb: Make lazy TLB mode lazier
  2018-09-26  3:58 ` [PATCH 7/7] x86/mm/tlb: Make lazy TLB mode lazier Rik van Riel
@ 2018-10-01 16:07   ` Peter Zijlstra
  2018-10-09 15:01   ` [tip:x86/mm] " tip-bot for Rik van Riel
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-01 16:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kernel-team, will.deacon, songliubraving, mingo,
	luto, hpa, npiggin

On Tue, Sep 25, 2018 at 11:58:44PM -0400, Rik van Riel wrote:
> @@ -594,8 +628,23 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  					       (void *)info, 1);
>  		return;
>  	}
> -	smp_call_function_many(cpumask, flush_tlb_func_remote,
> +
> +	/*
> +	 * If no page tables were freed, we can skip sending IPIs to
> +	 * CPUs in lazy TLB mode. They will flush the CPU themselves
> +	 * at the next context switch.
> +	 *
> +	 * However, if page tables are getting freed, we need to send the
> +	 * IPI everywhere, to prevent CPUs in lazy TLB mode from tripping
> +	 * up on the new contents of what used to be page tables, while
> +	 * doing a speculative memory access.
> +	 */
> +	if (info->freed_tables)
> +		smp_call_function_many(cpumask, flush_tlb_func_remote,
>  			       (void *)info, 1);
> +	else
> +		on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
> +				(void *)info, 1, GFP_ATOMIC, cpumask);
>  }

And this is safe vs paravirt, because for native we now do _less_
invalidations.

That might warrant a mention in the Changelog perhaps.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier
  2018-09-26  3:58 [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Rik van Riel
                   ` (6 preceding siblings ...)
  2018-09-26  3:58 ` [PATCH 7/7] x86/mm/tlb: Make lazy TLB mode lazier Rik van Riel
@ 2018-10-01 16:09 ` Peter Zijlstra
  2018-10-02  7:44 ` Peter Zijlstra
  8 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-01 16:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kernel-team, will.deacon, songliubraving, mingo,
	luto, hpa, npiggin



Looks good to me,

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] x86/mm/tlb: Restructure switch_mm_irqs_off()
  2018-09-26  3:58 ` [PATCH 2/7] x86/mm/tlb: Restructure switch_mm_irqs_off() Rik van Riel
@ 2018-10-02  7:32   ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-02  7:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kernel-team, will.deacon, songliubraving, mingo,
	luto, hpa, npiggin, Linus Torvalds, Thomas Gleixner, efault

On Tue, Sep 25, 2018 at 11:58:39PM -0400, Rik van Riel wrote:
> Move some code that will be needed for the lazy -> !lazy state
> transition when a lazy TLB CPU has gotten out of date.
> 
> No functional changes, since the if (real_prev == next) branch
> always returns.
> 
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Acked-by: Dave Hansen <dave.hansen@intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: efault@gmx.de
> Cc: kernel-team@fb.com
> Link: http://lkml.kernel.org/r/20180716190337.26133-4-riel@surriel.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> (cherry picked from commit 61d0beb5796ab11f7f3bf38cb2eccc6579aaa70b)
> ---
>  arch/x86/mm/tlb.c | 64 ++++++++++++++++++++++++-----------------------
>  1 file changed, 33 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 54a5870190a6..1224f7fb1311 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -187,6 +187,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
>  	unsigned cpu = smp_processor_id();
>  	u64 next_tlb_gen;
> +	bool need_flush;
> +	u16 new_asid;
>  
>  	/*
>  	 * NB: The scheduler will call us with prev == next when switching
> @@ -308,44 +310,44 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  		/* Let nmi_uaccess_okay() know that we're changing CR3. */
>  		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
>  		barrier();
> +	}

Compiling this gives me:

CC      arch/x86/mm/tlb.o
../arch/x86/mm/tlb.c: In function ‘switch_mm_irqs_off’:
../arch/x86/mm/tlb.c:315:5: warning: ‘need_flush’ may be used uninitialized in this function [-Wmaybe-uninitialized]
if (need_flush) {
^
../arch/x86/mm/tlb.c:316:45: warning: ‘new_asid’ may be used uninitialized in this function [-Wmaybe-uninitialized]
this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
^

Because you shadow need_flush and new_asid in a branch. I need the below
delta to make it happy again.

---
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -254,8 +254,6 @@ void switch_mm_irqs_off(struct mm_struct
 
 		return;
 	} else {
-		u16 new_asid;
-		bool need_flush;
 		u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);
 
 		/*

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier
  2018-09-26  3:58 [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Rik van Riel
                   ` (7 preceding siblings ...)
  2018-10-01 16:09 ` [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Peter Zijlstra
@ 2018-10-02  7:44 ` Peter Zijlstra
  2018-10-02 13:41   ` Rik van Riel
  8 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-02  7:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kernel-team, will.deacon, songliubraving, mingo,
	luto, hpa, npiggin, Dave Hansen

On Tue, Sep 25, 2018 at 11:58:37PM -0400, Rik van Riel wrote:

> This v2 is "identical" to the version I posted yesterday,
> except this one is actually against current -tip (not sure
> what went wrong before), with a number of relevant patches
> on top:
> - tip x86/core
> 	012e77a903d ("x86/nmi: Fix NMI uaccess race against CR3 switching")
> - arm64 tlb/asm-generic (entire branch)
> - peterz queue mm/tlb
> 	12b2b80ec6f4 ("x86/mm: Page size aware flush_tlb_mm_range()")

If you could please double check:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/mm

If 0day comes up green on that, I'll attempt to push it to tip.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier
  2018-10-02  7:44 ` Peter Zijlstra
@ 2018-10-02 13:41   ` Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2018-10-02 13:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kernel-team, will.deacon, songliubraving, mingo,
	luto, hpa, npiggin, Dave Hansen

[-- Attachment #1: Type: text/plain, Size: 843 bytes --]

On Tue, 2018-10-02 at 09:44 +0200, Peter Zijlstra wrote:
> On Tue, Sep 25, 2018 at 11:58:37PM -0400, Rik van Riel wrote:
> 
> > This v2 is "identical" to the version I posted yesterday,
> > except this one is actually against current -tip (not sure
> > what went wrong before), with a number of relevant patches
> > on top:
> > - tip x86/core
> > 	012e77a903d ("x86/nmi: Fix NMI uaccess race against CR3
> > switching")
> > - arm64 tlb/asm-generic (entire branch)
> > - peterz queue mm/tlb
> > 	12b2b80ec6f4 ("x86/mm: Page size aware flush_tlb_mm_range()")
> 
> If you could please double check:
> 
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/mm
> 
> If 0day comes up green on that, I'll attempt to push it to tip.

That looks good to me.

Thank you!

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [tip:x86/mm] smp: use __cpumask_set_cpu in on_each_cpu_cond
  2018-09-26  3:58 ` [PATCH 3/7] smp: use __cpumask_set_cpu in on_each_cpu_cond Rik van Riel
@ 2018-10-09 14:59   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Rik van Riel @ 2018-10-09 14:59 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, riel, peterz, linux-kernel, hpa, mingo, luto

Commit-ID:  c3f7f2c7eba1a53d2e5ffbc2dcc9a20c5f094890
Gitweb:     https://git.kernel.org/tip/c3f7f2c7eba1a53d2e5ffbc2dcc9a20c5f094890
Author:     Rik van Riel <riel@surriel.com>
AuthorDate: Tue, 25 Sep 2018 23:58:40 -0400
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Tue, 9 Oct 2018 16:51:11 +0200

smp: use __cpumask_set_cpu in on_each_cpu_cond

The code in on_each_cpu_cond sets CPUs in a locally allocated bitmask,
which should never be used by other CPUs simultaneously. There is no
need to use locked memory accesses to set the bits in this bitmap.

Switch to __cpumask_set_cpu.

Cc: npiggin@gmail.com
Cc: mingo@kernel.org
Cc: will.deacon@arm.com
Cc: songliubraving@fb.com
Cc: kernel-team@fb.com
Cc: hpa@zytor.com
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180926035844.1420-4-riel@surriel.com
---
 kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index d86eec5f51c1..a7d4f9f50a49 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -682,7 +682,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 		preempt_disable();
 		for_each_online_cpu(cpu)
 			if (cond_func(cpu, info))
-				cpumask_set_cpu(cpu, cpus);
+				__cpumask_set_cpu(cpu, cpus);
 		on_each_cpu_mask(cpus, func, info, wait);
 		preempt_enable();
 		free_cpumask_var(cpus);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip:x86/mm] smp,cpumask: introduce on_each_cpu_cond_mask
  2018-09-26  3:58 ` [PATCH 4/7] smp,cpumask: introduce on_each_cpu_cond_mask Rik van Riel
@ 2018-10-09 14:59   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Rik van Riel @ 2018-10-09 14:59 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: riel, hpa, linux-kernel, mingo, tglx, peterz

Commit-ID:  7d49b28a80b830c3ca876d33bedc58d62a78e16f
Gitweb:     https://git.kernel.org/tip/7d49b28a80b830c3ca876d33bedc58d62a78e16f
Author:     Rik van Riel <riel@surriel.com>
AuthorDate: Tue, 25 Sep 2018 23:58:41 -0400
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Tue, 9 Oct 2018 16:51:11 +0200

smp,cpumask: introduce on_each_cpu_cond_mask

Introduce a variant of on_each_cpu_cond that iterates only over the
CPUs in a cpumask, in order to avoid making callbacks for every single
CPU in the system when we only need to test a subset.

Cc: npiggin@gmail.com
Cc: mingo@kernel.org
Cc: will.deacon@arm.com
Cc: songliubraving@fb.com
Cc: kernel-team@fb.com
Cc: hpa@zytor.com
Cc: luto@kernel.org
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180926035844.1420-5-riel@surriel.com
---
 include/linux/smp.h |  4 ++++
 kernel/smp.c        | 17 +++++++++++++----
 kernel/up.c         | 14 +++++++++++---
 3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 9fb239e12b82..a56f08ff3097 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -53,6 +53,10 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 		smp_call_func_t func, void *info, bool wait,
 		gfp_t gfp_flags);
 
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
+		smp_call_func_t func, void *info, bool wait,
+		gfp_t gfp_flags, const struct cpumask *mask);
+
 int smp_call_function_single_async(int cpu, call_single_data_t *csd);
 
 #ifdef CONFIG_SMP
diff --git a/kernel/smp.c b/kernel/smp.c
index a7d4f9f50a49..163c451af42e 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -669,9 +669,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
  * You must not call this function with disabled interrupts or
  * from a hardware interrupt handler or from a bottom half handler.
  */
-void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
 			smp_call_func_t func, void *info, bool wait,
-			gfp_t gfp_flags)
+			gfp_t gfp_flags, const struct cpumask *mask)
 {
 	cpumask_var_t cpus;
 	int cpu, ret;
@@ -680,7 +680,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 
 	if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
 		preempt_disable();
-		for_each_online_cpu(cpu)
+		for_each_cpu(cpu, mask)
 			if (cond_func(cpu, info))
 				__cpumask_set_cpu(cpu, cpus);
 		on_each_cpu_mask(cpus, func, info, wait);
@@ -692,7 +692,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 		 * just have to IPI them one by one.
 		 */
 		preempt_disable();
-		for_each_online_cpu(cpu)
+		for_each_cpu(cpu, mask)
 			if (cond_func(cpu, info)) {
 				ret = smp_call_function_single(cpu, func,
 								info, wait);
@@ -701,6 +701,15 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 		preempt_enable();
 	}
 }
+EXPORT_SYMBOL(on_each_cpu_cond_mask);
+
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+			smp_call_func_t func, void *info, bool wait,
+			gfp_t gfp_flags)
+{
+	on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags,
+				cpu_online_mask);
+}
 EXPORT_SYMBOL(on_each_cpu_cond);
 
 static void do_nothing(void *unused)
diff --git a/kernel/up.c b/kernel/up.c
index 42c46bf3e0a5..ff536f9cc8a2 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -68,9 +68,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
  * Preemption is disabled here to make sure the cond_func is called under the
  * same condtions in UP and SMP.
  */
-void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
-		      smp_call_func_t func, void *info, bool wait,
-		      gfp_t gfp_flags)
+void on_each_cpu_cond_mask(bool (*cond_func)(int cpu, void *info),
+			   smp_call_func_t func, void *info, bool wait,
+			   gfp_t gfp_flags, const struct cpumask *mask)
 {
 	unsigned long flags;
 
@@ -82,6 +82,14 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 	}
 	preempt_enable();
 }
+EXPORT_SYMBOL(on_each_cpu_cond_mask);
+
+void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+		      smp_call_func_t func, void *info, bool wait,
+		      gfp_t gfp_flags)
+{
+	on_each_cpu_cond_mask(cond_func, func, info, wait, gfp_flags, NULL);
+}
 EXPORT_SYMBOL(on_each_cpu_cond);
 
 int smp_call_on_cpu(unsigned int cpu, int (*func)(void *), void *par, bool phys)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip:x86/mm] x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range
  2018-09-26  3:58 ` [PATCH 5/7] Add freed_tables argument to flush_tlb_mm_range Rik van Riel
@ 2018-10-09 15:00   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Rik van Riel @ 2018-10-09 15:00 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, peterz, mingo, tglx, riel

Commit-ID:  016c4d92cd16f569c6485ae62b076c1a4b779536
Gitweb:     https://git.kernel.org/tip/016c4d92cd16f569c6485ae62b076c1a4b779536
Author:     Rik van Riel <riel@surriel.com>
AuthorDate: Tue, 25 Sep 2018 23:58:42 -0400
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Tue, 9 Oct 2018 16:51:12 +0200

x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range

Add an argument to flush_tlb_mm_range to indicate whether page tables
are about to be freed after this TLB flush. This allows for an
optimization of flush_tlb_mm_range to skip CPUs in lazy TLB mode.

No functional changes.

Cc: npiggin@gmail.com
Cc: mingo@kernel.org
Cc: will.deacon@arm.com
Cc: songliubraving@fb.com
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Cc: hpa@zytor.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180926035844.1420-6-riel@surriel.com
---
 arch/x86/include/asm/tlb.h      |  2 +-
 arch/x86/include/asm/tlbflush.h | 10 ++++++----
 arch/x86/kernel/ldt.c           |  2 +-
 arch/x86/kernel/vm86_32.c       |  2 +-
 arch/x86/mm/tlb.c               |  3 ++-
 5 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index afbe7d1e68cf..404b8b1d44f5 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -20,7 +20,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 		end = tlb->end;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift);
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
 }
 
 /*
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index d6c0cd9e9591..1dea9860ce5b 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -536,22 +536,24 @@ struct flush_tlb_info {
 
 #define local_flush_tlb() __flush_tlb()
 
-#define flush_tlb_mm(mm)	flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL)
+#define flush_tlb_mm(mm)						\
+		flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true)
 
 #define flush_tlb_range(vma, start, end)				\
 	flush_tlb_mm_range((vma)->vm_mm, start, end,			\
 			   ((vma)->vm_flags & VM_HUGETLB)		\
 				? huge_page_shift(hstate_vma(vma))	\
-				: PAGE_SHIFT)
+				: PAGE_SHIFT, false)
 
 extern void flush_tlb_all(void);
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, unsigned int stride_shift);
+				unsigned long end, unsigned int stride_shift,
+				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
-	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT);
+	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
 void native_flush_tlb_others(const struct cpumask *cpumask,
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 7fdb2414ca65..ab18e0884dc6 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -273,7 +273,7 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
 	map_ldt_struct_to_user(mm);
 
 	va = (unsigned long)ldt_slot_va(slot);
-	flush_tlb_mm_range(mm, va, va + LDT_SLOT_STRIDE, PAGE_SHIFT);
+	flush_tlb_mm_range(mm, va, va + LDT_SLOT_STRIDE, PAGE_SHIFT, false);
 
 	ldt->slot = slot;
 	return 0;
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 52fed70f671e..c2fd39752da8 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -199,7 +199,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
 	pte_unmap_unlock(pte, ptl);
 out:
 	up_write(&mm->mmap_sem);
-	flush_tlb_mm_range(mm, 0xA0000, 0xA0000 + 32*PAGE_SIZE, PAGE_SHIFT);
+	flush_tlb_mm_range(mm, 0xA0000, 0xA0000 + 32*PAGE_SIZE, PAGE_SHIFT, false);
 }
 
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9fb30d27854b..14bf39fc0447 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -609,7 +609,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, unsigned int stride_shift)
+				unsigned long end, unsigned int stride_shift,
+				bool freed_tables)
 {
 	int cpu;
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip:x86/mm] x86/mm/tlb: Add freed_tables element to flush_tlb_info
  2018-09-26  3:58 ` [PATCH 6/7] Add freed_tables element to flush_tlb_info Rik van Riel
@ 2018-10-09 15:00   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Rik van Riel @ 2018-10-09 15:00 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, riel, peterz, mingo, tglx, hpa

Commit-ID:  97807813fe7074ee865d6bc1df1d0f8fb878ee9d
Gitweb:     https://git.kernel.org/tip/97807813fe7074ee865d6bc1df1d0f8fb878ee9d
Author:     Rik van Riel <riel@surriel.com>
AuthorDate: Tue, 25 Sep 2018 23:58:43 -0400
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Tue, 9 Oct 2018 16:51:12 +0200

x86/mm/tlb: Add freed_tables element to flush_tlb_info

Pass the information on to native_flush_tlb_others.

No functional changes.

Cc: npiggin@gmail.com
Cc: mingo@kernel.org
Cc: will.deacon@arm.com
Cc: songliubraving@fb.com
Cc: kernel-team@fb.com
Cc: hpa@zytor.com
Cc: luto@kernel.org
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180926035844.1420-7-riel@surriel.com
---
 arch/x86/include/asm/tlbflush.h | 1 +
 arch/x86/mm/tlb.c               | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1dea9860ce5b..323a313947e0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -532,6 +532,7 @@ struct flush_tlb_info {
 	unsigned long		end;
 	u64			new_tlb_gen;
 	unsigned int		stride_shift;
+	bool			freed_tables;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 14bf39fc0447..92e46f4c058c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -617,6 +617,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
 		.mm = mm,
 		.stride_shift = stride_shift,
+		.freed_tables = freed_tables,
 	};
 
 	cpu = get_cpu();

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip:x86/mm] x86/mm/tlb: Make lazy TLB mode lazier
  2018-09-26  3:58 ` [PATCH 7/7] x86/mm/tlb: Make lazy TLB mode lazier Rik van Riel
  2018-10-01 16:07   ` Peter Zijlstra
@ 2018-10-09 15:01   ` tip-bot for Rik van Riel
  1 sibling, 0 replies; 21+ messages in thread
From: tip-bot for Rik van Riel @ 2018-10-09 15:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: songliubraving, mingo, riel, peterz, linux-kernel, tglx, hpa

Commit-ID:  145f573b89a62bf53cfc0144fa9b1c56b0f70b45
Gitweb:     https://git.kernel.org/tip/145f573b89a62bf53cfc0144fa9b1c56b0f70b45
Author:     Rik van Riel <riel@surriel.com>
AuthorDate: Tue, 25 Sep 2018 23:58:44 -0400
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Tue, 9 Oct 2018 16:51:12 +0200

x86/mm/tlb: Make lazy TLB mode lazier

Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.

Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
which checks whether .tlb_gen changed, and the TLB invalidation code, which
increments .tlb_gen whenever page table entries get invalidated.

The atomic increment in inc_mm_tlb_gen is its own barrier; the context
switch code adds an explicit barrier between reading tlbstate.is_lazy and
next->context.tlb_gen.

CPUs in lazy TLB mode remain part of the mm_cpumask(mm), both because
that allows TLB flush IPIs to be sent at page table freeing time, and
because the cache line bouncing on the mm_cpumask(mm) was responsible
for about half the CPU use in switch_mm_irqs_off().

We can change native_flush_tlb_others() without touching other
(paravirt) implementations of flush_tlb_others() because we'll be
flushing less. The existing implementations flush more and are
therefore still correct.

Cc: npiggin@gmail.com
Cc: mingo@kernel.org
Cc: will.deacon@arm.com
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Cc: hpa@zytor.com
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180926035844.1420-8-riel@surriel.com
---
 arch/x86/mm/tlb.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 58 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 92e46f4c058c..7d68489cfdb1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -185,6 +185,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 {
 	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
 	unsigned cpu = smp_processor_id();
 	u64 next_tlb_gen;
 	bool need_flush;
@@ -242,17 +243,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			   next->context.ctx_id);
 
 		/*
-		 * We don't currently support having a real mm loaded without
-		 * our cpu set in mm_cpumask().  We have all the bookkeeping
-		 * in place to figure out whether we would need to flush
-		 * if our cpu were cleared in mm_cpumask(), but we don't
-		 * currently use it.
+		 * Even in lazy TLB mode, the CPU should stay set in the
+		 * mm_cpumask. The TLB shootdown code can figure out from
+		 * from cpu_tlbstate.is_lazy whether or not to send an IPI.
 		 */
 		if (WARN_ON_ONCE(real_prev != &init_mm &&
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
-		return;
+		/*
+		 * If the CPU is not in lazy TLB mode, we are just switching
+		 * from one thread in a process to another thread in the same
+		 * process. No TLB flush required.
+		 */
+		if (!was_lazy)
+			return;
+
+		/*
+		 * Read the tlb_gen to check whether a flush is needed.
+		 * If the TLB is up to date, just use it.
+		 * The barrier synchronizes with the tlb_gen increment in
+		 * the TLB shootdown code.
+		 */
+		smp_mb();
+		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
+				next_tlb_gen)
+			return;
+
+		/*
+		 * TLB contents went out of date while we were in lazy
+		 * mode. Fall through to the TLB switching code below.
+		 */
+		new_asid = prev_asid;
+		need_flush = true;
 	} else {
 		u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);
 
@@ -346,8 +370,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	this_cpu_write(cpu_tlbstate.loaded_mm, next);
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
 
-	load_mm_cr4(next);
-	switch_ldt(real_prev, next);
+	if (next != real_prev) {
+		load_mm_cr4(next);
+		switch_ldt(real_prev, next);
+	}
 }
 
 /*
@@ -455,6 +481,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * paging-structure cache to avoid speculatively reading
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
+		 *
+		 * This should be rare, with native_flush_tlb_others skipping
+		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
 		return;
@@ -557,6 +586,11 @@ static void flush_tlb_func_remote(void *info)
 	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
+static bool tlb_is_not_lazy(int cpu, void *data)
+{
+	return !per_cpu(cpu_tlbstate.is_lazy, cpu);
+}
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info)
 {
@@ -592,8 +626,23 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 					       (void *)info, 1);
 		return;
 	}
-	smp_call_function_many(cpumask, flush_tlb_func_remote,
+
+	/*
+	 * If no page tables were freed, we can skip sending IPIs to
+	 * CPUs in lazy TLB mode. They will flush the CPU themselves
+	 * at the next context switch.
+	 *
+	 * However, if page tables are getting freed, we need to send the
+	 * IPI everywhere, to prevent CPUs in lazy TLB mode from tripping
+	 * up on the new contents of what used to be page tables, while
+	 * doing a speculative memory access.
+	 */
+	if (info->freed_tables)
+		smp_call_function_many(cpumask, flush_tlb_func_remote,
 			       (void *)info, 1);
+	else
+		on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
+				(void *)info, 1, GFP_ATOMIC, cpumask);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/7] Add freed_tables element to flush_tlb_info
  2018-09-24 18:37 [PATCH " Rik van Riel
@ 2018-09-24 18:37 ` Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2018-09-24 18:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, kernel-team, songliubraving, mingo, will.deacon, hpa,
	luto, npiggin, Rik van Riel

Pass the information on to native_flush_tlb_others.

No functional changes.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlbflush.h | 1 +
 arch/x86/mm/tlb.c               | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index ae1ef755cb94..dc4d99579984 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -511,6 +511,7 @@ struct flush_tlb_info {
 	unsigned long		start;
 	unsigned long		end;
 	u64			new_tlb_gen;
+	bool			freed_tables;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 62f147b0f6b7..5d73bde0e4c8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -578,6 +578,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	struct flush_tlb_info info = {
 		.mm = mm,
+		.freed_tables = freed_tables,
 	};
 
 	cpu = get_cpu();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-10-09 15:01 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-26  3:58 [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Rik van Riel
2018-09-26  3:58 ` [PATCH 1/7] x86/mm/tlb: Always use lazy TLB mode Rik van Riel
2018-10-01 15:58   ` Peter Zijlstra
2018-10-01 16:07     ` Rik van Riel
2018-09-26  3:58 ` [PATCH 2/7] x86/mm/tlb: Restructure switch_mm_irqs_off() Rik van Riel
2018-10-02  7:32   ` Peter Zijlstra
2018-09-26  3:58 ` [PATCH 3/7] smp: use __cpumask_set_cpu in on_each_cpu_cond Rik van Riel
2018-10-09 14:59   ` [tip:x86/mm] " tip-bot for Rik van Riel
2018-09-26  3:58 ` [PATCH 4/7] smp,cpumask: introduce on_each_cpu_cond_mask Rik van Riel
2018-10-09 14:59   ` [tip:x86/mm] " tip-bot for Rik van Riel
2018-09-26  3:58 ` [PATCH 5/7] Add freed_tables argument to flush_tlb_mm_range Rik van Riel
2018-10-09 15:00   ` [tip:x86/mm] x86/mm/tlb: " tip-bot for Rik van Riel
2018-09-26  3:58 ` [PATCH 6/7] Add freed_tables element to flush_tlb_info Rik van Riel
2018-10-09 15:00   ` [tip:x86/mm] x86/mm/tlb: " tip-bot for Rik van Riel
2018-09-26  3:58 ` [PATCH 7/7] x86/mm/tlb: Make lazy TLB mode lazier Rik van Riel
2018-10-01 16:07   ` Peter Zijlstra
2018-10-09 15:01   ` [tip:x86/mm] " tip-bot for Rik van Riel
2018-10-01 16:09 ` [PATCH v2 0/7] x86/mm/tlb: make lazy TLB mode even lazier Peter Zijlstra
2018-10-02  7:44 ` Peter Zijlstra
2018-10-02 13:41   ` Rik van Riel
  -- strict thread matches above, loose matches on Subject: below --
2018-09-24 18:37 [PATCH " Rik van Riel
2018-09-24 18:37 ` [PATCH 6/7] Add freed_tables element to flush_tlb_info Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).