linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/10] PCID and improved laziness
@ 2017-06-14  4:56 Andy Lutomirski
  2017-06-14  4:56 ` [PATCH v2 01/10] x86/ldt: Simplify LDT switching logic Andy Lutomirski
                   ` (11 more replies)
  0 siblings, 12 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:56 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andy Lutomirski

There are three performance benefits here:

1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
   This avoids many of them when switching tasks by using PCID.  In
   a stupid little benchmark I did, it saves about 100ns on my laptop
   per context switch.  I'll try to improve that benchmark.

2. Mms that have been used recently on a given CPU might get to keep
   their TLB entries alive across process switches with this patch
   set.  TLB fills are pretty fast on modern CPUs, but they're even
   faster when they don't happen.

3. Lazy TLB is way better.  We used to do two stupid things when we
   ran kernel threads: we'd send IPIs to flush user contexts on their
   CPUs and then we'd write to CR3 for no particular reason as an excuse
   to stop further IPIs.  With this patch, we do neither.

This will, in general, perform suboptimally if paravirt TLB flushing
is in use (currently just Xen, I think, but Hyper-V is in the works).
The code is structured so we could fix it in one of two ways: we
could take a spinlock when touching the percpu state so we can update
it remotely after a paravirt flush, or we could be more careful about
our exactly how we access the state and use cmpxchg16b to do atomic
remote updates.  (On SMP systems without cmpxchg16b, we'd just skip
the optimization entirely.)

This is based on tip:x86/mm.  The branch is here if you want to play:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid

Changes from RFC:
 - flush_tlb_func_common() no longer gets reentered (Nadav)
 - Fix ASID corruption on unlazying (kbuild bot)
 - Move Xen init to the right place
 - Misc cleanups

Andy Lutomirski (10):
  x86/ldt: Simplify LDT switching logic
  x86/mm: Remove reset_lazy_tlbstate()
  x86/mm: Give each mm TLB flush generation a unique ID
  x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
  x86/mm: Rework lazy TLB mode and TLB freshness tracking
  x86/mm: Stop calling leave_mm() in idle code
  x86/mm: Disable PCID on 32-bit kernels
  x86/mm: Add nopcid to turn off PCID
  x86/mm: Enable CR4.PCIDE on supported systems
  x86/mm: Try to preserve old TLB entries using PCID

 Documentation/admin-guide/kernel-parameters.txt |   2 +
 arch/ia64/include/asm/acpi.h                    |   2 -
 arch/x86/include/asm/acpi.h                     |   2 -
 arch/x86/include/asm/disabled-features.h        |   4 +-
 arch/x86/include/asm/mmu.h                      |  25 +-
 arch/x86/include/asm/mmu_context.h              |  40 ++-
 arch/x86/include/asm/processor-flags.h          |   2 +
 arch/x86/include/asm/tlbflush.h                 |  89 +++++-
 arch/x86/kernel/cpu/bugs.c                      |   8 +
 arch/x86/kernel/cpu/common.c                    |  33 +++
 arch/x86/kernel/smpboot.c                       |   1 -
 arch/x86/mm/init.c                              |   2 +-
 arch/x86/mm/tlb.c                               | 368 +++++++++++++++---------
 arch/x86/xen/enlighten_pv.c                     |   6 +
 drivers/acpi/processor_idle.c                   |   2 -
 drivers/idle/intel_idle.c                       |   9 +-
 16 files changed, 429 insertions(+), 166 deletions(-)

-- 
2.9.4

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 01/10] x86/ldt: Simplify LDT switching logic
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
@ 2017-06-14  4:56 ` Andy Lutomirski
  2017-06-15 18:53   ` Rik van Riel
  2017-06-14  4:56 ` [PATCH v2 02/10] x86/mm: Remove reset_lazy_tlbstate() Andy Lutomirski
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:56 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andy Lutomirski

Originally, Linux reloaded the LDT whenever the prev mm or the next
mm had an LDT.  It was changed in 0bbed3beb4f2 ("[PATCH]
Thread-Local Storage (TLS) support") (from the historical tree) like
this:

-		/* load_LDT, if either the previous or next thread
-		 * has a non-default LDT.
+		/*
+		 * load the LDT, if the LDT is different:
		 */
-		if (next->context.size+prev->context.size)
+		if (unlikely(prev->context.ldt != next->context.ldt))
			load_LDT(&next->context);

The current code is unlikely to avoid any LDT reloads, since different
mms won't share an LDT.

When we redo lazy mode to stop flush IPIs without switching to
init_mm, though, the current logic would become incorrect: it will
be possible to have real_prev == next but nonetheless have a stale
LDT descriptor.

Simplify the code to update LDTR if either the previous or the next
mm has an LDT, i.e. effectively restore the historical logic..
While we're at it, clean up the code by moving all the ifdeffery to
a header where it belongs.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h | 26 ++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                  | 20 ++------------------
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 1458f530948b..ecfcb6643c9b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -93,6 +93,32 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 #else
 	clear_LDT();
 #endif
+}
+
+static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
+{
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+	/*
+	 * Load the LDT if either the old or new mm had an LDT.
+	 *
+	 * An mm will never go from having an LDT to not having an LDT.  Two
+	 * mms never share an LDT, so we don't gain anything by checking to
+	 * see whether the LDT changed.  There's also no guarantee that
+	 * prev->context.ldt actually matches LDTR, but, if LDTR is non-NULL,
+	 * then prev->context.ldt will also be non-NULL.
+	 *
+	 * If we really cared, we could optimize the case where prev == next
+	 * and we're exiting lazy mode.  Most of the time, if this happens,
+	 * we don't actually need to reload LDTR, but modify_ldt() is mostly
+	 * used by legacy code and emulators where we don't need this level of
+	 * performance.
+	 *
+	 * This uses | instead of || because it generates better code.
+	 */
+	if (unlikely((unsigned long)prev->context.ldt |
+		     (unsigned long)next->context.ldt))
+		load_mm_ldt(next);
+#endif
 
 	DEBUG_LOCKS_WARN_ON(preemptible());
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2a5e851f2035..b2485d69f7c2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -148,25 +148,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		     real_prev != &init_mm);
 	cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
-	/* Load per-mm CR4 state */
+	/* Load per-mm CR4 and LDTR state */
 	load_mm_cr4(next);
-
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
-	/*
-	 * Load the LDT, if the LDT is different.
-	 *
-	 * It's possible that prev->context.ldt doesn't match
-	 * the LDT register.  This can happen if leave_mm(prev)
-	 * was called and then modify_ldt changed
-	 * prev->context.ldt but suppressed an IPI to this CPU.
-	 * In this case, prev->context.ldt != NULL, because we
-	 * never set context.ldt to NULL while the mm still
-	 * exists.  That means that next->context.ldt !=
-	 * prev->context.ldt, because mms never share an LDT.
-	 */
-	if (unlikely(real_prev->context.ldt != next->context.ldt))
-		load_mm_ldt(next);
-#endif
+	switch_ldt(real_prev, next);
 }
 
 /*
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 02/10] x86/mm: Remove reset_lazy_tlbstate()
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
  2017-06-14  4:56 ` [PATCH v2 01/10] x86/ldt: Simplify LDT switching logic Andy Lutomirski
@ 2017-06-14  4:56 ` Andy Lutomirski
  2017-06-15 19:29   ` Rik van Riel
  2017-06-14  4:56 ` [PATCH v2 03/10] x86/mm: Give each mm TLB flush generation a unique ID Andy Lutomirski
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:56 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andy Lutomirski

The only call site also calls idle_task_exit(), and idle_task_exit()
puts us into a clean state by explicitly switching to init_mm.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 8 --------
 arch/x86/kernel/smpboot.c       | 1 -
 2 files changed, 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5f78c6a77578..50ea3482e1d1 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,14 +259,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
 
-static inline void reset_lazy_tlbstate(void)
-{
-	this_cpu_write(cpu_tlbstate.state, 0);
-	this_cpu_write(cpu_tlbstate.loaded_mm, &init_mm);
-
-	WARN_ON(read_cr3_pa() != __pa_symbol(swapper_pg_dir));
-}
-
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 					struct mm_struct *mm)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f04479a8f74f..6169a56aab49 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1589,7 +1589,6 @@ void native_cpu_die(unsigned int cpu)
 void play_dead_common(void)
 {
 	idle_task_exit();
-	reset_lazy_tlbstate();
 
 	/* Ack it */
 	(void)cpu_report_death();
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 03/10] x86/mm: Give each mm TLB flush generation a unique ID
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
  2017-06-14  4:56 ` [PATCH v2 01/10] x86/ldt: Simplify LDT switching logic Andy Lutomirski
  2017-06-14  4:56 ` [PATCH v2 02/10] x86/mm: Remove reset_lazy_tlbstate() Andy Lutomirski
@ 2017-06-14  4:56 ` Andy Lutomirski
  2017-06-14 15:54   ` Dave Hansen
  2017-06-14  4:56 ` [PATCH v2 04/10] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm Andy Lutomirski
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:56 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andy Lutomirski

This adds two new variables to mmu_context_t: ctx_id and tlb_gen.
ctx_id uniquely identifies the mm_struct and will never be reused.
For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic
count of the number of times that a TLB flush has been requested.
The pair (ctx_id, tlb_gen) can be used as an identifier for TLB
flush actions and will be used in subsequent patches to reliably
determine whether all needed TLB flushes have occurred on a given
CPU.

This patch is split out for ease of review.  By itself, it has no
real effect other than creating and updating the new variables.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu.h         | 25 +++++++++++++++++++++++--
 arch/x86/include/asm/mmu_context.h |  5 +++++
 arch/x86/include/asm/tlbflush.h    | 18 ++++++++++++++++++
 arch/x86/mm/tlb.c                  |  6 ++++--
 4 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 79b647a7ebd0..bb8c597c2248 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,12 +3,28 @@
 
 #include <linux/spinlock.h>
 #include <linux/mutex.h>
+#include <linux/atomic.h>
 
 /*
- * The x86 doesn't have a mmu context, but
- * we put the segment information here.
+ * x86 has arch-specific MMU state beyond what lives in mm_struct.
  */
 typedef struct {
+	/*
+	 * ctx_id uniquely identifies this mm_struct.  A ctx_id will never
+	 * be reused, and zero is not a valid ctx_id.
+	 */
+	u64 ctx_id;
+
+	/*
+	 * Any code that needs to do any sort of TLB flushing for this
+	 * mm will first make its changes to the page tables, then
+	 * increment tlb_gen, then flush.  This lets the low-level
+	 * flushing code keep track of what needs flushing.
+	 *
+	 * This is not used on Xen PV.
+	 */
+	atomic64_t tlb_gen;
+
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
 #endif
@@ -37,6 +53,11 @@ typedef struct {
 #endif
 } mm_context_t;
 
+#define INIT_MM_CONTEXT(mm)						\
+	.context = {							\
+		.ctx_id = 1,						\
+	}
+
 void leave_mm(int cpu);
 
 #endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ecfcb6643c9b..e5295d485899 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -129,9 +129,14 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
 }
 
+extern atomic64_t last_mm_ctx_id;
+
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
+	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
+	atomic64_set(&mm->context.tlb_gen, 0);
+
 	#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
 		/* pkey 0 is the default and always allocated */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 50ea3482e1d1..1eb946c0507e 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -57,6 +57,23 @@ static inline void invpcid_flush_all_nonglobals(void)
 	__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
 }
 
+static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
+{
+	u64 new_tlb_gen;
+
+	/*
+	 * Bump the generation count.  This also serves as a full barrier
+	 * that synchronizes with switch_mm: callers are required to order
+	 * their read of mm_cpumask after their writes to the paging
+	 * structures.
+	 */
+	smp_mb__before_atomic();
+	new_tlb_gen = atomic64_inc_return(&mm->context.tlb_gen);
+	smp_mb__after_atomic();
+
+	return new_tlb_gen;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
@@ -262,6 +279,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 					struct mm_struct *mm)
 {
+	bump_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
 }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b2485d69f7c2..7c99c50e8bc9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
+
 void leave_mm(int cpu)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -283,8 +285,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	cpu = get_cpu();
 
-	/* Synchronize with switch_mm. */
-	smp_mb();
+	/* This is also a barrier that synchronizes with switch_mm(). */
+	bump_mm_tlb_gen(mm);
 
 	/* Should we flush just the requested range? */
 	if ((end != TLB_FLUSH_ALL) &&
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 04/10] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
                   ` (2 preceding siblings ...)
  2017-06-14  4:56 ` [PATCH v2 03/10] x86/mm: Give each mm TLB flush generation a unique ID Andy Lutomirski
@ 2017-06-14  4:56 ` Andy Lutomirski
  2017-06-14  4:56 ` [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:56 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andy Lutomirski

There are two kernel features that would benefit from tracking
how up-to-date each CPU's TLB is in the case where IPIs aren't keeping
it up to date in real time:

 - Lazy mm switching currently works by switching to init_mm when
   it would otherwise flush.  This is wasteful: there isn't fundamentally
   any need to update CR3 at all when going lazy or when returning from
   lazy mode, nor is there any need to receive flush IPIs at all.  Instead,
   we should just stop trying to keep the TLB coherent when we go lazy and,
   when unlazying, check whether we missed any flushes.

 - PCID will let us keep recent user contexts alive in the TLB.  If we
   start doing this, we need a way to decide whether those contexts are
   up to date.

On some paravirt systems, remote TLBs can be flushed without IPIs.
This won't update the target CPUs' tlb_gens, which may cause
unnecessary local flushes later on.  We can address this if it becomes
a problem by carefully updating the target CPU's tlb_gen directly.

By itself, this patch is a very minor optimization that avoids
unnecessary flushes when multiple TLB flushes targetting the same CPU
race.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 37 +++++++++++++++++++
 arch/x86/mm/tlb.c               | 79 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 109 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1eb946c0507e..4f6c30d6ec39 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,11 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+struct tlb_context {
+	u64 ctx_id;
+	u64 tlb_gen;
+};
+
 struct tlb_state {
 	/*
 	 * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
@@ -97,6 +102,21 @@ struct tlb_state {
 	 * disabling interrupts when modifying either one.
 	 */
 	unsigned long cr4;
+
+	/*
+	 * This is a list of all contexts that might exist in the TLB.
+	 * Since we don't yet use PCID, there is only one context.
+	 *
+	 * For each context, ctx_id indicates which mm the TLB's user
+	 * entries came from.  As an invariant, the TLB will never
+	 * contain entries that are out-of-date as when that mm reached
+	 * the tlb_gen in the list.
+	 *
+	 * To be clear, this means that it's legal for the TLB code to
+	 * flush the TLB without updating tlb_gen.  This can happen
+	 * (for now, at least) due to paravirt remote flushes.
+	 */
+	struct tlb_context ctxs[1];
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
@@ -248,9 +268,26 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 struct flush_tlb_info {
+	/*
+	 * We support several kinds of flushes.
+	 *
+	 * - Fully flush a single mm.  flush_mm will be set, flush_end will be
+	 *   TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the
+	 *   IPI sender is trying to catch us up.
+	 *
+	 * - Partially flush a single mm.  flush_mm will be set, flush_start
+	 *   and flush_end will indicate the range, and new_tlb_gen will be
+	 *   set such that the changes between generation new_tlb_gen-1 and
+	 *   new_tlb_gen are entirely contained in the indicated range.
+	 *
+	 * - Fully flush all mms whose tlb_gens have been updated.  flush_mm
+	 *   will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen
+	 *   will be zero.
+	 */
 	struct mm_struct *mm;
 	unsigned long start;
 	unsigned long end;
+	u64 new_tlb_gen;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7c99c50e8bc9..3b19ba748e92 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	}
 
 	this_cpu_write(cpu_tlbstate.loaded_mm, next);
+	this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
+	this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+		       atomic64_read(&next->context.tlb_gen));
 
 	WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
 	cpumask_set_cpu(cpu, mm_cpumask(next));
@@ -194,17 +197,70 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 static void flush_tlb_func_common(const struct flush_tlb_info *f,
 				  bool local, enum tlb_flush_reason reason)
 {
+	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+
+	/*
+	 * Our memory ordering requirement is that any TLB fills that
+	 * happen after we flush the TLB are ordered after we read
+	 * active_mm's tlb_gen.  We don't need any explicit barrier
+	 * because all x86 flush operations are serializing and the
+	 * atomic64_read operation won't be reordered by the compiler.
+	 */
+	u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
+	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
+
+	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+		   loaded_mm->context.ctx_id);
+
 	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
+		/*
+		 * leave_mm() is adequate to handle any type of flush, and
+		 * we would prefer not to receive further IPIs.
+		 */
 		leave_mm(smp_processor_id());
 		return;
 	}
 
-	if (f->end == TLB_FLUSH_ALL) {
-		local_flush_tlb();
-		if (local)
-			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
-		trace_tlb_flush(reason, TLB_FLUSH_ALL);
-	} else {
+	if (local_tlb_gen == mm_tlb_gen) {
+		/*
+		 * There's nothing to do: we're already up to date.  This can
+		 * happen if two concurrent flushes happen -- the first IPI to
+		 * be handled can catch us all the way up, leaving no work for
+		 * the second IPI to be handled.
+		 */
+		return;
+	}
+
+	WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
+	WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen);
+
+	/*
+	 * If we get to this point, we know that our TLB is out of date.
+	 * This does not strictly imply that we need to flush (it's
+	 * possible that f->new_tlb_gen <= local_tlb_gen), but we're
+	 * going to need to flush in the very near future, so we might
+	 * as well get it over with.
+	 *
+	 * The only question is whether to do a full or partial flush.
+	 *
+	 * A partial TLB flush is safe and worthwhile if two conditions are
+	 * met:
+	 *
+	 * 1. We wouldn't be skipping a tlb_gen.  If the requester bumped
+	 *    the mm's tlb_gen from p to p+1, a partial flush is only correct
+	 *    if we would be bumping the local CPU's tlb_gen from p to p+1 as
+	 *    well.
+	 *
+	 * 2. If there are no more flushes on their way.  Partial TLB
+	 *    flushes are not all that much cheaper than full TLB
+	 *    flushes, so it seems unlikely that it would be a
+	 *    performance win to do a partial flush if that won't bring
+	 *    our TLB fully up to date.
+	 */
+	if (f->end != TLB_FLUSH_ALL &&
+	    f->new_tlb_gen == local_tlb_gen + 1 &&
+	    f->new_tlb_gen == mm_tlb_gen) {
+		/* Partial flush */
 		unsigned long addr;
 		unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
 		addr = f->start;
@@ -215,7 +271,16 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		if (local)
 			count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
 		trace_tlb_flush(reason, nr_pages);
+	} else {
+		/* Full flush. */
+		local_flush_tlb();
+		if (local)
+			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+		trace_tlb_flush(reason, TLB_FLUSH_ALL);
 	}
+
+	/* Both paths above update our state to mm_tlb_gen. */
+	this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, mm_tlb_gen);
 }
 
 static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
@@ -286,7 +351,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	cpu = get_cpu();
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
-	bump_mm_tlb_gen(mm);
+	info.new_tlb_gen = bump_mm_tlb_gen(mm);
 
 	/* Should we flush just the requested range? */
 	if ((end != TLB_FLUSH_ALL) &&
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
                   ` (3 preceding siblings ...)
  2017-06-14  4:56 ` [PATCH v2 04/10] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm Andy Lutomirski
@ 2017-06-14  4:56 ` Andy Lutomirski
  2017-06-14  6:09   ` Juergen Gross
                     ` (2 more replies)
  2017-06-14  4:56 ` [PATCH v2 06/10] x86/mm: Stop calling leave_mm() in idle code Andy Lutomirski
                   ` (6 subsequent siblings)
  11 siblings, 3 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:56 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andy Lutomirski, Andrew Banman,
	Mike Travis, Dimitri Sivanich, Juergen Gross, Boris Ostrovsky

x86's lazy TLB mode used to be fairly weak -- it would switch to
init_mm the first time it tried to flush a lazy TLB.  This meant an
unnecessary CR3 write and, if the flush was remote, an unnecessary
IPI.

Rewrite it entirely.  When we enter lazy mode, we simply remove the
cpu from mm_cpumask.  This means that we need a way to figure out
whether we've missed a flush when we switch back out of lazy mode.
I use the tlb_gen machinery to track whether a context is up to
date.

Note to reviewers: this patch, my itself, looks a bit odd.  I'm
using an array of length 1 containing (ctx_id, tlb_gen) rather than
just storing tlb_gen, and making it at array isn't necessary yet.
I'm doing this because the next few patches add PCID support, and,
with PCID, we need ctx_id, and the array will end up with a length
greater than 1.  Making it an array now means that there will be
less churn and therefore less stress on your eyeballs.

NB: This is dubious but, AFAICT, still correct on Xen and UV.
xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
patch changes the way that mm_cpumask() works.  This should be okay,
since Xen *also* iterates all online CPUs to find all the CPUs it
needs to twiddle.

The UV tlbflush code is rather dated and should be changed.

Cc: Andrew Banman <abanman@sgi.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h |   6 +-
 arch/x86/include/asm/tlbflush.h    |   4 -
 arch/x86/mm/init.c                 |   1 -
 arch/x86/mm/tlb.c                  | 242 +++++++++++++++++++------------------
 4 files changed, 131 insertions(+), 122 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index e5295d485899..69a4f1ee86ac 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
-	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
+	int cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
+		cpumask_clear_cpu(cpu, mm_cpumask(mm));
 }
 
 extern atomic64_t last_mm_ctx_id;
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4f6c30d6ec39..87b13e51e867 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -95,7 +95,6 @@ struct tlb_state {
 	 * mode even if we've already switched back to swapper_pg_dir.
 	 */
 	struct mm_struct *loaded_mm;
-	int state;
 
 	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
-#define TLBSTATE_OK	1
-#define TLBSTATE_LAZY	2
-
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 					struct mm_struct *mm)
 {
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 88ee942cb47d..7d6fa4676af9 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,6 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
 	.loaded_mm = &init_mm,
-	.state = 0,
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3b19ba748e92..fea2b07ac7d8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -45,8 +45,8 @@ void leave_mm(int cpu)
 	if (loaded_mm == &init_mm)
 		return;
 
-	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-		BUG();
+	/* Warn if we're not lazy. */
+	WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
 
 	switch_mm(NULL, &init_mm, NULL);
 }
@@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 {
 	unsigned cpu = smp_processor_id();
 	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+	u64 next_tlb_gen;
 
 	/*
-	 * NB: The scheduler will call us with prev == next when
-	 * switching from lazy TLB mode to normal mode if active_mm
-	 * isn't changing.  When this happens, there is no guarantee
-	 * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+	 * NB: The scheduler will call us with prev == next when switching
+	 * from lazy TLB mode to normal mode if active_mm isn't changing.
+	 * When this happens, we don't assume that CR3 (and hence
+	 * cpu_tlbstate.loaded_mm) matches next.
 	 *
 	 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
 	 */
 
-	this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+	/* We don't want flush_tlb_func_* to run concurrently with us. */
+	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
+		WARN_ON_ONCE(!irqs_disabled());
+
+	VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));
 
 	if (real_prev == next) {
-		/*
-		 * There's nothing to do: we always keep the per-mm control
-		 * regs in sync with cpu_tlbstate.loaded_mm.  Just
-		 * sanity-check mm_cpumask.
-		 */
-		if (WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(next))))
-			cpumask_set_cpu(cpu, mm_cpumask(next));
-		return;
-	}
+		if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
+			/*
+			 * There's nothing to do: we weren't lazy, and we
+			 * aren't changing our mm.  We don't need to flush
+			 * anything, nor do we need to update CR3, CR4, or
+			 * LDTR.
+			 */
+			return;
+		}
+
+		/* Resume remote flushes and then read tlb_gen. */
+		cpumask_set_cpu(cpu, mm_cpumask(next));
+		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+
+		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+			  next->context.ctx_id);
+
+		if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
+		    next_tlb_gen) {
+			/*
+			 * Ideally, we'd have a flush_tlb() variant that
+			 * takes the known CR3 value as input.  This would
+			 * be faster on Xen PV and on hypothetical CPUs
+			 * on which INVPCID is fast.
+			 */
+			this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+				       next_tlb_gen);
+			write_cr3(__pa(next->pgd));
+			/*
+			 * This gets called via leave_mm() in the idle path
+			 * where RCU functions differently.  Tracing normally
+			 * uses RCU, so we have to call the tracepoint
+			 * specially here.
+			 */
+			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
+						TLB_FLUSH_ALL);
+		}
 
-	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 		/*
-		 * If our current stack is in vmalloc space and isn't
-		 * mapped in the new pgd, we'll double-fault.  Forcibly
-		 * map it.
+		 * We just exited lazy mode, which means that CR4 and/or LDTR
+		 * may be stale.  (Changes to the required CR4 and LDTR states
+		 * are not reflected in tlb_gen.)
 		 */
-		unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
-
-		pgd_t *pgd = next->pgd + stack_pgd_index;
+	} else {
+		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+			/*
+			 * If our current stack is in vmalloc space and isn't
+			 * mapped in the new pgd, we'll double-fault.  Forcibly
+			 * map it.
+			 */
+			unsigned int stack_pgd_index =
+				pgd_index(current_stack_pointer());
+
+			pgd_t *pgd = next->pgd + stack_pgd_index;
+
+			if (unlikely(pgd_none(*pgd)))
+				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+		}
 
-		if (unlikely(pgd_none(*pgd)))
-			set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
-	}
+		/* Stop remote flushes for the previous mm */
+		if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
+			cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
-	this_cpu_write(cpu_tlbstate.loaded_mm, next);
-	this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
-	this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
-		       atomic64_read(&next->context.tlb_gen));
+		WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
 
-	WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
-	cpumask_set_cpu(cpu, mm_cpumask(next));
+		/*
+		 * Start remote flushes and then read tlb_gen.
+		 */
+		cpumask_set_cpu(cpu, mm_cpumask(next));
+		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
-	/*
-	 * Re-load page tables.
-	 *
-	 * This logic has an ordering constraint:
-	 *
-	 *  CPU 0: Write to a PTE for 'next'
-	 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
-	 *  CPU 1: set bit 1 in next's mm_cpumask
-	 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
-	 *
-	 * We need to prevent an outcome in which CPU 1 observes
-	 * the new PTE value and CPU 0 observes bit 1 clear in
-	 * mm_cpumask.  (If that occurs, then the IPI will never
-	 * be sent, and CPU 0's TLB will contain a stale entry.)
-	 *
-	 * The bad outcome can occur if either CPU's load is
-	 * reordered before that CPU's store, so both CPUs must
-	 * execute full barriers to prevent this from happening.
-	 *
-	 * Thus, switch_mm needs a full barrier between the
-	 * store to mm_cpumask and any operation that could load
-	 * from next->pgd.  TLB fills are special and can happen
-	 * due to instruction fetches or for no reason at all,
-	 * and neither LOCK nor MFENCE orders them.
-	 * Fortunately, load_cr3() is serializing and gives the
-	 * ordering guarantee we need.
-	 */
-	load_cr3(next->pgd);
+		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
+			  next->context.ctx_id);
 
-	/*
-	 * This gets called via leave_mm() in the idle path where RCU
-	 * functions differently.  Tracing normally uses RCU, so we have to
-	 * call the tracepoint specially here.
-	 */
-	trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+		this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,
+			       next->context.ctx_id);
+		this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+			       next_tlb_gen);
+		this_cpu_write(cpu_tlbstate.loaded_mm, next);
+		write_cr3(__pa(next->pgd));
 
-	/* Stop flush ipis for the previous mm */
-	WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
-		     real_prev != &init_mm);
-	cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+		/*
+		 * This gets called via leave_mm() in the idle path where RCU
+		 * functions differently.  Tracing normally uses RCU, so we
+		 * have to call the tracepoint specially here.
+		 */
+		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
+					TLB_FLUSH_ALL);
+	}
 
-	/* Load per-mm CR4 and LDTR state */
 	load_mm_cr4(next);
 	switch_ldt(real_prev, next);
 }
 
-/*
- * The flush IPI assumes that a thread switch happens in this order:
- * [cpu0: the cpu that switches]
- * 1) switch_mm() either 1a) or 1b)
- * 1a) thread switch to a different mm
- * 1a1) set cpu_tlbstate to TLBSTATE_OK
- *	Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
- *	if cpu0 was in lazy tlb mode.
- * 1a2) update cpu active_mm
- *	Now cpu0 accepts tlb flushes for the new mm.
- * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
- *	Now the other cpus will send tlb flush ipis.
- * 1a4) change cr3.
- * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
- *	Stop ipi delivery for the old mm. This is not synchronized with
- *	the other cpus, but flush_tlb_func ignore flush ipis for the wrong
- *	mm, and in the worst case we perform a superfluous tlb flush.
- * 1b) thread switch without mm change
- *	cpu active_mm is correct, cpu0 already handles flush ipis.
- * 1b1) set cpu_tlbstate to TLBSTATE_OK
- * 1b2) test_and_set the cpu bit in cpu_vm_mask.
- *	Atomically set the bit [other cpus will start sending flush ipis],
- *	and test the bit.
- * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
- * 2) switch %%esp, ie current
- *
- * The interrupt must handle 2 special cases:
- * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
- * - the cpu performs speculative tlb reads, i.e. even if the cpu only
- *   runs in kernel space, the cpu could load tlb entries for user space
- *   pages.
- *
- * The good news is that cpu_tlbstate is local to each cpu, no
- * write/read ordering problems.
- */
-
 static void flush_tlb_func_common(const struct flush_tlb_info *f,
 				  bool local, enum tlb_flush_reason reason)
 {
@@ -209,15 +194,19 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
 	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
 
+	/* This code cannot presently handle being reentered. */
+	VM_WARN_ON(!irqs_disabled());
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
-	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
+	if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
 		/*
-		 * leave_mm() is adequate to handle any type of flush, and
-		 * we would prefer not to receive further IPIs.
+		 * We're in lazy mode -- don't flush.  We can get here on
+		 * remote flushes due to races and on local flushes if a
+		 * kernel thread coincidentally flushes the mm it's lazily
+		 * still using.
 		 */
-		leave_mm(smp_processor_id());
 		return;
 	}
 
@@ -314,6 +303,21 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				(info->end - info->start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
+		/*
+		 * This whole special case is confused.  UV has a "Broadcast
+		 * Assist Unit", which seems to be a fancy way to send IPIs.
+		 * Back when x86 used an explicit TLB flush IPI, UV was
+		 * optimized to use its own mechanism.  These days, x86 uses
+		 * smp_call_function_many(), but UV still uses a manual IPI,
+		 * and that IPI's action is out of date -- it does a manual
+		 * flush instead of calling flush_tlb_func_remote().  This
+		 * means that the percpu tlb_gen variables won't be updated
+		 * and we'll do pointless flushes on future context switches.
+		 *
+		 * Rather than hooking native_flush_tlb_others() here, I think
+		 * that UV should be updated so that smp_call_function_many(),
+		 * etc, are optimal on UV.
+		 */
 		unsigned int cpu;
 
 		cpu = smp_processor_id();
@@ -364,10 +368,15 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		info.end = TLB_FLUSH_ALL;
 	}
 
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
+	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+		local_irq_disable();
 		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+		local_irq_enable();
+	}
+
 	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), &info);
+
 	put_cpu();
 }
 
@@ -376,8 +385,6 @@ static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
 	__flush_tlb_all();
-	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_LAZY)
-		leave_mm(smp_processor_id());
 }
 
 void flush_tlb_all(void)
@@ -421,10 +428,15 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	int cpu = get_cpu();
 
-	if (cpumask_test_cpu(cpu, &batch->cpumask))
+	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+		local_irq_disable();
 		flush_tlb_func_local(&info, TLB_LOCAL_SHOOTDOWN);
+		local_irq_enable();
+	}
+
 	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
 		flush_tlb_others(&batch->cpumask, &info);
+
 	cpumask_clear(&batch->cpumask);
 
 	put_cpu();
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 06/10] x86/mm: Stop calling leave_mm() in idle code
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
                   ` (4 preceding siblings ...)
  2017-06-14  4:56 ` [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
@ 2017-06-14  4:56 ` Andy Lutomirski
  2017-06-14  4:56 ` [PATCH v2 07/10] x86/mm: Disable PCID on 32-bit kernels Andy Lutomirski
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:56 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andy Lutomirski

Now that lazy TLB suppresses all flush IPIs (as opposed to all but
the first), there's no need to leave_mm() when going idle.

This means we can get rid of the rcuidle hack in
switch_mm_irqs_off() and we can unexport leave_mm().

This also removes acpi_unlazy_tlb() from the x86 and ia64 headers,
since it has no callers any more.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/ia64/include/asm/acpi.h  |  2 --
 arch/x86/include/asm/acpi.h   |  2 --
 arch/x86/mm/tlb.c             | 19 +++----------------
 drivers/acpi/processor_idle.c |  2 --
 drivers/idle/intel_idle.c     |  9 ++++-----
 5 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/arch/ia64/include/asm/acpi.h b/arch/ia64/include/asm/acpi.h
index a3d0211970e9..c86a947f5368 100644
--- a/arch/ia64/include/asm/acpi.h
+++ b/arch/ia64/include/asm/acpi.h
@@ -112,8 +112,6 @@ static inline void arch_acpi_set_pdc_bits(u32 *buf)
 	buf[2] |= ACPI_PDC_EST_CAPABILITY_SMP;
 }
 
-#define acpi_unlazy_tlb(x)
-
 #ifdef CONFIG_ACPI_NUMA
 extern cpumask_t early_cpu_possible_map;
 #define for_each_possible_early_cpu(cpu)  \
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 2efc768e4362..562286fa151f 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -150,8 +150,6 @@ static inline void disable_acpi(void) { }
 extern int x86_acpi_numa_init(void);
 #endif /* CONFIG_ACPI_NUMA */
 
-#define acpi_unlazy_tlb(x)	leave_mm(x)
-
 #ifdef CONFIG_ACPI_APEI
 static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fea2b07ac7d8..5f932fd80881 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -50,7 +50,6 @@ void leave_mm(int cpu)
 
 	switch_mm(NULL, &init_mm, NULL);
 }
-EXPORT_SYMBOL_GPL(leave_mm);
 
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
@@ -113,14 +112,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
 				       next_tlb_gen);
 			write_cr3(__pa(next->pgd));
-			/*
-			 * This gets called via leave_mm() in the idle path
-			 * where RCU functions differently.  Tracing normally
-			 * uses RCU, so we have to call the tracepoint
-			 * specially here.
-			 */
-			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
-						TLB_FLUSH_ALL);
+			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+					TLB_FLUSH_ALL);
 		}
 
 		/*
@@ -166,13 +159,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		this_cpu_write(cpu_tlbstate.loaded_mm, next);
 		write_cr3(__pa(next->pgd));
 
-		/*
-		 * This gets called via leave_mm() in the idle path where RCU
-		 * functions differently.  Tracing normally uses RCU, so we
-		 * have to call the tracepoint specially here.
-		 */
-		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
-					TLB_FLUSH_ALL);
+		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 	}
 
 	load_mm_cr4(next);
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 5c8aa9cf62d7..fe3d2a40f311 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -708,8 +708,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock);
 static void acpi_idle_enter_bm(struct acpi_processor *pr,
 			       struct acpi_processor_cx *cx, bool timer_bc)
 {
-	acpi_unlazy_tlb(smp_processor_id());
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 216d7ec88c0c..2ae43f59091d 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -912,16 +912,15 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
 	struct cpuidle_state *state = &drv->states[index];
 	unsigned long eax = flg2MWAIT(state->flags);
 	unsigned int cstate;
-	int cpu = smp_processor_id();
 
 	cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
 
 	/*
-	 * leave_mm() to avoid costly and often unnecessary wakeups
-	 * for flushing the user TLB's associated with the active mm.
+	 * NB: if CPUIDLE_FLAG_TLB_FLUSHED is set, this idle transition
+	 * will probably flush the TLB.  It's not guaranteed to flush
+	 * the TLB, though, so it's not clear that we can do anything
+	 * useful with this knowledge.
 	 */
-	if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
-		leave_mm(cpu);
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		tick_broadcast_enter();
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 07/10] x86/mm: Disable PCID on 32-bit kernels
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
                   ` (5 preceding siblings ...)
  2017-06-14  4:56 ` [PATCH v2 06/10] x86/mm: Stop calling leave_mm() in idle code Andy Lutomirski
@ 2017-06-14  4:56 ` Andy Lutomirski
  2017-06-14  4:56 ` [PATCH v2 08/10] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:56 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andy Lutomirski

32-bit kernels on new hardware will see PCID in CPUID, but PCID can
only be used in 64-bit mode.  Rather than making all PCID code
conditional, just disable the feature on 32-bit builds.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/disabled-features.h | 4 +++-
 arch/x86/kernel/cpu/bugs.c               | 8 ++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5dff775af7cd..c10c9128f54e 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -21,11 +21,13 @@
 # define DISABLE_K6_MTRR	(1<<(X86_FEATURE_K6_MTRR & 31))
 # define DISABLE_CYRIX_ARR	(1<<(X86_FEATURE_CYRIX_ARR & 31))
 # define DISABLE_CENTAUR_MCR	(1<<(X86_FEATURE_CENTAUR_MCR & 31))
+# define DISABLE_PCID		0
 #else
 # define DISABLE_VME		0
 # define DISABLE_K6_MTRR	0
 # define DISABLE_CYRIX_ARR	0
 # define DISABLE_CENTAUR_MCR	0
+# define DISABLE_PCID		(1<<(X86_FEATURE_PCID & 31))
 #endif /* CONFIG_X86_64 */
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
@@ -49,7 +51,7 @@
 #define DISABLED_MASK1	0
 #define DISABLED_MASK2	0
 #define DISABLED_MASK3	(DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR)
-#define DISABLED_MASK4	0
+#define DISABLED_MASK4	(DISABLE_PCID)
 #define DISABLED_MASK5	0
 #define DISABLED_MASK6	0
 #define DISABLED_MASK7	0
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0af86d9242da..db684880d74a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -21,6 +21,14 @@
 
 void __init check_bugs(void)
 {
+#ifdef CONFIG_X86_32
+	/*
+	 * Regardless of whether PCID is enumerated, the SDM says
+	 * that it can't be enabled in 32-bit mode.
+	 */
+	setup_clear_cpu_cap(X86_FEATURE_PCID);
+#endif
+
 	identify_boot_cpu();
 
 	if (!IS_ENABLED(CONFIG_SMP)) {
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 08/10] x86/mm: Add nopcid to turn off PCID
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
                   ` (6 preceding siblings ...)
  2017-06-14  4:56 ` [PATCH v2 07/10] x86/mm: Disable PCID on 32-bit kernels Andy Lutomirski
@ 2017-06-14  4:56 ` Andy Lutomirski
  2017-06-14  4:56 ` [PATCH v2 09/10] x86/mm: Enable CR4.PCIDE on supported systems Andy Lutomirski
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:56 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andy Lutomirski

The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt |  2 ++
 arch/x86/kernel/cpu/common.c                    | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0f5c3b4347c6..aa385109ae58 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2648,6 +2648,8 @@
 	nopat		[X86] Disable PAT (page attribute table extension of
 			pagetables) support.
 
+	nopcid		[X86-64] Disable the PCID cpu feature.
+
 	norandmaps	Don't use address space randomization.  Equivalent to
 			echo 0 > /proc/sys/kernel/randomize_va_space
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c8b39870f33e..904485e7b230 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -168,6 +168,24 @@ static int __init x86_mpx_setup(char *s)
 }
 __setup("nompx", x86_mpx_setup);
 
+#ifdef CONFIG_X86_64
+static int __init x86_pcid_setup(char *s)
+{
+	/* require an exact match without trailing characters */
+	if (strlen(s))
+		return 0;
+
+	/* do not emit a message if the feature is not present */
+	if (!boot_cpu_has(X86_FEATURE_PCID))
+		return 1;
+
+	setup_clear_cpu_cap(X86_FEATURE_PCID);
+	pr_info("nopcid: PCID feature disabled\n");
+	return 1;
+}
+__setup("nopcid", x86_pcid_setup);
+#endif
+
 static int __init x86_noinvpcid_setup(char *s)
 {
 	/* noinvpcid doesn't accept parameters */
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 09/10] x86/mm: Enable CR4.PCIDE on supported systems
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
                   ` (7 preceding siblings ...)
  2017-06-14  4:56 ` [PATCH v2 08/10] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
@ 2017-06-14  4:56 ` Andy Lutomirski
  2017-06-14  5:30   ` Juergen Gross
  2017-06-14  4:56 ` [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:56 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andy Lutomirski, Juergen Gross,
	Boris Ostrovsky

We can use PCID if the CPU has PCID and PGE and we're not on Xen.

By itself, this has no effect.  The next patch will start using
PCID.

Cc: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h |  8 ++++++++
 arch/x86/kernel/cpu/common.c    | 15 +++++++++++++++
 arch/x86/xen/enlighten_pv.c     |  6 ++++++
 3 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 87b13e51e867..57b305e13c4c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -243,6 +243,14 @@ static inline void __flush_tlb_all(void)
 		__flush_tlb_global();
 	else
 		__flush_tlb();
+
+	/*
+	 * Note: if we somehow had PCID but not PGE, then this wouldn't work --
+	 * we'd end up flushing kernel translations for the current ASID but
+	 * we might fail to flush kernel translations for other cached ASIDs.
+	 *
+	 * To avoid this issue, we force PCID off if PGE is off.
+	 */
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 904485e7b230..01caf66b270f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1143,6 +1143,21 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 	setup_smep(c);
 	setup_smap(c);
 
+	/* Set up PCID */
+	if (cpu_has(c, X86_FEATURE_PCID)) {
+		if (cpu_has(c, X86_FEATURE_PGE)) {
+			cr4_set_bits(X86_CR4_PCIDE);
+		} else {
+			/*
+			 * flush_tlb_all(), as currently implemented, won't
+			 * work if PCID is on but PGE is not.  Since that
+			 * combination doesn't exist on real hardware, there's
+			 * no reason to try to fully support it.
+			 */
+			clear_cpu_cap(c, X86_FEATURE_PCID);
+		}
+	}
+
 	/*
 	 * The vendor-specific functions might have changed features.
 	 * Now we do "generic changes."
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index f33eef4ebd12..a136aac543c3 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -295,6 +295,12 @@ static void __init xen_init_capabilities(void)
 	setup_clear_cpu_cap(X86_FEATURE_ACC);
 	setup_clear_cpu_cap(X86_FEATURE_X2APIC);
 
+	/*
+	 * Xen PV would need some work to support PCID: CR3 handling as well
+	 * as xen_flush_tlb_others() would need updating.
+	 */
+	setup_clear_cpu_cap(X86_FEATURE_PCID);
+
 	if (!xen_initial_domain())
 		setup_clear_cpu_cap(X86_FEATURE_ACPI);
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
                   ` (8 preceding siblings ...)
  2017-06-14  4:56 ` [PATCH v2 09/10] x86/mm: Enable CR4.PCIDE on supported systems Andy Lutomirski
@ 2017-06-14  4:56 ` Andy Lutomirski
  2017-06-18  6:26   ` Nadav Amit
  2017-06-14 22:18 ` [PATCH v2 00/10] PCID and improved laziness Dave Hansen
  2017-06-18 21:29 ` Levin, Alexander (Sasha Levin)
  11 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:56 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andy Lutomirski

PCID is a "process context ID" -- it's what other architectures call
an address space ID.  Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB.  x86's
PCID is 12 bits.

This is an unorthodox approach to using PCID.  x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs.  To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.

This patch uses PCID differently.  We use a PCID to identify a
recently-used mm on a per-cpu basis.  An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.

In particular, we use PCIDs 1-3 for recently-used mms and we reserve
PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
Nothing ever switches to PCID 0 without flushing PCID 0 non-global
pages, so PCID 0 conflicts won't cause problems.

This seems to save about 100ns on context switches between mms.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h     |  3 ++
 arch/x86/include/asm/processor-flags.h |  2 +
 arch/x86/include/asm/tlbflush.h        | 18 +++++++-
 arch/x86/mm/init.c                     |  1 +
 arch/x86/mm/tlb.c                      | 82 ++++++++++++++++++++++++++--------
 5 files changed, 86 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 69a4f1ee86ac..2537ec03c9b7 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -299,6 +299,9 @@ static inline unsigned long __get_current_cr3_fast(void)
 {
 	unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
 
+	if (static_cpu_has(X86_FEATURE_PCID))
+		cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
 	/* For now, be very restrictive about when this can be called. */
 	VM_WARN_ON(in_nmi() || !in_atomic());
 
diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 79aa2f98398d..791b60199aa4 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,6 +35,7 @@
 /* Mask off the address space ID bits. */
 #define CR3_ADDR_MASK 0x7FFFFFFFFFFFF000ull
 #define CR3_PCID_MASK 0xFFFull
+#define CR3_NOFLUSH (1UL << 63)
 #else
 /*
  * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -42,6 +43,7 @@
  */
 #define CR3_ADDR_MASK 0xFFFFFFFFull
 #define CR3_PCID_MASK 0ull
+#define CR3_NOFLUSH 0
 #endif
 
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 57b305e13c4c..a9a5aa6f45f7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,12 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define NR_DYNAMIC_ASIDS 6
+
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
@@ -95,6 +101,8 @@ struct tlb_state {
 	 * mode even if we've already switched back to swapper_pg_dir.
 	 */
 	struct mm_struct *loaded_mm;
+	u16 loaded_mm_asid;
+	u16 next_asid;
 
 	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -104,7 +112,8 @@ struct tlb_state {
 
 	/*
 	 * This is a list of all contexts that might exist in the TLB.
-	 * Since we don't yet use PCID, there is only one context.
+	 * There is one per ASID that we use, and the ASID (what the
+	 * CPU calls PCID) is the index into ctxts.
 	 *
 	 * For each context, ctx_id indicates which mm the TLB's user
 	 * entries came from.  As an invariant, the TLB will never
@@ -114,8 +123,13 @@ struct tlb_state {
 	 * To be clear, this means that it's legal for the TLB code to
 	 * flush the TLB without updating tlb_gen.  This can happen
 	 * (for now, at least) due to paravirt remote flushes.
+	 *
+	 * NB: context 0 is a bit special, since it's also used by
+	 * various bits of init code.  This is fine -- code that
+	 * isn't aware of PCID will end up harmlessly flushing
+	 * context 0.
 	 */
-	struct tlb_context ctxs[1];
+	struct tlb_context ctxs[NR_DYNAMIC_ASIDS];
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7d6fa4676af9..9c9570d300ba 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,6 +812,7 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
 	.loaded_mm = &init_mm,
+	.next_asid = 1,
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5f932fd80881..cd7f604ee818 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,6 +30,40 @@
 
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 
+static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
+			    u16 *new_asid, bool *need_flush)
+{
+	u16 asid;
+
+	if (!static_cpu_has(X86_FEATURE_PCID)) {
+		*new_asid = 0;
+		*need_flush = true;
+		return;
+	}
+
+	for (asid = 0; asid < NR_DYNAMIC_ASIDS; asid++) {
+		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
+		    next->context.ctx_id)
+			continue;
+
+		*new_asid = asid;
+		*need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
+			       next_tlb_gen);
+		return;
+	}
+
+	/*
+	 * We don't currently own an ASID slot on this CPU.
+	 * Allocate a slot.
+	 */
+	*new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
+	if (*new_asid >= NR_DYNAMIC_ASIDS) {
+		*new_asid = 0;
+		this_cpu_write(cpu_tlbstate.next_asid, 1);
+	}
+	*need_flush = true;
+}
+
 void leave_mm(int cpu)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -66,6 +100,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 {
 	unsigned cpu = smp_processor_id();
 	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
 	u64 next_tlb_gen;
 
 	/*
@@ -81,7 +116,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
 		WARN_ON_ONCE(!irqs_disabled());
 
-	VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));
+	VM_BUG_ON(__read_cr3() != (__pa(real_prev->pgd) | prev_asid));
 
 	if (real_prev == next) {
 		if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
@@ -98,10 +133,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
-		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			  next->context.ctx_id);
 
-		if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
+		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
 		    next_tlb_gen) {
 			/*
 			 * Ideally, we'd have a flush_tlb() variant that
@@ -109,9 +144,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			 * be faster on Xen PV and on hypothetical CPUs
 			 * on which INVPCID is fast.
 			 */
-			this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+			this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
 				       next_tlb_gen);
-			write_cr3(__pa(next->pgd));
+			write_cr3(__pa(next->pgd) | prev_asid);
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
 					TLB_FLUSH_ALL);
 		}
@@ -122,6 +157,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * are not reflected in tlb_gen.)
 		 */
 	} else {
+		u16 new_asid;
+		bool need_flush;
+
 		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 			/*
 			 * If our current stack is in vmalloc space and isn't
@@ -141,7 +179,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
 			cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
-		WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
+		VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
 
 		/*
 		 * Start remote flushes and then read tlb_gen.
@@ -149,17 +187,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
-		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
-			  next->context.ctx_id);
+		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
 
-		this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,
-			       next->context.ctx_id);
-		this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
-			       next_tlb_gen);
-		this_cpu_write(cpu_tlbstate.loaded_mm, next);
-		write_cr3(__pa(next->pgd));
+		if (need_flush) {
+			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id,
+				       next->context.ctx_id);
+			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen,
+				       next_tlb_gen);
+			write_cr3(__pa(next->pgd) | new_asid);
+			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+					TLB_FLUSH_ALL);
+		} else {
+			/* The new ASID is already up to date. */
+			write_cr3(__pa(next->pgd) | new_asid | CR3_NOFLUSH);
+			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
+		}
 
-		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+		this_cpu_write(cpu_tlbstate.loaded_mm, next);
+		this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
 	}
 
 	load_mm_cr4(next);
@@ -170,6 +215,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 				  bool local, enum tlb_flush_reason reason)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
 
 	/*
 	 * Our memory ordering requirement is that any TLB fills that
@@ -179,12 +225,12 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	 * atomic64_read operation won't be reordered by the compiler.
 	 */
 	u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
+	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
 
 	/* This code cannot presently handle being reentered. */
 	VM_WARN_ON(!irqs_disabled());
 
-	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
 	if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
@@ -256,7 +302,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	}
 
 	/* Both paths above update our state to mm_tlb_gen. */
-	this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, mm_tlb_gen);
+	this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
 }
 
 static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 09/10] x86/mm: Enable CR4.PCIDE on supported systems
  2017-06-14  4:56 ` [PATCH v2 09/10] x86/mm: Enable CR4.PCIDE on supported systems Andy Lutomirski
@ 2017-06-14  5:30   ` Juergen Gross
  0 siblings, 0 replies; 30+ messages in thread
From: Juergen Gross @ 2017-06-14  5:30 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Boris Ostrovsky

On 14/06/17 06:56, Andy Lutomirski wrote:
> We can use PCID if the CPU has PCID and PGE and we're not on Xen.
> 
> By itself, this has no effect.  The next patch will start using
> PCID.
> 
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Juergen Gross <jgross@suse.com>


Thanks,

Juergen

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-14  4:56 ` [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
@ 2017-06-14  6:09   ` Juergen Gross
  2017-06-19 22:00     ` Andy Lutomirski
  2017-06-14 22:33   ` Dave Hansen
  2017-06-18  8:06   ` Nadav Amit
  2 siblings, 1 reply; 30+ messages in thread
From: Juergen Gross @ 2017-06-14  6:09 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andrew Banman, Mike Travis,
	Dimitri Sivanich, Boris Ostrovsky

On 14/06/17 06:56, Andy Lutomirski wrote:
> x86's lazy TLB mode used to be fairly weak -- it would switch to
> init_mm the first time it tried to flush a lazy TLB.  This meant an
> unnecessary CR3 write and, if the flush was remote, an unnecessary
> IPI.
> 
> Rewrite it entirely.  When we enter lazy mode, we simply remove the
> cpu from mm_cpumask.  This means that we need a way to figure out
> whether we've missed a flush when we switch back out of lazy mode.
> I use the tlb_gen machinery to track whether a context is up to
> date.
> 
> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> just storing tlb_gen, and making it at array isn't necessary yet.
> I'm doing this because the next few patches add PCID support, and,
> with PCID, we need ctx_id, and the array will end up with a length
> greater than 1.  Making it an array now means that there will be
> less churn and therefore less stress on your eyeballs.
> 
> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> patch changes the way that mm_cpumask() works.  This should be okay,
> since Xen *also* iterates all online CPUs to find all the CPUs it
> needs to twiddle.

There is a allocation failure path in xen_drop_mm_ref() which might
be wrong with this patch. As this path should be taken only very
unlikely I'd suggest to remove the test for mm_cpumask() bit zero in
this path.


Juergen

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 03/10] x86/mm: Give each mm TLB flush generation a unique ID
  2017-06-14  4:56 ` [PATCH v2 03/10] x86/mm: Give each mm TLB flush generation a unique ID Andy Lutomirski
@ 2017-06-14 15:54   ` Dave Hansen
  2017-06-14 17:16     ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2017-06-14 15:54 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Arjan van de Ven,
	Peter Zijlstra

On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
>  typedef struct {
> +	/*
> +	 * ctx_id uniquely identifies this mm_struct.  A ctx_id will never
> +	 * be reused, and zero is not a valid ctx_id.
> +	 */
> +	u64 ctx_id;

Ahh, and you need this because an mm itself might get reused by being
freed and reallocated?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 03/10] x86/mm: Give each mm TLB flush generation a unique ID
  2017-06-14 15:54   ` Dave Hansen
@ 2017-06-14 17:16     ` Andy Lutomirski
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14 17:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm, Nadav Amit,
	Rik van Riel, Arjan van de Ven, Peter Zijlstra

On Wed, Jun 14, 2017 at 8:54 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
>>  typedef struct {
>> +     /*
>> +      * ctx_id uniquely identifies this mm_struct.  A ctx_id will never
>> +      * be reused, and zero is not a valid ctx_id.
>> +      */
>> +     u64 ctx_id;
>
> Ahh, and you need this because an mm itself might get reused by being
> freed and reallocated?

Exactly.  I didn't want to have to zap the data structures on each CPU
every time an mm is freed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 00/10] PCID and improved laziness
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
                   ` (9 preceding siblings ...)
  2017-06-14  4:56 ` [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
@ 2017-06-14 22:18 ` Dave Hansen
  2017-06-14 22:48   ` Andy Lutomirski
  2017-06-18 21:29 ` Levin, Alexander (Sasha Levin)
  11 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2017-06-14 22:18 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Arjan van de Ven,
	Peter Zijlstra

On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
> 2. Mms that have been used recently on a given CPU might get to keep
>    their TLB entries alive across process switches with this patch
>    set.  TLB fills are pretty fast on modern CPUs, but they're even
>    faster when they don't happen.

Let's not forget that TLBs are also getting bigger.  The bigger TLBs
help ensure that they *can* survive across another process's timeslice.

Also, the cost to refill the paging structure caches is going up.  Just
think of how many cachelines you have to pull in to populate a
~1500-entry TLB, even if the CPU hid the latency of those loads.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-14  4:56 ` [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
  2017-06-14  6:09   ` Juergen Gross
@ 2017-06-14 22:33   ` Dave Hansen
  2017-06-14 22:42     ` Andy Lutomirski
  2017-06-18  8:06   ` Nadav Amit
  2 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2017-06-14 22:33 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Arjan van de Ven,
	Peter Zijlstra, Andrew Banman, Mike Travis, Dimitri Sivanich,
	Juergen Gross, Boris Ostrovsky

On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
> -	if (cpumask_test_cpu(cpu, &batch->cpumask))
> +	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
> +		local_irq_disable();
>  		flush_tlb_func_local(&info, TLB_LOCAL_SHOOTDOWN);
> +		local_irq_enable();
> +	}
> +

Could you talk a little about why this needs to be local_irq_disable()
and not preempt_disable()?  Is it about the case where somebody is
trying to call flush_tlb_func_*() from an interrupt handler?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-14 22:33   ` Dave Hansen
@ 2017-06-14 22:42     ` Andy Lutomirski
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14 22:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm, Nadav Amit,
	Rik van Riel, Arjan van de Ven, Peter Zijlstra, Andrew Banman,
	Mike Travis, Dimitri Sivanich, Juergen Gross, Boris Ostrovsky

On Wed, Jun 14, 2017 at 3:33 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
>> -     if (cpumask_test_cpu(cpu, &batch->cpumask))
>> +     if (cpumask_test_cpu(cpu, &batch->cpumask)) {
>> +             local_irq_disable();
>>               flush_tlb_func_local(&info, TLB_LOCAL_SHOOTDOWN);
>> +             local_irq_enable();
>> +     }
>> +
>
> Could you talk a little about why this needs to be local_irq_disable()
> and not preempt_disable()?  Is it about the case where somebody is
> trying to call flush_tlb_func_*() from an interrupt handler?

It's to prevent flush_tlb_func_local() and flush_tlb_func_remote()
from being run concurrently, which would cause flush_tlb_func_common()
to be reentered.  Either we'd need to be very careful in
flush_tlb_func_common() to avoid races if this happened, or we could
just disable interrupts around flush_tlb_func_local().  The latter is
fast and easy.

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 00/10] PCID and improved laziness
  2017-06-14 22:18 ` [PATCH v2 00/10] PCID and improved laziness Dave Hansen
@ 2017-06-14 22:48   ` Andy Lutomirski
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-14 22:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm, Nadav Amit,
	Rik van Riel, Arjan van de Ven, Peter Zijlstra

On Wed, Jun 14, 2017 at 3:18 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
>> 2. Mms that have been used recently on a given CPU might get to keep
>>    their TLB entries alive across process switches with this patch
>>    set.  TLB fills are pretty fast on modern CPUs, but they're even
>>    faster when they don't happen.
>
> Let's not forget that TLBs are also getting bigger.  The bigger TLBs
> help ensure that they *can* survive across another process's timeslice.
>
> Also, the cost to refill the paging structure caches is going up.  Just
> think of how many cachelines you have to pull in to populate a
> ~1500-entry TLB, even if the CPU hid the latency of those loads.

Then throw EPT into the mix for extra fun.  I wonder if we should try
to allocate page tables from nearby physical addresses if we think we
might be running as a guest.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 01/10] x86/ldt: Simplify LDT switching logic
  2017-06-14  4:56 ` [PATCH v2 01/10] x86/ldt: Simplify LDT switching logic Andy Lutomirski
@ 2017-06-15 18:53   ` Rik van Riel
  0 siblings, 0 replies; 30+ messages in thread
From: Rik van Riel @ 2017-06-15 18:53 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Dave Hansen, Arjan van de Ven,
	Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 425 bytes --]

On Tue, 2017-06-13 at 21:56 -0700, Andy Lutomirski wrote:

> Simplify the code to update LDTR if either the previous or the next
> mm has an LDT, i.e. effectively restore the historical logic..
> While we're at it, clean up the code by moving all the ifdeffery to
> a header where it belongs.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 02/10] x86/mm: Remove reset_lazy_tlbstate()
  2017-06-14  4:56 ` [PATCH v2 02/10] x86/mm: Remove reset_lazy_tlbstate() Andy Lutomirski
@ 2017-06-15 19:29   ` Rik van Riel
  0 siblings, 0 replies; 30+ messages in thread
From: Rik van Riel @ 2017-06-15 19:29 UTC (permalink / raw)
  To: Andy Lutomirski, x86
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Dave Hansen, Arjan van de Ven,
	Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 326 bytes --]

On Tue, 2017-06-13 at 21:56 -0700, Andy Lutomirski wrote:
> The only call site also calls idle_task_exit(), and idle_task_exit()
> puts us into a clean state by explicitly switching to init_mm.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID
  2017-06-14  4:56 ` [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
@ 2017-06-18  6:26   ` Nadav Amit
  2017-06-19 22:02     ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Nadav Amit @ 2017-06-18  6:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra


> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> PCID is a "process context ID" -- it's what other architectures call
> an address space ID.  Every non-global TLB entry is tagged with a
> PCID, only TLB entries that match the currently selected PCID are
> used, and we can switch PGDs without flushing the TLB.  x86's
> PCID is 12 bits.
> 
> This is an unorthodox approach to using PCID.  x86's PCID is far too
> short to uniquely identify a process, and we can't even really
> uniquely identify a running process because there are monster
> systems with over 4096 CPUs.  To make matters worse, past attempts
> to use all 12 PCID bits have resulted in slowdowns instead of
> speedups.
> 
> This patch uses PCID differently.  We use a PCID to identify a
> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
> binding at all; instead, we give it a fresh PCID each time it's
> loaded except in cases where we want to preserve the TLB, in which
> case we reuse a recent value.
> 
> In particular, we use PCIDs 1-3 for recently-used mms and we reserve
> PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
> Nothing ever switches to PCID 0 without flushing PCID 0 non-global
> pages, so PCID 0 conflicts won't cause problems.

Is this commit message outdated? NR_DYNAMIC_ASIDS is set to 6.
More importantly, I do not see PCID 0 as reserved:

> +static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> +			    u16 *new_asid, bool *need_flush)
> +{
> 

[snip]

> +	if (*new_asid >= NR_DYNAMIC_ASIDS) {
> +		*new_asid = 0;
> +		this_cpu_write(cpu_tlbstate.next_asid, 1);
> +	}
> +	*need_flush = true;
> +}


Am I missing something?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-14  4:56 ` [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
  2017-06-14  6:09   ` Juergen Gross
  2017-06-14 22:33   ` Dave Hansen
@ 2017-06-18  8:06   ` Nadav Amit
  2017-06-19 21:58     ` Andy Lutomirski
  2 siblings, 1 reply; 30+ messages in thread
From: Nadav Amit @ 2017-06-18  8:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, LKML, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andrew Banman, Mike Travis,
	Dimitri Sivanich, Juergen Gross, Boris Ostrovsky


> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> x86's lazy TLB mode used to be fairly weak -- it would switch to
> init_mm the first time it tried to flush a lazy TLB.  This meant an
> unnecessary CR3 write and, if the flush was remote, an unnecessary
> IPI.
> 
> Rewrite it entirely.  When we enter lazy mode, we simply remove the
> cpu from mm_cpumask.  This means that we need a way to figure out
> whether we've missed a flush when we switch back out of lazy mode.
> I use the tlb_gen machinery to track whether a context is up to
> date.
> 
> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> just storing tlb_gen, and making it at array isn't necessary yet.
> I'm doing this because the next few patches add PCID support, and,
> with PCID, we need ctx_id, and the array will end up with a length
> greater than 1.  Making it an array now means that there will be
> less churn and therefore less stress on your eyeballs.
> 
> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> patch changes the way that mm_cpumask() works.  This should be okay,
> since Xen *also* iterates all online CPUs to find all the CPUs it
> needs to twiddle.
> 
> The UV tlbflush code is rather dated and should be changed.
> 
> Cc: Andrew Banman <abanman@sgi.com>
> Cc: Mike Travis <travis@sgi.com>
> Cc: Dimitri Sivanich <sivanich@sgi.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> arch/x86/include/asm/mmu_context.h |   6 +-
> arch/x86/include/asm/tlbflush.h    |   4 -
> arch/x86/mm/init.c                 |   1 -
> arch/x86/mm/tlb.c                  | 242 +++++++++++++++++++------------------
> 4 files changed, 131 insertions(+), 122 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index e5295d485899..69a4f1ee86ac 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
> 
> static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> {
> -	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> -		this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
> +	int cpu = smp_processor_id();
> +
> +	if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> +		cpumask_clear_cpu(cpu, mm_cpumask(mm));

The indication for laziness that was in cpu_tlbstate.state may be a better
indication whether the cpu needs to be cleared from the previous mm_cpumask().
If you kept this indication, you could have used this per-cpu information in
switch_mm_irqs_off() instead of "cpumask_test_cpu(cpu, mm_cpumask(next))”,
which might have been accessed by another core.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 00/10] PCID and improved laziness
  2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
                   ` (10 preceding siblings ...)
  2017-06-14 22:18 ` [PATCH v2 00/10] PCID and improved laziness Dave Hansen
@ 2017-06-18 21:29 ` Levin, Alexander (Sasha Levin)
  2017-06-19  4:43   ` Andy Lutomirski
  11 siblings, 1 reply; 30+ messages in thread
From: Levin, Alexander (Sasha Levin) @ 2017-06-18 21:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Nadav Amit, Rik van Riel,
	Dave Hansen, Arjan van de Ven, Peter Zijlstra

On Tue, Jun 13, 2017 at 09:56:18PM -0700, Andy Lutomirski wrote:
>There are three performance benefits here:
>
>1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
>   This avoids many of them when switching tasks by using PCID.  In
>   a stupid little benchmark I did, it saves about 100ns on my laptop
>   per context switch.  I'll try to improve that benchmark.
>
>2. Mms that have been used recently on a given CPU might get to keep
>   their TLB entries alive across process switches with this patch
>   set.  TLB fills are pretty fast on modern CPUs, but they're even
>   faster when they don't happen.
>
>3. Lazy TLB is way better.  We used to do two stupid things when we
>   ran kernel threads: we'd send IPIs to flush user contexts on their
>   CPUs and then we'd write to CR3 for no particular reason as an excuse
>   to stop further IPIs.  With this patch, we do neither.
>
>This will, in general, perform suboptimally if paravirt TLB flushing
>is in use (currently just Xen, I think, but Hyper-V is in the works).
>The code is structured so we could fix it in one of two ways: we
>could take a spinlock when touching the percpu state so we can update
>it remotely after a paravirt flush, or we could be more careful about
>our exactly how we access the state and use cmpxchg16b to do atomic
>remote updates.  (On SMP systems without cmpxchg16b, we'd just skip
>the optimization entirely.)

Hey Andy,

I've started seeing the following in -next:

------------[ cut here ]------------
kernel BUG at arch/x86/mm/tlb.c:47!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 5302 Comm: kworker/u9:1 Not tainted 4.12.0-rc5+ #142
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Workqueue: writeback wb_workfn (flush-259:0)
task: ffff880030ad0040 task.stack: ffff880036e78000
RIP: 0010:leave_mm+0x33/0x40 arch/x86/mm/tlb.c:50
RSP: 0018:ffff880036e7d4c8 EFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff88006a65e240 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: ffffffffb1475fa0 RDI: 0000000000000000
RBP: ffff880036e7d638 R08: 1ffff10006dcfad1 R09: ffff880030ad0040
R10: ffff880036e7d3b8 R11: 0000000000000000 R12: 1ffff10006dcfa9e
R13: ffff880036e7d6c0 R14: ffff880036e7d680 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88003ea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c420019318 CR3: 0000000047a28000 CR4: 00000000000406f0
Call Trace:
 flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
 flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317
 flush_tlb_page arch/x86/include/asm/tlbflush.h:253 [inline]
 ptep_clear_flush+0xd5/0x110 mm/pgtable-generic.c:86
 page_mkclean_one+0x242/0x540 mm/rmap.c:867
 rmap_walk_file+0x5e3/0xd20 mm/rmap.c:1681
 rmap_walk+0x1cd/0x2f0 mm/rmap.c:1699
 page_mkclean+0x2a0/0x380 mm/rmap.c:928
 clear_page_dirty_for_io+0x37e/0x9d0 mm/page-writeback.c:2703
 mpage_submit_page+0x77/0x230 fs/ext4/inode.c:2131
 mpage_process_page_bufs+0x427/0x500 fs/ext4/inode.c:2261
 mpage_prepare_extent_to_map+0x78d/0xcf0 fs/ext4/inode.c:2638
 ext4_writepages+0x13be/0x3dd0 fs/ext4/inode.c:2784
 do_writepages+0xff/0x170 mm/page-writeback.c:2357
 __writeback_single_inode+0x1d9/0x1480 fs/fs-writeback.c:1319
 writeback_sb_inodes+0x6e2/0x1260 fs/fs-writeback.c:1583
 wb_writeback+0x45d/0xed0 fs/fs-writeback.c:1759
 wb_do_writeback fs/fs-writeback.c:1891 [inline]
 wb_workfn+0x2b5/0x1460 fs/fs-writeback.c:1927
 process_one_work+0xbfa/0x1d30 kernel/workqueue.c:2097
 worker_thread+0x221/0x1860 kernel/workqueue.c:2231
 kthread+0x35f/0x430 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:425
Code: 48 3d 80 96 f8 b1 74 22 65 8b 05 f1 42 8c 53 83 f8 01 74 17 55 31 d2 48 c7 c6 80 96 f8 b1 31 ff 48 89 e5 e8 60 ff ff ff 5d c3 c3 <0f> 0b 90 66 2e 0f 1f 84 00 00 00 00 00 48 c7 c0 b4 10 73 b2 55 
RIP: leave_mm+0x33/0x40 arch/x86/mm/tlb.c:50 RSP: ffff880036e7d4c8
---[ end trace 3b5d5a6fb6e394f8 ]---
Kernel panic - not syncing: Fatal exception
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: 0x2b800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Rebooting in 86400 seconds..

Don't really have an easy way to reproduce it...

-- 

Thanks,
Sasha

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 00/10] PCID and improved laziness
  2017-06-18 21:29 ` Levin, Alexander (Sasha Levin)
@ 2017-06-19  4:43   ` Andy Lutomirski
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-19  4:43 UTC (permalink / raw)
  To: Levin, Alexander (Sasha Levin)
  Cc: Andy Lutomirski, x86, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm, Nadav Amit,
	Rik van Riel, Dave Hansen, Arjan van de Ven, Peter Zijlstra

On Sun, Jun 18, 2017 at 2:29 PM, Levin, Alexander (Sasha Levin)
<alexander.levin@verizon.com> wrote:
> On Tue, Jun 13, 2017 at 09:56:18PM -0700, Andy Lutomirski wrote:
>>There are three performance benefits here:
>>
>>1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
>>   This avoids many of them when switching tasks by using PCID.  In
>>   a stupid little benchmark I did, it saves about 100ns on my laptop
>>   per context switch.  I'll try to improve that benchmark.
>>
>>2. Mms that have been used recently on a given CPU might get to keep
>>   their TLB entries alive across process switches with this patch
>>   set.  TLB fills are pretty fast on modern CPUs, but they're even
>>   faster when they don't happen.
>>
>>3. Lazy TLB is way better.  We used to do two stupid things when we
>>   ran kernel threads: we'd send IPIs to flush user contexts on their
>>   CPUs and then we'd write to CR3 for no particular reason as an excuse
>>   to stop further IPIs.  With this patch, we do neither.
>>
>>This will, in general, perform suboptimally if paravirt TLB flushing
>>is in use (currently just Xen, I think, but Hyper-V is in the works).
>>The code is structured so we could fix it in one of two ways: we
>>could take a spinlock when touching the percpu state so we can update
>>it remotely after a paravirt flush, or we could be more careful about
>>our exactly how we access the state and use cmpxchg16b to do atomic
>>remote updates.  (On SMP systems without cmpxchg16b, we'd just skip
>>the optimization entirely.)
>
> Hey Andy,
>
> I've started seeing the following in -next:
>
> ------------[ cut here ]------------
> kernel BUG at arch/x86/mm/tlb.c:47!

...

> Call Trace:
>  flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
>  flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317
>  flush_tlb_page arch/x86/include/asm/tlbflush.h:253 [inline]

I think I see what's going on, and it should be fixed in the PCID
series.  I'll split out the fix.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-18  8:06   ` Nadav Amit
@ 2017-06-19 21:58     ` Andy Lutomirski
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-19 21:58 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, X86 ML, LKML, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra, Andrew Banman, Mike Travis,
	Dimitri Sivanich, Juergen Gross, Boris Ostrovsky

On Sun, Jun 18, 2017 at 1:06 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
>
>> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> x86's lazy TLB mode used to be fairly weak -- it would switch to
>> init_mm the first time it tried to flush a lazy TLB.  This meant an
>> unnecessary CR3 write and, if the flush was remote, an unnecessary
>> IPI.
>>
>> Rewrite it entirely.  When we enter lazy mode, we simply remove the
>> cpu from mm_cpumask.  This means that we need a way to figure out
>> whether we've missed a flush when we switch back out of lazy mode.
>> I use the tlb_gen machinery to track whether a context is up to
>> date.
>>
>> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
>> using an array of length 1 containing (ctx_id, tlb_gen) rather than
>> just storing tlb_gen, and making it at array isn't necessary yet.
>> I'm doing this because the next few patches add PCID support, and,
>> with PCID, we need ctx_id, and the array will end up with a length
>> greater than 1.  Making it an array now means that there will be
>> less churn and therefore less stress on your eyeballs.
>>
>> NB: This is dubious but, AFAICT, still correct on Xen and UV.
>> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
>> patch changes the way that mm_cpumask() works.  This should be okay,
>> since Xen *also* iterates all online CPUs to find all the CPUs it
>> needs to twiddle.
>>
>> The UV tlbflush code is rather dated and should be changed.
>>
>> Cc: Andrew Banman <abanman@sgi.com>
>> Cc: Mike Travis <travis@sgi.com>
>> Cc: Dimitri Sivanich <sivanich@sgi.com>
>> Cc: Juergen Gross <jgross@suse.com>
>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>> arch/x86/include/asm/mmu_context.h |   6 +-
>> arch/x86/include/asm/tlbflush.h    |   4 -
>> arch/x86/mm/init.c                 |   1 -
>> arch/x86/mm/tlb.c                  | 242 +++++++++++++++++++------------------
>> 4 files changed, 131 insertions(+), 122 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
>> index e5295d485899..69a4f1ee86ac 100644
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
>>
>> static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>> {
>> -     if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
>> -             this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
>> +     int cpu = smp_processor_id();
>> +
>> +     if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
>> +             cpumask_clear_cpu(cpu, mm_cpumask(mm));
>
> The indication for laziness that was in cpu_tlbstate.state may be a better
> indication whether the cpu needs to be cleared from the previous mm_cpumask().
> If you kept this indication, you could have used this per-cpu information in
> switch_mm_irqs_off() instead of "cpumask_test_cpu(cpu, mm_cpumask(next))”,
> which might have been accessed by another core.

Hmm, fair enough.  On the other hand, this is the least of our
problems in this particular case -- the scheduler's use of mmgrab()
and mmdrop() are probably at least as bad if not worse.  My preference
would be to get all this stuff merged and then see if we want to add
some scalability improvements on top.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-14  6:09   ` Juergen Gross
@ 2017-06-19 22:00     ` Andy Lutomirski
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-19 22:00 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm, Nadav Amit,
	Rik van Riel, Dave Hansen, Arjan van de Ven, Peter Zijlstra,
	Andrew Banman, Mike Travis, Dimitri Sivanich, Boris Ostrovsky

On Tue, Jun 13, 2017 at 11:09 PM, Juergen Gross <jgross@suse.com> wrote:
> On 14/06/17 06:56, Andy Lutomirski wrote:
>> x86's lazy TLB mode used to be fairly weak -- it would switch to
>> init_mm the first time it tried to flush a lazy TLB.  This meant an
>> unnecessary CR3 write and, if the flush was remote, an unnecessary
>> IPI.
>>
>> Rewrite it entirely.  When we enter lazy mode, we simply remove the
>> cpu from mm_cpumask.  This means that we need a way to figure out
>> whether we've missed a flush when we switch back out of lazy mode.
>> I use the tlb_gen machinery to track whether a context is up to
>> date.
>>
>> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
>> using an array of length 1 containing (ctx_id, tlb_gen) rather than
>> just storing tlb_gen, and making it at array isn't necessary yet.
>> I'm doing this because the next few patches add PCID support, and,
>> with PCID, we need ctx_id, and the array will end up with a length
>> greater than 1.  Making it an array now means that there will be
>> less churn and therefore less stress on your eyeballs.
>>
>> NB: This is dubious but, AFAICT, still correct on Xen and UV.
>> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
>> patch changes the way that mm_cpumask() works.  This should be okay,
>> since Xen *also* iterates all online CPUs to find all the CPUs it
>> needs to twiddle.
>
> There is a allocation failure path in xen_drop_mm_ref() which might
> be wrong with this patch. As this path should be taken only very
> unlikely I'd suggest to remove the test for mm_cpumask() bit zero in
> this path.
>

Right, fixed.

>
> Juergen

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID
  2017-06-18  6:26   ` Nadav Amit
@ 2017-06-19 22:02     ` Andy Lutomirski
  2017-06-19 22:53       ` Nadav Amit
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2017-06-19 22:02 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm,
	Rik van Riel, Dave Hansen, Arjan van de Ven, Peter Zijlstra

On Sat, Jun 17, 2017 at 11:26 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>
>> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> PCID is a "process context ID" -- it's what other architectures call
>> an address space ID.  Every non-global TLB entry is tagged with a
>> PCID, only TLB entries that match the currently selected PCID are
>> used, and we can switch PGDs without flushing the TLB.  x86's
>> PCID is 12 bits.
>>
>> This is an unorthodox approach to using PCID.  x86's PCID is far too
>> short to uniquely identify a process, and we can't even really
>> uniquely identify a running process because there are monster
>> systems with over 4096 CPUs.  To make matters worse, past attempts
>> to use all 12 PCID bits have resulted in slowdowns instead of
>> speedups.
>>
>> This patch uses PCID differently.  We use a PCID to identify a
>> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
>> binding at all; instead, we give it a fresh PCID each time it's
>> loaded except in cases where we want to preserve the TLB, in which
>> case we reuse a recent value.
>>
>> In particular, we use PCIDs 1-3 for recently-used mms and we reserve
>> PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
>> Nothing ever switches to PCID 0 without flushing PCID 0 non-global
>> pages, so PCID 0 conflicts won't cause problems.
>
> Is this commit message outdated?

Yes, it's old.  Will fix.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID
  2017-06-19 22:02     ` Andy Lutomirski
@ 2017-06-19 22:53       ` Nadav Amit
  2017-06-19 23:04         ` Nadav Amit
  0 siblings, 1 reply; 30+ messages in thread
From: Nadav Amit @ 2017-06-19 22:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra

Andy Lutomirski <luto@kernel.org> wrote:

> On Sat, Jun 17, 2017 at 11:26 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>> PCID is a "process context ID" -- it's what other architectures call
>>> an address space ID.  Every non-global TLB entry is tagged with a
>>> PCID, only TLB entries that match the currently selected PCID are
>>> used, and we can switch PGDs without flushing the TLB.  x86's
>>> PCID is 12 bits.
>>> 
>>> This is an unorthodox approach to using PCID.  x86's PCID is far too
>>> short to uniquely identify a process, and we can't even really
>>> uniquely identify a running process because there are monster
>>> systems with over 4096 CPUs.  To make matters worse, past attempts
>>> to use all 12 PCID bits have resulted in slowdowns instead of
>>> speedups.
>>> 
>>> This patch uses PCID differently.  We use a PCID to identify a
>>> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
>>> binding at all; instead, we give it a fresh PCID each time it's
>>> loaded except in cases where we want to preserve the TLB, in which
>>> case we reuse a recent value.
>>> 
>>> In particular, we use PCIDs 1-3 for recently-used mms and we reserve
>>> PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
>>> Nothing ever switches to PCID 0 without flushing PCID 0 non-global
>>> pages, so PCID 0 conflicts won't cause problems.
>> 
>> Is this commit message outdated?
> 
> Yes, it's old.  Will fix.

Just to clarify: I asked since I don’t understand how the interaction with
PCID-unaware CR3 users go. Specifically, IIUC, arch_efi_call_virt_teardown()
can reload CR3 with an old PCID value. No?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID
  2017-06-19 22:53       ` Nadav Amit
@ 2017-06-19 23:04         ` Nadav Amit
  0 siblings, 0 replies; 30+ messages in thread
From: Nadav Amit @ 2017-06-19 23:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Rik van Riel, Dave Hansen,
	Arjan van de Ven, Peter Zijlstra

Nadav Amit <nadav.amit@gmail.com> wrote:

>> 
> Just to clarify: I asked since I don’t understand how the interaction with
> PCID-unaware CR3 users go. Specifically, IIUC, arch_efi_call_virt_teardown()
> can reload CR3 with an old PCID value. No?

Please ignore this email. I realized it is not a problem.

Nadav

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-06-19 23:04 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-14  4:56 [PATCH v2 00/10] PCID and improved laziness Andy Lutomirski
2017-06-14  4:56 ` [PATCH v2 01/10] x86/ldt: Simplify LDT switching logic Andy Lutomirski
2017-06-15 18:53   ` Rik van Riel
2017-06-14  4:56 ` [PATCH v2 02/10] x86/mm: Remove reset_lazy_tlbstate() Andy Lutomirski
2017-06-15 19:29   ` Rik van Riel
2017-06-14  4:56 ` [PATCH v2 03/10] x86/mm: Give each mm TLB flush generation a unique ID Andy Lutomirski
2017-06-14 15:54   ` Dave Hansen
2017-06-14 17:16     ` Andy Lutomirski
2017-06-14  4:56 ` [PATCH v2 04/10] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm Andy Lutomirski
2017-06-14  4:56 ` [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
2017-06-14  6:09   ` Juergen Gross
2017-06-19 22:00     ` Andy Lutomirski
2017-06-14 22:33   ` Dave Hansen
2017-06-14 22:42     ` Andy Lutomirski
2017-06-18  8:06   ` Nadav Amit
2017-06-19 21:58     ` Andy Lutomirski
2017-06-14  4:56 ` [PATCH v2 06/10] x86/mm: Stop calling leave_mm() in idle code Andy Lutomirski
2017-06-14  4:56 ` [PATCH v2 07/10] x86/mm: Disable PCID on 32-bit kernels Andy Lutomirski
2017-06-14  4:56 ` [PATCH v2 08/10] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
2017-06-14  4:56 ` [PATCH v2 09/10] x86/mm: Enable CR4.PCIDE on supported systems Andy Lutomirski
2017-06-14  5:30   ` Juergen Gross
2017-06-14  4:56 ` [PATCH v2 10/10] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
2017-06-18  6:26   ` Nadav Amit
2017-06-19 22:02     ` Andy Lutomirski
2017-06-19 22:53       ` Nadav Amit
2017-06-19 23:04         ` Nadav Amit
2017-06-14 22:18 ` [PATCH v2 00/10] PCID and improved laziness Dave Hansen
2017-06-14 22:48   ` Andy Lutomirski
2017-06-18 21:29 ` Levin, Alexander (Sasha Levin)
2017-06-19  4:43   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).