linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/11] PCID and improved laziness
@ 2017-06-05 22:36 Andy Lutomirski
  2017-06-05 22:36 ` [RFC 01/11] x86/ldt: Simplify LDT switching logic Andy Lutomirski
                   ` (10 more replies)
  0 siblings, 11 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski

I think that this is in good enough shape to review.  I'm hoping to get
it in for 4.13.

There are three performance benefits here:

1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
   This avoids many of them when switching tasks by using PCID.  In
   a stupid little benchmark I did, it saves about 100ns on my laptop
   per context switch.  I'll try to improve that benchmark.

2. Mms that have been used recently on a given CPU might get to keep
   their TLB entries alive across process switches with this patch
   set.  TLB fills are pretty fast on modern CPUs, but they're even
   faster when they don't happen.

3. Lazy TLB is way better.  We used to do two stupid things when we
   ran kernel threads: we'd send IPIs to flush user contexts on their
   CPUs and then we'd write to CR3 for no particular reason as an excuse
   to stop further IPIs.  With this patch, we do neither.

This will, in general, perform suboptimally if paravirt TLB flushing
is in use (currently just Xen, I think, but Hyper-V is in the works).
The code is structured so we could fix it in one of two ways: we
could take a spinlock when touching the percpu state so we can update
it remotely after a paravirt flush, or we could be more careful about
our exactly how we access the state and use cmpxchg16b to do atomic
remote updates.  (On SMP systems without cmpxchg16b, we'd just skip
the optimization entirely.)

This code is running on my laptop right now and it hasn't blown up
yet, so it's obviously entirely bug-free. :)

What do you all think?

This is based on tip:x86/mm.  The branch is here if you want to play:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid

Andy Lutomirski (11):
  x86/ldt: Simplify LDT switching logic
  x86/mm: Remove reset_lazy_tlbstate()
  x86/mm: Give each mm TLB flush generation a unique ID
  x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
  x86/mm: Rework lazy TLB mode and TLB freshness tracking
  x86/mm: Stop calling leave_mm() in idle code
  x86/mm: Disable PCID on 32-bit kernels
  x86/mm: Add nopcid to turn off PCID
  x86/mm: Teach CR3 readers about PCID
  x86/mm: Enable CR4.PCIDE on supported systems
  x86/mm: Try to preserve old TLB entries using PCID

 Documentation/admin-guide/kernel-parameters.txt |   2 +
 arch/ia64/include/asm/acpi.h                    |   2 -
 arch/x86/boot/compressed/pagetable.c            |   2 +-
 arch/x86/include/asm/acpi.h                     |   2 -
 arch/x86/include/asm/disabled-features.h        |   4 +-
 arch/x86/include/asm/efi.h                      |   2 +-
 arch/x86/include/asm/mmu.h                      |  25 +-
 arch/x86/include/asm/mmu_context.h              |  41 ++-
 arch/x86/include/asm/paravirt.h                 |   2 +-
 arch/x86/include/asm/processor-flags.h          |  32 +++
 arch/x86/include/asm/processor.h                |   8 +
 arch/x86/include/asm/special_insns.h            |  10 +-
 arch/x86/include/asm/tlbflush.h                 |  91 +++++-
 arch/x86/kernel/cpu/bugs.c                      |   8 +
 arch/x86/kernel/cpu/common.c                    |  33 +++
 arch/x86/kernel/head64.c                        |   3 +-
 arch/x86/kernel/paravirt.c                      |   2 +-
 arch/x86/kernel/process_32.c                    |   2 +-
 arch/x86/kernel/process_64.c                    |   2 +-
 arch/x86/kernel/smpboot.c                       |   1 -
 arch/x86/kvm/vmx.c                              |   2 +-
 arch/x86/mm/fault.c                             |  10 +-
 arch/x86/mm/init.c                              |   2 +-
 arch/x86/mm/ioremap.c                           |   2 +-
 arch/x86/mm/tlb.c                               | 351 +++++++++++++++---------
 arch/x86/platform/efi/efi_64.c                  |   4 +-
 arch/x86/platform/olpc/olpc-xo1-pm.c            |   2 +-
 arch/x86/power/cpu.c                            |   2 +-
 arch/x86/power/hibernate_64.c                   |   3 +-
 arch/x86/xen/mmu_pv.c                           |   6 +-
 arch/x86/xen/setup.c                            |   6 +
 drivers/acpi/processor_idle.c                   |   2 -
 drivers/idle/intel_idle.c                       |   8 +-
 33 files changed, 483 insertions(+), 191 deletions(-)

-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC 01/11] x86/ldt: Simplify LDT switching logic
  2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
@ 2017-06-05 22:36 ` Andy Lutomirski
  2017-06-05 22:40   ` Linus Torvalds
  2017-06-05 22:36 ` [RFC 02/11] x86/mm: Remove reset_lazy_tlbstate() Andy Lutomirski
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski

We used to switch the LDT if the prev and next mms' LDTs didn't
match.  This was correct but overcomplicated -- it was subject to a
harmless race if prev called modify_ldt() which switching.  It was
also a pointless optimization, since different mms' LDTs are always
different.

Simplify the code to update LDTR if either the previous or the next
mm has an LDT.  While we're at it, clean up the code by moving all
the ifdeffery to a header where it belongs.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h | 23 +++++++++++++++++++++++
 arch/x86/mm/tlb.c                  | 20 ++------------------
 2 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index f20d7ea47095..d59bbfb4c8b4 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -93,6 +93,29 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 #else
 	clear_LDT();
 #endif
+}
+
+static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
+{
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+	/*
+	 * Load the LDT if either the old or new mm had an LDT.
+	 *
+	 * An mm will never go from having an LDT to not having an LDT.  Two
+	 * mms never share an LDT, so we don't gain anything by checking to
+	 * see whether the LDT changed.  There's also no guarantee that
+	 * prev->context.ldt actually matches LDTR, but, if LDTR is non-NULL,
+	 * then prev->context.ldt will also be non-NULL.
+	 *
+	 * If we really cared, we could optimize the case where prev == next
+	 * and we're existing lazy mode.  Most of the time, if this happens,
+	 * we don't actually need to reload LDTR, but modify_ldt() is mostly
+	 * used by legacy code and emulators where we don't need this level of
+	 * performance.
+	 */
+	if (unlikely(prev->context.ldt || next->context.ldt))
+		load_mm_ldt(next);
+#endif
 
 	DEBUG_LOCKS_WARN_ON(preemptible());
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2a5e851f2035..b2485d69f7c2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -148,25 +148,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		     real_prev != &init_mm);
 	cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
-	/* Load per-mm CR4 state */
+	/* Load per-mm CR4 and LDTR state */
 	load_mm_cr4(next);
-
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
-	/*
-	 * Load the LDT, if the LDT is different.
-	 *
-	 * It's possible that prev->context.ldt doesn't match
-	 * the LDT register.  This can happen if leave_mm(prev)
-	 * was called and then modify_ldt changed
-	 * prev->context.ldt but suppressed an IPI to this CPU.
-	 * In this case, prev->context.ldt != NULL, because we
-	 * never set context.ldt to NULL while the mm still
-	 * exists.  That means that next->context.ldt !=
-	 * prev->context.ldt, because mms never share an LDT.
-	 */
-	if (unlikely(real_prev->context.ldt != next->context.ldt))
-		load_mm_ldt(next);
-#endif
+	switch_ldt(real_prev, next);
 }
 
 /*
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 02/11] x86/mm: Remove reset_lazy_tlbstate()
  2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
  2017-06-05 22:36 ` [RFC 01/11] x86/ldt: Simplify LDT switching logic Andy Lutomirski
@ 2017-06-05 22:36 ` Andy Lutomirski
  2017-06-05 22:36 ` [RFC 03/11] x86/mm: Give each mm TLB flush generation a unique ID Andy Lutomirski
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski

The only call site also calls idle_task_exit(), and idle_task_exit()
puts us into a clean state by explicitly switching to init_mm.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 8 --------
 arch/x86/kernel/smpboot.c       | 1 -
 2 files changed, 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 388c2463fde6..ee5a138602e8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,14 +259,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
 
-static inline void reset_lazy_tlbstate(void)
-{
-	this_cpu_write(cpu_tlbstate.state, 0);
-	this_cpu_write(cpu_tlbstate.loaded_mm, &init_mm);
-
-	WARN_ON(read_cr3() != __pa_symbol(swapper_pg_dir));
-}
-
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 					struct mm_struct *mm)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f04479a8f74f..6169a56aab49 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1589,7 +1589,6 @@ void native_cpu_die(unsigned int cpu)
 void play_dead_common(void)
 {
 	idle_task_exit();
-	reset_lazy_tlbstate();
 
 	/* Ack it */
 	(void)cpu_report_death();
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 03/11] x86/mm: Give each mm TLB flush generation a unique ID
  2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
  2017-06-05 22:36 ` [RFC 01/11] x86/ldt: Simplify LDT switching logic Andy Lutomirski
  2017-06-05 22:36 ` [RFC 02/11] x86/mm: Remove reset_lazy_tlbstate() Andy Lutomirski
@ 2017-06-05 22:36 ` Andy Lutomirski
  2017-06-05 22:36 ` [RFC 04/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm Andy Lutomirski
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski

This adds two new variables to mmu_context_t: ctx_id and tlb_gen.
ctx_id uniquely identifies the mm_struct and will never be reused.
For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic
count of the number of times that a TLB flush has been requested.
The pair (ctx_id, tlb_gen) can be used as an identifier for TLB
flush actions and will be used in subsequent patches to reliably
determine whether all needed TLB flushes have occurred on a given
CPU.

This patch is split out for ease of review.  By itself, it has no
real effect other than creating and updating the new variables.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu.h         | 25 +++++++++++++++++++++++--
 arch/x86/include/asm/mmu_context.h |  5 +++++
 arch/x86/include/asm/tlbflush.h    | 18 ++++++++++++++++++
 arch/x86/mm/tlb.c                  |  6 ++++--
 4 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 79b647a7ebd0..bb8c597c2248 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,12 +3,28 @@
 
 #include <linux/spinlock.h>
 #include <linux/mutex.h>
+#include <linux/atomic.h>
 
 /*
- * The x86 doesn't have a mmu context, but
- * we put the segment information here.
+ * x86 has arch-specific MMU state beyond what lives in mm_struct.
  */
 typedef struct {
+	/*
+	 * ctx_id uniquely identifies this mm_struct.  A ctx_id will never
+	 * be reused, and zero is not a valid ctx_id.
+	 */
+	u64 ctx_id;
+
+	/*
+	 * Any code that needs to do any sort of TLB flushing for this
+	 * mm will first make its changes to the page tables, then
+	 * increment tlb_gen, then flush.  This lets the low-level
+	 * flushing code keep track of what needs flushing.
+	 *
+	 * This is not used on Xen PV.
+	 */
+	atomic64_t tlb_gen;
+
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
 #endif
@@ -37,6 +53,11 @@ typedef struct {
 #endif
 } mm_context_t;
 
+#define INIT_MM_CONTEXT(mm)						\
+	.context = {							\
+		.ctx_id = 1,						\
+	}
+
 void leave_mm(int cpu);
 
 #endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index d59bbfb4c8b4..e691b4d46b9d 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -126,9 +126,14 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
 }
 
+extern atomic64_t last_mm_ctx_id;
+
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
+	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
+	atomic64_set(&mm->context.tlb_gen, 0);
+
 	#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
 		/* pkey 0 is the default and always allocated */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index ee5a138602e8..5438f7e07fef 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -57,6 +57,23 @@ static inline void invpcid_flush_all_nonglobals(void)
 	__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
 }
 
+static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
+{
+	u64 new_tlb_gen;
+
+	/*
+	 * Bump the generation count.  This also serves as a full barrier
+	 * that synchronizes with switch_mm: callers are required to order
+	 * their read of mm_cpumask after their writes to the paging
+	 * structures.
+	 */
+	smp_mb__before_atomic();
+	new_tlb_gen = atomic64_inc_return(&mm->context.tlb_gen);
+	smp_mb__after_atomic();
+
+	return new_tlb_gen;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
@@ -262,6 +279,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 					struct mm_struct *mm)
 {
+	bump_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
 }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b2485d69f7c2..7c99c50e8bc9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
+
 void leave_mm(int cpu)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -283,8 +285,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	cpu = get_cpu();
 
-	/* Synchronize with switch_mm. */
-	smp_mb();
+	/* This is also a barrier that synchronizes with switch_mm(). */
+	bump_mm_tlb_gen(mm);
 
 	/* Should we flush just the requested range? */
 	if ((end != TLB_FLUSH_ALL) &&
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 04/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
  2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
                   ` (2 preceding siblings ...)
  2017-06-05 22:36 ` [RFC 03/11] x86/mm: Give each mm TLB flush generation a unique ID Andy Lutomirski
@ 2017-06-05 22:36 ` Andy Lutomirski
  2017-06-06  5:03   ` Nadav Amit
  2017-06-05 22:36 ` [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski

There are two kernel features that would benefit from tracking
how up-to-date each CPU's TLB is in the case where IPIs aren't keeping
it up to date in real time:

 - Lazy mm switching currently works by switching to init_mm when
   it would otherwise flush.  This is wasteful: there isn't fundamentally
   any need to update CR3 at all when going lazy or when returning from
   lazy mode, nor is there any need to receive flush IPIs at all.  Instead,
   we should just stop trying to keep the TLB coherent when we go lazy and,
   when unlazying, check whether we missed any flushes.

 - PCID will let us keep recent user contexts alive in the TLB.  If we
   start doing this, we need a way to decide whether those contexts are
   up to date.

On some paravirt systems, remote TLBs can be flushed without IPIs.
This won't update the target CPUs' tlb_gens, which may cause
unnecessary local flushes later on.  We can address this if it becomes
a problem by carefully updating the target CPU's tlb_gen directly.

By itself, this patch is a very minor optimization that avoids
unnecessary flushes when multiple TLB flushes targetting the same CPU
race.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 37 +++++++++++++++++++
 arch/x86/mm/tlb.c               | 79 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 109 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5438f7e07fef..646787ff1a01 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,11 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+struct tlb_context {
+	u64 ctx_id;
+	u64 tlb_gen;
+};
+
 struct tlb_state {
 	/*
 	 * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
@@ -97,6 +102,21 @@ struct tlb_state {
 	 * disabling interrupts when modifying either one.
 	 */
 	unsigned long cr4;
+
+	/*
+	 * This is a list of all contexts that might exist in the TLB.
+	 * Since we don't yet use PCID, there is only one context.
+	 *
+	 * For each context, ctx_id indicates which mm the TLB's user
+	 * entries came from.  As an invariant, the TLB will never
+	 * contain entries that are out-of-date as when that mm reached
+	 * the tlb_gen in the list.
+	 *
+	 * To be clear, this means that it's legal for the TLB code to
+	 * flush the TLB without updating tlb_gen.  This can happen
+	 * (for now, at least) due to paravirt remote flushes.
+	 */
+	struct tlb_context ctxs[1];
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
@@ -248,9 +268,26 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 struct flush_tlb_info {
+	/*
+	 * We support several kinds of flushes.
+	 *
+	 * - Fully flush a single mm.  flush_mm will be set, flush_end will be
+	 *   TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the
+	 *   IPI sender is trying to catch us up.
+	 *
+	 * - Partially flush a single mm.  flush_mm will be set, flush_start
+	 *   and flush_end will indicate the range, and new_tlb_gen will be
+	 *   set such that the changes between generation new_tlb_gen-1 and
+	 *   new_tlb_gen are entirely contained in the indicated range.
+	 *
+	 * - Fully flush all mms whose tlb_gens have been updated.  flush_mm
+	 *   will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen
+	 *   will be zero.
+	 */
 	struct mm_struct *mm;
 	unsigned long start;
 	unsigned long end;
+	u64 new_tlb_gen;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7c99c50e8bc9..3b19ba748e92 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	}
 
 	this_cpu_write(cpu_tlbstate.loaded_mm, next);
+	this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
+	this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+		       atomic64_read(&next->context.tlb_gen));
 
 	WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
 	cpumask_set_cpu(cpu, mm_cpumask(next));
@@ -194,17 +197,70 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 static void flush_tlb_func_common(const struct flush_tlb_info *f,
 				  bool local, enum tlb_flush_reason reason)
 {
+	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+
+	/*
+	 * Our memory ordering requirement is that any TLB fills that
+	 * happen after we flush the TLB are ordered after we read
+	 * active_mm's tlb_gen.  We don't need any explicit barrier
+	 * because all x86 flush operations are serializing and the
+	 * atomic64_read operation won't be reordered by the compiler.
+	 */
+	u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
+	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
+
+	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+		   loaded_mm->context.ctx_id);
+
 	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
+		/*
+		 * leave_mm() is adequate to handle any type of flush, and
+		 * we would prefer not to receive further IPIs.
+		 */
 		leave_mm(smp_processor_id());
 		return;
 	}
 
-	if (f->end == TLB_FLUSH_ALL) {
-		local_flush_tlb();
-		if (local)
-			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
-		trace_tlb_flush(reason, TLB_FLUSH_ALL);
-	} else {
+	if (local_tlb_gen == mm_tlb_gen) {
+		/*
+		 * There's nothing to do: we're already up to date.  This can
+		 * happen if two concurrent flushes happen -- the first IPI to
+		 * be handled can catch us all the way up, leaving no work for
+		 * the second IPI to be handled.
+		 */
+		return;
+	}
+
+	WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
+	WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen);
+
+	/*
+	 * If we get to this point, we know that our TLB is out of date.
+	 * This does not strictly imply that we need to flush (it's
+	 * possible that f->new_tlb_gen <= local_tlb_gen), but we're
+	 * going to need to flush in the very near future, so we might
+	 * as well get it over with.
+	 *
+	 * The only question is whether to do a full or partial flush.
+	 *
+	 * A partial TLB flush is safe and worthwhile if two conditions are
+	 * met:
+	 *
+	 * 1. We wouldn't be skipping a tlb_gen.  If the requester bumped
+	 *    the mm's tlb_gen from p to p+1, a partial flush is only correct
+	 *    if we would be bumping the local CPU's tlb_gen from p to p+1 as
+	 *    well.
+	 *
+	 * 2. If there are no more flushes on their way.  Partial TLB
+	 *    flushes are not all that much cheaper than full TLB
+	 *    flushes, so it seems unlikely that it would be a
+	 *    performance win to do a partial flush if that won't bring
+	 *    our TLB fully up to date.
+	 */
+	if (f->end != TLB_FLUSH_ALL &&
+	    f->new_tlb_gen == local_tlb_gen + 1 &&
+	    f->new_tlb_gen == mm_tlb_gen) {
+		/* Partial flush */
 		unsigned long addr;
 		unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
 		addr = f->start;
@@ -215,7 +271,16 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		if (local)
 			count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
 		trace_tlb_flush(reason, nr_pages);
+	} else {
+		/* Full flush. */
+		local_flush_tlb();
+		if (local)
+			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+		trace_tlb_flush(reason, TLB_FLUSH_ALL);
 	}
+
+	/* Both paths above update our state to mm_tlb_gen. */
+	this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, mm_tlb_gen);
 }
 
 static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
@@ -286,7 +351,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	cpu = get_cpu();
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
-	bump_mm_tlb_gen(mm);
+	info.new_tlb_gen = bump_mm_tlb_gen(mm);
 
 	/* Should we flush just the requested range? */
 	if ((end != TLB_FLUSH_ALL) &&
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
                   ` (3 preceding siblings ...)
  2017-06-05 22:36 ` [RFC 04/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm Andy Lutomirski
@ 2017-06-05 22:36 ` Andy Lutomirski
  2017-06-06  1:39   ` Nadav Amit
  2017-06-06 19:11   ` Rik van Riel
  2017-06-05 22:36 ` [RFC 06/11] x86/mm: Stop calling leave_mm() in idle code Andy Lutomirski
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski,
	Andrew Banman, Mike Travis, Dimitri Sivanich, Juergen Gross,
	Boris Ostrovsky

x86's lazy TLB mode used to be fairly weak -- it would switch to
init_mm the first time it tried to flush a lazy TLB.  This meant an
unnecessary CR3 write and, if the flush was remote, an unnecessary
IPI.

Rewrite it entirely.  When we enter lazy mode, we simply remove the
cpu from mm_cpumask.  This means that we need a way to figure out
whether we've missed a flush when we switch back out of lazy mode.
I use the tlb_gen machinery to track whether a context is up to
date.

Note to reviewers: this patch, my itself, looks a bit odd.  I'm
using an array of length 1 containing (ctx_id, tlb_gen) rather than
just storing tlb_gen, and making it at array isn't necessary yet.
I'm doing this because the next few patches add PCID support, and,
with PCID, we need ctx_id, and the array will end up with a length
greater than 1.  Making it an array now means that there will be
less churn and therefore less stress on your eyeballs.

NB: This is dubious but, AFAICT, still correct on Xen and UV.
xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
patch changes the way that mm_cpumask() works.  This should be okay,
since Xen *also* iterates all online CPUs to find all the CPUs it
needs to twiddle.

The UV tlbflush code is rather dated and should be changed.

Cc: Andrew Banman <abanman@sgi.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h |   6 +-
 arch/x86/include/asm/tlbflush.h    |   4 -
 arch/x86/mm/init.c                 |   1 -
 arch/x86/mm/tlb.c                  | 225 ++++++++++++++++++-------------------
 4 files changed, 116 insertions(+), 120 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index e691b4d46b9d..da0cd502b4bd 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -122,8 +122,10 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
-	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
+	int cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
+		cpumask_clear_cpu(cpu, mm_cpumask(mm));
 }
 
 extern atomic64_t last_mm_ctx_id;
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 646787ff1a01..c68b7c9a7d77 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -95,7 +95,6 @@ struct tlb_state {
 	 * mode even if we've already switched back to swapper_pg_dir.
 	 */
 	struct mm_struct *loaded_mm;
-	int state;
 
 	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
-#define TLBSTATE_OK	1
-#define TLBSTATE_LAZY	2
-
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 					struct mm_struct *mm)
 {
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 88ee942cb47d..7d6fa4676af9 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,6 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
 	.loaded_mm = &init_mm,
-	.state = 0,
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3b19ba748e92..95d71407247a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -45,8 +45,8 @@ void leave_mm(int cpu)
 	if (loaded_mm == &init_mm)
 		return;
 
-	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-		BUG();
+	/* Warn if we're not lazy. */
+	WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
 
 	switch_mm(NULL, &init_mm, NULL);
 }
@@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 {
 	unsigned cpu = smp_processor_id();
 	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+	u64 next_tlb_gen;
 
 	/*
-	 * NB: The scheduler will call us with prev == next when
-	 * switching from lazy TLB mode to normal mode if active_mm
-	 * isn't changing.  When this happens, there is no guarantee
-	 * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+	 * NB: The scheduler will call us with prev == next when switching
+	 * from lazy TLB mode to normal mode if active_mm isn't changing.
+	 * When this happens, we don't assume that CR3 (and hence
+	 * cpu_tlbstate.loaded_mm) matches next.
 	 *
 	 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
 	 */
 
-	this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+	/* We don't want flush_tlb_func_* to run concurrently with us. */
+	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
+		WARN_ON_ONCE(!irqs_disabled());
+
+	VM_BUG_ON(read_cr3() != __pa(real_prev->pgd));
 
 	if (real_prev == next) {
-		/*
-		 * There's nothing to do: we always keep the per-mm control
-		 * regs in sync with cpu_tlbstate.loaded_mm.  Just
-		 * sanity-check mm_cpumask.
-		 */
-		if (WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(next))))
-			cpumask_set_cpu(cpu, mm_cpumask(next));
-		return;
-	}
+		if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
+			/*
+			 * There's nothing to do: we weren't lazy, and we
+			 * aren't changing our mm.  We don't need to flush
+			 * anything, nor do we need to update CR3, CR4, or
+			 * LDTR.
+			 */
+			return;
+		}
+
+		/* Resume remote flushes and then read tlb_gen. */
+		cpumask_set_cpu(cpu, mm_cpumask(next));
+		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+
+		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+			  next->context.ctx_id);
+
+		if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
+		    next_tlb_gen) {
+			/*
+			 * Ideally, we'd have a flush_tlb() variant that
+			 * takes the known CR3 value as input.  This would
+			 * be faster on Xen PV and on hypothetical CPUs
+			 * on which INVPCID is fast.
+			 */
+			this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+				       next_tlb_gen);
+			write_cr3(__pa(next->pgd));
+			/*
+			 * This gets called via leave_mm() in the idle path
+			 * where RCU functions differently.  Tracing normally
+			 * uses RCU, so we have to call the tracepoint
+			 * specially here.
+			 */
+			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
+						TLB_FLUSH_ALL);
+		}
 
-	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 		/*
-		 * If our current stack is in vmalloc space and isn't
-		 * mapped in the new pgd, we'll double-fault.  Forcibly
-		 * map it.
+		 * We just exited lazy mode, which means that CR4 and/or LDTR
+		 * may be stale.  (Changes to the required CR4 and LDTR states
+		 * are not reflected in tlb_gen.)
 		 */
-		unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
+	} else {
+		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+			/*
+			 * If our current stack is in vmalloc space and isn't
+			 * mapped in the new pgd, we'll double-fault.  Forcibly
+			 * map it.
+			 */
+			unsigned int stack_pgd_index =
+				pgd_index(current_stack_pointer());
+
+			pgd_t *pgd = next->pgd + stack_pgd_index;
+
+			if (unlikely(pgd_none(*pgd)))
+				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+		}
 
-		pgd_t *pgd = next->pgd + stack_pgd_index;
+		/* Stop remote flushes for the previous mm */
+		if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
+			cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
-		if (unlikely(pgd_none(*pgd)))
-			set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
-	}
+		WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
 
-	this_cpu_write(cpu_tlbstate.loaded_mm, next);
-	this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
-	this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
-		       atomic64_read(&next->context.tlb_gen));
+		/*
+		 * Start remote flushes and then read tlb_gen.
+		 */
+		cpumask_set_cpu(cpu, mm_cpumask(next));
+		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
-	WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
-	cpumask_set_cpu(cpu, mm_cpumask(next));
+		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
+			  next->context.ctx_id);
 
-	/*
-	 * Re-load page tables.
-	 *
-	 * This logic has an ordering constraint:
-	 *
-	 *  CPU 0: Write to a PTE for 'next'
-	 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
-	 *  CPU 1: set bit 1 in next's mm_cpumask
-	 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
-	 *
-	 * We need to prevent an outcome in which CPU 1 observes
-	 * the new PTE value and CPU 0 observes bit 1 clear in
-	 * mm_cpumask.  (If that occurs, then the IPI will never
-	 * be sent, and CPU 0's TLB will contain a stale entry.)
-	 *
-	 * The bad outcome can occur if either CPU's load is
-	 * reordered before that CPU's store, so both CPUs must
-	 * execute full barriers to prevent this from happening.
-	 *
-	 * Thus, switch_mm needs a full barrier between the
-	 * store to mm_cpumask and any operation that could load
-	 * from next->pgd.  TLB fills are special and can happen
-	 * due to instruction fetches or for no reason at all,
-	 * and neither LOCK nor MFENCE orders them.
-	 * Fortunately, load_cr3() is serializing and gives the
-	 * ordering guarantee we need.
-	 */
-	load_cr3(next->pgd);
-
-	/*
-	 * This gets called via leave_mm() in the idle path where RCU
-	 * functions differently.  Tracing normally uses RCU, so we have to
-	 * call the tracepoint specially here.
-	 */
-	trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+		this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,
+			       next->context.ctx_id);
+		this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+			       next_tlb_gen);
+		this_cpu_write(cpu_tlbstate.loaded_mm, next);
+		write_cr3(__pa(next->pgd));
 
-	/* Stop flush ipis for the previous mm */
-	WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
-		     real_prev != &init_mm);
-	cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+		/*
+		 * This gets called via leave_mm() in the idle path where RCU
+		 * functions differently.  Tracing normally uses RCU, so we
+		 * have to call the tracepoint specially here.
+		 */
+		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
+					TLB_FLUSH_ALL);
+	}
 
-	/* Load per-mm CR4 and LDTR state */
 	load_mm_cr4(next);
 	switch_ldt(real_prev, next);
 }
 
-/*
- * The flush IPI assumes that a thread switch happens in this order:
- * [cpu0: the cpu that switches]
- * 1) switch_mm() either 1a) or 1b)
- * 1a) thread switch to a different mm
- * 1a1) set cpu_tlbstate to TLBSTATE_OK
- *	Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
- *	if cpu0 was in lazy tlb mode.
- * 1a2) update cpu active_mm
- *	Now cpu0 accepts tlb flushes for the new mm.
- * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
- *	Now the other cpus will send tlb flush ipis.
- * 1a4) change cr3.
- * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
- *	Stop ipi delivery for the old mm. This is not synchronized with
- *	the other cpus, but flush_tlb_func ignore flush ipis for the wrong
- *	mm, and in the worst case we perform a superfluous tlb flush.
- * 1b) thread switch without mm change
- *	cpu active_mm is correct, cpu0 already handles flush ipis.
- * 1b1) set cpu_tlbstate to TLBSTATE_OK
- * 1b2) test_and_set the cpu bit in cpu_vm_mask.
- *	Atomically set the bit [other cpus will start sending flush ipis],
- *	and test the bit.
- * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
- * 2) switch %%esp, ie current
- *
- * The interrupt must handle 2 special cases:
- * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
- * - the cpu performs speculative tlb reads, i.e. even if the cpu only
- *   runs in kernel space, the cpu could load tlb entries for user space
- *   pages.
- *
- * The good news is that cpu_tlbstate is local to each cpu, no
- * write/read ordering problems.
- */
-
 static void flush_tlb_func_common(const struct flush_tlb_info *f,
 				  bool local, enum tlb_flush_reason reason)
 {
@@ -212,12 +197,13 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
-	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
+	if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
 		/*
-		 * leave_mm() is adequate to handle any type of flush, and
-		 * we would prefer not to receive further IPIs.
+		 * We're in lazy mode -- don't flush.  We can get here on
+		 * remote flushes due to races and on local flushes if a
+		 * kernel thread coincidentally flushes the mm it's lazily
+		 * still using.
 		 */
-		leave_mm(smp_processor_id());
 		return;
 	}
 
@@ -314,6 +300,21 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				(info->end - info->start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
+		/*
+		 * This whole special case is confused.  UV has a "Broadcast
+		 * Assist Unit", which seems to be a fancy way to send IPIs.
+		 * Back when x86 used an explicit TLB flush IPI, UV was
+		 * optimized to use its own mechanism.  These days, x86 uses
+		 * smp_call_function_many(), but UV still uses a manual IPI,
+		 * and that IPI's action is out of date -- it does a manual
+		 * flush instead of calling flush_tlb_func_remote().  This
+		 * means that the percpu tlb_gen variables won't be updated
+		 * and we'll do pointless flushes on future context switches.
+		 *
+		 * Rather than hooking native_flush_tlb_others() here, I think
+		 * that UV should be updated so that smp_call_function_many(),
+		 * etc, are optimal on UV.
+		 */
 		unsigned int cpu;
 
 		cpu = smp_processor_id();
@@ -376,8 +377,6 @@ static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
 	__flush_tlb_all();
-	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_LAZY)
-		leave_mm(smp_processor_id());
 }
 
 void flush_tlb_all(void)
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 06/11] x86/mm: Stop calling leave_mm() in idle code
  2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
                   ` (4 preceding siblings ...)
  2017-06-05 22:36 ` [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
@ 2017-06-05 22:36 ` Andy Lutomirski
  2017-06-05 22:36 ` [RFC 07/11] x86/mm: Disable PCID on 32-bit kernels Andy Lutomirski
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski

Now that lazy TLB suppresses all flush IPIs (as opposed to all but
the first), there's no need to leave_mm() when going idle.

This means we can get rid of the rcuidle hack in
switch_mm_irqs_off() and we can unexport leave_mm().

This also removes acpi_unlazy_tlb() from the x86 and ia64 headers,
since it has no callers any more.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/ia64/include/asm/acpi.h  |  2 --
 arch/x86/include/asm/acpi.h   |  2 --
 arch/x86/mm/tlb.c             | 19 +++----------------
 drivers/acpi/processor_idle.c |  2 --
 drivers/idle/intel_idle.c     |  8 ++++----
 5 files changed, 7 insertions(+), 26 deletions(-)

diff --git a/arch/ia64/include/asm/acpi.h b/arch/ia64/include/asm/acpi.h
index a3d0211970e9..c86a947f5368 100644
--- a/arch/ia64/include/asm/acpi.h
+++ b/arch/ia64/include/asm/acpi.h
@@ -112,8 +112,6 @@ static inline void arch_acpi_set_pdc_bits(u32 *buf)
 	buf[2] |= ACPI_PDC_EST_CAPABILITY_SMP;
 }
 
-#define acpi_unlazy_tlb(x)
-
 #ifdef CONFIG_ACPI_NUMA
 extern cpumask_t early_cpu_possible_map;
 #define for_each_possible_early_cpu(cpu)  \
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 2efc768e4362..562286fa151f 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -150,8 +150,6 @@ static inline void disable_acpi(void) { }
 extern int x86_acpi_numa_init(void);
 #endif /* CONFIG_ACPI_NUMA */
 
-#define acpi_unlazy_tlb(x)	leave_mm(x)
-
 #ifdef CONFIG_ACPI_APEI
 static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 95d71407247a..09775cf5cb1b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -50,7 +50,6 @@ void leave_mm(int cpu)
 
 	switch_mm(NULL, &init_mm, NULL);
 }
-EXPORT_SYMBOL_GPL(leave_mm);
 
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
@@ -113,14 +112,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
 				       next_tlb_gen);
 			write_cr3(__pa(next->pgd));
-			/*
-			 * This gets called via leave_mm() in the idle path
-			 * where RCU functions differently.  Tracing normally
-			 * uses RCU, so we have to call the tracepoint
-			 * specially here.
-			 */
-			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
-						TLB_FLUSH_ALL);
+			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+					TLB_FLUSH_ALL);
 		}
 
 		/*
@@ -166,13 +159,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		this_cpu_write(cpu_tlbstate.loaded_mm, next);
 		write_cr3(__pa(next->pgd));
 
-		/*
-		 * This gets called via leave_mm() in the idle path where RCU
-		 * functions differently.  Tracing normally uses RCU, so we
-		 * have to call the tracepoint specially here.
-		 */
-		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
-					TLB_FLUSH_ALL);
+		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 	}
 
 	load_mm_cr4(next);
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 5c8aa9cf62d7..fe3d2a40f311 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -708,8 +708,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock);
 static void acpi_idle_enter_bm(struct acpi_processor *pr,
 			       struct acpi_processor_cx *cx, bool timer_bc)
 {
-	acpi_unlazy_tlb(smp_processor_id());
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 216d7ec88c0c..596b57311de6 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -917,11 +917,11 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
 	cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
 
 	/*
-	 * leave_mm() to avoid costly and often unnecessary wakeups
-	 * for flushing the user TLB's associated with the active mm.
+	 * NB: if CPUIDLE_FLAG_TLB_FLUSHED is set, this idle transition
+	 * will probably flush the TLB.  It's not guaranteed to flush
+	 * the TLB, though, so it's not clear that we can do anything
+	 * useful with this knowledge.
 	 */
-	if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
-		leave_mm(cpu);
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		tick_broadcast_enter();
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 07/11] x86/mm: Disable PCID on 32-bit kernels
  2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
                   ` (5 preceding siblings ...)
  2017-06-05 22:36 ` [RFC 06/11] x86/mm: Stop calling leave_mm() in idle code Andy Lutomirski
@ 2017-06-05 22:36 ` Andy Lutomirski
  2017-06-05 22:36 ` [RFC 08/11] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski

32-bit kernels on new hardware will see PCID in CPUID, but PCID can
only be used in 64-bit mode.  Rather than making all PCID code
conditional, just disable the feature on 32-bit builds.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/disabled-features.h | 4 +++-
 arch/x86/kernel/cpu/bugs.c               | 8 ++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5dff775af7cd..c10c9128f54e 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -21,11 +21,13 @@
 # define DISABLE_K6_MTRR	(1<<(X86_FEATURE_K6_MTRR & 31))
 # define DISABLE_CYRIX_ARR	(1<<(X86_FEATURE_CYRIX_ARR & 31))
 # define DISABLE_CENTAUR_MCR	(1<<(X86_FEATURE_CENTAUR_MCR & 31))
+# define DISABLE_PCID		0
 #else
 # define DISABLE_VME		0
 # define DISABLE_K6_MTRR	0
 # define DISABLE_CYRIX_ARR	0
 # define DISABLE_CENTAUR_MCR	0
+# define DISABLE_PCID		(1<<(X86_FEATURE_PCID & 31))
 #endif /* CONFIG_X86_64 */
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
@@ -49,7 +51,7 @@
 #define DISABLED_MASK1	0
 #define DISABLED_MASK2	0
 #define DISABLED_MASK3	(DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR)
-#define DISABLED_MASK4	0
+#define DISABLED_MASK4	(DISABLE_PCID)
 #define DISABLED_MASK5	0
 #define DISABLED_MASK6	0
 #define DISABLED_MASK7	0
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0af86d9242da..db684880d74a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -21,6 +21,14 @@
 
 void __init check_bugs(void)
 {
+#ifdef CONFIG_X86_32
+	/*
+	 * Regardless of whether PCID is enumerated, the SDM says
+	 * that it can't be enabled in 32-bit mode.
+	 */
+	setup_clear_cpu_cap(X86_FEATURE_PCID);
+#endif
+
 	identify_boot_cpu();
 
 	if (!IS_ENABLED(CONFIG_SMP)) {
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 08/11] x86/mm: Add nopcid to turn off PCID
  2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
                   ` (6 preceding siblings ...)
  2017-06-05 22:36 ` [RFC 07/11] x86/mm: Disable PCID on 32-bit kernels Andy Lutomirski
@ 2017-06-05 22:36 ` Andy Lutomirski
  2017-06-06  3:22   ` Andi Kleen
  2017-06-05 22:36 ` [RFC 09/11] x86/mm: Teach CR3 readers about PCID Andy Lutomirski
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski

The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt |  2 ++
 arch/x86/kernel/cpu/common.c                    | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 15f79c27748d..9e2ec142dc7e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2639,6 +2639,8 @@
 	nopat		[X86] Disable PAT (page attribute table extension of
 			pagetables) support.
 
+	nopcid		[X86-64] Disable the PCID cpu feature.
+
 	norandmaps	Don't use address space randomization.  Equivalent to
 			echo 0 > /proc/sys/kernel/randomize_va_space
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c8b39870f33e..904485e7b230 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -168,6 +168,24 @@ static int __init x86_mpx_setup(char *s)
 }
 __setup("nompx", x86_mpx_setup);
 
+#ifdef CONFIG_X86_64
+static int __init x86_pcid_setup(char *s)
+{
+	/* require an exact match without trailing characters */
+	if (strlen(s))
+		return 0;
+
+	/* do not emit a message if the feature is not present */
+	if (!boot_cpu_has(X86_FEATURE_PCID))
+		return 1;
+
+	setup_clear_cpu_cap(X86_FEATURE_PCID);
+	pr_info("nopcid: PCID feature disabled\n");
+	return 1;
+}
+__setup("nopcid", x86_pcid_setup);
+#endif
+
 static int __init x86_noinvpcid_setup(char *s)
 {
 	/* noinvpcid doesn't accept parameters */
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 09/11] x86/mm: Teach CR3 readers about PCID
  2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
                   ` (7 preceding siblings ...)
  2017-06-05 22:36 ` [RFC 08/11] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
@ 2017-06-05 22:36 ` Andy Lutomirski
  2017-06-05 22:36 ` [RFC 10/11] x86/mm: Enable CR4.PCIDE on supported systems Andy Lutomirski
  2017-06-05 22:36 ` [RFC 11/11] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
  10 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski

The kernel has several code paths that read CR3.  Most of them assume that
CR3 contains the PGD's physical address, whereas some of them awkwardly
use PHYSICAL_PAGE_MASK to mask off low bits.

Add explicit mask macros for CR3 and convert all of the CR3 readers.
This will keep them from breaking when PCID is enabled.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/boot/compressed/pagetable.c   |  2 +-
 arch/x86/include/asm/efi.h             |  2 +-
 arch/x86/include/asm/mmu_context.h     |  4 ++--
 arch/x86/include/asm/paravirt.h        |  2 +-
 arch/x86/include/asm/processor-flags.h | 30 ++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h       |  8 ++++++++
 arch/x86/include/asm/special_insns.h   | 10 +++++++---
 arch/x86/include/asm/tlbflush.h        |  2 +-
 arch/x86/kernel/head64.c               |  3 ++-
 arch/x86/kernel/paravirt.c             |  2 +-
 arch/x86/kernel/process_32.c           |  2 +-
 arch/x86/kernel/process_64.c           |  2 +-
 arch/x86/kvm/vmx.c                     |  2 +-
 arch/x86/mm/fault.c                    | 10 +++++-----
 arch/x86/mm/ioremap.c                  |  2 +-
 arch/x86/mm/tlb.c                      |  2 +-
 arch/x86/platform/efi/efi_64.c         |  4 ++--
 arch/x86/platform/olpc/olpc-xo1-pm.c   |  2 +-
 arch/x86/power/cpu.c                   |  2 +-
 arch/x86/power/hibernate_64.c          |  3 ++-
 arch/x86/xen/mmu_pv.c                  |  6 +++---
 21 files changed, 73 insertions(+), 29 deletions(-)

diff --git a/arch/x86/boot/compressed/pagetable.c b/arch/x86/boot/compressed/pagetable.c
index 1d78f1739087..16e8320f8658 100644
--- a/arch/x86/boot/compressed/pagetable.c
+++ b/arch/x86/boot/compressed/pagetable.c
@@ -92,7 +92,7 @@ void initialize_identity_maps(void)
 	 * and we must append to the existing area instead of entirely
 	 * overwriting it.
 	 */
-	level4p = read_cr3();
+	level4p = read_cr3_addr();
 	if (level4p == (unsigned long)_pgtable) {
 		debug_putstr("booted via startup_32()\n");
 		pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 2f77bcefe6b4..d2ff779f347e 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -74,7 +74,7 @@ struct efi_scratch {
 	__kernel_fpu_begin();						\
 									\
 	if (efi_scratch.use_pgd) {					\
-		efi_scratch.prev_cr3 = read_cr3();			\
+		efi_scratch.prev_cr3 = __read_cr3();			\
 		write_cr3((unsigned long)efi_scratch.efi_pgt);		\
 		__flush_tlb_all();					\
 	}								\
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index da0cd502b4bd..793cbe858ebf 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -299,7 +299,7 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
 
 /*
  * This can be used from process context to figure out what the value of
- * CR3 is without needing to do a (slow) read_cr3().
+ * CR3 is without needing to do a (slow) __read_cr3().
  *
  * It's intended to be used for code like KVM that sneakily changes CR3
  * and needs to restore it.  It needs to be used very carefully.
@@ -311,7 +311,7 @@ static inline unsigned long __get_current_cr3_fast(void)
 	/* For now, be very restrictive about when this can be called. */
 	VM_WARN_ON(in_nmi() || !in_atomic());
 
-	VM_BUG_ON(cr3 != read_cr3());
+	VM_BUG_ON(cr3 != __read_cr3());
 	return cr3;
 }
 
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9a15739d9f4b..a63e77f8eb41 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -61,7 +61,7 @@ static inline void write_cr2(unsigned long x)
 	PVOP_VCALL1(pv_mmu_ops.write_cr2, x);
 }
 
-static inline unsigned long read_cr3(void)
+static inline unsigned long __read_cr3(void)
 {
 	return PVOP_CALL0(unsigned long, pv_mmu_ops.read_cr3);
 }
diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 39fb618e2211..ce25ac7945c4 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -8,4 +8,34 @@
 #else
 #define X86_VM_MASK	0 /* No VM86 support */
 #endif
+
+/*
+ * CR3 field masks.  On 32-bit systems, bits 31:12 of CR3 give the
+ * physical page frame number of the top-level page table.  (Yes, this
+ * means that the page directory pointer table on PAE needs to live
+ * below 4 GB.)  On 64-bit systems, bits MAXPHYADDR:12 are the PGD page
+ * frame number, bits 62:MAXPHYADDR are reserved (and will presumably be
+ * read as zero forever unless the OS sets some new feature flag), and
+ * bit 63 is read as zero.
+ *
+ * The upshot is that masking off the low 12 bits gives the physical
+ * address of the top-level paging structure on all x86 systems.
+ *
+ * If PCID is enabled, writing 1 to bit 63 suppresses the normal TLB
+ * flush implied by a CR3 write but does *not* set bit 63 of CR3.
+ *
+ * If PCID is enabled, the low 12 bits are the process context ID.  If
+ * PCID is disabled, the low 12 bits are actively counterproductive to
+ * use, and Linux will always set them to zero.  PCID cannot be enabled
+ * on x86_32, so, to save some code size, we fudge the masks so that CR3
+ * reads can skip masking off the known-zero bits on x86_32.
+ */
+#ifdef CONFIG_X86_64
+#define CR3_ADDR_MASK 0x7FFFFFFFFFFFF000ull
+#define CR3_PCID_MASK 0xFFFull
+#else
+#define CR3_ADDR_MASK 0xFFFFFFFFull
+#define CR3_PCID_MASK 0ull
+#endif
+
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..f9142a1fb0d3 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -231,6 +231,14 @@ native_cpuid_reg(ebx)
 native_cpuid_reg(ecx)
 native_cpuid_reg(edx)
 
+/*
+ * Friendlier CR3 helpers.
+ */
+static inline unsigned long read_cr3_addr(void)
+{
+	return __read_cr3() & CR3_ADDR_MASK;
+}
+
 static inline void load_cr3(pgd_t *pgdir)
 {
 	write_cr3(__pa(pgdir));
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 12af3e35edfa..b3af02c7fa52 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -39,7 +39,7 @@ static inline void native_write_cr2(unsigned long val)
 	asm volatile("mov %0,%%cr2": : "r" (val), "m" (__force_order));
 }
 
-static inline unsigned long native_read_cr3(void)
+static inline unsigned long __native_read_cr3(void)
 {
 	unsigned long val;
 	asm volatile("mov %%cr3,%0\n\t" : "=r" (val), "=m" (__force_order));
@@ -159,9 +159,13 @@ static inline void write_cr2(unsigned long x)
 	native_write_cr2(x);
 }
 
-static inline unsigned long read_cr3(void)
+/*
+ * Careful!  CR3 contains more than just an address.  You probably want
+ * read_cr3_addr() instead.
+ */
+static inline unsigned long __read_cr3(void)
 {
-	return native_read_cr3();
+	return __native_read_cr3();
 }
 
 static inline void write_cr3(unsigned long x)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index c68b7c9a7d77..87b13e51e867 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -192,7 +192,7 @@ static inline void __native_flush_tlb(void)
 	 * back:
 	 */
 	preempt_disable();
-	native_write_cr3(native_read_cr3());
+	native_write_cr3(__native_read_cr3());
 	preempt_enable();
 }
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 43b7002f44fb..75fa59b22837 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -55,7 +55,8 @@ int __init early_make_pgtable(unsigned long address)
 	pmdval_t pmd, *pmd_p;
 
 	/* Invalid address or early pgt is done ?  */
-	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+	if (physaddr >= MAXMEM ||
+	    read_cr3_addr() != __pa_nodebug(early_level4_pgt))
 		return -1;
 
 again:
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 3586996fc50d..bc0a849589bb 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -391,7 +391,7 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
 
 	.read_cr2 = native_read_cr2,
 	.write_cr2 = native_write_cr2,
-	.read_cr3 = native_read_cr3,
+	.read_cr3 = __native_read_cr3,
 	.write_cr3 = native_write_cr3,
 
 	.flush_tlb_user = native_flush_tlb,
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index ffeae818aa7a..c6d6dc5f8bb2 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -92,7 +92,7 @@ void __show_regs(struct pt_regs *regs, int all)
 
 	cr0 = read_cr0();
 	cr2 = read_cr2();
-	cr3 = read_cr3();
+	cr3 = __read_cr3();
 	cr4 = __read_cr4();
 	printk(KERN_DEFAULT "CR0: %08lx CR2: %08lx CR3: %08lx CR4: %08lx\n",
 			cr0, cr2, cr3, cr4);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b6840bf3940b..75e235a91e3c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -104,7 +104,7 @@ void __show_regs(struct pt_regs *regs, int all)
 
 	cr0 = read_cr0();
 	cr2 = read_cr2();
-	cr3 = read_cr3();
+	cr3 = __read_cr3();
 	cr4 = __read_cr4();
 
 	printk(KERN_DEFAULT "FS:  %016lx(%04x) GS:%016lx(%04x) knlGS:%016lx\n",
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 19cde555d73f..d143dd397dc9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5024,7 +5024,7 @@ static void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 	 * Save the most likely value for this task's CR3 in the VMCS.
 	 * We can't use __get_current_cr3_fast() because we're not atomic.
 	 */
-	cr3 = read_cr3();
+	cr3 = __read_cr3();
 	vmcs_writel(HOST_CR3, cr3);		/* 22.2.3  FIXME: shadow tables */
 	vmx->host_state.vmcs_host_cr3 = cr3;
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8ad91a01cbc8..6fc2dfa28124 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -346,7 +346,7 @@ static noinline int vmalloc_fault(unsigned long address)
 	 * Do _not_ use "current" here. We might be inside
 	 * an interrupt in the middle of a task switch..
 	 */
-	pgd_paddr = read_cr3();
+	pgd_paddr = read_cr3_addr();
 	pmd_k = vmalloc_sync_one(__va(pgd_paddr), address);
 	if (!pmd_k)
 		return -1;
@@ -388,7 +388,7 @@ static bool low_pfn(unsigned long pfn)
 
 static void dump_pagetable(unsigned long address)
 {
-	pgd_t *base = __va(read_cr3());
+	pgd_t *base = __va(read_cr3_addr());
 	pgd_t *pgd = &base[pgd_index(address)];
 	p4d_t *p4d;
 	pud_t *pud;
@@ -451,7 +451,7 @@ static noinline int vmalloc_fault(unsigned long address)
 	 * happen within a race in page table update. In the later
 	 * case just flush:
 	 */
-	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(address);
+	pgd = (pgd_t *)__va(read_cr3_addr()) + pgd_index(address);
 	pgd_ref = pgd_offset_k(address);
 	if (pgd_none(*pgd_ref))
 		return -1;
@@ -555,7 +555,7 @@ static int bad_address(void *p)
 
 static void dump_pagetable(unsigned long address)
 {
-	pgd_t *base = __va(read_cr3() & PHYSICAL_PAGE_MASK);
+	pgd_t *base = __va(read_cr3_addr());
 	pgd_t *pgd = base + pgd_index(address);
 	p4d_t *p4d;
 	pud_t *pud;
@@ -700,7 +700,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code,
 		pgd_t *pgd;
 		pte_t *pte;
 
-		pgd = __va(read_cr3() & PHYSICAL_PAGE_MASK);
+		pgd = __va(read_cr3_addr());
 		pgd += pgd_index(address);
 
 		pte = lookup_address_in_pgd(pgd, address, &level);
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index bbc558b88a88..b21e00712dd7 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -424,7 +424,7 @@ static pte_t bm_pte[PAGE_SIZE/sizeof(pte_t)] __page_aligned_bss;
 static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
 {
 	/* Don't assume we're using swapper_pg_dir at this point */
-	pgd_t *base = __va(read_cr3());
+	pgd_t *base = __va(read_cr3_addr());
 	pgd_t *pgd = &base[pgd_index(addr)];
 	p4d_t *p4d = p4d_offset(pgd, addr);
 	pud_t *pud = pud_offset(p4d, addr);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 09775cf5cb1b..3773ba72cf2d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -81,7 +81,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
 		WARN_ON_ONCE(!irqs_disabled());
 
-	VM_BUG_ON(read_cr3() != __pa(real_prev->pgd));
+	VM_BUG_ON(read_cr3_addr() != __pa(real_prev->pgd));
 
 	if (real_prev == next) {
 		if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index eb8dff15a7f6..f40bf6230480 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -80,7 +80,7 @@ pgd_t * __init efi_call_phys_prolog(void)
 	int n_pgds, i, j;
 
 	if (!efi_enabled(EFI_OLD_MEMMAP)) {
-		save_pgd = (pgd_t *)read_cr3();
+		save_pgd = (pgd_t *)__read_cr3();
 		write_cr3((unsigned long)efi_scratch.efi_pgt);
 		goto out;
 	}
@@ -646,7 +646,7 @@ efi_status_t efi_thunk_set_virtual_address_map(
 	efi_sync_low_kernel_mappings();
 	local_irq_save(flags);
 
-	efi_scratch.prev_cr3 = read_cr3();
+	efi_scratch.prev_cr3 = __read_cr3();
 	write_cr3((unsigned long)efi_scratch.efi_pgt);
 	__flush_tlb_all();
 
diff --git a/arch/x86/platform/olpc/olpc-xo1-pm.c b/arch/x86/platform/olpc/olpc-xo1-pm.c
index c5350fd27d70..1ad9932ded7c 100644
--- a/arch/x86/platform/olpc/olpc-xo1-pm.c
+++ b/arch/x86/platform/olpc/olpc-xo1-pm.c
@@ -77,7 +77,7 @@ static int xo1_power_state_enter(suspend_state_t pm_state)
 
 asmlinkage __visible int xo1_do_sleep(u8 sleep_state)
 {
-	void *pgd_addr = __va(read_cr3());
+	void *pgd_addr = __va(read_cr3_addr());
 
 	/* Program wakeup mask (using dword access to CS5536_PM1_EN) */
 	outl(wakeup_mask << 16, acpi_base + CS5536_PM1_STS);
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 6b05a9219ea2..78459a6d455a 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -129,7 +129,7 @@ static void __save_processor_state(struct saved_context *ctxt)
 	 */
 	ctxt->cr0 = read_cr0();
 	ctxt->cr2 = read_cr2();
-	ctxt->cr3 = read_cr3();
+	ctxt->cr3 = __read_cr3();
 	ctxt->cr4 = __read_cr4();
 #ifdef CONFIG_X86_64
 	ctxt->cr8 = read_cr8();
diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
index a6e21fee22ea..98a17db2b214 100644
--- a/arch/x86/power/hibernate_64.c
+++ b/arch/x86/power/hibernate_64.c
@@ -150,7 +150,8 @@ static int relocate_restore_code(void)
 	memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
 
 	/* Make the page containing the relocated code executable */
-	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
+	pgd = (pgd_t *)__va(read_cr3_addr()) +
+		pgd_index(relocated_restore_code);
 	p4d = p4d_offset(pgd, relocated_restore_code);
 	if (p4d_large(*p4d)) {
 		set_p4d(p4d, __p4d(p4d_val(*p4d) & ~_PAGE_NX));
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 21beb37114b7..73e8595621c1 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2017,7 +2017,7 @@ static phys_addr_t __init xen_early_virt_to_phys(unsigned long vaddr)
 	pmd_t pmd;
 	pte_t pte;
 
-	pa = read_cr3();
+	pa = read_cr3_addr();
 	pgd = native_make_pgd(xen_read_phys_ulong(pa + pgd_index(vaddr) *
 						       sizeof(pgd)));
 	if (!pgd_present(pgd))
@@ -2097,7 +2097,7 @@ void __init xen_relocate_p2m(void)
 	pt_phys = pmd_phys + PFN_PHYS(n_pmd);
 	p2m_pfn = PFN_DOWN(pt_phys) + n_pt;
 
-	pgd = __va(read_cr3());
+	pgd = __va(read_cr3_addr());
 	new_p2m = (unsigned long *)(2 * PGDIR_SIZE);
 	idx_p4d = 0;
 	save_pud = n_pud;
@@ -2204,7 +2204,7 @@ static void __init xen_write_cr3_init(unsigned long cr3)
 {
 	unsigned long pfn = PFN_DOWN(__pa(swapper_pg_dir));
 
-	BUG_ON(read_cr3() != __pa(initial_page_table));
+	BUG_ON(read_cr3_addr() != __pa(initial_page_table));
 	BUG_ON(cr3 != __pa(swapper_pg_dir));
 
 	/*
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 10/11] x86/mm: Enable CR4.PCIDE on supported systems
  2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
                   ` (8 preceding siblings ...)
  2017-06-05 22:36 ` [RFC 09/11] x86/mm: Teach CR3 readers about PCID Andy Lutomirski
@ 2017-06-05 22:36 ` Andy Lutomirski
  2017-06-06 21:31   ` Boris Ostrovsky
  2017-06-05 22:36 ` [RFC 11/11] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
  10 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski,
	Juergen Gross, Boris Ostrovsky

We can use PCID if the CPU has PCID and PGE and we're not on Xen.

By itself, this has no effect.  The next patch will start using
PCID.

Cc: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h |  8 ++++++++
 arch/x86/kernel/cpu/common.c    | 15 +++++++++++++++
 arch/x86/xen/setup.c            |  6 ++++++
 3 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 87b13e51e867..57b305e13c4c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -243,6 +243,14 @@ static inline void __flush_tlb_all(void)
 		__flush_tlb_global();
 	else
 		__flush_tlb();
+
+	/*
+	 * Note: if we somehow had PCID but not PGE, then this wouldn't work --
+	 * we'd end up flushing kernel translations for the current ASID but
+	 * we might fail to flush kernel translations for other cached ASIDs.
+	 *
+	 * To avoid this issue, we force PCID off if PGE is off.
+	 */
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 904485e7b230..01caf66b270f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1143,6 +1143,21 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 	setup_smep(c);
 	setup_smap(c);
 
+	/* Set up PCID */
+	if (cpu_has(c, X86_FEATURE_PCID)) {
+		if (cpu_has(c, X86_FEATURE_PGE)) {
+			cr4_set_bits(X86_CR4_PCIDE);
+		} else {
+			/*
+			 * flush_tlb_all(), as currently implemented, won't
+			 * work if PCID is on but PGE is not.  Since that
+			 * combination doesn't exist on real hardware, there's
+			 * no reason to try to fully support it.
+			 */
+			clear_cpu_cap(c, X86_FEATURE_PCID);
+		}
+	}
+
 	/*
 	 * The vendor-specific functions might have changed features.
 	 * Now we do "generic changes."
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index a5bf7c451435..7681202b2857 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -1037,6 +1037,12 @@ void __init xen_arch_setup(void)
 	}
 #endif
 
+	/*
+	 * Xen would need some work to support PCID: CR3 handling as well
+	 * as xen_flush_tlb_others() would need updating.
+	 */
+	setup_clear_cpu_cap(X86_FEATURE_PCID);
+
 	memcpy(boot_command_line, xen_start_info->cmd_line,
 	       MAX_GUEST_CMDLINE > COMMAND_LINE_SIZE ?
 	       COMMAND_LINE_SIZE : MAX_GUEST_CMDLINE);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 11/11] x86/mm: Try to preserve old TLB entries using PCID
  2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
                   ` (9 preceding siblings ...)
  2017-06-05 22:36 ` [RFC 10/11] x86/mm: Enable CR4.PCIDE on supported systems Andy Lutomirski
@ 2017-06-05 22:36 ` Andy Lutomirski
  10 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:36 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Andy Lutomirski

PCID is a "process context ID" -- it's what other architectures call
an address space ID.  Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB.  x86's
PCID is 12 bits.

This is an unorthodox approach to using PCID.  x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs.  To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.

This patch uses PCID differently.  We use a PCID to identify a
recently-used mm on a per-cpu basis.  An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.

In particular, we use PCIDs 1-3 for recently-used mms and we reserve
PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
Nothing ever switches to PCID 0 without flushing PCID 0 non-global
pages, so PCID 0 conflicts won't cause problems.

This seems to save about 100ns on context switches between mms.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h     |  3 ++
 arch/x86/include/asm/processor-flags.h |  2 +
 arch/x86/include/asm/tlbflush.h        | 18 +++++++-
 arch/x86/mm/init.c                     |  1 +
 arch/x86/mm/tlb.c                      | 80 ++++++++++++++++++++++++++--------
 5 files changed, 85 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 793cbe858ebf..b3d4a6bec5b1 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -308,6 +308,9 @@ static inline unsigned long __get_current_cr3_fast(void)
 {
 	unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
 
+	if (static_cpu_has(X86_FEATURE_PCID))
+		cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
 	/* For now, be very restrictive about when this can be called. */
 	VM_WARN_ON(in_nmi() || !in_atomic());
 
diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index ce25ac7945c4..c8bd8a22d82e 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -33,9 +33,11 @@
 #ifdef CONFIG_X86_64
 #define CR3_ADDR_MASK 0x7FFFFFFFFFFFF000ull
 #define CR3_PCID_MASK 0xFFFull
+#define CR3_NOFLUSH (1UL << 63)
 #else
 #define CR3_ADDR_MASK 0xFFFFFFFFull
 #define CR3_PCID_MASK 0ull
+#define CR3_NOFLUSH 0
 #endif
 
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 57b305e13c4c..a9a5aa6f45f7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,12 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define NR_DYNAMIC_ASIDS 6
+
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
@@ -95,6 +101,8 @@ struct tlb_state {
 	 * mode even if we've already switched back to swapper_pg_dir.
 	 */
 	struct mm_struct *loaded_mm;
+	u16 loaded_mm_asid;
+	u16 next_asid;
 
 	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -104,7 +112,8 @@ struct tlb_state {
 
 	/*
 	 * This is a list of all contexts that might exist in the TLB.
-	 * Since we don't yet use PCID, there is only one context.
+	 * There is one per ASID that we use, and the ASID (what the
+	 * CPU calls PCID) is the index into ctxts.
 	 *
 	 * For each context, ctx_id indicates which mm the TLB's user
 	 * entries came from.  As an invariant, the TLB will never
@@ -114,8 +123,13 @@ struct tlb_state {
 	 * To be clear, this means that it's legal for the TLB code to
 	 * flush the TLB without updating tlb_gen.  This can happen
 	 * (for now, at least) due to paravirt remote flushes.
+	 *
+	 * NB: context 0 is a bit special, since it's also used by
+	 * various bits of init code.  This is fine -- code that
+	 * isn't aware of PCID will end up harmlessly flushing
+	 * context 0.
 	 */
-	struct tlb_context ctxs[1];
+	struct tlb_context ctxs[NR_DYNAMIC_ASIDS];
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7d6fa4676af9..9c9570d300ba 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,6 +812,7 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
 	.loaded_mm = &init_mm,
+	.next_asid = 1,
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3773ba72cf2d..9828f3444cba 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,6 +30,40 @@
 
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 
+static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
+			    u16 *new_asid, bool *need_flush)
+{
+	u16 asid;
+
+	if (!static_cpu_has(X86_FEATURE_PCID)) {
+		*new_asid = 0;
+		*need_flush = true;
+		return;
+	}
+
+	for (asid = 0; asid < NR_DYNAMIC_ASIDS; asid++) {
+		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
+		    next->context.ctx_id)
+			continue;
+
+		*new_asid = asid;
+		*need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
+			       next_tlb_gen);
+		return;
+	}
+
+	/*
+	 * We don't currently own an ASID slot on this CPU.
+	 * Allocate a slot.
+	 */
+	*new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
+	if (*new_asid >= NR_DYNAMIC_ASIDS) {
+		*new_asid = 0;
+		this_cpu_write(cpu_tlbstate.next_asid, 1);
+	}
+	*need_flush = true;
+}
+
 void leave_mm(int cpu)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -66,6 +100,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 {
 	unsigned cpu = smp_processor_id();
 	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
 	u64 next_tlb_gen;
 
 	/*
@@ -81,7 +116,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
 		WARN_ON_ONCE(!irqs_disabled());
 
-	VM_BUG_ON(read_cr3_addr() != __pa(real_prev->pgd));
+	VM_BUG_ON(__read_cr3() != (__pa(real_prev->pgd) | prev_asid));
 
 	if (real_prev == next) {
 		if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
@@ -98,10 +133,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
-		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			  next->context.ctx_id);
 
-		if (this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen) <
+		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
 		    next_tlb_gen) {
 			/*
 			 * Ideally, we'd have a flush_tlb() variant that
@@ -109,7 +144,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			 * be faster on Xen PV and on hypothetical CPUs
 			 * on which INVPCID is fast.
 			 */
-			this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+			this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
 				       next_tlb_gen);
 			write_cr3(__pa(next->pgd));
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
@@ -122,6 +157,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * are not reflected in tlb_gen.)
 		 */
 	} else {
+		u16 new_asid;
+		bool need_flush;
+
 		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 			/*
 			 * If our current stack is in vmalloc space and isn't
@@ -141,7 +179,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
 			cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
-		WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
+		VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
 
 		/*
 		 * Start remote flushes and then read tlb_gen.
@@ -149,17 +187,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
-		VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) ==
-			  next->context.ctx_id);
+		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
 
-		this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,
-			       next->context.ctx_id);
-		this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
-			       next_tlb_gen);
-		this_cpu_write(cpu_tlbstate.loaded_mm, next);
-		write_cr3(__pa(next->pgd));
+		if (need_flush) {
+			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id,
+				       next->context.ctx_id);
+			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen,
+				       next_tlb_gen);
+			write_cr3(__pa(next->pgd) | new_asid);
+			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+					TLB_FLUSH_ALL);
+		} else {
+			/* The new ASID is already up to date. */
+			write_cr3(__pa(next->pgd) | new_asid | CR3_NOFLUSH);
+			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
+		}
 
-		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+		this_cpu_write(cpu_tlbstate.loaded_mm, next);
+		this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
 	}
 
 	load_mm_cr4(next);
@@ -170,6 +215,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 				  bool local, enum tlb_flush_reason reason)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
 
 	/*
 	 * Our memory ordering requirement is that any TLB fills that
@@ -179,9 +225,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	 * atomic64_read operation won't be reordered by the compiler.
 	 */
 	u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
+	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
 
-	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
+	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
 	if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
@@ -253,7 +299,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	}
 
 	/* Both paths above update our state to mm_tlb_gen. */
-	this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, mm_tlb_gen);
+	this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
 }
 
 static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC 01/11] x86/ldt: Simplify LDT switching logic
  2017-06-05 22:36 ` [RFC 01/11] x86/ldt: Simplify LDT switching logic Andy Lutomirski
@ 2017-06-05 22:40   ` Linus Torvalds
  2017-06-05 22:44     ` Andy Lutomirski
  2017-06-05 22:51     ` Linus Torvalds
  0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2017-06-05 22:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, Andrew Morton, Mel Gorman, linux-mm,
	Nadav Amit, Rik van Riel

On Mon, Jun 5, 2017 at 3:36 PM, Andy Lutomirski <luto@kernel.org> wrote:
> We used to switch the LDT if the prev and next mms' LDTs didn't
> match.

I think the "LDT didn't match" was really just a simpler and more
efficient way to say "they weren't both NULL".

I think you actually broke that optimization, and it now does *two*
tests instead of just one.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 01/11] x86/ldt: Simplify LDT switching logic
  2017-06-05 22:40   ` Linus Torvalds
@ 2017-06-05 22:44     ` Andy Lutomirski
  2017-06-05 22:51     ` Linus Torvalds
  1 sibling, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-05 22:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel

On Mon, Jun 5, 2017 at 3:40 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jun 5, 2017 at 3:36 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> We used to switch the LDT if the prev and next mms' LDTs didn't
>> match.
>
> I think the "LDT didn't match" was really just a simpler and more
> efficient way to say "they weren't both NULL".

Once we go fully lazy (later in this series), though, I'd start
worrying that the optimization would be wrong:

1  Load ldt 0x1234
2. Become lazy
3. LDT changes twice from a remote cpu and the second change reuses
the pointer 0x1234.
4. We go unlazy, prev == next, but LDTR is wrong.

This isn't a bug in current kernels because step 3 will force a leave_mm().

>
> I think you actually broke that optimization, and it now does *two*
> tests instead of just one.

I haven't looked at the generated code, but shouldn't it be just orq; jnz?

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 01/11] x86/ldt: Simplify LDT switching logic
  2017-06-05 22:40   ` Linus Torvalds
  2017-06-05 22:44     ` Andy Lutomirski
@ 2017-06-05 22:51     ` Linus Torvalds
  1 sibling, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2017-06-05 22:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, Andrew Morton, Mel Gorman, linux-mm,
	Nadav Amit, Rik van Riel

On Mon, Jun 5, 2017 at 3:40 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I think the "LDT didn't match" was really just a simpler and more
> efficient way to say "they weren't both NULL".

In fact, looking back in the history, it used to instead add the sizes
of the context (and then similar logic: "if the sum is non-zero, one
or the other was non-zero").

Commit 0bbed3beb4 ("[PATCH] Thread-Local Storage (TLS) support") in
the historical tree then did this:

-               if (next->context.size+prev->context.size)
+               if (unlikely(prev->context.ldt != next->context.ldt))

I'm ok with your change, but I reacted to the commit log about how
this was "overcomplicated". It was actually an optimization exactly to
avoid two compares..

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-05 22:36 ` [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
@ 2017-06-06  1:39   ` Nadav Amit
  2017-06-06 21:23     ` Andy Lutomirski
  2017-06-06 19:11   ` Rik van Riel
  1 sibling, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2017-06-06  1:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Rik van Riel, Andrew Banman, Mike Travis,
	Dimitri Sivanich, Juergen Gross, Boris Ostrovsky


> On Jun 5, 2017, at 3:36 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> x86's lazy TLB mode used to be fairly weak -- it would switch to
> init_mm the first time it tried to flush a lazy TLB.  This meant an
> unnecessary CR3 write and, if the flush was remote, an unnecessary
> IPI.
> 
> Rewrite it entirely.  When we enter lazy mode, we simply remove the
> cpu from mm_cpumask.  This means that we need a way to figure out
> whether we've missed a flush when we switch back out of lazy mode.
> I use the tlb_gen machinery to track whether a context is up to
> date.
> 

[snip]

> @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> {
> 

[snip]

> +		/* Resume remote flushes and then read tlb_gen. */
> +		cpumask_set_cpu(cpu, mm_cpumask(next));
> +		next_tlb_gen = atomic64_read(&next->context.tlb_gen);

It seems correct, but it got me somewhat confused at first.

Perhaps it worth a comment that a memory barrier is not needed since
cpumask_set_cpu() uses a locked-instruction. Otherwise, somebody may
even copy-paste it to another architecture...

Thanks,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 08/11] x86/mm: Add nopcid to turn off PCID
  2017-06-05 22:36 ` [RFC 08/11] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
@ 2017-06-06  3:22   ` Andi Kleen
  2017-06-14  4:52     ` Andy Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2017-06-06  3:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel

Andy Lutomirski <luto@kernel.org> writes:

> The parameter is only present on x86_64 systems to save a few bytes,
> as PCID is always disabled on x86_32.

Seems redundant with clearcpuid.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 04/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
  2017-06-05 22:36 ` [RFC 04/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm Andy Lutomirski
@ 2017-06-06  5:03   ` Nadav Amit
  2017-06-06 22:45     ` Andy Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2017-06-06  5:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Rik van Riel

Maybe it’s me, but I find it rather hard to figure out whether
flush_tlb_func_common() is safe, since it can be re-entered - if a local TLB
flush is performed, and during this local flush a remote shootdown IPI is
received.

Did I miss irq being disabled during the local flush?

Otherwise, it raises the question whether flush_tlb_func_common() changes were
designed with re-entry in mind. Regarding it in the comments would really be
helpful.

Anyhow, I suspect that at least the following warning can be triggered:

	WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);


> static void flush_tlb_func_common(const struct flush_tlb_info *f,
> 				  bool local, enum tlb_flush_reason reason)
> {
> +	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> +
> +	/*
> +	 * Our memory ordering requirement is that any TLB fills that
> +	 * happen after we flush the TLB are ordered after we read
> +	 * active_mm's tlb_gen.  We don't need any explicit barrier
> +	 * because all x86 flush operations are serializing and the
> +	 * atomic64_read operation won't be reordered by the compiler.
> +	 */
> +	u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);

If for example a shootdown IPI can be delivered here... 

> +	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
> +


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-05 22:36 ` [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
  2017-06-06  1:39   ` Nadav Amit
@ 2017-06-06 19:11   ` Rik van Riel
  2017-06-06 21:34     ` Andy Lutomirski
  1 sibling, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2017-06-06 19:11 UTC (permalink / raw)
  To: Andy Lutomirski, X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Andrew Banman, Mike Travis,
	Dimitri Sivanich, Juergen Gross, Boris Ostrovsky

[-- Attachment #1: Type: text/plain, Size: 1044 bytes --]

On Mon, 2017-06-05 at 15:36 -0700, Andy Lutomirski wrote:

> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -122,8 +122,10 @@ static inline void switch_ldt(struct mm_struct
> *prev, struct mm_struct *next)
>  
>  static inline void enter_lazy_tlb(struct mm_struct *mm, struct
> task_struct *tsk)
>  {
> -	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> -		this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
> +	int cpu = smp_processor_id();
> +
> +	if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> +		cpumask_clear_cpu(cpu, mm_cpumask(mm));
>  }

This is an atomic write to a shared cacheline,
every time a CPU goes idle.

I am not sure you really want to do this, since
there are some workloads out there that have a
crazy number of threads, which go idle hundreds,
or even thousands of times a second, on dozens
of CPUs at a time. *cough*Java*cough*

Keeping track of the state in a CPU-local variable,
written with a non-atomic write, would be much more
CPU cache friendly here.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-06  1:39   ` Nadav Amit
@ 2017-06-06 21:23     ` Andy Lutomirski
  0 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-06 21:23 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Rik van Riel, Andrew Banman,
	Mike Travis, Dimitri Sivanich, Juergen Gross, Boris Ostrovsky

On Mon, Jun 5, 2017 at 6:39 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>
>> On Jun 5, 2017, at 3:36 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> x86's lazy TLB mode used to be fairly weak -- it would switch to
>> init_mm the first time it tried to flush a lazy TLB.  This meant an
>> unnecessary CR3 write and, if the flush was remote, an unnecessary
>> IPI.
>>
>> Rewrite it entirely.  When we enter lazy mode, we simply remove the
>> cpu from mm_cpumask.  This means that we need a way to figure out
>> whether we've missed a flush when we switch back out of lazy mode.
>> I use the tlb_gen machinery to track whether a context is up to
>> date.
>>
>
> [snip]
>
>> @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>> {
>>
>
> [snip]
>
>> +             /* Resume remote flushes and then read tlb_gen. */
>> +             cpumask_set_cpu(cpu, mm_cpumask(next));
>> +             next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>
> It seems correct, but it got me somewhat confused at first.
>
> Perhaps it worth a comment that a memory barrier is not needed since
> cpumask_set_cpu() uses a locked-instruction. Otherwise, somebody may
> even copy-paste it to another architecture...

Agreed.  I'll do something here.

>
> Thanks,
> Nadav

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 10/11] x86/mm: Enable CR4.PCIDE on supported systems
  2017-06-05 22:36 ` [RFC 10/11] x86/mm: Enable CR4.PCIDE on supported systems Andy Lutomirski
@ 2017-06-06 21:31   ` Boris Ostrovsky
  2017-06-06 21:35     ` Andy Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-06-06 21:31 UTC (permalink / raw)
  To: Andy Lutomirski, X86 ML
  Cc: Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel, Juergen Gross


> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -1037,6 +1037,12 @@ void __init xen_arch_setup(void)
>  	}
>  #endif
>  
> +	/*
> +	 * Xen would need some work to support PCID: CR3 handling as well
> +	 * as xen_flush_tlb_others() would need updating.
> +	 */
> +	setup_clear_cpu_cap(X86_FEATURE_PCID);


Capabilities for PV guests are typically set in xen_init_capabilities() now.


-boris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-06 19:11   ` Rik van Riel
@ 2017-06-06 21:34     ` Andy Lutomirski
  2017-06-07  3:33       ` Rik van Riel
  0 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-06 21:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Nadav Amit, Andrew Banman,
	Mike Travis, Dimitri Sivanich, Juergen Gross, Boris Ostrovsky

On Tue, Jun 6, 2017 at 12:11 PM, Rik van Riel <riel@redhat.com> wrote:
> On Mon, 2017-06-05 at 15:36 -0700, Andy Lutomirski wrote:
>
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -122,8 +122,10 @@ static inline void switch_ldt(struct mm_struct
>> *prev, struct mm_struct *next)
>>
>>  static inline void enter_lazy_tlb(struct mm_struct *mm, struct
>> task_struct *tsk)
>>  {
>> -     if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
>> -             this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
>> +     int cpu = smp_processor_id();
>> +
>> +     if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
>> +             cpumask_clear_cpu(cpu, mm_cpumask(mm));
>>  }
>
> This is an atomic write to a shared cacheline,
> every time a CPU goes idle.
>
> I am not sure you really want to do this, since
> there are some workloads out there that have a
> crazy number of threads, which go idle hundreds,
> or even thousands of times a second, on dozens
> of CPUs at a time. *cough*Java*cough*

It seems to me that the set of workloads on which this patch will hurt
performance is rather limited.  We'd need an mm with a lot of threads,
probably spread among a lot of nodes, that is constantly going idle
and non-idle on multiple CPUs on the same node, where there's nothing
else happening on those CPUs.

If there's a low-priority background task on the relevant CPUs, then
existing kernels will act just like patched kernels: the same bit will
be written by the same atomic operation at the same times.

>
> Keeping track of the state in a CPU-local variable,
> written with a non-atomic write, would be much more
> CPU cache friendly here.

We could, but then handing remote flushes becomes more complicated.

My inclination would be to keep the patch as is and, if this is
actually a problem, think about solving it more generally.  The real
issue is that we need a way to reasonably efficiently find the set of
CPUs for which a given mm is currently loaded and non-lazy.  A simple
improvement would be to split up mm_cpumask so that we'd have one
cache line per node.  (And we'd presumably allow several mms to share
the same pile of memory.)  Or we could go all out and use percpu state
only and iterate over all online CPUs when flushing (ick!).  Or
something in between.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 10/11] x86/mm: Enable CR4.PCIDE on supported systems
  2017-06-06 21:31   ` Boris Ostrovsky
@ 2017-06-06 21:35     ` Andy Lutomirski
  2017-06-06 21:48       ` Boris Ostrovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-06 21:35 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Nadav Amit, Rik van Riel,
	Juergen Gross

On Tue, Jun 6, 2017 at 2:31 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
>
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -1037,6 +1037,12 @@ void __init xen_arch_setup(void)
>>       }
>>  #endif
>>
>> +     /*
>> +      * Xen would need some work to support PCID: CR3 handling as well
>> +      * as xen_flush_tlb_others() would need updating.
>> +      */
>> +     setup_clear_cpu_cap(X86_FEATURE_PCID);
>
>
> Capabilities for PV guests are typically set in xen_init_capabilities() now.

Do I need this just for PV or for all Xen guests?  Do the
hardware-assisted guests still use paravirt flushes?  Does the
hypervisor either support PCID or correctly clear the PCID CPUID bit?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 10/11] x86/mm: Enable CR4.PCIDE on supported systems
  2017-06-06 21:35     ` Andy Lutomirski
@ 2017-06-06 21:48       ` Boris Ostrovsky
  2017-06-06 21:54         ` Andy Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-06-06 21:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Rik van Riel, Juergen Gross

On 06/06/2017 05:35 PM, Andy Lutomirski wrote:
> On Tue, Jun 6, 2017 at 2:31 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>>> --- a/arch/x86/xen/setup.c
>>> +++ b/arch/x86/xen/setup.c
>>> @@ -1037,6 +1037,12 @@ void __init xen_arch_setup(void)
>>>       }
>>>  #endif
>>>
>>> +     /*
>>> +      * Xen would need some work to support PCID: CR3 handling as well
>>> +      * as xen_flush_tlb_others() would need updating.
>>> +      */
>>> +     setup_clear_cpu_cap(X86_FEATURE_PCID);
>>
>> Capabilities for PV guests are typically set in xen_init_capabilities() now.
> Do I need this just for PV or for all Xen guests?  Do the
> hardware-assisted guests still use paravirt flushes?  Does the
> hypervisor either support PCID or correctly clear the PCID CPUID bit?


For HVM guests Xen will DTRT for CPUID so dealing with PV should be
sufficient (and xen_arch_setup() is called on PV only anyway)

As far as flushes are concerned for now it's PV only although I believe
Juergen is thinking about doing this on HVM too.

-boris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 10/11] x86/mm: Enable CR4.PCIDE on supported systems
  2017-06-06 21:48       ` Boris Ostrovsky
@ 2017-06-06 21:54         ` Andy Lutomirski
  0 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-06 21:54 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Nadav Amit, Rik van Riel,
	Juergen Gross

On Tue, Jun 6, 2017 at 2:48 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 06/06/2017 05:35 PM, Andy Lutomirski wrote:
>> On Tue, Jun 6, 2017 at 2:31 PM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>>> --- a/arch/x86/xen/setup.c
>>>> +++ b/arch/x86/xen/setup.c
>>>> @@ -1037,6 +1037,12 @@ void __init xen_arch_setup(void)
>>>>       }
>>>>  #endif
>>>>
>>>> +     /*
>>>> +      * Xen would need some work to support PCID: CR3 handling as well
>>>> +      * as xen_flush_tlb_others() would need updating.
>>>> +      */
>>>> +     setup_clear_cpu_cap(X86_FEATURE_PCID);
>>>
>>> Capabilities for PV guests are typically set in xen_init_capabilities() now.
>> Do I need this just for PV or for all Xen guests?  Do the
>> hardware-assisted guests still use paravirt flushes?  Does the
>> hypervisor either support PCID or correctly clear the PCID CPUID bit?
>
>
> For HVM guests Xen will DTRT for CPUID so dealing with PV should be
> sufficient (and xen_arch_setup() is called on PV only anyway)
>
> As far as flushes are concerned for now it's PV only although I believe
> Juergen is thinking about doing this on HVM too.

OK.  I'll move the code to xen_init_capabilities() for the next version.

>
> -boris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 04/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
  2017-06-06  5:03   ` Nadav Amit
@ 2017-06-06 22:45     ` Andy Lutomirski
  0 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-06 22:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Rik van Riel

On Mon, Jun 5, 2017 at 10:03 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> Maybe it’s me, but I find it rather hard to figure out whether
> flush_tlb_func_common() is safe, since it can be re-entered - if a local TLB
> flush is performed, and during this local flush a remote shootdown IPI is
> received.
>
> Did I miss irq being disabled during the local flush?
>

Whoops!  In my head, it was disabled.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-06 21:34     ` Andy Lutomirski
@ 2017-06-07  3:33       ` Rik van Riel
  2017-06-07  4:54         ` Andy Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2017-06-07  3:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Andrew Banman, Mike Travis,
	Dimitri Sivanich, Juergen Gross, Boris Ostrovsky

On Tue, 2017-06-06 at 14:34 -0700, Andy Lutomirski wrote:
> On Tue, Jun 6, 2017 at 12:11 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > On Mon, 2017-06-05 at 15:36 -0700, Andy Lutomirski wrote:
> > 
> > > +++ b/arch/x86/include/asm/mmu_context.h
> > > @@ -122,8 +122,10 @@ static inline void switch_ldt(struct
> > > mm_struct
> > > *prev, struct mm_struct *next)
> > > 
> > > A static inline void enter_lazy_tlb(struct mm_struct *mm, struct
> > > task_struct *tsk)
> > > A {
> > > -A A A A A if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> > > -A A A A A A A A A A A A A this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
> > > +A A A A A int cpu = smp_processor_id();
> > > +
> > > +A A A A A if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> > > +A A A A A A A A A A A A A cpumask_clear_cpu(cpu, mm_cpumask(mm));
> > > A }
> > 
> > This is an atomic write to a shared cacheline,
> > every time a CPU goes idle.
> > 
> > I am not sure you really want to do this, since
> > there are some workloads out there that have a
> > crazy number of threads, which go idle hundreds,
> > or even thousands of times a second, on dozens
> > of CPUs at a time. *cough*Java*cough*
> 
> It seems to me that the set of workloads on which this patch will
> hurt
> performance is rather limited.A A We'd need an mm with a lot of
> threads,
> probably spread among a lot of nodes, that is constantly going idle
> and non-idle on multiple CPUs on the same node, where there's nothing
> else happening on those CPUs.

I am assuming the SPECjbb2015 benchmark is representative
of how some actual (albeit crazy) Java workloads behave.

> > Keeping track of the state in a CPU-local variable,
> > written with a non-atomic write, would be much more
> > CPU cache friendly here.
> 
> We could, but then handing remote flushes becomes more complicated.

I already wrote that code. It's not that hard.

> My inclination would be to keep the patch as is and, if this is
> actually a problem, think about solving it more generally.A A The real
> issue is that we need a way to reasonably efficiently find the set of
> CPUs for which a given mm is currently loaded and non-lazy.A A A simple
> improvement would be to split up mm_cpumask so that we'd have one
> cache line per node.A A (And we'd presumably allow several mms to share
> the same pile of memory.)A A Or we could go all out and use percpu
> state
> only and iterate over all online CPUs when flushing (ick!).A A Or
> something in between.

Reading per cpu state is relatively cheap. Writing is
more expensive, but that only needs to be done at TLB
flush time, and is much cheaper than sending an IPI.

Tasks going idle and waking back up seems to be a much
more common operation than doing a TLB flush. Having the
idle path being the more expensive one makes little sense
to me, but I may be overlooking something.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-07  3:33       ` Rik van Riel
@ 2017-06-07  4:54         ` Andy Lutomirski
  2017-06-07  5:11           ` Andy Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-07  4:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Nadav Amit, Andrew Banman,
	Mike Travis, Dimitri Sivanich, Juergen Gross, Boris Ostrovsky

On Tue, Jun 6, 2017 at 8:33 PM, Rik van Riel <riel@redhat.com> wrote:
> On Tue, 2017-06-06 at 14:34 -0700, Andy Lutomirski wrote:
>> On Tue, Jun 6, 2017 at 12:11 PM, Rik van Riel <riel@redhat.com>
>> wrote:
>> > On Mon, 2017-06-05 at 15:36 -0700, Andy Lutomirski wrote:
>> >
>> > > +++ b/arch/x86/include/asm/mmu_context.h
>> > > @@ -122,8 +122,10 @@ static inline void switch_ldt(struct
>> > > mm_struct
>> > > *prev, struct mm_struct *next)
>> > >
>> > >  static inline void enter_lazy_tlb(struct mm_struct *mm, struct
>> > > task_struct *tsk)
>> > >  {
>> > > -     if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
>> > > -             this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
>> > > +     int cpu = smp_processor_id();
>> > > +
>> > > +     if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
>> > > +             cpumask_clear_cpu(cpu, mm_cpumask(mm));
>> > >  }
>> >
>> > This is an atomic write to a shared cacheline,
>> > every time a CPU goes idle.
>> >
>> > I am not sure you really want to do this, since
>> > there are some workloads out there that have a
>> > crazy number of threads, which go idle hundreds,
>> > or even thousands of times a second, on dozens
>> > of CPUs at a time. *cough*Java*cough*
>>
>> It seems to me that the set of workloads on which this patch will
>> hurt
>> performance is rather limited.  We'd need an mm with a lot of
>> threads,
>> probably spread among a lot of nodes, that is constantly going idle
>> and non-idle on multiple CPUs on the same node, where there's nothing
>> else happening on those CPUs.
>
> I am assuming the SPECjbb2015 benchmark is representative
> of how some actual (albeit crazy) Java workloads behave.

The little picture on the SPECjbb2015 site talks about
inter-java-process communication, which suggests that we'll bounce
between two non-idle mms, which should get significantly faster with
this patch set applied.

>
>> > Keeping track of the state in a CPU-local variable,
>> > written with a non-atomic write, would be much more
>> > CPU cache friendly here.
>>
>> We could, but then handing remote flushes becomes more complicated.
>
> I already wrote that code. It's not that hard.
>
>> My inclination would be to keep the patch as is and, if this is
>> actually a problem, think about solving it more generally.  The real
>> issue is that we need a way to reasonably efficiently find the set of
>> CPUs for which a given mm is currently loaded and non-lazy.  A simple
>> improvement would be to split up mm_cpumask so that we'd have one
>> cache line per node.  (And we'd presumably allow several mms to share
>> the same pile of memory.)  Or we could go all out and use percpu
>> state
>> only and iterate over all online CPUs when flushing (ick!).  Or
>> something in between.
>
> Reading per cpu state is relatively cheap. Writing is
> more expensive, but that only needs to be done at TLB
> flush time, and is much cheaper than sending an IPI.
>
> Tasks going idle and waking back up seems to be a much
> more common operation than doing a TLB flush. Having the
> idle path being the more expensive one makes little sense
> to me, but I may be overlooking something.

I agree with all of this.  I'm not saying that we shouldn't try to
optimize these transitions on large systems where an mm is in use on a
lot of cores at once.  My point is that we shouldn't treat idle as a
special case that functions completely differently from everything
else.  With ths series applied, we have the ability to accurately
determine whether a the current CPU's TLB is up to date for a given mm
with good cache performance: we look at a single shared cacheline
(mm->context) and two (one if !PCID) percpu cachelines.  Idle is
almost exactly the same condition as switched-out-but-still-live: we
have a context that's represented in the TLB, but we're not trying to
keep it coherent.  I think that, especially in the presence of PCID,
it would be rather odd to treat them differently.

The only real question in my book is how we should be tracking which
CPUs are currently attempting to maintain their TLBs *coherently* for
a given mm, which is exactly the same as the set of CPUs that are
currently running that mm non-lazily.  With my patches applied, it's a
cpumask.  4.11, it's a cpumask that's inaccurate, leading to
unnecessary IPIs, but that inaccuracy speeds up one particular type of
workload.  We could come up with other data structures for this.  The
very simplest is to ditch mm_cpumask entirely and just use percpu
variables for everything, but that would pessimize the single-threaded
case.

Anyway, my point is that I think that, if this is really a problem, we
should optimize mm_cpumask updating more generally instead of coming
up with something that's specific to idle transitions.

Anyway, I kind of suspect that we're arguing over something quite
minor.  I found a paper suggesting that cmpxchg took about 100ns for a
worst-case access to a cache line that's exclsively owned by another
socket.  We're using lock bts/btr, which should be a little faster,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking
  2017-06-07  4:54         ` Andy Lutomirski
@ 2017-06-07  5:11           ` Andy Lutomirski
  0 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-07  5:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, X86 ML, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Nadav Amit, Andrew Banman,
	Mike Travis, Dimitri Sivanich, Juergen Gross, Boris Ostrovsky

On Tue, Jun 6, 2017 at 9:54 PM, Andy Lutomirski <luto@kernel.org> wrote:
> Anyway, my point is that I think that, if this is really a problem, we
> should optimize mm_cpumask updating more generally instead of coming
> up with something that's specific to idle transitions.
>

I suspect the right data structure may be some kind of linked list,
not a bitmask at all.  The operations that should be fast are adding
yourself to the list, removing yourself from the list, and iterating
over all CPUs in the list.  Iterating over all CPUs in the list should
be reasonably fast, but the other two operations are more important.
We have the nice property that a given CPU is only on one such list at
a time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 08/11] x86/mm: Add nopcid to turn off PCID
  2017-06-06  3:22   ` Andi Kleen
@ 2017-06-14  4:52     ` Andy Lutomirski
  2017-06-14  9:51       ` Borislav Petkov
  0 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2017-06-14  4:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Nadav Amit, Rik van Riel

On Mon, Jun 5, 2017 at 8:22 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Andy Lutomirski <luto@kernel.org> writes:
>
>> The parameter is only present on x86_64 systems to save a few bytes,
>> as PCID is always disabled on x86_32.
>
> Seems redundant with clearcpuid.
>

It is.  OTOH, there are lots of noxyz options, and they're easier to
type and to remember.  Borislav?  Sometime I wonder whether we should
autogenerate noxyz options from the capflags table.

> -Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 08/11] x86/mm: Add nopcid to turn off PCID
  2017-06-14  4:52     ` Andy Lutomirski
@ 2017-06-14  9:51       ` Borislav Petkov
  0 siblings, 0 replies; 31+ messages in thread
From: Borislav Petkov @ 2017-06-14  9:51 UTC (permalink / raw)
  To: Andy Lutomirski, H. Peter Anvin
  Cc: Andi Kleen, X86 ML, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Nadav Amit, Rik van Riel

On Tue, Jun 13, 2017 at 09:52:03PM -0700, Andy Lutomirski wrote:
> It is.  OTOH, there are lots of noxyz options, and they're easier to
> type and to remember.  Borislav?  Sometime I wonder whether we should
> autogenerate noxyz options from the capflags table.

Maybe.

Although, last time hpa said that all those old chicken bits can simply
be removed now that they're not really needed anymore. I even had a
patch somewhere but then something more important happened...

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix ImendA?rffer, Jane Smithard, Graham Norton, HRB 21284 (AG NA 1/4 rnberg)
-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-06-14  9:51 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-05 22:36 [RFC 00/11] PCID and improved laziness Andy Lutomirski
2017-06-05 22:36 ` [RFC 01/11] x86/ldt: Simplify LDT switching logic Andy Lutomirski
2017-06-05 22:40   ` Linus Torvalds
2017-06-05 22:44     ` Andy Lutomirski
2017-06-05 22:51     ` Linus Torvalds
2017-06-05 22:36 ` [RFC 02/11] x86/mm: Remove reset_lazy_tlbstate() Andy Lutomirski
2017-06-05 22:36 ` [RFC 03/11] x86/mm: Give each mm TLB flush generation a unique ID Andy Lutomirski
2017-06-05 22:36 ` [RFC 04/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm Andy Lutomirski
2017-06-06  5:03   ` Nadav Amit
2017-06-06 22:45     ` Andy Lutomirski
2017-06-05 22:36 ` [RFC 05/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
2017-06-06  1:39   ` Nadav Amit
2017-06-06 21:23     ` Andy Lutomirski
2017-06-06 19:11   ` Rik van Riel
2017-06-06 21:34     ` Andy Lutomirski
2017-06-07  3:33       ` Rik van Riel
2017-06-07  4:54         ` Andy Lutomirski
2017-06-07  5:11           ` Andy Lutomirski
2017-06-05 22:36 ` [RFC 06/11] x86/mm: Stop calling leave_mm() in idle code Andy Lutomirski
2017-06-05 22:36 ` [RFC 07/11] x86/mm: Disable PCID on 32-bit kernels Andy Lutomirski
2017-06-05 22:36 ` [RFC 08/11] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
2017-06-06  3:22   ` Andi Kleen
2017-06-14  4:52     ` Andy Lutomirski
2017-06-14  9:51       ` Borislav Petkov
2017-06-05 22:36 ` [RFC 09/11] x86/mm: Teach CR3 readers about PCID Andy Lutomirski
2017-06-05 22:36 ` [RFC 10/11] x86/mm: Enable CR4.PCIDE on supported systems Andy Lutomirski
2017-06-06 21:31   ` Boris Ostrovsky
2017-06-06 21:35     ` Andy Lutomirski
2017-06-06 21:48       ` Boris Ostrovsky
2017-06-06 21:54         ` Andy Lutomirski
2017-06-05 22:36 ` [RFC 11/11] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).