linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/13] x86/mm: PCID and INVPCID
@ 2016-01-08 23:15 Andy Lutomirski
  2016-01-08 23:15 ` [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Andy Lutomirski
                   ` (13 more replies)
  0 siblings, 14 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

Here's my PCID and INVPCID work-in-progress.  It seems to work well
enough to play with it.  (That is, I'm not aware of anything wrong
with it, although it may eat your data.)

PCID and INVPCID use are orthogonal here.  INVPCID is a
straightforward speedup for global TLB flushes.  Other than that, I
don't use INVPCID at all, since it seems slower than just
manipulating CR3 carefully, at least on my Skylake laptop.

Please play around and suggest (and run?) good benchmarks.  It seems
to save around 100ns on cross-process context switches for me.
Unfortunately, we suck at context switches in general, so this is,
at best, a little over a 10% speedup.  Most of the time is spent in
the scheduler, not in arch code.

Andy Lutomirski (13):
  x86/paravirt: Turn KASAN off for parvirt.o
  x86/mm: Add INVPCID helpers
  x86/mm: Add a noinvpcid option to turn off INVPCID
  x86/mm: If INVPCID is available, use it to flush global mappings
  x86/mm: Add barriers and document switch_mm-vs-flush synchronization
  x86/mm: Disable PCID on 32-bit kernels
  x86/mm: Add nopcid to turn off PCID
  x86/mm: Teach CR3 readers about PCID
  x86/mm: Disable interrupts when flushing the TLB using CR3
  x86/mm: Factor out remote TLB flushing
  x86/mm: Build arch/x86/mm/tlb.c even on !SMP
  x86/mm: Uninline switch_mm
  x86/mm: Try to preserve old TLB entries using PCID

 Documentation/kernel-parameters.txt      |   4 +
 arch/x86/include/asm/disabled-features.h |   4 +-
 arch/x86/include/asm/mmu.h               |   7 +-
 arch/x86/include/asm/mmu_context.h       |  62 +-----
 arch/x86/include/asm/tlbflush.h          |  86 ++++++++
 arch/x86/kernel/Makefile                 |   1 +
 arch/x86/kernel/cpu/bugs.c               |   6 +
 arch/x86/kernel/cpu/common.c             |  38 ++++
 arch/x86/kernel/head64.c                 |   3 +-
 arch/x86/kernel/ldt.c                    |   2 +
 arch/x86/kernel/process_64.c             |   2 +
 arch/x86/mm/Makefile                     |   3 +-
 arch/x86/mm/fault.c                      |   8 +-
 arch/x86/mm/tlb.c                        | 324 +++++++++++++++++++++++++++++--
 14 files changed, 467 insertions(+), 83 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-10 18:59   ` Borislav Petkov
  2016-01-08 23:15 ` [RFC 02/13] x86/mm: Add INVPCID helpers Andy Lutomirski
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

Otherwise terrible things happen if some of the callbacks end up
calling into KASAN in unexpected places.

This has no obvious symptoms yet, but adding a memory reference to
native_flush_tlb_global without this blows up on KASAN kernels.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index b1b78ffe01d0..b7cd5bdf314b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -19,6 +19,7 @@ endif
 KASAN_SANITIZE_head$(BITS).o := n
 KASAN_SANITIZE_dumpstack.o := n
 KASAN_SANITIZE_dumpstack_$(BITS).o := n
+KASAN_SANITIZE_paravirt.o := n
 
 CFLAGS_irq.o := -I$(src)/../include/asm/trace
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 02/13] x86/mm: Add INVPCID helpers
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
  2016-01-08 23:15 ` [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-08 23:15 ` [RFC 03/13] x86/mm: Add a noinvpcid option to turn off INVPCID Andy Lutomirski
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

This adds helpers for each of the four currently-specified INVPCID
modes.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6df2029405a3..20fc38d8478a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -7,6 +7,47 @@
 #include <asm/processor.h>
 #include <asm/special_insns.h>
 
+static inline void __invpcid(unsigned long pcid, unsigned long addr,
+			     unsigned long type)
+{
+	u64 desc[2] = { pcid, addr };
+
+	/*
+	 * The memory clobber is because the whole point is to invalidate
+	 * stale TLB entries and, especially if we're flushing global
+	 * mappings, we don't want the compiler to reorder any subsequent
+	 * memory accesses before the TLB flush.
+	 */
+	asm volatile (
+		".byte 0x66, 0x0f, 0x38, 0x82, 0x01"	/* invpcid (%cx), %ax */
+		: : "m" (desc), "a" (type), "c" (desc) : "memory");
+}
+
+/* Flush all mappings for a given pcid and addr, not including globals. */
+static inline void invpcid_flush_one(unsigned long pcid,
+				     unsigned long addr)
+{
+	__invpcid(pcid, addr, 0);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invpcid_flush_single_context(unsigned long pcid)
+{
+	__invpcid(pcid, 0, 1);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invpcid_flush_everything(void)
+{
+	__invpcid(0, 0, 2);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invpcid_flush_all_nonglobals(void)
+{
+	__invpcid(0, 0, 3);
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 03/13] x86/mm: Add a noinvpcid option to turn off INVPCID
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
  2016-01-08 23:15 ` [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Andy Lutomirski
  2016-01-08 23:15 ` [RFC 02/13] x86/mm: Add INVPCID helpers Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-08 23:15 ` [RFC 04/13] x86/mm: If INVPCID is available, use it to flush global mappings Andy Lutomirski
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 Documentation/kernel-parameters.txt |  2 ++
 arch/x86/kernel/cpu/common.c        | 16 ++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 742f69d18fc8..b34e55e00bae 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2508,6 +2508,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 
 	nointroute	[IA-64]
 
+	noinvpcid	[X86] Disable the INVPCID cpu feature.
+
 	nojitter	[IA-64] Disables jitter checking for ITC timers.
 
 	no-kvmclock	[X86,KVM] Disable paravirtualized KVM clock driver
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c2b7522cbf35..48196980c1c7 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -162,6 +162,22 @@ static int __init x86_mpx_setup(char *s)
 }
 __setup("nompx", x86_mpx_setup);
 
+static int __init x86_noinvpcid_setup(char *s)
+{
+	/* noinvpcid doesn't accept parameters */
+	if (s)
+		return -EINVAL;
+
+	/* do not emit a message if the feature is not present */
+	if (!boot_cpu_has(X86_FEATURE_INVPCID))
+		return 0;
+
+	setup_clear_cpu_cap(X86_FEATURE_INVPCID);
+	pr_info("noinvpcid: INVPCID feature disabled\n");
+	return 0;
+}
+early_param("noinvpcid", x86_noinvpcid_setup);
+
 #ifdef CONFIG_X86_32
 static int cachesize_override = -1;
 static int disable_x86_serial_nr = 1;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 04/13] x86/mm: If INVPCID is available, use it to flush global mappings
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
                   ` (2 preceding siblings ...)
  2016-01-08 23:15 ` [RFC 03/13] x86/mm: Add a noinvpcid option to turn off INVPCID Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-08 23:15 ` [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization Andy Lutomirski
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

On my Skylake laptop, INVPCID function 2 (flush absolutely
everything) takes about 376ns, whereas saving flags, twiddling
CR4.PGE to flush global mappings, and restoring flags takes about
539ns.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 20fc38d8478a..4eba5164430d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -145,6 +145,15 @@ static inline void __native_flush_tlb_global(void)
 {
 	unsigned long flags;
 
+	if (static_cpu_has_safe(X86_FEATURE_INVPCID)) {
+		/*
+		 * Using INVPCID is considerably faster than a pair of writes
+		 * to CR4 sandwiched inside an IRQ flag save/restore.
+		 */
+		invpcid_flush_everything();
+		return;
+	}
+
 	/*
 	 * Read-modify-write to CR4 - protect it from preemption and
 	 * from interrupts. (Use the raw variant because this code can
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
                   ` (3 preceding siblings ...)
  2016-01-08 23:15 ` [RFC 04/13] x86/mm: If INVPCID is available, use it to flush global mappings Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-06-03 17:42   ` Nadav Amit
  2016-09-06  1:22   ` Wanpeng Li
  2016-01-08 23:15 ` [RFC 06/13] x86/mm: Disable PCID on 32-bit kernels Andy Lutomirski
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski, stable

When switch_mm activates a new pgd, it also sets a bit that tells
other CPUs that the pgd is in use so that tlb flush IPIs will be
sent.  In order for that to work correctly, the bit needs to be
visible prior to loading the pgd and therefore starting to fill the
local TLB.

Document all the barriers that make this work correctly and add a
couple that were missing.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h | 33 ++++++++++++++++++++++++++++++++-
 arch/x86/mm/tlb.c                  | 29 ++++++++++++++++++++++++++---
 2 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 379cd3658799..1edc9cd198b8 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 #endif
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
-		/* Re-load page tables */
+		/*
+		 * Re-load page tables.
+		 *
+		 * This logic has an ordering constraint:
+		 *
+		 *  CPU 0: Write to a PTE for 'next'
+		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
+		 *  CPU 1: set bit 1 in next's mm_cpumask
+		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
+		 *
+		 * We need to prevent an outcome in which CPU 1 observes
+		 * the new PTE value and CPU 0 observes bit 1 clear in
+		 * mm_cpumask.  (If that occurs, then the IPI will never
+		 * be sent, and CPU 0's TLB will contain a stale entry.)
+		 *
+		 * The bad outcome can occur if either CPU's load is
+		 * reordered before that CPU's store, so both CPUs much
+		 * execute full barriers to prevent this from happening.
+		 *
+		 * Thus, switch_mm needs a full barrier between the
+		 * store to mm_cpumask and any operation that could load
+		 * from next->pgd.  This barrier synchronizes with
+		 * remote TLB flushers.  Fortunately, load_cr3 is
+		 * serializing and thus acts as a full barrier.
+		 *
+		 */
 		load_cr3(next->pgd);
+
 		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 
 		/* Stop flush ipis for the previous mm */
@@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			 * schedule, protecting us from simultaneous changes.
 			 */
 			cpumask_set_cpu(cpu, mm_cpumask(next));
+
 			/*
 			 * We were in lazy tlb mode and leave_mm disabled
 			 * tlb flush IPI delivery. We must reload CR3
 			 * to make sure to use no freed page tables.
+			 *
+			 * As above, this is a barrier that forces
+			 * TLB repopulation to be ordered after the
+			 * store to mm_cpumask.
 			 */
 			load_cr3(next->pgd);
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8ddb5d0d66fb..8f4cc3dfac32 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -161,7 +161,10 @@ void flush_tlb_current_task(void)
 	preempt_disable();
 
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+
+	/* This is an implicit full barrier that synchronizes with switch_mm. */
 	local_flush_tlb();
+
 	trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
@@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
 
 	preempt_disable();
-	if (current->active_mm != mm)
+	if (current->active_mm != mm) {
+		/* Synchronize with switch_mm. */
+		smp_mb();
+
 		goto out;
+	}
 
 	if (!current->mm) {
 		leave_mm(smp_processor_id());
+
+		/* Synchronize with switch_mm. */
+		smp_mb();
+
 		goto out;
 	}
 
 	if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
 		base_pages_to_flush = (end - start) >> PAGE_SHIFT;
 
+	/*
+	 * Both branches below are implicit full barriers (MOV to CR or
+	 * INVLPG) that synchronize with switch_mm.
+	 */
 	if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
 		base_pages_to_flush = TLB_FLUSH_ALL;
 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
@@ -228,10 +243,18 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long start)
 	preempt_disable();
 
 	if (current->active_mm == mm) {
-		if (current->mm)
+		if (current->mm) {
+			/*
+			 * Implicit full barrier (INVLPG) that synchronizes
+			 * with switch_mm.
+			 */
 			__flush_tlb_one(start);
-		else
+		} else {
 			leave_mm(smp_processor_id());
+
+			/* Synchronize with switch_mm. */
+			smp_mb();
+		}
 	}
 
 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 06/13] x86/mm: Disable PCID on 32-bit kernels
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
                   ` (4 preceding siblings ...)
  2016-01-08 23:15 ` [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-08 23:15 ` [RFC 07/13] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

32-bit kernels on new hardware will see PCID in CPUID, but PCID can
only be used in 64-bit mode.  Rather than making all PCID code
conditional, just disable the feature on 32-bit builds.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/disabled-features.h | 4 +++-
 arch/x86/kernel/cpu/bugs.c               | 6 ++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index f226df064660..8b17c2ad1048 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -21,11 +21,13 @@
 # define DISABLE_K6_MTRR	(1<<(X86_FEATURE_K6_MTRR & 31))
 # define DISABLE_CYRIX_ARR	(1<<(X86_FEATURE_CYRIX_ARR & 31))
 # define DISABLE_CENTAUR_MCR	(1<<(X86_FEATURE_CENTAUR_MCR & 31))
+# define DISABLE_PCID		0
 #else
 # define DISABLE_VME		0
 # define DISABLE_K6_MTRR	0
 # define DISABLE_CYRIX_ARR	0
 # define DISABLE_CENTAUR_MCR	0
+# define DISABLE_PCID		(1<<(X86_FEATURE_PCID & 31))
 #endif /* CONFIG_X86_64 */
 
 /*
@@ -35,7 +37,7 @@
 #define DISABLED_MASK1	0
 #define DISABLED_MASK2	0
 #define DISABLED_MASK3	(DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR)
-#define DISABLED_MASK4	0
+#define DISABLED_MASK4	(DISABLE_PCID)
 #define DISABLED_MASK5	0
 #define DISABLED_MASK6	0
 #define DISABLED_MASK7	0
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index bd17db15a2c1..741d107e5376 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -19,6 +19,12 @@
 
 void __init check_bugs(void)
 {
+	/*
+	 * Regardless of whether PCID is enumerated, it can't be used in
+	 * 32-bit mode.
+	 */
+	setup_clear_cpu_cap(X86_FEATURE_PCID);
+
 	identify_boot_cpu();
 #ifndef CONFIG_SMP
 	pr_info("CPU: ");
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 07/13] x86/mm: Add nopcid to turn off PCID
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
                   ` (5 preceding siblings ...)
  2016-01-08 23:15 ` [RFC 06/13] x86/mm: Disable PCID on 32-bit kernels Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-08 23:15 ` [RFC 08/13] x86/mm: Teach CR3 readers about PCID Andy Lutomirski
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 Documentation/kernel-parameters.txt |  2 ++
 arch/x86/kernel/cpu/common.c        | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index b34e55e00bae..2f61809706c5 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2544,6 +2544,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 	nopat		[X86] Disable PAT (page attribute table extension of
 			pagetables) support.
 
+	nopcid		[X86-64] Disable the PCID cpu feature.
+
 	norandmaps	Don't use address space randomization.  Equivalent to
 			echo 0 > /proc/sys/kernel/randomize_va_space
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 48196980c1c7..7e1fc53a4ba5 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -162,6 +162,24 @@ static int __init x86_mpx_setup(char *s)
 }
 __setup("nompx", x86_mpx_setup);
 
+#ifdef CONFIG_X86_64
+static int __init x86_pcid_setup(char *s)
+{
+	/* require an exact match without trailing characters */
+	if (strlen(s))
+		return 0;
+
+	/* do not emit a message if the feature is not present */
+	if (!boot_cpu_has(X86_FEATURE_PCID))
+		return 1;
+
+	setup_clear_cpu_cap(X86_FEATURE_PCID);
+	pr_info("nopcid: PCID feature disabled\n");
+	return 1;
+}
+__setup("nopcid", x86_pcid_setup);
+#endif
+
 static int __init x86_noinvpcid_setup(char *s)
 {
 	/* noinvpcid doesn't accept parameters */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 08/13] x86/mm: Teach CR3 readers about PCID
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
                   ` (6 preceding siblings ...)
  2016-01-08 23:15 ` [RFC 07/13] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-08 23:15 ` [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Andy Lutomirski
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

The kernel has several code paths that read CR3.  Most of them assume that
CR3 contains the PGD's physical address, whereas some of them awkwardly
use PHYSICAL_PAGE_MASK to mask off low bits.

Add explicit mask macros for CR3 and convert all of the CR3 readers.
This will keep them from breaking when PCID is enabled.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 8 ++++++++
 arch/x86/kernel/head64.c        | 3 ++-
 arch/x86/mm/fault.c             | 8 ++++----
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4eba5164430d..3d905f12cda9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -48,6 +48,14 @@ static inline void invpcid_flush_all_nonglobals(void)
 	__invpcid(0, 0, 3);
 }
 
+#ifdef CONFIG_X86_64
+#define CR3_ADDR_MASK 0x7FFFFFFFFFFFF000ull
+#define CR3_PCID_MASK 0xFFFull
+#else
+#define CR3_ADDR_MASK 0xFFFFF000ull
+#define CR3_PCID_MASK 0ull
+#endif
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index f129a9af6357..3d075ac01a47 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -60,7 +60,8 @@ int __init early_make_pgtable(unsigned long address)
 	pmdval_t pmd, *pmd_p;
 
 	/* Invalid address or early pgt is done ?  */
-	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+	if (physaddr >= MAXMEM ||
+	    (read_cr3() & CR3_ADDR_MASK) != __pa_nodebug(early_level4_pgt))
 		return -1;
 
 again:
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9a3f77..9ceae2dc9be1 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -282,7 +282,7 @@ static noinline int vmalloc_fault(unsigned long address)
 	 * Do _not_ use "current" here. We might be inside
 	 * an interrupt in the middle of a task switch..
 	 */
-	pgd_paddr = read_cr3();
+	pgd_paddr = read_cr3() & CR3_ADDR_MASK;
 	pmd_k = vmalloc_sync_one(__va(pgd_paddr), address);
 	if (!pmd_k)
 		return -1;
@@ -321,7 +321,7 @@ static bool low_pfn(unsigned long pfn)
 
 static void dump_pagetable(unsigned long address)
 {
-	pgd_t *base = __va(read_cr3());
+	pgd_t *base = __va(read_cr3() & CR3_ADDR_MASK);
 	pgd_t *pgd = &base[pgd_index(address)];
 	pmd_t *pmd;
 	pte_t *pte;
@@ -459,7 +459,7 @@ static int bad_address(void *p)
 
 static void dump_pagetable(unsigned long address)
 {
-	pgd_t *base = __va(read_cr3() & PHYSICAL_PAGE_MASK);
+	pgd_t *base = __va(read_cr3() & CR3_ADDR_MASK);
 	pgd_t *pgd = base + pgd_index(address);
 	pud_t *pud;
 	pmd_t *pmd;
@@ -595,7 +595,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code,
 		pgd_t *pgd;
 		pte_t *pte;
 
-		pgd = __va(read_cr3() & PHYSICAL_PAGE_MASK);
+		pgd = __va(read_cr3() & CR3_ADDR_MASK);
 		pgd += pgd_index(address);
 
 		pte = lookup_address_in_pgd(pgd, address, &level);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
                   ` (7 preceding siblings ...)
  2016-01-08 23:15 ` [RFC 08/13] x86/mm: Teach CR3 readers about PCID Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-08 23:41   ` Linus Torvalds
  2016-01-08 23:15 ` [RFC 10/13] x86/mm: Factor out remote TLB flushing Andy Lutomirski
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3d905f12cda9..32e3d8769a22 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -135,7 +135,17 @@ static inline void cr4_set_bits_and_update_boot(unsigned long mask)
 
 static inline void __native_flush_tlb(void)
 {
+	unsigned long flags;
+
+	/*
+	 * We mustn't be preempted or handle an IPI while reading and
+	 * writing CR3.  Preemption could switch mms and switch back, and
+	 * an IPI could call leave_mm.  Either of those could cause our
+	 * PCID to change asynchronously.
+	 */
+	raw_local_irq_save(flags);
 	native_write_cr3(native_read_cr3());
+	raw_local_irq_restore(flags);
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 10/13] x86/mm: Factor out remote TLB flushing
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
                   ` (8 preceding siblings ...)
  2016-01-08 23:15 ` [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-08 23:15 ` [RFC 11/13] x86/mm: Build arch/x86/mm/tlb.c even on !SMP Andy Lutomirski
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

There are three call sites that propagate TLB flushes, and they all
do exactly the same thing.  Factor the code out into a helper.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8f4cc3dfac32..b208a33571b0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -154,6 +154,14 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
 }
 
+static void propagate_tlb_flush(unsigned int this_cpu,
+				struct mm_struct *mm, unsigned long start,
+				unsigned long end)
+{
+	if (cpumask_any_but(mm_cpumask(mm), this_cpu) < nr_cpu_ids)
+		flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
+}
+
 void flush_tlb_current_task(void)
 {
 	struct mm_struct *mm = current->mm;
@@ -166,8 +174,7 @@ void flush_tlb_current_task(void)
 	local_flush_tlb();
 
 	trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
-	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
+	propagate_tlb_flush(smp_processor_id(), mm, 0UL, TLB_FLUSH_ALL);
 	preempt_enable();
 }
 
@@ -231,8 +238,7 @@ out:
 		start = 0UL;
 		end = TLB_FLUSH_ALL;
 	}
-	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), mm, start, end);
+	propagate_tlb_flush(smp_processor_id(), mm, start, end);
 	preempt_enable();
 }
 
@@ -257,8 +263,7 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long start)
 		}
 	}
 
-	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), mm, start, 0UL);
+	propagate_tlb_flush(smp_processor_id(), mm, start, 0UL);
 
 	preempt_enable();
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 11/13] x86/mm: Build arch/x86/mm/tlb.c even on !SMP
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
                   ` (9 preceding siblings ...)
  2016-01-08 23:15 ` [RFC 10/13] x86/mm: Factor out remote TLB flushing Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-08 23:55   ` Dave Hansen
  2016-01-08 23:15 ` [RFC 12/13] x86/mm: Uninline switch_mm Andy Lutomirski
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

Currently all of the functions that live in tlb.c are inlined on
!SMP builds.  One can debate whether this is a good idea (in many
respects the code in tlb.c is better than the inlined UP code).

Regardless, I want to add code that needs to be built on UP and SMP
kernels and relates to tlb flushing, so arrange for tlb.c to be
compiled unconditionally.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/Makefile | 3 +--
 arch/x86/mm/tlb.c    | 4 ++++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 65c47fda26fc..1ae7c141f778 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -1,5 +1,5 @@
 obj-y	:=  init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
-	    pat.o pgtable.o physaddr.o gup.o setup_nx.o
+	    pat.o pgtable.o physaddr.o gup.o setup_nx.o tlb.o
 
 # Make sure __phys_addr has no stackprotector
 nostackp := $(call cc-option, -fno-stack-protector)
@@ -9,7 +9,6 @@ CFLAGS_setup_nx.o		:= $(nostackp)
 CFLAGS_fault.o := -I$(src)/../include/asm/trace
 
 obj-$(CONFIG_X86_PAT)		+= pat_rbtree.o
-obj-$(CONFIG_SMP)		+= tlb.o
 
 obj-$(CONFIG_X86_32)		+= pgtable_32.o iomap_32.o
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b208a33571b0..87fcc7a62e71 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+#ifdef CONFIG_SMP
+
 struct flush_tlb_info {
 	struct mm_struct *flush_mm;
 	unsigned long flush_start;
@@ -352,3 +354,5 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 	return 0;
 }
 late_initcall(create_tlb_single_page_flush_ceiling);
+
+#endif /* CONFIG_SMP */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 12/13] x86/mm: Uninline switch_mm
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
                   ` (10 preceding siblings ...)
  2016-01-08 23:15 ` [RFC 11/13] x86/mm: Build arch/x86/mm/tlb.c even on !SMP Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-08 23:15 ` [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
  2016-01-08 23:31 ` [RFC 00/13] x86/mm: PCID and INVPCID Linus Torvalds
  13 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

It's fairly large and it has quite a few callers.  This may also
help untangle some headers down the road.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h | 93 +-----------------------------------
 arch/x86/mm/tlb.c                  | 97 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+), 91 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 1edc9cd198b8..05c4d0ab64bb 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -104,102 +104,13 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 #endif
 }
 
-static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
-			     struct task_struct *tsk)
-{
-	unsigned cpu = smp_processor_id();
-
-	if (likely(prev != next)) {
-#ifdef CONFIG_SMP
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
-		this_cpu_write(cpu_tlbstate.active_mm, next);
-#endif
-		cpumask_set_cpu(cpu, mm_cpumask(next));
+extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+		      struct task_struct *tsk);
 
-		/*
-		 * Re-load page tables.
-		 *
-		 * This logic has an ordering constraint:
-		 *
-		 *  CPU 0: Write to a PTE for 'next'
-		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
-		 *  CPU 1: set bit 1 in next's mm_cpumask
-		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
-		 *
-		 * We need to prevent an outcome in which CPU 1 observes
-		 * the new PTE value and CPU 0 observes bit 1 clear in
-		 * mm_cpumask.  (If that occurs, then the IPI will never
-		 * be sent, and CPU 0's TLB will contain a stale entry.)
-		 *
-		 * The bad outcome can occur if either CPU's load is
-		 * reordered before that CPU's store, so both CPUs much
-		 * execute full barriers to prevent this from happening.
-		 *
-		 * Thus, switch_mm needs a full barrier between the
-		 * store to mm_cpumask and any operation that could load
-		 * from next->pgd.  This barrier synchronizes with
-		 * remote TLB flushers.  Fortunately, load_cr3 is
-		 * serializing and thus acts as a full barrier.
-		 *
-		 */
-		load_cr3(next->pgd);
 
-		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 
-		/* Stop flush ipis for the previous mm */
-		cpumask_clear_cpu(cpu, mm_cpumask(prev));
 
-		/* Load per-mm CR4 state */
-		load_mm_cr4(next);
 
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
-		/*
-		 * Load the LDT, if the LDT is different.
-		 *
-		 * It's possible that prev->context.ldt doesn't match
-		 * the LDT register.  This can happen if leave_mm(prev)
-		 * was called and then modify_ldt changed
-		 * prev->context.ldt but suppressed an IPI to this CPU.
-		 * In this case, prev->context.ldt != NULL, because we
-		 * never set context.ldt to NULL while the mm still
-		 * exists.  That means that next->context.ldt !=
-		 * prev->context.ldt, because mms never share an LDT.
-		 */
-		if (unlikely(prev->context.ldt != next->context.ldt))
-			load_mm_ldt(next);
-#endif
-	}
-#ifdef CONFIG_SMP
-	  else {
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
-		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
-
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
-			/*
-			 * On established mms, the mm_cpumask is only changed
-			 * from irq context, from ptep_clear_flush() while in
-			 * lazy tlb mode, and here. Irqs are blocked during
-			 * schedule, protecting us from simultaneous changes.
-			 */
-			cpumask_set_cpu(cpu, mm_cpumask(next));
-
-			/*
-			 * We were in lazy tlb mode and leave_mm disabled
-			 * tlb flush IPI delivery. We must reload CR3
-			 * to make sure to use no freed page tables.
-			 *
-			 * As above, this is a barrier that forces
-			 * TLB repopulation to be ordered after the
-			 * store to mm_cpumask.
-			 */
-			load_cr3(next->pgd);
-			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-			load_mm_cr4(next);
-			load_mm_ldt(next);
-		}
-	}
-#endif
-}
 
 #define activate_mm(prev, next)			\
 do {						\
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 87fcc7a62e71..9790c9338e52 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -59,6 +59,103 @@ void leave_mm(int cpu)
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
+void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+	       struct task_struct *tsk)
+{
+	unsigned cpu = smp_processor_id();
+
+	if (likely(prev != next)) {
+#ifdef CONFIG_SMP
+		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+		this_cpu_write(cpu_tlbstate.active_mm, next);
+#endif
+		cpumask_set_cpu(cpu, mm_cpumask(next));
+
+		/*
+		 * Re-load page tables.
+		 *
+		 * This logic has an ordering constraint:
+		 *
+		 *  CPU 0: Write to a PTE for 'next'
+		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
+		 *  CPU 1: set bit 1 in next's mm_cpumask
+		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
+		 *
+		 * We need to prevent an outcome in which CPU 1 observes
+		 * the new PTE value and CPU 0 observes bit 1 clear in
+		 * mm_cpumask.  (If that occurs, then the IPI will never
+		 * be sent, and CPU 0's TLB will contain a stale entry.)
+		 *
+		 * The bad outcome can occur if either CPU's load is
+		 * reordered before that CPU's store, so both CPUs much
+		 * execute full barriers to prevent this from happening.
+		 *
+		 * Thus, switch_mm needs a full barrier between the
+		 * store to mm_cpumask and any operation that could load
+		 * from next->pgd.  This barrier synchronizes with
+		 * remote TLB flushers.  Fortunately, load_cr3 is
+		 * serializing and thus acts as a full barrier.
+		 *
+		 */
+		load_cr3(next->pgd);
+
+		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+
+		/* Stop flush ipis for the previous mm */
+		cpumask_clear_cpu(cpu, mm_cpumask(prev));
+
+		/* Load per-mm CR4 state */
+		load_mm_cr4(next);
+
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+		/*
+		 * Load the LDT, if the LDT is different.
+		 *
+		 * It's possible that prev->context.ldt doesn't match
+		 * the LDT register.  This can happen if leave_mm(prev)
+		 * was called and then modify_ldt changed
+		 * prev->context.ldt but suppressed an IPI to this CPU.
+		 * In this case, prev->context.ldt != NULL, because we
+		 * never set context.ldt to NULL while the mm still
+		 * exists.  That means that next->context.ldt !=
+		 * prev->context.ldt, because mms never share an LDT.
+		 */
+		if (unlikely(prev->context.ldt != next->context.ldt))
+			load_mm_ldt(next);
+#endif
+	}
+#ifdef CONFIG_SMP
+	  else {
+		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
+
+		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+			/*
+			 * On established mms, the mm_cpumask is only changed
+			 * from irq context, from ptep_clear_flush() while in
+			 * lazy tlb mode, and here. Irqs are blocked during
+			 * schedule, protecting us from simultaneous changes.
+			 */
+			cpumask_set_cpu(cpu, mm_cpumask(next));
+
+			/*
+			 * We were in lazy tlb mode and leave_mm disabled
+			 * tlb flush IPI delivery. We must reload CR3
+			 * to make sure to use no freed page tables.
+			 *
+			 * As above, this is a barrier that forces
+			 * TLB repopulation to be ordered after the
+			 * store to mm_cpumask.
+			 */
+			load_cr3(next->pgd);
+			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+			load_mm_cr4(next);
+			load_mm_ldt(next);
+		}
+	}
+#endif
+}
+
 /*
  * The flush IPI assumes that a thread switch happens in this order:
  * [cpu0: the cpu that switches]
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
                   ` (11 preceding siblings ...)
  2016-01-08 23:15 ` [RFC 12/13] x86/mm: Uninline switch_mm Andy Lutomirski
@ 2016-01-08 23:15 ` Andy Lutomirski
  2016-01-09  0:27   ` Dave Hansen
  2016-01-08 23:31 ` [RFC 00/13] x86/mm: PCID and INVPCID Linus Torvalds
  13 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:15 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andy Lutomirski

PCID is a "process context ID" -- it's what other architectures call
an address space ID.  Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB.  x86's
PCID is 12 bits.

This is an unorthodox approach to using PCID.  x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs.  To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.

This patch uses PCID differently.  We use a PCID to identify a
recently-used mm on a per-cpu basis.  An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.

In particular, we use PCIDs 1-7 for recently-used mms and we reserve
PCID 0 for swapper_pg_dir and for PCID-unaware CR3 users (e.g. EFI).
Nothing ever switches to PCID 0 without flushing PCID 0 non-global
pages, so PCID 0 conflicts won't cause problems.

This also leaves the door open for UDEREF-style address space
switching: the kernel will use PCID 0, and exits could simply switch
back.  (As a practical matter, an in-tree implementation of that
feature would probably forego the full syscall fast path and just
invoke some or all of switch_mm in prepare_exit_to_usermode.)

This seems to save about 100ns on context switches between mms.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu.h      |   7 +-
 arch/x86/include/asm/tlbflush.h |  18 ++++
 arch/x86/kernel/cpu/common.c    |   4 +
 arch/x86/kernel/ldt.c           |   2 +
 arch/x86/kernel/process_64.c    |   2 +
 arch/x86/mm/tlb.c               | 195 +++++++++++++++++++++++++++++++++++++---
 6 files changed, 213 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 55234d5e7160..adb958d41bde 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -5,10 +5,13 @@
 #include <linux/mutex.h>
 
 /*
- * The x86 doesn't have a mmu context, but
- * we put the segment information here.
+ * x86 has an MMU context if PCID is enabled, and x86 also has arch-specific
+ * MMU state beyond what lives in mm_struct.
  */
 typedef struct {
+	/* See arch/x86/mm/tlb.c for details. */
+	struct cpumask pcid_live_cpus;
+
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
 #endif
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 32e3d8769a22..407c6f5dd4a6 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -56,6 +56,13 @@ static inline void invpcid_flush_all_nonglobals(void)
 #define CR3_PCID_MASK 0ull
 #endif
 
+#ifdef CONFIG_SMP
+extern void zap_old_pcids(void);
+#else
+/* Until PCID is implemented for !SMP, there's nothing to do. */
+static inline void zap_old_pcids(void) {}
+#endif
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
@@ -195,6 +202,8 @@ static inline void __flush_tlb_all(void)
 		__flush_tlb_global();
 	else
 		__flush_tlb();
+
+	zap_old_pcids();
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
@@ -238,6 +247,7 @@ static inline void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
 	__flush_tlb_all();
+	zap_old_pcids();
 }
 
 static inline void flush_tlb(void)
@@ -254,6 +264,8 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 {
 	if (mm == current->active_mm)
 		__flush_tlb_up();
+	else
+		zap_old_pcids();
 }
 
 static inline void flush_tlb_page(struct vm_area_struct *vma,
@@ -261,6 +273,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 {
 	if (vma->vm_mm == current->active_mm)
 		__flush_tlb_one(addr);
+	else
+		zap_old_pcids();
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,
@@ -268,6 +282,8 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
 {
 	if (vma->vm_mm == current->active_mm)
 		__flush_tlb_up();
+	else
+		zap_old_pcids();
 }
 
 static inline void flush_tlb_mm_range(struct mm_struct *mm,
@@ -275,6 +291,8 @@ static inline void flush_tlb_mm_range(struct mm_struct *mm,
 {
 	if (mm == current->active_mm)
 		__flush_tlb_up();
+	else
+		zap_old_pcids();
 }
 
 static inline void native_flush_tlb_others(const struct cpumask *cpumask,
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7e1fc53a4ba5..00bdf5806566 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -953,6 +953,10 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 	setup_smep(c);
 	setup_smap(c);
 
+	/* Enable PCID if available. */
+	if (cpu_has(c, X86_FEATURE_PCID))
+		cr4_set_bits(X86_CR4_PCIDE);
+
 	/*
 	 * The vendor-specific functions might have changed features.
 	 * Now we do "generic changes."
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 6acc9dd91f36..3d73c0ddc773 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -109,6 +109,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	struct mm_struct *old_mm;
 	int retval = 0;
 
+	cpumask_clear(&mm->context.pcid_live_cpus);
+
 	mutex_init(&mm->context.lock);
 	old_mm = current->mm;
 	if (!old_mm) {
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e835d263a33b..2cdb3ba715e6 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -578,6 +578,8 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
 		break;
 	}
 
+		__flush_tlb();
+		invpcid_flush_all_nonglobals();
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9790c9338e52..eb84240b8c92 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,92 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+/*
+ * An x86 PCID is what everyone else calls an ASID.  TLB entries on each
+ * CPU are tagged with a PCID, and the current CR3 value stores a PCID.
+ * Switching CR3 can change the PCID and can optionally flush all TLB
+ * entries associated with the new PCID.
+ *
+ * The PCID is 12 bits, but we don't use anywhere near that many.
+ *
+ * The guiding principle of this code is that TLB entries that have
+ * survived more than a small number of context switches are mostly
+ * useless, so we don't try very hard not to evict them.
+ *
+ * PCID 0 is used for swapper_pg_dir and for any other special PGDs that
+ * are loaded directly by PCID-naive users of load_cr3.  (This includes
+ * UEFI runtime services, etc.)  Note that we never switch to PCID 0 with
+ * the preserve flag set -- the only TLB entries with PCID == 0 that are
+ * worth preserving have the global flag set.
+ *
+ * PCIDs 1 through NR_PCIDS are used for real user mms.
+ */
+
+#define NR_DYNAMIC_PCIDS 7
+
+struct pcid_cpu_entry {
+	const struct mm_struct *mm;
+};
+
+struct pcid_cpu_state {
+	/* This entire data structure fits in one cache line. */
+	struct pcid_cpu_entry pcids[NR_DYNAMIC_PCIDS];	/* starts with PCID 1 */
+	unsigned long next_pcid_minus_two;
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct pcid_cpu_state, pcid_cpu_state);
+
+/*
+ * Allocate and return a fresh PCID for mm on this CPU.
+ */
+static unsigned long allocate_pcid(const struct mm_struct *mm)
+{
+	unsigned long pcid;
+
+	pcid = this_cpu_add_return(pcid_cpu_state.next_pcid_minus_two, 1) + 1;
+	if (pcid > NR_DYNAMIC_PCIDS) {
+		pcid = 1;
+		this_cpu_write(pcid_cpu_state.next_pcid_minus_two, 0);
+	}
+
+	this_cpu_write(pcid_cpu_state.pcids[pcid-1].mm, mm);
+
+	/*
+	 * We don't bother setting our cpu's bit on pcid_live_cpus.  Any
+	 * remote CPU that needs to shoot down one of our TLB entries will
+	 * do it via IPI because we're all the way live.  We'll take care
+	 * of pcid_live_cpus when we remove ourselves from mm_cpumask.
+	 */
+	return pcid;
+}
+
+/*
+ * Finds the PCID for the given pgd on this cpu.  If that PCID was last
+ * used by this mm, the high bit will be set in the return value.  Otherwise
+ * we claim ownership of the PCID and return the PCID with the high bit
+ * clear.
+ *
+ * This function must not be called if pgd has never been loaded on this
+ * CPU.  Otherwise we might return a PCID allocated to a dead mm whose pgd
+ * page has been reused.
+ */
+static unsigned long choose_pcid(struct mm_struct *mm)
+{
+	unsigned long pcid;
+
+	if (!static_cpu_has(X86_FEATURE_PCID))
+		return 0;
+
+	if (cpumask_test_cpu(smp_processor_id(), &mm->context.pcid_live_cpus)) {
+		for (pcid = 0; pcid < NR_DYNAMIC_PCIDS; pcid++) {
+			if (this_cpu_read(pcid_cpu_state.pcids[pcid].mm) == mm)
+				return (pcid + 1) | (1UL << 63);
+		}
+	}
+
+	return allocate_pcid(mm);
+}
+
 #ifdef CONFIG_SMP
 
 struct flush_tlb_info {
@@ -37,6 +123,55 @@ struct flush_tlb_info {
 };
 
 /*
+ * This effectively invalidates non-global mappings belonging to non-current
+ * PCIDs on the calling CPU.  Rather than doing NR_DYNAMIC_PCIDS-1 INVPCIDs,
+ * it invalidates the mm-to-PCID mappings.
+ */
+void zap_old_pcids(void)
+{
+	struct mm_struct *active_mm;
+	int i;
+
+	if (!static_cpu_has(X86_FEATURE_PCID))
+		return;
+
+	active_mm = this_cpu_read(cpu_tlbstate.active_mm);
+
+	for (i = 0; i < NR_DYNAMIC_PCIDS; i++)
+		if (this_cpu_read(pcid_cpu_state.pcids[i].mm) != active_mm)
+			this_cpu_write(pcid_cpu_state.pcids[i].mm, NULL);
+}
+EXPORT_SYMBOL(zap_old_pcids);
+
+static void zap_local_inactive_mm(struct mm_struct *mm)
+{
+	int i;
+
+	if (!static_cpu_has(X86_FEATURE_PCID))
+		return;
+
+	for (i = 0; i < NR_DYNAMIC_PCIDS; i++) {
+		if (this_cpu_read(pcid_cpu_state.pcids[i].mm) == mm) {
+			this_cpu_write(pcid_cpu_state.pcids[i].mm, NULL);
+			return;
+		}
+	}
+}
+
+static void stop_tlbflush_ipis(int cpu, struct mm_struct *mm)
+{
+	/*
+	 * Stop flush ipis for the previous mm.  First mark us live in
+	 * the PCID cache.  We need our store to pcid_live_cpus to
+	 * happen before remote CPUs stop sending us IPIs; the barrier
+	 * here synchronizes with the barrier in flush_tlb_remote.
+	 */
+	cpumask_set_cpu(cpu, &mm->context.pcid_live_cpus);
+	smp_mb__before_atomic();
+	cpumask_clear_cpu(cpu, mm_cpumask(mm));
+}
+
+/*
  * We cannot call mmdrop() because we are in interrupt context,
  * instead update mm->cpu_vm_mask.
  */
@@ -46,7 +181,7 @@ void leave_mm(int cpu)
 	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
 		BUG();
 	if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
-		cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
+		stop_tlbflush_ipis(cpu, active_mm);
 		load_cr3(swapper_pg_dir);
 		/*
 		 * This gets called in the idle path where RCU
@@ -63,6 +198,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
 {
 	unsigned cpu = smp_processor_id();
+	unsigned long pcid;
 
 	if (likely(prev != next)) {
 #ifdef CONFIG_SMP
@@ -76,9 +212,6 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		 *
 		 * This logic has an ordering constraint:
 		 *
-		 *  CPU 0: Write to a PTE for 'next'
-		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
-		 *  CPU 1: set bit 1 in next's mm_cpumask
 		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
 		 *
 		 * We need to prevent an outcome in which CPU 1 observes
@@ -97,12 +230,14 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		 * serializing and thus acts as a full barrier.
 		 *
 		 */
-		load_cr3(next->pgd);
 
-		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+		pcid = choose_pcid(next);
+		write_cr3(__pa(next->pgd) | pcid);
 
-		/* Stop flush ipis for the previous mm */
-		cpumask_clear_cpu(cpu, mm_cpumask(prev));
+		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+				(pcid & (1ULL << 63)) ? 0 : TLB_FLUSH_ALL);
+
+		stop_tlbflush_ipis(cpu, prev);
 
 		/* Load per-mm CR4 state */
 		load_mm_cr4(next);
@@ -146,9 +281,18 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			 * As above, this is a barrier that forces
 			 * TLB repopulation to be ordered after the
 			 * store to mm_cpumask.
+
+			 *
+			 * XXX: speedup possibility: if we end up preserving
+			 * PCID data, then the write_cr3 is a no-op.
 			 */
-			load_cr3(next->pgd);
-			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+			pcid = choose_pcid(next);
+			write_cr3(__pa(next->pgd) | pcid);
+
+			trace_tlb_flush(
+				TLB_FLUSH_ON_TASK_SWITCH,
+				(pcid & (1ULL << 63)) ? 0 : TLB_FLUSH_ALL);
+;
 			load_mm_cr4(next);
 			load_mm_ldt(next);
 		}
@@ -203,8 +347,24 @@ static void flush_tlb_func(void *info)
 
 	inc_irq_stat(irq_tlb_count);
 
-	if (f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm))
+	/*
+	 * After all relevant CPUs call flush_tlb_func, pcid_live_cpus
+	 * will be clear.  That's not enough, though: if the mm we're
+	 * targeting isn't active, we won't directly flush the TLB, but
+	 * we need to guarantee that the mm won't be reactivated and
+	 * therefore reinstate stale entries prior to cpumask_clear.
+	 *
+	 * Solve this problem by brute force: if we don't flush the TLB
+	 * directly, zap the PCID mapping.  (We zap it using
+	 * pcid_cpu_state instead of pcid_live_cpus to avoid excessive
+	 * cacheline bounding.)
+	 */
+
+	if (f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm)) {
+		zap_local_inactive_mm(f->flush_mm);
 		return;
+	}
+
 	if (!f->flush_end)
 		f->flush_end = f->flush_start + PAGE_SIZE;
 
@@ -224,9 +384,10 @@ static void flush_tlb_func(void *info)
 			}
 			trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, nr_pages);
 		}
-	} else
+	} else {
 		leave_mm(smp_processor_id());
-
+		zap_local_inactive_mm(f->flush_mm);
+	}
 }
 
 void native_flush_tlb_others(const struct cpumask *cpumask,
@@ -259,6 +420,14 @@ static void propagate_tlb_flush(unsigned int this_cpu,
 {
 	if (cpumask_any_but(mm_cpumask(mm), this_cpu) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
+	/*
+	 * Synchronize with barrier in stop_tlbflush_ipis; cpumask_clear
+	 * must not be overridden by the pcid_live_cpus write in
+	 * stop_tlbflush_ipis unless we sent an IPI.
+	 */
+	smp_wmb();
+
+	cpumask_clear(&mm->context.pcid_live_cpus);
 }
 
 void flush_tlb_current_task(void)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC 00/13] x86/mm: PCID and INVPCID
  2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
                   ` (12 preceding siblings ...)
  2016-01-08 23:15 ` [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
@ 2016-01-08 23:31 ` Linus Torvalds
  2016-01-08 23:36   ` Andy Lutomirski
  13 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2016-01-08 23:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List,
	Borislav Petkov, Brian Gerst, Dave Hansen, Oleg Nesterov,
	linux-mm

On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> Please play around and suggest (and run?) good benchmarks.  It seems
> to save around 100ns on cross-process context switches for me.

Interesting. There was reportedly (I never saw it) a test-patch to use
pcids inside of Intel a couple of years ago, and it never got outside
because it didn't make a difference.

Either things have changed (newer hardware with more pcids perhaps?)
or you did a better job at it.

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 00/13] x86/mm: PCID and INVPCID
  2016-01-08 23:31 ` [RFC 00/13] x86/mm: PCID and INVPCID Linus Torvalds
@ 2016-01-08 23:36   ` Andy Lutomirski
  2016-01-08 23:42     ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-08 23:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, Brian Gerst,
	Dave Hansen, Oleg Nesterov, linux-mm

On Fri, Jan 8, 2016 at 3:31 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> Please play around and suggest (and run?) good benchmarks.  It seems
>> to save around 100ns on cross-process context switches for me.
>
> Interesting. There was reportedly (I never saw it) a test-patch to use
> pcids inside of Intel a couple of years ago, and it never got outside
> because it didn't make a difference.

I have a copy of that patch, and my code works very differently.  I
only use 3 bits of PCID in this series.  I could probably reduce that
to 2 with little loss.  4 or more would be a waste.

>
> Either things have changed (newer hardware with more pcids perhaps?)
> or you did a better job at it.

On my Skylake laptop, all of the PCID bits appear to have at least
some effect.  Whether this means it gets hashed or whether this means
that all of the bits are real, I don't know.  I'll fiddle with it on
an older machine.

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3
  2016-01-08 23:15 ` [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Andy Lutomirski
@ 2016-01-08 23:41   ` Linus Torvalds
  2016-01-09  0:18     ` Andy Lutomirski
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2016-01-08 23:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List,
	Borislav Petkov, Brian Gerst, Dave Hansen, Oleg Nesterov,
	linux-mm

On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski <luto@kernel.org> wrote:
> +       /*
> +        * We mustn't be preempted or handle an IPI while reading and
> +        * writing CR3.  Preemption could switch mms and switch back, and
> +        * an IPI could call leave_mm.  Either of those could cause our
> +        * PCID to change asynchronously.
> +        */
> +       raw_local_irq_save(flags);
>         native_write_cr3(native_read_cr3());
> +       raw_local_irq_restore(flags);

This seems sad for two reasons:

 - it adds unnecessary overhead on non-pcid setups (32-bit being an
example of that)

 - on pcid setups, wouldn't invpcid_flush_single_context() be better?

So on the whole I hate it.

Why isn't this something like

        if (static_cpu_has_safe(X86_FEATURE_INVPCID)) {
                invpcid_flush_single_context();
                return;
        }
        native_write_cr3(native_read_cr3());

*without* any flag saving crud?

And yes, that means that we'd require X86_FEATURE_INVPCID in order to
use X86_FEATURE_PCID, but that seems fine.

Or is there some reason you wanted the odd flags version? If so, that
should be documented.

               Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 00/13] x86/mm: PCID and INVPCID
  2016-01-08 23:36   ` Andy Lutomirski
@ 2016-01-08 23:42     ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-01-08 23:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, Brian Gerst,
	Dave Hansen, Oleg Nesterov, linux-mm

On Fri, Jan 8, 2016 at 3:36 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Either things have changed (newer hardware with more pcids perhaps?)
>> or you did a better job at it.
>
> On my Skylake laptop, all of the PCID bits appear to have at least
> some effect.  Whether this means it gets hashed or whether this means
> that all of the bits are real, I don't know.

They have always gotten hashed, and no the bits aren't real - hardware
doesn't actually have as many bits in the pcid as there are in cr3.

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 11/13] x86/mm: Build arch/x86/mm/tlb.c even on !SMP
  2016-01-08 23:15 ` [RFC 11/13] x86/mm: Build arch/x86/mm/tlb.c even on !SMP Andy Lutomirski
@ 2016-01-08 23:55   ` Dave Hansen
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2016-01-08 23:55 UTC (permalink / raw)
  To: Andy Lutomirski, x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Linus Torvalds, Oleg Nesterov, linux-mm

On 01/08/2016 03:15 PM, Andy Lutomirski wrote:
> @@ -352,3 +354,5 @@ static int __init create_tlb_single_page_flush_ceiling(void)
>  	return 0;
>  }
>  late_initcall(create_tlb_single_page_flush_ceiling);
> +
> +#endif /* CONFIG_SMP */

Heh, I was about to complain that you #ifdef'd out my lovely INVLPG
tunable.  But I guess on UP you just get flush_tlb_mm_range() from:

> static inline void flush_tlb_mm_range(struct mm_struct *mm,
>            unsigned long start, unsigned long end, unsigned long vmflag)
> {
>         if (mm == current->active_mm)
>                 __flush_tlb_up();
> }

which doesn't even do INVLPG.  How sad.  Poor UP.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3
  2016-01-08 23:41   ` Linus Torvalds
@ 2016-01-09  0:18     ` Andy Lutomirski
  2016-01-09  2:20       ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-09  0:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, X86 ML, Dave Hansen, Borislav Petkov,
	Linux Kernel Mailing List, linux-mm, Brian Gerst

On Jan 8, 2016 3:41 PM, "Linus Torvalds" <torvalds@linux-foundation.org> wrote:
>
> On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > +       /*
> > +        * We mustn't be preempted or handle an IPI while reading and
> > +        * writing CR3.  Preemption could switch mms and switch back, and
> > +        * an IPI could call leave_mm.  Either of those could cause our
> > +        * PCID to change asynchronously.
> > +        */
> > +       raw_local_irq_save(flags);
> >         native_write_cr3(native_read_cr3());
> > +       raw_local_irq_restore(flags);
>
> This seems sad for two reasons:
>
>  - it adds unnecessary overhead on non-pcid setups (32-bit being an
> example of that)

I can certainly skip the flag saving on !PCID.

>
>  - on pcid setups, wouldn't invpcid_flush_single_context() be better?
>

I played with that and it was slower.  I don't pretend that makes any sense.

> So on the whole I hate it.
>
> Why isn't this something like
>
>         if (static_cpu_has_safe(X86_FEATURE_INVPCID)) {
>                 invpcid_flush_single_context();
>                 return;
>         }
>         native_write_cr3(native_read_cr3());
>
> *without* any flag saving crud?
>
> And yes, that means that we'd require X86_FEATURE_INVPCID in order to
> use X86_FEATURE_PCID, but that seems fine.

I have an SNB "Extreme" with PCID but not INVPCID, and there could be
a whole generation of servers like that.  I think we should fully
support them.

We might be able to get away with just disabling preemption instead of
IRQs, at least if mm == active_mm.

>
> Or is there some reason you wanted the odd flags version? If so, that
> should be documented.

What do you mean "odd"?

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID
  2016-01-08 23:15 ` [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
@ 2016-01-09  0:27   ` Dave Hansen
  2016-01-09  2:19     ` Andy Lutomirski
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2016-01-09  0:27 UTC (permalink / raw)
  To: Andy Lutomirski, x86, linux-kernel
  Cc: Borislav Petkov, Brian Gerst, Linus Torvalds, Oleg Nesterov, linux-mm

On 01/08/2016 03:15 PM, Andy Lutomirski wrote:
> + * The guiding principle of this code is that TLB entries that have
> + * survived more than a small number of context switches are mostly
> + * useless, so we don't try very hard not to evict them.

Big ack on that.  The original approach tried to keep track of the full
4k worth of possible PCIDs, it also needed an additional cpumask (which
it dynamically allocated) for where the PCID was active in addition to
the normal "where has this mm been" mask.

That's a lot of extra data to mangle, and I can definitely see your
approach as being nicer, *IF* the hardware isn't doing something useful
with the other 9 bits of PCID that you're throwing away. ;)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID
  2016-01-09  0:27   ` Dave Hansen
@ 2016-01-09  2:19     ` Andy Lutomirski
  0 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-09  2:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Brian Gerst, Linus Torvalds, Oleg Nesterov, linux-mm

On Fri, Jan 8, 2016 at 4:27 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> On 01/08/2016 03:15 PM, Andy Lutomirski wrote:
>> + * The guiding principle of this code is that TLB entries that have
>> + * survived more than a small number of context switches are mostly
>> + * useless, so we don't try very hard not to evict them.
>
> Big ack on that.  The original approach tried to keep track of the full
> 4k worth of possible PCIDs, it also needed an additional cpumask (which
> it dynamically allocated) for where the PCID was active in addition to
> the normal "where has this mm been" mask.

My patch has a similar extra cpumask, but at least I didn't
dynamically allocate it.  I did it because I need a 100% reliable way
to tell whether a given mm has a valid PCID in a cpu's PCID LRU list,
as opposed to just matching due to struct mm reuse or similar.  I also
need the ability to blow away old mappings, which I can do by clearing
the cpumask.  This happens in init_new_context and in
propagate_tlb_flush.

The other way to do it would be to store some kind of generation
counter in the per-cpu list.  I could use a global 64-bit atomic
counter to allocate never-reused mm ids (it's highly unlikely that a
system will run long enough for such a counter to overflow -- it could
only ever be incremented every few ns, giving hundreds of years of
safety), but that's kind of expensive.  I could use a per-cpu
allocator, but 54 bits per cpu is uncomfortably small unless we have
wraparound handling.  We could do 64 bits per cpu for very cheap
counter allocation, but then the "zap the pcid" logic gets much
nastier in that neither the percpu entries nor the per-mm generation
counter entries don't fit in a word any more.  Maybe that's fine.

What we can't do easily is have a per-mm generation counter, because
freeing an mm and reallocating it in the same place needs to reliably
zap the pcid on all CPUs.

Anyway, this problem is clearly solvable, but I haven't thought of a
straightforward solution that doesn't involve rarely-executed code
paths, and that makes me a bit nervous.

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3
  2016-01-09  0:18     ` Andy Lutomirski
@ 2016-01-09  2:20       ` Linus Torvalds
  2016-01-11 10:51         ` Ingo Molnar
  2016-01-13 23:35         ` Andy Lutomirski
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2016-01-09  2:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Oleg Nesterov, X86 ML, Dave Hansen, Borislav Petkov,
	Linux Kernel Mailing List, linux-mm, Brian Gerst

On Fri, Jan 8, 2016 at 4:18 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>>  - on pcid setups, wouldn't invpcid_flush_single_context() be better?
>
> I played with that and it was slower.  I don't pretend that makes any sense.

Ugh. I guess reading and writing cr3 has been optimized.

>> And yes, that means that we'd require X86_FEATURE_INVPCID in order to
>> use X86_FEATURE_PCID, but that seems fine.
>
> I have an SNB "Extreme" with PCID but not INVPCID, and there could be
> a whole generation of servers like that.  I think we should fully
> support them.

Can you check the timings? IOW, is it a win on SNB?

I think originally Intel only had two actual bits of process context
ID in the TLB, and it was meant to be used for virtualization or
something. Together with the hashing (to make it always appear as 12
bits to software - a nice idea but also means that the hardware ends
up invalidating more than software really expects), it may not work
all that well.

That _could_ explain why the original patch from intel didn't work.

> We might be able to get away with just disabling preemption instead of
> IRQs, at least if mm == active_mm.

I'm not convinced it is all that much faster. Of course, it's nicer on
non-preempt, but nobody seems to run things that way.

>> Or is there some reason you wanted the odd flags version? If so, that
>> should be documented.
>
> What do you mean "odd"?

It's odd because it makes no sense for non-pcid (christ, I wish Intel
had just called it "asid" instead, "pcid" always makes me react to
"pci"), and I think it would make more sense to pair up the pcid case
with the invpcid rather than have those preemption rules here.

                 Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o
  2016-01-08 23:15 ` [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Andy Lutomirski
@ 2016-01-10 18:59   ` Borislav Petkov
  2016-01-11 12:51     ` Andrey Ryabinin
  0 siblings, 1 reply; 43+ messages in thread
From: Borislav Petkov @ 2016-01-10 18:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, Andrey Ryabinin

+ Andrey.

On Fri, Jan 08, 2016 at 03:15:19PM -0800, Andy Lutomirski wrote:
> Otherwise terrible things happen if some of the callbacks end up
> calling into KASAN in unexpected places.
> 
> This has no obvious symptoms yet, but adding a memory reference to
> native_flush_tlb_global without this blows up on KASAN kernels.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/kernel/Makefile | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index b1b78ffe01d0..b7cd5bdf314b 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -19,6 +19,7 @@ endif
>  KASAN_SANITIZE_head$(BITS).o := n
>  KASAN_SANITIZE_dumpstack.o := n
>  KASAN_SANITIZE_dumpstack_$(BITS).o := n
> +KASAN_SANITIZE_paravirt.o := n
>  
>  CFLAGS_irq.o := -I$(src)/../include/asm/trace

Shouldn't we take this one irrespectively of what happens to the rest in
the patchset?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3
  2016-01-09  2:20       ` Linus Torvalds
@ 2016-01-11 10:51         ` Ingo Molnar
  2016-01-13 23:32           ` Andy Lutomirski
  2016-01-13 23:35         ` Andy Lutomirski
  1 sibling, 1 reply; 43+ messages in thread
From: Ingo Molnar @ 2016-01-11 10:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Oleg Nesterov, X86 ML, Dave Hansen,
	Borislav Petkov, Linux Kernel Mailing List, linux-mm,
	Brian Gerst


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> >> Or is there some reason you wanted the odd flags version? If so, that
> >> should be documented.
> >
> > What do you mean "odd"?
> 
> It's odd because it makes no sense for non-pcid (christ, I wish Intel had just 
> called it "asid" instead, "pcid" always makes me react to "pci"), and I think it 
> would make more sense to pair up the pcid case with the invpcid rather than have 
> those preemption rules here.

The naming is really painful, so a trivial suggestion: could we just name all the 
Linux side bits 'asid' or 'ctx_id' (even in x86 arch code) and only use 'PCID' 
nomenclature in the very lowest level code?

I.e. rename pcid_live_cpus et al and most functions to the asid or ctx_id or asid 
naming scheme or so. That would hide most of the naming ugliness.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o
  2016-01-10 18:59   ` Borislav Petkov
@ 2016-01-11 12:51     ` Andrey Ryabinin
  2016-01-11 12:51       ` [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush Andrey Ryabinin
                         ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Andrey Ryabinin @ 2016-01-11 12:51 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, x86, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin
  Cc: Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov,
	linux-mm, linux-kernel, Andrey Ryabinin

On 01/10/2016 09:59 PM, Borislav Petkov wrote:
> + Andrey.
> 
> On Fri, Jan 08, 2016 at 03:15:19PM -0800, Andy Lutomirski wrote:
>> Otherwise terrible things happen if some of the callbacks end up
>> calling into KASAN in unexpected places.
>>
>> This has no obvious symptoms yet, but adding a memory reference to
>> native_flush_tlb_global without this blows up on KASAN kernels.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/kernel/Makefile | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>> index b1b78ffe01d0..b7cd5bdf314b 100644
>> --- a/arch/x86/kernel/Makefile
>> +++ b/arch/x86/kernel/Makefile
>> @@ -19,6 +19,7 @@ endif
>>  KASAN_SANITIZE_head$(BITS).o := n
>>  KASAN_SANITIZE_dumpstack.o := n
>>  KASAN_SANITIZE_dumpstack_$(BITS).o := n
>> +KASAN_SANITIZE_paravirt.o := n
>>  
>>  CFLAGS_irq.o := -I$(src)/../include/asm/trace
> 
> Shouldn't we take this one irrespectively of what happens to the rest in
> the patchset?
>

I don't think that this patch is the right way to solve the problem.
The follow-up patch "x86/kasan: clear kasan_zero_page after TLB flush" should fix Andy's problem.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush
  2016-01-11 12:51     ` Andrey Ryabinin
@ 2016-01-11 12:51       ` Andrey Ryabinin
  2016-01-18 22:24         ` Andy Lutomirski
  2016-02-09 16:06         ` [tip:x86/mm] x86/kasan: Clear " tip-bot for Andrey Ryabinin
  2016-01-11 12:51       ` [PATCH 2/2] x86/kasan: write protect kasan zero shadow Andrey Ryabinin
  2016-01-29 10:35       ` [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Borislav Petkov
  2 siblings, 2 replies; 43+ messages in thread
From: Andrey Ryabinin @ 2016-01-11 12:51 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, x86, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin
  Cc: Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov,
	linux-mm, linux-kernel, Andrey Ryabinin

Currently we clear kasan_zero_page before __flush_tlb_all(). This
works with current implementation of native_flush_tlb[_global]()
because it doesn't cause do any writes to kasan shadow memory.
But any subtle change made in native_flush_tlb*() could break this.
Also current code seems doesn't work for paravirt guests (lguest).

Only after the TLB flush we can be sure that kasan_zero_page is not
used as early shadow anymore (instrumented code will not write to it).
So it should cleared it only after the TLB flush.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 arch/x86/mm/kasan_init_64.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index d470cf2..303e470 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -120,11 +120,16 @@ void __init kasan_init(void)
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
 
-	memset(kasan_zero_page, 0, PAGE_SIZE);
-
 	load_cr3(init_level4_pgt);
 	__flush_tlb_all();
-	init_task.kasan_depth = 0;
 
+	/*
+	 * kasan_zero_page has been used as early shadow memory, thus it may
+	 * contain some garbage. Now we can clear it, since after the TLB flush
+	 * no one should write to it.
+	 */
+	memset(kasan_zero_page, 0, PAGE_SIZE);
+
+	init_task.kasan_depth = 0;
 	pr_info("KernelAddressSanitizer initialized\n");
 }
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/2] x86/kasan: write protect kasan zero shadow
  2016-01-11 12:51     ` Andrey Ryabinin
  2016-01-11 12:51       ` [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush Andrey Ryabinin
@ 2016-01-11 12:51       ` Andrey Ryabinin
  2016-01-18 22:24         ` Andy Lutomirski
  2016-02-09 16:07         ` [tip:x86/mm] x86/kasan: Write " tip-bot for Andrey Ryabinin
  2016-01-29 10:35       ` [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Borislav Petkov
  2 siblings, 2 replies; 43+ messages in thread
From: Andrey Ryabinin @ 2016-01-11 12:51 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, x86, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin
  Cc: Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov,
	linux-mm, linux-kernel, Andrey Ryabinin

After kasan_init() executed, no one is allowed to write to kasan_zero_page,
so write protect it.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 arch/x86/mm/kasan_init_64.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 303e470..1b1110f 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -125,10 +125,16 @@ void __init kasan_init(void)
 
 	/*
 	 * kasan_zero_page has been used as early shadow memory, thus it may
-	 * contain some garbage. Now we can clear it, since after the TLB flush
-	 * no one should write to it.
+	 * contain some garbage. Now we can clear and write protect it, since
+	 * after the TLB flush no one should write to it.
 	 */
 	memset(kasan_zero_page, 0, PAGE_SIZE);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte_t pte = __pte(__pa(kasan_zero_page) | __PAGE_KERNEL_RO);
+		set_pte(&kasan_zero_pte[i], pte);
+	}
+	/* Flush TLBs again to be sure that write protection applied. */
+	__flush_tlb_all();
 
 	init_task.kasan_depth = 0;
 	pr_info("KernelAddressSanitizer initialized\n");
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3
  2016-01-11 10:51         ` Ingo Molnar
@ 2016-01-13 23:32           ` Andy Lutomirski
  0 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-13 23:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Oleg Nesterov, X86 ML, Dave Hansen,
	Borislav Petkov, Linux Kernel Mailing List, linux-mm,
	Brian Gerst

On Mon, Jan 11, 2016 at 2:51 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> >> Or is there some reason you wanted the odd flags version? If so, that
>> >> should be documented.
>> >
>> > What do you mean "odd"?
>>
>> It's odd because it makes no sense for non-pcid (christ, I wish Intel had just
>> called it "asid" instead, "pcid" always makes me react to "pci"), and I think it
>> would make more sense to pair up the pcid case with the invpcid rather than have
>> those preemption rules here.
>
> The naming is really painful, so a trivial suggestion: could we just name all the
> Linux side bits 'asid' or 'ctx_id' (even in x86 arch code) and only use 'PCID'
> nomenclature in the very lowest level code?

I'd be okay with "pctx_id" or "pctxid" for this, I think.  I'd like to
at least make it somewhat obvious how it maps back to hardware.

FWIW, I'd guess that Intel deviated from convention because their
actual address space id is (vpid, pcid), and calling it (vpid, asid)
might have been slightly confusing.  Or not.

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3
  2016-01-09  2:20       ` Linus Torvalds
  2016-01-11 10:51         ` Ingo Molnar
@ 2016-01-13 23:35         ` Andy Lutomirski
  2016-01-13 23:43           ` Dave Hansen
  1 sibling, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-13 23:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, X86 ML, Dave Hansen, Borislav Petkov,
	Linux Kernel Mailing List, linux-mm, Brian Gerst

On Fri, Jan 8, 2016 at 6:20 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Jan 8, 2016 at 4:18 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>>  - on pcid setups, wouldn't invpcid_flush_single_context() be better?
>>
>> I played with that and it was slower.  I don't pretend that makes any sense.
>
> Ugh. I guess reading and writing cr3 has been optimized.
>
>>> And yes, that means that we'd require X86_FEATURE_INVPCID in order to
>>> use X86_FEATURE_PCID, but that seems fine.
>>
>> I have an SNB "Extreme" with PCID but not INVPCID, and there could be
>> a whole generation of servers like that.  I think we should fully
>> support them.
>
> Can you check the timings? IOW, is it a win on SNB?

~80ns gain on SNB.  It's actually quite impressive on SNB: it knocks
the penalty for mm switches down to 20ns or so, which I find to be
fairly amazing.  (This is at 3.8GHz or thereabouts.)

>
> I think originally Intel only had two actual bits of process context
> ID in the TLB, and it was meant to be used for virtualization or
> something. Together with the hashing (to make it always appear as 12
> bits to software - a nice idea but also means that the hardware ends
> up invalidating more than software really expects), it may not work
> all that well.
>
> That _could_ explain why the original patch from intel didn't work.
>
>> We might be able to get away with just disabling preemption instead of
>> IRQs, at least if mm == active_mm.
>
> I'm not convinced it is all that much faster. Of course, it's nicer on
> non-preempt, but nobody seems to run things that way.

My current testing version has three different code paths now.  If
INVPCID and PCID are both available, then it uses INVPCID.  If PCID is
available but INVPCID is not, it does raw_local_irqsave.  If PCID is
not available, it just does the CR3 read/write.

Yeah, it's ugly, and it's a big blob of code to do something trivial,
but it seems to work and it should be the right thing to do in most
cases.

Can anyone here ask a hardware or microcode person what's going on
with CR3 writes possibly being faster than INVPCID?  Is there some
trick to it?

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3
  2016-01-13 23:35         ` Andy Lutomirski
@ 2016-01-13 23:43           ` Dave Hansen
  2016-01-13 23:51             ` Andy Lutomirski
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2016-01-13 23:43 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Oleg Nesterov, X86 ML, Borislav Petkov,
	Linux Kernel Mailing List, linux-mm, Brian Gerst

On 01/13/2016 03:35 PM, Andy Lutomirski wrote:
> Can anyone here ask a hardware or microcode person what's going on
> with CR3 writes possibly being faster than INVPCID?  Is there some
> trick to it?

I just went and measured it myself this morning.  "INVPCID Type 3" (all
contexts no global) on a Skylake system was 15% slower than a CR3 write.

Is that in the same ballpark from what you've observed?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3
  2016-01-13 23:43           ` Dave Hansen
@ 2016-01-13 23:51             ` Andy Lutomirski
  2016-01-13 23:56               ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-13 23:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Oleg Nesterov, X86 ML, Borislav Petkov,
	Linux Kernel Mailing List, linux-mm, Brian Gerst

On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 01/13/2016 03:35 PM, Andy Lutomirski wrote:
>> Can anyone here ask a hardware or microcode person what's going on
>> with CR3 writes possibly being faster than INVPCID?  Is there some
>> trick to it?
>
> I just went and measured it myself this morning.  "INVPCID Type 3" (all
> contexts no global) on a Skylake system was 15% slower than a CR3 write.
>
> Is that in the same ballpark from what you've observed?
>
>

It's similar, except that I was comparing "INVPCID Type 1" (single
context no globals) to a CR3 write.

Type 2, at least, is dramatically faster than the pair of CR4 writes
it replaces.

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3
  2016-01-13 23:51             ` Andy Lutomirski
@ 2016-01-13 23:56               ` Dave Hansen
  2016-01-14  0:34                 ` Andy Lutomirski
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2016-01-13 23:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Oleg Nesterov, X86 ML, Borislav Petkov,
	Linux Kernel Mailing List, linux-mm, Brian Gerst

On 01/13/2016 03:51 PM, Andy Lutomirski wrote:
> On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
>> On 01/13/2016 03:35 PM, Andy Lutomirski wrote:
>>> Can anyone here ask a hardware or microcode person what's going on
>>> with CR3 writes possibly being faster than INVPCID?  Is there some
>>> trick to it?
>>
>> I just went and measured it myself this morning.  "INVPCID Type 3" (all
>> contexts no global) on a Skylake system was 15% slower than a CR3 write.
>>
>> Is that in the same ballpark from what you've observed?
> 
> It's similar, except that I was comparing "INVPCID Type 1" (single
> context no globals) to a CR3 write.

Ahh, because you're using PCID...  That one I saw as being ~1.85x the
number of cycles that a CR3 write was.

> Type 2, at least, is dramatically faster than the pair of CR4 writes
> it replaces.

Yeah, I saw the same thing.  Type 2 was ~2.4x faster than the CR4 writes.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3
  2016-01-13 23:56               ` Dave Hansen
@ 2016-01-14  0:34                 ` Andy Lutomirski
  0 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-14  0:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Oleg Nesterov, X86 ML, Borislav Petkov,
	Linux Kernel Mailing List, linux-mm, Brian Gerst

On Wed, Jan 13, 2016 at 3:56 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 01/13/2016 03:51 PM, Andy Lutomirski wrote:
>> On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen
>> <dave.hansen@linux.intel.com> wrote:
>>> On 01/13/2016 03:35 PM, Andy Lutomirski wrote:
>>>> Can anyone here ask a hardware or microcode person what's going on
>>>> with CR3 writes possibly being faster than INVPCID?  Is there some
>>>> trick to it?
>>>
>>> I just went and measured it myself this morning.  "INVPCID Type 3" (all
>>> contexts no global) on a Skylake system was 15% slower than a CR3 write.
>>>
>>> Is that in the same ballpark from what you've observed?
>>
>> It's similar, except that I was comparing "INVPCID Type 1" (single
>> context no globals) to a CR3 write.
>
> Ahh, because you're using PCID...  That one I saw as being ~1.85x the
> number of cycles that a CR3 write was.
>

I think that settles it, then:

if (static_cpu_has_safe(X86_FEATURE_PCID)) {
  raw_local_irqsave();
  native_write_cr3(native_read_cr3());
  raw_local_irqrestore();
} else {
  native_write_cr3(native_read_cr3());
}

I don't think it's worth hacking more complexity into switch_mm to
make that annoyance go away.

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush
  2016-01-11 12:51       ` [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush Andrey Ryabinin
@ 2016-01-18 22:24         ` Andy Lutomirski
  2016-02-09 16:06         ` [tip:x86/mm] x86/kasan: Clear " tip-bot for Andrey Ryabinin
  1 sibling, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-18 22:24 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Borislav Petkov, Andy Lutomirski, X86 ML, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Brian Gerst, Dave Hansen,
	Linus Torvalds, Oleg Nesterov, linux-mm, linux-kernel

On Mon, Jan 11, 2016 at 4:51 AM, Andrey Ryabinin
<aryabinin@virtuozzo.com> wrote:
> Currently we clear kasan_zero_page before __flush_tlb_all(). This
> works with current implementation of native_flush_tlb[_global]()
> because it doesn't cause do any writes to kasan shadow memory.
> But any subtle change made in native_flush_tlb*() could break this.
> Also current code seems doesn't work for paravirt guests (lguest).
>
> Only after the TLB flush we can be sure that kasan_zero_page is not
> used as early shadow anymore (instrumented code will not write to it).
> So it should cleared it only after the TLB flush.

This seems to fix the issue with my patch set.  Thanks.

Tested-by: Andy Lutomirski <luto@kernel.org>

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] x86/kasan: write protect kasan zero shadow
  2016-01-11 12:51       ` [PATCH 2/2] x86/kasan: write protect kasan zero shadow Andrey Ryabinin
@ 2016-01-18 22:24         ` Andy Lutomirski
  2016-02-09 16:07         ` [tip:x86/mm] x86/kasan: Write " tip-bot for Andrey Ryabinin
  1 sibling, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2016-01-18 22:24 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Borislav Petkov, Andy Lutomirski, X86 ML, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Brian Gerst, Dave Hansen,
	Linus Torvalds, Oleg Nesterov, linux-mm, linux-kernel

On Mon, Jan 11, 2016 at 4:51 AM, Andrey Ryabinin
<aryabinin@virtuozzo.com> wrote:
> After kasan_init() executed, no one is allowed to write to kasan_zero_page,
> so write protect it.

This seems to work for me.

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o
  2016-01-11 12:51     ` Andrey Ryabinin
  2016-01-11 12:51       ` [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush Andrey Ryabinin
  2016-01-11 12:51       ` [PATCH 2/2] x86/kasan: write protect kasan zero shadow Andrey Ryabinin
@ 2016-01-29 10:35       ` Borislav Petkov
  2 siblings, 0 replies; 43+ messages in thread
From: Borislav Petkov @ 2016-01-29 10:35 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andy Lutomirski, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Brian Gerst, Dave Hansen, Linus Torvalds,
	Oleg Nesterov, linux-mm, linux-kernel

On Mon, Jan 11, 2016 at 03:51:17PM +0300, Andrey Ryabinin wrote:
> I don't think that this patch is the right way to solve the problem.
> The follow-up patch "x86/kasan: clear kasan_zero_page after TLB flush"
> should fix Andy's problem.

Both applied, thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [tip:x86/mm] x86/kasan: Clear kasan_zero_page after TLB flush
  2016-01-11 12:51       ` [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush Andrey Ryabinin
  2016-01-18 22:24         ` Andy Lutomirski
@ 2016-02-09 16:06         ` tip-bot for Andrey Ryabinin
  1 sibling, 0 replies; 43+ messages in thread
From: tip-bot for Andrey Ryabinin @ 2016-02-09 16:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: oleg, hpa, peterz, luto, tglx, dave.hansen, dvlasenk, bp, mcgrof,
	brgerst, bp, mingo, akpm, linux-kernel, luto, toshi.kani,
	aryabinin, torvalds

Commit-ID:  69e0210fd01ff157d332102219aaf5c26ca8069b
Gitweb:     http://git.kernel.org/tip/69e0210fd01ff157d332102219aaf5c26ca8069b
Author:     Andrey Ryabinin <aryabinin@virtuozzo.com>
AuthorDate: Mon, 11 Jan 2016 15:51:18 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Feb 2016 13:33:14 +0100

x86/kasan: Clear kasan_zero_page after TLB flush

Currently we clear kasan_zero_page before __flush_tlb_all(). This
works with current implementation of native_flush_tlb[_global]()
because it doesn't cause do any writes to kasan shadow memory.
But any subtle change made in native_flush_tlb*() could break this.
Also current code seems doesn't work for paravirt guests (lguest).

Only after the TLB flush we can be sure that kasan_zero_page is not
used as early shadow anymore (instrumented code will not write to it).
So it should cleared it only after the TLB flush.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1452516679-32040-2-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/kasan_init_64.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index d470cf2..303e470 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -120,11 +120,16 @@ void __init kasan_init(void)
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
 
-	memset(kasan_zero_page, 0, PAGE_SIZE);
-
 	load_cr3(init_level4_pgt);
 	__flush_tlb_all();
-	init_task.kasan_depth = 0;
 
+	/*
+	 * kasan_zero_page has been used as early shadow memory, thus it may
+	 * contain some garbage. Now we can clear it, since after the TLB flush
+	 * no one should write to it.
+	 */
+	memset(kasan_zero_page, 0, PAGE_SIZE);
+
+	init_task.kasan_depth = 0;
 	pr_info("KernelAddressSanitizer initialized\n");
 }

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [tip:x86/mm] x86/kasan: Write protect kasan zero shadow
  2016-01-11 12:51       ` [PATCH 2/2] x86/kasan: write protect kasan zero shadow Andrey Ryabinin
  2016-01-18 22:24         ` Andy Lutomirski
@ 2016-02-09 16:07         ` tip-bot for Andrey Ryabinin
  1 sibling, 0 replies; 43+ messages in thread
From: tip-bot for Andrey Ryabinin @ 2016-02-09 16:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, torvalds, toshi.kani, dave.hansen, bp, mcgrof, dvlasenk,
	akpm, linux-kernel, luto, aryabinin, oleg, peterz, hpa, brgerst,
	luto, mingo, tglx

Commit-ID:  063fb3e56f6dd29b2633b678b837e1d904200e6f
Gitweb:     http://git.kernel.org/tip/063fb3e56f6dd29b2633b678b837e1d904200e6f
Author:     Andrey Ryabinin <aryabinin@virtuozzo.com>
AuthorDate: Mon, 11 Jan 2016 15:51:19 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Feb 2016 13:33:14 +0100

x86/kasan: Write protect kasan zero shadow

After kasan_init() executed, no one is allowed to write to kasan_zero_page,
so write protect it.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1452516679-32040-3-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/kasan_init_64.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 303e470..1b1110f 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -125,10 +125,16 @@ void __init kasan_init(void)
 
 	/*
 	 * kasan_zero_page has been used as early shadow memory, thus it may
-	 * contain some garbage. Now we can clear it, since after the TLB flush
-	 * no one should write to it.
+	 * contain some garbage. Now we can clear and write protect it, since
+	 * after the TLB flush no one should write to it.
 	 */
 	memset(kasan_zero_page, 0, PAGE_SIZE);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte_t pte = __pte(__pa(kasan_zero_page) | __PAGE_KERNEL_RO);
+		set_pte(&kasan_zero_pte[i], pte);
+	}
+	/* Flush TLBs again to be sure that write protection applied. */
+	__flush_tlb_all();
 
 	init_task.kasan_depth = 0;
 	pr_info("KernelAddressSanitizer initialized\n");

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization
  2016-01-08 23:15 ` [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization Andy Lutomirski
@ 2016-06-03 17:42   ` Nadav Amit
  2016-06-09 17:24     ` Andy Lutomirski
  2016-09-06  1:22   ` Wanpeng Li
  1 sibling, 1 reply; 43+ messages in thread
From: Nadav Amit @ 2016-06-03 17:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, Borislav Petkov, Brian Gerst, Dave Hansen,
	Oleg Nesterov, linux-mm

Following this patch, if (current->active_mm != mm), flush_tlb_page() still
doesn’t call smp_mb() before checking mm_cpumask(mm).

In contrast, flush_tlb_mm_range() does call smp_mb().

Is there a reason for this discrepancy?

Thanks,
Nadav

Andy Lutomirski <luto@kernel.org> wrote:

> When switch_mm activates a new pgd, it also sets a bit that tells
> other CPUs that the pgd is in use so that tlb flush IPIs will be
> sent.  In order for that to work correctly, the bit needs to be
> visible prior to loading the pgd and therefore starting to fill the
> local TLB.
> 
> Document all the barriers that make this work correctly and add a
> couple that were missing.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> arch/x86/include/asm/mmu_context.h | 33 ++++++++++++++++++++++++++++++++-
> arch/x86/mm/tlb.c                  | 29 ++++++++++++++++++++++++++---
> 2 files changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 379cd3658799..1edc9cd198b8 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> #endif
> 		cpumask_set_cpu(cpu, mm_cpumask(next));
> 
> -		/* Re-load page tables */
> +		/*
> +		 * Re-load page tables.
> +		 *
> +		 * This logic has an ordering constraint:
> +		 *
> +		 *  CPU 0: Write to a PTE for 'next'
> +		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
> +		 *  CPU 1: set bit 1 in next's mm_cpumask
> +		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
> +		 *
> +		 * We need to prevent an outcome in which CPU 1 observes
> +		 * the new PTE value and CPU 0 observes bit 1 clear in
> +		 * mm_cpumask.  (If that occurs, then the IPI will never
> +		 * be sent, and CPU 0's TLB will contain a stale entry.)
> +		 *
> +		 * The bad outcome can occur if either CPU's load is
> +		 * reordered before that CPU's store, so both CPUs much
> +		 * execute full barriers to prevent this from happening.
> +		 *
> +		 * Thus, switch_mm needs a full barrier between the
> +		 * store to mm_cpumask and any operation that could load
> +		 * from next->pgd.  This barrier synchronizes with
> +		 * remote TLB flushers.  Fortunately, load_cr3 is
> +		 * serializing and thus acts as a full barrier.
> +		 *
> +		 */
> 		load_cr3(next->pgd);
> +
> 		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> 
> 		/* Stop flush ipis for the previous mm */
> @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> 			 * schedule, protecting us from simultaneous changes.
> 			 */
> 			cpumask_set_cpu(cpu, mm_cpumask(next));
> +
> 			/*
> 			 * We were in lazy tlb mode and leave_mm disabled
> 			 * tlb flush IPI delivery. We must reload CR3
> 			 * to make sure to use no freed page tables.
> +			 *
> +			 * As above, this is a barrier that forces
> +			 * TLB repopulation to be ordered after the
> +			 * store to mm_cpumask.
> 			 */
> 			load_cr3(next->pgd);
> 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 8ddb5d0d66fb..8f4cc3dfac32 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -161,7 +161,10 @@ void flush_tlb_current_task(void)
> 	preempt_disable();
> 
> 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> +
> +	/* This is an implicit full barrier that synchronizes with switch_mm. */
> 	local_flush_tlb();
> +
> 	trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
> 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
> 		flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
> @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> 	unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
> 
> 	preempt_disable();
> -	if (current->active_mm != mm)
> +	if (current->active_mm != mm) {
> +		/* Synchronize with switch_mm. */
> +		smp_mb();
> +
> 		goto out;
> +	}
> 
> 	if (!current->mm) {
> 		leave_mm(smp_processor_id());
> +
> +		/* Synchronize with switch_mm. */
> +		smp_mb();
> +
> 		goto out;
> 	}
> 
> 	if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
> 		base_pages_to_flush = (end - start) >> PAGE_SHIFT;
> 
> +	/*
> +	 * Both branches below are implicit full barriers (MOV to CR or
> +	 * INVLPG) that synchronize with switch_mm.
> +	 */
> 	if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
> 		base_pages_to_flush = TLB_FLUSH_ALL;
> 		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> @@ -228,10 +243,18 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long start)
> 	preempt_disable();
> 
> 	if (current->active_mm == mm) {
> -		if (current->mm)
> +		if (current->mm) {
> +			/*
> +			 * Implicit full barrier (INVLPG) that synchronizes
> +			 * with switch_mm.
> +			 */
> 			__flush_tlb_one(start);
> -		else
> +		} else {
> 			leave_mm(smp_processor_id());
> +
> +			/* Synchronize with switch_mm. */
> +			smp_mb();
> +		}
> 	}
> 
> 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
> -- 
> 2.5.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization
  2016-06-03 17:42   ` Nadav Amit
@ 2016-06-09 17:24     ` Andy Lutomirski
  2016-06-09 19:45       ` Nadav Amit
  0 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2016-06-09 17:24 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, X86 ML, LKML, Borislav Petkov, Brian Gerst,
	Dave Hansen, Oleg Nesterov, linux-mm

On Fri, Jun 3, 2016 at 10:42 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
> Following this patch, if (current->active_mm != mm), flush_tlb_page() still
> doesn’t call smp_mb() before checking mm_cpumask(mm).
>
> In contrast, flush_tlb_mm_range() does call smp_mb().
>
> Is there a reason for this discrepancy?

Not that I can remember.  Is the remote flush case likely to be racy?

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization
  2016-06-09 17:24     ` Andy Lutomirski
@ 2016-06-09 19:45       ` Nadav Amit
  0 siblings, 0 replies; 43+ messages in thread
From: Nadav Amit @ 2016-06-09 19:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, LKML, Borislav Petkov, Brian Gerst,
	Dave Hansen, Oleg Nesterov, linux-mm

Andy Lutomirski <luto@amacapital.net> wrote:

> On Fri, Jun 3, 2016 at 10:42 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> Following this patch, if (current->active_mm != mm), flush_tlb_page() still
>> doesn’t call smp_mb() before checking mm_cpumask(mm).
>> 
>> In contrast, flush_tlb_mm_range() does call smp_mb().
>> 
>> Is there a reason for this discrepancy?
> 
> Not that I can remember.  Is the remote flush case likely to be racy?

You replied separately on another email that included a patch to fix
this case. It turns out smp_mb is not needed on flush_tlb_page, since
the PTE is always updated using an atomic operation. Yet, a compiler 
barrier is still needed, so I added smp_mb__after_atomic instead.

Nadav

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization
  2016-01-08 23:15 ` [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization Andy Lutomirski
  2016-06-03 17:42   ` Nadav Amit
@ 2016-09-06  1:22   ` Wanpeng Li
  1 sibling, 0 replies; 43+ messages in thread
From: Wanpeng Li @ 2016-09-06  1:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, linux-kernel, Borislav Petkov,
	Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov,
	linux-mm, stable

Hi Andy,
2016-01-09 7:15 GMT+08:00 Andy Lutomirski <luto@kernel.org>:
> When switch_mm activates a new pgd, it also sets a bit that tells
> other CPUs that the pgd is in use so that tlb flush IPIs will be
> sent.  In order for that to work correctly, the bit needs to be
> visible prior to loading the pgd and therefore starting to fill the
> local TLB.
>
> Document all the barriers that make this work correctly and add a
> couple that were missing.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/include/asm/mmu_context.h | 33 ++++++++++++++++++++++++++++++++-
>  arch/x86/mm/tlb.c                  | 29 ++++++++++++++++++++++++++---
>  2 files changed, 58 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 379cd3658799..1edc9cd198b8 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  #endif
>                 cpumask_set_cpu(cpu, mm_cpumask(next));
>
> -               /* Re-load page tables */
> +               /*
> +                * Re-load page tables.
> +                *
> +                * This logic has an ordering constraint:
> +                *
> +                *  CPU 0: Write to a PTE for 'next'
> +                *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
> +                *  CPU 1: set bit 1 in next's mm_cpumask
> +                *  CPU 1: load from the PTE that CPU 0 writes (implicit)
> +                *
> +                * We need to prevent an outcome in which CPU 1 observes
> +                * the new PTE value and CPU 0 observes bit 1 clear in
> +                * mm_cpumask.  (If that occurs, then the IPI will never
> +                * be sent, and CPU 0's TLB will contain a stale entry.)

I misunderstand this comments, CPU0 write to a PTE for 'next', and
CPU0 observes bit 1 clear in mm_cpumask, so CPU0 won't kick IPI to
CPU1, why CPU0's TLB will contain a stale entry instead of CPU1?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2016-09-06  1:22 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-08 23:15 [RFC 00/13] x86/mm: PCID and INVPCID Andy Lutomirski
2016-01-08 23:15 ` [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Andy Lutomirski
2016-01-10 18:59   ` Borislav Petkov
2016-01-11 12:51     ` Andrey Ryabinin
2016-01-11 12:51       ` [PATCH 1/2] x86/kasan: clear kasan_zero_page after TLB flush Andrey Ryabinin
2016-01-18 22:24         ` Andy Lutomirski
2016-02-09 16:06         ` [tip:x86/mm] x86/kasan: Clear " tip-bot for Andrey Ryabinin
2016-01-11 12:51       ` [PATCH 2/2] x86/kasan: write protect kasan zero shadow Andrey Ryabinin
2016-01-18 22:24         ` Andy Lutomirski
2016-02-09 16:07         ` [tip:x86/mm] x86/kasan: Write " tip-bot for Andrey Ryabinin
2016-01-29 10:35       ` [RFC 01/13] x86/paravirt: Turn KASAN off for parvirt.o Borislav Petkov
2016-01-08 23:15 ` [RFC 02/13] x86/mm: Add INVPCID helpers Andy Lutomirski
2016-01-08 23:15 ` [RFC 03/13] x86/mm: Add a noinvpcid option to turn off INVPCID Andy Lutomirski
2016-01-08 23:15 ` [RFC 04/13] x86/mm: If INVPCID is available, use it to flush global mappings Andy Lutomirski
2016-01-08 23:15 ` [RFC 05/13] x86/mm: Add barriers and document switch_mm-vs-flush synchronization Andy Lutomirski
2016-06-03 17:42   ` Nadav Amit
2016-06-09 17:24     ` Andy Lutomirski
2016-06-09 19:45       ` Nadav Amit
2016-09-06  1:22   ` Wanpeng Li
2016-01-08 23:15 ` [RFC 06/13] x86/mm: Disable PCID on 32-bit kernels Andy Lutomirski
2016-01-08 23:15 ` [RFC 07/13] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
2016-01-08 23:15 ` [RFC 08/13] x86/mm: Teach CR3 readers about PCID Andy Lutomirski
2016-01-08 23:15 ` [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3 Andy Lutomirski
2016-01-08 23:41   ` Linus Torvalds
2016-01-09  0:18     ` Andy Lutomirski
2016-01-09  2:20       ` Linus Torvalds
2016-01-11 10:51         ` Ingo Molnar
2016-01-13 23:32           ` Andy Lutomirski
2016-01-13 23:35         ` Andy Lutomirski
2016-01-13 23:43           ` Dave Hansen
2016-01-13 23:51             ` Andy Lutomirski
2016-01-13 23:56               ` Dave Hansen
2016-01-14  0:34                 ` Andy Lutomirski
2016-01-08 23:15 ` [RFC 10/13] x86/mm: Factor out remote TLB flushing Andy Lutomirski
2016-01-08 23:15 ` [RFC 11/13] x86/mm: Build arch/x86/mm/tlb.c even on !SMP Andy Lutomirski
2016-01-08 23:55   ` Dave Hansen
2016-01-08 23:15 ` [RFC 12/13] x86/mm: Uninline switch_mm Andy Lutomirski
2016-01-08 23:15 ` [RFC 13/13] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
2016-01-09  0:27   ` Dave Hansen
2016-01-09  2:19     ` Andy Lutomirski
2016-01-08 23:31 ` [RFC 00/13] x86/mm: PCID and INVPCID Linus Torvalds
2016-01-08 23:36   ` Andy Lutomirski
2016-01-08 23:42     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).