linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements
@ 2016-04-26 16:39 Andy Lutomirski
  2016-04-26 16:39 ` [PATCH v2 1/5] arm: Include linux/preempt.h from asm/mmu_context.h Andy Lutomirski
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Andy Lutomirski @ 2016-04-26 16:39 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Peter Zijlstra, Borislav Petkov, Andy Lutomirski

Hi all-

I've been playing with context switching lately, and I'm going to start
sending out some of the patches that should be mostly self-contained and
ready for -tip.

Here's a little batch to start improving switch_mm.  It uninlines it
and makes it run with IRQs off.  (AFAICT everyone who's modified it
thought it ran with IRQs off, but that's not always the case.  I
don't know of any bugs that this fixes, but it'll be needed for PCID
to avoid introducing really nasty races.)

This may also help a bit with FSGSBASE -- not sure yet.  It certainly
won't hurt.

It contains a trivial off-topic ARM patch to avoid breaking the build.

Andy Lutomirski (5):
  arm: Include linux/preempt.h from asm/mmu_context.h
  sched: Add switch_mm_irqs_off and use it in the scheduler
  x86/mm: Build arch/x86/mm/tlb.c even on !SMP
  x86/mm: Uninline switch_mm
  x86/mm: Turn off IRQs in switch_mm

 arch/arm/include/asm/mmu_context.h |   1 +
 arch/x86/include/asm/mmu_context.h | 101 ++------------------------------
 arch/x86/mm/Makefile               |   3 +-
 arch/x86/mm/tlb.c                  | 116 +++++++++++++++++++++++++++++++++++++
 include/linux/mmu_context.h        |   7 +++
 kernel/sched/core.c                |   6 +-
 6 files changed, 133 insertions(+), 101 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/5] arm: Include linux/preempt.h from asm/mmu_context.h
  2016-04-26 16:39 [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements Andy Lutomirski
@ 2016-04-26 16:39 ` Andy Lutomirski
  2016-04-28 10:29   ` [tip:sched/core] sched/core, ARM: " tip-bot for Andy Lutomirski
  2016-04-28 13:27   ` [PATCH v2 1/5] arm: " Russell King - ARM Linux
  2016-04-26 16:39 ` [PATCH v2 2/5] sched: Add switch_mm_irqs_off and use it in the scheduler Andy Lutomirski
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 14+ messages in thread
From: Andy Lutomirski @ 2016-04-26 16:39 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Peter Zijlstra, Borislav Petkov, Andy Lutomirski,
	Russell King, Catalin Marinas

arm's mmu_context.h uses preempt_enable_no_resched and but doesn't
include anything that would pull in the declaration.

If I start including <asm/mmu_context.h> from <linux/mmu_context.h>
without this, the build breaks.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/arm/include/asm/mmu_context.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/include/asm/mmu_context.h b/arch/arm/include/asm/mmu_context.h
index fa5b42d44985..ed73babc0dc9 100644
--- a/arch/arm/include/asm/mmu_context.h
+++ b/arch/arm/include/asm/mmu_context.h
@@ -15,6 +15,7 @@
 
 #include <linux/compiler.h>
 #include <linux/sched.h>
+#include <linux/preempt.h>
 #include <asm/cacheflush.h>
 #include <asm/cachetype.h>
 #include <asm/proc-fns.h>
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/5] sched: Add switch_mm_irqs_off and use it in the scheduler
  2016-04-26 16:39 [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements Andy Lutomirski
  2016-04-26 16:39 ` [PATCH v2 1/5] arm: Include linux/preempt.h from asm/mmu_context.h Andy Lutomirski
@ 2016-04-26 16:39 ` Andy Lutomirski
  2016-04-28 10:46   ` [tip:sched/core] sched/core: Add switch_mm_irqs_off() " tip-bot for Andy Lutomirski
  2016-04-26 16:39 ` [PATCH v2 3/5] x86/mm: Build arch/x86/mm/tlb.c even on !SMP Andy Lutomirski
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2016-04-26 16:39 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Peter Zijlstra, Borislav Petkov, Andy Lutomirski

By defailt, this is the same thing as switch_mm.  x86 will override it
as an optimization.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/mmu_context.h | 7 +++++++
 kernel/sched/core.c         | 6 +++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index 70fffeba7495..f9e09613c113 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -1,9 +1,16 @@
 #ifndef _LINUX_MMU_CONTEXT_H
 #define _LINUX_MMU_CONTEXT_H
 
+#include <asm/mmu_context.h>
+
 struct mm_struct;
 
 void use_mm(struct mm_struct *mm);
 void unuse_mm(struct mm_struct *mm);
 
+/* Architectures that care about IRQ state in switch_mm can override this. */
+#ifndef switch_mm_irqs_off
+#define switch_mm_irqs_off switch_mm
+#endif
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8465eeab8b3..3636abcafac0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -33,7 +33,7 @@
 #include <linux/init.h>
 #include <linux/uaccess.h>
 #include <linux/highmem.h>
-#include <asm/mmu_context.h>
+#include <linux/mmu_context.h>
 #include <linux/interrupt.h>
 #include <linux/capability.h>
 #include <linux/completion.h>
@@ -2712,7 +2712,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		atomic_inc(&oldmm->mm_count);
 		enter_lazy_tlb(oldmm, next);
 	} else
-		switch_mm(oldmm, mm, next);
+		switch_mm_irqs_off(oldmm, mm, next);
 
 	if (!prev->mm) {
 		prev->active_mm = NULL;
@@ -5202,7 +5202,7 @@ void idle_task_exit(void)
 	BUG_ON(cpu_online(smp_processor_id()));
 
 	if (mm != &init_mm) {
-		switch_mm(mm, &init_mm, current);
+		switch_mm_irqs_off(mm, &init_mm, current);
 		finish_arch_post_lock_switch();
 	}
 	mmdrop(mm);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/5] x86/mm: Build arch/x86/mm/tlb.c even on !SMP
  2016-04-26 16:39 [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements Andy Lutomirski
  2016-04-26 16:39 ` [PATCH v2 1/5] arm: Include linux/preempt.h from asm/mmu_context.h Andy Lutomirski
  2016-04-26 16:39 ` [PATCH v2 2/5] sched: Add switch_mm_irqs_off and use it in the scheduler Andy Lutomirski
@ 2016-04-26 16:39 ` Andy Lutomirski
  2016-04-28 10:47   ` [tip:sched/core] " tip-bot for Andy Lutomirski
  2016-04-26 16:39 ` [PATCH v2 4/5] x86/mm: Uninline switch_mm Andy Lutomirski
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2016-04-26 16:39 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Peter Zijlstra, Borislav Petkov, Andy Lutomirski

Currently all of the functions that live in tlb.c are inlined on
!SMP builds.  One can debate whether this is a good idea (in many
respects the code in tlb.c is better than the inlined UP code).

Regardless, I want to add code that needs to be built on UP and SMP
kernels and relates to tlb flushing, so arrange for tlb.c to be
compiled unconditionally.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/Makefile | 3 +--
 arch/x86/mm/tlb.c    | 4 ++++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index f98913258c63..62c0043a5fd5 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -2,7 +2,7 @@
 KCOV_INSTRUMENT_tlb.o	:= n
 
 obj-y	:=  init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
-	    pat.o pgtable.o physaddr.o gup.o setup_nx.o
+	    pat.o pgtable.o physaddr.o gup.o setup_nx.o tlb.o
 
 # Make sure __phys_addr has no stackprotector
 nostackp := $(call cc-option, -fno-stack-protector)
@@ -12,7 +12,6 @@ CFLAGS_setup_nx.o		:= $(nostackp)
 CFLAGS_fault.o := -I$(src)/../include/asm/trace
 
 obj-$(CONFIG_X86_PAT)		+= pat_rbtree.o
-obj-$(CONFIG_SMP)		+= tlb.o
 
 obj-$(CONFIG_X86_32)		+= pgtable_32.o iomap_32.o
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8f4cc3dfac32..3f7a7939e65e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+#ifdef CONFIG_SMP
+
 struct flush_tlb_info {
 	struct mm_struct *flush_mm;
 	unsigned long flush_start;
@@ -347,3 +349,5 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 	return 0;
 }
 late_initcall(create_tlb_single_page_flush_ceiling);
+
+#endif /* CONFIG_SMP */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/5] x86/mm: Uninline switch_mm
  2016-04-26 16:39 [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements Andy Lutomirski
                   ` (2 preceding siblings ...)
  2016-04-26 16:39 ` [PATCH v2 3/5] x86/mm: Build arch/x86/mm/tlb.c even on !SMP Andy Lutomirski
@ 2016-04-26 16:39 ` Andy Lutomirski
  2016-04-28 10:47   ` [tip:sched/core] x86/mm, sched/core: Uninline switch_mm() tip-bot for Andy Lutomirski
  2016-04-26 16:39 ` [PATCH v2 5/5] x86/mm: Turn off IRQs in switch_mm Andy Lutomirski
  2016-04-27 18:09 ` [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements Borislav Petkov
  5 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2016-04-26 16:39 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Peter Zijlstra, Borislav Petkov, Andy Lutomirski

It's fairly large and it has quite a few callers.  This may also
help untangle some headers down the road.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h |  98 +----------------------------------
 arch/x86/mm/tlb.c                  | 102 +++++++++++++++++++++++++++++++++++++
 2 files changed, 104 insertions(+), 96 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 84280029cafd..bb911dd7cd01 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -115,103 +115,9 @@ static inline void destroy_context(struct mm_struct *mm)
 	destroy_context_ldt(mm);
 }
 
-static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
-			     struct task_struct *tsk)
-{
-	unsigned cpu = smp_processor_id();
+extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+		      struct task_struct *tsk);
 
-	if (likely(prev != next)) {
-#ifdef CONFIG_SMP
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
-		this_cpu_write(cpu_tlbstate.active_mm, next);
-#endif
-		cpumask_set_cpu(cpu, mm_cpumask(next));
-
-		/*
-		 * Re-load page tables.
-		 *
-		 * This logic has an ordering constraint:
-		 *
-		 *  CPU 0: Write to a PTE for 'next'
-		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
-		 *  CPU 1: set bit 1 in next's mm_cpumask
-		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
-		 *
-		 * We need to prevent an outcome in which CPU 1 observes
-		 * the new PTE value and CPU 0 observes bit 1 clear in
-		 * mm_cpumask.  (If that occurs, then the IPI will never
-		 * be sent, and CPU 0's TLB will contain a stale entry.)
-		 *
-		 * The bad outcome can occur if either CPU's load is
-		 * reordered before that CPU's store, so both CPUs must
-		 * execute full barriers to prevent this from happening.
-		 *
-		 * Thus, switch_mm needs a full barrier between the
-		 * store to mm_cpumask and any operation that could load
-		 * from next->pgd.  TLB fills are special and can happen
-		 * due to instruction fetches or for no reason at all,
-		 * and neither LOCK nor MFENCE orders them.
-		 * Fortunately, load_cr3() is serializing and gives the
-		 * ordering guarantee we need.
-		 *
-		 */
-		load_cr3(next->pgd);
-
-		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-
-		/* Stop flush ipis for the previous mm */
-		cpumask_clear_cpu(cpu, mm_cpumask(prev));
-
-		/* Load per-mm CR4 state */
-		load_mm_cr4(next);
-
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
-		/*
-		 * Load the LDT, if the LDT is different.
-		 *
-		 * It's possible that prev->context.ldt doesn't match
-		 * the LDT register.  This can happen if leave_mm(prev)
-		 * was called and then modify_ldt changed
-		 * prev->context.ldt but suppressed an IPI to this CPU.
-		 * In this case, prev->context.ldt != NULL, because we
-		 * never set context.ldt to NULL while the mm still
-		 * exists.  That means that next->context.ldt !=
-		 * prev->context.ldt, because mms never share an LDT.
-		 */
-		if (unlikely(prev->context.ldt != next->context.ldt))
-			load_mm_ldt(next);
-#endif
-	}
-#ifdef CONFIG_SMP
-	  else {
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
-		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
-
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
-			/*
-			 * On established mms, the mm_cpumask is only changed
-			 * from irq context, from ptep_clear_flush() while in
-			 * lazy tlb mode, and here. Irqs are blocked during
-			 * schedule, protecting us from simultaneous changes.
-			 */
-			cpumask_set_cpu(cpu, mm_cpumask(next));
-
-			/*
-			 * We were in lazy tlb mode and leave_mm disabled
-			 * tlb flush IPI delivery. We must reload CR3
-			 * to make sure to use no freed page tables.
-			 *
-			 * As above, load_cr3() is serializing and orders TLB
-			 * fills with respect to the mm_cpumask write.
-			 */
-			load_cr3(next->pgd);
-			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-			load_mm_cr4(next);
-			load_mm_ldt(next);
-		}
-	}
-#endif
-}
 
 #define activate_mm(prev, next)			\
 do {						\
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3f7a7939e65e..07558fabc222 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -59,6 +59,108 @@ void leave_mm(int cpu)
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
+#endif /* CONFIG_SMP */
+
+void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+	       struct task_struct *tsk)
+{
+	unsigned cpu = smp_processor_id();
+
+	if (likely(prev != next)) {
+#ifdef CONFIG_SMP
+		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+		this_cpu_write(cpu_tlbstate.active_mm, next);
+#endif
+		cpumask_set_cpu(cpu, mm_cpumask(next));
+
+		/*
+		 * Re-load page tables.
+		 *
+		 * This logic has an ordering constraint:
+		 *
+		 *  CPU 0: Write to a PTE for 'next'
+		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
+		 *  CPU 1: set bit 1 in next's mm_cpumask
+		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
+		 *
+		 * We need to prevent an outcome in which CPU 1 observes
+		 * the new PTE value and CPU 0 observes bit 1 clear in
+		 * mm_cpumask.  (If that occurs, then the IPI will never
+		 * be sent, and CPU 0's TLB will contain a stale entry.)
+		 *
+		 * The bad outcome can occur if either CPU's load is
+		 * reordered before that CPU's store, so both CPUs must
+		 * execute full barriers to prevent this from happening.
+		 *
+		 * Thus, switch_mm needs a full barrier between the
+		 * store to mm_cpumask and any operation that could load
+		 * from next->pgd.  TLB fills are special and can happen
+		 * due to instruction fetches or for no reason at all,
+		 * and neither LOCK nor MFENCE orders them.
+		 * Fortunately, load_cr3() is serializing and gives the
+		 * ordering guarantee we need.
+		 *
+		 */
+		load_cr3(next->pgd);
+
+		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+
+		/* Stop flush ipis for the previous mm */
+		cpumask_clear_cpu(cpu, mm_cpumask(prev));
+
+		/* Load per-mm CR4 state */
+		load_mm_cr4(next);
+
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+		/*
+		 * Load the LDT, if the LDT is different.
+		 *
+		 * It's possible that prev->context.ldt doesn't match
+		 * the LDT register.  This can happen if leave_mm(prev)
+		 * was called and then modify_ldt changed
+		 * prev->context.ldt but suppressed an IPI to this CPU.
+		 * In this case, prev->context.ldt != NULL, because we
+		 * never set context.ldt to NULL while the mm still
+		 * exists.  That means that next->context.ldt !=
+		 * prev->context.ldt, because mms never share an LDT.
+		 */
+		if (unlikely(prev->context.ldt != next->context.ldt))
+			load_mm_ldt(next);
+#endif
+	}
+#ifdef CONFIG_SMP
+	  else {
+		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
+
+		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+			/*
+			 * On established mms, the mm_cpumask is only changed
+			 * from irq context, from ptep_clear_flush() while in
+			 * lazy tlb mode, and here. Irqs are blocked during
+			 * schedule, protecting us from simultaneous changes.
+			 */
+			cpumask_set_cpu(cpu, mm_cpumask(next));
+
+			/*
+			 * We were in lazy tlb mode and leave_mm disabled
+			 * tlb flush IPI delivery. We must reload CR3
+			 * to make sure to use no freed page tables.
+			 *
+			 * As above, load_cr3() is serializing and orders TLB
+			 * fills with respect to the mm_cpumask write.
+			 */
+			load_cr3(next->pgd);
+			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+			load_mm_cr4(next);
+			load_mm_ldt(next);
+		}
+	}
+#endif
+}
+
+#ifdef CONFIG_SMP
+
 /*
  * The flush IPI assumes that a thread switch happens in this order:
  * [cpu0: the cpu that switches]
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 5/5] x86/mm: Turn off IRQs in switch_mm
  2016-04-26 16:39 [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements Andy Lutomirski
                   ` (3 preceding siblings ...)
  2016-04-26 16:39 ` [PATCH v2 4/5] x86/mm: Uninline switch_mm Andy Lutomirski
@ 2016-04-26 16:39 ` Andy Lutomirski
  2016-04-28 10:48   ` [tip:sched/core] x86/mm, sched/core: Turn off IRQs in switch_mm() tip-bot for Andy Lutomirski
  2016-04-27 18:09 ` [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements Borislav Petkov
  5 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2016-04-26 16:39 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Peter Zijlstra, Borislav Petkov, Andy Lutomirski

Potential races between switch_mm and TLB-flush or LDT-flush IPIs
could be very messy.  AFAICT the code is currently okay, whether by
accident or by careful design, but enabling PCID will make it
considerably more complicated and will no longer be obviously safe.

Fix it with a bug hammer: run switch_mm with IRQs off.

To avoid a performance hit in the scheduler, we take advantage of
our knowledge that the scheduler already has IRQs disabled when it
calls switch_mm.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h |  3 +++
 arch/x86/mm/tlb.c                  | 10 ++++++++++
 2 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index bb911dd7cd01..396348196aa7 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -118,6 +118,9 @@ static inline void destroy_context(struct mm_struct *mm)
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		      struct task_struct *tsk);
 
+extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
+			       struct task_struct *tsk);
+#define switch_mm_irqs_off switch_mm_irqs_off
 
 #define activate_mm(prev, next)			\
 do {						\
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 07558fabc222..c93481569e3c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -64,6 +64,16 @@ EXPORT_SYMBOL_GPL(leave_mm);
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
 {
+	unsigned long flags;
+
+	local_irq_save(flags);
+	switch_mm_irqs_off(prev, next, tsk);
+	local_irq_restore(flags);
+}
+
+void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
+			struct task_struct *tsk)
+{
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements
  2016-04-26 16:39 [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements Andy Lutomirski
                   ` (4 preceding siblings ...)
  2016-04-26 16:39 ` [PATCH v2 5/5] x86/mm: Turn off IRQs in switch_mm Andy Lutomirski
@ 2016-04-27 18:09 ` Borislav Petkov
  5 siblings, 0 replies; 14+ messages in thread
From: Borislav Petkov @ 2016-04-27 18:09 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, linux-kernel, Peter Zijlstra

On Tue, Apr 26, 2016 at 09:39:04AM -0700, Andy Lutomirski wrote:
> Hi all-
> 
> I've been playing with context switching lately, and I'm going to start
> sending out some of the patches that should be mostly self-contained and
> ready for -tip.
> 
> Here's a little batch to start improving switch_mm.  It uninlines it
> and makes it run with IRQs off.  (AFAICT everyone who's modified it
> thought it ran with IRQs off, but that's not always the case.  I
> don't know of any bugs that this fixes, but it'll be needed for PCID
> to avoid introducing really nasty races.)
> 
> This may also help a bit with FSGSBASE -- not sure yet.  It certainly
> won't hurt.
> 
> It contains a trivial off-topic ARM patch to avoid breaking the build.
> 
> Andy Lutomirski (5):
>   arm: Include linux/preempt.h from asm/mmu_context.h
>   sched: Add switch_mm_irqs_off and use it in the scheduler
>   x86/mm: Build arch/x86/mm/tlb.c even on !SMP
>   x86/mm: Uninline switch_mm
>   x86/mm: Turn off IRQs in switch_mm
> 
>  arch/arm/include/asm/mmu_context.h |   1 +
>  arch/x86/include/asm/mmu_context.h | 101 ++------------------------------
>  arch/x86/mm/Makefile               |   3 +-
>  arch/x86/mm/tlb.c                  | 116 +++++++++++++++++++++++++++++++++++++
>  include/linux/mmu_context.h        |   7 +++
>  kernel/sched/core.c                |   6 +-
>  6 files changed, 133 insertions(+), 101 deletions(-)

Reviewed-by: Borislav Petkov <bp@suse.de>

There was a concern that maybe disabling IRQs in
exec_mmap->activate_mm()->switch_mm() would be a little bit of a
slowdown but that's not a hot path anyway.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched/core, ARM: Include linux/preempt.h from asm/mmu_context.h
  2016-04-26 16:39 ` [PATCH v2 1/5] arm: Include linux/preempt.h from asm/mmu_context.h Andy Lutomirski
@ 2016-04-28 10:29   ` tip-bot for Andy Lutomirski
  2016-04-28 10:41     ` Russell King - ARM Linux
  2016-04-28 13:27   ` [PATCH v2 1/5] arm: " Russell King - ARM Linux
  1 sibling, 1 reply; 14+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-04-28 10:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, bp, hpa, peterz, linux-kernel, tglx, mingo, catalin.marinas,
	luto, torvalds, linux

Commit-ID:  88f10e37e150569a390be7a6161fa0f26b7372e9
Gitweb:     http://git.kernel.org/tip/88f10e37e150569a390be7a6161fa0f26b7372e9
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Tue, 26 Apr 2016 09:39:05 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 28 Apr 2016 11:08:41 +0200

sched/core, ARM: Include linux/preempt.h from asm/mmu_context.h

arm's mmu_context.h uses preempt_enable_no_resched and but doesn't
include anything that would pull in the declaration.

If I start including <asm/mmu_context.h> from <linux/mmu_context.h>
without this, the build breaks.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5b95730a70f2dafe12d4fbf38d20eb7330d67ba3.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/arm/include/asm/mmu_context.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/include/asm/mmu_context.h b/arch/arm/include/asm/mmu_context.h
index fa5b42d..ed73bab 100644
--- a/arch/arm/include/asm/mmu_context.h
+++ b/arch/arm/include/asm/mmu_context.h
@@ -15,6 +15,7 @@
 
 #include <linux/compiler.h>
 #include <linux/sched.h>
+#include <linux/preempt.h>
 #include <asm/cacheflush.h>
 #include <asm/cachetype.h>
 #include <asm/proc-fns.h>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [tip:sched/core] sched/core, ARM: Include linux/preempt.h from asm/mmu_context.h
  2016-04-28 10:29   ` [tip:sched/core] sched/core, ARM: " tip-bot for Andy Lutomirski
@ 2016-04-28 10:41     ` Russell King - ARM Linux
  0 siblings, 0 replies; 14+ messages in thread
From: Russell King - ARM Linux @ 2016-04-28 10:41 UTC (permalink / raw)
  To: torvalds, luto, mingo, catalin.marinas, tglx, linux-kernel, bp,
	hpa, bp, peterz
  Cc: linux-tip-commits

I've not got around to looking at this yet as I've been away, and I'm
still catching up with email.

On Thu, Apr 28, 2016 at 03:29:57AM -0700, tip-bot for Andy Lutomirski wrote:
> Commit-ID:  88f10e37e150569a390be7a6161fa0f26b7372e9
> Gitweb:     http://git.kernel.org/tip/88f10e37e150569a390be7a6161fa0f26b7372e9
> Author:     Andy Lutomirski <luto@kernel.org>
> AuthorDate: Tue, 26 Apr 2016 09:39:05 -0700
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Thu, 28 Apr 2016 11:08:41 +0200
> 
> sched/core, ARM: Include linux/preempt.h from asm/mmu_context.h
> 
> arm's mmu_context.h uses preempt_enable_no_resched and but doesn't
> include anything that would pull in the declaration.
> 
> If I start including <asm/mmu_context.h> from <linux/mmu_context.h>
> without this, the build breaks.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Link: http://lkml.kernel.org/r/5b95730a70f2dafe12d4fbf38d20eb7330d67ba3.1461688545.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/arm/include/asm/mmu_context.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm/include/asm/mmu_context.h b/arch/arm/include/asm/mmu_context.h
> index fa5b42d..ed73bab 100644
> --- a/arch/arm/include/asm/mmu_context.h
> +++ b/arch/arm/include/asm/mmu_context.h
> @@ -15,6 +15,7 @@
>  
>  #include <linux/compiler.h>
>  #include <linux/sched.h>
> +#include <linux/preempt.h>
>  #include <asm/cacheflush.h>
>  #include <asm/cachetype.h>
>  #include <asm/proc-fns.h>

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched/core: Add switch_mm_irqs_off() and use it in the scheduler
  2016-04-26 16:39 ` [PATCH v2 2/5] sched: Add switch_mm_irqs_off and use it in the scheduler Andy Lutomirski
@ 2016-04-28 10:46   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-04-28 10:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, peterz, luto, bp, tglx, linux-kernel, bp, mingo, hpa

Commit-ID:  f98db6013c557c216da5038d9c52045be55cd039
Gitweb:     http://git.kernel.org/tip/f98db6013c557c216da5038d9c52045be55cd039
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Tue, 26 Apr 2016 09:39:06 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 28 Apr 2016 11:44:19 +0200

sched/core: Add switch_mm_irqs_off() and use it in the scheduler

By default, this is the same thing as switch_mm().

x86 will override it as an optimization.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/df401df47bdd6be3e389c6f1e3f5310d70e81b2c.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mmu_context.h | 7 +++++++
 kernel/sched/core.c         | 6 +++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index 70fffeb..a444178 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -1,9 +1,16 @@
 #ifndef _LINUX_MMU_CONTEXT_H
 #define _LINUX_MMU_CONTEXT_H
 
+#include <asm/mmu_context.h>
+
 struct mm_struct;
 
 void use_mm(struct mm_struct *mm);
 void unuse_mm(struct mm_struct *mm);
 
+/* Architectures that care about IRQ state in switch_mm can override this. */
+#ifndef switch_mm_irqs_off
+# define switch_mm_irqs_off switch_mm
+#endif
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d84d60..adcafda 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -33,7 +33,7 @@
 #include <linux/init.h>
 #include <linux/uaccess.h>
 #include <linux/highmem.h>
-#include <asm/mmu_context.h>
+#include <linux/mmu_context.h>
 #include <linux/interrupt.h>
 #include <linux/capability.h>
 #include <linux/completion.h>
@@ -2733,7 +2733,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		atomic_inc(&oldmm->mm_count);
 		enter_lazy_tlb(oldmm, next);
 	} else
-		switch_mm(oldmm, mm, next);
+		switch_mm_irqs_off(oldmm, mm, next);
 
 	if (!prev->mm) {
 		prev->active_mm = NULL;
@@ -5274,7 +5274,7 @@ void idle_task_exit(void)
 	BUG_ON(cpu_online(smp_processor_id()));
 
 	if (mm != &init_mm) {
-		switch_mm(mm, &init_mm, current);
+		switch_mm_irqs_off(mm, &init_mm, current);
 		finish_arch_post_lock_switch();
 	}
 	mmdrop(mm);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:sched/core] x86/mm: Build arch/x86/mm/tlb.c even on !SMP
  2016-04-26 16:39 ` [PATCH v2 3/5] x86/mm: Build arch/x86/mm/tlb.c even on !SMP Andy Lutomirski
@ 2016-04-28 10:47   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-04-28 10:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, linux-kernel, bp, tglx, luto, peterz, hpa, torvalds, mingo

Commit-ID:  e1074888c326038340a1ada9129d679e661f2ea6
Gitweb:     http://git.kernel.org/tip/e1074888c326038340a1ada9129d679e661f2ea6
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Tue, 26 Apr 2016 09:39:07 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 28 Apr 2016 11:44:19 +0200

x86/mm: Build arch/x86/mm/tlb.c even on !SMP

Currently all of the functions that live in tlb.c are inlined on
!SMP builds.  One can debate whether this is a good idea (in many
respects the code in tlb.c is better than the inlined UP code).

Regardless, I want to add code that needs to be built on UP and SMP
kernels and relates to tlb flushing, so arrange for tlb.c to be
compiled unconditionally.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f0d778f0d828fc46e5d1946bca80f0aaf9abf032.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/Makefile | 3 +--
 arch/x86/mm/tlb.c    | 4 ++++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index f989132..62c0043 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -2,7 +2,7 @@
 KCOV_INSTRUMENT_tlb.o	:= n
 
 obj-y	:=  init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
-	    pat.o pgtable.o physaddr.o gup.o setup_nx.o
+	    pat.o pgtable.o physaddr.o gup.o setup_nx.o tlb.o
 
 # Make sure __phys_addr has no stackprotector
 nostackp := $(call cc-option, -fno-stack-protector)
@@ -12,7 +12,6 @@ CFLAGS_setup_nx.o		:= $(nostackp)
 CFLAGS_fault.o := -I$(src)/../include/asm/trace
 
 obj-$(CONFIG_X86_PAT)		+= pat_rbtree.o
-obj-$(CONFIG_SMP)		+= tlb.o
 
 obj-$(CONFIG_X86_32)		+= pgtable_32.o iomap_32.o
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fe9b9f7..a4530e2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+#ifdef CONFIG_SMP
+
 struct flush_tlb_info {
 	struct mm_struct *flush_mm;
 	unsigned long flush_start;
@@ -353,3 +355,5 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 	return 0;
 }
 late_initcall(create_tlb_single_page_flush_ceiling);
+
+#endif /* CONFIG_SMP */

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:sched/core] x86/mm, sched/core: Uninline switch_mm()
  2016-04-26 16:39 ` [PATCH v2 4/5] x86/mm: Uninline switch_mm Andy Lutomirski
@ 2016-04-28 10:47   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-04-28 10:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, bp, bp, peterz, mingo, linux-kernel, tglx, luto

Commit-ID:  69c0319aabba45bcf33178916a2f06967b4adede
Gitweb:     http://git.kernel.org/tip/69c0319aabba45bcf33178916a2f06967b4adede
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Tue, 26 Apr 2016 09:39:08 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 28 Apr 2016 11:44:19 +0200

x86/mm, sched/core: Uninline switch_mm()

It's fairly large and it has quite a few callers.  This may also
help untangle some headers down the road.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/54f3367803e7f80b2be62c8a21879aa74b1a5f57.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mmu_context.h |  98 +----------------------------------
 arch/x86/mm/tlb.c                  | 102 +++++++++++++++++++++++++++++++++++++
 2 files changed, 104 insertions(+), 96 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 8428002..bb911dd 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -115,103 +115,9 @@ static inline void destroy_context(struct mm_struct *mm)
 	destroy_context_ldt(mm);
 }
 
-static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
-			     struct task_struct *tsk)
-{
-	unsigned cpu = smp_processor_id();
+extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+		      struct task_struct *tsk);
 
-	if (likely(prev != next)) {
-#ifdef CONFIG_SMP
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
-		this_cpu_write(cpu_tlbstate.active_mm, next);
-#endif
-		cpumask_set_cpu(cpu, mm_cpumask(next));
-
-		/*
-		 * Re-load page tables.
-		 *
-		 * This logic has an ordering constraint:
-		 *
-		 *  CPU 0: Write to a PTE for 'next'
-		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
-		 *  CPU 1: set bit 1 in next's mm_cpumask
-		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
-		 *
-		 * We need to prevent an outcome in which CPU 1 observes
-		 * the new PTE value and CPU 0 observes bit 1 clear in
-		 * mm_cpumask.  (If that occurs, then the IPI will never
-		 * be sent, and CPU 0's TLB will contain a stale entry.)
-		 *
-		 * The bad outcome can occur if either CPU's load is
-		 * reordered before that CPU's store, so both CPUs must
-		 * execute full barriers to prevent this from happening.
-		 *
-		 * Thus, switch_mm needs a full barrier between the
-		 * store to mm_cpumask and any operation that could load
-		 * from next->pgd.  TLB fills are special and can happen
-		 * due to instruction fetches or for no reason at all,
-		 * and neither LOCK nor MFENCE orders them.
-		 * Fortunately, load_cr3() is serializing and gives the
-		 * ordering guarantee we need.
-		 *
-		 */
-		load_cr3(next->pgd);
-
-		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-
-		/* Stop flush ipis for the previous mm */
-		cpumask_clear_cpu(cpu, mm_cpumask(prev));
-
-		/* Load per-mm CR4 state */
-		load_mm_cr4(next);
-
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
-		/*
-		 * Load the LDT, if the LDT is different.
-		 *
-		 * It's possible that prev->context.ldt doesn't match
-		 * the LDT register.  This can happen if leave_mm(prev)
-		 * was called and then modify_ldt changed
-		 * prev->context.ldt but suppressed an IPI to this CPU.
-		 * In this case, prev->context.ldt != NULL, because we
-		 * never set context.ldt to NULL while the mm still
-		 * exists.  That means that next->context.ldt !=
-		 * prev->context.ldt, because mms never share an LDT.
-		 */
-		if (unlikely(prev->context.ldt != next->context.ldt))
-			load_mm_ldt(next);
-#endif
-	}
-#ifdef CONFIG_SMP
-	  else {
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
-		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
-
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
-			/*
-			 * On established mms, the mm_cpumask is only changed
-			 * from irq context, from ptep_clear_flush() while in
-			 * lazy tlb mode, and here. Irqs are blocked during
-			 * schedule, protecting us from simultaneous changes.
-			 */
-			cpumask_set_cpu(cpu, mm_cpumask(next));
-
-			/*
-			 * We were in lazy tlb mode and leave_mm disabled
-			 * tlb flush IPI delivery. We must reload CR3
-			 * to make sure to use no freed page tables.
-			 *
-			 * As above, load_cr3() is serializing and orders TLB
-			 * fills with respect to the mm_cpumask write.
-			 */
-			load_cr3(next->pgd);
-			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-			load_mm_cr4(next);
-			load_mm_ldt(next);
-		}
-	}
-#endif
-}
 
 #define activate_mm(prev, next)			\
 do {						\
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a4530e2..ce7a0c9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -59,6 +59,108 @@ void leave_mm(int cpu)
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
+#endif /* CONFIG_SMP */
+
+void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+	       struct task_struct *tsk)
+{
+	unsigned cpu = smp_processor_id();
+
+	if (likely(prev != next)) {
+#ifdef CONFIG_SMP
+		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+		this_cpu_write(cpu_tlbstate.active_mm, next);
+#endif
+		cpumask_set_cpu(cpu, mm_cpumask(next));
+
+		/*
+		 * Re-load page tables.
+		 *
+		 * This logic has an ordering constraint:
+		 *
+		 *  CPU 0: Write to a PTE for 'next'
+		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
+		 *  CPU 1: set bit 1 in next's mm_cpumask
+		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
+		 *
+		 * We need to prevent an outcome in which CPU 1 observes
+		 * the new PTE value and CPU 0 observes bit 1 clear in
+		 * mm_cpumask.  (If that occurs, then the IPI will never
+		 * be sent, and CPU 0's TLB will contain a stale entry.)
+		 *
+		 * The bad outcome can occur if either CPU's load is
+		 * reordered before that CPU's store, so both CPUs must
+		 * execute full barriers to prevent this from happening.
+		 *
+		 * Thus, switch_mm needs a full barrier between the
+		 * store to mm_cpumask and any operation that could load
+		 * from next->pgd.  TLB fills are special and can happen
+		 * due to instruction fetches or for no reason at all,
+		 * and neither LOCK nor MFENCE orders them.
+		 * Fortunately, load_cr3() is serializing and gives the
+		 * ordering guarantee we need.
+		 *
+		 */
+		load_cr3(next->pgd);
+
+		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+
+		/* Stop flush ipis for the previous mm */
+		cpumask_clear_cpu(cpu, mm_cpumask(prev));
+
+		/* Load per-mm CR4 state */
+		load_mm_cr4(next);
+
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+		/*
+		 * Load the LDT, if the LDT is different.
+		 *
+		 * It's possible that prev->context.ldt doesn't match
+		 * the LDT register.  This can happen if leave_mm(prev)
+		 * was called and then modify_ldt changed
+		 * prev->context.ldt but suppressed an IPI to this CPU.
+		 * In this case, prev->context.ldt != NULL, because we
+		 * never set context.ldt to NULL while the mm still
+		 * exists.  That means that next->context.ldt !=
+		 * prev->context.ldt, because mms never share an LDT.
+		 */
+		if (unlikely(prev->context.ldt != next->context.ldt))
+			load_mm_ldt(next);
+#endif
+	}
+#ifdef CONFIG_SMP
+	  else {
+		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
+
+		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+			/*
+			 * On established mms, the mm_cpumask is only changed
+			 * from irq context, from ptep_clear_flush() while in
+			 * lazy tlb mode, and here. Irqs are blocked during
+			 * schedule, protecting us from simultaneous changes.
+			 */
+			cpumask_set_cpu(cpu, mm_cpumask(next));
+
+			/*
+			 * We were in lazy tlb mode and leave_mm disabled
+			 * tlb flush IPI delivery. We must reload CR3
+			 * to make sure to use no freed page tables.
+			 *
+			 * As above, load_cr3() is serializing and orders TLB
+			 * fills with respect to the mm_cpumask write.
+			 */
+			load_cr3(next->pgd);
+			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+			load_mm_cr4(next);
+			load_mm_ldt(next);
+		}
+	}
+#endif
+}
+
+#ifdef CONFIG_SMP
+
 /*
  * The flush IPI assumes that a thread switch happens in this order:
  * [cpu0: the cpu that switches]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:sched/core] x86/mm, sched/core: Turn off IRQs in switch_mm()
  2016-04-26 16:39 ` [PATCH v2 5/5] x86/mm: Turn off IRQs in switch_mm Andy Lutomirski
@ 2016-04-28 10:48   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-04-28 10:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, mingo, bp, luto, tglx, hpa, torvalds, linux-kernel, bp

Commit-ID:  078194f8e9fe3cf54c8fd8bded48a1db5bd8eb8a
Gitweb:     http://git.kernel.org/tip/078194f8e9fe3cf54c8fd8bded48a1db5bd8eb8a
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Tue, 26 Apr 2016 09:39:09 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 28 Apr 2016 11:44:20 +0200

x86/mm, sched/core: Turn off IRQs in switch_mm()

Potential races between switch_mm() and TLB-flush or LDT-flush IPIs
could be very messy.  AFAICT the code is currently okay, whether by
accident or by careful design, but enabling PCID will make it
considerably more complicated and will no longer be obviously safe.

Fix it with a big hammer: run switch_mm() with IRQs off.

To avoid a performance hit in the scheduler, we take advantage of
our knowledge that the scheduler already has IRQs disabled when it
calls switch_mm().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f19baf759693c9dcae64bbff76189db77cb13398.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mmu_context.h |  3 +++
 arch/x86/mm/tlb.c                  | 10 ++++++++++
 2 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index bb911dd..39634819 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -118,6 +118,9 @@ static inline void destroy_context(struct mm_struct *mm)
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		      struct task_struct *tsk);
 
+extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
+			       struct task_struct *tsk);
+#define switch_mm_irqs_off switch_mm_irqs_off
 
 #define activate_mm(prev, next)			\
 do {						\
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ce7a0c9..5643fd0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -64,6 +64,16 @@ EXPORT_SYMBOL_GPL(leave_mm);
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
 {
+	unsigned long flags;
+
+	local_irq_save(flags);
+	switch_mm_irqs_off(prev, next, tsk);
+	local_irq_restore(flags);
+}
+
+void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
+			struct task_struct *tsk)
+{
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/5] arm: Include linux/preempt.h from asm/mmu_context.h
  2016-04-26 16:39 ` [PATCH v2 1/5] arm: Include linux/preempt.h from asm/mmu_context.h Andy Lutomirski
  2016-04-28 10:29   ` [tip:sched/core] sched/core, ARM: " tip-bot for Andy Lutomirski
@ 2016-04-28 13:27   ` Russell King - ARM Linux
  1 sibling, 0 replies; 14+ messages in thread
From: Russell King - ARM Linux @ 2016-04-28 13:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Peter Zijlstra, Borislav Petkov, Catalin Marinas

On Tue, Apr 26, 2016 at 09:39:05AM -0700, Andy Lutomirski wrote:
> arm's mmu_context.h uses preempt_enable_no_resched and but doesn't
> include anything that would pull in the declaration.
> 
> If I start including <asm/mmu_context.h> from <linux/mmu_context.h>
> without this, the build breaks.

>From an inspection of the include paths, this looks okay.
> 
> Cc: Russell King <linux@arm.linux.org.uk>

Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>

Thanks.

-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-04-28 13:27 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-26 16:39 [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements Andy Lutomirski
2016-04-26 16:39 ` [PATCH v2 1/5] arm: Include linux/preempt.h from asm/mmu_context.h Andy Lutomirski
2016-04-28 10:29   ` [tip:sched/core] sched/core, ARM: " tip-bot for Andy Lutomirski
2016-04-28 10:41     ` Russell King - ARM Linux
2016-04-28 13:27   ` [PATCH v2 1/5] arm: " Russell King - ARM Linux
2016-04-26 16:39 ` [PATCH v2 2/5] sched: Add switch_mm_irqs_off and use it in the scheduler Andy Lutomirski
2016-04-28 10:46   ` [tip:sched/core] sched/core: Add switch_mm_irqs_off() " tip-bot for Andy Lutomirski
2016-04-26 16:39 ` [PATCH v2 3/5] x86/mm: Build arch/x86/mm/tlb.c even on !SMP Andy Lutomirski
2016-04-28 10:47   ` [tip:sched/core] " tip-bot for Andy Lutomirski
2016-04-26 16:39 ` [PATCH v2 4/5] x86/mm: Uninline switch_mm Andy Lutomirski
2016-04-28 10:47   ` [tip:sched/core] x86/mm, sched/core: Uninline switch_mm() tip-bot for Andy Lutomirski
2016-04-26 16:39 ` [PATCH v2 5/5] x86/mm: Turn off IRQs in switch_mm Andy Lutomirski
2016-04-28 10:48   ` [tip:sched/core] x86/mm, sched/core: Turn off IRQs in switch_mm() tip-bot for Andy Lutomirski
2016-04-27 18:09 ` [PATCH v2 0/5] x86 switch_mm uninlining and IRQ improvements Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).