linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/8] x86 TLB flush cleanups, moving toward PCID support
@ 2017-05-26  0:47 Andy Lutomirski
  2017-05-26  0:47 ` [PATCH v3 1/8] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc Andy Lutomirski
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Andy Lutomirski @ 2017-05-26  0:47 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Andy Lutomirski

As I've been working on polishing my PCID code, a major problem I've
encountered is that there are too many x86 TLB flushing code paths and
that they have too many inconsequential differences.  The result was
that earlier versions of the PCID code were a colossal mess and very
difficult to understand.

This series goes a long way toward cleaning up the mess.  With all the
patches applied, there is a single function that contains the meat of
the code to flush the TLB on a given CPU, and all the tlb flushing
APIs call it for both local and remote CPUs.

This series should only adversely affect the kernel in a couple of
minor ways:

 - It makes smp_mb() unconditional when flushing TLBs.  We used to
   use the TLB flush itself to mostly avoid smp_mb() on the initiating
   CPU.

 - On UP kernels, we lose the dubious optimization of inlining nerfed
   variants of all the TLB flush APIs.  This bloats the kernel a tiny
   bit, although it should increase performance, since the SMP
   versions were better.

Patch 8 in here is a little bit off topic.  It's a cleanup that's
also needed before PCID can go in, but it's not directly about
TLB flushing.

Changes from v1:
 - Rebased onto tip:x86/mm to pick up UV and Xen changes.
 - Drop the patches that Ingo already applied.

Changes from RFC:
 - Fixed missing call to arch_tlbbatch_flush().
 - "Be more consistent wrt PAGE_SHIFT vs PAGE_SIZE in tlb flush code" is new
 - Misc typos fixed.
 - Actually compiles when UV is enabled.

Andy Lutomirski (8):
  x86/mm: Pass flush_tlb_info to flush_tlb_others() etc
  x86/mm: Change the leave_mm() condition for local TLB flushes
  x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases
  x86/mm: Use new merged flush logic in arch_tlbbatch_flush()
  x86/mm: Remove the UP tlbflush code; always use the formerly SMP code
  x86/mm: Rework lazy TLB to track the actual loaded mm
  x86/mm: Be more consistent wrt PAGE_SHIFT vs PAGE_SIZE in tlb flush
    code
  x86,kvm: Teach KVM's VMX code that CR3 isn't a constant

 arch/x86/Kconfig                      |   2 +-
 arch/x86/events/core.c                |   3 +-
 arch/x86/include/asm/hardirq.h        |   2 +-
 arch/x86/include/asm/mmu.h            |   6 -
 arch/x86/include/asm/mmu_context.h    |  21 +-
 arch/x86/include/asm/paravirt.h       |   6 +-
 arch/x86/include/asm/paravirt_types.h |   5 +-
 arch/x86/include/asm/tlbbatch.h       |   2 -
 arch/x86/include/asm/tlbflush.h       | 104 ++-------
 arch/x86/include/asm/uv/uv.h          |   9 +-
 arch/x86/kernel/ldt.c                 |   7 +-
 arch/x86/kvm/vmx.c                    |  21 +-
 arch/x86/mm/init.c                    |   4 +-
 arch/x86/mm/tlb.c                     | 389 ++++++++++++++++------------------
 arch/x86/platform/uv/tlb_uv.c         |  10 +-
 arch/x86/xen/mmu_pv.c                 |  61 +++---
 16 files changed, 281 insertions(+), 371 deletions(-)

-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/8] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc
  2017-05-26  0:47 [PATCH v3 0/8] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
@ 2017-05-26  0:47 ` Andy Lutomirski
  2017-05-26  0:47 ` [PATCH v3 2/8] x86/mm: Change the leave_mm() condition for local TLB flushes Andy Lutomirski
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2017-05-26  0:47 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Andy Lutomirski, Rik van Riel,
	Dave Hansen, Nadav Amit, Michal Hocko

Rather than passing all the contents of flush_tlb_info to
flush_tlb_others(), pass a pointer to the structure directly. For
consistency, this also removes the unnecessary cpu parameter from
uv_flush_tlb_others() to make its signature match the other
*flush_tlb_others() functions.

This serves two purposes:

 - It will dramatically simplify future patches that change struct
   flush_tlb_info, which I'm planning to do.

 - struct flush_tlb_info is an adequate description of what to do
   for a local flush, too, so by reusing it we can remove duplicated
   code between local and remove flushes in a future patch.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/paravirt.h       |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  5 ++-
 arch/x86/include/asm/tlbflush.h       | 19 ++++++-----
 arch/x86/include/asm/uv/uv.h          |  9 ++---
 arch/x86/mm/tlb.c                     | 64 +++++++++++++++++------------------
 arch/x86/platform/uv/tlb_uv.c         | 10 +++---
 arch/x86/xen/mmu_pv.c                 | 10 +++---
 7 files changed, 59 insertions(+), 64 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 55fa56fe4e45..9a15739d9f4b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -312,11 +312,9 @@ static inline void __flush_tlb_single(unsigned long addr)
 }
 
 static inline void flush_tlb_others(const struct cpumask *cpumask,
-				    struct mm_struct *mm,
-				    unsigned long start,
-				    unsigned long end)
+				    const struct flush_tlb_info *info)
 {
-	PVOP_VCALL4(pv_mmu_ops.flush_tlb_others, cpumask, mm, start, end);
+	PVOP_VCALL2(pv_mmu_ops.flush_tlb_others, cpumask, info);
 }
 
 static inline int paravirt_pgd_alloc(struct mm_struct *mm)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7465d6fe336f..cb976bab6299 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -51,6 +51,7 @@ struct mm_struct;
 struct desc_struct;
 struct task_struct;
 struct cpumask;
+struct flush_tlb_info;
 
 /*
  * Wrapper type for pointers to code which uses the non-standard
@@ -223,9 +224,7 @@ struct pv_mmu_ops {
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_single)(unsigned long addr);
 	void (*flush_tlb_others)(const struct cpumask *cpus,
-				 struct mm_struct *mm,
-				 unsigned long start,
-				 unsigned long end);
+				 const struct flush_tlb_info *info);
 
 	/* Hooks for allocating and freeing a pagetable top-level */
 	int  (*pgd_alloc)(struct mm_struct *mm);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8f6e2f87511b..6f439ac92026 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -220,12 +220,18 @@ static inline void __flush_tlb_one(unsigned long addr)
  *  - flush_tlb_page(vma, vmaddr) flushes one page
  *  - flush_tlb_range(vma, start, end) flushes a range of pages
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
- *  - flush_tlb_others(cpumask, mm, start, end) flushes TLBs on other cpus
+ *  - flush_tlb_others(cpumask, info) flushes TLBs on other cpus
  *
  * ..but the i386 has somewhat limited tlb flushing capabilities,
  * and page-granular flushes are available only on i486 and up.
  */
 
+struct flush_tlb_info {
+	struct mm_struct *mm;
+	unsigned long start;
+	unsigned long end;
+};
+
 #ifndef CONFIG_SMP
 
 /* "_up" is for UniProcessor.
@@ -279,9 +285,7 @@ static inline void flush_tlb_mm_range(struct mm_struct *mm,
 }
 
 static inline void native_flush_tlb_others(const struct cpumask *cpumask,
-					   struct mm_struct *mm,
-					   unsigned long start,
-					   unsigned long end)
+					   const struct flush_tlb_info *info)
 {
 }
 
@@ -317,8 +321,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 }
 
 void native_flush_tlb_others(const struct cpumask *cpumask,
-				struct mm_struct *mm,
-				unsigned long start, unsigned long end);
+			     const struct flush_tlb_info *info);
 
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
@@ -340,8 +343,8 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 #endif	/* SMP */
 
 #ifndef CONFIG_PARAVIRT
-#define flush_tlb_others(mask, mm, start, end)	\
-	native_flush_tlb_others(mask, mm, start, end)
+#define flush_tlb_others(mask, info)	\
+	native_flush_tlb_others(mask, info)
 #endif
 
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
index 6686820feae9..d2db4517930c 100644
--- a/arch/x86/include/asm/uv/uv.h
+++ b/arch/x86/include/asm/uv/uv.h
@@ -15,10 +15,7 @@ extern void uv_cpu_init(void);
 extern void uv_nmi_init(void);
 extern void uv_system_init(void);
 extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
-						 struct mm_struct *mm,
-						 unsigned long start,
-						 unsigned long end,
-						 unsigned int cpu);
+						 const struct flush_tlb_info *info);
 
 #else	/* X86_UV */
 
@@ -28,8 +25,8 @@ static inline int is_uv_hubless(void)	{ return 0; }
 static inline void uv_cpu_init(void)	{ }
 static inline void uv_system_init(void)	{ }
 static inline const struct cpumask *
-uv_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm,
-		    unsigned long start, unsigned long end, unsigned int cpu)
+uv_flush_tlb_others(const struct cpumask *cpumask,
+		    const struct flush_tlb_info *info)
 { return cpumask; }
 
 #endif	/* X86_UV */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 743e4c6b4529..776469cc54e0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,12 +30,6 @@
 
 #ifdef CONFIG_SMP
 
-struct flush_tlb_info {
-	struct mm_struct *flush_mm;
-	unsigned long flush_start;
-	unsigned long flush_end;
-};
-
 /*
  * We cannot call mmdrop() because we are in interrupt context,
  * instead update mm->cpu_vm_mask.
@@ -229,11 +223,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
  */
 static void flush_tlb_func(void *info)
 {
-	struct flush_tlb_info *f = info;
+	const struct flush_tlb_info *f = info;
 
 	inc_irq_stat(irq_tlb_count);
 
-	if (f->flush_mm && f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm))
+	if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.active_mm))
 		return;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -243,15 +237,15 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
-	if (f->flush_end == TLB_FLUSH_ALL) {
+	if (f->end == TLB_FLUSH_ALL) {
 		local_flush_tlb();
 		trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
 	} else {
 		unsigned long addr;
 		unsigned long nr_pages =
-			(f->flush_end - f->flush_start) / PAGE_SIZE;
-		addr = f->flush_start;
-		while (addr < f->flush_end) {
+			(f->end - f->start) / PAGE_SIZE;
+		addr = f->start;
+		while (addr < f->end) {
 			__flush_tlb_single(addr);
 			addr += PAGE_SIZE;
 		}
@@ -260,33 +254,27 @@ static void flush_tlb_func(void *info)
 }
 
 void native_flush_tlb_others(const struct cpumask *cpumask,
-				 struct mm_struct *mm, unsigned long start,
-				 unsigned long end)
+			     const struct flush_tlb_info *info)
 {
-	struct flush_tlb_info info;
-
-	info.flush_mm = mm;
-	info.flush_start = start;
-	info.flush_end = end;
-
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
-	if (end == TLB_FLUSH_ALL)
+	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
 	else
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI,
-				(end - start) >> PAGE_SHIFT);
+				(info->end - info->start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
 		unsigned int cpu;
 
 		cpu = smp_processor_id();
-		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
+		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
 			smp_call_function_many(cpumask, flush_tlb_func,
-								&info, 1);
+					       (void *)info, 1);
 		return;
 	}
-	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+	smp_call_function_many(cpumask, flush_tlb_func,
+			       (void *)info, 1);
 }
 
 /*
@@ -305,6 +293,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
 	unsigned long addr;
+	struct flush_tlb_info info;
 	/* do a global flush by default */
 	unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
 
@@ -347,15 +336,20 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	}
 	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush);
 out:
+	info.mm = mm;
 	if (base_pages_to_flush == TLB_FLUSH_ALL) {
-		start = 0UL;
-		end = TLB_FLUSH_ALL;
+		info.start = 0UL;
+		info.end = TLB_FLUSH_ALL;
+	} else {
+		info.start = start;
+		info.end = end;
 	}
 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), mm, start, end);
+		flush_tlb_others(mm_cpumask(mm), &info);
 	preempt_enable();
 }
 
+
 static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -376,7 +370,7 @@ static void do_kernel_range_flush(void *info)
 	unsigned long addr;
 
 	/* flush range by one by one 'invlpg' */
-	for (addr = f->flush_start; addr < f->flush_end; addr += PAGE_SIZE)
+	for (addr = f->start; addr < f->end; addr += PAGE_SIZE)
 		__flush_tlb_single(addr);
 }
 
@@ -389,14 +383,20 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
 	} else {
 		struct flush_tlb_info info;
-		info.flush_start = start;
-		info.flush_end = end;
+		info.start = start;
+		info.end = end;
 		on_each_cpu(do_kernel_range_flush, &info, 1);
 	}
 }
 
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
+	struct flush_tlb_info info = {
+		.mm = NULL,
+		.start = 0UL,
+		.end = TLB_FLUSH_ALL,
+	};
+
 	int cpu = get_cpu();
 
 	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
@@ -406,7 +406,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	}
 
 	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
-		flush_tlb_others(&batch->cpumask, NULL, 0, TLB_FLUSH_ALL);
+		flush_tlb_others(&batch->cpumask, &info);
 	cpumask_clear(&batch->cpumask);
 
 	put_cpu();
diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index 42e65fee5673..5a22d77ffc2a 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -1121,11 +1121,9 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp,
  * done.  The returned pointer is valid till preemption is re-enabled.
  */
 const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
-						struct mm_struct *mm,
-						unsigned long start,
-						unsigned long end,
-						unsigned int cpu)
+					  const struct flush_tlb_info *info)
 {
+	unsigned int cpu = smp_processor_id();
 	int locals = 0, remotes = 0, hubs = 0;
 	struct bau_desc *bau_desc;
 	struct cpumask *flush_mask;
@@ -1179,8 +1177,8 @@ const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
 
 	record_send_statistics(stat, locals, hubs, remotes, bau_desc);
 
-	if (!end || (end - start) <= PAGE_SIZE)
-		address = start;
+	if (!info->end || (info->end - info->start) <= PAGE_SIZE)
+		address = info->start;
 	else
 		address = TLB_FLUSH_ALL;
 
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 1f386d7fdf70..4b926c6b813c 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1366,8 +1366,7 @@ static void xen_flush_tlb_single(unsigned long addr)
 }
 
 static void xen_flush_tlb_others(const struct cpumask *cpus,
-				 struct mm_struct *mm, unsigned long start,
-				 unsigned long end)
+				 const struct flush_tlb_info *info)
 {
 	struct {
 		struct mmuext_op op;
@@ -1379,7 +1378,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	} *args;
 	struct multicall_space mcs;
 
-	trace_xen_mmu_flush_tlb_others(cpus, mm, start, end);
+	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
 
 	if (cpumask_empty(cpus))
 		return;		/* nothing to do */
@@ -1393,9 +1392,10 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask));
 
 	args->op.cmd = MMUEXT_TLB_FLUSH_MULTI;
-	if (end != TLB_FLUSH_ALL && (end - start) <= PAGE_SIZE) {
+	if (info->end != TLB_FLUSH_ALL &&
+	    (info->end - info->start) <= PAGE_SIZE) {
 		args->op.cmd = MMUEXT_INVLPG_MULTI;
-		args->op.arg1.linear_addr = start;
+		args->op.arg1.linear_addr = info->start;
 	}
 
 	MULTI_mmuext_op(mcs.mc, &args->op, 1, NULL, DOMID_SELF);
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/8] x86/mm: Change the leave_mm() condition for local TLB flushes
  2017-05-26  0:47 [PATCH v3 0/8] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
  2017-05-26  0:47 ` [PATCH v3 1/8] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc Andy Lutomirski
@ 2017-05-26  0:47 ` Andy Lutomirski
  2017-05-26  1:39   ` Rik van Riel
  2017-05-26  1:43   ` Nadav Amit
  2017-05-26  0:47 ` [PATCH v3 3/8] x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases Andy Lutomirski
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 15+ messages in thread
From: Andy Lutomirski @ 2017-05-26  0:47 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Andy Lutomirski, Rik van Riel,
	Dave Hansen, Nadav Amit, Michal Hocko, Arjan van de Ven

On a remote TLB flush, we leave_mm() if we're TLBSTATE_LAZY.  For a
local flush_tlb_mm_range(), we leave_mm() if !current->mm.  These
are approximately the same condition -- the scheduler sets lazy TLB
mode when switching to a thread with no mm.

I'm about to merge the local and remote flush code, but for ease of
verifying and bisecting the patch, I want the local and remote flush
behavior to match first.  This patch changes the local code to match
the remote code.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 776469cc54e0..3143c9a180e5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -311,7 +311,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		goto out;
 	}
 
-	if (!current->mm) {
+	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
 		leave_mm(smp_processor_id());
 
 		/* Synchronize with switch_mm. */
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/8] x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases
  2017-05-26  0:47 [PATCH v3 0/8] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
  2017-05-26  0:47 ` [PATCH v3 1/8] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc Andy Lutomirski
  2017-05-26  0:47 ` [PATCH v3 2/8] x86/mm: Change the leave_mm() condition for local TLB flushes Andy Lutomirski
@ 2017-05-26  0:47 ` Andy Lutomirski
  2017-05-26  0:47 ` [PATCH v3 4/8] x86/mm: Use new merged flush logic in arch_tlbbatch_flush() Andy Lutomirski
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2017-05-26  0:47 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Andy Lutomirski, Rik van Riel,
	Dave Hansen, Nadav Amit, Michal Hocko, Arjan van de Ven

The local flush path is very similar to the remote flush path.
Merge them.

This is intended to make no difference to behavior whatsoever.  It
removes some code and will make future changes to the flushing
mechanics simpler.

This patch does remove one small optimization: flush_tlb_mm_range()
now has an unconditional smp_mb() instead of using MOV to CR3 or
INVLPG as a full barrier when applicable.  I think this is okay for
a few reasons.  First, smp_mb() is quite cheap compared to the cost
of a TLB flush.  Second, this rearrangement makes a bigger
optimization available: with some work on the SMP function call
code, we could do the local and remote flushes in parallel.  Third,
I'm planning a rework of the TLB flush algorithm that will require
an atomic operation at the beginning of each flush, and that
operation will replace the smp_mb().

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h |   1 -
 arch/x86/mm/tlb.c               | 113 +++++++++++++++++-----------------------
 2 files changed, 48 insertions(+), 66 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6f439ac92026..9934c7c99213 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -225,7 +225,6 @@ static inline void __flush_tlb_one(unsigned long addr)
  * ..but the i386 has somewhat limited tlb flushing capabilities,
  * and page-granular flushes are available only on i486 and up.
  */
-
 struct flush_tlb_info {
 	struct mm_struct *mm;
 	unsigned long start;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3143c9a180e5..12b8812e8926 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -216,22 +216,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
  * write/read ordering problems.
  */
 
-/*
- * TLB flush funcation:
- * 1) Flush the tlb entries if the cpu uses the mm that's being flushed.
- * 2) Leave the mm if we are in the lazy tlb mode.
- */
-static void flush_tlb_func(void *info)
+static void flush_tlb_func_common(const struct flush_tlb_info *f,
+				  bool local, enum tlb_flush_reason reason)
 {
-	const struct flush_tlb_info *f = info;
-
-	inc_irq_stat(irq_tlb_count);
-
-	if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.active_mm))
-		return;
-
-	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
-
 	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
 		leave_mm(smp_processor_id());
 		return;
@@ -239,7 +226,9 @@ static void flush_tlb_func(void *info)
 
 	if (f->end == TLB_FLUSH_ALL) {
 		local_flush_tlb();
-		trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
+		if (local)
+			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+		trace_tlb_flush(reason, TLB_FLUSH_ALL);
 	} else {
 		unsigned long addr;
 		unsigned long nr_pages =
@@ -249,10 +238,32 @@ static void flush_tlb_func(void *info)
 			__flush_tlb_single(addr);
 			addr += PAGE_SIZE;
 		}
-		trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, nr_pages);
+		if (local)
+			count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
+		trace_tlb_flush(reason, nr_pages);
 	}
 }
 
+static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
+{
+	const struct flush_tlb_info *f = info;
+
+	flush_tlb_func_common(f, true, reason);
+}
+
+static void flush_tlb_func_remote(void *info)
+{
+	const struct flush_tlb_info *f = info;
+
+	inc_irq_stat(irq_tlb_count);
+
+	if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.active_mm))
+		return;
+
+	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
+}
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info)
 {
@@ -269,11 +280,11 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		cpu = smp_processor_id();
 		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
-			smp_call_function_many(cpumask, flush_tlb_func,
+			smp_call_function_many(cpumask, flush_tlb_func_remote,
 					       (void *)info, 1);
 		return;
 	}
-	smp_call_function_many(cpumask, flush_tlb_func,
+	smp_call_function_many(cpumask, flush_tlb_func_remote,
 			       (void *)info, 1);
 }
 
@@ -292,61 +303,33 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
-	unsigned long addr;
-	struct flush_tlb_info info;
-	/* do a global flush by default */
-	unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
-
-	preempt_disable();
+	int cpu;
 
-	if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
-		base_pages_to_flush = (end - start) >> PAGE_SHIFT;
-	if (base_pages_to_flush > tlb_single_page_flush_ceiling)
-		base_pages_to_flush = TLB_FLUSH_ALL;
-
-	if (current->active_mm != mm) {
-		/* Synchronize with switch_mm. */
-		smp_mb();
-
-		goto out;
-	}
-
-	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
-		leave_mm(smp_processor_id());
+	struct flush_tlb_info info = {
+		.mm = mm,
+	};
 
-		/* Synchronize with switch_mm. */
-		smp_mb();
+	cpu = get_cpu();
 
-		goto out;
-	}
+	/* Synchronize with switch_mm. */
+	smp_mb();
 
-	/*
-	 * Both branches below are implicit full barriers (MOV to CR or
-	 * INVLPG) that synchronize with switch_mm.
-	 */
-	if (base_pages_to_flush == TLB_FLUSH_ALL) {
-		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
-		local_flush_tlb();
+	/* Should we flush just the requested range? */
+	if ((end != TLB_FLUSH_ALL) &&
+	    !(vmflag & VM_HUGETLB) &&
+	    ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
+		info.start = start;
+		info.end = end;
 	} else {
-		/* flush range by one by one 'invlpg' */
-		for (addr = start; addr < end;	addr += PAGE_SIZE) {
-			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
-			__flush_tlb_single(addr);
-		}
-	}
-	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush);
-out:
-	info.mm = mm;
-	if (base_pages_to_flush == TLB_FLUSH_ALL) {
 		info.start = 0UL;
 		info.end = TLB_FLUSH_ALL;
-	} else {
-		info.start = start;
-		info.end = end;
 	}
-	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
+
+	if (mm == current->active_mm)
+		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), &info);
-	preempt_enable();
+	put_cpu();
 }
 
 
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 4/8] x86/mm: Use new merged flush logic in arch_tlbbatch_flush()
  2017-05-26  0:47 [PATCH v3 0/8] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (2 preceding siblings ...)
  2017-05-26  0:47 ` [PATCH v3 3/8] x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases Andy Lutomirski
@ 2017-05-26  0:47 ` Andy Lutomirski
  2017-05-26  0:47 ` [PATCH v3 5/8] x86/mm: Remove the UP tlbflush code; always use the formerly SMP code Andy Lutomirski
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2017-05-26  0:47 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Andy Lutomirski, Rik van Riel,
	Dave Hansen, Nadav Amit, Michal Hocko, Arjan van de Ven

Now there's only one copy of the local tlb flush logic for
non-kernel pages on SMP kernels.

The only functional change is that arch_tlbbatch_flush() will now
leave_mm() on the local CPU if that CPU is in the batch and is in
TLBSTATE_LAZY mode.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 12b8812e8926..c03b4a0ce58c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -382,12 +382,8 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	int cpu = get_cpu();
 
-	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
-		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
-		local_flush_tlb();
-		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
-	}
-
+	if (cpumask_test_cpu(cpu, &batch->cpumask))
+		flush_tlb_func_local(&info, TLB_LOCAL_SHOOTDOWN);
 	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
 		flush_tlb_others(&batch->cpumask, &info);
 	cpumask_clear(&batch->cpumask);
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 5/8] x86/mm: Remove the UP tlbflush code; always use the formerly SMP code
  2017-05-26  0:47 [PATCH v3 0/8] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (3 preceding siblings ...)
  2017-05-26  0:47 ` [PATCH v3 4/8] x86/mm: Use new merged flush logic in arch_tlbbatch_flush() Andy Lutomirski
@ 2017-05-26  0:47 ` Andy Lutomirski
  2017-05-26  0:47 ` [PATCH v3 6/8] x86/mm: Rework lazy TLB to track the actual loaded mm Andy Lutomirski
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2017-05-26  0:47 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Andy Lutomirski, Rik van Riel,
	Dave Hansen, Nadav Amit, Michal Hocko, Arjan van de Ven

The UP tlbflush generates somewhat nicer code than the SMP version.
Aside from that, it's fallen quite a bit behind the SMP code:

 - flush_tlb_mm_range() didn't flush individual pages if the range
   was small.

 - The lazy TLB code was much weaker.  This usually wouldn't matter,
   but, if a kernel thread flushed its lazy "active_mm" more than
   once (due to reclaim or similar), it wouldn't be unlazied and
   would instead pointlessly flush repeatedly.

 - Tracepoints were missing.

Aside from that, simply having the UP code around was a maintanence
burden, since it means that any change to the TLB flush code had to
make sure not to break it.

Simplify everything by deleting the UP code.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                   |  2 +-
 arch/x86/include/asm/hardirq.h     |  2 +-
 arch/x86/include/asm/mmu.h         |  6 ---
 arch/x86/include/asm/mmu_context.h |  2 -
 arch/x86/include/asm/tlbbatch.h    |  2 -
 arch/x86/include/asm/tlbflush.h    | 76 +-------------------------------------
 arch/x86/mm/init.c                 |  2 -
 arch/x86/mm/tlb.c                  | 17 +--------
 8 files changed, 5 insertions(+), 104 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cd18994a9555..d65f87e083d1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -69,7 +69,7 @@ config X86
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
-	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select BUILDTIME_EXTABLE_SORT
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 59405a248fc2..9b76cd331990 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -22,8 +22,8 @@ typedef struct {
 #ifdef CONFIG_SMP
 	unsigned int irq_resched_count;
 	unsigned int irq_call_count;
-	unsigned int irq_tlb_count;
 #endif
+	unsigned int irq_tlb_count;
 #ifdef CONFIG_X86_THERMAL_VECTOR
 	unsigned int irq_thermal_count;
 #endif
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index f9813b6d8b80..79b647a7ebd0 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -37,12 +37,6 @@ typedef struct {
 #endif
 } mm_context_t;
 
-#ifdef CONFIG_SMP
 void leave_mm(int cpu);
-#else
-static inline void leave_mm(int cpu)
-{
-}
-#endif
 
 #endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 68b329d77b3a..187c39470a0b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -99,10 +99,8 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
-#ifdef CONFIG_SMP
 	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
-#endif
 }
 
 static inline int init_new_context(struct task_struct *tsk,
diff --git a/arch/x86/include/asm/tlbbatch.h b/arch/x86/include/asm/tlbbatch.h
index 01a6de16fb96..f4a6ff352a0e 100644
--- a/arch/x86/include/asm/tlbbatch.h
+++ b/arch/x86/include/asm/tlbbatch.h
@@ -3,7 +3,6 @@
 
 #include <linux/cpumask.h>
 
-#ifdef CONFIG_SMP
 struct arch_tlbflush_unmap_batch {
 	/*
 	 * Each bit set is a CPU that potentially has a TLB entry for one of
@@ -11,6 +10,5 @@ struct arch_tlbflush_unmap_batch {
 	 */
 	struct cpumask cpumask;
 };
-#endif
 
 #endif /* _ARCH_X86_TLBBATCH_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 9934c7c99213..dbb5a9f0fed8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -7,6 +7,7 @@
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
+#include <asm/smp.h>
 
 static inline void __invpcid(unsigned long pcid, unsigned long addr,
 			     unsigned long type)
@@ -65,10 +66,8 @@ static inline void invpcid_flush_all_nonglobals(void)
 #endif
 
 struct tlb_state {
-#ifdef CONFIG_SMP
 	struct mm_struct *active_mm;
 	int state;
-#endif
 
 	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -231,77 +230,6 @@ struct flush_tlb_info {
 	unsigned long end;
 };
 
-#ifndef CONFIG_SMP
-
-/* "_up" is for UniProcessor.
- *
- * This is a helper for other header functions.  *Not* intended to be called
- * directly.  All global TLB flushes need to either call this, or to bump the
- * vm statistics themselves.
- */
-static inline void __flush_tlb_up(void)
-{
-	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
-	__flush_tlb();
-}
-
-static inline void flush_tlb_all(void)
-{
-	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
-	__flush_tlb_all();
-}
-
-static inline void local_flush_tlb(void)
-{
-	__flush_tlb_up();
-}
-
-static inline void flush_tlb_mm(struct mm_struct *mm)
-{
-	if (mm == current->active_mm)
-		__flush_tlb_up();
-}
-
-static inline void flush_tlb_page(struct vm_area_struct *vma,
-				  unsigned long addr)
-{
-	if (vma->vm_mm == current->active_mm)
-		__flush_tlb_one(addr);
-}
-
-static inline void flush_tlb_range(struct vm_area_struct *vma,
-				   unsigned long start, unsigned long end)
-{
-	if (vma->vm_mm == current->active_mm)
-		__flush_tlb_up();
-}
-
-static inline void flush_tlb_mm_range(struct mm_struct *mm,
-	   unsigned long start, unsigned long end, unsigned long vmflag)
-{
-	if (mm == current->active_mm)
-		__flush_tlb_up();
-}
-
-static inline void native_flush_tlb_others(const struct cpumask *cpumask,
-					   const struct flush_tlb_info *info)
-{
-}
-
-static inline void reset_lazy_tlbstate(void)
-{
-}
-
-static inline void flush_tlb_kernel_range(unsigned long start,
-					  unsigned long end)
-{
-	flush_tlb_all();
-}
-
-#else  /* SMP */
-
-#include <asm/smp.h>
-
 #define local_flush_tlb() __flush_tlb()
 
 #define flush_tlb_mm(mm)	flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL)
@@ -339,8 +267,6 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
-#endif	/* SMP */
-
 #ifndef CONFIG_PARAVIRT
 #define flush_tlb_others(mask, info)	\
 	native_flush_tlb_others(mask, info)
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index cbc87ea98751..c61183b57427 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -811,10 +811,8 @@ void __init zone_sizes_init(void)
 }
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
-#ifdef CONFIG_SMP
 	.active_mm = &init_mm,
 	.state = 0,
-#endif
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c03b4a0ce58c..da1416c77bfb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -15,7 +15,7 @@
 #include <linux/debugfs.h>
 
 /*
- *	Smarter SMP flushing macros.
+ *	TLB flushing, formerly SMP-only
  *		c/o Linus Torvalds.
  *
  *	These mean you can really definitely utterly forget about
@@ -28,8 +28,6 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
-#ifdef CONFIG_SMP
-
 /*
  * We cannot call mmdrop() because we are in interrupt context,
  * instead update mm->cpu_vm_mask.
@@ -53,8 +51,6 @@ void leave_mm(int cpu)
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
-#endif /* CONFIG_SMP */
-
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
 {
@@ -85,10 +81,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
 		}
 
-#ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
-#endif
 
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
@@ -146,9 +140,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		if (unlikely(prev->context.ldt != next->context.ldt))
 			load_mm_ldt(next);
 #endif
-	}
-#ifdef CONFIG_SMP
-	  else {
+	} else {
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
@@ -175,11 +167,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			load_mm_ldt(next);
 		}
 	}
-#endif
 }
 
-#ifdef CONFIG_SMP
-
 /*
  * The flush IPI assumes that a thread switch happens in this order:
  * [cpu0: the cpu that switches]
@@ -436,5 +425,3 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 	return 0;
 }
 late_initcall(create_tlb_single_page_flush_ceiling);
-
-#endif /* CONFIG_SMP */
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 6/8] x86/mm: Rework lazy TLB to track the actual loaded mm
  2017-05-26  0:47 [PATCH v3 0/8] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (4 preceding siblings ...)
  2017-05-26  0:47 ` [PATCH v3 5/8] x86/mm: Remove the UP tlbflush code; always use the formerly SMP code Andy Lutomirski
@ 2017-05-26  0:47 ` Andy Lutomirski
  2017-05-26  0:47 ` [PATCH v3 7/8] x86/mm: Be more consistent wrt PAGE_SHIFT vs PAGE_SIZE in tlb flush code Andy Lutomirski
  2017-05-26  0:47 ` [PATCH v3 8/8] x86,kvm: Teach KVM's VMX code that CR3 isn't a constant Andy Lutomirski
  7 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2017-05-26  0:47 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Andy Lutomirski, Rik van Riel,
	Dave Hansen, Nadav Amit, Michal Hocko, Arjan van de Ven

Lazy TLB state is currently managed in a rather baroque manner.
AFAICT, there are three possible states:

 - Non-lazy.  This means that we're running a user thread or a
   kernel thread that has called use_mm().  current->mm ==
   current->active_mm == cpu_tlbstate.active_mm and
   cpu_tlbstate.state == TLBSTATE_OK.

 - Lazy with user mm.  We're running a kernel thread without an mm
   and we're borrowing an mm_struct.  We have current->mm == NULL,
   current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state
   != TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0).  The current cpu is set
   in mm_cpumask(current->active_mm).  CR3 points to
   current->active_mm->pgd.  The TLB is up to date.

 - Lazy with init_mm.  This happens when we call leave_mm().  We
   have current->mm == NULL, current->active_mm ==
   cpu_tlbstate.active_mm, but that mm is only relelvant insofar as
   the scheduler is tracking it for refcounting.  cpu_tlbstate.state
   != TLBSTATE_OK.  The current cpu is clear in
   mm_cpumask(current->active_mm).  CR3 points to swapper_pg_dir,
   i.e. init_mm->pgd.

This patch simplifies the situation.  Other than perf, x86 stops
caring about current->active_mm at all.  We have
cpu_tlbstate.loaded_mm pointing to the mm that CR3 references.  The
TLB is always up to date for that mm.  leave_mm() just switches us
to init_mm.  There are no longer any special cases for mm_cpumask,
and switch_mm() switches mms without worrying about laziness.

After this patch, cpu_tlbstate.state serves only to tell the TLB
flush code whether it may switch to init_mm instead of doing a
normal flush.

This makes fairly extensive changes to xen_exit_mmap(), which used
to look a bit like black magic.

Perf is unchanged.  With or without this change, perf may behave a bit
erratically if it tries to read user memory in kernel thread context.
We should build on this patch to teach perf to never look at user
memory when cpu_tlbstate.loaded_mm != current->mm.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/events/core.c          |   3 +-
 arch/x86/include/asm/tlbflush.h |  12 ++-
 arch/x86/kernel/ldt.c           |   7 +-
 arch/x86/mm/init.c              |   2 +-
 arch/x86/mm/tlb.c               | 208 ++++++++++++++++++++--------------------
 arch/x86/xen/mmu_pv.c           |  51 +++++-----
 6 files changed, 143 insertions(+), 140 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 580b60f5ac83..77a33096728d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2101,8 +2101,7 @@ static int x86_pmu_event_init(struct perf_event *event)
 
 static void refresh_pce(void *ignored)
 {
-	if (current->active_mm)
-		load_mm_cr4(current->active_mm);
+	load_mm_cr4(this_cpu_read(cpu_tlbstate.loaded_mm));
 }
 
 static void x86_pmu_event_mapped(struct perf_event *event)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dbb5a9f0fed8..388c2463fde6 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -66,7 +66,13 @@ static inline void invpcid_flush_all_nonglobals(void)
 #endif
 
 struct tlb_state {
-	struct mm_struct *active_mm;
+	/*
+	 * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
+	 * are on.  This means that it may not match current->active_mm,
+	 * which will contain the previous user mm when we're in lazy TLB
+	 * mode even if we've already switched back to swapper_pg_dir.
+	 */
+	struct mm_struct *loaded_mm;
 	int state;
 
 	/*
@@ -256,7 +262,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 static inline void reset_lazy_tlbstate(void)
 {
 	this_cpu_write(cpu_tlbstate.state, 0);
-	this_cpu_write(cpu_tlbstate.active_mm, &init_mm);
+	this_cpu_write(cpu_tlbstate.loaded_mm, &init_mm);
+
+	WARN_ON(read_cr3() != __pa_symbol(swapper_pg_dir));
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index d4a15831ac58..de503e7a64ad 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -22,14 +22,15 @@
 #include <asm/syscalls.h>
 
 /* context.lock is held for us, so we don't need any locking. */
-static void flush_ldt(void *current_mm)
+static void flush_ldt(void *__mm)
 {
+	struct mm_struct *mm = __mm;
 	mm_context_t *pc;
 
-	if (current->active_mm != current_mm)
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
 		return;
 
-	pc = &current->active_mm->context;
+	pc = &mm->context;
 	set_ldt(pc->ldt->entries, pc->ldt->size);
 }
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index c61183b57427..88ee942cb47d 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -811,7 +811,7 @@ void __init zone_sizes_init(void)
 }
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
-	.active_mm = &init_mm,
+	.loaded_mm = &init_mm,
 	.state = 0,
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
 };
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index da1416c77bfb..4bfadb869a1e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -34,20 +34,19 @@
  */
 void leave_mm(int cpu)
 {
-	struct mm_struct *active_mm = this_cpu_read(cpu_tlbstate.active_mm);
+	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
 		BUG();
-	if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
-		cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
-		load_cr3(swapper_pg_dir);
-		/*
-		 * This gets called in the idle path where RCU
-		 * functions differently.  Tracing normally
-		 * uses RCU, so we have to call the tracepoint
-		 * specially here.
-		 */
-		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-	}
+
+	/*
+	 * It's plausible that we're in lazy TLB mode while our mm is init_mm.
+	 * If so, our callers still expect us to flush the TLB, but there
+	 * aren't any user TLB entries in init_mm to worry about.
+	 */
+	if (loaded_mm == &init_mm)
+		return;
+
+	switch_mm(NULL, &init_mm, NULL);
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
@@ -65,108 +64,109 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			struct task_struct *tsk)
 {
 	unsigned cpu = smp_processor_id();
+	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 
-	if (likely(prev != next)) {
-		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
-			/*
-			 * If our current stack is in vmalloc space and isn't
-			 * mapped in the new pgd, we'll double-fault.  Forcibly
-			 * map it.
-			 */
-			unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
+	/*
+	 * NB: The scheduler will call us with prev == next when
+	 * switching from lazy TLB mode to normal mode if active_mm
+	 * isn't changing.  When this happens, there is no guarantee
+	 * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+	 *
+	 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
+	 */
 
-			pgd_t *pgd = next->pgd + stack_pgd_index;
+	this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 
-			if (unlikely(pgd_none(*pgd)))
-				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
-		}
-
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
-		this_cpu_write(cpu_tlbstate.active_mm, next);
-
-		cpumask_set_cpu(cpu, mm_cpumask(next));
+	if (real_prev == next) {
+		/*
+		 * There's nothing to do: we always keep the per-mm control
+		 * regs in sync with cpu_tlbstate.loaded_mm.  Just
+		 * sanity-check mm_cpumask.
+		 */
+		if (WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(next))))
+			cpumask_set_cpu(cpu, mm_cpumask(next));
+		return;
+	}
 
+	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 		/*
-		 * Re-load page tables.
-		 *
-		 * This logic has an ordering constraint:
-		 *
-		 *  CPU 0: Write to a PTE for 'next'
-		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
-		 *  CPU 1: set bit 1 in next's mm_cpumask
-		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
-		 *
-		 * We need to prevent an outcome in which CPU 1 observes
-		 * the new PTE value and CPU 0 observes bit 1 clear in
-		 * mm_cpumask.  (If that occurs, then the IPI will never
-		 * be sent, and CPU 0's TLB will contain a stale entry.)
-		 *
-		 * The bad outcome can occur if either CPU's load is
-		 * reordered before that CPU's store, so both CPUs must
-		 * execute full barriers to prevent this from happening.
-		 *
-		 * Thus, switch_mm needs a full barrier between the
-		 * store to mm_cpumask and any operation that could load
-		 * from next->pgd.  TLB fills are special and can happen
-		 * due to instruction fetches or for no reason at all,
-		 * and neither LOCK nor MFENCE orders them.
-		 * Fortunately, load_cr3() is serializing and gives the
-		 * ordering guarantee we need.
-		 *
+		 * If our current stack is in vmalloc space and isn't
+		 * mapped in the new pgd, we'll double-fault.  Forcibly
+		 * map it.
 		 */
-		load_cr3(next->pgd);
+		unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
 
-		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+		pgd_t *pgd = next->pgd + stack_pgd_index;
 
-		/* Stop flush ipis for the previous mm */
-		cpumask_clear_cpu(cpu, mm_cpumask(prev));
+		if (unlikely(pgd_none(*pgd)))
+			set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+	}
 
-		/* Load per-mm CR4 state */
-		load_mm_cr4(next);
+	this_cpu_write(cpu_tlbstate.loaded_mm, next);
+
+	WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
+	cpumask_set_cpu(cpu, mm_cpumask(next));
+
+	/*
+	 * Re-load page tables.
+	 *
+	 * This logic has an ordering constraint:
+	 *
+	 *  CPU 0: Write to a PTE for 'next'
+	 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
+	 *  CPU 1: set bit 1 in next's mm_cpumask
+	 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
+	 *
+	 * We need to prevent an outcome in which CPU 1 observes
+	 * the new PTE value and CPU 0 observes bit 1 clear in
+	 * mm_cpumask.  (If that occurs, then the IPI will never
+	 * be sent, and CPU 0's TLB will contain a stale entry.)
+	 *
+	 * The bad outcome can occur if either CPU's load is
+	 * reordered before that CPU's store, so both CPUs must
+	 * execute full barriers to prevent this from happening.
+	 *
+	 * Thus, switch_mm needs a full barrier between the
+	 * store to mm_cpumask and any operation that could load
+	 * from next->pgd.  TLB fills are special and can happen
+	 * due to instruction fetches or for no reason at all,
+	 * and neither LOCK nor MFENCE orders them.
+	 * Fortunately, load_cr3() is serializing and gives the
+	 * ordering guarantee we need.
+	 */
+	load_cr3(next->pgd);
+
+	/*
+	 * This gets called via leave_mm() in the idle path where RCU
+	 * functions differently.  Tracing normally uses RCU, so we have to
+	 * call the tracepoint specially here.
+	 */
+	trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+
+	/* Stop flush ipis for the previous mm */
+	WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
+		     real_prev != &init_mm);
+	cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+
+	/* Load per-mm CR4 state */
+	load_mm_cr4(next);
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
-		/*
-		 * Load the LDT, if the LDT is different.
-		 *
-		 * It's possible that prev->context.ldt doesn't match
-		 * the LDT register.  This can happen if leave_mm(prev)
-		 * was called and then modify_ldt changed
-		 * prev->context.ldt but suppressed an IPI to this CPU.
-		 * In this case, prev->context.ldt != NULL, because we
-		 * never set context.ldt to NULL while the mm still
-		 * exists.  That means that next->context.ldt !=
-		 * prev->context.ldt, because mms never share an LDT.
-		 */
-		if (unlikely(prev->context.ldt != next->context.ldt))
-			load_mm_ldt(next);
+	/*
+	 * Load the LDT, if the LDT is different.
+	 *
+	 * It's possible that prev->context.ldt doesn't match
+	 * the LDT register.  This can happen if leave_mm(prev)
+	 * was called and then modify_ldt changed
+	 * prev->context.ldt but suppressed an IPI to this CPU.
+	 * In this case, prev->context.ldt != NULL, because we
+	 * never set context.ldt to NULL while the mm still
+	 * exists.  That means that next->context.ldt !=
+	 * prev->context.ldt, because mms never share an LDT.
+	 */
+	if (unlikely(real_prev->context.ldt != next->context.ldt))
+		load_mm_ldt(next);
 #endif
-	} else {
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
-		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
-
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
-			/*
-			 * On established mms, the mm_cpumask is only changed
-			 * from irq context, from ptep_clear_flush() while in
-			 * lazy tlb mode, and here. Irqs are blocked during
-			 * schedule, protecting us from simultaneous changes.
-			 */
-			cpumask_set_cpu(cpu, mm_cpumask(next));
-
-			/*
-			 * We were in lazy tlb mode and leave_mm disabled
-			 * tlb flush IPI delivery. We must reload CR3
-			 * to make sure to use no freed page tables.
-			 *
-			 * As above, load_cr3() is serializing and orders TLB
-			 * fills with respect to the mm_cpumask write.
-			 */
-			load_cr3(next->pgd);
-			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-			load_mm_cr4(next);
-			load_mm_ldt(next);
-		}
-	}
 }
 
 /*
@@ -246,7 +246,7 @@ static void flush_tlb_func_remote(void *info)
 
 	inc_irq_stat(irq_tlb_count);
 
-	if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.active_mm))
+	if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
 		return;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -314,7 +314,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		info.end = TLB_FLUSH_ALL;
 	}
 
-	if (mm == current->active_mm)
+	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
 		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
 	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), &info);
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 4b926c6b813c..21beb37114b7 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -975,37 +975,32 @@ static void xen_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 	spin_unlock(&mm->page_table_lock);
 }
 
-
-#ifdef CONFIG_SMP
-/* Another cpu may still have their %cr3 pointing at the pagetable, so
-   we need to repoint it somewhere else before we can unpin it. */
-static void drop_other_mm_ref(void *info)
+static void drop_mm_ref_this_cpu(void *info)
 {
 	struct mm_struct *mm = info;
-	struct mm_struct *active_mm;
-
-	active_mm = this_cpu_read(cpu_tlbstate.active_mm);
 
-	if (active_mm == mm && this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm)
 		leave_mm(smp_processor_id());
 
-	/* If this cpu still has a stale cr3 reference, then make sure
-	   it has been flushed. */
+	/*
+	 * If this cpu still has a stale cr3 reference, then make sure
+	 * it has been flushed.
+	 */
 	if (this_cpu_read(xen_current_cr3) == __pa(mm->pgd))
-		load_cr3(swapper_pg_dir);
+		xen_mc_flush();
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Another cpu may still have their %cr3 pointing at the pagetable, so
+ * we need to repoint it somewhere else before we can unpin it.
+ */
 static void xen_drop_mm_ref(struct mm_struct *mm)
 {
 	cpumask_var_t mask;
 	unsigned cpu;
 
-	if (current->active_mm == mm) {
-		if (current->mm == mm)
-			load_cr3(swapper_pg_dir);
-		else
-			leave_mm(smp_processor_id());
-	}
+	drop_mm_ref_this_cpu(mm);
 
 	/* Get the "official" set of cpus referring to our pagetable. */
 	if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
@@ -1013,31 +1008,31 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
 			if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
 			    && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
 				continue;
-			smp_call_function_single(cpu, drop_other_mm_ref, mm, 1);
+			smp_call_function_single(cpu, drop_mm_ref_this_cpu, mm, 1);
 		}
 		return;
 	}
 	cpumask_copy(mask, mm_cpumask(mm));
 
-	/* It's possible that a vcpu may have a stale reference to our
-	   cr3, because its in lazy mode, and it hasn't yet flushed
-	   its set of pending hypercalls yet.  In this case, we can
-	   look at its actual current cr3 value, and force it to flush
-	   if needed. */
+	/*
+	 * It's possible that a vcpu may have a stale reference to our
+	 * cr3, because its in lazy mode, and it hasn't yet flushed
+	 * its set of pending hypercalls yet.  In this case, we can
+	 * look at its actual current cr3 value, and force it to flush
+	 * if needed.
+	 */
 	for_each_online_cpu(cpu) {
 		if (per_cpu(xen_current_cr3, cpu) == __pa(mm->pgd))
 			cpumask_set_cpu(cpu, mask);
 	}
 
-	if (!cpumask_empty(mask))
-		smp_call_function_many(mask, drop_other_mm_ref, mm, 1);
+	smp_call_function_many(mask, drop_mm_ref_this_cpu, mm, 1);
 	free_cpumask_var(mask);
 }
 #else
 static void xen_drop_mm_ref(struct mm_struct *mm)
 {
-	if (current->active_mm == mm)
-		load_cr3(swapper_pg_dir);
+	drop_mm_ref_this_cpu(mm);
 }
 #endif
 
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 7/8] x86/mm: Be more consistent wrt PAGE_SHIFT vs PAGE_SIZE in tlb flush code
  2017-05-26  0:47 [PATCH v3 0/8] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (5 preceding siblings ...)
  2017-05-26  0:47 ` [PATCH v3 6/8] x86/mm: Rework lazy TLB to track the actual loaded mm Andy Lutomirski
@ 2017-05-26  0:47 ` Andy Lutomirski
  2017-05-26  0:47 ` [PATCH v3 8/8] x86,kvm: Teach KVM's VMX code that CR3 isn't a constant Andy Lutomirski
  7 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2017-05-26  0:47 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Andy Lutomirski, Rik van Riel,
	Dave Hansen, Nadav Amit, Michal Hocko, Arjan van de Ven

Nadav pointed out that some code used PAGE_SIZE and other code used
PAGE_SHIFT.  Use PAGE_SHIFT instead of multiplying or dividing by
PAGE_SIZE.

Requested-by: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4bfadb869a1e..4bcec510174c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -220,8 +220,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		trace_tlb_flush(reason, TLB_FLUSH_ALL);
 	} else {
 		unsigned long addr;
-		unsigned long nr_pages =
-			(f->end - f->start) / PAGE_SIZE;
+		unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
 		addr = f->start;
 		while (addr < f->end) {
 			__flush_tlb_single(addr);
@@ -351,7 +350,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 
 	/* Balance as user space task's flush, a bit conservative */
 	if (end == TLB_FLUSH_ALL ||
-	    (end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
+	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
 	} else {
 		struct flush_tlb_info info;
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 8/8] x86,kvm: Teach KVM's VMX code that CR3 isn't a constant
  2017-05-26  0:47 [PATCH v3 0/8] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (6 preceding siblings ...)
  2017-05-26  0:47 ` [PATCH v3 7/8] x86/mm: Be more consistent wrt PAGE_SHIFT vs PAGE_SIZE in tlb flush code Andy Lutomirski
@ 2017-05-26  0:47 ` Andy Lutomirski
  2017-05-26 15:30   ` Paolo Bonzini
  7 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2017-05-26  0:47 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Andy Lutomirski, Paolo Bonzini,
	Radim Krčmář,
	kvm, Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko,
	Arjan van de Ven

When PCID is enabled, CR3's PCID bits can change during context
switches, so KVM won't be able to treat CR3 as a per-mm constant any
more.

I structured this like the existing CR4 handling.  Under ordinary
circumstances (PCID disabled or if the current PCID and the value
that's already in the VMCS match), then we won't do an extra VMCS
write, and we'll never do an extra direct CR3 read.  The overhead
should be minimal.

I disallowed using the new helper in non-atomic context because
PCID support will cause CR3 to stop being constant in non-atomic
process context.

(Frankly, it also scares me a bit that KVM ever treated CR3 as
constant, but it looks like it was okay before.)

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrA?mA!A? <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h | 19 +++++++++++++++++++
 arch/x86/kvm/vmx.c                 | 21 ++++++++++++++++++---
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 187c39470a0b..f20d7ea47095 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -266,4 +266,23 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
 	return __pkru_allows_pkey(vma_pkey(vma), write);
 }
 
+
+/*
+ * This can be used from process context to figure out what the value of
+ * CR3 is without needing to do a (slow) read_cr3().
+ *
+ * It's intended to be used for code like KVM that sneakily changes CR3
+ * and needs to restore it.  It needs to be used very carefully.
+ */
+static inline unsigned long __get_current_cr3_fast(void)
+{
+	unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
+
+	/* For now, be very restrictive about when this can be called. */
+	VM_WARN_ON(in_nmi() || !in_atomic());
+
+	VM_BUG_ON(cr3 != read_cr3());
+	return cr3;
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 72f78396bc09..b7b36c9ffa3d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -48,6 +48,7 @@
 #include <asm/kexec.h>
 #include <asm/apic.h>
 #include <asm/irq_remapping.h>
+#include <asm/mmu_context.h>
 
 #include "trace.h"
 #include "pmu.h"
@@ -596,6 +597,7 @@ struct vcpu_vmx {
 		int           gs_ldt_reload_needed;
 		int           fs_reload_needed;
 		u64           msr_host_bndcfgs;
+		unsigned long vmcs_host_cr3;	/* May not match real cr3 */
 		unsigned long vmcs_host_cr4;	/* May not match real cr4 */
 	} host_state;
 	struct {
@@ -5012,12 +5014,19 @@ static void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 	u32 low32, high32;
 	unsigned long tmpl;
 	struct desc_ptr dt;
-	unsigned long cr0, cr4;
+	unsigned long cr0, cr3, cr4;
 
 	cr0 = read_cr0();
 	WARN_ON(cr0 & X86_CR0_TS);
 	vmcs_writel(HOST_CR0, cr0);  /* 22.2.3 */
-	vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
+
+	/*
+	 * Save the most likely value for this task's CR3 in the VMCS.
+	 * We can't use __get_current_cr3_fast() because we're not atomic.
+	 */
+	cr3 = read_cr3();
+	vmcs_writel(HOST_CR3, cr3);		/* 22.2.3  FIXME: shadow tables */
+	vmx->host_state.vmcs_host_cr3 = cr3;
 
 	/* Save the most likely value for this task's CR4 in the VMCS. */
 	cr4 = cr4_read_shadow();
@@ -8843,7 +8852,7 @@ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
 static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	unsigned long debugctlmsr, cr4;
+	unsigned long debugctlmsr, cr3, cr4;
 
 	/* Don't enter VMX if guest state is invalid, let the exit handler
 	   start emulation until we arrive back to a valid state */
@@ -8865,6 +8874,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
 		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
 
+	cr3 = __get_current_cr3_fast();
+	if (unlikely(cr3 != vmx->host_state.vmcs_host_cr3)) {
+		vmcs_writel(HOST_CR3, cr3);
+		vmx->host_state.vmcs_host_cr3 = cr3;
+	}
+
 	cr4 = cr4_read_shadow();
 	if (unlikely(cr4 != vmx->host_state.vmcs_host_cr4)) {
 		vmcs_writel(HOST_CR4, cr4);
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/8] x86/mm: Change the leave_mm() condition for local TLB flushes
  2017-05-26  0:47 ` [PATCH v3 2/8] x86/mm: Change the leave_mm() condition for local TLB flushes Andy Lutomirski
@ 2017-05-26  1:39   ` Rik van Riel
  2017-05-26  2:01     ` Andy Lutomirski
  2017-05-26  1:43   ` Nadav Amit
  1 sibling, 1 reply; 15+ messages in thread
From: Rik van Riel @ 2017-05-26  1:39 UTC (permalink / raw)
  To: Andy Lutomirski, X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Dave Hansen, Nadav Amit,
	Michal Hocko, Arjan van de Ven

On Thu, 2017-05-25 at 17:47 -0700, Andy Lutomirski wrote:
> 
> +++ b/arch/x86/mm/tlb.c
> @@ -311,7 +311,7 @@ void flush_tlb_mm_range(struct mm_struct *mm,
> unsigned long start,
> A 		goto out;
> A 	}
> A 
> -	if (!current->mm) {
> +	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> A 		leave_mm(smp_processor_id());

Unless -mm changed leave_mm (I did not check), this
is not quite correct yet.

The reason is leave_mm (at least in the latest Linus
tree) ignores the cpu argument for one of its checks.

You should probably fix that in an earlier patch,
assuming you haven't already done so in -mm.

void leave_mm(int cpu)
{
A A A A A A A A struct mm_struct *active_mm =
this_cpu_read(cpu_tlbstate.active_mm);
A A A A A A A A if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
A A A A A A A A A A A A A A A A BUG();
A A A A A A A A if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
A A A A A A A A A A A A A A A A cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
A A A A A A A A A A A A A A A A load_cr3(swapper_pg_dir);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/8] x86/mm: Change the leave_mm() condition for local TLB flushes
  2017-05-26  0:47 ` [PATCH v3 2/8] x86/mm: Change the leave_mm() condition for local TLB flushes Andy Lutomirski
  2017-05-26  1:39   ` Rik van Riel
@ 2017-05-26  1:43   ` Nadav Amit
  1 sibling, 0 replies; 15+ messages in thread
From: Nadav Amit @ 2017-05-26  1:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko, Arjan van de Ven


> On May 25, 2017, at 5:47 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On a remote TLB flush, we leave_mm() if we're TLBSTATE_LAZY.  For a
> local flush_tlb_mm_range(), we leave_mm() if !current->mm.  These
> are approximately the same condition -- the scheduler sets lazy TLB
> mode when switching to a thread with no mm.
> 
> I'm about to merge the local and remote flush code, but for ease of
> verifying and bisecting the patch, I want the local and remote flush
> behavior to match first.  This patch changes the local code to match
> the remote code.
> 
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Nadav Amit <namit@vmware.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> arch/x86/mm/tlb.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 776469cc54e0..3143c9a180e5 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -311,7 +311,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> 		goto out;
> 	}
> 
> -	if (!current->mm) {
> +	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> 		leave_mm(smp_processor_id());

Maybe it is an overkill, but you may want to have two variants: leave_mm()
and leave_mm_irq_off(). Currently, leave_mm() does not disable IRQs, but
in patch 6 it does. Here you indeed need to disable IRQs, but the cases
in prior to this patch - you do not.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/8] x86/mm: Change the leave_mm() condition for local TLB flushes
  2017-05-26  1:39   ` Rik van Riel
@ 2017-05-26  2:01     ` Andy Lutomirski
  2017-05-26 13:25       ` Rik van Riel
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2017-05-26  2:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm, Nadav Amit,
	Dave Hansen, Nadav Amit, Michal Hocko, Arjan van de Ven

On Thu, May 25, 2017 at 6:39 PM, Rik van Riel <riel@redhat.com> wrote:
> On Thu, 2017-05-25 at 17:47 -0700, Andy Lutomirski wrote:
>>
>> +++ b/arch/x86/mm/tlb.c
>> @@ -311,7 +311,7 @@ void flush_tlb_mm_range(struct mm_struct *mm,
>> unsigned long start,
>>               goto out;
>>       }
>>
>> -     if (!current->mm) {
>> +     if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
>>               leave_mm(smp_processor_id());
>
> Unless -mm changed leave_mm (I did not check), this
> is not quite correct yet.
>
> The reason is leave_mm (at least in the latest Linus
> tree) ignores the cpu argument for one of its checks.
>
> You should probably fix that in an earlier patch,
> assuming you haven't already done so in -mm.
>
> void leave_mm(int cpu)
> {
>         struct mm_struct *active_mm =
> this_cpu_read(cpu_tlbstate.active_mm);
>         if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
>                 BUG();
>         if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
>                 cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
>                 load_cr3(swapper_pg_dir);

I agree it's odd, but what's the bug?  Both before and after, leave_mm
needed to be called with cpu == smp_processor_id(), and
smp_processor_id() warns if it's called in a preemptible context.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/8] x86/mm: Change the leave_mm() condition for local TLB flushes
  2017-05-26  2:01     ` Andy Lutomirski
@ 2017-05-26 13:25       ` Rik van Riel
  0 siblings, 0 replies; 15+ messages in thread
From: Rik van Riel @ 2017-05-26 13:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Nadav Amit, Dave Hansen,
	Nadav Amit, Michal Hocko, Arjan van de Ven

On Thu, 2017-05-25 at 19:01 -0700, Andy Lutomirski wrote:
> On Thu, May 25, 2017 at 6:39 PM, Rik van Riel <riel@redhat.com>
> wrote:
> > On Thu, 2017-05-25 at 17:47 -0700, Andy Lutomirski wrote:
> > > 
> > > +++ b/arch/x86/mm/tlb.c
> > > @@ -311,7 +311,7 @@ void flush_tlb_mm_range(struct mm_struct *mm,
> > > unsigned long start,
> > > A A A A A A A A A A A A A A goto out;
> > > A A A A A A }
> > > 
> > > -A A A A A if (!current->mm) {
> > > +A A A A A if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> > > A A A A A A A A A A A A A A leave_mm(smp_processor_id());
> > 
> > Unless -mm changed leave_mm (I did not check), this
> > is not quite correct yet.
> > 
> > The reason is leave_mm (at least in the latest Linus
> > tree) ignores the cpu argument for one of its checks.
> > 
> > You should probably fix that in an earlier patch,
> > assuming you haven't already done so in -mm.
> > 
> > void leave_mm(int cpu)
> > {
> > A A A A A A A A struct mm_struct *active_mm =
> > this_cpu_read(cpu_tlbstate.active_mm);
> > A A A A A A A A if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> > A A A A A A A A A A A A A A A A BUG();
> > A A A A A A A A if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
> > A A A A A A A A A A A A A A A A cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
> > A A A A A A A A A A A A A A A A load_cr3(swapper_pg_dir);
> 
> I agree it's odd, but what's the bug?A A Both before and after,
> leave_mm
> needed to be called with cpu == smp_processor_id(), and
> smp_processor_id() warns if it's called in a preemptible context.

Indeed, you are right. Looking at too much code at once...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 8/8] x86,kvm: Teach KVM's VMX code that CR3 isn't a constant
  2017-05-26  0:47 ` [PATCH v3 8/8] x86,kvm: Teach KVM's VMX code that CR3 isn't a constant Andy Lutomirski
@ 2017-05-26 15:30   ` Paolo Bonzini
  2017-05-26 16:05     ` Paolo Bonzini
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2017-05-26 15:30 UTC (permalink / raw)
  To: Andy Lutomirski, X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Radim Krčmář,
	kvm, Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko,
	Arjan van de Ven



On 26/05/2017 02:47, Andy Lutomirski wrote:
> When PCID is enabled, CR3's PCID bits can change during context
> switches, so KVM won't be able to treat CR3 as a per-mm constant any
> more.
> 
> I structured this like the existing CR4 handling.  Under ordinary
> circumstances (PCID disabled or if the current PCID and the value
> that's already in the VMCS match), then we won't do an extra VMCS
> write, and we'll never do an extra direct CR3 read.  The overhead
> should be minimal.
> 
> I disallowed using the new helper in non-atomic context because
> PCID support will cause CR3 to stop being constant in non-atomic
> process context.
> 
> (Frankly, it also scares me a bit that KVM ever treated CR3 as
> constant, but it looks like it was okay before.)
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim KrA?mA!A? <rkrcmar@redhat.com>
> Cc: kvm@vger.kernel.org
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Nadav Amit <namit@vmware.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/include/asm/mmu_context.h | 19 +++++++++++++++++++
>  arch/x86/kvm/vmx.c                 | 21 ++++++++++++++++++---
>  2 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 187c39470a0b..f20d7ea47095 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -266,4 +266,23 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
>  	return __pkru_allows_pkey(vma_pkey(vma), write);
>  }
>  
> +
> +/*
> + * This can be used from process context to figure out what the value of
> + * CR3 is without needing to do a (slow) read_cr3().
> + *
> + * It's intended to be used for code like KVM that sneakily changes CR3
> + * and needs to restore it.  It needs to be used very carefully.
> + */
> +static inline unsigned long __get_current_cr3_fast(void)
> +{
> +	unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
> +
> +	/* For now, be very restrictive about when this can be called. */
> +	VM_WARN_ON(in_nmi() || !in_atomic());
> +
> +	VM_BUG_ON(cr3 != read_cr3());
> +	return cr3;
> +}
> +
>  #endif /* _ASM_X86_MMU_CONTEXT_H */
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 72f78396bc09..b7b36c9ffa3d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -48,6 +48,7 @@
>  #include <asm/kexec.h>
>  #include <asm/apic.h>
>  #include <asm/irq_remapping.h>
> +#include <asm/mmu_context.h>
>  
>  #include "trace.h"
>  #include "pmu.h"
> @@ -596,6 +597,7 @@ struct vcpu_vmx {
>  		int           gs_ldt_reload_needed;
>  		int           fs_reload_needed;
>  		u64           msr_host_bndcfgs;
> +		unsigned long vmcs_host_cr3;	/* May not match real cr3 */
>  		unsigned long vmcs_host_cr4;	/* May not match real cr4 */
>  	} host_state;
>  	struct {
> @@ -5012,12 +5014,19 @@ static void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
>  	u32 low32, high32;
>  	unsigned long tmpl;
>  	struct desc_ptr dt;
> -	unsigned long cr0, cr4;
> +	unsigned long cr0, cr3, cr4;
>  
>  	cr0 = read_cr0();
>  	WARN_ON(cr0 & X86_CR0_TS);
>  	vmcs_writel(HOST_CR0, cr0);  /* 22.2.3 */
> -	vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
> +
> +	/*
> +	 * Save the most likely value for this task's CR3 in the VMCS.
> +	 * We can't use __get_current_cr3_fast() because we're not atomic.
> +	 */
> +	cr3 = read_cr3();
> +	vmcs_writel(HOST_CR3, cr3);		/* 22.2.3  FIXME: shadow tables */
> +	vmx->host_state.vmcs_host_cr3 = cr3;
>  
>  	/* Save the most likely value for this task's CR4 in the VMCS. */
>  	cr4 = cr4_read_shadow();
> @@ -8843,7 +8852,7 @@ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
>  static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> -	unsigned long debugctlmsr, cr4;
> +	unsigned long debugctlmsr, cr3, cr4;
>  
>  	/* Don't enter VMX if guest state is invalid, let the exit handler
>  	   start emulation until we arrive back to a valid state */
> @@ -8865,6 +8874,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
>  		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
>  
> +	cr3 = __get_current_cr3_fast();
> +	if (unlikely(cr3 != vmx->host_state.vmcs_host_cr3)) {
> +		vmcs_writel(HOST_CR3, cr3);
> +		vmx->host_state.vmcs_host_cr3 = cr3;
> +	}
> +
>  	cr4 = cr4_read_shadow();
>  	if (unlikely(cr4 != vmx->host_state.vmcs_host_cr4)) {
>  		vmcs_writel(HOST_CR4, cr4);
> 

Queued, thanks.  If anybody needs a topic branch, please holler.

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 8/8] x86,kvm: Teach KVM's VMX code that CR3 isn't a constant
  2017-05-26 15:30   ` Paolo Bonzini
@ 2017-05-26 16:05     ` Paolo Bonzini
  0 siblings, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2017-05-26 16:05 UTC (permalink / raw)
  To: Andy Lutomirski, X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Nadav Amit, Radim Krčmář,
	kvm, Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko,
	Arjan van de Ven



On 26/05/2017 17:30, Paolo Bonzini wrote:
> 
> 
> On 26/05/2017 02:47, Andy Lutomirski wrote:
>> When PCID is enabled, CR3's PCID bits can change during context
>> switches, so KVM won't be able to treat CR3 as a per-mm constant any
>> more.
>>
>> I structured this like the existing CR4 handling.  Under ordinary
>> circumstances (PCID disabled or if the current PCID and the value
>> that's already in the VMCS match), then we won't do an extra VMCS
>> write, and we'll never do an extra direct CR3 read.  The overhead
>> should be minimal.
>>
>> I disallowed using the new helper in non-atomic context because
>> PCID support will cause CR3 to stop being constant in non-atomic
>> process context.
>>
>> (Frankly, it also scares me a bit that KVM ever treated CR3 as
>> constant, but it looks like it was okay before.)
>>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Radim KrA?mA!A? <rkrcmar@redhat.com>
>> Cc: kvm@vger.kernel.org
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Nadav Amit <namit@vmware.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Arjan van de Ven <arjan@linux.intel.com>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/include/asm/mmu_context.h | 19 +++++++++++++++++++
>>  arch/x86/kvm/vmx.c                 | 21 ++++++++++++++++++---
>>  2 files changed, 37 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
>> index 187c39470a0b..f20d7ea47095 100644
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -266,4 +266,23 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
>>  	return __pkru_allows_pkey(vma_pkey(vma), write);
>>  }
>>  
>> +
>> +/*
>> + * This can be used from process context to figure out what the value of
>> + * CR3 is without needing to do a (slow) read_cr3().
>> + *
>> + * It's intended to be used for code like KVM that sneakily changes CR3
>> + * and needs to restore it.  It needs to be used very carefully.
>> + */
>> +static inline unsigned long __get_current_cr3_fast(void)
>> +{
>> +	unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
>> +
>> +	/* For now, be very restrictive about when this can be called. */
>> +	VM_WARN_ON(in_nmi() || !in_atomic());
>> +
>> +	VM_BUG_ON(cr3 != read_cr3());
>> +	return cr3;
>> +}
>> +
>>  #endif /* _ASM_X86_MMU_CONTEXT_H */
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 72f78396bc09..b7b36c9ffa3d 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -48,6 +48,7 @@
>>  #include <asm/kexec.h>
>>  #include <asm/apic.h>
>>  #include <asm/irq_remapping.h>
>> +#include <asm/mmu_context.h>
>>  
>>  #include "trace.h"
>>  #include "pmu.h"
>> @@ -596,6 +597,7 @@ struct vcpu_vmx {
>>  		int           gs_ldt_reload_needed;
>>  		int           fs_reload_needed;
>>  		u64           msr_host_bndcfgs;
>> +		unsigned long vmcs_host_cr3;	/* May not match real cr3 */
>>  		unsigned long vmcs_host_cr4;	/* May not match real cr4 */
>>  	} host_state;
>>  	struct {
>> @@ -5012,12 +5014,19 @@ static void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
>>  	u32 low32, high32;
>>  	unsigned long tmpl;
>>  	struct desc_ptr dt;
>> -	unsigned long cr0, cr4;
>> +	unsigned long cr0, cr3, cr4;
>>  
>>  	cr0 = read_cr0();
>>  	WARN_ON(cr0 & X86_CR0_TS);
>>  	vmcs_writel(HOST_CR0, cr0);  /* 22.2.3 */
>> -	vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
>> +
>> +	/*
>> +	 * Save the most likely value for this task's CR3 in the VMCS.
>> +	 * We can't use __get_current_cr3_fast() because we're not atomic.
>> +	 */
>> +	cr3 = read_cr3();
>> +	vmcs_writel(HOST_CR3, cr3);		/* 22.2.3  FIXME: shadow tables */
>> +	vmx->host_state.vmcs_host_cr3 = cr3;
>>  
>>  	/* Save the most likely value for this task's CR4 in the VMCS. */
>>  	cr4 = cr4_read_shadow();
>> @@ -8843,7 +8852,7 @@ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
>>  static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>  {
>>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> -	unsigned long debugctlmsr, cr4;
>> +	unsigned long debugctlmsr, cr3, cr4;
>>  
>>  	/* Don't enter VMX if guest state is invalid, let the exit handler
>>  	   start emulation until we arrive back to a valid state */
>> @@ -8865,6 +8874,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>  	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
>>  		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
>>  
>> +	cr3 = __get_current_cr3_fast();
>> +	if (unlikely(cr3 != vmx->host_state.vmcs_host_cr3)) {
>> +		vmcs_writel(HOST_CR3, cr3);
>> +		vmx->host_state.vmcs_host_cr3 = cr3;
>> +	}
>> +
>>  	cr4 = cr4_read_shadow();
>>  	if (unlikely(cr4 != vmx->host_state.vmcs_host_cr4)) {
>>  		vmcs_writel(HOST_CR4, cr4);
>>
> 
> Queued, thanks.  If anybody needs a topic branch, please holler.

Ah, no, it depends on the others.  Note to self, compile first, answer
second.

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-05-26 16:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-26  0:47 [PATCH v3 0/8] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
2017-05-26  0:47 ` [PATCH v3 1/8] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc Andy Lutomirski
2017-05-26  0:47 ` [PATCH v3 2/8] x86/mm: Change the leave_mm() condition for local TLB flushes Andy Lutomirski
2017-05-26  1:39   ` Rik van Riel
2017-05-26  2:01     ` Andy Lutomirski
2017-05-26 13:25       ` Rik van Riel
2017-05-26  1:43   ` Nadav Amit
2017-05-26  0:47 ` [PATCH v3 3/8] x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases Andy Lutomirski
2017-05-26  0:47 ` [PATCH v3 4/8] x86/mm: Use new merged flush logic in arch_tlbbatch_flush() Andy Lutomirski
2017-05-26  0:47 ` [PATCH v3 5/8] x86/mm: Remove the UP tlbflush code; always use the formerly SMP code Andy Lutomirski
2017-05-26  0:47 ` [PATCH v3 6/8] x86/mm: Rework lazy TLB to track the actual loaded mm Andy Lutomirski
2017-05-26  0:47 ` [PATCH v3 7/8] x86/mm: Be more consistent wrt PAGE_SHIFT vs PAGE_SIZE in tlb flush code Andy Lutomirski
2017-05-26  0:47 ` [PATCH v3 8/8] x86,kvm: Teach KVM's VMX code that CR3 isn't a constant Andy Lutomirski
2017-05-26 15:30   ` Paolo Bonzini
2017-05-26 16:05     ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).