[RFC 00/10] x86 TLB flush cleanups, moving toward PCID support

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support
@ 2017-05-07 12:38 Andy Lutomirski
  2017-05-07 12:38 ` [RFC 01/10] x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range() Andy Lutomirski
                   ` (12 more replies)
  0 siblings, 13 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-07 12:38 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Andy Lutomirski

As I've been working on polishing my PCID code, a major problem I've
encountered is that there are too many x86 TLB flushing code paths and
that they have too many inconsequential differences.  The result was
that earlier versions of the PCID code were a colossal mess and very
difficult to understand.

This series goes a long way toward cleaning up the mess.  With all the
patches applied, there is a single function that contains the meat of
the code to flush the TLB on a given CPU, and all the tlb flushing
APIs call it for both local and remote CPUs.

This series should only adversely affect the kernel in a couple of
minor ways:

 - It makes smp_mb() unconditional when flushing TLBs.  We used to
   use the TLB flush itself to mostly avoid smp_mb() on the initiating
   CPU.

 - On UP kernels, we lose the dubious optimization of inlining nerfed
   variants of all the TLB flush APIs.  This bloats the kernel a tiny
   bit, although it should increase performance, since the SMP
   versions were better.

Patch 10 in here is a little bit off topic.  It's a cleanup that's
also needed before PCID can go in, but it's not directly about
TLB flushing.

Thoughts?

This applies to tip:x86/mm.  You can see it fully applied here:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/tlbflush_cleanup&id=59ea83a0a78025439e3d15e09b693846fa1f4770

Andy Lutomirski (10):
  x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()
  x86/mm: Reduce indentation in flush_tlb_func()
  x86/mm: Make the batched unmap TLB flush API more generic
  x86/mm: Pass flush_tlb_info to flush_tlb_others() etc
  x86/mm: Change the leave_mm() condition for local TLB flushes
  x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases
  x86/mm: Use new merged flush logic in arch_tlbbatch_flush()
  x86/mm: Remove the UP tlbflush code; always use the formerly SMP code
  x86/mm: Rework lazy TLB to track the actual loaded mm
  x86,kvm: Teach KVM's VMX code that CR3 isn't a constant

 arch/x86/Kconfig                      |   2 +-
 arch/x86/events/core.c                |   3 +-
 arch/x86/include/asm/hardirq.h        |   2 +-
 arch/x86/include/asm/mmu.h            |   6 -
 arch/x86/include/asm/mmu_context.h    |  21 +-
 arch/x86/include/asm/paravirt.h       |   6 +-
 arch/x86/include/asm/paravirt_types.h |   5 +-
 arch/x86/include/asm/tlbbatch.h       |  14 ++
 arch/x86/include/asm/tlbflush.h       | 116 +++------
 arch/x86/include/asm/uv/uv.h          |   9 +-
 arch/x86/kernel/ldt.c                 |   5 +-
 arch/x86/kvm/vmx.c                    |  21 +-
 arch/x86/mm/init.c                    |   4 +-
 arch/x86/mm/tlb.c                     | 429 +++++++++++++++-------------------
 arch/x86/platform/uv/tlb_uv.c         |   8 +-
 arch/x86/xen/mmu.c                    |  61 +++--
 include/linux/mm_types_task.h         |  15 +-
 mm/rmap.c                             |  15 +-
 18 files changed, 334 insertions(+), 408 deletions(-)
 create mode 100644 arch/x86/include/asm/tlbbatch.h

-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC 01/10] x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
@ 2017-05-07 12:38 ` Andy Lutomirski
  2017-05-11 17:41   ` Borislav Petkov
  2017-05-07 12:38 ` [RFC 02/10] x86/mm: Reduce indentation in flush_tlb_func() Andy Lutomirski
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-07 12:38 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Andy Lutomirski, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko

flush_tlb_page() was very similar to flush_tlb_mm_range() except that
it had a couple of issues:

 - It was missing an smp_mb() in the case where
   current->active_mm != mm.  (This is a longstanding bug reported by
   Nadav Amit.)

 - It was missing tracepoints and vm counter updates.

The only reason that I can see for keeping it at as a separate
function is that it could avoid a few branches that
flush_tlb_mm_range() needs to decide to flush just one page.  This
hardly seems worthwhile.  If we decide we want to get rid of those
branches again, a better way would be to introduce an
__flush_tlb_mm_range() helper and make both flush_tlb_page() and
flush_tlb_mm_range() use it.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h |  6 +++++-
 arch/x86/mm/tlb.c               | 27 ---------------------------
 2 files changed, 5 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6ed9ea469b48..5ed64cdaf536 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -307,11 +307,15 @@ static inline void flush_tlb_kernel_range(unsigned long start,
 		flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
 
 extern void flush_tlb_all(void);
-extern void flush_tlb_page(struct vm_area_struct *, unsigned long);
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
+static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
+{
+	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, 0);
+}
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 				struct mm_struct *mm,
 				unsigned long start, unsigned long end);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6e7bedf69af7..fe6471132ea3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -354,33 +354,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	preempt_enable();
 }
 
-void flush_tlb_page(struct vm_area_struct *vma, unsigned long start)
-{
-	struct mm_struct *mm = vma->vm_mm;
-
-	preempt_disable();
-
-	if (current->active_mm == mm) {
-		if (current->mm) {
-			/*
-			 * Implicit full barrier (INVLPG) that synchronizes
-			 * with switch_mm.
-			 */
-			__flush_tlb_one(start);
-		} else {
-			leave_mm(smp_processor_id());
-
-			/* Synchronize with switch_mm. */
-			smp_mb();
-		}
-	}
-
-	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), mm, start, start + PAGE_SIZE);
-
-	preempt_enable();
-}
-
 static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 02/10] x86/mm: Reduce indentation in flush_tlb_func()
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
  2017-05-07 12:38 ` [RFC 01/10] x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range() Andy Lutomirski
@ 2017-05-07 12:38 ` Andy Lutomirski
  2017-05-07 12:38 ` [RFC 03/10] x86/mm: Make the batched unmap TLB flush API more generic Andy Lutomirski
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-07 12:38 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Andy Lutomirski, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko

The leave_mm() case can just exit the function early so we don't
need to indent the entire remainder of the function.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fe6471132ea3..4d303864b310 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -237,24 +237,26 @@ static void flush_tlb_func(void *info)
 		return;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
-	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) {
-		if (f->flush_end == TLB_FLUSH_ALL) {
-			local_flush_tlb();
-			trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
-		} else {
-			unsigned long addr;
-			unsigned long nr_pages =
-				(f->flush_end - f->flush_start) / PAGE_SIZE;
-			addr = f->flush_start;
-			while (addr < f->flush_end) {
-				__flush_tlb_single(addr);
-				addr += PAGE_SIZE;
-			}
-			trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, nr_pages);
-		}
-	} else
+
+	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
 		leave_mm(smp_processor_id());
+		return;
+	}
 
+	if (f->flush_end == TLB_FLUSH_ALL) {
+		local_flush_tlb();
+		trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
+	} else {
+		unsigned long addr;
+		unsigned long nr_pages =
+			(f->flush_end - f->flush_start) / PAGE_SIZE;
+		addr = f->flush_start;
+		while (addr < f->flush_end) {
+			__flush_tlb_single(addr);
+			addr += PAGE_SIZE;
+		}
+		trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, nr_pages);
+	}
 }
 
 void native_flush_tlb_others(const struct cpumask *cpumask,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 03/10] x86/mm: Make the batched unmap TLB flush API more generic
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
  2017-05-07 12:38 ` [RFC 01/10] x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range() Andy Lutomirski
  2017-05-07 12:38 ` [RFC 02/10] x86/mm: Reduce indentation in flush_tlb_func() Andy Lutomirski
@ 2017-05-07 12:38 ` Andy Lutomirski
  2017-05-08 15:34   ` Dave Hansen
  2017-05-07 12:38 ` [RFC 04/10] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc Andy Lutomirski
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-07 12:38 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Andy Lutomirski, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko, Sasha Levin

try_to_unmap_flush() used to open-code a rather x86-centric flush
sequence: local_flush_tlb() + flush_tlb_others().  Rearrange the
code so that the arch (only x86 for now) provides
arch_tlbbatch_add_mm() and arch_tlbbatch_flush() and the core code
calls those functions instead.

I'll want this for x86 because, to enable address space ids, I can't
support the flush_tlb_others() mode used by exising
try_to_unmap_flush() implementation with good performance.  I can
support the new API fairly easily, though.

I imagine that other architectures may be in a similar position.
Architectures with strong remote flush primitives (arm64?) may have
even worse performance problems with flush_tlb_others() the way that
try_to_unmap_flush() uses it.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbbatch.h | 16 ++++++++++++++++
 arch/x86/include/asm/tlbflush.h |  8 ++++++++
 arch/x86/mm/tlb.c               | 17 +++++++++++++++++
 include/linux/mm_types_task.h   | 15 +++++++++++----
 mm/rmap.c                       | 15 +--------------
 5 files changed, 53 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/tlbbatch.h

diff --git a/arch/x86/include/asm/tlbbatch.h b/arch/x86/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..01a6de16fb96
--- /dev/null
+++ b/arch/x86/include/asm/tlbbatch.h
@@ -0,0 +1,16 @@
+#ifndef _ARCH_X86_TLBBATCH_H
+#define _ARCH_X86_TLBBATCH_H
+
+#include <linux/cpumask.h>
+
+#ifdef CONFIG_SMP
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * Each bit set is a CPU that potentially has a TLB entry for one of
+	 * the PFNs being flushed..
+	 */
+	struct cpumask cpumask;
+};
+#endif
+
+#endif /* _ARCH_X86_TLBBATCH_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5ed64cdaf536..df71e3f2fe4d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -329,6 +329,14 @@ static inline void reset_lazy_tlbstate(void)
 	this_cpu_write(cpu_tlbstate.active_mm, &init_mm);
 }
 
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm)
+{
+	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+}
+
+extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+
 #endif	/* SMP */
 
 #ifndef CONFIG_PARAVIRT
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4d303864b310..743e4c6b4529 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -395,6 +395,23 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	}
 }
 
+void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	int cpu = get_cpu();
+
+	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+		local_flush_tlb();
+		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
+	}
+
+	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
+		flush_tlb_others(&batch->cpumask, NULL, 0, TLB_FLUSH_ALL);
+	cpumask_clear(&batch->cpumask);
+
+	put_cpu();
+}
+
 static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
 {
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index 136dfdf63ba1..fc412fbd80bd 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -14,6 +14,10 @@
 
 #include <asm/page.h>
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+#include <asm/tlbbatch.h>
+#endif
+
 #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
 		IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
@@ -67,12 +71,15 @@ struct page_frag {
 struct tlbflush_unmap_batch {
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	/*
-	 * Each bit set is a CPU that potentially has a TLB entry for one of
-	 * the PFNs being flushed. See set_tlb_ubc_flush_pending().
+	 * The arch code makes the following promise: generic code can modify a
+	 * PTE, then call arch_tlbbatch_add_mm() (which internally provides all
+	 * needed barriers), then call arch_tlbbatch_flush(), and the entries
+	 * will be flushed on all CPUs by the time that arch_tlbbatch_flush()
+	 * returns.
 	 */
-	struct cpumask cpumask;
+	struct arch_tlbflush_unmap_batch arch;
 
-	/* True if any bit in cpumask is set */
+	/* True if a flush is needed. */
 	bool flush_required;
 
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index f6838015810f..2e568c82f477 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -579,25 +579,12 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 void try_to_unmap_flush(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
-	int cpu;
 
 	if (!tlb_ubc->flush_required)
 		return;
 
-	cpu = get_cpu();
-
-	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) {
-		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
-		local_flush_tlb();
-		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
-	}
-
-	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids)
-		flush_tlb_others(&tlb_ubc->cpumask, NULL, 0, TLB_FLUSH_ALL);
-	cpumask_clear(&tlb_ubc->cpumask);
 	tlb_ubc->flush_required = false;
 	tlb_ubc->writable = false;
-	put_cpu();
 }
 
 /* Flush iff there are potentially writable TLB entries that can race with IO */
@@ -613,7 +600,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 
-	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
+	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
 	tlb_ubc->flush_required = true;
 
 	/*
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 04/10] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (2 preceding siblings ...)
  2017-05-07 12:38 ` [RFC 03/10] x86/mm: Make the batched unmap TLB flush API more generic Andy Lutomirski
@ 2017-05-07 12:38 ` Andy Lutomirski
  2017-05-11 20:01   ` Nadav Amit
  2017-05-07 12:38 ` [RFC 05/10] x86/mm: Change the leave_mm() condition for local TLB flushes Andy Lutomirski
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-07 12:38 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Andy Lutomirski, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko, Sasha Levin

Rather than passing all the contents of flush_tlb_info to
flush_tlb_others(), pass a pointer to the structure directly. For
consistency, this also removes the unnecessary cpu parameter from
uv_flush_tlb_others() to make its signature match the other
*flush_tlb_others() functions.

This serves two purposes:

 - It will dramatically simplify future patches that change struct
   flush_tlb_info, which I'm planning to do.

 - struct flush_tlb_info is an adequate description of what to do
   for a local flush, too, so by reusing it we can remove duplicated
   code between local and remove flushes in a future patch.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/paravirt.h       |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  5 ++-
 arch/x86/include/asm/tlbflush.h       | 19 ++++++-----
 arch/x86/include/asm/uv/uv.h          |  9 ++---
 arch/x86/mm/tlb.c                     | 64 +++++++++++++++++------------------
 arch/x86/platform/uv/tlb_uv.c         |  8 ++---
 arch/x86/xen/mmu.c                    | 10 +++---
 7 files changed, 58 insertions(+), 63 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 55fa56fe4e45..9a15739d9f4b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -312,11 +312,9 @@ static inline void __flush_tlb_single(unsigned long addr)
 }
 
 static inline void flush_tlb_others(const struct cpumask *cpumask,
-				    struct mm_struct *mm,
-				    unsigned long start,
-				    unsigned long end)
+				    const struct flush_tlb_info *info)
 {
-	PVOP_VCALL4(pv_mmu_ops.flush_tlb_others, cpumask, mm, start, end);
+	PVOP_VCALL2(pv_mmu_ops.flush_tlb_others, cpumask, info);
 }
 
 static inline int paravirt_pgd_alloc(struct mm_struct *mm)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7465d6fe336f..cb976bab6299 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -51,6 +51,7 @@ struct mm_struct;
 struct desc_struct;
 struct task_struct;
 struct cpumask;
+struct flush_tlb_info;
 
 /*
  * Wrapper type for pointers to code which uses the non-standard
@@ -223,9 +224,7 @@ struct pv_mmu_ops {
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_single)(unsigned long addr);
 	void (*flush_tlb_others)(const struct cpumask *cpus,
-				 struct mm_struct *mm,
-				 unsigned long start,
-				 unsigned long end);
+				 const struct flush_tlb_info *info);
 
 	/* Hooks for allocating and freeing a pagetable top-level */
 	int  (*pgd_alloc)(struct mm_struct *mm);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index df71e3f2fe4d..6d236c1aff89 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -220,12 +220,18 @@ static inline void __flush_tlb_one(unsigned long addr)
  *  - flush_tlb_page(vma, vmaddr) flushes one page
  *  - flush_tlb_range(vma, start, end) flushes a range of pages
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
- *  - flush_tlb_others(cpumask, mm, start, end) flushes TLBs on other cpus
+ *  - flush_tlb_others(cpumask, info) flushes TLBs on other cpus
  *
  * ..but the i386 has somewhat limited tlb flushing capabilities,
  * and page-granular flushes are available only on i486 and up.
  */
 
+struct flush_tlb_info {
+	struct mm_struct *mm;
+	unsigned long start;
+	unsigned long end;
+};
+
 #ifndef CONFIG_SMP
 
 /* "_up" is for UniProcessor.
@@ -279,9 +285,7 @@ static inline void flush_tlb_mm_range(struct mm_struct *mm,
 }
 
 static inline void native_flush_tlb_others(const struct cpumask *cpumask,
-					   struct mm_struct *mm,
-					   unsigned long start,
-					   unsigned long end)
+					   const struct flush_tlb_info *info)
 {
 }
 
@@ -317,8 +321,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 }
 
 void native_flush_tlb_others(const struct cpumask *cpumask,
-				struct mm_struct *mm,
-				unsigned long start, unsigned long end);
+			     const struct flush_tlb_info *info);
 
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
@@ -340,8 +343,8 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 #endif	/* SMP */
 
 #ifndef CONFIG_PARAVIRT
-#define flush_tlb_others(mask, mm, start, end)	\
-	native_flush_tlb_others(mask, mm, start, end)
+#define flush_tlb_others(mask, info)	\
+	native_flush_tlb_others(mask, info)
 #endif
 
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
index 6686820feae9..d2db4517930c 100644
--- a/arch/x86/include/asm/uv/uv.h
+++ b/arch/x86/include/asm/uv/uv.h
@@ -15,10 +15,7 @@ extern void uv_cpu_init(void);
 extern void uv_nmi_init(void);
 extern void uv_system_init(void);
 extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
-						 struct mm_struct *mm,
-						 unsigned long start,
-						 unsigned long end,
-						 unsigned int cpu);
+						 const struct flush_tlb_info *info);
 
 #else	/* X86_UV */
 
@@ -28,8 +25,8 @@ static inline int is_uv_hubless(void)	{ return 0; }
 static inline void uv_cpu_init(void)	{ }
 static inline void uv_system_init(void)	{ }
 static inline const struct cpumask *
-uv_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm,
-		    unsigned long start, unsigned long end, unsigned int cpu)
+uv_flush_tlb_others(const struct cpumask *cpumask,
+		    const struct flush_tlb_info *info)
 { return cpumask; }
 
 #endif	/* X86_UV */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 743e4c6b4529..776469cc54e0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,12 +30,6 @@
 
 #ifdef CONFIG_SMP
 
-struct flush_tlb_info {
-	struct mm_struct *flush_mm;
-	unsigned long flush_start;
-	unsigned long flush_end;
-};
-
 /*
  * We cannot call mmdrop() because we are in interrupt context,
  * instead update mm->cpu_vm_mask.
@@ -229,11 +223,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
  */
 static void flush_tlb_func(void *info)
 {
-	struct flush_tlb_info *f = info;
+	const struct flush_tlb_info *f = info;
 
 	inc_irq_stat(irq_tlb_count);
 
-	if (f->flush_mm && f->flush_mm != this_cpu_read(cpu_tlbstate.active_mm))
+	if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.active_mm))
 		return;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -243,15 +237,15 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
-	if (f->flush_end == TLB_FLUSH_ALL) {
+	if (f->end == TLB_FLUSH_ALL) {
 		local_flush_tlb();
 		trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
 	} else {
 		unsigned long addr;
 		unsigned long nr_pages =
-			(f->flush_end - f->flush_start) / PAGE_SIZE;
-		addr = f->flush_start;
-		while (addr < f->flush_end) {
+			(f->end - f->start) / PAGE_SIZE;
+		addr = f->start;
+		while (addr < f->end) {
 			__flush_tlb_single(addr);
 			addr += PAGE_SIZE;
 		}
@@ -260,33 +254,27 @@ static void flush_tlb_func(void *info)
 }
 
 void native_flush_tlb_others(const struct cpumask *cpumask,
-				 struct mm_struct *mm, unsigned long start,
-				 unsigned long end)
+			     const struct flush_tlb_info *info)
 {
-	struct flush_tlb_info info;
-
-	info.flush_mm = mm;
-	info.flush_start = start;
-	info.flush_end = end;
-
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
-	if (end == TLB_FLUSH_ALL)
+	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
 	else
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI,
-				(end - start) >> PAGE_SHIFT);
+				(info->end - info->start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
 		unsigned int cpu;
 
 		cpu = smp_processor_id();
-		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
+		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
 			smp_call_function_many(cpumask, flush_tlb_func,
-								&info, 1);
+					       (void *)info, 1);
 		return;
 	}
-	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
+	smp_call_function_many(cpumask, flush_tlb_func,
+			       (void *)info, 1);
 }
 
 /*
@@ -305,6 +293,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
 	unsigned long addr;
+	struct flush_tlb_info info;
 	/* do a global flush by default */
 	unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
 
@@ -347,15 +336,20 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	}
 	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush);
 out:
+	info.mm = mm;
 	if (base_pages_to_flush == TLB_FLUSH_ALL) {
-		start = 0UL;
-		end = TLB_FLUSH_ALL;
+		info.start = 0UL;
+		info.end = TLB_FLUSH_ALL;
+	} else {
+		info.start = start;
+		info.end = end;
 	}
 	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), mm, start, end);
+		flush_tlb_others(mm_cpumask(mm), &info);
 	preempt_enable();
 }
 
+
 static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -376,7 +370,7 @@ static void do_kernel_range_flush(void *info)
 	unsigned long addr;
 
 	/* flush range by one by one 'invlpg' */
-	for (addr = f->flush_start; addr < f->flush_end; addr += PAGE_SIZE)
+	for (addr = f->start; addr < f->end; addr += PAGE_SIZE)
 		__flush_tlb_single(addr);
 }
 
@@ -389,14 +383,20 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
 	} else {
 		struct flush_tlb_info info;
-		info.flush_start = start;
-		info.flush_end = end;
+		info.start = start;
+		info.end = end;
 		on_each_cpu(do_kernel_range_flush, &info, 1);
 	}
 }
 
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
+	struct flush_tlb_info info = {
+		.mm = NULL,
+		.start = 0UL,
+		.end = TLB_FLUSH_ALL,
+	};
+
 	int cpu = get_cpu();
 
 	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
@@ -406,7 +406,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	}
 
 	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
-		flush_tlb_others(&batch->cpumask, NULL, 0, TLB_FLUSH_ALL);
+		flush_tlb_others(&batch->cpumask, &info);
 	cpumask_clear(&batch->cpumask);
 
 	put_cpu();
diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index f25982cdff90..018af012e646 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -1109,11 +1109,9 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp,
  * done.  The returned pointer is valid till preemption is re-enabled.
  */
 const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
-						struct mm_struct *mm,
-						unsigned long start,
-						unsigned long end,
-						unsigned int cpu)
+					  const struct flush_tlb_info *info)
 {
+	unsigned int cpu = smp_processor_id();
 	int locals = 0;
 	int remotes = 0;
 	int hubs = 0;
@@ -1170,7 +1168,7 @@ const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
 
 	record_send_statistics(stat, locals, hubs, remotes, bau_desc);
 
-	if (!end || (end - start) <= PAGE_SIZE)
+	if (!info->end || (info->end - info->start) <= PAGE_SIZE)
 		bau_desc->payload.address = start;
 	else
 		bau_desc->payload.address = TLB_FLUSH_ALL;
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f226038a39ca..894daefd6958 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1427,8 +1427,7 @@ static void xen_flush_tlb_single(unsigned long addr)
 }
 
 static void xen_flush_tlb_others(const struct cpumask *cpus,
-				 struct mm_struct *mm, unsigned long start,
-				 unsigned long end)
+				 const struct flush_tlb_info *info)
 {
 	struct {
 		struct mmuext_op op;
@@ -1440,7 +1439,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	} *args;
 	struct multicall_space mcs;
 
-	trace_xen_mmu_flush_tlb_others(cpus, mm, start, end);
+	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
 
 	if (cpumask_empty(cpus))
 		return;		/* nothing to do */
@@ -1454,9 +1453,10 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask));
 
 	args->op.cmd = MMUEXT_TLB_FLUSH_MULTI;
-	if (end != TLB_FLUSH_ALL && (end - start) <= PAGE_SIZE) {
+	if (info->end != TLB_FLUSH_ALL &&
+	    (info->end - info->start) <= PAGE_SIZE) {
 		args->op.cmd = MMUEXT_INVLPG_MULTI;
-		args->op.arg1.linear_addr = start;
+		args->op.arg1.linear_addr = info->start;
 	}
 
 	MULTI_mmuext_op(mcs.mc, &args->op, 1, NULL, DOMID_SELF);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 05/10] x86/mm: Change the leave_mm() condition for local TLB flushes
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (3 preceding siblings ...)
  2017-05-07 12:38 ` [RFC 04/10] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc Andy Lutomirski
@ 2017-05-07 12:38 ` Andy Lutomirski
  2017-05-07 12:38 ` [RFC 06/10] x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases Andy Lutomirski
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-07 12:38 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Andy Lutomirski, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko, Arjan van de Ven

On a remote TLB flush, we leave_mm() if we're TLBSTATE_LAZY.  For a
local flush_tlb_mm_range(), we leave_mm() if !current->mm.  These
are approximately the same condition -- the scheduler sets lazy TLB
mode when switching to a thread with no mm.

I'm about to merge the local and remote flush code, but for ease of
verifying and bisecting the patch, I want the local and remote flush
behavior to match first.  This patch changes the local code to match
the remote code.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 776469cc54e0..3143c9a180e5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -311,7 +311,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		goto out;
 	}
 
-	if (!current->mm) {
+	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
 		leave_mm(smp_processor_id());
 
 		/* Synchronize with switch_mm. */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 06/10] x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (4 preceding siblings ...)
  2017-05-07 12:38 ` [RFC 05/10] x86/mm: Change the leave_mm() condition for local TLB flushes Andy Lutomirski
@ 2017-05-07 12:38 ` Andy Lutomirski
  2017-05-07 12:38 ` [RFC 07/10] x86/mm: Use new merged flush logic in arch_tlbbatch_flush() Andy Lutomirski
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-07 12:38 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Andy Lutomirski, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko, Arjan van de Ven

The local flush path is very similar to the remote flush path.
Merge them.

This is intended to make no difference to behavior whatsoever.  It
removes some code and will make future changes to the flushing
mechanics simpler.

This patch does remove one small optimization: flush_tlb_mm_range()
now has an unconditional smp_mb() instead of using MOV to CR3 or
INVLPG as a full barrier when applicable.  I think this is okay for
a few reasons.  First, smp_mb() is quite cheap compared to the cost
of a TLB flush.  Second, this rearrangement makes a bigger
optimization available: with some work on the SMP function call
code, we could do the local and remote flushes in parallel.  Third,
I'm planning a rework of the TLB flush algorithm that will require
an atomic operation at the beginning of each flush, and that
operation will replace the smp_mb().

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/tlbflush.h |   1 -
 arch/x86/mm/tlb.c               | 113 +++++++++++++++++-----------------------
 2 files changed, 48 insertions(+), 66 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6d236c1aff89..c600d2d4a501 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -225,7 +225,6 @@ static inline void __flush_tlb_one(unsigned long addr)
  * ..but the i386 has somewhat limited tlb flushing capabilities,
  * and page-granular flushes are available only on i486 and up.
  */
-
 struct flush_tlb_info {
 	struct mm_struct *mm;
 	unsigned long start;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3143c9a180e5..12b8812e8926 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -216,22 +216,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
  * write/read ordering problems.
  */
 
-/*
- * TLB flush funcation:
- * 1) Flush the tlb entries if the cpu uses the mm that's being flushed.
- * 2) Leave the mm if we are in the lazy tlb mode.
- */
-static void flush_tlb_func(void *info)
+static void flush_tlb_func_common(const struct flush_tlb_info *f,
+				  bool local, enum tlb_flush_reason reason)
 {
-	const struct flush_tlb_info *f = info;
-
-	inc_irq_stat(irq_tlb_count);
-
-	if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.active_mm))
-		return;
-
-	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
-
 	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
 		leave_mm(smp_processor_id());
 		return;
@@ -239,7 +226,9 @@ static void flush_tlb_func(void *info)
 
 	if (f->end == TLB_FLUSH_ALL) {
 		local_flush_tlb();
-		trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
+		if (local)
+			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+		trace_tlb_flush(reason, TLB_FLUSH_ALL);
 	} else {
 		unsigned long addr;
 		unsigned long nr_pages =
@@ -249,10 +238,32 @@ static void flush_tlb_func(void *info)
 			__flush_tlb_single(addr);
 			addr += PAGE_SIZE;
 		}
-		trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, nr_pages);
+		if (local)
+			count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
+		trace_tlb_flush(reason, nr_pages);
 	}
 }
 
+static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
+{
+	const struct flush_tlb_info *f = info;
+
+	flush_tlb_func_common(f, true, reason);
+}
+
+static void flush_tlb_func_remote(void *info)
+{
+	const struct flush_tlb_info *f = info;
+
+	inc_irq_stat(irq_tlb_count);
+
+	if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.active_mm))
+		return;
+
+	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
+}
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info)
 {
@@ -269,11 +280,11 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		cpu = smp_processor_id();
 		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
-			smp_call_function_many(cpumask, flush_tlb_func,
+			smp_call_function_many(cpumask, flush_tlb_func_remote,
 					       (void *)info, 1);
 		return;
 	}
-	smp_call_function_many(cpumask, flush_tlb_func,
+	smp_call_function_many(cpumask, flush_tlb_func_remote,
 			       (void *)info, 1);
 }
 
@@ -292,61 +303,33 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
-	unsigned long addr;
-	struct flush_tlb_info info;
-	/* do a global flush by default */
-	unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
-
-	preempt_disable();
+	int cpu;
 
-	if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
-		base_pages_to_flush = (end - start) >> PAGE_SHIFT;
-	if (base_pages_to_flush > tlb_single_page_flush_ceiling)
-		base_pages_to_flush = TLB_FLUSH_ALL;
-
-	if (current->active_mm != mm) {
-		/* Synchronize with switch_mm. */
-		smp_mb();
-
-		goto out;
-	}
-
-	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
-		leave_mm(smp_processor_id());
+	struct flush_tlb_info info = {
+		.mm = mm,
+	};
 
-		/* Synchronize with switch_mm. */
-		smp_mb();
+	cpu = get_cpu();
 
-		goto out;
-	}
+	/* Synchronize with switch_mm. */
+	smp_mb();
 
-	/*
-	 * Both branches below are implicit full barriers (MOV to CR or
-	 * INVLPG) that synchronize with switch_mm.
-	 */
-	if (base_pages_to_flush == TLB_FLUSH_ALL) {
-		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
-		local_flush_tlb();
+	/* Should we flush just the requested range? */
+	if ((end != TLB_FLUSH_ALL) &&
+	    !(vmflag & VM_HUGETLB) &&
+	    ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
+		info.start = start;
+		info.end = end;
 	} else {
-		/* flush range by one by one 'invlpg' */
-		for (addr = start; addr < end;	addr += PAGE_SIZE) {
-			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
-			__flush_tlb_single(addr);
-		}
-	}
-	trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush);
-out:
-	info.mm = mm;
-	if (base_pages_to_flush == TLB_FLUSH_ALL) {
 		info.start = 0UL;
 		info.end = TLB_FLUSH_ALL;
-	} else {
-		info.start = start;
-		info.end = end;
 	}
-	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
+
+	if (mm == current->active_mm)
+		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), &info);
-	preempt_enable();
+	put_cpu();
 }
 
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 07/10] x86/mm: Use new merged flush logic in arch_tlbbatch_flush()
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (5 preceding siblings ...)
  2017-05-07 12:38 ` [RFC 06/10] x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases Andy Lutomirski
@ 2017-05-07 12:38 ` Andy Lutomirski
  2017-05-07 12:38 ` [RFC 08/10] x86/mm: Remove the UP tlbflush code; always use the formerly SMP code Andy Lutomirski
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-07 12:38 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Andy Lutomirski, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko, Arjan van de Ven

Now there's only one copy of the local tlb flush logic for
non-kernel pages on SMP kernels.

The only functional change is that arch_tlbbatch_flush() will now
leave_mm() on the local CPU if that CPU is in the batch and is in
TLBSTATE_LAZY mode.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/tlb.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 12b8812e8926..c03b4a0ce58c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -382,12 +382,8 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	int cpu = get_cpu();
 
-	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
-		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
-		local_flush_tlb();
-		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
-	}
-
+	if (cpumask_test_cpu(cpu, &batch->cpumask))
+		flush_tlb_func_local(&info, TLB_LOCAL_SHOOTDOWN);
 	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
 		flush_tlb_others(&batch->cpumask, &info);
 	cpumask_clear(&batch->cpumask);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 08/10] x86/mm: Remove the UP tlbflush code; always use the formerly SMP code
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (6 preceding siblings ...)
  2017-05-07 12:38 ` [RFC 07/10] x86/mm: Use new merged flush logic in arch_tlbbatch_flush() Andy Lutomirski
@ 2017-05-07 12:38 ` Andy Lutomirski
  2017-05-07 12:38 ` [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm Andy Lutomirski
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-07 12:38 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Andy Lutomirski, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko, Arjan van de Ven

The UP tlbflush generates somewhat nicer code than the SMP version.
Aside from that, it's fallen quite a bit behind the SMP code:

 - flush_tlb_mm_range() didn't flush individual pages if the range
   was small.

 - The lazy TLB code was much weaker.  This usually wouldn't matter,
   but, if a kernel thread flushed its lazy "active_mm" more than
   once (due to reclaim or similar), it wouldn't be unlazied and
   would instead pointlessly flush repeatedly.

 - Tracepoints were missing.

Aside from that, simply having the UP code around was a maintanence
burden, since it means that any change to the TLB flush code had to
make sure not to break it.

Simplify everything by deleting the UP code.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                   |  2 +-
 arch/x86/include/asm/hardirq.h     |  2 +-
 arch/x86/include/asm/mmu.h         |  6 ---
 arch/x86/include/asm/mmu_context.h |  2 -
 arch/x86/include/asm/tlbbatch.h    |  2 -
 arch/x86/include/asm/tlbflush.h    | 76 +-------------------------------------
 arch/x86/mm/init.c                 |  2 -
 arch/x86/mm/tlb.c                  | 17 +--------
 8 files changed, 5 insertions(+), 104 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bde14451e54..1b781bcde1a0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -69,7 +69,7 @@ config X86
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
-	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select BUILDTIME_EXTABLE_SORT
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 59405a248fc2..9b76cd331990 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -22,8 +22,8 @@ typedef struct {
 #ifdef CONFIG_SMP
 	unsigned int irq_resched_count;
 	unsigned int irq_call_count;
-	unsigned int irq_tlb_count;
 #endif
+	unsigned int irq_tlb_count;
 #ifdef CONFIG_X86_THERMAL_VECTOR
 	unsigned int irq_thermal_count;
 #endif
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index f9813b6d8b80..79b647a7ebd0 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -37,12 +37,6 @@ typedef struct {
 #endif
 } mm_context_t;
 
-#ifdef CONFIG_SMP
 void leave_mm(int cpu);
-#else
-static inline void leave_mm(int cpu)
-{
-}
-#endif
 
 #endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 68b329d77b3a..187c39470a0b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -99,10 +99,8 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
-#ifdef CONFIG_SMP
 	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
-#endif
 }
 
 static inline int init_new_context(struct task_struct *tsk,
diff --git a/arch/x86/include/asm/tlbbatch.h b/arch/x86/include/asm/tlbbatch.h
index 01a6de16fb96..f4a6ff352a0e 100644
--- a/arch/x86/include/asm/tlbbatch.h
+++ b/arch/x86/include/asm/tlbbatch.h
@@ -3,7 +3,6 @@
 
 #include <linux/cpumask.h>
 
-#ifdef CONFIG_SMP
 struct arch_tlbflush_unmap_batch {
 	/*
 	 * Each bit set is a CPU that potentially has a TLB entry for one of
@@ -11,6 +10,5 @@ struct arch_tlbflush_unmap_batch {
 	 */
 	struct cpumask cpumask;
 };
-#endif
 
 #endif /* _ARCH_X86_TLBBATCH_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index c600d2d4a501..3c9289637669 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -7,6 +7,7 @@
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
+#include <asm/smp.h>
 
 static inline void __invpcid(unsigned long pcid, unsigned long addr,
 			     unsigned long type)
@@ -65,10 +66,8 @@ static inline void invpcid_flush_all_nonglobals(void)
 #endif
 
 struct tlb_state {
-#ifdef CONFIG_SMP
 	struct mm_struct *active_mm;
 	int state;
-#endif
 
 	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -231,77 +230,6 @@ struct flush_tlb_info {
 	unsigned long end;
 };
 
-#ifndef CONFIG_SMP
-
-/* "_up" is for UniProcessor.
- *
- * This is a helper for other header functions.  *Not* intended to be called
- * directly.  All global TLB flushes need to either call this, or to bump the
- * vm statistics themselves.
- */
-static inline void __flush_tlb_up(void)
-{
-	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
-	__flush_tlb();
-}
-
-static inline void flush_tlb_all(void)
-{
-	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
-	__flush_tlb_all();
-}
-
-static inline void local_flush_tlb(void)
-{
-	__flush_tlb_up();
-}
-
-static inline void flush_tlb_mm(struct mm_struct *mm)
-{
-	if (mm == current->active_mm)
-		__flush_tlb_up();
-}
-
-static inline void flush_tlb_page(struct vm_area_struct *vma,
-				  unsigned long addr)
-{
-	if (vma->vm_mm == current->active_mm)
-		__flush_tlb_one(addr);
-}
-
-static inline void flush_tlb_range(struct vm_area_struct *vma,
-				   unsigned long start, unsigned long end)
-{
-	if (vma->vm_mm == current->active_mm)
-		__flush_tlb_up();
-}
-
-static inline void flush_tlb_mm_range(struct mm_struct *mm,
-	   unsigned long start, unsigned long end, unsigned long vmflag)
-{
-	if (mm == current->active_mm)
-		__flush_tlb_up();
-}
-
-static inline void native_flush_tlb_others(const struct cpumask *cpumask,
-					   const struct flush_tlb_info *info)
-{
-}
-
-static inline void reset_lazy_tlbstate(void)
-{
-}
-
-static inline void flush_tlb_kernel_range(unsigned long start,
-					  unsigned long end)
-{
-	flush_tlb_all();
-}
-
-#else  /* SMP */
-
-#include <asm/smp.h>
-
 #define local_flush_tlb() __flush_tlb()
 
 #define flush_tlb_mm(mm)	flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL)
@@ -339,8 +267,6 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
-#endif	/* SMP */
-
 #ifndef CONFIG_PARAVIRT
 #define flush_tlb_others(mask, info)	\
 	native_flush_tlb_others(mask, info)
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 2193799ca800..078972129885 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -792,10 +792,8 @@ void __init zone_sizes_init(void)
 }
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
-#ifdef CONFIG_SMP
 	.active_mm = &init_mm,
 	.state = 0,
-#endif
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c03b4a0ce58c..da1416c77bfb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -15,7 +15,7 @@
 #include <linux/debugfs.h>
 
 /*
- *	Smarter SMP flushing macros.
+ *	TLB flushing, formerly SMP-only
  *		c/o Linus Torvalds.
  *
  *	These mean you can really definitely utterly forget about
@@ -28,8 +28,6 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
-#ifdef CONFIG_SMP
-
 /*
  * We cannot call mmdrop() because we are in interrupt context,
  * instead update mm->cpu_vm_mask.
@@ -53,8 +51,6 @@ void leave_mm(int cpu)
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
-#endif /* CONFIG_SMP */
-
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
 {
@@ -85,10 +81,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
 		}
 
-#ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
-#endif
 
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
@@ -146,9 +140,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		if (unlikely(prev->context.ldt != next->context.ldt))
 			load_mm_ldt(next);
 #endif
-	}
-#ifdef CONFIG_SMP
-	  else {
+	} else {
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
@@ -175,11 +167,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			load_mm_ldt(next);
 		}
 	}
-#endif
 }
 
-#ifdef CONFIG_SMP
-
 /*
  * The flush IPI assumes that a thread switch happens in this order:
  * [cpu0: the cpu that switches]
@@ -436,5 +425,3 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 	return 0;
 }
 late_initcall(create_tlb_single_page_flush_ceiling);
-
-#endif /* CONFIG_SMP */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (7 preceding siblings ...)
  2017-05-07 12:38 ` [RFC 08/10] x86/mm: Remove the UP tlbflush code; always use the formerly SMP code Andy Lutomirski
@ 2017-05-07 12:38 ` Andy Lutomirski
  2017-05-09 20:41   ` Thomas Gleixner
  2017-05-07 12:38 ` [RFC 10/10] x86,kvm: Teach KVM's VMX code that CR3 isn't a constant Andy Lutomirski
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-07 12:38 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Andy Lutomirski, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko, Arjan van de Ven

Lazy TLB state is current managed in a rather baroque manner.
AFAICT, there are three possible states:

 - Non-lazy.  This means that we're running a user thread or a
   kernel thread that has called use_mm().  current->mm ==
   current->active_mm == cpu_tlbstate.active_mm and
   cpu_tlbstate.state == TLBSTATE_OK.

 - Lazy with user mm.  We're running a kernel thread without an mm
   and we're borrowing an mm_struct.  We have current->mm == NULL,
   current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state
   != TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0).  The current cpu is set
   in mm_cpumask(current->active_mm).  CR3 points to
   current->active_mm->pgd.  The TLB is up to date.

 - Lazy with init_mm.  This happens when we call leave_mm().  We
   have current->mm == NULL, current->active_mm ==
   cpu_tlbstate.active_mm, but that mm is only relelvant insofar as
   the scheduler is tracking it for refcounting.  cpu_tlbstate.state
   != TLBSTATE_OK.  The current cpu is clear in
   mm_cpumask(current->active_mm).  CR3 points to swapper_pg_dir,
   i.e. init_mm->pgd.

This patch simplifies the situation.  Other than perf, x86 stops
caring about current->active_mm at all.  We have
cpu_tlbstate.loaded_mm pointing to the mm that CR3 references.  The
TLB is always up to date for that mm.  leave_mm() just switches us
to init_mm.  There are no longer any special cases for mm_cpumask,
and switch_mm() switches mms without worrying about laziness.

After this patch, cpu_tlbstate.state serves only to tell the TLB
flush code whether it may switch to init_mm instead of doing a
normal flush.

This makes fairly extensive changes to xen_exit_mmap(), which used
to look a bit like black magic.

Perf is unchanged.  With or without this change, perf may behave a bit
erratically if it tries to read user memory in kernel thread context.
We should build on this patch to teach perf to never look at user
memory when cpu_tlbstate.loaded_mm != current->mm.

Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/events/core.c          |   3 +-
 arch/x86/include/asm/tlbflush.h |  12 ++-
 arch/x86/kernel/ldt.c           |   5 +-
 arch/x86/mm/init.c              |   2 +-
 arch/x86/mm/tlb.c               | 209 ++++++++++++++++++++--------------------
 arch/x86/xen/mmu.c              |  51 +++++-----
 6 files changed, 143 insertions(+), 139 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 580b60f5ac83..77a33096728d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2101,8 +2101,7 @@ static int x86_pmu_event_init(struct perf_event *event)
 
 static void refresh_pce(void *ignored)
 {
-	if (current->active_mm)
-		load_mm_cr4(current->active_mm);
+	load_mm_cr4(this_cpu_read(cpu_tlbstate.loaded_mm));
 }
 
 static void x86_pmu_event_mapped(struct perf_event *event)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3c9289637669..a08680d3edaa 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -66,7 +66,13 @@ static inline void invpcid_flush_all_nonglobals(void)
 #endif
 
 struct tlb_state {
-	struct mm_struct *active_mm;
+	/*
+	 * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
+	 * are on.  This means that it may not match current->active_mm,
+	 * which will contain the previous user mm when we're in lazy TLB
+	 * mode even if we've already switched back to swapper_pg_dir.
+	 */
+	struct mm_struct *loaded_mm;
 	int state;
 
 	/*
@@ -256,7 +262,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 static inline void reset_lazy_tlbstate(void)
 {
 	this_cpu_write(cpu_tlbstate.state, 0);
-	this_cpu_write(cpu_tlbstate.active_mm, &init_mm);
+	this_cpu_write(cpu_tlbstate.loaded_mm, &init_mm);
+
+	WARN_ON(read_cr3() != __pa_symbol(swapper_pg_dir));
 }
 
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index d4a15831ac58..d7e01790ae86 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -24,12 +24,13 @@
 /* context.lock is held for us, so we don't need any locking. */
 static void flush_ldt(void *current_mm)
 {
+	struct mm_struct *mm = current_mm;
 	mm_context_t *pc;
 
-	if (current->active_mm != current_mm)
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
 		return;
 
-	pc = &current->active_mm->context;
+	pc = &mm->context;
 	set_ldt(pc->ldt->entries, pc->ldt->size);
 }
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 078972129885..47453ce2374e 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -792,7 +792,7 @@ void __init zone_sizes_init(void)
 }
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
-	.active_mm = &init_mm,
+	.loaded_mm = &init_mm,
 	.state = 0,
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
 };
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index da1416c77bfb..c33fc1eaace4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -34,20 +34,19 @@
  */
 void leave_mm(int cpu)
 {
-	struct mm_struct *active_mm = this_cpu_read(cpu_tlbstate.active_mm);
+	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
 		BUG();
-	if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
-		cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
-		load_cr3(swapper_pg_dir);
-		/*
-		 * This gets called in the idle path where RCU
-		 * functions differently.  Tracing normally
-		 * uses RCU, so we have to call the tracepoint
-		 * specially here.
-		 */
-		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-	}
+
+	/*
+	 * It's plausible that we're in lazy TLB mode while our mm is init_mm.
+	 * If so, our callers still expect us to flush the TLB, but there
+	 * aren't any user TLB entries in init_mm to worry about.
+	 */
+	if (loaded_mm == &init_mm)
+		return;
+
+	switch_mm(NULL, &init_mm, NULL);
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
@@ -65,108 +64,110 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			struct task_struct *tsk)
 {
 	unsigned cpu = smp_processor_id();
+	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 
-	if (likely(prev != next)) {
-		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
-			/*
-			 * If our current stack is in vmalloc space and isn't
-			 * mapped in the new pgd, we'll double-fault.  Forcibly
-			 * map it.
-			 */
-			unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
+	/*
+	 * NB: The scheduler will call us with prev == next when
+	 * switching from lazy TLB mode to normal mode if active_mm
+	 * isn't changing.  When this happens, there is no guarantee
+	 * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+	 *
+	 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
+	 */
 
-			pgd_t *pgd = next->pgd + stack_pgd_index;
+	this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 
-			if (unlikely(pgd_none(*pgd)))
-				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
-		}
-
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
-		this_cpu_write(cpu_tlbstate.active_mm, next);
-
-		cpumask_set_cpu(cpu, mm_cpumask(next));
+	if (real_prev == next) {
+		/*
+		 * There's nothing to do: we always keep the per-mm control
+		 * regs in sync with cpu_tlbstate.loaded_mm.  Just
+		 * sanity-check mm_cpumask.
+		 */
+		if (WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(next))))
+			cpumask_set_cpu(cpu, mm_cpumask(next));
+		return;
+	}
 
+	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 		/*
-		 * Re-load page tables.
-		 *
-		 * This logic has an ordering constraint:
-		 *
-		 *  CPU 0: Write to a PTE for 'next'
-		 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
-		 *  CPU 1: set bit 1 in next's mm_cpumask
-		 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
-		 *
-		 * We need to prevent an outcome in which CPU 1 observes
-		 * the new PTE value and CPU 0 observes bit 1 clear in
-		 * mm_cpumask.  (If that occurs, then the IPI will never
-		 * be sent, and CPU 0's TLB will contain a stale entry.)
-		 *
-		 * The bad outcome can occur if either CPU's load is
-		 * reordered before that CPU's store, so both CPUs must
-		 * execute full barriers to prevent this from happening.
-		 *
-		 * Thus, switch_mm needs a full barrier between the
-		 * store to mm_cpumask and any operation that could load
-		 * from next->pgd.  TLB fills are special and can happen
-		 * due to instruction fetches or for no reason at all,
-		 * and neither LOCK nor MFENCE orders them.
-		 * Fortunately, load_cr3() is serializing and gives the
-		 * ordering guarantee we need.
-		 *
+		 * If our current stack is in vmalloc space and isn't
+		 * mapped in the new pgd, we'll double-fault.  Forcibly
+		 * map it.
 		 */
-		load_cr3(next->pgd);
+		unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
 
-		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+		pgd_t *pgd = next->pgd + stack_pgd_index;
 
-		/* Stop flush ipis for the previous mm */
-		cpumask_clear_cpu(cpu, mm_cpumask(prev));
+		if (unlikely(pgd_none(*pgd)))
+			set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+	}
 
-		/* Load per-mm CR4 state */
-		load_mm_cr4(next);
+	this_cpu_write(cpu_tlbstate.loaded_mm, next);
+
+	WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
+	cpumask_set_cpu(cpu, mm_cpumask(next));
+
+	/*
+	 * Re-load page tables.
+	 *
+	 * This logic has an ordering constraint:
+	 *
+	 *  CPU 0: Write to a PTE for 'next'
+	 *  CPU 0: load bit 1 in mm_cpumask.  if nonzero, send IPI.
+	 *  CPU 1: set bit 1 in next's mm_cpumask
+	 *  CPU 1: load from the PTE that CPU 0 writes (implicit)
+	 *
+	 * We need to prevent an outcome in which CPU 1 observes
+	 * the new PTE value and CPU 0 observes bit 1 clear in
+	 * mm_cpumask.  (If that occurs, then the IPI will never
+	 * be sent, and CPU 0's TLB will contain a stale entry.)
+	 *
+	 * The bad outcome can occur if either CPU's load is
+	 * reordered before that CPU's store, so both CPUs must
+	 * execute full barriers to prevent this from happening.
+	 *
+	 * Thus, switch_mm needs a full barrier between the
+	 * store to mm_cpumask and any operation that could load
+	 * from next->pgd.  TLB fills are special and can happen
+	 * due to instruction fetches or for no reason at all,
+	 * and neither LOCK nor MFENCE orders them.
+	 * Fortunately, load_cr3() is serializing and gives the
+	 * ordering guarantee we need.
+	 *
+	 */
+	load_cr3(next->pgd);
+
+	/*
+	 * This gets called via leave_mm() in the idle path where RCU
+	 * functions differently.  Tracing normally uses RCU, so we have to
+	 * call the tracepoint specially here.
+	 */
+	trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+
+	/* Stop flush ipis for the previous mm */
+	WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
+		     real_prev != &init_mm);
+	cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+
+	/* Load per-mm CR4 state */
+	load_mm_cr4(next);
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
-		/*
-		 * Load the LDT, if the LDT is different.
-		 *
-		 * It's possible that prev->context.ldt doesn't match
-		 * the LDT register.  This can happen if leave_mm(prev)
-		 * was called and then modify_ldt changed
-		 * prev->context.ldt but suppressed an IPI to this CPU.
-		 * In this case, prev->context.ldt != NULL, because we
-		 * never set context.ldt to NULL while the mm still
-		 * exists.  That means that next->context.ldt !=
-		 * prev->context.ldt, because mms never share an LDT.
-		 */
-		if (unlikely(prev->context.ldt != next->context.ldt))
-			load_mm_ldt(next);
+	/*
+	 * Load the LDT, if the LDT is different.
+	 *
+	 * It's possible that prev->context.ldt doesn't match
+	 * the LDT register.  This can happen if leave_mm(prev)
+	 * was called and then modify_ldt changed
+	 * prev->context.ldt but suppressed an IPI to this CPU.
+	 * In this case, prev->context.ldt != NULL, because we
+	 * never set context.ldt to NULL while the mm still
+	 * exists.  That means that next->context.ldt !=
+	 * prev->context.ldt, because mms never share an LDT.
+	 */
+	if (unlikely(real_prev->context.ldt != next->context.ldt))
+		load_mm_ldt(next);
 #endif
-	} else {
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
-		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
-
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
-			/*
-			 * On established mms, the mm_cpumask is only changed
-			 * from irq context, from ptep_clear_flush() while in
-			 * lazy tlb mode, and here. Irqs are blocked during
-			 * schedule, protecting us from simultaneous changes.
-			 */
-			cpumask_set_cpu(cpu, mm_cpumask(next));
-
-			/*
-			 * We were in lazy tlb mode and leave_mm disabled
-			 * tlb flush IPI delivery. We must reload CR3
-			 * to make sure to use no freed page tables.
-			 *
-			 * As above, load_cr3() is serializing and orders TLB
-			 * fills with respect to the mm_cpumask write.
-			 */
-			load_cr3(next->pgd);
-			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-			load_mm_cr4(next);
-			load_mm_ldt(next);
-		}
-	}
 }
 
 /*
@@ -246,7 +247,7 @@ static void flush_tlb_func_remote(void *info)
 
 	inc_irq_stat(irq_tlb_count);
 
-	if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.active_mm))
+	if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
 		return;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -314,7 +315,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		info.end = TLB_FLUSH_ALL;
 	}
 
-	if (mm == current->active_mm)
+	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
 		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
 	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
 		flush_tlb_others(mm_cpumask(mm), &info);
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 894daefd6958..c34fd91a0f5a 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1013,37 +1013,32 @@ static void xen_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 	spin_unlock(&mm->page_table_lock);
 }
 
-
-#ifdef CONFIG_SMP
-/* Another cpu may still have their %cr3 pointing at the pagetable, so
-   we need to repoint it somewhere else before we can unpin it. */
-static void drop_other_mm_ref(void *info)
+static void drop_mm_ref_this_cpu(void *info)
 {
 	struct mm_struct *mm = info;
-	struct mm_struct *active_mm;
-
-	active_mm = this_cpu_read(cpu_tlbstate.active_mm);
 
-	if (active_mm == mm && this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm)
 		leave_mm(smp_processor_id());
 
-	/* If this cpu still has a stale cr3 reference, then make sure
-	   it has been flushed. */
+	/*
+	 * If this cpu still has a stale cr3 reference, then make sure
+	 * it has been flushed.
+	 */
 	if (this_cpu_read(xen_current_cr3) == __pa(mm->pgd))
-		load_cr3(swapper_pg_dir);
+		xen_mc_flush();
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Another cpu may still have their %cr3 pointing at the pagetable, so
+ * we need to repoint it somewhere else before we can unpin it.
+ */
 static void xen_drop_mm_ref(struct mm_struct *mm)
 {
 	cpumask_var_t mask;
 	unsigned cpu;
 
-	if (current->active_mm == mm) {
-		if (current->mm == mm)
-			load_cr3(swapper_pg_dir);
-		else
-			leave_mm(smp_processor_id());
-	}
+	drop_mm_ref_this_cpu(mm);
 
 	/* Get the "official" set of cpus referring to our pagetable. */
 	if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
@@ -1051,31 +1046,31 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
 			if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
 			    && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
 				continue;
-			smp_call_function_single(cpu, drop_other_mm_ref, mm, 1);
+			smp_call_function_single(cpu, drop_mm_ref_this_cpu, mm, 1);
 		}
 		return;
 	}
 	cpumask_copy(mask, mm_cpumask(mm));
 
-	/* It's possible that a vcpu may have a stale reference to our
-	   cr3, because its in lazy mode, and it hasn't yet flushed
-	   its set of pending hypercalls yet.  In this case, we can
-	   look at its actual current cr3 value, and force it to flush
-	   if needed. */
+	/*
+	 * It's possible that a vcpu may have a stale reference to our
+	 * cr3, because its in lazy mode, and it hasn't yet flushed
+	 * its set of pending hypercalls yet.  In this case, we can
+	 * look at its actual current cr3 value, and force it to flush
+	 * if needed.
+	 */
 	for_each_online_cpu(cpu) {
 		if (per_cpu(xen_current_cr3, cpu) == __pa(mm->pgd))
 			cpumask_set_cpu(cpu, mask);
 	}
 
-	if (!cpumask_empty(mask))
-		smp_call_function_many(mask, drop_other_mm_ref, mm, 1);
+	smp_call_function_many(mask, drop_mm_ref_this_cpu, mm, 1);
 	free_cpumask_var(mask);
 }
 #else
 static void xen_drop_mm_ref(struct mm_struct *mm)
 {
-	if (current->active_mm == mm)
-		load_cr3(swapper_pg_dir);
+	drop_mm_ref_this_cpu(mm);
 }
 #endif
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 10/10] x86,kvm: Teach KVM's VMX code that CR3 isn't a constant
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (8 preceding siblings ...)
  2017-05-07 12:38 ` [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm Andy Lutomirski
@ 2017-05-07 12:38 ` Andy Lutomirski
  2017-05-07 13:00 ` [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Ingo Molnar
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-07 12:38 UTC (permalink / raw)
  To: X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Andy Lutomirski, Paolo Bonzini,
	Radim Krčmář,
	kvm, Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko,
	Arjan van de Ven

When PCID is enabled, CR3's PCID bits can change during context
switches, so KVM won't be able to treat CR3 as a per-mm constant any
more.

I structured this like the existing CR4 handling.  Under ordinary
circumstances (PCID disabled or if the current PCID and the value
that's already in the VMCS match), then we won't do an extra VMCS
write, and we'll never do an extra direct CR3 read.  The overhead
should be minimal.

I disallowed using the new helper in non-atomic context because
PCID support will cause CR3 to stop being constant in non-atomic
process context.

(Frankly, it also scares me a bit that KVM ever treated CR3 as
constant, but it looks like it was okay before.)

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/mmu_context.h | 19 +++++++++++++++++++
 arch/x86/kvm/vmx.c                 | 21 ++++++++++++++++++---
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 187c39470a0b..f20d7ea47095 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -266,4 +266,23 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
 	return __pkru_allows_pkey(vma_pkey(vma), write);
 }
 
+
+/*
+ * This can be used from process context to figure out what the value of
+ * CR3 is without needing to do a (slow) read_cr3().
+ *
+ * It's intended to be used for code like KVM that sneakily changes CR3
+ * and needs to restore it.  It needs to be used very carefully.
+ */
+static inline unsigned long __get_current_cr3_fast(void)
+{
+	unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
+
+	/* For now, be very restrictive about when this can be called. */
+	VM_WARN_ON(in_nmi() || !in_atomic());
+
+	VM_BUG_ON(cr3 != read_cr3());
+	return cr3;
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 535cc065b844..2771235072aa 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -48,6 +48,7 @@
 #include <asm/kexec.h>
 #include <asm/apic.h>
 #include <asm/irq_remapping.h>
+#include <asm/mmu_context.h>
 
 #include "trace.h"
 #include "pmu.h"
@@ -596,6 +597,7 @@ struct vcpu_vmx {
 		int           gs_ldt_reload_needed;
 		int           fs_reload_needed;
 		u64           msr_host_bndcfgs;
+		unsigned long vmcs_host_cr3;	/* May not match real cr3 */
 		unsigned long vmcs_host_cr4;	/* May not match real cr4 */
 	} host_state;
 	struct {
@@ -5017,12 +5019,19 @@ static void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 	u32 low32, high32;
 	unsigned long tmpl;
 	struct desc_ptr dt;
-	unsigned long cr0, cr4;
+	unsigned long cr0, cr3, cr4;
 
 	cr0 = read_cr0();
 	WARN_ON(cr0 & X86_CR0_TS);
 	vmcs_writel(HOST_CR0, cr0);  /* 22.2.3 */
-	vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
+
+	/*
+	 * Save the most likely value for this task's CR3 in the VMCS.
+	 * We can't use __get_current_cr3_fast() because we're not atomic.
+	 */
+	cr3 = read_cr3();
+	vmcs_writel(HOST_CR3, cr3);		/* 22.2.3  FIXME: shadow tables */
+	vmx->host_state.vmcs_host_cr3 = cr3;
 
 	/* Save the most likely value for this task's CR4 in the VMCS. */
 	cr4 = cr4_read_shadow();
@@ -8927,7 +8936,7 @@ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
 static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	unsigned long debugctlmsr, cr4;
+	unsigned long debugctlmsr, cr3, cr4;
 
 	/* Record the guest's net vcpu time for enforced NMI injections. */
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
@@ -8953,6 +8962,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
 		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
 
+	cr3 = __get_current_cr3_fast();
+	if (unlikely(cr3 != vmx->host_state.vmcs_host_cr3)) {
+		vmcs_writel(HOST_CR3, cr3);
+		vmx->host_state.vmcs_host_cr3 = cr3;
+	}
+
 	cr4 = cr4_read_shadow();
 	if (unlikely(cr4 != vmx->host_state.vmcs_host_cr4)) {
 		vmcs_writel(HOST_CR4, cr4);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (9 preceding siblings ...)
  2017-05-07 12:38 ` [RFC 10/10] x86,kvm: Teach KVM's VMX code that CR3 isn't a constant Andy Lutomirski
@ 2017-05-07 13:00 ` Ingo Molnar
  2017-05-07 16:05 ` Linus Torvalds
  2017-05-08 16:36 ` Nadav Amit
  12 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2017-05-07 13:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm


* Andy Lutomirski <luto@kernel.org> wrote:

> As I've been working on polishing my PCID code, a major problem I've
> encountered is that there are too many x86 TLB flushing code paths and
> that they have too many inconsequential differences.  The result was
> that earlier versions of the PCID code were a colossal mess and very
> difficult to understand.
> 
> This series goes a long way toward cleaning up the mess.  With all the
> patches applied, there is a single function that contains the meat of
> the code to flush the TLB on a given CPU, and all the tlb flushing
> APIs call it for both local and remote CPUs.
> 
> This series should only adversely affect the kernel in a couple of
> minor ways:
> 
>  - It makes smp_mb() unconditional when flushing TLBs.  We used to
>    use the TLB flush itself to mostly avoid smp_mb() on the initiating
>    CPU.
> 
>  - On UP kernels, we lose the dubious optimization of inlining nerfed
>    variants of all the TLB flush APIs.  This bloats the kernel a tiny
>    bit, although it should increase performance, since the SMP
>    versions were better.
> 
> Patch 10 in here is a little bit off topic.  It's a cleanup that's
> also needed before PCID can go in, but it's not directly about
> TLB flushing.
> 
> Thoughts?

Looks really nifty! The diffstat alone appears to be worth it:

 18 files changed, 334 insertions(+), 408 deletions(-)

Will have a closer look next week.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (10 preceding siblings ...)
  2017-05-07 13:00 ` [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Ingo Molnar
@ 2017-05-07 16:05 ` Linus Torvalds
  2017-05-08 16:36 ` Nadav Amit
  12 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2017-05-07 16:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Andrew Morton, Mel Gorman,
	linux-mm

On Sun, May 7, 2017 at 5:38 AM, Andy Lutomirski <luto@kernel.org> wrote:
>
> This series goes a long way toward cleaning up the mess.  With all the
> patches applied, there is a single function that contains the meat of
> the code to flush the TLB on a given CPU, and all the tlb flushing
> APIs call it for both local and remote CPUs.

Looks fine to me. I'm always a bit nervous about TLB changes like this
just because any potential bugs tend to be really really hard to see
and catch, but I don't see anything wrong in the series.

                  Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 03/10] x86/mm: Make the batched unmap TLB flush API more generic
  2017-05-07 12:38 ` [RFC 03/10] x86/mm: Make the batched unmap TLB flush API more generic Andy Lutomirski
@ 2017-05-08 15:34   ` Dave Hansen
  2017-05-09 13:02     ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2017-05-08 15:34 UTC (permalink / raw)
  To: Andy Lutomirski, X86 ML
  Cc: linux-kernel, Borislav Petkov, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-mm, Rik van Riel, Nadav Amit, Michal Hocko,
	Sasha Levin

On 05/07/2017 05:38 AM, Andy Lutomirski wrote:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f6838015810f..2e568c82f477 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -579,25 +579,12 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
>  void try_to_unmap_flush(void)
>  {
>  	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
> -	int cpu;
>  
>  	if (!tlb_ubc->flush_required)
>  		return;
>  
> -	cpu = get_cpu();
> -
> -	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) {
> -		count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> -		local_flush_tlb();
> -		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
> -	}
> -
> -	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids)
> -		flush_tlb_others(&tlb_ubc->cpumask, NULL, 0, TLB_FLUSH_ALL);
> -	cpumask_clear(&tlb_ubc->cpumask);
>  	tlb_ubc->flush_required = false;
>  	tlb_ubc->writable = false;
> -	put_cpu();
>  }
>  
>  /* Flush iff there are potentially writable TLB entries that can race with IO */
> @@ -613,7 +600,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
>  {
>  	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>  
> -	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
> +	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
>  	tlb_ubc->flush_required = true;
>  
>  	/*

Looking at this patch in isolation, how can this be safe?  It removes
TLB flushes from the generic code.  Do other patches in the series fix
this up?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support
  2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
                   ` (11 preceding siblings ...)
  2017-05-07 16:05 ` Linus Torvalds
@ 2017-05-08 16:36 ` Nadav Amit
  2017-05-09 12:43   ` Andy Lutomirski
  12 siblings, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2017-05-08 16:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm


> On May 7, 2017, at 5:38 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> As I've been working on polishing my PCID code, a major problem I've
> encountered is that there are too many x86 TLB flushing code paths and
> that they have too many inconsequential differences.  The result was
> that earlier versions of the PCID code were a colossal mess and very
> difficult to understand.
> 
> This series goes a long way toward cleaning up the mess.  With all the
> patches applied, there is a single function that contains the meat of
> the code to flush the TLB on a given CPU, and all the tlb flushing
> APIs call it for both local and remote CPUs.
> 
> This series should only adversely affect the kernel in a couple of
> minor ways:
> 
> - It makes smp_mb() unconditional when flushing TLBs.  We used to
>   use the TLB flush itself to mostly avoid smp_mb() on the initiating
>   CPU.
> 
> - On UP kernels, we lose the dubious optimization of inlining nerfed
>   variants of all the TLB flush APIs.  This bloats the kernel a tiny
>   bit, although it should increase performance, since the SMP
>   versions were better.
> 
> Patch 10 in here is a little bit off topic.  It's a cleanup that's
> also needed before PCID can go in, but it's not directly about
> TLB flushing.
> 
> Thoughts?

In general I like the changes. I needed to hack Linux TLB shootdowns for
a research project just because I could not handle the code otherwise.
I ended up doing some of changes that you have done.

I just have two general comments:

- You may want to consider merging the kernel mappings invalidation
  with the userspace mappings invalidations as well, since there are
  still code redundancies.

- Don’t expect too much from concurrent TLB invalidations. In many
  cases the IPI latency dominates the overhead from my experience.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support
  2017-05-08 16:36 ` Nadav Amit
@ 2017-05-09 12:43   ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-09 12:43 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm

On Mon, May 8, 2017 at 9:36 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
>
>> On May 7, 2017, at 5:38 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> As I've been working on polishing my PCID code, a major problem I've
>> encountered is that there are too many x86 TLB flushing code paths and
>> that they have too many inconsequential differences.  The result was
>> that earlier versions of the PCID code were a colossal mess and very
>> difficult to understand.
>>
>> This series goes a long way toward cleaning up the mess.  With all the
>> patches applied, there is a single function that contains the meat of
>> the code to flush the TLB on a given CPU, and all the tlb flushing
>> APIs call it for both local and remote CPUs.
>>
>> This series should only adversely affect the kernel in a couple of
>> minor ways:
>>
>> - It makes smp_mb() unconditional when flushing TLBs.  We used to
>>   use the TLB flush itself to mostly avoid smp_mb() on the initiating
>>   CPU.
>>
>> - On UP kernels, we lose the dubious optimization of inlining nerfed
>>   variants of all the TLB flush APIs.  This bloats the kernel a tiny
>>   bit, although it should increase performance, since the SMP
>>   versions were better.
>>
>> Patch 10 in here is a little bit off topic.  It's a cleanup that's
>> also needed before PCID can go in, but it's not directly about
>> TLB flushing.
>>
>> Thoughts?
>
> In general I like the changes. I needed to hack Linux TLB shootdowns for
> a research project just because I could not handle the code otherwise.
> I ended up doing some of changes that you have done.
>
> I just have two general comments:
>
> - You may want to consider merging the kernel mappings invalidation
>   with the userspace mappings invalidations as well, since there are
>   still code redundancies.
>

Hmm.  The code for kernel mappings is quite short, and I'm not sure
how well it would fit in if I tried to merge it.

> - Don’t expect too much from concurrent TLB invalidations. In many
>   cases the IPI latency dominates the overhead from my experience.
>

Fair enough.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 03/10] x86/mm: Make the batched unmap TLB flush API more generic
  2017-05-08 15:34   ` Dave Hansen
@ 2017-05-09 13:02     ` Andy Lutomirski
  2017-05-09 14:39       ` Mel Gorman
  2017-05-09 17:13       ` Dave Hansen
  0 siblings, 2 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-09 13:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm,
	Rik van Riel, Nadav Amit, Michal Hocko, Sasha Levin

On Mon, May 8, 2017 at 8:34 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 05/07/2017 05:38 AM, Andy Lutomirski wrote:
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index f6838015810f..2e568c82f477 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -579,25 +579,12 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
>>  void try_to_unmap_flush(void)
>>  {
>>       struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>> -     int cpu;
>>
>>       if (!tlb_ubc->flush_required)
>>               return;
>>
>> -     cpu = get_cpu();
>> -
>> -     if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) {
>> -             count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
>> -             local_flush_tlb();
>> -             trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>> -     }
>> -
>> -     if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids)
>> -             flush_tlb_others(&tlb_ubc->cpumask, NULL, 0, TLB_FLUSH_ALL);
>> -     cpumask_clear(&tlb_ubc->cpumask);
>>       tlb_ubc->flush_required = false;
>>       tlb_ubc->writable = false;
>> -     put_cpu();
>>  }
>>
>>  /* Flush iff there are potentially writable TLB entries that can race with IO */
>> @@ -613,7 +600,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
>>  {
>>       struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>>
>> -     cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
>> +     arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
>>       tlb_ubc->flush_required = true;
>>
>>       /*
>
> Looking at this patch in isolation, how can this be safe?  It removes
> TLB flushes from the generic code.  Do other patches in the series fix
> this up?

Hmm?  Unless I totally screwed this up, this patch just moves the
flushes around -- it shouldn't remove any flushes.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 03/10] x86/mm: Make the batched unmap TLB flush API more generic
  2017-05-09 13:02     ` Andy Lutomirski
@ 2017-05-09 14:39       ` Mel Gorman
  2017-05-09 17:13       ` Dave Hansen
  1 sibling, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2017-05-09 14:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, linux-mm, Rik van Riel,
	Nadav Amit, Michal Hocko, Sasha Levin

On Tue, May 09, 2017 at 06:02:49AM -0700, Andrew Lutomirski wrote:
> On Mon, May 8, 2017 at 8:34 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> > On 05/07/2017 05:38 AM, Andy Lutomirski wrote:
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index f6838015810f..2e568c82f477 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -579,25 +579,12 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
> >>  void try_to_unmap_flush(void)
> >>  {
> >>       struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
> >> -     int cpu;
> >>
> >>       if (!tlb_ubc->flush_required)
> >>               return;
> >>
> >> -     cpu = get_cpu();
> >> -
> >> -     if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) {
> >> -             count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> >> -             local_flush_tlb();
> >> -             trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
> >> -     }
> >> -
> >> -     if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids)
> >> -             flush_tlb_others(&tlb_ubc->cpumask, NULL, 0, TLB_FLUSH_ALL);
> >> -     cpumask_clear(&tlb_ubc->cpumask);
> >>       tlb_ubc->flush_required = false;
> >>       tlb_ubc->writable = false;
> >> -     put_cpu();
> >>  }
> >>
> >>  /* Flush iff there are potentially writable TLB entries that can race with IO */
> >> @@ -613,7 +600,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> >>  {
> >>       struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
> >>
> >> -     cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
> >> +     arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
> >>       tlb_ubc->flush_required = true;
> >>
> >>       /*
> >
> > Looking at this patch in isolation, how can this be safe?  It removes
> > TLB flushes from the generic code.  Do other patches in the series fix
> > this up?
> 
> Hmm?  Unless I totally screwed this up, this patch just moves the
> flushes around -- it shouldn't remove any flushes.

I think he's asking when or how arch_tlbbatch_flush gets called because
it doesn't happen in try_to_unmap_flush().

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 03/10] x86/mm: Make the batched unmap TLB flush API more generic
  2017-05-09 13:02     ` Andy Lutomirski
  2017-05-09 14:39       ` Mel Gorman
@ 2017-05-09 17:13       ` Dave Hansen
  2017-05-09 22:54         ` Andy Lutomirski
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2017-05-09 17:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Rik van Riel, Nadav Amit,
	Michal Hocko, Sasha Levin

On 05/09/2017 06:02 AM, Andy Lutomirski wrote:
> On Mon, May 8, 2017 at 8:34 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>> On 05/07/2017 05:38 AM, Andy Lutomirski wrote:
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index f6838015810f..2e568c82f477 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -579,25 +579,12 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
>>>  void try_to_unmap_flush(void)
>>>  {
>>>       struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>>> -     int cpu;
>>>
>>>       if (!tlb_ubc->flush_required)
>>>               return;
>>>
>>> -     cpu = get_cpu();
>>> -
>>> -     if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) {
>>> -             count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
>>> -             local_flush_tlb();
>>> -             trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>> -     }
>>> -
>>> -     if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids)
>>> -             flush_tlb_others(&tlb_ubc->cpumask, NULL, 0, TLB_FLUSH_ALL);
>>> -     cpumask_clear(&tlb_ubc->cpumask);
>>>       tlb_ubc->flush_required = false;
>>>       tlb_ubc->writable = false;
>>> -     put_cpu();
>>>  }
>>>
>>>  /* Flush iff there are potentially writable TLB entries that can race with IO */
>>> @@ -613,7 +600,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
>>>  {
>>>       struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>>>
>>> -     cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
>>> +     arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
>>>       tlb_ubc->flush_required = true;
>>>
>>>       /*
>>
>> Looking at this patch in isolation, how can this be safe?  It removes
>> TLB flushes from the generic code.  Do other patches in the series fix
>> this up?
> 
> Hmm?  Unless I totally screwed this up, this patch just moves the
> flushes around -- it shouldn't remove any flushes.

This takes a flush out of try_to_unmap_flush().  It adds code for
arch_tlbbatch_flush(), but not *calls* to arch_tlbbatch_flush() that I
can see.

I actually don't see _any_ in the whole series in a quick grepping.  Am
I just missing them?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm
  2017-05-07 12:38 ` [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm Andy Lutomirski
@ 2017-05-09 20:41   ` Thomas Gleixner
  2017-05-09 22:54     ` Andy Lutomirski
  2017-05-10  5:57     ` Ingo Molnar
  0 siblings, 2 replies; 32+ messages in thread
From: Thomas Gleixner @ 2017-05-09 20:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko, Arjan van de Ven

On Sun, 7 May 2017, Andy Lutomirski wrote:
>  /* context.lock is held for us, so we don't need any locking. */
>  static void flush_ldt(void *current_mm)
>  {
> +	struct mm_struct *mm = current_mm;
>  	mm_context_t *pc;
>  
> -	if (current->active_mm != current_mm)
> +	if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)

While functional correct, this really should compare against 'mm'.

>  		return;
>  
> -	pc = &current->active_mm->context;
> +	pc = &mm->context;

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm
  2017-05-09 20:41   ` Thomas Gleixner
@ 2017-05-09 22:54     ` Andy Lutomirski
  2017-05-10  5:57     ` Ingo Molnar
  1 sibling, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-09 22:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm,
	Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko,
	Arjan van de Ven

On Tue, May 9, 2017 at 1:41 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sun, 7 May 2017, Andy Lutomirski wrote:
>>  /* context.lock is held for us, so we don't need any locking. */
>>  static void flush_ldt(void *current_mm)
>>  {
>> +     struct mm_struct *mm = current_mm;
>>       mm_context_t *pc;
>>
>> -     if (current->active_mm != current_mm)
>> +     if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
>
> While functional correct, this really should compare against 'mm'.
>

Fixed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 03/10] x86/mm: Make the batched unmap TLB flush API more generic
  2017-05-09 17:13       ` Dave Hansen
@ 2017-05-09 22:54         ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-09 22:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm,
	Rik van Riel, Nadav Amit, Michal Hocko, Sasha Levin

On Tue, May 9, 2017 at 10:13 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 05/09/2017 06:02 AM, Andy Lutomirski wrote:
>> On Mon, May 8, 2017 at 8:34 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> On 05/07/2017 05:38 AM, Andy Lutomirski wrote:
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index f6838015810f..2e568c82f477 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -579,25 +579,12 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
>>>>  void try_to_unmap_flush(void)
>>>>  {
>>>>       struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>>>> -     int cpu;
>>>>
>>>>       if (!tlb_ubc->flush_required)
>>>>               return;
>>>>
>>>> -     cpu = get_cpu();
>>>> -
>>>> -     if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask)) {
>>>> -             count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
>>>> -             local_flush_tlb();
>>>> -             trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>> -     }
>>>> -
>>>> -     if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids)
>>>> -             flush_tlb_others(&tlb_ubc->cpumask, NULL, 0, TLB_FLUSH_ALL);
>>>> -     cpumask_clear(&tlb_ubc->cpumask);
>>>>       tlb_ubc->flush_required = false;
>>>>       tlb_ubc->writable = false;
>>>> -     put_cpu();
>>>>  }
>>>>
>>>>  /* Flush iff there are potentially writable TLB entries that can race with IO */
>>>> @@ -613,7 +600,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
>>>>  {
>>>>       struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>>>>
>>>> -     cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
>>>> +     arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
>>>>       tlb_ubc->flush_required = true;
>>>>
>>>>       /*
>>>
>>> Looking at this patch in isolation, how can this be safe?  It removes
>>> TLB flushes from the generic code.  Do other patches in the series fix
>>> this up?
>>
>> Hmm?  Unless I totally screwed this up, this patch just moves the
>> flushes around -- it shouldn't remove any flushes.
>
> This takes a flush out of try_to_unmap_flush().  It adds code for
> arch_tlbbatch_flush(), but not *calls* to arch_tlbbatch_flush() that I
> can see.
>
> I actually don't see _any_ in the whole series in a quick grepping.  Am
> I just missing them?

Oops!  I must have stared at that function for so long that I started
seeing the invisible call.  I'll fix that.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm
  2017-05-09 20:41   ` Thomas Gleixner
  2017-05-09 22:54     ` Andy Lutomirski
@ 2017-05-10  5:57     ` Ingo Molnar
  2017-05-10  8:19       ` Thomas Gleixner
  1 sibling, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2017-05-10  5:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm,
	Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko,
	Arjan van de Ven


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Sun, 7 May 2017, Andy Lutomirski wrote:
> >  /* context.lock is held for us, so we don't need any locking. */
> >  static void flush_ldt(void *current_mm)
> >  {
> > +	struct mm_struct *mm = current_mm;
> >  	mm_context_t *pc;
> >  
> > -	if (current->active_mm != current_mm)
> > +	if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
> 
> While functional correct, this really should compare against 'mm'.
> 
> >  		return;
> >  
> > -	pc = &current->active_mm->context;
> > +	pc = &mm->context;

So this appears to be the function:

 static void flush_ldt(void *current_mm)
 {
        struct mm_struct *mm = current_mm;
        mm_context_t *pc;

        if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
                return;

        pc = &mm->context;
        set_ldt(pc->ldt->entries, pc->ldt->size);
 }

why not rename 'current_mm' to 'mm' and remove the 'mm' local variable?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm
  2017-05-10  5:57     ` Ingo Molnar
@ 2017-05-10  8:19       ` Thomas Gleixner
  2017-05-10  8:24         ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2017-05-10  8:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm,
	Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko,
	Arjan van de Ven

On Wed, 10 May 2017, Ingo Molnar wrote:
> 
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Sun, 7 May 2017, Andy Lutomirski wrote:
> > >  /* context.lock is held for us, so we don't need any locking. */
> > >  static void flush_ldt(void *current_mm)
> > >  {
> > > +	struct mm_struct *mm = current_mm;
> > >  	mm_context_t *pc;
> > >  
> > > -	if (current->active_mm != current_mm)
> > > +	if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
> > 
> > While functional correct, this really should compare against 'mm'.
> > 
> > >  		return;
> > >  
> > > -	pc = &current->active_mm->context;
> > > +	pc = &mm->context;
> 
> So this appears to be the function:
> 
>  static void flush_ldt(void *current_mm)
>  {
>         struct mm_struct *mm = current_mm;
>         mm_context_t *pc;
> 
>         if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
>                 return;
> 
>         pc = &mm->context;
>         set_ldt(pc->ldt->entries, pc->ldt->size);
>  }
> 
> why not rename 'current_mm' to 'mm' and remove the 'mm' local variable?

Because you cannot dereference a void pointer, i.e. &mm->context ....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm
  2017-05-10  8:19       ` Thomas Gleixner
@ 2017-05-10  8:24         ` Ingo Molnar
  2017-05-10 22:42           ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2017-05-10  8:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm,
	Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko,
	Arjan van de Ven


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, 10 May 2017, Ingo Molnar wrote:
> > 
> > * Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > > On Sun, 7 May 2017, Andy Lutomirski wrote:
> > > >  /* context.lock is held for us, so we don't need any locking. */
> > > >  static void flush_ldt(void *current_mm)
> > > >  {
> > > > +	struct mm_struct *mm = current_mm;
> > > >  	mm_context_t *pc;
> > > >  
> > > > -	if (current->active_mm != current_mm)
> > > > +	if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
> > > 
> > > While functional correct, this really should compare against 'mm'.
> > > 
> > > >  		return;
> > > >  
> > > > -	pc = &current->active_mm->context;
> > > > +	pc = &mm->context;
> > 
> > So this appears to be the function:
> > 
> >  static void flush_ldt(void *current_mm)
> >  {
> >         struct mm_struct *mm = current_mm;
> >         mm_context_t *pc;
> > 
> >         if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
> >                 return;
> > 
> >         pc = &mm->context;
> >         set_ldt(pc->ldt->entries, pc->ldt->size);
> >  }
> > 
> > why not rename 'current_mm' to 'mm' and remove the 'mm' local variable?
> 
> Because you cannot dereference a void pointer, i.e. &mm->context ....

Indeed, doh! The naming totally confused me. The way I'd write it is the canonical 
form for such callbacks:

	static void flush_ldt(void *data)
	{
		struct mm_struct *mm = data;

... which beyond unconfusing me would probably also have prevented any accidental 
use of the 'current_mm' callback argument.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm
  2017-05-10  8:24         ` Ingo Molnar
@ 2017-05-10 22:42           ` Andy Lutomirski
  2017-05-11  7:13             ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-10 22:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Andy Lutomirski, X86 ML, linux-kernel,
	Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko,
	Arjan van de Ven

On Wed, May 10, 2017 at 1:24 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> On Wed, 10 May 2017, Ingo Molnar wrote:
>> >
>> > * Thomas Gleixner <tglx@linutronix.de> wrote:
>> >
>> > > On Sun, 7 May 2017, Andy Lutomirski wrote:
>> > > >  /* context.lock is held for us, so we don't need any locking. */
>> > > >  static void flush_ldt(void *current_mm)
>> > > >  {
>> > > > +       struct mm_struct *mm = current_mm;
>> > > >         mm_context_t *pc;
>> > > >
>> > > > -       if (current->active_mm != current_mm)
>> > > > +       if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
>> > >
>> > > While functional correct, this really should compare against 'mm'.
>> > >
>> > > >                 return;
>> > > >
>> > > > -       pc = &current->active_mm->context;
>> > > > +       pc = &mm->context;
>> >
>> > So this appears to be the function:
>> >
>> >  static void flush_ldt(void *current_mm)
>> >  {
>> >         struct mm_struct *mm = current_mm;
>> >         mm_context_t *pc;
>> >
>> >         if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
>> >                 return;
>> >
>> >         pc = &mm->context;
>> >         set_ldt(pc->ldt->entries, pc->ldt->size);
>> >  }
>> >
>> > why not rename 'current_mm' to 'mm' and remove the 'mm' local variable?
>>
>> Because you cannot dereference a void pointer, i.e. &mm->context ....
>
> Indeed, doh! The naming totally confused me. The way I'd write it is the canonical
> form for such callbacks:
>
>         static void flush_ldt(void *data)
>         {
>                 struct mm_struct *mm = data;
>
> ... which beyond unconfusing me would probably also have prevented any accidental
> use of the 'current_mm' callback argument.
>
>

void *data and void *info both seem fairly common in the kernel.  How
about my personal favorite for non-kernel work, though: void *mm_void?
 It documents what the parameter means and avoids the confusion.

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm
  2017-05-10 22:42           ` Andy Lutomirski
@ 2017-05-11  7:13             ` Ingo Molnar
  2017-05-12  3:36               ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2017-05-11  7:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm,
	Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko,
	Arjan van de Ven


* Andy Lutomirski <luto@kernel.org> wrote:

> On Wed, May 10, 2017 at 1:24 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> >> On Wed, 10 May 2017, Ingo Molnar wrote:
> >> >
> >> > * Thomas Gleixner <tglx@linutronix.de> wrote:
> >> >
> >> > > On Sun, 7 May 2017, Andy Lutomirski wrote:
> >> > > >  /* context.lock is held for us, so we don't need any locking. */
> >> > > >  static void flush_ldt(void *current_mm)
> >> > > >  {
> >> > > > +       struct mm_struct *mm = current_mm;
> >> > > >         mm_context_t *pc;
> >> > > >
> >> > > > -       if (current->active_mm != current_mm)
> >> > > > +       if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
> >> > >
> >> > > While functional correct, this really should compare against 'mm'.
> >> > >
> >> > > >                 return;
> >> > > >
> >> > > > -       pc = &current->active_mm->context;
> >> > > > +       pc = &mm->context;
> >> >
> >> > So this appears to be the function:
> >> >
> >> >  static void flush_ldt(void *current_mm)
> >> >  {
> >> >         struct mm_struct *mm = current_mm;
> >> >         mm_context_t *pc;
> >> >
> >> >         if (this_cpu_read(cpu_tlbstate.loaded_mm) != current_mm)
> >> >                 return;
> >> >
> >> >         pc = &mm->context;
> >> >         set_ldt(pc->ldt->entries, pc->ldt->size);
> >> >  }
> >> >
> >> > why not rename 'current_mm' to 'mm' and remove the 'mm' local variable?
> >>
> >> Because you cannot dereference a void pointer, i.e. &mm->context ....
> >
> > Indeed, doh! The naming totally confused me. The way I'd write it is the canonical
> > form for such callbacks:
> >
> >         static void flush_ldt(void *data)
> >         {
> >                 struct mm_struct *mm = data;
> >
> > ... which beyond unconfusing me would probably also have prevented any accidental
> > use of the 'current_mm' callback argument.
> >
> >
> 
> void *data and void *info both seem fairly common in the kernel.

Yes, the most common variants are:

  triton:~/tip> git grep -E 'void.*\(.*void \*.*' | grep -vE ',|\*\*|;' | cut -d\( -f2- | cut -d\) -f1 | sort | uniq -c | sort -n | tail -10
     38 void *args
     38 void *p
     39 void *ptr
     42 void *foo
     46 void *context
     55 void *addr
     69 void *priv
     95 void *info
    235 void *arg
    292 void *data

> How about my personal favorite for non-kernel work, though: void *mm_void? It 
> documents what the parameter means and avoids the confusion.

Dunno, and at the risk of painting that shed bright red it reads a bit weird to 
me: void pointers are fine and are often primary parameters - the _real_ quality 
here is not that it's void, but that's it's an opaque value passed in from a 
common callback. Note that sometimes opaque data is 'unsigned long' (such as in 
the case of timers), so it's really not the 'void' that matters.

In that sense 'data', 'arg' or 'info' seem the most readable names, as they 
clearly express the type opaqueness.

My personal favorite is double underscores prefix, i.e. 'void *__mm', which would 
clearly signal that this is something special. But this does not appear to have 
been picked up overly widely:

  triton:~/tip> git grep -E 'void.*\(.*void \*.*' | grep -vE ',|\*\*|;' | cut -d\( -f2- | cut -d\) -f1 | sort | uniq -c | sort -n | grep __
      1 void *__data
      1 void *__info
      2 void *__dev
      2 void *__tdata
      2 void *__tve
      3 void *__lock
      3 void * __user *
      3 volatile void *__p
      4 void *__map

... but either of these variants is fine to me.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 01/10] x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()
  2017-05-07 12:38 ` [RFC 01/10] x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range() Andy Lutomirski
@ 2017-05-11 17:41   ` Borislav Petkov
  2017-05-12  3:35     ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Borislav Petkov @ 2017-05-11 17:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko

On Sun, May 07, 2017 at 05:38:30AM -0700, Andy Lutomirski wrote:
> flush_tlb_page() was very similar to flush_tlb_mm_range() except that
> it had a couple of issues:
> 
>  - It was missing an smp_mb() in the case where
>    current->active_mm != mm.  (This is a longstanding bug reported by
>    Nadav Amit.)
> 
>  - It was missing tracepoints and vm counter updates.
> 
> The only reason that I can see for keeping it at as a separate
> function is that it could avoid a few branches that
> flush_tlb_mm_range() needs to decide to flush just one page.  This
> hardly seems worthwhile.  If we decide we want to get rid of those
> branches again, a better way would be to introduce an
> __flush_tlb_mm_range() helper and make both flush_tlb_page() and
> flush_tlb_mm_range() use it.
> 
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Nadav Amit <namit@vmware.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/include/asm/tlbflush.h |  6 +++++-
>  arch/x86/mm/tlb.c               | 27 ---------------------------
>  2 files changed, 5 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 6ed9ea469b48..5ed64cdaf536 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -307,11 +307,15 @@ static inline void flush_tlb_kernel_range(unsigned long start,
>  		flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
>  
>  extern void flush_tlb_all(void);
> -extern void flush_tlb_page(struct vm_area_struct *, unsigned long);
>  extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  				unsigned long end, unsigned long vmflag);
>  extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
>  
> +static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
> +{
> +	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, 0);

							 VM_NONE);

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 04/10] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc
  2017-05-07 12:38 ` [RFC 04/10] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc Andy Lutomirski
@ 2017-05-11 20:01   ` Nadav Amit
  2017-05-12  3:41     ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Nadav Amit @ 2017-05-11 20:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Borislav Petkov, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko, Sasha Levin


> On May 7, 2017, at 5:38 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> @@ -243,15 +237,15 @@ static void flush_tlb_func(void *info)
> 		return;
> 	}
> 
> -	if (f->flush_end == TLB_FLUSH_ALL) {
> +	if (f->end == TLB_FLUSH_ALL) {
> 		local_flush_tlb();
> 		trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
> 	} else {
> 		unsigned long addr;
> 		unsigned long nr_pages =
> -			(f->flush_end - f->flush_start) / PAGE_SIZE;
> -		addr = f->flush_start;
> -		while (addr < f->flush_end) {
> +			(f->end - f->start) / PAGE_SIZE;
> +		addr = f->start;
> +		while (addr < f->end) {
> 			__flush_tlb_single(addr);
> 			addr += PAGE_SIZE;
> 		}
> @@ -260,33 +254,27 @@ static void flush_tlb_func(void *info)
> }
> 
> void native_flush_tlb_others(const struct cpumask *cpumask,
> -				 struct mm_struct *mm, unsigned long start,
> -				 unsigned long end)
> +			     const struct flush_tlb_info *info)
> {
> -	struct flush_tlb_info info;
> -
> -	info.flush_mm = mm;
> -	info.flush_start = start;
> -	info.flush_end = end;
> -
> 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
> -	if (end == TLB_FLUSH_ALL)
> +	if (info->end == TLB_FLUSH_ALL)
> 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
> 	else
> 		trace_tlb_flush(TLB_REMOTE_SEND_IPI,
> -				(end - start) >> PAGE_SHIFT);
> +				(info->end - info->start) >> PAGE_SHIFT);

I know it is stupid, but since you already change the code, can you make
flush_tlb_func() and native_flush_tlb_others() consistent in the way
they compute the number of pages? (either ‘>> PAGE_SHIFT’ or ‘/ PAGE_SIZE’)?

On a different topic: I do not like or actually understand why TLBSTATE_OK
is defined as 1 and not 0. The very least it would generate a better code.

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 01/10] x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()
  2017-05-11 17:41   ` Borislav Petkov
@ 2017-05-12  3:35     ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-12  3:35 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Linus Torvalds,
	Andrew Morton, Mel Gorman, linux-mm, Rik van Riel, Dave Hansen,
	Nadav Amit, Michal Hocko

On Thu, May 11, 2017 at 10:41 AM, Borislav Petkov <bp@suse.de> wrote:
>> +{
>> +     flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, 0);
>
>                                                          VM_NONE);
>

Fixed, although this won't have any effect.

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm
  2017-05-11  7:13             ` Ingo Molnar
@ 2017-05-12  3:36               ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-12  3:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Thomas Gleixner, X86 ML, linux-kernel,
	Borislav Petkov, Linus Torvalds, Andrew Morton, Mel Gorman,
	linux-mm, Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko,
	Arjan van de Ven

On Thu, May 11, 2017 at 12:13 AM, Ingo Molnar <mingo@kernel.org> wrote:
> My personal favorite is double underscores prefix, i.e. 'void *__mm', which would
> clearly signal that this is something special. But this does not appear to have
> been picked up overly widely:

Nice bikeshed!  I'll use it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 04/10] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc
  2017-05-11 20:01   ` Nadav Amit
@ 2017-05-12  3:41     ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2017-05-12  3:41 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Mel Gorman, linux-mm,
	Rik van Riel, Dave Hansen, Nadav Amit, Michal Hocko, Sasha Levin

On Thu, May 11, 2017 at 1:01 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>
>> On May 7, 2017, at 5:38 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> @@ -243,15 +237,15 @@ static void flush_tlb_func(void *info)
>>               return;
>>       }
>>
>> -     if (f->flush_end == TLB_FLUSH_ALL) {
>> +     if (f->end == TLB_FLUSH_ALL) {
>>               local_flush_tlb();
>>               trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
>>       } else {
>>               unsigned long addr;
>>               unsigned long nr_pages =
>> -                     (f->flush_end - f->flush_start) / PAGE_SIZE;
>> -             addr = f->flush_start;
>> -             while (addr < f->flush_end) {
>> +                     (f->end - f->start) / PAGE_SIZE;
>> +             addr = f->start;
>> +             while (addr < f->end) {
>>                       __flush_tlb_single(addr);
>>                       addr += PAGE_SIZE;
>>               }
>> @@ -260,33 +254,27 @@ static void flush_tlb_func(void *info)
>> }
>>
>> void native_flush_tlb_others(const struct cpumask *cpumask,
>> -                              struct mm_struct *mm, unsigned long start,
>> -                              unsigned long end)
>> +                          const struct flush_tlb_info *info)
>> {
>> -     struct flush_tlb_info info;
>> -
>> -     info.flush_mm = mm;
>> -     info.flush_start = start;
>> -     info.flush_end = end;
>> -
>>       count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
>> -     if (end == TLB_FLUSH_ALL)
>> +     if (info->end == TLB_FLUSH_ALL)
>>               trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
>>       else
>>               trace_tlb_flush(TLB_REMOTE_SEND_IPI,
>> -                             (end - start) >> PAGE_SHIFT);
>> +                             (info->end - info->start) >> PAGE_SHIFT);
>
> I know it is stupid, but since you already change the code, can you make
> flush_tlb_func() and native_flush_tlb_others() consistent in the way
> they compute the number of pages? (either ‘>> PAGE_SHIFT’ or ‘/ PAGE_SIZE’)?

I added this to the queue.

>
> On a different topic: I do not like or actually understand why TLBSTATE_OK
> is defined as 1 and not 0. The very least it would generate a better code.

Me neither.  The value 0 can happen too, e.g. for init_mm.  Maybe I'll
clean that up when I'm all done with this stuff.

>
> Thanks,
> Nadav

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-05-12  3:41 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-07 12:38 [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Andy Lutomirski
2017-05-07 12:38 ` [RFC 01/10] x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range() Andy Lutomirski
2017-05-11 17:41   ` Borislav Petkov
2017-05-12  3:35     ` Andy Lutomirski
2017-05-07 12:38 ` [RFC 02/10] x86/mm: Reduce indentation in flush_tlb_func() Andy Lutomirski
2017-05-07 12:38 ` [RFC 03/10] x86/mm: Make the batched unmap TLB flush API more generic Andy Lutomirski
2017-05-08 15:34   ` Dave Hansen
2017-05-09 13:02     ` Andy Lutomirski
2017-05-09 14:39       ` Mel Gorman
2017-05-09 17:13       ` Dave Hansen
2017-05-09 22:54         ` Andy Lutomirski
2017-05-07 12:38 ` [RFC 04/10] x86/mm: Pass flush_tlb_info to flush_tlb_others() etc Andy Lutomirski
2017-05-11 20:01   ` Nadav Amit
2017-05-12  3:41     ` Andy Lutomirski
2017-05-07 12:38 ` [RFC 05/10] x86/mm: Change the leave_mm() condition for local TLB flushes Andy Lutomirski
2017-05-07 12:38 ` [RFC 06/10] x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases Andy Lutomirski
2017-05-07 12:38 ` [RFC 07/10] x86/mm: Use new merged flush logic in arch_tlbbatch_flush() Andy Lutomirski
2017-05-07 12:38 ` [RFC 08/10] x86/mm: Remove the UP tlbflush code; always use the formerly SMP code Andy Lutomirski
2017-05-07 12:38 ` [RFC 09/10] x86/mm: Rework lazy TLB to track the actual loaded mm Andy Lutomirski
2017-05-09 20:41   ` Thomas Gleixner
2017-05-09 22:54     ` Andy Lutomirski
2017-05-10  5:57     ` Ingo Molnar
2017-05-10  8:19       ` Thomas Gleixner
2017-05-10  8:24         ` Ingo Molnar
2017-05-10 22:42           ` Andy Lutomirski
2017-05-11  7:13             ` Ingo Molnar
2017-05-12  3:36               ` Andy Lutomirski
2017-05-07 12:38 ` [RFC 10/10] x86,kvm: Teach KVM's VMX code that CR3 isn't a constant Andy Lutomirski
2017-05-07 13:00 ` [RFC 00/10] x86 TLB flush cleanups, moving toward PCID support Ingo Molnar
2017-05-07 16:05 ` Linus Torvalds
2017-05-08 16:36 ` Nadav Amit
2017-05-09 12:43   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).