linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI
@ 2019-08-23 22:52 Nadav Amit
  2019-08-23 22:52 ` [RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface Nadav Amit
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Nadav Amit @ 2019-08-23 22:52 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit

INVPCID is considerably slower than INVLPG of a single PTE, but it is
currently used to flush PTEs in the user page-table when PTI is used.

Instead, it is possible to defer TLB flushes until after the user
page-tables are loaded. Preventing speculation over the TLB flushes
should keep the whole thing safe. In some cases, deferring TLB flushes
in such a way can result in more full TLB flushes, but arguably this
behavior is oftentimes beneficial.

These patches are based and evaluated on top of the concurrent
TLB-flushes v4 patch-set.

I will provide more results later, but it might be easier to look at the
time an isolated TLB flush takes. These numbers are from skylake,
showing the number of cycles that running madvise(DONTNEED) which
results in local TLB flushes takes:

n_pages		concurrent	+deferred-pti		change
-------		----------	-------------		------
 1		2119		1986 			-6.7%
 10		6791		5417 			 -20%

Please let me know if I missed something that affects security or
performance.

[ Yes, I know there is another pending RFC for async TLB flushes, but I
  think it might be easier to merge this one first ]

RFC v1 -> RFC v2:
  * Wrong patches were sent before

Nadav Amit (3):
  x86/mm/tlb: Change __flush_tlb_one_user interface
  x86/mm/tlb: Defer PTI flushes
  x86/mm/tlb: Avoid deferring PTI flushes on shootdown

 arch/x86/entry/calling.h              |  52 +++++++++++-
 arch/x86/include/asm/paravirt.h       |   5 +-
 arch/x86/include/asm/paravirt_types.h |   3 +-
 arch/x86/include/asm/tlbflush.h       |  55 +++++++-----
 arch/x86/kernel/asm-offsets.c         |   3 +
 arch/x86/kernel/paravirt.c            |   7 +-
 arch/x86/mm/tlb.c                     | 117 ++++++++++++++++++++++++--
 arch/x86/xen/mmu_pv.c                 |  21 +++--
 8 files changed, 218 insertions(+), 45 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface
  2019-08-23 22:52 [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
@ 2019-08-23 22:52 ` Nadav Amit
  2019-08-26  7:51   ` Juergen Gross
  2019-08-23 22:52 ` [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes Nadav Amit
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2019-08-23 22:52 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, Boris Ostrovsky, Juergen Gross, Stefano Stabellini,
	xen-devel

__flush_tlb_one_user() currently flushes a single entry, and flushes it
both in the kernel and user page-tables, when PTI is enabled.

Change __flush_tlb_one_user() and related interfaces into
__flush_tlb_range() that flushes a range and does not flush the user
page-table.

This refactoring is needed for the next patch, but regardless makes
sense and has several advantages. First, only Xen-PV, which does not
use PTI, implements the paravirtual interface of flush_tlb_one_user() so
nothing is broken by separating the user and kernel page-table flushes,
and the interface is more intuitive.

Second, INVLPG can flush unrelated mappings, and it is also a
serializing instruction. It is better to have a tight loop that flushes
the entries.

Third, currently __flush_tlb_one_kernel() also flushes the user
page-tables, which is not needed. This allows to avoid this redundant
flush.

Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/paravirt.h       |  5 ++--
 arch/x86/include/asm/paravirt_types.h |  3 ++-
 arch/x86/include/asm/tlbflush.h       | 24 +++++------------
 arch/x86/kernel/paravirt.c            |  7 ++---
 arch/x86/mm/tlb.c                     | 39 ++++++++++++++++++++++-----
 arch/x86/xen/mmu_pv.c                 | 21 +++++++++------
 6 files changed, 62 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index bc4829c9b3f9..7347328eacd3 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -57,9 +57,10 @@ static inline void __flush_tlb_global(void)
 	PVOP_VCALL0(mmu.flush_tlb_kernel);
 }
 
-static inline void __flush_tlb_one_user(unsigned long addr)
+static inline void __flush_tlb_range(unsigned long start, unsigned long end,
+				     u8 stride_shift)
 {
-	PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
+	PVOP_VCALL3(mmu.flush_tlb_range, start, end, stride_shift);
 }
 
 static inline void flush_tlb_multi(const struct cpumask *cpumask,
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 63fa751344bf..a87a5f236251 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -205,7 +205,8 @@ struct pv_mmu_ops {
 	/* TLB operations */
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
-	void (*flush_tlb_one_user)(unsigned long addr);
+	void (*flush_tlb_range)(unsigned long start, unsigned long end,
+				u8 stride_shift);
 	void (*flush_tlb_multi)(const struct cpumask *cpus,
 				const struct flush_tlb_info *info);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1f88ea410ff3..421bc82504e2 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -145,7 +145,7 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 #else
 #define __flush_tlb() __native_flush_tlb()
 #define __flush_tlb_global() __native_flush_tlb_global()
-#define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr)
+#define __flush_tlb_range(start, end, stride_shift) __native_flush_tlb_range(start, end, stride_shift)
 #endif
 
 struct tlb_context {
@@ -454,23 +454,13 @@ static inline void __native_flush_tlb_global(void)
 /*
  * flush one page in the user mapping
  */
-static inline void __native_flush_tlb_one_user(unsigned long addr)
+static inline void __native_flush_tlb_range(unsigned long start,
+				unsigned long end, u8 stride_shift)
 {
-	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	unsigned long addr;
 
-	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
-
-	if (!static_cpu_has(X86_FEATURE_PTI))
-		return;
-
-	/*
-	 * Some platforms #GP if we call invpcid(type=1/2) before CR4.PCIDE=1.
-	 * Just use invalidate_user_asid() in case we are called early.
-	 */
-	if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE))
-		invalidate_user_asid(loaded_mm_asid);
-	else
-		invpcid_flush_one(user_pcid(loaded_mm_asid), addr);
+	for (addr = start; addr < end; addr += 1ul << stride_shift)
+		asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 }
 
 /*
@@ -512,7 +502,7 @@ static inline void __flush_tlb_one_kernel(unsigned long addr)
 	 * kernel address space and for its usermode counterpart, but it does
 	 * not flush it for other address spaces.
 	 */
-	__flush_tlb_one_user(addr);
+	__flush_tlb_range(addr, addr + PAGE_SIZE, PAGE_SHIFT);
 
 	if (!static_cpu_has(X86_FEATURE_PTI))
 		return;
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5520a04c84ba..195f5577d0d5 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -173,9 +173,10 @@ static void native_flush_tlb_global(void)
 	__native_flush_tlb_global();
 }
 
-static void native_flush_tlb_one_user(unsigned long addr)
+static void native_flush_tlb_range(unsigned long start, unsigned long end,
+				   u8 stride_shift)
 {
-	__native_flush_tlb_one_user(addr);
+	__native_flush_tlb_range(start, end, stride_shift);
 }
 
 struct static_key paravirt_steal_enabled;
@@ -358,7 +359,7 @@ struct paravirt_patch_template pv_ops = {
 	/* Mmu ops. */
 	.mmu.flush_tlb_user	= native_flush_tlb,
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
-	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
+	.mmu.flush_tlb_range	= native_flush_tlb_range,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
 	.mmu.tlb_remove_table	=
 			(void (*)(struct mmu_gather *, void *))tlb_remove_page,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 0addc6e84126..ad15fc2c0790 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -512,6 +512,35 @@ void initialize_tlbstate_and_flush(void)
 		this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
 }
 
+static void flush_tlb_user_pt_range(u16 asid, const struct flush_tlb_info *f)
+{
+	unsigned long start, end, addr;
+	u8 stride_shift;
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	/*
+	 * Read the data into variable on the stack to prevent it from being
+	 * reevaluated due to invpcid memory clobber.
+	 */
+	start = f->start;
+	end = f->end;
+	stride_shift = f->stride_shift;
+
+	/*
+	 * Some platforms #GP if we call invpcid(type=1/2) before CR4.PCIDE=1.
+	 * Just use invalidate_user_asid() in case we are called early.
+	 */
+	if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) {
+		invalidate_user_asid(asid);
+		return;
+	}
+
+	for (addr = start; addr < end; addr += 1ul << stride_shift)
+		invpcid_flush_one(user_pcid(asid), addr);
+}
+
 /*
  * flush_tlb_func()'s memory ordering requirement is that any
  * TLB fills that happen after we flush the TLB are ordered after we
@@ -624,14 +653,12 @@ static void flush_tlb_func(void *info)
 	    f->new_tlb_gen == local_tlb_gen + 1 &&
 	    f->new_tlb_gen == mm_tlb_gen) {
 		/* Partial flush */
-		unsigned long addr = f->start;
-
 		nr_invalidate = (f->end - f->start) >> f->stride_shift;
 
-		while (addr < f->end) {
-			__flush_tlb_one_user(addr);
-			addr += 1UL << f->stride_shift;
-		}
+		__flush_tlb_range(f->start, f->end, f->stride_shift);
+
+		flush_tlb_user_pt_range(loaded_mm_asid, f);
+
 		if (local)
 			count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_invalidate);
 	} else {
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 48f7c7eb4dbc..ed68657f5e77 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1325,22 +1325,27 @@ static noinline void xen_flush_tlb(void)
 	preempt_enable();
 }
 
-static void xen_flush_tlb_one_user(unsigned long addr)
+static void xen_flush_tlb_range(unsigned long start, unsigned long end,
+				u8 stride_shift)
 {
 	struct mmuext_op *op;
 	struct multicall_space mcs;
-
-	trace_xen_mmu_flush_tlb_one_user(addr);
+	unsigned long addr;
 
 	preempt_disable();
 
 	mcs = xen_mc_entry(sizeof(*op));
 	op = mcs.args;
-	op->cmd = MMUEXT_INVLPG_LOCAL;
-	op->arg1.linear_addr = addr & PAGE_MASK;
-	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
 
-	xen_mc_issue(PARAVIRT_LAZY_MMU);
+	for (addr = start; addr < end; addr += 1ul << stride_shift) {
+		trace_xen_mmu_flush_tlb_one_user(addr);
+
+		op->cmd = MMUEXT_INVLPG_LOCAL;
+		op->arg1.linear_addr = addr & PAGE_MASK;
+		MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+
+		xen_mc_issue(PARAVIRT_LAZY_MMU);
+	}
 
 	preempt_enable();
 }
@@ -2394,7 +2399,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
 
 	.flush_tlb_user = xen_flush_tlb,
 	.flush_tlb_kernel = xen_flush_tlb,
-	.flush_tlb_one_user = xen_flush_tlb_one_user,
+	.flush_tlb_range = xen_flush_tlb_range,
 	.flush_tlb_multi = xen_flush_tlb_multi,
 	.tlb_remove_table = tlb_remove_table,
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes
  2019-08-23 22:52 [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
  2019-08-23 22:52 ` [RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface Nadav Amit
@ 2019-08-23 22:52 ` Nadav Amit
  2019-08-27 18:28   ` Dave Hansen
  2019-08-27 23:13   ` Andy Lutomirski
  2019-08-23 22:52 ` [RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown Nadav Amit
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 18+ messages in thread
From: Nadav Amit @ 2019-08-23 22:52 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit

INVPCID is considerably slower than INVLPG of a single PTE. Using it to
flush the user page-tables when PTI is enabled therefore introduces
significant overhead.

Instead, unless page-tables are released, it is possible to defer the
flushing of the user page-tables until the time the code returns to
userspace. These page tables are not in use, so deferring them is not a
security hazard. When CR3 is loaded, as part of returning to userspace,
use INVLPG to flush the relevant PTEs. Use LFENCE to prevent speculative
executions that skip INVLPG.

There are some caveats, which sometime require a full TLB flush of the
user page-tables. There are some (uncommon) code-paths that reload CR3
in which there is not stack. If a context-switch happens and there are
pending flushes, tracking which TLB flushes are later needed is
complicated and expensive. If there are multiple TLB flushes of
different ranges before the kernel returns to userspace, the overhead of
tracking them can exceed the benefit.

In these cases, perform a full TLB flush. It is possible to avoid them
in some cases, but the benefit in doing so is questionable.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/entry/calling.h        | 52 ++++++++++++++++++++++--
 arch/x86/include/asm/tlbflush.h | 30 +++++++++++---
 arch/x86/kernel/asm-offsets.c   |  3 ++
 arch/x86/mm/tlb.c               | 70 +++++++++++++++++++++++++++++++++
 4 files changed, 147 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 515c0ceeb4a3..a4d46416853d 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -6,6 +6,7 @@
 #include <asm/percpu.h>
 #include <asm/asm-offsets.h>
 #include <asm/processor-flags.h>
+#include <asm/tlbflush.h>
 
 /*
 
@@ -205,7 +206,16 @@ For 32-bit we have the following conventions - kernel is built with
 #define THIS_CPU_user_pcid_flush_mask   \
 	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
 
-.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
+#define THIS_CPU_user_flush_start	\
+	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_start
+
+#define THIS_CPU_user_flush_end	\
+	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_end
+
+#define THIS_CPU_user_flush_stride_shift	\
+	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_stride_shift
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req scratch_reg2:req has_stack:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 	mov	%cr3, \scratch_reg
 
@@ -221,9 +231,41 @@ For 32-bit we have the following conventions - kernel is built with
 
 	/* Flush needed, clear the bit */
 	btr	\scratch_reg, THIS_CPU_user_pcid_flush_mask
+.if \has_stack
+	cmpq	$(TLB_FLUSH_ALL), THIS_CPU_user_flush_end
+	jnz	.Lpartial_flush_\@
+.Ldo_full_flush_\@:
+.endif
 	movq	\scratch_reg2, \scratch_reg
 	jmp	.Lwrcr3_pcid_\@
-
+.if \has_stack
+.Lpartial_flush_\@:
+	/* Prepare CR3 with PGD of user, and no flush set */
+	orq	$(PTI_USER_PGTABLE_AND_PCID_MASK), \scratch_reg2
+	SET_NOFLUSH_BIT \scratch_reg2
+	pushq	%rsi
+	pushq	%rbx
+	pushq	%rcx
+	movb	THIS_CPU_user_flush_stride_shift, %cl
+	movq	$1, %rbx
+	shl	%cl, %rbx
+	movq	THIS_CPU_user_flush_start, %rsi
+	movq	THIS_CPU_user_flush_end, %rcx
+	/* Load the new cr3 and flush */
+	mov	\scratch_reg2, %cr3
+.Lflush_loop_\@:
+	invlpg	(%rsi)
+	addq	%rbx, %rsi
+	cmpq	%rsi, %rcx
+	ja	.Lflush_loop_\@
+	/* Prevent speculatively skipping flushes */
+	lfence
+
+	popq	%rcx
+	popq	%rbx
+	popq	%rsi
+	jmp	.Lend_\@
+.endif
 .Lnoflush_\@:
 	movq	\scratch_reg2, \scratch_reg
 	SET_NOFLUSH_BIT \scratch_reg
@@ -239,9 +281,13 @@ For 32-bit we have the following conventions - kernel is built with
 .Lend_\@:
 .endm
 
+.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
+	SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax has_stack=0
+.endm
+
 .macro SWITCH_TO_USER_CR3_STACK	scratch_reg:req
 	pushq	%rax
-	SWITCH_TO_USER_CR3_NOSTACK scratch_reg=\scratch_reg scratch_reg2=%rax
+	SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax has_stack=1
 	popq	%rax
 .endm
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 421bc82504e2..da56aa3ccd07 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -2,6 +2,10 @@
 #ifndef _ASM_X86_TLBFLUSH_H
 #define _ASM_X86_TLBFLUSH_H
 
+#define TLB_FLUSH_ALL	-1UL
+
+#ifndef __ASSEMBLY__
+
 #include <linux/mm.h>
 #include <linux/sched.h>
 
@@ -222,6 +226,10 @@ struct tlb_state {
 	 * context 0.
 	 */
 	struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
+
+	unsigned long user_flush_start;
+	unsigned long user_flush_end;
+	unsigned long user_flush_stride_shift;
 };
 DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
 
@@ -373,6 +381,16 @@ static inline void cr4_set_bits_and_update_boot(unsigned long mask)
 
 extern void initialize_tlbstate_and_flush(void);
 
+static unsigned long *this_cpu_user_pcid_flush_mask(void)
+{
+	return (unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask);
+}
+
+static inline void set_pending_user_pcid_flush(u16 asid)
+{
+	__set_bit(kern_pcid(asid), this_cpu_user_pcid_flush_mask());
+}
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -395,8 +413,10 @@ static inline void invalidate_user_asid(u16 asid)
 	if (!static_cpu_has(X86_FEATURE_PTI))
 		return;
 
-	__set_bit(kern_pcid(asid),
-		  (unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
+	set_pending_user_pcid_flush(asid);
+
+	/* Mark the flush as global */
+	__this_cpu_write(cpu_tlbstate.user_flush_end, TLB_FLUSH_ALL);
 }
 
 /*
@@ -516,8 +536,6 @@ static inline void __flush_tlb_one_kernel(unsigned long addr)
 	invalidate_other_asid();
 }
 
-#define TLB_FLUSH_ALL	-1UL
-
 /*
  * TLB flushing:
  *
@@ -580,7 +598,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 }
 
 void native_flush_tlb_multi(const struct cpumask *cpumask,
-			     const struct flush_tlb_info *info);
+			    const struct flush_tlb_info *info);
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
@@ -610,4 +628,6 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 	tlb_remove_page(tlb, (void *)(page))
 #endif
 
+#endif /* __ASSEMBLY__ */
+
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 5c7ee3df4d0b..bfbe393a5f46 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -95,6 +95,9 @@ static void __used common(void)
 
 	/* TLB state for the entry code */
 	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
+	OFFSET(TLB_STATE_user_flush_start, tlb_state, user_flush_start);
+	OFFSET(TLB_STATE_user_flush_end, tlb_state, user_flush_end);
+	OFFSET(TLB_STATE_user_flush_stride_shift, tlb_state, user_flush_stride_shift);
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ad15fc2c0790..31260c55d597 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -407,6 +407,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
 
+		/*
+		 * If the indication of partial flush is on, setting the end to
+		 * TLB_FLUSH_ALL would mark a full flush is need. Do it
+		 * unconditionally, since anyhow it is benign.  Alternatively,
+		 * we could conditionally flush the deferred range, but it is
+		 * likely to perform worse.
+		 */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			__this_cpu_write(cpu_tlbstate.user_flush_end, TLB_FLUSH_ALL);
+
 		/* Let nmi_uaccess_okay() know that we're changing CR3. */
 		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
 		barrier();
@@ -512,6 +522,58 @@ void initialize_tlbstate_and_flush(void)
 		this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
 }
 
+/*
+ * Defer the TLB flush to the point we return to userspace.
+ */
+static void flush_user_tlb_deferred(u16 asid, unsigned long start,
+				    unsigned long end, u8 stride_shift)
+{
+	unsigned long prev_start, prev_end;
+	u8 prev_stride_shift;
+
+	/*
+	 * Check if this is the first deferred flush of the user page tables.
+	 * If it is the first one, we simply record the pending flush.
+	 */
+	if (!test_bit(kern_pcid(asid), this_cpu_user_pcid_flush_mask())) {
+		__this_cpu_write(cpu_tlbstate.user_flush_start, start);
+		__this_cpu_write(cpu_tlbstate.user_flush_end, end);
+		__this_cpu_write(cpu_tlbstate.user_flush_stride_shift, stride_shift);
+		set_pending_user_pcid_flush(asid);
+		return;
+	}
+
+	prev_end = __this_cpu_read(cpu_tlbstate.user_flush_end);
+	prev_start = __this_cpu_read(cpu_tlbstate.user_flush_start);
+	prev_stride_shift = __this_cpu_read(cpu_tlbstate.user_flush_stride_shift);
+
+	/* If we already have a full pending flush, we are done */
+	if (prev_end == TLB_FLUSH_ALL)
+		return;
+
+	/*
+	 * We already have a pending flush, check if we can merge with the
+	 * previous one.
+	 */
+	if (start >= prev_start && stride_shift == prev_stride_shift) {
+		/*
+		 * Unlikely, but if the new range falls inside the old range we
+		 * are done. This check is required for correctness.
+		 */
+		if (end < prev_end)
+			return;
+
+		/* Check if a single range can also hold this flush. */
+		if ((end - prev_start) >> stride_shift < tlb_single_page_flush_ceiling) {
+			__this_cpu_write(cpu_tlbstate.user_flush_end, end);
+			return;
+		}
+	}
+
+	/* We cannot merge. Do a full flush instead */
+	__this_cpu_write(cpu_tlbstate.user_flush_end, TLB_FLUSH_ALL);
+}
+
 static void flush_tlb_user_pt_range(u16 asid, const struct flush_tlb_info *f)
 {
 	unsigned long start, end, addr;
@@ -528,6 +590,14 @@ static void flush_tlb_user_pt_range(u16 asid, const struct flush_tlb_info *f)
 	end = f->end;
 	stride_shift = f->stride_shift;
 
+	/*
+	 * We can defer flushes as long as page-tables were not freed.
+	 */
+	if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables) {
+		flush_user_tlb_deferred(asid, start, end, stride_shift);
+		return;
+	}
+
 	/*
 	 * Some platforms #GP if we call invpcid(type=1/2) before CR4.PCIDE=1.
 	 * Just use invalidate_user_asid() in case we are called early.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown
  2019-08-23 22:52 [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
  2019-08-23 22:52 ` [RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface Nadav Amit
  2019-08-23 22:52 ` [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes Nadav Amit
@ 2019-08-23 22:52 ` Nadav Amit
  2019-08-27 23:07   ` Andy Lutomirski
  2019-08-27 18:17 ` [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI Dave Hansen
  2019-09-03 15:17 ` Dave Hansen
  4 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2019-08-23 22:52 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit

When a shootdown is initiated, the initiating CPU has cycles to burn as
it waits for the responding CPUs to receive the IPI and acknowledge it.
In these cycles it is better to flush the user page-tables using
INVPCID, instead of deferring the TLB flush.

The best way to figure out whether there are cycles to burn is arguably
to expose from the SMP layer when an acknowledgment is received.
However, this would break some abstractions.

Instead, use a simpler solution: the initiating CPU of a TLB shootdown
would not defer PTI flushes. It is not always a win, relatively to
deferring user page-table flushes, but it prevents performance
regression.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 10 +++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index da56aa3ccd07..066b3804f876 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -573,6 +573,7 @@ struct flush_tlb_info {
 	unsigned int		initiating_cpu;
 	u8			stride_shift;
 	u8			freed_tables;
+	u8			shootdown;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 31260c55d597..ba50430275d4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -592,8 +592,13 @@ static void flush_tlb_user_pt_range(u16 asid, const struct flush_tlb_info *f)
 
 	/*
 	 * We can defer flushes as long as page-tables were not freed.
+	 *
+	 * However, if there is a shootdown the initiating CPU has cycles to
+	 * spare, while it waits for the other cores to respond. In this case,
+	 * deferring the flushing can cause overheads, so avoid it.
 	 */
-	if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables) {
+	if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables &&
+	    (!f->shootdown || f->initiating_cpu != smp_processor_id())) {
 		flush_user_tlb_deferred(asid, start, end, stride_shift);
 		return;
 	}
@@ -861,6 +866,7 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	info->freed_tables	= freed_tables;
 	info->new_tlb_gen	= new_tlb_gen;
 	info->initiating_cpu	= smp_processor_id();
+	info->shootdown		= false;
 
 	return info;
 }
@@ -903,6 +909,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * flush_tlb_func_local() directly in this case.
 	 */
 	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+		info->shootdown = true;
 		flush_tlb_multi(mm_cpumask(mm), info);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
@@ -970,6 +977,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * flush_tlb_func_local() directly in this case.
 	 */
 	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+		info->shootdown = true;
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface
  2019-08-23 22:52 ` [RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface Nadav Amit
@ 2019-08-26  7:51   ` Juergen Gross
  2019-08-26 16:38     ` Nadav Amit
  0 siblings, 1 reply; 18+ messages in thread
From: Juergen Gross @ 2019-08-26  7:51 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: Peter Zijlstra, Stefano Stabellini, x86, Thomas Gleixner,
	xen-devel, Boris Ostrovsky, Ingo Molnar, linux-kernel

On 24.08.19 00:52, Nadav Amit wrote:
> __flush_tlb_one_user() currently flushes a single entry, and flushes it
> both in the kernel and user page-tables, when PTI is enabled.
> 
> Change __flush_tlb_one_user() and related interfaces into
> __flush_tlb_range() that flushes a range and does not flush the user
> page-table.
> 
> This refactoring is needed for the next patch, but regardless makes
> sense and has several advantages. First, only Xen-PV, which does not
> use PTI, implements the paravirtual interface of flush_tlb_one_user() so
> nothing is broken by separating the user and kernel page-table flushes,
> and the interface is more intuitive.
> 
> Second, INVLPG can flush unrelated mappings, and it is also a
> serializing instruction. It is better to have a tight loop that flushes
> the entries.
> 
> Third, currently __flush_tlb_one_kernel() also flushes the user
> page-tables, which is not needed. This allows to avoid this redundant
> flush.
> 
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: xen-devel@lists.xenproject.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>   arch/x86/include/asm/paravirt.h       |  5 ++--
>   arch/x86/include/asm/paravirt_types.h |  3 ++-
>   arch/x86/include/asm/tlbflush.h       | 24 +++++------------
>   arch/x86/kernel/paravirt.c            |  7 ++---
>   arch/x86/mm/tlb.c                     | 39 ++++++++++++++++++++++-----
>   arch/x86/xen/mmu_pv.c                 | 21 +++++++++------
>   6 files changed, 62 insertions(+), 37 deletions(-)

...

> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 48f7c7eb4dbc..ed68657f5e77 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -1325,22 +1325,27 @@ static noinline void xen_flush_tlb(void)
>   	preempt_enable();
>   }
>   
> -static void xen_flush_tlb_one_user(unsigned long addr)
> +static void xen_flush_tlb_range(unsigned long start, unsigned long end,
> +				u8 stride_shift)
>   {
>   	struct mmuext_op *op;
>   	struct multicall_space mcs;
> -
> -	trace_xen_mmu_flush_tlb_one_user(addr);
> +	unsigned long addr;
>   
>   	preempt_disable();
>   
>   	mcs = xen_mc_entry(sizeof(*op));
>   	op = mcs.args;
> -	op->cmd = MMUEXT_INVLPG_LOCAL;
> -	op->arg1.linear_addr = addr & PAGE_MASK;
> -	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
>   
> -	xen_mc_issue(PARAVIRT_LAZY_MMU);
> +	for (addr = start; addr < end; addr += 1ul << stride_shift) {
> +		trace_xen_mmu_flush_tlb_one_user(addr);
> +
> +		op->cmd = MMUEXT_INVLPG_LOCAL;
> +		op->arg1.linear_addr = addr & PAGE_MASK;
> +		MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
> +
> +		xen_mc_issue(PARAVIRT_LAZY_MMU);
> +	}

For this kind of usage (a loop) you should:

- replace the call of xen_mc_entry() with xen_mc_batch()
- use xen_extend_mmuext_op() for each loop iteration
- call xen_mc_issue() after the loop

Additionally I'd like you to replace trace_xen_mmu_flush_tlb_one_user()
with trace_xen_mmu_flush_tlb_range() taking all three parameters and
keep it where it was (out of the loop).

The paravirt parts seem to be okay.


Juergen

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface
  2019-08-26  7:51   ` Juergen Gross
@ 2019-08-26 16:38     ` Nadav Amit
  2019-08-27  5:53       ` Juergen Gross
  0 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2019-08-26 16:38 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andy Lutomirski, Dave Hansen, Peter Zijlstra, Stefano Stabellini,
	the arch/x86 maintainers, Thomas Gleixner, xen-devel,
	Boris Ostrovsky, Ingo Molnar, linux-kernel

> On Aug 26, 2019, at 12:51 AM, Juergen Gross <jgross@suse.com> wrote:
> 
> On 24.08.19 00:52, Nadav Amit wrote:
>> __flush_tlb_one_user() currently flushes a single entry, and flushes it
>> both in the kernel and user page-tables, when PTI is enabled.
>> Change __flush_tlb_one_user() and related interfaces into
>> __flush_tlb_range() that flushes a range and does not flush the user
>> page-table.
>> This refactoring is needed for the next patch, but regardless makes
>> sense and has several advantages. First, only Xen-PV, which does not
>> use PTI, implements the paravirtual interface of flush_tlb_one_user() so
>> nothing is broken by separating the user and kernel page-table flushes,
>> and the interface is more intuitive.
>> Second, INVLPG can flush unrelated mappings, and it is also a
>> serializing instruction. It is better to have a tight loop that flushes
>> the entries.
>> Third, currently __flush_tlb_one_kernel() also flushes the user
>> page-tables, which is not needed. This allows to avoid this redundant
>> flush.
>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>> Cc: Juergen Gross <jgross@suse.com>
>> Cc: Stefano Stabellini <sstabellini@kernel.org>
>> Cc: xen-devel@lists.xenproject.org
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>>  arch/x86/include/asm/paravirt.h       |  5 ++--
>>  arch/x86/include/asm/paravirt_types.h |  3 ++-
>>  arch/x86/include/asm/tlbflush.h       | 24 +++++------------
>>  arch/x86/kernel/paravirt.c            |  7 ++---
>>  arch/x86/mm/tlb.c                     | 39 ++++++++++++++++++++++-----
>>  arch/x86/xen/mmu_pv.c                 | 21 +++++++++------
>>  6 files changed, 62 insertions(+), 37 deletions(-)
> 
> ...
> 
>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>> index 48f7c7eb4dbc..ed68657f5e77 100644
>> --- a/arch/x86/xen/mmu_pv.c
>> +++ b/arch/x86/xen/mmu_pv.c
>> @@ -1325,22 +1325,27 @@ static noinline void xen_flush_tlb(void)
>>  	preempt_enable();
>>  }
>>  -static void xen_flush_tlb_one_user(unsigned long addr)
>> +static void xen_flush_tlb_range(unsigned long start, unsigned long end,
>> +				u8 stride_shift)
>>  {
>>  	struct mmuext_op *op;
>>  	struct multicall_space mcs;
>> -
>> -	trace_xen_mmu_flush_tlb_one_user(addr);
>> +	unsigned long addr;
>>    	preempt_disable();
>>    	mcs = xen_mc_entry(sizeof(*op));
>>  	op = mcs.args;
>> -	op->cmd = MMUEXT_INVLPG_LOCAL;
>> -	op->arg1.linear_addr = addr & PAGE_MASK;
>> -	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
>>  -	xen_mc_issue(PARAVIRT_LAZY_MMU);
>> +	for (addr = start; addr < end; addr += 1ul << stride_shift) {
>> +		trace_xen_mmu_flush_tlb_one_user(addr);
>> +
>> +		op->cmd = MMUEXT_INVLPG_LOCAL;
>> +		op->arg1.linear_addr = addr & PAGE_MASK;
>> +		MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
>> +
>> +		xen_mc_issue(PARAVIRT_LAZY_MMU);
>> +	}
> 
> For this kind of usage (a loop) you should:
> 
> - replace the call of xen_mc_entry() with xen_mc_batch()
> - use xen_extend_mmuext_op() for each loop iteration
> - call xen_mc_issue() after the loop
> 
> Additionally I'd like you to replace trace_xen_mmu_flush_tlb_one_user()
> with trace_xen_mmu_flush_tlb_range() taking all three parameters and
> keep it where it was (out of the loop).
> 
> The paravirt parts seem to be okay.

Thanks for the quick response. I don’t think the preempt_disable/enable() is
needed in this case, but I kept them. Does the following match what you had
in mind?

---
 arch/x86/xen/mmu_pv.c      | 25 ++++++++++++++-----------
 include/trace/events/xen.h | 18 ++++++++++++------
 2 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 48f7c7eb4dbc..faa01591df36 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1325,20 +1325,23 @@ static noinline void xen_flush_tlb(void)
 	preempt_enable();
 }
 
-static void xen_flush_tlb_one_user(unsigned long addr)
+static void xen_flush_tlb_range(unsigned long start, unsigned long end,
+				u8 stride_shift)
 {
-	struct mmuext_op *op;
-	struct multicall_space mcs;
-
-	trace_xen_mmu_flush_tlb_one_user(addr);
+	struct mmuext_op op;
+	unsigned long addr;
 
 	preempt_disable();
 
-	mcs = xen_mc_entry(sizeof(*op));
-	op = mcs.args;
-	op->cmd = MMUEXT_INVLPG_LOCAL;
-	op->arg1.linear_addr = addr & PAGE_MASK;
-	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+	xen_mc_batch();
+	op.cmd = MMUEXT_INVLPG_LOCAL;
+
+	trace_xen_mmu_flush_tlb_range(start, end, stride_shift);
+
+	for (addr = start; addr < end; addr += 1ul << stride_shift) {
+		op.arg1.linear_addr = addr & PAGE_MASK;
+		xen_extend_mmuext_op(&op);
+	}
 
 	xen_mc_issue(PARAVIRT_LAZY_MMU);
 
@@ -2394,7 +2397,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
 
 	.flush_tlb_user = xen_flush_tlb,
 	.flush_tlb_kernel = xen_flush_tlb,
-	.flush_tlb_one_user = xen_flush_tlb_one_user,
+	.flush_tlb_range = xen_flush_tlb_range,
 	.flush_tlb_multi = xen_flush_tlb_multi,
 	.tlb_remove_table = tlb_remove_table,
 
diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index 546022acf160..32bf0a94b4d8 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -352,14 +352,20 @@ DECLARE_EVENT_CLASS(xen_mmu_pgd,
 DEFINE_XEN_MMU_PGD_EVENT(xen_mmu_pgd_pin);
 DEFINE_XEN_MMU_PGD_EVENT(xen_mmu_pgd_unpin);
 
-TRACE_EVENT(xen_mmu_flush_tlb_one_user,
-	    TP_PROTO(unsigned long addr),
-	    TP_ARGS(addr),
+TRACE_EVENT(xen_mmu_flush_tlb_range,
+	    TP_PROTO(unsigned long start, unsigned long end,
+		     unsigned char stride_shift),
+	    TP_ARGS(start, end, stride_shift),
 	    TP_STRUCT__entry(
-		    __field(unsigned long, addr)
+		    __field(unsigned long, start)
+		    __field(unsigned long, end)
+		    __field(unsigned char, stride_shift)
 		    ),
-	    TP_fast_assign(__entry->addr = addr),
-	    TP_printk("addr %lx", __entry->addr)
+	    TP_fast_assign(__entry->start = start;
+			   __entry->end = end;
+			   __entry->stride_shift = stride_shift),
+	    TP_printk("start %lx end %lx stride_shift %d", __entry->start,
+		      __entry->end, __entry->stride_shift)
 	);
 
 TRACE_EVENT(xen_mmu_flush_tlb_multi,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface
  2019-08-26 16:38     ` Nadav Amit
@ 2019-08-27  5:53       ` Juergen Gross
  0 siblings, 0 replies; 18+ messages in thread
From: Juergen Gross @ 2019-08-27  5:53 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Andy Lutomirski, StefanoStabellini,
	the arch/x86 maintainers, Thomas Gleixner, Dave Hansen,
	xen-devel, Boris Ostrovsky, Ingo Molnar, linux-kernel

On 26.08.19 18:38, Nadav Amit wrote:
>> On Aug 26, 2019, at 12:51 AM, Juergen Gross <jgross@suse.com> wrote:
>>
>> On 24.08.19 00:52, Nadav Amit wrote:
>>> __flush_tlb_one_user() currently flushes a single entry, and flushes it
>>> both in the kernel and user page-tables, when PTI is enabled.
>>> Change __flush_tlb_one_user() and related interfaces into
>>> __flush_tlb_range() that flushes a range and does not flush the user
>>> page-table.
>>> This refactoring is needed for the next patch, but regardless makes
>>> sense and has several advantages. First, only Xen-PV, which does not
>>> use PTI, implements the paravirtual interface of flush_tlb_one_user() so
>>> nothing is broken by separating the user and kernel page-table flushes,
>>> and the interface is more intuitive.
>>> Second, INVLPG can flush unrelated mappings, and it is also a
>>> serializing instruction. It is better to have a tight loop that flushes
>>> the entries.
>>> Third, currently __flush_tlb_one_kernel() also flushes the user
>>> page-tables, which is not needed. This allows to avoid this redundant
>>> flush.
>>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>> Cc: Juergen Gross <jgross@suse.com>
>>> Cc: Stefano Stabellini <sstabellini@kernel.org>
>>> Cc: xen-devel@lists.xenproject.org
>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>> ---
>>>   arch/x86/include/asm/paravirt.h       |  5 ++--
>>>   arch/x86/include/asm/paravirt_types.h |  3 ++-
>>>   arch/x86/include/asm/tlbflush.h       | 24 +++++------------
>>>   arch/x86/kernel/paravirt.c            |  7 ++---
>>>   arch/x86/mm/tlb.c                     | 39 ++++++++++++++++++++++-----
>>>   arch/x86/xen/mmu_pv.c                 | 21 +++++++++------
>>>   6 files changed, 62 insertions(+), 37 deletions(-)
>>
>> ...
>>
>>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>>> index 48f7c7eb4dbc..ed68657f5e77 100644
>>> --- a/arch/x86/xen/mmu_pv.c
>>> +++ b/arch/x86/xen/mmu_pv.c
>>> @@ -1325,22 +1325,27 @@ static noinline void xen_flush_tlb(void)
>>>   	preempt_enable();
>>>   }
>>>   -static void xen_flush_tlb_one_user(unsigned long addr)
>>> +static void xen_flush_tlb_range(unsigned long start, unsigned long end,
>>> +				u8 stride_shift)
>>>   {
>>>   	struct mmuext_op *op;
>>>   	struct multicall_space mcs;
>>> -
>>> -	trace_xen_mmu_flush_tlb_one_user(addr);
>>> +	unsigned long addr;
>>>     	preempt_disable();
>>>     	mcs = xen_mc_entry(sizeof(*op));
>>>   	op = mcs.args;
>>> -	op->cmd = MMUEXT_INVLPG_LOCAL;
>>> -	op->arg1.linear_addr = addr & PAGE_MASK;
>>> -	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
>>>   -	xen_mc_issue(PARAVIRT_LAZY_MMU);
>>> +	for (addr = start; addr < end; addr += 1ul << stride_shift) {
>>> +		trace_xen_mmu_flush_tlb_one_user(addr);
>>> +
>>> +		op->cmd = MMUEXT_INVLPG_LOCAL;
>>> +		op->arg1.linear_addr = addr & PAGE_MASK;
>>> +		MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
>>> +
>>> +		xen_mc_issue(PARAVIRT_LAZY_MMU);
>>> +	}
>>
>> For this kind of usage (a loop) you should:
>>
>> - replace the call of xen_mc_entry() with xen_mc_batch()
>> - use xen_extend_mmuext_op() for each loop iteration
>> - call xen_mc_issue() after the loop
>>
>> Additionally I'd like you to replace trace_xen_mmu_flush_tlb_one_user()
>> with trace_xen_mmu_flush_tlb_range() taking all three parameters and
>> keep it where it was (out of the loop).
>>
>> The paravirt parts seem to be okay.
> 
> Thanks for the quick response. I don’t think the preempt_disable/enable() is
> needed in this case, but I kept them. Does the following match what you had
> in mind?

Yes, it does.

BTW, preempt_disable/enable() is needed as xen_mc_batch() ...
xen_mc_issue() is using a percpu buffer.


Juergen

> 
> ---
>   arch/x86/xen/mmu_pv.c      | 25 ++++++++++++++-----------
>   include/trace/events/xen.h | 18 ++++++++++++------
>   2 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 48f7c7eb4dbc..faa01591df36 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -1325,20 +1325,23 @@ static noinline void xen_flush_tlb(void)
>   	preempt_enable();
>   }
>   
> -static void xen_flush_tlb_one_user(unsigned long addr)
> +static void xen_flush_tlb_range(unsigned long start, unsigned long end,
> +				u8 stride_shift)
>   {
> -	struct mmuext_op *op;
> -	struct multicall_space mcs;
> -
> -	trace_xen_mmu_flush_tlb_one_user(addr);
> +	struct mmuext_op op;
> +	unsigned long addr;
>   
>   	preempt_disable();
>   
> -	mcs = xen_mc_entry(sizeof(*op));
> -	op = mcs.args;
> -	op->cmd = MMUEXT_INVLPG_LOCAL;
> -	op->arg1.linear_addr = addr & PAGE_MASK;
> -	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
> +	xen_mc_batch();
> +	op.cmd = MMUEXT_INVLPG_LOCAL;
> +
> +	trace_xen_mmu_flush_tlb_range(start, end, stride_shift);
> +
> +	for (addr = start; addr < end; addr += 1ul << stride_shift) {
> +		op.arg1.linear_addr = addr & PAGE_MASK;
> +		xen_extend_mmuext_op(&op);
> +	}
>   
>   	xen_mc_issue(PARAVIRT_LAZY_MMU);
>   
> @@ -2394,7 +2397,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
>   
>   	.flush_tlb_user = xen_flush_tlb,
>   	.flush_tlb_kernel = xen_flush_tlb,
> -	.flush_tlb_one_user = xen_flush_tlb_one_user,
> +	.flush_tlb_range = xen_flush_tlb_range,
>   	.flush_tlb_multi = xen_flush_tlb_multi,
>   	.tlb_remove_table = tlb_remove_table,
>   
> diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
> index 546022acf160..32bf0a94b4d8 100644
> --- a/include/trace/events/xen.h
> +++ b/include/trace/events/xen.h
> @@ -352,14 +352,20 @@ DECLARE_EVENT_CLASS(xen_mmu_pgd,
>   DEFINE_XEN_MMU_PGD_EVENT(xen_mmu_pgd_pin);
>   DEFINE_XEN_MMU_PGD_EVENT(xen_mmu_pgd_unpin);
>   
> -TRACE_EVENT(xen_mmu_flush_tlb_one_user,
> -	    TP_PROTO(unsigned long addr),
> -	    TP_ARGS(addr),
> +TRACE_EVENT(xen_mmu_flush_tlb_range,
> +	    TP_PROTO(unsigned long start, unsigned long end,
> +		     unsigned char stride_shift),
> +	    TP_ARGS(start, end, stride_shift),
>   	    TP_STRUCT__entry(
> -		    __field(unsigned long, addr)
> +		    __field(unsigned long, start)
> +		    __field(unsigned long, end)
> +		    __field(unsigned char, stride_shift)
>   		    ),
> -	    TP_fast_assign(__entry->addr = addr),
> -	    TP_printk("addr %lx", __entry->addr)
> +	    TP_fast_assign(__entry->start = start;
> +			   __entry->end = end;
> +			   __entry->stride_shift = stride_shift),
> +	    TP_printk("start %lx end %lx stride_shift %d", __entry->start,
> +		      __entry->end, __entry->stride_shift)
>   	);
>   
>   TRACE_EVENT(xen_mmu_flush_tlb_multi,
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI
  2019-08-23 22:52 [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
                   ` (2 preceding siblings ...)
  2019-08-23 22:52 ` [RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown Nadav Amit
@ 2019-08-27 18:17 ` Dave Hansen
  2019-09-03 15:17 ` Dave Hansen
  4 siblings, 0 replies; 18+ messages in thread
From: Dave Hansen @ 2019-08-27 18:17 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

On 8/23/19 3:52 PM, Nadav Amit wrote:
> n_pages		concurrent	+deferred-pti		change
> -------		----------	-------------		------
>  1		2119		1986 			-6.7%

Is this implying that a single-page INVPCID is 133 cycles slower than
INVLPG?  That seems a bit larger than I'd expect.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes
  2019-08-23 22:52 ` [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes Nadav Amit
@ 2019-08-27 18:28   ` Dave Hansen
  2019-08-27 19:46     ` Nadav Amit
  2019-08-27 23:13   ` Andy Lutomirski
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Hansen @ 2019-08-27 18:28 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

On 8/23/19 3:52 PM, Nadav Amit wrote:
> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
> flush the user page-tables when PTI is enabled therefore introduces
> significant overhead.

I'm not sure this is worth all the churn, especially in the entry code.
 For large flushes (> tlb_single_page_flush_ceiling), we don't do
INVPCIDs in the first place.

I'd really want to understand what the heck is going on that makes
INVPCID so slow, first.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes
  2019-08-27 18:28   ` Dave Hansen
@ 2019-08-27 19:46     ` Nadav Amit
  0 siblings, 0 replies; 18+ messages in thread
From: Nadav Amit @ 2019-08-27 19:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers, LKML,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar

> On Aug 27, 2019, at 11:28 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 8/23/19 3:52 PM, Nadav Amit wrote:
>> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
>> flush the user page-tables when PTI is enabled therefore introduces
>> significant overhead.
> 
> I'm not sure this is worth all the churn, especially in the entry code.
> For large flushes (> tlb_single_page_flush_ceiling), we don't do
> INVPCIDs in the first place.

It is possible to jump from flush_tlb_func() into the trampoline page,
instead of flushing the TLB in the entry code. However, it induces higher
overhead (switching CR3s), so it will only be useful if multiple TLB entries
are flushed at once. It also prevents exploiting opportunities of promoting
individual entry flushes into a full-TLB flush when multiple flushes are
issued or when context switch takes place before returning-to-user-space.

There are cases/workloads that flush multiple (but not too many) TLB entries
on every syscall, for instance issuing msync() or running Apache webserver.
So I am not sure that tlb_single_page_flush_ceiling saves the day. Besides,
you may want to recalibrate (lower) tlb_single_page_flush_ceiling when PTI
is used.

> I'd really want to understand what the heck is going on that makes
> INVPCID so slow, first.

INVPCID-single is slow (even more than 133 cycles slower than INVLPG that
you mentioned; I don’t have the numbers if front of me). I thought that this
is a known fact, although, obviously, it does not make much sense.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown
  2019-08-23 22:52 ` [RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown Nadav Amit
@ 2019-08-27 23:07   ` Andy Lutomirski
  2019-08-27 23:57     ` Nadav Amit
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2019-08-27 23:07 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, X86 ML, LKML, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar

On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit <namit@vmware.com> wrote:
>
> When a shootdown is initiated, the initiating CPU has cycles to burn as
> it waits for the responding CPUs to receive the IPI and acknowledge it.
> In these cycles it is better to flush the user page-tables using
> INVPCID, instead of deferring the TLB flush.
>
> The best way to figure out whether there are cycles to burn is arguably
> to expose from the SMP layer when an acknowledgment is received.
> However, this would break some abstractions.
>
> Instead, use a simpler solution: the initiating CPU of a TLB shootdown
> would not defer PTI flushes. It is not always a win, relatively to
> deferring user page-table flushes, but it prevents performance
> regression.
>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/include/asm/tlbflush.h |  1 +
>  arch/x86/mm/tlb.c               | 10 +++++++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index da56aa3ccd07..066b3804f876 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -573,6 +573,7 @@ struct flush_tlb_info {
>         unsigned int            initiating_cpu;
>         u8                      stride_shift;
>         u8                      freed_tables;
> +       u8                      shootdown;

I find the name "shootdown" to be confusing.  How about "more_than_one_cpu"?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes
  2019-08-23 22:52 ` [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes Nadav Amit
  2019-08-27 18:28   ` Dave Hansen
@ 2019-08-27 23:13   ` Andy Lutomirski
  2019-08-27 23:55     ` Nadav Amit
  1 sibling, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2019-08-27 23:13 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, X86 ML, LKML, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar

On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit <namit@vmware.com> wrote:
>
> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
> flush the user page-tables when PTI is enabled therefore introduces
> significant overhead.
>
> Instead, unless page-tables are released, it is possible to defer the
> flushing of the user page-tables until the time the code returns to
> userspace. These page tables are not in use, so deferring them is not a
> security hazard.

I agree and, in fact, I argued against ever using INVPCID in the
original PTI code.

However, I don't see what freeing page tables has to do with this.  If
the CPU can actually do speculative page walks based on the contents
of non-current-PCID TLB entries, then we have major problems, since we
don't actively flush the TLB for non-running mms at all.

I suppose that, if we free a page table, then we can't activate the
PCID by writing to CR3 before flushing things.  But we can still defer
the flush and just set the flush bit when we write to CR3.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes
  2019-08-27 23:13   ` Andy Lutomirski
@ 2019-08-27 23:55     ` Nadav Amit
  2019-08-28  0:30       ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2019-08-27 23:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, X86 ML, LKML, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

> On Aug 27, 2019, at 4:13 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit <namit@vmware.com> wrote:
>> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
>> flush the user page-tables when PTI is enabled therefore introduces
>> significant overhead.
>> 
>> Instead, unless page-tables are released, it is possible to defer the
>> flushing of the user page-tables until the time the code returns to
>> userspace. These page tables are not in use, so deferring them is not a
>> security hazard.
> 
> I agree and, in fact, I argued against ever using INVPCID in the
> original PTI code.
> 
> However, I don't see what freeing page tables has to do with this.  If
> the CPU can actually do speculative page walks based on the contents
> of non-current-PCID TLB entries, then we have major problems, since we
> don't actively flush the TLB for non-running mms at all.

That was not my concern.

> 
> I suppose that, if we free a page table, then we can't activate the
> PCID by writing to CR3 before flushing things.  But we can still defer
> the flush and just set the flush bit when we write to CR3.

This was my concern. I can change the behavior so the code would flush the
whole TLB instead. I just tried not to change the existing behavior too
much.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown
  2019-08-27 23:07   ` Andy Lutomirski
@ 2019-08-27 23:57     ` Nadav Amit
  2019-08-28  0:30       ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2019-08-27 23:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, X86 ML, LKML, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

> On Aug 27, 2019, at 4:07 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit <namit@vmware.com> wrote:
>> When a shootdown is initiated, the initiating CPU has cycles to burn as
>> it waits for the responding CPUs to receive the IPI and acknowledge it.
>> In these cycles it is better to flush the user page-tables using
>> INVPCID, instead of deferring the TLB flush.
>> 
>> The best way to figure out whether there are cycles to burn is arguably
>> to expose from the SMP layer when an acknowledgment is received.
>> However, this would break some abstractions.
>> 
>> Instead, use a simpler solution: the initiating CPU of a TLB shootdown
>> would not defer PTI flushes. It is not always a win, relatively to
>> deferring user page-table flushes, but it prevents performance
>> regression.
>> 
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> arch/x86/include/asm/tlbflush.h |  1 +
>> arch/x86/mm/tlb.c               | 10 +++++++++-
>> 2 files changed, 10 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>> index da56aa3ccd07..066b3804f876 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -573,6 +573,7 @@ struct flush_tlb_info {
>>        unsigned int            initiating_cpu;
>>        u8                      stride_shift;
>>        u8                      freed_tables;
>> +       u8                      shootdown;
> 
> I find the name "shootdown" to be confusing.  How about "more_than_one_cpu”?

I think the current semantic is more of “includes remote cpus”. How about
calling it “local_only”, and negating its value?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown
  2019-08-27 23:57     ` Nadav Amit
@ 2019-08-28  0:30       ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2019-08-28  0:30 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, X86 ML, LKML, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar

On Tue, Aug 27, 2019 at 4:57 PM Nadav Amit <namit@vmware.com> wrote:
>
> > On Aug 27, 2019, at 4:07 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit <namit@vmware.com> wrote:
> >> When a shootdown is initiated, the initiating CPU has cycles to burn as
> >> it waits for the responding CPUs to receive the IPI and acknowledge it.
> >> In these cycles it is better to flush the user page-tables using
> >> INVPCID, instead of deferring the TLB flush.
> >>
> >> The best way to figure out whether there are cycles to burn is arguably
> >> to expose from the SMP layer when an acknowledgment is received.
> >> However, this would break some abstractions.
> >>
> >> Instead, use a simpler solution: the initiating CPU of a TLB shootdown
> >> would not defer PTI flushes. It is not always a win, relatively to
> >> deferring user page-table flushes, but it prevents performance
> >> regression.
> >>
> >> Signed-off-by: Nadav Amit <namit@vmware.com>
> >> ---
> >> arch/x86/include/asm/tlbflush.h |  1 +
> >> arch/x86/mm/tlb.c               | 10 +++++++++-
> >> 2 files changed, 10 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> >> index da56aa3ccd07..066b3804f876 100644
> >> --- a/arch/x86/include/asm/tlbflush.h
> >> +++ b/arch/x86/include/asm/tlbflush.h
> >> @@ -573,6 +573,7 @@ struct flush_tlb_info {
> >>        unsigned int            initiating_cpu;
> >>        u8                      stride_shift;
> >>        u8                      freed_tables;
> >> +       u8                      shootdown;
> >
> > I find the name "shootdown" to be confusing.  How about "more_than_one_cpu”?
>
> I think the current semantic is more of “includes remote cpus”. How about
> calling it “local_only”, and negating its value?

Sure.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes
  2019-08-27 23:55     ` Nadav Amit
@ 2019-08-28  0:30       ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2019-08-28  0:30 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, X86 ML, LKML, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar

On Tue, Aug 27, 2019 at 4:55 PM Nadav Amit <namit@vmware.com> wrote:
>
> > On Aug 27, 2019, at 4:13 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit <namit@vmware.com> wrote:
> >> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
> >> flush the user page-tables when PTI is enabled therefore introduces
> >> significant overhead.
> >>
> >> Instead, unless page-tables are released, it is possible to defer the
> >> flushing of the user page-tables until the time the code returns to
> >> userspace. These page tables are not in use, so deferring them is not a
> >> security hazard.
> >
> > I agree and, in fact, I argued against ever using INVPCID in the
> > original PTI code.
> >
> > However, I don't see what freeing page tables has to do with this.  If
> > the CPU can actually do speculative page walks based on the contents
> > of non-current-PCID TLB entries, then we have major problems, since we
> > don't actively flush the TLB for non-running mms at all.
>
> That was not my concern.
>
> >
> > I suppose that, if we free a page table, then we can't activate the
> > PCID by writing to CR3 before flushing things.  But we can still defer
> > the flush and just set the flush bit when we write to CR3.
>
> This was my concern. I can change the behavior so the code would flush the
> whole TLB instead. I just tried not to change the existing behavior too
> much.
>

We do this anyway if we don't have INVPCID_SINGLE, so it doesn't seem
so bad to also do it if there's a freed page table.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI
  2019-08-23 22:52 [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
                   ` (3 preceding siblings ...)
  2019-08-27 18:17 ` [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI Dave Hansen
@ 2019-09-03 15:17 ` Dave Hansen
  2019-09-03 16:13   ` Nadav Amit
  4 siblings, 1 reply; 18+ messages in thread
From: Dave Hansen @ 2019-09-03 15:17 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

On 8/23/19 3:52 PM, Nadav Amit wrote:
> n_pages		concurrent	+deferred-pti		change
> -------		----------	-------------		------
>  1		2119		1986 			-6.7%
>  10		6791		5417 			 -20%
> 
> Please let me know if I missed something that affects security or
> performance.

Hi Nadav,

Is this in a VM or on bare metal?  Could you also share the bare
/proc/cpuinfo for one of the CPU threads?  I want to make sure that any
oddities in the stepping or microcode are known.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI
  2019-09-03 15:17 ` Dave Hansen
@ 2019-09-03 16:13   ` Nadav Amit
  0 siblings, 0 replies; 18+ messages in thread
From: Nadav Amit @ 2019-09-03 16:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Dave Hansen, x86, linux-kernel, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar

> On Sep 3, 2019, at 8:17 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 8/23/19 3:52 PM, Nadav Amit wrote:
>> n_pages		concurrent	+deferred-pti		change
>> -------		----------	-------------		------
>> 1		2119		1986 			-6.7%
>> 10		6791		5417 			 -20%
>> 
>> Please let me know if I missed something that affects security or
>> performance.
> 
> Hi Nadav,
> 
> Is this in a VM or on bare metal?  Could you also share the bare
> /proc/cpuinfo for one of the CPU threads?  I want to make sure that any
> oddities in the stepping or microcode are known.

Bare-metal. Note that some of the benefit does come from having tighter
loops for INVPCID/INVLPG, but the big difference is due to INVPCID being
slower.

I am kind of surprised that you are surprised. I was under the impression
that INVPCID-single lower performance is known. Let me know if I did
something wrong.

processor	: 55
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
stepping	: 1
microcode	: 0xb000036
cpu MHz		: 2403.035
cache size	: 35840 KB
physical id	: 1
siblings	: 28
core id		: 14
cpu cores	: 14
apicid		: 61
initial apicid	: 61
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips	: 4001.56
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:





^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-09-03 16:13 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-23 22:52 [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
2019-08-23 22:52 ` [RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface Nadav Amit
2019-08-26  7:51   ` Juergen Gross
2019-08-26 16:38     ` Nadav Amit
2019-08-27  5:53       ` Juergen Gross
2019-08-23 22:52 ` [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes Nadav Amit
2019-08-27 18:28   ` Dave Hansen
2019-08-27 19:46     ` Nadav Amit
2019-08-27 23:13   ` Andy Lutomirski
2019-08-27 23:55     ` Nadav Amit
2019-08-28  0:30       ` Andy Lutomirski
2019-08-23 22:52 ` [RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown Nadav Amit
2019-08-27 23:07   ` Andy Lutomirski
2019-08-27 23:57     ` Nadav Amit
2019-08-28  0:30       ` Andy Lutomirski
2019-08-27 18:17 ` [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI Dave Hansen
2019-09-03 15:17 ` Dave Hansen
2019-09-03 16:13   ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).