linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI
@ 2019-08-23 22:46 Nadav Amit
  2019-08-23 22:46 ` [RFC PATCH 1/3] x86/mm/tlb: Defer PTI flushes Nadav Amit
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Nadav Amit @ 2019-08-23 22:46 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit

INVPCID is considerably slower than INVLPG of a single PTE, but it is
currently used to flush PTEs in the user page-table when PTI is used.

Instead, it is possible to defer TLB flushes until after the user
page-tables are loaded. Preventing speculation over the TLB flushes
should keep the whole thing safe. In some cases, deferring TLB flushes
in such a way can result in more full TLB flushes, but arguably this
behavior is oftentimes beneficial.

These patches are based and evaluated on top of the concurrent
TLB-flushes v4 patch-set.

I will provide more results later, but it might be easier to look at the
time an isolated TLB flush takes. These numbers are from skylake,
showing the number of cycles that running madvise(DONTNEED) which
results in local TLB flushes takes:

n_pages		concurrent	+deferred-pti		change
-------		----------	-------------		------
 1		2119		1986 			-6.7%
 10		6791		5417 			 -20%

Please let me know if I missed something that affects security or
performance.

[ Yes, I know there is another pending RFC for async TLB flushes, but I
  think it might be easier to merge this one first ]

Nadav Amit (3):
  x86/mm/tlb: Defer PTI flushes
  x86/mm/tlb: Avoid deferring PTI flushes on shootdown
  x86/mm/tlb: Use lockdep irq assertions

 arch/x86/entry/calling.h        | 52 +++++++++++++++++++--
 arch/x86/include/asm/tlbflush.h | 31 ++++++++++--
 arch/x86/kernel/asm-offsets.c   |  3 ++
 arch/x86/mm/tlb.c               | 83 +++++++++++++++++++++++++++++++--
 4 files changed, 158 insertions(+), 11 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/3] x86/mm/tlb: Defer PTI flushes
  2019-08-23 22:46 [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
@ 2019-08-23 22:46 ` Nadav Amit
  2019-08-23 22:46 ` [RFC PATCH 2/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown Nadav Amit
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Nadav Amit @ 2019-08-23 22:46 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit

INVPCID is considerably slower than INVLPG of a single PTE. Using it to
flush the user page-tables when PTI is enabled therefore introduces
significant overhead.

Instead, unless page-tables are released, it is possible to defer the
flushing of the user page-tables until the time the code returns to
userspace. These page tables are not in use, so deferring them is not a
security hazard. When CR3 is loaded, as part of returning to userspace,
use INVLPG to flush the relevant PTEs. Use LFENCE to prevent speculative
executions that skip INVLPG.

There are some caveats, which sometime require a full TLB flush of the
user page-tables. There are some (uncommon) code-paths that reload CR3
in which there is not stack. If a context-switch happens and there are
pending flushes, tracking which TLB flushes are later needed is
complicated and expensive. If there are multiple TLB flushes of
different ranges before the kernel returns to userspace, the overhead of
tracking them can exceed the benefit.

In these cases, perform a full TLB flush. It is possible to avoid them
in some cases, but the benefit in doing so is questionable.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/entry/calling.h        | 52 ++++++++++++++++++++++--
 arch/x86/include/asm/tlbflush.h | 30 +++++++++++---
 arch/x86/kernel/asm-offsets.c   |  3 ++
 arch/x86/mm/tlb.c               | 70 +++++++++++++++++++++++++++++++++
 4 files changed, 147 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 515c0ceeb4a3..a4d46416853d 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -6,6 +6,7 @@
 #include <asm/percpu.h>
 #include <asm/asm-offsets.h>
 #include <asm/processor-flags.h>
+#include <asm/tlbflush.h>
 
 /*
 
@@ -205,7 +206,16 @@ For 32-bit we have the following conventions - kernel is built with
 #define THIS_CPU_user_pcid_flush_mask   \
 	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
 
-.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
+#define THIS_CPU_user_flush_start	\
+	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_start
+
+#define THIS_CPU_user_flush_end	\
+	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_end
+
+#define THIS_CPU_user_flush_stride_shift	\
+	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_stride_shift
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req scratch_reg2:req has_stack:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 	mov	%cr3, \scratch_reg
 
@@ -221,9 +231,41 @@ For 32-bit we have the following conventions - kernel is built with
 
 	/* Flush needed, clear the bit */
 	btr	\scratch_reg, THIS_CPU_user_pcid_flush_mask
+.if \has_stack
+	cmpq	$(TLB_FLUSH_ALL), THIS_CPU_user_flush_end
+	jnz	.Lpartial_flush_\@
+.Ldo_full_flush_\@:
+.endif
 	movq	\scratch_reg2, \scratch_reg
 	jmp	.Lwrcr3_pcid_\@
-
+.if \has_stack
+.Lpartial_flush_\@:
+	/* Prepare CR3 with PGD of user, and no flush set */
+	orq	$(PTI_USER_PGTABLE_AND_PCID_MASK), \scratch_reg2
+	SET_NOFLUSH_BIT \scratch_reg2
+	pushq	%rsi
+	pushq	%rbx
+	pushq	%rcx
+	movb	THIS_CPU_user_flush_stride_shift, %cl
+	movq	$1, %rbx
+	shl	%cl, %rbx
+	movq	THIS_CPU_user_flush_start, %rsi
+	movq	THIS_CPU_user_flush_end, %rcx
+	/* Load the new cr3 and flush */
+	mov	\scratch_reg2, %cr3
+.Lflush_loop_\@:
+	invlpg	(%rsi)
+	addq	%rbx, %rsi
+	cmpq	%rsi, %rcx
+	ja	.Lflush_loop_\@
+	/* Prevent speculatively skipping flushes */
+	lfence
+
+	popq	%rcx
+	popq	%rbx
+	popq	%rsi
+	jmp	.Lend_\@
+.endif
 .Lnoflush_\@:
 	movq	\scratch_reg2, \scratch_reg
 	SET_NOFLUSH_BIT \scratch_reg
@@ -239,9 +281,13 @@ For 32-bit we have the following conventions - kernel is built with
 .Lend_\@:
 .endm
 
+.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
+	SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax has_stack=0
+.endm
+
 .macro SWITCH_TO_USER_CR3_STACK	scratch_reg:req
 	pushq	%rax
-	SWITCH_TO_USER_CR3_NOSTACK scratch_reg=\scratch_reg scratch_reg2=%rax
+	SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax has_stack=1
 	popq	%rax
 .endm
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 421bc82504e2..da56aa3ccd07 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -2,6 +2,10 @@
 #ifndef _ASM_X86_TLBFLUSH_H
 #define _ASM_X86_TLBFLUSH_H
 
+#define TLB_FLUSH_ALL	-1UL
+
+#ifndef __ASSEMBLY__
+
 #include <linux/mm.h>
 #include <linux/sched.h>
 
@@ -222,6 +226,10 @@ struct tlb_state {
 	 * context 0.
 	 */
 	struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
+
+	unsigned long user_flush_start;
+	unsigned long user_flush_end;
+	unsigned long user_flush_stride_shift;
 };
 DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
 
@@ -373,6 +381,16 @@ static inline void cr4_set_bits_and_update_boot(unsigned long mask)
 
 extern void initialize_tlbstate_and_flush(void);
 
+static unsigned long *this_cpu_user_pcid_flush_mask(void)
+{
+	return (unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask);
+}
+
+static inline void set_pending_user_pcid_flush(u16 asid)
+{
+	__set_bit(kern_pcid(asid), this_cpu_user_pcid_flush_mask());
+}
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -395,8 +413,10 @@ static inline void invalidate_user_asid(u16 asid)
 	if (!static_cpu_has(X86_FEATURE_PTI))
 		return;
 
-	__set_bit(kern_pcid(asid),
-		  (unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
+	set_pending_user_pcid_flush(asid);
+
+	/* Mark the flush as global */
+	__this_cpu_write(cpu_tlbstate.user_flush_end, TLB_FLUSH_ALL);
 }
 
 /*
@@ -516,8 +536,6 @@ static inline void __flush_tlb_one_kernel(unsigned long addr)
 	invalidate_other_asid();
 }
 
-#define TLB_FLUSH_ALL	-1UL
-
 /*
  * TLB flushing:
  *
@@ -580,7 +598,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 }
 
 void native_flush_tlb_multi(const struct cpumask *cpumask,
-			     const struct flush_tlb_info *info);
+			    const struct flush_tlb_info *info);
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
@@ -610,4 +628,6 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 	tlb_remove_page(tlb, (void *)(page))
 #endif
 
+#endif /* __ASSEMBLY__ */
+
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 5c7ee3df4d0b..bfbe393a5f46 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -95,6 +95,9 @@ static void __used common(void)
 
 	/* TLB state for the entry code */
 	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
+	OFFSET(TLB_STATE_user_flush_start, tlb_state, user_flush_start);
+	OFFSET(TLB_STATE_user_flush_end, tlb_state, user_flush_end);
+	OFFSET(TLB_STATE_user_flush_stride_shift, tlb_state, user_flush_stride_shift);
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ad15fc2c0790..31260c55d597 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -407,6 +407,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
 
+		/*
+		 * If the indication of partial flush is on, setting the end to
+		 * TLB_FLUSH_ALL would mark a full flush is need. Do it
+		 * unconditionally, since anyhow it is benign.  Alternatively,
+		 * we could conditionally flush the deferred range, but it is
+		 * likely to perform worse.
+		 */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			__this_cpu_write(cpu_tlbstate.user_flush_end, TLB_FLUSH_ALL);
+
 		/* Let nmi_uaccess_okay() know that we're changing CR3. */
 		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
 		barrier();
@@ -512,6 +522,58 @@ void initialize_tlbstate_and_flush(void)
 		this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
 }
 
+/*
+ * Defer the TLB flush to the point we return to userspace.
+ */
+static void flush_user_tlb_deferred(u16 asid, unsigned long start,
+				    unsigned long end, u8 stride_shift)
+{
+	unsigned long prev_start, prev_end;
+	u8 prev_stride_shift;
+
+	/*
+	 * Check if this is the first deferred flush of the user page tables.
+	 * If it is the first one, we simply record the pending flush.
+	 */
+	if (!test_bit(kern_pcid(asid), this_cpu_user_pcid_flush_mask())) {
+		__this_cpu_write(cpu_tlbstate.user_flush_start, start);
+		__this_cpu_write(cpu_tlbstate.user_flush_end, end);
+		__this_cpu_write(cpu_tlbstate.user_flush_stride_shift, stride_shift);
+		set_pending_user_pcid_flush(asid);
+		return;
+	}
+
+	prev_end = __this_cpu_read(cpu_tlbstate.user_flush_end);
+	prev_start = __this_cpu_read(cpu_tlbstate.user_flush_start);
+	prev_stride_shift = __this_cpu_read(cpu_tlbstate.user_flush_stride_shift);
+
+	/* If we already have a full pending flush, we are done */
+	if (prev_end == TLB_FLUSH_ALL)
+		return;
+
+	/*
+	 * We already have a pending flush, check if we can merge with the
+	 * previous one.
+	 */
+	if (start >= prev_start && stride_shift == prev_stride_shift) {
+		/*
+		 * Unlikely, but if the new range falls inside the old range we
+		 * are done. This check is required for correctness.
+		 */
+		if (end < prev_end)
+			return;
+
+		/* Check if a single range can also hold this flush. */
+		if ((end - prev_start) >> stride_shift < tlb_single_page_flush_ceiling) {
+			__this_cpu_write(cpu_tlbstate.user_flush_end, end);
+			return;
+		}
+	}
+
+	/* We cannot merge. Do a full flush instead */
+	__this_cpu_write(cpu_tlbstate.user_flush_end, TLB_FLUSH_ALL);
+}
+
 static void flush_tlb_user_pt_range(u16 asid, const struct flush_tlb_info *f)
 {
 	unsigned long start, end, addr;
@@ -528,6 +590,14 @@ static void flush_tlb_user_pt_range(u16 asid, const struct flush_tlb_info *f)
 	end = f->end;
 	stride_shift = f->stride_shift;
 
+	/*
+	 * We can defer flushes as long as page-tables were not freed.
+	 */
+	if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables) {
+		flush_user_tlb_deferred(asid, start, end, stride_shift);
+		return;
+	}
+
 	/*
 	 * Some platforms #GP if we call invpcid(type=1/2) before CR4.PCIDE=1.
 	 * Just use invalidate_user_asid() in case we are called early.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown
  2019-08-23 22:46 [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
  2019-08-23 22:46 ` [RFC PATCH 1/3] x86/mm/tlb: Defer PTI flushes Nadav Amit
@ 2019-08-23 22:46 ` Nadav Amit
  2019-08-23 22:46 ` [RFC PATCH 3/3] x86/mm/tlb: Use lockdep irq assertions Nadav Amit
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Nadav Amit @ 2019-08-23 22:46 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit

When a shootdown is initiated, the initiating CPU has cycles to burn as
it waits for the responding CPUs to receive the IPI and acknowledge it.
In these cycles it is better to flush the user page-tables using
INVPCID, instead of deferring the TLB flush.

The best way to figure out whether there are cycles to burn is arguably
to expose from the SMP layer when an acknowledgment is received.
However, this would break some abstractions.

Instead, use a simpler solution: the initiating CPU of a TLB shootdown
would not defer PTI flushes. It is not always a win, relatively to
deferring user page-table flushes, but it prevents performance
regression.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 10 +++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index da56aa3ccd07..066b3804f876 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -573,6 +573,7 @@ struct flush_tlb_info {
 	unsigned int		initiating_cpu;
 	u8			stride_shift;
 	u8			freed_tables;
+	u8			shootdown;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 31260c55d597..ba50430275d4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -592,8 +592,13 @@ static void flush_tlb_user_pt_range(u16 asid, const struct flush_tlb_info *f)
 
 	/*
 	 * We can defer flushes as long as page-tables were not freed.
+	 *
+	 * However, if there is a shootdown the initiating CPU has cycles to
+	 * spare, while it waits for the other cores to respond. In this case,
+	 * deferring the flushing can cause overheads, so avoid it.
 	 */
-	if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables) {
+	if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables &&
+	    (!f->shootdown || f->initiating_cpu != smp_processor_id())) {
 		flush_user_tlb_deferred(asid, start, end, stride_shift);
 		return;
 	}
@@ -861,6 +866,7 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	info->freed_tables	= freed_tables;
 	info->new_tlb_gen	= new_tlb_gen;
 	info->initiating_cpu	= smp_processor_id();
+	info->shootdown		= false;
 
 	return info;
 }
@@ -903,6 +909,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * flush_tlb_func_local() directly in this case.
 	 */
 	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+		info->shootdown = true;
 		flush_tlb_multi(mm_cpumask(mm), info);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
@@ -970,6 +977,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * flush_tlb_func_local() directly in this case.
 	 */
 	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+		info->shootdown = true;
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/3] x86/mm/tlb: Use lockdep irq assertions
  2019-08-23 22:46 [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
  2019-08-23 22:46 ` [RFC PATCH 1/3] x86/mm/tlb: Defer PTI flushes Nadav Amit
  2019-08-23 22:46 ` [RFC PATCH 2/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown Nadav Amit
@ 2019-08-23 22:46 ` Nadav Amit
  2019-08-24  6:09 ` [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
  2019-08-27 23:18 ` Andy Lutomirski
  4 siblings, 0 replies; 10+ messages in thread
From: Nadav Amit @ 2019-08-23 22:46 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit

The assertions that check whether IRQs are disabled depend currently on
different debug features. Use instead lockdep_assert_irqs_disabled(),
which is standard, enabled by the same debug feature,  and provides more
information upon failures.

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ba50430275d4..6f4ce02e2c5b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -293,8 +293,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	 */
 
 	/* We don't want flush_tlb_func() to run concurrently with us. */
-	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
-		WARN_ON_ONCE(!irqs_disabled());
+	lockdep_assert_irqs_disabled();
 
 	/*
 	 * Verify that CR3 is what we think it is.  This will catch
@@ -643,7 +642,7 @@ static void flush_tlb_func(void *info)
 	unsigned long nr_invalidate = 0;
 
 	/* This code cannot presently handle being reentered. */
-	VM_WARN_ON(!irqs_disabled());
+	lockdep_assert_irqs_disabled();
 
 	if (!local) {
 		inc_irq_stat(irq_tlb_count);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI
  2019-08-23 22:46 [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
                   ` (2 preceding siblings ...)
  2019-08-23 22:46 ` [RFC PATCH 3/3] x86/mm/tlb: Use lockdep irq assertions Nadav Amit
@ 2019-08-24  6:09 ` Nadav Amit
  2019-08-27 23:18 ` Andy Lutomirski
  4 siblings, 0 replies; 10+ messages in thread
From: Nadav Amit @ 2019-08-24  6:09 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers, LKML,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar

Sorry, I made a mistake and included the wrong patches. I will send
RFC v2 in few minutes.


> On Aug 23, 2019, at 3:46 PM, Nadav Amit <namit@vmware.com> wrote:
> 
> INVPCID is considerably slower than INVLPG of a single PTE, but it is
> currently used to flush PTEs in the user page-table when PTI is used.
> 
> Instead, it is possible to defer TLB flushes until after the user
> page-tables are loaded. Preventing speculation over the TLB flushes
> should keep the whole thing safe. In some cases, deferring TLB flushes
> in such a way can result in more full TLB flushes, but arguably this
> behavior is oftentimes beneficial.
> 
> These patches are based and evaluated on top of the concurrent
> TLB-flushes v4 patch-set.
> 
> I will provide more results later, but it might be easier to look at the
> time an isolated TLB flush takes. These numbers are from skylake,
> showing the number of cycles that running madvise(DONTNEED) which
> results in local TLB flushes takes:
> 
> n_pages		concurrent	+deferred-pti		change
> -------		----------	-------------		------
> 1		2119		1986 			-6.7%
> 10		6791		5417 			 -20%
> 
> Please let me know if I missed something that affects security or
> performance.
> 
> [ Yes, I know there is another pending RFC for async TLB flushes, but I
>  think it might be easier to merge this one first ]
> 
> Nadav Amit (3):
>  x86/mm/tlb: Defer PTI flushes
>  x86/mm/tlb: Avoid deferring PTI flushes on shootdown
>  x86/mm/tlb: Use lockdep irq assertions
> 
> arch/x86/entry/calling.h        | 52 +++++++++++++++++++--
> arch/x86/include/asm/tlbflush.h | 31 ++++++++++--
> arch/x86/kernel/asm-offsets.c   |  3 ++
> arch/x86/mm/tlb.c               | 83 +++++++++++++++++++++++++++++++--
> 4 files changed, 158 insertions(+), 11 deletions(-)
> 
> -- 
> 2.17.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI
  2019-08-23 22:46 [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
                   ` (3 preceding siblings ...)
  2019-08-24  6:09 ` [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
@ 2019-08-27 23:18 ` Andy Lutomirski
  2019-08-27 23:52   ` Nadav Amit
  4 siblings, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2019-08-27 23:18 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, X86 ML, LKML, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar

On Fri, Aug 23, 2019 at 11:07 PM Nadav Amit <namit@vmware.com> wrote:
>
> INVPCID is considerably slower than INVLPG of a single PTE, but it is
> currently used to flush PTEs in the user page-table when PTI is used.
>
> Instead, it is possible to defer TLB flushes until after the user
> page-tables are loaded. Preventing speculation over the TLB flushes
> should keep the whole thing safe. In some cases, deferring TLB flushes
> in such a way can result in more full TLB flushes, but arguably this
> behavior is oftentimes beneficial.

I have a somewhat horrible suggestion.

Would it make sense to refactor this so that it works for user *and*
kernel tables?  In particular, if we flush a *kernel* mapping (vfree,
vunmap, set_memory_ro, etc), we shouldn't need to send an IPI to a
task that is running user code to flush most kernel mappings or even
to free kernel pagetables.  The same trick could be done if we treat
idle like user mode for this purpose.

In code, this could mostly consist of changing all the "user" data
structures involved to something like struct deferred_flush_info and
having one for user and one for kernel.

I think this is horrible because it will enable certain workloads to
work considerably faster with PTI on than with PTI off, and that would
be a barely excusable moral failing. :-p

For what it's worth, other than register clobber issues, the whole
"switch CR3 for PTI" logic ought to be doable in C.  I don't know a
priori whether that would end up being an improvement.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI
  2019-08-27 23:18 ` Andy Lutomirski
@ 2019-08-27 23:52   ` Nadav Amit
  2019-08-28  0:30     ` Andy Lutomirski
  0 siblings, 1 reply; 10+ messages in thread
From: Nadav Amit @ 2019-08-27 23:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, X86 ML, LKML, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

> On Aug 27, 2019, at 4:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Fri, Aug 23, 2019 at 11:07 PM Nadav Amit <namit@vmware.com> wrote:
>> INVPCID is considerably slower than INVLPG of a single PTE, but it is
>> currently used to flush PTEs in the user page-table when PTI is used.
>> 
>> Instead, it is possible to defer TLB flushes until after the user
>> page-tables are loaded. Preventing speculation over the TLB flushes
>> should keep the whole thing safe. In some cases, deferring TLB flushes
>> in such a way can result in more full TLB flushes, but arguably this
>> behavior is oftentimes beneficial.
> 
> I have a somewhat horrible suggestion.
> 
> Would it make sense to refactor this so that it works for user *and*
> kernel tables?  In particular, if we flush a *kernel* mapping (vfree,
> vunmap, set_memory_ro, etc), we shouldn't need to send an IPI to a
> task that is running user code to flush most kernel mappings or even
> to free kernel pagetables.  The same trick could be done if we treat
> idle like user mode for this purpose.
> 
> In code, this could mostly consist of changing all the "user" data
> structures involved to something like struct deferred_flush_info and
> having one for user and one for kernel.
> 
> I think this is horrible because it will enable certain workloads to
> work considerably faster with PTI on than with PTI off, and that would
> be a barely excusable moral failing. :-p
> 
> For what it's worth, other than register clobber issues, the whole
> "switch CR3 for PTI" logic ought to be doable in C.  I don't know a
> priori whether that would end up being an improvement.

I implemented (and have not yet sent) another TLB deferring mechanism. It is
intended for user mappings and not kernel one, but I think your suggestion
shares some similar underlying rationale, and therefore challenges and
solutions. Let me rephrase what you say to ensure we are on the same page.

The basic idea is context-tracking to check whether each CPU is in kernel or
user mode. Accordingly, TLB flushes can be deferred, but I don’t see that
this solution is limited to PTI. There are 2 possible reasons, according to
my understanding, that you limit the discussion to PTI:

1. PTI provides clear boundaries when user and kernel mappings are used. I
am not sure that privilege-levels (and SMAP) do not do the same.

2. CR3 switching already imposes a memory barrier, which eliminates most of
the cost of implementing such scheme which requires something which is
similar to:

	write new context (kernel/user)
	mb();
	if (need_flush) flush;

I do agree that PTI addresses (2), but there is another problem. A
reasonable implementation would store in a per-cpu state whether each CPU is
in user/kernel, and the TLB shootdown initiator CPU would check the state to
decide whether an IPI is needed. This means that pretty much every TLB
shutdown would incur a cache-miss per-target CPU. This might cause
performance regressions, at least in some cases.

Admittedly, I did implement something similar (not sent) for user mappings:
defer all TLB flushes and shootdowns if the CPUs are known to be in kernel
mode. But I limited myself for certain cases, specifically “long” syscalls
that are already likely to cause a TLB flush (e.g., msync()). I am not sure
that tracking each CPU entry/exit would be a good idea.

I will give some more thought about kernel mapping invalidations, which I
did not think about enough. I tried to send what I considered “saner” and
cleaner patches first. I still have patches I mentioned here, the
async-flushes, and another patch to avoid local TLB flush on CoW and instead
accessing the data.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI
  2019-08-27 23:52   ` Nadav Amit
@ 2019-08-28  0:30     ` Andy Lutomirski
  2019-08-29 17:23       ` Nadav Amit
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2019-08-28  0:30 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, X86 ML, LKML, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar

On Tue, Aug 27, 2019 at 4:52 PM Nadav Amit <namit@vmware.com> wrote:
>
> > On Aug 27, 2019, at 4:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Fri, Aug 23, 2019 at 11:07 PM Nadav Amit <namit@vmware.com> wrote:
> >> INVPCID is considerably slower than INVLPG of a single PTE, but it is
> >> currently used to flush PTEs in the user page-table when PTI is used.
> >>
> >> Instead, it is possible to defer TLB flushes until after the user
> >> page-tables are loaded. Preventing speculation over the TLB flushes
> >> should keep the whole thing safe. In some cases, deferring TLB flushes
> >> in such a way can result in more full TLB flushes, but arguably this
> >> behavior is oftentimes beneficial.
> >
> > I have a somewhat horrible suggestion.
> >
> > Would it make sense to refactor this so that it works for user *and*
> > kernel tables?  In particular, if we flush a *kernel* mapping (vfree,
> > vunmap, set_memory_ro, etc), we shouldn't need to send an IPI to a
> > task that is running user code to flush most kernel mappings or even
> > to free kernel pagetables.  The same trick could be done if we treat
> > idle like user mode for this purpose.
> >
> > In code, this could mostly consist of changing all the "user" data
> > structures involved to something like struct deferred_flush_info and
> > having one for user and one for kernel.
> >
> > I think this is horrible because it will enable certain workloads to
> > work considerably faster with PTI on than with PTI off, and that would
> > be a barely excusable moral failing. :-p
> >
> > For what it's worth, other than register clobber issues, the whole
> > "switch CR3 for PTI" logic ought to be doable in C.  I don't know a
> > priori whether that would end up being an improvement.
>
> I implemented (and have not yet sent) another TLB deferring mechanism. It is
> intended for user mappings and not kernel one, but I think your suggestion
> shares some similar underlying rationale, and therefore challenges and
> solutions. Let me rephrase what you say to ensure we are on the same page.
>
> The basic idea is context-tracking to check whether each CPU is in kernel or
> user mode. Accordingly, TLB flushes can be deferred, but I don’t see that
> this solution is limited to PTI. There are 2 possible reasons, according to
> my understanding, that you limit the discussion to PTI:
>
> 1. PTI provides clear boundaries when user and kernel mappings are used. I
> am not sure that privilege-levels (and SMAP) do not do the same.
>
> 2. CR3 switching already imposes a memory barrier, which eliminates most of
> the cost of implementing such scheme which requires something which is
> similar to:
>
>         write new context (kernel/user)
>         mb();
>         if (need_flush) flush;
>
> I do agree that PTI addresses (2), but there is another problem. A
> reasonable implementation would store in a per-cpu state whether each CPU is
> in user/kernel, and the TLB shootdown initiator CPU would check the state to
> decide whether an IPI is needed. This means that pretty much every TLB
> shutdown would incur a cache-miss per-target CPU. This might cause
> performance regressions, at least in some cases.

We already more or less do this: we have mm_cpumask(), which is
particularly awful since it writes to a falsely-shared line for each
context switch.

For what it's worth, in some sense, your patch series is reinventing
the tracking that is already in cpu_tlbstate -- when we do a flush on
one mm and some cpu is running another mm, we don't do an IPI
shootdown -- instead we set flags so that it will be flushed the next
time it's used.  Maybe we could actually refactor this so we only have
one copy of this code that handles all the various deferred flush
variants.  Perhaps each tracked mm context could have a user
tlb_gen_id and a kernel tlb_gen_id.  I guess one thing that makes this
nasty is that we need to flush the kernel PCID for kernel *and* user
invalidations.  Sigh.

--Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI
  2019-08-28  0:30     ` Andy Lutomirski
@ 2019-08-29 17:23       ` Nadav Amit
  2019-09-03 21:33         ` Andy Lutomirski
  0 siblings, 1 reply; 10+ messages in thread
From: Nadav Amit @ 2019-08-29 17:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, X86 ML, LKML, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

> On Aug 27, 2019, at 5:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Tue, Aug 27, 2019 at 4:52 PM Nadav Amit <namit@vmware.com> wrote:
>>> On Aug 27, 2019, at 4:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>> On Fri, Aug 23, 2019 at 11:07 PM Nadav Amit <namit@vmware.com> wrote:
>>>> INVPCID is considerably slower than INVLPG of a single PTE, but it is
>>>> currently used to flush PTEs in the user page-table when PTI is used.
>>>> 
>>>> Instead, it is possible to defer TLB flushes until after the user
>>>> page-tables are loaded. Preventing speculation over the TLB flushes
>>>> should keep the whole thing safe. In some cases, deferring TLB flushes
>>>> in such a way can result in more full TLB flushes, but arguably this
>>>> behavior is oftentimes beneficial.
>>> 
>>> I have a somewhat horrible suggestion.
>>> 
>>> Would it make sense to refactor this so that it works for user *and*
>>> kernel tables?  In particular, if we flush a *kernel* mapping (vfree,
>>> vunmap, set_memory_ro, etc), we shouldn't need to send an IPI to a
>>> task that is running user code to flush most kernel mappings or even
>>> to free kernel pagetables.  The same trick could be done if we treat
>>> idle like user mode for this purpose.
>>> 
>>> In code, this could mostly consist of changing all the "user" data
>>> structures involved to something like struct deferred_flush_info and
>>> having one for user and one for kernel.
>>> 
>>> I think this is horrible because it will enable certain workloads to
>>> work considerably faster with PTI on than with PTI off, and that would
>>> be a barely excusable moral failing. :-p
>>> 
>>> For what it's worth, other than register clobber issues, the whole
>>> "switch CR3 for PTI" logic ought to be doable in C.  I don't know a
>>> priori whether that would end up being an improvement.
>> 
>> I implemented (and have not yet sent) another TLB deferring mechanism. It is
>> intended for user mappings and not kernel one, but I think your suggestion
>> shares some similar underlying rationale, and therefore challenges and
>> solutions. Let me rephrase what you say to ensure we are on the same page.
>> 
>> The basic idea is context-tracking to check whether each CPU is in kernel or
>> user mode. Accordingly, TLB flushes can be deferred, but I don’t see that
>> this solution is limited to PTI. There are 2 possible reasons, according to
>> my understanding, that you limit the discussion to PTI:
>> 
>> 1. PTI provides clear boundaries when user and kernel mappings are used. I
>> am not sure that privilege-levels (and SMAP) do not do the same.
>> 
>> 2. CR3 switching already imposes a memory barrier, which eliminates most of
>> the cost of implementing such scheme which requires something which is
>> similar to:
>> 
>>        write new context (kernel/user)
>>        mb();
>>        if (need_flush) flush;
>> 
>> I do agree that PTI addresses (2), but there is another problem. A
>> reasonable implementation would store in a per-cpu state whether each CPU is
>> in user/kernel, and the TLB shootdown initiator CPU would check the state to
>> decide whether an IPI is needed. This means that pretty much every TLB
>> shutdown would incur a cache-miss per-target CPU. This might cause
>> performance regressions, at least in some cases.
> 
> We already more or less do this: we have mm_cpumask(), which is
> particularly awful since it writes to a falsely-shared line for each
> context switch.

> For what it's worth, in some sense, your patch series is reinventing
> the tracking that is already in cpu_tlbstate -- when we do a flush on
> one mm and some cpu is running another mm, we don't do an IPI
> shootdown -- instead we set flags so that it will be flushed the next
> time it's used.  Maybe we could actually refactor this so we only have
> one copy of this code that handles all the various deferred flush
> variants.  Perhaps each tracked mm context could have a user
> tlb_gen_id and a kernel tlb_gen_id.  I guess one thing that makes this
> nasty is that we need to flush the kernel PCID for kernel *and* user
> invalidations.  Sigh.

Sorry for the late response - I was feeling under the weather.

There is a tradeoff between how often the state changes and how often it is
being checked. So actually, with this patch-set, we have three indications
of deferred TLB flushes:

1. mm_cpumask(), since mm changes infrequently

2. “is_lazy", which changes frequently, making per-cpu cacheline checks more
efficient than (1).

3. Deferred-PTI, which is only updated locally. 

This patch-set only introduces (3). Your suggestion, IIUC, is to somehow
combine (1) and (2), which I suspect might introduce some performance
regressions. Changing a cpumask, or even writing to a cacheline on *every*
kernel entry/exit can induce overheads (in the latter case, when the
shootdown initiator checks whether the flush can be deferred).

IOW, deferring remote TLB shootdowns is hard since it can induce some
overheads. Deferring local TLB flushes (or those that initiated by a remote
CPU, after the IPI was received) is easy. I deferred only the user
page-table flushes. If you want, I can try to extend it to all user flushes.
This would introduce some small overheads (check before each uaccess) and
small gains. Deferring local TLB flushes is inapplicable for kernel TLB
flushes, of course.

Let me know what you think.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI
  2019-08-29 17:23       ` Nadav Amit
@ 2019-09-03 21:33         ` Andy Lutomirski
  0 siblings, 0 replies; 10+ messages in thread
From: Andy Lutomirski @ 2019-09-03 21:33 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, X86 ML, LKML, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar

On Thu, Aug 29, 2019 at 10:24 AM Nadav Amit <namit@vmware.com> wrote:
>
> > On Aug 27, 2019, at 5:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Tue, Aug 27, 2019 at 4:52 PM Nadav Amit <namit@vmware.com> wrote:
> >>> On Aug 27, 2019, at 4:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >>>
> >>> On Fri, Aug 23, 2019 at 11:07 PM Nadav Amit <namit@vmware.com> wrote:
> >>>> INVPCID is considerably slower than INVLPG of a single PTE, but it is
> >>>> currently used to flush PTEs in the user page-table when PTI is used.
> >>>>
> >>>> Instead, it is possible to defer TLB flushes until after the user
> >>>> page-tables are loaded. Preventing speculation over the TLB flushes
> >>>> should keep the whole thing safe. In some cases, deferring TLB flushes
> >>>> in such a way can result in more full TLB flushes, but arguably this
> >>>> behavior is oftentimes beneficial.
> >>>
> >>> I have a somewhat horrible suggestion.
> >>>
> >>> Would it make sense to refactor this so that it works for user *and*
> >>> kernel tables?  In particular, if we flush a *kernel* mapping (vfree,
> >>> vunmap, set_memory_ro, etc), we shouldn't need to send an IPI to a
> >>> task that is running user code to flush most kernel mappings or even
> >>> to free kernel pagetables.  The same trick could be done if we treat
> >>> idle like user mode for this purpose.
> >>>
> >>> In code, this could mostly consist of changing all the "user" data
> >>> structures involved to something like struct deferred_flush_info and
> >>> having one for user and one for kernel.
> >>>
> >>> I think this is horrible because it will enable certain workloads to
> >>> work considerably faster with PTI on than with PTI off, and that would
> >>> be a barely excusable moral failing. :-p
> >>>
> >>> For what it's worth, other than register clobber issues, the whole
> >>> "switch CR3 for PTI" logic ought to be doable in C.  I don't know a
> >>> priori whether that would end up being an improvement.
> >>
> >> I implemented (and have not yet sent) another TLB deferring mechanism. It is
> >> intended for user mappings and not kernel one, but I think your suggestion
> >> shares some similar underlying rationale, and therefore challenges and
> >> solutions. Let me rephrase what you say to ensure we are on the same page.
> >>
> >> The basic idea is context-tracking to check whether each CPU is in kernel or
> >> user mode. Accordingly, TLB flushes can be deferred, but I don’t see that
> >> this solution is limited to PTI. There are 2 possible reasons, according to
> >> my understanding, that you limit the discussion to PTI:
> >>
> >> 1. PTI provides clear boundaries when user and kernel mappings are used. I
> >> am not sure that privilege-levels (and SMAP) do not do the same.
> >>
> >> 2. CR3 switching already imposes a memory barrier, which eliminates most of
> >> the cost of implementing such scheme which requires something which is
> >> similar to:
> >>
> >>        write new context (kernel/user)
> >>        mb();
> >>        if (need_flush) flush;
> >>
> >> I do agree that PTI addresses (2), but there is another problem. A
> >> reasonable implementation would store in a per-cpu state whether each CPU is
> >> in user/kernel, and the TLB shootdown initiator CPU would check the state to
> >> decide whether an IPI is needed. This means that pretty much every TLB
> >> shutdown would incur a cache-miss per-target CPU. This might cause
> >> performance regressions, at least in some cases.
> >
> > We already more or less do this: we have mm_cpumask(), which is
> > particularly awful since it writes to a falsely-shared line for each
> > context switch.
>
> > For what it's worth, in some sense, your patch series is reinventing
> > the tracking that is already in cpu_tlbstate -- when we do a flush on
> > one mm and some cpu is running another mm, we don't do an IPI
> > shootdown -- instead we set flags so that it will be flushed the next
> > time it's used.  Maybe we could actually refactor this so we only have
> > one copy of this code that handles all the various deferred flush
> > variants.  Perhaps each tracked mm context could have a user
> > tlb_gen_id and a kernel tlb_gen_id.  I guess one thing that makes this
> > nasty is that we need to flush the kernel PCID for kernel *and* user
> > invalidations.  Sigh.
>
> Sorry for the late response - I was feeling under the weather.
>
> There is a tradeoff between how often the state changes and how often it is
> being checked. So actually, with this patch-set, we have three indications
> of deferred TLB flushes:
>
> 1. mm_cpumask(), since mm changes infrequently
>
> 2. “is_lazy", which changes frequently, making per-cpu cacheline checks more
> efficient than (1).
>
> 3. Deferred-PTI, which is only updated locally.
>
> This patch-set only introduces (3). Your suggestion, IIUC, is to somehow
> combine (1) and (2), which I suspect might introduce some performance
> regressions. Changing a cpumask, or even writing to a cacheline on *every*
> kernel entry/exit can induce overheads (in the latter case, when the
> shootdown initiator checks whether the flush can be deferred).

Hmm.  It's entirely possible that my idea wasn't so good.  Although
mm_cpumask() writes really are a problem in some workloads.  Rik has
benchmarked this.

My thought is that, *maybe*, writing to a percpu cacheline on kernel
entry and exit is cheap enough that it will make up for itself in the
ability to avoid some IPIs.  Writing to mm_cpumask() on each entry
would be horrible.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-09-03 21:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-23 22:46 [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
2019-08-23 22:46 ` [RFC PATCH 1/3] x86/mm/tlb: Defer PTI flushes Nadav Amit
2019-08-23 22:46 ` [RFC PATCH 2/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown Nadav Amit
2019-08-23 22:46 ` [RFC PATCH 3/3] x86/mm/tlb: Use lockdep irq assertions Nadav Amit
2019-08-24  6:09 ` [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI Nadav Amit
2019-08-27 23:18 ` Andy Lutomirski
2019-08-27 23:52   ` Nadav Amit
2019-08-28  0:30     ` Andy Lutomirski
2019-08-29 17:23       ` Nadav Amit
2019-09-03 21:33         ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).