linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async
@ 2019-05-31  6:36 Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 01/12] smp: Remove smp_call_function() and on_each_cpu() return values Nadav Amit
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

In addition, this patch-set performs remote TLB shootdowns of userspace
mappings asynchronously. It inlines the info that describes what to
flush with the SMP information (which also reduces cache coherency
traffic), and the initiator continues to run after the IPI is received
before the actual flush. This is only possible if page tables were not
removed as otherwise speculative page-walks might lead to machine
checks. Access to userspace is prevented from NMI handlers and kprobes
until the flush is actually carried. Arguably, this should be safe.

Finally, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on the
behavior of on_each_cpu() that functions were first executed remotely
and only then locally.

On my Haswell machine (bare-metal), running a TLB flush microbenchmark
(MADV_DONTNEED/touch for a single page on one thread in a loop, others
perform busy-wait loop), takes the following time (ns per iteration) on
the thread the invokes the TLB shootdown:

	n_threads	before		after
	---------	------		-----
	1		661		663
	2		1436		923 (-36%)
	4		1571		1070 (-32%)

Note that since the benchmark also causes page-faults, the actual
speedup of TLB shootdowns is actually greater. Also note the higher
improvement in performance with 2 thread (a single remote TLB flush
target). This seems to be a side-effect of holding synchronization
data-structures (csd) off the stack, unlike the way it is currently done
(in smp_call_function_single()).

This microbenchmark only measures the time of the thread that invokes
the TLB shootdown. Running sysbench on dax w/emulated-pmem:

  sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
   --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run

Shows a performance improvement of 5% (607k vs 575k events).

Patch 1 does small cleanup. Patches 3-5 implement the concurrent
execution of TLB flushes. Patch 6-8 deals with false-sharing and
unnecassary atomic operations. Patches 9-12 perform async TLB flushes
when possible and inline the data on IPIs.

RFC v1 -> RFC v2:
- Removed a patch which did not improve performance
- Patches 6-8: false-sharing and atomic operation optimizations
- Patches 9-12: asynchronous TLB flushes

Nadav Amit (12):
  smp: Remove smp_call_function() and on_each_cpu() return values
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Optimize local TLB flushes
  KVM: x86: Provide paravirtualized flush_tlb_multi()
  smp: Do not mark call_function_data as shared
  x86/tlb: Privatize cpu_tlbstate
  x86/apic: Use non-atomic operations when possible
  smp: Enable data inlining for inter-processor function call
  x86/mm/tlb: Use async and inline messages for flushing
  x86/mm/tlb: Reverting the removal of flush_tlb_info from stack

 arch/alpha/kernel/smp.c               |  19 +--
 arch/alpha/oprofile/common.c          |   6 +-
 arch/ia64/kernel/perfmon.c            |  12 +-
 arch/ia64/kernel/uncached.c           |   8 +-
 arch/x86/hyperv/mmu.c                 |   2 +
 arch/x86/include/asm/paravirt.h       |   8 +
 arch/x86/include/asm/paravirt_types.h |   6 +
 arch/x86/include/asm/tlbflush.h       |  58 ++++---
 arch/x86/kernel/apic/apic_flat_64.c   |   4 +-
 arch/x86/kernel/apic/x2apic_cluster.c |   2 +-
 arch/x86/kernel/kvm.c                 |  11 +-
 arch/x86/kernel/paravirt.c            |   3 +
 arch/x86/kernel/smp.c                 |   2 +-
 arch/x86/lib/cache-smp.c              |   3 +-
 arch/x86/mm/init.c                    |   2 +-
 arch/x86/mm/tlb.c                     | 204 +++++++++++++++++--------
 arch/x86/xen/mmu_pv.c                 |   2 +
 drivers/char/agp/generic.c            |   3 +-
 include/linux/smp.h                   |  45 ++++--
 kernel/smp.c                          | 210 ++++++++++++++++----------
 kernel/up.c                           |   3 +-
 21 files changed, 397 insertions(+), 216 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 01/12] smp: Remove smp_call_function() and on_each_cpu() return values
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 02/12] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, Tony Luck, Fenghua Yu, Andrew Morton

The return value is fixed. Remove it and amend the callers.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/alpha/kernel/smp.c      | 19 +++++--------------
 arch/alpha/oprofile/common.c |  6 +++---
 arch/ia64/kernel/perfmon.c   | 12 ++----------
 arch/ia64/kernel/uncached.c  |  8 ++++----
 arch/x86/lib/cache-smp.c     |  3 ++-
 drivers/char/agp/generic.c   |  3 +--
 include/linux/smp.h          |  7 +++----
 kernel/smp.c                 | 10 +++-------
 kernel/up.c                  |  3 +--
 9 files changed, 24 insertions(+), 47 deletions(-)

diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c
index d0dccae53ba9..5f90df30be20 100644
--- a/arch/alpha/kernel/smp.c
+++ b/arch/alpha/kernel/smp.c
@@ -614,8 +614,7 @@ void
 smp_imb(void)
 {
 	/* Must wait other processors to flush their icache before continue. */
-	if (on_each_cpu(ipi_imb, NULL, 1))
-		printk(KERN_CRIT "smp_imb: timed out\n");
+	on_each_cpu(ipi_imb, NULL, 1);
 }
 EXPORT_SYMBOL(smp_imb);
 
@@ -630,9 +629,7 @@ flush_tlb_all(void)
 {
 	/* Although we don't have any data to pass, we do want to
 	   synchronize with the other processors.  */
-	if (on_each_cpu(ipi_flush_tlb_all, NULL, 1)) {
-		printk(KERN_CRIT "flush_tlb_all: timed out\n");
-	}
+	on_each_cpu(ipi_flush_tlb_all, NULL, 1);
 }
 
 #define asn_locked() (cpu_data[smp_processor_id()].asn_lock)
@@ -667,9 +664,7 @@ flush_tlb_mm(struct mm_struct *mm)
 		}
 	}
 
-	if (smp_call_function(ipi_flush_tlb_mm, mm, 1)) {
-		printk(KERN_CRIT "flush_tlb_mm: timed out\n");
-	}
+	smp_call_function(ipi_flush_tlb_mm, mm, 1);
 
 	preempt_enable();
 }
@@ -720,9 +715,7 @@ flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)
 	data.mm = mm;
 	data.addr = addr;
 
-	if (smp_call_function(ipi_flush_tlb_page, &data, 1)) {
-		printk(KERN_CRIT "flush_tlb_page: timed out\n");
-	}
+	smp_call_function(ipi_flush_tlb_page, &data, 1);
 
 	preempt_enable();
 }
@@ -772,9 +765,7 @@ flush_icache_user_range(struct vm_area_struct *vma, struct page *page,
 		}
 	}
 
-	if (smp_call_function(ipi_flush_icache_page, mm, 1)) {
-		printk(KERN_CRIT "flush_icache_page: timed out\n");
-	}
+	smp_call_function(ipi_flush_icache_page, mm, 1);
 
 	preempt_enable();
 }
diff --git a/arch/alpha/oprofile/common.c b/arch/alpha/oprofile/common.c
index 310a4ce1dccc..1b1259c7d7d1 100644
--- a/arch/alpha/oprofile/common.c
+++ b/arch/alpha/oprofile/common.c
@@ -65,7 +65,7 @@ op_axp_setup(void)
 	model->reg_setup(&reg, ctr, &sys);
 
 	/* Configure the registers on all cpus.  */
-	(void)smp_call_function(model->cpu_setup, &reg, 1);
+	smp_call_function(model->cpu_setup, &reg, 1);
 	model->cpu_setup(&reg);
 	return 0;
 }
@@ -86,7 +86,7 @@ op_axp_cpu_start(void *dummy)
 static int
 op_axp_start(void)
 {
-	(void)smp_call_function(op_axp_cpu_start, NULL, 1);
+	smp_call_function(op_axp_cpu_start, NULL, 1);
 	op_axp_cpu_start(NULL);
 	return 0;
 }
@@ -101,7 +101,7 @@ op_axp_cpu_stop(void *dummy)
 static void
 op_axp_stop(void)
 {
-	(void)smp_call_function(op_axp_cpu_stop, NULL, 1);
+	smp_call_function(op_axp_cpu_stop, NULL, 1);
 	op_axp_cpu_stop(NULL);
 }
 
diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 7a969f4c3534..6cf7db1daba2 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -6389,11 +6389,7 @@ pfm_install_alt_pmu_interrupt(pfm_intr_handler_desc_t *hdl)
 	}
 
 	/* save the current system wide pmu states */
-	ret = on_each_cpu(pfm_alt_save_pmu_state, NULL, 1);
-	if (ret) {
-		DPRINT(("on_each_cpu() failed: %d\n", ret));
-		goto cleanup_reserve;
-	}
+	on_each_cpu(pfm_alt_save_pmu_state, NULL, 1);
 
 	/* officially change to the alternate interrupt handler */
 	pfm_alt_intr_handler = hdl;
@@ -6420,7 +6416,6 @@ int
 pfm_remove_alt_pmu_interrupt(pfm_intr_handler_desc_t *hdl)
 {
 	int i;
-	int ret;
 
 	if (hdl == NULL) return -EINVAL;
 
@@ -6434,10 +6429,7 @@ pfm_remove_alt_pmu_interrupt(pfm_intr_handler_desc_t *hdl)
 
 	pfm_alt_intr_handler = NULL;
 
-	ret = on_each_cpu(pfm_alt_restore_pmu_state, NULL, 1);
-	if (ret) {
-		DPRINT(("on_each_cpu() failed: %d\n", ret));
-	}
+	on_each_cpu(pfm_alt_restore_pmu_state, NULL, 1);
 
 	for_each_online_cpu(i) {
 		pfm_unreserve_session(NULL, 1, i);
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 583f7ff6b589..c618d0745e22 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -124,8 +124,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 	status = ia64_pal_prefetch_visibility(PAL_VISIBILITY_PHYSICAL);
 	if (status == PAL_VISIBILITY_OK_REMOTE_NEEDED) {
 		atomic_set(&uc_pool->status, 0);
-		status = smp_call_function(uncached_ipi_visibility, uc_pool, 1);
-		if (status || atomic_read(&uc_pool->status))
+		smp_call_function(uncached_ipi_visibility, uc_pool, 1);
+		if (atomic_read(&uc_pool->status))
 			goto failed;
 	} else if (status != PAL_VISIBILITY_OK)
 		goto failed;
@@ -146,8 +146,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 	if (status != PAL_STATUS_SUCCESS)
 		goto failed;
 	atomic_set(&uc_pool->status, 0);
-	status = smp_call_function(uncached_ipi_mc_drain, uc_pool, 1);
-	if (status || atomic_read(&uc_pool->status))
+	smp_call_function(uncached_ipi_mc_drain, uc_pool, 1);
+	if (atomic_read(&uc_pool->status))
 		goto failed;
 
 	/*
diff --git a/arch/x86/lib/cache-smp.c b/arch/x86/lib/cache-smp.c
index 1811fa4a1b1a..7c48ff4ae8d1 100644
--- a/arch/x86/lib/cache-smp.c
+++ b/arch/x86/lib/cache-smp.c
@@ -15,6 +15,7 @@ EXPORT_SYMBOL(wbinvd_on_cpu);
 
 int wbinvd_on_all_cpus(void)
 {
-	return on_each_cpu(__wbinvd, NULL, 1);
+	on_each_cpu(__wbinvd, NULL, 1);
+	return 0;
 }
 EXPORT_SYMBOL(wbinvd_on_all_cpus);
diff --git a/drivers/char/agp/generic.c b/drivers/char/agp/generic.c
index 658664a5a5aa..df1edb5ec0ad 100644
--- a/drivers/char/agp/generic.c
+++ b/drivers/char/agp/generic.c
@@ -1311,8 +1311,7 @@ static void ipi_handler(void *null)
 
 void global_cache_flush(void)
 {
-	if (on_each_cpu(ipi_handler, NULL, 1) != 0)
-		panic(PFX "timed out waiting for the other CPUs!\n");
+	on_each_cpu(ipi_handler, NULL, 1);
 }
 EXPORT_SYMBOL(global_cache_flush);
 
diff --git a/include/linux/smp.h b/include/linux/smp.h
index a56f08ff3097..bb8b451ab01f 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -35,7 +35,7 @@ int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 /*
  * Call a function on all processors
  */
-int on_each_cpu(smp_call_func_t func, void *info, int wait);
+void on_each_cpu(smp_call_func_t func, void *info, int wait);
 
 /*
  * Call a function on processors specified by mask, which might include
@@ -101,7 +101,7 @@ extern void smp_cpus_done(unsigned int max_cpus);
 /*
  * Call a function on all other processors
  */
-int smp_call_function(smp_call_func_t func, void *info, int wait);
+void smp_call_function(smp_call_func_t func, void *info, int wait);
 void smp_call_function_many(const struct cpumask *mask,
 			    smp_call_func_t func, void *info, bool wait);
 
@@ -144,9 +144,8 @@ static inline void smp_send_stop(void) { }
  *	These macros fold the SMP functionality into a single CPU system
  */
 #define raw_smp_processor_id()			0
-static inline int up_smp_call_function(smp_call_func_t func, void *info)
+static inline void up_smp_call_function(smp_call_func_t func, void *info)
 {
-	return 0;
 }
 #define smp_call_function(func, info, wait) \
 			(up_smp_call_function(func, info))
diff --git a/kernel/smp.c b/kernel/smp.c
index f4cf1b0bb3b8..e0b05ac53108 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -486,13 +486,11 @@ EXPORT_SYMBOL(smp_call_function_many);
  * You must not call this function with disabled interrupts or from a
  * hardware interrupt handler or from a bottom half handler.
  */
-int smp_call_function(smp_call_func_t func, void *info, int wait)
+void smp_call_function(smp_call_func_t func, void *info, int wait)
 {
 	preempt_disable();
 	smp_call_function_many(cpu_online_mask, func, info, wait);
 	preempt_enable();
-
-	return 0;
 }
 EXPORT_SYMBOL(smp_call_function);
 
@@ -593,18 +591,16 @@ void __init smp_init(void)
  * early_boot_irqs_disabled is set.  Use local_irq_save/restore() instead
  * of local_irq_disable/enable().
  */
-int on_each_cpu(void (*func) (void *info), void *info, int wait)
+void on_each_cpu(void (*func) (void *info), void *info, int wait)
 {
 	unsigned long flags;
-	int ret = 0;
 
 	preempt_disable();
-	ret = smp_call_function(func, info, wait);
+	smp_call_function(func, info, wait);
 	local_irq_save(flags);
 	func(info);
 	local_irq_restore(flags);
 	preempt_enable();
-	return ret;
 }
 EXPORT_SYMBOL(on_each_cpu);
 
diff --git a/kernel/up.c b/kernel/up.c
index ff536f9cc8a2..378551e79894 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -34,14 +34,13 @@ int smp_call_function_single_async(int cpu, call_single_data_t *csd)
 }
 EXPORT_SYMBOL(smp_call_function_single_async);
 
-int on_each_cpu(smp_call_func_t func, void *info, int wait)
+void on_each_cpu(smp_call_func_t func, void *info, int wait)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
 	func(info);
 	local_irq_restore(flags);
-	return 0;
 }
 EXPORT_SYMBOL(on_each_cpu);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 02/12] smp: Run functions concurrently in smp_call_function_many()
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 01/12] smp: Remove smp_call_function() and on_each_cpu() return values Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 03/12] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus() Nadav Amit
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, Dave Hansen, Rik van Riel,
	Josh Poimboeuf

Currently, on_each_cpu() and similar functions do not exploit the
potential of concurrency: the function is first executed remotely and
only then it is executed locally. Functions such as TLB flush can take
considerable time, so this provides an opportunity for performance
optimization.

To do so, introduce __smp_call_function_many(), which allows the callers
to provide local and remote functions that should be executed, and run
them concurrently. Keep smp_call_function_many() semantic as it is today
for backward compatibility: the called function is not executed in this
case locally.

__smp_call_function_many() does not use the optimized version for a
single remote target that smp_call_function_single() implements. For
synchronous function call, smp_call_function_single() keeps a
call_single_data (which is used for synchronization) on the stack.
Interestingly, it seems that not using this optimization provides
greater performance improvements (greater speedup with a single remote
target than with multiple ones). Presumably, holding data structures
that are intended for synchronization on the stack can introduce
overheads due to TLB misses and false-sharing when the stack is used for
other purposes.

Adding support to run the functions concurrently required to remove a
micro-optimization in on_each_cpu() that disabled/enabled IRQs instead
of saving/restoring them. The benefit of running the local and remote
code concurrently is expected to be greater.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/smp.h |  27 ++++++---
 kernel/smp.c        | 133 +++++++++++++++++++++-----------------------
 2 files changed, 83 insertions(+), 77 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index bb8b451ab01f..b69abc88259d 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -32,11 +32,6 @@ extern unsigned int total_cpus;
 int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 			     int wait);
 
-/*
- * Call a function on all processors
- */
-void on_each_cpu(smp_call_func_t func, void *info, int wait);
-
 /*
  * Call a function on processors specified by mask, which might include
  * the local one.
@@ -44,6 +39,15 @@ void on_each_cpu(smp_call_func_t func, void *info, int wait);
 void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 		void *info, bool wait);
 
+/*
+ * Call a function on all processors.  May be used during early boot while
+ * early_boot_irqs_disabled is set.
+ */
+static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
+{
+	on_each_cpu_mask(cpu_online_mask, func, info, wait);
+}
+
 /*
  * Call a function on each processor for which the supplied function
  * cond_func returns a positive value. This may include the local
@@ -102,8 +106,17 @@ extern void smp_cpus_done(unsigned int max_cpus);
  * Call a function on all other processors
  */
 void smp_call_function(smp_call_func_t func, void *info, int wait);
-void smp_call_function_many(const struct cpumask *mask,
-			    smp_call_func_t func, void *info, bool wait);
+
+void __smp_call_function_many(const struct cpumask *mask,
+			      smp_call_func_t remote_func,
+			      smp_call_func_t local_func,
+			      void *info, bool wait);
+
+static inline void smp_call_function_many(const struct cpumask *mask,
+				smp_call_func_t func, void *info, bool wait)
+{
+	__smp_call_function_many(mask, func, NULL, info, wait);
+}
 
 int smp_call_function_any(const struct cpumask *mask,
 			  smp_call_func_t func, void *info, int wait);
diff --git a/kernel/smp.c b/kernel/smp.c
index e0b05ac53108..6b411ee86ef6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -387,9 +387,13 @@ int smp_call_function_any(const struct cpumask *mask,
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
 /**
- * smp_call_function_many(): Run a function on a set of other CPUs.
+ * __smp_call_function_many(): Run a function on a set of CPUs.
  * @mask: The set of cpus to run on (only runs on online subset).
- * @func: The function to run. This must be fast and non-blocking.
+ * @remote_func: The function to run on remote cores. This must be fast and
+ *		 non-blocking.
+ * @local_func: The function that should be run on this CPU. This must be
+ *		fast and non-blocking. If NULL is provided, no function will
+ *		be executed on this CPU.
  * @info: An arbitrary pointer to pass to the function.
  * @wait: If true, wait (atomically) until function has completed
  *        on other CPUs.
@@ -400,11 +404,16 @@ EXPORT_SYMBOL_GPL(smp_call_function_any);
  * hardware interrupt handler or from a bottom half handler. Preemption
  * must be disabled when calling this function.
  */
-void smp_call_function_many(const struct cpumask *mask,
-			    smp_call_func_t func, void *info, bool wait)
+void __smp_call_function_many(const struct cpumask *mask,
+			      smp_call_func_t remote_func,
+			      smp_call_func_t local_func,
+			      void *info, bool wait)
 {
+	int cpu, last_cpu, this_cpu = smp_processor_id();
 	struct call_function_data *cfd;
-	int cpu, next_cpu, this_cpu = smp_processor_id();
+	bool run_remote = false;
+	bool run_local = false;
+	int nr_cpus = 0;
 
 	/*
 	 * Can deadlock when called with interrupts disabled.
@@ -412,55 +421,62 @@ void smp_call_function_many(const struct cpumask *mask,
 	 * send smp call function interrupt to this cpu and as such deadlocks
 	 * can't happen.
 	 */
-	WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
-		     && !oops_in_progress && !early_boot_irqs_disabled);
+	if (cpu_online(this_cpu) && !oops_in_progress && !early_boot_irqs_disabled)
+		lockdep_assert_irqs_enabled();
+
+	/* Check if we need local execution. */
+	if (local_func && cpumask_test_cpu(this_cpu, mask))
+		run_local = true;
 
-	/* Try to fastpath.  So, what's a CPU they want? Ignoring this one. */
+	/* Check if we need remote execution, i.e., any CPU excluding this one. */
 	cpu = cpumask_first_and(mask, cpu_online_mask);
 	if (cpu == this_cpu)
 		cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+	if (cpu < nr_cpu_ids)
+		run_remote = true;
 
-	/* No online cpus?  We're done. */
-	if (cpu >= nr_cpu_ids)
-		return;
+	if (run_remote) {
+		cfd = this_cpu_ptr(&cfd_data);
 
-	/* Do we have another CPU which isn't us? */
-	next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
-	if (next_cpu == this_cpu)
-		next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
-
-	/* Fastpath: do that cpu by itself. */
-	if (next_cpu >= nr_cpu_ids) {
-		smp_call_function_single(cpu, func, info, wait);
-		return;
-	}
+		cpumask_and(cfd->cpumask, mask, cpu_online_mask);
+		__cpumask_clear_cpu(this_cpu, cfd->cpumask);
 
-	cfd = this_cpu_ptr(&cfd_data);
-
-	cpumask_and(cfd->cpumask, mask, cpu_online_mask);
-	__cpumask_clear_cpu(this_cpu, cfd->cpumask);
+		cpumask_clear(cfd->cpumask_ipi);
+		for_each_cpu(cpu, cfd->cpumask) {
+			call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
+
+			nr_cpus++;
+			last_cpu = cpu;
+
+			csd_lock(csd);
+			if (wait)
+				csd->flags |= CSD_FLAG_SYNCHRONOUS;
+			csd->func = remote_func;
+			csd->info = info;
+			if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+				__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
+		}
 
-	/* Some callers race with other cpus changing the passed mask */
-	if (unlikely(!cpumask_weight(cfd->cpumask)))
-		return;
+		/*
+		 * Choose the most efficient way to send an IPI. Note that the
+		 * number of CPUs might be zero due to concurrent changes to the
+		 * provided mask.
+		 */
+		if (nr_cpus == 1)
+			arch_send_call_function_single_ipi(last_cpu);
+		else if (likely(nr_cpus > 1))
+			arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
+	}
 
-	cpumask_clear(cfd->cpumask_ipi);
-	for_each_cpu(cpu, cfd->cpumask) {
-		call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
+	if (run_local) {
+		unsigned long flags;
 
-		csd_lock(csd);
-		if (wait)
-			csd->flags |= CSD_FLAG_SYNCHRONOUS;
-		csd->func = func;
-		csd->info = info;
-		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
-			__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
+		local_irq_save(flags);
+		local_func(info);
+		local_irq_restore(flags);
 	}
 
-	/* Send a message to all CPUs in the map */
-	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
-
-	if (wait) {
+	if (run_remote && wait) {
 		for_each_cpu(cpu, cfd->cpumask) {
 			call_single_data_t *csd;
 
@@ -469,7 +485,7 @@ void smp_call_function_many(const struct cpumask *mask,
 		}
 	}
 }
-EXPORT_SYMBOL(smp_call_function_many);
+EXPORT_SYMBOL(__smp_call_function_many);
 
 /**
  * smp_call_function(): Run a function on all other CPUs.
@@ -586,24 +602,6 @@ void __init smp_init(void)
 	smp_cpus_done(setup_max_cpus);
 }
 
-/*
- * Call a function on all processors.  May be used during early boot while
- * early_boot_irqs_disabled is set.  Use local_irq_save/restore() instead
- * of local_irq_disable/enable().
- */
-void on_each_cpu(void (*func) (void *info), void *info, int wait)
-{
-	unsigned long flags;
-
-	preempt_disable();
-	smp_call_function(func, info, wait);
-	local_irq_save(flags);
-	func(info);
-	local_irq_restore(flags);
-	preempt_enable();
-}
-EXPORT_SYMBOL(on_each_cpu);
-
 /**
  * on_each_cpu_mask(): Run a function on processors specified by
  * cpumask, which may include the local processor.
@@ -623,16 +621,11 @@ EXPORT_SYMBOL(on_each_cpu);
 void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 			void *info, bool wait)
 {
-	int cpu = get_cpu();
+	preempt_disable();
 
-	smp_call_function_many(mask, func, info, wait);
-	if (cpumask_test_cpu(cpu, mask)) {
-		unsigned long flags;
-		local_irq_save(flags);
-		func(info);
-		local_irq_restore(flags);
-	}
-	put_cpu();
+	__smp_call_function_many(mask, func, func, info, wait);
+
+	preempt_enable();
 }
 EXPORT_SYMBOL(on_each_cpu_mask);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 03/12] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 01/12] smp: Remove smp_call_function() and on_each_cpu() return values Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 02/12] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, Dave Hansen

arch_tlbbatch_flush() and flush_tlb_mm_range() have very similar code,
which is effectively the same. Extract the mutual code into a new
function flush_tlb_on_cpus().

There is one functional change, which should not affect correctness:
flush_tlb_mm_range compared loaded_mm and the mm to figure out if local
flush is needed. Instead, the common code would look at the mm_cpumask()
which should give the same result. Performance should not be affected,
since this cpumask should not change in such a frequency that would
introduce cache contention.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 62 ++++++++++++++++++++++++++---------------------
 1 file changed, 34 insertions(+), 28 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7f61431c75fb..ac98ad76f695 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -733,7 +733,11 @@ static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 			unsigned int stride_shift, bool freed_tables,
 			u64 new_tlb_gen)
 {
-	struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info);
+	struct flush_tlb_info *info;
+
+	preempt_disable();
+
+	info = this_cpu_ptr(&flush_tlb_info);
 
 #ifdef CONFIG_DEBUG_VM
 	/*
@@ -761,6 +765,23 @@ static inline void put_flush_tlb_info(void)
 	barrier();
 	this_cpu_dec(flush_tlb_info_idx);
 #endif
+	preempt_enable();
+}
+
+static void flush_tlb_on_cpus(const cpumask_t *cpumask,
+			      const struct flush_tlb_info *info)
+{
+	int this_cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(this_cpu, cpumask)) {
+		lockdep_assert_irqs_enabled();
+		local_irq_disable();
+		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
+		local_irq_enable();
+	}
+
+	if (cpumask_any_but(cpumask, this_cpu) < nr_cpu_ids)
+		flush_tlb_others(cpumask, info);
 }
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
@@ -769,9 +790,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 {
 	struct flush_tlb_info *info;
 	u64 new_tlb_gen;
-	int cpu;
-
-	cpu = get_cpu();
 
 	/* Should we flush just the requested range? */
 	if ((end == TLB_FLUSH_ALL) ||
@@ -786,18 +804,18 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
 				  new_tlb_gen);
 
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
-		lockdep_assert_irqs_enabled();
-		local_irq_disable();
-		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
-		local_irq_enable();
-	}
+	/*
+	 * Assert that mm_cpumask() corresponds with the loaded mm. We got one
+	 * exception: for init_mm we do not need to flush anything, and the
+	 * cpumask does not correspond with loaded_mm.
+	 */
+	VM_WARN_ON_ONCE(cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm)) !=
+			(mm == this_cpu_read(cpu_tlbstate.loaded_mm)) ||
+			mm == &init_mm);
 
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), info);
+	flush_tlb_on_cpus(mm_cpumask(mm), info);
 
 	put_flush_tlb_info();
-	put_cpu();
 }
 
 
@@ -832,13 +850,11 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	} else {
 		struct flush_tlb_info *info;
 
-		preempt_disable();
 		info = get_flush_tlb_info(NULL, start, end, 0, false, 0);
 
 		on_each_cpu(do_kernel_range_flush, info, 1);
 
 		put_flush_tlb_info();
-		preempt_enable();
 	}
 }
 
@@ -856,21 +872,11 @@ static const struct flush_tlb_info full_flush_tlb_info = {
 
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
-	int cpu = get_cpu();
-
-	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
-		lockdep_assert_irqs_enabled();
-		local_irq_disable();
-		flush_tlb_func_local(&full_flush_tlb_info, TLB_LOCAL_SHOOTDOWN);
-		local_irq_enable();
-	}
-
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
-		flush_tlb_others(&batch->cpumask, &full_flush_tlb_info);
+	preempt_disable();
+	flush_tlb_on_cpus(&batch->cpumask, &full_flush_tlb_info);
+	preempt_enable();
 
 	cpumask_clear(&batch->cpumask);
-
-	put_cpu();
 }
 
 static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
                   ` (2 preceding siblings ...)
  2019-05-31  6:36 ` [RFC PATCH v2 03/12] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus() Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  2019-05-31 11:48   ` Juergen Gross
  2019-05-31  6:36 ` [RFC PATCH v2 05/12] x86/mm/tlb: Optimize local TLB flushes Nadav Amit
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Juergen Gross, Paolo Bonzini,
	Dave Hansen, Boris Ostrovsky, linux-hyperv, virtualization, kvm,
	xen-devel

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. The current
flush_tlb_others() interface is kept, since paravirtual interfaces need
to be adapted first before it can be removed. This is left for future
work. In such PV environments, TLB flushes are not performed, at this
time, concurrently.

Add a static key to tell whether this new interface is supported.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: Juergen Gross <jgross@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/hyperv/mmu.c                 |  2 +
 arch/x86/include/asm/paravirt.h       |  8 +++
 arch/x86/include/asm/paravirt_types.h |  6 ++
 arch/x86/include/asm/tlbflush.h       |  6 ++
 arch/x86/kernel/kvm.c                 |  1 +
 arch/x86/kernel/paravirt.c            |  3 +
 arch/x86/mm/tlb.c                     | 80 +++++++++++++++++++++++----
 arch/x86/xen/mmu_pv.c                 |  2 +
 8 files changed, 96 insertions(+), 12 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..ca28b400c87c 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
 	pr_info("Using hypercall for remote TLB flush\n");
 	pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
 	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+
+	static_key_disable(&flush_tlb_multi_enabled.key);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..192be7254457 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -47,6 +47,8 @@ static inline void slow_down_io(void)
 #endif
 }
 
+DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static inline void __flush_tlb(void)
 {
 	PVOP_VCALL0(mmu.flush_tlb_user);
@@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
 	PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+				   const struct flush_tlb_info *info)
+{
+	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
+}
+
 static inline void flush_tlb_others(const struct cpumask *cpumask,
 				    const struct flush_tlb_info *info)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3a156e63c57d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,6 +211,12 @@ struct pv_mmu_ops {
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_one_user)(unsigned long addr);
+	/*
+	 * flush_tlb_multi() is the preferred interface. When it is used,
+	 * flush_tlb_others() should return false.
+	 */
+	void (*flush_tlb_multi)(const struct cpumask *cpus,
+				const struct flush_tlb_info *info);
 	void (*flush_tlb_others)(const struct cpumask *cpus,
 				 const struct flush_tlb_info *info);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..79272938cf79 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+			     const struct flush_tlb_info *info);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
@@ -593,6 +596,9 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
+#define flush_tlb_multi(mask, info)	\
+	native_flush_tlb_multi(mask, info)
+
 #define flush_tlb_others(mask, info)	\
 	native_flush_tlb_others(mask, info)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 3f0cc828cc36..c1c2b88ea3f1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -643,6 +643,7 @@ static void __init kvm_guest_init(void)
 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
 		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
 		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+		static_key_disable(&flush_tlb_multi_enabled.key);
 	}
 
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5492a669f658..1314f89304a8 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -171,6 +171,8 @@ unsigned paravirt_patch_insns(void *insn_buff, unsigned len,
 	return insn_len;
 }
 
+DEFINE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static void native_flush_tlb(void)
 {
 	__native_flush_tlb();
@@ -375,6 +377,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_user	= native_flush_tlb,
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
+	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
 	.mmu.flush_tlb_others	= native_flush_tlb_others,
 	.mmu.tlb_remove_table	=
 			(void (*)(struct mmu_gather *, void *))tlb_remove_page,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ac98ad76f695..73d0d51b0f61 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -550,7 +550,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
 		 *
-		 * This should be rare, with native_flush_tlb_others skipping
+		 * This should be rare, with native_flush_tlb_multi skipping
 		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
@@ -634,9 +634,12 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
 }
 
-static void flush_tlb_func_local(const void *info, enum tlb_flush_reason reason)
+static void flush_tlb_func_local(void *info)
 {
 	const struct flush_tlb_info *f = info;
+	enum tlb_flush_reason reason;
+
+	reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;
 
 	flush_tlb_func_common(f, true, reason);
 }
@@ -654,14 +657,30 @@ static void flush_tlb_func_remote(void *info)
 	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
-static bool tlb_is_not_lazy(int cpu, void *data)
+static inline bool tlb_is_not_lazy(int cpu)
 {
 	return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
-			     const struct flush_tlb_info *info)
+static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
+
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+			    const struct flush_tlb_info *info)
 {
+	/*
+	 * native_flush_tlb_multi() can handle a single CPU, but it is
+	 * suboptimal if the local TLB should be flushed, and therefore should
+	 * not be used in such case. Check that it is not used in such case,
+	 * and use this assumption for tracing and accounting of remote TLB
+	 * flushes.
+	 */
+	VM_WARN_ON(!cpumask_any_but(cpumask, smp_processor_id()));
+
+	/*
+	 * Do accounting and tracing. Note that there are (and have always been)
+	 * cases in which a remote TLB flush will be traced, but eventually
+	 * would not happen.
+	 */
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -681,10 +700,14 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		 * means that the percpu tlb_gen variables won't be updated
 		 * and we'll do pointless flushes on future context switches.
 		 *
-		 * Rather than hooking native_flush_tlb_others() here, I think
+		 * Rather than hooking native_flush_tlb_multi() here, I think
 		 * that UV should be updated so that smp_call_function_many(),
 		 * etc, are optimal on UV.
 		 */
+		local_irq_disable();
+		flush_tlb_func_local((__force void *)info);
+		local_irq_enable();
+
 		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
 			smp_call_function_many(cpumask, flush_tlb_func_remote,
@@ -703,11 +726,39 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	 * doing a speculative memory access.
 	 */
 	if (info->freed_tables)
-		smp_call_function_many(cpumask, flush_tlb_func_remote,
-			       (void *)info, 1);
-	else
-		on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
-				(void *)info, 1, GFP_ATOMIC, cpumask);
+		__smp_call_function_many(cpumask, flush_tlb_func_remote,
+					 flush_tlb_func_local, (void *)info, 1);
+	else {
+		/*
+		 * Although we could have used on_each_cpu_cond_mask(),
+		 * open-coding it has several performance advantages: (1) we can
+		 * use specialized functions for remote and local flushes; (2)
+		 * no need for indirect branch to test if TLB is lazy; (3) we
+		 * can use a designated cpumask for evaluating the condition
+		 * instead of allocating a new one.
+		 *
+		 * This works under the assumption that there are no nested TLB
+		 * flushes, an assumption that is already made in
+		 * flush_tlb_mm_range().
+		 */
+		struct cpumask *cond_cpumask = this_cpu_ptr(&flush_tlb_mask);
+		int cpu;
+
+		cpumask_clear(cond_cpumask);
+
+		for_each_cpu(cpu, cpumask) {
+			if (tlb_is_not_lazy(cpu))
+				__cpumask_set_cpu(cpu, cond_cpumask);
+		}
+		__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+					 flush_tlb_func_local, (void *)info, 1);
+	}
+}
+
+void native_flush_tlb_others(const struct cpumask *cpumask,
+			     const struct flush_tlb_info *info)
+{
+	native_flush_tlb_multi(cpumask, info);
 }
 
 /*
@@ -773,10 +824,15 @@ static void flush_tlb_on_cpus(const cpumask_t *cpumask,
 {
 	int this_cpu = smp_processor_id();
 
+	if (static_branch_likely(&flush_tlb_multi_enabled)) {
+		flush_tlb_multi(cpumask, info);
+		return;
+	}
+
 	if (cpumask_test_cpu(this_cpu, cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
+		flush_tlb_func_local((__force void *)info);
 		local_irq_enable();
 	}
 
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index beb44e22afdf..0cb277848cb4 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2474,6 +2474,8 @@ void __init xen_init_mmu_ops(void)
 
 	pv_ops.mmu = xen_mmu_ops;
 
+	static_key_disable(&flush_tlb_multi_enabled.key);
+
 	memset(dummy_mapping, 0xff, PAGE_SIZE);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 05/12] x86/mm/tlb: Optimize local TLB flushes
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
                   ` (3 preceding siblings ...)
  2019-05-31  6:36 ` [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 06/12] KVM: x86: Provide paravirtualized flush_tlb_multi() Nadav Amit
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, Dave Hansen

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, making
local TLB flushes slower than they were before the recent changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() another time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 73d0d51b0f61..b0c3065aad5d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -823,8 +823,12 @@ static void flush_tlb_on_cpus(const cpumask_t *cpumask,
 			      const struct flush_tlb_info *info)
 {
 	int this_cpu = smp_processor_id();
+	bool flush_others = false;
 
-	if (static_branch_likely(&flush_tlb_multi_enabled)) {
+	if (cpumask_any_but(cpumask, this_cpu) < nr_cpu_ids)
+		flush_others = true;
+
+	if (static_branch_likely(&flush_tlb_multi_enabled) && flush_others) {
 		flush_tlb_multi(cpumask, info);
 		return;
 	}
@@ -836,7 +840,7 @@ static void flush_tlb_on_cpus(const cpumask_t *cpumask,
 		local_irq_enable();
 	}
 
-	if (cpumask_any_but(cpumask, this_cpu) < nr_cpu_ids)
+	if (flush_others)
 		flush_tlb_others(cpumask, info);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 06/12] KVM: x86: Provide paravirtualized flush_tlb_multi()
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
                   ` (4 preceding siblings ...)
  2019-05-31  6:36 ` [RFC PATCH v2 05/12] x86/mm/tlb: Optimize local TLB flushes Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 07/12] smp: Do not mark call_function_data as shared Nadav Amit
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, Paolo Bonzini, kvm

Support the new interface of flush_tlb_multi, which also flushes the
local CPU's TLB, instead of flush_tlb_others that does not. This
interface is more performant since it parallelize remote and local TLB
flushes.

The actual implementation of flush_tlb_multi() is almost identical to
that of flush_tlb_others().

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/kernel/kvm.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c1c2b88ea3f1..86b8267166e6 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -593,7 +593,7 @@ static void __init kvm_apf_trap_init(void)
 
 static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
 
-static void kvm_flush_tlb_others(const struct cpumask *cpumask,
+static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
 			const struct flush_tlb_info *info)
 {
 	u8 state;
@@ -607,6 +607,11 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 	 * queue flush_on_enter for pre-empted vCPUs
 	 */
 	for_each_cpu(cpu, flushmask) {
+		/*
+		 * The local vCPU is never preempted, so we do not explicitly
+		 * skip check for local vCPU - it will never be cleared from
+		 * flushmask.
+		 */
 		src = &per_cpu(steal_time, cpu);
 		state = READ_ONCE(src->preempted);
 		if ((state & KVM_VCPU_PREEMPTED)) {
@@ -616,7 +621,7 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 		}
 	}
 
-	native_flush_tlb_others(flushmask, info);
+	native_flush_tlb_multi(flushmask, info);
 }
 
 static void __init kvm_guest_init(void)
@@ -641,9 +646,8 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
 	    !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
+		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
 		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
-		static_key_disable(&flush_tlb_multi_enabled.key);
 	}
 
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 07/12] smp: Do not mark call_function_data as shared
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
                   ` (5 preceding siblings ...)
  2019-05-31  6:36 ` [RFC PATCH v2 06/12] KVM: x86: Provide paravirtualized flush_tlb_multi() Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  2019-05-31 10:17   ` Peter Zijlstra
  2019-05-31  6:36 ` [RFC PATCH v2 08/12] x86/tlb: Privatize cpu_tlbstate Nadav Amit
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, Rik van Riel

cfd_data is marked as shared, but although it hold pointers to shared
data structures, it is private per core.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 6b411ee86ef6..f1a358f9c34c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -33,7 +33,7 @@ struct call_function_data {
 	cpumask_var_t		cpumask_ipi;
 };
 
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_function_data, cfd_data);
+static DEFINE_PER_CPU(struct call_function_data, cfd_data);
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct llist_head, call_single_queue);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 08/12] x86/tlb: Privatize cpu_tlbstate
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
                   ` (6 preceding siblings ...)
  2019-05-31  6:36 ` [RFC PATCH v2 07/12] smp: Do not mark call_function_data as shared Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  2019-05-31 18:48   ` Andy Lutomirski
  2019-05-31  6:36 ` [RFC PATCH v2 09/12] x86/apic: Use non-atomic operations when possible Nadav Amit
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, Dave Hansen

cpu_tlbstate is mostly private and only the variable is_lazy is shared.
This causes some false-sharing when TLB flushes are performed.

Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
each one accordingly.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h | 40 ++++++++++++++++++---------------
 arch/x86/mm/init.c              |  2 +-
 arch/x86/mm/tlb.c               | 15 ++++++++-----
 3 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 79272938cf79..a1fea36d5292 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -178,23 +178,6 @@ struct tlb_state {
 	u16 loaded_mm_asid;
 	u16 next_asid;
 
-	/*
-	 * We can be in one of several states:
-	 *
-	 *  - Actively using an mm.  Our CPU's bit will be set in
-	 *    mm_cpumask(loaded_mm) and is_lazy == false;
-	 *
-	 *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
-	 *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
-	 *
-	 *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
-	 *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
-	 *    We're heuristically guessing that the CR3 load we
-	 *    skipped more than makes up for the overhead added by
-	 *    lazy mode.
-	 */
-	bool is_lazy;
-
 	/*
 	 * If set we changed the page tables in such a way that we
 	 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
@@ -240,7 +223,27 @@ struct tlb_state {
 	 */
 	struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
 };
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
+DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
+
+struct tlb_state_shared {
+	/*
+	 * We can be in one of several states:
+	 *
+	 *  - Actively using an mm.  Our CPU's bit will be set in
+	 *    mm_cpumask(loaded_mm) and is_lazy == false;
+	 *
+	 *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
+	 *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
+	 *
+	 *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
+	 *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
+	 *    We're heuristically guessing that the CR3 load we
+	 *    skipped more than makes up for the overhead added by
+	 *    lazy mode.
+	 */
+	bool is_lazy;
+};
+DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
 
 /*
  * Blindly accessing user memory from NMI context can be dangerous
@@ -439,6 +442,7 @@ static inline void __native_flush_tlb_one_user(unsigned long addr)
 {
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
 
+	//invpcid_flush_one(kern_pcid(loaded_mm_asid), addr);
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 
 	if (!static_cpu_has(X86_FEATURE_PTI))
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index fd10d91a6115..34027f36a944 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -951,7 +951,7 @@ void __init zone_sizes_init(void)
 	free_area_init_nodes(max_zone_pfns);
 }
 
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
+__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = {
 	.loaded_mm = &init_mm,
 	.next_asid = 1,
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b0c3065aad5d..755b2bb3e5b6 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -144,7 +144,7 @@ void leave_mm(int cpu)
 		return;
 
 	/* Warn if we're not lazy. */
-	WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
+	WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy));
 
 	switch_mm(NULL, &init_mm, NULL);
 }
@@ -276,7 +276,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 {
 	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
+	bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
 	unsigned cpu = smp_processor_id();
 	u64 next_tlb_gen;
 	bool need_flush;
@@ -321,7 +321,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		__flush_tlb_all();
 	}
 #endif
-	this_cpu_write(cpu_tlbstate.is_lazy, false);
+	this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
 	/*
 	 * The membarrier system call requires a full memory barrier and
@@ -462,7 +462,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
 		return;
 
-	this_cpu_write(cpu_tlbstate.is_lazy, true);
+	this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
 }
 
 /*
@@ -543,7 +543,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
-	if (this_cpu_read(cpu_tlbstate.is_lazy)) {
+	if (this_cpu_read(cpu_tlbstate_shared.is_lazy)) {
 		/*
 		 * We're in lazy mode.  We need to at least flush our
 		 * paging-structure cache to avoid speculatively reading
@@ -659,11 +659,14 @@ static void flush_tlb_func_remote(void *info)
 
 static inline bool tlb_is_not_lazy(int cpu)
 {
-	return !per_cpu(cpu_tlbstate.is_lazy, cpu);
+	return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);
 }
 
 static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
 
+DEFINE_PER_CPU_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
+EXPORT_PER_CPU_SYMBOL(cpu_tlbstate_shared);
+
 void native_flush_tlb_multi(const struct cpumask *cpumask,
 			    const struct flush_tlb_info *info)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 09/12] x86/apic: Use non-atomic operations when possible
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
                   ` (7 preceding siblings ...)
  2019-05-31  6:36 ` [RFC PATCH v2 08/12] x86/tlb: Privatize cpu_tlbstate Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 10/12] smp: Enable data inlining for inter-processor function call Nadav Amit
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, Dave Hansen

Using __clear_bit() and __cpumask_clear_cpu() is more efficient than
using their atomic counterparts. Use them when atomicity is not needed,
such as when manipulating bitmasks that are on the stack.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/kernel/apic/apic_flat_64.c   | 4 ++--
 arch/x86/kernel/apic/x2apic_cluster.c | 2 +-
 arch/x86/kernel/smp.c                 | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/apic/apic_flat_64.c b/arch/x86/kernel/apic/apic_flat_64.c
index 0005c284a5c5..65072858f553 100644
--- a/arch/x86/kernel/apic/apic_flat_64.c
+++ b/arch/x86/kernel/apic/apic_flat_64.c
@@ -78,7 +78,7 @@ flat_send_IPI_mask_allbutself(const struct cpumask *cpumask, int vector)
 	int cpu = smp_processor_id();
 
 	if (cpu < BITS_PER_LONG)
-		clear_bit(cpu, &mask);
+		__clear_bit(cpu, &mask);
 
 	_flat_send_IPI_mask(mask, vector);
 }
@@ -92,7 +92,7 @@ static void flat_send_IPI_allbutself(int vector)
 			unsigned long mask = cpumask_bits(cpu_online_mask)[0];
 
 			if (cpu < BITS_PER_LONG)
-				clear_bit(cpu, &mask);
+				__clear_bit(cpu, &mask);
 
 			_flat_send_IPI_mask(mask, vector);
 		}
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index 7685444a106b..609e499387a1 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -50,7 +50,7 @@ __x2apic_send_IPI_mask(const struct cpumask *mask, int vector, int apic_dest)
 	cpumask_copy(tmpmsk, mask);
 	/* If IPI should not be sent to self, clear current CPU */
 	if (apic_dest != APIC_DEST_ALLINC)
-		cpumask_clear_cpu(smp_processor_id(), tmpmsk);
+		__cpumask_clear_cpu(smp_processor_id(), tmpmsk);
 
 	/* Collapse cpus in a cluster so a single IPI per cluster is sent */
 	for_each_cpu(cpu, tmpmsk) {
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 04adc8d60aed..acddd988602d 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -146,7 +146,7 @@ void native_send_call_func_ipi(const struct cpumask *mask)
 	}
 
 	cpumask_copy(allbutself, cpu_online_mask);
-	cpumask_clear_cpu(smp_processor_id(), allbutself);
+	__cpumask_clear_cpu(smp_processor_id(), allbutself);
 
 	if (cpumask_equal(mask, allbutself) &&
 	    cpumask_equal(cpu_online_mask, cpu_callout_mask))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 10/12] smp: Enable data inlining for inter-processor function call
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
                   ` (8 preceding siblings ...)
  2019-05-31  6:36 ` [RFC PATCH v2 09/12] x86/apic: Use non-atomic operations when possible Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 12/12] x86/mm/tlb: Reverting the removal of flush_tlb_info from stack Nadav Amit
  11 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, Dave Hansen

There are some opportunities to improve the performance of
inter-processor function calls.

First, currently call_single_data, which is used for communicating the
data that is consumed by the function call is not cacheline aligned.
This might lead to false sharing, since different structures are used
for communication between two different CPUs.

Second, the data that is used as an argument for the remotely executed
function resides in a different address (and cacheline), so it also
needs to traverse between caches (in addition to call_single_data).

Third, some functions can be executed asynchronously, but their data
(i.e., info, which is used as an argument to the function) must stay
untouched until they are done. Such functions might be executed
synchronously right now due to lack of infrastructure.

Allow callers of __smp_call_function_many() and __on_each_cpu_mask() to
request function data to be inlined with the interprocessor message on
the same cache-line. Once the IPI is received, info is first copied to
the receiver stack to both avoid false sharing and support asynchronous
mode, in which the initiator of the remote-call does not wait for the
receiver.

While a remote function is executed, save in externally visible variable
the function that is currently executed. This information will be needed
soon for asynchronous remote TLB flushing.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/smp.h | 21 ++++++++----
 kernel/smp.c        | 83 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 87 insertions(+), 17 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index b69abc88259d..eb449086feb3 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -18,8 +18,11 @@ typedef void (*smp_call_func_t)(void *info);
 struct __call_single_data {
 	struct llist_node llist;
 	smp_call_func_t func;
-	void *info;
 	unsigned int flags;
+	union {
+		void *info;
+		u8 inlined_info[0];
+	};
 };
 
 /* Use __aligned() to avoid to use 2 cache lines for 1 csd */
@@ -32,12 +35,18 @@ extern unsigned int total_cpus;
 int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 			     int wait);
 
+void __on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
+		void *info, size_t info_size, bool wait);
+
 /*
  * Call a function on processors specified by mask, which might include
  * the local one.
  */
-void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
-		void *info, bool wait);
+static inline void on_each_cpu_mask(const struct cpumask *mask,
+				smp_call_func_t func, void *info, bool wait)
+{
+	__on_each_cpu_mask(mask, func, info, 0, wait);
+}
 
 /*
  * Call a function on all processors.  May be used during early boot while
@@ -109,13 +118,13 @@ void smp_call_function(smp_call_func_t func, void *info, int wait);
 
 void __smp_call_function_many(const struct cpumask *mask,
 			      smp_call_func_t remote_func,
-			      smp_call_func_t local_func,
-			      void *info, bool wait);
+			      smp_call_func_t local_func, void *info,
+			      size_t info_size, bool wait);
 
 static inline void smp_call_function_many(const struct cpumask *mask,
 				smp_call_func_t func, void *info, bool wait)
 {
-	__smp_call_function_many(mask, func, NULL, info, wait);
+	__smp_call_function_many(mask, func, NULL, info, 0, wait);
 }
 
 int smp_call_function_any(const struct cpumask *mask,
diff --git a/kernel/smp.c b/kernel/smp.c
index f1a358f9c34c..46cbf611a38d 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -25,6 +25,7 @@
 enum {
 	CSD_FLAG_LOCK		= 0x01,
 	CSD_FLAG_SYNCHRONOUS	= 0x02,
+	CSD_FLAG_INLINED	= 0x04,
 };
 
 struct call_function_data {
@@ -35,8 +36,16 @@ struct call_function_data {
 
 static DEFINE_PER_CPU(struct call_function_data, cfd_data);
 
+/* Function which is currently executed asynchronously. */
+DECLARE_PER_CPU(smp_call_func_t, async_func_in_progress);
+DEFINE_PER_CPU(smp_call_func_t, async_func_in_progress);
+EXPORT_PER_CPU_SYMBOL(async_func_in_progress);
+
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct llist_head, call_single_queue);
 
+#define MAX_FUNC_INFO_SIZE						\
+	(SMP_CACHE_BYTES - offsetof(struct __call_single_data, inlined_info[0]))
+
 static void flush_smp_call_function_queue(bool warn_cpu_offline);
 
 int smpcfd_prepare_cpu(unsigned int cpu)
@@ -51,7 +60,15 @@ int smpcfd_prepare_cpu(unsigned int cpu)
 		free_cpumask_var(cfd->cpumask);
 		return -ENOMEM;
 	}
-	cfd->csd = alloc_percpu(call_single_data_t);
+
+	/*
+	 * Allocating a whole cache line to leave space for inlined data
+	 * adjacent to call_single_data_t. Ensure first that indeed the struct
+	 * is smaller than a cache line.
+	 */
+	BUILD_BUG_ON(sizeof(call_single_data_t) > SMP_CACHE_BYTES);
+
+	cfd->csd = __alloc_percpu(SMP_CACHE_BYTES, SMP_CACHE_BYTES);
 	if (!cfd->csd) {
 		free_cpumask_var(cfd->cpumask);
 		free_cpumask_var(cfd->cpumask_ipi);
@@ -235,16 +252,36 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 	}
 
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
+		u8 inlined_info[MAX_FUNC_INFO_SIZE] __aligned(sizeof(long));
 		smp_call_func_t func = csd->func;
-		void *info = csd->info;
+		void *info;
+
+		/*
+		 * Check if we are requested to copy the info to the local storage
+		 * and use it.
+		 */
+		if (csd->flags & CSD_FLAG_INLINED) {
+			memcpy(inlined_info, csd->inlined_info, MAX_FUNC_INFO_SIZE);
+			info = &inlined_info;
+		} else {
+			info = csd->info;
+		}
 
 		/* Do we wait until *after* callback? */
 		if (csd->flags & CSD_FLAG_SYNCHRONOUS) {
 			func(info);
 			csd_unlock(csd);
 		} else {
+			this_cpu_write(async_func_in_progress, csd->func);
 			csd_unlock(csd);
 			func(info);
+
+			/*
+			 * Ensure the write that indicates the func is completed
+			 * is only performed after the function is called.
+			 */
+			barrier();
+			this_cpu_write(async_func_in_progress, NULL);
 		}
 	}
 
@@ -395,10 +432,15 @@ EXPORT_SYMBOL_GPL(smp_call_function_any);
  *		fast and non-blocking. If NULL is provided, no function will
  *		be executed on this CPU.
  * @info: An arbitrary pointer to pass to the function.
+ * @info_size: Size of @info to copy into the inlined message that is sent to
+ *	       the target CPUs. If zero, @info will not be copied, but can be
+ *	       accessed. Must be smaller than %MAX_FUNC_INFO_SIZE.
  * @wait: If true, wait (atomically) until function has completed
  *        on other CPUs.
  *
- * If @wait is true, then returns once @func has returned.
+ * If @wait is true, then returns once @func has returned. Otherwise, returns
+ * after the IPI is delivered. If @info_size > 0, the funciton returns only
+ * after the @info was copied by the remote CPU to local storage.
  *
  * You must not call this function with disabled interrupts or from a
  * hardware interrupt handler or from a bottom half handler. Preemption
@@ -406,11 +448,12 @@ EXPORT_SYMBOL_GPL(smp_call_function_any);
  */
 void __smp_call_function_many(const struct cpumask *mask,
 			      smp_call_func_t remote_func,
-			      smp_call_func_t local_func,
-			      void *info, bool wait)
+			      smp_call_func_t local_func, void *info,
+			      size_t info_size, bool wait)
 {
 	int cpu, last_cpu, this_cpu = smp_processor_id();
 	struct call_function_data *cfd;
+	bool copy = (info_size != 0);
 	bool run_remote = false;
 	bool run_local = false;
 	int nr_cpus = 0;
@@ -424,6 +467,16 @@ void __smp_call_function_many(const struct cpumask *mask,
 	if (cpu_online(this_cpu) && !oops_in_progress && !early_boot_irqs_disabled)
 		lockdep_assert_irqs_enabled();
 
+	/*
+	 * If size is too big, issue a warning. To be safe, we have to wait for
+	 * the function to be done.
+	 */
+	if (unlikely(info_size > MAX_FUNC_INFO_SIZE && copy)) {
+		WARN_ONCE(1, "Inlined IPI info size exceeds maximum\n");
+		copy = false;
+		wait = 1;
+	}
+
 	/* Check if we need local execution. */
 	if (local_func && cpumask_test_cpu(this_cpu, mask))
 		run_local = true;
@@ -449,10 +502,18 @@ void __smp_call_function_many(const struct cpumask *mask,
 			last_cpu = cpu;
 
 			csd_lock(csd);
+
+			if (copy) {
+				csd->flags |= CSD_FLAG_INLINED;
+				memcpy(csd->inlined_info, info, info_size);
+			} else {
+				csd->info = info;
+			}
+
 			if (wait)
 				csd->flags |= CSD_FLAG_SYNCHRONOUS;
 			csd->func = remote_func;
-			csd->info = info;
+
 			if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
 				__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
 		}
@@ -603,7 +664,7 @@ void __init smp_init(void)
 }
 
 /**
- * on_each_cpu_mask(): Run a function on processors specified by
+ * __on_each_cpu_mask(): Run a function on processors specified by
  * cpumask, which may include the local processor.
  * @mask: The set of cpus to run on (only runs on online subset).
  * @func: The function to run. This must be fast and non-blocking.
@@ -618,16 +679,16 @@ void __init smp_init(void)
  * exception is that it may be used during early boot while
  * early_boot_irqs_disabled is set.
  */
-void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
-			void *info, bool wait)
+void __on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
+			void *info, size_t info_size, bool wait)
 {
 	preempt_disable();
 
-	__smp_call_function_many(mask, func, func, info, wait);
+	__smp_call_function_many(mask, func, func, info, info_size, wait);
 
 	preempt_enable();
 }
-EXPORT_SYMBOL(on_each_cpu_mask);
+EXPORT_SYMBOL(__on_each_cpu_mask);
 
 /*
  * on_each_cpu_cond(): Call a function on each processor for which
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
                   ` (9 preceding siblings ...)
  2019-05-31  6:36 ` [RFC PATCH v2 10/12] smp: Enable data inlining for inter-processor function call Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  2019-05-31 10:57   ` Peter Zijlstra
  2019-05-31 21:14   ` Andy Lutomirski
  2019-05-31  6:36 ` [RFC PATCH v2 12/12] x86/mm/tlb: Reverting the removal of flush_tlb_info from stack Nadav Amit
  11 siblings, 2 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, Dave Hansen

When we flush userspace mappings, we can defer the TLB flushes, as long
the following conditions are met:

1. No tables are freed, since otherwise speculative page walks might
   cause machine-checks.

2. No one would access userspace before flush takes place. Specifically,
   NMI handlers and kprobes would avoid accessing userspace.

Use the new SMP support to execute remote function calls with inlined
data for the matter. The function remote TLB flushing function would be
executed asynchronously and the local CPU would continue execution as
soon as the IPI was delivered, before the function was actually
executed. Since tlb_flush_info is copied, there is no risk it would
change before the TLB flush is actually executed.

Change nmi_uaccess_okay() to check whether a remote TLB flush is
currently in progress on this CPU by checking whether the asynchronously
called function is the remote TLB flushing function. The current
implementation disallows access in such cases, but it is also possible
to flush the entire TLB in such case and allow access.

When page-tables are freed or when kernel PTEs are removed, perform
synchronous TLB flushes. But still inline the data with the IPI data,
although the performance gains in this case are likely to be much
smaller.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h | 12 ++++++++++++
 arch/x86/mm/tlb.c               | 15 +++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index a1fea36d5292..75e4c4263af6 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -245,6 +245,10 @@ struct tlb_state_shared {
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
 
+DECLARE_PER_CPU(smp_call_func_t, async_func_in_progress);
+
+extern void flush_tlb_func_remote(void *info);
+
 /*
  * Blindly accessing user memory from NMI context can be dangerous
  * if we're in the middle of switching the current user task or
@@ -259,6 +263,14 @@ static inline bool nmi_uaccess_okay(void)
 
 	VM_WARN_ON_ONCE(!loaded_mm);
 
+	/*
+	 * If we are in the middle of a TLB flush, access is not allowed. We
+	 * could have just flushed the entire TLB and allow access, but it is
+	 * easier and safer just to disallow access.
+	 */
+	if (this_cpu_read(async_func_in_progress) == flush_tlb_func_remote)
+		return false;
+
 	/*
 	 * The condition we want to check is
 	 * current_mm->pgd == __va(read_cr3_pa()).  This may be slow, though,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 755b2bb3e5b6..fd7e90adbe43 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -644,7 +644,7 @@ static void flush_tlb_func_local(void *info)
 	flush_tlb_func_common(f, true, reason);
 }
 
-static void flush_tlb_func_remote(void *info)
+void flush_tlb_func_remote(void *info)
 {
 	const struct flush_tlb_info *f = info;
 
@@ -730,7 +730,7 @@ void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 */
 	if (info->freed_tables)
 		__smp_call_function_many(cpumask, flush_tlb_func_remote,
-					 flush_tlb_func_local, (void *)info, 1);
+				flush_tlb_func_local, (void *)info, sizeof(*info), 1);
 	else {
 		/*
 		 * Although we could have used on_each_cpu_cond_mask(),
@@ -753,8 +753,10 @@ void native_flush_tlb_multi(const struct cpumask *cpumask,
 			if (tlb_is_not_lazy(cpu))
 				__cpumask_set_cpu(cpu, cond_cpumask);
 		}
+
 		__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
-					 flush_tlb_func_local, (void *)info, 1);
+					 flush_tlb_func_local, (void *)info,
+					 sizeof(*info), 0);
 	}
 }
 
@@ -915,7 +917,12 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 
 		info = get_flush_tlb_info(NULL, start, end, 0, false, 0);
 
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		/*
+		 * We have to wait for the remote shootdown to be done since it
+		 * is kernel space.
+		 */
+		__on_each_cpu_mask(cpu_online_mask, do_kernel_range_flush,
+				   info, sizeof(*info), 1);
 
 		put_flush_tlb_info();
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v2 12/12] x86/mm/tlb: Reverting the removal of flush_tlb_info from stack
  2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
                   ` (10 preceding siblings ...)
  2019-05-31  6:36 ` [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing Nadav Amit
@ 2019-05-31  6:36 ` Nadav Amit
  11 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit, Brian Gerst, H . Peter Anvin,
	Rik van Riel

Partially revert 3db6d5a5eca ("x86/mm/tlb: Remove 'struct
flush_tlb_info' from the stack").

Now that we copy flush_tlb_info and inline it with the IPI information,
we can put it back onto the stack. This simplifies the code and should be
slightly more robust. The stack is also a bit more likely to be cached
than a global variable.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 46 +++++++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fd7e90adbe43..81170fd6b1dd 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -827,15 +827,17 @@ static inline void put_flush_tlb_info(void)
 static void flush_tlb_on_cpus(const cpumask_t *cpumask,
 			      const struct flush_tlb_info *info)
 {
-	int this_cpu = smp_processor_id();
 	bool flush_others = false;
+	int this_cpu;
+
+	this_cpu = get_cpu();
 
 	if (cpumask_any_but(cpumask, this_cpu) < nr_cpu_ids)
 		flush_others = true;
 
 	if (static_branch_likely(&flush_tlb_multi_enabled) && flush_others) {
 		flush_tlb_multi(cpumask, info);
-		return;
+		goto out;
 	}
 
 	if (cpumask_test_cpu(this_cpu, cpumask)) {
@@ -847,27 +849,33 @@ static void flush_tlb_on_cpus(const cpumask_t *cpumask,
 
 	if (flush_others)
 		flush_tlb_others(cpumask, info);
+
+out:
+	put_cpu();
 }
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables)
 {
-	struct flush_tlb_info *info;
-	u64 new_tlb_gen;
+	struct flush_tlb_info info = {
+		.mm		= mm,
+		.stride_shift	= stride_shift,
+		.freed_tables	= freed_tables,
+	};
 
 	/* Should we flush just the requested range? */
 	if ((end == TLB_FLUSH_ALL) ||
 	    ((end - start) >> stride_shift) > tlb_single_page_flush_ceiling) {
-		start = 0;
-		end = TLB_FLUSH_ALL;
+		info.start = 0;
+		info.end = TLB_FLUSH_ALL;
+	} else {
+		info.start = start;
+		info.end = end;
 	}
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
-	new_tlb_gen = inc_mm_tlb_gen(mm);
-
-	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
-				  new_tlb_gen);
+	info.new_tlb_gen = inc_mm_tlb_gen(mm);
 
 	/*
 	 * Assert that mm_cpumask() corresponds with the loaded mm. We got one
@@ -878,9 +886,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 			(mm == this_cpu_read(cpu_tlbstate.loaded_mm)) ||
 			mm == &init_mm);
 
-	flush_tlb_on_cpus(mm_cpumask(mm), info);
-
-	put_flush_tlb_info();
+	flush_tlb_on_cpus(mm_cpumask(mm), &info);
 }
 
 
@@ -913,18 +919,18 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
 	} else {
-		struct flush_tlb_info *info;
-
-		info = get_flush_tlb_info(NULL, start, end, 0, false, 0);
+		struct flush_tlb_info info = {
+			.mm		= NULL,
+			.start		= start,
+			.end		= end,
+		};
 
 		/*
 		 * We have to wait for the remote shootdown to be done since it
 		 * is kernel space.
 		 */
 		__on_each_cpu_mask(cpu_online_mask, do_kernel_range_flush,
-				   info, sizeof(*info), 1);
-
-		put_flush_tlb_info();
+				   &info, sizeof(info), 1);
 	}
 }
 
@@ -942,9 +948,7 @@ static const struct flush_tlb_info full_flush_tlb_info = {
 
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
-	preempt_disable();
 	flush_tlb_on_cpus(&batch->cpumask, &full_flush_tlb_info);
-	preempt_enable();
 
 	cpumask_clear(&batch->cpumask);
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 07/12] smp: Do not mark call_function_data as shared
  2019-05-31  6:36 ` [RFC PATCH v2 07/12] smp: Do not mark call_function_data as shared Nadav Amit
@ 2019-05-31 10:17   ` Peter Zijlstra
  2019-05-31 17:50     ` Nadav Amit
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2019-05-31 10:17 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Borislav Petkov, Dave Hansen, Ingo Molnar,
	Thomas Gleixner, x86, linux-kernel, Rik van Riel

On Thu, May 30, 2019 at 11:36:40PM -0700, Nadav Amit wrote:
> cfd_data is marked as shared, but although it hold pointers to shared
> data structures, it is private per core.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  kernel/smp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 6b411ee86ef6..f1a358f9c34c 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -33,7 +33,7 @@ struct call_function_data {
>  	cpumask_var_t		cpumask_ipi;
>  };
>  
> -static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_function_data, cfd_data);
> +static DEFINE_PER_CPU(struct call_function_data, cfd_data);

Should that not be DEFINE_PER_CPU_ALIGNED() then?

>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct llist_head, call_single_queue);
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31  6:36 ` [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing Nadav Amit
@ 2019-05-31 10:57   ` Peter Zijlstra
  2019-05-31 18:29     ` Nadav Amit
  2019-05-31 18:44     ` Andy Lutomirski
  2019-05-31 21:14   ` Andy Lutomirski
  1 sibling, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2019-05-31 10:57 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Borislav Petkov, Dave Hansen, Ingo Molnar,
	Thomas Gleixner, x86, linux-kernel, Dave Hansen

On Thu, May 30, 2019 at 11:36:44PM -0700, Nadav Amit wrote:
> When we flush userspace mappings, we can defer the TLB flushes, as long
> the following conditions are met:
> 
> 1. No tables are freed, since otherwise speculative page walks might
>    cause machine-checks.
> 
> 2. No one would access userspace before flush takes place. Specifically,
>    NMI handlers and kprobes would avoid accessing userspace.
> 
> Use the new SMP support to execute remote function calls with inlined
> data for the matter. The function remote TLB flushing function would be
> executed asynchronously and the local CPU would continue execution as
> soon as the IPI was delivered, before the function was actually
> executed. Since tlb_flush_info is copied, there is no risk it would
> change before the TLB flush is actually executed.
> 
> Change nmi_uaccess_okay() to check whether a remote TLB flush is
> currently in progress on this CPU by checking whether the asynchronously
> called function is the remote TLB flushing function. The current
> implementation disallows access in such cases, but it is also possible
> to flush the entire TLB in such case and allow access.

ARGGH, brain hurt. I'm not sure I fully understand this one. How is it
different from today, where the NMI can hit in the middle of the TLB
invalidation?

Also; since we're not waiting on the IPI, what prevents us from freeing
the user pages before the remote CPU is 'done' with them? Currently the
synchronous IPI is like a sync point where we *know* the remote CPU is
completely done accessing the page.

Where getting an IPI stops speculation, speculation again restarts
inside the interrupt handler, and until we've passed the INVLPG/MOV CR3,
speculation can happen on that TLB entry, even though we've already
freed and re-used the user-page.

Also, what happens if the TLB invalidation IPI is stuck behind another
smp_function_call IPI that is doing user-access?

As said,.. brain hurts.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-05-31  6:36 ` [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
@ 2019-05-31 11:48   ` Juergen Gross
  2019-05-31 19:44     ` Nadav Amit
  0 siblings, 1 reply; 36+ messages in thread
From: Juergen Gross @ 2019-05-31 11:48 UTC (permalink / raw)
  To: Nadav Amit, Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Paolo Bonzini, Dave Hansen, Boris Ostrovsky,
	linux-hyperv, virtualization, kvm, xen-devel

On 31/05/2019 08:36, Nadav Amit wrote:
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. The current
> flush_tlb_others() interface is kept, since paravirtual interfaces need
> to be adapted first before it can be removed. This is left for future
> work. In such PV environments, TLB flushes are not performed, at this
> time, concurrently.
> 
> Add a static key to tell whether this new interface is supported.
> 
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Sasha Levin <sashal@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: x86@kernel.org
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: linux-hyperv@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: kvm@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/hyperv/mmu.c                 |  2 +
>  arch/x86/include/asm/paravirt.h       |  8 +++
>  arch/x86/include/asm/paravirt_types.h |  6 ++
>  arch/x86/include/asm/tlbflush.h       |  6 ++
>  arch/x86/kernel/kvm.c                 |  1 +
>  arch/x86/kernel/paravirt.c            |  3 +
>  arch/x86/mm/tlb.c                     | 80 +++++++++++++++++++++++----
>  arch/x86/xen/mmu_pv.c                 |  2 +
>  8 files changed, 96 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> index e65d7fe6489f..ca28b400c87c 100644
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
>  	pr_info("Using hypercall for remote TLB flush\n");
>  	pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
>  	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
> +
> +	static_key_disable(&flush_tlb_multi_enabled.key);
>  }
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index c25c38a05c1c..192be7254457 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -47,6 +47,8 @@ static inline void slow_down_io(void)
>  #endif
>  }
>  
> +DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
> +
>  static inline void __flush_tlb(void)
>  {
>  	PVOP_VCALL0(mmu.flush_tlb_user);
> @@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
>  	PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
>  }
>  
> +static inline void flush_tlb_multi(const struct cpumask *cpumask,
> +				   const struct flush_tlb_info *info)
> +{
> +	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
> +}
> +
>  static inline void flush_tlb_others(const struct cpumask *cpumask,
>  				    const struct flush_tlb_info *info)
>  {
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 946f8f1f1efc..3a156e63c57d 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>  	void (*flush_tlb_user)(void);
>  	void (*flush_tlb_kernel)(void);
>  	void (*flush_tlb_one_user)(unsigned long addr);
> +	/*
> +	 * flush_tlb_multi() is the preferred interface. When it is used,
> +	 * flush_tlb_others() should return false.

Didn't you want to remove/change this comment?


Juergen

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 07/12] smp: Do not mark call_function_data as shared
  2019-05-31 10:17   ` Peter Zijlstra
@ 2019-05-31 17:50     ` Nadav Amit
  0 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Borislav Petkov, Dave Hansen, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML, Rik van Riel

> On May 31, 2019, at 3:17 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, May 30, 2019 at 11:36:40PM -0700, Nadav Amit wrote:
>> cfd_data is marked as shared, but although it hold pointers to shared
>> data structures, it is private per core.
>> 
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Rik van Riel <riel@surriel.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> kernel/smp.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index 6b411ee86ef6..f1a358f9c34c 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -33,7 +33,7 @@ struct call_function_data {
>> 	cpumask_var_t		cpumask_ipi;
>> };
>> 
>> -static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_function_data, cfd_data);
>> +static DEFINE_PER_CPU(struct call_function_data, cfd_data);
> 
> Should that not be DEFINE_PER_CPU_ALIGNED() then?

Yes. I don’t know what I was thinking. I’ll change it.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 10:57   ` Peter Zijlstra
@ 2019-05-31 18:29     ` Nadav Amit
  2019-05-31 19:20       ` Jann Horn
  2019-05-31 18:44     ` Andy Lutomirski
  1 sibling, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2019-05-31 18:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Borislav Petkov, Dave Hansen, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers, LKML, Dave Hansen,
	Jann Horn

[ +Jann Horn ]

> On May 31, 2019, at 3:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, May 30, 2019 at 11:36:44PM -0700, Nadav Amit wrote:
>> When we flush userspace mappings, we can defer the TLB flushes, as long
>> the following conditions are met:
>> 
>> 1. No tables are freed, since otherwise speculative page walks might
>>   cause machine-checks.
>> 
>> 2. No one would access userspace before flush takes place. Specifically,
>>   NMI handlers and kprobes would avoid accessing userspace.
>> 
>> Use the new SMP support to execute remote function calls with inlined
>> data for the matter. The function remote TLB flushing function would be
>> executed asynchronously and the local CPU would continue execution as
>> soon as the IPI was delivered, before the function was actually
>> executed. Since tlb_flush_info is copied, there is no risk it would
>> change before the TLB flush is actually executed.
>> 
>> Change nmi_uaccess_okay() to check whether a remote TLB flush is
>> currently in progress on this CPU by checking whether the asynchronously
>> called function is the remote TLB flushing function. The current
>> implementation disallows access in such cases, but it is also possible
>> to flush the entire TLB in such case and allow access.
> 
> ARGGH, brain hurt. I'm not sure I fully understand this one. How is it
> different from today, where the NMI can hit in the middle of the TLB
> invalidation?

Let’s assume that a page mapping is removed so a PTE is changed from present
to non-present. Today, the page would only be recycled after all the cores
flushed their TLB. So NMI handlers accessing the page are safe - they might
either access the previously mapped page (which still has not been reused)
or fault, but they cannot access the wrong data.

Following this change it might happen that one core would still have stale
mappings for a page that has already recycled for a new use (e.g., it is
allocated to a different process). The core that has the stale mapping would
find itself in the interrupt handler. But we must ensure the NMI handler
does not access this page.

> Also; since we're not waiting on the IPI, what prevents us from freeing
> the user pages before the remote CPU is 'done' with them? Currently the
> synchronous IPI is like a sync point where we *know* the remote CPU is
> completely done accessing the page.

We might free them before the remote TLB flush is done, but not that only
after the IPI has been acknowledged and right before the actual TLB flush
is about to start. See flush_smp_call_function_queue().

                if (csd->flags & CSD_FLAG_SYNCHRONOUS) {
                        func(info);
                        csd_unlock(csd);
                } else {
                        this_cpu_write(async_func_in_progress, csd->func);
                        csd_unlock(csd); (*)
                        func(info);

    
We do wait for the IPI to be received and the entry of the specific TLB
invalidation to be processed. Following the instruction that is marked with
(*) the initiator to the TLB shootdown would be able to continue, but not
before. We just do not wait for the actual flush (called by func()) to take
place.

> Where getting an IPI stops speculation, speculation again restarts
> inside the interrupt handler, and until we've passed the INVLPG/MOV CR3,
> speculation can happen on that TLB entry, even though we've already
> freed and re-used the user-page.

Let’s consider what the impact of such speculation might be. Virtually
indexed caches might have the wrong data. But this data should not be used
before the flush, since we are already in the interrupt handler, and we
should be able to prevent accesses to userspace.

A new vector for Spectre-type attacks might be opened. but I don’t see how.

A #MC might be caused. I tried to avoid it by not allowing freeing of
page-tables in such way. Did I miss something else? Some interaction with
MTRR changes? I’ll think about it some more, but I don’t see how.


> Also, what happens if the TLB invalidation IPI is stuck behind another
> smp_function_call IPI that is doing user-access?

It’s not exactly the IPI but the call_single_data. In such case, as shown
above, csd_unlock() is only called once you are about to start handling
the specific TLB flush.

> 
> As said,.. brain hurts.

I understand. Mine too. The textbook says that we need to flush all the TLBs
before we can recycle the page, be guaranteed that write-protected page will
not change, and so on. I think that since we should be able to guarantee
that the userspace address, which are invalidated, would never be used by
the kernel before the flush, we should be safe. After the flush, all the
artifact of speculative caching should disappear.

I might be wrong, and I do question myself, but I don’t see how.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 10:57   ` Peter Zijlstra
  2019-05-31 18:29     ` Nadav Amit
@ 2019-05-31 18:44     ` Andy Lutomirski
  2019-05-31 19:31       ` Nadav Amit
  1 sibling, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2019-05-31 18:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, x86, linux-kernel, Dave Hansen



> On May 31, 2019, at 3:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Thu, May 30, 2019 at 11:36:44PM -0700, Nadav Amit wrote:
>> When we flush userspace mappings, we can defer the TLB flushes, as long
>> the following conditions are met:
>> 
>> 1. No tables are freed, since otherwise speculative page walks might
>>   cause machine-checks.
>> 
>> 2. No one would access userspace before flush takes place. Specifically,
>>   NMI handlers and kprobes would avoid accessing userspace.
>> 
>> Use the new SMP support to execute remote function calls with inlined
>> data for the matter. The function remote TLB flushing function would be
>> executed asynchronously and the local CPU would continue execution as
>> soon as the IPI was delivered, before the function was actually
>> executed. Since tlb_flush_info is copied, there is no risk it would
>> change before the TLB flush is actually executed.
>> 
>> Change nmi_uaccess_okay() to check whether a remote TLB flush is
>> currently in progress on this CPU by checking whether the asynchronously
>> called function is the remote TLB flushing function. The current
>> implementation disallows access in such cases, but it is also possible
>> to flush the entire TLB in such case and allow access.
> 
> ARGGH, brain hurt. I'm not sure I fully understand this one. How is it
> different from today, where the NMI can hit in the middle of the TLB
> invalidation?
> 
> Also; since we're not waiting on the IPI, what prevents us from freeing
> the user pages before the remote CPU is 'done' with them? Currently the
> synchronous IPI is like a sync point where we *know* the remote CPU is
> completely done accessing the page.
> 
> Where getting an IPI stops speculation, speculation again restarts
> inside the interrupt handler, and until we've passed the INVLPG/MOV CR3,
> speculation can happen on that TLB entry, even though we've already
> freed and re-used the user-page.
> 
> Also, what happens if the TLB invalidation IPI is stuck behind another
> smp_function_call IPI that is doing user-access?
> 
> As said,.. brain hurts.

Speculation aside, any code doing dirty tracking needs the flush to happen for real before it reads the dirty bit.

How does this patch guarantee that the flush is really done before someone depends on it?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 08/12] x86/tlb: Privatize cpu_tlbstate
  2019-05-31  6:36 ` [RFC PATCH v2 08/12] x86/tlb: Privatize cpu_tlbstate Nadav Amit
@ 2019-05-31 18:48   ` Andy Lutomirski
  2019-05-31 19:42     ` Nadav Amit
  0 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2019-05-31 18:48 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, x86, linux-kernel, Dave Hansen


> On May 30, 2019, at 11:36 PM, Nadav Amit <namit@vmware.com> wrote:
> 
> cpu_tlbstate is mostly private and only the variable is_lazy is shared.
> This causes some false-sharing when TLB flushes are performed.
> 
> Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
> each one accordingly.
> 

At some point we’ll want to do a better job with our PV flush code, and I suspect we’ll end up with more of this shared again.

> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
> arch/x86/include/asm/tlbflush.h | 40 ++++++++++++++++++---------------
> arch/x86/mm/init.c              |  2 +-
> arch/x86/mm/tlb.c               | 15 ++++++++-----
> 3 files changed, 32 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 79272938cf79..a1fea36d5292 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -178,23 +178,6 @@ struct tlb_state {
>    u16 loaded_mm_asid;
>    u16 next_asid;
> 
> -    /*
> -     * We can be in one of several states:
> -     *
> -     *  - Actively using an mm.  Our CPU's bit will be set in
> -     *    mm_cpumask(loaded_mm) and is_lazy == false;
> -     *
> -     *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
> -     *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
> -     *
> -     *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
> -     *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
> -     *    We're heuristically guessing that the CR3 load we
> -     *    skipped more than makes up for the overhead added by
> -     *    lazy mode.
> -     */
> -    bool is_lazy;
> -
>    /*
>     * If set we changed the page tables in such a way that we
>     * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
> @@ -240,7 +223,27 @@ struct tlb_state {
>     */
>    struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
> };
> -DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
> +DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
> +
> +struct tlb_state_shared {
> +    /*
> +     * We can be in one of several states:
> +     *
> +     *  - Actively using an mm.  Our CPU's bit will be set in
> +     *    mm_cpumask(loaded_mm) and is_lazy == false;
> +     *
> +     *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
> +     *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
> +     *
> +     *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
> +     *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
> +     *    We're heuristically guessing that the CR3 load we
> +     *    skipped more than makes up for the overhead added by
> +     *    lazy mode.
> +     */
> +    bool is_lazy;
> +};
> +DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
> 
> /*
>  * Blindly accessing user memory from NMI context can be dangerous
> @@ -439,6 +442,7 @@ static inline void __native_flush_tlb_one_user(unsigned long addr)
> {
>    u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> 
> +    //invpcid_flush_one(kern_pcid(loaded_mm_asid), addr);
>    asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
> 
>    if (!static_cpu_has(X86_FEATURE_PTI))
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index fd10d91a6115..34027f36a944 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -951,7 +951,7 @@ void __init zone_sizes_init(void)
>    free_area_init_nodes(max_zone_pfns);
> }
> 
> -__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
> +__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = {
>    .loaded_mm = &init_mm,
>    .next_asid = 1,
>    .cr4 = ~0UL,    /* fail hard if we screw up cr4 shadow initialization */
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index b0c3065aad5d..755b2bb3e5b6 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -144,7 +144,7 @@ void leave_mm(int cpu)
>        return;
> 
>    /* Warn if we're not lazy. */
> -    WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
> +    WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy));
> 
>    switch_mm(NULL, &init_mm, NULL);
> }
> @@ -276,7 +276,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> {
>    struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
>    u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> -    bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
> +    bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
>    unsigned cpu = smp_processor_id();
>    u64 next_tlb_gen;
>    bool need_flush;
> @@ -321,7 +321,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>        __flush_tlb_all();
>    }
> #endif
> -    this_cpu_write(cpu_tlbstate.is_lazy, false);
> +    this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
> 
>    /*
>     * The membarrier system call requires a full memory barrier and
> @@ -462,7 +462,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>    if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
>        return;
> 
> -    this_cpu_write(cpu_tlbstate.is_lazy, true);
> +    this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
> }
> 
> /*
> @@ -543,7 +543,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
>    VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
>           loaded_mm->context.ctx_id);
> 
> -    if (this_cpu_read(cpu_tlbstate.is_lazy)) {
> +    if (this_cpu_read(cpu_tlbstate_shared.is_lazy)) {
>        /*
>         * We're in lazy mode.  We need to at least flush our
>         * paging-structure cache to avoid speculatively reading
> @@ -659,11 +659,14 @@ static void flush_tlb_func_remote(void *info)
> 
> static inline bool tlb_is_not_lazy(int cpu)
> {
> -    return !per_cpu(cpu_tlbstate.is_lazy, cpu);
> +    return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);
> }
> 
> static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
> 
> +DEFINE_PER_CPU_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
> +EXPORT_PER_CPU_SYMBOL(cpu_tlbstate_shared);
> +
> void native_flush_tlb_multi(const struct cpumask *cpumask,
>                const struct flush_tlb_info *info)
> {
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 18:29     ` Nadav Amit
@ 2019-05-31 19:20       ` Jann Horn
  2019-05-31 20:04         ` Nadav Amit
  0 siblings, 1 reply; 36+ messages in thread
From: Jann Horn @ 2019-05-31 19:20 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML,
	Dave Hansen

On Fri, May 31, 2019 at 8:29 PM Nadav Amit <namit@vmware.com> wrote:
>
> [ +Jann Horn ]
>
> > On May 31, 2019, at 3:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, May 30, 2019 at 11:36:44PM -0700, Nadav Amit wrote:
> >> When we flush userspace mappings, we can defer the TLB flushes, as long
> >> the following conditions are met:
> >>
> >> 1. No tables are freed, since otherwise speculative page walks might
> >>   cause machine-checks.
> >>
> >> 2. No one would access userspace before flush takes place. Specifically,
> >>   NMI handlers and kprobes would avoid accessing userspace.
[...]
> A #MC might be caused. I tried to avoid it by not allowing freeing of
> page-tables in such way. Did I miss something else? Some interaction with
> MTRR changes? I’ll think about it some more, but I don’t see how.

I don't really know much about this topic, but here's a random comment
since you cc'ed me: If the physical memory range was freed and
reallocated, could you end up with speculatively executed cached
memory reads from I/O memory? (And if so, would that be bad?)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 18:44     ` Andy Lutomirski
@ 2019-05-31 19:31       ` Nadav Amit
  2019-05-31 20:13         ` Dave Hansen
  0 siblings, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2019-05-31 19:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, x86, linux-kernel, Dave Hansen

> On May 31, 2019, at 11:44 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
> 
>> On May 31, 2019, at 3:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>>> On Thu, May 30, 2019 at 11:36:44PM -0700, Nadav Amit wrote:
>>> When we flush userspace mappings, we can defer the TLB flushes, as long
>>> the following conditions are met:
>>> 
>>> 1. No tables are freed, since otherwise speculative page walks might
>>>  cause machine-checks.
>>> 
>>> 2. No one would access userspace before flush takes place. Specifically,
>>>  NMI handlers and kprobes would avoid accessing userspace.
>>> 
>>> Use the new SMP support to execute remote function calls with inlined
>>> data for the matter. The function remote TLB flushing function would be
>>> executed asynchronously and the local CPU would continue execution as
>>> soon as the IPI was delivered, before the function was actually
>>> executed. Since tlb_flush_info is copied, there is no risk it would
>>> change before the TLB flush is actually executed.
>>> 
>>> Change nmi_uaccess_okay() to check whether a remote TLB flush is
>>> currently in progress on this CPU by checking whether the asynchronously
>>> called function is the remote TLB flushing function. The current
>>> implementation disallows access in such cases, but it is also possible
>>> to flush the entire TLB in such case and allow access.
>> 
>> ARGGH, brain hurt. I'm not sure I fully understand this one. How is it
>> different from today, where the NMI can hit in the middle of the TLB
>> invalidation?
>> 
>> Also; since we're not waiting on the IPI, what prevents us from freeing
>> the user pages before the remote CPU is 'done' with them? Currently the
>> synchronous IPI is like a sync point where we *know* the remote CPU is
>> completely done accessing the page.
>> 
>> Where getting an IPI stops speculation, speculation again restarts
>> inside the interrupt handler, and until we've passed the INVLPG/MOV CR3,
>> speculation can happen on that TLB entry, even though we've already
>> freed and re-used the user-page.
>> 
>> Also, what happens if the TLB invalidation IPI is stuck behind another
>> smp_function_call IPI that is doing user-access?
>> 
>> As said,.. brain hurts.
> 
> Speculation aside, any code doing dirty tracking needs the flush to happen
> for real before it reads the dirty bit.
> 
> How does this patch guarantee that the flush is really done before someone
> depends on it?

I was always under the impression that the dirty-bit is pass-through - the
A/D-assist walks the tables and sets the dirty bit upon access. Otherwise,
what happens when you invalidate the PTE, and have already marked the PTE as
non-present? Would the CPU set the dirty-bit at this point?

In this regard, I remember this thread of Dave Hansen [1], which also seems
to me as supporting the notion the dirty-bit is set on write and not on
INVLPG.

Looking at the SDM on "Virtual TLB Scheme”, also seems to support this
claim. The guest dirty bit is set in section 32.3.5.2 "Response to Page
Faults”, and not in section 32.3.5. "Response to Uses of INVLPG”.

Am I wrong?


[1] https://groups.google.com/forum/#!topic/linux.kernel/HBgh0uT24K8


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 08/12] x86/tlb: Privatize cpu_tlbstate
  2019-05-31 18:48   ` Andy Lutomirski
@ 2019-05-31 19:42     ` Nadav Amit
  0 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31 19:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML,
	Dave Hansen

> On May 31, 2019, at 11:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
>> On May 30, 2019, at 11:36 PM, Nadav Amit <namit@vmware.com> wrote:
>> 
>> cpu_tlbstate is mostly private and only the variable is_lazy is shared.
>> This causes some false-sharing when TLB flushes are performed.
>> 
>> Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
>> each one accordingly.
> 
> At some point we’ll want to do a better job with our PV flush code, and I
> suspect we’ll end up with more of this shared again.

In the usual use-case, when you flush the TLB, will you write something to
cpu_tlbstate that should be visible to other cores? I don’t see why, even if
PV is used.

Anyhow, I was always under the impression that PV should not punish
bare-metal.

The other option is to take cpu_tlbstate and rearrange it so the shared
stuff will not be next to the private. I just don’t know how to do it
without making an assumption of the cacheline size.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-05-31 11:48   ` Juergen Gross
@ 2019-05-31 19:44     ` Nadav Amit
  0 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31 19:44 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	Paolo Bonzini, Dave Hansen, Boris Ostrovsky, linux-hyperv,
	virtualization, kvm, xen-devel

> On May 31, 2019, at 4:48 AM, Juergen Gross <jgross@suse.com> wrote:
> 
> On 31/05/2019 08:36, Nadav Amit wrote:
>> 
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>> 	void (*flush_tlb_user)(void);
>> 	void (*flush_tlb_kernel)(void);
>> 	void (*flush_tlb_one_user)(unsigned long addr);
>> +	/*
>> +	 * flush_tlb_multi() is the preferred interface. When it is used,
>> +	 * flush_tlb_others() should return false.
> 
> Didn't you want to remove/change this comment?

Yes! Sorry for that. Fixed now.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 19:20       ` Jann Horn
@ 2019-05-31 20:04         ` Nadav Amit
  2019-05-31 20:37           ` Jann Horn
  0 siblings, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2019-05-31 20:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML,
	Dave Hansen

> On May 31, 2019, at 12:20 PM, Jann Horn <jannh@google.com> wrote:
> 
> On Fri, May 31, 2019 at 8:29 PM Nadav Amit <namit@vmware.com> wrote:
>> [ +Jann Horn ]
>> 
>>> On May 31, 2019, at 3:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Thu, May 30, 2019 at 11:36:44PM -0700, Nadav Amit wrote:
>>>> When we flush userspace mappings, we can defer the TLB flushes, as long
>>>> the following conditions are met:
>>>> 
>>>> 1. No tables are freed, since otherwise speculative page walks might
>>>>  cause machine-checks.
>>>> 
>>>> 2. No one would access userspace before flush takes place. Specifically,
>>>>  NMI handlers and kprobes would avoid accessing userspace.
> [...]
>> A #MC might be caused. I tried to avoid it by not allowing freeing of
>> page-tables in such way. Did I miss something else? Some interaction with
>> MTRR changes? I’ll think about it some more, but I don’t see how.
> 
> I don't really know much about this topic, but here's a random comment
> since you cc'ed me: If the physical memory range was freed and
> reallocated, could you end up with speculatively executed cached
> memory reads from I/O memory? (And if so, would that be bad?)

Thanks. I thought that your experience with TLB page-freeing bugs may
be valuable, and you frequently find my mistakes. ;-)

Yes, speculatively executed cached reads from the I/O memory are a concern.
IIRC they caused #MC on AMD. If page-tables are not changes, but only PTEs
are changed, I don’t see how it can be a problem. I also looked at the MTRR
setting code, but I don’t see a concrete problem.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 19:31       ` Nadav Amit
@ 2019-05-31 20:13         ` Dave Hansen
  2019-05-31 20:37           ` Andy Lutomirski
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Hansen @ 2019-05-31 20:13 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski
  Cc: Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Ingo Molnar,
	Thomas Gleixner, x86, linux-kernel, Dave Hansen

On 5/31/19 12:31 PM, Nadav Amit wrote:
>> On May 31, 2019, at 11:44 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>>
>>
>>> On May 31, 2019, at 3:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>>> On Thu, May 30, 2019 at 11:36:44PM -0700, Nadav Amit wrote:
>>>> When we flush userspace mappings, we can defer the TLB flushes, as long
>>>> the following conditions are met:
>>>>
>>>> 1. No tables are freed, since otherwise speculative page walks might
>>>>  cause machine-checks.
>>>>
>>>> 2. No one would access userspace before flush takes place. Specifically,
>>>>  NMI handlers and kprobes would avoid accessing userspace.
>>>>
>>>> Use the new SMP support to execute remote function calls with inlined
>>>> data for the matter. The function remote TLB flushing function would be
>>>> executed asynchronously and the local CPU would continue execution as
>>>> soon as the IPI was delivered, before the function was actually
>>>> executed. Since tlb_flush_info is copied, there is no risk it would
>>>> change before the TLB flush is actually executed.
>>>>
>>>> Change nmi_uaccess_okay() to check whether a remote TLB flush is
>>>> currently in progress on this CPU by checking whether the asynchronously
>>>> called function is the remote TLB flushing function. The current
>>>> implementation disallows access in such cases, but it is also possible
>>>> to flush the entire TLB in such case and allow access.
>>>
>>> ARGGH, brain hurt. I'm not sure I fully understand this one. How is it
>>> different from today, where the NMI can hit in the middle of the TLB
>>> invalidation?
>>>
>>> Also; since we're not waiting on the IPI, what prevents us from freeing
>>> the user pages before the remote CPU is 'done' with them? Currently the
>>> synchronous IPI is like a sync point where we *know* the remote CPU is
>>> completely done accessing the page.
>>>
>>> Where getting an IPI stops speculation, speculation again restarts
>>> inside the interrupt handler, and until we've passed the INVLPG/MOV CR3,
>>> speculation can happen on that TLB entry, even though we've already
>>> freed and re-used the user-page.
>>>
>>> Also, what happens if the TLB invalidation IPI is stuck behind another
>>> smp_function_call IPI that is doing user-access?
>>>
>>> As said,.. brain hurts.
>>
>> Speculation aside, any code doing dirty tracking needs the flush to happen
>> for real before it reads the dirty bit.
>>
>> How does this patch guarantee that the flush is really done before someone
>> depends on it?
> 
> I was always under the impression that the dirty-bit is pass-through - the
> A/D-assist walks the tables and sets the dirty bit upon access. Otherwise,
> what happens when you invalidate the PTE, and have already marked the PTE as
> non-present? Would the CPU set the dirty-bit at this point?

Modulo bugs^Werrata...  No.  What actually happens is that a
try-to-set-dirty-bit page table walk acts just like a TLB miss.  The old
contents of the TLB are discarded and only the in-memory contents matter
for forward progress.  If Present=0 when the PTE is reached, you'll get
a normal Present=0 page fault.

> In this regard, I remember this thread of Dave Hansen [1], which also seems
> to me as supporting the notion the dirty-bit is set on write and not on
> INVLPG.

... and that's the erratum I was hoping you wouldn't mention. :)

But, yeah, I don't think it's possible to set the Dirty bit on INVLPG.
The bits are set on establishing TLB entries, not on evicting or
flushing them.

I hope that clears it up.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 20:13         ` Dave Hansen
@ 2019-05-31 20:37           ` Andy Lutomirski
  2019-05-31 20:42             ` Nadav Amit
  2019-05-31 21:06             ` Dave Hansen
  0 siblings, 2 replies; 36+ messages in thread
From: Andy Lutomirski @ 2019-05-31 20:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nadav Amit, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	Ingo Molnar, Thomas Gleixner, x86, linux-kernel, Dave Hansen

On Fri, May 31, 2019 at 1:13 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/31/19 12:31 PM, Nadav Amit wrote:
> >> On May 31, 2019, at 11:44 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >>
> >>
> >>
> >>> On May 31, 2019, at 3:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >>>
> >>>> On Thu, May 30, 2019 at 11:36:44PM -0700, Nadav Amit wrote:
> >>>> When we flush userspace mappings, we can defer the TLB flushes, as long
> >>>> the following conditions are met:
> >>>>
> >>>> 1. No tables are freed, since otherwise speculative page walks might
> >>>>  cause machine-checks.
> >>>>
> >>>> 2. No one would access userspace before flush takes place. Specifically,
> >>>>  NMI handlers and kprobes would avoid accessing userspace.
> >>>>
> >>>> Use the new SMP support to execute remote function calls with inlined
> >>>> data for the matter. The function remote TLB flushing function would be
> >>>> executed asynchronously and the local CPU would continue execution as
> >>>> soon as the IPI was delivered, before the function was actually
> >>>> executed. Since tlb_flush_info is copied, there is no risk it would
> >>>> change before the TLB flush is actually executed.
> >>>>
> >>>> Change nmi_uaccess_okay() to check whether a remote TLB flush is
> >>>> currently in progress on this CPU by checking whether the asynchronously
> >>>> called function is the remote TLB flushing function. The current
> >>>> implementation disallows access in such cases, but it is also possible
> >>>> to flush the entire TLB in such case and allow access.
> >>>
> >>> ARGGH, brain hurt. I'm not sure I fully understand this one. How is it
> >>> different from today, where the NMI can hit in the middle of the TLB
> >>> invalidation?
> >>>
> >>> Also; since we're not waiting on the IPI, what prevents us from freeing
> >>> the user pages before the remote CPU is 'done' with them? Currently the
> >>> synchronous IPI is like a sync point where we *know* the remote CPU is
> >>> completely done accessing the page.
> >>>
> >>> Where getting an IPI stops speculation, speculation again restarts
> >>> inside the interrupt handler, and until we've passed the INVLPG/MOV CR3,
> >>> speculation can happen on that TLB entry, even though we've already
> >>> freed and re-used the user-page.
> >>>
> >>> Also, what happens if the TLB invalidation IPI is stuck behind another
> >>> smp_function_call IPI that is doing user-access?
> >>>
> >>> As said,.. brain hurts.
> >>
> >> Speculation aside, any code doing dirty tracking needs the flush to happen
> >> for real before it reads the dirty bit.
> >>
> >> How does this patch guarantee that the flush is really done before someone
> >> depends on it?
> >
> > I was always under the impression that the dirty-bit is pass-through - the
> > A/D-assist walks the tables and sets the dirty bit upon access. Otherwise,
> > what happens when you invalidate the PTE, and have already marked the PTE as
> > non-present? Would the CPU set the dirty-bit at this point?
>
> Modulo bugs^Werrata...  No.  What actually happens is that a
> try-to-set-dirty-bit page table walk acts just like a TLB miss.  The old
> contents of the TLB are discarded and only the in-memory contents matter
> for forward progress.  If Present=0 when the PTE is reached, you'll get
> a normal Present=0 page fault.

Wait, does that mean that you can do a lock cmpxchg or similar to
clear the dirty and writable bits together and, if the dirty bit was
clear, skip the TLB flush?  If so, nifty!  Modulo errata, of course.
And I seem to remember some exceptions relating to CET shadow stack
involving the dirty bit being set on not-present pages.

>
> > In this regard, I remember this thread of Dave Hansen [1], which also seems
> > to me as supporting the notion the dirty-bit is set on write and not on
> > INVLPG.
>
> ... and that's the erratum I was hoping you wouldn't mention. :)
>
> But, yeah, I don't think it's possible to set the Dirty bit on INVLPG.
> The bits are set on establishing TLB entries, not on evicting or
> flushing them.
>
> I hope that clears it up.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 20:04         ` Nadav Amit
@ 2019-05-31 20:37           ` Jann Horn
  0 siblings, 0 replies; 36+ messages in thread
From: Jann Horn @ 2019-05-31 20:37 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, LKML,
	Dave Hansen

On Fri, May 31, 2019 at 10:04 PM Nadav Amit <namit@vmware.com> wrote:
> > On May 31, 2019, at 12:20 PM, Jann Horn <jannh@google.com> wrote:
> > On Fri, May 31, 2019 at 8:29 PM Nadav Amit <namit@vmware.com> wrote:
> >> [ +Jann Horn ]
> >>
> >>> On May 31, 2019, at 3:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >>>
> >>> On Thu, May 30, 2019 at 11:36:44PM -0700, Nadav Amit wrote:
> >>>> When we flush userspace mappings, we can defer the TLB flushes, as long
> >>>> the following conditions are met:
> >>>>
> >>>> 1. No tables are freed, since otherwise speculative page walks might
> >>>>  cause machine-checks.
> >>>>
> >>>> 2. No one would access userspace before flush takes place. Specifically,
> >>>>  NMI handlers and kprobes would avoid accessing userspace.
> > [...]
> >> A #MC might be caused. I tried to avoid it by not allowing freeing of
> >> page-tables in such way. Did I miss something else? Some interaction with
> >> MTRR changes? I’ll think about it some more, but I don’t see how.
> >
> > I don't really know much about this topic, but here's a random comment
> > since you cc'ed me: If the physical memory range was freed and
> > reallocated, could you end up with speculatively executed cached
> > memory reads from I/O memory? (And if so, would that be bad?)
>
> Thanks. I thought that your experience with TLB page-freeing bugs may
> be valuable, and you frequently find my mistakes. ;-)
>
> Yes, speculatively executed cached reads from the I/O memory are a concern.
> IIRC they caused #MC on AMD. If page-tables are not changes, but only PTEs
> are changed, I don’t see how it can be a problem. I also looked at the MTRR
> setting code, but I don’t see a concrete problem.

Can the *physical memory range* not be freed and assigned to another
device? Like, when you mess around with memory hotplug and PCI hotplug?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 20:37           ` Andy Lutomirski
@ 2019-05-31 20:42             ` Nadav Amit
  2019-05-31 21:06             ` Dave Hansen
  1 sibling, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31 20:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Peter Zijlstra, Borislav Petkov, Ingo Molnar,
	Thomas Gleixner, x86, linux-kernel, Dave Hansen

> On May 31, 2019, at 1:37 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Fri, May 31, 2019 at 1:13 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> On 5/31/19 12:31 PM, Nadav Amit wrote:
>>>> On May 31, 2019, at 11:44 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On May 31, 2019, at 3:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> 
>>>>>> On Thu, May 30, 2019 at 11:36:44PM -0700, Nadav Amit wrote:
>>>>>> When we flush userspace mappings, we can defer the TLB flushes, as long
>>>>>> the following conditions are met:
>>>>>> 
>>>>>> 1. No tables are freed, since otherwise speculative page walks might
>>>>>> cause machine-checks.
>>>>>> 
>>>>>> 2. No one would access userspace before flush takes place. Specifically,
>>>>>> NMI handlers and kprobes would avoid accessing userspace.
>>>>>> 
>>>>>> Use the new SMP support to execute remote function calls with inlined
>>>>>> data for the matter. The function remote TLB flushing function would be
>>>>>> executed asynchronously and the local CPU would continue execution as
>>>>>> soon as the IPI was delivered, before the function was actually
>>>>>> executed. Since tlb_flush_info is copied, there is no risk it would
>>>>>> change before the TLB flush is actually executed.
>>>>>> 
>>>>>> Change nmi_uaccess_okay() to check whether a remote TLB flush is
>>>>>> currently in progress on this CPU by checking whether the asynchronously
>>>>>> called function is the remote TLB flushing function. The current
>>>>>> implementation disallows access in such cases, but it is also possible
>>>>>> to flush the entire TLB in such case and allow access.
>>>>> 
>>>>> ARGGH, brain hurt. I'm not sure I fully understand this one. How is it
>>>>> different from today, where the NMI can hit in the middle of the TLB
>>>>> invalidation?
>>>>> 
>>>>> Also; since we're not waiting on the IPI, what prevents us from freeing
>>>>> the user pages before the remote CPU is 'done' with them? Currently the
>>>>> synchronous IPI is like a sync point where we *know* the remote CPU is
>>>>> completely done accessing the page.
>>>>> 
>>>>> Where getting an IPI stops speculation, speculation again restarts
>>>>> inside the interrupt handler, and until we've passed the INVLPG/MOV CR3,
>>>>> speculation can happen on that TLB entry, even though we've already
>>>>> freed and re-used the user-page.
>>>>> 
>>>>> Also, what happens if the TLB invalidation IPI is stuck behind another
>>>>> smp_function_call IPI that is doing user-access?
>>>>> 
>>>>> As said,.. brain hurts.
>>>> 
>>>> Speculation aside, any code doing dirty tracking needs the flush to happen
>>>> for real before it reads the dirty bit.
>>>> 
>>>> How does this patch guarantee that the flush is really done before someone
>>>> depends on it?
>>> 
>>> I was always under the impression that the dirty-bit is pass-through - the
>>> A/D-assist walks the tables and sets the dirty bit upon access. Otherwise,
>>> what happens when you invalidate the PTE, and have already marked the PTE as
>>> non-present? Would the CPU set the dirty-bit at this point?
>> 
>> Modulo bugs^Werrata...  No.  What actually happens is that a
>> try-to-set-dirty-bit page table walk acts just like a TLB miss.  The old
>> contents of the TLB are discarded and only the in-memory contents matter
>> for forward progress.  If Present=0 when the PTE is reached, you'll get
>> a normal Present=0 page fault.
> 
> Wait, does that mean that you can do a lock cmpxchg or similar to
> clear the dirty and writable bits together and, if the dirty bit was
> clear, skip the TLB flush?  If so, nifty!  Modulo errata, of course.
> And I seem to remember some exceptions relating to CET shadow stack
> involving the dirty bit being set on not-present pages.

I did something similar with the access-bit in the past.

Anyhow, I have a bug here - the code does not wait for the indication that
the IPI was received. I need to rerun performance measurements again once I
fix it.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 20:37           ` Andy Lutomirski
  2019-05-31 20:42             ` Nadav Amit
@ 2019-05-31 21:06             ` Dave Hansen
  1 sibling, 0 replies; 36+ messages in thread
From: Dave Hansen @ 2019-05-31 21:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nadav Amit, Peter Zijlstra, Borislav Petkov, Ingo Molnar,
	Thomas Gleixner, x86, linux-kernel, Dave Hansen

On 5/31/19 1:37 PM, Andy Lutomirski wrote:
>> Modulo bugs^Werrata...  No.  What actually happens is that a
>> try-to-set-dirty-bit page table walk acts just like a TLB miss.  The old
>> contents of the TLB are discarded and only the in-memory contents matter
>> for forward progress.  If Present=0 when the PTE is reached, you'll get
>> a normal Present=0 page fault.
> Wait, does that mean that you can do a lock cmpxchg or similar to
> clear the dirty and writable bits together and, if the dirty bit was
> clear, skip the TLB flush?

Yeah, in the case that you're going from R/W->R/O, you can be assured
that no writable TLB entries were established if D=0.

Is it totally safe against other things?  Hell if I know. :)

I'd want to go look very closely at software things like GUP-for-write
before we went and actually did this.  But I can't think of any hardware
reasons off the top of my head why it wouldn't work.

A quick perusal of the SDM didn't have any slam dunks do support doing
this.  It's a bit cagey about exactly what can be cached and when.  The
supporting reasoning might have escaped my quick scan, though.

> If so, nifty!  Modulo errata, of course. And I seem to remember some
> exceptions relating to CET shadow stack involving the dirty bit being
> set on not-present pages.

Yeah: "no processor that supports CET will ever set the dirty flag in a
paging-structure entry in which the R/W flag is 0"

> https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

Which probably means that the things we were saying above are
technically only architectural on CET-enabled processors.  I think the
behavior is actually much more widespread than that, though.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31  6:36 ` [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing Nadav Amit
  2019-05-31 10:57   ` Peter Zijlstra
@ 2019-05-31 21:14   ` Andy Lutomirski
  2019-05-31 21:33     ` Nadav Amit
  1 sibling, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2019-05-31 21:14 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, X86 ML, LKML, Dave Hansen

On Thu, May 30, 2019 at 11:37 PM Nadav Amit <namit@vmware.com> wrote:
>
> When we flush userspace mappings, we can defer the TLB flushes, as long
> the following conditions are met:
>
> 1. No tables are freed, since otherwise speculative page walks might
>    cause machine-checks.
>
> 2. No one would access userspace before flush takes place. Specifically,
>    NMI handlers and kprobes would avoid accessing userspace.
>

I think I need to ask the big picture question.  When someone calls
flush_tlb_mm_range() (or the other entry points), if no page tables
were freed, they want the guarantee that future accesses (initiated
observably after the flush returns) will not use paging entries that
were replaced by stores ordered before flush_tlb_mm_range().  We also
need the guarantee that any effects from any memory access using the
old paging entries will become globally visible before
flush_tlb_mm_range().

I'm wondering if receipt of an IPI is enough to guarantee any of this.
If CPU 1 sets a dirty bit and CPU 2 writes to the APIC to send an IPI
to CPU 1, at what point is CPU 2 guaranteed to be able to observe the
dirty bit?  An interrupt entry today is fully serializing by the time
it finishes, but interrupt entries are epicly slow, and I don't know
if the APIC waits long enough.  Heck, what if IRQs are off on the
remote CPU?  There are a handful of places where we touch user memory
with IRQs off, and it's (sadly) possible for user code to turn off
IRQs with iopl().

I *think* that Intel has stated recently that SMT siblings are
guaranteed to stop speculating when you write to the APIC ICR to poke
them, but SMT is very special.

My general conclusion is that I think the code needs to document what
is guaranteed and why.

--Andy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 21:14   ` Andy Lutomirski
@ 2019-05-31 21:33     ` Nadav Amit
  2019-05-31 21:47       ` Andy Lutomirski
  0 siblings, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2019-05-31 21:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Borislav Petkov, Dave Hansen, Ingo Molnar,
	Thomas Gleixner, X86 ML, LKML, Dave Hansen

> On May 31, 2019, at 2:14 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Thu, May 30, 2019 at 11:37 PM Nadav Amit <namit@vmware.com> wrote:
>> When we flush userspace mappings, we can defer the TLB flushes, as long
>> the following conditions are met:
>> 
>> 1. No tables are freed, since otherwise speculative page walks might
>>   cause machine-checks.
>> 
>> 2. No one would access userspace before flush takes place. Specifically,
>>   NMI handlers and kprobes would avoid accessing userspace.
> 
> I think I need to ask the big picture question.  When someone calls
> flush_tlb_mm_range() (or the other entry points), if no page tables
> were freed, they want the guarantee that future accesses (initiated
> observably after the flush returns) will not use paging entries that
> were replaced by stores ordered before flush_tlb_mm_range().  We also
> need the guarantee that any effects from any memory access using the
> old paging entries will become globally visible before
> flush_tlb_mm_range().
> 
> I'm wondering if receipt of an IPI is enough to guarantee any of this.
> If CPU 1 sets a dirty bit and CPU 2 writes to the APIC to send an IPI
> to CPU 1, at what point is CPU 2 guaranteed to be able to observe the
> dirty bit?  An interrupt entry today is fully serializing by the time
> it finishes, but interrupt entries are epicly slow, and I don't know
> if the APIC waits long enough.  Heck, what if IRQs are off on the
> remote CPU?  There are a handful of places where we touch user memory
> with IRQs off, and it's (sadly) possible for user code to turn off
> IRQs with iopl().
> 
> I *think* that Intel has stated recently that SMT siblings are
> guaranteed to stop speculating when you write to the APIC ICR to poke
> them, but SMT is very special.
> 
> My general conclusion is that I think the code needs to document what
> is guaranteed and why.

I think I might have managed to confuse you with a bug I made (last minute
bug when I was doing some cleanup). This bug does not affect the performance
much, but it might led you to think that I use the APIC sending as
synchronization.

The idea is not for us to rely on write to ICR as something serializing. The
flow should be as follows:


	CPU0					CPU1

flush_tlb_mm_range()
 __smp_call_function_many()
  [ prepare call_single_data (csd) ]
  [ lock csd ] 
  [ send IPI ]
	(*)
  [ wait for csd to be unlocked ]
					[ interrupt ]
					[ copy csd info to stack ]
					[ csd unlock ]
  [ find csd is unlocked ]
  [ continue (**) ]
					[ flush TLB ]


At (**) the pages might be recycled, written-back to disk, etc. Note that
during (*), CPU0 might do some local TLB flushes, making it very likely that
CSD will be unlocked by the time it gets there.

As you can see, I don’t rely on any special micro-architectural behavior.
The synchronization is done purely in software.

Does it make more sense now?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 21:33     ` Nadav Amit
@ 2019-05-31 21:47       ` Andy Lutomirski
  2019-05-31 22:07         ` Nadav Amit
  0 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2019-05-31 21:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Peter Zijlstra, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, X86 ML, LKML, Dave Hansen


On May 31, 2019, at 2:33 PM, Nadav Amit <namit@vmware.com> wrote:

>> On May 31, 2019, at 2:14 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> 
>>> On Thu, May 30, 2019 at 11:37 PM Nadav Amit <namit@vmware.com> wrote:
>>> When we flush userspace mappings, we can defer the TLB flushes, as long
>>> the following conditions are met:
>>> 
>>> 1. No tables are freed, since otherwise speculative page walks might
>>>  cause machine-checks.
>>> 
>>> 2. No one would access userspace before flush takes place. Specifically,
>>>  NMI handlers and kprobes would avoid accessing userspace.
>> 
>> I think I need to ask the big picture question.  When someone calls
>> flush_tlb_mm_range() (or the other entry points), if no page tables
>> were freed, they want the guarantee that future accesses (initiated
>> observably after the flush returns) will not use paging entries that
>> were replaced by stores ordered before flush_tlb_mm_range().  We also
>> need the guarantee that any effects from any memory access using the
>> old paging entries will become globally visible before
>> flush_tlb_mm_range().
>> 
>> I'm wondering if receipt of an IPI is enough to guarantee any of this.
>> If CPU 1 sets a dirty bit and CPU 2 writes to the APIC to send an IPI
>> to CPU 1, at what point is CPU 2 guaranteed to be able to observe the
>> dirty bit?  An interrupt entry today is fully serializing by the time
>> it finishes, but interrupt entries are epicly slow, and I don't know
>> if the APIC waits long enough.  Heck, what if IRQs are off on the
>> remote CPU?  There are a handful of places where we touch user memory
>> with IRQs off, and it's (sadly) possible for user code to turn off
>> IRQs with iopl().
>> 
>> I *think* that Intel has stated recently that SMT siblings are
>> guaranteed to stop speculating when you write to the APIC ICR to poke
>> them, but SMT is very special.
>> 
>> My general conclusion is that I think the code needs to document what
>> is guaranteed and why.
> 
> I think I might have managed to confuse you with a bug I made (last minute
> bug when I was doing some cleanup). This bug does not affect the performance
> much, but it might led you to think that I use the APIC sending as
> synchronization.
> 
> The idea is not for us to rely on write to ICR as something serializing. The
> flow should be as follows:
> 
> 
>    CPU0                    CPU1
> 
> flush_tlb_mm_range()
> __smp_call_function_many()
>  [ prepare call_single_data (csd) ]
>  [ lock csd ] 
>  [ send IPI ]
>    (*)
>  [ wait for csd to be unlocked ]
>                    [ interrupt ]
>                    [ copy csd info to stack ]
>                    [ csd unlock ]
>  [ find csd is unlocked ]
>  [ continue (**) ]
>                    [ flush TLB ]
> 
> 
> At (**) the pages might be recycled, written-back to disk, etc. Note that
> during (*), CPU0 might do some local TLB flushes, making it very likely that
> CSD will be unlocked by the time it gets there.
> 
> As you can see, I don’t rely on any special micro-architectural behavior.
> The synchronization is done purely in software.
> 
> Does it make more sense now?
> 

Yes.  Have you benchmarked this?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 21:47       ` Andy Lutomirski
@ 2019-05-31 22:07         ` Nadav Amit
  2019-06-07  5:28           ` Nadav Amit
  0 siblings, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2019-05-31 22:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Peter Zijlstra, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, X86 ML, LKML, Dave Hansen

> On May 31, 2019, at 2:47 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
> On May 31, 2019, at 2:33 PM, Nadav Amit <namit@vmware.com> wrote:
> 
>>> On May 31, 2019, at 2:14 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>>> On Thu, May 30, 2019 at 11:37 PM Nadav Amit <namit@vmware.com> wrote:
>>>> When we flush userspace mappings, we can defer the TLB flushes, as long
>>>> the following conditions are met:
>>>> 
>>>> 1. No tables are freed, since otherwise speculative page walks might
>>>> cause machine-checks.
>>>> 
>>>> 2. No one would access userspace before flush takes place. Specifically,
>>>> NMI handlers and kprobes would avoid accessing userspace.
>>> 
>>> I think I need to ask the big picture question.  When someone calls
>>> flush_tlb_mm_range() (or the other entry points), if no page tables
>>> were freed, they want the guarantee that future accesses (initiated
>>> observably after the flush returns) will not use paging entries that
>>> were replaced by stores ordered before flush_tlb_mm_range().  We also
>>> need the guarantee that any effects from any memory access using the
>>> old paging entries will become globally visible before
>>> flush_tlb_mm_range().
>>> 
>>> I'm wondering if receipt of an IPI is enough to guarantee any of this.
>>> If CPU 1 sets a dirty bit and CPU 2 writes to the APIC to send an IPI
>>> to CPU 1, at what point is CPU 2 guaranteed to be able to observe the
>>> dirty bit?  An interrupt entry today is fully serializing by the time
>>> it finishes, but interrupt entries are epicly slow, and I don't know
>>> if the APIC waits long enough.  Heck, what if IRQs are off on the
>>> remote CPU?  There are a handful of places where we touch user memory
>>> with IRQs off, and it's (sadly) possible for user code to turn off
>>> IRQs with iopl().
>>> 
>>> I *think* that Intel has stated recently that SMT siblings are
>>> guaranteed to stop speculating when you write to the APIC ICR to poke
>>> them, but SMT is very special.
>>> 
>>> My general conclusion is that I think the code needs to document what
>>> is guaranteed and why.
>> 
>> I think I might have managed to confuse you with a bug I made (last minute
>> bug when I was doing some cleanup). This bug does not affect the performance
>> much, but it might led you to think that I use the APIC sending as
>> synchronization.
>> 
>> The idea is not for us to rely on write to ICR as something serializing. The
>> flow should be as follows:
>> 
>> 
>>   CPU0                    CPU1
>> 
>> flush_tlb_mm_range()
>> __smp_call_function_many()
>> [ prepare call_single_data (csd) ]
>> [ lock csd ] 
>> [ send IPI ]
>>   (*)
>> [ wait for csd to be unlocked ]
>>                   [ interrupt ]
>>                   [ copy csd info to stack ]
>>                   [ csd unlock ]
>> [ find csd is unlocked ]
>> [ continue (**) ]
>>                   [ flush TLB ]
>> 
>> 
>> At (**) the pages might be recycled, written-back to disk, etc. Note that
>> during (*), CPU0 might do some local TLB flushes, making it very likely that
>> CSD will be unlocked by the time it gets there.
>> 
>> As you can see, I don’t rely on any special micro-architectural behavior.
>> The synchronization is done purely in software.
>> 
>> Does it make more sense now?
> 
> Yes.  Have you benchmarked this?

Partially. Numbers are indeed worse. Here are preliminary results, comparing
to v1 (concurrent):

	n_threads	before		concurrent	+async
	---------	------		---------- 	------
	1		661		663		663
	2		1436		1225 (-14%)	1115 (-22%)
	4		1571		1421 (-10%)	1289 (-18%)

Note that the benefit of “async" would be greater if the initiator does not
flush the TLB at all. This might happen in the case of kswapd, for example.
Let me try some micro-optimizations first, run more benchmarks and get back
to you.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-05-31 22:07         ` Nadav Amit
@ 2019-06-07  5:28           ` Nadav Amit
  2019-06-07 16:42             ` Andy Lutomirski
  0 siblings, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2019-06-07  5:28 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Andy Lutomirski, Borislav Petkov, Dave Hansen, Ingo Molnar,
	Thomas Gleixner, X86 ML, LKML, Dave Hansen

> On May 31, 2019, at 3:07 PM, Nadav Amit <namit@vmware.com> wrote:
> 
>> On May 31, 2019, at 2:47 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>> 
>> On May 31, 2019, at 2:33 PM, Nadav Amit <namit@vmware.com> wrote:
>> 
>>>> On May 31, 2019, at 2:14 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>> 
>>>>> On Thu, May 30, 2019 at 11:37 PM Nadav Amit <namit@vmware.com> wrote:
>>>>> When we flush userspace mappings, we can defer the TLB flushes, as long
>>>>> the following conditions are met:
>>>>> 
>>>>> 1. No tables are freed, since otherwise speculative page walks might
>>>>> cause machine-checks.
>>>>> 
>>>>> 2. No one would access userspace before flush takes place. Specifically,
>>>>> NMI handlers and kprobes would avoid accessing userspace.
>>>> 
>>>> I think I need to ask the big picture question.  When someone calls
>>>> flush_tlb_mm_range() (or the other entry points), if no page tables
>>>> were freed, they want the guarantee that future accesses (initiated
>>>> observably after the flush returns) will not use paging entries that
>>>> were replaced by stores ordered before flush_tlb_mm_range().  We also
>>>> need the guarantee that any effects from any memory access using the
>>>> old paging entries will become globally visible before
>>>> flush_tlb_mm_range().
>>>> 
>>>> I'm wondering if receipt of an IPI is enough to guarantee any of this.
>>>> If CPU 1 sets a dirty bit and CPU 2 writes to the APIC to send an IPI
>>>> to CPU 1, at what point is CPU 2 guaranteed to be able to observe the
>>>> dirty bit?  An interrupt entry today is fully serializing by the time
>>>> it finishes, but interrupt entries are epicly slow, and I don't know
>>>> if the APIC waits long enough.  Heck, what if IRQs are off on the
>>>> remote CPU?  There are a handful of places where we touch user memory
>>>> with IRQs off, and it's (sadly) possible for user code to turn off
>>>> IRQs with iopl().
>>>> 
>>>> I *think* that Intel has stated recently that SMT siblings are
>>>> guaranteed to stop speculating when you write to the APIC ICR to poke
>>>> them, but SMT is very special.
>>>> 
>>>> My general conclusion is that I think the code needs to document what
>>>> is guaranteed and why.
>>> 
>>> I think I might have managed to confuse you with a bug I made (last minute
>>> bug when I was doing some cleanup). This bug does not affect the performance
>>> much, but it might led you to think that I use the APIC sending as
>>> synchronization.
>>> 
>>> The idea is not for us to rely on write to ICR as something serializing. The
>>> flow should be as follows:
>>> 
>>> 
>>>  CPU0                    CPU1
>>> 
>>> flush_tlb_mm_range()
>>> __smp_call_function_many()
>>> [ prepare call_single_data (csd) ]
>>> [ lock csd ] 
>>> [ send IPI ]
>>>  (*)
>>> [ wait for csd to be unlocked ]
>>>                  [ interrupt ]
>>>                  [ copy csd info to stack ]
>>>                  [ csd unlock ]
>>> [ find csd is unlocked ]
>>> [ continue (**) ]
>>>                  [ flush TLB ]
>>> 
>>> 
>>> At (**) the pages might be recycled, written-back to disk, etc. Note that
>>> during (*), CPU0 might do some local TLB flushes, making it very likely that
>>> CSD will be unlocked by the time it gets there.
>>> 
>>> As you can see, I don’t rely on any special micro-architectural behavior.
>>> The synchronization is done purely in software.
>>> 
>>> Does it make more sense now?
>> 
>> Yes.  Have you benchmarked this?
> 
> Partially. Numbers are indeed worse. Here are preliminary results, comparing
> to v1 (concurrent):
> 
> 	n_threads	before		concurrent	+async
> 	---------	------		---------- 	------
> 	1		661		663		663
> 	2		1436		1225 (-14%)	1115 (-22%)
> 	4		1571		1421 (-10%)	1289 (-18%)
> 
> Note that the benefit of “async" would be greater if the initiator does not
> flush the TLB at all. This might happen in the case of kswapd, for example.
> Let me try some micro-optimizations first, run more benchmarks and get back
> to you.

So I ran some more benchmarking (my benchmark is not very suitable), and tried
more stuff that did not help (checking for more work before returning from the
IPI handler, and avoid redundant IPIs in such case).

Anyhow, with a fixed version, I ran a more standard benchmark on DAX:

$ mkfs.ext4 /dev/pmem0
$ mount -o dax /dev/pmem0 /mnt/mem
$ cd /mnt/mem
$ bash -c 'echo 0 > /sys/devices/platform/e820_pmem/ndbus0/region0/namespace0.0/block/pmem0/dax/write_cache’
$ sysbench fileio --file-total-size=3G --file-test-mode=rndwr 	\
	--file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync prepare
$ sysbench fileio --file-total-size=3G --file-test-mode=rndwr 	\
	--file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run

( as you can see, I disabled the write-cache, since my machine does not have
  clwb/clflushopt and clflush appears to become a bottleneck otherwise )


The results are:
				events (avg/stddev)
				-------------------
base				1263689.0000/11481.10
concurrent			1310123.5000/19205.79	(+3.6%)
concurrent + async		1326750.2500/24563.61	(+4.9%)

So which version do you want me to submit? With or without the async part?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
  2019-06-07  5:28           ` Nadav Amit
@ 2019-06-07 16:42             ` Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: Andy Lutomirski @ 2019-06-07 16:42 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	Ingo Molnar, Thomas Gleixner, X86 ML, LKML, Dave Hansen


On Jun 6, 2019, at 10:28 PM, Nadav Amit <namit@vmware.com> wrote:

>> On May 31, 2019, at 3:07 PM, Nadav Amit <namit@vmware.com> wrote:
>> 
>>> On May 31, 2019, at 2:47 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> 
>>> 
>>> On May 31, 2019, at 2:33 PM, Nadav Amit <namit@vmware.com> wrote:
>>> 
>>>>> On May 31, 2019, at 2:14 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>> 
>>>>>> On Thu, May 30, 2019 at 11:37 PM Nadav Amit <namit@vmware.com> wrote:
>>>>>> When we flush userspace mappings, we can defer the TLB flushes, as long
>>>>>> the following conditions are met:
>>>>>> 
>>>>>> 1. No tables are freed, since otherwise speculative page walks might
>>>>>> cause machine-checks.
>>>>>> 
>>>>>> 2. No one would access userspace before flush takes place. Specifically,
>>>>>> NMI handlers and kprobes would avoid accessing userspace.
>>>>> 
>>>>> I think I need to ask the big picture question.  When someone calls
>>>>> flush_tlb_mm_range() (or the other entry points), if no page tables
>>>>> were freed, they want the guarantee that future accesses (initiated
>>>>> observably after the flush returns) will not use paging entries that
>>>>> were replaced by stores ordered before flush_tlb_mm_range().  We also
>>>>> need the guarantee that any effects from any memory access using the
>>>>> old paging entries will become globally visible before
>>>>> flush_tlb_mm_range().
>>>>> 
>>>>> I'm wondering if receipt of an IPI is enough to guarantee any of this.
>>>>> If CPU 1 sets a dirty bit and CPU 2 writes to the APIC to send an IPI
>>>>> to CPU 1, at what point is CPU 2 guaranteed to be able to observe the
>>>>> dirty bit?  An interrupt entry today is fully serializing by the time
>>>>> it finishes, but interrupt entries are epicly slow, and I don't know
>>>>> if the APIC waits long enough.  Heck, what if IRQs are off on the
>>>>> remote CPU?  There are a handful of places where we touch user memory
>>>>> with IRQs off, and it's (sadly) possible for user code to turn off
>>>>> IRQs with iopl().
>>>>> 
>>>>> I *think* that Intel has stated recently that SMT siblings are
>>>>> guaranteed to stop speculating when you write to the APIC ICR to poke
>>>>> them, but SMT is very special.
>>>>> 
>>>>> My general conclusion is that I think the code needs to document what
>>>>> is guaranteed and why.
>>>> 
>>>> I think I might have managed to confuse you with a bug I made (last minute
>>>> bug when I was doing some cleanup). This bug does not affect the performance
>>>> much, but it might led you to think that I use the APIC sending as
>>>> synchronization.
>>>> 
>>>> The idea is not for us to rely on write to ICR as something serializing. The
>>>> flow should be as follows:
>>>> 
>>>> 
>>>> CPU0                    CPU1
>>>> 
>>>> flush_tlb_mm_range()
>>>> __smp_call_function_many()
>>>> [ prepare call_single_data (csd) ]
>>>> [ lock csd ] 
>>>> [ send IPI ]
>>>> (*)
>>>> [ wait for csd to be unlocked ]
>>>>                 [ interrupt ]
>>>>                 [ copy csd info to stack ]
>>>>                 [ csd unlock ]
>>>> [ find csd is unlocked ]
>>>> [ continue (**) ]
>>>>                 [ flush TLB ]
>>>> 
>>>> 
>>>> At (**) the pages might be recycled, written-back to disk, etc. Note that
>>>> during (*), CPU0 might do some local TLB flushes, making it very likely that
>>>> CSD will be unlocked by the time it gets there.
>>>> 
>>>> As you can see, I don’t rely on any special micro-architectural behavior.
>>>> The synchronization is done purely in software.
>>>> 
>>>> Does it make more sense now?
>>> 
>>> Yes.  Have you benchmarked this?
>> 
>> Partially. Numbers are indeed worse. Here are preliminary results, comparing
>> to v1 (concurrent):
>> 
>>    n_threads    before        concurrent    +async
>>    ---------    ------        ----------    ------
>>    1        661        663        663
>>    2        1436        1225 (-14%)    1115 (-22%)
>>    4        1571        1421 (-10%)    1289 (-18%)
>> 
>> Note that the benefit of “async" would be greater if the initiator does not
>> flush the TLB at all. This might happen in the case of kswapd, for example.
>> Let me try some micro-optimizations first, run more benchmarks and get back
>> to you.
> 
> So I ran some more benchmarking (my benchmark is not very suitable), and tried
> more stuff that did not help (checking for more work before returning from the
> IPI handler, and avoid redundant IPIs in such case).
> 
> Anyhow, with a fixed version, I ran a more standard benchmark on DAX:
> 
> $ mkfs.ext4 /dev/pmem0
> $ mount -o dax /dev/pmem0 /mnt/mem
> $ cd /mnt/mem
> $ bash -c 'echo 0 > /sys/devices/platform/e820_pmem/ndbus0/region0/namespace0.0/block/pmem0/dax/write_cache’
> $ sysbench fileio --file-total-size=3G --file-test-mode=rndwr    \
>    --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync prepare
> $ sysbench fileio --file-total-size=3G --file-test-mode=rndwr    \
>    --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run
> 
> ( as you can see, I disabled the write-cache, since my machine does not have
>  clwb/clflushopt and clflush appears to become a bottleneck otherwise )
> 
> 
> The results are:
>                events (avg/stddev)
>                -------------------
> base                1263689.0000/11481.10
> concurrent            1310123.5000/19205.79    (+3.6%)
> concurrent + async        1326750.2500/24563.61    (+4.9%)
> 
> So which version do you want me to submit? With or without the async part?

I think it would be best to submit it without the async part. You can always submit that later.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2019-06-07 16:42 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-31  6:36 [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async Nadav Amit
2019-05-31  6:36 ` [RFC PATCH v2 01/12] smp: Remove smp_call_function() and on_each_cpu() return values Nadav Amit
2019-05-31  6:36 ` [RFC PATCH v2 02/12] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
2019-05-31  6:36 ` [RFC PATCH v2 03/12] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus() Nadav Amit
2019-05-31  6:36 ` [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
2019-05-31 11:48   ` Juergen Gross
2019-05-31 19:44     ` Nadav Amit
2019-05-31  6:36 ` [RFC PATCH v2 05/12] x86/mm/tlb: Optimize local TLB flushes Nadav Amit
2019-05-31  6:36 ` [RFC PATCH v2 06/12] KVM: x86: Provide paravirtualized flush_tlb_multi() Nadav Amit
2019-05-31  6:36 ` [RFC PATCH v2 07/12] smp: Do not mark call_function_data as shared Nadav Amit
2019-05-31 10:17   ` Peter Zijlstra
2019-05-31 17:50     ` Nadav Amit
2019-05-31  6:36 ` [RFC PATCH v2 08/12] x86/tlb: Privatize cpu_tlbstate Nadav Amit
2019-05-31 18:48   ` Andy Lutomirski
2019-05-31 19:42     ` Nadav Amit
2019-05-31  6:36 ` [RFC PATCH v2 09/12] x86/apic: Use non-atomic operations when possible Nadav Amit
2019-05-31  6:36 ` [RFC PATCH v2 10/12] smp: Enable data inlining for inter-processor function call Nadav Amit
2019-05-31  6:36 ` [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing Nadav Amit
2019-05-31 10:57   ` Peter Zijlstra
2019-05-31 18:29     ` Nadav Amit
2019-05-31 19:20       ` Jann Horn
2019-05-31 20:04         ` Nadav Amit
2019-05-31 20:37           ` Jann Horn
2019-05-31 18:44     ` Andy Lutomirski
2019-05-31 19:31       ` Nadav Amit
2019-05-31 20:13         ` Dave Hansen
2019-05-31 20:37           ` Andy Lutomirski
2019-05-31 20:42             ` Nadav Amit
2019-05-31 21:06             ` Dave Hansen
2019-05-31 21:14   ` Andy Lutomirski
2019-05-31 21:33     ` Nadav Amit
2019-05-31 21:47       ` Andy Lutomirski
2019-05-31 22:07         ` Nadav Amit
2019-06-07  5:28           ` Nadav Amit
2019-06-07 16:42             ` Andy Lutomirski
2019-05-31  6:36 ` [RFC PATCH v2 12/12] x86/mm/tlb: Reverting the removal of flush_tlb_info from stack Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).