[RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently
@ 2019-05-25  8:21 Nadav Amit
  2019-05-25  8:21 ` [RFC PATCH 1/6] smp: Remove smp_call_function() and on_each_cpu() return values Nadav Amit
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Nadav Amit @ 2019-05-25  8:21 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, linux-kernel, Nadav Amit

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on the
behavior of on_each_cpu() that functions were first executed remotely
and only then locally.

On my Haswell machine (bare-metal), running a TLB flush microbenchmark
(MADV_DONTNEED/touch for a single page on one thread), takes the
following time (ns):

	n_threads	before		after
	---------	------		-----
	1		661		663
	2		1436		1225 (-14%)
	4		1571		1421 (-10%)

Note that since the benchmark also causes page-faults, the actual
speedup of TLB shootdowns is actually greater. Also note the higher
improvement in performance with 2 thread (a single remote TLB flush
target). This seems to be a side-effect of holding synchronization
data-structures (csd) off the stack, unlike the way it is currently done
(in smp_call_function_single()).

Patches 1-2 do small cleanup. Patches 3-5 actually implement the
concurrent execution of TLB flushes. Patch 6 restores local TLB flushes
performance, which was hurt by the optimization, to be as good as it was
before these changes by introducing a fast-pass for this specific case.

Nadav Amit (6):
  smp: Remove smp_call_function() and on_each_cpu() return values
  cpumask: Purify cpumask_next()
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Optimize local TLB flushes

 arch/alpha/kernel/smp.c               |  19 +---
 arch/alpha/oprofile/common.c          |   6 +-
 arch/ia64/kernel/perfmon.c            |  12 +--
 arch/ia64/kernel/uncached.c           |   8 +-
 arch/x86/hyperv/mmu.c                 |   2 +
 arch/x86/include/asm/paravirt.h       |   8 ++
 arch/x86/include/asm/paravirt_types.h |   6 ++
 arch/x86/include/asm/tlbflush.h       |   6 ++
 arch/x86/kernel/kvm.c                 |   1 +
 arch/x86/kernel/paravirt.c            |   3 +
 arch/x86/lib/cache-smp.c              |   3 +-
 arch/x86/mm/tlb.c                     | 137 +++++++++++++++++--------
 arch/x86/xen/mmu_pv.c                 |   2 +
 drivers/char/agp/generic.c            |   3 +-
 include/linux/cpumask.h               |   2 +-
 include/linux/smp.h                   |  32 ++++--
 kernel/smp.c                          | 139 ++++++++++++--------------
 kernel/up.c                           |   3 +-
 18 files changed, 230 insertions(+), 162 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 1/6] smp: Remove smp_call_function() and on_each_cpu() return values
  2019-05-25  8:21 [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Nadav Amit
@ 2019-05-25  8:21 ` Nadav Amit
  2019-05-25  8:21 ` [RFC PATCH 2/6] cpumask: Purify cpumask_next() Nadav Amit
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Nadav Amit @ 2019-05-25  8:21 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, linux-kernel, Nadav Amit, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Tony Luck, Fenghua Yu,
	Thomas Gleixner, Andrew Morton

The return value is fixed. Remove it and amend the callers.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/alpha/kernel/smp.c      | 19 +++++--------------
 arch/alpha/oprofile/common.c |  6 +++---
 arch/ia64/kernel/perfmon.c   | 12 ++----------
 arch/ia64/kernel/uncached.c  |  8 ++++----
 arch/x86/lib/cache-smp.c     |  3 ++-
 drivers/char/agp/generic.c   |  3 +--
 include/linux/smp.h          |  7 +++----
 kernel/smp.c                 | 10 +++-------
 kernel/up.c                  |  3 +--
 9 files changed, 24 insertions(+), 47 deletions(-)

diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c
index d0dccae53ba9..5f90df30be20 100644
--- a/arch/alpha/kernel/smp.c
+++ b/arch/alpha/kernel/smp.c
@@ -614,8 +614,7 @@ void
 smp_imb(void)
 {
 	/* Must wait other processors to flush their icache before continue. */
-	if (on_each_cpu(ipi_imb, NULL, 1))
-		printk(KERN_CRIT "smp_imb: timed out\n");
+	on_each_cpu(ipi_imb, NULL, 1);
 }
 EXPORT_SYMBOL(smp_imb);
 
@@ -630,9 +629,7 @@ flush_tlb_all(void)
 {
 	/* Although we don't have any data to pass, we do want to
 	   synchronize with the other processors.  */
-	if (on_each_cpu(ipi_flush_tlb_all, NULL, 1)) {
-		printk(KERN_CRIT "flush_tlb_all: timed out\n");
-	}
+	on_each_cpu(ipi_flush_tlb_all, NULL, 1);
 }
 
 #define asn_locked() (cpu_data[smp_processor_id()].asn_lock)
@@ -667,9 +664,7 @@ flush_tlb_mm(struct mm_struct *mm)
 		}
 	}
 
-	if (smp_call_function(ipi_flush_tlb_mm, mm, 1)) {
-		printk(KERN_CRIT "flush_tlb_mm: timed out\n");
-	}
+	smp_call_function(ipi_flush_tlb_mm, mm, 1);
 
 	preempt_enable();
 }
@@ -720,9 +715,7 @@ flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)
 	data.mm = mm;
 	data.addr = addr;
 
-	if (smp_call_function(ipi_flush_tlb_page, &data, 1)) {
-		printk(KERN_CRIT "flush_tlb_page: timed out\n");
-	}
+	smp_call_function(ipi_flush_tlb_page, &data, 1);
 
 	preempt_enable();
 }
@@ -772,9 +765,7 @@ flush_icache_user_range(struct vm_area_struct *vma, struct page *page,
 		}
 	}
 
-	if (smp_call_function(ipi_flush_icache_page, mm, 1)) {
-		printk(KERN_CRIT "flush_icache_page: timed out\n");
-	}
+	smp_call_function(ipi_flush_icache_page, mm, 1);
 
 	preempt_enable();
 }
diff --git a/arch/alpha/oprofile/common.c b/arch/alpha/oprofile/common.c
index 310a4ce1dccc..1b1259c7d7d1 100644
--- a/arch/alpha/oprofile/common.c
+++ b/arch/alpha/oprofile/common.c
@@ -65,7 +65,7 @@ op_axp_setup(void)
 	model->reg_setup(&reg, ctr, &sys);
 
 	/* Configure the registers on all cpus.  */
-	(void)smp_call_function(model->cpu_setup, &reg, 1);
+	smp_call_function(model->cpu_setup, &reg, 1);
 	model->cpu_setup(&reg);
 	return 0;
 }
@@ -86,7 +86,7 @@ op_axp_cpu_start(void *dummy)
 static int
 op_axp_start(void)
 {
-	(void)smp_call_function(op_axp_cpu_start, NULL, 1);
+	smp_call_function(op_axp_cpu_start, NULL, 1);
 	op_axp_cpu_start(NULL);
 	return 0;
 }
@@ -101,7 +101,7 @@ op_axp_cpu_stop(void *dummy)
 static void
 op_axp_stop(void)
 {
-	(void)smp_call_function(op_axp_cpu_stop, NULL, 1);
+	smp_call_function(op_axp_cpu_stop, NULL, 1);
 	op_axp_cpu_stop(NULL);
 }
 
diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 7a969f4c3534..6cf7db1daba2 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -6389,11 +6389,7 @@ pfm_install_alt_pmu_interrupt(pfm_intr_handler_desc_t *hdl)
 	}
 
 	/* save the current system wide pmu states */
-	ret = on_each_cpu(pfm_alt_save_pmu_state, NULL, 1);
-	if (ret) {
-		DPRINT(("on_each_cpu() failed: %d\n", ret));
-		goto cleanup_reserve;
-	}
+	on_each_cpu(pfm_alt_save_pmu_state, NULL, 1);
 
 	/* officially change to the alternate interrupt handler */
 	pfm_alt_intr_handler = hdl;
@@ -6420,7 +6416,6 @@ int
 pfm_remove_alt_pmu_interrupt(pfm_intr_handler_desc_t *hdl)
 {
 	int i;
-	int ret;
 
 	if (hdl == NULL) return -EINVAL;
 
@@ -6434,10 +6429,7 @@ pfm_remove_alt_pmu_interrupt(pfm_intr_handler_desc_t *hdl)
 
 	pfm_alt_intr_handler = NULL;
 
-	ret = on_each_cpu(pfm_alt_restore_pmu_state, NULL, 1);
-	if (ret) {
-		DPRINT(("on_each_cpu() failed: %d\n", ret));
-	}
+	on_each_cpu(pfm_alt_restore_pmu_state, NULL, 1);
 
 	for_each_online_cpu(i) {
 		pfm_unreserve_session(NULL, 1, i);
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 583f7ff6b589..c618d0745e22 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -124,8 +124,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 	status = ia64_pal_prefetch_visibility(PAL_VISIBILITY_PHYSICAL);
 	if (status == PAL_VISIBILITY_OK_REMOTE_NEEDED) {
 		atomic_set(&uc_pool->status, 0);
-		status = smp_call_function(uncached_ipi_visibility, uc_pool, 1);
-		if (status || atomic_read(&uc_pool->status))
+		smp_call_function(uncached_ipi_visibility, uc_pool, 1);
+		if (atomic_read(&uc_pool->status))
 			goto failed;
 	} else if (status != PAL_VISIBILITY_OK)
 		goto failed;
@@ -146,8 +146,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 	if (status != PAL_STATUS_SUCCESS)
 		goto failed;
 	atomic_set(&uc_pool->status, 0);
-	status = smp_call_function(uncached_ipi_mc_drain, uc_pool, 1);
-	if (status || atomic_read(&uc_pool->status))
+	smp_call_function(uncached_ipi_mc_drain, uc_pool, 1);
+	if (atomic_read(&uc_pool->status))
 		goto failed;
 
 	/*
diff --git a/arch/x86/lib/cache-smp.c b/arch/x86/lib/cache-smp.c
index 1811fa4a1b1a..7c48ff4ae8d1 100644
--- a/arch/x86/lib/cache-smp.c
+++ b/arch/x86/lib/cache-smp.c
@@ -15,6 +15,7 @@ EXPORT_SYMBOL(wbinvd_on_cpu);
 
 int wbinvd_on_all_cpus(void)
 {
-	return on_each_cpu(__wbinvd, NULL, 1);
+	on_each_cpu(__wbinvd, NULL, 1);
+	return 0;
 }
 EXPORT_SYMBOL(wbinvd_on_all_cpus);
diff --git a/drivers/char/agp/generic.c b/drivers/char/agp/generic.c
index 658664a5a5aa..df1edb5ec0ad 100644
--- a/drivers/char/agp/generic.c
+++ b/drivers/char/agp/generic.c
@@ -1311,8 +1311,7 @@ static void ipi_handler(void *null)
 
 void global_cache_flush(void)
 {
-	if (on_each_cpu(ipi_handler, NULL, 1) != 0)
-		panic(PFX "timed out waiting for the other CPUs!\n");
+	on_each_cpu(ipi_handler, NULL, 1);
 }
 EXPORT_SYMBOL(global_cache_flush);
 
diff --git a/include/linux/smp.h b/include/linux/smp.h
index a56f08ff3097..bb8b451ab01f 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -35,7 +35,7 @@ int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 /*
  * Call a function on all processors
  */
-int on_each_cpu(smp_call_func_t func, void *info, int wait);
+void on_each_cpu(smp_call_func_t func, void *info, int wait);
 
 /*
  * Call a function on processors specified by mask, which might include
@@ -101,7 +101,7 @@ extern void smp_cpus_done(unsigned int max_cpus);
 /*
  * Call a function on all other processors
  */
-int smp_call_function(smp_call_func_t func, void *info, int wait);
+void smp_call_function(smp_call_func_t func, void *info, int wait);
 void smp_call_function_many(const struct cpumask *mask,
 			    smp_call_func_t func, void *info, bool wait);
 
@@ -144,9 +144,8 @@ static inline void smp_send_stop(void) { }
  *	These macros fold the SMP functionality into a single CPU system
  */
 #define raw_smp_processor_id()			0
-static inline int up_smp_call_function(smp_call_func_t func, void *info)
+static inline void up_smp_call_function(smp_call_func_t func, void *info)
 {
-	return 0;
 }
 #define smp_call_function(func, info, wait) \
 			(up_smp_call_function(func, info))
diff --git a/kernel/smp.c b/kernel/smp.c
index f4cf1b0bb3b8..e0b05ac53108 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -486,13 +486,11 @@ EXPORT_SYMBOL(smp_call_function_many);
  * You must not call this function with disabled interrupts or from a
  * hardware interrupt handler or from a bottom half handler.
  */
-int smp_call_function(smp_call_func_t func, void *info, int wait)
+void smp_call_function(smp_call_func_t func, void *info, int wait)
 {
 	preempt_disable();
 	smp_call_function_many(cpu_online_mask, func, info, wait);
 	preempt_enable();
-
-	return 0;
 }
 EXPORT_SYMBOL(smp_call_function);
 
@@ -593,18 +591,16 @@ void __init smp_init(void)
  * early_boot_irqs_disabled is set.  Use local_irq_save/restore() instead
  * of local_irq_disable/enable().
  */
-int on_each_cpu(void (*func) (void *info), void *info, int wait)
+void on_each_cpu(void (*func) (void *info), void *info, int wait)
 {
 	unsigned long flags;
-	int ret = 0;
 
 	preempt_disable();
-	ret = smp_call_function(func, info, wait);
+	smp_call_function(func, info, wait);
 	local_irq_save(flags);
 	func(info);
 	local_irq_restore(flags);
 	preempt_enable();
-	return ret;
 }
 EXPORT_SYMBOL(on_each_cpu);
 
diff --git a/kernel/up.c b/kernel/up.c
index ff536f9cc8a2..378551e79894 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -34,14 +34,13 @@ int smp_call_function_single_async(int cpu, call_single_data_t *csd)
 }
 EXPORT_SYMBOL(smp_call_function_single_async);
 
-int on_each_cpu(smp_call_func_t func, void *info, int wait)
+void on_each_cpu(smp_call_func_t func, void *info, int wait)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
 	func(info);
 	local_irq_restore(flags);
-	return 0;
 }
 EXPORT_SYMBOL(on_each_cpu);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 2/6] cpumask: Purify cpumask_next()
  2019-05-25  8:21 [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Nadav Amit
  2019-05-25  8:21 ` [RFC PATCH 1/6] smp: Remove smp_call_function() and on_each_cpu() return values Nadav Amit
@ 2019-05-25  8:21 ` Nadav Amit
  2019-05-25  8:32   ` Ingo Molnar
  2019-05-27  8:30   ` Peter Zijlstra
  2019-05-25  8:22 ` [RFC PATCH 3/6] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Nadav Amit @ 2019-05-25  8:21 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, linux-kernel, Nadav Amit, David S. Miller

cpumask_next() has no side-effects. Mark it as pure.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/cpumask.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 147bdec42215..20df46705f9c 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -209,7 +209,7 @@ static inline unsigned int cpumask_last(const struct cpumask *srcp)
 	return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
-unsigned int cpumask_next(int n, const struct cpumask *srcp);
+unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);
 
 /**
  * cpumask_next_zero - get the next unset cpu in a cpumask
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 3/6] smp: Run functions concurrently in smp_call_function_many()
  2019-05-25  8:21 [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Nadav Amit
  2019-05-25  8:21 ` [RFC PATCH 1/6] smp: Remove smp_call_function() and on_each_cpu() return values Nadav Amit
  2019-05-25  8:21 ` [RFC PATCH 2/6] cpumask: Purify cpumask_next() Nadav Amit
@ 2019-05-25  8:22 ` Nadav Amit
  2019-05-27  9:15   ` Peter Zijlstra
  2019-05-25  8:22 ` [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus() Nadav Amit
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Nadav Amit @ 2019-05-25  8:22 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, linux-kernel, Nadav Amit, Rik van Riel,
	Thomas Gleixner, Josh Poimboeuf

Currently, on_each_cpu() and similar functions do not exploit the
potential of concurrency: the function is first executed remotely and
only then it is executed locally. Functions such as TLB flush can take
considerable time, so this provides an opportunity for performance
optimization.

To do so, introduce __smp_call_function_many(), which allows the callers
to provide local and remote functions that should be executed, and run
them concurrently. Keep smp_call_function_many() semantic as it is today
for backward compatibility: the called function is not executed in this
case locally.

__smp_call_function_many() does not use the optimized version for a
single remote target that smp_call_function_single() implements. For
synchronous function call, smp_call_function_single() keeps a
call_single_data (which is used for synchronization) on the stack.
Interestingly, it seems that not using this optimization provides
greater performance improvements (greater speedup with a single remote
target than with multiple ones). Presumably, holding data structures
that are intended for synchronization on the stack can introduce
overheads due to TLB misses and false-sharing when the stack is used for
other purposes.

Adding support to run the functions concurrently required to remove a
micro-optimization in on_each_cpu() that disabled/enabled IRQs instead
of saving/restoring them. The benefit of running the local and remote
code concurrently is expected to be greater.

CC: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/smp.h |  27 ++++++---
 kernel/smp.c        | 133 +++++++++++++++++++++-----------------------
 2 files changed, 83 insertions(+), 77 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index bb8b451ab01f..b69abc88259d 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -32,11 +32,6 @@ extern unsigned int total_cpus;
 int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 			     int wait);
 
-/*
- * Call a function on all processors
- */
-void on_each_cpu(smp_call_func_t func, void *info, int wait);
-
 /*
  * Call a function on processors specified by mask, which might include
  * the local one.
@@ -44,6 +39,15 @@ void on_each_cpu(smp_call_func_t func, void *info, int wait);
 void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 		void *info, bool wait);
 
+/*
+ * Call a function on all processors.  May be used during early boot while
+ * early_boot_irqs_disabled is set.
+ */
+static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
+{
+	on_each_cpu_mask(cpu_online_mask, func, info, wait);
+}
+
 /*
  * Call a function on each processor for which the supplied function
  * cond_func returns a positive value. This may include the local
@@ -102,8 +106,17 @@ extern void smp_cpus_done(unsigned int max_cpus);
  * Call a function on all other processors
  */
 void smp_call_function(smp_call_func_t func, void *info, int wait);
-void smp_call_function_many(const struct cpumask *mask,
-			    smp_call_func_t func, void *info, bool wait);
+
+void __smp_call_function_many(const struct cpumask *mask,
+			      smp_call_func_t remote_func,
+			      smp_call_func_t local_func,
+			      void *info, bool wait);
+
+static inline void smp_call_function_many(const struct cpumask *mask,
+				smp_call_func_t func, void *info, bool wait)
+{
+	__smp_call_function_many(mask, func, NULL, info, wait);
+}
 
 int smp_call_function_any(const struct cpumask *mask,
 			  smp_call_func_t func, void *info, int wait);
diff --git a/kernel/smp.c b/kernel/smp.c
index e0b05ac53108..17750608c52c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -387,9 +387,13 @@ int smp_call_function_any(const struct cpumask *mask,
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
 /**
- * smp_call_function_many(): Run a function on a set of other CPUs.
+ * __smp_call_function_many(): Run a function on a set of CPUs.
  * @mask: The set of cpus to run on (only runs on online subset).
- * @func: The function to run. This must be fast and non-blocking.
+ * @remote_func: The function to run on remote cores. This must be fast and
+ *		 non-blocking.
+ * @local_func: The function that should be run on this CPU. This must be
+ *		fast and non-blocking. If NULL is provided, no function will
+ *		be executed on this CPU.
  * @info: An arbitrary pointer to pass to the function.
  * @wait: If true, wait (atomically) until function has completed
  *        on other CPUs.
@@ -400,11 +404,16 @@ EXPORT_SYMBOL_GPL(smp_call_function_any);
  * hardware interrupt handler or from a bottom half handler. Preemption
  * must be disabled when calling this function.
  */
-void smp_call_function_many(const struct cpumask *mask,
-			    smp_call_func_t func, void *info, bool wait)
+void __smp_call_function_many(const struct cpumask *mask,
+			      smp_call_func_t remote_func,
+			      smp_call_func_t local_func,
+			      void *info, bool wait)
 {
+	int cpu, last_cpu, this_cpu = smp_processor_id();
 	struct call_function_data *cfd;
-	int cpu, next_cpu, this_cpu = smp_processor_id();
+	bool run_remote = false;
+	bool run_local = false;
+	int nr_cpus = 0;
 
 	/*
 	 * Can deadlock when called with interrupts disabled.
@@ -412,55 +421,62 @@ void smp_call_function_many(const struct cpumask *mask,
 	 * send smp call function interrupt to this cpu and as such deadlocks
 	 * can't happen.
 	 */
-	WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
-		     && !oops_in_progress && !early_boot_irqs_disabled);
+	if (cpu_online(this_cpu) && !oops_in_progress && !early_boot_irqs_disabled)
+		lockdep_assert_irqs_enabled();
+
+	/* Check if we need local execution. */
+	if (local_func && cpumask_test_cpu(this_cpu, mask))
+		run_local = true;
 
-	/* Try to fastpath.  So, what's a CPU they want? Ignoring this one. */
+	/* Check if we need remote execution, i.e., any CPU excluding this one. */
 	cpu = cpumask_first_and(mask, cpu_online_mask);
 	if (cpu == this_cpu)
 		cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+	if (cpu < nr_cpu_ids)
+		run_remote = true;
 
-	/* No online cpus?  We're done. */
-	if (cpu >= nr_cpu_ids)
-		return;
+	if (run_remote) {
+		cfd = this_cpu_ptr(&cfd_data);
 
-	/* Do we have another CPU which isn't us? */
-	next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
-	if (next_cpu == this_cpu)
-		next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
-
-	/* Fastpath: do that cpu by itself. */
-	if (next_cpu >= nr_cpu_ids) {
-		smp_call_function_single(cpu, func, info, wait);
-		return;
-	}
+		cpumask_and(cfd->cpumask, mask, cpu_online_mask);
+		__cpumask_clear_cpu(this_cpu, cfd->cpumask);
 
-	cfd = this_cpu_ptr(&cfd_data);
-
-	cpumask_and(cfd->cpumask, mask, cpu_online_mask);
-	__cpumask_clear_cpu(this_cpu, cfd->cpumask);
+		cpumask_clear(cfd->cpumask_ipi);
+		for_each_cpu(cpu, cfd->cpumask) {
+			call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
+
+			nr_cpus++;
+			last_cpu = cpu;
+
+			csd_lock(csd);
+			if (wait)
+				csd->flags |= CSD_FLAG_SYNCHRONOUS;
+			csd->func = remote_func;
+			csd->info = info;
+			if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+				__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
+		}
 
-	/* Some callers race with other cpus changing the passed mask */
-	if (unlikely(!cpumask_weight(cfd->cpumask)))
-		return;
+		/*
+		 * Choose the most efficient way to send an IPI. Note that the
+		 * number of CPUs might be zero due to concurrent changes to the
+		 * provided mask or cpu_online_mask.
+		 */
+		if (nr_cpus == 1)
+			arch_send_call_function_single_ipi(last_cpu);
+		else if (likely(nr_cpus > 1))
+			arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
+	}
 
-	cpumask_clear(cfd->cpumask_ipi);
-	for_each_cpu(cpu, cfd->cpumask) {
-		call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
+	if (run_local) {
+		unsigned long flags;
 
-		csd_lock(csd);
-		if (wait)
-			csd->flags |= CSD_FLAG_SYNCHRONOUS;
-		csd->func = func;
-		csd->info = info;
-		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
-			__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
+		local_irq_save(flags);
+		local_func(info);
+		local_irq_restore(flags);
 	}
 
-	/* Send a message to all CPUs in the map */
-	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
-
-	if (wait) {
+	if (run_remote && wait) {
 		for_each_cpu(cpu, cfd->cpumask) {
 			call_single_data_t *csd;
 
@@ -469,7 +485,7 @@ void smp_call_function_many(const struct cpumask *mask,
 		}
 	}
 }
-EXPORT_SYMBOL(smp_call_function_many);
+EXPORT_SYMBOL(__smp_call_function_many);
 
 /**
  * smp_call_function(): Run a function on all other CPUs.
@@ -586,24 +602,6 @@ void __init smp_init(void)
 	smp_cpus_done(setup_max_cpus);
 }
 
-/*
- * Call a function on all processors.  May be used during early boot while
- * early_boot_irqs_disabled is set.  Use local_irq_save/restore() instead
- * of local_irq_disable/enable().
- */
-void on_each_cpu(void (*func) (void *info), void *info, int wait)
-{
-	unsigned long flags;
-
-	preempt_disable();
-	smp_call_function(func, info, wait);
-	local_irq_save(flags);
-	func(info);
-	local_irq_restore(flags);
-	preempt_enable();
-}
-EXPORT_SYMBOL(on_each_cpu);
-
 /**
  * on_each_cpu_mask(): Run a function on processors specified by
  * cpumask, which may include the local processor.
@@ -623,16 +621,11 @@ EXPORT_SYMBOL(on_each_cpu);
 void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 			void *info, bool wait)
 {
-	int cpu = get_cpu();
+	preempt_disable();
 
-	smp_call_function_many(mask, func, info, wait);
-	if (cpumask_test_cpu(cpu, mask)) {
-		unsigned long flags;
-		local_irq_save(flags);
-		func(info);
-		local_irq_restore(flags);
-	}
-	put_cpu();
+	__smp_call_function_many(mask, func, func, info, wait);
+
+	preempt_enable();
 }
 EXPORT_SYMBOL(on_each_cpu_mask);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()
  2019-05-25  8:21 [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Nadav Amit
                   ` (2 preceding siblings ...)
  2019-05-25  8:22 ` [RFC PATCH 3/6] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
@ 2019-05-25  8:22 ` Nadav Amit
  2019-05-27  9:24   ` Peter Zijlstra
  2019-05-25  8:22 ` [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Nadav Amit @ 2019-05-25  8:22 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, linux-kernel, Nadav Amit, Dave Hansen,
	Thomas Gleixner, H. Peter Anvin, x86

arch_tlbbatch_flush() and flush_tlb_mm_range() have very similar code,
which is effectively the same. Extract the mutual code into a new
function flush_tlb_on_cpus().

There is one functional change, which should not affect correctness:
flush_tlb_mm_range compared loaded_mm and the mm to figure out if local
flush is needed. Instead, the common code would look at the mm_cpumask()
which should give the same result. Performance should not be affected,
since this cpumask should not change in such a frequency that would
introduce cache contention.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 55 ++++++++++++++++++++++-------------------------
 1 file changed, 26 insertions(+), 29 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7f61431c75fb..afd2859a6966 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -733,7 +733,11 @@ static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 			unsigned int stride_shift, bool freed_tables,
 			u64 new_tlb_gen)
 {
-	struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info);
+	struct flush_tlb_info *info;
+
+	preempt_disable();
+
+	info = this_cpu_ptr(&flush_tlb_info);
 
 #ifdef CONFIG_DEBUG_VM
 	/*
@@ -761,6 +765,23 @@ static inline void put_flush_tlb_info(void)
 	barrier();
 	this_cpu_dec(flush_tlb_info_idx);
 #endif
+	preempt_enable();
+}
+
+static void flush_tlb_on_cpus(const cpumask_t *cpumask,
+			      const struct flush_tlb_info *info)
+{
+	int this_cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(this_cpu, cpumask)) {
+		lockdep_assert_irqs_enabled();
+		local_irq_disable();
+		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
+		local_irq_enable();
+	}
+
+	if (cpumask_any_but(cpumask, this_cpu) < nr_cpu_ids)
+		flush_tlb_others(cpumask, info);
 }
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
@@ -769,9 +790,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 {
 	struct flush_tlb_info *info;
 	u64 new_tlb_gen;
-	int cpu;
-
-	cpu = get_cpu();
 
 	/* Should we flush just the requested range? */
 	if ((end == TLB_FLUSH_ALL) ||
@@ -786,18 +804,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
 				  new_tlb_gen);
 
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
-		lockdep_assert_irqs_enabled();
-		local_irq_disable();
-		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
-		local_irq_enable();
-	}
-
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), info);
+	flush_tlb_on_cpus(mm_cpumask(mm), info);
 
 	put_flush_tlb_info();
-	put_cpu();
 }
 
 
@@ -832,13 +841,11 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	} else {
 		struct flush_tlb_info *info;
 
-		preempt_disable();
 		info = get_flush_tlb_info(NULL, start, end, 0, false, 0);
 
 		on_each_cpu(do_kernel_range_flush, info, 1);
 
 		put_flush_tlb_info();
-		preempt_enable();
 	}
 }
 
@@ -856,21 +863,11 @@ static const struct flush_tlb_info full_flush_tlb_info = {
 
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
-	int cpu = get_cpu();
-
-	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
-		lockdep_assert_irqs_enabled();
-		local_irq_disable();
-		flush_tlb_func_local(&full_flush_tlb_info, TLB_LOCAL_SHOOTDOWN);
-		local_irq_enable();
-	}
-
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
-		flush_tlb_others(&batch->cpumask, &full_flush_tlb_info);
+	preempt_disable();
+	flush_tlb_on_cpus(&batch->cpumask, &full_flush_tlb_info);
+	preempt_enable();
 
 	cpumask_clear(&batch->cpumask);
-
-	put_cpu();
 }
 
 static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-05-25  8:21 [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Nadav Amit
                   ` (3 preceding siblings ...)
  2019-05-25  8:22 ` [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus() Nadav Amit
@ 2019-05-25  8:22 ` Nadav Amit
  2019-05-25  8:38   ` Nadav Amit
  2019-05-25  8:54   ` Juergen Gross
  2019-05-25  8:22 ` [RFC PATCH 6/6] x86/mm/tlb: Optimize local TLB flushes Nadav Amit
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Nadav Amit @ 2019-05-25  8:22 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, linux-kernel, Nadav Amit, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Sasha Levin, Thomas Gleixner,
	x86, Juergen Gross, Paolo Bonzini, Dave Hansen, Boris Ostrovsky,
	linux-hyperv, virtualization, kvm, xen-devel

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. The current
flush_tlb_others() interface is kept, since paravirtual interfaces need
to be adapted first before it can be removed. This is left for future
work. In such PV environments, TLB flushes are not performed, at this
time, concurrently.

Add a static key to tell whether this new interface is supported.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: Juergen Gross <jgross@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/hyperv/mmu.c                 |  2 +
 arch/x86/include/asm/paravirt.h       |  8 +++
 arch/x86/include/asm/paravirt_types.h |  6 ++
 arch/x86/include/asm/tlbflush.h       |  6 ++
 arch/x86/kernel/kvm.c                 |  1 +
 arch/x86/kernel/paravirt.c            |  3 +
 arch/x86/mm/tlb.c                     | 80 +++++++++++++++++++++++----
 arch/x86/xen/mmu_pv.c                 |  2 +
 8 files changed, 96 insertions(+), 12 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..ca28b400c87c 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
 	pr_info("Using hypercall for remote TLB flush\n");
 	pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
 	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+
+	static_key_disable(&flush_tlb_multi_enabled.key);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..192be7254457 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -47,6 +47,8 @@ static inline void slow_down_io(void)
 #endif
 }
 
+DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static inline void __flush_tlb(void)
 {
 	PVOP_VCALL0(mmu.flush_tlb_user);
@@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
 	PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+				   const struct flush_tlb_info *info)
+{
+	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
+}
+
 static inline void flush_tlb_others(const struct cpumask *cpumask,
 				    const struct flush_tlb_info *info)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3a156e63c57d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,6 +211,12 @@ struct pv_mmu_ops {
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_one_user)(unsigned long addr);
+	/*
+	 * flush_tlb_multi() is the preferred interface. When it is used,
+	 * flush_tlb_others() should return false.
+	 */
+	void (*flush_tlb_multi)(const struct cpumask *cpus,
+				const struct flush_tlb_info *info);
 	void (*flush_tlb_others)(const struct cpumask *cpus,
 				 const struct flush_tlb_info *info);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..79272938cf79 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+			     const struct flush_tlb_info *info);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
@@ -593,6 +596,9 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
+#define flush_tlb_multi(mask, info)	\
+	native_flush_tlb_multi(mask, info)
+
 #define flush_tlb_others(mask, info)	\
 	native_flush_tlb_others(mask, info)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 3f0cc828cc36..c1c2b88ea3f1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -643,6 +643,7 @@ static void __init kvm_guest_init(void)
 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
 		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
 		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+		static_key_disable(&flush_tlb_multi_enabled.key);
 	}
 
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5492a669f658..1314f89304a8 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -171,6 +171,8 @@ unsigned paravirt_patch_insns(void *insn_buff, unsigned len,
 	return insn_len;
 }
 
+DEFINE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static void native_flush_tlb(void)
 {
 	__native_flush_tlb();
@@ -375,6 +377,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_user	= native_flush_tlb,
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
+	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
 	.mmu.flush_tlb_others	= native_flush_tlb_others,
 	.mmu.tlb_remove_table	=
 			(void (*)(struct mmu_gather *, void *))tlb_remove_page,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index afd2859a6966..0ec2bfca7581 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -550,7 +550,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
 		 *
-		 * This should be rare, with native_flush_tlb_others skipping
+		 * This should be rare, with native_flush_tlb_multi skipping
 		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
@@ -634,9 +634,12 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
 }
 
-static void flush_tlb_func_local(const void *info, enum tlb_flush_reason reason)
+static void flush_tlb_func_local(void *info)
 {
 	const struct flush_tlb_info *f = info;
+	enum tlb_flush_reason reason;
+
+	reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;
 
 	flush_tlb_func_common(f, true, reason);
 }
@@ -654,14 +657,30 @@ static void flush_tlb_func_remote(void *info)
 	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
-static bool tlb_is_not_lazy(int cpu, void *data)
+static inline bool tlb_is_not_lazy(int cpu)
 {
 	return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
-			     const struct flush_tlb_info *info)
+static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
+
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+			    const struct flush_tlb_info *info)
 {
+	/*
+	 * native_flush_tlb_multi() can handle a single CPU, but it is
+	 * suboptimal if the local TLB should be flushed, and therefore should
+	 * not be used in such case. Check that it is not used in such case,
+	 * and use this assumption for tracing and accounting of remote TLB
+	 * flushes.
+	 */
+	VM_WARN_ON(!cpumask_any_but(cpumask, smp_processor_id()));
+
+	/*
+	 * Do accounting and tracing. Note that there are (and have always been)
+	 * cases in which a remote TLB flush will be traced, but eventually
+	 * would not happen.
+	 */
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -681,10 +700,14 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		 * means that the percpu tlb_gen variables won't be updated
 		 * and we'll do pointless flushes on future context switches.
 		 *
-		 * Rather than hooking native_flush_tlb_others() here, I think
+		 * Rather than hooking native_flush_tlb_multi() here, I think
 		 * that UV should be updated so that smp_call_function_many(),
 		 * etc, are optimal on UV.
 		 */
+		local_irq_disable();
+		flush_tlb_func_local((__force void *)info);
+		local_irq_enable();
+
 		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
 			smp_call_function_many(cpumask, flush_tlb_func_remote,
@@ -703,11 +726,39 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	 * doing a speculative memory access.
 	 */
 	if (info->freed_tables)
-		smp_call_function_many(cpumask, flush_tlb_func_remote,
-			       (void *)info, 1);
-	else
-		on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
-				(void *)info, 1, GFP_ATOMIC, cpumask);
+		__smp_call_function_many(cpumask, flush_tlb_func_remote,
+					 flush_tlb_func_local, (void *)info, 1);
+	else {
+		/*
+		 * Although we could have used on_each_cpu_cond_mask(),
+		 * open-coding it has several performance advantages: (1) we can
+		 * use specialized functions for remote and local flushes; (2)
+		 * no need for indirect branch to test if TLB is lazy; (3) we
+		 * can use a designated cpumask for evaluating the condition
+		 * instead of allocating a new one.
+		 *
+		 * This works under the assumption that there are no nested TLB
+		 * flushes, an assumption that is already made in
+		 * flush_tlb_mm_range().
+		 */
+		struct cpumask *cond_cpumask = this_cpu_ptr(&flush_tlb_mask);
+		int cpu;
+
+		cpumask_clear(cond_cpumask);
+
+		for_each_cpu(cpu, cpumask) {
+			if (tlb_is_not_lazy(cpu))
+				__cpumask_set_cpu(cpu, cond_cpumask);
+		}
+		__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+					 flush_tlb_func_local, (void *)info, 1);
+	}
+}
+
+void native_flush_tlb_others(const struct cpumask *cpumask,
+			     const struct flush_tlb_info *info)
+{
+	native_flush_tlb_multi(cpumask, info);
 }
 
 /*
@@ -773,10 +824,15 @@ static void flush_tlb_on_cpus(const cpumask_t *cpumask,
 {
 	int this_cpu = smp_processor_id();
 
+	if (static_branch_likely(&flush_tlb_multi_enabled)) {
+		flush_tlb_multi(cpumask, info);
+		return;
+	}
+
 	if (cpumask_test_cpu(this_cpu, cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
+		flush_tlb_func_local((__force void *)info);
 		local_irq_enable();
 	}
 
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index beb44e22afdf..0cb277848cb4 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2474,6 +2474,8 @@ void __init xen_init_mmu_ops(void)
 
 	pv_ops.mmu = xen_mmu_ops;
 
+	static_key_disable(&flush_tlb_multi_enabled.key);
+
 	memset(dummy_mapping, 0xff, PAGE_SIZE);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 6/6] x86/mm/tlb: Optimize local TLB flushes
  2019-05-25  8:21 [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Nadav Amit
                   ` (4 preceding siblings ...)
  2019-05-25  8:22 ` [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
@ 2019-05-25  8:22 ` Nadav Amit
  2019-05-27  8:28 ` [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Peter Zijlstra
  2019-05-27  9:59 ` Peter Zijlstra
  7 siblings, 0 replies; 24+ messages in thread
From: Nadav Amit @ 2019-05-25  8:22 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, linux-kernel, Nadav Amit, Dave Hansen,
	Thomas Gleixner, x86

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, making
local TLB flushes slower than they were before the recent changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() another time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 0ec2bfca7581..3f3f983e224e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -823,8 +823,12 @@ static void flush_tlb_on_cpus(const cpumask_t *cpumask,
 			      const struct flush_tlb_info *info)
 {
 	int this_cpu = smp_processor_id();
+	bool flush_others = false;
 
-	if (static_branch_likely(&flush_tlb_multi_enabled)) {
+	if (cpumask_any_but(cpumask, this_cpu) < nr_cpu_ids)
+		flush_others = true;
+
+	if (static_branch_likely(&flush_tlb_multi_enabled) && flush_others) {
 		flush_tlb_multi(cpumask, info);
 		return;
 	}
@@ -836,7 +840,7 @@ static void flush_tlb_on_cpus(const cpumask_t *cpumask,
 		local_irq_enable();
 	}
 
-	if (cpumask_any_but(cpumask, this_cpu) < nr_cpu_ids)
+	if (flush_others)
 		flush_tlb_others(cpumask, info);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] cpumask: Purify cpumask_next()
  2019-05-25  8:21 ` [RFC PATCH 2/6] cpumask: Purify cpumask_next() Nadav Amit
@ 2019-05-25  8:32   ` Ingo Molnar
  2019-05-27  8:30   ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2019-05-25  8:32 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	linux-kernel, David S. Miller


* Nadav Amit <namit@vmware.com> wrote:

> cpumask_next() has no side-effects. Mark it as pure.
> 
> Cc: "David S. Miller" <davem@davemloft.net>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  include/linux/cpumask.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 147bdec42215..20df46705f9c 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -209,7 +209,7 @@ static inline unsigned int cpumask_last(const struct cpumask *srcp)
>  	return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
>  }
>  
> -unsigned int cpumask_next(int n, const struct cpumask *srcp);
> +unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);

I suppose this makes a code generation difference somewhere, right?

I'm wondering, couldn't it also be marked a const function? That's 
supposedly an even better category.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-05-25  8:22 ` [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
@ 2019-05-25  8:38   ` Nadav Amit
  2019-05-25  8:54   ` Juergen Gross
  1 sibling, 0 replies; 24+ messages in thread
From: Nadav Amit @ 2019-05-25  8:38 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	linux-kernel, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Thomas Gleixner, x86, Juergen Gross, Paolo Bonzini,
	Dave Hansen, Boris Ostrovsky, linux-hyperv, virtualization, kvm,
	xen-devel

> On May 25, 2019, at 1:22 AM, Nadav Amit <namit@vmware.com> wrote:
> 
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. The current
> flush_tlb_others() interface is kept, since paravirtual interfaces need
> to be adapted first before it can be removed. This is left for future
> work. In such PV environments, TLB flushes are not performed, at this
> time, concurrently.
> 
> +void native_flush_tlb_multi(const struct cpumask *cpumask,
> +			    const struct flush_tlb_info *info)
> {
> +	/*
> +	 * native_flush_tlb_multi() can handle a single CPU, but it is
> +	 * suboptimal if the local TLB should be flushed, and therefore should
> +	 * not be used in such case. Check that it is not used in such case,
> +	 * and use this assumption for tracing and accounting of remote TLB
> +	 * flushes.
> +	 */
> +	VM_WARN_ON(!cpumask_any_but(cpumask, smp_processor_id()));

This warning might fire off incorrectly and will be removed.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-05-25  8:22 ` [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
  2019-05-25  8:38   ` Nadav Amit
@ 2019-05-25  8:54   ` Juergen Gross
  2019-05-27  9:47     ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Juergen Gross @ 2019-05-25  8:54 UTC (permalink / raw)
  To: Nadav Amit, Ingo Molnar, Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, linux-kernel, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Thomas Gleixner, x86,
	Paolo Bonzini, Dave Hansen, Boris Ostrovsky, linux-hyperv,
	virtualization, kvm, xen-devel

On 25/05/2019 10:22, Nadav Amit wrote:
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. The current
> flush_tlb_others() interface is kept, since paravirtual interfaces need
> to be adapted first before it can be removed. This is left for future
> work. In such PV environments, TLB flushes are not performed, at this
> time, concurrently.
> 
> Add a static key to tell whether this new interface is supported.
> 
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Sasha Levin <sashal@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: x86@kernel.org
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: linux-hyperv@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: kvm@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/hyperv/mmu.c                 |  2 +
>  arch/x86/include/asm/paravirt.h       |  8 +++
>  arch/x86/include/asm/paravirt_types.h |  6 ++
>  arch/x86/include/asm/tlbflush.h       |  6 ++
>  arch/x86/kernel/kvm.c                 |  1 +
>  arch/x86/kernel/paravirt.c            |  3 +
>  arch/x86/mm/tlb.c                     | 80 +++++++++++++++++++++++----
>  arch/x86/xen/mmu_pv.c                 |  2 +
>  8 files changed, 96 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> index e65d7fe6489f..ca28b400c87c 100644
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
>  	pr_info("Using hypercall for remote TLB flush\n");
>  	pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
>  	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
> +
> +	static_key_disable(&flush_tlb_multi_enabled.key);
>  }
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index c25c38a05c1c..192be7254457 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -47,6 +47,8 @@ static inline void slow_down_io(void)
>  #endif
>  }
>  
> +DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
> +
>  static inline void __flush_tlb(void)
>  {
>  	PVOP_VCALL0(mmu.flush_tlb_user);
> @@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
>  	PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
>  }
>  
> +static inline void flush_tlb_multi(const struct cpumask *cpumask,
> +				   const struct flush_tlb_info *info)
> +{
> +	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
> +}
> +
>  static inline void flush_tlb_others(const struct cpumask *cpumask,
>  				    const struct flush_tlb_info *info)
>  {
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 946f8f1f1efc..3a156e63c57d 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>  	void (*flush_tlb_user)(void);
>  	void (*flush_tlb_kernel)(void);
>  	void (*flush_tlb_one_user)(unsigned long addr);
> +	/*
> +	 * flush_tlb_multi() is the preferred interface. When it is used,
> +	 * flush_tlb_others() should return false.

This comment does not make sense. flush_tlb_others() return type is
void.


Juergen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently
  2019-05-25  8:21 [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Nadav Amit
                   ` (5 preceding siblings ...)
  2019-05-25  8:22 ` [RFC PATCH 6/6] x86/mm/tlb: Optimize local TLB flushes Nadav Amit
@ 2019-05-27  8:28 ` Peter Zijlstra
  2019-05-27  9:59 ` Peter Zijlstra
  7 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2019-05-27  8:28 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Borislav Petkov, linux-kernel,
	Thomas Gleixner

On Sat, May 25, 2019 at 01:21:57AM -0700, Nadav Amit wrote:
> The proposed changes should also improve the performance of other
> invocations of on_each_cpu(). Hopefully, no one has relied on the
> behavior of on_each_cpu() that functions were first executed remotely
> and only then locally.

Oh gawd... I actually have vague memories of code doing exactly that,
but I can't for the life of me remember where :/:

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] cpumask: Purify cpumask_next()
  2019-05-25  8:21 ` [RFC PATCH 2/6] cpumask: Purify cpumask_next() Nadav Amit
  2019-05-25  8:32   ` Ingo Molnar
@ 2019-05-27  8:30   ` Peter Zijlstra
  2019-05-27 17:34     ` Nadav Amit
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2019-05-27  8:30 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Borislav Petkov, linux-kernel,
	David S. Miller

On Sat, May 25, 2019 at 01:21:59AM -0700, Nadav Amit wrote:
> cpumask_next() has no side-effects. Mark it as pure.

It would be good to have a few word on why... because apparently you
found this makes a difference.

> Cc: "David S. Miller" <davem@davemloft.net>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  include/linux/cpumask.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 147bdec42215..20df46705f9c 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -209,7 +209,7 @@ static inline unsigned int cpumask_last(const struct cpumask *srcp)
>  	return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
>  }
>  
> -unsigned int cpumask_next(int n, const struct cpumask *srcp);
> +unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);
>  
>  /**
>   * cpumask_next_zero - get the next unset cpu in a cpumask
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 3/6] smp: Run functions concurrently in smp_call_function_many()
  2019-05-25  8:22 ` [RFC PATCH 3/6] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
@ 2019-05-27  9:15   ` Peter Zijlstra
  2019-05-27 17:39     ` Nadav Amit
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2019-05-27  9:15 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Borislav Petkov, linux-kernel,
	Rik van Riel, Thomas Gleixner, Josh Poimboeuf

> +		/*
> +		 * Choose the most efficient way to send an IPI. Note that the
> +		 * number of CPUs might be zero due to concurrent changes to the
> +		 * provided mask or cpu_online_mask.
> +		 */

Since we have preemption disabled here, I don't think online mask can
shrink, cpu-offline uses stop_machine().

> +		if (nr_cpus == 1)
> +			arch_send_call_function_single_ipi(last_cpu);
> +		else if (likely(nr_cpus > 1))
> +			arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
> +	}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()
  2019-05-25  8:22 ` [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus() Nadav Amit
@ 2019-05-27  9:24   ` Peter Zijlstra
  2019-05-27 18:59     ` Nadav Amit
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2019-05-27  9:24 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Borislav Petkov, linux-kernel,
	Dave Hansen, Thomas Gleixner, H. Peter Anvin, x86

On Sat, May 25, 2019 at 01:22:01AM -0700, Nadav Amit wrote:

> There is one functional change, which should not affect correctness:
> flush_tlb_mm_range compared loaded_mm and the mm to figure out if local
> flush is needed. Instead, the common code would look at the mm_cpumask()
> which should give the same result.

> @@ -786,18 +804,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
>  				  new_tlb_gen);
>  
> -	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> -		lockdep_assert_irqs_enabled();
> -		local_irq_disable();
> -		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
> -		local_irq_enable();
> -	}
> -
> -	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> -		flush_tlb_others(mm_cpumask(mm), info);

So if we want to double check that; we'd add:

	WARN_ON_ONCE(cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm)) ==
		     (mm == this_cpu_read(cpu_tlbstate.loaded_mm)));

right?

> +	flush_tlb_on_cpus(mm_cpumask(mm), info);
>  
>  	put_flush_tlb_info();
> -	put_cpu();
>  }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-05-25  8:54   ` Juergen Gross
@ 2019-05-27  9:47     ` Peter Zijlstra
  2019-05-27 10:21       ` Paolo Bonzini
  2019-05-27 17:49       ` Nadav Amit
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2019-05-27  9:47 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Nadav Amit, Ingo Molnar, Andy Lutomirski, Borislav Petkov,
	linux-kernel, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Thomas Gleixner, x86, Paolo Bonzini, Dave Hansen,
	Boris Ostrovsky, linux-hyperv, virtualization, kvm, xen-devel

On Sat, May 25, 2019 at 10:54:50AM +0200, Juergen Gross wrote:
> On 25/05/2019 10:22, Nadav Amit wrote:

> > diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> > index 946f8f1f1efc..3a156e63c57d 100644
> > --- a/arch/x86/include/asm/paravirt_types.h
> > +++ b/arch/x86/include/asm/paravirt_types.h
> > @@ -211,6 +211,12 @@ struct pv_mmu_ops {
> >  	void (*flush_tlb_user)(void);
> >  	void (*flush_tlb_kernel)(void);
> >  	void (*flush_tlb_one_user)(unsigned long addr);
> > +	/*
> > +	 * flush_tlb_multi() is the preferred interface. When it is used,
> > +	 * flush_tlb_others() should return false.
> 
> This comment does not make sense. flush_tlb_others() return type is
> void.

I suspect that is an artifact from before the static_key; an attempt to
make the pv interface less awkward.

Something like the below would work for KVM I suspect, the others
(Hyper-V and Xen are more 'interesting').

---
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
 
 static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
 
-static void kvm_flush_tlb_others(const struct cpumask *cpumask,
+static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
 			const struct flush_tlb_info *info)
 {
 	u8 state;
@@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
 	 * queue flush_on_enter for pre-empted vCPUs
 	 */
 	for_each_cpu(cpu, flushmask) {
+		if (cpu == smp_processor_id())
+			continue;
+
 		src = &per_cpu(steal_time, cpu);
 		state = READ_ONCE(src->preempted);
 		if ((state & KVM_VCPU_PREEMPTED)) {
@@ -603,7 +606,7 @@ static void kvm_flush_tlb_others(const s
 		}
 	}
 
-	native_flush_tlb_others(flushmask, info);
+	native_flush_tlb_multi(flushmask, info);
 }
 
 static void __init kvm_guest_init(void)
@@ -628,9 +631,8 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
 	    !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
+		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
 		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
-		static_key_disable(&flush_tlb_multi_enabled.key);
 	}
 
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently
  2019-05-25  8:21 [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Nadav Amit
                   ` (6 preceding siblings ...)
  2019-05-27  8:28 ` [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Peter Zijlstra
@ 2019-05-27  9:59 ` Peter Zijlstra
  7 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2019-05-27  9:59 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Borislav Petkov, linux-kernel,
	jgross, kys, haiyangz, sthemmin, sashal

On Sat, May 25, 2019 at 01:21:57AM -0700, Nadav Amit wrote:
> Currently, local and remote TLB flushes are not performed concurrently,
> which introduces unnecessary overhead - each INVLPG can take 100s of
> cycles. This patch-set allows TLB flushes to be run concurrently: first
> request the remote CPUs to initiate the flush, then run it locally, and
> finally wait for the remote CPUs to finish their work.
> 
> The proposed changes should also improve the performance of other
> invocations of on_each_cpu(). Hopefully, no one has relied on the
> behavior of on_each_cpu() that functions were first executed remotely
> and only then locally.
> 
> On my Haswell machine (bare-metal), running a TLB flush microbenchmark
> (MADV_DONTNEED/touch for a single page on one thread), takes the
> following time (ns):
> 
> 	n_threads	before		after
> 	---------	------		-----
> 	1		661		663
> 	2		1436		1225 (-14%)
> 	4		1571		1421 (-10%)
> 
> Note that since the benchmark also causes page-faults, the actual
> speedup of TLB shootdowns is actually greater. Also note the higher
> improvement in performance with 2 thread (a single remote TLB flush
> target). This seems to be a side-effect of holding synchronization
> data-structures (csd) off the stack, unlike the way it is currently done
> (in smp_call_function_single()).
> 
> Patches 1-2 do small cleanup. Patches 3-5 actually implement the
> concurrent execution of TLB flushes. Patch 6 restores local TLB flushes
> performance, which was hurt by the optimization, to be as good as it was
> before these changes by introducing a fast-pass for this specific case.

I like; ideally we'll get Hyper-V and Xen sorted before the final
version and avoid having to introduce more PV crud and static keys for
that.

The Hyper-V implementation in particular is horrifically ugly, the Xen
one also doesn't win any prices, esp. that on-stack CPU mask needs to
go.

Looking at them, I'm not sure they can actually win anything from using
the new interface, but at least we can avoid making our PV crud uglier
than it has to be.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-05-27  9:47     ` Peter Zijlstra
@ 2019-05-27 10:21       ` Paolo Bonzini
  2019-05-27 12:32         ` Peter Zijlstra
  2019-05-27 17:49       ` Nadav Amit
  1 sibling, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2019-05-27 10:21 UTC (permalink / raw)
  To: Peter Zijlstra, Juergen Gross
  Cc: Nadav Amit, Ingo Molnar, Andy Lutomirski, Borislav Petkov,
	linux-kernel, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Thomas Gleixner, x86, Dave Hansen, Boris Ostrovsky,
	linux-hyperv, virtualization, kvm, xen-devel

On 27/05/19 11:47, Peter Zijlstra wrote:
> On Sat, May 25, 2019 at 10:54:50AM +0200, Juergen Gross wrote:
>> On 25/05/2019 10:22, Nadav Amit wrote:
> 
>>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>>> index 946f8f1f1efc..3a156e63c57d 100644
>>> --- a/arch/x86/include/asm/paravirt_types.h
>>> +++ b/arch/x86/include/asm/paravirt_types.h
>>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>>>  	void (*flush_tlb_user)(void);
>>>  	void (*flush_tlb_kernel)(void);
>>>  	void (*flush_tlb_one_user)(unsigned long addr);
>>> +	/*
>>> +	 * flush_tlb_multi() is the preferred interface. When it is used,
>>> +	 * flush_tlb_others() should return false.
>>
>> This comment does not make sense. flush_tlb_others() return type is
>> void.
> 
> I suspect that is an artifact from before the static_key; an attempt to
> make the pv interface less awkward.
> 
> Something like the below would work for KVM I suspect, the others
> (Hyper-V and Xen are more 'interesting').
> 
> ---
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
>  
>  static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
>  
> -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
>  			const struct flush_tlb_info *info)
>  {
>  	u8 state;
> @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
>  	 * queue flush_on_enter for pre-empted vCPUs
>  	 */
>  	for_each_cpu(cpu, flushmask) {
> +		if (cpu == smp_processor_id())
> +			continue;
> +

Even this would be just an optimization; the vCPU you're running on
cannot be preempted.  You can just change others to multi.

Paolo

>  		src = &per_cpu(steal_time, cpu);
>  		state = READ_ONCE(src->preempted);
>  		if ((state & KVM_VCPU_PREEMPTED)) {
> @@ -603,7 +606,7 @@ static void kvm_flush_tlb_others(const s
>  		}
>  	}
>  
> -	native_flush_tlb_others(flushmask, info);
> +	native_flush_tlb_multi(flushmask, info);
>  }
>  
>  static void __init kvm_guest_init(void)
> @@ -628,9 +631,8 @@ static void __init kvm_guest_init(void)
>  	if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
>  	    !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
>  	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
> -		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
> +		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
>  		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
> -		static_key_disable(&flush_tlb_multi_enabled.key);
>  	}
>  
>  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-05-27 10:21       ` Paolo Bonzini
@ 2019-05-27 12:32         ` Peter Zijlstra
  2019-05-27 12:45           ` Paolo Bonzini
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2019-05-27 12:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Juergen Gross, Nadav Amit, Ingo Molnar, Andy Lutomirski,
	Borislav Petkov, linux-kernel, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Thomas Gleixner, x86,
	Dave Hansen, Boris Ostrovsky, linux-hyperv, virtualization, kvm,
	xen-devel

On Mon, May 27, 2019 at 12:21:59PM +0200, Paolo Bonzini wrote:
> On 27/05/19 11:47, Peter Zijlstra wrote:

> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
> >  
> >  static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
> >  
> > -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> > +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
> >  			const struct flush_tlb_info *info)
> >  {
> >  	u8 state;
> > @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
> >  	 * queue flush_on_enter for pre-empted vCPUs
> >  	 */
> >  	for_each_cpu(cpu, flushmask) {
> > +		if (cpu == smp_processor_id())
> > +			continue;
> > +
> 
> Even this would be just an optimization; the vCPU you're running on
> cannot be preempted.  You can just change others to multi.

Yeah, I know, but it felt weird so I added the explicit skip. No strong
feelings though.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-05-27 12:32         ` Peter Zijlstra
@ 2019-05-27 12:45           ` Paolo Bonzini
  0 siblings, 0 replies; 24+ messages in thread
From: Paolo Bonzini @ 2019-05-27 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juergen Gross, Nadav Amit, Ingo Molnar, Andy Lutomirski,
	Borislav Petkov, linux-kernel, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Thomas Gleixner, x86,
	Dave Hansen, Boris Ostrovsky, linux-hyperv, virtualization, kvm,
	xen-devel

On 27/05/19 14:32, Peter Zijlstra wrote:
> On Mon, May 27, 2019 at 12:21:59PM +0200, Paolo Bonzini wrote:
>> On 27/05/19 11:47, Peter Zijlstra wrote:
> 
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
>>>  
>>>  static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
>>>  
>>> -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>>> +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
>>>  			const struct flush_tlb_info *info)
>>>  {
>>>  	u8 state;
>>> @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
>>>  	 * queue flush_on_enter for pre-empted vCPUs
>>>  	 */
>>>  	for_each_cpu(cpu, flushmask) {
>>> +		if (cpu == smp_processor_id())
>>> +			continue;
>>> +
>>
>> Even this would be just an optimization; the vCPU you're running on
>> cannot be preempted.  You can just change others to multi.
> 
> Yeah, I know, but it felt weird so I added the explicit skip. No strong
> feelings though.

Neither here, and it would indeed deserve a comment if you left the if out.

Paolo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/6] cpumask: Purify cpumask_next()
  2019-05-27  8:30   ` Peter Zijlstra
@ 2019-05-27 17:34     ` Nadav Amit
  0 siblings, 0 replies; 24+ messages in thread
From: Nadav Amit @ 2019-05-27 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andy Lutomirski, Borislav Petkov, linux-kernel,
	David S. Miller

> On May 27, 2019, at 1:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Sat, May 25, 2019 at 01:21:59AM -0700, Nadav Amit wrote:
>> cpumask_next() has no side-effects. Mark it as pure.
> 
> It would be good to have a few word on why... because apparently you
> found this makes a difference.

I see that eventually it did not make any difference. I saw in the past that
some instances of the kernel are affected by it, but I will remove it from
this patch-set in the next iteration.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 3/6] smp: Run functions concurrently in smp_call_function_many()
  2019-05-27  9:15   ` Peter Zijlstra
@ 2019-05-27 17:39     ` Nadav Amit
  0 siblings, 0 replies; 24+ messages in thread
From: Nadav Amit @ 2019-05-27 17:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andy Lutomirski, Borislav Petkov, LKML,
	Rik van Riel, Thomas Gleixner, Josh Poimboeuf

> On May 27, 2019, at 2:15 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> +		/*
>> +		 * Choose the most efficient way to send an IPI. Note that the
>> +		 * number of CPUs might be zero due to concurrent changes to the
>> +		 * provided mask or cpu_online_mask.
>> +		 */
> 
> Since we have preemption disabled here, I don't think online mask can
> shrink, cpu-offline uses stop_machine().

Right. So I’ll update the comment, but IIUC the provided mask might still
change, so I’ll leave the rest of the comment and the code as is.

>> +		if (nr_cpus == 1)
>> +			arch_send_call_function_single_ipi(last_cpu);
>> +		else if (likely(nr_cpus > 1))
>> +			arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
>> +	}



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-05-27  9:47     ` Peter Zijlstra
  2019-05-27 10:21       ` Paolo Bonzini
@ 2019-05-27 17:49       ` Nadav Amit
  1 sibling, 0 replies; 24+ messages in thread
From: Nadav Amit @ 2019-05-27 17:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juergen Gross, Ingo Molnar, Andy Lutomirski, Borislav Petkov,
	linux-kernel, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Thomas Gleixner, x86, Paolo Bonzini, Dave Hansen,
	Boris Ostrovsky, linux-hyperv, virtualization, kvm, xen-devel

> On May 27, 2019, at 2:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Sat, May 25, 2019 at 10:54:50AM +0200, Juergen Gross wrote:
>> On 25/05/2019 10:22, Nadav Amit wrote:
> 
>>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>>> index 946f8f1f1efc..3a156e63c57d 100644
>>> --- a/arch/x86/include/asm/paravirt_types.h
>>> +++ b/arch/x86/include/asm/paravirt_types.h
>>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>>> 	void (*flush_tlb_user)(void);
>>> 	void (*flush_tlb_kernel)(void);
>>> 	void (*flush_tlb_one_user)(unsigned long addr);
>>> +	/*
>>> +	 * flush_tlb_multi() is the preferred interface. When it is used,
>>> +	 * flush_tlb_others() should return false.
>> 
>> This comment does not make sense. flush_tlb_others() return type is
>> void.
> 
> I suspect that is an artifact from before the static_key; an attempt to
> make the pv interface less awkward.

Yes, remainders that should have been removed - I will remove them for the
next version.

> Something like the below would work for KVM I suspect, the others
> (Hyper-V and Xen are more 'interesting').
> 
> ---
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
> 
> static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
> 
> -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
> 			const struct flush_tlb_info *info)
> {
> 	u8 state;
> @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
> 	 * queue flush_on_enter for pre-empted vCPUs
> 	 */
> 	for_each_cpu(cpu, flushmask) {
> +		if (cpu == smp_processor_id())
> +			continue;
> +
> 		src = &per_cpu(steal_time, cpu);
> 		state = READ_ONCE(src->preempted);
> 		if ((state & KVM_VCPU_PREEMPTED)) {
> @@ -603,7 +606,7 @@ static void kvm_flush_tlb_others(const s
> 		}
> 	}
> 
> -	native_flush_tlb_others(flushmask, info);
> +	native_flush_tlb_multi(flushmask, info);
> }
> 
> static void __init kvm_guest_init(void)
> @@ -628,9 +631,8 @@ static void __init kvm_guest_init(void)
> 	if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
> 	    !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
> 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
> -		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
> +		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
> 		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
> -		static_key_disable(&flush_tlb_multi_enabled.key);
> 	}
> 
> 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))

That’s what I have as well ;-).

As you mentioned (in another email), specifically hyper-v code seems
convoluted to me. In general, I prefer not to touch KVM/Xen/hyper-v, but you
twist my arm, I will send a compile-tested version for Xen and hyper-v.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()
  2019-05-27  9:24   ` Peter Zijlstra
@ 2019-05-27 18:59     ` Nadav Amit
  2019-05-27 19:14       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Nadav Amit @ 2019-05-27 18:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andy Lutomirski, Borislav Petkov, linux-kernel,
	Dave Hansen, Thomas Gleixner, H. Peter Anvin, x86

> On May 27, 2019, at 2:24 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Sat, May 25, 2019 at 01:22:01AM -0700, Nadav Amit wrote:
> 
>> There is one functional change, which should not affect correctness:
>> flush_tlb_mm_range compared loaded_mm and the mm to figure out if local
>> flush is needed. Instead, the common code would look at the mm_cpumask()
>> which should give the same result.
> 
>> @@ -786,18 +804,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>> 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
>> 				  new_tlb_gen);
>> 
>> -	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>> -		lockdep_assert_irqs_enabled();
>> -		local_irq_disable();
>> -		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
>> -		local_irq_enable();
>> -	}
>> -
>> -	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>> -		flush_tlb_others(mm_cpumask(mm), info);
> 
> So if we want to double check that; we'd add:
> 
> 	WARN_ON_ONCE(cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm)) ==
> 		     (mm == this_cpu_read(cpu_tlbstate.loaded_mm)));
> 
> right?

Yes, except the condition should be inverted (“!=“ instead of “==“), and I
would prefer to use VM_WARN_ON_ONCE().

Unfortunately, this condition does fire when copy_init_mm() calls dup_mm().
I don’t think there is a correctness issue, and I am tempted just check,
before warning, that (mm != init_mm) .

What do you say?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()
  2019-05-27 18:59     ` Nadav Amit
@ 2019-05-27 19:14       ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2019-05-27 19:14 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Ingo Molnar, Andy Lutomirski, Borislav Petkov, linux-kernel,
	Dave Hansen, Thomas Gleixner, H. Peter Anvin, x86

On Mon, May 27, 2019 at 06:59:01PM +0000, Nadav Amit wrote:
> > On May 27, 2019, at 2:24 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Sat, May 25, 2019 at 01:22:01AM -0700, Nadav Amit wrote:
> > 
> >> There is one functional change, which should not affect correctness:
> >> flush_tlb_mm_range compared loaded_mm and the mm to figure out if local
> >> flush is needed. Instead, the common code would look at the mm_cpumask()
> >> which should give the same result.
> > 
> >> @@ -786,18 +804,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> >> 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
> >> 				  new_tlb_gen);
> >> 
> >> -	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> >> -		lockdep_assert_irqs_enabled();
> >> -		local_irq_disable();
> >> -		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
> >> -		local_irq_enable();
> >> -	}
> >> -
> >> -	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> >> -		flush_tlb_others(mm_cpumask(mm), info);
> > 
> > So if we want to double check that; we'd add:
> > 
> > 	WARN_ON_ONCE(cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm)) ==
> > 		     (mm == this_cpu_read(cpu_tlbstate.loaded_mm)));
> > 
> > right?
> 
> Yes, except the condition should be inverted (“!=“ instead of “==“), and I
> would prefer to use VM_WARN_ON_ONCE().
> 
> Unfortunately, this condition does fire when copy_init_mm() calls dup_mm().
> I don’t think there is a correctness issue, and I am tempted just check,
> before warning, that (mm != init_mm) .
> 
> What do you say?

Works for me.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-05-27 19:15 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-25  8:21 [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Nadav Amit
2019-05-25  8:21 ` [RFC PATCH 1/6] smp: Remove smp_call_function() and on_each_cpu() return values Nadav Amit
2019-05-25  8:21 ` [RFC PATCH 2/6] cpumask: Purify cpumask_next() Nadav Amit
2019-05-25  8:32   ` Ingo Molnar
2019-05-27  8:30   ` Peter Zijlstra
2019-05-27 17:34     ` Nadav Amit
2019-05-25  8:22 ` [RFC PATCH 3/6] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
2019-05-27  9:15   ` Peter Zijlstra
2019-05-27 17:39     ` Nadav Amit
2019-05-25  8:22 ` [RFC PATCH 4/6] x86/mm/tlb: Refactor common code into flush_tlb_on_cpus() Nadav Amit
2019-05-27  9:24   ` Peter Zijlstra
2019-05-27 18:59     ` Nadav Amit
2019-05-27 19:14       ` Peter Zijlstra
2019-05-25  8:22 ` [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
2019-05-25  8:38   ` Nadav Amit
2019-05-25  8:54   ` Juergen Gross
2019-05-27  9:47     ` Peter Zijlstra
2019-05-27 10:21       ` Paolo Bonzini
2019-05-27 12:32         ` Peter Zijlstra
2019-05-27 12:45           ` Paolo Bonzini
2019-05-27 17:49       ` Nadav Amit
2019-05-25  8:22 ` [RFC PATCH 6/6] x86/mm/tlb: Optimize local TLB flushes Nadav Amit
2019-05-27  8:28 ` [RFC PATCH 0/6] x86/mm: Flush remote and local TLBs concurrently Peter Zijlstra
2019-05-27  9:59 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).