linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/9] x86: Concurrent TLB flushes
@ 2019-07-19  0:58 Nadav Amit
  2019-07-19  0:58 ` [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
                   ` (9 more replies)
  0 siblings, 10 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-19  0:58 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, Borislav Petkov, Boris Ostrovsky, Haiyang Zhang,
	Josh Poimboeuf, Juergen Gross, K. Y. Srinivasan, Paolo Bonzini,
	Rik van Riel, Sasha Levin, Stephen Hemminger, kvm, linux-hyperv,
	virtualization, xen-devel

[ Cover-letter is identical to v2, including benchmark results,
  excluding the change log. ] 

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions:

 sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
  --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   ---------------------   ----------------------  ------
  1     1267765 (14146)         1299253 (5715)          +2.4%
  2     1734644 (11936)         1799225 (19577)         +3.7%
  4     2821268 (41184)         2919132 (40149)         +3.4%
  8     4171652 (31243)         4376925 (65416)         +4.9%
  16    5590729 (24160)         5829866 (8127)          +4.2%
  24    6250212 (24481)         6522303 (28044)         +4.3%
  32    3994314 (26606)         4077543 (10685)         +2.0%
  48    4345177 (28091)         4417821 (41337)         +1.6%

(Note that on configurations with up to 24 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   ---------------------   ----------------------  ------
  1     1598896 (5174)          1607903 (4091)          +0.5%
  2     2109472 (17827)         2224726 (4372)          +5.4%
  4     3448587 (11952)         3668551 (30219)         +6.3%
  8     5425778 (29641)         5606266 (33519)         +3.3%
  16    6931232 (34677)         7054052 (27873)         +1.7%
  24    7612473 (23482)         7783138 (13871)         +2.2%
  32    4296274 (18029)         4283279 (32323)         -0.3%
  48    4770029 (35541)         4764760 (13575)         -0.1%

Presumably, PTI requires two invalidations of each mapping, which allows
to get higher benefits from concurrency when PTI is on. At the same
time, when mitigations are on, other overheads reduce the potential
speedup.

I tried to reduce the size of the code of the main patch, which required
restructuring of the series.

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c                 |  10 +-
 arch/x86/include/asm/paravirt.h       |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h       |  47 ++++-----
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c                 |  11 ++-
 arch/x86/kernel/paravirt.c            |   2 +-
 arch/x86/mm/init.c                    |   2 +-
 arch/x86/mm/tlb.c                     | 133 ++++++++++++++++----------
 arch/x86/xen/mmu_pv.c                 |  11 +--
 include/linux/cpumask.h               |   6 +-
 include/linux/smp.h                   |  27 ++++--
 include/trace/events/xen.h            |   2 +-
 kernel/smp.c                          | 133 ++++++++++++--------------
 14 files changed, 218 insertions(+), 178 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-19  0:58 [PATCH v3 0/9] x86: Concurrent TLB flushes Nadav Amit
@ 2019-07-19  0:58 ` Nadav Amit
  2019-07-19 18:23   ` Dave Hansen
  2019-07-22 18:21   ` Peter Zijlstra
  2019-07-19  0:58 ` [PATCH v3 2/9] x86/mm/tlb: Remove reason as argument for flush_tlb_func_local() Nadav Amit
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-19  0:58 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, Rik van Riel, Josh Poimboeuf

Currently, on_each_cpu() and similar functions do not exploit the
potential of concurrency: the function is first executed remotely and
only then it is executed locally. Functions such as TLB flush can take
considerable time, so this provides an opportunity for performance
optimization.

To do so, introduce __smp_call_function_many(), which allows the callers
to provide local and remote functions that should be executed, and run
them concurrently. Keep smp_call_function_many() semantic as it is today
for backward compatibility: the called function is not executed in this
case locally.

__smp_call_function_many() does not use the optimized version for a
single remote target that smp_call_function_single() implements. For
synchronous function call, smp_call_function_single() keeps a
call_single_data (which is used for synchronization) on the stack.
Interestingly, it seems that not using this optimization provides
greater performance improvements (greater speedup with a single remote
target than with multiple ones). Presumably, holding data structures
that are intended for synchronization on the stack can introduce
overheads due to TLB misses and false-sharing when the stack is used for
other purposes.

Adding support to run the functions concurrently required to remove a
micro-optimization in on_each_cpu() that disabled/enabled IRQs instead
of saving/restoring them. The benefit of running the local and remote
code concurrently is expected to be greater.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/smp.h |  27 ++++++---
 kernel/smp.c        | 133 +++++++++++++++++++++-----------------------
 2 files changed, 83 insertions(+), 77 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 6fc856c9eda5..c31c7cde0f77 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -32,11 +32,6 @@ extern unsigned int total_cpus;
 int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 			     int wait);
 
-/*
- * Call a function on all processors
- */
-void on_each_cpu(smp_call_func_t func, void *info, int wait);
-
 /*
  * Call a function on processors specified by mask, which might include
  * the local one.
@@ -44,6 +39,15 @@ void on_each_cpu(smp_call_func_t func, void *info, int wait);
 void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 		void *info, bool wait);
 
+/*
+ * Call a function on all processors.  May be used during early boot while
+ * early_boot_irqs_disabled is set.
+ */
+static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
+{
+	on_each_cpu_mask(cpu_online_mask, func, info, wait);
+}
+
 /*
  * Call a function on each processor for which the supplied function
  * cond_func returns a positive value. This may include the local
@@ -102,8 +106,17 @@ extern void smp_cpus_done(unsigned int max_cpus);
  * Call a function on all other processors
  */
 void smp_call_function(smp_call_func_t func, void *info, int wait);
-void smp_call_function_many(const struct cpumask *mask,
-			    smp_call_func_t func, void *info, bool wait);
+
+void __smp_call_function_many(const struct cpumask *mask,
+			      smp_call_func_t remote_func,
+			      smp_call_func_t local_func,
+			      void *info, bool wait);
+
+static inline void smp_call_function_many(const struct cpumask *mask,
+				smp_call_func_t func, void *info, bool wait)
+{
+	__smp_call_function_many(mask, func, NULL, info, wait);
+}
 
 int smp_call_function_any(const struct cpumask *mask,
 			  smp_call_func_t func, void *info, int wait);
diff --git a/kernel/smp.c b/kernel/smp.c
index 616d4d114847..d5d8dee8e3f3 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -388,9 +388,13 @@ int smp_call_function_any(const struct cpumask *mask,
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
 /**
- * smp_call_function_many(): Run a function on a set of other CPUs.
+ * __smp_call_function_many(): Run a function on a set of CPUs.
  * @mask: The set of cpus to run on (only runs on online subset).
- * @func: The function to run. This must be fast and non-blocking.
+ * @remote_func: The function to run on remote cores. This must be fast and
+ *		 non-blocking.
+ * @local_func: The function that should be run on this CPU. This must be
+ *		fast and non-blocking. If NULL is provided, no function will
+ *		be executed on this CPU.
  * @info: An arbitrary pointer to pass to the function.
  * @wait: If true, wait (atomically) until function has completed
  *        on other CPUs.
@@ -401,11 +405,16 @@ EXPORT_SYMBOL_GPL(smp_call_function_any);
  * hardware interrupt handler or from a bottom half handler. Preemption
  * must be disabled when calling this function.
  */
-void smp_call_function_many(const struct cpumask *mask,
-			    smp_call_func_t func, void *info, bool wait)
+void __smp_call_function_many(const struct cpumask *mask,
+			      smp_call_func_t remote_func,
+			      smp_call_func_t local_func,
+			      void *info, bool wait)
 {
+	int cpu, last_cpu, this_cpu = smp_processor_id();
 	struct call_function_data *cfd;
-	int cpu, next_cpu, this_cpu = smp_processor_id();
+	bool run_remote = false;
+	bool run_local = false;
+	int nr_cpus = 0;
 
 	/*
 	 * Can deadlock when called with interrupts disabled.
@@ -413,55 +422,62 @@ void smp_call_function_many(const struct cpumask *mask,
 	 * send smp call function interrupt to this cpu and as such deadlocks
 	 * can't happen.
 	 */
-	WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
-		     && !oops_in_progress && !early_boot_irqs_disabled);
+	if (cpu_online(this_cpu) && !oops_in_progress && !early_boot_irqs_disabled)
+		lockdep_assert_irqs_enabled();
+
+	/* Check if we need local execution. */
+	if (local_func && cpumask_test_cpu(this_cpu, mask))
+		run_local = true;
 
-	/* Try to fastpath.  So, what's a CPU they want? Ignoring this one. */
+	/* Check if we need remote execution, i.e., any CPU excluding this one. */
 	cpu = cpumask_first_and(mask, cpu_online_mask);
 	if (cpu == this_cpu)
 		cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+	if (cpu < nr_cpu_ids)
+		run_remote = true;
 
-	/* No online cpus?  We're done. */
-	if (cpu >= nr_cpu_ids)
-		return;
+	if (run_remote) {
+		cfd = this_cpu_ptr(&cfd_data);
 
-	/* Do we have another CPU which isn't us? */
-	next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
-	if (next_cpu == this_cpu)
-		next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
-
-	/* Fastpath: do that cpu by itself. */
-	if (next_cpu >= nr_cpu_ids) {
-		smp_call_function_single(cpu, func, info, wait);
-		return;
-	}
+		cpumask_and(cfd->cpumask, mask, cpu_online_mask);
+		__cpumask_clear_cpu(this_cpu, cfd->cpumask);
 
-	cfd = this_cpu_ptr(&cfd_data);
-
-	cpumask_and(cfd->cpumask, mask, cpu_online_mask);
-	__cpumask_clear_cpu(this_cpu, cfd->cpumask);
+		cpumask_clear(cfd->cpumask_ipi);
+		for_each_cpu(cpu, cfd->cpumask) {
+			call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
+
+			nr_cpus++;
+			last_cpu = cpu;
+
+			csd_lock(csd);
+			if (wait)
+				csd->flags |= CSD_FLAG_SYNCHRONOUS;
+			csd->func = remote_func;
+			csd->info = info;
+			if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+				__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
+		}
 
-	/* Some callers race with other cpus changing the passed mask */
-	if (unlikely(!cpumask_weight(cfd->cpumask)))
-		return;
+		/*
+		 * Choose the most efficient way to send an IPI. Note that the
+		 * number of CPUs might be zero due to concurrent changes to the
+		 * provided mask.
+		 */
+		if (nr_cpus == 1)
+			arch_send_call_function_single_ipi(last_cpu);
+		else if (likely(nr_cpus > 1))
+			arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
+	}
 
-	cpumask_clear(cfd->cpumask_ipi);
-	for_each_cpu(cpu, cfd->cpumask) {
-		call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
+	if (run_local) {
+		unsigned long flags;
 
-		csd_lock(csd);
-		if (wait)
-			csd->flags |= CSD_FLAG_SYNCHRONOUS;
-		csd->func = func;
-		csd->info = info;
-		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
-			__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
+		local_irq_save(flags);
+		local_func(info);
+		local_irq_restore(flags);
 	}
 
-	/* Send a message to all CPUs in the map */
-	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
-
-	if (wait) {
+	if (run_remote && wait) {
 		for_each_cpu(cpu, cfd->cpumask) {
 			call_single_data_t *csd;
 
@@ -470,7 +486,7 @@ void smp_call_function_many(const struct cpumask *mask,
 		}
 	}
 }
-EXPORT_SYMBOL(smp_call_function_many);
+EXPORT_SYMBOL(__smp_call_function_many);
 
 /**
  * smp_call_function(): Run a function on all other CPUs.
@@ -587,24 +603,6 @@ void __init smp_init(void)
 	smp_cpus_done(setup_max_cpus);
 }
 
-/*
- * Call a function on all processors.  May be used during early boot while
- * early_boot_irqs_disabled is set.  Use local_irq_save/restore() instead
- * of local_irq_disable/enable().
- */
-void on_each_cpu(void (*func) (void *info), void *info, int wait)
-{
-	unsigned long flags;
-
-	preempt_disable();
-	smp_call_function(func, info, wait);
-	local_irq_save(flags);
-	func(info);
-	local_irq_restore(flags);
-	preempt_enable();
-}
-EXPORT_SYMBOL(on_each_cpu);
-
 /**
  * on_each_cpu_mask(): Run a function on processors specified by
  * cpumask, which may include the local processor.
@@ -624,16 +622,11 @@ EXPORT_SYMBOL(on_each_cpu);
 void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 			void *info, bool wait)
 {
-	int cpu = get_cpu();
+	preempt_disable();
 
-	smp_call_function_many(mask, func, info, wait);
-	if (cpumask_test_cpu(cpu, mask)) {
-		unsigned long flags;
-		local_irq_save(flags);
-		func(info);
-		local_irq_restore(flags);
-	}
-	put_cpu();
+	__smp_call_function_many(mask, func, func, info, wait);
+
+	preempt_enable();
 }
 EXPORT_SYMBOL(on_each_cpu_mask);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 2/9] x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
  2019-07-19  0:58 [PATCH v3 0/9] x86: Concurrent TLB flushes Nadav Amit
  2019-07-19  0:58 ` [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
@ 2019-07-19  0:58 ` Nadav Amit
  2019-07-19  0:58 ` [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() Nadav Amit
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-19  0:58 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, Rik van Riel, Josh Poimboeuf

To use flush_tlb_func_local() as an argument to
__smp_call_function_many() we need it to have a single (void *)
parameter. Eliminate the second parameter and deduce the reason for the
flush.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4de9704c4aaf..233f3d8308db 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -635,9 +635,12 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
 }
 
-static void flush_tlb_func_local(const void *info, enum tlb_flush_reason reason)
+static void flush_tlb_func_local(void *info)
 {
 	const struct flush_tlb_info *f = info;
+	enum tlb_flush_reason reason;
+
+	reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;
 
 	flush_tlb_func_common(f, true, reason);
 }
@@ -790,7 +793,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
+		flush_tlb_func_local(info);
 		local_irq_enable();
 	}
 
@@ -862,7 +865,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func_local(&full_flush_tlb_info, TLB_LOCAL_SHOOTDOWN);
+		flush_tlb_func_local(&full_flush_tlb_info);
 		local_irq_enable();
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  2019-07-19  0:58 [PATCH v3 0/9] x86: Concurrent TLB flushes Nadav Amit
  2019-07-19  0:58 ` [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
  2019-07-19  0:58 ` [PATCH v3 2/9] x86/mm/tlb: Remove reason as argument for flush_tlb_func_local() Nadav Amit
@ 2019-07-19  0:58 ` Nadav Amit
  2019-07-19 18:36   ` Dave Hansen
                     ` (2 more replies)
  2019-07-19  0:58 ` [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
                   ` (6 subsequent siblings)
  9 siblings, 3 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-19  0:58 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, Rik van Riel, Josh Poimboeuf

Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to
optimize the code. Open-coding eliminates the need for the indirect branch
that is used to call is_lazy(), and in CPUs that are vulnerable to
Spectre v2, it eliminates the retpoline. In addition, it allows to use a
preallocated cpumask to compute the CPUs that should be.

This would later allow us not to adapt on_each_cpu_cond_mask() to
support local and remote functions.

Note that calling tlb_is_not_lazy() for every CPU that needs to be
flushed, as done in native_flush_tlb_multi() might look ugly, but it is
equivalent to what is currently done in on_each_cpu_cond_mask().
Actually, native_flush_tlb_multi() does it more efficiently since it
avoids using an indirect branch for the matter.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 40 ++++++++++++++++++++++++++++++++++------
 1 file changed, 34 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 233f3d8308db..abbf55fa8b81 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -658,11 +658,13 @@ static void flush_tlb_func_remote(void *info)
 	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
-static bool tlb_is_not_lazy(int cpu, void *data)
+static bool tlb_is_not_lazy(int cpu)
 {
 	return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
+static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info)
 {
@@ -706,12 +708,38 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables)
+	if (info->freed_tables) {
 		smp_call_function_many(cpumask, flush_tlb_func_remote,
 			       (void *)info, 1);
-	else
-		on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
-				(void *)info, 1, GFP_ATOMIC, cpumask);
+	} else {
+		/*
+		 * Although we could have used on_each_cpu_cond_mask(),
+		 * open-coding it has performance advantages, as it eliminates
+		 * the need for indirect calls or retpolines. In addition, it
+		 * allows to use a designated cpumask for evaluating the
+		 * condition, instead of allocating one.
+		 *
+		 * This code works under the assumption that there are no nested
+		 * TLB flushes, an assumption that is already made in
+		 * flush_tlb_mm_range().
+		 *
+		 * cond_cpumask is logically a stack-local variable, but it is
+		 * more efficient to have it off the stack and not to allocate
+		 * it on demand. Preemption is disabled and this code is
+		 * non-reentrant.
+		 */
+		struct cpumask *cond_cpumask = this_cpu_ptr(&flush_tlb_mask);
+		int cpu;
+
+		cpumask_clear(cond_cpumask);
+
+		for_each_cpu(cpu, cpumask) {
+			if (tlb_is_not_lazy(cpu))
+				__cpumask_set_cpu(cpu, cond_cpumask);
+		}
+		smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+					 (void *)info, 1);
+	}
 }
 
 /*
@@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func_local(&full_flush_tlb_info);
+		flush_tlb_func_local((void *)&full_flush_tlb_info);
 		local_irq_enable();
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-07-19  0:58 [PATCH v3 0/9] x86: Concurrent TLB flushes Nadav Amit
                   ` (2 preceding siblings ...)
  2019-07-19  0:58 ` [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() Nadav Amit
@ 2019-07-19  0:58 ` Nadav Amit
  2019-07-22 19:14   ` Peter Zijlstra
                     ` (2 more replies)
  2019-07-19  0:58 ` [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate Nadav Amit
                   ` (5 subsequent siblings)
  9 siblings, 3 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-19  0:58 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Borislav Petkov, Juergen Gross, Paolo Bonzini,
	Boris Ostrovsky, linux-hyperv, virtualization, kvm, xen-devel

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: Juergen Gross <jgross@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/hyperv/mmu.c                 | 10 +++---
 arch/x86/include/asm/paravirt.h       |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h       |  8 ++---
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c                 | 11 +++++--
 arch/x86/kernel/paravirt.c            |  2 +-
 arch/x86/mm/tlb.c                     | 47 ++++++++++++++++++---------
 arch/x86/xen/mmu_pv.c                 | 11 +++----
 include/trace/events/xen.h            |  2 +-
 10 files changed, 62 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..8740d8b21db3 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
 	return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-				    const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+				   const struct flush_tlb_info *info)
 {
 	int cpu, vcpu, gva_n, max_gvas;
 	struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
 	u64 status = U64_MAX;
 	unsigned long flags;
 
-	trace_hyperv_mmu_flush_tlb_others(cpus, info);
+	trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
 	if (!hv_hypercall_pg)
 		goto do_native;
@@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
 	if (!(status & HV_HYPERCALL_RESULT_MASK))
 		return;
 do_native:
-	native_flush_tlb_others(cpus, info);
+	native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void)
 		return;
 
 	pr_info("Using hypercall for remote TLB flush\n");
-	pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
 	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index dce26f1d13e1..8c6c2394393b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
 	PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-				    const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+				   const struct flush_tlb_info *info)
 {
-	PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 639b2df445ee..c82969f38845 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,8 +211,8 @@ struct pv_mmu_ops {
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_one_user)(unsigned long addr);
-	void (*flush_tlb_others)(const struct cpumask *cpus,
-				 const struct flush_tlb_info *info);
+	void (*flush_tlb_multi)(const struct cpumask *cpus,
+				const struct flush_tlb_info *info);
 
 	void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..610e47dc66ef 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -517,7 +517,7 @@ static inline void __flush_tlb_one_kernel(unsigned long addr)
  *  - flush_tlb_page(vma, vmaddr) flushes one page
  *  - flush_tlb_range(vma, start, end) flushes a range of pages
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
- *  - flush_tlb_others(cpumask, info) flushes TLBs on other cpus
+ *  - flush_tlb_multi(cpumask, info) flushes TLBs on multiple cpus
  *
  * ..but the i386 has somewhat limited tlb flushing capabilities,
  * and page-granular flushes are available only on i486 and up.
@@ -569,7 +569,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_multi(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
@@ -593,8 +593,8 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
-#define flush_tlb_others(mask, info)	\
-	native_flush_tlb_others(mask, info)
+#define flush_tlb_multi(mask, info)	\
+	native_flush_tlb_multi(mask, info)
 
 #define paravirt_tlb_remove_table(tlb, page) \
 	tlb_remove_page(tlb, (void *)(page))
diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h
index ace464f09681..85ca8560c7f9 100644
--- a/arch/x86/include/asm/trace/hyperv.h
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -8,7 +8,7 @@
 
 #if IS_ENABLED(CONFIG_HYPERV)
 
-TRACE_EVENT(hyperv_mmu_flush_tlb_others,
+TRACE_EVENT(hyperv_mmu_flush_tlb_multi,
 	    TP_PROTO(const struct cpumask *cpus,
 		     const struct flush_tlb_info *info),
 	    TP_ARGS(cpus, info),
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b7f34fe2171e..de40657d9025 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -595,7 +595,7 @@ static void __init kvm_apf_trap_init(void)
 
 static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
 
-static void kvm_flush_tlb_others(const struct cpumask *cpumask,
+static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
 			const struct flush_tlb_info *info)
 {
 	u8 state;
@@ -609,6 +609,11 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 	 * queue flush_on_enter for pre-empted vCPUs
 	 */
 	for_each_cpu(cpu, flushmask) {
+		/*
+		 * The local vCPU is never preempted, so we do not explicitly
+		 * skip check for local vCPU - it will never be cleared from
+		 * flushmask.
+		 */
 		src = &per_cpu(steal_time, cpu);
 		state = READ_ONCE(src->preempted);
 		if ((state & KVM_VCPU_PREEMPTED)) {
@@ -618,7 +623,7 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 		}
 	}
 
-	native_flush_tlb_others(flushmask, info);
+	native_flush_tlb_multi(flushmask, info);
 }
 
 static void __init kvm_guest_init(void)
@@ -643,7 +648,7 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
 	    !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
+		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
 		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 	}
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 0aa6256eedd8..6af40844a730 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -363,7 +363,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_user	= native_flush_tlb,
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
-	.mmu.flush_tlb_others	= native_flush_tlb_others,
+	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
 	.mmu.tlb_remove_table	=
 			(void (*)(struct mmu_gather *, void *))tlb_remove_page,
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index abbf55fa8b81..63c00908bdd9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -551,7 +551,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
 		 *
-		 * This should be rare, with native_flush_tlb_others skipping
+		 * This should be rare, with native_flush_tlb_multi() skipping
 		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
@@ -665,9 +665,14 @@ static bool tlb_is_not_lazy(int cpu)
 
 static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
-			     const struct flush_tlb_info *info)
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+			    const struct flush_tlb_info *info)
 {
+	/*
+	 * Do accounting and tracing. Note that there are (and have always been)
+	 * cases in which a remote TLB flush will be traced, but eventually
+	 * would not happen.
+	 */
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -687,10 +692,12 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		 * means that the percpu tlb_gen variables won't be updated
 		 * and we'll do pointless flushes on future context switches.
 		 *
-		 * Rather than hooking native_flush_tlb_others() here, I think
+		 * Rather than hooking native_flush_tlb_multi() here, I think
 		 * that UV should be updated so that smp_call_function_many(),
 		 * etc, are optimal on UV.
 		 */
+		flush_tlb_func_local((void *)info);
+
 		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
 			smp_call_function_many(cpumask, flush_tlb_func_remote,
@@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	 * doing a speculative memory access.
 	 */
 	if (info->freed_tables) {
-		smp_call_function_many(cpumask, flush_tlb_func_remote,
-			       (void *)info, 1);
+		__smp_call_function_many(cpumask, flush_tlb_func_remote,
+					 flush_tlb_func_local,
+					 (void *)info, 1);
 	} else {
 		/*
 		 * Although we could have used on_each_cpu_cond_mask(),
@@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 			if (tlb_is_not_lazy(cpu))
 				__cpumask_set_cpu(cpu, cond_cpumask);
 		}
-		smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+		__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+					 flush_tlb_func_local,
 					 (void *)info, 1);
 	}
 }
@@ -818,16 +827,20 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
 				  new_tlb_gen);
 
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+	/*
+	 * flush_tlb_multi() is not optimized for the common case in which only
+	 * a local TLB flush is needed. Optimize this use-case by calling
+	 * flush_tlb_func_local() directly in this case.
+	 */
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+		flush_tlb_multi(mm_cpumask(mm), info);
+	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
 		flush_tlb_func_local(info);
 		local_irq_enable();
 	}
 
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), info);
-
 	put_flush_tlb_info();
 	put_cpu();
 }
@@ -890,16 +903,20 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	int cpu = get_cpu();
 
-	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+	/*
+	 * flush_tlb_multi() is not optimized for the common case in which only
+	 * a local TLB flush is needed. Optimize this use-case by calling
+	 * flush_tlb_func_local() directly in this case.
+	 */
+	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+		flush_tlb_multi(&batch->cpumask, &full_flush_tlb_info);
+	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
 		flush_tlb_func_local((void *)&full_flush_tlb_info);
 		local_irq_enable();
 	}
 
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
-		flush_tlb_others(&batch->cpumask, &full_flush_tlb_info);
-
 	cpumask_clear(&batch->cpumask);
 
 	put_cpu();
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 26e8b326966d..48f7c7eb4dbc 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1345,8 +1345,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
 	preempt_enable();
 }
 
-static void xen_flush_tlb_others(const struct cpumask *cpus,
-				 const struct flush_tlb_info *info)
+static void xen_flush_tlb_multi(const struct cpumask *cpus,
+				const struct flush_tlb_info *info)
 {
 	struct {
 		struct mmuext_op op;
@@ -1356,7 +1356,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	const size_t mc_entry_size = sizeof(args->op) +
 		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
 
-	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
+	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
 
 	if (cpumask_empty(cpus))
 		return;		/* nothing to do */
@@ -1365,9 +1365,8 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	args = mcs.args;
 	args->op.arg2.vcpumask = to_cpumask(args->mask);
 
-	/* Remove us, and any offline CPUS. */
+	/* Remove any offline CPUs */
 	cpumask_and(to_cpumask(args->mask), cpus, cpu_online_mask);
-	cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask));
 
 	args->op.cmd = MMUEXT_TLB_FLUSH_MULTI;
 	if (info->end != TLB_FLUSH_ALL &&
@@ -2396,7 +2395,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
 	.flush_tlb_user = xen_flush_tlb,
 	.flush_tlb_kernel = xen_flush_tlb,
 	.flush_tlb_one_user = xen_flush_tlb_one_user,
-	.flush_tlb_others = xen_flush_tlb_others,
+	.flush_tlb_multi = xen_flush_tlb_multi,
 	.tlb_remove_table = tlb_remove_table,
 
 	.pgd_alloc = xen_pgd_alloc,
diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index 9a0e8af21310..546022acf160 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -362,7 +362,7 @@ TRACE_EVENT(xen_mmu_flush_tlb_one_user,
 	    TP_printk("addr %lx", __entry->addr)
 	);
 
-TRACE_EVENT(xen_mmu_flush_tlb_others,
+TRACE_EVENT(xen_mmu_flush_tlb_multi,
 	    TP_PROTO(const struct cpumask *cpus, struct mm_struct *mm,
 		     unsigned long addr, unsigned long end),
 	    TP_ARGS(cpus, mm, addr, end),
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate
  2019-07-19  0:58 [PATCH v3 0/9] x86: Concurrent TLB flushes Nadav Amit
                   ` (3 preceding siblings ...)
  2019-07-19  0:58 ` [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
@ 2019-07-19  0:58 ` Nadav Amit
  2019-07-19 18:38   ` Dave Hansen
  2019-07-19  0:58 ` [PATCH v3 6/9] x86/mm/tlb: Do not make is_lazy dirty for no reason Nadav Amit
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 45+ messages in thread
From: Nadav Amit @ 2019-07-19  0:58 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit

cpu_tlbstate is mostly private and only the variable is_lazy is shared.
This causes some false-sharing when TLB flushes are performed.

Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
each one accordingly.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h | 39 ++++++++++++++++++---------------
 arch/x86/mm/init.c              |  2 +-
 arch/x86/mm/tlb.c               | 15 ++++++++-----
 3 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 610e47dc66ef..8490591d3cdf 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -178,23 +178,6 @@ struct tlb_state {
 	u16 loaded_mm_asid;
 	u16 next_asid;
 
-	/*
-	 * We can be in one of several states:
-	 *
-	 *  - Actively using an mm.  Our CPU's bit will be set in
-	 *    mm_cpumask(loaded_mm) and is_lazy == false;
-	 *
-	 *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
-	 *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
-	 *
-	 *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
-	 *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
-	 *    We're heuristically guessing that the CR3 load we
-	 *    skipped more than makes up for the overhead added by
-	 *    lazy mode.
-	 */
-	bool is_lazy;
-
 	/*
 	 * If set we changed the page tables in such a way that we
 	 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
@@ -240,7 +223,27 @@ struct tlb_state {
 	 */
 	struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
 };
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
+DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
+
+struct tlb_state_shared {
+	/*
+	 * We can be in one of several states:
+	 *
+	 *  - Actively using an mm.  Our CPU's bit will be set in
+	 *    mm_cpumask(loaded_mm) and is_lazy == false;
+	 *
+	 *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
+	 *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
+	 *
+	 *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
+	 *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
+	 *    We're heuristically guessing that the CR3 load we
+	 *    skipped more than makes up for the overhead added by
+	 *    lazy mode.
+	 */
+	bool is_lazy;
+};
+DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
 
 /*
  * Blindly accessing user memory from NMI context can be dangerous
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index fd10d91a6115..34027f36a944 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -951,7 +951,7 @@ void __init zone_sizes_init(void)
 	free_area_init_nodes(max_zone_pfns);
 }
 
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
+__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = {
 	.loaded_mm = &init_mm,
 	.next_asid = 1,
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 63c00908bdd9..af80c274c88d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -145,7 +145,7 @@ void leave_mm(int cpu)
 		return;
 
 	/* Warn if we're not lazy. */
-	WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
+	WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy));
 
 	switch_mm(NULL, &init_mm, NULL);
 }
@@ -277,7 +277,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 {
 	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
+	bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
 	unsigned cpu = smp_processor_id();
 	u64 next_tlb_gen;
 	bool need_flush;
@@ -322,7 +322,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		__flush_tlb_all();
 	}
 #endif
-	this_cpu_write(cpu_tlbstate.is_lazy, false);
+	this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
 	/*
 	 * The membarrier system call requires a full memory barrier and
@@ -463,7 +463,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
 		return;
 
-	this_cpu_write(cpu_tlbstate.is_lazy, true);
+	this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
 }
 
 /*
@@ -544,7 +544,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
-	if (this_cpu_read(cpu_tlbstate.is_lazy)) {
+	if (this_cpu_read(cpu_tlbstate_shared.is_lazy)) {
 		/*
 		 * We're in lazy mode.  We need to at least flush our
 		 * paging-structure cache to avoid speculatively reading
@@ -660,11 +660,14 @@ static void flush_tlb_func_remote(void *info)
 
 static bool tlb_is_not_lazy(int cpu)
 {
-	return !per_cpu(cpu_tlbstate.is_lazy, cpu);
+	return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);
 }
 
 static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
 
+DEFINE_PER_CPU_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
+EXPORT_PER_CPU_SYMBOL(cpu_tlbstate_shared);
+
 void native_flush_tlb_multi(const struct cpumask *cpumask,
 			    const struct flush_tlb_info *info)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 6/9] x86/mm/tlb: Do not make is_lazy dirty for no reason
  2019-07-19  0:58 [PATCH v3 0/9] x86: Concurrent TLB flushes Nadav Amit
                   ` (4 preceding siblings ...)
  2019-07-19  0:58 ` [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate Nadav Amit
@ 2019-07-19  0:58 ` Nadav Amit
  2019-07-19  0:58 ` [PATCH v3 7/9] cpumask: Mark functions as pure Nadav Amit
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-19  0:58 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, Dave Hansen

Blindly writing to is_lazy for no reason, when the written value is
identical to the old value, makes the cacheline dirty for no reason.
Avoid making such writes to prevent cache coherency traffic for no
reason.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index af80c274c88d..89f83ad19507 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -322,7 +322,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		__flush_tlb_all();
 	}
 #endif
-	this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
+	if (was_lazy)
+		this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
 	/*
 	 * The membarrier system call requires a full memory barrier and
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 7/9] cpumask: Mark functions as pure
  2019-07-19  0:58 [PATCH v3 0/9] x86: Concurrent TLB flushes Nadav Amit
                   ` (5 preceding siblings ...)
  2019-07-19  0:58 ` [PATCH v3 6/9] x86/mm/tlb: Do not make is_lazy dirty for no reason Nadav Amit
@ 2019-07-19  0:58 ` Nadav Amit
  2019-07-19  0:58 ` [PATCH v3 8/9] x86/mm/tlb: Remove UV special case Nadav Amit
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-19  0:58 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit

cpumask_next_and() and cpumask_any_but() are pure, and marking them as
such seems to generate different and presumably better code for
native_flush_tlb_multi().

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/cpumask.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 0c7db5efe66c..6d57e6372b9d 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -211,7 +211,7 @@ static inline unsigned int cpumask_last(const struct cpumask *srcp)
 	return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
-unsigned int cpumask_next(int n, const struct cpumask *srcp);
+unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);
 
 /**
  * cpumask_next_zero - get the next unset cpu in a cpumask
@@ -228,8 +228,8 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
 	return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1);
 }
 
-int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
-int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
+int __pure cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
+int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
 
 /**
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 8/9] x86/mm/tlb: Remove UV special case
  2019-07-19  0:58 [PATCH v3 0/9] x86: Concurrent TLB flushes Nadav Amit
                   ` (6 preceding siblings ...)
  2019-07-19  0:58 ` [PATCH v3 7/9] cpumask: Mark functions as pure Nadav Amit
@ 2019-07-19  0:58 ` Nadav Amit
  2019-07-19  2:25   ` Mike Travis
  2019-07-19  0:58 ` [PATCH v3 9/9] x86/mm/tlb: Remove unnecessary uses of the inline keyword Nadav Amit
  2019-07-19 21:36 ` [PATCH v3 0/9] x86: Concurrent TLB flushes Dave Hansen
  9 siblings, 1 reply; 45+ messages in thread
From: Nadav Amit @ 2019-07-19  0:58 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, Dave Hansen, Mike Travis

SGI UV support is outdated and not maintained, and it is not clear how
it performs relatively to non-UV. Remove the code to simplify the code.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Acked-by: Mike Travis <mike.travis@hpe.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 25 -------------------------
 1 file changed, 25 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 89f83ad19507..40daad52ec7d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -684,31 +684,6 @@ void native_flush_tlb_multi(const struct cpumask *cpumask,
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI,
 				(info->end - info->start) >> PAGE_SHIFT);
 
-	if (is_uv_system()) {
-		/*
-		 * This whole special case is confused.  UV has a "Broadcast
-		 * Assist Unit", which seems to be a fancy way to send IPIs.
-		 * Back when x86 used an explicit TLB flush IPI, UV was
-		 * optimized to use its own mechanism.  These days, x86 uses
-		 * smp_call_function_many(), but UV still uses a manual IPI,
-		 * and that IPI's action is out of date -- it does a manual
-		 * flush instead of calling flush_tlb_func_remote().  This
-		 * means that the percpu tlb_gen variables won't be updated
-		 * and we'll do pointless flushes on future context switches.
-		 *
-		 * Rather than hooking native_flush_tlb_multi() here, I think
-		 * that UV should be updated so that smp_call_function_many(),
-		 * etc, are optimal on UV.
-		 */
-		flush_tlb_func_local((void *)info);
-
-		cpumask = uv_flush_tlb_others(cpumask, info);
-		if (cpumask)
-			smp_call_function_many(cpumask, flush_tlb_func_remote,
-					       (void *)info, 1);
-		return;
-	}
-
 	/*
 	 * If no page tables were freed, we can skip sending IPIs to
 	 * CPUs in lazy TLB mode. They will flush the CPU themselves
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 9/9] x86/mm/tlb: Remove unnecessary uses of the inline keyword
  2019-07-19  0:58 [PATCH v3 0/9] x86: Concurrent TLB flushes Nadav Amit
                   ` (7 preceding siblings ...)
  2019-07-19  0:58 ` [PATCH v3 8/9] x86/mm/tlb: Remove UV special case Nadav Amit
@ 2019-07-19  0:58 ` Nadav Amit
  2019-07-19 21:36 ` [PATCH v3 0/9] x86: Concurrent TLB flushes Dave Hansen
  9 siblings, 0 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-19  0:58 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit

The compiler is smart enough without these hints.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 40daad52ec7d..2ddc32e787f3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -189,7 +189,7 @@ static void sync_current_stack_to_mm(struct mm_struct *mm)
 	}
 }
 
-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
 {
 	unsigned long next_tif = task_thread_info(next)->flags;
 	unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
@@ -748,7 +748,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, flush_tlb_info);
 static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
 #endif
 
-static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
+static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 			unsigned long start, unsigned long end,
 			unsigned int stride_shift, bool freed_tables,
 			u64 new_tlb_gen)
@@ -774,7 +774,7 @@ static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	return info;
 }
 
-static inline void put_flush_tlb_info(void)
+static void put_flush_tlb_info(void)
 {
 #ifdef CONFIG_DEBUG_VM
 	/* Complete reentrency prevention checks */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 8/9] x86/mm/tlb: Remove UV special case
  2019-07-19  0:58 ` [PATCH v3 8/9] x86/mm/tlb: Remove UV special case Nadav Amit
@ 2019-07-19  2:25   ` Mike Travis
  2019-07-19  4:58     ` Nadav Amit
  2019-07-31  3:11     ` Nadav Amit
  0 siblings, 2 replies; 45+ messages in thread
From: Mike Travis @ 2019-07-19  2:25 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Dave Hansen

It is a fact that the UV is still the UV and SGI is now part of HPE. 
The current external product is known as SuperDome Flex.  It is both up 
to date as well as very well maintained.  The ACK I provided was an okay 
to change the code, but please make the description accurate.

On 7/18/2019 5:58 PM, Nadav Amit wrote:
> SGI UV support is outdated and not maintained, and it is not clear how
> it performs relatively to non-UV. Remove the code to simplify the code.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Acked-by: Mike Travis <mike.travis@hpe.com>
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>   arch/x86/mm/tlb.c | 25 -------------------------
>   1 file changed, 25 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 89f83ad19507..40daad52ec7d 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -684,31 +684,6 @@ void native_flush_tlb_multi(const struct cpumask *cpumask,
>   		trace_tlb_flush(TLB_REMOTE_SEND_IPI,
>   				(info->end - info->start) >> PAGE_SHIFT);
>   
> -	if (is_uv_system()) {
> -		/*
> -		 * This whole special case is confused.  UV has a "Broadcast
> -		 * Assist Unit", which seems to be a fancy way to send IPIs.
> -		 * Back when x86 used an explicit TLB flush IPI, UV was
> -		 * optimized to use its own mechanism.  These days, x86 uses
> -		 * smp_call_function_many(), but UV still uses a manual IPI,
> -		 * and that IPI's action is out of date -- it does a manual
> -		 * flush instead of calling flush_tlb_func_remote().  This
> -		 * means that the percpu tlb_gen variables won't be updated
> -		 * and we'll do pointless flushes on future context switches.
> -		 *
> -		 * Rather than hooking native_flush_tlb_multi() here, I think
> -		 * that UV should be updated so that smp_call_function_many(),
> -		 * etc, are optimal on UV.
> -		 */
> -		flush_tlb_func_local((void *)info);
> -
> -		cpumask = uv_flush_tlb_others(cpumask, info);
> -		if (cpumask)
> -			smp_call_function_many(cpumask, flush_tlb_func_remote,
> -					       (void *)info, 1);
> -		return;
> -	}
> -
>   	/*
>   	 * If no page tables were freed, we can skip sending IPIs to
>   	 * CPUs in lazy TLB mode. They will flush the CPU themselves
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 8/9] x86/mm/tlb: Remove UV special case
  2019-07-19  2:25   ` Mike Travis
@ 2019-07-19  4:58     ` Nadav Amit
  2019-07-31  3:11     ` Nadav Amit
  1 sibling, 0 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-19  4:58 UTC (permalink / raw)
  To: Mike Travis
  Cc: Andy Lutomirski, Dave Hansen, x86, linux-kernel, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Dave Hansen

> On Jul 18, 2019, at 7:25 PM, Mike Travis <mike.travis@hpe.com> wrote:
> 
> It is a fact that the UV is still the UV and SGI is now part of HPE. The current external product is known as SuperDome Flex.  It is both up to date as well as very well maintained.  The ACK I provided was an okay to change the code, but please make the description accurate.

Indeed. Sorry for that - I will update it in v4. I guess you will be ok with
me copy-pasting parts of your response into the commit log.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-19  0:58 ` [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
@ 2019-07-19 18:23   ` Dave Hansen
  2019-07-22 18:16     ` Peter Zijlstra
  2019-07-22 18:21   ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Dave Hansen @ 2019-07-19 18:23 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Rik van Riel, Josh Poimboeuf

On 7/18/19 5:58 PM, Nadav Amit wrote:
> @@ -624,16 +622,11 @@ EXPORT_SYMBOL(on_each_cpu);
>  void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
>  			void *info, bool wait)
>  {
> -	int cpu = get_cpu();
> +	preempt_disable();
>  
> -	smp_call_function_many(mask, func, info, wait);
> -	if (cpumask_test_cpu(cpu, mask)) {
> -		unsigned long flags;
> -		local_irq_save(flags);
> -		func(info);
> -		local_irq_restore(flags);
> -	}
> -	put_cpu();
> +	__smp_call_function_many(mask, func, func, info, wait);
> +
> +	preempt_enable();
>  }

The get_cpu() was missing it too, but it would be nice to add some
comments about why preempt needs to be off.  I was also thinking it
might make sense to do:

	cfd = get_cpu_var(cfd_data);
	__smp_call_function_many(cfd, ...);
	put_cpu_var(cfd_data);
	
instead of the explicit preempt_enable/disable(), but I don't feel too
strongly about it.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  2019-07-19  0:58 ` [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() Nadav Amit
@ 2019-07-19 18:36   ` Dave Hansen
  2019-07-19 18:41     ` Nadav Amit
  2019-07-22 18:27   ` Peter Zijlstra
  2019-07-22 19:47   ` Rasmus Villemoes
  2 siblings, 1 reply; 45+ messages in thread
From: Dave Hansen @ 2019-07-19 18:36 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Rik van Riel, Josh Poimboeuf

On 7/18/19 5:58 PM, Nadav Amit wrote:
> @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
>  		lockdep_assert_irqs_enabled();
>  		local_irq_disable();
> -		flush_tlb_func_local(&full_flush_tlb_info);
> +		flush_tlb_func_local((void *)&full_flush_tlb_info);
>  		local_irq_enable();
>  	}

This looks like superfluous churn.  Is it?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate
  2019-07-19  0:58 ` [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate Nadav Amit
@ 2019-07-19 18:38   ` Dave Hansen
  2019-07-19 18:43     ` Nadav Amit
  2019-07-21 20:21     ` Nadav Amit
  0 siblings, 2 replies; 45+ messages in thread
From: Dave Hansen @ 2019-07-19 18:38 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

On 7/18/19 5:58 PM, Nadav Amit wrote:
> +struct tlb_state_shared {
> +	/*
> +	 * We can be in one of several states:
> +	 *
> +	 *  - Actively using an mm.  Our CPU's bit will be set in
> +	 *    mm_cpumask(loaded_mm) and is_lazy == false;
> +	 *
> +	 *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
> +	 *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
> +	 *
> +	 *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
> +	 *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
> +	 *    We're heuristically guessing that the CR3 load we
> +	 *    skipped more than makes up for the overhead added by
> +	 *    lazy mode.
> +	 */
> +	bool is_lazy;
> +};
> +DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);

Could we get a comment about what "shared" means and why we need shared
state?

Should we change 'tlb_state' to 'tlb_state_private'?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  2019-07-19 18:36   ` Dave Hansen
@ 2019-07-19 18:41     ` Nadav Amit
  2019-07-19 22:44       ` Joe Perches
  0 siblings, 1 reply; 45+ messages in thread
From: Nadav Amit @ 2019-07-19 18:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Dave Hansen, x86, linux-kernel, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Rik van Riel, Josh Poimboeuf

> On Jul 19, 2019, at 11:36 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 7/18/19 5:58 PM, Nadav Amit wrote:
>> @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>> 	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
>> 		lockdep_assert_irqs_enabled();
>> 		local_irq_disable();
>> -		flush_tlb_func_local(&full_flush_tlb_info);
>> +		flush_tlb_func_local((void *)&full_flush_tlb_info);
>> 		local_irq_enable();
>> 	}
> 
> This looks like superfluous churn.  Is it?

Unfortunately not, since full_flush_tlb_info is defined as const. Without it
you would get:

warning: passing argument 1 of ‘flush_tlb_func_local’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]

And flush_tlb_func_local() should get (void *) argument since it is also
used by the SMP infrastructure.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate
  2019-07-19 18:38   ` Dave Hansen
@ 2019-07-19 18:43     ` Nadav Amit
  2019-07-19 18:48       ` Dave Hansen
  2019-07-21 20:21     ` Nadav Amit
  1 sibling, 1 reply; 45+ messages in thread
From: Nadav Amit @ 2019-07-19 18:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers,
	linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

> On Jul 19, 2019, at 11:38 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 7/18/19 5:58 PM, Nadav Amit wrote:
>> +struct tlb_state_shared {
>> +	/*
>> +	 * We can be in one of several states:
>> +	 *
>> +	 *  - Actively using an mm.  Our CPU's bit will be set in
>> +	 *    mm_cpumask(loaded_mm) and is_lazy == false;
>> +	 *
>> +	 *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
>> +	 *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
>> +	 *
>> +	 *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
>> +	 *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
>> +	 *    We're heuristically guessing that the CR3 load we
>> +	 *    skipped more than makes up for the overhead added by
>> +	 *    lazy mode.
>> +	 */
>> +	bool is_lazy;
>> +};
>> +DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
> 
> Could we get a comment about what "shared" means and why we need shared
> state?
> 
> Should we change 'tlb_state' to 'tlb_state_private’?

Andy said that for the lazy tlb optimizations there might soon be more
shared state. If you prefer, I can move is_lazy outside of tlb_state, and
not set it in any alternative struct.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate
  2019-07-19 18:43     ` Nadav Amit
@ 2019-07-19 18:48       ` Dave Hansen
  2019-07-19 18:54         ` Nadav Amit
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Hansen @ 2019-07-19 18:48 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers,
	linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

On 7/19/19 11:43 AM, Nadav Amit wrote:
> Andy said that for the lazy tlb optimizations there might soon be more
> shared state. If you prefer, I can move is_lazy outside of tlb_state, and
> not set it in any alternative struct.

I just wanted to make sure that we capture these rules:

1. If the data is only ever accessed on the "owning" CPU via
   this_cpu_*(), put it in 'tlb_state'.
2. If the data is read by other CPUs, put it in 'tlb_state_shared'.

I actually like the idea of having two structs.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate
  2019-07-19 18:48       ` Dave Hansen
@ 2019-07-19 18:54         ` Nadav Amit
  2019-07-20 13:58           ` Andy Lutomirski
  0 siblings, 1 reply; 45+ messages in thread
From: Nadav Amit @ 2019-07-19 18:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers,
	linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar

> On Jul 19, 2019, at 11:48 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 7/19/19 11:43 AM, Nadav Amit wrote:
>> Andy said that for the lazy tlb optimizations there might soon be more
>> shared state. If you prefer, I can move is_lazy outside of tlb_state, and
>> not set it in any alternative struct.
> 
> I just wanted to make sure that we capture these rules:
> 
> 1. If the data is only ever accessed on the "owning" CPU via
>   this_cpu_*(), put it in 'tlb_state'.
> 2. If the data is read by other CPUs, put it in 'tlb_state_shared'.
> 
> I actually like the idea of having two structs.

Yes, that’s exactly the idea. In the (1) case, we may even be able to mark
the struct with __thread qualifier, which IIRC would prevent memory barriers
from causing these values being reread.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 0/9] x86: Concurrent TLB flushes
  2019-07-19  0:58 [PATCH v3 0/9] x86: Concurrent TLB flushes Nadav Amit
                   ` (8 preceding siblings ...)
  2019-07-19  0:58 ` [PATCH v3 9/9] x86/mm/tlb: Remove unnecessary uses of the inline keyword Nadav Amit
@ 2019-07-19 21:36 ` Dave Hansen
  9 siblings, 0 replies; 45+ messages in thread
From: Dave Hansen @ 2019-07-19 21:36 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Boris Ostrovsky, Haiyang Zhang, Josh Poimboeuf,
	Juergen Gross, K. Y. Srinivasan, Paolo Bonzini, Rik van Riel,
	Sasha Levin, Stephen Hemminger, kvm, linux-hyperv,
	virtualization, xen-devel

Thanks for doing this, it's something I've been hoping someone would do
for a long time.

While I kinda wish we had performance data for each individual patch (at
least the behavior-changing ones), this does look like a good
improvement.  That might, for instance, tell is a bit about how the
separating out "is_lazy" compares to the "check before setting"
optimization.  But, they're both sane enough on their own that I'm not
too worried.

I had some nits that I hope get covered in later revisions, if sent.
But, overall looks fine.  For the series:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  2019-07-19 18:41     ` Nadav Amit
@ 2019-07-19 22:44       ` Joe Perches
  2019-07-19 23:02         ` Nadav Amit
  0 siblings, 1 reply; 45+ messages in thread
From: Joe Perches @ 2019-07-19 22:44 UTC (permalink / raw)
  To: Nadav Amit, Dave Hansen
  Cc: Andy Lutomirski, Dave Hansen, x86, linux-kernel, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Rik van Riel, Josh Poimboeuf

On Fri, 2019-07-19 at 18:41 +0000, Nadav Amit wrote:
> > On Jul 19, 2019, at 11:36 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> > On 7/18/19 5:58 PM, Nadav Amit wrote:
> > > @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > > 	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
> > > 		lockdep_assert_irqs_enabled();
> > > 		local_irq_disable();
> > > -		flush_tlb_func_local(&full_flush_tlb_info);
> > > +		flush_tlb_func_local((void *)&full_flush_tlb_info);
> > > 		local_irq_enable();
> > > 	}
> > 
> > This looks like superfluous churn.  Is it?
> 
> Unfortunately not, since full_flush_tlb_info is defined as const. Without it
> you would get:
> 
> warning: passing argument 1 of ‘flush_tlb_func_local’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
> 
> And flush_tlb_func_local() should get (void *) argument since it is also
> used by the SMP infrastructure.

huh?

commit 3db6d5a5ecaf0a778d721ccf9809248350d4bfaf
Author: Nadav Amit <namit@vmware.com>
Date:   Thu Apr 25 16:01:43 2019 -0700

[]

-static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
+static void flush_tlb_func_local(const void *info, enum tlb_flush_reason reason)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  2019-07-19 22:44       ` Joe Perches
@ 2019-07-19 23:02         ` Nadav Amit
  0 siblings, 0 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-19 23:02 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Hansen, Andy Lutomirski, Dave Hansen, x86, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Rik van Riel,
	Josh Poimboeuf

> On Jul 19, 2019, at 3:44 PM, Joe Perches <joe@perches.com> wrote:
> 
> On Fri, 2019-07-19 at 18:41 +0000, Nadav Amit wrote:
>>> On Jul 19, 2019, at 11:36 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> 
>>> On 7/18/19 5:58 PM, Nadav Amit wrote:
>>>> @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>>>> 	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
>>>> 		lockdep_assert_irqs_enabled();
>>>> 		local_irq_disable();
>>>> -		flush_tlb_func_local(&full_flush_tlb_info);
>>>> +		flush_tlb_func_local((void *)&full_flush_tlb_info);
>>>> 		local_irq_enable();
>>>> 	}
>>> 
>>> This looks like superfluous churn.  Is it?
>> 
>> Unfortunately not, since full_flush_tlb_info is defined as const. Without it
>> you would get:
>> 
>> warning: passing argument 1 of ‘flush_tlb_func_local’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
>> 
>> And flush_tlb_func_local() should get (void *) argument since it is also
>> used by the SMP infrastructure.
> 
> huh?
> 
> commit 3db6d5a5ecaf0a778d721ccf9809248350d4bfaf
> Author: Nadav Amit <namit@vmware.com>
> Date:   Thu Apr 25 16:01:43 2019 -0700
> 
> []
> 
> -static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
> +static void flush_tlb_func_local(const void *info, enum tlb_flush_reason reason)

Hopefully I decipher the “huh?” correctly...

When we moved the flush_tlb_info off the stack, in the patch that you
reference, we created a global full_flush_tlb_info variable, which was set
as const and delivered to flush_tlb_func_local. I changed the signature of
the function to avoid casting. IIRC, setting the variable as constant
slightly improved the generated assembly, and anyhow made sense.

In this patch-set I need to change the first argument of
flush_tlb_func_local to be non-const to match the SMP function signature
which takes a single non-const void * argument. Yes, there is what seems to
be non-safe casting, but flush_tlb_func_common() casts it back into a const
variable.

I know that I also added the infrastructure to run a function locally in the
SMP infrastructure, but it made sense to have the same signature for local
function and remote ones.

Does it make more sense now?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate
  2019-07-19 18:54         ` Nadav Amit
@ 2019-07-20 13:58           ` Andy Lutomirski
  0 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2019-07-20 13:58 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Dave Hansen, Andy Lutomirski, Dave Hansen,
	the arch/x86 maintainers, linux-kernel, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar

On Fri, Jul 19, 2019 at 11:54 AM Nadav Amit <namit@vmware.com> wrote:
>
> > On Jul 19, 2019, at 11:48 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 7/19/19 11:43 AM, Nadav Amit wrote:
> >> Andy said that for the lazy tlb optimizations there might soon be more
> >> shared state. If you prefer, I can move is_lazy outside of tlb_state, and
> >> not set it in any alternative struct.
> >
> > I just wanted to make sure that we capture these rules:
> >
> > 1. If the data is only ever accessed on the "owning" CPU via
> >   this_cpu_*(), put it in 'tlb_state'.
> > 2. If the data is read by other CPUs, put it in 'tlb_state_shared'.
> >
> > I actually like the idea of having two structs.
>
> Yes, that’s exactly the idea. In the (1) case, we may even be able to mark
> the struct with __thread qualifier, which IIRC would prevent memory barriers
> from causing these values being reread.

I'm okay with the patch.  If we end up changing things later, we can
rearrange as needed.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate
  2019-07-19 18:38   ` Dave Hansen
  2019-07-19 18:43     ` Nadav Amit
@ 2019-07-21 20:21     ` Nadav Amit
  1 sibling, 0 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-21 20:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Dave Hansen, x86, linux-kernel, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar

> On Jul 19, 2019, at 11:38 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 7/18/19 5:58 PM, Nadav Amit wrote:
>> +struct tlb_state_shared {
>> +	/*
>> +	 * We can be in one of several states:
>> +	 *
>> +	 *  - Actively using an mm.  Our CPU's bit will be set in
>> +	 *    mm_cpumask(loaded_mm) and is_lazy == false;
>> +	 *
>> +	 *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
>> +	 *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
>> +	 *
>> +	 *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
>> +	 *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
>> +	 *    We're heuristically guessing that the CR3 load we
>> +	 *    skipped more than makes up for the overhead added by
>> +	 *    lazy mode.
>> +	 */
>> +	bool is_lazy;
>> +};
>> +DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
> 
> Could we get a comment about what "shared" means and why we need shared
> state?
> 
> Should we change 'tlb_state' to 'tlb_state_private’?

I don’t feel strongly about either one. I perferred the one that is likely
to cause fewer changes and potential conflicts. Anyhow, I would add a better
comment as you asked for.

So it is really up to you.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-19 18:23   ` Dave Hansen
@ 2019-07-22 18:16     ` Peter Zijlstra
  2019-07-22 18:41       ` Nadav Amit
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2019-07-22 18:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nadav Amit, Andy Lutomirski, Dave Hansen, x86, linux-kernel,
	Thomas Gleixner, Ingo Molnar, Rik van Riel, Josh Poimboeuf

On Fri, Jul 19, 2019 at 11:23:06AM -0700, Dave Hansen wrote:
> On 7/18/19 5:58 PM, Nadav Amit wrote:
> > @@ -624,16 +622,11 @@ EXPORT_SYMBOL(on_each_cpu);
> >  void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
> >  			void *info, bool wait)
> >  {
> > -	int cpu = get_cpu();
> > +	preempt_disable();
> >  
> > -	smp_call_function_many(mask, func, info, wait);
> > -	if (cpumask_test_cpu(cpu, mask)) {
> > -		unsigned long flags;
> > -		local_irq_save(flags);
> > -		func(info);
> > -		local_irq_restore(flags);
> > -	}
> > -	put_cpu();
> > +	__smp_call_function_many(mask, func, func, info, wait);
> > +
> > +	preempt_enable();
> >  }
> 
> The get_cpu() was missing it too, but it would be nice to add some
> comments about why preempt needs to be off.  I was also thinking it
> might make sense to do:
> 
> 	cfd = get_cpu_var(cfd_data);
> 	__smp_call_function_many(cfd, ...);
> 	put_cpu_var(cfd_data);
> 	
> instead of the explicit preempt_enable/disable(), but I don't feel too
> strongly about it.

It is also required for cpu hotplug.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-19  0:58 ` [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
  2019-07-19 18:23   ` Dave Hansen
@ 2019-07-22 18:21   ` Peter Zijlstra
  2019-07-22 18:34     ` Nadav Amit
  2019-07-22 18:37     ` Thomas Gleixner
  1 sibling, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2019-07-22 18:21 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, x86, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Rik van Riel, Josh Poimboeuf

On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote:
> +/*
> + * Call a function on all processors.  May be used during early boot while
> + * early_boot_irqs_disabled is set.
> + */
> +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
> +{
> +	on_each_cpu_mask(cpu_online_mask, func, info, wait);
> +}

I'm thinking that one if buggy, nothing protects online mask here.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  2019-07-19  0:58 ` [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() Nadav Amit
  2019-07-19 18:36   ` Dave Hansen
@ 2019-07-22 18:27   ` Peter Zijlstra
  2019-07-22 19:47   ` Rasmus Villemoes
  2 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2019-07-22 18:27 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, x86, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Rik van Riel, Josh Poimboeuf

On Thu, Jul 18, 2019 at 05:58:31PM -0700, Nadav Amit wrote:
> +static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);

I'm thinking it should be possible to allocate this before we switch to
SMP, no? cpumask_t really should not be used wherever possible.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-22 18:21   ` Peter Zijlstra
@ 2019-07-22 18:34     ` Nadav Amit
  2019-07-22 19:19       ` Peter Zijlstra
  2019-07-22 18:37     ` Thomas Gleixner
  1 sibling, 1 reply; 45+ messages in thread
From: Nadav Amit @ 2019-07-22 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers, LKML,
	Thomas Gleixner, Ingo Molnar, Rik van Riel, Josh Poimboeuf

> On Jul 22, 2019, at 11:21 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote:
>> +/*
>> + * Call a function on all processors.  May be used during early boot while
>> + * early_boot_irqs_disabled is set.
>> + */
>> +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
>> +{
>> +	on_each_cpu_mask(cpu_online_mask, func, info, wait);
>> +}
> 
> I'm thinking that one if buggy, nothing protects online mask here.

on_each_cpu_mask() calls __on_each_cpu_mask() which would disable preemption.
The mask might change, but anyhow __smp_call_function_many() would “and” it,
after disabling preemption, with (the potentially updated) cpu_online_mask.

What is your concern?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-22 18:21   ` Peter Zijlstra
  2019-07-22 18:34     ` Nadav Amit
@ 2019-07-22 18:37     ` Thomas Gleixner
  2019-07-22 18:40       ` Nadav Amit
  1 sibling, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2019-07-22 18:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nadav Amit, Andy Lutomirski, Dave Hansen, x86, linux-kernel,
	Ingo Molnar, Rik van Riel, Josh Poimboeuf

On Mon, 22 Jul 2019, Peter Zijlstra wrote:

> On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote:
> > +/*
> > + * Call a function on all processors.  May be used during early boot while
> > + * early_boot_irqs_disabled is set.
> > + */
> > +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
> > +{
> > +	on_each_cpu_mask(cpu_online_mask, func, info, wait);
> > +}
> 
> I'm thinking that one if buggy, nothing protects online mask here.

The current implementation has preemption disabled before touching
cpu_online_mask which at least protects against a CPU going away as that
prevents the stomp machine thread from getting on the CPU. But it's not
protected against a CPU coming online concurrently.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-22 18:37     ` Thomas Gleixner
@ 2019-07-22 18:40       ` Nadav Amit
  2019-07-22 18:51         ` Thomas Gleixner
  0 siblings, 1 reply; 45+ messages in thread
From: Nadav Amit @ 2019-07-22 18:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	the arch/x86 maintainers, LKML, Ingo Molnar, Rik van Riel,
	Josh Poimboeuf

> On Jul 22, 2019, at 11:37 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Mon, 22 Jul 2019, Peter Zijlstra wrote:
> 
>> On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote:
>>> +/*
>>> + * Call a function on all processors.  May be used during early boot while
>>> + * early_boot_irqs_disabled is set.
>>> + */
>>> +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
>>> +{
>>> +	on_each_cpu_mask(cpu_online_mask, func, info, wait);
>>> +}
>> 
>> I'm thinking that one if buggy, nothing protects online mask here.
> 
> The current implementation has preemption disabled before touching
> cpu_online_mask which at least protects against a CPU going away as that
> prevents the stomp machine thread from getting on the CPU. But it's not
> protected against a CPU coming online concurrently.

I still don’t understand. If you called cpu_online_mask() and did not
disable preemption before calling it, you are already (today) not protected
against another CPU coming online. Disabling preemption in on_each_cpu()
will not solve it.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-22 18:16     ` Peter Zijlstra
@ 2019-07-22 18:41       ` Nadav Amit
  2019-07-22 19:34         ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Nadav Amit @ 2019-07-22 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Andy Lutomirski, Dave Hansen,
	the arch/x86 maintainers, LKML, Thomas Gleixner, Ingo Molnar,
	Rik van Riel, Josh Poimboeuf

> On Jul 22, 2019, at 11:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Fri, Jul 19, 2019 at 11:23:06AM -0700, Dave Hansen wrote:
>> On 7/18/19 5:58 PM, Nadav Amit wrote:
>>> @@ -624,16 +622,11 @@ EXPORT_SYMBOL(on_each_cpu);
>>> void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
>>> 			void *info, bool wait)
>>> {
>>> -	int cpu = get_cpu();
>>> +	preempt_disable();
>>> 
>>> -	smp_call_function_many(mask, func, info, wait);
>>> -	if (cpumask_test_cpu(cpu, mask)) {
>>> -		unsigned long flags;
>>> -		local_irq_save(flags);
>>> -		func(info);
>>> -		local_irq_restore(flags);
>>> -	}
>>> -	put_cpu();
>>> +	__smp_call_function_many(mask, func, func, info, wait);
>>> +
>>> +	preempt_enable();
>>> }
>> 
>> The get_cpu() was missing it too, but it would be nice to add some
>> comments about why preempt needs to be off.  I was also thinking it
>> might make sense to do:
>> 
>> 	cfd = get_cpu_var(cfd_data);
>> 	__smp_call_function_many(cfd, ...);
>> 	put_cpu_var(cfd_data);
>> 	
>> instead of the explicit preempt_enable/disable(), but I don't feel too
>> strongly about it.
> 
> It is also required for cpu hotplug.

But then smpcfd_dead_cpu() will not respect the “cpu” argument. Do you still
prefer it this way (instead of the current preempt_enable() /
preempt_disable())?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-22 18:40       ` Nadav Amit
@ 2019-07-22 18:51         ` Thomas Gleixner
  2019-07-22 19:02           ` Nadav Amit
  0 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2019-07-22 18:51 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	the arch/x86 maintainers, LKML, Ingo Molnar, Rik van Riel,
	Josh Poimboeuf

[-- Attachment #1: Type: text/plain, Size: 1919 bytes --]

On Mon, 22 Jul 2019, Nadav Amit wrote:
> > On Jul 22, 2019, at 11:37 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > On Mon, 22 Jul 2019, Peter Zijlstra wrote:
> > 
> >> On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote:
> >>> +/*
> >>> + * Call a function on all processors.  May be used during early boot while
> >>> + * early_boot_irqs_disabled is set.
> >>> + */
> >>> +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
> >>> +{
> >>> +	on_each_cpu_mask(cpu_online_mask, func, info, wait);
> >>> +}
> >> 
> >> I'm thinking that one if buggy, nothing protects online mask here.
> > 
> > The current implementation has preemption disabled before touching
> > cpu_online_mask which at least protects against a CPU going away as that
> > prevents the stomp machine thread from getting on the CPU. But it's not
> > protected against a CPU coming online concurrently.
> 
> I still don’t understand. If you called cpu_online_mask() and did not
> disable preemption before calling it, you are already (today) not protected
> against another CPU coming online. Disabling preemption in on_each_cpu()
> will not solve it.

Disabling preemption _cannot_ protect against a CPU coming online. It only
can protect against a CPU being offlined.

The current implementation of on_each_cpu() disables preemption _before_
touching cpu_online_mask.

void on_each_cpu(void (*func) (void *info), void *info, int wait)
{
        unsigned long flags;

        preempt_disable();
	smp_call_function(func, info, wait);

smp_call_function() has another preempt_disable as it can be called
separately and it does:

        preempt_disable();
        smp_call_function_many(cpu_online_mask, func, info, wait);

Your new on_each_cpu() implementation does not. So there is a
difference. Whether it matters or not is a different question, but that
needs to be explained and documented.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-22 18:51         ` Thomas Gleixner
@ 2019-07-22 19:02           ` Nadav Amit
  2019-07-25 12:36             ` Thomas Gleixner
  0 siblings, 1 reply; 45+ messages in thread
From: Nadav Amit @ 2019-07-22 19:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	the arch/x86 maintainers, LKML, Ingo Molnar, Rik van Riel,
	Josh Poimboeuf

> On Jul 22, 2019, at 11:51 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Mon, 22 Jul 2019, Nadav Amit wrote:
>>> On Jul 22, 2019, at 11:37 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> On Mon, 22 Jul 2019, Peter Zijlstra wrote:
>>> 
>>>> On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote:
>>>>> +/*
>>>>> + * Call a function on all processors.  May be used during early boot while
>>>>> + * early_boot_irqs_disabled is set.
>>>>> + */
>>>>> +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
>>>>> +{
>>>>> +	on_each_cpu_mask(cpu_online_mask, func, info, wait);
>>>>> +}
>>>> 
>>>> I'm thinking that one if buggy, nothing protects online mask here.
>>> 
>>> The current implementation has preemption disabled before touching
>>> cpu_online_mask which at least protects against a CPU going away as that
>>> prevents the stomp machine thread from getting on the CPU. But it's not
>>> protected against a CPU coming online concurrently.
>> 
>> I still don’t understand. If you called cpu_online_mask() and did not
>> disable preemption before calling it, you are already (today) not protected
>> against another CPU coming online. Disabling preemption in on_each_cpu()
>> will not solve it.
> 
> Disabling preemption _cannot_ protect against a CPU coming online. It only
> can protect against a CPU being offlined.
> 
> The current implementation of on_each_cpu() disables preemption _before_
> touching cpu_online_mask.
> 
> void on_each_cpu(void (*func) (void *info), void *info, int wait)
> {
>        unsigned long flags;
> 
>        preempt_disable();
> 	smp_call_function(func, info, wait);
> 
> smp_call_function() has another preempt_disable as it can be called
> separately and it does:
> 
>        preempt_disable();
>        smp_call_function_many(cpu_online_mask, func, info, wait);
> 
> Your new on_each_cpu() implementation does not. So there is a
> difference. Whether it matters or not is a different question, but that
> needs to be explained and documented.

Thanks for explaining - so your concern is for CPUs being offlined.

But unless I am missing something: on_each_cpu() calls __on_each_cpu_mask(),
which disables preemption and calls __smp_call_function_many().

Then  __smp_call_function_many() runs:

	cpumask_and(cfd->cpumask, mask, cpu_online_mask);

… before choosing which remote CPUs should run the function. So the only
case that I was missing is if the current CPU goes away and the function is
called locally.

Can it happen? I can add documentation and a debug assertion for this case.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-07-19  0:58 ` [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
@ 2019-07-22 19:14   ` Peter Zijlstra
  2019-07-22 19:27     ` Nadav Amit
  2019-07-26  7:28   ` Juergen Gross
  2019-07-31  0:13   ` Michael Kelley
  2 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2019-07-22 19:14 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, x86, linux-kernel, Thomas Gleixner,
	Ingo Molnar, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Borislav Petkov, Juergen Gross, Paolo Bonzini,
	Boris Ostrovsky, linux-hyperv, virtualization, kvm, xen-devel

On Thu, Jul 18, 2019 at 05:58:32PM -0700, Nadav Amit wrote:
> @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  	 * doing a speculative memory access.
>  	 */
>  	if (info->freed_tables) {
> -		smp_call_function_many(cpumask, flush_tlb_func_remote,
> -			       (void *)info, 1);
> +		__smp_call_function_many(cpumask, flush_tlb_func_remote,
> +					 flush_tlb_func_local,
> +					 (void *)info, 1);
>  	} else {
>  		/*
>  		 * Although we could have used on_each_cpu_cond_mask(),
> @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  			if (tlb_is_not_lazy(cpu))
>  				__cpumask_set_cpu(cpu, cond_cpumask);
>  		}
> -		smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
> +		__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
> +					 flush_tlb_func_local,
>  					 (void *)info, 1);
>  	}
>  }

Do we really need that _local/_remote distinction? ISTR you had a patch
that frobbed flush_tlb_info into the csd and that gave space
constraints, but I'm not seeing that here (probably a wise, get stuff
merged etc..).

struct __call_single_data {
        struct llist_node          llist;                /*     0     8 */
        smp_call_func_t            func;                 /*     8     8 */
        void *                     info;                 /*    16     8 */
        unsigned int               flags;                /*    24     4 */

        /* size: 32, cachelines: 1, members: 4 */
        /* padding: 4 */
        /* last cacheline: 32 bytes */
};

struct flush_tlb_info {
        struct mm_struct *         mm;                   /*     0     8 */
        long unsigned int          start;                /*     8     8 */
        long unsigned int          end;                  /*    16     8 */
        u64                        new_tlb_gen;          /*    24     8 */
        unsigned int               stride_shift;         /*    32     4 */
        bool                       freed_tables;         /*    36     1 */

        /* size: 40, cachelines: 1, members: 6 */
        /* padding: 3 */
        /* last cacheline: 40 bytes */
};

IIRC what you did was make void *__call_single_data::info the last
member and a union until the full cacheline size (64). Given the above
that would get us 24 bytes for csd, leaving us 40 for that
flush_tlb_info.

But then we can still do something like the below, which doesn't change
things and still gets rid of that dual function crud, simplifying
smp_call_function_many again.

Index: linux-2.6/arch/x86/include/asm/tlbflush.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/tlbflush.h
+++ linux-2.6/arch/x86/include/asm/tlbflush.h
@@ -546,8 +546,9 @@ struct flush_tlb_info {
 	unsigned long		start;
 	unsigned long		end;
 	u64			new_tlb_gen;
-	unsigned int		stride_shift;
-	bool			freed_tables;
+	unsigned int		cpu;
+	unsigned short		stride_shift;
+	unsigned char		freed_tables;
 };
 
 #define local_flush_tlb() __flush_tlb()
Index: linux-2.6/arch/x86/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/tlb.c
+++ linux-2.6/arch/x86/mm/tlb.c
@@ -659,6 +659,27 @@ static void flush_tlb_func_remote(void *
 	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
+static void flush_tlb_func(void *info)
+{
+	const struct flush_tlb_info *f = info;
+	enum tlb_flush_reason reason = TLB_REMOTE_SHOOTDOWN;
+	bool local = false;
+
+	if (f->cpu == smp_processor_id()) {
+		local = true;
+		reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;
+	} else {
+		inc_irq_stat(irq_tlb_count);
+
+		if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
+			return;
+
+		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+	}
+
+	flush_tlb_func_common(f, local, reason);
+}
+
 static bool tlb_is_not_lazy(int cpu)
 {
 	return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-22 18:34     ` Nadav Amit
@ 2019-07-22 19:19       ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2019-07-22 19:19 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers, LKML,
	Thomas Gleixner, Ingo Molnar, Rik van Riel, Josh Poimboeuf

On Mon, Jul 22, 2019 at 06:34:22PM +0000, Nadav Amit wrote:
> > On Jul 22, 2019, at 11:21 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote:
> >> +/*
> >> + * Call a function on all processors.  May be used during early boot while
> >> + * early_boot_irqs_disabled is set.
> >> + */
> >> +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
> >> +{
> >> +	on_each_cpu_mask(cpu_online_mask, func, info, wait);
> >> +}
> > 
> > I'm thinking that one if buggy, nothing protects online mask here.
> 
> on_each_cpu_mask() calls __on_each_cpu_mask() which would disable preemption.
> The mask might change, but anyhow __smp_call_function_many() would “and” it,
> after disabling preemption, with (the potentially updated) cpu_online_mask.

Ah, indeed, as long as we double check the state after disabling
preemption things should be fine.

> What is your concern?

Pavlov reaction to seeing a naked cpu_online_mask :-)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-07-22 19:14   ` Peter Zijlstra
@ 2019-07-22 19:27     ` Nadav Amit
  2019-07-22 19:32       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Nadav Amit @ 2019-07-22 19:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers, LKML,
	Thomas Gleixner, Ingo Molnar, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Borislav Petkov, Juergen Gross,
	Paolo Bonzini, Boris Ostrovsky, linux-hyperv, virtualization,
	kvm, xen-devel

> On Jul 22, 2019, at 12:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Jul 18, 2019 at 05:58:32PM -0700, Nadav Amit wrote:
>> @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>> 	 * doing a speculative memory access.
>> 	 */
>> 	if (info->freed_tables) {
>> -		smp_call_function_many(cpumask, flush_tlb_func_remote,
>> -			       (void *)info, 1);
>> +		__smp_call_function_many(cpumask, flush_tlb_func_remote,
>> +					 flush_tlb_func_local,
>> +					 (void *)info, 1);
>> 	} else {
>> 		/*
>> 		 * Although we could have used on_each_cpu_cond_mask(),
>> @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>> 			if (tlb_is_not_lazy(cpu))
>> 				__cpumask_set_cpu(cpu, cond_cpumask);
>> 		}
>> -		smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> +		__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> +					 flush_tlb_func_local,
>> 					 (void *)info, 1);
>> 	}
>> }
> 
> Do we really need that _local/_remote distinction? ISTR you had a patch
> that frobbed flush_tlb_info into the csd and that gave space
> constraints, but I'm not seeing that here (probably a wise, get stuff
> merged etc..).
> 
> struct __call_single_data {
>        struct llist_node          llist;                /*     0     8 */
>        smp_call_func_t            func;                 /*     8     8 */
>        void *                     info;                 /*    16     8 */
>        unsigned int               flags;                /*    24     4 */
> 
>        /* size: 32, cachelines: 1, members: 4 */
>        /* padding: 4 */
>        /* last cacheline: 32 bytes */
> };
> 
> struct flush_tlb_info {
>        struct mm_struct *         mm;                   /*     0     8 */
>        long unsigned int          start;                /*     8     8 */
>        long unsigned int          end;                  /*    16     8 */
>        u64                        new_tlb_gen;          /*    24     8 */
>        unsigned int               stride_shift;         /*    32     4 */
>        bool                       freed_tables;         /*    36     1 */
> 
>        /* size: 40, cachelines: 1, members: 6 */
>        /* padding: 3 */
>        /* last cacheline: 40 bytes */
> };
> 
> IIRC what you did was make void *__call_single_data::info the last
> member and a union until the full cacheline size (64). Given the above
> that would get us 24 bytes for csd, leaving us 40 for that
> flush_tlb_info.
> 
> But then we can still do something like the below, which doesn't change
> things and still gets rid of that dual function crud, simplifying
> smp_call_function_many again.
> 
> Index: linux-2.6/arch/x86/include/asm/tlbflush.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/tlbflush.h
> +++ linux-2.6/arch/x86/include/asm/tlbflush.h
> @@ -546,8 +546,9 @@ struct flush_tlb_info {
> 	unsigned long		start;
> 	unsigned long		end;
> 	u64			new_tlb_gen;
> -	unsigned int		stride_shift;
> -	bool			freed_tables;
> +	unsigned int		cpu;
> +	unsigned short		stride_shift;
> +	unsigned char		freed_tables;
> };
> 
> #define local_flush_tlb() __flush_tlb()
> Index: linux-2.6/arch/x86/mm/tlb.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/tlb.c
> +++ linux-2.6/arch/x86/mm/tlb.c
> @@ -659,6 +659,27 @@ static void flush_tlb_func_remote(void *
> 	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
> }
> 
> +static void flush_tlb_func(void *info)
> +{
> +	const struct flush_tlb_info *f = info;
> +	enum tlb_flush_reason reason = TLB_REMOTE_SHOOTDOWN;
> +	bool local = false;
> +
> +	if (f->cpu == smp_processor_id()) {
> +		local = true;
> +		reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;
> +	} else {
> +		inc_irq_stat(irq_tlb_count);
> +
> +		if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
> +			return;
> +
> +		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> +	}
> +
> +	flush_tlb_func_common(f, local, reason);
> +}
> +
> static bool tlb_is_not_lazy(int cpu)
> {
> 	return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);

Nice! I will add it on top, if you don’t mind (instead squashing it).

The original decision to have local/remote functions was mostly to provide
the generality.

I would change the last argument of __smp_call_function_many() from “wait”
to “flags” that would indicate whether to run the function locally, since I
don’t want to change the semantics of smp_call_function_many() and decide
whether to run the function locally purely based on the mask. Let me know if
you disagree.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-07-22 19:27     ` Nadav Amit
@ 2019-07-22 19:32       ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2019-07-22 19:32 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers, LKML,
	Thomas Gleixner, Ingo Molnar, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Sasha Levin, Borislav Petkov, Juergen Gross,
	Paolo Bonzini, Boris Ostrovsky, linux-hyperv, virtualization,
	kvm, xen-devel

On Mon, Jul 22, 2019 at 07:27:09PM +0000, Nadav Amit wrote:
> > On Jul 22, 2019, at 12:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> > But then we can still do something like the below, which doesn't change
> > things and still gets rid of that dual function crud, simplifying
> > smp_call_function_many again.

> Nice! I will add it on top, if you don’t mind (instead squashing it).

Not at all.

> The original decision to have local/remote functions was mostly to provide
> the generality.
> 
> I would change the last argument of __smp_call_function_many() from “wait”
> to “flags” that would indicate whether to run the function locally, since I
> don’t want to change the semantics of smp_call_function_many() and decide
> whether to run the function locally purely based on the mask. Let me know if
> you disagree.

Agreed.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-22 18:41       ` Nadav Amit
@ 2019-07-22 19:34         ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2019-07-22 19:34 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Dave Hansen, Andy Lutomirski, Dave Hansen,
	the arch/x86 maintainers, LKML, Thomas Gleixner, Ingo Molnar,
	Rik van Riel, Josh Poimboeuf

On Mon, Jul 22, 2019 at 06:41:44PM +0000, Nadav Amit wrote:
> > On Jul 22, 2019, at 11:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Fri, Jul 19, 2019 at 11:23:06AM -0700, Dave Hansen wrote:
> >> On 7/18/19 5:58 PM, Nadav Amit wrote:
> >>> @@ -624,16 +622,11 @@ EXPORT_SYMBOL(on_each_cpu);
> >>> void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
> >>> 			void *info, bool wait)
> >>> {
> >>> -	int cpu = get_cpu();
> >>> +	preempt_disable();
> >>> 
> >>> -	smp_call_function_many(mask, func, info, wait);
> >>> -	if (cpumask_test_cpu(cpu, mask)) {
> >>> -		unsigned long flags;
> >>> -		local_irq_save(flags);
> >>> -		func(info);
> >>> -		local_irq_restore(flags);
> >>> -	}
> >>> -	put_cpu();
> >>> +	__smp_call_function_many(mask, func, func, info, wait);
> >>> +
> >>> +	preempt_enable();
> >>> }
> >> 
> >> The get_cpu() was missing it too, but it would be nice to add some
> >> comments about why preempt needs to be off.  I was also thinking it
> >> might make sense to do:
> >> 
> >> 	cfd = get_cpu_var(cfd_data);
> >> 	__smp_call_function_many(cfd, ...);
> >> 	put_cpu_var(cfd_data);
> >> 	
> >> instead of the explicit preempt_enable/disable(), but I don't feel too
> >> strongly about it.
> > 
> > It is also required for cpu hotplug.
> 
> But then smpcfd_dead_cpu() will not respect the “cpu” argument. Do you still
> prefer it this way (instead of the current preempt_enable() /
> preempt_disable())?

I just meant that the preempt_disable() (either form) is required for
hotplug (we must not send IPIs to offline CPUs, that gets things upset).

Personally I don't mind the bare preempt_disable() as you have; but I
think Dave's idea of a comment has merrit.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  2019-07-19  0:58 ` [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() Nadav Amit
  2019-07-19 18:36   ` Dave Hansen
  2019-07-22 18:27   ` Peter Zijlstra
@ 2019-07-22 19:47   ` Rasmus Villemoes
  2019-07-22 19:51     ` Nadav Amit
  2 siblings, 1 reply; 45+ messages in thread
From: Rasmus Villemoes @ 2019-07-22 19:47 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Rik van Riel, Josh Poimboeuf, Joe Perches

On 19/07/2019 02.58, Nadav Amit wrote:

>  /*
> @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
>  		lockdep_assert_irqs_enabled();
>  		local_irq_disable();
> -		flush_tlb_func_local(&full_flush_tlb_info);
> +		flush_tlb_func_local((void *)&full_flush_tlb_info);
>  		local_irq_enable();
>  	}

I think the confusion could be cleared up if you moved this hunk to
patch 2 where it belongs - i.e. where you change the prototype of
flush_tlb_func_local() and hence introduce the warning.

Rasmus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  2019-07-22 19:47   ` Rasmus Villemoes
@ 2019-07-22 19:51     ` Nadav Amit
  0 siblings, 0 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-22 19:51 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers, LKML,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Rik van Riel,
	Josh Poimboeuf, Joe Perches

> On Jul 22, 2019, at 12:47 PM, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:
> 
> On 19/07/2019 02.58, Nadav Amit wrote:
> 
>> /*
>> @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>> 	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
>> 		lockdep_assert_irqs_enabled();
>> 		local_irq_disable();
>> -		flush_tlb_func_local(&full_flush_tlb_info);
>> +		flush_tlb_func_local((void *)&full_flush_tlb_info);
>> 		local_irq_enable();
>> 	}
> 
> I think the confusion could be cleared up if you moved this hunk to
> patch 2 where it belongs - i.e. where you change the prototype of
> flush_tlb_func_local() and hence introduce the warning.

Yes, there is a small mess here - the constification should actually go
to a different patch… I’ll fix it.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-22 19:02           ` Nadav Amit
@ 2019-07-25 12:36             ` Thomas Gleixner
  2019-07-25 19:10               ` Nadav Amit
  0 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2019-07-25 12:36 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	the arch/x86 maintainers, LKML, Ingo Molnar, Rik van Riel,
	Josh Poimboeuf

[-- Attachment #1: Type: text/plain, Size: 1718 bytes --]

On Mon, 22 Jul 2019, Nadav Amit wrote:
> > On Jul 22, 2019, at 11:51 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > void on_each_cpu(void (*func) (void *info), void *info, int wait)
> > {
> >        unsigned long flags;
> > 
> >        preempt_disable();
> > 	smp_call_function(func, info, wait);
> > 
> > smp_call_function() has another preempt_disable as it can be called
> > separately and it does:
> > 
> >        preempt_disable();
> >        smp_call_function_many(cpu_online_mask, func, info, wait);
> > 
> > Your new on_each_cpu() implementation does not. So there is a
> > difference. Whether it matters or not is a different question, but that
> > needs to be explained and documented.
> 
> Thanks for explaining - so your concern is for CPUs being offlined.
> 
> But unless I am missing something: on_each_cpu() calls __on_each_cpu_mask(),
> which disables preemption and calls __smp_call_function_many().
> 
> Then  __smp_call_function_many() runs:
> 
> 	cpumask_and(cfd->cpumask, mask, cpu_online_mask);
> 
> … before choosing which remote CPUs should run the function. So the only
> case that I was missing is if the current CPU goes away and the function is
> called locally.
>
> Can it happen? I can add documentation and a debug assertion for this case.

I don't think it can happen:

  on_each_cpu()
    on_each_cpu_mask(....)
      preempt_disable()
        __smp_call_function_many()

So if a CPU goes offline between on_each_cpu() and preempt_disable() then
there is no damage. After the preempt_disable() it can't go away anymore
and the task executing this cannot be migrated either.

So yes, it's safe, but please add a big fat comment so future readers won't
be puzzled.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
  2019-07-25 12:36             ` Thomas Gleixner
@ 2019-07-25 19:10               ` Nadav Amit
  0 siblings, 0 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-25 19:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	the arch/x86 maintainers, LKML, Ingo Molnar, Rik van Riel,
	Josh Poimboeuf

> On Jul 25, 2019, at 5:36 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Mon, 22 Jul 2019, Nadav Amit wrote:
>>> On Jul 22, 2019, at 11:51 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> void on_each_cpu(void (*func) (void *info), void *info, int wait)
>>> {
>>>       unsigned long flags;
>>> 
>>>       preempt_disable();
>>> 	smp_call_function(func, info, wait);
>>> 
>>> smp_call_function() has another preempt_disable as it can be called
>>> separately and it does:
>>> 
>>>       preempt_disable();
>>>       smp_call_function_many(cpu_online_mask, func, info, wait);
>>> 
>>> Your new on_each_cpu() implementation does not. So there is a
>>> difference. Whether it matters or not is a different question, but that
>>> needs to be explained and documented.
>> 
>> Thanks for explaining - so your concern is for CPUs being offlined.
>> 
>> But unless I am missing something: on_each_cpu() calls __on_each_cpu_mask(),
>> which disables preemption and calls __smp_call_function_many().
>> 
>> Then  __smp_call_function_many() runs:
>> 
>> 	cpumask_and(cfd->cpumask, mask, cpu_online_mask);
>> 
>> … before choosing which remote CPUs should run the function. So the only
>> case that I was missing is if the current CPU goes away and the function is
>> called locally.
>> 
>> Can it happen? I can add documentation and a debug assertion for this case.
> 
> I don't think it can happen:
> 
>  on_each_cpu()
>    on_each_cpu_mask(....)
>      preempt_disable()
>        __smp_call_function_many()
> 
> So if a CPU goes offline between on_each_cpu() and preempt_disable() then
> there is no damage. After the preempt_disable() it can't go away anymore
> and the task executing this cannot be migrated either.
> 
> So yes, it's safe, but please add a big fat comment so future readers won't
> be puzzled.

I will do. I will need some more time to respin the next version. I see that
what I build on top of it might require some changes, and I want to minimize
them.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-07-19  0:58 ` [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
  2019-07-22 19:14   ` Peter Zijlstra
@ 2019-07-26  7:28   ` Juergen Gross
  2019-07-31  0:13   ` Michael Kelley
  2 siblings, 0 replies; 45+ messages in thread
From: Juergen Gross @ 2019-07-26  7:28 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: Borislav Petkov, Peter Zijlstra, Sasha Levin, x86,
	Thomas Gleixner, virtualization, xen-devel, Haiyang Zhang,
	K. Y. Srinivasan, Stephen Hemminger, Boris Ostrovsky,
	Ingo Molnar, Paolo Bonzini, kvm, linux-hyperv, linux-kernel

On 19.07.19 02:58, Nadav Amit wrote:
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. Introduce
> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
> and hyper-v are only compile-tested).
> 
> While the updated smp infrastructure is capable of running a function on
> a single local core, it is not optimized for this case. The multiple
> function calls and the indirect branch introduce some overhead, and
> might make local TLB flushes slower than they were before the recent
> changes.
> 
> Before calling the SMP infrastructure, check if only a local TLB flush
> is needed to restore the lost performance in this common case. This
> requires to check mm_cpumask() one more time, but unless this mask is
> updated very frequently, this should impact performance negatively.
> 
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Sasha Levin <sashal@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: x86@kernel.org
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: linux-hyperv@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: kvm@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>   arch/x86/hyperv/mmu.c                 | 10 +++---
>   arch/x86/include/asm/paravirt.h       |  6 ++--
>   arch/x86/include/asm/paravirt_types.h |  4 +--
>   arch/x86/include/asm/tlbflush.h       |  8 ++---
>   arch/x86/include/asm/trace/hyperv.h   |  2 +-
>   arch/x86/kernel/kvm.c                 | 11 +++++--
>   arch/x86/kernel/paravirt.c            |  2 +-
>   arch/x86/mm/tlb.c                     | 47 ++++++++++++++++++---------
>   arch/x86/xen/mmu_pv.c                 | 11 +++----
>   include/trace/events/xen.h            |  2 +-
>   10 files changed, 62 insertions(+), 41 deletions(-)

Xen and paravirt parts: Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-07-19  0:58 ` [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
  2019-07-22 19:14   ` Peter Zijlstra
  2019-07-26  7:28   ` Juergen Gross
@ 2019-07-31  0:13   ` Michael Kelley
  2 siblings, 0 replies; 45+ messages in thread
From: Michael Kelley @ 2019-07-31  0:13 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	KY Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
	Borislav Petkov, Juergen Gross, Paolo Bonzini, Boris Ostrovsky,
	linux-hyperv, virtualization, kvm, xen-devel

From: Nadav Amit <namit@vmware.com> Sent: Thursday, July 18, 2019 5:59 PM
> 
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. Introduce
> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
> and hyper-v are only compile-tested).
> 
> While the updated smp infrastructure is capable of running a function on
> a single local core, it is not optimized for this case. The multiple
> function calls and the indirect branch introduce some overhead, and
> might make local TLB flushes slower than they were before the recent
> changes.
> 
> Before calling the SMP infrastructure, check if only a local TLB flush
> is needed to restore the lost performance in this common case. This
> requires to check mm_cpumask() one more time, but unless this mask is
> updated very frequently, this should impact performance negatively.
> 
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Sasha Levin <sashal@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: x86@kernel.org
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: linux-hyperv@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: kvm@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/hyperv/mmu.c                 | 10 +++---
>  arch/x86/include/asm/paravirt.h       |  6 ++--
>  arch/x86/include/asm/paravirt_types.h |  4 +--
>  arch/x86/include/asm/tlbflush.h       |  8 ++---
>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>  arch/x86/kernel/kvm.c                 | 11 +++++--
>  arch/x86/kernel/paravirt.c            |  2 +-
>  arch/x86/mm/tlb.c                     | 47 ++++++++++++++++++---------
>  arch/x86/xen/mmu_pv.c                 | 11 +++----
>  include/trace/events/xen.h            |  2 +-
>  10 files changed, 62 insertions(+), 41 deletions(-)
> 

For the Hyper-V parts --
Reviewed-by: Michael Kelley <mikelley@microsoft.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 8/9] x86/mm/tlb: Remove UV special case
  2019-07-19  2:25   ` Mike Travis
  2019-07-19  4:58     ` Nadav Amit
@ 2019-07-31  3:11     ` Nadav Amit
  1 sibling, 0 replies; 45+ messages in thread
From: Nadav Amit @ 2019-07-31  3:11 UTC (permalink / raw)
  To: Mike Travis
  Cc: Andy Lutomirski, Dave Hansen, x86, linux-kernel, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Dave Hansen

> On Jul 18, 2019, at 7:25 PM, Mike Travis <mike.travis@hpe.com> wrote:
> 
> It is a fact that the UV is still the UV and SGI is now part of HPE. The
> current external product is known as SuperDome Flex. It is both up to date
> as well as very well maintained. The ACK I provided was an okay to change
> the code, but please make the description accurate.

Looking at the code again, if I remove the call to uv_flush_tlb_others(), is
there any use for tlb_uv.c?


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2019-07-31  3:11 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-19  0:58 [PATCH v3 0/9] x86: Concurrent TLB flushes Nadav Amit
2019-07-19  0:58 ` [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many() Nadav Amit
2019-07-19 18:23   ` Dave Hansen
2019-07-22 18:16     ` Peter Zijlstra
2019-07-22 18:41       ` Nadav Amit
2019-07-22 19:34         ` Peter Zijlstra
2019-07-22 18:21   ` Peter Zijlstra
2019-07-22 18:34     ` Nadav Amit
2019-07-22 19:19       ` Peter Zijlstra
2019-07-22 18:37     ` Thomas Gleixner
2019-07-22 18:40       ` Nadav Amit
2019-07-22 18:51         ` Thomas Gleixner
2019-07-22 19:02           ` Nadav Amit
2019-07-25 12:36             ` Thomas Gleixner
2019-07-25 19:10               ` Nadav Amit
2019-07-19  0:58 ` [PATCH v3 2/9] x86/mm/tlb: Remove reason as argument for flush_tlb_func_local() Nadav Amit
2019-07-19  0:58 ` [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() Nadav Amit
2019-07-19 18:36   ` Dave Hansen
2019-07-19 18:41     ` Nadav Amit
2019-07-19 22:44       ` Joe Perches
2019-07-19 23:02         ` Nadav Amit
2019-07-22 18:27   ` Peter Zijlstra
2019-07-22 19:47   ` Rasmus Villemoes
2019-07-22 19:51     ` Nadav Amit
2019-07-19  0:58 ` [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
2019-07-22 19:14   ` Peter Zijlstra
2019-07-22 19:27     ` Nadav Amit
2019-07-22 19:32       ` Peter Zijlstra
2019-07-26  7:28   ` Juergen Gross
2019-07-31  0:13   ` Michael Kelley
2019-07-19  0:58 ` [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate Nadav Amit
2019-07-19 18:38   ` Dave Hansen
2019-07-19 18:43     ` Nadav Amit
2019-07-19 18:48       ` Dave Hansen
2019-07-19 18:54         ` Nadav Amit
2019-07-20 13:58           ` Andy Lutomirski
2019-07-21 20:21     ` Nadav Amit
2019-07-19  0:58 ` [PATCH v3 6/9] x86/mm/tlb: Do not make is_lazy dirty for no reason Nadav Amit
2019-07-19  0:58 ` [PATCH v3 7/9] cpumask: Mark functions as pure Nadav Amit
2019-07-19  0:58 ` [PATCH v3 8/9] x86/mm/tlb: Remove UV special case Nadav Amit
2019-07-19  2:25   ` Mike Travis
2019-07-19  4:58     ` Nadav Amit
2019-07-31  3:11     ` Nadav Amit
2019-07-19  0:58 ` [PATCH v3 9/9] x86/mm/tlb: Remove unnecessary uses of the inline keyword Nadav Amit
2019-07-19 21:36 ` [PATCH v3 0/9] x86: Concurrent TLB flushes Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).