linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/9] x86: Concurrent TLB flushes
@ 2019-07-02 23:51 Nadav Amit
  2019-07-02 23:51 ` [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
  0 siblings, 1 reply; 6+ messages in thread
From: Nadav Amit @ 2019-07-02 23:51 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, Borislav Petkov, Boris Ostrovsky, Haiyang Zhang,
	Josh Poimboeuf, Juergen Gross, K. Y. Srinivasan, Paolo Bonzini,
	Rik van Riel, Sasha Levin, Stephen Hemminger, kvm, linux-hyperv,
	virtualization, xen-devel

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions:

 sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
  --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

  Th.	tip-jun28 avg (stdev)	+patch-set avg (stdev)	change
  ---	---------------------	----------------------	------
  1	1267765 (14146)		1299253 (5715)		+2.4%
  2	1734644 (11936)		1799225 (19577)		+3.7%
  4	2821268 (41184)		2919132 (40149)		+3.4%
  8	4171652 (31243)		4376925 (65416)		+4.9%
  16	5590729 (24160)		5829866 (8127)		+4.2%
  24	6250212 (24481)		6522303 (28044)		+4.3%
  32	3994314 (26606)		4077543 (10685)		+2.0%
  48	4345177 (28091)		4417821 (41337)		+1.6%

(Note that on configurations with up to 24 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

  Th.	tip-jun28 avg (stdev)	+patch-set avg (stdev)	change
  ---	---------------------	----------------------	------
  1	1598896 (5174)		1607903 (4091)		+0.5%
  2	2109472 (17827)		2224726 (4372)		+5.4%
  4	3448587 (11952)		3668551 (30219)		+6.3%
  8	5425778 (29641)		5606266 (33519)		+3.3%
  16	6931232 (34677)		7054052 (27873)		+1.7%
  24	7612473 (23482)		7783138 (13871)		+2.2%
  32	4296274 (18029)		4283279 (32323)		-0.3%
  48	4770029 (35541)		4764760 (13575)		-0.1%

Presumably, PTI requires two invalidations of each mapping, which allows
to get higher benefits from concurrency when PTI is on. At the same
time, when mitigations are on, other overheads reduce the potential
speedup.

I tried to reduce the size of the code of the main patch, which required
restructuring of the series.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c                 |  13 ++-
 arch/x86/include/asm/paravirt.h       |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h       |  48 +++++----
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c                 |  11 +-
 arch/x86/kernel/paravirt.c            |   2 +-
 arch/x86/mm/init.c                    |   2 +-
 arch/x86/mm/tlb.c                     | 147 ++++++++++++++++----------
 arch/x86/xen/mmu_pv.c                 |  20 ++--
 include/linux/cpumask.h               |   6 +-
 include/linux/smp.h                   |  27 +++--
 include/trace/events/xen.h            |   2 +-
 kernel/smp.c                          | 133 +++++++++++------------
 14 files changed, 245 insertions(+), 178 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-07-02 23:51 [PATCH v2 0/9] x86: Concurrent TLB flushes Nadav Amit
@ 2019-07-02 23:51 ` Nadav Amit
  2019-07-03 14:04   ` Juergen Gross
  0 siblings, 1 reply; 6+ messages in thread
From: Nadav Amit @ 2019-07-02 23:51 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Nadav Amit, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Sasha Levin, Borislav Petkov, Juergen Gross, Paolo Bonzini,
	Boris Ostrovsky, linux-hyperv, virtualization, kvm, xen-devel

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: Juergen Gross <jgross@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/hyperv/mmu.c                 | 13 +++---
 arch/x86/include/asm/paravirt.h       |  6 +--
 arch/x86/include/asm/paravirt_types.h |  4 +-
 arch/x86/include/asm/tlbflush.h       |  9 ++--
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c                 | 11 +++--
 arch/x86/kernel/paravirt.c            |  2 +-
 arch/x86/mm/tlb.c                     | 65 ++++++++++++++++++++-------
 arch/x86/xen/mmu_pv.c                 | 20 ++++++---
 include/trace/events/xen.h            |  2 +-
 10 files changed, 91 insertions(+), 43 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..1177f863e4cd 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
 	return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-				    const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+				   const struct flush_tlb_info *info)
 {
 	int cpu, vcpu, gva_n, max_gvas;
 	struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
 	u64 status = U64_MAX;
 	unsigned long flags;
 
-	trace_hyperv_mmu_flush_tlb_others(cpus, info);
+	trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
 	if (!hv_hypercall_pg)
 		goto do_native;
@@ -69,6 +69,9 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
 
 	local_irq_save(flags);
 
+	if (cpumask_test_cpu(smp_processor_id(), cpus))
+		flush_tlb_func_local(info);
+
 	flush_pcpu = (struct hv_tlb_flush **)
 		     this_cpu_ptr(hyperv_pcpu_input_arg);
 
@@ -156,7 +159,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
 	if (!(status & HV_HYPERCALL_RESULT_MASK))
 		return;
 do_native:
-	native_flush_tlb_others(cpus, info);
+	native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +234,6 @@ void hyperv_setup_mmu_ops(void)
 		return;
 
 	pr_info("Using hypercall for remote TLB flush\n");
-	pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
 	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..316959e89258 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
 	PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-				    const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+				   const struct flush_tlb_info *info)
 {
-	PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..54f4c718b5b0 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,8 +211,8 @@ struct pv_mmu_ops {
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_one_user)(unsigned long addr);
-	void (*flush_tlb_others)(const struct cpumask *cpus,
-				 const struct flush_tlb_info *info);
+	void (*flush_tlb_multi)(const struct cpumask *cpus,
+				const struct flush_tlb_info *info);
 
 	void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..36aa2a9b7597 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -517,7 +517,7 @@ static inline void __flush_tlb_one_kernel(unsigned long addr)
  *  - flush_tlb_page(vma, vmaddr) flushes one page
  *  - flush_tlb_range(vma, start, end) flushes a range of pages
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
- *  - flush_tlb_others(cpumask, info) flushes TLBs on other cpus
+ *  - flush_tlb_multi(cpumask, info) flushes TLBs on multiple cpus
  *
  * ..but the i386 has somewhat limited tlb flushing capabilities,
  * and page-granular flushes are available only on i486 and up.
@@ -563,13 +563,14 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void flush_tlb_func_local(const struct flush_tlb_info *info);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_multi(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
@@ -593,8 +594,8 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
-#define flush_tlb_others(mask, info)	\
-	native_flush_tlb_others(mask, info)
+#define flush_tlb_multi(mask, info)	\
+	native_flush_tlb_multi(mask, info)
 
 #define paravirt_tlb_remove_table(tlb, page) \
 	tlb_remove_page(tlb, (void *)(page))
diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h
index ace464f09681..85ca8560c7f9 100644
--- a/arch/x86/include/asm/trace/hyperv.h
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -8,7 +8,7 @@
 
 #if IS_ENABLED(CONFIG_HYPERV)
 
-TRACE_EVENT(hyperv_mmu_flush_tlb_others,
+TRACE_EVENT(hyperv_mmu_flush_tlb_multi,
 	    TP_PROTO(const struct cpumask *cpus,
 		     const struct flush_tlb_info *info),
 	    TP_ARGS(cpus, info),
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5169b8cc35bb..d00d551d4a2a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(void)
 
 static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
 
-static void kvm_flush_tlb_others(const struct cpumask *cpumask,
+static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
 			const struct flush_tlb_info *info)
 {
 	u8 state;
@@ -594,6 +594,11 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 	 * queue flush_on_enter for pre-empted vCPUs
 	 */
 	for_each_cpu(cpu, flushmask) {
+		/*
+		 * The local vCPU is never preempted, so we do not explicitly
+		 * skip check for local vCPU - it will never be cleared from
+		 * flushmask.
+		 */
 		src = &per_cpu(steal_time, cpu);
 		state = READ_ONCE(src->preempted);
 		if ((state & KVM_VCPU_PREEMPTED)) {
@@ -603,7 +608,7 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 		}
 	}
 
-	native_flush_tlb_others(flushmask, info);
+	native_flush_tlb_multi(flushmask, info);
 }
 
 static void __init kvm_guest_init(void)
@@ -628,7 +633,7 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
 	    !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
+		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
 		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 	}
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 98039d7fb998..7cdcffe2a028 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -363,7 +363,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_user	= native_flush_tlb,
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
-	.mmu.flush_tlb_others	= native_flush_tlb_others,
+	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
 	.mmu.tlb_remove_table	=
 			(void (*)(struct mmu_gather *, void *))tlb_remove_page,
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5c9b1607191d..074288a6916e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -551,7 +551,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
 		 *
-		 * This should be rare, with native_flush_tlb_others skipping
+		 * This should be rare, with native_flush_tlb_multi() skipping
 		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
@@ -635,7 +635,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
 }
 
-static void flush_tlb_func_local(void *info)
+static void __flush_tlb_func_local(void *info)
 {
 	const struct flush_tlb_info *f = info;
 	enum tlb_flush_reason reason;
@@ -645,6 +645,11 @@ static void flush_tlb_func_local(void *info)
 	flush_tlb_func_common(f, true, reason);
 }
 
+void flush_tlb_func_local(const struct flush_tlb_info *info)
+{
+	__flush_tlb_func_local((void *)info);
+}
+
 static void flush_tlb_func_remote(void *info)
 {
 	const struct flush_tlb_info *f = info;
@@ -665,9 +670,14 @@ static bool tlb_is_not_lazy(int cpu)
 
 static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
-			     const struct flush_tlb_info *info)
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+			    const struct flush_tlb_info *info)
 {
+	/*
+	 * Do accounting and tracing. Note that there are (and have always been)
+	 * cases in which a remote TLB flush will be traced, but eventually
+	 * would not happen.
+	 */
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -687,10 +697,12 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		 * means that the percpu tlb_gen variables won't be updated
 		 * and we'll do pointless flushes on future context switches.
 		 *
-		 * Rather than hooking native_flush_tlb_others() here, I think
+		 * Rather than hooking native_flush_tlb_multi() here, I think
 		 * that UV should be updated so that smp_call_function_many(),
 		 * etc, are optimal on UV.
 		 */
+		flush_tlb_func_local(info);
+
 		cpumask = uv_flush_tlb_others(cpumask, info);
 		if (cpumask)
 			smp_call_function_many(cpumask, flush_tlb_func_remote,
@@ -709,8 +721,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	 * doing a speculative memory access.
 	 */
 	if (info->freed_tables) {
-		smp_call_function_many(cpumask, flush_tlb_func_remote,
-			       (void *)info, 1);
+		__smp_call_function_many(cpumask, flush_tlb_func_remote,
+					 __flush_tlb_func_local,
+					 (void *)info, 1);
 	} else {
 		/*
 		 * Although we could have used on_each_cpu_cond_mask(),
@@ -737,7 +750,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 			if (tlb_is_not_lazy(cpu))
 				__cpumask_set_cpu(cpu, cond_cpumask);
 		}
-		smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+		__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+					 __flush_tlb_func_local,
 					 (void *)info, 1);
 	}
 }
@@ -818,16 +832,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
 				  new_tlb_gen);
 
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+	/*
+	 * Assert that mm_cpumask() corresponds with the loaded mm. We got one
+	 * exception: for init_mm we do not need to flush anything, and the
+	 * cpumask does not correspond with loaded_mm.
+	 */
+	VM_WARN_ON_ONCE(cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm)) !=
+			(mm == this_cpu_read(cpu_tlbstate.loaded_mm)) &&
+			mm != &init_mm);
+
+	/*
+	 * flush_tlb_multi() is not optimized for the common case in which only
+	 * a local TLB flush is needed. Optimize this use-case by calling
+	 * flush_tlb_func_local() directly in this case.
+	 */
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+		flush_tlb_multi(mm_cpumask(mm), info);
+	} else {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
 		flush_tlb_func_local(info);
 		local_irq_enable();
 	}
 
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), info);
-
 	put_flush_tlb_info();
 	put_cpu();
 }
@@ -890,16 +917,20 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	int cpu = get_cpu();
 
-	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+	/*
+	 * flush_tlb_multi() is not optimized for the common case in which only
+	 * a local TLB flush is needed. Optimize this use-case by calling
+	 * flush_tlb_func_local() directly in this case.
+	 */
+	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+		flush_tlb_multi(&batch->cpumask, &full_flush_tlb_info);
+	} else {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func_local((void *)&full_flush_tlb_info);
+		flush_tlb_func_local(&full_flush_tlb_info);
 		local_irq_enable();
 	}
 
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
-		flush_tlb_others(&batch->cpumask, &full_flush_tlb_info);
-
 	cpumask_clear(&batch->cpumask);
 
 	put_cpu();
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index beb44e22afdf..19e481e6e904 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
 	preempt_enable();
 }
 
-static void xen_flush_tlb_others(const struct cpumask *cpus,
-				 const struct flush_tlb_info *info)
+static void xen_flush_tlb_multi(const struct cpumask *cpus,
+				const struct flush_tlb_info *info)
 {
 	struct {
 		struct mmuext_op op;
@@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	const size_t mc_entry_size = sizeof(args->op) +
 		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
 
-	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
+	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
 
 	if (cpumask_empty(cpus))
 		return;		/* nothing to do */
@@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
 	args = mcs.args;
 	args->op.arg2.vcpumask = to_cpumask(args->mask);
 
-	/* Remove us, and any offline CPUS. */
+	/* Flush locally if needed and remove us */
+	if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
+		local_irq_disable();
+		flush_tlb_func_local(info);
+		local_irq_enable();
+
+		cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask));
+	}
+
+	/* Remove offline CPUS */
 	cpumask_and(to_cpumask(args->mask), cpus, cpu_online_mask);
-	cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask));
 
 	args->op.cmd = MMUEXT_TLB_FLUSH_MULTI;
 	if (info->end != TLB_FLUSH_ALL &&
@@ -2406,7 +2414,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
 	.flush_tlb_user = xen_flush_tlb,
 	.flush_tlb_kernel = xen_flush_tlb,
 	.flush_tlb_one_user = xen_flush_tlb_one_user,
-	.flush_tlb_others = xen_flush_tlb_others,
+	.flush_tlb_multi = xen_flush_tlb_multi,
 	.tlb_remove_table = tlb_remove_table,
 
 	.pgd_alloc = xen_pgd_alloc,
diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index 9a0e8af21310..546022acf160 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -362,7 +362,7 @@ TRACE_EVENT(xen_mmu_flush_tlb_one_user,
 	    TP_printk("addr %lx", __entry->addr)
 	);
 
-TRACE_EVENT(xen_mmu_flush_tlb_others,
+TRACE_EVENT(xen_mmu_flush_tlb_multi,
 	    TP_PROTO(const struct cpumask *cpus, struct mm_struct *mm,
 		     unsigned long addr, unsigned long end),
 	    TP_ARGS(cpus, mm, addr, end),
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-07-02 23:51 ` [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
@ 2019-07-03 14:04   ` Juergen Gross
  2019-07-03 17:02     ` Nadav Amit
  0 siblings, 1 reply; 6+ messages in thread
From: Juergen Gross @ 2019-07-03 14:04 UTC (permalink / raw)
  To: Nadav Amit, Andy Lutomirski, Dave Hansen
  Cc: Borislav Petkov, Peter Zijlstra, Sasha Levin, x86,
	Thomas Gleixner, virtualization, xen-devel, Haiyang Zhang,
	K. Y. Srinivasan, Stephen Hemminger, Boris Ostrovsky,
	Ingo Molnar, Paolo Bonzini, kvm, linux-hyperv, linux-kernel

On 03.07.19 01:51, Nadav Amit wrote:
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. Introduce
> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
> and hyper-v are only compile-tested).
> 
> While the updated smp infrastructure is capable of running a function on
> a single local core, it is not optimized for this case. The multiple
> function calls and the indirect branch introduce some overhead, and
> might make local TLB flushes slower than they were before the recent
> changes.
> 
> Before calling the SMP infrastructure, check if only a local TLB flush
> is needed to restore the lost performance in this common case. This
> requires to check mm_cpumask() one more time, but unless this mask is
> updated very frequently, this should impact performance negatively.
> 
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Sasha Levin <sashal@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: x86@kernel.org
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: linux-hyperv@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: kvm@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>   arch/x86/hyperv/mmu.c                 | 13 +++---
>   arch/x86/include/asm/paravirt.h       |  6 +--
>   arch/x86/include/asm/paravirt_types.h |  4 +-
>   arch/x86/include/asm/tlbflush.h       |  9 ++--
>   arch/x86/include/asm/trace/hyperv.h   |  2 +-
>   arch/x86/kernel/kvm.c                 | 11 +++--
>   arch/x86/kernel/paravirt.c            |  2 +-
>   arch/x86/mm/tlb.c                     | 65 ++++++++++++++++++++-------
>   arch/x86/xen/mmu_pv.c                 | 20 ++++++---
>   include/trace/events/xen.h            |  2 +-
>   10 files changed, 91 insertions(+), 43 deletions(-)

...

> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index beb44e22afdf..19e481e6e904 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>   	preempt_enable();
>   }
>   
> -static void xen_flush_tlb_others(const struct cpumask *cpus,
> -				 const struct flush_tlb_info *info)
> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
> +				const struct flush_tlb_info *info)
>   {
>   	struct {
>   		struct mmuext_op op;
> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>   	const size_t mc_entry_size = sizeof(args->op) +
>   		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>   
> -	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
> +	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>   
>   	if (cpumask_empty(cpus))
>   		return;		/* nothing to do */
> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>   	args = mcs.args;
>   	args->op.arg2.vcpumask = to_cpumask(args->mask);
>   
> -	/* Remove us, and any offline CPUS. */
> +	/* Flush locally if needed and remove us */
> +	if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
> +		local_irq_disable();
> +		flush_tlb_func_local(info);

I think this isn't the correct function for PV guests.

In fact it should be much easier: just don't clear the own cpu from the
mask, that's all what's needed. The hypervisor is just fine having the
current cpu in the mask and it will do the right thing.


Juergen

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-07-03 14:04   ` Juergen Gross
@ 2019-07-03 17:02     ` Nadav Amit
  2019-07-03 17:43       ` [Xen-devel] " Andrew Cooper
  0 siblings, 1 reply; 6+ messages in thread
From: Nadav Amit @ 2019-07-03 17:02 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andy Lutomirski, Dave Hansen, Borislav Petkov, Peter Zijlstra,
	Sasha Levin, the arch/x86 maintainers, Thomas Gleixner,
	virtualization, xen-devel, Haiyang Zhang, K. Y. Srinivasan,
	Stephen Hemminger, Boris Ostrovsky, Ingo Molnar, Paolo Bonzini,
	kvm list, linux-hyperv, linux-kernel

> On Jul 3, 2019, at 7:04 AM, Juergen Gross <jgross@suse.com> wrote:
> 
> On 03.07.19 01:51, Nadav Amit wrote:
>> To improve TLB shootdown performance, flush the remote and local TLBs
>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>> and hyper-v are only compile-tested).
>> While the updated smp infrastructure is capable of running a function on
>> a single local core, it is not optimized for this case. The multiple
>> function calls and the indirect branch introduce some overhead, and
>> might make local TLB flushes slower than they were before the recent
>> changes.
>> Before calling the SMP infrastructure, check if only a local TLB flush
>> is needed to restore the lost performance in this common case. This
>> requires to check mm_cpumask() one more time, but unless this mask is
>> updated very frequently, this should impact performance negatively.
>> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
>> Cc: Haiyang Zhang <haiyangz@microsoft.com>
>> Cc: Stephen Hemminger <sthemmin@microsoft.com>
>> Cc: Sasha Levin <sashal@kernel.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: x86@kernel.org
>> Cc: Juergen Gross <jgross@suse.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>> Cc: linux-hyperv@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: kvm@vger.kernel.org
>> Cc: xen-devel@lists.xenproject.org
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>>  arch/x86/hyperv/mmu.c                 | 13 +++---
>>  arch/x86/include/asm/paravirt.h       |  6 +--
>>  arch/x86/include/asm/paravirt_types.h |  4 +-
>>  arch/x86/include/asm/tlbflush.h       |  9 ++--
>>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>  arch/x86/kernel/kvm.c                 | 11 +++--
>>  arch/x86/kernel/paravirt.c            |  2 +-
>>  arch/x86/mm/tlb.c                     | 65 ++++++++++++++++++++-------
>>  arch/x86/xen/mmu_pv.c                 | 20 ++++++---
>>  include/trace/events/xen.h            |  2 +-
>>  10 files changed, 91 insertions(+), 43 deletions(-)
> 
> ...
> 
>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>> index beb44e22afdf..19e481e6e904 100644
>> --- a/arch/x86/xen/mmu_pv.c
>> +++ b/arch/x86/xen/mmu_pv.c
>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>>  	preempt_enable();
>>  }
>>  -static void xen_flush_tlb_others(const struct cpumask *cpus,
>> -				 const struct flush_tlb_info *info)
>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>> +				const struct flush_tlb_info *info)
>>  {
>>  	struct {
>>  		struct mmuext_op op;
>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>  	const size_t mc_entry_size = sizeof(args->op) +
>>  		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>  -	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>> +	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>    	if (cpumask_empty(cpus))
>>  		return;		/* nothing to do */
>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>  	args = mcs.args;
>>  	args->op.arg2.vcpumask = to_cpumask(args->mask);
>>  -	/* Remove us, and any offline CPUS. */
>> +	/* Flush locally if needed and remove us */
>> +	if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>> +		local_irq_disable();
>> +		flush_tlb_func_local(info);
> 
> I think this isn't the correct function for PV guests.
> 
> In fact it should be much easier: just don't clear the own cpu from the
> mask, that's all what's needed. The hypervisor is just fine having the
> current cpu in the mask and it will do the right thing.

Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
the same, unfortunately, since it would induce VM-exit on TLB flushes. But
if they do - I’ll be able not to expose flush_tlb_func_local().


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Xen-devel] [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-07-03 17:02     ` Nadav Amit
@ 2019-07-03 17:43       ` Andrew Cooper
  2019-07-03 18:09         ` Nadav Amit
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Cooper @ 2019-07-03 17:43 UTC (permalink / raw)
  To: Nadav Amit, Juergen Gross
  Cc: Sasha Levin, linux-hyperv, the arch/x86 maintainers,
	Stephen Hemminger, kvm list, Peter Zijlstra, Haiyang Zhang,
	Dave Hansen, linux-kernel, virtualization, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Paolo Bonzini, xen-devel,
	Thomas Gleixner, K. Y. Srinivasan, Boris Ostrovsky

On 03/07/2019 18:02, Nadav Amit wrote:
>> On Jul 3, 2019, at 7:04 AM, Juergen Gross <jgross@suse.com> wrote:
>>
>> On 03.07.19 01:51, Nadav Amit wrote:
>>> To improve TLB shootdown performance, flush the remote and local TLBs
>>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>>> and hyper-v are only compile-tested).
>>> While the updated smp infrastructure is capable of running a function on
>>> a single local core, it is not optimized for this case. The multiple
>>> function calls and the indirect branch introduce some overhead, and
>>> might make local TLB flushes slower than they were before the recent
>>> changes.
>>> Before calling the SMP infrastructure, check if only a local TLB flush
>>> is needed to restore the lost performance in this common case. This
>>> requires to check mm_cpumask() one more time, but unless this mask is
>>> updated very frequently, this should impact performance negatively.
>>> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
>>> Cc: Haiyang Zhang <haiyangz@microsoft.com>
>>> Cc: Stephen Hemminger <sthemmin@microsoft.com>
>>> Cc: Sasha Levin <sashal@kernel.org>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Borislav Petkov <bp@alien8.de>
>>> Cc: x86@kernel.org
>>> Cc: Juergen Gross <jgross@suse.com>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>> Cc: Andy Lutomirski <luto@kernel.org>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>> Cc: linux-hyperv@vger.kernel.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Cc: virtualization@lists.linux-foundation.org
>>> Cc: kvm@vger.kernel.org
>>> Cc: xen-devel@lists.xenproject.org
>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>> ---
>>>  arch/x86/hyperv/mmu.c                 | 13 +++---
>>>  arch/x86/include/asm/paravirt.h       |  6 +--
>>>  arch/x86/include/asm/paravirt_types.h |  4 +-
>>>  arch/x86/include/asm/tlbflush.h       |  9 ++--
>>>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>>  arch/x86/kernel/kvm.c                 | 11 +++--
>>>  arch/x86/kernel/paravirt.c            |  2 +-
>>>  arch/x86/mm/tlb.c                     | 65 ++++++++++++++++++++-------
>>>  arch/x86/xen/mmu_pv.c                 | 20 ++++++---
>>>  include/trace/events/xen.h            |  2 +-
>>>  10 files changed, 91 insertions(+), 43 deletions(-)
>> ...
>>
>>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>>> index beb44e22afdf..19e481e6e904 100644
>>> --- a/arch/x86/xen/mmu_pv.c
>>> +++ b/arch/x86/xen/mmu_pv.c
>>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>>>  	preempt_enable();
>>>  }
>>>  -static void xen_flush_tlb_others(const struct cpumask *cpus,
>>> -				 const struct flush_tlb_info *info)
>>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>>> +				const struct flush_tlb_info *info)
>>>  {
>>>  	struct {
>>>  		struct mmuext_op op;
>>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>>  	const size_t mc_entry_size = sizeof(args->op) +
>>>  		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>>  -	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>>> +	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>>    	if (cpumask_empty(cpus))
>>>  		return;		/* nothing to do */
>>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>>  	args = mcs.args;
>>>  	args->op.arg2.vcpumask = to_cpumask(args->mask);
>>>  -	/* Remove us, and any offline CPUS. */
>>> +	/* Flush locally if needed and remove us */
>>> +	if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>>> +		local_irq_disable();
>>> +		flush_tlb_func_local(info);
>> I think this isn't the correct function for PV guests.
>>
>> In fact it should be much easier: just don't clear the own cpu from the
>> mask, that's all what's needed. The hypervisor is just fine having the
>> current cpu in the mask and it will do the right thing.
> Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
> the same, unfortunately, since it would induce VM-exit on TLB flushes.

Why do you believe the vmexit matters?  You're talking one anyway for
the IPI.

Intel only have virtualised self-IPI, and while AMD do have working
non-self IPIs, you still take a vmexit anyway if any destination vcpu
isn't currently running in non-root mode (IIRC).

At that point, you might as well have the hypervisor do all the hard
work via a multi-cpu shootdown/flush hypercall, rather than trying to
arrange it locally.

~Andrew

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Xen-devel] [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
  2019-07-03 17:43       ` [Xen-devel] " Andrew Cooper
@ 2019-07-03 18:09         ` Nadav Amit
  0 siblings, 0 replies; 6+ messages in thread
From: Nadav Amit @ 2019-07-03 18:09 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Sasha Levin, linux-hyperv,
	the arch/x86 maintainers, Stephen Hemminger, kvm list,
	Peter Zijlstra, Haiyang Zhang, Dave Hansen, linux-kernel,
	virtualization, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Paolo Bonzini, xen-devel, Thomas Gleixner, K. Y. Srinivasan,
	Boris Ostrovsky

> On Jul 3, 2019, at 10:43 AM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> 
> On 03/07/2019 18:02, Nadav Amit wrote:
>>> On Jul 3, 2019, at 7:04 AM, Juergen Gross <jgross@suse.com> wrote:
>>> 
>>> On 03.07.19 01:51, Nadav Amit wrote:
>>>> To improve TLB shootdown performance, flush the remote and local TLBs
>>>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>>>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>>>> and hyper-v are only compile-tested).
>>>> While the updated smp infrastructure is capable of running a function on
>>>> a single local core, it is not optimized for this case. The multiple
>>>> function calls and the indirect branch introduce some overhead, and
>>>> might make local TLB flushes slower than they were before the recent
>>>> changes.
>>>> Before calling the SMP infrastructure, check if only a local TLB flush
>>>> is needed to restore the lost performance in this common case. This
>>>> requires to check mm_cpumask() one more time, but unless this mask is
>>>> updated very frequently, this should impact performance negatively.
>>>> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
>>>> Cc: Haiyang Zhang <haiyangz@microsoft.com>
>>>> Cc: Stephen Hemminger <sthemmin@microsoft.com>
>>>> Cc: Sasha Levin <sashal@kernel.org>
>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>> Cc: Borislav Petkov <bp@alien8.de>
>>>> Cc: x86@kernel.org
>>>> Cc: Juergen Gross <jgross@suse.com>
>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>>> Cc: Andy Lutomirski <luto@kernel.org>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>>> Cc: linux-hyperv@vger.kernel.org
>>>> Cc: linux-kernel@vger.kernel.org
>>>> Cc: virtualization@lists.linux-foundation.org
>>>> Cc: kvm@vger.kernel.org
>>>> Cc: xen-devel@lists.xenproject.org
>>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>>> ---
>>>> arch/x86/hyperv/mmu.c                 | 13 +++---
>>>> arch/x86/include/asm/paravirt.h       |  6 +--
>>>> arch/x86/include/asm/paravirt_types.h |  4 +-
>>>> arch/x86/include/asm/tlbflush.h       |  9 ++--
>>>> arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>>> arch/x86/kernel/kvm.c                 | 11 +++--
>>>> arch/x86/kernel/paravirt.c            |  2 +-
>>>> arch/x86/mm/tlb.c                     | 65 ++++++++++++++++++++-------
>>>> arch/x86/xen/mmu_pv.c                 | 20 ++++++---
>>>> include/trace/events/xen.h            |  2 +-
>>>> 10 files changed, 91 insertions(+), 43 deletions(-)
>>> ...
>>> 
>>>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>>>> index beb44e22afdf..19e481e6e904 100644
>>>> --- a/arch/x86/xen/mmu_pv.c
>>>> +++ b/arch/x86/xen/mmu_pv.c
>>>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>>>> 	preempt_enable();
>>>> }
>>>> -static void xen_flush_tlb_others(const struct cpumask *cpus,
>>>> -				 const struct flush_tlb_info *info)
>>>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>>>> +				const struct flush_tlb_info *info)
>>>> {
>>>> 	struct {
>>>> 		struct mmuext_op op;
>>>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>>> 	const size_t mc_entry_size = sizeof(args->op) +
>>>> 		sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>>> -	trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>>>> +	trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>>>   	if (cpumask_empty(cpus))
>>>> 		return;		/* nothing to do */
>>>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
>>>> 	args = mcs.args;
>>>> 	args->op.arg2.vcpumask = to_cpumask(args->mask);
>>>> -	/* Remove us, and any offline CPUS. */
>>>> +	/* Flush locally if needed and remove us */
>>>> +	if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>>>> +		local_irq_disable();
>>>> +		flush_tlb_func_local(info);
>>> I think this isn't the correct function for PV guests.
>>> 
>>> In fact it should be much easier: just don't clear the own cpu from the
>>> mask, that's all what's needed. The hypervisor is just fine having the
>>> current cpu in the mask and it will do the right thing.
>> Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
>> the same, unfortunately, since it would induce VM-exit on TLB flushes.
> 
> Why do you believe the vmexit matters?  You're talking one anyway for
> the IPI.
> 
> Intel only have virtualised self-IPI, and while AMD do have working
> non-self IPIs, you still take a vmexit anyway if any destination vcpu
> isn't currently running in non-root mode (IIRC).
> 
> At that point, you might as well have the hypervisor do all the hard
> work via a multi-cpu shootdown/flush hypercall, rather than trying to
> arrange it locally.

I forgot that xen_flush_tlb_multi() should actually only be called when
there are some remote CPUs (as I optimized the case in which there is only a
single local CPU that needs to be flushed), so you are right.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-07-03 18:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-02 23:51 [PATCH v2 0/9] x86: Concurrent TLB flushes Nadav Amit
2019-07-02 23:51 ` [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently Nadav Amit
2019-07-03 14:04   ` Juergen Gross
2019-07-03 17:02     ` Nadav Amit
2019-07-03 17:43       ` [Xen-devel] " Andrew Cooper
2019-07-03 18:09         ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).