All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] KVM: X86: Paravirt remote TLB flush
@ 2017-11-13  0:33 Wanpeng Li
  2017-11-13  0:33 ` [PATCH v4 1/4] KVM: X86: Add vCPU running/preempted state Wanpeng Li
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Wanpeng Li @ 2017-11-13  0:33 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář, Peter Zijlstra, Wanpeng Li

Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This patch set implements para-virt flush tlbs making sure that it
does not wait for vcpus that are sleeping. And all the sleeping vcpus
flush the tlb on guest enter. Idea was discussed here:
https://lkml.org/lkml/2012/2/20/157

The best result is achieved when we're overcommiting the host by running 
multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching 
vCPUs which are not scheduled and avoid the wait on the main CPU.

In addition, thanks for commit 9e52fc2b50d ("x86/mm: Enable RCU based 
page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")

Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy 
in one linux guest.

ebizzy -M 
              vanilla    optimized     boost
 8 vCPUs       10152       10083       -0.68% 
16 vCPUs        1224        4866       297.5% 
24 vCPUs        1109        3871       249%
32 vCPUs        1025        3375       229.3% 

Note: The patchset is rebased against "locking/qspinlock/x86: Avoid
   test-and-set when PV_DEDICATED is set" v3

v3 -> v4:
 * use READ_ONCE()
 * use try_cmpxchg instead of cmpxchg
 * add {} to if
 * no FLUSH flags to preserve during set_preempted
 * "KVM: X86" prefix to patch subject

v2 -> v3: 
 * percpu cpumask

v1 -> v2:
 * a new CPUID feature bit
 * fix cmpxchg check
 * use kvm_vcpu_flush_tlb() to get the statistics right
 * just OR the KVM_VCPU_PREEMPTED in kvm_steal_time_set_preempted
 * add a new bool argument to kvm_x86_ops->tlb_flush
 * __cpumask_clear_cpu() instead of cpumask_clear_cpu()
 * not put cpumask_t on stack
 * rebase the patchset against "locking/qspinlock/x86: Avoid
   test-and-set when PV_DEDICATED is set" v3

Wanpeng Li (4):
  KVM: X86: Add vCPU running/preempted state
  KVM: X86: Add paravirt remote TLB flush
  KVM: X86: introduce invalidate_gpa argument to tlb flush
  KVM: X86: Add flush_on_enter before guest enter

 Documentation/virtual/kvm/cpuid.txt  |  4 ++++
 arch/x86/include/asm/kvm_host.h      |  2 +-
 arch/x86/include/uapi/asm/kvm_para.h |  6 +++++
 arch/x86/kernel/kvm.c                | 46 ++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/cpuid.c                 |  3 ++-
 arch/x86/kvm/svm.c                   | 14 +++++------
 arch/x86/kvm/vmx.c                   | 21 ++++++++--------
 arch/x86/kvm/x86.c                   | 25 +++++++++++++-------
 8 files changed, 88 insertions(+), 30 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/4] KVM: X86: Add vCPU running/preempted state
  2017-11-13  0:33 [PATCH v4 0/4] KVM: X86: Paravirt remote TLB flush Wanpeng Li
@ 2017-11-13  0:33 ` Wanpeng Li
  2017-11-14 16:45   ` David Hildenbrand
  2017-11-13  0:33 ` [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush Wanpeng Li
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Wanpeng Li @ 2017-11-13  0:33 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář, Peter Zijlstra, Wanpeng Li

From: Wanpeng Li <wanpeng.li@hotmail.com>

This patch reuses the preempted field in kvm_steal_time, and will export
the vcpu running/pre-empted information to the guest from host. This will
enable guest to intelligently send ipi to running vcpus and set flag for
pre-empted vcpus. This will prevent waiting for vcpus that are not running.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
 arch/x86/include/uapi/asm/kvm_para.h | 3 +++
 arch/x86/kernel/kvm.c                | 2 +-
 arch/x86/kvm/x86.c                   | 4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 554aa8f..bf17b30 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -51,6 +51,9 @@ struct kvm_steal_time {
 	__u32 pad[11];
 };
 
+#define KVM_VCPU_NOT_PREEMPTED      (0 << 0)
+#define KVM_VCPU_PREEMPTED          (1 << 0)
+
 #define KVM_CLOCK_PAIRING_WALLCLOCK 0
 struct kvm_clock_pairing {
 	__s64 sec;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8bb9594..1b1b641 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -608,7 +608,7 @@ __visible bool __kvm_vcpu_is_preempted(long cpu)
 {
 	struct kvm_steal_time *src = &per_cpu(steal_time, cpu);
 
-	return !!src->preempted;
+	return !!(src->preempted & KVM_VCPU_PREEMPTED);
 }
 PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03869eb..5e63033 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2113,7 +2113,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
 		return;
 
-	vcpu->arch.st.steal.preempted = 0;
+	vcpu->arch.st.steal.preempted = KVM_VCPU_NOT_PREEMPTED;
 
 	if (vcpu->arch.st.steal.version & 1)
 		vcpu->arch.st.steal.version += 1;  /* first time write, random junk */
@@ -2884,7 +2884,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
 		return;
 
-	vcpu->arch.st.steal.preempted = 1;
+	vcpu->arch.st.steal.preempted = KVM_VCPU_PREEMPTED;
 
 	kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
 			&vcpu->arch.st.steal.preempted,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
  2017-11-13  0:33 [PATCH v4 0/4] KVM: X86: Paravirt remote TLB flush Wanpeng Li
  2017-11-13  0:33 ` [PATCH v4 1/4] KVM: X86: Add vCPU running/preempted state Wanpeng Li
@ 2017-11-13  0:33 ` Wanpeng Li
  2017-11-13  7:59   ` Peter Zijlstra
  2017-11-13  8:04   ` Peter Zijlstra
  2017-11-13  0:33 ` [PATCH v4 3/4] KVM: X86: introduce invalidate_gpa argument to tlb flush Wanpeng Li
  2017-11-13  0:33 ` [PATCH v4 4/4] KVM: X86: Add flush_on_enter before guest enter Wanpeng Li
  3 siblings, 2 replies; 18+ messages in thread
From: Wanpeng Li @ 2017-11-13  0:33 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář, Peter Zijlstra, Wanpeng Li

From: Wanpeng Li <wanpeng.li@hotmail.com>

Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This patch set implements para-virt flush tlbs making sure that it does 
not wait for vcpus that are sleeping. And all the sleeping vcpus flush 
the tlb on guest enter.

The best result is achieved when we're overcommiting the host by running 
multiple vCPUs on each pCPU. In this case PV tlb flush avoids touching 
vCPUs which are not scheduled and avoid the wait on the main CPU.

Test on a Haswell i7 desktop 4 cores (2HT), so 8 pCPUs, running ebizzy 
in one linux guest.

ebizzy -M 
              vanilla    optimized     boost
 8 vCPUs       10152       10083       -0.68% 
16 vCPUs        1224        4866       297.5% 
24 vCPUs        1109        3871       249%
32 vCPUs        1025        3375       229.3% 

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
 Documentation/virtual/kvm/cpuid.txt  |  4 ++++
 arch/x86/include/uapi/asm/kvm_para.h |  2 ++
 arch/x86/kernel/kvm.c                | 42 +++++++++++++++++++++++++++++++++++-
 3 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
index 117066a..9693fcc 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -60,6 +60,10 @@ KVM_FEATURE_PV_DEDICATED           ||     8 || guest checks this feature bit
                                    ||       || mizations such as usage of
                                    ||       || qspinlocks.
 ------------------------------------------------------------------------------
+KVM_FEATURE_PV_TLB_FLUSH           ||     9 || guest checks this feature bit
+                                   ||       || before enabling paravirtualized
+                                   ||       || tlb flush.
+------------------------------------------------------------------------------
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
                                    ||       || per-cpu warps are expected in
                                    ||       || kvmclock.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6d66556..e267d83 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -26,6 +26,7 @@
 #define KVM_FEATURE_PV_EOI		6
 #define KVM_FEATURE_PV_UNHALT		7
 #define KVM_FEATURE_PV_DEDICATED	8
+#define KVM_FEATURE_PV_TLB_FLUSH	9
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -54,6 +55,7 @@ struct kvm_steal_time {
 
 #define KVM_VCPU_NOT_PREEMPTED      (0 << 0)
 #define KVM_VCPU_PREEMPTED          (1 << 0)
+#define KVM_VCPU_SHOULD_FLUSH       (1 << 1)
 
 #define KVM_CLOCK_PAIRING_WALLCLOCK 0
 struct kvm_clock_pairing {
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 66ed3bc..78794c1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -465,9 +465,40 @@ static void __init kvm_apf_trap_init(void)
 	update_intr_gate(X86_TRAP_PF, async_page_fault);
 }
 
+static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
+
+static void kvm_flush_tlb_others(const struct cpumask *cpumask,
+			const struct flush_tlb_info *info)
+{
+	u8 state;
+	int cpu;
+	struct kvm_steal_time *src;
+	struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_tlb_mask);
+
+	if (unlikely(!flushmask))
+		return;
+
+	cpumask_copy(flushmask, cpumask);
+	/*
+	 * We have to call flush only on online vCPUs. And
+	 * queue flush_on_enter for pre-empted vCPUs
+	 */
+	for_each_cpu(cpu, cpumask) {
+		src = &per_cpu(steal_time, cpu);
+		state = READ_ONCE(src->preempted);
+		if ((state & KVM_VCPU_PREEMPTED)) {
+			if (try_cmpxchg(&src->preempted, &state,
+				state | KVM_VCPU_SHOULD_FLUSH))
+				__cpumask_clear_cpu(cpu, flushmask);
+		}
+	}
+
+	native_flush_tlb_others(flushmask, info);
+}
+
 void __init kvm_guest_init(void)
 {
-	int i;
+	int i, cpu;
 
 	if (!kvm_para_available())
 		return;
@@ -484,6 +515,15 @@ void __init kvm_guest_init(void)
 		pv_time_ops.steal_clock = kvm_steal_clock;
 	}
 
+	if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
+		!kvm_para_has_feature(KVM_FEATURE_PV_DEDICATED)) {
+		for_each_possible_cpu(cpu) {
+			zalloc_cpumask_var_node(per_cpu_ptr(&__pv_tlb_mask, cpu),
+					GFP_KERNEL, cpu_to_node(cpu));
+		}
+		pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
+	}
+
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 3/4] KVM: X86: introduce invalidate_gpa argument to tlb flush
  2017-11-13  0:33 [PATCH v4 0/4] KVM: X86: Paravirt remote TLB flush Wanpeng Li
  2017-11-13  0:33 ` [PATCH v4 1/4] KVM: X86: Add vCPU running/preempted state Wanpeng Li
  2017-11-13  0:33 ` [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush Wanpeng Li
@ 2017-11-13  0:33 ` Wanpeng Li
  2017-11-13  0:33 ` [PATCH v4 4/4] KVM: X86: Add flush_on_enter before guest enter Wanpeng Li
  3 siblings, 0 replies; 18+ messages in thread
From: Wanpeng Li @ 2017-11-13  0:33 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář, Peter Zijlstra, Wanpeng Li

From: Wanpeng Li <wanpeng.li@hotmail.com>

Introduce a new bool invalidate_gpa argument to kvm_x86_ops->tlb_flush,
it will be used by later patches to just flush guest tlb.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/svm.c              | 14 +++++++-------
 arch/x86/kvm/vmx.c              | 21 +++++++++++----------
 arch/x86/kvm/x86.c              |  6 +++---
 4 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c73e493..b4f7bb1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -952,7 +952,7 @@ struct kvm_x86_ops {
 	unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
 	void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
 
-	void (*tlb_flush)(struct kvm_vcpu *vcpu);
+	void (*tlb_flush)(struct kvm_vcpu *vcpu, bool invalidate_gpa);
 
 	void (*run)(struct kvm_vcpu *vcpu);
 	int (*handle_exit)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 0e68f0b..efaf95f 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -285,7 +285,7 @@ static int vgif = true;
 module_param(vgif, int, 0444);
 
 static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
-static void svm_flush_tlb(struct kvm_vcpu *vcpu);
+static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa);
 static void svm_complete_interrupts(struct vcpu_svm *svm);
 
 static int nested_svm_exit_handled(struct vcpu_svm *svm);
@@ -2032,7 +2032,7 @@ static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		return 1;
 
 	if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE))
-		svm_flush_tlb(vcpu);
+		svm_flush_tlb(vcpu, true);
 
 	vcpu->arch.cr4 = cr4;
 	if (!npt_enabled)
@@ -2368,7 +2368,7 @@ static void nested_svm_set_tdp_cr3(struct kvm_vcpu *vcpu,
 
 	svm->vmcb->control.nested_cr3 = __sme_set(root);
 	mark_dirty(svm->vmcb, VMCB_NPT);
-	svm_flush_tlb(vcpu);
+	svm_flush_tlb(vcpu, true);
 }
 
 static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
@@ -3033,7 +3033,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
 	svm->nested.intercept_exceptions = nested_vmcb->control.intercept_exceptions;
 	svm->nested.intercept            = nested_vmcb->control.intercept;
 
-	svm_flush_tlb(&svm->vcpu);
+	svm_flush_tlb(&svm->vcpu, true);
 	svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK;
 	if (nested_vmcb->control.int_ctl & V_INTR_MASKING_MASK)
 		svm->vcpu.arch.hflags |= HF_VINTR_MASK;
@@ -4755,7 +4755,7 @@ static int svm_set_tss_addr(struct kvm *kvm, unsigned int addr)
 	return 0;
 }
 
-static void svm_flush_tlb(struct kvm_vcpu *vcpu)
+static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -5046,7 +5046,7 @@ static void svm_set_cr3(struct kvm_vcpu *vcpu, unsigned long root)
 
 	svm->vmcb->save.cr3 = __sme_set(root);
 	mark_dirty(svm->vmcb, VMCB_CR);
-	svm_flush_tlb(vcpu);
+	svm_flush_tlb(vcpu, true);
 }
 
 static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned long root)
@@ -5060,7 +5060,7 @@ static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned long root)
 	svm->vmcb->save.cr3 = kvm_read_cr3(vcpu);
 	mark_dirty(svm->vmcb, VMCB_CR);
 
-	svm_flush_tlb(vcpu);
+	svm_flush_tlb(vcpu, true);
 }
 
 static int is_disabled(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a6f4f09..e2157d4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4113,9 +4113,10 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
 
 #endif
 
-static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid)
+static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid,
+				bool invalidate_gpa)
 {
-	if (enable_ept) {
+	if (enable_ept && (invalidate_gpa || !enable_vpid)) {
 		if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 			return;
 		ept_sync_context(construct_eptp(vcpu, vcpu->arch.mmu.root_hpa));
@@ -4124,15 +4125,15 @@ static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid)
 	}
 }
 
-static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
+static void vmx_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
 {
-	__vmx_flush_tlb(vcpu, to_vmx(vcpu)->vpid);
+	__vmx_flush_tlb(vcpu, to_vmx(vcpu)->vpid, invalidate_gpa);
 }
 
 static void vmx_flush_tlb_ept_only(struct kvm_vcpu *vcpu)
 {
 	if (enable_ept)
-		vmx_flush_tlb(vcpu);
+		vmx_flush_tlb(vcpu, true);
 }
 
 static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu)
@@ -4330,7 +4331,7 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		ept_load_pdptrs(vcpu);
 	}
 
-	vmx_flush_tlb(vcpu);
+	vmx_flush_tlb(vcpu, true);
 	vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
@@ -7959,7 +7960,7 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
 		return kvm_skip_emulated_instruction(vcpu);
 	}
 
-	__vmx_flush_tlb(vcpu, vmx->nested.vpid02);
+	__vmx_flush_tlb(vcpu, vmx->nested.vpid02, true);
 	nested_vmx_succeed(vcpu);
 
 	return kvm_skip_emulated_instruction(vcpu);
@@ -10611,11 +10612,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 			vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02);
 			if (vmcs12->virtual_processor_id != vmx->nested.last_vpid) {
 				vmx->nested.last_vpid = vmcs12->virtual_processor_id;
-				__vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02);
+				__vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02, true);
 			}
 		} else {
 			vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
-			vmx_flush_tlb(vcpu);
+			vmx_flush_tlb(vcpu, true);
 		}
 
 	}
@@ -11314,7 +11315,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 		 * L1's vpid. TODO: move to a more elaborate solution, giving
 		 * each L2 its own vpid and exposing the vpid feature to L1.
 		 */
-		vmx_flush_tlb(vcpu);
+		vmx_flush_tlb(vcpu, true);
 	}
 	/* Restore posted intr vector. */
 	if (nested_cpu_has_posted_intr(vmcs12))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5e63033..41339ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6734,10 +6734,10 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
+static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
 {
 	++vcpu->stat.tlb_flush;
-	kvm_x86_ops->tlb_flush(vcpu);
+	kvm_x86_ops->tlb_flush(vcpu, invalidate_gpa);
 }
 
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
@@ -6794,7 +6794,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
 			kvm_mmu_sync_roots(vcpu);
 		if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
-			kvm_vcpu_flush_tlb(vcpu);
+			kvm_vcpu_flush_tlb(vcpu, true);
 		if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) {
 			vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS;
 			r = 0;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 4/4] KVM: X86: Add flush_on_enter before guest enter
  2017-11-13  0:33 [PATCH v4 0/4] KVM: X86: Paravirt remote TLB flush Wanpeng Li
                   ` (2 preceding siblings ...)
  2017-11-13  0:33 ` [PATCH v4 3/4] KVM: X86: introduce invalidate_gpa argument to tlb flush Wanpeng Li
@ 2017-11-13  0:33 ` Wanpeng Li
  3 siblings, 0 replies; 18+ messages in thread
From: Wanpeng Li @ 2017-11-13  0:33 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář, Peter Zijlstra, Wanpeng Li

From: Wanpeng Li <wanpeng.li@hotmail.com>

PV-Flush guest would indicate to flush on enter, flush the TLB before
entering the guest.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
 arch/x86/kvm/cpuid.c |  3 ++-
 arch/x86/kvm/x86.c   | 21 ++++++++++++++-------
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0099e10..2724a5c 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -594,7 +594,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 			     (1 << KVM_FEATURE_ASYNC_PF) |
 			     (1 << KVM_FEATURE_PV_EOI) |
 			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
-			     (1 << KVM_FEATURE_PV_UNHALT);
+			     (1 << KVM_FEATURE_PV_UNHALT) |
+			     (1 << KVM_FEATURE_PV_TLB_FLUSH);
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 41339ee..bf52936 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2104,6 +2104,12 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.pv_time_enabled = false;
 }
 
+static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
+{
+	++vcpu->stat.tlb_flush;
+	kvm_x86_ops->tlb_flush(vcpu, invalidate_gpa);
+}
+
 static void record_steal_time(struct kvm_vcpu *vcpu)
 {
 	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
@@ -2113,7 +2119,14 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
 		return;
 
-	vcpu->arch.st.steal.preempted = KVM_VCPU_NOT_PREEMPTED;
+	if (xchg(&vcpu->arch.st.steal.preempted, KVM_VCPU_NOT_PREEMPTED) ==
+			(KVM_VCPU_SHOULD_FLUSH | KVM_VCPU_PREEMPTED)) {
+		/*
+		 * Do TLB_FLUSH before entering the guest, its passed
+		 * the stage of request checking
+		 */
+		kvm_vcpu_flush_tlb(vcpu, false);
+	}
 
 	if (vcpu->arch.st.steal.version & 1)
 		vcpu->arch.st.steal.version += 1;  /* first time write, random junk */
@@ -6734,12 +6747,6 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
-{
-	++vcpu->stat.tlb_flush;
-	kvm_x86_ops->tlb_flush(vcpu, invalidate_gpa);
-}
-
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
 	struct page *page = NULL;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
  2017-11-13  0:33 ` [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush Wanpeng Li
@ 2017-11-13  7:59   ` Peter Zijlstra
  2017-11-13  8:12     ` Wanpeng Li
  2017-11-13  8:04   ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2017-11-13  7:59 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: linux-kernel, kvm, Paolo Bonzini, Radim Kr??m????, Wanpeng Li

On Sun, Nov 12, 2017 at 04:33:24PM -0800, Wanpeng Li wrote:
> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +			const struct flush_tlb_info *info)
> +{
> +	u8 state;
> +	int cpu;
> +	struct kvm_steal_time *src;
> +	struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_tlb_mask);
> +
> +	if (unlikely(!flushmask))
> +		return;
> +
> +	cpumask_copy(flushmask, cpumask);
> +	/*
> +	 * We have to call flush only on online vCPUs. And
> +	 * queue flush_on_enter for pre-empted vCPUs
> +	 */
> +	for_each_cpu(cpu, cpumask) {

Should this not iterate flushmask? Its far too early to think, so I'm
not sure this is an actual problem, but it does seem weird.

> +		src = &per_cpu(steal_time, cpu);
> +		state = READ_ONCE(src->preempted);
> +		if ((state & KVM_VCPU_PREEMPTED)) {
> +			if (try_cmpxchg(&src->preempted, &state,
> +				state | KVM_VCPU_SHOULD_FLUSH))
> +				__cpumask_clear_cpu(cpu, flushmask);
> +		}
> +	}
> +
> +	native_flush_tlb_others(flushmask, info);
> +}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
  2017-11-13  0:33 ` [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush Wanpeng Li
  2017-11-13  7:59   ` Peter Zijlstra
@ 2017-11-13  8:04   ` Peter Zijlstra
  2017-11-13  8:26     ` Wanpeng Li
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2017-11-13  8:04 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: linux-kernel, kvm, Paolo Bonzini, Radim Kr??m????, Wanpeng Li

On Sun, Nov 12, 2017 at 04:33:24PM -0800, Wanpeng Li wrote:
> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +			const struct flush_tlb_info *info)
> +{
> +	u8 state;
> +	int cpu;
> +	struct kvm_steal_time *src;
> +	struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_tlb_mask);
> +
> +	if (unlikely(!flushmask))
> +		return;
> +
> +	cpumask_copy(flushmask, cpumask);
> +	/*
> +	 * We have to call flush only on online vCPUs. And
> +	 * queue flush_on_enter for pre-empted vCPUs
> +	 */
> +	for_each_cpu(cpu, cpumask) {
> +		src = &per_cpu(steal_time, cpu);
> +		state = READ_ONCE(src->preempted);
> +		if ((state & KVM_VCPU_PREEMPTED)) {
> +			if (try_cmpxchg(&src->preempted, &state,
> +				state | KVM_VCPU_SHOULD_FLUSH))
> +				__cpumask_clear_cpu(cpu, flushmask);
> +		}
> +	}

So if at this point a vCPU gets preempted we'll still spin-wait for it,
which is sub-optimal.

I think we can come up with something to get around that 'problem' if
indeed it is a problem. But we can easily do that as follow up patches.
Just let me know if you think its worth spending more time on.

> +	native_flush_tlb_others(flushmask, info);
> +}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
  2017-11-13  7:59   ` Peter Zijlstra
@ 2017-11-13  8:12     ` Wanpeng Li
  0 siblings, 0 replies; 18+ messages in thread
From: Wanpeng Li @ 2017-11-13  8:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, Paolo Bonzini, Radim Kr??m????, Wanpeng Li

2017-11-13 15:59 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> On Sun, Nov 12, 2017 at 04:33:24PM -0800, Wanpeng Li wrote:
>> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>> +                     const struct flush_tlb_info *info)
>> +{
>> +     u8 state;
>> +     int cpu;
>> +     struct kvm_steal_time *src;
>> +     struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_tlb_mask);
>> +
>> +     if (unlikely(!flushmask))
>> +             return;
>> +
>> +     cpumask_copy(flushmask, cpumask);
>> +     /*
>> +      * We have to call flush only on online vCPUs. And
>> +      * queue flush_on_enter for pre-empted vCPUs
>> +      */
>> +     for_each_cpu(cpu, cpumask) {
>
> Should this not iterate flushmask? Its far too early to think, so I'm
> not sure this is an actual problem, but it does seem weird.

Agreed, should be flushmask in next version. :)

Regards,
Wanpeng Li

>
>> +             src = &per_cpu(steal_time, cpu);
>> +             state = READ_ONCE(src->preempted);
>> +             if ((state & KVM_VCPU_PREEMPTED)) {
>> +                     if (try_cmpxchg(&src->preempted, &state,
>> +                             state | KVM_VCPU_SHOULD_FLUSH))
>> +                             __cpumask_clear_cpu(cpu, flushmask);
>> +             }
>> +     }
>> +
>> +     native_flush_tlb_others(flushmask, info);
>> +}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
  2017-11-13  8:04   ` Peter Zijlstra
@ 2017-11-13  8:26     ` Wanpeng Li
  2017-11-13 10:46       ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Wanpeng Li @ 2017-11-13  8:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, Paolo Bonzini, Radim Kr??m????, Wanpeng Li

2017-11-13 16:04 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> On Sun, Nov 12, 2017 at 04:33:24PM -0800, Wanpeng Li wrote:
>> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>> +                     const struct flush_tlb_info *info)
>> +{
>> +     u8 state;
>> +     int cpu;
>> +     struct kvm_steal_time *src;
>> +     struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_tlb_mask);
>> +
>> +     if (unlikely(!flushmask))
>> +             return;
>> +
>> +     cpumask_copy(flushmask, cpumask);
>> +     /*
>> +      * We have to call flush only on online vCPUs. And
>> +      * queue flush_on_enter for pre-empted vCPUs
>> +      */
>> +     for_each_cpu(cpu, cpumask) {
>> +             src = &per_cpu(steal_time, cpu);
>> +             state = READ_ONCE(src->preempted);
>> +             if ((state & KVM_VCPU_PREEMPTED)) {
>> +                     if (try_cmpxchg(&src->preempted, &state,
>> +                             state | KVM_VCPU_SHOULD_FLUSH))
>> +                             __cpumask_clear_cpu(cpu, flushmask);
>> +             }
>> +     }
>
> So if at this point a vCPU gets preempted we'll still spin-wait for it,
> which is sub-optimal.
>
> I think we can come up with something to get around that 'problem' if
> indeed it is a problem. But we can easily do that as follow up patches.
> Just let me know if you think its worth spending more time on.

You can post your idea, it is always smart. :) Then we can evaluate
the complexity and gains.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
  2017-11-13  8:26     ` Wanpeng Li
@ 2017-11-13 10:46       ` Peter Zijlstra
  2017-11-13 13:02         ` Peter Zijlstra
  2017-11-14  6:28         ` Wanpeng Li
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2017-11-13 10:46 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: linux-kernel, kvm, Paolo Bonzini, Radim Kr??m????, Wanpeng Li

On Mon, Nov 13, 2017 at 04:26:57PM +0800, Wanpeng Li wrote:
> 2017-11-13 16:04 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:

> > So if at this point a vCPU gets preempted we'll still spin-wait for it,
> > which is sub-optimal.
> >
> > I think we can come up with something to get around that 'problem' if
> > indeed it is a problem. But we can easily do that as follow up patches.
> > Just let me know if you think its worth spending more time on.
> 
> You can post your idea, it is always smart. :) Then we can evaluate
> the complexity and gains.

I'm not sure I have a fully baked idea just yet, but the general idea
would be something like:

 - switch (back) to a dedicated TLB invalidate IPI

 - introduce KVM_VCPU_IPI_PENDING

 - change flush_tlb_others() into something like:

   for_each_cpu(cpu, flushmask) {
	 src = &per_cpu(steal_time, cpu);
	 state = READ_ONCE(src->preempted);
	 do {
		 if (state & KVM_VCPU_PREEMPTED) {
			 if (try_cmpxchg(&src->preempted, &state,
						 state | KVM_VCPU_SHOULD_FLUSH)) {
				 __cpumask_clear_cpu(cpu, flushmask);
				 break;
			 }
		 }
	 } while (!try_cmpxchg(&src->preempted, &state,
				 state | KVM_VCPU_IPI_PENDING));
   }

   apic->send_IPI_mask(flushmask, CALL_TLB_VECTOR);

   for_each_cpu(cpu, flushmask) {
	 src = &per_cpu(steal_time, cpu);
	 smp_cond_load_acquire(&src->preempted, !(VAL & KVM_VCPU_IPI_PENDING);
   }


 - have the TLB invalidate handler do something like:

   state = READ_ONCE(src->preempted);
   if (!(state & KVM_VCPU_IPI_PENDING))
	   return;

   local_flush_tlb();

   do {
   } while (!try_cmpxchg(&src->preempted, &state,
			 state & ~KVM_VCPU_IPI_PENDING));

 - then at VMEXIT time do something like:

   state = READ_ONCE(src->preempted);
   do {
	if (!(state & KVM_VCPU_IPI_PENDING))
		break;
   } while (!try_cmpxchg(&src->preempted, state,
			 (state & ~KVM_VCPU_IPI_PENDING) |
			 KVM_VCPU_SHOULD_FLUSH));

   and clear any possible pending TLB_VECTOR in the guest state to avoid
   raising that IPI spuriously on enter again.


This way the preemption will clear the IPI_PENDING and the
flush_others() wait loop will terminate.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
  2017-11-13 10:46       ` Peter Zijlstra
@ 2017-11-13 13:02         ` Peter Zijlstra
  2017-11-14  6:10           ` Wanpeng Li
  2017-11-14  6:28         ` Wanpeng Li
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2017-11-13 13:02 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: linux-kernel, kvm, Paolo Bonzini, Radim Kr??m????, Wanpeng Li

On Mon, Nov 13, 2017 at 11:46:34AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 13, 2017 at 04:26:57PM +0800, Wanpeng Li wrote:
> > 2017-11-13 16:04 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> 
> > > So if at this point a vCPU gets preempted we'll still spin-wait for it,
> > > which is sub-optimal.
> > >
> > > I think we can come up with something to get around that 'problem' if
> > > indeed it is a problem. But we can easily do that as follow up patches.
> > > Just let me know if you think its worth spending more time on.
> > 
> > You can post your idea, it is always smart. :) Then we can evaluate
> > the complexity and gains.
> 
> I'm not sure I have a fully baked idea just yet, but the general idea
> would be something like:
> 
>  - switch (back) to a dedicated TLB invalidate IPI

Just for PV that is; the !PV code can continue doing what it does today.

>  - introduce KVM_VCPU_IPI_PENDING
> 
>  - change flush_tlb_others() into something like:
> 
>    for_each_cpu(cpu, flushmask) {
> 	 src = &per_cpu(steal_time, cpu);
> 	 state = READ_ONCE(src->preempted);
> 	 do {
> 		 if (state & KVM_VCPU_PREEMPTED) {
> 			 if (try_cmpxchg(&src->preempted, &state,
> 						 state | KVM_VCPU_SHOULD_FLUSH)) {
> 				 __cpumask_clear_cpu(cpu, flushmask);
> 				 break;
> 			 }
> 		 }
> 	 } while (!try_cmpxchg(&src->preempted, &state,
> 				 state | KVM_VCPU_IPI_PENDING));

That can be written like:

	do {
		if (state & KVM_VCPU_PREEMPTED)
			new_state = state | KVM_VCPU_SHOULD_FLUSH;
		else
			new_state = state | KVM_VCPU_IPI_PENDING;
	} while (!try_cmpxchg(&src->preempted, state, new_state);

	if (new_state & KVM_VCPU_IPI_PENDING)
		__cpumask_clear_cpu(cpu, flushmask);

>    }
> 
>    apic->send_IPI_mask(flushmask, CALL_TLB_VECTOR);
> 
>    for_each_cpu(cpu, flushmask) {
> 	 src = &per_cpu(steal_time, cpu);

	/*
	 * The ACQUIRE pairs with the cmpxchg clearing IPI_PENDING,
	 * which is either the TLB IPI handler, or the VMEXIT path.
	 * It ensure that the invalidate happens-before.
	 */
> 	 smp_cond_load_acquire(&src->preempted, !(VAL & KVM_VCPU_IPI_PENDING);
>    }

And here we wait for completion of the invalidate; but because of the
VMEXIT change below, this will never stall on a !running vCPU.

Note that PLE will not help (much) here, without this extra IPI_PENDING
state and the VMEXIT transferring it to SHOULD_FLUSH this vCPU's progress
will be held up until all vCPU's you've IPI'd will have ran the IPI
handler, which in the worst case is still a very long time.

>  - have the TLB invalidate handler do something like:
> 
>    state = READ_ONCE(src->preempted);
>    if (!(state & KVM_VCPU_IPI_PENDING))
> 	   return;
> 
>    local_flush_tlb();
> 
>    do {
>    } while (!try_cmpxchg(&src->preempted, &state,
> 			 state & ~KVM_VCPU_IPI_PENDING));

That needs to be:

	/*
	 * Clear KVM_VCPU_IPI_PENDING to 'complete' flush_tlb_others().
	 */
	do {
		/*
		 * VMEXIT could have cleared this for us, in which case
		 * we're done.
		 */
		if (!(state & KVM_VCPU_IPI_PENDING))
			return;

	} while (!try_cmpxchg(&src->preempted, state,
				state & ~KVM_VCPU_IPI_PENDING));

>  - then at VMEXIT time do something like:
> 
	/*
	 * If we have IPI_PENDING set at VMEXIT time, transfer it to
	 * SHOULD_FLUSH. Clearing IPI_PENDING here allows the
	 * flush_others() vCPU to continue while the SHOULD_FLUSH
	 * guarantees this vCPU will flush TLBs before it continues
	 * execution.
	 */

>    state = READ_ONCE(src->preempted);
>    do {
> 	if (!(state & KVM_VCPU_IPI_PENDING))
> 		break;
>    } while (!try_cmpxchg(&src->preempted, state,
> 			 (state & ~KVM_VCPU_IPI_PENDING) |
> 			 KVM_VCPU_SHOULD_FLUSH));
> 
>    and clear any possible pending TLB_VECTOR in the guest state to avoid
>    raising that IPI spuriously on enter again.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
  2017-11-13 13:02         ` Peter Zijlstra
@ 2017-11-14  6:10           ` Wanpeng Li
  2017-11-14  7:35             ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Wanpeng Li @ 2017-11-14  6:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, Paolo Bonzini, Radim Kr??m????, Wanpeng Li

2017-11-13 21:02 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> On Mon, Nov 13, 2017 at 11:46:34AM +0100, Peter Zijlstra wrote:
>> On Mon, Nov 13, 2017 at 04:26:57PM +0800, Wanpeng Li wrote:
>> > 2017-11-13 16:04 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
>>
>> > > So if at this point a vCPU gets preempted we'll still spin-wait for it,
>> > > which is sub-optimal.
>> > >
>> > > I think we can come up with something to get around that 'problem' if
>> > > indeed it is a problem. But we can easily do that as follow up patches.
>> > > Just let me know if you think its worth spending more time on.
>> >
>> > You can post your idea, it is always smart. :) Then we can evaluate
>> > the complexity and gains.
>>
>> I'm not sure I have a fully baked idea just yet, but the general idea
>> would be something like:
>>
>>  - switch (back) to a dedicated TLB invalidate IPI
>
> Just for PV that is; the !PV code can continue doing what it does today.
>
>>  - introduce KVM_VCPU_IPI_PENDING
>>
>>  - change flush_tlb_others() into something like:
>>
>>    for_each_cpu(cpu, flushmask) {
>>        src = &per_cpu(steal_time, cpu);
>>        state = READ_ONCE(src->preempted);
>>        do {
>>                if (state & KVM_VCPU_PREEMPTED) {
>>                        if (try_cmpxchg(&src->preempted, &state,
>>                                                state | KVM_VCPU_SHOULD_FLUSH)) {
>>                                __cpumask_clear_cpu(cpu, flushmask);
>>                                break;
>>                        }
>>                }
>>        } while (!try_cmpxchg(&src->preempted, &state,
>>                                state | KVM_VCPU_IPI_PENDING));
>
> That can be written like:
>
>         do {
>                 if (state & KVM_VCPU_PREEMPTED)
>                         new_state = state | KVM_VCPU_SHOULD_FLUSH;
>                 else
>                         new_state = state | KVM_VCPU_IPI_PENDING;
>         } while (!try_cmpxchg(&src->preempted, state, new_state);
>
>         if (new_state & KVM_VCPU_IPI_PENDING)

Should be new_state & KVM_VCPU_SHOULD_FLUSH I think.

Regards,
Wanpeng Li

>                 __cpumask_clear_cpu(cpu, flushmask);
>
>>    }
>>
>>    apic->send_IPI_mask(flushmask, CALL_TLB_VECTOR);
>>
>>    for_each_cpu(cpu, flushmask) {
>>        src = &per_cpu(steal_time, cpu);
>
>         /*
>          * The ACQUIRE pairs with the cmpxchg clearing IPI_PENDING,
>          * which is either the TLB IPI handler, or the VMEXIT path.
>          * It ensure that the invalidate happens-before.
>          */
>>        smp_cond_load_acquire(&src->preempted, !(VAL & KVM_VCPU_IPI_PENDING);
>>    }
>
> And here we wait for completion of the invalidate; but because of the
> VMEXIT change below, this will never stall on a !running vCPU.
>
> Note that PLE will not help (much) here, without this extra IPI_PENDING
> state and the VMEXIT transferring it to SHOULD_FLUSH this vCPU's progress
> will be held up until all vCPU's you've IPI'd will have ran the IPI
> handler, which in the worst case is still a very long time.
>
>>  - have the TLB invalidate handler do something like:
>>
>>    state = READ_ONCE(src->preempted);
>>    if (!(state & KVM_VCPU_IPI_PENDING))
>>          return;
>>
>>    local_flush_tlb();
>>
>>    do {
>>    } while (!try_cmpxchg(&src->preempted, &state,
>>                        state & ~KVM_VCPU_IPI_PENDING));
>
> That needs to be:
>
>         /*
>          * Clear KVM_VCPU_IPI_PENDING to 'complete' flush_tlb_others().
>          */
>         do {
>                 /*
>                  * VMEXIT could have cleared this for us, in which case
>                  * we're done.
>                  */
>                 if (!(state & KVM_VCPU_IPI_PENDING))
>                         return;
>
>         } while (!try_cmpxchg(&src->preempted, state,
>                                 state & ~KVM_VCPU_IPI_PENDING));
>
>>  - then at VMEXIT time do something like:
>>
>         /*
>          * If we have IPI_PENDING set at VMEXIT time, transfer it to
>          * SHOULD_FLUSH. Clearing IPI_PENDING here allows the
>          * flush_others() vCPU to continue while the SHOULD_FLUSH
>          * guarantees this vCPU will flush TLBs before it continues
>          * execution.
>          */
>
>>    state = READ_ONCE(src->preempted);
>>    do {
>>       if (!(state & KVM_VCPU_IPI_PENDING))
>>               break;
>>    } while (!try_cmpxchg(&src->preempted, state,
>>                        (state & ~KVM_VCPU_IPI_PENDING) |
>>                        KVM_VCPU_SHOULD_FLUSH));
>>
>>    and clear any possible pending TLB_VECTOR in the guest state to avoid
>>    raising that IPI spuriously on enter again.
>>
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
  2017-11-13 10:46       ` Peter Zijlstra
  2017-11-13 13:02         ` Peter Zijlstra
@ 2017-11-14  6:28         ` Wanpeng Li
  2017-11-14  7:56           ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Wanpeng Li @ 2017-11-14  6:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, Paolo Bonzini, Radim Kr??m????, Wanpeng Li

2017-11-13 18:46 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> On Mon, Nov 13, 2017 at 04:26:57PM +0800, Wanpeng Li wrote:
>> 2017-11-13 16:04 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
>
>> > So if at this point a vCPU gets preempted we'll still spin-wait for it,
>> > which is sub-optimal.
>> >
>> > I think we can come up with something to get around that 'problem' if
>> > indeed it is a problem. But we can easily do that as follow up patches.
>> > Just let me know if you think its worth spending more time on.
>>
>> You can post your idea, it is always smart. :) Then we can evaluate
>> the complexity and gains.
>
> I'm not sure I have a fully baked idea just yet, but the general idea
> would be something like:
>
>  - switch (back) to a dedicated TLB invalidate IPI
>
>  - introduce KVM_VCPU_IPI_PENDING
>
>  - change flush_tlb_others() into something like:
>
>    for_each_cpu(cpu, flushmask) {
>          src = &per_cpu(steal_time, cpu);
>          state = READ_ONCE(src->preempted);
>          do {
>                  if (state & KVM_VCPU_PREEMPTED) {
>                          if (try_cmpxchg(&src->preempted, &state,
>                                                  state | KVM_VCPU_SHOULD_FLUSH)) {
>                                  __cpumask_clear_cpu(cpu, flushmask);
>                                  break;
>                          }
>                  }
>          } while (!try_cmpxchg(&src->preempted, &state,
>                                  state | KVM_VCPU_IPI_PENDING));
>    }
>
>    apic->send_IPI_mask(flushmask, CALL_TLB_VECTOR);
>
>    for_each_cpu(cpu, flushmask) {
>          src = &per_cpu(steal_time, cpu);
>          smp_cond_load_acquire(&src->preempted, !(VAL & KVM_VCPU_IPI_PENDING);
>    }
>
>
>  - have the TLB invalidate handler do something like:
>
>    state = READ_ONCE(src->preempted);
>    if (!(state & KVM_VCPU_IPI_PENDING))
>            return;
>
>    local_flush_tlb();
>
>    do {
>    } while (!try_cmpxchg(&src->preempted, &state,
>                          state & ~KVM_VCPU_IPI_PENDING));

There are a lot of cases handled by flush_tlb_func_remote() ->
flush_tlb_function_common(), so I'm afraid to have hole.

Regards,
Wanpeng Li

>
>  - then at VMEXIT time do something like:
>
>    state = READ_ONCE(src->preempted);
>    do {
>         if (!(state & KVM_VCPU_IPI_PENDING))
>                 break;
>    } while (!try_cmpxchg(&src->preempted, state,
>                          (state & ~KVM_VCPU_IPI_PENDING) |
>                          KVM_VCPU_SHOULD_FLUSH));
>
>    and clear any possible pending TLB_VECTOR in the guest state to avoid
>    raising that IPI spuriously on enter again.
>
>
> This way the preemption will clear the IPI_PENDING and the
> flush_others() wait loop will terminate.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
  2017-11-14  6:10           ` Wanpeng Li
@ 2017-11-14  7:35             ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2017-11-14  7:35 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: linux-kernel, kvm, Paolo Bonzini, Radim Kr??m????, Wanpeng Li

On Tue, Nov 14, 2017 at 02:10:16PM +0800, Wanpeng Li wrote:
> 2017-11-13 21:02 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> > That can be written like:
> >
> >         do {
> >                 if (state & KVM_VCPU_PREEMPTED)
> >                         new_state = state | KVM_VCPU_SHOULD_FLUSH;
> >                 else
> >                         new_state = state | KVM_VCPU_IPI_PENDING;
> >         } while (!try_cmpxchg(&src->preempted, state, new_state);
> >
> >         if (new_state & KVM_VCPU_IPI_PENDING)
> 
> Should be new_state & KVM_VCPU_SHOULD_FLUSH I think.

Quite so indeed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
  2017-11-14  6:28         ` Wanpeng Li
@ 2017-11-14  7:56           ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2017-11-14  7:56 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: linux-kernel, kvm, Paolo Bonzini, Radim Kr??m????, Wanpeng Li

On Tue, Nov 14, 2017 at 02:28:56PM +0800, Wanpeng Li wrote:
> >  - have the TLB invalidate handler do something like:
> >
> >    state = READ_ONCE(src->preempted);
> >    if (!(state & KVM_VCPU_IPI_PENDING))
> >            return;
> >
> >    local_flush_tlb();
> >
> >    do {
> >    } while (!try_cmpxchg(&src->preempted, &state,
> >                          state & ~KVM_VCPU_IPI_PENDING));
> 
> There are a lot of cases handled by flush_tlb_func_remote() ->
> flush_tlb_function_common(), so I'm afraid to have hole.

Sure, just fix the handler to do what must be done. The above was merely
a sketch. The important part is to only clear IPI_PENDING after we do
the actual flushing, since the caller is waiting for that bit.


So flush_tlb_others() has two callers:

 - arch_tlbbatch_flush() -- info::end = TLB_FLUSH_ALL
 - flush_tlb_mm_range()  -- info::mm = mm

native_flush_tlb_others() does smp_call_function_many(
.func=flush_tlb_func_remote) which in turn calls flush_tlb_func_common(
.local=false, .reason=TLB_REMOTE_SHOOTDOWN).


So something like:

	struct flush_tlb_info info = {
		.start = 0,
		.end = TLB_FLUSH_ALL,
	};

	flush_tlb_func_common(&info, false, TLB_REMOTE_SHOOTDOWN);

should work for the new IPI. It 'upgrades' all ranges to full flushes,
but meh.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/4] KVM: X86: Add vCPU running/preempted state
  2017-11-13  0:33 ` [PATCH v4 1/4] KVM: X86: Add vCPU running/preempted state Wanpeng Li
@ 2017-11-14 16:45   ` David Hildenbrand
  2017-11-14 17:03     ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2017-11-14 16:45 UTC (permalink / raw)
  To: Wanpeng Li, linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář, Peter Zijlstra, Wanpeng Li

On 13.11.2017 01:33, Wanpeng Li wrote:
> From: Wanpeng Li <wanpeng.li@hotmail.com>
> 
> This patch reuses the preempted field in kvm_steal_time, and will export
> the vcpu running/pre-empted information to the guest from host. This will
> enable guest to intelligently send ipi to running vcpus and set flag for
> pre-empted vcpus. This will prevent waiting for vcpus that are not running.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
> ---
>  arch/x86/include/uapi/asm/kvm_para.h | 3 +++
>  arch/x86/kernel/kvm.c                | 2 +-
>  arch/x86/kvm/x86.c                   | 4 ++--
>  3 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 554aa8f..bf17b30 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -51,6 +51,9 @@ struct kvm_steal_time {
>  	__u32 pad[11];
>  };
>  
> +#define KVM_VCPU_NOT_PREEMPTED      (0 << 0)
> +#define KVM_VCPU_PREEMPTED          (1 << 0)

These should have a prefix that makes it obvious that they are used
for kvm_steal_time/preempted.

What about renaming preempted to "flags" or something like that.
Then we could have

KVM_STEAL_TIME_(FLAG_)PREEMPTED
KVM_STEAL_TIME_(FLAG_)NOT_PREEMPTED

> +
>  #define KVM_CLOCK_PAIRING_WALLCLOCK 0
>  struct kvm_clock_pairing {
>  	__s64 sec;
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 8bb9594..1b1b641 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -608,7 +608,7 @@ __visible bool __kvm_vcpu_is_preempted(long cpu)
>  {
>  	struct kvm_steal_time *src = &per_cpu(steal_time, cpu);
>  
> -	return !!src->preempted;
> +	return !!(src->preempted & KVM_VCPU_PREEMPTED);
>  }
>  PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 03869eb..5e63033 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2113,7 +2113,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>  		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
>  		return;
>  
> -	vcpu->arch.st.steal.preempted = 0;
> +	vcpu->arch.st.steal.preempted = KVM_VCPU_NOT_PREEMPTED;
>  
>  	if (vcpu->arch.st.steal.version & 1)
>  		vcpu->arch.st.steal.version += 1;  /* first time write, random junk */
> @@ -2884,7 +2884,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
>  	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
>  		return;
>  
> -	vcpu->arch.st.steal.preempted = 1;
> +	vcpu->arch.st.steal.preempted = KVM_VCPU_PREEMPTED;
>  
>  	kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
>  			&vcpu->arch.st.steal.preempted,
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/4] KVM: X86: Add vCPU running/preempted state
  2017-11-14 16:45   ` David Hildenbrand
@ 2017-11-14 17:03     ` Peter Zijlstra
  2017-11-14 17:04       ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2017-11-14 17:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Wanpeng Li, linux-kernel, kvm, Paolo Bonzini,
	Radim Krčmář,
	Wanpeng Li

On Tue, Nov 14, 2017 at 05:45:30PM +0100, David Hildenbrand wrote:
> KVM_STEAL_TIME_(FLAG_)PREEMPTED
> KVM_STEAL_TIME_(FLAG_)NOT_PREEMPTED

Shees, and I thought the KVM_CPU_ crap was already long winded.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/4] KVM: X86: Add vCPU running/preempted state
  2017-11-14 17:03     ` Peter Zijlstra
@ 2017-11-14 17:04       ` Paolo Bonzini
  0 siblings, 0 replies; 18+ messages in thread
From: Paolo Bonzini @ 2017-11-14 17:04 UTC (permalink / raw)
  To: Peter Zijlstra, David Hildenbrand
  Cc: Wanpeng Li, linux-kernel, kvm, Radim Krčmář, Wanpeng Li

On 14/11/2017 18:03, Peter Zijlstra wrote:
>> KVM_STEAL_TIME_(FLAG_)PREEMPTED
>> KVM_STEAL_TIME_(FLAG_)NOT_PREEMPTED
> Shees, and I thought the KVM_CPU_ crap was already long winded.

Yeah, it's too long.  KVM_VCPU_* is not a great name, but it's clear and
concise.

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-11-14 17:04 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-13  0:33 [PATCH v4 0/4] KVM: X86: Paravirt remote TLB flush Wanpeng Li
2017-11-13  0:33 ` [PATCH v4 1/4] KVM: X86: Add vCPU running/preempted state Wanpeng Li
2017-11-14 16:45   ` David Hildenbrand
2017-11-14 17:03     ` Peter Zijlstra
2017-11-14 17:04       ` Paolo Bonzini
2017-11-13  0:33 ` [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush Wanpeng Li
2017-11-13  7:59   ` Peter Zijlstra
2017-11-13  8:12     ` Wanpeng Li
2017-11-13  8:04   ` Peter Zijlstra
2017-11-13  8:26     ` Wanpeng Li
2017-11-13 10:46       ` Peter Zijlstra
2017-11-13 13:02         ` Peter Zijlstra
2017-11-14  6:10           ` Wanpeng Li
2017-11-14  7:35             ` Peter Zijlstra
2017-11-14  6:28         ` Wanpeng Li
2017-11-14  7:56           ` Peter Zijlstra
2017-11-13  0:33 ` [PATCH v4 3/4] KVM: X86: introduce invalidate_gpa argument to tlb flush Wanpeng Li
2017-11-13  0:33 ` [PATCH v4 4/4] KVM: X86: Add flush_on_enter before guest enter Wanpeng Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.