kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM
@ 2021-04-21 15:08 Kenta Ishiguro
  2021-04-21 15:08 ` [RFC PATCH 1/2] Prevent CFS from ignoring boost requests from KVM Kenta Ishiguro
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Kenta Ishiguro @ 2021-04-21 15:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, kvm, linux-kernel, pl, kono,
	Kenta Ishiguro

Dear KVM developers and maintainers,

In our research work presented last week at the VEE 2021 conference [1], we
found out that a lot of continuous Pause-Loop-Exiting (PLE) events occur
due to three problems we have identified: 1) Linux CFS ignores hints from
KVM; 2) IPI receiver vCPUs in user-mode are not boosted; 3) IPI-receiver
that has halted is always a candidate for boost.  We have intoduced two
mitigations against the problems.

To solve problem (1), patch 1 increases the vruntime of yielded vCPU to
pass the check `if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
left) < 1)` in `struct sched_entity * pick_next_entity()` if the cfs_rq's
skip and next are both vCPUs in the same VM. To keep fairness it does not
prioritize the guest VM which causes PLE, however it improves the
performance by eliminating unnecessary PLE. Also we have confirmed
`yield_to_task_fair` is called only from KVM.

To solve problems (2) and (3), patch 2 monitors IPI communication between
vCPUs and leverages the relationship between vCPUs to select boost
candidates.  The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
delivering interrupt" patch
(https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
seems to be effective for (2) while it only uses the IPI receiver
information.

Our approach reduces the total number of PLE events by up to 87.6 % in four
8-vCPU VMs in over-subscribed scenario with the Linux kernel 5.6.0. Please
find the patch below.

We would greatly appreciate your valuable feedback on our approach and
patch.

Thank you very much for your consideration
Kenta Ishiguro

[1] Kenta Ishiguro, Naoki Yasuno, Pierre-Louis Aublin, and Kenji Kono.
    "Mitigating excessive vCPU spinning in VM-agnostic KVM".
    In Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference
    on Virtual Execution Environments (VEE 2021).
    Association for Computing Machinery, New York,
    NY, USA, 139--152.  https://dl.acm.org/doi/abs/10.1145/3453933.3454020

Kenta Ishiguro (2):
  Prevent CFS from ignoring boost requests from KVM
  Boost vCPUs based on IPI-sender and receiver information

 arch/x86/kvm/lapic.c     | 14 ++++++++++++++
 arch/x86/kvm/vmx/vmx.c   |  2 ++
 include/linux/kvm_host.h |  5 +++++
 kernel/sched/fair.c      | 31 +++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++--
 5 files changed, 76 insertions(+), 2 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/2] Prevent CFS from ignoring boost requests from KVM
  2021-04-21 15:08 [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM Kenta Ishiguro
@ 2021-04-21 15:08 ` Kenta Ishiguro
  2021-04-21 15:08 ` [RFC PATCH 2/2] Boost vCPUs based on IPI-sender and receiver information Kenta Ishiguro
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Kenta Ishiguro @ 2021-04-21 15:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, kvm, linux-kernel, pl, kono,
	Kenta Ishiguro

This commit increases the vruntime of yielded vCPU to boost a vCPU instead
of the yielded vCPU when two vCPUs are in the same VM. This change avoids
the situation where scheduling the boosted vCPU is too unfair.

Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Kenta Ishiguro <kentaishiguro@sslab.ics.keio.ac.jp>
---
 kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..2908da3f4c77 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7204,9 +7204,36 @@ static void yield_task_fair(struct rq *rq)
 	set_skip_buddy(se);
 }
 
+static void deboost_yield_task_vruntime(struct sched_entity *next_se, struct sched_entity *yield_se)
+{
+	if (wakeup_preempt_entity(next_se, yield_se) < 1)
+		return;
+	yield_se->vruntime = next_se->vruntime - wakeup_gran(yield_se);
+}
+
+static void deboost_yield_task(struct sched_entity *next_se, struct sched_entity *yield_se)
+{
+	struct sched_entity *next_se_base = next_se;
+
+	if (rq_of(cfs_rq_of(yield_se)) != rq_of(cfs_rq_of(next_se)))
+		return;
+
+	for_each_sched_entity(yield_se) {
+		next_se = next_se_base;
+		for_each_sched_entity(next_se) {
+			if (cfs_rq_of(yield_se) == cfs_rq_of(next_se)) {
+				deboost_yield_task_vruntime(next_se, yield_se);
+				return;
+			}
+		}
+	}
+}
+
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
+	struct task_struct *curr;
+	struct sched_entity *yield_se;
 
 	/* throttled hierarchies are not runnable */
 	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
@@ -7215,6 +7242,10 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 	/* Tell the scheduler that we'd really like pse to run next. */
 	set_next_buddy(se);
 
+	curr = rq->curr;
+	yield_se = &curr->se;
+	deboost_yield_task(se, yield_se);
+
 	yield_task_fair(rq);
 
 	return true;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/2] Boost vCPUs based on IPI-sender and receiver information
  2021-04-21 15:08 [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM Kenta Ishiguro
  2021-04-21 15:08 ` [RFC PATCH 1/2] Prevent CFS from ignoring boost requests from KVM Kenta Ishiguro
@ 2021-04-21 15:08 ` Kenta Ishiguro
  2021-04-21 15:43   ` Sean Christopherson
  2021-04-21 16:16   ` Sean Christopherson
  2021-04-21 16:19 ` [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM Sean Christopherson
  2021-04-22  0:55 ` Wanpeng Li
  3 siblings, 2 replies; 13+ messages in thread
From: Kenta Ishiguro @ 2021-04-21 15:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, kvm, linux-kernel, pl, kono,
	Kenta Ishiguro

This commit monitors IPI communication between vCPUs and leverages the
relationship between vCPUs to select boost candidates.

Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Kenta Ishiguro <kentaishiguro@sslab.ics.keio.ac.jp>
---
 arch/x86/kvm/lapic.c     | 14 ++++++++++++++
 arch/x86/kvm/vmx/vmx.c   |  2 ++
 include/linux/kvm_host.h |  5 +++++
 virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++--
 4 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index cc369b9ad8f1..c8d967ddecf9 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1269,6 +1269,18 @@ void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
 }
 EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
 
+static void mark_ipi_receiver(struct kvm_lapic *apic, struct kvm_lapic_irq *irq)
+{
+	struct kvm_vcpu *dest_vcpu;
+	u64 prev_ipi_received;
+
+	dest_vcpu = kvm_get_vcpu_by_id(apic->vcpu->kvm, irq->dest_id);
+	if (READ_ONCE(dest_vcpu->sched_outed)) {
+		prev_ipi_received = READ_ONCE(dest_vcpu->ipi_received);
+		WRITE_ONCE(dest_vcpu->ipi_received, prev_ipi_received | (1 << apic->vcpu->vcpu_id));
+	}
+}
+
 void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
 {
 	struct kvm_lapic_irq irq;
@@ -1287,6 +1299,8 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
 
 	trace_kvm_apic_ipi(icr_low, irq.dest_id);
 
+	mark_ipi_receiver(apic, &irq);
+
 	kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL);
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 29b40e092d13..ced50935a38b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6718,6 +6718,8 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 		vmcs_write32(PLE_WINDOW, vmx->ple_window);
 	}
 
+	WRITE_ONCE(vcpu->ipi_received, 0);
+
 	/*
 	 * We did this in prepare_switch_to_guest, because it needs to
 	 * be within srcu_read_lock.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1b65e7204344..6726aeec03e7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -320,6 +320,11 @@ struct kvm_vcpu {
 #endif
 	bool preempted;
 	bool ready;
+	bool sched_outed;
+	/*
+	 * The current implementation of strict boost supports up to 64 vCPUs
+	 */
+	u64 ipi_received;
 	struct kvm_vcpu_arch arch;
 	struct kvm_dirty_ring dirty_ring;
 };
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 383df23514b9..08e629957e7e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -413,6 +413,10 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	kvm_vcpu_set_dy_eligible(vcpu, false);
 	vcpu->preempted = false;
 	vcpu->ready = false;
+
+	vcpu->sched_outed = false;
+	vcpu->ipi_received = 0;
+
 	preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
 }
 
@@ -3011,6 +3015,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 	int try = 3;
 	int pass;
 	int i;
+	u64 prev_ipi_received;
 
 	kvm_vcpu_set_in_spin_loop(me, true);
 	/*
@@ -3031,12 +3036,25 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 				continue;
 			if (vcpu == me)
 				continue;
+			prev_ipi_received = READ_ONCE(vcpu->ipi_received);
+			if (!READ_ONCE(vcpu->preempted) &&
+			    !(prev_ipi_received & (1 << me->vcpu_id))) {
+				WRITE_ONCE(vcpu->ipi_received,
+					   prev_ipi_received | (1 << me->vcpu_id));
+				continue;
+			}
 			if (rcuwait_active(&vcpu->wait) &&
 			    !vcpu_dy_runnable(vcpu))
 				continue;
 			if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
-				!kvm_arch_vcpu_in_kernel(vcpu))
-				continue;
+				!kvm_arch_vcpu_in_kernel(vcpu)) {
+				prev_ipi_received = READ_ONCE(vcpu->ipi_received);
+				if (!(prev_ipi_received & (1 << me->vcpu_id))) {
+					WRITE_ONCE(vcpu->ipi_received,
+						   prev_ipi_received | (1 << me->vcpu_id));
+					continue;
+				}
+			}
 			if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
 				continue;
 
@@ -4859,6 +4877,9 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
 	WRITE_ONCE(vcpu->preempted, false);
 	WRITE_ONCE(vcpu->ready, false);
 
+	WRITE_ONCE(vcpu->sched_outed, false);
+	WRITE_ONCE(vcpu->ipi_received, 0);
+
 	__this_cpu_write(kvm_running_vcpu, vcpu);
 	kvm_arch_sched_in(vcpu, cpu);
 	kvm_arch_vcpu_load(vcpu, cpu);
@@ -4873,6 +4894,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 		WRITE_ONCE(vcpu->preempted, true);
 		WRITE_ONCE(vcpu->ready, true);
 	}
+	WRITE_ONCE(vcpu->sched_outed, true);
 	kvm_arch_vcpu_put(vcpu);
 	__this_cpu_write(kvm_running_vcpu, NULL);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 2/2] Boost vCPUs based on IPI-sender and receiver information
  2021-04-21 15:08 ` [RFC PATCH 2/2] Boost vCPUs based on IPI-sender and receiver information Kenta Ishiguro
@ 2021-04-21 15:43   ` Sean Christopherson
  2021-04-21 16:16   ` Sean Christopherson
  1 sibling, 0 replies; 13+ messages in thread
From: Sean Christopherson @ 2021-04-21 15:43 UTC (permalink / raw)
  To: Kenta Ishiguro
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, kvm, linux-kernel, pl, kono

On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> This commit monitors IPI communication between vCPUs and leverages the
> relationship between vCPUs to select boost candidates.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Kenta Ishiguro <kentaishiguro@sslab.ics.keio.ac.jp>
> ---
>  arch/x86/kvm/lapic.c     | 14 ++++++++++++++
>  arch/x86/kvm/vmx/vmx.c   |  2 ++
>  include/linux/kvm_host.h |  5 +++++
>  virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++--
>  4 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index cc369b9ad8f1..c8d967ddecf9 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1269,6 +1269,18 @@ void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
>  }
>  EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
>  
> +static void mark_ipi_receiver(struct kvm_lapic *apic, struct kvm_lapic_irq *irq)
> +{
> +	struct kvm_vcpu *dest_vcpu;
> +	u64 prev_ipi_received;
> +
> +	dest_vcpu = kvm_get_vcpu_by_id(apic->vcpu->kvm, irq->dest_id);
> +	if (READ_ONCE(dest_vcpu->sched_outed)) {
> +		prev_ipi_received = READ_ONCE(dest_vcpu->ipi_received);
> +		WRITE_ONCE(dest_vcpu->ipi_received, prev_ipi_received | (1 << apic->vcpu->vcpu_id));

The lack of "ull" on '1' actually means this will probably go sideways at >32 vCPUs.
Regardless, there needs to be an actual fallback path when the number of vCPUs
exceeds what the IPI-boost can support.  That said, it should be quite easy to
allocate a bitmap to support an arbitrary number of vCPUs.

> +	}
> +}
> +
>  void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
>  {
>  	struct kvm_lapic_irq irq;
> @@ -1287,6 +1299,8 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
>  
>  	trace_kvm_apic_ipi(icr_low, irq.dest_id);
>  
> +	mark_ipi_receiver(apic, &irq);
> +
>  	kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL);
>  }
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 29b40e092d13..ced50935a38b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6718,6 +6718,8 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  		vmcs_write32(PLE_WINDOW, vmx->ple_window);
>  	}
>  
> +	WRITE_ONCE(vcpu->ipi_received, 0);
> +
>  	/*
>  	 * We did this in prepare_switch_to_guest, because it needs to
>  	 * be within srcu_read_lock.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1b65e7204344..6726aeec03e7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -320,6 +320,11 @@ struct kvm_vcpu {
>  #endif
>  	bool preempted;
>  	bool ready;
> +	bool sched_outed;
> +	/*
> +	 * The current implementation of strict boost supports up to 64 vCPUs
> +	 */
> +	u64 ipi_received;

...

>  	struct kvm_vcpu_arch arch;
>  	struct kvm_dirty_ring dirty_ring;
>  };

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 2/2] Boost vCPUs based on IPI-sender and receiver information
  2021-04-21 15:08 ` [RFC PATCH 2/2] Boost vCPUs based on IPI-sender and receiver information Kenta Ishiguro
  2021-04-21 15:43   ` Sean Christopherson
@ 2021-04-21 16:16   ` Sean Christopherson
  1 sibling, 0 replies; 13+ messages in thread
From: Sean Christopherson @ 2021-04-21 16:16 UTC (permalink / raw)
  To: Kenta Ishiguro
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, kvm, linux-kernel, pl, kono

On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> This commit monitors IPI communication between vCPUs and leverages the
> relationship between vCPUs to select boost candidates.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Kenta Ishiguro <kentaishiguro@sslab.ics.keio.ac.jp>
> ---
>  arch/x86/kvm/lapic.c     | 14 ++++++++++++++
>  arch/x86/kvm/vmx/vmx.c   |  2 ++
>  include/linux/kvm_host.h |  5 +++++
>  virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++--
>  4 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index cc369b9ad8f1..c8d967ddecf9 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1269,6 +1269,18 @@ void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
>  }
>  EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
>  
> +static void mark_ipi_receiver(struct kvm_lapic *apic, struct kvm_lapic_irq *irq)
> +{
> +	struct kvm_vcpu *dest_vcpu;
> +	u64 prev_ipi_received;
> +
> +	dest_vcpu = kvm_get_vcpu_by_id(apic->vcpu->kvm, irq->dest_id);
> +	if (READ_ONCE(dest_vcpu->sched_outed)) {

dest_vcpu needs to be checked for NULL.

> +		prev_ipi_received = READ_ONCE(dest_vcpu->ipi_received);
> +		WRITE_ONCE(dest_vcpu->ipi_received, prev_ipi_received | (1 << apic->vcpu->vcpu_id));
> +	}
> +}
> +
>  void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
>  {
>  	struct kvm_lapic_irq irq;
> @@ -1287,6 +1299,8 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
>  
>  	trace_kvm_apic_ipi(icr_low, irq.dest_id);
>  
> +	mark_ipi_receiver(apic, &irq);
> +
>  	kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL);
>  }
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 29b40e092d13..ced50935a38b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6718,6 +6718,8 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  		vmcs_write32(PLE_WINDOW, vmx->ple_window);
>  	}
>  
> +	WRITE_ONCE(vcpu->ipi_received, 0);

Given that ipi_received is cleared when the vCPU is run, is there actually an
observable advantage to tracking which vCPU sent the IPI?  I.e. how do the
numbers look if ipi_received is a simple bool, and kvm_vcpu_on_spin() yields to
any vCPU that has an IPI pending?

>  	/*
>  	 * We did this in prepare_switch_to_guest, because it needs to
>  	 * be within srcu_read_lock.

...

> @@ -4873,6 +4894,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
>  		WRITE_ONCE(vcpu->preempted, true);
>  		WRITE_ONCE(vcpu->ready, true);
>  	}
> +	WRITE_ONCE(vcpu->sched_outed, true);

s/sched_outed/scheduled_out to be more grammatically correct.

It might also make sense to introduce the flag in a separate path.  Or even
better, can the existing "preempted" and "ready" be massaged so that we don't
have three flags that are tracking the same basic thing, with slightly different
semantics?

>  	kvm_arch_vcpu_put(vcpu);
>  	__this_cpu_write(kvm_running_vcpu, NULL);
>  }
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM
  2021-04-21 15:08 [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM Kenta Ishiguro
  2021-04-21 15:08 ` [RFC PATCH 1/2] Prevent CFS from ignoring boost requests from KVM Kenta Ishiguro
  2021-04-21 15:08 ` [RFC PATCH 2/2] Boost vCPUs based on IPI-sender and receiver information Kenta Ishiguro
@ 2021-04-21 16:19 ` Sean Christopherson
  2021-04-22 12:21   ` Wanpeng Li
  2021-04-22  0:55 ` Wanpeng Li
  3 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2021-04-21 16:19 UTC (permalink / raw)
  To: Kenta Ishiguro
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Hildenbrand, kvm, linux-kernel, pl, kono

On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> To solve problems (2) and (3), patch 2 monitors IPI communication between
> vCPUs and leverages the relationship between vCPUs to select boost
> candidates.  The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> delivering interrupt" patch
> (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> seems to be effective for (2) while it only uses the IPI receiver
> information.

On the IPI side of thing, I like the idea of explicitly tracking the IPIs,
especially if we can simplify the implementation, e.g. by losing the receiver
info and making ipi_received a bool.  Maybe temporarily table Wanpeng's patch
while this approach is analyzed?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM
  2021-04-21 15:08 [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM Kenta Ishiguro
                   ` (2 preceding siblings ...)
  2021-04-21 16:19 ` [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM Sean Christopherson
@ 2021-04-22  0:55 ` Wanpeng Li
  3 siblings, 0 replies; 13+ messages in thread
From: Wanpeng Li @ 2021-04-22  0:55 UTC (permalink / raw)
  To: Kenta Ishiguro
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Hildenbrand, kvm, LKML, pl,
	kono

On Thu, 22 Apr 2021 at 06:13, Kenta Ishiguro
<kentaishiguro@sslab.ics.keio.ac.jp> wrote:
>
> Dear KVM developers and maintainers,
>
> In our research work presented last week at the VEE 2021 conference [1], we
> found out that a lot of continuous Pause-Loop-Exiting (PLE) events occur
> due to three problems we have identified: 1) Linux CFS ignores hints from
> KVM; 2) IPI receiver vCPUs in user-mode are not boosted; 3) IPI-receiver
> that has halted is always a candidate for boost.  We have intoduced two
> mitigations against the problems.
>
> To solve problem (1), patch 1 increases the vruntime of yielded vCPU to
> pass the check `if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
> left) < 1)` in `struct sched_entity * pick_next_entity()` if the cfs_rq's
> skip and next are both vCPUs in the same VM. To keep fairness it does not
> prioritize the guest VM which causes PLE, however it improves the
> performance by eliminating unnecessary PLE. Also we have confirmed
> `yield_to_task_fair` is called only from KVM.
>
> To solve problems (2) and (3), patch 2 monitors IPI communication between
> vCPUs and leverages the relationship between vCPUs to select boost
> candidates.  The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> delivering interrupt" patch
> (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> seems to be effective for (2) while it only uses the IPI receiver
> information.
>
> Our approach reduces the total number of PLE events by up to 87.6 % in four
> 8-vCPU VMs in over-subscribed scenario with the Linux kernel 5.6.0. Please
> find the patch below.

You should mention that this improvement mainly comes from your
problems (1) scheduler hacking, however, kvm task is just an ordinary
task and scheduler maintainer always does not accept special
treatment.  the worst case for problems (1) mentioned in your paper, I
guess it is vCPU stacking issue, I try to mitigate it before
(https://lore.kernel.org/kvm/1564479235-25074-1-git-send-email-wanpengli@tencent.com/).
For your problems (3), we evaluate hackbench which is heavily
contended rq locks and heavy async ipi(reschedule ipi), the async ipi
influence is around 0.X%, I don't expect normal workloads can feel any
affected. In addition, four 8-vCPU VMs are not suitable for
scalability evaluation. I don't think the complex which is introduced
by your patch 2 is worth it since it gets a similar effect as my
version w/ current heuristic algorithm.

    Wanpeng

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM
  2021-04-21 16:19 ` [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM Sean Christopherson
@ 2021-04-22 12:21   ` Wanpeng Li
  2021-04-22 15:58     ` Sean Christopherson
       [not found]     ` <CAOOWTGJvFB_hgSUzaqNaNigdkRXFcaK37F9V7kmL3nCG+bFz5Q@mail.gmail.com>
  0 siblings, 2 replies; 13+ messages in thread
From: Wanpeng Li @ 2021-04-22 12:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kenta Ishiguro, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Hildenbrand, kvm, LKML,
	Pierre-Louis Aublin, 河野健二

On Thu, 22 Apr 2021 at 09:45, Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> > To solve problems (2) and (3), patch 2 monitors IPI communication between
> > vCPUs and leverages the relationship between vCPUs to select boost
> > candidates.  The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> > delivering interrupt" patch
> > (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> > seems to be effective for (2) while it only uses the IPI receiver
> > information.
>
> On the IPI side of thing, I like the idea of explicitly tracking the IPIs,
> especially if we can simplify the implementation, e.g. by losing the receiver
> info and making ipi_received a bool.  Maybe temporarily table Wanpeng's patch
> while this approach is analyzed?

Hi all,

I evaluate my patch
(https://lore.kernel.org/kvm/1618542490-14756-1-git-send-email-wanpengli@tencent.com),
Kenta's patch 2 and Sean's suggestion. The testing environment is
pbzip2 in 96 vCPUs VM in over-subscribe scenario (The host machine is
2 socket, 48 cores, 96 HTs Intel CLX box). Note: the Kenta's scheduler
hacking is not applied. The score of my patch is the most stable and
the best performance.

Wanpeng's patch

The average: vanilla -> boost: 69.124 -> 61.975, 10.3%

* Wall Clock: 61.695359 seconds
* Wall Clock: 63.343579 seconds
* Wall Clock: 61.567513 seconds
* Wall Clock: 62.144722 seconds
* Wall Clock: 61.091442 seconds
* Wall Clock: 62.085912 seconds
* Wall Clock: 61.311954 seconds

Kenta' patch

The average: vanilla -> boost: 69.148 -> 64.567, 6.6%

* Wall Clock:  66.288113 seconds
* Wall Clock:  61.228642 seconds
* Wall Clock:  62.100524 seconds
* Wall Clock:  68.355473 seconds
* Wall Clock:  64.864608 seconds

Sean's suggestion:

The average: vanilla -> boost: 69.148 -> 66.505, 3.8%

* Wall Clock: 60.583562 seconds
* Wall Clock: 58.533960 seconds
* Wall Clock: 70.103489 seconds
* Wall Clock: 74.279028 seconds
* Wall Clock: 69.024194 seconds

I follow(almost) Sean's suggestion:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0050f39..78b5eb6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1272,6 +1272,7 @@ EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
 void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
 {
     struct kvm_lapic_irq irq;
+    struct kvm_vcpu *dest_vcpu;

     irq.vector = icr_low & APIC_VECTOR_MASK;
     irq.delivery_mode = icr_low & APIC_MODE_MASK;
@@ -1285,6 +1286,10 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic,
u32 icr_low, u32 icr_high)
     else
         irq.dest_id = GET_APIC_DEST_FIELD(icr_high);

+    dest_vcpu = kvm_get_vcpu_by_id(apic->vcpu->kvm, irq.dest_id);
+    if (dest_vcpu)
+        WRITE_ONCE(dest_vcpu->ipi_received, true);
+
     trace_kvm_apic_ipi(icr_low, irq.dest_id);

     kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 303fb55..a98bf571 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9298,6 +9298,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
     if (test_thread_flag(TIF_NEED_FPU_LOAD))
         switch_fpu_return();

+    WRITE_ONCE(vcpu->ipi_received, false);
+
     if (unlikely(vcpu->arch.switch_db_regs)) {
         set_debugreg(0, 7);
         set_debugreg(vcpu->arch.eff_db[0], 0);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5ef09a4..81e39fa 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -332,6 +332,8 @@ struct kvm_vcpu {
         bool dy_eligible;
     } spin_loop;
 #endif
+
+    bool ipi_received;
     bool preempted;
     bool ready;
     struct kvm_vcpu_arch arch;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c682f82..5098929 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -411,6 +411,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu,
struct kvm *kvm, unsigned id)

     kvm_vcpu_set_in_spin_loop(vcpu, false);
     kvm_vcpu_set_dy_eligible(vcpu, false);
+    vcpu->ipi_received = false;
     vcpu->preempted = false;
     vcpu->ready = false;
     preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
@@ -3220,6 +3221,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool
yield_to_kernel_mode)
                 !vcpu_dy_runnable(vcpu))
                 continue;
             if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+                !READ_ONCE(vcpu->ipi_received) &&
                 !kvm_arch_vcpu_in_kernel(vcpu))
                 continue;
             if (!kvm_vcpu_eligible_for_directed_yield(vcpu))

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM
  2021-04-22 12:21   ` Wanpeng Li
@ 2021-04-22 15:58     ` Sean Christopherson
       [not found]     ` <CAOOWTGJvFB_hgSUzaqNaNigdkRXFcaK37F9V7kmL3nCG+bFz5Q@mail.gmail.com>
  1 sibling, 0 replies; 13+ messages in thread
From: Sean Christopherson @ 2021-04-22 15:58 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Kenta Ishiguro, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Hildenbrand, kvm, LKML,
	Pierre-Louis Aublin, 河野健二

On Thu, Apr 22, 2021, Wanpeng Li wrote:
> On Thu, 22 Apr 2021 at 09:45, Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Apr 22, 2021, Kenta Ishiguro wrote:
> > > To solve problems (2) and (3), patch 2 monitors IPI communication between
> > > vCPUs and leverages the relationship between vCPUs to select boost
> > > candidates.  The "[PATCH] KVM: Boost vCPU candidiate in user mode which is
> > > delivering interrupt" patch
> > > (https://lore.kernel.org/kvm/CANRm+Cy-78UnrkX8nh5WdHut2WW5NU=UL84FRJnUNjsAPK+Uww@mail.gmail.com/T/)
> > > seems to be effective for (2) while it only uses the IPI receiver
> > > information.
> >
> > On the IPI side of thing, I like the idea of explicitly tracking the IPIs,
> > especially if we can simplify the implementation, e.g. by losing the receiver
> > info and making ipi_received a bool.  Maybe temporarily table Wanpeng's patch
> > while this approach is analyzed?
> 
> Hi all,
> 
> I evaluate my patch

Thanks for doing the testing, much appreciated!

> (https://lore.kernel.org/kvm/1618542490-14756-1-git-send-email-wanpengli@tencent.com),
> Kenta's patch 2 and Sean's suggestion. The testing environment is
> pbzip2 in 96 vCPUs VM in over-subscribe scenario (The host machine is
> 2 socket, 48 cores, 96 HTs Intel CLX box).

Are the vCPUs affined in any way?  How many VMs are running?  Are there other
workloads in the host?  Not criticising, just asking so that others can reproduce
your setup.

> Note: the Kenta's scheduler hacking is not applied. The score of my patch is
> the most stable and the best performance.

On the other hand, Kenta's approach has the advantage of working for both Intel
and AMD.  But I'm also not very familiar with AMD's AVIC, so I don't know if it's
feasible to implement a performant equivalent in svm_dy_apicv_has_pending_interrupt().

Kenda's patch is also flawed as it doesn't scale to 96 vCPUs; vCPUs 64-95 will
never get boosted.

> Wanpeng's patch
> 
> The average: vanilla -> boost: 69.124 -> 61.975, 10.3%
> 
> * Wall Clock: 61.695359 seconds
> * Wall Clock: 63.343579 seconds
> * Wall Clock: 61.567513 seconds
> * Wall Clock: 62.144722 seconds
> * Wall Clock: 61.091442 seconds
> * Wall Clock: 62.085912 seconds
> * Wall Clock: 61.311954 seconds
> 
> Kenta' patch
> 
> The average: vanilla -> boost: 69.148 -> 64.567, 6.6%
> 
> * Wall Clock:  66.288113 seconds
> * Wall Clock:  61.228642 seconds
> * Wall Clock:  62.100524 seconds
> * Wall Clock:  68.355473 seconds
> * Wall Clock:  64.864608 seconds
> 
> Sean's suggestion:
> 
> The average: vanilla -> boost: 69.148 -> 66.505, 3.8%
> 
> * Wall Clock: 60.583562 seconds
> * Wall Clock: 58.533960 seconds
> * Wall Clock: 70.103489 seconds
> * Wall Clock: 74.279028 seconds
> * Wall Clock: 69.024194 seconds

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM
       [not found]     ` <CAOOWTGJvFB_hgSUzaqNaNigdkRXFcaK37F9V7kmL3nCG+bFz5Q@mail.gmail.com>
@ 2021-04-26  3:02       ` Wanpeng Li
  2021-04-26  3:18         ` Kenta Ishiguro
  0 siblings, 1 reply; 13+ messages in thread
From: Wanpeng Li @ 2021-04-26  3:02 UTC (permalink / raw)
  To: Kenta Ishiguro
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Hildenbrand, kvm, LKML,
	Pierre-Louis Aublin, 河野健二

On Mon, 26 Apr 2021 at 10:56, Kenta Ishiguro
<kentaishiguro@sslab.ics.keio.ac.jp> wrote:
>
> Dear all,
>
> Thank you for the insightful feedback.
>
> Does Sean's suggested version of Wanpeng's patch mark a running vCPU as an IPI
> receiver? If it's right, I think the candidate set of vCPUs for boost is
> slightly different between using kvm_arch_interrupt_delivery and using boolean
> ipi_received. In the version of using boolean ipi_received, vCPUs which
> receive IPI while running are also candidates for a boost.
> However, they likely have already responded to their IPI before they exit.

             if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+                !READ_ONCE(vcpu->ipi_received) &&

There is a vcpu->preempted checking here.

    Wanpeng

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM
  2021-04-26  3:02       ` Wanpeng Li
@ 2021-04-26  3:18         ` Kenta Ishiguro
  2021-04-26  3:58           ` Wanpeng Li
  0 siblings, 1 reply; 13+ messages in thread
From: Kenta Ishiguro @ 2021-04-26  3:18 UTC (permalink / raw)
  To: kernellwp
  Cc: david, jmattson, joro, kentaishiguro, kono, kvm, linux-kernel,
	pbonzini, pl, seanjc, vkuznets, wanpengli

Thank you for the reply.

My question is about following scenario:
1. running vCPU receives IPI and the vCPU's ipi_received gets true
2. the vCPU responds to the IPI
3. the vCPU exits
4. the vCPU is preempted by KVM
5. the vCPU is boosted, but it has already responded to the IPI
6. the vCPU enters and the vCPU's ipi_received is cleaned

In this case, I think the check of vcpu->preempted does not limit the candidate vCPUs.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM
  2021-04-26  3:18         ` Kenta Ishiguro
@ 2021-04-26  3:58           ` Wanpeng Li
  2021-04-26  4:10             ` Kenta Ishiguro
  0 siblings, 1 reply; 13+ messages in thread
From: Wanpeng Li @ 2021-04-26  3:58 UTC (permalink / raw)
  To: Kenta Ishiguro
  Cc: David Hildenbrand, Jim Mattson, Joerg Roedel,
	河野健二,
	kvm, LKML, Paolo Bonzini, Pierre-Louis Aublin,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li

On Mon, 26 Apr 2021 at 11:19, Kenta Ishiguro
<kentaishiguro@sslab.ics.keio.ac.jp> wrote:
>
> Thank you for the reply.
>
> My question is about following scenario:
> 1. running vCPU receives IPI and the vCPU's ipi_received gets true
> 2. the vCPU responds to the IPI
> 3. the vCPU exits
> 4. the vCPU is preempted by KVM
> 5. the vCPU is boosted, but it has already responded to the IPI
> 6. the vCPU enters and the vCPU's ipi_received is cleaned
>
> In this case, I think the check of vcpu->preempted does not limit the candidate vCPUs.

Good point, you are right. However, actually I played with that code a
bit before, I have another version adding the vcpu->preempted checking
when marking IPI receiver, the score is not as good as expected.

    Wanpeng

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM
  2021-04-26  3:58           ` Wanpeng Li
@ 2021-04-26  4:10             ` Kenta Ishiguro
  0 siblings, 0 replies; 13+ messages in thread
From: Kenta Ishiguro @ 2021-04-26  4:10 UTC (permalink / raw)
  To: kernellwp
  Cc: david, jmattson, joro, kentaishiguro, kono, kvm, linux-kernel,
	pbonzini, pl, seanjc, vkuznets, wanpengli

I see. Thank you!


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-04-26  4:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-21 15:08 [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM Kenta Ishiguro
2021-04-21 15:08 ` [RFC PATCH 1/2] Prevent CFS from ignoring boost requests from KVM Kenta Ishiguro
2021-04-21 15:08 ` [RFC PATCH 2/2] Boost vCPUs based on IPI-sender and receiver information Kenta Ishiguro
2021-04-21 15:43   ` Sean Christopherson
2021-04-21 16:16   ` Sean Christopherson
2021-04-21 16:19 ` [RFC PATCH 0/2] Mitigating Excessive Pause-Loop Exiting in VM-Agnostic KVM Sean Christopherson
2021-04-22 12:21   ` Wanpeng Li
2021-04-22 15:58     ` Sean Christopherson
     [not found]     ` <CAOOWTGJvFB_hgSUzaqNaNigdkRXFcaK37F9V7kmL3nCG+bFz5Q@mail.gmail.com>
2021-04-26  3:02       ` Wanpeng Li
2021-04-26  3:18         ` Kenta Ishiguro
2021-04-26  3:58           ` Wanpeng Li
2021-04-26  4:10             ` Kenta Ishiguro
2021-04-22  0:55 ` Wanpeng Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).