* [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire
@ 2019-07-31 11:27 Wanpeng Li
2019-07-31 11:27 ` [PATCH 2/3] KVM: Check preempted_in_kernel for involuntary preemption Wanpeng Li
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Wanpeng Li @ 2019-07-31 11:27 UTC (permalink / raw)
To: linux-kernel, kvm; +Cc: Paolo Bonzini, Radim Krčmář
From: Wanpeng Li <wanpengli@tencent.com>
kvm_set_pending_timer() will take care to wake up the sleeping vCPU which
has pending timer, don't need to check this in apic_timer_expired() again.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
arch/x86/kvm/lapic.c | 8 --------
1 file changed, 8 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0aa1586..685d17c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1548,7 +1548,6 @@ static void kvm_apic_inject_pending_timer_irqs(struct kvm_lapic *apic)
static void apic_timer_expired(struct kvm_lapic *apic)
{
struct kvm_vcpu *vcpu = apic->vcpu;
- struct swait_queue_head *q = &vcpu->wq;
struct kvm_timer *ktimer = &apic->lapic_timer;
if (atomic_read(&apic->lapic_timer.pending))
@@ -1566,13 +1565,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
atomic_inc(&apic->lapic_timer.pending);
kvm_set_pending_timer(vcpu);
-
- /*
- * For x86, the atomic_inc() is serialized, thus
- * using swait_active() is safe.
- */
- if (swait_active(q))
- swake_up_one(q);
}
static void start_sw_tscdeadline(struct kvm_lapic *apic)
--
2.7.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/3] KVM: Check preempted_in_kernel for involuntary preemption
2019-07-31 11:27 [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire Wanpeng Li
@ 2019-07-31 11:27 ` Wanpeng Li
2019-07-31 11:27 ` [PATCH 3/3] KVM: Fix leak vCPU's VMCS value into other pCPU Wanpeng Li
2019-07-31 12:56 ` [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire Paolo Bonzini
2 siblings, 0 replies; 9+ messages in thread
From: Wanpeng Li @ 2019-07-31 11:27 UTC (permalink / raw)
To: linux-kernel, kvm; +Cc: Paolo Bonzini, Radim Krčmář
From: Wanpeng Li <wanpengli@tencent.com>
preempted_in_kernel is updated in preempt_notifier when involuntary preemption
ocurrs, it can be stale when the voluntarily preempted vCPUs are taken into
account by kvm_vcpu_on_spin() loop. This patch lets it just check preempted_in_kernel
for involuntary preemption.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
virt/kvm/kvm_main.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 887f3b0..ed061d8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2508,7 +2508,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
continue;
if (swait_active(&vcpu->wq) && !kvm_arch_vcpu_runnable(vcpu))
continue;
- if (yield_to_kernel_mode && !kvm_arch_vcpu_in_kernel(vcpu))
+ if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+ !kvm_arch_vcpu_in_kernel(vcpu))
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;
@@ -4205,7 +4206,7 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
{
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
- vcpu->preempted = false;
+ WRITE_ONCE(vcpu->preempted, false);
WRITE_ONCE(vcpu->ready, false);
kvm_arch_sched_in(vcpu, cpu);
@@ -4219,7 +4220,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
if (current->state == TASK_RUNNING) {
- vcpu->preempted = true;
+ WRITE_ONCE(vcpu->preempted, true);
WRITE_ONCE(vcpu->ready, true);
}
kvm_arch_vcpu_put(vcpu);
--
2.7.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/3] KVM: Fix leak vCPU's VMCS value into other pCPU
2019-07-31 11:27 [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire Wanpeng Li
2019-07-31 11:27 ` [PATCH 2/3] KVM: Check preempted_in_kernel for involuntary preemption Wanpeng Li
@ 2019-07-31 11:27 ` Wanpeng Li
2019-07-31 11:39 ` [PATCH v2 " Wanpeng Li
2019-07-31 12:56 ` [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire Paolo Bonzini
2 siblings, 1 reply; 9+ messages in thread
From: Wanpeng Li @ 2019-07-31 11:27 UTC (permalink / raw)
To: linux-kernel, kvm; +Cc: Paolo Bonzini, Radim Krčmář, stable
From: Wanpeng Li <wanpengli@tencent.com>
After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
five years ago bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:
INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
Call Trace:
flush_tlb_mm_range+0x68/0x140
tlb_flush_mmu.part.75+0x37/0xe0
tlb_finish_mmu+0x55/0x60
zap_page_range+0x142/0x190
SyS_madvise+0x3cd/0x9c0
system_call_fastpath+0x1c/0x21
swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.
This patch fixes it by reverting the kvm_arch_vcpu_runnable() condition
in kvm_vcpu_on_spin() loop.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
virt/kvm/kvm_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ed061d8..12f2c91 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2506,7 +2506,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
continue;
if (vcpu == me)
continue;
- if (swait_active(&vcpu->wq) && !kvm_arch_vcpu_runnable(vcpu))
+ if (swait_active(&vcpu->wq))
continue;
if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
!kvm_arch_vcpu_in_kernel(vcpu))
--
2.7.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v2 3/3] KVM: Fix leak vCPU's VMCS value into other pCPU
2019-07-31 11:27 ` [PATCH 3/3] KVM: Fix leak vCPU's VMCS value into other pCPU Wanpeng Li
@ 2019-07-31 11:39 ` Wanpeng Li
2019-07-31 12:55 ` Paolo Bonzini
0 siblings, 1 reply; 9+ messages in thread
From: Wanpeng Li @ 2019-07-31 11:39 UTC (permalink / raw)
To: linux-kernel, kvm; +Cc: Paolo Bonzini, Radim Krčmář, stable
From: Wanpeng Li <wanpengli@tencent.com>
After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
five years ago bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:
INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
Call Trace:
flush_tlb_mm_range+0x68/0x140
tlb_flush_mmu.part.75+0x37/0xe0
tlb_finish_mmu+0x55/0x60
zap_page_range+0x142/0x190
SyS_madvise+0x3cd/0x9c0
system_call_fastpath+0x1c/0x21
swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.
This patch fixes it by reverting the kvm_arch_vcpu_runnable() condition
in kvm_vcpu_on_spin() loop and checking swait_active(&vcpu->wq) for
involuntary preemption.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
v1 -> v2:
* checking swait_active(&vcpu->wq) for involuntary preemption
virt/kvm/kvm_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ed061d8..12f2c91 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2506,7 +2506,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
continue;
if (vcpu == me)
continue;
- if (swait_active(&vcpu->wq) && !kvm_arch_vcpu_runnable(vcpu))
+ if (READ_ONCE(vcpu->preempted) && swait_active(&vcpu->wq))
continue;
if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
!kvm_arch_vcpu_in_kernel(vcpu))
--
2.7.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 3/3] KVM: Fix leak vCPU's VMCS value into other pCPU
2019-07-31 11:39 ` [PATCH v2 " Wanpeng Li
@ 2019-07-31 12:55 ` Paolo Bonzini
2019-08-01 3:35 ` Wanpeng Li
0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2019-07-31 12:55 UTC (permalink / raw)
To: Wanpeng Li, linux-kernel, kvm
Cc: Radim Krčmář, stable, Marc Zyngier, Christian Borntraeger
On 31/07/19 13:39, Wanpeng Li wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ed061d8..12f2c91 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2506,7 +2506,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> continue;
> if (vcpu == me)
> continue;
> - if (swait_active(&vcpu->wq) && !kvm_arch_vcpu_runnable(vcpu))
> + if (READ_ONCE(vcpu->preempted) && swait_active(&vcpu->wq))
> continue;
> if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
> !kvm_arch_vcpu_in_kernel(vcpu))
>
This cannot work. swait_active means you are waiting, so you cannot be
involuntarily preempted.
The problem here is simply that kvm_vcpu_has_events is being called
without holding the lock. So kvm_arch_vcpu_runnable is okay, it's the
implementation that's wrong.
Just rename the existing function to just vcpu_runnable and make a new
arch callback kvm_arch_dy_runnable. kvm_arch_dy_runnable can be
conservative and only returns true for a subset of events, in particular
for x86 it can check:
- vcpu->arch.pv.pv_unhalted
- KVM_REQ_NMI or KVM_REQ_SMI or KVM_REQ_EVENT
- PIR.ON if APICv is set
Ultimately, all variables accessed in kvm_arch_dy_runnable should be
accessed with READ_ONCE or atomic_read.
And for all architectures, kvm_vcpu_on_spin should check
list_empty_careful(&vcpu->async_pf.done)
It's okay if your patch renames the function in non-x86 architectures,
leaving the fix to maintainers. So, let's CC Marc and Christian since
ARM and s390 have pretty complex kvm_arch_vcpu_runnable as well.
Paolo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire
2019-07-31 11:27 [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire Wanpeng Li
2019-07-31 11:27 ` [PATCH 2/3] KVM: Check preempted_in_kernel for involuntary preemption Wanpeng Li
2019-07-31 11:27 ` [PATCH 3/3] KVM: Fix leak vCPU's VMCS value into other pCPU Wanpeng Li
@ 2019-07-31 12:56 ` Paolo Bonzini
2019-07-31 13:14 ` Vitaly Kuznetsov
2 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2019-07-31 12:56 UTC (permalink / raw)
To: Wanpeng Li, linux-kernel, kvm; +Cc: Radim Krčmář
On 31/07/19 13:27, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
>
> kvm_set_pending_timer() will take care to wake up the sleeping vCPU which
> has pending timer, don't need to check this in apic_timer_expired() again.
No, it doesn't. kvm_make_request never kicks the vCPU.
Paolo
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---
> arch/x86/kvm/lapic.c | 8 --------
> 1 file changed, 8 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 0aa1586..685d17c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1548,7 +1548,6 @@ static void kvm_apic_inject_pending_timer_irqs(struct kvm_lapic *apic)
> static void apic_timer_expired(struct kvm_lapic *apic)
> {
> struct kvm_vcpu *vcpu = apic->vcpu;
> - struct swait_queue_head *q = &vcpu->wq;
> struct kvm_timer *ktimer = &apic->lapic_timer;
>
> if (atomic_read(&apic->lapic_timer.pending))
> @@ -1566,13 +1565,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
>
> atomic_inc(&apic->lapic_timer.pending);
> kvm_set_pending_timer(vcpu);
> -
> - /*
> - * For x86, the atomic_inc() is serialized, thus
> - * using swait_active() is safe.
> - */
> - if (swait_active(q))
> - swake_up_one(q);
> }
>
> static void start_sw_tscdeadline(struct kvm_lapic *apic)
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire
2019-07-31 12:56 ` [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire Paolo Bonzini
@ 2019-07-31 13:14 ` Vitaly Kuznetsov
2019-07-31 16:39 ` Paolo Bonzini
0 siblings, 1 reply; 9+ messages in thread
From: Vitaly Kuznetsov @ 2019-07-31 13:14 UTC (permalink / raw)
To: Paolo Bonzini, Wanpeng Li; +Cc: Radim Krčmář, linux-kernel, kvm
Paolo Bonzini <pbonzini@redhat.com> writes:
> On 31/07/19 13:27, Wanpeng Li wrote:
>> From: Wanpeng Li <wanpengli@tencent.com>
>>
>> kvm_set_pending_timer() will take care to wake up the sleeping vCPU which
>> has pending timer, don't need to check this in apic_timer_expired() again.
>
> No, it doesn't. kvm_make_request never kicks the vCPU.
>
Hm, but kvm_set_pending_timer() currently looks like:
void kvm_set_pending_timer(struct kvm_vcpu *vcpu)
{
kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
kvm_vcpu_kick(vcpu);
}
--
Vitaly
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire
2019-07-31 13:14 ` Vitaly Kuznetsov
@ 2019-07-31 16:39 ` Paolo Bonzini
0 siblings, 0 replies; 9+ messages in thread
From: Paolo Bonzini @ 2019-07-31 16:39 UTC (permalink / raw)
To: Vitaly Kuznetsov, Wanpeng Li
Cc: Radim Krčmář, linux-kernel, kvm
On 31/07/19 15:14, Vitaly Kuznetsov wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
>
>> On 31/07/19 13:27, Wanpeng Li wrote:
>>> From: Wanpeng Li <wanpengli@tencent.com>
>>>
>>> kvm_set_pending_timer() will take care to wake up the sleeping vCPU which
>>> has pending timer, don't need to check this in apic_timer_expired() again.
>>
>> No, it doesn't. kvm_make_request never kicks the vCPU.
>>
>
> Hm, but kvm_set_pending_timer() currently looks like:
>
> void kvm_set_pending_timer(struct kvm_vcpu *vcpu)
> {
> kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
> kvm_vcpu_kick(vcpu);
> }
Doing "git fetch" could have helped indeed.
Paolo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 3/3] KVM: Fix leak vCPU's VMCS value into other pCPU
2019-07-31 12:55 ` Paolo Bonzini
@ 2019-08-01 3:35 ` Wanpeng Li
0 siblings, 0 replies; 9+ messages in thread
From: Wanpeng Li @ 2019-08-01 3:35 UTC (permalink / raw)
To: Paolo Bonzini
Cc: LKML, kvm, Radim Krčmář, # v3 . 10+,
Marc Zyngier, Christian Borntraeger
On Wed, 31 Jul 2019 at 20:55, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 31/07/19 13:39, Wanpeng Li wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ed061d8..12f2c91 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2506,7 +2506,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> > continue;
> > if (vcpu == me)
> > continue;
> > - if (swait_active(&vcpu->wq) && !kvm_arch_vcpu_runnable(vcpu))
> > + if (READ_ONCE(vcpu->preempted) && swait_active(&vcpu->wq))
> > continue;
> > if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
> > !kvm_arch_vcpu_in_kernel(vcpu))
> >
>
> This cannot work. swait_active means you are waiting, so you cannot be
> involuntarily preempted.
>
> The problem here is simply that kvm_vcpu_has_events is being called
> without holding the lock. So kvm_arch_vcpu_runnable is okay, it's the
> implementation that's wrong.
>
> Just rename the existing function to just vcpu_runnable and make a new
> arch callback kvm_arch_dy_runnable. kvm_arch_dy_runnable can be
> conservative and only returns true for a subset of events, in particular
> for x86 it can check:
>
> - vcpu->arch.pv.pv_unhalted
>
> - KVM_REQ_NMI or KVM_REQ_SMI or KVM_REQ_EVENT
>
> - PIR.ON if APICv is set
>
> Ultimately, all variables accessed in kvm_arch_dy_runnable should be
> accessed with READ_ONCE or atomic_read.
>
> And for all architectures, kvm_vcpu_on_spin should check
> list_empty_careful(&vcpu->async_pf.done)
>
> It's okay if your patch renames the function in non-x86 architectures,
> leaving the fix to maintainers. So, let's CC Marc and Christian since
> ARM and s390 have pretty complex kvm_arch_vcpu_runnable as well.
Ok, just sent patch to do this.
Regards,
Wanpeng Li
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-08-01 3:35 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-31 11:27 [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire Wanpeng Li
2019-07-31 11:27 ` [PATCH 2/3] KVM: Check preempted_in_kernel for involuntary preemption Wanpeng Li
2019-07-31 11:27 ` [PATCH 3/3] KVM: Fix leak vCPU's VMCS value into other pCPU Wanpeng Li
2019-07-31 11:39 ` [PATCH v2 " Wanpeng Li
2019-07-31 12:55 ` Paolo Bonzini
2019-08-01 3:35 ` Wanpeng Li
2019-07-31 12:56 ` [PATCH 1/3] KVM: Don't need to wakeup vCPU twice afer timer fire Paolo Bonzini
2019-07-31 13:14 ` Vitaly Kuznetsov
2019-07-31 16:39 ` Paolo Bonzini
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.