* [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption
@ 2019-07-30 9:33 Wanpeng Li
2019-07-30 11:46 ` Paolo Bonzini
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Wanpeng Li @ 2019-07-30 9:33 UTC (permalink / raw)
To: linux-kernel, kvm
Cc: Paolo Bonzini, Radim Krčmář,
Peter Zijlstra, Thomas Gleixner
From: Wanpeng Li <wanpengli@tencent.com>
Wake-affine is a feature inside scheduler which we attempt to make processes
running closely, it gains benefit mostly from cache-hit. When waker tries
to wakup wakee, it needs to select cpu to run wakee, wake affine heuristic
mays select the cpu which waker is running on currently instead of the prev
cpu which wakee was last time running.
However, in multiple VMs over-subscribe virtualization scenario, it increases
the probability to incur vCPU stacking which means that the sibling vCPUs from
the same VM will be stacked on one pCPU. I test three 80 vCPUs VMs running on
one 80 pCPUs Skylake server(PLE is supported), the ebizzy score can increase 17%
after disabling wake-affine for vCPU process.
When qemu/other vCPU inject virtual interrupt to guest through waking up one
sleeping vCPU, it increases the probability to stack vCPUs/qemu by scheduler
wake-affine. vCPU stacking issue can greately inceases the lock synchronization
latency in a virtualized environment. This patch disables wake-affine vCPU
process to mitigtate lock holder preemption.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 3 +++
virt/kvm/kvm_main.c | 1 +
3 files changed, 5 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8dc1811..3dd33d8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1468,6 +1468,7 @@ extern struct pid *cad_pid;
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
+#define PF_NO_WAKE_AFFINE 0x20000000 /* This thread should not be wake affine */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95..18eb1fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5428,6 +5428,9 @@ static int wake_wide(struct task_struct *p)
unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);
+ if (unlikely(p->flags & PF_NO_WAKE_AFFINE))
+ return 1;
+
if (master < slave)
swap(master, slave);
if (slave < factor || master < slave * factor)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 887f3b0..b9f75c3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2680,6 +2680,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
mutex_unlock(&kvm->lock);
kvm_arch_vcpu_postcreate(vcpu);
+ current->flags |= PF_NO_WAKE_AFFINE;
return r;
unlock_vcpu_destroy:
--
2.7.4
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption
2019-07-30 9:33 [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption Wanpeng Li
@ 2019-07-30 11:46 ` Paolo Bonzini
2019-08-01 12:39 ` Dario Faggioli
2019-07-30 12:09 ` Peter Zijlstra
2019-08-01 12:57 ` Dario Faggioli
2 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2019-07-30 11:46 UTC (permalink / raw)
To: Wanpeng Li, linux-kernel, kvm
Cc: Radim Krčmář, Peter Zijlstra, Thomas Gleixner
On 30/07/19 11:33, Wanpeng Li wrote:
> When qemu/other vCPU inject virtual interrupt to guest through waking up one
> sleeping vCPU, it increases the probability to stack vCPUs/qemu by scheduler
> wake-affine. vCPU stacking issue can greately inceases the lock synchronization
> latency in a virtualized environment. This patch disables wake-affine vCPU
> process to mitigtate lock holder preemption.
There is no guarantee that the vCPU remains on the thread where it's
created, so the patch is not enough.
If many vCPUs are stacked on the same pCPU, why doesn't the wake_cap
kick in sooner or later?
Paolo
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/fair.c | 3 +++
> virt/kvm/kvm_main.c | 1 +
> 3 files changed, 5 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8dc1811..3dd33d8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1468,6 +1468,7 @@ extern struct pid *cad_pid;
> #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
> #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> #define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
> +#define PF_NO_WAKE_AFFINE 0x20000000 /* This thread should not be wake affine */
> #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
> #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 036be95..18eb1fa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5428,6 +5428,9 @@ static int wake_wide(struct task_struct *p)
> unsigned int slave = p->wakee_flips;
> int factor = this_cpu_read(sd_llc_size);
>
> + if (unlikely(p->flags & PF_NO_WAKE_AFFINE))
> + return 1;
> +
> if (master < slave)
> swap(master, slave);
> if (slave < factor || master < slave * factor)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 887f3b0..b9f75c3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2680,6 +2680,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>
> mutex_unlock(&kvm->lock);
> kvm_arch_vcpu_postcreate(vcpu);
> + current->flags |= PF_NO_WAKE_AFFINE;
> return r;
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption
2019-07-30 9:33 [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption Wanpeng Li
2019-07-30 11:46 ` Paolo Bonzini
@ 2019-07-30 12:09 ` Peter Zijlstra
2019-08-01 12:57 ` Dario Faggioli
2 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2019-07-30 12:09 UTC (permalink / raw)
To: Wanpeng Li
Cc: linux-kernel, kvm, Paolo Bonzini, Radim Krčmář,
Thomas Gleixner
On Tue, Jul 30, 2019 at 05:33:55PM +0800, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
>
> Wake-affine is a feature inside scheduler which we attempt to make processes
> running closely, it gains benefit mostly from cache-hit. When waker tries
> to wakup wakee, it needs to select cpu to run wakee, wake affine heuristic
> mays select the cpu which waker is running on currently instead of the prev
> cpu which wakee was last time running.
>
> However, in multiple VMs over-subscribe virtualization scenario, it increases
> the probability to incur vCPU stacking which means that the sibling vCPUs from
> the same VM will be stacked on one pCPU. I test three 80 vCPUs VMs running on
> one 80 pCPUs Skylake server(PLE is supported), the ebizzy score can increase 17%
> after disabling wake-affine for vCPU process.
>
> When qemu/other vCPU inject virtual interrupt to guest through waking up one
> sleeping vCPU, it increases the probability to stack vCPUs/qemu by scheduler
> wake-affine. vCPU stacking issue can greately inceases the lock synchronization
> latency in a virtualized environment. This patch disables wake-affine vCPU
> process to mitigtate lock holder preemption.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/fair.c | 3 +++
> virt/kvm/kvm_main.c | 1 +
> 3 files changed, 5 insertions(+)
> index 036be95..18eb1fa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5428,6 +5428,9 @@ static int wake_wide(struct task_struct *p)
> unsigned int slave = p->wakee_flips;
> int factor = this_cpu_read(sd_llc_size);
>
> + if (unlikely(p->flags & PF_NO_WAKE_AFFINE))
> + return 1;
> +
> if (master < slave)
> swap(master, slave);
> if (slave < factor || master < slave * factor)
I intensely dislike how you misrepresent this patch as a KVM patch.
Also the above is very much not the right place, even if this PF_flag
were to live.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption
2019-07-30 11:46 ` Paolo Bonzini
@ 2019-08-01 12:39 ` Dario Faggioli
0 siblings, 0 replies; 8+ messages in thread
From: Dario Faggioli @ 2019-08-01 12:39 UTC (permalink / raw)
To: Paolo Bonzini, Wanpeng Li, linux-kernel, kvm
Cc: Radim Krčmář, Peter Zijlstra, Thomas Gleixner
[-- Attachment #1: Type: text/plain, Size: 1306 bytes --]
On Tue, 2019-07-30 at 13:46 +0200, Paolo Bonzini wrote:
> On 30/07/19 11:33, Wanpeng Li wrote:
> > When qemu/other vCPU inject virtual interrupt to guest through
> > waking up one
> > sleeping vCPU, it increases the probability to stack vCPUs/qemu by
> > scheduler
> > wake-affine. vCPU stacking issue can greately inceases the lock
> > synchronization
> > latency in a virtualized environment. This patch disables wake-
> > affine vCPU
> > process to mitigtate lock holder preemption.
>
> There is no guarantee that the vCPU remains on the thread where it's
> created, so the patch is not enough.
>
> If many vCPUs are stacked on the same pCPU, why doesn't the wake_cap
> kick in sooner or later?
>
Assuming it actually is the case that vcpus *do* get stacked *and* that
wake_cap() *doesn't* kick in, maybe it could be because of this check?
/* Minimum capacity is close to max, no need to abort wake_affine */
if (max_cap - min_cap < max_cap >> 3)
return 0;
Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption
2019-07-30 9:33 [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption Wanpeng Li
2019-07-30 11:46 ` Paolo Bonzini
2019-07-30 12:09 ` Peter Zijlstra
@ 2019-08-01 12:57 ` Dario Faggioli
2019-08-02 0:51 ` Wanpeng Li
` (2 more replies)
2 siblings, 3 replies; 8+ messages in thread
From: Dario Faggioli @ 2019-08-01 12:57 UTC (permalink / raw)
To: Wanpeng Li, linux-kernel, kvm
Cc: Paolo Bonzini, Radim Krčmář,
Peter Zijlstra, Thomas Gleixner
[-- Attachment #1: Type: text/plain, Size: 998 bytes --]
On Tue, 2019-07-30 at 17:33 +0800, Wanpeng Li wrote:
> However, in multiple VMs over-subscribe virtualization scenario, it
> increases
> the probability to incur vCPU stacking which means that the sibling
> vCPUs from
> the same VM will be stacked on one pCPU. I test three 80 vCPUs VMs
> running on
> one 80 pCPUs Skylake server(PLE is supported), the ebizzy score can
> increase 17%
> after disabling wake-affine for vCPU process.
>
Can't we achieve this by removing SD_WAKE_AFFINE from the relevant
scheduling domains? By acting on
/proc/sys/kernel/sched_domain/cpuX/domainY/flags, I mean?
Of course this will impact all tasks, not only KVM vcpus. But if the
host does KVM only anyway...
Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption
2019-08-01 12:57 ` Dario Faggioli
@ 2019-08-02 0:51 ` Wanpeng Li
2019-08-02 8:30 ` Christophe de Dinechin
2019-08-02 8:38 ` Paolo Bonzini
2 siblings, 0 replies; 8+ messages in thread
From: Wanpeng Li @ 2019-08-02 0:51 UTC (permalink / raw)
To: Dario Faggioli
Cc: LKML, kvm, Paolo Bonzini, Radim Krčmář,
Peter Zijlstra, Thomas Gleixner
On Thu, 1 Aug 2019 at 20:57, Dario Faggioli <dfaggioli@suse.com> wrote:
>
> On Tue, 2019-07-30 at 17:33 +0800, Wanpeng Li wrote:
> > However, in multiple VMs over-subscribe virtualization scenario, it
> > increases
> > the probability to incur vCPU stacking which means that the sibling
> > vCPUs from
> > the same VM will be stacked on one pCPU. I test three 80 vCPUs VMs
> > running on
> > one 80 pCPUs Skylake server(PLE is supported), the ebizzy score can
> > increase 17%
> > after disabling wake-affine for vCPU process.
> >
> Can't we achieve this by removing SD_WAKE_AFFINE from the relevant
> scheduling domains? By acting on
> /proc/sys/kernel/sched_domain/cpuX/domainY/flags, I mean?
>
> Of course this will impact all tasks, not only KVM vcpus. But if the
> host does KVM only anyway...
Yes, not only kvm host and dedicated kvm host, unless introduce
per-process flags, otherwise can't appeal to both.
Regards,
Wanpeng Li
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption
2019-08-01 12:57 ` Dario Faggioli
2019-08-02 0:51 ` Wanpeng Li
@ 2019-08-02 8:30 ` Christophe de Dinechin
2019-08-02 8:38 ` Paolo Bonzini
2 siblings, 0 replies; 8+ messages in thread
From: Christophe de Dinechin @ 2019-08-02 8:30 UTC (permalink / raw)
To: Dario Faggioli
Cc: Wanpeng Li, linux-kernel, kvm, Paolo Bonzini,
Radim Krčmář,
Peter Zijlstra, Thomas Gleixner
Dario Faggioli writes:
> On Tue, 2019-07-30 at 17:33 +0800, Wanpeng Li wrote:
>> However, in multiple VMs over-subscribe virtualization scenario, it
>> increases
>> the probability to incur vCPU stacking which means that the sibling
>> vCPUs from
>> the same VM will be stacked on one pCPU. I test three 80 vCPUs VMs
>> running on
>> one 80 pCPUs Skylake server(PLE is supported), the ebizzy score can
>> increase 17%
>> after disabling wake-affine for vCPU process.
>>
> Can't we achieve this by removing SD_WAKE_AFFINE from the relevant
> scheduling domains? By acting on
> /proc/sys/kernel/sched_domain/cpuX/domainY/flags, I mean?
>
> Of course this will impact all tasks, not only KVM vcpus. But if the
> host does KVM only anyway...
Even a host dedicated to KVM has many non-KVM processes. I suspect an
increasing number of hosts will be split between VMs and containers.
>
> Regards
--
Cheers,
Christophe de Dinechin
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption
2019-08-01 12:57 ` Dario Faggioli
2019-08-02 0:51 ` Wanpeng Li
2019-08-02 8:30 ` Christophe de Dinechin
@ 2019-08-02 8:38 ` Paolo Bonzini
2 siblings, 0 replies; 8+ messages in thread
From: Paolo Bonzini @ 2019-08-02 8:38 UTC (permalink / raw)
To: Dario Faggioli, Wanpeng Li, linux-kernel, kvm
Cc: Radim Krčmář, Peter Zijlstra, Thomas Gleixner
On 01/08/19 14:57, Dario Faggioli wrote:
> Can't we achieve this by removing SD_WAKE_AFFINE from the relevant
> scheduling domains? By acting on
> /proc/sys/kernel/sched_domain/cpuX/domainY/flags, I mean?
>
> Of course this will impact all tasks, not only KVM vcpus. But if the
> host does KVM only anyway...
Perhaps add flags to the unified cgroups hierarchy instead. But if the
"min_cap close to max_cap" heuristics are wrong they should indeed be fixed.
Paolo
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-08-02 8:38 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-30 9:33 [PATCH] KVM: Disable wake-affine vCPU process to mitigate lock holder preemption Wanpeng Li
2019-07-30 11:46 ` Paolo Bonzini
2019-08-01 12:39 ` Dario Faggioli
2019-07-30 12:09 ` Peter Zijlstra
2019-08-01 12:57 ` Dario Faggioli
2019-08-02 0:51 ` Wanpeng Li
2019-08-02 8:30 ` Christophe de Dinechin
2019-08-02 8:38 ` Paolo Bonzini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).