* [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support @ 2017-09-21 11:38 Marcelo Tosatti 2017-09-21 11:38 ` [patch 1/3] KVM: x86: add per-vcpu option to set guest vcpu -RT priority Marcelo Tosatti ` (3 more replies) 0 siblings, 4 replies; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-21 11:38 UTC (permalink / raw) To: kvm, linux-kernel When executing guest vcpu-0 with FIFO:1 priority, which is necessary to deal with the following situation: VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) raw_spin_lock(A) interrupted, schedule task T-1 raw_spin_lock(A) (spin) raw_spin_unlock(A) Certain operations must interrupt guest vcpu-0 (see trace below). To fix this issue, only change guest vcpu-0 to FIFO priority on spinlock critical sections (see patch). Hang trace ========== Without FIFO priority: qemu-kvm-6705 [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0 qemu-kvm-6705 [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0 qemu-kvm-6705 [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-6705 [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-6705 [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0 qemu-kvm-6705 [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0 qemu-kvm-6705 [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0 qemu-kvm-6705 [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0 With FIFO priority: qemu-kvm-7636 [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-7636 [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-7636 [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-7636 [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-7636 [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 .. Performance numbers (kernel compilation with make -j2) ====================================================== With hypercall: 4:40. (make -j2) Without hypercall: 3:38. (make -j2) Note for NFV workloads spinlock performance is not relevant since DPDK should not enter the kernel (and housekeeping vcpu performance is far from a key factor). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> ^ permalink raw reply [flat|nested] 52+ messages in thread
* [patch 1/3] KVM: x86: add per-vcpu option to set guest vcpu -RT priority 2017-09-21 11:38 [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Marcelo Tosatti @ 2017-09-21 11:38 ` Marcelo Tosatti 2017-09-21 11:38 ` [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) Marcelo Tosatti ` (2 subsequent siblings) 3 siblings, 0 replies; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-21 11:38 UTC (permalink / raw) To: kvm, linux-kernel; +Cc: Marcelo Tosatti [-- Attachment #1: 01-kvm-vcpu-allow-setfifo --] [-- Type: text/plain, Size: 3906 bytes --] In preparation to the patch which adds a hypercall for the guest vcpu to change states between: -> SCHED_FIFO. -> SCHED_NORMAL. Add controls to: 1) Allow control to which priority SCHED_FIFO state will be in place (in the host). 2) Enable/disable ability to use the hypercall, on a per-vcpu basis. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> --- arch/x86/include/asm/kvm_host.h | 4 ++++ arch/x86/include/uapi/asm/kvm.h | 5 +++++ arch/x86/kvm/x86.c | 29 +++++++++++++++++++++++++++++ include/uapi/linux/kvm.h | 3 +++ 4 files changed, 41 insertions(+) Index: kvm.fifopriohc-submit/arch/x86/include/asm/kvm_host.h =================================================================== --- kvm.fifopriohc-submit.orig/arch/x86/include/asm/kvm_host.h +++ kvm.fifopriohc-submit/arch/x86/include/asm/kvm_host.h @@ -688,6 +688,10 @@ struct kvm_vcpu_arch { /* GPA available (AMD only) */ bool gpa_available; + + /* Enable RT prio hypercall */ + bool enable_rt_prio_hc; + int rt_sched_priority; }; struct kvm_lpage_info { Index: kvm.fifopriohc-submit/arch/x86/kvm/x86.c =================================================================== --- kvm.fifopriohc-submit.orig/arch/x86/kvm/x86.c +++ kvm.fifopriohc-submit/arch/x86/kvm/x86.c @@ -2683,6 +2683,7 @@ int kvm_vm_ioctl_check_extension(struct case KVM_CAP_SET_BOOT_CPU_ID: case KVM_CAP_SPLIT_IRQCHIP: case KVM_CAP_IMMEDIATE_EXIT: + case KVM_CAP_VCPU_RT_PRIO_HC: r = 1; break; case KVM_CAP_ADJUST_CLOCK: @@ -3386,6 +3387,25 @@ static int kvm_set_guest_paused(struct k return 0; } +static int kvm_vcpu_ioctl_set_rt_prio_hc(struct kvm_vcpu *vcpu, + struct kvm_vcpu_rt_prio *rt_prio) +{ + if (rt_prio->enabled == 0) { + vcpu->arch.enable_rt_prio_hc = false; + return 0; + } + + if (rt_prio->sched_priority < 1 || + rt_prio->sched_priority > 99) + return -EINVAL; + + + vcpu->arch.enable_rt_prio_hc = true; + vcpu->arch.rt_sched_priority = rt_prio->sched_priority; + + return 0; +} + static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu, struct kvm_enable_cap *cap) { @@ -3682,6 +3702,15 @@ long kvm_arch_vcpu_ioctl(struct file *fi r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap); break; } + case KVM_SET_VCPU_RT_PRIO_HC: { + struct kvm_vcpu_rt_prio rt_prio; + + r = -EFAULT; + if (copy_from_user(&rt_prio, argp, sizeof(rt_prio))) + goto out; + r = kvm_vcpu_ioctl_set_rt_prio_hc(vcpu, &rt_prio); + break; + } default: r = -EINVAL; } Index: kvm.fifopriohc-submit/include/uapi/linux/kvm.h =================================================================== --- kvm.fifopriohc-submit.orig/include/uapi/linux/kvm.h +++ kvm.fifopriohc-submit/include/uapi/linux/kvm.h @@ -929,6 +929,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_PPC_SMT_POSSIBLE 147 #define KVM_CAP_HYPERV_SYNIC2 148 #define KVM_CAP_HYPERV_VP_INDEX 149 +#define KVM_CAP_VCPU_RT_PRIO_HC 150 #ifdef KVM_CAP_IRQ_ROUTING @@ -1355,6 +1356,8 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_S390_CMMA_MIGRATION */ #define KVM_S390_GET_CMMA_BITS _IOWR(KVMIO, 0xb8, struct kvm_s390_cmma_log) #define KVM_S390_SET_CMMA_BITS _IOW(KVMIO, 0xb9, struct kvm_s390_cmma_log) +/* Available with KVM_CAP_VCPU_RT_PRIO_HC */ +#define KVM_SET_VCPU_RT_PRIO_HC _IOW(KVMIO, 0xba, struct kvm_vcpu_rt_prio) #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0) #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1) Index: kvm.fifopriohc-submit/arch/x86/include/uapi/asm/kvm.h =================================================================== --- kvm.fifopriohc-submit.orig/arch/x86/include/uapi/asm/kvm.h +++ kvm.fifopriohc-submit/arch/x86/include/uapi/asm/kvm.h @@ -353,6 +353,11 @@ struct kvm_xcrs { __u64 padding[16]; }; +struct kvm_vcpu_rt_prio { + __u32 enabled; + __u32 sched_priority; +}; + /* definition of registers in kvm_run */ struct kvm_sync_regs { }; ^ permalink raw reply [flat|nested] 52+ messages in thread
* [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) 2017-09-21 11:38 [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Marcelo Tosatti 2017-09-21 11:38 ` [patch 1/3] KVM: x86: add per-vcpu option to set guest vcpu -RT priority Marcelo Tosatti @ 2017-09-21 11:38 ` Marcelo Tosatti 2017-09-21 13:32 ` Konrad Rzeszutek Wilk 2017-09-21 11:38 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti 2017-09-21 17:45 ` [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Jan Kiszka 3 siblings, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-21 11:38 UTC (permalink / raw) To: kvm, linux-kernel; +Cc: Marcelo Tosatti [-- Attachment #1: 02-kvm-setfifoprio-hc --] [-- Type: text/plain, Size: 5963 bytes --] When executing guest vcpu-0 with FIFO:1 priority, which is necessary to deal with the following situation: VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) raw_spin_lock(A) interrupted, schedule task T-1 raw_spin_lock(A) (spin) raw_spin_unlock(A) Certain operations must interrupt guest vcpu-0 (see trace below). To fix this issue, only change guest vcpu-0 to FIFO priority on spinlock critical sections (see patch). Hang trace ========== Without FIFO priority: qemu-kvm-6705 [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0 qemu-kvm-6705 [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0 qemu-kvm-6705 [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-6705 [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-6705 [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0 qemu-kvm-6705 [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0 qemu-kvm-6705 [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0 qemu-kvm-6705 [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0 With FIFO priority: qemu-kvm-7636 [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-7636 [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-7636 [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-7636 [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-7636 [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 .. Performance numbers (kernel compilation with make -j2) ====================================================== With hypercall: 4:40. (make -j2) Without hypercall: 3:38. (make -j2) Note for NFV workloads spinlock performance is not relevant since DPDK should not enter the kernel (and housekeeping vcpu performance is far from a key factor). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> --- Documentation/virtual/kvm/hypercalls.txt | 22 +++++++++++++++ arch/x86/kvm/x86.c | 43 +++++++++++++++++++++++++++++++ include/uapi/linux/kvm_para.h | 2 + 3 files changed, 67 insertions(+) Index: kvm.fifopriohc-submit/Documentation/virtual/kvm/hypercalls.txt =================================================================== --- kvm.fifopriohc-submit.orig/Documentation/virtual/kvm/hypercalls.txt +++ kvm.fifopriohc-submit/Documentation/virtual/kvm/hypercalls.txt @@ -121,3 +121,25 @@ compute the CLOCK_REALTIME for its clock Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource, or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK. + +6. KVM_HC_RT_PRIO +------------------------ +Architecture: x86 +Status: active +Purpose: Hypercall used to change qemu vcpu process -RT priority. + +Usage: Having a pCPU share a FIFO:1 vcpu and a QEMU emulator thread +can be problematic: especially if the vcpu busy-spins on memory waiting +for the QEMU emulator thread to write to, which leads to a hang +(because the FIFO:1 vcpu is never scheduled in favor of QEMU emulator +thread). +So this hypercall is supposed to be called by the guest when +the OS knows its not going to busy spin on memory thats +written by the emulator thread as above. + +a0: bit 0 contains enable bit, if 0 indicates that SCHED_OTHER +priority should be set for vcpu, if 1 indicates SCHED_FIFO +priority (the actual value for FIFO priority is decided +by the host). + + Index: kvm.fifopriohc-submit/include/uapi/linux/kvm_para.h =================================================================== --- kvm.fifopriohc-submit.orig/include/uapi/linux/kvm_para.h +++ kvm.fifopriohc-submit/include/uapi/linux/kvm_para.h @@ -15,6 +15,7 @@ #define KVM_E2BIG E2BIG #define KVM_EPERM EPERM #define KVM_EOPNOTSUPP 95 +#define KVM_EINVAL EINVAL #define KVM_HC_VAPIC_POLL_IRQ 1 #define KVM_HC_MMU_OP 2 @@ -25,6 +26,7 @@ #define KVM_HC_MIPS_EXIT_VM 7 #define KVM_HC_MIPS_CONSOLE_OUTPUT 8 #define KVM_HC_CLOCK_PAIRING 9 +#define KVM_HC_RT_PRIO 10 /* * hypercalls use architecture specific Index: kvm.fifopriohc-submit/arch/x86/kvm/x86.c =================================================================== --- kvm.fifopriohc-submit.orig/arch/x86/kvm/x86.c +++ kvm.fifopriohc-submit/arch/x86/kvm/x86.c @@ -66,6 +66,8 @@ #include <asm/pvclock.h> #include <asm/div64.h> #include <asm/irq_remapping.h> +#include <uapi/linux/sched/types.h> +#include <uapi/linux/sched.h> #define CREATE_TRACE_POINTS #include "trace.h" @@ -6261,6 +6263,44 @@ void kvm_vcpu_deactivate_apicv(struct kv kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu); } +static int convert_to_kvm_errcode(int error) +{ + switch (error) { + case -EPERM: + return -KVM_EPERM; + case -EINVAL: + default: + return -KVM_EINVAL; + } +} + +int kvm_pv_rt_prio(struct kvm_vcpu *vcpu, unsigned long a0) +{ + int ret; + bool enable; + struct sched_param param; + + memset(¶m, 0, sizeof(struct sched_param)); + param.sched_priority = vcpu->arch.rt_sched_priority; + + enable = a0 & 0x1; + + if (vcpu->arch.enable_rt_prio_hc == false) + return -KVM_EPERM; + + if (enable) { + ret = sched_setscheduler(current, SCHED_FIFO, ¶m); + } else { + param.sched_priority = 0; + ret = sched_setscheduler(current, SCHED_NORMAL, ¶m); + } + + if (ret) + ret = convert_to_kvm_errcode(ret); + + return ret; +} + int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) { unsigned long nr, a0, a1, a2, a3, ret; @@ -6306,6 +6346,9 @@ int kvm_emulate_hypercall(struct kvm_vcp ret = kvm_pv_clock_pairing(vcpu, a0, a1); break; #endif + case KVM_HC_RT_PRIO: + ret = kvm_pv_rt_prio(vcpu, a0); + break; default: ret = -KVM_ENOSYS; break; ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) 2017-09-21 11:38 ` [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) Marcelo Tosatti @ 2017-09-21 13:32 ` Konrad Rzeszutek Wilk 2017-09-21 13:49 ` Paolo Bonzini 0 siblings, 1 reply; 52+ messages in thread From: Konrad Rzeszutek Wilk @ 2017-09-21 13:32 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: kvm, linux-kernel On Thu, Sep 21, 2017 at 08:38:37AM -0300, Marcelo Tosatti wrote: > When executing guest vcpu-0 with FIFO:1 priority, which is necessary to > deal with the following situation: > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > raw_spin_lock(A) > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > raw_spin_unlock(A) > > Certain operations must interrupt guest vcpu-0 (see trace below). > > To fix this issue, only change guest vcpu-0 to FIFO priority > on spinlock critical sections (see patch). > > Hang trace > ========== > > Without FIFO priority: > > qemu-kvm-6705 [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0 > qemu-kvm-6705 [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0 > qemu-kvm-6705 [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > qemu-kvm-6705 [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > qemu-kvm-6705 [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0 > qemu-kvm-6705 [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0 > qemu-kvm-6705 [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0 > qemu-kvm-6705 [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0 > > With FIFO priority: > > qemu-kvm-7636 [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > qemu-kvm-7636 [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > qemu-kvm-7636 [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > qemu-kvm-7636 [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > qemu-kvm-7636 [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > .. > > Performance numbers (kernel compilation with make -j2) > ====================================================== > > With hypercall: 4:40. (make -j2) > Without hypercall: 3:38. (make -j2) > > Note for NFV workloads spinlock performance is not relevant > since DPDK should not enter the kernel (and housekeeping vcpu > performance is far from a key factor). > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > --- > Documentation/virtual/kvm/hypercalls.txt | 22 +++++++++++++++ > arch/x86/kvm/x86.c | 43 +++++++++++++++++++++++++++++++ > include/uapi/linux/kvm_para.h | 2 + > 3 files changed, 67 insertions(+) > > Index: kvm.fifopriohc-submit/Documentation/virtual/kvm/hypercalls.txt > =================================================================== > --- kvm.fifopriohc-submit.orig/Documentation/virtual/kvm/hypercalls.txt > +++ kvm.fifopriohc-submit/Documentation/virtual/kvm/hypercalls.txt > @@ -121,3 +121,25 @@ compute the CLOCK_REALTIME for its clock > > Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource, > or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK. > + > +6. KVM_HC_RT_PRIO > +------------------------ > +Architecture: x86 > +Status: active > +Purpose: Hypercall used to change qemu vcpu process -RT priority. So the guest can change the scheduling decisions at the host level? And the host HAS to follow it? There is no policy override for the host to say - nah, not going to do it? Also wouldn't the guest want to always be at SCHED_FIFO? [I am thinking of a guest admin who wants all the CPU resources he can get] > + > +Usage: Having a pCPU share a FIFO:1 vcpu and a QEMU emulator thread > +can be problematic: especially if the vcpu busy-spins on memory waiting > +for the QEMU emulator thread to write to, which leads to a hang .. Is the QEMU emulator writing to the spinlock memory? > +(because the FIFO:1 vcpu is never scheduled in favor of QEMU emulator > +thread). > +So this hypercall is supposed to be called by the guest when > +the OS knows its not going to busy spin on memory thats > +written by the emulator thread as above. > + > +a0: bit 0 contains enable bit, if 0 indicates that SCHED_OTHER > +priority should be set for vcpu, if 1 indicates SCHED_FIFO > +priority (the actual value for FIFO priority is decided > +by the host). > + > + > Index: kvm.fifopriohc-submit/include/uapi/linux/kvm_para.h > =================================================================== > --- kvm.fifopriohc-submit.orig/include/uapi/linux/kvm_para.h > +++ kvm.fifopriohc-submit/include/uapi/linux/kvm_para.h > @@ -15,6 +15,7 @@ > #define KVM_E2BIG E2BIG > #define KVM_EPERM EPERM > #define KVM_EOPNOTSUPP 95 > +#define KVM_EINVAL EINVAL > > #define KVM_HC_VAPIC_POLL_IRQ 1 > #define KVM_HC_MMU_OP 2 > @@ -25,6 +26,7 @@ > #define KVM_HC_MIPS_EXIT_VM 7 > #define KVM_HC_MIPS_CONSOLE_OUTPUT 8 > #define KVM_HC_CLOCK_PAIRING 9 > +#define KVM_HC_RT_PRIO 10 > > /* > * hypercalls use architecture specific > Index: kvm.fifopriohc-submit/arch/x86/kvm/x86.c > =================================================================== > --- kvm.fifopriohc-submit.orig/arch/x86/kvm/x86.c > +++ kvm.fifopriohc-submit/arch/x86/kvm/x86.c > @@ -66,6 +66,8 @@ > #include <asm/pvclock.h> > #include <asm/div64.h> > #include <asm/irq_remapping.h> > +#include <uapi/linux/sched/types.h> > +#include <uapi/linux/sched.h> > > #define CREATE_TRACE_POINTS > #include "trace.h" > @@ -6261,6 +6263,44 @@ void kvm_vcpu_deactivate_apicv(struct kv > kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu); > } > > +static int convert_to_kvm_errcode(int error) > +{ > + switch (error) { > + case -EPERM: > + return -KVM_EPERM; > + case -EINVAL: > + default: > + return -KVM_EINVAL; > + } > +} > + > +int kvm_pv_rt_prio(struct kvm_vcpu *vcpu, unsigned long a0) > +{ > + int ret; > + bool enable; > + struct sched_param param; > + > + memset(¶m, 0, sizeof(struct sched_param)); > + param.sched_priority = vcpu->arch.rt_sched_priority; > + > + enable = a0 & 0x1; > + > + if (vcpu->arch.enable_rt_prio_hc == false) > + return -KVM_EPERM; > + > + if (enable) { > + ret = sched_setscheduler(current, SCHED_FIFO, ¶m); > + } else { > + param.sched_priority = 0; > + ret = sched_setscheduler(current, SCHED_NORMAL, ¶m); > + } > + > + if (ret) > + ret = convert_to_kvm_errcode(ret); > + > + return ret; > +} > + > int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) > { > unsigned long nr, a0, a1, a2, a3, ret; > @@ -6306,6 +6346,9 @@ int kvm_emulate_hypercall(struct kvm_vcp > ret = kvm_pv_clock_pairing(vcpu, a0, a1); > break; > #endif > + case KVM_HC_RT_PRIO: > + ret = kvm_pv_rt_prio(vcpu, a0); > + break; > default: > ret = -KVM_ENOSYS; > break; > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) 2017-09-21 13:32 ` Konrad Rzeszutek Wilk @ 2017-09-21 13:49 ` Paolo Bonzini 2017-09-22 1:08 ` Marcelo Tosatti 0 siblings, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2017-09-21 13:49 UTC (permalink / raw) To: Konrad Rzeszutek Wilk, Marcelo Tosatti; +Cc: kvm, linux-kernel On 21/09/2017 15:32, Konrad Rzeszutek Wilk wrote: > So the guest can change the scheduling decisions at the host level? > And the host HAS to follow it? There is no policy override for the > host to say - nah, not going to do it? > > Also wouldn't the guest want to always be at SCHED_FIFO? [I am thinking > of a guest admin who wants all the CPU resources he can get] Yeah, I do not understand why there should be a housekeeping VCPU that is running at SCHED_NORMAL. If it hurts, don't do it... Paolo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) 2017-09-21 13:49 ` Paolo Bonzini @ 2017-09-22 1:08 ` Marcelo Tosatti 2017-09-22 7:23 ` Paolo Bonzini 0 siblings, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-22 1:08 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Konrad Rzeszutek Wilk, kvm, linux-kernel On Thu, Sep 21, 2017 at 03:49:33PM +0200, Paolo Bonzini wrote: > On 21/09/2017 15:32, Konrad Rzeszutek Wilk wrote: > > So the guest can change the scheduling decisions at the host level? > > And the host HAS to follow it? There is no policy override for the > > host to say - nah, not going to do it? In that case the host should not even configure the guest with this option (this is QEMU's 'enable-rt-fifo-hc' option). > > Also wouldn't the guest want to always be at SCHED_FIFO? [I am thinking > > of a guest admin who wants all the CPU resources he can get] No. Because in the following code, executed by the housekeeping vCPU running at constant SCHED_FIFO priority: 1. Start disk I/O. 2. busy spin With the emulator thread sharing the same pCPU with the housekeeping vCPU, the emulator thread (which runs at SCHED_NORMAL), will never be scheduled in in place of the vcpu thread at SCHED_FIFO. This causes a hang. > Yeah, I do not understand why there should be a housekeeping VCPU that > is running at SCHED_NORMAL. If it hurts, don't do it... Hope explanation above makes sense (in fact, it was you who pointed out SCHED_FIFO should not be constant on the housekeeping vCPU, when sharing pCPU with emulator thread at SCHED_NORMAL). ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) 2017-09-22 1:08 ` Marcelo Tosatti @ 2017-09-22 7:23 ` Paolo Bonzini 2017-09-22 12:24 ` Marcelo Tosatti 0 siblings, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2017-09-22 7:23 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Konrad Rzeszutek Wilk, kvm, linux-kernel On 22/09/2017 03:08, Marcelo Tosatti wrote: > On Thu, Sep 21, 2017 at 03:49:33PM +0200, Paolo Bonzini wrote: >> On 21/09/2017 15:32, Konrad Rzeszutek Wilk wrote: >>> So the guest can change the scheduling decisions at the host level? >>> And the host HAS to follow it? There is no policy override for the >>> host to say - nah, not going to do it? > > In that case the host should not even configure the guest with this > option (this is QEMU's 'enable-rt-fifo-hc' option). > >>> Also wouldn't the guest want to always be at SCHED_FIFO? [I am thinking >>> of a guest admin who wants all the CPU resources he can get] > > No. Because in the following code, executed by the housekeeping vCPU > running at constant SCHED_FIFO priority: > > 1. Start disk I/O. > 2. busy spin > > With the emulator thread sharing the same pCPU with the housekeeping > vCPU, the emulator thread (which runs at SCHED_NORMAL), will never > be scheduled in in place of the vcpu thread at SCHED_FIFO. > > This causes a hang. But if the emulator thread can interrupt the housekeeping thread, the emulator thread should also be SCHED_FIFO at higher priority; IIRC this was in Jan's talk from a few years ago. QEMU would also have to use PI mutexes (which is the main reason why it's using QemuMutex instead of e.g. GMutex). >> Yeah, I do not understand why there should be a housekeeping VCPU that >> is running at SCHED_NORMAL. If it hurts, don't do it... > > Hope explanation above makes sense (in fact, it was you who pointed > out SCHED_FIFO should not be constant on the housekeeping vCPU, > when sharing pCPU with emulator thread at SCHED_NORMAL). The two are not exclusive... As you point out, it depends on the workload. For DPDK you can put both of them at SCHED_NORMAL. For kernel-intensive uses you must use SCHED_FIFO. Perhaps we could consider running these threads at SCHED_RR instead. Unlike SCHED_NORMAL, I am not against a hypercall that bumps temporarily SCHED_RR to SCHED_FIFO, but perhaps that's not even necessary. Paolo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) 2017-09-22 7:23 ` Paolo Bonzini @ 2017-09-22 12:24 ` Marcelo Tosatti 0 siblings, 0 replies; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-22 12:24 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Konrad Rzeszutek Wilk, kvm, linux-kernel On Fri, Sep 22, 2017 at 09:23:47AM +0200, Paolo Bonzini wrote: > On 22/09/2017 03:08, Marcelo Tosatti wrote: > > On Thu, Sep 21, 2017 at 03:49:33PM +0200, Paolo Bonzini wrote: > >> On 21/09/2017 15:32, Konrad Rzeszutek Wilk wrote: > >>> So the guest can change the scheduling decisions at the host level? > >>> And the host HAS to follow it? There is no policy override for the > >>> host to say - nah, not going to do it? > > > > In that case the host should not even configure the guest with this > > option (this is QEMU's 'enable-rt-fifo-hc' option). > > > >>> Also wouldn't the guest want to always be at SCHED_FIFO? [I am thinking > >>> of a guest admin who wants all the CPU resources he can get] > > > > No. Because in the following code, executed by the housekeeping vCPU > > running at constant SCHED_FIFO priority: > > > > 1. Start disk I/O. > > 2. busy spin > > > > With the emulator thread sharing the same pCPU with the housekeeping > > vCPU, the emulator thread (which runs at SCHED_NORMAL), will never > > be scheduled in in place of the vcpu thread at SCHED_FIFO. > > > > This causes a hang. > > But if the emulator thread can interrupt the housekeeping thread, the > emulator thread should also be SCHED_FIFO at higher priority; IIRC this > was in Jan's talk from a few years ago. The point is we do not want the emulator thread to interrupt the housekeeping thread at all times: we only want it to interrupt the housekeeping thread when it is not in a spinlock protected section (because that has an effect on realtime vcpu's attempting to grab that particular spinlock). Otherwise, it can interrupt the housekeeping thread. > QEMU would also have to use PI mutexes (which is the main reason why > it's using QemuMutex instead of e.g. GMutex). > > >> Yeah, I do not understand why there should be a housekeeping VCPU that > >> is running at SCHED_NORMAL. If it hurts, don't do it... > > > > Hope explanation above makes sense (in fact, it was you who pointed > > out SCHED_FIFO should not be constant on the housekeeping vCPU, > > when sharing pCPU with emulator thread at SCHED_NORMAL). > > The two are not exclusive... As you point out, it depends on the > workload. For DPDK you can put both of them at SCHED_NORMAL. For > kernel-intensive uses you must use SCHED_FIFO. > > Perhaps we could consider running these threads at SCHED_RR instead. > Unlike SCHED_NORMAL, I am not against a hypercall that bumps temporarily > SCHED_RR to SCHED_FIFO, but perhaps that's not even necessary. Sorry Paolo, i don't see how SCHED_RR is going to help here: " SCHED_RR: Round-robin scheduling SCHED_RR is a simple enhancement of SCHED_FIFO. Everything described above for SCHED_FIFO also applies to SCHED_RR, except that each thread is allowed to run only for a maximum time quantum." What must happen is that vcpu0 should run _until its finished with spinlock protected section_ (that is, any job the emulator thread has, in that period where vcpu0 has work to do, is of less priority and must not execute). Otherwise vcpu1, running a realtime workload, will attempt to grab the spinlock vcpu0 has grabbed, and busy spin waiting on the emulator thread to finish. If you have the emulator thread at a higher priority than vcpu0, as you suggested above, the same problem will happen. So that option is not viable. We tried to have vcpu0 with SCHED_FIFO at all times, to avoid this hypercall, but unfortunately that'll cause the hang as described in the trace. So i fail to see how SCHED_RR should help here? Thanks ^ permalink raw reply [flat|nested] 52+ messages in thread
* [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-21 11:38 [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Marcelo Tosatti 2017-09-21 11:38 ` [patch 1/3] KVM: x86: add per-vcpu option to set guest vcpu -RT priority Marcelo Tosatti 2017-09-21 11:38 ` [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) Marcelo Tosatti @ 2017-09-21 11:38 ` Marcelo Tosatti 2017-09-21 13:36 ` Konrad Rzeszutek Wilk 2017-09-21 17:45 ` [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Jan Kiszka 3 siblings, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-21 11:38 UTC (permalink / raw) To: kvm, linux-kernel; +Cc: Marcelo Tosatti [-- Attachment #1: 03-kvm-hv-fifo-guest-support --] [-- Type: text/plain, Size: 3173 bytes --] Add hypercalls to spinlock/unlock to set/unset FIFO priority for the vcpu, protected by a static branch to avoid performance increase in the normal kernels. Enable option by "kvmfifohc" kernel command line parameter (disabled by default). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> --- arch/x86/kernel/kvm.c | 31 +++++++++++++++++++++++++++++++ include/linux/spinlock.h | 2 +- include/linux/spinlock_api_smp.h | 17 +++++++++++++++++ 3 files changed, 49 insertions(+), 1 deletion(-) Index: kvm.fifopriohc-submit/arch/x86/kernel/kvm.c =================================================================== --- kvm.fifopriohc-submit.orig/arch/x86/kernel/kvm.c +++ kvm.fifopriohc-submit/arch/x86/kernel/kvm.c @@ -37,6 +37,7 @@ #include <linux/debugfs.h> #include <linux/nmi.h> #include <linux/swait.h> +#include <linux/static_key.h> #include <asm/timer.h> #include <asm/cpu.h> #include <asm/traps.h> @@ -321,6 +322,36 @@ static notrace void kvm_guest_apic_eoi_w apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK); } +static int kvmfifohc; + +static int parse_kvmfifohc(char *arg) +{ + kvmfifohc = 1; + return 0; +} + +early_param("kvmfifohc", parse_kvmfifohc); + +DEFINE_STATIC_KEY_FALSE(kvm_fifo_hc_key); + +static void kvm_init_fifo_hc(void) +{ + long ret; + + ret = kvm_hypercall1(KVM_HC_RT_PRIO, 0); + + if (ret == 0 && kvmfifohc == 1) + static_branch_enable(&kvm_fifo_hc_key); +} + +static __init int kvmguest_late_init(void) +{ + kvm_init_fifo_hc(); + return 0; +} + +late_initcall(kvmguest_late_init); + static void kvm_guest_cpu_init(void) { if (!kvm_para_available()) Index: kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h =================================================================== --- kvm.fifopriohc-submit.orig/include/linux/spinlock_api_smp.h +++ kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h @@ -136,11 +136,28 @@ static inline void __raw_spin_lock_bh(ra LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); } +#ifdef CONFIG_KVM_GUEST +DECLARE_STATIC_KEY_FALSE(kvm_fifo_hc_key); +#endif + static inline void __raw_spin_lock(raw_spinlock_t *lock) { preempt_disable(); + +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP) + /* enable FIFO priority */ + if (static_branch_unlikely(&kvm_fifo_hc_key)) + kvm_hypercall1(KVM_HC_RT_PRIO, 0x1); +#endif + spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); + +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP) + /* disable FIFO priority */ + if (static_branch_unlikely(&kvm_fifo_hc_key)) + kvm_hypercall1(KVM_HC_RT_PRIO, 0); +#endif } #endif /* !CONFIG_GENERIC_LOCKBREAK || CONFIG_DEBUG_LOCK_ALLOC */ Index: kvm.fifopriohc-submit/include/linux/spinlock.h =================================================================== --- kvm.fifopriohc-submit.orig/include/linux/spinlock.h +++ kvm.fifopriohc-submit/include/linux/spinlock.h @@ -56,7 +56,7 @@ #include <linux/stringify.h> #include <linux/bottom_half.h> #include <asm/barrier.h> - +#include <uapi/linux/kvm_para.h> /* * Must define these before including other files, inline functions need them ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-21 11:38 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti @ 2017-09-21 13:36 ` Konrad Rzeszutek Wilk 2017-09-21 14:06 ` Peter Zijlstra 0 siblings, 1 reply; 52+ messages in thread From: Konrad Rzeszutek Wilk @ 2017-09-21 13:36 UTC (permalink / raw) To: Marcelo Tosatti, peterz, mingo; +Cc: kvm, linux-kernel On Thu, Sep 21, 2017 at 08:38:38AM -0300, Marcelo Tosatti wrote: > Add hypercalls to spinlock/unlock to set/unset FIFO priority > for the vcpu, protected by a static branch to avoid performance > increase in the normal kernels. > > Enable option by "kvmfifohc" kernel command line parameter (disabled > by default). Wouldn't be better if there was a global 'kvm=' which could have the various overrides? > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > --- > arch/x86/kernel/kvm.c | 31 +++++++++++++++++++++++++++++++ > include/linux/spinlock.h | 2 +- > include/linux/spinlock_api_smp.h | 17 +++++++++++++++++ Hm. It looks like you forgot to CC the maintainers: $ scripts/get_maintainer.pl -f include/linux/spinlock.h Peter Zijlstra <peterz@infradead.org> (maintainer:LOCKING PRIMITIVES) Ingo Molnar <mingo@redhat.com> (maintainer:LOCKING PRIMITIVES) linux-kernel@vger.kernel.org (open list:LOCKING PRIMITIVES) Doing that for you. > 3 files changed, 49 insertions(+), 1 deletion(-) > > Index: kvm.fifopriohc-submit/arch/x86/kernel/kvm.c > =================================================================== > --- kvm.fifopriohc-submit.orig/arch/x86/kernel/kvm.c > +++ kvm.fifopriohc-submit/arch/x86/kernel/kvm.c > @@ -37,6 +37,7 @@ > #include <linux/debugfs.h> > #include <linux/nmi.h> > #include <linux/swait.h> > +#include <linux/static_key.h> > #include <asm/timer.h> > #include <asm/cpu.h> > #include <asm/traps.h> > @@ -321,6 +322,36 @@ static notrace void kvm_guest_apic_eoi_w > apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK); > } > > +static int kvmfifohc; > + > +static int parse_kvmfifohc(char *arg) > +{ > + kvmfifohc = 1; > + return 0; > +} > + > +early_param("kvmfifohc", parse_kvmfifohc); > + > +DEFINE_STATIC_KEY_FALSE(kvm_fifo_hc_key); > + > +static void kvm_init_fifo_hc(void) > +{ > + long ret; > + > + ret = kvm_hypercall1(KVM_HC_RT_PRIO, 0); > + > + if (ret == 0 && kvmfifohc == 1) > + static_branch_enable(&kvm_fifo_hc_key); > +} > + > +static __init int kvmguest_late_init(void) > +{ > + kvm_init_fifo_hc(); > + return 0; > +} > + > +late_initcall(kvmguest_late_init); > + > static void kvm_guest_cpu_init(void) > { > if (!kvm_para_available()) > Index: kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h > =================================================================== > --- kvm.fifopriohc-submit.orig/include/linux/spinlock_api_smp.h > +++ kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h > @@ -136,11 +136,28 @@ static inline void __raw_spin_lock_bh(ra > LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); > } > > +#ifdef CONFIG_KVM_GUEST > +DECLARE_STATIC_KEY_FALSE(kvm_fifo_hc_key); > +#endif > + > static inline void __raw_spin_lock(raw_spinlock_t *lock) > { > preempt_disable(); > + > +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP) > + /* enable FIFO priority */ > + if (static_branch_unlikely(&kvm_fifo_hc_key)) > + kvm_hypercall1(KVM_HC_RT_PRIO, 0x1); > +#endif I am assuming the reason you choose not to wrap this in a pvops or any other structure that is more of hypervisor agnostic is that only KVM exposes this. But what if other hypervisors expose something similar? Or some other mechanism similar to this? > + > spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); > LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); > + > +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP) > + /* disable FIFO priority */ > + if (static_branch_unlikely(&kvm_fifo_hc_key)) > + kvm_hypercall1(KVM_HC_RT_PRIO, 0); > +#endif > } > > #endif /* !CONFIG_GENERIC_LOCKBREAK || CONFIG_DEBUG_LOCK_ALLOC */ > Index: kvm.fifopriohc-submit/include/linux/spinlock.h > =================================================================== > --- kvm.fifopriohc-submit.orig/include/linux/spinlock.h > +++ kvm.fifopriohc-submit/include/linux/spinlock.h > @@ -56,7 +56,7 @@ > #include <linux/stringify.h> > #include <linux/bottom_half.h> > #include <asm/barrier.h> > - > +#include <uapi/linux/kvm_para.h> > > /* > * Must define these before including other files, inline functions need them > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-21 13:36 ` Konrad Rzeszutek Wilk @ 2017-09-21 14:06 ` Peter Zijlstra 2017-09-22 1:10 ` Marcelo Tosatti 0 siblings, 1 reply; 52+ messages in thread From: Peter Zijlstra @ 2017-09-21 14:06 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Marcelo Tosatti, mingo, kvm, linux-kernel, Thomas Gleixner On Thu, Sep 21, 2017 at 09:36:53AM -0400, Konrad Rzeszutek Wilk wrote: > On Thu, Sep 21, 2017 at 08:38:38AM -0300, Marcelo Tosatti wrote: > > Add hypercalls to spinlock/unlock to set/unset FIFO priority > > for the vcpu, protected by a static branch to avoid performance > > increase in the normal kernels. > > > > Enable option by "kvmfifohc" kernel command line parameter (disabled > > by default). WTF kind of fudge is this? Changelog completely fails to explain the problem this would solve. Why are you doing insane things like this? NAK! > > Index: kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h > > =================================================================== > > --- kvm.fifopriohc-submit.orig/include/linux/spinlock_api_smp.h > > +++ kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h > > @@ -136,11 +136,28 @@ static inline void __raw_spin_lock_bh(ra > > LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); > > } > > > > +#ifdef CONFIG_KVM_GUEST > > +DECLARE_STATIC_KEY_FALSE(kvm_fifo_hc_key); > > +#endif > > + > > static inline void __raw_spin_lock(raw_spinlock_t *lock) > > { > > preempt_disable(); > > + > > +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP) > > + /* enable FIFO priority */ > > + if (static_branch_unlikely(&kvm_fifo_hc_key)) > > + kvm_hypercall1(KVM_HC_RT_PRIO, 0x1); > > +#endif > > + > > spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); > > LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); > > + > > +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP) > > + /* disable FIFO priority */ > > + if (static_branch_unlikely(&kvm_fifo_hc_key)) > > + kvm_hypercall1(KVM_HC_RT_PRIO, 0); > > +#endif > > } > > > > #endif /* !CONFIG_GENERIC_LOCKBREAK || CONFIG_DEBUG_LOCK_ALLOC */ ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-21 14:06 ` Peter Zijlstra @ 2017-09-22 1:10 ` Marcelo Tosatti 2017-09-22 10:00 ` Peter Zijlstra 0 siblings, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-22 1:10 UTC (permalink / raw) To: Peter Zijlstra Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Thu, Sep 21, 2017 at 04:06:28PM +0200, Peter Zijlstra wrote: > On Thu, Sep 21, 2017 at 09:36:53AM -0400, Konrad Rzeszutek Wilk wrote: > > On Thu, Sep 21, 2017 at 08:38:38AM -0300, Marcelo Tosatti wrote: > > > Add hypercalls to spinlock/unlock to set/unset FIFO priority > > > for the vcpu, protected by a static branch to avoid performance > > > increase in the normal kernels. > > > > > > Enable option by "kvmfifohc" kernel command line parameter (disabled > > > by default). > > WTF kind of fudge is this? Changelog completely fails to explain the > problem this would solve. Why are you doing insane things like this? > > > NAK! Copy&pasting from the initial message, please point out whether this explanation makes sense (better solutions to this problem are welcome): When executing guest vcpu-0 with FIFO:1 priority, which is necessary to deal with the following situation: VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) raw_spin_lock(A) interrupted, schedule task T-1 raw_spin_lock(A) (spin) raw_spin_unlock(A) Certain operations must interrupt guest vcpu-0 (see trace below). To fix this issue, only change guest vcpu-0 to FIFO priority on spinlock critical sections (see patch). Hang trace ========== Without FIFO priority: qemu-kvm-6705 [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0 qemu-kvm-6705 [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0 qemu-kvm-6705 [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-6705 [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-6705 [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0 qemu-kvm-6705 [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0 qemu-kvm-6705 [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0 qemu-kvm-6705 [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0 With FIFO priority: qemu-kvm-7636 [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-7636 [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-7636 [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-7636 [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-7636 [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 .. Performance numbers (kernel compilation with make -j2) ====================================================== With hypercall: 4:40. (make -j2) Without hypercall: 3:38. (make -j2) Note for NFV workloads spinlock performance is not relevant since DPDK should not enter the kernel (and housekeeping vcpu performance is far from a key factor). > > > > Index: kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h > > > =================================================================== > > > --- kvm.fifopriohc-submit.orig/include/linux/spinlock_api_smp.h > > > +++ kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h > > > @@ -136,11 +136,28 @@ static inline void __raw_spin_lock_bh(ra > > > LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); > > > } > > > > > > +#ifdef CONFIG_KVM_GUEST > > > +DECLARE_STATIC_KEY_FALSE(kvm_fifo_hc_key); > > > +#endif > > > + > > > static inline void __raw_spin_lock(raw_spinlock_t *lock) > > > { > > > preempt_disable(); > > > + > > > +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP) > > > + /* enable FIFO priority */ > > > + if (static_branch_unlikely(&kvm_fifo_hc_key)) > > > + kvm_hypercall1(KVM_HC_RT_PRIO, 0x1); > > > +#endif > > > + > > > spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); > > > LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); > > > + > > > +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP) > > > + /* disable FIFO priority */ > > > + if (static_branch_unlikely(&kvm_fifo_hc_key)) > > > + kvm_hypercall1(KVM_HC_RT_PRIO, 0); > > > +#endif > > > } > > > > > > #endif /* !CONFIG_GENERIC_LOCKBREAK || CONFIG_DEBUG_LOCK_ALLOC */ ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-22 1:10 ` Marcelo Tosatti @ 2017-09-22 10:00 ` Peter Zijlstra 2017-09-22 10:56 ` Peter Zijlstra 2017-09-22 12:16 ` Marcelo Tosatti 0 siblings, 2 replies; 52+ messages in thread From: Peter Zijlstra @ 2017-09-22 10:00 UTC (permalink / raw) To: Marcelo Tosatti Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote: > When executing guest vcpu-0 with FIFO:1 priority, which is necessary > to > deal with the following situation: > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > raw_spin_lock(A) > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > raw_spin_unlock(A) > > Certain operations must interrupt guest vcpu-0 (see trace below). Those traces don't make any sense. All they include is kvm_exit and you can't tell anything from that. > To fix this issue, only change guest vcpu-0 to FIFO priority > on spinlock critical sections (see patch). This doesn't make sense. So you're saying that if you run all VCPUs as FIFO things come apart? Why? And why can't they still come apart when the guest holds a spinlock? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-22 10:00 ` Peter Zijlstra @ 2017-09-22 10:56 ` Peter Zijlstra 2017-09-22 12:33 ` Marcelo Tosatti 2017-09-22 12:16 ` Marcelo Tosatti 1 sibling, 1 reply; 52+ messages in thread From: Peter Zijlstra @ 2017-09-22 10:56 UTC (permalink / raw) To: Marcelo Tosatti Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 22, 2017 at 12:00:04PM +0200, Peter Zijlstra wrote: > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote: > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary > > to > > deal with the following situation: > > > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > > > raw_spin_lock(A) > > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > > > raw_spin_unlock(A) > > > > Certain operations must interrupt guest vcpu-0 (see trace below). > > Those traces don't make any sense. All they include is kvm_exit and you > can't tell anything from that. > > > To fix this issue, only change guest vcpu-0 to FIFO priority > > on spinlock critical sections (see patch). > > This doesn't make sense. So you're saying that if you run all VCPUs as > FIFO things come apart? Why? > > And why can't they still come apart when the guest holds a spinlock? That is, running a RT guest and not having _all_ VCPUs being RT tasks on the host is absolutely and completely insane and broken. Fix whatever needs fixing to allow your VCPU0 to be RT, don't do insane things like this. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-22 10:56 ` Peter Zijlstra @ 2017-09-22 12:33 ` Marcelo Tosatti 2017-09-22 12:55 ` Peter Zijlstra 0 siblings, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-22 12:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 22, 2017 at 12:56:09PM +0200, Peter Zijlstra wrote: > On Fri, Sep 22, 2017 at 12:00:04PM +0200, Peter Zijlstra wrote: > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote: > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary > > > to > > > deal with the following situation: > > > > > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > > > > > raw_spin_lock(A) > > > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > > > > > raw_spin_unlock(A) > > > > > > Certain operations must interrupt guest vcpu-0 (see trace below). > > > > Those traces don't make any sense. All they include is kvm_exit and you > > can't tell anything from that. > > > > > To fix this issue, only change guest vcpu-0 to FIFO priority > > > on spinlock critical sections (see patch). > > > > This doesn't make sense. So you're saying that if you run all VCPUs as > > FIFO things come apart? Why? > > > > And why can't they still come apart when the guest holds a spinlock? > > That is, running a RT guest and not having _all_ VCPUs being RT tasks on > the host is absolutely and completely insane and broken. Can you explain why, please? > Fix whatever needs fixing to allow your VCPU0 to be RT, don't do insane > things like this. VCPU0 can be RT, but you'll get the following hang, if the emulator thread is sharing a pCPU with VCPU0: 1. submit IO. 2. busy spin. As executed by the guest vcpu (its a natural problem). Do you have a better suggestion as how to fix the problem? We can fix the BIOS, but userspace will still be allowed to generate the code pattern above. And increasing the priority of the emulator thread, at random times (so it can inject interrupts to vcpu-0), can cause it to interrupt vcpu-0 in a spinlock protected section. The only other option is for customers to live with the decreased packing (that is require one pcpu for each vcpu, and an additional pcpu for emulator threads). Is that what you are suggesting? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-22 12:33 ` Marcelo Tosatti @ 2017-09-22 12:55 ` Peter Zijlstra 2017-09-23 10:56 ` Paolo Bonzini 0 siblings, 1 reply; 52+ messages in thread From: Peter Zijlstra @ 2017-09-22 12:55 UTC (permalink / raw) To: Marcelo Tosatti Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 22, 2017 at 09:33:05AM -0300, Marcelo Tosatti wrote: > > That is, running a RT guest and not having _all_ VCPUs being RT tasks on > > the host is absolutely and completely insane and broken. > > Can you explain why, please? You just explained it yourself. If the thread that needs to complete what you're waiting on has lower priority, it will _never_ get to run if you're busy waiting on it. This is _trivial_. And even for !RT it can be quite costly, because you can end up having to burn your entire slot of CPU time before you run the other task. Userspace spinning is _bad_, do not do this. (the one exception where it works is where you have a single thread per cpu, because then there's effectively no scheduling). > > Fix whatever needs fixing to allow your VCPU0 to be RT, don't do insane > > things like this. > > VCPU0 can be RT, but you'll get the following hang, if the emulator > thread is sharing a pCPU with VCPU0: > > 1. submit IO. > 2. busy spin. > > As executed by the guest vcpu (its a natural problem). > > Do you have a better suggestion as how to fix the problem? Yes, not busy wait. Go to sleep and make sure you're woken up once the IO completes. > We can fix the BIOS, but userspace will still be allowed to > generate the code pattern above. What does the BIOS have to do with anything? > And increasing the priority of the emulator thread, at random times > (so it can inject interrupts to vcpu-0), can cause it to interrupt > vcpu-0 in a spinlock protected section. You can equally boost the emulator thread while you're spin-waiting, but that's ugly as heck too. The normal, sane solution is to not spin-wait but block. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-22 12:55 ` Peter Zijlstra @ 2017-09-23 10:56 ` Paolo Bonzini 2017-09-23 13:41 ` Peter Zijlstra 0 siblings, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2017-09-23 10:56 UTC (permalink / raw) To: Peter Zijlstra, Marcelo Tosatti Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On 22/09/2017 14:55, Peter Zijlstra wrote: > You just explained it yourself. If the thread that needs to complete > what you're waiting on has lower priority, it will _never_ get to run if > you're busy waiting on it. > > This is _trivial_. > > And even for !RT it can be quite costly, because you can end up having > to burn your entire slot of CPU time before you run the other task. > > Userspace spinning is _bad_, do not do this. This is not userspace spinning, it is guest spinning---which has effectively the same effect but you cannot quite avoid. But I agree that the solution is properly prioritizing threads that can interrupt the VCPU, and using PI mutexes. I'm not a priori opposed to paravirt scheduling primitives, but I am not at all sure that it's required. Paolo > (the one exception where it works is where you have a single thread per > cpu, because then there's effectively no scheduling). ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-23 10:56 ` Paolo Bonzini @ 2017-09-23 13:41 ` Peter Zijlstra 2017-09-24 13:05 ` Paolo Bonzini 0 siblings, 1 reply; 52+ messages in thread From: Peter Zijlstra @ 2017-09-23 13:41 UTC (permalink / raw) To: Paolo Bonzini Cc: Marcelo Tosatti, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Sat, Sep 23, 2017 at 12:56:12PM +0200, Paolo Bonzini wrote: > On 22/09/2017 14:55, Peter Zijlstra wrote: > > You just explained it yourself. If the thread that needs to complete > > what you're waiting on has lower priority, it will _never_ get to run if > > you're busy waiting on it. > > > > This is _trivial_. > > > > And even for !RT it can be quite costly, because you can end up having > > to burn your entire slot of CPU time before you run the other task. > > > > Userspace spinning is _bad_, do not do this. > > This is not userspace spinning, it is guest spinning---which has > effectively the same effect but you cannot quite avoid. So I'm virt illiterate and have no clue on how all this works; but wasn't this a vmexit ? (that's what marcelo traced). And once you've done a vmexit you're a regular task again, not a vcpu. > But I agree that the solution is properly prioritizing threads that can > interrupt the VCPU, and using PI mutexes. Right, if you want to run RT VCPUs the whole emulator/vcpu interaction needs to be designed for RT. > I'm not a priori opposed to paravirt scheduling primitives, but I am not > at all sure that it's required. Problem is that the proposed thing doesn't solve anything. There is nothing that prohibits the guest from triggering a vmexit while holding a spinlock and landing in the self-same problems. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-23 13:41 ` Peter Zijlstra @ 2017-09-24 13:05 ` Paolo Bonzini 2017-09-25 2:57 ` Marcelo Tosatti 0 siblings, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2017-09-24 13:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Marcelo Tosatti, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner ----- Original Message ----- > From: "Peter Zijlstra" <peterz@infradead.org> > To: "Paolo Bonzini" <pbonzini@redhat.com> > Cc: "Marcelo Tosatti" <mtosatti@redhat.com>, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com>, mingo@redhat.com, > kvm@vger.kernel.org, linux-kernel@vger.kernel.org, "Thomas Gleixner" <tglx@linutronix.de> > Sent: Saturday, September 23, 2017 3:41:14 PM > Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall > > On Sat, Sep 23, 2017 at 12:56:12PM +0200, Paolo Bonzini wrote: > > On 22/09/2017 14:55, Peter Zijlstra wrote: > > > You just explained it yourself. If the thread that needs to complete > > > what you're waiting on has lower priority, it will _never_ get to run if > > > you're busy waiting on it. > > > > > > This is _trivial_. > > > > > > And even for !RT it can be quite costly, because you can end up having > > > to burn your entire slot of CPU time before you run the other task. > > > > > > Userspace spinning is _bad_, do not do this. > > > > This is not userspace spinning, it is guest spinning---which has > > effectively the same effect but you cannot quite avoid. > > So I'm virt illiterate and have no clue on how all this works; but > wasn't this a vmexit ? (that's what marcelo traced). And once you've > done a vmexit you're a regular task again, not a vcpu. His trace simply shows that the timer tick happened and the SCHED_NORMAL thread was preempted. Bumping the vCPU thread to SCHED_FIFO drops the scheduler tick (the system is NOHZ_FULL) and thus 1) the frequency of EXTERNAL_INTERRUPT vmexits drops to 1 second 2) the thread is not preempted anymore. > > But I agree that the solution is properly prioritizing threads that can > > interrupt the VCPU, and using PI mutexes. > > Right, if you want to run RT VCPUs the whole emulator/vcpu interaction > needs to be designed for RT. > > > I'm not a priori opposed to paravirt scheduling primitives, but I am not > > at all sure that it's required. > > Problem is that the proposed thing doesn't solve anything. There is > nothing that prohibits the guest from triggering a vmexit while holding > a spinlock and landing in the self-same problems. Well, part of configuring virt for RT is (at all levels: host hypervisor+QEMU and guest kernel+userspace) is that vmexits while holding a spinlock are either confined to one vCPU or are handled in the host hypervisor very quickly, like less than 2000 clock cycles. So I'm not denying that Marcelo's approach solves the problem, but it's very heavyweight and it masks an important misconfiguration (as you write above, everything needs to be RT and the priorities must be designed carefully). _However_, even if you do this, you may want to put the less important vCPUs and the emulator threads on the same physical CPU. In that case, the vCPU can be placed at SCHED_RR to avoid starvation (while the emulator thread needs to stay at SCHED_FIFO and higher priority). Some kind of trick that bumps spinlock critical sections in that vCPU to SCHED_FIFO, for a limited time only, might still be useful. Paolo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-24 13:05 ` Paolo Bonzini @ 2017-09-25 2:57 ` Marcelo Tosatti 2017-09-25 9:13 ` Peter Zijlstra 2017-09-25 16:20 ` Konrad Rzeszutek Wilk 0 siblings, 2 replies; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-25 2:57 UTC (permalink / raw) To: Paolo Bonzini Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Sun, Sep 24, 2017 at 09:05:44AM -0400, Paolo Bonzini wrote: > > > ----- Original Message ----- > > From: "Peter Zijlstra" <peterz@infradead.org> > > To: "Paolo Bonzini" <pbonzini@redhat.com> > > Cc: "Marcelo Tosatti" <mtosatti@redhat.com>, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com>, mingo@redhat.com, > > kvm@vger.kernel.org, linux-kernel@vger.kernel.org, "Thomas Gleixner" <tglx@linutronix.de> > > Sent: Saturday, September 23, 2017 3:41:14 PM > > Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall > > > > On Sat, Sep 23, 2017 at 12:56:12PM +0200, Paolo Bonzini wrote: > > > On 22/09/2017 14:55, Peter Zijlstra wrote: > > > > You just explained it yourself. If the thread that needs to complete > > > > what you're waiting on has lower priority, it will _never_ get to run if > > > > you're busy waiting on it. > > > > > > > > This is _trivial_. > > > > > > > > And even for !RT it can be quite costly, because you can end up having > > > > to burn your entire slot of CPU time before you run the other task. > > > > > > > > Userspace spinning is _bad_, do not do this. > > > > > > This is not userspace spinning, it is guest spinning---which has > > > effectively the same effect but you cannot quite avoid. > > > > So I'm virt illiterate and have no clue on how all this works; but > > wasn't this a vmexit ? (that's what marcelo traced). And once you've > > done a vmexit you're a regular task again, not a vcpu. > > His trace simply shows that the timer tick happened and the SCHED_NORMAL > thread was preempted. Bumping the vCPU thread to SCHED_FIFO drops > the scheduler tick (the system is NOHZ_FULL) and thus 1) the frequency > of EXTERNAL_INTERRUPT vmexits drops to 1 second 2) the thread is not > preempted anymore. > > > > But I agree that the solution is properly prioritizing threads that can > > > interrupt the VCPU, and using PI mutexes. Thats exactly what the patch does, the prioritization is not fixed in time, and depends on whether or not vcpu-0 is in spinlock protected section. Are you suggesting a different prioritization? Can you describe it please, even if incomplete? > > > > Right, if you want to run RT VCPUs the whole emulator/vcpu interaction > > needs to be designed for RT. > > > > > I'm not a priori opposed to paravirt scheduling primitives, but I am not > > > at all sure that it's required. > > > > Problem is that the proposed thing doesn't solve anything. There is > > nothing that prohibits the guest from triggering a vmexit while holding > > a spinlock and landing in the self-same problems. > > Well, part of configuring virt for RT is (at all levels: host hypervisor+QEMU > and guest kernel+userspace) is that vmexits while holding a spinlock are either > confined to one vCPU or are handled in the host hypervisor very quickly, like > less than 2000 clock cycles. > > So I'm not denying that Marcelo's approach solves the problem, but it's very > heavyweight and it masks an important misconfiguration (as you write above, > everything needs to be RT and the priorities must be designed carefully). I think you are missing the following point: "vcpu0 can be interrupted when its not in a spinlock protected section, otherwise it can't." So you _have_ to communicate to the host when the guest enters/leaves a critical section. So this point of "everything needs to be RT and the priorities must be designed carefully", is this: WHEN in spinlock protected section (more specifically, when spinlock protected section _shared with realtime vcpus_), priority of vcpu0 > priority of emulator thread OTHERWISE priority of vcpu0 < priority of emulator thread. (*) So emulator thread can interrupt and inject interrupts to vcpu0. > > _However_, even if you do this, you may want to put the less important vCPUs > and the emulator threads on the same physical CPU. In that case, the vCPU > can be placed at SCHED_RR to avoid starvation (while the emulator thread needs > to stay at SCHED_FIFO and higher priority). Some kind of trick that bumps > spinlock critical sections in that vCPU to SCHED_FIFO, for a limited time only, > might still be useful. Anything that violates (*) above is going to cause excessive latencies in realtime vcpus, via: PCPU-0: * vcpu-0 grabs spinlock A. * event wakes up emulator thread, vcpu-0 sched out, vcpu-0 sched in. PCPU-1: * realtime vcpu grabs spinlock-A, busy spins on emulator threads completion. So its more than useful, its necessary. I'm open to suggestions as better ways to solve this problem while sharing emulator thread with vcpu-0 (which is something users are interested in, for obvious economical reasons), but: 1) Don't get the point of Peters rejection. 2) Don't get how SCHED_RR can help the situation. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-25 2:57 ` Marcelo Tosatti @ 2017-09-25 9:13 ` Peter Zijlstra 2017-09-25 15:12 ` Paolo Bonzini 2017-09-26 23:22 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti 2017-09-25 16:20 ` Konrad Rzeszutek Wilk 1 sibling, 2 replies; 52+ messages in thread From: Peter Zijlstra @ 2017-09-25 9:13 UTC (permalink / raw) To: Marcelo Tosatti Cc: Paolo Bonzini, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote: > I think you are missing the following point: > > "vcpu0 can be interrupted when its not in a spinlock protected section, > otherwise it can't." > > So you _have_ to communicate to the host when the guest enters/leaves a > critical section. > > So this point of "everything needs to be RT and the priorities must be > designed carefully", is this: > > WHEN in spinlock protected section (more specifically, when > spinlock protected section _shared with realtime vcpus_), > > priority of vcpu0 > priority of emulator thread > > OTHERWISE > > priority of vcpu0 < priority of emulator thread. > > (*) > > So emulator thread can interrupt and inject interrupts to vcpu0. spinlock protected regions are not everything. What about lock-free constructs where CPU's spin-wait on one another (there's plenty). And I'm clearly ignorant of how this emulation thread works, but why would it run for a long time? Either it is needed for forward progress of the VCPU or its not. If its not, it shouldn't run. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-25 9:13 ` Peter Zijlstra @ 2017-09-25 15:12 ` Paolo Bonzini 2017-09-26 22:49 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti 2017-09-26 23:22 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti 1 sibling, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2017-09-25 15:12 UTC (permalink / raw) To: Peter Zijlstra, Marcelo Tosatti Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On 25/09/2017 11:13, Peter Zijlstra wrote: > On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote: >> I think you are missing the following point: >> >> "vcpu0 can be interrupted when its not in a spinlock protected section, >> otherwise it can't." Who says that? Certainly a driver can dedicate a single VCPU to periodic polling of the device, in such a way that the polling does not require a spinlock. >> So you _have_ to communicate to the host when the guest enters/leaves a >> critical section. >> >> So this point of "everything needs to be RT and the priorities must be >> designed carefully", is this: >> >> WHEN in spinlock protected section (more specifically, when >> spinlock protected section _shared with realtime vcpus_), >> >> priority of vcpu0 > priority of emulator thread >> >> OTHERWISE >> >> priority of vcpu0 < priority of emulator thread. This is _not_ designed carefully, this is messy. The emulator thread can interrupt the VCPU thread, so it has to be at higher RT priority (+ priority inheritance of mutexes). Once you have done that we can decide on other approaches that e.g. let you get more sharing by placing housekeeping VCPUs at SCHED_NORMAL or SCHED_RR. >> So emulator thread can interrupt and inject interrupts to vcpu0. > > spinlock protected regions are not everything. What about lock-free > constructs where CPU's spin-wait on one another (there's plenty). > > And I'm clearly ignorant of how this emulation thread works, but why > would it run for a long time? Either it is needed for forward progress > of the VCPU or its not. If its not, it shouldn't run. The emulator thread 1) should not run for long period of times indeed, and 2) it is needed for forward progress of the VCPU. So it has to be at higher RT priority. I agree with Peter, sorry. Spinlocks are a red herring here. Paolo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-25 15:12 ` Paolo Bonzini @ 2017-09-26 22:49 ` Marcelo Tosatti 2017-09-27 9:37 ` Paolo Bonzini 0 siblings, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-26 22:49 UTC (permalink / raw) To: Paolo Bonzini Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote: > On 25/09/2017 11:13, Peter Zijlstra wrote: > > On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote: > >> I think you are missing the following point: > >> > >> "vcpu0 can be interrupted when its not in a spinlock protected section, > >> otherwise it can't." > > Who says that? Certainly a driver can dedicate a single VCPU to > periodic polling of the device, in such a way that the polling does not > require a spinlock. This sequence: VCPU-0 VCPU-1 (running realtime workload) takes spinlock A scheduled out spinlock(A) (busy spins until VCPU-0 is scheduled back in) scheduled in finishes execution of code under protected section releases spinlock(A) takes spinlock(A) You get that point, right? (*) > >> So you _have_ to communicate to the host when the guest enters/leaves a > >> critical section. > >> > >> So this point of "everything needs to be RT and the priorities must be > >> designed carefully", is this: > >> > >> WHEN in spinlock protected section (more specifically, when > >> spinlock protected section _shared with realtime vcpus_), > >> > >> priority of vcpu0 > priority of emulator thread > >> > >> OTHERWISE > >> > >> priority of vcpu0 < priority of emulator thread. > > This is _not_ designed carefully, this is messy. This is very precise to me. What is "messy" about it? (its clearly defined). > The emulator thread can interrupt the VCPU thread, so it has to be at > higher RT priority (+ priority inheritance of mutexes). It can only do that _when_ the VCPU thread is not running a critical section which a higher priority task depends on. > Once you have > done that we can decide on other approaches that e.g. let you get more > sharing by placing housekeeping VCPUs at SCHED_NORMAL or SCHED_RR. Well, if someone looks at (*) he sees that if the interruption delay (the length between "scheduled out" and "scheduled in" in that diagram) exceeds a given threshold, that causes the realtime vcpu1 to also exceed processing of the realtime task for a given threshold. So when you say "The emulator thread can interrupt the VCPU thread", you're saying that it has to be modified to interrupt for a maximum amount of time (say 15us). Is that what you are suggesting? > >> So emulator thread can interrupt and inject interrupts to vcpu0. > > > > spinlock protected regions are not everything. What about lock-free > > constructs where CPU's spin-wait on one another (there's plenty). > > > > And I'm clearly ignorant of how this emulation thread works, but why > > would it run for a long time? Either it is needed for forward progress > > of the VCPU or its not. If its not, it shouldn't run. > > The emulator thread 1) should not run for long period of times indeed, > and 2) it is needed for forward progress of the VCPU. So it has to be > at higher RT priority. I agree with Peter, sorry. Spinlocks are a red > herring here. > > Paolo Paolo, you don't control how many interruptions of the emulator thread happen per second. So if you let the emulator thread interrupt the emulator thread at all times, without some kind of bounding of these interruptions per time unit, you have a similar problem as (*) (where the realtime task is scheduled). Another approach to the problem was suggested to OpenStack. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-26 22:49 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti @ 2017-09-27 9:37 ` Paolo Bonzini 2017-09-28 0:44 ` Marcelo Tosatti 0 siblings, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2017-09-27 9:37 UTC (permalink / raw) To: Marcelo Tosatti Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On 27/09/2017 00:49, Marcelo Tosatti wrote: > On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote: >> On 25/09/2017 11:13, Peter Zijlstra wrote: >>> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote: >>>> I think you are missing the following point: >>>> >>>> "vcpu0 can be interrupted when its not in a spinlock protected section, >>>> otherwise it can't." >> >> Who says that? Certainly a driver can dedicate a single VCPU to >> periodic polling of the device, in such a way that the polling does not >> require a spinlock. > > This sequence: > > > VCPU-0 VCPU-1 (running realtime workload) > > takes spinlock A > scheduled out > spinlock(A) (busy spins until > VCPU-0 is scheduled > back in) > scheduled in > finishes execution of > code under protected section > releases spinlock(A) > > takes spinlock(A) Yes, but then you have busy waits for flag to be set polls device scheduled out device receives data ... scheduled in set flag or check for work to do scheduled out submit work busy waits for work to complete scheduled in do work None of which have anything to do with spinlocks. They're just different ways to get priority inversion, and this... >>>> So this point of "everything needs to be RT and the priorities must be >>>> designed carefully", is this: >>>> >>>> WHEN in spinlock protected section (more specifically, when >>>> spinlock protected section _shared with realtime vcpus_), >>>> >>>> priority of vcpu0 > priority of emulator thread >>>> >>>> OTHERWISE >>>> >>>> priority of vcpu0 < priority of emulator thread. >> >> This is _not_ designed carefully, this is messy. > > This is very precise to me. What is "messy" about it? (its clearly > defined). ... it's not how you design RT systems to avoid priority inversion. It's just solving an instance of the issue. >> The emulator thread can interrupt the VCPU thread, so it has to be at >> higher RT priority (+ priority inheritance of mutexes). > > It can only do that _when_ the VCPU thread is not running a critical > section which a higher priority task depends on. All VCPUs must have the same priority. There's no such thing as a housekeeping VCPU. There could be one sacrificial VCPU that you place on the same physical CPU as the emulator thread, but that's it. The whole system must run smoothly even if you place those on the same physical CPU, without any PV hacks. > So when you say "The emulator thread can interrupt the VCPU thread", > you're saying that it has to be modified to interrupt for a maximum > amount of time (say 15us). > > Is that what you are suggesting? This is correct. But I think you are missing a fundamental point, that is not specific to virt. If a thread 1 can trigger an event in thread 2, and thread 2 runs at priority N, thread 1 must be placed at priority >N. In this case, the emulator thread can signal I/O completion to the VCPU: it doesn't matter if the VCPU is polling or using interrupts, the emulator thread must be placed at higher priority than the VCPU. This is the root cause of the SeaBIOS issue that you were seeing. Yes, it was me who suggested moving VCPU0 and the emulator thread to SCHED_NORMAL, but that was just a bandaid until the real fix was done (which is to set up SCHED_FIFO priorities correctly and use PI mutexes). In fact, this is not specific to the emulator thread. It applies just as well to vhost threads, to QEMU iothreads, even to the KVM PIT emulation thread if the guest were using it. > Paolo, you don't control how many interruptions of the emulator thread > happen per second. Indeed these non-VCPU threads should indeed run rarely and for a small amount of time only to achieve bounded latencies in the VCPUs. But if they don't, those are simply bugs and we fix them. In fact, sources of frequent interruptions have all been fixed or moved outside the main thread; for example, disks can use separate iothreads. Configuration problems are also bugs, just in a different place. For a very basic VM that I've just tried (whatever QEMU does by default, only tweak was not using the QEMU GUI), there is exactly one interruption per second when the VM is idle. That interruption in turn is caused by udev periodically probing the guest CDROM (!). So that's a configuration problem: remove the CDROM, and it does one interruption every 10-30 seconds, again all of them guest triggered. Again: if you have many interruptions, it's not a flaw in KVM or QEMU's design, it's just that someone is doing something stupid. It could be the guest (e.g. unnecessary devices or daemons as in the example above), QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second just to increment the clock), or the management (e.g. polling "is the VM running" 50 times per second). But it can and must be fixed. The design should be the basic one: attribute the right priority to each thread and things just work. Paolo > So if you let the emulator thread interrupt the > emulator thread at all times, without some kind of bounding > of these interruptions per time unit, you have a similar > problem as (*) (where the realtime task is scheduled). > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-27 9:37 ` Paolo Bonzini @ 2017-09-28 0:44 ` Marcelo Tosatti 2017-09-28 7:22 ` Paolo Bonzini 0 siblings, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-28 0:44 UTC (permalink / raw) To: Paolo Bonzini Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Wed, Sep 27, 2017 at 11:37:48AM +0200, Paolo Bonzini wrote: > On 27/09/2017 00:49, Marcelo Tosatti wrote: > > On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote: > >> On 25/09/2017 11:13, Peter Zijlstra wrote: > >>> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote: > >>>> I think you are missing the following point: > >>>> > >>>> "vcpu0 can be interrupted when its not in a spinlock protected section, > >>>> otherwise it can't." > >> > >> Who says that? Certainly a driver can dedicate a single VCPU to > >> periodic polling of the device, in such a way that the polling does not > >> require a spinlock. > > > > This sequence: > > > > > > VCPU-0 VCPU-1 (running realtime workload) > > > > takes spinlock A > > scheduled out > > spinlock(A) (busy spins until > > VCPU-0 is scheduled > > back in) > > scheduled in > > finishes execution of > > code under protected section > > releases spinlock(A) > > > > takes spinlock(A) > > Yes, but then you have > > busy waits for flag to be set > polls device > scheduled out > device receives data > ... > scheduled in > set flag > > or > > check for work to do > scheduled out > submit work > busy waits for work to complete > scheduled in > do work > > None of which have anything to do with spinlocks. They're just > different ways to get priority inversion, and this... > > >>>> So this point of "everything needs to be RT and the priorities must be > >>>> designed carefully", is this: > >>>> > >>>> WHEN in spinlock protected section (more specifically, when > >>>> spinlock protected section _shared with realtime vcpus_), > >>>> > >>>> priority of vcpu0 > priority of emulator thread > >>>> > >>>> OTHERWISE > >>>> > >>>> priority of vcpu0 < priority of emulator thread. > >> > >> This is _not_ designed carefully, this is messy. > > > > This is very precise to me. What is "messy" about it? (its clearly > > defined). > > ... it's not how you design RT systems to avoid priority inversion. > It's just solving an instance of the issue. > > >> The emulator thread can interrupt the VCPU thread, so it has to be at > >> higher RT priority (+ priority inheritance of mutexes). > > > > It can only do that _when_ the VCPU thread is not running a critical > > section which a higher priority task depends on. > > All VCPUs must have the same priority. There's no such thing as a > housekeeping VCPU. > > There could be one sacrificial VCPU that you place on the same physical > CPU as the emulator thread, but that's it. The whole system must run > smoothly even if you place those on the same physical CPU, without any > PV hacks. > > > So when you say "The emulator thread can interrupt the VCPU thread", > > you're saying that it has to be modified to interrupt for a maximum > > amount of time (say 15us). > > > > Is that what you are suggesting? > > This is correct. But I think you are missing a fundamental point, that > is not specific to virt. If a thread 1 can trigger an event in thread > 2, and thread 2 runs at priority N, thread 1 must be placed at priority >N. > > In this case, the emulator thread can signal I/O completion to the VCPU: > it doesn't matter if the VCPU is polling or using interrupts, the > emulator thread must be placed at higher priority than the VCPU. This > is the root cause of the SeaBIOS issue that you were seeing. Yes, it > was me who suggested moving VCPU0 and the emulator thread to > SCHED_NORMAL, but that was just a bandaid until the real fix was done > (which is to set up SCHED_FIFO priorities correctly and use PI mutexes). > > In fact, this is not specific to the emulator thread. It applies just > as well to vhost threads, to QEMU iothreads, even to the KVM PIT > emulation thread if the guest were using it. > > > Paolo, you don't control how many interruptions of the emulator thread > > happen per second. > > Indeed these non-VCPU threads should indeed run rarely and for a small > amount of time only to achieve bounded latencies in the VCPUs. But if > they don't, those are simply bugs and we fix them. In fact, sources of > frequent interruptions have all been fixed or moved outside the main > thread; for example, disks can use separate iothreads. Configuration > problems are also bugs, just in a different place. > > For a very basic VM that I've just tried (whatever QEMU does by default, > only tweak was not using the QEMU GUI), there is exactly one > interruption per second when the VM is idle. That interruption in turn > is caused by udev periodically probing the guest CDROM (!). So that's a > configuration problem: remove the CDROM, and it does one interruption > every 10-30 seconds, again all of them guest triggered. > > Again: if you have many interruptions, it's not a flaw in KVM or QEMU's > design, it's just that someone is doing something stupid. It could be > the guest (e.g. unnecessary devices or daemons as in the example above), > QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second > just to increment the clock), or the management (e.g. polling "is the VM > running" 50 times per second). But it can and must be fixed. No, i mean you can run anything in VCPU-0 (it is valid to do that). And that "anything" can generate 1 interrupt per second, 1000 or 10.000 interrupts per second. Which are all valid things to be done. "I can't run a kernel compilation on VCPU-0 because that will impact latency on the realtime VCPU-1" is not acceptable. > The design should be the basic one: attribute the right priority to each > thread and things just work. > > Paolo > > > So if you let the emulator thread interrupt the > > emulator thread at all times, without some kind of bounding > > of these interruptions per time unit, you have a similar > > problem as (*) (where the realtime task is scheduled). > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-28 0:44 ` Marcelo Tosatti @ 2017-09-28 7:22 ` Paolo Bonzini 2017-09-28 21:35 ` Marcelo Tosatti 0 siblings, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2017-09-28 7:22 UTC (permalink / raw) To: Marcelo Tosatti Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On 28/09/2017 02:44, Marcelo Tosatti wrote: >> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's >> design, it's just that someone is doing something stupid. It could be >> the guest (e.g. unnecessary devices or daemons as in the example above), >> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second >> just to increment the clock), or the management (e.g. polling "is the VM >> running" 50 times per second). But it can and must be fixed. > > No, i mean you can run anything in VCPU-0 (it is valid to do that). > And that "anything" can generate 1 interrupt per second, 1000 or 10.000 > interrupts per second. Which are all valid things to be done. > > "I can't run a kernel compilation on VCPU-0 because that will impact > latency on the realtime VCPU-1" is not acceptable. That shouldn't happen. Sources of frequent interruptions have all been fixed or moved outside the main thread. If there are more left, report the bug and we'll see how to fix it in userspace. Paolo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-28 7:22 ` Paolo Bonzini @ 2017-09-28 21:35 ` Marcelo Tosatti 2017-09-28 21:41 ` Marcelo Tosatti 2017-09-29 8:18 ` Paolo Bonzini 0 siblings, 2 replies; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-28 21:35 UTC (permalink / raw) To: Paolo Bonzini Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Thu, Sep 28, 2017 at 09:22:02AM +0200, Paolo Bonzini wrote: > On 28/09/2017 02:44, Marcelo Tosatti wrote: > >> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's > >> design, it's just that someone is doing something stupid. It could be > >> the guest (e.g. unnecessary devices or daemons as in the example above), > >> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second > >> just to increment the clock), or the management (e.g. polling "is the VM > >> running" 50 times per second). But it can and must be fixed. > > > > No, i mean you can run anything in VCPU-0 (it is valid to do that). > > And that "anything" can generate 1 interrupt per second, 1000 or 10.000 > > interrupts per second. Which are all valid things to be done. > > > > "I can't run a kernel compilation on VCPU-0 because that will impact > > latency on the realtime VCPU-1" is not acceptable. > > That shouldn't happen. Sources of frequent interruptions have all been > fixed or moved outside the main thread. > > If there are more left, report the bug and we'll see how to fix it in > userspace. > > Paolo What should not happen? The generation of 10.000 interrupts per second (say disk IO completion) on a given workload ? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-28 21:35 ` Marcelo Tosatti @ 2017-09-28 21:41 ` Marcelo Tosatti 2017-09-29 8:18 ` Paolo Bonzini 1 sibling, 0 replies; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-28 21:41 UTC (permalink / raw) To: Paolo Bonzini Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Thu, Sep 28, 2017 at 06:35:08PM -0300, Marcelo Tosatti wrote: > On Thu, Sep 28, 2017 at 09:22:02AM +0200, Paolo Bonzini wrote: > > On 28/09/2017 02:44, Marcelo Tosatti wrote: > > >> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's > > >> design, it's just that someone is doing something stupid. It could be > > >> the guest (e.g. unnecessary devices or daemons as in the example above), > > >> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second > > >> just to increment the clock), or the management (e.g. polling "is the VM > > >> running" 50 times per second). But it can and must be fixed. > > > > > > No, i mean you can run anything in VCPU-0 (it is valid to do that). > > > And that "anything" can generate 1 interrupt per second, 1000 or 10.000 > > > interrupts per second. Which are all valid things to be done. > > > > > > "I can't run a kernel compilation on VCPU-0 because that will impact > > > latency on the realtime VCPU-1" is not acceptable. > > > > That shouldn't happen. Sources of frequent interruptions have all been > > fixed or moved outside the main thread. > > > > If there are more left, report the bug and we'll see how to fix it in > > userspace. > > > > Paolo > > What should not happen? The generation of 10.000 interrupts per second > (say disk IO completion) on a given workload ? Are you suggesting that, workloads in vcpu-0 should be limited in the number of interrupts (and durations of each interruption), so that the realtime vcpu-1's latency requirement is met ? I don't see how that suggestion can work because even if you make each exit small, the frequency of them will cause a latency violation on vcpu-1. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-28 21:35 ` Marcelo Tosatti 2017-09-28 21:41 ` Marcelo Tosatti @ 2017-09-29 8:18 ` Paolo Bonzini 2017-09-29 16:40 ` Marcelo Tosatti 1 sibling, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2017-09-29 8:18 UTC (permalink / raw) To: Marcelo Tosatti Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On 28/09/2017 23:35, Marcelo Tosatti wrote: > On Thu, Sep 28, 2017 at 09:22:02AM +0200, Paolo Bonzini wrote: >> On 28/09/2017 02:44, Marcelo Tosatti wrote: >>>> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's >>>> design, it's just that someone is doing something stupid. It could be >>>> the guest (e.g. unnecessary devices or daemons as in the example above), >>>> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second >>>> just to increment the clock), or the management (e.g. polling "is the VM >>>> running" 50 times per second). But it can and must be fixed. >>> >>> No, i mean you can run anything in VCPU-0 (it is valid to do that). >>> And that "anything" can generate 1 interrupt per second, 1000 or 10.000 >>> interrupts per second. Which are all valid things to be done. >>> >>> "I can't run a kernel compilation on VCPU-0 because that will impact >>> latency on the realtime VCPU-1" is not acceptable. >> >> That shouldn't happen. Sources of frequent interruptions have all been >> fixed or moved outside the main thread. >> >> If there are more left, report the bug and we'll see how to fix it in >> userspace. > > What should not happen? The generation of 10.000 interrupts per second > (say disk IO completion) on a given workload ? If you know you have this kind disk workload, you must use virtio-blk or virtio-scsi with iothreads and place the iothreads on their own physical CPUs. Among "run arbitrary workloads", "run real-time workloads", "pack stuff into as few physical CPUs as possible", you can only pick two. Paolo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-29 8:18 ` Paolo Bonzini @ 2017-09-29 16:40 ` Marcelo Tosatti 2017-09-29 17:05 ` Paolo Bonzini 0 siblings, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-29 16:40 UTC (permalink / raw) To: Paolo Bonzini Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 29, 2017 at 10:18:25AM +0200, Paolo Bonzini wrote: > On 28/09/2017 23:35, Marcelo Tosatti wrote: > > On Thu, Sep 28, 2017 at 09:22:02AM +0200, Paolo Bonzini wrote: > >> On 28/09/2017 02:44, Marcelo Tosatti wrote: > >>>> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's > >>>> design, it's just that someone is doing something stupid. It could be > >>>> the guest (e.g. unnecessary devices or daemons as in the example above), > >>>> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second > >>>> just to increment the clock), or the management (e.g. polling "is the VM > >>>> running" 50 times per second). But it can and must be fixed. > >>> > >>> No, i mean you can run anything in VCPU-0 (it is valid to do that). > >>> And that "anything" can generate 1 interrupt per second, 1000 or 10.000 > >>> interrupts per second. Which are all valid things to be done. > >>> > >>> "I can't run a kernel compilation on VCPU-0 because that will impact > >>> latency on the realtime VCPU-1" is not acceptable. > >> > >> That shouldn't happen. Sources of frequent interruptions have all been > >> fixed or moved outside the main thread. > >> > >> If there are more left, report the bug and we'll see how to fix it in > >> userspace. > > > > What should not happen? The generation of 10.000 interrupts per second > > (say disk IO completion) on a given workload ? > > If you know you have this kind disk workload, you must use virtio-blk or > virtio-scsi with iothreads and place the iothreads on their own physical > CPUs. > > Among "run arbitrary workloads", "run real-time workloads", "pack stuff > into as few physical CPUs as possible", you can only pick two. > > Paolo Thats not the state of things (userspace in vcpu-0 is not specially tailored to not violate latencies in vcpu-1): that is not all user triggered actions can be verified. Think "updatedb", and so on... ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-29 16:40 ` Marcelo Tosatti @ 2017-09-29 17:05 ` Paolo Bonzini 2017-09-29 20:17 ` Marcelo Tosatti 0 siblings, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2017-09-29 17:05 UTC (permalink / raw) To: Marcelo Tosatti Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On 29/09/2017 18:40, Marcelo Tosatti wrote: >> If you know you have this kind disk workload, you must use virtio-blk or >> virtio-scsi with iothreads and place the iothreads on their own physical >> CPUs. >> >> Among "run arbitrary workloads", "run real-time workloads", "pack stuff >> into as few physical CPUs as possible", you can only pick two. > > Thats not the state of things (userspace in vcpu-0 is not specially tailored > to not violate latencies in vcpu-1): that is not all user triggered > actions can be verified. > > Think "updatedb", and so on... _Which_ spinlock is it that can cause unwanted latency while running updatedb on VCPU0 and a real-time workload on VCPU1, and only so on virt because of the emulator thread? Is this still broken if you set up priorities for the emulator thread correctly and use PI mutexes in QEMU? And if so, what is the cause of interruptions in the emulator thread and how are these interruptions causing the jitter? Priorities and priority inheritance (or lack of them) is a _known_ issue. Jan was doing his KVM-RT things in 2009 and he was talking about priorities[1] back then. The effect of correct priorities is to _lower_ jitter, not to make it worse, and anyway certainly not worse than SCHED_NORMAL I/O thread. Once that's fixed, we can look at other problems. Paolo [1] http://static.lwn.net/images/conf/rtlws11/papers/proc/p18.pdf which also mentions pv scheduling ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-29 17:05 ` Paolo Bonzini @ 2017-09-29 20:17 ` Marcelo Tosatti 2017-10-02 12:30 ` Paolo Bonzini 0 siblings, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-29 20:17 UTC (permalink / raw) To: Paolo Bonzini Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 29, 2017 at 07:05:41PM +0200, Paolo Bonzini wrote: > On 29/09/2017 18:40, Marcelo Tosatti wrote: > >> If you know you have this kind disk workload, you must use virtio-blk or > >> virtio-scsi with iothreads and place the iothreads on their own physical > >> CPUs. > >> > >> Among "run arbitrary workloads", "run real-time workloads", "pack stuff > >> into as few physical CPUs as possible", you can only pick two. > > > > Thats not the state of things (userspace in vcpu-0 is not specially tailored > > to not violate latencies in vcpu-1): that is not all user triggered > > actions can be verified. > > > > Think "updatedb", and so on... > > _Which_ spinlock is it that can cause unwanted latency while running > updatedb on VCPU0 and a real-time workload on VCPU1, and only so on virt > because of the emulator thread? Hundreds of them (the one being hit is in timer_interrupt), but i went to check and there are hundreds of raw spinlocks shared between the kernel threads that run on isolated CPUs and vcpu-0. > Is this still broken if you set up > priorities for the emulator thread correctly and use PI mutexes in QEMU? I don't see why it would not, if you have to schedule the emulator thread to process and inject I/O interrupts for example. > And if so, what is the cause of interruptions in the emulator thread > and how are these interruptions causing the jitter? Interrupt injections. > Priorities and priority inheritance (or lack of them) is a _known_ > issue. Jan was doing his KVM-RT things in 2009 and he was talking about > priorities[1] back then. The effect of correct priorities is to _lower_ > jitter, not to make it worse, and anyway certainly not worse than > SCHED_NORMAL I/O thread. Once that's fixed, we can look at other problems. > > Paolo > > [1] http://static.lwn.net/images/conf/rtlws11/papers/proc/p18.pdf which > also mentions pv scheduling ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-29 20:17 ` Marcelo Tosatti @ 2017-10-02 12:30 ` Paolo Bonzini 2017-10-02 12:48 ` Peter Zijlstra 0 siblings, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2017-10-02 12:30 UTC (permalink / raw) To: Marcelo Tosatti Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On 29/09/2017 22:17, Marcelo Tosatti wrote: > On Fri, Sep 29, 2017 at 07:05:41PM +0200, Paolo Bonzini wrote: >> On 29/09/2017 18:40, Marcelo Tosatti wrote: >>> Thats not the state of things (userspace in vcpu-0 is not specially tailored >>> to not violate latencies in vcpu-1): that is not all user triggered >>> actions can be verified. >>> >>> Think "updatedb", and so on... >> >> _Which_ spinlock is it that can cause unwanted latency while running >> updatedb on VCPU0 and a real-time workload on VCPU1, and only so on virt >> because of the emulator thread? > > Hundreds of them (the one being hit is in timer_interrupt), but i went > to check and there are hundreds of raw spinlocks shared between the > kernel threads that run on isolated CPUs and vcpu-0. > >> Is this still broken if you set up >> priorities for the emulator thread correctly and use PI mutexes in QEMU? > > I don't see why it would not, if you have to schedule the emulator > thread to process and inject I/O interrupts for example. Yes, you're right if it's interrupt injections. If it's unexpected disk accesses, you can just add a QEMU I/O thread on a different physical CPU. The same physical CPU can host I/O threads for different guests if you expect them to do little. I don't understand why is it correct to delay interrupt injection just because VCPU0 is running in a spinlock-protected region? I just cannot see the reason why it's safe and not a recipe for priority inversions. Paolo >> And if so, what is the cause of interruptions in the emulator thread >> and how are these interruptions causing the jitter? > > Interrupt injections. > >> Priorities and priority inheritance (or lack of them) is a _known_ >> issue. Jan was doing his KVM-RT things in 2009 and he was talking about >> priorities[1] back then. The effect of correct priorities is to _lower_ >> jitter, not to make it worse, and anyway certainly not worse than >> SCHED_NORMAL I/O thread. Once that's fixed, we can look at other problems. >> >> Paolo >> >> [1] http://static.lwn.net/images/conf/rtlws11/papers/proc/p18.pdf which >> also mentions pv scheduling ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-10-02 12:30 ` Paolo Bonzini @ 2017-10-02 12:48 ` Peter Zijlstra 0 siblings, 0 replies; 52+ messages in thread From: Peter Zijlstra @ 2017-10-02 12:48 UTC (permalink / raw) To: Paolo Bonzini Cc: Marcelo Tosatti, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Mon, Oct 02, 2017 at 02:30:33PM +0200, Paolo Bonzini wrote: > I don't understand why is it correct to delay interrupt injection just > because VCPU0 is running in a spinlock-protected region? I just cannot > see the reason why it's safe and not a recipe for priority inversions. It is indeed not right. Something like: raw_spin_lock(&some_lock); /* do crud */ raw_spin_unlock(&some_lock); Should not hold off the interrupt that tells you your finger is in imminent danger of becoming detached. Only when we do local_irq_disable() (ie. raw_spin_lock_irq*() and the like) should we avoid interrupt delivery. This whole fixation on spinlock regions is misguided and must stop, its wrong on all levels. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-25 9:13 ` Peter Zijlstra 2017-09-25 15:12 ` Paolo Bonzini @ 2017-09-26 23:22 ` Marcelo Tosatti 1 sibling, 0 replies; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-26 23:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Paolo Bonzini, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Mon, Sep 25, 2017 at 11:13:16AM +0200, Peter Zijlstra wrote: > On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote: > > I think you are missing the following point: > > > > "vcpu0 can be interrupted when its not in a spinlock protected section, > > otherwise it can't." > > > > So you _have_ to communicate to the host when the guest enters/leaves a > > critical section. > > > > So this point of "everything needs to be RT and the priorities must be > > designed carefully", is this: > > > > WHEN in spinlock protected section (more specifically, when > > spinlock protected section _shared with realtime vcpus_), > > > > priority of vcpu0 > priority of emulator thread > > > > OTHERWISE > > > > priority of vcpu0 < priority of emulator thread. > > > > (*) > > > > So emulator thread can interrupt and inject interrupts to vcpu0. > > spinlock protected regions are not everything. What about lock-free > constructs where CPU's spin-wait on one another (there's plenty). True. Could add the "i am in a critical section" notifier to those constructs as well, which would call the hypercall. > And I'm clearly ignorant of how this emulation thread works, but why > would it run for a long time? Either it is needed for forward progress > of the VCPU or its not. If its not, it shouldn't run. It is needed only when not in a critical section. And when in a critical section, the vcpu should not get interrupted. But the solution to reserve one pCPU per socket, to run all emulator threads, achieves reasonable packing numbers without the downsides of the hypercall. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-25 2:57 ` Marcelo Tosatti 2017-09-25 9:13 ` Peter Zijlstra @ 2017-09-25 16:20 ` Konrad Rzeszutek Wilk 1 sibling, 0 replies; 52+ messages in thread From: Konrad Rzeszutek Wilk @ 2017-09-25 16:20 UTC (permalink / raw) To: Marcelo Tosatti Cc: Paolo Bonzini, Peter Zijlstra, mingo, kvm, linux-kernel, Thomas Gleixner > I think you are missing the following point: > > "vcpu0 can be interrupted when its not in a spinlock protected section, > otherwise it can't." > > So you _have_ to communicate to the host when the guest enters/leaves a > critical section. How would this work for Windows or FreeBSD? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-22 10:00 ` Peter Zijlstra 2017-09-22 10:56 ` Peter Zijlstra @ 2017-09-22 12:16 ` Marcelo Tosatti 2017-09-22 12:31 ` Peter Zijlstra 1 sibling, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-22 12:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote: > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote: > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary > > to > > deal with the following situation: > > > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > > > raw_spin_lock(A) > > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > > > raw_spin_unlock(A) > > > > Certain operations must interrupt guest vcpu-0 (see trace below). > > Those traces don't make any sense. All they include is kvm_exit and you > can't tell anything from that. Hi Peter, OK lets describe whats happening: With QEMU emulator thread and vcpu-0 sharing a physical CPU (which is a request from several NFV customers, to improve guest packing), the following occurs when the guest generates the following pattern: 1. submit IO. 2. busy spin. Hang trace ========== Without FIFO priority: qemu-kvm-6705 [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0 qemu-kvm-6705 [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0 qemu-kvm-6705 [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-6705 [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-6705 [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0 qemu-kvm-6705 [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0 qemu-kvm-6705 [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0 qemu-kvm-6705 [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0 The emulator thread is able to interrupt qemu vcpu0 at SCHED_NORMAL priority. With FIFO priority: Now, with qemu vcpu0 at SCHED_FIFO priority, which is necessary to avoid the following scenario: (*) VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) raw_spin_lock(A) interrupted, schedule task T-1 raw_spin_lock(A) (spin) raw_spin_unlock(A) And the following code pattern by vcpu0: 1. submit IO. 2. busy spin. The emulator thread is unable to interrupt vcpu0 thread (vcpu0 busy spinning at SCHED_FIFO, emulator thread at SCHED_NORMAL), and you get a hang at boot as follows: qemu-kvm-7636 [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-7636 [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-7636 [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 qemu-kvm-7636 [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 qemu-kvm-7636 [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 So to fix this problem, the patchset changes the priority of the VCPU thread (to fix (*)), only when taking spinlocks. Does that make sense now? > > > To fix this issue, only change guest vcpu-0 to FIFO priority > > on spinlock critical sections (see patch). > > This doesn't make sense. So you're saying that if you run all VCPUs as > FIFO things come apart? Why? Please see above. > And why can't they still come apart when the guest holds a spinlock? Hopefully the above makes sense. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-22 12:16 ` Marcelo Tosatti @ 2017-09-22 12:31 ` Peter Zijlstra 2017-09-22 12:36 ` Marcelo Tosatti 2017-09-22 12:40 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti 0 siblings, 2 replies; 52+ messages in thread From: Peter Zijlstra @ 2017-09-22 12:31 UTC (permalink / raw) To: Marcelo Tosatti Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote: > On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote: > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote: > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary > > > to > > > deal with the following situation: > > > > > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > > > > > raw_spin_lock(A) > > > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > > > > > raw_spin_unlock(A) > > > > > > Certain operations must interrupt guest vcpu-0 (see trace below). > > > > Those traces don't make any sense. All they include is kvm_exit and you > > can't tell anything from that. > > Hi Peter, > > OK lets describe whats happening: > > With QEMU emulator thread and vcpu-0 sharing a physical CPU > (which is a request from several NFV customers, to improve > guest packing), the following occurs when the guest generates > the following pattern: > > 1. submit IO. > 2. busy spin. User-space spinning is a bad idea in general and terminally broken in a RT setup. Sounds like you need to go fix qemu to not suck. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-22 12:31 ` Peter Zijlstra @ 2017-09-22 12:36 ` Marcelo Tosatti 2017-09-22 12:59 ` Peter Zijlstra 2017-09-22 12:40 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti 1 sibling, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-22 12:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 22, 2017 at 02:31:07PM +0200, Peter Zijlstra wrote: > On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote: > > On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote: > > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote: > > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary > > > > to > > > > deal with the following situation: > > > > > > > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > > > > > > > raw_spin_lock(A) > > > > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > > > > > > > raw_spin_unlock(A) > > > > > > > > Certain operations must interrupt guest vcpu-0 (see trace below). > > > > > > Those traces don't make any sense. All they include is kvm_exit and you > > > can't tell anything from that. > > > > Hi Peter, > > > > OK lets describe whats happening: > > > > With QEMU emulator thread and vcpu-0 sharing a physical CPU > > (which is a request from several NFV customers, to improve > > guest packing), the following occurs when the guest generates > > the following pattern: > > > > 1. submit IO. > > 2. busy spin. > > User-space spinning is a bad idea in general and terminally broken in > a RT setup. Sounds like you need to go fix qemu to not suck. One can run whatever application they want on the housekeeping vcpus. This is why rteval exists. This is not the realtime vcpu we are talking about. We can fix the BIOS, which is hanging now, but userspace can do whatever it wants, on non realtime vcpus (again, this is why rteval test exists and is used by the -RT community as a testcase). I haven't understood what is the wrong with the patch? Are you trying to avoid pollution of the spinlock codepath to keep it simple? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-22 12:36 ` Marcelo Tosatti @ 2017-09-22 12:59 ` Peter Zijlstra 2017-09-25 1:52 ` Marcelo Tosatti 0 siblings, 1 reply; 52+ messages in thread From: Peter Zijlstra @ 2017-09-22 12:59 UTC (permalink / raw) To: Marcelo Tosatti Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 22, 2017 at 09:36:39AM -0300, Marcelo Tosatti wrote: > On Fri, Sep 22, 2017 at 02:31:07PM +0200, Peter Zijlstra wrote: > > On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote: > > > On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote: > > > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote: > > > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary > > > > > to > > > > > deal with the following situation: > > > > > > > > > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > > > > > > > > > raw_spin_lock(A) > > > > > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > > > > > > > > > raw_spin_unlock(A) > > > > > > > > > > Certain operations must interrupt guest vcpu-0 (see trace below). > > > > > > > > Those traces don't make any sense. All they include is kvm_exit and you > > > > can't tell anything from that. > > > > > > Hi Peter, > > > > > > OK lets describe whats happening: > > > > > > With QEMU emulator thread and vcpu-0 sharing a physical CPU > > > (which is a request from several NFV customers, to improve > > > guest packing), the following occurs when the guest generates > > > the following pattern: > > > > > > 1. submit IO. > > > 2. busy spin. > > > > User-space spinning is a bad idea in general and terminally broken in > > a RT setup. Sounds like you need to go fix qemu to not suck. > > One can run whatever application they want on the housekeeping > vcpus. This is why rteval exists. Nobody cares about other tasks. The problem is between the VCPU and emulator thread. They get a priority inversion and live-lock because of spin-waiting. > This is not the realtime vcpu we are talking about. You're being confused, its a RT _guest_, all VCPUs _must_ be RT. Because, as you ran into, the guest functions as a whole, not as a bunch of individual CPUs. > We can fix the BIOS, which is hanging now, but userspace can > do whatever it wants, on non realtime vcpus (again, this is why > rteval test exists and is used by the -RT community as > a testcase). But nobody cares what other tasks on the system do, all you care about is that the VCPUs make deterministic forward progress. > I haven't understood what is the wrong with the patch? Are you trying > to avoid pollution of the spinlock codepath to keep it simple? Your patch is voodoo programming. You don't solve the actual problem, you try and paper over it. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-22 12:59 ` Peter Zijlstra @ 2017-09-25 1:52 ` Marcelo Tosatti 2017-09-25 8:35 ` Peter Zijlstra 0 siblings, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-25 1:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 22, 2017 at 02:59:51PM +0200, Peter Zijlstra wrote: > On Fri, Sep 22, 2017 at 09:36:39AM -0300, Marcelo Tosatti wrote: > > On Fri, Sep 22, 2017 at 02:31:07PM +0200, Peter Zijlstra wrote: > > > On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote: > > > > On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote: > > > > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote: > > > > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary > > > > > > to > > > > > > deal with the following situation: > > > > > > > > > > > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > > > > > > > > > > > raw_spin_lock(A) > > > > > > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > > > > > > > > > > > raw_spin_unlock(A) > > > > > > > > > > > > Certain operations must interrupt guest vcpu-0 (see trace below). > > > > > > > > > > Those traces don't make any sense. All they include is kvm_exit and you > > > > > can't tell anything from that. > > > > > > > > Hi Peter, > > > > > > > > OK lets describe whats happening: > > > > > > > > With QEMU emulator thread and vcpu-0 sharing a physical CPU > > > > (which is a request from several NFV customers, to improve > > > > guest packing), the following occurs when the guest generates > > > > the following pattern: > > > > > > > > 1. submit IO. > > > > 2. busy spin. > > > > > > User-space spinning is a bad idea in general and terminally broken in > > > a RT setup. Sounds like you need to go fix qemu to not suck. > > > > One can run whatever application they want on the housekeeping > > vcpus. This is why rteval exists. > > Nobody cares about other tasks. The problem is between the VCPU and > emulator thread. They get a priority inversion and live-lock because of > spin-waiting. > > > This is not the realtime vcpu we are talking about. > > You're being confused, its a RT _guest_, all VCPUs _must_ be RT. > Because, as you ran into, the guest functions as a whole, not as a bunch > of individual CPUs. > > > We can fix the BIOS, which is hanging now, but userspace can > > do whatever it wants, on non realtime vcpus (again, this is why > > rteval test exists and is used by the -RT community as > > a testcase). > > But nobody cares what other tasks on the system do, all you care about > is that the VCPUs make deterministic forward progress. > > > I haven't understood what is the wrong with the patch? Are you trying > > to avoid pollution of the spinlock codepath to keep it simple? > > Your patch is voodoo programming. You don't solve the actual problem, > you try and paper over it. Priority boosting on a particular section of code is voodoo programming? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall 2017-09-25 1:52 ` Marcelo Tosatti @ 2017-09-25 8:35 ` Peter Zijlstra 0 siblings, 0 replies; 52+ messages in thread From: Peter Zijlstra @ 2017-09-25 8:35 UTC (permalink / raw) To: Marcelo Tosatti Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Sun, Sep 24, 2017 at 10:52:58PM -0300, Marcelo Tosatti wrote: > On Fri, Sep 22, 2017 at 02:59:51PM +0200, Peter Zijlstra wrote: > > Your patch is voodoo programming. You don't solve the actual problem, > > you try and paper over it. > > Priority boosting on a particular section of code is voodoo programming? Yes, because there's nothing that prevents said section of code from triggering the exact problem you're trying to avoid. The real fix is making sure the problem cannot happen to begin with, which is done by fixing the interaction between the VCPU and that emulation thread thing. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-22 12:31 ` Peter Zijlstra 2017-09-22 12:36 ` Marcelo Tosatti @ 2017-09-22 12:40 ` Marcelo Tosatti 2017-09-22 13:01 ` Peter Zijlstra 1 sibling, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-22 12:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 22, 2017 at 02:31:07PM +0200, Peter Zijlstra wrote: > On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote: > > On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote: > > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote: > > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary > > > > to > > > > deal with the following situation: > > > > > > > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > > > > > > > raw_spin_lock(A) > > > > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > > > > > > > raw_spin_unlock(A) > > > > > > > > Certain operations must interrupt guest vcpu-0 (see trace below). > > > > > > Those traces don't make any sense. All they include is kvm_exit and you > > > can't tell anything from that. > > > > Hi Peter, > > > > OK lets describe whats happening: > > > > With QEMU emulator thread and vcpu-0 sharing a physical CPU > > (which is a request from several NFV customers, to improve > > guest packing), the following occurs when the guest generates > > the following pattern: > > > > 1. submit IO. > > 2. busy spin. > > User-space spinning is a bad idea in general and terminally broken in > a RT setup. Sounds like you need to go fix qemu to not suck. Are you arguing its invalid for the following application to execute on housekeeping vcpu of a realtime system: void main(void) { submit_IO(); do { computation(); } while (!interrupted()); } Really? Replace "busy spin" by "useful computation until interrupted". ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-22 12:40 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti @ 2017-09-22 13:01 ` Peter Zijlstra 2017-09-25 2:22 ` Marcelo Tosatti 0 siblings, 1 reply; 52+ messages in thread From: Peter Zijlstra @ 2017-09-22 13:01 UTC (permalink / raw) To: Marcelo Tosatti Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 22, 2017 at 09:40:05AM -0300, Marcelo Tosatti wrote: > Are you arguing its invalid for the following application to execute on > housekeeping vcpu of a realtime system: > > void main(void) > { > > submit_IO(); > do { > computation(); > } while (!interrupted()); > } > > Really? No. Nobody cares about random crap tasks. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-22 13:01 ` Peter Zijlstra @ 2017-09-25 2:22 ` Marcelo Tosatti 2017-09-25 8:58 ` Peter Zijlstra 2017-09-25 10:41 ` Thomas Gleixner 0 siblings, 2 replies; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-25 2:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote: > On Fri, Sep 22, 2017 at 09:40:05AM -0300, Marcelo Tosatti wrote: > > > Are you arguing its invalid for the following application to execute on > > housekeeping vcpu of a realtime system: > > > > void main(void) > > { > > > > submit_IO(); > > do { > > computation(); > > } while (!interrupted()); > > } > > > > Really? > > No. Nobody cares about random crap tasks. Nobody has control over all code that runs in userspace Peter. And not supporting a valid sequence of steps because its "crap" (whatever your definition of crap is) makes no sense. It might be that someone decides to do the above (i really can't see any actual reasoning i can follow and agree on your "its crap" argument), this truly seems valid to me. So lets follow the reasoning steps: 1) "NACK, because you didnt understand the problem". OK thats an invalid NACK, you did understand the problem later and now your argument is the following. 2) "NACK, because all VCPUs should be SCHED_FIFO all the time". But the existence of this code path from userspace: submit_IO(); do { computation(); } while (!interrupted()); Its a supported code sequence, and works fine in a non-RT environment. Therefore it should work on an -RT environment. Think of any two applications, such as an IO application and a CPU bound application. The IO application will be severely impacted, or never execute, in such scenario. Is that combination of tasks "random crap tasks" ? (No, its not, which makes me think you're just NACKing without giving enough thought to the problem). So please give me some logical reasoning for the NACK (people can live with it, but it has to be good enough to justify the decreasing packing of guests in pCPUs): 1) "Voodoo programming" (its hard for me to parse what you mean with that... do you mean you foresee this style of priority boosting causing problems in the future? Can you give an example?). Is there fundamentally wrong about priority boosting in spinlock sections, or this particular style of priority boosting is wrong? 2) "Pollution of the kernel code path". That makes sense to me, if thats whats your concerned about. 3) "Reduction of spinlock performance". Its true, but for NFV workloads people don't care about. 4) "All vcpus should be SCHED_FIFO all the time". OK, why is that? What dictates that to be true? What the patch does is the following: It reduces the window where SCHED_FIFO is applied vcpu0 to those were a spinlock is shared between -RT vcpus and vcpu0 (why: because otherwise, when the emulator thread is sharing a pCPU with vcpu0, its unable to generate interrupts vcpu0). And its being rejected because: Please fill in. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-25 2:22 ` Marcelo Tosatti @ 2017-09-25 8:58 ` Peter Zijlstra 2017-09-25 10:41 ` Thomas Gleixner 1 sibling, 0 replies; 52+ messages in thread From: Peter Zijlstra @ 2017-09-25 8:58 UTC (permalink / raw) To: Marcelo Tosatti Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner On Sun, Sep 24, 2017 at 11:22:38PM -0300, Marcelo Tosatti wrote: > On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote: > > On Fri, Sep 22, 2017 at 09:40:05AM -0300, Marcelo Tosatti wrote: > > > > > Are you arguing its invalid for the following application to execute on > > > housekeeping vcpu of a realtime system: > > > > > > void main(void) > > > { > > > > > > submit_IO(); > > > do { > > > computation(); > > > } while (!interrupted()); > > > } > > > > > > Really? > > > > No. Nobody cares about random crap tasks. > > Nobody has control over all code that runs in userspace Peter. And not > supporting a valid sequence of steps because its "crap" (whatever your > definition of crap is) makes no sense. > > It might be that someone decides to do the above (i really can't see > any actual reasoning i can follow and agree on your "its crap" > argument), this truly seems valid to me. We don't care what other tasks do. This isn't a hard thing to understand. You're free to run whatever junk on your CPUs. This doesn't (much) affect the correct functioning of RT tasks that you also run there. > So lets follow the reasoning steps: > > 1) "NACK, because you didnt understand the problem". > > OK thats an invalid NACK, you did understand the problem > later and now your argument is the following. It was a NACK because you wrote a shit changelog that didn't explain the problem. But yes. > 2) "NACK, because all VCPUs should be SCHED_FIFO all the time". Very much, if you want a RT guest, all VCPU's should run at RT prio and the interaction between the VCPUs and all supporting threads should be designed for RT. > But the existence of this code path from userspace: > > submit_IO(); > do { > computation(); > } while (!interrupted()); > > Its a supported code sequence, and works fine in a non-RT environment. Who cares about that chunk of code? Have you forgotten to mention that this is the form of the emulation thread? > Therefore it should work on an -RT environment. No, this is where you're wrong. That code works on -RT as long as you don't expect it to be a valid RT program. -RT kernels will run !RT stuff just fine. But the moment you run a program as RT (FIFO/RR/DEADLINE) it had better damn well be a valid RT program, and that excludes a lot of code. > So please give me some logical reasoning for the NACK (people can live with > it, but it has to be good enough to justify the decreasing packing of > guests in pCPUs): > > 1) "Voodoo programming" (its hard for me to parse what you mean with > that... do you mean you foresee this style of priority boosting causing > problems in the future? Can you give an example?). Your 'solution' only works if you sacrifice a goat on a full moon, because only that ensures the guest doesn't VM_EXIT and cause the self-same problem while you've boosted it. Because you've _not_ fixed the actual problem! > Is there fundamentally wrong about priority boosting in spinlock > sections, or this particular style of priority boosting is wrong? Yes, its fundamentally crap, because it doesn't guarantee anything. RT is about making guarantees. An RT program needs a provable forward progress guarantee at the very least. It including a priority inversion disqualifies it from being sane. > 2) "Pollution of the kernel code path". That makes sense to me, if thats > whats your concerned about. Also.. > 3) "Reduction of spinlock performance". Its true, but for NFV workloads > people don't care about. I've no idea what an NFV is. > 4) "All vcpus should be SCHED_FIFO all the time". OK, why is that? > What dictates that to be true? Solid engineering. Does the guest kernel function as a bunch of independent CPUs or does it assume all CPUs are equal and have strong inter-cpu connections? Linux is the latter, therefore if one VCPU is RT they all should be. Dammit, you even recognise this in the spin-owner preemption issue you're hacking around, but then go arse-about-face 'solving' it. > What the patch does is the following: > It reduces the window where SCHED_FIFO is applied vcpu0 > to those were a spinlock is shared between -RT vcpus and vcpu0 > (why: because otherwise, when the emulator thread is sharing a > pCPU with vcpu0, its unable to generate interrupts vcpu0). > > And its being rejected because: Its not fixing the actual problem. The real problem is the prio inversion between the VCPU and the emulation thread, _That_ is what needs fixing. Rewrite that VCPU/emulator interaction to be a proper RT construct. Then you can run the VCPU at RT prio as you should, and the guest can issue all the VM_EXIT things it wants at any time and still function correctly. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-25 2:22 ` Marcelo Tosatti 2017-09-25 8:58 ` Peter Zijlstra @ 2017-09-25 10:41 ` Thomas Gleixner 2017-09-25 18:28 ` Jan Kiszka 1 sibling, 1 reply; 52+ messages in thread From: Thomas Gleixner @ 2017-09-25 10:41 UTC (permalink / raw) To: Marcelo Tosatti Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel On Sun, 24 Sep 2017, Marcelo Tosatti wrote: > On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote: > What the patch does is the following: > It reduces the window where SCHED_FIFO is applied vcpu0 > to those were a spinlock is shared between -RT vcpus and vcpu0 > (why: because otherwise, when the emulator thread is sharing a > pCPU with vcpu0, its unable to generate interrupts vcpu0). > > And its being rejected because: > Please fill in. Your patch is just papering over one particular problem, but it's not fixing the root cause. That's the worst engineering approach and we all know how fast this kind of crap falls over. There are enough other issues which can cause starvation of the RT VCPUs when the housekeeping VCPU is preempted, not just the particular problem which you observed. Back then when I did the first prototype of RT in KVM, I made it entirely clear, that you have to spend one physical CPU for _each_ VCPU, independent whether the VCPU is reserved for RT workers or the housekeeping VCPU. The emulator thread needs to run on a separate physical CPU. If you want to run the housekeeping VCPU and the emulator thread on the same physical CPU then you have to make sure that both the emulator and the housekeeper side of affairs are designed and implemented with RT in mind. As long as that is not the case, you simply cannot run them on the same physical CPU. RT is about guarantees and guarantees cannot be achieved with bandaid engineering. It's that simple, end of story. Thanks, tglx ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ 2017-09-25 10:41 ` Thomas Gleixner @ 2017-09-25 18:28 ` Jan Kiszka 0 siblings, 0 replies; 52+ messages in thread From: Jan Kiszka @ 2017-09-25 18:28 UTC (permalink / raw) To: Thomas Gleixner, Marcelo Tosatti Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel On 2017-09-25 12:41, Thomas Gleixner wrote: > On Sun, 24 Sep 2017, Marcelo Tosatti wrote: >> On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote: >> What the patch does is the following: >> It reduces the window where SCHED_FIFO is applied vcpu0 >> to those were a spinlock is shared between -RT vcpus and vcpu0 >> (why: because otherwise, when the emulator thread is sharing a >> pCPU with vcpu0, its unable to generate interrupts vcpu0). >> >> And its being rejected because: >> Please fill in. > > Your patch is just papering over one particular problem, but it's not > fixing the root cause. That's the worst engineering approach and we all > know how fast this kind of crap falls over. > > There are enough other issues which can cause starvation of the RT VCPUs > when the housekeeping VCPU is preempted, not just the particular problem > which you observed. > > Back then when I did the first prototype of RT in KVM, I made it entirely > clear, that you have to spend one physical CPU for _each_ VCPU, independent > whether the VCPU is reserved for RT workers or the housekeeping VCPU. The > emulator thread needs to run on a separate physical CPU. > > If you want to run the housekeeping VCPU and the emulator thread on the > same physical CPU then you have to make sure that both the emulator and the > housekeeper side of affairs are designed and implemented with RT in > mind. As long as that is not the case, you simply cannot run them on the > same physical CPU. RT is about guarantees and guarantees cannot be achieved > with bandaid engineering. It's even more complicated for the guest: It needs to be aware of the latencies its interaction with a VM - instead of a real machine - may cause while being in whatever critical sections. That's an additional design dimension that would be very hard to establish and maintain, even in Linux. The only way around that is to truly decouple guest CPUs via full core isolation inside the Linux guest and have your RT guest application exploit this partitioning, e.g. by using lock-less inter-core communication without kernel help. The reason I was playing with PV-sched back then was to explore how you could map the guest's task prio dynamically on its host vcpu. That involved boosting whenever en event (aka irq) came in for the guest vcpu. It turned out to be a more or less working solution looking for a real-world problem. Jan -- Siemens AG, Corporate Technology, CT RDA ITP SES-DE Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support 2017-09-21 11:38 [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Marcelo Tosatti ` (2 preceding siblings ...) 2017-09-21 11:38 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti @ 2017-09-21 17:45 ` Jan Kiszka 2017-09-22 1:19 ` Marcelo Tosatti 3 siblings, 1 reply; 52+ messages in thread From: Jan Kiszka @ 2017-09-21 17:45 UTC (permalink / raw) To: Marcelo Tosatti, kvm, linux-kernel On 2017-09-21 13:38, Marcelo Tosatti wrote: > When executing guest vcpu-0 with FIFO:1 priority, which is necessary to > deal with the following situation: > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > raw_spin_lock(A) > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > raw_spin_unlock(A) > > Certain operations must interrupt guest vcpu-0 (see trace below). > > To fix this issue, only change guest vcpu-0 to FIFO priority > on spinlock critical sections (see patch). > > Hang trace > ========== > > Without FIFO priority: > > qemu-kvm-6705 [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0 > qemu-kvm-6705 [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0 > qemu-kvm-6705 [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > qemu-kvm-6705 [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > qemu-kvm-6705 [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0 > qemu-kvm-6705 [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0 > qemu-kvm-6705 [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0 > qemu-kvm-6705 [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0 > > With FIFO priority: > > qemu-kvm-7636 [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > qemu-kvm-7636 [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > qemu-kvm-7636 [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > qemu-kvm-7636 [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > qemu-kvm-7636 [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > .. > > Performance numbers (kernel compilation with make -j2) > ====================================================== > > With hypercall: 4:40. (make -j2) > Without hypercall: 3:38. (make -j2) > > Note for NFV workloads spinlock performance is not relevant > since DPDK should not enter the kernel (and housekeeping vcpu > performance is far from a key factor). > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > That sounds familiar, though not yet the same: :) http://git.kiszka.org/?p=linux-kvm.git;a=shortlog;h=refs/heads/queues/paravirt-sched (paper: http://lwn.net/images/conf/rtlws11/papers/proc/p18.pdf) I suppose your goal is not to enable the host to follow the guest scheduler priority completely but only to have prio-ceiling for such short critical sections. Maybe still useful to think ahead about future extensions when actually introducing such an interface. But shouldn't there be some limits on the maximum prio the guest can select? Jan -- Siemens AG, Corporate Technology, CT RDA ITP SES-DE Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support 2017-09-21 17:45 ` [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Jan Kiszka @ 2017-09-22 1:19 ` Marcelo Tosatti 2017-09-22 6:23 ` Jan Kiszka 0 siblings, 1 reply; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-22 1:19 UTC (permalink / raw) To: Jan Kiszka; +Cc: kvm, linux-kernel On Thu, Sep 21, 2017 at 07:45:32PM +0200, Jan Kiszka wrote: > On 2017-09-21 13:38, Marcelo Tosatti wrote: > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary to > > deal with the following situation: > > > > VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > > > > raw_spin_lock(A) > > interrupted, schedule task T-1 raw_spin_lock(A) (spin) > > > > raw_spin_unlock(A) > > > > Certain operations must interrupt guest vcpu-0 (see trace below). > > > > To fix this issue, only change guest vcpu-0 to FIFO priority > > on spinlock critical sections (see patch). > > > > Hang trace > > ========== > > > > Without FIFO priority: > > > > qemu-kvm-6705 [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0 > > qemu-kvm-6705 [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0 > > qemu-kvm-6705 [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > > qemu-kvm-6705 [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > > qemu-kvm-6705 [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0 > > qemu-kvm-6705 [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0 > > qemu-kvm-6705 [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0 > > qemu-kvm-6705 [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0 > > > > With FIFO priority: > > > > qemu-kvm-7636 [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > > qemu-kvm-7636 [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > > qemu-kvm-7636 [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > > qemu-kvm-7636 [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > > qemu-kvm-7636 [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > > .. > > > > Performance numbers (kernel compilation with make -j2) > > ====================================================== > > > > With hypercall: 4:40. (make -j2) > > Without hypercall: 3:38. (make -j2) > > > > Note for NFV workloads spinlock performance is not relevant > > since DPDK should not enter the kernel (and housekeeping vcpu > > performance is far from a key factor). > > > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > > > That sounds familiar, though not yet the same: :) > > http://git.kiszka.org/?p=linux-kvm.git;a=shortlog;h=refs/heads/queues/paravirt-sched > (paper: http://lwn.net/images/conf/rtlws11/papers/proc/p18.pdf) > > I suppose your goal is not to enable the host to follow the guest > scheduler priority completely but only to have prio-ceiling for such > short critical sections. Maybe still useful to think ahead about future > extensions when actually introducing such an interface. Hi Jan! Hum... I'll take a look at your interface/paper and get back to you. > But shouldn't there be some limits on the maximum prio the guest can select? The SCHED_FIFO prio is fixed, selectable when QEMU starts. Do you envision any other use case than a fixed priority value selectable at QEMU initialization? Thanks ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support 2017-09-22 1:19 ` Marcelo Tosatti @ 2017-09-22 6:23 ` Jan Kiszka 2017-09-26 23:59 ` Marcelo Tosatti 0 siblings, 1 reply; 52+ messages in thread From: Jan Kiszka @ 2017-09-22 6:23 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: kvm, linux-kernel On 2017-09-22 03:19, Marcelo Tosatti wrote: > On Thu, Sep 21, 2017 at 07:45:32PM +0200, Jan Kiszka wrote: >> On 2017-09-21 13:38, Marcelo Tosatti wrote: >>> When executing guest vcpu-0 with FIFO:1 priority, which is necessary to >>> deal with the following situation: >>> >>> VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) >>> >>> raw_spin_lock(A) >>> interrupted, schedule task T-1 raw_spin_lock(A) (spin) >>> >>> raw_spin_unlock(A) >>> >>> Certain operations must interrupt guest vcpu-0 (see trace below). >>> >>> To fix this issue, only change guest vcpu-0 to FIFO priority >>> on spinlock critical sections (see patch). >>> >>> Hang trace >>> ========== >>> >>> Without FIFO priority: >>> >>> qemu-kvm-6705 [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0 >>> qemu-kvm-6705 [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0 >>> qemu-kvm-6705 [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 >>> qemu-kvm-6705 [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 >>> qemu-kvm-6705 [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0 >>> qemu-kvm-6705 [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0 >>> qemu-kvm-6705 [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0 >>> qemu-kvm-6705 [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0 >>> >>> With FIFO priority: >>> >>> qemu-kvm-7636 [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 >>> qemu-kvm-7636 [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 >>> qemu-kvm-7636 [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 >>> qemu-kvm-7636 [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 >>> qemu-kvm-7636 [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 >>> .. >>> >>> Performance numbers (kernel compilation with make -j2) >>> ====================================================== >>> >>> With hypercall: 4:40. (make -j2) >>> Without hypercall: 3:38. (make -j2) >>> >>> Note for NFV workloads spinlock performance is not relevant >>> since DPDK should not enter the kernel (and housekeeping vcpu >>> performance is far from a key factor). >>> >>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> >>> >> >> That sounds familiar, though not yet the same: :) >> >> http://git.kiszka.org/?p=linux-kvm.git;a=shortlog;h=refs/heads/queues/paravirt-sched >> (paper: http://lwn.net/images/conf/rtlws11/papers/proc/p18.pdf) >> >> I suppose your goal is not to enable the host to follow the guest >> scheduler priority completely but only to have prio-ceiling for such >> short critical sections. Maybe still useful to think ahead about future >> extensions when actually introducing such an interface. > > Hi Jan! > > Hum... I'll take a look at your interface/paper and get back to you. > >> But shouldn't there be some limits on the maximum prio the guest can select? > > The SCHED_FIFO prio is fixed, selectable when QEMU starts. Do you > envision any other use case than a fixed priority value selectable > at QEMU initialization? Oh, indeed, this is a pure prio-ceiling variant with host-defined ceiling value. But it's very inefficient to use a hypercall for entering and leaving each and every sections. I would strongly recommend using a lazy scheme where the guest writes the desired state into a shared memory page, and the host only evaluates that prior to taking a scheduling decision, or at least only on real vmexits. We're using such scheme successfully to accelerate the fast path of prio-ceiling for pthread mutexes in the Xenomai real-time extension. Jan -- Siemens AG, Corporate Technology, CT RDA ITP SES-DE Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support 2017-09-22 6:23 ` Jan Kiszka @ 2017-09-26 23:59 ` Marcelo Tosatti 0 siblings, 0 replies; 52+ messages in thread From: Marcelo Tosatti @ 2017-09-26 23:59 UTC (permalink / raw) To: Jan Kiszka; +Cc: kvm, linux-kernel On Fri, Sep 22, 2017 at 08:23:02AM +0200, Jan Kiszka wrote: > On 2017-09-22 03:19, Marcelo Tosatti wrote: > > On Thu, Sep 21, 2017 at 07:45:32PM +0200, Jan Kiszka wrote: > >> On 2017-09-21 13:38, Marcelo Tosatti wrote: > >>> When executing guest vcpu-0 with FIFO:1 priority, which is necessary to > >>> deal with the following situation: > >>> > >>> VCPU-0 (housekeeping VCPU) VCPU-1 (realtime VCPU) > >>> > >>> raw_spin_lock(A) > >>> interrupted, schedule task T-1 raw_spin_lock(A) (spin) > >>> > >>> raw_spin_unlock(A) > >>> > >>> Certain operations must interrupt guest vcpu-0 (see trace below). > >>> > >>> To fix this issue, only change guest vcpu-0 to FIFO priority > >>> on spinlock critical sections (see patch). > >>> > >>> Hang trace > >>> ========== > >>> > >>> Without FIFO priority: > >>> > >>> qemu-kvm-6705 [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0 > >>> qemu-kvm-6705 [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0 > >>> qemu-kvm-6705 [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > >>> qemu-kvm-6705 [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > >>> qemu-kvm-6705 [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0 > >>> qemu-kvm-6705 [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0 > >>> qemu-kvm-6705 [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0 > >>> qemu-kvm-6705 [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0 > >>> > >>> With FIFO priority: > >>> > >>> qemu-kvm-7636 [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > >>> qemu-kvm-7636 [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > >>> qemu-kvm-7636 [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > >>> qemu-kvm-7636 [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0 > >>> qemu-kvm-7636 [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0 > >>> .. > >>> > >>> Performance numbers (kernel compilation with make -j2) > >>> ====================================================== > >>> > >>> With hypercall: 4:40. (make -j2) > >>> Without hypercall: 3:38. (make -j2) > >>> > >>> Note for NFV workloads spinlock performance is not relevant > >>> since DPDK should not enter the kernel (and housekeeping vcpu > >>> performance is far from a key factor). > >>> > >>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > >>> > >> > >> That sounds familiar, though not yet the same: :) > >> > >> http://git.kiszka.org/?p=linux-kvm.git;a=shortlog;h=refs/heads/queues/paravirt-sched > >> (paper: http://lwn.net/images/conf/rtlws11/papers/proc/p18.pdf) > >> > >> I suppose your goal is not to enable the host to follow the guest > >> scheduler priority completely but only to have prio-ceiling for such > >> short critical sections. Maybe still useful to think ahead about future > >> extensions when actually introducing such an interface. > > > > Hi Jan! > > > > Hum... I'll take a look at your interface/paper and get back to you. > > > >> But shouldn't there be some limits on the maximum prio the guest can select? > > > > The SCHED_FIFO prio is fixed, selectable when QEMU starts. Do you > > envision any other use case than a fixed priority value selectable > > at QEMU initialization? > > Oh, indeed, this is a pure prio-ceiling variant with host-defined > ceiling value. > > But it's very inefficient to use a hypercall for entering and leaving > each and every sections. I would strongly recommend using a lazy scheme > where the guest writes the desired state into a shared memory page, and > the host only evaluates that prior to taking a scheduling decision, or > at least only on real vmexits. We're using such scheme successfully to > accelerate the fast path of prio-ceiling for pthread mutexes in the > Xenomai real-time extension. Yes, a faster scheme was envisioned, but not developed. ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2017-10-02 12:48 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-09-21 11:38 [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Marcelo Tosatti 2017-09-21 11:38 ` [patch 1/3] KVM: x86: add per-vcpu option to set guest vcpu -RT priority Marcelo Tosatti 2017-09-21 11:38 ` [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) Marcelo Tosatti 2017-09-21 13:32 ` Konrad Rzeszutek Wilk 2017-09-21 13:49 ` Paolo Bonzini 2017-09-22 1:08 ` Marcelo Tosatti 2017-09-22 7:23 ` Paolo Bonzini 2017-09-22 12:24 ` Marcelo Tosatti 2017-09-21 11:38 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti 2017-09-21 13:36 ` Konrad Rzeszutek Wilk 2017-09-21 14:06 ` Peter Zijlstra 2017-09-22 1:10 ` Marcelo Tosatti 2017-09-22 10:00 ` Peter Zijlstra 2017-09-22 10:56 ` Peter Zijlstra 2017-09-22 12:33 ` Marcelo Tosatti 2017-09-22 12:55 ` Peter Zijlstra 2017-09-23 10:56 ` Paolo Bonzini 2017-09-23 13:41 ` Peter Zijlstra 2017-09-24 13:05 ` Paolo Bonzini 2017-09-25 2:57 ` Marcelo Tosatti 2017-09-25 9:13 ` Peter Zijlstra 2017-09-25 15:12 ` Paolo Bonzini 2017-09-26 22:49 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti 2017-09-27 9:37 ` Paolo Bonzini 2017-09-28 0:44 ` Marcelo Tosatti 2017-09-28 7:22 ` Paolo Bonzini 2017-09-28 21:35 ` Marcelo Tosatti 2017-09-28 21:41 ` Marcelo Tosatti 2017-09-29 8:18 ` Paolo Bonzini 2017-09-29 16:40 ` Marcelo Tosatti 2017-09-29 17:05 ` Paolo Bonzini 2017-09-29 20:17 ` Marcelo Tosatti 2017-10-02 12:30 ` Paolo Bonzini 2017-10-02 12:48 ` Peter Zijlstra 2017-09-26 23:22 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti 2017-09-25 16:20 ` Konrad Rzeszutek Wilk 2017-09-22 12:16 ` Marcelo Tosatti 2017-09-22 12:31 ` Peter Zijlstra 2017-09-22 12:36 ` Marcelo Tosatti 2017-09-22 12:59 ` Peter Zijlstra 2017-09-25 1:52 ` Marcelo Tosatti 2017-09-25 8:35 ` Peter Zijlstra 2017-09-22 12:40 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti 2017-09-22 13:01 ` Peter Zijlstra 2017-09-25 2:22 ` Marcelo Tosatti 2017-09-25 8:58 ` Peter Zijlstra 2017-09-25 10:41 ` Thomas Gleixner 2017-09-25 18:28 ` Jan Kiszka 2017-09-21 17:45 ` [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Jan Kiszka 2017-09-22 1:19 ` Marcelo Tosatti 2017-09-22 6:23 ` Jan Kiszka 2017-09-26 23:59 ` Marcelo Tosatti
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.