All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support
@ 2017-09-21 11:38 Marcelo Tosatti
  2017-09-21 11:38 ` [patch 1/3] KVM: x86: add per-vcpu option to set guest vcpu -RT priority Marcelo Tosatti
                   ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-21 11:38 UTC (permalink / raw)
  To: kvm, linux-kernel

When executing guest vcpu-0 with FIFO:1 priority, which is necessary to
deal with the following situation:

VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)

raw_spin_lock(A)
interrupted, schedule task T-1          raw_spin_lock(A) (spin)

raw_spin_unlock(A)

Certain operations must interrupt guest vcpu-0 (see trace below).

To fix this issue, only change guest vcpu-0 to FIFO priority
on spinlock critical sections (see patch).

Hang trace
==========

Without FIFO priority:

qemu-kvm-6705  [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0
qemu-kvm-6705  [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0
qemu-kvm-6705  [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-6705  [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-6705  [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0
qemu-kvm-6705  [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0
qemu-kvm-6705  [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0
qemu-kvm-6705  [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0

With FIFO priority:

qemu-kvm-7636  [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-7636  [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-7636  [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-7636  [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-7636  [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
..

Performance numbers (kernel compilation with make -j2)
======================================================

With hypercall: 4:40.  (make -j2)
Without hypercall: 3:38.  (make -j2)

Note for NFV workloads spinlock performance is not relevant
since DPDK should not enter the kernel (and housekeeping vcpu
performance is far from a key factor).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch 1/3] KVM: x86: add per-vcpu option to set guest vcpu -RT priority
  2017-09-21 11:38 [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Marcelo Tosatti
@ 2017-09-21 11:38 ` Marcelo Tosatti
  2017-09-21 11:38 ` [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) Marcelo Tosatti
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-21 11:38 UTC (permalink / raw)
  To: kvm, linux-kernel; +Cc: Marcelo Tosatti

[-- Attachment #1: 01-kvm-vcpu-allow-setfifo --]
[-- Type: text/plain, Size: 3906 bytes --]

In preparation to the patch which adds a hypercall for 
the guest vcpu to change states between: 

	-> SCHED_FIFO.
	-> SCHED_NORMAL.

Add controls to:

	1) Allow control to which priority SCHED_FIFO state will be in
	   place (in the host).

	2) Enable/disable ability to use the hypercall, on a per-vcpu
	   basis.


Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 arch/x86/include/asm/kvm_host.h |    4 ++++
 arch/x86/include/uapi/asm/kvm.h |    5 +++++
 arch/x86/kvm/x86.c              |   29 +++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h        |    3 +++
 4 files changed, 41 insertions(+)

Index: kvm.fifopriohc-submit/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm.fifopriohc-submit.orig/arch/x86/include/asm/kvm_host.h
+++ kvm.fifopriohc-submit/arch/x86/include/asm/kvm_host.h
@@ -688,6 +688,10 @@ struct kvm_vcpu_arch {
 
 	/* GPA available (AMD only) */
 	bool gpa_available;
+
+	/* Enable RT prio hypercall */
+	bool enable_rt_prio_hc;
+	int rt_sched_priority;
 };
 
 struct kvm_lpage_info {
Index: kvm.fifopriohc-submit/arch/x86/kvm/x86.c
===================================================================
--- kvm.fifopriohc-submit.orig/arch/x86/kvm/x86.c
+++ kvm.fifopriohc-submit/arch/x86/kvm/x86.c
@@ -2683,6 +2683,7 @@ int kvm_vm_ioctl_check_extension(struct
 	case KVM_CAP_SET_BOOT_CPU_ID:
  	case KVM_CAP_SPLIT_IRQCHIP:
 	case KVM_CAP_IMMEDIATE_EXIT:
+	case KVM_CAP_VCPU_RT_PRIO_HC:
 		r = 1;
 		break;
 	case KVM_CAP_ADJUST_CLOCK:
@@ -3386,6 +3387,25 @@ static int kvm_set_guest_paused(struct k
 	return 0;
 }
 
+static int kvm_vcpu_ioctl_set_rt_prio_hc(struct kvm_vcpu *vcpu,
+					    struct kvm_vcpu_rt_prio *rt_prio)
+{
+	if (rt_prio->enabled == 0) {
+		vcpu->arch.enable_rt_prio_hc = false;
+		return 0;
+	}
+
+	if (rt_prio->sched_priority < 1 ||
+	    rt_prio->sched_priority > 99)
+		return -EINVAL;
+
+
+	vcpu->arch.enable_rt_prio_hc = true;
+	vcpu->arch.rt_sched_priority = rt_prio->sched_priority;
+
+	return 0;
+}
+
 static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 				     struct kvm_enable_cap *cap)
 {
@@ -3682,6 +3702,15 @@ long kvm_arch_vcpu_ioctl(struct file *fi
 		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
 		break;
 	}
+	case KVM_SET_VCPU_RT_PRIO_HC: {
+		struct kvm_vcpu_rt_prio rt_prio;
+
+		r = -EFAULT;
+		if (copy_from_user(&rt_prio, argp, sizeof(rt_prio)))
+			goto out;
+		r = kvm_vcpu_ioctl_set_rt_prio_hc(vcpu, &rt_prio);
+		break;
+	}
 	default:
 		r = -EINVAL;
 	}
Index: kvm.fifopriohc-submit/include/uapi/linux/kvm.h
===================================================================
--- kvm.fifopriohc-submit.orig/include/uapi/linux/kvm.h
+++ kvm.fifopriohc-submit/include/uapi/linux/kvm.h
@@ -929,6 +929,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_SMT_POSSIBLE 147
 #define KVM_CAP_HYPERV_SYNIC2 148
 #define KVM_CAP_HYPERV_VP_INDEX 149
+#define KVM_CAP_VCPU_RT_PRIO_HC 150
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1355,6 +1356,8 @@ struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_S390_CMMA_MIGRATION */
 #define KVM_S390_GET_CMMA_BITS      _IOWR(KVMIO, 0xb8, struct kvm_s390_cmma_log)
 #define KVM_S390_SET_CMMA_BITS      _IOW(KVMIO, 0xb9, struct kvm_s390_cmma_log)
+/* Available with KVM_CAP_VCPU_RT_PRIO_HC */
+#define KVM_SET_VCPU_RT_PRIO_HC	_IOW(KVMIO, 0xba, struct kvm_vcpu_rt_prio)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
Index: kvm.fifopriohc-submit/arch/x86/include/uapi/asm/kvm.h
===================================================================
--- kvm.fifopriohc-submit.orig/arch/x86/include/uapi/asm/kvm.h
+++ kvm.fifopriohc-submit/arch/x86/include/uapi/asm/kvm.h
@@ -353,6 +353,11 @@ struct kvm_xcrs {
 	__u64 padding[16];
 };
 
+struct kvm_vcpu_rt_prio {
+	__u32 enabled;
+	__u32 sched_priority;
+};
+
 /* definition of registers in kvm_run */
 struct kvm_sync_regs {
 };

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side)
  2017-09-21 11:38 [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Marcelo Tosatti
  2017-09-21 11:38 ` [patch 1/3] KVM: x86: add per-vcpu option to set guest vcpu -RT priority Marcelo Tosatti
@ 2017-09-21 11:38 ` Marcelo Tosatti
  2017-09-21 13:32   ` Konrad Rzeszutek Wilk
  2017-09-21 11:38 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti
  2017-09-21 17:45 ` [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Jan Kiszka
  3 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-21 11:38 UTC (permalink / raw)
  To: kvm, linux-kernel; +Cc: Marcelo Tosatti

[-- Attachment #1: 02-kvm-setfifoprio-hc --]
[-- Type: text/plain, Size: 5963 bytes --]

When executing guest vcpu-0 with FIFO:1 priority, which is necessary to 
deal with the following situation:

VCPU-0 (housekeeping VCPU)		VCPU-1 (realtime VCPU)

raw_spin_lock(A)
interrupted, schedule task T-1		raw_spin_lock(A) (spin)

raw_spin_unlock(A)

Certain operations must interrupt guest vcpu-0 (see trace below).

To fix this issue, only change guest vcpu-0 to FIFO priority
on spinlock critical sections (see patch).

Hang trace
==========

Without FIFO priority:

qemu-kvm-6705  [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0
qemu-kvm-6705  [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0
qemu-kvm-6705  [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-6705  [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-6705  [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0
qemu-kvm-6705  [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0
qemu-kvm-6705  [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0
qemu-kvm-6705  [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0

With FIFO priority:

qemu-kvm-7636  [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-7636  [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-7636  [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-7636  [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-7636  [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
..

Performance numbers (kernel compilation with make -j2)
======================================================

With hypercall: 4:40.  (make -j2)
Without hypercall: 3:38.  (make -j2)

Note for NFV workloads spinlock performance is not relevant
since DPDK should not enter the kernel (and housekeeping vcpu
performance is far from a key factor).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 Documentation/virtual/kvm/hypercalls.txt |   22 +++++++++++++++
 arch/x86/kvm/x86.c                       |   43 +++++++++++++++++++++++++++++++
 include/uapi/linux/kvm_para.h            |    2 +
 3 files changed, 67 insertions(+)

Index: kvm.fifopriohc-submit/Documentation/virtual/kvm/hypercalls.txt
===================================================================
--- kvm.fifopriohc-submit.orig/Documentation/virtual/kvm/hypercalls.txt
+++ kvm.fifopriohc-submit/Documentation/virtual/kvm/hypercalls.txt
@@ -121,3 +121,25 @@ compute the CLOCK_REALTIME for its clock
 
 Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
 or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
+
+6. KVM_HC_RT_PRIO
+------------------------
+Architecture: x86
+Status: active
+Purpose: Hypercall used to change qemu vcpu process -RT priority.
+
+Usage: Having a pCPU share a FIFO:1 vcpu and a QEMU emulator thread
+can be problematic: especially if the vcpu busy-spins on memory waiting
+for the QEMU emulator thread to write to, which leads to a hang
+(because the FIFO:1 vcpu is never scheduled in favor of QEMU emulator
+thread).
+So this hypercall is supposed to be called by the guest when
+the OS knows its not going to busy spin on memory thats
+written by the emulator thread as above.
+
+a0: bit 0 contains enable bit, if 0 indicates that SCHED_OTHER
+priority should be set for vcpu, if 1 indicates SCHED_FIFO
+priority (the actual value for FIFO priority is decided
+by the host).
+
+
Index: kvm.fifopriohc-submit/include/uapi/linux/kvm_para.h
===================================================================
--- kvm.fifopriohc-submit.orig/include/uapi/linux/kvm_para.h
+++ kvm.fifopriohc-submit/include/uapi/linux/kvm_para.h
@@ -15,6 +15,7 @@
 #define KVM_E2BIG		E2BIG
 #define KVM_EPERM		EPERM
 #define KVM_EOPNOTSUPP		95
+#define KVM_EINVAL		EINVAL
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
@@ -25,6 +26,7 @@
 #define KVM_HC_MIPS_EXIT_VM		7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
 #define KVM_HC_CLOCK_PAIRING		9
+#define KVM_HC_RT_PRIO			10
 
 /*
  * hypercalls use architecture specific
Index: kvm.fifopriohc-submit/arch/x86/kvm/x86.c
===================================================================
--- kvm.fifopriohc-submit.orig/arch/x86/kvm/x86.c
+++ kvm.fifopriohc-submit/arch/x86/kvm/x86.c
@@ -66,6 +66,8 @@
 #include <asm/pvclock.h>
 #include <asm/div64.h>
 #include <asm/irq_remapping.h>
+#include <uapi/linux/sched/types.h>
+#include <uapi/linux/sched.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -6261,6 +6263,44 @@ void kvm_vcpu_deactivate_apicv(struct kv
 	kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
 }
 
+static int convert_to_kvm_errcode(int error)
+{
+	switch (error) {
+	case -EPERM:
+		return -KVM_EPERM;
+	case -EINVAL:
+	default:
+		return -KVM_EINVAL;
+	}
+}
+
+int kvm_pv_rt_prio(struct kvm_vcpu *vcpu, unsigned long a0)
+{
+	int ret;
+	bool enable;
+	struct sched_param param;
+
+	memset(&param, 0, sizeof(struct sched_param));
+	param.sched_priority = vcpu->arch.rt_sched_priority;
+
+	enable = a0 & 0x1;
+
+	if (vcpu->arch.enable_rt_prio_hc == false)
+		return -KVM_EPERM;
+
+	if (enable) {
+		ret = sched_setscheduler(current, SCHED_FIFO, &param);
+	} else {
+		param.sched_priority = 0;
+		ret = sched_setscheduler(current, SCHED_NORMAL, &param);
+	}
+
+	if (ret)
+		ret = convert_to_kvm_errcode(ret);
+
+	return ret;
+}
+
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
 	unsigned long nr, a0, a1, a2, a3, ret;
@@ -6306,6 +6346,9 @@ int kvm_emulate_hypercall(struct kvm_vcp
 		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
 		break;
 #endif
+	case KVM_HC_RT_PRIO:
+		ret = kvm_pv_rt_prio(vcpu, a0);
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-21 11:38 [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Marcelo Tosatti
  2017-09-21 11:38 ` [patch 1/3] KVM: x86: add per-vcpu option to set guest vcpu -RT priority Marcelo Tosatti
  2017-09-21 11:38 ` [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) Marcelo Tosatti
@ 2017-09-21 11:38 ` Marcelo Tosatti
  2017-09-21 13:36   ` Konrad Rzeszutek Wilk
  2017-09-21 17:45 ` [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Jan Kiszka
  3 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-21 11:38 UTC (permalink / raw)
  To: kvm, linux-kernel; +Cc: Marcelo Tosatti

[-- Attachment #1: 03-kvm-hv-fifo-guest-support --]
[-- Type: text/plain, Size: 3173 bytes --]

Add hypercalls to spinlock/unlock to set/unset FIFO priority
for the vcpu, protected by a static branch to avoid performance
increase in the normal kernels.

Enable option by "kvmfifohc" kernel command line parameter (disabled
by default).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 arch/x86/kernel/kvm.c            |   31 +++++++++++++++++++++++++++++++
 include/linux/spinlock.h         |    2 +-
 include/linux/spinlock_api_smp.h |   17 +++++++++++++++++
 3 files changed, 49 insertions(+), 1 deletion(-)

Index: kvm.fifopriohc-submit/arch/x86/kernel/kvm.c
===================================================================
--- kvm.fifopriohc-submit.orig/arch/x86/kernel/kvm.c
+++ kvm.fifopriohc-submit/arch/x86/kernel/kvm.c
@@ -37,6 +37,7 @@
 #include <linux/debugfs.h>
 #include <linux/nmi.h>
 #include <linux/swait.h>
+#include <linux/static_key.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
 #include <asm/traps.h>
@@ -321,6 +322,36 @@ static notrace void kvm_guest_apic_eoi_w
 	apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK);
 }
 
+static int kvmfifohc;
+
+static int parse_kvmfifohc(char *arg)
+{
+	kvmfifohc = 1;
+	return 0;
+}
+
+early_param("kvmfifohc", parse_kvmfifohc);
+
+DEFINE_STATIC_KEY_FALSE(kvm_fifo_hc_key);
+
+static void kvm_init_fifo_hc(void)
+{
+	long ret;
+
+	ret = kvm_hypercall1(KVM_HC_RT_PRIO, 0);
+
+	if (ret == 0 && kvmfifohc == 1)
+		static_branch_enable(&kvm_fifo_hc_key);
+}
+
+static __init int kvmguest_late_init(void)
+{
+	kvm_init_fifo_hc();
+	return 0;
+}
+
+late_initcall(kvmguest_late_init);
+
 static void kvm_guest_cpu_init(void)
 {
 	if (!kvm_para_available())
Index: kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h
===================================================================
--- kvm.fifopriohc-submit.orig/include/linux/spinlock_api_smp.h
+++ kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h
@@ -136,11 +136,28 @@ static inline void __raw_spin_lock_bh(ra
 	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
 }
 
+#ifdef CONFIG_KVM_GUEST
+DECLARE_STATIC_KEY_FALSE(kvm_fifo_hc_key);
+#endif
+
 static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
 	preempt_disable();
+
+#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP)
+	/* enable FIFO priority */
+	if (static_branch_unlikely(&kvm_fifo_hc_key))
+		kvm_hypercall1(KVM_HC_RT_PRIO, 0x1);
+#endif
+
 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
+
+#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP)
+	/* disable FIFO priority */
+	if (static_branch_unlikely(&kvm_fifo_hc_key))
+		kvm_hypercall1(KVM_HC_RT_PRIO, 0);
+#endif
 }
 
 #endif /* !CONFIG_GENERIC_LOCKBREAK || CONFIG_DEBUG_LOCK_ALLOC */
Index: kvm.fifopriohc-submit/include/linux/spinlock.h
===================================================================
--- kvm.fifopriohc-submit.orig/include/linux/spinlock.h
+++ kvm.fifopriohc-submit/include/linux/spinlock.h
@@ -56,7 +56,7 @@
 #include <linux/stringify.h>
 #include <linux/bottom_half.h>
 #include <asm/barrier.h>
-
+#include <uapi/linux/kvm_para.h>
 
 /*
  * Must define these before including other files, inline functions need them

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side)
  2017-09-21 11:38 ` [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) Marcelo Tosatti
@ 2017-09-21 13:32   ` Konrad Rzeszutek Wilk
  2017-09-21 13:49     ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-09-21 13:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, linux-kernel

On Thu, Sep 21, 2017 at 08:38:37AM -0300, Marcelo Tosatti wrote:
> When executing guest vcpu-0 with FIFO:1 priority, which is necessary to 
> deal with the following situation:
> 
> VCPU-0 (housekeeping VCPU)		VCPU-1 (realtime VCPU)
> 
> raw_spin_lock(A)
> interrupted, schedule task T-1		raw_spin_lock(A) (spin)
> 
> raw_spin_unlock(A)
> 
> Certain operations must interrupt guest vcpu-0 (see trace below).
> 
> To fix this issue, only change guest vcpu-0 to FIFO priority
> on spinlock critical sections (see patch).
> 
> Hang trace
> ==========
> 
> Without FIFO priority:
> 
> qemu-kvm-6705  [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0
> qemu-kvm-6705  [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0
> qemu-kvm-6705  [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> qemu-kvm-6705  [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> qemu-kvm-6705  [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0
> qemu-kvm-6705  [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0
> qemu-kvm-6705  [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0
> qemu-kvm-6705  [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0
> 
> With FIFO priority:
> 
> qemu-kvm-7636  [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> qemu-kvm-7636  [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> qemu-kvm-7636  [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> qemu-kvm-7636  [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> qemu-kvm-7636  [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> ..
> 
> Performance numbers (kernel compilation with make -j2)
> ======================================================
> 
> With hypercall: 4:40.  (make -j2)
> Without hypercall: 3:38.  (make -j2)
> 
> Note for NFV workloads spinlock performance is not relevant
> since DPDK should not enter the kernel (and housekeeping vcpu
> performance is far from a key factor).
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  Documentation/virtual/kvm/hypercalls.txt |   22 +++++++++++++++
>  arch/x86/kvm/x86.c                       |   43 +++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm_para.h            |    2 +
>  3 files changed, 67 insertions(+)
> 
> Index: kvm.fifopriohc-submit/Documentation/virtual/kvm/hypercalls.txt
> ===================================================================
> --- kvm.fifopriohc-submit.orig/Documentation/virtual/kvm/hypercalls.txt
> +++ kvm.fifopriohc-submit/Documentation/virtual/kvm/hypercalls.txt
> @@ -121,3 +121,25 @@ compute the CLOCK_REALTIME for its clock
>  
>  Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
>  or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
> +
> +6. KVM_HC_RT_PRIO
> +------------------------
> +Architecture: x86
> +Status: active
> +Purpose: Hypercall used to change qemu vcpu process -RT priority.

So the guest can change the scheduling decisions at the host level?
And the host HAS to follow it? There is no policy override for the
host to say - nah, not going to do it?

Also wouldn't the guest want to always be at SCHED_FIFO? [I am thinking
of a guest admin who wants all the CPU resources he can get]


> +
> +Usage: Having a pCPU share a FIFO:1 vcpu and a QEMU emulator thread
> +can be problematic: especially if the vcpu busy-spins on memory waiting
> +for the QEMU emulator thread to write to, which leads to a hang

.. Is the QEMU emulator writing to the spinlock memory?

> +(because the FIFO:1 vcpu is never scheduled in favor of QEMU emulator
> +thread).
> +So this hypercall is supposed to be called by the guest when
> +the OS knows its not going to busy spin on memory thats
> +written by the emulator thread as above.
> +
> +a0: bit 0 contains enable bit, if 0 indicates that SCHED_OTHER
> +priority should be set for vcpu, if 1 indicates SCHED_FIFO
> +priority (the actual value for FIFO priority is decided
> +by the host).
> +
> +
> Index: kvm.fifopriohc-submit/include/uapi/linux/kvm_para.h
> ===================================================================
> --- kvm.fifopriohc-submit.orig/include/uapi/linux/kvm_para.h
> +++ kvm.fifopriohc-submit/include/uapi/linux/kvm_para.h
> @@ -15,6 +15,7 @@
>  #define KVM_E2BIG		E2BIG
>  #define KVM_EPERM		EPERM
>  #define KVM_EOPNOTSUPP		95
> +#define KVM_EINVAL		EINVAL
>  
>  #define KVM_HC_VAPIC_POLL_IRQ		1
>  #define KVM_HC_MMU_OP			2
> @@ -25,6 +26,7 @@
>  #define KVM_HC_MIPS_EXIT_VM		7
>  #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
>  #define KVM_HC_CLOCK_PAIRING		9
> +#define KVM_HC_RT_PRIO			10
>  
>  /*
>   * hypercalls use architecture specific
> Index: kvm.fifopriohc-submit/arch/x86/kvm/x86.c
> ===================================================================
> --- kvm.fifopriohc-submit.orig/arch/x86/kvm/x86.c
> +++ kvm.fifopriohc-submit/arch/x86/kvm/x86.c
> @@ -66,6 +66,8 @@
>  #include <asm/pvclock.h>
>  #include <asm/div64.h>
>  #include <asm/irq_remapping.h>
> +#include <uapi/linux/sched/types.h>
> +#include <uapi/linux/sched.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include "trace.h"
> @@ -6261,6 +6263,44 @@ void kvm_vcpu_deactivate_apicv(struct kv
>  	kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
>  }
>  
> +static int convert_to_kvm_errcode(int error)
> +{
> +	switch (error) {
> +	case -EPERM:
> +		return -KVM_EPERM;
> +	case -EINVAL:
> +	default:
> +		return -KVM_EINVAL;
> +	}
> +}
> +
> +int kvm_pv_rt_prio(struct kvm_vcpu *vcpu, unsigned long a0)
> +{
> +	int ret;
> +	bool enable;
> +	struct sched_param param;
> +
> +	memset(&param, 0, sizeof(struct sched_param));
> +	param.sched_priority = vcpu->arch.rt_sched_priority;
> +
> +	enable = a0 & 0x1;
> +
> +	if (vcpu->arch.enable_rt_prio_hc == false)
> +		return -KVM_EPERM;
> +
> +	if (enable) {
> +		ret = sched_setscheduler(current, SCHED_FIFO, &param);
> +	} else {
> +		param.sched_priority = 0;
> +		ret = sched_setscheduler(current, SCHED_NORMAL, &param);
> +	}
> +
> +	if (ret)
> +		ret = convert_to_kvm_errcode(ret);
> +
> +	return ret;
> +}
> +
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long nr, a0, a1, a2, a3, ret;
> @@ -6306,6 +6346,9 @@ int kvm_emulate_hypercall(struct kvm_vcp
>  		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
>  		break;
>  #endif
> +	case KVM_HC_RT_PRIO:
> +		ret = kvm_pv_rt_prio(vcpu, a0);
> +		break;
>  	default:
>  		ret = -KVM_ENOSYS;
>  		break;
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-21 11:38 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti
@ 2017-09-21 13:36   ` Konrad Rzeszutek Wilk
  2017-09-21 14:06     ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-09-21 13:36 UTC (permalink / raw)
  To: Marcelo Tosatti, peterz, mingo; +Cc: kvm, linux-kernel

On Thu, Sep 21, 2017 at 08:38:38AM -0300, Marcelo Tosatti wrote:
> Add hypercalls to spinlock/unlock to set/unset FIFO priority
> for the vcpu, protected by a static branch to avoid performance
> increase in the normal kernels.
> 
> Enable option by "kvmfifohc" kernel command line parameter (disabled
> by default).

Wouldn't be better if there was a global 'kvm=' which could have the
various overrides?

> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  arch/x86/kernel/kvm.c            |   31 +++++++++++++++++++++++++++++++
>  include/linux/spinlock.h         |    2 +-
>  include/linux/spinlock_api_smp.h |   17 +++++++++++++++++

Hm. It looks like you forgot to CC the maintainers:

$ scripts/get_maintainer.pl -f include/linux/spinlock.h
Peter Zijlstra <peterz@infradead.org> (maintainer:LOCKING PRIMITIVES)
Ingo Molnar <mingo@redhat.com> (maintainer:LOCKING PRIMITIVES)
linux-kernel@vger.kernel.org (open list:LOCKING PRIMITIVES)

Doing that for you.

>  3 files changed, 49 insertions(+), 1 deletion(-)
> 
> Index: kvm.fifopriohc-submit/arch/x86/kernel/kvm.c
> ===================================================================
> --- kvm.fifopriohc-submit.orig/arch/x86/kernel/kvm.c
> +++ kvm.fifopriohc-submit/arch/x86/kernel/kvm.c
> @@ -37,6 +37,7 @@
>  #include <linux/debugfs.h>
>  #include <linux/nmi.h>
>  #include <linux/swait.h>
> +#include <linux/static_key.h>
>  #include <asm/timer.h>
>  #include <asm/cpu.h>
>  #include <asm/traps.h>
> @@ -321,6 +322,36 @@ static notrace void kvm_guest_apic_eoi_w
>  	apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK);
>  }
>  
> +static int kvmfifohc;
> +
> +static int parse_kvmfifohc(char *arg)
> +{
> +	kvmfifohc = 1;
> +	return 0;
> +}
> +
> +early_param("kvmfifohc", parse_kvmfifohc);
> +
> +DEFINE_STATIC_KEY_FALSE(kvm_fifo_hc_key);
> +
> +static void kvm_init_fifo_hc(void)
> +{
> +	long ret;
> +
> +	ret = kvm_hypercall1(KVM_HC_RT_PRIO, 0);
> +
> +	if (ret == 0 && kvmfifohc == 1)
> +		static_branch_enable(&kvm_fifo_hc_key);
> +}
> +
> +static __init int kvmguest_late_init(void)
> +{
> +	kvm_init_fifo_hc();
> +	return 0;
> +}
> +
> +late_initcall(kvmguest_late_init);
> +
>  static void kvm_guest_cpu_init(void)
>  {
>  	if (!kvm_para_available())
> Index: kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h
> ===================================================================
> --- kvm.fifopriohc-submit.orig/include/linux/spinlock_api_smp.h
> +++ kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h
> @@ -136,11 +136,28 @@ static inline void __raw_spin_lock_bh(ra
>  	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
>  }
>  
> +#ifdef CONFIG_KVM_GUEST
> +DECLARE_STATIC_KEY_FALSE(kvm_fifo_hc_key);
> +#endif
> +
>  static inline void __raw_spin_lock(raw_spinlock_t *lock)
>  {
>  	preempt_disable();
> +
> +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP)
> +	/* enable FIFO priority */
> +	if (static_branch_unlikely(&kvm_fifo_hc_key))
> +		kvm_hypercall1(KVM_HC_RT_PRIO, 0x1);
> +#endif

I am assuming the reason you choose not to wrap this in a pvops
or any other structure that is more of hypervisor agnostic is
that only KVM exposes this. But what if other hypervisors expose
something similar? Or some other mechanism similar to this?

> +
>  	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
>  	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
> +
> +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP)
> +	/* disable FIFO priority */
> +	if (static_branch_unlikely(&kvm_fifo_hc_key))
> +		kvm_hypercall1(KVM_HC_RT_PRIO, 0);
> +#endif
>  }
>  
>  #endif /* !CONFIG_GENERIC_LOCKBREAK || CONFIG_DEBUG_LOCK_ALLOC */
> Index: kvm.fifopriohc-submit/include/linux/spinlock.h
> ===================================================================
> --- kvm.fifopriohc-submit.orig/include/linux/spinlock.h
> +++ kvm.fifopriohc-submit/include/linux/spinlock.h
> @@ -56,7 +56,7 @@
>  #include <linux/stringify.h>
>  #include <linux/bottom_half.h>
>  #include <asm/barrier.h>
> -
> +#include <uapi/linux/kvm_para.h>
>  
>  /*
>   * Must define these before including other files, inline functions need them
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side)
  2017-09-21 13:32   ` Konrad Rzeszutek Wilk
@ 2017-09-21 13:49     ` Paolo Bonzini
  2017-09-22  1:08       ` Marcelo Tosatti
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2017-09-21 13:49 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Marcelo Tosatti; +Cc: kvm, linux-kernel

On 21/09/2017 15:32, Konrad Rzeszutek Wilk wrote:
> So the guest can change the scheduling decisions at the host level?
> And the host HAS to follow it? There is no policy override for the
> host to say - nah, not going to do it?
> 
> Also wouldn't the guest want to always be at SCHED_FIFO? [I am thinking
> of a guest admin who wants all the CPU resources he can get]

Yeah, I do not understand why there should be a housekeeping VCPU that
is running at SCHED_NORMAL.  If it hurts, don't do it...

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-21 13:36   ` Konrad Rzeszutek Wilk
@ 2017-09-21 14:06     ` Peter Zijlstra
  2017-09-22  1:10       ` Marcelo Tosatti
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2017-09-21 14:06 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Marcelo Tosatti, mingo, kvm, linux-kernel, Thomas Gleixner

On Thu, Sep 21, 2017 at 09:36:53AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Sep 21, 2017 at 08:38:38AM -0300, Marcelo Tosatti wrote:
> > Add hypercalls to spinlock/unlock to set/unset FIFO priority
> > for the vcpu, protected by a static branch to avoid performance
> > increase in the normal kernels.
> > 
> > Enable option by "kvmfifohc" kernel command line parameter (disabled
> > by default).

WTF kind of fudge is this? Changelog completely fails to explain the
problem this would solve. Why are you doing insane things like this?


NAK!

> > Index: kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h
> > ===================================================================
> > --- kvm.fifopriohc-submit.orig/include/linux/spinlock_api_smp.h
> > +++ kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h
> > @@ -136,11 +136,28 @@ static inline void __raw_spin_lock_bh(ra
> >  	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
> >  }
> >  
> > +#ifdef CONFIG_KVM_GUEST
> > +DECLARE_STATIC_KEY_FALSE(kvm_fifo_hc_key);
> > +#endif
> > +
> >  static inline void __raw_spin_lock(raw_spinlock_t *lock)
> >  {
> >  	preempt_disable();
> > +
> > +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP)
> > +	/* enable FIFO priority */
> > +	if (static_branch_unlikely(&kvm_fifo_hc_key))
> > +		kvm_hypercall1(KVM_HC_RT_PRIO, 0x1);
> > +#endif
> > +
> >  	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> >  	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
> > +
> > +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP)
> > +	/* disable FIFO priority */
> > +	if (static_branch_unlikely(&kvm_fifo_hc_key))
> > +		kvm_hypercall1(KVM_HC_RT_PRIO, 0);
> > +#endif
> >  }
> >  
> >  #endif /* !CONFIG_GENERIC_LOCKBREAK || CONFIG_DEBUG_LOCK_ALLOC */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support
  2017-09-21 11:38 [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2017-09-21 11:38 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti
@ 2017-09-21 17:45 ` Jan Kiszka
  2017-09-22  1:19   ` Marcelo Tosatti
  3 siblings, 1 reply; 52+ messages in thread
From: Jan Kiszka @ 2017-09-21 17:45 UTC (permalink / raw)
  To: Marcelo Tosatti, kvm, linux-kernel

On 2017-09-21 13:38, Marcelo Tosatti wrote:
> When executing guest vcpu-0 with FIFO:1 priority, which is necessary to
> deal with the following situation:
> 
> VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> 
> raw_spin_lock(A)
> interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> 
> raw_spin_unlock(A)
> 
> Certain operations must interrupt guest vcpu-0 (see trace below).
> 
> To fix this issue, only change guest vcpu-0 to FIFO priority
> on spinlock critical sections (see patch).
> 
> Hang trace
> ==========
> 
> Without FIFO priority:
> 
> qemu-kvm-6705  [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0
> qemu-kvm-6705  [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0
> qemu-kvm-6705  [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> qemu-kvm-6705  [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> qemu-kvm-6705  [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0
> qemu-kvm-6705  [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0
> qemu-kvm-6705  [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0
> qemu-kvm-6705  [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0
> 
> With FIFO priority:
> 
> qemu-kvm-7636  [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> qemu-kvm-7636  [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> qemu-kvm-7636  [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> qemu-kvm-7636  [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> qemu-kvm-7636  [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> ..
> 
> Performance numbers (kernel compilation with make -j2)
> ======================================================
> 
> With hypercall: 4:40.  (make -j2)
> Without hypercall: 3:38.  (make -j2)
> 
> Note for NFV workloads spinlock performance is not relevant
> since DPDK should not enter the kernel (and housekeeping vcpu
> performance is far from a key factor).
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 

That sounds familiar, though not yet the same: :)

http://git.kiszka.org/?p=linux-kvm.git;a=shortlog;h=refs/heads/queues/paravirt-sched
(paper: http://lwn.net/images/conf/rtlws11/papers/proc/p18.pdf)

I suppose your goal is not to enable the host to follow the guest
scheduler priority completely but only to have prio-ceiling for such
short critical sections. Maybe still useful to think ahead about future
extensions when actually introducing such an interface.

But shouldn't there be some limits on the maximum prio the guest can select?

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side)
  2017-09-21 13:49     ` Paolo Bonzini
@ 2017-09-22  1:08       ` Marcelo Tosatti
  2017-09-22  7:23         ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-22  1:08 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Konrad Rzeszutek Wilk, kvm, linux-kernel

On Thu, Sep 21, 2017 at 03:49:33PM +0200, Paolo Bonzini wrote:
> On 21/09/2017 15:32, Konrad Rzeszutek Wilk wrote:
> > So the guest can change the scheduling decisions at the host level?
> > And the host HAS to follow it? There is no policy override for the
> > host to say - nah, not going to do it?

In that case the host should not even configure the guest with this
option (this is QEMU's 'enable-rt-fifo-hc' option).

> > Also wouldn't the guest want to always be at SCHED_FIFO? [I am thinking
> > of a guest admin who wants all the CPU resources he can get]

No. Because in the following code, executed by the housekeeping vCPU
running at constant SCHED_FIFO priority:

	1. Start disk I/O.
	2. busy spin

With the emulator thread sharing the same pCPU with the housekeeping
vCPU, the emulator thread (which runs at SCHED_NORMAL), will never
be scheduled in in place of the vcpu thread at SCHED_FIFO.

This causes a hang.

> Yeah, I do not understand why there should be a housekeeping VCPU that
> is running at SCHED_NORMAL.  If it hurts, don't do it...

Hope explanation above makes sense (in fact, it was you who pointed 
out SCHED_FIFO should not be constant on the housekeeping vCPU,
when sharing pCPU with emulator thread at SCHED_NORMAL).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-21 14:06     ` Peter Zijlstra
@ 2017-09-22  1:10       ` Marcelo Tosatti
  2017-09-22 10:00         ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-22  1:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Thu, Sep 21, 2017 at 04:06:28PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 21, 2017 at 09:36:53AM -0400, Konrad Rzeszutek Wilk wrote:
> > On Thu, Sep 21, 2017 at 08:38:38AM -0300, Marcelo Tosatti wrote:
> > > Add hypercalls to spinlock/unlock to set/unset FIFO priority
> > > for the vcpu, protected by a static branch to avoid performance
> > > increase in the normal kernels.
> > > 
> > > Enable option by "kvmfifohc" kernel command line parameter (disabled
> > > by default).
> 
> WTF kind of fudge is this? Changelog completely fails to explain the
> problem this would solve. Why are you doing insane things like this?
> 
> 
> NAK!

Copy&pasting from the initial message, please point out whether this
explanation makes sense (better solutions to this problem are welcome):


When executing guest vcpu-0 with FIFO:1 priority, which is necessary
to
deal with the following situation:

VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)

raw_spin_lock(A)
interrupted, schedule task T-1          raw_spin_lock(A) (spin)

raw_spin_unlock(A)

Certain operations must interrupt guest vcpu-0 (see trace below).

To fix this issue, only change guest vcpu-0 to FIFO priority
on spinlock critical sections (see patch).

Hang trace
==========

Without FIFO priority:

qemu-kvm-6705  [002] ....1.. 767785.648964: kvm_exit: reason
IO_INSTRUCTION rip 0xe8fe info 1f00039 0
qemu-kvm-6705  [002] ....1.. 767785.648965: kvm_exit: reason
IO_INSTRUCTION rip 0xe911 info 3f60008 0
qemu-kvm-6705  [002] ....1.. 767785.648968: kvm_exit: reason
IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-6705  [002] ....1.. 767785.648971: kvm_exit: reason
IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-6705  [002] ....1.. 767785.648974: kvm_exit: reason
IO_INSTRUCTION rip 0xb514 info 3f60000 0
qemu-kvm-6705  [002] ....1.. 767785.648977: kvm_exit: reason
PENDING_INTERRUPT rip 0x8052 info 0 0
qemu-kvm-6705  [002] ....1.. 767785.648980: kvm_exit: reason
IO_INSTRUCTION rip 0xeee6 info 200040 0
qemu-kvm-6705  [002] ....1.. 767785.648999: kvm_exit: reason
EPT_MISCONFIG rip 0x2120 info 0 0

With FIFO priority:

qemu-kvm-7636  [002] ....1.. 768218.205065: kvm_exit: reason
IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-7636  [002] ....1.. 768218.205068: kvm_exit: reason
IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-7636  [002] ....1.. 768218.205071: kvm_exit: reason
IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-7636  [002] ....1.. 768218.205074: kvm_exit: reason
IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-7636  [002] ....1.. 768218.205077: kvm_exit: reason
IO_INSTRUCTION rip 0xb313 info 1f70008 0
..

Performance numbers (kernel compilation with make -j2)
======================================================

With hypercall: 4:40.  (make -j2)
Without hypercall: 3:38.  (make -j2)

Note for NFV workloads spinlock performance is not relevant
since DPDK should not enter the kernel (and housekeeping vcpu
performance is far from a key factor).


> 
> > > Index: kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h
> > > ===================================================================
> > > --- kvm.fifopriohc-submit.orig/include/linux/spinlock_api_smp.h
> > > +++ kvm.fifopriohc-submit/include/linux/spinlock_api_smp.h
> > > @@ -136,11 +136,28 @@ static inline void __raw_spin_lock_bh(ra
> > >  	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
> > >  }
> > >  
> > > +#ifdef CONFIG_KVM_GUEST
> > > +DECLARE_STATIC_KEY_FALSE(kvm_fifo_hc_key);
> > > +#endif
> > > +
> > >  static inline void __raw_spin_lock(raw_spinlock_t *lock)
> > >  {
> > >  	preempt_disable();
> > > +
> > > +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP)
> > > +	/* enable FIFO priority */
> > > +	if (static_branch_unlikely(&kvm_fifo_hc_key))
> > > +		kvm_hypercall1(KVM_HC_RT_PRIO, 0x1);
> > > +#endif
> > > +
> > >  	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> > >  	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
> > > +
> > > +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_SMP)
> > > +	/* disable FIFO priority */
> > > +	if (static_branch_unlikely(&kvm_fifo_hc_key))
> > > +		kvm_hypercall1(KVM_HC_RT_PRIO, 0);
> > > +#endif
> > >  }
> > >  
> > >  #endif /* !CONFIG_GENERIC_LOCKBREAK || CONFIG_DEBUG_LOCK_ALLOC */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support
  2017-09-21 17:45 ` [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Jan Kiszka
@ 2017-09-22  1:19   ` Marcelo Tosatti
  2017-09-22  6:23     ` Jan Kiszka
  0 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-22  1:19 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: kvm, linux-kernel

On Thu, Sep 21, 2017 at 07:45:32PM +0200, Jan Kiszka wrote:
> On 2017-09-21 13:38, Marcelo Tosatti wrote:
> > When executing guest vcpu-0 with FIFO:1 priority, which is necessary to
> > deal with the following situation:
> > 
> > VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> > 
> > raw_spin_lock(A)
> > interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> > 
> > raw_spin_unlock(A)
> > 
> > Certain operations must interrupt guest vcpu-0 (see trace below).
> > 
> > To fix this issue, only change guest vcpu-0 to FIFO priority
> > on spinlock critical sections (see patch).
> > 
> > Hang trace
> > ==========
> > 
> > Without FIFO priority:
> > 
> > qemu-kvm-6705  [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0
> > qemu-kvm-6705  [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0
> > qemu-kvm-6705  [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> > qemu-kvm-6705  [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> > qemu-kvm-6705  [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0
> > qemu-kvm-6705  [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0
> > qemu-kvm-6705  [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0
> > qemu-kvm-6705  [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0
> > 
> > With FIFO priority:
> > 
> > qemu-kvm-7636  [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> > qemu-kvm-7636  [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> > qemu-kvm-7636  [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> > qemu-kvm-7636  [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> > qemu-kvm-7636  [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> > ..
> > 
> > Performance numbers (kernel compilation with make -j2)
> > ======================================================
> > 
> > With hypercall: 4:40.  (make -j2)
> > Without hypercall: 3:38.  (make -j2)
> > 
> > Note for NFV workloads spinlock performance is not relevant
> > since DPDK should not enter the kernel (and housekeeping vcpu
> > performance is far from a key factor).
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> 
> That sounds familiar, though not yet the same: :)
> 
> http://git.kiszka.org/?p=linux-kvm.git;a=shortlog;h=refs/heads/queues/paravirt-sched
> (paper: http://lwn.net/images/conf/rtlws11/papers/proc/p18.pdf)
> 
> I suppose your goal is not to enable the host to follow the guest
> scheduler priority completely but only to have prio-ceiling for such
> short critical sections. Maybe still useful to think ahead about future
> extensions when actually introducing such an interface.

Hi Jan!

Hum... I'll take a look at your interface/paper and get back to you.

> But shouldn't there be some limits on the maximum prio the guest can select?

The SCHED_FIFO prio is fixed, selectable when QEMU starts. Do you
envision any other use case than a fixed priority value selectable
at QEMU initialization?

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support
  2017-09-22  1:19   ` Marcelo Tosatti
@ 2017-09-22  6:23     ` Jan Kiszka
  2017-09-26 23:59       ` Marcelo Tosatti
  0 siblings, 1 reply; 52+ messages in thread
From: Jan Kiszka @ 2017-09-22  6:23 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, linux-kernel

On 2017-09-22 03:19, Marcelo Tosatti wrote:
> On Thu, Sep 21, 2017 at 07:45:32PM +0200, Jan Kiszka wrote:
>> On 2017-09-21 13:38, Marcelo Tosatti wrote:
>>> When executing guest vcpu-0 with FIFO:1 priority, which is necessary to
>>> deal with the following situation:
>>>
>>> VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
>>>
>>> raw_spin_lock(A)
>>> interrupted, schedule task T-1          raw_spin_lock(A) (spin)
>>>
>>> raw_spin_unlock(A)
>>>
>>> Certain operations must interrupt guest vcpu-0 (see trace below).
>>>
>>> To fix this issue, only change guest vcpu-0 to FIFO priority
>>> on spinlock critical sections (see patch).
>>>
>>> Hang trace
>>> ==========
>>>
>>> Without FIFO priority:
>>>
>>> qemu-kvm-6705  [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0
>>> qemu-kvm-6705  [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0
>>> qemu-kvm-6705  [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
>>> qemu-kvm-6705  [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
>>> qemu-kvm-6705  [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0
>>> qemu-kvm-6705  [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0
>>> qemu-kvm-6705  [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0
>>> qemu-kvm-6705  [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0
>>>
>>> With FIFO priority:
>>>
>>> qemu-kvm-7636  [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
>>> qemu-kvm-7636  [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
>>> qemu-kvm-7636  [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
>>> qemu-kvm-7636  [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
>>> qemu-kvm-7636  [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
>>> ..
>>>
>>> Performance numbers (kernel compilation with make -j2)
>>> ======================================================
>>>
>>> With hypercall: 4:40.  (make -j2)
>>> Without hypercall: 3:38.  (make -j2)
>>>
>>> Note for NFV workloads spinlock performance is not relevant
>>> since DPDK should not enter the kernel (and housekeeping vcpu
>>> performance is far from a key factor).
>>>
>>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>>
>>
>> That sounds familiar, though not yet the same: :)
>>
>> http://git.kiszka.org/?p=linux-kvm.git;a=shortlog;h=refs/heads/queues/paravirt-sched
>> (paper: http://lwn.net/images/conf/rtlws11/papers/proc/p18.pdf)
>>
>> I suppose your goal is not to enable the host to follow the guest
>> scheduler priority completely but only to have prio-ceiling for such
>> short critical sections. Maybe still useful to think ahead about future
>> extensions when actually introducing such an interface.
> 
> Hi Jan!
> 
> Hum... I'll take a look at your interface/paper and get back to you.
> 
>> But shouldn't there be some limits on the maximum prio the guest can select?
> 
> The SCHED_FIFO prio is fixed, selectable when QEMU starts. Do you
> envision any other use case than a fixed priority value selectable
> at QEMU initialization?

Oh, indeed, this is a pure prio-ceiling variant with host-defined
ceiling value.

But it's very inefficient to use a hypercall for entering and leaving
each and every sections. I would strongly recommend using a lazy scheme
where the guest writes the desired state into a shared memory page, and
the host only evaluates that prior to taking a scheduling decision, or
at least only on real vmexits. We're using such scheme successfully to
accelerate the fast path of prio-ceiling for pthread mutexes in the
Xenomai real-time extension.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side)
  2017-09-22  1:08       ` Marcelo Tosatti
@ 2017-09-22  7:23         ` Paolo Bonzini
  2017-09-22 12:24           ` Marcelo Tosatti
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2017-09-22  7:23 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Konrad Rzeszutek Wilk, kvm, linux-kernel

On 22/09/2017 03:08, Marcelo Tosatti wrote:
> On Thu, Sep 21, 2017 at 03:49:33PM +0200, Paolo Bonzini wrote:
>> On 21/09/2017 15:32, Konrad Rzeszutek Wilk wrote:
>>> So the guest can change the scheduling decisions at the host level?
>>> And the host HAS to follow it? There is no policy override for the
>>> host to say - nah, not going to do it?
> 
> In that case the host should not even configure the guest with this
> option (this is QEMU's 'enable-rt-fifo-hc' option).
> 
>>> Also wouldn't the guest want to always be at SCHED_FIFO? [I am thinking
>>> of a guest admin who wants all the CPU resources he can get]
> 
> No. Because in the following code, executed by the housekeeping vCPU
> running at constant SCHED_FIFO priority:
> 
> 	1. Start disk I/O.
> 	2. busy spin
> 
> With the emulator thread sharing the same pCPU with the housekeeping
> vCPU, the emulator thread (which runs at SCHED_NORMAL), will never
> be scheduled in in place of the vcpu thread at SCHED_FIFO.
> 
> This causes a hang.

But if the emulator thread can interrupt the housekeeping thread, the
emulator thread should also be SCHED_FIFO at higher priority; IIRC this
was in Jan's talk from a few years ago.

QEMU would also have to use PI mutexes (which is the main reason why
it's using QemuMutex instead of e.g. GMutex).

>> Yeah, I do not understand why there should be a housekeeping VCPU that
>> is running at SCHED_NORMAL.  If it hurts, don't do it...
> 
> Hope explanation above makes sense (in fact, it was you who pointed 
> out SCHED_FIFO should not be constant on the housekeeping vCPU,
> when sharing pCPU with emulator thread at SCHED_NORMAL).

The two are not exclusive... As you point out, it depends on the
workload.  For DPDK you can put both of them at SCHED_NORMAL.  For
kernel-intensive uses you must use SCHED_FIFO.

Perhaps we could consider running these threads at SCHED_RR instead.
Unlike SCHED_NORMAL, I am not against a hypercall that bumps temporarily
SCHED_RR to SCHED_FIFO, but perhaps that's not even necessary.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-22  1:10       ` Marcelo Tosatti
@ 2017-09-22 10:00         ` Peter Zijlstra
  2017-09-22 10:56           ` Peter Zijlstra
  2017-09-22 12:16           ` Marcelo Tosatti
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2017-09-22 10:00 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote:
> When executing guest vcpu-0 with FIFO:1 priority, which is necessary
> to
> deal with the following situation:
> 
> VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> 
> raw_spin_lock(A)
> interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> 
> raw_spin_unlock(A)
> 
> Certain operations must interrupt guest vcpu-0 (see trace below).

Those traces don't make any sense. All they include is kvm_exit and you
can't tell anything from that.

> To fix this issue, only change guest vcpu-0 to FIFO priority
> on spinlock critical sections (see patch).

This doesn't make sense. So you're saying that if you run all VCPUs as
FIFO things come apart? Why?

And why can't they still come apart when the guest holds a spinlock?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-22 10:00         ` Peter Zijlstra
@ 2017-09-22 10:56           ` Peter Zijlstra
  2017-09-22 12:33             ` Marcelo Tosatti
  2017-09-22 12:16           ` Marcelo Tosatti
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2017-09-22 10:56 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Fri, Sep 22, 2017 at 12:00:04PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote:
> > When executing guest vcpu-0 with FIFO:1 priority, which is necessary
> > to
> > deal with the following situation:
> > 
> > VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> > 
> > raw_spin_lock(A)
> > interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> > 
> > raw_spin_unlock(A)
> > 
> > Certain operations must interrupt guest vcpu-0 (see trace below).
> 
> Those traces don't make any sense. All they include is kvm_exit and you
> can't tell anything from that.
> 
> > To fix this issue, only change guest vcpu-0 to FIFO priority
> > on spinlock critical sections (see patch).
> 
> This doesn't make sense. So you're saying that if you run all VCPUs as
> FIFO things come apart? Why?
> 
> And why can't they still come apart when the guest holds a spinlock?

That is, running a RT guest and not having _all_ VCPUs being RT tasks on
the host is absolutely and completely insane and broken.

Fix whatever needs fixing to allow your VCPU0 to be RT, don't do insane
things like this.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-22 10:00         ` Peter Zijlstra
  2017-09-22 10:56           ` Peter Zijlstra
@ 2017-09-22 12:16           ` Marcelo Tosatti
  2017-09-22 12:31             ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-22 12:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote:
> > When executing guest vcpu-0 with FIFO:1 priority, which is necessary
> > to
> > deal with the following situation:
> > 
> > VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> > 
> > raw_spin_lock(A)
> > interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> > 
> > raw_spin_unlock(A)
> > 
> > Certain operations must interrupt guest vcpu-0 (see trace below).
> 
> Those traces don't make any sense. All they include is kvm_exit and you
> can't tell anything from that.

Hi Peter,

OK lets describe whats happening:

With QEMU emulator thread and vcpu-0 sharing a physical CPU
(which is a request from several NFV customers, to improve
guest packing), the following occurs when the guest generates 
the following pattern:

		1. submit IO.
		2. busy spin.

Hang trace
==========

Without FIFO priority:

qemu-kvm-6705  [002] ....1.. 767785.648964: kvm_exit: reason
IO_INSTRUCTION rip 0xe8fe info 1f00039 0
qemu-kvm-6705  [002] ....1.. 767785.648965: kvm_exit: reason
IO_INSTRUCTION rip 0xe911 info 3f60008 0
qemu-kvm-6705  [002] ....1.. 767785.648968: kvm_exit: reason
IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-6705  [002] ....1.. 767785.648971: kvm_exit: reason
IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-6705  [002] ....1.. 767785.648974: kvm_exit: reason
IO_INSTRUCTION rip 0xb514 info 3f60000 0
qemu-kvm-6705  [002] ....1.. 767785.648977: kvm_exit: reason
PENDING_INTERRUPT rip 0x8052 info 0 0
qemu-kvm-6705  [002] ....1.. 767785.648980: kvm_exit: reason
IO_INSTRUCTION rip 0xeee6 info 200040 0
qemu-kvm-6705  [002] ....1.. 767785.648999: kvm_exit: reason
EPT_MISCONFIG rip 0x2120 info 0 0

The emulator thread is able to interrupt qemu vcpu0 at SCHED_NORMAL
priority.

With FIFO priority:

Now, with qemu vcpu0 at SCHED_FIFO priority, which is necessary to
avoid the following scenario:

(*)
VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
 
raw_spin_lock(A)
interrupted, schedule task T-1          raw_spin_lock(A) (spin)
 
raw_spin_unlock(A)

And the following code pattern by vcpu0:

		1. submit IO.
		2. busy spin.

The emulator thread is unable to interrupt vcpu0 thread
(vcpu0 busy spinning at SCHED_FIFO, emulator thread at SCHED_NORMAL), 
and you get a hang at boot as follows:

qemu-kvm-7636  [002] ....1.. 768218.205065: kvm_exit: reason
IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-7636  [002] ....1.. 768218.205068: kvm_exit: reason
IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-7636  [002] ....1.. 768218.205071: kvm_exit: reason
IO_INSTRUCTION rip 0xb313 info 1f70008 0
qemu-kvm-7636  [002] ....1.. 768218.205074: kvm_exit: reason
IO_INSTRUCTION rip 0x8984 info 608000b 0
qemu-kvm-7636  [002] ....1.. 768218.205077: kvm_exit: reason
IO_INSTRUCTION rip 0xb313 info 1f70008 0

So to fix this problem, the patchset changes the priority
of the VCPU thread (to fix (*)), only when taking spinlocks.

Does that make sense now?

> 
> > To fix this issue, only change guest vcpu-0 to FIFO priority
> > on spinlock critical sections (see patch).
> 
> This doesn't make sense. So you're saying that if you run all VCPUs as
> FIFO things come apart? Why?

Please see above.

> And why can't they still come apart when the guest holds a spinlock?

Hopefully the above makes sense.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side)
  2017-09-22  7:23         ` Paolo Bonzini
@ 2017-09-22 12:24           ` Marcelo Tosatti
  0 siblings, 0 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-22 12:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Konrad Rzeszutek Wilk, kvm, linux-kernel

On Fri, Sep 22, 2017 at 09:23:47AM +0200, Paolo Bonzini wrote:
> On 22/09/2017 03:08, Marcelo Tosatti wrote:
> > On Thu, Sep 21, 2017 at 03:49:33PM +0200, Paolo Bonzini wrote:
> >> On 21/09/2017 15:32, Konrad Rzeszutek Wilk wrote:
> >>> So the guest can change the scheduling decisions at the host level?
> >>> And the host HAS to follow it? There is no policy override for the
> >>> host to say - nah, not going to do it?
> > 
> > In that case the host should not even configure the guest with this
> > option (this is QEMU's 'enable-rt-fifo-hc' option).
> > 
> >>> Also wouldn't the guest want to always be at SCHED_FIFO? [I am thinking
> >>> of a guest admin who wants all the CPU resources he can get]
> > 
> > No. Because in the following code, executed by the housekeeping vCPU
> > running at constant SCHED_FIFO priority:
> > 
> > 	1. Start disk I/O.
> > 	2. busy spin
> > 
> > With the emulator thread sharing the same pCPU with the housekeeping
> > vCPU, the emulator thread (which runs at SCHED_NORMAL), will never
> > be scheduled in in place of the vcpu thread at SCHED_FIFO.
> > 
> > This causes a hang.
> 
> But if the emulator thread can interrupt the housekeeping thread, the
> emulator thread should also be SCHED_FIFO at higher priority; IIRC this
> was in Jan's talk from a few years ago.

The point is we do not want the emulator thread to interrupt the
housekeeping thread at all times: we only want it to interrupt the 
housekeeping thread when it is not in a spinlock protected section (because
that has an effect on realtime vcpu's attempting to grab
that particular spinlock).

Otherwise, it can interrupt the housekeeping thread.

> QEMU would also have to use PI mutexes (which is the main reason why
> it's using QemuMutex instead of e.g. GMutex).
> 
> >> Yeah, I do not understand why there should be a housekeeping VCPU that
> >> is running at SCHED_NORMAL.  If it hurts, don't do it...
> > 
> > Hope explanation above makes sense (in fact, it was you who pointed 
> > out SCHED_FIFO should not be constant on the housekeeping vCPU,
> > when sharing pCPU with emulator thread at SCHED_NORMAL).
> 
> The two are not exclusive... As you point out, it depends on the
> workload.  For DPDK you can put both of them at SCHED_NORMAL.  For
> kernel-intensive uses you must use SCHED_FIFO.
> 
> Perhaps we could consider running these threads at SCHED_RR instead.
> Unlike SCHED_NORMAL, I am not against a hypercall that bumps temporarily
> SCHED_RR to SCHED_FIFO, but perhaps that's not even necessary.

Sorry Paolo, i don't see how SCHED_RR is going to help here:

"   SCHED_RR: Round-robin scheduling
       SCHED_RR is a simple enhancement of SCHED_FIFO.  Everything
described
       above for SCHED_FIFO also applies to SCHED_RR, except that each
       thread is allowed to run only for a maximum time quantum."

What must happen is that vcpu0 should run _until its finished with 
spinlock protected section_ (that is, any job the emulator thread 
has, in that period where vcpu0 has work to do, is of less priority
and must not execute). Otherwise vcpu1, running a realtime workload, 
will attempt to grab the spinlock vcpu0 has grabbed, and busy 
spin waiting on the emulator thread to finish.

If you have the emulator thread at a higher priority than vcpu0, as you 
suggested above, the same problem will happen. So that option is not
viable.

We tried to have vcpu0 with SCHED_FIFO at all times, to avoid this
hypercall, but unfortunately that'll cause the hang as described in the
trace.

So i fail to see how SCHED_RR should help here?

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-22 12:16           ` Marcelo Tosatti
@ 2017-09-22 12:31             ` Peter Zijlstra
  2017-09-22 12:36               ` Marcelo Tosatti
  2017-09-22 12:40               ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2017-09-22 12:31 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote:
> On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote:
> > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary
> > > to
> > > deal with the following situation:
> > > 
> > > VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> > > 
> > > raw_spin_lock(A)
> > > interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> > > 
> > > raw_spin_unlock(A)
> > > 
> > > Certain operations must interrupt guest vcpu-0 (see trace below).
> > 
> > Those traces don't make any sense. All they include is kvm_exit and you
> > can't tell anything from that.
> 
> Hi Peter,
> 
> OK lets describe whats happening:
> 
> With QEMU emulator thread and vcpu-0 sharing a physical CPU
> (which is a request from several NFV customers, to improve
> guest packing), the following occurs when the guest generates 
> the following pattern:
> 
> 		1. submit IO.
> 		2. busy spin.

User-space spinning is a bad idea in general and terminally broken in
a RT setup. Sounds like you need to go fix qemu to not suck.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-22 10:56           ` Peter Zijlstra
@ 2017-09-22 12:33             ` Marcelo Tosatti
  2017-09-22 12:55               ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-22 12:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Fri, Sep 22, 2017 at 12:56:09PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 12:00:04PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote:
> > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary
> > > to
> > > deal with the following situation:
> > > 
> > > VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> > > 
> > > raw_spin_lock(A)
> > > interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> > > 
> > > raw_spin_unlock(A)
> > > 
> > > Certain operations must interrupt guest vcpu-0 (see trace below).
> > 
> > Those traces don't make any sense. All they include is kvm_exit and you
> > can't tell anything from that.
> > 
> > > To fix this issue, only change guest vcpu-0 to FIFO priority
> > > on spinlock critical sections (see patch).
> > 
> > This doesn't make sense. So you're saying that if you run all VCPUs as
> > FIFO things come apart? Why?
> > 
> > And why can't they still come apart when the guest holds a spinlock?
> 
> That is, running a RT guest and not having _all_ VCPUs being RT tasks on
> the host is absolutely and completely insane and broken.

Can you explain why, please?

> Fix whatever needs fixing to allow your VCPU0 to be RT, don't do insane
> things like this.

VCPU0 can be RT, but you'll get the following hang, if the emulator
thread is sharing a pCPU with VCPU0:

	1. submit IO.
	2. busy spin.

As executed by the guest vcpu (its a natural problem).

Do you have a better suggestion as how to fix the problem?

We can fix the BIOS, but userspace will still be allowed to
generate the code pattern above.

And increasing the priority of the emulator thread, at random times 
(so it can inject interrupts to vcpu-0), can cause it to interrupt 
vcpu-0 in a spinlock protected section.

The only other option is for customers to live with the decreased 
packing (that is require one pcpu for each vcpu, and an additional pcpu
for emulator threads). Is that what you are suggesting?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-22 12:31             ` Peter Zijlstra
@ 2017-09-22 12:36               ` Marcelo Tosatti
  2017-09-22 12:59                 ` Peter Zijlstra
  2017-09-22 12:40               ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti
  1 sibling, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-22 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Fri, Sep 22, 2017 at 02:31:07PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote:
> > On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote:
> > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary
> > > > to
> > > > deal with the following situation:
> > > > 
> > > > VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> > > > 
> > > > raw_spin_lock(A)
> > > > interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> > > > 
> > > > raw_spin_unlock(A)
> > > > 
> > > > Certain operations must interrupt guest vcpu-0 (see trace below).
> > > 
> > > Those traces don't make any sense. All they include is kvm_exit and you
> > > can't tell anything from that.
> > 
> > Hi Peter,
> > 
> > OK lets describe whats happening:
> > 
> > With QEMU emulator thread and vcpu-0 sharing a physical CPU
> > (which is a request from several NFV customers, to improve
> > guest packing), the following occurs when the guest generates 
> > the following pattern:
> > 
> > 		1. submit IO.
> > 		2. busy spin.
> 
> User-space spinning is a bad idea in general and terminally broken in
> a RT setup. Sounds like you need to go fix qemu to not suck.

One can run whatever application they want on the housekeeping
vcpus. This is why rteval exists.

This is not the realtime vcpu we are talking about.

We can fix the BIOS, which is hanging now, but userspace can 
do whatever it wants, on non realtime vcpus (again, this is why
rteval test exists and is used by the -RT community as 
a testcase).

I haven't understood what is the wrong with the patch? Are you trying
to avoid pollution of the spinlock codepath to keep it simple?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-22 12:31             ` Peter Zijlstra
  2017-09-22 12:36               ` Marcelo Tosatti
@ 2017-09-22 12:40               ` Marcelo Tosatti
  2017-09-22 13:01                 ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-22 12:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Fri, Sep 22, 2017 at 02:31:07PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote:
> > On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote:
> > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary
> > > > to
> > > > deal with the following situation:
> > > > 
> > > > VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> > > > 
> > > > raw_spin_lock(A)
> > > > interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> > > > 
> > > > raw_spin_unlock(A)
> > > > 
> > > > Certain operations must interrupt guest vcpu-0 (see trace below).
> > > 
> > > Those traces don't make any sense. All they include is kvm_exit and you
> > > can't tell anything from that.
> > 
> > Hi Peter,
> > 
> > OK lets describe whats happening:
> > 
> > With QEMU emulator thread and vcpu-0 sharing a physical CPU
> > (which is a request from several NFV customers, to improve
> > guest packing), the following occurs when the guest generates 
> > the following pattern:
> > 
> > 		1. submit IO.
> > 		2. busy spin.
> 
> User-space spinning is a bad idea in general and terminally broken in
> a RT setup. Sounds like you need to go fix qemu to not suck.

Are you arguing its invalid for the following application to execute on 
housekeeping vcpu of a realtime system:

void main(void)
{

    submit_IO();
    do {
       computation();
    } while (!interrupted());
}

Really?

Replace "busy spin" by "useful computation until interrupted". 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-22 12:33             ` Marcelo Tosatti
@ 2017-09-22 12:55               ` Peter Zijlstra
  2017-09-23 10:56                 ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2017-09-22 12:55 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Fri, Sep 22, 2017 at 09:33:05AM -0300, Marcelo Tosatti wrote:
> > That is, running a RT guest and not having _all_ VCPUs being RT tasks on
> > the host is absolutely and completely insane and broken.
> 
> Can you explain why, please?

You just explained it yourself. If the thread that needs to complete
what you're waiting on has lower priority, it will _never_ get to run if
you're busy waiting on it.

This is _trivial_.

And even for !RT it can be quite costly, because you can end up having
to burn your entire slot of CPU time before you run the other task.

Userspace spinning is _bad_, do not do this.

(the one exception where it works is where you have a single thread per
cpu, because then there's effectively no scheduling).

> > Fix whatever needs fixing to allow your VCPU0 to be RT, don't do insane
> > things like this.
> 
> VCPU0 can be RT, but you'll get the following hang, if the emulator
> thread is sharing a pCPU with VCPU0:
> 
> 	1. submit IO.
> 	2. busy spin.
> 
> As executed by the guest vcpu (its a natural problem).
> 
> Do you have a better suggestion as how to fix the problem?

Yes, not busy wait. Go to sleep and make sure you're woken up once the
IO completes.

> We can fix the BIOS, but userspace will still be allowed to
> generate the code pattern above.

What does the BIOS have to do with anything?

> And increasing the priority of the emulator thread, at random times 
> (so it can inject interrupts to vcpu-0), can cause it to interrupt 
> vcpu-0 in a spinlock protected section.

You can equally boost the emulator thread while you're spin-waiting, but
that's ugly as heck too.

The normal, sane solution is to not spin-wait but block.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-22 12:36               ` Marcelo Tosatti
@ 2017-09-22 12:59                 ` Peter Zijlstra
  2017-09-25  1:52                   ` Marcelo Tosatti
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2017-09-22 12:59 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Fri, Sep 22, 2017 at 09:36:39AM -0300, Marcelo Tosatti wrote:
> On Fri, Sep 22, 2017 at 02:31:07PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote:
> > > On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote:
> > > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote:
> > > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary
> > > > > to
> > > > > deal with the following situation:
> > > > > 
> > > > > VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> > > > > 
> > > > > raw_spin_lock(A)
> > > > > interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> > > > > 
> > > > > raw_spin_unlock(A)
> > > > > 
> > > > > Certain operations must interrupt guest vcpu-0 (see trace below).
> > > > 
> > > > Those traces don't make any sense. All they include is kvm_exit and you
> > > > can't tell anything from that.
> > > 
> > > Hi Peter,
> > > 
> > > OK lets describe whats happening:
> > > 
> > > With QEMU emulator thread and vcpu-0 sharing a physical CPU
> > > (which is a request from several NFV customers, to improve
> > > guest packing), the following occurs when the guest generates 
> > > the following pattern:
> > > 
> > > 		1. submit IO.
> > > 		2. busy spin.
> > 
> > User-space spinning is a bad idea in general and terminally broken in
> > a RT setup. Sounds like you need to go fix qemu to not suck.
> 
> One can run whatever application they want on the housekeeping
> vcpus. This is why rteval exists.

Nobody cares about other tasks. The problem is between the VCPU and
emulator thread. They get a priority inversion and live-lock because of
spin-waiting.

> This is not the realtime vcpu we are talking about.

You're being confused, its a RT _guest_, all VCPUs _must_ be RT.
Because, as you ran into, the guest functions as a whole, not as a bunch
of individual CPUs.

> We can fix the BIOS, which is hanging now, but userspace can 
> do whatever it wants, on non realtime vcpus (again, this is why
> rteval test exists and is used by the -RT community as 
> a testcase).

But nobody cares what other tasks on the system do, all you care about
is that the VCPUs make deterministic forward progress.

> I haven't understood what is the wrong with the patch? Are you trying
> to avoid pollution of the spinlock codepath to keep it simple?

Your patch is voodoo programming. You don't solve the actual problem,
you try and paper over it.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-22 12:40               ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti
@ 2017-09-22 13:01                 ` Peter Zijlstra
  2017-09-25  2:22                   ` Marcelo Tosatti
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2017-09-22 13:01 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Fri, Sep 22, 2017 at 09:40:05AM -0300, Marcelo Tosatti wrote:

> Are you arguing its invalid for the following application to execute on 
> housekeeping vcpu of a realtime system:
> 
> void main(void)
> {
> 
>     submit_IO();
>     do {
>        computation();
>     } while (!interrupted());
> }
> 
> Really?

No. Nobody cares about random crap tasks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-22 12:55               ` Peter Zijlstra
@ 2017-09-23 10:56                 ` Paolo Bonzini
  2017-09-23 13:41                   ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2017-09-23 10:56 UTC (permalink / raw)
  To: Peter Zijlstra, Marcelo Tosatti
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On 22/09/2017 14:55, Peter Zijlstra wrote:
> You just explained it yourself. If the thread that needs to complete
> what you're waiting on has lower priority, it will _never_ get to run if
> you're busy waiting on it.
> 
> This is _trivial_.
> 
> And even for !RT it can be quite costly, because you can end up having
> to burn your entire slot of CPU time before you run the other task.
> 
> Userspace spinning is _bad_, do not do this.

This is not userspace spinning, it is guest spinning---which has
effectively the same effect but you cannot quite avoid.

But I agree that the solution is properly prioritizing threads that can
interrupt the VCPU, and using PI mutexes.

I'm not a priori opposed to paravirt scheduling primitives, but I am not
at all sure that it's required.

Paolo

> (the one exception where it works is where you have a single thread per
> cpu, because then there's effectively no scheduling).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-23 10:56                 ` Paolo Bonzini
@ 2017-09-23 13:41                   ` Peter Zijlstra
  2017-09-24 13:05                     ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2017-09-23 13:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marcelo Tosatti, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On Sat, Sep 23, 2017 at 12:56:12PM +0200, Paolo Bonzini wrote:
> On 22/09/2017 14:55, Peter Zijlstra wrote:
> > You just explained it yourself. If the thread that needs to complete
> > what you're waiting on has lower priority, it will _never_ get to run if
> > you're busy waiting on it.
> > 
> > This is _trivial_.
> > 
> > And even for !RT it can be quite costly, because you can end up having
> > to burn your entire slot of CPU time before you run the other task.
> > 
> > Userspace spinning is _bad_, do not do this.
> 
> This is not userspace spinning, it is guest spinning---which has
> effectively the same effect but you cannot quite avoid.

So I'm virt illiterate and have no clue on how all this works; but
wasn't this a vmexit ? (that's what marcelo traced). And once you've
done a vmexit you're a regular task again, not a vcpu.

> But I agree that the solution is properly prioritizing threads that can
> interrupt the VCPU, and using PI mutexes.

Right, if you want to run RT VCPUs the whole emulator/vcpu interaction
needs to be designed for RT.

> I'm not a priori opposed to paravirt scheduling primitives, but I am not
> at all sure that it's required.

Problem is that the proposed thing doesn't solve anything. There is
nothing that prohibits the guest from triggering a vmexit while holding
a spinlock and landing in the self-same problems.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-23 13:41                   ` Peter Zijlstra
@ 2017-09-24 13:05                     ` Paolo Bonzini
  2017-09-25  2:57                       ` Marcelo Tosatti
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2017-09-24 13:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Marcelo Tosatti, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner



----- Original Message -----
> From: "Peter Zijlstra" <peterz@infradead.org>
> To: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: "Marcelo Tosatti" <mtosatti@redhat.com>, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com>, mingo@redhat.com,
> kvm@vger.kernel.org, linux-kernel@vger.kernel.org, "Thomas Gleixner" <tglx@linutronix.de>
> Sent: Saturday, September 23, 2017 3:41:14 PM
> Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
> 
> On Sat, Sep 23, 2017 at 12:56:12PM +0200, Paolo Bonzini wrote:
> > On 22/09/2017 14:55, Peter Zijlstra wrote:
> > > You just explained it yourself. If the thread that needs to complete
> > > what you're waiting on has lower priority, it will _never_ get to run if
> > > you're busy waiting on it.
> > > 
> > > This is _trivial_.
> > > 
> > > And even for !RT it can be quite costly, because you can end up having
> > > to burn your entire slot of CPU time before you run the other task.
> > > 
> > > Userspace spinning is _bad_, do not do this.
> > 
> > This is not userspace spinning, it is guest spinning---which has
> > effectively the same effect but you cannot quite avoid.
> 
> So I'm virt illiterate and have no clue on how all this works; but
> wasn't this a vmexit ? (that's what marcelo traced). And once you've
> done a vmexit you're a regular task again, not a vcpu.

His trace simply shows that the timer tick happened and the SCHED_NORMAL
thread was preempted.  Bumping the vCPU thread to SCHED_FIFO drops
the scheduler tick (the system is NOHZ_FULL) and thus 1) the frequency
of EXTERNAL_INTERRUPT vmexits drops to 1 second 2) the thread is not
preempted anymore.

> > But I agree that the solution is properly prioritizing threads that can
> > interrupt the VCPU, and using PI mutexes.
> 
> Right, if you want to run RT VCPUs the whole emulator/vcpu interaction
> needs to be designed for RT.
> 
> > I'm not a priori opposed to paravirt scheduling primitives, but I am not
> > at all sure that it's required.
> 
> Problem is that the proposed thing doesn't solve anything. There is
> nothing that prohibits the guest from triggering a vmexit while holding
> a spinlock and landing in the self-same problems.

Well, part of configuring virt for RT is (at all levels: host hypervisor+QEMU
and guest kernel+userspace) is that vmexits while holding a spinlock are either
confined to one vCPU or are handled in the host hypervisor very quickly, like
less than 2000 clock cycles.

So I'm not denying that Marcelo's approach solves the problem, but it's very
heavyweight and it masks an important misconfiguration (as you write above,
everything needs to be RT and the priorities must be designed carefully).

_However_, even if you do this, you may want to put the less important vCPUs
and the emulator threads on the same physical CPU.  In that case, the vCPU
can be placed at SCHED_RR to avoid starvation (while the emulator thread needs
to stay at SCHED_FIFO and higher priority).  Some kind of trick that bumps
spinlock critical sections in that vCPU to SCHED_FIFO, for a limited time only,
might still be useful.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-22 12:59                 ` Peter Zijlstra
@ 2017-09-25  1:52                   ` Marcelo Tosatti
  2017-09-25  8:35                     ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-25  1:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Fri, Sep 22, 2017 at 02:59:51PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 09:36:39AM -0300, Marcelo Tosatti wrote:
> > On Fri, Sep 22, 2017 at 02:31:07PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote:
> > > > On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote:
> > > > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote:
> > > > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary
> > > > > > to
> > > > > > deal with the following situation:
> > > > > > 
> > > > > > VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> > > > > > 
> > > > > > raw_spin_lock(A)
> > > > > > interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> > > > > > 
> > > > > > raw_spin_unlock(A)
> > > > > > 
> > > > > > Certain operations must interrupt guest vcpu-0 (see trace below).
> > > > > 
> > > > > Those traces don't make any sense. All they include is kvm_exit and you
> > > > > can't tell anything from that.
> > > > 
> > > > Hi Peter,
> > > > 
> > > > OK lets describe whats happening:
> > > > 
> > > > With QEMU emulator thread and vcpu-0 sharing a physical CPU
> > > > (which is a request from several NFV customers, to improve
> > > > guest packing), the following occurs when the guest generates 
> > > > the following pattern:
> > > > 
> > > > 		1. submit IO.
> > > > 		2. busy spin.
> > > 
> > > User-space spinning is a bad idea in general and terminally broken in
> > > a RT setup. Sounds like you need to go fix qemu to not suck.
> > 
> > One can run whatever application they want on the housekeeping
> > vcpus. This is why rteval exists.
> 
> Nobody cares about other tasks. The problem is between the VCPU and
> emulator thread. They get a priority inversion and live-lock because of
> spin-waiting.
> 
> > This is not the realtime vcpu we are talking about.
> 
> You're being confused, its a RT _guest_, all VCPUs _must_ be RT.
> Because, as you ran into, the guest functions as a whole, not as a bunch
> of individual CPUs.
> 
> > We can fix the BIOS, which is hanging now, but userspace can 
> > do whatever it wants, on non realtime vcpus (again, this is why
> > rteval test exists and is used by the -RT community as 
> > a testcase).
> 
> But nobody cares what other tasks on the system do, all you care about
> is that the VCPUs make deterministic forward progress.
> 
> > I haven't understood what is the wrong with the patch? Are you trying
> > to avoid pollution of the spinlock codepath to keep it simple?
> 
> Your patch is voodoo programming. You don't solve the actual problem,
> you try and paper over it.

Priority boosting on a particular section of code is voodoo programming? 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-22 13:01                 ` Peter Zijlstra
@ 2017-09-25  2:22                   ` Marcelo Tosatti
  2017-09-25  8:58                     ` Peter Zijlstra
  2017-09-25 10:41                     ` Thomas Gleixner
  0 siblings, 2 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-25  2:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 09:40:05AM -0300, Marcelo Tosatti wrote:
> 
> > Are you arguing its invalid for the following application to execute on 
> > housekeeping vcpu of a realtime system:
> > 
> > void main(void)
> > {
> > 
> >     submit_IO();
> >     do {
> >        computation();
> >     } while (!interrupted());
> > }
> > 
> > Really?
> 
> No. Nobody cares about random crap tasks.

Nobody has control over all code that runs in userspace Peter. And not
supporting a valid sequence of steps because its "crap" (whatever your 
definition of crap is) makes no sense.

It might be that someone decides to do the above (i really can't see 
any actual reasoning i can follow and agree on your "its crap"
argument), this truly seems valid to me.

So lets follow the reasoning steps:

1) "NACK, because you didnt understand the problem".

	OK thats an invalid NACK, you did understand the problem
	later and now your argument is the following.

2) "NACK, because all VCPUs should be SCHED_FIFO all the time".

But the existence of this code path from userspace:

  submit_IO();
  do {
     computation();
  } while (!interrupted());

Its a supported code sequence, and works fine in a non-RT environment.
Therefore it should work on an -RT environment.
Think of any two applications, such as an IO application
and a CPU bound application. The IO application will be severely
impacted, or never execute, in such scenario.

Is that combination of tasks "random crap tasks" ? (No, its not, which 
makes me think you're just NACKing without giving enough thought to the
problem).

So please give me some logical reasoning for the NACK (people can live with
it, but it has to be good enough to justify the decreasing packing of 
guests in pCPUs):

1) "Voodoo programming" (its hard for me to parse what you mean with
that... do you mean you foresee this style of priority boosting causing
problems in the future? Can you give an example?).

Is there fundamentally wrong about priority boosting in spinlock
sections, or this particular style of priority boosting is wrong?

2) "Pollution of the kernel code path". That makes sense to me, if thats
whats your concerned about.

3) "Reduction of spinlock performance". Its true, but for NFV workloads
people don't care about.

4) "All vcpus should be SCHED_FIFO all the time". OK, why is that?
What dictates that to be true?

What the patch does is the following:
It reduces the window where SCHED_FIFO is applied vcpu0
to those were a spinlock is shared between -RT vcpus and vcpu0
(why: because otherwise, when the emulator thread is sharing a
pCPU with vcpu0, its unable to generate interrupts vcpu0).

And its being rejected because:
Please fill in.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-24 13:05                     ` Paolo Bonzini
@ 2017-09-25  2:57                       ` Marcelo Tosatti
  2017-09-25  9:13                         ` Peter Zijlstra
  2017-09-25 16:20                         ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-25  2:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On Sun, Sep 24, 2017 at 09:05:44AM -0400, Paolo Bonzini wrote:
> 
> 
> ----- Original Message -----
> > From: "Peter Zijlstra" <peterz@infradead.org>
> > To: "Paolo Bonzini" <pbonzini@redhat.com>
> > Cc: "Marcelo Tosatti" <mtosatti@redhat.com>, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com>, mingo@redhat.com,
> > kvm@vger.kernel.org, linux-kernel@vger.kernel.org, "Thomas Gleixner" <tglx@linutronix.de>
> > Sent: Saturday, September 23, 2017 3:41:14 PM
> > Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
> > 
> > On Sat, Sep 23, 2017 at 12:56:12PM +0200, Paolo Bonzini wrote:
> > > On 22/09/2017 14:55, Peter Zijlstra wrote:
> > > > You just explained it yourself. If the thread that needs to complete
> > > > what you're waiting on has lower priority, it will _never_ get to run if
> > > > you're busy waiting on it.
> > > > 
> > > > This is _trivial_.
> > > > 
> > > > And even for !RT it can be quite costly, because you can end up having
> > > > to burn your entire slot of CPU time before you run the other task.
> > > > 
> > > > Userspace spinning is _bad_, do not do this.
> > > 
> > > This is not userspace spinning, it is guest spinning---which has
> > > effectively the same effect but you cannot quite avoid.
> > 
> > So I'm virt illiterate and have no clue on how all this works; but
> > wasn't this a vmexit ? (that's what marcelo traced). And once you've
> > done a vmexit you're a regular task again, not a vcpu.
> 
> His trace simply shows that the timer tick happened and the SCHED_NORMAL
> thread was preempted.  Bumping the vCPU thread to SCHED_FIFO drops
> the scheduler tick (the system is NOHZ_FULL) and thus 1) the frequency
> of EXTERNAL_INTERRUPT vmexits drops to 1 second 2) the thread is not
> preempted anymore.
> 
> > > But I agree that the solution is properly prioritizing threads that can
> > > interrupt the VCPU, and using PI mutexes.

Thats exactly what the patch does, the prioritization is not fixed in
time, and depends on whether or not vcpu-0 is in spinlock protected
section.

Are you suggesting a different prioritization? Can you describe it
please, even if incomplete?

> > 
> > Right, if you want to run RT VCPUs the whole emulator/vcpu interaction
> > needs to be designed for RT.
> > 
> > > I'm not a priori opposed to paravirt scheduling primitives, but I am not
> > > at all sure that it's required.
> > 
> > Problem is that the proposed thing doesn't solve anything. There is
> > nothing that prohibits the guest from triggering a vmexit while holding
> > a spinlock and landing in the self-same problems.
> 
> Well, part of configuring virt for RT is (at all levels: host hypervisor+QEMU
> and guest kernel+userspace) is that vmexits while holding a spinlock are either
> confined to one vCPU or are handled in the host hypervisor very quickly, like
> less than 2000 clock cycles.
> 
> So I'm not denying that Marcelo's approach solves the problem, but it's very
> heavyweight and it masks an important misconfiguration (as you write above,
> everything needs to be RT and the priorities must be designed carefully).

I think you are missing the following point:

"vcpu0 can be interrupted when its not in a spinlock protected section, 
otherwise it can't."

So you _have_ to communicate to the host when the guest enters/leaves a
critical section.

So this point of "everything needs to be RT and the priorities must be
designed carefully", is this: 

	WHEN in spinlock protected section (more specifically, when 
	spinlock protected section _shared with realtime vcpus_),

	priority of vcpu0 > priority of emulator thread

	OTHERWISE

	priority of vcpu0 < priority of emulator thread.

(*)

So emulator thread can interrupt and inject interrupts to vcpu0.

> 
> _However_, even if you do this, you may want to put the less important vCPUs
> and the emulator threads on the same physical CPU.  In that case, the vCPU
> can be placed at SCHED_RR to avoid starvation (while the emulator thread needs
> to stay at SCHED_FIFO and higher priority).  Some kind of trick that bumps
> spinlock critical sections in that vCPU to SCHED_FIFO, for a limited time only,
> might still be useful.

Anything that violates (*) above is going to cause excessive latencies
in realtime vcpus, via:

PCPU-0:
	* vcpu-0 grabs spinlock A.
	* event wakes up emulator thread, vcpu-0 sched out, vcpu-0 sched
	in.
PCPU-1:
	* realtime vcpu grabs spinlock-A, busy spins on emulator threads
	completion.

So its more than useful, its necessary.

I'm open to suggestions as better ways to solve this problem 
while sharing emulator thread with vcpu-0 (which is something users
are interested in, for obvious economical reasons), but:

	1) Don't get the point of Peters rejection.

	2) Don't get how SCHED_RR can help the situation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-25  1:52                   ` Marcelo Tosatti
@ 2017-09-25  8:35                     ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2017-09-25  8:35 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Sun, Sep 24, 2017 at 10:52:58PM -0300, Marcelo Tosatti wrote:
> On Fri, Sep 22, 2017 at 02:59:51PM +0200, Peter Zijlstra wrote:

> > Your patch is voodoo programming. You don't solve the actual problem,
> > you try and paper over it.
> 
> Priority boosting on a particular section of code is voodoo programming? 

Yes, because there's nothing that prevents said section of code from
triggering the exact problem you're trying to avoid.

The real fix is making sure the problem cannot happen to begin with,
which is done by fixing the interaction between the VCPU and that
emulation thread thing.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-25  2:22                   ` Marcelo Tosatti
@ 2017-09-25  8:58                     ` Peter Zijlstra
  2017-09-25 10:41                     ` Thomas Gleixner
  1 sibling, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2017-09-25  8:58 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On Sun, Sep 24, 2017 at 11:22:38PM -0300, Marcelo Tosatti wrote:
> On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 22, 2017 at 09:40:05AM -0300, Marcelo Tosatti wrote:
> > 
> > > Are you arguing its invalid for the following application to execute on 
> > > housekeeping vcpu of a realtime system:
> > > 
> > > void main(void)
> > > {
> > > 
> > >     submit_IO();
> > >     do {
> > >        computation();
> > >     } while (!interrupted());
> > > }
> > > 
> > > Really?
> > 
> > No. Nobody cares about random crap tasks.
> 
> Nobody has control over all code that runs in userspace Peter. And not
> supporting a valid sequence of steps because its "crap" (whatever your 
> definition of crap is) makes no sense.
> 
> It might be that someone decides to do the above (i really can't see 
> any actual reasoning i can follow and agree on your "its crap"
> argument), this truly seems valid to me.

We don't care what other tasks do. This isn't a hard thing to
understand. You're free to run whatever junk on your CPUs. This doesn't
(much) affect the correct functioning of RT tasks that you also run
there.

> So lets follow the reasoning steps:
> 
> 1) "NACK, because you didnt understand the problem".
> 
> 	OK thats an invalid NACK, you did understand the problem
> 	later and now your argument is the following.

It was a NACK because you wrote a shit changelog that didn't explain the
problem. But yes.

> 2) "NACK, because all VCPUs should be SCHED_FIFO all the time".

Very much, if you want a RT guest, all VCPU's should run at RT prio and
the interaction between the VCPUs and all supporting threads should be
designed for RT.

> But the existence of this code path from userspace:
> 
>   submit_IO();
>   do {
>      computation();
>   } while (!interrupted());
> 
> Its a supported code sequence, and works fine in a non-RT environment.

Who cares about that chunk of code? Have you forgotten to mention that
this is the form of the emulation thread?

> Therefore it should work on an -RT environment.

No, this is where you're wrong. That code works on -RT as long as you
don't expect it to be a valid RT program. -RT kernels will run !RT stuff
just fine.

But the moment you run a program as RT (FIFO/RR/DEADLINE) it had better
damn well be a valid RT program, and that excludes a lot of code.

> So please give me some logical reasoning for the NACK (people can live with
> it, but it has to be good enough to justify the decreasing packing of 
> guests in pCPUs):
> 
> 1) "Voodoo programming" (its hard for me to parse what you mean with
> that... do you mean you foresee this style of priority boosting causing
> problems in the future? Can you give an example?).

Your 'solution' only works if you sacrifice a goat on a full moon,
because only that ensures the guest doesn't VM_EXIT and cause the
self-same problem while you've boosted it.

Because you've _not_ fixed the actual problem!

> Is there fundamentally wrong about priority boosting in spinlock
> sections, or this particular style of priority boosting is wrong?

Yes, its fundamentally crap, because it doesn't guarantee anything.

RT is about making guarantees. An RT program needs a provable forward
progress guarantee at the very least. It including a priority inversion
disqualifies it from being sane.

> 2) "Pollution of the kernel code path". That makes sense to me, if thats
> whats your concerned about.

Also..

> 3) "Reduction of spinlock performance". Its true, but for NFV workloads
> people don't care about.

I've no idea what an NFV is.

> 4) "All vcpus should be SCHED_FIFO all the time". OK, why is that?
> What dictates that to be true?

Solid engineering. Does the guest kernel function as a bunch of
independent CPUs or does it assume all CPUs are equal and have strong
inter-cpu connections? Linux is the latter, therefore if one VCPU is RT
they all should be.

Dammit, you even recognise this in the spin-owner preemption issue
you're hacking around, but then go arse-about-face 'solving' it.

> What the patch does is the following:
> It reduces the window where SCHED_FIFO is applied vcpu0
> to those were a spinlock is shared between -RT vcpus and vcpu0
> (why: because otherwise, when the emulator thread is sharing a
> pCPU with vcpu0, its unable to generate interrupts vcpu0).
> 
> And its being rejected because:

Its not fixing the actual problem. The real problem is the prio
inversion between the VCPU and the emulation thread, _That_ is what
needs fixing.

Rewrite that VCPU/emulator interaction to be a proper RT construct.

Then you can run the VCPU at RT prio as you should, and the guest can
issue all the VM_EXIT things it wants at any time and still function
correctly.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-25  2:57                       ` Marcelo Tosatti
@ 2017-09-25  9:13                         ` Peter Zijlstra
  2017-09-25 15:12                           ` Paolo Bonzini
  2017-09-26 23:22                           ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti
  2017-09-25 16:20                         ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2017-09-25  9:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote:
> I think you are missing the following point:
> 
> "vcpu0 can be interrupted when its not in a spinlock protected section, 
> otherwise it can't."
> 
> So you _have_ to communicate to the host when the guest enters/leaves a
> critical section.
> 
> So this point of "everything needs to be RT and the priorities must be
> designed carefully", is this: 
> 
> 	WHEN in spinlock protected section (more specifically, when 
> 	spinlock protected section _shared with realtime vcpus_),
> 
> 	priority of vcpu0 > priority of emulator thread
> 
> 	OTHERWISE
> 
> 	priority of vcpu0 < priority of emulator thread.
> 
> (*)
> 
> So emulator thread can interrupt and inject interrupts to vcpu0.

spinlock protected regions are not everything. What about lock-free
constructs where CPU's spin-wait on one another (there's plenty).

And I'm clearly ignorant of how this emulation thread works, but why
would it run for a long time? Either it is needed for forward progress
of the VCPU or its not. If its not, it shouldn't run.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-25  2:22                   ` Marcelo Tosatti
  2017-09-25  8:58                     ` Peter Zijlstra
@ 2017-09-25 10:41                     ` Thomas Gleixner
  2017-09-25 18:28                       ` Jan Kiszka
  1 sibling, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2017-09-25 10:41 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel

On Sun, 24 Sep 2017, Marcelo Tosatti wrote:
> On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote:
> What the patch does is the following:
> It reduces the window where SCHED_FIFO is applied vcpu0
> to those were a spinlock is shared between -RT vcpus and vcpu0
> (why: because otherwise, when the emulator thread is sharing a
> pCPU with vcpu0, its unable to generate interrupts vcpu0).
> 
> And its being rejected because:
> Please fill in.

Your patch is just papering over one particular problem, but it's not
fixing the root cause. That's the worst engineering approach and we all
know how fast this kind of crap falls over.

There are enough other issues which can cause starvation of the RT VCPUs
when the housekeeping VCPU is preempted, not just the particular problem
which you observed.

Back then when I did the first prototype of RT in KVM, I made it entirely
clear, that you have to spend one physical CPU for _each_ VCPU, independent
whether the VCPU is reserved for RT workers or the housekeeping VCPU. The
emulator thread needs to run on a separate physical CPU.

If you want to run the housekeeping VCPU and the emulator thread on the
same physical CPU then you have to make sure that both the emulator and the
housekeeper side of affairs are designed and implemented with RT in
mind. As long as that is not the case, you simply cannot run them on the
same physical CPU. RT is about guarantees and guarantees cannot be achieved
with bandaid engineering.

It's that simple, end of story.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-25  9:13                         ` Peter Zijlstra
@ 2017-09-25 15:12                           ` Paolo Bonzini
  2017-09-26 22:49                             ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti
  2017-09-26 23:22                           ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti
  1 sibling, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2017-09-25 15:12 UTC (permalink / raw)
  To: Peter Zijlstra, Marcelo Tosatti
  Cc: Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel, Thomas Gleixner

On 25/09/2017 11:13, Peter Zijlstra wrote:
> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote:
>> I think you are missing the following point:
>>
>> "vcpu0 can be interrupted when its not in a spinlock protected section, 
>> otherwise it can't."

Who says that?  Certainly a driver can dedicate a single VCPU to
periodic polling of the device, in such a way that the polling does not
require a spinlock.

>> So you _have_ to communicate to the host when the guest enters/leaves a
>> critical section.
>>
>> So this point of "everything needs to be RT and the priorities must be
>> designed carefully", is this: 
>>
>> 	WHEN in spinlock protected section (more specifically, when 
>> 	spinlock protected section _shared with realtime vcpus_),
>>
>> 	priority of vcpu0 > priority of emulator thread
>>
>> 	OTHERWISE
>>
>> 	priority of vcpu0 < priority of emulator thread.

This is _not_ designed carefully, this is messy.

The emulator thread can interrupt the VCPU thread, so it has to be at
higher RT priority (+ priority inheritance of mutexes).  Once you have
done that we can decide on other approaches that e.g. let you get more
sharing by placing housekeeping VCPUs at SCHED_NORMAL or SCHED_RR.

>> So emulator thread can interrupt and inject interrupts to vcpu0.
> 
> spinlock protected regions are not everything. What about lock-free
> constructs where CPU's spin-wait on one another (there's plenty).
> 
> And I'm clearly ignorant of how this emulation thread works, but why
> would it run for a long time? Either it is needed for forward progress
> of the VCPU or its not. If its not, it shouldn't run.

The emulator thread 1) should not run for long period of times indeed,
and 2) it is needed for forward progress of the VCPU.  So it has to be
at higher RT priority.  I agree with Peter, sorry.  Spinlocks are a red
herring here.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-25  2:57                       ` Marcelo Tosatti
  2017-09-25  9:13                         ` Peter Zijlstra
@ 2017-09-25 16:20                         ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 52+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-09-25 16:20 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, Peter Zijlstra, mingo, kvm, linux-kernel, Thomas Gleixner

> I think you are missing the following point:
> 
> "vcpu0 can be interrupted when its not in a spinlock protected section, 
> otherwise it can't."
> 
> So you _have_ to communicate to the host when the guest enters/leaves a
> critical section.


How would this work for Windows or FreeBSD?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-25 10:41                     ` Thomas Gleixner
@ 2017-09-25 18:28                       ` Jan Kiszka
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kiszka @ 2017-09-25 18:28 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel

On 2017-09-25 12:41, Thomas Gleixner wrote:
> On Sun, 24 Sep 2017, Marcelo Tosatti wrote:
>> On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote:
>> What the patch does is the following:
>> It reduces the window where SCHED_FIFO is applied vcpu0
>> to those were a spinlock is shared between -RT vcpus and vcpu0
>> (why: because otherwise, when the emulator thread is sharing a
>> pCPU with vcpu0, its unable to generate interrupts vcpu0).
>>
>> And its being rejected because:
>> Please fill in.
> 
> Your patch is just papering over one particular problem, but it's not
> fixing the root cause. That's the worst engineering approach and we all
> know how fast this kind of crap falls over.
> 
> There are enough other issues which can cause starvation of the RT VCPUs
> when the housekeeping VCPU is preempted, not just the particular problem
> which you observed.
> 
> Back then when I did the first prototype of RT in KVM, I made it entirely
> clear, that you have to spend one physical CPU for _each_ VCPU, independent
> whether the VCPU is reserved for RT workers or the housekeeping VCPU. The
> emulator thread needs to run on a separate physical CPU.
> 
> If you want to run the housekeeping VCPU and the emulator thread on the
> same physical CPU then you have to make sure that both the emulator and the
> housekeeper side of affairs are designed and implemented with RT in
> mind. As long as that is not the case, you simply cannot run them on the
> same physical CPU. RT is about guarantees and guarantees cannot be achieved
> with bandaid engineering.

It's even more complicated for the guest: It needs to be aware of the
latencies its interaction with a VM - instead of a real machine - may
cause while being in whatever critical sections. That's an additional
design dimension that would be very hard to establish and maintain, even
in Linux.

The only way around that is to truly decouple guest CPUs via full core
isolation inside the Linux guest and have your RT guest application
exploit this partitioning, e.g. by using lock-less inter-core
communication without kernel help.

The reason I was playing with PV-sched back then was to explore how you
could map the guest's task prio dynamically on its host vcpu. That
involved boosting whenever en event (aka irq) came in for the guest
vcpu. It turned out to be a more or less working solution looking for a
real-world problem.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-25 15:12                           ` Paolo Bonzini
@ 2017-09-26 22:49                             ` Marcelo Tosatti
  2017-09-27  9:37                               ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-26 22:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote:
> On 25/09/2017 11:13, Peter Zijlstra wrote:
> > On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote:
> >> I think you are missing the following point:
> >>
> >> "vcpu0 can be interrupted when its not in a spinlock protected section, 
> >> otherwise it can't."
> 
> Who says that?  Certainly a driver can dedicate a single VCPU to
> periodic polling of the device, in such a way that the polling does not
> require a spinlock.

This sequence:


VCPU-0					VCPU-1 (running realtime workload)

takes spinlock A
scheduled out				
					spinlock(A) (busy spins until
						     VCPU-0 is scheduled
						     back in)
scheduled in
finishes execution of 
code under protected section
releases spinlock(A)			

					takes spinlock(A)

You get that point, right?

(*)

> >> So you _have_ to communicate to the host when the guest enters/leaves a
> >> critical section.
> >>
> >> So this point of "everything needs to be RT and the priorities must be
> >> designed carefully", is this: 
> >>
> >> 	WHEN in spinlock protected section (more specifically, when 
> >> 	spinlock protected section _shared with realtime vcpus_),
> >>
> >> 	priority of vcpu0 > priority of emulator thread
> >>
> >> 	OTHERWISE
> >>
> >> 	priority of vcpu0 < priority of emulator thread.
> 
> This is _not_ designed carefully, this is messy.

This is very precise to me. What is "messy" about it? (its clearly
defined).

> The emulator thread can interrupt the VCPU thread, so it has to be at
> higher RT priority (+ priority inheritance of mutexes).  

It can only do that _when_ the VCPU thread is not running a critical
section which a higher priority task depends on.

> Once you have
> done that we can decide on other approaches that e.g. let you get more
> sharing by placing housekeeping VCPUs at SCHED_NORMAL or SCHED_RR.

Well, if someone looks at (*) he sees that if the interruption delay 
(the length between "scheduled out" and "scheduled in" in that diagram)
exceeds a given threshold, that causes the realtime vcpu1 to also 
exceed processing of the realtime task for a given threshold. 

So when you say "The emulator thread can interrupt the VCPU thread", 
you're saying that it has to be modified to interrupt for a maximum
amount of time (say 15us).

Is that what you are suggesting?

> >> So emulator thread can interrupt and inject interrupts to vcpu0.
> > 
> > spinlock protected regions are not everything. What about lock-free
> > constructs where CPU's spin-wait on one another (there's plenty).
> > 
> > And I'm clearly ignorant of how this emulation thread works, but why
> > would it run for a long time? Either it is needed for forward progress
> > of the VCPU or its not. If its not, it shouldn't run.
> 
> The emulator thread 1) should not run for long period of times indeed,
> and 2) it is needed for forward progress of the VCPU.  So it has to be
> at higher RT priority.  I agree with Peter, sorry.  Spinlocks are a red
> herring here.
> 
> Paolo

Paolo, you don't control how many interruptions of the emulator thread
happen per second. So if you let the emulator thread interrupt the
emulator thread at all times, without some kind of bounding 
of these interruptions per time unit, you have a similar
problem as (*) (where the realtime task is scheduled).

Another approach to the problem was suggested to OpenStack.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall
  2017-09-25  9:13                         ` Peter Zijlstra
  2017-09-25 15:12                           ` Paolo Bonzini
@ 2017-09-26 23:22                           ` Marcelo Tosatti
  1 sibling, 0 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-26 23:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paolo Bonzini, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On Mon, Sep 25, 2017 at 11:13:16AM +0200, Peter Zijlstra wrote:
> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote:
> > I think you are missing the following point:
> > 
> > "vcpu0 can be interrupted when its not in a spinlock protected section, 
> > otherwise it can't."
> > 
> > So you _have_ to communicate to the host when the guest enters/leaves a
> > critical section.
> > 
> > So this point of "everything needs to be RT and the priorities must be
> > designed carefully", is this: 
> > 
> > 	WHEN in spinlock protected section (more specifically, when 
> > 	spinlock protected section _shared with realtime vcpus_),
> > 
> > 	priority of vcpu0 > priority of emulator thread
> > 
> > 	OTHERWISE
> > 
> > 	priority of vcpu0 < priority of emulator thread.
> > 
> > (*)
> > 
> > So emulator thread can interrupt and inject interrupts to vcpu0.
> 
> spinlock protected regions are not everything. What about lock-free
> constructs where CPU's spin-wait on one another (there's plenty).

True. Could add the "i am in a critical section" notifier to those
constructs as well, which would call the hypercall.

> And I'm clearly ignorant of how this emulation thread works, but why
> would it run for a long time? Either it is needed for forward progress
> of the VCPU or its not. If its not, it shouldn't run.

It is needed only when not in a critical section.
And when in a critical section, the vcpu should not get interrupted.

But the solution to reserve one pCPU per socket, to run all emulator
threads, achieves reasonable packing numbers without the downsides
of the hypercall.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support
  2017-09-22  6:23     ` Jan Kiszka
@ 2017-09-26 23:59       ` Marcelo Tosatti
  0 siblings, 0 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-26 23:59 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: kvm, linux-kernel

On Fri, Sep 22, 2017 at 08:23:02AM +0200, Jan Kiszka wrote:
> On 2017-09-22 03:19, Marcelo Tosatti wrote:
> > On Thu, Sep 21, 2017 at 07:45:32PM +0200, Jan Kiszka wrote:
> >> On 2017-09-21 13:38, Marcelo Tosatti wrote:
> >>> When executing guest vcpu-0 with FIFO:1 priority, which is necessary to
> >>> deal with the following situation:
> >>>
> >>> VCPU-0 (housekeeping VCPU)              VCPU-1 (realtime VCPU)
> >>>
> >>> raw_spin_lock(A)
> >>> interrupted, schedule task T-1          raw_spin_lock(A) (spin)
> >>>
> >>> raw_spin_unlock(A)
> >>>
> >>> Certain operations must interrupt guest vcpu-0 (see trace below).
> >>>
> >>> To fix this issue, only change guest vcpu-0 to FIFO priority
> >>> on spinlock critical sections (see patch).
> >>>
> >>> Hang trace
> >>> ==========
> >>>
> >>> Without FIFO priority:
> >>>
> >>> qemu-kvm-6705  [002] ....1.. 767785.648964: kvm_exit: reason IO_INSTRUCTION rip 0xe8fe info 1f00039 0
> >>> qemu-kvm-6705  [002] ....1.. 767785.648965: kvm_exit: reason IO_INSTRUCTION rip 0xe911 info 3f60008 0
> >>> qemu-kvm-6705  [002] ....1.. 767785.648968: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> >>> qemu-kvm-6705  [002] ....1.. 767785.648971: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> >>> qemu-kvm-6705  [002] ....1.. 767785.648974: kvm_exit: reason IO_INSTRUCTION rip 0xb514 info 3f60000 0
> >>> qemu-kvm-6705  [002] ....1.. 767785.648977: kvm_exit: reason PENDING_INTERRUPT rip 0x8052 info 0 0
> >>> qemu-kvm-6705  [002] ....1.. 767785.648980: kvm_exit: reason IO_INSTRUCTION rip 0xeee6 info 200040 0
> >>> qemu-kvm-6705  [002] ....1.. 767785.648999: kvm_exit: reason EPT_MISCONFIG rip 0x2120 info 0 0
> >>>
> >>> With FIFO priority:
> >>>
> >>> qemu-kvm-7636  [002] ....1.. 768218.205065: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> >>> qemu-kvm-7636  [002] ....1.. 768218.205068: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> >>> qemu-kvm-7636  [002] ....1.. 768218.205071: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> >>> qemu-kvm-7636  [002] ....1.. 768218.205074: kvm_exit: reason IO_INSTRUCTION rip 0x8984 info 608000b 0
> >>> qemu-kvm-7636  [002] ....1.. 768218.205077: kvm_exit: reason IO_INSTRUCTION rip 0xb313 info 1f70008 0
> >>> ..
> >>>
> >>> Performance numbers (kernel compilation with make -j2)
> >>> ======================================================
> >>>
> >>> With hypercall: 4:40.  (make -j2)
> >>> Without hypercall: 3:38.  (make -j2)
> >>>
> >>> Note for NFV workloads spinlock performance is not relevant
> >>> since DPDK should not enter the kernel (and housekeeping vcpu
> >>> performance is far from a key factor).
> >>>
> >>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >>>
> >>
> >> That sounds familiar, though not yet the same: :)
> >>
> >> http://git.kiszka.org/?p=linux-kvm.git;a=shortlog;h=refs/heads/queues/paravirt-sched
> >> (paper: http://lwn.net/images/conf/rtlws11/papers/proc/p18.pdf)
> >>
> >> I suppose your goal is not to enable the host to follow the guest
> >> scheduler priority completely but only to have prio-ceiling for such
> >> short critical sections. Maybe still useful to think ahead about future
> >> extensions when actually introducing such an interface.
> > 
> > Hi Jan!
> > 
> > Hum... I'll take a look at your interface/paper and get back to you.
> > 
> >> But shouldn't there be some limits on the maximum prio the guest can select?
> > 
> > The SCHED_FIFO prio is fixed, selectable when QEMU starts. Do you
> > envision any other use case than a fixed priority value selectable
> > at QEMU initialization?
> 
> Oh, indeed, this is a pure prio-ceiling variant with host-defined
> ceiling value.
> 
> But it's very inefficient to use a hypercall for entering and leaving
> each and every sections. I would strongly recommend using a lazy scheme
> where the guest writes the desired state into a shared memory page, and
> the host only evaluates that prior to taking a scheduling decision, or
> at least only on real vmexits. We're using such scheme successfully to
> accelerate the fast path of prio-ceiling for pthread mutexes in the
> Xenomai real-time extension.

Yes, a faster scheme was envisioned, but not developed.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-26 22:49                             ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti
@ 2017-09-27  9:37                               ` Paolo Bonzini
  2017-09-28  0:44                                 ` Marcelo Tosatti
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2017-09-27  9:37 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On 27/09/2017 00:49, Marcelo Tosatti wrote:
> On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote:
>> On 25/09/2017 11:13, Peter Zijlstra wrote:
>>> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote:
>>>> I think you are missing the following point:
>>>>
>>>> "vcpu0 can be interrupted when its not in a spinlock protected section, 
>>>> otherwise it can't."
>>
>> Who says that?  Certainly a driver can dedicate a single VCPU to
>> periodic polling of the device, in such a way that the polling does not
>> require a spinlock.
> 
> This sequence:
> 
> 
> VCPU-0					VCPU-1 (running realtime workload)
> 
> takes spinlock A
> scheduled out				
> 					spinlock(A) (busy spins until
> 						     VCPU-0 is scheduled
> 						     back in)
> scheduled in
> finishes execution of 
> code under protected section
> releases spinlock(A)			
> 
> 					takes spinlock(A)

Yes, but then you have

                                        busy waits for flag to be set
  polls device
  scheduled out
  device receives data
                                        ...
  scheduled in
  set flag

or

  check for work to do
  scheduled out
                                        submit work
                                        busy waits for work to complete
  scheduled in
  do work

None of which have anything to do with spinlocks.  They're just
different ways to get priority inversion, and this...

>>>> So this point of "everything needs to be RT and the priorities must be
>>>> designed carefully", is this: 
>>>>
>>>> 	WHEN in spinlock protected section (more specifically, when 
>>>> 	spinlock protected section _shared with realtime vcpus_),
>>>>
>>>> 	priority of vcpu0 > priority of emulator thread
>>>>
>>>> 	OTHERWISE
>>>>
>>>> 	priority of vcpu0 < priority of emulator thread.
>>
>> This is _not_ designed carefully, this is messy.
> 
> This is very precise to me. What is "messy" about it? (its clearly
> defined).

... it's not how you design RT systems to avoid priority inversion.
It's just solving an instance of the issue.

>> The emulator thread can interrupt the VCPU thread, so it has to be at
>> higher RT priority (+ priority inheritance of mutexes).  
> 
> It can only do that _when_ the VCPU thread is not running a critical
> section which a higher priority task depends on.

All VCPUs must have the same priority.  There's no such thing as a
housekeeping VCPU.

There could be one sacrificial VCPU that you place on the same physical
CPU as the emulator thread, but that's it.  The whole system must run
smoothly even if you place those on the same physical CPU, without any
PV hacks.

> So when you say "The emulator thread can interrupt the VCPU thread", 
> you're saying that it has to be modified to interrupt for a maximum
> amount of time (say 15us).
> 
> Is that what you are suggesting?

This is correct.  But I think you are missing a fundamental point, that
is not specific to virt.  If a thread 1 can trigger an event in thread
2, and thread 2 runs at priority N, thread 1 must be placed at priority >N.

In this case, the emulator thread can signal I/O completion to the VCPU:
it doesn't matter if the VCPU is polling or using interrupts, the
emulator thread must be placed at higher priority than the VCPU.  This
is the root cause of the SeaBIOS issue that you were seeing.  Yes, it
was me who suggested moving VCPU0 and the emulator thread to
SCHED_NORMAL, but that was just a bandaid until the real fix was done
(which is to set up SCHED_FIFO priorities correctly and use PI mutexes).

In fact, this is not specific to the emulator thread.  It applies just
as well to vhost threads, to QEMU iothreads, even to the KVM PIT
emulation thread if the guest were using it.

> Paolo, you don't control how many interruptions of the emulator thread
> happen per second.

Indeed these non-VCPU threads should indeed run rarely and for a small
amount of time only to achieve bounded latencies in the VCPUs.  But if
they don't, those are simply bugs and we fix them.  In fact, sources of
frequent interruptions have all been fixed or moved outside the main
thread; for example, disks can use separate iothreads.  Configuration
problems are also bugs, just in a different place.

For a very basic VM that I've just tried (whatever QEMU does by default,
only tweak was not using the QEMU GUI), there is exactly one
interruption per second when the VM is idle.  That interruption in turn
is caused by udev periodically probing the guest CDROM (!).  So that's a
configuration problem: remove the CDROM, and it does one interruption
every 10-30 seconds, again all of them guest triggered.

Again: if you have many interruptions, it's not a flaw in KVM or QEMU's
design, it's just that someone is doing something stupid.  It could be
the guest (e.g. unnecessary devices or daemons as in the example above),
QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second
just to increment the clock), or the management (e.g. polling "is the VM
running" 50 times per second).  But it can and must be fixed.

The design should be the basic one: attribute the right priority to each
thread and things just work.

Paolo

> So if you let the emulator thread interrupt the
> emulator thread at all times, without some kind of bounding 
> of these interruptions per time unit, you have a similar
> problem as (*) (where the realtime task is scheduled).
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-27  9:37                               ` Paolo Bonzini
@ 2017-09-28  0:44                                 ` Marcelo Tosatti
  2017-09-28  7:22                                   ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-28  0:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On Wed, Sep 27, 2017 at 11:37:48AM +0200, Paolo Bonzini wrote:
> On 27/09/2017 00:49, Marcelo Tosatti wrote:
> > On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote:
> >> On 25/09/2017 11:13, Peter Zijlstra wrote:
> >>> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote:
> >>>> I think you are missing the following point:
> >>>>
> >>>> "vcpu0 can be interrupted when its not in a spinlock protected section, 
> >>>> otherwise it can't."
> >>
> >> Who says that?  Certainly a driver can dedicate a single VCPU to
> >> periodic polling of the device, in such a way that the polling does not
> >> require a spinlock.
> > 
> > This sequence:
> > 
> > 
> > VCPU-0					VCPU-1 (running realtime workload)
> > 
> > takes spinlock A
> > scheduled out				
> > 					spinlock(A) (busy spins until
> > 						     VCPU-0 is scheduled
> > 						     back in)
> > scheduled in
> > finishes execution of 
> > code under protected section
> > releases spinlock(A)			
> > 
> > 					takes spinlock(A)
> 
> Yes, but then you have
> 
>                                         busy waits for flag to be set
>   polls device
>   scheduled out
>   device receives data
>                                         ...
>   scheduled in
>   set flag
> 
> or
> 
>   check for work to do
>   scheduled out
>                                         submit work
>                                         busy waits for work to complete
>   scheduled in
>   do work
> 
> None of which have anything to do with spinlocks.  They're just
> different ways to get priority inversion, and this...
> 
> >>>> So this point of "everything needs to be RT and the priorities must be
> >>>> designed carefully", is this: 
> >>>>
> >>>> 	WHEN in spinlock protected section (more specifically, when 
> >>>> 	spinlock protected section _shared with realtime vcpus_),
> >>>>
> >>>> 	priority of vcpu0 > priority of emulator thread
> >>>>
> >>>> 	OTHERWISE
> >>>>
> >>>> 	priority of vcpu0 < priority of emulator thread.
> >>
> >> This is _not_ designed carefully, this is messy.
> > 
> > This is very precise to me. What is "messy" about it? (its clearly
> > defined).
> 
> ... it's not how you design RT systems to avoid priority inversion.
> It's just solving an instance of the issue.
> 
> >> The emulator thread can interrupt the VCPU thread, so it has to be at
> >> higher RT priority (+ priority inheritance of mutexes).  
> > 
> > It can only do that _when_ the VCPU thread is not running a critical
> > section which a higher priority task depends on.
> 
> All VCPUs must have the same priority.  There's no such thing as a
> housekeeping VCPU.
> 
> There could be one sacrificial VCPU that you place on the same physical
> CPU as the emulator thread, but that's it.  The whole system must run
> smoothly even if you place those on the same physical CPU, without any
> PV hacks.
> 
> > So when you say "The emulator thread can interrupt the VCPU thread", 
> > you're saying that it has to be modified to interrupt for a maximum
> > amount of time (say 15us).
> > 
> > Is that what you are suggesting?
> 
> This is correct.  But I think you are missing a fundamental point, that
> is not specific to virt.  If a thread 1 can trigger an event in thread
> 2, and thread 2 runs at priority N, thread 1 must be placed at priority >N.
> 
> In this case, the emulator thread can signal I/O completion to the VCPU:
> it doesn't matter if the VCPU is polling or using interrupts, the
> emulator thread must be placed at higher priority than the VCPU.  This
> is the root cause of the SeaBIOS issue that you were seeing.  Yes, it
> was me who suggested moving VCPU0 and the emulator thread to
> SCHED_NORMAL, but that was just a bandaid until the real fix was done
> (which is to set up SCHED_FIFO priorities correctly and use PI mutexes).
> 
> In fact, this is not specific to the emulator thread.  It applies just
> as well to vhost threads, to QEMU iothreads, even to the KVM PIT
> emulation thread if the guest were using it.
> 
> > Paolo, you don't control how many interruptions of the emulator thread
> > happen per second.
> 
> Indeed these non-VCPU threads should indeed run rarely and for a small
> amount of time only to achieve bounded latencies in the VCPUs.  But if
> they don't, those are simply bugs and we fix them.  In fact, sources of
> frequent interruptions have all been fixed or moved outside the main
> thread; for example, disks can use separate iothreads.  Configuration
> problems are also bugs, just in a different place.


> 
> For a very basic VM that I've just tried (whatever QEMU does by default,
> only tweak was not using the QEMU GUI), there is exactly one
> interruption per second when the VM is idle.  That interruption in turn
> is caused by udev periodically probing the guest CDROM (!).  So that's a
> configuration problem: remove the CDROM, and it does one interruption
> every 10-30 seconds, again all of them guest triggered.
> 
> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's
> design, it's just that someone is doing something stupid.  It could be
> the guest (e.g. unnecessary devices or daemons as in the example above),
> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second
> just to increment the clock), or the management (e.g. polling "is the VM
> running" 50 times per second).  But it can and must be fixed.

No, i mean you can run anything in VCPU-0 (it is valid to do that).
And that "anything" can generate 1 interrupt per second, 1000 or 10.000
interrupts per second. Which are all valid things to be done.

"I can't run a kernel compilation on VCPU-0 because that will impact
latency on the realtime VCPU-1" is not acceptable.

> The design should be the basic one: attribute the right priority to each
> thread and things just work.
> 
> Paolo
> 
> > So if you let the emulator thread interrupt the
> > emulator thread at all times, without some kind of bounding 
> > of these interruptions per time unit, you have a similar
> > problem as (*) (where the realtime task is scheduled).
> > 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-28  0:44                                 ` Marcelo Tosatti
@ 2017-09-28  7:22                                   ` Paolo Bonzini
  2017-09-28 21:35                                     ` Marcelo Tosatti
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2017-09-28  7:22 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On 28/09/2017 02:44, Marcelo Tosatti wrote:
>> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's
>> design, it's just that someone is doing something stupid.  It could be
>> the guest (e.g. unnecessary devices or daemons as in the example above),
>> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second
>> just to increment the clock), or the management (e.g. polling "is the VM
>> running" 50 times per second).  But it can and must be fixed.
>
> No, i mean you can run anything in VCPU-0 (it is valid to do that).
> And that "anything" can generate 1 interrupt per second, 1000 or 10.000
> interrupts per second. Which are all valid things to be done.
> 
> "I can't run a kernel compilation on VCPU-0 because that will impact
> latency on the realtime VCPU-1" is not acceptable.

That shouldn't happen.  Sources of frequent interruptions have all been
fixed or moved outside the main thread.

If there are more left, report the bug and we'll see how to fix it in
userspace.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-28  7:22                                   ` Paolo Bonzini
@ 2017-09-28 21:35                                     ` Marcelo Tosatti
  2017-09-28 21:41                                       ` Marcelo Tosatti
  2017-09-29  8:18                                       ` Paolo Bonzini
  0 siblings, 2 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-28 21:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On Thu, Sep 28, 2017 at 09:22:02AM +0200, Paolo Bonzini wrote:
> On 28/09/2017 02:44, Marcelo Tosatti wrote:
> >> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's
> >> design, it's just that someone is doing something stupid.  It could be
> >> the guest (e.g. unnecessary devices or daemons as in the example above),
> >> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second
> >> just to increment the clock), or the management (e.g. polling "is the VM
> >> running" 50 times per second).  But it can and must be fixed.
> >
> > No, i mean you can run anything in VCPU-0 (it is valid to do that).
> > And that "anything" can generate 1 interrupt per second, 1000 or 10.000
> > interrupts per second. Which are all valid things to be done.
> > 
> > "I can't run a kernel compilation on VCPU-0 because that will impact
> > latency on the realtime VCPU-1" is not acceptable.
> 
> That shouldn't happen.  Sources of frequent interruptions have all been
> fixed or moved outside the main thread.
> 
> If there are more left, report the bug and we'll see how to fix it in
> userspace.
> 
> Paolo

What should not happen? The generation of 10.000 interrupts per second
(say disk IO completion) on a given workload ?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-28 21:35                                     ` Marcelo Tosatti
@ 2017-09-28 21:41                                       ` Marcelo Tosatti
  2017-09-29  8:18                                       ` Paolo Bonzini
  1 sibling, 0 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-28 21:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On Thu, Sep 28, 2017 at 06:35:08PM -0300, Marcelo Tosatti wrote:
> On Thu, Sep 28, 2017 at 09:22:02AM +0200, Paolo Bonzini wrote:
> > On 28/09/2017 02:44, Marcelo Tosatti wrote:
> > >> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's
> > >> design, it's just that someone is doing something stupid.  It could be
> > >> the guest (e.g. unnecessary devices or daemons as in the example above),
> > >> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second
> > >> just to increment the clock), or the management (e.g. polling "is the VM
> > >> running" 50 times per second).  But it can and must be fixed.
> > >
> > > No, i mean you can run anything in VCPU-0 (it is valid to do that).
> > > And that "anything" can generate 1 interrupt per second, 1000 or 10.000
> > > interrupts per second. Which are all valid things to be done.
> > > 
> > > "I can't run a kernel compilation on VCPU-0 because that will impact
> > > latency on the realtime VCPU-1" is not acceptable.
> > 
> > That shouldn't happen.  Sources of frequent interruptions have all been
> > fixed or moved outside the main thread.
> > 
> > If there are more left, report the bug and we'll see how to fix it in
> > userspace.
> > 
> > Paolo
> 
> What should not happen? The generation of 10.000 interrupts per second
> (say disk IO completion) on a given workload ?

Are you suggesting that, workloads in vcpu-0 should be limited in the number
of interrupts (and durations of each interruption), so that the realtime vcpu-1's
latency requirement is met ?

I don't see how that suggestion can work because even if you make each
exit small, the frequency of them will cause a latency violation on
vcpu-1.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-28 21:35                                     ` Marcelo Tosatti
  2017-09-28 21:41                                       ` Marcelo Tosatti
@ 2017-09-29  8:18                                       ` Paolo Bonzini
  2017-09-29 16:40                                         ` Marcelo Tosatti
  1 sibling, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2017-09-29  8:18 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On 28/09/2017 23:35, Marcelo Tosatti wrote:
> On Thu, Sep 28, 2017 at 09:22:02AM +0200, Paolo Bonzini wrote:
>> On 28/09/2017 02:44, Marcelo Tosatti wrote:
>>>> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's
>>>> design, it's just that someone is doing something stupid.  It could be
>>>> the guest (e.g. unnecessary devices or daemons as in the example above),
>>>> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second
>>>> just to increment the clock), or the management (e.g. polling "is the VM
>>>> running" 50 times per second).  But it can and must be fixed.
>>>
>>> No, i mean you can run anything in VCPU-0 (it is valid to do that).
>>> And that "anything" can generate 1 interrupt per second, 1000 or 10.000
>>> interrupts per second. Which are all valid things to be done.
>>>
>>> "I can't run a kernel compilation on VCPU-0 because that will impact
>>> latency on the realtime VCPU-1" is not acceptable.
>>
>> That shouldn't happen.  Sources of frequent interruptions have all been
>> fixed or moved outside the main thread.
>>
>> If there are more left, report the bug and we'll see how to fix it in
>> userspace.
> 
> What should not happen? The generation of 10.000 interrupts per second
> (say disk IO completion) on a given workload ?

If you know you have this kind disk workload, you must use virtio-blk or
virtio-scsi with iothreads and place the iothreads on their own physical
CPUs.

Among "run arbitrary workloads", "run real-time workloads", "pack stuff
into as few physical CPUs as possible", you can only pick two.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-29  8:18                                       ` Paolo Bonzini
@ 2017-09-29 16:40                                         ` Marcelo Tosatti
  2017-09-29 17:05                                           ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-29 16:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On Fri, Sep 29, 2017 at 10:18:25AM +0200, Paolo Bonzini wrote:
> On 28/09/2017 23:35, Marcelo Tosatti wrote:
> > On Thu, Sep 28, 2017 at 09:22:02AM +0200, Paolo Bonzini wrote:
> >> On 28/09/2017 02:44, Marcelo Tosatti wrote:
> >>>> Again: if you have many interruptions, it's not a flaw in KVM or QEMU's
> >>>> design, it's just that someone is doing something stupid.  It could be
> >>>> the guest (e.g. unnecessary devices or daemons as in the example above),
> >>>> QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second
> >>>> just to increment the clock), or the management (e.g. polling "is the VM
> >>>> running" 50 times per second).  But it can and must be fixed.
> >>>
> >>> No, i mean you can run anything in VCPU-0 (it is valid to do that).
> >>> And that "anything" can generate 1 interrupt per second, 1000 or 10.000
> >>> interrupts per second. Which are all valid things to be done.
> >>>
> >>> "I can't run a kernel compilation on VCPU-0 because that will impact
> >>> latency on the realtime VCPU-1" is not acceptable.
> >>
> >> That shouldn't happen.  Sources of frequent interruptions have all been
> >> fixed or moved outside the main thread.
> >>
> >> If there are more left, report the bug and we'll see how to fix it in
> >> userspace.
> > 
> > What should not happen? The generation of 10.000 interrupts per second
> > (say disk IO completion) on a given workload ?
> 
> If you know you have this kind disk workload, you must use virtio-blk or
> virtio-scsi with iothreads and place the iothreads on their own physical
> CPUs.
> 
> Among "run arbitrary workloads", "run real-time workloads", "pack stuff
> into as few physical CPUs as possible", you can only pick two.
> 
> Paolo

Thats not the state of things (userspace in vcpu-0 is not specially tailored
to not violate latencies in vcpu-1): that is not all user triggered
actions can be verified.

Think "updatedb", and so on...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-29 16:40                                         ` Marcelo Tosatti
@ 2017-09-29 17:05                                           ` Paolo Bonzini
  2017-09-29 20:17                                             ` Marcelo Tosatti
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2017-09-29 17:05 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On 29/09/2017 18:40, Marcelo Tosatti wrote:
>> If you know you have this kind disk workload, you must use virtio-blk or
>> virtio-scsi with iothreads and place the iothreads on their own physical
>> CPUs.
>>
>> Among "run arbitrary workloads", "run real-time workloads", "pack stuff
>> into as few physical CPUs as possible", you can only pick two.
>
> Thats not the state of things (userspace in vcpu-0 is not specially tailored
> to not violate latencies in vcpu-1): that is not all user triggered
> actions can be verified.
> 
> Think "updatedb", and so on...

_Which_ spinlock is it that can cause unwanted latency while running
updatedb on VCPU0 and a real-time workload on VCPU1, and only so on virt
because of the emulator thread?  Is this still broken if you set up
priorities for the emulator thread correctly and use PI mutexes in QEMU?
 And if so, what is the cause of interruptions in the emulator thread
and how are these interruptions causing the jitter?

Priorities and priority inheritance (or lack of them) is a _known_
issue.  Jan was doing his KVM-RT things in 2009 and he was talking about
priorities[1] back then.  The effect of correct priorities is to _lower_
jitter, not to make it worse, and anyway certainly not worse than
SCHED_NORMAL I/O thread.  Once that's fixed, we can look at other problems.

Paolo

[1] http://static.lwn.net/images/conf/rtlws11/papers/proc/p18.pdf which
also mentions pv scheduling

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-29 17:05                                           ` Paolo Bonzini
@ 2017-09-29 20:17                                             ` Marcelo Tosatti
  2017-10-02 12:30                                               ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2017-09-29 20:17 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On Fri, Sep 29, 2017 at 07:05:41PM +0200, Paolo Bonzini wrote:
> On 29/09/2017 18:40, Marcelo Tosatti wrote:
> >> If you know you have this kind disk workload, you must use virtio-blk or
> >> virtio-scsi with iothreads and place the iothreads on their own physical
> >> CPUs.
> >>
> >> Among "run arbitrary workloads", "run real-time workloads", "pack stuff
> >> into as few physical CPUs as possible", you can only pick two.
> >
> > Thats not the state of things (userspace in vcpu-0 is not specially tailored
> > to not violate latencies in vcpu-1): that is not all user triggered
> > actions can be verified.
> > 
> > Think "updatedb", and so on...
> 
> _Which_ spinlock is it that can cause unwanted latency while running
> updatedb on VCPU0 and a real-time workload on VCPU1, and only so on virt
> because of the emulator thread?

Hundreds of them (the one being hit is in timer_interrupt), but i went 
to check and there are hundreds of raw spinlocks shared between the
kernel threads that run on isolated CPUs and vcpu-0.

>  Is this still broken if you set up
> priorities for the emulator thread correctly and use PI mutexes in QEMU?

I don't see why it would not, if you have to schedule the emulator
thread to process and inject I/O interrupts for example.

>  And if so, what is the cause of interruptions in the emulator thread
> and how are these interruptions causing the jitter?

Interrupt injections.

> Priorities and priority inheritance (or lack of them) is a _known_
> issue.  Jan was doing his KVM-RT things in 2009 and he was talking about
> priorities[1] back then.  The effect of correct priorities is to _lower_
> jitter, not to make it worse, and anyway certainly not worse than
> SCHED_NORMAL I/O thread.  Once that's fixed, we can look at other problems.
> 
> Paolo
> 
> [1] http://static.lwn.net/images/conf/rtlws11/papers/proc/p18.pdf which
> also mentions pv scheduling

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-09-29 20:17                                             ` Marcelo Tosatti
@ 2017-10-02 12:30                                               ` Paolo Bonzini
  2017-10-02 12:48                                                 ` Peter Zijlstra
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2017-10-02 12:30 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Peter Zijlstra, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On 29/09/2017 22:17, Marcelo Tosatti wrote:
> On Fri, Sep 29, 2017 at 07:05:41PM +0200, Paolo Bonzini wrote:
>> On 29/09/2017 18:40, Marcelo Tosatti wrote:
>>> Thats not the state of things (userspace in vcpu-0 is not specially tailored
>>> to not violate latencies in vcpu-1): that is not all user triggered
>>> actions can be verified.
>>>
>>> Think "updatedb", and so on...
>>
>> _Which_ spinlock is it that can cause unwanted latency while running
>> updatedb on VCPU0 and a real-time workload on VCPU1, and only so on virt
>> because of the emulator thread?
> 
> Hundreds of them (the one being hit is in timer_interrupt), but i went 
> to check and there are hundreds of raw spinlocks shared between the
> kernel threads that run on isolated CPUs and vcpu-0.
> 
>>  Is this still broken if you set up
>> priorities for the emulator thread correctly and use PI mutexes in QEMU?
> 
> I don't see why it would not, if you have to schedule the emulator
> thread to process and inject I/O interrupts for example.

Yes, you're right if it's interrupt injections.  If it's unexpected disk
accesses, you can just add a QEMU I/O thread on a different physical
CPU.  The same physical CPU can host I/O threads for different guests if
you expect them to do little.

I don't understand why is it correct to delay interrupt injection just
because VCPU0 is running in a spinlock-protected region?  I just cannot
see the reason why it's safe and not a recipe for priority inversions.

Paolo

>>  And if so, what is the cause of interruptions in the emulator thread
>> and how are these interruptions causing the jitter?
> 
> Interrupt injections.
> 
>> Priorities and priority inheritance (or lack of them) is a _known_
>> issue.  Jan was doing his KVM-RT things in 2009 and he was talking about
>> priorities[1] back then.  The effect of correct priorities is to _lower_
>> jitter, not to make it worse, and anyway certainly not worse than
>> SCHED_NORMAL I/O thread.  Once that's fixed, we can look at other problems.
>>
>> Paolo
>>
>> [1] http://static.lwn.net/images/conf/rtlws11/papers/proc/p18.pdf which
>> also mentions pv scheduling

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\
  2017-10-02 12:30                                               ` Paolo Bonzini
@ 2017-10-02 12:48                                                 ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2017-10-02 12:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marcelo Tosatti, Konrad Rzeszutek Wilk, mingo, kvm, linux-kernel,
	Thomas Gleixner

On Mon, Oct 02, 2017 at 02:30:33PM +0200, Paolo Bonzini wrote:
> I don't understand why is it correct to delay interrupt injection just
> because VCPU0 is running in a spinlock-protected region?  I just cannot
> see the reason why it's safe and not a recipe for priority inversions.

It is indeed not right. Something like:

	raw_spin_lock(&some_lock);

	/* do crud */

	raw_spin_unlock(&some_lock);

Should not hold off the interrupt that tells you your finger is in
imminent danger of becoming detached. Only when we do
local_irq_disable() (ie. raw_spin_lock_irq*() and the like) should we
avoid interrupt delivery.

This whole fixation on spinlock regions is misguided and must stop, its
wrong on all levels.

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2017-10-02 12:48 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-21 11:38 [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Marcelo Tosatti
2017-09-21 11:38 ` [patch 1/3] KVM: x86: add per-vcpu option to set guest vcpu -RT priority Marcelo Tosatti
2017-09-21 11:38 ` [patch 2/3] KVM: x86: KVM_HC_RT_PRIO hypercall (host-side) Marcelo Tosatti
2017-09-21 13:32   ` Konrad Rzeszutek Wilk
2017-09-21 13:49     ` Paolo Bonzini
2017-09-22  1:08       ` Marcelo Tosatti
2017-09-22  7:23         ` Paolo Bonzini
2017-09-22 12:24           ` Marcelo Tosatti
2017-09-21 11:38 ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti
2017-09-21 13:36   ` Konrad Rzeszutek Wilk
2017-09-21 14:06     ` Peter Zijlstra
2017-09-22  1:10       ` Marcelo Tosatti
2017-09-22 10:00         ` Peter Zijlstra
2017-09-22 10:56           ` Peter Zijlstra
2017-09-22 12:33             ` Marcelo Tosatti
2017-09-22 12:55               ` Peter Zijlstra
2017-09-23 10:56                 ` Paolo Bonzini
2017-09-23 13:41                   ` Peter Zijlstra
2017-09-24 13:05                     ` Paolo Bonzini
2017-09-25  2:57                       ` Marcelo Tosatti
2017-09-25  9:13                         ` Peter Zijlstra
2017-09-25 15:12                           ` Paolo Bonzini
2017-09-26 22:49                             ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti
2017-09-27  9:37                               ` Paolo Bonzini
2017-09-28  0:44                                 ` Marcelo Tosatti
2017-09-28  7:22                                   ` Paolo Bonzini
2017-09-28 21:35                                     ` Marcelo Tosatti
2017-09-28 21:41                                       ` Marcelo Tosatti
2017-09-29  8:18                                       ` Paolo Bonzini
2017-09-29 16:40                                         ` Marcelo Tosatti
2017-09-29 17:05                                           ` Paolo Bonzini
2017-09-29 20:17                                             ` Marcelo Tosatti
2017-10-02 12:30                                               ` Paolo Bonzini
2017-10-02 12:48                                                 ` Peter Zijlstra
2017-09-26 23:22                           ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall Marcelo Tosatti
2017-09-25 16:20                         ` Konrad Rzeszutek Wilk
2017-09-22 12:16           ` Marcelo Tosatti
2017-09-22 12:31             ` Peter Zijlstra
2017-09-22 12:36               ` Marcelo Tosatti
2017-09-22 12:59                 ` Peter Zijlstra
2017-09-25  1:52                   ` Marcelo Tosatti
2017-09-25  8:35                     ` Peter Zijlstra
2017-09-22 12:40               ` [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\ Marcelo Tosatti
2017-09-22 13:01                 ` Peter Zijlstra
2017-09-25  2:22                   ` Marcelo Tosatti
2017-09-25  8:58                     ` Peter Zijlstra
2017-09-25 10:41                     ` Thomas Gleixner
2017-09-25 18:28                       ` Jan Kiszka
2017-09-21 17:45 ` [patch 0/3] KVM KVM_HC_RT_PRIO hypercall support Jan Kiszka
2017-09-22  1:19   ` Marcelo Tosatti
2017-09-22  6:23     ` Jan Kiszka
2017-09-26 23:59       ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.