linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] KVM CPU frequency change hypercalls
@ 2017-02-02 17:47 Marcelo Tosatti
  2017-02-02 17:47 ` [patch 1/3] cpufreq: implement min/max/up/down functions Marcelo Tosatti
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-02 17:47 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar

Implement KVM hypercalls for the guest
to issue frequency changes.

Current situation with DPDK and frequency changes is as follows:
An algorithm in the guest decides when to increase/decrease
frequency based on the queue length of the device.

On the host, a power manager daemon is used to listen for
frequency change requests (on another core) and issue these
requests.

However frequency changes are performance sensitive events because:
On a change from low load condition to max load condition,
the frequency should be raised as soon as possible.
Sending a virtio-serial notification to another pCPU,
waiting for that pCPU to initiate an IPI to the requestor pCPU
to change frequency, is slower and more cache costly than
a direct hypercall to host to switch the frequency.

If the pCPU where the power manager daemon is running
is not busy spinning on requests from the isolated DPDK vcpus,
there is also the cost of HLT wakeup for that pCPU.

Moreover, the daemon serves multiple VMs, meaning that
the scheme is subject to additional delays from
queueing of power change requests from VMs.

A direct hypercall from userspace is the fastest most direct
method for the guest to change frequency and does not suffer
from the issues above.

The usage scenario for this hypercalls is for pinned vCPUs <-> pCPUs.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 1/3] cpufreq: implement min/max/up/down functions
  2017-02-02 17:47 [patch 0/3] KVM CPU frequency change hypercalls Marcelo Tosatti
@ 2017-02-02 17:47 ` Marcelo Tosatti
  2017-02-03  4:09   ` Viresh Kumar
  2017-02-02 17:47 ` [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls Marcelo Tosatti
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-02 17:47 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar,
	Marcelo Tosatti

[-- Attachment #1: cpufreq-userspace --]
[-- Type: text/plain, Size: 6522 bytes --]

Implement functions in cpufreq userspace code to:
	* Change current frequency to {max,min,up,down} frequencies.

up/down being relative to current one.

These will be used to implement KVM hypercalls for the guest
to issue frequency changes.

Current situation with DPDK and frequency changes is as follows:
An algorithm in the guest decides when to increase/decrease
frequency based on the queue length of the device. 

On the host, a power manager daemon is used to listen for 
frequency change requests (on another core) and issue these 
requests.

However frequency changes are performance sensitive events because:
On a change from low load condition to max load condition,
the frequency should be raised as soon as possible.
Sending a virtio-serial notification to another pCPU, 
waiting for that pCPU to initiate an IPI to the requestor pCPU
to change frequency, is slower and more cache costly than 
a direct hypercall to host to switch the frequency.

Moreover, if the pCPU where the power manager daemon is running 
is not busy spinning on requests from the isolated DPDK vcpus,
there is also the cost of HLT wakeup for that pCPU.

Instructions to setup:
Disable the intel_pstate driver (intel_pstate=disable host kernel
command line option), and set cpufreq userspace governor for
the isolated pCPU.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 drivers/cpufreq/cpufreq_userspace.c |  172 ++++++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h             |    7 +
 2 files changed, 179 insertions(+)

Index: kvm-pvfreq/drivers/cpufreq/cpufreq_userspace.c
===================================================================
--- kvm-pvfreq.orig/drivers/cpufreq/cpufreq_userspace.c	2017-01-31 10:41:54.102575877 -0200
+++ kvm-pvfreq/drivers/cpufreq/cpufreq_userspace.c	2017-02-02 15:32:53.456262640 -0200
@@ -118,6 +118,178 @@
 	mutex_unlock(&userspace_mutex);
 }
 
+static int cpufreq_is_userspace_governor(int cpu)
+{
+	int ret;
+
+	mutex_lock(&userspace_mutex);
+	ret = per_cpu(cpu_is_managed, cpu);
+	mutex_unlock(&userspace_mutex);
+
+	return ret;
+}
+
+int cpufreq_userspace_freq_up(int cpu)
+{
+	unsigned int curfreq, nextminfreq;
+	unsigned int ret = 0;
+	struct cpufreq_frequency_table *pos, *table;
+	struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+
+	if (!policy)
+		return -EINVAL;
+
+	if (!cpufreq_is_userspace_governor(cpu)) {
+		cpufreq_cpu_put(policy);
+		return -EINVAL;
+	}
+
+	cpufreq_cpu_put(policy);
+
+	mutex_lock(&userspace_mutex);
+	table = policy->freq_table;
+	if (!table) {
+		mutex_unlock(&userspace_mutex);
+		return -ENODEV;
+	}
+	nextminfreq = cpufreq_quick_get_max(cpu);
+	curfreq = policy->cur;
+
+	cpufreq_for_each_valid_entry(pos, table) {
+		if (pos->frequency > curfreq &&
+		    pos->frequency < nextminfreq)
+			nextminfreq = pos->frequency;
+	}
+
+	if (nextminfreq != curfreq) {
+		unsigned int *setspeed = policy->governor_data;
+
+		*setspeed = nextminfreq;
+		ret = __cpufreq_driver_target(policy, nextminfreq,
+					      CPUFREQ_RELATION_L);
+	} else
+		ret = 1;
+	mutex_unlock(&userspace_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(cpufreq_userspace_freq_up);
+
+int cpufreq_userspace_freq_down(int cpu)
+{
+	unsigned int curfreq, prevmaxfreq;
+	unsigned int ret = 0;
+	struct cpufreq_frequency_table *pos, *table;
+	struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+
+	if (!policy)
+		return -EINVAL;
+
+	if (!cpufreq_is_userspace_governor(cpu)) {
+		cpufreq_cpu_put(policy);
+		return -EINVAL;
+	}
+
+	cpufreq_cpu_put(policy);
+
+	mutex_lock(&userspace_mutex);
+	table = policy->freq_table;
+	if (!table) {
+		mutex_unlock(&userspace_mutex);
+		return -ENODEV;
+	}
+	prevmaxfreq = policy->min;
+	curfreq = policy->cur;
+
+	cpufreq_for_each_valid_entry(pos, table) {
+		if (pos->frequency < curfreq &&
+		    pos->frequency > prevmaxfreq)
+			prevmaxfreq = pos->frequency;
+	}
+
+	if (prevmaxfreq != curfreq) {
+		unsigned int *setspeed = policy->governor_data;
+
+		*setspeed = prevmaxfreq;
+		ret = __cpufreq_driver_target(policy, prevmaxfreq,
+					      CPUFREQ_RELATION_L);
+	} else
+		ret = 1;
+	mutex_unlock(&userspace_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(cpufreq_userspace_freq_down);
+
+int cpufreq_userspace_freq_max(int cpu)
+{
+	unsigned int maxfreq;
+	unsigned int ret = 0;
+	struct cpufreq_frequency_table *table;
+	struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+	unsigned int *setspeed = policy->governor_data;
+
+
+	if (!policy)
+		return -EINVAL;
+
+	if (!cpufreq_is_userspace_governor(cpu)) {
+		cpufreq_cpu_put(policy);
+		return -EINVAL;
+	}
+
+	cpufreq_cpu_put(policy);
+
+	mutex_lock(&userspace_mutex);
+	table = policy->freq_table;
+	if (!table) {
+		mutex_unlock(&userspace_mutex);
+		return -ENODEV;
+	}
+	maxfreq = cpufreq_quick_get_max(cpu);
+
+	*setspeed = maxfreq;
+	ret = __cpufreq_driver_target(policy, maxfreq, CPUFREQ_RELATION_L);
+	mutex_unlock(&userspace_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(cpufreq_userspace_freq_max);
+
+int cpufreq_userspace_freq_min(int cpu)
+{
+	unsigned int minfreq;
+	unsigned int ret = 0;
+	struct cpufreq_frequency_table *table;
+	struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+	unsigned int *setspeed = policy->governor_data;
+
+	if (!policy)
+		return -EINVAL;
+
+	if (!cpufreq_is_userspace_governor(cpu)) {
+		cpufreq_cpu_put(policy);
+		return -EINVAL;
+	}
+	minfreq = policy->min;
+
+	cpufreq_cpu_put(policy);
+
+	mutex_lock(&userspace_mutex);
+	table = policy->freq_table;
+	if (!table) {
+		mutex_unlock(&userspace_mutex);
+		return -ENODEV;
+	}
+
+	*setspeed = minfreq;
+	ret = __cpufreq_driver_target(policy, minfreq, CPUFREQ_RELATION_L);
+	mutex_unlock(&userspace_mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(cpufreq_userspace_freq_min);
+
 static struct cpufreq_governor cpufreq_gov_userspace = {
 	.name		= "userspace",
 	.init		= cpufreq_userspace_policy_init,
Index: kvm-pvfreq/include/linux/cpufreq.h
===================================================================
--- kvm-pvfreq.orig/include/linux/cpufreq.h	2017-01-31 10:41:54.102575877 -0200
+++ kvm-pvfreq/include/linux/cpufreq.h	2017-01-31 14:20:00.508613672 -0200
@@ -890,4 +890,11 @@
 int cpufreq_generic_init(struct cpufreq_policy *policy,
 		struct cpufreq_frequency_table *table,
 		unsigned int transition_latency);
+#ifdef CONFIG_CPU_FREQ
+int cpufreq_userspace_freq_down(int cpu);
+int cpufreq_userspace_freq_up(int cpu);
+int cpufreq_userspace_freq_max(int cpu);
+int cpufreq_userspace_freq_min(int cpu);
+#else
+#endif
 #endif /* _LINUX_CPUFREQ_H */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls
  2017-02-02 17:47 [patch 0/3] KVM CPU frequency change hypercalls Marcelo Tosatti
  2017-02-02 17:47 ` [patch 1/3] cpufreq: implement min/max/up/down functions Marcelo Tosatti
@ 2017-02-02 17:47 ` Marcelo Tosatti
  2017-02-03 17:03   ` Radim Krcmar
  2017-02-02 17:47 ` [patch 3/3] KVM: x86: frequency change hypercalls Marcelo Tosatti
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-02 17:47 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar,
	Marcelo Tosatti

[-- Attachment #1: allow-per-vcpu-freq --]
[-- Type: text/plain, Size: 3480 bytes --]

For most VMs, modifying the host frequency is an undesired 
operation. Introduce ioctl to enable the guest to 
modify host CPU frequency.


Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    2 ++
 arch/x86/include/uapi/asm/kvm.h |    5 +++++
 arch/x86/kvm/x86.c              |   20 ++++++++++++++++++++
 include/uapi/linux/kvm.h        |    3 +++
 virt/kvm/kvm_main.c             |    2 ++
 5 files changed, 32 insertions(+)

Index: kvm-pvfreq/arch/x86/kvm/x86.c
===================================================================
--- kvm-pvfreq.orig/arch/x86/kvm/x86.c	2017-01-31 10:32:33.023378783 -0200
+++ kvm-pvfreq/arch/x86/kvm/x86.c	2017-01-31 10:34:25.443618639 -0200
@@ -3665,6 +3665,26 @@
 		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
 		break;
 	}
+	case KVM_SET_VCPU_ALLOW_FREQ_HC: {
+		struct kvm_vcpu_allow_freq freq;
+
+		r = -EFAULT;
+		if (copy_from_user(&freq, argp, sizeof(freq)))
+			goto out;
+		vcpu->arch.allow_freq_hypercall = freq.enable;
+		r = 0;
+		break;
+	}
+	case KVM_GET_VCPU_ALLOW_FREQ_HC: {
+		struct kvm_vcpu_allow_freq freq;
+
+		memset(&freq, 0, sizeof(struct kvm_vcpu_allow_freq));
+		r = -EFAULT;
+		if (copy_to_user(&freq, argp, sizeof(freq)))
+			break;
+		r = 0;
+		break;
+	}
 	default:
 		r = -EINVAL;
 	}
Index: kvm-pvfreq/include/uapi/linux/kvm.h
===================================================================
--- kvm-pvfreq.orig/include/uapi/linux/kvm.h	2017-01-31 10:32:33.023378783 -0200
+++ kvm-pvfreq/include/uapi/linux/kvm.h	2017-01-31 10:32:38.000389402 -0200
@@ -871,6 +871,7 @@
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_ALLOW_FREQ_HC 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1281,6 +1282,8 @@
 #define KVM_S390_GET_IRQ_STATE	  _IOW(KVMIO, 0xb6, struct kvm_s390_irq_state)
 /* Available with KVM_CAP_X86_SMM */
 #define KVM_SMI                   _IO(KVMIO,   0xb7)
+#define KVM_SET_VCPU_ALLOW_FREQ_HC   _IO(KVMIO,   0xb8)
+#define KVM_GET_VCPU_ALLOW_FREQ_HC   _IO(KVMIO,   0xb9)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
Index: kvm-pvfreq/arch/x86/include/uapi/asm/kvm.h
===================================================================
--- kvm-pvfreq.orig/arch/x86/include/uapi/asm/kvm.h	2017-01-31 10:32:33.023378783 -0200
+++ kvm-pvfreq/arch/x86/include/uapi/asm/kvm.h	2017-01-31 10:32:38.000389402 -0200
@@ -357,4 +357,9 @@
 #define KVM_X86_QUIRK_LINT0_REENABLED	(1 << 0)
 #define KVM_X86_QUIRK_CD_NW_CLEARED	(1 << 1)
 
+struct kvm_vcpu_allow_freq {
+	__u16 enable;
+	__u16 pad[7];
+};
+
 #endif /* _ASM_X86_KVM_H */
Index: kvm-pvfreq/virt/kvm/kvm_main.c
===================================================================
--- kvm-pvfreq.orig/virt/kvm/kvm_main.c	2017-01-31 10:32:33.023378783 -0200
+++ kvm-pvfreq/virt/kvm/kvm_main.c	2017-01-31 10:32:38.001389404 -0200
@@ -2938,6 +2938,8 @@
 #endif
 	case KVM_CAP_MAX_VCPU_ID:
 		return KVM_MAX_VCPU_ID;
+	case KVM_CAP_ALLOW_FREQ_HC:
+		return 1;
 	default:
 		break;
 	}
Index: kvm-pvfreq/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm-pvfreq.orig/arch/x86/include/asm/kvm_host.h	2017-01-31 10:32:33.023378783 -0200
+++ kvm-pvfreq/arch/x86/include/asm/kvm_host.h	2017-01-31 10:32:38.001389404 -0200
@@ -678,6 +678,8 @@
 
 	/* GPA available (AMD only) */
 	bool gpa_available;
+
+	bool allow_freq_hypercall;
 };
 
 struct kvm_lpage_info {

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 3/3] KVM: x86: frequency change hypercalls
  2017-02-02 17:47 [patch 0/3] KVM CPU frequency change hypercalls Marcelo Tosatti
  2017-02-02 17:47 ` [patch 1/3] cpufreq: implement min/max/up/down functions Marcelo Tosatti
  2017-02-02 17:47 ` [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls Marcelo Tosatti
@ 2017-02-02 17:47 ` Marcelo Tosatti
  2017-02-02 18:01   ` Marcelo Tosatti
  2017-02-03 17:40   ` Radim Krcmar
  2017-02-03 12:50 ` [patch 0/3] KVM CPU " Rafael J. Wysocki
  2017-02-03 16:43 ` Radim Krcmar
  4 siblings, 2 replies; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-02 17:47 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar,
	Marcelo Tosatti

[-- Attachment #1: kvm-cpufreq-api --]
[-- Type: text/plain, Size: 4840 bytes --]

Implement min/max/up/down frequency change 
KVM hypercalls. To be used by DPDK implementation.

Also allow such hypercalls from guest userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 Documentation/virtual/kvm/hypercalls.txt |   45 +++++++++++++++++++
 arch/x86/kvm/x86.c                       |   71 ++++++++++++++++++++++++++++++-
 include/uapi/linux/kvm_para.h            |    5 ++
 3 files changed, 120 insertions(+), 1 deletion(-)

Index: kvm-pvfreq/arch/x86/kvm/x86.c
===================================================================
--- kvm-pvfreq.orig/arch/x86/kvm/x86.c	2017-02-02 11:17:17.063756725 -0200
+++ kvm-pvfreq/arch/x86/kvm/x86.c	2017-02-02 11:17:17.822752510 -0200
@@ -6219,10 +6219,58 @@
 	kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
 }
 
+#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
+/* call into cpufreq-userspace governor */
+static int kvm_pvfreq_up(struct kvm_vcpu *vcpu)
+{
+	int ret;
+	int cpu = get_cpu();
+
+	ret = cpufreq_userspace_freq_up(cpu);
+	put_cpu();
+
+	return ret;
+}
+
+static int kvm_pvfreq_down(struct kvm_vcpu *vcpu)
+{
+	int ret;
+	int cpu = get_cpu();
+
+	ret = cpufreq_userspace_freq_down(cpu);
+	put_cpu();
+
+	return ret;
+}
+
+static int kvm_pvfreq_max(struct kvm_vcpu *vcpu)
+{
+	int ret;
+	int cpu = get_cpu();
+
+	ret = cpufreq_userspace_freq_max(cpu);
+	put_cpu();
+
+	return ret;
+}
+
+static int kvm_pvfreq_min(struct kvm_vcpu *vcpu)
+{
+	int ret;
+	int cpu = get_cpu();
+
+	ret = cpufreq_userspace_freq_min(cpu);
+	put_cpu();
+
+	return ret;
+}
+#endif
+
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
 	unsigned long nr, a0, a1, a2, a3, ret;
 	int op_64_bit, r;
+	bool cpl_check;
 
 	r = kvm_skip_emulated_instruction(vcpu);
 
@@ -6246,7 +6294,13 @@
 		a3 &= 0xFFFFFFFF;
 	}
 
-	if (kvm_x86_ops->get_cpl(vcpu) != 0) {
+	cpl_check = true;
+	if (nr == KVM_HC_FREQ_UP || nr == KVM_HC_FREQ_DOWN ||
+	    nr == KVM_HC_FREQ_MIN || nr == KVM_HC_FREQ_MAX)
+		if (vcpu->arch.allow_freq_hypercall == true)
+			cpl_check = false;
+
+	if (cpl_check == true && kvm_x86_ops->get_cpl(vcpu) != 0) {
 		ret = -KVM_EPERM;
 		goto out;
 	}
@@ -6262,6 +6316,21 @@
 	case KVM_HC_CLOCK_PAIRING:
 		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
 		break;
+#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
+	case KVM_HC_FREQ_UP:
+		ret = kvm_pvfreq_up(vcpu);
+		break;
+	case KVM_HC_FREQ_DOWN:
+		ret = kvm_pvfreq_down(vcpu);
+		break;
+	case KVM_HC_FREQ_MAX:
+		ret = kvm_pvfreq_max(vcpu);
+		break;
+	case KVM_HC_FREQ_MIN:
+		ret = kvm_pvfreq_min(vcpu);
+		break;
+#endif
+
 	default:
 		ret = -KVM_ENOSYS;
 		break;
Index: kvm-pvfreq/include/uapi/linux/kvm_para.h
===================================================================
--- kvm-pvfreq.orig/include/uapi/linux/kvm_para.h	2017-02-02 10:51:53.741217306 -0200
+++ kvm-pvfreq/include/uapi/linux/kvm_para.h	2017-02-02 11:17:17.824752499 -0200
@@ -25,6 +25,11 @@
 #define KVM_HC_MIPS_EXIT_VM		7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
 #define KVM_HC_CLOCK_PAIRING		9
+#define KVM_HC_FREQ_UP			10
+#define KVM_HC_FREQ_DOWN		11
+#define KVM_HC_FREQ_MAX			12
+#define KVM_HC_FREQ_MIN			13
+
 
 /*
  * hypercalls use architecture specific
Index: kvm-pvfreq/Documentation/virtual/kvm/hypercalls.txt
===================================================================
--- kvm-pvfreq.orig/Documentation/virtual/kvm/hypercalls.txt	2017-02-02 10:51:53.741217306 -0200
+++ kvm-pvfreq/Documentation/virtual/kvm/hypercalls.txt	2017-02-02 15:29:24.401692793 -0200
@@ -116,3 +116,48 @@
 
 Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
 or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
+
+7. KVM_HC_FREQ_UP
+-----------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to increase frequency to the next
+higher frequency.
+Usage example: DPDK power aware applications, that run on
+isolated CPUs. No input argument, returns 0 if success,
+1 if already at lowest frequency, error otherwise.
+
+8. KVM_HC_FREQ_DOWN
+---------------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to decrease frequency to the next
+lower frequency.
+Usage example: DPDK power aware applications, that run on
+isolated CPUs. No input argument, returns 0 if success,
+1 if already at lowest frequency, negative error otherwise.
+
+9. KVM_HC_FREQ_MIN
+-------------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to decrease frequency to the
+minimum frequency.
+Usage example: DPDK power aware applications, that run
+on isolated CPUs. No input argument, returns 0 if success
+error otherwise.
+
+10. KVM_HC_FREQ_MAX
+-------------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to increase frequency to the
+maximum frequency.
+Usage example: DPDK power aware applications, that run
+on isolated CPUs. No input argument, returns 0 if success
+error otherwise.
+

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/3] KVM: x86: frequency change hypercalls
  2017-02-02 17:47 ` [patch 3/3] KVM: x86: frequency change hypercalls Marcelo Tosatti
@ 2017-02-02 18:01   ` Marcelo Tosatti
  2017-02-03 17:40   ` Radim Krcmar
  1 sibling, 0 replies; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-02 18:01 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: Paolo Bonzini, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar

On Thu, Feb 02, 2017 at 03:47:58PM -0200, Marcelo Tosatti wrote:
> Implement min/max/up/down frequency change 
> KVM hypercalls. To be used by DPDK implementation.
> 
> Also allow such hypercalls from guest userspace.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  Documentation/virtual/kvm/hypercalls.txt |   45 +++++++++++++++++++
>  arch/x86/kvm/x86.c                       |   71 ++++++++++++++++++++++++++++++-
>  include/uapi/linux/kvm_para.h            |    5 ++
>  3 files changed, 120 insertions(+), 1 deletion(-)
> 
> Index: kvm-pvfreq/arch/x86/kvm/x86.c
> ===================================================================
> --- kvm-pvfreq.orig/arch/x86/kvm/x86.c	2017-02-02 11:17:17.063756725 -0200
> +++ kvm-pvfreq/arch/x86/kvm/x86.c	2017-02-02 11:17:17.822752510 -0200
> @@ -6219,10 +6219,58 @@
>  	kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
>  }
>  
> +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
> +/* call into cpufreq-userspace governor */
> +static int kvm_pvfreq_up(struct kvm_vcpu *vcpu)
> +{
> +	int ret;
> +	int cpu = get_cpu();
> +
> +	ret = cpufreq_userspace_freq_up(cpu);
> +	put_cpu();
> +
> +	return ret;
> +}
> +
> +static int kvm_pvfreq_down(struct kvm_vcpu *vcpu)
> +{
> +	int ret;
> +	int cpu = get_cpu();
> +
> +	ret = cpufreq_userspace_freq_down(cpu);
> +	put_cpu();
> +
> +	return ret;
> +}
> +
> +static int kvm_pvfreq_max(struct kvm_vcpu *vcpu)
> +{
> +	int ret;
> +	int cpu = get_cpu();
> +
> +	ret = cpufreq_userspace_freq_max(cpu);
> +	put_cpu();
> +
> +	return ret;
> +}
> +
> +static int kvm_pvfreq_min(struct kvm_vcpu *vcpu)
> +{
> +	int ret;
> +	int cpu = get_cpu();
> +
> +	ret = cpufreq_userspace_freq_min(cpu);
> +	put_cpu();
> +
> +	return ret;
> +}
> +#endif
> +
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long nr, a0, a1, a2, a3, ret;
>  	int op_64_bit, r;
> +	bool cpl_check;
>  
>  	r = kvm_skip_emulated_instruction(vcpu);
>  
> @@ -6246,7 +6294,13 @@
>  		a3 &= 0xFFFFFFFF;
>  	}
>  
> -	if (kvm_x86_ops->get_cpl(vcpu) != 0) {
> +	cpl_check = true;
> +	if (nr == KVM_HC_FREQ_UP || nr == KVM_HC_FREQ_DOWN ||
> +	    nr == KVM_HC_FREQ_MIN || nr == KVM_HC_FREQ_MAX)
> +		if (vcpu->arch.allow_freq_hypercall == true)
> +			cpl_check = false;
> +
> +	if (cpl_check == true && kvm_x86_ops->get_cpl(vcpu) != 0) {
>  		ret = -KVM_EPERM;
>  		goto out;

This should fail with EPERM if vcpu->arch.allow_freq_hypercall ==
false, independently of CPL level.

Will resend with that (and other comments) in v2.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] cpufreq: implement min/max/up/down functions
  2017-02-02 17:47 ` [patch 1/3] cpufreq: implement min/max/up/down functions Marcelo Tosatti
@ 2017-02-03  4:09   ` Viresh Kumar
  0 siblings, 0 replies; 28+ messages in thread
From: Viresh Kumar @ 2017-02-03  4:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-kernel, Paolo Bonzini, Radim Krcmar, Rafael J. Wysocki

On 02-02-17, 15:47, Marcelo Tosatti wrote:
> +++ kvm-pvfreq/drivers/cpufreq/cpufreq_userspace.c	2017-02-02 15:32:53.456262640 -0200
> @@ -118,6 +118,178 @@
>  	mutex_unlock(&userspace_mutex);
>  }
>  
> +static int cpufreq_is_userspace_governor(int cpu)
> +{
> +	int ret;
> +
> +	mutex_lock(&userspace_mutex);
> +	ret = per_cpu(cpu_is_managed, cpu);

The userspace governor is buggy in the sense that cpu_is_managed is only updated
for the policy->cpu and not any other CPU in that policy. But then it was never
used with anything other than policy->cpu, so it was fine.

But now that you are allowing any CPU number here, you need to do one of these:
- Either set cpu_is_managed for all the CPUs from a policy
- Or get the policy first and pass policy->cpu here.

> +	mutex_unlock(&userspace_mutex);
> +
> +	return ret;
> +}

All 4 routines defined below have too much in common and it would be very easy
to write a common routine cpufreq_userspace_freq_change(), which can be called
in all the four cases. You can pass a function pointer to that, which can give
min, max, up, or down frequencies. That will make it more robust and less error
prone.

> +int cpufreq_userspace_freq_up(int cpu)
> +{
> +	unsigned int curfreq, nextminfreq;
> +	unsigned int ret = 0;
> +	struct cpufreq_frequency_table *pos, *table;
> +	struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> +
> +	if (!policy)
> +		return -EINVAL;
> +
> +	if (!cpufreq_is_userspace_governor(cpu)) {
> +		cpufreq_cpu_put(policy);
> +		return -EINVAL;
> +	}

Because the userspace_mutex is dropped after that routine returned, there is no
guarantee that 'cpu' is still managed by this governor. And so you need to make
sure that you drops the locks only at the end.

> +
> +	cpufreq_cpu_put(policy);

This must be called only after you are done using the policy, to make sure that
the policy doesn't get freed while you are using it.

> +	mutex_lock(&userspace_mutex);
> +	table = policy->freq_table;
> +	if (!table) {
> +		mutex_unlock(&userspace_mutex);
> +		return -ENODEV;
> +	}
> +	nextminfreq = cpufreq_quick_get_max(cpu);

Just use policy->max here, why waste time ?

> +	curfreq = policy->cur;
> +
> +	cpufreq_for_each_valid_entry(pos, table) {
> +		if (pos->frequency > curfreq &&
> +		    pos->frequency < nextminfreq)
> +			nextminfreq = pos->frequency;
> +	}

The above part can be a routine of its own, whose pointer will be passed to
cpufreq_userspace_freq_change().

> +
> +	if (nextminfreq != curfreq) {

You are missing similar checks in the last two routines, any special reason for
that ?

> +		unsigned int *setspeed = policy->governor_data;
> +
> +		*setspeed = nextminfreq;
> +		ret = __cpufreq_driver_target(policy, nextminfreq,
> +					      CPUFREQ_RELATION_L);
> +	} else
> +		ret = 1;

Why ret 1? What are the callers expected to do on seeing this value? Maybe
return 0 as the desired freq is set by the governor ?

And always use {} for even single line code if the 'if' block has them.

> +	mutex_unlock(&userspace_mutex);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_userspace_freq_up);
> +++ kvm-pvfreq/include/linux/cpufreq.h	2017-01-31 14:20:00.508613672 -0200
> @@ -890,4 +890,11 @@
>  int cpufreq_generic_init(struct cpufreq_policy *policy,
>  		struct cpufreq_frequency_table *table,
>  		unsigned int transition_latency);
> +#ifdef CONFIG_CPU_FREQ
> +int cpufreq_userspace_freq_down(int cpu);
> +int cpufreq_userspace_freq_up(int cpu);
> +int cpufreq_userspace_freq_max(int cpu);
> +int cpufreq_userspace_freq_min(int cpu);
> +#else

Don't want to put dummy routines here? Then why the blank #else part ?

> +#endif
>  #endif /* _LINUX_CPUFREQ_H */
> 

-- 
viresh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-02 17:47 [patch 0/3] KVM CPU frequency change hypercalls Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2017-02-02 17:47 ` [patch 3/3] KVM: x86: frequency change hypercalls Marcelo Tosatti
@ 2017-02-03 12:50 ` Rafael J. Wysocki
  2017-02-03 16:43 ` Radim Krcmar
  4 siblings, 0 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2017-02-03 12:50 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-kernel, Paolo Bonzini, Radim Krcmar, Viresh Kumar

On Thursday, February 02, 2017 03:47:55 PM Marcelo Tosatti wrote:
> Implement KVM hypercalls for the guest
> to issue frequency changes.
> 
> Current situation with DPDK and frequency changes is as follows:
> An algorithm in the guest decides when to increase/decrease
> frequency based on the queue length of the device.
> 
> On the host, a power manager daemon is used to listen for
> frequency change requests (on another core) and issue these
> requests.
> 
> However frequency changes are performance sensitive events because:
> On a change from low load condition to max load condition,
> the frequency should be raised as soon as possible.
> Sending a virtio-serial notification to another pCPU,
> waiting for that pCPU to initiate an IPI to the requestor pCPU
> to change frequency, is slower and more cache costly than
> a direct hypercall to host to switch the frequency.
> 
> If the pCPU where the power manager daemon is running
> is not busy spinning on requests from the isolated DPDK vcpus,
> there is also the cost of HLT wakeup for that pCPU.
> 
> Moreover, the daemon serves multiple VMs, meaning that
> the scheme is subject to additional delays from
> queueing of power change requests from VMs.
> 
> A direct hypercall from userspace is the fastest most direct
> method for the guest to change frequency and does not suffer
> from the issues above.
> 
> The usage scenario for this hypercalls is for pinned vCPUs <-> pCPUs.

Any chance to CC this to linux-pm in the future?  That would help the review
quite a bit.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-02 17:47 [patch 0/3] KVM CPU frequency change hypercalls Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2017-02-03 12:50 ` [patch 0/3] KVM CPU " Rafael J. Wysocki
@ 2017-02-03 16:43 ` Radim Krcmar
  2017-02-03 18:14   ` Marcelo Tosatti
  4 siblings, 1 reply; 28+ messages in thread
From: Radim Krcmar @ 2017-02-03 16:43 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar

2017-02-02 15:47-0200, Marcelo Tosatti:
> Implement KVM hypercalls for the guest
> to issue frequency changes.
> 
> Current situation with DPDK and frequency changes is as follows:
> An algorithm in the guest decides when to increase/decrease
> frequency based on the queue length of the device.

Does the algorithm compute with the magnitude of frequency steps?

(e.g. if CPU can step with 200 MHz granularity, does the algorithm ever
 do 400 MHz at once, because it assumes that frequency would be enough
 to handle the load?)

> On the host, a power manager daemon is used to listen for
> frequency change requests (on another core) and issue these
> requests.
> 
> However frequency changes are performance sensitive events because:
> On a change from low load condition to max load condition,
> the frequency should be raised as soon as possible.
> Sending a virtio-serial notification to another pCPU,
> waiting for that pCPU to initiate an IPI to the requestor pCPU
> to change frequency, is slower and more cache costly than
> a direct hypercall to host to switch the frequency.
> 
> If the pCPU where the power manager daemon is running
> is not busy spinning on requests from the isolated DPDK vcpus,
> there is also the cost of HLT wakeup for that pCPU.
> 
> Moreover, the daemon serves multiple VMs, meaning that
> the scheme is subject to additional delays from
> queueing of power change requests from VMs.

(Wow, this must be bringing humanity to its doom faster than the heat it
 helps to eliminate.)

> A direct hypercall from userspace is the fastest most direct
> method for the guest to change frequency and does not suffer
> from the issues above.

Right, userspace on bare-metal cannot change frequency directly.

> The usage scenario for this hypercalls is for pinned vCPUs <-> pCPUs.

And pinned tasks <-> vCPUs, because the guest kernel has no idea what
frequency is being used or desired on its virtualware, so the kernel
cannot even change frequency without introducing a bug ...

I'm not happy about this hole through layers of isolations.

The domain of valid users is very small and a problem is that any
program with access to /dev/kvm gains the ability to change host CPU
frequency if the host happens to use the userspace governor.

We should at least enable this feature only if /dev/kvm is root-only.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls
  2017-02-02 17:47 ` [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls Marcelo Tosatti
@ 2017-02-03 17:03   ` Radim Krcmar
  2017-02-22 21:18     ` Marcelo Tosatti
  0 siblings, 1 reply; 28+ messages in thread
From: Radim Krcmar @ 2017-02-03 17:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar

2017-02-02 15:47-0200, Marcelo Tosatti:
> For most VMs, modifying the host frequency is an undesired 
> operation. Introduce ioctl to enable the guest to 
> modify host CPU frequency.
> 
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 ++
>  arch/x86/include/uapi/asm/kvm.h |    5 +++++
>  arch/x86/kvm/x86.c              |   20 ++++++++++++++++++++
>  include/uapi/linux/kvm.h        |    3 +++
>  virt/kvm/kvm_main.c             |    2 ++
>  5 files changed, 32 insertions(+)
> 
> Index: kvm-pvfreq/arch/x86/kvm/x86.c
> ===================================================================
> --- kvm-pvfreq.orig/arch/x86/kvm/x86.c	2017-01-31 10:32:33.023378783 -0200
> +++ kvm-pvfreq/arch/x86/kvm/x86.c	2017-01-31 10:34:25.443618639 -0200
> @@ -3665,6 +3665,26 @@
>  		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
>  		break;
>  	}
> +	case KVM_SET_VCPU_ALLOW_FREQ_HC: {

Just enable the frequency hypercalls with KVM_ENABLE_CAP ioctl and get
rid of this ioctl.
(I don't think that we want to allow disabling this capability.)

> +		struct kvm_vcpu_allow_freq freq;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&freq, argp, sizeof(freq)))
> +			goto out;
> +		vcpu->arch.allow_freq_hypercall = freq.enable;

Enabling the capability should also set a bit in KVM_CPUID_FEATURES,
a well-behaved guest won't use it otherwise.

> +		r = 0;
> +		break;
> +	}
> +	case KVM_GET_VCPU_ALLOW_FREQ_HC: {

And this ioctl has no use, so it should be omitted.

(Userspace doesn't learn anything it doesn't already know.
 The feature is unsafe, so the default must be 0.)

> +		struct kvm_vcpu_allow_freq freq;
> +
> +		memset(&freq, 0, sizeof(struct kvm_vcpu_allow_freq));
> +		r = -EFAULT;
> +		if (copy_to_user(&freq, argp, sizeof(freq)))
> +			break;
> +		r = 0;
> +		break;
> +	}
>  	default:
>  		r = -EINVAL;
>  	}
> Index: kvm-pvfreq/include/uapi/linux/kvm.h
> ===================================================================
> --- kvm-pvfreq.orig/include/uapi/linux/kvm.h	2017-01-31 10:32:33.023378783 -0200
> +++ kvm-pvfreq/include/uapi/linux/kvm.h	2017-01-31 10:32:38.000389402 -0200
> @@ -871,6 +871,7 @@
>  #define KVM_CAP_S390_USER_INSTR0 130
>  #define KVM_CAP_MSI_DEVID 131
>  #define KVM_CAP_PPC_HTM 132
> +#define KVM_CAP_ALLOW_FREQ_HC 133
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1281,6 +1282,8 @@
>  #define KVM_S390_GET_IRQ_STATE	  _IOW(KVMIO, 0xb6, struct kvm_s390_irq_state)
>  /* Available with KVM_CAP_X86_SMM */
>  #define KVM_SMI                   _IO(KVMIO,   0xb7)
> +#define KVM_SET_VCPU_ALLOW_FREQ_HC   _IO(KVMIO,   0xb8)
> +#define KVM_GET_VCPU_ALLOW_FREQ_HC   _IO(KVMIO,   0xb9)
>  
>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
>  #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
> Index: kvm-pvfreq/arch/x86/include/uapi/asm/kvm.h
> ===================================================================
> --- kvm-pvfreq.orig/arch/x86/include/uapi/asm/kvm.h	2017-01-31 10:32:33.023378783 -0200
> +++ kvm-pvfreq/arch/x86/include/uapi/asm/kvm.h	2017-01-31 10:32:38.000389402 -0200
> @@ -357,4 +357,9 @@
>  #define KVM_X86_QUIRK_LINT0_REENABLED	(1 << 0)
>  #define KVM_X86_QUIRK_CD_NW_CLEARED	(1 << 1)
>  
> +struct kvm_vcpu_allow_freq {
> +	__u16 enable;
> +	__u16 pad[7];
> +};
> +
>  #endif /* _ASM_X86_KVM_H */
> Index: kvm-pvfreq/virt/kvm/kvm_main.c
> ===================================================================
> --- kvm-pvfreq.orig/virt/kvm/kvm_main.c	2017-01-31 10:32:33.023378783 -0200
> +++ kvm-pvfreq/virt/kvm/kvm_main.c	2017-01-31 10:32:38.001389404 -0200
> @@ -2938,6 +2938,8 @@
>  #endif
>  	case KVM_CAP_MAX_VCPU_ID:
>  		return KVM_MAX_VCPU_ID;
> +	case KVM_CAP_ALLOW_FREQ_HC:

This can share an existing return 1 and would be elsewhere with
KVM_ENABLE_CAP.

> +		return 1;
>  	default:
>  		break;
>  	}
> Index: kvm-pvfreq/arch/x86/include/asm/kvm_host.h
> ===================================================================
> --- kvm-pvfreq.orig/arch/x86/include/asm/kvm_host.h	2017-01-31 10:32:33.023378783 -0200
> +++ kvm-pvfreq/arch/x86/include/asm/kvm_host.h	2017-01-31 10:32:38.001389404 -0200
> @@ -678,6 +678,8 @@
>  
>  	/* GPA available (AMD only) */
>  	bool gpa_available;
> +
> +	bool allow_freq_hypercall;
>  };
>  
>  struct kvm_lpage_info {
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/3] KVM: x86: frequency change hypercalls
  2017-02-02 17:47 ` [patch 3/3] KVM: x86: frequency change hypercalls Marcelo Tosatti
  2017-02-02 18:01   ` Marcelo Tosatti
@ 2017-02-03 17:40   ` Radim Krcmar
  2017-02-03 18:24     ` Marcelo Tosatti
  1 sibling, 1 reply; 28+ messages in thread
From: Radim Krcmar @ 2017-02-03 17:40 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar

2017-02-02 15:47-0200, Marcelo Tosatti:
> Implement min/max/up/down frequency change 
> KVM hypercalls. To be used by DPDK implementation.
> 
> Also allow such hypercalls from guest userspace.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
> Index: kvm-pvfreq/arch/x86/kvm/x86.c
> ===================================================================
> --- kvm-pvfreq.orig/arch/x86/kvm/x86.c	2017-02-02 11:17:17.063756725 -0200
> +++ kvm-pvfreq/arch/x86/kvm/x86.c	2017-02-02 11:17:17.822752510 -0200
> @@ -6219,10 +6219,58 @@

[Here lived copy-paste.]

>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long nr, a0, a1, a2, a3, ret;
>  	int op_64_bit, r;
> +	bool cpl_check;
>  
>  	r = kvm_skip_emulated_instruction(vcpu);
>  
> @@ -6246,7 +6294,13 @@
>  		a3 &= 0xFFFFFFFF;
>  	}
>  
> -	if (kvm_x86_ops->get_cpl(vcpu) != 0) {
> +	cpl_check = true;
> +	if (nr == KVM_HC_FREQ_UP || nr == KVM_HC_FREQ_DOWN ||
> +	    nr == KVM_HC_FREQ_MIN || nr == KVM_HC_FREQ_MAX)
> +		if (vcpu->arch.allow_freq_hypercall == true)
> +			cpl_check = false;
> +
> +	if (cpl_check == true && kvm_x86_ops->get_cpl(vcpu) != 0) {
>  		ret = -KVM_EPERM;
>  		goto out;
>  	}
> @@ -6262,6 +6316,21 @@
>  	case KVM_HC_CLOCK_PAIRING:
>  		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
>  		break;
> +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE

CONFIG_CPU_FREQ_GOV_USERSPACE should be checked when enabling the
capability.

> +	case KVM_HC_FREQ_UP:
> +		ret = kvm_pvfreq_up(vcpu);
> +		break;
> +	case KVM_HC_FREQ_DOWN:
> +		ret = kvm_pvfreq_down(vcpu);
> +		break;
> +	case KVM_HC_FREQ_MAX:
> +		ret = kvm_pvfreq_max(vcpu);
> +		break;
> +	case KVM_HC_FREQ_MIN:
> +		ret = kvm_pvfreq_min(vcpu);
> +		break;

Having 4 hypercalls for this is an overkill.
You can make it one hypercall with an argument.

And the argument doesn't have to be enum {UP, DOWN, MAX, MIN}, but an
int, which would also allow you to do -2 steps.
A number over the capabilites of stepping would just map to MAX/MIN.

Avoiding an absolute scale for interface simplifies migration, where the
guest cannot really depend much on this.  Except that calling it with
MIN (INT_MIN) will get the minimum and MAX (INT_MAX) the maximum
frequency.

Plese explictly say in documentation that things like the number of
steps, which the guest can learn by doing MAX and then -1 until the
hypercall fails, is undefined and should not be depended upon.

Userspace might still want know the number of steps to avoid useless
hypercall -- I think we should return a different value when the limit
is reached, not just after the guest wants to go past it.

> +#endif
> +
>  	default:
>  		ret = -KVM_ENOSYS;
>  		break;

And thinking more about migration, userspace cannot learn the current
frequency (at least MIN/MAX), so the new host will just pick at random,
which will break userspace's expectations that it cannot increase or
decrease the frequency.  Is migration left for the future, because DPDK
doesn't migrate anyway?

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-03 16:43 ` Radim Krcmar
@ 2017-02-03 18:14   ` Marcelo Tosatti
  2017-02-03 19:09     ` Radim Krcmar
  0 siblings, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-03 18:14 UTC (permalink / raw)
  To: Radim Krcmar
  Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar

On Fri, Feb 03, 2017 at 05:43:50PM +0100, Radim Krcmar wrote:
> 2017-02-02 15:47-0200, Marcelo Tosatti:
> > Implement KVM hypercalls for the guest
> > to issue frequency changes.
> > 
> > Current situation with DPDK and frequency changes is as follows:
> > An algorithm in the guest decides when to increase/decrease
> > frequency based on the queue length of the device.
> 
> Does the algorithm compute with the magnitude of frequency steps?
> 
> (e.g. if CPU can step with 200 MHz granularity, does the algorithm ever
>  do 400 MHz at once, because it assumes that frequency would be enough
>  to handle the load?)

No, it does not know the frequency directly. It only "knows" the
frequency indirectly by the size of the network queue (that is, if the
network queue is above a threshold, then frequency is "too low" and
should be increased).

> > On the host, a power manager daemon is used to listen for
> > frequency change requests (on another core) and issue these
> > requests.
> > 
> > However frequency changes are performance sensitive events because:
> > On a change from low load condition to max load condition,
> > the frequency should be raised as soon as possible.
> > Sending a virtio-serial notification to another pCPU,
> > waiting for that pCPU to initiate an IPI to the requestor pCPU
> > to change frequency, is slower and more cache costly than
> > a direct hypercall to host to switch the frequency.
> > 
> > If the pCPU where the power manager daemon is running
> > is not busy spinning on requests from the isolated DPDK vcpus,
> > there is also the cost of HLT wakeup for that pCPU.
> > 
> > Moreover, the daemon serves multiple VMs, meaning that
> > the scheme is subject to additional delays from
> > queueing of power change requests from VMs.
> 
> (Wow, this must be bringing humanity to its doom faster than the heat it
>  helps to eliminate.)
> > A direct hypercall from userspace is the fastest most direct
> > method for the guest to change frequency and does not suffer
> > from the issues above.
> 
> Right, userspace on bare-metal cannot change frequency directly.

Yes it can: write to sysfs (not sure what you meant).

> > The usage scenario for this hypercalls is for pinned vCPUs <-> pCPUs.
> 
> And pinned tasks <-> vCPUs, because the guest kernel has no idea what
> frequency is being used or desired on its virtualware, 

And it does not have to know...

> so the kernel
> cannot even change frequency without introducing a bug ...

Not sure what are you thinking, please be more verbose.

> I'm not happy about this hole through layers of isolations.
> 
> The domain of valid users is very small and a problem is that any
> program with access to /dev/kvm gains the ability to change host CPU
> frequency if the host happens to use the userspace governor.

Yes.

> We should at least enable this feature only if /dev/kvm is root-only.

Fine, can change that, will fix in -v2. Maybe there is a capability 
to change frequency... should require that capability (or root 
if there is none).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/3] KVM: x86: frequency change hypercalls
  2017-02-03 17:40   ` Radim Krcmar
@ 2017-02-03 18:24     ` Marcelo Tosatti
  2017-02-03 19:28       ` Radim Krcmar
  0 siblings, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-03 18:24 UTC (permalink / raw)
  To: Radim Krcmar
  Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar

On Fri, Feb 03, 2017 at 06:40:34PM +0100, Radim Krcmar wrote:
> 2017-02-02 15:47-0200, Marcelo Tosatti:
> > Implement min/max/up/down frequency change 
> > KVM hypercalls. To be used by DPDK implementation.
> > 
> > Also allow such hypercalls from guest userspace.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > ---
> > Index: kvm-pvfreq/arch/x86/kvm/x86.c
> > ===================================================================
> > --- kvm-pvfreq.orig/arch/x86/kvm/x86.c	2017-02-02 11:17:17.063756725 -0200
> > +++ kvm-pvfreq/arch/x86/kvm/x86.c	2017-02-02 11:17:17.822752510 -0200
> > @@ -6219,10 +6219,58 @@
> 
> [Here lived copy-paste.]
> 
> >  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >  {
> >  	unsigned long nr, a0, a1, a2, a3, ret;
> >  	int op_64_bit, r;
> > +	bool cpl_check;
> >  
> >  	r = kvm_skip_emulated_instruction(vcpu);
> >  
> > @@ -6246,7 +6294,13 @@
> >  		a3 &= 0xFFFFFFFF;
> >  	}
> >  
> > -	if (kvm_x86_ops->get_cpl(vcpu) != 0) {
> > +	cpl_check = true;
> > +	if (nr == KVM_HC_FREQ_UP || nr == KVM_HC_FREQ_DOWN ||
> > +	    nr == KVM_HC_FREQ_MIN || nr == KVM_HC_FREQ_MAX)
> > +		if (vcpu->arch.allow_freq_hypercall == true)
> > +			cpl_check = false;
> > +
> > +	if (cpl_check == true && kvm_x86_ops->get_cpl(vcpu) != 0) {
> >  		ret = -KVM_EPERM;
> >  		goto out;
> >  	}
> > @@ -6262,6 +6316,21 @@
> >  	case KVM_HC_CLOCK_PAIRING:
> >  		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
> >  		break;
> > +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
> 
> CONFIG_CPU_FREQ_GOV_USERSPACE should be checked when enabling the
> capability.
> 
> > +	case KVM_HC_FREQ_UP:
> > +		ret = kvm_pvfreq_up(vcpu);
> > +		break;
> > +	case KVM_HC_FREQ_DOWN:
> > +		ret = kvm_pvfreq_down(vcpu);
> > +		break;
> > +	case KVM_HC_FREQ_MAX:
> > +		ret = kvm_pvfreq_max(vcpu);
> > +		break;
> > +	case KVM_HC_FREQ_MIN:
> > +		ret = kvm_pvfreq_min(vcpu);
> > +		break;
> 
> Having 4 hypercalls for this is an overkill.
> You can make it one hypercall with an argument.

Fine.

> And the argument doesn't have to be enum {UP, DOWN, MAX, MIN}, but an
> int, which would also allow you to do -2 steps.

Are you suggesting to have an integer to signify the number of steps up
or down.

> A number over the capabilites of stepping would just map to MAX/MIN.

Then MAX == any positive value above the number of steps
     MIN == any negative value below the negative of number of steps

Sure.

> Avoiding an absolute scale for interface simplifies migration, where the
> guest cannot really depend much on this.  Except that calling it with
> MIN (INT_MIN) will get the minimum and MAX (INT_MAX) the maximum
> frequency.

Are you suggesting for the hypercall to return the maximum/minimum
frequency if called with the highest integer and lowest negative integer 
respectively? (That same hypercall).

Sure.

> Plese explictly say in documentation that things like the number of
> steps, which the guest can learn by doing MAX and then -1 until the
> hypercall fails, is undefined and should not be depended upon.

Sure, because it fails over migration.

> Userspace might still want know the number of steps to avoid useless
> hypercall -- I think we should return a different value when the limit
> is reached, not just after the guest wants to go past it.

Are you suggesting to return a different value when going from 

max-1 -> max  
and
min+1 -> min

frequencies?

Fine.

> > +#endif
> > +
> >  	default:
> >  		ret = -KVM_ENOSYS;
> >  		break;
> 
> And thinking more about migration, userspace cannot learn the current
> frequency (at least MIN/MAX), so the new host will just pick at random,
> which will break userspace's expectations that it cannot increase or
> decrease the frequency.  Is migration left for the future, because DPDK
> doesn't migrate anyway?
> 
> Thanks.

The new host should start with the highest frequency always. Then
the frequency tuning algorithm can reduce frequency afterwards.

Migration is a desired feature for DPDK, so it should be supported
(thats one reason why virtio-net drivers are used in the guest BTW).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-03 18:14   ` Marcelo Tosatti
@ 2017-02-03 19:09     ` Radim Krcmar
  2017-02-23 17:35       ` Paolo Bonzini
  0 siblings, 1 reply; 28+ messages in thread
From: Radim Krcmar @ 2017-02-03 19:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar

2017-02-03 16:14-0200, Marcelo Tosatti:
> On Fri, Feb 03, 2017 at 05:43:50PM +0100, Radim Krcmar wrote:
>> 2017-02-02 15:47-0200, Marcelo Tosatti:
>> > Implement KVM hypercalls for the guest
>> > to issue frequency changes.
>> > 
>> > Current situation with DPDK and frequency changes is as follows:
>> > An algorithm in the guest decides when to increase/decrease
>> > frequency based on the queue length of the device.
>> 
>> Does the algorithm compute with the magnitude of frequency steps?
>> 
>> (e.g. if CPU can step with 200 MHz granularity, does the algorithm ever
>>  do 400 MHz at once, because it assumes that frequency would be enough
>>  to handle the load?)
> 
> No, it does not know the frequency directly. It only "knows" the
> frequency indirectly by the size of the network queue (that is, if the
> network queue is above a threshold, then frequency is "too low" and
> should be increased).

I see, thanks.  You added MAX to the interface ... so DPDK has two
thresholds and forces MAX frequency after reaching the second one?

>> > A direct hypercall from userspace is the fastest most direct
>> > method for the guest to change frequency and does not suffer
>> > from the issues above.
>> 
>> Right, userspace on bare-metal cannot change frequency directly.
> 
> Yes it can: write to sysfs (not sure what you meant).

On x86, the frequency can only be changed from CPL 0, but userspace runs
at CPL 3.  sysfs is used because the userspace cannot change frequency
directly (behind the kernel's back).

(KVM could avoid trapping guest's access to MSRs that control frequency,
 which would allow us to do it behind host's back, but still not directly
 from guest userspace, because MSRs only work at CPL 0.)

>> > The usage scenario for this hypercalls is for pinned vCPUs <-> pCPUs.
>> 
>> And pinned tasks <-> vCPUs, because the guest kernel has no idea what
>> frequency is being used or desired on its virtualware, 
> 
> And it does not have to know...

Probably not in DPDK setups, but it has to know in general.

>> so the kernel
>> cannot even change frequency without introducing a bug ...
> 
> Not sure what are you thinking, please be more verbose.

One reason why we have a kernel/userspace split is to allow sharing of
CPU time.  Each application then its state that the kernel keeps track
of and saves/restores while time-multiplexing.

Our frequency scaling interface goes against the idea -- guest kernel
cannot schedule multiple userspaces on the same vCPU, because they could
conflict by overriding frequency.

i.e. our feature implies userspace tasks pinned to isolated vCPUs.

>> I'm not happy about this hole through layers of isolations.
>> 
>> The domain of valid users is very small and a problem is that any
>> program with access to /dev/kvm gains the ability to change host CPU
>> frequency if the host happens to use the userspace governor.
> 
> Yes.
> 
>> We should at least enable this feature only if /dev/kvm is root-only.
> 
> Fine, can change that, will fix in -v2. Maybe there is a capability 
> to change frequency... should require that capability (or root 
> if there is none).

Capability sounds good too.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/3] KVM: x86: frequency change hypercalls
  2017-02-03 18:24     ` Marcelo Tosatti
@ 2017-02-03 19:28       ` Radim Krcmar
  0 siblings, 0 replies; 28+ messages in thread
From: Radim Krcmar @ 2017-02-03 19:28 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar

2017-02-03 16:24-0200, Marcelo Tosatti:
> On Fri, Feb 03, 2017 at 06:40:34PM +0100, Radim Krcmar wrote:
>> You can make it one hypercall with an argument.
> 
> Fine.
> 
>> And the argument doesn't have to be enum {UP, DOWN, MAX, MIN}, but an
>> int, which would also allow you to do -2 steps.
> 
> Are you suggesting to have an integer to signify the number of steps up
> or down.

Yes.

>> A number over the capabilites of stepping would just map to MAX/MIN.
> 
> Then MAX == any positive value above the number of steps
>      MIN == any negative value below the negative of number of steps
> 
> Sure.
> 
>> Avoiding an absolute scale for interface simplifies migration, where the
>> guest cannot really depend much on this.  Except that calling it with
>> MIN (INT_MIN) will get the minimum and MAX (INT_MAX) the maximum
>> frequency.
> 
> Are you suggesting for the hypercall to return the maximum/minimum
> frequency if called with the highest integer and lowest negative integer 
> respectively? (That same hypercall).

No, I meant that we will guarantee that the guest will always get (the
CPU will be in) the minimal frequency when hypercall parameter is
INT_MIN and the maximal with INT_MAX -- just so the guest wouldn't lose
the ability which you provided by MIN and MAX hypercalls.

(We could also make a stronger assertion that there is never going to be
 more than INT_MAX steps, CPUs that run KVM will probably never have
 that fine frequency control.)

>> Plese explictly say in documentation that things like the number of
>> steps, which the guest can learn by doing MAX and then -1 until the
>> hypercall fails, is undefined and should not be depended upon.
> 
> Sure, because it fails over migration.
> 
>> Userspace might still want know the number of steps to avoid useless
>> hypercall -- I think we should return a different value when the limit
>> is reached, not just after the guest wants to go past it.
> 
> Are you suggesting to return a different value when going from 
> 
> max-1 -> max  
> and
> min+1 -> min
> 
> frequencies?

Yes.  Like you do now when going "up" from "max".
It saves one call of the hypercall.

> Fine.
> 
>> > +#endif
>> > +
>> >  	default:
>> >  		ret = -KVM_ENOSYS;
>> >  		break;
>> 
>> And thinking more about migration, userspace cannot learn the current
>> frequency (at least MIN/MAX), so the new host will just pick at random,
>> which will break userspace's expectations that it cannot increase or
>> decrease the frequency.  Is migration left for the future, because DPDK
>> doesn't migrate anyway?
>> 
>> Thanks.
> 
> The new host should start with the highest frequency always. Then
> the frequency tuning algorithm can reduce frequency afterwards.

That is not going to work on migration.

Suppose we do that and the CPU is in minimal frequency before the
migration.  This means that queue is below the threshold and userspace
knows that it is in minimum frequency (because we provide that
information when going down), so it doesn't trigger useless hypercalls.

After migration, the host would set frequency to maximum, but userspace
would still thing that it is minimal, so it would decrease it.

The only reason for this series -- power saving -- is lost.

> Migration is a desired feature for DPDK, so it should be supported
> (thats one reason why virtio-net drivers are used in the guest BTW).

Oh, nice,

thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls
  2017-02-03 17:03   ` Radim Krcmar
@ 2017-02-22 21:18     ` Marcelo Tosatti
  2017-02-23 16:48       ` Radim Krcmar
  0 siblings, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-22 21:18 UTC (permalink / raw)
  To: Radim Krcmar
  Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar

On Fri, Feb 03, 2017 at 06:03:37PM +0100, Radim Krcmar wrote:
> 2017-02-02 15:47-0200, Marcelo Tosatti:
> > For most VMs, modifying the host frequency is an undesired 
> > operation. Introduce ioctl to enable the guest to 
> > modify host CPU frequency.
> > 
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |    2 ++
> >  arch/x86/include/uapi/asm/kvm.h |    5 +++++
> >  arch/x86/kvm/x86.c              |   20 ++++++++++++++++++++
> >  include/uapi/linux/kvm.h        |    3 +++
> >  virt/kvm/kvm_main.c             |    2 ++
> >  5 files changed, 32 insertions(+)
> > 
> > Index: kvm-pvfreq/arch/x86/kvm/x86.c
> > ===================================================================
> > --- kvm-pvfreq.orig/arch/x86/kvm/x86.c	2017-01-31 10:32:33.023378783 -0200
> > +++ kvm-pvfreq/arch/x86/kvm/x86.c	2017-01-31 10:34:25.443618639 -0200
> > @@ -3665,6 +3665,26 @@
> >  		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
> >  		break;
> >  	}
> > +	case KVM_SET_VCPU_ALLOW_FREQ_HC: {
> 
> Just enable the frequency hypercalls with KVM_ENABLE_CAP ioctl and get
> rid of this ioctl.
> (I don't think that we want to allow disabling this capability.)

Not sure. What if you change the role of vcpus and now want 
to change vcpu-1 from a realtime vcpu (one where only DPDK runs) 
to a multi-user vcpu without a reboot?

> > +		struct kvm_vcpu_allow_freq freq;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&freq, argp, sizeof(freq)))
> > +			goto out;
> > +		vcpu->arch.allow_freq_hypercall = freq.enable;
> 
> Enabling the capability should also set a bit in KVM_CPUID_FEATURES,
> a well-behaved guest won't use it otherwise.

Fixed.

> > +		r = 0;
> > +		break;
> > +	}
> > +	case KVM_GET_VCPU_ALLOW_FREQ_HC: {
> 
> And this ioctl has no use, so it should be omitted.
> 
> (Userspace doesn't learn anything it doesn't already know.
>  The feature is unsafe, so the default must be 0.)

Fixed.

> > +		struct kvm_vcpu_allow_freq freq;
> > +
> > +		memset(&freq, 0, sizeof(struct kvm_vcpu_allow_freq));
> > +		r = -EFAULT;
> > +		if (copy_to_user(&freq, argp, sizeof(freq)))
> > +			break;
> > +		r = 0;
> > +		break;
> > +	}
> >  	default:
> >  		r = -EINVAL;
> >  	}
> > Index: kvm-pvfreq/include/uapi/linux/kvm.h
> > ===================================================================
> > --- kvm-pvfreq.orig/include/uapi/linux/kvm.h	2017-01-31 10:32:33.023378783 -0200
> > +++ kvm-pvfreq/include/uapi/linux/kvm.h	2017-01-31 10:32:38.000389402 -0200
> > @@ -871,6 +871,7 @@
> >  #define KVM_CAP_S390_USER_INSTR0 130
> >  #define KVM_CAP_MSI_DEVID 131
> >  #define KVM_CAP_PPC_HTM 132
> > +#define KVM_CAP_ALLOW_FREQ_HC 133
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > @@ -1281,6 +1282,8 @@
> >  #define KVM_S390_GET_IRQ_STATE	  _IOW(KVMIO, 0xb6, struct kvm_s390_irq_state)
> >  /* Available with KVM_CAP_X86_SMM */
> >  #define KVM_SMI                   _IO(KVMIO,   0xb7)
> > +#define KVM_SET_VCPU_ALLOW_FREQ_HC   _IO(KVMIO,   0xb8)
> > +#define KVM_GET_VCPU_ALLOW_FREQ_HC   _IO(KVMIO,   0xb9)
> >  
> >  #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
> >  #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
> > Index: kvm-pvfreq/arch/x86/include/uapi/asm/kvm.h
> > ===================================================================
> > --- kvm-pvfreq.orig/arch/x86/include/uapi/asm/kvm.h	2017-01-31 10:32:33.023378783 -0200
> > +++ kvm-pvfreq/arch/x86/include/uapi/asm/kvm.h	2017-01-31 10:32:38.000389402 -0200
> > @@ -357,4 +357,9 @@
> >  #define KVM_X86_QUIRK_LINT0_REENABLED	(1 << 0)
> >  #define KVM_X86_QUIRK_CD_NW_CLEARED	(1 << 1)
> >  
> > +struct kvm_vcpu_allow_freq {
> > +	__u16 enable;
> > +	__u16 pad[7];
> > +};
> > +
> >  #endif /* _ASM_X86_KVM_H */
> > Index: kvm-pvfreq/virt/kvm/kvm_main.c
> > ===================================================================
> > --- kvm-pvfreq.orig/virt/kvm/kvm_main.c	2017-01-31 10:32:33.023378783 -0200
> > +++ kvm-pvfreq/virt/kvm/kvm_main.c	2017-01-31 10:32:38.001389404 -0200
> > @@ -2938,6 +2938,8 @@
> >  #endif
> >  	case KVM_CAP_MAX_VCPU_ID:
> >  		return KVM_MAX_VCPU_ID;
> > +	case KVM_CAP_ALLOW_FREQ_HC:
> 
> This can share an existing return 1 and would be elsewhere with
> KVM_ENABLE_CAP.

Fixed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls
  2017-02-22 21:18     ` Marcelo Tosatti
@ 2017-02-23 16:48       ` Radim Krcmar
  2017-02-23 17:31         ` Paolo Bonzini
  0 siblings, 1 reply; 28+ messages in thread
From: Radim Krcmar @ 2017-02-23 16:48 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar

2017-02-22 18:18-0300, Marcelo Tosatti:
> On Fri, Feb 03, 2017 at 06:03:37PM +0100, Radim Krcmar wrote:
>> 2017-02-02 15:47-0200, Marcelo Tosatti:
>> > For most VMs, modifying the host frequency is an undesired 
>> > operation. Introduce ioctl to enable the guest to 
>> > modify host CPU frequency.
>> > 
>> > 
>> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>> > ---
>> >  arch/x86/include/asm/kvm_host.h |    2 ++
>> >  arch/x86/include/uapi/asm/kvm.h |    5 +++++
>> >  arch/x86/kvm/x86.c              |   20 ++++++++++++++++++++
>> >  include/uapi/linux/kvm.h        |    3 +++
>> >  virt/kvm/kvm_main.c             |    2 ++
>> >  5 files changed, 32 insertions(+)
>> > 
>> > Index: kvm-pvfreq/arch/x86/kvm/x86.c
>> > ===================================================================
>> > --- kvm-pvfreq.orig/arch/x86/kvm/x86.c	2017-01-31 10:32:33.023378783 -0200
>> > +++ kvm-pvfreq/arch/x86/kvm/x86.c	2017-01-31 10:34:25.443618639 -0200
>> > @@ -3665,6 +3665,26 @@
>> >  		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
>> >  		break;
>> >  	}
>> > +	case KVM_SET_VCPU_ALLOW_FREQ_HC: {
>> 
>> Just enable the frequency hypercalls with KVM_ENABLE_CAP ioctl and get
>> rid of this ioctl.
>> (I don't think that we want to allow disabling this capability.)
> 
> Not sure. What if you change the role of vcpus and now want 
> to change vcpu-1 from a realtime vcpu (one where only DPDK runs) 
> to a multi-user vcpu without a reboot?

If we want to be dynamic, I'd rather have it as a hypercall that toggles
the permissions for frequency change under CPL > 0.  This would require
guest kernel changes, though.
As a benefit, we could enable the capability by default for all VCPUs
and just let the guest kernel control which task can change frequency
without exits into userspace.

Having QEMU toggle the whole feature would require some QEMU emulated
device, unless we really want to do it manually on both sides.

Doing it manually doesn't sound useful outside of testing ...
is DPDK actually being used "dynamically"?
(I thought that the setup is decided when the host boots.)

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls
  2017-02-23 16:48       ` Radim Krcmar
@ 2017-02-23 17:31         ` Paolo Bonzini
  0 siblings, 0 replies; 28+ messages in thread
From: Paolo Bonzini @ 2017-02-23 17:31 UTC (permalink / raw)
  To: Radim Krcmar, Marcelo Tosatti
  Cc: kvm, linux-kernel, Rafael J. Wysocki, Viresh Kumar



On 23/02/2017 17:48, Radim Krcmar wrote:
> Doing it manually doesn't sound useful outside of testing ...
> is DPDK actually being used "dynamically"?
> (I thought that the setup is decided when the host boots.)

Or at the very least when the guest boots.  I agree that KVM_ENABLE_CAP
is enough.  However, changing KVM_CPUID_FEATURES should be handled by
userspace (KVM doesn't know about 0x400000xx leaves at all).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-03 19:09     ` Radim Krcmar
@ 2017-02-23 17:35       ` Paolo Bonzini
  2017-02-23 23:19         ` Marcelo Tosatti
  0 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2017-02-23 17:35 UTC (permalink / raw)
  To: Radim Krcmar, Marcelo Tosatti
  Cc: kvm, linux-kernel, Rafael J. Wysocki, Viresh Kumar



On 03/02/2017 20:09, Radim Krcmar wrote:
> One reason why we have a kernel/userspace split is to allow sharing of
> CPU time.  Each application then its state that the kernel keeps track
> of and saves/restores while time-multiplexing.
> 
> Our frequency scaling interface goes against the idea -- guest kernel
> cannot schedule multiple userspaces on the same vCPU, because they could
> conflict by overriding frequency.
> 
> i.e. our feature implies userspace tasks pinned to isolated vCPUs.

That's bad.  This feature is broken by design unless it does proper
save/restore across preemption.

You don't need a hypercall.  Add a cpufreq driver in DPDK that doesn't
use sysfs, and connect it to a daemon in the host through virtio-serial
or vsock.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-23 17:35       ` Paolo Bonzini
@ 2017-02-23 23:19         ` Marcelo Tosatti
  2017-02-24  9:18           ` Paolo Bonzini
  0 siblings, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-23 23:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krcmar, kvm, linux-kernel, Rafael J. Wysocki, Viresh Kumar

On Thu, Feb 23, 2017 at 06:35:24PM +0100, Paolo Bonzini wrote:
> 
> 
> On 03/02/2017 20:09, Radim Krcmar wrote:
> > One reason why we have a kernel/userspace split is to allow sharing of
> > CPU time.  Each application then its state that the kernel keeps track
> > of and saves/restores while time-multiplexing.
> > 
> > Our frequency scaling interface goes against the idea -- guest kernel
> > cannot schedule multiple userspaces on the same vCPU, because they could
> > conflict by overriding frequency.
> > 
> > i.e. our feature implies userspace tasks pinned to isolated vCPUs.

This is how cpufreq-userspace works:

2.2 Governor
------------

On all other cpufreq implementations, these boundaries still need to
be set. Then, a "governor" must be selected. Such a "governor" decides
what speed the processor shall run within the boundaries. One such
"governor" is the "userspace" governor. This one allows the user - or
a yet-to-implement userspace program - to decide what specific speed
the processor shall run at.

> That's bad.  This feature is broken by design unless it does proper
> save/restore across preemption.

Whats the current usecase, or forseeable future usecase, for save/restore
across preemption again? (which would validate the broken by design
claim).

> You don't need a hypercall.  Add a cpufreq driver in DPDK that doesn't
> use sysfs, and connect it to a daemon in the host through virtio-serial
> or vsock.
> 
> Paolo

Hypercalls overcome the problems mentioned in the first email
of the thread, i think you missed them:

"[patch 0/3] KVM CPU frequency change hypercalls"

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-23 23:19         ` Marcelo Tosatti
@ 2017-02-24  9:18           ` Paolo Bonzini
  2017-02-24 11:50             ` Marcelo Tosatti
  0 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2017-02-24  9:18 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Radim Krcmar, kvm, linux-kernel, Rafael J. Wysocki, Viresh Kumar



On 24/02/2017 00:19, Marcelo Tosatti wrote:
>>> i.e. our feature implies userspace tasks pinned to isolated vCPUs.
> This is how cpufreq-userspace works:
> 
> 2.2 Governor
> ------------
> 
> On all other cpufreq implementations, these boundaries still need to
> be set. Then, a "governor" must be selected. Such a "governor" decides
> what speed the processor shall run within the boundaries. One such
> "governor" is the "userspace" governor. This one allows the user - or
> a yet-to-implement userspace program - to decide what specific speed
> the processor shall run at.

The userspace program sets a policy for the whole system.

>> That's bad.  This feature is broken by design unless it does proper
>> save/restore across preemption.
> 
> Whats the current usecase, or forseeable future usecase, for save/restore
> across preemption again? (which would validate the broken by design
> claim).

Stop a guest that is using cpufreq, start a guest that is not using it.
The second guest's performance now depends on the state that the first
guest left in cpufreq.

I think this is abusing the userspace governor.  Unfortunately cpufreq
governors cannot be stacked.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-24  9:18           ` Paolo Bonzini
@ 2017-02-24 11:50             ` Marcelo Tosatti
  2017-02-24 12:17               ` Paolo Bonzini
  0 siblings, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-24 11:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krcmar, kvm, linux-kernel, Rafael J. Wysocki, Viresh Kumar

On Fri, Feb 24, 2017 at 10:18:59AM +0100, Paolo Bonzini wrote:
> 
> 
> On 24/02/2017 00:19, Marcelo Tosatti wrote:
> >>> i.e. our feature implies userspace tasks pinned to isolated vCPUs.
> > This is how cpufreq-userspace works:
> > 
> > 2.2 Governor
> > ------------
> > 
> > On all other cpufreq implementations, these boundaries still need to
> > be set. Then, a "governor" must be selected. Such a "governor" decides
> > what speed the processor shall run within the boundaries. One such
> > "governor" is the "userspace" governor. This one allows the user - or
> > a yet-to-implement userspace program - to decide what specific speed
> > the processor shall run at.
> 
> The userspace program sets a policy for the whole system.

No, its per cpu.

> >> That's bad.  This feature is broken by design unless it does proper
> >> save/restore across preemption.
> > 
> > Whats the current usecase, or forseeable future usecase, for save/restore
> > across preemption again? (which would validate the broken by design
> > claim).
> 
> Stop a guest that is using cpufreq, start a guest that is not using it.
> The second guest's performance now depends on the state that the first
> guest left in cpufreq.

Nothing forbids the host to implement switching with the
current hypercall interface: all you need is a scheduler
hook.

> I think this is abusing the userspace governor.  Unfortunately cpufreq
> governors cannot be stacked.
> 
> Paolo

This is a special usecase where only the app in the guest knows 
whats the most appropriate frequency at a given time.
This is what cpufreq-userspace is supposed to allow userspace to do, 
but in this case "userspace" is the guest, so i don't 
see this as an abuse at all.

Timeshared setups are by definition not deterministic: 
your task A could be interrupted by another task B 
with results similar to a lower frequency being set.

So saying that:

"Our frequency scaling interface goes against the idea -- guest kernel
 cannot schedule multiple userspaces on the same vCPU, because they
 could
 conflict by overriding frequency."

Assumes that, in a timeshared system, an application is guaranteed a
particular frequency. But that does not make sense: its a timeshared
system in the first place, there is no determinism regarding execution
time.

Moreover, there is no notion of "per-task CPU frequency" in Linux
(there could be, this whole governor business with user
being responsible for setting up the governor is pretty sucky
IMO).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-24 11:50             ` Marcelo Tosatti
@ 2017-02-24 12:17               ` Paolo Bonzini
  2017-02-24 13:04                 ` Marcelo Tosatti
  0 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2017-02-24 12:17 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Radim Krcmar, kvm, linux-kernel, Rafael J. Wysocki, Viresh Kumar



On 24/02/2017 12:50, Marcelo Tosatti wrote:
>>>
>>> On all other cpufreq implementations, these boundaries still need to
>>> be set. Then, a "governor" must be selected. Such a "governor" decides
>>> what speed the processor shall run within the boundaries. One such
>>> "governor" is the "userspace" governor. This one allows the user - or
>>> a yet-to-implement userspace program - to decide what specific speed
>>> the processor shall run at.
>> The userspace program sets a policy for the whole system.
> No, its per cpu.

Yeah, what I mean is that userspace program can be per-CPU, but it looks
at all the processes running on that CPU ("the whole system").  This is
very different from a guest, which is isolated.

>>>> That's bad.  This feature is broken by design unless it does proper
>>>> save/restore across preemption.
>>> Whats the current usecase, or forseeable future usecase, for save/restore
>>> across preemption again? (which would validate the broken by design
>>> claim).
>> Stop a guest that is using cpufreq, start a guest that is not using it.
>> The second guest's performance now depends on the state that the first
>> guest left in cpufreq.
>
> Nothing forbids the host to implement switching with the
> current hypercall interface: all you need is a scheduler
> hook.

Can it be done in vcpu_load/vcpu_put?  But you still would have two
components (KVM and sysfs) potentially fighting over the frequency, and
that's still a bit ugly.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-24 12:17               ` Paolo Bonzini
@ 2017-02-24 13:04                 ` Marcelo Tosatti
  2017-02-24 15:34                   ` Paolo Bonzini
  0 siblings, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-24 13:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krcmar, kvm, linux-kernel, Rafael J. Wysocki, Viresh Kumar

On Fri, Feb 24, 2017 at 01:17:07PM +0100, Paolo Bonzini wrote:
> 
> 
> On 24/02/2017 12:50, Marcelo Tosatti wrote:
> >>>
> >>> On all other cpufreq implementations, these boundaries still need to
> >>> be set. Then, a "governor" must be selected. Such a "governor" decides
> >>> what speed the processor shall run within the boundaries. One such
> >>> "governor" is the "userspace" governor. This one allows the user - or
> >>> a yet-to-implement userspace program - to decide what specific speed
> >>> the processor shall run at.
> >> The userspace program sets a policy for the whole system.
> > No, its per cpu.
> 
> Yeah, what I mean is that userspace program can be per-CPU, but it looks
> at all the processes running on that CPU ("the whole system").  This is
> very different from a guest, which is isolated.
> 
> >>>> That's bad.  This feature is broken by design unless it does proper
> >>>> save/restore across preemption.
> >>> Whats the current usecase, or forseeable future usecase, for save/restore
> >>> across preemption again? (which would validate the broken by design
> >>> claim).
> >> Stop a guest that is using cpufreq, start a guest that is not using it.
> >> The second guest's performance now depends on the state that the first
> >> guest left in cpufreq.
> >
> > Nothing forbids the host to implement switching with the
> > current hypercall interface: all you need is a scheduler
> > hook.
> 
> Can it be done in vcpu_load/vcpu_put?  But you still would have two
> components (KVM and sysfs) potentially fighting over the frequency, and
> that's still a bit ugly.
> 
> Paolo

Change the frequency at vcpu_load/vcpu_put? Yes: call into
cpufreq-userspace. But there is no notion of "per-task frequency" on the
Linux kernel (which was the starting point of this subthread).
 
But if you configure all CPUs in the system as cpufreq-userspace,
then some other (userspace program) has to decide the frequency
for the other CPUs.

Which agent would do that and why? Thats why i initially said "whats the
usecase".

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-24 13:04                 ` Marcelo Tosatti
@ 2017-02-24 15:34                   ` Paolo Bonzini
  2017-02-24 16:54                     ` Rafael J. Wysocki
  2017-02-28  2:45                     ` Marcelo Tosatti
  0 siblings, 2 replies; 28+ messages in thread
From: Paolo Bonzini @ 2017-02-24 15:34 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Radim Krcmar, kvm, linux-kernel, Rafael J. Wysocki, Viresh Kumar



On 24/02/2017 14:04, Marcelo Tosatti wrote:
>>>>> Whats the current usecase, or forseeable future usecase, for save/restore
>>>>> across preemption again? (which would validate the broken by design
>>>>> claim).
>>>> Stop a guest that is using cpufreq, start a guest that is not using it.
>>>> The second guest's performance now depends on the state that the first
>>>> guest left in cpufreq.
>>> Nothing forbids the host to implement switching with the
>>> current hypercall interface: all you need is a scheduler
>>> hook.
>> Can it be done in vcpu_load/vcpu_put?  But you still would have two
>> components (KVM and sysfs) potentially fighting over the frequency, and
>> that's still a bit ugly.
>
> Change the frequency at vcpu_load/vcpu_put? Yes: call into
> cpufreq-userspace. But there is no notion of "per-task frequency" on the
> Linux kernel (which was the starting point of this subthread).

There isn't, but this patchset is providing a direct path from a task to
cpufreq-userspace.  This is as close as you can get to a per-task frequency.

> But if you configure all CPUs in the system as cpufreq-userspace,
> then some other (userspace program) has to decide the frequency
> for the other CPUs.
> 
> Which agent would do that and why? Thats why i initially said "whats the
> usecase".

You could just pin them at the highest non-TurboBoost frequency until a
guest runs.  That's assuming that they are idle and, because of
isol_cpus/nohz_full, they would be almost always in deep C state anyway.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-24 15:34                   ` Paolo Bonzini
@ 2017-02-24 16:54                     ` Rafael J. Wysocki
  2017-02-28  2:45                     ` Marcelo Tosatti
  1 sibling, 0 replies; 28+ messages in thread
From: Rafael J. Wysocki @ 2017-02-24 16:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marcelo Tosatti, Radim Krcmar, kvm, linux-kernel, Viresh Kumar

On Friday, February 24, 2017 04:34:52 PM Paolo Bonzini wrote:
> 
> On 24/02/2017 14:04, Marcelo Tosatti wrote:
> >>>>> Whats the current usecase, or forseeable future usecase, for save/restore
> >>>>> across preemption again? (which would validate the broken by design
> >>>>> claim).
> >>>> Stop a guest that is using cpufreq, start a guest that is not using it.
> >>>> The second guest's performance now depends on the state that the first
> >>>> guest left in cpufreq.
> >>> Nothing forbids the host to implement switching with the
> >>> current hypercall interface: all you need is a scheduler
> >>> hook.
> >> Can it be done in vcpu_load/vcpu_put?  But you still would have two
> >> components (KVM and sysfs) potentially fighting over the frequency, and
> >> that's still a bit ugly.
> >
> > Change the frequency at vcpu_load/vcpu_put? Yes: call into
> > cpufreq-userspace. But there is no notion of "per-task frequency" on the
> > Linux kernel (which was the starting point of this subthread).
> 
> There isn't, but this patchset is providing a direct path from a task to
> cpufreq-userspace.  This is as close as you can get to a per-task frequency.
> 
> > But if you configure all CPUs in the system as cpufreq-userspace,
> > then some other (userspace program) has to decide the frequency
> > for the other CPUs.
> > 
> > Which agent would do that and why? Thats why i initially said "whats the
> > usecase".
> 
> You could just pin them at the highest non-TurboBoost frequency until a
> guest runs.  That's assuming that they are idle and, because of
> isol_cpus/nohz_full, they would be almost always in deep C state anyway.

Good discussion so far, but it should be happening on the linux-pm list.

Would it be possible to repost the patches with a CC to linux-pm?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-24 15:34                   ` Paolo Bonzini
  2017-02-24 16:54                     ` Rafael J. Wysocki
@ 2017-02-28  2:45                     ` Marcelo Tosatti
  2017-03-01 14:21                       ` Paolo Bonzini
  1 sibling, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2017-02-28  2:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krcmar, kvm, linux-kernel, Rafael J. Wysocki, Viresh Kumar

On Fri, Feb 24, 2017 at 04:34:52PM +0100, Paolo Bonzini wrote:
> 
> 
> On 24/02/2017 14:04, Marcelo Tosatti wrote:
> >>>>> Whats the current usecase, or forseeable future usecase, for save/restore
> >>>>> across preemption again? (which would validate the broken by design
> >>>>> claim).
> >>>> Stop a guest that is using cpufreq, start a guest that is not using it.
> >>>> The second guest's performance now depends on the state that the first
> >>>> guest left in cpufreq.
> >>> Nothing forbids the host to implement switching with the
> >>> current hypercall interface: all you need is a scheduler
> >>> hook.
> >> Can it be done in vcpu_load/vcpu_put?  But you still would have two
> >> components (KVM and sysfs) potentially fighting over the frequency, and
> >> that's still a bit ugly.
> >
> > Change the frequency at vcpu_load/vcpu_put? Yes: call into
> > cpufreq-userspace. But there is no notion of "per-task frequency" on the
> > Linux kernel (which was the starting point of this subthread).
> 
> There isn't, but this patchset is providing a direct path from a task to
> cpufreq-userspace.  This is as close as you can get to a per-task frequency.

Cpufreq-userspace is supposed to be used by tasks in userspace.
Thats why its called "userspace".

> > But if you configure all CPUs in the system as cpufreq-userspace,
> > then some other (userspace program) has to decide the frequency
> > for the other CPUs.
> > 
> > Which agent would do that and why? Thats why i initially said "whats the
> > usecase".
> 
> You could just pin them at the highest non-TurboBoost frequency until a
> guest runs.  That's assuming that they are idle and, because of
> isol_cpus/nohz_full, they would be almost always in deep C state anyway.
> 
> Paolo

The original claim of the thread  was: "this feature (frequency
hypercalls) works for pinned vcpu<->pcpu, pcpu dedicated exclusively
to vcpu case, lets try to extend this to other cases".

Which is a valid and useful direction to go.

However there is no user for multiple vcpus in the same pcpu now.

If there were multiple vcpus, all of them requesting a given
frequency, it would be necessary to:

	1) Maintain frequency of the pcpu to the highest 
	   frequencies.

		OR

	2) Since switching frequencies can take up to 70us (*)
	   (depends on processor), its generally not worthwhile
	   to switch frequencies between task switches.

So its a dead end...

*: http://www.ena-hpc.org/2013/pdf/04.pdf

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-02-28  2:45                     ` Marcelo Tosatti
@ 2017-03-01 14:21                       ` Paolo Bonzini
  2017-03-01 15:11                         ` Marcelo Tosatti
  0 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2017-03-01 14:21 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Radim Krcmar, kvm, linux-kernel, Rafael J. Wysocki, Viresh Kumar



On 28/02/2017 03:45, Marcelo Tosatti wrote:
> On Fri, Feb 24, 2017 at 04:34:52PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 24/02/2017 14:04, Marcelo Tosatti wrote:
>>>>>>> Whats the current usecase, or forseeable future usecase, for save/restore
>>>>>>> across preemption again? (which would validate the broken by design
>>>>>>> claim).
>>>>>> Stop a guest that is using cpufreq, start a guest that is not using it.
>>>>>> The second guest's performance now depends on the state that the first
>>>>>> guest left in cpufreq.
>>>>> Nothing forbids the host to implement switching with the
>>>>> current hypercall interface: all you need is a scheduler
>>>>> hook.
>>>> Can it be done in vcpu_load/vcpu_put?  But you still would have two
>>>> components (KVM and sysfs) potentially fighting over the frequency, and
>>>> that's still a bit ugly.
>>>
>>> Change the frequency at vcpu_load/vcpu_put? Yes: call into
>>> cpufreq-userspace. But there is no notion of "per-task frequency" on the
>>> Linux kernel (which was the starting point of this subthread).
>>
>> There isn't, but this patchset is providing a direct path from a task to
>> cpufreq-userspace.  This is as close as you can get to a per-task frequency.
> 
> Cpufreq-userspace is supposed to be used by tasks in userspace.
> Thats why its called "userspace".

I think the intended usecase is to have a daemon handling a systemwide
policy.  Examples are the historical (and now obsolete) users such as
cpufreqd, cpudyn, powernowd, or cpuspeed.  The user alternatively can
play the role of the daemon by writing to sysfs.

I've never seen userspace tasks talking to cpufreq-userspace to set
their own running frequency.  If DPDK does it, that's nasty in my
opinion and we should find an interface that works best for both DPDK
and KVM.  Which should be done on linux-pm like Rafael suggested.

>>> But if you configure all CPUs in the system as cpufreq-userspace,
>>> then some other (userspace program) has to decide the frequency
>>> for the other CPUs.
>>>
>>> Which agent would do that and why? Thats why i initially said "whats the
>>> usecase".
>>
>> You could just pin them at the highest non-TurboBoost frequency until a
>> guest runs.  That's assuming that they are idle and, because of
>> isol_cpus/nohz_full, they would be almost always in deep C state anyway.
> 
> The original claim of the thread  was: "this feature (frequency
> hypercalls) works for pinned vcpu<->pcpu, pcpu dedicated exclusively
> to vcpu case, lets try to extend this to other cases".
> 
> Which is a valid and useful direction to go.
> 
> However there is no user for multiple vcpus in the same pcpu now.

You are still ignoring the case of one guest started after another, or
of another program started on a CPU that formerly was used by KVM.  They
don't have to be multiple users at the same time.

> If there were multiple vcpus, all of them requesting a given
> frequency, it would be necessary to:
> 
> 	1) Maintain frequency of the pcpu to the highest 
> 	   frequencies.
> 
> 		OR
> 
> 	2) Since switching frequencies can take up to 70us (*)
> 	   (depends on processor), its generally not worthwhile
> 	   to switch frequencies between task switches.

Is latency that important, or is rather overhead the one to pay
attention to?  The slides you linked
(http://www.ena-hpc.org/2013/pdf/04.pdf) at page 17 suggest it's around
10us.

One possibility is to do (1) if you have multiple tasks on the run queue
(or fallback to what is specified in sysfs) and (2) if you only have one
task.

Anyway, please repost with Cc to linux-pm so that we can restart the
discussion there.

Paolo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/3] KVM CPU frequency change hypercalls
  2017-03-01 14:21                       ` Paolo Bonzini
@ 2017-03-01 15:11                         ` Marcelo Tosatti
  0 siblings, 0 replies; 28+ messages in thread
From: Marcelo Tosatti @ 2017-03-01 15:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krcmar, kvm, linux-kernel, Rafael J. Wysocki, Viresh Kumar

On Wed, Mar 01, 2017 at 03:21:32PM +0100, Paolo Bonzini wrote:
> 
> 
> On 28/02/2017 03:45, Marcelo Tosatti wrote:
> > On Fri, Feb 24, 2017 at 04:34:52PM +0100, Paolo Bonzini wrote:
> >>
> >>
> >> On 24/02/2017 14:04, Marcelo Tosatti wrote:
> >>>>>>> Whats the current usecase, or forseeable future usecase, for save/restore
> >>>>>>> across preemption again? (which would validate the broken by design
> >>>>>>> claim).
> >>>>>> Stop a guest that is using cpufreq, start a guest that is not using it.
> >>>>>> The second guest's performance now depends on the state that the first
> >>>>>> guest left in cpufreq.
> >>>>> Nothing forbids the host to implement switching with the
> >>>>> current hypercall interface: all you need is a scheduler
> >>>>> hook.
> >>>> Can it be done in vcpu_load/vcpu_put?  But you still would have two
> >>>> components (KVM and sysfs) potentially fighting over the frequency, and
> >>>> that's still a bit ugly.
> >>>
> >>> Change the frequency at vcpu_load/vcpu_put? Yes: call into
> >>> cpufreq-userspace. But there is no notion of "per-task frequency" on the
> >>> Linux kernel (which was the starting point of this subthread).
> >>
> >> There isn't, but this patchset is providing a direct path from a task to
> >> cpufreq-userspace.  This is as close as you can get to a per-task frequency.
> > 
> > Cpufreq-userspace is supposed to be used by tasks in userspace.
> > Thats why its called "userspace".
> 
> I think the intended usecase is to have a daemon handling a systemwide
> policy.  Examples are the historical (and now obsolete) users such as
> cpufreqd, cpudyn, powernowd, or cpuspeed.  The user alternatively can
> play the role of the daemon by writing to sysfs.
> 
> I've never seen userspace tasks talking to cpufreq-userspace to set
> their own running frequency.  If DPDK does it, that's nasty in my
> opinion

Please extend what "nasty" means in detail. I really don't understand
why its nasty.

>  and we should find an interface that works best for both DPDK
> and KVM.  Which should be done on linux-pm like Rafael suggested.
> 
> >>> But if you configure all CPUs in the system as cpufreq-userspace,
> >>> then some other (userspace program) has to decide the frequency
> >>> for the other CPUs.
> >>>
> >>> Which agent would do that and why? Thats why i initially said "whats the
> >>> usecase".
> >>
> >> You could just pin them at the highest non-TurboBoost frequency until a
> >> guest runs.  That's assuming that they are idle and, because of
> >> isol_cpus/nohz_full, they would be almost always in deep C state anyway.
> > 
> > The original claim of the thread  was: "this feature (frequency
> > hypercalls) works for pinned vcpu<->pcpu, pcpu dedicated exclusively
> > to vcpu case, lets try to extend this to other cases".
> > 
> > Which is a valid and useful direction to go.
> > 
> > However there is no user for multiple vcpus in the same pcpu now.
> 
> You are still ignoring the case of one guest started after another, or
> of another program started on a CPU that formerly was used by KVM.  They
> don't have to be multiple users at the same time.

Just have the cpufreq-userspace policy be instantiated while the 
isolated vcpu owns the pcpu. Before/after that, the previous policy 
is in place. 

> > If there were multiple vcpus, all of them requesting a given
> > frequency, it would be necessary to:
> > 
> > 	1) Maintain frequency of the pcpu to the highest 
> > 	   frequencies.
> > 
> > 		OR
> > 
> > 	2) Since switching frequencies can take up to 70us (*)
> > 	   (depends on processor), its generally not worthwhile
> > 	   to switch frequencies between task switches.
> 
> Is latency that important, or is rather overhead the one to pay
> attention to?  The slides you linked
> (http://www.ena-hpc.org/2013/pdf/04.pdf) at page 17 suggest it's around
> 10us.

Ok, be it 10us. 10us overhead on every task context switch is not
acceptable.

> One possibility is to do (1) if you have multiple tasks on the run queue
> (or fallback to what is specified in sysfs) and (2) if you only have one
> task.

Sure, that is alright. But the use-case at hand does not involve 
multiple tasks on the pcpu.

> Anyway, please repost with Cc to linux-pm so that we can restart the
> discussion there.
> 
> Paolo

Done. Can you please reply with a concise summary of what you object to? 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-03-01 15:22 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-02 17:47 [patch 0/3] KVM CPU frequency change hypercalls Marcelo Tosatti
2017-02-02 17:47 ` [patch 1/3] cpufreq: implement min/max/up/down functions Marcelo Tosatti
2017-02-03  4:09   ` Viresh Kumar
2017-02-02 17:47 ` [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls Marcelo Tosatti
2017-02-03 17:03   ` Radim Krcmar
2017-02-22 21:18     ` Marcelo Tosatti
2017-02-23 16:48       ` Radim Krcmar
2017-02-23 17:31         ` Paolo Bonzini
2017-02-02 17:47 ` [patch 3/3] KVM: x86: frequency change hypercalls Marcelo Tosatti
2017-02-02 18:01   ` Marcelo Tosatti
2017-02-03 17:40   ` Radim Krcmar
2017-02-03 18:24     ` Marcelo Tosatti
2017-02-03 19:28       ` Radim Krcmar
2017-02-03 12:50 ` [patch 0/3] KVM CPU " Rafael J. Wysocki
2017-02-03 16:43 ` Radim Krcmar
2017-02-03 18:14   ` Marcelo Tosatti
2017-02-03 19:09     ` Radim Krcmar
2017-02-23 17:35       ` Paolo Bonzini
2017-02-23 23:19         ` Marcelo Tosatti
2017-02-24  9:18           ` Paolo Bonzini
2017-02-24 11:50             ` Marcelo Tosatti
2017-02-24 12:17               ` Paolo Bonzini
2017-02-24 13:04                 ` Marcelo Tosatti
2017-02-24 15:34                   ` Paolo Bonzini
2017-02-24 16:54                     ` Rafael J. Wysocki
2017-02-28  2:45                     ` Marcelo Tosatti
2017-03-01 14:21                       ` Paolo Bonzini
2017-03-01 15:11                         ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).