All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] New Proposal for steal time in KVM
@ 2011-01-24 18:06 Glauber Costa
  2011-01-24 18:06 ` [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure Glauber Costa
                   ` (15 more replies)
  0 siblings, 16 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori

Hello people

This is the new version of the steal time series, this time on steroids.
The steal time per se is not much different from the last time I posted, so
I'll highlight what's around it.

Since one of the main fights was around how to register the shared memory area,
which would end up with a new MSR, I decided to provide an MSR to end all MSR
allocations. This patchset contains a generic GuestKernel-to-Hypervisor memory
registering, that can be further used and abused without too much hassle, even by
letting userspace handle what the hypervisor won't - think of feature emulation in
older hosts, or any other thing.

To demonstrate that, and how it works, I ported kvmclock to this infrastructure.
After that, I used it to implement steal time information exchange.

Even our last discussions, I am keeping the steal time accounting in sched.c.
I still see value on that, as opposed to a lower level, because it will give us
easier access to the scheduler variables, such as the cpu runqueue.

The last patch in the series uses this to decrease CPU power according to the
current steal time information, leading to a possibly more smart scheduling.
This, in particular, is very early work, and advise on that - including "Stop now,
you idiot!" is very welcome.

TODO:
 * Handle unregister over reboots
 * Grab a list of current registrations, for migration
 * Write documentation
 * Check size in hv registerings, to prevent out of boundaries
   exploits


Glauber Costa (16):
  KVM-HDR: register KVM basic header infrastructure
  KVM-HV: KVM - KVM Virtual Memory hypervisor implementation
  KVM-HDR: KVM Userspace registering ioctl
  KVM-HV: KVM Userspace registering ioctl
  KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory
  KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory
  KVM-GST: Implement wallclock over KVM - KVM Virtual Memory
  KVM-HDR: Implement kvmclock systemtime over KVM - KVM Virtual Memory
  KVM-HV: Implement kvmclock systemtime over KVM - KVM Virtual Memory
  KVM-GST: Implement kvmclock systemtime over KVM - KVM Virtual Memory
  KVM-HDR: KVM Steal time implementation
  KVM-HV: KVM Steal time implementation
  KVM-HV: KVM Steal time calculation
  KVM-GST: KVM Steal time registration
  KVM-GST: KVM Steal time accounting
  KVM-GST: adjust scheduler cpu power

 arch/x86/include/asm/kvm_host.h |    3 +
 arch/x86/include/asm/kvm_para.h |   19 +++++++
 arch/x86/kernel/kvmclock.c      |  104 +++++++++++++++++++++++++++++++++-----
 arch/x86/kvm/svm.c              |    4 ++
 arch/x86/kvm/vmx.c              |    4 ++
 arch/x86/kvm/x86.c              |   99 +++++++++++++++++++++++++++++--------
 include/linux/kvm.h             |   14 +++++-
 include/linux/kvm_host.h        |    1 +
 include/linux/sched.h           |    1 +
 kernel/sched.c                  |   44 ++++++++++++++++
 kernel/sched_fair.c             |   10 ++++
 11 files changed, 268 insertions(+), 35 deletions(-)

-- 
1.7.2.3


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-26 11:06   ` Avi Kivity
  2011-01-24 18:06 ` [PATCH 02/16] KVM-HV: KVM - KVM Virtual Memory hypervisor implementation Glauber Costa
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori, Avi Kivity

KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual Mojito),
is a piece of shared memory that is visible to both the hypervisor and the guest
kernel - but not the guest userspace.

The basic idea is that the guest can tell the hypervisor about a specific
piece of memory, and what it expects to find in there. This is a generic
abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
handle a specific request, thus giving us flexibility in some features
in the future.

KVM (The hypervisor) can change the contents of this piece of memory at
will. This works well with paravirtual information, and hopefully
normal guest memory - like last update time for the watchdog, for
instance.

This is basic KVM registration headers. I am keeping headers
separate to facilitate backports to people who wants to backport
the kernel part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..b0b0ee0 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,7 @@
  */
 #define KVM_FEATURE_CLOCKSOURCE2        3
 #define KVM_FEATURE_ASYNC_PF		4
+#define KVM_FEATURE_MEMORY_AREA		5
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -35,6 +36,16 @@
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 
+#define MSR_KVM_REGISTER_MEM_AREA 0x4b564d03
+
+struct kvm_memory_area {
+	__u64 base;
+	__u32 size;
+	__u32 type;
+	__u8  result;
+	__u8  pad[3];
+};
+
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 02/16] KVM-HV: KVM - KVM Virtual Memory hypervisor implementation
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
  2011-01-24 18:06 ` [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 18:06 ` [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl Glauber Costa
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori, Avi Kivity

KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual Mojito),
is a piece of shared memory that is visible to both the hypervisor and the guest
kernel - but not the guest userspace.

The basic idea is that the guest can tell the hypervisor about a specific
piece of memory, and what it expects to find in there. This is a generic
abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
handle a specific request, thus giving us flexibility in some features
in the future.

KVM (The hypervisor) can change the contents of this piece of memory at
will. This works well with paravirtual information, and hopefully
normal guest memory - like last update time for the watchdog, for
instance.

This is basic KVM registration implementation, that just returns an error
code for every operation. I am keeping it separate from the headers
to facilitate backports to people who wants to backport
the kernel part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/kvm/x86.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bcc0efc..6206fd3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1538,6 +1538,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		if (kvm_pv_enable_async_pf(vcpu, data))
 			return 1;
 		break;
+	case MSR_KVM_REGISTER_MEM_AREA: {
+		struct kvm_memory_area area_desc;
+
+		kvm_read_guest(vcpu->kvm, data, &area_desc, sizeof(area_desc));
+		area_desc.result = 0xF;
+		kvm_write_guest(vcpu->kvm, data, &area_desc, sizeof(area_desc));
+		break;
+	}
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -2391,6 +2399,7 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
 			     (1 << KVM_FEATURE_NOP_IO_DELAY) |
 			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
+			     (1 << KVM_FEATURE_MEMORY_AREA) |
 			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
 		entry->ebx = 0;
 		entry->ecx = 0;
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
  2011-01-24 18:06 ` [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure Glauber Costa
  2011-01-24 18:06 ` [PATCH 02/16] KVM-HV: KVM - KVM Virtual Memory hypervisor implementation Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-26 11:12   ` Avi Kivity
  2011-01-24 18:06 ` [PATCH 04/16] KVM-HV: " Glauber Costa
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori, Avi Kivity

KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual Mojito),
is a piece of shared memory that is visible to both the hypervisor and the guest
kernel - but not the guest userspace.

The basic idea is that the guest can tell the hypervisor about a specific
piece of memory, and what it expects to find in there. This is a generic
abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
handle a specific request, thus giving us flexibility in some features
in the future.

KVM (The hypervisor) can change the contents of this piece of memory at
will. This works well with paravirtual information, and hopefully
normal guest memory - like last update time for the watchdog, for
instance.

This patch contains the header part of the userspace communication implementation.
Userspace can query the presence/absence of this feature in the normal way.
It also tells the hypervisor that it is capable of handling - in whatever
way it chooses, registrations that the hypervisor does not know how to.
In x86, only user so far, this mechanism is implemented as generic userspace
msr exit, that could theorectically be used to implement msr-handling in
userspace.

I am keeping the headers separate to facilitate backports to people
who wants to backport the kernel part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Avi Kivity <avi@redhat.com>
---
 include/linux/kvm.h      |   14 +++++++++++++-
 include/linux/kvm_host.h |    1 +
 2 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ea2dc1a..5cc4fe8 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -161,6 +161,7 @@ struct kvm_pit_config {
 #define KVM_EXIT_NMI              16
 #define KVM_EXIT_INTERNAL_ERROR   17
 #define KVM_EXIT_OSI              18
+#define KVM_EXIT_X86_MSR_OP	  19
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 #define KVM_INTERNAL_ERROR_EMULATION 1
@@ -264,6 +265,10 @@ struct kvm_run {
 		struct {
 			__u64 gprs[32];
 		} osi;
+		/* KVM_EXIT_X86_MSR_OP */
+		struct {
+			__u64 msr_data;
+		} msr;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -422,6 +427,11 @@ struct kvm_ppc_pvinfo {
 	__u8  pad[108];
 };
 
+struct kvm_area_info {
+	__u8  enabled;
+	__u8  pad[3];
+};
+
 #define KVMIO 0xAE
 
 /*
@@ -541,6 +551,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
 #define KVM_CAP_ASYNC_PF 59
+#define KVM_CAP_REGISTER_MEM_AREA 60
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -677,7 +688,8 @@ struct kvm_clock_data {
 #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
 /* Available with KVM_CAP_PPC_GET_PVINFO */
 #define KVM_PPC_GET_PVINFO	  _IOW(KVMIO,  0xa1, struct kvm_ppc_pvinfo)
-
+#define KVM_USERSPACE_REGISTER_MEM_AREA \
+				  _IOW(KVMIO,  0xa8, struct kvm_area_info)
 /*
  * ioctls for vcpu fds
  */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b5021db..b7b361f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -258,6 +258,7 @@ struct kvm {
 	long mmu_notifier_count;
 #endif
 	long tlbs_dirty;
+	int register_mem_area_uspace;
 };
 
 /* The guest did something we don't support. */
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 04/16] KVM-HV: KVM Userspace registering ioctl
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (2 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 18:06 ` [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory Glauber Costa
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual Mojito),
is a piece of shared memory that is visible to both the hypervisor and the guest
kernel - but not the guest userspace.

The basic idea is that the guest can tell the hypervisor about a specific
piece of memory, and what it expects to find in there. This is a generic
abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
handle a specific request, thus giving us flexibility in some features
in the future.

KVM (The hypervisor) can change the contents of this piece of memory at
will. This works well with paravirtual information, and hopefully
normal guest memory - like last update time for the watchdog, for
instance.

This patch contains the basic implementation of the userspace communication.
Userspace can query the presence/absence of this feature in the normal way.
It also tells the hypervisor that it is capable of handling - in whatever
way it chooses, registrations that the hypervisor does not know how to.
In x86, only user so far, this mechanism is implemented as generic userspace
msr exit, that could theorectically be used to implement msr-handling in
userspace.

I am keeping it separate from the headers to facilitate backports to people
who wants to backport the kernel part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/kvm/svm.c |    4 ++++
 arch/x86/kvm/vmx.c |    4 ++++
 arch/x86/kvm/x86.c |   11 +++++++++++
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 73a8f1d..214e740 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2990,6 +2990,10 @@ static int wrmsr_interception(struct vcpu_svm *svm)
 	svm->next_rip = kvm_rip_read(&svm->vcpu) + 2;
 	if (svm_set_msr(&svm->vcpu, ecx, data)) {
 		trace_kvm_msr_write_ex(ecx, data);
+		if (svm->vcpu.run->exit_reason == KVM_EXIT_X86_MSR_OP) {
+			skip_emulated_instruction(&svm->vcpu);
+			return 0;
+		}
 		kvm_inject_gp(&svm->vcpu, 0);
 	} else {
 		trace_kvm_msr_write(ecx, data);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e2c4e32..f5c585f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3385,6 +3385,10 @@ static int handle_wrmsr(struct kvm_vcpu *vcpu)
 
 	if (vmx_set_msr(vcpu, ecx, data) != 0) {
 		trace_kvm_msr_write_ex(ecx, data);
+		if (vcpu->run->exit_reason == KVM_EXIT_X86_MSR_OP) {
+			skip_emulated_instruction(vcpu);
+			return 0;
+		}
 		kvm_inject_gp(vcpu, 0);
 		return 1;
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6206fd3..4ee9c87 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1543,6 +1543,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 
 		kvm_read_guest(vcpu->kvm, data, &area_desc, sizeof(area_desc));
 		area_desc.result = 0xF;
+
+		if (vcpu->kvm->register_mem_area_uspace) {
+			vcpu->run->exit_reason = KVM_EXIT_X86_MSR_OP;
+			vcpu->run->msr.msr_data = data;
+			return 1;
+		}
+rma_out:
 		kvm_write_guest(vcpu->kvm, data, &area_desc, sizeof(area_desc));
 		break;
 	}
@@ -1974,6 +1981,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
 	case KVM_CAP_ASYNC_PF:
+	case KVM_CAP_REGISTER_MEM_AREA:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -3555,6 +3563,9 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_USERSPACE_REGISTER_MEM_AREA:
+		kvm->register_mem_area_uspace = 1;
+		break;
 
 	default:
 		;
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (3 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 04/16] KVM-HV: " Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-26 11:13   ` Avi Kivity
  2011-01-24 18:06 ` [PATCH 06/16] " Glauber Costa
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori, Avi Kivity

As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements wallclock grabbing on top of it. At first, it may seem
as a waste of work to just redo it, since it is working well. But over the
time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index b0b0ee0..de088c8 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -46,6 +46,8 @@ struct kvm_memory_area {
 	__u8  pad[3];
 };
 
+#define KVM_AREA_WALLCLOCK		1
+
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 06/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (4 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 18:06 ` [PATCH 07/16] KVM-GST: " Glauber Costa
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori, Avi Kivity

As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements wallclock grabbing on top of it. At first, it may seem
as a waste of work to just redo it, since it is working well. But over the
time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the hypervisor implementation for it. I am keeping it
separate from the headers to facilitate backports to people who wants to backport
the kernel part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/kvm/x86.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4ee9c87..a232a36 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1540,16 +1540,26 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		break;
 	case MSR_KVM_REGISTER_MEM_AREA: {
 		struct kvm_memory_area area_desc;
+		u64 wall_data;
 
 		kvm_read_guest(vcpu->kvm, data, &area_desc, sizeof(area_desc));
 		area_desc.result = 0xF;
 
+		if (area_desc.type == KVM_AREA_WALLCLOCK) {
+			kvm_read_guest(vcpu->kvm, area_desc.base,
+				       &wall_data, area_desc.size);
+			vcpu->kvm->arch.wall_clock = wall_data;
+			kvm_write_wall_clock(vcpu->kvm, wall_data);
+			goto rma_out;
+		}
+
 		if (vcpu->kvm->register_mem_area_uspace) {
 			vcpu->run->exit_reason = KVM_EXIT_X86_MSR_OP;
 			vcpu->run->msr.msr_data = data;
 			return 1;
 		}
 rma_out:
+		area_desc.result = 0;
 		kvm_write_guest(vcpu->kvm, data, &area_desc, sizeof(area_desc));
 		break;
 	}
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 07/16] KVM-GST: Implement wallclock over KVM - KVM Virtual Memory
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (5 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 06/16] " Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 18:06 ` [PATCH 08/16] KVM-HDR: Implement kvmclock systemtime " Glauber Costa
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori, Avi Kivity

As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements wallclock grabbing on top of it. At first, it may seem
as a waste of work to just redo it, since it is working well. But over the
time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the hypervisor guest implementation for it. I am keeping it
separate to facilitate backports to people who wants to backport
the kernel part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/kernel/kvmclock.c |   41 ++++++++++++++++++++++++++++++++++++-----
 1 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f98d3ea..b8809f0 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -31,6 +31,7 @@
 static int kvmclock = 1;
 static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
+static int kvm_memory_area_available = 0;
 
 static int parse_no_kvmclock(char *arg)
 {
@@ -43,6 +44,27 @@ early_param("no-kvmclock", parse_no_kvmclock);
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
 static struct pvclock_wall_clock wall_clock;
 
+static int kvm_register_mem_area(u64 base, int type, int size)
+{
+	int low, high;
+
+	struct kvm_memory_area mem;
+
+	if (!kvm_memory_area_available)
+		return 1;
+
+	mem.base = base;
+	mem.size = size;
+	mem.type = type;
+
+	low = (int)__pa_symbol(&mem);
+	high = ((u64)__pa_symbol(&mem) >> 32);
+
+	native_write_msr(MSR_KVM_REGISTER_MEM_AREA, low, high);
+	return mem.result;
+}
+
+
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
  * have elapsed since the hypervisor wrote the data. So we try to account for
@@ -53,11 +75,17 @@ static unsigned long kvm_get_wallclock(void)
 	struct pvclock_vcpu_time_info *vcpu_time;
 	struct timespec ts;
 	int low, high;
-
-	low = (int)__pa_symbol(&wall_clock);
-	high = ((u64)__pa_symbol(&wall_clock) >> 32);
-
-	native_write_msr(msr_kvm_wall_clock, low, high);
+	u64 addr = __pa_symbol(&wall_clock);
+	int ret;
+	
+	ret = kvm_register_mem_area(addr, KVM_AREA_WALLCLOCK,
+				    sizeof(wall_clock));
+	if (ret != 0) {
+		low = (int)addr;
+		high = ((u64)addr >> 32);
+
+		native_write_msr(msr_kvm_wall_clock, low, high);
+	}
 
 	vcpu_time = &get_cpu_var(hv_clock);
 	pvclock_read_wallclock(&wall_clock, vcpu_time, &ts);
@@ -179,6 +207,9 @@ void __init kvmclock_init(void)
 	if (!kvm_para_available())
 		return;
 
+	if (kvm_para_has_feature(KVM_FEATURE_MEMORY_AREA))
+		kvm_memory_area_available = 1;
+
 	if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
 		msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
 		msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 08/16] KVM-HDR: Implement kvmclock systemtime over KVM - KVM Virtual Memory
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (6 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 07/16] KVM-GST: " Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 18:06 ` [PATCH 09/16] KVM-HV: " Glauber Costa
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori, Avi Kivity

As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements kvmclock per-vcpu systime grabbing on top of it. At first, it
may seem as a waste of work to just redo it, since it is working well. But over
the time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index de088c8..4df93cf 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -47,6 +47,7 @@ struct kvm_memory_area {
 };
 
 #define KVM_AREA_WALLCLOCK		1
+#define KVM_AREA_SYSTIME		2
 
 #define KVM_MAX_MMU_OP_BATCH           32
 
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 09/16] KVM-HV: Implement kvmclock systemtime over KVM - KVM Virtual Memory
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (7 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 08/16] KVM-HDR: Implement kvmclock systemtime " Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 18:06 ` [PATCH 10/16] KVM-GST: " Glauber Costa
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori, Avi Kivity

As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements kvmclock per-vcpu systime grabbing on top of it. At first, it
may seem as a waste of work to just redo it, since it is working well. But over
the time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the hypervisor part for it. I am keeping it separate from
the headers to facilitate backports to people who wants to backport the kernel part
but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/kvm/x86.c |   55 +++++++++++++++++++++++++++++++--------------------
 1 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a232a36..81b2f34 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1453,6 +1453,33 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 	return 0;
 }
 
+static void kvmclock_register_page(struct kvm_vcpu *vcpu, u64 data)
+{
+	if (vcpu->arch.time_page) {
+		kvm_release_page_dirty(vcpu->arch.time_page);
+		vcpu->arch.time_page = NULL;
+	}
+
+	vcpu->arch.time = data;
+	kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+
+	/* we verify if the enable bit is set... */
+	if (!(data & 1))
+		return;
+
+	/* ...but clean it before doing the actual write */
+	vcpu->arch.time_offset = data & ~(PAGE_MASK | 1);
+
+	vcpu->arch.time_page =
+			gfn_to_page(vcpu->kvm, data >> PAGE_SHIFT);
+
+	if (is_error_page(vcpu->arch.time_page)) {
+		kvm_release_page_clean(vcpu->arch.time_page);
+		vcpu->arch.time_page = NULL;
+	}
+}
+
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	switch (msr) {
@@ -1510,28 +1537,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		break;
 	case MSR_KVM_SYSTEM_TIME_NEW:
 	case MSR_KVM_SYSTEM_TIME: {
-		if (vcpu->arch.time_page) {
-			kvm_release_page_dirty(vcpu->arch.time_page);
-			vcpu->arch.time_page = NULL;
-		}
-
-		vcpu->arch.time = data;
-		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-
-		/* we verify if the enable bit is set... */
-		if (!(data & 1))
-			break;
-
-		/* ...but clean it before doing the actual write */
-		vcpu->arch.time_offset = data & ~(PAGE_MASK | 1);
-
-		vcpu->arch.time_page =
-				gfn_to_page(vcpu->kvm, data >> PAGE_SHIFT);
-
-		if (is_error_page(vcpu->arch.time_page)) {
-			kvm_release_page_clean(vcpu->arch.time_page);
-			vcpu->arch.time_page = NULL;
-		}
+		kvmclock_register_page(vcpu, data);
 		break;
 	}
 	case MSR_KVM_ASYNC_PF_EN:
@@ -1553,6 +1559,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 			goto rma_out;
 		}
 
+		if (area_desc.type == KVM_AREA_SYSTIME) {
+			kvmclock_register_page(vcpu, area_desc.base | 1);
+			goto rma_out;
+		}
+
 		if (vcpu->kvm->register_mem_area_uspace) {
 			vcpu->run->exit_reason = KVM_EXIT_X86_MSR_OP;
 			vcpu->run->msr.msr_data = data;
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 10/16] KVM-GST: Implement kvmclock systemtime over KVM - KVM Virtual Memory
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (8 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 09/16] KVM-HV: " Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 18:06 ` [PATCH 11/16] KVM-HDR: KVM Steal time implementation Glauber Costa
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori, Avi Kivity

As a proof of concept to KVM - Kernel Virtual Memory, this patch
implements kvmclock per-vcpu systime grabbing on top of it. At first, it
may seem as a waste of work to just redo it, since it is working well. But over
the time, other MSRs were added - think ASYNC_PF - and more will probably come.
After this patch, we won't need to ever add another virtual MSR to KVM.

If the hypervisor fails to register the memory area, we switch back to legacy
behavior on things that were already present - like kvm clock.

This patch contains the guest part for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/kernel/kvmclock.c |   31 ++++++++++++++++++++++---------
 1 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index b8809f0..c304fdb 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -157,12 +157,28 @@ int kvm_register_clock(char *txt)
 {
 	int cpu = smp_processor_id();
 	int low, high, ret;
-
-	low = (int)__pa(&per_cpu(hv_clock, cpu)) | 1;
-	high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
-	ret = native_write_msr_safe(msr_kvm_system_time, low, high);
-	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
-	       cpu, high, low, txt);
+	struct pvclock_vcpu_time_info *vcpu_time;
+	static int warned;
+
+	vcpu_time = &per_cpu(hv_clock, cpu);
+
+	ret = kvm_register_mem_area(__pa(vcpu_time), KVM_AREA_SYSTIME,
+				    sizeof(*vcpu_time));
+	if (ret == 0) {
+		printk(KERN_INFO "kvm-clock: cpu %d, mem_area %lx %s\n",
+		       cpu, __pa(vcpu_time), txt);
+	} else {
+		low = (int)__pa(vcpu_time) | 1;
+		high = ((u64)__pa(vcpu_time) >> 32);
+		ret = native_write_msr_safe(msr_kvm_system_time, low, high);
+
+		if (!warned++)
+			printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
+				msr_kvm_system_time, msr_kvm_wall_clock);
+
+		printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
+		       cpu, high, low, txt);
+	}
 
 	return ret;
 }
@@ -216,9 +232,6 @@ void __init kvmclock_init(void)
 	} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
 		return;
 
-	printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
-		msr_kvm_system_time, msr_kvm_wall_clock);
-
 	if (kvm_register_clock("boot clock"))
 		return;
 	pv_time_ops.sched_clock = kvm_clock_read;
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 11/16] KVM-HDR: KVM Steal time implementation
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (9 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 10/16] KVM-GST: " Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 23:06   ` Rik van Riel
  2011-01-24 18:06 ` [PATCH 12/16] KVM-HV: " Glauber Costa
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 4df93cf..1ca49d8 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -46,8 +46,13 @@ struct kvm_memory_area {
 	__u8  pad[3];
 };
 
+struct kvm_steal_time {
+	__u64 steal;
+};
+
 #define KVM_AREA_WALLCLOCK		1
 #define KVM_AREA_SYSTIME		2
+#define KVM_AREA_STEAL_TIME		3
 
 #define KVM_MAX_MMU_OP_BATCH           32
 
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 12/16] KVM-HV: KVM Steal time implementation
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (10 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 11/16] KVM-HDR: KVM Steal time implementation Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 23:15   ` Rik van Riel
  2011-01-24 18:06 ` [PATCH 13/16] KVM-HV: KVM Steal time calculation Glauber Costa
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

This patch contains the hypervisor part for it. I am keeping it separate from
the headers to facilitate backports to people who wants to backport the kernel
part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/kvm/x86.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 81b2f34..f129ea1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1564,6 +1564,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 			goto rma_out;
 		}
 
+		if (area_desc.type == KVM_AREA_STEAL_TIME) {
+			vcpu->arch.stime = area_desc.base;
+			goto rma_out;
+		}
+
 		if (vcpu->kvm->register_mem_area_uspace) {
 			vcpu->run->exit_reason = KVM_EXIT_X86_MSR_OP;
 			vcpu->run->msr.msr_data = data;
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 13/16] KVM-HV: KVM Steal time calculation
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (11 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 12/16] KVM-HV: " Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 23:20   ` Rik van Riel
  2011-01-24 18:06 ` [PATCH 14/16] KVM-GST: KVM Steal time registration Glauber Costa
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
We consider time to be potentially stolen everytime we schedule out the vcpu,
until we schedule it in again. If this is, or if this will not, be accounted
as steal time for the guest, is a guest's decision.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    3 +++
 arch/x86/kvm/x86.c              |   13 +++++++++++++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ffd7f8d..038679b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -377,6 +377,9 @@ struct kvm_vcpu_arch {
 	unsigned int hw_tsc_khz;
 	unsigned int time_offset;
 	struct page *time_page;
+	gpa_t stime;
+	u64 time_out;
+	u64 this_time_out;
 	u64 last_host_tsc;
 	u64 last_guest_tsc;
 	u64 last_kernel_ns;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f129ea1..33af936 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2123,6 +2123,9 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+	struct kvm_steal_time *st;
+	st = (struct kvm_steal_time *)vcpu->arch.stime;
+
 	/* Address WBINVD may be executed by guest */
 	if (need_emulate_wbinvd(vcpu)) {
 		if (kvm_x86_ops->has_wbinvd_exit())
@@ -2148,6 +2151,14 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 			kvm_migrate_timers(vcpu);
 		vcpu->cpu = cpu;
 	}
+	if (vcpu->arch.this_time_out) {
+		vcpu->arch.time_out +=
+				(get_kernel_ns() - vcpu->arch.this_time_out);
+		kvm_write_guest(vcpu->kvm, (gpa_t)&st->steal,
+				&vcpu->arch.time_out, sizeof(st->steal));
+		/* is it possible to have 2 loads in sequence? */
+		vcpu->arch.this_time_out = 0;
+	}
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -5333,6 +5344,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	kvm_lapic_sync_from_vapic(vcpu);
 
+	vcpu->arch.this_time_out = get_kernel_ns();
+
 	r = kvm_x86_ops->handle_exit(vcpu);
 out:
 	return r;
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 14/16] KVM-GST: KVM Steal time registration
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (12 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 13/16] KVM-HV: KVM Steal time calculation Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 23:27   ` Rik van Riel
  2011-01-24 23:31   ` Rik van Riel
  2011-01-24 18:06 ` [PATCH 15/16] KVM-GST: KVM Steal time accounting Glauber Costa
  2011-01-24 18:06 ` [PATCH 16/16] KVM-GST: adjust scheduler cpu power Glauber Costa
  15 siblings, 2 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

Register steal time within KVM. Everytime we sample the steal time
information, we update a local variable that tells what was the
last time read. We then account the difference.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/kernel/kvmclock.c |   32 ++++++++++++++++++++++++++++++++
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index c304fdb..c0b0522 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -22,6 +22,7 @@
 #include <asm/msr.h>
 #include <asm/apic.h>
 #include <linux/percpu.h>
+#include <linux/sched.h>
 
 #include <asm/x86_init.h>
 #include <asm/reboot.h>
@@ -42,7 +43,9 @@ early_param("no-kvmclock", parse_no_kvmclock);
 
 /* The hypervisor will put information about time periodically here */
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct kvm_steal_time, steal_time);
 static struct pvclock_wall_clock wall_clock;
+static DEFINE_PER_CPU(u64, steal_info);
 
 static int kvm_register_mem_area(u64 base, int type, int size)
 {
@@ -153,14 +156,43 @@ static struct clocksource kvm_clock = {
 	.flags = CLOCK_SOURCE_IS_CONTINUOUS,
 };
 
+static cputime_t kvm_account_steal_time(void)
+{
+	u64 delta = 0;
+	u64 *last_steal_info, this_steal_info;
+	struct kvm_steal_time *src;
+	cputime_t cpu;
+
+	src = &get_cpu_var(steal_time);
+	this_steal_info = src->steal;
+	put_cpu_var(steal_time);
+
+	last_steal_info = &get_cpu_var(steal_info);
+
+	delta = this_steal_info - *last_steal_info;
+
+	*last_steal_info = this_steal_info;
+	put_cpu_var(steal_info);
+
+	cpu = usecs_to_cputime(delta / 1000);
+
+	return cpu;
+}
+
 int kvm_register_clock(char *txt)
 {
 	int cpu = smp_processor_id();
 	int low, high, ret;
 	struct pvclock_vcpu_time_info *vcpu_time;
+	struct kvm_steal_time *stime;
 	static int warned;
 
 	vcpu_time = &per_cpu(hv_clock, cpu);
+	stime = &per_cpu(steal_time, cpu);
+
+	ret = kvm_register_mem_area(__pa(stime), KVM_AREA_STEAL_TIME, sizeof(*stime));
+	if (ret == 0)
+		hypervisor_steal_time = kvm_account_steal_time;
 
 	ret = kvm_register_mem_area(__pa(vcpu_time), KVM_AREA_SYSTIME,
 				    sizeof(*vcpu_time));
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 15/16] KVM-GST: KVM Steal time accounting
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (13 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 14/16] KVM-GST: KVM Steal time registration Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 23:33   ` Rik van Riel
  2011-01-24 18:06 ` [PATCH 16/16] KVM-GST: adjust scheduler cpu power Glauber Costa
  15 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

This patch accounts steal time time in kernel/sched.
I kept it from last proposal, because I still see advantages
in it: Doing it here will give us easier access from scheduler
variables such as the cpu rq. The next patch shows an example of
usage for it.

Since functions like account_idle_time() can be called from
multiple places, not only account_process_tick(), steal time
grabbing is repeated in each account function separatedely.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 kernel/sched.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ea3e5ef..3b3e88d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3508,6 +3508,38 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
 	return ns;
 }
 
+cputime_t (*hypervisor_steal_time)(void) = NULL;
+
+/*
+ * We have to at flush steal time information every time something else
+ * is accounted. Since the accounting functions are all visible to the rest
+ * of the kernel, it gets tricky to do them in one place. This helper function
+ * helps us.
+ *
+ * When the system is idle, the concept of steal time does not apply. We just
+ * tell the underlying hypervisor that we grabbed the data, but skip steal time
+ * accounting
+ */
+static int touch_steal_time(int is_idle)
+{
+	u64 steal;
+	struct rq *rq = this_rq();
+
+	if (!hypervisor_steal_time)
+		return 0;
+
+	steal = hypervisor_steal_time();
+
+	if (is_idle)
+		return 0;
+
+	if (steal) {
+		account_steal_time(steal);
+		return 1;
+	}
+	return 0;
+}
+
 /*
  * Account user cpu time to a process.
  * @p: the process that the cpu time gets accounted to
@@ -3520,6 +3552,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
 	cputime64_t tmp;
 
+	if (touch_steal_time(0))
+		return;
+
 	/* Add user time to process. */
 	p->utime = cputime_add(p->utime, cputime);
 	p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
@@ -3580,6 +3615,9 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
 	cputime64_t tmp;
 
+	if (touch_steal_time(0))
+		return;
+
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
 		return;
@@ -3627,6 +3665,8 @@ void account_idle_time(cputime_t cputime)
 	cputime64_t cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
+	touch_steal_time(1);
+
 	if (atomic_read(&rq->nr_iowait) > 0)
 		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
 	else
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
                   ` (14 preceding siblings ...)
  2011-01-24 18:06 ` [PATCH 15/16] KVM-GST: KVM Steal time accounting Glauber Costa
@ 2011-01-24 18:06 ` Glauber Costa
  2011-01-24 18:32   ` Peter Zijlstra
  15 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:06 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

This is a first proposal for using steal time information
to influence the scheduler. There are a lot of optimizations
and fine grained adjustments to be done, but it is working reasonably
so far for me (mostly)

With this patch (and some host pinnings to demonstrate the situation),
two vcpus with very different steal time (Say 80 % vs 1 %) will not get
an even distribution of processes. This is a situation that can naturally
arise, specially in overcommited scenarios. Previosly, the guest scheduler
would wrongly think that all cpus have the same ability to run processes,
lowering the overall throughput.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |    4 ++++
 kernel/sched_fair.c   |   10 ++++++++++
 3 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..beab72d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -302,6 +302,7 @@ long io_schedule_timeout(long timeout);
 extern void cpu_init (void);
 extern void trap_init(void);
 extern void update_process_times(int user);
+extern cputime_t (*hypervisor_steal_time)(void);
 extern void scheduler_tick(void);
 
 extern void sched_show_task(struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index 3b3e88d..c460f0d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -501,6 +501,8 @@ struct rq {
 	struct sched_domain *sd;
 
 	unsigned long cpu_power;
+	unsigned long real_ticks;
+	unsigned long total_ticks;
 
 	unsigned char idle_at_tick;
 	/* For active balancing */
@@ -3533,10 +3535,12 @@ static int touch_steal_time(int is_idle)
 	if (is_idle)
 		return 0;
 
+	rq->total_ticks++;
 	if (steal) {
 		account_steal_time(steal);
 		return 1;
 	}
+	rq->real_ticks++;
 	return 0;
 }
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c62ebae..1364c28 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2509,6 +2509,16 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 		power >>= SCHED_LOAD_SHIFT;
 	}
 
+	WARN_ON(cpu_rq(cpu)->real_ticks > cpu_rq(cpu)->total_ticks);
+
+	if (cpu_rq(cpu)->total_ticks) {
+		power *= cpu_rq(cpu)->real_ticks;
+		power /= cpu_rq(cpu)->total_ticks;
+	}
+
+	cpu_rq(cpu)->total_ticks = 0;
+	cpu_rq(cpu)->real_ticks = 0;
+
 	sdg->cpu_power_orig = power;
 
 	if (sched_feat(ARCH_POWER))
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-24 18:06 ` [PATCH 16/16] KVM-GST: adjust scheduler cpu power Glauber Costa
@ 2011-01-24 18:32   ` Peter Zijlstra
  2011-01-24 18:51     ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2011-01-24 18:32 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Mon, 2011-01-24 at 13:06 -0500, Glauber Costa wrote:
> This is a first proposal for using steal time information
> to influence the scheduler. There are a lot of optimizations
> and fine grained adjustments to be done, but it is working reasonably
> so far for me (mostly)
> 
> With this patch (and some host pinnings to demonstrate the situation),
> two vcpus with very different steal time (Say 80 % vs 1 %) will not get
> an even distribution of processes. This is a situation that can naturally
> arise, specially in overcommited scenarios. Previosly, the guest scheduler
> would wrongly think that all cpus have the same ability to run processes,
> lowering the overall throughput.
> 
> Signed-off-by: Glauber Costa <glommer@redhat.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Avi Kivity <avi@redhat.com>
> ---
>  include/linux/sched.h |    1 +
>  kernel/sched.c        |    4 ++++
>  kernel/sched_fair.c   |   10 ++++++++++
>  3 files changed, 15 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d747f94..beab72d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -302,6 +302,7 @@ long io_schedule_timeout(long timeout);
>  extern void cpu_init (void);
>  extern void trap_init(void);
>  extern void update_process_times(int user);
> +extern cputime_t (*hypervisor_steal_time)(void);
>  extern void scheduler_tick(void);
>  
>  extern void sched_show_task(struct task_struct *p);
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3b3e88d..c460f0d 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -501,6 +501,8 @@ struct rq {
>  	struct sched_domain *sd;
>  
>  	unsigned long cpu_power;
> +	unsigned long real_ticks;
> +	unsigned long total_ticks;
>  
>  	unsigned char idle_at_tick;
>  	/* For active balancing */
> @@ -3533,10 +3535,12 @@ static int touch_steal_time(int is_idle)
>  	if (is_idle)
>  		return 0;
>  
> +	rq->total_ticks++;
>  	if (steal) {
>  		account_steal_time(steal);
>  		return 1;
>  	}
> +	rq->real_ticks++;
>  	return 0;
>  }

yuck!! is ticks really the best you can do?

I thought kvm had a ns resolution steal-time clock?

> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index c62ebae..1364c28 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -2509,6 +2509,16 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
>  		power >>= SCHED_LOAD_SHIFT;
>  	}
>  
> +	WARN_ON(cpu_rq(cpu)->real_ticks > cpu_rq(cpu)->total_ticks);
> +
> +	if (cpu_rq(cpu)->total_ticks) {
> +		power *= cpu_rq(cpu)->real_ticks;
> +		power /= cpu_rq(cpu)->total_ticks;
> +	}
> +
> +	cpu_rq(cpu)->total_ticks = 0;
> +	cpu_rq(cpu)->real_ticks = 0;
> +
>  	sdg->cpu_power_orig = power;
>  
>  	if (sched_feat(ARCH_POWER))

I would really much rather see you change update_rq_clock_task() and
subtract your ns resolution steal time from our wall-time,
update_rq_clock_task() already updates the cpu_power relative to the
remaining time available.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-24 18:32   ` Peter Zijlstra
@ 2011-01-24 18:51     ` Glauber Costa
  2011-01-24 19:51       ` Peter Zijlstra
  2011-01-24 19:53       ` Peter Zijlstra
  0 siblings, 2 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 18:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Mon, 2011-01-24 at 19:32 +0100, Peter Zijlstra wrote:
> On Mon, 2011-01-24 at 13:06 -0500, Glauber Costa wrote:
> > This is a first proposal for using steal time information
> > to influence the scheduler. There are a lot of optimizations
> > and fine grained adjustments to be done, but it is working reasonably
> > so far for me (mostly)
> > 
> > With this patch (and some host pinnings to demonstrate the situation),
> > two vcpus with very different steal time (Say 80 % vs 1 %) will not get
> > an even distribution of processes. This is a situation that can naturally
> > arise, specially in overcommited scenarios. Previosly, the guest scheduler
> > would wrongly think that all cpus have the same ability to run processes,
> > lowering the overall throughput.
> > 
> > Signed-off-by: Glauber Costa <glommer@redhat.com>
> > CC: Rik van Riel <riel@redhat.com>
> > CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> > CC: Peter Zijlstra <peterz@infradead.org>
> > CC: Avi Kivity <avi@redhat.com>
> > ---
> >  include/linux/sched.h |    1 +
> >  kernel/sched.c        |    4 ++++
> >  kernel/sched_fair.c   |   10 ++++++++++
> >  3 files changed, 15 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d747f94..beab72d 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -302,6 +302,7 @@ long io_schedule_timeout(long timeout);
> >  extern void cpu_init (void);
> >  extern void trap_init(void);
> >  extern void update_process_times(int user);
> > +extern cputime_t (*hypervisor_steal_time)(void);
> >  extern void scheduler_tick(void);
> >  
> >  extern void sched_show_task(struct task_struct *p);
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 3b3e88d..c460f0d 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -501,6 +501,8 @@ struct rq {
> >  	struct sched_domain *sd;
> >  
> >  	unsigned long cpu_power;
> > +	unsigned long real_ticks;
> > +	unsigned long total_ticks;
> >  
> >  	unsigned char idle_at_tick;
> >  	/* For active balancing */
> > @@ -3533,10 +3535,12 @@ static int touch_steal_time(int is_idle)
> >  	if (is_idle)
> >  		return 0;
> >  
> > +	rq->total_ticks++;
> >  	if (steal) {
> >  		account_steal_time(steal);
> >  		return 1;
> >  	}
> > +	rq->real_ticks++;
> >  	return 0;
> >  }
> 
> yuck!! is ticks really the best you can do?
No, but it is simple enough for a first try. With those variables, we're
not accounting anything, but rather, getting a rough estimate of the %
of steal time compared to the useful (non-idle) time.

> I thought kvm had a ns resolution steal-time clock?
Yes, the one I introduced earlier in this series is nsec. However, user
and system will be accounted in usec at most, so there is no point in
using nsec here.

> 
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index c62ebae..1364c28 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -2509,6 +2509,16 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
> >  		power >>= SCHED_LOAD_SHIFT;
> >  	}
> >  
> > +	WARN_ON(cpu_rq(cpu)->real_ticks > cpu_rq(cpu)->total_ticks);
> > +
> > +	if (cpu_rq(cpu)->total_ticks) {
> > +		power *= cpu_rq(cpu)->real_ticks;
> > +		power /= cpu_rq(cpu)->total_ticks;
> > +	}
> > +
> > +	cpu_rq(cpu)->total_ticks = 0;
> > +	cpu_rq(cpu)->real_ticks = 0;
> > +
> >  	sdg->cpu_power_orig = power;
> >  
> >  	if (sched_feat(ARCH_POWER))
> 
> I would really much rather see you change update_rq_clock_task() and
> subtract your ns resolution steal time from our wall-time,
> update_rq_clock_task() already updates the cpu_power relative to the
> remaining time available.

But then we get into the problems we already discussed in previous
submissions, which is, we don't really want to alter any notion of
wallclock time. Could you list any more concrete advantages of the
specific way you proposed?

I found that updating cpu power directly is pretty simple, and seems
to work well enough, without disturbing any notion of real time.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-24 18:51     ` Glauber Costa
@ 2011-01-24 19:51       ` Peter Zijlstra
  2011-01-24 19:57         ` Glauber Costa
  2011-01-25 20:02         ` Glauber Costa
  2011-01-24 19:53       ` Peter Zijlstra
  1 sibling, 2 replies; 57+ messages in thread
From: Peter Zijlstra @ 2011-01-24 19:51 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Mon, 2011-01-24 at 16:51 -0200, Glauber Costa wrote:
> > I would really much rather see you change update_rq_clock_task() and
> > subtract your ns resolution steal time from our wall-time,
> > update_rq_clock_task() already updates the cpu_power relative to the
> > remaining time available.
> 
> But then we get into the problems we already discussed in previous
> submissions, which is, we don't really want to alter any notion of
> wallclock time. Could you list any more concrete advantages of the
> specific way you proposed? 

clock_task is the time spend on the task, by not taking steal time into
account all steal time is accounted as service to whatever task was
current when the vcpu wasn't running.

It doesn't change wall-time in the sense of gtod, only the service time
to tasks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-24 18:51     ` Glauber Costa
  2011-01-24 19:51       ` Peter Zijlstra
@ 2011-01-24 19:53       ` Peter Zijlstra
  1 sibling, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2011-01-24 19:53 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Mon, 2011-01-24 at 16:51 -0200, Glauber Costa wrote:
> 
> > I thought kvm had a ns resolution steal-time clock?
> Yes, the one I introduced earlier in this series is nsec. However, user
> and system will be accounted in usec at most, so there is no point in
> using nsec here.

Well, the scheduler accounts most things in ns these days -- some archs
even do user/sys time in ns -- using ticks is really very crude, esp
when you've got better information.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-24 19:51       ` Peter Zijlstra
@ 2011-01-24 19:57         ` Glauber Costa
  2011-01-25 20:02         ` Glauber Costa
  1 sibling, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-24 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Mon, 2011-01-24 at 20:51 +0100, Peter Zijlstra wrote:
> On Mon, 2011-01-24 at 16:51 -0200, Glauber Costa wrote:
> > > I would really much rather see you change update_rq_clock_task() and
> > > subtract your ns resolution steal time from our wall-time,
> > > update_rq_clock_task() already updates the cpu_power relative to the
> > > remaining time available.
> > 
> > But then we get into the problems we already discussed in previous
> > submissions, which is, we don't really want to alter any notion of
> > wallclock time. Could you list any more concrete advantages of the
> > specific way you proposed? 
> 
> clock_task is the time spend on the task, by not taking steal time into
> account all steal time is accounted as service to whatever task was
> current when the vcpu wasn't running.
> 
> It doesn't change wall-time in the sense of gtod, only the service time
> to tasks.
Ok, I'll experiment with that and see how it goes.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 11/16] KVM-HDR: KVM Steal time implementation
  2011-01-24 18:06 ` [PATCH 11/16] KVM-HDR: KVM Steal time implementation Glauber Costa
@ 2011-01-24 23:06   ` Rik van Riel
  0 siblings, 0 replies; 57+ messages in thread
From: Rik van Riel @ 2011-01-24 23:06 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On 01/24/2011 01:06 PM, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> This patch contains the headers for it. I am keeping it separate to facilitate
> backports to people who wants to backport the kernel part but not the
> hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa<glommer@redhat.com>
> CC: Rik van Riel<riel@redhat.com>
> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra<peterz@infradead.org>
> CC: Avi Kivity<avi@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 12/16] KVM-HV: KVM Steal time implementation
  2011-01-24 18:06 ` [PATCH 12/16] KVM-HV: " Glauber Costa
@ 2011-01-24 23:15   ` Rik van Riel
  0 siblings, 0 replies; 57+ messages in thread
From: Rik van Riel @ 2011-01-24 23:15 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On 01/24/2011 01:06 PM, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> This patch contains the hypervisor part for it. I am keeping it separate from
> the headers to facilitate backports to people who wants to backport the kernel
> part but not the hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa<glommer@redhat.com>
> CC: Rik van Riel<riel@redhat.com>
> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra<peterz@infradead.org>
> CC: Avi Kivity<avi@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 13/16] KVM-HV: KVM Steal time calculation
  2011-01-24 18:06 ` [PATCH 13/16] KVM-HV: KVM Steal time calculation Glauber Costa
@ 2011-01-24 23:20   ` Rik van Riel
  0 siblings, 0 replies; 57+ messages in thread
From: Rik van Riel @ 2011-01-24 23:20 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On 01/24/2011 01:06 PM, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> We consider time to be potentially stolen everytime we schedule out the vcpu,
> until we schedule it in again. If this is, or if this will not, be accounted
> as steal time for the guest, is a guest's decision.
>
> Signed-off-by: Glauber Costa<glommer@redhat.com>
> CC: Rik van Riel<riel@redhat.com>
> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra<peterz@infradead.org>
> CC: Avi Kivity<avi@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 14/16] KVM-GST: KVM Steal time registration
  2011-01-24 18:06 ` [PATCH 14/16] KVM-GST: KVM Steal time registration Glauber Costa
@ 2011-01-24 23:27   ` Rik van Riel
  2011-01-24 23:31   ` Rik van Riel
  1 sibling, 0 replies; 57+ messages in thread
From: Rik van Riel @ 2011-01-24 23:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On 01/24/2011 01:06 PM, Glauber Costa wrote:
> Register steal time within KVM. Everytime we sample the steal time
> information, we update a local variable that tells what was the
> last time read. We then account the difference.
>
> Signed-off-by: Glauber Costa<glommer@redhat.com>
> CC: Rik van Riel<riel@redhat.com>
> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra<peterz@infradead.org>
> CC: Avi Kivity<avi@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 14/16] KVM-GST: KVM Steal time registration
  2011-01-24 18:06 ` [PATCH 14/16] KVM-GST: KVM Steal time registration Glauber Costa
  2011-01-24 23:27   ` Rik van Riel
@ 2011-01-24 23:31   ` Rik van Riel
  2011-01-25  1:25     ` Glauber Costa
  1 sibling, 1 reply; 57+ messages in thread
From: Rik van Riel @ 2011-01-24 23:31 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On 01/24/2011 01:06 PM, Glauber Costa wrote:
> Register steal time within KVM. Everytime we sample the steal time
> information, we update a local variable that tells what was the
> last time read. We then account the difference.
>
> Signed-off-by: Glauber Costa<glommer@redhat.com>
> CC: Rik van Riel<riel@redhat.com>
> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra<peterz@infradead.org>
> CC: Avi Kivity<avi@redhat.com>

On second thought - how does this deal with cpu hotplug and
hot unplug?

Do you allocate a new one of these structs every time a cpu
is hot unplugged and then hotplugged, leaking the old one?

Will leaving the old value around confuse the steal time
calculation?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 15/16] KVM-GST: KVM Steal time accounting
  2011-01-24 18:06 ` [PATCH 15/16] KVM-GST: KVM Steal time accounting Glauber Costa
@ 2011-01-24 23:33   ` Rik van Riel
  0 siblings, 0 replies; 57+ messages in thread
From: Rik van Riel @ 2011-01-24 23:33 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On 01/24/2011 01:06 PM, Glauber Costa wrote:
> This patch accounts steal time time in kernel/sched.
> I kept it from last proposal, because I still see advantages
> in it: Doing it here will give us easier access from scheduler
> variables such as the cpu rq. The next patch shows an example of
> usage for it.
>
> Since functions like account_idle_time() can be called from
> multiple places, not only account_process_tick(), steal time
> grabbing is repeated in each account function separatedely.
>
> Signed-off-by: Glauber Costa<glommer@redhat.com>
> CC: Rik van Riel<riel@redhat.com>
> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra<peterz@infradead.org>
> CC: Avi Kivity<avi@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 14/16] KVM-GST: KVM Steal time registration
  2011-01-24 23:31   ` Rik van Riel
@ 2011-01-25  1:25     ` Glauber Costa
  2011-01-25  1:26       ` Rik van Riel
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-25  1:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On Mon, 2011-01-24 at 18:31 -0500, Rik van Riel wrote:
> On 01/24/2011 01:06 PM, Glauber Costa wrote:
> > Register steal time within KVM. Everytime we sample the steal time
> > information, we update a local variable that tells what was the
> > last time read. We then account the difference.
> >
> > Signed-off-by: Glauber Costa<glommer@redhat.com>
> > CC: Rik van Riel<riel@redhat.com>
> > CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> > CC: Peter Zijlstra<peterz@infradead.org>
> > CC: Avi Kivity<avi@redhat.com>
> 
> On second thought - how does this deal with cpu hotplug and
> hot unplug?
> 
> Do you allocate a new one of these structs every time a cpu
> is hot unplugged and then hotplugged, leaking the old one?
> 
> Will leaving the old value around confuse the steal time
> calculation?

If you look closely, there are no allocations happening at all,
it's all static.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 14/16] KVM-GST: KVM Steal time registration
  2011-01-25  1:25     ` Glauber Costa
@ 2011-01-25  1:26       ` Rik van Riel
  2011-01-25  1:28         ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Rik van Riel @ 2011-01-25  1:26 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On 01/24/2011 08:25 PM, Glauber Costa wrote:
> On Mon, 2011-01-24 at 18:31 -0500, Rik van Riel wrote:
>> On 01/24/2011 01:06 PM, Glauber Costa wrote:
>>> Register steal time within KVM. Everytime we sample the steal time
>>> information, we update a local variable that tells what was the
>>> last time read. We then account the difference.
>>>
>>> Signed-off-by: Glauber Costa<glommer@redhat.com>
>>> CC: Rik van Riel<riel@redhat.com>
>>> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
>>> CC: Peter Zijlstra<peterz@infradead.org>
>>> CC: Avi Kivity<avi@redhat.com>
>>
>> On second thought - how does this deal with cpu hotplug and
>> hot unplug?
>>
>> Do you allocate a new one of these structs every time a cpu
>> is hot unplugged and then hotplugged, leaking the old one?
>>
>> Will leaving the old value around confuse the steal time
>> calculation?
>
> If you look closely, there are no allocations happening at all,
> it's all static.

In that case, does the per-cpu steal area need to be
reinitialized at hotplug time?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 14/16] KVM-GST: KVM Steal time registration
  2011-01-25  1:26       ` Rik van Riel
@ 2011-01-25  1:28         ` Glauber Costa
  0 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-25  1:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On Mon, 2011-01-24 at 20:26 -0500, Rik van Riel wrote:
> On 01/24/2011 08:25 PM, Glauber Costa wrote:
> > On Mon, 2011-01-24 at 18:31 -0500, Rik van Riel wrote:
> >> On 01/24/2011 01:06 PM, Glauber Costa wrote:
> >>> Register steal time within KVM. Everytime we sample the steal time
> >>> information, we update a local variable that tells what was the
> >>> last time read. We then account the difference.
> >>>
> >>> Signed-off-by: Glauber Costa<glommer@redhat.com>
> >>> CC: Rik van Riel<riel@redhat.com>
> >>> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> >>> CC: Peter Zijlstra<peterz@infradead.org>
> >>> CC: Avi Kivity<avi@redhat.com>
> >>
> >> On second thought - how does this deal with cpu hotplug and
> >> hot unplug?
> >>
> >> Do you allocate a new one of these structs every time a cpu
> >> is hot unplugged and then hotplugged, leaking the old one?
> >>
> >> Will leaving the old value around confuse the steal time
> >> calculation?
> >
> > If you look closely, there are no allocations happening at all,
> > it's all static.
> 
> In that case, does the per-cpu steal area need to be
> reinitialized at hotplug time?
Probably.

I have to look closely at all unregistration scenarios, like
reboot.
It's part of my todo list


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-24 19:51       ` Peter Zijlstra
  2011-01-24 19:57         ` Glauber Costa
@ 2011-01-25 20:02         ` Glauber Costa
  2011-01-25 20:13           ` Peter Zijlstra
  1 sibling, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-25 20:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Mon, 2011-01-24 at 20:51 +0100, Peter Zijlstra wrote:
> On Mon, 2011-01-24 at 16:51 -0200, Glauber Costa wrote:
> > > I would really much rather see you change update_rq_clock_task() and
> > > subtract your ns resolution steal time from our wall-time,
> > > update_rq_clock_task() already updates the cpu_power relative to the
> > > remaining time available.
> > 
> > But then we get into the problems we already discussed in previous
> > submissions, which is, we don't really want to alter any notion of
> > wallclock time. Could you list any more concrete advantages of the
> > specific way you proposed? 
> 
> clock_task is the time spend on the task, by not taking steal time into
> account all steal time is accounted as service to whatever task was
> current when the vcpu wasn't running.
> 
> It doesn't change wall-time in the sense of gtod, only the service time
> to tasks.
I fail to see how does clock_task influence cpu power.
If we also have to touch clock_task for better accounting of other
stuff, it is a separate story.
But for cpu_power, I really fail. Please enlighten me.

I did have slightly better results accounting the whole steal time
period away from total_"ticks"(*) than just with units.

* Please note again that ticks here is just a (bad, I admit) name.
This is not tick accounting, is just an estimate of how much useful work
your cpu did.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-25 20:02         ` Glauber Costa
@ 2011-01-25 20:13           ` Peter Zijlstra
  2011-01-25 20:47             ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2011-01-25 20:13 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Tue, 2011-01-25 at 18:02 -0200, Glauber Costa wrote:

> I fail to see how does clock_task influence cpu power.
> If we also have to touch clock_task for better accounting of other
> stuff, it is a separate story.
> But for cpu_power, I really fail. Please enlighten me.

static void update_rq_clock_task(struct rq *rq, s64 delta)
{
        s64 irq_delta;

        irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;

        if (irq_delta > delta)
                irq_delta = delta;

        rq->prev_irq_time += irq_delta;
        delta -= irq_delta;
        rq->clock_task += delta;

        if (irq_delta && sched_feat(NONIRQ_POWER))
                sched_rt_avg_update(rq, irq_delta);
}

its done through that sched_rt_avg_update() (should probably rename
that), it computes a floating average of time not spend on fair tasks.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-25 20:13           ` Peter Zijlstra
@ 2011-01-25 20:47             ` Glauber Costa
  2011-01-25 21:07               ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-25 20:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Tue, 2011-01-25 at 21:13 +0100, Peter Zijlstra wrote:
> On Tue, 2011-01-25 at 18:02 -0200, Glauber Costa wrote:
> 
> > I fail to see how does clock_task influence cpu power.
> > If we also have to touch clock_task for better accounting of other
> > stuff, it is a separate story.
> > But for cpu_power, I really fail. Please enlighten me.
> 
> static void update_rq_clock_task(struct rq *rq, s64 delta)
> {
>         s64 irq_delta;
> 
>         irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
> 
>         if (irq_delta > delta)
>                 irq_delta = delta;
> 
>         rq->prev_irq_time += irq_delta;
>         delta -= irq_delta;
>         rq->clock_task += delta;
> 
>         if (irq_delta && sched_feat(NONIRQ_POWER))
>                 sched_rt_avg_update(rq, irq_delta);
> }
> 
> its done through that sched_rt_avg_update() (should probably rename
> that), it computes a floating average of time not spend on fair tasks.
> 
It creates a dependency on CONFIG_IRQ_TIME_ACCOUNTING, though.
This piece of code is simply compiled out if this option is disabled.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-25 20:47             ` Glauber Costa
@ 2011-01-25 21:07               ` Peter Zijlstra
  2011-01-25 21:27                 ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2011-01-25 21:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Tue, 2011-01-25 at 18:47 -0200, Glauber Costa wrote:
> On Tue, 2011-01-25 at 21:13 +0100, Peter Zijlstra wrote:
> > On Tue, 2011-01-25 at 18:02 -0200, Glauber Costa wrote:
> > 
> > > I fail to see how does clock_task influence cpu power.
> > > If we also have to touch clock_task for better accounting of other
> > > stuff, it is a separate story.
> > > But for cpu_power, I really fail. Please enlighten me.
> > 
> > static void update_rq_clock_task(struct rq *rq, s64 delta)
> > {
> >         s64 irq_delta;
> > 
> >         irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
> > 
> >         if (irq_delta > delta)
> >                 irq_delta = delta;
> > 
> >         rq->prev_irq_time += irq_delta;
> >         delta -= irq_delta;
> >         rq->clock_task += delta;
> > 
> >         if (irq_delta && sched_feat(NONIRQ_POWER))
> >                 sched_rt_avg_update(rq, irq_delta);
> > }
> > 
> > its done through that sched_rt_avg_update() (should probably rename
> > that), it computes a floating average of time not spend on fair tasks.
> > 
> It creates a dependency on CONFIG_IRQ_TIME_ACCOUNTING, though.
> This piece of code is simply compiled out if this option is disabled.

We can pull this bit out and make the common bit also available for
paravirt.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-25 21:07               ` Peter Zijlstra
@ 2011-01-25 21:27                 ` Glauber Costa
  2011-01-26  9:57                   ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-25 21:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Tue, 2011-01-25 at 22:07 +0100, Peter Zijlstra wrote:
> On Tue, 2011-01-25 at 18:47 -0200, Glauber Costa wrote:
> > On Tue, 2011-01-25 at 21:13 +0100, Peter Zijlstra wrote:
> > > On Tue, 2011-01-25 at 18:02 -0200, Glauber Costa wrote:
> > > 
> > > > I fail to see how does clock_task influence cpu power.
> > > > If we also have to touch clock_task for better accounting of other
> > > > stuff, it is a separate story.
> > > > But for cpu_power, I really fail. Please enlighten me.
> > > 
> > > static void update_rq_clock_task(struct rq *rq, s64 delta)
> > > {
> > >         s64 irq_delta;
> > > 
> > >         irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
> > > 
> > >         if (irq_delta > delta)
> > >                 irq_delta = delta;
> > > 
> > >         rq->prev_irq_time += irq_delta;
> > >         delta -= irq_delta;
> > >         rq->clock_task += delta;
> > > 
> > >         if (irq_delta && sched_feat(NONIRQ_POWER))
> > >                 sched_rt_avg_update(rq, irq_delta);
> > > }
> > > 
> > > its done through that sched_rt_avg_update() (should probably rename
> > > that), it computes a floating average of time not spend on fair tasks.
> > > 
> > It creates a dependency on CONFIG_IRQ_TIME_ACCOUNTING, though.
> > This piece of code is simply compiled out if this option is disabled.
> 
> We can pull this bit out and make the common bit also available for
> paravirt.

scale_rt_power() seems to do the right thing, but all the path leading
to it seem to work on rq->clock, rather than rq->clock_task.

Although I do can experiment with that as well, could you please
elaborate on what are your reasons to prefer this over than variations
of the method I proposed?





^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-25 21:27                 ` Glauber Costa
@ 2011-01-26  9:57                   ` Peter Zijlstra
  2011-01-26 15:43                     ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2011-01-26  9:57 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Tue, 2011-01-25 at 19:27 -0200, Glauber Costa wrote:
> On Tue, 2011-01-25 at 22:07 +0100, Peter Zijlstra wrote:
> > On Tue, 2011-01-25 at 18:47 -0200, Glauber Costa wrote:
> > > On Tue, 2011-01-25 at 21:13 +0100, Peter Zijlstra wrote:
> > > > On Tue, 2011-01-25 at 18:02 -0200, Glauber Costa wrote:
> > > > 
> > > > > I fail to see how does clock_task influence cpu power.
> > > > > If we also have to touch clock_task for better accounting of other
> > > > > stuff, it is a separate story.
> > > > > But for cpu_power, I really fail. Please enlighten me.
> > > > 
> > > > static void update_rq_clock_task(struct rq *rq, s64 delta)
> > > > {
> > > >         s64 irq_delta;
> > > > 
> > > >         irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
> > > > 
> > > >         if (irq_delta > delta)
> > > >                 irq_delta = delta;
> > > > 
> > > >         rq->prev_irq_time += irq_delta;
> > > >         delta -= irq_delta;
> > > >         rq->clock_task += delta;
> > > > 
> > > >         if (irq_delta && sched_feat(NONIRQ_POWER))
> > > >                 sched_rt_avg_update(rq, irq_delta);
> > > > }
> > > > 
> > > > its done through that sched_rt_avg_update() (should probably rename
> > > > that), it computes a floating average of time not spend on fair tasks.
> > > > 
> > > It creates a dependency on CONFIG_IRQ_TIME_ACCOUNTING, though.
> > > This piece of code is simply compiled out if this option is disabled.
> > 
> > We can pull this bit out and make the common bit also available for
> > paravirt.
> 
> scale_rt_power() seems to do the right thing, but all the path leading
> to it seem to work on rq->clock, rather than rq->clock_task.

Not quite, see how rq->clock_task is irq_delta less than the increment
to rq->clock? You want it to be your steal-time delta less too.

> Although I do can experiment with that as well, could you please
> elaborate on what are your reasons to prefer this over than variations
> of the method I proposed?

Because I want rq->clock_task to not include steal-time.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure
  2011-01-24 18:06 ` [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure Glauber Costa
@ 2011-01-26 11:06   ` Avi Kivity
  2011-01-26 12:13     ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Avi Kivity @ 2011-01-26 11:06 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, linux-kernel, aliguori

On 01/24/2011 08:06 PM, Glauber Costa wrote:
> KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual Mojito),
> is a piece of shared memory that is visible to both the hypervisor and the guest
> kernel - but not the guest userspace.
>
> The basic idea is that the guest can tell the hypervisor about a specific
> piece of memory, and what it expects to find in there. This is a generic
> abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
> handle a specific request, thus giving us flexibility in some features
> in the future.
>
> KVM (The hypervisor) can change the contents of this piece of memory at
> will. This works well with paravirtual information, and hopefully
> normal guest memory - like last update time for the watchdog, for
> instance.
>
> This is basic KVM registration headers. I am keeping headers
> separate to facilitate backports to people who wants to backport
> the kernel part but not the hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa<glommer@redhat.com>
> CC: Avi Kivity<avi@redhat.com>
> ---
>   arch/x86/include/asm/kvm_para.h |   11 +++++++++++
>   1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index a427bf7..b0b0ee0 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -21,6 +21,7 @@
>    */
>   #define KVM_FEATURE_CLOCKSOURCE2        3
>   #define KVM_FEATURE_ASYNC_PF		4
> +#define KVM_FEATURE_MEMORY_AREA		5
>
>   /* The last 8 bits are used to indicate how to interpret the flags field
>    * in pvclock structure. If no bits are set, all flags are ignored.
> @@ -35,6 +36,16 @@
>   #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
>   #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
>
> +#define MSR_KVM_REGISTER_MEM_AREA 0x4b564d03
> +
> +struct kvm_memory_area {
> +	__u64 base;
> +	__u32 size;
> +	__u32 type;
> +	__u8  result;
> +	__u8  pad[3];
> +};
> +
>   #define KVM_MAX_MMU_OP_BATCH           32

I'm guessing the protocol here is:

- guest fills in ->base/size/type
- issues wrmsr
- host registers the memory and updates ->result
- guest examines ->result

there are two issues with this approach:

- it doesn't lend itself will to live migration.  Extra state must be 
maintained in the hypervisor.
- it isn't how normal hardware operates

what's wrong with extending the normal approach of one msr per feature?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl
  2011-01-24 18:06 ` [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl Glauber Costa
@ 2011-01-26 11:12   ` Avi Kivity
  2011-01-26 12:14     ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Avi Kivity @ 2011-01-26 11:12 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, linux-kernel, aliguori

On 01/24/2011 08:06 PM, Glauber Costa wrote:
> KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual Mojito),
> is a piece of shared memory that is visible to both the hypervisor and the guest
> kernel - but not the guest userspace.
>
> The basic idea is that the guest can tell the hypervisor about a specific
> piece of memory, and what it expects to find in there. This is a generic
> abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
> handle a specific request, thus giving us flexibility in some features
> in the future.
>
> KVM (The hypervisor) can change the contents of this piece of memory at
> will. This works well with paravirtual information, and hopefully
> normal guest memory - like last update time for the watchdog, for
> instance.
>
> This patch contains the header part of the userspace communication implementation.
> Userspace can query the presence/absence of this feature in the normal way.
> It also tells the hypervisor that it is capable of handling - in whatever
> way it chooses, registrations that the hypervisor does not know how to.
> In x86, only user so far, this mechanism is implemented as generic userspace
> msr exit, that could theorectically be used to implement msr-handling in
> userspace.
>
> I am keeping the headers separate to facilitate backports to people
> who wants to backport the kernel part but not the hypervisor, or the other way around.
>

Again the protocol is not specified.  How about starting from 
Documentation/kvm/api.txt so we don't have to guess?

> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index ea2dc1a..5cc4fe8 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -161,6 +161,7 @@ struct kvm_pit_config {
>   #define KVM_EXIT_NMI              16
>   #define KVM_EXIT_INTERNAL_ERROR   17
>   #define KVM_EXIT_OSI              18
> +#define KVM_EXIT_X86_MSR_OP	  19

>
>   /*
> @@ -541,6 +551,7 @@ struct kvm_ppc_pvinfo {
>   #define KVM_CAP_PPC_GET_PVINFO 57
>   #define KVM_CAP_PPC_IRQ_LEVEL 58
>   #define KVM_CAP_ASYNC_PF 59
> +#define KVM_CAP_REGISTER_MEM_AREA 60

These two seem completely unrelated.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory
  2011-01-24 18:06 ` [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory Glauber Costa
@ 2011-01-26 11:13   ` Avi Kivity
  2011-01-26 12:20     ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Avi Kivity @ 2011-01-26 11:13 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, linux-kernel, aliguori

On 01/24/2011 08:06 PM, Glauber Costa wrote:
> As a proof of concept to KVM - Kernel Virtual Memory, this patch
> implements wallclock grabbing on top of it. At first, it may seem
> as a waste of work to just redo it, since it is working well. But over the
> time, other MSRs were added - think ASYNC_PF - and more will probably come.
> After this patch, we won't need to ever add another virtual MSR to KVM.
>

So instead of adding MSRs, we're adding area identifiers.  What did we gain?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure
  2011-01-26 11:06   ` Avi Kivity
@ 2011-01-26 12:13     ` Glauber Costa
  2011-01-26 15:12       ` Avi Kivity
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-26 12:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-kernel, aliguori

On Wed, 2011-01-26 at 13:06 +0200, Avi Kivity wrote:
> On 01/24/2011 08:06 PM, Glauber Costa wrote:
> > KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual Mojito),
> > is a piece of shared memory that is visible to both the hypervisor and the guest
> > kernel - but not the guest userspace.
> >
> > The basic idea is that the guest can tell the hypervisor about a specific
> > piece of memory, and what it expects to find in there. This is a generic
> > abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
> > handle a specific request, thus giving us flexibility in some features
> > in the future.
> >
> > KVM (The hypervisor) can change the contents of this piece of memory at
> > will. This works well with paravirtual information, and hopefully
> > normal guest memory - like last update time for the watchdog, for
> > instance.
> >
> > This is basic KVM registration headers. I am keeping headers
> > separate to facilitate backports to people who wants to backport
> > the kernel part but not the hypervisor, or the other way around.
> >
> > Signed-off-by: Glauber Costa<glommer@redhat.com>
> > CC: Avi Kivity<avi@redhat.com>
> > ---
> >   arch/x86/include/asm/kvm_para.h |   11 +++++++++++
> >   1 files changed, 11 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index a427bf7..b0b0ee0 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -21,6 +21,7 @@
> >    */
> >   #define KVM_FEATURE_CLOCKSOURCE2        3
> >   #define KVM_FEATURE_ASYNC_PF		4
> > +#define KVM_FEATURE_MEMORY_AREA		5
> >
> >   /* The last 8 bits are used to indicate how to interpret the flags field
> >    * in pvclock structure. If no bits are set, all flags are ignored.
> > @@ -35,6 +36,16 @@
> >   #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
> >   #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
> >
> > +#define MSR_KVM_REGISTER_MEM_AREA 0x4b564d03
> > +
> > +struct kvm_memory_area {
> > +	__u64 base;
> > +	__u32 size;
> > +	__u32 type;
> > +	__u8  result;
> > +	__u8  pad[3];
> > +};
> > +
> >   #define KVM_MAX_MMU_OP_BATCH           32
> 
> I'm guessing the protocol here is:
> 
> - guest fills in ->base/size/type
> - issues wrmsr
> - host registers the memory and updates ->result
> - guest examines ->result
> 
> there are two issues with this approach:
> 
> - it doesn't lend itself will to live migration.  Extra state must be 
> maintained in the hypervisor.
Yes, but can be queried at any time as well. I don't do it in this
patch, but this is explicitly mentioned in my TODO.

> - it isn't how normal hardware operates
Since we're trying to go for guest cooperation here, I don't really see
a need to stay close to hardware here.

> 
> what's wrong with extending the normal approach of one msr per feature?

* It's harder to do discovery with MSRs. You can't just rely on getting
an error before the idts are properly setups. The way I am proposing
allow us to just try to register a memory area, and get a failure if we
can't handle it, at any time
* To overcome the above, we had usually relied on cpuids. This requires
qemu/userspace cooperation for feature enablement
* This mechanism just bumps us out to userspace if we can't handle a
request. As such, it allows for pure guest kernel -> userspace
communication, that can be used, for instance, to emulate new features
in older hypervisors one does not want to change. BTW, maybe there is
value in exiting to userspace even if we stick to the
one-msr-per-feature approach?




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl
  2011-01-26 11:12   ` Avi Kivity
@ 2011-01-26 12:14     ` Glauber Costa
  2011-01-26 15:14       ` Avi Kivity
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-26 12:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-kernel, aliguori

On Wed, 2011-01-26 at 13:12 +0200, Avi Kivity wrote:
> On 01/24/2011 08:06 PM, Glauber Costa wrote:
> > KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual Mojito),
> > is a piece of shared memory that is visible to both the hypervisor and the guest
> > kernel - but not the guest userspace.
> >
> > The basic idea is that the guest can tell the hypervisor about a specific
> > piece of memory, and what it expects to find in there. This is a generic
> > abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
> > handle a specific request, thus giving us flexibility in some features
> > in the future.
> >
> > KVM (The hypervisor) can change the contents of this piece of memory at
> > will. This works well with paravirtual information, and hopefully
> > normal guest memory - like last update time for the watchdog, for
> > instance.
> >
> > This patch contains the header part of the userspace communication implementation.
> > Userspace can query the presence/absence of this feature in the normal way.
> > It also tells the hypervisor that it is capable of handling - in whatever
> > way it chooses, registrations that the hypervisor does not know how to.
> > In x86, only user so far, this mechanism is implemented as generic userspace
> > msr exit, that could theorectically be used to implement msr-handling in
> > userspace.
> >
> > I am keeping the headers separate to facilitate backports to people
> > who wants to backport the kernel part but not the hypervisor, or the other way around.
> >
> 
> Again the protocol is not specified.  How about starting from 
> Documentation/kvm/api.txt so we don't have to guess?
I will do that in the next version, if the idea is not shoot up
completely.

> 
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index ea2dc1a..5cc4fe8 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -161,6 +161,7 @@ struct kvm_pit_config {
> >   #define KVM_EXIT_NMI              16
> >   #define KVM_EXIT_INTERNAL_ERROR   17
> >   #define KVM_EXIT_OSI              18
> > +#define KVM_EXIT_X86_MSR_OP	  19
> 
> >
> >   /*
> > @@ -541,6 +551,7 @@ struct kvm_ppc_pvinfo {
> >   #define KVM_CAP_PPC_GET_PVINFO 57
> >   #define KVM_CAP_PPC_IRQ_LEVEL 58
> >   #define KVM_CAP_ASYNC_PF 59
> > +#define KVM_CAP_REGISTER_MEM_AREA 60
> 
> These two seem completely unrelated.

Thanks, will put them in the right place.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory
  2011-01-26 11:13   ` Avi Kivity
@ 2011-01-26 12:20     ` Glauber Costa
  2011-01-26 15:17       ` Avi Kivity
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-26 12:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-kernel, aliguori

On Wed, 2011-01-26 at 13:13 +0200, Avi Kivity wrote:
> On 01/24/2011 08:06 PM, Glauber Costa wrote:
> > As a proof of concept to KVM - Kernel Virtual Memory, this patch
> > implements wallclock grabbing on top of it. At first, it may seem
> > as a waste of work to just redo it, since it is working well. But over the
> > time, other MSRs were added - think ASYNC_PF - and more will probably come.
> > After this patch, we won't need to ever add another virtual MSR to KVM.
> >
> 
> So instead of adding MSRs, we're adding area identifiers.  What did we gain?

* No risk of namespace clashes of any kind,
* less need for userspace coordination for feature enablement,
* more robust mechanism that can do discovery even on early boot,
* more informative result values can be passed on to guest kernel, 
* more flexibility, since we go back to userspace if we can't handle
some request. Also, some requests are better handled by userspace
anyway. But again, maybe this is a separate issue here...
* size information goes together with base, allowing for extending
structures (well, maybe I should add versioning explicitly?)




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure
  2011-01-26 12:13     ` Glauber Costa
@ 2011-01-26 15:12       ` Avi Kivity
  2011-01-26 15:36         ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Avi Kivity @ 2011-01-26 15:12 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, linux-kernel, aliguori

On 01/26/2011 02:13 PM, Glauber Costa wrote:
> >
> >  - it doesn't lend itself will to live migration.  Extra state must be
> >  maintained in the hypervisor.
> Yes, but can be queried at any time as well. I don't do it in this
> patch, but this is explicitly mentioned in my TODO.

Using the existing method (MSRs) takes care of this, which reduces churn.

> >  - it isn't how normal hardware operates
> Since we're trying to go for guest cooperation here, I don't really see
> a need to stay close to hardware here.

For Linux there is not much difference, since we can easily adapt it.  
But we don't know the impact on other guests, and we can't refactor 
them.  Staying close to precedent means it will be easier for other 
guests to work with a kvm host, if they choose.

> >
> >  what's wrong with extending the normal approach of one msr per feature?
>
> * It's harder to do discovery with MSRs. You can't just rely on getting
> an error before the idts are properly setups. The way I am proposing
> allow us to just try to register a memory area, and get a failure if we
> can't handle it, at any time

Use cpuid to ensure that you won't get a #GP.

> * To overcome the above, we had usually relied on cpuids. This requires
> qemu/userspace cooperation for feature enablement

We need that anyway.  The kernel cannot enable features on its own since 
that breaks live migration.

> * This mechanism just bumps us out to userspace if we can't handle a
> request. As such, it allows for pure guest kernel ->  userspace
> communication, that can be used, for instance, to emulate new features
> in older hypervisors one does not want to change. BTW, maybe there is
> value in exiting to userspace even if we stick to the
> one-msr-per-feature approach?

Yes.

I'm not 100% happy with emulating MSRs in userspace, but we can think 
about a mechanism that allows userspace to designate certain MSRs as 
handled by userspace.

Before we do that I'd like to see what fraction of MSRs can be usefully 
emulated in userspace (beyond those that just store a value and ignore it).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl
  2011-01-26 12:14     ` Glauber Costa
@ 2011-01-26 15:14       ` Avi Kivity
  2011-01-26 15:23         ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Avi Kivity @ 2011-01-26 15:14 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, linux-kernel, aliguori

On 01/26/2011 02:14 PM, Glauber Costa wrote:
> On Wed, 2011-01-26 at 13:12 +0200, Avi Kivity wrote:
> >  On 01/24/2011 08:06 PM, Glauber Costa wrote:
> >  >  KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual Mojito),
> >  >  is a piece of shared memory that is visible to both the hypervisor and the guest
> >  >  kernel - but not the guest userspace.
> >  >
> >  >  The basic idea is that the guest can tell the hypervisor about a specific
> >  >  piece of memory, and what it expects to find in there. This is a generic
> >  >  abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
> >  >  handle a specific request, thus giving us flexibility in some features
> >  >  in the future.
> >  >
> >  >  KVM (The hypervisor) can change the contents of this piece of memory at
> >  >  will. This works well with paravirtual information, and hopefully
> >  >  normal guest memory - like last update time for the watchdog, for
> >  >  instance.
> >  >
> >  >  This patch contains the header part of the userspace communication implementation.
> >  >  Userspace can query the presence/absence of this feature in the normal way.
> >  >  It also tells the hypervisor that it is capable of handling - in whatever
> >  >  way it chooses, registrations that the hypervisor does not know how to.
> >  >  In x86, only user so far, this mechanism is implemented as generic userspace
> >  >  msr exit, that could theorectically be used to implement msr-handling in
> >  >  userspace.
> >  >
> >  >  I am keeping the headers separate to facilitate backports to people
> >  >  who wants to backport the kernel part but not the hypervisor, or the other way around.
> >  >
> >
> >  Again the protocol is not specified.  How about starting from
> >  Documentation/kvm/api.txt so we don't have to guess?
> I will do that in the next version, if the idea is not shoot up
> completely.

For some reason people write the code first and the documentation as an 
afterthought.  If you switch the order, you can get a high level review 
first, followed by a low-level code review once the high level details 
are agreed.  Of course it's hard to do major changes this way, since the 
API often evolves while writing the code.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory
  2011-01-26 12:20     ` Glauber Costa
@ 2011-01-26 15:17       ` Avi Kivity
  2011-01-26 15:45         ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Avi Kivity @ 2011-01-26 15:17 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, linux-kernel, aliguori

On 01/26/2011 02:20 PM, Glauber Costa wrote:
> On Wed, 2011-01-26 at 13:13 +0200, Avi Kivity wrote:
> >  On 01/24/2011 08:06 PM, Glauber Costa wrote:
> >  >  As a proof of concept to KVM - Kernel Virtual Memory, this patch
> >  >  implements wallclock grabbing on top of it. At first, it may seem
> >  >  as a waste of work to just redo it, since it is working well. But over the
> >  >  time, other MSRs were added - think ASYNC_PF - and more will probably come.
> >  >  After this patch, we won't need to ever add another virtual MSR to KVM.
> >  >
> >
> >  So instead of adding MSRs, we're adding area identifiers.  What did we gain?
>
> * No risk of namespace clashes of any kind,
> * less need for userspace coordination for feature enablement,

That's a bug, not a feature.

> * more robust mechanism that can do discovery even on early boot,

cpuid/wrmsr should be robust enough.

> * more informative result values can be passed on to guest kernel,

True.

> * more flexibility, since we go back to userspace if we can't handle
> some request. Also, some requests are better handled by userspace
> anyway. But again, maybe this is a separate issue here...

Yes.

> * size information goes together with base, allowing for extending
> structures (well, maybe I should add versioning explicitly?)
>

We could do that as well with wrmsr, by having the size as the first 
field of the structure.  Usually the size isn't really interesting, 
though, since you need to discover/enable the new features independently.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl
  2011-01-26 15:14       ` Avi Kivity
@ 2011-01-26 15:23         ` Glauber Costa
  0 siblings, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-26 15:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-kernel, aliguori

On Wed, 2011-01-26 at 17:14 +0200, Avi Kivity wrote:
> On 01/26/2011 02:14 PM, Glauber Costa wrote:
> > On Wed, 2011-01-26 at 13:12 +0200, Avi Kivity wrote:
> > >  On 01/24/2011 08:06 PM, Glauber Costa wrote:
> > >  >  KVM, which stands for KVM Virtual Memory (I wanted to call it KVM Virtual Mojito),
> > >  >  is a piece of shared memory that is visible to both the hypervisor and the guest
> > >  >  kernel - but not the guest userspace.
> > >  >
> > >  >  The basic idea is that the guest can tell the hypervisor about a specific
> > >  >  piece of memory, and what it expects to find in there. This is a generic
> > >  >  abstraction, that goes to userspace (qemu) if KVM (the hypervisor) can't
> > >  >  handle a specific request, thus giving us flexibility in some features
> > >  >  in the future.
> > >  >
> > >  >  KVM (The hypervisor) can change the contents of this piece of memory at
> > >  >  will. This works well with paravirtual information, and hopefully
> > >  >  normal guest memory - like last update time for the watchdog, for
> > >  >  instance.
> > >  >
> > >  >  This patch contains the header part of the userspace communication implementation.
> > >  >  Userspace can query the presence/absence of this feature in the normal way.
> > >  >  It also tells the hypervisor that it is capable of handling - in whatever
> > >  >  way it chooses, registrations that the hypervisor does not know how to.
> > >  >  In x86, only user so far, this mechanism is implemented as generic userspace
> > >  >  msr exit, that could theorectically be used to implement msr-handling in
> > >  >  userspace.
> > >  >
> > >  >  I am keeping the headers separate to facilitate backports to people
> > >  >  who wants to backport the kernel part but not the hypervisor, or the other way around.
> > >  >
> > >
> > >  Again the protocol is not specified.  How about starting from
> > >  Documentation/kvm/api.txt so we don't have to guess?
> > I will do that in the next version, if the idea is not shoot up
> > completely.
> 
> For some reason people write the code first and the documentation as an 
> afterthought.  If you switch the order, you can get a high level review 
> first, followed by a low-level code review once the high level details 
> are agreed.  Of course it's hard to do major changes this way, since the 
> API often evolves while writing the code.

Thing is, processor runs code, not english. So I guess people write code
first at least to know what works vs what not works before writing a
potentially beautiful-but-only-theoretical novel.

Rest assured, I don't plan to push for merges of any feature before
providing docs.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure
  2011-01-26 15:12       ` Avi Kivity
@ 2011-01-26 15:36         ` Glauber Costa
  2011-01-26 17:22           ` Anthony Liguori
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-26 15:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-kernel, aliguori

On Wed, 2011-01-26 at 17:12 +0200, Avi Kivity wrote:
> On 01/26/2011 02:13 PM, Glauber Costa wrote:
> > >
> > >  - it doesn't lend itself will to live migration.  Extra state must be
> > >  maintained in the hypervisor.
> > Yes, but can be queried at any time as well. I don't do it in this
> > patch, but this is explicitly mentioned in my TODO.
> 
> Using the existing method (MSRs) takes care of this, which reduces churn.

No, it doesn't.

First, we have to explicitly list some msrs for save/restore in
userspace anyway. But also, the MSRs only holds values. For the case I'm
trying to hit here, being: msrs being used to register something, like
kvmclock, there is usually accompanying code as well.


> > >  - it isn't how normal hardware operates
> > Since we're trying to go for guest cooperation here, I don't really see
> > a need to stay close to hardware here.
> 
> For Linux there is not much difference, since we can easily adapt it.
> But we don't know the impact on other guests, and we can't refactor 
> them.  Staying close to precedent means it will be easier for other 
> guests to work with a kvm host, if they choose.

I honestly don't see the difference. I am not proposing anything
terribly different, in the end, for the sake of this specific point of
guest supportability it's all 1 msr+cpuid vs n msr+cpuid.

> 
> > >
> > >  what's wrong with extending the normal approach of one msr per feature?
> >
> > * It's harder to do discovery with MSRs. You can't just rely on getting
> > an error before the idts are properly setups. The way I am proposing
> > allow us to just try to register a memory area, and get a failure if we
> > can't handle it, at any time
> 
> Use cpuid to ensure that you won't get a #GP.
Again, that increases confusion, IMHO. Your hypervisor may have a
feature, userspace lack it, and then you end up figuring why something
does not work.

> 
> > * To overcome the above, we had usually relied on cpuids. This requires
> > qemu/userspace cooperation for feature enablement
> 
> We need that anyway.  The kernel cannot enable features on its own since 
> that breaks live migration.

That is true. But easy to overcome as well.

> > * This mechanism just bumps us out to userspace if we can't handle a
> > request. As such, it allows for pure guest kernel ->  userspace
> > communication, that can be used, for instance, to emulate new features
> > in older hypervisors one does not want to change. BTW, maybe there is
> > value in exiting to userspace even if we stick to the
> > one-msr-per-feature approach?
> 
> Yes.
> 
> I'm not 100% happy with emulating MSRs in userspace, but we can think 
> about a mechanism that allows userspace to designate certain MSRs as 
> handled by userspace.
> 
> Before we do that I'd like to see what fraction of MSRs can be usefully 
> emulated in userspace (beyond those that just store a value and ignore it).

None of the existing. But for instance, I was discussing this issue with
anthony a while ago, and he thinks that in order to completely avoid
bogus softlockups, qemu/userspace, which is the entity here that knows
when it has stopped (think ctrl+Z or stop + cont, save/restore, etc),
could notify this to the guest kernel directly through a shared variable
like this.

See, this is not about "new features", but rather, about between pieces
of memory. So what I'm doing in the end is just generalizing "an MSR for
shared memory", instead of one new MSR for each piece of data.

Maybe I was unfortunate to mention async_pf in the description to begin
with.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-26  9:57                   ` Peter Zijlstra
@ 2011-01-26 15:43                     ` Glauber Costa
  2011-01-26 16:46                       ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-26 15:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Wed, 2011-01-26 at 10:57 +0100, Peter Zijlstra wrote:
> On Tue, 2011-01-25 at 19:27 -0200, Glauber Costa wrote:
> > On Tue, 2011-01-25 at 22:07 +0100, Peter Zijlstra wrote:
> > > On Tue, 2011-01-25 at 18:47 -0200, Glauber Costa wrote:
> > > > On Tue, 2011-01-25 at 21:13 +0100, Peter Zijlstra wrote:
> > > > > On Tue, 2011-01-25 at 18:02 -0200, Glauber Costa wrote:
> > > > > 
> > > > > > I fail to see how does clock_task influence cpu power.
> > > > > > If we also have to touch clock_task for better accounting of other
> > > > > > stuff, it is a separate story.
> > > > > > But for cpu_power, I really fail. Please enlighten me.
> > > > > 
> > > > > static void update_rq_clock_task(struct rq *rq, s64 delta)
> > > > > {
> > > > >         s64 irq_delta;
> > > > > 
> > > > >         irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
> > > > > 
> > > > >         if (irq_delta > delta)
> > > > >                 irq_delta = delta;
> > > > > 
> > > > >         rq->prev_irq_time += irq_delta;
> > > > >         delta -= irq_delta;
> > > > >         rq->clock_task += delta;
> > > > > 
> > > > >         if (irq_delta && sched_feat(NONIRQ_POWER))
> > > > >                 sched_rt_avg_update(rq, irq_delta);
> > > > > }
> > > > > 
> > > > > its done through that sched_rt_avg_update() (should probably rename
> > > > > that), it computes a floating average of time not spend on fair tasks.
> > > > > 
> > > > It creates a dependency on CONFIG_IRQ_TIME_ACCOUNTING, though.
> > > > This piece of code is simply compiled out if this option is disabled.
> > > 
> > > We can pull this bit out and make the common bit also available for
> > > paravirt.
> > 
> > scale_rt_power() seems to do the right thing, but all the path leading
> > to it seem to work on rq->clock, rather than rq->clock_task.
> 
> Not quite, see how rq->clock_task is irq_delta less than the increment
> to rq->clock? You want it to be your steal-time delta less too.
yes, but once this delta is subtracted from rq->clock_task, this value is not
used to dictate power, unless I am mistaken.

power is adjusted according to scale_rt_power(), which does it using the
values of rq->rt_avg, rq->age_stamp, and rq->clock.

So whatever I store into rq->clock_task, but not rq->clock (which
correct me if I'm wrong, is expected to be walltime), will not be used
to adjust cpu power, which is what I'm trying to achieve.

> > Although I do can experiment with that as well, could you please
> > elaborate on what are your reasons to prefer this over than variations
> > of the method I proposed?
> 
> Because I want rq->clock_task to not include steal-time.
Sure, fair deal. But at this point, those demands seem orthogonal to me.




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory
  2011-01-26 15:17       ` Avi Kivity
@ 2011-01-26 15:45         ` Glauber Costa
  2011-01-27 12:17           ` Avi Kivity
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-26 15:45 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-kernel, aliguori

On Wed, 2011-01-26 at 17:17 +0200, Avi Kivity wrote:
> On 01/26/2011 02:20 PM, Glauber Costa wrote:
> > On Wed, 2011-01-26 at 13:13 +0200, Avi Kivity wrote:
> > >  On 01/24/2011 08:06 PM, Glauber Costa wrote:
> > >  >  As a proof of concept to KVM - Kernel Virtual Memory, this patch
> > >  >  implements wallclock grabbing on top of it. At first, it may seem
> > >  >  as a waste of work to just redo it, since it is working well. But over the
> > >  >  time, other MSRs were added - think ASYNC_PF - and more will probably come.
> > >  >  After this patch, we won't need to ever add another virtual MSR to KVM.
> > >  >
> > >
> > >  So instead of adding MSRs, we're adding area identifiers.  What did we gain?
> >
> > * No risk of namespace clashes of any kind,
> > * less need for userspace coordination for feature enablement,
> 
> That's a bug, not a feature.

I don't see why.
I's about feature enablement, not feature discovery. 

> > * more robust mechanism that can do discovery even on early boot,
> 
> cpuid/wrmsr should be robust enough.
> 
> > * more informative result values can be passed on to guest kernel,
> 
> True.
> 
> > * more flexibility, since we go back to userspace if we can't handle
> > some request. Also, some requests are better handled by userspace
> > anyway. But again, maybe this is a separate issue here...
> 
> Yes.
> 
> > * size information goes together with base, allowing for extending
> > structures (well, maybe I should add versioning explicitly?)
> >
> 
> We could do that as well with wrmsr, by having the size as the first 
> field of the structure.  Usually the size isn't really interesting, 
> though, since you need to discover/enable the new features independently.

Which structure? For msrs, we're usually going for just an u64, but of
course we could change that when needed.




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-26 15:43                     ` Glauber Costa
@ 2011-01-26 16:46                       ` Peter Zijlstra
  2011-01-26 16:53                         ` Peter Zijlstra
  2011-01-26 18:11                         ` Glauber Costa
  0 siblings, 2 replies; 57+ messages in thread
From: Peter Zijlstra @ 2011-01-26 16:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Wed, 2011-01-26 at 13:43 -0200, Glauber Costa wrote:

> yes, but once this delta is subtracted from rq->clock_task, this value is not
> used to dictate power, unless I am mistaken.
> 
> power is adjusted according to scale_rt_power(), which does it using the
> values of rq->rt_avg, rq->age_stamp, and rq->clock.
> 
> So whatever I store into rq->clock_task, but not rq->clock (which
> correct me if I'm wrong, is expected to be walltime), will not be used
> to adjust cpu power, which is what I'm trying to achieve.

No, see the below, it uses a per-cpu virt_steal_time() clock which is
expected to return steal-time in ns.

All time not accounted to ->clock_task is accumulated in lost, and
passed into sched_rt_avg_update() and thus affects the cpu_power.

If it finds that 50% of the (recent) time is steal time, its cpu_power
will be 50%.

---
 kernel/sched.c          |   44 ++++++++++++++++++++++++++++----------------
 kernel/sched_features.h |    2 +-
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..c71384c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -523,6 +523,9 @@ struct rq {
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	u64 prev_irq_time;
 #endif
+#ifdef CONFIG_SCHED_PARAVIRT
+	u64 prev_steal_time;
+#endif
 
 	/* calc_load related fields */
 	unsigned long calc_load_update;
@@ -1888,11 +1891,15 @@ void account_system_vtime(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
 
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
-	s64 irq_delta;
+	s64 lost_delta __maybe_unused;
+	s64 lost = 0;
 
-	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+	lost_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
 
 	/*
 	 * Since irq_time is only updated on {soft,}irq_exit, we might run into
@@ -1909,26 +1916,31 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 	 * the current rq->clock timestamp, except that would require using
 	 * atomic ops.
 	 */
-	if (irq_delta > delta)
-		irq_delta = delta;
+	if (lost_delta > delta)
+		lost_delta = delta;
 
-	rq->prev_irq_time += irq_delta;
-	delta -= irq_delta;
-	rq->clock_task += delta;
+	rq->prev_irq_time += lost_delta;
+	lost += lost_delta;
+#endif
+#ifdef CONFIG_SCHED_PARAVIRT
+	lost_delta = virt_steal_time(cpu_of(rq)) - rq->prev_steal_time;
+	
+	/*
+	 * unlikely, unless steal_time accounting is iffy
+	 */
+	if (lost + lost_delta > delta)
+		lost_delta = delta - lost;
 
-	if (irq_delta && sched_feat(NONIRQ_POWER))
-		sched_rt_avg_update(rq, irq_delta);
-}
+	rq->prev_steal_time += lost_delta;
+	lost += lost_delta
+#endif
 
-#else /* CONFIG_IRQ_TIME_ACCOUNTING */
+	rq->clock_task += delta - lost;
 
-static void update_rq_clock_task(struct rq *rq, s64 delta)
-{
-	rq->clock_task += delta;
+	if (lost && sched_feat(NONTASK_POWER))
+		sched_rt_avg_update(rq, lost);
 }
 
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
-
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 68e69ac..b334a2d 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -63,4 +63,4 @@ SCHED_FEAT(OWNER_SPIN, 1)
 /*
  * Decrement CPU power based on irq activity
  */
-SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(NONTASK_POWER, 1)


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-26 16:46                       ` Peter Zijlstra
@ 2011-01-26 16:53                         ` Peter Zijlstra
  2011-01-26 18:11                         ` Glauber Costa
  1 sibling, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2011-01-26 16:53 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Wed, 2011-01-26 at 17:46 +0100, Peter Zijlstra wrote:
> it uses a per-cpu virt_steal_time() clock which is
> expected to return steal-time in ns. 

This clock should return u64 and wrap on u64 and be provided when
CONFIG_SCHED_PARAVIRT.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure
  2011-01-26 15:36         ` Glauber Costa
@ 2011-01-26 17:22           ` Anthony Liguori
  2011-01-26 17:49             ` Glauber Costa
  0 siblings, 1 reply; 57+ messages in thread
From: Anthony Liguori @ 2011-01-26 17:22 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Avi Kivity, kvm, linux-kernel, aliguori

On 01/26/2011 09:36 AM, Glauber Costa wrote:
> On Wed, 2011-01-26 at 17:12 +0200, Avi Kivity wrote:
>    
>> On 01/26/2011 02:13 PM, Glauber Costa wrote:
>>      
>>>>   - it doesn't lend itself will to live migration.  Extra state must be
>>>>   maintained in the hypervisor.
>>>>          
>>> Yes, but can be queried at any time as well. I don't do it in this
>>> patch, but this is explicitly mentioned in my TODO.
>>>        
>> Using the existing method (MSRs) takes care of this, which reduces churn.
>>      
> No, it doesn't.
>
> First, we have to explicitly list some msrs for save/restore in
> userspace anyway. But also, the MSRs only holds values. For the case I'm
> trying to hit here, being: msrs being used to register something, like
> kvmclock, there is usually accompanying code as well.
>
>
>    
>>>>   - it isn't how normal hardware operates
>>>>          
>>> Since we're trying to go for guest cooperation here, I don't really see
>>> a need to stay close to hardware here.
>>>        
>> For Linux there is not much difference, since we can easily adapt it.
>> But we don't know the impact on other guests, and we can't refactor
>> them.  Staying close to precedent means it will be easier for other
>> guests to work with a kvm host, if they choose.
>>      
> I honestly don't see the difference. I am not proposing anything
> terribly different, in the end, for the sake of this specific point of
> guest supportability it's all 1 msr+cpuid vs n msr+cpuid.
>    

If type becomes implied based on the MSR number, you'd get the best of 
both worlds, no?

I do think advertising features in CPUID is nicer than writing to an MSR 
and then checking for an ack in the memory region.

>>> * This mechanism just bumps us out to userspace if we can't handle a
>>> request. As such, it allows for pure guest kernel ->   userspace
>>> communication, that can be used, for instance, to emulate new features
>>> in older hypervisors one does not want to change. BTW, maybe there is
>>> value in exiting to userspace even if we stick to the
>>> one-msr-per-feature approach?
>>>        
>> Yes.
>>
>> I'm not 100% happy with emulating MSRs in userspace, but we can think
>> about a mechanism that allows userspace to designate certain MSRs as
>> handled by userspace.
>>
>> Before we do that I'd like to see what fraction of MSRs can be usefully
>> emulated in userspace (beyond those that just store a value and ignore it).
>>      
> None of the existing. But for instance, I was discussing this issue with
> anthony a while ago, and he thinks that in order to completely avoid
> bogus softlockups, qemu/userspace, which is the entity here that knows
> when it has stopped (think ctrl+Z or stop + cont, save/restore, etc),
> could notify this to the guest kernel directly through a shared variable
> like this.
>
> See, this is not about "new features", but rather, about between pieces
> of memory. So what I'm doing in the end is just generalizing "an MSR for
> shared memory", instead of one new MSR for each piece of data.
>    

I do think having a standard mechanism for small regions of shared 
memory between the hypervisor and guest is a reasonable thing to do.

Regards,

Anthony Liguori

> Maybe I was unfortunate to mention async_pf in the description to begin
> with.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>    


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure
  2011-01-26 17:22           ` Anthony Liguori
@ 2011-01-26 17:49             ` Glauber Costa
  2011-01-27 12:31               ` Avi Kivity
  0 siblings, 1 reply; 57+ messages in thread
From: Glauber Costa @ 2011-01-26 17:49 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, linux-kernel, aliguori

On Wed, 2011-01-26 at 11:22 -0600, Anthony Liguori wrote:
> On 01/26/2011 09:36 AM, Glauber Costa wrote:
> > On Wed, 2011-01-26 at 17:12 +0200, Avi Kivity wrote:
> >    
> >> On 01/26/2011 02:13 PM, Glauber Costa wrote:
> >>      
> >>>>   - it doesn't lend itself will to live migration.  Extra state must be
> >>>>   maintained in the hypervisor.
> >>>>          
> >>> Yes, but can be queried at any time as well. I don't do it in this
> >>> patch, but this is explicitly mentioned in my TODO.
> >>>        
> >> Using the existing method (MSRs) takes care of this, which reduces churn.
> >>      
> > No, it doesn't.
> >
> > First, we have to explicitly list some msrs for save/restore in
> > userspace anyway. But also, the MSRs only holds values. For the case I'm
> > trying to hit here, being: msrs being used to register something, like
> > kvmclock, there is usually accompanying code as well.
> >
> >
> >    
> >>>>   - it isn't how normal hardware operates
> >>>>          
> >>> Since we're trying to go for guest cooperation here, I don't really see
> >>> a need to stay close to hardware here.
> >>>        
> >> For Linux there is not much difference, since we can easily adapt it.
> >> But we don't know the impact on other guests, and we can't refactor
> >> them.  Staying close to precedent means it will be easier for other
> >> guests to work with a kvm host, if they choose.
> >>      
> > I honestly don't see the difference. I am not proposing anything
> > terribly different, in the end, for the sake of this specific point of
> > guest supportability it's all 1 msr+cpuid vs n msr+cpuid.
> >    
> 
> If type becomes implied based on the MSR number, you'd get the best of 
> both worlds, no?
> 
> I do think advertising features in CPUID is nicer than writing to an MSR 
> and then checking for an ack in the memory region.
Fine. But back to the point, I think the reasoning here is that I see
all those areas as just a single feature, shared data.

> 
> >>> * This mechanism just bumps us out to userspace if we can't handle a
> >>> request. As such, it allows for pure guest kernel ->   userspace
> >>> communication, that can be used, for instance, to emulate new features
> >>> in older hypervisors one does not want to change. BTW, maybe there is
> >>> value in exiting to userspace even if we stick to the
> >>> one-msr-per-feature approach?
> >>>        
> >> Yes.
> >>
> >> I'm not 100% happy with emulating MSRs in userspace, but we can think
> >> about a mechanism that allows userspace to designate certain MSRs as
> >> handled by userspace.
> >>
> >> Before we do that I'd like to see what fraction of MSRs can be usefully
> >> emulated in userspace (beyond those that just store a value and ignore it).
> >>      
> > None of the existing. But for instance, I was discussing this issue with
> > anthony a while ago, and he thinks that in order to completely avoid
> > bogus softlockups, qemu/userspace, which is the entity here that knows
> > when it has stopped (think ctrl+Z or stop + cont, save/restore, etc),
> > could notify this to the guest kernel directly through a shared variable
> > like this.
> >
> > See, this is not about "new features", but rather, about between pieces
> > of memory. So what I'm doing in the end is just generalizing "an MSR for
> > shared memory", instead of one new MSR for each piece of data.
> >    
> 
> I do think having a standard mechanism for small regions of shared 
> memory between the hypervisor and guest is a reasonable thing to do.

Through what I am proposing, or through something else? (including
slight variations)



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/16] KVM-GST: adjust scheduler cpu power
  2011-01-26 16:46                       ` Peter Zijlstra
  2011-01-26 16:53                         ` Peter Zijlstra
@ 2011-01-26 18:11                         ` Glauber Costa
  1 sibling, 0 replies; 57+ messages in thread
From: Glauber Costa @ 2011-01-26 18:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Wed, 2011-01-26 at 17:46 +0100, Peter Zijlstra wrote:
> On Wed, 2011-01-26 at 13:43 -0200, Glauber Costa wrote:
> 
> > yes, but once this delta is subtracted from rq->clock_task, this value is not
> > used to dictate power, unless I am mistaken.
> > 
> > power is adjusted according to scale_rt_power(), which does it using the
> > values of rq->rt_avg, rq->age_stamp, and rq->clock.
> > 
> > So whatever I store into rq->clock_task, but not rq->clock (which
> > correct me if I'm wrong, is expected to be walltime), will not be used
> > to adjust cpu power, which is what I'm trying to achieve.
> 
> No, see the below, it uses a per-cpu virt_steal_time() clock which is
> expected to return steal-time in ns.
> 
> All time not accounted to ->clock_task is accumulated in lost, and
> passed into sched_rt_avg_update() and thus affects the cpu_power.
> 
> If it finds that 50% of the (recent) time is steal time, its cpu_power
> will be 50%.
ok, now I get your proposal. It does make sense.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory
  2011-01-26 15:45         ` Glauber Costa
@ 2011-01-27 12:17           ` Avi Kivity
  0 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2011-01-27 12:17 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, linux-kernel, aliguori

On 01/26/2011 05:45 PM, Glauber Costa wrote:
> On Wed, 2011-01-26 at 17:17 +0200, Avi Kivity wrote:
> >  On 01/26/2011 02:20 PM, Glauber Costa wrote:
> >  >  On Wed, 2011-01-26 at 13:13 +0200, Avi Kivity wrote:
> >  >  >   On 01/24/2011 08:06 PM, Glauber Costa wrote:
> >  >  >   >   As a proof of concept to KVM - Kernel Virtual Memory, this patch
> >  >  >   >   implements wallclock grabbing on top of it. At first, it may seem
> >  >  >   >   as a waste of work to just redo it, since it is working well. But over the
> >  >  >   >   time, other MSRs were added - think ASYNC_PF - and more will probably come.
> >  >  >   >   After this patch, we won't need to ever add another virtual MSR to KVM.
> >  >  >   >
> >  >  >
> >  >  >   So instead of adding MSRs, we're adding area identifiers.  What did we gain?
> >  >
> >  >  * No risk of namespace clashes of any kind,
> >  >  * less need for userspace coordination for feature enablement,
> >
> >  That's a bug, not a feature.
>
> I don't see why.
> I's about feature enablement, not feature discovery.

Well, "zero userspace coordination" would be a bug, since it would 
remove userspace-controlled discovery.  Since the userspace patches for 
these types of features are usually very small, "less coordination" 
doesn't buy us much.

> >
> >  >  * size information goes together with base, allowing for extending
> >  >  structures (well, maybe I should add versioning explicitly?)
> >  >
> >
> >  We could do that as well with wrmsr, by having the size as the first
> >  field of the structure.  Usually the size isn't really interesting,
> >  though, since you need to discover/enable the new features independently.
>
> Which structure? For msrs, we're usually going for just an u64, but of
> course we could change that when needed.

It's usually a physical address of a structure (together with an enable 
bit).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure
  2011-01-26 17:49             ` Glauber Costa
@ 2011-01-27 12:31               ` Avi Kivity
  0 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2011-01-27 12:31 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Anthony Liguori, kvm, linux-kernel, aliguori

On 01/26/2011 07:49 PM, Glauber Costa wrote:
> >
> >  If type becomes implied based on the MSR number, you'd get the best of
> >  both worlds, no?
> >
> >  I do think advertising features in CPUID is nicer than writing to an MSR
> >  and then checking for an ack in the memory region.
> Fine. But back to the point, I think the reasoning here is that I see
> all those areas as just a single feature, shared data.

That's not a feature.  kvmclock and apf are features.

> >
> >  I do think having a standard mechanism for small regions of shared
> >  memory between the hypervisor and guest is a reasonable thing to do.
>
> Through what I am proposing, or through something else? (including
> slight variations)
>

I'd like to keep the current way of doing things.  Helpers in the guest 
and host to consolidate code are fine, but there's no need to impact the 
ABI.

e.g.

     kvm_register_feature_msr(u32 msr, u64 alignment, struct cpuid_bit 
*feature, int (*callback)(struct kvm_vcpu *vcpu, u64 value))

will install handlers for wrmsr and rdmsr, and declare the msr for 
save/restore, tell the wrmsr handler which cpuid bit allows exposing the 
msr, and registers a callback for when the msr is written by the guest.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2011-01-27 12:31 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-24 18:06 [PATCH 00/16] New Proposal for steal time in KVM Glauber Costa
2011-01-24 18:06 ` [PATCH 01/16] KVM-HDR: register KVM basic header infrastructure Glauber Costa
2011-01-26 11:06   ` Avi Kivity
2011-01-26 12:13     ` Glauber Costa
2011-01-26 15:12       ` Avi Kivity
2011-01-26 15:36         ` Glauber Costa
2011-01-26 17:22           ` Anthony Liguori
2011-01-26 17:49             ` Glauber Costa
2011-01-27 12:31               ` Avi Kivity
2011-01-24 18:06 ` [PATCH 02/16] KVM-HV: KVM - KVM Virtual Memory hypervisor implementation Glauber Costa
2011-01-24 18:06 ` [PATCH 03/16] KVM-HDR: KVM Userspace registering ioctl Glauber Costa
2011-01-26 11:12   ` Avi Kivity
2011-01-26 12:14     ` Glauber Costa
2011-01-26 15:14       ` Avi Kivity
2011-01-26 15:23         ` Glauber Costa
2011-01-24 18:06 ` [PATCH 04/16] KVM-HV: " Glauber Costa
2011-01-24 18:06 ` [PATCH 05/16] KVM-HDR: Implement wallclock over KVM - KVM Virtual Memory Glauber Costa
2011-01-26 11:13   ` Avi Kivity
2011-01-26 12:20     ` Glauber Costa
2011-01-26 15:17       ` Avi Kivity
2011-01-26 15:45         ` Glauber Costa
2011-01-27 12:17           ` Avi Kivity
2011-01-24 18:06 ` [PATCH 06/16] " Glauber Costa
2011-01-24 18:06 ` [PATCH 07/16] KVM-GST: " Glauber Costa
2011-01-24 18:06 ` [PATCH 08/16] KVM-HDR: Implement kvmclock systemtime " Glauber Costa
2011-01-24 18:06 ` [PATCH 09/16] KVM-HV: " Glauber Costa
2011-01-24 18:06 ` [PATCH 10/16] KVM-GST: " Glauber Costa
2011-01-24 18:06 ` [PATCH 11/16] KVM-HDR: KVM Steal time implementation Glauber Costa
2011-01-24 23:06   ` Rik van Riel
2011-01-24 18:06 ` [PATCH 12/16] KVM-HV: " Glauber Costa
2011-01-24 23:15   ` Rik van Riel
2011-01-24 18:06 ` [PATCH 13/16] KVM-HV: KVM Steal time calculation Glauber Costa
2011-01-24 23:20   ` Rik van Riel
2011-01-24 18:06 ` [PATCH 14/16] KVM-GST: KVM Steal time registration Glauber Costa
2011-01-24 23:27   ` Rik van Riel
2011-01-24 23:31   ` Rik van Riel
2011-01-25  1:25     ` Glauber Costa
2011-01-25  1:26       ` Rik van Riel
2011-01-25  1:28         ` Glauber Costa
2011-01-24 18:06 ` [PATCH 15/16] KVM-GST: KVM Steal time accounting Glauber Costa
2011-01-24 23:33   ` Rik van Riel
2011-01-24 18:06 ` [PATCH 16/16] KVM-GST: adjust scheduler cpu power Glauber Costa
2011-01-24 18:32   ` Peter Zijlstra
2011-01-24 18:51     ` Glauber Costa
2011-01-24 19:51       ` Peter Zijlstra
2011-01-24 19:57         ` Glauber Costa
2011-01-25 20:02         ` Glauber Costa
2011-01-25 20:13           ` Peter Zijlstra
2011-01-25 20:47             ` Glauber Costa
2011-01-25 21:07               ` Peter Zijlstra
2011-01-25 21:27                 ` Glauber Costa
2011-01-26  9:57                   ` Peter Zijlstra
2011-01-26 15:43                     ` Glauber Costa
2011-01-26 16:46                       ` Peter Zijlstra
2011-01-26 16:53                         ` Peter Zijlstra
2011-01-26 18:11                         ` Glauber Costa
2011-01-24 19:53       ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.