linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Alter steal time reporting in KVM
@ 2012-11-26 20:36 Michael Wolf
  2012-11-26 20:36 ` [PATCH 1/5] Alter the amount of steal time reported by the guest Michael Wolf
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Michael Wolf @ 2012-11-26 20:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: riel, kvm, peterz, mtosatti, glommer, mingo

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

---

Michael Wolf (5):
      Alter the amount of steal time reported by the guest.
      Expand the steal time msr to also contain the consigned time.
      Add the code to send the consigned time from the host to the guest
      Add a timer to allow the separation of consigned from steal time.
      Add an ioctl to communicate the consign limit to the host.


 arch/x86/include/asm/kvm_host.h       |   11 +++++++
 arch/x86/include/asm/kvm_para.h       |    3 +-
 arch/x86/include/asm/paravirt.h       |    4 +--
 arch/x86/include/asm/paravirt_types.h |    2 +
 arch/x86/kernel/kvm.c                 |    8 ++---
 arch/x86/kernel/paravirt.c            |    4 +--
 arch/x86/kvm/x86.c                    |   50 ++++++++++++++++++++++++++++++++-
 fs/proc/stat.c                        |    9 +++++-
 include/linux/kernel_stat.h           |    2 +
 include/linux/kvm_host.h              |    2 +
 include/uapi/linux/kvm.h              |    2 +
 kernel/sched/core.c                   |   10 ++++++-
 kernel/sched/cputime.c                |   21 +++++++++++++-
 kernel/sched/sched.h                  |    2 +
 virt/kvm/kvm_main.c                   |    7 +++++
 15 files changed, 120 insertions(+), 17 deletions(-)

-- 
Signature


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/5] Alter the amount of steal time reported by the guest.
  2012-11-26 20:36 [PATCH 0/5] Alter steal time reporting in KVM Michael Wolf
@ 2012-11-26 20:36 ` Michael Wolf
  2012-11-26 20:36 ` [PATCH 2/5] Expand the steal time msr to also contain the consigned time Michael Wolf
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Michael Wolf @ 2012-11-26 20:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: riel, kvm, peterz, mtosatti, glommer, mingo

Modify the amount of stealtime that the kernel reports via the /proc interface.
Steal time will now be broken down into steal_time and consigned_time.
Consigned_time will represent the amount of time that is expected to be lost
due to overcommitment of the physical cpu or by using cpu capping.  The amount
consigned_time will be passed in using an ioctl.  The time will be expressed in
the number of nanoseconds to be lost in during the fixed period.  The fixed period
is currently 1/10th of a second.

Signed-off-by: Michael Wolf <mjw@linux.vnet.ibm.com>
---
 fs/proc/stat.c              |    9 +++++++--
 include/linux/kernel_stat.h |    1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..cb7fe80 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v)
 	int i, j;
 	unsigned long jif;
 	u64 user, nice, system, idle, iowait, irq, softirq, steal;
-	u64 guest, guest_nice;
+	u64 guest, guest_nice, consign;
 	u64 sum = 0;
 	u64 sum_softirq = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v)
 
 	user = nice = system = idle = iowait =
 		irq = softirq = steal = 0;
-	guest = guest_nice = 0;
+	guest = guest_nice = consign = 0;
 	getboottime(&boottime);
 	jif = boottime.tv_sec;
 
+
 	for_each_possible_cpu(i) {
 		user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
 		nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
@@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v)
 		steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
 		guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
 		guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+		consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
 		sum += kstat_cpu_irqs_sum(i);
 		sum += arch_irq_stat_cpu(i);
 
@@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v)
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
 	seq_putc(p, '\n');
 
 	for_each_online_cpu(i) {
@@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v)
 		steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
 		guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
 		guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+		consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
 		seq_printf(p, "cpu%d", i);
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v)
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
 		seq_putc(p, '\n');
 	}
 	seq_printf(p, "intr %llu", (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 1865b1f..e5978b0 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -28,6 +28,7 @@ enum cpu_usage_stat {
 	CPUTIME_STEAL,
 	CPUTIME_GUEST,
 	CPUTIME_GUEST_NICE,
+	CPUTIME_CONSIGN,
 	NR_STATS,
 };
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/5] Expand the steal time msr to also contain the consigned time.
  2012-11-26 20:36 [PATCH 0/5] Alter steal time reporting in KVM Michael Wolf
  2012-11-26 20:36 ` [PATCH 1/5] Alter the amount of steal time reported by the guest Michael Wolf
@ 2012-11-26 20:36 ` Michael Wolf
  2012-11-27 21:03   ` Konrad Rzeszutek Wilk
  2012-11-26 20:36 ` [PATCH 3/5] Add the code to send the consigned time from the host to the guest Michael Wolf
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Michael Wolf @ 2012-11-26 20:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: riel, kvm, peterz, mtosatti, glommer, mingo

Add a consigned field.  This field will hold the time lost due to capping or overcommit.
The rest of the time will still show up in the steal-time field.

Signed-off-by: Michael Wolf <mjw@linux.vnet.ibm.com>
---
 arch/x86/include/asm/paravirt.h       |    4 ++--
 arch/x86/include/asm/paravirt_types.h |    2 +-
 arch/x86/kernel/kvm.c                 |    7 ++-----
 kernel/sched/core.c                   |   10 +++++++++-
 kernel/sched/cputime.c                |    2 +-
 5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..a5f9f30 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
 {
-	return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+	PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
 	unsigned long long (*sched_clock)(void);
-	unsigned long long (*steal_clock)(int cpu);
+	void (*steal_clock)(int cpu, unsigned long long *steal);
 	unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 4180a87..ac357b3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
 	.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu)
+static void kvm_steal_clock(int cpu, u64 *steal)
 {
-	u64 steal;
 	struct kvm_steal_time *src;
 	int version;
 
@@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu)
 	do {
 		version = src->version;
 		rmb();
-		steal = src->steal;
+		*steal = src->steal;
 		rmb();
 	} while ((version & 1) || (version != src->version));
-
-	return steal;
 }
 
 void kvm_disable_steal_time(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2e077c..b21d92d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
  */
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
 	s64 steal = 0, irq_delta = 0;
+	u64 consigned = 0;
 #endif
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
@@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 	if (static_key_false((&paravirt_steal_rq_enabled))) {
 		u64 st;
+		u64 cs;
 
-		steal = paravirt_steal_clock(cpu_of(rq));
+		paravirt_steal_clock(cpu_of(rq), &steal, &consigned);
+		/*
+		 * since we are not assigning the steal time to cpustats
+		 * here, just combine the steal and consigned times to
+		 * do the rest of the calculations.
+		 */
+		steal += consigned;
 		steal -= rq->prev_steal_time_rq;
 
 		if (unlikely(steal > delta))
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8d859da..593b647 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
 	if (static_key_false(&paravirt_steal_enabled)) {
 		u64 steal, st = 0;
 
-		steal = paravirt_steal_clock(smp_processor_id());
+		paravirt_steal_clock(smp_processor_id(), &steal);
 		steal -= this_rq()->prev_steal_time;
 
 		st = steal_ticks(steal);


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/5] Add the code to send the consigned time from the host to the guest
  2012-11-26 20:36 [PATCH 0/5] Alter steal time reporting in KVM Michael Wolf
  2012-11-26 20:36 ` [PATCH 1/5] Alter the amount of steal time reported by the guest Michael Wolf
  2012-11-26 20:36 ` [PATCH 2/5] Expand the steal time msr to also contain the consigned time Michael Wolf
@ 2012-11-26 20:36 ` Michael Wolf
  2012-11-26 20:37 ` [PATCH 4/5] Add a timer to allow the separation of consigned from steal time Michael Wolf
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Michael Wolf @ 2012-11-26 20:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: riel, kvm, peterz, mtosatti, glommer, mingo

Add the code to send the consigned time from the host to the guest.

Signed-off-by: Michael Wolf <mjw@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/include/asm/kvm_para.h |    3 ++-
 arch/x86/include/asm/paravirt.h |    4 ++--
 arch/x86/kernel/kvm.c           |    3 ++-
 arch/x86/kernel/paravirt.c      |    4 ++--
 arch/x86/kvm/x86.c              |    2 ++
 include/linux/kernel_stat.h     |    1 +
 kernel/sched/cputime.c          |   21 +++++++++++++++++++--
 kernel/sched/sched.h            |    2 ++
 9 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b2e11f4..434d378 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -426,6 +426,7 @@ struct kvm_vcpu_arch {
 		u64 msr_val;
 		u64 last_steal;
 		u64 accum_steal;
+		u64 accum_consigned;
 		struct gfn_to_hva_cache stime;
 		struct kvm_steal_time steal;
 	} st;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index eb3e9d8..1763369 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -42,9 +42,10 @@
 
 struct kvm_steal_time {
 	__u64 steal;
+	__u64 consigned;
 	__u32 version;
 	__u32 flags;
-	__u32 pad[12];
+	__u32 pad[10];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a5f9f30..d39e8d0 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-	PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
+	PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ac357b3..4439a5c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
 	.notifier_call = kvm_pv_reboot_notify,
 };
 
-static void kvm_steal_clock(int cpu, u64 *steal)
+static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
 	struct kvm_steal_time *src;
 	int version;
@@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal)
 		version = src->version;
 		rmb();
 		*steal = src->steal;
+		*consigned = src->consigned;
 		rmb();
 	} while ((version & 1) || (version != src->version));
 }
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 17fff18..3797683 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr)
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
-static u64 native_steal_clock(int cpu)
+static void native_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
-	return 0;
+	*steal = *consigned = 0;
 }
 
 /* These are in entry.S */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1eefebe..683b531 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 		return;
 
 	vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal;
+	vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned;
 	vcpu->arch.st.steal.version += 2;
 	vcpu->arch.st.accum_steal = 0;
+	vcpu->arch.st.accum_consigned = 0;
 
 	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
 		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index e5978b0..91afaa3 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *);
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
 extern void account_steal_time(cputime_t);
+extern void account_consigned_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
 extern void account_process_tick(struct task_struct *, int user);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 593b647..53bd0be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 }
 
 /*
+ * This accounts for the time that is split out of steal time.
+ * Consigned time represents the amount of time that the cpu was
+ * expected to be somewhere else.
+ */
+void account_consigned_time(cputime_t cputime)
+{
+	u64 *cpustat =  kcpustat_this_cpu->cpustat;
+
+	cpustat[CPUTIME_CONSIGN] += (__force u64) cputime;
+}
+
+/*
  * Account for involuntary wait time.
  * @cputime: the cpu time spent in involuntary wait
  */
@@ -274,15 +286,20 @@ static __always_inline bool steal_account_process_tick(void)
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
 		u64 steal, st = 0;
+		u64 consigned, cs = 0;
 
-		paravirt_steal_clock(smp_processor_id(), &steal);
+		paravirt_steal_clock(smp_processor_id(), &steal, &consigned);
 		steal -= this_rq()->prev_steal_time;
+		consigned -= this_rq()->prev_consigned_time;
 
 		st = steal_ticks(steal);
+		cs = steal_ticks(consigned);
 		this_rq()->prev_steal_time += st * TICK_NSEC;
+		this_rq()->prev_consigned_time += cs * TICK_NSEC;
 
 		account_steal_time(st);
-		return st;
+		account_consigned_time(cs);
+		return st || cs;
 	}
 #endif
 	return false;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 508e77e..83fe976 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -450,9 +450,11 @@ struct rq {
 #endif
 #ifdef CONFIG_PARAVIRT
 	u64 prev_steal_time;
+	u64 prev_consigned_time;
 #endif
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 	u64 prev_steal_time_rq;
+	u64 prev_consigned_time_rq;
 #endif
 
 	/* calc_load related fields */


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/5] Add a timer to allow the separation of consigned from steal time.
  2012-11-26 20:36 [PATCH 0/5] Alter steal time reporting in KVM Michael Wolf
                   ` (2 preceding siblings ...)
  2012-11-26 20:36 ` [PATCH 3/5] Add the code to send the consigned time from the host to the guest Michael Wolf
@ 2012-11-26 20:37 ` Michael Wolf
  2012-11-26 20:37 ` [PATCH 5/5] Add an ioctl to communicate the consign limit to the host Michael Wolf
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Michael Wolf @ 2012-11-26 20:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: riel, kvm, peterz, mtosatti, glommer, mingo

Add a timer to the host.  This will define the period.  During a period
the first n ticks will go into the consigned bucket.  Any other ticks that
occur within the period will be placed in the stealtime bucket.

Signed-off-by: Michael Wolf <mjw@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |   10 +++++++++
 arch/x86/include/asm/paravirt.h |    2 +-
 arch/x86/kvm/x86.c              |   42 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 434d378..4794c95 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -41,6 +41,8 @@
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
 
+#define KVM_STEAL_TIMER_DELAY 100000000UL
+
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
 			  | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
@@ -353,6 +355,14 @@ struct kvm_vcpu_arch {
 	bool tpr_access_reporting;
 
 	/*
+	 * timer used to determine if the time should be counted as
+	 * steal time or consigned time.
+	 */
+	struct hrtimer steal_timer;
+	u64 current_consigned;
+	u64 consigned_limit;
+
+	/*
 	 * Paging state of the vcpu
 	 *
 	 * If the vcpu runs in guest mode with two level paging this still saves
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d39e8d0..6db79f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,7 +196,7 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
+static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
 {
 	PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 683b531..c91f4c9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 {
 	u64 delta;
+	u64 steal_delta;
+	u64 consigned_delta;
 
 	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
 		return;
 
 	delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
-	vcpu->arch.st.accum_steal = delta;
+
+	/* split the delta into steal and consigned */
+	if (vcpu->arch.current_consigned < vcpu->arch.consigned_limit) {
+		vcpu->arch.current_consigned += delta;
+		if (vcpu->arch.current_consigned > vcpu->arch.consigned_limit) {
+			steal_delta = vcpu->arch.current_consigned
+						-  vcpu->arch.consigned_limit;
+			consigned_delta = delta - steal_delta;
+		} else {
+			consigned_delta = delta;
+			steal_delta = 0;
+		}
+	} else {
+		consigned_delta = 0;
+		steal_delta = delta;
+	}
+	vcpu->arch.st.accum_steal = steal_delta;
+	vcpu->arch.st.accum_consigned = consigned_delta;
 }
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu)
 
 struct static_key kvm_no_apic_vcpu __read_mostly;
 
+enum hrtimer_restart steal_timer_fn(struct hrtimer *data)
+{
+	struct kvm_vcpu *vcpu;
+	ktime_t now;
+
+	vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer);
+	vcpu->arch.current_consigned = 0;
+	now = ktime_get();
+	hrtimer_forward(&vcpu->arch.steal_timer, now,
+			ktime_set(0, KVM_STEAL_TIMER_DELAY));
+	return HRTIMER_RESTART;
+}
+
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 {
 	struct page *page;
 	struct kvm *kvm;
 	int r;
+	ktime_t ktime;
 
 	BUG_ON(vcpu->kvm == NULL);
 	kvm = vcpu->kvm;
@@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
 	kvm_async_pf_hash_reset(vcpu);
 	kvm_pmu_init(vcpu);
+	/* Initialize and start a timer to capture steal and consigned time */
+	hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC,
+			HRTIMER_MODE_REL);
+	vcpu->arch.steal_timer.function = &steal_timer_fn;
+	ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY);
+	hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL);
 
 	return 0;
 fail_free_mce_banks:
@@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
 	int idx;
 
+	hrtimer_cancel(&vcpu->arch.steal_timer);
 	kvm_pmu_destroy(vcpu);
 	kfree(vcpu->arch.mce_banks);
 	kvm_free_lapic(vcpu);


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/5] Add an ioctl to communicate the consign limit to the host.
  2012-11-26 20:36 [PATCH 0/5] Alter steal time reporting in KVM Michael Wolf
                   ` (3 preceding siblings ...)
  2012-11-26 20:37 ` [PATCH 4/5] Add a timer to allow the separation of consigned from steal time Michael Wolf
@ 2012-11-26 20:37 ` Michael Wolf
  2012-11-27  8:48 ` [PATCH 0/5] Alter steal time reporting in KVM Glauber Costa
  2012-11-27 23:24 ` Marcelo Tosatti
  6 siblings, 0 replies; 21+ messages in thread
From: Michael Wolf @ 2012-11-26 20:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: riel, kvm, peterz, mtosatti, glommer, mingo

Add an ioctl to communicate the consign limit to the host.

Signed-off-by: Michael Wolf <mjw@linux.vnet.ibm.com>
---
 arch/x86/kvm/x86.c       |    6 ++++++
 include/linux/kvm_host.h |    2 ++
 include/uapi/linux/kvm.h |    2 ++
 virt/kvm/kvm_main.c      |    7 +++++++
 4 files changed, 17 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c91f4c9..5d57469 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5938,6 +5938,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, long entitlement)
+{
+	vcpu->arch.consigned_limit = entitlement;
+	return 0;
+}
+
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
 	struct i387_fxsave_struct *fxsave =
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e2212f..de13648 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -590,6 +590,8 @@ void kvm_arch_hardware_unsetup(void);
 void kvm_arch_check_processor_compat(void *rtn);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu,
+					long entitlement);
 
 void kvm_free_physmem(struct kvm *kvm);
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0a6d6ba..86f24bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -921,6 +921,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_SET_ONE_REG		  _IOW(KVMIO,  0xac, struct kvm_one_reg)
 /* VM is being stopped by host */
 #define KVM_KVMCLOCK_CTRL	  _IO(KVMIO,   0xad)
+/* Set the consignment limit which will be used to separete steal time */
+#define KVM_SET_ENTITLEMENT	  _IOW(KVMIO, 0xae, unsigned long)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..c712fe5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2062,6 +2062,13 @@ out_free2:
 		r = 0;
 		break;
 	}
+	case KVM_SET_ENTITLEMENT: {
+		r = kvm_arch_vcpu_ioctl_set_entitlement(vcpu, arg);
+		if (r)
+			goto out;
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
 	}


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-11-26 20:36 [PATCH 0/5] Alter steal time reporting in KVM Michael Wolf
                   ` (4 preceding siblings ...)
  2012-11-26 20:37 ` [PATCH 5/5] Add an ioctl to communicate the consign limit to the host Michael Wolf
@ 2012-11-27  8:48 ` Glauber Costa
  2012-11-27 15:10   ` Michael Wolf
  2012-11-28 19:16   ` Anthony Liguori
  2012-11-27 23:24 ` Marcelo Tosatti
  6 siblings, 2 replies; 21+ messages in thread
From: Glauber Costa @ 2012-11-27  8:48 UTC (permalink / raw)
  To: Michael Wolf; +Cc: linux-kernel, riel, kvm, peterz, mtosatti, mingo

Hi,

On 11/27/2012 12:36 AM, Michael Wolf wrote:
> In the case of where you have a system that is running in a
> capped or overcommitted environment the user may see steal time
> being reported in accounting tools such as top or vmstat.  This can
> cause confusion for the end user.  To ease the confusion this patch set
> adds the idea of consigned (expected steal) time.  The host will separate
> the consigned time from the steal time.  The consignment limit passed to the
> host will be the amount of steal time expected within a fixed period of
> time.  Any other steal time accruing during that period will show as the
> traditional steal time.

If you submit this again, please include a version number in your series.

It would also be helpful to include a small changelog about what changed
between last version and this version, so we could focus on that.

As for the rest, I answered your previous two submissions saying I don't
agree with the concept. If you hadn't changed anything, resending it
won't change my mind.

I could of course, be mistaken or misguided. But I had also not seen any
wave of support in favor of this previously, so basically I have no new
data to make me believe I should see it any differently.

Let's try this again:

* Rik asked you in your last submission how does ppc handle this. You
said, and I quote: "In the case of lpar on POWER systems they simply
report steal time and do not alter it in any way.
They do however report how much processor is assigned to the partition
and that information is in /proc/ppc64/lparcfg."

Now, that is a *way* more sensible thing to do. Much more. "Confusing
users" is something extremely subjective. This is specially true about
concepts that are know for quite some time, like steal time. If you out
of a sudden change the meaning of this, it is sure to confuse a lot more
users than it would clarify.





> 
> ---
> 
> Michael Wolf (5):
>       Alter the amount of steal time reported by the guest.
>       Expand the steal time msr to also contain the consigned time.
>       Add the code to send the consigned time from the host to the guest
>       Add a timer to allow the separation of consigned from steal time.
>       Add an ioctl to communicate the consign limit to the host.
> 
> 
>  arch/x86/include/asm/kvm_host.h       |   11 +++++++
>  arch/x86/include/asm/kvm_para.h       |    3 +-
>  arch/x86/include/asm/paravirt.h       |    4 +--
>  arch/x86/include/asm/paravirt_types.h |    2 +
>  arch/x86/kernel/kvm.c                 |    8 ++---
>  arch/x86/kernel/paravirt.c            |    4 +--
>  arch/x86/kvm/x86.c                    |   50 ++++++++++++++++++++++++++++++++-
>  fs/proc/stat.c                        |    9 +++++-
>  include/linux/kernel_stat.h           |    2 +
>  include/linux/kvm_host.h              |    2 +
>  include/uapi/linux/kvm.h              |    2 +
>  kernel/sched/core.c                   |   10 ++++++-
>  kernel/sched/cputime.c                |   21 +++++++++++++-
>  kernel/sched/sched.h                  |    2 +
>  virt/kvm/kvm_main.c                   |    7 +++++
>  15 files changed, 120 insertions(+), 17 deletions(-)
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-11-27  8:48 ` [PATCH 0/5] Alter steal time reporting in KVM Glauber Costa
@ 2012-11-27 15:10   ` Michael Wolf
  2012-11-28  8:45     ` Glauber Costa
  2012-11-28 19:16   ` Anthony Liguori
  1 sibling, 1 reply; 21+ messages in thread
From: Michael Wolf @ 2012-11-27 15:10 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, riel, kvm, peterz, mtosatti, mingo, anthony,
	gleb@redhat.com >> Gleb Natapov

On 11/27/2012 02:48 AM, Glauber Costa wrote:
> Hi,
>
> On 11/27/2012 12:36 AM, Michael Wolf wrote:
>> In the case of where you have a system that is running in a
>> capped or overcommitted environment the user may see steal time
>> being reported in accounting tools such as top or vmstat.  This can
>> cause confusion for the end user.  To ease the confusion this patch set
>> adds the idea of consigned (expected steal) time.  The host will separate
>> the consigned time from the steal time.  The consignment limit passed to the
>> host will be the amount of steal time expected within a fixed period of
>> time.  Any other steal time accruing during that period will show as the
>> traditional steal time.
> If you submit this again, please include a version number in your series.
Will do.  The patchset was sent twice yesterday by mistake.  Got an 
error the first time and didn't
think the patches went out.  This has been corrected.
>
> It would also be helpful to include a small changelog about what changed
> between last version and this version, so we could focus on that.
yes, will do that.  When I took the RFC off the patches I was looking at 
it as a new patchset which was
a mistake.  I will make sure to add a changelog when I submit again.
>
> As for the rest, I answered your previous two submissions saying I don't
> agree with the concept. If you hadn't changed anything, resending it
> won't change my mind.
>
> I could of course, be mistaken or misguided. But I had also not seen any
> wave of support in favor of this previously, so basically I have no new
> data to make me believe I should see it any differently.
>
> Let's try this again:
>
> * Rik asked you in your last submission how does ppc handle this. You
> said, and I quote: "In the case of lpar on POWER systems they simply
> report steal time and do not alter it in any way.
> They do however report how much processor is assigned to the partition
> and that information is in /proc/ppc64/lparcfg."
Yes, but we still get questions from users asking what is steal time? 
why am I seeing this?
>
> Now, that is a *way* more sensible thing to do. Much more. "Confusing
> users" is something extremely subjective. This is specially true about
> concepts that are know for quite some time, like steal time. If you out
> of a sudden change the meaning of this, it is sure to confuse a lot more
> users than it would clarify.
Something like this could certainly be done.  But when I was submitting 
the patch set as
an RFC then qemu was passing a cpu percentage that would be used by the 
guest kernel
to adjust the steal time. This percentage was being stored on the guest 
as a sysctl value.
Avi stated he didn't like that kind of coupling, and that the value
could get out of sync.  Anthony stated "The guest shouldn't need to know 
it's entitlement. Or at least, it's up to a management tool to report 
that in a way that's meaningful for the guest."

So perhaps I misunderstood what they were suggesting, but I took it to 
mean that they did not
want the guest to know what the entitlement was.  That the host should 
take care of it and just
report the already adjusted data to the guest.  So in this version of 
the code the host would use a set
period for a timer and be passed essentially a number of ticks of 
expected steal time.  The host
would then use the timer to break out the steal time into consigned and 
steal buckets which would be
reported to the guest.

Both the consigned and the steal would be reported via /proc/stat. So 
anyone needing to see total
time away could add the two fields together.  The user, however, when 
using tools like top or vmstat
would see the usage based on what the guest is entitled to.

Do you have suggestions for how I can build consensus around one of the 
two approaches?


>
>
>
>
>> ---
>>
>> Michael Wolf (5):
>>        Alter the amount of steal time reported by the guest.
>>        Expand the steal time msr to also contain the consigned time.
>>        Add the code to send the consigned time from the host to the guest
>>        Add a timer to allow the separation of consigned from steal time.
>>        Add an ioctl to communicate the consign limit to the host.
>>
>>
>>   arch/x86/include/asm/kvm_host.h       |   11 +++++++
>>   arch/x86/include/asm/kvm_para.h       |    3 +-
>>   arch/x86/include/asm/paravirt.h       |    4 +--
>>   arch/x86/include/asm/paravirt_types.h |    2 +
>>   arch/x86/kernel/kvm.c                 |    8 ++---
>>   arch/x86/kernel/paravirt.c            |    4 +--
>>   arch/x86/kvm/x86.c                    |   50 ++++++++++++++++++++++++++++++++-
>>   fs/proc/stat.c                        |    9 +++++-
>>   include/linux/kernel_stat.h           |    2 +
>>   include/linux/kvm_host.h              |    2 +
>>   include/uapi/linux/kvm.h              |    2 +
>>   kernel/sched/core.c                   |   10 ++++++-
>>   kernel/sched/cputime.c                |   21 +++++++++++++-
>>   kernel/sched/sched.h                  |    2 +
>>   virt/kvm/kvm_main.c                   |    7 +++++
>>   15 files changed, 120 insertions(+), 17 deletions(-)
>>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.
  2012-11-26 20:36 ` [PATCH 2/5] Expand the steal time msr to also contain the consigned time Michael Wolf
@ 2012-11-27 21:03   ` Konrad Rzeszutek Wilk
  2012-11-28 15:23     ` Michael Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-11-27 21:03 UTC (permalink / raw)
  To: Michael Wolf; +Cc: linux-kernel, riel, kvm, peterz, mtosatti, glommer, mingo

On Mon, Nov 26, 2012 at 02:36:45PM -0600, Michael Wolf wrote:
> Add a consigned field.  This field will hold the time lost due to capping or overcommit.
> The rest of the time will still show up in the steal-time field.
> 
> Signed-off-by: Michael Wolf <mjw@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/paravirt.h       |    4 ++--
>  arch/x86/include/asm/paravirt_types.h |    2 +-
>  arch/x86/kernel/kvm.c                 |    7 ++-----
>  kernel/sched/core.c                   |   10 +++++++++-
>  kernel/sched/cputime.c                |    2 +-
>  5 files changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index a0facf3..a5f9f30 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -196,9 +196,9 @@ struct static_key;
>  extern struct static_key paravirt_steal_enabled;
>  extern struct static_key paravirt_steal_rq_enabled;
>  
> -static inline u64 paravirt_steal_clock(int cpu)
> +static inline u64 paravirt_steal_clock(int cpu, u64 *steal)

So its u64 here.
>  {
> -	return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
> +	PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
>  }
>  
>  static inline unsigned long long paravirt_read_pmc(int counter)
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 142236e..5d4fc8b 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -95,7 +95,7 @@ struct pv_lazy_ops {
>  
>  struct pv_time_ops {
>  	unsigned long long (*sched_clock)(void);
> -	unsigned long long (*steal_clock)(int cpu);
> +	void (*steal_clock)(int cpu, unsigned long long *steal);

But not u64 here? Any particular reason?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-11-26 20:36 [PATCH 0/5] Alter steal time reporting in KVM Michael Wolf
                   ` (5 preceding siblings ...)
  2012-11-27  8:48 ` [PATCH 0/5] Alter steal time reporting in KVM Glauber Costa
@ 2012-11-27 23:24 ` Marcelo Tosatti
  2012-11-28  0:32   ` Marcelo Tosatti
  2012-11-28 18:43   ` Michael Wolf
  6 siblings, 2 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2012-11-27 23:24 UTC (permalink / raw)
  To: Michael Wolf; +Cc: linux-kernel, riel, kvm, peterz, glommer, mingo

On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
> In the case of where you have a system that is running in a
> capped or overcommitted environment the user may see steal time
> being reported in accounting tools such as top or vmstat.

The definition of stolen time is 'time during which the virtual CPU is
runnable to not running'. Overcommit is the main scenario which steal
time helps to detect.

Can you describe the 'capped' case?

>  This can
> cause confusion for the end user.  To ease the confusion this patch set
> adds the idea of consigned (expected steal) time.  The host will separate
> the consigned time from the steal time.  The consignment limit passed to the
> host will be the amount of steal time expected within a fixed period of
> time.  Any other steal time accruing during that period will show as the
> traditional steal time.
> 
> ---
> 
> Michael Wolf (5):
>       Alter the amount of steal time reported by the guest.
>       Expand the steal time msr to also contain the consigned time.
>       Add the code to send the consigned time from the host to the guest
>       Add a timer to allow the separation of consigned from steal time.
>       Add an ioctl to communicate the consign limit to the host.
> 
> 
>  arch/x86/include/asm/kvm_host.h       |   11 +++++++
>  arch/x86/include/asm/kvm_para.h       |    3 +-
>  arch/x86/include/asm/paravirt.h       |    4 +--
>  arch/x86/include/asm/paravirt_types.h |    2 +
>  arch/x86/kernel/kvm.c                 |    8 ++---
>  arch/x86/kernel/paravirt.c            |    4 +--
>  arch/x86/kvm/x86.c                    |   50 ++++++++++++++++++++++++++++++++-
>  fs/proc/stat.c                        |    9 +++++-
>  include/linux/kernel_stat.h           |    2 +
>  include/linux/kvm_host.h              |    2 +
>  include/uapi/linux/kvm.h              |    2 +
>  kernel/sched/core.c                   |   10 ++++++-
>  kernel/sched/cputime.c                |   21 +++++++++++++-
>  kernel/sched/sched.h                  |    2 +
>  virt/kvm/kvm_main.c                   |    7 +++++
>  15 files changed, 120 insertions(+), 17 deletions(-)
> 
> -- 
> Signature
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-11-27 23:24 ` Marcelo Tosatti
@ 2012-11-28  0:32   ` Marcelo Tosatti
  2012-11-28 18:43   ` Michael Wolf
  1 sibling, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2012-11-28  0:32 UTC (permalink / raw)
  To: Michael Wolf; +Cc: linux-kernel, riel, kvm, peterz, glommer, mingo

On Tue, Nov 27, 2012 at 09:24:42PM -0200, Marcelo Tosatti wrote:
> On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
> > In the case of where you have a system that is running in a
> > capped or overcommitted environment the user may see steal time
> > being reported in accounting tools such as top or vmstat.
> 
> The definition of stolen time is 'time during which the virtual CPU is
> runnable to not running'. Overcommit is the main scenario which steal
> time helps to detect.

Meant 'runnable but not running'.


> Can you describe the 'capped' case?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-11-27 15:10   ` Michael Wolf
@ 2012-11-28  8:45     ` Glauber Costa
  2012-11-28 18:44       ` Michael Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2012-11-28  8:45 UTC (permalink / raw)
  To: Michael Wolf
  Cc: linux-kernel, riel, kvm, peterz, mtosatti, mingo, anthony,
	gleb@redhat.com >> Gleb Natapov

On 11/27/2012 07:10 PM, Michael Wolf wrote:
> On 11/27/2012 02:48 AM, Glauber Costa wrote:
>> Hi,
>>
>> On 11/27/2012 12:36 AM, Michael Wolf wrote:
>>> In the case of where you have a system that is running in a
>>> capped or overcommitted environment the user may see steal time
>>> being reported in accounting tools such as top or vmstat.  This can
>>> cause confusion for the end user.  To ease the confusion this patch set
>>> adds the idea of consigned (expected steal) time.  The host will
>>> separate
>>> the consigned time from the steal time.  The consignment limit passed
>>> to the
>>> host will be the amount of steal time expected within a fixed period of
>>> time.  Any other steal time accruing during that period will show as the
>>> traditional steal time.
>> If you submit this again, please include a version number in your series.
> Will do.  The patchset was sent twice yesterday by mistake.  Got an
> error the first time and didn't
> think the patches went out.  This has been corrected.
>>
>> It would also be helpful to include a small changelog about what changed
>> between last version and this version, so we could focus on that.
> yes, will do that.  When I took the RFC off the patches I was looking at
> it as a new patchset which was
> a mistake.  I will make sure to add a changelog when I submit again.
>>
>> As for the rest, I answered your previous two submissions saying I don't
>> agree with the concept. If you hadn't changed anything, resending it
>> won't change my mind.
>>
>> I could of course, be mistaken or misguided. But I had also not seen any
>> wave of support in favor of this previously, so basically I have no new
>> data to make me believe I should see it any differently.
>>
>> Let's try this again:
>>
>> * Rik asked you in your last submission how does ppc handle this. You
>> said, and I quote: "In the case of lpar on POWER systems they simply
>> report steal time and do not alter it in any way.
>> They do however report how much processor is assigned to the partition
>> and that information is in /proc/ppc64/lparcfg."
> Yes, but we still get questions from users asking what is steal time?
> why am I seeing this?
>>
>> Now, that is a *way* more sensible thing to do. Much more. "Confusing
>> users" is something extremely subjective. This is specially true about
>> concepts that are know for quite some time, like steal time. If you out
>> of a sudden change the meaning of this, it is sure to confuse a lot more
>> users than it would clarify.
> Something like this could certainly be done.  But when I was submitting
> the patch set as
> an RFC then qemu was passing a cpu percentage that would be used by the
> guest kernel
> to adjust the steal time. This percentage was being stored on the guest
> as a sysctl value.
> Avi stated he didn't like that kind of coupling, and that the value
> could get out of sync.  Anthony stated "The guest shouldn't need to know
> it's entitlement. Or at least, it's up to a management tool to report
> that in a way that's meaningful for the guest."
> 
> So perhaps I misunderstood what they were suggesting, but I took it to
> mean that they did not
> want the guest to know what the entitlement was.  That the host should
> take care of it and just
> report the already adjusted data to the guest.  So in this version of
> the code the host would use a set
> period for a timer and be passed essentially a number of ticks of
> expected steal time.  The host
> would then use the timer to break out the steal time into consigned and
> steal buckets which would be
> reported to the guest.
> 
> Both the consigned and the steal would be reported via /proc/stat. So
> anyone needing to see total
> time away could add the two fields together.  The user, however, when
> using tools like top or vmstat
> would see the usage based on what the guest is entitled to.
> 
> Do you have suggestions for how I can build consensus around one of the
> two approaches?
> 

Before I answer this, can you please detail which mechanism are you
using to enforce the entitlement? Is it the cgroup cpu controller, or
something else?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.
  2012-11-27 21:03   ` Konrad Rzeszutek Wilk
@ 2012-11-28 15:23     ` Michael Wolf
  0 siblings, 0 replies; 21+ messages in thread
From: Michael Wolf @ 2012-11-28 15:23 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, riel, kvm, peterz, mtosatti, glommer, mingo

On 11/27/2012 03:03 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 26, 2012 at 02:36:45PM -0600, Michael Wolf wrote:
>> Add a consigned field.  This field will hold the time lost due to capping or overcommit.
>> The rest of the time will still show up in the steal-time field.
>>
>> Signed-off-by: Michael Wolf <mjw@linux.vnet.ibm.com>
>> ---
>>   arch/x86/include/asm/paravirt.h       |    4 ++--
>>   arch/x86/include/asm/paravirt_types.h |    2 +-
>>   arch/x86/kernel/kvm.c                 |    7 ++-----
>>   kernel/sched/core.c                   |   10 +++++++++-
>>   kernel/sched/cputime.c                |    2 +-
>>   5 files changed, 15 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
>> index a0facf3..a5f9f30 100644
>> --- a/arch/x86/include/asm/paravirt.h
>> +++ b/arch/x86/include/asm/paravirt.h
>> @@ -196,9 +196,9 @@ struct static_key;
>>   extern struct static_key paravirt_steal_enabled;
>>   extern struct static_key paravirt_steal_rq_enabled;
>>   
>> -static inline u64 paravirt_steal_clock(int cpu)
>> +static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
> So its u64 here.
>>   {
>> -	return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
>> +	PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
>>   }
>>   
>>   static inline unsigned long long paravirt_read_pmc(int counter)
>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>> index 142236e..5d4fc8b 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -95,7 +95,7 @@ struct pv_lazy_ops {
>>   
>>   struct pv_time_ops {
>>   	unsigned long long (*sched_clock)(void);
>> -	unsigned long long (*steal_clock)(int cpu);
>> +	void (*steal_clock)(int cpu, unsigned long long *steal);
> But not u64 here? Any particular reason?
>
It should be void everywhere, thanks for catching that I will change the 
code.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-11-27 23:24 ` Marcelo Tosatti
  2012-11-28  0:32   ` Marcelo Tosatti
@ 2012-11-28 18:43   ` Michael Wolf
  2012-11-28 20:55     ` Glauber Costa
  1 sibling, 1 reply; 21+ messages in thread
From: Michael Wolf @ 2012-11-28 18:43 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, riel, kvm, peterz, glommer, mingo

On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:
> On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
>> In the case of where you have a system that is running in a
>> capped or overcommitted environment the user may see steal time
>> being reported in accounting tools such as top or vmstat.
> The definition of stolen time is 'time during which the virtual CPU is
> runnable to not running'. Overcommit is the main scenario which steal
> time helps to detect.
>
> Can you describe the 'capped' case?
In the capped case, the time that the guest spends waiting due to it 
having used its full allottment of time shows up as steal time.  The way 
my patchset currently stands is that you would set up the
bandwidth control and you would have to pass it a  matching value from 
qemu.  In the future, it would
be possible to have something parse the bandwidth setting and 
automatically adjust the setting in the
host used for steal time reporting.
>
>>   This can
>> cause confusion for the end user.  To ease the confusion this patch set
>> adds the idea of consigned (expected steal) time.  The host will separate
>> the consigned time from the steal time.  The consignment limit passed to the
>> host will be the amount of steal time expected within a fixed period of
>> time.  Any other steal time accruing during that period will show as the
>> traditional steal time.
>>
>> ---
>>
>> Michael Wolf (5):
>>        Alter the amount of steal time reported by the guest.
>>        Expand the steal time msr to also contain the consigned time.
>>        Add the code to send the consigned time from the host to the guest
>>        Add a timer to allow the separation of consigned from steal time.
>>        Add an ioctl to communicate the consign limit to the host.
>>
>>
>>   arch/x86/include/asm/kvm_host.h       |   11 +++++++
>>   arch/x86/include/asm/kvm_para.h       |    3 +-
>>   arch/x86/include/asm/paravirt.h       |    4 +--
>>   arch/x86/include/asm/paravirt_types.h |    2 +
>>   arch/x86/kernel/kvm.c                 |    8 ++---
>>   arch/x86/kernel/paravirt.c            |    4 +--
>>   arch/x86/kvm/x86.c                    |   50 ++++++++++++++++++++++++++++++++-
>>   fs/proc/stat.c                        |    9 +++++-
>>   include/linux/kernel_stat.h           |    2 +
>>   include/linux/kvm_host.h              |    2 +
>>   include/uapi/linux/kvm.h              |    2 +
>>   kernel/sched/core.c                   |   10 ++++++-
>>   kernel/sched/cputime.c                |   21 +++++++++++++-
>>   kernel/sched/sched.h                  |    2 +
>>   virt/kvm/kvm_main.c                   |    7 +++++
>>   15 files changed, 120 insertions(+), 17 deletions(-)
>>
>> -- 
>> Signature
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-11-28  8:45     ` Glauber Costa
@ 2012-11-28 18:44       ` Michael Wolf
  0 siblings, 0 replies; 21+ messages in thread
From: Michael Wolf @ 2012-11-28 18:44 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, riel, kvm, peterz, mtosatti, mingo, anthony,
	gleb@redhat.com >> Gleb Natapov

On 11/28/2012 02:45 AM, Glauber Costa wrote:
> On 11/27/2012 07:10 PM, Michael Wolf wrote:
>> On 11/27/2012 02:48 AM, Glauber Costa wrote:
>>> Hi,
>>>
>>> On 11/27/2012 12:36 AM, Michael Wolf wrote:
>>>> In the case of where you have a system that is running in a
>>>> capped or overcommitted environment the user may see steal time
>>>> being reported in accounting tools such as top or vmstat.  This can
>>>> cause confusion for the end user.  To ease the confusion this patch set
>>>> adds the idea of consigned (expected steal) time.  The host will
>>>> separate
>>>> the consigned time from the steal time.  The consignment limit passed
>>>> to the
>>>> host will be the amount of steal time expected within a fixed period of
>>>> time.  Any other steal time accruing during that period will show as the
>>>> traditional steal time.
>>> If you submit this again, please include a version number in your series.
>> Will do.  The patchset was sent twice yesterday by mistake.  Got an
>> error the first time and didn't
>> think the patches went out.  This has been corrected.
>>> It would also be helpful to include a small changelog about what changed
>>> between last version and this version, so we could focus on that.
>> yes, will do that.  When I took the RFC off the patches I was looking at
>> it as a new patchset which was
>> a mistake.  I will make sure to add a changelog when I submit again.
>>> As for the rest, I answered your previous two submissions saying I don't
>>> agree with the concept. If you hadn't changed anything, resending it
>>> won't change my mind.
>>>
>>> I could of course, be mistaken or misguided. But I had also not seen any
>>> wave of support in favor of this previously, so basically I have no new
>>> data to make me believe I should see it any differently.
>>>
>>> Let's try this again:
>>>
>>> * Rik asked you in your last submission how does ppc handle this. You
>>> said, and I quote: "In the case of lpar on POWER systems they simply
>>> report steal time and do not alter it in any way.
>>> They do however report how much processor is assigned to the partition
>>> and that information is in /proc/ppc64/lparcfg."
>> Yes, but we still get questions from users asking what is steal time?
>> why am I seeing this?
>>> Now, that is a *way* more sensible thing to do. Much more. "Confusing
>>> users" is something extremely subjective. This is specially true about
>>> concepts that are know for quite some time, like steal time. If you out
>>> of a sudden change the meaning of this, it is sure to confuse a lot more
>>> users than it would clarify.
>> Something like this could certainly be done.  But when I was submitting
>> the patch set as
>> an RFC then qemu was passing a cpu percentage that would be used by the
>> guest kernel
>> to adjust the steal time. This percentage was being stored on the guest
>> as a sysctl value.
>> Avi stated he didn't like that kind of coupling, and that the value
>> could get out of sync.  Anthony stated "The guest shouldn't need to know
>> it's entitlement. Or at least, it's up to a management tool to report
>> that in a way that's meaningful for the guest."
>>
>> So perhaps I misunderstood what they were suggesting, but I took it to
>> mean that they did not
>> want the guest to know what the entitlement was.  That the host should
>> take care of it and just
>> report the already adjusted data to the guest.  So in this version of
>> the code the host would use a set
>> period for a timer and be passed essentially a number of ticks of
>> expected steal time.  The host
>> would then use the timer to break out the steal time into consigned and
>> steal buckets which would be
>> reported to the guest.
>>
>> Both the consigned and the steal would be reported via /proc/stat. So
>> anyone needing to see total
>> time away could add the two fields together.  The user, however, when
>> using tools like top or vmstat
>> would see the usage based on what the guest is entitled to.
>>
>> Do you have suggestions for how I can build consensus around one of the
>> two approaches?
>>
> Before I answer this, can you please detail which mechanism are you
> using to enforce the entitlement? Is it the cgroup cpu controller, or
> something else?
It is setup using cpu overcommit.  But the request was for something 
that would work in both
the overcommit environment as well as when hard capping is being used.

>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-11-27  8:48 ` [PATCH 0/5] Alter steal time reporting in KVM Glauber Costa
  2012-11-27 15:10   ` Michael Wolf
@ 2012-11-28 19:16   ` Anthony Liguori
  1 sibling, 0 replies; 21+ messages in thread
From: Anthony Liguori @ 2012-11-28 19:16 UTC (permalink / raw)
  To: Glauber Costa, Michael Wolf
  Cc: linux-kernel, riel, kvm, peterz, mtosatti, mingo

Glauber Costa <glommer@parallels.com> writes:

> Hi,
>
> On 11/27/2012 12:36 AM, Michael Wolf wrote:
>> In the case of where you have a system that is running in a
>> capped or overcommitted environment the user may see steal time
>> being reported in accounting tools such as top or vmstat.  This can
>> cause confusion for the end user.  To ease the confusion this patch set
>> adds the idea of consigned (expected steal) time.  The host will separate
>> the consigned time from the steal time.  The consignment limit passed to the
>> host will be the amount of steal time expected within a fixed period of
>> time.  Any other steal time accruing during that period will show as the
>> traditional steal time.
>
> If you submit this again, please include a version number in your series.
>
> It would also be helpful to include a small changelog about what changed
> between last version and this version, so we could focus on that.
>
> As for the rest, I answered your previous two submissions saying I don't
> agree with the concept. If you hadn't changed anything, resending it
> won't change my mind.
>
> I could of course, be mistaken or misguided. But I had also not seen any
> wave of support in favor of this previously, so basically I have no new
> data to make me believe I should see it any differently.
>
> Let's try this again:
>
> * Rik asked you in your last submission how does ppc handle this. You
> said, and I quote: "In the case of lpar on POWER systems they simply
> report steal time and do not alter it in any way.
> They do however report how much processor is assigned to the partition
> and that information is in /proc/ppc64/lparcfg."

This only is helpful for static entitlements.

But if we allow dynamic entitlements--which is a very useful feature,
think buying an online "upgrade" in a cloud environment--then you need
to account for entitlement loss at the same place where you do the rest
of the accounting: in /proc/stat.

> Now, that is a *way* more sensible thing to do. Much more. "Confusing
> users" is something extremely subjective. This is specially true about
> concepts that are know for quite some time, like steal time. If you out
> of a sudden change the meaning of this, it is sure to confuse a lot more
> users than it would clarify.

I'll bring you a nice bottle of scotch at the next KVM Forum if you can
find me one user that can accurately describe what steal time is.

The semantics are so incredibly subtle that I have a hard time believing
anyone actually understands what it means today.

Regards,

Anthony Liguori
>
>
>
>
>
>> 
>> ---
>> 
>> Michael Wolf (5):
>>       Alter the amount of steal time reported by the guest.
>>       Expand the steal time msr to also contain the consigned time.
>>       Add the code to send the consigned time from the host to the guest
>>       Add a timer to allow the separation of consigned from steal time.
>>       Add an ioctl to communicate the consign limit to the host.
>> 
>> 
>>  arch/x86/include/asm/kvm_host.h       |   11 +++++++
>>  arch/x86/include/asm/kvm_para.h       |    3 +-
>>  arch/x86/include/asm/paravirt.h       |    4 +--
>>  arch/x86/include/asm/paravirt_types.h |    2 +
>>  arch/x86/kernel/kvm.c                 |    8 ++---
>>  arch/x86/kernel/paravirt.c            |    4 +--
>>  arch/x86/kvm/x86.c                    |   50 ++++++++++++++++++++++++++++++++-
>>  fs/proc/stat.c                        |    9 +++++-
>>  include/linux/kernel_stat.h           |    2 +
>>  include/linux/kvm_host.h              |    2 +
>>  include/uapi/linux/kvm.h              |    2 +
>>  kernel/sched/core.c                   |   10 ++++++-
>>  kernel/sched/cputime.c                |   21 +++++++++++++-
>>  kernel/sched/sched.h                  |    2 +
>>  virt/kvm/kvm_main.c                   |    7 +++++
>>  15 files changed, 120 insertions(+), 17 deletions(-)
>> 
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-11-28 18:43   ` Michael Wolf
@ 2012-11-28 20:55     ` Glauber Costa
  2012-11-29 17:43       ` Michael Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2012-11-28 20:55 UTC (permalink / raw)
  To: Michael Wolf; +Cc: Marcelo Tosatti, linux-kernel, riel, kvm, peterz, mingo

On 11/28/2012 10:43 PM, Michael Wolf wrote:
> On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:
>> On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
>>> In the case of where you have a system that is running in a
>>> capped or overcommitted environment the user may see steal time
>>> being reported in accounting tools such as top or vmstat.
>> The definition of stolen time is 'time during which the virtual CPU is
>> runnable to not running'. Overcommit is the main scenario which steal
>> time helps to detect.
>>
>> Can you describe the 'capped' case?
> In the capped case, the time that the guest spends waiting due to it
> having used its full allottment of time shows up as steal time.  The way
> my patchset currently stands is that you would set up the
> bandwidth control and you would have to pass it a  matching value from
> qemu.  In the future, it would
> be possible to have something parse the bandwidth setting and
> automatically adjust the setting in the
> host used for steal time reporting.

Ok, so correct me if I am wrong, but I believe you would be using
something like the bandwidth capper in the cpu cgroup to set those
entitlements, right?

Some time has passed since I last looked into it, but IIRC, after you
get are out of your quota, you should be out of the runqueue. In the
lovely world of KVM, we approximate steal time as runqueue time:

arch/x86/kvm/x86.c:
delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
vcpu->arch.st.accum_steal = delta;

include/linux/sched.h:
unsigned long long run_delay; /* time spent waiting on a runqueue */

So if you are out of the runqueue, you won't get steal time accounted,
and then I truly fail to understand what you are doing.

In case I am wrong, and run_delay also includes the time you can't run
because you are out of capacity, then maybe what we should do, is to
just subtract it from run_delay in kvm/x86.c before we pass it on. In
summary:


>>>        Alter the amount of steal time reported by the guest.
Maybe this should go away.

>>>        Expand the steal time msr to also contain the consigned time.
Maybe this should go away

>>>        Add the code to send the consigned time from the host to the
>>> guest
This definitely should be heavily modified

>>>        Add a timer to allow the separation of consigned from steal time.
Maybe this should go away

>>>        Add an ioctl to communicate the consign limit to the host.
This definitely should go away.

More specifically, *whatever* way we use to cap the processor, the host
system will have all the information at all times.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-11-28 20:55     ` Glauber Costa
@ 2012-11-29 17:43       ` Michael Wolf
  2012-12-05 12:46         ` Glauber Costa
  0 siblings, 1 reply; 21+ messages in thread
From: Michael Wolf @ 2012-11-29 17:43 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Marcelo Tosatti, linux-kernel, riel, kvm, peterz, mingo, anthony

On 11/28/2012 02:55 PM, Glauber Costa wrote:
> On 11/28/2012 10:43 PM, Michael Wolf wrote:
>> On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:
>>> On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
>>>> In the case of where you have a system that is running in a
>>>> capped or overcommitted environment the user may see steal time
>>>> being reported in accounting tools such as top or vmstat.
>>> The definition of stolen time is 'time during which the virtual CPU is
>>> runnable to not running'. Overcommit is the main scenario which steal
>>> time helps to detect.
>>>
>>> Can you describe the 'capped' case?
>> In the capped case, the time that the guest spends waiting due to it
>> having used its full allottment of time shows up as steal time.  The way
>> my patchset currently stands is that you would set up the
>> bandwidth control and you would have to pass it a  matching value from
>> qemu.  In the future, it would
>> be possible to have something parse the bandwidth setting and
>> automatically adjust the setting in the
>> host used for steal time reporting.
> Ok, so correct me if I am wrong, but I believe you would be using
> something like the bandwidth capper in the cpu cgroup to set those
> entitlements, right?
Yes, in the context above I'm referring to the cfs bandwidth control.
>
> Some time has passed since I last looked into it, but IIRC, after you
> get are out of your quota, you should be out of the runqueue. In the
> lovely world of KVM, we approximate steal time as runqueue time:
>
> arch/x86/kvm/x86.c:
> delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
> vcpu->arch.st.last_steal = current->sched_info.run_delay;
> vcpu->arch.st.accum_steal = delta;
>
> include/linux/sched.h:
> unsigned long long run_delay; /* time spent waiting on a runqueue */
>
> So if you are out of the runqueue, you won't get steal time accounted,
> and then I truly fail to understand what you are doing.
So I looked at something like this in the past.  To make sure things 
haven't changed
I set up a cgroup on my test server running a kernel built from the 
latest tip tree.

[root]# cat cpu.cfs_quota_us
50000
[root]# cat cpu.cfs_period_us
100000
[root]# cat cpuset.cpus
1
[root]# cat cpuset.mems
0

Next I put the PID from the cpu thread into tasks.  When I start a 
script that will hog the cpu I see the
following in top on the guest
Cpu(s):  1.9%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 48.3%hi, 0.0%si, 
49.8%st

So the steal time here is in line with the bandwidth control settings.
>
> In case I am wrong, and run_delay also includes the time you can't run
> because you are out of capacity, then maybe what we should do, is to
> just subtract it from run_delay in kvm/x86.c before we pass it on. In
> summary:
About a year ago I was playing with this patch.  It is out of date now 
but will give you
an idea of what I was looking at.

  kernel/sched_fair.c  |    4 ++--
  kernel/sched_stats.h |    7 ++++++-
  2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..a837e4e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -707,7 +707,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq,
struct sched_entity *se)

  #ifdef CONFIG_FAIR_GROUP_SCHED
  /* we need this in update_cfs_load and load-balance functions below */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
  # ifdef CONFIG_SMP
  static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
                          int global_update)
@@ -1420,7 +1420,7 @@ static inline int cfs_rq_throttled(struct
cfs_rq *cfs_rq)
  }

  /* check whether cfs_rq, or any parent, is throttled */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
  {
      return cfs_rq->throttle_count;
  }
diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
index 87f9e36..e30ff26 100644
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -213,14 +213,19 @@ static inline void sched_info_queued(struct
task_struct *t)
   * sched_info_queued() to mark that it has now again started waiting on
   * the runqueue.
   */
+extern inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
  static inline void sched_info_depart(struct task_struct *t)
  {
+    struct task_group *tg = task_group(t);
+    struct cfs_rq *cfs_rq;
      unsigned long long delta = task_rq(t)->clock -
                      t->sched_info.last_arrival;

+    cfs_rq = tg->cfs_rq[smp_processor_id()];
      rq_sched_info_depart(task_rq(t), delta);

-    if (t->state == TASK_RUNNING)
+
+    if (t->state == TASK_RUNNING && !throttled_hierarchy(cfs_rq))
          sched_info_queued(t);
  }


So then the steal time did not show on the guest.  You have no value 
that needs to be passed
around.  What I did not like about this approach was
* only works for cfs bandwidth control.  If another type of hard limit 
was added to the kernel
    the code would potentially need to change.
* This approach doesn't help if the limits are set by overcommitting the 
cpus.  It is my understanding
    that this is a common approach.

>
>>>>         Alter the amount of steal time reported by the guest.
> Maybe this should go away.
>
>>>>         Expand the steal time msr to also contain the consigned time.
> Maybe this should go away
>
>>>>         Add the code to send the consigned time from the host to the
>>>> guest
> This definitely should be heavily modified
>
>>>>         Add a timer to allow the separation of consigned from steal time.
> Maybe this should go away
>
>>>>         Add an ioctl to communicate the consign limit to the host.
> This definitely should go away.
>
> More specifically, *whatever* way we use to cap the processor, the host
> system will have all the information at all times.
I'm not understanding that comment.  If you are capping by simply 
controlling the amount of
overcommit on the host then wouldn't you still need some value to 
indicate the desired amount.
If another guest was inadvertently started or something else on the host 
started taking more
processor than expected, you would need to report the steal time.


>
>
>


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-11-29 17:43       ` Michael Wolf
@ 2012-12-05 12:46         ` Glauber Costa
  2012-12-07 15:50           ` Michael Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2012-12-05 12:46 UTC (permalink / raw)
  To: Michael Wolf
  Cc: Marcelo Tosatti, linux-kernel, riel, kvm, peterz, mingo, anthony

I am deeply sorry.

I was busy first time I read this, so I postponed answering and ended up
forgetting.

Sorry
>>
>> include/linux/sched.h:
>> unsigned long long run_delay; /* time spent waiting on a runqueue */
>>
>> So if you are out of the runqueue, you won't get steal time accounted,
>> and then I truly fail to understand what you are doing.
> So I looked at something like this in the past.  To make sure things
> haven't changed
> I set up a cgroup on my test server running a kernel built from the
> latest tip tree.
> 
> [root]# cat cpu.cfs_quota_us
> 50000
> [root]# cat cpu.cfs_period_us
> 100000
> [root]# cat cpuset.cpus
> 1
> [root]# cat cpuset.mems
> 0
> 
> Next I put the PID from the cpu thread into tasks.  When I start a
> script that will hog the cpu I see the
> following in top on the guest
> Cpu(s):  1.9%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 48.3%hi, 0.0%si,
> 49.8%st
> 
> So the steal time here is in line with the bandwidth control settings.

Ok. So I was wrong in my hunch that it would be outside the runqueue,
therefore work automatically. Still, the host kernel has all the
information in cgroups.

> So then the steal time did not show on the guest.  You have no value
> that needs to be passed
> around.  What I did not like about this approach was
> * only works for cfs bandwidth control.  If another type of hard limit
> was added to the kernel
>    the code would potentially need to change.

This is true for almost everything we have in the kernel!
It is *very* unlikely for other bandwidth control mechanism to ever
appear. If it ever does, it's *their* burden to make sure it works for
steal time (provided it is merged). Code in tree gets precedence.

> * This approach doesn't help if the limits are set by overcommitting the
> cpus.  It is my understanding
>    that this is a common approach.
> 

I can't say anything about commonality, but common or not, it is a
*crazy* approach.

When you simply overcommit, you have no way to differentiate between
intended steal time and non-intended steal time. Moreover, when you
overcommit, your cpu usage will vary over time. If two guests use the
cpu to their full power, you will have 50 % each. But if one of them
slows down, the other gets more. What is your entitlement value? How do
you define this?

And then after you define it, you end up using more than this, what is
your cpu usage? 130 %?


The only sane way to do it, is to communicate this value to the kernel
somehow. The bandwidth controller is the interface we have for that. So
everybody that wants to *intentionally* overcommit needs to communicate
this to the controller. IOW: Any sane configuration should be explicit
about your capping.

>>>>>         Add an ioctl to communicate the consign limit to the host.
>> This definitely should go away.
>>
>> More specifically, *whatever* way we use to cap the processor, the host
>> system will have all the information at all times.
> I'm not understanding that comment.  If you are capping by simply
> controlling the amount of
> overcommit on the host then wouldn't you still need some value to
> indicate the desired amount.
No, that is just crazy, and I don't like it a single bit.

So in the light of it: Whatever capping mechanism we have, we need to be
explicit about the expected entitlement. At this point, the kernel
already knows what it is, and needs no extra ioctls or anything like that.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Alter steal time reporting in KVM
  2012-12-05 12:46         ` Glauber Costa
@ 2012-12-07 15:50           ` Michael Wolf
  0 siblings, 0 replies; 21+ messages in thread
From: Michael Wolf @ 2012-12-07 15:50 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Marcelo Tosatti, linux-kernel, riel, kvm, peterz, mingo, anthony

On 12/05/2012 06:46 AM, Glauber Costa wrote:
> I am deeply sorry.
>
> I was busy first time I read this, so I postponed answering and ended up
> forgetting.
>
> Sorry
>>> include/linux/sched.h:
>>> unsigned long long run_delay; /* time spent waiting on a runqueue */
>>>
>>> So if you are out of the runqueue, you won't get steal time accounted,
>>> and then I truly fail to understand what you are doing.
>> So I looked at something like this in the past.  To make sure things
>> haven't changed
>> I set up a cgroup on my test server running a kernel built from the
>> latest tip tree.
>>
>> [root]# cat cpu.cfs_quota_us
>> 50000
>> [root]# cat cpu.cfs_period_us
>> 100000
>> [root]# cat cpuset.cpus
>> 1
>> [root]# cat cpuset.mems
>> 0
>>
>> Next I put the PID from the cpu thread into tasks.  When I start a
>> script that will hog the cpu I see the
>> following in top on the guest
>> Cpu(s):  1.9%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 48.3%hi, 0.0%si,
>> 49.8%st
>>
>> So the steal time here is in line with the bandwidth control settings.
> Ok. So I was wrong in my hunch that it would be outside the runqueue,
> therefore work automatically. Still, the host kernel has all the
> information in cgroups.
>
>> So then the steal time did not show on the guest.  You have no value
>> that needs to be passed
>> around.  What I did not like about this approach was
>> * only works for cfs bandwidth control.  If another type of hard limit
>> was added to the kernel
>>     the code would potentially need to change.
> This is true for almost everything we have in the kernel!
> It is *very* unlikely for other bandwidth control mechanism to ever
> appear. If it ever does, it's *their* burden to make sure it works for
> steal time (provided it is merged). Code in tree gets precedence.

Ok,  I will work on a patch that uses the cgroup information for 
bandwidth control
to separate out the time.

>
>> * This approach doesn't help if the limits are set by overcommitting the
>> cpus.  It is my understanding
>>     that this is a common approach.
>>
> I can't say anything about commonality, but common or not, it is a
> *crazy* approach.
>
> When you simply overcommit, you have no way to differentiate between
> intended steal time and non-intended steal time. Moreover, when you
> overcommit, your cpu usage will vary over time. If two guests use the
> cpu to their full power, you will have 50 % each. But if one of them
> slows down, the other gets more. What is your entitlement value? How do
> you define this?
>
> And then after you define it, you end up using more than this, what is
> your cpu usage? 130 %?

yes exactly you would ideally show a boosted amount of cpu.  However to 
do that
you would need to either create a new tool or modify the current 
accounting tools
such as top.

My understanding is that you are not capping in this case as much as you 
are
guaranteeing a minimum level of performance.

>
>
> The only sane way to do it, is to communicate this value to the kernel
> somehow. The bandwidth controller is the interface we have for that. So
> everybody that wants to *intentionally* overcommit needs to communicate
> this to the controller. IOW: Any sane configuration should be explicit
> about your capping.
>
>>>>>>          Add an ioctl to communicate the consign limit to the host.
>>> This definitely should go away.
>>>
>>> More specifically, *whatever* way we use to cap the processor, the host
>>> system will have all the information at all times.
>> I'm not understanding that comment.  If you are capping by simply
>> controlling the amount of
>> overcommit on the host then wouldn't you still need some value to
>> indicate the desired amount.
> No, that is just crazy, and I don't like it a single bit.
>
> So in the light of it: Whatever capping mechanism we have, we need to be
> explicit about the expected entitlement. At this point, the kernel
> already knows what it is, and needs no extra ioctls or anything like that.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/5] Expand the steal time msr to also contain the consigned time.
  2012-11-26 21:05 [PATCH 0/5] Alter stealtime " Michael Wolf
@ 2012-11-26 21:05 ` Michael Wolf
  0 siblings, 0 replies; 21+ messages in thread
From: Michael Wolf @ 2012-11-26 21:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: riel, gleb, kvm, peterz, mtosatti, glommer, mingo

Add a consigned field.  This field will hold the time lost due to capping or overcommit.
The rest of the time will still show up in the steal-time field.

Signed-off-by: Michael Wolf <mjw@linux.vnet.ibm.com>
---
 arch/x86/include/asm/paravirt.h       |    4 ++--
 arch/x86/include/asm/paravirt_types.h |    2 +-
 arch/x86/kernel/kvm.c                 |    7 ++-----
 kernel/sched/core.c                   |   10 +++++++++-
 kernel/sched/cputime.c                |    2 +-
 5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..a5f9f30 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
-static inline u64 paravirt_steal_clock(int cpu)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
 {
-	return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+	PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
 }
 
 static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
 	unsigned long long (*sched_clock)(void);
-	unsigned long long (*steal_clock)(int cpu);
+	void (*steal_clock)(int cpu, unsigned long long *steal);
 	unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 4180a87..ac357b3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
 	.notifier_call = kvm_pv_reboot_notify,
 };
 
-static u64 kvm_steal_clock(int cpu)
+static void kvm_steal_clock(int cpu, u64 *steal)
 {
-	u64 steal;
 	struct kvm_steal_time *src;
 	int version;
 
@@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu)
 	do {
 		version = src->version;
 		rmb();
-		steal = src->steal;
+		*steal = src->steal;
 		rmb();
 	} while ((version & 1) || (version != src->version));
-
-	return steal;
 }
 
 void kvm_disable_steal_time(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2e077c..b21d92d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
  */
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
 	s64 steal = 0, irq_delta = 0;
+	u64 consigned = 0;
 #endif
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
@@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 	if (static_key_false((&paravirt_steal_rq_enabled))) {
 		u64 st;
+		u64 cs;
 
-		steal = paravirt_steal_clock(cpu_of(rq));
+		paravirt_steal_clock(cpu_of(rq), &steal, &consigned);
+		/*
+		 * since we are not assigning the steal time to cpustats
+		 * here, just combine the steal and consigned times to
+		 * do the rest of the calculations.
+		 */
+		steal += consigned;
 		steal -= rq->prev_steal_time_rq;
 
 		if (unlikely(steal > delta))
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8d859da..593b647 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
 	if (static_key_false(&paravirt_steal_enabled)) {
 		u64 steal, st = 0;
 
-		steal = paravirt_steal_clock(smp_processor_id());
+		paravirt_steal_clock(smp_processor_id(), &steal);
 		steal -= this_rq()->prev_steal_time;
 
 		st = steal_ticks(steal);


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-12-07 15:51 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-26 20:36 [PATCH 0/5] Alter steal time reporting in KVM Michael Wolf
2012-11-26 20:36 ` [PATCH 1/5] Alter the amount of steal time reported by the guest Michael Wolf
2012-11-26 20:36 ` [PATCH 2/5] Expand the steal time msr to also contain the consigned time Michael Wolf
2012-11-27 21:03   ` Konrad Rzeszutek Wilk
2012-11-28 15:23     ` Michael Wolf
2012-11-26 20:36 ` [PATCH 3/5] Add the code to send the consigned time from the host to the guest Michael Wolf
2012-11-26 20:37 ` [PATCH 4/5] Add a timer to allow the separation of consigned from steal time Michael Wolf
2012-11-26 20:37 ` [PATCH 5/5] Add an ioctl to communicate the consign limit to the host Michael Wolf
2012-11-27  8:48 ` [PATCH 0/5] Alter steal time reporting in KVM Glauber Costa
2012-11-27 15:10   ` Michael Wolf
2012-11-28  8:45     ` Glauber Costa
2012-11-28 18:44       ` Michael Wolf
2012-11-28 19:16   ` Anthony Liguori
2012-11-27 23:24 ` Marcelo Tosatti
2012-11-28  0:32   ` Marcelo Tosatti
2012-11-28 18:43   ` Michael Wolf
2012-11-28 20:55     ` Glauber Costa
2012-11-29 17:43       ` Michael Wolf
2012-12-05 12:46         ` Glauber Costa
2012-12-07 15:50           ` Michael Wolf
2012-11-26 21:05 [PATCH 0/5] Alter stealtime " Michael Wolf
2012-11-26 21:05 ` [PATCH 2/5] Expand the steal time msr to also contain the consigned time Michael Wolf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).