* [RFC v2 0/7] kvm stael time implementation @ 2010-08-30 16:06 Glauber Costa 2010-08-30 16:06 ` [RFC v2 1/7] change headers preparing for steal time Glauber Costa ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy Hi, So, this is basically the same as v1, with three major differences: 1) I am posting to lkml for wider audience 2) patch 2/7 fixes one problem I mentined would happen in smp systems, which is, we only update kvmclock when we changes pcup 3) softlockup algorithm is changed. Again, as marcelo pointed out, this is open to discussion, and I am not dropping it so more people can step in. I have some other patches under local test for a slightly modified guest part accounting, and I do somehow support extending the interface, and changing to nsecs (maybe not 100 %, but...). But I am posting in this state so we can have lkml people to step in earlier. Reminder of the previous cover-letter: There are two parts of it: the guest and host part. The proposal for the guest part, is to just change the common time accounting, and try to identify at that spot, wether or not we should account any steal time. I considered this idea less cumbersome that trying to cook a clockevents implementation ourselves, since I see little value in it. I am, however, pretty open to suggestions. For the host<->guest communications, I am using a shared page, in the same way as pvclock. Because of that, I am just hijacking pvclock structure anyway. There is a 32-bit field floating by, that gives us enough room for 8 years of steal time (we use msec resolution). The main idea is to timestamp our exit and entry through sched notifiers, and export the value at pvclock updates. This obviously have some disadvantages: by doing this we are giving up futures ideas about only updating this structure once, and even right now, won't work on pinned-smp (since we don't update pvclock if we haven't changed cpus.) But again, it is just an RFC, and I'd like to feel the reception of the idea as a whole. Have a nice review. Glauber Costa (7): change headers preparing for steal time always call kvm_write_guest measure time out of guest change kernel accounting to include steal time kvm steal time implementation touch softlockup watchdog tell guest about steal time feature arch/x86/include/asm/kvm_host.h | 2 + arch/x86/include/asm/kvm_para.h | 1 + arch/x86/include/asm/pvclock-abi.h | 4 ++- arch/x86/kernel/kvmclock.c | 40 ++++++++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 26 ++++++++++++++++++---- include/linux/sched.h | 1 + kernel/sched.c | 29 ++++++++++++++++++++++++++ 7 files changed, 97 insertions(+), 6 deletions(-) ^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC v2 1/7] change headers preparing for steal time 2010-08-30 16:06 [RFC v2 0/7] kvm stael time implementation Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC 1/8] Implement getnsboottime kernel API Glauber Costa 2010-08-30 16:37 ` [RFC v2 0/7] kvm stael time implementation Peter Zijlstra 2010-08-30 17:20 ` Jeremy Fitzhardinge 2 siblings, 1 reply; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy This guest/host common patch prepares infrastructure for the steal time implementation. Some constants are added, and a name change happens in pvclock vcpu structure. Signed-off-by: Glauber Costa <glommer@redhat.com> --- arch/x86/include/asm/kvm_para.h | 1 + arch/x86/include/asm/pvclock-abi.h | 4 +++- 2 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 05eba5e..1759c81 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -25,6 +25,7 @@ * in pvclock structure. If no bits are set, all flags are ignored. */ #define KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 +#define KVM_FEATURE_CLOCKSOURCE_STEAL_BIT 25 #define MSR_KVM_WALL_CLOCK 0x11 #define MSR_KVM_SYSTEM_TIME 0x12 diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h index 35f2d19..417061b 100644 --- a/arch/x86/include/asm/pvclock-abi.h +++ b/arch/x86/include/asm/pvclock-abi.h @@ -24,7 +24,7 @@ struct pvclock_vcpu_time_info { u32 version; - u32 pad0; + u32 steal_time; u64 tsc_timestamp; u64 system_time; u32 tsc_to_system_mul; @@ -40,5 +40,7 @@ struct pvclock_wall_clock { } __attribute__((__packed__)); #define PVCLOCK_TSC_STABLE_BIT (1 << 0) +#define PVCLOCK_STEAL_BIT (2 << 0) + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_PVCLOCK_ABI_H */ -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC 1/8] Implement getnsboottime kernel API 2010-08-30 16:06 ` [RFC v2 1/7] change headers preparing for steal time Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC v2 2/7] always call kvm_write_guest Glauber Costa 2010-09-01 23:56 ` [RFC 1/8] Implement getnsboottime kernel API Zachary Amsden 0 siblings, 2 replies; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy From: Zachary Amsden <zamsden@redhat.com> Add a kernel call to get the number of nanoseconds since boot. This is generally useful enough to make it a generic call. Signed-off-by: Zachary Amsden <zamsden@redhat.com> --- include/linux/time.h | 1 + kernel/time/timekeeping.c | 27 +++++++++++++++++++++++++++ 2 files changed, 28 insertions(+), 0 deletions(-) diff --git a/include/linux/time.h b/include/linux/time.h index ea3559f..5d04108 100644 --- a/include/linux/time.h +++ b/include/linux/time.h @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv); extern void getrawmonotonic(struct timespec *ts); extern void getboottime(struct timespec *ts); extern void monotonic_to_bootbased(struct timespec *ts); +extern s64 getnsboottime(void); extern struct timespec timespec_trunc(struct timespec t, unsigned gran); extern int timekeeping_valid_for_hres(void); diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index caf8d4d..d250f0a 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts) } EXPORT_SYMBOL_GPL(ktime_get_ts); + +/** + * getnsboottime - get the bootbased clock in nsec format + * + * The function calculates the bootbased clock from the realtime + * clock and the wall_to_monotonic offset and stores the result + * in normalized timespec format in the variable pointed to by @ts. + */ +s64 getnsboottime(void) +{ + unsigned int seq; + s64 secs, nsecs; + + WARN_ON(timekeeping_suspended); + + do { + seq = read_seqbegin(&xtime_lock); + secs = xtime.tv_sec + wall_to_monotonic.tv_sec; + secs += total_sleep_time.tv_sec; + nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec; + nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns(); + + } while (read_seqretry(&xtime_lock, seq)); + return nsecs + (secs * NSEC_PER_SEC); +} +EXPORT_SYMBOL_GPL(getnsboottime); + /** * do_gettimeofday - Returns the time of day in a timeval * @tv: pointer to the timeval to be set -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC v2 2/7] always call kvm_write_guest 2010-08-30 16:06 ` [RFC 1/8] Implement getnsboottime kernel API Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC 2/8] change headers preparing for steal time Glauber Costa 2010-09-01 23:56 ` [RFC 1/8] Implement getnsboottime kernel API Zachary Amsden 1 sibling, 1 reply; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy This patch makes sure that kvm_write_guest() is called at every guest enter. We do not, however, want to update all the structure at all times, so we add a flag that basically tells it whether or not to do the whole operation Signed-off-by: Glauber Costa <glommer@redhat.com> --- arch/x86/kvm/x86.c | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e7e3b50..f4b77ea 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -895,7 +895,7 @@ static void kvm_set_time_scale(uint32_t tsc_khz, struct pvclock_vcpu_time_info * static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz); -static void kvm_write_guest_time(struct kvm_vcpu *v) +static void kvm_write_guest_time(struct kvm_vcpu *v, bool update_time) { struct timespec ts; unsigned long flags; @@ -906,6 +906,9 @@ static void kvm_write_guest_time(struct kvm_vcpu *v) if ((!vcpu->time_page)) return; + if (!update_time) + return; + this_tsc_khz = get_cpu_var(cpu_tsc_khz); if (unlikely(vcpu->hv_clock_tsc_khz != this_tsc_khz)) { kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock); @@ -4671,6 +4674,7 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu) static int vcpu_enter_guest(struct kvm_vcpu *vcpu) { int r; + bool update_kvm_time = 0; bool req_int_win = !irqchip_in_kernel(vcpu->kvm) && vcpu->run->request_interrupt_window; @@ -4680,7 +4684,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu)) __kvm_migrate_timers(vcpu); if (kvm_check_request(KVM_REQ_KVMCLOCK_UPDATE, vcpu)) - kvm_write_guest_time(vcpu); + update_kvm_time = 1; if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu)) kvm_mmu_sync_roots(vcpu); if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu)) @@ -4701,6 +4705,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) } } + kvm_write_guest_time(vcpu, update_kvm_time); + r = kvm_mmu_reload(vcpu); if (unlikely(r)) goto out; -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC 2/8] change headers preparing for steal time 2010-08-30 16:06 ` [RFC v2 2/7] always call kvm_write_guest Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC 3/8] always call kvm_write_guest Glauber Costa 0 siblings, 1 reply; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy This guest/host common patch prepares infrastructure for the steal time implementation. Some constants are added, and a name change happens in pvclock vcpu structure. Signed-off-by: Glauber Costa <glommer@redhat.com> --- arch/x86/include/asm/kvm_para.h | 1 + arch/x86/include/asm/pvclock-abi.h | 4 +++- 2 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 05eba5e..1759c81 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -25,6 +25,7 @@ * in pvclock structure. If no bits are set, all flags are ignored. */ #define KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 +#define KVM_FEATURE_CLOCKSOURCE_STEAL_BIT 25 #define MSR_KVM_WALL_CLOCK 0x11 #define MSR_KVM_SYSTEM_TIME 0x12 diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h index 35f2d19..417061b 100644 --- a/arch/x86/include/asm/pvclock-abi.h +++ b/arch/x86/include/asm/pvclock-abi.h @@ -24,7 +24,7 @@ struct pvclock_vcpu_time_info { u32 version; - u32 pad0; + u32 steal_time; u64 tsc_timestamp; u64 system_time; u32 tsc_to_system_mul; @@ -40,5 +40,7 @@ struct pvclock_wall_clock { } __attribute__((__packed__)); #define PVCLOCK_TSC_STABLE_BIT (1 << 0) +#define PVCLOCK_STEAL_BIT (2 << 0) + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_PVCLOCK_ABI_H */ -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC 3/8] always call kvm_write_guest 2010-08-30 16:06 ` [RFC 2/8] change headers preparing for steal time Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC v2 3/7] measure time out of guest Glauber Costa 0 siblings, 1 reply; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy This patch makes sure that kvm_write_guest() is called at every guest enter. We do not, however, want to update all the structure at all times, so we add a flag that basically tells it whether or not to do the whole operation Signed-off-by: Glauber Costa <glommer@redhat.com> --- arch/x86/kvm/x86.c | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e7e3b50..f4b77ea 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -895,7 +895,7 @@ static void kvm_set_time_scale(uint32_t tsc_khz, struct pvclock_vcpu_time_info * static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz); -static void kvm_write_guest_time(struct kvm_vcpu *v) +static void kvm_write_guest_time(struct kvm_vcpu *v, bool update_time) { struct timespec ts; unsigned long flags; @@ -906,6 +906,9 @@ static void kvm_write_guest_time(struct kvm_vcpu *v) if ((!vcpu->time_page)) return; + if (!update_time) + return; + this_tsc_khz = get_cpu_var(cpu_tsc_khz); if (unlikely(vcpu->hv_clock_tsc_khz != this_tsc_khz)) { kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock); @@ -4671,6 +4674,7 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu) static int vcpu_enter_guest(struct kvm_vcpu *vcpu) { int r; + bool update_kvm_time = 0; bool req_int_win = !irqchip_in_kernel(vcpu->kvm) && vcpu->run->request_interrupt_window; @@ -4680,7 +4684,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu)) __kvm_migrate_timers(vcpu); if (kvm_check_request(KVM_REQ_KVMCLOCK_UPDATE, vcpu)) - kvm_write_guest_time(vcpu); + update_kvm_time = 1; if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu)) kvm_mmu_sync_roots(vcpu); if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu)) @@ -4701,6 +4705,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) } } + kvm_write_guest_time(vcpu, update_kvm_time); + r = kvm_mmu_reload(vcpu); if (unlikely(r)) goto out; -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC v2 3/7] measure time out of guest 2010-08-30 16:06 ` [RFC 3/8] always call kvm_write_guest Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC v2 4/7] change kernel accounting to include steal time Glauber Costa 0 siblings, 1 reply; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy By measuring time between a vcpu_put and a vcpu_load, we can estimate how much time did the guest stay out of the cpu. This is exported to the guest at every clock update. Signed-off-by: Glauber Costa <glommer@redhat.com> --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/x86.c | 13 +++++++++++-- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 502e53f..bc28aff 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -364,6 +364,8 @@ struct kvm_vcpu_arch { u64 hv_vapic; cpumask_var_t wbinvd_dirty_mask; + u64 time_out; + u64 last_time_out; }; struct kvm_arch { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f4b77ea..9d08032 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -906,6 +906,7 @@ static void kvm_write_guest_time(struct kvm_vcpu *v, bool update_time) if ((!vcpu->time_page)) return; + vcpu->hv_clock.steal_time = vcpu->time_out / 1000000; if (!update_time) return; @@ -928,8 +929,7 @@ static void kvm_write_guest_time(struct kvm_vcpu *v, bool update_time) vcpu->hv_clock.system_time = ts.tv_nsec + (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset; - vcpu->hv_clock.flags = 0; - + vcpu->hv_clock.flags = PVCLOCK_STEAL_BIT; /* * The interface expects us to write an even number signaling that the * update is finished. Since the guest won't see the intermediate @@ -1801,6 +1801,8 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu) void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { + s64 now; + /* Address WBINVD may be executed by guest */ if (need_emulate_wbinvd(vcpu)) { if (kvm_x86_ops->has_wbinvd_exit()) @@ -1818,12 +1820,19 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) per_cpu(cpu_tsc_khz, cpu) = khz; } kvm_request_guest_time_update(vcpu); + + now = getnsboottime(); + + if (vcpu->arch.last_time_out != 0) + vcpu->arch.time_out += now - vcpu->arch.last_time_out; } void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); + + vcpu->arch.last_time_out = getnsboottime(); } static int is_efer_nx(void) -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC v2 4/7] change kernel accounting to include steal time 2010-08-30 16:06 ` [RFC v2 3/7] measure time out of guest Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC 4/8] measure time out of guest Glauber Costa 2010-08-30 17:30 ` [RFC v2 4/7] " Jeremy Fitzhardinge 0 siblings, 2 replies; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy This patch proposes a common steal time implementation. When no steal time is accounted, we just add a branch to the current accounting code, that shouldn't add much overhead. When we do want to register steal time, we proceed as following: - if we would account user or system time in this tick, and there is out-of-cpu time registered, we skip it altogether, and account steal time only. - if we would account user or system time in this tick, and we got the cpu for the whole slice, we proceed normaly. - if we are idle in this tick, we flush out-of-cpu time to give it the chance to update whatever last-measure internal variable it may have. This approach is simple, but proved to work well for my test scenarios. in a UP guest on UP host, with a cpu-hog in both guest and host shows ~ 50 % steal time. steal time is also accounted proportionally, if nice values are given to the host cpu-hog. A cpu-hog in the host with no load in the guest, produces 0 % steal time, with 100 % idle, as one would expect. Signed-off-by: Glauber Costa <glommer@redhat.com> --- include/linux/sched.h | 1 + kernel/sched.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 30 insertions(+), 0 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 0478888..e571ddd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -312,6 +312,7 @@ long io_schedule_timeout(long timeout); extern void cpu_init (void); extern void trap_init(void); extern void update_process_times(int user); +extern cputime_t (*hypervisor_steal_time)(void); extern void scheduler_tick(void); extern void sched_show_task(struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index f52a880..9695c92 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3157,6 +3157,16 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p) return ns; } +cputime_t (*hypervisor_steal_time)(void) = NULL; + +static inline cputime_t get_steal_time_from_hypervisor(void) +{ + if (!hypervisor_steal_time) + return 0; + return hypervisor_steal_time(); +} + + /* * Account user cpu time to a process. * @p: the process that the cpu time gets accounted to @@ -3169,6 +3179,12 @@ void account_user_time(struct task_struct *p, cputime_t cputime, struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; cputime64_t tmp; + tmp = get_steal_time_from_hypervisor(); + if (tmp) { + account_steal_time(tmp); + return; + } + /* Add user time to process. */ p->utime = cputime_add(p->utime, cputime); p->utimescaled = cputime_add(p->utimescaled, cputime_scaled); @@ -3234,6 +3250,12 @@ void account_system_time(struct task_struct *p, int hardirq_offset, return; } + tmp = get_steal_time_from_hypervisor(); + if (tmp) { + account_steal_time(tmp); + return; + } + /* Add system time to process. */ p->stime = cputime_add(p->stime, cputime); p->stimescaled = cputime_add(p->stimescaled, cputime_scaled); @@ -3276,6 +3298,13 @@ void account_idle_time(cputime_t cputime) cputime64_t cputime64 = cputime_to_cputime64(cputime); struct rq *rq = this_rq(); + /* + * if we're idle, we don't account it as steal time, since we did + * not want to run anyway. We do call the steal function, however, to + * give the guest the chance to flush its internal buffers + */ + get_steal_time_from_hypervisor(); + if (atomic_read(&rq->nr_iowait) > 0) cpustat->iowait = cputime64_add(cpustat->iowait, cputime64); else -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC 4/8] measure time out of guest 2010-08-30 16:06 ` [RFC v2 4/7] change kernel accounting to include steal time Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC 5/8] change kernel accounting to include steal time Glauber Costa 2010-08-30 17:30 ` [RFC v2 4/7] " Jeremy Fitzhardinge 1 sibling, 1 reply; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy By measuring time between a vcpu_put and a vcpu_load, we can estimate how much time did the guest stay out of the cpu. This is exported to the guest at every clock update. Signed-off-by: Glauber Costa <glommer@redhat.com> --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/x86.c | 13 +++++++++++-- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 502e53f..bc28aff 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -364,6 +364,8 @@ struct kvm_vcpu_arch { u64 hv_vapic; cpumask_var_t wbinvd_dirty_mask; + u64 time_out; + u64 last_time_out; }; struct kvm_arch { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f4b77ea..9d08032 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -906,6 +906,7 @@ static void kvm_write_guest_time(struct kvm_vcpu *v, bool update_time) if ((!vcpu->time_page)) return; + vcpu->hv_clock.steal_time = vcpu->time_out / 1000000; if (!update_time) return; @@ -928,8 +929,7 @@ static void kvm_write_guest_time(struct kvm_vcpu *v, bool update_time) vcpu->hv_clock.system_time = ts.tv_nsec + (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset; - vcpu->hv_clock.flags = 0; - + vcpu->hv_clock.flags = PVCLOCK_STEAL_BIT; /* * The interface expects us to write an even number signaling that the * update is finished. Since the guest won't see the intermediate @@ -1801,6 +1801,8 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu) void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { + s64 now; + /* Address WBINVD may be executed by guest */ if (need_emulate_wbinvd(vcpu)) { if (kvm_x86_ops->has_wbinvd_exit()) @@ -1818,12 +1820,19 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) per_cpu(cpu_tsc_khz, cpu) = khz; } kvm_request_guest_time_update(vcpu); + + now = getnsboottime(); + + if (vcpu->arch.last_time_out != 0) + vcpu->arch.time_out += now - vcpu->arch.last_time_out; } void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); + + vcpu->arch.last_time_out = getnsboottime(); } static int is_efer_nx(void) -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC 5/8] change kernel accounting to include steal time 2010-08-30 16:06 ` [RFC 4/8] measure time out of guest Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC v2 5/7] kvm steal time implementation Glauber Costa 2010-08-30 16:46 ` [RFC 5/8] change kernel accounting to include steal time Peter Zijlstra 0 siblings, 2 replies; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy This patch proposes a common steal time implementation. When no steal time is accounted, we just add a branch to the current accounting code, that shouldn't add much overhead. When we do want to register steal time, we proceed as following: - if we would account user or system time in this tick, and there is out-of-cpu time registered, we skip it altogether, and account steal time only. - if we would account user or system time in this tick, and we got the cpu for the whole slice, we proceed normaly. - if we are idle in this tick, we flush out-of-cpu time to give it the chance to update whatever last-measure internal variable it may have. This approach is simple, but proved to work well for my test scenarios. in a UP guest on UP host, with a cpu-hog in both guest and host shows ~ 50 % steal time. steal time is also accounted proportionally, if nice values are given to the host cpu-hog. A cpu-hog in the host with no load in the guest, produces 0 % steal time, with 100 % idle, as one would expect. Signed-off-by: Glauber Costa <glommer@redhat.com> --- include/linux/sched.h | 1 + kernel/sched.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 30 insertions(+), 0 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 0478888..e571ddd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -312,6 +312,7 @@ long io_schedule_timeout(long timeout); extern void cpu_init (void); extern void trap_init(void); extern void update_process_times(int user); +extern cputime_t (*hypervisor_steal_time)(void); extern void scheduler_tick(void); extern void sched_show_task(struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index f52a880..9695c92 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3157,6 +3157,16 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p) return ns; } +cputime_t (*hypervisor_steal_time)(void) = NULL; + +static inline cputime_t get_steal_time_from_hypervisor(void) +{ + if (!hypervisor_steal_time) + return 0; + return hypervisor_steal_time(); +} + + /* * Account user cpu time to a process. * @p: the process that the cpu time gets accounted to @@ -3169,6 +3179,12 @@ void account_user_time(struct task_struct *p, cputime_t cputime, struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; cputime64_t tmp; + tmp = get_steal_time_from_hypervisor(); + if (tmp) { + account_steal_time(tmp); + return; + } + /* Add user time to process. */ p->utime = cputime_add(p->utime, cputime); p->utimescaled = cputime_add(p->utimescaled, cputime_scaled); @@ -3234,6 +3250,12 @@ void account_system_time(struct task_struct *p, int hardirq_offset, return; } + tmp = get_steal_time_from_hypervisor(); + if (tmp) { + account_steal_time(tmp); + return; + } + /* Add system time to process. */ p->stime = cputime_add(p->stime, cputime); p->stimescaled = cputime_add(p->stimescaled, cputime_scaled); @@ -3276,6 +3298,13 @@ void account_idle_time(cputime_t cputime) cputime64_t cputime64 = cputime_to_cputime64(cputime); struct rq *rq = this_rq(); + /* + * if we're idle, we don't account it as steal time, since we did + * not want to run anyway. We do call the steal function, however, to + * give the guest the chance to flush its internal buffers + */ + get_steal_time_from_hypervisor(); + if (atomic_read(&rq->nr_iowait) > 0) cpustat->iowait = cputime64_add(cpustat->iowait, cputime64); else -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC v2 5/7] kvm steal time implementation 2010-08-30 16:06 ` [RFC 5/8] change kernel accounting to include steal time Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC 6/8] " Glauber Costa 2010-08-30 16:46 ` [RFC 5/8] change kernel accounting to include steal time Peter Zijlstra 1 sibling, 1 reply; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy This is the proposed kvm-side steal time implementation. It is migration safe, as it checks flags at every read. Signed-off-by: Glauber Costa <glommer@redhat.com> --- arch/x86/kernel/kvmclock.c | 35 +++++++++++++++++++++++++++++++++++ 1 files changed, 35 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index eb9b76c..a1f4852 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -18,6 +18,8 @@ #include <linux/clocksource.h> #include <linux/kvm_para.h> +#include <linux/kernel_stat.h> +#include <linux/sched.h> #include <asm/pvclock.h> #include <asm/msr.h> #include <asm/apic.h> @@ -41,6 +43,7 @@ early_param("no-kvmclock", parse_no_kvmclock); /* The hypervisor will put information about time periodically here */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock); +static DEFINE_PER_CPU(u64, steal_info); static struct pvclock_wall_clock wall_clock; /* @@ -82,6 +85,32 @@ static cycle_t kvm_clock_read(void) return ret; } +static DEFINE_PER_CPU(u64, steal_info); + +cputime_t kvm_get_steal_time(void) +{ + u64 delta = 0; + u64 *last_steal_info, this_steal_info; + struct pvclock_vcpu_time_info *src; + + src = &get_cpu_var(hv_clock); + if (!(src->flags & PVCLOCK_STEAL_BIT)) + goto out; + + this_steal_info = src->steal_time; + put_cpu_var(hv_clock); + + last_steal_info = &get_cpu_var(steal_info); + + delta = this_steal_info - *last_steal_info; + + *last_steal_info = this_steal_info; + put_cpu_var(steal_info); + +out: + return msecs_to_cputime(delta); +} + static cycle_t kvm_clock_get_cycles(struct clocksource *cs) { return kvm_clock_read(); @@ -134,6 +163,8 @@ static int kvm_register_clock(char *txt) printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n", cpu, high, low, txt); + per_cpu(steal_info, cpu) = 0; + return native_write_msr_safe(msr_kvm_system_time, low, high); } @@ -218,4 +249,8 @@ void __init kvmclock_init(void) if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); + + + if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STEAL_BIT)) + hypervisor_steal_time = kvm_get_steal_time; } -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC 6/8] kvm steal time implementation 2010-08-30 16:06 ` [RFC v2 5/7] kvm steal time implementation Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC v2 6/7] touch softlockup watchdog Glauber Costa 0 siblings, 1 reply; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy This is the proposed kvm-side steal time implementation. It is migration safe, as it checks flags at every read. Signed-off-by: Glauber Costa <glommer@redhat.com> --- arch/x86/kernel/kvmclock.c | 35 +++++++++++++++++++++++++++++++++++ 1 files changed, 35 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index eb9b76c..a1f4852 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -18,6 +18,8 @@ #include <linux/clocksource.h> #include <linux/kvm_para.h> +#include <linux/kernel_stat.h> +#include <linux/sched.h> #include <asm/pvclock.h> #include <asm/msr.h> #include <asm/apic.h> @@ -41,6 +43,7 @@ early_param("no-kvmclock", parse_no_kvmclock); /* The hypervisor will put information about time periodically here */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock); +static DEFINE_PER_CPU(u64, steal_info); static struct pvclock_wall_clock wall_clock; /* @@ -82,6 +85,32 @@ static cycle_t kvm_clock_read(void) return ret; } +static DEFINE_PER_CPU(u64, steal_info); + +cputime_t kvm_get_steal_time(void) +{ + u64 delta = 0; + u64 *last_steal_info, this_steal_info; + struct pvclock_vcpu_time_info *src; + + src = &get_cpu_var(hv_clock); + if (!(src->flags & PVCLOCK_STEAL_BIT)) + goto out; + + this_steal_info = src->steal_time; + put_cpu_var(hv_clock); + + last_steal_info = &get_cpu_var(steal_info); + + delta = this_steal_info - *last_steal_info; + + *last_steal_info = this_steal_info; + put_cpu_var(steal_info); + +out: + return msecs_to_cputime(delta); +} + static cycle_t kvm_clock_get_cycles(struct clocksource *cs) { return kvm_clock_read(); @@ -134,6 +163,8 @@ static int kvm_register_clock(char *txt) printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n", cpu, high, low, txt); + per_cpu(steal_info, cpu) = 0; + return native_write_msr_safe(msr_kvm_system_time, low, high); } @@ -218,4 +249,8 @@ void __init kvmclock_init(void) if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); + + + if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STEAL_BIT)) + hypervisor_steal_time = kvm_get_steal_time; } -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC v2 6/7] touch softlockup watchdog 2010-08-30 16:06 ` [RFC 6/8] " Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC v2 7/7] tell guest about steal time feature Glauber Costa 2010-08-30 17:33 ` [RFC v2 6/7] touch softlockup watchdog Jeremy Fitzhardinge 0 siblings, 2 replies; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy With a reliable steal time mechanism, we can tell if we're out of the cpu for very long, differentiating from the case that we simply got a real softlockup. In the case we were out of cpu, the watchdog is fed, making bogus softlockups disappear. Signed-off-by: Glauber Costa <glommer@redhat.com> --- arch/x86/kernel/kvmclock.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index a1f4852..d217475 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -91,6 +91,7 @@ cputime_t kvm_get_steal_time(void) { u64 delta = 0; u64 *last_steal_info, this_steal_info; + int touch_wd; struct pvclock_vcpu_time_info *src; src = &get_cpu_var(hv_clock); @@ -104,6 +105,10 @@ cputime_t kvm_get_steal_time(void) delta = this_steal_info - *last_steal_info; + touch_wd = softlockup_thresh * 1000UL; + if ((touch_wd > 0) && (delta > touch_wd)) + touch_softlockup_watchdog(); + *last_steal_info = this_steal_info; put_cpu_var(steal_info); -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC v2 7/7] tell guest about steal time feature 2010-08-30 16:06 ` [RFC v2 6/7] touch softlockup watchdog Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC 7/8] touch softlockup watchdog Glauber Costa 2010-08-30 17:33 ` [RFC v2 6/7] touch softlockup watchdog Jeremy Fitzhardinge 1 sibling, 1 reply; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy Guest kernel will only activate steal time if the host exports it. Warn the guest about it. --- arch/x86/kvm/x86.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9d08032..f6a2d74 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2103,7 +2103,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) | (1 << KVM_FEATURE_NOP_IO_DELAY) | (1 << KVM_FEATURE_CLOCKSOURCE2) | - (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT); + (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) | + (1 << KVM_FEATURE_CLOCKSOURCE_STEAL_BIT); entry->ebx = 0; entry->ecx = 0; entry->edx = 0; -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC 7/8] touch softlockup watchdog 2010-08-30 16:06 ` [RFC v2 7/7] tell guest about steal time feature Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 2010-08-30 16:06 ` [RFC 8/8] tell guest about steal time feature Glauber Costa 0 siblings, 1 reply; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy With a reliable steal time mechanism, we can tell if we're out of the cpu for very long, differentiating from the case that we simply got a real softlockup. In the case we were out of cpu, the watchdog is fed, making bogus softlockups disappear. Signed-off-by: Glauber Costa <glommer@redhat.com> --- arch/x86/kernel/kvmclock.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index a1f4852..d217475 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -91,6 +91,7 @@ cputime_t kvm_get_steal_time(void) { u64 delta = 0; u64 *last_steal_info, this_steal_info; + int touch_wd; struct pvclock_vcpu_time_info *src; src = &get_cpu_var(hv_clock); @@ -104,6 +105,10 @@ cputime_t kvm_get_steal_time(void) delta = this_steal_info - *last_steal_info; + touch_wd = softlockup_thresh * 1000UL; + if ((touch_wd > 0) && (delta > touch_wd)) + touch_softlockup_watchdog(); + *last_steal_info = this_steal_info; put_cpu_var(steal_info); -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC 8/8] tell guest about steal time feature 2010-08-30 16:06 ` [RFC 7/8] touch softlockup watchdog Glauber Costa @ 2010-08-30 16:06 ` Glauber Costa 0 siblings, 0 replies; 38+ messages in thread From: Glauber Costa @ 2010-08-30 16:06 UTC (permalink / raw) To: kvm; +Cc: avi, zamsden, mtosatti, riel, peterz, mingo, jeremy Guest kernel will only activate steal time if the host exports it. Warn the guest about it. --- arch/x86/kvm/x86.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9d08032..f6a2d74 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2103,7 +2103,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) | (1 << KVM_FEATURE_NOP_IO_DELAY) | (1 << KVM_FEATURE_CLOCKSOURCE2) | - (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT); + (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) | + (1 << KVM_FEATURE_CLOCKSOURCE_STEAL_BIT); entry->ebx = 0; entry->ecx = 0; entry->edx = 0; -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC v2 6/7] touch softlockup watchdog 2010-08-30 16:06 ` [RFC v2 6/7] touch softlockup watchdog Glauber Costa 2010-08-30 16:06 ` [RFC v2 7/7] tell guest about steal time feature Glauber Costa @ 2010-08-30 17:33 ` Jeremy Fitzhardinge 2010-08-30 18:07 ` Glauber Costa 1 sibling, 1 reply; 38+ messages in thread From: Jeremy Fitzhardinge @ 2010-08-30 17:33 UTC (permalink / raw) To: Glauber Costa; +Cc: kvm, avi, zamsden, mtosatti, riel, peterz, mingo On 08/30/2010 09:06 AM, Glauber Costa wrote: > With a reliable steal time mechanism, we can tell if we're > out of the cpu for very long, differentiating from the case > that we simply got a real softlockup. > > In the case we were out of cpu, the watchdog is fed, making > bogus softlockups disappear. Why not change the softlockup to measure unstolen time rather than explicitly poking it from a hypervisor-specific function? How is touching it in kvm_get_steal_time() correct anyway? J > Signed-off-by: Glauber Costa <glommer@redhat.com> > --- > arch/x86/kernel/kvmclock.c | 5 +++++ > 1 files changed, 5 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c > index a1f4852..d217475 100644 > --- a/arch/x86/kernel/kvmclock.c > +++ b/arch/x86/kernel/kvmclock.c > @@ -91,6 +91,7 @@ cputime_t kvm_get_steal_time(void) > { > u64 delta = 0; > u64 *last_steal_info, this_steal_info; > + int touch_wd; > struct pvclock_vcpu_time_info *src; > > src = &get_cpu_var(hv_clock); > @@ -104,6 +105,10 @@ cputime_t kvm_get_steal_time(void) > > delta = this_steal_info - *last_steal_info; > > + touch_wd = softlockup_thresh * 1000UL; > + if ((touch_wd > 0) && (delta > touch_wd)) > + touch_softlockup_watchdog(); > + > *last_steal_info = this_steal_info; > put_cpu_var(steal_info); > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 6/7] touch softlockup watchdog 2010-08-30 17:33 ` [RFC v2 6/7] touch softlockup watchdog Jeremy Fitzhardinge @ 2010-08-30 18:07 ` Glauber Costa 0 siblings, 0 replies; 38+ messages in thread From: Glauber Costa @ 2010-08-30 18:07 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: kvm, avi, zamsden, mtosatti, riel, peterz, mingo On Mon, Aug 30, 2010 at 10:33:26AM -0700, Jeremy Fitzhardinge wrote: > On 08/30/2010 09:06 AM, Glauber Costa wrote: > > With a reliable steal time mechanism, we can tell if we're > > out of the cpu for very long, differentiating from the case > > that we simply got a real softlockup. > > > > In the case we were out of cpu, the watchdog is fed, making > > bogus softlockups disappear. > > Why not change the softlockup to measure unstolen time rather than > explicitly poking it from a hypervisor-specific function? > Ok, I like your idea much better. Thanks! ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC 5/8] change kernel accounting to include steal time 2010-08-30 16:06 ` [RFC 5/8] change kernel accounting to include steal time Glauber Costa 2010-08-30 16:06 ` [RFC v2 5/7] kvm steal time implementation Glauber Costa @ 2010-08-30 16:46 ` Peter Zijlstra 2010-08-30 17:26 ` Glauber Costa 1 sibling, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2010-08-30 16:46 UTC (permalink / raw) To: Glauber Costa; +Cc: kvm, avi, zamsden, mtosatti, riel, mingo, jeremy On Mon, 2010-08-30 at 12:06 -0400, Glauber Costa wrote: > This patch proposes a common steal time implementation. When no > steal time is accounted, we just add a branch to the current > accounting code, that shouldn't add much overhead. But why not use alternative code and avoid that conditional all-together? > * Account user cpu time to a process. > * @p: the process that the cpu time gets accounted to > @@ -3169,6 +3179,12 @@ void account_user_time(struct task_struct *p, cputime_t cputime, > struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; > cputime64_t tmp; > > + tmp = get_steal_time_from_hypervisor(); > + if (tmp) { > + account_steal_time(tmp); > + return; > + } > + > /* Add user time to process. */ > p->utime = cputime_add(p->utime, cputime); > p->utimescaled = cputime_add(p->utimescaled, cputime_scaled); > @@ -3234,6 +3250,12 @@ void account_system_time(struct task_struct *p, int hardirq_offset, > return; > } > > + tmp = get_steal_time_from_hypervisor(); > + if (tmp) { > + account_steal_time(tmp); > + return; > + } > + > /* Add system time to process. */ > p->stime = cputime_add(p->stime, cputime); > p->stimescaled = cputime_add(p->stimescaled, cputime_scaled); Why replicate that code and not restructure account_process_tick()? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC 5/8] change kernel accounting to include steal time 2010-08-30 16:46 ` [RFC 5/8] change kernel accounting to include steal time Peter Zijlstra @ 2010-08-30 17:26 ` Glauber Costa 0 siblings, 0 replies; 38+ messages in thread From: Glauber Costa @ 2010-08-30 17:26 UTC (permalink / raw) To: Peter Zijlstra; +Cc: kvm, avi, zamsden, mtosatti, riel, mingo, jeremy On Mon, Aug 30, 2010 at 06:46:25PM +0200, Peter Zijlstra wrote: > On Mon, 2010-08-30 at 12:06 -0400, Glauber Costa wrote: > > This patch proposes a common steal time implementation. When no > > steal time is accounted, we just add a branch to the current > > accounting code, that shouldn't add much overhead. > > But why not use alternative code and avoid that conditional > all-together? > > > * Account user cpu time to a process. > > * @p: the process that the cpu time gets accounted to > > @@ -3169,6 +3179,12 @@ void account_user_time(struct task_struct *p, cputime_t cputime, > > struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; > > cputime64_t tmp; > > > > + tmp = get_steal_time_from_hypervisor(); > > + if (tmp) { > > + account_steal_time(tmp); > > + return; > > + } > > + > > /* Add user time to process. */ > > p->utime = cputime_add(p->utime, cputime); > > p->utimescaled = cputime_add(p->utimescaled, cputime_scaled); > > @@ -3234,6 +3250,12 @@ void account_system_time(struct task_struct *p, int hardirq_offset, > > return; > > } > > > > + tmp = get_steal_time_from_hypervisor(); > > + if (tmp) { > > + account_steal_time(tmp); > > + return; > > + } > > + > > /* Add system time to process. */ > > p->stime = cputime_add(p->stime, cputime); > > p->stimescaled = cputime_add(p->stimescaled, cputime_scaled); > > Why replicate that code and not restructure account_process_tick()? For both: because this way is easier, and this is just an rfc, so I am mainly interested in getting convergence on the design, or leaving this behind altogether. I do agree with both your comments, btw ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-08-30 16:06 ` [RFC v2 4/7] change kernel accounting to include steal time Glauber Costa 2010-08-30 16:06 ` [RFC 4/8] measure time out of guest Glauber Costa @ 2010-08-30 17:30 ` Jeremy Fitzhardinge 2010-08-30 18:39 ` Rik van Riel 1 sibling, 1 reply; 38+ messages in thread From: Jeremy Fitzhardinge @ 2010-08-30 17:30 UTC (permalink / raw) To: Glauber Costa; +Cc: kvm, avi, zamsden, mtosatti, riel, peterz, mingo On 08/30/2010 09:06 AM, Glauber Costa wrote: > This patch proposes a common steal time implementation. When no > steal time is accounted, we just add a branch to the current > accounting code, that shouldn't add much overhead. How is stolen time logically any different from a CPU running slowly due to HT or power management? Is it worth trying to handle them in the same way? (I'm mostly picking on the "_from_hypervisor" part, since that seems over-specific.) Why not have a get_unstolen_time() function which just returns sched_clock() in the normal case, but can return less? > When we do want to register steal time, we proceed as following: > - if we would account user or system time in this tick, and there is > out-of-cpu time registered, we skip it altogether, and account steal > time only. > - if we would account user or system time in this tick, and we got the > cpu for the whole slice, we proceed normaly. > - if we are idle in this tick, we flush out-of-cpu time to give it the > chance to update whatever last-measure internal variable it may have. > > This approach is simple, but proved to work well for my test scenarios. > in a UP guest on UP host, with a cpu-hog in both guest and host shows > ~ 50 % steal time. steal time is also accounted proportionally, if > nice values are given to the host cpu-hog. > > A cpu-hog in the host with no load in the guest, produces 0 % steal time, > with 100 % idle, as one would expect. > > Signed-off-by: Glauber Costa <glommer@redhat.com> > --- > include/linux/sched.h | 1 + > kernel/sched.c | 29 +++++++++++++++++++++++++++++ > 2 files changed, 30 insertions(+), 0 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 0478888..e571ddd 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -312,6 +312,7 @@ long io_schedule_timeout(long timeout); > extern void cpu_init (void); > extern void trap_init(void); > extern void update_process_times(int user); > +extern cputime_t (*hypervisor_steal_time)(void); > extern void scheduler_tick(void); > > extern void sched_show_task(struct task_struct *p); > diff --git a/kernel/sched.c b/kernel/sched.c > index f52a880..9695c92 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -3157,6 +3157,16 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p) > return ns; > } > > +cputime_t (*hypervisor_steal_time)(void) = NULL; > + > +static inline cputime_t get_steal_time_from_hypervisor(void) > +{ > + if (!hypervisor_steal_time) > + return 0; > + return hypervisor_steal_time(); > +} > + > + > /* > * Account user cpu time to a process. > * @p: the process that the cpu time gets accounted to > @@ -3169,6 +3179,12 @@ void account_user_time(struct task_struct *p, cputime_t cputime, > struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; > cputime64_t tmp; > > + tmp = get_steal_time_from_hypervisor(); > + if (tmp) { > + account_steal_time(tmp); > + return; > + } Is that all? Does the scheduler use account_steal_time() to adjust its scheduling decisions, or is it just something that gets shown to users? I thought just the latter. But if all you're doing is calling account_steal_time(), why bother with all this get_steal_time_from_hypervisor() stuff? The hypervisor-specific code can just call account_steal_time() directly. > + > /* Add user time to process. */ > p->utime = cputime_add(p->utime, cputime); > p->utimescaled = cputime_add(p->utimescaled, cputime_scaled); > @@ -3234,6 +3250,12 @@ void account_system_time(struct task_struct *p, int hardirq_offset, > return; > } > > + tmp = get_steal_time_from_hypervisor(); > + if (tmp) { > + account_steal_time(tmp); > + return; > + } > + > /* Add system time to process. */ > p->stime = cputime_add(p->stime, cputime); > p->stimescaled = cputime_add(p->stimescaled, cputime_scaled); > @@ -3276,6 +3298,13 @@ void account_idle_time(cputime_t cputime) > cputime64_t cputime64 = cputime_to_cputime64(cputime); > struct rq *rq = this_rq(); > > + /* > + * if we're idle, we don't account it as steal time, since we did > + * not want to run anyway. We do call the steal function, however, to > + * give the guest the chance to flush its internal buffers > + */ > + get_steal_time_from_hypervisor(); Eh? This doesn't make much sense. What side-effects is get_steal_time_from_hypervisor() expected to have? If there's some hypervisor-specific implementation detail, why not wrap that up in a specific function rather than overloading this one? J ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-08-30 17:30 ` [RFC v2 4/7] " Jeremy Fitzhardinge @ 2010-08-30 18:39 ` Rik van Riel 2010-08-30 19:07 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 38+ messages in thread From: Rik van Riel @ 2010-08-30 18:39 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Glauber Costa, kvm, avi, zamsden, mtosatti, peterz, mingo On 08/30/2010 01:30 PM, Jeremy Fitzhardinge wrote: > On 08/30/2010 09:06 AM, Glauber Costa wrote: >> This patch proposes a common steal time implementation. When no >> steal time is accounted, we just add a branch to the current >> accounting code, that shouldn't add much overhead. > > How is stolen time logically any different from a CPU running slowly due > to HT or power management? Is it worth trying to handle them in the > same way? (I'm mostly picking on the "_from_hypervisor" part, since > that seems over-specific.) > > Why not have a get_unstolen_time() function which just returns > sched_clock() in the normal case, but can return less? Steal time gets you information you can act on, when your program is running slowly. The steal time statistic allows you to see whether the slowdown was due to the CPU just not being fast enough, or due to something else contending for the CPU. This can be useful information. Apparently, it has been useful enough that it has been implemented on s390, PPC and Xen (pre pvops Xen). I suppose HT can have similar results, but that is also something that system administrators can see. -- All rights reversed ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-08-30 18:39 ` Rik van Riel @ 2010-08-30 19:07 ` Jeremy Fitzhardinge 2010-08-30 19:14 ` Peter Zijlstra 2010-08-30 19:17 ` Rik van Riel 0 siblings, 2 replies; 38+ messages in thread From: Jeremy Fitzhardinge @ 2010-08-30 19:07 UTC (permalink / raw) To: Rik van Riel; +Cc: Glauber Costa, kvm, avi, zamsden, mtosatti, peterz, mingo On 08/30/2010 11:39 AM, Rik van Riel wrote: > On 08/30/2010 01:30 PM, Jeremy Fitzhardinge wrote: >> On 08/30/2010 09:06 AM, Glauber Costa wrote: >>> This patch proposes a common steal time implementation. When no >>> steal time is accounted, we just add a branch to the current >>> accounting code, that shouldn't add much overhead. >> >> How is stolen time logically any different from a CPU running slowly due >> to HT or power management? Is it worth trying to handle them in the >> same way? (I'm mostly picking on the "_from_hypervisor" part, since >> that seems over-specific.) >> >> Why not have a get_unstolen_time() function which just returns >> sched_clock() in the normal case, but can return less? > > Steal time gets you information you can act on, when > your program is running slowly. > > The steal time statistic allows you to see whether the > slowdown was due to the CPU just not being fast enough, > or due to something else contending for the CPU. > > This can be useful information. Apparently, it has been > useful enough that it has been implemented on s390, PPC > and Xen (pre pvops Xen). Sure - pvops Xen has had stolen time accounting from the start (or a very long time, at least). I think Glauber's changes are functionally identical to Xen's existing stolen time support, but more intrusive. > I suppose HT can have similar results, but that is > also something that system administrators can see. My comments were mostly because I though that this patch was going to actually make the scheduler change behaviour to take stolen time into account in scheduling decisions - and those decisions would be equally meaningful if the slowdown was from HT, PM or stolen time. But AFAICT this patch does nothing to that end. J ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-08-30 19:07 ` Jeremy Fitzhardinge @ 2010-08-30 19:14 ` Peter Zijlstra 2010-08-30 19:17 ` Rik van Riel 1 sibling, 0 replies; 38+ messages in thread From: Peter Zijlstra @ 2010-08-30 19:14 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Rik van Riel, Glauber Costa, kvm, avi, zamsden, mtosatti, mingo On Mon, 2010-08-30 at 12:07 -0700, Jeremy Fitzhardinge wrote: > My comments were mostly because I though that this patch was going to > actually make the scheduler change behaviour to take stolen time into > account in scheduling decisions - and those decisions would be equally > meaningful if the slowdown was from HT, PM or stolen time. But AFAICT > this patch does nothing to that end. If you want to influence the load-balancer (the only thing that really cares about HT) the best would probably be to add a term to kernel/sched_fair.c:update_cpu_power(). Decrease the power relative to the amount of steal-time, just like freq and time spend on RT tasks. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-08-30 19:07 ` Jeremy Fitzhardinge 2010-08-30 19:14 ` Peter Zijlstra @ 2010-08-30 19:17 ` Rik van Riel 2010-08-30 19:20 ` Peter Zijlstra 1 sibling, 1 reply; 38+ messages in thread From: Rik van Riel @ 2010-08-30 19:17 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Glauber Costa, kvm, avi, zamsden, mtosatti, peterz, mingo On 08/30/2010 03:07 PM, Jeremy Fitzhardinge wrote: > On 08/30/2010 11:39 AM, Rik van Riel wrote: >> On 08/30/2010 01:30 PM, Jeremy Fitzhardinge wrote: >>> On 08/30/2010 09:06 AM, Glauber Costa wrote: >>>> This patch proposes a common steal time implementation. When no >>>> steal time is accounted, we just add a branch to the current >>>> accounting code, that shouldn't add much overhead. >>> >>> How is stolen time logically any different from a CPU running slowly due >>> to HT or power management? Is it worth trying to handle them in the >>> same way? (I'm mostly picking on the "_from_hypervisor" part, since >>> that seems over-specific.) >>> >>> Why not have a get_unstolen_time() function which just returns >>> sched_clock() in the normal case, but can return less? >> >> Steal time gets you information you can act on, when >> your program is running slowly. >> >> The steal time statistic allows you to see whether the >> slowdown was due to the CPU just not being fast enough, >> or due to something else contending for the CPU. >> >> This can be useful information. Apparently, it has been >> useful enough that it has been implemented on s390, PPC >> and Xen (pre pvops Xen). > > Sure - pvops Xen has had stolen time accounting from the start (or a > very long time, at least). I think Glauber's changes are functionally > identical to Xen's existing stolen time support, but more intrusive. > > >> I suppose HT can have similar results, but that is >> also something that system administrators can see. > > My comments were mostly because I though that this patch was going to > actually make the scheduler change behaviour to take stolen time into > account in scheduling decisions I believe that happens automatically. When time is accounted as steal time, it is NOT accounted as to the current process user/system/..., which in turn should help it in the scheduler. Am I overlooking something? -- All rights reversed ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-08-30 19:17 ` Rik van Riel @ 2010-08-30 19:20 ` Peter Zijlstra 2010-08-30 19:45 ` Rik van Riel 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2010-08-30 19:20 UTC (permalink / raw) To: Rik van Riel Cc: Jeremy Fitzhardinge, Glauber Costa, kvm, avi, zamsden, mtosatti, mingo On Mon, 2010-08-30 at 15:17 -0400, Rik van Riel wrote: > > When time is accounted as steal time, it is NOT accounted as > to the current process user/system/..., which in turn should > help it in the scheduler. > > Am I overlooking something? Yeah, the scheduler doesn't care about the user/system time accounting at all... :-) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-08-30 19:20 ` Peter Zijlstra @ 2010-08-30 19:45 ` Rik van Riel 2010-08-30 22:56 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 38+ messages in thread From: Rik van Riel @ 2010-08-30 19:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Jeremy Fitzhardinge, Glauber Costa, kvm, avi, zamsden, mtosatti, mingo On 08/30/2010 03:20 PM, Peter Zijlstra wrote: > On Mon, 2010-08-30 at 15:17 -0400, Rik van Riel wrote: >> >> When time is accounted as steal time, it is NOT accounted as >> to the current process user/system/..., which in turn should >> help it in the scheduler. >> >> Am I overlooking something? > > Yeah, the scheduler doesn't care about the user/system time accounting > at all... :-) Uh oh. This would seem like something we'll want to fix in an architecture independent way, so s390, etc. also benefit from it. I can see this being a real problem when the host and guest OS have the same time slice - which is quite possible since they may both be the same version of Linux. Guest 1, alternating between processes A and B, may end up with process A getting a lot of actual CPU time, and process B being scheduled in when the VCPU itself is not running... -- All rights reversed ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-08-30 19:45 ` Rik van Riel @ 2010-08-30 22:56 ` Jeremy Fitzhardinge 2010-08-30 23:03 ` Rik van Riel 0 siblings, 1 reply; 38+ messages in thread From: Jeremy Fitzhardinge @ 2010-08-30 22:56 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, Glauber Costa, kvm, avi, zamsden, mtosatti, mingo On 08/30/2010 12:45 PM, Rik van Riel wrote: > On 08/30/2010 03:20 PM, Peter Zijlstra wrote: >> On Mon, 2010-08-30 at 15:17 -0400, Rik van Riel wrote: >>> >>> When time is accounted as steal time, it is NOT accounted as >>> to the current process user/system/..., which in turn should >>> help it in the scheduler. >>> >>> Am I overlooking something? >> >> Yeah, the scheduler doesn't care about the user/system time accounting >> at all... :-) > > Uh oh. This would seem like something we'll want to fix > in an architecture independent way, so s390, etc. also > benefit from it. > > I can see this being a real problem when the host and guest > OS have the same time slice - which is quite possible since > they may both be the same version of Linux. > > Guest 1, alternating between processes A and B, may end up > with process A getting a lot of actual CPU time, and process > B being scheduled in when the VCPU itself is not running... Yep. I'd been trying to do that with sched_clock games, but that never worked out. I think it basically comes down to adding "sched_clock_unstolen()" which the scheduler can use to measure time a process spends running, and sched_clock() for measuring sleep times. In the normal case, sched_clock_unstolen() would be the same as sched_clock(). J ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-08-30 22:56 ` Jeremy Fitzhardinge @ 2010-08-30 23:03 ` Rik van Riel 2010-08-31 8:11 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Rik van Riel @ 2010-08-30 23:03 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Peter Zijlstra, Glauber Costa, kvm, avi, zamsden, mtosatti, mingo On 08/30/2010 06:56 PM, Jeremy Fitzhardinge wrote: > On 08/30/2010 12:45 PM, Rik van Riel wrote: >> On 08/30/2010 03:20 PM, Peter Zijlstra wrote: >>> On Mon, 2010-08-30 at 15:17 -0400, Rik van Riel wrote: >>>> >>>> When time is accounted as steal time, it is NOT accounted as >>>> to the current process user/system/..., which in turn should >>>> help it in the scheduler. >>>> >>>> Am I overlooking something? >>> >>> Yeah, the scheduler doesn't care about the user/system time accounting >>> at all... :-) >> >> Uh oh. This would seem like something we'll want to fix >> in an architecture independent way, so s390, etc. also >> benefit from it. >> >> I can see this being a real problem when the host and guest >> OS have the same time slice - which is quite possible since >> they may both be the same version of Linux. >> >> Guest 1, alternating between processes A and B, may end up >> with process A getting a lot of actual CPU time, and process >> B being scheduled in when the VCPU itself is not running... > > Yep. I'd been trying to do that with sched_clock games, but that never > worked out. > > I think it basically comes down to adding "sched_clock_unstolen()" which > the scheduler can use to measure time a process spends running, and > sched_clock() for measuring sleep times. In the normal case, > sched_clock_unstolen() would be the same as sched_clock(). That requires the host to export (any time the guest is scheduled in), the amount of CPU time the VCPU thread has used, and the time the VCPU was scheduled in. Since the VCPU must be running when it is examining these variables, it can calculate the additional time (since it was last scheduled) to account to the task, and remember the currently calculated time in its own per-vcpu variable, so next time it can get a delta again. -- All rights reversed ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-08-30 23:03 ` Rik van Riel @ 2010-08-31 8:11 ` Peter Zijlstra 2010-09-02 18:19 ` Glauber Costa 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2010-08-31 8:11 UTC (permalink / raw) To: Rik van Riel Cc: Jeremy Fitzhardinge, Glauber Costa, kvm, avi, zamsden, mtosatti, mingo On Mon, 2010-08-30 at 19:03 -0400, Rik van Riel wrote: > > > I think it basically comes down to adding "sched_clock_unstolen()" which > > the scheduler can use to measure time a process spends running, and > > sched_clock() for measuring sleep times. In the normal case, > > sched_clock_unstolen() would be the same as sched_clock(). > > That requires the host to export (any time the guest is scheduled > in), the amount of CPU time the VCPU thread has used, and the time > the VCPU was scheduled in. > > Since the VCPU must be running when it is examining these variables, > it can calculate the additional time (since it was last scheduled) > to account to the task, and remember the currently calculated time > in its own per-vcpu variable, so next time it can get a delta again. I think its easier (and sufficient) for the host to tell the guest how long it was _not_ running. That can simply be passed in when you start the vcpu again and doesn't need a fancy communication channel. The guests sched_clock() will measure wall time, the guests sched_clock_stolen() will report the accumulation of these stolen times. Then you can make sched_clock_unstolen() be sched_clock() - sched_clock_stolen(). And like Jeremy said, if you make the sched_fair stuff use sched_clock_unstolen() things should more or less work. The problem with all that is that you'll start to schedule on unstolen time instead of wall-time, which might not give the best results for things like latencies etc.. but I guess that's one of the prices you pay for using virt. Also, like said yesterday, you need some factor in update_cpu_power(), a quick hack might be to add all stolen time to sched_rt_avg_update(). ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-08-31 8:11 ` Peter Zijlstra @ 2010-09-02 18:19 ` Glauber Costa 2010-09-03 3:24 ` Jeremy Fitzhardinge 2010-09-03 7:18 ` Peter Zijlstra 0 siblings, 2 replies; 38+ messages in thread From: Glauber Costa @ 2010-09-02 18:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Rik van Riel, Jeremy Fitzhardinge, kvm, avi, zamsden, mtosatti, mingo On Tue, Aug 31, 2010 at 10:11:49AM +0200, Peter Zijlstra wrote: > On Mon, 2010-08-30 at 19:03 -0400, Rik van Riel wrote: > > > > > I think it basically comes down to adding "sched_clock_unstolen()" which > > > the scheduler can use to measure time a process spends running, and > > > sched_clock() for measuring sleep times. In the normal case, > > > sched_clock_unstolen() would be the same as sched_clock(). > > > > That requires the host to export (any time the guest is scheduled > > in), the amount of CPU time the VCPU thread has used, and the time > > the VCPU was scheduled in. > > > > Since the VCPU must be running when it is examining these variables, > > it can calculate the additional time (since it was last scheduled) > > to account to the task, and remember the currently calculated time > > in its own per-vcpu variable, so next time it can get a delta again. > > I think its easier (and sufficient) for the host to tell the guest how > long it was _not_ running. That can simply be passed in when you start > the vcpu again and doesn't need a fancy communication channel. > > The guests sched_clock() will measure wall time, the guests > sched_clock_stolen() will report the accumulation of these stolen times. > > Then you can make sched_clock_unstolen() be sched_clock() - > sched_clock_stolen(). And like Jeremy said, if you make the sched_fair > stuff use sched_clock_unstolen() things should more or less work. So what's the big drawback of just making sched_clock return sched_clock_unstolen? When there is no steal time involved, they will just be equal anyway. And this way, everybody that relies on sched_clock for whatever reason, will probably work. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-09-02 18:19 ` Glauber Costa @ 2010-09-03 3:24 ` Jeremy Fitzhardinge 2010-09-03 7:18 ` Peter Zijlstra 1 sibling, 0 replies; 38+ messages in thread From: Jeremy Fitzhardinge @ 2010-09-03 3:24 UTC (permalink / raw) To: Glauber Costa Cc: Peter Zijlstra, Rik van Riel, kvm, avi, zamsden, mtosatti, mingo On 09/02/2010 11:19 AM, Glauber Costa wrote: > On Tue, Aug 31, 2010 at 10:11:49AM +0200, Peter Zijlstra wrote: >> I think its easier (and sufficient) for the host to tell the guest how >> long it was _not_ running. That can simply be passed in when you start >> the vcpu again and doesn't need a fancy communication channel. >> >> The guests sched_clock() will measure wall time, the guests >> sched_clock_stolen() will report the accumulation of these stolen times. >> >> Then you can make sched_clock_unstolen() be sched_clock() - >> sched_clock_stolen(). And like Jeremy said, if you make the sched_fair >> stuff use sched_clock_unstolen() things should more or less work. > So what's the big drawback of just making sched_clock return sched_clock_unstolen? > When there is no steal time involved, they will just be equal anyway. > And this way, everybody that relies on sched_clock for whatever reason, > will probably work. I can say from experience that it definitely does not work. That's what I tried up until recently when I reverted it all. It doesn't work for two semi-related reasons: * Using sched_clock_unstolen() to measure idle/sleep times is completely meaningless (nobody cares if the cpu was "stolen" while you were sleeping). * The notion of unstolen time is inherently per-cpu, so the unstolen clocks are completely unsynchronized between cpus, whereas sched_clock is expected to be moderately well synchronized. A corollary to this is that if a task sleeps on one cpu and wakes on another, then "sleep_time = sleep_end - sleep_start" is completely, utterly, meaningless if you use per-cpu unstolen time as your timebase sleep_start/end. (The surprising thing is that it actually works surprisingly well, in that the system runs and there's only occasional oddities like "idle time" being completely misreported, and sometimes processes go to sleep for really long times.) J ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 4/7] change kernel accounting to include steal time 2010-09-02 18:19 ` Glauber Costa 2010-09-03 3:24 ` Jeremy Fitzhardinge @ 2010-09-03 7:18 ` Peter Zijlstra 1 sibling, 0 replies; 38+ messages in thread From: Peter Zijlstra @ 2010-09-03 7:18 UTC (permalink / raw) To: Glauber Costa Cc: Rik van Riel, Jeremy Fitzhardinge, kvm, avi, zamsden, mtosatti, mingo On Thu, 2010-09-02 at 15:19 -0300, Glauber Costa wrote: > So what's the big drawback of just making sched_clock return sched_clock_unstolen? > When there is no steal time involved, they will just be equal anyway. > And this way, everybody that relies on sched_clock for whatever reason, > will probably work. Because there's a number of things wanting sched_clock() and the likes to be synchronized across cpus, its impossible to make _unstolen() synchronized because different vcpus can get different service levels. Also, I think you want to keep measuring scheduling latency and other such things in actual wall-time. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC 1/8] Implement getnsboottime kernel API 2010-08-30 16:06 ` [RFC 1/8] Implement getnsboottime kernel API Glauber Costa 2010-08-30 16:06 ` [RFC v2 2/7] always call kvm_write_guest Glauber Costa @ 2010-09-01 23:56 ` Zachary Amsden 1 sibling, 0 replies; 38+ messages in thread From: Zachary Amsden @ 2010-09-01 23:56 UTC (permalink / raw) To: Glauber Costa; +Cc: kvm, avi, mtosatti, riel, peterz, mingo, jeremy On 08/30/2010 06:06 AM, Glauber Costa wrote: > From: Zachary Amsden<zamsden@redhat.com> > > Add a kernel call to get the number of nanoseconds since boot. This > is generally useful enough to make it a generic call. > > Signed-off-by: Zachary Amsden<zamsden@redhat.com> > --- > include/linux/time.h | 1 + > kernel/time/timekeeping.c | 27 +++++++++++++++++++++++++++ > 2 files changed, 28 insertions(+), 0 deletions(-) > > diff --git a/include/linux/time.h b/include/linux/time.h > index ea3559f..5d04108 100644 > --- a/include/linux/time.h > +++ b/include/linux/time.h > @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv); > extern void getrawmonotonic(struct timespec *ts); > extern void getboottime(struct timespec *ts); > extern void monotonic_to_bootbased(struct timespec *ts); > +extern s64 getnsboottime(void); > > extern struct timespec timespec_trunc(struct timespec t, unsigned gran); > extern int timekeeping_valid_for_hres(void); > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > index caf8d4d..d250f0a 100644 > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts) > } > EXPORT_SYMBOL_GPL(ktime_get_ts); > > + > +/** > + * getnsboottime - get the bootbased clock in nsec format > + * > + * The function calculates the bootbased clock from the realtime > + * clock and the wall_to_monotonic offset and stores the result > + * in normalized timespec format in the variable pointed to by @ts. > + */ > +s64 getnsboottime(void) > +{ > + unsigned int seq; > + s64 secs, nsecs; > + > + WARN_ON(timekeeping_suspended); > + > + do { > + seq = read_seqbegin(&xtime_lock); > + secs = xtime.tv_sec + wall_to_monotonic.tv_sec; > + secs += total_sleep_time.tv_sec; > + nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec; > + nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns(); > + > + } while (read_seqretry(&xtime_lock, seq)); > + return nsecs + (secs * NSEC_PER_SEC); > +} > +EXPORT_SYMBOL_GPL(getnsboottime); > + > /** > * do_gettimeofday - Returns the time of day in a timeval > * @tv: pointer to the timeval to be set > FWIW, I sent a patch to drop this yesterday, in favor of this approach: +static inline u64 get_kernel_ns(void) +{ + struct timespec ts; + + WARN_ON(preemptible()); + ktime_get_ts(&ts); + monotonic_to_bootbased(&ts); + return timespec_to_ns(&ts); +} + ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 0/7] kvm stael time implementation 2010-08-30 16:06 [RFC v2 0/7] kvm stael time implementation Glauber Costa 2010-08-30 16:06 ` [RFC v2 1/7] change headers preparing for steal time Glauber Costa @ 2010-08-30 16:37 ` Peter Zijlstra 2010-08-30 16:45 ` Jeremy Fitzhardinge 2010-08-30 17:20 ` Jeremy Fitzhardinge 2 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2010-08-30 16:37 UTC (permalink / raw) To: Glauber Costa; +Cc: kvm, avi, zamsden, mtosatti, riel, mingo, jeremy On Mon, 2010-08-30 at 12:06 -0400, Glauber Costa wrote: > > Have a nice review. > Glauber Costa (7): > change headers preparing for steal time > always call kvm_write_guest > measure time out of guest > change kernel accounting to include steal time > kvm steal time implementation > touch softlockup watchdog > tell guest about steal time feature Something's gone wrong, I've got at least 15 emails in this thread. Also, please try again but now with --no-chain-reply-to, threads nested this deep are useless for reviewing. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 0/7] kvm stael time implementation 2010-08-30 16:37 ` [RFC v2 0/7] kvm stael time implementation Peter Zijlstra @ 2010-08-30 16:45 ` Jeremy Fitzhardinge 2010-08-30 17:21 ` Glauber Costa 0 siblings, 1 reply; 38+ messages in thread From: Jeremy Fitzhardinge @ 2010-08-30 16:45 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Glauber Costa, kvm, avi, zamsden, mtosatti, riel, mingo On 08/30/2010 09:37 AM, Peter Zijlstra wrote: > Something's gone wrong, I've got at least 15 emails in this thread. > > Also, please try again but now with --no-chain-reply-to, threads nested > this deep are useless for reviewing. Looks like there are two versions of the series mushed together. J ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 0/7] kvm stael time implementation 2010-08-30 16:45 ` Jeremy Fitzhardinge @ 2010-08-30 17:21 ` Glauber Costa 0 siblings, 0 replies; 38+ messages in thread From: Glauber Costa @ 2010-08-30 17:21 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Peter Zijlstra, kvm, avi, zamsden, mtosatti, riel, mingo On Mon, Aug 30, 2010 at 09:45:41AM -0700, Jeremy Fitzhardinge wrote: > On 08/30/2010 09:37 AM, Peter Zijlstra wrote: > > Something's gone wrong, I've got at least 15 emails in this thread. > > > > Also, please try again but now with --no-chain-reply-to, threads nested > > this deep are useless for reviewing. > > Looks like there are two versions of the series mushed together. yeah, my bad. I did git-format-patch twice by accident, into the same directory. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC v2 0/7] kvm stael time implementation 2010-08-30 16:06 [RFC v2 0/7] kvm stael time implementation Glauber Costa 2010-08-30 16:06 ` [RFC v2 1/7] change headers preparing for steal time Glauber Costa 2010-08-30 16:37 ` [RFC v2 0/7] kvm stael time implementation Peter Zijlstra @ 2010-08-30 17:20 ` Jeremy Fitzhardinge 2 siblings, 0 replies; 38+ messages in thread From: Jeremy Fitzhardinge @ 2010-08-30 17:20 UTC (permalink / raw) To: Glauber Costa; +Cc: kvm, avi, zamsden, mtosatti, riel, peterz, mingo On 08/30/2010 09:06 AM, Glauber Costa wrote: > Hi, > > So, this is basically the same as v1, with three major > differences: > 1) I am posting to lkml for wider audience > 2) patch 2/7 fixes one problem I mentined would happen in > smp systems, which is, we only update kvmclock when we > changes pcup > 3) softlockup algorithm is changed. Again, as marcelo pointed > out, this is open to discussion, and I am not dropping it > so more people can step in. > > I have some other patches under local test for a slightly modified > guest part accounting, and I do somehow support extending > the interface, and changing to nsecs (maybe not 100 %, but...). But > I am posting in this state so we can have lkml people to step > in earlier. > > Reminder of the previous cover-letter: > > There are two parts of it: the guest and host part. > > The proposal for the guest part, is to just change the > common time accounting, and try to identify at that spot, > wether or not we should account any steal time. I considered > this idea less cumbersome that trying to cook a clockevents > implementation ourselves, since I see little value in it. > I am, however, pretty open to suggestions. What's the relationship between clockevents and stolen time? Are you thinking some form of timer that counts unstolen time or something? > For the host<->guest communications, I am using a shared > page, in the same way as pvclock. Because of that, I am just > hijacking pvclock structure anyway. There is a 32-bit field > floating by, that gives us enough room for 8 years of steal > time (we use msec resolution). Please don't. The pvclock structure is already getting a bit packed with stuff, and stolen time isn't really part of its job. In Xen we have a separate runstate structure which allows the guest to get a detailed breakdown of the time each vcpu spends in each state (which are guaranteed to sum to the system time). We can use that to compute how much time has been stolen (time spent in RUNNABLE state). You might consider a similar ABI for KVM, even if you can't (yet) fill out all the time values. > The main idea is to timestamp our exit and entry through > sched notifiers, and export the value at pvclock updates. > This obviously have some disadvantages: by doing this we > are giving up futures ideas about only updating > this structure once, and even right now, won't work > on pinned-smp (since we don't update pvclock if we > haven't changed cpus.) > > But again, it is just an RFC, and I'd like to feel the > reception of the idea as a whole. The Xen code has always accounted for stolen time, so it appears in top, vmstat, etc, and gives a user/admin some idea about how much their domain is suffering from competition. This doesn't require any kernel changes aside from some calls to account_steal_ticks(); we do this every timer interrupt, accumulating whole ticks as we get them. But I've not successfully managed to make the scheduler work well with stolen time. I did experiment with making sched_clock return unstolen time, on the grounds that it would give the scheduler more information about how long things actually executed for. But its meaningless for measuring sleep/idle times, and it causes the different cpus' timebases to drift severely, which causes other things in the kernel to get upset. So I think any work in the scheduler area is interesting, and probably worth posting separately unless they're too entangled in the stolen time accounting area. J > Have a nice review. > Glauber Costa (7): > change headers preparing for steal time > always call kvm_write_guest > measure time out of guest > change kernel accounting to include steal time > kvm steal time implementation > touch softlockup watchdog > tell guest about steal time feature > > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/include/asm/kvm_para.h | 1 + > arch/x86/include/asm/pvclock-abi.h | 4 ++- > arch/x86/kernel/kvmclock.c | 40 ++++++++++++++++++++++++++++++++++++ > arch/x86/kvm/x86.c | 26 ++++++++++++++++++---- > include/linux/sched.h | 1 + > kernel/sched.c | 29 ++++++++++++++++++++++++++ > 7 files changed, 97 insertions(+), 6 deletions(-) > ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2010-09-03 7:19 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-08-30 16:06 [RFC v2 0/7] kvm stael time implementation Glauber Costa 2010-08-30 16:06 ` [RFC v2 1/7] change headers preparing for steal time Glauber Costa 2010-08-30 16:06 ` [RFC 1/8] Implement getnsboottime kernel API Glauber Costa 2010-08-30 16:06 ` [RFC v2 2/7] always call kvm_write_guest Glauber Costa 2010-08-30 16:06 ` [RFC 2/8] change headers preparing for steal time Glauber Costa 2010-08-30 16:06 ` [RFC 3/8] always call kvm_write_guest Glauber Costa 2010-08-30 16:06 ` [RFC v2 3/7] measure time out of guest Glauber Costa 2010-08-30 16:06 ` [RFC v2 4/7] change kernel accounting to include steal time Glauber Costa 2010-08-30 16:06 ` [RFC 4/8] measure time out of guest Glauber Costa 2010-08-30 16:06 ` [RFC 5/8] change kernel accounting to include steal time Glauber Costa 2010-08-30 16:06 ` [RFC v2 5/7] kvm steal time implementation Glauber Costa 2010-08-30 16:06 ` [RFC 6/8] " Glauber Costa 2010-08-30 16:06 ` [RFC v2 6/7] touch softlockup watchdog Glauber Costa 2010-08-30 16:06 ` [RFC v2 7/7] tell guest about steal time feature Glauber Costa 2010-08-30 16:06 ` [RFC 7/8] touch softlockup watchdog Glauber Costa 2010-08-30 16:06 ` [RFC 8/8] tell guest about steal time feature Glauber Costa 2010-08-30 17:33 ` [RFC v2 6/7] touch softlockup watchdog Jeremy Fitzhardinge 2010-08-30 18:07 ` Glauber Costa 2010-08-30 16:46 ` [RFC 5/8] change kernel accounting to include steal time Peter Zijlstra 2010-08-30 17:26 ` Glauber Costa 2010-08-30 17:30 ` [RFC v2 4/7] " Jeremy Fitzhardinge 2010-08-30 18:39 ` Rik van Riel 2010-08-30 19:07 ` Jeremy Fitzhardinge 2010-08-30 19:14 ` Peter Zijlstra 2010-08-30 19:17 ` Rik van Riel 2010-08-30 19:20 ` Peter Zijlstra 2010-08-30 19:45 ` Rik van Riel 2010-08-30 22:56 ` Jeremy Fitzhardinge 2010-08-30 23:03 ` Rik van Riel 2010-08-31 8:11 ` Peter Zijlstra 2010-09-02 18:19 ` Glauber Costa 2010-09-03 3:24 ` Jeremy Fitzhardinge 2010-09-03 7:18 ` Peter Zijlstra 2010-09-01 23:56 ` [RFC 1/8] Implement getnsboottime kernel API Zachary Amsden 2010-08-30 16:37 ` [RFC v2 0/7] kvm stael time implementation Peter Zijlstra 2010-08-30 16:45 ` Jeremy Fitzhardinge 2010-08-30 17:21 ` Glauber Costa 2010-08-30 17:20 ` Jeremy Fitzhardinge
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).