Hello,

there have been serveral reports on kvm-devel about problems with kvm-clock. I 
can reproduce this 100%.
1. doesn't depend on guest kernel eing 2.6.32  or 3.0.1
2. qemu-0.14.1 is broken, qemu-0.15.0 works.
3. host-kernel 2.6.32.40 is broken, 3.0.0 works.
So currently I have found two solutions: either upgrade qemu-kvm or the 
host-kernel, which (for me) both are no options.
I tracked it down to this patch: If I revert it, the VM starts fine.
My current procedure to reproduce this problem is like this:
1. boot into guest kernel
2. reboot from within guest
3. on the second boot, the VM crawls very slowly. The kernel-time printed are 
roughly the same numbers as on the first boot, but they don't match 
wall-clock: The kernel-time claims to be 10 s from uptime, but real-time is 
more like 42 s. If the boot process makes it as far as a command line, 
runnign a "sleep 1" takes much more than one second.

Putting in some prink(..., max_kernel_ns, kernel_ns) I've ovserved that during 
the first boot, I get 10k calles to that function a second, on the seond boot 
its down to 10-20 a second.
Also max_kernel_ns is way larger than kernel_ns:
18:27:17 [23755.005941] 6148360456312778392 23755005912468

This patch was back-ported to 2.6.32.40 but it looks like some other 
infrastructure might have changed, so it doen't work as expected.

And idee on how to proceed?

On Friday 20 August 2010 10:07:30 Zachary Amsden wrote:
> Kernel time, which advances in discrete steps may progress much slower
> than TSC.  As a result, when kvmclock is adjusted to a new base, the
> apparent time to the guest, which runs at a much higher, nsec scaled
> rate based on the current TSC, may have already been observed to have
> a larger value (kernel_ns + scaled tsc) than the value to which we are
> setting it (kernel_ns + 0).
>
> We must instead compute the clock as potentially observed by the guest
> for kernel_ns to make sure it does not go backwards.
>
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 +
>  arch/x86/kvm/x86.c              |   43
> +++++++++++++++++++++++++++++++++++++- 2 files changed, 43 insertions(+), 2
> deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h index 324e892..871800d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -339,6 +339,8 @@ struct kvm_vcpu_arch {
>  	unsigned int time_offset;
>  	struct page *time_page;
>  	u64 last_host_tsc;
> +	u64 last_guest_tsc;
> +	u64 last_kernel_ns;
>
>  	bool nmi_pending;
>  	bool nmi_injected;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1948c36..fe74b42 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -976,14 +976,15 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
>  	struct kvm_vcpu_arch *vcpu = &v->arch;
>  	void *shared_kaddr;
>  	unsigned long this_tsc_khz;
> -	s64 kernel_ns;
> +	s64 kernel_ns, max_kernel_ns;
> +	u64 tsc_timestamp;
>
>  	if ((!vcpu->time_page))
>  		return 0;
>
>  	/* Keep irq disabled to prevent changes to the clock */
>  	local_irq_save(flags);
> -	kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp);
> +	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
>  	kernel_ns = get_kernel_ns();
>  	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
>  	local_irq_restore(flags);
> @@ -993,13 +994,49 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
>  		return 1;
>  	}
>
> +	/*
> +	 * Time as measured by the TSC may go backwards when resetting the base
> +	 * tsc_timestamp.  The reason for this is that the TSC resolution is
> +	 * higher than the resolution of the other clock scales.  Thus, many
> +	 * possible measurments of the TSC correspond to one measurement of any
> +	 * other clock, and so a spread of values is possible.  This is not a
> +	 * problem for the computation of the nanosecond clock; with TSC rates
> +	 * around 1GHZ, there can only be a few cycles which correspond to one
> +	 * nanosecond value, and any path through this code will inevitably
> +	 * take longer than that.  However, with the kernel_ns value itself,
> +	 * the precision may be much lower, down to HZ granularity.  If the
> +	 * first sampling of TSC against kernel_ns ends in the low part of the
> +	 * range, and the second in the high end of the range, we can get:
> +	 *
> +	 * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
> +	 *
> +	 * As the sampling errors potentially range in the thousands of cycles,
> +	 * it is possible such a time value has already been observed by the
> +	 * guest.  To protect against this, we must compute the system time as
> +	 * observed by the guest and ensure the new system time is greater.
> + 	 */
> +	max_kernel_ns = 0;
> +	if (vcpu->hv_clock.tsc_timestamp && vcpu->last_guest_tsc) {
> +		max_kernel_ns = vcpu->last_guest_tsc -
> +				vcpu->hv_clock.tsc_timestamp;
> +		max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
> +				    vcpu->hv_clock.tsc_to_system_mul,
> +				    vcpu->hv_clock.tsc_shift);
> +		max_kernel_ns += vcpu->last_kernel_ns;
> +	}
> +
>  	if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>  		kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock);
>  		vcpu->hw_tsc_khz = this_tsc_khz;
>  	}
>
> +	if (max_kernel_ns > kernel_ns)
> +		kernel_ns = max_kernel_ns;
> +
>  	/* With all the info we got, fill in the values */
> +	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
>  	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
> +	vcpu->last_kernel_ns = kernel_ns;
>  	vcpu->hv_clock.flags = 0;
>
>  	/*
> @@ -4931,6 +4968,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	if (hw_breakpoint_active())
>  		hw_breakpoint_restore();
>
> +	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +
>  	atomic_set(&vcpu->guest_mode, 0);
>  	smp_wmb();
>  	local_irq_enable();

Sincerely
Philipp
-- 
Philipp Hahn           Open Source Software Engineer      hahn@univention.de
Univention GmbH        Linux for Your Business        fon: +49 421 22 232- 0
Mary-Somerville-Str.1  D-28359 Bremen                 fax: +49 421 22 232-99
                                                   http://www.univention.de/
----------------------------------------------------------------------------
Treffen Sie Univention auf der IT&Business vom 20. bis 22. September 2011
auf dem Gemeinschaftsstand der Open Source Business Alliance in Stuttgart in
Halle 3 Stand 3D27-7.