From mboxrd@z Thu Jan 1 00:00:00 1970 From: Philipp Hahn Subject: Re: [KVM timekeeping 16/35] Fix a possible backwards warp of kvmclock Date: Fri, 2 Sep 2011 20:34:11 +0200 Message-ID: <201109022034.16517.hahn@univention.de> References: <1282291669-25709-1-git-send-email-zamsden@redhat.com> <1282291669-25709-17-git-send-email-zamsden@redhat.com> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart5723555.RdIKnqArT4"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org, Avi Kivity , Marcelo Tosatti , Glauber Costa , Thomas Gleixner , John Stultz To: Zachary Amsden Return-path: Received: from mail.univention.de ([82.198.197.8]:1158 "EHLO mail.univention.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754165Ab1IBSea (ORCPT ); Fri, 2 Sep 2011 14:34:30 -0400 In-Reply-To: <1282291669-25709-17-git-send-email-zamsden@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: --nextPart5723555.RdIKnqArT4 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Hello, there have been serveral reports on kvm-devel about problems with kvm-clock= =2E I=20 can reproduce this 100%. 1. doesn't depend on guest kernel eing 2.6.32 or 3.0.1 2. qemu-0.14.1 is broken, qemu-0.15.0 works. 3. host-kernel 2.6.32.40 is broken, 3.0.0 works. So currently I have found two solutions: either upgrade qemu-kvm or the=20 host-kernel, which (for me) both are no options. I tracked it down to this patch: If I revert it, the VM starts fine. My current procedure to reproduce this problem is like this: 1. boot into guest kernel 2. reboot from within guest 3. on the second boot, the VM crawls very slowly. The kernel-time printed a= re=20 roughly the same numbers as on the first boot, but they don't match=20 wall-clock: The kernel-time claims to be 10 s from uptime, but real-time is= =20 more like 42 s. If the boot process makes it as far as a command line,=20 runnign a "sleep 1" takes much more than one second. Putting in some prink(..., max_kernel_ns, kernel_ns) I've ovserved that dur= ing=20 the first boot, I get 10k calles to that function a second, on the seond bo= ot=20 its down to 10-20 a second. Also max_kernel_ns is way larger than kernel_ns: 18:27:17 [23755.005941] 6148360456312778392 23755005912468 This patch was back-ported to 2.6.32.40 but it looks like some other=20 infrastructure might have changed, so it doen't work as expected. And idee on how to proceed? On Friday 20 August 2010 10:07:30 Zachary Amsden wrote: > Kernel time, which advances in discrete steps may progress much slower > than TSC. As a result, when kvmclock is adjusted to a new base, the > apparent time to the guest, which runs at a much higher, nsec scaled > rate based on the current TSC, may have already been observed to have > a larger value (kernel_ns + scaled tsc) than the value to which we are > setting it (kernel_ns + 0). > > We must instead compute the clock as potentially observed by the guest > for kernel_ns to make sure it does not go backwards. > > Signed-off-by: Zachary Amsden > --- > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/kvm/x86.c | 43 > +++++++++++++++++++++++++++++++++++++- 2 files changed, 43 insertions(+),= 2 > deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h > b/arch/x86/include/asm/kvm_host.h index 324e892..871800d 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -339,6 +339,8 @@ struct kvm_vcpu_arch { > unsigned int time_offset; > struct page *time_page; > u64 last_host_tsc; > + u64 last_guest_tsc; > + u64 last_kernel_ns; > > bool nmi_pending; > bool nmi_injected; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 1948c36..fe74b42 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -976,14 +976,15 @@ static int kvm_write_guest_time(struct kvm_vcpu *v) > struct kvm_vcpu_arch *vcpu =3D &v->arch; > void *shared_kaddr; > unsigned long this_tsc_khz; > - s64 kernel_ns; > + s64 kernel_ns, max_kernel_ns; > + u64 tsc_timestamp; > > if ((!vcpu->time_page)) > return 0; > > /* Keep irq disabled to prevent changes to the clock */ > local_irq_save(flags); > - kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp); > + kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); > kernel_ns =3D get_kernel_ns(); > this_tsc_khz =3D __get_cpu_var(cpu_tsc_khz); > local_irq_restore(flags); > @@ -993,13 +994,49 @@ static int kvm_write_guest_time(struct kvm_vcpu *v) > return 1; > } > > + /* > + * Time as measured by the TSC may go backwards when resetting the base > + * tsc_timestamp. The reason for this is that the TSC resolution is > + * higher than the resolution of the other clock scales. Thus, many > + * possible measurments of the TSC correspond to one measurement of any > + * other clock, and so a spread of values is possible. This is not a > + * problem for the computation of the nanosecond clock; with TSC rates > + * around 1GHZ, there can only be a few cycles which correspond to one > + * nanosecond value, and any path through this code will inevitably > + * take longer than that. However, with the kernel_ns value itself, > + * the precision may be much lower, down to HZ granularity. If the > + * first sampling of TSC against kernel_ns ends in the low part of the > + * range, and the second in the high end of the range, we can get: > + * > + * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new > + * > + * As the sampling errors potentially range in the thousands of cycles, > + * it is possible such a time value has already been observed by the > + * guest. To protect against this, we must compute the system time as > + * observed by the guest and ensure the new system time is greater. > + */ > + max_kernel_ns =3D 0; > + if (vcpu->hv_clock.tsc_timestamp && vcpu->last_guest_tsc) { > + max_kernel_ns =3D vcpu->last_guest_tsc - > + vcpu->hv_clock.tsc_timestamp; > + max_kernel_ns =3D pvclock_scale_delta(max_kernel_ns, > + vcpu->hv_clock.tsc_to_system_mul, > + vcpu->hv_clock.tsc_shift); > + max_kernel_ns +=3D vcpu->last_kernel_ns; > + } > + > if (unlikely(vcpu->hw_tsc_khz !=3D this_tsc_khz)) { > kvm_set_time_scale(this_tsc_khz, &vcpu->hv_clock); > vcpu->hw_tsc_khz =3D this_tsc_khz; > } > > + if (max_kernel_ns > kernel_ns) > + kernel_ns =3D max_kernel_ns; > + > /* With all the info we got, fill in the values */ > + vcpu->hv_clock.tsc_timestamp =3D tsc_timestamp; > vcpu->hv_clock.system_time =3D kernel_ns + v->kvm->arch.kvmclock_offset; > + vcpu->last_kernel_ns =3D kernel_ns; > vcpu->hv_clock.flags =3D 0; > > /* > @@ -4931,6 +4968,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > if (hw_breakpoint_active()) > hw_breakpoint_restore(); > > + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > + > atomic_set(&vcpu->guest_mode, 0); > smp_wmb(); > local_irq_enable(); Sincerely Philipp =2D-=20 Philipp Hahn Open Source Software Engineer hahn@univention.de Univention GmbH Linux for Your Business fon: +49 421 22 232- 0 Mary-Somerville-Str.1 D-28359 Bremen fax: +49 421 22 232-99 http://www.univention.de/ =2D------------------------------------------------------------------------= =2D-- Treffen Sie Univention auf der IT&Business vom 20. bis 22. September 2011 auf dem Gemeinschaftsstand der Open Source Business Alliance in Stuttgart in Halle 3 Stand 3D27-7. --nextPart5723555.RdIKnqArT4 Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEABECAAYFAk5hIaQACgkQYPlgoZpUDjklEQCdGwfYvZSS0+2vc1Nw6su/4zeZ SXQAoIoBWBEmO7oZZGzJfQPSzXHfhDkN =aet6 -----END PGP SIGNATURE----- --nextPart5723555.RdIKnqArT4--