Re: [PATCH kvm-unit-tests] KVM: x86: add hyperv clock test case

From: Roman Kagan <rkagan@virtuozzo.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: <kvm@vger.kernel.org>, Andrey Smetanin <asmetanin@virtuozzo.com>,
	"Denis V. Lunev" <den@openvz.org>,
	Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: [PATCH kvm-unit-tests] KVM: x86: add hyperv clock test case
Date: Thu, 21 Apr 2016 20:01:58 +0300	[thread overview]
Message-ID: <20160421170157.GA16360@rkaganb.sw.ru> (raw)
In-Reply-To: <56B22CB4.9090404@redhat.com>

On Wed, Feb 03, 2016 at 05:37:08PM +0100, Paolo Bonzini wrote:
> On 28/01/2016 17:22, Roman Kagan wrote:
> > On Thu, Jan 28, 2016 at 03:04:58PM +0100, Paolo Bonzini wrote:
> >> The test checks the relative precision of the reference TSC page
> >> and the time reference counter.
> >>
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> ---
> >> 	Andrey, the test has a relative error of approximately 3 parts
> >> 	per million on my machine.  In other words it drifts by 3
> >> 	microseconds every second, which I don't think is acceptable.
> >> 	The problem is that virtual_tsc_khz is 2593993 while the actual
> >> 	frequency is more like 2594001 kHz.
> > 
> > Hmm, how come?  I thought virtual_tsc_khz could only diverge from
> > tsc_khz on migration.
> > 
> > [Or maybe I just misunderstand what you mean by "the actual frequency".]
> 
> Talking to Marcelo helped me understanding it. :)
> 
> The issue is that the TSC kHz is not the correct value to use for the  
> the TSC page.  Instead, we want to pass to the guest a scale value 
> corresponding to the kernel's timekeeping multiplier.  This is what 
> both kvmclock and the time reference counter use.
> 
> The bit that I was missing last week was how the guest could perceive
> updates to the TSC page parameters as atomic.  We actually _can_
> emulate seqlock-like behavior in Hyper-V by writing 0 to seq during
> an update.  Instead of looping like seqlocks do, the guest will simply
> use the time reference counter and take a hypervisor exit.  The result
> however is still valid, because we want the time reference counter to
> be perfectly synchronized with the Hyper-V clock and lets you handle
> any kvmclock update scenario safely.
> 
> Therefore the algorithm would be like patch 2/2 I sent, but with
> important differences.  You'd still
> 
> 	if (hv->hv_tsc_page & HV_X64_MSR_TSC_REFERENCE_ENABLE)
> 		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
> 
> on MSR writes, but then hook into kvm_guest_time_update rather than
> kvm_gen_update_masterclock, like:
> 
> 	if (v == kvm_get_vcpu(v->kvm, 0))
> 	        kvm_write_hyperv_tsc_page(v->kvm, &vcpu->hv_clock);
> 
> and in kvm_write_hyperv_tsc_page the calculation would be based on
> the kvmclock parameters:
> 
>         {
>                 write 0 to seq;
>                 if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
>                         return;
> 
>                 compute scale and offset from hv_clock mul and shift;
>                 write scale and offset;
> 		write sequence
>         }
> 
> all of scale, offset and sequence can be computed from kvmclock parameters.
> 
> For sequence we have to convert "even values between 0 and 0xFFFFFFFE"
> into "values between "1 and 0xFFFFFFFE".  For example:
> 
> 	hyper-v sequence = (kvmclock sequence >> 1) + 1
> 
> will do it.  Computing the scale and offset is something like:
> 
>         // KVM formula:
> 	//    nsec = (ticks - tsc_timestamp) * tsc_to_system_mul << tsc_shift + system_time
> 	//
> 	// hyper-v formula:
> 	//    nsec/100 = ticks * scale / 2^32 + offset
> 	//
> 	// when tsc_timestamp, offset is zero so remove them both:
>         //    ticks * tsc_to_system_mul << tsc_shift / 100 = ticks * scale / 2^32
> 	//
> 	// multiply both sides by 2^32 / ticks and you get scale:
> 	//    scale = tsc_to_system_mul << (32 + tsc_shift) / 100
> 	//
> 	// check if it would overflow, and then use the time ref counter
> 	//    tsc_to_system_mul << (32 + tsc_shift) / 100 >= 2^32
> 	//    tsc_to_system_mul << 32 >= 100 * 2^32 << -tsc_shift
> 	//    tsc_to_system_mul >= 100 << -tsc_shift
> 
> 	if (shift < 0)
> 		rhs = 100 << -shift;
> 	else
> 		rhs = 100 >> shift;
> 
> 	if (tsc_to_system_mul >= rhs)
> 		return;
> 
>         scale = mul_u64_u32_div(1ULL << (32 + tsc_shift), tsc_to_system_mul, 100);
> 
>         // now expand the kvmclock formula:
>         //    nsec = ticks * tsc_to_system_mul << tsc_shift - (tsc_timestamp * tsc_to_system_mul << tsc_shift) + system_time 
> 	// divide by 100:
> 	//    nsec/100 = ticks * tsc_to_system_mul << tsc_shift /100 - (tsc_timestamp * tsc_to_system_mul << tsc_shift) /100 + system_time /100
> 	// replace tsc_to_system_mul << tsc_shift /100 by scale / 2^32:
>         //    nsec/100 = ticks * scale / 2^32 - (tsc_timestamp * scale / 2^32) + system_time / 100
> 	// comparing with the Hyper-V formula:
> 	//    offset = system_time / 100 - (tsc_timestamp * scale / 2^32)
> 
> 	offset = system_time;
> 	do_div(offset, 100);
> 
> 	timestamp_offset = tsc_timestamp;
> 	do_div32_shl32(timestamp_offset, scale);
>         offset = offset - timestamp_offset;
> 
> The remaining part is _further_ adjusting the offset to
> Would you like to finish implementing this and test it with the unit test?
> kvm/queue already has the patch to introduce do_div32_shl32.

At last I've got some time to tackle this.  I cooked up a patch to do
approximately this (fixing the math and the logic in a few places), and
adjusted the test to build, and now I'm about to submit both of them.

As to the drift problem, I reliably reproduce it, too, but it's a bug
somewhere in kvm_clock, and the new hyperv ref tsc page now demonstrates
it by virtue of relying on kvm_clock's values.  On my machine I observe
as much as +14 ppm and I'll try to chase it down, but this is
independent of the hyperv ref tsc page patchset.

Roman.