Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.

From: Marcelo Tosatti <mtosatti@redhat.com>
To: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Paolo Bonzini <pbonzini@redhat.com>,
	Luiz Capitulino <lcapitulino@redhat.com>
Subject: Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
Date: Mon, 21 Sep 2015 19:37:14 -0300	[thread overview]
Message-ID: <20150921223713.GA21139@amt.cnet> (raw)
In-Reply-To: <20150921220039.GC2734@potion.brq.redhat.com>

On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Krčmář wrote:
> 2015-09-21 17:53-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote:
> >> 2015-09-21 12:52-0300, Marcelo Tosatti:
> >> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> >> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >> >>> Is it counting from zero that breaks SLES10?
> >> >> 
> >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> >> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> >> >> while still keeping system time;  we used to allow that, which means an
> >> >> ABI breakage.  (And we can't even say that guest's behaviour is against
> >> >> the spec ...)
> >> > 
> >> > Because this behaviour was not defined.
> >> 
> >> It is defined by implementation.
> >> 
> >> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> >> > boot_vcpu_runs_old_kvmclock == false? 
> >> > The patch would be much simpler.
> >> 
> >> If you mean the hunk in cover letter, I don't like it because we presume
> >> that no other guests were broken.
> > 
> > Yes patch 1.
> 
> (I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and
>  patch 2 = [2/2].)
> 
> > Do you have an example of another guest that is broken? 
> 
> Yes, the hot-plug in any relatively recent Linux guest.
> 
> > Really, assuming things about the value of the msr read when you
> > write to the MSR is very awkward. That SuSE code is incorrect, it fails
> > on other occasions as well (see
> > 54750f2cf042c42b4223d67b1bd20138464bde0e).
> 
> Yeah, I even read the SUSE implementation, sad times.
> 
> >> I really don't like it 
> > 
> > Because it changes behaviour that was previously unspecified?
> 
> No, because it adds another exemption to already ugly code.
> (Talking about the hunk in [0/2].)
> 
> >> so I thought about other problems with
> >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
> > 
> > Can't unplug VCPU 0 i think.
> 
> You can.
> 
> >> It doesn't work well ;)
> > 
> > Can you hot-unplug vcpu 0? 
> 
> I can, I did, hot-plug bugged.

Due to PVCLOCK_COUNTS_FROM_ZERO patch?  

> >> > The problem is, "selecting one read as the initial point" is inherently
> >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> >> > but must be applied to all vcpus.
> >> 
> >> I don't think that is a problem.
> >> 
> >> Kvmclock has a notion of a global system_time in nanoseconds (one value
> >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
> >> to propagate system_time into guest VCPUs as precisely as possible with
> >> the help of TSC.
> >
> >Different pairs of values (system_time, tsc reads) are fundamentally
> >broken. This is why 
> >
> >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
> >    x86, paravirt: Add a global synchronization point for pvclock
> >
> >Exists.
> >
> >Then to guarantee monotonicity you need to stop those updates
> >(or perform the pair update in lock step):
> >
> >commit d828199e84447795c6669ff0e6c6d55eb9beeff6
> >    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
> 
> (Thanks for the commits, split values are the core design that avoids
>  modifying rdtsc, 

Which is broken (thats the point).

> so I'll be grad to read its details.)
> 
> >> > 	2) You rely on monotonicity across vcpus to perform 
> >> > 	   the 'minus delta that was read on vcpu0' calculation, but 
> >> > 	   monotonicity across vcpus can fail during runtime
> >> >            (say host clocksource goes tsc->hpet due to tsc instability).
> >> 
> >> That could be a problem, but I'm adding a VCPU independent constant to
> >> all reads -- does the new code rely on monoticity in places where the
> >> old one didn't?
> > 
> > The problem is overflow:
> > kvmclock non-monotonic across vcpus AND delta subtraction:
> > 
> > ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
> > 1          1                      1                         
> > 2          2                      2                        1                
> 
> KVM sched clock not used before this point (I even moved the code to
> make sure), so there is no problem with monotonicity.
> 
> > 3          3                      0                        2            -1
>                                                                           ^^^
> -1 is the overflow.  Very unlikely to ever happen in Linux, as we now
> boot other VCPUs much later than the KVM clock configuration and it can
> only happen if VCPU1 is running as the reconfiguration takes place, but
> a simple
> 
>   if (vcpu[x].sched_clock <= global_sched_clock_offset)
>   	return 0;
> 
> would take care of it.  The time would stand still for a while, which is
> not a huge problem for boot-only scenario.  

Look at 

commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
    x86, paravirt: Add a global synchronization point for pvclock

And think of what an overflow does.

Note: i tried your solution before (to add offset) and saw this issue
in practice.

> Other VCPUs couldn't be
> reading KVM sched before it was configured, so only operations that
> happen before vcpu1sched_clock goes positive are affected.
> (We have other problems if the window can be long.)

The point where vcpu1sched_clock is negative is after kvmclock is
initialized.

> 
> > 4          4                      1                        3             0
> > 5          5                      2                        4             1
> > 6          6                      3                        5             2
> > 7          7                      4                        6             3
> > 
> > ptime is "physical time".
> > 
> > I prefer that the host counts from zero (there are less problems to
> > handle).
> 
> The main advantage of a hypervisor solution for me is one saved
> subtraction and comparison on every sched_clock().

To me are such complications as handling overflows.

> > Again, that SuSE patch is incorrect and a huge hack.
> 
> I'm not disputing that.  (Which doesn't justify the breakage.)
> 
> > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
> > requests the feature.
> 
> Yes, and the alternative needs new paravirt interface.

So either:

Proceed with guest solution:
-> Make sure the overflow can't happen (and write down why not in the
code). Don't assume a small delta between kvmclock values of vcpus.
-> Handle stable -> non-stable kvmclock transition.
-> kvmclock counts from zero should not depend on stable kvmclock
(because nohz_full should work on hpet host systems).

Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
-> Figure out whats wrong with different kvmclock values on hotplug, 
and fix it.