All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marcelo Tosatti <mtosatti@redhat.com>
To: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Paolo Bonzini <pbonzini@redhat.com>,
	Luiz Capitulino <lcapitulino@redhat.com>
Subject: Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO.
Date: Mon, 21 Sep 2015 19:37:14 -0300	[thread overview]
Message-ID: <20150921223713.GA21139@amt.cnet> (raw)
In-Reply-To: <20150921220039.GC2734@potion.brq.redhat.com>

On Tue, Sep 22, 2015 at 12:00:39AM +0200, Radim Krčmář wrote:
> 2015-09-21 17:53-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote:
> >> 2015-09-21 12:52-0300, Marcelo Tosatti:
> >> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> >> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >> >>> Is it counting from zero that breaks SLES10?
> >> >> 
> >> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> >> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> >> >> while still keeping system time;  we used to allow that, which means an
> >> >> ABI breakage.  (And we can't even say that guest's behaviour is against
> >> >> the spec ...)
> >> > 
> >> > Because this behaviour was not defined.
> >> 
> >> It is defined by implementation.
> >> 
> >> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> >> > boot_vcpu_runs_old_kvmclock == false? 
> >> > The patch would be much simpler.
> >> 
> >> If you mean the hunk in cover letter, I don't like it because we presume
> >> that no other guests were broken.
> > 
> > Yes patch 1.
> 
> (I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and
>  patch 2 = [2/2].)
> 
> > Do you have an example of another guest that is broken? 
> 
> Yes, the hot-plug in any relatively recent Linux guest.
> 
> > Really, assuming things about the value of the msr read when you
> > write to the MSR is very awkward. That SuSE code is incorrect, it fails
> > on other occasions as well (see
> > 54750f2cf042c42b4223d67b1bd20138464bde0e).
> 
> Yeah, I even read the SUSE implementation, sad times.
> 
> >> I really don't like it 
> > 
> > Because it changes behaviour that was previously unspecified?
> 
> No, because it adds another exemption to already ugly code.
> (Talking about the hunk in [0/2].)
> 
> >> so I thought about other problems with
> >> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
> > 
> > Can't unplug VCPU 0 i think.
> 
> You can.
> 
> >> It doesn't work well ;)
> > 
> > Can you hot-unplug vcpu 0? 
> 
> I can, I did, hot-plug bugged.

Due to PVCLOCK_COUNTS_FROM_ZERO patch?  

> >> > The problem is, "selecting one read as the initial point" is inherently
> >> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> >> > but must be applied to all vcpus.
> >> 
> >> I don't think that is a problem.
> >> 
> >> Kvmclock has a notion of a global system_time in nanoseconds (one value
> >> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
> >> to propagate system_time into guest VCPUs as precisely as possible with
> >> the help of TSC.
> >
> >Different pairs of values (system_time, tsc reads) are fundamentally
> >broken. This is why 
> >
> >commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
> >    x86, paravirt: Add a global synchronization point for pvclock
> >
> >Exists.
> >
> >Then to guarantee monotonicity you need to stop those updates
> >(or perform the pair update in lock step):
> >
> >commit d828199e84447795c6669ff0e6c6d55eb9beeff6
> >    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
> 
> (Thanks for the commits, split values are the core design that avoids
>  modifying rdtsc, 

Which is broken (thats the point).

> so I'll be grad to read its details.)
> 
> >> > 	2) You rely on monotonicity across vcpus to perform 
> >> > 	   the 'minus delta that was read on vcpu0' calculation, but 
> >> > 	   monotonicity across vcpus can fail during runtime
> >> >            (say host clocksource goes tsc->hpet due to tsc instability).
> >> 
> >> That could be a problem, but I'm adding a VCPU independent constant to
> >> all reads -- does the new code rely on monoticity in places where the
> >> old one didn't?
> > 
> > The problem is overflow:
> > kvmclock non-monotonic across vcpus AND delta subtraction:
> > 
> > ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
> > 1          1                      1                         
> > 2          2                      2                        1                
> 
> KVM sched clock not used before this point (I even moved the code to
> make sure), so there is no problem with monotonicity.
> 
> > 3          3                      0                        2            -1
>                                                                           ^^^
> -1 is the overflow.  Very unlikely to ever happen in Linux, as we now
> boot other VCPUs much later than the KVM clock configuration and it can
> only happen if VCPU1 is running as the reconfiguration takes place, but
> a simple
> 
>   if (vcpu[x].sched_clock <= global_sched_clock_offset)
>   	return 0;
> 
> would take care of it.  The time would stand still for a while, which is
> not a huge problem for boot-only scenario.  

Look at 

commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
    x86, paravirt: Add a global synchronization point for pvclock

And think of what an overflow does.

Note: i tried your solution before (to add offset) and saw this issue
in practice.

> Other VCPUs couldn't be
> reading KVM sched before it was configured, so only operations that
> happen before vcpu1sched_clock goes positive are affected.
> (We have other problems if the window can be long.)

The point where vcpu1sched_clock is negative is after kvmclock is
initialized.

> 
> > 4          4                      1                        3             0
> > 5          5                      2                        4             1
> > 6          6                      3                        5             2
> > 7          7                      4                        6             3
> > 
> > ptime is "physical time".
> > 
> > I prefer that the host counts from zero (there are less problems to
> > handle).
> 
> The main advantage of a hypervisor solution for me is one saved
> subtraction and comparison on every sched_clock().

To me are such complications as handling overflows.

> > Again, that SuSE patch is incorrect and a huge hack.
> 
> I'm not disputing that.  (Which doesn't justify the breakage.)
> 
> > An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
> > requests the feature.
> 
> Yes, and the alternative needs new paravirt interface.

So either:

Proceed with guest solution:
-> Make sure the overflow can't happen (and write down why not in the
code). Don't assume a small delta between kvmclock values of vcpus.
-> Handle stable -> non-stable kvmclock transition.
-> kvmclock counts from zero should not depend on stable kvmclock
(because nohz_full should work on hpet host systems).

Enable counts-from-zero on MSR_KVM_SYSTEM_TIME_NEW:
-> Figure out whats wrong with different kvmclock values on hotplug, 
and fix it.



  reply	other threads:[~2015-09-21 22:37 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-18 15:54 [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
2015-09-18 15:54 ` [PATCH 1/2] x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO Radim Krčmář
2015-09-22 19:01   ` Marcelo Tosatti
2015-09-28 14:10   ` Paolo Bonzini
2015-09-18 15:54 ` [PATCH 2/2] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" Radim Krčmář
2015-09-22 19:01   ` Marcelo Tosatti
2015-09-22 19:52     ` Paolo Bonzini
2015-09-22 20:23       ` Marcelo Tosatti
2015-09-20 22:57 ` [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM_ZERO Marcelo Tosatti
2015-09-21 15:12   ` Radim Krčmář
2015-09-21 15:43     ` Radim Krčmář
2015-09-21 15:52     ` Marcelo Tosatti
2015-09-21 20:00       ` Radim Krčmář
2015-09-21 20:53         ` Marcelo Tosatti
2015-09-21 22:00           ` Radim Krčmář
2015-09-21 22:37             ` Marcelo Tosatti [this message]
2015-09-22  0:40               ` Marcelo Tosatti
2015-09-22 14:33               ` Radim Krčmář
2015-09-22 14:46               ` Radim Krčmář
2015-09-28 11:05 ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150921223713.GA21139@amt.cnet \
    --to=mtosatti@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=lcapitulino@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=rkrcmar@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.