Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm

From: Oliver Upton <oupton@google.com>
To: David Woodhouse <dwmw2@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, Sean Christopherson <seanjc@google.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>
Subject: Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
Date: Mon, 21 Mar 2022 00:38:21 +0000	[thread overview]
Message-ID: <YjfI/Sl3lFEFOIWc@google.com> (raw)
In-Reply-To: <1680281fee4384d27bd97dba117f391a.squirrel@twosheds.infradead.org>

On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote:
> 
> > The offset interface completely punts the decision around guest clocks
> > to userspace. We (KVM) have absolutely no idea what userspace is about
> > to do with the guest. The guest could be paused for 5 seconds or 5
> > years. Encouraging host userspace to just read/write a { TOD, TSC } pair
> > and let KVM do the heavy lifting could completely wreck the guest's
> > monotonic clock.
> >
> > Additionally, it is impossible for userspace to enforce policy/limits on
> > how much to time travel a guest with a value-based interface. Any event
> > could sneak in between the time userspace checks the value and KVM sets
> > the L1 offset. Offsets are idempotent and will still uphold userspace's
> > intentions even if an inordinate amount of time elapses until KVM
> > processes it.
> 
> Thanks for the detailed explanation. One part which confuses me here...
> Why can't userspace impose those same limits using a (TOD, value) tuple?
> Userspace can still look at that TOD from before the brownout period
> started, and declare that is too far in the past.
> 
> If the event happens *after* userspace has decided that the migration was
> quick enough, but before the vCPUs are actually running again, even the
> offset based interface doesn't protect against that.

Right, if nastiness happens after the offset was calculated, the guest
is still going to feel it. The benefit is that we've at least stopped
the bleeding on the monotonic clock. Guests of KVM are probably happier
having a messed up realtime clock instead of a panicked kernel that
threw a fit over monotonic stepping too far.

> > Apologies for grandstanding, but clocks has been a real source of pain
> > during migration. I do agree that the documented algorithm is a mess at
> > the moment, given that there's no good way for userspace to transform
> > host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
> > nice to have, but probably not ABI to hang this whole thing off of.
> >
> > What do you folks think about having a new R/O vCPU attribute that
> > returns a { TOD, guest_tsc } pair? I believe that would immediately
> > satisfy the needs of upstream to implement clock-advancing live
> > migration.
> 
> Hm, I need to do some more thinking here. I poked at this because for TSC
> scaling even before we think about clock jumps it was just utterly hosed —
> userspace naively just creates a bunch of vCPUs and sets their TSC
> frequency + value, and they all end up with unsynced TSC values.

And thank you for addressing that. I think there is a broader
generalization that can be made about guest timekeeping that you've
started poking at with that patch set, which is that we should only
really care about these at a VM scope. There's nothing we can do
to cover up broken hardware, so if the host's TSCs are out of whack then
oh well.

To that end, we could have a single host<->guest offset that is used
across all vCPUs. Guest fiddling with TSC_ADJUST or even worse direct
writes to the TSC can then be treated as solely a part of guest state,
and get added on top of the host<->guest offset.

> But coincidentally since then I have started having conversations with
> people who really want the guest to have an immediate knowledge of the
> adjtimex maxerror etc. on the new host immediately after the migration.
> Maybe the "if the migration isn't fast enough then let the guest know it's
> now unsynced" is OK, but I'll need to work out what "immediately" means
> when we have a guest userspace component involved in it.

This has also been an area of interest to me. I think we've all seen the
many ways in which doing migrations behind the guest's can put software
in an extremely undesirable state on the other end. If those
conversations are taking place on the mailing lists, could you please CC
me?

Our (Google) TSC adjustment clamping and userspace notification mechanism
was a halfway kludge to keep things happy on the other end. And it
generally has worked well, but misses a fundamental point.

The hypervisor should tell the guest kernel about time travel and let it
cascade that information throughout the guest system. Regardless of what
we do to the TSC, we invariably destroy one of the two guest clocks along
the way. If we told the guest "you time traveled X seconds", it could
fold that into its own idea of real time. Guest kernel can then fire off
events to inform software that wants to keep up with clock changes, and
even a new event to let NTP know its probably running on different
hardware.

Time sucks :-)

--
Thanks,
Oliver