Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

From: Thomas Gleixner <tglx@linutronix.de>
To: Andy Lutomirski <luto@amacapital.net>,
	Maxim Levitsky <mlevitsk@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	kvm@vger.kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Jim Mattson <jmattson@google.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	"open list\:KERNEL SELFTEST FRAMEWORK" 
	<linux-kselftest@vger.kernel.org>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Sean Christopherson <sean.j.christopherson@intel.com>,
	open list <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	"maintainer\:X86 ARCHITECTURE \(32-BIT AND 64-BIT\)"
	<x86@kernel.org>, Joerg Roedel <joro@8bytes.org>,
	Borislav Petkov <bp@alien8.de>, Shuah Khan <shuah@kernel.org>,
	Andrew Jones <drjones@redhat.com>,
	Oliver Upton <oupton@google.com>,
	"open list\:DOCUMENTATION" <linux-doc@vger.kernel.org>
Subject: Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE
Date: Thu, 10 Dec 2020 22:25:09 +0100	[thread overview]
Message-ID: <87360djqve.fsf@nanos.tec.linutronix.de> (raw)
In-Reply-To: <E4F263BE-6CAA-4152-8818-187D34D8D0FD@amacapital.net>

Andy,

On Thu, Dec 10 2020 at 07:16, Andy Lutomirski wrote:
>> On Dec 10, 2020, at 6:52 AM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
>> On Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote:
>>>> On 08/12/20 22:20, Thomas Gleixner wrote:
>>>> So now life migration comes a long time after timekeeping had set the
>>>> limits and just because it's virt it expects that everything works and it
>>>> just can ignore these limits.
>>>> 
>>>> TBH. That's not any different than SMM or hard/firmware taking the
>>>> machine out for lunch. It's exactly the same: It's broken.
>>> 
>>> I agree.  If *live* migration stops the VM for 200 seconds, it's broken.

I'm glad we are on the same page here.

>>> Sure, there's the case of snapshotting the VM over the weekend.  My 
>>> favorite solution would be to just put it in S3 before doing that.  *Do 
>>> what bare metal does* and you can't go that wrong.

:)

>> Note though that qemu has a couple of issues with s3, and it is disabled 
>> by default in libvirt. 
>> I would be very happy to work on improving this if there is a need for that.
>
> There’s also the case where someone has a VM running on a laptop and
> someone closes the lid. The host QEMU might not have a chance to
> convince the guest to enter S3.

But the host kernel can do something sensible before going off into lala
land. It knows that it is about to do that and it knows that there are
guests running.

>> I still think though that we should have a discussion on feasibility
>> of making the kernel time code deal with large *forward* tsc jumps 
>> without crashing.

I'm not opposed against that as I said before.

>> If that is indeed hard to do, or will cause performance issues,
>> then I agree that we might indeed inform the guest of time jumps instead.
>> 
>
> Tglx, even without fancy shared host/guest timekeeping, count the
> guest kernel manage to update its timekeeping if the host sent the
> guest an interrupt or NMI on all CPUs synchronously on resume?

Tell it before it takes a nap is simpler and does not require an NMI
which is horrible anyway because we can't do much in the NMI and
scheduling irq_work from NMI does not help either because the guest
could be in the middle of ... timekeeping. See below.

> Alternatively, if we had the explicit “max TSC value that makes sense
> right now” in the timekeeping data, the guest would reliably notice
> the large jump and could at least do something intelligent about it
> instead of overflowing its internal calculation.

Yes. We can do that and we should do that for robustness sake.

But, there is more than the robustness problem on the reader side, which
is trivial as we discussed in the other part of this thread already.

There is also the problem on the timekeeping core in various aspects
which need a very close look.

So there are various things to solve:

   1) Readerside delta limit

      Trivial to provide and trivial to implement for the VDSO because
      the VDSO just can loop forever.

      Not so trivial for kernel side usage due to the fact that being
      caught in a read can prevent timekeeping from being updated.

      Hint: NOHZ entry path. That would simply livelock and never reach
      the timekeeping code. Also any interrupt disabled region can cause
      that.

      I looked into using 128 bit math as well, but that only works for
      wide clock sources like TSC and needs to be conditional on 64bit
      as 32bit would really suffer badly in the hotpath and even on
      64bit it's measurable.

      So we could keep 64bit math, use the limit and if the delta is
      larger than the limit take a slowpath which does wider math.

      But that still needs thoughts about clocksources with smaller
      counterwidth and therefore a fast wraparound time.

      There is another issue with larger deltas. The extrapolation might
      be off and then cause other side effects like observable time
      going backwards etc.

      That has to be analyzed together with the writer side because
      that's where we need to ensure the continuity/monotonicity etc.

   2) Writer side issues

      The core timekeeping code is pretty robust against large deltas
      already, but there are some limitations nevertheless and it was
      obviously not designed to be taken out in the middle of the
      updates. Haven't wrapped my head around that yet.

      But there is more than the timekeeper update, there are other
      things like NTP, PTP and consumers of the more raw timekeeping
      internals which might get surprised by large deltas and run into
      similar problems because they were not designed to deal with that.

So yes, it can and should be done, but it's not a project for friday
afternoon and it's going to be hard to backport. I know that distro
people do not care because another 500 patches on their frankenkernels
are just noise, but it leaves everybody else out in the dark and I have
zero interest to proliferate that.

I'm still convinced that a notification about 'we take a nap' will be
more robust, less complex and more trivial to backport.

Both life migration and suspend on the host know it upfront which means
that not using this knowledge and instead of that trying to cure the
symptom is violating the basic engineering principles and TBH outright
stupid.

As I said before we have most of the bits and pieces in place and I'm
sure we can come up with an even simpler solution as the one I outlined
before and once that is solved (or in parallel) make the time keeping
more robust. 

Thanks,

        tglx