From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8331CC433F5
	for <kvm@archiver.kernel.org>; Fri,  1 Oct 2021 21:03:53 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 6489561ABF
	for <kvm@archiver.kernel.org>; Fri,  1 Oct 2021 21:03:53 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230052AbhJAVFh (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Fri, 1 Oct 2021 17:05:37 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33084 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229568AbhJAVFf (ORCPT <rfc822;kvm@vger.kernel.org>);
        Fri, 1 Oct 2021 17:05:35 -0400
Received: from mail-lf1-x134.google.com (mail-lf1-x134.google.com [IPv6:2a00:1450:4864:20::134])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C98DAC061775
        for <kvm@vger.kernel.org>; Fri,  1 Oct 2021 14:03:50 -0700 (PDT)
Received: by mail-lf1-x134.google.com with SMTP id i4so43952888lfv.4
        for <kvm@vger.kernel.org>; Fri, 01 Oct 2021 14:03:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=US7THjs9q8+dC6lBnfN2W9f1a9lLVsXtHL76JHwld78=;
        b=LEHHXQQrqHueraWXN4QJAj9D834jvqNGLG7RicSBdHd9m2si5c+mnQn40cjGjWM3t4
         ZpFeMi4i5i3rpTiH3hSI/M/2Zm3I08Dm6clnPWH1yBvFjvLfu132ub7eTqs4YpM10V7a
         lTTc9ffz0ryonTNmL/nAJg8f1EBpmFlhmTGHuFMlm7picAw5bafsvoB1TQzM/Y7zt0jW
         wCpZYgm8Nt739gBMaAX8Yb06Sp1bUjR5lGxvvUFf9cGYRh6FYwCRxNfmUdDVpwvG6z7a
         8KT60fu7tEv8nCbjPx44+D+6j2dj0EzazJ3JRD49too+EAFLxHSIar0SaJvMdKCbvS39
         uoLg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=US7THjs9q8+dC6lBnfN2W9f1a9lLVsXtHL76JHwld78=;
        b=igHjKQE1iVxDhDCzMAwqVUmjCiZnzFgiRMpbacFs43mruEVMa37BScXO+3bELJ3eYD
         rhIZt1UhHrCX6NwXRpM0T9B8fOJUAUOdciPxLOtvcuGrhrBM/yfT9YlWc0JBnHWC+3je
         RHQnWhZZlvhc8dKpyXOGtHBTpC8DBOpLYJqibPwX3yN8rAavY30UUmFGgKgKgq2J8Cjh
         YH3XtbAQW+6uE7tXNO0aB0ShFGwHW5bRAoDY+Z2qdDqCEGVc43R/nPLe9Wrxfxm6UVSN
         grSl4X3a+M9xFU9PIU08VFrFHz65+XdMGLaRN0RYENCR4ZGo+T4gk/7jy/F/P+Ek5LfL
         p52g==
X-Gm-Message-State: AOAM531wNPD/blADsIFy3wfEA3OkQuvd7SubbLxjfmlJzdXA4Ea4A1Xl
        2dIqKA6W4bkUdjDPstSgEGK3yjKMvMeKYmGek4/6/LtZLYT4qNLH
X-Google-Smtp-Source: ABdhPJznCcQb+ccwGsDLyERyb3W3qjFR7Oa5Aa/Z4+L5spmUTbANL6wSIlWbkIDv6JUkIXl+ke4fitW2VkDiMn4VEeU=
X-Received: by 2002:a2e:5c88:: with SMTP id q130mr189882ljb.170.1633122228750;
 Fri, 01 Oct 2021 14:03:48 -0700 (PDT)
MIME-Version: 1.0
References: <20210916181538.968978-1-oupton@google.com> <20210916181538.968978-5-oupton@google.com>
 <20210929185629.GA10933@fuller.cnet> <20210930192107.GB19068@fuller.cnet>
 <871r557jls.ffs@tglx> <20211001120527.GA43086@fuller.cnet> <87ilyg4iu9.ffs@tglx>
In-Reply-To: <87ilyg4iu9.ffs@tglx>
From:   Oliver Upton <oupton@google.com>
Date:   Fri, 1 Oct 2021 14:03:37 -0700
Message-ID: <CAOQ_Qsgq9t3syQkQaorszBvg07XOhOmWJeMF=bQN8HqMPo4ADQ@mail.gmail.com>
Subject: Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
To:     Thomas Gleixner <tglx@linutronix.de>
Cc:     Marcelo Tosatti <mtosatti@redhat.com>, kvm@vger.kernel.org,
        kvmarm@lists.cs.columbia.edu, Paolo Bonzini <pbonzini@redhat.com>,
        Sean Christopherson <seanjc@google.com>,
        Marc Zyngier <maz@kernel.org>, Peter Shier <pshier@google.com>,
        Jim Mattson <jmattson@google.com>,
        David Matlack <dmatlack@google.com>,
        Ricardo Koller <ricarkol@google.com>,
        Jing Zhang <jingzhangos@google.com>,
        Raghavendra Rao Anata <rananta@google.com>,
        James Morse <james.morse@arm.com>,
        Alexandru Elisei <Alexandru.Elisei@arm.com>,
        Suzuki K Poulose <suzuki.poulose@arm.com>,
        linux-arm-kernel@lists.infradead.org,
        Andrew Jones <drjones@redhat.com>,
        Will Deacon <will@kernel.org>,
        Catalin Marinas <catalin.marinas@arm.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Hi Thomas,

On Fri, Oct 1, 2021 at 12:59 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Marcelo,
>
> On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote:
> > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> >> But even if that would be the case, then what prevents the stale time
> >> stamps to be visible? Nothing:
> >>
> >> T0:    t = now();
> >>          -> pause
> >>          -> resume
> >>          -> magic "fixup"
> >> T1:    dostuff(t);
> >
> > Yes.
> >
> > BTW, you could have a userspace notification (then applications
> > could handle this if desired).
>
> Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for
> CLOCK_REALTIME. That's what applications which are sensitive to clock
> REALTIME jumps use today.
>
> >>   Now the proposed change is creating exactly the same problem:
> >>
> >>   >> > +     if (data.flags & KVM_CLOCK_REALTIME) {
> >>   >> > +             u64 now_real_ns = ktime_get_real_ns();
> >>   >> > +
> >>   >> > +             /*
> >>   >> > +              * Avoid stepping the kvmclock backwards.
> >>   >> > +              */
> >>   >> > +             if (now_real_ns > data.realtime)
> >>   >> > +                     data.clock += now_real_ns - data.realtime;
> >>   >> > +     }
> >>
> >>   IOW, it takes the time between pause and resume into account and
> >>   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >>   jump forward by exactly that amount of time.
> >
> > Well, it is assuming that the
> >
> >  T0:    t = now();
> >  T1:    pause vm()
> >  T2:  finish vm migration()
> >  T3:    dostuff(t);
> >
> > Interval between T1 and T2 is small (and that the guest
> > clocks are synchronized up to a given boundary).
>
> Yes, I understand that, but it's an assumption and there is no boundary
> for the time jump in the proposed patches, which rings my alarm bells :)
>
> > But i suppose adding a limit to the forward clock advance
> > (in the migration case) is useful:
> >
> >       1) If migration (well actually, only the final steps
> >          to finish migration, the time between when guest is paused
> >          on source and is resumed on destination) takes too long,
> >          then too bad: fix it to be shorter if you want the clocks
> >          to have close to zero change to realtime on migration.
> >
> >       2) Avoid the other bugs in case of large forward advance.
> >
> > Maybe having it configurable, with a say, 1 minute maximum by default
> > is a good choice?
>
> Don't know what 1 minute does in terms of applications etc. You have to
> do some experiments on that.

I debated quite a bit on what the absolute limit should be for
advancing the KVM clock, and settled on doing no checks in the kernel
besides the monotonicity invariant. End of the day, userspace can
ignore all of the rules that KVM will try to enforce on the kvm
clock/TSC and jump it as it sees fit (both are already directly
writable). But I agree that there has to be some reason around what is
acceptable. We have an absolute limit on how far forward we will yank
the KVM clock and TSC in our userspace, but of course it has a TOCTOU
problem for whatever madness can come in between userspace and the
time the kernel actually services the ioctl.

--
Thanks,
Oliver


> > An alternative would be to advance only the guests REALTIME clock, from
> > data about how long steps T1-T2 took.
>
> Yes, that's what would happen in the cooperative S2IDLE or S3 case when
> the guest resumes.
>
> >> So now virt came along and created a hard to solve circular dependency
> >> problem:
> >>
> >>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
> >>      sync, but everything else is happy.
> >>
> >>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
> >>      lose, but NTP/PTP is happy.
> >
> > One must handle the
> >
> >  T0:    t = now();
> >           -> pause
> >           -> resume
> >           -> magic "fixup"
> >  T1:    dostuff(t);
> >
> > fact if one is going to use savevm/restorevm anyway, so...
> > (it is kind of unfixable, unless you modify your application
> > to accept notifications to redo any computation based on t, isnt it?).
>
> Well yes, but what applications can deal with is CLOCK_REALTIME jumping
> because that's a property of it. Not so much the CLOCK_MONOTONIC part.
>
> >> If you decide that correctness is overrated, then please document it
> >> clearly instead of trying to pretend being correct.
> >
> > Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
> > would be correct, right? And its probably not very hard to do.
>
> Time _is_ hard to get right.
>
> So you might experiment with something like this as a stop gap:
>
>   Provide the guest something like this:
>
>           u64              migration_seq;
>           u64              realtime_delta_ns;
>
>   in the shared clock page
>
>   Do not forward jump clock MONOTONIC.
>
>   On resume kick an IPI where the guest handler does:
>
>          if (clock_data->migration_seq == migration_seq)
>                 return;
>
>          migration_seq = clock_data->migration_seq;
>
>          ts64 = { 0, 0 };
>          timespec64_add_ns(&ts64, clock_data->realtime_delta_ns);
>          timekeeping_inject_sleeptime64(&ts64);
>
>   Make sure that the IPI completes before you migrate the guest another
>   time or implement it slightly smarter, but you get the idea :)
>
> That's what we use for suspend time injection, but it should just work
> without frozen tasks as well. It will forward clock REALTIME by the
> amount of time spent during migration. It'll also modify the BOOTTIME
> offset by the same amount, but that's not a tragedy.
>
> The function will also reset NTP state so the NTP/PTP daemon knows that
> there was a kernel initiated time jump and it can work out easily what
> to do like it does on resume from an actual suspend. It will also
> invoke clock_was_set() which makes all the other time related updates
> trigger and wakeup tasks which have a timerfd with
> TFD_TIMER_CANCEL_ON_SET armed.
>
> This will obviously not work when the guest is in S2IDLE or S3, but for
> initial experimentation you can ignore that and just avoid to do that in
> the guest. :)
>
> That still is worse than a cooperative S2IDLE/S3, but it's way more
> sensible than the other two evils you have today.
>
> > Thanks very much for the detailed information! Its a good basis
> > for the document you ask.
>
> I volunteer to review that documentation once it materializes :)
>
> Thanks,
>
>         tglx

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=fGD+=OV=lists.cs.columbia.edu=kvmarm-bounces@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 88505C433FE
	for <kvmarm@archiver.kernel.org>; Fri,  1 Oct 2021 21:03:54 +0000 (UTC)
Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253])
	by mail.kernel.org (Postfix) with ESMTP id 061AE61ABF
	for <kvmarm@archiver.kernel.org>; Fri,  1 Oct 2021 21:03:53 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 061AE61ABF
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=lists.cs.columbia.edu
Received: from localhost (localhost [127.0.0.1])
	by mm01.cs.columbia.edu (Postfix) with ESMTP id 5D6C14B129;
	Fri,  1 Oct 2021 17:03:53 -0400 (EDT)
X-Virus-Scanned: at lists.cs.columbia.edu
Authentication-Results: mm01.cs.columbia.edu (amavisd-new); dkim=softfail
	(fail, message has been altered) header.i=@google.com
Received: from mm01.cs.columbia.edu ([127.0.0.1])
	by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id nYAwWzXiJfkO; Fri,  1 Oct 2021 17:03:52 -0400 (EDT)
Received: from mm01.cs.columbia.edu (localhost [127.0.0.1])
	by mm01.cs.columbia.edu (Postfix) with ESMTP id 0B0A54B0D7;
	Fri,  1 Oct 2021 17:03:52 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
 by mm01.cs.columbia.edu (Postfix) with ESMTP id A57CB4B0C3
 for <kvmarm@lists.cs.columbia.edu>; Fri,  1 Oct 2021 17:03:51 -0400 (EDT)
X-Virus-Scanned: at lists.cs.columbia.edu
Received: from mm01.cs.columbia.edu ([127.0.0.1])
 by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id oUgSSWjYyqUb for <kvmarm@lists.cs.columbia.edu>;
 Fri,  1 Oct 2021 17:03:50 -0400 (EDT)
Received: from mail-lf1-f54.google.com (mail-lf1-f54.google.com
 [209.85.167.54])
 by mm01.cs.columbia.edu (Postfix) with ESMTPS id 5770B4048B
 for <kvmarm@lists.cs.columbia.edu>; Fri,  1 Oct 2021 17:03:50 -0400 (EDT)
Received: by mail-lf1-f54.google.com with SMTP id b20so43767874lfv.3
 for <kvmarm@lists.cs.columbia.edu>; Fri, 01 Oct 2021 14:03:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=US7THjs9q8+dC6lBnfN2W9f1a9lLVsXtHL76JHwld78=;
 b=LEHHXQQrqHueraWXN4QJAj9D834jvqNGLG7RicSBdHd9m2si5c+mnQn40cjGjWM3t4
 ZpFeMi4i5i3rpTiH3hSI/M/2Zm3I08Dm6clnPWH1yBvFjvLfu132ub7eTqs4YpM10V7a
 lTTc9ffz0ryonTNmL/nAJg8f1EBpmFlhmTGHuFMlm7picAw5bafsvoB1TQzM/Y7zt0jW
 wCpZYgm8Nt739gBMaAX8Yb06Sp1bUjR5lGxvvUFf9cGYRh6FYwCRxNfmUdDVpwvG6z7a
 8KT60fu7tEv8nCbjPx44+D+6j2dj0EzazJ3JRD49too+EAFLxHSIar0SaJvMdKCbvS39
 uoLg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=US7THjs9q8+dC6lBnfN2W9f1a9lLVsXtHL76JHwld78=;
 b=YFgk9C5rphTS4BP/1HdHBa6vvxdAMToInu0mfEP1Zqf4pS15RTzhsB3dUpepFARciV
 C6FiamBnJtODO19IUchJVS4wiH50vCascHWXaXPMFEkaIbRhisxFqdjoqDb7umMSfgcO
 1JRv4oZI/GFqg+VtE9Bzlw3jxn8GwCuoAP9sHrZ2ZNN++N/23TUAWoiRJI6exTMbTcoz
 +3juxi8IMBns6fXDiYNfO516jpF17bimDqNLXruoJM065RVNw6wWyMsFge7f3ZNYtvRV
 mBdtdoVW1r5N1+qq3y/Ii0YiAZrQVT148VR3xX6g61Qu64XnN+gBf0t2p0sV0VHFyxIf
 AMFw==
X-Gm-Message-State: AOAM530WU3LLq7zxTGqEae8S3zroyOCNkhkSMSH5TdoruXwRLUkHC/+Z
 LbCQdIXTxtolkqq1NdgEgMtmkfINoCGB+mxYgwimZw==
X-Google-Smtp-Source: ABdhPJznCcQb+ccwGsDLyERyb3W3qjFR7Oa5Aa/Z4+L5spmUTbANL6wSIlWbkIDv6JUkIXl+ke4fitW2VkDiMn4VEeU=
X-Received: by 2002:a2e:5c88:: with SMTP id q130mr189882ljb.170.1633122228750; 
 Fri, 01 Oct 2021 14:03:48 -0700 (PDT)
MIME-Version: 1.0
References: <20210916181538.968978-1-oupton@google.com>
 <20210916181538.968978-5-oupton@google.com>
 <20210929185629.GA10933@fuller.cnet> <20210930192107.GB19068@fuller.cnet>
 <871r557jls.ffs@tglx> <20211001120527.GA43086@fuller.cnet>
 <87ilyg4iu9.ffs@tglx>
In-Reply-To: <87ilyg4iu9.ffs@tglx>
From: Oliver Upton <oupton@google.com>
Date: Fri, 1 Oct 2021 14:03:37 -0700
Message-ID: <CAOQ_Qsgq9t3syQkQaorszBvg07XOhOmWJeMF=bQN8HqMPo4ADQ@mail.gmail.com>
Subject: Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in
 KVM_GET_CLOCK
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>, kvm@vger.kernel.org,
 Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>,
 Marcelo Tosatti <mtosatti@redhat.com>, Peter Shier <pshier@google.com>,
 David Matlack <dmatlack@google.com>, Paolo Bonzini <pbonzini@redhat.com>,
 kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org,
 Jim Mattson <jmattson@google.com>
X-BeenThere: kvmarm@lists.cs.columbia.edu
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Where KVM/ARM decisions are made <kvmarm.lists.cs.columbia.edu>
List-Unsubscribe: <https://lists.cs.columbia.edu/mailman/options/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=unsubscribe>
List-Archive: <https://lists.cs.columbia.edu/pipermail/kvmarm>
List-Post: <mailto:kvmarm@lists.cs.columbia.edu>
List-Help: <mailto:kvmarm-request@lists.cs.columbia.edu?subject=help>
List-Subscribe: <https://lists.cs.columbia.edu/mailman/listinfo/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: kvmarm-bounces@lists.cs.columbia.edu
Sender: kvmarm-bounces@lists.cs.columbia.edu

Hi Thomas,

On Fri, Oct 1, 2021 at 12:59 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Marcelo,
>
> On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote:
> > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> >> But even if that would be the case, then what prevents the stale time
> >> stamps to be visible? Nothing:
> >>
> >> T0:    t = now();
> >>          -> pause
> >>          -> resume
> >>          -> magic "fixup"
> >> T1:    dostuff(t);
> >
> > Yes.
> >
> > BTW, you could have a userspace notification (then applications
> > could handle this if desired).
>
> Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for
> CLOCK_REALTIME. That's what applications which are sensitive to clock
> REALTIME jumps use today.
>
> >>   Now the proposed change is creating exactly the same problem:
> >>
> >>   >> > +     if (data.flags & KVM_CLOCK_REALTIME) {
> >>   >> > +             u64 now_real_ns = ktime_get_real_ns();
> >>   >> > +
> >>   >> > +             /*
> >>   >> > +              * Avoid stepping the kvmclock backwards.
> >>   >> > +              */
> >>   >> > +             if (now_real_ns > data.realtime)
> >>   >> > +                     data.clock += now_real_ns - data.realtime;
> >>   >> > +     }
> >>
> >>   IOW, it takes the time between pause and resume into account and
> >>   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >>   jump forward by exactly that amount of time.
> >
> > Well, it is assuming that the
> >
> >  T0:    t = now();
> >  T1:    pause vm()
> >  T2:  finish vm migration()
> >  T3:    dostuff(t);
> >
> > Interval between T1 and T2 is small (and that the guest
> > clocks are synchronized up to a given boundary).
>
> Yes, I understand that, but it's an assumption and there is no boundary
> for the time jump in the proposed patches, which rings my alarm bells :)
>
> > But i suppose adding a limit to the forward clock advance
> > (in the migration case) is useful:
> >
> >       1) If migration (well actually, only the final steps
> >          to finish migration, the time between when guest is paused
> >          on source and is resumed on destination) takes too long,
> >          then too bad: fix it to be shorter if you want the clocks
> >          to have close to zero change to realtime on migration.
> >
> >       2) Avoid the other bugs in case of large forward advance.
> >
> > Maybe having it configurable, with a say, 1 minute maximum by default
> > is a good choice?
>
> Don't know what 1 minute does in terms of applications etc. You have to
> do some experiments on that.

I debated quite a bit on what the absolute limit should be for
advancing the KVM clock, and settled on doing no checks in the kernel
besides the monotonicity invariant. End of the day, userspace can
ignore all of the rules that KVM will try to enforce on the kvm
clock/TSC and jump it as it sees fit (both are already directly
writable). But I agree that there has to be some reason around what is
acceptable. We have an absolute limit on how far forward we will yank
the KVM clock and TSC in our userspace, but of course it has a TOCTOU
problem for whatever madness can come in between userspace and the
time the kernel actually services the ioctl.

--
Thanks,
Oliver


> > An alternative would be to advance only the guests REALTIME clock, from
> > data about how long steps T1-T2 took.
>
> Yes, that's what would happen in the cooperative S2IDLE or S3 case when
> the guest resumes.
>
> >> So now virt came along and created a hard to solve circular dependency
> >> problem:
> >>
> >>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
> >>      sync, but everything else is happy.
> >>
> >>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
> >>      lose, but NTP/PTP is happy.
> >
> > One must handle the
> >
> >  T0:    t = now();
> >           -> pause
> >           -> resume
> >           -> magic "fixup"
> >  T1:    dostuff(t);
> >
> > fact if one is going to use savevm/restorevm anyway, so...
> > (it is kind of unfixable, unless you modify your application
> > to accept notifications to redo any computation based on t, isnt it?).
>
> Well yes, but what applications can deal with is CLOCK_REALTIME jumping
> because that's a property of it. Not so much the CLOCK_MONOTONIC part.
>
> >> If you decide that correctness is overrated, then please document it
> >> clearly instead of trying to pretend being correct.
> >
> > Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
> > would be correct, right? And its probably not very hard to do.
>
> Time _is_ hard to get right.
>
> So you might experiment with something like this as a stop gap:
>
>   Provide the guest something like this:
>
>           u64              migration_seq;
>           u64              realtime_delta_ns;
>
>   in the shared clock page
>
>   Do not forward jump clock MONOTONIC.
>
>   On resume kick an IPI where the guest handler does:
>
>          if (clock_data->migration_seq == migration_seq)
>                 return;
>
>          migration_seq = clock_data->migration_seq;
>
>          ts64 = { 0, 0 };
>          timespec64_add_ns(&ts64, clock_data->realtime_delta_ns);
>          timekeeping_inject_sleeptime64(&ts64);
>
>   Make sure that the IPI completes before you migrate the guest another
>   time or implement it slightly smarter, but you get the idea :)
>
> That's what we use for suspend time injection, but it should just work
> without frozen tasks as well. It will forward clock REALTIME by the
> amount of time spent during migration. It'll also modify the BOOTTIME
> offset by the same amount, but that's not a tragedy.
>
> The function will also reset NTP state so the NTP/PTP daemon knows that
> there was a kernel initiated time jump and it can work out easily what
> to do like it does on resume from an actual suspend. It will also
> invoke clock_was_set() which makes all the other time related updates
> trigger and wakeup tasks which have a timerfd with
> TFD_TIMER_CANCEL_ON_SET armed.
>
> This will obviously not work when the guest is in S2IDLE or S3, but for
> initial experimentation you can ignore that and just avoid to do that in
> the guest. :)
>
> That still is worse than a cooperative S2IDLE/S3, but it's way more
> sensible than the other two evils you have today.
>
> > Thanks very much for the detailed information! Its a good basis
> > for the document you ask.
>
> I volunteer to review that documentation once it materializes :)
>
> Thanks,
>
>         tglx
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=stTo=OV=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3A302C433EF
	for <linux-arm-kernel@archiver.kernel.org>; Fri,  1 Oct 2021 21:06:43 +0000 (UTC)
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id E6E8E619E7
	for <linux-arm-kernel@archiver.kernel.org>; Fri,  1 Oct 2021 21:06:42 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org E6E8E619E7
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.infradead.org
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:Cc:To:Subject:Message-ID:Date:From:
	In-Reply-To:References:MIME-Version:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=s4vumxqUe1qLZ9KfOAS2zMjK1juvZhEmTePAx2TP8ec=; b=QGCHMFg0QsQdvt
	Ul2jAcQkPptt+mWl/hvqSH9n/pgZ7CWQcWSwoP1hOyq8yntTW8F4MrgY2VbN1zqjloMdXyWj2l97J
	sbdZEm/Nn/jOAuHr1sYX24F8JSnmkEf0f6MTeTQI4nLT3uBWfFLk8d0MxeCu/SS98lX5mHislZOnZ
	6Xc2sa3SmF1d8UoQA0Q4gM3y75ZJFMV482824f+Drla4RfIR2HMwR7uykwXwkl3pcPnpJ0ueMbh7X
	q0yYrDHhvzfnqq9XX/s7L93jdcqlItH8jvKNTf9QSDz3piPPzGZIdisJmyhSE12Z/reYj/MvO+LbE
	pJ6qYx/2EaHZhqdbXX7Q==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1mWPhD-001JJP-Cy; Fri, 01 Oct 2021 21:03:55 +0000
Received: from mail-lf1-x134.google.com ([2a00:1450:4864:20::134])
 by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
 id 1mWPh9-001JIz-Ng
 for linux-arm-kernel@lists.infradead.org; Fri, 01 Oct 2021 21:03:53 +0000
Received: by mail-lf1-x134.google.com with SMTP id e15so43505496lfr.10
 for <linux-arm-kernel@lists.infradead.org>;
 Fri, 01 Oct 2021 14:03:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=US7THjs9q8+dC6lBnfN2W9f1a9lLVsXtHL76JHwld78=;
 b=LEHHXQQrqHueraWXN4QJAj9D834jvqNGLG7RicSBdHd9m2si5c+mnQn40cjGjWM3t4
 ZpFeMi4i5i3rpTiH3hSI/M/2Zm3I08Dm6clnPWH1yBvFjvLfu132ub7eTqs4YpM10V7a
 lTTc9ffz0ryonTNmL/nAJg8f1EBpmFlhmTGHuFMlm7picAw5bafsvoB1TQzM/Y7zt0jW
 wCpZYgm8Nt739gBMaAX8Yb06Sp1bUjR5lGxvvUFf9cGYRh6FYwCRxNfmUdDVpwvG6z7a
 8KT60fu7tEv8nCbjPx44+D+6j2dj0EzazJ3JRD49too+EAFLxHSIar0SaJvMdKCbvS39
 uoLg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=US7THjs9q8+dC6lBnfN2W9f1a9lLVsXtHL76JHwld78=;
 b=agcrKf+C5C7QaFUVOPxV/IcouZ6rxVzwcrB7HhiIb7uKq6Z9QETUDZ/s5wtIUhZQei
 0hZkFY6EYfWml80aILramlBJg0xG+ukNht2jaCp5I/nnLJaeCn6avof7VBYyGTqgJytv
 Aept6tPmnG2jR50Ro7+J87IIq2GB4C4rCIEGIK1VTPhYvosr+IuPD3kRjJ+hxtTQcIH4
 Kctf/Oo+WVBnE09SPqHv1R/8kNMtzCZpULW8qS6JaoqlQ/vFTQ2pWcxkQxvctDDM7DT+
 O4Fx8ns9ol9oBzkDjwA7VGZjgoQHcxYVZH7FmKUl1kJupE7mH1PQIIWrAs+T0cP8FC+t
 nnXw==
X-Gm-Message-State: AOAM5327Tcar0tOGe36y02kKyQeAxRosc2NZDD7AHe+YcaPLSfUaENJP
 o/ZCHpyiCybjmlShv86YWuhYR21Tb6/AiAmIZxKqkA==
X-Google-Smtp-Source: ABdhPJznCcQb+ccwGsDLyERyb3W3qjFR7Oa5Aa/Z4+L5spmUTbANL6wSIlWbkIDv6JUkIXl+ke4fitW2VkDiMn4VEeU=
X-Received: by 2002:a2e:5c88:: with SMTP id q130mr189882ljb.170.1633122228750; 
 Fri, 01 Oct 2021 14:03:48 -0700 (PDT)
MIME-Version: 1.0
References: <20210916181538.968978-1-oupton@google.com>
 <20210916181538.968978-5-oupton@google.com>
 <20210929185629.GA10933@fuller.cnet> <20210930192107.GB19068@fuller.cnet>
 <871r557jls.ffs@tglx> <20211001120527.GA43086@fuller.cnet>
 <87ilyg4iu9.ffs@tglx>
In-Reply-To: <87ilyg4iu9.ffs@tglx>
From: Oliver Upton <oupton@google.com>
Date: Fri, 1 Oct 2021 14:03:37 -0700
Message-ID: <CAOQ_Qsgq9t3syQkQaorszBvg07XOhOmWJeMF=bQN8HqMPo4ADQ@mail.gmail.com>
Subject: Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in
 KVM_GET_CLOCK
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Marcelo Tosatti <mtosatti@redhat.com>, kvm@vger.kernel.org,
 kvmarm@lists.cs.columbia.edu, 
 Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <seanjc@google.com>,
 Marc Zyngier <maz@kernel.org>, 
 Peter Shier <pshier@google.com>, Jim Mattson <jmattson@google.com>, 
 David Matlack <dmatlack@google.com>, Ricardo Koller <ricarkol@google.com>, 
 Jing Zhang <jingzhangos@google.com>, Raghavendra Rao Anata <rananta@google.com>,
 James Morse <james.morse@arm.com>, Alexandru Elisei <Alexandru.Elisei@arm.com>,
 Suzuki K Poulose <suzuki.poulose@arm.com>, linux-arm-kernel@lists.infradead.org,
 Andrew Jones <drjones@redhat.com>, Will Deacon <will@kernel.org>, 
 Catalin Marinas <catalin.marinas@arm.com>
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20211001_140351_799066_BF3D27C1 
X-CRM114-Status: GOOD (  62.05  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Hi Thomas,

On Fri, Oct 1, 2021 at 12:59 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Marcelo,
>
> On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote:
> > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> >> But even if that would be the case, then what prevents the stale time
> >> stamps to be visible? Nothing:
> >>
> >> T0:    t = now();
> >>          -> pause
> >>          -> resume
> >>          -> magic "fixup"
> >> T1:    dostuff(t);
> >
> > Yes.
> >
> > BTW, you could have a userspace notification (then applications
> > could handle this if desired).
>
> Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for
> CLOCK_REALTIME. That's what applications which are sensitive to clock
> REALTIME jumps use today.
>
> >>   Now the proposed change is creating exactly the same problem:
> >>
> >>   >> > +     if (data.flags & KVM_CLOCK_REALTIME) {
> >>   >> > +             u64 now_real_ns = ktime_get_real_ns();
> >>   >> > +
> >>   >> > +             /*
> >>   >> > +              * Avoid stepping the kvmclock backwards.
> >>   >> > +              */
> >>   >> > +             if (now_real_ns > data.realtime)
> >>   >> > +                     data.clock += now_real_ns - data.realtime;
> >>   >> > +     }
> >>
> >>   IOW, it takes the time between pause and resume into account and
> >>   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >>   jump forward by exactly that amount of time.
> >
> > Well, it is assuming that the
> >
> >  T0:    t = now();
> >  T1:    pause vm()
> >  T2:  finish vm migration()
> >  T3:    dostuff(t);
> >
> > Interval between T1 and T2 is small (and that the guest
> > clocks are synchronized up to a given boundary).
>
> Yes, I understand that, but it's an assumption and there is no boundary
> for the time jump in the proposed patches, which rings my alarm bells :)
>
> > But i suppose adding a limit to the forward clock advance
> > (in the migration case) is useful:
> >
> >       1) If migration (well actually, only the final steps
> >          to finish migration, the time between when guest is paused
> >          on source and is resumed on destination) takes too long,
> >          then too bad: fix it to be shorter if you want the clocks
> >          to have close to zero change to realtime on migration.
> >
> >       2) Avoid the other bugs in case of large forward advance.
> >
> > Maybe having it configurable, with a say, 1 minute maximum by default
> > is a good choice?
>
> Don't know what 1 minute does in terms of applications etc. You have to
> do some experiments on that.

I debated quite a bit on what the absolute limit should be for
advancing the KVM clock, and settled on doing no checks in the kernel
besides the monotonicity invariant. End of the day, userspace can
ignore all of the rules that KVM will try to enforce on the kvm
clock/TSC and jump it as it sees fit (both are already directly
writable). But I agree that there has to be some reason around what is
acceptable. We have an absolute limit on how far forward we will yank
the KVM clock and TSC in our userspace, but of course it has a TOCTOU
problem for whatever madness can come in between userspace and the
time the kernel actually services the ioctl.

--
Thanks,
Oliver


> > An alternative would be to advance only the guests REALTIME clock, from
> > data about how long steps T1-T2 took.
>
> Yes, that's what would happen in the cooperative S2IDLE or S3 case when
> the guest resumes.
>
> >> So now virt came along and created a hard to solve circular dependency
> >> problem:
> >>
> >>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
> >>      sync, but everything else is happy.
> >>
> >>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
> >>      lose, but NTP/PTP is happy.
> >
> > One must handle the
> >
> >  T0:    t = now();
> >           -> pause
> >           -> resume
> >           -> magic "fixup"
> >  T1:    dostuff(t);
> >
> > fact if one is going to use savevm/restorevm anyway, so...
> > (it is kind of unfixable, unless you modify your application
> > to accept notifications to redo any computation based on t, isnt it?).
>
> Well yes, but what applications can deal with is CLOCK_REALTIME jumping
> because that's a property of it. Not so much the CLOCK_MONOTONIC part.
>
> >> If you decide that correctness is overrated, then please document it
> >> clearly instead of trying to pretend being correct.
> >
> > Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
> > would be correct, right? And its probably not very hard to do.
>
> Time _is_ hard to get right.
>
> So you might experiment with something like this as a stop gap:
>
>   Provide the guest something like this:
>
>           u64              migration_seq;
>           u64              realtime_delta_ns;
>
>   in the shared clock page
>
>   Do not forward jump clock MONOTONIC.
>
>   On resume kick an IPI where the guest handler does:
>
>          if (clock_data->migration_seq == migration_seq)
>                 return;
>
>          migration_seq = clock_data->migration_seq;
>
>          ts64 = { 0, 0 };
>          timespec64_add_ns(&ts64, clock_data->realtime_delta_ns);
>          timekeeping_inject_sleeptime64(&ts64);
>
>   Make sure that the IPI completes before you migrate the guest another
>   time or implement it slightly smarter, but you get the idea :)
>
> That's what we use for suspend time injection, but it should just work
> without frozen tasks as well. It will forward clock REALTIME by the
> amount of time spent during migration. It'll also modify the BOOTTIME
> offset by the same amount, but that's not a tragedy.
>
> The function will also reset NTP state so the NTP/PTP daemon knows that
> there was a kernel initiated time jump and it can work out easily what
> to do like it does on resume from an actual suspend. It will also
> invoke clock_was_set() which makes all the other time related updates
> trigger and wakeup tasks which have a timerfd with
> TFD_TIMER_CANCEL_ON_SET armed.
>
> This will obviously not work when the guest is in S2IDLE or S3, but for
> initial experimentation you can ignore that and just avoid to do that in
> the guest. :)
>
> That still is worse than a cooperative S2IDLE/S3, but it's way more
> sensible than the other two evils you have today.
>
> > Thanks very much for the detailed information! Its a good basis
> > for the document you ask.
>
> I volunteer to review that documentation once it materializes :)
>
> Thanks,
>
>         tglx

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel