From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B6069C433F5 for ; Fri, 1 Oct 2021 12:11:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9EF4B61A05 for ; Fri, 1 Oct 2021 12:11:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1354322AbhJAMNh (ORCPT ); Fri, 1 Oct 2021 08:13:37 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:52353 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237143AbhJAMNg (ORCPT ); Fri, 1 Oct 2021 08:13:36 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1633090312; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1WMoHDuRMv8a05PMxxecz9ri+FED1EzVSseeb3CdzN8=; b=CMIVTzH2PejuIdbUla4sFhNDkNG8FdzHrRF/aACqBdc+kvxbwVgttKVK9HeC47+Be6JAFe fc1KLdwW2q+0K7HJDzrQXMk8ysxLEid4NY3TmEMggntoTc7B0dJEimqSdVWb2WGFT9WBrB NBhUNmbPli6jTzYMt9t0NffPJAqirw0= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-587-5p5UB6B6NGWPnU9lOVTNuA-1; Fri, 01 Oct 2021 08:11:51 -0400 X-MC-Unique: 5p5UB6B6NGWPnU9lOVTNuA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 27D6519253C4; Fri, 1 Oct 2021 12:11:49 +0000 (UTC) Received: from fuller.cnet (ovpn-112-2.gru2.redhat.com [10.97.112.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 7882160BF1; Fri, 1 Oct 2021 12:11:48 +0000 (UTC) Received: by fuller.cnet (Postfix, from userid 1000) id 400A3416CE5D; Fri, 1 Oct 2021 09:10:41 -0300 (-03) Date: Fri, 1 Oct 2021 09:10:41 -0300 From: Marcelo Tosatti To: Thomas Gleixner Cc: Oliver Upton , kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu, Paolo Bonzini , Sean Christopherson , Marc Zyngier , Peter Shier , Jim Mattson , David Matlack , Ricardo Koller , Jing Zhang , Raghavendra Rao Anata , James Morse , Alexandru Elisei , Suzuki K Poulose , linux-arm-kernel@lists.infradead.org, Andrew Jones , Will Deacon , Catalin Marinas Subject: Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK Message-ID: <20211001121041.GA45897@fuller.cnet> References: <20210916181538.968978-1-oupton@google.com> <20210916181538.968978-5-oupton@google.com> <20210929185629.GA10933@fuller.cnet> <20210930192107.GB19068@fuller.cnet> <871r557jls.ffs@tglx> <20211001120527.GA43086@fuller.cnet> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20211001120527.GA43086@fuller.cnet> User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Fri, Oct 01, 2021 at 09:05:27AM -0300, Marcelo Tosatti wrote: > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote: > > Marcelo, > > > > On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote: > > > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote: > > >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote: > > >> > > >> Thomas, CC'ed, has deeper understanding of problems with > > >> forward time jumps than I do. Thomas, any comments? > > > > > > Based on the earlier discussion about the problems of synchronizing > > > the guests clock via a notification to the NTP/Chrony daemon > > > (where there is a window where applications can read the stale > > > value of the clock), a possible solution would be triggering > > > an NMI on the destination (so that it runs ASAP, with higher > > > priority than application/kernel). > > > > > > What would this NMI do, exactly? > > > > Nothing. You cannot do anything time related in an NMI. > > > > You might queue irq work which handles that, but that would still not > > prevent user space or kernel space from observing the stale time stamp > > depending on the execution state from where it resumes. > > Yes. > > > >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag > > >> for either vm pause / vm resume (well, if paused for long periods of time) > > >> or savevm / restorevm. > > > > > > Maybe with the NMI above, it would be possible to use > > > the realtime clock as a way to know time elapsed between > > > events and advance guest clock without the current > > > problematic window. > > > > As much duct tape you throw at the problem, it cannot be solved ever > > because it's fundamentally broken. All you can do is to make the > > observation windows smaller, but that's just curing the symptom. > > Yes. > > > The problem is that the guest is paused/resumed without getting any > > information about that and the execution of the guest is stopped at an > > arbitrary instruction boundary from which it resumes after migration or > > restore. So there is no way to guarantee that after resume all vCPUs are > > executing in a state which can handle that. > > > > But even if that would be the case, then what prevents the stale time > > stamps to be visible? Nothing: > > > > T0: t = now(); > > -> pause > > -> resume > > -> magic "fixup" > > T1: dostuff(t); > > Yes. > > BTW, you could have a userspace notification (then applications > could handle this if desired). > > > But that's not a fundamental problem because every preemptible or > > interruptible code has the same issue: > > > > T0: t = now(); > > -> preemption or interrupt > > T1: dostuff(t); > > > > Which is usually not a problem, but It becomes a problem when T1 - T0 is > > greater than the usual expectations which can obviously be trivially > > achieved by guest migration or a savevm, restorevm cycle. > > > > But let's go a step back and look at the clocks and their expectations: > > > > CLOCK_MONOTONIC: > > > > Monotonically increasing clock which counts unless the system > > is in suspend. On resume it continues counting without jumping > > forward. > > > > That's the reference clock for everything else and therefore it > > is important that it does _not_ jump around. > > > > The reasons why CLOCK_MONOTONIC stops during suspend is > > historical and any attempt to change that breaks the world and > > some more because making it jump forward will trigger all sorts > > of timeouts, watchdogs and whatever. The last attempt to make > > CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3 > > weeks. It's not going to be attempted again. See a3ed0e4393d6 > > ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for > > details. > > > > Now the proposed change is creating exactly the same problem: > > > > >> > + if (data.flags & KVM_CLOCK_REALTIME) { > > >> > + u64 now_real_ns = ktime_get_real_ns(); > > >> > + > > >> > + /* > > >> > + * Avoid stepping the kvmclock backwards. > > >> > + */ > > >> > + if (now_real_ns > data.realtime) > > >> > + data.clock += now_real_ns - data.realtime; > > >> > + } > > > > IOW, it takes the time between pause and resume into account and > > forwards the underlying base clock which makes CLOCK_MONOTONIC > > jump forward by exactly that amount of time. > > Well, it is assuming that the > > T0: t = now(); > T1: pause vm() > T2: finish vm migration() > T3: dostuff(t); > > Interval between T1 and T2 is small (and that the guest > clocks are synchronized up to a given boundary). > > But i suppose adding a limit to the forward clock advance > (in the migration case) is useful: > > 1) If migration (well actually, only the final steps > to finish migration, the time between when guest is paused > on source and is resumed on destination) takes too long, > then too bad: fix it to be shorter if you want the clocks > to have close to zero change to realtime on migration. > > 2) Avoid the other bugs in case of large forward advance. > > Maybe having it configurable, with a say, 1 minute maximum by default > is a good choice? > > An alternative would be to advance only the guests REALTIME clock, from > data about how long steps T1-T2 took. > > > So depending on the size of the delta you are running into exactly the > > same problem as the final failed attempt to unify CLOCK_MONOTONIC and > > CLOCK_BOOTTIME which btw. would have been a magic cure for virt. > > > > Too bad, not going to happen ever unless you fix all affected user > > space and kernel side issues. > > > > > > CLOCK_BOOTTIME: > > > > CLOCK_MONOTONIC + time spent in suspend > > > > > > CLOCK_REALTIME/TAI: > > > > CLOCK_MONOTONIC + offset > > > > The offset is established by reading RTC at boot time and can be > > changed by clock_settime(2) and adjtimex(2). The latter is used > > by NTP/PTP. > > > > Any user of CLOCK_REALTIME has to be prepared for it to jump in > > both directions, but of course NTP/PTP daemons have expectations > > vs. such time jumps. > > > > They rightfully assume on a properly configured and administrated > > system that there are only two things which can make CLOCK_REALTIME > > jump: > > > > 1) NTP/PTP daemon controlled > > 2) Suspend/resume related updates by the kernel > > > > > > Just for the record, these assumptions predate virtualization. > > > > So now virt came along and created a hard to solve circular dependency > > problem: > > > > - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of > > sync, but everything else is happy. > > > > - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks > > lose, but NTP/PTP is happy. > > One must handle the > > T0: t = now(); > -> pause > -> resume > -> magic "fixup" > T1: dostuff(t); > > fact if one is going to use savevm/restorevm anyway, so... > (it is kind of unfixable, unless you modify your application > to accept notifications to redo any computation based on t, isnt it?). https://lore.kernel.org/lkml/1289503802-22444-2-git-send-email-virtuoso@slind.org/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EDB5AC43219 for ; Fri, 1 Oct 2021 12:11:56 +0000 (UTC) Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253]) by mail.kernel.org (Postfix) with ESMTP id 61E7461A56 for ; Fri, 1 Oct 2021 12:11:56 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 61E7461A56 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=lists.cs.columbia.edu Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id C6E5B4B0B4; Fri, 1 Oct 2021 08:11:55 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Authentication-Results: mm01.cs.columbia.edu (amavisd-new); dkim=softfail (fail, message has been altered) header.i=@redhat.com Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RB6ZmO+zFp5g; Fri, 1 Oct 2021 08:11:54 -0400 (EDT) Received: from mm01.cs.columbia.edu (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 83B4C4B0BD; Fri, 1 Oct 2021 08:11:54 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 01F344B0AC for ; Fri, 1 Oct 2021 08:11:54 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id v4skH0ClCZLo for ; Fri, 1 Oct 2021 08:11:52 -0400 (EDT) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by mm01.cs.columbia.edu (Postfix) with ESMTP id CD8514B08A for ; Fri, 1 Oct 2021 08:11:52 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1633090312; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1WMoHDuRMv8a05PMxxecz9ri+FED1EzVSseeb3CdzN8=; b=CMIVTzH2PejuIdbUla4sFhNDkNG8FdzHrRF/aACqBdc+kvxbwVgttKVK9HeC47+Be6JAFe fc1KLdwW2q+0K7HJDzrQXMk8ysxLEid4NY3TmEMggntoTc7B0dJEimqSdVWb2WGFT9WBrB NBhUNmbPli6jTzYMt9t0NffPJAqirw0= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-587-5p5UB6B6NGWPnU9lOVTNuA-1; Fri, 01 Oct 2021 08:11:51 -0400 X-MC-Unique: 5p5UB6B6NGWPnU9lOVTNuA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 27D6519253C4; Fri, 1 Oct 2021 12:11:49 +0000 (UTC) Received: from fuller.cnet (ovpn-112-2.gru2.redhat.com [10.97.112.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 7882160BF1; Fri, 1 Oct 2021 12:11:48 +0000 (UTC) Received: by fuller.cnet (Postfix, from userid 1000) id 400A3416CE5D; Fri, 1 Oct 2021 09:10:41 -0300 (-03) Date: Fri, 1 Oct 2021 09:10:41 -0300 From: Marcelo Tosatti To: Thomas Gleixner Subject: Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK Message-ID: <20211001121041.GA45897@fuller.cnet> References: <20210916181538.968978-1-oupton@google.com> <20210916181538.968978-5-oupton@google.com> <20210929185629.GA10933@fuller.cnet> <20210930192107.GB19068@fuller.cnet> <871r557jls.ffs@tglx> <20211001120527.GA43086@fuller.cnet> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20211001120527.GA43086@fuller.cnet> User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Cc: Catalin Marinas , kvm@vger.kernel.org, Marc Zyngier , Peter Shier , David Matlack , Paolo Bonzini , Will Deacon , kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, Jim Mattson X-BeenThere: kvmarm@lists.cs.columbia.edu X-Mailman-Version: 2.1.14 Precedence: list List-Id: Where KVM/ARM decisions are made List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: kvmarm-bounces@lists.cs.columbia.edu Sender: kvmarm-bounces@lists.cs.columbia.edu On Fri, Oct 01, 2021 at 09:05:27AM -0300, Marcelo Tosatti wrote: > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote: > > Marcelo, > > > > On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote: > > > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote: > > >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote: > > >> > > >> Thomas, CC'ed, has deeper understanding of problems with > > >> forward time jumps than I do. Thomas, any comments? > > > > > > Based on the earlier discussion about the problems of synchronizing > > > the guests clock via a notification to the NTP/Chrony daemon > > > (where there is a window where applications can read the stale > > > value of the clock), a possible solution would be triggering > > > an NMI on the destination (so that it runs ASAP, with higher > > > priority than application/kernel). > > > > > > What would this NMI do, exactly? > > > > Nothing. You cannot do anything time related in an NMI. > > > > You might queue irq work which handles that, but that would still not > > prevent user space or kernel space from observing the stale time stamp > > depending on the execution state from where it resumes. > > Yes. > > > >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag > > >> for either vm pause / vm resume (well, if paused for long periods of time) > > >> or savevm / restorevm. > > > > > > Maybe with the NMI above, it would be possible to use > > > the realtime clock as a way to know time elapsed between > > > events and advance guest clock without the current > > > problematic window. > > > > As much duct tape you throw at the problem, it cannot be solved ever > > because it's fundamentally broken. All you can do is to make the > > observation windows smaller, but that's just curing the symptom. > > Yes. > > > The problem is that the guest is paused/resumed without getting any > > information about that and the execution of the guest is stopped at an > > arbitrary instruction boundary from which it resumes after migration or > > restore. So there is no way to guarantee that after resume all vCPUs are > > executing in a state which can handle that. > > > > But even if that would be the case, then what prevents the stale time > > stamps to be visible? Nothing: > > > > T0: t = now(); > > -> pause > > -> resume > > -> magic "fixup" > > T1: dostuff(t); > > Yes. > > BTW, you could have a userspace notification (then applications > could handle this if desired). > > > But that's not a fundamental problem because every preemptible or > > interruptible code has the same issue: > > > > T0: t = now(); > > -> preemption or interrupt > > T1: dostuff(t); > > > > Which is usually not a problem, but It becomes a problem when T1 - T0 is > > greater than the usual expectations which can obviously be trivially > > achieved by guest migration or a savevm, restorevm cycle. > > > > But let's go a step back and look at the clocks and their expectations: > > > > CLOCK_MONOTONIC: > > > > Monotonically increasing clock which counts unless the system > > is in suspend. On resume it continues counting without jumping > > forward. > > > > That's the reference clock for everything else and therefore it > > is important that it does _not_ jump around. > > > > The reasons why CLOCK_MONOTONIC stops during suspend is > > historical and any attempt to change that breaks the world and > > some more because making it jump forward will trigger all sorts > > of timeouts, watchdogs and whatever. The last attempt to make > > CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3 > > weeks. It's not going to be attempted again. See a3ed0e4393d6 > > ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for > > details. > > > > Now the proposed change is creating exactly the same problem: > > > > >> > + if (data.flags & KVM_CLOCK_REALTIME) { > > >> > + u64 now_real_ns = ktime_get_real_ns(); > > >> > + > > >> > + /* > > >> > + * Avoid stepping the kvmclock backwards. > > >> > + */ > > >> > + if (now_real_ns > data.realtime) > > >> > + data.clock += now_real_ns - data.realtime; > > >> > + } > > > > IOW, it takes the time between pause and resume into account and > > forwards the underlying base clock which makes CLOCK_MONOTONIC > > jump forward by exactly that amount of time. > > Well, it is assuming that the > > T0: t = now(); > T1: pause vm() > T2: finish vm migration() > T3: dostuff(t); > > Interval between T1 and T2 is small (and that the guest > clocks are synchronized up to a given boundary). > > But i suppose adding a limit to the forward clock advance > (in the migration case) is useful: > > 1) If migration (well actually, only the final steps > to finish migration, the time between when guest is paused > on source and is resumed on destination) takes too long, > then too bad: fix it to be shorter if you want the clocks > to have close to zero change to realtime on migration. > > 2) Avoid the other bugs in case of large forward advance. > > Maybe having it configurable, with a say, 1 minute maximum by default > is a good choice? > > An alternative would be to advance only the guests REALTIME clock, from > data about how long steps T1-T2 took. > > > So depending on the size of the delta you are running into exactly the > > same problem as the final failed attempt to unify CLOCK_MONOTONIC and > > CLOCK_BOOTTIME which btw. would have been a magic cure for virt. > > > > Too bad, not going to happen ever unless you fix all affected user > > space and kernel side issues. > > > > > > CLOCK_BOOTTIME: > > > > CLOCK_MONOTONIC + time spent in suspend > > > > > > CLOCK_REALTIME/TAI: > > > > CLOCK_MONOTONIC + offset > > > > The offset is established by reading RTC at boot time and can be > > changed by clock_settime(2) and adjtimex(2). The latter is used > > by NTP/PTP. > > > > Any user of CLOCK_REALTIME has to be prepared for it to jump in > > both directions, but of course NTP/PTP daemons have expectations > > vs. such time jumps. > > > > They rightfully assume on a properly configured and administrated > > system that there are only two things which can make CLOCK_REALTIME > > jump: > > > > 1) NTP/PTP daemon controlled > > 2) Suspend/resume related updates by the kernel > > > > > > Just for the record, these assumptions predate virtualization. > > > > So now virt came along and created a hard to solve circular dependency > > problem: > > > > - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of > > sync, but everything else is happy. > > > > - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks > > lose, but NTP/PTP is happy. > > One must handle the > > T0: t = now(); > -> pause > -> resume > -> magic "fixup" > T1: dostuff(t); > > fact if one is going to use savevm/restorevm anyway, so... > (it is kind of unfixable, unless you modify your application > to accept notifications to redo any computation based on t, isnt it?). https://lore.kernel.org/lkml/1289503802-22444-2-git-send-email-virtuoso@slind.org/ _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 01FAEC433EF for ; Fri, 1 Oct 2021 12:18:52 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id C330A61A55 for ; Fri, 1 Oct 2021 12:18:51 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org C330A61A55 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=pbsbEPRgLxsOYnFiU0fy9StdmIRdjr7zNJH7ybT0aL4=; b=Mzq3iCykl/vj6e MBvYkeKDJgj3c4U00E2/TardsXiLYkVsN7Kp8pHF+WYOCXrbv3qUq0YPqIweVHHOFcP8cWBWTMHr7 cTyGqD7CapRSjY8tZJoGuSYFA5zzGMpL1jMKkEm//z/IES3pVvl75QilV41230OZztmaSoWh2RJA+ yx+tOq+1iKZ4HZXzNJJMzSjT5hsqXT2k5BpFq3l6cvLRu54BESmWJNtYq40jAn+1cqFcOES1MV2mf caIIfp0Ma7cVNfgAlAyhy6Nujd/x2fOGxxszeu4NM6/bivFhfRAx/B9u/T33MvDM4sIg9GeU1bGna 9/yeTdYCGKfzCVPTxElg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1mWHSh-000G3s-7u; Fri, 01 Oct 2021 12:16:23 +0000 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1mWHON-000D1X-M9 for linux-arm-kernel@lists.infradead.org; Fri, 01 Oct 2021 12:11:58 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1633090314; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1WMoHDuRMv8a05PMxxecz9ri+FED1EzVSseeb3CdzN8=; b=LImC8ywT1vvEPbB1VOd31M0h6N6+5oNOWFujGld+oidaQqhmndeSIT/AXYs9fhnbPExcHr cAOfYUBvwlWzRCrjivlxasE6b3+kpO/YLcTyU0CY1H+kbYCrotjJeOBKgGwJfF+51peCOe VQZygQxb3CheiIbADl7FSMqPa+5NRgQ= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-587-5p5UB6B6NGWPnU9lOVTNuA-1; Fri, 01 Oct 2021 08:11:51 -0400 X-MC-Unique: 5p5UB6B6NGWPnU9lOVTNuA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 27D6519253C4; Fri, 1 Oct 2021 12:11:49 +0000 (UTC) Received: from fuller.cnet (ovpn-112-2.gru2.redhat.com [10.97.112.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 7882160BF1; Fri, 1 Oct 2021 12:11:48 +0000 (UTC) Received: by fuller.cnet (Postfix, from userid 1000) id 400A3416CE5D; Fri, 1 Oct 2021 09:10:41 -0300 (-03) Date: Fri, 1 Oct 2021 09:10:41 -0300 From: Marcelo Tosatti To: Thomas Gleixner Cc: Oliver Upton , kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu, Paolo Bonzini , Sean Christopherson , Marc Zyngier , Peter Shier , Jim Mattson , David Matlack , Ricardo Koller , Jing Zhang , Raghavendra Rao Anata , James Morse , Alexandru Elisei , Suzuki K Poulose , linux-arm-kernel@lists.infradead.org, Andrew Jones , Will Deacon , Catalin Marinas Subject: Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK Message-ID: <20211001121041.GA45897@fuller.cnet> References: <20210916181538.968978-1-oupton@google.com> <20210916181538.968978-5-oupton@google.com> <20210929185629.GA10933@fuller.cnet> <20210930192107.GB19068@fuller.cnet> <871r557jls.ffs@tglx> <20211001120527.GA43086@fuller.cnet> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20211001120527.GA43086@fuller.cnet> User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20211001_051155_862183_197574B9 X-CRM114-Status: GOOD ( 65.99 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Fri, Oct 01, 2021 at 09:05:27AM -0300, Marcelo Tosatti wrote: > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote: > > Marcelo, > > > > On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote: > > > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote: > > >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote: > > >> > > >> Thomas, CC'ed, has deeper understanding of problems with > > >> forward time jumps than I do. Thomas, any comments? > > > > > > Based on the earlier discussion about the problems of synchronizing > > > the guests clock via a notification to the NTP/Chrony daemon > > > (where there is a window where applications can read the stale > > > value of the clock), a possible solution would be triggering > > > an NMI on the destination (so that it runs ASAP, with higher > > > priority than application/kernel). > > > > > > What would this NMI do, exactly? > > > > Nothing. You cannot do anything time related in an NMI. > > > > You might queue irq work which handles that, but that would still not > > prevent user space or kernel space from observing the stale time stamp > > depending on the execution state from where it resumes. > > Yes. > > > >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag > > >> for either vm pause / vm resume (well, if paused for long periods of time) > > >> or savevm / restorevm. > > > > > > Maybe with the NMI above, it would be possible to use > > > the realtime clock as a way to know time elapsed between > > > events and advance guest clock without the current > > > problematic window. > > > > As much duct tape you throw at the problem, it cannot be solved ever > > because it's fundamentally broken. All you can do is to make the > > observation windows smaller, but that's just curing the symptom. > > Yes. > > > The problem is that the guest is paused/resumed without getting any > > information about that and the execution of the guest is stopped at an > > arbitrary instruction boundary from which it resumes after migration or > > restore. So there is no way to guarantee that after resume all vCPUs are > > executing in a state which can handle that. > > > > But even if that would be the case, then what prevents the stale time > > stamps to be visible? Nothing: > > > > T0: t = now(); > > -> pause > > -> resume > > -> magic "fixup" > > T1: dostuff(t); > > Yes. > > BTW, you could have a userspace notification (then applications > could handle this if desired). > > > But that's not a fundamental problem because every preemptible or > > interruptible code has the same issue: > > > > T0: t = now(); > > -> preemption or interrupt > > T1: dostuff(t); > > > > Which is usually not a problem, but It becomes a problem when T1 - T0 is > > greater than the usual expectations which can obviously be trivially > > achieved by guest migration or a savevm, restorevm cycle. > > > > But let's go a step back and look at the clocks and their expectations: > > > > CLOCK_MONOTONIC: > > > > Monotonically increasing clock which counts unless the system > > is in suspend. On resume it continues counting without jumping > > forward. > > > > That's the reference clock for everything else and therefore it > > is important that it does _not_ jump around. > > > > The reasons why CLOCK_MONOTONIC stops during suspend is > > historical and any attempt to change that breaks the world and > > some more because making it jump forward will trigger all sorts > > of timeouts, watchdogs and whatever. The last attempt to make > > CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3 > > weeks. It's not going to be attempted again. See a3ed0e4393d6 > > ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for > > details. > > > > Now the proposed change is creating exactly the same problem: > > > > >> > + if (data.flags & KVM_CLOCK_REALTIME) { > > >> > + u64 now_real_ns = ktime_get_real_ns(); > > >> > + > > >> > + /* > > >> > + * Avoid stepping the kvmclock backwards. > > >> > + */ > > >> > + if (now_real_ns > data.realtime) > > >> > + data.clock += now_real_ns - data.realtime; > > >> > + } > > > > IOW, it takes the time between pause and resume into account and > > forwards the underlying base clock which makes CLOCK_MONOTONIC > > jump forward by exactly that amount of time. > > Well, it is assuming that the > > T0: t = now(); > T1: pause vm() > T2: finish vm migration() > T3: dostuff(t); > > Interval between T1 and T2 is small (and that the guest > clocks are synchronized up to a given boundary). > > But i suppose adding a limit to the forward clock advance > (in the migration case) is useful: > > 1) If migration (well actually, only the final steps > to finish migration, the time between when guest is paused > on source and is resumed on destination) takes too long, > then too bad: fix it to be shorter if you want the clocks > to have close to zero change to realtime on migration. > > 2) Avoid the other bugs in case of large forward advance. > > Maybe having it configurable, with a say, 1 minute maximum by default > is a good choice? > > An alternative would be to advance only the guests REALTIME clock, from > data about how long steps T1-T2 took. > > > So depending on the size of the delta you are running into exactly the > > same problem as the final failed attempt to unify CLOCK_MONOTONIC and > > CLOCK_BOOTTIME which btw. would have been a magic cure for virt. > > > > Too bad, not going to happen ever unless you fix all affected user > > space and kernel side issues. > > > > > > CLOCK_BOOTTIME: > > > > CLOCK_MONOTONIC + time spent in suspend > > > > > > CLOCK_REALTIME/TAI: > > > > CLOCK_MONOTONIC + offset > > > > The offset is established by reading RTC at boot time and can be > > changed by clock_settime(2) and adjtimex(2). The latter is used > > by NTP/PTP. > > > > Any user of CLOCK_REALTIME has to be prepared for it to jump in > > both directions, but of course NTP/PTP daemons have expectations > > vs. such time jumps. > > > > They rightfully assume on a properly configured and administrated > > system that there are only two things which can make CLOCK_REALTIME > > jump: > > > > 1) NTP/PTP daemon controlled > > 2) Suspend/resume related updates by the kernel > > > > > > Just for the record, these assumptions predate virtualization. > > > > So now virt came along and created a hard to solve circular dependency > > problem: > > > > - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of > > sync, but everything else is happy. > > > > - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks > > lose, but NTP/PTP is happy. > > One must handle the > > T0: t = now(); > -> pause > -> resume > -> magic "fixup" > T1: dostuff(t); > > fact if one is going to use savevm/restorevm anyway, so... > (it is kind of unfixable, unless you modify your application > to accept notifications to redo any computation based on t, isnt it?). https://lore.kernel.org/lkml/1289503802-22444-2-git-send-email-virtuoso@slind.org/ _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel