From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C6FEC433F5 for ; Mon, 4 Oct 2021 14:30:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1071A6139E for ; Mon, 4 Oct 2021 14:30:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234072AbhJDOcN (ORCPT ); Mon, 4 Oct 2021 10:32:13 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:57402 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233516AbhJDOcL (ORCPT ); Mon, 4 Oct 2021 10:32:11 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1633357821; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4VYFPa55D97zDtyLRl+yLgiIp27xlAa5QykhLUO/Whc=; b=Ibl04MK1dg+rZR6Q4XNWPQZzokXaPh9HIvAFfdmYOhlu+bKYmt9Z531FzkXDY1wwKtFKOq fHCcd7KK/wxOKEKEhQHIMg1eajGia1vEWL68QcYmAump1kaKXVDkD+vfJQfyj8fuQ5onzO WnYsQLBGsaF0GNRz5kPezhunVLPpIvg= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-270-Eg805yp2MUmJKwHZSsIvdg-1; Mon, 04 Oct 2021 10:30:19 -0400 X-MC-Unique: Eg805yp2MUmJKwHZSsIvdg-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8DDAB1966320; Mon, 4 Oct 2021 14:30:17 +0000 (UTC) Received: from fuller.cnet (ovpn-112-2.gru2.redhat.com [10.97.112.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id DF29910016FB; Mon, 4 Oct 2021 14:30:16 +0000 (UTC) Received: by fuller.cnet (Postfix, from userid 1000) id EDC6B416CE5D; Mon, 4 Oct 2021 11:30:11 -0300 (-03) Date: Mon, 4 Oct 2021 11:30:11 -0300 From: Marcelo Tosatti To: Oliver Upton Cc: Paolo Bonzini , kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu, Sean Christopherson , Marc Zyngier , Peter Shier , Jim Mattson , David Matlack , Ricardo Koller , Jing Zhang , Raghavendra Rao Anata , James Morse , Alexandru Elisei , Suzuki K Poulose , linux-arm-kernel@lists.infradead.org, Andrew Jones , Will Deacon , Catalin Marinas Subject: Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace Message-ID: <20211004143011.GA72593@fuller.cnet> References: <20210916181538.968978-1-oupton@google.com> <20210916181538.968978-8-oupton@google.com> <20210930191416.GA19068@fuller.cnet> <48151d08-ee29-2b98-b6e1-f3c8a1ff26bc@redhat.com> <20211001103200.GA39746@fuller.cnet> <7901cb84-052d-92b6-1e6a-028396c2c691@redhat.com> <20211001191117.GA69579@fuller.cnet> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Fri, Oct 01, 2021 at 12:33:28PM -0700, Oliver Upton wrote: > Marcelo, > > On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti wrote: > > > > On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote: > > > On 01/10/21 12:32, Marcelo Tosatti wrote: > > > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), + > > > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...] > > > > > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock > > > > > nanoseconds + (k_0) and realtime nanoseconds (r_0) in their > > > > > respective fields. + Ensure that the KVM_CLOCK_REALTIME flag is > > > > > set in the provided + structure. KVM will advance the VM's > > > > > kvmclock to account for elapsed + time since recording the clock > > > > > values. > > > > > > > > You can't advance both kvmclock (kvmclock_offset variable) and the > > > > TSCs, which would be double counting. > > > > > > > > So you have to either add the elapsed realtime (1) between > > > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the > > > > TSCs. If you do both, there is double counting. Am i missing > > > > something? > > > > > > Probably one of these two (but it's worth pointing out both of them): > > > > > > 1) the attribute that's introduced here *replaces* > > > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added. > > > > > > 2) the adjustment formula later in the algorithm does not care about how > > > much time passed between step 1 and step 4. It just takes two well > > > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is > > > the same on the destination as if the guest was still running on the > > > source. It is irrelevant that one of them is before migration and one > > > is after, all it matters is that one is on the source and one is on the > > > destination. > > > > OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay > > which is introduced during migration (which is what i would guess is > > the lower hanging fruit) (for guests using TSC). > > The series gives userspace the ability to modify the guest's > perception of the TSC in whatever way it sees fit. The algorithm in > the documentation provides a suggestion to userspace on how to do > exactly that. I kept that advancement logic out of the kernel because > IMO it is an implementation detail: users have differing opinions on > how clocks should behave across a migration and KVM shouldn't have any > baked-in rules around it. Ok, was just trying to visualize how this would work with QEMU Linux guests. > > At the same time, userspace can choose to _not_ jump the TSC and use > the available interfaces to just migrate the existing state of the > TSCs. > > When I had initially proposed this series upstream, Paolo astutely > pointed out that there was no good way to get a (CLOCK_REALTIME, TSC) > pairing, which is critical for the TSC advancement algorithm in the > documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists > in userspace [1], hence the missing kvm clock changes. So, in all, the > spirit of the KVM clock changes is to provide missing UAPI around the > clock/TSC, with the side effect of changing the guest-visible value. > > [1] https://cloud.google.com/spanner/docs/true-time-external-consistency > > > My point was that, by advancing the _TSC value_ by: > > > > T0. stop guest vcpus (source) > > T1. KVM_GET_CLOCK (source) > > T2. KVM_SET_CLOCK (destination) > > T3. Write guest TSCs (destination) > > T4. resume guest (destination) > > > > new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1 > > > > t_0: host TSC at KVM_GET_CLOCK time. > > off_n: TSC offset at vcpu-n (as long as no guest TSC writes are performed, > > TSC offset is fixed). > > ... > > > > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds > > + (k_0) and realtime nanoseconds (r_0) in their respective fields. > > + Ensure that the KVM_CLOCK_REALTIME flag is set in the provided > > + structure. KVM will advance the VM's kvmclock to account for elapsed > > + time since recording the clock values. > > > > Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock > > (hopefully modern guests on modern hosts will use TSC clocksource, > > whose clock_gettime is faster... some people are using that already). > > > > Hopefully the above explanation made it clearer how the TSCs are > supposed to get advanced, and why it isn't done in the kernel. > > > At some point QEMU should enable invariant TSC flag by default? > > > > That said, the point is: why not advance the _TSC_ values > > (instead of kvmclock nanoseconds), as doing so would reduce > > the "the CLOCK_REALTIME delay which is introduced during migration" > > for both kvmclock users and modern tsc clocksource users. > > > > So yes, i also like this patchset, but would like it even more > > if it fixed the case above as well (and not sure whether adding > > the migration delta to KVMCLOCK makes it harder to fix TSC case > > later). > > > > > Perhaps we can add to step 6 something like: > > > > > > > +6. Adjust the guest TSC offsets for every vCPU to account for (1) > > > > time + elapsed since recording state and (2) difference in TSCs > > > > between the + source and destination machine: + + new_off_n = t_0 > > > > + off_n + (k_1 - k_0) * freq - t_1 + > > > > > > "off + t - k * freq" is the guest TSC value corresponding to a time of 0 > > > in kvmclock. The above formula ensures that it is the same on the > > > destination as it was on the source. > > > > > > Also, the names are a bit hard to follow. Perhaps > > > > > > t_0 tsc_src > > > t_1 tsc_dest > > > k_0 guest_src > > > k_1 guest_dest > > > r_0 host_src > > > off_n ofs_src[i] > > > new_off_n ofs_dest[i] > > > > > > Paolo > > > > > Yeah, sounds good to me. Shall I respin the whole series from what you > have in kvm/queue, or just send you the bits and pieces that ought to > be applied? > > -- > Thanks, > Oliver > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95E19C433FE for ; Mon, 4 Oct 2021 14:30:26 +0000 (UTC) Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253]) by mail.kernel.org (Postfix) with ESMTP id 106B36137D for ; Mon, 4 Oct 2021 14:30:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 106B36137D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=lists.cs.columbia.edu Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 7F0404B259; Mon, 4 Oct 2021 10:30:25 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Authentication-Results: mm01.cs.columbia.edu (amavisd-new); dkim=softfail (fail, message has been altered) header.i=@redhat.com Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id b2nkQAGcsW7f; Mon, 4 Oct 2021 10:30:24 -0400 (EDT) Received: from mm01.cs.columbia.edu (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 42BBF4B261; Mon, 4 Oct 2021 10:30:24 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 5F99E4B259 for ; Mon, 4 Oct 2021 10:30:23 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 95fJOUhW2jnB for ; Mon, 4 Oct 2021 10:30:22 -0400 (EDT) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 1BDA54B251 for ; Mon, 4 Oct 2021 10:30:22 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1633357821; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4VYFPa55D97zDtyLRl+yLgiIp27xlAa5QykhLUO/Whc=; b=Ibl04MK1dg+rZR6Q4XNWPQZzokXaPh9HIvAFfdmYOhlu+bKYmt9Z531FzkXDY1wwKtFKOq fHCcd7KK/wxOKEKEhQHIMg1eajGia1vEWL68QcYmAump1kaKXVDkD+vfJQfyj8fuQ5onzO WnYsQLBGsaF0GNRz5kPezhunVLPpIvg= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-270-Eg805yp2MUmJKwHZSsIvdg-1; Mon, 04 Oct 2021 10:30:19 -0400 X-MC-Unique: Eg805yp2MUmJKwHZSsIvdg-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8DDAB1966320; Mon, 4 Oct 2021 14:30:17 +0000 (UTC) Received: from fuller.cnet (ovpn-112-2.gru2.redhat.com [10.97.112.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id DF29910016FB; Mon, 4 Oct 2021 14:30:16 +0000 (UTC) Received: by fuller.cnet (Postfix, from userid 1000) id EDC6B416CE5D; Mon, 4 Oct 2021 11:30:11 -0300 (-03) Date: Mon, 4 Oct 2021 11:30:11 -0300 From: Marcelo Tosatti To: Oliver Upton Subject: Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace Message-ID: <20211004143011.GA72593@fuller.cnet> References: <20210916181538.968978-1-oupton@google.com> <20210916181538.968978-8-oupton@google.com> <20210930191416.GA19068@fuller.cnet> <48151d08-ee29-2b98-b6e1-f3c8a1ff26bc@redhat.com> <20211001103200.GA39746@fuller.cnet> <7901cb84-052d-92b6-1e6a-028396c2c691@redhat.com> <20211001191117.GA69579@fuller.cnet> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Cc: Catalin Marinas , kvm@vger.kernel.org, Will Deacon , Marc Zyngier , Peter Shier , David Matlack , Paolo Bonzini , kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, Jim Mattson X-BeenThere: kvmarm@lists.cs.columbia.edu X-Mailman-Version: 2.1.14 Precedence: list List-Id: Where KVM/ARM decisions are made List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: kvmarm-bounces@lists.cs.columbia.edu Sender: kvmarm-bounces@lists.cs.columbia.edu On Fri, Oct 01, 2021 at 12:33:28PM -0700, Oliver Upton wrote: > Marcelo, > > On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti wrote: > > > > On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote: > > > On 01/10/21 12:32, Marcelo Tosatti wrote: > > > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), + > > > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...] > > > > > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock > > > > > nanoseconds + (k_0) and realtime nanoseconds (r_0) in their > > > > > respective fields. + Ensure that the KVM_CLOCK_REALTIME flag is > > > > > set in the provided + structure. KVM will advance the VM's > > > > > kvmclock to account for elapsed + time since recording the clock > > > > > values. > > > > > > > > You can't advance both kvmclock (kvmclock_offset variable) and the > > > > TSCs, which would be double counting. > > > > > > > > So you have to either add the elapsed realtime (1) between > > > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the > > > > TSCs. If you do both, there is double counting. Am i missing > > > > something? > > > > > > Probably one of these two (but it's worth pointing out both of them): > > > > > > 1) the attribute that's introduced here *replaces* > > > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added. > > > > > > 2) the adjustment formula later in the algorithm does not care about how > > > much time passed between step 1 and step 4. It just takes two well > > > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is > > > the same on the destination as if the guest was still running on the > > > source. It is irrelevant that one of them is before migration and one > > > is after, all it matters is that one is on the source and one is on the > > > destination. > > > > OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay > > which is introduced during migration (which is what i would guess is > > the lower hanging fruit) (for guests using TSC). > > The series gives userspace the ability to modify the guest's > perception of the TSC in whatever way it sees fit. The algorithm in > the documentation provides a suggestion to userspace on how to do > exactly that. I kept that advancement logic out of the kernel because > IMO it is an implementation detail: users have differing opinions on > how clocks should behave across a migration and KVM shouldn't have any > baked-in rules around it. Ok, was just trying to visualize how this would work with QEMU Linux guests. > > At the same time, userspace can choose to _not_ jump the TSC and use > the available interfaces to just migrate the existing state of the > TSCs. > > When I had initially proposed this series upstream, Paolo astutely > pointed out that there was no good way to get a (CLOCK_REALTIME, TSC) > pairing, which is critical for the TSC advancement algorithm in the > documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists > in userspace [1], hence the missing kvm clock changes. So, in all, the > spirit of the KVM clock changes is to provide missing UAPI around the > clock/TSC, with the side effect of changing the guest-visible value. > > [1] https://cloud.google.com/spanner/docs/true-time-external-consistency > > > My point was that, by advancing the _TSC value_ by: > > > > T0. stop guest vcpus (source) > > T1. KVM_GET_CLOCK (source) > > T2. KVM_SET_CLOCK (destination) > > T3. Write guest TSCs (destination) > > T4. resume guest (destination) > > > > new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1 > > > > t_0: host TSC at KVM_GET_CLOCK time. > > off_n: TSC offset at vcpu-n (as long as no guest TSC writes are performed, > > TSC offset is fixed). > > ... > > > > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds > > + (k_0) and realtime nanoseconds (r_0) in their respective fields. > > + Ensure that the KVM_CLOCK_REALTIME flag is set in the provided > > + structure. KVM will advance the VM's kvmclock to account for elapsed > > + time since recording the clock values. > > > > Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock > > (hopefully modern guests on modern hosts will use TSC clocksource, > > whose clock_gettime is faster... some people are using that already). > > > > Hopefully the above explanation made it clearer how the TSCs are > supposed to get advanced, and why it isn't done in the kernel. > > > At some point QEMU should enable invariant TSC flag by default? > > > > That said, the point is: why not advance the _TSC_ values > > (instead of kvmclock nanoseconds), as doing so would reduce > > the "the CLOCK_REALTIME delay which is introduced during migration" > > for both kvmclock users and modern tsc clocksource users. > > > > So yes, i also like this patchset, but would like it even more > > if it fixed the case above as well (and not sure whether adding > > the migration delta to KVMCLOCK makes it harder to fix TSC case > > later). > > > > > Perhaps we can add to step 6 something like: > > > > > > > +6. Adjust the guest TSC offsets for every vCPU to account for (1) > > > > time + elapsed since recording state and (2) difference in TSCs > > > > between the + source and destination machine: + + new_off_n = t_0 > > > > + off_n + (k_1 - k_0) * freq - t_1 + > > > > > > "off + t - k * freq" is the guest TSC value corresponding to a time of 0 > > > in kvmclock. The above formula ensures that it is the same on the > > > destination as it was on the source. > > > > > > Also, the names are a bit hard to follow. Perhaps > > > > > > t_0 tsc_src > > > t_1 tsc_dest > > > k_0 guest_src > > > k_1 guest_dest > > > r_0 host_src > > > off_n ofs_src[i] > > > new_off_n ofs_dest[i] > > > > > > Paolo > > > > > Yeah, sounds good to me. Shall I respin the whole series from what you > have in kvm/queue, or just send you the bits and pieces that ought to > be applied? > > -- > Thanks, > Oliver > > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57912C433F5 for ; Mon, 4 Oct 2021 14:33:02 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 2161E61002 for ; Mon, 4 Oct 2021 14:33:02 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 2161E61002 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=8D+QX0mJYio4ha0q3zhn2CCJ3TKrd3wh1P6keTSLKDM=; b=U6WsVuhWY1UGE7 qJFOBajZRt2Zri4JQvyzGTPk22qzybY2dolZtPGeqwKU8rVfPx3K16u5xD7/9k8BUD0UYPhHmJoiK MbAqjk00MqzxvfdKiL/xcODJGRs2Z3qz98SqFXW+AkwNV26CSobdkynu4X+Rna2aecbMadg1O+GvA +QpxlQ7xPp5TUH8uoiMLbH5v6OsMnSe6ero+bLW5Ra+zLcT6aH4/vsUKt2hA+pCIzZyrNDTNZGr6Q YqdVvUdgpj0ARSWAY022bYvT4khPb2yeDbq0eTbAWzoYc++ChxgaykdQ1lxk01brKcId54HoreU8D 388FFdd3DnJSqavh5mVw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1mXOz9-006lG3-7K; Mon, 04 Oct 2021 14:30:31 +0000 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1mXOz5-006lER-9l for linux-arm-kernel@lists.infradead.org; Mon, 04 Oct 2021 14:30:29 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1633357823; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4VYFPa55D97zDtyLRl+yLgiIp27xlAa5QykhLUO/Whc=; b=e2pOCM4k5h5Na6N0N7EdsUEmapsiXdKVLjbxkaV1JYbOsHV1FYHlkqHyZNzdKwW9OnchxG AD5GkL64zPpp9IRp6FpHsNu/rGTuwz3TSY1l+bhI8xPGwUgbI7ZohL/k+ur6O/STAxBzhh iuBy7jyxjYqeBOTOvpSagQ/KMJO2OYM= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-270-Eg805yp2MUmJKwHZSsIvdg-1; Mon, 04 Oct 2021 10:30:19 -0400 X-MC-Unique: Eg805yp2MUmJKwHZSsIvdg-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8DDAB1966320; Mon, 4 Oct 2021 14:30:17 +0000 (UTC) Received: from fuller.cnet (ovpn-112-2.gru2.redhat.com [10.97.112.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id DF29910016FB; Mon, 4 Oct 2021 14:30:16 +0000 (UTC) Received: by fuller.cnet (Postfix, from userid 1000) id EDC6B416CE5D; Mon, 4 Oct 2021 11:30:11 -0300 (-03) Date: Mon, 4 Oct 2021 11:30:11 -0300 From: Marcelo Tosatti To: Oliver Upton Cc: Paolo Bonzini , kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu, Sean Christopherson , Marc Zyngier , Peter Shier , Jim Mattson , David Matlack , Ricardo Koller , Jing Zhang , Raghavendra Rao Anata , James Morse , Alexandru Elisei , Suzuki K Poulose , linux-arm-kernel@lists.infradead.org, Andrew Jones , Will Deacon , Catalin Marinas Subject: Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace Message-ID: <20211004143011.GA72593@fuller.cnet> References: <20210916181538.968978-1-oupton@google.com> <20210916181538.968978-8-oupton@google.com> <20210930191416.GA19068@fuller.cnet> <48151d08-ee29-2b98-b6e1-f3c8a1ff26bc@redhat.com> <20211001103200.GA39746@fuller.cnet> <7901cb84-052d-92b6-1e6a-028396c2c691@redhat.com> <20211001191117.GA69579@fuller.cnet> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20211004_073027_472120_68360FB5 X-CRM114-Status: GOOD ( 53.81 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Fri, Oct 01, 2021 at 12:33:28PM -0700, Oliver Upton wrote: > Marcelo, > > On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti wrote: > > > > On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote: > > > On 01/10/21 12:32, Marcelo Tosatti wrote: > > > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), + > > > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...] > > > > > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock > > > > > nanoseconds + (k_0) and realtime nanoseconds (r_0) in their > > > > > respective fields. + Ensure that the KVM_CLOCK_REALTIME flag is > > > > > set in the provided + structure. KVM will advance the VM's > > > > > kvmclock to account for elapsed + time since recording the clock > > > > > values. > > > > > > > > You can't advance both kvmclock (kvmclock_offset variable) and the > > > > TSCs, which would be double counting. > > > > > > > > So you have to either add the elapsed realtime (1) between > > > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the > > > > TSCs. If you do both, there is double counting. Am i missing > > > > something? > > > > > > Probably one of these two (but it's worth pointing out both of them): > > > > > > 1) the attribute that's introduced here *replaces* > > > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added. > > > > > > 2) the adjustment formula later in the algorithm does not care about how > > > much time passed between step 1 and step 4. It just takes two well > > > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is > > > the same on the destination as if the guest was still running on the > > > source. It is irrelevant that one of them is before migration and one > > > is after, all it matters is that one is on the source and one is on the > > > destination. > > > > OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay > > which is introduced during migration (which is what i would guess is > > the lower hanging fruit) (for guests using TSC). > > The series gives userspace the ability to modify the guest's > perception of the TSC in whatever way it sees fit. The algorithm in > the documentation provides a suggestion to userspace on how to do > exactly that. I kept that advancement logic out of the kernel because > IMO it is an implementation detail: users have differing opinions on > how clocks should behave across a migration and KVM shouldn't have any > baked-in rules around it. Ok, was just trying to visualize how this would work with QEMU Linux guests. > > At the same time, userspace can choose to _not_ jump the TSC and use > the available interfaces to just migrate the existing state of the > TSCs. > > When I had initially proposed this series upstream, Paolo astutely > pointed out that there was no good way to get a (CLOCK_REALTIME, TSC) > pairing, which is critical for the TSC advancement algorithm in the > documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists > in userspace [1], hence the missing kvm clock changes. So, in all, the > spirit of the KVM clock changes is to provide missing UAPI around the > clock/TSC, with the side effect of changing the guest-visible value. > > [1] https://cloud.google.com/spanner/docs/true-time-external-consistency > > > My point was that, by advancing the _TSC value_ by: > > > > T0. stop guest vcpus (source) > > T1. KVM_GET_CLOCK (source) > > T2. KVM_SET_CLOCK (destination) > > T3. Write guest TSCs (destination) > > T4. resume guest (destination) > > > > new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1 > > > > t_0: host TSC at KVM_GET_CLOCK time. > > off_n: TSC offset at vcpu-n (as long as no guest TSC writes are performed, > > TSC offset is fixed). > > ... > > > > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds > > + (k_0) and realtime nanoseconds (r_0) in their respective fields. > > + Ensure that the KVM_CLOCK_REALTIME flag is set in the provided > > + structure. KVM will advance the VM's kvmclock to account for elapsed > > + time since recording the clock values. > > > > Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock > > (hopefully modern guests on modern hosts will use TSC clocksource, > > whose clock_gettime is faster... some people are using that already). > > > > Hopefully the above explanation made it clearer how the TSCs are > supposed to get advanced, and why it isn't done in the kernel. > > > At some point QEMU should enable invariant TSC flag by default? > > > > That said, the point is: why not advance the _TSC_ values > > (instead of kvmclock nanoseconds), as doing so would reduce > > the "the CLOCK_REALTIME delay which is introduced during migration" > > for both kvmclock users and modern tsc clocksource users. > > > > So yes, i also like this patchset, but would like it even more > > if it fixed the case above as well (and not sure whether adding > > the migration delta to KVMCLOCK makes it harder to fix TSC case > > later). > > > > > Perhaps we can add to step 6 something like: > > > > > > > +6. Adjust the guest TSC offsets for every vCPU to account for (1) > > > > time + elapsed since recording state and (2) difference in TSCs > > > > between the + source and destination machine: + + new_off_n = t_0 > > > > + off_n + (k_1 - k_0) * freq - t_1 + > > > > > > "off + t - k * freq" is the guest TSC value corresponding to a time of 0 > > > in kvmclock. The above formula ensures that it is the same on the > > > destination as it was on the source. > > > > > > Also, the names are a bit hard to follow. Perhaps > > > > > > t_0 tsc_src > > > t_1 tsc_dest > > > k_0 guest_src > > > k_1 guest_dest > > > r_0 host_src > > > off_n ofs_src[i] > > > new_off_n ofs_dest[i] > > > > > > Paolo > > > > > Yeah, sounds good to me. Shall I respin the whole series from what you > have in kvm/queue, or just send you the bits and pieces that ought to > be applied? > > -- > Thanks, > Oliver > > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel