From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,MIME_QP_LONG_LINE, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7490C1B0D9 for ; Tue, 8 Dec 2020 20:33:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8A69B23AA8 for ; Tue, 8 Dec 2020 20:33:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730295AbgLHUdy (ORCPT ); Tue, 8 Dec 2020 15:33:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39934 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726338AbgLHUdx (ORCPT ); Tue, 8 Dec 2020 15:33:53 -0500 Received: from mail-pg1-x542.google.com (mail-pg1-x542.google.com [IPv6:2607:f8b0:4864:20::542]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3C5D1C0617A6 for ; Tue, 8 Dec 2020 12:32:51 -0800 (PST) Received: by mail-pg1-x542.google.com with SMTP id g18so13261388pgk.1 for ; Tue, 08 Dec 2020 12:32:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amacapital-net.20150623.gappssmtp.com; s=20150623; h=content-transfer-encoding:from:mime-version:subject:date:message-id :references:cc:in-reply-to:to; bh=X3jaTUl9GBo4a8qnhQEMvznUX5ksekBuFBQfWILQ+U8=; b=MrliCpl8hL4cJC8PERAnNSwTdw6bXwTTXhAZ/M5/h7klwCksqQ6Ag35pmFF1pxUlTT XQJ2QeciTt9iZFcybbCPAOWut3Mp8EWvFsTGWzCkPREPRSEXjTcA9xchQcRSUroN+CzI STLPmTE4Xc+a36i7ap+UAK8ndr2qoM5aj8iRgtNJTv+tP+4ZPmBW2xu6Te336JZ4a2j2 R/JIVw0N8rxTDFu4L1TXxBTxtO3OKCOlmkdkzthu3wt3BV2FKt3XlanBn5ef8C49Pfir HsVPP3dqRjfjkctYO0dzupgCnDP7/I4WqvNmoHHCqeUQrfK/uOodQ4oUU7C243b1Grn9 2naA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:content-transfer-encoding:from:mime-version :subject:date:message-id:references:cc:in-reply-to:to; bh=X3jaTUl9GBo4a8qnhQEMvznUX5ksekBuFBQfWILQ+U8=; b=Zq8pfMhOHo7CmgVcETFuhrFi+RBEGzyabWOu5i/uK3RZsvoPvCvN91bBOkWgVzHmoG 0xlBhC1dyZdqJ74XvO5yhYxFH6qS6kWAl1vgJkYO8D3/a8f3g6Jko17dkChWtrmbNIIU 3Hv4TEUdzOIvMmhfYCW9r2geocmx7Fp/DCHb90E5mdYD4ChaUT9Lnk312f2AfplrHg/S RdC46jeP9Z+DbqHvV0UcS16hyXFQNan1+ViY2Q0hBWO4hAUqFpSsMVOHp1AQMH6S1VkC ipr5VXXXYKX7CNC2tB1SRB4sBgtr1WC8S0kCbwNz1glsA/V+L52BICqEwaHq7jRMc+xP QnGQ== X-Gm-Message-State: AOAM531hBJWH28edzqPHSgA6glkSU8swX58rJjT1PF4SEx4vQXIcOmpK L7KDGWx/tnrF9BFNVm/j3OOU9A== X-Google-Smtp-Source: ABdhPJxYJCnYuntUM3RTj5G4uZ8JcqXRsmHjDtNdlCetqy79aa/LEq/fJzELfZLm7dPA5RZVrqzYYQ== X-Received: by 2002:a17:90b:a04:: with SMTP id gg4mr6017057pjb.8.1607459570651; Tue, 08 Dec 2020 12:32:50 -0800 (PST) Received: from ?IPv6:2600:1010:b020:7dd:51f8:22c8:74de:21e0? ([2600:1010:b020:7dd:51f8:22c8:74de:21e0]) by smtp.gmail.com with ESMTPSA id a17sm15902499pga.56.2020.12.08.12.32.49 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 08 Dec 2020 12:32:49 -0800 (PST) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Andy Lutomirski Mime-Version: 1.0 (1.0) Subject: Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE Date: Tue, 8 Dec 2020 12:32:48 -0800 Message-Id: <301491B7-DEB6-41ED-B8FD-657B864696CF@amacapital.net> References: <87h7ow2j91.fsf@nanos.tec.linutronix.de> Cc: Andy Lutomirski , Marcelo Tosatti , Maxim Levitsky , kvm list , "H. Peter Anvin" , Paolo Bonzini , Jonathan Corbet , Jim Mattson , Wanpeng Li , "open list:KERNEL SELFTEST FRAMEWORK" , Vitaly Kuznetsov , Sean Christopherson , open list , Ingo Molnar , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Joerg Roedel , Borislav Petkov , Shuah Khan , Andrew Jones , Oliver Upton , "open list:DOCUMENTATION" In-Reply-To: <87h7ow2j91.fsf@nanos.tec.linutronix.de> To: Thomas Gleixner X-Mailer: iPhone Mail (18B121) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Dec 8, 2020, at 11:25 AM, Thomas Gleixner wrote: >=20 > =EF=BB=BFOn Tue, Dec 08 2020 at 09:43, Andy Lutomirski wrote: >> On Tue, Dec 8, 2020 at 6:23 AM Marcelo Tosatti wrot= e: >> It looks like it tries to accomplish the right goal, but in a rather >> roundabout way. The host knows how to convert from TSC to >> CLOCK_REALTIME, and ptp_kvm.c exposes this data to the guest. But, >> rather than just making the guest use the same CLOCK_REALTIME data as >> the host, ptp_kvm.c seems to expose information to usermode that a >> user daemon could use to attempt (with some degree of error?) to use >> to make the guest kernel track CLOCK_REALTIME. This seems inefficient >> and dubiously accurate. >>=20 >> My feature request is for this to be fully automatic and completely >> coherent. I would like for a host user program and a guest user >> program to be able to share memory, run concurrently, and use the >> shared memory to exchange CLOCK_REALTIME values without ever observing >> the clock going backwards. This ought to be doable. Ideally the >> result should even be usable for Spanner-style synchronization >> assuming the host clock is good enough. Also, this whole thing should >> work without needing to periodically wake the guest to remain >> synchronized. If the guest sleeps for two minutes (full nohz-idle, no >> guest activity at all), the host makes a small REALTIME frequency >> adjustment, and then the guest runs user code that reads >> CLOCK_REALTIME, the guest clock should still be fully synchronized >> with the host. I don't think that ptp_kvm.c-style synchronization can >> do this. >=20 > One issue here is that guests might want to run their own NTP/PTP. One > reason to do that is that some people prefer the leap second smearing > NTP servers.=20 I would hope that using this part would be optional on the guest=E2=80=99s p= art. Guests should be able to use just the CLOCK_MONOTONIC_RAW part or fanci= er stuff at their option. (Hmm, it would, in principle, be possible for a guest to use the host=E2=80=99= s TAI but still smear leap seconds. Even without virt, smearing could be a p= er-timens option.) >=20 >> tglx etc, I think that doing this really really nicely might involve >> promoting something like the current vDSO data structures to ABI -- a >> straightforward-ish implementation would be for the KVM host to export >> its vvar clock data to the guest and for the guest to use it, possibly >> with an offset applied. The offset could work a lot like timens works >> today. >=20 > Works nicely if the guest TSC is not scaled. But that means that on > migration the raw TSC usage in the guest is borked because the new host > might have a different TSC frequency. >=20 > If you use TSC scaling then the conversion needs to take TSC scaling > into account which needs some thought. And the guest would need to read > the host conversion from 'vdso data' and the scaling from the next page > (per guest) and then still has to support timens. Doable but adds extra > overhead on every time read operation. Is the issue that scaling would result in a different guest vs host frequenc= y? Perhaps we could limit each physical machine to exactly two modes: unsca= led (use TSC ticks, convert in software) and scaled to nanoseconds (CLOCK_MO= NOTONIC_RAW is RDTSC + possible offset). Then the host could produce its da= ta structures in exactly those two formats and export them as appropriate.=20= >=20 > If you want to avoid that you are back to the point where you need to > chase all guest data when the host NTP/PTP adjusts the host side. > Chasing and updating all this stuff in the tick was the reason why I was > fighting the idea of clock realtime in namespaces. I think that, if we can arrange for a small, bounded number of pages generat= ed by the host, then this problem isn=E2=80=99t so bad. Hmm, leap second smearing is just a different linear mapping. I=E2=80=99m no= t sure how leap second smearing should interact with timens, but it seems to= be that the host should be able to produce four data pages (scaled vs unsca= led and smeared vs unsmeared) and one per-guest/timens offset page (where of= fset applies to MONOTONIC and MONOTONIC_RAW only) and cover all bases. Or d= o people actually want to offset their TAI and/or REALTIME, and what would t= hat even mean if the offset crosses a leap second? (I haven=E2=80=99t though about the interaction of any of this with ART.)