From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B64BDC4361B for ; Thu, 10 Dec 2020 23:44:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8F73723117 for ; Thu, 10 Dec 2020 23:44:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2394402AbgLJXnw (ORCPT ); Thu, 10 Dec 2020 18:43:52 -0500 Received: from mail.kernel.org ([198.145.29.99]:59814 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2394390AbgLJXnZ (ORCPT ); Thu, 10 Dec 2020 18:43:25 -0500 X-Gm-Message-State: AOAM531yxr1vSKF5fk9UKmZzFH1rGduJxUru4SzCaeNqM/vPmv5zdkgR R85Z8zYKukmdEeaZCIcgAoAjowhg2swK92ELNzs7oA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1607643764; bh=aDUgcrOjJHp0mQjbB5wZvqYkGsM2qYU6k/VAqT3R+WA=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=FGp9IaKzuRbglOgmHOQ9u10r1PAKxq7XnvEyrgno+yD7EMzhvrpFCTTmqmM6zr3hL 70swxxWYHlGhnNeazzooH5aPD5LM2wnNvBmfe3kH0T614BKdOg1Zuf19En9CdY4+Ka HHWaSxgfsfQiqh5Kq1ctb+Hd4NJffdgAONG0PJeTxLObSF/iyTMBzodFCJLUk0XYrf DeXaHqpaS9NPaIW3KBOzcqy1BuNSo4wTzp6UJIyt6a6FM/P6CVM6ygw82v8VUKYpGm /LAg8+uC1r3pgbNrnHTEID8KGqqUph82lhnwwdocKEIn7FJr+YZIY84jIgW6EQ5x5y ukAH2EJvj2UIg== X-Google-Smtp-Source: ABdhPJziwLtAmmpO/LbhskFlxkb1P0wMIysjlNojkf8wrbc3mNp641j73CHEkz/5cnjDHvWhRXkgCyZELNAG2ldvgYw= X-Received: by 2002:a1c:1d85:: with SMTP id d127mr10897421wmd.49.1607643762611; Thu, 10 Dec 2020 15:42:42 -0800 (PST) MIME-Version: 1.0 References: <87h7ow2j91.fsf@nanos.tec.linutronix.de> <301491B7-DEB6-41ED-B8FD-657B864696CF@amacapital.net> <87v9db25me.fsf@nanos.tec.linutronix.de> <87lfe71e1z.fsf@nanos.tec.linutronix.de> In-Reply-To: <87lfe71e1z.fsf@nanos.tec.linutronix.de> From: Andy Lutomirski Date: Thu, 10 Dec 2020 15:42:30 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE To: Thomas Gleixner Cc: Andy Lutomirski , Marcelo Tosatti , Maxim Levitsky , kvm list , "H. Peter Anvin" , Paolo Bonzini , Jonathan Corbet , Jim Mattson , Wanpeng Li , "open list:KERNEL SELFTEST FRAMEWORK" , Vitaly Kuznetsov , Sean Christopherson , open list , Ingo Molnar , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Joerg Roedel , Borislav Petkov , Shuah Khan , Andrew Jones , Oliver Upton , "open list:DOCUMENTATION" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Dec 9, 2020, at 2:14 AM, Thomas Gleixner wrote: > > But what's more problematic is the basic requirement that time all over > the place has to be consistent. > > On machines which use DISTORTED_REALTIME everything _IS_ consistent > within the distorted universe they created. It's still inconsistent > vs. the outside, but that's unsolvable and none of our problems. > > TLDR: Do not even think about opening pandoras box. This could be a per-machine/per-vm setting that nonetheless uses the same underlying implementation. There are, sadly, lots of people using smeared time, and there will probably be VM hosts that simultaneously have both styles of guest. Supporting full PV time for both would be nice. Obviously this gets a bit screwy if they are using a paravirt fs, but it=E2=80=99s also a problem with NFS, etc. So maybe the nasty corne= rs could be narrow enough to just say =E2=80=9Cdon=E2=80=99t do that=E2=80=9D. > >> If we want to get fancy, we can make a change that I've contemplated >> for a while -- we could make t_end explicit and have two copies of all >> these data structures. The reader would use one copy if t < t_change >> and a different copy if t >=3D t_change. > > See below. > >> This would allow NTP-like code in usermode to schedule a frequency >> shift to start at a specific time. > > That's an orthogonal problem and can be done without changing the > reader side. Really? Right now, the switch happens whenever the kernel takes the seqlock, which it doesn=E2=80=99t have exact control over. But I meant something a little different: > >> With some care, it would also allow the timekeeping code to update the >> data structures without causing clock_gettime() to block while the >> timekeeping code is running on a different CPU. > > It still has to block, i.e. retry, because the data set becomes invalid > when t_end is reached. So the whole thing would be: > > do { > seq =3D read_seqcount_latch(); > data =3D select_data(seq); > delta =3D read_clocksource() - data->base; > if (delta >=3D data->max_delta) > continue; > .... > } while (read_seqcount_latch_retry()); > > TBH, I like the idea for exactly one reason: It increases robustness. I like the max_delta for robustness, too. What do you have in mind for select_data()? Is the idea that select_data() returns something like &data[seq & 1]? But I actually meant something a little bit different: you would use delta >=3D data->max_delta as an indication that you need to look at the other copy. Perhaps the lower three bits of the seqcount would work like: 00: both copies are valid, but start with the first copy. 10: only the first copy is valid. 01: both copies are valid, but start with the second copy. 11: only the second copy is valid You'd read it like this (with all the bugs that I surely have fixed); do { seq =3D read_seqcount(); data =3D &data_array[seq & 1]; clock =3D read_clocksource(); delta =3D clock - data->base; if (delta->data->max_delta) { if (seq & 2) continue; data =3D &data_array[(seq + 1) & 1]; // <-- the other copy delta =3D clock - data->base; if (delta >=3D data->max_delta) continue; } ...; } while (seq =3D=3D read_seqcount()); This has two main benefits. First, it allows the timekeeping code to run concurrently with readers, which is nice for tail latency -- readers would only ever need to spin if the timekeeper falls behind, intentionally disables both copies, or somehow manages to run one iteration for each reader attempt and livelocks the reader. The latter is very unlikely.) Second, it allows the timekeeping code to literally schedule an update to occur at a precise clocksource tick, which seems to be like it could make the timekeeping code simpler and more robust. (If the timekeeper wants to simultaneously disable both copies, it sets one copy's max_delta to zero and uses seq to disable the other copy.) --Andy > > For X86 we already have the comparison for dealing with TSC < base > which would be covered by > > if (delta >=3D data->max_delta) > continue; > > automatically. Non X86 gains this extra conditional, but I think it's > worth to pay that price. > > It won't solve the VM migration problem on it's own though. You still > have to be careful about the inner workings of everything related to > timekeeping itself. > >> One other thing that might be worth noting: there's another thread >> about "vmgenid". It's plausible that it's worth considering stopping >> the guest or perhaps interrupting all vCPUs to allow it to take some >> careful actions on migration for reasons that have nothing to do with >> timekeeping. > > How surprising. Who could have thought about that? > > OMG, virtualization seems to have gone off into a virtual reality long > ago. > > Thanks, > > tglx >