From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754889AbcIOVC4 (ORCPT ); Thu, 15 Sep 2016 17:02:56 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54002 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751337AbcIOVCt (ORCPT ); Thu, 15 Sep 2016 17:02:49 -0400 Subject: Re: [PATCH 2/2] x86, kvm: use kvmclock to compute TSC deadline value To: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= References: <1473200999-123004-1-git-send-email-pbonzini@redhat.com> <1473200999-123004-3-git-send-email-pbonzini@redhat.com> <20160915150851.GA15815@potion> <20160915195949.GA17095@potion> Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, dmatlack@google.com, luto@kernel.org, peterhornyack@google.com, x86@kernel.org From: Paolo Bonzini Message-ID: Date: Thu, 15 Sep 2016 23:02:43 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: <20160915195949.GA17095@potion> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Thu, 15 Sep 2016 21:02:48 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 15/09/2016 21:59, Radim Krčmář wrote: > 2016-09-15 18:00+0200, Paolo Bonzini: >> So virtual_tsc_khz does change on migration, at least if your host >> doesn't have TSC scaling (which is pretty much all Intel hosts in >> existence). > > Ugh, I'd consider exposing constant TSC to the guest (which depends > solely on CPUID -- modern models have it), allowing migration, and not > preserving the TSC rate as a userspace bug. Technically it is, but without hardware help I guess it's justified... > Invariant TSC seems to prevent migration and the only thing that changed > between constant and invariant TSC is that invariant TSC doesn't stop in > guest-controlled deep C states. > > Hm, and Linux uses TSC_DEADLINE_TIMER without invariant TSC: setting a > timer and halting could mean that the timer never fires ... > maybe real hardware always has both, so it is an implicit condition? Yeah, I guess so. And the kvmclock-based clockevent would not be able to fix this. So we also need to blacklist the TSC deadline timer if kvmclock doesn't set PVCLOCK_TSC_STABLE_BIT. >> Safety against TSC rate changes is pretty much the reason >> why kvmclock exists: it used to be that TSC rate changed on pCPU >> migration, now it's only host migration but the idea is the same. :) > > Yep. I think that TSC shouldn't have been allowed outside of kvmclock. Luckily TSC is only used by kvmclock (which is okay) and... the TSC deadline timer. Also by nested KVM's kvmclock, but that's the last of our worries I guess. So we're close. >>> When we are already going the paravirtual route, we could add an >>> interface that accepts the deadline in kvmclock nanoseconds. >>> It would be much more maintanable than adding a fragile paravirtual >>> layer on top of random interfaces. >> >> Good idea. > > I'll prepare a prototype. So how would this work? A single MSR, used after setting TSC deadline mode in LVTT? Could you write it and read TSC deadline or vice versa? My idea would be "yes" for writing nsec deadline and reading TSC deadline, but "no" for writing TSC deadline and reading nsec deadline. In the latter case, reading nsec deadline might return an impossible value such as -1; this lets userspace decide whether to set a nsec-based deadline or a TSC-based deadline after migration. If you have other ideas, I'm all ears! >> This still wouldn't handle old hosts of course. > > The question is whether we want to carry around 150 LOC because of old > hosts. I'd just fix Linux to avoid deadline TSC without invariant TSC. > :) Yes, that would automatically blacklist it on KVM. You'd also need to update the recent optimization to the TSC deadline timer, to also work on other APIC timer modes or at least in your new PV mode. >>> Don't we completely replace the native apic timer, so it can't even be >>> registered? The comment doesn't explain what we expect from the call, >>> so it's a bit confusing. >>> >>> I think the call expects that per_cpu(lapic_events, cpu).event_handler >>> == NULL, so we do it to print the warning and disable the LAPIC timer. >> >> This pvop is always called instead of native_local_apic_timer_interrupt. >> We need to defer to the native implementation if the kvmclock >> clocksource is not in use, for example if there's no TSC deadline timer >> in the guest. > > No, the pvop is for that. If there is no TSC deadline timer, then the > pvop will call native_local_apic_timer_interrupt(), because we don't > overwrite it: > >>>> + if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER) && >>>> + !disable_apic && !disable_apic_timer) { >>>> + pv_time_ops.local_apic_timer_interrupt = kvmclock_local_apic_timer_interrupt; >>>> + x86_init.timers.setup_percpu_clockev = setup_kvmclock_timer; >>>> + x86_cpuinit.setup_percpu_clockev = setup_kvmclock_timer; >>>> + } >> >> So I should do s/hasn't been setup yet/isn't in use/. > > Worse that no comment, IMO. ;) > > I still think it is only to call this block in > native_local_apic_timer_interrupt(): > > if (!evt->event_handler) { > pr_warning("Spurious LAPIC timer interrupt on cpu %d\n", cpu); > /* Switch it off */ > lapic_timer_shutdown(evt); > return; > } No, it was needed for something else. :) I just don't recall for what anymore, since your observation on pv_time_ops.local_apic_timer_interrupt is obviously correct. Paolo