From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1754889AbcIOVC4 (ORCPT <rfc822;w@1wt.eu>);
        Thu, 15 Sep 2016 17:02:56 -0400
Received: from mx1.redhat.com ([209.132.183.28]:54002 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751337AbcIOVCt (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 15 Sep 2016 17:02:49 -0400
Subject: Re: [PATCH 2/2] x86, kvm: use kvmclock to compute TSC deadline value
To: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>
References: <1473200999-123004-1-git-send-email-pbonzini@redhat.com>
 <1473200999-123004-3-git-send-email-pbonzini@redhat.com>
 <20160915150851.GA15815@potion>
 <f737af67-927c-933e-d7b9-3c5320dd0b25@redhat.com>
 <20160915195949.GA17095@potion>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, dmatlack@google.com,
        luto@kernel.org, peterhornyack@google.com, x86@kernel.org
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <c61d107c-5f08-af75-1183-b6bc5a4b7651@redhat.com>
Date: Thu, 15 Sep 2016 23:02:43 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.2.0
MIME-Version: 1.0
In-Reply-To: <20160915195949.GA17095@potion>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Thu, 15 Sep 2016 21:02:48 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


On 15/09/2016 21:59, Radim Krčmář wrote:
> 2016-09-15 18:00+0200, Paolo Bonzini:
>> So virtual_tsc_khz does change on migration, at least if your host
>> doesn't have TSC scaling (which is pretty much all Intel hosts in
>> existence).
> 
> Ugh, I'd consider exposing constant TSC to the guest (which depends
> solely on CPUID -- modern models have it), allowing migration, and not
> preserving the TSC rate as a userspace bug.

Technically it is, but without hardware help I guess it's justified...

> Invariant TSC seems to prevent migration and the only thing that changed
> between constant and invariant TSC is that invariant TSC doesn't stop in
> guest-controlled deep C states.
> 
> Hm, and Linux uses TSC_DEADLINE_TIMER without invariant TSC: setting a
> timer and halting could mean that the timer never fires ...
> maybe real hardware always has both, so it is an implicit condition?

Yeah, I guess so.  And the kvmclock-based clockevent would not be able
to fix this.  So we also need to blacklist the TSC deadline timer if
kvmclock doesn't set PVCLOCK_TSC_STABLE_BIT.

>>              Safety against TSC rate changes is pretty much the reason
>> why kvmclock exists: it used to be that TSC rate changed on pCPU
>> migration, now it's only host migration but the idea is the same. :)
> 
> Yep.  I think that TSC shouldn't have been allowed outside of kvmclock.

Luckily TSC is only used by kvmclock (which is okay) and... the TSC
deadline timer.  Also by nested KVM's kvmclock, but that's the last of
our worries I guess.  So we're close.

>>> When we are already going the paravirtual route, we could add an
>>> interface that accepts the deadline in kvmclock nanoseconds.
>>> It would be much more maintanable than adding a fragile paravirtual
>>> layer on top of random interfaces.
>>
>> Good idea.
> 
> I'll prepare a prototype.

So how would this work?  A single MSR, used after setting TSC deadline
mode in LVTT?  Could you write it and read TSC deadline or vice versa?

My idea would be "yes" for writing nsec deadline and reading TSC
deadline, but "no" for writing TSC deadline and reading nsec deadline.
In the latter case, reading nsec deadline might return an impossible
value such as -1; this lets userspace decide whether to set a nsec-based
deadline or a TSC-based deadline after migration.  If you have other
ideas, I'm all ears!

>>             This still wouldn't handle old hosts of course.
> 
> The question is whether we want to carry around 150 LOC because of old
> hosts.  I'd just fix Linux to avoid deadline TSC without invariant TSC.
> :)

Yes, that would automatically blacklist it on KVM.  You'd also need to
update the recent optimization to the TSC deadline timer, to also work
on other APIC timer modes or at least in your new PV mode.

>>> Don't we completely replace the native apic timer, so it can't even be
>>> registered?  The comment doesn't explain what we expect from the call,
>>> so it's a bit confusing.
>>>
>>> I think the call expects that per_cpu(lapic_events, cpu).event_handler
>>> == NULL, so we do it to print the warning and disable the LAPIC timer.
>>
>> This pvop is always called instead of native_local_apic_timer_interrupt.
>>  We need to defer to the native implementation if the kvmclock
>> clocksource is not in use, for example if there's no TSC deadline timer
>> in the guest.
> 
> No, the pvop is for that.  If there is no TSC deadline timer, then the
> pvop will call native_local_apic_timer_interrupt(), because we don't
> overwrite it:
> 
>>>> +	if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER) &&
>>>> +	    !disable_apic && !disable_apic_timer) {
>>>> +		pv_time_ops.local_apic_timer_interrupt = kvmclock_local_apic_timer_interrupt;
>>>> +		x86_init.timers.setup_percpu_clockev = setup_kvmclock_timer;
>>>> +		x86_cpuinit.setup_percpu_clockev = setup_kvmclock_timer;
>>>> +	}
>>
>> So I should do s/hasn't been setup yet/isn't in use/.
> 
> Worse that no comment, IMO. ;)
> 
> I still think it is only to call this block in
> native_local_apic_timer_interrupt():
> 
>  	if (!evt->event_handler) {
>  		pr_warning("Spurious LAPIC timer interrupt on cpu %d\n", cpu);
>  		/* Switch it off */
>  		lapic_timer_shutdown(evt);
>  		return;
>  	}

No, it was needed for something else. :)  I just don't recall for what
anymore, since your observation on
pv_time_ops.local_apic_timer_interrupt is obviously correct.

Paolo