Re: kvmclock doesn't work, help?

From: Andy Lutomirski <luto@amacapital.net>
To: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm list <kvm@vger.kernel.org>, Radim Krcmar <rkrcmar@redhat.com>,
	X86 ML <x86@kernel.org>, John Stultz <john.stultz@linaro.org>
Subject: Re: kvmclock doesn't work, help?
Date: Mon, 21 Dec 2015 14:49:25 -0800	[thread overview]
Message-ID: <CALCETrWrSQ4JZ74Qt4f2QZUCxFP2HcXkPEZWD6vws307Yxj_YQ@mail.gmail.com> (raw)
In-Reply-To: <20151218214928.GA32177@amt.cnet>

On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote:
>> [cc: John Stultz -- maybe you have ideas on how this should best
>> integrate with the core code]
>>
>> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

>> > Can you write an actual proposal (with details) that accomodates the
>> > issue described at "Assuming a stable TSC across physical CPUS, and a
>> > stable TSC" ?
>> >
>> > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for
>> > realtime guests.
>>
>> This shouldn't require many details, and I don't think there's an ABI
>> change.  The rules are:
>>
>> When the overall system timebase changes (e.g. when the selected
>> clocksource changes or when update_pvclock_gtod is called), the KVM
>> host would:
>>
>> optionally: preempt_disable();  /* for performance */
>>
>> for all vms {
>>
>>   for all registered pvti structures {
>>     pvti->version++;  /* should be odd now */
>>   }
>
> pvti is userspace data, so you have to pin it before?

Yes.

Fortunately, most systems probably only have one page of pvti
structures, I think (unless there are a ton of vcpus), so the
performance impact should be negligible.

>
>>   /* Note: right now, any vcpu that tries to access pvti will start
>> infinite looping.  We should add cpu_relax() to the guests. */
>>
>>   for all registered pvti structures {
>>     update everything except pvti->version;
>>   }
>>
>>   for all registered pvti structures {
>>     pvti->version++;  /* should be even now */
>>   }
>>
>>   cond_resched();
>> }
>>
>> Is this enough detail?  This should work with all existing guests,
>> too, unless there's a buggy guest out there that actually fails to
>> double-check version.
>
> What is the advantage of this over the brute force method, given
> that guests will busy spin?
>
> (busy spin is equally problematic as IPI for realtime guests).

I disagree.  It's never been safe to call clock_gettime from an RT
task and expect a guarantee of real-time performance.  We could fix
that, but it's not even safe on non-KVM.

Sending an IPI *always* stalls the task.  Taking a lock (which is
effectively what this is doing) only stalls the tasks that contend for
the lock, which, most of the time, means that nothing stalls.

Also, if the host disables preemption or otherwise boosts its priority
while version is odd, then the actual stall will be very short, in
contrast to an IPI-induced stall, which will be much, much longer.

--Andy