From mboxrd@z Thu Jan  1 00:00:00 1970
From: Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>
Subject: Re: vtime accounting
Date: Wed, 15 Mar 2017 16:57:24 +0100
Message-ID: <20170315155723.GA14081@potion>
References: <20170308105700.GA109453@lvm>
 <20170313162259.GE18298@potion>
 <20170314082601.GC1277@cbox>
 <20170314165858.GA5435@potion>
 <20170314183913.GF1277@cbox>
 <20170314202721.GD5432@potion>
 <20170315084303.GJ1277@cbox>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Cc: Christoffer Dall <christoffer.dall@linaro.org>,
        Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org,
        Marc Zyngier <marc.zyngier@arm.com>,
        Rik van Riel <riel@redhat.com>
To: Christoffer Dall <cdall@linaro.org>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:60602 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752809AbdCOP6T (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 15 Mar 2017 11:58:19 -0400
Content-Disposition: inline
In-Reply-To: <20170315084303.GJ1277@cbox>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

2017-03-15 09:43+0100, Christoffer Dall:
> On Tue, Mar 14, 2017 at 09:27:22PM +0100, Radim Krčmář wrote:
>> 2017-03-14 19:39+0100, Christoffer Dall:
>> > On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote:
>> >> 2017-03-14 09:26+0100, Christoffer Dall:
>> >> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote:
>> >> >> 2017-03-08 02:57-0800, Christoffer Dall:
>> >> >> > Hi Paolo,
>> >> >> > 
>> >> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
>> >> >> > before enabling interrupts when coming back from the guest.
>> >> >> > 
>> >> >> > Unfortunately, this appears to mess up my view of CPU usage using
>> >> >> > something like htop on the host, because it appears all time is spent
>> >> >> > inside the kernel.
>> >> >> > 
>> >> >> > From my analysis, I think this is because we never handle any interrupts
>> >> >> > before enabling interrupts, where the x86 code does its
>> >> >> > handle_external_intr, and the result on ARM is that we never increment
>> >> >> > jiffies before doing the vtime accounting.
>> >> >> 
>> >> >> (Hm, the counting might be broken on nohz_full then.)
>> >> >> 
>> >> > 
>> >> > Don't you still have a scheduler tick even with nohz_full and something
>> >> > that will eventually update jiffies then?
>> >> 
>> >> Probably, I don't understand jiffies accounting too well and didn't see
>> >> anything that would bump the jiffies in or before guest_exit_irqoff().
>> > 
>> > As far as I understand, from my very very short time of looking at the
>> > timer code, jiffies are updated on every tick, which can be cause by a
>> > number of events, including *any* interrupt handler (coming from idle
>> > state), soft timers, timer interrupts, and possibly other things.
>> 
>> Yes, I was thinking that entering/exiting user mode should trigger it as
>> well, in order to correctly account for time spent there, but couldn't
>> find it ...
> 
> Isn't it based on taking the timer interrupt when running in userspace
> as opposed to when running in the kernel?

The timer interrupt usually accounts it, but there is no timer interrupt
when tickless, so there has to be accounting somewhere on the border ...

__context_tracking_enter/exit(CONTEXT_USER) calls
vtime_user_enter/exit() to do accounting in that case, but expects that
jiffies were updated by someone else.

>> The case I was wondering about is if the kernel spent e.g. 10 jiffies in
>> guest mode and then exited on mmio -- no interrupt in the host, and
>> guest_exit_irqoff() would flip accouting would over system time.
>> Can those 10 jiffies get accounted to system (not guest) time?
> 
> Yes, I think the key is whether you end up taking a timer interrupt
> before or after switchign PF_VCPU.  So you can spend X jiffies in the
> guest, come back, change PF_VCPU (timer still hasen't expired), and then
> the timer expires immediately afterwards, and the whole block of jiffies
> that are incremented as a result of the timer gets accounted as kernel
> time.

Wouldn't that be result of a bug?

(The timer period is 1 jiffy, so the VM would get kicked before getting
 to 2 jiffies in guest mode and spending so much time in bettwen seems
 like a bug.
 10 jiffies could be spend in a guest while under nohz_full, but there is
 still at least one CPU with HZ timer that updates the global jiffies, so
 I assume that guest_exit() should see at least 9 jiffies.)

> (Note that jiffies on clocksourced architectures is an arbitrary number,
> which somehow scales according to the clocksource frequency, depending
> on the CONFIG_HZ, so a single timer interrupt can increment jiffies by
> more than 1.
> 
>> Accounting 1 jiffy wrong is normal as we expect that the distribution of
>> guest/kernel time in the jiffy is going to be approximated over a longer
>> sampling period, but if we could account multiple jiffies wrong, this
>> expectation is very hard to defend.
>> 
> 
> I think it's a best effort, and we expect inaccuracies over much more
> than a single jiffy, see above.
> 
>> >> >> > So my current idea is to increment jiffies according to the clocksource
>> >> >> > before calling guest_exit_irqoff, but this would require some main
>> >> >> > clocksource infrastructure changes.
>> >> >> 
>> >> >> This seems similar to calling the function from the timer interrupt.
>> >> >> The timer interrupt would be delivered after that and only wasted time,
>> >> >> so it might actually be slower than just delivering it before ...
>> >> > 
>> >> > That's assuming that the timer interrupt hits at every exit.  I don't
>> >> > think that's the case, but I should measure it.
>> >> 
>> >> There cannot be less vm exits and I think there are far more vm exits,
>> >> but if there was no interrupt, then jiffies shouldn't raise and we would
>> >> get the same result as with plain guest_exit_irqoff().
>> >> 
>> > 
>> > That's true if you're guaranteed to take the timer interrupts that
>> > happen while running the guest before hitting guest_exit_irqoff(), so
>> > that you eventually count *some* time for the guest.  In the arm64 case,
>> > if we just do guest_exit_irqoff(), we *never* account any time to the
>> > guest.
>> 
>> Yes.  I assumed that if jiffies should be incremented, then you also get
>> a timer tick,
> 
> I think jiffies can be incremented in a number of cases, and as I
> understand it, when the kernel folks talk about "the tick" that doesn't
> necessarily mean that your scheduler timer fires an interrupt, but I
> could be wrong.

Good point.  do_timer() (the only place to update jiffies, I hope) is
called from:

  tick_periodic()
  xtime_update()
  tick_do_update_jiffies64()

The first two seem to coincide with a specific timer interrupt.  The
last one is called from:

  tick_sched_do_timer() [timer interrupt]
  tick_nohz_update_jiffies() [any interrupt]
  do_idle() [not interrupt, but also never hit after VM exit]

Looks like jiffies are mostly timer interrupt based, but if you have
nohz, then there are also other interrupts and switches from idle.

>> so checking whether jiffies should be incremented inside
>> guest_exit_irqoff() was only slowing down the common case, where jiffies
>> remained the same.
> 
> My original suggestion was to not use jiffies at all, but use ktime, on
> architectures that don't do handle_external_intr() type things, which
> would result in a clocksource read on every exit, which on ARM I think
> is faster than our enable/disable interrupt dance we do at the moment.

I didn't catch that, sorry.

> But of course, I'd have to measure that.

The accounting would need to change to ktime even for userspace switches
to avoid double-accounting time and I think that ktime was abolished
because it was too slow ... definitely a lot of benchmarking.

>> (Checking if jiffies should be incremented is quite slow by itself.)
>> 
> 
> Really?  I think that depends on the architecture.  Reading the counter
> on ARM is supposed to be cheap, I think.

True, and also depends on what slow means. :)

IIRC, it takes around 50 cycles to read TSC-based ktime on x86, which
would be rougly 2% of a current VM exit/entry loop.

>> >> > I assume there's a good reason why we call guest_enter() and
>> >> > guest_exit() in the hot path on every KVM architecture?
>> >> 
>> >> I consider myself biased when it comes to jiffies, so no judgement. :)
>> >> 
>> >> From what I see, the mode switch is used only for statistics.
>> >> The original series is
>> >> 
>> >> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332
>> >> 
>> >> It didn't introduce the overhead with interrupt window and it didn't
>> >> count host kernel irq time as user time, so it was better at that time.
>> > 
>> > Yes, but it was based on cputime_to... functions, which I understand use
>> > ktime, which on systems running KVM will most often read the clocksource
>> > directly from the hardware, and that was then optimized later to just
>> > use jiffies to avoid the clocksource read because jiffies is already in
>> > memory and adjusted to the granularity we need, so in some sense an
>> > improvement, only it doesn't work if you don't update jiffies when
>> > needed.
>> 
>> True.  The kvm_guest_enter/exit still didn't trigger accounting, but the
>> granularity was better if there were other sources of accounting than
>> just timer ticks.
> 
> I think the granularity was better, yes, but I wonder if that improved
> accuracy was ever visible to the user, given that the CPU time may be
> counted in jiffies?

I think the time for statistics used to be in cycles before it switched
to nanoseconds, so user could have seen it some difference.

>> And I noticed another funny feature: the original intention was that if
>> the guest uses hardware acceleration, then the whole timeslice gets
>> accounted to guest/user time, because kvm_guest_exit() was not supposed
>> to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105
> 
> I don't see how this works on a current kernel though, because I can't
> find any path beyond the kvm_guest_exit() which clears the flag at the
> moment.

Right, the flag used to be cleared in account_system_time().

>> This somehow got mangled when merging and later there was a fix that
>> introduced the current behavior, which might be slightly more accurate:
>> 0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()")
> 
> So Laurent was confused as well?

It looks that Laurent understood that you need the timer tick before
clearing the flag.  No idea how it really happened.

> It seems he needed to add that before you guys added
> handle_external_intr()?

Yes.