All of lore.kernel.org
 help / color / mirror / Atom feed
* vtime accounting
@ 2017-03-08 10:57 Christoffer Dall
  2017-03-09  8:16 ` Paolo Bonzini
  2017-03-13 17:28 ` Radim Krčmář
  0 siblings, 2 replies; 26+ messages in thread
From: Christoffer Dall @ 2017-03-08 10:57 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Marc Zyngier

Hi Paolo,

I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
before enabling interrupts when coming back from the guest.

Unfortunately, this appears to mess up my view of CPU usage using
something like htop on the host, because it appears all time is spent
inside the kernel.

>From my analysis, I think this is because we never handle any interrupts
before enabling interrupts, where the x86 code does its
handle_external_intr, and the result on ARM is that we never increment
jiffies before doing the vtime accounting.

So my current idea is to increment jiffies according to the clocksource
before calling guest_exit_irqoff, but this would require some main
clocksource infrastructure changes.

My question is: how important is the vtime accounting on the host from
your point of view?  Worth poking the timekeeping folks about or even
trying to convince ourselves that the handle_external_intr thing is
worth it?

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-08 10:57 vtime accounting Christoffer Dall
@ 2017-03-09  8:16 ` Paolo Bonzini
  2017-03-13 17:28 ` Radim Krčmář
  1 sibling, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2017-03-09  8:16 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: kvm, Marc Zyngier, Rik van Riel



On 08/03/2017 11:57, Christoffer Dall wrote:
> Hi Paolo,
> 
> I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
> before enabling interrupts when coming back from the guest.
> 
> Unfortunately, this appears to mess up my view of CPU usage using
> something like htop on the host, because it appears all time is spent
> inside the kernel.
> 
> From my analysis, I think this is because we never handle any interrupts
> before enabling interrupts, where the x86 code does its
> handle_external_intr, and the result on ARM is that we never increment
> jiffies before doing the vtime accounting.
> 
> So my current idea is to increment jiffies according to the clocksource
> before calling guest_exit_irqoff, but this would require some main
> clocksource infrastructure changes.
> 
> My question is: how important is the vtime accounting on the host from
> your point of view?  Worth poking the timekeeping folks about or even
> trying to convince ourselves that the handle_external_intr thing is
> worth it?

Not really my area, so let's ask Rik. :)

Paolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-08 10:57 vtime accounting Christoffer Dall
  2017-03-09  8:16 ` Paolo Bonzini
@ 2017-03-13 17:28 ` Radim Krčmář
  2017-03-14  8:26   ` Christoffer Dall
  1 sibling, 1 reply; 26+ messages in thread
From: Radim Krčmář @ 2017-03-13 17:28 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel

2017-03-08 02:57-0800, Christoffer Dall:
> Hi Paolo,
> 
> I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
> before enabling interrupts when coming back from the guest.
> 
> Unfortunately, this appears to mess up my view of CPU usage using
> something like htop on the host, because it appears all time is spent
> inside the kernel.
> 
> From my analysis, I think this is because we never handle any interrupts
> before enabling interrupts, where the x86 code does its
> handle_external_intr, and the result on ARM is that we never increment
> jiffies before doing the vtime accounting.

(Hm, the counting might be broken on nohz_full then.)

> So my current idea is to increment jiffies according to the clocksource
> before calling guest_exit_irqoff, but this would require some main
> clocksource infrastructure changes.

This seems similar to calling the function from the timer interrupt.
The timer interrupt would be delivered after that and only wasted time,
so it might actually be slower than just delivering it before ...

How expensive is the interrupt enable/disable cycle that this
optimization saves?

> My question is: how important is the vtime accounting on the host from
> your point of view?

No idea.  I'd keep the same behavior on all architectures, though.

The precision of accounting is in jiffies (millions of cycles), so we
could maybe move it from the hot path to vcpu_load/put(?) without
affecting the count in usual cases ...

>                      Worth poking the timekeeping folks about or even
> trying to convince ourselves that the handle_external_intr thing is
> worth it?

handle_external_intr() needs some hardware support in order to be more
than worthless 'local_irq_enable(); local_irq_disable()' ...

e.g. VMX doesn't queue the interrupt that caused a VM exit in the
interrupt controller.  (VMX's handle_external_intr() doesn't deliver any
other interrupts than might have become pending after the exit, but this
race is not very important due to accounting granularity ...)
Alternatively, the interrupt controller would need to support dequeing
without delivery (but delivering in software might still be slower than
cyclying interrupts on and off).

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-13 17:28 ` Radim Krčmář
@ 2017-03-14  8:26   ` Christoffer Dall
  2017-03-14  8:55     ` Marc Zyngier
                       ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Christoffer Dall @ 2017-03-14  8:26 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel

On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote:
> 2017-03-08 02:57-0800, Christoffer Dall:
> > Hi Paolo,
> > 
> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
> > before enabling interrupts when coming back from the guest.
> > 
> > Unfortunately, this appears to mess up my view of CPU usage using
> > something like htop on the host, because it appears all time is spent
> > inside the kernel.
> > 
> > From my analysis, I think this is because we never handle any interrupts
> > before enabling interrupts, where the x86 code does its
> > handle_external_intr, and the result on ARM is that we never increment
> > jiffies before doing the vtime accounting.
> 
> (Hm, the counting might be broken on nohz_full then.)
> 

Don't you still have a scheduler tick even with nohz_full and something
that will eventually update jiffies then?

> > So my current idea is to increment jiffies according to the clocksource
> > before calling guest_exit_irqoff, but this would require some main
> > clocksource infrastructure changes.
> 
> This seems similar to calling the function from the timer interrupt.
> The timer interrupt would be delivered after that and only wasted time,
> so it might actually be slower than just delivering it before ...

That's assuming that the timer interrupt hits at every exit.  I don't
think that's the case, but I should measure it.

> 
> How expensive is the interrupt enable/disable cycle that this
> optimization saves?

I'll have to go back and measure this bit specifically again, but I
recall it being a couple of hundred cycles.  Not alariming, but
worthwhile looking into.

> 
> > My question is: how important is the vtime accounting on the host from
> > your point of view?
> 
> No idea.  I'd keep the same behavior on all architectures, though.
> 
> The precision of accounting is in jiffies (millions of cycles), so we
> could maybe move it from the hot path to vcpu_load/put(?) without
> affecting the count in usual cases ...
> 

So since sending my original e-mail I found out that the vtime
accounting logic was changed from ktime to jiffies, which is partly why
we're having problems on arm.  See:

ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd
sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity

Moving to load/put depends on the semantics of this vtime thing.  Is
this counting cycles spent in the VM as opposed to in the host kernel
and IRQ handling, and is that useful for system profiling or scheduling
decisions, in which case moving to vcpu_load/put doesn't work...

I assume there's a good reason why we call guest_enter() and
guest_exit() in the hot path on every KVM architecture?

> >                      Worth poking the timekeeping folks about or even
> > trying to convince ourselves that the handle_external_intr thing is
> > worth it?
> 
> handle_external_intr() needs some hardware support in order to be more
> than worthless 'local_irq_enable(); local_irq_disable()' ...
> 
> e.g. VMX doesn't queue the interrupt that caused a VM exit in the
> interrupt controller.  (VMX's handle_external_intr() doesn't deliver any
> other interrupts than might have become pending after the exit, but this
> race is not very important due to accounting granularity ...)
> Alternatively, the interrupt controller would need to support dequeing
> without delivery (but delivering in software might still be slower than
> cyclying interrupts on and off).
> 

On ARM, I think the main benefits of implementing something like
handle_external_intr would come from two things: (1) You'd avoid the
context synchronization and associated cost of taking an exception on
the CPU, and (2) you'd also (potentially) avoid the additional
save/restore of all the GP regiters from the kernel exception entry path
to create a usable gp_regs.

I have to look more careful at whether or not (2) is possible, because
it would mean we'd have to store the guest register state on a pt_regs
structure in the first place and pass that directly to arch_handle_irq.

Additionally, if we had something like handle_external_intr, the
guest_exit thing would be kinda moot, since we'd do our ticks like
x86...

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14  8:26   ` Christoffer Dall
@ 2017-03-14  8:55     ` Marc Zyngier
  2017-03-14 11:12       ` Christoffer Dall
  2017-03-14 16:58     ` Radim Krčmář
  2017-03-24 14:55     ` Rik van Riel
  2 siblings, 1 reply; 26+ messages in thread
From: Marc Zyngier @ 2017-03-14  8:55 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Radim Krčmář,
	Christoffer Dall, Paolo Bonzini, kvm, Rik van Riel

On Tue, Mar 14 2017 at  8:26:01 am GMT, Christoffer Dall <cdall@linaro.org> wrote:
> On ARM, I think the main benefits of implementing something like
> handle_external_intr would come from two things: (1) You'd avoid the
> context synchronization and associated cost of taking an exception on
> the CPU, and (2) you'd also (potentially) avoid the additional
> save/restore of all the GP regiters from the kernel exception entry path
> to create a usable gp_regs.
>
> I have to look more careful at whether or not (2) is possible, because
> it would mean we'd have to store the guest register state on a pt_regs
> structure in the first place and pass that directly to
> arch_handle_irq.

For that to be useful on pre-VHE HW and avoid the extra exception
handling, we'd also have to perform that arch_handle_irq call as part of
the exception return to EL1. That's not going to be very pretty, as we
need to build two return contexts (EL2->EL1 IRQ on the host, followed by
EL1->EL1).

This is all perfectly doable, but as always, we need numbers.

I had similar stuff for VHE a long while ago (back when I wrote WSINC),
but faking the pt_regs was absolutely horrible. But maybe it is time to
exhume this again...

> Additionally, if we had something like handle_external_intr, the
> guest_exit thing would be kinda moot, since we'd do our ticks like
> x86...

Indeed. Which brings the obvious question: are we the only ones not
taking the interrupt early?

Thanks,

	M.
-- 
Jazz is not dead, it just smell funny.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14  8:55     ` Marc Zyngier
@ 2017-03-14 11:12       ` Christoffer Dall
  2017-03-14 11:46         ` Marc Zyngier
  0 siblings, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2017-03-14 11:12 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Radim Krčmář,
	Christoffer Dall, Paolo Bonzini, kvm, Rik van Riel

On Tue, Mar 14, 2017 at 08:55:09AM +0000, Marc Zyngier wrote:
> On Tue, Mar 14 2017 at  8:26:01 am GMT, Christoffer Dall <cdall@linaro.org> wrote:
> > On ARM, I think the main benefits of implementing something like
> > handle_external_intr would come from two things: (1) You'd avoid the
> > context synchronization and associated cost of taking an exception on
> > the CPU, and (2) you'd also (potentially) avoid the additional
> > save/restore of all the GP regiters from the kernel exception entry path
> > to create a usable gp_regs.
> >
> > I have to look more careful at whether or not (2) is possible, because
> > it would mean we'd have to store the guest register state on a pt_regs
> > structure in the first place and pass that directly to
> > arch_handle_irq.
> 
> For that to be useful on pre-VHE HW and avoid the extra exception
> handling, we'd also have to perform that arch_handle_irq call as part of
> the exception return to EL1. That's not going to be very pretty, as we
> need to build two return contexts (EL2->EL1 IRQ on the host, followed by
> EL1->EL1).

Why do we need something like that?

Why can't we simply call arch_handle_irq from kvm_arch_vcpu_ioctl_run
being in either EL1 or EL2, depending if we have VHE, respectively,
assuming we have a pt_regs struct we can use with the expected semantics
for arch_handle_irq ?

In the first case, we would have completed the EL2->EL1 transition
before attempting to handle IRQs and in the second case we would always
be in EL2, but we would logically be in the same place in the overall
execution flow.

Note, that I'm assuming that you don't need to be directly calling
arch_handle_irq as a result of a minimal set of assembly code directly
following the exception vector, but that arch_handle_irq can be called
as a function from a later time with interrupts disabled as long as the
pt_regs reasonably express the state of the CPU at the time the
exception occurred, which in this case would (most likely) be the
guest's state.

> 
> This is all perfectly doable, but as always, we need numbers.

Agreed.

> 
> I had similar stuff for VHE a long while ago (back when I wrote WSINC),
> but faking the pt_regs was absolutely horrible. But maybe it is time to
> exhume this again...
> 

Correct me if I'm wrong, but I think you're referring to the work where
you used the host's exception vectors, not calling arch_handle_irq from
the run loop, or did I misunderstand?

> > Additionally, if we had something like handle_external_intr, the
> > guest_exit thing would be kinda moot, since we'd do our ticks like
> > x86...
> 
> Indeed. Which brings the obvious question: are we the only ones not
> taking the interrupt early?
> 

Hmm, I tried taking a look, but my mips/powerpc/s390 fail me slightly.
However, from what I can gather, s390 enables interrupts after returning
from the guest, then disables them again and calls guest_exit_irqoff, so
kinda what we're doing from a performance perspective, only different.
I'm lost in all the versions of the powerpc code that calls
guest_exit{_irqoff}, and it looks to me like mips handles physical
interrupts from the exit path before reaching the guest_exit_irqoff
point.

But I could be completely wrong :-/

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 11:12       ` Christoffer Dall
@ 2017-03-14 11:46         ` Marc Zyngier
  0 siblings, 0 replies; 26+ messages in thread
From: Marc Zyngier @ 2017-03-14 11:46 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Radim Krčmář,
	Christoffer Dall, Paolo Bonzini, kvm, Rik van Riel

On 14/03/17 11:12, Christoffer Dall wrote:
> On Tue, Mar 14, 2017 at 08:55:09AM +0000, Marc Zyngier wrote:
>> On Tue, Mar 14 2017 at  8:26:01 am GMT, Christoffer Dall <cdall@linaro.org> wrote:
>>> On ARM, I think the main benefits of implementing something like
>>> handle_external_intr would come from two things: (1) You'd avoid the
>>> context synchronization and associated cost of taking an exception on
>>> the CPU, and (2) you'd also (potentially) avoid the additional
>>> save/restore of all the GP regiters from the kernel exception entry path
>>> to create a usable gp_regs.
>>>
>>> I have to look more careful at whether or not (2) is possible, because
>>> it would mean we'd have to store the guest register state on a pt_regs
>>> structure in the first place and pass that directly to
>>> arch_handle_irq.
>>
>> For that to be useful on pre-VHE HW and avoid the extra exception
>> handling, we'd also have to perform that arch_handle_irq call as part of
>> the exception return to EL1. That's not going to be very pretty, as we
>> need to build two return contexts (EL2->EL1 IRQ on the host, followed by
>> EL1->EL1).
> 
> Why do we need something like that?
> 
> Why can't we simply call arch_handle_irq from kvm_arch_vcpu_ioctl_run
> being in either EL1 or EL2, depending if we have VHE, respectively,
> assuming we have a pt_regs struct we can use with the expected semantics
> for arch_handle_irq ?

Ah, that's where you want to perform the call. I was thinking of doing
it a bit earlier and save another few instructions by performing the
eret directly into the interrupt handler. I'm probably trying to
optimizing too early, as usual!

> 
> In the first case, we would have completed the EL2->EL1 transition
> before attempting to handle IRQs and in the second case we would always
> be in EL2, but we would logically be in the same place in the overall
> execution flow.
> 
> Note, that I'm assuming that you don't need to be directly calling
> arch_handle_irq as a result of a minimal set of assembly code directly
> following the exception vector, but that arch_handle_irq can be called
> as a function from a later time with interrupts disabled as long as the
> pt_regs reasonably express the state of the CPU at the time the
> exception occurred, which in this case would (most likely) be the
> guest's state.
> 
>>
>> This is all perfectly doable, but as always, we need numbers.
> 
> Agreed.
> 
>>
>> I had similar stuff for VHE a long while ago (back when I wrote WSINC),
>> but faking the pt_regs was absolutely horrible. But maybe it is time to
>> exhume this again...
>>
> 
> Correct me if I'm wrong, but I think you're referring to the work where
> you used the host's exception vectors, not calling arch_handle_irq from
> the run loop, or did I misunderstand?

Yes, that's what I had in mind, but you're obviously thinking of
performing the call differently, so please ignore me until I come to my
senses...

>>> Additionally, if we had something like handle_external_intr, the
>>> guest_exit thing would be kinda moot, since we'd do our ticks like
>>> x86...
>>
>> Indeed. Which brings the obvious question: are we the only ones not
>> taking the interrupt early?
>>
> 
> Hmm, I tried taking a look, but my mips/powerpc/s390 fail me slightly.
> However, from what I can gather, s390 enables interrupts after returning
> from the guest, then disables them again and calls guest_exit_irqoff, so
> kinda what we're doing from a performance perspective, only different.
> I'm lost in all the versions of the powerpc code that calls
> guest_exit{_irqoff}, and it looks to me like mips handles physical
> interrupts from the exit path before reaching the guest_exit_irqoff
> point.
> 
> But I could be completely wrong :-/

Well, it at least shows that there is a variety of ways to handle
interrupts on exit, and that the patch you quoted earlier in the thread
may have affected more than just ARM.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14  8:26   ` Christoffer Dall
  2017-03-14  8:55     ` Marc Zyngier
@ 2017-03-14 16:58     ` Radim Krčmář
  2017-03-14 17:09       ` Paolo Bonzini
  2017-03-14 18:39       ` Christoffer Dall
  2017-03-24 14:55     ` Rik van Riel
  2 siblings, 2 replies; 26+ messages in thread
From: Radim Krčmář @ 2017-03-14 16:58 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel

2017-03-14 09:26+0100, Christoffer Dall:
> On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote:
>> 2017-03-08 02:57-0800, Christoffer Dall:
>> > Hi Paolo,
>> > 
>> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
>> > before enabling interrupts when coming back from the guest.
>> > 
>> > Unfortunately, this appears to mess up my view of CPU usage using
>> > something like htop on the host, because it appears all time is spent
>> > inside the kernel.
>> > 
>> > From my analysis, I think this is because we never handle any interrupts
>> > before enabling interrupts, where the x86 code does its
>> > handle_external_intr, and the result on ARM is that we never increment
>> > jiffies before doing the vtime accounting.
>> 
>> (Hm, the counting might be broken on nohz_full then.)
>> 
> 
> Don't you still have a scheduler tick even with nohz_full and something
> that will eventually update jiffies then?

Probably, I don't understand jiffies accounting too well and didn't see
anything that would bump the jiffies in or before guest_exit_irqoff().

>> > So my current idea is to increment jiffies according to the clocksource
>> > before calling guest_exit_irqoff, but this would require some main
>> > clocksource infrastructure changes.
>> 
>> This seems similar to calling the function from the timer interrupt.
>> The timer interrupt would be delivered after that and only wasted time,
>> so it might actually be slower than just delivering it before ...
> 
> That's assuming that the timer interrupt hits at every exit.  I don't
> think that's the case, but I should measure it.

There cannot be less vm exits and I think there are far more vm exits,
but if there was no interrupt, then jiffies shouldn't raise and we would
get the same result as with plain guest_exit_irqoff().

>> How expensive is the interrupt enable/disable cycle that this
>> optimization saves?
> 
> I'll have to go back and measure this bit specifically again, but I
> recall it being a couple of hundred cycles.  Not alariming, but
> worthwhile looking into.

Yeah, sounds good.

>> > My question is: how important is the vtime accounting on the host from
>> > your point of view?
>> 
>> No idea.  I'd keep the same behavior on all architectures, though.
>> 
>> The precision of accounting is in jiffies (millions of cycles), so we
>> could maybe move it from the hot path to vcpu_load/put(?) without
>> affecting the count in usual cases ...
>> 
> 
> So since sending my original e-mail I found out that the vtime
> accounting logic was changed from ktime to jiffies, which is partly why
> we're having problems on arm.  See:
> 
> ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd
> sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
> 
> Moving to load/put depends on the semantics of this vtime thing.  Is
> this counting cycles spent in the VM as opposed to in the host kernel
> and IRQ handling, and is that useful for system profiling or scheduling
> decisions, in which case moving to vcpu_load/put doesn't work...

Right.

> I assume there's a good reason why we call guest_enter() and
> guest_exit() in the hot path on every KVM architecture?

I consider myself biased when it comes to jiffies, so no judgement. :)

>From what I see, the mode switch is used only for statistics.
The original series is

5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332

It didn't introduce the overhead with interrupt window and it didn't
count host kernel irq time as user time, so it was better at that time.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 16:58     ` Radim Krčmář
@ 2017-03-14 17:09       ` Paolo Bonzini
  2017-03-14 18:41         ` Christoffer Dall
  2017-03-14 18:39       ` Christoffer Dall
  1 sibling, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2017-03-14 17:09 UTC (permalink / raw)
  To: Radim Krčmář, Christoffer Dall
  Cc: Christoffer Dall, kvm, Marc Zyngier, Rik van Riel



On 14/03/2017 17:58, Radim Krčmář wrote:
>> I assume there's a good reason why we call guest_enter() and
>> guest_exit() in the hot path on every KVM architecture?
> I consider myself biased when it comes to jiffies, so no judgement. :)
> 
> From what I see, the mode switch is used only for statistics.

vtime is only for statistics, but guest_enter/exit are important because
they enter an RCU extended quiescent state.  This means that (physical)
CPUs running a guest are effectively "off" from the point of view of the
RCU accounting machinery.  Not having to perform any RCU work is very
good for jitter.

Paolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 16:58     ` Radim Krčmář
  2017-03-14 17:09       ` Paolo Bonzini
@ 2017-03-14 18:39       ` Christoffer Dall
  2017-03-14 20:27         ` Radim Krčmář
  1 sibling, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2017-03-14 18:39 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel

On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote:
> 2017-03-14 09:26+0100, Christoffer Dall:
> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote:
> >> 2017-03-08 02:57-0800, Christoffer Dall:
> >> > Hi Paolo,
> >> > 
> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
> >> > before enabling interrupts when coming back from the guest.
> >> > 
> >> > Unfortunately, this appears to mess up my view of CPU usage using
> >> > something like htop on the host, because it appears all time is spent
> >> > inside the kernel.
> >> > 
> >> > From my analysis, I think this is because we never handle any interrupts
> >> > before enabling interrupts, where the x86 code does its
> >> > handle_external_intr, and the result on ARM is that we never increment
> >> > jiffies before doing the vtime accounting.
> >> 
> >> (Hm, the counting might be broken on nohz_full then.)
> >> 
> > 
> > Don't you still have a scheduler tick even with nohz_full and something
> > that will eventually update jiffies then?
> 
> Probably, I don't understand jiffies accounting too well and didn't see
> anything that would bump the jiffies in or before guest_exit_irqoff().
> 

As far as I understand, from my very very short time of looking at the
timer code, jiffies are updated on every tick, which can be cause by a
number of events, including *any* interrupt handler (coming from idle
state), soft timers, timer interrupts, and possibly other things.

> >> > So my current idea is to increment jiffies according to the clocksource
> >> > before calling guest_exit_irqoff, but this would require some main
> >> > clocksource infrastructure changes.
> >> 
> >> This seems similar to calling the function from the timer interrupt.
> >> The timer interrupt would be delivered after that and only wasted time,
> >> so it might actually be slower than just delivering it before ...
> > 
> > That's assuming that the timer interrupt hits at every exit.  I don't
> > think that's the case, but I should measure it.
> 
> There cannot be less vm exits and I think there are far more vm exits,
> but if there was no interrupt, then jiffies shouldn't raise and we would
> get the same result as with plain guest_exit_irqoff().
> 

That's true if you're guaranteed to take the timer interrupts that
happen while running the guest before hitting guest_exit_irqoff(), so
that you eventually count *some* time for the guest.  In the arm64 case,
if we just do guest_exit_irqoff(), we *never* account any time to the
guest.

> >> How expensive is the interrupt enable/disable cycle that this
> >> optimization saves?
> > 
> > I'll have to go back and measure this bit specifically again, but I
> > recall it being a couple of hundred cycles.  Not alariming, but
> > worthwhile looking into.
> 
> Yeah, sounds good.
> 
> >> > My question is: how important is the vtime accounting on the host from
> >> > your point of view?
> >> 
> >> No idea.  I'd keep the same behavior on all architectures, though.
> >> 
> >> The precision of accounting is in jiffies (millions of cycles), so we
> >> could maybe move it from the hot path to vcpu_load/put(?) without
> >> affecting the count in usual cases ...
> >> 
> > 
> > So since sending my original e-mail I found out that the vtime
> > accounting logic was changed from ktime to jiffies, which is partly why
> > we're having problems on arm.  See:
> > 
> > ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd
> > sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
> > 
> > Moving to load/put depends on the semantics of this vtime thing.  Is
> > this counting cycles spent in the VM as opposed to in the host kernel
> > and IRQ handling, and is that useful for system profiling or scheduling
> > decisions, in which case moving to vcpu_load/put doesn't work...
> 
> Right.
> 
> > I assume there's a good reason why we call guest_enter() and
> > guest_exit() in the hot path on every KVM architecture?
> 
> I consider myself biased when it comes to jiffies, so no judgement. :)
> 
> From what I see, the mode switch is used only for statistics.
> The original series is
> 
> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332
> 
> It didn't introduce the overhead with interrupt window and it didn't
> count host kernel irq time as user time, so it was better at that time.

Yes, but it was based on cputime_to... functions, which I understand use
ktime, which on systems running KVM will most often read the clocksource
directly from the hardware, and that was then optimized later to just
use jiffies to avoid the clocksource read because jiffies is already in
memory and adjusted to the granularity we need, so in some sense an
improvement, only it doesn't work if you don't update jiffies when
needed.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 17:09       ` Paolo Bonzini
@ 2017-03-14 18:41         ` Christoffer Dall
  2017-03-14 19:32           ` Radim Krčmář
  2017-03-15  8:05           ` Paolo Bonzini
  0 siblings, 2 replies; 26+ messages in thread
From: Christoffer Dall @ 2017-03-14 18:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Christoffer Dall, kvm, Marc Zyngier, Rik van Riel

On Tue, Mar 14, 2017 at 06:09:45PM +0100, Paolo Bonzini wrote:
> 
> 
> On 14/03/2017 17:58, Radim Krčmář wrote:
> >> I assume there's a good reason why we call guest_enter() and
> >> guest_exit() in the hot path on every KVM architecture?
> > I consider myself biased when it comes to jiffies, so no judgement. :)
> > 
> > From what I see, the mode switch is used only for statistics.
> 
> vtime is only for statistics, but guest_enter/exit are important because
> they enter an RCU extended quiescent state.  This means that (physical)
> CPUs running a guest are effectively "off" from the point of view of the
> RCU accounting machinery.  Not having to perform any RCU work is very
> good for jitter.
> 

So would it be worth considering factoring out vtime accounting from
guest_enter/exit, such that we could do the vtime accounting from vcpu
load/put and mark the RCU extended quiescent state in the run loop?

Disclaimer: I haven't completely convinced myself that vtime accounting
from load/put works as it should.  For example, when servicing a VM from
KVM, should we really be accounting this as kernel time, or as guest
time?  I think we do the former now, but if the latter is the right
thing, would changing the behavior constitute an ABI change to
userspace?

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 18:41         ` Christoffer Dall
@ 2017-03-14 19:32           ` Radim Krčmář
  2017-03-14 20:01             ` Christoffer Dall
  2017-03-15  8:05           ` Paolo Bonzini
  1 sibling, 1 reply; 26+ messages in thread
From: Radim Krčmář @ 2017-03-14 19:32 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Paolo Bonzini, Christoffer Dall, kvm, Marc Zyngier, Rik van Riel

2017-03-14 19:41+0100, Christoffer Dall:
> On Tue, Mar 14, 2017 at 06:09:45PM +0100, Paolo Bonzini wrote:
>> On 14/03/2017 17:58, Radim Krčmář wrote:
>> >> I assume there's a good reason why we call guest_enter() and
>> >> guest_exit() in the hot path on every KVM architecture?
>> > I consider myself biased when it comes to jiffies, so no judgement. :)
>> > 
>> > From what I see, the mode switch is used only for statistics.
>> 
>> vtime is only for statistics, but guest_enter/exit are important because
>> they enter an RCU extended quiescent state.  This means that (physical)
>> CPUs running a guest are effectively "off" from the point of view of the
>> RCU accounting machinery.  Not having to perform any RCU work is very
>> good for jitter.

Ah, good point.

> So would it be worth considering factoring out vtime accounting from
> guest_enter/exit, such that we could do the vtime accounting from vcpu
> load/put and mark the RCU extended quiescent state in the run loop?

RCU is the reason why guest_exit() needs disabled interrupts, so if we
split them, we could do rcu_virt_note_context_switch() before enabling
interrupts, and guest_exit() right after.

> Disclaimer: I haven't completely convinced myself that vtime accounting
> from load/put works as it should.  For example, when servicing a VM from
> KVM, should we really be accounting this as kernel time, or as guest
> time?  I think we do the former now, but if the latter is the right
> thing, would changing the behavior constitute an ABI change to
> userspace?

Not considering that option would be best. :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 19:32           ` Radim Krčmář
@ 2017-03-14 20:01             ` Christoffer Dall
  2017-03-14 21:52               ` Radim Krčmář
  0 siblings, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2017-03-14 20:01 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Paolo Bonzini, Christoffer Dall, kvm, Marc Zyngier, Rik van Riel

On Tue, Mar 14, 2017 at 08:32:02PM +0100, Radim Krčmář wrote:
> 2017-03-14 19:41+0100, Christoffer Dall:
> > On Tue, Mar 14, 2017 at 06:09:45PM +0100, Paolo Bonzini wrote:
> >> On 14/03/2017 17:58, Radim Krčmář wrote:
> >> >> I assume there's a good reason why we call guest_enter() and
> >> >> guest_exit() in the hot path on every KVM architecture?
> >> > I consider myself biased when it comes to jiffies, so no judgement. :)
> >> > 
> >> > From what I see, the mode switch is used only for statistics.
> >> 
> >> vtime is only for statistics, but guest_enter/exit are important because
> >> they enter an RCU extended quiescent state.  This means that (physical)
> >> CPUs running a guest are effectively "off" from the point of view of the
> >> RCU accounting machinery.  Not having to perform any RCU work is very
> >> good for jitter.
> 
> Ah, good point.
> 
> > So would it be worth considering factoring out vtime accounting from
> > guest_enter/exit, such that we could do the vtime accounting from vcpu
> > load/put and mark the RCU extended quiescent state in the run loop?
> 
> RCU is the reason why guest_exit() needs disabled interrupts, so if we
> split them, we could do rcu_virt_note_context_switch() before enabling
> interrupts, and guest_exit() right after.
> 

I'm not convinced that what you're saying is true ;)

I think we only fiddle with RCU during guest_enter, and further, a trace
of guest_exit reveals:

guest_exit_irqoff
 -> vtime_guest_exit
    -> __vtime_account_system
       -> get_vtime_delta
          -> account_other_time
             -> WARN_ON_ONCE(!irqs_disabled());

So I think we do need interrupts disabled when messing with vtime?

> > Disclaimer: I haven't completely convinced myself that vtime accounting
> > from load/put works as it should.  For example, when servicing a VM from
> > KVM, should we really be accounting this as kernel time, or as guest
> > time?  I think we do the former now, but if the latter is the right
> > thing, would changing the behavior constitute an ABI change to
> > userspace?
> 
> Not considering that option would be best. :)

If my statement above about needing interrupts disabled when dealing
with vtime, then considering this begins to sound interesting, also
given that the vtime thing is not entirely free and we're dealing with
the hot path of receiving IPIs here, for example.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 18:39       ` Christoffer Dall
@ 2017-03-14 20:27         ` Radim Krčmář
  2017-03-14 21:53           ` Radim Krčmář
  2017-03-15  8:43           ` Christoffer Dall
  0 siblings, 2 replies; 26+ messages in thread
From: Radim Krčmář @ 2017-03-14 20:27 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel

2017-03-14 19:39+0100, Christoffer Dall:
> On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote:
>> 2017-03-14 09:26+0100, Christoffer Dall:
>> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote:
>> >> 2017-03-08 02:57-0800, Christoffer Dall:
>> >> > Hi Paolo,
>> >> > 
>> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
>> >> > before enabling interrupts when coming back from the guest.
>> >> > 
>> >> > Unfortunately, this appears to mess up my view of CPU usage using
>> >> > something like htop on the host, because it appears all time is spent
>> >> > inside the kernel.
>> >> > 
>> >> > From my analysis, I think this is because we never handle any interrupts
>> >> > before enabling interrupts, where the x86 code does its
>> >> > handle_external_intr, and the result on ARM is that we never increment
>> >> > jiffies before doing the vtime accounting.
>> >> 
>> >> (Hm, the counting might be broken on nohz_full then.)
>> >> 
>> > 
>> > Don't you still have a scheduler tick even with nohz_full and something
>> > that will eventually update jiffies then?
>> 
>> Probably, I don't understand jiffies accounting too well and didn't see
>> anything that would bump the jiffies in or before guest_exit_irqoff().
> 
> As far as I understand, from my very very short time of looking at the
> timer code, jiffies are updated on every tick, which can be cause by a
> number of events, including *any* interrupt handler (coming from idle
> state), soft timers, timer interrupts, and possibly other things.

Yes, I was thinking that entering/exiting user mode should trigger it as
well, in order to correctly account for time spent there, but couldn't
find it ...

The case I was wondering about is if the kernel spent e.g. 10 jiffies in
guest mode and then exited on mmio -- no interrupt in the host, and
guest_exit_irqoff() would flip accouting would over system time.
Can those 10 jiffies get accounted to system (not guest) time?

Accounting 1 jiffy wrong is normal as we expect that the distribution of
guest/kernel time in the jiffy is going to be approximated over a longer
sampling period, but if we could account multiple jiffies wrong, this
expectation is very hard to defend.

>> >> > So my current idea is to increment jiffies according to the clocksource
>> >> > before calling guest_exit_irqoff, but this would require some main
>> >> > clocksource infrastructure changes.
>> >> 
>> >> This seems similar to calling the function from the timer interrupt.
>> >> The timer interrupt would be delivered after that and only wasted time,
>> >> so it might actually be slower than just delivering it before ...
>> > 
>> > That's assuming that the timer interrupt hits at every exit.  I don't
>> > think that's the case, but I should measure it.
>> 
>> There cannot be less vm exits and I think there are far more vm exits,
>> but if there was no interrupt, then jiffies shouldn't raise and we would
>> get the same result as with plain guest_exit_irqoff().
>> 
> 
> That's true if you're guaranteed to take the timer interrupts that
> happen while running the guest before hitting guest_exit_irqoff(), so
> that you eventually count *some* time for the guest.  In the arm64 case,
> if we just do guest_exit_irqoff(), we *never* account any time to the
> guest.

Yes.  I assumed that if jiffies should be incremented, then you also get
a timer tick, so checking whether jiffies should be incremented inside
guest_exit_irqoff() was only slowing down the common case, where jiffies
remained the same.

(Checking if jiffies should be incremented is quite slow by itself.)

>> > I assume there's a good reason why we call guest_enter() and
>> > guest_exit() in the hot path on every KVM architecture?
>> 
>> I consider myself biased when it comes to jiffies, so no judgement. :)
>> 
>> From what I see, the mode switch is used only for statistics.
>> The original series is
>> 
>> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332
>> 
>> It didn't introduce the overhead with interrupt window and it didn't
>> count host kernel irq time as user time, so it was better at that time.
> 
> Yes, but it was based on cputime_to... functions, which I understand use
> ktime, which on systems running KVM will most often read the clocksource
> directly from the hardware, and that was then optimized later to just
> use jiffies to avoid the clocksource read because jiffies is already in
> memory and adjusted to the granularity we need, so in some sense an
> improvement, only it doesn't work if you don't update jiffies when
> needed.

True.  The kvm_guest_enter/exit still didn't trigger accounting, but the
granularity was better if there were other sources of accounting than
just timer ticks.

And I noticed another funny feature: the original intention was that if
the guest uses hardware acceleration, then the whole timeslice gets
accounted to guest/user time, because kvm_guest_exit() was not supposed
to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105
This somehow got mangled when merging and later there was a fix that
introduced the current behavior, which might be slightly more accurate:
0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()")

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 20:01             ` Christoffer Dall
@ 2017-03-14 21:52               ` Radim Krčmář
  2017-03-15  8:09                 ` Paolo Bonzini
  0 siblings, 1 reply; 26+ messages in thread
From: Radim Krčmář @ 2017-03-14 21:52 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Paolo Bonzini, Christoffer Dall, kvm, Marc Zyngier, Rik van Riel

2017-03-14 21:01+0100, Christoffer Dall:
> On Tue, Mar 14, 2017 at 08:32:02PM +0100, Radim Krčmář wrote:
>> 2017-03-14 19:41+0100, Christoffer Dall:
>> > On Tue, Mar 14, 2017 at 06:09:45PM +0100, Paolo Bonzini wrote:
>> >> On 14/03/2017 17:58, Radim Krčmář wrote:
>> >> >> I assume there's a good reason why we call guest_enter() and
>> >> >> guest_exit() in the hot path on every KVM architecture?
>> >> > I consider myself biased when it comes to jiffies, so no judgement. :)
>> >> > 
>> >> > From what I see, the mode switch is used only for statistics.
>> >> 
>> >> vtime is only for statistics, but guest_enter/exit are important because
>> >> they enter an RCU extended quiescent state.  This means that (physical)
>> >> CPUs running a guest are effectively "off" from the point of view of the
>> >> RCU accounting machinery.  Not having to perform any RCU work is very
>> >> good for jitter.
>> 
>> Ah, good point.
>> 
>> > So would it be worth considering factoring out vtime accounting from
>> > guest_enter/exit, such that we could do the vtime accounting from vcpu
>> > load/put and mark the RCU extended quiescent state in the run loop?
>> 
>> RCU is the reason why guest_exit() needs disabled interrupts, so if we
>> split them, we could do rcu_virt_note_context_switch() before enabling
>> interrupts, and guest_exit() right after.
>> 
> 
> I'm not convinced that what you're saying is true ;)

I agree.

> I think we only fiddle with RCU during guest_enter, and further, a trace
> of guest_exit reveals:
> 
> guest_exit_irqoff
>  -> vtime_guest_exit
>     -> __vtime_account_system
>        -> get_vtime_delta
>           -> account_other_time
>              -> WARN_ON_ONCE(!irqs_disabled());
> 
> So I think we do need interrupts disabled when messing with vtime?

Seem like it.

>> > Disclaimer: I haven't completely convinced myself that vtime accounting
>> > from load/put works as it should.  For example, when servicing a VM from
>> > KVM, should we really be accounting this as kernel time, or as guest
>> > time?  I think we do the former now, but if the latter is the right
>> > thing, would changing the behavior constitute an ABI change to
>> > userspace?
>> 
>> Not considering that option would be best. :)
> 
> If my statement above about needing interrupts disabled when dealing
> with vtime, then considering this begins to sound interesting, also
> given that the vtime thing is not entirely free and we're dealing with
> the hot path of receiving IPIs here, for example.

I'm liking it less and less the more I read. :)
CONTEXT_USER vtime is coupled with context tracking and going out of
CONTEXT_KERNEL means that RCU cannot be used in between.  Using
CONTEXT_GUEST from load/put would change the meaning of contexts ...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 20:27         ` Radim Krčmář
@ 2017-03-14 21:53           ` Radim Krčmář
  2017-03-15  8:43           ` Christoffer Dall
  1 sibling, 0 replies; 26+ messages in thread
From: Radim Krčmář @ 2017-03-14 21:53 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel

2017-03-14 21:27+0100, Radim Krčmář:
> 2017-03-14 19:39+0100, Christoffer Dall:
>> On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote:
>>> 2017-03-14 09:26+0100, Christoffer Dall:
>>> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote:
>>> >> 2017-03-08 02:57-0800, Christoffer Dall:
>>> >> > Hi Paolo,
>>> >> > 
>>> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
>>> >> > before enabling interrupts when coming back from the guest.
>>> >> > 
>>> >> > Unfortunately, this appears to mess up my view of CPU usage using
>>> >> > something like htop on the host, because it appears all time is spent
>>> >> > inside the kernel.
>>> >> > 
>>> >> > From my analysis, I think this is because we never handle any interrupts
>>> >> > before enabling interrupts, where the x86 code does its
>>> >> > handle_external_intr, and the result on ARM is that we never increment
>>> >> > jiffies before doing the vtime accounting.
>>> >> 
>>> >> (Hm, the counting might be broken on nohz_full then.)
>>> >> 
>>> > 
>>> > Don't you still have a scheduler tick even with nohz_full and something
>>> > that will eventually update jiffies then?
>>> 
>>> Probably, I don't understand jiffies accounting too well and didn't see
>>> anything that would bump the jiffies in or before guest_exit_irqoff().
>> 
>> As far as I understand, from my very very short time of looking at the
>> timer code, jiffies are updated on every tick, which can be cause by a
>> number of events, including *any* interrupt handler (coming from idle
>> state), soft timers, timer interrupts, and possibly other things.
> 
> Yes, I was thinking that entering/exiting user mode should trigger it as
> well, in order to correctly account for time spent there, but couldn't
> find it ...
> 
> The case I was wondering about is if the kernel spent e.g. 10 jiffies in
> guest mode and then exited on mmio -- no interrupt in the host, and
> guest_exit_irqoff() would flip accouting would over system time.
> Can those 10 jiffies get accounted to system (not guest) time?
> 
> Accounting 1 jiffy wrong is normal as we expect that the distribution of
> guest/kernel time in the jiffy is going to be approximated over a longer
> sampling period, but if we could account multiple jiffies wrong, this
> expectation is very hard to defend.

IIUC, other CPU will bump jiffies of running tickless VCPUs, so this
works fine.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 18:41         ` Christoffer Dall
  2017-03-14 19:32           ` Radim Krčmář
@ 2017-03-15  8:05           ` Paolo Bonzini
  2017-03-15  8:30             ` Christoffer Dall
  1 sibling, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2017-03-15  8:05 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Radim Krčmář,
	Christoffer Dall, kvm, Marc Zyngier, Rik van Riel



On 14/03/2017 19:41, Christoffer Dall wrote:
> So would it be worth considering factoring out vtime accounting from
> guest_enter/exit, such that we could do the vtime accounting from vcpu
> load/put and mark the RCU extended quiescent state in the run loop?
> 
> Disclaimer: I haven't completely convinced myself that vtime accounting
> from load/put works as it should.

Me neither, but if it works, it's worth at least benchmarking it to make
an informed decision.

> For example, when servicing a VM from
> KVM, should we really be accounting this as kernel time, or as guest
> time?

Servicing the request should be fast enough that it doesn't matter.  For
the most common x86 vmexits, 50% of the time is spent executing the x86
world switch microcode, and that is already being accounted as guest
time already.  On microbenchmarks of course it would make a difference:

$ time ./x86/run x86/vmexit.flat -append cpuid -smp 4
[...]
real	0m1.327s
user	0m1.877s        << this is really guest time
sys	0m1.844s        << this is spent mostly in KVM_RUN


Paolo

> I think we do the former now, but if the latter is the right
> thing, would changing the behavior constitute an ABI change to
> userspace?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 21:52               ` Radim Krčmář
@ 2017-03-15  8:09                 ` Paolo Bonzini
  0 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2017-03-15  8:09 UTC (permalink / raw)
  To: Radim Krčmář, Christoffer Dall
  Cc: Christoffer Dall, kvm, Marc Zyngier, Rik van Riel



On 14/03/2017 22:52, Radim Krčmář wrote:
>> If my statement above about needing interrupts disabled when dealing
>> with vtime, then considering this begins to sound interesting, also
>> given that the vtime thing is not entirely free and we're dealing with
>> the hot path of receiving IPIs here, for example.
> I'm liking it less and less the more I read. :)
> CONTEXT_USER vtime is coupled with context tracking and going out of
> CONTEXT_KERNEL means that RCU cannot be used in between.  Using
> CONTEXT_GUEST from load/put would change the meaning of contexts ...

CONTEXT_GUEST is an RCU extended quiescent state, doing it from load to
put makes no sense.  PF_VCPU from load to put on the other hand might
make some sense.

Paolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-15  8:05           ` Paolo Bonzini
@ 2017-03-15  8:30             ` Christoffer Dall
  0 siblings, 0 replies; 26+ messages in thread
From: Christoffer Dall @ 2017-03-15  8:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Christoffer Dall, kvm, Marc Zyngier, Rik van Riel

On Wed, Mar 15, 2017 at 09:05:28AM +0100, Paolo Bonzini wrote:
> 
> 
> On 14/03/2017 19:41, Christoffer Dall wrote:
> > So would it be worth considering factoring out vtime accounting from
> > guest_enter/exit, such that we could do the vtime accounting from vcpu
> > load/put and mark the RCU extended quiescent state in the run loop?
> > 
> > Disclaimer: I haven't completely convinced myself that vtime accounting
> > from load/put works as it should.
> 
> Me neither, but if it works, it's worth at least benchmarking it to make
> an informed decision.
> 

I'll try to prototype it and see how it looks.

> > For example, when servicing a VM from
> > KVM, should we really be accounting this as kernel time, or as guest
> > time?
> 
> Servicing the request should be fast enough that it doesn't matter.  For
> the most common x86 vmexits, 50% of the time is spent executing the x86
> world switch microcode, and that is already being accounted as guest
> time already.  On microbenchmarks of course it would make a difference:
> 
> $ time ./x86/run x86/vmexit.flat -append cpuid -smp 4
> [...]
> real	0m1.327s
> user	0m1.877s        << this is really guest time
> sys	0m1.844s        << this is spent mostly in KVM_RUN
> 
> 
Good point, and if it's for statistics, this light change in behavior
can hardly constitute any form of user breakage.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14 20:27         ` Radim Krčmář
  2017-03-14 21:53           ` Radim Krčmář
@ 2017-03-15  8:43           ` Christoffer Dall
  2017-03-15 15:57             ` Radim Krčmář
  2017-03-24 15:04             ` Rik van Riel
  1 sibling, 2 replies; 26+ messages in thread
From: Christoffer Dall @ 2017-03-15  8:43 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel

On Tue, Mar 14, 2017 at 09:27:22PM +0100, Radim Krčmář wrote:
> 2017-03-14 19:39+0100, Christoffer Dall:
> > On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote:
> >> 2017-03-14 09:26+0100, Christoffer Dall:
> >> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote:
> >> >> 2017-03-08 02:57-0800, Christoffer Dall:
> >> >> > Hi Paolo,
> >> >> > 
> >> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
> >> >> > before enabling interrupts when coming back from the guest.
> >> >> > 
> >> >> > Unfortunately, this appears to mess up my view of CPU usage using
> >> >> > something like htop on the host, because it appears all time is spent
> >> >> > inside the kernel.
> >> >> > 
> >> >> > From my analysis, I think this is because we never handle any interrupts
> >> >> > before enabling interrupts, where the x86 code does its
> >> >> > handle_external_intr, and the result on ARM is that we never increment
> >> >> > jiffies before doing the vtime accounting.
> >> >> 
> >> >> (Hm, the counting might be broken on nohz_full then.)
> >> >> 
> >> > 
> >> > Don't you still have a scheduler tick even with nohz_full and something
> >> > that will eventually update jiffies then?
> >> 
> >> Probably, I don't understand jiffies accounting too well and didn't see
> >> anything that would bump the jiffies in or before guest_exit_irqoff().
> > 
> > As far as I understand, from my very very short time of looking at the
> > timer code, jiffies are updated on every tick, which can be cause by a
> > number of events, including *any* interrupt handler (coming from idle
> > state), soft timers, timer interrupts, and possibly other things.
> 
> Yes, I was thinking that entering/exiting user mode should trigger it as
> well, in order to correctly account for time spent there, but couldn't
> find it ...

Isn't it based on taking the timer interrupt when running in userspace
as opposed to when running in the kernel?

> 
> The case I was wondering about is if the kernel spent e.g. 10 jiffies in
> guest mode and then exited on mmio -- no interrupt in the host, and
> guest_exit_irqoff() would flip accouting would over system time.
> Can those 10 jiffies get accounted to system (not guest) time?

Yes, I think the key is whether you end up taking a timer interrupt
before or after switchign PF_VCPU.  So you can spend X jiffies in the
guest, come back, change PF_VCPU (timer still hasen't expired), and then
the timer expires immediately afterwards, and the whole block of jiffies
that are incremented as a result of the timer gets accounted as kernel
time.

(Note that jiffies on clocksourced architectures is an arbitrary number,
which somehow scales according to the clocksource frequency, depending
on the CONFIG_HZ, so a single timer interrupt can increment jiffies by
more than 1.

> 
> Accounting 1 jiffy wrong is normal as we expect that the distribution of
> guest/kernel time in the jiffy is going to be approximated over a longer
> sampling period, but if we could account multiple jiffies wrong, this
> expectation is very hard to defend.
> 

I think it's a best effort, and we expect inaccuracies over much more
than a single jiffy, see above.

> >> >> > So my current idea is to increment jiffies according to the clocksource
> >> >> > before calling guest_exit_irqoff, but this would require some main
> >> >> > clocksource infrastructure changes.
> >> >> 
> >> >> This seems similar to calling the function from the timer interrupt.
> >> >> The timer interrupt would be delivered after that and only wasted time,
> >> >> so it might actually be slower than just delivering it before ...
> >> > 
> >> > That's assuming that the timer interrupt hits at every exit.  I don't
> >> > think that's the case, but I should measure it.
> >> 
> >> There cannot be less vm exits and I think there are far more vm exits,
> >> but if there was no interrupt, then jiffies shouldn't raise and we would
> >> get the same result as with plain guest_exit_irqoff().
> >> 
> > 
> > That's true if you're guaranteed to take the timer interrupts that
> > happen while running the guest before hitting guest_exit_irqoff(), so
> > that you eventually count *some* time for the guest.  In the arm64 case,
> > if we just do guest_exit_irqoff(), we *never* account any time to the
> > guest.
> 
> Yes.  I assumed that if jiffies should be incremented, then you also get
> a timer tick,

I think jiffies can be incremented in a number of cases, and as I
understand it, when the kernel folks talk about "the tick" that doesn't
necessarily mean that your scheduler timer fires an interrupt, but I
could be wrong.



> so checking whether jiffies should be incremented inside
> guest_exit_irqoff() was only slowing down the common case, where jiffies
> remained the same.

My original suggestion was to not use jiffies at all, but use ktime, on
architectures that don't do handle_external_intr() type things, which
would result in a clocksource read on every exit, which on ARM I think
is faster than our enable/disable interrupt dance we do at the moment.

But of course, I'd have to measure that.

> 
> (Checking if jiffies should be incremented is quite slow by itself.)
> 

Really?  I think that depends on the architecture.  Reading the counter
on ARM is supposed to be cheap, I think.

> >> > I assume there's a good reason why we call guest_enter() and
> >> > guest_exit() in the hot path on every KVM architecture?
> >> 
> >> I consider myself biased when it comes to jiffies, so no judgement. :)
> >> 
> >> From what I see, the mode switch is used only for statistics.
> >> The original series is
> >> 
> >> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332
> >> 
> >> It didn't introduce the overhead with interrupt window and it didn't
> >> count host kernel irq time as user time, so it was better at that time.
> > 
> > Yes, but it was based on cputime_to... functions, which I understand use
> > ktime, which on systems running KVM will most often read the clocksource
> > directly from the hardware, and that was then optimized later to just
> > use jiffies to avoid the clocksource read because jiffies is already in
> > memory and adjusted to the granularity we need, so in some sense an
> > improvement, only it doesn't work if you don't update jiffies when
> > needed.
> 
> True.  The kvm_guest_enter/exit still didn't trigger accounting, but the
> granularity was better if there were other sources of accounting than
> just timer ticks.

I think the granularity was better, yes, but I wonder if that improved
accuracy was ever visible to the user, given that the CPU time may be
counted in jiffies?

> 
> And I noticed another funny feature: the original intention was that if
> the guest uses hardware acceleration, then the whole timeslice gets
> accounted to guest/user time, because kvm_guest_exit() was not supposed
> to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105

I don't see how this works on a current kernel though, because I can't
find any path beyond the kvm_guest_exit() which clears the flag at the
moment.

> This somehow got mangled when merging and later there was a fix that
> introduced the current behavior, which might be slightly more accurate:
> 0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()")

So Laurent was confused as well?

It seems he needed to add that before you guys added
handle_external_intr()?

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-15  8:43           ` Christoffer Dall
@ 2017-03-15 15:57             ` Radim Krčmář
  2017-03-15 16:48               ` Christoffer Dall
  2017-03-24 15:04             ` Rik van Riel
  1 sibling, 1 reply; 26+ messages in thread
From: Radim Krčmář @ 2017-03-15 15:57 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel

2017-03-15 09:43+0100, Christoffer Dall:
> On Tue, Mar 14, 2017 at 09:27:22PM +0100, Radim Krčmář wrote:
>> 2017-03-14 19:39+0100, Christoffer Dall:
>> > On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote:
>> >> 2017-03-14 09:26+0100, Christoffer Dall:
>> >> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote:
>> >> >> 2017-03-08 02:57-0800, Christoffer Dall:
>> >> >> > Hi Paolo,
>> >> >> > 
>> >> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
>> >> >> > before enabling interrupts when coming back from the guest.
>> >> >> > 
>> >> >> > Unfortunately, this appears to mess up my view of CPU usage using
>> >> >> > something like htop on the host, because it appears all time is spent
>> >> >> > inside the kernel.
>> >> >> > 
>> >> >> > From my analysis, I think this is because we never handle any interrupts
>> >> >> > before enabling interrupts, where the x86 code does its
>> >> >> > handle_external_intr, and the result on ARM is that we never increment
>> >> >> > jiffies before doing the vtime accounting.
>> >> >> 
>> >> >> (Hm, the counting might be broken on nohz_full then.)
>> >> >> 
>> >> > 
>> >> > Don't you still have a scheduler tick even with nohz_full and something
>> >> > that will eventually update jiffies then?
>> >> 
>> >> Probably, I don't understand jiffies accounting too well and didn't see
>> >> anything that would bump the jiffies in or before guest_exit_irqoff().
>> > 
>> > As far as I understand, from my very very short time of looking at the
>> > timer code, jiffies are updated on every tick, which can be cause by a
>> > number of events, including *any* interrupt handler (coming from idle
>> > state), soft timers, timer interrupts, and possibly other things.
>> 
>> Yes, I was thinking that entering/exiting user mode should trigger it as
>> well, in order to correctly account for time spent there, but couldn't
>> find it ...
> 
> Isn't it based on taking the timer interrupt when running in userspace
> as opposed to when running in the kernel?

The timer interrupt usually accounts it, but there is no timer interrupt
when tickless, so there has to be accounting somewhere on the border ...

__context_tracking_enter/exit(CONTEXT_USER) calls
vtime_user_enter/exit() to do accounting in that case, but expects that
jiffies were updated by someone else.

>> The case I was wondering about is if the kernel spent e.g. 10 jiffies in
>> guest mode and then exited on mmio -- no interrupt in the host, and
>> guest_exit_irqoff() would flip accouting would over system time.
>> Can those 10 jiffies get accounted to system (not guest) time?
> 
> Yes, I think the key is whether you end up taking a timer interrupt
> before or after switchign PF_VCPU.  So you can spend X jiffies in the
> guest, come back, change PF_VCPU (timer still hasen't expired), and then
> the timer expires immediately afterwards, and the whole block of jiffies
> that are incremented as a result of the timer gets accounted as kernel
> time.

Wouldn't that be result of a bug?

(The timer period is 1 jiffy, so the VM would get kicked before getting
 to 2 jiffies in guest mode and spending so much time in bettwen seems
 like a bug.
 10 jiffies could be spend in a guest while under nohz_full, but there is
 still at least one CPU with HZ timer that updates the global jiffies, so
 I assume that guest_exit() should see at least 9 jiffies.)

> (Note that jiffies on clocksourced architectures is an arbitrary number,
> which somehow scales according to the clocksource frequency, depending
> on the CONFIG_HZ, so a single timer interrupt can increment jiffies by
> more than 1.
> 
>> Accounting 1 jiffy wrong is normal as we expect that the distribution of
>> guest/kernel time in the jiffy is going to be approximated over a longer
>> sampling period, but if we could account multiple jiffies wrong, this
>> expectation is very hard to defend.
>> 
> 
> I think it's a best effort, and we expect inaccuracies over much more
> than a single jiffy, see above.
> 
>> >> >> > So my current idea is to increment jiffies according to the clocksource
>> >> >> > before calling guest_exit_irqoff, but this would require some main
>> >> >> > clocksource infrastructure changes.
>> >> >> 
>> >> >> This seems similar to calling the function from the timer interrupt.
>> >> >> The timer interrupt would be delivered after that and only wasted time,
>> >> >> so it might actually be slower than just delivering it before ...
>> >> > 
>> >> > That's assuming that the timer interrupt hits at every exit.  I don't
>> >> > think that's the case, but I should measure it.
>> >> 
>> >> There cannot be less vm exits and I think there are far more vm exits,
>> >> but if there was no interrupt, then jiffies shouldn't raise and we would
>> >> get the same result as with plain guest_exit_irqoff().
>> >> 
>> > 
>> > That's true if you're guaranteed to take the timer interrupts that
>> > happen while running the guest before hitting guest_exit_irqoff(), so
>> > that you eventually count *some* time for the guest.  In the arm64 case,
>> > if we just do guest_exit_irqoff(), we *never* account any time to the
>> > guest.
>> 
>> Yes.  I assumed that if jiffies should be incremented, then you also get
>> a timer tick,
> 
> I think jiffies can be incremented in a number of cases, and as I
> understand it, when the kernel folks talk about "the tick" that doesn't
> necessarily mean that your scheduler timer fires an interrupt, but I
> could be wrong.

Good point.  do_timer() (the only place to update jiffies, I hope) is
called from:

  tick_periodic()
  xtime_update()
  tick_do_update_jiffies64()

The first two seem to coincide with a specific timer interrupt.  The
last one is called from:

  tick_sched_do_timer() [timer interrupt]
  tick_nohz_update_jiffies() [any interrupt]
  do_idle() [not interrupt, but also never hit after VM exit]

Looks like jiffies are mostly timer interrupt based, but if you have
nohz, then there are also other interrupts and switches from idle.

>> so checking whether jiffies should be incremented inside
>> guest_exit_irqoff() was only slowing down the common case, where jiffies
>> remained the same.
> 
> My original suggestion was to not use jiffies at all, but use ktime, on
> architectures that don't do handle_external_intr() type things, which
> would result in a clocksource read on every exit, which on ARM I think
> is faster than our enable/disable interrupt dance we do at the moment.

I didn't catch that, sorry.

> But of course, I'd have to measure that.

The accounting would need to change to ktime even for userspace switches
to avoid double-accounting time and I think that ktime was abolished
because it was too slow ... definitely a lot of benchmarking.

>> (Checking if jiffies should be incremented is quite slow by itself.)
>> 
> 
> Really?  I think that depends on the architecture.  Reading the counter
> on ARM is supposed to be cheap, I think.

True, and also depends on what slow means. :)

IIRC, it takes around 50 cycles to read TSC-based ktime on x86, which
would be rougly 2% of a current VM exit/entry loop.

>> >> > I assume there's a good reason why we call guest_enter() and
>> >> > guest_exit() in the hot path on every KVM architecture?
>> >> 
>> >> I consider myself biased when it comes to jiffies, so no judgement. :)
>> >> 
>> >> From what I see, the mode switch is used only for statistics.
>> >> The original series is
>> >> 
>> >> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332
>> >> 
>> >> It didn't introduce the overhead with interrupt window and it didn't
>> >> count host kernel irq time as user time, so it was better at that time.
>> > 
>> > Yes, but it was based on cputime_to... functions, which I understand use
>> > ktime, which on systems running KVM will most often read the clocksource
>> > directly from the hardware, and that was then optimized later to just
>> > use jiffies to avoid the clocksource read because jiffies is already in
>> > memory and adjusted to the granularity we need, so in some sense an
>> > improvement, only it doesn't work if you don't update jiffies when
>> > needed.
>> 
>> True.  The kvm_guest_enter/exit still didn't trigger accounting, but the
>> granularity was better if there were other sources of accounting than
>> just timer ticks.
> 
> I think the granularity was better, yes, but I wonder if that improved
> accuracy was ever visible to the user, given that the CPU time may be
> counted in jiffies?

I think the time for statistics used to be in cycles before it switched
to nanoseconds, so user could have seen it some difference.

>> And I noticed another funny feature: the original intention was that if
>> the guest uses hardware acceleration, then the whole timeslice gets
>> accounted to guest/user time, because kvm_guest_exit() was not supposed
>> to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105
> 
> I don't see how this works on a current kernel though, because I can't
> find any path beyond the kvm_guest_exit() which clears the flag at the
> moment.

Right, the flag used to be cleared in account_system_time().

>> This somehow got mangled when merging and later there was a fix that
>> introduced the current behavior, which might be slightly more accurate:
>> 0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()")
> 
> So Laurent was confused as well?

It looks that Laurent understood that you need the timer tick before
clearing the flag.  No idea how it really happened.

> It seems he needed to add that before you guys added
> handle_external_intr()?

Yes.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-15 15:57             ` Radim Krčmář
@ 2017-03-15 16:48               ` Christoffer Dall
  2017-03-15 17:09                 ` Radim Krčmář
  0 siblings, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2017-03-15 16:48 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel,
	stephen.boyd

Hi Radim,

[Stephen, I am cc'ing you because we are managing to confuse ourselves
here and some of my ramblings are basically misrepresented bits of
information I got from you].

On Wed, Mar 15, 2017 at 04:57:24PM +0100, Radim Krčmář wrote:
> 2017-03-15 09:43+0100, Christoffer Dall:
> > On Tue, Mar 14, 2017 at 09:27:22PM +0100, Radim Krčmář wrote:
> >> 2017-03-14 19:39+0100, Christoffer Dall:
> >> > On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote:
> >> >> 2017-03-14 09:26+0100, Christoffer Dall:
> >> >> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote:
> >> >> >> 2017-03-08 02:57-0800, Christoffer Dall:
> >> >> >> > Hi Paolo,
> >> >> >> > 
> >> >> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
> >> >> >> > before enabling interrupts when coming back from the guest.
> >> >> >> > 
> >> >> >> > Unfortunately, this appears to mess up my view of CPU usage using
> >> >> >> > something like htop on the host, because it appears all time is spent
> >> >> >> > inside the kernel.
> >> >> >> > 
> >> >> >> > From my analysis, I think this is because we never handle any interrupts
> >> >> >> > before enabling interrupts, where the x86 code does its
> >> >> >> > handle_external_intr, and the result on ARM is that we never increment
> >> >> >> > jiffies before doing the vtime accounting.
> >> >> >> 
> >> >> >> (Hm, the counting might be broken on nohz_full then.)
> >> >> >> 
> >> >> > 
> >> >> > Don't you still have a scheduler tick even with nohz_full and something
> >> >> > that will eventually update jiffies then?
> >> >> 
> >> >> Probably, I don't understand jiffies accounting too well and didn't see
> >> >> anything that would bump the jiffies in or before guest_exit_irqoff().
> >> > 
> >> > As far as I understand, from my very very short time of looking at the
> >> > timer code, jiffies are updated on every tick, which can be cause by a
> >> > number of events, including *any* interrupt handler (coming from idle
> >> > state), soft timers, timer interrupts, and possibly other things.
> >> 
> >> Yes, I was thinking that entering/exiting user mode should trigger it as
> >> well, in order to correctly account for time spent there, but couldn't
> >> find it ...
> > 
> > Isn't it based on taking the timer interrupt when running in userspace
> > as opposed to when running in the kernel?
> 
> The timer interrupt usually accounts it, but there is no timer interrupt
> when tickless, so there has to be accounting somewhere on the border ...

I think there are; only it's not a regular interval but only when a
timing is needed, such as the time slice expiring as decided by the
scheduler.

> 
> __context_tracking_enter/exit(CONTEXT_USER) calls
> vtime_user_enter/exit() to do accounting in that case, but expects that
> jiffies were updated by someone else.
> 

I'm assuming user/kernel accounting works today with tickless, so surely
this is handled somewhere.  There are a bunch of calling paths to
tick_do_update_jiffies64() in kernel/time/tick-sched.c related to nohz,
so my best guess is that it's updated at the sched_clock granularity and
that is sufficient for profiling, because you basically care about where
a particualr time slice was spent, I suppose.

> >> The case I was wondering about is if the kernel spent e.g. 10 jiffies in
> >> guest mode and then exited on mmio -- no interrupt in the host, and
> >> guest_exit_irqoff() would flip accouting would over system time.
> >> Can those 10 jiffies get accounted to system (not guest) time?
> > 
> > Yes, I think the key is whether you end up taking a timer interrupt
> > before or after switchign PF_VCPU.  So you can spend X jiffies in the
> > guest, come back, change PF_VCPU (timer still hasen't expired), and then
> > the timer expires immediately afterwards, and the whole block of jiffies
> > that are incremented as a result of the timer gets accounted as kernel
> > time.
> 
> Wouldn't that be result of a bug?
> 

I don't think anyone cares about this level of granularity really, the
case I mentioned above is possible, albeit unlikely.

> (The timer period is 1 jiffy, so the VM would get kicked before getting
>  to 2 jiffies in guest mode and spending so much time in bettwen seems
>  like a bug.
>  10 jiffies could be spend in a guest while under nohz_full, but there is
>  still at least one CPU with HZ timer that updates the global jiffies, so
>  I assume that guest_exit() should see at least 9 jiffies.)
> 
> > (Note that jiffies on clocksourced architectures is an arbitrary number,
> > which somehow scales according to the clocksource frequency, depending
> > on the CONFIG_HZ, so a single timer interrupt can increment jiffies by
> > more than 1.
> > 
> >> Accounting 1 jiffy wrong is normal as we expect that the distribution of
> >> guest/kernel time in the jiffy is going to be approximated over a longer
> >> sampling period, but if we could account multiple jiffies wrong, this
> >> expectation is very hard to defend.
> >> 
> > 
> > I think it's a best effort, and we expect inaccuracies over much more
> > than a single jiffy, see above.
> > 
> >> >> >> > So my current idea is to increment jiffies according to the clocksource
> >> >> >> > before calling guest_exit_irqoff, but this would require some main
> >> >> >> > clocksource infrastructure changes.
> >> >> >> 
> >> >> >> This seems similar to calling the function from the timer interrupt.
> >> >> >> The timer interrupt would be delivered after that and only wasted time,
> >> >> >> so it might actually be slower than just delivering it before ...
> >> >> > 
> >> >> > That's assuming that the timer interrupt hits at every exit.  I don't
> >> >> > think that's the case, but I should measure it.
> >> >> 
> >> >> There cannot be less vm exits and I think there are far more vm exits,
> >> >> but if there was no interrupt, then jiffies shouldn't raise and we would
> >> >> get the same result as with plain guest_exit_irqoff().
> >> >> 
> >> > 
> >> > That's true if you're guaranteed to take the timer interrupts that
> >> > happen while running the guest before hitting guest_exit_irqoff(), so
> >> > that you eventually count *some* time for the guest.  In the arm64 case,
> >> > if we just do guest_exit_irqoff(), we *never* account any time to the
> >> > guest.
> >> 
> >> Yes.  I assumed that if jiffies should be incremented, then you also get
> >> a timer tick,
> > 
> > I think jiffies can be incremented in a number of cases, and as I
> > understand it, when the kernel folks talk about "the tick" that doesn't
> > necessarily mean that your scheduler timer fires an interrupt, but I
> > could be wrong.
> 
> Good point.  do_timer() (the only place to update jiffies, I hope) is

as I understand it, the whole jiffies thing is circular, so the update
can happen in other paths than do_timer.

> called from:
> 
>   tick_periodic()
>   xtime_update()
>   tick_do_update_jiffies64()
> 
> The first two seem to coincide with a specific timer interrupt.  The
> last one is called from:
> 
>   tick_sched_do_timer() [timer interrupt]
>   tick_nohz_update_jiffies() [any interrupt]
>   do_idle() [not interrupt, but also never hit after VM exit]
> 
> Looks like jiffies are mostly timer interrupt based, but if you have
> nohz, then there are also other interrupts and switches from idle.
> 

yes exactly, clearly, non of us are timekeeping experts...

> >> so checking whether jiffies should be incremented inside
> >> guest_exit_irqoff() was only slowing down the common case, where jiffies
> >> remained the same.
> > 
> > My original suggestion was to not use jiffies at all, but use ktime, on
> > architectures that don't do handle_external_intr() type things, which
> > would result in a clocksource read on every exit, which on ARM I think
> > is faster than our enable/disable interrupt dance we do at the moment.
> 
> I didn't catch that, sorry.
> 
> > But of course, I'd have to measure that.
> 
> The accounting would need to change to ktime even for userspace switches

good point.

> to avoid double-accounting time and I think that ktime was abolished
> because it was too slow ... definitely a lot of benchmarking.
> 
> >> (Checking if jiffies should be incremented is quite slow by itself.)
> >> 
> > 
> > Really?  I think that depends on the architecture.  Reading the counter
> > on ARM is supposed to be cheap, I think.
> 
> True, and also depends on what slow means. :)
> 
> IIRC, it takes around 50 cycles to read TSC-based ktime on x86, which
> would be rougly 2% of a current VM exit/entry loop.
> 
> >> >> > I assume there's a good reason why we call guest_enter() and
> >> >> > guest_exit() in the hot path on every KVM architecture?
> >> >> 
> >> >> I consider myself biased when it comes to jiffies, so no judgement. :)
> >> >> 
> >> >> From what I see, the mode switch is used only for statistics.
> >> >> The original series is
> >> >> 
> >> >> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332
> >> >> 
> >> >> It didn't introduce the overhead with interrupt window and it didn't
> >> >> count host kernel irq time as user time, so it was better at that time.
> >> > 
> >> > Yes, but it was based on cputime_to... functions, which I understand use
> >> > ktime, which on systems running KVM will most often read the clocksource
> >> > directly from the hardware, and that was then optimized later to just
> >> > use jiffies to avoid the clocksource read because jiffies is already in
> >> > memory and adjusted to the granularity we need, so in some sense an
> >> > improvement, only it doesn't work if you don't update jiffies when
> >> > needed.
> >> 
> >> True.  The kvm_guest_enter/exit still didn't trigger accounting, but the
> >> granularity was better if there were other sources of accounting than
> >> just timer ticks.
> > 
> > I think the granularity was better, yes, but I wonder if that improved
> > accuracy was ever visible to the user, given that the CPU time may be
> > counted in jiffies?
> 
> I think the time for statistics used to be in cycles before it switched
> to nanoseconds, so user could have seen it some difference.
> 
> >> And I noticed another funny feature: the original intention was that if
> >> the guest uses hardware acceleration, then the whole timeslice gets
> >> accounted to guest/user time, because kvm_guest_exit() was not supposed
> >> to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105
> > 
> > I don't see how this works on a current kernel though, because I can't
> > find any path beyond the kvm_guest_exit() which clears the flag at the
> > moment.
> 
> Right, the flag used to be cleared in account_system_time().
> 
> >> This somehow got mangled when merging and later there was a fix that
> >> introduced the current behavior, which might be slightly more accurate:
> >> 0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()")
> > 
> > So Laurent was confused as well?
> 
> It looks that Laurent understood that you need the timer tick before
> clearing the flag.  No idea how it really happened.
> 
> > It seems he needed to add that before you guys added
> > handle_external_intr()?
> 
> Yes.

In any case, it would be good for us to get some answers to the stuff
above, but I'll also try to prototype the load/put accounting of time.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-15 16:48               ` Christoffer Dall
@ 2017-03-15 17:09                 ` Radim Krčmář
  0 siblings, 0 replies; 26+ messages in thread
From: Radim Krčmář @ 2017-03-15 17:09 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel,
	stephen.boyd

2017-03-15 17:48+0100, Christoffer Dall:
> In any case, it would be good for us to get some answers to the stuff
> above, but I'll also try to prototype the load/put accounting of time.

I like this plan, thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-14  8:26   ` Christoffer Dall
  2017-03-14  8:55     ` Marc Zyngier
  2017-03-14 16:58     ` Radim Krčmář
@ 2017-03-24 14:55     ` Rik van Riel
  2 siblings, 0 replies; 26+ messages in thread
From: Rik van Riel @ 2017-03-24 14:55 UTC (permalink / raw)
  To: Christoffer Dall, Radim Krčmář
  Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier

On Tue, 2017-03-14 at 09:26 +0100, Christoffer Dall wrote:
> On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote:
> > 2017-03-08 02:57-0800, Christoffer Dall:
> > > Hi Paolo,
> > > 
> > > I'm looking at improving KVM/ARM a bit by calling
> > > guest_exit_irqoff
> > > before enabling interrupts when coming back from the guest.
> > > 
> > > Unfortunately, this appears to mess up my view of CPU usage using
> > > something like htop on the host, because it appears all time is
> > > spent
> > > inside the kernel.
> > > 
> > > From my analysis, I think this is because we never handle any
> > > interrupts
> > > before enabling interrupts, where the x86 code does its
> > > handle_external_intr, and the result on ARM is that we never
> > > increment
> > > jiffies before doing the vtime accounting.
> > 
> > (Hm, the counting might be broken on nohz_full then.)
> > 
> 
> Don't you still have a scheduler tick even with nohz_full and
> something
> that will eventually update jiffies then?
> 

With nohz_full, the housekeeping CPU will get a
timer interrupt every single jiffy, and the
jiffies variable will get updated HZ times a
second.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-15  8:43           ` Christoffer Dall
  2017-03-15 15:57             ` Radim Krčmář
@ 2017-03-24 15:04             ` Rik van Riel
  2017-03-27 12:29               ` Wanpeng Li
  1 sibling, 1 reply; 26+ messages in thread
From: Rik van Riel @ 2017-03-24 15:04 UTC (permalink / raw)
  To: Christoffer Dall, Radim Krčmář
  Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, fweisbec

On Wed, 2017-03-15 at 09:43 +0100, Christoffer Dall wrote:

> Yes, I think the key is whether you end up taking a timer interrupt
> before or after switchign PF_VCPU.  So you can spend X jiffies in the
> guest, come back, change PF_VCPU (timer still hasen't expired), and
> then
> the timer expires immediately afterwards, and the whole block of
> jiffies
> that are incremented as a result of the timer gets accounted as
> kernel
> time.
> 
> (Note that jiffies on clocksourced architectures is an arbitrary
> number,
> which somehow scales according to the clocksource frequency,
> depending
> on the CONFIG_HZ, so a single timer interrupt can increment jiffies
> by
> more than 1.
> 

That really should not happen with nohz_full.

The housekeeping CPU should get a timer interrupt
every jiffy, unless something changed recently that
I am not aware of.

Frederic, am I totally off base?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: vtime accounting
  2017-03-24 15:04             ` Rik van Riel
@ 2017-03-27 12:29               ` Wanpeng Li
  0 siblings, 0 replies; 26+ messages in thread
From: Wanpeng Li @ 2017-03-27 12:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Christoffer Dall, Radim Krčmář,
	Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, fweisbec

2017-03-24 23:04 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Wed, 2017-03-15 at 09:43 +0100, Christoffer Dall wrote:
>
>> Yes, I think the key is whether you end up taking a timer interrupt
>> before or after switchign PF_VCPU.  So you can spend X jiffies in the
>> guest, come back, change PF_VCPU (timer still hasen't expired), and
>> then
>> the timer expires immediately afterwards, and the whole block of
>> jiffies
>> that are incremented as a result of the timer gets accounted as
>> kernel
>> time.
>>
>> (Note that jiffies on clocksourced architectures is an arbitrary
>> number,
>> which somehow scales according to the clocksource frequency,
>> depending
>> on the CONFIG_HZ, so a single timer interrupt can increment jiffies
>> by
>> more than 1.
>>
>
> That really should not happen with nohz_full.
>
> The housekeeping CPU should get a timer interrupt
> every jiffy, unless something changed recently that
> I am not aware of.

Agreed, the timer interrupt of the housekeeping cpu will not stop w/ nohz_full.

__tick_nohz_idle_enter()
    -> can_stop_idle_tick()
        -> if (tick_nohz_full_enabled())
                if (tick_do_timer_cpu == cpu)
                    return false;

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-03-27 12:29 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-08 10:57 vtime accounting Christoffer Dall
2017-03-09  8:16 ` Paolo Bonzini
2017-03-13 17:28 ` Radim Krčmář
2017-03-14  8:26   ` Christoffer Dall
2017-03-14  8:55     ` Marc Zyngier
2017-03-14 11:12       ` Christoffer Dall
2017-03-14 11:46         ` Marc Zyngier
2017-03-14 16:58     ` Radim Krčmář
2017-03-14 17:09       ` Paolo Bonzini
2017-03-14 18:41         ` Christoffer Dall
2017-03-14 19:32           ` Radim Krčmář
2017-03-14 20:01             ` Christoffer Dall
2017-03-14 21:52               ` Radim Krčmář
2017-03-15  8:09                 ` Paolo Bonzini
2017-03-15  8:05           ` Paolo Bonzini
2017-03-15  8:30             ` Christoffer Dall
2017-03-14 18:39       ` Christoffer Dall
2017-03-14 20:27         ` Radim Krčmář
2017-03-14 21:53           ` Radim Krčmář
2017-03-15  8:43           ` Christoffer Dall
2017-03-15 15:57             ` Radim Krčmář
2017-03-15 16:48               ` Christoffer Dall
2017-03-15 17:09                 ` Radim Krčmář
2017-03-24 15:04             ` Rik van Riel
2017-03-27 12:29               ` Wanpeng Li
2017-03-24 14:55     ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.