* vtime accounting
@ 2017-03-08 10:57 Christoffer Dall
2017-03-09 8:16 ` Paolo Bonzini
2017-03-13 17:28 ` Radim Krčmář
0 siblings, 2 replies; 26+ messages in thread
From: Christoffer Dall @ 2017-03-08 10:57 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: kvm, Marc Zyngier
Hi Paolo,
I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff
before enabling interrupts when coming back from the guest.
Unfortunately, this appears to mess up my view of CPU usage using
something like htop on the host, because it appears all time is spent
inside the kernel.
>From my analysis, I think this is because we never handle any interrupts
before enabling interrupts, where the x86 code does its
handle_external_intr, and the result on ARM is that we never increment
jiffies before doing the vtime accounting.
So my current idea is to increment jiffies according to the clocksource
before calling guest_exit_irqoff, but this would require some main
clocksource infrastructure changes.
My question is: how important is the vtime accounting on the host from
your point of view? Worth poking the timekeeping folks about or even
trying to convince ourselves that the handle_external_intr thing is
worth it?
Thanks,
-Christoffer
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-08 10:57 vtime accounting Christoffer Dall @ 2017-03-09 8:16 ` Paolo Bonzini 2017-03-13 17:28 ` Radim Krčmář 1 sibling, 0 replies; 26+ messages in thread From: Paolo Bonzini @ 2017-03-09 8:16 UTC (permalink / raw) To: Christoffer Dall; +Cc: kvm, Marc Zyngier, Rik van Riel On 08/03/2017 11:57, Christoffer Dall wrote: > Hi Paolo, > > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff > before enabling interrupts when coming back from the guest. > > Unfortunately, this appears to mess up my view of CPU usage using > something like htop on the host, because it appears all time is spent > inside the kernel. > > From my analysis, I think this is because we never handle any interrupts > before enabling interrupts, where the x86 code does its > handle_external_intr, and the result on ARM is that we never increment > jiffies before doing the vtime accounting. > > So my current idea is to increment jiffies according to the clocksource > before calling guest_exit_irqoff, but this would require some main > clocksource infrastructure changes. > > My question is: how important is the vtime accounting on the host from > your point of view? Worth poking the timekeeping folks about or even > trying to convince ourselves that the handle_external_intr thing is > worth it? Not really my area, so let's ask Rik. :) Paolo ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-08 10:57 vtime accounting Christoffer Dall 2017-03-09 8:16 ` Paolo Bonzini @ 2017-03-13 17:28 ` Radim Krčmář 2017-03-14 8:26 ` Christoffer Dall 1 sibling, 1 reply; 26+ messages in thread From: Radim Krčmář @ 2017-03-13 17:28 UTC (permalink / raw) To: Christoffer Dall; +Cc: Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel 2017-03-08 02:57-0800, Christoffer Dall: > Hi Paolo, > > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff > before enabling interrupts when coming back from the guest. > > Unfortunately, this appears to mess up my view of CPU usage using > something like htop on the host, because it appears all time is spent > inside the kernel. > > From my analysis, I think this is because we never handle any interrupts > before enabling interrupts, where the x86 code does its > handle_external_intr, and the result on ARM is that we never increment > jiffies before doing the vtime accounting. (Hm, the counting might be broken on nohz_full then.) > So my current idea is to increment jiffies according to the clocksource > before calling guest_exit_irqoff, but this would require some main > clocksource infrastructure changes. This seems similar to calling the function from the timer interrupt. The timer interrupt would be delivered after that and only wasted time, so it might actually be slower than just delivering it before ... How expensive is the interrupt enable/disable cycle that this optimization saves? > My question is: how important is the vtime accounting on the host from > your point of view? No idea. I'd keep the same behavior on all architectures, though. The precision of accounting is in jiffies (millions of cycles), so we could maybe move it from the hot path to vcpu_load/put(?) without affecting the count in usual cases ... > Worth poking the timekeeping folks about or even > trying to convince ourselves that the handle_external_intr thing is > worth it? handle_external_intr() needs some hardware support in order to be more than worthless 'local_irq_enable(); local_irq_disable()' ... e.g. VMX doesn't queue the interrupt that caused a VM exit in the interrupt controller. (VMX's handle_external_intr() doesn't deliver any other interrupts than might have become pending after the exit, but this race is not very important due to accounting granularity ...) Alternatively, the interrupt controller would need to support dequeing without delivery (but delivering in software might still be slower than cyclying interrupts on and off). Thanks. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-13 17:28 ` Radim Krčmář @ 2017-03-14 8:26 ` Christoffer Dall 2017-03-14 8:55 ` Marc Zyngier ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Christoffer Dall @ 2017-03-14 8:26 UTC (permalink / raw) To: Radim Krčmář Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote: > 2017-03-08 02:57-0800, Christoffer Dall: > > Hi Paolo, > > > > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff > > before enabling interrupts when coming back from the guest. > > > > Unfortunately, this appears to mess up my view of CPU usage using > > something like htop on the host, because it appears all time is spent > > inside the kernel. > > > > From my analysis, I think this is because we never handle any interrupts > > before enabling interrupts, where the x86 code does its > > handle_external_intr, and the result on ARM is that we never increment > > jiffies before doing the vtime accounting. > > (Hm, the counting might be broken on nohz_full then.) > Don't you still have a scheduler tick even with nohz_full and something that will eventually update jiffies then? > > So my current idea is to increment jiffies according to the clocksource > > before calling guest_exit_irqoff, but this would require some main > > clocksource infrastructure changes. > > This seems similar to calling the function from the timer interrupt. > The timer interrupt would be delivered after that and only wasted time, > so it might actually be slower than just delivering it before ... That's assuming that the timer interrupt hits at every exit. I don't think that's the case, but I should measure it. > > How expensive is the interrupt enable/disable cycle that this > optimization saves? I'll have to go back and measure this bit specifically again, but I recall it being a couple of hundred cycles. Not alariming, but worthwhile looking into. > > > My question is: how important is the vtime accounting on the host from > > your point of view? > > No idea. I'd keep the same behavior on all architectures, though. > > The precision of accounting is in jiffies (millions of cycles), so we > could maybe move it from the hot path to vcpu_load/put(?) without > affecting the count in usual cases ... > So since sending my original e-mail I found out that the vtime accounting logic was changed from ktime to jiffies, which is partly why we're having problems on arm. See: ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity Moving to load/put depends on the semantics of this vtime thing. Is this counting cycles spent in the VM as opposed to in the host kernel and IRQ handling, and is that useful for system profiling or scheduling decisions, in which case moving to vcpu_load/put doesn't work... I assume there's a good reason why we call guest_enter() and guest_exit() in the hot path on every KVM architecture? > > Worth poking the timekeeping folks about or even > > trying to convince ourselves that the handle_external_intr thing is > > worth it? > > handle_external_intr() needs some hardware support in order to be more > than worthless 'local_irq_enable(); local_irq_disable()' ... > > e.g. VMX doesn't queue the interrupt that caused a VM exit in the > interrupt controller. (VMX's handle_external_intr() doesn't deliver any > other interrupts than might have become pending after the exit, but this > race is not very important due to accounting granularity ...) > Alternatively, the interrupt controller would need to support dequeing > without delivery (but delivering in software might still be slower than > cyclying interrupts on and off). > On ARM, I think the main benefits of implementing something like handle_external_intr would come from two things: (1) You'd avoid the context synchronization and associated cost of taking an exception on the CPU, and (2) you'd also (potentially) avoid the additional save/restore of all the GP regiters from the kernel exception entry path to create a usable gp_regs. I have to look more careful at whether or not (2) is possible, because it would mean we'd have to store the guest register state on a pt_regs structure in the first place and pass that directly to arch_handle_irq. Additionally, if we had something like handle_external_intr, the guest_exit thing would be kinda moot, since we'd do our ticks like x86... Thanks, -Christoffer ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 8:26 ` Christoffer Dall @ 2017-03-14 8:55 ` Marc Zyngier 2017-03-14 11:12 ` Christoffer Dall 2017-03-14 16:58 ` Radim Krčmář 2017-03-24 14:55 ` Rik van Riel 2 siblings, 1 reply; 26+ messages in thread From: Marc Zyngier @ 2017-03-14 8:55 UTC (permalink / raw) To: Christoffer Dall Cc: Radim Krčmář, Christoffer Dall, Paolo Bonzini, kvm, Rik van Riel On Tue, Mar 14 2017 at 8:26:01 am GMT, Christoffer Dall <cdall@linaro.org> wrote: > On ARM, I think the main benefits of implementing something like > handle_external_intr would come from two things: (1) You'd avoid the > context synchronization and associated cost of taking an exception on > the CPU, and (2) you'd also (potentially) avoid the additional > save/restore of all the GP regiters from the kernel exception entry path > to create a usable gp_regs. > > I have to look more careful at whether or not (2) is possible, because > it would mean we'd have to store the guest register state on a pt_regs > structure in the first place and pass that directly to > arch_handle_irq. For that to be useful on pre-VHE HW and avoid the extra exception handling, we'd also have to perform that arch_handle_irq call as part of the exception return to EL1. That's not going to be very pretty, as we need to build two return contexts (EL2->EL1 IRQ on the host, followed by EL1->EL1). This is all perfectly doable, but as always, we need numbers. I had similar stuff for VHE a long while ago (back when I wrote WSINC), but faking the pt_regs was absolutely horrible. But maybe it is time to exhume this again... > Additionally, if we had something like handle_external_intr, the > guest_exit thing would be kinda moot, since we'd do our ticks like > x86... Indeed. Which brings the obvious question: are we the only ones not taking the interrupt early? Thanks, M. -- Jazz is not dead, it just smell funny. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 8:55 ` Marc Zyngier @ 2017-03-14 11:12 ` Christoffer Dall 2017-03-14 11:46 ` Marc Zyngier 0 siblings, 1 reply; 26+ messages in thread From: Christoffer Dall @ 2017-03-14 11:12 UTC (permalink / raw) To: Marc Zyngier Cc: Radim Krčmář, Christoffer Dall, Paolo Bonzini, kvm, Rik van Riel On Tue, Mar 14, 2017 at 08:55:09AM +0000, Marc Zyngier wrote: > On Tue, Mar 14 2017 at 8:26:01 am GMT, Christoffer Dall <cdall@linaro.org> wrote: > > On ARM, I think the main benefits of implementing something like > > handle_external_intr would come from two things: (1) You'd avoid the > > context synchronization and associated cost of taking an exception on > > the CPU, and (2) you'd also (potentially) avoid the additional > > save/restore of all the GP regiters from the kernel exception entry path > > to create a usable gp_regs. > > > > I have to look more careful at whether or not (2) is possible, because > > it would mean we'd have to store the guest register state on a pt_regs > > structure in the first place and pass that directly to > > arch_handle_irq. > > For that to be useful on pre-VHE HW and avoid the extra exception > handling, we'd also have to perform that arch_handle_irq call as part of > the exception return to EL1. That's not going to be very pretty, as we > need to build two return contexts (EL2->EL1 IRQ on the host, followed by > EL1->EL1). Why do we need something like that? Why can't we simply call arch_handle_irq from kvm_arch_vcpu_ioctl_run being in either EL1 or EL2, depending if we have VHE, respectively, assuming we have a pt_regs struct we can use with the expected semantics for arch_handle_irq ? In the first case, we would have completed the EL2->EL1 transition before attempting to handle IRQs and in the second case we would always be in EL2, but we would logically be in the same place in the overall execution flow. Note, that I'm assuming that you don't need to be directly calling arch_handle_irq as a result of a minimal set of assembly code directly following the exception vector, but that arch_handle_irq can be called as a function from a later time with interrupts disabled as long as the pt_regs reasonably express the state of the CPU at the time the exception occurred, which in this case would (most likely) be the guest's state. > > This is all perfectly doable, but as always, we need numbers. Agreed. > > I had similar stuff for VHE a long while ago (back when I wrote WSINC), > but faking the pt_regs was absolutely horrible. But maybe it is time to > exhume this again... > Correct me if I'm wrong, but I think you're referring to the work where you used the host's exception vectors, not calling arch_handle_irq from the run loop, or did I misunderstand? > > Additionally, if we had something like handle_external_intr, the > > guest_exit thing would be kinda moot, since we'd do our ticks like > > x86... > > Indeed. Which brings the obvious question: are we the only ones not > taking the interrupt early? > Hmm, I tried taking a look, but my mips/powerpc/s390 fail me slightly. However, from what I can gather, s390 enables interrupts after returning from the guest, then disables them again and calls guest_exit_irqoff, so kinda what we're doing from a performance perspective, only different. I'm lost in all the versions of the powerpc code that calls guest_exit{_irqoff}, and it looks to me like mips handles physical interrupts from the exit path before reaching the guest_exit_irqoff point. But I could be completely wrong :-/ Thanks, -Christoffer ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 11:12 ` Christoffer Dall @ 2017-03-14 11:46 ` Marc Zyngier 0 siblings, 0 replies; 26+ messages in thread From: Marc Zyngier @ 2017-03-14 11:46 UTC (permalink / raw) To: Christoffer Dall Cc: Radim Krčmář, Christoffer Dall, Paolo Bonzini, kvm, Rik van Riel On 14/03/17 11:12, Christoffer Dall wrote: > On Tue, Mar 14, 2017 at 08:55:09AM +0000, Marc Zyngier wrote: >> On Tue, Mar 14 2017 at 8:26:01 am GMT, Christoffer Dall <cdall@linaro.org> wrote: >>> On ARM, I think the main benefits of implementing something like >>> handle_external_intr would come from two things: (1) You'd avoid the >>> context synchronization and associated cost of taking an exception on >>> the CPU, and (2) you'd also (potentially) avoid the additional >>> save/restore of all the GP regiters from the kernel exception entry path >>> to create a usable gp_regs. >>> >>> I have to look more careful at whether or not (2) is possible, because >>> it would mean we'd have to store the guest register state on a pt_regs >>> structure in the first place and pass that directly to >>> arch_handle_irq. >> >> For that to be useful on pre-VHE HW and avoid the extra exception >> handling, we'd also have to perform that arch_handle_irq call as part of >> the exception return to EL1. That's not going to be very pretty, as we >> need to build two return contexts (EL2->EL1 IRQ on the host, followed by >> EL1->EL1). > > Why do we need something like that? > > Why can't we simply call arch_handle_irq from kvm_arch_vcpu_ioctl_run > being in either EL1 or EL2, depending if we have VHE, respectively, > assuming we have a pt_regs struct we can use with the expected semantics > for arch_handle_irq ? Ah, that's where you want to perform the call. I was thinking of doing it a bit earlier and save another few instructions by performing the eret directly into the interrupt handler. I'm probably trying to optimizing too early, as usual! > > In the first case, we would have completed the EL2->EL1 transition > before attempting to handle IRQs and in the second case we would always > be in EL2, but we would logically be in the same place in the overall > execution flow. > > Note, that I'm assuming that you don't need to be directly calling > arch_handle_irq as a result of a minimal set of assembly code directly > following the exception vector, but that arch_handle_irq can be called > as a function from a later time with interrupts disabled as long as the > pt_regs reasonably express the state of the CPU at the time the > exception occurred, which in this case would (most likely) be the > guest's state. > >> >> This is all perfectly doable, but as always, we need numbers. > > Agreed. > >> >> I had similar stuff for VHE a long while ago (back when I wrote WSINC), >> but faking the pt_regs was absolutely horrible. But maybe it is time to >> exhume this again... >> > > Correct me if I'm wrong, but I think you're referring to the work where > you used the host's exception vectors, not calling arch_handle_irq from > the run loop, or did I misunderstand? Yes, that's what I had in mind, but you're obviously thinking of performing the call differently, so please ignore me until I come to my senses... >>> Additionally, if we had something like handle_external_intr, the >>> guest_exit thing would be kinda moot, since we'd do our ticks like >>> x86... >> >> Indeed. Which brings the obvious question: are we the only ones not >> taking the interrupt early? >> > > Hmm, I tried taking a look, but my mips/powerpc/s390 fail me slightly. > However, from what I can gather, s390 enables interrupts after returning > from the guest, then disables them again and calls guest_exit_irqoff, so > kinda what we're doing from a performance perspective, only different. > I'm lost in all the versions of the powerpc code that calls > guest_exit{_irqoff}, and it looks to me like mips handles physical > interrupts from the exit path before reaching the guest_exit_irqoff > point. > > But I could be completely wrong :-/ Well, it at least shows that there is a variety of ways to handle interrupts on exit, and that the patch you quoted earlier in the thread may have affected more than just ARM. Thanks, M. -- Jazz is not dead. It just smells funny... ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 8:26 ` Christoffer Dall 2017-03-14 8:55 ` Marc Zyngier @ 2017-03-14 16:58 ` Radim Krčmář 2017-03-14 17:09 ` Paolo Bonzini 2017-03-14 18:39 ` Christoffer Dall 2017-03-24 14:55 ` Rik van Riel 2 siblings, 2 replies; 26+ messages in thread From: Radim Krčmář @ 2017-03-14 16:58 UTC (permalink / raw) To: Christoffer Dall Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel 2017-03-14 09:26+0100, Christoffer Dall: > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote: >> 2017-03-08 02:57-0800, Christoffer Dall: >> > Hi Paolo, >> > >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff >> > before enabling interrupts when coming back from the guest. >> > >> > Unfortunately, this appears to mess up my view of CPU usage using >> > something like htop on the host, because it appears all time is spent >> > inside the kernel. >> > >> > From my analysis, I think this is because we never handle any interrupts >> > before enabling interrupts, where the x86 code does its >> > handle_external_intr, and the result on ARM is that we never increment >> > jiffies before doing the vtime accounting. >> >> (Hm, the counting might be broken on nohz_full then.) >> > > Don't you still have a scheduler tick even with nohz_full and something > that will eventually update jiffies then? Probably, I don't understand jiffies accounting too well and didn't see anything that would bump the jiffies in or before guest_exit_irqoff(). >> > So my current idea is to increment jiffies according to the clocksource >> > before calling guest_exit_irqoff, but this would require some main >> > clocksource infrastructure changes. >> >> This seems similar to calling the function from the timer interrupt. >> The timer interrupt would be delivered after that and only wasted time, >> so it might actually be slower than just delivering it before ... > > That's assuming that the timer interrupt hits at every exit. I don't > think that's the case, but I should measure it. There cannot be less vm exits and I think there are far more vm exits, but if there was no interrupt, then jiffies shouldn't raise and we would get the same result as with plain guest_exit_irqoff(). >> How expensive is the interrupt enable/disable cycle that this >> optimization saves? > > I'll have to go back and measure this bit specifically again, but I > recall it being a couple of hundred cycles. Not alariming, but > worthwhile looking into. Yeah, sounds good. >> > My question is: how important is the vtime accounting on the host from >> > your point of view? >> >> No idea. I'd keep the same behavior on all architectures, though. >> >> The precision of accounting is in jiffies (millions of cycles), so we >> could maybe move it from the hot path to vcpu_load/put(?) without >> affecting the count in usual cases ... >> > > So since sending my original e-mail I found out that the vtime > accounting logic was changed from ktime to jiffies, which is partly why > we're having problems on arm. See: > > ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd > sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity > > Moving to load/put depends on the semantics of this vtime thing. Is > this counting cycles spent in the VM as opposed to in the host kernel > and IRQ handling, and is that useful for system profiling or scheduling > decisions, in which case moving to vcpu_load/put doesn't work... Right. > I assume there's a good reason why we call guest_enter() and > guest_exit() in the hot path on every KVM architecture? I consider myself biased when it comes to jiffies, so no judgement. :) >From what I see, the mode switch is used only for statistics. The original series is 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332 It didn't introduce the overhead with interrupt window and it didn't count host kernel irq time as user time, so it was better at that time. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 16:58 ` Radim Krčmář @ 2017-03-14 17:09 ` Paolo Bonzini 2017-03-14 18:41 ` Christoffer Dall 2017-03-14 18:39 ` Christoffer Dall 1 sibling, 1 reply; 26+ messages in thread From: Paolo Bonzini @ 2017-03-14 17:09 UTC (permalink / raw) To: Radim Krčmář, Christoffer Dall Cc: Christoffer Dall, kvm, Marc Zyngier, Rik van Riel On 14/03/2017 17:58, Radim Krčmář wrote: >> I assume there's a good reason why we call guest_enter() and >> guest_exit() in the hot path on every KVM architecture? > I consider myself biased when it comes to jiffies, so no judgement. :) > > From what I see, the mode switch is used only for statistics. vtime is only for statistics, but guest_enter/exit are important because they enter an RCU extended quiescent state. This means that (physical) CPUs running a guest are effectively "off" from the point of view of the RCU accounting machinery. Not having to perform any RCU work is very good for jitter. Paolo ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 17:09 ` Paolo Bonzini @ 2017-03-14 18:41 ` Christoffer Dall 2017-03-14 19:32 ` Radim Krčmář 2017-03-15 8:05 ` Paolo Bonzini 0 siblings, 2 replies; 26+ messages in thread From: Christoffer Dall @ 2017-03-14 18:41 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Christoffer Dall, kvm, Marc Zyngier, Rik van Riel On Tue, Mar 14, 2017 at 06:09:45PM +0100, Paolo Bonzini wrote: > > > On 14/03/2017 17:58, Radim Krčmář wrote: > >> I assume there's a good reason why we call guest_enter() and > >> guest_exit() in the hot path on every KVM architecture? > > I consider myself biased when it comes to jiffies, so no judgement. :) > > > > From what I see, the mode switch is used only for statistics. > > vtime is only for statistics, but guest_enter/exit are important because > they enter an RCU extended quiescent state. This means that (physical) > CPUs running a guest are effectively "off" from the point of view of the > RCU accounting machinery. Not having to perform any RCU work is very > good for jitter. > So would it be worth considering factoring out vtime accounting from guest_enter/exit, such that we could do the vtime accounting from vcpu load/put and mark the RCU extended quiescent state in the run loop? Disclaimer: I haven't completely convinced myself that vtime accounting from load/put works as it should. For example, when servicing a VM from KVM, should we really be accounting this as kernel time, or as guest time? I think we do the former now, but if the latter is the right thing, would changing the behavior constitute an ABI change to userspace? Thanks, -Christoffer ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 18:41 ` Christoffer Dall @ 2017-03-14 19:32 ` Radim Krčmář 2017-03-14 20:01 ` Christoffer Dall 2017-03-15 8:05 ` Paolo Bonzini 1 sibling, 1 reply; 26+ messages in thread From: Radim Krčmář @ 2017-03-14 19:32 UTC (permalink / raw) To: Christoffer Dall Cc: Paolo Bonzini, Christoffer Dall, kvm, Marc Zyngier, Rik van Riel 2017-03-14 19:41+0100, Christoffer Dall: > On Tue, Mar 14, 2017 at 06:09:45PM +0100, Paolo Bonzini wrote: >> On 14/03/2017 17:58, Radim Krčmář wrote: >> >> I assume there's a good reason why we call guest_enter() and >> >> guest_exit() in the hot path on every KVM architecture? >> > I consider myself biased when it comes to jiffies, so no judgement. :) >> > >> > From what I see, the mode switch is used only for statistics. >> >> vtime is only for statistics, but guest_enter/exit are important because >> they enter an RCU extended quiescent state. This means that (physical) >> CPUs running a guest are effectively "off" from the point of view of the >> RCU accounting machinery. Not having to perform any RCU work is very >> good for jitter. Ah, good point. > So would it be worth considering factoring out vtime accounting from > guest_enter/exit, such that we could do the vtime accounting from vcpu > load/put and mark the RCU extended quiescent state in the run loop? RCU is the reason why guest_exit() needs disabled interrupts, so if we split them, we could do rcu_virt_note_context_switch() before enabling interrupts, and guest_exit() right after. > Disclaimer: I haven't completely convinced myself that vtime accounting > from load/put works as it should. For example, when servicing a VM from > KVM, should we really be accounting this as kernel time, or as guest > time? I think we do the former now, but if the latter is the right > thing, would changing the behavior constitute an ABI change to > userspace? Not considering that option would be best. :) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 19:32 ` Radim Krčmář @ 2017-03-14 20:01 ` Christoffer Dall 2017-03-14 21:52 ` Radim Krčmář 0 siblings, 1 reply; 26+ messages in thread From: Christoffer Dall @ 2017-03-14 20:01 UTC (permalink / raw) To: Radim Krčmář Cc: Paolo Bonzini, Christoffer Dall, kvm, Marc Zyngier, Rik van Riel On Tue, Mar 14, 2017 at 08:32:02PM +0100, Radim Krčmář wrote: > 2017-03-14 19:41+0100, Christoffer Dall: > > On Tue, Mar 14, 2017 at 06:09:45PM +0100, Paolo Bonzini wrote: > >> On 14/03/2017 17:58, Radim Krčmář wrote: > >> >> I assume there's a good reason why we call guest_enter() and > >> >> guest_exit() in the hot path on every KVM architecture? > >> > I consider myself biased when it comes to jiffies, so no judgement. :) > >> > > >> > From what I see, the mode switch is used only for statistics. > >> > >> vtime is only for statistics, but guest_enter/exit are important because > >> they enter an RCU extended quiescent state. This means that (physical) > >> CPUs running a guest are effectively "off" from the point of view of the > >> RCU accounting machinery. Not having to perform any RCU work is very > >> good for jitter. > > Ah, good point. > > > So would it be worth considering factoring out vtime accounting from > > guest_enter/exit, such that we could do the vtime accounting from vcpu > > load/put and mark the RCU extended quiescent state in the run loop? > > RCU is the reason why guest_exit() needs disabled interrupts, so if we > split them, we could do rcu_virt_note_context_switch() before enabling > interrupts, and guest_exit() right after. > I'm not convinced that what you're saying is true ;) I think we only fiddle with RCU during guest_enter, and further, a trace of guest_exit reveals: guest_exit_irqoff -> vtime_guest_exit -> __vtime_account_system -> get_vtime_delta -> account_other_time -> WARN_ON_ONCE(!irqs_disabled()); So I think we do need interrupts disabled when messing with vtime? > > Disclaimer: I haven't completely convinced myself that vtime accounting > > from load/put works as it should. For example, when servicing a VM from > > KVM, should we really be accounting this as kernel time, or as guest > > time? I think we do the former now, but if the latter is the right > > thing, would changing the behavior constitute an ABI change to > > userspace? > > Not considering that option would be best. :) If my statement above about needing interrupts disabled when dealing with vtime, then considering this begins to sound interesting, also given that the vtime thing is not entirely free and we're dealing with the hot path of receiving IPIs here, for example. Thanks, -Christoffer ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 20:01 ` Christoffer Dall @ 2017-03-14 21:52 ` Radim Krčmář 2017-03-15 8:09 ` Paolo Bonzini 0 siblings, 1 reply; 26+ messages in thread From: Radim Krčmář @ 2017-03-14 21:52 UTC (permalink / raw) To: Christoffer Dall Cc: Paolo Bonzini, Christoffer Dall, kvm, Marc Zyngier, Rik van Riel 2017-03-14 21:01+0100, Christoffer Dall: > On Tue, Mar 14, 2017 at 08:32:02PM +0100, Radim Krčmář wrote: >> 2017-03-14 19:41+0100, Christoffer Dall: >> > On Tue, Mar 14, 2017 at 06:09:45PM +0100, Paolo Bonzini wrote: >> >> On 14/03/2017 17:58, Radim Krčmář wrote: >> >> >> I assume there's a good reason why we call guest_enter() and >> >> >> guest_exit() in the hot path on every KVM architecture? >> >> > I consider myself biased when it comes to jiffies, so no judgement. :) >> >> > >> >> > From what I see, the mode switch is used only for statistics. >> >> >> >> vtime is only for statistics, but guest_enter/exit are important because >> >> they enter an RCU extended quiescent state. This means that (physical) >> >> CPUs running a guest are effectively "off" from the point of view of the >> >> RCU accounting machinery. Not having to perform any RCU work is very >> >> good for jitter. >> >> Ah, good point. >> >> > So would it be worth considering factoring out vtime accounting from >> > guest_enter/exit, such that we could do the vtime accounting from vcpu >> > load/put and mark the RCU extended quiescent state in the run loop? >> >> RCU is the reason why guest_exit() needs disabled interrupts, so if we >> split them, we could do rcu_virt_note_context_switch() before enabling >> interrupts, and guest_exit() right after. >> > > I'm not convinced that what you're saying is true ;) I agree. > I think we only fiddle with RCU during guest_enter, and further, a trace > of guest_exit reveals: > > guest_exit_irqoff > -> vtime_guest_exit > -> __vtime_account_system > -> get_vtime_delta > -> account_other_time > -> WARN_ON_ONCE(!irqs_disabled()); > > So I think we do need interrupts disabled when messing with vtime? Seem like it. >> > Disclaimer: I haven't completely convinced myself that vtime accounting >> > from load/put works as it should. For example, when servicing a VM from >> > KVM, should we really be accounting this as kernel time, or as guest >> > time? I think we do the former now, but if the latter is the right >> > thing, would changing the behavior constitute an ABI change to >> > userspace? >> >> Not considering that option would be best. :) > > If my statement above about needing interrupts disabled when dealing > with vtime, then considering this begins to sound interesting, also > given that the vtime thing is not entirely free and we're dealing with > the hot path of receiving IPIs here, for example. I'm liking it less and less the more I read. :) CONTEXT_USER vtime is coupled with context tracking and going out of CONTEXT_KERNEL means that RCU cannot be used in between. Using CONTEXT_GUEST from load/put would change the meaning of contexts ... ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 21:52 ` Radim Krčmář @ 2017-03-15 8:09 ` Paolo Bonzini 0 siblings, 0 replies; 26+ messages in thread From: Paolo Bonzini @ 2017-03-15 8:09 UTC (permalink / raw) To: Radim Krčmář, Christoffer Dall Cc: Christoffer Dall, kvm, Marc Zyngier, Rik van Riel On 14/03/2017 22:52, Radim Krčmář wrote: >> If my statement above about needing interrupts disabled when dealing >> with vtime, then considering this begins to sound interesting, also >> given that the vtime thing is not entirely free and we're dealing with >> the hot path of receiving IPIs here, for example. > I'm liking it less and less the more I read. :) > CONTEXT_USER vtime is coupled with context tracking and going out of > CONTEXT_KERNEL means that RCU cannot be used in between. Using > CONTEXT_GUEST from load/put would change the meaning of contexts ... CONTEXT_GUEST is an RCU extended quiescent state, doing it from load to put makes no sense. PF_VCPU from load to put on the other hand might make some sense. Paolo ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 18:41 ` Christoffer Dall 2017-03-14 19:32 ` Radim Krčmář @ 2017-03-15 8:05 ` Paolo Bonzini 2017-03-15 8:30 ` Christoffer Dall 1 sibling, 1 reply; 26+ messages in thread From: Paolo Bonzini @ 2017-03-15 8:05 UTC (permalink / raw) To: Christoffer Dall Cc: Radim Krčmář, Christoffer Dall, kvm, Marc Zyngier, Rik van Riel On 14/03/2017 19:41, Christoffer Dall wrote: > So would it be worth considering factoring out vtime accounting from > guest_enter/exit, such that we could do the vtime accounting from vcpu > load/put and mark the RCU extended quiescent state in the run loop? > > Disclaimer: I haven't completely convinced myself that vtime accounting > from load/put works as it should. Me neither, but if it works, it's worth at least benchmarking it to make an informed decision. > For example, when servicing a VM from > KVM, should we really be accounting this as kernel time, or as guest > time? Servicing the request should be fast enough that it doesn't matter. For the most common x86 vmexits, 50% of the time is spent executing the x86 world switch microcode, and that is already being accounted as guest time already. On microbenchmarks of course it would make a difference: $ time ./x86/run x86/vmexit.flat -append cpuid -smp 4 [...] real 0m1.327s user 0m1.877s << this is really guest time sys 0m1.844s << this is spent mostly in KVM_RUN Paolo > I think we do the former now, but if the latter is the right > thing, would changing the behavior constitute an ABI change to > userspace? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-15 8:05 ` Paolo Bonzini @ 2017-03-15 8:30 ` Christoffer Dall 0 siblings, 0 replies; 26+ messages in thread From: Christoffer Dall @ 2017-03-15 8:30 UTC (permalink / raw) To: Paolo Bonzini Cc: Radim Krčmář, Christoffer Dall, kvm, Marc Zyngier, Rik van Riel On Wed, Mar 15, 2017 at 09:05:28AM +0100, Paolo Bonzini wrote: > > > On 14/03/2017 19:41, Christoffer Dall wrote: > > So would it be worth considering factoring out vtime accounting from > > guest_enter/exit, such that we could do the vtime accounting from vcpu > > load/put and mark the RCU extended quiescent state in the run loop? > > > > Disclaimer: I haven't completely convinced myself that vtime accounting > > from load/put works as it should. > > Me neither, but if it works, it's worth at least benchmarking it to make > an informed decision. > I'll try to prototype it and see how it looks. > > For example, when servicing a VM from > > KVM, should we really be accounting this as kernel time, or as guest > > time? > > Servicing the request should be fast enough that it doesn't matter. For > the most common x86 vmexits, 50% of the time is spent executing the x86 > world switch microcode, and that is already being accounted as guest > time already. On microbenchmarks of course it would make a difference: > > $ time ./x86/run x86/vmexit.flat -append cpuid -smp 4 > [...] > real 0m1.327s > user 0m1.877s << this is really guest time > sys 0m1.844s << this is spent mostly in KVM_RUN > > Good point, and if it's for statistics, this light change in behavior can hardly constitute any form of user breakage. Thanks, -Christoffer ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 16:58 ` Radim Krčmář 2017-03-14 17:09 ` Paolo Bonzini @ 2017-03-14 18:39 ` Christoffer Dall 2017-03-14 20:27 ` Radim Krčmář 1 sibling, 1 reply; 26+ messages in thread From: Christoffer Dall @ 2017-03-14 18:39 UTC (permalink / raw) To: Radim Krčmář Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote: > 2017-03-14 09:26+0100, Christoffer Dall: > > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote: > >> 2017-03-08 02:57-0800, Christoffer Dall: > >> > Hi Paolo, > >> > > >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff > >> > before enabling interrupts when coming back from the guest. > >> > > >> > Unfortunately, this appears to mess up my view of CPU usage using > >> > something like htop on the host, because it appears all time is spent > >> > inside the kernel. > >> > > >> > From my analysis, I think this is because we never handle any interrupts > >> > before enabling interrupts, where the x86 code does its > >> > handle_external_intr, and the result on ARM is that we never increment > >> > jiffies before doing the vtime accounting. > >> > >> (Hm, the counting might be broken on nohz_full then.) > >> > > > > Don't you still have a scheduler tick even with nohz_full and something > > that will eventually update jiffies then? > > Probably, I don't understand jiffies accounting too well and didn't see > anything that would bump the jiffies in or before guest_exit_irqoff(). > As far as I understand, from my very very short time of looking at the timer code, jiffies are updated on every tick, which can be cause by a number of events, including *any* interrupt handler (coming from idle state), soft timers, timer interrupts, and possibly other things. > >> > So my current idea is to increment jiffies according to the clocksource > >> > before calling guest_exit_irqoff, but this would require some main > >> > clocksource infrastructure changes. > >> > >> This seems similar to calling the function from the timer interrupt. > >> The timer interrupt would be delivered after that and only wasted time, > >> so it might actually be slower than just delivering it before ... > > > > That's assuming that the timer interrupt hits at every exit. I don't > > think that's the case, but I should measure it. > > There cannot be less vm exits and I think there are far more vm exits, > but if there was no interrupt, then jiffies shouldn't raise and we would > get the same result as with plain guest_exit_irqoff(). > That's true if you're guaranteed to take the timer interrupts that happen while running the guest before hitting guest_exit_irqoff(), so that you eventually count *some* time for the guest. In the arm64 case, if we just do guest_exit_irqoff(), we *never* account any time to the guest. > >> How expensive is the interrupt enable/disable cycle that this > >> optimization saves? > > > > I'll have to go back and measure this bit specifically again, but I > > recall it being a couple of hundred cycles. Not alariming, but > > worthwhile looking into. > > Yeah, sounds good. > > >> > My question is: how important is the vtime accounting on the host from > >> > your point of view? > >> > >> No idea. I'd keep the same behavior on all architectures, though. > >> > >> The precision of accounting is in jiffies (millions of cycles), so we > >> could maybe move it from the hot path to vcpu_load/put(?) without > >> affecting the count in usual cases ... > >> > > > > So since sending my original e-mail I found out that the vtime > > accounting logic was changed from ktime to jiffies, which is partly why > > we're having problems on arm. See: > > > > ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd > > sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity > > > > Moving to load/put depends on the semantics of this vtime thing. Is > > this counting cycles spent in the VM as opposed to in the host kernel > > and IRQ handling, and is that useful for system profiling or scheduling > > decisions, in which case moving to vcpu_load/put doesn't work... > > Right. > > > I assume there's a good reason why we call guest_enter() and > > guest_exit() in the hot path on every KVM architecture? > > I consider myself biased when it comes to jiffies, so no judgement. :) > > From what I see, the mode switch is used only for statistics. > The original series is > > 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332 > > It didn't introduce the overhead with interrupt window and it didn't > count host kernel irq time as user time, so it was better at that time. Yes, but it was based on cputime_to... functions, which I understand use ktime, which on systems running KVM will most often read the clocksource directly from the hardware, and that was then optimized later to just use jiffies to avoid the clocksource read because jiffies is already in memory and adjusted to the granularity we need, so in some sense an improvement, only it doesn't work if you don't update jiffies when needed. Thanks, -Christoffer ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 18:39 ` Christoffer Dall @ 2017-03-14 20:27 ` Radim Krčmář 2017-03-14 21:53 ` Radim Krčmář 2017-03-15 8:43 ` Christoffer Dall 0 siblings, 2 replies; 26+ messages in thread From: Radim Krčmář @ 2017-03-14 20:27 UTC (permalink / raw) To: Christoffer Dall Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel 2017-03-14 19:39+0100, Christoffer Dall: > On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote: >> 2017-03-14 09:26+0100, Christoffer Dall: >> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote: >> >> 2017-03-08 02:57-0800, Christoffer Dall: >> >> > Hi Paolo, >> >> > >> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff >> >> > before enabling interrupts when coming back from the guest. >> >> > >> >> > Unfortunately, this appears to mess up my view of CPU usage using >> >> > something like htop on the host, because it appears all time is spent >> >> > inside the kernel. >> >> > >> >> > From my analysis, I think this is because we never handle any interrupts >> >> > before enabling interrupts, where the x86 code does its >> >> > handle_external_intr, and the result on ARM is that we never increment >> >> > jiffies before doing the vtime accounting. >> >> >> >> (Hm, the counting might be broken on nohz_full then.) >> >> >> > >> > Don't you still have a scheduler tick even with nohz_full and something >> > that will eventually update jiffies then? >> >> Probably, I don't understand jiffies accounting too well and didn't see >> anything that would bump the jiffies in or before guest_exit_irqoff(). > > As far as I understand, from my very very short time of looking at the > timer code, jiffies are updated on every tick, which can be cause by a > number of events, including *any* interrupt handler (coming from idle > state), soft timers, timer interrupts, and possibly other things. Yes, I was thinking that entering/exiting user mode should trigger it as well, in order to correctly account for time spent there, but couldn't find it ... The case I was wondering about is if the kernel spent e.g. 10 jiffies in guest mode and then exited on mmio -- no interrupt in the host, and guest_exit_irqoff() would flip accouting would over system time. Can those 10 jiffies get accounted to system (not guest) time? Accounting 1 jiffy wrong is normal as we expect that the distribution of guest/kernel time in the jiffy is going to be approximated over a longer sampling period, but if we could account multiple jiffies wrong, this expectation is very hard to defend. >> >> > So my current idea is to increment jiffies according to the clocksource >> >> > before calling guest_exit_irqoff, but this would require some main >> >> > clocksource infrastructure changes. >> >> >> >> This seems similar to calling the function from the timer interrupt. >> >> The timer interrupt would be delivered after that and only wasted time, >> >> so it might actually be slower than just delivering it before ... >> > >> > That's assuming that the timer interrupt hits at every exit. I don't >> > think that's the case, but I should measure it. >> >> There cannot be less vm exits and I think there are far more vm exits, >> but if there was no interrupt, then jiffies shouldn't raise and we would >> get the same result as with plain guest_exit_irqoff(). >> > > That's true if you're guaranteed to take the timer interrupts that > happen while running the guest before hitting guest_exit_irqoff(), so > that you eventually count *some* time for the guest. In the arm64 case, > if we just do guest_exit_irqoff(), we *never* account any time to the > guest. Yes. I assumed that if jiffies should be incremented, then you also get a timer tick, so checking whether jiffies should be incremented inside guest_exit_irqoff() was only slowing down the common case, where jiffies remained the same. (Checking if jiffies should be incremented is quite slow by itself.) >> > I assume there's a good reason why we call guest_enter() and >> > guest_exit() in the hot path on every KVM architecture? >> >> I consider myself biased when it comes to jiffies, so no judgement. :) >> >> From what I see, the mode switch is used only for statistics. >> The original series is >> >> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332 >> >> It didn't introduce the overhead with interrupt window and it didn't >> count host kernel irq time as user time, so it was better at that time. > > Yes, but it was based on cputime_to... functions, which I understand use > ktime, which on systems running KVM will most often read the clocksource > directly from the hardware, and that was then optimized later to just > use jiffies to avoid the clocksource read because jiffies is already in > memory and adjusted to the granularity we need, so in some sense an > improvement, only it doesn't work if you don't update jiffies when > needed. True. The kvm_guest_enter/exit still didn't trigger accounting, but the granularity was better if there were other sources of accounting than just timer ticks. And I noticed another funny feature: the original intention was that if the guest uses hardware acceleration, then the whole timeslice gets accounted to guest/user time, because kvm_guest_exit() was not supposed to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105 This somehow got mangled when merging and later there was a fix that introduced the current behavior, which might be slightly more accurate: 0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()") ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 20:27 ` Radim Krčmář @ 2017-03-14 21:53 ` Radim Krčmář 2017-03-15 8:43 ` Christoffer Dall 1 sibling, 0 replies; 26+ messages in thread From: Radim Krčmář @ 2017-03-14 21:53 UTC (permalink / raw) To: Christoffer Dall Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel 2017-03-14 21:27+0100, Radim Krčmář: > 2017-03-14 19:39+0100, Christoffer Dall: >> On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote: >>> 2017-03-14 09:26+0100, Christoffer Dall: >>> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote: >>> >> 2017-03-08 02:57-0800, Christoffer Dall: >>> >> > Hi Paolo, >>> >> > >>> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff >>> >> > before enabling interrupts when coming back from the guest. >>> >> > >>> >> > Unfortunately, this appears to mess up my view of CPU usage using >>> >> > something like htop on the host, because it appears all time is spent >>> >> > inside the kernel. >>> >> > >>> >> > From my analysis, I think this is because we never handle any interrupts >>> >> > before enabling interrupts, where the x86 code does its >>> >> > handle_external_intr, and the result on ARM is that we never increment >>> >> > jiffies before doing the vtime accounting. >>> >> >>> >> (Hm, the counting might be broken on nohz_full then.) >>> >> >>> > >>> > Don't you still have a scheduler tick even with nohz_full and something >>> > that will eventually update jiffies then? >>> >>> Probably, I don't understand jiffies accounting too well and didn't see >>> anything that would bump the jiffies in or before guest_exit_irqoff(). >> >> As far as I understand, from my very very short time of looking at the >> timer code, jiffies are updated on every tick, which can be cause by a >> number of events, including *any* interrupt handler (coming from idle >> state), soft timers, timer interrupts, and possibly other things. > > Yes, I was thinking that entering/exiting user mode should trigger it as > well, in order to correctly account for time spent there, but couldn't > find it ... > > The case I was wondering about is if the kernel spent e.g. 10 jiffies in > guest mode and then exited on mmio -- no interrupt in the host, and > guest_exit_irqoff() would flip accouting would over system time. > Can those 10 jiffies get accounted to system (not guest) time? > > Accounting 1 jiffy wrong is normal as we expect that the distribution of > guest/kernel time in the jiffy is going to be approximated over a longer > sampling period, but if we could account multiple jiffies wrong, this > expectation is very hard to defend. IIUC, other CPU will bump jiffies of running tickless VCPUs, so this works fine. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 20:27 ` Radim Krčmář 2017-03-14 21:53 ` Radim Krčmář @ 2017-03-15 8:43 ` Christoffer Dall 2017-03-15 15:57 ` Radim Krčmář 2017-03-24 15:04 ` Rik van Riel 1 sibling, 2 replies; 26+ messages in thread From: Christoffer Dall @ 2017-03-15 8:43 UTC (permalink / raw) To: Radim Krčmář Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel On Tue, Mar 14, 2017 at 09:27:22PM +0100, Radim Krčmář wrote: > 2017-03-14 19:39+0100, Christoffer Dall: > > On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote: > >> 2017-03-14 09:26+0100, Christoffer Dall: > >> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote: > >> >> 2017-03-08 02:57-0800, Christoffer Dall: > >> >> > Hi Paolo, > >> >> > > >> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff > >> >> > before enabling interrupts when coming back from the guest. > >> >> > > >> >> > Unfortunately, this appears to mess up my view of CPU usage using > >> >> > something like htop on the host, because it appears all time is spent > >> >> > inside the kernel. > >> >> > > >> >> > From my analysis, I think this is because we never handle any interrupts > >> >> > before enabling interrupts, where the x86 code does its > >> >> > handle_external_intr, and the result on ARM is that we never increment > >> >> > jiffies before doing the vtime accounting. > >> >> > >> >> (Hm, the counting might be broken on nohz_full then.) > >> >> > >> > > >> > Don't you still have a scheduler tick even with nohz_full and something > >> > that will eventually update jiffies then? > >> > >> Probably, I don't understand jiffies accounting too well and didn't see > >> anything that would bump the jiffies in or before guest_exit_irqoff(). > > > > As far as I understand, from my very very short time of looking at the > > timer code, jiffies are updated on every tick, which can be cause by a > > number of events, including *any* interrupt handler (coming from idle > > state), soft timers, timer interrupts, and possibly other things. > > Yes, I was thinking that entering/exiting user mode should trigger it as > well, in order to correctly account for time spent there, but couldn't > find it ... Isn't it based on taking the timer interrupt when running in userspace as opposed to when running in the kernel? > > The case I was wondering about is if the kernel spent e.g. 10 jiffies in > guest mode and then exited on mmio -- no interrupt in the host, and > guest_exit_irqoff() would flip accouting would over system time. > Can those 10 jiffies get accounted to system (not guest) time? Yes, I think the key is whether you end up taking a timer interrupt before or after switchign PF_VCPU. So you can spend X jiffies in the guest, come back, change PF_VCPU (timer still hasen't expired), and then the timer expires immediately afterwards, and the whole block of jiffies that are incremented as a result of the timer gets accounted as kernel time. (Note that jiffies on clocksourced architectures is an arbitrary number, which somehow scales according to the clocksource frequency, depending on the CONFIG_HZ, so a single timer interrupt can increment jiffies by more than 1. > > Accounting 1 jiffy wrong is normal as we expect that the distribution of > guest/kernel time in the jiffy is going to be approximated over a longer > sampling period, but if we could account multiple jiffies wrong, this > expectation is very hard to defend. > I think it's a best effort, and we expect inaccuracies over much more than a single jiffy, see above. > >> >> > So my current idea is to increment jiffies according to the clocksource > >> >> > before calling guest_exit_irqoff, but this would require some main > >> >> > clocksource infrastructure changes. > >> >> > >> >> This seems similar to calling the function from the timer interrupt. > >> >> The timer interrupt would be delivered after that and only wasted time, > >> >> so it might actually be slower than just delivering it before ... > >> > > >> > That's assuming that the timer interrupt hits at every exit. I don't > >> > think that's the case, but I should measure it. > >> > >> There cannot be less vm exits and I think there are far more vm exits, > >> but if there was no interrupt, then jiffies shouldn't raise and we would > >> get the same result as with plain guest_exit_irqoff(). > >> > > > > That's true if you're guaranteed to take the timer interrupts that > > happen while running the guest before hitting guest_exit_irqoff(), so > > that you eventually count *some* time for the guest. In the arm64 case, > > if we just do guest_exit_irqoff(), we *never* account any time to the > > guest. > > Yes. I assumed that if jiffies should be incremented, then you also get > a timer tick, I think jiffies can be incremented in a number of cases, and as I understand it, when the kernel folks talk about "the tick" that doesn't necessarily mean that your scheduler timer fires an interrupt, but I could be wrong. > so checking whether jiffies should be incremented inside > guest_exit_irqoff() was only slowing down the common case, where jiffies > remained the same. My original suggestion was to not use jiffies at all, but use ktime, on architectures that don't do handle_external_intr() type things, which would result in a clocksource read on every exit, which on ARM I think is faster than our enable/disable interrupt dance we do at the moment. But of course, I'd have to measure that. > > (Checking if jiffies should be incremented is quite slow by itself.) > Really? I think that depends on the architecture. Reading the counter on ARM is supposed to be cheap, I think. > >> > I assume there's a good reason why we call guest_enter() and > >> > guest_exit() in the hot path on every KVM architecture? > >> > >> I consider myself biased when it comes to jiffies, so no judgement. :) > >> > >> From what I see, the mode switch is used only for statistics. > >> The original series is > >> > >> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332 > >> > >> It didn't introduce the overhead with interrupt window and it didn't > >> count host kernel irq time as user time, so it was better at that time. > > > > Yes, but it was based on cputime_to... functions, which I understand use > > ktime, which on systems running KVM will most often read the clocksource > > directly from the hardware, and that was then optimized later to just > > use jiffies to avoid the clocksource read because jiffies is already in > > memory and adjusted to the granularity we need, so in some sense an > > improvement, only it doesn't work if you don't update jiffies when > > needed. > > True. The kvm_guest_enter/exit still didn't trigger accounting, but the > granularity was better if there were other sources of accounting than > just timer ticks. I think the granularity was better, yes, but I wonder if that improved accuracy was ever visible to the user, given that the CPU time may be counted in jiffies? > > And I noticed another funny feature: the original intention was that if > the guest uses hardware acceleration, then the whole timeslice gets > accounted to guest/user time, because kvm_guest_exit() was not supposed > to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105 I don't see how this works on a current kernel though, because I can't find any path beyond the kvm_guest_exit() which clears the flag at the moment. > This somehow got mangled when merging and later there was a fix that > introduced the current behavior, which might be slightly more accurate: > 0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()") So Laurent was confused as well? It seems he needed to add that before you guys added handle_external_intr()? Thanks, -Christoffer ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-15 8:43 ` Christoffer Dall @ 2017-03-15 15:57 ` Radim Krčmář 2017-03-15 16:48 ` Christoffer Dall 2017-03-24 15:04 ` Rik van Riel 1 sibling, 1 reply; 26+ messages in thread From: Radim Krčmář @ 2017-03-15 15:57 UTC (permalink / raw) To: Christoffer Dall Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel 2017-03-15 09:43+0100, Christoffer Dall: > On Tue, Mar 14, 2017 at 09:27:22PM +0100, Radim Krčmář wrote: >> 2017-03-14 19:39+0100, Christoffer Dall: >> > On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote: >> >> 2017-03-14 09:26+0100, Christoffer Dall: >> >> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote: >> >> >> 2017-03-08 02:57-0800, Christoffer Dall: >> >> >> > Hi Paolo, >> >> >> > >> >> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff >> >> >> > before enabling interrupts when coming back from the guest. >> >> >> > >> >> >> > Unfortunately, this appears to mess up my view of CPU usage using >> >> >> > something like htop on the host, because it appears all time is spent >> >> >> > inside the kernel. >> >> >> > >> >> >> > From my analysis, I think this is because we never handle any interrupts >> >> >> > before enabling interrupts, where the x86 code does its >> >> >> > handle_external_intr, and the result on ARM is that we never increment >> >> >> > jiffies before doing the vtime accounting. >> >> >> >> >> >> (Hm, the counting might be broken on nohz_full then.) >> >> >> >> >> > >> >> > Don't you still have a scheduler tick even with nohz_full and something >> >> > that will eventually update jiffies then? >> >> >> >> Probably, I don't understand jiffies accounting too well and didn't see >> >> anything that would bump the jiffies in or before guest_exit_irqoff(). >> > >> > As far as I understand, from my very very short time of looking at the >> > timer code, jiffies are updated on every tick, which can be cause by a >> > number of events, including *any* interrupt handler (coming from idle >> > state), soft timers, timer interrupts, and possibly other things. >> >> Yes, I was thinking that entering/exiting user mode should trigger it as >> well, in order to correctly account for time spent there, but couldn't >> find it ... > > Isn't it based on taking the timer interrupt when running in userspace > as opposed to when running in the kernel? The timer interrupt usually accounts it, but there is no timer interrupt when tickless, so there has to be accounting somewhere on the border ... __context_tracking_enter/exit(CONTEXT_USER) calls vtime_user_enter/exit() to do accounting in that case, but expects that jiffies were updated by someone else. >> The case I was wondering about is if the kernel spent e.g. 10 jiffies in >> guest mode and then exited on mmio -- no interrupt in the host, and >> guest_exit_irqoff() would flip accouting would over system time. >> Can those 10 jiffies get accounted to system (not guest) time? > > Yes, I think the key is whether you end up taking a timer interrupt > before or after switchign PF_VCPU. So you can spend X jiffies in the > guest, come back, change PF_VCPU (timer still hasen't expired), and then > the timer expires immediately afterwards, and the whole block of jiffies > that are incremented as a result of the timer gets accounted as kernel > time. Wouldn't that be result of a bug? (The timer period is 1 jiffy, so the VM would get kicked before getting to 2 jiffies in guest mode and spending so much time in bettwen seems like a bug. 10 jiffies could be spend in a guest while under nohz_full, but there is still at least one CPU with HZ timer that updates the global jiffies, so I assume that guest_exit() should see at least 9 jiffies.) > (Note that jiffies on clocksourced architectures is an arbitrary number, > which somehow scales according to the clocksource frequency, depending > on the CONFIG_HZ, so a single timer interrupt can increment jiffies by > more than 1. > >> Accounting 1 jiffy wrong is normal as we expect that the distribution of >> guest/kernel time in the jiffy is going to be approximated over a longer >> sampling period, but if we could account multiple jiffies wrong, this >> expectation is very hard to defend. >> > > I think it's a best effort, and we expect inaccuracies over much more > than a single jiffy, see above. > >> >> >> > So my current idea is to increment jiffies according to the clocksource >> >> >> > before calling guest_exit_irqoff, but this would require some main >> >> >> > clocksource infrastructure changes. >> >> >> >> >> >> This seems similar to calling the function from the timer interrupt. >> >> >> The timer interrupt would be delivered after that and only wasted time, >> >> >> so it might actually be slower than just delivering it before ... >> >> > >> >> > That's assuming that the timer interrupt hits at every exit. I don't >> >> > think that's the case, but I should measure it. >> >> >> >> There cannot be less vm exits and I think there are far more vm exits, >> >> but if there was no interrupt, then jiffies shouldn't raise and we would >> >> get the same result as with plain guest_exit_irqoff(). >> >> >> > >> > That's true if you're guaranteed to take the timer interrupts that >> > happen while running the guest before hitting guest_exit_irqoff(), so >> > that you eventually count *some* time for the guest. In the arm64 case, >> > if we just do guest_exit_irqoff(), we *never* account any time to the >> > guest. >> >> Yes. I assumed that if jiffies should be incremented, then you also get >> a timer tick, > > I think jiffies can be incremented in a number of cases, and as I > understand it, when the kernel folks talk about "the tick" that doesn't > necessarily mean that your scheduler timer fires an interrupt, but I > could be wrong. Good point. do_timer() (the only place to update jiffies, I hope) is called from: tick_periodic() xtime_update() tick_do_update_jiffies64() The first two seem to coincide with a specific timer interrupt. The last one is called from: tick_sched_do_timer() [timer interrupt] tick_nohz_update_jiffies() [any interrupt] do_idle() [not interrupt, but also never hit after VM exit] Looks like jiffies are mostly timer interrupt based, but if you have nohz, then there are also other interrupts and switches from idle. >> so checking whether jiffies should be incremented inside >> guest_exit_irqoff() was only slowing down the common case, where jiffies >> remained the same. > > My original suggestion was to not use jiffies at all, but use ktime, on > architectures that don't do handle_external_intr() type things, which > would result in a clocksource read on every exit, which on ARM I think > is faster than our enable/disable interrupt dance we do at the moment. I didn't catch that, sorry. > But of course, I'd have to measure that. The accounting would need to change to ktime even for userspace switches to avoid double-accounting time and I think that ktime was abolished because it was too slow ... definitely a lot of benchmarking. >> (Checking if jiffies should be incremented is quite slow by itself.) >> > > Really? I think that depends on the architecture. Reading the counter > on ARM is supposed to be cheap, I think. True, and also depends on what slow means. :) IIRC, it takes around 50 cycles to read TSC-based ktime on x86, which would be rougly 2% of a current VM exit/entry loop. >> >> > I assume there's a good reason why we call guest_enter() and >> >> > guest_exit() in the hot path on every KVM architecture? >> >> >> >> I consider myself biased when it comes to jiffies, so no judgement. :) >> >> >> >> From what I see, the mode switch is used only for statistics. >> >> The original series is >> >> >> >> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332 >> >> >> >> It didn't introduce the overhead with interrupt window and it didn't >> >> count host kernel irq time as user time, so it was better at that time. >> > >> > Yes, but it was based on cputime_to... functions, which I understand use >> > ktime, which on systems running KVM will most often read the clocksource >> > directly from the hardware, and that was then optimized later to just >> > use jiffies to avoid the clocksource read because jiffies is already in >> > memory and adjusted to the granularity we need, so in some sense an >> > improvement, only it doesn't work if you don't update jiffies when >> > needed. >> >> True. The kvm_guest_enter/exit still didn't trigger accounting, but the >> granularity was better if there were other sources of accounting than >> just timer ticks. > > I think the granularity was better, yes, but I wonder if that improved > accuracy was ever visible to the user, given that the CPU time may be > counted in jiffies? I think the time for statistics used to be in cycles before it switched to nanoseconds, so user could have seen it some difference. >> And I noticed another funny feature: the original intention was that if >> the guest uses hardware acceleration, then the whole timeslice gets >> accounted to guest/user time, because kvm_guest_exit() was not supposed >> to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105 > > I don't see how this works on a current kernel though, because I can't > find any path beyond the kvm_guest_exit() which clears the flag at the > moment. Right, the flag used to be cleared in account_system_time(). >> This somehow got mangled when merging and later there was a fix that >> introduced the current behavior, which might be slightly more accurate: >> 0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()") > > So Laurent was confused as well? It looks that Laurent understood that you need the timer tick before clearing the flag. No idea how it really happened. > It seems he needed to add that before you guys added > handle_external_intr()? Yes. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-15 15:57 ` Radim Krčmář @ 2017-03-15 16:48 ` Christoffer Dall 2017-03-15 17:09 ` Radim Krčmář 0 siblings, 1 reply; 26+ messages in thread From: Christoffer Dall @ 2017-03-15 16:48 UTC (permalink / raw) To: Radim Krčmář Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel, stephen.boyd Hi Radim, [Stephen, I am cc'ing you because we are managing to confuse ourselves here and some of my ramblings are basically misrepresented bits of information I got from you]. On Wed, Mar 15, 2017 at 04:57:24PM +0100, Radim Krčmář wrote: > 2017-03-15 09:43+0100, Christoffer Dall: > > On Tue, Mar 14, 2017 at 09:27:22PM +0100, Radim Krčmář wrote: > >> 2017-03-14 19:39+0100, Christoffer Dall: > >> > On Tue, Mar 14, 2017 at 05:58:59PM +0100, Radim Krčmář wrote: > >> >> 2017-03-14 09:26+0100, Christoffer Dall: > >> >> > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote: > >> >> >> 2017-03-08 02:57-0800, Christoffer Dall: > >> >> >> > Hi Paolo, > >> >> >> > > >> >> >> > I'm looking at improving KVM/ARM a bit by calling guest_exit_irqoff > >> >> >> > before enabling interrupts when coming back from the guest. > >> >> >> > > >> >> >> > Unfortunately, this appears to mess up my view of CPU usage using > >> >> >> > something like htop on the host, because it appears all time is spent > >> >> >> > inside the kernel. > >> >> >> > > >> >> >> > From my analysis, I think this is because we never handle any interrupts > >> >> >> > before enabling interrupts, where the x86 code does its > >> >> >> > handle_external_intr, and the result on ARM is that we never increment > >> >> >> > jiffies before doing the vtime accounting. > >> >> >> > >> >> >> (Hm, the counting might be broken on nohz_full then.) > >> >> >> > >> >> > > >> >> > Don't you still have a scheduler tick even with nohz_full and something > >> >> > that will eventually update jiffies then? > >> >> > >> >> Probably, I don't understand jiffies accounting too well and didn't see > >> >> anything that would bump the jiffies in or before guest_exit_irqoff(). > >> > > >> > As far as I understand, from my very very short time of looking at the > >> > timer code, jiffies are updated on every tick, which can be cause by a > >> > number of events, including *any* interrupt handler (coming from idle > >> > state), soft timers, timer interrupts, and possibly other things. > >> > >> Yes, I was thinking that entering/exiting user mode should trigger it as > >> well, in order to correctly account for time spent there, but couldn't > >> find it ... > > > > Isn't it based on taking the timer interrupt when running in userspace > > as opposed to when running in the kernel? > > The timer interrupt usually accounts it, but there is no timer interrupt > when tickless, so there has to be accounting somewhere on the border ... I think there are; only it's not a regular interval but only when a timing is needed, such as the time slice expiring as decided by the scheduler. > > __context_tracking_enter/exit(CONTEXT_USER) calls > vtime_user_enter/exit() to do accounting in that case, but expects that > jiffies were updated by someone else. > I'm assuming user/kernel accounting works today with tickless, so surely this is handled somewhere. There are a bunch of calling paths to tick_do_update_jiffies64() in kernel/time/tick-sched.c related to nohz, so my best guess is that it's updated at the sched_clock granularity and that is sufficient for profiling, because you basically care about where a particualr time slice was spent, I suppose. > >> The case I was wondering about is if the kernel spent e.g. 10 jiffies in > >> guest mode and then exited on mmio -- no interrupt in the host, and > >> guest_exit_irqoff() would flip accouting would over system time. > >> Can those 10 jiffies get accounted to system (not guest) time? > > > > Yes, I think the key is whether you end up taking a timer interrupt > > before or after switchign PF_VCPU. So you can spend X jiffies in the > > guest, come back, change PF_VCPU (timer still hasen't expired), and then > > the timer expires immediately afterwards, and the whole block of jiffies > > that are incremented as a result of the timer gets accounted as kernel > > time. > > Wouldn't that be result of a bug? > I don't think anyone cares about this level of granularity really, the case I mentioned above is possible, albeit unlikely. > (The timer period is 1 jiffy, so the VM would get kicked before getting > to 2 jiffies in guest mode and spending so much time in bettwen seems > like a bug. > 10 jiffies could be spend in a guest while under nohz_full, but there is > still at least one CPU with HZ timer that updates the global jiffies, so > I assume that guest_exit() should see at least 9 jiffies.) > > > (Note that jiffies on clocksourced architectures is an arbitrary number, > > which somehow scales according to the clocksource frequency, depending > > on the CONFIG_HZ, so a single timer interrupt can increment jiffies by > > more than 1. > > > >> Accounting 1 jiffy wrong is normal as we expect that the distribution of > >> guest/kernel time in the jiffy is going to be approximated over a longer > >> sampling period, but if we could account multiple jiffies wrong, this > >> expectation is very hard to defend. > >> > > > > I think it's a best effort, and we expect inaccuracies over much more > > than a single jiffy, see above. > > > >> >> >> > So my current idea is to increment jiffies according to the clocksource > >> >> >> > before calling guest_exit_irqoff, but this would require some main > >> >> >> > clocksource infrastructure changes. > >> >> >> > >> >> >> This seems similar to calling the function from the timer interrupt. > >> >> >> The timer interrupt would be delivered after that and only wasted time, > >> >> >> so it might actually be slower than just delivering it before ... > >> >> > > >> >> > That's assuming that the timer interrupt hits at every exit. I don't > >> >> > think that's the case, but I should measure it. > >> >> > >> >> There cannot be less vm exits and I think there are far more vm exits, > >> >> but if there was no interrupt, then jiffies shouldn't raise and we would > >> >> get the same result as with plain guest_exit_irqoff(). > >> >> > >> > > >> > That's true if you're guaranteed to take the timer interrupts that > >> > happen while running the guest before hitting guest_exit_irqoff(), so > >> > that you eventually count *some* time for the guest. In the arm64 case, > >> > if we just do guest_exit_irqoff(), we *never* account any time to the > >> > guest. > >> > >> Yes. I assumed that if jiffies should be incremented, then you also get > >> a timer tick, > > > > I think jiffies can be incremented in a number of cases, and as I > > understand it, when the kernel folks talk about "the tick" that doesn't > > necessarily mean that your scheduler timer fires an interrupt, but I > > could be wrong. > > Good point. do_timer() (the only place to update jiffies, I hope) is as I understand it, the whole jiffies thing is circular, so the update can happen in other paths than do_timer. > called from: > > tick_periodic() > xtime_update() > tick_do_update_jiffies64() > > The first two seem to coincide with a specific timer interrupt. The > last one is called from: > > tick_sched_do_timer() [timer interrupt] > tick_nohz_update_jiffies() [any interrupt] > do_idle() [not interrupt, but also never hit after VM exit] > > Looks like jiffies are mostly timer interrupt based, but if you have > nohz, then there are also other interrupts and switches from idle. > yes exactly, clearly, non of us are timekeeping experts... > >> so checking whether jiffies should be incremented inside > >> guest_exit_irqoff() was only slowing down the common case, where jiffies > >> remained the same. > > > > My original suggestion was to not use jiffies at all, but use ktime, on > > architectures that don't do handle_external_intr() type things, which > > would result in a clocksource read on every exit, which on ARM I think > > is faster than our enable/disable interrupt dance we do at the moment. > > I didn't catch that, sorry. > > > But of course, I'd have to measure that. > > The accounting would need to change to ktime even for userspace switches good point. > to avoid double-accounting time and I think that ktime was abolished > because it was too slow ... definitely a lot of benchmarking. > > >> (Checking if jiffies should be incremented is quite slow by itself.) > >> > > > > Really? I think that depends on the architecture. Reading the counter > > on ARM is supposed to be cheap, I think. > > True, and also depends on what slow means. :) > > IIRC, it takes around 50 cycles to read TSC-based ktime on x86, which > would be rougly 2% of a current VM exit/entry loop. > > >> >> > I assume there's a good reason why we call guest_enter() and > >> >> > guest_exit() in the hot path on every KVM architecture? > >> >> > >> >> I consider myself biased when it comes to jiffies, so no judgement. :) > >> >> > >> >> From what I see, the mode switch is used only for statistics. > >> >> The original series is > >> >> > >> >> 5e84cfde51cf303d368fcb48f22059f37b3872de~1..d172fcd3ae1ca7ac27ec8904242fd61e0e11d332 > >> >> > >> >> It didn't introduce the overhead with interrupt window and it didn't > >> >> count host kernel irq time as user time, so it was better at that time. > >> > > >> > Yes, but it was based on cputime_to... functions, which I understand use > >> > ktime, which on systems running KVM will most often read the clocksource > >> > directly from the hardware, and that was then optimized later to just > >> > use jiffies to avoid the clocksource read because jiffies is already in > >> > memory and adjusted to the granularity we need, so in some sense an > >> > improvement, only it doesn't work if you don't update jiffies when > >> > needed. > >> > >> True. The kvm_guest_enter/exit still didn't trigger accounting, but the > >> granularity was better if there were other sources of accounting than > >> just timer ticks. > > > > I think the granularity was better, yes, but I wonder if that improved > > accuracy was ever visible to the user, given that the CPU time may be > > counted in jiffies? > > I think the time for statistics used to be in cycles before it switched > to nanoseconds, so user could have seen it some difference. > > >> And I noticed another funny feature: the original intention was that if > >> the guest uses hardware acceleration, then the whole timeslice gets > >> accounted to guest/user time, because kvm_guest_exit() was not supposed > >> to clear PF_VCPU: https://lkml.org/lkml/2007/10/15/105 > > > > I don't see how this works on a current kernel though, because I can't > > find any path beyond the kvm_guest_exit() which clears the flag at the > > moment. > > Right, the flag used to be cleared in account_system_time(). > > >> This somehow got mangled when merging and later there was a fix that > >> introduced the current behavior, which might be slightly more accurate: > >> 0552f73b9a81 ("KVM: Move kvm_guest_exit() after local_irq_enable()") > > > > So Laurent was confused as well? > > It looks that Laurent understood that you need the timer tick before > clearing the flag. No idea how it really happened. > > > It seems he needed to add that before you guys added > > handle_external_intr()? > > Yes. In any case, it would be good for us to get some answers to the stuff above, but I'll also try to prototype the load/put accounting of time. Thanks, -Christoffer ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-15 16:48 ` Christoffer Dall @ 2017-03-15 17:09 ` Radim Krčmář 0 siblings, 0 replies; 26+ messages in thread From: Radim Krčmář @ 2017-03-15 17:09 UTC (permalink / raw) To: Christoffer Dall Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, Rik van Riel, stephen.boyd 2017-03-15 17:48+0100, Christoffer Dall: > In any case, it would be good for us to get some answers to the stuff > above, but I'll also try to prototype the load/put accounting of time. I like this plan, thanks. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-15 8:43 ` Christoffer Dall 2017-03-15 15:57 ` Radim Krčmář @ 2017-03-24 15:04 ` Rik van Riel 2017-03-27 12:29 ` Wanpeng Li 1 sibling, 1 reply; 26+ messages in thread From: Rik van Riel @ 2017-03-24 15:04 UTC (permalink / raw) To: Christoffer Dall, Radim Krčmář Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, fweisbec On Wed, 2017-03-15 at 09:43 +0100, Christoffer Dall wrote: > Yes, I think the key is whether you end up taking a timer interrupt > before or after switchign PF_VCPU. So you can spend X jiffies in the > guest, come back, change PF_VCPU (timer still hasen't expired), and > then > the timer expires immediately afterwards, and the whole block of > jiffies > that are incremented as a result of the timer gets accounted as > kernel > time. > > (Note that jiffies on clocksourced architectures is an arbitrary > number, > which somehow scales according to the clocksource frequency, > depending > on the CONFIG_HZ, so a single timer interrupt can increment jiffies > by > more than 1. > That really should not happen with nohz_full. The housekeeping CPU should get a timer interrupt every jiffy, unless something changed recently that I am not aware of. Frederic, am I totally off base? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-24 15:04 ` Rik van Riel @ 2017-03-27 12:29 ` Wanpeng Li 0 siblings, 0 replies; 26+ messages in thread From: Wanpeng Li @ 2017-03-27 12:29 UTC (permalink / raw) To: Rik van Riel Cc: Christoffer Dall, Radim Krčmář, Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier, fweisbec 2017-03-24 23:04 GMT+08:00 Rik van Riel <riel@redhat.com>: > On Wed, 2017-03-15 at 09:43 +0100, Christoffer Dall wrote: > >> Yes, I think the key is whether you end up taking a timer interrupt >> before or after switchign PF_VCPU. So you can spend X jiffies in the >> guest, come back, change PF_VCPU (timer still hasen't expired), and >> then >> the timer expires immediately afterwards, and the whole block of >> jiffies >> that are incremented as a result of the timer gets accounted as >> kernel >> time. >> >> (Note that jiffies on clocksourced architectures is an arbitrary >> number, >> which somehow scales according to the clocksource frequency, >> depending >> on the CONFIG_HZ, so a single timer interrupt can increment jiffies >> by >> more than 1. >> > > That really should not happen with nohz_full. > > The housekeeping CPU should get a timer interrupt > every jiffy, unless something changed recently that > I am not aware of. Agreed, the timer interrupt of the housekeeping cpu will not stop w/ nohz_full. __tick_nohz_idle_enter() -> can_stop_idle_tick() -> if (tick_nohz_full_enabled()) if (tick_do_timer_cpu == cpu) return false; Regards, Wanpeng Li ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: vtime accounting 2017-03-14 8:26 ` Christoffer Dall 2017-03-14 8:55 ` Marc Zyngier 2017-03-14 16:58 ` Radim Krčmář @ 2017-03-24 14:55 ` Rik van Riel 2 siblings, 0 replies; 26+ messages in thread From: Rik van Riel @ 2017-03-24 14:55 UTC (permalink / raw) To: Christoffer Dall, Radim Krčmář Cc: Christoffer Dall, Paolo Bonzini, kvm, Marc Zyngier On Tue, 2017-03-14 at 09:26 +0100, Christoffer Dall wrote: > On Mon, Mar 13, 2017 at 06:28:16PM +0100, Radim Krčmář wrote: > > 2017-03-08 02:57-0800, Christoffer Dall: > > > Hi Paolo, > > > > > > I'm looking at improving KVM/ARM a bit by calling > > > guest_exit_irqoff > > > before enabling interrupts when coming back from the guest. > > > > > > Unfortunately, this appears to mess up my view of CPU usage using > > > something like htop on the host, because it appears all time is > > > spent > > > inside the kernel. > > > > > > From my analysis, I think this is because we never handle any > > > interrupts > > > before enabling interrupts, where the x86 code does its > > > handle_external_intr, and the result on ARM is that we never > > > increment > > > jiffies before doing the vtime accounting. > > > > (Hm, the counting might be broken on nohz_full then.) > > > > Don't you still have a scheduler tick even with nohz_full and > something > that will eventually update jiffies then? > With nohz_full, the housekeeping CPU will get a timer interrupt every single jiffy, and the jiffies variable will get updated HZ times a second. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2017-03-27 12:29 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-08 10:57 vtime accounting Christoffer Dall 2017-03-09 8:16 ` Paolo Bonzini 2017-03-13 17:28 ` Radim Krčmář 2017-03-14 8:26 ` Christoffer Dall 2017-03-14 8:55 ` Marc Zyngier 2017-03-14 11:12 ` Christoffer Dall 2017-03-14 11:46 ` Marc Zyngier 2017-03-14 16:58 ` Radim Krčmář 2017-03-14 17:09 ` Paolo Bonzini 2017-03-14 18:41 ` Christoffer Dall 2017-03-14 19:32 ` Radim Krčmář 2017-03-14 20:01 ` Christoffer Dall 2017-03-14 21:52 ` Radim Krčmář 2017-03-15 8:09 ` Paolo Bonzini 2017-03-15 8:05 ` Paolo Bonzini 2017-03-15 8:30 ` Christoffer Dall 2017-03-14 18:39 ` Christoffer Dall 2017-03-14 20:27 ` Radim Krčmář 2017-03-14 21:53 ` Radim Krčmář 2017-03-15 8:43 ` Christoffer Dall 2017-03-15 15:57 ` Radim Krčmář 2017-03-15 16:48 ` Christoffer Dall 2017-03-15 17:09 ` Radim Krčmář 2017-03-24 15:04 ` Rik van Riel 2017-03-27 12:29 ` Wanpeng Li 2017-03-24 14:55 ` Rik van Riel
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.