kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* cputime takes cstate into consideration
@ 2019-06-26  9:43 Wanpeng Li
  2019-06-26 10:13 ` Peter Zijlstra
  2019-06-26 10:33 ` Thomas Gleixner
  0 siblings, 2 replies; 22+ messages in thread
From: Wanpeng Li @ 2019-06-26  9:43 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: Paolo Bonzini, Radim Krcmar, Marcelo Tosatti, KarimAllah, LKML, kvm

Hi all,

After exposing mwait/monitor into kvm guest, the guest can make
physical cpu enter deeper cstate through mwait instruction, however,
the top command on host still observe 100% cpu utilization since qemu
process is running even though guest who has the power management
capability executes mwait. Actually we can observe the physical cpu
has already enter deeper cstate by powertop on host. Could we take
cstate into consideration when accounting cputime etc?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26  9:43 cputime takes cstate into consideration Wanpeng Li
@ 2019-06-26 10:13 ` Peter Zijlstra
  2019-06-26 10:33 ` Thomas Gleixner
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2019-06-26 10:13 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Marcelo Tosatti,
	KarimAllah, LKML, kvm

On Wed, Jun 26, 2019 at 05:43:55PM +0800, Wanpeng Li wrote:
> Hi all,
> 
> After exposing mwait/monitor into kvm guest, the guest can make
> physical cpu enter deeper cstate through mwait instruction, however,
> the top command on host still observe 100% cpu utilization since qemu
> process is running even though guest who has the power management
> capability executes mwait. Actually we can observe the physical cpu
> has already enter deeper cstate by powertop on host. Could we take
> cstate into consideration when accounting cputime etc?

Either we account runtime on the CPU itself, in which case it will not
be in a C state due to actually running an interrupt that does
accounting, or we do it remote (NOHZ_FULL case) and there is no way to
know what C state, if any, that CPU is in.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26  9:43 cputime takes cstate into consideration Wanpeng Li
  2019-06-26 10:13 ` Peter Zijlstra
@ 2019-06-26 10:33 ` Thomas Gleixner
  2019-06-26 14:54   ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2019-06-26 10:33 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Paolo Bonzini, Radim Krcmar, Marcelo Tosatti,
	KarimAllah, LKML, kvm

On Wed, 26 Jun 2019, Wanpeng Li wrote:
> After exposing mwait/monitor into kvm guest, the guest can make
> physical cpu enter deeper cstate through mwait instruction, however,
> the top command on host still observe 100% cpu utilization since qemu
> process is running even though guest who has the power management
> capability executes mwait. Actually we can observe the physical cpu
> has already enter deeper cstate by powertop on host. Could we take
> cstate into consideration when accounting cputime etc?

If MWAIT can be used inside the guest then the host cannot distinguish
between execution and stuck in mwait.

It'd need to poll the power monitoring MSRs on every occasion where the
accounting happens.

This completely falls apart when you have zero exit guest. (think
NOHZ_FULL). Then you'd have to bring the guest out with an IPI to access
the per CPU MSRs.

I assume a lot of people will be happy about all that :)

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 10:33 ` Thomas Gleixner
@ 2019-06-26 14:54   ` Konrad Rzeszutek Wilk
  2019-06-26 16:16     ` Peter Zijlstra
  2019-06-26 18:58     ` Raslan, KarimAllah
  0 siblings, 2 replies; 22+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-06-26 14:54 UTC (permalink / raw)
  To: Thomas Gleixner, Boris Ostrovsky, Ankur Arora, Joao Martins
  Cc: Wanpeng Li, Peter Zijlstra, Paolo Bonzini, Radim Krcmar,
	Marcelo Tosatti, KarimAllah, LKML, kvm

On Wed, Jun 26, 2019 at 12:33:30PM +0200, Thomas Gleixner wrote:
> On Wed, 26 Jun 2019, Wanpeng Li wrote:
> > After exposing mwait/monitor into kvm guest, the guest can make
> > physical cpu enter deeper cstate through mwait instruction, however,
> > the top command on host still observe 100% cpu utilization since qemu
> > process is running even though guest who has the power management
> > capability executes mwait. Actually we can observe the physical cpu
> > has already enter deeper cstate by powertop on host. Could we take
> > cstate into consideration when accounting cputime etc?
> 
> If MWAIT can be used inside the guest then the host cannot distinguish
> between execution and stuck in mwait.
> 
> It'd need to poll the power monitoring MSRs on every occasion where the
> accounting happens.
> 
> This completely falls apart when you have zero exit guest. (think
> NOHZ_FULL). Then you'd have to bring the guest out with an IPI to access
> the per CPU MSRs.
> 
> I assume a lot of people will be happy about all that :)

There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
counters (in the host) to sample the guest and construct a better
accounting idea of what the guest does. That way the dashboard
from the host would not show 100% CPU utilization.

But the patches that Marcelo posted (" cpuidle-haltpoll driver") in 
"solves" the problem for Linux. That is the guest wants awesome latency and
one way was to expose MWAIT to the guest, or just tweak the guest to do the
idling a bit different.

Marcelo patches are all good for Linux, but Windows is still an issue.

Ankur, would you be OK sharing some of your ideas?
> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 14:54   ` Konrad Rzeszutek Wilk
@ 2019-06-26 16:16     ` Peter Zijlstra
  2019-06-26 18:30       ` Konrad Rzeszutek Wilk
  2019-06-26 18:58     ` Raslan, KarimAllah
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-06-26 16:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Thomas Gleixner, Boris Ostrovsky, Ankur Arora, Joao Martins,
	Wanpeng Li, Paolo Bonzini, Radim Krcmar, Marcelo Tosatti,
	KarimAllah, LKML, kvm

On Wed, Jun 26, 2019 at 10:54:13AM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 26, 2019 at 12:33:30PM +0200, Thomas Gleixner wrote:
> > On Wed, 26 Jun 2019, Wanpeng Li wrote:
> > > After exposing mwait/monitor into kvm guest, the guest can make
> > > physical cpu enter deeper cstate through mwait instruction, however,
> > > the top command on host still observe 100% cpu utilization since qemu
> > > process is running even though guest who has the power management
> > > capability executes mwait. Actually we can observe the physical cpu
> > > has already enter deeper cstate by powertop on host. Could we take
> > > cstate into consideration when accounting cputime etc?
> > 
> > If MWAIT can be used inside the guest then the host cannot distinguish
> > between execution and stuck in mwait.
> > 
> > It'd need to poll the power monitoring MSRs on every occasion where the
> > accounting happens.
> > 
> > This completely falls apart when you have zero exit guest. (think
> > NOHZ_FULL). Then you'd have to bring the guest out with an IPI to access
> > the per CPU MSRs.
> > 
> > I assume a lot of people will be happy about all that :)
> 
> There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> counters (in the host) to sample the guest and construct a better
> accounting idea of what the guest does. That way the dashboard
> from the host would not show 100% CPU utilization.

But then you generate extra noise and vmexits on those cpus, just to get
this accounting sorted, which sounds like a bad trade.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 16:16     ` Peter Zijlstra
@ 2019-06-26 18:30       ` Konrad Rzeszutek Wilk
  2019-06-26 18:41         ` Thomas Gleixner
  0 siblings, 1 reply; 22+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-06-26 18:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Boris Ostrovsky, Ankur Arora, Joao Martins,
	Wanpeng Li, Paolo Bonzini, Radim Krcmar, Marcelo Tosatti,
	KarimAllah, LKML, kvm

On Wed, Jun 26, 2019 at 06:16:08PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 10:54:13AM -0400, Konrad Rzeszutek Wilk wrote:
> > On Wed, Jun 26, 2019 at 12:33:30PM +0200, Thomas Gleixner wrote:
> > > On Wed, 26 Jun 2019, Wanpeng Li wrote:
> > > > After exposing mwait/monitor into kvm guest, the guest can make
> > > > physical cpu enter deeper cstate through mwait instruction, however,
> > > > the top command on host still observe 100% cpu utilization since qemu
> > > > process is running even though guest who has the power management
> > > > capability executes mwait. Actually we can observe the physical cpu
> > > > has already enter deeper cstate by powertop on host. Could we take
> > > > cstate into consideration when accounting cputime etc?
> > > 
> > > If MWAIT can be used inside the guest then the host cannot distinguish
> > > between execution and stuck in mwait.
> > > 
> > > It'd need to poll the power monitoring MSRs on every occasion where the
> > > accounting happens.
> > > 
> > > This completely falls apart when you have zero exit guest. (think
> > > NOHZ_FULL). Then you'd have to bring the guest out with an IPI to access
> > > the per CPU MSRs.
> > > 
> > > I assume a lot of people will be happy about all that :)
> > 
> > There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> > counters (in the host) to sample the guest and construct a better
> > accounting idea of what the guest does. That way the dashboard
> > from the host would not show 100% CPU utilization.
> 
> But then you generate extra noise and vmexits on those cpus, just to get
> this accounting sorted, which sounds like a bad trade.

Considering that the CPUs aren't doing anything and if you do say the 
IPIs "only" 100/second - that would be so small but give you a big benefit
in properly accounting the guests.

But perhaps there are other ways too to "snoop" if a guest is sitting on
an MWAIT?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 18:30       ` Konrad Rzeszutek Wilk
@ 2019-06-26 18:41         ` Thomas Gleixner
  2019-06-26 18:55           ` Raslan, KarimAllah
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2019-06-26 18:41 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Peter Zijlstra, Boris Ostrovsky, Ankur Arora, Joao Martins,
	Wanpeng Li, Paolo Bonzini, Radim Krcmar, Marcelo Tosatti,
	KarimAllah, LKML, kvm

On Wed, 26 Jun 2019, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 26, 2019 at 06:16:08PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 26, 2019 at 10:54:13AM -0400, Konrad Rzeszutek Wilk wrote:
> > > There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> > > counters (in the host) to sample the guest and construct a better
> > > accounting idea of what the guest does. That way the dashboard
> > > from the host would not show 100% CPU utilization.
> > 
> > But then you generate extra noise and vmexits on those cpus, just to get
> > this accounting sorted, which sounds like a bad trade.
> 
> Considering that the CPUs aren't doing anything and if you do say the 
> IPIs "only" 100/second - that would be so small but give you a big benefit
> in properly accounting the guests.

The host doesn't know what the guest CPUs are doing. And if you have a full
zero exit setup and the guest is computing stuff or doing that network
offloading thing then they will notice the 100/s vmexits and complain.

> But perhaps there are other ways too to "snoop" if a guest is sitting on
> an MWAIT?

No idea.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 18:41         ` Thomas Gleixner
@ 2019-06-26 18:55           ` Raslan, KarimAllah
  2019-06-26 19:19             ` Thomas Gleixner
  2019-06-26 19:21             ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Raslan, KarimAllah @ 2019-06-26 18:55 UTC (permalink / raw)
  To: tglx, konrad.wilk
  Cc: boris.ostrovsky, joao.m.martins, peterz, kvm, kernellwp,
	linux-kernel, mtosatti, pbonzini, ankur.a.arora, rkrcmar

On Wed, 2019-06-26 at 20:41 +0200, Thomas Gleixner wrote:
> On Wed, 26 Jun 2019, Konrad Rzeszutek Wilk wrote:
> > 
> > On Wed, Jun 26, 2019 at 06:16:08PM +0200, Peter Zijlstra wrote:
> > > 
> > > On Wed, Jun 26, 2019 at 10:54:13AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > 
> > > > There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> > > > counters (in the host) to sample the guest and construct a better
> > > > accounting idea of what the guest does. That way the dashboard
> > > > from the host would not show 100% CPU utilization.
> > > 
> > > But then you generate extra noise and vmexits on those cpus, just to get
> > > this accounting sorted, which sounds like a bad trade.
> > 
> > Considering that the CPUs aren't doing anything and if you do say the 
> > IPIs "only" 100/second - that would be so small but give you a big benefit
> > in properly accounting the guests.
> 
> The host doesn't know what the guest CPUs are doing. And if you have a full
> zero exit setup and the guest is computing stuff or doing that network
> offloading thing then they will notice the 100/s vmexits and complain.

If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
still be ticking in the host once every second for housekeeping, right? Would 
not updating the mwait-time once a second be enough here?

> 
> > 
> > But perhaps there are other ways too to "snoop" if a guest is sitting on
> > an MWAIT?
> 
> No idea.
> 
> Thanks,
> 
> 	tglx
> 
> 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 14:54   ` Konrad Rzeszutek Wilk
  2019-06-26 16:16     ` Peter Zijlstra
@ 2019-06-26 18:58     ` Raslan, KarimAllah
  2019-06-26 19:23       ` Thomas Gleixner
  1 sibling, 1 reply; 22+ messages in thread
From: Raslan, KarimAllah @ 2019-06-26 18:58 UTC (permalink / raw)
  To: tglx, boris.ostrovsky, joao.m.martins, konrad.wilk, ankur.a.arora
  Cc: kvm, linux-kernel, peterz, rkrcmar, pbonzini, kernellwp, mtosatti

On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 26, 2019 at 12:33:30PM +0200, Thomas Gleixner wrote:
> > 
> > On Wed, 26 Jun 2019, Wanpeng Li wrote:
> > > 
> > > After exposing mwait/monitor into kvm guest, the guest can make
> > > physical cpu enter deeper cstate through mwait instruction, however,
> > > the top command on host still observe 100% cpu utilization since qemu
> > > process is running even though guest who has the power management
> > > capability executes mwait. Actually we can observe the physical cpu
> > > has already enter deeper cstate by powertop on host. Could we take
> > > cstate into consideration when accounting cputime etc?
> > 
> > If MWAIT can be used inside the guest then the host cannot distinguish
> > between execution and stuck in mwait.
> > 
> > It'd need to poll the power monitoring MSRs on every occasion where the
> > accounting happens.
> > 
> > This completely falls apart when you have zero exit guest. (think
> > NOHZ_FULL). Then you'd have to bring the guest out with an IPI to access
> > the per CPU MSRs.
> > 
> > I assume a lot of people will be happy about all that :)
> 
> There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> counters (in the host) to sample the guest and construct a better
> accounting idea of what the guest does. That way the dashboard
> from the host would not show 100% CPU utilization.

You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF 
MSRs for that. (sorry I got distracted and forgot to send the patch)

> 
> But the patches that Marcelo posted (" cpuidle-haltpoll driver") in 
> "solves" the problem for Linux. That is the guest wants awesome latency and
> one way was to expose MWAIT to the guest, or just tweak the guest to do the
> idling a bit different.
> 
> Marcelo patches are all good for Linux, but Windows is still an issue.
> 
> Ankur, would you be OK sharing some of your ideas?
> > 
> > 
> > Thanks,
> > 
> > 	tglx
> > 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 18:55           ` Raslan, KarimAllah
@ 2019-06-26 19:19             ` Thomas Gleixner
  2019-06-26 19:21             ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2019-06-26 19:19 UTC (permalink / raw)
  To: Raslan, KarimAllah
  Cc: konrad.wilk, boris.ostrovsky, joao.m.martins, peterz, kvm,
	kernellwp, linux-kernel, mtosatti, pbonzini, ankur.a.arora,
	rkrcmar

[-- Attachment #1: Type: text/plain, Size: 1053 bytes --]

On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
> On Wed, 2019-06-26 at 20:41 +0200, Thomas Gleixner wrote:
> > The host doesn't know what the guest CPUs are doing. And if you have a full
> > zero exit setup and the guest is computing stuff or doing that network
> > offloading thing then they will notice the 100/s vmexits and complain.
> 
> If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
> single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
> still be ticking in the host once every second for housekeeping, right? Would 
> not updating the mwait-time once a second be enough here?

It maybe that it 'still' does that, but the goal is to fix that by doing
remote accounting. I think Frederic is pretty close to that.

Then your 'lets do accounting' on the housekeeping tick falls apart.

And even with that tick every second, the nohz full people take every
shortcut to go back into the guest ASAP. Doing a dozen MSR reads will
surely not find many enthusiastic supporters.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 18:55           ` Raslan, KarimAllah
  2019-06-26 19:19             ` Thomas Gleixner
@ 2019-06-26 19:21             ` Peter Zijlstra
  2019-06-26 19:27               ` Raslan, KarimAllah
  2019-06-26 19:29               ` Thomas Gleixner
  1 sibling, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2019-06-26 19:21 UTC (permalink / raw)
  To: Raslan, KarimAllah
  Cc: tglx, konrad.wilk, boris.ostrovsky, joao.m.martins, kvm,
	kernellwp, linux-kernel, mtosatti, pbonzini, ankur.a.arora,
	rkrcmar

On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:

> If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
> single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
> still be ticking in the host once every second for housekeeping, right? Would 
> not updating the mwait-time once a second be enough here?

People are trying very hard to get rid of that remnant tick. Lets not
add dependencies to it.

IMO this is a really stupid issue, 100% time is correct if the guest
does idle in pinned vcpu mode.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 18:58     ` Raslan, KarimAllah
@ 2019-06-26 19:23       ` Thomas Gleixner
  2019-07-09  2:00         ` Ankur Arora
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2019-06-26 19:23 UTC (permalink / raw)
  To: Raslan, KarimAllah
  Cc: boris.ostrovsky, joao.m.martins, konrad.wilk, ankur.a.arora, kvm,
	linux-kernel, peterz, rkrcmar, pbonzini, kernellwp, mtosatti

[-- Attachment #1: Type: text/plain, Size: 671 bytes --]

On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
> On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
> > There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> > counters (in the host) to sample the guest and construct a better
> > accounting idea of what the guest does. That way the dashboard
> > from the host would not show 100% CPU utilization.
> 
> You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF 
> MSRs for that. (sorry I got distracted and forgot to send the patch)

Sure, but then you conflict with the other people who fight tooth and nail
over every single performance counter.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 19:21             ` Peter Zijlstra
@ 2019-06-26 19:27               ` Raslan, KarimAllah
  2019-06-26 19:32                 ` Thomas Gleixner
                                   ` (2 more replies)
  2019-06-26 19:29               ` Thomas Gleixner
  1 sibling, 3 replies; 22+ messages in thread
From: Raslan, KarimAllah @ 2019-06-26 19:27 UTC (permalink / raw)
  To: peterz
  Cc: boris.ostrovsky, kvm, kernellwp, joao.m.martins, linux-kernel,
	tglx, konrad.wilk, mtosatti, pbonzini, ankur.a.arora, rkrcmar

On Wed, 2019-06-26 at 21:21 +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
> 
> > 
> > If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
> > single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
> > still be ticking in the host once every second for housekeeping, right? Would 
> > not updating the mwait-time once a second be enough here?
> 
> People are trying very hard to get rid of that remnant tick. Lets not
> add dependencies to it.
> 
> IMO this is a really stupid issue, 100% time is correct if the guest
> does idle in pinned vcpu mode.

One use case for proper accounting (obviously for a slightly relaxed definition 
or *proper*) is *external* monitoring of CPU utilization for scaling group
(i.e. more VMs will be launched when you reach a certain CPU utilization).
These external monitoring tools needs to account CPU utilization properly.



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 19:21             ` Peter Zijlstra
  2019-06-26 19:27               ` Raslan, KarimAllah
@ 2019-06-26 19:29               ` Thomas Gleixner
  1 sibling, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2019-06-26 19:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raslan, KarimAllah, konrad.wilk, boris.ostrovsky, joao.m.martins,
	kvm, kernellwp, linux-kernel, mtosatti, pbonzini, ankur.a.arora,
	rkrcmar

[-- Attachment #1: Type: text/plain, Size: 1233 bytes --]

On Wed, 26 Jun 2019, Peter Zijlstra wrote:

> On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
> 
> > If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
> > single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
> > still be ticking in the host once every second for housekeeping, right? Would 
> > not updating the mwait-time once a second be enough here?
> 
> People are trying very hard to get rid of that remnant tick. Lets not
> add dependencies to it.
> 
> IMO this is a really stupid issue, 100% time is correct if the guest
> does idle in pinned vcpu mode.

Correct. We are going to see the same issue with UMWAIT/UMONITOR. If the
timeout is set long enough by the admin, then a task can stay in user mode
UMWAIT for a very long time. And we're going to account that as user time.

That's not any different with a guest.

You might go there and establish a shared page with the guest where the
guest drops his internal accounting information. For trusted guests that
might be a good approximation. For untrusted ones not so much, but then you
just have to say, you occupy the CPU 100% in guest mode. If you idle there,
none of my problems.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 19:27               ` Raslan, KarimAllah
@ 2019-06-26 19:32                 ` Thomas Gleixner
  2019-06-26 20:01                 ` Peter Zijlstra
  2019-12-10  0:44                 ` Wanpeng Li
  2 siblings, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2019-06-26 19:32 UTC (permalink / raw)
  To: Raslan, KarimAllah
  Cc: peterz, boris.ostrovsky, kvm, kernellwp, joao.m.martins,
	linux-kernel, konrad.wilk, mtosatti, pbonzini, ankur.a.arora,
	rkrcmar

[-- Attachment #1: Type: text/plain, Size: 1441 bytes --]

On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
> On Wed, 2019-06-26 at 21:21 +0200, Peter Zijlstra wrote:
> > On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
> > 
> > > 
> > > If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
> > > single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
> > > still be ticking in the host once every second for housekeeping, right? Would 
> > > not updating the mwait-time once a second be enough here?
> > 
> > People are trying very hard to get rid of that remnant tick. Lets not
> > add dependencies to it.
> > 
> > IMO this is a really stupid issue, 100% time is correct if the guest
> > does idle in pinned vcpu mode.
> 
> One use case for proper accounting (obviously for a slightly relaxed definition 
> or *proper*) is *external* monitoring of CPU utilization for scaling group
> (i.e. more VMs will be launched when you reach a certain CPU utilization).
> These external monitoring tools needs to account CPU utilization properly.

Then you need a trusted cooperative guest and that can give you the
information. If it doesn't, then either do not give him MWAIT or the scheme
does not work.

If you can afford to waste performance counters for that, you can do that
from user space.

There are lots of options, but the kernel won't chose one because it's
guaranteed to be the wrong choice for most scenarios.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 19:27               ` Raslan, KarimAllah
  2019-06-26 19:32                 ` Thomas Gleixner
@ 2019-06-26 20:01                 ` Peter Zijlstra
  2019-06-26 20:09                   ` Thomas Gleixner
  2019-12-10  0:44                 ` Wanpeng Li
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-06-26 20:01 UTC (permalink / raw)
  To: Raslan, KarimAllah
  Cc: boris.ostrovsky, kvm, kernellwp, joao.m.martins, linux-kernel,
	tglx, konrad.wilk, mtosatti, pbonzini, ankur.a.arora, rkrcmar

On Wed, Jun 26, 2019 at 07:27:35PM +0000, Raslan, KarimAllah wrote:
> On Wed, 2019-06-26 at 21:21 +0200, Peter Zijlstra wrote:
> > On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
> > 
> > > 
> > > If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
> > > single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
> > > still be ticking in the host once every second for housekeeping, right? Would 
> > > not updating the mwait-time once a second be enough here?
> > 
> > People are trying very hard to get rid of that remnant tick. Lets not
> > add dependencies to it.
> > 
> > IMO this is a really stupid issue, 100% time is correct if the guest
> > does idle in pinned vcpu mode.
> 
> One use case for proper accounting (obviously for a slightly relaxed definition 
> or *proper*) is *external* monitoring of CPU utilization for scaling group
> (i.e. more VMs will be launched when you reach a certain CPU utilization).
> These external monitoring tools needs to account CPU utilization properly.

That's utter nonsense; what's the point of exposing mwait to guests if
you're not doing vcpu pinning. For overloaded guests mwait makes no
sense what so ever.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 20:01                 ` Peter Zijlstra
@ 2019-06-26 20:09                   ` Thomas Gleixner
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2019-06-26 20:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raslan, KarimAllah, boris.ostrovsky, kvm, kernellwp,
	joao.m.martins, linux-kernel, konrad.wilk, mtosatti, pbonzini,
	ankur.a.arora, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 1605 bytes --]

On Wed, 26 Jun 2019, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 07:27:35PM +0000, Raslan, KarimAllah wrote:
> > On Wed, 2019-06-26 at 21:21 +0200, Peter Zijlstra wrote:
> > > On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
> > > 
> > > > 
> > > > If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
> > > > single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
> > > > still be ticking in the host once every second for housekeeping, right? Would 
> > > > not updating the mwait-time once a second be enough here?
> > > 
> > > People are trying very hard to get rid of that remnant tick. Lets not
> > > add dependencies to it.
> > > 
> > > IMO this is a really stupid issue, 100% time is correct if the guest
> > > does idle in pinned vcpu mode.
> > 
> > One use case for proper accounting (obviously for a slightly relaxed definition 
> > or *proper*) is *external* monitoring of CPU utilization for scaling group
> > (i.e. more VMs will be launched when you reach a certain CPU utilization).
> > These external monitoring tools needs to account CPU utilization properly.
> 
> That's utter nonsense; what's the point of exposing mwait to guests if
> you're not doing vcpu pinning. For overloaded guests mwait makes no
> sense what so ever.

I think you misunderstood. The guests are pinned. What they can do today is
monitor the guests utilization time through mwait/vmexit. If that goes over
a certain threshold they can automatically launch more VMs to spread the
load.

With MWAIT in the guest this is gone...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 19:23       ` Thomas Gleixner
@ 2019-07-09  2:00         ` Ankur Arora
  2019-07-09  2:06           ` Wanpeng Li
  2019-07-09 12:38           ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Ankur Arora @ 2019-07-09  2:00 UTC (permalink / raw)
  To: Thomas Gleixner, Raslan, KarimAllah
  Cc: boris.ostrovsky, joao.m.martins, konrad.wilk, kvm, linux-kernel,
	peterz, rkrcmar, pbonzini, kernellwp, mtosatti

On 2019-06-26 12:23 p.m., Thomas Gleixner wrote:
> On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
>> On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
>>> There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
>>> counters (in the host) to sample the guest and construct a better
>>> accounting idea of what the guest does. That way the dashboard
>>> from the host would not show 100% CPU utilization.
>>
>> You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF
>> MSRs for that. (sorry I got distracted and forgot to send the patch)
> 
> Sure, but then you conflict with the other people who fight tooth and nail
> over every single performance counter.
How about using Intel PT PwrEvt extensions? This should allow us to
precisely track idle residency via just MWAIT and TSC packets. Should
be pretty cheap too. It's post Cascade Lake though.

Ankur

> 
> Thanks,
> 
> 	tglx
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-07-09  2:00         ` Ankur Arora
@ 2019-07-09  2:06           ` Wanpeng Li
  2019-07-09 12:38           ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Wanpeng Li @ 2019-07-09  2:06 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Thomas Gleixner, Raslan, KarimAllah, boris.ostrovsky,
	joao.m.martins, konrad.wilk, kvm, linux-kernel, peterz, rkrcmar,
	pbonzini, mtosatti, Frederic Weisbecker

also Cc Frederic,
On Tue, 9 Jul 2019 at 10:00, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> On 2019-06-26 12:23 p.m., Thomas Gleixner wrote:
> > On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
> >> On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
> >>> There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> >>> counters (in the host) to sample the guest and construct a better
> >>> accounting idea of what the guest does. That way the dashboard
> >>> from the host would not show 100% CPU utilization.
> >>
> >> You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF
> >> MSRs for that. (sorry I got distracted and forgot to send the patch)
> >
> > Sure, but then you conflict with the other people who fight tooth and nail
> > over every single performance counter.
> How about using Intel PT PwrEvt extensions? This should allow us to
> precisely track idle residency via just MWAIT and TSC packets. Should
> be pretty cheap too. It's post Cascade Lake though.
>
> Ankur
>
> >
> > Thanks,
> >
> >       tglx
> >
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-07-09  2:00         ` Ankur Arora
  2019-07-09  2:06           ` Wanpeng Li
@ 2019-07-09 12:38           ` Peter Zijlstra
  2019-07-09 18:27             ` Ankur Arora
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-09 12:38 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Thomas Gleixner, Raslan, KarimAllah, boris.ostrovsky,
	joao.m.martins, konrad.wilk, kvm, linux-kernel, rkrcmar,
	pbonzini, kernellwp, mtosatti

On Mon, Jul 08, 2019 at 07:00:08PM -0700, Ankur Arora wrote:
> On 2019-06-26 12:23 p.m., Thomas Gleixner wrote:
> > On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
> > > On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
> > > > There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> > > > counters (in the host) to sample the guest and construct a better
> > > > accounting idea of what the guest does. That way the dashboard
> > > > from the host would not show 100% CPU utilization.
> > > 
> > > You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF
> > > MSRs for that. (sorry I got distracted and forgot to send the patch)
> > 
> > Sure, but then you conflict with the other people who fight tooth and nail
> > over every single performance counter.
> How about using Intel PT PwrEvt extensions? This should allow us to
> precisely track idle residency via just MWAIT and TSC packets. Should
> be pretty cheap too. It's post Cascade Lake though.

That would fully claim PT just for this stupid accounting thing and be
completely Intel specific.

Just stop this madness already.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-07-09 12:38           ` Peter Zijlstra
@ 2019-07-09 18:27             ` Ankur Arora
  0 siblings, 0 replies; 22+ messages in thread
From: Ankur Arora @ 2019-07-09 18:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Raslan, KarimAllah, boris.ostrovsky,
	joao.m.martins, konrad.wilk, kvm, linux-kernel, rkrcmar,
	pbonzini, kernellwp, mtosatti

On 7/9/19 5:38 AM, Peter Zijlstra wrote:
> On Mon, Jul 08, 2019 at 07:00:08PM -0700, Ankur Arora wrote:
>> On 2019-06-26 12:23 p.m., Thomas Gleixner wrote:
>>> On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
>>>> On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
>>>>> There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
>>>>> counters (in the host) to sample the guest and construct a better
>>>>> accounting idea of what the guest does. That way the dashboard
>>>>> from the host would not show 100% CPU utilization.
>>>>
>>>> You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF
>>>> MSRs for that. (sorry I got distracted and forgot to send the patch)
>>>
>>> Sure, but then you conflict with the other people who fight tooth and nail
>>> over every single performance counter.
>> How about using Intel PT PwrEvt extensions? This should allow us to
>> precisely track idle residency via just MWAIT and TSC packets. Should
>> be pretty cheap too. It's post Cascade Lake though.
> 
> That would fully claim PT just for this stupid accounting thing and be
> completely Intel specific.
> 
> Just stop this madness already.
I see the point about just accruing guest time (in mwait or not) as
guest CPU time.
But, to take this madness a little further, I'm not sure I see why it
fully claims PT. AFAICS, we should be able to enable PwrEvt and whatever
else simultaneously.

Ankur

> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: cputime takes cstate into consideration
  2019-06-26 19:27               ` Raslan, KarimAllah
  2019-06-26 19:32                 ` Thomas Gleixner
  2019-06-26 20:01                 ` Peter Zijlstra
@ 2019-12-10  0:44                 ` Wanpeng Li
  2 siblings, 0 replies; 22+ messages in thread
From: Wanpeng Li @ 2019-12-10  0:44 UTC (permalink / raw)
  To: Raslan, KarimAllah
  Cc: peterz, boris.ostrovsky, kvm, joao.m.martins, linux-kernel, tglx,
	konrad.wilk, mtosatti, pbonzini, ankur.a.arora,
	Frederic Weisbecker

On Thu, 27 Jun 2019 at 03:27, Raslan, KarimAllah <karahmed@amazon.de> wrote:
>
> On Wed, 2019-06-26 at 21:21 +0200, Peter Zijlstra wrote:
> > On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
> >
> > >
> > > If the host is completely in no_full_hz mode and the pCPU is dedicated to a
> > > single vCPU/task (and the guest is 100% CPU bound and never exits), you would
> > > still be ticking in the host once every second for housekeeping, right? Would
> > > not updating the mwait-time once a second be enough here?
> >
> > People are trying very hard to get rid of that remnant tick. Lets not
> > add dependencies to it.
> >
> > IMO this is a really stupid issue, 100% time is correct if the guest
> > does idle in pinned vcpu mode.
>
> One use case for proper accounting (obviously for a slightly relaxed definition
> or *proper*) is *external* monitoring of CPU utilization for scaling group
> (i.e. more VMs will be launched when you reach a certain CPU utilization).
> These external monitoring tools needs to account CPU utilization properly.

Except cputime accounting, the other gordian knot is qemu main loop,
libvirt, kthreads etc can't be offload to the other hardware like
smart nic, these stuff will contend with vCPUs even if MWAIT/HLT
instructions are executing in the guest. There is a HLT activity state
in CPU VMCS which indicates the logical processor is inactive because
it executed the HLT instruction, but SDM 24.4.2 mentioned that
execution of the MWAIT instruction may put a logical processor into an
inactive state, however, this VMCS field never reflects this state.

    Wanpeng

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-12-10  0:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-26  9:43 cputime takes cstate into consideration Wanpeng Li
2019-06-26 10:13 ` Peter Zijlstra
2019-06-26 10:33 ` Thomas Gleixner
2019-06-26 14:54   ` Konrad Rzeszutek Wilk
2019-06-26 16:16     ` Peter Zijlstra
2019-06-26 18:30       ` Konrad Rzeszutek Wilk
2019-06-26 18:41         ` Thomas Gleixner
2019-06-26 18:55           ` Raslan, KarimAllah
2019-06-26 19:19             ` Thomas Gleixner
2019-06-26 19:21             ` Peter Zijlstra
2019-06-26 19:27               ` Raslan, KarimAllah
2019-06-26 19:32                 ` Thomas Gleixner
2019-06-26 20:01                 ` Peter Zijlstra
2019-06-26 20:09                   ` Thomas Gleixner
2019-12-10  0:44                 ` Wanpeng Li
2019-06-26 19:29               ` Thomas Gleixner
2019-06-26 18:58     ` Raslan, KarimAllah
2019-06-26 19:23       ` Thomas Gleixner
2019-07-09  2:00         ` Ankur Arora
2019-07-09  2:06           ` Wanpeng Li
2019-07-09 12:38           ` Peter Zijlstra
2019-07-09 18:27             ` Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).