linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Stolen and degraded time and schedulers
@ 2007-03-13 16:31 Jeremy Fitzhardinge
  2007-03-13 20:12 ` john stultz
  2007-03-14 21:36 ` Con Kolivas
  0 siblings, 2 replies; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-13 16:31 UTC (permalink / raw)
  To: Andi Kleen, Ingo Molnar, Thomas Gleixner, Con Kolivas,
	Rusty Russell, Zachary Amsden, James Morris, john stultz,
	Chris Wright
  Cc: Linux Kernel Mailing List, cpufreq, Virtualization Mailing List

The current Linux scheduler makes one big assumption: that 1ms of CPU
time is the same as any other 1ms of CPU time, and that therefore a
process makes the same amount of progress regardless of which particular
ms of time it gets.

This assumption is wrong now, and will become more wrong as
virtualization gets more widely used.

It's wrong now, because it fails to take into account of several kinds
of missing time:

   1. interrupts - time spent in an ISR is accounted to the current
      process, even though it gets no direct benefit
   2. SMM - time is completely lost from the kernel
   3. slow CPUs - 1ms of 600MHz CPU is less useful than 1ms of 2.4GHz CPU

The first two - time lost to interrupts - are a well known problem, and
are generally considered to be a non issue.  If you're losing a
significant amount of time to interrupts, you probably have bigger
problems.  (Or maybe not?)

The third is not something I've seen discussed before, but it seems like
it could be a significant problem today.  Certainly, I've noticed it
myself: an interactive program decides to do something CPU-intensive
(like start an animation), and it chugs until the conservative governor
brings the CPU up to speed.  Certainly some of this is because its just
plain CPU-starved, but I think another factor is that it gets penalized
for running on a slow CPU: 1ms is not 1ms.  And for power reasons you
want to encourage processes to run on slow CPUs rather than penalize them.

Virtualization just exacerbates this.  If you have a busy machine
running multiple virtual CPUs, then each VCPU may only get a small
proportion of the total amount of available CPU time.  If the kernel's
scheduler asserts that "you were just scheduled for 1ms, therefore you
made 1ms of progress", then many timeslices will effectively end up
being 1ms of 0Mhz CPU - because the VCPU wasn't scheduled and the real
CPU was doing something else.


So how to deal with this?  Basically we need a clock which measures "CPU
work units", and have the scheduler use this clock.

A "CPU work unit" clock has these properties:

    * inherently per-CPU (from the kernel's perspective, so it would be
      per-VCPU in a virtual machine)
    * monotonic - you can't do negative work
    * measured in "work units"

A "work unit" is probably most simply expressed in cycles - you assume a
cycle of CPU time is equivalent in terms of work done to any other
cycle.  This means that 1 cycle at 600MHz is equivalent to 1 cycle at
2.4GHz - but of course the 2.4GHz processor gets 4 times as many in any
real time interval.  (This is the instance where the worst kind of tsc -
varying speed which stops on idle - is actually exactly what you want.)

You could also measure "work units" in terms of normalized time units:
if the fastest CPU on the machine is 2.4GHz, then 1ms is 1ms a work unit
on that CPU, but 250us on the 600MHz CPU.

It doesn't really matter what the unit is, so long as it is used
consistently to measure how much progress all processes made.


So, how to implement this?

One quick hack would be to just make a new clocksource entrypoint, which
returns work units rather than real-time cycles.  That would be fairly
simple to implement, but it doesn't really take the per-cpu nature of
the clock into account (since its possible that different cpus on the
same machine might need their own methods).

Perhaps a better fit would be an entity which is equivalent to a
clocksource, but registered per-cpu like (some) clockevents.

I don't have a particular preference, but I wonder what the clock gurus
think.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-13 16:31 Stolen and degraded time and schedulers Jeremy Fitzhardinge
@ 2007-03-13 20:12 ` john stultz
  2007-03-13 20:32   ` Jeremy Fitzhardinge
  2007-03-14 21:36 ` Con Kolivas
  1 sibling, 1 reply; 51+ messages in thread
From: john stultz @ 2007-03-13 20:12 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Ingo Molnar, Thomas Gleixner, Con Kolivas,
	Rusty Russell, Zachary Amsden, James Morris, Chris Wright,
	Linux Kernel Mailing List, cpufreq, Virtualization Mailing List,
	Daniel Walker

On Tue, 2007-03-13 at 09:31 -0700, Jeremy Fitzhardinge wrote:
> The current Linux scheduler makes one big assumption: that 1ms of CPU
> time is the same as any other 1ms of CPU time, and that therefore a
> process makes the same amount of progress regardless of which particular
> ms of time it gets.
> 
> This assumption is wrong now, and will become more wrong as
> virtualization gets more widely used.
> 
> It's wrong now, because it fails to take into account of several kinds
> of missing time:
> 
>    1. interrupts - time spent in an ISR is accounted to the current
>       process, even though it gets no direct benefit
>    2. SMM - time is completely lost from the kernel
>    3. slow CPUs - 1ms of 600MHz CPU is less useful than 1ms of 2.4GHz CPU
> 
[snip]
> So how to deal with this?  Basically we need a clock which measures "CPU
> work units", and have the scheduler use this clock.
> 
> A "CPU work unit" clock has these properties:
> 
>     * inherently per-CPU (from the kernel's perspective, so it would be
>       per-VCPU in a virtual machine)
>     * monotonic - you can't do negative work
>     * measured in "work units"
[snip]
> So, how to implement this?
> 
> One quick hack would be to just make a new clocksource entrypoint, which
> returns work units rather than real-time cycles.  That would be fairly
> simple to implement, but it doesn't really take the per-cpu nature of
> the clock into account (since its possible that different cpus on the
> same machine might need their own methods).
> 
> Perhaps a better fit would be an entity which is equivalent to a
> clocksource, but registered per-cpu like (some) clockevents.
> 
> I don't have a particular preference, but I wonder what the clock gurus
> think.

My gut reaction would be to avoid using clocksources for now. While
there is some thought going into how to expand clocksources for other
uses (Daniel is working on this, for example), the design for
clocksources has been very focused on its utility to timekeeping, so I'm
hesitant to try complicate the clocksources in order to multiplex
functionality until what is really needed is well understood.

I suspect the best approach would be see how the sched_clock interface
can be reworked/used for what you want, as it's design goals map closest
to the work-unit properties you list above.

Then we can look to see how clocksources can be best used to implement
the sched_clock interface.

-john










^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-13 20:12 ` john stultz
@ 2007-03-13 20:32   ` Jeremy Fitzhardinge
  2007-03-13 21:27     ` Daniel Walker
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-13 20:32 UTC (permalink / raw)
  To: john stultz
  Cc: Andi Kleen, Ingo Molnar, Thomas Gleixner, Con Kolivas,
	Rusty Russell, Zachary Amsden, James Morris, Chris Wright,
	Linux Kernel Mailing List, cpufreq, Virtualization Mailing List,
	Daniel Walker

john stultz wrote:
> My gut reaction would be to avoid using clocksources for now. While
> there is some thought going into how to expand clocksources for other
> uses (Daniel is working on this, for example), the design for
> clocksources has been very focused on its utility to timekeeping, so I'm
> hesitant to try complicate the clocksources in order to multiplex
> functionality until what is really needed is well understood.
>   

Yes, you could imagine adding it as a clocksource variant, by allowing a
clocksource to set a particular timebase:

enum clocksource_timebase {
	CLOCK_TIMEBASE_REALTIME,
	CLOCK_TIMEBASE_CPU_WORK,
	...
};

struct clocksource {
	enum clocksource_timebase timebase;
	...
}

Most of the existing clocksource infrastructure would only operate on
CLOCK_TIMEBASE_REALTIME clocksources, so I'm not sure how much overlap
there would be here.  In the case of dealing with cpufreq, there's a
certain appeal to manipulating the shift/mult parameters to reflect the
fractional speed of a cpu as it changes.

> I suspect the best approach would be see how the sched_clock interface
> can be reworked/used for what you want, as it's design goals map closest
> to the work-unit properties you list above.
>   

sched_clock would definitely be the interface which exposes all this
stuff to the rest of the kernel.  After all, its basically a very simple
interface, though the backend implementation details may not be.

We currently have a sched_clock interface in paravirt_ops to deal with
the hypervisor aspect.  It only occurred to me this morning that cpufreq
presents exactly the same problem to the rest of the kernel, and so
there's room for a more general solution.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-13 20:32   ` Jeremy Fitzhardinge
@ 2007-03-13 21:27     ` Daniel Walker
  2007-03-13 21:59       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 51+ messages in thread
From: Daniel Walker @ 2007-03-13 21:27 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List

On Tue, 2007-03-13 at 13:32 -0700, Jeremy Fitzhardinge wrote:

> Most of the existing clocksource infrastructure would only operate on
> CLOCK_TIMEBASE_REALTIME clocksources, so I'm not sure how much overlap
> there would be here.  In the case of dealing with cpufreq, there's a
> certain appeal to manipulating the shift/mult parameters to reflect the
> fractional speed of a cpu as it changes.

The frequency tracking you mention is done to some extent inside the
timekeeping adjustment functions, but I'm not sure it's totally accurate
for non-timekeeping, and it also tracks things like interrupt latency.
Tracking frequency changes where it's important to get it right
shouldn't be done I think ..

If you want accurate time accounting, don't use the TSC .

> sched_clock would definitely be the interface which exposes all this
> stuff to the rest of the kernel.  After all, its basically a very simple
> interface, though the backend implementation details may not be.

The sched_clock interface is basically a stripped down clocksource..
I've implemented sched_clock as a clocksource in the past ..

> We currently have a sched_clock interface in paravirt_ops to deal with
> the hypervisor aspect.  It only occurred to me this morning that cpufreq
> presents exactly the same problem to the rest of the kernel, and so
> there's room for a more general solution.

Are there other architecture which have this per-cpu clock frequency
changing issue? I worked with several other architectures beyond just
x86 and haven't seen this issue ..

Daniel


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-13 21:27     ` Daniel Walker
@ 2007-03-13 21:59       ` Jeremy Fitzhardinge
  2007-03-14  0:43         ` Dan Hecht
  2007-03-14  2:00         ` Daniel Walker
  0 siblings, 2 replies; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-13 21:59 UTC (permalink / raw)
  To: dwalker
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List

Daniel Walker wrote:
> The frequency tracking you mention is done to some extent inside the
> timekeeping adjustment functions, but I'm not sure it's totally accurate
> for non-timekeeping, and it also tracks things like interrupt latency.
> Tracking frequency changes where it's important to get it right
> shouldn't be done I think ..
>
> If you want accurate time accounting, don't use the TSC .
>   

I'm not sure I follow you here.  Clocksources have the means to adjust
the rate of time progression, mostly to warp the time for things like
ntp.  The stability or otherwise of the tsc is irrelevant.

If you had a clocksource which was explicitly using the rate at which a
CPU does work as a timebase, then using the same warping mechanism would
allow you to model CPU speed changes.

> The sched_clock interface is basically a stripped down clocksource..
> I've implemented sched_clock as a clocksource in the past ..
>   

Yes, that works.  But a clocksource is strictly about measuring the
progression of real time, and so doesn't generally measure how much work
a CPU has done.

>> We currently have a sched_clock interface in paravirt_ops to deal with
>> the hypervisor aspect.  It only occurred to me this morning that cpufreq
>> presents exactly the same problem to the rest of the kernel, and so
>> there's room for a more general solution.
>>     
>
> Are there other architecture which have this per-cpu clock frequency
> changing issue? I worked with several other architectures beyond just
> x86 and haven't seen this issue ..

Well, lots of cpus have dynamic frequencies.  Any scheduler which
maintains history will suffer the same problem, even on UP.  If
processes A and B are supposed to have the same priority and they both
execute for 1ms of real time, did they make the same amount of
progress?  Not if the cpu changed speed in between.

And any system which commonly runs virtualized (s390, power, etc) will
need to deal with the notion of stolen time.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-13 21:59       ` Jeremy Fitzhardinge
@ 2007-03-14  0:43         ` Dan Hecht
  2007-03-14  4:37           ` Jeremy Fitzhardinge
  2007-03-14  2:00         ` Daniel Walker
  1 sibling, 1 reply; 51+ messages in thread
From: Dan Hecht @ 2007-03-14  0:43 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner

On 03/13/2007 02:59 PM, Jeremy Fitzhardinge wrote:
> Daniel Walker wrote:
>> The frequency tracking you mention is done to some extent inside the
>> timekeeping adjustment functions, but I'm not sure it's totally accurate
>> for non-timekeeping, and it also tracks things like interrupt latency.
>> Tracking frequency changes where it's important to get it right
>> shouldn't be done I think ..
>>
>> If you want accurate time accounting, don't use the TSC .
>>   
> 
> I'm not sure I follow you here.  Clocksources have the means to adjust
> the rate of time progression, mostly to warp the time for things like
> ntp.  The stability or otherwise of the tsc is irrelevant.
> 
> If you had a clocksource which was explicitly using the rate at which a
> CPU does work as a timebase, then using the same warping mechanism would
> allow you to model CPU speed changes.
> 
>> The sched_clock interface is basically a stripped down clocksource..
>> I've implemented sched_clock as a clocksource in the past ..
>>   
> 
> Yes, that works.  But a clocksource is strictly about measuring the
> progression of real time, and so doesn't generally measure how much work
> a CPU has done.
> 
>>> We currently have a sched_clock interface in paravirt_ops to deal with
>>> the hypervisor aspect.  It only occurred to me this morning that cpufreq
>>> presents exactly the same problem to the rest of the kernel, and so
>>> there's room for a more general solution.
>>>     
>> Are there other architecture which have this per-cpu clock frequency
>> changing issue? I worked with several other architectures beyond just
>> x86 and haven't seen this issue ..
> 
> Well, lots of cpus have dynamic frequencies.  Any scheduler which
> maintains history will suffer the same problem, even on UP.  If
> processes A and B are supposed to have the same priority and they both
> execute for 1ms of real time, did they make the same amount of
> progress?  Not if the cpu changed speed in between.
> 
> And any system which commonly runs virtualized (s390, power, etc) will
> need to deal with the notion of stolen time.
> 

With your previous definition of work time, would it be that:

monotonic_time == work_time + stolen_time ??

i.e. would you be defining stolen_time to include the time lost to 
processes due to the cpu running at a lower frequency?  How does this 
play into the other potential users, besides sched_clock(), of stolen 
time?  We should make sure that the abstraction introduced here makes 
sense in those places too.

For example, the stuff that happens in update_process_times().  I think 
we'd want to account the stolen time to cpustat->steal.  Also we'd 
probably want account for stolen time with regards to 
task_running_tick().  (Though, in the latter case, maybe we first have 
to move the scheduler away from assuming HZ rate decrementing of 
p->time_slice to get this right. i.e. remove the tick based assumption 
from the scheduler, and then maybe stolen time falls in more naturally 
when accounting time slices).

I guess taking your cpufreq as an example of work_time progressing 
slower than monotonic_time (and assuming that the remaining time is what 
you would call stolen), then e.g. top would report 50% of your cpu 
stolen when you cpu is running at 1/2 max rate.  And p->time_slice would 
decrement at 1/2 the rate it normally did when running at 1/2 speed.  Is 
this the right thing to do?  If so, then I agree it makes sense to model 
hypervisor stolen time in terms of your "work time".  But, if not, then 
maybe the amount of work you can get done during a period of time that 
is not stolen and the stolen time itself are really two different 
notions, and shouldn't be confused.  I can see arguments both ways.

Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-13 21:59       ` Jeremy Fitzhardinge
  2007-03-14  0:43         ` Dan Hecht
@ 2007-03-14  2:00         ` Daniel Walker
  2007-03-14  6:52           ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 51+ messages in thread
From: Daniel Walker @ 2007-03-14  2:00 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List

On Tue, 2007-03-13 at 14:59 -0700, Jeremy Fitzhardinge wrote:
> Daniel Walker wrote:
> > The frequency tracking you mention is done to some extent inside the
> > timekeeping adjustment functions, but I'm not sure it's totally accurate
> > for non-timekeeping, and it also tracks things like interrupt latency.
> > Tracking frequency changes where it's important to get it right
> > shouldn't be done I think ..
> >
> > If you want accurate time accounting, don't use the TSC .
> >   
> 
> I'm not sure I follow you here.  Clocksources have the means to adjust
> the rate of time progression, mostly to warp the time for things like
> ntp.  The stability or otherwise of the tsc is irrelevant.

The adjustments that I spoke of above are working regardless of ntp ..
The stability of the TSC directly effects the clock mult adjustments in
timekeeping, as does interrupt latency since the clock is essentially
validated against the timer interrupt.

> If you had a clocksource which was explicitly using the rate at which a
> CPU does work as a timebase, then using the same warping mechanism would
> allow you to model CPU speed changes.

like I said there are other factors so that's not going to exactly model
cpu speed changes. You could come up with another method, but that would
likely require another known constant clock.

> > The sched_clock interface is basically a stripped down clocksource..
> > I've implemented sched_clock as a clocksource in the past ..
> >   
> 
> Yes, that works.  But a clocksource is strictly about measuring the
> progression of real time, and so doesn't generally measure how much work
> a CPU has done.

sched_clock doesn't measure amounts of cpu work either, it's all about
timing. 

> >> We currently have a sched_clock interface in paravirt_ops to deal with
> >> the hypervisor aspect.  It only occurred to me this morning that cpufreq
> >> presents exactly the same problem to the rest of the kernel, and so
> >> there's room for a more general solution.
> >>     
> >
> > Are there other architecture which have this per-cpu clock frequency
> > changing issue? I worked with several other architectures beyond just
> > x86 and haven't seen this issue ..
> 
> Well, lots of cpus have dynamic frequencies.  Any scheduler which
> maintains history will suffer the same problem, even on UP.  If
> processes A and B are supposed to have the same priority and they both
> execute for 1ms of real time, did they make the same amount of
> progress?  Not if the cpu changed speed in between.

That's true, but given a constant clock (like what sched_clock should
have) then the accounting is similarly inaccurate. Any connection
between the scheduler and the TSC frequency changes aren't part of the
design AFAIK ..

> And any system which commonly runs virtualized (s390, power, etc) will
> need to deal with the notion of stolen time.

I haven't followed the "stolen time" discussion, but just a brief look
at your first email I'd say don't mess with the clocks .. The clocks
should always reflect the time accurately .. That's the point of the
clocks, and when the TSC, or any other clock, changes frequency it
sucks..

I haven't thought it through completely, but you might be able to solve
the issue by adding a value to each jiffie in the scheduler or altering
the scheduler to extend the number of jiffies a task gets pending on the
virtual speed of the cpu..

Daniel


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14  0:43         ` Dan Hecht
@ 2007-03-14  4:37           ` Jeremy Fitzhardinge
  2007-03-14 13:58             ` Lennart Sorensen
  2007-03-14 19:02             ` Dan Hecht
  0 siblings, 2 replies; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14  4:37 UTC (permalink / raw)
  To: Dan Hecht
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner

Dan Hecht wrote:
> With your previous definition of work time, would it be that:
>
> monotonic_time == work_time + stolen_time ??

(By monotonic time, I presume you mean monotonic real time.)  Yes, I
suppose you could, but I don't think that's terribly useful.   I think
work_time is probably most naturally measured in cpu clock cycles rather
than an actual time unit.  You could convert it to ns, but I don't see
the point.

I know its a term in general use, but I don't think the term "stolen
time" is all that useful, particularly when we're talking about a more
general notion of cpu work contributing to the progress of process
execution.  In the cpufreq case, time isn't "stolen" per se.

(I guess I don't like the term stolen time because you don't refer to
time spent on other processes as being stolen from your process: its
just processor time being distributed.)

> i.e. would you be defining stolen_time to include the time lost to
> processes due to the cpu running at a lower frequency?  How does this
> play into the other potential users, besides sched_clock(), of stolen
> time?  We should make sure that the abstraction introduced here makes
> sense in those places too.

Be specific.  What other uses are there?

> For example, the stuff that happens in update_process_times().  I
> think we'd want to account the stolen time to cpustat->steal.

I guess we could do something for that.  Would we account non-full-speed
cpus to it?  Maybe?

How is cpustat->steal used?  How does it get out to usermode?


>   Also we'd probably want account for stolen time with regards to
> task_running_tick().  (Though, in the latter case, maybe we first have
> to move the scheduler away from assuming HZ rate decrementing of
> p->time_slice to get this right. i.e. remove the tick based assumption
> from the scheduler, and then maybe stolen time falls in more naturally
> when accounting time slices).

I think the important part is that sched_clock() be used to actually
compute how much time each process gets.  The fact that a time quantum
gets stolen is less important.  Or do you mean something else?

> I guess taking your cpufreq as an example of work_time progressing
> slower than monotonic_time (and assuming that the remaining time is
> what you would call stolen), then e.g. top would report 50% of your
> cpu stolen when you cpu is running at 1/2 max rate.

Yes.  In the same way that clock modulation gates the cpu clock, the
hypervisor effectively gates the clock by giving time to other vcpus.

> And p->time_slice would decrement at 1/2 the rate it normally did when
> running at 1/2 speed.  Is this the right thing to do?  If so, then I
> agree it makes sense to model hypervisor stolen time in terms of your
> "work time".

Yes, that's my thought.

>   But, if not, then maybe the amount of work you can get done during a
> period of time that is not stolen and the stolen time itself are
> really two different notions, and shouldn't be confused.  I can see
> arguments both ways. 

It seems to me like a nice opportunity to solve two problems with one
mechanism.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14  2:00         ` Daniel Walker
@ 2007-03-14  6:52           ` Jeremy Fitzhardinge
  2007-03-14  8:20             ` Zan Lynx
  2007-03-14 16:11             ` Daniel Walker
  0 siblings, 2 replies; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14  6:52 UTC (permalink / raw)
  To: Daniel Walker
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List

Daniel Walker wrote:
> The adjustments that I spoke of above are working regardless of ntp ..
> The stability of the TSC directly effects the clock mult adjustments in
> timekeeping, as does interrupt latency since the clock is essentially
> validated against the timer interrupt.
>   

Yep.  But the tsc is just an example of a clocksource, and doesn't have
any real bearing on what I'm saying.

> like I said there are other factors so that's not going to exactly model
> cpu speed changes. You could come up with another method, but that would
> likely require another known constant clock.
>   

Well, it doesn't need to be a constant clock if its modelling a changing
rate.  And it doesn't need to be an exact model; it just needs to be
better than the current situation.

> sched_clock doesn't measure amounts of cpu work either, it's all about
> timing. 
>   

Specifically, how much cpu time a process has used.  But if the CPU is
running at half speed (or 50% duty cycle), then claiming that the
process got the full amount of time is just an error.

>> Well, lots of cpus have dynamic frequencies.  Any scheduler which
>> maintains history will suffer the same problem, even on UP.  If
>> processes A and B are supposed to have the same priority and they both
>> execute for 1ms of real time, did they make the same amount of
>> progress?  Not if the cpu changed speed in between.
>>     
>
> That's true, but given a constant clock (like what sched_clock should
> have) then the accounting is similarly inaccurate. Any connection
> between the scheduler and the TSC frequency changes aren't part of the
> design AFAIK ..
>   

Well, my whole argument is that sched_clock /should not/ be a constant
clock.  And I'm not quite sure why you keep bringing up the tsc, because
it has no relevance.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14  6:52           ` Jeremy Fitzhardinge
@ 2007-03-14  8:20             ` Zan Lynx
  2007-03-14 16:11             ` Daniel Walker
  1 sibling, 0 replies; 51+ messages in thread
From: Zan Lynx @ 2007-03-14  8:20 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 849 bytes --]

On Tue, 2007-03-13 at 23:52 -0700, Jeremy Fitzhardinge wrote:
> Yep.  But the tsc is just an example of a clocksource, and doesn't have
> any real bearing on what I'm saying.
[cut/snip/slash]
> Well, it doesn't need to be a constant clock if its modelling a changing
> rate.  And it doesn't need to be an exact model; it just needs to be
> better than the current situation.

It's 2 AM so I don't know if I'm making sense, but I had an idea for the
sort of clock I think you're looking for.

Couldn't one of the CPU performance counters do this?  I think you can
set one to count cycles and trigger every 100,000, or 10,000 or 1,000,
or whatever.  Then when you get that interrupt hit the context switch.

Then every time slice would be in cycles and not wall-clock, which is
what I think you wanted.
-- 
Zan Lynx <zlynx@acm.org>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14  4:37           ` Jeremy Fitzhardinge
@ 2007-03-14 13:58             ` Lennart Sorensen
  2007-03-14 15:08               ` Jeremy Fitzhardinge
  2007-03-14 19:02             ` Dan Hecht
  1 sibling, 1 reply; 51+ messages in thread
From: Lennart Sorensen @ 2007-03-14 13:58 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Ingo Molnar, Thomas Gleixner

On Tue, Mar 13, 2007 at 09:37:59PM -0700, Jeremy Fitzhardinge wrote:
> (By monotonic time, I presume you mean monotonic real time.)  Yes, I
> suppose you could, but I don't think that's terribly useful.   I think
> work_time is probably most naturally measured in cpu clock cycles rather
> than an actual time unit.  You could convert it to ns, but I don't see
> the point.
> 
> I know its a term in general use, but I don't think the term "stolen
> time" is all that useful, particularly when we're talking about a more
> general notion of cpu work contributing to the progress of process
> execution.  In the cpufreq case, time isn't "stolen" per se.

How would you deal with something like a pentium 4 HT processor where
you may run slower just because you got scheduled on the sibling of a
cpu that happens to run something else needing the same execution units
you do, causing you to get delayed more, even though the cpu is running
full speed and nothing else is trying to use your "cpu"?  I don't think
there is any way to know what the real impact of two processes on a HT
cpu have on each other.

Interesting goal.  Not sure it can be done.

--
Len Sorensen

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 13:58             ` Lennart Sorensen
@ 2007-03-14 15:08               ` Jeremy Fitzhardinge
  2007-03-14 15:12                 ` Lennart Sorensen
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 15:08 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Ingo Molnar, Thomas Gleixner

Lennart Sorensen wrote:
> How would you deal with something like a pentium 4 HT processor where
> you may run slower just because you got scheduled on the sibling of a
> cpu that happens to run something else needing the same execution units
> you do, causing you to get delayed more, even though the cpu is running
> full speed and nothing else is trying to use your "cpu"?  I don't think
> there is any way to know what the real impact of two processes on a HT
> cpu have on each other.
>
> Interesting goal.  Not sure it can be done.

You're right.  That's a very tough case.  I don't know if there's any
way to do a reasonable estimate of the slowdown.  You could handwave it
and say "if both threads are running a process, then apply an X scaling
factor to their rate of progress".  That might be enough.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 15:08               ` Jeremy Fitzhardinge
@ 2007-03-14 15:12                 ` Lennart Sorensen
  0 siblings, 0 replies; 51+ messages in thread
From: Lennart Sorensen @ 2007-03-14 15:12 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Ingo Molnar, Thomas Gleixner

On Wed, Mar 14, 2007 at 08:08:17AM -0700, Jeremy Fitzhardinge wrote:
> You're right.  That's a very tough case.  I don't know if there's any
> way to do a reasonable estimate of the slowdown.  You could handwave it
> and say "if both threads are running a process, then apply an X scaling
> factor to their rate of progress".  That might be enough.

I would think that's a bad idea.  I expect future processors to do HT
much better than the P4 did.  There will always be cases where two
processes don't share well, but the majority of cases they probably will
share well.  Of course if you don't have enough processes to keep all
the CPU cores busy, you might as well not schedule something on the
siblings on HT processors, but on the other hand maybe you should if you
can power down other cpus completely to save power and still get all the
work done at maximum speed.

--
Len Sorensen

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14  6:52           ` Jeremy Fitzhardinge
  2007-03-14  8:20             ` Zan Lynx
@ 2007-03-14 16:11             ` Daniel Walker
  2007-03-14 16:37               ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 51+ messages in thread
From: Daniel Walker @ 2007-03-14 16:11 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List

On Tue, 2007-03-13 at 23:52 -0700, Jeremy Fitzhardinge wrote:

> >
> > That's true, but given a constant clock (like what sched_clock should
> > have) then the accounting is similarly inaccurate. Any connection
> > between the scheduler and the TSC frequency changes aren't part of the
> > design AFAIK ..
> >   
> 
> Well, my whole argument is that sched_clock /should not/ be a constant
> clock.  And I'm not quite sure why you keep bringing up the tsc, because
> it has no relevance.

Then your direction is wrong, sched_clock() should be constant ideally
(1millisecond should really be 1millisecond). Like I said in the last
email, change the scheduler to make it aware of the variable quantum
values.

Daniel


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 16:11             ` Daniel Walker
@ 2007-03-14 16:37               ` Jeremy Fitzhardinge
  2007-03-14 16:59                 ` Daniel Walker
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 16:37 UTC (permalink / raw)
  To: Daniel Walker
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List

Daniel Walker wrote:
> Then your direction is wrong, sched_clock() should be constant ideally
> (1millisecond should really be 1millisecond). 

Rather than repeating myself, I suggest you read my original post
again.  But my point is that "I was runnable on a cpu for 1ms of real
time" is a meaningless measurement: you want to measure "I ran for 1
cpu-ms", which is a unit which depends on how work a particular CPU does
in relationship to other CPUs on the system, or even itself at some
previous time.

> Like I said in the last
> email, change the scheduler to make it aware of the variable quantum
> values.

I suppose you could, but that seems more complex.  I think you could
encode the same information in the measurement of how much work a cpu
actually got done while a process was scheduled on it.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 16:37               ` Jeremy Fitzhardinge
@ 2007-03-14 16:59                 ` Daniel Walker
  2007-03-14 17:08                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 51+ messages in thread
From: Daniel Walker @ 2007-03-14 16:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List

On Wed, 2007-03-14 at 09:37 -0700, Jeremy Fitzhardinge wrote:
> Daniel Walker wrote:
> > Then your direction is wrong, sched_clock() should be constant ideally
> > (1millisecond should really be 1millisecond). 
> 
> Rather than repeating myself, I suggest you read my original post
> again.  But my point is that "I was runnable on a cpu for 1ms of real
> time" is a meaningless measurement: you want to measure "I ran for 1
> cpu-ms", which is a unit which depends on how work a particular CPU does
> in relationship to other CPUs on the system, or even itself at some
> previous time.

I understood, I just don't agree that you suggested modification are the
correct ones to make.

> > Like I said in the last
> > email, change the scheduler to make it aware of the variable quantum
> > values.
> 
> I suppose you could, but that seems more complex.  I think you could
> encode the same information in the measurement of how much work a cpu
> actually got done while a process was scheduled on it.

I know it's more complex, but that seems more like the "right" thing to
do.

Daniel


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 16:59                 ` Daniel Walker
@ 2007-03-14 17:08                   ` Jeremy Fitzhardinge
  2007-03-14 18:06                     ` Daniel Walker
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 17:08 UTC (permalink / raw)
  To: Daniel Walker
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List

Daniel Walker wrote:
>> I suppose you could, but that seems more complex.  I think you could
>> encode the same information in the measurement of how much work a cpu
>> actually got done while a process was scheduled on it.
>>     
>
> I know it's more complex, but that seems more like the "right" thing to
> do.

Why's that?

I'm proposing that rather than using "time spent scheduled" as an
approximation of how much progress a process made on a particular CPU
during its timeslice, we should measure it directly. It seems to me that
this usefully encapsulates both the problems of variable-speed cpus and
hypervisors stealing time from guests.

The actual length of the timeslices is an orthogonal issue.  It may be
that you want to give processes more cpu time by making their quanta
longer to compensate for lost cpu time, but that would affect their
real-time characteristics.  Or you could keep the quanta small, and give
those processes more of them.

But all this is getting deep into scheduler design, which is not what I
want to get into; I'm just proposing a better metric for a scheduler to
use in whatever way it wants.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 17:08                   ` Jeremy Fitzhardinge
@ 2007-03-14 18:06                     ` Daniel Walker
  2007-03-14 18:41                       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 51+ messages in thread
From: Daniel Walker @ 2007-03-14 18:06 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List

On Wed, 2007-03-14 at 10:08 -0700, Jeremy Fitzhardinge wrote:

> The actual length of the timeslices is an orthogonal issue.  It may be
> that you want to give processes more cpu time by making their quanta
> longer to compensate for lost cpu time, but that would affect their
> real-time characteristics.  Or you could keep the quanta small, and give
> those processes more of them.
> 
> But all this is getting deep into scheduler design, which is not what I
> want to get into; I'm just proposing a better metric for a scheduler to
> use in whatever way it wants.

>From prior emails I think your suggesting that 1ms (or 5 or 10) of time
should actually be a variable X that is changed inside sched_clock().
That's not the purpose of that API call, sched_clock() measure real time
period.

After reading your emails it sounds like what you really want is similar
to accurate state accounting which is used for scheduling purposes. Part
of that has already been implemented at least twice that I know of.
Accounting real time against specific states was done in two version of
microstate accounting. Those are fine starting points for the changes
you are wanting.

Daniel


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 18:06                     ` Daniel Walker
@ 2007-03-14 18:41                       ` Jeremy Fitzhardinge
  2007-03-14 19:00                         ` Daniel Walker
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 18:41 UTC (permalink / raw)
  To: Daniel Walker
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List, Peter Chubb

Daniel Walker wrote:
> >From prior emails I think your suggesting that 1ms (or 5 or 10) of time
> should actually be a variable X that is changed inside sched_clock().
> That's not the purpose of that API call, sched_clock() measure real time
> period.
>   

To what purpose?  What is it really measuring?  My understanding is that
its for the scheduler to work out how much time a process actually ran
for.  Aside from its use in printk as a general monotonic timestamp,
this seems to be how it gets used everywhere.  If I change it to return
cpu-ns (ie, make it not count time that the cpu was stolen by the
hypervisor), then it will return what its callers actually want to know.

If I scale its result according to the cpu's current speed compared to
its maximum speed, it would also be producing results consistent with
what its callers want to know.

> After reading your emails it sounds like what you really want is similar
> to accurate state accounting which is used for scheduling purposes. Part
> of that has already been implemented at least twice that I know of.
> Accounting real time against specific states was done in two version of
> microstate accounting. Those are fine starting points for the changes
> you are wanting.

I haven't looked at the microstate accounting patches in any detail, but
I'm assuming that they take a timestamp at each CPU state transition and
use that to account time to the appropriate entities (tell me if I'm
missing something pertinent here).  There are two problems with that
approach in this case:

   1. If the cpu is stolen by the hypervisor, the kernel will get no
      state transition notification.  It can generally find out that
      some time was stolen after the fact, but there's no specific event
      at the time it happens.
   2. It doesn't map particularly well to a cpu changing speed.  In
      particular if a cpu has continuously varying execution speed
      (Transmeta?), then the best you can hope for is the integration of
      cpu work done over a time period rather than discrete cpu
      speed-change events.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 18:41                       ` Jeremy Fitzhardinge
@ 2007-03-14 19:00                         ` Daniel Walker
  2007-03-14 19:44                           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 51+ messages in thread
From: Daniel Walker @ 2007-03-14 19:00 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List, Peter Chubb

On Wed, 2007-03-14 at 11:41 -0700, Jeremy Fitzhardinge wrote:
> Daniel Walker wrote:
> > >From prior emails I think your suggesting that 1ms (or 5 or 10) of time
> > should actually be a variable X that is changed inside sched_clock().
> > That's not the purpose of that API call, sched_clock() measure real time
> > period.
> >   
> 
> To what purpose?  What is it really measuring?  My understanding is that
> its for the scheduler to work out how much time a process actually ran
> for.  Aside from its use in printk as a general monotonic timestamp,
> this seems to be how it gets used everywhere.  If I change it to return
> cpu-ns (ie, make it not count time that the cpu was stolen by the
> hypervisor), then it will return what its callers actually want to know.

sched_clock is used to bank real time against some specific states
inside the scheduler, and no it doesn't _just_ measure a processes
executing time.

> > After reading your emails it sounds like what you really want is similar
> > to accurate state accounting which is used for scheduling purposes. Part
> > of that has already been implemented at least twice that I know of.
> > Accounting real time against specific states was done in two version of
> > microstate accounting. Those are fine starting points for the changes
> > you are wanting.
> 
> I haven't looked at the microstate accounting patches in any detail, but
> I'm assuming that they take a timestamp at each CPU state transition and
> use that to account time to the appropriate entities (tell me if I'm
> missing something pertinent here).  There are two problems with that
> approach in this case:
> 
>    1. If the cpu is stolen by the hypervisor, the kernel will get no
>       state transition notification.  It can generally find out that
>       some time was stolen after the fact, but there's no specific event
>       at the time it happens.

The hypervisor would need to do it's own accounting I'd imagine then
provide that to the scheduler.

>    2. It doesn't map particularly well to a cpu changing speed.  In
>       particular if a cpu has continuously varying execution speed
>       (Transmeta?), then the best you can hope for is the integration of
>       cpu work done over a time period rather than discrete cpu
>       speed-change events.

True, but as I said in my original email it's not trivial to follow
physical cpu speed changes since the changes are free form and change
potentially per system. Your better off do it just with the hypervisor
since you can control it ..

Daniel


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14  4:37           ` Jeremy Fitzhardinge
  2007-03-14 13:58             ` Lennart Sorensen
@ 2007-03-14 19:02             ` Dan Hecht
  2007-03-14 19:34               ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 51+ messages in thread
From: Dan Hecht @ 2007-03-14 19:02 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner, paulus, schwidefsky, Dan Hecht,
	Rik van Riel

On 03/13/2007 09:37 PM, Jeremy Fitzhardinge wrote:
> Dan Hecht wrote:
>> With your previous definition of work time, would it be that:
>>
>> monotonic_time == work_time + stolen_time ??
> 
> (By monotonic time, I presume you mean monotonic real time.)  

Yes, I was just trying to use some consistent terminology, so I picked 
linux (hrtimer.c) terms: CLOCK_REALTIME == wallclock, CLOCK_MONOTONIC == 
"real" time counter.

> Yes, I
> suppose you could, but I don't think that's terribly useful.   I think
> work_time is probably most naturally measured in cpu clock cycles rather
> than an actual time unit.  You could convert it to ns, but I don't see
> the point.
> 

Even cpu clock cycles doesn't really tell you how much "work" a cpu was 
able to get done.  Different cpus have different throughputs per cycle 
per instruction sequence.


> I know its a term in general use, but I don't think the term "stolen
> time" is all that useful, particularly when we're talking about a more
> general notion of cpu work contributing to the progress of process
> execution.  In the cpufreq case, time isn't "stolen" per se.
> 

Right, and that's why I'm not sure I'm convinced the two should be 
confused.  In the case of cpufreq you are talking about the cpu not 
doing as much work due to a choice the kernel made.  In the case of 
stolen time, the choice wasn't made by the kernel, but instead the 
hypervisor.  I understand they are somewhat similar from the perspective 
of the scheduler, but a bit different.

Also, I'm not sure this is the right thing for cpufreq because:
1) when the load is high, which is when this all matters, presumably the 
kernel will ramp up the cpu to full speed.
2) in the case where there are two different machines with two different 
process speeds (well, really processor throughputs), today the scheduler 
doesn't care about trying to adjust the timing due to one machine being 
faster than the other.  I think this is might be design; i.e. the 
scheduler was intended to work in terms of real time units rather than 
work units.  I guess you are arguing that this is incorrect, and the 
scheduler should be scheduling based on how much work it was able to get 
done.  I'm not sure this makes sense because it was the kernel that 
decided slow the cpus and cause less work to be done.

In the case of stolen time, however, the kernel was even running at all 
on any pcpu, and it wasn't even up to the kernel to decide that.


> (I guess I don't like the term stolen time because you don't refer to
> time spent on other processes as being stolen from your process: its
> just processor time being distributed.)
> 

Okay. I think maybe it comes from the fact that most moder processes 
expect to time share the cpu, whereas most kernels do not expect this, 
and it has already been adopted by the kernel (cpustat->steal).  But, I 
don't really care what we call it.

>> i.e. would you be defining stolen_time to include the time lost to
>> processes due to the cpu running at a lower frequency?  How does this
>> play into the other potential users, besides sched_clock(), of stolen
>> time?  We should make sure that the abstraction introduced here makes
>> sense in those places too.
> 
> Be specific.  What other uses are there?
> 

I listed them below.  To summarize, there are (at least) three:

1) sched_clock, the main topic of this thread.
2) p->time_slice
3) cpustat->steal

>> For example, the stuff that happens in update_process_times().  I
>> think we'd want to account the stolen time to cpustat->steal.
> 
> I guess we could do something for that.  Would we account non-full-speed
> cpus to it?  Maybe?
> 
> How is cpustat->steal used?  How does it get out to usermode?
> 

Via /proc/stat, used by modern 'top', maybe other utilities.  It is 
useful to users who want to see where the time is really going from 
inside a guest when running on a (para)virtual machine.

I believe previous set of xen paravirt-ops patches already handled cases 
#2 and #3 (but no longer do since switching to clockevents), and the old 
vmitime code did also.  Obviously, we need revamp this stuff to make it 
fit in with the new clockevents/hrtimer way of doing things.

Also, s390 and powerpc arch's already account steal time, as another 
reference point.  (the old xen time code and vmi time code worked much 
in the same way as those).  We should bring the s390 and powerpc folks 
into the discussion.

> 
>>   Also we'd probably want account for stolen time with regards to
>> task_running_tick().  (Though, in the latter case, maybe we first have
>> to move the scheduler away from assuming HZ rate decrementing of
>> p->time_slice to get this right. i.e. remove the tick based assumption
>> from the scheduler, and then maybe stolen time falls in more naturally
>> when accounting time slices).
> 
> I think the important part is that sched_clock() be used to actually
> compute how much time each process gets.  The fact that a time quantum
> gets stolen is less important.  Or do you mean something else?
> 

How is time quantum getting stolen less important?  Time quantum getting 
stolen results directly in more unnecessary context switches since we 
might steal the entire timeslice before the process even ran.

I actually think #2 and #3 might be more important than #1 (at least 
they are as important).  And, the earlier Xen patches seem to agree with 
this, since they addressed 2 & 3 only.

>> I guess taking your cpufreq as an example of work_time progressing
>> slower than monotonic_time (and assuming that the remaining time is
>> what you would call stolen), then e.g. top would report 50% of your
>> cpu stolen when you cpu is running at 1/2 max rate.
> 
> Yes.  In the same way that clock modulation gates the cpu clock, the
> hypervisor effectively gates the clock by giving time to other vcpus.
> 

Yes, except in one case it was a choice of the kernel, and in the other 
it was up to the hypervisor.

>> And p->time_slice would decrement at 1/2 the rate it normally did when
>> running at 1/2 speed.  Is this the right thing to do?  If so, then I
>> agree it makes sense to model hypervisor stolen time in terms of your
>> "work time".
> 
> Yes, that's my thought.
> 

Since it was the kernel that decided to slow down the processor, I don't 
know if this makes sense.

Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 19:02             ` Dan Hecht
@ 2007-03-14 19:34               ` Jeremy Fitzhardinge
  2007-03-14 19:45                 ` Rik van Riel
                                   ` (3 more replies)
  0 siblings, 4 replies; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 19:34 UTC (permalink / raw)
  To: Dan Hecht
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner, paulus, schwidefsky, Rik van Riel

Dan Hecht wrote:
> Yes, I was just trying to use some consistent terminology, so I picked
> linux (hrtimer.c) terms: CLOCK_REALTIME == wallclock, CLOCK_MONOTONIC
> == "real" time counter.

OK.  I had used "monotonic" in its more general sense earlier in the
thread, and I wanted to be sure.

> Even cpu clock cycles doesn't really tell you how much "work" a cpu
> was able to get done.  Different cpus have different throughputs per
> cycle per instruction sequence.

Sure.  But on a given machine, the CPUs are likely to be closely enough
matched that a cycle on one CPU is more or less equivalent to a cycle on
another CPU.  The fact that a cycle represents a different amount of
work on an i486 compared to Core2 doesn't matter much.  The important
part is that when the scheduler is doling out CPU time is comparing
everyone's usages with a common unit.

> Right, and that's why I'm not sure I'm convinced the two should be
> confused.  In the case of cpufreq you are talking about the cpu not
> doing as much work due to a choice the kernel made.  In the case of
> stolen time, the choice wasn't made by the kernel, but instead the
> hypervisor.  I understand they are somewhat similar from the
> perspective of the scheduler, but a bit different.
Yes, but the question is whether it matters all that much?  Does it
matter enough to make them two separate concepts, when one seems to
cover all the important points?

> Also, I'm not sure this is the right thing for cpufreq because:
> 1) when the load is high, which is when this all matters, presumably
> the kernel will ramp up the cpu to full speed.

Not at all.  You might have an unimportant but cpu-bound process which
doesn't merit increasing the cpu speed, but should also be scheduled
properly compared to other processes.  I often nice my kernel builds
(which cpufreq takes as a hint to not ramp up the cpu speed) on my
laptop so to save power.

> 2) in the case where there are two different machines with two
> different process speeds (well, really processor throughputs), today
> the scheduler doesn't care about trying to adjust the timing due to
> one machine being faster than the other.

It doesn't matter.  The scheduler is only important when there's
contention for the cpu, and if there is, that it compare process CPU
usage with the same unit.  What that unit isn't inherently very
important, so long as its consistent.

> I'm not sure this makes sense because it was the kernel that decided
> slow the cpus and cause less work to be done.

That's true.  But this is a case of the left brain not talking to the
right brain: cpufreq might decide to slow a cpu down, but the scheduler
doesn't take that into account.  Making the timebase of sched_clock
reflect the current cpu speed (or more specifically, the integral of the
cpu speed over a time interval) is a good way of communicating between
the two subsystems.

> In the case of stolen time, however, the kernel was even running at
> all on any pcpu, and it wasn't even up to the kernel to decide that.

As things stand now, there's not much difference from the scheduler's
perspective, since the scheduler takes no action in either case.

> I listed them below.  To summarize, there are (at least) three:
>
> 1) sched_clock, the main topic of this thread.
> 2) p->time_slice

So, this is the target process timeslice, in units of sched_clock's
timebase?

> 3) cpustat->steal
>
>>> For example, the stuff that happens in update_process_times().  I
>>> think we'd want to account the stolen time to cpustat->steal.
>>
>> I guess we could do something for that.  Would we account non-full-speed
>> cpus to it?  Maybe?
>>
>> How is cpustat->steal used?  How does it get out to usermode?
>>
>
> Via /proc/stat, used by modern 'top', maybe other utilities.  It is
> useful to users who want to see where the time is really going from
> inside a guest when running on a (para)virtual machine.
>
> I believe previous set of xen paravirt-ops patches already handled
> cases #2 and #3 (but no longer do since switching to clockevents), and
> the old vmitime code did also.  Obviously, we need revamp this stuff
> to make it fit in with the new clockevents/hrtimer way of doing things.

I added stolen time accounting to xen-pv_ops last night.  For Xen, at
least, it wasn't hard to fit into the clockevent infrastructure.  I just
update the stolen time accounting for each cpu when it gets a timer
tick; they seem to get a tick every couple of seconds even when idle.

Similarly, implementing sched_clock as "number of ns the vcpu spent in
running state" is simple and direct (though this makes it an explicitly
per-cpu clock; comparing raw sched_clock values between cpus will be
meaningless; but that's likely true when using the tsc as a timebase too).

>> I think the important part is that sched_clock() be used to actually
>> compute how much time each process gets.  The fact that a time quantum
>> gets stolen is less important.  Or do you mean something else?
>>
>
> How is time quantum getting stolen less important?  Time quantum
> getting stolen results directly in more unnecessary context switches
> since we might steal the entire timeslice before the process even ran.

It doesn't matter why you didn't get the time; the important part is
that you know that time went missing.  Its true that you may end up with
some spurious rescheds, but that seems like the kind of thing you'd want
to measure as being a problem before getting clever in fixing.

> I actually think #2 and #3 might be more important than #1 (at least
> they are as important).  And, the earlier Xen patches seem to agree
> with this, since they addressed 2 & 3 only.

I'd call that an oversight.  Xen has everything needed to implement
sched_clock in terms of non-stolen time.

>> Yes.  In the same way that clock modulation gates the cpu clock, the
>> hypervisor effectively gates the clock by giving time to other vcpus.
>>
>
> Yes, except in one case it was a choice of the kernel, and in the
> other it was up to the hypervisor.

Not necessarily.  The cpu might drop into thermal protection clock
modulation.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 19:00                         ` Daniel Walker
@ 2007-03-14 19:44                           ` Jeremy Fitzhardinge
  2007-03-14 20:33                             ` Daniel Walker
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 19:44 UTC (permalink / raw)
  To: Daniel Walker
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List, Peter Chubb

Daniel Walker wrote:
> sched_clock is used to bank real time against some specific states
> inside the scheduler, and no it doesn't _just_ measure a processes
> executing time.
>   

Could you point these places out?  All uses of sched_clock() that I
could see in kernel/sched.c seemed to be related to working out how long
something spent executing, either in the scheduler proper, or
benchmarking cache characteristics.

>>    1. If the cpu is stolen by the hypervisor, the kernel will get no
>>       state transition notification.  It can generally find out that
>>       some time was stolen after the fact, but there's no specific event
>>       at the time it happens.
>>     
>
> The hypervisor would need to do it's own accounting I'd imagine then
> provide that to the scheduler.
>   

Yes.  Xen, at least, provides nanosecond resolution information about
how long each vcpu spent in its various states.  But the question is how
this information should be exposed to the scheduler.  I could provide a
raw dump of the info, but in general the scheduler doesn't care and
other hypervisors might not be able to produce the same information. 
The essential information is "how long did process X actually run on a
real CPU"?  And that, as far as I can tell, is the question
sched_clock() is already designed to answer.

>>    2. It doesn't map particularly well to a cpu changing speed.  In
>>       particular if a cpu has continuously varying execution speed
>>       (Transmeta?), then the best you can hope for is the integration of
>>       cpu work done over a time period rather than discrete cpu
>>       speed-change events.
>>     
>
> True, but as I said in my original email it's not trivial to follow
> physical cpu speed changes since the changes are free form and change
> potentially per system. Your better off do it just with the hypervisor
> since you can control it ..
>   

No, I'm talking about cpu speed changes as a completely separate case,
which is primarily an issue while running a kernel on bare hardware. 
But it is, in some ways, more complex than running on a hypervisor. 
There are numerous mechanisms for cpu speed control, some kernel driven,
some autonomous, some stepwise, some continuous.  I'm arguing that its
the cpufreq subsystem's job to keep track of all that detail, but the
only information it needs to provide to the scheduler is, again, "how
much work did my process get done on the CPU"?


    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 19:34               ` Jeremy Fitzhardinge
@ 2007-03-14 19:45                 ` Rik van Riel
  2007-03-14 19:47                   ` Jeremy Fitzhardinge
  2007-03-14 20:26                 ` Dan Hecht
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 51+ messages in thread
From: Rik van Riel @ 2007-03-14 19:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Ingo Molnar, Thomas Gleixner, paulus, schwidefsky

Jeremy Fitzhardinge wrote:

>> How is time quantum getting stolen less important?  Time quantum
>> getting stolen results directly in more unnecessary context switches
>> since we might steal the entire timeslice before the process even ran.
> 
> It doesn't matter why you didn't get the time; 

Oh, but it does.

System administrators can use steal time the same way they
use iowait time: to spot bottlenecks on their systems.

If you have a lot of iowait time, you know you want either
faster IO or more memory.

If you have a lot of steal time, you know you need to spread
your virtual machines over more CPUs.

Steal time allows you to see the difference between a busy
system and an overloaded system.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 19:45                 ` Rik van Riel
@ 2007-03-14 19:47                   ` Jeremy Fitzhardinge
  2007-03-14 20:02                     ` Rik van Riel
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 19:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Ingo Molnar, Thomas Gleixner, paulus, schwidefsky

Rik van Riel wrote:
> Jeremy Fitzhardinge wrote:
>
>> It doesn't matter why you didn't get the time; 
>
> Oh, but it does.

I meant specifically from a scheduling perspective.

> System administrators can use steal time the same way they
> use iowait time: to spot bottlenecks on their systems.
>
> If you have a lot of iowait time, you know you want either
> faster IO or more memory.
>
> If you have a lot of steal time, you know you need to spread
> your virtual machines over more CPUs.
>
> Steal time allows you to see the difference between a busy
> system and an overloaded system.

Sure, the various accounting tools can go into as much detail as you
want.  I just added stolen time accounting to the xen-pv_ops patchset
which is equivalent to the xen-unstable stolen time accounting.  Is that
sufficient for these purposes?

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 19:47                   ` Jeremy Fitzhardinge
@ 2007-03-14 20:02                     ` Rik van Riel
  0 siblings, 0 replies; 51+ messages in thread
From: Rik van Riel @ 2007-03-14 20:02 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Ingo Molnar, Thomas Gleixner, paulus, schwidefsky

Jeremy Fitzhardinge wrote:
> Rik van Riel wrote:

>> Steal time allows you to see the difference between a busy
>> system and an overloaded system.
> 
> Sure, the various accounting tools can go into as much detail as you
> want.  I just added stolen time accounting to the xen-pv_ops patchset
> which is equivalent to the xen-unstable stolen time accounting.  Is that
> sufficient for these purposes?

Yes, that works.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 19:34               ` Jeremy Fitzhardinge
  2007-03-14 19:45                 ` Rik van Riel
@ 2007-03-14 20:26                 ` Dan Hecht
  2007-03-14 20:31                   ` Jeremy Fitzhardinge
  2007-03-14 20:38                 ` Ingo Molnar
  2007-03-15  5:23                 ` Paul Mackerras
  3 siblings, 1 reply; 51+ messages in thread
From: Dan Hecht @ 2007-03-14 20:26 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner, paulus, schwidefsky, Rik van Riel


>>> How is cpustat->steal used?  How does it get out to usermode?
>>>
>> Via /proc/stat, used by modern 'top', maybe other utilities.  It is
>> useful to users who want to see where the time is really going from
>> inside a guest when running on a (para)virtual machine.
>>
>> I believe previous set of xen paravirt-ops patches already handled
>> cases #2 and #3 (but no longer do since switching to clockevents), and
>> the old vmitime code did also.  Obviously, we need revamp this stuff
>> to make it fit in with the new clockevents/hrtimer way of doing things.
> 
> I added stolen time accounting to xen-pv_ops last night.  For Xen, at
> least, it wasn't hard to fit into the clockevent infrastructure.  I just
> update the stolen time accounting for each cpu when it gets a timer
> tick; they seem to get a tick every couple of seconds even when idle.
> 

Sounds good.  I don't see this in your patchset you sent yesterday 
though; did you add it after sending out those patches?  if so, could 
you forward the new patch?  does it explicitly prevent stolen time from 
getting accounted as  user/system time or does it just rely on NO_HZ 
mode sort of happening to work that way (since the one shot timer is 
skipped ahead for missed ticks)?

thanks,
Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 20:26                 ` Dan Hecht
@ 2007-03-14 20:31                   ` Jeremy Fitzhardinge
  2007-03-14 20:46                     ` Dan Hecht
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 20:31 UTC (permalink / raw)
  To: Dan Hecht
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner, paulus, schwidefsky, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 704 bytes --]

Dan Hecht wrote:
> Sounds good.  I don't see this in your patchset you sent yesterday
> though; did you add it after sending out those patches?

Yes.

>   if so, could you forward the new patch?  does it explicitly prevent
> stolen time from getting accounted as  user/system time or does it
> just rely on NO_HZ mode sort of happening to work that way (since the
> one shot timer is skipped ahead for missed ticks)?

Hm, not sure.  It doesn't care how often it gets called; it just
accumulates results up to that point, but I'm not sure if the time would
get double accounted.  Perhaps it doesn't matter when using
xen_sched_clock().

Did the get_scheduled_time -> sched_clock make sense to you?

    J

[-- Attachment #2: xen-stolen-time.patch --]
[-- Type: text/x-patch, Size: 3983 bytes --]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: john stultz <johnstul@us.ibm.com>

---
 arch/i386/xen/time.c |  101 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

===================================================================
--- a/arch/i386/xen/time.c
+++ b/arch/i386/xen/time.c
@@ -2,6 +2,7 @@
 #include <linux/interrupt.h>
 #include <linux/clocksource.h>
 #include <linux/clockchips.h>
+#include <linux/kernel_stat.h>
 
 #include <asm/xen/hypervisor.h>
 #include <asm/xen/hypercall.h>
@@ -14,6 +15,7 @@
 
 #define XEN_SHIFT 22
 #define TIMER_SLOP	100000	/* Xen may fire a timer up to this many ns early */
+#define NS_PER_TICK	(1000000000ll / HZ)
 
 static DEFINE_PER_CPU(struct clock_event_device, xen_clock_events);
 
@@ -28,6 +30,99 @@ struct shadow_time_info {
 
 static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);
 
+/* runstate info updated by Xen */
+static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate);
+
+/* snapshots of runstate info */
+static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate_snapshot);
+
+/* unused ns of stolen and blocked time */
+static DEFINE_PER_CPU(u64, residual_stolen);
+static DEFINE_PER_CPU(u64, residual_blocked);
+
+/*
+   Runstate accounting
+ */
+static void get_runstate_snapshot(struct vcpu_runstate_info *res)
+{
+	u64 state_time;
+	struct vcpu_runstate_info *state;
+
+	preempt_disable();
+
+	state = &__get_cpu_var(runstate);
+
+	do {
+		state_time = state->state_entry_time;
+		barrier();
+		*res = *state;
+		barrier();
+	} while(state->state_entry_time != state_time);
+
+	preempt_enable();
+}
+
+static void setup_runstate_info(void)
+{
+	struct vcpu_register_runstate_memory_area area;
+
+	area.addr.v = &__get_cpu_var(runstate);
+
+	if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area,
+			       smp_processor_id(), &area))
+		BUG();
+
+	get_runstate_snapshot(&__get_cpu_var(runstate_snapshot));
+}
+
+static void do_stolen_accounting(void)
+{
+	struct vcpu_runstate_info state;
+	struct vcpu_runstate_info *snap;
+	u64 blocked, runnable, offline, stolen;
+	cputime_t ticks;
+
+	get_runstate_snapshot(&state);
+
+	WARN_ON(state.state != RUNSTATE_running);
+
+	snap = &__get_cpu_var(runstate_snapshot);
+
+	/* work out how much time the VCPU has not been runn*ing*  */
+	blocked = state.time[RUNSTATE_blocked] - snap->time[RUNSTATE_blocked];
+	runnable = state.time[RUNSTATE_runnable] - snap->time[RUNSTATE_runnable];
+	offline = state.time[RUNSTATE_offline] - snap->time[RUNSTATE_offline];
+
+	*snap = state;
+
+	/* Add the appropriate number of ticks of stolen time,
+	   including any left-overs from last time.  Passing NULL to
+	   account_steal_time accounts the time as stolen. */
+	stolen = runnable + offline + __get_cpu_var(residual_stolen);
+	ticks = 0;
+	while(stolen >= NS_PER_TICK) {
+		ticks++;
+		stolen -= NS_PER_TICK;
+	}
+	__get_cpu_var(residual_stolen) = stolen;
+	account_steal_time(NULL, ticks);
+
+	/* Add the appropriate number of ticks of blocked time,
+	   including any left-overs from last time.  Passing idle to
+	   account_steal_time accounts the time as idle/wait. */
+	blocked += __get_cpu_var(residual_blocked);
+	ticks = 0;
+	while(blocked >= NS_PER_TICK) {
+		ticks++;
+		blocked -= NS_PER_TICK;
+	}
+	__get_cpu_var(residual_blocked) = blocked;
+	account_steal_time(idle_task(smp_processor_id()), ticks);
+}
+
+
+
+/* Get the CPU speed from Xen */
 unsigned long xen_cpu_khz(void)
 {
 	u64 cpu_khz = 1000000ULL << 32;
@@ -264,6 +359,8 @@ static irqreturn_t xen_timerop_timer_int
 		ret = IRQ_HANDLED;
 	}
 
+	do_stolen_accounting();
+
 	return ret;
 }
 
@@ -338,6 +435,8 @@ static irqreturn_t xen_vcpuop_timer_inte
 		ret = IRQ_HANDLED;
 	}
 
+	do_stolen_accounting();
+
 	return ret;
 }
 
@@ -380,6 +479,8 @@ static void xen_setup_timer(int cpu)
 	evt->cpumask = cpumask_of_cpu(cpu);
 	evt->irq = irq;
 	clockevents_register_device(evt);
+
+	setup_runstate_info();
 
 	put_cpu_var(xen_clock_events);
 }

[-- Attachment #3: xen-sched-clock.patch --]
[-- Type: text/x-patch, Size: 2222 bytes --]

Subject: Implement xen_sched_clock

Implement xen_sched_clock, which returns the number of ns the current
vcpu has been actually in the running state (vs blocked,
runnable-but-not-running, or offline) since boot.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: john stultz <johnstul@us.ibm.com>

---
 arch/i386/xen/enlighten.c |    2 +-
 arch/i386/xen/time.c      |   14 ++++++++++++++
 arch/i386/xen/xen-ops.h   |    1 +
 3 files changed, 16 insertions(+), 1 deletion(-)

===================================================================
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -664,7 +664,7 @@ static const struct paravirt_ops xen_par
 	.set_wallclock = xen_set_wallclock,
 	.get_wallclock = xen_get_wallclock,
 	.get_cpu_khz = xen_cpu_khz,
-	.get_scheduled_cycles = native_read_tsc,
+	.sched_clock = xen_sched_clock,
 
 #ifdef CONFIG_X86_LOCAL_APIC
 	.apic_write = paravirt_nop,
===================================================================
--- a/arch/i386/xen/time.c
+++ b/arch/i386/xen/time.c
@@ -16,6 +16,8 @@
 #define XEN_SHIFT 22
 #define TIMER_SLOP	100000	/* Xen may fire a timer up to this many ns early */
 #define NS_PER_TICK	(1000000000ll / HZ)
+
+static cycle_t xen_clocksource_read(void);
 
 static DEFINE_PER_CPU(struct clock_event_device, xen_clock_events);
 
@@ -120,6 +122,18 @@ static void do_stolen_accounting(void)
 	account_steal_time(idle_task(smp_processor_id()), ticks);
 }
 
+/* Xen sched_clock implementation.  Returns the number of RUNNING ns */
+unsigned long long xen_sched_clock(void)
+{
+	struct vcpu_runstate_info state;
+	cycle_t now = xen_clocksource_read();
+
+	get_runstate_snapshot(&state);
+
+	WARN_ON(state.state != RUNSTATE_running);
+
+	return state.time[RUNSTATE_running] + (now - state.state_entry_time);
+}
 
 
 /* Get the CPU speed from Xen */
===================================================================
--- a/arch/i386/xen/xen-ops.h
+++ b/arch/i386/xen/xen-ops.h
@@ -14,6 +14,7 @@ void __init xen_time_init(void);
 void __init xen_time_init(void);
 unsigned long xen_get_wallclock(void);
 int xen_set_wallclock(unsigned long time);
+unsigned long long xen_sched_clock(void);
 
 void xen_mark_init_mm_pinned(void);
 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 19:44                           ` Jeremy Fitzhardinge
@ 2007-03-14 20:33                             ` Daniel Walker
  2007-03-14 21:16                               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 51+ messages in thread
From: Daniel Walker @ 2007-03-14 20:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List, Peter Chubb

On Wed, 2007-03-14 at 12:44 -0700, Jeremy Fitzhardinge wrote:
> Daniel Walker wrote:
> > sched_clock is used to bank real time against some specific states
> > inside the scheduler, and no it doesn't _just_ measure a processes
> > executing time.
> >   
> 
> Could you point these places out?  All uses of sched_clock() that I
> could see in kernel/sched.c seemed to be related to working out how
> long
> something spent executing, either in the scheduler proper, or
> benchmarking cache characteristics. 

For interactive tasks (basic scheduling) the execution time, and sleep
time need to be measured. It's also used for some posix cpu timers
(sched_ns) , and it used for migration thread initialization. I'm sure
it's used for a variety of out-of-tree random timing as well..

Daniel


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 19:34               ` Jeremy Fitzhardinge
  2007-03-14 19:45                 ` Rik van Riel
  2007-03-14 20:26                 ` Dan Hecht
@ 2007-03-14 20:38                 ` Ingo Molnar
  2007-03-14 20:59                   ` Jeremy Fitzhardinge
  2007-03-15  5:23                 ` Paul Mackerras
  3 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2007-03-14 20:38 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Thomas Gleixner, paulus, schwidefsky, Rik van Riel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> I added stolen time accounting to xen-pv_ops last night.  For Xen, at 
> least, it wasn't hard to fit into the clockevent infrastructure.  I 
> just update the stolen time accounting for each cpu when it gets a 
> timer tick; they seem to get a tick every couple of seconds even when 
> idle.

touching the 'timer tick' is the wrong approach. 'stolen time' only 
matters to the /scheduler tick/. So extend the hypervisor interface to 
allow the injection of 'virtual' scheduler tick events: via the use of a 
special clockevents device - do not change clockevents itself.

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 20:31                   ` Jeremy Fitzhardinge
@ 2007-03-14 20:46                     ` Dan Hecht
  2007-03-14 21:18                       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 51+ messages in thread
From: Dan Hecht @ 2007-03-14 20:46 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner, paulus, schwidefsky, Rik van Riel

On 03/14/2007 01:31 PM, Jeremy Fitzhardinge wrote:
> Dan Hecht wrote:
>> Sounds good.  I don't see this in your patchset you sent yesterday
>> though; did you add it after sending out those patches?
> 
> Yes.
> 
>>   if so, could you forward the new patch?  does it explicitly prevent
>> stolen time from getting accounted as  user/system time or does it
>> just rely on NO_HZ mode sort of happening to work that way (since the
>> one shot timer is skipped ahead for missed ticks)?
> 
> Hm, not sure.  It doesn't care how often it gets called; it just
> accumulates results up to that point, but I'm not sure if the time would
> get double accounted.  Perhaps it doesn't matter when using
> xen_sched_clock().
> 

I think you might be double counting time in some cases.  sched_clock() 
isn't really relevant to stolen time accounting (i.e. cpustat->steal).

I think what you want is to make sure that the sum of the cputime passed 
to all of:

account_user_time
account_system_time
account_steal_time

adds up to the total amount of time that has passed.  I think it is sort 
of working for you (i.e. doesn't always double count stolen ticks) since 
in NO_HZ mode, update_process_time (which calls account_user_time & 
account_system_time) happens to be skipped during periods of stolen time 
due to the hrtimer_forward()'ing of the one shot expiry.

> Did the get_scheduled_time -> sched_clock make sense to you?
> 

The get_scheduled_time change should work fine for vmi.

>     J
> 
> 
> ------------------------------------------------------------------------
> 
> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
> Cc: john stultz <johnstul@us.ibm.com>
> 
> ---
>  arch/i386/xen/time.c |  101 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 101 insertions(+)
> 
> ===================================================================
> --- a/arch/i386/xen/time.c
> +++ b/arch/i386/xen/time.c
> @@ -2,6 +2,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/clocksource.h>
>  #include <linux/clockchips.h>
> +#include <linux/kernel_stat.h>
>  
>  #include <asm/xen/hypervisor.h>
>  #include <asm/xen/hypercall.h>
> @@ -14,6 +15,7 @@
>  
>  #define XEN_SHIFT 22
>  #define TIMER_SLOP	100000	/* Xen may fire a timer up to this many ns early */
> +#define NS_PER_TICK	(1000000000ll / HZ)
>  
>  static DEFINE_PER_CPU(struct clock_event_device, xen_clock_events);
>  
> @@ -28,6 +30,99 @@ struct shadow_time_info {
>  
>  static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);
>  
> +/* runstate info updated by Xen */
> +static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate);
> +
> +/* snapshots of runstate info */
> +static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate_snapshot);
> +
> +/* unused ns of stolen and blocked time */
> +static DEFINE_PER_CPU(u64, residual_stolen);
> +static DEFINE_PER_CPU(u64, residual_blocked);
> +
> +/*
> +   Runstate accounting
> + */
> +static void get_runstate_snapshot(struct vcpu_runstate_info *res)
> +{
> +	u64 state_time;
> +	struct vcpu_runstate_info *state;
> +
> +	preempt_disable();
> +
> +	state = &__get_cpu_var(runstate);
> +
> +	do {
> +		state_time = state->state_entry_time;
> +		barrier();
> +		*res = *state;
> +		barrier();
> +	} while(state->state_entry_time != state_time);
> +
> +	preempt_enable();
> +}
> +
> +static void setup_runstate_info(void)
> +{
> +	struct vcpu_register_runstate_memory_area area;
> +
> +	area.addr.v = &__get_cpu_var(runstate);
> +
> +	if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area,
> +			       smp_processor_id(), &area))
> +		BUG();
> +
> +	get_runstate_snapshot(&__get_cpu_var(runstate_snapshot));
> +}
> +
> +static void do_stolen_accounting(void)
> +{
> +	struct vcpu_runstate_info state;
> +	struct vcpu_runstate_info *snap;
> +	u64 blocked, runnable, offline, stolen;
> +	cputime_t ticks;
> +
> +	get_runstate_snapshot(&state);
> +
> +	WARN_ON(state.state != RUNSTATE_running);
> +
> +	snap = &__get_cpu_var(runstate_snapshot);
> +
> +	/* work out how much time the VCPU has not been runn*ing*  */
> +	blocked = state.time[RUNSTATE_blocked] - snap->time[RUNSTATE_blocked];
> +	runnable = state.time[RUNSTATE_runnable] - snap->time[RUNSTATE_runnable];
> +	offline = state.time[RUNSTATE_offline] - snap->time[RUNSTATE_offline];
> +
> +	*snap = state;
> +
> +	/* Add the appropriate number of ticks of stolen time,
> +	   including any left-overs from last time.  Passing NULL to
> +	   account_steal_time accounts the time as stolen. */
> +	stolen = runnable + offline + __get_cpu_var(residual_stolen);
> +	ticks = 0;
> +	while(stolen >= NS_PER_TICK) {
> +		ticks++;
> +		stolen -= NS_PER_TICK;
> +	}
> +	__get_cpu_var(residual_stolen) = stolen;
> +	account_steal_time(NULL, ticks);
> +
> +	/* Add the appropriate number of ticks of blocked time,
> +	   including any left-overs from last time.  Passing idle to
> +	   account_steal_time accounts the time as idle/wait. */
> +	blocked += __get_cpu_var(residual_blocked);
> +	ticks = 0;
> +	while(blocked >= NS_PER_TICK) {
> +		ticks++;
> +		blocked -= NS_PER_TICK;
> +	}
> +	__get_cpu_var(residual_blocked) = blocked;
> +	account_steal_time(idle_task(smp_processor_id()), ticks);
> +}
> +
> +
> +
> +/* Get the CPU speed from Xen */
>  unsigned long xen_cpu_khz(void)
>  {
>  	u64 cpu_khz = 1000000ULL << 32;
> @@ -264,6 +359,8 @@ static irqreturn_t xen_timerop_timer_int
>  		ret = IRQ_HANDLED;
>  	}
>  
> +	do_stolen_accounting();
> +
>  	return ret;
>  }
>  
> @@ -338,6 +435,8 @@ static irqreturn_t xen_vcpuop_timer_inte
>  		ret = IRQ_HANDLED;
>  	}
>  
> +	do_stolen_accounting();
> +
>  	return ret;
>  }
>  
> @@ -380,6 +479,8 @@ static void xen_setup_timer(int cpu)
>  	evt->cpumask = cpumask_of_cpu(cpu);
>  	evt->irq = irq;
>  	clockevents_register_device(evt);
> +
> +	setup_runstate_info();
>  
>  	put_cpu_var(xen_clock_events);
>  }
> 
> 
> ------------------------------------------------------------------------
> 
> Subject: Implement xen_sched_clock
> 
> Implement xen_sched_clock, which returns the number of ns the current
> vcpu has been actually in the running state (vs blocked,
> runnable-but-not-running, or offline) since boot.
> 
> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
> Cc: john stultz <johnstul@us.ibm.com>
> 
> ---
>  arch/i386/xen/enlighten.c |    2 +-
>  arch/i386/xen/time.c      |   14 ++++++++++++++
>  arch/i386/xen/xen-ops.h   |    1 +
>  3 files changed, 16 insertions(+), 1 deletion(-)
> 
> ===================================================================
> --- a/arch/i386/xen/enlighten.c
> +++ b/arch/i386/xen/enlighten.c
> @@ -664,7 +664,7 @@ static const struct paravirt_ops xen_par
>  	.set_wallclock = xen_set_wallclock,
>  	.get_wallclock = xen_get_wallclock,
>  	.get_cpu_khz = xen_cpu_khz,
> -	.get_scheduled_cycles = native_read_tsc,
> +	.sched_clock = xen_sched_clock,
>  
>  #ifdef CONFIG_X86_LOCAL_APIC
>  	.apic_write = paravirt_nop,
> ===================================================================
> --- a/arch/i386/xen/time.c
> +++ b/arch/i386/xen/time.c
> @@ -16,6 +16,8 @@
>  #define XEN_SHIFT 22
>  #define TIMER_SLOP	100000	/* Xen may fire a timer up to this many ns early */
>  #define NS_PER_TICK	(1000000000ll / HZ)
> +
> +static cycle_t xen_clocksource_read(void);
>  
>  static DEFINE_PER_CPU(struct clock_event_device, xen_clock_events);
>  
> @@ -120,6 +122,18 @@ static void do_stolen_accounting(void)
>  	account_steal_time(idle_task(smp_processor_id()), ticks);
>  }
>  
> +/* Xen sched_clock implementation.  Returns the number of RUNNING ns */
> +unsigned long long xen_sched_clock(void)
> +{
> +	struct vcpu_runstate_info state;
> +	cycle_t now = xen_clocksource_read();
> +
> +	get_runstate_snapshot(&state);
> +
> +	WARN_ON(state.state != RUNSTATE_running);
> +
> +	return state.time[RUNSTATE_running] + (now - state.state_entry_time);
> +}
>  
>  
>  /* Get the CPU speed from Xen */
> ===================================================================
> --- a/arch/i386/xen/xen-ops.h
> +++ b/arch/i386/xen/xen-ops.h
> @@ -14,6 +14,7 @@ void __init xen_time_init(void);
>  void __init xen_time_init(void);
>  unsigned long xen_get_wallclock(void);
>  int xen_set_wallclock(unsigned long time);
> +unsigned long long xen_sched_clock(void);
>  
>  void xen_mark_init_mm_pinned(void);
>  


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 20:38                 ` Ingo Molnar
@ 2007-03-14 20:59                   ` Jeremy Fitzhardinge
  2007-03-16  8:38                     ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 20:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Thomas Gleixner, paulus, schwidefsky, Rik van Riel

Ingo Molnar wrote:
> touching the 'timer tick' is the wrong approach. 'stolen time' only 
> matters to the /scheduler tick/. So extend the hypervisor interface to 
> allow the injection of 'virtual' scheduler tick events: via the use of a 
> special clockevents device - do not change clockevents itself.

I didn't.  I was using sloppy terminology: I hang the stolen time
accounting off the Xen timer interrupt routine, just so that it gets run
every now and again.

I suppose I could explicitly hook stolen time accounting into the
scheduler, but its not obvious to me that it's necessary.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 20:33                             ` Daniel Walker
@ 2007-03-14 21:16                               ` Jeremy Fitzhardinge
  2007-03-14 21:34                                 ` Daniel Walker
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 21:16 UTC (permalink / raw)
  To: Daniel Walker
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List, Peter Chubb

Daniel Walker wrote:
> For interactive tasks (basic scheduling) the execution time, and sleep
> time need to be measured.

Sleep time is interesting.  It doesn't make much sense to talk about
time that was stolen while a process was sleeping (it was either stolen
from another running process, or the VCPU was just plain idle).  Also,
the definition of sched_clock I'm talking about is inherently per-cpu,
and sleeping has nothing to do with any cpu by definition.

So something other than sched_clock should be used to measure sleep
time, but it needs to produce interval measurements which are in the
same units as sched_clock.

>  It's also used for some posix cpu timers
> (sched_ns) , and it used for migration thread initialization.

sched_ns doesn't use it directly except for the case where the process
is currently running.  Anyway, it's compatible with what I'm talking about.

>  I'm sure
> it's used for a variety of out-of-tree random timing as well..
>   

Yeah, well...

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 20:46                     ` Dan Hecht
@ 2007-03-14 21:18                       ` Jeremy Fitzhardinge
  2007-03-15 19:09                         ` Dan Hecht
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 21:18 UTC (permalink / raw)
  To: Dan Hecht
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner, paulus, schwidefsky, Rik van Riel

Dan Hecht wrote:
> On 03/14/2007 01:31 PM, Jeremy Fitzhardinge wrote:
>> Dan Hecht wrote:
>>> Sounds good.  I don't see this in your patchset you sent yesterday
>>> though; did you add it after sending out those patches?
>>
>> Yes.
>>
>>>   if so, could you forward the new patch?  does it explicitly prevent
>>> stolen time from getting accounted as  user/system time or does it
>>> just rely on NO_HZ mode sort of happening to work that way (since the
>>> one shot timer is skipped ahead for missed ticks)?
>>
>> Hm, not sure.  It doesn't care how often it gets called; it just
>> accumulates results up to that point, but I'm not sure if the time would
>> get double accounted.  Perhaps it doesn't matter when using
>> xen_sched_clock().
>>
>
> I think you might be double counting time in some cases. 
> sched_clock() isn't really relevant to stolen time accounting (i.e.
> cpustat->steal).
>
> I think what you want is to make sure that the sum of the cputime
> passed to all of:
>
> account_user_time
> account_system_time
> account_steal_time
>
> adds up to the total amount of time that has passed.  I think it is
> sort of working for you (i.e. doesn't always double count stolen
> ticks) since in NO_HZ mode, update_process_time (which calls
> account_user_time & account_system_time) happens to be skipped during
> periods of stolen time due to the hrtimer_forward()'ing of the one
> shot expiry. 

OK, this will need a closer look.

BTW, what are the properties of the vmi read_available_cycles()?  Is
that a per-cpu timer?  If its used as the timebase for sched_clock, how
does recalc_task_prio not get a -ve sleep time?


    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 21:16                               ` Jeremy Fitzhardinge
@ 2007-03-14 21:34                                 ` Daniel Walker
  2007-03-14 21:42                                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 51+ messages in thread
From: Daniel Walker @ 2007-03-14 21:34 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List, Peter Chubb

On Wed, 2007-03-14 at 14:16 -0700, Jeremy Fitzhardinge wrote:

> 
> >  It's also used for some posix cpu timers
> > (sched_ns) , and it used for migration thread initialization.
> 
> sched_ns doesn't use it directly except for the case where the process
> is currently running.  Anyway, it's compatible with what I'm talking about.

It's used for measuring execution time, but timers are triggered based
on that time, so it needs to be actual execution time. I don't know to
what extent this is already inaccurate on some system tho.

Daniel


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-13 16:31 Stolen and degraded time and schedulers Jeremy Fitzhardinge
  2007-03-13 20:12 ` john stultz
@ 2007-03-14 21:36 ` Con Kolivas
  2007-03-14 21:38   ` Jeremy Fitzhardinge
  2007-03-14 21:40   ` Con Kolivas
  1 sibling, 2 replies; 51+ messages in thread
From: Con Kolivas @ 2007-03-14 21:36 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Ingo Molnar, Thomas Gleixner, Rusty Russell,
	Zachary Amsden, James Morris, john stultz, Chris Wright,
	Linux Kernel Mailing List, cpufreq, Virtualization Mailing List

On Wednesday 14 March 2007 03:31, Jeremy Fitzhardinge wrote:
> The current Linux scheduler makes one big assumption: that 1ms of CPU
> time is the same as any other 1ms of CPU time, and that therefore a
> process makes the same amount of progress regardless of which particular
> ms of time it gets.
>
> This assumption is wrong now, and will become more wrong as
> virtualization gets more widely used.
>
> It's wrong now, because it fails to take into account of several kinds
> of missing time:
>
>    1. interrupts - time spent in an ISR is accounted to the current
>       process, even though it gets no direct benefit
>    2. SMM - time is completely lost from the kernel
>    3. slow CPUs - 1ms of 600MHz CPU is less useful than 1ms of 2.4GHz CPU
>
> The first two - time lost to interrupts - are a well known problem, and
> are generally considered to be a non issue.  If you're losing a
> significant amount of time to interrupts, you probably have bigger
> problems.  (Or maybe not?)
>
> The third is not something I've seen discussed before, but it seems like
> it could be a significant problem today.  Certainly, I've noticed it
> myself: an interactive program decides to do something CPU-intensive
> (like start an animation), and it chugs until the conservative governor
> brings the CPU up to speed.  Certainly some of this is because its just
> plain CPU-starved, but I think another factor is that it gets penalized
> for running on a slow CPU: 1ms is not 1ms.  And for power reasons you
> want to encourage processes to run on slow CPUs rather than penalize them.
>
> Virtualization just exacerbates this.  If you have a busy machine
> running multiple virtual CPUs, then each VCPU may only get a small
> proportion of the total amount of available CPU time.  If the kernel's
> scheduler asserts that "you were just scheduled for 1ms, therefore you
> made 1ms of progress", then many timeslices will effectively end up
> being 1ms of 0Mhz CPU - because the VCPU wasn't scheduled and the real
> CPU was doing something else.
>
>
> So how to deal with this?  Basically we need a clock which measures "CPU
> work units", and have the scheduler use this clock.
>
> A "CPU work unit" clock has these properties:
>
>     * inherently per-CPU (from the kernel's perspective, so it would be
>       per-VCPU in a virtual machine)
>     * monotonic - you can't do negative work
>     * measured in "work units"
>
> A "work unit" is probably most simply expressed in cycles - you assume a
> cycle of CPU time is equivalent in terms of work done to any other
> cycle.  This means that 1 cycle at 600MHz is equivalent to 1 cycle at
> 2.4GHz - but of course the 2.4GHz processor gets 4 times as many in any
> real time interval.  (This is the instance where the worst kind of tsc -
> varying speed which stops on idle - is actually exactly what you want.)
>
> You could also measure "work units" in terms of normalized time units:
> if the fastest CPU on the machine is 2.4GHz, then 1ms is 1ms a work unit
> on that CPU, but 250us on the 600MHz CPU.
>
> It doesn't really matter what the unit is, so long as it is used
> consistently to measure how much progress all processes made.

I think you're looking for a complex solution to a problem that doesn't exist. 
The job of the process scheduler is to meter out the available cpu resources. 
It cannot make up cycles for a slow cpu or one that is throttled. If the 
problem is happening due to throttling it should be fixed by altering the 
throttle. The example you describe with the conservative governor is as easy 
to fix as changing to the ondemand governor. Differential power cpus on an 
SMP machine should be managed by SMP balancing choices based on power groups.

It would be fine to implement some other accounting of this definition of time 
for other purposes but not for process scheduler decisions per se.

Sorry to chime in late.  My physical condition prevents me spending any 
extended period of time at the computer so I've tried to be succinct with my 
comments and may not be able to reply again.

-- 
-ck

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 21:36 ` Con Kolivas
@ 2007-03-14 21:38   ` Jeremy Fitzhardinge
  2007-03-14 21:40   ` Con Kolivas
  1 sibling, 0 replies; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 21:38 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Andi Kleen, Ingo Molnar, Thomas Gleixner, Rusty Russell,
	Zachary Amsden, James Morris, john stultz, Chris Wright,
	Linux Kernel Mailing List, cpufreq, Virtualization Mailing List

Con Kolivas wrote:
> I think you're looking for a complex solution to a problem that doesn't exist. 
>   

The problem is subtle, but I think the solution is actually fairly simple.

> The job of the process scheduler is to meter out the available cpu resources. 
> It cannot make up cycles for a slow cpu or one that is throttled.

No, it can't schedule more cpu time than exists.  But it should account
for the time the processes actually had available to them, rather than
assuming they got the full power of the cpu.

>  If the 
> problem is happening due to throttling it should be fixed by altering the 
> throttle. The example you describe with the conservative governor is as easy 
> to fix as changing to the ondemand governor.

That's one workaround, but sometimes its desirable to keep even
cpu-bound processes at a lower cpu performance level for power-saving
reasons.  Modern CPUs are designed to switch performance states very
quickly, so its conceivable you could change performance every context
switch, though no governor currently uses that fine granularity.

>  Differential power cpus on an 
> SMP machine should be managed by SMP balancing choices based on power groups.
>   

Do you mean compute power or energy power here?

> It would be fine to implement some other accounting of this definition of time 
> for other purposes but not for process scheduler decisions per se.
>   

I suppose, but it seems to me that they're pretty much the same thing.

> Sorry to chime in late.  My physical condition prevents me spending any 
> extended period of time at the computer so I've tried to be succinct with my 
> comments and may not be able to reply again.
>   

My sympathies; neck problems are bad news.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 21:36 ` Con Kolivas
  2007-03-14 21:38   ` Jeremy Fitzhardinge
@ 2007-03-14 21:40   ` Con Kolivas
  1 sibling, 0 replies; 51+ messages in thread
From: Con Kolivas @ 2007-03-14 21:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Ingo Molnar, Thomas Gleixner, Rusty Russell,
	Zachary Amsden, James Morris, john stultz, Chris Wright,
	Linux Kernel Mailing List, cpufreq, Virtualization Mailing List

On Thursday 15 March 2007 08:36, Con Kolivas wrote:
> On Wednesday 14 March 2007 03:31, Jeremy Fitzhardinge wrote:
> > The current Linux scheduler makes one big assumption: that 1ms of CPU
> > time is the same as any other 1ms of CPU time, and that therefore a
> > process makes the same amount of progress regardless of which particular
> > ms of time it gets.
> >
> > This assumption is wrong now, and will become more wrong as
> > virtualization gets more widely used.
> >
> > It's wrong now, because it fails to take into account of several kinds
> > of missing time:
> >
> >    1. interrupts - time spent in an ISR is accounted to the current
> >       process, even though it gets no direct benefit
> >    2. SMM - time is completely lost from the kernel
> >    3. slow CPUs - 1ms of 600MHz CPU is less useful than 1ms of 2.4GHz CPU
> >
> > The first two - time lost to interrupts - are a well known problem, and
> > are generally considered to be a non issue.  If you're losing a
> > significant amount of time to interrupts, you probably have bigger
> > problems.  (Or maybe not?)
> >
> > The third is not something I've seen discussed before, but it seems like
> > it could be a significant problem today.  Certainly, I've noticed it
> > myself: an interactive program decides to do something CPU-intensive
> > (like start an animation), and it chugs until the conservative governor
> > brings the CPU up to speed.  Certainly some of this is because its just
> > plain CPU-starved, but I think another factor is that it gets penalized
> > for running on a slow CPU: 1ms is not 1ms.  And for power reasons you
> > want to encourage processes to run on slow CPUs rather than penalize
> > them.
> >
> > Virtualization just exacerbates this.  If you have a busy machine
> > running multiple virtual CPUs, then each VCPU may only get a small
> > proportion of the total amount of available CPU time.  If the kernel's
> > scheduler asserts that "you were just scheduled for 1ms, therefore you
> > made 1ms of progress", then many timeslices will effectively end up
> > being 1ms of 0Mhz CPU - because the VCPU wasn't scheduled and the real
> > CPU was doing something else.
> >
> >
> > So how to deal with this?  Basically we need a clock which measures "CPU
> > work units", and have the scheduler use this clock.
> >
> > A "CPU work unit" clock has these properties:
> >
> >     * inherently per-CPU (from the kernel's perspective, so it would be
> >       per-VCPU in a virtual machine)
> >     * monotonic - you can't do negative work
> >     * measured in "work units"
> >
> > A "work unit" is probably most simply expressed in cycles - you assume a
> > cycle of CPU time is equivalent in terms of work done to any other
> > cycle.  This means that 1 cycle at 600MHz is equivalent to 1 cycle at
> > 2.4GHz - but of course the 2.4GHz processor gets 4 times as many in any
> > real time interval.  (This is the instance where the worst kind of tsc -
> > varying speed which stops on idle - is actually exactly what you want.)
> >
> > You could also measure "work units" in terms of normalized time units:
> > if the fastest CPU on the machine is 2.4GHz, then 1ms is 1ms a work unit
> > on that CPU, but 250us on the 600MHz CPU.
> >
> > It doesn't really matter what the unit is, so long as it is used
> > consistently to measure how much progress all processes made.
>
> I think you're looking for a complex solution to a problem that doesn't
> exist. The job of the process scheduler is to meter out the available cpu
> resources. It cannot make up cycles for a slow cpu or one that is
> throttled. If the problem is happening due to throttling it should be fixed
> by altering the throttle. The example you describe with the conservative
> governor is as easy to fix as changing to the ondemand governor.
> Differential power cpus on an SMP machine should be managed by SMP
> balancing choices based on power groups.
>
> It would be fine to implement some other accounting of this definition of
> time for other purposes

I mean such as for virtualisation purposes.

> but not for process scheduler decisions per se. 

>
> Sorry to chime in late.  My physical condition prevents me spending any
> extended period of time at the computer so I've tried to be succinct with
> my comments and may not be able to reply again.

-- 
-ck

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 21:34                                 ` Daniel Walker
@ 2007-03-14 21:42                                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-14 21:42 UTC (permalink / raw)
  To: Daniel Walker
  Cc: john stultz, Andi Kleen, Ingo Molnar, Thomas Gleixner,
	Con Kolivas, Rusty Russell, Zachary Amsden, James Morris,
	Chris Wright, Linux Kernel Mailing List, cpufreq,
	Virtualization Mailing List, Peter Chubb

Daniel Walker wrote:
> It's used for measuring execution time, but timers are triggered based
> on that time, so it needs to be actual execution time. I don't know to
> what extent this is already inaccurate on some system tho.
>   

Well, "actual execution time" is a bit ambiguous: should that be "time
actually spent executing", or "time we should have spent executing"?

It looks like cpu_clock_sample() will only return accurate results on
yourself; if you get the sched_ns on a thread on another cpu, it won't
include the time accumulated since the start of its timeslice.

    J


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 19:34               ` Jeremy Fitzhardinge
                                   ` (2 preceding siblings ...)
  2007-03-14 20:38                 ` Ingo Molnar
@ 2007-03-15  5:23                 ` Paul Mackerras
  2007-03-15 19:33                   ` Jeremy Fitzhardinge
  3 siblings, 1 reply; 51+ messages in thread
From: Paul Mackerras @ 2007-03-15  5:23 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Ingo Molnar, Thomas Gleixner, schwidefsky,
	Rik van Riel

Jeremy Fitzhardinge writes:

> Sure.  But on a given machine, the CPUs are likely to be closely enough
> matched that a cycle on one CPU is more or less equivalent to a cycle on
> another CPU.  The fact that a cycle represents a different amount of

A cycle on one thread of a machine with SMT/hyperthreading when the
other thread is idle *isn't* equivalent to a cycle when the other
thread is busy.  We run into this on POWER5, where we have hardware
that counts cycles when each of the two threads in each core gets to
dispatch instructions (on each cycle, one thread or the other gets to
dispatch).  That helps but still doesn't give a totally accurate
estimate of how much computation a given process has managed to do.

> Not at all.  You might have an unimportant but cpu-bound process which
> doesn't merit increasing the cpu speed, but should also be scheduled
> properly compared to other processes.  I often nice my kernel builds
> (which cpufreq takes as a hint to not ramp up the cpu speed) on my
> laptop so to save power.

Just as a side note - that's probably actually a bad strategy; you
almost certainly consume less total energy by running the cpu at full
speed until the build is done and then going to the deepest sleep mode
you can achieve.

> That's true.  But this is a case of the left brain not talking to the
> right brain: cpufreq might decide to slow a cpu down, but the scheduler
> doesn't take that into account.  Making the timebase of sched_clock
> reflect the current cpu speed (or more specifically, the integral of the
> cpu speed over a time interval) is a good way of communicating between
> the two subsystems.

What was the original proposal?  I came into this discussion late...

Paul.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 21:18                       ` Jeremy Fitzhardinge
@ 2007-03-15 19:09                         ` Dan Hecht
  2007-03-15 19:18                           ` Jeremy Fitzhardinge
                                             ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Dan Hecht @ 2007-03-15 19:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner, paulus, schwidefsky, Rik van Riel,
	Zachary Amsden

On 03/14/2007 02:18 PM, Jeremy Fitzhardinge wrote:
> Dan Hecht wrote:
>> On 03/14/2007 01:31 PM, Jeremy Fitzhardinge wrote:
>>> Dan Hecht wrote:
>>>> Sounds good.  I don't see this in your patchset you sent yesterday
>>>> though; did you add it after sending out those patches?
>>> Yes.
>>>
>>>>   if so, could you forward the new patch?  does it explicitly prevent
>>>> stolen time from getting accounted as  user/system time or does it
>>>> just rely on NO_HZ mode sort of happening to work that way (since the
>>>> one shot timer is skipped ahead for missed ticks)?
>>> Hm, not sure.  It doesn't care how often it gets called; it just
>>> accumulates results up to that point, but I'm not sure if the time would
>>> get double accounted.  Perhaps it doesn't matter when using
>>> xen_sched_clock().
>>>
>> I think you might be double counting time in some cases. 
>> sched_clock() isn't really relevant to stolen time accounting (i.e.
>> cpustat->steal).
>>
>> I think what you want is to make sure that the sum of the cputime
>> passed to all of:
>>
>> account_user_time
>> account_system_time
>> account_steal_time
>>
>> adds up to the total amount of time that has passed.  I think it is
>> sort of working for you (i.e. doesn't always double count stolen
>> ticks) since in NO_HZ mode, update_process_time (which calls
>> account_user_time & account_system_time) happens to be skipped during
>> periods of stolen time due to the hrtimer_forward()'ing of the one
>> shot expiry. 
> 
> OK, this will need a closer look.
> 
> BTW, what are the properties of the vmi read_available_cycles()?  Is
> that a per-cpu timer?  If its used as the timebase for sched_clock, how
> does recalc_task_prio not get a -ve sleep time?
> 

Available time is defined to be (real_time - stolen_time).  i.e. time in 
which the vcpu is either running or not ready to run [because it is 
halted, and nothing is pending]).

So, yes, it is per-vcpu.  But, the sched_clock() samples are rebased 
when processes are migrated between runqueues; search sched.c for 
most_recent_timestamp.  It's not perfect since most_recent_timestamp 
between cpu0 and cpu1 do not correspond to the exact same instant, but 
does prevent negative sleep time and is fairly close.

Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-15 19:09                         ` Dan Hecht
@ 2007-03-15 19:18                           ` Jeremy Fitzhardinge
  2007-03-15 19:48                           ` Rik van Riel
  2007-03-15 19:53                           ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-15 19:18 UTC (permalink / raw)
  To: Dan Hecht
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner, paulus, schwidefsky, Rik van Riel,
	Zachary Amsden

Dan Hecht wrote:
> So, yes, it is per-vcpu.  But, the sched_clock() samples are rebased
> when processes are migrated between runqueues; search sched.c for
> most_recent_timestamp.  It's not perfect since most_recent_timestamp
> between cpu0 and cpu1 do not correspond to the exact same instant, but
> does prevent negative sleep time and is fairly close.

Yes, I noticed that when I looked more carefully, but I wasn't sure
whether it would be sufficient to make it all work out.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-15  5:23                 ` Paul Mackerras
@ 2007-03-15 19:33                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-15 19:33 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Ingo Molnar, Thomas Gleixner, schwidefsky,
	Rik van Riel

Paul Mackerras wrote:
> A cycle on one thread of a machine with SMT/hyperthreading when the
> other thread is idle *isn't* equivalent to a cycle when the other
> thread is busy.  We run into this on POWER5, where we have hardware
> that counts cycles when each of the two threads in each core gets to
> dispatch instructions (on each cycle, one thread or the other gets to
> dispatch).  That helps but still doesn't give a totally accurate
> estimate of how much computation a given process has managed to do.
>   

Yes, but it doesn't need to be 100% accurate to be useful; it just needs
to better characterize the amount of work done.  You could get a better
approximation by using two scaling factors: work done with other thread
idle, and work done when other thread busy.

>> I often nice my kernel builds
>> (which cpufreq takes as a hint to not ramp up the cpu speed) on my
>> laptop so to save power.
>>     
>
> Just as a side note - that's probably actually a bad strategy; you
> almost certainly consume less total energy by running the cpu at full
> speed until the build is done and then going to the deepest sleep mode
> you can achieve.
>   

It seems to me that a 5min build at 1/4 power is better than running for
2.5min at full power - voltage scaling gives you n^2 power use scaling,
remember.  Not that I've measured it or anything.

> What was the original proposal?  I came into this discussion late...
>   

My core proposal is basically that sched_clock() should try to return a
time which scales with the amount of work done by a CPU rather than
measure real time.  This helps solve two problems:

    * it accounts for time stolen by a hypervisor, since a stolen CPU
      does no work
    * it accounts for cpus running a lower operating points, since they
      do less work per unit time

You could also use it to take into account time stolen by interrupts,
SMM, thermal limiting and so on.  The idea is that processes shouldn't
get penalized for CPU time that they had no opportunity to use.

This is almost completely compatible with how sched_clock() is currently
used, except that the scheduler also uses it to measure sleeping time. 
This doesn't make much sense in my proposal because sched_clock() is an
inherently per-CPU time measure, and sleeping doesn't involve any CPU by
definition.  It also doesn't make much sense to say that a process slept
for less time simply because it wasn't using the CPU while it was being
stolen/running slowly.

Despite this, it works better than expected because the current
scheduler adjusts the sched_clock-derived process timestamps as they
move between runqueues, so they never get too far out of whack.

I've implemented sched_clock to only count non-stolen CPU nanoseconds in
the Xen-paravirt_ops implementation; we'll see how it works out.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-15 19:09                         ` Dan Hecht
  2007-03-15 19:18                           ` Jeremy Fitzhardinge
@ 2007-03-15 19:48                           ` Rik van Riel
  2007-03-15 19:53                           ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 51+ messages in thread
From: Rik van Riel @ 2007-03-15 19:48 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Jeremy Fitzhardinge, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Ingo Molnar, Thomas Gleixner, paulus, schwidefsky,
	Zachary Amsden

Dan Hecht wrote:

> Available time is defined to be (real_time - stolen_time).  i.e. time in 
> which the vcpu is either running or not ready to run [because it is 
> halted, and nothing is pending]).

 From the guest perspective, steal time is:

"Time I would have liked to run"

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-15 19:09                         ` Dan Hecht
  2007-03-15 19:18                           ` Jeremy Fitzhardinge
  2007-03-15 19:48                           ` Rik van Riel
@ 2007-03-15 19:53                           ` Jeremy Fitzhardinge
  2007-03-15 20:07                             ` Dan Hecht
  2 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-15 19:53 UTC (permalink / raw)
  To: Dan Hecht
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner, paulus, schwidefsky, Rik van Riel,
	Zachary Amsden

Dan Hecht wrote:
> Available time is defined to be (real_time - stolen_time).  i.e. time
> in which the vcpu is either running or not ready to run [because it is
> halted, and nothing is pending]).

Hm, the Xen definition of stolen time is "time VCPU spent in runnable
(vs running) or offline state".  If the VCPU was blocked anyway, then
its never considered to be stolen.  Offline means the VCPU was paused by
the administrator, or during suspend/resume.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-15 19:53                           ` Jeremy Fitzhardinge
@ 2007-03-15 20:07                             ` Dan Hecht
  2007-03-15 20:14                               ` Rik van Riel
  0 siblings, 1 reply; 51+ messages in thread
From: Dan Hecht @ 2007-03-15 20:07 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: dwalker, cpufreq, Linux Kernel Mailing List, Con Kolivas,
	Chris Wright, Virtualization Mailing List, john stultz,
	Ingo Molnar, Thomas Gleixner, paulus, schwidefsky, Rik van Riel,
	Zachary Amsden

On 03/15/2007 12:53 PM, Jeremy Fitzhardinge wrote:
> Dan Hecht wrote:
>> Available time is defined to be (real_time - stolen_time).  i.e. time
>> in which the vcpu is either running or not ready to run [because it is
>> halted, and nothing is pending]).
> 
> Hm, the Xen definition of stolen time is "time VCPU spent in runnable
> (vs running) or offline state".  If the VCPU was blocked anyway, then
> its never considered to be stolen.  Offline means the VCPU was paused by
> the administrator, or during suspend/resume.
> 
>

Yes, the part in the "i.e." above is describing available time.  So, it 
is essentially is the same definition of stolen time VMI uses:

stolen time     == ready to run but not running
available time  == running or not ready to run

Basically, a vcpu starts off running (and this time is counted as 
available time).  Eventually, it will either be preempted by the 
hypervisor and descheduled (then this time becomes stolen since the vcpu 
is still ready), or it will decide to halt (then the time remains 
accounted towards available time).  Once the vcpu is halted, then 
eventually something happens and we should deliver a virtual interrupt 
to the vcpu.  The time between that something happening (e.g. host I/O 
completing, an alarm expiring) and the vcpu actually starting to run 
again is accounted as stolen.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-15 20:07                             ` Dan Hecht
@ 2007-03-15 20:14                               ` Rik van Riel
  2007-03-15 20:35                                 ` Dan Hecht
  0 siblings, 1 reply; 51+ messages in thread
From: Rik van Riel @ 2007-03-15 20:14 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Jeremy Fitzhardinge, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Ingo Molnar, Thomas Gleixner, paulus, schwidefsky,
	Zachary Amsden

Dan Hecht wrote:

> Yes, the part in the "i.e." above is describing available time.  So, it 
> is essentially is the same definition of stolen time VMI uses:

> stolen time     == ready to run but not running
> available time  == running or not ready to run

S390 too.  We were quite careful to make sure that steal time
means the same on the different platforms when the code was
introduced.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-15 20:14                               ` Rik van Riel
@ 2007-03-15 20:35                                 ` Dan Hecht
  2007-03-16  8:59                                   ` Martin Schwidefsky
  0 siblings, 1 reply; 51+ messages in thread
From: Dan Hecht @ 2007-03-15 20:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jeremy Fitzhardinge, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Ingo Molnar, Thomas Gleixner, paulus, schwidefsky,
	Zachary Amsden

On 03/15/2007 01:14 PM, Rik van Riel wrote:
> Dan Hecht wrote:
> 
>> Yes, the part in the "i.e." above is describing available time.  So, 
>> it is essentially is the same definition of stolen time VMI uses:
> 
>> stolen time     == ready to run but not running
>> available time  == running or not ready to run
> 
> S390 too.  We were quite careful to make sure that steal time
> means the same on the different platforms when the code was
> introduced.
> 

The S390 folks should correct me if I'm mistaken, but I think S390 works 
a bit differently.  I don't think their "steal clock" will differentiate 
between idle time and stolen time (since it's implemented as a hardware 
clock that counts the time a particular vcpu context is executing on the 
pcpu).  So they need the kernel to differentiate between really stolen 
time and just idle time.  At least, I assume this is why 
account_steal_time() can then sometimes account steal time towards idle, 
and looking at arch/s390/kernel/vtime.c seems to indicate this.

In the Xen and VMI case, the hypervisor differentiates between stolen 
and idle time, which is why we use the hack to call into 
account_steal_time with NULL tsk (so that all of steal gets accounted to 
stolen, even if the idle task happened to be current).  This allows us 
to account stolen time that happened on the tail end of an idle period.




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-14 20:59                   ` Jeremy Fitzhardinge
@ 2007-03-16  8:38                     ` Ingo Molnar
  2007-03-16 16:53                       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2007-03-16  8:38 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Thomas Gleixner, paulus, schwidefsky, Rik van Riel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
> > touching the 'timer tick' is the wrong approach. 'stolen time' only 
> > matters to the /scheduler tick/. So extend the hypervisor interface to 
> > allow the injection of 'virtual' scheduler tick events: via the use of a 
> > special clockevents device - do not change clockevents itself.
> 
> I didn't.  I was using sloppy terminology: I hang the stolen time 
> accounting off the Xen timer interrupt routine, just so that it gets 
> run every now and again.

i dont understand: how are you separating 'stolen time' drifts from 
events generated for absolute timeouts?

> I suppose I could explicitly hook stolen time accounting into the 
> scheduler, but its not obvious to me that it's necessary.

right now i dont see any clean way to solve this problem without having 
two clockevents drivers: one for the scheduler, one for timer events.

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-15 20:35                                 ` Dan Hecht
@ 2007-03-16  8:59                                   ` Martin Schwidefsky
  0 siblings, 0 replies; 51+ messages in thread
From: Martin Schwidefsky @ 2007-03-16  8:59 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Rik van Riel, Jeremy Fitzhardinge, dwalker, cpufreq,
	Linux Kernel Mailing List, Con Kolivas, Chris Wright,
	Virtualization Mailing List, john stultz, Ingo Molnar,
	Thomas Gleixner, paulus, Zachary Amsden

On Thu, 2007-03-15 at 13:35 -0700, Dan Hecht wrote:
> >> Yes, the part in the "i.e." above is describing available time.  So, 
> >> it is essentially is the same definition of stolen time VMI uses:
> > 
> >> stolen time     == ready to run but not running
> >> available time  == running or not ready to run
> > 
> > S390 too.  We were quite careful to make sure that steal time
> > means the same on the different platforms when the code was
> > introduced.
> > 
> 
> The S390 folks should correct me if I'm mistaken, but I think S390 works 
> a bit differently.  I don't think their "steal clock" will differentiate 
> between idle time and stolen time (since it's implemented as a hardware 
> clock that counts the time a particular vcpu context is executing on the 
> pcpu).  So they need the kernel to differentiate between really stolen 
> time and just idle time.  At least, I assume this is why 
> account_steal_time() can then sometimes account steal time towards idle, 
> and looking at arch/s390/kernel/vtime.c seems to indicate this.
> idle period.

For s390 we have: stolen time == wanted to run but the hypervisor didn't
let us. The way this is implemented is by using the cpu timer. This is a
per-cpu register that is fully virtualized. It runs at the same rate as
the clock, but only if the virtual cpu is scheduled to run. If the real
cpu falls out of the guest context the guest cpu timer just stops. The
wall clock (TOD) keeps ticking. The calculation to find the amount of
stolen time is now simple: TOD clock - guest cpu timer.
For idle there is a little pitfall. If the guest cpu is a dedicated cpu
under LPAR loading a wait psw does not cause the guest cpu fall out of
the guest context. The guest cpu timer will continue ticking. In this
case the time spent in idle is accounted via system_time. If the guest
cpu is a shared cpu then loading a wait psw will cause the cpu to fall
out of guest context and the guest cpu timer will be stopped. In this
case the idle time will be accounted via steal_time. 

-- 
blue skies,              IBM Deutschland Entwicklung GmbH
   Martin                Vorsitzender des Aufsichtsrats: Johann Weihen
                         Geschäftsführung: Herbert Kircher
Martin Schwidefsky       Sitz der Gesellschaft: Böblingen
Linux on zSeries         Registergericht: Amtsgericht Stuttgart,
   Development           HRB 243294

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Stolen and degraded time and schedulers
  2007-03-16  8:38                     ` Ingo Molnar
@ 2007-03-16 16:53                       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-16 16:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dan Hecht, dwalker, cpufreq, Linux Kernel Mailing List,
	Con Kolivas, Chris Wright, Virtualization Mailing List,
	john stultz, Thomas Gleixner, paulus, schwidefsky, Rik van Riel

Ingo Molnar wrote:
> i dont understand: how are you separating 'stolen time' drifts from 
> events generated for absolute timeouts?
>   

I'm not sure what you're asking; I think we're talking past each other.

I can extract from Xen how much time was stolen over some real-time
interval.  If I call do_stolen_accounting() at any two arbitrary points,
it will compute the amount of time stolen in the interval.  So that
means I can call it from time to time to update stolen time.  Of course,
this accounting will be out of date for a while, but it doesn't matter
too much.  I call it from the timer interrupt, since it will be called
occasionally on idle CPUs and often on busy CPUs, which is what we want.

Oh, its worth pointing out that stolen time is accounted against CPUs
rather than individual processes, so it doesn't need to be related to
process scheduling.

Also, none of this is in the patch set I posted; I've implemented it
since, and it will be in the next batch.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2007-03-16 16:53 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-13 16:31 Stolen and degraded time and schedulers Jeremy Fitzhardinge
2007-03-13 20:12 ` john stultz
2007-03-13 20:32   ` Jeremy Fitzhardinge
2007-03-13 21:27     ` Daniel Walker
2007-03-13 21:59       ` Jeremy Fitzhardinge
2007-03-14  0:43         ` Dan Hecht
2007-03-14  4:37           ` Jeremy Fitzhardinge
2007-03-14 13:58             ` Lennart Sorensen
2007-03-14 15:08               ` Jeremy Fitzhardinge
2007-03-14 15:12                 ` Lennart Sorensen
2007-03-14 19:02             ` Dan Hecht
2007-03-14 19:34               ` Jeremy Fitzhardinge
2007-03-14 19:45                 ` Rik van Riel
2007-03-14 19:47                   ` Jeremy Fitzhardinge
2007-03-14 20:02                     ` Rik van Riel
2007-03-14 20:26                 ` Dan Hecht
2007-03-14 20:31                   ` Jeremy Fitzhardinge
2007-03-14 20:46                     ` Dan Hecht
2007-03-14 21:18                       ` Jeremy Fitzhardinge
2007-03-15 19:09                         ` Dan Hecht
2007-03-15 19:18                           ` Jeremy Fitzhardinge
2007-03-15 19:48                           ` Rik van Riel
2007-03-15 19:53                           ` Jeremy Fitzhardinge
2007-03-15 20:07                             ` Dan Hecht
2007-03-15 20:14                               ` Rik van Riel
2007-03-15 20:35                                 ` Dan Hecht
2007-03-16  8:59                                   ` Martin Schwidefsky
2007-03-14 20:38                 ` Ingo Molnar
2007-03-14 20:59                   ` Jeremy Fitzhardinge
2007-03-16  8:38                     ` Ingo Molnar
2007-03-16 16:53                       ` Jeremy Fitzhardinge
2007-03-15  5:23                 ` Paul Mackerras
2007-03-15 19:33                   ` Jeremy Fitzhardinge
2007-03-14  2:00         ` Daniel Walker
2007-03-14  6:52           ` Jeremy Fitzhardinge
2007-03-14  8:20             ` Zan Lynx
2007-03-14 16:11             ` Daniel Walker
2007-03-14 16:37               ` Jeremy Fitzhardinge
2007-03-14 16:59                 ` Daniel Walker
2007-03-14 17:08                   ` Jeremy Fitzhardinge
2007-03-14 18:06                     ` Daniel Walker
2007-03-14 18:41                       ` Jeremy Fitzhardinge
2007-03-14 19:00                         ` Daniel Walker
2007-03-14 19:44                           ` Jeremy Fitzhardinge
2007-03-14 20:33                             ` Daniel Walker
2007-03-14 21:16                               ` Jeremy Fitzhardinge
2007-03-14 21:34                                 ` Daniel Walker
2007-03-14 21:42                                   ` Jeremy Fitzhardinge
2007-03-14 21:36 ` Con Kolivas
2007-03-14 21:38   ` Jeremy Fitzhardinge
2007-03-14 21:40   ` Con Kolivas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).