All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC patch] CFS fix place entity spread issue (v2)
@ 2010-04-18 13:13 Mathieu Desnoyers
  2010-04-18 20:21 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Mathieu Desnoyers @ 2010-04-18 13:13 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Mike Galbraith
  Cc: Andrew Morton, Linus Torvalds, Greg Kroah-Hartman,
	Steven Rostedt, Jarkko Nikula, Tony Lindgren, linux-kernel

CFS fix place entity spread issue (v2)

Huge CFS vruntime spread (18 minutes) has been observed with LTTng while simply
running Xorg on a uniprocessor machine. Detailed explanation in my ELC2010
presentation at:

http://www.efficios.com/elc2010

(includes slides, ad-hoc CFS instrumentation patches and wakeup latency test
program)

Changelog since v1:

* Keep max(se->vruntime, vruntime). It's needed to ensure that tasks already
  within the spread window do not get an extra unfair vruntime boost just
  because they slept.

I've torn the CFS scheduler apart in the past days to figure out what is causing
this weird behavior, and the culprit seems to be place_entity(). The problem
appears to be the cumulative effect of letting the min_vruntime go backward when
putting sleepers back on the runqueue. It lets the vruntime spread grow to
"entertaining" values (it is supposed to be in the 5ms range, not 18 minutes!).

In the original code, a max between the sched entity vruntime and the calculated
vruntime was supposed to "ensure that the thread time never go backward". But I
don't see why we even care about that. The key point is that the min_vruntime
of the runqueue should not go backward.

I propose to fix this by calculating the relative offset from
min_vruntime + sysctl_sched_latency rather than directly from min_vruntime. I
also ensure that the value never goes below min_vruntime.

Under the Xorg workload, moving a few windows around and starting firefox while
executing the wakeup-latency.c program (program waking up every 10ms and
reporting wakeup latency), this patch brings worse latency from 60ms down to
12ms. Even doing a kernel compilation at the same time, the worse latency stays
around 20ms now.

I'm submitting this patch ASAP, since it seems to fix CFS issues that many
people have been complaining about. I'm sending it as RFC because testing its
effect on more workloads would be welcome.

I can see that place_entity() has stayed more or less the same since 2.6.24 (and
maybe even before, as code has just been reorganised between 2.6.23 and 2.6.24),
so we can expect this to be a problem people have been experiencing for a while.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Mike Galbraith <efault@gmx.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Greg Kroah-Hartman <greg@kroah.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Jarkko Nikula <jhnikula@gmail.com>
CC: Tony Lindgren <tony@atomide.com>
---
 kernel/sched_fair.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

Index: linux-2.6-lttng.git/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_fair.c	2010-04-18 01:48:04.000000000 -0400
+++ linux-2.6-lttng.git/kernel/sched_fair.c	2010-04-18 08:58:30.000000000 -0400
@@ -738,6 +738,14 @@
 		unsigned long thresh = sysctl_sched_latency;
 
 		/*
+		 * Place the woken up task relative to
+		 * min_vruntime + sysctl_sched_latency.
+		 * We must _never_ decrement min_vruntime, because the effect is
+		 * that spread increases progressively under the Xorg workload.
+		 */
+		vruntime += sysctl_sched_latency;
+
+		/*
 		 * Convert the sleeper threshold into virtual time.
 		 * SCHED_IDLE is a special sub-class.  We care about
 		 * fairness only relative to other SCHED_IDLE tasks,
@@ -755,6 +763,9 @@
 			thresh >>= 1;
 
 		vruntime -= thresh;
+
+		/* ensure min_vruntime never go backwards. */
+		vruntime = max_t(u64, vruntime, cfs_rq->min_vruntime);
 	}
 
 	/* ensure we never gain time by being placed backwards. */
-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC patch] CFS fix place entity spread issue (v2)
  2010-04-18 13:13 [RFC patch] CFS fix place entity spread issue (v2) Mathieu Desnoyers
@ 2010-04-18 20:21 ` Linus Torvalds
  2010-04-18 22:59   ` Mathieu Desnoyers
  2010-04-19  9:25 ` Peter Zijlstra
  2010-04-19 14:43 ` Peter Zijlstra
  2 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2010-04-18 20:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Andrew Morton,
	Greg Kroah-Hartman, Steven Rostedt, Jarkko Nikula, Tony Lindgren,
	linux-kernel



On Sun, 18 Apr 2010, Mathieu Desnoyers wrote:
>
> CFS fix place entity spread issue (v2)
> 
> Huge CFS vruntime spread (18 minutes) has been observed with LTTng while simply
> running Xorg on a uniprocessor machine. Detailed explanation in my ELC2010
> presentation at:

Hmm. I tested this patch with my favourite non-scientific desktop load 
test: do web browsing while doing a kernel compile with "make -j16" (after 
doing a "git clean -dqfx" and "ccache -C" to get rid of old object files 
and ccache).

This is on a dual-core (with SMT, so 4 threads) Core i5, so "make -j16" 
overcommits the CPU's quite a bit.

And quite frankly, I think your patch makes things much worse. I don't 
have numbers, but it felt much choppier and slower to do scrolling in 
firefox o moving windows around while the load average is 20+.

My testload is in no way objective, nor necessarily any good, but it's the 
one I happen to use, and while it's subjective, I think the difference was 
pretty clear and not good.

			Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC patch] CFS fix place entity spread issue (v2)
  2010-04-18 20:21 ` Linus Torvalds
@ 2010-04-18 22:59   ` Mathieu Desnoyers
  2010-04-18 23:23     ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Mathieu Desnoyers @ 2010-04-18 22:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Andrew Morton,
	Greg Kroah-Hartman, Steven Rostedt, Jarkko Nikula, Tony Lindgren,
	linux-kernel

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Sun, 18 Apr 2010, Mathieu Desnoyers wrote:
> >
> > CFS fix place entity spread issue (v2)
> > 
> > Huge CFS vruntime spread (18 minutes) has been observed with LTTng while simply
> > running Xorg on a uniprocessor machine. Detailed explanation in my ELC2010
> > presentation at:
> 
> Hmm. I tested this patch with my favourite non-scientific desktop load 
> test: do web browsing while doing a kernel compile with "make -j16" (after 
> doing a "git clean -dqfx" and "ccache -C" to get rid of old object files 
> and ccache).
> 
> This is on a dual-core (with SMT, so 4 threads) Core i5, so "make -j16" 
> overcommits the CPU's quite a bit.
> 
> And quite frankly, I think your patch makes things much worse. I don't 
> have numbers, but it felt much choppier and slower to do scrolling in 
> firefox o moving windows around while the load average is 20+.
> 
> My testload is in no way objective, nor necessarily any good, but it's the 
> one I happen to use, and while it's subjective, I think the difference was 
> pretty clear and not good.
> 
> 			Linus

Degradation of this particular workload is expected, because the nature of the
large CFS spread issue favors Xorg: it gets litterally minutes of available
vruntime, while most other threads (e.g. kernel build, audio playing, etc..)
suffer from this.

I'll try to play a bit with this (I got plenty of time to kill at the airport
today, missed my connecting flight back from San Francisco). Ideally, we might
want to try to find a way to ensure that place_entity() is kind to sleepers, but
without putting them before min_vruntime. I don't have any SMP machine with Xorg
handy, but I'll try to push my UP laptop a bit more heavily with a kernel build.

If we are lucky enough, we might just have to twist a knob or two to make Xorg
more responsive.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC patch] CFS fix place entity spread issue (v2)
  2010-04-18 22:59   ` Mathieu Desnoyers
@ 2010-04-18 23:23     ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2010-04-18 23:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Andrew Morton,
	Greg Kroah-Hartman, Steven Rostedt, Jarkko Nikula, Tony Lindgren,
	linux-kernel



On Sun, 18 Apr 2010, Mathieu Desnoyers wrote:
> 
> Degradation of this particular workload is expected, because the nature of the
> large CFS spread issue favors Xorg: it gets litterally minutes of available
> vruntime, while most other threads (e.g. kernel build, audio playing, etc..)
> suffer from this.

In that case, I think your patch is fundamentally incorrect.

I care a _whole_ lot more about "usable desktop" than I care about some 
inane "available vruntime". If the current behavior breaks your vruntime 
rules, then the current behaviour is clearly _correct_.

			Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC patch] CFS fix place entity spread issue (v2)
  2010-04-18 13:13 [RFC patch] CFS fix place entity spread issue (v2) Mathieu Desnoyers
  2010-04-18 20:21 ` Linus Torvalds
@ 2010-04-19  9:25 ` Peter Zijlstra
  2010-04-19 14:06   ` Mathieu Desnoyers
  2010-04-19 14:43 ` Peter Zijlstra
  2 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2010-04-19  9:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Mike Galbraith, Andrew Morton, Linus Torvalds,
	Greg Kroah-Hartman, Steven Rostedt, Jarkko Nikula, Tony Lindgren,
	linux-kernel

On Sun, 2010-04-18 at 09:13 -0400, Mathieu Desnoyers wrote:

> Detailed explanation in my ELC2010 presentation at:
> 
> http://www.efficios.com/elc2010

So what happened to comprehensive changelogs?

> I've torn the CFS scheduler apart in the past days to figure out what is causing
> this weird behavior, and the culprit seems to be place_entity(). The problem
> appears to be the cumulative effect of letting the min_vruntime go backward when
> putting sleepers back on the runqueue.

Right, that should not happen, ever, min_vruntime should be monotonic.
Your explanation fails to mention how exactly this happens.

>  It lets the vruntime spread grow to
> "entertaining" values (it is supposed to be in the 5ms range, not 18 minutes!).
> 
> In the original code, a max between the sched entity vruntime and the calculated
> vruntime was supposed to "ensure that the thread time never go backward". But I
> don't see why we even care about that.

Fairness, a task's vruntime going backwards means it can obtain more
cputime than a while(1) loop would get, not a good thing.

>  The key point is that the min_vruntime
> of the runqueue should not go backward.

That is certainly important too.

> I propose to fix this by calculating the relative offset from
> min_vruntime + sysctl_sched_latency rather than directly from min_vruntime. I
> also ensure that the value never goes below min_vruntime.

I'm not sure that's a good idea, that'll end up being a 'fairer'
scheduler in the strict sense, but I suspect it will break the wakeup
preemption model we have.

> Under the Xorg workload, moving a few windows around and starting firefox while
> executing the wakeup-latency.c program (program waking up every 10ms and
> reporting wakeup latency), this patch brings worse latency from 60ms down to
> 12ms. Even doing a kernel compilation at the same time, the worse latency stays
> around 20ms now.

That seems nice enough. But really, this changelog doesn't explain the
cause at all.

/me goes try and figure out where this min_vruntime goes backwards.

So update_min_vruntime() is the sole place where we change min_vruntime,
it is called from update_curr() where we add runtime to the task we're
currently servicing, and dequeue_entity() where we remove someone from
the tree.

These two sites are the only places where min_vruntime can change
because only adding service to the leftmost task, or removal thereof can
make the min_vruntime go forward.

So update_min_vruntime():

 - takes the smaller vruntime between the current task and the leftmost
tree entry (current is not in the tree, and we can pass beyond the
leftmost waiting task due to limited preemption and minimum service
windows).

 - set min_vruntime to the larger between the existing min_vruntime and
the smallest vruntime.

So colour me confused if I'm not seeing min_vruntime going backwards...




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC patch] CFS fix place entity spread issue (v2)
  2010-04-19  9:25 ` Peter Zijlstra
@ 2010-04-19 14:06   ` Mathieu Desnoyers
  2010-04-19 14:43     ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Mathieu Desnoyers @ 2010-04-19 14:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Andrew Morton, Linus Torvalds,
	Greg Kroah-Hartman, Steven Rostedt, Jarkko Nikula, Tony Lindgren,
	linux-kernel

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Sun, 2010-04-18 at 09:13 -0400, Mathieu Desnoyers wrote:
> 
> > Detailed explanation in my ELC2010 presentation at:
> > 
> > http://www.efficios.com/elc2010
> 
> So what happened to comprehensive changelogs?
> 
> > I've torn the CFS scheduler apart in the past days to figure out what is causing
> > this weird behavior, and the culprit seems to be place_entity(). The problem
> > appears to be the cumulative effect of letting the min_vruntime go backward when
> > putting sleepers back on the runqueue.
> 
> Right, that should not happen, ever, min_vruntime should be monotonic.
> Your explanation fails to mention how exactly this happens.
> 
> >  It lets the vruntime spread grow to
> > "entertaining" values (it is supposed to be in the 5ms range, not 18 minutes!).
> > 
> > In the original code, a max between the sched entity vruntime and the calculated
> > vruntime was supposed to "ensure that the thread time never go backward". But I
> > don't see why we even care about that.
> 
> Fairness, a task's vruntime going backwards means it can obtain more
> cputime than a while(1) loop would get, not a good thing.
> 
> >  The key point is that the min_vruntime
> > of the runqueue should not go backward.
> 
> That is certainly important too.
> 
> > I propose to fix this by calculating the relative offset from
> > min_vruntime + sysctl_sched_latency rather than directly from min_vruntime. I
> > also ensure that the value never goes below min_vruntime.
> 
> I'm not sure that's a good idea, that'll end up being a 'fairer'
> scheduler in the strict sense, but I suspect it will break the wakeup
> preemption model we have.
> 
> > Under the Xorg workload, moving a few windows around and starting firefox while
> > executing the wakeup-latency.c program (program waking up every 10ms and
> > reporting wakeup latency), this patch brings worse latency from 60ms down to
> > 12ms. Even doing a kernel compilation at the same time, the worse latency stays
> > around 20ms now.
> 
> That seems nice enough. But really, this changelog doesn't explain the
> cause at all.
> 
> /me goes try and figure out where this min_vruntime goes backwards.
> 
> So update_min_vruntime() is the sole place where we change min_vruntime,
> it is called from update_curr() where we add runtime to the task we're
> currently servicing, and dequeue_entity() where we remove someone from
> the tree.
> 
> These two sites are the only places where min_vruntime can change
> because only adding service to the leftmost task, or removal thereof can
> make the min_vruntime go forward.
> 
> So update_min_vruntime():
> 
>  - takes the smaller vruntime between the current task and the leftmost
> tree entry (current is not in the tree, and we can pass beyond the
> leftmost waiting task due to limited preemption and minimum service
> windows).
> 
>  - set min_vruntime to the larger between the existing min_vruntime and
> the smallest vruntime.
> 
> So colour me confused if I'm not seeing min_vruntime going backwards...

place_entity(), when placing a sleeper on the runqueue, decrements its vruntime
(derived from min_vruntime) from -= thresh (derived from load calculation). So
if we end up with a task woken up that does not consume enough vruntime to pass
the previous min_vruntime value, we effectively decrement min_vruntime.

The other thing I am currently looking into is that, without even considering
the question of going lower than min_vruntime, given that a woken up sleeper
gets an extra share of vruntime (still because of place_entity), it might always
run between -thresh to the previous min_runtime, therefore "stalling" the
min_vruntime values at low values while other threads push the maximum vruntime
higher, therefore increasing the spread. I'm currently doing some tests trying
to cope with this by consuming woken up sleepers vruntime more quickly for a
vruntime duration of "thresh".

I am also testing alternatives ways to deal with the min_vruntime going backward
problem, by keeping a "floor_vruntime", which ensures that place_entity() can
never decrement min_vruntime below the floor value of its previous decrement.

Sorry for the missing changelog info, I was between planes this weekend.
Hopefully my sleep-deprived explanation makes sense. ;)

Thanks,

Mathieu

> 
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC patch] CFS fix place entity spread issue (v2)
  2010-04-18 13:13 [RFC patch] CFS fix place entity spread issue (v2) Mathieu Desnoyers
  2010-04-18 20:21 ` Linus Torvalds
  2010-04-19  9:25 ` Peter Zijlstra
@ 2010-04-19 14:43 ` Peter Zijlstra
  2 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2010-04-19 14:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Mike Galbraith, Andrew Morton, Linus Torvalds,
	Greg Kroah-Hartman, Steven Rostedt, Jarkko Nikula, Tony Lindgren,
	linux-kernel

On Sun, 2010-04-18 at 09:13 -0400, Mathieu Desnoyers wrote:

OK, so looking purely at the patch:

> Index: linux-2.6-lttng.git/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6-lttng.git.orig/kernel/sched_fair.c        2010-04-18 01:48:04.000000000 -0400
> +++ linux-2.6-lttng.git/kernel/sched_fair.c     2010-04-18 08:58:30.000000000 -0400
> @@ -738,6 +738,14 @@
>                 unsigned long thresh = sysctl_sched_latency;
>  
>                 /*
> +                * Place the woken up task relative to
> +                * min_vruntime + sysctl_sched_latency.
> +                * We must _never_ decrement min_vruntime, because the effect is

Nobody I could find decrements min_vruntime, and certainly
place_entity() doesn't change min_vruntime. So this is a totally
mis-guided comment.

> +                * that spread increases progressively under the Xorg workload.
> +                */
> +               vruntime += sysctl_sched_latency;

So in effect you change: 
  vruntime = max(vruntime, min_vruntime - thresh/2)
into
  vruntime = max(vruntime, min_vruntime + thresh/2)

in a non-obvious way and unclear reason.

> +               /*
>                  * Convert the sleeper threshold into virtual time.
>                  * SCHED_IDLE is a special sub-class.  We care about
>                  * fairness only relative to other SCHED_IDLE tasks,
> @@ -755,6 +763,9 @@
>                         thresh >>= 1;
>  
>                 vruntime -= thresh;
> +
> +               /* ensure min_vruntime never go backwards. */
> +               vruntime = max_t(u64, vruntime, cfs_rq->min_vruntime);

So the comment doesn't match the code, nor is it correct.

The code tries to implement clipping vruntime to min_vruntime, not
clipping min_vruntime, but then botches it by not taking wrap-around
into account.

Now, I know why your patch helps you (its in effect similar to what
START_DEBIT does for fork()), but getting the wakeup-preemption to do
something nice along with it is the hard part.

The whole perfectly fair scheduling thing is more-or-less doable
(dealing with tasks dying with !0-lag gets interesting, you'd have to
start adjusting global-timeline like things for that). But the thing is
that it makes for rather poor interactive behaviour.

Letting a task that sleeps long and runs short preempt heavier tasks
generally works well. Also, there's a number of apps that get a nice
boost from getting preempted before they can actually block on a
(read-like) systemcall, That saves a whole scheduler round-trip on the
wakeup side, so ping-pong like tasks love this too.

And then there is the whole signal delivery muck..


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC patch] CFS fix place entity spread issue (v2)
  2010-04-19 14:06   ` Mathieu Desnoyers
@ 2010-04-19 14:43     ` Peter Zijlstra
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2010-04-19 14:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Mike Galbraith, Andrew Morton, Linus Torvalds,
	Greg Kroah-Hartman, Steven Rostedt, Jarkko Nikula, Tony Lindgren,
	linux-kernel

On Mon, 2010-04-19 at 10:06 -0400, Mathieu Desnoyers wrote:

> place_entity(), when placing a sleeper on the runqueue, decrements its vruntime
> (derived from min_vruntime) from -= thresh (derived from load calculation). So
> if we end up with a task woken up that does not consume enough vruntime to pass
> the previous min_vruntime value, we effectively decrement min_vruntime.

No we don't, the next time it comes along it will at worst use the exact
same min_vruntime, but not a lower one.

> The other thing I am currently looking into is that, without even considering
> the question of going lower than min_vruntime, given that a woken up sleeper
> gets an extra share of vruntime (still because of place_entity), it might always
> run between -thresh to the previous min_runtime, therefore "stalling" the
> min_vruntime values at low values while other threads push the maximum vruntime
> higher, therefore increasing the spread. I'm currently doing some tests trying
> to cope with this by consuming woken up sleepers vruntime more quickly for a
> vruntime duration of "thresh".

But these things will break the wakeup-preemption, you cannot change
place_entity() without also considering
check_preempt_wakeup()->wakeup_preempt_entity().

> I am also testing alternatives ways to deal with the min_vruntime going backward

It doesn't go backwards!! Show me where min_vruntime actually decreases?
Sure can have entities that are left of min_vruntime, but that doesn't
mean min_vruntime decreases!

> problem, by keeping a "floor_vruntime", which ensures that place_entity() can
> never decrement min_vruntime below the floor value of its previous decrement.
> 
> Sorry for the missing changelog info, I was between planes this weekend.
> Hopefully my sleep-deprived explanation makes sense. ;)

None-what-so-ever ;-)


Now, theory says we should use the 0-lag point for task insertion, and I
have a patch that tracks that point, but it adds extra multiplications
to all the fast-paths, and I never really managed to make 0-lag
insertion work well with wakeup-preemption.


I've never really liked sleeper-fairness stuff from a theoretical pov.
but it does provide the 'interactive' stuff lots of things like (otoh it
also does cause some over-scheduling for some loads, and as you've seen
it rips the spread apart).


I also tried an avg_runtime based wakeup-preemption, where a task that
on average runs shorter than the currently running task will preempt
(within the wakeup_gran limits), but that also didn't work out --
although it did work for that one latency-tester app from Jens (IIRC)
that started that).

---
Subject: sched: avg_vruntime
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri, 24 Oct 2008 11:06:17 +0200

Renicing requires scaling the lag. Therefore we need a way to compute the it.
Lag is defined as the difference between the service time received from the
ideal model and the actual scheduler.

The defining property of a fair scheduler is that the sum of all lags is zero;
which can be seen is trivially true for the ideal case, as all lags are zero.

Therefore, the average of all virtual runtimes will be the point of zero lag.

We cannot prove fairness for CFS due to sleeper fairness (without it we can).
However since we can observe it does converge to fairness in stable operation,
we can say the zero lag point converges to the average.

We can't just take the average of vruntime - as it will use the full range
of its u64 and will wrap around. Instead we'll use the average of
(vruntime - min_vruntime)

\Sum_{i}^{n} 1/n (v_{i} - v) = 1/n (\Sum_{i}^{n} v_{i}) - vn

By factoring out the 1/n (never storing that) we avoid rounding, which
would bring an accumulating error.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20081024091109.832347739@chello.nl>
---
 kernel/sched.c       |    4 +++
 kernel/sched_debug.c |    3 ++
 kernel/sched_fair.c  |   60 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 66 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -349,6 +349,10 @@ struct cfs_rq {
 	struct load_weight load;
 	unsigned long nr_running;
 
+	long nr_queued;
+	long avg_load;
+	s64 avg_vruntime;
+
 	u64 exec_clock;
 	u64 min_vruntime;
 
Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -202,6 +202,9 @@ void print_cfs_rq(struct seq_file *m, in
 			SPLIT_NS(spread0));
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
+			SPLIT_NS(avg_vruntime(cfs_rq)));
+
 
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -301,6 +301,60 @@ static inline s64 entity_key(struct cfs_
 	return se->vruntime - cfs_rq->min_vruntime;
 }
 
+static void
+avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_load += se->load.weight;
+	cfs_rq->avg_vruntime += key * se->load.weight;
+	cfs_rq->nr_queued++;
+}
+
+static void
+avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_load -= se->load.weight;
+	cfs_rq->avg_vruntime -= key * se->load.weight;
+	cfs_rq->nr_queued--;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+	cfs_rq->avg_vruntime -= cfs_rq->nr_queued * cfs_rq->avg_load * delta;
+}
+
+static u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+	s64 avg = cfs_rq->avg_vruntime;
+	long nr_queued = cfs_rq->nr_queued;
+	long load = cfs_rq->avg_load;
+
+	if (cfs_rq->curr) {
+		nr_queued++;
+		avg += entity_key(cfs_rq, cfs_rq->curr);
+		load += cfs_rq->curr->load.weight;
+	}
+
+	if (nr_queued)
+		avg = div_s64(avg, nr_queued);
+
+	return cfs_rq->min_vruntime + avg;
+}
+
+static void __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+{
+	/*
+	 * open coded max_vruntime() to allow updating avg_vruntime
+	 */
+	s64 delta = (s64)(vruntime - cfs_rq->min_vruntime);
+	if (delta > 0) {
+		avg_vruntime_update(cfs_rq, delta);
+		cfs_rq->min_vruntime = vruntime;
+	}
+}
+
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
 	u64 vruntime = cfs_rq->min_vruntime;
@@ -319,7 +373,7 @@ static void update_min_vruntime(struct c
 			vruntime = min_vruntime(vruntime, se->vruntime);
 	}
 
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+	__update_min_vruntime(cfs_rq, vruntime);
 }
 
 /*
@@ -333,6 +387,8 @@ static void __enqueue_entity(struct cfs_
 	s64 key = entity_key(cfs_rq, se);
 	int leftmost = 1;
 
+	avg_vruntime_add(cfs_rq, se);
+
 	/*
 	 * Find the right place in the rbtree:
 	 */
@@ -372,6 +428,8 @@ static void __dequeue_entity(struct cfs_
 	}
 
 	rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
+
+	avg_vruntime_sub(cfs_rq, se);
 }
 
 static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-04-19 14:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-18 13:13 [RFC patch] CFS fix place entity spread issue (v2) Mathieu Desnoyers
2010-04-18 20:21 ` Linus Torvalds
2010-04-18 22:59   ` Mathieu Desnoyers
2010-04-18 23:23     ` Linus Torvalds
2010-04-19  9:25 ` Peter Zijlstra
2010-04-19 14:06   ` Mathieu Desnoyers
2010-04-19 14:43     ` Peter Zijlstra
2010-04-19 14:43 ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.