[PATCH 0/8] scheduler patches

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] scheduler patches
@ 2008-10-24  9:06 Peter Zijlstra
  2008-10-24  9:06 ` [PATCH 1/8] sched: fix a find_busiest_group buglet Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-24  9:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, efault, vatsa, Peter Zijlstra


1-3 definite 28 patches
4-5 wanna-be 28 patches
6-8 29 work



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/8] sched: fix a find_busiest_group buglet
  2008-10-24  9:06 [PATCH 0/8] scheduler patches Peter Zijlstra
@ 2008-10-24  9:06 ` Peter Zijlstra
  2008-10-24  9:06 ` [PATCH 2/8] sched: more accurate min_vruntime accounting Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-24  9:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, efault, vatsa, Peter Zijlstra

[-- Attachment #1: sched-fbg-buglet.patch --]
[-- Type: text/plain, Size: 1208 bytes --]

In one of the group load balancer patches:

	commit 408ed066b11cf9ee4536573b4269ee3613bd735e
	Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
	Date:   Fri Jun 27 13:41:28 2008 +0200
	Subject: sched: hierarchical load vs find_busiest_group

The following change:

-               if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
+               if (max_load - this_load + 2*busiest_load_per_task >=
                                        busiest_load_per_task * imbn) {

made the condition always true, because imbn is [1,2].
Therefore, remove the 2*, and give the it a fair chance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3343,7 +3343,7 @@ small_imbalance:
 		} else
 			this_load_per_task = cpu_avg_load_per_task(this_cpu);
 
-		if (max_load - this_load + 2*busiest_load_per_task >=
+		if (max_load - this_load + busiest_load_per_task >=
 					busiest_load_per_task * imbn) {
 			*imbalance = busiest_load_per_task;
 			return busiest;

-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/8] sched: more accurate min_vruntime accounting
  2008-10-24  9:06 [PATCH 0/8] scheduler patches Peter Zijlstra
  2008-10-24  9:06 ` [PATCH 1/8] sched: fix a find_busiest_group buglet Peter Zijlstra
@ 2008-10-24  9:06 ` Peter Zijlstra
  2008-10-24  9:06 ` [PATCH 3/8] sched: weaken sync hint Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-24  9:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, efault, vatsa, Peter Zijlstra

[-- Attachment #1: sched-min_vruntime-fiddle.patch --]
[-- Type: text/plain, Size: 3983 bytes --]

Mike noticed the current min_vruntime tracking can go wrong and skip the
current task. If the only remaining task in the tree is a nice 19 task
with huge vruntime, new tasks will be inserted too far to the right too,
causing some interactibity issues.

min_vruntime can only change due to the leftmost entry disappearing
(dequeue_entity()), or by the leftmost entry being incremented past the
next entry, which elects a new leftmost (__update_curr())

Due to the current entry not being part of the actual tree, we have to
compare the leftmost tree entry with the current entry, and take the
leftmost of these two.

So create a update_min_vruntime() function that takes computes the
leftmost vruntime in the system (either tree of current) and increases
the cfs_rq->min_vruntime if the computed value is larger than the
previously found min_vruntime. And call this from the two sites we've
identified that can change min_vruntime.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   49 +++++++++++++++++++++++++------------------------
 1 file changed, 25 insertions(+), 24 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -271,6 +271,27 @@ static inline s64 entity_key(struct cfs_
 	return se->vruntime - cfs_rq->min_vruntime;
 }
 
+static void update_min_vruntime(struct cfs_rq *cfs_rq)
+{
+	u64 vruntime = cfs_rq->min_vruntime;
+
+	if (cfs_rq->curr)
+		vruntime = cfs_rq->curr->vruntime;
+
+	if (cfs_rq->rb_leftmost) {
+		struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
+						   struct sched_entity,
+						   run_node);
+
+		if (vruntime == cfs_rq->min_vruntime)
+			vruntime = se->vruntime;
+		else
+			vruntime = min_vruntime(vruntime, se->vruntime);
+	}
+
+	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+}
+
 /*
  * Enqueue an entity into the rb-tree:
  */
@@ -304,15 +325,8 @@ static void __enqueue_entity(struct cfs_
 	 * Maintain a cache of leftmost tree entries (it is frequently
 	 * used):
 	 */
-	if (leftmost) {
+	if (leftmost)
 		cfs_rq->rb_leftmost = &se->run_node;
-		/*
-		 * maintain cfs_rq->min_vruntime to be a monotonic increasing
-		 * value tracking the leftmost vruntime in the tree.
-		 */
-		cfs_rq->min_vruntime =
-			max_vruntime(cfs_rq->min_vruntime, se->vruntime);
-	}
 
 	rb_link_node(&se->run_node, parent, link);
 	rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
@@ -322,18 +336,9 @@ static void __dequeue_entity(struct cfs_
 {
 	if (cfs_rq->rb_leftmost == &se->run_node) {
 		struct rb_node *next_node;
-		struct sched_entity *next;
 
 		next_node = rb_next(&se->run_node);
 		cfs_rq->rb_leftmost = next_node;
-
-		if (next_node) {
-			next = rb_entry(next_node,
-					struct sched_entity, run_node);
-			cfs_rq->min_vruntime =
-				max_vruntime(cfs_rq->min_vruntime,
-					     next->vruntime);
-		}
 	}
 
 	if (cfs_rq->next == se)
@@ -472,6 +477,7 @@ __update_curr(struct cfs_rq *cfs_rq, str
 	schedstat_add(cfs_rq, exec_clock, delta_exec);
 	delta_exec_weighted = calc_delta_fair(delta_exec, curr);
 	curr->vruntime += delta_exec_weighted;
+	update_min_vruntime(cfs_rq);
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -661,13 +667,7 @@ static void check_spread(struct cfs_rq *
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
-	u64 vruntime;
-
-	if (first_fair(cfs_rq)) {
-		vruntime = min_vruntime(cfs_rq->min_vruntime,
-				__pick_next_entity(cfs_rq)->vruntime);
-	} else
-		vruntime = cfs_rq->min_vruntime;
+	u64 vruntime = cfs_rq->min_vruntime;
 
 	/*
 	 * The 'current' period is already promised to the current tasks,
@@ -744,6 +744,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	account_entity_dequeue(cfs_rq, se);
+	update_min_vruntime(cfs_rq);
 }
 
 /*

-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/8] sched: weaken sync hint
  2008-10-24  9:06 [PATCH 0/8] scheduler patches Peter Zijlstra
  2008-10-24  9:06 ` [PATCH 1/8] sched: fix a find_busiest_group buglet Peter Zijlstra
  2008-10-24  9:06 ` [PATCH 2/8] sched: more accurate min_vruntime accounting Peter Zijlstra
@ 2008-10-24  9:06 ` Peter Zijlstra
  2008-10-24  9:06 ` [PATCH 4/8] sched: re-instate vruntime based wakeup preemption Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-24  9:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, efault, vatsa, Peter Zijlstra

[-- Attachment #1: mike-weaken-sync.patch --]
[-- Type: text/plain, Size: 1116 bytes --]

From: Mike Galbraith <efault@gmx.de>

Mysql+oltp and pgsql+oltp peaks are still shifted right. The below puts
the peaks back to 1 client/server pair per core.

Use the avg_overlap information to weaken the sync hint.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1228,10 +1228,9 @@ wake_affine(struct sched_domain *this_sd
 	if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
 		return 0;
 
-	if (!sync && sched_feat(SYNC_WAKEUPS) &&
-	    curr->se.avg_overlap < sysctl_sched_migration_cost &&
-	    p->se.avg_overlap < sysctl_sched_migration_cost)
-		sync = 1;
+	if (sync && (curr->se.avg_overlap > sysctl_sched_migration_cost ||
+			p->se.avg_overlap > sysctl_sched_migration_cost))
+		sync = 0;
 
 	/*
 	 * If sync wakeup then subtract the (maximum possible)

-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 4/8] sched: re-instate vruntime based wakeup preemption
  2008-10-24  9:06 [PATCH 0/8] scheduler patches Peter Zijlstra
                   ` (2 preceding siblings ...)
  2008-10-24  9:06 ` [PATCH 3/8] sched: weaken sync hint Peter Zijlstra
@ 2008-10-24  9:06 ` Peter Zijlstra
  2008-10-24  9:06 ` [PATCH 5/8] sched: virtual time buddy preemption Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-24  9:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, efault, vatsa, Peter Zijlstra

[-- Attachment #1: sched-fix-wakeup-preempt.patch --]
[-- Type: text/plain, Size: 3737 bytes --]

The advantage is that vruntime based wakeup preemption has a better
conceptual model. Here wakeup_gran = 0 means: preempt when 'fair'.
Therefore wakeup_gran is the granularity of unfairness we allow in order
to make progress.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   98 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 92 insertions(+), 6 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -143,6 +143,49 @@ static inline struct sched_entity *paren
 	return se->parent;
 }
 
+/* return depth at which a sched entity is present in the hierarchy */
+static inline int depth_se(struct sched_entity *se)
+{
+	int depth = 0;
+
+	for_each_sched_entity(se)
+		depth++;
+
+	return depth;
+}
+
+static void
+find_matching_se(struct sched_entity **se, struct sched_entity **pse)
+{
+	int se_depth, pse_depth;
+
+	/*
+	 * preemption test can be made between sibling entities who are in the
+	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
+	 * both tasks until we find their ancestors who are siblings of common
+	 * parent.
+	 */
+
+	/* First walk up until both entities are at same depth */
+	se_depth = depth_se(*se);
+	pse_depth = depth_se(*pse);
+
+	while (se_depth > pse_depth) {
+		se_depth--;
+		*se = parent_entity(*se);
+	}
+
+	while (pse_depth > se_depth) {
+		pse_depth--;
+		*pse = parent_entity(*pse);
+	}
+
+	while (!is_same_group(*se, *pse)) {
+		*se = parent_entity(*se);
+		*pse = parent_entity(*pse);
+	}
+}
+
 #else	/* CONFIG_FAIR_GROUP_SCHED */
 
 static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
@@ -193,6 +236,11 @@ static inline struct sched_entity *paren
 	return NULL;
 }
 
+static inline void
+find_matching_se(struct sched_entity **se, struct sched_entity **pse)
+{
+}
+
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
 
@@ -1244,13 +1292,42 @@ static unsigned long wakeup_gran(struct 
 	 * More easily preempt - nice tasks, while not making it harder for
 	 * + nice tasks.
 	 */
-	if (sched_feat(ASYM_GRAN))
-		gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);
+	if (!sched_feat(ASYM_GRAN) || se->load.weight > NICE_0_LOAD)
+		gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se);
 
 	return gran;
 }
 
 /*
+ * Should 'se' preempt 'curr'.
+ *
+ *             |s1
+ *        |s2
+ *   |s3
+ *         g
+ *      |<--->|c
+ *
+ *  w(c, s1) = -1
+ *  w(c, s2) =  0
+ *  w(c, s3) =  1
+ *
+ */
+static int
+wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
+{
+	s64 gran, vdiff = curr->vruntime - se->vruntime;
+
+	if (vdiff <= 0)
+		return -1;
+
+	gran = wakeup_gran(curr);
+	if (vdiff > gran)
+		return 1;
+
+	return 0;
+}
+
+/*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
@@ -1258,7 +1335,6 @@ static void check_preempt_wakeup(struct 
 	struct task_struct *curr = rq->curr;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	struct sched_entity *se = &curr->se, *pse = &p->se;
-	s64 delta_exec;
 
 	if (unlikely(rt_prio(p->prio))) {
 		update_rq_clock(rq);
@@ -1296,9 +1372,19 @@ static void check_preempt_wakeup(struct 
 		return;
 	}
 
-	delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
-	if (delta_exec > wakeup_gran(pse))
-		resched_task(curr);
+	find_matching_se(&se, &pse);
+
+	while (se) {
+		BUG_ON(!pse);
+
+		if (wakeup_preempt_entity(se, pse) == 1) {
+			resched_task(curr);
+			break;
+		}
+
+		se = parent_entity(se);
+		pse = parent_entity(pse);
+	}
 }
 
 static struct task_struct *pick_next_task_fair(struct rq *rq)

-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 5/8] sched: virtual time buddy preemption
  2008-10-24  9:06 [PATCH 0/8] scheduler patches Peter Zijlstra
                   ` (3 preceding siblings ...)
  2008-10-24  9:06 ` [PATCH 4/8] sched: re-instate vruntime based wakeup preemption Peter Zijlstra
@ 2008-10-24  9:06 ` Peter Zijlstra
  2008-10-24  9:06 ` [PATCH 6/8] sched: avg_vruntime Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-24  9:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, efault, vatsa, Peter Zijlstra

[-- Attachment #1: sched-buddy-vruntime.patch --]
[-- Type: text/plain, Size: 1784 bytes --]

Since we moved wakeup preemption back to virtual time, it makes sense to move
the buddy stuff back as well. The purpose of the buddy scheduling is to allow
a quickly scheduling pair of tasks to run away from the group as far as a
regular busy task would be allowed under wakeup preemption.

This has the advantage that the pair can ping-pong for a while, enjoying
cache-hotness. Without buddy scheduling other tasks would interleave destroying
the cache.

Also, it saves a word in cfs_rq.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |    1 -
 kernel/sched_fair.c |   12 +++++-------
 2 files changed, 5 insertions(+), 8 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -385,7 +385,6 @@ struct cfs_rq {
 
 	u64 exec_clock;
 	u64 min_vruntime;
-	u64 pair_start;
 
 	struct rb_root tasks_timeline;
 	struct rb_node *rb_leftmost;
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -790,16 +790,14 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
+static int
+wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
+
 static struct sched_entity *
 pick_next(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	struct rq *rq = rq_of(cfs_rq);
-	u64 pair_slice = rq->clock - cfs_rq->pair_start;
-
-	if (!cfs_rq->next || pair_slice > sysctl_sched_min_granularity) {
-		cfs_rq->pair_start = rq->clock;
+	if (!cfs_rq->next || wakeup_preempt_entity(cfs_rq->next, se) == 1)
 		return se;
-	}
 
 	return cfs_rq->next;
 }

-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 6/8] sched: avg_vruntime
  2008-10-24  9:06 [PATCH 0/8] scheduler patches Peter Zijlstra
                   ` (4 preceding siblings ...)
  2008-10-24  9:06 ` [PATCH 5/8] sched: virtual time buddy preemption Peter Zijlstra
@ 2008-10-24  9:06 ` Peter Zijlstra
  2008-10-29 15:48   ` Peter Zijlstra
  2008-10-24  9:06 ` [PATCH 7/8] sched: non-zero lag renice Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-24  9:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, efault, vatsa, Peter Zijlstra

[-- Attachment #1: sched-avg-vruntime.patch --]
[-- Type: text/plain, Size: 4344 bytes --]

Renicing requires scaling the lag. Therefore we need a way to compute the it.
Lag is defined as the difference between the service time received from the
ideal model and the actual scheduler.

The defining property of a fair scheduler is that the sum of all lags is zero;
which can be seen is trivially true for the ideal case, as all lags are zero.

Therefore, the average of all virtual runtimes will be the point of zero lag.

We cannot prove fairness for CFS due to sleeper fairness (without it we can).
However since we can observe it does converge to fairness in stable operation,
we can say the zero lag point converges to the average.

We can't just take the average of vruntime - as it will use the full range
of its u64 and will wrap around. Instead we'll use the average of
(vruntime - min_vruntime)

\Sum_{i}^{n} 1/n (v_{i} - v) = 1/n (\Sum_{i}^{n} v_{i}) - vn

By factoring out the 1/n (never storing that) we avoid rounding, which
would bring an accumulating error.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c       |    3 ++
 kernel/sched_debug.c |    3 ++
 kernel/sched_fair.c  |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 61 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -383,6 +383,9 @@ struct cfs_rq {
 	struct load_weight load;
 	unsigned long nr_running;
 
+	long nr_queued;
+	s64 avg_vruntime;
+
 	u64 exec_clock;
 	u64 min_vruntime;
 
Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -161,6 +161,9 @@ void print_cfs_rq(struct seq_file *m, in
 			SPLIT_NS(spread0));
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
+			SPLIT_NS(avg_vruntime(cfs_rq)));
+
 #ifdef CONFIG_SCHEDSTATS
 #define P(n) SEQ_printf(m, "  .%-30s: %d\n", #n, rq->n);
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -271,6 +271,56 @@ static inline s64 entity_key(struct cfs_
 	return se->vruntime - cfs_rq->min_vruntime;
 }
 
+static void
+avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_vruntime += key;
+	cfs_rq->nr_queued++;
+}
+
+static void
+avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_vruntime -= key;
+	cfs_rq->nr_queued--;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+	cfs_rq->avg_vruntime -= cfs_rq->nr_queued * delta;
+}
+
+static u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+	s64 avg = cfs_rq->avg_vruntime;
+	long nr_queued = cfs_rq->nr_queued;
+
+	if (cfs_rq->curr) {
+		nr_queued++;
+		avg += entity_key(cfs_rq, cfs_rq->curr);
+	}
+
+	if (nr_queued)
+		avg = div_s64(avg, nr_queued);
+
+	return cfs_rq->min_vruntime + avg;
+}
+
+static void __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+{
+	/*
+	 * open coded max_vruntime() to allow updating avg_vruntime
+	 */
+	s64 delta = (s64)(vruntime - cfs_rq->min_vruntime);
+	if (delta > 0) {
+		avg_vruntime_update(cfs_rq, delta);
+		cfs_rq->min_vruntime = vruntime;
+	}
+}
+
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
 	u64 vruntime = cfs_rq->min_vruntime;
@@ -289,7 +339,7 @@ static void update_min_vruntime(struct c
 			vruntime = min_vruntime(vruntime, se->vruntime);
 	}
 
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+	__update_min_vruntime(cfs_rq, vruntime);
 }
 
 /*
@@ -303,6 +353,8 @@ static void __enqueue_entity(struct cfs_
 	s64 key = entity_key(cfs_rq, se);
 	int leftmost = 1;
 
+	avg_vruntime_add(cfs_rq, se);
+
 	/*
 	 * Find the right place in the rbtree:
 	 */
@@ -345,6 +397,8 @@ static void __dequeue_entity(struct cfs_
 		cfs_rq->next = NULL;
 
 	rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
+
+	avg_vruntime_sub(cfs_rq, se);
 }
 
 static inline struct rb_node *first_fair(struct cfs_rq *cfs_rq)

-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 7/8] sched: non-zero lag renice
  2008-10-24  9:06 [PATCH 0/8] scheduler patches Peter Zijlstra
                   ` (5 preceding siblings ...)
  2008-10-24  9:06 ` [PATCH 6/8] sched: avg_vruntime Peter Zijlstra
@ 2008-10-24  9:06 ` Peter Zijlstra
  2008-10-24 17:47   ` Chris Friesen
  2008-10-24  9:06 ` [PATCH 8/8] use avg_vruntime for task placement Peter Zijlstra
  2008-10-24 10:26 ` [PATCH 0/8] scheduler patches Ingo Molnar
  8 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-24  9:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, efault, vatsa, Peter Zijlstra

[-- Attachment #1: sched-non-zero-lag-renice.patch --]
[-- Type: text/plain, Size: 3681 bytes --]

Then renicing, esp when lowering nice value (getting heavier), its possible
to get into a starvation scenario. If you got too much runtime as a very
light task, you get shot way far too the right, which means you'll have to
wait for a long time in order to run again.

If during that wait you get reniced down, fairness would suggest you get run
earlier, because you deserve more time.

This can be solved by scaling the vruntime so that we keep the real-time
lag invariant.

So under transformation(w->w') keep dt == dt':

   dv = dt/w -> w dv = dt
   w dv = w' dv' -> dv' = w/w' dv

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |   14 +++++++-------
 kernel/sched_fair.c |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4984,12 +4984,9 @@ void set_user_nice(struct task_struct *p
 
 	if (on_rq) {
 		enqueue_task(rq, p, 0);
-		/*
-		 * If the task increased its priority or is running and
-		 * lowered its priority, then reschedule its CPU:
-		 */
-		if (delta < 0 || (delta > 0 && task_running(rq, p)))
-			resched_task(rq->curr);
+
+		check_class_changed(rq, p, p->sched_class, old_prio,
+				    task_running(rq, p));
 	}
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -8744,6 +8741,7 @@ void sched_move_task(struct task_struct 
 static void __set_se_shares(struct sched_entity *se, unsigned long shares)
 {
 	struct cfs_rq *cfs_rq = se->cfs_rq;
+	unsigned long old_weight = se->load.weight;
 	int on_rq;
 
 	on_rq = se->on_rq;
@@ -8753,8 +8751,10 @@ static void __set_se_shares(struct sched
 	se->load.weight = shares;
 	se->load.inv_weight = 0;
 
-	if (on_rq)
+	if (on_rq) {
 		enqueue_entity(cfs_rq, se, 0);
+		prio_changed_entity(cfs_rq, se, old_weight, shares);
+	}
 }
 
 static void set_se_shares(struct sched_entity *se, unsigned long shares)
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1671,6 +1671,41 @@ static void task_new_fair(struct rq *rq,
 	enqueue_task_fair(rq, p, 0);
 }
 
+static void prio_changed_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+		unsigned long old_weight, unsigned long new_weight)
+{
+	u64 avg;
+	s64 lag;
+
+	if (old_weight == new_weight)
+		return;
+
+	/*
+	 * XXX very likely we just did a dequeue/enqueue cycle already
+	 * optimize this...
+	 */
+	update_curr(cfs_rq);
+	avg = avg_vruntime(cfs_rq);
+	if (se != cfs_rq->curr)
+		__dequeue_entity(cfs_rq, se);
+
+	/*
+	 * When changing weight, keep the lag invariant under real time.
+	 * So under transformation(w->w') keep dt == dt':
+	 *
+	 *   dv = dt/w -> w dv = dt
+	 *   w dv = w' dv' -> dv' = w/w' dv
+	 */
+	lag = (s64)(se->vruntime - avg);
+	lag *= old_weight;
+	lag = div_s64(lag, new_weight);
+
+	se->vruntime = avg + lag;
+	if (se != cfs_rq->curr)
+		__enqueue_entity(cfs_rq, se);
+	update_min_vruntime(cfs_rq);
+}
+
 /*
  * Priority of the task has changed. Check to see if we preempt
  * the current task.
@@ -1678,6 +1713,11 @@ static void task_new_fair(struct rq *rq,
 static void prio_changed_fair(struct rq *rq, struct task_struct *p,
 			      int oldprio, int running)
 {
+	if (!rt_prio(oldprio))
+		prio_changed_entity(&rq->cfs, &p->se,
+				prio_to_weight[USER_PRIO(oldprio)],
+				prio_to_weight[USER_PRIO(p->prio)]);
+
 	/*
 	 * Reschedule if we are currently running on this runqueue and
 	 * our priority decreased, or if we are not currently running on

-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 8/8] use avg_vruntime for task placement
  2008-10-24  9:06 [PATCH 0/8] scheduler patches Peter Zijlstra
                   ` (6 preceding siblings ...)
  2008-10-24  9:06 ` [PATCH 7/8] sched: non-zero lag renice Peter Zijlstra
@ 2008-10-24  9:06 ` Peter Zijlstra
  2008-10-24 10:26 ` [PATCH 0/8] scheduler patches Ingo Molnar
  8 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-24  9:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, efault, vatsa, Peter Zijlstra

[-- Attachment #1: sched-placement.patch --]
[-- Type: text/plain, Size: 1919 bytes --]

Allow to use the avg_vruntime for task placement.

The pro: its the 'fair' place to insert tasks.
The con: its an extra division.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c          |    9 +++++++--
 kernel/sched_fair.c     |    7 ++++++-
 kernel/sched_features.h |    1 +
 3 files changed, 14 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1847,8 +1847,13 @@ void set_task_cpu(struct task_struct *p,
 			schedstat_inc(p, se.nr_forced2_migrations);
 	}
 #endif
-	p->se.vruntime -= old_cfsrq->min_vruntime -
-					 new_cfsrq->min_vruntime;
+	if (sched_feat(AVG_VRUNTIME)) {
+		p->se.vruntime -=
+			avg_vruntime(old_cfsrq) - avg_vruntime(new_cfsrq);
+	} else {
+		p->se.vruntime -=
+			old_cfsrq->min_vruntime - new_cfsrq->min_vruntime;
+	}
 
 	__set_task_cpu(p, new_cpu);
 }
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -719,7 +719,12 @@ static void check_spread(struct cfs_rq *
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
-	u64 vruntime = cfs_rq->min_vruntime;
+	u64 vruntime;
+
+	if (sched_feat(AVG_VRUNTIME))
+		vruntime = avg_vruntime(cfs_rq);
+	else
+		vruntime = cfs_rq->min_vruntime;
 
 	/*
 	 * The 'current' period is already promised to the current tasks,
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -11,4 +11,5 @@ SCHED_FEAT(ASYM_GRAN, 1)
 SCHED_FEAT(LB_BIAS, 1)
 SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
 SCHED_FEAT(ASYM_EFF_LOAD, 1)
+SCHED_FEAT(AVG_VRUNTIME, 0)
 SCHED_FEAT(WAKEUP_OVERLAP, 0)

-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/8] scheduler patches
  2008-10-24  9:06 [PATCH 0/8] scheduler patches Peter Zijlstra
                   ` (7 preceding siblings ...)
  2008-10-24  9:06 ` [PATCH 8/8] use avg_vruntime for task placement Peter Zijlstra
@ 2008-10-24 10:26 ` Ingo Molnar
  2008-10-24 10:29   ` Peter Zijlstra
  8 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2008-10-24 10:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, efault, vatsa


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> 1-3 definite 28 patches
> 4-5 wanna-be 28 patches

ok, applied them to tip/sched/urgent, thanks Peter!

> 6-8 29 work

Could we have a #9 that uses a no-fastpath-overhead min() method instead 
of avg_vruntime? Maybe we can get away without that overhead?

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/8] scheduler patches
  2008-10-24 10:26 ` [PATCH 0/8] scheduler patches Ingo Molnar
@ 2008-10-24 10:29   ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-24 10:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, efault, vatsa

On Fri, 2008-10-24 at 12:26 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > 6-8 29 work
> 
> Could we have a #9 that uses a no-fastpath-overhead min() method instead 
> of avg_vruntime? Maybe we can get away without that overhead?

Looking at that suggestion, maybe I'll redo 7/8 so at to not depend on
6/8 using min from the get-go and fixing that XXX as well.

Will have to poke at it a little.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 7/8] sched: non-zero lag renice
  2008-10-24  9:06 ` [PATCH 7/8] sched: non-zero lag renice Peter Zijlstra
@ 2008-10-24 17:47   ` Chris Friesen
  2008-10-24 20:28     ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Friesen @ 2008-10-24 17:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, efault, vatsa

Peter Zijlstra wrote:

 > Then renicing, esp when lowering nice value (getting heavier), its possible
 > to get into a starvation scenario. If you got too much runtime as a very
 > light task, you get shot way far too the right, which means you'll have to
 > wait for a long time in order to run again.
 >
 > If during that wait you get reniced down, fairness would suggest you get run
 > earlier, because you deserve more time.
 >
 > This can be solved by scaling the vruntime so that we keep the real-time
 > lag invariant.

If we've already been shot way out to the right, presumably that would give us 
a large real-time lag.  If we renice to a lower nice level, wouldn't we want 
to reduce the real-time lag rather than make it constant?

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 7/8] sched: non-zero lag renice
  2008-10-24 17:47   ` Chris Friesen
@ 2008-10-24 20:28     ` Peter Zijlstra
  2008-10-24 21:13       ` Chris Friesen
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-24 20:28 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, mingo, efault, vatsa

On Fri, 2008-10-24 at 11:47 -0600, Chris Friesen wrote:
> Peter Zijlstra wrote:
> 
>  > Then renicing, esp when lowering nice value (getting heavier), its possible
>  > to get into a starvation scenario. If you got too much runtime as a very
>  > light task, you get shot way far too the right, which means you'll have to
>  > wait for a long time in order to run again.
>  >
>  > If during that wait you get reniced down, fairness would suggest you get run
>  > earlier, because you deserve more time.
>  >
>  > This can be solved by scaling the vruntime so that we keep the real-time
>  > lag invariant.
> 
> If we've already been shot way out to the right, presumably that would give us 
> a large real-time lag.  If we renice to a lower nice level, wouldn't we want 
> to reduce the real-time lag rather than make it constant?

Ah, see but a 1ms real-time lag might be gigantic on weight=15, but
nearly nothing on weight=88761.

1e6 * 1024/15 is massively larger than 1e6 * 1024/88761.

1000000*1024/15 = 68266666
1000000*1024/88761 = 11536





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 7/8] sched: non-zero lag renice
  2008-10-24 20:28     ` Peter Zijlstra
@ 2008-10-24 21:13       ` Chris Friesen
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Friesen @ 2008-10-24 21:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, efault, vatsa

Peter Zijlstra wrote:
> On Fri, 2008-10-24 at 11:47 -0600, Chris Friesen wrote:
>> Peter Zijlstra wrote:
>>
>>  > Then renicing, esp when lowering nice value (getting heavier), its possible
>>  > to get into a starvation scenario. If you got too much runtime as a very
>>  > light task, you get shot way far too the right, which means you'll have to
>>  > wait for a long time in order to run again.
>>  >
>>  > If during that wait you get reniced down, fairness would suggest you get run
>>  > earlier, because you deserve more time.
>>  >
>>  > This can be solved by scaling the vruntime so that we keep the real-time
>>  > lag invariant.
>>
>> If we've already been shot way out to the right, presumably that would give us 
>> a large real-time lag.  If we renice to a lower nice level, wouldn't we want 
>> to reduce the real-time lag rather than make it constant?

Sorry, the patches arrived out of order and my comments were sent out before 
reading your definition of "lag" as "the difference between the service time 
received from the ideal model and the actual scheduler".

I was using a definition of lag as "amount of time before the process gets 
scheduled again".

Given your definition, I think the patch makes sense.

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/8] sched: avg_vruntime
  2008-10-24  9:06 ` [PATCH 6/8] sched: avg_vruntime Peter Zijlstra
@ 2008-10-29 15:48   ` Peter Zijlstra
  2008-11-01 18:13     ` Fabio Checconi
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2008-10-29 15:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, efault, vatsa, fabio

On Fri, 2008-10-24 at 11:06 +0200, Peter Zijlstra wrote:
> plain text document attachment (sched-avg-vruntime.patch)
> Renicing requires scaling the lag. Therefore we need a way to compute the it.
> Lag is defined as the difference between the service time received from the
> ideal model and the actual scheduler.
> 
> The defining property of a fair scheduler is that the sum of all lags is zero;
> which can be seen is trivially true for the ideal case, as all lags are zero.
> 
> Therefore, the average of all virtual runtimes will be the point of zero lag.
> 
> We cannot prove fairness for CFS due to sleeper fairness (without it we can).
> However since we can observe it does converge to fairness in stable operation,
> we can say the zero lag point converges to the average.
> 
> We can't just take the average of vruntime - as it will use the full range
> of its u64 and will wrap around. Instead we'll use the average of
> (vruntime - min_vruntime)
> 
> \Sum_{i}^{n} 1/n (v_{i} - v) = 1/n (\Sum_{i}^{n} v_{i}) - vn
> 
> By factoring out the 1/n (never storing that) we avoid rounding, which
> would bring an accumulating error.

Hi Fabio,

you were right, this is wrong.

How about this..

The fluid model, would for each task t_i, generate an execution time e_i

  de_i = w_i / w_sum * dt

However, any real scheduler will be imperfect and have an error eps_i

  dE_i = de_i + eps_i,

But due to only dt actual time having past we can state that

  \Sum_i dE_i = dt, therefore  \Sum_i eps_i = 0.

This will be reflected in a virtual runtime skew of

  dv_i = eps_i / w_i

If we now wish to obtain the zero lag point, there were all tasks would
be in the fluid model, we get

  eps_i = dv_i * w_i, which yields: \Sum dv_i * w_i = 0

IOW avg(v_i*w_i) = v_fluid

1/n \Sum_i v_i*w_i, [v_i -> v_i-x] ->
1/n \sum_i (v_i-x)*w_i =
1/n \Sum v_i*w_i - \Sum x*w_i =
1/n \Sum v_i*w_i - x \Sum w_i

which in turn would yield a patch like below..

I'll also try and quantify the error and effect of using min_vruntime as
zero lag point as Ingo suggested.

---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2008-10-29 16:43:16.000000000 +0100
+++ linux-2.6/kernel/sched.c	2008-10-29 16:43:27.000000000 +0100
@@ -384,6 +384,10 @@ struct cfs_rq {
 	struct load_weight load;
 	unsigned long nr_running;
 
+	long nr_queued;
+	long avg_load;
+	s64 avg_vruntime;
+
 	u64 exec_clock;
 	u64 min_vruntime;
 
Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c	2008-10-29 16:43:04.000000000 +0100
+++ linux-2.6/kernel/sched_debug.c	2008-10-29 16:43:37.000000000 +0100
@@ -161,6 +161,9 @@ void print_cfs_rq(struct seq_file *m, in
 			SPLIT_NS(spread0));
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
+			SPLIT_NS(avg_vruntime(cfs_rq)));
+
 #ifdef CONFIG_SCHEDSTATS
 #define P(n) SEQ_printf(m, "  .%-30s: %d\n", #n, rq->n);
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c	2008-10-29 16:43:17.000000000 +0100
+++ linux-2.6/kernel/sched_fair.c	2008-10-29 16:46:41.000000000 +0100
@@ -271,6 +271,60 @@ static inline s64 entity_key(struct cfs_
 	return se->vruntime - cfs_rq->min_vruntime;
 }
 
+static void
+avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_load += se->load.weight;
+	cfs_rq->avg_vruntime += key * se->load.weight;
+	cfs_rq->nr_queued++;
+}
+
+static void
+avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_load -= se->load.weight;
+	cfs_rq->avg_vruntime -= key * se->load.weight;
+	cfs_rq->nr_queued--;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+	cfs_rq->avg_vruntime -= cfs_rq->nr_queued * cfs_rq->avg_load * delta;
+}
+
+static u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+	s64 avg = cfs_rq->avg_vruntime;
+	long nr_queued = cfs_rq->nr_queued;
+
+	if (cfs_rq->curr) {
+		nr_queued++;
+		avg += entity_key(cfs_rq, cfs_rq->curr) * cfs_rq->curr->load.weight;
+	}
+
+	avg >>= NICE_0_SHIFT;
+
+	if (nr_queued)
+		avg = div_s64(avg, nr_queued);
+
+	return cfs_rq->min_vruntime + avg;
+}
+
+static void __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+{
+	/*
+	 * open coded max_vruntime() to allow updating avg_vruntime
+	 */
+	s64 delta = (s64)(vruntime - cfs_rq->min_vruntime);
+	if (delta > 0) {
+		avg_vruntime_update(cfs_rq, delta);
+		cfs_rq->min_vruntime = vruntime;
+	}
+}
+
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
 	u64 vruntime = cfs_rq->min_vruntime;
@@ -289,7 +343,7 @@ static void update_min_vruntime(struct c
 			vruntime = min_vruntime(vruntime, se->vruntime);
 	}
 
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+	__update_min_vruntime(cfs_rq, vruntime);
 }
 
 /*
@@ -303,6 +357,8 @@ static void __enqueue_entity(struct cfs_
 	s64 key = entity_key(cfs_rq, se);
 	int leftmost = 1;
 
+	avg_vruntime_add(cfs_rq, se);
+
 	/*
 	 * Find the right place in the rbtree:
 	 */
@@ -345,6 +401,7 @@ static void __dequeue_entity(struct cfs_
 		cfs_rq->next = NULL;
 
 	rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
+	avg_vruntime_sub(cfs_rq, se);
 }
 
 static inline struct rb_node *first_fair(struct cfs_rq *cfs_rq)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/8] sched: avg_vruntime
  2008-10-29 15:48   ` Peter Zijlstra
@ 2008-11-01 18:13     ` Fabio Checconi
  0 siblings, 0 replies; 16+ messages in thread
From: Fabio Checconi @ 2008-11-01 18:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, efault, vatsa

Hi,

> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Wed, Oct 29, 2008 04:48:34PM +0100
>
> On Fri, 2008-10-24 at 11:06 +0200, Peter Zijlstra wrote:
...
> How about this..
> 
> The fluid model, would for each task t_i, generate an execution time e_i
> 
>   de_i = w_i / w_sum * dt
> 
> However, any real scheduler will be imperfect and have an error eps_i
> 
>   dE_i = de_i + eps_i,
> 

  ...according to this equation...


> But due to only dt actual time having past we can state that
> 
>   \Sum_i dE_i = dt, therefore  \Sum_i eps_i = 0.
> 
> This will be reflected in a virtual runtime skew of
> 
>   dv_i = eps_i / w_i
> 

...and to this one, what you call ``virtual runtime skew'' is:

  dv_i = (dE_i - de_i) / w_i.


> If we now wish to obtain the zero lag point, there were all tasks would
> be in the fluid model, we get
> 
>   eps_i = dv_i * w_i, which yields: \Sum dv_i * w_i = 0
> 
> IOW avg(v_i*w_i) = v_fluid
> 

Looking at the code, it seems that you use the vruntime values of the
entities when you do the average, which are different from what you
previously called ``virtual runtime skew.''  I don't understand the
connection between the previous dv_i and the v_i there.  Calling dVR_i
the vruntime increment for the i-th entity, dVR_i = dE_i / w_i, which
clearly differs from dv_i.

Moreover, v_fluid (considering all the flows backlogged from the beginning
and the set of active flows constant) is defined as:

  v_fluid = 1 / w_sum * \sum w_i * VR_i,

so, unless w_sum == N, this differs from your expression for v_fluid.
Am I missing something there?


> 1/n \Sum_i v_i*w_i, [v_i -> v_i-x] ->
> 1/n \sum_i (v_i-x)*w_i =
> 1/n \Sum v_i*w_i - \Sum x*w_i =
> 1/n \Sum v_i*w_i - x \Sum w_i
> 
> which in turn would yield a patch like below..
> 
> I'll also try and quantify the error and effect of using min_vruntime as
> zero lag point as Ingo suggested.
> 

min_vruntime, given its definition, is very likely to be near to the
maximum lag point...


> ---
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c	2008-10-29 16:43:16.000000000 +0100
> +++ linux-2.6/kernel/sched.c	2008-10-29 16:43:27.000000000 +0100
> @@ -384,6 +384,10 @@ struct cfs_rq {
>  	struct load_weight load;
>  	unsigned long nr_running;
>  
> +	long nr_queued;
> +	long avg_load;
> +	s64 avg_vruntime;
> +
>  	u64 exec_clock;
>  	u64 min_vruntime;
>  
> Index: linux-2.6/kernel/sched_debug.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_debug.c	2008-10-29 16:43:04.000000000 +0100
> +++ linux-2.6/kernel/sched_debug.c	2008-10-29 16:43:37.000000000 +0100
> @@ -161,6 +161,9 @@ void print_cfs_rq(struct seq_file *m, in
>  			SPLIT_NS(spread0));
>  	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
>  	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
> +	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
> +			SPLIT_NS(avg_vruntime(cfs_rq)));
> +
>  #ifdef CONFIG_SCHEDSTATS
>  #define P(n) SEQ_printf(m, "  .%-30s: %d\n", #n, rq->n);
>  
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c	2008-10-29 16:43:17.000000000 +0100
> +++ linux-2.6/kernel/sched_fair.c	2008-10-29 16:46:41.000000000 +0100
> @@ -271,6 +271,60 @@ static inline s64 entity_key(struct cfs_
>  	return se->vruntime - cfs_rq->min_vruntime;
>  }
>  
> +static void
> +avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +	s64 key = entity_key(cfs_rq, se);
> +	cfs_rq->avg_load += se->load.weight;
> +	cfs_rq->avg_vruntime += key * se->load.weight;
> +	cfs_rq->nr_queued++;
> +}
> +
> +static void
> +avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +	s64 key = entity_key(cfs_rq, se);
> +	cfs_rq->avg_load -= se->load.weight;
> +	cfs_rq->avg_vruntime -= key * se->load.weight;
> +	cfs_rq->nr_queued--;
> +}
> +
> +static inline
> +void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
> +{
> +	cfs_rq->avg_vruntime -= cfs_rq->nr_queued * cfs_rq->avg_load * delta;
> +}
> +
> +static u64 avg_vruntime(struct cfs_rq *cfs_rq)
> +{
> +	s64 avg = cfs_rq->avg_vruntime;
> +	long nr_queued = cfs_rq->nr_queued;
> +
> +	if (cfs_rq->curr) {
> +		nr_queued++;
> +		avg += entity_key(cfs_rq, cfs_rq->curr) * cfs_rq->curr->load.weight;
> +	}
> +
> +	avg >>= NICE_0_SHIFT;
> +
> +	if (nr_queued)
> +		avg = div_s64(avg, nr_queued);
> +
> +	return cfs_rq->min_vruntime + avg;
> +}
> +
> +static void __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
> +{
> +	/*
> +	 * open coded max_vruntime() to allow updating avg_vruntime
> +	 */
> +	s64 delta = (s64)(vruntime - cfs_rq->min_vruntime);
> +	if (delta > 0) {
> +		avg_vruntime_update(cfs_rq, delta);
> +		cfs_rq->min_vruntime = vruntime;
> +	}
> +}
> +
>  static void update_min_vruntime(struct cfs_rq *cfs_rq)
>  {
>  	u64 vruntime = cfs_rq->min_vruntime;
> @@ -289,7 +343,7 @@ static void update_min_vruntime(struct c
>  			vruntime = min_vruntime(vruntime, se->vruntime);
>  	}
>  
> -	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
> +	__update_min_vruntime(cfs_rq, vruntime);
>  }
>  
>  /*
> @@ -303,6 +357,8 @@ static void __enqueue_entity(struct cfs_
>  	s64 key = entity_key(cfs_rq, se);
>  	int leftmost = 1;
>  
> +	avg_vruntime_add(cfs_rq, se);
> +
>  	/*
>  	 * Find the right place in the rbtree:
>  	 */
> @@ -345,6 +401,7 @@ static void __dequeue_entity(struct cfs_
>  		cfs_rq->next = NULL;
>  
>  	rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
> +	avg_vruntime_sub(cfs_rq, se);
>  }
>  
>  static inline struct rb_node *first_fair(struct cfs_rq *cfs_rq)
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2008-11-01 19:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-24  9:06 [PATCH 0/8] scheduler patches Peter Zijlstra
2008-10-24  9:06 ` [PATCH 1/8] sched: fix a find_busiest_group buglet Peter Zijlstra
2008-10-24  9:06 ` [PATCH 2/8] sched: more accurate min_vruntime accounting Peter Zijlstra
2008-10-24  9:06 ` [PATCH 3/8] sched: weaken sync hint Peter Zijlstra
2008-10-24  9:06 ` [PATCH 4/8] sched: re-instate vruntime based wakeup preemption Peter Zijlstra
2008-10-24  9:06 ` [PATCH 5/8] sched: virtual time buddy preemption Peter Zijlstra
2008-10-24  9:06 ` [PATCH 6/8] sched: avg_vruntime Peter Zijlstra
2008-10-29 15:48   ` Peter Zijlstra
2008-11-01 18:13     ` Fabio Checconi
2008-10-24  9:06 ` [PATCH 7/8] sched: non-zero lag renice Peter Zijlstra
2008-10-24 17:47   ` Chris Friesen
2008-10-24 20:28     ` Peter Zijlstra
2008-10-24 21:13       ` Chris Friesen
2008-10-24  9:06 ` [PATCH 8/8] use avg_vruntime for task placement Peter Zijlstra
2008-10-24 10:26 ` [PATCH 0/8] scheduler patches Ingo Molnar
2008-10-24 10:29   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).