linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] Various sched patches
@ 2014-01-21 11:17 Peter Zijlstra
  2014-01-21 11:17 ` [PATCH 1/9] sched: Remove cpu parameter for idle_balance() Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 11:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, daniel.lezcano, pjt, bsegall, Peter Zijlstra

Hi,

A set of recent patches by Daniel reminded me of a series I did a while back.
So here's a combined set.

Paul, Ben, could you guys have a careful look at the one optimizing
pick_next_task_fair() (patch 8)? That patch is somewhat big and scary,
although it does appear to do what it says.

I've not yet tested the bandwidth aspect at all.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/9] sched: Remove cpu parameter for idle_balance()
  2014-01-21 11:17 [PATCH 0/9] Various sched patches Peter Zijlstra
@ 2014-01-21 11:17 ` Peter Zijlstra
  2014-01-21 11:17 ` [PATCH 2/9] sched: Fix race in idle_balance() Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 11:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, daniel.lezcano, pjt, bsegall, Peter Zijlstra

[-- Attachment #1: daniel_lezcano-1_sched-remove__cpu__parameter_for_idle_balance.patch --]
[-- Type: text/plain, Size: 1832 bytes --]

From: Daniel Lezcano <daniel.lezcano@linaro.org>

The cpu parameter passed to idle_balance is not needed as it could
be retrieved from the struct rq.

Cc: alex.shi@linaro.org
Cc: peterz@infradead.org
Cc: mingo@kernel.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389949444-14821-1-git-send-email-daniel.lezcano@linaro.org
---
 kernel/sched/core.c  |    2 +-
 kernel/sched/fair.c  |    3 ++-
 kernel/sched/sched.h |    2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2680,7 +2680,7 @@ static void __sched __schedule(void)
 	pre_schedule(rq, prev);
 
 	if (unlikely(!rq->nr_running))
-		idle_balance(cpu, rq);
+		idle_balance(rq);
 
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6359,12 +6359,13 @@ static int load_balance(int this_cpu, st
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-void idle_balance(int this_cpu, struct rq *this_rq)
+void idle_balance(struct rq *this_rq)
 {
 	struct sched_domain *sd;
 	int pulled_task = 0;
 	unsigned long next_balance = jiffies + HZ;
 	u64 curr_cost = 0;
+	int this_cpu = this_rq->cpu;
 
 	this_rq->idle_stamp = rq_clock(this_rq);
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1176,7 +1176,7 @@ extern const struct sched_class idle_sch
 extern void update_group_power(struct sched_domain *sd, int cpu);
 
 extern void trigger_load_balance(struct rq *rq);
-extern void idle_balance(int this_cpu, struct rq *this_rq);
+extern void idle_balance(struct rq *this_rq);
 
 extern void idle_enter_fair(struct rq *this_rq);
 extern void idle_exit_fair(struct rq *this_rq);



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/9] sched: Fix race in idle_balance()
  2014-01-21 11:17 [PATCH 0/9] Various sched patches Peter Zijlstra
  2014-01-21 11:17 ` [PATCH 1/9] sched: Remove cpu parameter for idle_balance() Peter Zijlstra
@ 2014-01-21 11:17 ` Peter Zijlstra
  2014-01-21 11:17 ` [PATCH 3/9] sched: Move idle_stamp up to the core Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 11:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, daniel.lezcano, pjt, bsegall, Peter Zijlstra

[-- Attachment #1: daniel_lezcano-2_sched-fix_race_in_idle_balance.patch --]
[-- Type: text/plain, Size: 1538 bytes --]

From: Daniel Lezcano <daniel.lezcano@linaro.org>

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But the idle_balance() releases the rq->lock in order to lookup in the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Cc: alex.shi@linaro.org
Cc: peterz@infradead.org
Cc: mingo@kernel.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389949444-14821-2-git-send-email-daniel.lezcano@linaro.org
---
 kernel/sched/fair.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6417,6 +6417,13 @@ void idle_balance(struct rq *this_rq)
 
 	raw_spin_lock(&this_rq->lock);
 
+	/*
+	 * While browsing the domains, we released the rq lock.
+	 * A task could have be enqueued in the meantime
+	 */
+	if (this_rq->nr_running && !pulled_task)
+		return;
+
 	if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
 		/*
 		 * We are going idle. next_balance may be set based on



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 3/9] sched: Move idle_stamp up to the core
  2014-01-21 11:17 [PATCH 0/9] Various sched patches Peter Zijlstra
  2014-01-21 11:17 ` [PATCH 1/9] sched: Remove cpu parameter for idle_balance() Peter Zijlstra
  2014-01-21 11:17 ` [PATCH 2/9] sched: Fix race in idle_balance() Peter Zijlstra
@ 2014-01-21 11:17 ` Peter Zijlstra
  2014-01-23 12:58   ` Peter Zijlstra
  2014-01-21 11:17 ` [PATCH 4/9] sched: Clean up idle task SMP logic Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 11:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, daniel.lezcano, pjt, bsegall, Peter Zijlstra

[-- Attachment #1: daniel_lezcano-3_sched-move_idle_stamp_up_to_the_core.patch --]
[-- Type: text/plain, Size: 3122 bytes --]

From: Daniel Lezcano <daniel.lezcano@linaro.org>

The idle_balance modifies the idle_stamp field of the rq, making this
information to be shared across core.c and fair.c. As we can know if the
cpu is going to idle or not with the previous patch, let's encapsulate the
idle_stamp information in core.c by moving it up to the caller. The
idle_balance function returns true in case a balancing occured and the cpu
won't be idle, false if no balance happened and the cpu is going idle.

Cc: alex.shi@linaro.org
Cc: peterz@infradead.org
Cc: mingo@kernel.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389949444-14821-3-git-send-email-daniel.lezcano@linaro.org
---
 kernel/sched/core.c  |    2 +-
 kernel/sched/fair.c  |   14 ++++++--------
 kernel/sched/sched.h |    2 +-
 3 files changed, 8 insertions(+), 10 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2680,7 +2680,7 @@ static void __sched __schedule(void)
 	pre_schedule(rq, prev);
 
 	if (unlikely(!rq->nr_running))
-		idle_balance(rq);
+		rq->idle_stamp = idle_balance(rq) ? 0 : rq_clock(rq);
 
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6359,7 +6359,7 @@ static int load_balance(int this_cpu, st
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-void idle_balance(struct rq *this_rq)
+int idle_balance(struct rq *this_rq)
 {
 	struct sched_domain *sd;
 	int pulled_task = 0;
@@ -6367,10 +6367,8 @@ void idle_balance(struct rq *this_rq)
 	u64 curr_cost = 0;
 	int this_cpu = this_rq->cpu;
 
-	this_rq->idle_stamp = rq_clock(this_rq);
-
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
-		return;
+		return 0;
 
 	/*
 	 * Drop the rq->lock, but keep IRQ/preempt disabled.
@@ -6408,10 +6406,8 @@ void idle_balance(struct rq *this_rq)
 		interval = msecs_to_jiffies(sd->balance_interval);
 		if (time_after(next_balance, sd->last_balance + interval))
 			next_balance = sd->last_balance + interval;
-		if (pulled_task) {
-			this_rq->idle_stamp = 0;
+		if (pulled_task)
 			break;
-		}
 	}
 	rcu_read_unlock();
 
@@ -6422,7 +6418,7 @@ void idle_balance(struct rq *this_rq)
 	 * A task could have be enqueued in the meantime
 	 */
 	if (this_rq->nr_running && !pulled_task)
-		return;
+		return 1;
 
 	if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
 		/*
@@ -6434,6 +6430,8 @@ void idle_balance(struct rq *this_rq)
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
+
+	return pulled_task;
 }
 
 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1176,7 +1176,7 @@ extern const struct sched_class idle_sch
 extern void update_group_power(struct sched_domain *sd, int cpu);
 
 extern void trigger_load_balance(struct rq *rq);
-extern void idle_balance(struct rq *this_rq);
+extern int idle_balance(struct rq *this_rq);
 
 extern void idle_enter_fair(struct rq *this_rq);
 extern void idle_exit_fair(struct rq *this_rq);



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 4/9] sched: Clean up idle task SMP logic
  2014-01-21 11:17 [PATCH 0/9] Various sched patches Peter Zijlstra
                   ` (2 preceding siblings ...)
  2014-01-21 11:17 ` [PATCH 3/9] sched: Move idle_stamp up to the core Peter Zijlstra
@ 2014-01-21 11:17 ` Peter Zijlstra
  2014-01-21 17:27   ` Vincent Guittot
  2014-01-21 11:17 ` [PATCH 5/9] sched/fair: Track cgroup depth Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 11:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, daniel.lezcano, pjt, bsegall, Vincent Guittot, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-sched-clean_up_idle_task_smp_logic.patch --]
[-- Type: text/plain, Size: 2469 bytes --]

The idle post_schedule hook is just a vile waste of time, fix it proper.

Cc: alex.shi@linaro.org
Cc: mingo@kernel.org
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140117140939.GH11314@laptop.programming.kicks-ass.net
---
 kernel/sched/fair.c      |    5 +++--
 kernel/sched/idle_task.c |   21 ++++++---------------
 2 files changed, 9 insertions(+), 17 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2416,7 +2416,8 @@ void idle_exit_fair(struct rq *this_rq)
 	update_rq_runnable_avg(this_rq, 0);
 }
 
-#else
+#else /* CONFIG_SMP */
+
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq) {}
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
@@ -2428,7 +2429,7 @@ static inline void dequeue_entity_load_a
 					   int sleep) {}
 static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
 					      int force_update) {}
-#endif
+#endif /* CONFIG_SMP */
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -13,18 +13,8 @@ select_task_rq_idle(struct task_struct *
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
-
-static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
-{
-	idle_exit_fair(rq);
-	rq_last_tick_reset(rq);
-}
-
-static void post_schedule_idle(struct rq *rq)
-{
-	idle_enter_fair(rq);
-}
 #endif /* CONFIG_SMP */
+
 /*
  * Idle tasks are unconditionally rescheduled:
  */
@@ -37,8 +27,7 @@ static struct task_struct *pick_next_tas
 {
 	schedstat_inc(rq, sched_goidle);
 #ifdef CONFIG_SMP
-	/* Trigger the post schedule to do an idle_enter for CFS */
-	rq->post_schedule = 1;
+	idle_enter_fair(rq);
 #endif
 	return rq->idle;
 }
@@ -58,6 +47,10 @@ dequeue_task_idle(struct rq *rq, struct
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
+#ifdef CONFIG_SMP
+	idle_exit_fair(rq);
+	rq_last_tick_reset(rq);
+#endif
 }
 
 static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
@@ -101,8 +94,6 @@ const struct sched_class idle_sched_clas
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
-	.pre_schedule		= pre_schedule_idle,
-	.post_schedule		= post_schedule_idle,
 #endif
 
 	.set_curr_task          = set_curr_task_idle,



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 5/9] sched/fair: Track cgroup depth
  2014-01-21 11:17 [PATCH 0/9] Various sched patches Peter Zijlstra
                   ` (3 preceding siblings ...)
  2014-01-21 11:17 ` [PATCH 4/9] sched: Clean up idle task SMP logic Peter Zijlstra
@ 2014-01-21 11:17 ` Peter Zijlstra
  2014-01-21 11:18 ` [PATCH 6/9] sched: Push put_prev_task() into pick_next_task() Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 11:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, daniel.lezcano, pjt, bsegall, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_1.patch --]
[-- Type: text/plain, Size: 4300 bytes --]

Track depth in cgroup tree, this is useful for things like
find_matching_se() where you need to get to a common parent of two
sched entities.

Keeping the depth avoids having to calculate it on the spot, which
saves a number of possible cache-misses.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop
---
 include/linux/sched.h |    1 +
 kernel/sched/fair.c   |   47 +++++++++++++++++++++--------------------------
 2 files changed, 22 insertions(+), 26 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1064,6 +1064,7 @@ struct sched_entity {
 #endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	int			depth;
 	struct sched_entity	*parent;
 	/* rq on which this entity is (to be) queued: */
 	struct cfs_rq		*cfs_rq;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -322,13 +322,13 @@ static inline void list_del_leaf_cfs_rq(
 	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
-static inline int
+static inline struct cfs_rq *
 is_same_group(struct sched_entity *se, struct sched_entity *pse)
 {
 	if (se->cfs_rq == pse->cfs_rq)
-		return 1;
+		return se->cfs_rq;
 
-	return 0;
+	return NULL;
 }
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
@@ -336,17 +336,6 @@ static inline struct sched_entity *paren
 	return se->parent;
 }
 
-/* return depth at which a sched entity is present in the hierarchy */
-static inline int depth_se(struct sched_entity *se)
-{
-	int depth = 0;
-
-	for_each_sched_entity(se)
-		depth++;
-
-	return depth;
-}
-
 static void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -360,8 +349,8 @@ find_matching_se(struct sched_entity **s
 	 */
 
 	/* First walk up until both entities are at same depth */
-	se_depth = depth_se(*se);
-	pse_depth = depth_se(*pse);
+	se_depth = (*se)->depth;
+	pse_depth = (*pse)->depth;
 
 	while (se_depth > pse_depth) {
 		se_depth--;
@@ -426,10 +415,10 @@ static inline void list_del_leaf_cfs_rq(
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
-static inline int
+static inline struct cfs_rq *
 is_same_group(struct sched_entity *se, struct sched_entity *pse)
 {
-	return 1;
+	return cfs_rq_of(se); /* always the same rq */
 }
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
@@ -7091,7 +7080,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void task_move_group_fair(struct task_struct *p, int on_rq)
 {
+	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq;
+
 	/*
 	 * If the task was not on the rq at the time of this cgroup movement
 	 * it must have been asleep, sleeping tasks keep their ->vruntime
@@ -7117,23 +7108,24 @@ static void task_move_group_fair(struct
 	 * To prevent boost or penalty in the new cfs_rq caused by delta
 	 * min_vruntime between the two cfs_rqs, we skip vruntime adjustment.
 	 */
-	if (!on_rq && (!p->se.sum_exec_runtime || p->state == TASK_WAKING))
+	if (!on_rq && (!se->sum_exec_runtime || p->state == TASK_WAKING))
 		on_rq = 1;
 
 	if (!on_rq)
-		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
+		se->vruntime -= cfs_rq_of(se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));
+	se->depth = se->parent ? se->parent->depth + 1 : 0;
 	if (!on_rq) {
-		cfs_rq = cfs_rq_of(&p->se);
-		p->se.vruntime += cfs_rq->min_vruntime;
+		cfs_rq = cfs_rq_of(se);
+		se->vruntime += cfs_rq->min_vruntime;
 #ifdef CONFIG_SMP
 		/*
 		 * migrate_task_rq_fair() will have removed our previous
 		 * contribution, but we must synchronize for ongoing future
 		 * decay.
 		 */
-		p->se.avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
-		cfs_rq->blocked_load_avg += p->se.avg.load_avg_contrib;
+		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 #endif
 	}
 }
@@ -7229,10 +7221,13 @@ void init_tg_cfs_entry(struct task_group
 	if (!se)
 		return;
 
-	if (!parent)
+	if (!parent) {
 		se->cfs_rq = &rq->cfs;
-	else
+		se->depth = 0;
+	} else {
 		se->cfs_rq = parent->my_q;
+		se->depth = parent->depth + 1;
+	}
 
 	se->my_q = cfs_rq;
 	/* guarantee group entities always have weight */



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 6/9] sched: Push put_prev_task() into pick_next_task()
  2014-01-21 11:17 [PATCH 0/9] Various sched patches Peter Zijlstra
                   ` (4 preceding siblings ...)
  2014-01-21 11:17 ` [PATCH 5/9] sched/fair: Track cgroup depth Peter Zijlstra
@ 2014-01-21 11:18 ` Peter Zijlstra
  2014-01-21 21:46   ` bsegall
  2014-01-21 11:18 ` [PATCH 7/9] sched/fair: Clean up __clear_buddies_* Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 11:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, daniel.lezcano, pjt, bsegall, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_2.patch --]
[-- Type: text/plain, Size: 5933 bytes --]

In order to avoid having to do put/set on a whole cgroup hierarchy
when we context switch, push the put into pick_next_task() so that
both operations are in the same function. Further changes then allow
us to possibly optimize away redundant work.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop
---
 kernel/sched/core.c      |   21 ++++++++-------------
 kernel/sched/deadline.c  |    2 +-
 kernel/sched/fair.c      |    6 +++++-
 kernel/sched/idle_task.c |    6 +++++-
 kernel/sched/rt.c        |   27 ++++++++++++++++-----------
 kernel/sched/sched.h     |    8 +++++++-
 kernel/sched/stop_task.c |    5 ++++-
 7 files changed, 46 insertions(+), 29 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2554,18 +2554,11 @@ static inline void schedule_debug(struct
 	schedstat_inc(this_rq(), sched_count);
 }
 
-static void put_prev_task(struct rq *rq, struct task_struct *prev)
-{
-	if (prev->on_rq || rq->skip_clock_update < 0)
-		update_rq_clock(rq);
-	prev->sched_class->put_prev_task(rq, prev);
-}
-
 /*
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq)
+pick_next_task(struct rq *rq, struct task_struct *prev)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -2575,13 +2568,13 @@ pick_next_task(struct rq *rq)
 	 * the fair class we can call that function directly:
 	 */
 	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
-		p = fair_sched_class.pick_next_task(rq);
+		p = fair_sched_class.pick_next_task(rq, prev);
 		if (likely(p))
 			return p;
 	}
 
 	for_each_class(class) {
-		p = class->pick_next_task(rq);
+		p = class->pick_next_task(rq, prev);
 		if (p)
 			return p;
 	}
@@ -2682,8 +2675,10 @@ static void __sched __schedule(void)
 	if (unlikely(!rq->nr_running))
 		rq->idle_stamp = idle_balance(rq) ? 0 : rq_clock(rq);
 
-	put_prev_task(rq, prev);
-	next = pick_next_task(rq);
+	if (prev->on_rq || rq->skip_clock_update < 0)
+		update_rq_clock(rq);
+
+	next = pick_next_task(rq, prev);
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 	rq->skip_clock_update = 0;
@@ -4725,7 +4720,7 @@ static void migrate_tasks(unsigned int d
 		if (rq->nr_running == 1)
 			break;
 
-		next = pick_next_task(rq);
+		next = pick_next_task(rq, NULL);
 		BUG_ON(!next);
 		next->sched_class->put_prev_task(rq, next);
 
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -989,7 +989,7 @@ static struct sched_dl_entity *pick_next
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-struct task_struct *pick_next_task_dl(struct rq *rq)
+struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev)
 {
 	struct sched_dl_entity *dl_se;
 	struct task_struct *p;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4484,7 +4484,8 @@ static void check_preempt_wakeup(struct
 		set_last_buddy(se);
 }
 
-static struct task_struct *pick_next_task_fair(struct rq *rq)
+static struct task_struct *
+pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *p;
 	struct cfs_rq *cfs_rq = &rq->cfs;
@@ -4493,6 +4494,9 @@ static struct task_struct *pick_next_tas
 	if (!cfs_rq->nr_running)
 		return NULL;
 
+	if (prev)
+		prev->sched_class->put_prev_task(rq, prev);
+
 	do {
 		se = pick_next_entity(cfs_rq);
 		set_next_entity(cfs_rq, se);
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -23,8 +23,12 @@ static void check_preempt_curr_idle(stru
 	resched_task(rq->idle);
 }
 
-static struct task_struct *pick_next_task_idle(struct rq *rq)
+static struct task_struct *
+pick_next_task_idle(struct rq *rq, struct task_struct *prev)
 {
+	if (prev)
+		prev->sched_class->put_prev_task(rq, prev);
+
 	schedstat_inc(rq, sched_goidle);
 #ifdef CONFIG_SMP
 	idle_enter_fair(rq);
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1310,15 +1310,7 @@ static struct task_struct *_pick_next_ta
 {
 	struct sched_rt_entity *rt_se;
 	struct task_struct *p;
-	struct rt_rq *rt_rq;
-
-	rt_rq = &rq->rt;
-
-	if (!rt_rq->rt_nr_running)
-		return NULL;
-
-	if (rt_rq_throttled(rt_rq))
-		return NULL;
+	struct rt_rq *rt_rq  = &rq->rt;
 
 	do {
 		rt_se = pick_next_rt_entity(rq, rt_rq);
@@ -1332,9 +1324,22 @@ static struct task_struct *_pick_next_ta
 	return p;
 }
 
-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *
+pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *p = _pick_next_task_rt(rq);
+	struct task_struct *p;
+	struct rt_rq *rt_rq = &rq->rt;
+
+	if (!rt_rq->rt_nr_running)
+		return NULL;
+
+	if (rt_rq_throttled(rt_rq))
+		return NULL;
+
+	if (prev)
+		prev->sched_class->put_prev_task(rq, prev);
+
+	p = _pick_next_task_rt(rq);
 
 	/* The running task is never eligible for pushing */
 	if (p)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1123,7 +1123,13 @@ struct sched_class {
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
-	struct task_struct * (*pick_next_task) (struct rq *rq);
+	/*
+	 * It is the responsibility of the pick_next_task() method that will
+	 * return the next task to call put_prev_task() on the @prev task or
+	 * something equivalent.
+	 */
+	struct task_struct * (*pick_next_task) (struct rq *rq,
+						struct task_struct *prev);
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -23,11 +23,14 @@ check_preempt_curr_stop(struct rq *rq, s
 	/* we're never preempted */
 }
 
-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *
+pick_next_task_stop(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *stop = rq->stop;
 
 	if (stop && stop->on_rq) {
+		if (prev)
+			prev->sched_class->put_prev_task(rq, prev);
 		stop->se.exec_start = rq_clock_task(rq);
 		return stop;
 	}



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 7/9] sched/fair: Clean up __clear_buddies_*
  2014-01-21 11:17 [PATCH 0/9] Various sched patches Peter Zijlstra
                   ` (5 preceding siblings ...)
  2014-01-21 11:18 ` [PATCH 6/9] sched: Push put_prev_task() into pick_next_task() Peter Zijlstra
@ 2014-01-21 11:18 ` Peter Zijlstra
  2014-01-21 11:18 ` [PATCH 8/9] sched/fair: Optimize cgroup pick_next_task_fair Peter Zijlstra
  2014-01-21 11:18 ` [PATCH 9/9] sched: Use idle task shortcut Peter Zijlstra
  8 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 11:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, daniel.lezcano, pjt, bsegall, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_3.patch --]
[-- Type: text/plain, Size: 1137 bytes --]

Slightly easier code flow, no functional changes.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop
---
 kernel/sched/fair.c |   18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2567,10 +2567,10 @@ static void __clear_buddies_last(struct
 {
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->last == se)
-			cfs_rq->last = NULL;
-		else
+		if (cfs_rq->last != se)
 			break;
+
+		cfs_rq->last = NULL;
 	}
 }
 
@@ -2578,10 +2578,10 @@ static void __clear_buddies_next(struct
 {
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->next == se)
-			cfs_rq->next = NULL;
-		else
+		if (cfs_rq->next != se)
 			break;
+
+		cfs_rq->next = NULL;
 	}
 }
 
@@ -2589,10 +2589,10 @@ static void __clear_buddies_skip(struct
 {
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->skip == se)
-			cfs_rq->skip = NULL;
-		else
+		if (cfs_rq->skip != se)
 			break;
+
+		cfs_rq->skip = NULL;
 	}
 }
 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 8/9] sched/fair: Optimize cgroup pick_next_task_fair
  2014-01-21 11:17 [PATCH 0/9] Various sched patches Peter Zijlstra
                   ` (6 preceding siblings ...)
  2014-01-21 11:18 ` [PATCH 7/9] sched/fair: Clean up __clear_buddies_* Peter Zijlstra
@ 2014-01-21 11:18 ` Peter Zijlstra
  2014-01-21 19:24   ` bsegall
  2014-01-21 11:18 ` [PATCH 9/9] sched: Use idle task shortcut Peter Zijlstra
  8 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 11:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, daniel.lezcano, pjt, bsegall, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_4.patch --]
[-- Type: text/plain, Size: 6152 bytes --]

Since commit 2f36825b1 ("sched: Next buddy hint on sleep and preempt
path") it is likely we pick a new task from the same cgroup, doing a put
and then set on all intermediate entities is a waste of time, so try to
avoid this.

XXX please review carefully; its quite horrid.

That said, simple hackbench runs in a 3 deep cgroup show a consistent
performance increase (small, but consistent).

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop
---
 kernel/sched/fair.c |  140 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 111 insertions(+), 29 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2738,15 +2738,46 @@ wakeup_preempt_entity(struct sched_entit
  */
 static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
 {
-	struct sched_entity *se = __pick_first_entity(cfs_rq);
-	struct sched_entity *left = se;
+	struct sched_entity *left = __pick_first_entity(cfs_rq);
+	struct sched_entity *se, *curr = cfs_rq->curr;
+
+	/*
+	 * Since its possible we got here without doing put_prev_entity() we
+	 * also have to consider cfs_rq->curr. If it was set, and is still a
+	 * runnable entity, update_curr() will update its vruntime, otherwise
+	 * forget we've ever seen it.
+	 */
+	if (curr) {
+		if (curr->on_rq)
+			update_curr(cfs_rq);
+		else
+			curr = NULL;
+	}
+
+	/*
+	 * If curr is set we have to see if its left of the leftmost entity
+	 * still in the tree, provided there was anything in the tree at all.
+	 */
+	if (!left || (curr && entity_before(curr, left)))
+		left = curr;
+
+	se = left; /* ideally we run the leftmost entity */
 
 	/*
 	 * Avoid running the skip buddy, if running something else can
 	 * be done without getting too unfair.
 	 */
 	if (cfs_rq->skip == se) {
-		struct sched_entity *second = __pick_next_entity(se);
+		struct sched_entity *second;
+
+		if (se == curr) {
+			second = __pick_first_entity(cfs_rq);
+		} else {
+			second = __pick_next_entity(se);
+			if (!second || (curr && entity_before(curr, second)))
+				second = curr;
+		}
+
 		if (second && wakeup_preempt_entity(second, left) < 1)
 			se = second;
 	}
@@ -2768,7 +2799,7 @@ static struct sched_entity *pick_next_en
 	return se;
 }
 
-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
@@ -3423,22 +3454,23 @@ static void check_enqueue_throttle(struc
 }
 
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_bandwidth_used())
-		return;
+		return false;
 
 	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
-		return;
+		return false;
 
 	/*
 	 * it's possible for a throttled entity to be forced into a running
 	 * state (e.g. set_curr_task), in this case we're finished.
 	 */
 	if (cfs_rq_throttled(cfs_rq))
-		return;
+		return true;
 
 	throttle_cfs_rq(cfs_rq);
+	return true;
 }
 
 static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
@@ -3548,7 +3580,7 @@ static inline u64 cfs_rq_clock_task(stru
 }
 
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 
@@ -4484,44 +4516,94 @@ static void check_preempt_wakeup(struct
 		set_last_buddy(se);
 }
 
+/*
+ * Account for a descheduled task:
+ */
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+{
+	struct sched_entity *se = &prev->se;
+	struct cfs_rq *cfs_rq;
+
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+		put_prev_entity(cfs_rq, se);
+	}
+}
+
 static struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 {
+	struct sched_entity *se, __maybe_unused *pse;
 	struct task_struct *p;
-	struct cfs_rq *cfs_rq = &rq->cfs;
-	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+
+again: __maybe_unused
+	cfs_rq = &rq->cfs;
+
+	if (prev) {
+		if (!IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) ||
+		    (prev->sched_class != &fair_sched_class)) {
+			prev->sched_class->put_prev_task(rq, prev);
+			prev = NULL;
+		}
+	}
 
 	if (!cfs_rq->nr_running)
 		return NULL;
 
-	if (prev)
-		prev->sched_class->put_prev_task(rq, prev);
-
 	do {
 		se = pick_next_entity(cfs_rq);
-		set_next_entity(cfs_rq, se);
+		if (!prev)
+			set_next_entity(cfs_rq, se);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
 	p = task_of(se);
-	if (hrtick_enabled(rq))
-		hrtick_start_fair(rq, p);
 
-	return p;
-}
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	/*
+	 * If we haven't yet done put_prev_entity and the selected task is
+	 * a different task than we started out with, try and touch the least
+	 * amount of cfs_rq trees.
+	 */
+	if (prev) {
+		if (prev != p) {
+			pse = &prev->se;
+
+			while (!(cfs_rq = is_same_group(se, pse))) {
+				int se_depth = se->depth;
+				int pse_depth = pse->depth;
+
+				if (se_depth <= pse_depth) {
+					put_prev_entity(cfs_rq_of(pse), pse);
+					pse = parent_entity(pse);
+				}
+				if (se_depth >= pse_depth) {
+					set_next_entity(cfs_rq_of(se), se);
+					se = parent_entity(se);
+				}
+			}
 
-/*
- * Account for a descheduled task:
- */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
-{
-	struct sched_entity *se = &prev->se;
-	struct cfs_rq *cfs_rq;
+			put_prev_entity(cfs_rq, pse);
+			set_next_entity(cfs_rq, se);
+		}
 
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		put_prev_entity(cfs_rq, se);
+		/*
+		 * In case the common cfs_rq got throttled, just give up and
+		 * put the stack and retry.
+		 */
+		if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
+			put_prev_task_fair(rq, p);
+			prev = NULL;
+			goto again;
+		}
 	}
+#endif
+
+	if (hrtick_enabled(rq))
+		hrtick_start_fair(rq, p);
+
+	return p;
 }
 
 /*



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 9/9] sched: Use idle task shortcut
  2014-01-21 11:17 [PATCH 0/9] Various sched patches Peter Zijlstra
                   ` (7 preceding siblings ...)
  2014-01-21 11:18 ` [PATCH 8/9] sched/fair: Optimize cgroup pick_next_task_fair Peter Zijlstra
@ 2014-01-21 11:18 ` Peter Zijlstra
  8 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 11:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, daniel.lezcano, pjt, bsegall, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-2_sched-use_idle_task_shortcut.patch --]
[-- Type: text/plain, Size: 2101 bytes --]

With the previous patches, we have no ambiguity on going to idle. So we can
return directly the idle task instead of looking up all the domains which will in
any case return the idle_task.

Cc: alex.shi@linaro.org
Cc: peterz@infradead.org
Cc: mingo@kernel.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389977102-4420-2-git-send-email-daniel.lezcano@linaro.org
---
 kernel/sched/core.c |   39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2563,23 +2563,33 @@ pick_next_task(struct rq *rq, struct tas
 	const struct sched_class *class;
 	struct task_struct *p;
 
+again:
+	if (likely(rq->nr_running)) {
+		/*
+		 * Optimization: we know that if all tasks are in
+		 * the fair class we can call that function directly:
+		 */
+		if (likely(rq->nr_running == rq->cfs.h_nr_running))
+			return fair_sched_class.pick_next_task(rq, prev);
+
+		for_each_class(class) {
+			p = class->pick_next_task(rq, prev);
+			if (p)
+				return p;
+		}
+	}
+
 	/*
-	 * Optimization: we know that if all tasks are in
-	 * the fair class we can call that function directly:
+	 * If there is a task balanced on this cpu, pick the next task,
+	 * otherwise fall in the optimization by picking the idle task
+	 * directly.
 	 */
-	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
-		p = fair_sched_class.pick_next_task(rq, prev);
-		if (likely(p))
-			return p;
-	}
+	if (idle_balance(rq))
+		goto again;
 
-	for_each_class(class) {
-		p = class->pick_next_task(rq, prev);
-		if (p)
-			return p;
-	}
+	rq->idle_stamp = rq_clock(rq);
 
-	BUG(); /* the idle class will always have a runnable task */
+	return idle_sched_class.pick_next_task(rq, prev);
 }
 
 /*
@@ -2672,9 +2682,6 @@ static void __sched __schedule(void)
 
 	pre_schedule(rq, prev);
 
-	if (unlikely(!rq->nr_running))
-		rq->idle_stamp = idle_balance(rq) ? 0 : rq_clock(rq);
-
 	if (prev->on_rq || rq->skip_clock_update < 0)
 		update_rq_clock(rq);
 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/9] sched: Clean up idle task SMP logic
  2014-01-21 11:17 ` [PATCH 4/9] sched: Clean up idle task SMP logic Peter Zijlstra
@ 2014-01-21 17:27   ` Vincent Guittot
  2014-01-23 11:37     ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Vincent Guittot @ 2014-01-21 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Daniel Lezcano, Paul Turner, Benjamin Segall

On 21 January 2014 12:17, Peter Zijlstra <peterz@infradead.org> wrote:
> The idle post_schedule hook is just a vile waste of time, fix it proper.
>
> Cc: alex.shi@linaro.org
> Cc: mingo@kernel.org
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20140117140939.GH11314@laptop.programming.kicks-ass.net
> ---
>  kernel/sched/fair.c      |    5 +++--
>  kernel/sched/idle_task.c |   21 ++++++---------------
>  2 files changed, 9 insertions(+), 17 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2416,7 +2416,8 @@ void idle_exit_fair(struct rq *this_rq)
>         update_rq_runnable_avg(this_rq, 0);
>  }
>
> -#else
> +#else /* CONFIG_SMP */
> +
>  static inline void update_entity_load_avg(struct sched_entity *se,
>                                           int update_cfs_rq) {}
>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
> @@ -2428,7 +2429,7 @@ static inline void dequeue_entity_load_a
>                                            int sleep) {}
>  static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
>                                               int force_update) {}
> -#endif
> +#endif /* CONFIG_SMP */
>
>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -13,18 +13,8 @@ select_task_rq_idle(struct task_struct *
>  {
>         return task_cpu(p); /* IDLE tasks as never migrated */
>  }
> -
> -static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
> -{
> -       idle_exit_fair(rq);
> -       rq_last_tick_reset(rq);
> -}
> -
> -static void post_schedule_idle(struct rq *rq)
> -{
> -       idle_enter_fair(rq);
> -}
>  #endif /* CONFIG_SMP */
> +
>  /*
>   * Idle tasks are unconditionally rescheduled:
>   */
> @@ -37,8 +27,7 @@ static struct task_struct *pick_next_tas
>  {
>         schedstat_inc(rq, sched_goidle);
>  #ifdef CONFIG_SMP
> -       /* Trigger the post schedule to do an idle_enter for CFS */
> -       rq->post_schedule = 1;
> +       idle_enter_fair(rq);

If i have correctly followed the new function path that is introduced
by the patchset, idle_enter_fair is called after idle_balance whereas
it must be called before in order to update the
runnable_avg_sum/period of the rq before evaluating the interest of
pulling cfs task

Regards,
vincent

>  #endif
>         return rq->idle;
>  }
> @@ -58,6 +47,10 @@ dequeue_task_idle(struct rq *rq, struct
>
>  static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
>  {
> +#ifdef CONFIG_SMP
> +       idle_exit_fair(rq);
> +       rq_last_tick_reset(rq);
> +#endif
>  }
>
>  static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
> @@ -101,8 +94,6 @@ const struct sched_class idle_sched_clas
>
>  #ifdef CONFIG_SMP
>         .select_task_rq         = select_task_rq_idle,
> -       .pre_schedule           = pre_schedule_idle,
> -       .post_schedule          = post_schedule_idle,
>  #endif
>
>         .set_curr_task          = set_curr_task_idle,
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 8/9] sched/fair: Optimize cgroup pick_next_task_fair
  2014-01-21 11:18 ` [PATCH 8/9] sched/fair: Optimize cgroup pick_next_task_fair Peter Zijlstra
@ 2014-01-21 19:24   ` bsegall
  2014-01-21 19:37     ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: bsegall @ 2014-01-21 19:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, daniel.lezcano, pjt

Peter Zijlstra <peterz@infradead.org> writes:
>  static struct task_struct *
>  pick_next_task_fair(struct rq *rq, struct task_struct *prev)
>  {
> +	struct sched_entity *se, __maybe_unused *pse;
>  	struct task_struct *p;
> -	struct cfs_rq *cfs_rq = &rq->cfs;
> -	struct sched_entity *se;
> +	struct cfs_rq *cfs_rq;
> +
> +again: __maybe_unused
> +	cfs_rq = &rq->cfs;
> +
> +	if (prev) {
> +		if (!IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) ||
> +		    (prev->sched_class != &fair_sched_class)) {
> +			prev->sched_class->put_prev_task(rq, prev);
> +			prev = NULL;
> +		}
> +	}
>  
>  	if (!cfs_rq->nr_running)
>  		return NULL;
>  
> -	if (prev)
> -		prev->sched_class->put_prev_task(rq, prev);
> -
>  	do {
>  		se = pick_next_entity(cfs_rq);
> -		set_next_entity(cfs_rq, se);
> +		if (!prev)
> +			set_next_entity(cfs_rq, se);
>  		cfs_rq = group_cfs_rq(se);
>  	} while (cfs_rq);
>  
>  	p = task_of(se);
> -	if (hrtick_enabled(rq))
> -		hrtick_start_fair(rq, p);
>  
> -	return p;
> -}
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +	/*
> +	 * If we haven't yet done put_prev_entity and the selected task is
> +	 * a different task than we started out with, try and touch the least
> +	 * amount of cfs_rq trees.
> +	 */
> +	if (prev) {
> +		if (prev != p) {
> +			pse = &prev->se;
> +
> +			while (!(cfs_rq = is_same_group(se, pse))) {
> +				int se_depth = se->depth;
> +				int pse_depth = pse->depth;
> +
> +				if (se_depth <= pse_depth) {
> +					put_prev_entity(cfs_rq_of(pse), pse);
> +					pse = parent_entity(pse);
> +				}
> +				if (se_depth >= pse_depth) {
> +					set_next_entity(cfs_rq_of(se), se);
> +					se = parent_entity(se);
> +				}
> +			}
>  
> -/*
> - * Account for a descheduled task:
> - */
> -static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
> -{
> -	struct sched_entity *se = &prev->se;
> -	struct cfs_rq *cfs_rq;
> +			put_prev_entity(cfs_rq, pse);
> +			set_next_entity(cfs_rq, se);
> +		}
>  
> -	for_each_sched_entity(se) {
> -		cfs_rq = cfs_rq_of(se);
> -		put_prev_entity(cfs_rq, se);
> +		/*
> +		 * In case the common cfs_rq got throttled, just give up and
> +		 * put the stack and retry.
> +		 */
> +		if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
> +			put_prev_task_fair(rq, p);
> +			prev = NULL;
> +			goto again;
> +		}

This double-calls put_prev_entity on any non-common cfs_rqs and ses,
which means double __enqueue_entity, among other things. Just doing the
put_prev loop from se->parent should fix that.

However, any sort of abort means that we may have already done
set_next_entity on some children, which even with the changes to
pick_next_entity will cause problems, up to and including double
__dequeue_entity I think.

Also, this way we never do check_cfs_rq_runtime on any parents of the
common cfs_rq, which could even have been the reason for the resched to
begin with. I'm not sure if there would be any problem doing it on the
way down or not, I don't see any problems at a glance.



>  	}
> +#endif
> +
> +	if (hrtick_enabled(rq))
> +		hrtick_start_fair(rq, p);
> +
> +	return p;
>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 8/9] sched/fair: Optimize cgroup pick_next_task_fair
  2014-01-21 19:24   ` bsegall
@ 2014-01-21 19:37     ` Peter Zijlstra
  2014-01-21 20:03       ` bsegall
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 19:37 UTC (permalink / raw)
  To: bsegall; +Cc: linux-kernel, mingo, daniel.lezcano, pjt

On Tue, Jan 21, 2014 at 11:24:39AM -0800, bsegall@google.com wrote:
> Peter Zijlstra <peterz@infradead.org> writes:

> > +#ifdef CONFIG_FAIR_GROUP_SCHED
> > +	/*
> > +	 * If we haven't yet done put_prev_entity and the selected task is
> > +	 * a different task than we started out with, try and touch the least
> > +	 * amount of cfs_rq trees.
> > +	 */
> > +	if (prev) {
> > +		if (prev != p) {
> > +			pse = &prev->se;
> > +
> > +			while (!(cfs_rq = is_same_group(se, pse))) {
> > +				int se_depth = se->depth;
> > +				int pse_depth = pse->depth;
> > +
> > +				if (se_depth <= pse_depth) {
> > +					put_prev_entity(cfs_rq_of(pse), pse);
> > +					pse = parent_entity(pse);
> > +				}
> > +				if (se_depth >= pse_depth) {
> > +					set_next_entity(cfs_rq_of(se), se);
> > +					se = parent_entity(se);
> > +				}
> > +			}
> >  
> > +			put_prev_entity(cfs_rq, pse);
> > +			set_next_entity(cfs_rq, se);
> > +		}

(A)

> > +		/*
> > +		 * In case the common cfs_rq got throttled, just give up and
> > +		 * put the stack and retry.
> > +		 */
> > +		if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
> > +			put_prev_task_fair(rq, p);
> > +			prev = NULL;
> > +			goto again;
> > +		}
> 
> This double-calls put_prev_entity on any non-common cfs_rqs and ses,
> which means double __enqueue_entity, among other things. Just doing the
> put_prev loop from se->parent should fix that.

I'm not seeing that, so at point (A) we've completely switched over from
@prev to @p, we've put all pse until the common parent and set all se
back to @p.

So if we then do: put_prev_task_fair(rq, p), we simply undo all the
set_next_entity(se) we just did, and continue from the common parent
upwards.

> However, any sort of abort means that we may have already done
> set_next_entity on some children, which even with the changes to
> pick_next_entity will cause problems, up to and including double
> __dequeue_entity I think.

But the abort is only done after we've completely set up @p as the
current task.

Yes, completely tearing it down again is probably a waste, but given
that bandwidth enforcement should be rare and I didn't want to
complicate things even further for rare cases.

> Also, this way we never do check_cfs_rq_runtime on any parents of the
> common cfs_rq, which could even have been the reason for the resched to
> begin with. I'm not sure if there would be any problem doing it on the
> way down or not, I don't see any problems at a glance.

Oh, so we allow a parent to have less runtime than the sum of all its
children?

Indeed, in that case we can miss something... we could try to call
check_cfs_rq_runtime() from the initial top-down selection loop? When
true, just put the entire stack and don't pretend to be smart?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 8/9] sched/fair: Optimize cgroup pick_next_task_fair
  2014-01-21 19:37     ` Peter Zijlstra
@ 2014-01-21 20:03       ` bsegall
  2014-01-21 20:43         ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: bsegall @ 2014-01-21 20:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, daniel.lezcano, pjt

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Jan 21, 2014 at 11:24:39AM -0800, bsegall@google.com wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>
>> > +#ifdef CONFIG_FAIR_GROUP_SCHED
>> > +	/*
>> > +	 * If we haven't yet done put_prev_entity and the selected task is
>> > +	 * a different task than we started out with, try and touch the least
>> > +	 * amount of cfs_rq trees.
>> > +	 */
>> > +	if (prev) {
>> > +		if (prev != p) {
>> > +			pse = &prev->se;
>> > +
>> > +			while (!(cfs_rq = is_same_group(se, pse))) {
>> > +				int se_depth = se->depth;
>> > +				int pse_depth = pse->depth;
>> > +
>> > +				if (se_depth <= pse_depth) {
>> > +					put_prev_entity(cfs_rq_of(pse), pse);
>> > +					pse = parent_entity(pse);
>> > +				}
>> > +				if (se_depth >= pse_depth) {
>> > +					set_next_entity(cfs_rq_of(se), se);
>> > +					se = parent_entity(se);
>> > +				}
>> > +			}
>> >  
>> > +			put_prev_entity(cfs_rq, pse);
>> > +			set_next_entity(cfs_rq, se);
>> > +		}
>
> (A)
>
>> > +		/*
>> > +		 * In case the common cfs_rq got throttled, just give up and
>> > +		 * put the stack and retry.
>> > +		 */
>> > +		if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
>> > +			put_prev_task_fair(rq, p);
>> > +			prev = NULL;
>> > +			goto again;
>> > +		}
>> 
>> This double-calls put_prev_entity on any non-common cfs_rqs and ses,
>> which means double __enqueue_entity, among other things. Just doing the
>> put_prev loop from se->parent should fix that.
>
> I'm not seeing that, so at point (A) we've completely switched over from
> @prev to @p, we've put all pse until the common parent and set all se
> back to @p.
>
> So if we then do: put_prev_task_fair(rq, p), we simply undo all the
> set_next_entity(se) we just did, and continue from the common parent
> upwards.
>
>> However, any sort of abort means that we may have already done
>> set_next_entity on some children, which even with the changes to
>> pick_next_entity will cause problems, up to and including double
>> __dequeue_entity I think.
>
> But the abort is only done after we've completely set up @p as the
> current task.
>
> Yes, completely tearing it down again is probably a waste, but given
> that bandwidth enforcement should be rare and I didn't want to
> complicate things even further for rare cases.


Ah, I missed that this was p, not prev. That makes a lot more sense, and
I agree that this should be fine.

>
>> Also, this way we never do check_cfs_rq_runtime on any parents of the
>> common cfs_rq, which could even have been the reason for the resched to
>> begin with. I'm not sure if there would be any problem doing it on the
>> way down or not, I don't see any problems at a glance.
>
> Oh, so we allow a parent to have less runtime than the sum of all its
> children?

Yeah, the check is currently max(children) <= parent, and unthrottled
children are also allowed.

>
> Indeed, in that case we can miss something... we could try to call
> check_cfs_rq_runtime() from the initial top-down selection loop? When
> true, just put the entire stack and don't pretend to be smart?

Yeah, I think that should work. I wasn't sure if there could be a
problem of doing throttle_cfs_rq(parent); throttle_cfs_rq(child);, but
thinking about, that has to be ok, because schedule can do that if
deactivate throttled the parent, schedule calls update_rq_clock, and
then put_prev throttled the child.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 8/9] sched/fair: Optimize cgroup pick_next_task_fair
  2014-01-21 20:03       ` bsegall
@ 2014-01-21 20:43         ` Peter Zijlstra
  2014-01-21 21:43           ` bsegall
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-21 20:43 UTC (permalink / raw)
  To: bsegall; +Cc: linux-kernel, mingo, daniel.lezcano, pjt

On Tue, Jan 21, 2014 at 12:03:38PM -0800, bsegall@google.com wrote:

> > Indeed, in that case we can miss something... we could try to call
> > check_cfs_rq_runtime() from the initial top-down selection loop? When
> > true, just put the entire stack and don't pretend to be smart?
> 
> Yeah, I think that should work. I wasn't sure if there could be a
> problem of doing throttle_cfs_rq(parent); throttle_cfs_rq(child);, but
> thinking about, that has to be ok, because schedule can do that if
> deactivate throttled the parent, schedule calls update_rq_clock, and
> then put_prev throttled the child.

Maybe something like the below? Completely untested and everything.

There's the obviuos XXX fail that was also in the previous version; not
sure how to deal with that yet, either we need to change the interface
to take struct task_struct **prev, or get smarter :-)

Any other obvious fails in here?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2736,17 +2736,36 @@ wakeup_preempt_entity(struct sched_entit
  * 3) pick the "last" process, for cache locality
  * 4) do not run the "skip" process, if something else is available
  */
-static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
+static struct sched_entity *
+pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	struct sched_entity *se = __pick_first_entity(cfs_rq);
-	struct sched_entity *left = se;
+	struct sched_entity *left = __pick_first_entity(cfs_rq);
+	struct sched_entity *se;
+
+	/*
+	 * If curr is set we have to see if its left of the leftmost entity
+	 * still in the tree, provided there was anything in the tree at all.
+	 */
+	if (!left || (curr && entity_before(curr, left)))
+		left = curr;
+
+	se = left; /* ideally we run the leftmost entity */
 
 	/*
 	 * Avoid running the skip buddy, if running something else can
 	 * be done without getting too unfair.
 	 */
 	if (cfs_rq->skip == se) {
-		struct sched_entity *second = __pick_next_entity(se);
+		struct sched_entity *second;
+
+		if (se == curr) {
+			second = __pick_first_entity(cfs_rq);
+		} else {
+			second = __pick_next_entity(se);
+			if (!second || (curr && entity_before(curr, second)))
+				second = curr;
+		}
+
 		if (second && wakeup_preempt_entity(second, left) < 1)
 			se = second;
 	}
@@ -2768,7 +2787,7 @@ static struct sched_entity *pick_next_en
 	return se;
 }
 
-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
@@ -3423,22 +3442,23 @@ static void check_enqueue_throttle(struc
 }
 
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_bandwidth_used())
-		return;
+		return false;
 
 	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
-		return;
+		return false;
 
 	/*
 	 * it's possible for a throttled entity to be forced into a running
 	 * state (e.g. set_curr_task), in this case we're finished.
 	 */
 	if (cfs_rq_throttled(cfs_rq))
-		return;
+		return true;
 
 	throttle_cfs_rq(cfs_rq);
+	return true;
 }
 
 static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
@@ -3548,7 +3568,7 @@ static inline u64 cfs_rq_clock_task(stru
 }
 
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 
@@ -4484,26 +4504,109 @@ static void check_preempt_wakeup(struct
 		set_last_buddy(se);
 }
 
+/*
+ * Account for a descheduled task:
+ */
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+{
+	struct sched_entity *se = &prev->se;
+	struct cfs_rq *cfs_rq;
+
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+		put_prev_entity(cfs_rq, se);
+	}
+}
+
 static struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *p;
 	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
+	struct task_struct *p;
 
 	if (!cfs_rq->nr_running)
 		return NULL;
 
-	if (prev)
+	if (!IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) ||
+	    (prev->sched_class != &fair_sched_class)) {
 		prev->sched_class->put_prev_task(rq, prev);
+		prev = NULL;
+	}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (!prev)
+		goto simple;
 
 	do {
-		se = pick_next_entity(cfs_rq);
+		struct sched_entity *curr = cfs_rq->curr;
+
+		/*
+		 * Since we got here without doing put_prev_entity() we also
+		 * have to consider cfs_rq->curr. If it is still a runnable
+		 * entity, update_curr() will update its vruntime, otherwise
+		 * forget we've ever seen it.
+		 */
+		if (curr->on_rq)
+			update_curr(cfs_rq);
+		else
+			curr = NULL;
+
+		if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
+			put_prev_task_fair(rq, prev);
+			goto simple;
+		}
+
+		se = pick_next_entity(cfs_rq, curr);
+		cfs_rq = group_cfs_rq(se);
+	} while (cfs_rq);
+
+	p = task_of(se);
+
+	/*
+	 * Since we haven't yet done put_prev_entity and if the selected task
+	 * is a different task than we started out with, try and touch the
+	 * least amount of cfs_rqs.
+	 */
+	if (prev != p) {
+		struct sched_entity *pse = &prev->se;
+
+		while (!(cfs_rq = is_same_group(se, pse))) {
+			int se_depth = se->depth;
+			int pse_depth = pse->depth;
+
+			if (se_depth <= pse_depth) {
+				put_prev_entity(cfs_rq_of(pse), pse);
+				pse = parent_entity(pse);
+			}
+			if (se_depth >= pse_depth) {
+				set_next_entity(cfs_rq_of(se), se);
+				se = parent_entity(se);
+			}
+		}
+
+		put_prev_entity(cfs_rq, pse);
+		set_next_entity(cfs_rq, se);
+	}
+
+	return p;
+simple:
+#endif
+	cfs_rq = &rq->cfs;
+
+	if (!cfs_rq->nr_running) {
+		// XXX FAIL we should never return NULL after putting @prev
+		return NULL;
+	}
+
+	do {
+		se = pick_next_entity(cfs_rq, NULL);
 		set_next_entity(cfs_rq, se);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
 	p = task_of(se);
+
 	if (hrtick_enabled(rq))
 		hrtick_start_fair(rq, p);
 
@@ -4511,20 +4614,6 @@ pick_next_task_fair(struct rq *rq, struc
 }
 
 /*
- * Account for a descheduled task:
- */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
-{
-	struct sched_entity *se = &prev->se;
-	struct cfs_rq *cfs_rq;
-
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		put_prev_entity(cfs_rq, se);
-	}
-}
-
-/*
  * sched_yield() is very simple
  *
  * The magic of dealing with the ->skip buddy is in pick_next_entity.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 8/9] sched/fair: Optimize cgroup pick_next_task_fair
  2014-01-21 20:43         ` Peter Zijlstra
@ 2014-01-21 21:43           ` bsegall
  2014-01-22 18:06             ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: bsegall @ 2014-01-21 21:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, daniel.lezcano, pjt

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Jan 21, 2014 at 12:03:38PM -0800, bsegall@google.com wrote:
>
>> > Indeed, in that case we can miss something... we could try to call
>> > check_cfs_rq_runtime() from the initial top-down selection loop? When
>> > true, just put the entire stack and don't pretend to be smart?
>> 
>> Yeah, I think that should work. I wasn't sure if there could be a
>> problem of doing throttle_cfs_rq(parent); throttle_cfs_rq(child);, but
>> thinking about, that has to be ok, because schedule can do that if
>> deactivate throttled the parent, schedule calls update_rq_clock, and
>> then put_prev throttled the child.
>
> Maybe something like the below? Completely untested and everything.
>
> There's the obviuos XXX fail that was also in the previous version; not
> sure how to deal with that yet, either we need to change the interface
> to take struct task_struct **prev, or get smarter :-)
>
> Any other obvious fails in here?

prev can be NULL to start with, hrtick should be handled in both paths.
How about this on top of your patch (equally untested) to fix those and
the XXX? The double-check on nr_running is annoying, but necessary when
prev slept.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4528,14 +4528,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
        if (!cfs_rq->nr_running)
                return NULL;
 
-       if (!IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) ||
-           (prev->sched_class != &fair_sched_class)) {
-               prev->sched_class->put_prev_task(rq, prev);
-               prev = NULL;
-       }
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
-       if (!prev)
+       if (!prev || prev->sched_class != &fair_sched_class)
                goto simple;
 
        do {
@@ -4552,10 +4546,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
                else
                        curr = NULL;
 
-               if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
-                       put_prev_task_fair(rq, prev);
+               if (unlikely(check_cfs_rq_runtime(cfs_rq)))
                        goto simple;
-               }
 
                se = pick_next_entity(cfs_rq, curr);
                cfs_rq = group_cfs_rq(se);
@@ -4589,15 +4581,19 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
                set_next_entity(cfs_rq, se);
        }
 
+       if (hrtick_enabled(rq))
+               hrtick_start_fair(rq, p);
+
        return p;
 simple:
 #endif
        cfs_rq = &rq->cfs;
 
-       if (!cfs_rq->nr_running) {
-               // XXX FAIL we should never return NULL after putting @prev
+       if (!cfs_rq->nr_running)
                return NULL;
-       }
+
+       if (prev)
+               prev->sched_class->put_prev_task(rq, prev);
 
        do {
                se = pick_next_entity(cfs_rq, NULL);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/9] sched: Push put_prev_task() into pick_next_task()
  2014-01-21 11:18 ` [PATCH 6/9] sched: Push put_prev_task() into pick_next_task() Peter Zijlstra
@ 2014-01-21 21:46   ` bsegall
  0 siblings, 0 replies; 23+ messages in thread
From: bsegall @ 2014-01-21 21:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, daniel.lezcano, pjt

Peter Zijlstra <peterz@infradead.org> writes:

> In order to avoid having to do put/set on a whole cgroup hierarchy
> when we context switch, push the put into pick_next_task() so that
> both operations are in the same function. Further changes then allow
> us to possibly optimize away redundant work.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -23,8 +23,12 @@ static void check_preempt_curr_idle(stru
>  	resched_task(rq->idle);
>  }
>  
> -static struct task_struct *pick_next_task_idle(struct rq *rq)
> +static struct task_struct *
> +pick_next_task_idle(struct rq *rq, struct task_struct *prev)
>  {
> +	if (prev)
> +		prev->sched_class->put_prev_task(rq, prev);
> +
>  	schedstat_inc(rq, sched_goidle);
>  #ifdef CONFIG_SMP
>  	idle_enter_fair(rq);
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1310,15 +1310,7 @@ static struct task_struct *_pick_next_ta
>  {
>  	struct sched_rt_entity *rt_se;
>  	struct task_struct *p;
> -	struct rt_rq *rt_rq;
> -
> -	rt_rq = &rq->rt;
> -
> -	if (!rt_rq->rt_nr_running)
> -		return NULL;
> -
> -	if (rt_rq_throttled(rt_rq))
> -		return NULL;
> +	struct rt_rq *rt_rq  = &rq->rt;
>  
>  	do {
>  		rt_se = pick_next_rt_entity(rq, rt_rq);
> @@ -1332,9 +1324,22 @@ static struct task_struct *_pick_next_ta
>  	return p;
>  }
>  
> -static struct task_struct *pick_next_task_rt(struct rq *rq)
> +static struct task_struct *
> +pick_next_task_rt(struct rq *rq, struct task_struct *prev)
>  {
> -	struct task_struct *p = _pick_next_task_rt(rq);
> +	struct task_struct *p;
> +	struct rt_rq *rt_rq = &rq->rt;
> +
> +	if (!rt_rq->rt_nr_running)
> +		return NULL;
> +
> +	if (rt_rq_throttled(rt_rq))
> +		return NULL;
> +
> +	if (prev)
> +		prev->sched_class->put_prev_task(rq, prev);
> +
> +	p = _pick_next_task_rt(rq);
>  
>  	/* The running task is never eligible for pushing */
>  	if (p)
p is now always non-NULL, so this branch can now go (and it is important
that we can't fail after doing put).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 8/9] sched/fair: Optimize cgroup pick_next_task_fair
  2014-01-21 21:43           ` bsegall
@ 2014-01-22 18:06             ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-22 18:06 UTC (permalink / raw)
  To: bsegall; +Cc: linux-kernel, mingo, daniel.lezcano, pjt

On Tue, Jan 21, 2014 at 01:43:32PM -0800, bsegall@google.com wrote:
> prev can be NULL to start with, hrtick should be handled in both paths.
> How about this on top of your patch (equally untested) to fix those and
> the XXX? The double-check on nr_running is annoying, but necessary when
> prev slept.

Ah yes indeed. Let me build the lot and see if I can wreck it ;-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/9] sched: Clean up idle task SMP logic
  2014-01-21 17:27   ` Vincent Guittot
@ 2014-01-23 11:37     ` Peter Zijlstra
  2014-01-23 14:52       ` Vincent Guittot
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-23 11:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Daniel Lezcano, Paul Turner,
	Benjamin Segall, Steven Rostedt

On Tue, Jan 21, 2014 at 06:27:11PM +0100, Vincent Guittot wrote:
> On 21 January 2014 12:17, Peter Zijlstra <peterz@infradead.org> wrote:

> If i have correctly followed the new function path that is introduced
> by the patchset, idle_enter_fair is called after idle_balance whereas
> it must be called before in order to update the
> runnable_avg_sum/period of the rq before evaluating the interest of
> pulling cfs task

Its idle_exit_fair, that's moved from pre_schedule to put_prev_task and
thus indeed has crossed idle_balance.

Yeah, I can leave that pre_schedule thing in place, however all that has
be thinking.

Ideally we'd do something like the below; but I must admit to still
being slightly confused about the idle_{enter,exit}_fair() calls, their
comment doesn't seem to clarify things.

Steve, I don't think I wrecked rt/deadline by pulling in the
pre_schedule call into pick_next_task(), but then, who knows :-)

The only thing I really don't like is having that double conditional for
the direct fair_sched_class call, but I didn't see a way around that,
other than dropping that entirely. Then again, we cut out a conditional
and indirect call by getting rid of pre_schedule() -- so it might just
balance out.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2146,13 +2146,6 @@ static void finish_task_switch(struct rq
 
 #ifdef CONFIG_SMP
 
-/* assumes rq->lock is held */
-static inline void pre_schedule(struct rq *rq, struct task_struct *prev)
-{
-	if (prev->sched_class->pre_schedule)
-		prev->sched_class->pre_schedule(rq, prev);
-}
-
 /* rq->lock is NOT held, but preemption is disabled */
 static inline void post_schedule(struct rq *rq)
 {
@@ -2170,10 +2163,6 @@ static inline void post_schedule(struct
 
 #else
 
-static inline void pre_schedule(struct rq *rq, struct task_struct *p)
-{
-}
-
 static inline void post_schedule(struct rq *rq)
 {
 }
@@ -2571,7 +2560,8 @@ pick_next_task(struct rq *rq, struct tas
 		 * Optimization: we know that if all tasks are in
 		 * the fair class we can call that function directly:
 		 */
-		if (likely(rq->nr_running == rq->cfs.h_nr_running))
+		if (likely(prev->sched_class == &fair_sched_class &&
+			   rq->nr_running == rq->cfs.h_nr_running))
 			return fair_sched_class.pick_next_task(rq, prev);
 
 		for_each_class(class) {
@@ -2581,14 +2571,6 @@ pick_next_task(struct rq *rq, struct tas
 		}
 	}
 
-	/*
-	 * If there is a task balanced on this cpu, pick the next task,
-	 * otherwise fall in the optimization by picking the idle task
-	 * directly.
-	 */
-	if (idle_balance(rq))
-		goto again;
-
 	rq->idle_stamp = rq_clock(rq);
 
 	return idle_sched_class.pick_next_task(rq, prev);
@@ -2682,8 +2664,6 @@ static void __sched __schedule(void)
 		switch_count = &prev->nvcsw;
 	}
 
-	pre_schedule(rq, prev);
-
 	if (prev->on_rq || rq->skip_clock_update < 0)
 		update_rq_clock(rq);
 
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -997,6 +997,9 @@ struct task_struct *pick_next_task_dl(st
 
 	dl_rq = &rq->dl;
 
+	if (dl_task(prev))
+		pull_dl_task(rq);
+
 	if (unlikely(!dl_rq->dl_nr_running))
 		return NULL;
 
@@ -1430,8 +1433,6 @@ static int pull_dl_task(struct rq *this_
 static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
 {
 	/* Try to pull other tasks here */
-	if (dl_task(prev))
-		pull_dl_task(rq);
 }
 
 static void post_schedule_dl(struct rq *rq)
@@ -1626,7 +1627,6 @@ const struct sched_class dl_sched_class
 	.set_cpus_allowed       = set_cpus_allowed_dl,
 	.rq_online              = rq_online_dl,
 	.rq_offline             = rq_offline_dl,
-	.pre_schedule		= pre_schedule_dl,
 	.post_schedule		= post_schedule_dl,
 	.task_woken		= task_woken_dl,
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2581,7 +2581,8 @@ void idle_exit_fair(struct rq *this_rq)
 	update_rq_runnable_avg(this_rq, 0);
 }
 
-#else
+#else /* CONFIG_SMP */
+
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq) {}
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
@@ -2593,7 +2594,7 @@ static inline void dequeue_entity_load_a
 					   int sleep) {}
 static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
 					      int force_update) {}
-#endif
+#endif /* CONFIG_SMP */
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -4700,10 +4701,11 @@ pick_next_task_fair(struct rq *rq, struc
 	struct sched_entity *se;
 	struct task_struct *p;
 
+again:
+#ifdef CONFIG_FAIR_GROUP_SCHED
 	if (!cfs_rq->nr_running)
-		return NULL;
+		goto idle;
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	if (prev->sched_class != &fair_sched_class)
 		goto simple;
 
@@ -4769,11 +4771,10 @@ pick_next_task_fair(struct rq *rq, struc
 
 	return p;
 simple:
-#endif
 	cfs_rq = &rq->cfs;
-
+#endif
 	if (!cfs_rq->nr_running)
-		return NULL;
+		goto idle;
 
 	prev->sched_class->put_prev_task(rq, prev);
 
@@ -4789,6 +4790,13 @@ pick_next_task_fair(struct rq *rq, struc
 		hrtick_start_fair(rq, p);
 
 	return p;
+
+idle:
+	idle_exit_fair();
+	if (idle_balance()) /* drops rq->lock */
+		goto again;
+
+	return NULL;
 }
 
 /*
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -13,18 +13,8 @@ select_task_rq_idle(struct task_struct *
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
-
-static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
-{
-	idle_exit_fair(rq);
-	rq_last_tick_reset(rq);
-}
-
-static void post_schedule_idle(struct rq *rq)
-{
-	idle_enter_fair(rq);
-}
 #endif /* CONFIG_SMP */
+
 /*
  * Idle tasks are unconditionally rescheduled:
  */
@@ -40,8 +30,7 @@ pick_next_task_idle(struct rq *rq, struc
 
 	schedstat_inc(rq, sched_goidle);
 #ifdef CONFIG_SMP
-	/* Trigger the post schedule to do an idle_enter for CFS */
-	rq->post_schedule = 1;
+	idle_enter_fair(rq);
 #endif
 	return rq->idle;
 }
@@ -61,6 +50,10 @@ dequeue_task_idle(struct rq *rq, struct
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
+#ifdef CONFIG_SMP
+	idle_exit_fair(rq);
+	rq_last_tick_reset(rq);
+#endif
 }
 
 static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
@@ -104,8 +97,6 @@ const struct sched_class idle_sched_clas
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
-	.pre_schedule		= pre_schedule_idle,
-	.post_schedule		= post_schedule_idle,
 #endif
 
 	.set_curr_task          = set_curr_task_idle,
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1330,6 +1330,10 @@ pick_next_task_rt(struct rq *rq, struct
 	struct task_struct *p;
 	struct rt_rq *rt_rq = &rq->rt;
 
+	/* Try to pull RT tasks here if we lower this rq's prio */
+	if (rq->rt.highest_prio.curr > prev->prio)
+		pull_rt_task(rq);
+
 	if (!rt_rq->rt_nr_running)
 		return NULL;
 
@@ -1720,13 +1724,6 @@ static int pull_rt_task(struct rq *this_
 	return ret;
 }
 
-static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
-{
-	/* Try to pull RT tasks here if we lower this rq's prio */
-	if (rq->rt.highest_prio.curr > prev->prio)
-		pull_rt_task(rq);
-}
-
 static void post_schedule_rt(struct rq *rq)
 {
 	push_rt_tasks(rq);
@@ -2003,7 +2000,6 @@ const struct sched_class rt_sched_class
 	.set_cpus_allowed       = set_cpus_allowed_rt,
 	.rq_online              = rq_online_rt,
 	.rq_offline             = rq_offline_rt,
-	.pre_schedule		= pre_schedule_rt,
 	.post_schedule		= post_schedule_rt,
 	.task_woken		= task_woken_rt,
 	.switched_from		= switched_from_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1136,7 +1136,6 @@ struct sched_class {
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
 
-	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
 	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/9] sched: Move idle_stamp up to the core
  2014-01-21 11:17 ` [PATCH 3/9] sched: Move idle_stamp up to the core Peter Zijlstra
@ 2014-01-23 12:58   ` Peter Zijlstra
  2014-01-23 14:39     ` Daniel Lezcano
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-23 12:58 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, daniel.lezcano, pjt, bsegall, Jason Low

On Tue, Jan 21, 2014 at 12:17:57PM +0100, Peter Zijlstra wrote:
> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> 
> The idle_balance modifies the idle_stamp field of the rq, making this
> information to be shared across core.c and fair.c. As we can know if the
> cpu is going to idle or not with the previous patch, let's encapsulate the
> idle_stamp information in core.c by moving it up to the caller. The
> idle_balance function returns true in case a balancing occured and the cpu
> won't be idle, false if no balance happened and the cpu is going idle.
> 
> Cc: alex.shi@linaro.org
> Cc: peterz@infradead.org
> Cc: mingo@kernel.org
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1389949444-14821-3-git-send-email-daniel.lezcano@linaro.org
> ---
>  kernel/sched/core.c  |    2 +-
>  kernel/sched/fair.c  |   14 ++++++--------
>  kernel/sched/sched.h |    2 +-
>  3 files changed, 8 insertions(+), 10 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2680,7 +2680,7 @@ static void __sched __schedule(void)
>  	pre_schedule(rq, prev);
>  
>  	if (unlikely(!rq->nr_running))
> -		idle_balance(rq);
> +		rq->idle_stamp = idle_balance(rq) ? 0 : rq_clock(rq);

OK, spotted a problem here..

So previously idle_stamp was set _before_ actually doing idle_balance(),
and that was RIGHT, because that way we include the cost of actually
doing idle_balance() into the idle time.

By not including the cost of idle_balance() you underestimate the 'idle'
time in that if idle_balance() filled the entire idle time we account 0
idle, even though we had 'plenty' of time to run the entire thing.

This leads to not running idle_balance() even though we have the time
for it.

So we very much want something like:


  if (!rq->nr_running)
  	rq->idle_stamp = rq_clock(rq);

  p = pick_next_task(rq, prev);

  if (!is_idle_task(p))
  	rq->idle_stamp = 0;



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/9] sched: Move idle_stamp up to the core
  2014-01-23 12:58   ` Peter Zijlstra
@ 2014-01-23 14:39     ` Daniel Lezcano
  2014-01-23 15:23       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Lezcano @ 2014-01-23 14:39 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel; +Cc: mingo, pjt, bsegall, Jason Low

On 01/23/2014 01:58 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 12:17:57PM +0100, Peter Zijlstra wrote:
>> From: Daniel Lezcano <daniel.lezcano@linaro.org>
>>
>> The idle_balance modifies the idle_stamp field of the rq, making this
>> information to be shared across core.c and fair.c. As we can know if the
>> cpu is going to idle or not with the previous patch, let's encapsulate the
>> idle_stamp information in core.c by moving it up to the caller. The
>> idle_balance function returns true in case a balancing occured and the cpu
>> won't be idle, false if no balance happened and the cpu is going idle.
>>
>> Cc: alex.shi@linaro.org
>> Cc: peterz@infradead.org
>> Cc: mingo@kernel.org
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>> Link: http://lkml.kernel.org/r/1389949444-14821-3-git-send-email-daniel.lezcano@linaro.org
>> ---
>>   kernel/sched/core.c  |    2 +-
>>   kernel/sched/fair.c  |   14 ++++++--------
>>   kernel/sched/sched.h |    2 +-
>>   3 files changed, 8 insertions(+), 10 deletions(-)
>>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2680,7 +2680,7 @@ static void __sched __schedule(void)
>>   	pre_schedule(rq, prev);
>>
>>   	if (unlikely(!rq->nr_running))
>> -		idle_balance(rq);
>> +		rq->idle_stamp = idle_balance(rq) ? 0 : rq_clock(rq);
>
> OK, spotted a problem here..
>
> So previously idle_stamp was set _before_ actually doing idle_balance(),
> and that was RIGHT, because that way we include the cost of actually
> doing idle_balance() into the idle time.
>
> By not including the cost of idle_balance() you underestimate the 'idle'
> time in that if idle_balance() filled the entire idle time we account 0
> idle, even though we had 'plenty' of time to run the entire thing.
>
> This leads to not running idle_balance() even though we have the time
> for it.

Good catch. How did you notice that ?

> So we very much want something like:
>
>
>    if (!rq->nr_running)
>    	rq->idle_stamp = rq_clock(rq);
>
>    p = pick_next_task(rq, prev);
>
>    if (!is_idle_task(p))
>    	rq->idle_stamp = 0;

Is this code assuming idle_balance() is in pick_next_task ?

For this specific patch 3/9, will be ok the following ?

+       if (unlikely(!rq->nr_running)) {
+               rq->idle_stamp = rq_clock(rq);
+               if (idle_balance(rq))
+                       rq->idle_stamp = 0;
+       }

So the patch 9/9 is wrong also because it does not fill idle_stamp 
before idle_balance, the fix would be.

  kernel/sched/core.c |    8 +++++---
  1 file changed, 5 insertions(+), 3 deletions(-)

Index: cpuidle-next/kernel/sched/core.c
===================================================================
--- cpuidle-next.orig/kernel/sched/core.c
+++ cpuidle-next/kernel/sched/core.c
@@ -2579,15 +2579,17 @@ again:
  		}
  	}

+	rq->idle_stamp = rq_clock(rq);
+
  	/*
  	 * If there is a task balanced on this cpu, pick the next task,
  	 * otherwise fall in the optimization by picking the idle task
  	 * directly.
  	 */
-	if (idle_balance(rq))
+	if (idle_balance(rq)) {
+		rq->idle_stamp = 0;
  		goto again;
-
-	rq->idle_stamp = rq_clock(rq);
+	}

  	return idle_sched_class.pick_next_task(rq, prev);
  }


> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/9] sched: Clean up idle task SMP logic
  2014-01-23 11:37     ` Peter Zijlstra
@ 2014-01-23 14:52       ` Vincent Guittot
  0 siblings, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2014-01-23 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Daniel Lezcano, Paul Turner,
	Benjamin Segall, Steven Rostedt

On 23 January 2014 12:37, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jan 21, 2014 at 06:27:11PM +0100, Vincent Guittot wrote:
>> On 21 January 2014 12:17, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> If i have correctly followed the new function path that is introduced
>> by the patchset, idle_enter_fair is called after idle_balance whereas
>> it must be called before in order to update the
>> runnable_avg_sum/period of the rq before evaluating the interest of
>> pulling cfs task
>
> Its idle_exit_fair, that's moved from pre_schedule to put_prev_task and
> thus indeed has crossed idle_balance.

ok, so i probably mix the changes introduced by your patches and a
potential issue that was already present before

idle_enter/exit_fair are used to be sure to account correctly the run
time of a rq in its sched_avg field, so they must be called before
entering/leaving idle. Your patch ensures that they will be called
correctly. Now, the idle_balance is used to check the interest of
pulling task on this newly idle CPU and it could use the sched_avg
field in a near future regarding the discussion around a energy aware
scheduler. as a result, we must update the sched_avg field before
running the idle_balance (that's not the case even before your
patches)

So one solution is probably to move idle_enter_fair into the idle_balance

Regards,
Vincent

>
> Yeah, I can leave that pre_schedule thing in place, however all that has
> be thinking.
>
> Ideally we'd do something like the below; but I must admit to still
> being slightly confused about the idle_{enter,exit}_fair() calls, their
> comment doesn't seem to clarify things.
>
> Steve, I don't think I wrecked rt/deadline by pulling in the
> pre_schedule call into pick_next_task(), but then, who knows :-)
>
> The only thing I really don't like is having that double conditional for
> the direct fair_sched_class call, but I didn't see a way around that,
> other than dropping that entirely. Then again, we cut out a conditional
> and indirect call by getting rid of pre_schedule() -- so it might just
> balance out.
>

[snip]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/9] sched: Move idle_stamp up to the core
  2014-01-23 14:39     ` Daniel Lezcano
@ 2014-01-23 15:23       ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2014-01-23 15:23 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: linux-kernel, mingo, pjt, bsegall, Jason Low

On Thu, Jan 23, 2014 at 03:39:36PM +0100, Daniel Lezcano wrote:
> On 01/23/2014 01:58 PM, Peter Zijlstra wrote:
> >On Tue, Jan 21, 2014 at 12:17:57PM +0100, Peter Zijlstra wrote:
> >>From: Daniel Lezcano <daniel.lezcano@linaro.org>
> >>
> >>The idle_balance modifies the idle_stamp field of the rq, making this
> >>information to be shared across core.c and fair.c. As we can know if the
> >>cpu is going to idle or not with the previous patch, let's encapsulate the
> >>idle_stamp information in core.c by moving it up to the caller. The
> >>idle_balance function returns true in case a balancing occured and the cpu
> >>won't be idle, false if no balance happened and the cpu is going idle.
> >>
> >>Cc: alex.shi@linaro.org
> >>Cc: peterz@infradead.org
> >>Cc: mingo@kernel.org
> >>Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> >>Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> >>Link: http://lkml.kernel.org/r/1389949444-14821-3-git-send-email-daniel.lezcano@linaro.org
> >>---
> >>  kernel/sched/core.c  |    2 +-
> >>  kernel/sched/fair.c  |   14 ++++++--------
> >>  kernel/sched/sched.h |    2 +-
> >>  3 files changed, 8 insertions(+), 10 deletions(-)
> >>
> >>--- a/kernel/sched/core.c
> >>+++ b/kernel/sched/core.c
> >>@@ -2680,7 +2680,7 @@ static void __sched __schedule(void)
> >>  	pre_schedule(rq, prev);
> >>
> >>  	if (unlikely(!rq->nr_running))
> >>-		idle_balance(rq);
> >>+		rq->idle_stamp = idle_balance(rq) ? 0 : rq_clock(rq);
> >
> >OK, spotted a problem here..
> >
> >So previously idle_stamp was set _before_ actually doing idle_balance(),
> >and that was RIGHT, because that way we include the cost of actually
> >doing idle_balance() into the idle time.
> >
> >By not including the cost of idle_balance() you underestimate the 'idle'
> >time in that if idle_balance() filled the entire idle time we account 0
> >idle, even though we had 'plenty' of time to run the entire thing.
> >
> >This leads to not running idle_balance() even though we have the time
> >for it.
> 
> Good catch. How did you notice that ?

Staring at that code for too long :-)

> >So we very much want something like:
> >
> >
> >   if (!rq->nr_running)
> >   	rq->idle_stamp = rq_clock(rq);
> >
> >   p = pick_next_task(rq, prev);
> >
> >   if (!is_idle_task(p))
> >   	rq->idle_stamp = 0;
> 
> Is this code assuming idle_balance() is in pick_next_task ?

Yeah, I'm trying to make that work.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-01-23 15:23 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-21 11:17 [PATCH 0/9] Various sched patches Peter Zijlstra
2014-01-21 11:17 ` [PATCH 1/9] sched: Remove cpu parameter for idle_balance() Peter Zijlstra
2014-01-21 11:17 ` [PATCH 2/9] sched: Fix race in idle_balance() Peter Zijlstra
2014-01-21 11:17 ` [PATCH 3/9] sched: Move idle_stamp up to the core Peter Zijlstra
2014-01-23 12:58   ` Peter Zijlstra
2014-01-23 14:39     ` Daniel Lezcano
2014-01-23 15:23       ` Peter Zijlstra
2014-01-21 11:17 ` [PATCH 4/9] sched: Clean up idle task SMP logic Peter Zijlstra
2014-01-21 17:27   ` Vincent Guittot
2014-01-23 11:37     ` Peter Zijlstra
2014-01-23 14:52       ` Vincent Guittot
2014-01-21 11:17 ` [PATCH 5/9] sched/fair: Track cgroup depth Peter Zijlstra
2014-01-21 11:18 ` [PATCH 6/9] sched: Push put_prev_task() into pick_next_task() Peter Zijlstra
2014-01-21 21:46   ` bsegall
2014-01-21 11:18 ` [PATCH 7/9] sched/fair: Clean up __clear_buddies_* Peter Zijlstra
2014-01-21 11:18 ` [PATCH 8/9] sched/fair: Optimize cgroup pick_next_task_fair Peter Zijlstra
2014-01-21 19:24   ` bsegall
2014-01-21 19:37     ` Peter Zijlstra
2014-01-21 20:03       ` bsegall
2014-01-21 20:43         ` Peter Zijlstra
2014-01-21 21:43           ` bsegall
2014-01-22 18:06             ` Peter Zijlstra
2014-01-21 11:18 ` [PATCH 9/9] sched: Use idle task shortcut Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).