linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] Various sched patches -v2
@ 2014-01-28 17:16 Peter Zijlstra
  2014-01-28 17:16 ` [PATCH 1/9] sched: Remove cpu parameter for idle_balance() Peter Zijlstra
                   ` (9 more replies)
  0 siblings, 10 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-28 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, daniel.lezcano, pjt, bsegall, Steven Rostedt,
	Vincent Guittot, Peter Zijlstra

Here's a set that seems to work properly and hopefully adresses all previously
raised issues.

Daniel, I'm sorry, the idle_stamp thing wandered back over to fair.c :-)

I did test the fair bits, including bandwidth (although not very
extensively).

I did not test the dl/rt bits, Steve do you have a testcase for that?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/9] sched: Remove cpu parameter for idle_balance()
  2014-01-28 17:16 [PATCH 0/9] Various sched patches -v2 Peter Zijlstra
@ 2014-01-28 17:16 ` Peter Zijlstra
  2014-01-28 17:16 ` [PATCH 2/9] sched: Fix race in idle_balance() Peter Zijlstra
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-28 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, daniel.lezcano, pjt, bsegall, Steven Rostedt,
	Vincent Guittot, Peter Zijlstra

[-- Attachment #1: daniel_lezcano-1_sched-remove__cpu__parameter_for_idle_balance.patch --]
[-- Type: text/plain, Size: 1741 bytes --]

From: Daniel Lezcano <daniel.lezcano@linaro.org>

The cpu parameter passed to idle_balance is not needed as it could
be retrieved from the struct rq.

Cc: alex.shi@linaro.org
Cc: peterz@infradead.org
Cc: mingo@kernel.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c  |    2 +-
 kernel/sched/fair.c  |    3 ++-
 kernel/sched/sched.h |    2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2680,7 +2680,7 @@ static void __sched __schedule(void)
 	pre_schedule(rq, prev);
 
 	if (unlikely(!rq->nr_running))
-		idle_balance(cpu, rq);
+		idle_balance(rq);
 
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6359,12 +6359,13 @@ static int load_balance(int this_cpu, st
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-void idle_balance(int this_cpu, struct rq *this_rq)
+void idle_balance(struct rq *this_rq)
 {
 	struct sched_domain *sd;
 	int pulled_task = 0;
 	unsigned long next_balance = jiffies + HZ;
 	u64 curr_cost = 0;
+	int this_cpu = this_rq->cpu;
 
 	this_rq->idle_stamp = rq_clock(this_rq);
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1176,7 +1176,7 @@ extern const struct sched_class idle_sch
 extern void update_group_power(struct sched_domain *sd, int cpu);
 
 extern void trigger_load_balance(struct rq *rq);
-extern void idle_balance(int this_cpu, struct rq *this_rq);
+extern void idle_balance(struct rq *this_rq);
 
 extern void idle_enter_fair(struct rq *this_rq);
 extern void idle_exit_fair(struct rq *this_rq);



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/9] sched: Fix race in idle_balance()
  2014-01-28 17:16 [PATCH 0/9] Various sched patches -v2 Peter Zijlstra
  2014-01-28 17:16 ` [PATCH 1/9] sched: Remove cpu parameter for idle_balance() Peter Zijlstra
@ 2014-01-28 17:16 ` Peter Zijlstra
  2014-01-28 17:16 ` [PATCH 3/9] sched: Move idle_stamp up to the core Peter Zijlstra
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-28 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, daniel.lezcano, pjt, bsegall, Steven Rostedt,
	Vincent Guittot, Peter Zijlstra

[-- Attachment #1: daniel_lezcano-2_sched-fix_race_in_idle_balance.patch --]
[-- Type: text/plain, Size: 1447 bytes --]

From: Daniel Lezcano <daniel.lezcano@linaro.org>

The scheduler main function 'schedule()' checks if there are no more tasks
on the runqueue. Then it checks if a task should be pulled in the current
runqueue in idle_balance() assuming it will go to idle otherwise.

But the idle_balance() releases the rq->lock in order to lookup in the sched
domains and takes the lock again right after. That opens a window where
another cpu may put a task in our runqueue, so we won't go to idle but
we have filled the idle_stamp, thinking we will.

This patch closes the window by checking if the runqueue has been modified
but without pulling a task after taking the lock again, so we won't go to idle
right after in the __schedule() function.

Cc: alex.shi@linaro.org
Cc: peterz@infradead.org
Cc: mingo@kernel.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6417,6 +6417,13 @@ void idle_balance(struct rq *this_rq)
 
 	raw_spin_lock(&this_rq->lock);
 
+	/*
+	 * While browsing the domains, we released the rq lock.
+	 * A task could have be enqueued in the meantime
+	 */
+	if (this_rq->nr_running && !pulled_task)
+		return;
+
 	if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
 		/*
 		 * We are going idle. next_balance may be set based on



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/9] sched: Move idle_stamp up to the core
  2014-01-28 17:16 [PATCH 0/9] Various sched patches -v2 Peter Zijlstra
  2014-01-28 17:16 ` [PATCH 1/9] sched: Remove cpu parameter for idle_balance() Peter Zijlstra
  2014-01-28 17:16 ` [PATCH 2/9] sched: Fix race in idle_balance() Peter Zijlstra
@ 2014-01-28 17:16 ` Peter Zijlstra
  2014-01-28 17:16 ` [PATCH 4/9] sched/fair: Track cgroup depth Peter Zijlstra
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-28 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, daniel.lezcano, pjt, bsegall, Steven Rostedt,
	Vincent Guittot, Peter Zijlstra

[-- Attachment #1: daniel_lezcano-3_sched-move_idle_stamp_up_to_the_core.patch --]
[-- Type: text/plain, Size: 3256 bytes --]

From: Daniel Lezcano <daniel.lezcano@linaro.org>

The idle_balance modifies the idle_stamp field of the rq, making this
information to be shared across core.c and fair.c. As we can know if the
cpu is going to idle or not with the previous patch, let's encapsulate the
idle_stamp information in core.c by moving it up to the caller. The
idle_balance function returns true in case a balancing occured and the cpu
won't be idle, false if no balance happened and the cpu is going idle.

Cc: mingo@kernel.org
Cc: alex.shi@linaro.org
Cc: peterz@infradead.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c  |   11 +++++++++--
 kernel/sched/fair.c  |   14 ++++++--------
 kernel/sched/sched.h |    2 +-
 3 files changed, 16 insertions(+), 11 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2681,8 +2681,15 @@ static void __sched __schedule(void)
 
 	pre_schedule(rq, prev);
 
-	if (unlikely(!rq->nr_running))
-		idle_balance(rq);
+	if (unlikely(!rq->nr_running)) {
+		/*
+		 * We must set idle_stamp _before_ calling idle_balance(), such
+		 * that we measure the duration of idle_balance() as idle time.
+		 */
+		rq->idle_stamp = rq_clock(rq);
+		if (idle_balance(rq))
+			rq->idle_stamp = 0;
+	}
 
 	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6535,7 +6535,7 @@ static int load_balance(int this_cpu, st
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-void idle_balance(struct rq *this_rq)
+int idle_balance(struct rq *this_rq)
 {
 	struct sched_domain *sd;
 	int pulled_task = 0;
@@ -6543,10 +6543,8 @@ void idle_balance(struct rq *this_rq)
 	u64 curr_cost = 0;
 	int this_cpu = this_rq->cpu;
 
-	this_rq->idle_stamp = rq_clock(this_rq);
-
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
-		return;
+		return 0;
 
 	/*
 	 * Drop the rq->lock, but keep IRQ/preempt disabled.
@@ -6584,10 +6582,8 @@ void idle_balance(struct rq *this_rq)
 		interval = msecs_to_jiffies(sd->balance_interval);
 		if (time_after(next_balance, sd->last_balance + interval))
 			next_balance = sd->last_balance + interval;
-		if (pulled_task) {
-			this_rq->idle_stamp = 0;
+		if (pulled_task)
 			break;
-		}
 	}
 	rcu_read_unlock();
 
@@ -6598,7 +6594,7 @@ void idle_balance(struct rq *this_rq)
 	 * A task could have be enqueued in the meantime
 	 */
 	if (this_rq->nr_running && !pulled_task)
-		return;
+		return 1;
 
 	if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
 		/*
@@ -6610,6 +6606,8 @@ void idle_balance(struct rq *this_rq)
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
+
+	return pulled_task;
 }
 
 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1176,7 +1176,7 @@ extern const struct sched_class idle_sch
 extern void update_group_power(struct sched_domain *sd, int cpu);
 
 extern void trigger_load_balance(struct rq *rq);
-extern void idle_balance(struct rq *this_rq);
+extern int idle_balance(struct rq *this_rq);
 
 extern void idle_enter_fair(struct rq *this_rq);
 extern void idle_exit_fair(struct rq *this_rq);



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/9] sched/fair: Track cgroup depth
  2014-01-28 17:16 [PATCH 0/9] Various sched patches -v2 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2014-01-28 17:16 ` [PATCH 3/9] sched: Move idle_stamp up to the core Peter Zijlstra
@ 2014-01-28 17:16 ` Peter Zijlstra
  2014-01-28 17:16 ` [PATCH 5/9] sched: Push put_prev_task() into pick_next_task() Peter Zijlstra
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-28 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, daniel.lezcano, pjt, bsegall, Steven Rostedt,
	Vincent Guittot, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_1.patch --]
[-- Type: text/plain, Size: 4237 bytes --]

Track depth in cgroup tree, this is useful for things like
find_matching_se() where you need to get to a common parent of two
sched entities.

Keeping the depth avoids having to calculate it on the spot, which
saves a number of possible cache-misses.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h |    1 +
 kernel/sched/fair.c   |   47 +++++++++++++++++++++--------------------------
 2 files changed, 22 insertions(+), 26 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1064,6 +1064,7 @@ struct sched_entity {
 #endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	int			depth;
 	struct sched_entity	*parent;
 	/* rq on which this entity is (to be) queued: */
 	struct cfs_rq		*cfs_rq;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -322,13 +322,13 @@ static inline void list_del_leaf_cfs_rq(
 	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
-static inline int
+static inline struct cfs_rq *
 is_same_group(struct sched_entity *se, struct sched_entity *pse)
 {
 	if (se->cfs_rq == pse->cfs_rq)
-		return 1;
+		return se->cfs_rq;
 
-	return 0;
+	return NULL;
 }
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
@@ -336,17 +336,6 @@ static inline struct sched_entity *paren
 	return se->parent;
 }
 
-/* return depth at which a sched entity is present in the hierarchy */
-static inline int depth_se(struct sched_entity *se)
-{
-	int depth = 0;
-
-	for_each_sched_entity(se)
-		depth++;
-
-	return depth;
-}
-
 static void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -360,8 +349,8 @@ find_matching_se(struct sched_entity **s
 	 */
 
 	/* First walk up until both entities are at same depth */
-	se_depth = depth_se(*se);
-	pse_depth = depth_se(*pse);
+	se_depth = (*se)->depth;
+	pse_depth = (*pse)->depth;
 
 	while (se_depth > pse_depth) {
 		se_depth--;
@@ -426,10 +415,10 @@ static inline void list_del_leaf_cfs_rq(
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
-static inline int
+static inline struct cfs_rq *
 is_same_group(struct sched_entity *se, struct sched_entity *pse)
 {
-	return 1;
+	return cfs_rq_of(se); /* always the same rq */
 }
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
@@ -7091,7 +7080,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void task_move_group_fair(struct task_struct *p, int on_rq)
 {
+	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq;
+
 	/*
 	 * If the task was not on the rq at the time of this cgroup movement
 	 * it must have been asleep, sleeping tasks keep their ->vruntime
@@ -7117,23 +7108,24 @@ static void task_move_group_fair(struct
 	 * To prevent boost or penalty in the new cfs_rq caused by delta
 	 * min_vruntime between the two cfs_rqs, we skip vruntime adjustment.
 	 */
-	if (!on_rq && (!p->se.sum_exec_runtime || p->state == TASK_WAKING))
+	if (!on_rq && (!se->sum_exec_runtime || p->state == TASK_WAKING))
 		on_rq = 1;
 
 	if (!on_rq)
-		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
+		se->vruntime -= cfs_rq_of(se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));
+	se->depth = se->parent ? se->parent->depth + 1 : 0;
 	if (!on_rq) {
-		cfs_rq = cfs_rq_of(&p->se);
-		p->se.vruntime += cfs_rq->min_vruntime;
+		cfs_rq = cfs_rq_of(se);
+		se->vruntime += cfs_rq->min_vruntime;
 #ifdef CONFIG_SMP
 		/*
 		 * migrate_task_rq_fair() will have removed our previous
 		 * contribution, but we must synchronize for ongoing future
 		 * decay.
 		 */
-		p->se.avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
-		cfs_rq->blocked_load_avg += p->se.avg.load_avg_contrib;
+		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 #endif
 	}
 }
@@ -7229,10 +7221,13 @@ void init_tg_cfs_entry(struct task_group
 	if (!se)
 		return;
 
-	if (!parent)
+	if (!parent) {
 		se->cfs_rq = &rq->cfs;
-	else
+		se->depth = 0;
+	} else {
 		se->cfs_rq = parent->my_q;
+		se->depth = parent->depth + 1;
+	}
 
 	se->my_q = cfs_rq;
 	/* guarantee group entities always have weight */



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 5/9] sched: Push put_prev_task() into pick_next_task()
  2014-01-28 17:16 [PATCH 0/9] Various sched patches -v2 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2014-01-28 17:16 ` [PATCH 4/9] sched/fair: Track cgroup depth Peter Zijlstra
@ 2014-01-28 17:16 ` Peter Zijlstra
  2014-01-28 18:46   ` bsegall
  2014-01-28 17:16 ` [PATCH 6/9] sched/fair: Clean up __clear_buddies_* Peter Zijlstra
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-28 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, daniel.lezcano, pjt, bsegall, Steven Rostedt,
	Vincent Guittot, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_2.patch --]
[-- Type: text/plain, Size: 6171 bytes --]

In order to avoid having to do put/set on a whole cgroup hierarchy
when we context switch, push the put into pick_next_task() so that
both operations are in the same function. Further changes then allow
us to possibly optimize away redundant work.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c      |   21 ++++++++-------------
 kernel/sched/deadline.c  |    4 +++-
 kernel/sched/fair.c      |    5 ++++-
 kernel/sched/idle_task.c |    5 ++++-
 kernel/sched/rt.c        |   26 +++++++++++++++-----------
 kernel/sched/sched.h     |    8 +++++++-
 kernel/sched/stop_task.c |   14 ++++++++------
 7 files changed, 49 insertions(+), 34 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2556,18 +2556,11 @@ static inline void schedule_debug(struct
 	schedstat_inc(this_rq(), sched_count);
 }
 
-static void put_prev_task(struct rq *rq, struct task_struct *prev)
-{
-	if (prev->on_rq || rq->skip_clock_update < 0)
-		update_rq_clock(rq);
-	prev->sched_class->put_prev_task(rq, prev);
-}
-
 /*
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq)
+pick_next_task(struct rq *rq, struct task_struct *prev)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -2577,13 +2570,13 @@ pick_next_task(struct rq *rq)
 	 * the fair class we can call that function directly:
 	 */
 	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
-		p = fair_sched_class.pick_next_task(rq);
+		p = fair_sched_class.pick_next_task(rq, prev);
 		if (likely(p))
 			return p;
 	}
 
 	for_each_class(class) {
-		p = class->pick_next_task(rq);
+		p = class->pick_next_task(rq, prev);
 		if (p)
 			return p;
 	}
@@ -2691,8 +2684,10 @@ static void __sched __schedule(void)
 			rq->idle_stamp = 0;
 	}
 
-	put_prev_task(rq, prev);
-	next = pick_next_task(rq);
+	if (prev->on_rq || rq->skip_clock_update < 0)
+		update_rq_clock(rq);
+
+	next = pick_next_task(rq, prev);
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 	rq->skip_clock_update = 0;
@@ -4734,7 +4729,7 @@ static void migrate_tasks(unsigned int d
 		if (rq->nr_running == 1)
 			break;
 
-		next = pick_next_task(rq);
+		next = pick_next_task(rq, NULL);
 		BUG_ON(!next);
 		next->sched_class->put_prev_task(rq, next);
 
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -989,7 +989,7 @@ static struct sched_dl_entity *pick_next
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-struct task_struct *pick_next_task_dl(struct rq *rq)
+struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev)
 {
 	struct sched_dl_entity *dl_se;
 	struct task_struct *p;
@@ -1000,6 +1000,8 @@ struct task_struct *pick_next_task_dl(st
 	if (unlikely(!dl_rq->dl_nr_running))
 		return NULL;
 
+	prev->sched_class->put_prev_task(rq, prev);
+
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4659,7 +4659,8 @@ static void check_preempt_wakeup(struct
 		set_last_buddy(se);
 }
 
-static struct task_struct *pick_next_task_fair(struct rq *rq)
+static struct task_struct *
+pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *p;
 	struct cfs_rq *cfs_rq = &rq->cfs;
@@ -4668,6 +4669,8 @@ static struct task_struct *pick_next_tas
 	if (!cfs_rq->nr_running)
 		return NULL;
 
+	prev->sched_class->put_prev_task(rq, prev);
+
 	do {
 		se = pick_next_entity(cfs_rq);
 		set_next_entity(cfs_rq, se);
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -33,8 +33,11 @@ static void check_preempt_curr_idle(stru
 	resched_task(rq->idle);
 }
 
-static struct task_struct *pick_next_task_idle(struct rq *rq)
+static struct task_struct *
+pick_next_task_idle(struct rq *rq, struct task_struct *prev)
 {
+	prev->sched_class->put_prev_task(rq, prev);
+
 	schedstat_inc(rq, sched_goidle);
 #ifdef CONFIG_SMP
 	/* Trigger the post schedule to do an idle_enter for CFS */
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1310,15 +1310,7 @@ static struct task_struct *_pick_next_ta
 {
 	struct sched_rt_entity *rt_se;
 	struct task_struct *p;
-	struct rt_rq *rt_rq;
-
-	rt_rq = &rq->rt;
-
-	if (!rt_rq->rt_nr_running)
-		return NULL;
-
-	if (rt_rq_throttled(rt_rq))
-		return NULL;
+	struct rt_rq *rt_rq  = &rq->rt;
 
 	do {
 		rt_se = pick_next_rt_entity(rq, rt_rq);
@@ -1332,9 +1324,21 @@ static struct task_struct *_pick_next_ta
 	return p;
 }
 
-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *
+pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *p = _pick_next_task_rt(rq);
+	struct task_struct *p;
+	struct rt_rq *rt_rq = &rq->rt;
+
+	if (!rt_rq->rt_nr_running)
+		return NULL;
+
+	if (rt_rq_throttled(rt_rq))
+		return NULL;
+
+	prev->sched_class->put_prev_task(rq, prev);
+
+	p = _pick_next_task_rt(rq);
 
 	/* The running task is never eligible for pushing */
 	if (p)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1123,7 +1123,13 @@ struct sched_class {
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
-	struct task_struct * (*pick_next_task) (struct rq *rq);
+	/*
+	 * It is the responsibility of the pick_next_task() method that will
+	 * return the next task to call put_prev_task() on the @prev task or
+	 * something equivalent.
+	 */
+	struct task_struct * (*pick_next_task) (struct rq *rq,
+						struct task_struct *prev);
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -23,16 +23,18 @@ check_preempt_curr_stop(struct rq *rq, s
 	/* we're never preempted */
 }
 
-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *
+pick_next_task_stop(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *stop = rq->stop;
 
-	if (stop && stop->on_rq) {
-		stop->se.exec_start = rq_clock_task(rq);
-		return stop;
-	}
+	if (!stop || !stop->on_rq)
+		return NULL;
 
-	return NULL;
+	prev->sched_class->put_prev_task(rq, prev);
+	stop->se.exec_start = rq_clock_task(rq);
+
+	return stop;
 }
 
 static void



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 6/9] sched/fair: Clean up __clear_buddies_*
  2014-01-28 17:16 [PATCH 0/9] Various sched patches -v2 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2014-01-28 17:16 ` [PATCH 5/9] sched: Push put_prev_task() into pick_next_task() Peter Zijlstra
@ 2014-01-28 17:16 ` Peter Zijlstra
  2014-01-28 17:16 ` [PATCH 7/9] sched/fair: Optimize cgroup pick_next_task_fair Peter Zijlstra
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-28 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, daniel.lezcano, pjt, bsegall, Steven Rostedt,
	Vincent Guittot, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_3.patch --]
[-- Type: text/plain, Size: 1074 bytes --]

Slightly easier code flow, no functional changes.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |   18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2567,10 +2567,10 @@ static void __clear_buddies_last(struct
 {
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->last == se)
-			cfs_rq->last = NULL;
-		else
+		if (cfs_rq->last != se)
 			break;
+
+		cfs_rq->last = NULL;
 	}
 }
 
@@ -2578,10 +2578,10 @@ static void __clear_buddies_next(struct
 {
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->next == se)
-			cfs_rq->next = NULL;
-		else
+		if (cfs_rq->next != se)
 			break;
+
+		cfs_rq->next = NULL;
 	}
 }
 
@@ -2589,10 +2589,10 @@ static void __clear_buddies_skip(struct
 {
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->skip == se)
-			cfs_rq->skip = NULL;
-		else
+		if (cfs_rq->skip != se)
 			break;
+
+		cfs_rq->skip = NULL;
 	}
 }
 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 7/9] sched/fair: Optimize cgroup pick_next_task_fair
  2014-01-28 17:16 [PATCH 0/9] Various sched patches -v2 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2014-01-28 17:16 ` [PATCH 6/9] sched/fair: Clean up __clear_buddies_* Peter Zijlstra
@ 2014-01-28 17:16 ` Peter Zijlstra
  2014-01-30 12:18   ` Vincent Guittot
  2014-01-28 17:16 ` [PATCH 8/9] sched: Clean up idle task SMP logic Peter Zijlstra
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-28 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, daniel.lezcano, pjt, bsegall, Steven Rostedt,
	Vincent Guittot, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_4.patch --]
[-- Type: text/plain, Size: 6220 bytes --]

Since commit 2f36825b1 ("sched: Next buddy hint on sleep and preempt
path") it is likely we pick a new task from the same cgroup, doing a put
and then set on all intermediate entities is a waste of time, so try to
avoid this.

Measured using:

  mount nodev /cgroup -t cgroup -o cpu
  cd /cgroup
  mkdir a; cd a
  mkdir b; cd b
  mkdir c; cd c
  echo $$ > tasks
  perf stat --repeat 10 -- taskset 1 perf bench sched pipe

PRE :      4.542422684 seconds time elapsed   ( +-  0.33% )
POST:      4.389409991 seconds time elapsed   ( +-  0.32% )

Which shows a significant improvement of ~3.5%

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c |  122 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 110 insertions(+), 12 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2911,17 +2911,36 @@ wakeup_preempt_entity(struct sched_entit
  * 3) pick the "last" process, for cache locality
  * 4) do not run the "skip" process, if something else is available
  */
-static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
+static struct sched_entity *
+pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	struct sched_entity *se = __pick_first_entity(cfs_rq);
-	struct sched_entity *left = se;
+	struct sched_entity *left = __pick_first_entity(cfs_rq);
+	struct sched_entity *se;
+
+	/*
+	 * If curr is set we have to see if its left of the leftmost entity
+	 * still in the tree, provided there was anything in the tree at all.
+	 */
+	if (!left || (curr && entity_before(curr, left)))
+		left = curr;
+
+	se = left; /* ideally we run the leftmost entity */
 
 	/*
 	 * Avoid running the skip buddy, if running something else can
 	 * be done without getting too unfair.
 	 */
 	if (cfs_rq->skip == se) {
-		struct sched_entity *second = __pick_next_entity(se);
+		struct sched_entity *second;
+
+		if (se == curr) {
+			second = __pick_first_entity(cfs_rq);
+		} else {
+			second = __pick_next_entity(se);
+			if (!second || (curr && entity_before(curr, second)))
+				second = curr;
+		}
+
 		if (second && wakeup_preempt_entity(second, left) < 1)
 			se = second;
 	}
@@ -2943,7 +2962,7 @@ static struct sched_entity *pick_next_en
 	return se;
 }
 
-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
@@ -3598,22 +3617,23 @@ static void check_enqueue_throttle(struc
 }
 
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_bandwidth_used())
-		return;
+		return false;
 
 	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
-		return;
+		return false;
 
 	/*
 	 * it's possible for a throttled entity to be forced into a running
 	 * state (e.g. set_curr_task), in this case we're finished.
 	 */
 	if (cfs_rq_throttled(cfs_rq))
-		return;
+		return true;
 
 	throttle_cfs_rq(cfs_rq);
+	return true;
 }
 
 static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
@@ -3723,7 +3743,7 @@ static inline u64 cfs_rq_clock_task(stru
 }
 
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 
@@ -4662,9 +4682,86 @@ static void check_preempt_wakeup(struct
 static struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *p;
 	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
+	struct task_struct *p;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (!cfs_rq->nr_running)
+		return NULL;
+
+	if (prev->sched_class != &fair_sched_class)
+		goto simple;
+
+	/*
+	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
+	 * likely that a next task is from the same cgroup as the current.
+	 *
+	 * Therefore attempt to avoid putting and setting the entire cgroup
+	 * hierarchy, only change the part that actually changes.
+	 */
+
+	do {
+		struct sched_entity *curr = cfs_rq->curr;
+
+		/*
+		 * Since we got here without doing put_prev_entity() we also
+		 * have to consider cfs_rq->curr. If it is still a runnable
+		 * entity, update_curr() will update its vruntime, otherwise
+		 * forget we've ever seen it.
+		 */
+		if (curr && curr->on_rq)
+			update_curr(cfs_rq);
+		else
+			curr = NULL;
+
+		/*
+		 * This call to check_cfs_rq_runtime() will do the throttle and
+		 * dequeue its entity in the parent(s). Therefore the 'simple'
+		 * nr_running test will indeed be correct.
+		 */
+		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+			goto simple;
+
+		se = pick_next_entity(cfs_rq, curr);
+		cfs_rq = group_cfs_rq(se);
+	} while (cfs_rq);
+
+	p = task_of(se);
+
+	/*
+	 * Since we haven't yet done put_prev_entity and if the selected task
+	 * is a different task than we started out with, try and touch the
+	 * least amount of cfs_rqs.
+	 */
+	if (prev != p) {
+		struct sched_entity *pse = &prev->se;
+
+		while (!(cfs_rq = is_same_group(se, pse))) {
+			int se_depth = se->depth;
+			int pse_depth = pse->depth;
+
+			if (se_depth <= pse_depth) {
+				put_prev_entity(cfs_rq_of(pse), pse);
+				pse = parent_entity(pse);
+			}
+			if (se_depth >= pse_depth) {
+				set_next_entity(cfs_rq_of(se), se);
+				se = parent_entity(se);
+			}
+		}
+
+		put_prev_entity(cfs_rq, pse);
+		set_next_entity(cfs_rq, se);
+	}
+
+	if (hrtick_enabled(rq))
+		hrtick_start_fair(rq, p);
+
+	return p;
+simple:
+	cfs_rq = &rq->cfs;
+#endif
 
 	if (!cfs_rq->nr_running)
 		return NULL;
@@ -4672,12 +4769,13 @@ pick_next_task_fair(struct rq *rq, struc
 	prev->sched_class->put_prev_task(rq, prev);
 
 	do {
-		se = pick_next_entity(cfs_rq);
+		se = pick_next_entity(cfs_rq, NULL);
 		set_next_entity(cfs_rq, se);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
 	p = task_of(se);
+
 	if (hrtick_enabled(rq))
 		hrtick_start_fair(rq, p);
 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 8/9] sched: Clean up idle task SMP logic
  2014-01-28 17:16 [PATCH 0/9] Various sched patches -v2 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2014-01-28 17:16 ` [PATCH 7/9] sched/fair: Optimize cgroup pick_next_task_fair Peter Zijlstra
@ 2014-01-28 17:16 ` Peter Zijlstra
  2014-01-30 10:52   ` Vincent Guittot
  2014-01-28 17:16 ` [PATCH 9/9] sched: Push down pre_schedule() and idle_balance() Peter Zijlstra
  2014-01-28 18:07 ` [PATCH 0/9] Various sched patches -v2 Steven Rostedt
  9 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-28 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, daniel.lezcano, pjt, bsegall, Steven Rostedt,
	Vincent Guittot, Peter Zijlstra

[-- Attachment #1: peterz-sched-idle-1.patch --]
[-- Type: text/plain, Size: 1324 bytes --]

The idle post_schedule hook is just a vile waste of time, furthermore
it appears unneeded, move the idle_enter_fair() call into
pick_next_task_idle().

Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: alex.shi@linaro.org
Cc: mingo@kernel.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/idle_task.c |    9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -19,11 +19,6 @@ static void pre_schedule_idle(struct rq
 	idle_exit_fair(rq);
 	rq_last_tick_reset(rq);
 }
-
-static void post_schedule_idle(struct rq *rq)
-{
-	idle_enter_fair(rq);
-}
 #endif /* CONFIG_SMP */
 /*
  * Idle tasks are unconditionally rescheduled:
@@ -40,8 +35,7 @@ pick_next_task_idle(struct rq *rq, struc
 
 	schedstat_inc(rq, sched_goidle);
 #ifdef CONFIG_SMP
-	/* Trigger the post schedule to do an idle_enter for CFS */
-	rq->post_schedule = 1;
+	idle_enter_fair(rq);
 #endif
 	return rq->idle;
 }
@@ -105,7 +99,6 @@ const struct sched_class idle_sched_clas
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
 	.pre_schedule		= pre_schedule_idle,
-	.post_schedule		= post_schedule_idle,
 #endif
 
 	.set_curr_task          = set_curr_task_idle,



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 9/9] sched: Push down pre_schedule() and idle_balance()
  2014-01-28 17:16 [PATCH 0/9] Various sched patches -v2 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2014-01-28 17:16 ` [PATCH 8/9] sched: Clean up idle task SMP logic Peter Zijlstra
@ 2014-01-28 17:16 ` Peter Zijlstra
  2014-01-30 12:45   ` Vincent Guittot
  2014-01-28 18:07 ` [PATCH 0/9] Various sched patches -v2 Steven Rostedt
  9 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-28 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, daniel.lezcano, pjt, bsegall, Steven Rostedt,
	Vincent Guittot, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-sched-clean_up_idle_task_smp_logic.patch --]
[-- Type: text/plain, Size: 8277 bytes --]

This patch both merged idle_balance() and pre_schedule() and pushes
both of them into pick_next_task().

Conceptually pre_schedule() and idle_balance() are rather similar,
both are used to pull more work onto the current CPU.

We cannot however first move idle_balance() into pre_schedule_fair()
since there is no guarantee the last runnable task is a fair task, and
thus we would miss newidle balances.

Similarly, the dl and rt pre_schedule calls must be ran before
idle_balance() since their respective tasks have higher priority and
it would not do to delay their execution searching for less important
tasks first.

However, by noticing that pick_next_tasks() already traverses the
sched_class hierarchy in the right order, we can get the right
behaviour and do away with both calls.

We must however change the special case optimization to also require
that prev is of sched_class_fair, otherwise we can miss doing a dl or
rt pull where we needed one.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c      |   26 ++------------------------
 kernel/sched/deadline.c  |   15 +++++++--------
 kernel/sched/fair.c      |   24 ++++++++++++++++++++----
 kernel/sched/idle_task.c |   12 +++++-------
 kernel/sched/rt.c        |   16 ++++++++--------
 kernel/sched/sched.h     |    1 -
 6 files changed, 42 insertions(+), 52 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2146,13 +2146,6 @@ static void finish_task_switch(struct rq
 
 #ifdef CONFIG_SMP
 
-/* assumes rq->lock is held */
-static inline void pre_schedule(struct rq *rq, struct task_struct *prev)
-{
-	if (prev->sched_class->pre_schedule)
-		prev->sched_class->pre_schedule(rq, prev);
-}
-
 /* rq->lock is NOT held, but preemption is disabled */
 static inline void post_schedule(struct rq *rq)
 {
@@ -2170,10 +2163,6 @@ static inline void post_schedule(struct
 
 #else
 
-static inline void pre_schedule(struct rq *rq, struct task_struct *p)
-{
-}
-
 static inline void post_schedule(struct rq *rq)
 {
 }
@@ -2569,7 +2558,8 @@ pick_next_task(struct rq *rq, struct tas
 	 * Optimization: we know that if all tasks are in
 	 * the fair class we can call that function directly:
 	 */
-	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
+	if (likely(prev->sched_class == &fair_sched_class &&
+		   rq->nr_running == rq->cfs.h_nr_running)) {
 		p = fair_sched_class.pick_next_task(rq, prev);
 		if (likely(p))
 			return p;
@@ -2672,18 +2662,6 @@ static void __sched __schedule(void)
 		switch_count = &prev->nvcsw;
 	}
 
-	pre_schedule(rq, prev);
-
-	if (unlikely(!rq->nr_running)) {
-		/*
-		 * We must set idle_stamp _before_ calling idle_balance(), such
-		 * that we measure the duration of idle_balance() as idle time.
-		 */
-		rq->idle_stamp = rq_clock(rq);
-		if (idle_balance(rq))
-			rq->idle_stamp = 0;
-	}
-
 	if (prev->on_rq || rq->skip_clock_update < 0)
 		update_rq_clock(rq);
 
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -989,6 +989,8 @@ static struct sched_dl_entity *pick_next
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
+static int pull_dl_task(struct rq *this_rq);
+
 struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev)
 {
 	struct sched_dl_entity *dl_se;
@@ -997,6 +999,11 @@ struct task_struct *pick_next_task_dl(st
 
 	dl_rq = &rq->dl;
 
+#ifdef CONFIG_SMP
+	if (dl_task(prev))
+		pull_dl_task(rq);
+#endif
+
 	if (unlikely(!dl_rq->dl_nr_running))
 		return NULL;
 
@@ -1427,13 +1434,6 @@ static int pull_dl_task(struct rq *this_
 	return ret;
 }
 
-static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
-{
-	/* Try to pull other tasks here */
-	if (dl_task(prev))
-		pull_dl_task(rq);
-}
-
 static void post_schedule_dl(struct rq *rq)
 {
 	push_dl_tasks(rq);
@@ -1626,7 +1626,6 @@ const struct sched_class dl_sched_class
 	.set_cpus_allowed       = set_cpus_allowed_dl,
 	.rq_online              = rq_online_dl,
 	.rq_offline             = rq_offline_dl,
-	.pre_schedule		= pre_schedule_dl,
 	.post_schedule		= post_schedule_dl,
 	.task_woken		= task_woken_dl,
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2581,7 +2581,8 @@ void idle_exit_fair(struct rq *this_rq)
 	update_rq_runnable_avg(this_rq, 0);
 }
 
-#else
+#else /* CONFIG_SMP */
+
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq) {}
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
@@ -2593,7 +2594,7 @@ static inline void dequeue_entity_load_a
 					   int sleep) {}
 static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
 					      int force_update) {}
-#endif
+#endif /* CONFIG_SMP */
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -4686,9 +4687,10 @@ pick_next_task_fair(struct rq *rq, struc
 	struct sched_entity *se;
 	struct task_struct *p;
 
+again:
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (!cfs_rq->nr_running)
-		return NULL;
+		goto idle;
 
 	if (prev->sched_class != &fair_sched_class)
 		goto simple;
@@ -4764,7 +4766,7 @@ pick_next_task_fair(struct rq *rq, struc
 #endif
 
 	if (!cfs_rq->nr_running)
-		return NULL;
+		goto idle;
 
 	prev->sched_class->put_prev_task(rq, prev);
 
@@ -4780,6 +4782,20 @@ pick_next_task_fair(struct rq *rq, struc
 		hrtick_start_fair(rq, p);
 
 	return p;
+
+idle:
+	idle_exit_fair(rq);
+	/*
+	 * We must set idle_stamp _before_ calling idle_balance(), such that we
+	 * measure the duration of idle_balance() as idle time.
+	 */
+	rq->idle_stamp = rq_clock(rq);
+	if (idle_balance(rq)) { /* drops rq->lock */
+		rq->idle_stamp = 0;
+		goto again;
+	}
+
+	return NULL;
 }
 
 /*
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -13,13 +13,8 @@ select_task_rq_idle(struct task_struct *
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
-
-static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
-{
-	idle_exit_fair(rq);
-	rq_last_tick_reset(rq);
-}
 #endif /* CONFIG_SMP */
+
 /*
  * Idle tasks are unconditionally rescheduled:
  */
@@ -55,6 +50,10 @@ dequeue_task_idle(struct rq *rq, struct
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
+#ifdef CONFIG_SMP
+	idle_exit_fair(rq);
+	rq_last_tick_reset(rq);
+#endif
 }
 
 static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
@@ -98,7 +97,6 @@ const struct sched_class idle_sched_clas
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
-	.pre_schedule		= pre_schedule_idle,
 #endif
 
 	.set_curr_task          = set_curr_task_idle,
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1324,12 +1324,20 @@ static struct task_struct *_pick_next_ta
 	return p;
 }
 
+static int pull_rt_task(struct rq *this_rq);
+
 static struct task_struct *
 pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *p;
 	struct rt_rq *rt_rq = &rq->rt;
 
+#ifdef CONFIG_SMP
+	/* Try to pull RT tasks here if we lower this rq's prio */
+	if (rq->rt.highest_prio.curr > prev->prio)
+		pull_rt_task(rq);
+#endif
+
 	if (!rt_rq->rt_nr_running)
 		return NULL;
 
@@ -1720,13 +1728,6 @@ static int pull_rt_task(struct rq *this_
 	return ret;
 }
 
-static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
-{
-	/* Try to pull RT tasks here if we lower this rq's prio */
-	if (rq->rt.highest_prio.curr > prev->prio)
-		pull_rt_task(rq);
-}
-
 static void post_schedule_rt(struct rq *rq)
 {
 	push_rt_tasks(rq);
@@ -2003,7 +2004,6 @@ const struct sched_class rt_sched_class
 	.set_cpus_allowed       = set_cpus_allowed_rt,
 	.rq_online              = rq_online_rt,
 	.rq_offline             = rq_offline_rt,
-	.pre_schedule		= pre_schedule_rt,
 	.post_schedule		= post_schedule_rt,
 	.task_woken		= task_woken_rt,
 	.switched_from		= switched_from_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1136,7 +1136,6 @@ struct sched_class {
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
 
-	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
 	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/9] Various sched patches -v2
  2014-01-28 17:16 [PATCH 0/9] Various sched patches -v2 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2014-01-28 17:16 ` [PATCH 9/9] sched: Push down pre_schedule() and idle_balance() Peter Zijlstra
@ 2014-01-28 18:07 ` Steven Rostedt
  9 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2014-01-28 18:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, daniel.lezcano, pjt, bsegall, Vincent Guittot

On Tue, 28 Jan 2014 18:16:34 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> Here's a set that seems to work properly and hopefully adresses all previously
> raised issues.
> 
> Daniel, I'm sorry, the idle_stamp thing wandered back over to fair.c :-)
> 
> I did test the fair bits, including bandwidth (although not very
> extensively).
> 
> I did not test the dl/rt bits, Steve do you have a testcase for that?


My sched tests are usually done by running some scheduling tests while
recording sched switches and then analyzing it under kernelshark. I
don't have many other tests that do pass fail. Hmm, I did have a couple
of rt migration tests, but that's it.

I'll take a look at theses tomorrow.

-- Steve


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/9] sched: Push put_prev_task() into pick_next_task()
  2014-01-28 17:16 ` [PATCH 5/9] sched: Push put_prev_task() into pick_next_task() Peter Zijlstra
@ 2014-01-28 18:46   ` bsegall
  2014-01-28 19:14     ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: bsegall @ 2014-01-28 18:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, daniel.lezcano, pjt, Steven Rostedt,
	Vincent Guittot

Peter Zijlstra <peterz@infradead.org> writes:

> In order to avoid having to do put/set on a whole cgroup hierarchy
> when we context switch, push the put into pick_next_task() so that
> both operations are in the same function. Further changes then allow
> us to possibly optimize away redundant work.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/sched/core.c      |   21 ++++++++-------------
>  kernel/sched/deadline.c  |    4 +++-
>  kernel/sched/fair.c      |    5 ++++-
>  kernel/sched/idle_task.c |    5 ++++-
>  kernel/sched/rt.c        |   26 +++++++++++++++-----------
>  kernel/sched/sched.h     |    8 +++++++-
>  kernel/sched/stop_task.c |   14 ++++++++------
>  7 files changed, 49 insertions(+), 34 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2556,18 +2556,11 @@ static inline void schedule_debug(struct
>  	schedstat_inc(this_rq(), sched_count);
>  }
>  
> -static void put_prev_task(struct rq *rq, struct task_struct *prev)
> -{
> -	if (prev->on_rq || rq->skip_clock_update < 0)
> -		update_rq_clock(rq);
> -	prev->sched_class->put_prev_task(rq, prev);
> -}
> -
>  /*
>   * Pick up the highest-prio task:
>   */
>  static inline struct task_struct *
> -pick_next_task(struct rq *rq)
> +pick_next_task(struct rq *rq, struct task_struct *prev)
>  {
>  	const struct sched_class *class;
>  	struct task_struct *p;
> @@ -2577,13 +2570,13 @@ pick_next_task(struct rq *rq)
>  	 * the fair class we can call that function directly:
>  	 */
>  	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
> -		p = fair_sched_class.pick_next_task(rq);
> +		p = fair_sched_class.pick_next_task(rq, prev);
>  		if (likely(p))
>  			return p;
>  	}
>  
>  	for_each_class(class) {
> -		p = class->pick_next_task(rq);
> +		p = class->pick_next_task(rq, prev);
>  		if (p)
>  			return p;
>  	}
> @@ -2691,8 +2684,10 @@ static void __sched __schedule(void)
>  			rq->idle_stamp = 0;
>  	}
>  
> -	put_prev_task(rq, prev);
> -	next = pick_next_task(rq);
> +	if (prev->on_rq || rq->skip_clock_update < 0)
> +		update_rq_clock(rq);
> +
> +	next = pick_next_task(rq, prev);
>  	clear_tsk_need_resched(prev);
>  	clear_preempt_need_resched();
>  	rq->skip_clock_update = 0;
> @@ -4734,7 +4729,7 @@ static void migrate_tasks(unsigned int d
>  		if (rq->nr_running == 1)
>  			break;
>  
> -		next = pick_next_task(rq);
> +		next = pick_next_task(rq, NULL);

This seems to be incorrect without the if (prev) lines in
pick_next_task_foo, since foo_nr_running isn't enough to save us.

>  		BUG_ON(!next);
>  		next->sched_class->put_prev_task(rq, next);
>  
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -989,7 +989,7 @@ static struct sched_dl_entity *pick_next
>  	return rb_entry(left, struct sched_dl_entity, rb_node);
>  }
>  
> -struct task_struct *pick_next_task_dl(struct rq *rq)
> +struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev)
>  {
>  	struct sched_dl_entity *dl_se;
>  	struct task_struct *p;
> @@ -1000,6 +1000,8 @@ struct task_struct *pick_next_task_dl(st
>  	if (unlikely(!dl_rq->dl_nr_running))
>  		return NULL;
>  
> +	prev->sched_class->put_prev_task(rq, prev);
> +
>  	dl_se = pick_next_dl_entity(rq, dl_rq);
>  	BUG_ON(!dl_se);
>  
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4659,7 +4659,8 @@ static void check_preempt_wakeup(struct
>  		set_last_buddy(se);
>  }
>  
> -static struct task_struct *pick_next_task_fair(struct rq *rq)
> +static struct task_struct *
> +pick_next_task_fair(struct rq *rq, struct task_struct *prev)
>  {
>  	struct task_struct *p;
>  	struct cfs_rq *cfs_rq = &rq->cfs;
> @@ -4668,6 +4669,8 @@ static struct task_struct *pick_next_tas
>  	if (!cfs_rq->nr_running)
>  		return NULL;
>  
> +	prev->sched_class->put_prev_task(rq, prev);
> +
>  	do {
>  		se = pick_next_entity(cfs_rq);
>  		set_next_entity(cfs_rq, se);
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -33,8 +33,11 @@ static void check_preempt_curr_idle(stru
>  	resched_task(rq->idle);
>  }
>  
> -static struct task_struct *pick_next_task_idle(struct rq *rq)
> +static struct task_struct *
> +pick_next_task_idle(struct rq *rq, struct task_struct *prev)
>  {
> +	prev->sched_class->put_prev_task(rq, prev);
> +
>  	schedstat_inc(rq, sched_goidle);
>  #ifdef CONFIG_SMP
>  	/* Trigger the post schedule to do an idle_enter for CFS */
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1310,15 +1310,7 @@ static struct task_struct *_pick_next_ta
>  {
>  	struct sched_rt_entity *rt_se;
>  	struct task_struct *p;
> -	struct rt_rq *rt_rq;
> -
> -	rt_rq = &rq->rt;
> -
> -	if (!rt_rq->rt_nr_running)
> -		return NULL;
> -
> -	if (rt_rq_throttled(rt_rq))
> -		return NULL;
> +	struct rt_rq *rt_rq  = &rq->rt;
>  
>  	do {
>  		rt_se = pick_next_rt_entity(rq, rt_rq);
> @@ -1332,9 +1324,21 @@ static struct task_struct *_pick_next_ta
>  	return p;
>  }
>  
> -static struct task_struct *pick_next_task_rt(struct rq *rq)
> +static struct task_struct *
> +pick_next_task_rt(struct rq *rq, struct task_struct *prev)
>  {
> -	struct task_struct *p = _pick_next_task_rt(rq);
> +	struct task_struct *p;
> +	struct rt_rq *rt_rq = &rq->rt;
> +
> +	if (!rt_rq->rt_nr_running)
> +		return NULL;
> +
> +	if (rt_rq_throttled(rt_rq))
> +		return NULL;
> +
> +	prev->sched_class->put_prev_task(rq, prev);
> +
> +	p = _pick_next_task_rt(rq);
>  
>  	/* The running task is never eligible for pushing */
>  	if (p)
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1123,7 +1123,13 @@ struct sched_class {
>  
>  	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
>  
> -	struct task_struct * (*pick_next_task) (struct rq *rq);
> +	/*
> +	 * It is the responsibility of the pick_next_task() method that will
> +	 * return the next task to call put_prev_task() on the @prev task or
> +	 * something equivalent.
> +	 */
> +	struct task_struct * (*pick_next_task) (struct rq *rq,
> +						struct task_struct *prev);
>  	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>  
>  #ifdef CONFIG_SMP
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -23,16 +23,18 @@ check_preempt_curr_stop(struct rq *rq, s
>  	/* we're never preempted */
>  }
>  
> -static struct task_struct *pick_next_task_stop(struct rq *rq)
> +static struct task_struct *
> +pick_next_task_stop(struct rq *rq, struct task_struct *prev)
>  {
>  	struct task_struct *stop = rq->stop;
>  
> -	if (stop && stop->on_rq) {
> -		stop->se.exec_start = rq_clock_task(rq);
> -		return stop;
> -	}
> +	if (!stop || !stop->on_rq)
> +		return NULL;
>  
> -	return NULL;
> +	prev->sched_class->put_prev_task(rq, prev);
> +	stop->se.exec_start = rq_clock_task(rq);
> +
> +	return stop;
>  }
>  
>  static void

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/9] sched: Push put_prev_task() into pick_next_task()
  2014-01-28 18:46   ` bsegall
@ 2014-01-28 19:14     ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-28 19:14 UTC (permalink / raw)
  To: bsegall
  Cc: linux-kernel, mingo, daniel.lezcano, pjt, Steven Rostedt,
	Vincent Guittot

On Tue, Jan 28, 2014 at 10:46:18AM -0800, bsegall@google.com wrote:
> > @@ -4734,7 +4729,7 @@ static void migrate_tasks(unsigned int d
> >  		if (rq->nr_running == 1)
> >  			break;
> >  
> > -		next = pick_next_task(rq);
> > +		next = pick_next_task(rq, NULL);
> 
> This seems to be incorrect without the if (prev) lines in
> pick_next_task_foo, since foo_nr_running isn't enough to save us.

ARgh, I knew there must've been a reason I had those :/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 8/9] sched: Clean up idle task SMP logic
  2014-01-28 17:16 ` [PATCH 8/9] sched: Clean up idle task SMP logic Peter Zijlstra
@ 2014-01-30 10:52   ` Vincent Guittot
  0 siblings, 0 replies; 19+ messages in thread
From: Vincent Guittot @ 2014-01-30 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Daniel Lezcano, Paul Turner,
	Benjamin Segall, Steven Rostedt

Hi Peter,

Acked-by: Vincent Guittot <vincent.guittot@linaro.org>

Vincent

On 28 January 2014 18:16, Peter Zijlstra <peterz@infradead.org> wrote:
> The idle post_schedule hook is just a vile waste of time, furthermore
> it appears unneeded, move the idle_enter_fair() call into
> pick_next_task_idle().
>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: alex.shi@linaro.org
> Cc: mingo@kernel.org
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/sched/idle_task.c |    9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
>
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -19,11 +19,6 @@ static void pre_schedule_idle(struct rq
>         idle_exit_fair(rq);
>         rq_last_tick_reset(rq);
>  }
> -
> -static void post_schedule_idle(struct rq *rq)
> -{
> -       idle_enter_fair(rq);
> -}
>  #endif /* CONFIG_SMP */
>  /*
>   * Idle tasks are unconditionally rescheduled:
> @@ -40,8 +35,7 @@ pick_next_task_idle(struct rq *rq, struc
>
>         schedstat_inc(rq, sched_goidle);
>  #ifdef CONFIG_SMP
> -       /* Trigger the post schedule to do an idle_enter for CFS */
> -       rq->post_schedule = 1;
> +       idle_enter_fair(rq);
>  #endif
>         return rq->idle;
>  }
> @@ -105,7 +99,6 @@ const struct sched_class idle_sched_clas
>  #ifdef CONFIG_SMP
>         .select_task_rq         = select_task_rq_idle,
>         .pre_schedule           = pre_schedule_idle,
> -       .post_schedule          = post_schedule_idle,
>  #endif
>
>         .set_curr_task          = set_curr_task_idle,
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/9] sched/fair: Optimize cgroup pick_next_task_fair
  2014-01-28 17:16 ` [PATCH 7/9] sched/fair: Optimize cgroup pick_next_task_fair Peter Zijlstra
@ 2014-01-30 12:18   ` Vincent Guittot
  2014-01-30 12:37     ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2014-01-30 12:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Daniel Lezcano, Paul Turner,
	Benjamin Segall, Steven Rostedt, Mike Galbraith

On 28 January 2014 18:16, Peter Zijlstra <peterz@infradead.org> wrote:

[snip]

>
> @@ -4662,9 +4682,86 @@ static void check_preempt_wakeup(struct
>  static struct task_struct *
>  pick_next_task_fair(struct rq *rq, struct task_struct *prev)
>  {
> -       struct task_struct *p;
>         struct cfs_rq *cfs_rq = &rq->cfs;
>         struct sched_entity *se;
> +       struct task_struct *p;
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +       if (!cfs_rq->nr_running)
> +               return NULL;

Couldn't you move the test above out of the CONFIG_FAIR_GROUP_SCHED
and remove the same test that is done after the simple label

Vincent

> +
> +       if (prev->sched_class != &fair_sched_class)
> +               goto simple;
> +
> +       /*
> +        * Because of the set_next_buddy() in dequeue_task_fair() it is rather
> +        * likely that a next task is from the same cgroup as the current.
> +        *
> +        * Therefore attempt to avoid putting and setting the entire cgroup
> +        * hierarchy, only change the part that actually changes.
> +        */
> +
> +       do {
> +               struct sched_entity *curr = cfs_rq->curr;
> +
> +               /*
> +                * Since we got here without doing put_prev_entity() we also
> +                * have to consider cfs_rq->curr. If it is still a runnable
> +                * entity, update_curr() will update its vruntime, otherwise
> +                * forget we've ever seen it.
> +                */
> +               if (curr && curr->on_rq)
> +                       update_curr(cfs_rq);
> +               else
> +                       curr = NULL;
> +
> +               /*
> +                * This call to check_cfs_rq_runtime() will do the throttle and
> +                * dequeue its entity in the parent(s). Therefore the 'simple'
> +                * nr_running test will indeed be correct.
> +                */
> +               if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> +                       goto simple;
> +
> +               se = pick_next_entity(cfs_rq, curr);
> +               cfs_rq = group_cfs_rq(se);
> +       } while (cfs_rq);
> +
> +       p = task_of(se);
> +
> +       /*
> +        * Since we haven't yet done put_prev_entity and if the selected task
> +        * is a different task than we started out with, try and touch the
> +        * least amount of cfs_rqs.
> +        */
> +       if (prev != p) {
> +               struct sched_entity *pse = &prev->se;
> +
> +               while (!(cfs_rq = is_same_group(se, pse))) {
> +                       int se_depth = se->depth;
> +                       int pse_depth = pse->depth;
> +
> +                       if (se_depth <= pse_depth) {
> +                               put_prev_entity(cfs_rq_of(pse), pse);
> +                               pse = parent_entity(pse);
> +                       }
> +                       if (se_depth >= pse_depth) {
> +                               set_next_entity(cfs_rq_of(se), se);
> +                               se = parent_entity(se);
> +                       }
> +               }
> +
> +               put_prev_entity(cfs_rq, pse);
> +               set_next_entity(cfs_rq, se);
> +       }
> +
> +       if (hrtick_enabled(rq))
> +               hrtick_start_fair(rq, p);
> +
> +       return p;
> +simple:
> +       cfs_rq = &rq->cfs;
> +#endif
>
>         if (!cfs_rq->nr_running)
>                 return NULL;
> @@ -4672,12 +4769,13 @@ pick_next_task_fair(struct rq *rq, struc
>         prev->sched_class->put_prev_task(rq, prev);
>
>         do {
> -               se = pick_next_entity(cfs_rq);
> +               se = pick_next_entity(cfs_rq, NULL);
>                 set_next_entity(cfs_rq, se);
>                 cfs_rq = group_cfs_rq(se);
>         } while (cfs_rq);
>
>         p = task_of(se);
> +
>         if (hrtick_enabled(rq))
>                 hrtick_start_fair(rq, p);
>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/9] sched/fair: Optimize cgroup pick_next_task_fair
  2014-01-30 12:18   ` Vincent Guittot
@ 2014-01-30 12:37     ` Peter Zijlstra
  2014-01-30 12:56       ` Vincent Guittot
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-30 12:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Daniel Lezcano, Paul Turner,
	Benjamin Segall, Steven Rostedt, Mike Galbraith

On Thu, Jan 30, 2014 at 01:18:09PM +0100, Vincent Guittot wrote:
> On 28 January 2014 18:16, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> [snip]
> 
> >
> > @@ -4662,9 +4682,86 @@ static void check_preempt_wakeup(struct
> >  static struct task_struct *
> >  pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> >  {
> > -       struct task_struct *p;
> >         struct cfs_rq *cfs_rq = &rq->cfs;
> >         struct sched_entity *se;
> > +       struct task_struct *p;
> > +
> > +#ifdef CONFIG_FAIR_GROUP_SCHED
> > +       if (!cfs_rq->nr_running)
> > +               return NULL;
> 
> Couldn't you move the test above out of the CONFIG_FAIR_GROUP_SCHED
> and remove the same test that is done after the simple label

No, we have to check it twice because..
> 
> > +
> > +       if (prev->sched_class != &fair_sched_class)
> > +               goto simple;
> > +
> > +       /*
> > +        * Because of the set_next_buddy() in dequeue_task_fair() it is rather
> > +        * likely that a next task is from the same cgroup as the current.
> > +        *
> > +        * Therefore attempt to avoid putting and setting the entire cgroup
> > +        * hierarchy, only change the part that actually changes.
> > +        */
> > +
> > +       do {
> > +               struct sched_entity *curr = cfs_rq->curr;
> > +
> > +               /*
> > +                * Since we got here without doing put_prev_entity() we also
> > +                * have to consider cfs_rq->curr. If it is still a runnable
> > +                * entity, update_curr() will update its vruntime, otherwise
> > +                * forget we've ever seen it.
> > +                */
> > +               if (curr && curr->on_rq)
> > +                       update_curr(cfs_rq);
> > +               else
> > +                       curr = NULL;
> > +
> > +               /*
> > +                * This call to check_cfs_rq_runtime() will do the throttle and
> > +                * dequeue its entity in the parent(s). Therefore the 'simple'
> > +                * nr_running test will indeed be correct.
> > +                */
> > +               if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> > +                       goto simple;

... here if you read the comment above, we could have modified the
nr_running.

> > +               se = pick_next_entity(cfs_rq, curr);
> > +               cfs_rq = group_cfs_rq(se);
> > +       } while (cfs_rq);
> > +
> > +       p = task_of(se);
> > +
> > +       /*
> > +        * Since we haven't yet done put_prev_entity and if the selected task
> > +        * is a different task than we started out with, try and touch the
> > +        * least amount of cfs_rqs.
> > +        */
> > +       if (prev != p) {
> > +               struct sched_entity *pse = &prev->se;
> > +
> > +               while (!(cfs_rq = is_same_group(se, pse))) {
> > +                       int se_depth = se->depth;
> > +                       int pse_depth = pse->depth;
> > +
> > +                       if (se_depth <= pse_depth) {
> > +                               put_prev_entity(cfs_rq_of(pse), pse);
> > +                               pse = parent_entity(pse);
> > +                       }
> > +                       if (se_depth >= pse_depth) {
> > +                               set_next_entity(cfs_rq_of(se), se);
> > +                               se = parent_entity(se);
> > +                       }
> > +               }
> > +
> > +               put_prev_entity(cfs_rq, pse);
> > +               set_next_entity(cfs_rq, se);
> > +       }
> > +
> > +       if (hrtick_enabled(rq))
> > +               hrtick_start_fair(rq, p);
> > +
> > +       return p;
> > +simple:
> > +       cfs_rq = &rq->cfs;
> > +#endif
> >
> >         if (!cfs_rq->nr_running)
> >                 return NULL;

And therefore this test needs to stay.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 9/9] sched: Push down pre_schedule() and idle_balance()
  2014-01-28 17:16 ` [PATCH 9/9] sched: Push down pre_schedule() and idle_balance() Peter Zijlstra
@ 2014-01-30 12:45   ` Vincent Guittot
  2014-01-30 15:22     ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2014-01-30 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Daniel Lezcano, Paul Turner,
	Benjamin Segall, Steven Rostedt

On 28 January 2014 18:16, Peter Zijlstra <peterz@infradead.org> wrote:
> This patch both merged idle_balance() and pre_schedule() and pushes
> both of them into pick_next_task().
>
> Conceptually pre_schedule() and idle_balance() are rather similar,
> both are used to pull more work onto the current CPU.
>
> We cannot however first move idle_balance() into pre_schedule_fair()
> since there is no guarantee the last runnable task is a fair task, and
> thus we would miss newidle balances.
>
> Similarly, the dl and rt pre_schedule calls must be ran before
> idle_balance() since their respective tasks have higher priority and
> it would not do to delay their execution searching for less important
> tasks first.
>
> However, by noticing that pick_next_tasks() already traverses the
> sched_class hierarchy in the right order, we can get the right
> behaviour and do away with both calls.
>
> We must however change the special case optimization to also require
> that prev is of sched_class_fair, otherwise we can miss doing a dl or
> rt pull where we needed one.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/sched/core.c      |   26 ++------------------------
>  kernel/sched/deadline.c  |   15 +++++++--------
>  kernel/sched/fair.c      |   24 ++++++++++++++++++++----
>  kernel/sched/idle_task.c |   12 +++++-------
>  kernel/sched/rt.c        |   16 ++++++++--------
>  kernel/sched/sched.h     |    1 -
>  6 files changed, 42 insertions(+), 52 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2146,13 +2146,6 @@ static void finish_task_switch(struct rq
>
>  #ifdef CONFIG_SMP
>
> -/* assumes rq->lock is held */
> -static inline void pre_schedule(struct rq *rq, struct task_struct *prev)
> -{
> -       if (prev->sched_class->pre_schedule)
> -               prev->sched_class->pre_schedule(rq, prev);
> -}
> -
>  /* rq->lock is NOT held, but preemption is disabled */
>  static inline void post_schedule(struct rq *rq)
>  {
> @@ -2170,10 +2163,6 @@ static inline void post_schedule(struct
>
>  #else
>
> -static inline void pre_schedule(struct rq *rq, struct task_struct *p)
> -{
> -}
> -
>  static inline void post_schedule(struct rq *rq)
>  {
>  }
> @@ -2569,7 +2558,8 @@ pick_next_task(struct rq *rq, struct tas
>          * Optimization: we know that if all tasks are in
>          * the fair class we can call that function directly:
>          */
> -       if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
> +       if (likely(prev->sched_class == &fair_sched_class &&
> +                  rq->nr_running == rq->cfs.h_nr_running)) {
>                 p = fair_sched_class.pick_next_task(rq, prev);
>                 if (likely(p))
>                         return p;
> @@ -2672,18 +2662,6 @@ static void __sched __schedule(void)
>                 switch_count = &prev->nvcsw;
>         }
>
> -       pre_schedule(rq, prev);
> -
> -       if (unlikely(!rq->nr_running)) {
> -               /*
> -                * We must set idle_stamp _before_ calling idle_balance(), such
> -                * that we measure the duration of idle_balance() as idle time.
> -                */
> -               rq->idle_stamp = rq_clock(rq);
> -               if (idle_balance(rq))
> -                       rq->idle_stamp = 0;
> -       }
> -
>         if (prev->on_rq || rq->skip_clock_update < 0)
>                 update_rq_clock(rq);
>
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -989,6 +989,8 @@ static struct sched_dl_entity *pick_next
>         return rb_entry(left, struct sched_dl_entity, rb_node);
>  }
>
> +static int pull_dl_task(struct rq *this_rq);
> +
>  struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev)
>  {
>         struct sched_dl_entity *dl_se;
> @@ -997,6 +999,11 @@ struct task_struct *pick_next_task_dl(st
>
>         dl_rq = &rq->dl;
>
> +#ifdef CONFIG_SMP
> +       if (dl_task(prev))
> +               pull_dl_task(rq);
> +#endif
> +
>         if (unlikely(!dl_rq->dl_nr_running))
>                 return NULL;
>
> @@ -1427,13 +1434,6 @@ static int pull_dl_task(struct rq *this_
>         return ret;
>  }
>
> -static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
> -{
> -       /* Try to pull other tasks here */
> -       if (dl_task(prev))
> -               pull_dl_task(rq);
> -}
> -
>  static void post_schedule_dl(struct rq *rq)
>  {
>         push_dl_tasks(rq);
> @@ -1626,7 +1626,6 @@ const struct sched_class dl_sched_class
>         .set_cpus_allowed       = set_cpus_allowed_dl,
>         .rq_online              = rq_online_dl,
>         .rq_offline             = rq_offline_dl,
> -       .pre_schedule           = pre_schedule_dl,
>         .post_schedule          = post_schedule_dl,
>         .task_woken             = task_woken_dl,
>  #endif
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2581,7 +2581,8 @@ void idle_exit_fair(struct rq *this_rq)
>         update_rq_runnable_avg(this_rq, 0);
>  }
>
> -#else
> +#else /* CONFIG_SMP */
> +
>  static inline void update_entity_load_avg(struct sched_entity *se,
>                                           int update_cfs_rq) {}
>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
> @@ -2593,7 +2594,7 @@ static inline void dequeue_entity_load_a
>                                            int sleep) {}
>  static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
>                                               int force_update) {}
> -#endif
> +#endif /* CONFIG_SMP */
>
>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> @@ -4686,9 +4687,10 @@ pick_next_task_fair(struct rq *rq, struc
>         struct sched_entity *se;
>         struct task_struct *p;
>
> +again:
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         if (!cfs_rq->nr_running)
> -               return NULL;
> +               goto idle;
>
>         if (prev->sched_class != &fair_sched_class)
>                 goto simple;
> @@ -4764,7 +4766,7 @@ pick_next_task_fair(struct rq *rq, struc
>  #endif
>
>         if (!cfs_rq->nr_running)
> -               return NULL;
> +               goto idle;
>
>         prev->sched_class->put_prev_task(rq, prev);
>
> @@ -4780,6 +4782,20 @@ pick_next_task_fair(struct rq *rq, struc
>                 hrtick_start_fair(rq, p);
>
>         return p;
> +
> +idle:
> +       idle_exit_fair(rq);

It should be idle_enter_fair.

we want to update the statistic with the running time of other classes
than CFS.

The use case is:

exit idle
put_prev_task_idle
--> idle_exit_fair (account elapsed idle time)
pick_next_task other than fair tasks
switch between "other than fair" tasks
...
no more "other than fair" tasks to schedule
pick_next_task_fair
--> no fair task on the rq
--> jump to simple
--> idle_enter_fair (account elapsed running time of other class
before trying to pull fair task from other CPUs)
--> idle_balance()
...

Vincent

> +       /*
> +        * We must set idle_stamp _before_ calling idle_balance(), such that we
> +        * measure the duration of idle_balance() as idle time.
> +        */
> +       rq->idle_stamp = rq_clock(rq);
> +       if (idle_balance(rq)) { /* drops rq->lock */
> +               rq->idle_stamp = 0;
> +               goto again;
> +       }
> +
> +       return NULL;
>  }
>
>  /*
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -13,13 +13,8 @@ select_task_rq_idle(struct task_struct *
>  {
>         return task_cpu(p); /* IDLE tasks as never migrated */
>  }
> -
> -static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
> -{
> -       idle_exit_fair(rq);
> -       rq_last_tick_reset(rq);
> -}
>  #endif /* CONFIG_SMP */
> +
>  /*
>   * Idle tasks are unconditionally rescheduled:
>   */
> @@ -55,6 +50,10 @@ dequeue_task_idle(struct rq *rq, struct
>
>  static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
>  {
> +#ifdef CONFIG_SMP
> +       idle_exit_fair(rq);
> +       rq_last_tick_reset(rq);
> +#endif
>  }
>
>  static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
> @@ -98,7 +97,6 @@ const struct sched_class idle_sched_clas
>
>  #ifdef CONFIG_SMP
>         .select_task_rq         = select_task_rq_idle,
> -       .pre_schedule           = pre_schedule_idle,
>  #endif
>
>         .set_curr_task          = set_curr_task_idle,
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1324,12 +1324,20 @@ static struct task_struct *_pick_next_ta
>         return p;
>  }
>
> +static int pull_rt_task(struct rq *this_rq);
> +
>  static struct task_struct *
>  pick_next_task_rt(struct rq *rq, struct task_struct *prev)
>  {
>         struct task_struct *p;
>         struct rt_rq *rt_rq = &rq->rt;
>
> +#ifdef CONFIG_SMP
> +       /* Try to pull RT tasks here if we lower this rq's prio */
> +       if (rq->rt.highest_prio.curr > prev->prio)
> +               pull_rt_task(rq);
> +#endif
> +
>         if (!rt_rq->rt_nr_running)
>                 return NULL;
>
> @@ -1720,13 +1728,6 @@ static int pull_rt_task(struct rq *this_
>         return ret;
>  }
>
> -static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
> -{
> -       /* Try to pull RT tasks here if we lower this rq's prio */
> -       if (rq->rt.highest_prio.curr > prev->prio)
> -               pull_rt_task(rq);
> -}
> -
>  static void post_schedule_rt(struct rq *rq)
>  {
>         push_rt_tasks(rq);
> @@ -2003,7 +2004,6 @@ const struct sched_class rt_sched_class
>         .set_cpus_allowed       = set_cpus_allowed_rt,
>         .rq_online              = rq_online_rt,
>         .rq_offline             = rq_offline_rt,
> -       .pre_schedule           = pre_schedule_rt,
>         .post_schedule          = post_schedule_rt,
>         .task_woken             = task_woken_rt,
>         .switched_from          = switched_from_rt,
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1136,7 +1136,6 @@ struct sched_class {
>         int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
>         void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
>
> -       void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
>         void (*post_schedule) (struct rq *this_rq);
>         void (*task_waking) (struct task_struct *task);
>         void (*task_woken) (struct rq *this_rq, struct task_struct *task);
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/9] sched/fair: Optimize cgroup pick_next_task_fair
  2014-01-30 12:37     ` Peter Zijlstra
@ 2014-01-30 12:56       ` Vincent Guittot
  0 siblings, 0 replies; 19+ messages in thread
From: Vincent Guittot @ 2014-01-30 12:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Daniel Lezcano, Paul Turner,
	Benjamin Segall, Steven Rostedt, Mike Galbraith

On 30 January 2014 13:37, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jan 30, 2014 at 01:18:09PM +0100, Vincent Guittot wrote:
>> On 28 January 2014 18:16, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> [snip]
>>
>> >
>> > @@ -4662,9 +4682,86 @@ static void check_preempt_wakeup(struct
>> >  static struct task_struct *
>> >  pick_next_task_fair(struct rq *rq, struct task_struct *prev)
>> >  {
>> > -       struct task_struct *p;
>> >         struct cfs_rq *cfs_rq = &rq->cfs;
>> >         struct sched_entity *se;
>> > +       struct task_struct *p;
>> > +
>> > +#ifdef CONFIG_FAIR_GROUP_SCHED
>> > +       if (!cfs_rq->nr_running)
>> > +               return NULL;
>>
>> Couldn't you move the test above out of the CONFIG_FAIR_GROUP_SCHED
>> and remove the same test that is done after the simple label
>
> No, we have to check it twice because..
>>
>> > +
>> > +       if (prev->sched_class != &fair_sched_class)
>> > +               goto simple;
>> > +
>> > +       /*
>> > +        * Because of the set_next_buddy() in dequeue_task_fair() it is rather
>> > +        * likely that a next task is from the same cgroup as the current.
>> > +        *
>> > +        * Therefore attempt to avoid putting and setting the entire cgroup
>> > +        * hierarchy, only change the part that actually changes.
>> > +        */
>> > +
>> > +       do {
>> > +               struct sched_entity *curr = cfs_rq->curr;
>> > +
>> > +               /*
>> > +                * Since we got here without doing put_prev_entity() we also
>> > +                * have to consider cfs_rq->curr. If it is still a runnable
>> > +                * entity, update_curr() will update its vruntime, otherwise
>> > +                * forget we've ever seen it.
>> > +                */
>> > +               if (curr && curr->on_rq)
>> > +                       update_curr(cfs_rq);
>> > +               else
>> > +                       curr = NULL;
>> > +
>> > +               /*
>> > +                * This call to check_cfs_rq_runtime() will do the throttle and
>> > +                * dequeue its entity in the parent(s). Therefore the 'simple'
>> > +                * nr_running test will indeed be correct.
>> > +                */
>> > +               if (unlikely(check_cfs_rq_runtime(cfs_rq)))
>> > +                       goto simple;
>
> ... here if you read the comment above, we could have modified the
> nr_running.

ok, i missed this point

>
>> > +               se = pick_next_entity(cfs_rq, curr);
>> > +               cfs_rq = group_cfs_rq(se);
>> > +       } while (cfs_rq);
>> > +
>> > +       p = task_of(se);
>> > +
>> > +       /*
>> > +        * Since we haven't yet done put_prev_entity and if the selected task
>> > +        * is a different task than we started out with, try and touch the
>> > +        * least amount of cfs_rqs.
>> > +        */
>> > +       if (prev != p) {
>> > +               struct sched_entity *pse = &prev->se;
>> > +
>> > +               while (!(cfs_rq = is_same_group(se, pse))) {
>> > +                       int se_depth = se->depth;
>> > +                       int pse_depth = pse->depth;
>> > +
>> > +                       if (se_depth <= pse_depth) {
>> > +                               put_prev_entity(cfs_rq_of(pse), pse);
>> > +                               pse = parent_entity(pse);
>> > +                       }
>> > +                       if (se_depth >= pse_depth) {
>> > +                               set_next_entity(cfs_rq_of(se), se);
>> > +                               se = parent_entity(se);
>> > +                       }
>> > +               }
>> > +
>> > +               put_prev_entity(cfs_rq, pse);
>> > +               set_next_entity(cfs_rq, se);
>> > +       }
>> > +
>> > +       if (hrtick_enabled(rq))
>> > +               hrtick_start_fair(rq, p);
>> > +
>> > +       return p;
>> > +simple:
>> > +       cfs_rq = &rq->cfs;
>> > +#endif
>> >
>> >         if (!cfs_rq->nr_running)
>> >                 return NULL;
>
> And therefore this test needs to stay.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 9/9] sched: Push down pre_schedule() and idle_balance()
  2014-01-30 12:45   ` Vincent Guittot
@ 2014-01-30 15:22     ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-01-30 15:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Daniel Lezcano, Paul Turner,
	Benjamin Segall, Steven Rostedt

On Thu, Jan 30, 2014 at 01:45:07PM +0100, Vincent Guittot wrote:
> On 28 January 2014 18:16, Peter Zijlstra <peterz@infradead.org> wrote:
> > +idle:
> > +       idle_exit_fair(rq);
> 
> It should be idle_enter_fair.
> 
> we want to update the statistic with the running time of other classes
> than CFS.
> 
> The use case is:
> 
> exit idle
> put_prev_task_idle
> --> idle_exit_fair (account elapsed idle time)
> pick_next_task other than fair tasks
> switch between "other than fair" tasks
> ...
> no more "other than fair" tasks to schedule
> pick_next_task_fair
> --> no fair task on the rq
> --> jump to simple
> --> idle_enter_fair (account elapsed running time of other class
> before trying to pull fair task from other CPUs)
> --> idle_balance()
> ...

Right.. terminal confusion on this topic it seems. Fixed it up.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-01-30 15:22 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-28 17:16 [PATCH 0/9] Various sched patches -v2 Peter Zijlstra
2014-01-28 17:16 ` [PATCH 1/9] sched: Remove cpu parameter for idle_balance() Peter Zijlstra
2014-01-28 17:16 ` [PATCH 2/9] sched: Fix race in idle_balance() Peter Zijlstra
2014-01-28 17:16 ` [PATCH 3/9] sched: Move idle_stamp up to the core Peter Zijlstra
2014-01-28 17:16 ` [PATCH 4/9] sched/fair: Track cgroup depth Peter Zijlstra
2014-01-28 17:16 ` [PATCH 5/9] sched: Push put_prev_task() into pick_next_task() Peter Zijlstra
2014-01-28 18:46   ` bsegall
2014-01-28 19:14     ` Peter Zijlstra
2014-01-28 17:16 ` [PATCH 6/9] sched/fair: Clean up __clear_buddies_* Peter Zijlstra
2014-01-28 17:16 ` [PATCH 7/9] sched/fair: Optimize cgroup pick_next_task_fair Peter Zijlstra
2014-01-30 12:18   ` Vincent Guittot
2014-01-30 12:37     ` Peter Zijlstra
2014-01-30 12:56       ` Vincent Guittot
2014-01-28 17:16 ` [PATCH 8/9] sched: Clean up idle task SMP logic Peter Zijlstra
2014-01-30 10:52   ` Vincent Guittot
2014-01-28 17:16 ` [PATCH 9/9] sched: Push down pre_schedule() and idle_balance() Peter Zijlstra
2014-01-30 12:45   ` Vincent Guittot
2014-01-30 15:22     ` Peter Zijlstra
2014-01-28 18:07 ` [PATCH 0/9] Various sched patches -v2 Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).