linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched: Optimize pick_next_task for idle_sched_class too
@ 2017-01-19 15:17 Steven Rostedt
  2017-01-19 17:44 ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2017-01-19 15:17 UTC (permalink / raw)
  To: LKML; +Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton

When running my likely/unlikely profiler, I noticed that the
SCHED_DEADLINE's pick_next_task_dl() unlikely case of
(!dl_rq->dl_nr_running) was always being hit. There's two cases where
this can happen.

First, there's an optimization in pick_next_task() for the likely case
that the only tasks running on the run queue are SCHED_OTHER tasks. In
a normal system, this is the case most of the time. When this is true,
only the pick_next_task() of the fair_sched_class is called. If an RT or
DEADLINE task is queued, then the other pick_next_task()s of the other
sched classes are called in sequence.

The SCHED_DEADLINE pick_next_task() is called first, and that
unlikely() case is hit if there's no deadline tasks available. This
happens when an RT task is queued (first case). But tracing revealed
that this happens in another very common case. The case where the
system goes from idle to running any task, including SCHED_OTHER. This
is because the idle task has a different sched class than the
fair_sched_class.

The optimization has:

	if (prev->sched_class == fair_sched_class &&
	    rq->nr_running == rq->cfs.h_nr_running) {

When going from SCHED_OTHER to idle, this optimization is hit, because
the SCHED_OTHER task is of the fair_sched_class, and rq->nr_running and
rq->cfs.h_nr_running are both zero. But when we go from idle to
SCHED_OTHER, the first test fails. prev->sched_class is equal to
idle_sched_class, and this causes both the pick_next_task() of deadline
and RT sched classes to be called unnecessarily.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 154fd68..e2c6d3b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3259,13 +3259,15 @@ static inline struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
 {
 	const struct sched_class *class = &fair_sched_class;
+	const struct sched_class *idle_class = &idle_sched_class;
 	struct task_struct *p;
 
 	/*
 	 * Optimization: we know that if all tasks are in
 	 * the fair class we can call that function directly:
 	 */
-	if (likely(prev->sched_class == class &&
+	if (likely((prev->sched_class == class ||
+		    prev->sched_class == idle_class) &&
 		   rq->nr_running == rq->cfs.h_nr_running)) {
 		p = fair_sched_class.pick_next_task(rq, prev, cookie);
 		if (unlikely(p == RETRY_TASK))

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-01-19 15:17 [PATCH] sched: Optimize pick_next_task for idle_sched_class too Steven Rostedt
@ 2017-01-19 17:44 ` Peter Zijlstra
  2017-01-20 16:14   ` Steven Rostedt
                     ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Peter Zijlstra @ 2017-01-19 17:44 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, Ingo Molnar, Andrew Morton

On Thu, Jan 19, 2017 at 10:17:03AM -0500, Steven Rostedt wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 154fd68..e2c6d3b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3259,13 +3259,15 @@ static inline struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
>  {
>  	const struct sched_class *class = &fair_sched_class;
> +	const struct sched_class *idle_class = &idle_sched_class;
>  	struct task_struct *p;
>  
>  	/*
>  	 * Optimization: we know that if all tasks are in
>  	 * the fair class we can call that function directly:
>  	 */
> -	if (likely(prev->sched_class == class &&
> +	if (likely((prev->sched_class == class ||
> +		    prev->sched_class == idle_class) &&
>  		   rq->nr_running == rq->cfs.h_nr_running)) {

OK, so I hate this patch because it makes the condition more complex,
and while staring at what it does for code generation I couldn't for the
life of me figure out why we care about prev->sched_class to begin with.

(we used to, but the current code not so much)

So I simply removed that entire clause, like below, and lo and behold,
the system booted...

Could you give it a spin to see if anything comes apart?


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 49ce1cb..51ca21e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3321,15 +3321,14 @@ static inline void schedule_debug(struct task_struct *prev)
 static inline struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	const struct sched_class *class = &fair_sched_class;
+	const struct sched_class *class;
 	struct task_struct *p;
 
 	/*
 	 * Optimization: we know that if all tasks are in
 	 * the fair class we can call that function directly:
 	 */
-	if (likely(prev->sched_class == class &&
-		   rq->nr_running == rq->cfs.h_nr_running)) {
+	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
 		p = fair_sched_class.pick_next_task(rq, prev, rf);
 		if (unlikely(p == RETRY_TASK))
 			goto again;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-01-19 17:44 ` Peter Zijlstra
@ 2017-01-20 16:14   ` Steven Rostedt
  2017-01-20 16:16     ` Steven Rostedt
  2017-01-20 16:52     ` Peter Zijlstra
  2017-01-30 11:54   ` [tip:sched/core] sched/core: Optimize pick_next_task() for idle_sched_class tip-bot for Peter Zijlstra
  2017-02-23 10:34   ` [PATCH] sched: Optimize pick_next_task for idle_sched_class too Pavan Kondeti
  2 siblings, 2 replies; 19+ messages in thread
From: Steven Rostedt @ 2017-01-20 16:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, Andrew Morton

On Thu, 19 Jan 2017 18:44:08 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
-	if (likely(prev->sched_class == class &&
> > +	if (likely((prev->sched_class == class ||
> > +		    prev->sched_class == idle_class) &&
> >  		   rq->nr_running == rq->cfs.h_nr_running)) {  
> 
> OK, so I hate this patch because it makes the condition more complex,
> and while staring at what it does for code generation I couldn't for the
> life of me figure out why we care about prev->sched_class to begin with.

I was thinking it would save on checking the rq at all, but rq is used
by pick_next_task_*() anyway, so I doubt it's much savings.

> 
> (we used to, but the current code not so much)
> 
> So I simply removed that entire clause, like below, and lo and behold,
> the system booted...

I thought about doing this too, but decided against it because I was
thinking that since class (and now idle_class) are constants, it would
help the non cfs case. Checking prev->sched_class against a constant I
thought would be quick. But yeah, I doubt it matters much in the grand
scale of things.


> 
> Could you give it a spin to see if anything comes apart?

Yeah this works. You can add:

Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Thanks,

-- Steve

> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 49ce1cb..51ca21e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3321,15 +3321,14 @@ static inline void schedule_debug(struct task_struct *prev)
>  static inline struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> -	const struct sched_class *class = &fair_sched_class;
> +	const struct sched_class *class;
>  	struct task_struct *p;
>  
>  	/*
>  	 * Optimization: we know that if all tasks are in
>  	 * the fair class we can call that function directly:
>  	 */
> -	if (likely(prev->sched_class == class &&
> -		   rq->nr_running == rq->cfs.h_nr_running)) {
> +	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
>  		p = fair_sched_class.pick_next_task(rq, prev, rf);
>  		if (unlikely(p == RETRY_TASK))
>  			goto again;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-01-20 16:14   ` Steven Rostedt
@ 2017-01-20 16:16     ` Steven Rostedt
  2017-01-20 16:52     ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2017-01-20 16:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, Andrew Morton

On Fri, 20 Jan 2017 11:14:25 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > 
> > Could you give it a spin to see if anything comes apart?  
> 
> Yeah this works. You can add:
> 
> Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> 

You can also add that my likely profiler on a normal box showed this
likely case go from 78% correct to 98% correct.

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-01-20 16:14   ` Steven Rostedt
  2017-01-20 16:16     ` Steven Rostedt
@ 2017-01-20 16:52     ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2017-01-20 16:52 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: LKML, Ingo Molnar, Andrew Morton

On Fri, Jan 20, 2017 at 11:14:25AM -0500, Steven Rostedt wrote:
> > OK, so I hate this patch because it makes the condition more complex,
> > and while staring at what it does for code generation I couldn't for the
> > life of me figure out why we care about prev->sched_class to begin with.
> 
> I was thinking it would save on checking the rq at all, but rq is used
> by pick_next_task_*() anyway, so I doubt it's much savings.

fwiw, code generation adds something like:

	CMP imm32, reg
	JE imm8

So the overhead is not immense, something like 7-8 bytes, but still,
less is more :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip:sched/core] sched/core: Optimize pick_next_task() for idle_sched_class
  2017-01-19 17:44 ` Peter Zijlstra
  2017-01-20 16:14   ` Steven Rostedt
@ 2017-01-30 11:54   ` tip-bot for Peter Zijlstra
  2017-02-23 10:34   ` [PATCH] sched: Optimize pick_next_task for idle_sched_class too Pavan Kondeti
  2 siblings, 0 replies; 19+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-01-30 11:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, rostedt, linux-kernel, efault, torvalds, peterz, akpm, tglx, hpa

Commit-ID:  49ee576809d837442624ac18804b07943267cd57
Gitweb:     http://git.kernel.org/tip/49ee576809d837442624ac18804b07943267cd57
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 19 Jan 2017 18:44:08 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 30 Jan 2017 11:46:34 +0100

sched/core: Optimize pick_next_task() for idle_sched_class

Steve noticed that when we switch from IDLE to SCHED_OTHER we fail to
take the shortcut, even though all runnable tasks are of the fair
class, because prev->sched_class != &fair_sched_class.

Since I reworked the put_prev_task() stuff, we don't really care about
prev->class here, so removing that condition will allow this case.

This increases the likely case from 78% to 98% correct for Steve's
workload.

Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170119174408.GN6485@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 49ce1cb..51ca21e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3321,15 +3321,14 @@ static inline void schedule_debug(struct task_struct *prev)
 static inline struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	const struct sched_class *class = &fair_sched_class;
+	const struct sched_class *class;
 	struct task_struct *p;
 
 	/*
 	 * Optimization: we know that if all tasks are in
 	 * the fair class we can call that function directly:
 	 */
-	if (likely(prev->sched_class == class &&
-		   rq->nr_running == rq->cfs.h_nr_running)) {
+	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
 		p = fair_sched_class.pick_next_task(rq, prev, rf);
 		if (unlikely(p == RETRY_TASK))
 			goto again;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-01-19 17:44 ` Peter Zijlstra
  2017-01-20 16:14   ` Steven Rostedt
  2017-01-30 11:54   ` [tip:sched/core] sched/core: Optimize pick_next_task() for idle_sched_class tip-bot for Peter Zijlstra
@ 2017-02-23 10:34   ` Pavan Kondeti
  2017-02-23 13:54     ` Peter Zijlstra
  2 siblings, 1 reply; 19+ messages in thread
From: Pavan Kondeti @ 2017-02-23 10:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Steven Rostedt, LKML, Ingo Molnar, Andrew Morton

Hi Peter,

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 49ce1cb..51ca21e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3321,15 +3321,14 @@ static inline void schedule_debug(struct task_struct *prev)
>  static inline struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> -       const struct sched_class *class = &fair_sched_class;
> +       const struct sched_class *class;
>         struct task_struct *p;
>
>         /*
>          * Optimization: we know that if all tasks are in
>          * the fair class we can call that function directly:
>          */
> -       if (likely(prev->sched_class == class &&
> -                  rq->nr_running == rq->cfs.h_nr_running)) {
> +       if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
>                 p = fair_sched_class.pick_next_task(rq, prev, rf);
>                 if (unlikely(p == RETRY_TASK))
>                         goto again;

Would this delay pulling RT tasks from other CPUs? Lets say this CPU
has 2 fair tasks and 1 RT task. The RT task is sleeping now. Earlier,
we attempt to pull RT tasks from other CPUs in pick_next_task_rt(),
which is not done anymore.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-02-23 10:34   ` [PATCH] sched: Optimize pick_next_task for idle_sched_class too Pavan Kondeti
@ 2017-02-23 13:54     ` Peter Zijlstra
  2017-02-23 15:15       ` Pavan Kondeti
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2017-02-23 13:54 UTC (permalink / raw)
  To: Pavan Kondeti; +Cc: Steven Rostedt, LKML, Ingo Molnar, Andrew Morton

On Thu, Feb 23, 2017 at 04:04:22PM +0530, Pavan Kondeti wrote:
> Hi Peter,
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 49ce1cb..51ca21e 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3321,15 +3321,14 @@ static inline void schedule_debug(struct task_struct *prev)
> >  static inline struct task_struct *
> >  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  {
> > -       const struct sched_class *class = &fair_sched_class;
> > +       const struct sched_class *class;
> >         struct task_struct *p;
> >
> >         /*
> >          * Optimization: we know that if all tasks are in
> >          * the fair class we can call that function directly:
> >          */
> > -       if (likely(prev->sched_class == class &&
> > -                  rq->nr_running == rq->cfs.h_nr_running)) {
> > +       if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
> >                 p = fair_sched_class.pick_next_task(rq, prev, rf);
> >                 if (unlikely(p == RETRY_TASK))
> >                         goto again;
> 
> Would this delay pulling RT tasks from other CPUs? Lets say this CPU
> has 2 fair tasks and 1 RT task. The RT task is sleeping now. Earlier,
> we attempt to pull RT tasks from other CPUs in pick_next_task_rt(),
> which is not done anymore.

It should not; the two places of interrests are when we leave the RT
class to run anything lower (fair,idle), at which point we'll pull,
or when an RT tasks wakes up, at which point it'll push.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-02-23 13:54     ` Peter Zijlstra
@ 2017-02-23 15:15       ` Pavan Kondeti
  2017-02-23 15:25         ` Peter Zijlstra
  2017-03-01 15:53         ` Steven Rostedt
  0 siblings, 2 replies; 19+ messages in thread
From: Pavan Kondeti @ 2017-02-23 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pavan Kondeti, Steven Rostedt, LKML, Ingo Molnar, Andrew Morton

Hi Peter,

On Thu, Feb 23, 2017 at 7:24 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 23, 2017 at 04:04:22PM +0530, Pavan Kondeti wrote:
>> Hi Peter,
>>
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index 49ce1cb..51ca21e 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -3321,15 +3321,14 @@ static inline void schedule_debug(struct task_struct *prev)
>> >  static inline struct task_struct *
>> >  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>> >  {
>> > -       const struct sched_class *class = &fair_sched_class;
>> > +       const struct sched_class *class;
>> >         struct task_struct *p;
>> >
>> >         /*
>> >          * Optimization: we know that if all tasks are in
>> >          * the fair class we can call that function directly:
>> >          */
>> > -       if (likely(prev->sched_class == class &&
>> > -                  rq->nr_running == rq->cfs.h_nr_running)) {
>> > +       if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
>> >                 p = fair_sched_class.pick_next_task(rq, prev, rf);
>> >                 if (unlikely(p == RETRY_TASK))
>> >                         goto again;
>>
>> Would this delay pulling RT tasks from other CPUs? Lets say this CPU
>> has 2 fair tasks and 1 RT task. The RT task is sleeping now. Earlier,
>> we attempt to pull RT tasks from other CPUs in pick_next_task_rt(),
>> which is not done anymore.
>
> It should not; the two places of interrests are when we leave the RT
> class to run anything lower (fair,idle), at which point we'll pull,
> or when an RT tasks wakes up, at which point it'll push.

Can you kindly show me where we are pulling when a RT task goes to
sleep? Apart from class/prio change, I see pull happening only from
pick_next_task_rt().

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-02-23 15:15       ` Pavan Kondeti
@ 2017-02-23 15:25         ` Peter Zijlstra
  2017-02-23 16:37           ` Peter Zijlstra
  2017-03-01 15:53         ` Steven Rostedt
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2017-02-23 15:25 UTC (permalink / raw)
  To: Pavan Kondeti; +Cc: Steven Rostedt, LKML, Ingo Molnar, Andrew Morton

On Thu, Feb 23, 2017 at 08:45:06PM +0530, Pavan Kondeti wrote:
> Hi Peter,
> 
> On Thu, Feb 23, 2017 at 7:24 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Feb 23, 2017 at 04:04:22PM +0530, Pavan Kondeti wrote:
> >> Hi Peter,
> >>
> >> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> > index 49ce1cb..51ca21e 100644
> >> > --- a/kernel/sched/core.c
> >> > +++ b/kernel/sched/core.c
> >> > @@ -3321,15 +3321,14 @@ static inline void schedule_debug(struct task_struct *prev)
> >> >  static inline struct task_struct *
> >> >  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >> >  {
> >> > -       const struct sched_class *class = &fair_sched_class;
> >> > +       const struct sched_class *class;
> >> >         struct task_struct *p;
> >> >
> >> >         /*
> >> >          * Optimization: we know that if all tasks are in
> >> >          * the fair class we can call that function directly:
> >> >          */
> >> > -       if (likely(prev->sched_class == class &&
> >> > -                  rq->nr_running == rq->cfs.h_nr_running)) {
> >> > +       if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
> >> >                 p = fair_sched_class.pick_next_task(rq, prev, rf);
> >> >                 if (unlikely(p == RETRY_TASK))
> >> >                         goto again;
> >>
> >> Would this delay pulling RT tasks from other CPUs? Lets say this CPU
> >> has 2 fair tasks and 1 RT task. The RT task is sleeping now. Earlier,
> >> we attempt to pull RT tasks from other CPUs in pick_next_task_rt(),
> >> which is not done anymore.
> >
> > It should not; the two places of interrests are when we leave the RT
> > class to run anything lower (fair,idle), at which point we'll pull,
> > or when an RT tasks wakes up, at which point it'll push.
> 
> Can you kindly show me where we are pulling when a RT task goes to
> sleep? Apart from class/prio change, I see pull happening only from
> pick_next_task_rt().

Ah, I read your question wrong. Yes I think you're right, we now loose
the pull when the last RT task goes away.

Hmm.. how to fix that nicely..

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-02-23 15:25         ` Peter Zijlstra
@ 2017-02-23 16:37           ` Peter Zijlstra
  2017-02-23 17:29             ` Pavan Kondeti
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2017-02-23 16:37 UTC (permalink / raw)
  To: Pavan Kondeti; +Cc: Steven Rostedt, LKML, Ingo Molnar, Andrew Morton

On Thu, Feb 23, 2017 at 04:25:33PM +0100, Peter Zijlstra wrote:
> 
> Ah, I read your question wrong. Yes I think you're right, we now loose
> the pull when the last RT task goes away.
> 
> Hmm.. how to fix that nicely..

Something like so perhaps? This would make a pull happen when the last
RT task on this CPU goes away.

Steve?

---
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9f3e40226dec..283d591078b0 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1336,6 +1336,9 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	dequeue_rt_entity(rt_se, flags);
 
 	dequeue_pushable_task(rq, p);
+
+	if (!rq->rt.rt_nr_running)
+		queue_pull_task(rq);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-02-23 16:37           ` Peter Zijlstra
@ 2017-02-23 17:29             ` Pavan Kondeti
  2017-02-23 17:45               ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Pavan Kondeti @ 2017-02-23 17:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pavan Kondeti, Steven Rostedt, LKML, Ingo Molnar, Andrew Morton

On Thu, Feb 23, 2017 at 10:07 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 23, 2017 at 04:25:33PM +0100, Peter Zijlstra wrote:
>>
>> Ah, I read your question wrong. Yes I think you're right, we now loose
>> the pull when the last RT task goes away.
>>
>> Hmm.. how to fix that nicely..
>
> Something like so perhaps? This would make a pull happen when the last
> RT task on this CPU goes away.
>
> Steve?
>
> ---
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 9f3e40226dec..283d591078b0 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1336,6 +1336,9 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>         dequeue_rt_entity(rt_se, flags);
>
>         dequeue_pushable_task(rq, p);
> +
> +       if (!rq->rt.rt_nr_running)
> +               queue_pull_task(rq);
>  }
>
>  /*

The next balance_callback() is not called until the context switch is
completed. So we potentially pick a lower class task before the pull
happens. Would it be wrong to call pull_rt_task() directly instead of
queuing the callback.


-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-02-23 17:29             ` Pavan Kondeti
@ 2017-02-23 17:45               ` Peter Zijlstra
  2017-02-23 17:54                 ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2017-02-23 17:45 UTC (permalink / raw)
  To: Pavan Kondeti; +Cc: Steven Rostedt, LKML, Ingo Molnar, Andrew Morton

On Thu, Feb 23, 2017 at 10:59:15PM +0530, Pavan Kondeti wrote:
> On Thu, Feb 23, 2017 at 10:07 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Feb 23, 2017 at 04:25:33PM +0100, Peter Zijlstra wrote:
> >>
> >> Ah, I read your question wrong. Yes I think you're right, we now loose
> >> the pull when the last RT task goes away.
> >>
> >> Hmm.. how to fix that nicely..
> >
> > Something like so perhaps? This would make a pull happen when the last
> > RT task on this CPU goes away.
> >
> > Steve?
> >
> > ---
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index 9f3e40226dec..283d591078b0 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1336,6 +1336,9 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
> >         dequeue_rt_entity(rt_se, flags);
> >
> >         dequeue_pushable_task(rq, p);
> > +
> > +       if (!rq->rt.rt_nr_running)
> > +               queue_pull_task(rq);
> >  }
> >
> >  /*
> 
> The next balance_callback() is not called until the context switch is
> completed. So we potentially pick a lower class task before the pull
> happens. Would it be wrong to call pull_rt_task() directly instead of
> queuing the callback.

deactivate_task()...->dequeue_task_rt() cannot drop the rq->lock which
would be required to pull.

Hurm.. maybe we should do what Steve initially suggested. The
alternative is link order trickery, and I'm not sure we want to do that.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-02-23 17:45               ` Peter Zijlstra
@ 2017-02-23 17:54                 ` Peter Zijlstra
  2017-02-27 17:54                   ` Steven Rostedt
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2017-02-23 17:54 UTC (permalink / raw)
  To: Pavan Kondeti; +Cc: Steven Rostedt, LKML, Ingo Molnar, Andrew Morton

On Thu, Feb 23, 2017 at 06:45:05PM +0100, Peter Zijlstra wrote:
> Hurm.. maybe we should do what Steve initially suggested. The
> alternative is link order trickery, and I'm not sure we want to do that.

That is, given:

kernel/sched/Makefile: obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o

results in:

readelf -s defconfig-build/vmlinux | awk '/sched_class/ {print $2 " " $8}' | sort -n
00000000602c93c0 idle_sched_class
00000000602c9480 fair_sched_class
00000000602c9580 rt_sched_class
00000000602c96c0 dl_sched_class
00000000602c97c0 stop_sched_class

we can do this, but yuck!

---
 kernel/sched/core.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f972df76eb2..eebe6729ceb7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3285,10 +3285,16 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *p;
 
 	/*
-	 * Optimization: we know that if all tasks are in
-	 * the fair class we can call that function directly:
+	 * Optimization: we know that if all tasks are in the fair class we can
+	 * call that function directly, but only if the @prev task wasn't of a
+	 * higher scheduling class, because otherwise those loose the
+	 * opportinity to pull in more work from other CPUs.
+	 *
+	 * Depends on link order in kernel/sched/Makefile.
 	 */
-	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
+	if (likely(rq->nr_running == rq->cfs.h_nr_running &&
+		   prev->sched_class <= &fair_sched_class)) {
+
 		p = fair_sched_class.pick_next_task(rq, prev, rf);
 		if (unlikely(p == RETRY_TASK))
 			goto again;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-02-23 17:54                 ` Peter Zijlstra
@ 2017-02-27 17:54                   ` Steven Rostedt
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2017-02-27 17:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Pavan Kondeti, LKML, Ingo Molnar, Andrew Morton


Sorry, for the late reply. Just got back from traveling.

On Thu, 23 Feb 2017 18:54:38 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Feb 23, 2017 at 06:45:05PM +0100, Peter Zijlstra wrote:
> > Hurm.. maybe we should do what Steve initially suggested. The
> > alternative is link order trickery, and I'm not sure we want to do that.  
> 
> That is, given:
> 
> kernel/sched/Makefile: obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
> 
> results in:
> 
> readelf -s defconfig-build/vmlinux | awk '/sched_class/ {print $2 " " $8}' | sort -n
> 00000000602c93c0 idle_sched_class
> 00000000602c9480 fair_sched_class
> 00000000602c9580 rt_sched_class
> 00000000602c96c0 dl_sched_class
> 00000000602c97c0 stop_sched_class
> 
> we can do this, but yuck!
> 
> ---
>  kernel/sched/core.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8f972df76eb2..eebe6729ceb7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3285,10 +3285,16 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	struct task_struct *p;
>  
>  	/*
> -	 * Optimization: we know that if all tasks are in
> -	 * the fair class we can call that function directly:
> +	 * Optimization: we know that if all tasks are in the fair class we can
> +	 * call that function directly, but only if the @prev task wasn't of a
> +	 * higher scheduling class, because otherwise those loose the
> +	 * opportinity to pull in more work from other CPUs.
> +	 *
> +	 * Depends on link order in kernel/sched/Makefile.
>  	 */
> -	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
> +	if (likely(rq->nr_running == rq->cfs.h_nr_running &&
> +		   prev->sched_class <= &fair_sched_class)) {

If we go this route, I would suggest that we hardcode the classes in
vmlinux.lds.h.

-- Steve

> +
>  		p = fair_sched_class.pick_next_task(rq, prev, rf);
>  		if (unlikely(p == RETRY_TASK))
>  			goto again;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-02-23 15:15       ` Pavan Kondeti
  2017-02-23 15:25         ` Peter Zijlstra
@ 2017-03-01 15:53         ` Steven Rostedt
  2017-03-01 16:03           ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2017-03-01 15:53 UTC (permalink / raw)
  To: Pavan Kondeti; +Cc: Peter Zijlstra, LKML, Ingo Molnar, Andrew Morton

On Thu, 23 Feb 2017 20:45:06 +0530
Pavan Kondeti <pkondeti@codeaurora.org> wrote:

> Hi Peter,
> 
> On Thu, Feb 23, 2017 at 7:24 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Feb 23, 2017 at 04:04:22PM +0530, Pavan Kondeti wrote:  
> >> Hi Peter,
> >>  
> >> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> > index 49ce1cb..51ca21e 100644
> >> > --- a/kernel/sched/core.c
> >> > +++ b/kernel/sched/core.c
> >> > @@ -3321,15 +3321,14 @@ static inline void schedule_debug(struct task_struct *prev)
> >> >  static inline struct task_struct *
> >> >  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >> >  {
> >> > -       const struct sched_class *class = &fair_sched_class;
> >> > +       const struct sched_class *class;
> >> >         struct task_struct *p;
> >> >
> >> >         /*
> >> >          * Optimization: we know that if all tasks are in
> >> >          * the fair class we can call that function directly:
> >> >          */
> >> > -       if (likely(prev->sched_class == class &&
> >> > -                  rq->nr_running == rq->cfs.h_nr_running)) {
> >> > +       if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
> >> >                 p = fair_sched_class.pick_next_task(rq, prev, rf);
> >> >                 if (unlikely(p == RETRY_TASK))
> >> >                         goto again;  
> >>
> >> Would this delay pulling RT tasks from other CPUs? Lets say this CPU
> >> has 2 fair tasks and 1 RT task. The RT task is sleeping now. Earlier,
> >> we attempt to pull RT tasks from other CPUs in pick_next_task_rt(),
> >> which is not done anymore.  
> >
> > It should not; the two places of interrests are when we leave the RT
> > class to run anything lower (fair,idle), at which point we'll pull,
> > or when an RT tasks wakes up, at which point it'll push.  
> 
> Can you kindly show me where we are pulling when a RT task goes to
> sleep? Apart from class/prio change, I see pull happening only from
> pick_next_task_rt().

Thanks for pointing this out. I was just doing some tests with my
migrate program and it was failing dramatically. Then looking at why,
it appeared to be missing pushes. Putting back in my old patch, fixed
it up.

Peter, do we have a solution for this yet? Are you going to add the one
with the linker magic?

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-03-01 15:53         ` Steven Rostedt
@ 2017-03-01 16:03           ` Peter Zijlstra
  2017-03-01 16:19             ` Steven Rostedt
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2017-03-01 16:03 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Pavan Kondeti, LKML, Ingo Molnar, Andrew Morton

On Wed, Mar 01, 2017 at 10:53:03AM -0500, Steven Rostedt wrote:
> Peter, do we have a solution for this yet? Are you going to add the one
> with the linker magic?

I queued the below earlier today.

---
Subject: sched: Fix pick_next_task() for RT,DL
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Mar  1 10:51:47 CET 2017

Pavan noticed that commit 49ee576809d8 ("sched/core: Optimize
pick_next_task() for idle_sched_class") broke RT,DL balancing by
robbing them of the opportinty to do new-'idle' balancing when their
last runnable task (on that runqueue) goes away.

Cc: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Fixes: 49ee576809d8 ("sched/core: Optimize pick_next_task() for idle_sched_class")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3273,10 +3273,15 @@ pick_next_task(struct rq *rq, struct tas
 	struct task_struct *p;
 
 	/*
-	 * Optimization: we know that if all tasks are in
-	 * the fair class we can call that function directly:
+	 * Optimization: we know that if all tasks are in the fair class we can
+	 * call that function directly, but only if the @pref task wasn't of a
+	 * higher scheduling class, because otherwise those loose the
+	 * opportunity to pull in more work from other CPUs.
 	 */
-	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
+	if (likely((prev->sched_class == &idle_sched_class ||
+		    prev->sched_class == &fair_sched_class) &&
+		   rq->nr_running == rq->cfs.h_nr_running)) {
+
 		p = fair_sched_class.pick_next_task(rq, prev, rf);
 		if (unlikely(p == RETRY_TASK))
 			goto again;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-03-01 16:03           ` Peter Zijlstra
@ 2017-03-01 16:19             ` Steven Rostedt
  2017-03-01 16:22               ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2017-03-01 16:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Pavan Kondeti, LKML, Ingo Molnar, Andrew Morton

On Wed, 1 Mar 2017 17:03:52 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Mar 01, 2017 at 10:53:03AM -0500, Steven Rostedt wrote:
> > Peter, do we have a solution for this yet? Are you going to add the one
> > with the linker magic?  
> 
> I queued the below earlier today.

Isn't this pretty much identical to the patch I sent you a month ago?

  http://lkml.kernel.org/r/20170119101703.2abeaeb6@gandalf.local.home

-- Steve

> 
> ---
> Subject: sched: Fix pick_next_task() for RT,DL
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed Mar  1 10:51:47 CET 2017
> 
> Pavan noticed that commit 49ee576809d8 ("sched/core: Optimize
> pick_next_task() for idle_sched_class") broke RT,DL balancing by
> robbing them of the opportinty to do new-'idle' balancing when their
> last runnable task (on that runqueue) goes away.
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
> Fixes: 49ee576809d8 ("sched/core: Optimize pick_next_task() for idle_sched_class")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3273,10 +3273,15 @@ pick_next_task(struct rq *rq, struct tas
>  	struct task_struct *p;
>  
>  	/*
> -	 * Optimization: we know that if all tasks are in
> -	 * the fair class we can call that function directly:
> +	 * Optimization: we know that if all tasks are in the fair class we can
> +	 * call that function directly, but only if the @pref task wasn't of a
> +	 * higher scheduling class, because otherwise those loose the
> +	 * opportunity to pull in more work from other CPUs.
>  	 */
> -	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
> +	if (likely((prev->sched_class == &idle_sched_class ||
> +		    prev->sched_class == &fair_sched_class) &&
> +		   rq->nr_running == rq->cfs.h_nr_running)) {
> +
>  		p = fair_sched_class.pick_next_task(rq, prev, rf);
>  		if (unlikely(p == RETRY_TASK))
>  			goto again;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: Optimize pick_next_task for idle_sched_class too
  2017-03-01 16:19             ` Steven Rostedt
@ 2017-03-01 16:22               ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2017-03-01 16:22 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Pavan Kondeti, LKML, Ingo Molnar, Andrew Morton

On Wed, Mar 01, 2017 at 11:19:22AM -0500, Steven Rostedt wrote:
> On Wed, 1 Mar 2017 17:03:52 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Mar 01, 2017 at 10:53:03AM -0500, Steven Rostedt wrote:
> > > Peter, do we have a solution for this yet? Are you going to add the one
> > > with the linker magic?  
> > 
> > I queued the below earlier today.
> 
> Isn't this pretty much identical to the patch I sent you a month ago?
> 
>   http://lkml.kernel.org/r/20170119101703.2abeaeb6@gandalf.local.home

Yes, except this one has a comment explaining why we care about
prev->sched_class.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-03-01 16:45 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-19 15:17 [PATCH] sched: Optimize pick_next_task for idle_sched_class too Steven Rostedt
2017-01-19 17:44 ` Peter Zijlstra
2017-01-20 16:14   ` Steven Rostedt
2017-01-20 16:16     ` Steven Rostedt
2017-01-20 16:52     ` Peter Zijlstra
2017-01-30 11:54   ` [tip:sched/core] sched/core: Optimize pick_next_task() for idle_sched_class tip-bot for Peter Zijlstra
2017-02-23 10:34   ` [PATCH] sched: Optimize pick_next_task for idle_sched_class too Pavan Kondeti
2017-02-23 13:54     ` Peter Zijlstra
2017-02-23 15:15       ` Pavan Kondeti
2017-02-23 15:25         ` Peter Zijlstra
2017-02-23 16:37           ` Peter Zijlstra
2017-02-23 17:29             ` Pavan Kondeti
2017-02-23 17:45               ` Peter Zijlstra
2017-02-23 17:54                 ` Peter Zijlstra
2017-02-27 17:54                   ` Steven Rostedt
2017-03-01 15:53         ` Steven Rostedt
2017-03-01 16:03           ` Peter Zijlstra
2017-03-01 16:19             ` Steven Rostedt
2017-03-01 16:22               ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).