linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks
@ 2013-04-04 14:15 Vincent Guittot
  2013-04-09  8:50 ` Peter Zijlstra
  2013-04-09  8:55 ` Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Vincent Guittot @ 2013-04-04 14:15 UTC (permalink / raw)
  To: linux-kernel, linaro-kernel, peterz, mingo, pjt, rostedt,
	fweisbec, efault
  Cc: Vincent Guittot

The current update of the rq's load can be erroneous when RT tasks are
involved

The update of the load of a rq that becomes idle, is done only if the avg_idle
is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
alternate, the runnable_avg will not be updated correctly and the time will be
accounted as idle time when a CFS task wakes up.

A new idle_enter function is called when the next task is the idle function
so the elapsed time will be accounted as run time in the load of the rq,
whatever the average idle time is. The function update_rq_runnable_avg is
removed from idle_balance.

When a RT task is scheduled on an idle CPU, the update of the rq's load is
not done when the rq exit idle state because CFS's functions are not
called. Then, the idle_balance, which is called just before entering the
idle function, updates the rq's load and makes the assumption that the
elapsed time since the last update, was only running time.

As a consequence, the rq's load of a CPU that only runs a periodic RT task,
is close to LOAD_AVG_MAX whatever the running duration of the RT task is.

A new idle_exit function is called when the prev task is the idle function
so the elapsed time will be accounted as idle time in the rq's load.

Changes since V3:
- Remove dependancy with CONFIG_FAIR_GROUP_SCHED
- Add a new idle_enter function and create a post_schedule callback for
 idle class
- Remove the update_runnable_avg from idle_balance

Changes since V2:
- remove useless definition for UP platform
- rebased on top of Steven Rostedt's patches :
https://lkml.org/lkml/2013/2/12/558

Changes since V1:
- move code out of schedule function and create a pre_schedule callback for
  idle class instead.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c      |   23 +++++++++++++++++++++--
 kernel/sched/idle_task.c |   10 ++++++++++
 kernel/sched/sched.h     |   12 ++++++++++++
 3 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fcdbff..1851ca8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1562,6 +1562,27 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
 	} /* migrations, e.g. sleep=0 leave decay_count == 0 */
 }
+
+/*
+ * Update the rq's load with the elapsed running time before entering
+ * idle. if the last scheduled task is not a CFS task, idle_enter will
+ * be the only way to update the runnable statistic.
+ */
+void idle_enter(struct rq *this_rq)
+{
+	update_rq_runnable_avg(this_rq, 1);
+}
+
+/*
+ * Update the rq's load with the elapsed idle time before a task is
+ * scheduled. if the newly scheduled task is not a CFS task, idle_exit will
+ * be the only way to update the runnable statistic.
+ */
+void idle_exit(struct rq *this_rq)
+{
+	update_rq_runnable_avg(this_rq, 0);
+}
+
 #else
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq) {}
@@ -5219,8 +5240,6 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
 		return;
 
-	update_rq_runnable_avg(this_rq, 1);
-
 	/*
 	 * Drop the rq->lock, but keep preempt disabled.
 	 */
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 66b5220..0775261 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -14,8 +14,17 @@ select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
 
+static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
+{
+	/* Update rq's load with elapsed idle time */
+	idle_exit(rq);
+}
+
 static void post_schedule_idle(struct rq *rq)
 {
+	/* Update rq's load with elapsed running time */
+	idle_enter(rq);
+
 	idle_balance(smp_processor_id(), rq);
 }
 #endif /* CONFIG_SMP */
@@ -95,6 +104,7 @@ const struct sched_class idle_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
+	.pre_schedule		= pre_schedule_idle,
 	.post_schedule		= post_schedule_idle,
 #endif
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..ff4b029 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -878,6 +878,18 @@ extern const struct sched_class idle_sched_class;
 extern void trigger_load_balance(struct rq *rq, int cpu);
 extern void idle_balance(int this_cpu, struct rq *this_rq);
 
+/*
+ * Only depends on SMP, FAIR_GROUP_SCHED may be removed when runnable_avg
+ * becomes useful in lb
+ */
+#if defined(CONFIG_FAIR_GROUP_SCHED)
+extern void idle_enter(struct rq *this_rq);
+extern void idle_exit(struct rq *this_rq);
+#else
+static inline void idle_enter(struct rq *this_rq) {}
+static inline void idle_exit(struct rq *this_rq) {}
+#endif
+
 #else	/* CONFIG_SMP */
 
 static inline void idle_balance(int cpu, struct rq *rq)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-04 14:15 [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks Vincent Guittot
@ 2013-04-09  8:50 ` Peter Zijlstra
  2013-04-09  9:06   ` Vincent Guittot
  2013-04-09  8:55 ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2013-04-09  8:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, linaro-kernel, mingo, pjt, rostedt, fweisbec, efault

On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c      |   23 +++++++++++++++++++++--
>  kernel/sched/idle_task.c |   10 ++++++++++
>  kernel/sched/sched.h     |   12 ++++++++++++
>  3 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fcdbff..1851ca8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1562,6 +1562,27 @@ static inline void
> dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
>                 se->avg.decay_count =
> atomic64_read(&cfs_rq->decay_counter);
>         } /* migrations, e.g. sleep=0 leave decay_count == 0 */
>  }
> +
> +/*
> + * Update the rq's load with the elapsed running time before entering
> + * idle. if the last scheduled task is not a CFS task, idle_enter
> will
> + * be the only way to update the runnable statistic.
> + */
> +void idle_enter(struct rq *this_rq)
> +{
> +       update_rq_runnable_avg(this_rq, 1);
> +}
> +
> +/*
> + * Update the rq's load with the elapsed idle time before a task is
> + * scheduled. if the newly scheduled task is not a CFS task,
> idle_exit will
> + * be the only way to update the runnable statistic.
> + */
> +void idle_exit(struct rq *this_rq)
> +{
> +       update_rq_runnable_avg(this_rq, 0);
> +}

These seem like fairly unfortunate names to expose to the global
namespace, why not expose update_rq_runnable_avg() instead?



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-04 14:15 [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks Vincent Guittot
  2013-04-09  8:50 ` Peter Zijlstra
@ 2013-04-09  8:55 ` Peter Zijlstra
  2013-04-09 12:18   ` Vincent Guittot
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2013-04-09  8:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, linaro-kernel, mingo, pjt, rostedt, fweisbec, efault

On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
> Changes since V2:
> - remove useless definition for UP platform
> - rebased on top of Steven Rostedt's patches :
> https://lkml.org/lkml/2013/2/12/558

So what's the status of those patches? I still worry about the extra
context switch overhead for the high-frequency idle scenario.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-09  8:50 ` Peter Zijlstra
@ 2013-04-09  9:06   ` Vincent Guittot
  2013-04-10  7:26     ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Vincent Guittot @ 2013-04-09  9:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linaro-kernel, Ingo Molnar, Paul Turner,
	Steven Rostedt, Frédéric Weisbecker, Mike Galbraith

On 9 April 2013 10:50, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c      |   23 +++++++++++++++++++++--
>>  kernel/sched/idle_task.c |   10 ++++++++++
>>  kernel/sched/sched.h     |   12 ++++++++++++
>>  3 files changed, 43 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0fcdbff..1851ca8 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1562,6 +1562,27 @@ static inline void
>> dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
>>                 se->avg.decay_count =
>> atomic64_read(&cfs_rq->decay_counter);
>>         } /* migrations, e.g. sleep=0 leave decay_count == 0 */
>>  }
>> +
>> +/*
>> + * Update the rq's load with the elapsed running time before entering
>> + * idle. if the last scheduled task is not a CFS task, idle_enter
>> will
>> + * be the only way to update the runnable statistic.
>> + */
>> +void idle_enter(struct rq *this_rq)
>> +{
>> +       update_rq_runnable_avg(this_rq, 1);
>> +}
>> +
>> +/*
>> + * Update the rq's load with the elapsed idle time before a task is
>> + * scheduled. if the newly scheduled task is not a CFS task,
>> idle_exit will
>> + * be the only way to update the runnable statistic.
>> + */
>> +void idle_exit(struct rq *this_rq)
>> +{
>> +       update_rq_runnable_avg(this_rq, 0);
>> +}
>
> These seem like fairly unfortunate names to expose to the global
> namespace, why not expose update_rq_runnable_avg() instead?

Just to gather in one place all cfs actions that should be done when
exiting idle even if we only have update_rq_runnable_avg right now. I
have distinguished that from idle_balance because this sequence can't
generate extra context switch like idle_balance and they would finally
not be called in the same time

>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-09  8:55 ` Peter Zijlstra
@ 2013-04-09 12:18   ` Vincent Guittot
  2013-04-09 13:16     ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Vincent Guittot @ 2013-04-09 12:18 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linaro-kernel, Ingo Molnar, Paul Turner,
	Frédéric Weisbecker, Mike Galbraith

On 9 April 2013 10:55, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
>> Changes since V2:
>> - remove useless definition for UP platform
>> - rebased on top of Steven Rostedt's patches :
>> https://lkml.org/lkml/2013/2/12/558
>
> So what's the status of those patches? I still worry about the extra
> context switch overhead for the high-frequency idle scenario.

I don't know. I have seen a pulled answer from Ingo but can't find the
commits in the tip tree.

Steve, have you got more info about the status of your patches ?

Vincent

>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-09 12:18   ` Vincent Guittot
@ 2013-04-09 13:16     ` Steven Rostedt
  2013-04-09 14:05       ` Vincent Guittot
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2013-04-09 13:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, linux-kernel, linaro-kernel, Ingo Molnar,
	Paul Turner, Frédéric Weisbecker, Mike Galbraith

On Tue, 2013-04-09 at 14:18 +0200, Vincent Guittot wrote:
> On 9 April 2013 10:55, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
> >> Changes since V2:
> >> - remove useless definition for UP platform
> >> - rebased on top of Steven Rostedt's patches :
> >> https://lkml.org/lkml/2013/2/12/558
> >
> > So what's the status of those patches? I still worry about the extra
> > context switch overhead for the high-frequency idle scenario.
> 
> I don't know. I have seen a pulled answer from Ingo but can't find the
> commits in the tip tree.
> 
> Steve, have you got more info about the status of your patches ?
> 

Yeah, I asked Ingo to revert it due to Peter's concerns. I was able to
get the latencies I needed without that patch set. That made it not so
urgent.

Can you rebase your patches doing something similar? That is, still use
the pre/post_schedule_idle() calls, but don't base it off of my patch
set.

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-09 13:16     ` Steven Rostedt
@ 2013-04-09 14:05       ` Vincent Guittot
  0 siblings, 0 replies; 9+ messages in thread
From: Vincent Guittot @ 2013-04-09 14:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, linux-kernel, linaro-kernel, Ingo Molnar,
	Paul Turner, Frédéric Weisbecker, Mike Galbraith

On 9 April 2013 15:16, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 2013-04-09 at 14:18 +0200, Vincent Guittot wrote:
>> On 9 April 2013 10:55, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
>> >> Changes since V2:
>> >> - remove useless definition for UP platform
>> >> - rebased on top of Steven Rostedt's patches :
>> >> https://lkml.org/lkml/2013/2/12/558
>> >
>> > So what's the status of those patches? I still worry about the extra
>> > context switch overhead for the high-frequency idle scenario.
>>
>> I don't know. I have seen a pulled answer from Ingo but can't find the
>> commits in the tip tree.
>>
>> Steve, have you got more info about the status of your patches ?
>>
>
> Yeah, I asked Ingo to revert it due to Peter's concerns. I was able to
> get the latencies I needed without that patch set. That made it not so
> urgent.
>
> Can you rebase your patches doing something similar? That is, still use
> the pre/post_schedule_idle() calls, but don't base it off of my patch
> set.

Yes. I'm going to rebase my patches and add the declaration of
post_schedule_idle in my patch instead of using your patch

Thanks,
Vincent

>
> Thanks,
>
> -- Steve
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-09  9:06   ` Vincent Guittot
@ 2013-04-10  7:26     ` Peter Zijlstra
  2013-04-10  7:59       ` Vincent Guittot
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2013-04-10  7:26 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, linaro-kernel, Ingo Molnar, Paul Turner,
	Steven Rostedt, Frédéric Weisbecker, Mike Galbraith

On Tue, 2013-04-09 at 11:06 +0200, Vincent Guittot wrote:
> >> +void idle_enter(struct rq *this_rq)
> >> +{
> >> +       update_rq_runnable_avg(this_rq, 1);
> >> +}

> >> +void idle_exit(struct rq *this_rq)
> >> +{
> >> +       update_rq_runnable_avg(this_rq, 0);
> >> +}
> >
> > These seem like fairly unfortunate names to expose to the global
> > namespace, why not expose update_rq_runnable_avg() instead?
> 
> Just to gather in one place all cfs actions that should be done when
> exiting idle even if we only have update_rq_runnable_avg right now. I
> have distinguished that from idle_balance because this sequence can't
> generate extra context switch like idle_balance and they would finally
> not be called in the same time

OK, but could we then please give then more scheduler specific names?
It just seems to me that idle_enter/idle_exit() are very obvious
function names for unrelated things.

How about calling it idle_{enter,exit}_fair; so that once other classes
grow hooks we can do something like:

static void pre_schedule_idle(struct rq *rq, struct task_struct *p)
{
	struct sched_class *class;

	for_each_class(class) {
		if (class->idle_enter)
			class->idle_enter(rq);
	}	
}

or whatnot..


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-10  7:26     ` Peter Zijlstra
@ 2013-04-10  7:59       ` Vincent Guittot
  0 siblings, 0 replies; 9+ messages in thread
From: Vincent Guittot @ 2013-04-10  7:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linaro-kernel, Ingo Molnar, Paul Turner,
	Steven Rostedt, Frédéric Weisbecker, Mike Galbraith

On 10 April 2013 09:26, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2013-04-09 at 11:06 +0200, Vincent Guittot wrote:
>> >> +void idle_enter(struct rq *this_rq)
>> >> +{
>> >> +       update_rq_runnable_avg(this_rq, 1);
>> >> +}
>
>> >> +void idle_exit(struct rq *this_rq)
>> >> +{
>> >> +       update_rq_runnable_avg(this_rq, 0);
>> >> +}
>> >
>> > These seem like fairly unfortunate names to expose to the global
>> > namespace, why not expose update_rq_runnable_avg() instead?
>>
>> Just to gather in one place all cfs actions that should be done when
>> exiting idle even if we only have update_rq_runnable_avg right now. I
>> have distinguished that from idle_balance because this sequence can't
>> generate extra context switch like idle_balance and they would finally
>> not be called in the same time
>
> OK, but could we then please give then more scheduler specific names?
> It just seems to me that idle_enter/idle_exit() are very obvious
> function names for unrelated things.
>
> How about calling it idle_{enter,exit}_fair; so that once other classes
> grow hooks we can do something like:

My primary goal was to align with idle_balance name but
idle_{enter,exit}_fair is better

In the same way, should we change idle_balance to idle_balance_fair ?

and since we don't have Steve's irq constraint anymore, we could move
idle_balance in the beginning of the function pick_next_task_fair ? We
will not have spurious switch context and we will remove fair function
from __schedule function

Vincent
>
> static void pre_schedule_idle(struct rq *rq, struct task_struct *p)
> {
>         struct sched_class *class;
>
>         for_each_class(class) {
>                 if (class->idle_enter)
>                         class->idle_enter(rq);
>         }
> }
>
> or whatnot..
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-04-10  7:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-04 14:15 [PATCH v4] sched: fix wrong rq's runnable_avg update with rt tasks Vincent Guittot
2013-04-09  8:50 ` Peter Zijlstra
2013-04-09  9:06   ` Vincent Guittot
2013-04-10  7:26     ` Peter Zijlstra
2013-04-10  7:59       ` Vincent Guittot
2013-04-09  8:55 ` Peter Zijlstra
2013-04-09 12:18   ` Vincent Guittot
2013-04-09 13:16     ` Steven Rostedt
2013-04-09 14:05       ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).