linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
@ 2013-04-18 16:34 Vincent Guittot
  2013-04-19  4:30 ` Mike Galbraith
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Vincent Guittot @ 2013-04-18 16:34 UTC (permalink / raw)
  To: linux-kernel, linaro-kernel, peterz, mingo, pjt, rostedt,
	fweisbec, efault
  Cc: Vincent Guittot

The current update of the rq's load can be erroneous when RT tasks are
involved

The update of the load of a rq that becomes idle, is done only if the avg_idle
is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
alternate, the runnable_avg will not be updated correctly and the time will be
accounted as idle time when a CFS task wakes up.

A new idle_enter function is called when the next task is the idle function
so the elapsed time will be accounted as run time in the load of the rq,
whatever the average idle time is. The function update_rq_runnable_avg is
removed from idle_balance.

When a RT task is scheduled on an idle CPU, the update of the rq's load is
not done when the rq exit idle state because CFS's functions are not
called. Then, the idle_balance, which is called just before entering the
idle function, updates the rq's load and makes the assumption that the
elapsed time since the last update, was only running time.

As a consequence, the rq's load of a CPU that only runs a periodic RT task,
is close to LOAD_AVG_MAX whatever the running duration of the RT task is.

A new idle_exit function is called when the prev task is the idle function
so the elapsed time will be accounted as idle time in the rq's load.

Changes since V5:
- Rename idle_enter/exit function to idle_enter/exit_fair

Changes since V4:
- Rebase on v3.9-rc6 instead of Steven Rostedt's patches
- Create the post_schedule_idle function that was previously created by Steven's patches

Changes since V3:
- Remove dependancy with CONFIG_FAIR_GROUP_SCHED
- Add a new idle_enter function and create a post_schedule callback for
 idle class
- Remove the update_runnable_avg from idle_balance

Changes since V2:
- remove useless definition for UP platform
- rebased on top of Steven Rostedt's patches :
https://lkml.org/lkml/2013/2/12/558

Changes since V1:
- move code out of schedule function and create a pre_schedule callback for
  idle class instead.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c      |   23 +++++++++++++++++++++--
 kernel/sched/idle_task.c |   16 ++++++++++++++++
 kernel/sched/sched.h     |   12 ++++++++++++
 3 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..1de3df0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1562,6 +1562,27 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
 	} /* migrations, e.g. sleep=0 leave decay_count == 0 */
 }
+
+/*
+ * Update the rq's load with the elapsed running time before entering
+ * idle. if the last scheduled task is not a CFS task, idle_enter will
+ * be the only way to update the runnable statistic.
+ */
+void idle_enter_fair(struct rq *this_rq)
+{
+	update_rq_runnable_avg(this_rq, 1);
+}
+
+/*
+ * Update the rq's load with the elapsed idle time before a task is
+ * scheduled. if the newly scheduled task is not a CFS task, idle_exit will
+ * be the only way to update the runnable statistic.
+ */
+void idle_exit_fair(struct rq *this_rq)
+{
+	update_rq_runnable_avg(this_rq, 0);
+}
+
 #else
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq) {}
@@ -5219,8 +5240,6 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
 		return;
 
-	update_rq_runnable_avg(this_rq, 1);
-
 	/*
 	 * Drop the rq->lock, but keep IRQ/preempt disabled.
 	 */
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index b6baf37..b8ce773 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -13,6 +13,16 @@ select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
+
+static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
+{
+	idle_exit_fair(rq);
+}
+
+static void post_schedule_idle(struct rq *rq)
+{
+	idle_enter_fair(rq);
+}
 #endif /* CONFIG_SMP */
 /*
  * Idle tasks are unconditionally rescheduled:
@@ -25,6 +35,10 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 static struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	schedstat_inc(rq, sched_goidle);
+#ifdef CONFIG_SMP
+	/* Trigger the post schedule to do an idle_enter for CFS */
+	rq->post_schedule = 1;
+#endif
 	return rq->idle;
 }
 
@@ -86,6 +100,8 @@ const struct sched_class idle_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
+	.pre_schedule		= pre_schedule_idle,
+	.post_schedule		= post_schedule_idle,
 #endif
 
 	.set_curr_task          = set_curr_task_idle,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..8f1d80e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -880,6 +880,18 @@ extern const struct sched_class idle_sched_class;
 extern void trigger_load_balance(struct rq *rq, int cpu);
 extern void idle_balance(int this_cpu, struct rq *this_rq);
 
+/*
+ * Only depends on SMP, FAIR_GROUP_SCHED may be removed when runnable_avg
+ * becomes useful in lb
+ */
+#if defined(CONFIG_FAIR_GROUP_SCHED)
+extern void idle_enter_fair(struct rq *this_rq);
+extern void idle_exit_fair(struct rq *this_rq);
+#else
+static inline void idle_enter_fair(struct rq *this_rq) {}
+static inline void idle_exit_fair(struct rq *this_rq) {}
+#endif
+
 #else	/* CONFIG_SMP */
 
 static inline void idle_balance(int cpu, struct rq *rq)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-18 16:34 [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks Vincent Guittot
@ 2013-04-19  4:30 ` Mike Galbraith
  2013-04-19  7:49   ` Vincent Guittot
  2013-04-19 11:47 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Mike Galbraith @ 2013-04-19  4:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, linaro-kernel, peterz, mingo, pjt, rostedt, fweisbec

On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: 
> The current update of the rq's load can be erroneous when RT tasks are
> involved
> 
> The update of the load of a rq that becomes idle, is done only if the avg_idle
> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
> alternate, the runnable_avg will not be updated correctly and the time will be
> accounted as idle time when a CFS task wakes up.
> 
> A new idle_enter function is called when the next task is the idle function
> so the elapsed time will be accounted as run time in the load of the rq,
> whatever the average idle time is. The function update_rq_runnable_avg is
> removed from idle_balance.
> 
> When a RT task is scheduled on an idle CPU, the update of the rq's load is
> not done when the rq exit idle state because CFS's functions are not
> called. Then, the idle_balance, which is called just before entering the
> idle function, updates the rq's load and makes the assumption that the
> elapsed time since the last update, was only running time.
> 
> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.

Why do we care what rq's load says, if the only thing running is a
periodic RT task?  I _think_ I recall that stuff being put under the
throttle specifically to not waste cycles doing that on every
microscopic idle.

Seems to me when scheduling an rt task, you want to do as little other
than switching to/from the rt task as possible.  I don't let rt tasks do
idle balancing either, their job isn't to balance fair class on the way
out the door, it's to get off/onto the cpu ASAP, and do rt work.

-Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-19  4:30 ` Mike Galbraith
@ 2013-04-19  7:49   ` Vincent Guittot
  2013-04-19  8:14     ` Mike Galbraith
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2013-04-19  7:49 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, linaro-kernel, Peter Zijlstra, Ingo Molnar,
	Paul Turner, Steven Rostedt, Frédéric Weisbecker

On 19 April 2013 06:30, Mike Galbraith <efault@gmx.de> wrote:
> On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
>> The current update of the rq's load can be erroneous when RT tasks are
>> involved
>>
>> The update of the load of a rq that becomes idle, is done only if the avg_idle
>> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
>> alternate, the runnable_avg will not be updated correctly and the time will be
>> accounted as idle time when a CFS task wakes up.
>>
>> A new idle_enter function is called when the next task is the idle function
>> so the elapsed time will be accounted as run time in the load of the rq,
>> whatever the average idle time is. The function update_rq_runnable_avg is
>> removed from idle_balance.
>>
>> When a RT task is scheduled on an idle CPU, the update of the rq's load is
>> not done when the rq exit idle state because CFS's functions are not
>> called. Then, the idle_balance, which is called just before entering the
>> idle function, updates the rq's load and makes the assumption that the
>> elapsed time since the last update, was only running time.
>>
>> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
>> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
>
> Why do we care what rq's load says, if the only thing running is a
> periodic RT task?  I _think_ I recall that stuff being put under the

cfs scheduler will use a wrong rq load the next time it wants to schedule a task

> throttle specifically to not waste cycles doing that on every
> microscopic idle.

yes but this lead to the wrong computation of runnable_avg_sum. To be
more precise, we only need to call __update_entity_runnable_avg,
__update_tg_runnable_avg is not mandatory in this step.

>
> Seems to me when scheduling an rt task, you want to do as little other
> than switching to/from the rt task as possible.  I don't let rt tasks do
> idle balancing either, their job isn't to balance fair class on the way
> out the door, it's to get off/onto the cpu ASAP, and do rt work.

I agree but the patch is not about balancing fair task but keep
coherent runnable value

Vincent
>
> -Mike
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-19  7:49   ` Vincent Guittot
@ 2013-04-19  8:14     ` Mike Galbraith
  2013-04-19  8:50       ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Galbraith @ 2013-04-19  8:14 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, linaro-kernel, Peter Zijlstra, Ingo Molnar,
	Paul Turner, Steven Rostedt, Frédéric Weisbecker

On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: 
> On 19 April 2013 06:30, Mike Galbraith <efault@gmx.de> wrote:
> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
> >> The current update of the rq's load can be erroneous when RT tasks are
> >> involved
> >>
> >> The update of the load of a rq that becomes idle, is done only if the avg_idle
> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
> >> alternate, the runnable_avg will not be updated correctly and the time will be
> >> accounted as idle time when a CFS task wakes up.
> >>
> >> A new idle_enter function is called when the next task is the idle function
> >> so the elapsed time will be accounted as run time in the load of the rq,
> >> whatever the average idle time is. The function update_rq_runnable_avg is
> >> removed from idle_balance.
> >>
> >> When a RT task is scheduled on an idle CPU, the update of the rq's load is
> >> not done when the rq exit idle state because CFS's functions are not
> >> called. Then, the idle_balance, which is called just before entering the
> >> idle function, updates the rq's load and makes the assumption that the
> >> elapsed time since the last update, was only running time.
> >>
> >> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
> >
> > Why do we care what rq's load says, if the only thing running is a
> > periodic RT task?  I _think_ I recall that stuff being put under the
> 
> cfs scheduler will use a wrong rq load the next time it wants to schedule a task
> 
> > throttle specifically to not waste cycles doing that on every
> > microscopic idle.
> 
> yes but this lead to the wrong computation of runnable_avg_sum. To be
> more precise, we only need to call __update_entity_runnable_avg,
> __update_tg_runnable_avg is not mandatory in this step.

If it only scares fair class tasks away from the periodic rt load, that
seems like a benefit to me, not a liability.  If we really really need
perfect load numbers, fine, we have to eat some cycles, but when I look
at it, it looks like one of those "Perfect is the enemy of good" things.

-Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-19  8:14     ` Mike Galbraith
@ 2013-04-19  8:50       ` Vincent Guittot
  2013-04-19  9:21         ` Mike Galbraith
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2013-04-19  8:50 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, linaro-kernel, Peter Zijlstra, Ingo Molnar,
	Paul Turner, Steven Rostedt, Frédéric Weisbecker

On 19 April 2013 10:14, Mike Galbraith <efault@gmx.de> wrote:
> On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
>> On 19 April 2013 06:30, Mike Galbraith <efault@gmx.de> wrote:
>> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
>> >> The current update of the rq's load can be erroneous when RT tasks are
>> >> involved
>> >>
>> >> The update of the load of a rq that becomes idle, is done only if the avg_idle
>> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
>> >> alternate, the runnable_avg will not be updated correctly and the time will be
>> >> accounted as idle time when a CFS task wakes up.
>> >>
>> >> A new idle_enter function is called when the next task is the idle function
>> >> so the elapsed time will be accounted as run time in the load of the rq,
>> >> whatever the average idle time is. The function update_rq_runnable_avg is
>> >> removed from idle_balance.
>> >>
>> >> When a RT task is scheduled on an idle CPU, the update of the rq's load is
>> >> not done when the rq exit idle state because CFS's functions are not
>> >> called. Then, the idle_balance, which is called just before entering the
>> >> idle function, updates the rq's load and makes the assumption that the
>> >> elapsed time since the last update, was only running time.
>> >>
>> >> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
>> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
>> >
>> > Why do we care what rq's load says, if the only thing running is a
>> > periodic RT task?  I _think_ I recall that stuff being put under the
>>
>> cfs scheduler will use a wrong rq load the next time it wants to schedule a task
>>
>> > throttle specifically to not waste cycles doing that on every
>> > microscopic idle.
>>
>> yes but this lead to the wrong computation of runnable_avg_sum. To be
>> more precise, we only need to call __update_entity_runnable_avg,
>> __update_tg_runnable_avg is not mandatory in this step.
>
> If it only scares fair class tasks away from the periodic rt load, that
> seems like a benefit to me, not a liability.  If we really really need

I'm not sure that such behavior that is only based on erroneous value,
is good one.

> perfect load numbers, fine, we have to eat some cycles, but when I look
> at it, it looks like one of those "Perfect is the enemy of good" things.

The target is not perfect number but good enough to be usable. The
systctl_migration_cost threshold is good for idle balancing but can
generates wrong load value

Vincent
>
> -Mike
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-19  8:50       ` Vincent Guittot
@ 2013-04-19  9:21         ` Mike Galbraith
  2013-04-19  9:37           ` Mike Galbraith
  2013-04-19 11:11           ` Vincent Guittot
  0 siblings, 2 replies; 11+ messages in thread
From: Mike Galbraith @ 2013-04-19  9:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, linaro-kernel, Peter Zijlstra, Ingo Molnar,
	Paul Turner, Steven Rostedt, Frédéric Weisbecker

On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: 
> On 19 April 2013 10:14, Mike Galbraith <efault@gmx.de> wrote:
> > On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
> >> On 19 April 2013 06:30, Mike Galbraith <efault@gmx.de> wrote:
> >> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
> >> >> The current update of the rq's load can be erroneous when RT tasks are
> >> >> involved
> >> >>
> >> >> The update of the load of a rq that becomes idle, is done only if the avg_idle
> >> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
> >> >> alternate, the runnable_avg will not be updated correctly and the time will be
> >> >> accounted as idle time when a CFS task wakes up.
> >> >>
> >> >> A new idle_enter function is called when the next task is the idle function
> >> >> so the elapsed time will be accounted as run time in the load of the rq,
> >> >> whatever the average idle time is. The function update_rq_runnable_avg is
> >> >> removed from idle_balance.
> >> >>
> >> >> When a RT task is scheduled on an idle CPU, the update of the rq's load is
> >> >> not done when the rq exit idle state because CFS's functions are not
> >> >> called. Then, the idle_balance, which is called just before entering the
> >> >> idle function, updates the rq's load and makes the assumption that the
> >> >> elapsed time since the last update, was only running time.
> >> >>
> >> >> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
> >> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
> >> >
> >> > Why do we care what rq's load says, if the only thing running is a
> >> > periodic RT task?  I _think_ I recall that stuff being put under the
> >>
> >> cfs scheduler will use a wrong rq load the next time it wants to schedule a task
> >>
> >> > throttle specifically to not waste cycles doing that on every
> >> > microscopic idle.
> >>
> >> yes but this lead to the wrong computation of runnable_avg_sum. To be
> >> more precise, we only need to call __update_entity_runnable_avg,
> >> __update_tg_runnable_avg is not mandatory in this step.
> >
> > If it only scares fair class tasks away from the periodic rt load, that
> > seems like a benefit to me, not a liability.  If we really really need
> 
> I'm not sure that such behavior that is only based on erroneous value,
> is good one.
> 
> > perfect load numbers, fine, we have to eat some cycles, but when I look
> > at it, it looks like one of those "Perfect is the enemy of good" things.
> 
> The target is not perfect number but good enough to be usable. The
> systctl_migration_cost threshold is good for idle balancing but can
> generates wrong load value

But again, why do we care?  To be able to mix rt and fair loads and
still make pretty mixed load utilization numbers?  Paying a general case
fast path price to make strange (to me) load utilization numbers pretty
is not very attractive.  If you muck about with rt classes, you need to
have a good reason for doing that.  If you do have a good reason, you
also allocated all resources, including CPU, so don't need the kernel to
balance the load for you.  Paying any fast path price to make the kernel
balance a mixed rt/fair load just seems fundamentally wrong to me.

-Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-19  9:21         ` Mike Galbraith
@ 2013-04-19  9:37           ` Mike Galbraith
  2013-04-19 11:11           ` Vincent Guittot
  1 sibling, 0 replies; 11+ messages in thread
From: Mike Galbraith @ 2013-04-19  9:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, linaro-kernel, Peter Zijlstra, Ingo Molnar,
	Paul Turner, Steven Rostedt, Frédéric Weisbecker

On Fri, 2013-04-19 at 11:21 +0200, Mike Galbraith wrote: 
> On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: 
> > On 19 April 2013 10:14, Mike Galbraith <efault@gmx.de> wrote:
> > > On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
> > >> On 19 April 2013 06:30, Mike Galbraith <efault@gmx.de> wrote:
> > >> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
> > >> >> The current update of the rq's load can be erroneous when RT tasks are
> > >> >> involved
> > >> >>
> > >> >> The update of the load of a rq that becomes idle, is done only if the avg_idle
> > >> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
> > >> >> alternate, the runnable_avg will not be updated correctly and the time will be
> > >> >> accounted as idle time when a CFS task wakes up.
> > >> >>
> > >> >> A new idle_enter function is called when the next task is the idle function
> > >> >> so the elapsed time will be accounted as run time in the load of the rq,
> > >> >> whatever the average idle time is. The function update_rq_runnable_avg is
> > >> >> removed from idle_balance.
> > >> >>
> > >> >> When a RT task is scheduled on an idle CPU, the update of the rq's load is
> > >> >> not done when the rq exit idle state because CFS's functions are not
> > >> >> called. Then, the idle_balance, which is called just before entering the
> > >> >> idle function, updates the rq's load and makes the assumption that the
> > >> >> elapsed time since the last update, was only running time.
> > >> >>
> > >> >> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
> > >> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
> > >> >
> > >> > Why do we care what rq's load says, if the only thing running is a
> > >> > periodic RT task?  I _think_ I recall that stuff being put under the
> > >>
> > >> cfs scheduler will use a wrong rq load the next time it wants to schedule a task
> > >>
> > >> > throttle specifically to not waste cycles doing that on every
> > >> > microscopic idle.
> > >>
> > >> yes but this lead to the wrong computation of runnable_avg_sum. To be
> > >> more precise, we only need to call __update_entity_runnable_avg,
> > >> __update_tg_runnable_avg is not mandatory in this step.
> > >
> > > If it only scares fair class tasks away from the periodic rt load, that
> > > seems like a benefit to me, not a liability.  If we really really need
> > 
> > I'm not sure that such behavior that is only based on erroneous value,
> > is good one.
> > 
> > > perfect load numbers, fine, we have to eat some cycles, but when I look
> > > at it, it looks like one of those "Perfect is the enemy of good" things.
> > 
> > The target is not perfect number but good enough to be usable. The
> > systctl_migration_cost threshold is good for idle balancing but can
> > generates wrong load value
> 
> But again, why do we care?  To be able to mix rt and fair loads and
> still make pretty mixed load utilization numbers?  Paying a general case
> fast path price to make strange (to me) load utilization numbers pretty
> is not very attractive.

So I'm not convinced this is a good thing to do, but it's not my call,
that's Peter and Ingos job, so having expressed my opinion, I'll shut up
and let them do their thing ;-)

-Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-19  9:21         ` Mike Galbraith
  2013-04-19  9:37           ` Mike Galbraith
@ 2013-04-19 11:11           ` Vincent Guittot
  1 sibling, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2013-04-19 11:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, linaro-kernel, Peter Zijlstra, Ingo Molnar,
	Paul Turner, Steven Rostedt, Frédéric Weisbecker

On 19 April 2013 11:21, Mike Galbraith <efault@gmx.de> wrote:
> On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote:
>> On 19 April 2013 10:14, Mike Galbraith <efault@gmx.de> wrote:
>> > On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
>> >> On 19 April 2013 06:30, Mike Galbraith <efault@gmx.de> wrote:
>> >> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
>> >> >> The current update of the rq's load can be erroneous when RT tasks are
>> >> >> involved
>> >> >>
>> >> >> The update of the load of a rq that becomes idle, is done only if the avg_idle
>> >> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
>> >> >> alternate, the runnable_avg will not be updated correctly and the time will be
>> >> >> accounted as idle time when a CFS task wakes up.
>> >> >>
>> >> >> A new idle_enter function is called when the next task is the idle function
>> >> >> so the elapsed time will be accounted as run time in the load of the rq,
>> >> >> whatever the average idle time is. The function update_rq_runnable_avg is
>> >> >> removed from idle_balance.
>> >> >>
>> >> >> When a RT task is scheduled on an idle CPU, the update of the rq's load is
>> >> >> not done when the rq exit idle state because CFS's functions are not
>> >> >> called. Then, the idle_balance, which is called just before entering the
>> >> >> idle function, updates the rq's load and makes the assumption that the
>> >> >> elapsed time since the last update, was only running time.
>> >> >>
>> >> >> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
>> >> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
>> >> >
>> >> > Why do we care what rq's load says, if the only thing running is a
>> >> > periodic RT task?  I _think_ I recall that stuff being put under the
>> >>
>> >> cfs scheduler will use a wrong rq load the next time it wants to schedule a task
>> >>
>> >> > throttle specifically to not waste cycles doing that on every
>> >> > microscopic idle.
>> >>
>> >> yes but this lead to the wrong computation of runnable_avg_sum. To be
>> >> more precise, we only need to call __update_entity_runnable_avg,
>> >> __update_tg_runnable_avg is not mandatory in this step.
>> >
>> > If it only scares fair class tasks away from the periodic rt load, that
>> > seems like a benefit to me, not a liability.  If we really really need
>>
>> I'm not sure that such behavior that is only based on erroneous value,
>> is good one.
>>
>> > perfect load numbers, fine, we have to eat some cycles, but when I look
>> > at it, it looks like one of those "Perfect is the enemy of good" things.
>>
>> The target is not perfect number but good enough to be usable. The
>> systctl_migration_cost threshold is good for idle balancing but can
>> generates wrong load value
>
> But again, why do we care?  To be able to mix rt and fair loads and
> still make pretty mixed load utilization numbers?  Paying a general case

If runnable_avg_sum can be wrong, it becomes unusable and all the
stuff around becomes useless.

> fast path price to make strange (to me) load utilization numbers pretty
> is not very attractive.  If you muck about with rt classes, you need to
> have a good reason for doing that.  If you do have a good reason, you
> also allocated all resources, including CPU, so don't need the kernel to

Some tasks have responsiveness constraints so they use rt class but
they also live with cfs tasks.

Vincent

> balance the load for you.  Paying any fast path price to make the kernel
> balance a mixed rt/fair load just seems fundamentally wrong to me.
>
> -Mike
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-18 16:34 [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks Vincent Guittot
  2013-04-19  4:30 ` Mike Galbraith
@ 2013-04-19 11:47 ` Peter Zijlstra
  2013-04-19 21:28 ` Steven Rostedt
  2013-04-21 12:52 ` [tip:sched/core] sched: Fix wrong rq' s " tip-bot for Vincent Guittot
  3 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-04-19 11:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, linaro-kernel, mingo, pjt, rostedt, fweisbec, efault

On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
> The current update of the rq's load can be erroneous when RT tasks are
> involved
> 
> The update of the load of a rq that becomes idle, is done only if the avg_idle
> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
> alternate, the runnable_avg will not be updated correctly and the time will be
> accounted as idle time when a CFS task wakes up.
> 
> A new idle_enter function is called when the next task is the idle function
> so the elapsed time will be accounted as run time in the load of the rq,
> whatever the average idle time is. The function update_rq_runnable_avg is
> removed from idle_balance.
> 
> When a RT task is scheduled on an idle CPU, the update of the rq's load is
> not done when the rq exit idle state because CFS's functions are not
> called. Then, the idle_balance, which is called just before entering the
> idle function, updates the rq's load and makes the assumption that the
> elapsed time since the last update, was only running time.
> 
> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
> 
> A new idle_exit function is called when the prev task is the idle function
> so the elapsed time will be accounted as idle time in the rq's load.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Thanks Vince!


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
  2013-04-18 16:34 [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks Vincent Guittot
  2013-04-19  4:30 ` Mike Galbraith
  2013-04-19 11:47 ` Peter Zijlstra
@ 2013-04-19 21:28 ` Steven Rostedt
  2013-04-21 12:52 ` [tip:sched/core] sched: Fix wrong rq' s " tip-bot for Vincent Guittot
  3 siblings, 0 replies; 11+ messages in thread
From: Steven Rostedt @ 2013-04-19 21:28 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, linaro-kernel, peterz, mingo, pjt, fweisbec, efault

On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
> The current update of the rq's load can be erroneous when RT tasks are
> involved
> 
> The update of the load of a rq that becomes idle, is done only if the avg_idle
> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
> alternate, the runnable_avg will not be updated correctly and the time will be
> accounted as idle time when a CFS task wakes up.
> 
> A new idle_enter function is called when the next task is the idle function
> so the elapsed time will be accounted as run time in the load of the rq,
> whatever the average idle time is. The function update_rq_runnable_avg is
> removed from idle_balance.
> 
> When a RT task is scheduled on an idle CPU, the update of the rq's load is
> not done when the rq exit idle state because CFS's functions are not
> called. Then, the idle_balance, which is called just before entering the
> idle function, updates the rq's load and makes the assumption that the
> elapsed time since the last update, was only running time.
> 
> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
> 
> A new idle_exit function is called when the prev task is the idle function
> so the elapsed time will be accounted as idle time in the rq's load.
> 
> Changes since V5:
> - Rename idle_enter/exit function to idle_enter/exit_fair
> 
> Changes since V4:
> - Rebase on v3.9-rc6 instead of Steven Rostedt's patches

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

> - Create the post_schedule_idle function that was previously created by Steven's patches
> 
> Changes since V3:
> - Remove dependancy with CONFIG_FAIR_GROUP_SCHED
> - Add a new idle_enter function and create a post_schedule callback for
>  idle class
> - Remove the update_runnable_avg from idle_balance
> 
> Changes since V2:
> - remove useless definition for UP platform
> - rebased on top of Steven Rostedt's patches :
> https://lkml.org/lkml/2013/2/12/558
> 
> Changes since V1:
> - move code out of schedule function and create a pre_schedule callback for
>   idle class instead.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [tip:sched/core] sched: Fix wrong rq' s runnable_avg update with rt tasks
  2013-04-18 16:34 [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks Vincent Guittot
                   ` (2 preceding siblings ...)
  2013-04-19 21:28 ` Steven Rostedt
@ 2013-04-21 12:52 ` tip-bot for Vincent Guittot
  3 siblings, 0 replies; 11+ messages in thread
From: tip-bot for Vincent Guittot @ 2013-04-21 12:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, rostedt, a.p.zijlstra, vincent.guittot, tglx

Commit-ID:  642dbc39ab1ea00f47e0fee1b8e8a27da036d940
Gitweb:     http://git.kernel.org/tip/642dbc39ab1ea00f47e0fee1b8e8a27da036d940
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 18 Apr 2013 18:34:26 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 21 Apr 2013 11:22:52 +0200

sched: Fix wrong rq's runnable_avg update with rt tasks

The current update of the rq's load can be erroneous when RT
tasks are involved.

The update of the load of a rq that becomes idle, is done only
if the avg_idle is less than sysctl_sched_migration_cost. If RT
tasks and short idle duration alternate, the runnable_avg will
not be updated correctly and the time will be accounted as idle
time when a CFS task wakes up.

A new idle_enter function is called when the next task is the
idle function so the elapsed time will be accounted as run time
in the load of the rq, whatever the average idle time is. The
function update_rq_runnable_avg is removed from idle_balance.

When a RT task is scheduled on an idle CPU, the update of the
rq's load is not done when the rq exit idle state because CFS's
functions are not called. Then, the idle_balance, which is
called just before entering the idle function, updates the rq's
load and makes the assumption that the elapsed time since the
last update, was only running time.

As a consequence, the rq's load of a CPU that only runs a
periodic RT task, is close to LOAD_AVG_MAX whatever the running
duration of the RT task is.

A new idle_exit function is called when the prev task is the
idle function so the elapsed time will be accounted as idle time
in the rq's load.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: linaro-kernel@lists.linaro.org
Cc: peterz@infradead.org
Cc: pjt@google.com
Cc: fweisbec@gmail.com
Cc: efault@gmx.de
Link: http://lkml.kernel.org/r/1366302867-5055-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c      | 23 +++++++++++++++++++++--
 kernel/sched/idle_task.c | 16 ++++++++++++++++
 kernel/sched/sched.h     | 12 ++++++++++++
 3 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 155783b..1c97735 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1563,6 +1563,27 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
 	} /* migrations, e.g. sleep=0 leave decay_count == 0 */
 }
+
+/*
+ * Update the rq's load with the elapsed running time before entering
+ * idle. if the last scheduled task is not a CFS task, idle_enter will
+ * be the only way to update the runnable statistic.
+ */
+void idle_enter_fair(struct rq *this_rq)
+{
+	update_rq_runnable_avg(this_rq, 1);
+}
+
+/*
+ * Update the rq's load with the elapsed idle time before a task is
+ * scheduled. if the newly scheduled task is not a CFS task, idle_exit will
+ * be the only way to update the runnable statistic.
+ */
+void idle_exit_fair(struct rq *this_rq)
+{
+	update_rq_runnable_avg(this_rq, 0);
+}
+
 #else
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq) {}
@@ -5217,8 +5238,6 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
 		return;
 
-	update_rq_runnable_avg(this_rq, 1);
-
 	/*
 	 * Drop the rq->lock, but keep IRQ/preempt disabled.
 	 */
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index b6baf37..b8ce773 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -13,6 +13,16 @@ select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
+
+static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
+{
+	idle_exit_fair(rq);
+}
+
+static void post_schedule_idle(struct rq *rq)
+{
+	idle_enter_fair(rq);
+}
 #endif /* CONFIG_SMP */
 /*
  * Idle tasks are unconditionally rescheduled:
@@ -25,6 +35,10 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 static struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	schedstat_inc(rq, sched_goidle);
+#ifdef CONFIG_SMP
+	/* Trigger the post schedule to do an idle_enter for CFS */
+	rq->post_schedule = 1;
+#endif
 	return rq->idle;
 }
 
@@ -86,6 +100,8 @@ const struct sched_class idle_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
+	.pre_schedule		= pre_schedule_idle,
+	.post_schedule		= post_schedule_idle,
 #endif
 
 	.set_curr_task          = set_curr_task_idle,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8116cf8..605426a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1024,6 +1024,18 @@ extern void update_group_power(struct sched_domain *sd, int cpu);
 extern void trigger_load_balance(struct rq *rq, int cpu);
 extern void idle_balance(int this_cpu, struct rq *this_rq);
 
+/*
+ * Only depends on SMP, FAIR_GROUP_SCHED may be removed when runnable_avg
+ * becomes useful in lb
+ */
+#if defined(CONFIG_FAIR_GROUP_SCHED)
+extern void idle_enter_fair(struct rq *this_rq);
+extern void idle_exit_fair(struct rq *this_rq);
+#else
+static inline void idle_enter_fair(struct rq *this_rq) {}
+static inline void idle_exit_fair(struct rq *this_rq) {}
+#endif
+
 #else	/* CONFIG_SMP */
 
 static inline void idle_balance(int cpu, struct rq *rq)

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-04-21 12:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-18 16:34 [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks Vincent Guittot
2013-04-19  4:30 ` Mike Galbraith
2013-04-19  7:49   ` Vincent Guittot
2013-04-19  8:14     ` Mike Galbraith
2013-04-19  8:50       ` Vincent Guittot
2013-04-19  9:21         ` Mike Galbraith
2013-04-19  9:37           ` Mike Galbraith
2013-04-19 11:11           ` Vincent Guittot
2013-04-19 11:47 ` Peter Zijlstra
2013-04-19 21:28 ` Steven Rostedt
2013-04-21 12:52 ` [tip:sched/core] sched: Fix wrong rq' s " tip-bot for Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).