All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/3] sched: fixes related to scaling down cpu power with RT tasks
@ 2010-08-13 19:45 Suresh Siddha
  2010-08-13 19:45 ` [patch 1/3] sched: init rt_avg stat whenever rq comes online Suresh Siddha
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Suresh Siddha @ 2010-08-13 19:45 UTC (permalink / raw)
  To: a.p.zijlstra, mingo, linux-kernel
  Cc: chris, debian00, hpa, jonathan.protzenko, mans, psastudio, rjw,
	stephan.eicher, sxxe, thomas, venki, wonghow

First patch in the series fixes the main problem reported in the
https://bugzilla.kernel.org/show_bug.cgi?id=15559
(Bad performance after suspend on Intel i7 and i5)

And the next two patches in the series are for issues identified during the
above root cause.

Some reporters of the Bug 15559 still report some unresolved performance
issues after the resume (All the reporters are happy however with the
first patch that fixes the load balancing issue after resume).
Further root cause might result in more patches (scheduler or something else).

For now, I am sending these patches as the first patch in the series is
critical.

Thanks.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 1/3] sched: init rt_avg stat whenever rq comes online
  2010-08-13 19:45 [patch 0/3] sched: fixes related to scaling down cpu power with RT tasks Suresh Siddha
@ 2010-08-13 19:45 ` Suresh Siddha
  2010-08-16  7:47   ` Peter Zijlstra
  2010-08-13 19:45 ` [patch 2/3] sched: fix minimum power returned by update_cpu_power() Suresh Siddha
  2010-08-13 19:45 ` [patch 3/3] sched: move sched_avg_update() to update_cpu_load() Suresh Siddha
  2 siblings, 1 reply; 23+ messages in thread
From: Suresh Siddha @ 2010-08-13 19:45 UTC (permalink / raw)
  To: a.p.zijlstra, mingo, linux-kernel
  Cc: chris, debian00, hpa, jonathan.protzenko, mans, psastudio, rjw,
	stephan.eicher, sxxe, thomas, venki, wonghow, Suresh Siddha,
	stable

[-- Attachment #1: sched_reset_rt_avg_stat_online.patch --]
[-- Type: text/plain, Size: 1246 bytes --]

TSC's get reset after suspend/resume and this leads to a scenario of
rq->clock (sched_clock_cpu()) less than rq->age_stamp. This leads
to a big value returned by scale_rt_power() and the resulting big group
power set by the update_group_power() is causing improper load balancing
between busy and idle cpu's after suspend/resume.

This resulted in multi-threaded workloads (like kernel-compilation) go slower
after suspend/resume cycle on core i5 laptops.

Realign rq->age_stamp to rq->clock and reset rq->rt_avg in rq_online_rt().

Addresses the primary issue reported in the
https://bugzilla.kernel.org/show_bug.cgi?id=15559
(Bad performance after suspend on Intel i7 and i5)

Reported-by: Florian Pritz <flo@xssn.at>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: stable@kernel.org	[2.6.32+]
---

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..a00af73 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1550,6 +1550,12 @@ static void rq_online_rt(struct rq *rq)
 	__enable_runtime(rq);
 
 	cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio.curr);
+
+	/*
+ 	 * Reset RT average stat and its time stamp.
+ 	 */
+	rq->age_stamp = rq->clock;
+	rq->rt_avg = 0;
 }
 
 /* Assumes rq->lock is held */



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [patch 2/3] sched: fix minimum power returned by update_cpu_power()
  2010-08-13 19:45 [patch 0/3] sched: fixes related to scaling down cpu power with RT tasks Suresh Siddha
  2010-08-13 19:45 ` [patch 1/3] sched: init rt_avg stat whenever rq comes online Suresh Siddha
@ 2010-08-13 19:45 ` Suresh Siddha
  2010-08-16  7:50   ` Peter Zijlstra
  2010-08-13 19:45 ` [patch 3/3] sched: move sched_avg_update() to update_cpu_load() Suresh Siddha
  2 siblings, 1 reply; 23+ messages in thread
From: Suresh Siddha @ 2010-08-13 19:45 UTC (permalink / raw)
  To: a.p.zijlstra, mingo, linux-kernel
  Cc: chris, debian00, hpa, jonathan.protzenko, mans, psastudio, rjw,
	stephan.eicher, sxxe, thomas, venki, wonghow, Suresh Siddha

[-- Attachment #1: fix_update_cpu_power.patch --]
[-- Type: text/plain, Size: 601 bytes --]

Default cpu_power needs to be multiples of SCHED_LOAD_SCALE and not '1'.
Fix it.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched_fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: tree/kernel/sched_fair.c
===================================================================
--- tree.orig/kernel/sched_fair.c
+++ tree/kernel/sched_fair.c
@@ -2309,7 +2309,7 @@ static void update_cpu_power(struct sche
 	power >>= SCHED_LOAD_SHIFT;
 
 	if (!power)
-		power = 1;
+		power = SCHED_LOAD_SCALE;
 
 	cpu_rq(cpu)->cpu_power = power;
 	sdg->cpu_power = power;



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 3/3] sched: move sched_avg_update() to update_cpu_load()
  2010-08-13 19:45 [patch 0/3] sched: fixes related to scaling down cpu power with RT tasks Suresh Siddha
  2010-08-13 19:45 ` [patch 1/3] sched: init rt_avg stat whenever rq comes online Suresh Siddha
  2010-08-13 19:45 ` [patch 2/3] sched: fix minimum power returned by update_cpu_power() Suresh Siddha
@ 2010-08-13 19:45 ` Suresh Siddha
  2010-08-16  8:00   ` Peter Zijlstra
  2 siblings, 1 reply; 23+ messages in thread
From: Suresh Siddha @ 2010-08-13 19:45 UTC (permalink / raw)
  To: a.p.zijlstra, mingo, linux-kernel
  Cc: chris, debian00, hpa, jonathan.protzenko, mans, psastudio, rjw,
	stephan.eicher, sxxe, thomas, venki, wonghow, Suresh Siddha

[-- Attachment #1: move_sched_avg_update.patch --]
[-- Type: text/plain, Size: 1451 bytes --]

Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting
called from scale_rt_power() (in the load balance context) which doesn't take
rq->lock.

Fix it by moving the sched_avg_update() to more appropriate update_cpu_load()
where the CFS load gets updated aswell.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c      |    3 ++-
 kernel/sched_fair.c |    2 --
 2 files changed, 2 insertions(+), 3 deletions(-)

Index: tree/kernel/sched_fair.c
===================================================================
--- tree.orig/kernel/sched_fair.c
+++ tree/kernel/sched_fair.c
@@ -2268,8 +2268,6 @@ unsigned long scale_rt_power(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	u64 total, available;
 
-	sched_avg_update(rq);
-
 	total = sched_avg_period() + (rq->clock - rq->age_stamp);
 	available = total - rq->rt_avg;
 
Index: tree/kernel/sched.c
===================================================================
--- tree.orig/kernel/sched.c
+++ tree/kernel/sched.c
@@ -1281,7 +1281,6 @@ static void sched_avg_update(struct rq *
 static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 	rq->rt_avg += rt_delta;
-	sched_avg_update(rq);
 }
 
 #else /* !CONFIG_SMP */
@@ -3182,6 +3181,8 @@ static void update_cpu_load(struct rq *t
 
 		this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
 	}
+
+	sched_avg_update(this_rq);
 }
 
 static void update_cpu_load_active(struct rq *this_rq)



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/3] sched: init rt_avg stat whenever rq comes online
  2010-08-13 19:45 ` [patch 1/3] sched: init rt_avg stat whenever rq comes online Suresh Siddha
@ 2010-08-16  7:47   ` Peter Zijlstra
  2010-08-16 17:36     ` Suresh Siddha
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2010-08-16  7:47 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow, stable

On Fri, 2010-08-13 at 12:45 -0700, Suresh Siddha wrote:
> plain text document attachment (sched_reset_rt_avg_stat_online.patch)
> TSC's get reset after suspend/resume and this leads to a scenario of
> rq->clock (sched_clock_cpu()) less than rq->age_stamp. This leads
> to a big value returned by scale_rt_power() and the resulting big group
> power set by the update_group_power() is causing improper load balancing
> between busy and idle cpu's after suspend/resume.

ARGH, so i[357] westmere mobile stops TSC on some power state?

Why don't we sync it back to the other CPUs instead?

Or does it simply mark TSCs unstable and leaves it at that?

In any case, this needs to be fixed at the clock level, not like this.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 2/3] sched: fix minimum power returned by update_cpu_power()
  2010-08-13 19:45 ` [patch 2/3] sched: fix minimum power returned by update_cpu_power() Suresh Siddha
@ 2010-08-16  7:50   ` Peter Zijlstra
  2010-08-16 17:37     ` Suresh Siddha
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2010-08-16  7:50 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow

On Fri, 2010-08-13 at 12:45 -0700, Suresh Siddha wrote:
> plain text document attachment (fix_update_cpu_power.patch)
> Default cpu_power needs to be multiples of SCHED_LOAD_SCALE and not '1'.
> Fix it.
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---
>  kernel/sched_fair.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: tree/kernel/sched_fair.c
> ===================================================================
> --- tree.orig/kernel/sched_fair.c
> +++ tree/kernel/sched_fair.c
> @@ -2309,7 +2309,7 @@ static void update_cpu_power(struct sche
>  	power >>= SCHED_LOAD_SHIFT;
>  
>  	if (!power)
> -		power = 1;
> +		power = SCHED_LOAD_SCALE;
>  

                            smt_power   freq_power    rt_power
power = SCHED_LOAD_SCALE * ---------- * ---------- * ----------
                           LOAD_SCALE   LOAD_SCALE   LOAD_SCALE

Which, in the above case ends up being 0, so how does resetting it back
to LOAD_SCALE make sense?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] sched: move sched_avg_update() to update_cpu_load()
  2010-08-13 19:45 ` [patch 3/3] sched: move sched_avg_update() to update_cpu_load() Suresh Siddha
@ 2010-08-16  8:00   ` Peter Zijlstra
  2010-08-16 17:46     ` Suresh Siddha
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2010-08-16  8:00 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow

On Fri, 2010-08-13 at 12:45 -0700, Suresh Siddha wrote:
> plain text document attachment (move_sched_avg_update.patch)
> Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting
> called from scale_rt_power() (in the load balance context) which doesn't take
> rq->lock.

Right, but I think it doesn't need to, as its never accessed cross CPU
(except maybe for /proc/sched_debug, and who cares about that).

The (other) odd case is NO_HZ idle balancing, but there the actual CPU
won't be updating the fields, so the remote CPU doing NO_HZ idle
balancing should be good to touch it.

> Fix it by moving the sched_avg_update() to more appropriate update_cpu_load()
> where the CFS load gets updated aswell.

Right, except it breaks things a bit, at the very least you really need
that update right before reading it, otherwise you can end up with >100%
fractions, which are odd indeed ;-)



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/3] sched: init rt_avg stat whenever rq comes online
  2010-08-16  7:47   ` Peter Zijlstra
@ 2010-08-16 17:36     ` Suresh Siddha
  2010-08-16 19:25       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Suresh Siddha @ 2010-08-16 17:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow, stable

On Mon, 2010-08-16 at 00:47 -0700, Peter Zijlstra wrote:
> On Fri, 2010-08-13 at 12:45 -0700, Suresh Siddha wrote:
> > plain text document attachment (sched_reset_rt_avg_stat_online.patch)
> > TSC's get reset after suspend/resume and this leads to a scenario of
> > rq->clock (sched_clock_cpu()) less than rq->age_stamp. This leads
> > to a big value returned by scale_rt_power() and the resulting big group
> > power set by the update_group_power() is causing improper load balancing
> > between busy and idle cpu's after suspend/resume.
> 
> ARGH, so i[357] westmere mobile stops TSC on some power state?

WSM has working TSC with constant rate across P/C/T-states. This issue
is about suspend/resume (S-states).

> Why don't we sync it back to the other CPUs instead?

All the cpu's entered suspend state and during resume it gets reset for
all the CPU's.

> 
> Or does it simply mark TSCs unstable and leaves it at that?

TSCs are stable and in sync after resume aswell. If we want to do SW
sync, we need to keep track of the time we spent in the suspend state
and do a SW sync (during resume) that can potentially disturb the HW
sync.

> 
> In any case, this needs to be fixed at the clock level, not like this.

If we have more such dependencies on TSC then we may need to address the
issue at clock level aswell. Nevertheless, across cpu online/offline,
current scheduler code is expecting TSC (sched_clock) to be going
forward and not sure why we need to carry the rt_avg history across
online/offline.

thanks,
suresh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 2/3] sched: fix minimum power returned by update_cpu_power()
  2010-08-16  7:50   ` Peter Zijlstra
@ 2010-08-16 17:37     ` Suresh Siddha
  0 siblings, 0 replies; 23+ messages in thread
From: Suresh Siddha @ 2010-08-16 17:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow

On Mon, 2010-08-16 at 00:50 -0700, Peter Zijlstra wrote:
> On Fri, 2010-08-13 at 12:45 -0700, Suresh Siddha wrote:
> > plain text document attachment (fix_update_cpu_power.patch)
> > Default cpu_power needs to be multiples of SCHED_LOAD_SCALE and not '1'.
> > Fix it.
> > 
> > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > ---
> >  kernel/sched_fair.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > Index: tree/kernel/sched_fair.c
> > ===================================================================
> > --- tree.orig/kernel/sched_fair.c
> > +++ tree/kernel/sched_fair.c
> > @@ -2309,7 +2309,7 @@ static void update_cpu_power(struct sche
> >  	power >>= SCHED_LOAD_SHIFT;
> >  
> >  	if (!power)
> > -		power = 1;
> > +		power = SCHED_LOAD_SCALE;
> >  
> 
>                             smt_power   freq_power    rt_power
> power = SCHED_LOAD_SCALE * ---------- * ---------- * ----------
>                            LOAD_SCALE   LOAD_SCALE   LOAD_SCALE
> 
> Which, in the above case ends up being 0, so how does resetting it back
> to LOAD_SCALE make sense?

hmm, true, but I thought I saw some load balancing code which was
depending on SCHED_LOAD_SCALE value. Ignore this for now. Will get back
to you if this indeed is a problem.

thanks,
suresh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] sched: move sched_avg_update() to update_cpu_load()
  2010-08-16  8:00   ` Peter Zijlstra
@ 2010-08-16 17:46     ` Suresh Siddha
  2010-08-16 19:31       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Suresh Siddha @ 2010-08-16 17:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow

On Mon, 2010-08-16 at 01:00 -0700, Peter Zijlstra wrote:
> On Fri, 2010-08-13 at 12:45 -0700, Suresh Siddha wrote:
> > plain text document attachment (move_sched_avg_update.patch)
> > Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting
> > called from scale_rt_power() (in the load balance context) which doesn't take
> > rq->lock.
> 
> Right, but I think it doesn't need to, as its never accessed cross CPU
> (except maybe for /proc/sched_debug, and who cares about that).
> 
> The (other) odd case is NO_HZ idle balancing, but there the actual CPU
> won't be updating the fields, so the remote CPU doing NO_HZ idle
> balancing should be good to touch it.

There is no guarantee that the original cpu won't be doing this in
parallel with nohz idle load balancing cpu.

> 
> > Fix it by moving the sched_avg_update() to more appropriate update_cpu_load()
> > where the CFS load gets updated aswell.
> 
> Right, except it breaks things a bit, at the very least you really need
> that update right before reading it, otherwise you can end up with >100%
> fractions, which are odd indeed ;-)

with the patch, the update always happens before reading it. isn't it?

update now happens during the scheduler tick (or during nohz load
balancing tick). And the load balancer gets triggered with the tick.
So the update (at the tick) should happen before reading it (used by
load balancing triggered by the tick). Am I missing something?

thanks,
suresh



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/3] sched: init rt_avg stat whenever rq comes online
  2010-08-16 17:36     ` Suresh Siddha
@ 2010-08-16 19:25       ` Peter Zijlstra
  2010-08-17  8:51         ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2010-08-16 19:25 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow, stable, tglx

On Mon, 2010-08-16 at 10:36 -0700, Suresh Siddha wrote:
> On Mon, 2010-08-16 at 00:47 -0700, Peter Zijlstra wrote:
> > On Fri, 2010-08-13 at 12:45 -0700, Suresh Siddha wrote:
> > > plain text document attachment (sched_reset_rt_avg_stat_online.patch)
> > > TSC's get reset after suspend/resume and this leads to a scenario of
> > > rq->clock (sched_clock_cpu()) less than rq->age_stamp. This leads
> > > to a big value returned by scale_rt_power() and the resulting big group
> > > power set by the update_group_power() is causing improper load balancing
> > > between busy and idle cpu's after suspend/resume.
> > 
> > ARGH, so i[357] westmere mobile stops TSC on some power state?
> 
> WSM has working TSC with constant rate across P/C/T-states. This issue
> is about suspend/resume (S-states).

Hurm..

> > Why don't we sync it back to the other CPUs instead?
> 
> All the cpu's entered suspend state and during resume it gets reset for
> all the CPU's.

Bloody lovely..

> > Or does it simply mark TSCs unstable and leaves it at that?
> 
> TSCs are stable and in sync after resume aswell. If we want to do SW
> sync, we need to keep track of the time we spent in the suspend state
> and do a SW sync (during resume) that can potentially disturb the HW
> sync.

Nah, no need to track the time spend in S-states, simply not going
backwards would be enough, save before entering S, restore after coming
out.

You can use something like:

suspend:
 __get_cpu_var(cyc2ns_suspend) = sched_clock();

resume:
 for_each_possible_cpu(i)
   per_cpu(cyc2ns_offset, i) += per_cpu(cyc2ns_suspend);

or something like that to keep sched_clock() stable, which is exactly
what most (all?) its users expect when we report the TSC is usable.

Not sure how to arrange the suspend bit to run on all cpus though, as I
think we offline them all first or something.

> > In any case, this needs to be fixed at the clock level, not like this.
> 
> If we have more such dependencies on TSC then we may need to address the
> issue at clock level aswell. Nevertheless, across cpu online/offline,
> current scheduler code is expecting TSC (sched_clock) to be going
> forward and not sure why we need to carry the rt_avg history across
> online/offline.

We assume sched_clock_cpu() _never_ goes backwards, when
sched_clock_stable, sched_clock_cpu() == sched_clock() (we could, and
probably should, do better on clock continuity when we flip
sched_clock_stable).

We carry rt_avg over suspend much like we carry pretty much all state
over suspend, including load_avg etc.. no reason to special case it at
all.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] sched: move sched_avg_update() to update_cpu_load()
  2010-08-16 17:46     ` Suresh Siddha
@ 2010-08-16 19:31       ` Peter Zijlstra
  2010-08-20  0:51         ` Suresh Siddha
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2010-08-16 19:31 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow

On Mon, 2010-08-16 at 10:46 -0700, Suresh Siddha wrote:

> There is no guarantee that the original cpu won't be doing this in
> parallel with nohz idle load balancing cpu.

Hmm, true.. bugger.

> > > Fix it by moving the sched_avg_update() to more appropriate update_cpu_load()
> > > where the CFS load gets updated aswell.
> > 
> > Right, except it breaks things a bit, at the very least you really need
> > that update right before reading it, otherwise you can end up with >100%
> > fractions, which are odd indeed ;-)
> 
> with the patch, the update always happens before reading it. isn't it?
> 
> update now happens during the scheduler tick (or during nohz load
> balancing tick). And the load balancer gets triggered with the tick.
> So the update (at the tick) should happen before reading it (used by
> load balancing triggered by the tick). Am I missing something?

We run the load-balancer in softirq context, on -rt that's a task, and
we could have ran other (more important) RT tasks between the hardirq
and the softirq running, which would increase the rt_avg and could thus
result in >100%.

But I think we can simply retain the sched_avg_update(rq) in
sched_rt_avg_update(), that is ran with rq->lock held and should be
enough to avoid that case.

We can retain the other bit of you patch, moving sched_avg_update() from
scale_rt_power() to update_cpu_load(), since that is only concerned with
lowering the average when there is no actual activity.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/3] sched: init rt_avg stat whenever rq comes online
  2010-08-16 19:25       ` Peter Zijlstra
@ 2010-08-17  8:51         ` Peter Zijlstra
  2010-08-19  0:20           ` Suresh Siddha
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2010-08-17  8:51 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow, stable, tglx

On Mon, 2010-08-16 at 21:25 +0200, Peter Zijlstra wrote:
> You can use something like:
> 
> suspend:
>  __get_cpu_var(cyc2ns_suspend) = sched_clock();
> 
> resume:
>  for_each_possible_cpu(i)
>    per_cpu(cyc2ns_offset, i) += per_cpu(cyc2ns_suspend);
> 
> or something like that to keep sched_clock() stable, which is exactly
> what most (all?) its users expect when we report the TSC is usable. 

That's actually broken, you only want a single offset, otherwise we
de-sync the TSC, which is bad.

So simply store the sched_clock() value at suspend time on the single
CPU that is still running, then on resume make sure sched_clock()
continues there by adding that stamp to all CPU offsets.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/3] sched: init rt_avg stat whenever rq comes online
  2010-08-17  8:51         ` Peter Zijlstra
@ 2010-08-19  0:20           ` Suresh Siddha
  2010-08-19  8:53             ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Suresh Siddha @ 2010-08-19  0:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow, stable, tglx

On Tue, 2010-08-17 at 01:51 -0700, Peter Zijlstra wrote:
> On Mon, 2010-08-16 at 21:25 +0200, Peter Zijlstra wrote:
> > You can use something like:
> > 
> > suspend:
> >  __get_cpu_var(cyc2ns_suspend) = sched_clock();
> > 
> > resume:
> >  for_each_possible_cpu(i)
> >    per_cpu(cyc2ns_offset, i) += per_cpu(cyc2ns_suspend);
> > 
> > or something like that to keep sched_clock() stable, which is exactly
> > what most (all?) its users expect when we report the TSC is usable. 
> 
> That's actually broken, you only want a single offset, otherwise we
> de-sync the TSC, which is bad.
> 
> So simply store the sched_clock() value at suspend time on the single
> CPU that is still running, then on resume make sure sched_clock()
> continues there by adding that stamp to all CPU offsets.


Peter, That might not be enough. I should add that in my Lenovo T410
(having 2 core wsm cpu), TSC's are somehow set to a strange big value
(for example 0xfffffffebc22f02e) after resume from S3. It looks like
bios might be writing TSC during resume. I am not sure if this is the
case for other OEM laptops aswell. I am checking.

So such large values of TSC (leading to a very big difference between
rq->clock and rq->age_stamp) wont be correctly handled by
scale_rt_power() either.

thanks,
suresh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/3] sched: init rt_avg stat whenever rq comes online
  2010-08-19  0:20           ` Suresh Siddha
@ 2010-08-19  8:53             ` Peter Zijlstra
  2010-08-20  0:03               ` Suresh Siddha
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2010-08-19  8:53 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow, stable, tglx

On Wed, 2010-08-18 at 17:20 -0700, Suresh Siddha wrote:
> On Tue, 2010-08-17 at 01:51 -0700, Peter Zijlstra wrote:
> > On Mon, 2010-08-16 at 21:25 +0200, Peter Zijlstra wrote:
> > > You can use something like:
> > > 
> > > suspend:
> > >  __get_cpu_var(cyc2ns_suspend) = sched_clock();
> > > 
> > > resume:
> > >  for_each_possible_cpu(i)
> > >    per_cpu(cyc2ns_offset, i) += per_cpu(cyc2ns_suspend);
> > > 
> > > or something like that to keep sched_clock() stable, which is exactly
> > > what most (all?) its users expect when we report the TSC is usable. 
> > 
> > That's actually broken, you only want a single offset, otherwise we
> > de-sync the TSC, which is bad.
> > 
> > So simply store the sched_clock() value at suspend time on the single
> > CPU that is still running, then on resume make sure sched_clock()
> > continues there by adding that stamp to all CPU offsets.
> 
> 
> Peter, That might not be enough. I should add that in my Lenovo T410
> (having 2 core wsm cpu), TSC's are somehow set to a strange big value
> (for example 0xfffffffebc22f02e) after resume from S3. It looks like
> bios might be writing TSC during resume. I am not sure if this is the
> case for other OEM laptops aswell. I am checking.

ARGH, please kill all SMM support for future CPUs ;-)

Are the TSCs still sync'ed though? If so, we can still compute a offset
and continue with things, albeit it requires something like:

  local_irq_save(flags);
  __get_cpu_var(cyc2ns_offset) = 0;
  offset = cyc2ns_suspend - sched_clock();
  local_irq_restore(flags);

  for_each_possible_cpu(i)
    per_cpu(cyc2ns_offset, i) = offset;

Which would take the funny offset into account and make it resume at
where we left off.

If they got out of sync, we need to flip sched_clock_stable and work on
getting the sched_clock.c code to be monotonic over such a flip.

> So such large values of TSC (leading to a very big difference between
> rq->clock and rq->age_stamp) wont be correctly handled by
> scale_rt_power() either.

Still, we need to fix the clock, not fudge the users.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/3] sched: init rt_avg stat whenever rq comes online
  2010-08-19  8:53             ` Peter Zijlstra
@ 2010-08-20  0:03               ` Suresh Siddha
  2010-08-20  8:54                 ` Peter Zijlstra
  2010-08-20 14:16                 ` [tip:sched/urgent] x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep states tip-bot for Suresh Siddha
  0 siblings, 2 replies; 23+ messages in thread
From: Suresh Siddha @ 2010-08-20  0:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow, stable, tglx

On Thu, 2010-08-19 at 01:53 -0700, Peter Zijlstra wrote:
> ARGH, please kill all SMM support for future CPUs ;-)
> 
> Are the TSCs still sync'ed though? 

Yes.

> If so, we can still compute a offset
> and continue with things, albeit it requires something like:
> 
>   local_irq_save(flags);
>   __get_cpu_var(cyc2ns_offset) = 0;
>   offset = cyc2ns_suspend - sched_clock();
>   local_irq_restore(flags);
> 
>   for_each_possible_cpu(i)
>     per_cpu(cyc2ns_offset, i) = offset;
> 
> Which would take the funny offset into account and make it resume at
> where we left off.
> 
> If they got out of sync, we need to flip sched_clock_stable and work on
> getting the sched_clock.c code to be monotonic over such a flip.
> 
> > So such large values of TSC (leading to a very big difference between
> > rq->clock and rq->age_stamp) wont be correctly handled by
> > scale_rt_power() either.
> 
> Still, we need to fix the clock, not fudge the users.

Ok. I have appended a patch doing this. Seems to fix the scheduler
performance issue triggered by suspend/resume. Can you please Ack it?

Thomas/Peter/Ingo: can you please pick this up if you have no other
objections. Thanks.
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: x86, tsc: recompute cyc2ns_offset's during resume from sleep states

TSC's get reset after suspend/resume (even on cpu's with invariant TSC which
runs at a constant rate across ACPI P-, C- and T-states). And in some systems
BIOS seem to reinit TSC to arbitrary large value (still sync'd across cpu's)
during resume.

This leads to a scenario of scheduler rq->clock (sched_clock_cpu()) less than
rq->age_stamp (introduced in 2.6.32). This leads to a big value returned by
scale_rt_power() and the resulting big group power set by the update_group_power()
is causing improper load balancing between busy and idle cpu's after suspend/resume.

This resulted in multi-threaded workloads (like kernel-compilation) go slower
after suspend/resume cycle on core i5 laptops.

Fix this by recomputing cyc2ns_offset's during resume, so that sched_clock()
continues from the point where it was left off during suspend.

Reported-by: Florian Pritz <flo@xssn.at>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: stable@kernel.org [2.6.32+]
---

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c042729..1ca132f 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -59,5 +59,7 @@ extern void check_tsc_sync_source(int cpu);
 extern void check_tsc_sync_target(void);
 
 extern int notsc_setup(char *);
+extern void save_sched_clock_state(void);
+extern void restore_sched_clock_state(void);
 
 #endif /* _ASM_X86_TSC_H */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ce8e502..d632934 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -626,6 +626,44 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 	local_irq_restore(flags);
 }
 
+static unsigned long long cyc2ns_suspend;
+
+void save_sched_clock_state(void)
+{
+	if (!sched_clock_stable)
+		return;
+
+	cyc2ns_suspend = sched_clock();
+}
+
+/*
+ * Even on processors with invariant TSC, TSC gets reset in some the
+ * ACPI system sleep states. And in some systems BIOS seem to reinit TSC to
+ * arbitrary value (still sync'd across cpu's) during resume from such sleep
+ * states. To cope up with this, recompute the cyc2ns_offset for each cpu so
+ * that sched_clock() continues from the point where it was left off during
+ * suspend.
+ */
+void restore_sched_clock_state(void)
+{
+	unsigned long long offset;
+	unsigned long flags;
+	int cpu;
+
+	if (!sched_clock_stable)
+		return;
+
+	local_irq_save(flags);
+
+	get_cpu_var(cyc2ns_offset) = 0;
+	offset = cyc2ns_suspend - sched_clock();
+
+	for_each_possible_cpu(cpu)
+		per_cpu(cyc2ns_offset, cpu) = offset;
+
+	local_irq_restore(flags);
+}
+
 #ifdef CONFIG_CPU_FREQ
 
 /* Frequency scaling support. Adjust the TSC based timer when the cpu frequency
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index e7e8c5f..87bb35e 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -113,6 +113,7 @@ static void __save_processor_state(struct saved_context *ctxt)
 void save_processor_state(void)
 {
 	__save_processor_state(&saved_context);
+	save_sched_clock_state();
 }
 #ifdef CONFIG_X86_32
 EXPORT_SYMBOL(save_processor_state);
@@ -229,6 +230,7 @@ static void __restore_processor_state(struct saved_context *ctxt)
 void restore_processor_state(void)
 {
 	__restore_processor_state(&saved_context);
+	restore_sched_clock_state();
 }
 #ifdef CONFIG_X86_32
 EXPORT_SYMBOL(restore_processor_state);



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] sched: move sched_avg_update() to update_cpu_load()
  2010-08-16 19:31       ` Peter Zijlstra
@ 2010-08-20  0:51         ` Suresh Siddha
  2010-08-20  9:25           ` Peter Zijlstra
  2010-08-20 13:47           ` Peter Zijlstra
  0 siblings, 2 replies; 23+ messages in thread
From: Suresh Siddha @ 2010-08-20  0:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow

On Mon, 2010-08-16 at 12:31 -0700, Peter Zijlstra wrote:
> On Mon, 2010-08-16 at 10:46 -0700, Suresh Siddha wrote:
> 
> > There is no guarantee that the original cpu won't be doing this in
> > parallel with nohz idle load balancing cpu.
> 
> Hmm, true.. bugger.
> 
> > > > Fix it by moving the sched_avg_update() to more appropriate update_cpu_load()
> > > > where the CFS load gets updated aswell.
> > > 
> > > Right, except it breaks things a bit, at the very least you really need
> > > that update right before reading it, otherwise you can end up with >100%
> > > fractions, which are odd indeed ;-)
> > 
> > with the patch, the update always happens before reading it. isn't it?
> > 
> > update now happens during the scheduler tick (or during nohz load
> > balancing tick). And the load balancer gets triggered with the tick.
> > So the update (at the tick) should happen before reading it (used by
> > load balancing triggered by the tick). Am I missing something?
> 
> We run the load-balancer in softirq context, on -rt that's a task, and
> we could have ran other (more important) RT tasks between the hardirq
> and the softirq running, which would increase the rt_avg and could thus
> result in >100%.
> 
> But I think we can simply retain the sched_avg_update(rq) in
> sched_rt_avg_update(), that is ran with rq->lock held and should be
> enough to avoid that case.
> 
> We can retain the other bit of you patch, moving sched_avg_update() from
> scale_rt_power() to update_cpu_load(), since that is only concerned with
> lowering the average when there is no actual activity.

Ok. Updated patch appended. Thanks.
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: sched: move sched_avg_update() to update_cpu_load()

Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting
called from scale_rt_power() (in the load balance context) which doesn't take
rq->lock.

Fix it by moving the sched_avg_update() to more appropriate update_cpu_load()
where the CFS load gets updated aswell.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c      |    3 ++-
 kernel/sched_fair.c |    2 --
 2 files changed, 2 insertions(+), 3 deletions(-)

Index: tree/kernel/sched_fair.c
===================================================================
--- tree.orig/kernel/sched_fair.c
+++ tree/kernel/sched_fair.c
@@ -2268,8 +2268,6 @@ unsigned long scale_rt_power(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	u64 total, available;
 
-	sched_avg_update(rq);
-
 	total = sched_avg_period() + (rq->clock - rq->age_stamp);
 	available = total - rq->rt_avg;
 
Index: tree/kernel/sched.c
===================================================================
--- tree.orig/kernel/sched.c
+++ tree/kernel/sched.c
@@ -3182,6 +3182,8 @@ static void update_cpu_load(struct rq *t
 
 		this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
 	}
+
+	sched_avg_update(this_rq);
 }
 
 static void update_cpu_load_active(struct rq *this_rq)



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/3] sched: init rt_avg stat whenever rq comes online
  2010-08-20  0:03               ` Suresh Siddha
@ 2010-08-20  8:54                 ` Peter Zijlstra
  2010-08-20 14:16                 ` [tip:sched/urgent] x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep states tip-bot for Suresh Siddha
  1 sibling, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2010-08-20  8:54 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow, stable, tglx

On Thu, 2010-08-19 at 17:03 -0700, Suresh Siddha wrote:
> Thomas/Peter/Ingo: can you please pick this up if you have no other
> objections. Thanks. 

Looks good, will queue. Thanks Suresh!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] sched: move sched_avg_update() to update_cpu_load()
  2010-08-20  0:51         ` Suresh Siddha
@ 2010-08-20  9:25           ` Peter Zijlstra
  2010-08-20 13:47           ` Peter Zijlstra
  1 sibling, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2010-08-20  9:25 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow

On Thu, 2010-08-19 at 17:51 -0700, Suresh Siddha wrote:
> Ok. Updated patch appended. 

Applied, Thanks Suresh!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] sched: move sched_avg_update() to update_cpu_load()
  2010-08-20  0:51         ` Suresh Siddha
  2010-08-20  9:25           ` Peter Zijlstra
@ 2010-08-20 13:47           ` Peter Zijlstra
  2010-08-23 20:42             ` Suresh Siddha
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2010-08-20 13:47 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow

On Thu, 2010-08-19 at 17:51 -0700, Suresh Siddha wrote:
> 
> From: Suresh Siddha <suresh.b.siddha@intel.com>
> Subject: sched: move sched_avg_update() to update_cpu_load()
> 
> Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting
> called from scale_rt_power() (in the load balance context) which doesn't take
> rq->lock.
> 
> Fix it by moving the sched_avg_update() to more appropriate update_cpu_load()
> where the CFS load gets updated aswell.
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---
>  kernel/sched.c      |    3 ++-
>  kernel/sched_fair.c |    2 --
>  2 files changed, 2 insertions(+), 3 deletions(-)
> 
> Index: tree/kernel/sched_fair.c
> ===================================================================
> --- tree.orig/kernel/sched_fair.c
> +++ tree/kernel/sched_fair.c
> @@ -2268,8 +2268,6 @@ unsigned long scale_rt_power(int cpu)
>         struct rq *rq = cpu_rq(cpu);
>         u64 total, available;
>  
> -       sched_avg_update(rq);
> -
>         total = sched_avg_period() + (rq->clock - rq->age_stamp);
>         available = total - rq->rt_avg;
>  
> Index: tree/kernel/sched.c
> ===================================================================
> --- tree.orig/kernel/sched.c
> +++ tree/kernel/sched.c
> @@ -3182,6 +3182,8 @@ static void update_cpu_load(struct rq *t
>  
>                 this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
>         }
> +
> +       sched_avg_update(this_rq);
>  }
>  
>  static void update_cpu_load_active(struct rq *this_rq) 

breaks CONFIG_SMP=n builds..

  CC      kernel/sched.o
/usr/src/linux-2.6/kernel/sched.c: In function ‘update_cpu_load’:
/usr/src/linux-2.6/kernel/sched.c:3186: error: implicit declaration of function ‘sched_avg_update’  CC      kernel/sched.o
/usr/src/linux-2.6/kernel/sched.c: In function ‘update_cpu_load’:
/usr/src/linux-2.6/kernel/sched.c:3186: error: implicit declaration of function ‘sched_avg_update’



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip:sched/urgent] x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep states
  2010-08-20  0:03               ` Suresh Siddha
  2010-08-20  8:54                 ` Peter Zijlstra
@ 2010-08-20 14:16                 ` tip-bot for Suresh Siddha
  1 sibling, 0 replies; 23+ messages in thread
From: tip-bot for Suresh Siddha @ 2010-08-20 14:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, flo, hpa, mingo, a.p.zijlstra, suresh.b.siddha,
	tglx, mingo

Commit-ID:  cd7240c0b900eb6d690ccee088a6c9b46dae815a
Gitweb:     http://git.kernel.org/tip/cd7240c0b900eb6d690ccee088a6c9b46dae815a
Author:     Suresh Siddha <suresh.b.siddha@intel.com>
AuthorDate: Thu, 19 Aug 2010 17:03:38 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 20 Aug 2010 14:59:02 +0200

x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep states

TSC's get reset after suspend/resume (even on cpu's with invariant TSC
which runs at a constant rate across ACPI P-, C- and T-states). And in
some systems BIOS seem to reinit TSC to arbitrary large value (still
sync'd across cpu's) during resume.

This leads to a scenario of scheduler rq->clock (sched_clock_cpu()) less
than rq->age_stamp (introduced in 2.6.32). This leads to a big value
returned by scale_rt_power() and the resulting big group power set by the
update_group_power() is causing improper load balancing between busy and
idle cpu's after suspend/resume.

This resulted in multi-threaded workloads (like kernel-compilation) go
slower after suspend/resume cycle on core i5 laptops.

Fix this by recomputing cyc2ns_offset's during resume, so that
sched_clock() continues from the point where it was left off during
suspend.

Reported-by: Florian Pritz <flo@xssn.at>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: <stable@kernel.org> # [v2.6.32+]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1282262618.2675.24.camel@sbsiddha-MOBL3.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/include/asm/tsc.h |    2 ++
 arch/x86/kernel/tsc.c      |   38 ++++++++++++++++++++++++++++++++++++++
 arch/x86/power/cpu.c       |    2 ++
 3 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c042729..1ca132f 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -59,5 +59,7 @@ extern void check_tsc_sync_source(int cpu);
 extern void check_tsc_sync_target(void);
 
 extern int notsc_setup(char *);
+extern void save_sched_clock_state(void);
+extern void restore_sched_clock_state(void);
 
 #endif /* _ASM_X86_TSC_H */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ce8e502..d632934 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -626,6 +626,44 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 	local_irq_restore(flags);
 }
 
+static unsigned long long cyc2ns_suspend;
+
+void save_sched_clock_state(void)
+{
+	if (!sched_clock_stable)
+		return;
+
+	cyc2ns_suspend = sched_clock();
+}
+
+/*
+ * Even on processors with invariant TSC, TSC gets reset in some the
+ * ACPI system sleep states. And in some systems BIOS seem to reinit TSC to
+ * arbitrary value (still sync'd across cpu's) during resume from such sleep
+ * states. To cope up with this, recompute the cyc2ns_offset for each cpu so
+ * that sched_clock() continues from the point where it was left off during
+ * suspend.
+ */
+void restore_sched_clock_state(void)
+{
+	unsigned long long offset;
+	unsigned long flags;
+	int cpu;
+
+	if (!sched_clock_stable)
+		return;
+
+	local_irq_save(flags);
+
+	get_cpu_var(cyc2ns_offset) = 0;
+	offset = cyc2ns_suspend - sched_clock();
+
+	for_each_possible_cpu(cpu)
+		per_cpu(cyc2ns_offset, cpu) = offset;
+
+	local_irq_restore(flags);
+}
+
 #ifdef CONFIG_CPU_FREQ
 
 /* Frequency scaling support. Adjust the TSC based timer when the cpu frequency
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index e7e8c5f..87bb35e 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -113,6 +113,7 @@ static void __save_processor_state(struct saved_context *ctxt)
 void save_processor_state(void)
 {
 	__save_processor_state(&saved_context);
+	save_sched_clock_state();
 }
 #ifdef CONFIG_X86_32
 EXPORT_SYMBOL(save_processor_state);
@@ -229,6 +230,7 @@ static void __restore_processor_state(struct saved_context *ctxt)
 void restore_processor_state(void)
 {
 	__restore_processor_state(&saved_context);
+	restore_sched_clock_state();
 }
 #ifdef CONFIG_X86_32
 EXPORT_SYMBOL(restore_processor_state);

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] sched: move sched_avg_update() to update_cpu_load()
  2010-08-20 13:47           ` Peter Zijlstra
@ 2010-08-23 20:42             ` Suresh Siddha
  2010-09-09 19:45               ` [tip:sched/urgent] sched: Move " tip-bot for Suresh Siddha
  0 siblings, 1 reply; 23+ messages in thread
From: Suresh Siddha @ 2010-08-23 20:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, chris, debian00, hpa, jonathan.protzenko,
	mans, psastudio, rjw, stephan.eicher, sxxe, thomas, venki,
	wonghow

On Fri, 2010-08-20 at 06:47 -0700, Peter Zijlstra wrote:
> breaks CONFIG_SMP=n builds..
> 
>   CC      kernel/sched.o
> /usr/src/linux-2.6/kernel/sched.c: In function ‘update_cpu_load’:
> /usr/src/linux-2.6/kernel/sched.c:3186: error: implicit declaration of function ‘sched_avg_update’  CC      kernel/sched.o
> /usr/src/linux-2.6/kernel/sched.c: In function ‘update_cpu_load’:
> /usr/src/linux-2.6/kernel/sched.c:3186: error: implicit declaration of function ‘sched_avg_update’

Updated patch appended. Thanks.
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: sched: move sched_avg_update() to update_cpu_load()

Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting
called from scale_rt_power() (in the load balance context) which doesn't take
rq->lock.

Fix it by moving the sched_avg_update() to more appropriate update_cpu_load()
where the CFS load gets updated aswell.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c      |    6 ++++++
 kernel/sched_fair.c |    2 --
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 41541d7..f95bbbe 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1294,6 +1294,10 @@ static void resched_task(struct task_struct *p)
 static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 }
+
+static void sched_avg_update(struct rq *rq)
+{
+}
 #endif /* CONFIG_SMP */
 
 #if BITS_PER_LONG == 32
@@ -3182,6 +3186,8 @@ static void update_cpu_load(struct rq *this_rq)
 
 		this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
 	}
+
+	sched_avg_update(this_rq);
 }
 
 static void update_cpu_load_active(struct rq *this_rq)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 806d1b2..a2bf7bd 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2268,8 +2268,6 @@ unsigned long scale_rt_power(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	u64 total, available;
 
-	sched_avg_update(rq);
-
 	total = sched_avg_period() + (rq->clock - rq->age_stamp);
 	available = total - rq->rt_avg;
 



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:sched/urgent] sched: Move sched_avg_update() to update_cpu_load()
  2010-08-23 20:42             ` Suresh Siddha
@ 2010-09-09 19:45               ` tip-bot for Suresh Siddha
  0 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Suresh Siddha @ 2010-09-09 19:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, suresh.b.siddha, tglx, mingo

Commit-ID:  da2b71edd8a7db44fe1746261410a981f3e03632
Gitweb:     http://git.kernel.org/tip/da2b71edd8a7db44fe1746261410a981f3e03632
Author:     Suresh Siddha <suresh.b.siddha@intel.com>
AuthorDate: Mon, 23 Aug 2010 13:42:51 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 9 Sep 2010 20:39:33 +0200

sched: Move sched_avg_update() to update_cpu_load()

Currently sched_avg_update() (which updates rt_avg stats in the rq)
is getting called from scale_rt_power() (in the load balance context)
which doesn't take rq->lock.

Fix it by moving the sched_avg_update() to more appropriate
update_cpu_load() where the CFS load gets updated as well.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |    6 ++++++
 kernel/sched_fair.c |    2 --
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 09b574e..ed09d4f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1294,6 +1294,10 @@ static void resched_task(struct task_struct *p)
 static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 }
+
+static void sched_avg_update(struct rq *rq)
+{
+}
 #endif /* CONFIG_SMP */
 
 #if BITS_PER_LONG == 32
@@ -3182,6 +3186,8 @@ static void update_cpu_load(struct rq *this_rq)
 
 		this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
 	}
+
+	sched_avg_update(this_rq);
 }
 
 static void update_cpu_load_active(struct rq *this_rq)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ab661eb..f53ec75 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2268,8 +2268,6 @@ unsigned long scale_rt_power(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	u64 total, available;
 
-	sched_avg_update(rq);
-
 	total = sched_avg_period() + (rq->clock - rq->age_stamp);
 	available = total - rq->rt_avg;
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-09-09 19:46 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-13 19:45 [patch 0/3] sched: fixes related to scaling down cpu power with RT tasks Suresh Siddha
2010-08-13 19:45 ` [patch 1/3] sched: init rt_avg stat whenever rq comes online Suresh Siddha
2010-08-16  7:47   ` Peter Zijlstra
2010-08-16 17:36     ` Suresh Siddha
2010-08-16 19:25       ` Peter Zijlstra
2010-08-17  8:51         ` Peter Zijlstra
2010-08-19  0:20           ` Suresh Siddha
2010-08-19  8:53             ` Peter Zijlstra
2010-08-20  0:03               ` Suresh Siddha
2010-08-20  8:54                 ` Peter Zijlstra
2010-08-20 14:16                 ` [tip:sched/urgent] x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep states tip-bot for Suresh Siddha
2010-08-13 19:45 ` [patch 2/3] sched: fix minimum power returned by update_cpu_power() Suresh Siddha
2010-08-16  7:50   ` Peter Zijlstra
2010-08-16 17:37     ` Suresh Siddha
2010-08-13 19:45 ` [patch 3/3] sched: move sched_avg_update() to update_cpu_load() Suresh Siddha
2010-08-16  8:00   ` Peter Zijlstra
2010-08-16 17:46     ` Suresh Siddha
2010-08-16 19:31       ` Peter Zijlstra
2010-08-20  0:51         ` Suresh Siddha
2010-08-20  9:25           ` Peter Zijlstra
2010-08-20 13:47           ` Peter Zijlstra
2010-08-23 20:42             ` Suresh Siddha
2010-09-09 19:45               ` [tip:sched/urgent] sched: Move " tip-bot for Suresh Siddha

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.