[REGRESSION 2.6.30][PATCH 0/1] sched: defer idle accounting till after load update period

All of lore.kernel.org
 help / color / mirror / Atom feed

* [REGRESSION 2.6.30][PATCH 0/1] sched: defer idle accounting till after load update period
@ 2010-03-29 13:41 Chase Douglas
  2010-03-29 13:41 ` [REGRESSION 2.6.30][PATCH 1/1] " Chase Douglas
  0 siblings, 1 reply; 17+ messages in thread
From: Chase Douglas @ 2010-03-29 13:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Andrew Morton,
	Rafael J. Wysocki

The following patch fixes a load avg calculation bug. Details are included with
the patch. Essentially, a task that often runs for less than 10 ticks at a time
is likely to be left out of the load avg calculation. A test case is provided
below. If you run the test case on a near zero-load system you will find top
report 90% cpu usage while the load avg stays at or near 0.00. With the patch,
the load avg is calculated correctly to be at least 0.90.

This is a regression since 2.6.30. The offending commit is:
dce48a84adf1806676319f6f480e30a6daa012f9.

--

#include <asm/param.h>
#include <sys/time.h>
#include <time.h>

int main() {
	struct timespec ts;
	ts.tv_sec = 0;
	ts.tv_nsec = 1000000000 / HZ;

	/*
	 * Run gettimeofday in a tight loop 9 ticks, then sleep for 1 tick
	 */
	while (1) {
		struct timeval tv;

		do {
			gettimeofday(&tv, NULL);
		} while ((tv.tv_usec * HZ / 1000000) % 10 != 0);

		nanosleep(&ts, NULL);
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period
  2010-03-29 13:41 [REGRESSION 2.6.30][PATCH 0/1] sched: defer idle accounting till after load update period Chase Douglas
@ 2010-03-29 13:41 ` Chase Douglas
  2010-03-29 14:41   ` Peter Zijlstra
  2010-04-01 19:27   ` Andrew Morton
  0 siblings, 2 replies; 17+ messages in thread
From: Chase Douglas @ 2010-03-29 13:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Andrew Morton,
	Rafael J. Wysocki

There's a period of 10 ticks where calc_load_tasks is updated by all the
cpus for the load avg. Usually all the cpus do this during the first
tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
However, if they wake up calc_load_tasks is not incremented. Thus, if
cpus go idle during the 10 tick period, calc_load_tasks may be
decremented to a non-representative value. This issue can lead to
systems having a load avg of exactly 0, even though the real load avg
could theoretically be up to NR_CPUS.

This change defers calc_load_tasks accounting after each cpu updates the
count until after the 10 tick period.

BugLink: http://bugs.launchpad.net/bugs/513848

Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
---
 kernel/sched.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 9ab3cd7..c0aedac 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3064,7 +3064,8 @@ void calc_global_load(void)
  */
 static void calc_load_account_active(struct rq *this_rq)
 {
-	long nr_active, delta;
+	static atomic_long_t deferred;
+	long nr_active, delta, deferred_delta;
 
 	nr_active = this_rq->nr_running;
 	nr_active += (long) this_rq->nr_uninterruptible;
@@ -3072,6 +3073,17 @@ static void calc_load_account_active(struct rq *this_rq)
 	if (nr_active != this_rq->calc_load_active) {
 		delta = nr_active - this_rq->calc_load_active;
 		this_rq->calc_load_active = nr_active;
+
+		/* Need to defer idle accounting during load update period: */
+		if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
+			     time_after_eq(jiffies, calc_load_update))) {
+			atomic_long_add(delta, &deferred);
+			return;
+		}
+
+		deferred_delta = atomic_long_xchg(&deferred, 0);
+		delta += deferred_delta;
+
 		atomic_long_add(delta, &calc_load_tasks);
 	}
 }
@@ -3106,8 +3118,8 @@ static void update_cpu_load(struct rq *this_rq)
 	}
 
 	if (time_after_eq(jiffies, this_rq->calc_load_update)) {
-		this_rq->calc_load_update += LOAD_FREQ;
 		calc_load_account_active(this_rq);
+		this_rq->calc_load_update += LOAD_FREQ;
 	}
 }
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period
  2010-03-29 13:41 ` [REGRESSION 2.6.30][PATCH 1/1] " Chase Douglas
@ 2010-03-29 14:41   ` Peter Zijlstra
  2010-03-29 17:20     ` Chase Douglas
  2010-04-01 19:27   ` Andrew Morton
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2010-03-29 14:41 UTC (permalink / raw)
  To: Chase Douglas
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Rafael J. Wysocki

On Mon, 2010-03-29 at 09:41 -0400, Chase Douglas wrote:
> There's a period of 10 ticks where calc_load_tasks is updated by all the
> cpus for the load avg. Usually all the cpus do this during the first
> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> However, if they wake up calc_load_tasks is not incremented. Thus, if
> cpus go idle during the 10 tick period, calc_load_tasks may be
> decremented to a non-representative value. This issue can lead to
> systems having a load avg of exactly 0, even though the real load avg
> could theoretically be up to NR_CPUS.
> 
> This change defers calc_load_tasks accounting after each cpu updates the
> count until after the 10 tick period.

>From reading the above changelog it seems to me there should be a
callback from leaving nohz mode, your proposed patch has no such thing.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till  after load update period
  2010-03-29 14:41   ` Peter Zijlstra
@ 2010-03-29 17:20     ` Chase Douglas
  0 siblings, 0 replies; 17+ messages in thread
From: Chase Douglas @ 2010-03-29 17:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Rafael J. Wysocki

On Mon, Mar 29, 2010 at 10:41 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2010-03-29 at 09:41 -0400, Chase Douglas wrote:
>> There's a period of 10 ticks where calc_load_tasks is updated by all the
>> cpus for the load avg. Usually all the cpus do this during the first
>> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
>> However, if they wake up calc_load_tasks is not incremented. Thus, if
>> cpus go idle during the 10 tick period, calc_load_tasks may be
>> decremented to a non-representative value. This issue can lead to
>> systems having a load avg of exactly 0, even though the real load avg
>> could theoretically be up to NR_CPUS.
>>
>> This change defers calc_load_tasks accounting after each cpu updates the
>> count until after the 10 tick period.
>
> >From reading the above changelog it seems to me there should be a
> callback from leaving nohz mode, your proposed patch has no such thing.

I believe what you're implying is that there should be a corresponding
call for when a cpu is awakened to counter the accounting when a cpu
goes to sleep during the 10 tick window. I don't think this is the
correct approach because the load avg should be a snapshot in time. If
there were 10 runnable tasks at the beginning of the load calculation
window, then we should account for 10 tasks. If they all get serviced
and the cpu goes to sleep, then wakes back up with one runnable task,
we would account for only one task instead of the original 10. What if
9 more tasks then get added to the cpu's rq during the 10 tick update
period? None of them would be accounted for either.

-- Chase

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period
  2010-03-29 13:41 ` [REGRESSION 2.6.30][PATCH 1/1] " Chase Douglas
  2010-03-29 14:41   ` Peter Zijlstra
@ 2010-04-01 19:27   ` Andrew Morton
  2010-04-01 19:37     ` Thomas Gleixner
  2010-04-01 19:56     ` Chase Douglas
  1 sibling, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2010-04-01 19:27 UTC (permalink / raw)
  To: Chase Douglas
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Rafael J. Wysocki

On Mon, 29 Mar 2010 09:41:12 -0400
Chase Douglas <chase.douglas@canonical.com> wrote:

> There's a period of 10 ticks where calc_load_tasks is updated by all the
> cpus for the load avg. Usually all the cpus do this during the first
> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> However, if they wake up calc_load_tasks is not incremented. Thus, if
> cpus go idle during the 10 tick period, calc_load_tasks may be
> decremented to a non-representative value. This issue can lead to
> systems having a load avg of exactly 0, even though the real load avg
> could theoretically be up to NR_CPUS.
> 
> This change defers calc_load_tasks accounting after each cpu updates the
> count until after the 10 tick period.
> 
> BugLink: http://bugs.launchpad.net/bugs/513848
> 

There was useful information in the [patch 0/1] email, such as the
offending commit ID.  If possible, it's best to avoid the [patch 0/n]
thing altogether - that information either has to be moved into the
[patch 1/n] changelog by someone (ie: me), or it just gets ommitted and
lost.


> ---
>  kernel/sched.c |   16 ++++++++++++++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 9ab3cd7..c0aedac 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3064,7 +3064,8 @@ void calc_global_load(void)
>   */
>  static void calc_load_account_active(struct rq *this_rq)
>  {
> -	long nr_active, delta;
> +	static atomic_long_t deferred;
> +	long nr_active, delta, deferred_delta;
>  
>  	nr_active = this_rq->nr_running;
>  	nr_active += (long) this_rq->nr_uninterruptible;
> @@ -3072,6 +3073,17 @@ static void calc_load_account_active(struct rq *this_rq)
>  	if (nr_active != this_rq->calc_load_active) {
>  		delta = nr_active - this_rq->calc_load_active;
>  		this_rq->calc_load_active = nr_active;
> +
> +		/* Need to defer idle accounting during load update period: */
> +		if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
> +			     time_after_eq(jiffies, calc_load_update))) {
> +			atomic_long_add(delta, &deferred);
> +			return;
> +		}

That seems a sensible way to avoid the gross-looking "10 ticks" thing.

What was the reason for "update the avenrun load estimates 10 ticks
after the CPUs have updated calc_load_tasks"?  Can we do something
smarter there to fix this?

> +		deferred_delta = atomic_long_xchg(&deferred, 0);
> +		delta += deferred_delta;
> +
>  		atomic_long_add(delta, &calc_load_tasks);
>  	}
>  }

The global `deferred' is unfortunate from a design and possibly
scalability POV.  Can it be moved into the `struct rq'?  That way it
can become a plain old `unsigned long', too.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period
  2010-04-01 19:27   ` Andrew Morton
@ 2010-04-01 19:37     ` Thomas Gleixner
  2010-04-01 20:00       ` Chase Douglas
  2010-04-01 19:56     ` Chase Douglas
  1 sibling, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2010-04-01 19:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chase Douglas, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Rafael J. Wysocki

On Thu, 1 Apr 2010, Andrew Morton wrote:
> On Mon, 29 Mar 2010 09:41:12 -0400
> Chase Douglas <chase.douglas@canonical.com> wrote:
> 
> > There's a period of 10 ticks where calc_load_tasks is updated by all the
> > cpus for the load avg. Usually all the cpus do this during the first
> > tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> > However, if they wake up calc_load_tasks is not incremented. Thus, if
> > cpus go idle during the 10 tick period, calc_load_tasks may be
> > decremented to a non-representative value. This issue can lead to
> > systems having a load avg of exactly 0, even though the real load avg
> > could theoretically be up to NR_CPUS.
> > 
> > This change defers calc_load_tasks accounting after each cpu updates the
> > count until after the 10 tick period.
> > 
> > BugLink: http://bugs.launchpad.net/bugs/513848
> > 
> 
> There was useful information in the [patch 0/1] email, such as the
> offending commit ID.  If possible, it's best to avoid the [patch 0/n]
> thing altogether - that information either has to be moved into the
> [patch 1/n] changelog by someone (ie: me), or it just gets ommitted and
> lost.
> 
> 
> > ---
> >  kernel/sched.c |   16 ++++++++++++++--
> >  1 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 9ab3cd7..c0aedac 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -3064,7 +3064,8 @@ void calc_global_load(void)
> >   */
> >  static void calc_load_account_active(struct rq *this_rq)
> >  {
> > -	long nr_active, delta;
> > +	static atomic_long_t deferred;
> > +	long nr_active, delta, deferred_delta;
> >  
> >  	nr_active = this_rq->nr_running;
> >  	nr_active += (long) this_rq->nr_uninterruptible;
> > @@ -3072,6 +3073,17 @@ static void calc_load_account_active(struct rq *this_rq)
> >  	if (nr_active != this_rq->calc_load_active) {
> >  		delta = nr_active - this_rq->calc_load_active;
> >  		this_rq->calc_load_active = nr_active;
> > +
> > +		/* Need to defer idle accounting during load update period: */
> > +		if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
> > +			     time_after_eq(jiffies, calc_load_update))) {
> > +			atomic_long_add(delta, &deferred);
> > +			return;
> > +		}
> 
> That seems a sensible way to avoid the gross-looking "10 ticks" thing.
> 
> What was the reason for "update the avenrun load estimates 10 ticks
> after the CPUs have updated calc_load_tasks"?  Can we do something
> smarter there to fix this?

The reason was to make sure that all cpus have updated their stuff
halfways before we start to calculate and we spread out stuff among
cpus to avoid contention on those global atomic variables.

Yes, we can do something smarter. First thing is to fix the
nr_uninterruptible accounting which can become negative. Peter looked
into that already and we really need to fix this first before doing
anything smart about that load avg stuff.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till  after load update period
  2010-04-01 19:27   ` Andrew Morton
  2010-04-01 19:37     ` Thomas Gleixner
@ 2010-04-01 19:56     ` Chase Douglas
  2010-04-01 20:01       ` Thomas Gleixner
  1 sibling, 1 reply; 17+ messages in thread
From: Chase Douglas @ 2010-04-01 19:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Rafael J. Wysocki, kernel-team

On Thu, Apr 1, 2010 at 3:27 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Mon, 29 Mar 2010 09:41:12 -0400
> Chase Douglas <chase.douglas@canonical.com> wrote:
>
>> There's a period of 10 ticks where calc_load_tasks is updated by all the
>> cpus for the load avg. Usually all the cpus do this during the first
>> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
>> However, if they wake up calc_load_tasks is not incremented. Thus, if
>> cpus go idle during the 10 tick period, calc_load_tasks may be
>> decremented to a non-representative value. This issue can lead to
>> systems having a load avg of exactly 0, even though the real load avg
>> could theoretically be up to NR_CPUS.
>>
>> This change defers calc_load_tasks accounting after each cpu updates the
>> count until after the 10 tick period.
>>
>> BugLink: http://bugs.launchpad.net/bugs/513848
>>
>
> There was useful information in the [patch 0/1] email, such as the
> offending commit ID.  If possible, it's best to avoid the [patch 0/n]
> thing altogether - that information either has to be moved into the
> [patch 1/n] changelog by someone (ie: me), or it just gets ommitted and
> lost.

I'll be sure to fix that up in any future versions.

>
>> ---
>>  kernel/sched.c |   16 ++++++++++++++--
>>  1 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 9ab3cd7..c0aedac 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -3064,7 +3064,8 @@ void calc_global_load(void)
>>   */
>>  static void calc_load_account_active(struct rq *this_rq)
>>  {
>> -     long nr_active, delta;
>> +     static atomic_long_t deferred;
>> +     long nr_active, delta, deferred_delta;
>>
>>       nr_active = this_rq->nr_running;
>>       nr_active += (long) this_rq->nr_uninterruptible;
>> @@ -3072,6 +3073,17 @@ static void calc_load_account_active(struct rq *this_rq)
>>       if (nr_active != this_rq->calc_load_active) {
>>               delta = nr_active - this_rq->calc_load_active;
>>               this_rq->calc_load_active = nr_active;
>> +
>> +             /* Need to defer idle accounting during load update period: */
>> +             if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
>> +                          time_after_eq(jiffies, calc_load_update))) {
>> +                     atomic_long_add(delta, &deferred);
>> +                     return;
>> +             }
>
> That seems a sensible way to avoid the gross-looking "10 ticks" thing.
>
> What was the reason for "update the avenrun load estimates 10 ticks
> after the CPUs have updated calc_load_tasks"?  Can we do something
> smarter there to fix this?
>
>> +             deferred_delta = atomic_long_xchg(&deferred, 0);
>> +             delta += deferred_delta;
>> +
>>               atomic_long_add(delta, &calc_load_tasks);
>>       }
>>  }
>
> The global `deferred' is unfortunate from a design and possibly
> scalability POV.  Can it be moved into the `struct rq'?  That way it
> can become a plain old `unsigned long', too.

The reason it's global is that any given cpu may go NOHZ idle. If they
sleep through the next load accounting, then their deferred value
won't get accounted for. By keeping it global and having it checked by
every cpu, we ensure that in the worst case where only one cpu is
awake to do the accounting, the deferred values from all cpus are
taken into account.

After a review by Andy Whitcroft on the Ubuntu kernel team, we decided
to do a few things to improve performance:

1. Check if values are non-zero before atomically touching the
deferred variable (load avg accounting is imprecise anyways, we just
need to make sure we don't lose any tasks)
2. Rename the deferred variable to calc_load_tasks_deferred and move
it right after calc_load_tasks to hopefully catch it in the same cache
line

Thomas Gleixner seems to think this isn't the best approach (later
email in this thread), so I'm deferring sending it unless someone is
still interested in this approach.

-- Chase

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till  after load update period
  2010-04-01 19:37     ` Thomas Gleixner
@ 2010-04-01 20:00       ` Chase Douglas
  2010-04-01 20:18         ` Thomas Gleixner
  0 siblings, 1 reply; 17+ messages in thread
From: Chase Douglas @ 2010-04-01 20:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Rafael J. Wysocki, kernel-team

On Thu, Apr 1, 2010 at 3:37 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 1 Apr 2010, Andrew Morton wrote:
>> On Mon, 29 Mar 2010 09:41:12 -0400
>> Chase Douglas <chase.douglas@canonical.com> wrote:
>>
>> > There's a period of 10 ticks where calc_load_tasks is updated by all the
>> > cpus for the load avg. Usually all the cpus do this during the first
>> > tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
>> > However, if they wake up calc_load_tasks is not incremented. Thus, if
>> > cpus go idle during the 10 tick period, calc_load_tasks may be
>> > decremented to a non-representative value. This issue can lead to
>> > systems having a load avg of exactly 0, even though the real load avg
>> > could theoretically be up to NR_CPUS.
>> >
>> > This change defers calc_load_tasks accounting after each cpu updates the
>> > count until after the 10 tick period.
>> >
>> > BugLink: http://bugs.launchpad.net/bugs/513848
>> >
>>
>> There was useful information in the [patch 0/1] email, such as the
>> offending commit ID.  If possible, it's best to avoid the [patch 0/n]
>> thing altogether - that information either has to be moved into the
>> [patch 1/n] changelog by someone (ie: me), or it just gets ommitted and
>> lost.
>>
>>
>> > ---
>> >  kernel/sched.c |   16 ++++++++++++++--
>> >  1 files changed, 14 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/kernel/sched.c b/kernel/sched.c
>> > index 9ab3cd7..c0aedac 100644
>> > --- a/kernel/sched.c
>> > +++ b/kernel/sched.c
>> > @@ -3064,7 +3064,8 @@ void calc_global_load(void)
>> >   */
>> >  static void calc_load_account_active(struct rq *this_rq)
>> >  {
>> > -   long nr_active, delta;
>> > +   static atomic_long_t deferred;
>> > +   long nr_active, delta, deferred_delta;
>> >
>> >     nr_active = this_rq->nr_running;
>> >     nr_active += (long) this_rq->nr_uninterruptible;
>> > @@ -3072,6 +3073,17 @@ static void calc_load_account_active(struct rq *this_rq)
>> >     if (nr_active != this_rq->calc_load_active) {
>> >             delta = nr_active - this_rq->calc_load_active;
>> >             this_rq->calc_load_active = nr_active;
>> > +
>> > +           /* Need to defer idle accounting during load update period: */
>> > +           if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
>> > +                        time_after_eq(jiffies, calc_load_update))) {
>> > +                   atomic_long_add(delta, &deferred);
>> > +                   return;
>> > +           }
>>
>> That seems a sensible way to avoid the gross-looking "10 ticks" thing.
>>
>> What was the reason for "update the avenrun load estimates 10 ticks
>> after the CPUs have updated calc_load_tasks"?  Can we do something
>> smarter there to fix this?
>
> The reason was to make sure that all cpus have updated their stuff
> halfways before we start to calculate and we spread out stuff among
> cpus to avoid contention on those global atomic variables.
>
> Yes, we can do something smarter. First thing is to fix the
> nr_uninterruptible accounting which can become negative. Peter looked
> into that already and we really need to fix this first before doing
> anything smart about that load avg stuff.

I noticed this too, and figured it was some clever accounting :). If
this is a real bug, I'd be happy to take a look at it as well.

What do you have in mind as a smarter way of fixing this issue, and
why is it dependent on fixing the absolute values of each runqueue's
nr_uninterruptible count?

-- Chase

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period
  2010-04-01 19:56     ` Chase Douglas
@ 2010-04-01 20:01       ` Thomas Gleixner
  2010-04-13 20:39         ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2010-04-01 20:01 UTC (permalink / raw)
  To: Chase Douglas
  Cc: Andrew Morton, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Rafael J. Wysocki, kernel-team

On Thu, 1 Apr 2010, Chase Douglas wrote:
> On Thu, Apr 1, 2010 at 3:27 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> Thomas Gleixner seems to think this isn't the best approach (later
> email in this thread), so I'm deferring sending it unless someone is
> still interested in this approach.

Well, it's a good workaround for now I think, I just answered Andrews
question whether we can't do any better :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period
  2010-04-01 20:00       ` Chase Douglas
@ 2010-04-01 20:18         ` Thomas Gleixner
  2010-04-01 20:32           ` Chase Douglas
  2010-04-02  7:59           ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Thomas Gleixner @ 2010-04-01 20:18 UTC (permalink / raw)
  To: Chase Douglas
  Cc: Andrew Morton, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Rafael J. Wysocki, kernel-team

On Thu, 1 Apr 2010, Chase Douglas wrote:
> On Thu, Apr 1, 2010 at 3:37 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Yes, we can do something smarter. First thing is to fix the
> > nr_uninterruptible accounting which can become negative. Peter looked
> > into that already and we really need to fix this first before doing
> > anything smart about that load avg stuff.
> 
> I noticed this too, and figured it was some clever accounting :). If
> this is a real bug, I'd be happy to take a look at it as well.
> 
> What do you have in mind as a smarter way of fixing this issue, and
> why is it dependent on fixing the absolute values of each runqueue's
> nr_uninterruptible count?

Well, the uninterruptible count imbalance is preventing us to calc the
load avg per cpu, which is something the power saving folks are
interested in.

If that is fixed we probably have a good chance to collect the per cpu
stuff and build a combined one - have not looked into the gory details
of the math yet.

Also we need to think about a more clever way than just accounting the
number of running and uninterruptible tasks. What we really want is to
use the numbers which the scheduler has handy, i.e how many tasks did
we run since we did the last loadavg calculation. It was always
possible (even before the big loadavg changes) to create a task which
consumes 50% of a CPU and never shows up in the loadavg calculations
at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till  after load update period
  2010-04-01 20:18         ` Thomas Gleixner
@ 2010-04-01 20:32           ` Chase Douglas
  2010-04-01 20:37             ` Thomas Gleixner
  2010-04-02  7:59           ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Chase Douglas @ 2010-04-01 20:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Rafael J. Wysocki, kernel-team

On Thu, Apr 1, 2010 at 4:18 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Also we need to think about a more clever way than just accounting the
> number of running and uninterruptible tasks. What we really want is to
> use the numbers which the scheduler has handy, i.e how many tasks did
> we run since we did the last loadavg calculation. It was always
> possible (even before the big loadavg changes) to create a task which
> consumes 50% of a CPU and never shows up in the loadavg calculations
> at all.

I'm not sure I follow how knowing how many tasks have been run since
the last LOAD_FREQ expiration will help, or is that just an example of
the kind of data the scheduler has available that may be useful?

-- Chase

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period
  2010-04-01 20:32           ` Chase Douglas
@ 2010-04-01 20:37             ` Thomas Gleixner
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2010-04-01 20:37 UTC (permalink / raw)
  To: Chase Douglas
  Cc: Andrew Morton, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Rafael J. Wysocki, kernel-team

On Thu, 1 Apr 2010, Chase Douglas wrote:

> On Thu, Apr 1, 2010 at 4:18 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Also we need to think about a more clever way than just accounting the
> > number of running and uninterruptible tasks. What we really want is to
> > use the numbers which the scheduler has handy, i.e how many tasks did
> > we run since we did the last loadavg calculation. It was always
> > possible (even before the big loadavg changes) to create a task which
> > consumes 50% of a CPU and never shows up in the loadavg calculations
> > at all.
> 
> I'm not sure I follow how knowing how many tasks have been run since
> the last LOAD_FREQ expiration will help, or is that just an example of
> the kind of data the scheduler has available that may be useful?

Yes, it's just an example. Look at the following (UP) scenario:

-> tick	calcs loadavg

-> task is woken and runs for 50% of a tick
-> task goes to sleep

-> tick	calcs loadavg

So loadavg will say 0 forever, while we in reality know that there was
significant work on that CPU - at least when we have a sched_clock
which has a better granularity than the tick itself.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period
  2010-04-01 20:18         ` Thomas Gleixner
  2010-04-01 20:32           ` Chase Douglas
@ 2010-04-02  7:59           ` Peter Zijlstra
  2010-04-05 14:44             ` Chase Douglas
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2010-04-02  7:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chase Douglas, Andrew Morton, linux-kernel, Ingo Molnar,
	Rafael J. Wysocki, kernel-team

On Thu, 2010-04-01 at 22:18 +0200, Thomas Gleixner wrote:
> Well, the uninterruptible count imbalance is preventing us to calc the
> load avg per cpu, which is something the power saving folks are
> interested in.
> 
> If that is fixed we probably have a good chance to collect the per cpu
> stuff and build a combined one - have not looked into the gory details
> of the math yet.

The problem with all this per-cpu loadavg is that it would require a
steady tick on each cpu to age this loadavg, which is quite in
contradiction with power savings as they rather like the nohz feature.

No idea yet on what to do about that.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till  after load update period
  2010-04-02  7:59           ` Peter Zijlstra
@ 2010-04-05 14:44             ` Chase Douglas
  0 siblings, 0 replies; 17+ messages in thread
From: Chase Douglas @ 2010-04-05 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Andrew Morton, linux-kernel, Ingo Molnar,
	Rafael J. Wysocki, kernel-team

On Fri, Apr 2, 2010 at 3:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-04-01 at 22:18 +0200, Thomas Gleixner wrote:
>> Well, the uninterruptible count imbalance is preventing us to calc the
>> load avg per cpu, which is something the power saving folks are
>> interested in.
>>
>> If that is fixed we probably have a good chance to collect the per cpu
>> stuff and build a combined one - have not looked into the gory details
>> of the math yet.
>
> The problem with all this per-cpu loadavg is that it would require a
> steady tick on each cpu to age this loadavg, which is quite in
> contradiction with power savings as they rather like the nohz feature.
>
> No idea yet on what to do about that.

What if we stored the time at which a cpu goes idle in its rq. Then,
when the time comes around to calculate the load avg one cpu should be
able to scan the run queues of all the cpus and use their load avg if
not idle, or use their load avg + idle timestamp math if idle.

-- Chase

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period
  2010-04-01 20:01       ` Thomas Gleixner
@ 2010-04-13 20:39         ` Andrew Morton
  2010-04-13 21:02           ` Thomas Gleixner
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2010-04-13 20:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chase Douglas, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Rafael J. Wysocki, kernel-team

On Thu, 1 Apr 2010 22:01:59 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Thu, 1 Apr 2010, Chase Douglas wrote:
> > On Thu, Apr 1, 2010 at 3:27 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > Thomas Gleixner seems to think this isn't the best approach (later
> > email in this thread), so I'm deferring sending it unless someone is
> > still interested in this approach.
> 
> Well, it's a good workaround for now I think, I just answered Andrews
> question whether we can't do any better :)
> 

So... should we merge Chase's patch?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period
  2010-04-13 20:39         ` Andrew Morton
@ 2010-04-13 21:02           ` Thomas Gleixner
  2010-04-13 21:06             ` Chase Douglas
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2010-04-13 21:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chase Douglas, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Rafael J. Wysocki, kernel-team

On Tue, 13 Apr 2010, Andrew Morton wrote:

> On Thu, 1 Apr 2010 22:01:59 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Thu, 1 Apr 2010, Chase Douglas wrote:
> > > On Thu, Apr 1, 2010 at 3:27 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > > 
> > > Thomas Gleixner seems to think this isn't the best approach (later
> > > email in this thread), so I'm deferring sending it unless someone is
> > > still interested in this approach.
> > 
> > Well, it's a good workaround for now I think, I just answered Andrews
> > question whether we can't do any better :)
> > 
> 
> So... should we merge Chase's patch?

I have no better solution right now. Peter ?

Thanks,

	tglx
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till  after load update period
  2010-04-13 21:02           ` Thomas Gleixner
@ 2010-04-13 21:06             ` Chase Douglas
  0 siblings, 0 replies; 17+ messages in thread
From: Chase Douglas @ 2010-04-13 21:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Rafael J. Wysocki, kernel-team

On Tue, Apr 13, 2010 at 2:02 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 13 Apr 2010, Andrew Morton wrote:
>
>> On Thu, 1 Apr 2010 22:01:59 +0200 (CEST)
>> Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> > On Thu, 1 Apr 2010, Chase Douglas wrote:
>> > > On Thu, Apr 1, 2010 at 3:27 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> > >
>> > > Thomas Gleixner seems to think this isn't the best approach (later
>> > > email in this thread), so I'm deferring sending it unless someone is
>> > > still interested in this approach.
>> >
>> > Well, it's a good workaround for now I think, I just answered Andrews
>> > question whether we can't do any better :)
>> >
>>
>> So... should we merge Chase's patch?
>
> I have no better solution right now. Peter ?

I've made one small change to it. Checking the atomic .counter
variable probably isn't the most portable, and atomic_long_read()
should be portable and not add any overhead on sane platforms, so I
swapped out the check against .counter with an atomic_long_read().
I'll resend it shortly.

-- Chase

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-04-13 21:06 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-29 13:41 [REGRESSION 2.6.30][PATCH 0/1] sched: defer idle accounting till after load update period Chase Douglas
2010-03-29 13:41 ` [REGRESSION 2.6.30][PATCH 1/1] " Chase Douglas
2010-03-29 14:41   ` Peter Zijlstra
2010-03-29 17:20     ` Chase Douglas
2010-04-01 19:27   ` Andrew Morton
2010-04-01 19:37     ` Thomas Gleixner
2010-04-01 20:00       ` Chase Douglas
2010-04-01 20:18         ` Thomas Gleixner
2010-04-01 20:32           ` Chase Douglas
2010-04-01 20:37             ` Thomas Gleixner
2010-04-02  7:59           ` Peter Zijlstra
2010-04-05 14:44             ` Chase Douglas
2010-04-01 19:56     ` Chase Douglas
2010-04-01 20:01       ` Thomas Gleixner
2010-04-13 20:39         ` Andrew Morton
2010-04-13 21:02           ` Thomas Gleixner
2010-04-13 21:06             ` Chase Douglas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.