All of lore.kernel.org
 help / color / mirror / Atom feed
* scheduler oddity [bug?]
@ 2009-03-07 17:47 Balazs Scheidler
  2009-03-07 18:47 ` Balazs Scheidler
  2009-03-08  9:42 ` Mike Galbraith
  0 siblings, 2 replies; 42+ messages in thread
From: Balazs Scheidler @ 2009-03-07 17:47 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2917 bytes --]

Hi,

I'm experiencing an odd behaviour from the Linux scheduler. I have an
application that feeds data to another process using a pipe. Both
processes use a fair amount of CPU time apart from writing to/reading
from this pipe.

The machine I'm running on  is an Opteron Quad-Core CPU:
model name	: Quad-Core AMD Opteron(tm) Processor 2347 HE
stepping	: 3

What I see is that only one of the cores is used, the other three is
idling without doing any work. If I explicitly set the CPU affinity of
the processes to use distinct CPUs the performance goes up
significantly. (e.g. it starts to use the other cores and the load
scales linearly).

I've tried to reproduce the problem by writing a small test program,
which you can find attached. The program creates two processes, one
feeds the other using a pipe and each does a series of memset() calls to
simulate CPU load. I've also added capability to the program to set its
own CPU affinity. The results (the more the better):

Without enabling CPU affinity:
$ ./a.out
Check: 0 loops/sec, sum: 1 
Check: 12 loops/sec, sum: 13 
Check: 41 loops/sec, sum: 54 
Check: 41 loops/sec, sum: 95 
Check: 41 loops/sec, sum: 136 
Check: 41 loops/sec, sum: 177 
Check: 41 loops/sec, sum: 218 
Check: 40 loops/sec, sum: 258 
Check: 41 loops/sec, sum: 299 
Check: 41 loops/sec, sum: 340 
Check: 41 loops/sec, sum: 381 
Check: 41 loops/sec, sum: 422 
Check: 41 loops/sec, sum: 463 
Check: 41 loops/sec, sum: 504 
Check: 41 loops/sec, sum: 545 
Check: 40 loops/sec, sum: 585 
Check: 41 loops/sec, sum: 626 
Check: 41 loops/sec, sum: 667 
Check: 41 loops/sec, sum: 708 
Check: 41 loops/sec, sum: 749 
Check: 41 loops/sec, sum: 790 
Check: 41 loops/sec, sum: 831 
Final: 39 loops/sec, sum: 831


With CPU affinity:
# ./a.out 1
Check: 0 loops/sec, sum: 1 
Check: 41 loops/sec, sum: 42 
Check: 49 loops/sec, sum: 91 
Check: 49 loops/sec, sum: 140 
Check: 49 loops/sec, sum: 189 
Check: 49 loops/sec, sum: 238 
Check: 49 loops/sec, sum: 287 
Check: 50 loops/sec, sum: 337 
Check: 49 loops/sec, sum: 386 
Check: 49 loops/sec, sum: 435 
Check: 49 loops/sec, sum: 484 
Check: 49 loops/sec, sum: 533 
Check: 49 loops/sec, sum: 582 
Check: 49 loops/sec, sum: 631 
Check: 49 loops/sec, sum: 680 
Check: 49 loops/sec, sum: 729 
Check: 49 loops/sec, sum: 778 
Check: 49 loops/sec, sum: 827 
Check: 49 loops/sec, sum: 876 
Check: 49 loops/sec, sum: 925 
Check: 50 loops/sec, sum: 975 
Check: 49 loops/sec, sum: 1024 
Final: 48 loops/sec, sum: 1024

The difference is about 20%, which is about the same work performed by
the slave process. If the two processes race for the same CPU this 20%
of performance is lost.

I've tested this on 3 computers and each showed the same symptoms:
 * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
 * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
 * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1

Is this a bug, or a feature?

-- 
Bazsi

[-- Attachment #2: pipetest.c --]
[-- Type: text/x-csrc, Size: 2262 bytes --]

/*
 * This is a test program to reproduce a scheduling oddity I have found.
 *
 * (c) Balazs Scheidler
 *
 * Pass any argument to the program to set the CPU affinity.
 */
#define _GNU_SOURCE

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <time.h>
#include <sched.h>
              
/* diff in millisecs */
long
tv_diff(struct timeval *t1, struct timeval *t2)
{
  long long diff = (t2->tv_sec - t1->tv_sec) * 1e9 + (t2->tv_usec - t1->tv_usec);
  
  return diff /  1e6;
}

int 
reader(int fd)
{
  char buf[4096];
  int i;
  
  while (read(fd, buf, sizeof(buf)) > 0)
    {
      for (i = 0; i < 20000; i++)
        memset(buf, 'A'+i, sizeof(buf));
    }
  return 0;
}

int 
writer(int fd)
{
  char buf[4096];
  int i;
  int counter, prev_counter;
  struct timeval start, end, prev, now;
  long diff;
  
  memset(buf, 'A', sizeof(buf));
  
  counter = 0;
  prev_counter = 0;
  gettimeofday(&start, NULL);
  
  /* feed the other process with data while doing something that spins the CPU */
  while (write(fd, buf, sizeof(buf)) > 0)
    {
      for (i = 0; i < 100000; i++)
        memset(buf, 'A'+i, sizeof(buf));
        
      /* the rest of the loop is only to measure performance */
      counter++;
      gettimeofday(&now, NULL);
      if (now.tv_sec != prev.tv_sec)
        {
          diff = tv_diff(&prev, &now);
          printf("Check: %ld loops/sec, sum: %d \n", ((counter - prev_counter) * 1000) / diff, counter);
          prev_counter = counter;
        }
      if (now.tv_sec - start.tv_sec > 20)
        break;
      prev = now;
    }
  gettimeofday(&end, NULL);
  diff = tv_diff(&start, &end);
  printf("Final: %ld loops/sec, sum: %d\n", (counter*1000) / diff, counter);
  return 0;
}

int 
main(int argc, char *argv)
{
  int fds[2];
  cpu_set_t s;
  int set_affinity = 0;

  CPU_ZERO(&s);
  
  if (argc > 1)
    set_affinity = 1;
  
  pipe(fds);
  
  if (fork() == 0)
    {
      if (set_affinity)
        {
          CPU_SET(0, &s);
          sched_setaffinity(getpid(), sizeof(s), &s);
        }
      close(fds[1]);
      reader(fds[0]);
      return 0;
    }
  if (set_affinity)
    {
      CPU_SET(1, &s);
      sched_setaffinity(getpid(), sizeof(s), &s);
    }
  close(fds[0]);
  writer(fds[1]);
}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-07 17:47 scheduler oddity [bug?] Balazs Scheidler
@ 2009-03-07 18:47 ` Balazs Scheidler
  2009-03-08 19:45   ` Balazs Scheidler
  2009-03-08  9:42 ` Mike Galbraith
  1 sibling, 1 reply; 42+ messages in thread
From: Balazs Scheidler @ 2009-03-07 18:47 UTC (permalink / raw)
  To: linux-kernel

On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> Hi,
> 
> I'm experiencing an odd behaviour from the Linux scheduler. I have an
> application that feeds data to another process using a pipe. Both
> processes use a fair amount of CPU time apart from writing to/reading
> from this pipe.
> 
> The machine I'm running on  is an Opteron Quad-Core CPU:
> model name	: Quad-Core AMD Opteron(tm) Processor 2347 HE
> stepping	: 3
> 
> What I see is that only one of the cores is used, the other three is
> idling without doing any work. If I explicitly set the CPU affinity of
> the processes to use distinct CPUs the performance goes up
> significantly. (e.g. it starts to use the other cores and the load
> scales linearly).
> 
> I've tried to reproduce the problem by writing a small test program,
> which you can find attached. The program creates two processes, one
> feeds the other using a pipe and each does a series of memset() calls to
> simulate CPU load. I've also added capability to the program to set its
> own CPU affinity. The results (the more the better):
> 
> Without enabling CPU affinity:
> $ ./a.out
> Check: 0 loops/sec, sum: 1 
> Check: 12 loops/sec, sum: 13 
> Check: 41 loops/sec, sum: 54 
> Check: 41 loops/sec, sum: 95 
> Check: 41 loops/sec, sum: 136 
> Check: 41 loops/sec, sum: 177 
> Check: 41 loops/sec, sum: 218 
> Check: 40 loops/sec, sum: 258 
> Check: 41 loops/sec, sum: 299 
> Check: 41 loops/sec, sum: 340 
> Check: 41 loops/sec, sum: 381 
> Check: 41 loops/sec, sum: 422 
> Check: 41 loops/sec, sum: 463 
> Check: 41 loops/sec, sum: 504 
> Check: 41 loops/sec, sum: 545 
> Check: 40 loops/sec, sum: 585 
> Check: 41 loops/sec, sum: 626 
> Check: 41 loops/sec, sum: 667 
> Check: 41 loops/sec, sum: 708 
> Check: 41 loops/sec, sum: 749 
> Check: 41 loops/sec, sum: 790 
> Check: 41 loops/sec, sum: 831 
> Final: 39 loops/sec, sum: 831
> 
> 
> With CPU affinity:
> # ./a.out 1
> Check: 0 loops/sec, sum: 1 
> Check: 41 loops/sec, sum: 42 
> Check: 49 loops/sec, sum: 91 
> Check: 49 loops/sec, sum: 140 
> Check: 49 loops/sec, sum: 189 
> Check: 49 loops/sec, sum: 238 
> Check: 49 loops/sec, sum: 287 
> Check: 50 loops/sec, sum: 337 
> Check: 49 loops/sec, sum: 386 
> Check: 49 loops/sec, sum: 435 
> Check: 49 loops/sec, sum: 484 
> Check: 49 loops/sec, sum: 533 
> Check: 49 loops/sec, sum: 582 
> Check: 49 loops/sec, sum: 631 
> Check: 49 loops/sec, sum: 680 
> Check: 49 loops/sec, sum: 729 
> Check: 49 loops/sec, sum: 778 
> Check: 49 loops/sec, sum: 827 
> Check: 49 loops/sec, sum: 876 
> Check: 49 loops/sec, sum: 925 
> Check: 50 loops/sec, sum: 975 
> Check: 49 loops/sec, sum: 1024 
> Final: 48 loops/sec, sum: 1024
> 
> The difference is about 20%, which is about the same work performed by
> the slave process. If the two processes race for the same CPU this 20%
> of performance is lost.
> 
> I've tested this on 3 computers and each showed the same symptoms:
>  * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
>  * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
>  * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> 
> Is this a bug, or a feature?
> 

One new interesting information: I've retested with a 2.6.22 based
kernel, and it still works there, setting the CPU affinity does not
change the performance of the test program and mpstat nicely shows that
2 cores are working, not just one.

Maybe this is CFS related? That was merged for 2.6.23 IIRC.

Also, I tried changing various scheduler knobs
in /proc/sys/kernel/sched_* but they didn't help. I've tried to change
these:

 * sched_migration_cost: changed from the default 500000 to 100000 and
then 10000 but neither helped.
 * sched_nr_migrate: increased it to 64, but again nothing

I'm starting to think that this is a regression that may or may not be
related to CFS. 

I don't have a box where I could bisect on, but the test program makes
the problem quite obvious.


-- 
Bazsi



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-07 17:47 scheduler oddity [bug?] Balazs Scheidler
  2009-03-07 18:47 ` Balazs Scheidler
@ 2009-03-08  9:42 ` Mike Galbraith
  2009-03-08  9:58   ` Mike Galbraith
  1 sibling, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2009-03-08  9:42 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: linux-kernel

On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> Hi,
> 
> I'm experiencing an odd behaviour from the Linux scheduler. I have an
> application that feeds data to another process using a pipe. Both
> processes use a fair amount of CPU time apart from writing to/reading
> from this pipe.
> 
> The machine I'm running on  is an Opteron Quad-Core CPU:
> model name	: Quad-Core AMD Opteron(tm) Processor 2347 HE
> stepping	: 3
> 
> What I see is that only one of the cores is used, the other three is
> idling without doing any work. If I explicitly set the CPU affinity of
> the processes to use distinct CPUs the performance goes up
> significantly. (e.g. it starts to use the other cores and the load
> scales linearly).
> 
> I've tried to reproduce the problem by writing a small test program,
> which you can find attached. The program creates two processes, one
> feeds the other using a pipe and each does a series of memset() calls to
> simulate CPU load. I've also added capability to the program to set its
> own CPU affinity. The results (the more the better):
> 
> Without enabling CPU affinity:
> $ ./a.out
> Check: 0 loops/sec, sum: 1 
> Check: 12 loops/sec, sum: 13 
> Check: 41 loops/sec, sum: 54 
> Check: 41 loops/sec, sum: 95 
> Check: 41 loops/sec, sum: 136 
> Check: 41 loops/sec, sum: 177 
> Check: 41 loops/sec, sum: 218 
> Check: 40 loops/sec, sum: 258 
> Check: 41 loops/sec, sum: 299 
> Check: 41 loops/sec, sum: 340 
> Check: 41 loops/sec, sum: 381 
> Check: 41 loops/sec, sum: 422 
> Check: 41 loops/sec, sum: 463 
> Check: 41 loops/sec, sum: 504 
> Check: 41 loops/sec, sum: 545 
> Check: 40 loops/sec, sum: 585 
> Check: 41 loops/sec, sum: 626 
> Check: 41 loops/sec, sum: 667 
> Check: 41 loops/sec, sum: 708 
> Check: 41 loops/sec, sum: 749 
> Check: 41 loops/sec, sum: 790 
> Check: 41 loops/sec, sum: 831 
> Final: 39 loops/sec, sum: 831
> 
> 
> With CPU affinity:
> # ./a.out 1
> Check: 0 loops/sec, sum: 1 
> Check: 41 loops/sec, sum: 42 
> Check: 49 loops/sec, sum: 91 
> Check: 49 loops/sec, sum: 140 
> Check: 49 loops/sec, sum: 189 
> Check: 49 loops/sec, sum: 238 
> Check: 49 loops/sec, sum: 287 
> Check: 50 loops/sec, sum: 337 
> Check: 49 loops/sec, sum: 386 
> Check: 49 loops/sec, sum: 435 
> Check: 49 loops/sec, sum: 484 
> Check: 49 loops/sec, sum: 533 
> Check: 49 loops/sec, sum: 582 
> Check: 49 loops/sec, sum: 631 
> Check: 49 loops/sec, sum: 680 
> Check: 49 loops/sec, sum: 729 
> Check: 49 loops/sec, sum: 778 
> Check: 49 loops/sec, sum: 827 
> Check: 49 loops/sec, sum: 876 
> Check: 49 loops/sec, sum: 925 
> Check: 50 loops/sec, sum: 975 
> Check: 49 loops/sec, sum: 1024 
> Final: 48 loops/sec, sum: 1024
> 
> The difference is about 20%, which is about the same work performed by
> the slave process. If the two processes race for the same CPU this 20%
> of performance is lost.
> 
> I've tested this on 3 computers and each showed the same symptoms:
>  * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
>  * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
>  * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> 
> Is this a bug, or a feature?

Both.  Affine wakeups are cache friendly, and generally a feature, but
can lead to underutilized CPUs in some cases, thus turning feature into
bug as your testcase demonstrates.  The metric we for the affinity hint
works well, but clearly wants some refinement.

You can turn this scheduler hint off via:
	echo NO_SYNC_WAKEUPS > /sys/kernel/debug/sched_features

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08  9:42 ` Mike Galbraith
@ 2009-03-08  9:58   ` Mike Galbraith
  2009-03-08 10:02     ` Mike Galbraith
                       ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Mike Galbraith @ 2009-03-08  9:58 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: linux-kernel, Ingo Molnar, Peter Zijlstra

On Sun, 2009-03-08 at 10:42 +0100, Mike Galbraith wrote:
> On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> > Hi,
> > 
> > I'm experiencing an odd behaviour from the Linux scheduler. I have an
> > application that feeds data to another process using a pipe. Both
> > processes use a fair amount of CPU time apart from writing to/reading
> > from this pipe.
> > 
> > The machine I'm running on  is an Opteron Quad-Core CPU:
> > model name	: Quad-Core AMD Opteron(tm) Processor 2347 HE
> > stepping	: 3
> > 
> > What I see is that only one of the cores is used, the other three is
> > idling without doing any work. If I explicitly set the CPU affinity of
> > the processes to use distinct CPUs the performance goes up
> > significantly. (e.g. it starts to use the other cores and the load
> > scales linearly).
> > 
> > I've tried to reproduce the problem by writing a small test program,
> > which you can find attached. The program creates two processes, one
> > feeds the other using a pipe and each does a series of memset() calls to
> > simulate CPU load. I've also added capability to the program to set its
> > own CPU affinity. The results (the more the better):
> > 
> > Without enabling CPU affinity:
> > $ ./a.out
> > Check: 0 loops/sec, sum: 1 
> > Check: 12 loops/sec, sum: 13 
> > Check: 41 loops/sec, sum: 54 
> > Check: 41 loops/sec, sum: 95 
> > Check: 41 loops/sec, sum: 136 
> > Check: 41 loops/sec, sum: 177 
> > Check: 41 loops/sec, sum: 218 
> > Check: 40 loops/sec, sum: 258 
> > Check: 41 loops/sec, sum: 299 
> > Check: 41 loops/sec, sum: 340 
> > Check: 41 loops/sec, sum: 381 
> > Check: 41 loops/sec, sum: 422 
> > Check: 41 loops/sec, sum: 463 
> > Check: 41 loops/sec, sum: 504 
> > Check: 41 loops/sec, sum: 545 
> > Check: 40 loops/sec, sum: 585 
> > Check: 41 loops/sec, sum: 626 
> > Check: 41 loops/sec, sum: 667 
> > Check: 41 loops/sec, sum: 708 
> > Check: 41 loops/sec, sum: 749 
> > Check: 41 loops/sec, sum: 790 
> > Check: 41 loops/sec, sum: 831 
> > Final: 39 loops/sec, sum: 831
> > 
> > 
> > With CPU affinity:
> > # ./a.out 1
> > Check: 0 loops/sec, sum: 1 
> > Check: 41 loops/sec, sum: 42 
> > Check: 49 loops/sec, sum: 91 
> > Check: 49 loops/sec, sum: 140 
> > Check: 49 loops/sec, sum: 189 
> > Check: 49 loops/sec, sum: 238 
> > Check: 49 loops/sec, sum: 287 
> > Check: 50 loops/sec, sum: 337 
> > Check: 49 loops/sec, sum: 386 
> > Check: 49 loops/sec, sum: 435 
> > Check: 49 loops/sec, sum: 484 
> > Check: 49 loops/sec, sum: 533 
> > Check: 49 loops/sec, sum: 582 
> > Check: 49 loops/sec, sum: 631 
> > Check: 49 loops/sec, sum: 680 
> > Check: 49 loops/sec, sum: 729 
> > Check: 49 loops/sec, sum: 778 
> > Check: 49 loops/sec, sum: 827 
> > Check: 49 loops/sec, sum: 876 
> > Check: 49 loops/sec, sum: 925 
> > Check: 50 loops/sec, sum: 975 
> > Check: 49 loops/sec, sum: 1024 
> > Final: 48 loops/sec, sum: 1024
> > 
> > The difference is about 20%, which is about the same work performed by
> > the slave process. If the two processes race for the same CPU this 20%
> > of performance is lost.
> > 
> > I've tested this on 3 computers and each showed the same symptoms:
> >  * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> >  * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> >  * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> > 
> > Is this a bug, or a feature?
> 
> Both.  Affine wakeups are cache friendly, and generally a feature, but
> can lead to underutilized CPUs in some cases, thus turning feature into
> bug as your testcase demonstrates.  The metric we for the affinity hint
> works well, but clearly wants some refinement.
> 
> You can turn this scheduler hint off via:
> 	echo NO_SYNC_WAKEUPS > /sys/kernel/debug/sched_features
> 

The problem with your particular testcase is that while one half has an
avg_overlap (what we use as affinity hint for synchronous wakeups) which
triggers the affinity hint, the other half has avg_overlap of zero, what
it was born with, so despite significant execution overlap, the
scheduler treats them as if they were truly synchronous tasks.

The below cures it, but is only a demo hack.

diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..85f9ced 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1712,11 +1712,15 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
 {
+	u64 limit = sysctl_sched_migration_cost;
+	u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
+
 	if (sleep && p->se.last_wakeup) {
 		update_avg(&p->se.avg_overlap,
 			   p->se.sum_exec_runtime - p->se.last_wakeup);
 		p->se.last_wakeup = 0;
-	}
+	} else if (p->se.avg_overlap < limit && runtime >= lpipetest (6701, #threads: 1)
---------------------------------------------------------
se.exec_start                      :       5607096.896687
se.vruntime                        :        274158.274352
se.sum_exec_runtime                :        139434.783417
se.avg_overlap                     :             6.477067 <== was ze
nr_switches                        :                 2246
nr_voluntary_switches              :                    1
nr_involuntary_switches            :                 2245
se.load.weight                     :                 1024
policy                             :                    0
prio                               :                  120
clock-delta                        :                  102

pipetest (6702, #threads: 1)
---------------------------------------------------------
se.exec_start                      :       5607096.896687
se.vruntime                        :        274098.273516
se.sum_exec_runtime                :         32987.899515
se.avg_overlap                     :             0.502174
nr_switches                        :                13631
nr_voluntary_switches              :                11639
nr_involuntary_switches            :                 1992
se.load.weight                     :                 1024
policy                             :                    0
prio                               :                  120
clock-delta                        :                  117
imit)
+		update_avg(&p->se.avg_overlap, runtime);
 
 	sched_info_dequeued(p);
 	p->sched_class->dequeue_task(rq, p, sleep);




^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08  9:58   ` Mike Galbraith
@ 2009-03-08 10:02     ` Mike Galbraith
  2009-03-08 10:19     ` Peter Zijlstra
  2009-03-08 15:39     ` Ingo Molnar
  2 siblings, 0 replies; 42+ messages in thread
From: Mike Galbraith @ 2009-03-08 10:02 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: linux-kernel, Ingo Molnar, Peter Zijlstra

On Sun, 2009-03-08 at 10:58 +0100, Mike Galbraith wrote:
> On Sun, 2009-03-08 at 10:42 +0100, Mike Galbraith wrote:
> > On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> > > Hi,
> > > 
> > > I'm experiencing an odd behaviour from the Linux scheduler. I have an
> > > application that feeds data to another process using a pipe. Both
> > > processes use a fair amount of CPU time apart from writing to/reading
> > > from this pipe.
> > > 
> > > The machine I'm running on  is an Opteron Quad-Core CPU:
> > > model name	: Quad-Core AMD Opteron(tm) Processor 2347 HE
> > > stepping	: 3
> > > 
> > > What I see is that only one of the cores is used, the other three is
> > > idling without doing any work. If I explicitly set the CPU affinity of
> > > the processes to use distinct CPUs the performance goes up
> > > significantly. (e.g. it starts to use the other cores and the load
> > > scales linearly).
> > > 
> > > I've tried to reproduce the problem by writing a small test program,
> > > which you can find attached. The program creates two processes, one
> > > feeds the other using a pipe and each does a series of memset() calls to
> > > simulate CPU load. I've also added capability to the program to set its
> > > own CPU affinity. The results (the more the better):
> > > 
> > > Without enabling CPU affinity:
> > > $ ./a.out
> > > Check: 0 loops/sec, sum: 1 
> > > Check: 12 loops/sec, sum: 13 
> > > Check: 41 loops/sec, sum: 54 
> > > Check: 41 loops/sec, sum: 95 
> > > Check: 41 loops/sec, sum: 136 
> > > Check: 41 loops/sec, sum: 177 
> > > Check: 41 loops/sec, sum: 218 
> > > Check: 40 loops/sec, sum: 258 
> > > Check: 41 loops/sec, sum: 299 
> > > Check: 41 loops/sec, sum: 340 
> > > Check: 41 loops/sec, sum: 381 
> > > Check: 41 loops/sec, sum: 422 
> > > Check: 41 loops/sec, sum: 463 
> > > Check: 41 loops/sec, sum: 504 
> > > Check: 41 loops/sec, sum: 545 
> > > Check: 40 loops/sec, sum: 585 
> > > Check: 41 loops/sec, sum: 626 
> > > Check: 41 loops/sec, sum: 667 
> > > Check: 41 loops/sec, sum: 708 
> > > Check: 41 loops/sec, sum: 749 
> > > Check: 41 loops/sec, sum: 790 
> > > Check: 41 loops/sec, sum: 831 
> > > Final: 39 loops/sec, sum: 831
> > > 
> > > 
> > > With CPU affinity:
> > > # ./a.out 1
> > > Check: 0 loops/sec, sum: 1 
> > > Check: 41 loops/sec, sum: 42 
> > > Check: 49 loops/sec, sum: 91 
> > > Check: 49 loops/sec, sum: 140 
> > > Check: 49 loops/sec, sum: 189 
> > > Check: 49 loops/sec, sum: 238 
> > > Check: 49 loops/sec, sum: 287 
> > > Check: 50 loops/sec, sum: 337 
> > > Check: 49 loops/sec, sum: 386 
> > > Check: 49 loops/sec, sum: 435 
> > > Check: 49 loops/sec, sum: 484 
> > > Check: 49 loops/sec, sum: 533 
> > > Check: 49 loops/sec, sum: 582 
> > > Check: 49 loops/sec, sum: 631 
> > > Check: 49 loops/sec, sum: 680 
> > > Check: 49 loops/sec, sum: 729 
> > > Check: 49 loops/sec, sum: 778 
> > > Check: 49 loops/sec, sum: 827 
> > > Check: 49 loops/sec, sum: 876 
> > > Check: 49 loops/sec, sum: 925 
> > > Check: 50 loops/sec, sum: 975 
> > > Check: 49 loops/sec, sum: 1024 
> > > Final: 48 loops/sec, sum: 1024
> > > 
> > > The difference is about 20%, which is about the same work performed by
> > > the slave process. If the two processes race for the same CPU this 20%
> > > of performance is lost.
> > > 
> > > I've tested this on 3 computers and each showed the same symptoms:
> > >  * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> > >  * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> > >  * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> > > 
> > > Is this a bug, or a feature?
> > 
> > Both.  Affine wakeups are cache friendly, and generally a feature, but
> > can lead to underutilized CPUs in some cases, thus turning feature into
> > bug as your testcase demonstrates.  The metric we for the affinity hint
> > works well, but clearly wants some refinement.
> > 
> > You can turn this scheduler hint off via:
> > 	echo NO_SYNC_WAKEUPS > /sys/kernel/debug/sched_features
> > 
> 

(reply got munged)

> The problem with your particular testcase is that while one half has an
> avg_overlap (what we use as affinity hint for synchronous wakeups) which
> triggers the affinity hint, the other half has avg_overlap of zero, what
> it was born with, so despite significant execution overlap, the
> scheduler treats them as if they were truly synchronous tasks.
> 
> The below cures it, but is only a demo hack.

diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..85f9ced 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1712,11 +1712,15 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
 {
+	u64 limit = sysctl_sched_migration_cost;
+	u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
+
 	if (sleep && p->se.last_wakeup) {
 		update_avg(&p->se.avg_overlap,
 			   p->se.sum_exec_runtime - p->se.last_wakeup);
 		p->se.last_wakeup = 0;
-	}
+	} else if (p->se.avg_overlap < limit && runtime >= limit)
+		update_avg(&p->se.avg_overlap, runtime);
 
 	sched_info_dequeued(p);
 	p->sched_class->dequeue_task(rq, p, sleep);

pipetest (6701, #threads: 1)
---------------------------------------------------------
se.exec_start                      :       5607096.896687
se.vruntime                        :        274158.274352
se.sum_exec_runtime                :        139434.783417
se.avg_overlap                     :             6.477067 <== was zero
nr_switches                        :                 2246
nr_voluntary_switches              :                    1
nr_involuntary_switches            :                 2245
se.load.weight                     :                 1024
policy                             :                    0
prio                               :                  120
clock-delta                        :                  102

pipetest (6702, #threads: 1)
---------------------------------------------------------
se.exec_start                      :       5607096.896687
se.vruntime                        :        274098.273516
se.sum_exec_runtime                :         32987.899515
se.avg_overlap                     :             0.502174 <== was always < migration cost
nr_switches                        :                13631
nr_voluntary_switches              :                11639
nr_involuntary_switches            :                 1992
se.load.weight                     :                 1024
policy                             :                    0
prio                               :                  120
clock-delta                        :                  117



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08  9:58   ` Mike Galbraith
  2009-03-08 10:02     ` Mike Galbraith
@ 2009-03-08 10:19     ` Peter Zijlstra
  2009-03-08 13:35       ` Mike Galbraith
  2009-03-08 15:39     ` Ingo Molnar
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-03-08 10:19 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Balazs Scheidler, linux-kernel, Ingo Molnar

On Sun, 2009-03-08 at 10:58 +0100, Mike Galbraith wrote:
> 
> The below cures it, but is only a demo hack.
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 8e2558c..85f9ced 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1712,11 +1712,15 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
>  
>  static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
>  {
> +       u64 limit = sysctl_sched_migration_cost;
> +       u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> +
>         if (sleep && p->se.last_wakeup) {
>                 update_avg(&p->se.avg_overlap,
>                            p->se.sum_exec_runtime - p->se.last_wakeup);
>                 p->se.last_wakeup = 0;
> -       }


Wouldn't something like the below be more suited:

diff --git a/kernel/sched.c b/kernel/sched.c
index 89e2ca0..94c8b02 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2501,7 +2501,7 @@ static void __sched_fork(struct task_struct *p)
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.last_wakeup		= 0;
-	p->se.avg_overlap		= 0;
+	p->se.avg_overlap		= sysctl_sched_migration_cost;
 	p->se.start_runtime		= 0;
 	p->se.avg_wakeup		= sysctl_sched_wakeup_granularity;
 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08 10:19     ` Peter Zijlstra
@ 2009-03-08 13:35       ` Mike Galbraith
  0 siblings, 0 replies; 42+ messages in thread
From: Mike Galbraith @ 2009-03-08 13:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Balazs Scheidler, linux-kernel, Ingo Molnar

On Sun, 2009-03-08 at 11:19 +0100, Peter Zijlstra wrote:

> Wouldn't something like the below be more suited:
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 89e2ca0..94c8b02 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2501,7 +2501,7 @@ static void __sched_fork(struct task_struct *p)
>  	p->se.prev_sum_exec_runtime	= 0;
>  	p->se.nr_migrations		= 0;
>  	p->se.last_wakeup		= 0;
> -	p->se.avg_overlap		= 0;
> +	p->se.avg_overlap		= sysctl_sched_migration_cost;
>  	p->se.start_runtime		= 0;
>  	p->se.avg_wakeup		= sysctl_sched_wakeup_granularity;
>  

Dunno exactly what we need to do here.  My hacklet certainly ain't it,
but I suspect this isn't going to work out either.  Having tasks begin
life at zero or sysctl_sched_migration_cost won't matter much methinks.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08  9:58   ` Mike Galbraith
  2009-03-08 10:02     ` Mike Galbraith
  2009-03-08 10:19     ` Peter Zijlstra
@ 2009-03-08 15:39     ` Ingo Molnar
  2009-03-08 16:20       ` Mike Galbraith
  2 siblings, 1 reply; 42+ messages in thread
From: Ingo Molnar @ 2009-03-08 15:39 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Balazs Scheidler, linux-kernel, Peter Zijlstra


* Mike Galbraith <efault@gmx.de> wrote:

> The problem with your particular testcase is that while one 
> half has an avg_overlap (what we use as affinity hint for 
> synchronous wakeups) which triggers the affinity hint, the 
> other half has avg_overlap of zero, what it was born with, so 
> despite significant execution overlap, the scheduler treats 
> them as if they were truly synchronous tasks.

hm, why does it stay on zero?

>  static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
>  {
> +	u64 limit = sysctl_sched_migration_cost;
> +	u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> +
>  	if (sleep && p->se.last_wakeup) {
>  		update_avg(&p->se.avg_overlap,
>  			   p->se.sum_exec_runtime - p->se.last_wakeup);
>  		p->se.last_wakeup = 0;
> -	}
> +	} else if (p->se.avg_overlap < limit && runtime >= limit)
> +		update_avg(&p->se.avg_overlap, runtime);
>  
>  	sched_info_dequeued(p);
>  	p->sched_class->dequeue_task(rq, p, sleep);

hm, that's weird. We want to limit avg_overlap maintenance to 
true sleeps only.

And this patch only makes a difference in the !sleep case - 
which shouldnt be that common in this workload.

What am i missing? A trace would be nice of a few millisecs of 
runtime of this workload, with a few more trace_printk()'s added 
showing how avg_overlap develops.

Maybe it's some task migration artifact? There we do a 
dequeue/enqueue with sleep=0.

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08 15:39     ` Ingo Molnar
@ 2009-03-08 16:20       ` Mike Galbraith
  2009-03-08 17:52         ` Ingo Molnar
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2009-03-08 16:20 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Balazs Scheidler, linux-kernel, Peter Zijlstra

On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > The problem with your particular testcase is that while one 
> > half has an avg_overlap (what we use as affinity hint for 
> > synchronous wakeups) which triggers the affinity hint, the 
> > other half has avg_overlap of zero, what it was born with, so 
> > despite significant execution overlap, the scheduler treats 
> > them as if they were truly synchronous tasks.
> 
> hm, why does it stay on zero?

Wakeup preemption.  Presuming here: heavy task wakes light task, is
preempted, light task stuffs data into pipe, heavy task doesn't block,
so no avg_overlap is ever computed.  The heavy task uses 100% CPU.

Running as SCHED_BATCH (virgin source), it becomes sane.

pipetest (6836, #threads: 1)
---------------------------------------------------------
se.exec_start                      :        266073.001296
se.vruntime                        :        173620.953443
se.sum_exec_runtime                :         11324.486321
se.avg_overlap                     :             1.306762
nr_switches                        :                  381
nr_voluntary_switches              :                    2
nr_involuntary_switches            :                  379
se.load.weight                     :                 1024
policy                             :                    3
prio                               :                  120
clock-delta                        :                  109

pipetest (6837, #threads: 1)
---------------------------------------------------------
se.exec_start                      :        266066.098182
se.vruntime                        :         51893.050177
se.sum_exec_runtime                :          2367.077751
se.avg_overlap                     :             0.077492
nr_switches                        :                  897
nr_voluntary_switches              :                  828
nr_involuntary_switches            :                   69
se.load.weight                     :                 1024
policy                             :                    3
prio                               :                  120
clock-delta                        :                  109

> >  static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
> >  {
> > +	u64 limit = sysctl_sched_migration_cost;
> > +	u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> > +
> >  	if (sleep && p->se.last_wakeup) {
> >  		update_avg(&p->se.avg_overlap,
> >  			   p->se.sum_exec_runtime - p->se.last_wakeup);
> >  		p->se.last_wakeup = 0;
> > -	}
> > +	} else if (p->se.avg_overlap < limit && runtime >= limit)
> > +		update_avg(&p->se.avg_overlap, runtime);
> >  
> >  	sched_info_dequeued(p);
> >  	p->sched_class->dequeue_task(rq, p, sleep);
> 
> hm, that's weird. We want to limit avg_overlap maintenance to 
> true sleeps only.

Except that when we stop sleeping, we're left with a stale avg_overlap.

> And this patch only makes a difference in the !sleep case - 
> which shouldnt be that common in this workload.

Hack was only to kill the stale zero.  Let's forget hack ;-)

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08 16:20       ` Mike Galbraith
@ 2009-03-08 17:52         ` Ingo Molnar
  2009-03-08 18:39           ` Mike Galbraith
  2009-03-09  8:02           ` [patch] " Mike Galbraith
  0 siblings, 2 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-03-08 17:52 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Balazs Scheidler, linux-kernel, Peter Zijlstra


* Mike Galbraith <efault@gmx.de> wrote:

> On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > * Mike Galbraith <efault@gmx.de> wrote:
> > 
> > > The problem with your particular testcase is that while one 
> > > half has an avg_overlap (what we use as affinity hint for 
> > > synchronous wakeups) which triggers the affinity hint, the 
> > > other half has avg_overlap of zero, what it was born with, so 
> > > despite significant execution overlap, the scheduler treats 
> > > them as if they were truly synchronous tasks.
> > 
> > hm, why does it stay on zero?
> 
> Wakeup preemption.  Presuming here: heavy task wakes light 
> task, is preempted, light task stuffs data into pipe, heavy 
> task doesn't block, so no avg_overlap is ever computed.  The 
> heavy task uses 100% CPU.
> 
> Running as SCHED_BATCH (virgin source), it becomes sane.

ah.

I'd argue then that time spent on the rq preempted _should_ 
count in avg_overlap statistics. I.e. couldnt we do something 
like ... your patch? :)

> >     if (sleep && p->se.last_wakeup) {
> >             update_avg(&p->se.avg_overlap,
> >                        p->se.sum_exec_runtime - p->se.last_wakeup);
> >             p->se.last_wakeup = 0;
> > -   }
> > +   } else if (p->se.avg_overlap < limit && runtime >= limit)
> > +           update_avg(&p->se.avg_overlap, runtime);

Just done unconditionally, i.e. something like:

	if (sleep) {
		runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
		p->se.last_wakeup = 0;
	} else {
		runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
	}

	update_avg(&p->se.avg_overlap, runtime);

?

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08 17:52         ` Ingo Molnar
@ 2009-03-08 18:39           ` Mike Galbraith
  2009-03-08 18:55             ` Ingo Molnar
  2009-03-09  8:02           ` [patch] " Mike Galbraith
  1 sibling, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2009-03-08 18:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Balazs Scheidler, linux-kernel, Peter Zijlstra

On Sun, 2009-03-08 at 18:52 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > > * Mike Galbraith <efault@gmx.de> wrote:
> > > 
> > > > The problem with your particular testcase is that while one 
> > > > half has an avg_overlap (what we use as affinity hint for 
> > > > synchronous wakeups) which triggers the affinity hint, the 
> > > > other half has avg_overlap of zero, what it was born with, so 
> > > > despite significant execution overlap, the scheduler treats 
> > > > them as if they were truly synchronous tasks.
> > > 
> > > hm, why does it stay on zero?
> > 
> > Wakeup preemption.  Presuming here: heavy task wakes light 
> > task, is preempted, light task stuffs data into pipe, heavy 
> > task doesn't block, so no avg_overlap is ever computed.  The 
> > heavy task uses 100% CPU.
> > 
> > Running as SCHED_BATCH (virgin source), it becomes sane.
> 
> ah.
> 
> I'd argue then that time spent on the rq preempted _should_ 
> count in avg_overlap statistics. I.e. couldnt we do something 
> like ... your patch? :)
> 
> > >     if (sleep && p->se.last_wakeup) {
> > >             update_avg(&p->se.avg_overlap,
> > >                        p->se.sum_exec_runtime - p->se.last_wakeup);
> > >             p->se.last_wakeup = 0;
> > > -   }
> > > +   } else if (p->se.avg_overlap < limit && runtime >= limit)
> > > +           update_avg(&p->se.avg_overlap, runtime);
> 
> Just done unconditionally, i.e. something like:
> 
> 	if (sleep) {
> 		runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
> 		p->se.last_wakeup = 0;
> 	} else {
> 		runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> 	}
> 
> 	update_avg(&p->se.avg_overlap, runtime);
> 
> ?

That'll do it for this load.  I'll resume in the a.m., give that some
testing, and try to remember all the things I was paranoid about.
(getting interrupted a _lot_.. i give up on today;)

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08 18:39           ` Mike Galbraith
@ 2009-03-08 18:55             ` Ingo Molnar
  2009-03-09  4:10               ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Ingo Molnar @ 2009-03-08 18:55 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Balazs Scheidler, linux-kernel, Peter Zijlstra


* Mike Galbraith <efault@gmx.de> wrote:

> On Sun, 2009-03-08 at 18:52 +0100, Ingo Molnar wrote:
> > * Mike Galbraith <efault@gmx.de> wrote:
> > 
> > > On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > > > * Mike Galbraith <efault@gmx.de> wrote:
> > > > 
> > > > > The problem with your particular testcase is that while one 
> > > > > half has an avg_overlap (what we use as affinity hint for 
> > > > > synchronous wakeups) which triggers the affinity hint, the 
> > > > > other half has avg_overlap of zero, what it was born with, so 
> > > > > despite significant execution overlap, the scheduler treats 
> > > > > them as if they were truly synchronous tasks.
> > > > 
> > > > hm, why does it stay on zero?
> > > 
> > > Wakeup preemption.  Presuming here: heavy task wakes light 
> > > task, is preempted, light task stuffs data into pipe, heavy 
> > > task doesn't block, so no avg_overlap is ever computed.  The 
> > > heavy task uses 100% CPU.
> > > 
> > > Running as SCHED_BATCH (virgin source), it becomes sane.
> > 
> > ah.
> > 
> > I'd argue then that time spent on the rq preempted _should_ 
> > count in avg_overlap statistics. I.e. couldnt we do something 
> > like ... your patch? :)
> > 
> > > >     if (sleep && p->se.last_wakeup) {
> > > >             update_avg(&p->se.avg_overlap,
> > > >                        p->se.sum_exec_runtime - p->se.last_wakeup);
> > > >             p->se.last_wakeup = 0;
> > > > -   }
> > > > +   } else if (p->se.avg_overlap < limit && runtime >= limit)
> > > > +           update_avg(&p->se.avg_overlap, runtime);
> > 
> > Just done unconditionally, i.e. something like:
> > 
> > 	if (sleep) {
> > 		runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
> > 		p->se.last_wakeup = 0;
> > 	} else {
> > 		runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> > 	}
> > 
> > 	update_avg(&p->se.avg_overlap, runtime);
> > 
> > ?
> 
> That'll do it for this load.  I'll resume in the a.m., give 
> that some testing, and try to remember all the things I was 
> paranoid about.

btw., there's room for a cleanup + micro-optimization here too: 
it would be nice to change se.last_wakeup and 
se.prev_sum_exec_runtime to be the same variable, 
se.prev_timestamp or so.

That way we can do a simple:

 	update_avg(&p->se.avg_overlap,
		    p->se.sum_exec_runtime - p->se.prev_timestamp);
	p->se.prev_timestamp = 0;

the latter is needed as we rely on the zeroing here:

	kernel/sched.c: if (sleep && p->se.last_wakeup) {


	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-07 18:47 ` Balazs Scheidler
@ 2009-03-08 19:45   ` Balazs Scheidler
  2009-03-08 22:03     ` Willy Tarreau
  2009-03-09 11:19     ` David Newall
  0 siblings, 2 replies; 42+ messages in thread
From: Balazs Scheidler @ 2009-03-08 19:45 UTC (permalink / raw)
  To: linux-kernel

On Sat, 2009-03-07 at 19:47 +0100, Balazs Scheidler wrote:
> On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> > Hi,
> > 
> > I've tested this on 3 computers and each showed the same symptoms:
> >  * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> >  * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> >  * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> > 
> > Is this a bug, or a feature?
> > 
> 
> One new interesting information: I've retested with a 2.6.22 based
> kernel, and it still works there, setting the CPU affinity does not
> change the performance of the test program and mpstat nicely shows that
> 2 cores are working, not just one.
> 
> Maybe this is CFS related? That was merged for 2.6.23 IIRC.
> 
> Also, I tried changing various scheduler knobs
> in /proc/sys/kernel/sched_* but they didn't help. I've tried to change
> these:
> 
>  * sched_migration_cost: changed from the default 500000 to 100000 and
> then 10000 but neither helped.
>  * sched_nr_migrate: increased it to 64, but again nothing
> 
> I'm starting to think that this is a regression that may or may not be
> related to CFS. 
> 
> I don't have a box where I could bisect on, but the test program makes
> the problem quite obvious.

Some more test results:

Latest tree from Linus seems to work, at least the program runs on both
cores as it should. I bisected the patch that changed behaviour, and
I've found this:

commit 38736f475071b80b66be28af7b44c854073699cc
Author: Gautham R Shenoy <ego@in.ibm.com>
Date:   Sat Sep 6 14:50:23 2008 +0530

    sched: fix __load_balance_iterator() for cfq with only one task
    
    The __load_balance_iterator() returns a NULL when there's only one
    sched_entity which is a task. It is caused by the following code-path.
    
    	/* Skip over entities that are not tasks */
    	do {
    		se = list_entry(next, struct sched_entity, group_node);
    		next = next->next;
    	} while (next != &cfs_rq->tasks && !entity_is_task(se));
    
    	if (next == &cfs_rq->tasks)
    		return NULL;
    	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          This will return NULL even when se is a task.
    
    As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
    since iter_move_one_task() when it calls load_balance_start_fair(),
    would not get any tasks to move!
    
    Fix this by checking if the last entity was a task or not.
    
    Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>


This patch was integrated for 2.6.28. With the above patch, my test program uses 
two cores as it should. I could only test this in a virtual machine so I don't 
know exact performance metrics, but I'll test 2.6.27 + plus this patch on a real 
box tomorrow to see if this was the culprit.

I'm not sure if this is related to the avg_overlap discussion (which I honestly 
don't really understand :)


-- 
Bazsi



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08 19:45   ` Balazs Scheidler
@ 2009-03-08 22:03     ` Willy Tarreau
  2009-03-09  3:35       ` Mike Galbraith
  2009-03-09 11:19     ` David Newall
  1 sibling, 1 reply; 42+ messages in thread
From: Willy Tarreau @ 2009-03-08 22:03 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: linux-kernel

Hi Balazs,

On Sun, Mar 08, 2009 at 08:45:24PM +0100, Balazs Scheidler wrote:
> On Sat, 2009-03-07 at 19:47 +0100, Balazs Scheidler wrote:
> > On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> > > Hi,
> > > 
> > > I've tested this on 3 computers and each showed the same symptoms:
> > >  * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> > >  * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> > >  * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> > > 
> > > Is this a bug, or a feature?
> > > 
> > 
> > One new interesting information: I've retested with a 2.6.22 based
> > kernel, and it still works there, setting the CPU affinity does not
> > change the performance of the test program and mpstat nicely shows that
> > 2 cores are working, not just one.
> > 
> > Maybe this is CFS related? That was merged for 2.6.23 IIRC.
> > 
> > Also, I tried changing various scheduler knobs
> > in /proc/sys/kernel/sched_* but they didn't help. I've tried to change
> > these:
> > 
> >  * sched_migration_cost: changed from the default 500000 to 100000 and
> > then 10000 but neither helped.
> >  * sched_nr_migrate: increased it to 64, but again nothing
> > 
> > I'm starting to think that this is a regression that may or may not be
> > related to CFS. 
> > 
> > I don't have a box where I could bisect on, but the test program makes
> > the problem quite obvious.
> 
> Some more test results:
> 
> Latest tree from Linus seems to work, at least the program runs on both
> cores as it should. I bisected the patch that changed behaviour, and
> I've found this:
> 
> commit 38736f475071b80b66be28af7b44c854073699cc
> Author: Gautham R Shenoy <ego@in.ibm.com>
> Date:   Sat Sep 6 14:50:23 2008 +0530
> 
>     sched: fix __load_balance_iterator() for cfq with only one task
>     
>     The __load_balance_iterator() returns a NULL when there's only one
>     sched_entity which is a task. It is caused by the following code-path.
>     
>     	/* Skip over entities that are not tasks */
>     	do {
>     		se = list_entry(next, struct sched_entity, group_node);
>     		next = next->next;
>     	} while (next != &cfs_rq->tasks && !entity_is_task(se));
>     
>     	if (next == &cfs_rq->tasks)
>     		return NULL;
>     	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>           This will return NULL even when se is a task.
>     
>     As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
>     since iter_move_one_task() when it calls load_balance_start_fair(),
>     would not get any tasks to move!
>     
>     Fix this by checking if the last entity was a task or not.
>     
>     Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
>     Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>     Signed-off-by: Ingo Molnar <mingo@elte.hu>
> 
> 
> This patch was integrated for 2.6.28. With the above patch, my test program uses 
> two cores as it should. I could only test this in a virtual machine so I don't 
> know exact performance metrics, but I'll test 2.6.27 + plus this patch on a real 
> box tomorrow to see if this was the culprit.

Just tested right here and I can confirm it is the culprit. I can reliably
reproduce the issue here on my core2 duo, and this patch fixes it. With your
memset() loop at 20k iterations, I saw exactly 50% CPU usage, and a final
sum of 794. With the patch, I see 53% CPU and 909. Changing the loop to 80k
iterations shows 53% CPU usage and 541 loops without the patch, versus
639 loops and 63% CPU usage with the patch.

So there's clearly a big win.

On a related note, I've often noticed that my kernel builds with -j 2 often
only use once CPU. I'm wondering whether this could be related to the same
issue. Just testing, I don't notice this with the patch. I'll have to retry
without later.

> I'm not sure if this is related to the avg_overlap discussion (which I honestly 
> don't really understand :)

neither do I :-)

Regards,
Willy


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08 22:03     ` Willy Tarreau
@ 2009-03-09  3:35       ` Mike Galbraith
  0 siblings, 0 replies; 42+ messages in thread
From: Mike Galbraith @ 2009-03-09  3:35 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Balazs Scheidler, linux-kernel

On Sun, 2009-03-08 at 23:03 +0100, Willy Tarreau wrote:
> Hi Balazs,
> 
> On Sun, Mar 08, 2009 at 08:45:24PM +0100, Balazs Scheidler wrote:
> > On Sat, 2009-03-07 at 19:47 +0100, Balazs Scheidler wrote:
> > > On Sat, 2009-03-07 at 18:47 +0100, Balazs Scheidler wrote:
> > > > Hi,
> > > > 
> > > > I've tested this on 3 computers and each showed the same symptoms:
> > > >  * quad core Opteron, running Ubuntu kernel 2.6.27-13.29
> > > >  * Core 2 Duo, running Ubuntu kernel 2.6.27-11.27
> > > >  * Dual Core Opteron, Debian backports.org kernel 2.6.26-13~bpo40+1
> > > > 
> > > > Is this a bug, or a feature?
> > > > 
> > > 
> > > One new interesting information: I've retested with a 2.6.22 based
> > > kernel, and it still works there, setting the CPU affinity does not
> > > change the performance of the test program and mpstat nicely shows that
> > > 2 cores are working, not just one.
> > > 
> > > Maybe this is CFS related? That was merged for 2.6.23 IIRC.
> > > 
> > > Also, I tried changing various scheduler knobs
> > > in /proc/sys/kernel/sched_* but they didn't help. I've tried to change
> > > these:
> > > 
> > >  * sched_migration_cost: changed from the default 500000 to 100000 and
> > > then 10000 but neither helped.
> > >  * sched_nr_migrate: increased it to 64, but again nothing
> > > 
> > > I'm starting to think that this is a regression that may or may not be
> > > related to CFS. 
> > > 
> > > I don't have a box where I could bisect on, but the test program makes
> > > the problem quite obvious.
> > 
> > Some more test results:
> > 
> > Latest tree from Linus seems to work, at least the program runs on both
> > cores as it should. I bisected the patch that changed behaviour, and
> > I've found this:
> > 
> > commit 38736f475071b80b66be28af7b44c854073699cc
> > Author: Gautham R Shenoy <ego@in.ibm.com>
> > Date:   Sat Sep 6 14:50:23 2008 +0530
> > 
> >     sched: fix __load_balance_iterator() for cfq with only one task
> >     
> >     The __load_balance_iterator() returns a NULL when there's only one
> >     sched_entity which is a task. It is caused by the following code-path.
> >     
> >     	/* Skip over entities that are not tasks */
> >     	do {
> >     		se = list_entry(next, struct sched_entity, group_node);
> >     		next = next->next;
> >     	} while (next != &cfs_rq->tasks && !entity_is_task(se));
> >     
> >     	if (next == &cfs_rq->tasks)
> >     		return NULL;
> >     	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >           This will return NULL even when se is a task.
> >     
> >     As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
> >     since iter_move_one_task() when it calls load_balance_start_fair(),
> >     would not get any tasks to move!
> >     
> >     Fix this by checking if the last entity was a task or not.
> >     
> >     Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
> >     Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >     Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > 
> > 
> > This patch was integrated for 2.6.28. With the above patch, my test program uses 
> > two cores as it should. I could only test this in a virtual machine so I don't 
> > know exact performance metrics, but I'll test 2.6.27 + plus this patch on a real 
> > box tomorrow to see if this was the culprit.
> 
> Just tested right here and I can confirm it is the culprit. I can reliably
> reproduce the issue here on my core2 duo, and this patch fixes it. With your
> memset() loop at 20k iterations, I saw exactly 50% CPU usage, and a final
> sum of 794. With the patch, I see 53% CPU and 909. Changing the loop to 80k
> iterations shows 53% CPU usage and 541 loops without the patch, versus
> 639 loops and 63% CPU usage with the patch.

Interesting.  I'm testing in .git (Q6600), and it's only using one CPU
unless I actively intervene.  Doing whatever to pry the pair apart takes
loops/sec from 70 to 84.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08 18:55             ` Ingo Molnar
@ 2009-03-09  4:10               ` Mike Galbraith
  2009-03-09  6:52                 ` Ingo Molnar
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2009-03-09  4:10 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Balazs Scheidler, linux-kernel, Peter Zijlstra

On Sun, 2009-03-08 at 19:55 +0100, Ingo Molnar wrote:

> btw., there's room for a cleanup + micro-optimization here too: 
> it would be nice to change se.last_wakeup and 
> se.prev_sum_exec_runtime to be the same variable, 
> se.prev_timestamp or so.

I don't see how they can meld without breaking check_preempt_tick() all
to pieces.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-09  4:10               ` Mike Galbraith
@ 2009-03-09  6:52                 ` Ingo Molnar
  0 siblings, 0 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-03-09  6:52 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Balazs Scheidler, linux-kernel, Peter Zijlstra


* Mike Galbraith <efault@gmx.de> wrote:

> On Sun, 2009-03-08 at 19:55 +0100, Ingo Molnar wrote:
> 
> > btw., there's room for a cleanup + micro-optimization here too: 
> > it would be nice to change se.last_wakeup and 
> > se.prev_sum_exec_runtime to be the same variable, 
> > se.prev_timestamp or so.
> 
> I don't see how they can meld without breaking 
> check_preempt_tick() all to pieces.

hm, indeed.

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch] Re: scheduler oddity [bug?]
  2009-03-08 17:52         ` Ingo Molnar
  2009-03-08 18:39           ` Mike Galbraith
@ 2009-03-09  8:02           ` Mike Galbraith
  2009-03-09  8:07             ` Ingo Molnar
  2009-03-09 15:57             ` Balazs Scheidler
  1 sibling, 2 replies; 42+ messages in thread
From: Mike Galbraith @ 2009-03-09  8:02 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Balazs Scheidler, linux-kernel, Peter Zijlstra, Willy Tarreau

On Sun, 2009-03-08 at 18:52 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > > * Mike Galbraith <efault@gmx.de> wrote:
> > > 
> > > > The problem with your particular testcase is that while one 
> > > > half has an avg_overlap (what we use as affinity hint for 
> > > > synchronous wakeups) which triggers the affinity hint, the 
> > > > other half has avg_overlap of zero, what it was born with, so 
> > > > despite significant execution overlap, the scheduler treats 
> > > > them as if they were truly synchronous tasks.
> > > 
> > > hm, why does it stay on zero?
> > 
> > Wakeup preemption.  Presuming here: heavy task wakes light 
> > task, is preempted, light task stuffs data into pipe, heavy 
> > task doesn't block, so no avg_overlap is ever computed.  The 
> > heavy task uses 100% CPU.
> > 
> > Running as SCHED_BATCH (virgin source), it becomes sane.
> 
> ah.
> 
> I'd argue then that time spent on the rq preempted _should_ 
> count in avg_overlap statistics. I.e. couldnt we do something 
> like ... your patch? :)
> 
> > >     if (sleep && p->se.last_wakeup) {
> > >             update_avg(&p->se.avg_overlap,
> > >                        p->se.sum_exec_runtime - p->se.last_wakeup);
> > >             p->se.last_wakeup = 0;
> > > -   }
> > > +   } else if (p->se.avg_overlap < limit && runtime >= limit)
> > > +           update_avg(&p->se.avg_overlap, runtime);
> 
> Just done unconditionally, i.e. something like:
> 
> 	if (sleep) {
> 		runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
> 		p->se.last_wakeup = 0;
> 	} else {
> 		runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> 	}
> 
> 	update_avg(&p->se.avg_overlap, runtime);
> 
> ?

OK, I've not seen any problem indications yet, so find patchlet below.

However! Balazs has stated that this problem is _not_ present in .git,
and that..

	commit 38736f475071b80b66be28af7b44c854073699cc
	Author: Gautham R Shenoy <ego@in.ibm.com>
	Date:   Sat Sep 6 14:50:23 2008 +0530

..is what fixed it.  Willy Tarreau verified this as being the case on
his HW as well.  It is present in .git with my HW.

I see it as a problem, but it's your call.  Dunno if I'd apply it or
hold back, given these conflicting reports.

Anyway...

Given a task pair communicating via pipe, if one partner fills/drains such
that the other does not block for extended periods, avg_overlap can be long
stale, and trigger affine wakeups despite heavy CPU demand.  This can, and
does lead to throughput loss in the testcase posted by the reporter.

Fix this by unconditionally updating avg_overlap at dequeue time instead
of only updating when a task sleeps.

See http://lkml.org/lkml/2009/3/7/79 for details/testcase.

Reported-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: Mike Galbraith <efault@gmx.de>

 kernel/sched.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..c670050 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1712,12 +1712,17 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
 {
+	u64 runtime;
+
 	if (sleep && p->se.last_wakeup) {
-		update_avg(&p->se.avg_overlap,
-			   p->se.sum_exec_runtime - p->se.last_wakeup);
+		runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
 		p->se.last_wakeup = 0;
+	} else {
+		runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
 	}
 
+	update_avg(&p->se.avg_overlap, runtime);
+
 	sched_info_dequeued(p);
 	p->sched_class->dequeue_task(rq, p, sleep);
 	p->se.on_rq = 0;




^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09  8:02           ` [patch] " Mike Galbraith
@ 2009-03-09  8:07             ` Ingo Molnar
  2009-03-09 10:16               ` David Newall
  2009-03-09 11:04               ` Peter Zijlstra
  2009-03-09 15:57             ` Balazs Scheidler
  1 sibling, 2 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-03-09  8:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Balazs Scheidler, linux-kernel, Peter Zijlstra, Willy Tarreau


* Mike Galbraith <efault@gmx.de> wrote:

> On Sun, 2009-03-08 at 18:52 +0100, Ingo Molnar wrote:
> > * Mike Galbraith <efault@gmx.de> wrote:
> > 
> > > On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > > > * Mike Galbraith <efault@gmx.de> wrote:
> > > > 
> > > > > The problem with your particular testcase is that while one 
> > > > > half has an avg_overlap (what we use as affinity hint for 
> > > > > synchronous wakeups) which triggers the affinity hint, the 
> > > > > other half has avg_overlap of zero, what it was born with, so 
> > > > > despite significant execution overlap, the scheduler treats 
> > > > > them as if they were truly synchronous tasks.
> > > > 
> > > > hm, why does it stay on zero?
> > > 
> > > Wakeup preemption.  Presuming here: heavy task wakes light 
> > > task, is preempted, light task stuffs data into pipe, heavy 
> > > task doesn't block, so no avg_overlap is ever computed.  The 
> > > heavy task uses 100% CPU.
> > > 
> > > Running as SCHED_BATCH (virgin source), it becomes sane.
> > 
> > ah.
> > 
> > I'd argue then that time spent on the rq preempted _should_ 
> > count in avg_overlap statistics. I.e. couldnt we do something 
> > like ... your patch? :)
> > 
> > > >     if (sleep && p->se.last_wakeup) {
> > > >             update_avg(&p->se.avg_overlap,
> > > >                        p->se.sum_exec_runtime - p->se.last_wakeup);
> > > >             p->se.last_wakeup = 0;
> > > > -   }
> > > > +   } else if (p->se.avg_overlap < limit && runtime >= limit)
> > > > +           update_avg(&p->se.avg_overlap, runtime);
> > 
> > Just done unconditionally, i.e. something like:
> > 
> > 	if (sleep) {
> > 		runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
> > 		p->se.last_wakeup = 0;
> > 	} else {
> > 		runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> > 	}
> > 
> > 	update_avg(&p->se.avg_overlap, runtime);
> > 
> > ?
> 
> OK, I've not seen any problem indications yet, so find patchlet below.
> 
> However! Balazs has stated that this problem is _not_ present in .git,
> and that..
> 
> 	commit 38736f475071b80b66be28af7b44c854073699cc
> 	Author: Gautham R Shenoy <ego@in.ibm.com>
> 	Date:   Sat Sep 6 14:50:23 2008 +0530
> 
> ..is what fixed it.  Willy Tarreau verified this as being the case on
> his HW as well.  It is present in .git with my HW.
> 
> I see it as a problem, but it's your call.  Dunno if I'd apply it or
> hold back, given these conflicting reports.

I think we still want it - as the purpose of the overlap metric 
is to measure reality. If preemption causes overlap in execution 
we should not ignore that.

The fact that your hw triggers it currently is enough of a 
justification. Gautham's change to load-balancing might have 
shifted the preemption and migration characteristics on his box 
just enough to not trigger this - but it does not 'fix' the 
problem per se.

Peter, what do you think?

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09  8:07             ` Ingo Molnar
@ 2009-03-09 10:16               ` David Newall
  2009-03-09 11:04               ` Peter Zijlstra
  1 sibling, 0 replies; 42+ messages in thread
From: David Newall @ 2009-03-09 10:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Balazs Scheidler, linux-kernel, Peter Zijlstra,
	Willy Tarreau

Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
>   


...
>> OK, I've not seen any problem indications yet, so find patchlet below.
>>
>> However! Balazs has stated that this problem is _not_ present in .git,
>> and that..
>>
>> 	commit 38736f475071b80b66be28af7b44c854073699cc
>> 	Author: Gautham R Shenoy <ego@in.ibm.com>
>> 	Date:   Sat Sep 6 14:50:23 2008 +0530
>>
>> ..is what fixed it.  Willy Tarreau verified this as being the case on
>> his HW as well.  It is present in .git with my HW.
>>
>> I see it as a problem, but it's your call.  Dunno if I'd apply it or
>> hold back, given these conflicting reports.
>>     
>
> I think we still want it - as the purpose of the overlap metric 
> is to measure reality. If preemption causes overlap in execution 
> we should not ignore that.
>   

I'm sure it's wrong. The only call to dequeue with a non-zero sleep
value is in deactivate_task. All the rest have zero sleep. The section
of code identified by Mike in his patchlet should be moved for purpose
of clarity. It also hilights the symmetry between queue_task and
dequeue_task:

--- sched.c     2009-02-21 09:09:34.000000000 +1030
+++ sched.c.dn  2009-03-09 20:13:51.000000000 +1030
@@ -1647,12 +1647,6 @@
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
 {
-       if (sleep && p->se.last_wakeup) {
-               update_avg(&p->se.avg_overlap,
-                          p->se.sum_exec_runtime - p->se.last_wakeup);
-               p->se.last_wakeup = 0;
-       }
-
        sched_info_dequeued(p);
        p->sched_class->dequeue_task(rq, p, sleep);
        p->se.on_rq = 0;
@@ -1724,6 +1718,12 @@
        if (task_contributes_to_load(p))
                rq->nr_uninterruptible++;
 
+       if (sleep && p->se.last_wakeup) {
+               update_avg(&p->se.avg_overlap,
+                          p->se.sum_exec_runtime - p->se.last_wakeup);
+               p->se.last_wakeup = 0;
+       }
+
        dequeue_task(rq, p, sleep);
        dec_nr_running(rq);
 }



Having done that, it makes sense to entirely remove dequeue_task 's
sleep parameter, and replicate all three lines in deactivate_task:

--- sched.c.dn  2009-03-09 20:41:13.000000000 +1030
+++ sched.c.dn2 2009-03-09 20:41:30.000000000 +1030
@@ -1645,10 +1645,10 @@
        p->se.on_rq = 1;
 }
 
-static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
+static void dequeue_task(struct rq *rq, struct task_struct *p)
 {
        sched_info_dequeued(p);
-       p->sched_class->dequeue_task(rq, p, sleep);
+       p->sched_class->dequeue_task(rq, p, 0);
        p->se.on_rq = 0;
 }
 
@@ -1724,7 +1724,11 @@
                p->se.last_wakeup = 0;
        }
 
-       dequeue_task(rq, p, sleep);
+       /*dequeue_task(rq, p, sleep);*/
+       sched_info_dequeued(p);
+       p->sched_class->dequeue_task(rq, p, sleep);
+       p->se.on_rq = 0;
+
        dec_nr_running(rq);
 }
 
@@ -4848,7 +4852,7 @@
        on_rq = p->se.on_rq;
        running = task_current(rq, p);
        if (on_rq)
-               dequeue_task(rq, p, 0);
+               dequeue_task(rq, p);
        if (running)
                p->sched_class->put_prev_task(rq, p);
 
@@ -4897,7 +4901,7 @@
        }
        on_rq = p->se.on_rq;
        if (on_rq)
-               dequeue_task(rq, p, 0);
+               dequeue_task(rq, p);
 
        p->static_prio = NICE_TO_PRIO(nice);
        set_load_weight(p);
@@ -8637,7 +8641,7 @@
        on_rq = tsk->se.on_rq;
 
        if (on_rq)
-               dequeue_task(rq, tsk, 0);
+               dequeue_task(rq, tsk);
        if (unlikely(running))
                tsk->sched_class->put_prev_task(rq, tsk);
 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09  8:07             ` Ingo Molnar
  2009-03-09 10:16               ` David Newall
@ 2009-03-09 11:04               ` Peter Zijlstra
  2009-03-09 13:16                 ` Mike Galbraith
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-03-09 11:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mike Galbraith, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 09:07 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:

> > I see it as a problem, but it's your call.  Dunno if I'd apply it or
> > hold back, given these conflicting reports.
> 
> I think we still want it - as the purpose of the overlap metric 
> is to measure reality. If preemption causes overlap in execution 
> we should not ignore that.
> 
> The fact that your hw triggers it currently is enough of a 
> justification. Gautham's change to load-balancing might have 
> shifted the preemption and migration characteristics on his box 
> just enough to not trigger this - but it does not 'fix' the 
> problem per se.
> 
> Peter, what do you think?

Mostly confusion... trying to reverse engineer wth the patch does, and
why, as the changelog is somewhat silent on the issue, nor are there
comments added to clarify things.

Having something of a cold doesn't really help either..

OK, so staring at this:

---
diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..c670050 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1712,12 +1712,17 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
 {
+       u64 runtime;
+
        if (sleep && p->se.last_wakeup) {
-               update_avg(&p->se.avg_overlap,
-                          p->se.sum_exec_runtime - p->se.last_wakeup);
+               runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
                p->se.last_wakeup = 0;
+       } else {
+               runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
        }
 
+       update_avg(&p->se.avg_overlap, runtime);
+
        sched_info_dequeued(p);
        p->sched_class->dequeue_task(rq, p, sleep);
        p->se.on_rq = 0;
---

The idea of avg_overlap is to measure the time between waking someone
and going to sleep yourself. If this overlap time is short for both
tasks, we infer a mutal relation and try to keep these tasks on the same
cpu.

The above patch changes this definition by adding the full run-time on !
sleep dequeues.

We reset prev_sum_exec_runtime in set_next_entity(), iow every time we
start running a task.

Now !sleep dequeues happen mostly with preemption, but also with things
like migration, nice, etc..

Take migration, that would simply add the last full runtime again, even
though it hasn't ran -- that seems most odd.

OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
can easily grow stale.. I can see that happen indeed.

So the 'perfect' thing would be a task-runtime decay, barring that the
preemption thing seems a sane enough hart-beat of a task.

How does the below look to you?

---
 kernel/sched.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4414926..ec7ffdc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4692,6 +4692,19 @@ static inline void schedule_debug(struct task_struct *prev)
 #endif
 }
 
+static void put_prev_task(struct rq *rq, struct task_struct *prev)
+{
+	if (prev->state == TASK_RUNNING) {
+		/*
+		 * In order to avoid avg_overlap growing stale when we are
+		 * indeed overlapping and hence not getting put to sleep, grow
+		 * the avg_overlap on preemption.
+		 */
+		update_avg(&prev->se.avg_overlap, sysctl_sched_migration_cost);
+	}
+	prev->sched_class->put_prev_task(rq, prev);
+}
+
 /*
  * Pick up the highest-prio task:
  */
@@ -4768,7 +4781,7 @@ need_resched_nonpreemptible:
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
-	prev->sched_class->put_prev_task(rq, prev);
+	put_prev_task(rq, prev);
 	next = pick_next_task(rq);
 
 	if (likely(prev != next)) {



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: scheduler oddity [bug?]
  2009-03-08 19:45   ` Balazs Scheidler
  2009-03-08 22:03     ` Willy Tarreau
@ 2009-03-09 11:19     ` David Newall
  1 sibling, 0 replies; 42+ messages in thread
From: David Newall @ 2009-03-09 11:19 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: linux-kernel

Balazs Scheidler wrote:
> Some more test results:
>
> Latest tree from Linus seems to work, at least the program runs on both
> cores as it should. I bisected the patch that changed behaviour, and
> I've found this:
>
> commit 38736f475071b80b66be28af7b44c854073699cc
> Author: Gautham R Shenoy <ego@in.ibm.com>
> Date:   Sat Sep 6 14:50:23 2008 +0530
>
>     sched: fix __load_balance_iterator() for cfq with only one task
>     
>     The __load_balance_iterator() returns a NULL when there's only one
>     sched_entity which is a task. It is caused by the following code-path.
>     
>     	/* Skip over entities that are not tasks */
>     	do {
>     		se = list_entry(next, struct sched_entity, group_node);
>     		next = next->next;
>     	} while (next != &cfs_rq->tasks && !entity_is_task(se));
>     
>     	if (next == &cfs_rq->tasks)
>     		return NULL;
>     	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>           This will return NULL even when se is a task.
>     
>     As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
>     since iter_move_one_task() when it calls load_balance_start_fair(),
>     would not get any tasks to move!
>     
>     Fix this by checking if the last entity was a task or not.
>     
>     Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
>     Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>     Signed-off-by: Ingo Molnar <mingo@elte.hu>

Woops!  That fails when the task is the last entry on the list.  This
fixes that:

--- sched_fair.c        2009-02-21 09:09:34.000000000 +1030
+++ sched_fair.c.dn     2009-03-09 20:48:36.000000000 +1030
@@ -1440,7 +1440,7 @@
 __load_balance_iterator(struct cfs_rq *cfs_rq, struct list_head *next)
 {
        struct task_struct *p = NULL;
-       struct sched_entity *se;
+       struct sched_entity *se = NULL;
 
        if (next == &cfs_rq->tasks)
                return NULL;
@@ -1451,7 +1451,7 @@
                next = next->next;
        } while (next != &cfs_rq->tasks && !entity_is_task(se));
 
-       if (next == &cfs_rq->tasks)
+       if (se == NULL || !entity_is_task(se))
                return NULL;
 
        cfs_rq->balance_iterator = next;


Really, though, the function could stand a spring-cleaning, for example
either of the following, depending on how much you hate returning from
within a loop:

__load_balance_iterator(struct cfs_rq *cfs_rq, struct list_head *next)
{
        do {
                struct sched_entity *se = list_entry(next, struct sched_entity, group_node);
                next = next->next;
                if (entity_is_task(se))
                {
                        cfs_rq->balance_iterator = next;
                        return task_of(se);
                }
        } while (next != &cfs_rq->tasks);
        return NULL;
}


__load_balance_iterator(struct cfs_rq *cfs_rq, struct list_head *next)
{
        struct sched_entity *se;

        for ( ; next != &cfs_rq->tasks; next = next->next)
        {
                se = list_entry(next, struct sched_entity, group_node);
                if (entity_is_task(se))
                        break;
        }

        if (next == &cfs_rq->tasks)
                return NULL;

        cfs_rq->balance_iterator = next->next;
        return task_of(se);
}


I wonder if it was intended to set balance_iterator to the task's list
entry instead of the following one.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 11:04               ` Peter Zijlstra
@ 2009-03-09 13:16                 ` Mike Galbraith
  2009-03-09 13:27                   ` Peter Zijlstra
  2009-03-09 13:37                   ` Mike Galbraith
  0 siblings, 2 replies; 42+ messages in thread
From: Mike Galbraith @ 2009-03-09 13:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:

> OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> can easily grow stale.. I can see that happen indeed.
> 
> So the 'perfect' thing would be a task-runtime decay, barring that the
> preemption thing seems a sane enough hart-beat of a task.
> 
> How does the below look to you?

Other than the fact that the test for sync reject is currently
avg_overlap > sysctl_sched_migration_cost, looks fine to me.  Having it
capped at the boundary is probably the better way to go.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 13:16                 ` Mike Galbraith
@ 2009-03-09 13:27                   ` Peter Zijlstra
  2009-03-09 13:51                     ` Mike Galbraith
  2009-03-09 14:00                     ` David Newall
  2009-03-09 13:37                   ` Mike Galbraith
  1 sibling, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2009-03-09 13:27 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
> 
> > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > can easily grow stale.. I can see that happen indeed.
> > 
> > So the 'perfect' thing would be a task-runtime decay, barring that the
> > preemption thing seems a sane enough hart-beat of a task.
> > 
> > How does the below look to you?
> 
> Other than the fact that the test for sync reject is currently
> avg_overlap > sysctl_sched_migration_cost, looks fine to me.  Having it
> capped at the boundary is probably the better way to go.

Ah, yes, and looking at update_avg() we'll also discard the lower 3
bits, so we'll never actually reach.

So I guess it should read something like:

  update_avg(&prev->se.avg_overlap, 2*sysctl_sched_migration_cost);

or somesuch.

Does it actually solve the reported problem? I've only thought about the
issue so far :-)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 13:16                 ` Mike Galbraith
  2009-03-09 13:27                   ` Peter Zijlstra
@ 2009-03-09 13:37                   ` Mike Galbraith
  2009-03-09 13:46                     ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2009-03-09 13:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
> 
> > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > can easily grow stale.. I can see that happen indeed.
> > 
> > So the 'perfect' thing would be a task-runtime decay, barring that the
> > preemption thing seems a sane enough hart-beat of a task.
> > 
> > How does the below look to you?
> 
> Other than the fact that the test for sync reject is currently
> avg_overlap > sysctl_sched_migration_cost, looks fine to me.  Having it
> capped at the boundary is probably the better way to go.

Heh, doesn't _quite_ work though.  The little bugger now hovers just
under :-/

pipetest (5976, #threads: 1)
---------------------------------------------------------
se.exec_start                      :        150672.502691
se.vruntime                        :         94882.186606
se.sum_exec_runtime                :         34875.797932
se.avg_overlap                     :             0.499993
nr_switches                        :                 3680
nr_voluntary_switches              :                    0
nr_involuntary_switches            :                 3680
se.load.weight                     :                 1024
policy                             :                    0
prio                               :                  120
clock-delta                        :                  112

pipetest (5977, #threads: 1)
---------------------------------------------------------
se.exec_start                      :        150665.016951
se.vruntime                        :         94817.157909
se.sum_exec_runtime                :          7069.323019
se.avg_overlap                     :             0.012718
nr_switches                        :                 2931
nr_voluntary_switches              :                 2930
nr_involuntary_switches            :                    1
se.load.weight                     :                 1024
policy                             :                    0
prio                               :                  120
clock-delta                        :                  117



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 13:37                   ` Mike Galbraith
@ 2009-03-09 13:46                     ` Peter Zijlstra
  2009-03-09 13:58                       ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-03-09 13:46 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 14:37 +0100, Mike Galbraith wrote:
> On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> > On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
> > 
> > > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > > can easily grow stale.. I can see that happen indeed.
> > > 
> > > So the 'perfect' thing would be a task-runtime decay, barring that the
> > > preemption thing seems a sane enough hart-beat of a task.
> > > 
> > > How does the below look to you?
> > 
> > Other than the fact that the test for sync reject is currently
> > avg_overlap > sysctl_sched_migration_cost, looks fine to me.  Having it
> > capped at the boundary is probably the better way to go.
> 
> Heh, doesn't _quite_ work though.  The little bugger now hovers just
> under :-/

> se.avg_overlap                     :             0.499993

Right, update_avg()'s >>3 and the off-by-one you spotted.

I recon stuff works better with a 2* added? After that I guess its
praying sysbench still works.. :-)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 13:27                   ` Peter Zijlstra
@ 2009-03-09 13:51                     ` Mike Galbraith
  2009-03-09 14:00                     ` David Newall
  1 sibling, 0 replies; 42+ messages in thread
From: Mike Galbraith @ 2009-03-09 13:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 14:27 +0100, Peter Zijlstra wrote:
> On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> > On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
> > 
> > > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > > can easily grow stale.. I can see that happen indeed.
> > > 
> > > So the 'perfect' thing would be a task-runtime decay, barring that the
> > > preemption thing seems a sane enough hart-beat of a task.
> > > 
> > > How does the below look to you?
> > 
> > Other than the fact that the test for sync reject is currently
> > avg_overlap > sysctl_sched_migration_cost, looks fine to me.  Having it
> > capped at the boundary is probably the better way to go.
> 
> Ah, yes, and looking at update_avg() we'll also discard the lower 3
> bits, so we'll never actually reach.
> 
> So I guess it should read something like:
> 
>   update_avg(&prev->se.avg_overlap, 2*sysctl_sched_migration_cost);
> 
> or somesuch.
> 
> Does it actually solve the reported problem? I've only thought about the
> issue so far :-)

 5977 root      20   0  3672  440  352 R  100  0.0   0:28.53 2 pipetest
 5978 root      20   0  3668  180   96 S   29  0.0   0:08.27 0 pipetest

Yup, works for me.  Ship it :)

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 13:46                     ` Peter Zijlstra
@ 2009-03-09 13:58                       ` Mike Galbraith
  2009-03-09 14:11                         ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2009-03-09 13:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 14:46 +0100, Peter Zijlstra wrote:
> On Mon, 2009-03-09 at 14:37 +0100, Mike Galbraith wrote:
> > On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> > > On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
> > > 
> > > > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > > > can easily grow stale.. I can see that happen indeed.
> > > > 
> > > > So the 'perfect' thing would be a task-runtime decay, barring that the
> > > > preemption thing seems a sane enough hart-beat of a task.
> > > > 
> > > > How does the below look to you?
> > > 
> > > Other than the fact that the test for sync reject is currently
> > > avg_overlap > sysctl_sched_migration_cost, looks fine to me.  Having it
> > > capped at the boundary is probably the better way to go.
> > 
> > Heh, doesn't _quite_ work though.  The little bugger now hovers just
> > under :-/
> 
> > se.avg_overlap                     :             0.499993
> 
> Right, update_avg()'s >>3 and the off-by-one you spotted.

(that was with the off-by-one fixed, but..)

> I recon stuff works better with a 2* added? After that I guess its
> praying sysbench still works.. :-)

Yes 2* worked fine.  Mysql+oltp was my worry spot, being a very affinity
sensitive little <bleep>, but my patchlet didn't cause any trouble, so
this one shouldn't either.  I'll do some re-test in any case, and squeak
should anything turn up.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 13:27                   ` Peter Zijlstra
  2009-03-09 13:51                     ` Mike Galbraith
@ 2009-03-09 14:00                     ` David Newall
  2009-03-09 14:19                       ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: David Newall @ 2009-03-09 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, Balazs Scheidler, linux-kernel,
	Willy Tarreau

Peter Zijlstra wrote:
> Does it actually solve the reported problem? I've only thought about the
> issue so far

Wasn't it reported to be caused by an off-by-one bug in commit 
38736f475071b80b66be28af7b44c854073699cc?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 13:58                       ` Mike Galbraith
@ 2009-03-09 14:11                         ` Mike Galbraith
  2009-03-09 14:41                           ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2009-03-09 14:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 14:58 +0100, Mike Galbraith wrote:
> On Mon, 2009-03-09 at 14:46 +0100, Peter Zijlstra wrote:
> > On Mon, 2009-03-09 at 14:37 +0100, Mike Galbraith wrote:
> > > On Mon, 2009-03-09 at 14:16 +0100, Mike Galbraith wrote:
> > > > On Mon, 2009-03-09 at 12:04 +0100, Peter Zijlstra wrote:
> > > > 
> > > > > OK, talked a bit with Ingo, the reason you're doing is that avg_overlap
> > > > > can easily grow stale.. I can see that happen indeed.
> > > > > 
> > > > > So the 'perfect' thing would be a task-runtime decay, barring that the
> > > > > preemption thing seems a sane enough hart-beat of a task.
> > > > > 
> > > > > How does the below look to you?
> > > > 
> > > > Other than the fact that the test for sync reject is currently
> > > > avg_overlap > sysctl_sched_migration_cost, looks fine to me.  Having it
> > > > capped at the boundary is probably the better way to go.
> > > 
> > > Heh, doesn't _quite_ work though.  The little bugger now hovers just
> > > under :-/
> > 
> > > se.avg_overlap                     :             0.499993
> > 
> > Right, update_avg()'s >>3 and the off-by-one you spotted.
> 
> (that was with the off-by-one fixed, but..)
> 
> > I recon stuff works better with a 2* added? After that I guess its
> > praying sysbench still works.. :-)
> 
> Yes 2* worked fine.  Mysql+oltp was my worry spot, being a very affinity
> sensitive little <bleep>, but my patchlet didn't cause any trouble, so
> this one shouldn't either.  I'll do some re-test in any case, and squeak
> should anything turn up.

Squeak!  Didn't even get to mysql+oltp.

marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1  -- -P 15888,12384 -s 32768 -S 32768 -m 4096
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

 65536    4096   60.00     5161103      0    2818.65
 65536           60.00     5149666           2812.40

 6188 root      20   0  1040  544  324 R  100  0.0   0:31.49 0 netperf
 6189 root      20   0  1044  260  164 R   48  0.0   0:15.35 3 netserver

Hurt, pain, ouch, vs...

marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -T 0,0 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET : cpu bind
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

 65536    4096   60.00     8452028      0    4615.93
 65536           60.00     8442945           4610.97

Drat.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 14:00                     ` David Newall
@ 2009-03-09 14:19                       ` Peter Zijlstra
  2009-03-10  0:20                         ` David Newall
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-03-09 14:19 UTC (permalink / raw)
  To: David Newall
  Cc: Mike Galbraith, Ingo Molnar, Balazs Scheidler, linux-kernel,
	Willy Tarreau

On Tue, 2009-03-10 at 00:30 +1030, David Newall wrote:
> Peter Zijlstra wrote:
> > Does it actually solve the reported problem? I've only thought about the
> > issue so far
> 
> Wasn't it reported to be caused by an off-by-one bug in commit 
> 38736f475071b80b66be28af7b44c854073699cc?

Completely different issue. Unfortunate interference of effects though.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 14:11                         ` Mike Galbraith
@ 2009-03-09 14:41                           ` Peter Zijlstra
  2009-03-09 15:30                             ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-03-09 14:41 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 15:11 +0100, Mike Galbraith wrote:

> > Yes 2* worked fine.  Mysql+oltp was my worry spot, being a very affinity
> > sensitive little <bleep>, but my patchlet didn't cause any trouble, so
> > this one shouldn't either.  I'll do some re-test in any case, and squeak
> > should anything turn up.
> 
> Squeak!  Didn't even get to mysql+oltp.
> 
> marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1  -- -P 15888,12384 -s 32768 -S 32768 -m 4096
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
>  65536    4096   60.00     5161103      0    2818.65
>  65536           60.00     5149666           2812.40
> 
>  6188 root      20   0  1040  544  324 R  100  0.0   0:31.49 0 netperf
>  6189 root      20   0  1044  260  164 R   48  0.0   0:15.35 3 netserver
> 
> Hurt, pain, ouch, vs...
> 
> marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -T 0,0 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
> UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET : cpu bind
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
>  65536    4096   60.00     8452028      0    4615.93
>  65536           60.00     8442945           4610.97
> 
> Drat.

Bugger, so back to the drawing board it is...


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 14:41                           ` Peter Zijlstra
@ 2009-03-09 15:30                             ` Mike Galbraith
  2009-03-09 16:12                               ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2009-03-09 15:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 15:41 +0100, Peter Zijlstra wrote:
> On Mon, 2009-03-09 at 15:11 +0100, Mike Galbraith wrote:
> 
> > > Yes 2* worked fine.  Mysql+oltp was my worry spot, being a very affinity
> > > sensitive little <bleep>, but my patchlet didn't cause any trouble, so
> > > this one shouldn't either.  I'll do some re-test in any case, and squeak
> > > should anything turn up.
> > 
> > Squeak!  Didn't even get to mysql+oltp.
> > 
> > marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1  -- -P 15888,12384 -s 32768 -S 32768 -m 4096
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET
> > Socket  Message  Elapsed      Messages
> > Size    Size     Time         Okay Errors   Throughput
> > bytes   bytes    secs            #      #   10^6bits/sec
> > 
> >  65536    4096   60.00     5161103      0    2818.65
> >  65536           60.00     5149666           2812.40
> > 
> >  6188 root      20   0  1040  544  324 R  100  0.0   0:31.49 0 netperf
> >  6189 root      20   0  1044  260  164 R   48  0.0   0:15.35 3 netserver
> > 
> > Hurt, pain, ouch, vs...
> > 
> > marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -T 0,0 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET : cpu bind
> > Socket  Message  Elapsed      Messages
> > Size    Size     Time         Okay Errors   Throughput
> > bytes   bytes    secs            #      #   10^6bits/sec
> > 
> >  65536    4096   60.00     8452028      0    4615.93
> >  65536           60.00     8442945           4610.97
> > 
> > Drat.
> 
> Bugger, so back to the drawing board it is...

Hm.

CPU utilization wise, this test is similar to pipetest.  The major
difference is chunk size.  Netperf is waking and being preempted (if on
the same CPU) at a very high rate, so the hog component gets cpu in tiny
chunks, vs hefty chunks for pipetest.

Simply doing the below (will look very familiar) made both netperf and
pipetest happy again, because of that preemption rate.  Both start life
wanting to be affine, and due to the switch rate, pipetest becomes
non-affine, but netperf remains affine.

Maybe we should factor in wakeup rate, and whether we're waking many vs
one.  Wakeup is tied to data, so there is correlation to potential
cache-miss pain, no?

There is also evidence that your patch did in fact make the right
decision, but that we really REALLY should try to punt to a CPU that
shares a cache if available.  Check out the numbers when the netperf
test runs on two CPUs that share cache.

marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -T 0,1 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET : cpu bind
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

 65536    4096   60.00     15325632      0    8369.84
 65536           60.00     15321176           8367.40

(You can skip the below, nothing new there.  Just for completeness;)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..0f67b2a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4508,6 +4508,24 @@ static inline void schedule_debug(struct task_struct *prev)
 #endif
 }
 
+static void put_prev_task(struct rq *rq, struct task_struct *prev)
+{
+	if (prev->state == TASK_RUNNING) {
+		u64 runtime = prev->se.sum_exec_runtime;
+
+		runtime -= prev->se.prev_sum_exec_runtime;
+		runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
+
+		/*
+		 * In order to avoid avg_overlap growing stale when we are
+		 * indeed overlapping and hence not getting put to sleep, grow
+		 * the avg_overlap on preemption.
+		 */
+		update_avg(&prev->se.avg_overlap, runtime);
+	}
+	prev->sched_class->put_prev_task(rq, prev);
+}
+
 /*
  * Pick up the highest-prio task:
  */
@@ -4586,7 +4604,7 @@ need_resched_nonpreemptible:
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
-	prev->sched_class->put_prev_task(rq, prev);
+	put_prev_task(rq, prev);
 	next = pick_next_task(rq, prev);
 
 	if (likely(prev != next)) {



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09  8:02           ` [patch] " Mike Galbraith
  2009-03-09  8:07             ` Ingo Molnar
@ 2009-03-09 15:57             ` Balazs Scheidler
  2009-03-10  3:16               ` Mike Galbraith
  1 sibling, 1 reply; 42+ messages in thread
From: Balazs Scheidler @ 2009-03-09 15:57 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Willy Tarreau

Hi,

Just an interesting sidenote:

I've ported the quoted patch and
38736f475071b80b66be28af7b44c854073699cc  (the one I've found via
bisect) to 2.6.27 but these didn't resolve my scheduling problem, both
my test program and my application still uses only one CPU. So probably
the rest of the scheduling patches between 2.6.27..2.6.28 have some
effect too.

On Mon, 2009-03-09 at 09:02 +0100, Mike Galbraith wrote:
> On Sun, 2009-03-08 at 18:52 +0100, Ingo Molnar wrote:
> > * Mike Galbraith <efault@gmx.de> wrote:
> > 
> > > On Sun, 2009-03-08 at 16:39 +0100, Ingo Molnar wrote:
> > > > * Mike Galbraith <efault@gmx.de> wrote:
> > > > 
> > > > > The problem with your particular testcase is that while one 
> > > > > half has an avg_overlap (what we use as affinity hint for 
> > > > > synchronous wakeups) which triggers the affinity hint, the 
> > > > > other half has avg_overlap of zero, what it was born with, so 
> > > > > despite significant execution overlap, the scheduler treats 
> > > > > them as if they were truly synchronous tasks.
> > > > 
> > > > hm, why does it stay on zero?
> > > 
> > > Wakeup preemption.  Presuming here: heavy task wakes light 
> > > task, is preempted, light task stuffs data into pipe, heavy 
> > > task doesn't block, so no avg_overlap is ever computed.  The 
> > > heavy task uses 100% CPU.
> > > 
> > > Running as SCHED_BATCH (virgin source), it becomes sane.
> > 
> > ah.
> > 
> > I'd argue then that time spent on the rq preempted _should_ 
> > count in avg_overlap statistics. I.e. couldnt we do something 
> > like ... your patch? :)
> > 
> > > >     if (sleep && p->se.last_wakeup) {
> > > >             update_avg(&p->se.avg_overlap,
> > > >                        p->se.sum_exec_runtime - p->se.last_wakeup);
> > > >             p->se.last_wakeup = 0;
> > > > -   }
> > > > +   } else if (p->se.avg_overlap < limit && runtime >= limit)
> > > > +           update_avg(&p->se.avg_overlap, runtime);
> > 
> > Just done unconditionally, i.e. something like:
> > 
> > 	if (sleep) {
> > 		runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
> > 		p->se.last_wakeup = 0;
> > 	} else {
> > 		runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
> > 	}
> > 
> > 	update_avg(&p->se.avg_overlap, runtime);
> > 
> > ?
> 
> OK, I've not seen any problem indications yet, so find patchlet below.
> 
> However! Balazs has stated that this problem is _not_ present in .git,
> and that..
> 
> 	commit 38736f475071b80b66be28af7b44c854073699cc
> 	Author: Gautham R Shenoy <ego@in.ibm.com>
> 	Date:   Sat Sep 6 14:50:23 2008 +0530
> 
> ..is what fixed it.  Willy Tarreau verified this as being the case on
> his HW as well.  It is present in .git with my HW.
> 
> I see it as a problem, but it's your call.  Dunno if I'd apply it or
> hold back, given these conflicting reports.
> 
> Anyway...
> 
> Given a task pair communicating via pipe, if one partner fills/drains such
> that the other does not block for extended periods, avg_overlap can be long
> stale, and trigger affine wakeups despite heavy CPU demand.  This can, and
> does lead to throughput loss in the testcase posted by the reporter.
> 
> Fix this by unconditionally updating avg_overlap at dequeue time instead
> of only updating when a task sleeps.
> 
> See http://lkml.org/lkml/2009/3/7/79 for details/testcase.
> 
> Reported-by: Balazs Scheidler <bazsi@balabit.hu>
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> 
>  kernel/sched.c |    9 +++++++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 8e2558c..c670050 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1712,12 +1712,17 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
>  
>  static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
>  {
> +	u64 runtime;
> +
>  	if (sleep && p->se.last_wakeup) {
> -		update_avg(&p->se.avg_overlap,
> -			   p->se.sum_exec_runtime - p->se.last_wakeup);
> +		runtime = p->se.sum_exec_runtime - p->se.last_wakeup;
>  		p->se.last_wakeup = 0;
> +	} else {
> +		runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
>  	}
>  
> +	update_avg(&p->se.avg_overlap, runtime);
> +
>  	sched_info_dequeued(p);
>  	p->sched_class->dequeue_task(rq, p, sleep);
>  	p->se.on_rq = 0;
> 
> 
> 
> 
-- 
Bazsi



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 15:30                             ` Mike Galbraith
@ 2009-03-09 16:12                               ` Peter Zijlstra
  2009-03-09 17:28                                 ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-03-09 16:12 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> +{
> +       if (prev->state == TASK_RUNNING) {
> +               u64 runtime = prev->se.sum_exec_runtime;
> +
> +               runtime -= prev->se.prev_sum_exec_runtime;
> +               runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> +
> +               /*
> +                * In order to avoid avg_overlap growing stale when we are
> +                * indeed overlapping and hence not getting put to sleep, grow
> +                * the avg_overlap on preemption.
> +                */
> +               update_avg(&prev->se.avg_overlap, runtime);
> +       }
> +       prev->sched_class->put_prev_task(rq, prev);
> +}

Right, so we both found it worked quite well, I'm still slightly puzzled
but it.

If something gets preempted a lot and will therefore have short runtimes
it will be seen as sync even though it might not at all be.

Then again, it its preempted that much, it won't be likely to obtain a
large cache footprint either...

hohumm


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 16:12                               ` Peter Zijlstra
@ 2009-03-09 17:28                                 ` Mike Galbraith
  2009-03-15 13:53                                   ` Balazs Scheidler
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2009-03-09 17:28 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Balazs Scheidler, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 17:12 +0100, Peter Zijlstra wrote:
> On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> > +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> > +{
> > +       if (prev->state == TASK_RUNNING) {
> > +               u64 runtime = prev->se.sum_exec_runtime;
> > +
> > +               runtime -= prev->se.prev_sum_exec_runtime;
> > +               runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> > +
> > +               /*
> > +                * In order to avoid avg_overlap growing stale when we are
> > +                * indeed overlapping and hence not getting put to sleep, grow
> > +                * the avg_overlap on preemption.
> > +                */
> > +               update_avg(&prev->se.avg_overlap, runtime);
> > +       }
> > +       prev->sched_class->put_prev_task(rq, prev);
> > +}
> 
> Right, so we both found it worked quite well, I'm still slightly puzzled
> but it.
> 
> If something gets preempted a lot and will therefore have short runtimes
> it will be seen as sync even though it might not at all be.

Yes, and the netperf on 2 CPUs with shared cache numbers show that's
happening.  It just so happens that in the non-shared case, netperf's
cache pain far outweighs the benefit of having more CPU available :-/

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 14:19                       ` Peter Zijlstra
@ 2009-03-10  0:20                         ` David Newall
  0 siblings, 0 replies; 42+ messages in thread
From: David Newall @ 2009-03-10  0:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, Balazs Scheidler, linux-kernel,
	Willy Tarreau

Peter Zijlstra wrote:
> On Tue, 2009-03-10 at 00:30 +1030, David Newall wrote:
>   
>> Wasn't it reported to be caused by an off-by-one bug in commit 
>> 38736f475071b80b66be28af7b44c854073699cc?
>>     
>
> Completely different issue. Unfortunate interference of effects though.


And yet, there *is* a bug in that commit, being, as I explained, that it
won't return a task if it's the last entry in the list, which seems
quite likely immediately after a fork.  Has my patch been tried?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 15:57             ` Balazs Scheidler
@ 2009-03-10  3:16               ` Mike Galbraith
  0 siblings, 0 replies; 42+ messages in thread
From: Mike Galbraith @ 2009-03-10  3:16 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Willy Tarreau

On Mon, 2009-03-09 at 16:57 +0100, Balazs Scheidler wrote:
> Hi,
> 
> Just an interesting sidenote:
> 
> I've ported the quoted patch and
> 38736f475071b80b66be28af7b44c854073699cc  (the one I've found via
> bisect) to 2.6.27 but these didn't resolve my scheduling problem, both
> my test program and my application still uses only one CPU. So probably
> the rest of the scheduling patches between 2.6.27..2.6.28 have some
> effect too.

It's probably the changes to wake_affine().  27 doesn't have the disable
sync if avg_overlap is too large tests.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-09 17:28                                 ` Mike Galbraith
@ 2009-03-15 13:53                                   ` Balazs Scheidler
  2009-03-15 17:16                                     ` Mike Galbraith
  0 siblings, 1 reply; 42+ messages in thread
From: Balazs Scheidler @ 2009-03-15 13:53 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Willy Tarreau

On Mon, 2009-03-09 at 18:28 +0100, Mike Galbraith wrote:
> On Mon, 2009-03-09 at 17:12 +0100, Peter Zijlstra wrote:
> > On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> > > +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> > > +{
> > > +       if (prev->state == TASK_RUNNING) {
> > > +               u64 runtime = prev->se.sum_exec_runtime;
> > > +
> > > +               runtime -= prev->se.prev_sum_exec_runtime;
> > > +               runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> > > +
> > > +               /*
> > > +                * In order to avoid avg_overlap growing stale when we are
> > > +                * indeed overlapping and hence not getting put to sleep, grow
> > > +                * the avg_overlap on preemption.
> > > +                */
> > > +               update_avg(&prev->se.avg_overlap, runtime);
> > > +       }
> > > +       prev->sched_class->put_prev_task(rq, prev);
> > > +}
> > 
> > Right, so we both found it worked quite well, I'm still slightly puzzled
> > but it.
> > 
> > If something gets preempted a lot and will therefore have short runtimes
> > it will be seen as sync even though it might not at all be.
> 
> Yes, and the netperf on 2 CPUs with shared cache numbers show that's
> happening.  It just so happens that in the non-shared case, netperf's
> cache pain far outweighs the benefit of having more CPU available :-/

Any news on this? I haven't seen a patch that was actually integrated,
or I just missed something?

-- 
Bazsi



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-15 13:53                                   ` Balazs Scheidler
@ 2009-03-15 17:16                                     ` Mike Galbraith
  2009-03-15 18:57                                       ` Ingo Molnar
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Galbraith @ 2009-03-15 17:16 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Willy Tarreau

On Sun, 2009-03-15 at 14:53 +0100, Balazs Scheidler wrote:
> On Mon, 2009-03-09 at 18:28 +0100, Mike Galbraith wrote:
> > On Mon, 2009-03-09 at 17:12 +0100, Peter Zijlstra wrote:
> > > On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> > > > +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> > > > +{
> > > > +       if (prev->state == TASK_RUNNING) {
> > > > +               u64 runtime = prev->se.sum_exec_runtime;
> > > > +
> > > > +               runtime -= prev->se.prev_sum_exec_runtime;
> > > > +               runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> > > > +
> > > > +               /*
> > > > +                * In order to avoid avg_overlap growing stale when we are
> > > > +                * indeed overlapping and hence not getting put to sleep, grow
> > > > +                * the avg_overlap on preemption.
> > > > +                */
> > > > +               update_avg(&prev->se.avg_overlap, runtime);
> > > > +       }
> > > > +       prev->sched_class->put_prev_task(rq, prev);
> > > > +}
> > > 
> > > Right, so we both found it worked quite well, I'm still slightly puzzled
> > > but it.
> > > 
> > > If something gets preempted a lot and will therefore have short runtimes
> > > it will be seen as sync even though it might not at all be.
> > 
> > Yes, and the netperf on 2 CPUs with shared cache numbers show that's
> > happening.  It just so happens that in the non-shared case, netperf's
> > cache pain far outweighs the benefit of having more CPU available :-/
> 
> Any news on this? I haven't seen a patch that was actually integrated,
> or I just missed something?

It's in tip, as df1c99d416500da8d26a4d78777467c53ee7689e.

	-Mike


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-15 17:16                                     ` Mike Galbraith
@ 2009-03-15 18:57                                       ` Ingo Molnar
  2009-03-16 11:55                                         ` Balazs Scheidler
  0 siblings, 1 reply; 42+ messages in thread
From: Ingo Molnar @ 2009-03-15 18:57 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Balazs Scheidler, Peter Zijlstra, linux-kernel, Willy Tarreau


* Mike Galbraith <efault@gmx.de> wrote:

> On Sun, 2009-03-15 at 14:53 +0100, Balazs Scheidler wrote:
> > On Mon, 2009-03-09 at 18:28 +0100, Mike Galbraith wrote:
> > > On Mon, 2009-03-09 at 17:12 +0100, Peter Zijlstra wrote:
> > > > On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> > > > > +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> > > > > +{
> > > > > +       if (prev->state == TASK_RUNNING) {
> > > > > +               u64 runtime = prev->se.sum_exec_runtime;
> > > > > +
> > > > > +               runtime -= prev->se.prev_sum_exec_runtime;
> > > > > +               runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> > > > > +
> > > > > +               /*
> > > > > +                * In order to avoid avg_overlap growing stale when we are
> > > > > +                * indeed overlapping and hence not getting put to sleep, grow
> > > > > +                * the avg_overlap on preemption.
> > > > > +                */
> > > > > +               update_avg(&prev->se.avg_overlap, runtime);
> > > > > +       }
> > > > > +       prev->sched_class->put_prev_task(rq, prev);
> > > > > +}
> > > > 
> > > > Right, so we both found it worked quite well, I'm still slightly puzzled
> > > > but it.
> > > > 
> > > > If something gets preempted a lot and will therefore have short runtimes
> > > > it will be seen as sync even though it might not at all be.
> > > 
> > > Yes, and the netperf on 2 CPUs with shared cache numbers show that's
> > > happening.  It just so happens that in the non-shared case, netperf's
> > > cache pain far outweighs the benefit of having more CPU available :-/
> > 
> > Any news on this? I haven't seen a patch that was actually integrated,
> > or I just missed something?
> 
> It's in tip, as df1c99d416500da8d26a4d78777467c53ee7689e.

testable via:

  http://people.redhat.com/mingo/tip.git/README

Balazs, could you please try it?

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch] Re: scheduler oddity [bug?]
  2009-03-15 18:57                                       ` Ingo Molnar
@ 2009-03-16 11:55                                         ` Balazs Scheidler
  0 siblings, 0 replies; 42+ messages in thread
From: Balazs Scheidler @ 2009-03-16 11:55 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mike Galbraith, Peter Zijlstra, linux-kernel, Willy Tarreau

On Sun, 2009-03-15 at 19:57 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > On Sun, 2009-03-15 at 14:53 +0100, Balazs Scheidler wrote:
> > > On Mon, 2009-03-09 at 18:28 +0100, Mike Galbraith wrote:
> > > > On Mon, 2009-03-09 at 17:12 +0100, Peter Zijlstra wrote:
> > > > > On Mon, 2009-03-09 at 16:30 +0100, Mike Galbraith wrote:
> > > > > > +static void put_prev_task(struct rq *rq, struct task_struct *prev)
> > > > > > +{
> > > > > > +       if (prev->state == TASK_RUNNING) {
> > > > > > +               u64 runtime = prev->se.sum_exec_runtime;
> > > > > > +
> > > > > > +               runtime -= prev->se.prev_sum_exec_runtime;
> > > > > > +               runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
> > > > > > +
> > > > > > +               /*
> > > > > > +                * In order to avoid avg_overlap growing stale when we are
> > > > > > +                * indeed overlapping and hence not getting put to sleep, grow
> > > > > > +                * the avg_overlap on preemption.
> > > > > > +                */
> > > > > > +               update_avg(&prev->se.avg_overlap, runtime);
> > > > > > +       }
> > > > > > +       prev->sched_class->put_prev_task(rq, prev);
> > > > > > +}
> > > > > 
> > > > > Right, so we both found it worked quite well, I'm still slightly puzzled
> > > > > but it.
> > > > > 
> > > > > If something gets preempted a lot and will therefore have short runtimes
> > > > > it will be seen as sync even though it might not at all be.
> > > > 
> > > > Yes, and the netperf on 2 CPUs with shared cache numbers show that's
> > > > happening.  It just so happens that in the non-shared case, netperf's
> > > > cache pain far outweighs the benefit of having more CPU available :-/
> > > 
> > > Any news on this? I haven't seen a patch that was actually integrated,
> > > or I just missed something?
> > 
> > It's in tip, as df1c99d416500da8d26a4d78777467c53ee7689e.
> 
> testable via:
> 
>   http://people.redhat.com/mingo/tip.git/README
> 
> Balazs, could you please try it?

Sure I'll try, although not today as my first son was born yesterday,
and I have a lot of things to do, but hopefully I can give a spin
tomorrow.


-- 
Bazsi



^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2009-03-16 11:55 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-07 17:47 scheduler oddity [bug?] Balazs Scheidler
2009-03-07 18:47 ` Balazs Scheidler
2009-03-08 19:45   ` Balazs Scheidler
2009-03-08 22:03     ` Willy Tarreau
2009-03-09  3:35       ` Mike Galbraith
2009-03-09 11:19     ` David Newall
2009-03-08  9:42 ` Mike Galbraith
2009-03-08  9:58   ` Mike Galbraith
2009-03-08 10:02     ` Mike Galbraith
2009-03-08 10:19     ` Peter Zijlstra
2009-03-08 13:35       ` Mike Galbraith
2009-03-08 15:39     ` Ingo Molnar
2009-03-08 16:20       ` Mike Galbraith
2009-03-08 17:52         ` Ingo Molnar
2009-03-08 18:39           ` Mike Galbraith
2009-03-08 18:55             ` Ingo Molnar
2009-03-09  4:10               ` Mike Galbraith
2009-03-09  6:52                 ` Ingo Molnar
2009-03-09  8:02           ` [patch] " Mike Galbraith
2009-03-09  8:07             ` Ingo Molnar
2009-03-09 10:16               ` David Newall
2009-03-09 11:04               ` Peter Zijlstra
2009-03-09 13:16                 ` Mike Galbraith
2009-03-09 13:27                   ` Peter Zijlstra
2009-03-09 13:51                     ` Mike Galbraith
2009-03-09 14:00                     ` David Newall
2009-03-09 14:19                       ` Peter Zijlstra
2009-03-10  0:20                         ` David Newall
2009-03-09 13:37                   ` Mike Galbraith
2009-03-09 13:46                     ` Peter Zijlstra
2009-03-09 13:58                       ` Mike Galbraith
2009-03-09 14:11                         ` Mike Galbraith
2009-03-09 14:41                           ` Peter Zijlstra
2009-03-09 15:30                             ` Mike Galbraith
2009-03-09 16:12                               ` Peter Zijlstra
2009-03-09 17:28                                 ` Mike Galbraith
2009-03-15 13:53                                   ` Balazs Scheidler
2009-03-15 17:16                                     ` Mike Galbraith
2009-03-15 18:57                                       ` Ingo Molnar
2009-03-16 11:55                                         ` Balazs Scheidler
2009-03-09 15:57             ` Balazs Scheidler
2009-03-10  3:16               ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.