Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

All of lore.kernel.org
 help / color / mirror / Atom feed

* Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
@ 2008-02-09  0:04 Olof Johansson
  2008-02-09  0:08 ` Ingo Molnar
  2008-02-09  7:58 ` Mike Galbraith
  0 siblings, 2 replies; 23+ messages in thread
From: Olof Johansson @ 2008-02-09  0:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Ingo Molnar

Hi,

I ended up with a customer benchmark in my lap this week that doesn't
do well on recent kernels. :(

After cutting it down to a simple testcase/microbenchmark, it seems like
recent kernels don't do as well with short-lived threads competing
with the thread it's cloned off of. The CFS scheduler changes come to
mind, but I suppose it could be caused by something else as well.

The pared-down testcase is included below. Reported runtime for the
testcase has increased almost 3x between 2.6.22 and 2.6.24:

2.6.22: 3332 ms
2.6.23: 4397 ms
2.6.24: 8953 ms
2.6.24-git19: 8986 ms

While running, it'll fork off a bunch of threads, each doing just a little
work, then busy-waiting on the original thread to finish as well. Yes,
it's incredibly stupidly coded but that's not my point here.

During run, (runtime 10s on my 1.5GHz Core2 Duo laptop), vmstat 2  shows:

 0  0      0 115196 364748 2248396    0    0     0     0  163   89  0  0 100  0
 2  0      0 115172 364748 2248396    0    0     0     0  270  178 24  0 76  0
 2  0      0 115172 364748 2248396    0    0     0     0  402  283 52  0 48  0
 2  0      0 115180 364748 2248396    0    0     0     0  402  281 50  0 50  0
 2  0      0 115180 364764 2248396    0    0     0    22  403  295 51  0 48  1
 2  0      0 115056 364764 2248396    0    0     0     0  399  280 52  0 48  0
 0  0      0 115196 364764 2248396    0    0     0     0  241  141 17  0 83  0
 0  0      0 115196 364768 2248396    0    0     0     2  155   67  0  0 100  0
 0  0      0 115196 364768 2248396    0    0     0     0  148   62  0  0 100  0

I.e. runqueue is 2, but only one cpu is busy. However, this still seems
true on the kernel that runs the testcase in more reasonable time.

Also, 'time' reports real and user time roughly the same on all kernels,
so it's not that the older kernels are better at spreading out the load
between the two cores (either that or it doesn't account for stuff right).

I've included the config files, runtime output and vmstat output at
http://lixom.net/~olof/threadtest/. I see similar behaviour on PPC as
well as x86, so it's not architecture-specific.

Testcase below. Yes, I know, there's a bunch of stuff that could be done
differently and better, but it still doesn't motivate why there's a 3x
slowdown between kernel versions...

-Olof

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

#ifdef __PPC__
static void atomic_inc(volatile long *a)
{
	asm volatile ("1:\n\
			lwarx  %0,0,%1\n\
			addic  %0,%0,1\n\
			stwcx. %0,0,%1\n\
			bne-  1b" : "=&r" (result) : "r"(a));
}
#else
static void atomic_inc(volatile long *a)
{
	asm volatile ("lock; incl %0" : "+m" (*a));
}
#endif

volatile long stopped;

int thread_func(int cpus)
{
	int j;

	for (j = 0; j < 10000; j++)
		;

	atomic_inc(&stopped);

	/* Busy-wait */
	while (stopped < cpus)
		j++;
}

long usecs(void)
{
	struct timeval tv;
	gettimeofday(&tv, NULL);
	return tv.tv_sec * 1000000 + tv.tv_usec;
}

int main(int argc, char **argv)
{
	pthread_t thread;
	int i;
	long t1, t2;

	t1 = usecs();
	for (i = 0; i < 500; i++) {
		stopped = 0;

		pthread_create(&thread, NULL, thread_func, 2);
		thread_func(2);
		pthread_join(thread, NULL);
	}
	t2 = usecs();

	printf("time %ld ms\n", (t2-t1) / 1000);

	return 0;
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-09  0:04 Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads Olof Johansson
@ 2008-02-09  0:08 ` Ingo Molnar
  2008-02-09  0:32   ` Olof Johansson
  2008-02-09  7:58 ` Mike Galbraith
  1 sibling, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2008-02-09  0:08 UTC (permalink / raw)
  To: Olof Johansson; +Cc: linux-kernel, Peter Zijlstra


* Olof Johansson <olof@lixom.net> wrote:

> 2.6.22: 3332 ms
> 2.6.23: 4397 ms
> 2.6.24: 8953 ms
> 2.6.24-git19: 8986 ms

if you enable SCHED_DEBUG, and subtract 4 from the value of 
/proc/sys/kernel/sched_features, does it get any better?

if not, does writing 0 into /proc/sys/kernel/sched_features have any 
impact?

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-09  0:08 ` Ingo Molnar
@ 2008-02-09  0:32   ` Olof Johansson
  0 siblings, 0 replies; 23+ messages in thread
From: Olof Johansson @ 2008-02-09  0:32 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra

On Sat, Feb 09, 2008 at 01:08:30AM +0100, Ingo Molnar wrote:
> 
> * Olof Johansson <olof@lixom.net> wrote:
> 
> > 2.6.22: 3332 ms
> > 2.6.23: 4397 ms
> > 2.6.24: 8953 ms
> > 2.6.24-git19: 8986 ms
> 
> if you enable SCHED_DEBUG, and subtract 4 from the value of 
> /proc/sys/kernel/sched_features, does it get any better?
> 
> if not, does writing 0 into /proc/sys/kernel/sched_features have any 
> impact?

Doesn't seem to make a difference, either of them. I can unfortunately
not retest on x86 at the moment, only powerpc. But it made no noticable
difference in runtime w/ pasemi_defconfig and current mainline.


-Olof

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-09  0:04 Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads Olof Johansson
  2008-02-09  0:08 ` Ingo Molnar
@ 2008-02-09  7:58 ` Mike Galbraith
  2008-02-09  8:03   ` Willy Tarreau
  1 sibling, 1 reply; 23+ messages in thread
From: Mike Galbraith @ 2008-02-09  7:58 UTC (permalink / raw)
  To: Olof Johansson; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar


On Fri, 2008-02-08 at 18:04 -0600, Olof Johansson wrote:
> Hi,
> 
> I ended up with a customer benchmark in my lap this week that doesn't
> do well on recent kernels. :(
> 
> After cutting it down to a simple testcase/microbenchmark, it seems like
> recent kernels don't do as well with short-lived threads competing
> with the thread it's cloned off of. The CFS scheduler changes come to
> mind, but I suppose it could be caused by something else as well.
> 
> The pared-down testcase is included below. Reported runtime for the
> testcase has increased almost 3x between 2.6.22 and 2.6.24:
> 
> 2.6.22: 3332 ms
> 2.6.23: 4397 ms
> 2.6.24: 8953 ms
> 2.6.24-git19: 8986 ms

My 3GHz P4 shows disjointed results.

2.6.22.17-smp
time 798 ms
time 780 ms
time 702 ms

2.6.22.17-cfs-v24.1-smp
time 562 ms
time 551 ms
time 551 ms

2.6.23.15-smp
time 254 ms
time 254 ms
time 293 ms

2.6.23.15-cfs-v24-smp
time 764 ms
time 791 ms
time 780 ms

2.6.24.1-smp
time 815 ms
time 820 ms
time 771 ms

2.6.25-smp (git today)
time 29 ms
time 61 ms
time 72 ms



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-09  7:58 ` Mike Galbraith
@ 2008-02-09  8:03   ` Willy Tarreau
  2008-02-09 10:58     ` Mike Galbraith
  0 siblings, 1 reply; 23+ messages in thread
From: Willy Tarreau @ 2008-02-09  8:03 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Olof Johansson, linux-kernel, Peter Zijlstra, Ingo Molnar

Hi Mike,

On Sat, Feb 09, 2008 at 08:58:39AM +0100, Mike Galbraith wrote:
> 
> On Fri, 2008-02-08 at 18:04 -0600, Olof Johansson wrote:
> > Hi,
> > 
> > I ended up with a customer benchmark in my lap this week that doesn't
> > do well on recent kernels. :(
> > 
> > After cutting it down to a simple testcase/microbenchmark, it seems like
> > recent kernels don't do as well with short-lived threads competing
> > with the thread it's cloned off of. The CFS scheduler changes come to
> > mind, but I suppose it could be caused by something else as well.
> > 
> > The pared-down testcase is included below. Reported runtime for the
> > testcase has increased almost 3x between 2.6.22 and 2.6.24:
> > 
> > 2.6.22: 3332 ms
> > 2.6.23: 4397 ms
> > 2.6.24: 8953 ms
> > 2.6.24-git19: 8986 ms
> 
> My 3GHz P4 shows disjointed results.
> 
> 2.6.22.17-smp
> time 798 ms
> time 780 ms
> time 702 ms
> 
> 2.6.22.17-cfs-v24.1-smp
> time 562 ms
> time 551 ms
> time 551 ms
> 
> 2.6.23.15-smp
> time 254 ms
> time 254 ms
> time 293 ms
> 
> 2.6.23.15-cfs-v24-smp
> time 764 ms
> time 791 ms
> time 780 ms
> 
> 2.6.24.1-smp
> time 815 ms
> time 820 ms
> time 771 ms

How many CPUs do you have ? It's impressive to see such important variations.
I would guess from the numbers that you sometimes have 1 or 2 CPUs doing
nothing. I've already observed strange CPU usage patterns while building
kernels (bound to 50% usage on a dual-athlon), but I can't say for sure
that it was only on 2.6, so it may be related to make dependencies instead
(at least it's what I've been suspecting till now).

> 2.6.25-smp (git today)
> time 29 ms
> time 61 ms
> time 72 ms

These ones look rather strange. What type of workload is it ? Can you
publish the program for others to test it ?

Regards,
Willy


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-09  8:03   ` Willy Tarreau
@ 2008-02-09 10:58     ` Mike Galbraith
  2008-02-09 11:40       ` Willy Tarreau
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Galbraith @ 2008-02-09 10:58 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Olof Johansson, linux-kernel, Peter Zijlstra, Ingo Molnar


On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:

> How many CPUs do you have ?

It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE

> > 2.6.25-smp (git today)
> > time 29 ms
> > time 61 ms
> > time 72 ms
> 
> These ones look rather strange. What type of workload is it ? Can you
> publish the program for others to test it ?

It's the proglet posted in this thread.

	-Mike


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-09 10:58     ` Mike Galbraith
@ 2008-02-09 11:40       ` Willy Tarreau
  2008-02-09 13:37         ` Mike Galbraith
  0 siblings, 1 reply; 23+ messages in thread
From: Willy Tarreau @ 2008-02-09 11:40 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Olof Johansson, linux-kernel, Peter Zijlstra, Ingo Molnar

On Sat, Feb 09, 2008 at 11:58:25AM +0100, Mike Galbraith wrote:
> 
> On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:
> 
> > How many CPUs do you have ?
> 
> It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE
> 
> > > 2.6.25-smp (git today)
> > > time 29 ms
> > > time 61 ms
> > > time 72 ms
> > 
> > These ones look rather strange. What type of workload is it ? Can you
> > publish the program for others to test it ?
> 
> It's the proglet posted in this thread.

OK sorry, I did not notice it when I first read the report.

Regards,
Willy


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-09 11:40       ` Willy Tarreau
@ 2008-02-09 13:37         ` Mike Galbraith
  2008-02-09 16:19           ` Willy Tarreau
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Galbraith @ 2008-02-09 13:37 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Olof Johansson, linux-kernel, Peter Zijlstra, Ingo Molnar


On Sat, 2008-02-09 at 12:40 +0100, Willy Tarreau wrote: 
> On Sat, Feb 09, 2008 at 11:58:25AM +0100, Mike Galbraith wrote:
> > 
> > On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:
> > 
> > > How many CPUs do you have ?
> > 
> > It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE
> > 
> > > > 2.6.25-smp (git today)
> > > > time 29 ms
> > > > time 61 ms
> > > > time 72 ms
> > > 
> > > These ones look rather strange. What type of workload is it ? Can you
> > > publish the program for others to test it ?
> > 
> > It's the proglet posted in this thread.
> 
> OK sorry, I did not notice it when I first read the report.

Hm.  The 2.6.25-smp kernel is the only one that looks like it's doing
what proggy wants to do, massive context switching.  Bump threads to
larger number so you can watch: the supposedly good kernel (22) is doing
everything on one CPU.  Everybody else sucks differently (idleness), and
the clear throughput winner, via mad over-schedule (!?!), is git today.

	-Mike


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-09 13:37         ` Mike Galbraith
@ 2008-02-09 16:19           ` Willy Tarreau
  2008-02-09 17:33             ` Mike Galbraith
  2008-02-10  5:29             ` Olof Johansson
  0 siblings, 2 replies; 23+ messages in thread
From: Willy Tarreau @ 2008-02-09 16:19 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Olof Johansson, linux-kernel, Peter Zijlstra, Ingo Molnar

On Sat, Feb 09, 2008 at 02:37:39PM +0100, Mike Galbraith wrote:
> 
> On Sat, 2008-02-09 at 12:40 +0100, Willy Tarreau wrote: 
> > On Sat, Feb 09, 2008 at 11:58:25AM +0100, Mike Galbraith wrote:
> > > 
> > > On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:
> > > 
> > > > How many CPUs do you have ?
> > > 
> > > It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE
> > > 
> > > > > 2.6.25-smp (git today)
> > > > > time 29 ms
> > > > > time 61 ms
> > > > > time 72 ms
> > > > 
> > > > These ones look rather strange. What type of workload is it ? Can you
> > > > publish the program for others to test it ?
> > > 
> > > It's the proglet posted in this thread.
> > 
> > OK sorry, I did not notice it when I first read the report.
> 
> Hm.  The 2.6.25-smp kernel is the only one that looks like it's doing
> what proggy wants to do, massive context switching.  Bump threads to
> larger number so you can watch: the supposedly good kernel (22) is doing
> everything on one CPU.  Everybody else sucks differently (idleness), and
> the clear throughput winner, via mad over-schedule (!?!), is git today.

For me, 2.6.25-smp gives pretty irregular results :

time 6548 ms
time 7272 ms
time 1188 ms
time 3772 ms

The CPU usage is quite irregular too and never goes beyond 50% (this is a
dual-athlon). If I start two of these processes, 100% of the CPU is used,
the context switch rate is more regular (about 700/s) and the total time
is more regular too (between 14.8 and 18.5 seconds).

Increasing the parallel run time of the two threads by changing the upper
limit of the for(j) loop correctly saturates both processors. I think that
this program simply does not have enough work to do for each thread to run
for a full timeslice, thus showing a random behaviour.

However, I fail to understand the goal of the reproducer. Granted it shows
irregularities in the scheduler under such conditions, but what *real*
workload would spend its time sequentially creating then immediately killing
threads, never using more than 2 at a time ?

If this could be turned into a DoS, I could understand, but here it looks
a bit pointless :-/

Regards,
Willy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-09 16:19           ` Willy Tarreau
@ 2008-02-09 17:33             ` Mike Galbraith
  2008-02-10  5:29             ` Olof Johansson
  1 sibling, 0 replies; 23+ messages in thread
From: Mike Galbraith @ 2008-02-09 17:33 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Olof Johansson, linux-kernel, Peter Zijlstra, Ingo Molnar


On Sat, 2008-02-09 at 17:19 +0100, Willy Tarreau wrote:

> However, I fail to understand the goal of the reproducer.

(me too, I was trying to figure out what could be expected)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-09 16:19           ` Willy Tarreau
  2008-02-09 17:33             ` Mike Galbraith
@ 2008-02-10  5:29             ` Olof Johansson
  2008-02-10  6:15               ` Willy Tarreau
  2008-02-11 21:45               ` Bill Davidsen
  1 sibling, 2 replies; 23+ messages in thread
From: Olof Johansson @ 2008-02-10  5:29 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Ingo Molnar

On Sat, Feb 09, 2008 at 05:19:57PM +0100, Willy Tarreau wrote:
> On Sat, Feb 09, 2008 at 02:37:39PM +0100, Mike Galbraith wrote:
> > 
> > On Sat, 2008-02-09 at 12:40 +0100, Willy Tarreau wrote: 
> > > On Sat, Feb 09, 2008 at 11:58:25AM +0100, Mike Galbraith wrote:
> > > > 
> > > > On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:
> > > > 
> > > > > How many CPUs do you have ?
> > > > 
> > > > It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE
> > > > 
> > > > > > 2.6.25-smp (git today)
> > > > > > time 29 ms
> > > > > > time 61 ms
> > > > > > time 72 ms
> > > > > 
> > > > > These ones look rather strange. What type of workload is it ? Can you
> > > > > publish the program for others to test it ?
> > > > 
> > > > It's the proglet posted in this thread.
> > > 
> > > OK sorry, I did not notice it when I first read the report.
> > 
> > Hm.  The 2.6.25-smp kernel is the only one that looks like it's doing
> > what proggy wants to do, massive context switching.  Bump threads to
> > larger number so you can watch: the supposedly good kernel (22) is doing
> > everything on one CPU.  Everybody else sucks differently (idleness), and
> > the clear throughput winner, via mad over-schedule (!?!), is git today.
> 
> For me, 2.6.25-smp gives pretty irregular results :
> 
> time 6548 ms
> time 7272 ms
> time 1188 ms
> time 3772 ms
> 
> The CPU usage is quite irregular too and never goes beyond 50% (this is a
> dual-athlon). If I start two of these processes, 100% of the CPU is used,
> the context switch rate is more regular (about 700/s) and the total time
> is more regular too (between 14.8 and 18.5 seconds).
> 
> Increasing the parallel run time of the two threads by changing the upper
> limit of the for(j) loop correctly saturates both processors. I think that
> this program simply does not have enough work to do for each thread to run
> for a full timeslice, thus showing a random behaviour.

Right. I should have tinkered a bit more with it before I posted it, the
version posted had too little going on in the first loop and thus got
hung up on the second busywait loop instead.

I did a bunch of runs with various loop sizes. Basically, what seems to
happen is that the older kernels are quicker at rebalancing a new thread
over to the other cpu, while newer kernels let them share the same cpu
longer (and thus increases wall clock runtime).

All of these are built with gcc without optimization, larger loop size
and an added sched_yield() in the busy-wait loop at the end to take that
out as a factor. As you've seen yourself, runtimes can be quite noisy
but the trends are quite clear anyway. All of these numbers were
collected with default scheduler runtime options, same kernels and
configs as previously posted.

Loop to 1M:
2.6.22		time 4015 ms
2.6.23		time 4581 ms
2.6.24		time 10765 ms
2.6.24-git19	time 8286 ms

2M:
2.6.22		time 7574 ms
2.6.23		time 9031 ms
2.6.24		time 12844 ms
2.6.24-git19	time 10959 ms

3M:
2.6.22		time 8015 ms
2.6.23		time 13053 ms
2.6.24		time 16204 ms
2.6.24-git19	time 14984 ms

4M:
2.6.22		time 10045 ms
2.6.23		time 16642 ms
2.6.24		time 16910 ms
2.6.24-git19	time 16468 ms

5M:
2.6.22		time 12055 ms
2.6.23		time 21024 ms
<missed 2.6.24 here>
2.6.24-git19	time 16040 ms

10M:
2.6.22		time 24030 ms
2.6.23		time 33082 ms
2.6.24		time 34139 ms
2.6.24-git19	time 33724 ms

20M:
2.6.22		time 50015 ms
2.6.23		time 63963 ms
2.6.24		time 65100 ms
2.6.24-git19	time 63092 ms

40M:
2.6.22		time 94315 ms
2.6.23		time 107930 ms
2.6.24		time 113291 ms
2.6.24-git19	time 110360 ms

So with more work per thread, the differences become less but they're
still there. At the 40M loop, with 500 threads it's quite a bit of
runtime per thread.

> However, I fail to understand the goal of the reproducer. Granted it shows
> irregularities in the scheduler under such conditions, but what *real*
> workload would spend its time sequentially creating then immediately killing
> threads, never using more than 2 at a time ?
>
> If this could be turned into a DoS, I could understand, but here it looks
> a bit pointless :-/

It seems generally unfortunate that it takes longer for a new thread to
move over to the second cpu even when the first is busy with the original
thread. I can certainly see cases where this causes suboptimal overall
system behaviour.

I agree that the testcase is highly artificial. Unfortunately, it's
not uncommon to see these kind of weird testcases from customers tring
to evaluate new hardware. :( They tend to be pared-down versions of
whatever their real workload is (the real workload is doing things more
appropriately, but the smaller version is used for testing). I was lucky
enough to get source snippets to base a standalone reproduction case on
for this, normally we wouldn't even get copies of their binaries.

-Olof

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-10  5:29             ` Olof Johansson
@ 2008-02-10  6:15               ` Willy Tarreau
  2008-02-10  7:00                 ` Olof Johansson
  2008-02-11 21:45               ` Bill Davidsen
  1 sibling, 1 reply; 23+ messages in thread
From: Willy Tarreau @ 2008-02-10  6:15 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Ingo Molnar

On Sat, Feb 09, 2008 at 11:29:41PM -0600, Olof Johansson wrote:
> On Sat, Feb 09, 2008 at 05:19:57PM +0100, Willy Tarreau wrote:
> > On Sat, Feb 09, 2008 at 02:37:39PM +0100, Mike Galbraith wrote:
> > > 
> > > On Sat, 2008-02-09 at 12:40 +0100, Willy Tarreau wrote: 
> > > > On Sat, Feb 09, 2008 at 11:58:25AM +0100, Mike Galbraith wrote:
> > > > > 
> > > > > On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:
> > > > > 
> > > > > > How many CPUs do you have ?
> > > > > 
> > > > > It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE
> > > > > 
> > > > > > > 2.6.25-smp (git today)
> > > > > > > time 29 ms
> > > > > > > time 61 ms
> > > > > > > time 72 ms
> > > > > > 
> > > > > > These ones look rather strange. What type of workload is it ? Can you
> > > > > > publish the program for others to test it ?
> > > > > 
> > > > > It's the proglet posted in this thread.
> > > > 
> > > > OK sorry, I did not notice it when I first read the report.
> > > 
> > > Hm.  The 2.6.25-smp kernel is the only one that looks like it's doing
> > > what proggy wants to do, massive context switching.  Bump threads to
> > > larger number so you can watch: the supposedly good kernel (22) is doing
> > > everything on one CPU.  Everybody else sucks differently (idleness), and
> > > the clear throughput winner, via mad over-schedule (!?!), is git today.
> > 
> > For me, 2.6.25-smp gives pretty irregular results :
> > 
> > time 6548 ms
> > time 7272 ms
> > time 1188 ms
> > time 3772 ms
> > 
> > The CPU usage is quite irregular too and never goes beyond 50% (this is a
> > dual-athlon). If I start two of these processes, 100% of the CPU is used,
> > the context switch rate is more regular (about 700/s) and the total time
> > is more regular too (between 14.8 and 18.5 seconds).
> > 
> > Increasing the parallel run time of the two threads by changing the upper
> > limit of the for(j) loop correctly saturates both processors. I think that
> > this program simply does not have enough work to do for each thread to run
> > for a full timeslice, thus showing a random behaviour.
> 
> Right. I should have tinkered a bit more with it before I posted it, the
> version posted had too little going on in the first loop and thus got
> hung up on the second busywait loop instead.
> 
> I did a bunch of runs with various loop sizes. Basically, what seems to
> happen is that the older kernels are quicker at rebalancing a new thread
> over to the other cpu, while newer kernels let them share the same cpu
> longer (and thus increases wall clock runtime).
> 
> All of these are built with gcc without optimization, larger loop size
> and an added sched_yield() in the busy-wait loop at the end to take that
> out as a factor. As you've seen yourself, runtimes can be quite noisy
> but the trends are quite clear anyway. All of these numbers were
> collected with default scheduler runtime options, same kernels and
> configs as previously posted.
> 
> Loop to 1M:
> 2.6.22		time 4015 ms
> 2.6.23		time 4581 ms
> 2.6.24		time 10765 ms
> 2.6.24-git19	time 8286 ms
> 
> 2M:
> 2.6.22		time 7574 ms
> 2.6.23		time 9031 ms
> 2.6.24		time 12844 ms
> 2.6.24-git19	time 10959 ms
> 
> 3M:
> 2.6.22		time 8015 ms
> 2.6.23		time 13053 ms
> 2.6.24		time 16204 ms
> 2.6.24-git19	time 14984 ms
> 
> 4M:
> 2.6.22		time 10045 ms
> 2.6.23		time 16642 ms
> 2.6.24		time 16910 ms
> 2.6.24-git19	time 16468 ms
> 
> 5M:
> 2.6.22		time 12055 ms
> 2.6.23		time 21024 ms
> <missed 2.6.24 here>
> 2.6.24-git19	time 16040 ms
> 
> 10M:
> 2.6.22		time 24030 ms
> 2.6.23		time 33082 ms
> 2.6.24		time 34139 ms
> 2.6.24-git19	time 33724 ms
> 
> 20M:
> 2.6.22		time 50015 ms
> 2.6.23		time 63963 ms
> 2.6.24		time 65100 ms
> 2.6.24-git19	time 63092 ms
> 
> 40M:
> 2.6.22		time 94315 ms
> 2.6.23		time 107930 ms
> 2.6.24		time 113291 ms
> 2.6.24-git19	time 110360 ms
> 
> So with more work per thread, the differences become less but they're
> still there. At the 40M loop, with 500 threads it's quite a bit of
> runtime per thread.

No, it's really nothing. I had to push the loop to 1 billion to make the load
noticeable. You don't have 500 threads, you have 2 threads and that load is
repeated 500 times. And if we look at the numbers, let's take the worst one :
> 40M:
> 2.6.24                time 113291 ms
113291/500 = 227 microseconds/loop. This is still very low compared to the
smallest timeslice you would have (1 ms at HZ=1000).

So your threads are still completing *before* the scheduler has to preempt
them.

> > However, I fail to understand the goal of the reproducer. Granted it shows
> > irregularities in the scheduler under such conditions, but what *real*
> > workload would spend its time sequentially creating then immediately killing
> > threads, never using more than 2 at a time ?
> >
> > If this could be turned into a DoS, I could understand, but here it looks
> > a bit pointless :-/
> 
> It seems generally unfortunate that it takes longer for a new thread to
> move over to the second cpu even when the first is busy with the original
> thread. I can certainly see cases where this causes suboptimal overall
> system behaviour.

In fact, I don't think it takes longer, I think it does not do it at their
creation, but will do it immediately after the first slice is consumed. This
would explain the important differences here. I don't know how we could ensure
that the new thread is created on the second CPU from the start, though.

I tried inserting a sched_yield() at the top of the busy loop (1M loops).
By default, it did not change a thing. Then I simply set sched_compat_yield
to 1, and the two threads then ran simultaneously with a stable low time
(2700 ms instead of 10-12 seconds).

Doing so with 10k loops (initial test) shows times in the range 240-300 ms
only instead of 2200-6500 ms.

Ingo, would it be possible (and wise) to ensure that a new thread being
created gets immediately rebalanced in order to emulate what is done here
with sched_compat_yield=1 and sched_yield() in both threads just after the
thread creation ? I don't expect any performance difference doing this,
but maybe some shell scripts reliying on short-lived pipes would get faster
on SMP.

In fact, right now the following command uses only 1 CPU :

$ dd if=/dev/urandom of=rnd bs=1M count=1
$ (time bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9)<rnd >/dev/null

real    0m6.977s
user    0m6.908s
sys     0m0.072s

Maybe this is due to the block size in bzip2 though, because gzip does not
exhibit exactly the same behaviour :

$ (time gzipi -1|gzip -1|gzip -1|gzip -1|gzip -1|gzip -1)<rnd >/dev/null

real    0m5.300s
user    0m7.392s
sys     0m0.392s

> I agree that the testcase is highly artificial. Unfortunately, it's
> not uncommon to see these kind of weird testcases from customers tring
> to evaluate new hardware. :( They tend to be pared-down versions of
> whatever their real workload is (the real workload is doing things more
> appropriately, but the smaller version is used for testing). I was lucky
> enough to get source snippets to base a standalone reproduction case on
> for this, normally we wouldn't even get copies of their binaries.

I'm well aware of that. What's important is to be able to explain what is
causing the difference and why the test case does not represent anything
related to performance. Maybe the code author wanted to get 500 parallel
threads and got his code wrong ?

Regards,
willy


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-10  6:15               ` Willy Tarreau
@ 2008-02-10  7:00                 ` Olof Johansson
  2008-02-10  7:58                   ` Willy Tarreau
  2008-02-11  8:15                   ` Mike Galbraith
  0 siblings, 2 replies; 23+ messages in thread
From: Olof Johansson @ 2008-02-10  7:00 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Ingo Molnar

On Sun, Feb 10, 2008 at 07:15:58AM +0100, Willy Tarreau wrote:
> On Sat, Feb 09, 2008 at 11:29:41PM -0600, Olof Johansson wrote:
> > 40M:
> > 2.6.22		time 94315 ms
> > 2.6.23		time 107930 ms
> > 2.6.24		time 113291 ms
> > 2.6.24-git19	time 110360 ms
> > 
> > So with more work per thread, the differences become less but they're
> > still there. At the 40M loop, with 500 threads it's quite a bit of
> > runtime per thread.
> 
> No, it's really nothing. I had to push the loop to 1 billion to make the load
> noticeable. You don't have 500 threads, you have 2 threads and that load is
> repeated 500 times. And if we look at the numbers, let's take the worst one :
> > 40M:
> > 2.6.24                time 113291 ms
> 113291/500 = 227 microseconds/loop. This is still very low compared to the
> smallest timeslice you would have (1 ms at HZ=1000).
> 
> So your threads are still completing *before* the scheduler has to preempt
> them.

Hmm? I get that to be 227ms per loop, which is way more than a full
timeslice. Running the program took in the range of 2 minutes, so it's
110000 milliseconds, not microseconds.

> > It seems generally unfortunate that it takes longer for a new thread to
> > move over to the second cpu even when the first is busy with the original
> > thread. I can certainly see cases where this causes suboptimal overall
> > system behaviour.
> 
> In fact, I don't think it takes longer, I think it does not do it at their
> creation, but will do it immediately after the first slice is consumed. This
> would explain the important differences here. I don't know how we could ensure
> that the new thread is created on the second CPU from the start, though.

The math doesn't add up for me. Even if it rebalanced at the end of
the first slice (i.e. after 1ms), that would be a 1ms penalty per
iteration. With 500 threads that'd be a total penalty of 500ms.

> I tried inserting a sched_yield() at the top of the busy loop (1M loops).
> By default, it did not change a thing. Then I simply set sched_compat_yield
> to 1, and the two threads then ran simultaneously with a stable low time
> (2700 ms instead of 10-12 seconds).
> 
> Doing so with 10k loops (initial test) shows times in the range 240-300 ms
> only instead of 2200-6500 ms.

Right, likely because the long-running cases got stuck at the busy loop
at the end, which would end up aborting quicker if the other thread got
scheduled for just a bit. It was a mistake to post that variant of the
testcase, it's not as relevant and doesn't mimic the original workload I
was trying to mimic as well as if the first loop was made larger.

> Ingo, would it be possible (and wise) to ensure that a new thread being
> created gets immediately rebalanced in order to emulate what is done here
> with sched_compat_yield=1 and sched_yield() in both threads just after the
> thread creation ? I don't expect any performance difference doing this,
> but maybe some shell scripts reliying on short-lived pipes would get faster
> on SMP.

There's always the tradeoff of losing cache warmth whenever a thread is
moved, so I'm not sure if it's a good idea to always migrate it at
creation time. It's not a simple problem, really.

> > I agree that the testcase is highly artificial. Unfortunately, it's
> > not uncommon to see these kind of weird testcases from customers tring
> > to evaluate new hardware. :( They tend to be pared-down versions of
> > whatever their real workload is (the real workload is doing things more
> > appropriately, but the smaller version is used for testing). I was lucky
> > enough to get source snippets to base a standalone reproduction case on
> > for this, normally we wouldn't even get copies of their binaries.
> 
> I'm well aware of that. What's important is to be able to explain what is
> causing the difference and why the test case does not represent anything
> related to performance. Maybe the code author wanted to get 500 parallel
> threads and got his code wrong ?

I believe it started out as a simple attempt to parallelize a workload
that sliced the problem too low, instead of slicing it in larger chunks
and have each thread do more work at a time. It did well on 2.6.22 with
almost a 2x speedup, but did worse than the single-treaded testcase on a
2.6.24 kernel.

So yes, it can clearly be handled through explanations and education
and fixen the broken testcase, but I'm still not sure the new behaviour
is desired.


-Olof

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-10  7:00                 ` Olof Johansson
@ 2008-02-10  7:58                   ` Willy Tarreau
  2008-02-11  8:15                   ` Mike Galbraith
  1 sibling, 0 replies; 23+ messages in thread
From: Willy Tarreau @ 2008-02-10  7:58 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Ingo Molnar

On Sun, Feb 10, 2008 at 01:00:56AM -0600, Olof Johansson wrote:
> On Sun, Feb 10, 2008 at 07:15:58AM +0100, Willy Tarreau wrote:
> > On Sat, Feb 09, 2008 at 11:29:41PM -0600, Olof Johansson wrote:
> > > 40M:
> > > 2.6.22		time 94315 ms
> > > 2.6.23		time 107930 ms
> > > 2.6.24		time 113291 ms
> > > 2.6.24-git19	time 110360 ms
> > > 
> > > So with more work per thread, the differences become less but they're
> > > still there. At the 40M loop, with 500 threads it's quite a bit of
> > > runtime per thread.
> > 
> > No, it's really nothing. I had to push the loop to 1 billion to make the load
> > noticeable. You don't have 500 threads, you have 2 threads and that load is
> > repeated 500 times. And if we look at the numbers, let's take the worst one :
> > > 40M:
> > > 2.6.24                time 113291 ms
> > 113291/500 = 227 microseconds/loop. This is still very low compared to the
> > smallest timeslice you would have (1 ms at HZ=1000).
> > 
> > So your threads are still completing *before* the scheduler has to preempt
> > them.
> 
> Hmm? I get that to be 227ms per loop, which is way more than a full
> timeslice. Running the program took in the range of 2 minutes, so it's
> 110000 milliseconds, not microseconds.

Damn you're right! I don't know why I assumed that the reported time was
in microseconds. Nevermind.

> > > It seems generally unfortunate that it takes longer for a new thread to
> > > move over to the second cpu even when the first is busy with the original
> > > thread. I can certainly see cases where this causes suboptimal overall
> > > system behaviour.
> > 
> > In fact, I don't think it takes longer, I think it does not do it at their
> > creation, but will do it immediately after the first slice is consumed. This
> > would explain the important differences here. I don't know how we could ensure
> > that the new thread is created on the second CPU from the start, though.
> 
> The math doesn't add up for me. Even if it rebalanced at the end of
> the first slice (i.e. after 1ms), that would be a 1ms penalty per
> iteration. With 500 threads that'd be a total penalty of 500ms.

yes you're right.

> > I tried inserting a sched_yield() at the top of the busy loop (1M loops).
> > By default, it did not change a thing. Then I simply set sched_compat_yield
> > to 1, and the two threads then ran simultaneously with a stable low time
> > (2700 ms instead of 10-12 seconds).
> > 
> > Doing so with 10k loops (initial test) shows times in the range 240-300 ms
> > only instead of 2200-6500 ms.
> 
> Right, likely because the long-running cases got stuck at the busy loop
> at the end, which would end up aborting quicker if the other thread got
> scheduled for just a bit. It was a mistake to post that variant of the
> testcase, it's not as relevant and doesn't mimic the original workload I
> was trying to mimic as well as if the first loop was made larger.

agreed, but what's important is not to change the workload, but to see
what changes induce a different behaviour.

> > Ingo, would it be possible (and wise) to ensure that a new thread being
> > created gets immediately rebalanced in order to emulate what is done here
> > with sched_compat_yield=1 and sched_yield() in both threads just after the
> > thread creation ? I don't expect any performance difference doing this,
> > but maybe some shell scripts reliying on short-lived pipes would get faster
> > on SMP.
> 
> There's always the tradeoff of losing cache warmth whenever a thread is
> moved, so I'm not sure if it's a good idea to always migrate it at
> creation time. It's not a simple problem, really.

yes I know. That should not prevent us from experimenting though. If
thread-CPU affinity is too strong and causes the second CPU to be
rarely used, there's something wrong waiting for a fix.

> > > I agree that the testcase is highly artificial. Unfortunately, it's
> > > not uncommon to see these kind of weird testcases from customers tring
> > > to evaluate new hardware. :( They tend to be pared-down versions of
> > > whatever their real workload is (the real workload is doing things more
> > > appropriately, but the smaller version is used for testing). I was lucky
> > > enough to get source snippets to base a standalone reproduction case on
> > > for this, normally we wouldn't even get copies of their binaries.
> > 
> > I'm well aware of that. What's important is to be able to explain what is
> > causing the difference and why the test case does not represent anything
> > related to performance. Maybe the code author wanted to get 500 parallel
> > threads and got his code wrong ?
> 
> I believe it started out as a simple attempt to parallelize a workload
> that sliced the problem too low, instead of slicing it in larger chunks
> and have each thread do more work at a time. It did well on 2.6.22 with
> almost a 2x speedup, but did worse than the single-treaded testcase on a
> 2.6.24 kernel.
> 
> So yes, it can clearly be handled through explanations and education
> and fixen the broken testcase, but I'm still not sure the new behaviour
> is desired.

While I could accept the first slice of the second thread being consumed 
on the same CPU as the first one, it definitely must migrate quickly if
both threads are competing for CPU.

regards,
willy


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-10  7:00                 ` Olof Johansson
  2008-02-10  7:58                   ` Willy Tarreau
@ 2008-02-11  8:15                   ` Mike Galbraith
  2008-02-11 17:26                     ` Olof Johansson
  1 sibling, 1 reply; 23+ messages in thread
From: Mike Galbraith @ 2008-02-11  8:15 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Willy Tarreau, linux-kernel, Peter Zijlstra, Ingo Molnar


On Sun, 2008-02-10 at 01:00 -0600, Olof Johansson wrote:
> On Sun, Feb 10, 2008 at 07:15:58AM +0100, Willy Tarreau wrote:
> 
> > > I agree that the testcase is highly artificial. Unfortunately, it's
> > > not uncommon to see these kind of weird testcases from customers tring
> > > to evaluate new hardware. :( They tend to be pared-down versions of
> > > whatever their real workload is (the real workload is doing things more
> > > appropriately, but the smaller version is used for testing). I was lucky
> > > enough to get source snippets to base a standalone reproduction case on
> > > for this, normally we wouldn't even get copies of their binaries.
> > 
> > I'm well aware of that. What's important is to be able to explain what is
> > causing the difference and why the test case does not represent anything
> > related to performance. Maybe the code author wanted to get 500 parallel
> > threads and got his code wrong ?
> 
> I believe it started out as a simple attempt to parallelize a workload
> that sliced the problem too low, instead of slicing it in larger chunks
> and have each thread do more work at a time. It did well on 2.6.22 with
> almost a 2x speedup, but did worse than the single-treaded testcase on a
> 2.6.24 kernel.
> 
> So yes, it can clearly be handled through explanations and education
> and fixen the broken testcase, but I'm still not sure the new behaviour
> is desired.

Piddling around with your testcase, it still looks to me like things
improved considerably in latest greatest git.  Hopefully that means
happiness is in the pipe for the real workload... synthetic load is
definitely happier here as burst is shortened.

#!/bin/sh

uname -r >> results
./threadtest 10 1000000 >> results
./threadtest 100 100000 >> results
./threadtest 1000 10000 >> results
./threadtest 10000 1000 >> results
./threadtest 100000 100 >> results
echo >> results

(threadtest <iterations> <burn_time>)

results:
2.6.22.17-smp
time: 10525370 (us)  work: 20000000  wait: 181000  idx: 1.90
time: 13514232 (us)  work: 20000001  wait: 2666366  idx: 1.48
time: 36280953 (us)  work: 20000008  wait: 21156077  idx: 0.55
time: 196374337 (us)  work: 20000058  wait: 177141620  idx: 0.10
time: 128721968 (us)  work: 20000099  wait: 115052705  idx: 0.16

2.6.22.17-cfs-v24.1-smp
time: 10579591 (us)  work: 20000000  wait: 203659  idx: 1.89
time: 11170784 (us)  work: 20000000  wait: 471961  idx: 1.79
time: 11650138 (us)  work: 20000000  wait: 1406224  idx: 1.72
time: 21447616 (us)  work: 20000007  wait: 10689242  idx: 0.93
time: 106792791 (us)  work: 20000060  wait: 92098132  idx: 0.19

2.6.23.15-smp
time: 10507122 (us)  work: 20000000  wait: 159809  idx: 1.90
time: 10545417 (us)  work: 20000000  wait: 263833  idx: 1.90
time: 11337770 (us)  work: 20000012  wait: 1069588  idx: 1.76
time: 15969860 (us)  work: 20000000  wait: 5577750  idx: 1.25
time: 54029726 (us)  work: 20000027  wait: 41734789  idx: 0.37

2.6.23.15-cfs-v24-smp
time: 10528972 (us)  work: 20000000  wait: 217060  idx: 1.90
time: 10697159 (us)  work: 20000000  wait: 447224  idx: 1.87
time: 12242250 (us)  work: 20000000  wait: 1930175  idx: 1.63
time: 26364658 (us)  work: 20000011  wait: 15468447  idx: 0.76
time: 158338977 (us)  work: 20000084  wait: 144048265  idx: 0.13

2.6.24.1-smp
time: 10570521 (us)  work: 20000000  wait: 208947  idx: 1.89
time: 10699224 (us)  work: 20000000  wait: 404644  idx: 1.87
time: 12280164 (us)  work: 20000005  wait: 1969263  idx: 1.63
time: 26424580 (us)  work: 20000004  wait: 15725967  idx: 0.76
time: 159012417 (us)  work: 20000055  wait: 144906212  idx: 0.13

2.6.25-smp (.git)
time: 10707278 (us)  work: 20000000  wait: 376063  idx: 1.87
time: 10696886 (us)  work: 20000000  wait: 455909  idx: 1.87
time: 11068510 (us)  work: 19990003  wait: 820104  idx: 1.81
time: 11493803 (us)  work: 19995076  wait: 1160150  idx: 1.74
time: 21311673 (us)  work: 20001848  wait: 9399490  idx: 0.94


#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>

#ifdef __PPC__
static void atomic_inc(volatile long *a)
{
	asm volatile ("1:\n\
			lwarx  %0,0,%1\n\
			addic  %0,%0,1\n\
			stwcx. %0,0,%1\n\
			bne-  1b" : "=&r" (result) : "r"(a));
}
#else
static void atomic_inc(volatile long *a)
{
	asm volatile ("lock; incl %0" : "+m" (*a));
}
#endif

long usecs(void)
{
	struct timeval tv;
	gettimeofday(&tv, NULL);
	return tv.tv_sec * 1000000 + tv.tv_usec;
}

void burn(long *burnt)
{
	long then, now, delta, tolerance = 10;

	then = now = usecs();
	while (now == then)
		now = usecs();
	delta = now - then;
	if (delta < tolerance)
		*burnt += delta;
}

volatile long stopped;
long burn_usecs = 1000, tot_work, tot_wait;

void *thread_func(void *cpus)
{
	long work = 0, wait = 0;

	while (work < burn_usecs)
		burn(&work);
	tot_work += work;

	atomic_inc(&stopped);

	/* Busy-wait */
	while (stopped < *(int *)cpus)
		burn(&wait);
	tot_wait += wait;

	return NULL;
}

int main(int argc, char **argv)
{
	pthread_t thread;
	int iter = 500, cpus = 2;
	long t1, t2;

	if (argc > 1)
		iter = atoi(argv[1]);

	if (argc > 2)
		burn_usecs = atoi(argv[2]);

	t1 = usecs();
	while(iter--) {
		stopped = 0;

		pthread_create(&thread, NULL, &thread_func, &cpus);
		thread_func(&cpus);
		pthread_join(thread, NULL);
	}
	t2 = usecs();

	printf("time: %ld (us)  work: %ld  wait: %ld  idx: %2.2f\n",
		t2-t1, tot_work, tot_wait, (double)tot_work/(t2-t1));

	return 0;
}




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-11  8:15                   ` Mike Galbraith
@ 2008-02-11 17:26                     ` Olof Johansson
  2008-02-11 19:58                       ` Mike Galbraith
  0 siblings, 1 reply; 23+ messages in thread
From: Olof Johansson @ 2008-02-11 17:26 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Willy Tarreau, linux-kernel, Peter Zijlstra, Ingo Molnar

On Mon, Feb 11, 2008 at 09:15:55AM +0100, Mike Galbraith wrote:
> Piddling around with your testcase, it still looks to me like things
> improved considerably in latest greatest git.  Hopefully that means
> happiness is in the pipe for the real workload... synthetic load is
> definitely happier here as burst is shortened.

The real workload doesn't see much of an improvement. The changes I did
when tinkering yesterday seem like they're better at modelling just
what's going on with that one.

I've added the new testcase I'm using for the numbers I posted last
night at http://lixom.net/~olof/threadtest/new/, including numbers for
the various kernels. I also included the binaries I built. (with "gcc
-DLOOPS=<size> testcase.c -lpthread").

I tried graphing the numbers as well, it looks like for larger workloads
that 2.6.22 has a fixed benefit over newer kernels. I.e. it seems quicker
at rebalancing, but once things balance out they're obviously doing ok
independent of kernel version.

Graph at http://lixom.net/olof/threadtest/new/schedgraph.pdf. I couldn't
figure out how to make the X axis linear, so it obviously looks odd with
the current powers-of-two at the end instead of linear, but the
differences can still be seen clearly.

-Olof

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-11 17:26                     ` Olof Johansson
@ 2008-02-11 19:58                       ` Mike Galbraith
  2008-02-11 20:31                         ` Olof Johansson
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Galbraith @ 2008-02-11 19:58 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Willy Tarreau, linux-kernel, Peter Zijlstra, Ingo Molnar

On Mon, 2008-02-11 at 11:26 -0600, Olof Johansson wrote: 
> On Mon, Feb 11, 2008 at 09:15:55AM +0100, Mike Galbraith wrote:
> > Piddling around with your testcase, it still looks to me like things
> > improved considerably in latest greatest git.  Hopefully that means
> > happiness is in the pipe for the real workload... synthetic load is
> > definitely happier here as burst is shortened.
> 
> The real workload doesn't see much of an improvement. The changes I did
> when tinkering yesterday seem like they're better at modelling just
> what's going on with that one.

So the real application is trying to yield?  If so, you could try
prodding /proc/sys/kernel/sched_compat_yield.

It shouldn't matter if you yield or not really, that should reduce the
number of non-work spin cycles wasted awaiting preemption as threads
execute in series (the problem), and should improve your performance
numbers, but not beyond single threaded.

If I plugged a yield into the busy wait, I would expect to see a large
behavioral difference due to yield implementation changes, but that
would only be a symptom in this case, no?  Yield should be a noop.

	-Mike

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-11 19:58                       ` Mike Galbraith
@ 2008-02-11 20:31                         ` Olof Johansson
  2008-02-12  9:23                           ` Mike Galbraith
  0 siblings, 1 reply; 23+ messages in thread
From: Olof Johansson @ 2008-02-11 20:31 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Willy Tarreau, linux-kernel, Peter Zijlstra, Ingo Molnar

On Mon, Feb 11, 2008 at 08:58:46PM +0100, Mike Galbraith wrote:
> 
> On Mon, 2008-02-11 at 11:26 -0600, Olof Johansson wrote: 
> > On Mon, Feb 11, 2008 at 09:15:55AM +0100, Mike Galbraith wrote:
> > > Piddling around with your testcase, it still looks to me like things
> > > improved considerably in latest greatest git.  Hopefully that means
> > > happiness is in the pipe for the real workload... synthetic load is
> > > definitely happier here as burst is shortened.
> > 
> > The real workload doesn't see much of an improvement. The changes I did
> > when tinkering yesterday seem like they're better at modelling just
> > what's going on with that one.
> 
> So the real application is trying to yield?  If so, you could try
> prodding /proc/sys/kernel/sched_compat_yield.

The real application wasn't using yield, but it also didn't get hung up
on the busy-wait loop, that was a mistake by me on Friday. Practically
all the time of that application was spent in the actual workload
loop. I tried adding a yield at the bottom loop, but it didn't make a
significant difference.

> It shouldn't matter if you yield or not really, that should reduce the
> number of non-work spin cycles wasted awaiting preemption as threads
> execute in series (the problem), and should improve your performance
> numbers, but not beyond single threaded.
> 
> If I plugged a yield into the busy wait, I would expect to see a large
> behavioral difference due to yield implementation changes, but that
> would only be a symptom in this case, no?  Yield should be a noop.

Exactly. It made a big impact on the first testcase from Friday, where
the spin-off thread spent the bulk of the time in the busy-wait loop,
with a very small initial workload loop. Thus the yield passed the cpu
over to the other thread who got a chance to run the small workload,
followed by a quick finish by both of them. The better model spends the
bulk of the time in the first workload loop, so yielding doesn't gain
at all the same amount.

I still added it to rule out that it was a factor in the time
differences.

-Olof

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-10  5:29             ` Olof Johansson
  2008-02-10  6:15               ` Willy Tarreau
@ 2008-02-11 21:45               ` Bill Davidsen
  2008-02-12  4:30                 ` Mike Galbraith
  1 sibling, 1 reply; 23+ messages in thread
From: Bill Davidsen @ 2008-02-11 21:45 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Willy Tarreau, Mike Galbraith, linux-kernel, Peter Zijlstra, Ingo Molnar

Olof Johansson wrote:

>> However, I fail to understand the goal of the reproducer. Granted it shows
>> irregularities in the scheduler under such conditions, but what *real*
>> workload would spend its time sequentially creating then immediately killing
>> threads, never using more than 2 at a time ?
>>
>> If this could be turned into a DoS, I could understand, but here it looks
>> a bit pointless :-/
> 
> It seems generally unfortunate that it takes longer for a new thread to
> move over to the second cpu even when the first is busy with the original
> thread. I can certainly see cases where this causes suboptimal overall
> system behaviour.
> 
I think the moving to another CPU gets really dependent on the CPU type. 
On a P4+HT the caches are shared, and moving costs almost nothing for 
cache hits, while on CPUs which have other cache layouts the migration 
cost is higher. Obviously multi-core should be cheaper than 
multi-socket, by avoiding using the system memory bus, but it still can 
get ugly.

I have an IPC test around which showed that, it ran like hell on HT, and 
progressively worse as cache because less shared. I wonder why the 
latest git works so much better?

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-11 21:45               ` Bill Davidsen
@ 2008-02-12  4:30                 ` Mike Galbraith
  0 siblings, 0 replies; 23+ messages in thread
From: Mike Galbraith @ 2008-02-12  4:30 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Olof Johansson, Willy Tarreau, linux-kernel, Peter Zijlstra, Ingo Molnar


On Mon, 2008-02-11 at 16:45 -0500, Bill Davidsen wrote:

> I think the moving to another CPU gets really dependent on the CPU type. 
> On a P4+HT the caches are shared, and moving costs almost nothing for 
> cache hits, while on CPUs which have other cache layouts the migration 
> cost is higher. Obviously multi-core should be cheaper than 
> multi-socket, by avoiding using the system memory bus, but it still can 
> get ugly.
> 
> I have an IPC test around which showed that, it ran like hell on HT, and 
> progressively worse as cache because less shared. I wonder why the 
> latest git works so much better?

Yes, I'm wondering the same.  With latest git, ~400 usec work units
suffice to achieve overlap (on my P4/HT), whereas all other kernels
tested require several milliseconds.

	-Mike


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-11 20:31                         ` Olof Johansson
@ 2008-02-12  9:23                           ` Mike Galbraith
  2008-02-13  5:49                             ` Mike Galbraith
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Galbraith @ 2008-02-12  9:23 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Willy Tarreau, linux-kernel, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 1579 bytes --]


On Mon, 2008-02-11 at 14:31 -0600, Olof Johansson wrote:
> On Mon, Feb 11, 2008 at 08:58:46PM +0100, Mike Galbraith wrote:

> > It shouldn't matter if you yield or not really, that should reduce the
> > number of non-work spin cycles wasted awaiting preemption as threads
> > execute in series (the problem), and should improve your performance
> > numbers, but not beyond single threaded.
> > 
> > If I plugged a yield into the busy wait, I would expect to see a large
> > behavioral difference due to yield implementation changes, but that
> > would only be a symptom in this case, no?  Yield should be a noop.
> 
> Exactly. It made a big impact on the first testcase from Friday, where
> the spin-off thread spent the bulk of the time in the busy-wait loop,
> with a very small initial workload loop. Thus the yield passed the cpu
> over to the other thread who got a chance to run the small workload,
> followed by a quick finish by both of them. The better model spends the
> bulk of the time in the first workload loop, so yielding doesn't gain
> at all the same amount.

There is a strong dependency on execution order in this testcase.

Between cpu affinity and giving the child a little head start to reduce
the chance (100% if child wakes on same CPU and doesn't preempt parent)
of busy wait, modified testcase behaves.  I don't think I should need
the CPU affinity, but I do.

If you plunk a usleep(1) in prior to calling thread_func() does your
testcase performance change radically?  If so, I wonder if the real
application has the same kind of dependency.

	-Mike

[-- Attachment #2: threadtest.c --]
[-- Type: text/x-csrc, Size: 2029 bytes --]

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <sched.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/syscall.h>

#ifdef __PPC__
static void atomic_inc(volatile long *a)
{
	asm volatile ("1:\n\
			lwarx  %0,0,%1\n\
			addic  %0,%0,1\n\
			stwcx. %0,0,%1\n\
			bne-  1b" : "=&r" (result) : "r"(a));
}
#else
static void atomic_inc(volatile long *a)
{
	asm volatile ("lock; incl %0" : "+m" (*a));
}
#endif

long usecs(void)
{
	struct timeval tv;
	gettimeofday(&tv, NULL);
	return tv.tv_sec * 1000000 + tv.tv_usec;
}

void burn(long *burnt)
{
	long then, now, delta, tolerance = 10;

	then = now = usecs();
	while (now == then)
		now = usecs();
	delta = now - then;
	if (delta < tolerance)
		*burnt += delta;
}

volatile long stopped;
long burn_usecs = 1000, tot_work, tot_wait;

pid_t parent;
#define gettid() syscall(SYS_gettid)

void *thread_func(void *cpus)
{
	long work = 0, wait = 0;
	cpu_set_t cpuset;
	pid_t whoami = gettid();

	if (whoami != parent) {
		CPU_ZERO(&cpuset);
		CPU_SET(1, &cpuset);
		sched_setaffinity(whoami, sizeof(cpuset), &cpuset);
		usleep(1);
	}

	while (work < burn_usecs)
		burn(&work);
	tot_work += work;

	atomic_inc(&stopped);

	/* Busy-wait */
	while (stopped < *(int *)cpus)
		burn(&wait);
	tot_wait += wait;

	return NULL;
}

int main(int argc, char **argv)
{
	pthread_t thread;
	int iter = 500, cpus = 2;
	long t1, t2;
	cpu_set_t cpuset;

	if (argc > 1)
		iter = atoi(argv[1]);

	if (argc > 2)
		burn_usecs = atoi(argv[2]);

	parent = gettid();
	CPU_ZERO(&cpuset);
	CPU_SET(0, &cpuset);
	sched_setaffinity(parent, sizeof(cpuset), &cpuset);

	t1 = usecs();
	while(iter--) {
		stopped = 0;

		pthread_create(&thread, NULL, &thread_func, &cpus);
		/* clild needs headstart guarantee to avoid busy wait */
		usleep(1);
		thread_func(&cpus);
		pthread_join(thread, NULL);
	}
	t2 = usecs();

	printf("time: %ld (us)  work: %ld  wait: %ld  idx: %2.2f\n",
		t2-t1, tot_work, tot_wait, (double)tot_work/(t2-t1));

	return 0;
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
  2008-02-12  9:23                           ` Mike Galbraith
@ 2008-02-13  5:49                             ` Mike Galbraith
  0 siblings, 0 replies; 23+ messages in thread
From: Mike Galbraith @ 2008-02-13  5:49 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Willy Tarreau, linux-kernel, Peter Zijlstra, Ingo Molnar


On Tue, 2008-02-12 at 10:23 +0100, Mike Galbraith wrote:

> If you plunk a usleep(1) in prior to calling thread_func() does your
> testcase performance change radically?  If so, I wonder if the real
> application has the same kind of dependency.

The answer is yes for 2.6.22, and no for 2.6.24, which surprised me.

2.6.22.17-smp
homer:..local/tmp # ./threadtest2
10000000 loops time 16215 ms
10000000 loops time 16268 ms
10000000 loops time 16246 ms

homer:..local/tmp # ./threadtest3 (with usleep(1))
10000000 loops time 13938 ms
10000000 loops time 13921 ms
10000000 loops time 13898 ms

2.6.24.1-smp
homer:..local/tmp # ./threadtest2
10000000 loops time 14663 ms
10000000 loops time 14523 ms
10000000 loops time 14466 ms

homer:..local/tmp # ./threadtest3
10000000 loops time 14513 ms
10000000 loops time 14500 ms
10000000 loops time 14464 ms

echo 0 >  /proc/sys/kernel/sched_child_runs_first

homer:..local/tmp # ./threadtest2
10000000 loops time 14157 ms
10000000 loops time 14097 ms
10000000 loops time 14153 ms

homer:..local/tmp # ./threadtest3
10000000 loops time 14065 ms
10000000 loops time 14075 ms
10000000 loops time 14018 ms




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
       [not found] <fa.6N2dhyJ1cmBqiuFKgCaYfwduM+0@ifi.uio.no>
@ 2008-02-09  1:49 ` Robert Hancock
  0 siblings, 0 replies; 23+ messages in thread
From: Robert Hancock @ 2008-02-09  1:49 UTC (permalink / raw)
  To: Olof Johansson; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

Olof Johansson wrote:
> Hi,
> 
> I ended up with a customer benchmark in my lap this week that doesn't
> do well on recent kernels. :(
> 
> After cutting it down to a simple testcase/microbenchmark, it seems like
> recent kernels don't do as well with short-lived threads competing
> with the thread it's cloned off of. The CFS scheduler changes come to
> mind, but I suppose it could be caused by something else as well.
> 
> The pared-down testcase is included below. Reported runtime for the
> testcase has increased almost 3x between 2.6.22 and 2.6.24:
> 
> 2.6.22: 3332 ms
> 2.6.23: 4397 ms
> 2.6.24: 8953 ms
> 2.6.24-git19: 8986 ms
> 
> While running, it'll fork off a bunch of threads, each doing just a little
> work, then busy-waiting on the original thread to finish as well. Yes,
> it's incredibly stupidly coded but that's not my point here.
> 
> During run, (runtime 10s on my 1.5GHz Core2 Duo laptop), vmstat 2  shows:
> 
>  0  0      0 115196 364748 2248396    0    0     0     0  163   89  0  0 100  0
>  2  0      0 115172 364748 2248396    0    0     0     0  270  178 24  0 76  0
>  2  0      0 115172 364748 2248396    0    0     0     0  402  283 52  0 48  0
>  2  0      0 115180 364748 2248396    0    0     0     0  402  281 50  0 50  0
>  2  0      0 115180 364764 2248396    0    0     0    22  403  295 51  0 48  1
>  2  0      0 115056 364764 2248396    0    0     0     0  399  280 52  0 48  0
>  0  0      0 115196 364764 2248396    0    0     0     0  241  141 17  0 83  0
>  0  0      0 115196 364768 2248396    0    0     0     2  155   67  0  0 100  0
>  0  0      0 115196 364768 2248396    0    0     0     0  148   62  0  0 100  0
> 
> I.e. runqueue is 2, but only one cpu is busy. However, this still seems
> true on the kernel that runs the testcase in more reasonable time.
> 
> Also, 'time' reports real and user time roughly the same on all kernels,
> so it's not that the older kernels are better at spreading out the load
> between the two cores (either that or it doesn't account for stuff right).
> 
> I've included the config files, runtime output and vmstat output at
> http://lixom.net/~olof/threadtest/. I see similar behaviour on PPC as
> well as x86, so it's not architecture-specific.
> 
> Testcase below. Yes, I know, there's a bunch of stuff that could be done
> differently and better, but it still doesn't motivate why there's a 3x
> slowdown between kernel versions...

I would say that something coded this bizarrely is really an application 
bug and not something that one could call a kernel regression. Any 
change in how the parent and child threads get scheduled will have a 
huge impact on this test. I bet if you replace that busy wait with a 
pthread_cond_wait or something similar, this problem goes away.

Hopefully it doesn't have to be pointed out that spawning off threads to 
do so little work before terminating is inefficient, a thread pool or 
even just a single thread would likely do a much better job..

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2008-02-13  5:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-09  0:04 Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads Olof Johansson
2008-02-09  0:08 ` Ingo Molnar
2008-02-09  0:32   ` Olof Johansson
2008-02-09  7:58 ` Mike Galbraith
2008-02-09  8:03   ` Willy Tarreau
2008-02-09 10:58     ` Mike Galbraith
2008-02-09 11:40       ` Willy Tarreau
2008-02-09 13:37         ` Mike Galbraith
2008-02-09 16:19           ` Willy Tarreau
2008-02-09 17:33             ` Mike Galbraith
2008-02-10  5:29             ` Olof Johansson
2008-02-10  6:15               ` Willy Tarreau
2008-02-10  7:00                 ` Olof Johansson
2008-02-10  7:58                   ` Willy Tarreau
2008-02-11  8:15                   ` Mike Galbraith
2008-02-11 17:26                     ` Olof Johansson
2008-02-11 19:58                       ` Mike Galbraith
2008-02-11 20:31                         ` Olof Johansson
2008-02-12  9:23                           ` Mike Galbraith
2008-02-13  5:49                             ` Mike Galbraith
2008-02-11 21:45               ` Bill Davidsen
2008-02-12  4:30                 ` Mike Galbraith
     [not found] <fa.6N2dhyJ1cmBqiuFKgCaYfwduM+0@ifi.uio.no>
2008-02-09  1:49 ` Robert Hancock

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.