linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.6.0-test2-mm3 osdl-aim-7 regression
@ 2003-08-04 16:07 Cliff White
  2003-08-06  5:23 ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Cliff White @ 2003-08-04 16:07 UTC (permalink / raw)
  To: linux-kernel


I see 2.6.0-test2-mm4 is already in our queue, so this may be
Old News. ( serves me right for taking a weekend off )
Performance of -mm3 falls off on the 4-cpu machines. 

2-cpu ssytems
Kernel 			JPM 
2.6.0-test2-mm3		1313.53
linux-2.6.0-test2	1320.68 (0.54 % +)

4-cpu systems
2.6.0-test2-mm3		4824.96
linux-2.6.0-test2	5381.20 ( 11.53 % + )

Full details at
http://developer.osdl.org/cliffw/reaim/index.html
code at 
bk://developer.osdl.org/osdl-aim-7

cliffw



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test2-mm3 osdl-aim-7 regression
  2003-08-04 16:07 2.6.0-test2-mm3 osdl-aim-7 regression Cliff White
@ 2003-08-06  5:23 ` Andrew Morton
  2003-08-06 19:10   ` Cliff White
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2003-08-06  5:23 UTC (permalink / raw)
  To: Cliff White; +Cc: linux-kernel, Ingo Molnar

Cliff White <cliffw@osdl.org> wrote:
>
> 
> I see 2.6.0-test2-mm4 is already in our queue, so this may be
> Old News. ( serves me right for taking a weekend off )
> Performance of -mm3 falls off on the 4-cpu machines. 
> 
> 2-cpu ssytems
> Kernel 			JPM 
> 2.6.0-test2-mm3		1313.53
> linux-2.6.0-test2	1320.68 (0.54 % +)
> 
> 4-cpu systems
> 2.6.0-test2-mm3		4824.96
> linux-2.6.0-test2	5381.20 ( 11.53 % + )
> 
> Full details at
> http://developer.osdl.org/cliffw/reaim/index.html
> code at 
> bk://developer.osdl.org/osdl-aim-7
> 

OK, I can reproduce this on 4way.

Binary searching (insert gratuitous rant about benchmarks that take more
than two minutes to complete) reveals that the slowdown is due to
sched-2.6.0-test2-mm2-A3.

So mm4 with everthing up to but not including sched-2.6.0-test2-mm2-A3:

	Max Jobs per Minute 1467.06
	Max Jobs per Minute 1478.82
	Max Jobs per Minute 1473.36

	3853.55s user 264.31s system 370% cpu 18:31.95 total

After adding sched-2.6.0-test2-mm2-A3:

        Max Jobs per Minute 1375.63
        Max Jobs per Minute 1278.40
        Max Jobs per Minute 1293.11

        4416.70s user 275.61s system 374% cpu 20:53.58 total

A 10% regression there, mainly user time.


The test is:

- build bk://developer.osdl.org/osdl-aim-7

- cd src

- time ./reaim -s4 -q -t -i4 -f./workfile.new_dbase -r3 -b -l./reaim.config




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test2-mm3 osdl-aim-7 regression
  2003-08-06  5:23 ` Andrew Morton
@ 2003-08-06 19:10   ` Cliff White
  2003-08-07  2:40     ` Con Kolivas
  0 siblings, 1 reply; 10+ messages in thread
From: Cliff White @ 2003-08-06 19:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cliff White, linux-kernel, Ingo Molnar

> Cliff White <cliffw@osdl.org> wrote:
> >
> > 
> > I see 2.6.0-test2-mm4 is already in our queue, so this may be
> > Old News. ( serves me right for taking a weekend off )
> > Performance of -mm3 falls off on the 4-cpu machines. 
> > 
> > 2-cpu ssytems
> > Kernel 			JPM 
> > 2.6.0-test2-mm3		1313.53
> > linux-2.6.0-test2	1320.68 (0.54 % +)
> > 
> > 4-cpu systems
> > 2.6.0-test2-mm3		4824.96
> > linux-2.6.0-test2	5381.20 ( 11.53 % + )
> > 
> > Full details at
> > http://developer.osdl.org/cliffw/reaim/index.html
> > code at 
> > bk://developer.osdl.org/osdl-aim-7
> > 
> 
> OK, I can reproduce this on 4way.
> 
> Binary searching (insert gratuitous rant about benchmarks that take more
> than two minutes to complete) reveals that the slowdown is due to
> sched-2.6.0-test2-mm2-A3.

[Rant response]
For a short test run, you can run a small number of iterations like this:
./reaim -s2 -e10 -i2 -f./workfile.new_dbase

( 2->10 users, increment by 2)

That takes about 5 minutes on our 4-way.

Or, run one iteration with a large user count:

./reaim -s25 -e25 -f ./workifile.foo

cliffw


> 
> So mm4 with everthing up to but not including sched-2.6.0-test2-mm2-A3:
> 
> 	Max Jobs per Minute 1467.06
> 	Max Jobs per Minute 1478.82
> 	Max Jobs per Minute 1473.36
> 
> 	3853.55s user 264.31s system 370% cpu 18:31.95 total
> 
> After adding sched-2.6.0-test2-mm2-A3:
> 
>         Max Jobs per Minute 1375.63
>         Max Jobs per Minute 1278.40
>         Max Jobs per Minute 1293.11
> 
>         4416.70s user 275.61s system 374% cpu 20:53.58 total
> 
> A 10% regression there, mainly user time.
> 
> 
> The test is:
> 
> - build bk://developer.osdl.org/osdl-aim-7
> 
> - cd src
> 
> - time ./reaim -s4 -q -t -i4 -f./workfile.new_dbase -r3 -b -l./reaim.config
> 
> 
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test2-mm3 osdl-aim-7 regression
  2003-08-06 19:10   ` Cliff White
@ 2003-08-07  2:40     ` Con Kolivas
  2003-08-07  5:11       ` Nick Piggin
  2003-08-08 20:58       ` Cliff White
  0 siblings, 2 replies; 10+ messages in thread
From: Con Kolivas @ 2003-08-07  2:40 UTC (permalink / raw)
  To: Cliff White, Andrew Morton; +Cc: Cliff White, linux-kernel, Ingo Molnar

On Thu, 7 Aug 2003 05:10, Cliff White wrote:
> > Binary searching (insert gratuitous rant about benchmarks that take more
> > than two minutes to complete) reveals that the slowdown is due to
> > sched-2.6.0-test2-mm2-A3.

This is most likely the round robinning of tasks every 25ms. The extra 
overhead of nanosecond timing I doubt could make that size difference (but I 
could be wrong). There is some tweaking of this round robinning in my code 
which may help this, but it won't bring it back up to original performance I 
believe. Two things to try are add my patches up to O12.3int first to see how 
much (if at all!) it helps, and change TIMESLICE_GRANULARITY in sched.c to 
(MAX_TIMESLICE) which basically disables it completely. If there is still  a 
drop in performance with this, the remainder is the extra locking/overhead in 
nanosecond timing.

Con


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test2-mm3 osdl-aim-7 regression
  2003-08-07  2:40     ` Con Kolivas
@ 2003-08-07  5:11       ` Nick Piggin
  2003-08-07  5:41         ` Con Kolivas
  2003-08-08 20:58       ` Cliff White
  1 sibling, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2003-08-07  5:11 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Cliff White, Andrew Morton, linux-kernel, Ingo Molnar

Con Kolivas wrote:

>On Thu, 7 Aug 2003 05:10, Cliff White wrote:
>
>>>Binary searching (insert gratuitous rant about benchmarks that take more
>>>than two minutes to complete) reveals that the slowdown is due to
>>>sched-2.6.0-test2-mm2-A3.
>>>
>
>This is most likely the round robinning of tasks every 25ms. The extra 
>overhead of nanosecond timing I doubt could make that size difference (but I 
>could be wrong). There is some tweaking of this round robinning in my code 
>which may help this, but it won't bring it back up to original performance I 
>believe. Two things to try are add my patches up to O12.3int first to see how 
>much (if at all!) it helps, and change TIMESLICE_GRANULARITY in sched.c to 
>(MAX_TIMESLICE) which basically disables it completely. If there is still  a 
>drop in performance with this, the remainder is the extra locking/overhead in 
>nanosecond timing.
>
>
What is the need for this round robining? Don't processes get a calculated
timeslice anyway?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test2-mm3 osdl-aim-7 regression
  2003-08-07  5:11       ` Nick Piggin
@ 2003-08-07  5:41         ` Con Kolivas
  2003-08-07  8:25           ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Con Kolivas @ 2003-08-07  5:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel

On Thu, 7 Aug 2003 15:11, Nick Piggin wrote:
> What is the need for this round robining? Don't processes get a calculated
> timeslice anyway?

Nice to see you taking an unhealthy interest in the scheduler tweaks Nick. 
This issue has been discussed before but it never hurts to review things. 
I've uncc'ed the rest of the people in case we get carried away again. First 
let me show you Ingo's comment in the relevant code section:

		 * Prevent a too long timeslice allowing a task to monopolize
		 * the CPU. We do this by splitting up the timeslice into
		 * smaller pieces.
		 *
		 * Note: this does not mean the task's timeslices expire or
		 * get lost in any way, they just might be preempted by
		 * another task of equal priority. (one with higher
		 * priority would have preempted this task already.) We
		 * requeue this task to the end of the list on this priority
		 * level, which is in essence a round-robin of tasks with
		 * equal priority.

I was gonna say second blah blah but I think the first paragraph explains the 
issue. 

Must we do this? No. 

Should we? Probably. 

How frequently should we do it? Once again I'll quote Ingo who said it's a 
difficult question to answer. 

The more frequently you round robin the lower the scheduler latency between 
SCHED_OTHER tasks of the same priority. However, the longer the timeslice the 
more benefit you get from cpu cache. Where is the sweet spot? Depends on the 
hardware and your usage requirements of course, but Ingo has empirically 
chosen 25ms after 50ms seemed too long. Basically cache trashing becomes a 
real problem with timeslices below ~7ms on modern hardware in my limited 
testing. A minor quirk in Ingo's original code means _occasionally_ a task 
will be requeued with <3ms to go. It will be interesting to see if fixing 
this (which O12.2+ does) makes a big difference or whether we need to 
reconsider how frequently (if at all) we round robin tasks.  

Con


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test2-mm3 osdl-aim-7 regression
  2003-08-07  5:41         ` Con Kolivas
@ 2003-08-07  8:25           ` Nick Piggin
  2003-08-07 10:01             ` Con Kolivas
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2003-08-07  8:25 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel



Con Kolivas wrote:

>On Thu, 7 Aug 2003 15:11, Nick Piggin wrote:
>
>>What is the need for this round robining? Don't processes get a calculated
>>timeslice anyway?
>>
>
>Nice to see you taking an unhealthy interest in the scheduler tweaks Nick. 
>This issue has been discussed before but it never hurts to review things. 
>I've uncc'ed the rest of the people in case we get carried away again. First 
>let me show you Ingo's comment in the relevant code section:
>
>		 * Prevent a too long timeslice allowing a task to monopolize
>		 * the CPU. We do this by splitting up the timeslice into
>		 * smaller pieces.
>		 *
>		 * Note: this does not mean the task's timeslices expire or
>		 * get lost in any way, they just might be preempted by
>		 * another task of equal priority. (one with higher
>		 * priority would have preempted this task already.) We
>		 * requeue this task to the end of the list on this priority
>		 * level, which is in essence a round-robin of tasks with
>		 * equal priority.
>
>I was gonna say second blah blah but I think the first paragraph explains the 
>issue. 
>
>Must we do this? No. 
>
>Should we? Probably. 
>
>How frequently should we do it? Once again I'll quote Ingo who said it's a 
>difficult question to answer. 
>

OK, I was just thinking it should get done automatically by virtue
of the regular timeslice allocation, dynamic priorities, etc.

It just sounds like another workaround due to the scheduler's inability
to properly manage priorities and (the large range of length of) timeslices.

>
>
>The more frequently you round robin the lower the scheduler latency between 
>SCHED_OTHER tasks of the same priority. However, the longer the timeslice the 
>more benefit you get from cpu cache. Where is the sweet spot? Depends on the 
>hardware and your usage requirements of course, but Ingo has empirically 
>chosen 25ms after 50ms seemed too long. Basically cache trashing becomes a 
>real problem with timeslices below ~7ms on modern hardware in my limited 
>testing. A minor quirk in Ingo's original code means _occasionally_ a task 
>will be requeued with <3ms to go. It will be interesting to see if fixing 
>this (which O12.2+ does) makes a big difference or whether we need to 
>reconsider how frequently (if at all) we round robin tasks.  
>

Why not have it dynamic? CPU hogs get longer timeslices (but of course
can be preempted by higher priorities).



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test2-mm3 osdl-aim-7 regression
  2003-08-07  8:25           ` Nick Piggin
@ 2003-08-07 10:01             ` Con Kolivas
  2003-08-07 10:05               ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Con Kolivas @ 2003-08-07 10:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel

On Thu, 7 Aug 2003 18:25, Nick Piggin wrote:
> >The more frequently you round robin the lower the scheduler latency
> > between SCHED_OTHER tasks of the same priority. However, the longer the
> > timeslice the more benefit you get from cpu cache. Where is the sweet
> > spot? Depends on the hardware and your usage requirements of course, but
> > Ingo has empirically chosen 25ms after 50ms seemed too long. Basically
> > cache trashing becomes a real problem with timeslices below ~7ms on
> > modern hardware in my limited testing. A minor quirk in Ingo's original
> > code means _occasionally_ a task will be requeued with <3ms to go. It
> > will be interesting to see if fixing this (which O12.2+ does) makes a big
> > difference or whether we need to reconsider how frequently (if at all) we
> > round robin tasks.
>
> Why not have it dynamic? CPU hogs get longer timeslices (but of course
> can be preempted by higher priorities).

Funny you should say that. Before Ingo merged his A3 changes, that's what my 
version of them did.

Con


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test2-mm3 osdl-aim-7 regression
  2003-08-07 10:01             ` Con Kolivas
@ 2003-08-07 10:05               ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2003-08-07 10:05 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel



Con Kolivas wrote:

>On Thu, 7 Aug 2003 18:25, Nick Piggin wrote:
>
>>>The more frequently you round robin the lower the scheduler latency
>>>between SCHED_OTHER tasks of the same priority. However, the longer the
>>>timeslice the more benefit you get from cpu cache. Where is the sweet
>>>spot? Depends on the hardware and your usage requirements of course, but
>>>Ingo has empirically chosen 25ms after 50ms seemed too long. Basically
>>>cache trashing becomes a real problem with timeslices below ~7ms on
>>>modern hardware in my limited testing. A minor quirk in Ingo's original
>>>code means _occasionally_ a task will be requeued with <3ms to go. It
>>>will be interesting to see if fixing this (which O12.2+ does) makes a big
>>>difference or whether we need to reconsider how frequently (if at all) we
>>>round robin tasks.
>>>
>>Why not have it dynamic? CPU hogs get longer timeslices (but of course
>>can be preempted by higher priorities).
>>
>
>Funny you should say that. Before Ingo merged his A3 changes, that's what my 
>version of them did.
>
>

Between you and me, I think this would be the right way to go if it
could be done right. I don't think wli, mjb and the rest of their
clique appreciate the 25ms reschedule!



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2.6.0-test2-mm3 osdl-aim-7 regression
  2003-08-07  2:40     ` Con Kolivas
  2003-08-07  5:11       ` Nick Piggin
@ 2003-08-08 20:58       ` Cliff White
  1 sibling, 0 replies; 10+ messages in thread
From: Cliff White @ 2003-08-08 20:58 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel

> On Thu, 7 Aug 2003 05:10, Cliff White wrote:
> > > Binary searching (insert gratuitous rant about benchmarks that take more
> > > than two minutes to complete) reveals that the slowdown is due to
> > > sched-2.6.0-test2-mm2-A3.
> 
> This is most likely the round robinning of tasks every 25ms. The extra 
> overhead of nanosecond timing I doubt could make that size difference (but I 
> could be wrong). There is some tweaking of this round robinning in my code 
> which may help this, but it won't bring it back up to original performance I 
> believe. Two things to try are add my patches up to O12.3int first to see how 
> much (if at all!) it helps, and change TIMESLICE_GRANULARITY in sched.c to 
> (MAX_TIMESLICE) which basically disables it completely. If there is still  a 
> drop in performance with this, the remainder is the extra locking/overhead in 
> nanosecond timing.
> 
> Con
> 
Added your patches to PLM, from your web site. We've had other issues slowing 
up the
4-cpu queue, but the two CPU tests ran. On these smaller platforms, not seeing 
big
difference between the patches.

STP id PLM# Kernel Name         Workfile   MaxJPM  MaxUser Host     %Change
 277231 2042 CK-O13-O13.1int-1  new_dbase  1333.60  22     stp2-002 0.00
 277230 2041 CK-O12.3-O13int-1  new_dbase  1344.23  24     stp2-003 0.80
 277228 2040 CK-012.2-O12.3int-1 new_dbase 1328.86  22     stp2-002 -0.36
All are a bit better than stock:
276572 2020 linux-2.6.0-test2    new_dbase  1320.68 22	   stp2-000 -0.96
---- 
Code location:
bk://developer.osdl.org/osdl-aim-7
More results:
http://developer.osdl.org/cliffw/reaim/index.html

Run parameters: 

./reaim -s2 -x -t -i2 -f workfile.new_dbase -r3 -b -l./stp.config

cliffw



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-08-08 20:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-08-04 16:07 2.6.0-test2-mm3 osdl-aim-7 regression Cliff White
2003-08-06  5:23 ` Andrew Morton
2003-08-06 19:10   ` Cliff White
2003-08-07  2:40     ` Con Kolivas
2003-08-07  5:11       ` Nick Piggin
2003-08-07  5:41         ` Con Kolivas
2003-08-07  8:25           ` Nick Piggin
2003-08-07 10:01             ` Con Kolivas
2003-08-07 10:05               ` Nick Piggin
2003-08-08 20:58       ` Cliff White

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).