linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Default cache_hot_time value back to 10ms
@ 2004-10-06  0:42 Chen, Kenneth W
  2004-10-06  0:47 ` Con Kolivas
                   ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Chen, Kenneth W @ 2004-10-06  0:42 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: linux-kernel, 'Andrew Morton', 'Nick Piggin'

Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
> We have experimented with similar thing, via bumping up sd->cache_hot_time to
> a very large number, like 1 sec.  What we measured was a equally low throughput.
> But that was because of not enough load balancing.

Since we are talking about load balancing, we decided to measure various
value for cache_hot_time variable to see how it affects app performance. We
first establish baseline number with vanilla base kernel (default at 2.5ms),
then sweep that variable up to 1000ms.  All of the experiments are done with
Ingo's patch posted earlier.  Here are the result (test environment is 4-way
SMP machine, 32 GB memory, 500 disks running industry standard db transaction
processing workload):

cache_hot_time  | workload throughput
--------------------------------------
         2.5ms  - 100.0   (0% idle)
         5ms    - 106.0   (0% idle)
         10ms   - 112.5   (1% idle)
         15ms   - 111.6   (3% idle)
         25ms   - 111.1   (5% idle)
         250ms  - 105.6   (7% idle)
         1000ms - 105.4   (7% idle)

Clearly the default value for SMP has the worst application throughput (12%
below peak performance).  When set too low, kernel is too aggressive on load
balancing and we are still seeing cache thrashing despite the perf fix.
However, If set too high, kernel gets too conservative and not doing enough
load balance.

This value was default to 10ms before domain scheduler, why does domain
scheduler need to change it to 2.5ms? And on what bases does that decision
take place?  We are proposing change that number back to 10ms.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

--- linux-2.6.9-rc3/kernel/sched.c.orig	2004-10-05 17:37:21.000000000 -0700
+++ linux-2.6.9-rc3/kernel/sched.c	2004-10-05 17:38:02.000000000 -0700
@@ -387,7 +387,7 @@ struct sched_domain {
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
+	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_BALANCE_NEWIDLE	\



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  0:42 Default cache_hot_time value back to 10ms Chen, Kenneth W
@ 2004-10-06  0:47 ` Con Kolivas
  2004-10-06  1:02   ` Nick Piggin
  2004-10-06  0:58 ` Nick Piggin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 52+ messages in thread
From: Con Kolivas @ 2004-10-06  0:47 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', linux-kernel, 'Andrew Morton',
	'Nick Piggin'

Chen, Kenneth W writes:

> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
>> We have experimented with similar thing, via bumping up sd->cache_hot_time to
>> a very large number, like 1 sec.  What we measured was a equally low throughput.
>> But that was because of not enough load balancing.
> 
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms.  All of the experiments are done with
> Ingo's patch posted earlier.  Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
> 
> cache_hot_time  | workload throughput
> --------------------------------------
>          2.5ms  - 100.0   (0% idle)
>          5ms    - 106.0   (0% idle)
>          10ms   - 112.5   (1% idle)
>          15ms   - 111.6   (3% idle)
>          25ms   - 111.1   (5% idle)
>          250ms  - 105.6   (7% idle)
>          1000ms - 105.4   (7% idle)
> 
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance).  When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.
> 
> This value was default to 10ms before domain scheduler, why does domain
> scheduler need to change it to 2.5ms? And on what bases does that decision
> take place?  We are proposing change that number back to 10ms.

Should it not be based on the cache flush time? We measure that and set the 
cache_decay_ticks and can base it on that. What is the cache_decay_ticks 
value reported in the dmesg of your hardware?

Cheers,
Con


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  0:42 Default cache_hot_time value back to 10ms Chen, Kenneth W
  2004-10-06  0:47 ` Con Kolivas
@ 2004-10-06  0:58 ` Nick Piggin
  2004-10-06  3:55 ` Andrew Morton
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2004-10-06  0:58 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', linux-kernel, 'Andrew Morton'

Chen, Kenneth W wrote:
> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
> 
>>We have experimented with similar thing, via bumping up sd->cache_hot_time to
>>a very large number, like 1 sec.  What we measured was a equally low throughput.
>>But that was because of not enough load balancing.
> 
> 
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms.  All of the experiments are done with
> Ingo's patch posted earlier.  Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
> 
> cache_hot_time  | workload throughput
> --------------------------------------
>          2.5ms  - 100.0   (0% idle)
>          5ms    - 106.0   (0% idle)
>          10ms   - 112.5   (1% idle)
>          15ms   - 111.6   (3% idle)
>          25ms   - 111.1   (5% idle)
>          250ms  - 105.6   (7% idle)
>          1000ms - 105.4   (7% idle)
> 
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance).  When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.
> 

Great testing, thanks.

> This value was default to 10ms before domain scheduler, why does domain
> scheduler need to change it to 2.5ms? And on what bases does that decision
> take place?  We are proposing change that number back to 10ms.
> 

IIRC Ingo wanted it lower, to closer match previous values (correct
me if I'm wrong).

I think your patch would be fine though (when timeslicing tasks on
the same CPU, I've typically seen large regressions when going below
a 10ms timeslice, even on a small cache CPU (512K).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  0:47 ` Con Kolivas
@ 2004-10-06  1:02   ` Nick Piggin
  0 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2004-10-06  1:02 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Chen, Kenneth W, 'Ingo Molnar',
	linux-kernel, 'Andrew Morton'

Con Kolivas wrote:

> Should it not be based on the cache flush time? We measure that and set 
> the cache_decay_ticks and can base it on that. What is the 
> cache_decay_ticks value reported in the dmesg of your hardware?
> 

It should be, but the cache_decay_ticks calculation is so crude that I
preferred to use a fixed value to reduce the variation between different
setups.

I once experimented with attempting to figure out memory bandwidth based
on reading an uncached page. That might be the way to go.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  0:42 Default cache_hot_time value back to 10ms Chen, Kenneth W
  2004-10-06  0:47 ` Con Kolivas
  2004-10-06  0:58 ` Nick Piggin
@ 2004-10-06  3:55 ` Andrew Morton
  2004-10-06  4:30   ` Nick Piggin
  2004-10-06  7:48 ` Ingo Molnar
  2004-10-06 13:29 ` [patch] sched: auto-tuning task-migration Ingo Molnar
  4 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2004-10-06  3:55 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: mingo, linux-kernel, nickpiggin

"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>
> This value was default to 10ms before domain scheduler, why does domain
>  scheduler need to change it to 2.5ms? And on what bases does that decision
>  take place?  We are proposing change that number back to 10ms.

It sounds like this needs to be runtime tunable?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  3:55 ` Andrew Morton
@ 2004-10-06  4:30   ` Nick Piggin
  2004-10-06  4:51     ` Andrew Morton
  0 siblings, 1 reply; 52+ messages in thread
From: Nick Piggin @ 2004-10-06  4:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Chen, Kenneth W, mingo, linux-kernel

Andrew Morton wrote:
> "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> 
>>This value was default to 10ms before domain scheduler, why does domain
>> scheduler need to change it to 2.5ms? And on what bases does that decision
>> take place?  We are proposing change that number back to 10ms.
> 
> 
> It sounds like this needs to be runtime tunable?
> 

I'd say it is probably too low level to be a useful tunable (although
for testing I guess so... but then you could have *lots* of parameters
tunable).

I don't think there was a really good reason why this value is 2.5ms.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  4:30   ` Nick Piggin
@ 2004-10-06  4:51     ` Andrew Morton
  2004-10-06  5:00       ` Nick Piggin
                         ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Andrew Morton @ 2004-10-06  4:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kenneth.w.chen, mingo, linux-kernel

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>  Andrew Morton wrote:
>  > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>  > 
>  >>This value was default to 10ms before domain scheduler, why does domain
>  >> scheduler need to change it to 2.5ms? And on what bases does that decision
>  >> take place?  We are proposing change that number back to 10ms.
>  > 
>  > 
>  > It sounds like this needs to be runtime tunable?
>  > 
> 
>  I'd say it is probably too low level to be a useful tunable (although
>  for testing I guess so... but then you could have *lots* of parameters
>  tunable).

This tunable caused an 11% performance difference in (I assume) TPCx. 
That's a big deal, and people will want to diddle it.

If one number works optimally for all machines and workloads then fine.

But yes, avoiding a tunable would be nice, but we need a tunable to work
out whether we can avoid making it tunable ;)

Not that I'm soliciting patches or anything.  I'll duck this one for now.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  4:51     ` Andrew Morton
@ 2004-10-06  5:00       ` Nick Piggin
  2004-10-06  5:09         ` Andrew Morton
  2004-10-06  5:52       ` Default cache_hot_time value back to 10ms Chen, Kenneth W
  2004-10-06 19:27       ` Chen, Kenneth W
  2 siblings, 1 reply; 52+ messages in thread
From: Nick Piggin @ 2004-10-06  5:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kenneth.w.chen, mingo, linux-kernel, Judith Lebzelter

Andrew Morton wrote:

> This tunable caused an 11% performance difference in (I assume) TPCx. 
> That's a big deal, and people will want to diddle it.
> 

True. But 2.5 I think really is too low (for anyone, except maybe a
CPU with no/a tiny L2 cache).

> If one number works optimally for all machines and workloads then fine.
> 

Yeah.. 10ms may bring up idle times a bit on other workloads. Judith
had some database tests that were very sensitive to this - if 10ms is
OK there, then I'd say it would be OK for most things.

> But yes, avoiding a tunable would be nice, but we need a tunable to work
> out whether we can avoid making it tunable ;)
> 

Heh. I think it would be good to have a automatic thingy to tune it.
A smarter cache_decay_ticks calculation would suit.

> Not that I'm soliciting patches or anything.  I'll duck this one for now.
> 

OK. Any idea when 2.6.9 will be coming out? :)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  5:00       ` Nick Piggin
@ 2004-10-06  5:09         ` Andrew Morton
  2004-10-06  5:21           ` Nick Piggin
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2004-10-06  5:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kenneth.w.chen, mingo, linux-kernel, judith

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Any idea when 2.6.9 will be coming out?

Before -mm hits 1000 patches, I hope.

2.6.8 wasn't really super-stable and our main tool for getting the quality
is to stretch the release times, give us time to shake things out.  The
release time is largely driven by perceptions of current stability, bug
report rates, etc.

A current guess would be -rc4 later this week, 2.6.9 late next week.  We'll
see.

One way of advancing that is to get down and work on bugs in current -linus
tree, yes?

If this still doesn't seem to be working out and if 2.6.9 isn't as good as
we'd like I'll consider shutting down -mm completely once we hit -rc2 so
people have nothing else to do apart from fix bugs in, and test -linus. 
We'll see.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  5:09         ` Andrew Morton
@ 2004-10-06  5:21           ` Nick Piggin
  2004-10-06  5:33             ` Andrew Morton
  0 siblings, 1 reply; 52+ messages in thread
From: Nick Piggin @ 2004-10-06  5:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kenneth.w.chen, mingo, linux-kernel, judith

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>Any idea when 2.6.9 will be coming out?
> 
> 
> Before -mm hits 1000 patches, I hope.
> 
> 2.6.8 wasn't really super-stable and our main tool for getting the quality
> is to stretch the release times, give us time to shake things out.  The
> release time is largely driven by perceptions of current stability, bug
> report rates, etc.
> 
> A current guess would be -rc4 later this week, 2.6.9 late next week.  We'll
> see.
> 
> One way of advancing that is to get down and work on bugs in current -linus
> tree, yes?
> 
> If this still doesn't seem to be working out and if 2.6.9 isn't as good as
> we'd like I'll consider shutting down -mm completely once we hit -rc2 so
> people have nothing else to do apart from fix bugs in, and test -linus. 
> We'll see.
> 

OK thanks for the explanation.

Any thoughts about making -rc's into -pre's, and doing real -rc's?
It would have caught the NFS bug that made 2.6.8.1, and probably
the cd burning problems... Or is Linus' patching finger just too
itchy?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  5:21           ` Nick Piggin
@ 2004-10-06  5:33             ` Andrew Morton
  2004-10-06  5:46               ` Nick Piggin
  2004-10-06  6:19               ` new dev model (was Re: Default cache_hot_time value back to 10ms) Jeff Garzik
  0 siblings, 2 replies; 52+ messages in thread
From: Andrew Morton @ 2004-10-06  5:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kenneth.w.chen, mingo, linux-kernel, judith

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Any thoughts about making -rc's into -pre's, and doing real -rc's?

I think what we have is OK.  The idea is that once 2.6.9 is released we
merge up all the well-tested code which is sitting in various trees and has
been under test for a few weeks.  As soon as all that well-tested code is
merged, we go into -rc.  So we're pipelining the development of 2.6.10 code
with the stabilisation of 2.6.9.

If someone goes and develops *new* code after the release of, say, 2.6.9
then tough tittie, it's too late for 2.6.9: we don't want new code - we
want old-n-tested code.  So your typed-in-after-2.6.9 code goes into
2.6.11.

That's the theory anyway.  If it means that it takes a long time to get
code into the kernel.org tree, well, that's a cost.  That latency may be
high but the bandwidth is pretty good.

There are exceptions of course.  Completely new
drivers/filesystems/architectures can go in any old time becasue they won't
break existing setups.  Although I do tend to hold back on even these in
the (probably overoptimistic) hope that people will then concentrate on
mainline bug fixing and testing.

>  It would have caught the NFS bug that made 2.6.8.1, and probably
>  the cd burning problems... Or is Linus' patching finger just too
>  itchy?

uh, let's say that incident was "proof by counter example".

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  5:33             ` Andrew Morton
@ 2004-10-06  5:46               ` Nick Piggin
  2004-10-06  6:19               ` new dev model (was Re: Default cache_hot_time value back to 10ms) Jeff Garzik
  1 sibling, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2004-10-06  5:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kenneth.w.chen, mingo, linux-kernel, judith

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>Any thoughts about making -rc's into -pre's, and doing real -rc's?
> 
> 
> I think what we have is OK.  The idea is that once 2.6.9 is released we
> merge up all the well-tested code which is sitting in various trees and has
> been under test for a few weeks.  As soon as all that well-tested code is
> merged, we go into -rc.  So we're pipelining the development of 2.6.10 code
> with the stabilisation of 2.6.9.
> 
> If someone goes and develops *new* code after the release of, say, 2.6.9
> then tough tittie, it's too late for 2.6.9: we don't want new code - we
> want old-n-tested code.  So your typed-in-after-2.6.9 code goes into
> 2.6.11.
> 
> That's the theory anyway.  If it means that it takes a long time to get
> code into the kernel.org tree, well, that's a cost.  That latency may be
> high but the bandwidth is pretty good.
> 
> There are exceptions of course.  Completely new
> drivers/filesystems/architectures can go in any old time becasue they won't
> break existing setups.  Although I do tend to hold back on even these in
> the (probably overoptimistic) hope that people will then concentrate on
> mainline bug fixing and testing.
> 
> 
>> It would have caught the NFS bug that made 2.6.8.1, and probably
>> the cd burning problems... Or is Linus' patching finger just too
>> itchy?
> 
> 
> uh, let's say that incident was "proof by counter example".
> 

Heh :)

OK I agree on all these points. And yeah it has worked quite well...

But by real -rc, I mean 2.6.9 is a week after 2.6.9-rcx minus the
extraversion string; nothing more.

The main point (for me, at least) is that if -rc1 comes out, and I'm
still working on some bug or having something else tested then I can
hurry up and/or send you and Linus a polite email saying don't release
yet.

Would probably be a help for people running automated testing and
regression tests, etc. And just generally increase the userbase a
little bit.

Catching the odd paper bag bug would be a fringe benefit.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06  4:51     ` Andrew Morton
  2004-10-06  5:00       ` Nick Piggin
@ 2004-10-06  5:52       ` Chen, Kenneth W
  2004-10-06 19:27       ` Chen, Kenneth W
  2 siblings, 0 replies; 52+ messages in thread
From: Chen, Kenneth W @ 2004-10-06  5:52 UTC (permalink / raw)
  To: 'Andrew Morton', Nick Piggin; +Cc: mingo, linux-kernel

Andrew Morton wrote on Tuesday, October 05, 2004 9:51 PM
> >  > It sounds like this needs to be runtime tunable?
> >  >
> >
> >  I'd say it is probably too low level to be a useful tunable (although
> >  for testing I guess so... but then you could have *lots* of parameters
> >  tunable).
>
> This tunable caused an 11% performance difference in (I assume) TPCx.
> That's a big deal, and people will want to diddle it.
>
> If one number works optimally for all machines and workloads then fine.
>
> But yes, avoiding a tunable would be nice, but we need a tunable to work
> out whether we can avoid making it tunable ;)

Just to throw in some more benchmark numbers, we measured that specjbb
throughput went up by about 0.3% with cache_hot_time set to 10ms compare
to default 2.5ms.  No measurable speedup/regression on volanmark (we
just tried 10 and 2.5ms).

- Ken



^ permalink raw reply	[flat|nested] 52+ messages in thread

* new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06  5:33             ` Andrew Morton
  2004-10-06  5:46               ` Nick Piggin
@ 2004-10-06  6:19               ` Jeff Garzik
  2004-10-06  6:39                 ` Andrew Morton
  2004-10-06  9:23                 ` Ingo Molnar
  1 sibling, 2 replies; 52+ messages in thread
From: Jeff Garzik @ 2004-10-06  6:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, kenneth.w.chen, mingo, linux-kernel, judith

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>Any thoughts about making -rc's into -pre's, and doing real -rc's?
> 
> 
> I think what we have is OK.  The idea is that once 2.6.9 is released we
> merge up all the well-tested code which is sitting in various trees and has
> been under test for a few weeks.  As soon as all that well-tested code is
> merged, we go into -rc.  So we're pipelining the development of 2.6.10 code
> with the stabilisation of 2.6.9.
> 
> If someone goes and develops *new* code after the release of, say, 2.6.9
> then tough tittie, it's too late for 2.6.9: we don't want new code - we
> want old-n-tested code.  So your typed-in-after-2.6.9 code goes into
> 2.6.11.
> 
> That's the theory anyway.  If it means that it takes a long time to get

This is damned frustrating :(  Reality is _far_ divorced from what you 
just described.

Major developers such as David and Al don't have trees that see wide 
testing, their code only sees wide testing once it hits mainline.  See 
this message from David, 
http://marc.theaimsgroup.com/?l=linux-netdev&m=109648930728731&w=2

In particular, I think David's point about -mm being perceived as overly 
experimental is fair.

Recent experience seems to directly counter the assertion that only 
well-tested code is landing in mainline, and it's not hard to pick 
through the -rc changelogs to find non-trivial, non-bugfix modifications 
to existing code.  My own experience with netdev-2.6 bears this out as 
well:  I have several personal examples of bugs sitting in netdev (and 
thus -mm) for quite a while, only being noticed when the code hits mainline.

Linus's assertion that "calling it -rc means developers should calm 
down" (implying we should start concentrating on bug fixing rather than 
more-fun stuff) is equally fanciful.

Why is it so hard to say "only bugfixes"?

The _reality_ is that there is _no_ point in time where you and Linus 
allow for stabilization of the main tree prior to relesae.  The release 
criteria has devolved to a point where we call it done when the stack of 
pancakes gets too high.

Ground control to Major Tom?

	Jeff



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06  6:19               ` new dev model (was Re: Default cache_hot_time value back to 10ms) Jeff Garzik
@ 2004-10-06  6:39                 ` Andrew Morton
  2004-10-06  8:56                   ` Paolo Ciarrocchi
                                     ` (2 more replies)
  2004-10-06  9:23                 ` Ingo Molnar
  1 sibling, 3 replies; 52+ messages in thread
From: Andrew Morton @ 2004-10-06  6:39 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: nickpiggin, kenneth.w.chen, mingo, linux-kernel, judith

Jeff Garzik <jgarzik@pobox.com> wrote:
>
> Andrew Morton wrote:
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > 
> >>Any thoughts about making -rc's into -pre's, and doing real -rc's?
> > 
> > 
> > I think what we have is OK.  The idea is that once 2.6.9 is released we
> > merge up all the well-tested code which is sitting in various trees and has
> > been under test for a few weeks.  As soon as all that well-tested code is
> > merged, we go into -rc.  So we're pipelining the development of 2.6.10 code
> > with the stabilisation of 2.6.9.
> > 
> > If someone goes and develops *new* code after the release of, say, 2.6.9
> > then tough tittie, it's too late for 2.6.9: we don't want new code - we
> > want old-n-tested code.  So your typed-in-after-2.6.9 code goes into
> > 2.6.11.
> > 
> > That's the theory anyway.  If it means that it takes a long time to get
> 
> This is damned frustrating :(  Reality is _far_ divorced from what you 
> just described.

s/far/a bit/

> Major developers such as David and Al don't have trees that see wide 
> testing, their code only sees wide testing once it hits mainline.  See 
> this message from David, 
> http://marc.theaimsgroup.com/?l=linux-netdev&m=109648930728731&w=2
> 

Yes, networking has been an exception.  I think this has been acceptable
thus far because historically networking has tended to work better than
other parts of the kernel.  Although the fib_hash stuff was a bit of a
fiasco.

> In particular, I think David's point about -mm being perceived as overly 
> experimental is fair.

I agree - -mm breaks too often.  You wouldn't believe the crap people throw
at me :(.   But a lot of problems get fixed this way too.

> Recent experience seems to directly counter the assertion that only 
> well-tested code is landing in mainline, and it's not hard to pick 
> through the -rc changelogs to find non-trivial, non-bugfix modifications 
> to existing code.

Once we hit -rc2 we shouldn't be doing that.

>  My own experience with netdev-2.6 bears this out as 
> well:  I have several personal examples of bugs sitting in netdev (and 
> thus -mm) for quite a while, only being noticed when the code hits mainline.

yes, I've had a couple of those.  Not too many, fortunately.  But having
bugs leak in mainline is OK - we expect that.  As long as it wasn't late in
the cycle.  If it was late in the cycle then, well,
bad-call-won't-do-that-again.

> Linus's assertion that "calling it -rc means developers should calm 
> down" (implying we should start concentrating on bug fixing rather than 
> more-fun stuff) is equally fanciful.
> 
> Why is it so hard to say "only bugfixes"?

(It's not "only bugfixes".  It's "only bugfixes, completely new stuff and
documentation/comment fixes).

But yes.  When you see this please name names and thwap people.

> The _reality_ is that there is _no_ point in time where you and Linus 
> allow for stabilization of the main tree prior to relesae.  The release 
> criteria has devolved to a point where we call it done when the stack of 
> pancakes gets too high.

That's simply wrong.

For instance, 2.6.8-rc1-mm1-series had 252 patches.  I'm now sitting on 726
patches.  That's 500 patches which are either non-bugfixes or minor
bugfixes which are held back.  The various bk tree maintainers do the same
thing.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06  0:42 Default cache_hot_time value back to 10ms Chen, Kenneth W
                   ` (2 preceding siblings ...)
  2004-10-06  3:55 ` Andrew Morton
@ 2004-10-06  7:48 ` Ingo Molnar
  2004-10-06 17:18   ` Chen, Kenneth W
  2004-10-06 13:29 ` [patch] sched: auto-tuning task-migration Ingo Molnar
  4 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2004-10-06  7:48 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', linux-kernel, 'Andrew Morton',
	'Nick Piggin'


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
> > We have experimented with similar thing, via bumping up sd->cache_hot_time to
> > a very large number, like 1 sec.  What we measured was a equally low throughput.
> > But that was because of not enough load balancing.
> 
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms.  All of the experiments are done with
> Ingo's patch posted earlier.  Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
> 
> cache_hot_time  | workload throughput
> --------------------------------------
>          2.5ms  - 100.0   (0% idle)
>          5ms    - 106.0   (0% idle)
>          10ms   - 112.5   (1% idle)
>          15ms   - 111.6   (3% idle)
>          25ms   - 111.1   (5% idle)
>          250ms  - 105.6   (7% idle)
>          1000ms - 105.4   (7% idle)
> 
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance).  When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.

could you please try the test in 1 msec increments around 10 msec? It
would be very nice to find a good formula and the 5 msec steps are too
coarse. I think it would be nice to test 7,9,11,13 msecs first, and then
the remaining 1 msec slots around the new maximum. (assuming the
workload measurement is stable.)

> This value was default to 10ms before domain scheduler, why does domain
> scheduler need to change it to 2.5ms? And on what bases does that decision
> take place?  We are proposing change that number back to 10ms.

agreed. What value does cache_decay_ticks have on your box?

> 
> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

Signed-off-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06  6:39                 ` Andrew Morton
@ 2004-10-06  8:56                   ` Paolo Ciarrocchi
  2004-10-06  9:44                   ` bert hubert
  2004-10-06 19:40                   ` Jeff Garzik
  2 siblings, 0 replies; 52+ messages in thread
From: Paolo Ciarrocchi @ 2004-10-06  8:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Garzik, nickpiggin, kenneth.w.chen, mingo, linux-kernel, judith

On Tue, 5 Oct 2004 23:39:58 -0700, Andrew Morton <akpm@osdl.org> wrote:
> Jeff Garzik <jgarzik@pobox.com> wrote:
> >
> > Andrew Morton wrote:
> > > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > >
> > >>Any thoughts about making -rc's into -pre's, and doing real -rc's?
> > >
> > >
> > > I think what we have is OK.  The idea is that once 2.6.9 is released we
> > > merge up all the well-tested code which is sitting in various trees and has
> > > been under test for a few weeks.  As soon as all that well-tested code is
> > > merged, we go into -rc.  So we're pipelining the development of 2.6.10 code
> > > with the stabilisation of 2.6.9.
> > >
> > > If someone goes and develops *new* code after the release of, say, 2.6.9
> > > then tough tittie, it's too late for 2.6.9: we don't want new code - we
> > > want old-n-tested code.  So your typed-in-after-2.6.9 code goes into
> > > 2.6.11.
> > >
> > > That's the theory anyway.  If it means that it takes a long time to get
> >
> > This is damned frustrating :(  Reality is _far_ divorced from what you
> > just described.
> 
> s/far/a bit/

True, just a bit. But the the -pre/-rc thing is pretty confusing.
 
> > Major developers such as David and Al don't have trees that see wide
> > testing, their code only sees wide testing once it hits mainline.  See
> > this message from David,
> > http://marc.theaimsgroup.com/?l=linux-netdev&m=109648930728731&w=2
> >
> 
> Yes, networking has been an exception.  I think this has been acceptable
> thus far because historically networking has tended to work better than
> other parts of the kernel.  Although the fib_hash stuff was a bit of a
> fiasco.
> 
> > In particular, I think David's point about -mm being perceived as overly
> > experimental is fair.
> 
> I agree - -mm breaks too often.  You wouldn't believe the crap people throw
> at me :(.   But a lot of problems get fixed this way too.

Again, true.
But it's hard to understand why we have 'exceptions' to the dev model.
I still thing that the dev model should be  make official and all the
develpoers should follow such a rules.

> > Recent experience seems to directly counter the assertion that only
> > well-tested code is landing in mainline, and it's not hard to pick
> > through the -rc changelogs to find non-trivial, non-bugfix modifications
> > to existing code.
> 
> Once we hit -rc2 we shouldn't be doing that.
> 
> >  My own experience with netdev-2.6 bears this out as
> > well:  I have several personal examples of bugs sitting in netdev (and
> > thus -mm) for quite a while, only being noticed when the code hits mainline.
> 
> yes, I've had a couple of those.  Not too many, fortunately.  But having
> bugs leak in mainline is OK - we expect that.  As long as it wasn't late in
> the cycle.  If it was late in the cycle then, well,
> bad-call-won't-do-that-again.
> 
> > Linus's assertion that "calling it -rc means developers should calm
> > down" (implying we should start concentrating on bug fixing rather than
> > more-fun stuff) is equally fanciful.
> >
> > Why is it so hard to say "only bugfixes"?
> 
> (It's not "only bugfixes".  It's "only bugfixes, completely new stuff and
> documentation/comment fixes).
> 
> But yes.  When you see this please name names and thwap people.
> 
> > The _reality_ is that there is _no_ point in time where you and Linus
> > allow for stabilization of the main tree prior to relesae.  The release
> > criteria has devolved to a point where we call it done when the stack of
> > pancakes gets too high.
> 
> That's simply wrong.
> 
> For instance, 2.6.8-rc1-mm1-series had 252 patches.  I'm now sitting on 726
> patches.  That's 500 patches which are either non-bugfixes or minor
> bugfixes which are held back.  The various bk tree maintainers do the same
> thing.

I really think that:
- linus should start making -pre releases and then one (or a couple,
if needed) -rc candidate
- all the patches should go in -mm before landing in -pre
- maybe, try to match a few quality "goals'" ?



-- 
Paolo
Personal home page: www.ciarrocchi.tk
See my photos: http://paolociarrocchi.fotopic.net/
Buy cool stuff here: http://www.cafepress.com/paoloc

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06  6:19               ` new dev model (was Re: Default cache_hot_time value back to 10ms) Jeff Garzik
  2004-10-06  6:39                 ` Andrew Morton
@ 2004-10-06  9:23                 ` Ingo Molnar
  2004-10-06  9:57                   ` Paolo Ciarrocchi
  2004-10-06 19:33                   ` Jeff Garzik
  1 sibling, 2 replies; 52+ messages in thread
From: Ingo Molnar @ 2004-10-06  9:23 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Nick Piggin, kenneth.w.chen, linux-kernel, judith


On Wed, 6 Oct 2004, Jeff Garzik wrote:

> The _reality_ is that there is _no_ point in time where you and Linus
> allow for stabilization of the main tree prior to relesae. [...]

i dont think this is fair to Andrew - there's hundreds of patches in his
tree that are scheduled for 2.6.10 not 2.6.9.

you are right that -mm is experimental, but the latency of bugfixes is the
lowest i've ever seen in any Linux tree, which is quite amazing
considering the hundreds of patches.

it is also correct that the pile of patches in the -mm tree mask the QA
effects of testing done on -mm, so testing -BK separately is just as
important at this stage.

Maybe it would help perception and awareness-of-release a bit if at this
stage Andrew switched the -mm tree to the -BK tree and truly only kept
those patches that are destined for BK for 2.6.9. [i.e. if the current
patch-series would be cut off at patch #3 or so, but the numbering of
-rc3-mm3 would be keept.] This can only be done if the changes from now to
2.6.9-real are small enough in that they dont impact those 700 patches too
much.

This switching would immediately expose all -mm users to the current state
of affairs of the -BK tree. (yes, people could try the -BK tree just as
much but it seems -mm is used by developers quite often and it would help
if the two trees would be largely equivalent so close to the release.)

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06  6:39                 ` Andrew Morton
  2004-10-06  8:56                   ` Paolo Ciarrocchi
@ 2004-10-06  9:44                   ` bert hubert
  2004-10-06 14:00                     ` Andries Brouwer
  2004-10-06 19:40                   ` Jeff Garzik
  2 siblings, 1 reply; 52+ messages in thread
From: bert hubert @ 2004-10-06  9:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Garzik, nickpiggin, kenneth.w.chen, mingo, linux-kernel, judith

On Tue, Oct 05, 2004 at 11:39:58PM -0700, Andrew Morton wrote:

> I agree - -mm breaks too often.  You wouldn't believe the crap people throw
> at me :(.   But a lot of problems get fixed this way too.

Mainline is suffering too - lots of people I know running 2.6 on production
systems have noted a marked increase in problems, crashes, odd things. 

I'd bet you get a lot of people who'd vote for a timeout right now to figure
out what's going wrong.

There is the distinct impression that we are going down hill in this series.
My personal feeling is that this trend started almost immediately after OLS.

I can try to gather the general reports I hear from people - it might well
be that we are not reporting the bugs properly. I'm sitting on a couple of
odd things myself that need to be written up.

Thanks.

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06  9:23                 ` Ingo Molnar
@ 2004-10-06  9:57                   ` Paolo Ciarrocchi
  2004-10-06 19:33                   ` Jeff Garzik
  1 sibling, 0 replies; 52+ messages in thread
From: Paolo Ciarrocchi @ 2004-10-06  9:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeff Garzik, Andrew Morton, Nick Piggin, kenneth.w.chen,
	linux-kernel, judith

On Wed, 6 Oct 2004 05:23:29 -0400 (EDT), Ingo Molnar <mingo@redhat.com> wrote:
> 
> On Wed, 6 Oct 2004, Jeff Garzik wrote:
> 
> > The _reality_ is that there is _no_ point in time where you and Linus
> > allow for stabilization of the main tree prior to relesae. [...]
> 
> i dont think this is fair to Andrew - there's hundreds of patches in his
> tree that are scheduled for 2.6.10 not 2.6.9.

Andrew is doing an amazing job. He's really an impressive hacker.
 
> you are right that -mm is experimental, but the latency of bugfixes is the
> lowest i've ever seen in any Linux tree, which is quite amazing
> considering the hundreds of patches.

Just my humble opinion,
I think that's because Andrew and Linus are working very well together,
I'm not sure that's because the new dev model.
It seems to me that there is room for improvment.

> it is also correct that the pile of patches in the -mm tree mask the QA
> effects of testing done on -mm, so testing -BK separately is just as
> important at this stage.
> 
> Maybe it would help perception and awareness-of-release a bit if at this
> stage Andrew switched the -mm tree to the -BK tree and truly only kept
> those patches that are destined for BK for 2.6.9. [i.e. if the current
> patch-series would be cut off at patch #3 or so, but the numbering of
> -rc3-mm3 would be keept.] This can only be done if the changes from now to
> 2.6.9-real are small enough in that they dont impact those 700 patches too
> much.
> 
> This switching would immediately expose all -mm users to the current state
> of affairs of the -BK tree. (yes, people could try the -BK tree just as
> much but it seems -mm is used by developers quite often and it would help
> if the two trees would be largely equivalent so close to the release.)

Good idea.

-- 
Paolo
Personal home page: www.ciarrocchi.tk
See my photos: http://paolociarrocchi.fotopic.net/
Buy cool stuff here: http://www.cafepress.com/paoloc

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch] sched: auto-tuning task-migration
  2004-10-06  0:42 Default cache_hot_time value back to 10ms Chen, Kenneth W
                   ` (3 preceding siblings ...)
  2004-10-06  7:48 ` Ingo Molnar
@ 2004-10-06 13:29 ` Ingo Molnar
  2004-10-06 13:44   ` Nick Piggin
                     ` (2 more replies)
  4 siblings, 3 replies; 52+ messages in thread
From: Ingo Molnar @ 2004-10-06 13:29 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: linux-kernel, 'Andrew Morton', 'Nick Piggin'


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Since we are talking about load balancing, we decided to measure
> various value for cache_hot_time variable to see how it affects app
> performance. We first establish baseline number with vanilla base
> kernel (default at 2.5ms), then sweep that variable up to 1000ms.  All
> of the experiments are done with Ingo's patch posted earlier.  Here
> are the result (test environment is 4-way SMP machine, 32 GB memory,
> 500 disks running industry standard db transaction processing
> workload):
> 
> cache_hot_time  | workload throughput
> --------------------------------------
>          2.5ms  - 100.0   (0% idle)
>          5ms    - 106.0   (0% idle)
>          10ms   - 112.5   (1% idle)
>          15ms   - 111.6   (3% idle)
>          25ms   - 111.1   (5% idle)
>          250ms  - 105.6   (7% idle)
>          1000ms - 105.4   (7% idle)

the following patch adds a new feature to the scheduler: during bootup
it measures migration costs and sets up cache_hot value accordingly.

The measurement is point-to-point, i.e. it can be used to measure the
migration costs in cache hierarchies - e.g. by NUMA setup code. The
patch prints out a matrix of migration costs between CPUs. 
(self-migration means pure cache dirtying cost)

Here are a couple of matrixes from testsystems:

A 2-way Celeron/128K box:

 arch cache_decay_nsec: 1000000
 migration cost matrix (cache_size: 131072, cpu: 467 MHz):
         [00]  [01]
 [00]:    9.6  12.0
 [01]:   12.2   9.8
 min_delta: 12586890
 using cache_decay nsec: 12586890 (12 msec)

a 2-way/4-way P4/512K HT box:

 arch cache_decay_nsec: 2000000
 migration cost matrix (cache_size: 524288, cpu: 2379 MHz):
         [00]  [01]  [02]  [03]
 [00]:    6.1   6.1   5.7   6.1
 [01]:    6.7   6.2   6.7   6.2
 [02]:    5.9   5.9   6.1   5.0
 [03]:    6.7   6.2   6.7   6.2
 min_delta: 6053016
 using cache_decay nsec: 6053016 (5 msec)

an 8-way P3/2MB Xeon box:

 arch cache_decay_nsec: 6000000
 migration cost matrix (cache_size: 2097152, cpu: 700 MHz):
         [00]  [01]  [02]  [03]  [04]  [05]  [06]  [07]
 [00]:   92.1 184.8 184.8 184.8 184.9  90.7  90.6  90.7
 [01]:  181.3  92.7  88.5  88.6  88.5 181.5 181.3 181.4
 [02]:  181.4  88.4  92.5  88.4  88.5 181.4 181.3 181.4
 [03]:  181.4  88.4  88.5  92.5  88.4 181.5 181.2 181.4
 [04]:  181.4  88.5  88.4  88.4  92.5 181.5 181.3 181.5
 [05]:   87.2 181.5 181.4 181.5 181.4  90.0  87.0  87.1
 [06]:   87.2 181.5 181.4 181.5 181.4  87.9  90.0  87.1
 [07]:   87.2 181.5 181.4 181.5 181.4  87.9  87.0  90.0
 min_delta: 91815564
 using cache_decay nsec: 91815564 (87 msec)

(btw., this matrix shows nicely the 0,5,6,7/1,2,3,4 grouping of quads in
this semi-NUMA 8-way box.)

could you try this patch on your testbox and send me the bootlog? How
close does this method get us to the 10 msec value you measured to be
close to the best value? The patch is against 2.6.9-rc3 + the last
cache_hot fixpatch you tried.

the patch contains comments that explain how migration costs are
measured.

(NOTE: sched_cache_size is only filled in for x86 at the moment, so if
you have another architecture then please add those two lines to that
architecture's smpboot.c.)

this is only the first release of the patch - obviously we cannot print
such a matrix for 1024 CPUs. But this should be good enough for testing
purposes.

	Ingo

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -388,7 +388,7 @@ struct sched_domain {
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_BALANCE_NEWIDLE	\
@@ -410,7 +410,7 @@ struct sched_domain {
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_BALANCE_EXEC	\
@@ -4420,11 +4420,233 @@ __init static void init_sched_build_grou
 	last->next = first;
 }
 
-__init static void arch_init_sched_domains(void)
+/*
+ * Task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the worst-case cost. Here are the
+ * steps that are taken:
+ *
+ * 1) the source CPU dirties its L2 cache with a shared buffer
+ * 2) the target CPU dirties its L2 cache with a local buffer
+ * 3) the target CPU dirties the shared buffer
+ *
+ * We measure the time step #3 takes - this is the cost of migrating
+ * a cache-hot task that has a large, dirty dataset in the L2 cache,
+ * to another CPU.
+ */
+
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+static void fill_cache(void *__cache, unsigned long __size)
 {
+	unsigned long size = __size/sizeof(long);
+	unsigned long *cache = __cache;
+	unsigned long data = 0xdeadbeef;
 	int i;
+
+	for (i = 0; i < size/4; i++) {
+		if ((i & 3) == 0)
+			cache[i] = data;
+		if ((i & 3) == 1)
+			cache[size-1-i] = data;
+		if ((i & 3) == 2)
+			cache[size/2-i] = data;
+		if ((i & 3) == 3)
+			cache[size/2+i] = data;
+	}
+}
+
+struct flush_data {
+	unsigned long source, target;
+	void (*fn)(void *, unsigned long);
+	void *cache;
+	void *local_cache;
+	unsigned long size;
+	unsigned long long delta;
+};
+
+/*
+ * Dirty L2 on the source CPU:
+ */
+static void source_handler(void *__data)
+{
+	struct flush_data *data = __data;
+
+	if (smp_processor_id() != data->source)
+		return;
+
+	memset(data->cache, 0, data->size);
+}
+
+/*
+ * Dirty the L2 cache on this CPU and then access the shared
+ * buffer. (which represents the working set of the migrated task.)
+ */
+static void target_handler(void *__data)
+{
+	struct flush_data *data = __data;
+	unsigned long long t0, t1;
+	unsigned long flags;
+
+	if (smp_processor_id() != data->target)
+		return;
+
+	memset(data->local_cache, 0, data->size);
+	local_irq_save(flags);
+	t0 = sched_clock();
+	fill_cache(data->cache, data->size);
+	t1 = sched_clock();
+	local_irq_restore(flags);
+
+	data->delta = t1 - t0;
+}
+
+/*
+ * Measure the cache-cost of one task migration:
+ */
+static unsigned long long measure_one(void *cache, unsigned long size,
+				      int source, int target)
+{
+	struct flush_data data;
+	unsigned long flags;
+	void *local_cache;
+
+	local_cache = vmalloc(size);
+	if (!local_cache) {
+		printk("couldnt allocate local cache ...\n");
+		return 0;
+	}
+	memset(local_cache, 0, size);
+
+	local_irq_save(flags);
+	local_irq_enable();
+
+	data.source = source;
+	data.target = target;
+	data.size = size;
+	data.cache = cache;
+	data.local_cache = local_cache;
+
+	if (on_each_cpu(source_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		local_irq_restore(flags);
+		return -1;
+	}
+	if (on_each_cpu(target_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		local_irq_restore(flags);
+		return -1;
+	}
+
+	vfree(local_cache);
+
+	return data.delta;
+}
+
+unsigned long sched_cache_size;
+
+/*
+ * Measure a series of task migrations and return the maximum
+ * result - the worst-case. Since this code runs early during
+ * bootup the system is 'undisturbed' and the maximum latency
+ * makes sense.
+ *
+ * As the working set we use 1.66 times the L2 cache size, this is
+ * chosen in such a nonsymmetric way so that fill_cache() doesnt
+ * iterate at power-of-2 boundaries (which might hit cache mapping
+ * artifacts and pessimise the results).
+ */
+static __init unsigned long long measure_cacheflush_time(int cpu1, int cpu2)
+{
+	unsigned long size = sched_cache_size*5/3;
+	unsigned long long delta, max = 0;
+	void *cache;
+	int i;
+
+	if (!size) {
+		printk("arch has not set cachesize - using default.\n");
+		return 0;
+	}
+	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+		return 0;
+	}
+	cache = vmalloc(size);
+	if (!cache) {
+		printk("could not vmalloc %ld bytes for cache!\n", size);
+		return 0;
+	}
+	memset(cache, 0, size);
+	for (i = 0; i < 20; i++) {
+		delta = measure_one(cache, size, cpu1, cpu2);
+		if (delta > max)
+			max = delta;
+	}
+
+	vfree(cache);
+
+	/*
+	 * A task is considered 'cache cold' if at least 10 times
+	 * the cost of migration has passed. I.e. in the rare and
+	 * absolutely worst-case we should see a 10% degradation
+	 * due to migration. (this limit is only listened to if the
+	 * load-balancing situation is 'nice' - if there is a large
+	 * imbalance we ignore it for the sake of CPU utilization and
+	 * processing fairness.)
+	 *
+	 * (We use 5/3 times the L2 cachesize in our measurement,
+	 *  hence factor 6 here: 10 == 6*5/3.)
+	 */
+	return max * 6;
+}
+
+static unsigned long long cache_decay_nsec;
+
+__init static void arch_init_sched_domains(void)
+{
+	int i, cpu1 = -1, cpu2 = -1;
+	unsigned long long min_delta = -1ULL;
+
 	cpumask_t cpu_default_map;
 
+	printk("arch cache_decay_nsec: %ld\n", cache_decay_ticks*1000000);
+	printk("migration cost matrix (cache_size: %ld, cpu: %ld MHz):\n",
+		sched_cache_size, cpu_khz/1000);
+	printk("      ");
+	for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+		if (!cpu_online(cpu1))
+			continue;
+		printk("  [%02d]", cpu1);
+	}
+	printk("\n");
+	for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+		if (!cpu_online(cpu1))
+			continue;
+		printk("[%02d]: ", cpu1);
+		for (cpu2 = 0; cpu2 < NR_CPUS; cpu2++) {
+			unsigned long long delta;
+
+			if (!cpu_online(cpu2))
+				continue;
+			delta = measure_cacheflush_time(cpu1, cpu2);
+			
+			printk(" %3Ld.%ld", delta >> 20,
+				(((long)delta >> 10) / 102) % 10);
+			if ((cpu1 != cpu2) && (delta < min_delta))
+				min_delta = delta;
+		}
+		printk("\n");
+	}
+	printk("min_delta: %Ld\n", min_delta);
+	if (min_delta != -1ULL)
+		cache_decay_nsec = min_delta;
+	printk("using cache_decay nsec: %Ld (%Ld msec)\n",
+		cache_decay_nsec, cache_decay_nsec >> 20);
+
 	/*
 	 * Setup mask for cpus without special case scheduling requirements.
 	 * For now this just excludes isolated cpus, but could be used to
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -849,6 +849,8 @@ static int __init do_boot_cpu(int apicid
 cycles_t cacheflush_time;
 unsigned long cache_decay_ticks;
 
+extern unsigned long sched_cache_size;
+
 static void smp_tune_scheduling (void)
 {
 	unsigned long cachesize;       /* kB   */
@@ -879,6 +881,7 @@ static void smp_tune_scheduling (void)
 		}
 
 		cacheflush_time = (cpu_khz>>10) * (cachesize<<10) / bandwidth;
+		sched_cache_size = cachesize * 1024;
 	}
 
 	cache_decay_ticks = (long)cacheflush_time/cpu_khz + 1;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] sched: auto-tuning task-migration
  2004-10-06 13:29 ` [patch] sched: auto-tuning task-migration Ingo Molnar
@ 2004-10-06 13:44   ` Nick Piggin
  2004-10-06 17:49   ` Chen, Kenneth W
  2005-02-21  5:08   ` Paul Jackson
  2 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2004-10-06 13:44 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Chen, Kenneth W, linux-kernel, 'Andrew Morton'

Ingo Molnar wrote:
> * Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:
> 
> 
>>Since we are talking about load balancing, we decided to measure
>>various value for cache_hot_time variable to see how it affects app
>>performance. We first establish baseline number with vanilla base
>>kernel (default at 2.5ms), then sweep that variable up to 1000ms.  All
>>of the experiments are done with Ingo's patch posted earlier.  Here
>>are the result (test environment is 4-way SMP machine, 32 GB memory,
>>500 disks running industry standard db transaction processing
>>workload):
>>
>>cache_hot_time  | workload throughput
>>--------------------------------------
>>         2.5ms  - 100.0   (0% idle)
>>         5ms    - 106.0   (0% idle)
>>         10ms   - 112.5   (1% idle)
>>         15ms   - 111.6   (3% idle)
>>         25ms   - 111.1   (5% idle)
>>         250ms  - 105.6   (7% idle)
>>         1000ms - 105.4   (7% idle)
> 
> 
> the following patch adds a new feature to the scheduler: during bootup
> it measures migration costs and sets up cache_hot value accordingly.
> 
> The measurement is point-to-point, i.e. it can be used to measure the
> migration costs in cache hierarchies - e.g. by NUMA setup code. The
> patch prints out a matrix of migration costs between CPUs. 
> (self-migration means pure cache dirtying cost)
> 
> Here are a couple of matrixes from testsystems:
> 
> A 2-way Celeron/128K box:
> 
>  arch cache_decay_nsec: 1000000
>  migration cost matrix (cache_size: 131072, cpu: 467 MHz):
>          [00]  [01]
>  [00]:    9.6  12.0
>  [01]:   12.2   9.8
>  min_delta: 12586890
>  using cache_decay nsec: 12586890 (12 msec)
> 
> a 2-way/4-way P4/512K HT box:
> 
>  arch cache_decay_nsec: 2000000
>  migration cost matrix (cache_size: 524288, cpu: 2379 MHz):
>          [00]  [01]  [02]  [03]
>  [00]:    6.1   6.1   5.7   6.1
>  [01]:    6.7   6.2   6.7   6.2
>  [02]:    5.9   5.9   6.1   5.0
>  [03]:    6.7   6.2   6.7   6.2
>  min_delta: 6053016
>  using cache_decay nsec: 6053016 (5 msec)
> 
> an 8-way P3/2MB Xeon box:
> 
>  arch cache_decay_nsec: 6000000
>  migration cost matrix (cache_size: 2097152, cpu: 700 MHz):
>          [00]  [01]  [02]  [03]  [04]  [05]  [06]  [07]
>  [00]:   92.1 184.8 184.8 184.8 184.9  90.7  90.6  90.7
>  [01]:  181.3  92.7  88.5  88.6  88.5 181.5 181.3 181.4
>  [02]:  181.4  88.4  92.5  88.4  88.5 181.4 181.3 181.4
>  [03]:  181.4  88.4  88.5  92.5  88.4 181.5 181.2 181.4
>  [04]:  181.4  88.5  88.4  88.4  92.5 181.5 181.3 181.5
>  [05]:   87.2 181.5 181.4 181.5 181.4  90.0  87.0  87.1
>  [06]:   87.2 181.5 181.4 181.5 181.4  87.9  90.0  87.1
>  [07]:   87.2 181.5 181.4 181.5 181.4  87.9  87.0  90.0
>  min_delta: 91815564
>  using cache_decay nsec: 91815564 (87 msec)
> 

Very cool. I reckon you may want to make the final number
non linear if possible, because a 2MB cache probably doesn't
need double the cache decay time of a 1MB cache.

And possible things need to be tuned a bit, eg. 12ms for the
128K celeron may be a bit large (even though it does have a
slow bus).

But this is a nice starting point.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06  9:44                   ` bert hubert
@ 2004-10-06 14:00                     ` Andries Brouwer
  0 siblings, 0 replies; 52+ messages in thread
From: Andries Brouwer @ 2004-10-06 14:00 UTC (permalink / raw)
  To: bert hubert, Andrew Morton, Jeff Garzik, nickpiggin,
	kenneth.w.chen, mingo, linux-kernel, judith

On Wed, Oct 06, 2004 at 11:44:37AM +0200, bert hubert wrote:

> Mainline is suffering too - lots of people I know running 2.6 on production
> systems have noted a marked increase in problems, crashes, odd things. 
> 
> I'd bet you get a lot of people who'd vote for a timeout right now to figure
> out what's going wrong.
> 
> There is the distinct impression that we are going down hill in this series.
> My personal feeling is that this trend started almost immediately after OLS.

Well, suppose we eliminate 5% of all bugs each week.
Then after a year only 7% of the original bugs are left.

In a stable series that is a fairly good result.
In a series that is simultaneously "stable" and "development"
new random bugs are being introduced continually.
One never reaches the state with only few bugs.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06  7:48 ` Ingo Molnar
@ 2004-10-06 17:18   ` Chen, Kenneth W
  2004-10-06 19:55     ` Ingo Molnar
  2004-10-06 22:46     ` Peter Williams
  0 siblings, 2 replies; 52+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 17:18 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: 'Ingo Molnar', linux-kernel, 'Andrew Morton',
	'Nick Piggin'

> Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
> > We have experimented with similar thing, via bumping up sd->cache_hot_time to
> > a very large number, like 1 sec.  What we measured was a equally low throughput.
> > But that was because of not enough load balancing.
>
> Since we are talking about load balancing, we decided to measure various
> value for cache_hot_time variable to see how it affects app performance. We
> first establish baseline number with vanilla base kernel (default at 2.5ms),
> then sweep that variable up to 1000ms.  All of the experiments are done with
> Ingo's patch posted earlier.  Here are the result (test environment is 4-way
> SMP machine, 32 GB memory, 500 disks running industry standard db transaction
> processing workload):
>
> cache_hot_time  | workload throughput
> --------------------------------------
>          2.5ms  - 100.0   (0% idle)
>          5ms    - 106.0   (0% idle)
>          10ms   - 112.5   (1% idle)
>          15ms   - 111.6   (3% idle)
>          25ms   - 111.1   (5% idle)
>          250ms  - 105.6   (7% idle)
>          1000ms - 105.4   (7% idle)
>
> Clearly the default value for SMP has the worst application throughput (12%
> below peak performance).  When set too low, kernel is too aggressive on load
> balancing and we are still seeing cache thrashing despite the perf fix.
> However, If set too high, kernel gets too conservative and not doing enough
> load balance.

Ingo Molnar wrote on Wednesday, October 06, 2004 12:48 AM
> could you please try the test in 1 msec increments around 10 msec? It
> would be very nice to find a good formula and the 5 msec steps are too
> coarse. I think it would be nice to test 7,9,11,13 msecs first, and then
> the remaining 1 msec slots around the new maximum. (assuming the
> workload measurement is stable.)

I should've post the whole thing yesterday, we had measurement of 7.5 and
12.5 ms.  Here is the result (repeating 5, 10, 15 for easy reading).

 5   ms 106.0
 7.5 ms 110.3
10   ms 112.5
12.5 ms 112.0
15   ms 111.6


> > This value was default to 10ms before domain scheduler, why does domain
> > scheduler need to change it to 2.5ms? And on what bases does that decision
> > take place?  We are proposing change that number back to 10ms.
>
> agreed. What value does cache_decay_ticks have on your box?


I see all the fancy calculation with cache_decay_ticks on x86, but nobody
actually uses it in the domain scheduler.  Anyway, my box has that value
hard coded to 10ms (ia64).

- Ken



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [patch] sched: auto-tuning task-migration
  2004-10-06 13:29 ` [patch] sched: auto-tuning task-migration Ingo Molnar
  2004-10-06 13:44   ` Nick Piggin
@ 2004-10-06 17:49   ` Chen, Kenneth W
  2004-10-06 20:04     ` Ingo Molnar
  2005-02-21  5:08   ` Paul Jackson
  2 siblings, 1 reply; 52+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 17:49 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: linux-kernel, 'Andrew Morton', 'Nick Piggin'

Ingo Molnar wrote on Wednesday, October 06, 2004 6:30 AM
> the following patch adds a new feature to the scheduler: during bootup
> it measures migration costs and sets up cache_hot value accordingly.
>
> could you try this patch on your testbox and send me the bootlog? How
> close does this method get us to the 10 msec value you measured to be
> close to the best value? The patch is against 2.6.9-rc3 + the last
> cache_hot fixpatch you tried.

Ran it on a similar system.  Below is the output.  Haven't tried to get a
real benchmark run with 42 ms cache_hot_time.  I don't think it will get
peak throughput as we already start tapering off at 12.5 ms.

task migration cache decay timeout: 10 msecs.
CPU 1: base freq=199.458MHz, ITC ratio=15/2, ITC freq=1495.941MHz+/--1ppm
CPU 2: base freq=199.458MHz, ITC ratio=15/2, ITC freq=1495.941MHz+/--1ppm
CPU 3: base freq=199.458MHz, ITC ratio=15/2, ITC freq=1495.941MHz+/--1ppm
Calibrating delay loop... 2232.84 BogoMIPS (lpj=1089536)
Brought up 4 CPUs
Total of 4 processors activated (8939.60 BogoMIPS).
arch cache_decay_nsec: 10000000
migration cost matrix (cache_size: 9437184, cpu: 1500 MHz):
        [00]  [01]  [02]  [03]
[00]:   50.2  42.8  42.9  42.8
[01]:   42.9  50.2  42.1  42.9
[02]:   42.9  42.9  50.2  42.8
[03]:   42.9  42.9  42.9  50.2
min_delta: 44785782
using cache_decay nsec: 44785782 (42 msec)



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06  4:51     ` Andrew Morton
  2004-10-06  5:00       ` Nick Piggin
  2004-10-06  5:52       ` Default cache_hot_time value back to 10ms Chen, Kenneth W
@ 2004-10-06 19:27       ` Chen, Kenneth W
  2004-10-06 19:39         ` Andrew Morton
  2 siblings, 1 reply; 52+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 19:27 UTC (permalink / raw)
  To: 'Andrew Morton', Nick Piggin; +Cc: mingo, linux-kernel

Andrew Morton wrote on Tuesday, October 05, 2004 9:51 PM
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >  I'd say it is probably too low level to be a useful tunable (although
> >  for testing I guess so... but then you could have *lots* of parameters
> >  tunable).
>
> This tunable caused an 11% performance difference in (I assume) TPCx.
> That's a big deal, and people will want to diddle it.
>
> If one number works optimally for all machines and workloads then fine.
>
> But yes, avoiding a tunable would be nice, but we need a tunable to work
> out whether we can avoid making it tunable ;)
>
> Not that I'm soliciting patches or anything.  I'll duck this one for now.

Andrew, can I safely interpret this response as you are OK with having
cache_hot_time set to 10 ms for now?  And you will merge this change for
2.6.9?  I think Ingo and Nick are both OK with that change as well. Thanks.

- Ken



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06  9:23                 ` Ingo Molnar
  2004-10-06  9:57                   ` Paolo Ciarrocchi
@ 2004-10-06 19:33                   ` Jeff Garzik
  2004-10-06 22:23                     ` Martin J. Bligh
  1 sibling, 1 reply; 52+ messages in thread
From: Jeff Garzik @ 2004-10-06 19:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Nick Piggin, kenneth.w.chen, linux-kernel, judith

Ingo Molnar wrote:
> On Wed, 6 Oct 2004, Jeff Garzik wrote:
> 
> 
>>The _reality_ is that there is _no_ point in time where you and Linus
>>allow for stabilization of the main tree prior to relesae. [...]
> 
> 
> i dont think this is fair to Andrew - there's hundreds of patches in his
> tree that are scheduled for 2.6.10 not 2.6.9.
> 
> you are right that -mm is experimental, but the latency of bugfixes is the
> lowest i've ever seen in any Linux tree, which is quite amazing
> considering the hundreds of patches.

I said "stabilization of the main tree" for a reason :)  Like a 
"mini-Andrew", I have over 100 net driver csets waiting for 2.6.10 as well.

The crucial point is establishing a psychology where maintainers only 
submit (and only apply) bug fixes in -rc series.  As long as random 
stuff (like fasync in 2.6.8 release) is getting applied at the last 
minute, we are

* destroying the validity of testing done in -rc prior to release, and
* reducing the value of user testing
* discouraging users from treating -rc as anything but a 'devel' release 
(as opposed to a 'stable' release)



> it is also correct that the pile of patches in the -mm tree mask the QA
> effects of testing done on -mm, so testing -BK separately is just as
> important at this stage.

The simple fact is that -mm doesn't receive _nearly_ the amount of 
testing that a 2.6.x -BK snapshot does, which in turn doesn't receive 
_nearly_ the amount of testing that a 2.6.x-rc release gets.

The increase in the amount of testing, and amount of feedback I get for 
my stuff in -mm/-bk versus -rc/release is a very large margin.  For this 
reason, one cannot hold up testing in -mm as nearly having the value of 
testing in -rc.

But with the diminished signal/noise ratio of current -rc...

	Jeff



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 19:27       ` Chen, Kenneth W
@ 2004-10-06 19:39         ` Andrew Morton
  2004-10-06 20:38           ` Chen, Kenneth W
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2004-10-06 19:39 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: nickpiggin, mingo, linux-kernel

"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>
> Andrew Morton wrote on Tuesday, October 05, 2004 9:51 PM
>  > > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>  > >  I'd say it is probably too low level to be a useful tunable (although
>  > >  for testing I guess so... but then you could have *lots* of parameters
>  > >  tunable).
>  >
>  > This tunable caused an 11% performance difference in (I assume) TPCx.
>  > That's a big deal, and people will want to diddle it.
>  >
>  > If one number works optimally for all machines and workloads then fine.
>  >
>  > But yes, avoiding a tunable would be nice, but we need a tunable to work
>  > out whether we can avoid making it tunable ;)
>  >
>  > Not that I'm soliciting patches or anything.  I'll duck this one for now.
> 
>  Andrew, can I safely interpret this response as you are OK with having
>  cache_hot_time set to 10 ms for now?

I have a lot of scheduler changes queued up and I view this change as being
not very high priority.  If someone sends a patch to update -mm then we can
run with that, however Ingo's auto-tuning seems a far preferable approach.

>  And you will merge this change for 2.6.9?

I was not planning on doing so, but could be persuaded, I guess.

It's very, very late for this and subtle CPU scheduler regressions tend to
take a long time (weeks or months) to be identified.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06  6:39                 ` Andrew Morton
  2004-10-06  8:56                   ` Paolo Ciarrocchi
  2004-10-06  9:44                   ` bert hubert
@ 2004-10-06 19:40                   ` Jeff Garzik
  2004-10-06 19:48                     ` Jeff Garzik
  2 siblings, 1 reply; 52+ messages in thread
From: Jeff Garzik @ 2004-10-06 19:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, kenneth.w.chen, mingo, linux-kernel, judith

Andrew Morton wrote:
> Jeff Garzik <jgarzik@pobox.com> wrote:
> 
>>Andrew Morton wrote:
>>
>>>Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>>>
>>>
>>>>Any thoughts about making -rc's into -pre's, and doing real -rc's?
>>>
>>>
>>>I think what we have is OK.  The idea is that once 2.6.9 is released we
>>>merge up all the well-tested code which is sitting in various trees and has
>>>been under test for a few weeks.  As soon as all that well-tested code is
>>>merged, we go into -rc.  So we're pipelining the development of 2.6.10 code
>>>with the stabilisation of 2.6.9.
>>>
>>>If someone goes and develops *new* code after the release of, say, 2.6.9
>>>then tough tittie, it's too late for 2.6.9: we don't want new code - we
>>>want old-n-tested code.  So your typed-in-after-2.6.9 code goes into
>>>2.6.11.
>>>
>>>That's the theory anyway.  If it means that it takes a long time to get
>>
>>This is damned frustrating :(  Reality is _far_ divorced from what you 
>>just described.
> 
> 
> s/far/a bit/
> 
> 
>>Major developers such as David and Al don't have trees that see wide 
>>testing, their code only sees wide testing once it hits mainline.  See 
>>this message from David, 
>>http://marc.theaimsgroup.com/?l=linux-netdev&m=109648930728731&w=2
>>
> 
> 
> Yes, networking has been an exception.  I think this has been acceptable
> thus far because historically networking has tended to work better than
> other parts of the kernel.  Although the fib_hash stuff was a bit of a
> fiasco.

That's a prime example, yes...


>>In particular, I think David's point about -mm being perceived as overly 
>>experimental is fair.
> 
> 
> I agree - -mm breaks too often.  You wouldn't believe the crap people throw
> at me :(.   But a lot of problems get fixed this way too.
> 
> 
>>Recent experience seems to directly counter the assertion that only 
>>well-tested code is landing in mainline, and it's not hard to pick 
>>through the -rc changelogs to find non-trivial, non-bugfix modifications 
>>to existing code.
> 
> 
> Once we hit -rc2 we shouldn't be doing that.

Why does -rc2 have to be a magic number?  Does that really make sense to 
users that we want to be testing our stuff?

"We picked a magic number, after which, we hope it becomes more stable 
even if it doesn't work out like that in practice"


>> My own experience with netdev-2.6 bears this out as 
>>well:  I have several personal examples of bugs sitting in netdev (and 
>>thus -mm) for quite a while, only being noticed when the code hits mainline.
> 
> 
> yes, I've had a couple of those.  Not too many, fortunately.  But having
> bugs leak in mainline is OK - we expect that.  As long as it wasn't late in
> the cycle.  If it was late in the cycle then, well,
> bad-call-won't-do-that-again.
> 
> 
>>Linus's assertion that "calling it -rc means developers should calm 
>>down" (implying we should start concentrating on bug fixing rather than 
>>more-fun stuff) is equally fanciful.
>>
>>Why is it so hard to say "only bugfixes"?
> 
> 
> (It's not "only bugfixes".  It's "only bugfixes, completely new stuff and
> documentation/comment fixes).
> 
> But yes.  When you see this please name names and thwap people.

I thought I just did ;-)


>>The _reality_ is that there is _no_ point in time where you and Linus 
>>allow for stabilization of the main tree prior to relesae.  The release 
>>criteria has devolved to a point where we call it done when the stack of 
>>pancakes gets too high.
> 
> 
> That's simply wrong.
> 
> For instance, 2.6.8-rc1-mm1-series had 252 patches.  I'm now sitting on 726
> patches.  That's 500 patches which are either non-bugfixes or minor
> bugfixes which are held back.  The various bk tree maintainers do the same
> thing.

Sure I'm sitting on over 100 net driver csets myself.  I'm glad, but the 
overall point is still that "-rc" -- which stands for Release Candidate 
-- is nowhere near release candidate status when -rc1 hits, and fluff 
like sparse notations and changes like the fasync API change in 2.6.8 
always seem to sneak in at the last minute, further belieing(sp?) the 
supposed Release Candidate status.

No matter the effort of maintainers to hold back patches, every 
violation of the Release Candidate Bugfixes Only policy serves to 
undermine user confidence and invalidate previous Q/A work.

	Jeff



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06 19:40                   ` Jeff Garzik
@ 2004-10-06 19:48                     ` Jeff Garzik
  2004-10-06 19:58                       ` Jeff Garzik
  2004-10-07  0:02                       ` Matt Mackall
  0 siblings, 2 replies; 52+ messages in thread
From: Jeff Garzik @ 2004-10-06 19:48 UTC (permalink / raw)
  To: Andrew Morton, mingo; +Cc: nickpiggin, kenneth.w.chen, linux-kernel, judith


So my own suggestions for increasing 2.6.x stability are:

1) Create a release numbering system that is __clear to users__, not 
just developers.  This is a human, not technical problem.  Telling users 
"oh, -rc1 doesn't really mean Release Candidate, we start getting 
serious around -rc2 or so but some stuff slips in and..." is hardly clear.

2) Really (underscore underscore) only accept bugfixes after the chosen 
line of demarcation.  No API changes.  No new stuff (new stuff may not 
break anything, but it's a distraction).  Chill out on all the sparse 
notations.  _Just_ _bug_ _fixes_.  The fluff (comments/sparse/new 
features) just serves to make reviewing the changes more difficult, as 
it vastly increases the noise-to-signal ratio.

With all the noise of comment fixes, new features, etc. during Release 
Candidate, you kill the value of reviewing the -rc patch.  Developers 
and users have to either (a) boot and pray or (b) wade through tons of 
non-bugfix changes in an attempt to review the changes.

I know it's an antiquated idea, _reading_ and reviewing a -rc patch, but 
still...

	Jeff




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 17:18   ` Chen, Kenneth W
@ 2004-10-06 19:55     ` Ingo Molnar
  2004-10-06 22:46     ` Peter Williams
  1 sibling, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2004-10-06 19:55 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', linux-kernel, 'Andrew Morton',
	'Nick Piggin'


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

>  5   ms 106.0
>  7.5 ms 110.3
> 10   ms 112.5
> 12.5 ms 112.0
> 15   ms 111.6

ok, great. 9ms and 11ms would still be interesting. My guess would be
that the maximum is at 9. (albeit the numbers, when plotted, indicate
that the measurement might be a bit noisy.)

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06 19:48                     ` Jeff Garzik
@ 2004-10-06 19:58                       ` Jeff Garzik
  2004-10-06 20:37                         ` Geert Uytterhoeven
  2004-10-07  0:02                       ` Matt Mackall
  1 sibling, 1 reply; 52+ messages in thread
From: Jeff Garzik @ 2004-10-06 19:58 UTC (permalink / raw)
  To: Andrew Morton, mingo; +Cc: nickpiggin, kenneth.w.chen, linux-kernel, judith

Jeff Garzik wrote:
> 
> So my own suggestions for increasing 2.6.x stability are:

And one more, that I meant to include in the last email,

3) Release early, release often (official -rc releases, not just snapshots)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] sched: auto-tuning task-migration
  2004-10-06 17:49   ` Chen, Kenneth W
@ 2004-10-06 20:04     ` Ingo Molnar
  2004-10-06 21:18       ` Chen, Kenneth W
  0 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2004-10-06 20:04 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: linux-kernel, 'Andrew Morton', 'Nick Piggin'


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Ran it on a similar system.  Below is the output.  Haven't tried to
> get a real benchmark run with 42 ms cache_hot_time.  I don't think it
> will get peak throughput as we already start tapering off at 12.5 ms.

> arch cache_decay_nsec: 10000000
> migration cost matrix (cache_size: 9437184, cpu: 1500 MHz):
>         [00]  [01]  [02]  [03]
> [00]:   50.2  42.8  42.9  42.8
> [01]:   42.9  50.2  42.1  42.9
> [02]:   42.9  42.9  50.2  42.8
> [03]:   42.9  42.9  42.9  50.2
> min_delta: 44785782
> using cache_decay nsec: 44785782 (42 msec)

could you try the replacement patch below - what results does it give?

	Ingo

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -388,7 +388,7 @@ struct sched_domain {
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_BALANCE_NEWIDLE	\
@@ -410,7 +410,7 @@ struct sched_domain {
 	.max_interval		= 32,			\
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (10*1000000),		\
+	.cache_hot_time		= cache_decay_nsec,	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_BALANCE_EXEC	\
@@ -4420,11 +4420,232 @@ __init static void init_sched_build_grou
 	last->next = first;
 }
 
-__init static void arch_init_sched_domains(void)
+/*
+ * Task migration cost measurement between source and target CPUs.
+ *
+ * This is done by measuring the worst-case cost. Here are the
+ * steps that are taken:
+ *
+ * 1) the source CPU dirties its L2 cache with a shared buffer
+ * 2) the target CPU dirties its L2 cache with a local buffer
+ * 3) the target CPU dirties the shared buffer
+ *
+ * We measure the time step #3 takes - this is the cost of migrating
+ * a cache-hot task that has a large, dirty dataset in the L2 cache,
+ * to another CPU.
+ */
+
+
+/*
+ * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
+ * is the operation that is timed, so we try to generate unpredictable
+ * cachemisses that still end up filling the L2 cache:
+ */
+__init static void fill_cache(void *__cache, unsigned long __size)
 {
+	unsigned long size = __size/sizeof(long);
+	unsigned long *cache = __cache;
+	unsigned long data = 0xdeadbeef;
 	int i;
+
+	for (i = 0; i < size/4; i++) {
+		if ((i & 3) == 0)
+			cache[i] = data;
+		if ((i & 3) == 1)
+			cache[size-1-i] = data;
+		if ((i & 3) == 2)
+			cache[size/2-i] = data;
+		if ((i & 3) == 3)
+			cache[size/2+i] = data;
+	}
+}
+
+struct flush_data {
+	unsigned long source, target;
+	void (*fn)(void *, unsigned long);
+	void *cache;
+	void *local_cache;
+	unsigned long size;
+	unsigned long long delta;
+};
+
+/*
+ * Dirty L2 on the source CPU:
+ */
+__init static void source_handler(void *__data)
+{
+	struct flush_data *data = __data;
+
+	if (smp_processor_id() != data->source)
+		return;
+
+	memset(data->cache, 0, data->size);
+}
+
+/*
+ * Dirty the L2 cache on this CPU and then access the shared
+ * buffer. (which represents the working set of the migrated task.)
+ */
+__init static void target_handler(void *__data)
+{
+	struct flush_data *data = __data;
+	unsigned long long t0, t1;
+	unsigned long flags;
+
+	if (smp_processor_id() != data->target)
+		return;
+
+	memset(data->local_cache, 0, data->size);
+	local_irq_save(flags);
+	t0 = sched_clock();
+	fill_cache(data->cache, data->size);
+	t1 = sched_clock();
+	local_irq_restore(flags);
+
+	data->delta = t1 - t0;
+}
+
+/*
+ * Measure the cache-cost of one task migration:
+ */
+__init static unsigned long long measure_one(void *cache, unsigned long size,
+					     int source, int target)
+{
+	struct flush_data data;
+	unsigned long flags;
+	void *local_cache;
+
+	local_cache = vmalloc(size);
+	if (!local_cache) {
+		printk("couldnt allocate local cache ...\n");
+		return 0;
+	}
+	memset(local_cache, 0, size);
+
+	local_irq_save(flags);
+	local_irq_enable();
+
+	data.source = source;
+	data.target = target;
+	data.size = size;
+	data.cache = cache;
+	data.local_cache = local_cache;
+
+	if (on_each_cpu(source_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		local_irq_restore(flags);
+		return -1;
+	}
+	if (on_each_cpu(target_handler, &data, 1, 1) != 0) {
+		printk("measure_one: timed out waiting for other CPUs\n");
+		local_irq_restore(flags);
+		return -1;
+	}
+
+	vfree(local_cache);
+
+	return data.delta;
+}
+
+__initdata unsigned long sched_cache_size;
+
+/*
+ * Measure a series of task migrations and return the maximum
+ * result - the worst-case. Since this code runs early during
+ * bootup the system is 'undisturbed' and the maximum latency
+ * makes sense.
+ *
+ * As the working set we use 2.1 times the L2 cache size, this is
+ * chosen in such a nonsymmetric way so that fill_cache() doesnt
+ * iterate at power-of-2 boundaries (which might hit cache mapping
+ * artifacts and pessimise the results).
+ */
+__init static unsigned long long measure_cacheflush_time(int cpu1, int cpu2)
+{
+	unsigned long size = sched_cache_size*21/10;
+	unsigned long long delta, max = 0;
+	void *cache;
+	int i;
+
+	if (!size) {
+		printk("arch has not set cachesize - using default.\n");
+		return 0;
+	}
+	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
+		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
+		return 0;
+	}
+	cache = vmalloc(size);
+	if (!cache) {
+		printk("could not vmalloc %ld bytes for cache!\n", size);
+		return 0;
+	}
+	memset(cache, 0, size);
+	for (i = 0; i < 20; i++) {
+		delta = measure_one(cache, size, cpu1, cpu2);
+		if (delta > max)
+			max = delta;
+	}
+
+	vfree(cache);
+
+	/*
+	 * A task is considered 'cache cold' if at least 2 times
+	 * the worst-case cost of migration has passed.
+	 * (this limit is only listened to if the load-balancing
+	 * situation is 'nice' - if there is a large imbalance we
+	 * ignore it for the sake of CPU utilization and
+	 * processing fairness.)
+	 *
+	 * (We use 2.1 times the L2 cachesize in our measurement,
+	 *  we keep this factor when returning.)
+	 */
+	return max;
+}
+
+__initdata static unsigned long long cache_decay_nsec;
+
+__init static void arch_init_sched_domains(void)
+{
+	int i, cpu1 = -1, cpu2 = -1;
+	unsigned long long min_delta = -1ULL;
+
 	cpumask_t cpu_default_map;
 
+	printk("arch cache_decay_nsec: %ld\n", cache_decay_ticks*1000000);
+	printk("migration cost matrix (cache_size: %ld, cpu: %ld MHz):\n",
+		sched_cache_size, cpu_khz/1000);
+	printk("      ");
+	for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+		if (!cpu_online(cpu1))
+			continue;
+		printk("  [%02d]", cpu1);
+	}
+	printk("\n");
+	for (cpu1 = 0; cpu1 < NR_CPUS; cpu1++) {
+		if (!cpu_online(cpu1))
+			continue;
+		printk("[%02d]: ", cpu1);
+		for (cpu2 = 0; cpu2 < NR_CPUS; cpu2++) {
+			unsigned long long delta;
+
+			if (!cpu_online(cpu2))
+				continue;
+			delta = measure_cacheflush_time(cpu1, cpu2);
+			
+			printk(" %3Ld.%ld", delta >> 20,
+				(((long)delta >> 10) / 102) % 10);
+			if ((cpu1 != cpu2) && (delta < min_delta))
+				min_delta = delta;
+		}
+		printk("\n");
+	}
+	printk("min_delta: %Ld\n", min_delta);
+	if (min_delta != -1ULL)
+		cache_decay_nsec = min_delta;
+	printk("using cache_decay nsec: %Ld (%Ld msec)\n",
+		cache_decay_nsec, cache_decay_nsec >> 20);
+
 	/*
 	 * Setup mask for cpus without special case scheduling requirements.
 	 * For now this just excludes isolated cpus, but could be used to
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -849,6 +849,8 @@ static int __init do_boot_cpu(int apicid
 cycles_t cacheflush_time;
 unsigned long cache_decay_ticks;
 
+extern unsigned long sched_cache_size;
+
 static void smp_tune_scheduling (void)
 {
 	unsigned long cachesize;       /* kB   */
@@ -879,6 +881,7 @@ static void smp_tune_scheduling (void)
 		}
 
 		cacheflush_time = (cpu_khz>>10) * (cachesize<<10) / bandwidth;
+		sched_cache_size = cachesize * 1024;
 	}
 
 	cache_decay_ticks = (long)cacheflush_time/cpu_khz + 1;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06 19:58                       ` Jeff Garzik
@ 2004-10-06 20:37                         ` Geert Uytterhoeven
  2004-10-07  1:08                           ` Jeff Garzik
  0 siblings, 1 reply; 52+ messages in thread
From: Geert Uytterhoeven @ 2004-10-06 20:37 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, mingo, nickpiggin, kenneth.w.chen,
	Linux Kernel Development, judith

On Wed, 6 Oct 2004, Jeff Garzik wrote:
> Jeff Garzik wrote:
> > So my own suggestions for increasing 2.6.x stability are:
> 
> And one more, that I meant to include in the last email,
> 
> 3) Release early, release often (official -rc releases, not just snapshots)

I guess you mean official -pre releases as well?

Gr{oetje,eeting}s,

						Geert

P.S. I only track `real' (-pre and -rc) releases. I don't have the manpower
     (what's in a word) to track daily snapshots (I do `read' bk-commits). If
     m68k stuff gets broken in -rc, usually it means it won't get fixed before
     2 full releases later.  Anyway, things shouldn't become broken in -rc,
     IMHO that's what we (should) have -pre for...
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06 19:39         ` Andrew Morton
@ 2004-10-06 20:38           ` Chen, Kenneth W
  2004-10-06 20:43             ` Andrew Morton
  2004-10-06 20:50             ` Ingo Molnar
  0 siblings, 2 replies; 52+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 20:38 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: nickpiggin, mingo, linux-kernel

Andrew Morton wrote on Wednesday, October 06, 2004 12:40 PM
> "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> >  Andrew, can I safely interpret this response as you are OK with having
> >  cache_hot_time set to 10 ms for now?
>
> I have a lot of scheduler changes queued up and I view this change as being
> not very high priority.  If someone sends a patch to update -mm then we can
> run with that, however Ingo's auto-tuning seems a far preferable approach.
>
> >  And you will merge this change for 2.6.9?
>
> I was not planning on doing so, but could be persuaded, I guess.
>
> It's very, very late for this and subtle CPU scheduler regressions tend to
> take a long time (weeks or months) to be identified.


Let me try to persuade ;-).  First, it hard to accept the fact that we are
leaving 11% of performance on the table just due to a poorly chosen parameter.
This much percentage difference on a db workload is a huge deal.  It basically
"unfairly" handicap 2.6 kernel behind competition, even handicap ourselves compare
to 2.4 kernel.  We have established from various workloads that 10 ms works the
best, from db to java workload.  What more data can we provide to swing you in
that direction?

Secondly, let me ask the question again from the first mail thread:  this value
*WAS* 10 ms for a long time, before the domain scheduler.  What's so special
about domain scheduler that all the sudden this parameter get changed to 2.5?
I'd like to see some justification/prior measurement for such change when
domain scheduler kicks in.

- Ken



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 20:38           ` Chen, Kenneth W
@ 2004-10-06 20:43             ` Andrew Morton
  2004-10-06 23:14               ` Chen, Kenneth W
  2004-10-06 20:50             ` Ingo Molnar
  1 sibling, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2004-10-06 20:43 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: nickpiggin, mingo, linux-kernel

"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>
>  Secondly, let me ask the question again from the first mail thread:  this value
>  *WAS* 10 ms for a long time, before the domain scheduler.  What's so special
>  about domain scheduler that all the sudden this parameter get changed to 2.5?

So why on earth was it switched from 10 to 2.5 in the first place?

Please resend the final patch.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06 20:38           ` Chen, Kenneth W
  2004-10-06 20:43             ` Andrew Morton
@ 2004-10-06 20:50             ` Ingo Molnar
  2004-10-06 21:03               ` Chen, Kenneth W
  1 sibling, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2004-10-06 20:50 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', nickpiggin, linux-kernel


On Wed, 6 Oct 2004, Chen, Kenneth W wrote:

> Let me try to persuade ;-).  First, it hard to accept the fact that we
> are leaving 11% of performance on the table just due to a poorly chosen
> parameter. This much percentage difference on a db workload is a huge
> deal.  It basically "unfairly" handicap 2.6 kernel behind competition,
> even handicap ourselves compare to 2.4 kernel.  We have established from
> various workloads that 10 ms works the best, from db to java workload.  
> What more data can we provide to swing you in that direction?

the problem is that 10 msec might be fine for a 9MB L2 cache CPU running a
DB benchmark, but it will sure be too much of a migration cutoff for other
boxes. And too much of a migration cutoff means increased idle time -
resulting in CPU-under-utilization and worse performance.

so i'd prefer to not touch it for 2.6.9 (consider that tree closed from a
scheduler POV), and we can do the auto-tuning in 2.6.10 just fine. It will
need the same weeks-long testcycle that all scheduler balancing patches
need. There are so many different type of workloads ...

> Secondly, let me ask the question again from the first mail thread:  
> this value *WAS* 10 ms for a long time, before the domain scheduler.  
> What's so special about domain scheduler that all the sudden this
> parameter get changed to 2.5? I'd like to see some justification/prior
> measurement for such change when domain scheduler kicks in.

iirc it was tweaked as a result of the other bug that you fixed. But, high
sensitivity to this tunable was nevery truly established, and a 9 MB L2
cache CPU is certainly not typical - and it is certainly the one that
hurts most from migration effects.

anyway, we were running based on cache_decay_ticks for a long time - is
that what was 10 msec on your box? The cache_decay_ticks calculation was
pretty fine too, it scaled up with cachesize.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06 20:50             ` Ingo Molnar
@ 2004-10-06 21:03               ` Chen, Kenneth W
  0 siblings, 0 replies; 52+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 21:03 UTC (permalink / raw)
  To: 'Ingo Molnar'; +Cc: 'Andrew Morton', nickpiggin, linux-kernel

Ingo Molnar wrote on Wednesday, October 06, 2004 1:51 PM
> On Wed, 6 Oct 2004, Chen, Kenneth W wrote:
> > Let me try to persuade ;-).  First, it hard to accept the fact that we
> > are leaving 11% of performance on the table just due to a poorly chosen
> > parameter. This much percentage difference on a db workload is a huge
> > deal.  It basically "unfairly" handicap 2.6 kernel behind competition,
> > even handicap ourselves compare to 2.4 kernel.  We have established from
> > various workloads that 10 ms works the best, from db to java workload.
> > What more data can we provide to swing you in that direction?
>
> the problem is that 10 msec might be fine for a 9MB L2 cache CPU running a
> DB benchmark, but it will sure be too much of a migration cutoff for other
> boxes. And too much of a migration cutoff means increased idle time -
> resulting in CPU-under-utilization and worse performance.
>
> so i'd prefer to not touch it for 2.6.9 (consider that tree closed from a
> scheduler POV), and we can do the auto-tuning in 2.6.10 just fine. It will
> need the same weeks-long testcycle that all scheduler balancing patches
> need. There are so many different type of workloads ...

I would argue that the testing should be the other way around: having people
argue/provide data why 2.5 is better than 10.  Is there any prior measurement
or mailing list posting out there?


> > Secondly, let me ask the question again from the first mail thread:
> > this value *WAS* 10 ms for a long time, before the domain scheduler.
> > What's so special about domain scheduler that all the sudden this
> > parameter get changed to 2.5? I'd like to see some justification/prior
> > measurement for such change when domain scheduler kicks in.
>
> iirc it was tweaked as a result of the other bug that you fixed.

Is it possible that whatever the tweak was done before, was running with
broken load balancing logic and thus invalidates the 2.5 ms result?


> anyway, we were running based on cache_decay_ticks for a long time - is
> that what was 10 msec on your box? The cache_decay_ticks calculation was
> pretty fine too, it scaled up with cachesize.

Yes, cache_decay_tick is what I referring to.  I guess I was too concentrated
on ia64. For ia64, it was hard coded to 10ms regardless what cache size is.

cache_decay_ticks isn't used anywhere in 2.6.9-rc3, maybe that should be the
one for cache_hot_time.

- Ken



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [patch] sched: auto-tuning task-migration
  2004-10-06 20:04     ` Ingo Molnar
@ 2004-10-06 21:18       ` Chen, Kenneth W
  2004-10-07  6:10         ` Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 21:18 UTC (permalink / raw)
  To: 'Ingo Molnar'
  Cc: linux-kernel, 'Andrew Morton', 'Nick Piggin'

Ingo Molnar wrote on Wednesday, October 06, 2004 1:05 PM
> could you try the replacement patch below - what results does it give?

By the way, I wonder why you chose to round down, but not up.


arch cache_decay_nsec: 10000000
migration cost matrix (cache_size: 9437184, cpu: 1500 MHz):
        [00]  [01]  [02]  [03]
[00]:    9.1   8.5   8.5   8.5
[01]:    8.5   9.1   8.5   8.5
[02]:    8.5   8.5   9.1   8.5
[03]:    8.5   8.5   8.5   9.1
min_delta: 8909202
using cache_decay nsec: 8909202 (8 msec)



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06 19:33                   ` Jeff Garzik
@ 2004-10-06 22:23                     ` Martin J. Bligh
  0 siblings, 0 replies; 52+ messages in thread
From: Martin J. Bligh @ 2004-10-06 22:23 UTC (permalink / raw)
  To: Jeff Garzik, Ingo Molnar
  Cc: Andrew Morton, Nick Piggin, kenneth.w.chen, linux-kernel, judith

>> it is also correct that the pile of patches in the -mm tree mask the QA
>> effects of testing done on -mm, so testing -BK separately is just as
>> important at this stage.
> 
> The simple fact is that -mm doesn't receive _nearly_ the amount of testing that a 2.6.x -BK snapshot does, which in turn doesn't receive _nearly_ the amount of testing that a 2.6.x-rc release gets.

Not sure that's true. Personally I test all -mm releases, and not -bk 
snapshots ... I've heard similar from other people.

If everyone pushed their stuff through -mm, and it sat there for a few
days before going upstream, we'd get a much better opportunity to test.

M.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 17:18   ` Chen, Kenneth W
  2004-10-06 19:55     ` Ingo Molnar
@ 2004-10-06 22:46     ` Peter Williams
  1 sibling, 0 replies; 52+ messages in thread
From: Peter Williams @ 2004-10-06 22:46 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Ingo Molnar', 'Ingo Molnar',
	linux-kernel, 'Andrew Morton', 'Nick Piggin'

Chen, Kenneth W wrote:
>>Chen, Kenneth W wrote on Tuesday, October 05, 2004 10:31 AM
>>
>>>We have experimented with similar thing, via bumping up sd->cache_hot_time to
>>>a very large number, like 1 sec.  What we measured was a equally low throughput.
>>>But that was because of not enough load balancing.
>>
>>Since we are talking about load balancing, we decided to measure various
>>value for cache_hot_time variable to see how it affects app performance. We
>>first establish baseline number with vanilla base kernel (default at 2.5ms),
>>then sweep that variable up to 1000ms.  All of the experiments are done with
>>Ingo's patch posted earlier.  Here are the result (test environment is 4-way
>>SMP machine, 32 GB memory, 500 disks running industry standard db transaction
>>processing workload):
>>
>>cache_hot_time  | workload throughput
>>--------------------------------------
>>         2.5ms  - 100.0   (0% idle)
>>         5ms    - 106.0   (0% idle)
>>         10ms   - 112.5   (1% idle)
>>         15ms   - 111.6   (3% idle)
>>         25ms   - 111.1   (5% idle)
>>         250ms  - 105.6   (7% idle)
>>         1000ms - 105.4   (7% idle)
>>
>>Clearly the default value for SMP has the worst application throughput (12%
>>below peak performance).  When set too low, kernel is too aggressive on load
>>balancing and we are still seeing cache thrashing despite the perf fix.
>>However, If set too high, kernel gets too conservative and not doing enough
>>load balance.
> 
> 
> Ingo Molnar wrote on Wednesday, October 06, 2004 12:48 AM
> 
>>could you please try the test in 1 msec increments around 10 msec? It
>>would be very nice to find a good formula and the 5 msec steps are too
>>coarse. I think it would be nice to test 7,9,11,13 msecs first, and then
>>the remaining 1 msec slots around the new maximum. (assuming the
>>workload measurement is stable.)
> 
> 
> I should've post the whole thing yesterday, we had measurement of 7.5 and
> 12.5 ms.  Here is the result (repeating 5, 10, 15 for easy reading).
> 
>  5   ms 106.0
>  7.5 ms 110.3
> 10   ms 112.5
> 12.5 ms 112.0
> 15   ms 111.6
> 
> 
> 
>>>This value was default to 10ms before domain scheduler, why does domain
>>>scheduler need to change it to 2.5ms? And on what bases does that decision
>>>take place?  We are proposing change that number back to 10ms.
>>
>>agreed. What value does cache_decay_ticks have on your box?
> 
> 
> 
> I see all the fancy calculation with cache_decay_ticks on x86, but nobody
> actually uses it in the domain scheduler.  Anyway, my box has that value
> hard coded to 10ms (ia64).
> 

If you fit a quadratic equation to this data, take the first derivative 
and then solve for zero it will give the cache_hot_time that maximizes 
the throughput.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Default cache_hot_time value back to 10ms
  2004-10-06 20:43             ` Andrew Morton
@ 2004-10-06 23:14               ` Chen, Kenneth W
  2004-10-07  2:26                 ` Nick Piggin
  2004-10-07  6:29                 ` Ingo Molnar
  0 siblings, 2 replies; 52+ messages in thread
From: Chen, Kenneth W @ 2004-10-06 23:14 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: nickpiggin, mingo, linux-kernel

Andrew Morton wrote on Wednesday, October 06, 2004 1:43 PM
> "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> >
> >  Secondly, let me ask the question again from the first mail thread:  this value
> >  *WAS* 10 ms for a long time, before the domain scheduler.  What's so special
> >  about domain scheduler that all the sudden this parameter get changed to 2.5?
>
> So why on earth was it switched from 10 to 2.5 in the first place?
>
> Please resend the final patch.


Here is a patch that revert default cache_hot_time value back to the equivalence
of pre-domain scheduler, which determins task's cache affinity via architecture
defined variable cache_decay_ticks.

This is a mere request that we go back to what *was* before, *NOT* as a new
scheduler tweak (Whatever tweak done for domain scheduler broke traditional/
industry recognized workload).

As a side note, I'd like to get involved on future scheduler tuning experiments,
we have fair amount of benchmark environments where we can validate things across
various kind of workload, i.e., db, java, cpu, etc.  Thanks.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

patch against 2.6.9-rc3:

--- linux-2.6.9-rc3/kernel/sched.c.orig	2004-10-06 15:10:56.000000000 -0700
+++ linux-2.6.9-rc3/kernel/sched.c	2004-10-06 15:18:51.000000000 -0700
@@ -387,7 +387,7 @@ struct sched_domain {
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000000/2),	\
+	.cache_hot_time		= cache_decay_ticks*1000000,\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_BALANCE_NEWIDLE	\


patch against 2.6.9-rc3-mm2:

--- linux-2.6.9-rc3/include/linux/topology.h.orig	2004-10-06 15:32:48.000000000 -0700
+++ linux-2.6.9-rc3/include/linux/topology.h	2004-10-06 15:33:25.000000000 -0700
@@ -113,7 +113,7 @@ static inline int __next_node_with_cpus(
 	.max_interval		= 4,			\
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
-	.cache_hot_time		= (5*1000/2),		\
+	.cache_hot_time		= (cache_decay_ticks*1000),\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06 19:48                     ` Jeff Garzik
  2004-10-06 19:58                       ` Jeff Garzik
@ 2004-10-07  0:02                       ` Matt Mackall
  1 sibling, 0 replies; 52+ messages in thread
From: Matt Mackall @ 2004-10-07  0:02 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, mingo, nickpiggin, kenneth.w.chen, linux-kernel, judith

On Wed, Oct 06, 2004 at 03:48:01PM -0400, Jeff Garzik wrote:
> 
> So my own suggestions for increasing 2.6.x stability are:
> 
> 1) Create a release numbering system that is __clear to users__, not 
> just developers.  This is a human, not technical problem.  Telling users 
> "oh, -rc1 doesn't really mean Release Candidate, we start getting 
> serious around -rc2 or so but some stuff slips in and..." is hardly clear.

The 2.4 system Marcelo used did this nicely.. A couple -preX to shove
in new stuff, and a couple -rcX to iron out the bugs. 2.6.x-rc[12]
seem to be similar in content to 2.4.x-pre - little expectaction that
they're actually candidates for release.
 
> 2) Really (underscore underscore) only accept bugfixes after the chosen 
> line of demarcation.  No API changes.  No new stuff (new stuff may not 
> break anything, but it's a distraction).  Chill out on all the sparse 
> notations.  _Just_ _bug_ _fixes_.  The fluff (comments/sparse/new 
> features) just serves to make reviewing the changes more difficult, as 
> it vastly increases the noise-to-signal ratio.

Also, please simply rename the last -rcX for the release as Marcelo
does with 2.4. Slipping in new stuff between the candidate and the
release invalidates the testing done on the candidate so someone can't
look at 2.6.9 and say "this looks solid from 2 weeks as a release
candidate, I can run with it today".

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: new dev model (was Re: Default cache_hot_time value back to 10ms)
  2004-10-06 20:37                         ` Geert Uytterhoeven
@ 2004-10-07  1:08                           ` Jeff Garzik
  0 siblings, 0 replies; 52+ messages in thread
From: Jeff Garzik @ 2004-10-07  1:08 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Andrew Morton, mingo, nickpiggin, kenneth.w.chen,
	Linux Kernel Development, judith

Geert Uytterhoeven wrote:
> P.S. I only track `real' (-pre and -rc) releases. I don't have the manpower
>      (what's in a word) to track daily snapshots (I do `read' bk-commits). If
>      m68k stuff gets broken in -rc, usually it means it won't get fixed before
>      2 full releases later.  Anyway, things shouldn't become broken in -rc,
>      IMHO that's what we (should) have -pre for...


I agree completely, but -pre is apparently a dirty word (dirty suffix?:))

	Jeff



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 23:14               ` Chen, Kenneth W
@ 2004-10-07  2:26                 ` Nick Piggin
  2004-10-07  6:29                 ` Ingo Molnar
  1 sibling, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2004-10-07  2:26 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', mingo, linux-kernel

Chen, Kenneth W wrote:
> Andrew Morton wrote on Wednesday, October 06, 2004 1:43 PM
> 
>>"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>>
>>> Secondly, let me ask the question again from the first mail thread:  this value
>>> *WAS* 10 ms for a long time, before the domain scheduler.  What's so special
>>> about domain scheduler that all the sudden this parameter get changed to 2.5?
>>
>>So why on earth was it switched from 10 to 2.5 in the first place?
>>
>>Please resend the final patch.
> 
> 
> 
> Here is a patch that revert default cache_hot_time value back to the equivalence
> of pre-domain scheduler, which determins task's cache affinity via architecture
> defined variable cache_decay_ticks.
> 
> This is a mere request that we go back to what *was* before, *NOT* as a new
> scheduler tweak (Whatever tweak done for domain scheduler broke traditional/
> industry recognized workload).
> 

OK... Well Andrew as I said I'd be happy for this to go in. I'd be *extra*
happy if Judith ran a few of those dbt thingy tests which had been sensitive
to idle time. Can you ask her about that? or should I?

> As a side note, I'd like to get involved on future scheduler tuning experiments,
> we have fair amount of benchmark environments where we can validate things across
> various kind of workload, i.e., db, java, cpu, etc.  Thanks.
> 

That would be very welcome indeed. We have a big backlog of scheduler things
to go in after 2.6.9 is released (although not many of them change the runtime
behaviour IIRC). After that, I have some experimental performance work that
could use wider testing. After *that*, the multiprocessor scheduler will in a
state where 2.6 shouldn't need much more work, so we can concentrate on just
tuning the dials.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] sched: auto-tuning task-migration
  2004-10-06 21:18       ` Chen, Kenneth W
@ 2004-10-07  6:10         ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2004-10-07  6:10 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: linux-kernel, 'Andrew Morton', 'Nick Piggin'


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> > could you try the replacement patch below - what results does it give?
> 
> By the way, I wonder why you chose to round down, but not up.

what do you mean - the minimum search within the matrix?

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-06 23:14               ` Chen, Kenneth W
  2004-10-07  2:26                 ` Nick Piggin
@ 2004-10-07  6:29                 ` Ingo Molnar
  2004-10-07  7:08                   ` Jeff Garzik
  1 sibling, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2004-10-07  6:29 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Andrew Morton', nickpiggin, linux-kernel


* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:

> Here is a patch that revert default cache_hot_time value back to the
> equivalence of pre-domain scheduler, which determins task's cache
> affinity via architecture defined variable cache_decay_ticks.

i could agree with this oneliner patch for 2.6.9, it only affects the
SMP balancer and there for the most common boxes it likely results in a
similar migration cutoff as the 2.5 msec we currently have. Here are the
changes that occur on a couple of x86 boxes:

 2-way celeron, 128K cache:         2.5 msec -> 1.0 msec 
 2-way/4-way P4 Xeon 1MB cache:     2.5 msec -> 2.0 msec
 8-way P3 Xeon 2MB cache:           2.5 msec -> 6.0 msec

each of these changes is sane and not too drastic.

(on ia64 there is no auto-tuning of cache_decay_ticks, there you've got
a decay=<x> boot parameter anyway, to fix things up.)

there was one particular DB test that was quite sensitive to idle time
introduced by too large migration cutoff: dbt2-pgsql. Could you try that
one too and compare -rc3 performance to -rc3+migration-patches?

> This is a mere request that we go back to what *was* before, *NOT* as
> a new scheduler tweak (Whatever tweak done for domain scheduler broke
> traditional/ industry recognized workload).
> 
> As a side note, I'd like to get involved on future scheduler tuning
> experiments, we have fair amount of benchmark environments where we
> can validate things across various kind of workload, i.e., db, java,
> cpu, etc.  Thanks.

yeah, it would be nice to test the following 3 kernels:

 2.6.9-rc3 vanilla,
 2.6.9-rc3 + cache_hot_fix + use-cache_decay_ticks
 2.6.9-rc3 + cache_hot_fixes + autotune-patch

using as many different CPU types (and # of CPUs) as possible.

The most important factor in these measurements is statistical stability
of the result - if noise is too high then it's hard to judge. (the
numbers you posted in previous mails are quite stable, but not all
benchmarks are like that.)

> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

Signed-off-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-07  6:29                 ` Ingo Molnar
@ 2004-10-07  7:08                   ` Jeff Garzik
  2004-10-07  7:26                     ` Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Jeff Garzik @ 2004-10-07  7:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chen, Kenneth W, 'Andrew Morton', nickpiggin, linux-kernel

Ingo Molnar wrote:
> * Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:
> >>Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
> 
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>


[tangent]  FWIW Andrew has recently been using "Acked-by" as well, 
presumably for patches created by person X from but reviewed by wholly 
independent person Y (since signed-off-by indicates you have some amount 
of legal standing to actually sign off on the patch)

	Jeff



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Default cache_hot_time value back to 10ms
  2004-10-07  7:08                   ` Jeff Garzik
@ 2004-10-07  7:26                     ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2004-10-07  7:26 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Chen, Kenneth W, 'Andrew Morton', nickpiggin, linux-kernel


* Jeff Garzik <jgarzik@pobox.com> wrote:

> Ingo Molnar wrote:
> >* Chen, Kenneth W <kenneth.w.chen@intel.com> wrote:
> >>>Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
> >
> >
> >Signed-off-by: Ingo Molnar <mingo@elte.hu>
> 
> 
> [tangent] FWIW Andrew has recently been using "Acked-by" as well,
> presumably for patches created by person X from but reviewed by wholly
> independent person Y (since signed-off-by indicates you have some
> amount of legal standing to actually sign off on the patch)

[even more tangential] even if this werent a onliner, i might have some
amount of legal standing, i wrote the original cache_decay_ticks code
that this patch reverts to ;) But yeah, Acked-by would be more
informative here.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] sched: auto-tuning task-migration
  2004-10-06 13:29 ` [patch] sched: auto-tuning task-migration Ingo Molnar
  2004-10-06 13:44   ` Nick Piggin
  2004-10-06 17:49   ` Chen, Kenneth W
@ 2005-02-21  5:08   ` Paul Jackson
  2 siblings, 0 replies; 52+ messages in thread
From: Paul Jackson @ 2005-02-21  5:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kenneth.w.chen, linux-kernel, akpm, nickpiggin

A long long time ago (Oct 2004) Ingo wrote:
> the following patch adds a new feature to the scheduler: during bootup
> it measures migration costs and sets up cache_hot value accordingly.

Ingo - what became of this patch?

I made a quick search for it in Linus's bk tree and Andrew's *-mm
patches, but didn't find it.  Perhaps I didn't know what to look for.

The metric it exposes looks like something I might want to expose to
userland, so the performance guys can begin optimizing the placement of
tasks on CPUs, depending on whether they would benefit from, or be
harmed by, sharing cache.  Would the two halves of a hyper threaded core
show up as particularly close on this metric?  I presume so.

It seems to me to be a good complement to the current cpu-memory
distance we have now in node_distance() exposing the ACPI 2.0 SLIT
table distances.

I view the two key numbers here as (1) how fast can a cpu get stuff out
of a memory node (an amalgam of bandwidth and latency), and (2) how much
cache and buses and such do two cpus share (which can be good or bad,
depending on whether the two tasks on those two cpus share much of their
cache footprint).

The SLIT table provides (1) just fine.  Your patch seems to compute a
sensible estimate of (2).

I had one worry - was there a potential O(N**2) cost in computing this
at boottime, where N is the number of nodes?  Us SGI folks are usually
amongst the first to notice such details, when they blow up on us ;).

I never actually saw the original patch -- perhaps if I had, some of
my questions above would have obvious answers.

Thanks.  (and thanks for cpus_allowed -- I've practically made a
profession out of building stuff on top of that one ... ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] sched: auto-tuning task-migration
       [not found] <200411110851.30819.habanero@us.ibm.com>
@ 2004-11-11 15:04 ` Andrew Theurer
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Theurer @ 2004-11-11 15:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: kenneth.w.chen, Nick Piggin, mingo

> Ingo Molnar wrote on Wednesday, October 06, 2004 1:05 PM
>
> > could you try the replacement patch below - what results does it give?
>
> By the way, I wonder why you chose to round down, but not up.
>
>
> arch cache_decay_nsec: 10000000
> migration cost matrix (cache_size: 9437184, cpu: 1500 MHz):
>         [00]  [01]  [02]  [03]
> [00]:    9.1   8.5   8.5   8.5
> [01]:    8.5   9.1   8.5   8.5
> [02]:    8.5   8.5   9.1   8.5
> [03]:    8.5   8.5   8.5   9.1
> min_delta: 8909202
> using cache_decay nsec: 8909202 (8 msec)

I tried this patch on power5.  This is a 2 node system, 2 chips (1 per node), 
2 cores per chip, 2 siblings per core.  Cores share and L2 & L3 cache.

Hard coding 1920KB for cache size (L2) I get:

migration cost matrix (cache_size: 1966080, cpu: 1656 MHz):
        [00]  [01]  [02]  [03]  [04]  [05]  [06]  [07]
[00]:    1.3   1.3   1.3   1.3   2.3   1.4   1.4   1.4
[01]:    1.3   1.4   1.3   1.3   1.4   1.4   1.4   1.4
[02]:    1.3   1.4   1.3   1.3   1.4   1.4   1.4   1.4
[03]:    1.4   1.4   1.4   1.3   1.4   1.4   1.4   1.4
[04]:    1.3   1.3   1.3   1.3   1.3   1.3   1.3   1.3
[05]:    1.3   1.4   1.3   1.3   1.3   1.4   1.4   1.3
[06]:    1.3   1.4   1.4   1.3   1.3   1.3   1.3   1.3
[07]:    1.3   1.3   1.3   1.3   1.3   1.4   1.3   1.3
min_delta: 1422824
using cache_decay nsec: 1422824 (1 msec)

I ran again for L3, but could not vmalloc the whole amount (cache is 36MB).  I 
tried 19200KB 
and got:

migration cost matrix (cache_size: 19660800, cpu: 1656 MHz):
        [00]  [01]  [02]  [03]  [04]  [05]  [06]  [07]
[00]:   16.9  16.8  16.0  16.0  16.7  16.9  16.7  16.9
[01]:   16.0  17.1  16.0  16.0  16.8  16.9  16.7  16.9
[02]:   17.0  17.1  17.0  16.0  16.7  16.9  16.7  16.9
[03]:   17.0  17.1  16.0  16.0  16.7  16.9  16.8  16.9
[04]:   16.0  16.0  16.0  16.9  17.2  17.1  17.2  17.2
[05]:   16.0  16.0  16.0  16.9  17.2  17.2  17.2  17.2
[06]:   16.0  16.0  16.0  16.9  17.2  17.1  17.2  17.2
[07]:   16.0  16.0  16.0  16.9  17.2  17.1  17.2  17.1
min_delta: 17492688
using cache_decay nsec: 17492688 (16 msec)

First, I am going to assume this test is not designed to show effects of 
shared cache.  For power5, since cores on a same chip share L2 & L3, I would 
conclude that cache_hot_time for level 0 (siblings in a core) and level 1 
(cores in a chip) domains should probably be "0".  For level 2 domains (all 
chips in a system), I guess it needs to be somewhere above 16ms.

We had someone run that online transaction DB workload with 10ms 
cache_hot_time on both level1 & 2 domains and performance regressed.  If we 
get a chance to run again, I will probably try level0: 0ms level1: 0ms 
level2: 10-20ms.

-Andrew Theurer

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch] sched: auto-tuning task-migration
@ 2004-10-06 14:22 emmanuel.fuste
  0 siblings, 0 replies; 52+ messages in thread
From: emmanuel.fuste @ 2004-10-06 14:22 UTC (permalink / raw)
  To: linux-kernel


>the following patch adds a new feature to the scheduler:
during >bootup
>it measures migration costs and sets up cache_hot value
>accordingly.
>
>The measurement is point-to-point, i.e. it can be used to measure
>the
>migration costs in cache hierarchies - e.g. by NUMA setup
code. >The
>patch prints out a matrix of migration costs between CPUs. 
>(self-migration means pure cache dirtying cost)

Hi Ingo,

Is your auto-tunig patch is supposed to work on a shared L2
cache arch like my i586 SMP system ?
Just to know.

Thanks.

E.F.


Accédez au courrier électronique de La Poste : www.laposte.net ; 
3615 LAPOSTENET (0,34€/mn) ; tél : 08 92 68 13 50 (0,34€/mn)




^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2005-02-21  5:10 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-06  0:42 Default cache_hot_time value back to 10ms Chen, Kenneth W
2004-10-06  0:47 ` Con Kolivas
2004-10-06  1:02   ` Nick Piggin
2004-10-06  0:58 ` Nick Piggin
2004-10-06  3:55 ` Andrew Morton
2004-10-06  4:30   ` Nick Piggin
2004-10-06  4:51     ` Andrew Morton
2004-10-06  5:00       ` Nick Piggin
2004-10-06  5:09         ` Andrew Morton
2004-10-06  5:21           ` Nick Piggin
2004-10-06  5:33             ` Andrew Morton
2004-10-06  5:46               ` Nick Piggin
2004-10-06  6:19               ` new dev model (was Re: Default cache_hot_time value back to 10ms) Jeff Garzik
2004-10-06  6:39                 ` Andrew Morton
2004-10-06  8:56                   ` Paolo Ciarrocchi
2004-10-06  9:44                   ` bert hubert
2004-10-06 14:00                     ` Andries Brouwer
2004-10-06 19:40                   ` Jeff Garzik
2004-10-06 19:48                     ` Jeff Garzik
2004-10-06 19:58                       ` Jeff Garzik
2004-10-06 20:37                         ` Geert Uytterhoeven
2004-10-07  1:08                           ` Jeff Garzik
2004-10-07  0:02                       ` Matt Mackall
2004-10-06  9:23                 ` Ingo Molnar
2004-10-06  9:57                   ` Paolo Ciarrocchi
2004-10-06 19:33                   ` Jeff Garzik
2004-10-06 22:23                     ` Martin J. Bligh
2004-10-06  5:52       ` Default cache_hot_time value back to 10ms Chen, Kenneth W
2004-10-06 19:27       ` Chen, Kenneth W
2004-10-06 19:39         ` Andrew Morton
2004-10-06 20:38           ` Chen, Kenneth W
2004-10-06 20:43             ` Andrew Morton
2004-10-06 23:14               ` Chen, Kenneth W
2004-10-07  2:26                 ` Nick Piggin
2004-10-07  6:29                 ` Ingo Molnar
2004-10-07  7:08                   ` Jeff Garzik
2004-10-07  7:26                     ` Ingo Molnar
2004-10-06 20:50             ` Ingo Molnar
2004-10-06 21:03               ` Chen, Kenneth W
2004-10-06  7:48 ` Ingo Molnar
2004-10-06 17:18   ` Chen, Kenneth W
2004-10-06 19:55     ` Ingo Molnar
2004-10-06 22:46     ` Peter Williams
2004-10-06 13:29 ` [patch] sched: auto-tuning task-migration Ingo Molnar
2004-10-06 13:44   ` Nick Piggin
2004-10-06 17:49   ` Chen, Kenneth W
2004-10-06 20:04     ` Ingo Molnar
2004-10-06 21:18       ` Chen, Kenneth W
2004-10-07  6:10         ` Ingo Molnar
2005-02-21  5:08   ` Paul Jackson
2004-10-06 14:22 emmanuel.fuste
     [not found] <200411110851.30819.habanero@us.ibm.com>
2004-11-11 15:04 ` Andrew Theurer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).