* Scaling noise @ 2003-09-03 4:03 Larry McVoy 2003-09-03 4:12 ` Roland Dreier 2003-09-03 4:18 ` Anton Blanchard 0 siblings, 2 replies; 62+ messages in thread From: Larry McVoy @ 2003-09-03 4:03 UTC (permalink / raw) To: linux-kernel I've frequently tried to make the point that all the scaling for lots of processors is nonsense. Mr Dell says it better: "Eight-way (servers) are less than 1 percent of the market and shrinking pretty dramatically," Dell said. "If our competitors want to claim they're No. 1 in eight-ways, that's fine. We want to lead the market with two-way and four-way (processor machines)." Tell me again that it is a good idea to screw up uniprocessor performance for 64 way machines. Great idea, that. Go Dinosaurs! -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 4:03 Scaling noise Larry McVoy @ 2003-09-03 4:12 ` Roland Dreier 2003-09-03 4:20 ` Larry McVoy 2003-09-03 15:12 ` Martin J. Bligh 2003-09-03 4:18 ` Anton Blanchard 1 sibling, 2 replies; 62+ messages in thread From: Roland Dreier @ 2003-09-03 4:12 UTC (permalink / raw) Cc: linux-kernel +--------------+ | Don't feed | | the trolls | | | | thank you | +--------------+ | | | | | | | | ....\ /.... ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 4:12 ` Roland Dreier @ 2003-09-03 4:20 ` Larry McVoy 2003-09-03 15:12 ` Martin J. Bligh 1 sibling, 0 replies; 62+ messages in thread From: Larry McVoy @ 2003-09-03 4:20 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-kernel And here I thought that real data was interesting. My mistake. On Tue, Sep 02, 2003 at 09:12:36PM -0700, Roland Dreier wrote: > +--------------+ > | Don't feed | > | the trolls | -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 4:12 ` Roland Dreier 2003-09-03 4:20 ` Larry McVoy @ 2003-09-03 15:12 ` Martin J. Bligh 1 sibling, 0 replies; 62+ messages in thread From: Martin J. Bligh @ 2003-09-03 15:12 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-kernel --Roland Dreier <roland@topspin.com> wrote (on Tuesday, September 02, 2003 21:12:36 -0700): > +--------------+ >| Don't feed | >| the trolls | >| | >| thank you | > +--------------+ > | | > | | > | | > | | > ....\ /.... Agreed. Please refer to the last flamefest a few months ago, when this was covered in detail. M. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 4:03 Scaling noise Larry McVoy 2003-09-03 4:12 ` Roland Dreier @ 2003-09-03 4:18 ` Anton Blanchard 2003-09-03 4:29 ` Larry McVoy 1 sibling, 1 reply; 62+ messages in thread From: Anton Blanchard @ 2003-09-03 4:18 UTC (permalink / raw) To: Larry McVoy, linux-kernel > I've frequently tried to make the point that all the scaling for lots of > processors is nonsense. Mr Dell says it better: > > "Eight-way (servers) are less than 1 percent of the market and shrinking > pretty dramatically," Dell said. "If our competitors want to claim > they're No. 1 in eight-ways, that's fine. We want to lead the market > with two-way and four-way (processor machines)." > > Tell me again that it is a good idea to screw up uniprocessor performance > for 64 way machines. Great idea, that. Go Dinosaurs! And does your 4 way have hyperthreading? Anton ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 4:18 ` Anton Blanchard @ 2003-09-03 4:29 ` Larry McVoy 2003-09-03 4:33 ` CaT 2003-09-03 6:28 ` Anton Blanchard 0 siblings, 2 replies; 62+ messages in thread From: Larry McVoy @ 2003-09-03 4:29 UTC (permalink / raw) To: Anton Blanchard; +Cc: Larry McVoy, linux-kernel On Wed, Sep 03, 2003 at 02:18:51PM +1000, Anton Blanchard wrote: > > I've frequently tried to make the point that all the scaling for lots of > > processors is nonsense. Mr Dell says it better: > > > > "Eight-way (servers) are less than 1 percent of the market and shrinking > > pretty dramatically," Dell said. "If our competitors want to claim > > they're No. 1 in eight-ways, that's fine. We want to lead the market > > with two-way and four-way (processor machines)." > > > > Tell me again that it is a good idea to screw up uniprocessor performance > > for 64 way machines. Great idea, that. Go Dinosaurs! > > And does your 4 way have hyperthreading? What part of "shrinking pretty dramatically" did you not understand? Maybe you know more than Mike Dell. Could you share that insight? -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 4:29 ` Larry McVoy @ 2003-09-03 4:33 ` CaT 2003-09-03 5:08 ` Larry McVoy 2003-09-03 6:28 ` Anton Blanchard 1 sibling, 1 reply; 62+ messages in thread From: CaT @ 2003-09-03 4:33 UTC (permalink / raw) To: Larry McVoy, Anton Blanchard, Larry McVoy, linux-kernel On Tue, Sep 02, 2003 at 09:29:53PM -0700, Larry McVoy wrote: > On Wed, Sep 03, 2003 at 02:18:51PM +1000, Anton Blanchard wrote: > > > I've frequently tried to make the point that all the scaling for lots of > > > processors is nonsense. Mr Dell says it better: > > > > > > "Eight-way (servers) are less than 1 percent of the market and shrinking > > > pretty dramatically," Dell said. "If our competitors want to claim > > > they're No. 1 in eight-ways, that's fine. We want to lead the market > > > with two-way and four-way (processor machines)." > > > > > > Tell me again that it is a good idea to screw up uniprocessor performance > > > for 64 way machines. Great idea, that. Go Dinosaurs! > > > > And does your 4 way have hyperthreading? > > What part of "shrinking pretty dramatically" did you not understand? Maybe > you know more than Mike Dell. Could you share that insight? I think Anton is referring to the fact that on a 4-way cpu machine with HT enabled you basically have an 8-way smp box (with special conditions) and so if 4-way machines are becoming more popular, making sure that 8-way smp works well is a good idea. At least that's how I took it. -- "How can I not love the Americans? They helped me with a flat tire the other day," he said. - http://tinyurl.com/h6fo ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 4:33 ` CaT @ 2003-09-03 5:08 ` Larry McVoy 2003-09-03 5:44 ` Mikael Abrahamsson ` (5 more replies) 0 siblings, 6 replies; 62+ messages in thread From: Larry McVoy @ 2003-09-03 5:08 UTC (permalink / raw) To: CaT; +Cc: Larry McVoy, Anton Blanchard, linux-kernel On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote: > I think Anton is referring to the fact that on a 4-way cpu machine with > HT enabled you basically have an 8-way smp box (with special conditions) > and so if 4-way machines are becoming more popular, making sure that 8-way > smp works well is a good idea. Maybe this is a better way to get my point across. Think about more CPUs on the same memory subsystem. I've been trying to make this scaling point ever since I discovered how much cache misses hurt. That was about 1995 or so. At that point, memory latency was about 200 ns and processor speeds were at about 200Mhz or 5 ns. Today, memory latency is about 130 ns and processor speeds are about .3 ns. Processor speeds are 15 times faster and memory is less than 2 times faster. SMP makes that ratio worse. It's called asymptotic behavior. After a while you can look at the graph and see that more CPUs on the same memory doesn't make sense. It hasn't made sense for a decade, what makes anyone think that is changing? -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 5:08 ` Larry McVoy @ 2003-09-03 5:44 ` Mikael Abrahamsson 2003-09-03 6:12 ` Bernd Eckenfels ` (4 subsequent siblings) 5 siblings, 0 replies; 62+ messages in thread From: Mikael Abrahamsson @ 2003-09-03 5:44 UTC (permalink / raw) To: linux-kernel On Tue, 2 Sep 2003, Larry McVoy wrote: > It's called asymptotic behavior. After a while you can look at the graph > and see that more CPUs on the same memory doesn't make sense. It hasn't > made sense for a decade, what makes anyone think that is changing? It didnt make sense two decades ago either, the VAX 8300 could be made to go 6way and it stopped going faster around the third processor added. (my memory is a bit rusty, but I believe this is what we came up with when we got donated a few of those in the mid 90ties and yes, they're not from 83 but perhaps from 86-87 so not two decades ago either). -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 5:08 ` Larry McVoy 2003-09-03 5:44 ` Mikael Abrahamsson @ 2003-09-03 6:12 ` Bernd Eckenfels 2003-09-03 12:09 ` Alan Cox 2003-09-03 8:11 ` Giuliano Pochini ` (3 subsequent siblings) 5 siblings, 1 reply; 62+ messages in thread From: Bernd Eckenfels @ 2003-09-03 6:12 UTC (permalink / raw) To: linux-kernel In article <20030903050859.GD10257@work.bitmover.com> you wrote: > It's called asymptotic behavior. After a while you can look at the graph > and see that more CPUs on the same memory doesn't make sense. It hasn't > made sense for a decade, what makes anyone think that is changing? Thats why NUMA gets so popular. Larry, dont forget, that Linux is growing in the University Labs, where those big NUMA and Multi-Node Clusters are most popular for Number Crunching. Greetings Bernd -- eckes privat - http://www.eckes.org/ Project Freefire - http://www.freefire.org/ ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 6:12 ` Bernd Eckenfels @ 2003-09-03 12:09 ` Alan Cox 2003-09-03 15:10 ` Martin J. Bligh 0 siblings, 1 reply; 62+ messages in thread From: Alan Cox @ 2003-09-03 12:09 UTC (permalink / raw) To: Bernd Eckenfels; +Cc: Linux Kernel Mailing List On Mer, 2003-09-03 at 07:12, Bernd Eckenfels wrote: > Thats why NUMA gets so popular. NUMA doesn't help you much. > Larry, dont forget, that Linux is growing in the University Labs, where > those big NUMA and Multi-Node Clusters are most popular for Number > Crunching. multi node yes, numa not much and where numa-like systems are being used they are being used for message passing not as a fake big pc. Numa is valuable because - It makes some things go faster without having to rewrite them - It lets you partition a large box into several effective small ones cutting maintenance - It lets you partition a large box into several effective small ones so you can avoid buying two software licenses for expensive toys if you actually care enough about performance to write the code to do the job then its value is rather questionable. There are exceptions as with anything else. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 12:09 ` Alan Cox @ 2003-09-03 15:10 ` Martin J. Bligh 2003-09-03 16:01 ` Jörn Engel 2003-09-04 20:36 ` Rik van Riel 0 siblings, 2 replies; 62+ messages in thread From: Martin J. Bligh @ 2003-09-03 15:10 UTC (permalink / raw) To: Alan Cox, Bernd Eckenfels; +Cc: Linux Kernel Mailing List > multi node yes, numa not much and where numa-like systems are being used > they are being used for message passing not as a fake big pc. > > Numa is valuable because > - It makes some things go faster without having to rewrite them > - It lets you partition a large box into several effective small ones > cutting maintenance > - It lets you partition a large box into several effective small ones > so you can avoid buying two software licenses for expensive toys > > if you actually care enough about performance to write the code to do > the job then its value is rather questionable. There are exceptions as > with anything else. The real core use of NUMA is to run one really big app on one machine, where it's hard to split it across a cluster. You just can't build an SMP box big enough for some of these things. M. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 15:10 ` Martin J. Bligh @ 2003-09-03 16:01 ` Jörn Engel 2003-09-03 16:21 ` Martin J. Bligh 2003-09-04 20:36 ` Rik van Riel 1 sibling, 1 reply; 62+ messages in thread From: Jörn Engel @ 2003-09-03 16:01 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List On Wed, 3 September 2003 08:10:33 -0700, Martin J. Bligh wrote: > > > multi node yes, numa not much and where numa-like systems are being used > > they are being used for message passing not as a fake big pc. > > > > Numa is valuable because > > - It makes some things go faster without having to rewrite them > > - It lets you partition a large box into several effective small ones > > cutting maintenance > > - It lets you partition a large box into several effective small ones > > so you can avoid buying two software licenses for expensive toys > > > > if you actually care enough about performance to write the code to do > > the job then its value is rather questionable. There are exceptions as > > with anything else. > > The real core use of NUMA is to run one really big app on one machine, > where it's hard to split it across a cluster. You just can't build an > SMP box big enough for some of these things. This "hard to split" is usually caused by memory use instead of cpu use, right? I don't see a big problem scaling number crunchers over a cluster, but a process with a working set >64GB cannot be split between 4GB machines easily. Jörn -- Good warriors cause others to come to them and do not go to others. -- Sun Tzu ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 16:01 ` Jörn Engel @ 2003-09-03 16:21 ` Martin J. Bligh 2003-09-03 19:41 ` Mike Fedyk 0 siblings, 1 reply; 62+ messages in thread From: Martin J. Bligh @ 2003-09-03 16:21 UTC (permalink / raw) To: Jörn Engel; +Cc: Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List >> The real core use of NUMA is to run one really big app on one machine, >> where it's hard to split it across a cluster. You just can't build an >> SMP box big enough for some of these things. > > This "hard to split" is usually caused by memory use instead of cpu > use, right? Heavy process intercommunication I guess, often but not always through shared mem. > I don't see a big problem scaling number crunchers over a cluster, but > a process with a working set >64GB cannot be split between 4GB > machines easily. Right - some problems split nicely, and should get run on clusters because it's a shitload cheaper. Preferably an SSI cluster so you get to manage things easily, but either way. As you say, some things just don't split that way, and that's why people pay for big iron (which ends up being NUMA). I've seen people use big machines for clusterable things, which I think is a waste of money, but the cost of the machine compared to the cost of admin (vs multiple machines) may have come down to the point where it's worth it now. You get implicit "cluster" load balancing done in a transparent way by the OS on NUMA boxes. M. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 16:21 ` Martin J. Bligh @ 2003-09-03 19:41 ` Mike Fedyk 2003-09-03 20:11 ` Martin J. Bligh 0 siblings, 1 reply; 62+ messages in thread From: Mike Fedyk @ 2003-09-03 19:41 UTC (permalink / raw) To: Martin J. Bligh Cc: J?rn Engel, Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List On Wed, Sep 03, 2003 at 09:21:47AM -0700, Martin J. Bligh wrote: > I've seen people use big machines for clusterable things, which I think > is a waste of money, but the cost of the machine compared to the cost > of admin (vs multiple machines) may have come down to the point where > it's worth it now. You get implicit "cluster" load balancing done in a > transparent way by the OS on NUMA boxes. Doesn't SSI clustering do something similar (without the effency of the interconnections though)? ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 19:41 ` Mike Fedyk @ 2003-09-03 20:11 ` Martin J. Bligh 0 siblings, 0 replies; 62+ messages in thread From: Martin J. Bligh @ 2003-09-03 20:11 UTC (permalink / raw) To: Mike Fedyk Cc: J?rn Engel, Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List > On Wed, Sep 03, 2003 at 09:21:47AM -0700, Martin J. Bligh wrote: >> I've seen people use big machines for clusterable things, which I think >> is a waste of money, but the cost of the machine compared to the cost >> of admin (vs multiple machines) may have come down to the point where >> it's worth it now. You get implicit "cluster" load balancing done in a >> transparent way by the OS on NUMA boxes. > > Doesn't SSI clustering do something similar (without the effency of the > interconnections though)? Yes ... *if* someone had a implementation that worked well and was maintainable ;-) M. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 15:10 ` Martin J. Bligh 2003-09-03 16:01 ` Jörn Engel @ 2003-09-04 20:36 ` Rik van Riel 2003-09-04 20:47 ` Martin J. Bligh 2003-09-04 21:30 ` William Lee Irwin III 1 sibling, 2 replies; 62+ messages in thread From: Rik van Riel @ 2003-09-04 20:36 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List On Wed, 3 Sep 2003, Martin J. Bligh wrote: > The real core use of NUMA is to run one really big app on one machine, > where it's hard to split it across a cluster. You just can't build an > SMP box big enough for some of these things. That only works when the NUMA factor is low enough that you can effectively treat the box as an SMP system. It doesn't work when you have a NUMA factor of 15 (like some unspecified box you are very familiar with) and half of your database index is always on the "other half" of the two-node NUMA system. You'll end up with half your accesses being 15 times as slow, meaning that your average memory access time is 8 times as high! Good way to REDUCE performance, but most people won't like that... If the NUMA factor is low enough that applications can treat it like SMP, then the kernel NUMA support won't have to be very high either... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 20:36 ` Rik van Riel @ 2003-09-04 20:47 ` Martin J. Bligh 2003-09-04 21:30 ` William Lee Irwin III 1 sibling, 0 replies; 62+ messages in thread From: Martin J. Bligh @ 2003-09-04 20:47 UTC (permalink / raw) To: Rik van Riel; +Cc: Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List > On Wed, 3 Sep 2003, Martin J. Bligh wrote: > >> The real core use of NUMA is to run one really big app on one machine, >> where it's hard to split it across a cluster. You just can't build an >> SMP box big enough for some of these things. > > That only works when the NUMA factor is low enough that > you can effectively treat the box as an SMP system. > > It doesn't work when you have a NUMA factor of 15 (like > some unspecified box you are very familiar with) and > half of your database index is always on the "other half" > of the two-node NUMA system. > > You'll end up with half your accesses being 15 times as > slow, meaning that your average memory access time is 8 > times as high! Good way to REDUCE performance, but most > people won't like that... > > If the NUMA factor is low enough that applications can > treat it like SMP, then the kernel NUMA support won't > have to be very high either... I think there's a few too many assumptions in that - are you thinking of a big r/w shmem application? There's lots of other application programming models that wouldn't suffer nearly so much ... but maybe they're more splittable ... there's lots of things we can do to ensure at least better than average node-locality for most of the memory. M. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 20:36 ` Rik van Riel 2003-09-04 20:47 ` Martin J. Bligh @ 2003-09-04 21:30 ` William Lee Irwin III 1 sibling, 0 replies; 62+ messages in thread From: William Lee Irwin III @ 2003-09-04 21:30 UTC (permalink / raw) To: Rik van Riel Cc: Martin J. Bligh, Alan Cox, Bernd Eckenfels, Linux Kernel Mailing List On Thu, Sep 04, 2003 at 04:36:56PM -0400, Rik van Riel wrote: > You'll end up with half your accesses being 15 times as > slow, meaning that your average memory access time is 8 > times as high! Good way to REDUCE performance, but most > people won't like that... > If the NUMA factor is low enough that applications can > treat it like SMP, then the kernel NUMA support won't > have to be very high either... This does not hold. The data set is not necessarily where the communication occurs. -- wli ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 5:08 ` Larry McVoy 2003-09-03 5:44 ` Mikael Abrahamsson 2003-09-03 6:12 ` Bernd Eckenfels @ 2003-09-03 8:11 ` Giuliano Pochini 2003-09-03 14:25 ` Steven Cole ` (2 subsequent siblings) 5 siblings, 0 replies; 62+ messages in thread From: Giuliano Pochini @ 2003-09-03 8:11 UTC (permalink / raw) To: Larry McVoy; +Cc: linux-kernel On 03-Sep-2003 Larry McVoy wrote: > That was about 1995 > or so. At that point, memory latency was about 200 ns and processor speeds > were at about 200Mhz or 5 ns. Today, memory latency is about 130 ns and > processor speeds are about .3 ns. Processor speeds are 15 times faster and > memory is less than 2 times faster. SMP makes that ratio worse. Latency is not bandwidth. btw you are right, that's why caches are growing, too. It's likely in the future there will be only UP (HT'd ?) and NUMA machines. Bye. Giuliano. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 5:08 ` Larry McVoy ` (2 preceding siblings ...) 2003-09-03 8:11 ` Giuliano Pochini @ 2003-09-03 14:25 ` Steven Cole 2003-09-03 12:47 ` Antonio Vargas 2003-09-08 19:12 ` bill davidsen 2003-09-03 16:37 ` Kurt Wall 2003-09-06 15:08 ` Pavel Machek 5 siblings, 2 replies; 62+ messages in thread From: Steven Cole @ 2003-09-03 14:25 UTC (permalink / raw) To: Larry McVoy; +Cc: CaT, Anton Blanchard, linux-kernel On Tue, 2003-09-02 at 23:08, Larry McVoy wrote: > On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote: > > I think Anton is referring to the fact that on a 4-way cpu machine with > > HT enabled you basically have an 8-way smp box (with special conditions) > > and so if 4-way machines are becoming more popular, making sure that 8-way > > smp works well is a good idea. > > Maybe this is a better way to get my point across. Think about more CPUs > on the same memory subsystem. I've been trying to make this scaling point > ever since I discovered how much cache misses hurt. That was about 1995 > or so. At that point, memory latency was about 200 ns and processor speeds > were at about 200Mhz or 5 ns. Today, memory latency is about 130 ns and > processor speeds are about .3 ns. Processor speeds are 15 times faster and > memory is less than 2 times faster. SMP makes that ratio worse. > > It's called asymptotic behavior. After a while you can look at the graph > and see that more CPUs on the same memory doesn't make sense. It hasn't > made sense for a decade, what makes anyone think that is changing? You're right about the asymptotic behavior and you'll just get more right as time goes on, but other forces are at work. What is changing is the number of cores per 'processor' is increasing. The Intel Montecito will increase this to two, and rumor has it that the Intel Tanglewood may have as many as sixteen. The IBM Power6 will likely be similarly capable. The Tanglewood is not some far off flight of fancy; it may be available as soon as the 2.8.x stable series, so planning to accommodate it should be happening now. With companies like SGI building Altix systems with 64 and 128 CPUs using the current single-core Madison, just think of what will be possible using the future hardware. In four years, Michael Dell will still be saying the same thing, but he'll just fudge his answer by a factor of four. The question which will continue to be important in the next kernel series is: How to best accommodate the future many-CPU machines without sacrificing performance on the low-end? The change is that the 'many' in the above may start to double every few years. Some candidate answers to this have been discussed before, such as cache-coherent clusters. I just hope this gets worked out before the hardware ships. Steven ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 14:25 ` Steven Cole @ 2003-09-03 12:47 ` Antonio Vargas 2003-09-03 15:31 ` Steven Cole 2003-09-08 19:12 ` bill davidsen 1 sibling, 1 reply; 62+ messages in thread From: Antonio Vargas @ 2003-09-03 12:47 UTC (permalink / raw) To: Steven Cole; +Cc: Larry McVoy, CaT, Anton Blanchard, linux-kernel On Wed, Sep 03, 2003 at 08:25:36AM -0600, Steven Cole wrote: > On Tue, 2003-09-02 at 23:08, Larry McVoy wrote: > > On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote: > [snip] > > The question which will continue to be important in the next kernel > series is: How to best accommodate the future many-CPU machines without > sacrificing performance on the low-end? The change is that the 'many' > in the above may start to double every few years. > > Some candidate answers to this have been discussed before, such as > cache-coherent clusters. I just hope this gets worked out before the > hardware ships. As you may probably know, CC-clusters were heavily advocated by the same Larry McVoy who has started this thread. Greets, Antonio. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 12:47 ` Antonio Vargas @ 2003-09-03 15:31 ` Steven Cole 2003-09-04 1:50 ` Daniel Phillips 0 siblings, 1 reply; 62+ messages in thread From: Steven Cole @ 2003-09-03 15:31 UTC (permalink / raw) To: Antonio Vargas; +Cc: Larry McVoy, CaT, Anton Blanchard, linux-kernel On Wed, 2003-09-03 at 06:47, Antonio Vargas wrote: > On Wed, Sep 03, 2003 at 08:25:36AM -0600, Steven Cole wrote: > > On Tue, 2003-09-02 at 23:08, Larry McVoy wrote: > > > On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote: > > > [snip] > > > > The question which will continue to be important in the next kernel > > series is: How to best accommodate the future many-CPU machines without > > sacrificing performance on the low-end? The change is that the 'many' > > in the above may start to double every few years. > > > > Some candidate answers to this have been discussed before, such as > > cache-coherent clusters. I just hope this gets worked out before the > > hardware ships. > > As you may probably know, CC-clusters were heavily advocated by the > same Larry McVoy who has started this thread. > Yes, thanks. I'm well aware of that. I would like to get a discussion going again on CC-clusters, since that seems to be a way out of the scaling spiral. Here is an interesting link: http://www.opersys.com/adeos/practical-smp-clusters/ Steven ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 15:31 ` Steven Cole @ 2003-09-04 1:50 ` Daniel Phillips 2003-09-04 1:52 ` Larry McVoy ` (3 more replies) 0 siblings, 4 replies; 62+ messages in thread From: Daniel Phillips @ 2003-09-04 1:50 UTC (permalink / raw) To: Steven Cole, Antonio Vargas Cc: Larry McVoy, CaT, Anton Blanchard, linux-kernel On Wednesday 03 September 2003 17:31, Steven Cole wrote: > On Wed, 2003-09-03 at 06:47, Antonio Vargas wrote: > > As you may probably know, CC-clusters were heavily advocated by the > > same Larry McVoy who has started this thread. > > Yes, thanks. I'm well aware of that. I would like to get a discussion > going again on CC-clusters, since that seems to be a way out of the > scaling spiral. Here is an interesting link: > http://www.opersys.com/adeos/practical-smp-clusters/ As you know, the argument is that locking overhead grows by some factor worse than linear as the size of an SMP cluster increases, so that the locking overhead explodes at some point, and thus it would be more efficient to eliminate the SMP overhead entirely and run a cluster of UP kernels, communicating through the high bandwidth channel provided by shared memory. There are other arguments, such as how complex locking is, and how it will never work correctly, but those are noise: it's pretty much done now, the complexity is still manageable, and Linux has never been more stable. There was a time when SMP locking overhead actually cost something in the high single digits on Linux, on certain loads. Today, you'd have to work at it to find a real load where the 2.5/6 kernel spends more than 1% of its time in locking overhead, even on a large SMP machine (sample size of one: I asked Bill Irwin how his 32 node Numa cluster is running these days). This blows the ccCluster idea out of the water, sorry. The only way ccCluster gets to live is if SMP locking is pathetic and it's not. As for Karim's work, it's a quintessentially flashy trick to make two UP kernels run on a dual processor. It's worth doing, but not because it blazes the way forward for ccClusters. It can be the basis for hot kernel swap: migrate all the processes to one of the two CPUs, load and start a new kernel on the other one, migrate all processes to it, and let the new kernel restart the first processor, which is now idle. Regards, Daniel ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 1:50 ` Daniel Phillips @ 2003-09-04 1:52 ` Larry McVoy 2003-09-04 4:42 ` David S. Miller 2003-09-04 2:18 ` William Lee Irwin III ` (2 subsequent siblings) 3 siblings, 1 reply; 62+ messages in thread From: Larry McVoy @ 2003-09-04 1:52 UTC (permalink / raw) To: Daniel Phillips Cc: Steven Cole, Antonio Vargas, Larry McVoy, CaT, Anton Blanchard, linux-kernel On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote: > There are other arguments, such as how complex locking is, and how it will > never work correctly, but those are noise: it's pretty much done now, the > complexity is still manageable, and Linux has never been more stable. yeah, right. I'm not sure what you are smoking but I'll avoid your dealer. Your politics are showing, Daniel. Try staying focussed on the technical merits and we can have a discussion. Otherwise you just get ignored. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 1:52 ` Larry McVoy @ 2003-09-04 4:42 ` David S. Miller 2003-09-08 19:40 ` bill davidsen 0 siblings, 1 reply; 62+ messages in thread From: David S. Miller @ 2003-09-04 4:42 UTC (permalink / raw) To: Larry McVoy; +Cc: phillips, elenstev, wind, lm, cat, anton, linux-kernel On Wed, 3 Sep 2003 18:52:49 -0700 Larry McVoy <lm@bitmover.com> wrote: > On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote: > > There are other arguments, such as how complex locking is, and how it will > > never work correctly, but those are noise: it's pretty much done now, the > > complexity is still manageable, and Linux has never been more stable. > > yeah, right. I'm not sure what you are smoking but I'll avoid your dealer. I hate to enter these threads but... The amount of locking bugs found in the core networking, ipv4, and ipv6 for a year or two in 2.4.x has been nearly nil. If you're going to try and argue against supporting huge SMP to me, don't make locking complexity one of the arguments. :-) ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 4:42 ` David S. Miller @ 2003-09-08 19:40 ` bill davidsen 0 siblings, 0 replies; 62+ messages in thread From: bill davidsen @ 2003-09-08 19:40 UTC (permalink / raw) To: linux-kernel In article <20030903214233.24d3c902.davem@redhat.com>, David S. Miller <davem@redhat.com> wrote: | On Wed, 3 Sep 2003 18:52:49 -0700 | Larry McVoy <lm@bitmover.com> wrote: | | > On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote: | > > There are other arguments, such as how complex locking is, and how it will | > > never work correctly, but those are noise: it's pretty much done now, the | > > complexity is still manageable, and Linux has never been more stable. | > | > yeah, right. I'm not sure what you are smoking but I'll avoid your dealer. | | I hate to enter these threads but... | | The amount of locking bugs found in the core networking, ipv4, and | ipv6 for a year or two in 2.4.x has been nearly nil. | | If you're going to try and argue against supporting huge SMP | to me, don't make locking complexity one of the arguments. :-) If you count only "bugs" which cause hang or oops, sure. But just because something works doesn't make it simple (or non-complex if you prefer). But look at all the "lockless" changes and such in 2.4, and I think you will agree that there have been a number and it is complex. I don't think stable and complex are mutually exclusive in this case. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 1:50 ` Daniel Phillips 2003-09-04 1:52 ` Larry McVoy @ 2003-09-04 2:18 ` William Lee Irwin III 2003-09-04 2:19 ` Steven Cole 2003-09-08 19:27 ` bill davidsen 3 siblings, 0 replies; 62+ messages in thread From: William Lee Irwin III @ 2003-09-04 2:18 UTC (permalink / raw) To: Daniel Phillips Cc: Steven Cole, Antonio Vargas, Larry McVoy, CaT, Anton Blanchard, linux-kernel On Thu, Sep 04, 2003 at 03:50:31AM +0200, Daniel Phillips wrote: > Bill Irwin how his 32 node Numa cluster is running these days). This blows Sorry for any misunderstanding, the model only goes to 16 nodes/64x, the box mentioned was 32 cpus. It's also SMP (SSI, shared memory, mach-numaq), not a cluster. I also only have half of it full-time. -- wli ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 1:50 ` Daniel Phillips 2003-09-04 1:52 ` Larry McVoy 2003-09-04 2:18 ` William Lee Irwin III @ 2003-09-04 2:19 ` Steven Cole 2003-09-04 2:35 ` William Lee Irwin III 2003-09-04 3:07 ` Daniel Phillips 2003-09-08 19:27 ` bill davidsen 3 siblings, 2 replies; 62+ messages in thread From: Steven Cole @ 2003-09-04 2:19 UTC (permalink / raw) To: Daniel Phillips Cc: Antonio Vargas, Larry McVoy, CaT, Anton Blanchard, linux-kernel On Wed, 2003-09-03 at 19:50, Daniel Phillips wrote: > On Wednesday 03 September 2003 17:31, Steven Cole wrote: > > On Wed, 2003-09-03 at 06:47, Antonio Vargas wrote: > > > As you may probably know, CC-clusters were heavily advocated by the > > > same Larry McVoy who has started this thread. > > > > Yes, thanks. I'm well aware of that. I would like to get a discussion > > going again on CC-clusters, since that seems to be a way out of the > > scaling spiral. Here is an interesting link: > > http://www.opersys.com/adeos/practical-smp-clusters/ > > As you know, the argument is that locking overhead grows by some factor worse > than linear as the size of an SMP cluster increases, so that the locking > overhead explodes at some point, and thus it would be more efficient to > eliminate the SMP overhead entirely and run a cluster of UP kernels, > communicating through the high bandwidth channel provided by shared memory. > > There are other arguments, such as how complex locking is, and how it will > never work correctly, but those are noise: it's pretty much done now, the > complexity is still manageable, and Linux has never been more stable. > > There was a time when SMP locking overhead actually cost something in the high > single digits on Linux, on certain loads. Today, you'd have to work at it to > find a real load where the 2.5/6 kernel spends more than 1% of its time in > locking overhead, even on a large SMP machine (sample size of one: I asked > Bill Irwin how his 32 node Numa cluster is running these days). This blows > the ccCluster idea out of the water, sorry. The only way ccCluster gets to > live is if SMP locking is pathetic and it's not. I would never call the SMP locking pathetic, but it could be improved. Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7 (Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for Large NUMA Systems", available for download here: http://archive.linuxsymposium.org/ols2003/Proceedings/ it appears that for those applications, the curves begin to flatten rather alarmingly. This may have little to do with locking overhead. One possible benefit of using ccClusters would be to stay on that lower part of the curve for the nodes, using perhaps 16 CPUs in a node. That way, a 256 CPU (e.g. Altix 3000) system might perform better than if a single kernel were to be used. I say might. It's likely that only empirical data will tell the tale for sure. > > As for Karim's work, it's a quintessentially flashy trick to make two UP > kernels run on a dual processor. It's worth doing, but not because it blazes > the way forward for ccClusters. It can be the basis for hot kernel swap: > migrate all the processes to one of the two CPUs, load and start a new kernel > on the other one, migrate all processes to it, and let the new kernel restart > the first processor, which is now idle. > Thank you for that very succinct summary of my rather long-winded exposition on that subject which I posted here: http://marc.theaimsgroup.com/?l=linux-kernel&m=105214105131450&w=2 Quite a bit of the complexity which I mentioned, if it were necessary at all, could go into user space helper processes which get spawned for the kernel going away, and before init for the on-coming kernel. Also, my comment about not being able to shoe-horn two kernels in at once for 32-bit arches may have been addressed by Ingo's 4G/4G split. Steven ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 2:19 ` Steven Cole @ 2003-09-04 2:35 ` William Lee Irwin III 2003-09-04 2:40 ` Steven Cole 2003-09-04 3:07 ` Daniel Phillips 1 sibling, 1 reply; 62+ messages in thread From: William Lee Irwin III @ 2003-09-04 2:35 UTC (permalink / raw) To: Steven Cole Cc: Daniel Phillips, Antonio Vargas, Larry McVoy, CaT, Anton Blanchard, linux-kernel On Wed, Sep 03, 2003 at 08:19:26PM -0600, Steven Cole wrote: > I would never call the SMP locking pathetic, but it could be improved. > Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7 > (Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for > Large NUMA Systems", available for download here: > http://archive.linuxsymposium.org/ols2003/Proceedings/ > it appears that for those applications, the curves begin to flatten > rather alarmingly. This may have little to do with locking overhead. Those numbers are 2.4.x -- wli ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 2:35 ` William Lee Irwin III @ 2003-09-04 2:40 ` Steven Cole 2003-09-04 3:20 ` Nick Piggin 0 siblings, 1 reply; 62+ messages in thread From: Steven Cole @ 2003-09-04 2:40 UTC (permalink / raw) To: William Lee Irwin III Cc: Daniel Phillips, Antonio Vargas, Larry McVoy, CaT, Anton Blanchard, linux-kernel On Wed, 2003-09-03 at 20:35, William Lee Irwin III wrote: > On Wed, Sep 03, 2003 at 08:19:26PM -0600, Steven Cole wrote: > > I would never call the SMP locking pathetic, but it could be improved. > > Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7 > > (Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for > > Large NUMA Systems", available for download here: > > http://archive.linuxsymposium.org/ols2003/Proceedings/ > > it appears that for those applications, the curves begin to flatten > > rather alarmingly. This may have little to do with locking overhead. > > Those numbers are 2.4.x Yes, I saw that. It would be interesting to see results for recent 2.6.0-textX kernels. Judging from other recent numbers out of osdl, the results for 2.6 should be quite a bit better. But won't the curves still begin to flatten, but at a higher CPU count? Or has the miracle goodness of RCU pushed those limits to insanely high numbers? Steven ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 2:40 ` Steven Cole @ 2003-09-04 3:20 ` Nick Piggin 0 siblings, 0 replies; 62+ messages in thread From: Nick Piggin @ 2003-09-04 3:20 UTC (permalink / raw) To: Steven Cole Cc: William Lee Irwin III, Daniel Phillips, Antonio Vargas, Larry McVoy, CaT, Anton Blanchard, linux-kernel Steven Cole wrote: >On Wed, 2003-09-03 at 20:35, William Lee Irwin III wrote: > >>On Wed, Sep 03, 2003 at 08:19:26PM -0600, Steven Cole wrote: >> >>>I would never call the SMP locking pathetic, but it could be improved. >>>Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7 >>>(Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for >>>Large NUMA Systems", available for download here: >>>http://archive.linuxsymposium.org/ols2003/Proceedings/ >>>it appears that for those applications, the curves begin to flatten >>>rather alarmingly. This may have little to do with locking overhead. >>> >>Those numbers are 2.4.x >> > >Yes, I saw that. It would be interesting to see results for recent >2.6.0-textX kernels. Judging from other recent numbers out of osdl, the >results for 2.6 should be quite a bit better. But won't the curves >still begin to flatten, but at a higher CPU count? Or has the miracle >goodness of RCU pushed those limits to insanely high numbers? > They fixed some big 2.4 scalability problems, so it wouldn't be as impressive as plain 2.4 -> 2.6. However there are obviously hardware scalability limits as well as software ones. So a more interesting comparison would of course be 2.6 vs LM's SSI clusters. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 2:19 ` Steven Cole 2003-09-04 2:35 ` William Lee Irwin III @ 2003-09-04 3:07 ` Daniel Phillips 1 sibling, 0 replies; 62+ messages in thread From: Daniel Phillips @ 2003-09-04 3:07 UTC (permalink / raw) To: Steven Cole Cc: Antonio Vargas, Larry McVoy, CaT, Anton Blanchard, linux-kernel On Thursday 04 September 2003 04:19, Steven Cole wrote: > On Wed, 2003-09-03 at 19:50, Daniel Phillips wrote: > > There was a time when SMP locking overhead actually cost something in the > > high single digits on Linux, on certain loads. Today, you'd have to work > > at it to find a real load where the 2.5/6 kernel spends more than 1% of > > its time in locking overhead, even on a large SMP machine (sample size of > > one: I asked Bill Irwin how his 32 node Numa cluster is running these > > days). This blows the ccCluster idea out of the water, sorry. The only > > way ccCluster gets to live is if SMP locking is pathetic and it's not. > > I would never call the SMP locking pathetic, but it could be improved. > Looking at Figure 6 (Star-CD, 1-64 processors on Altix) and Figure 7 > (Gaussian 1-32 processors on Altix) on page 13 of "Linux Scalability for > Large NUMA Systems", available for download here: > http://archive.linuxsymposium.org/ols2003/Proceedings/ > it appears that for those applications, the curves begin to flatten > rather alarmingly. This may have little to do with locking overhead. 2.4.17 is getting a little old, don't you think? This is the thing that changed most in 2.4 -> 2.6, and indeed, much of the work was in locking. > One possible benefit of using ccClusters would be to stay on that lower > part of the curve for the nodes, using perhaps 16 CPUs in a node. That > way, a 256 CPU (e.g. Altix 3000) system might perform better than if a > single kernel were to be used. I say might. It's likely that only > empirical data will tell the tale for sure. Right, and we do not see SGI contributing patches for partitioning their 256 CPU boxes. That's all the empirical data I need at this point. They surely do partition them, but not at the Linux OS level. > > As for Karim's work, it's a quintessentially flashy trick to make two UP > > kernels run on a dual processor. It's worth doing, but not because it > > blazes the way forward for ccClusters. It can be the basis for hot > > kernel swap: migrate all the processes to one of the two CPUs, load and > > start a new kernel on the other one, migrate all processes to it, and let > > the new kernel restart the first processor, which is now idle. > > Thank you for that very succinct summary of my rather long-winded > exposition on that subject which I posted here: > http://marc.theaimsgroup.com/?l=linux-kernel&m=105214105131450&w=2 I swear I made the above up on the spot, just now :-) > Quite a bit of the complexity which I mentioned, if it were necessary at > all, could go into user space helper processes which get spawned for the > kernel going away, and before init for the on-coming kernel. Also, my > comment about not being able to shoe-horn two kernels in at once for > 32-bit arches may have been addressed by Ingo's 4G/4G split. I don't see what you're worried about, they are separate kernels and you get two instances of whatever split you want. Regards, Daniel ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 1:50 ` Daniel Phillips ` (2 preceding siblings ...) 2003-09-04 2:19 ` Steven Cole @ 2003-09-08 19:27 ` bill davidsen 3 siblings, 0 replies; 62+ messages in thread From: bill davidsen @ 2003-09-08 19:27 UTC (permalink / raw) To: linux-kernel In article <200309040350.31949.phillips@arcor.de>, Daniel Phillips <phillips@arcor.de> wrote: | As for Karim's work, it's a quintessentially flashy trick to make two UP | kernels run on a dual processor. It's worth doing, but not because it blazes | the way forward for ccClusters. It can be the basis for hot kernel swap: | migrate all the processes to one of the two CPUs, load and start a new kernel | on the other one, migrate all processes to it, and let the new kernel restart | the first processor, which is now idle. UML running on a sibling, anyone? Interesting concept, not necessarily useful. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 14:25 ` Steven Cole 2003-09-03 12:47 ` Antonio Vargas @ 2003-09-08 19:12 ` bill davidsen 1 sibling, 0 replies; 62+ messages in thread From: bill davidsen @ 2003-09-08 19:12 UTC (permalink / raw) To: linux-kernel In article <1062599136.1724.84.camel@spc9.esa.lanl.gov>, Steven Cole <elenstev@mesatop.com> wrote: | On Tue, 2003-09-02 at 23:08, Larry McVoy wrote: | > On Wed, Sep 03, 2003 at 02:33:56PM +1000, CaT wrote: | > > I think Anton is referring to the fact that on a 4-way cpu machine with | > > HT enabled you basically have an 8-way smp box (with special conditions) | > > and so if 4-way machines are becoming more popular, making sure that 8-way | > > smp works well is a good idea. | > | > Maybe this is a better way to get my point across. Think about more CPUs | > on the same memory subsystem. I've been trying to make this scaling point | > ever since I discovered how much cache misses hurt. That was about 1995 | > or so. At that point, memory latency was about 200 ns and processor speeds | > were at about 200Mhz or 5 ns. Today, memory latency is about 130 ns and | > processor speeds are about .3 ns. Processor speeds are 15 times faster and | > memory is less than 2 times faster. SMP makes that ratio worse. | > | > It's called asymptotic behavior. After a while you can look at the graph | > and see that more CPUs on the same memory doesn't make sense. It hasn't | > made sense for a decade, what makes anyone think that is changing? | | You're right about the asymptotic behavior and you'll just get more | right as time goes on, but other forces are at work. | | What is changing is the number of cores per 'processor' is increasing. | The Intel Montecito will increase this to two, and rumor has it that the | Intel Tanglewood may have as many as sixteen. The IBM Power6 will | likely be similarly capable. | | The Tanglewood is not some far off flight of fancy; it may be available | as soon as the 2.8.x stable series, so planning to accommodate it should | be happening now. | | With companies like SGI building Altix systems with 64 and 128 CPUs | using the current single-core Madison, just think of what will be | possible using the future hardware. | | In four years, Michael Dell will still be saying the same thing, but | he'll just fudge his answer by a factor of four. The mass market will still be in small machines, because the CPUs keep on getting faster. And at least for most small servers running Linux, like news, mail, DNS, and web, the disk, memory and network are more of a problem than the CPU. Some database and CGI loads are CPU intensive, but I don't see that the nature of loads will change; most aren't CPU intensive. | The question which will continue to be important in the next kernel | series is: How to best accommodate the future many-CPU machines without | sacrificing performance on the low-end? The change is that the 'many' | in the above may start to double every few years. Since you can still get a decent research grant or graduate thesis out of ways to use a lot of CPUs, there will not be a lack of thought on the topic. I think Larry is just worried that some of these solutions may really work poorly on smaller systems. | Some candidate answers to this have been discussed before, such as | cache-coherent clusters. I just hope this gets worked out before the | hardware ships. Honestly, I would expect a good solution to scale better at the "more" end of the range than the "less." A good 16-way approach will probably not need major work for 256, while it may be pretty grim for the uni or 2-way counting HT machines. With all the work people are doing on writing scheduler changes for responsiveness, and the number of people trying them, I would assume a need for improvement on small machines and response over throughput. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 5:08 ` Larry McVoy ` (3 preceding siblings ...) 2003-09-03 14:25 ` Steven Cole @ 2003-09-03 16:37 ` Kurt Wall 2003-09-06 15:08 ` Pavel Machek 5 siblings, 0 replies; 62+ messages in thread From: Kurt Wall @ 2003-09-03 16:37 UTC (permalink / raw) To: linux-kernel Quoth Larry McVoy: [SMP hits memory latency wall] > It's called asymptotic behavior. After a while you can look at the graph > and see that more CPUs on the same memory doesn't make sense. It hasn't > made sense for a decade, what makes anyone think that is changing? Isn't this what NUMA is for, then? Kurt -- "There was a boy called Eustace Clarence Scrubb, and he almost deserved it." -- C. S. Lewis, The Chronicles of Narnia ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 5:08 ` Larry McVoy ` (4 preceding siblings ...) 2003-09-03 16:37 ` Kurt Wall @ 2003-09-06 15:08 ` Pavel Machek 2003-09-08 13:38 ` Alan Cox 5 siblings, 1 reply; 62+ messages in thread From: Pavel Machek @ 2003-09-06 15:08 UTC (permalink / raw) To: Larry McVoy, CaT, Larry McVoy, Anton Blanchard, linux-kernel Hi! > Maybe this is a better way to get my point across. Think about more CPUs > on the same memory subsystem. I've been trying to make this scaling point The point of hyperthreading is that more virtual CPUs on same memory subsystem can actually help stuff. -- Pavel Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need... ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-06 15:08 ` Pavel Machek @ 2003-09-08 13:38 ` Alan Cox 2003-09-09 6:11 ` Rob Landley 0 siblings, 1 reply; 62+ messages in thread From: Alan Cox @ 2003-09-08 13:38 UTC (permalink / raw) To: Pavel Machek Cc: Larry McVoy, CaT, Larry McVoy, Anton Blanchard, Linux Kernel Mailing List On Sad, 2003-09-06 at 16:08, Pavel Machek wrote: > Hi! > > > Maybe this is a better way to get my point across. Think about more CPUs > > on the same memory subsystem. I've been trying to make this scaling point > > The point of hyperthreading is that more virtual CPUs on same memory > subsystem can actually help stuff. Its a way of exposing asynchronicity keeping the old instruction set. Its trying to make better use of the bandwidth available by having something else to schedule into stalls. Thats why HT is really good for code which is full of polling I/O, badly coded memory accesses but is worthless on perfectly tuned hand coded stuff which doesnt stall. Its great feature is that HT gets *more* not less useful as the CPU gets faster.. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-08 13:38 ` Alan Cox @ 2003-09-09 6:11 ` Rob Landley 2003-09-09 16:07 ` Ricardo Bugalho 0 siblings, 1 reply; 62+ messages in thread From: Rob Landley @ 2003-09-09 6:11 UTC (permalink / raw) To: Alan Cox, Pavel Machek Cc: CaT, Larry McVoy, Anton Blanchard, Linux Kernel Mailing List On Monday 08 September 2003 09:38, Alan Cox wrote: > On Sad, 2003-09-06 at 16:08, Pavel Machek wrote: > > Hi! > > > > > Maybe this is a better way to get my point across. Think about more > > > CPUs on the same memory subsystem. I've been trying to make this > > > scaling point > > > > The point of hyperthreading is that more virtual CPUs on same memory > > subsystem can actually help stuff. > > Its a way of exposing asynchronicity keeping the old instruction set. > Its trying to make better use of the bandwidth available by having > something else to schedule into stalls. Thats why HT is really good for > code which is full of polling I/O, badly coded memory accesses but is > worthless on perfectly tuned hand coded stuff which doesnt stall. <rant> I wouldn't call it worthless. "Proof of concept", maybe. Modern processors (Athlon and P4 both, I believe) have three execution cores, and so are trying to dispatch three instructions per clock. With speculation, lookahead, branch prediction, register renaming, instruction reordering, magic pixie dust, happy thoughts, a tailwind, and 8 zillion other related things, they can just about do it too, but not even close to 100% of the time. Extracting three parallel instructions from one instruction stream is doable, but not fun, and not consistent. The third core is unavoidably idle some of the time. Trying to keep four cores bus would be a nightmare. (All the VLIW guys keep trying to unload this on the compiler. Don't ask me how a compiler is supposed to do branch prediction and speculative execution. I suppose having to recompile your binaries for more cores isn't TOO big a problem these days, but the boxed mainstream desktop apps people wouldn't like it at all.) Transistor budgets keep going up as manufacturing die sizes shrink, and the engineers keep wanting to throw transistors at the problem. The first really easy way to turn transistors into performance are a bigger L1 cache, but somewhere between 256k and one megabyte per running process you hit some serious diminishing returns since your working set is in cache and your far accesses to big datasets (or streaming data) just aren't going to be helped by more L1 cache. The other obvious way to turn transistors into performance is to build execution cores out of them. (Yeah, you can also pipeline yourself to death to do less per clock for marketing reasons, but there's serious diminishing returns there too.) With more execution cores, you can (theoretically) execute more instructions per clock. Except that keeping 3 cores busy out of one instruction stream is really hard, and 4 would be a nightmare... Hyperthreading is just a neat hack to keep multiple cores busy. Having another point of execution to schedule instructions from means you're guaranteed to keep 1 core busy all the time for each point of execution (barring memory access latency on "branch to mars" conditions), and with 3 cores and 2 pointes of execution they can fight over the middle core, which should just about never be idle when the system is loaded. With hyperthreading (SMT, whatever you wanna call it), the move to 4 execution cores becomes a no-brainer. (Keeping 2 cores busy from one instruction stream is relatively trivial), and even 5 (since keeping 3 cores busy is a solved problem, although it's not busy all the time, but the two threads can fight for the extra core when they actually have something for it to do...) And THAT is where SMT starts showing real performance benefits, when you get to 4 or 5 cores. It's cheaper than SMP on a die because they can share all sorts of hardware (not the least of which being L1 cache, and you can even expand L1 cache a bit because you now have the working sets of 2 processes to stick in it)... Intel's been desperate for a way to make use of its transistor budget for a while; manufacturing is what it does better than AMD< not clever processor design. The original Itanic, case in point, had more than 3 instruction execution cores in each chip: 3 VLIW, a HP-PA Risc, and a brain-damaged Pentium (which itself had a couple execution cores)... The long list of reasons Itanic sucked started with the fact that it had 3 different modes and whichever one you were in circuitry for the other 2 wouldn't contribute a darn thing to your performance (although it did not stop there, and in fact didn't even slow down...) Of course since power is now the third variable along with price/performance, sooner or later you'll see chips that individually power down cores as they go dormant. Possibly even a banked L1 cache; who knows? (It's another alternative to clocking down the whole chip; power down individual functional units of the chip. Dunno who might actually do that, or when, but it's nice to have options...) </rant> In brief: hyper threading is cool. > Its great feature is that HT gets *more* not less useful as the CPU gets > faster.. Excution point 1 stalls waiting for memory, so execution point 2 gets extra cores. The classic tale of overlapping processing and I/O, only this time with the memory bus being the slow device you have to wait for... Rob ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-09 6:11 ` Rob Landley @ 2003-09-09 16:07 ` Ricardo Bugalho 2003-09-10 5:14 ` Rob Landley 0 siblings, 1 reply; 62+ messages in thread From: Ricardo Bugalho @ 2003-09-09 16:07 UTC (permalink / raw) To: linux-kernel On Tue, 09 Sep 2003 02:11:15 -0400, Rob Landley wrote: > Modern processors (Athlon and P4 both, I believe) have three execution > cores, and so are trying to dispatch three instructions per clock. With Neither of these CPUs are multi-core. They're just superscalar cores, that is, they can dispatch multiple instructions in parallel. An example of a multi-core CPU is the POWER4: there are two complete cores in the same sillicon die, sharing some cache levels and memory bus. BTW, Pentium [Pro,II,III] and Athlon are three way in the sense they have three-way decoders that decode up to three x86 instructions into µOPs. Pentium4 has a one-way decoder and the trace cache that stores decodes µOPs. As a curiosity, AMD's K5 and K6 were 4 way. > four cores bus would be a nightmare. (All the VLIW guys keep trying to > unload this on the compiler. Don't ask me how a compiler is supposed to > do branch prediction and speculative execution. I suppose having to > recompile your binaries for more cores isn't TOO big a problem these > days, but the boxed mainstream desktop apps people wouldn't like it at > all.) In normal instructions sets, whatever CPUs do, from the software perspective, it MUST look like the CPU is executing one instruction at a time. In VLIW, some forms of parallelism are exposed. For example, before executing two instructions in parallel, non-VLIW CPUs have to check for data dependencies. If they exist, those two instructions can't be executed in parallel. VLIW instruction sets just define that instructions MUST be grouped in sets of N instructions that can be executed in parallel and that if they don't the CPU, the CPU will yield an exception or undefined behaviour. In a similar manner, there is the issue of avaliable execution units and exeptions. The net result is that in-order VLIW CPUs are simpler to design that in-order superscalar RISC CPUs, but I think it won't make much of a difference for out-of-order CPUs. I've never seen a VLIW out-of-ordem implementation. VLIW ISAs are no different from others regarding branch prediction -- which is a problem for ALL pipelined implementations, superscalar or not. Speculative execution is a feature of out-of-order implementation. > Transistor budgets keep going up as manufacturing die sizes shrink, and > the engineers keep wanting to throw transistors at the problem. The > first really easy way to turn transistors into performance are a bigger > L1 cache, but somewhere between 256k and one megabyte per running > process you hit some serious diminishing returns since your working set > is in cache and your far accesses to big datasets (or streaming data) > just aren't going to be helped by more L1 cache. L1 caches are kept small so they can be fast. > Hyperthreading is just a neat hack to keep multiple cores busy. Having SMT (Simultaneous Multi-Threading, aka Hyperthreading in Intel's marketing term) is a neat hack to keep execution units within the same core busy. And its a cheap hack when the CPUs are alread out-of-order. CMP (Concurrent Multi-Processing) is a neat hack to keep expensive resources like big L2/L3 caches and memory interfaces busy by placing multiple cores on the same die. CMP is simpler, but is only usefull for multi-thread performance. With SMT, it makes sense to add more execution units that now, so it can also help single-thread performance. > Intel's been desperate for a way to make use of its transistor budget > for a while; manufacturing is what it does better than AMD< not clever > processor design. The original Itanic, case in point, had more than 3 > instruction execution cores in each chip: 3 VLIW, a HP-PA Risc, and a > brain-damaged Pentium (which itself had a couple execution cores)... The > long list of reasons Itanic sucked started with the fact that it had 3 > different modes and whichever one you were in circuitry for the other 2 > wouldn't contribute a darn thing to your performance (although it did > not stop there, and in fact didn't even slow down...) Itanium doesn't have hardware support for PA-RISC emulation. The IA-64 ISA has some similarities with PA-RISC to ease dynamic translation though. But you're right: the IA-32 hardware emulation layer is not a Good Thing™. -- Ricardo ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-09 16:07 ` Ricardo Bugalho @ 2003-09-10 5:14 ` Rob Landley 2003-09-10 5:45 ` David Mosberger 2003-09-10 10:10 ` Ricardo Bugalho 0 siblings, 2 replies; 62+ messages in thread From: Rob Landley @ 2003-09-10 5:14 UTC (permalink / raw) To: Ricardo Bugalho, linux-kernel On Tuesday 09 September 2003 12:07, Ricardo Bugalho wrote: > On Tue, 09 Sep 2003 02:11:15 -0400, Rob Landley wrote: > > Modern processors (Athlon and P4 both, I believe) have three execution > > cores, and so are trying to dispatch three instructions per clock. With > > Neither of these CPUs are multi-core. They're just superscalar cores, that > is, they can dispatch multiple instructions in parallel. An example of a > multi-core CPU is the POWER4: there are two complete cores in the same > sillicon die, sharing some cache levels and memory bus. Sorry, wrong terminology. (I'm a software dude.) "Instruction execution thingy". (Well you didn't give it a name either. :) > BTW, Pentium [Pro,II,III] and Athlon are three way in the sense they have > three-way decoders that decode up to three x86 instructions into µOPs. > Pentium4 has a one-way decoder and the trace cache that stores decodes > µOPs. > As a curiosity, AMD's K5 and K6 were 4 way. I hadn't known that. (I had known that the AMD guys I talked to around Austin had proven to themselves that 4 way was not a good idea in the real world, but I didn't know it had actually made it outside of the labs...) > > four cores bus would be a nightmare. (All the VLIW guys keep trying to > > unload this on the compiler. Don't ask me how a compiler is supposed to > > do branch prediction and speculative execution. I suppose having to > > recompile your binaries for more cores isn't TOO big a problem these > > days, but the boxed mainstream desktop apps people wouldn't like it at > > all.) > > In normal instructions sets, whatever CPUs do, from the software > perspective, it MUST look like the CPU is executing one instruction at a > time. Yup. > In VLIW, some forms of parallelism are exposed. I tend to think of it as "unloaded upon the compiler"... > For example, before > executing two instructions in parallel, non-VLIW CPUs have to check for > data dependencies. If they exist, those two instructions can't be executed > in parallel. VLIW instruction sets just define that instructions MUST be > grouped in sets of N instructions that can be executed in parallel and > that if they don't the CPU, the CPU will yield an exception or undefined > behaviour. Presumably this is the compiler's job, and the CPU can just have "undefined behavior" if fed impossible instruction mixes. But yeah, throwing an exception would be the conscientious thing to do. :) > In a similar manner, there is the issue of avaliable execution units and > exeptions. > The net result is that in-order VLIW CPUs are simpler to design that > in-order superscalar RISC CPUs, but I think it won't make much of a > difference for out-of-order CPUs. I've never seen a VLIW out-of-ordem > implementation. I'm not sure what the point of out-of-order VLIW would be. You just put extra pressure on the memory bus by tagging your instructions with grouping info, just to give you even LESS leeway about shuffling the groups at run-time... > VLIW ISAs are no different from others regarding branch prediction -- > which is a problem for ALL pipelined implementations, superscalar or not. > Speculative execution is a feature of out-of-order implementation. Ah yes, predication. Rather than having instruction execution thingies be idle, have them follow both branches and do work with a 100% chance of being thrown away. And you wonder why the chips have heat problems... :) > > Transistor budgets keep going up as manufacturing die sizes shrink, and > > the engineers keep wanting to throw transistors at the problem. The > > first really easy way to turn transistors into performance are a bigger > > L1 cache, but somewhere between 256k and one megabyte per running > > process you hit some serious diminishing returns since your working set > > is in cache and your far accesses to big datasets (or streaming data) > > just aren't going to be helped by more L1 cache. > > L1 caches are kept small so they can be fast. Sorry, I still refer to on-die L2 caches as L1. Bad habit. (As I said, I get the names wrong...) "On die cache." Right. The point was, you can spend your transistor budget with big caches on the die, but there are diminishing returns. > > Intel's been desperate for a way to make use of its transistor budget > > for a while; manufacturing is what it does better than AMD< not clever > > processor design. The original Itanic, case in point, had more than 3 > > instruction execution cores in each chip: 3 VLIW, a HP-PA Risc, and a > > brain-damaged Pentium (which itself had a couple execution cores)... The > > long list of reasons Itanic sucked started with the fact that it had 3 > > different modes and whichever one you were in circuitry for the other 2 > > wouldn't contribute a darn thing to your performance (although it did > > not stop there, and in fact didn't even slow down...) > > Itanium doesn't have hardware support for PA-RISC emulation. I'm under the impression it used to be part of the design, circa 1997. But I must admit: when discussing Itanium I'm not really prepared; I stopped paying too much attention a year or so after the sucker had taped out but still had no silicon to play with, especially after HP and SGI revived their own chip designs due to the delay...) I only actually got to play with the original Itanium hardware once, and never got it out of the darn monitor that substituted for a bios. The people who did benchmarked it at about Pentium III 300 mhz levels, and it became a doorstop. (These days, I've got a friend who's got an Itanium II evaluation system, but it's another doorstop and I'm not going to make him hook it up again just so I can go "yeah, I agree with you, it sucks"...) > The IA-64 ISA > has some similarities with PA-RISC to ease dynamic translation though. > But you're right: the IA-32 hardware emulation layer is not a Good Thing™. It's apparently going away. http://news.com.com/2100-1006-997936.html?tag=nl Rob ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-10 5:14 ` Rob Landley @ 2003-09-10 5:45 ` David Mosberger 2003-09-10 10:10 ` Ricardo Bugalho 1 sibling, 0 replies; 62+ messages in thread From: David Mosberger @ 2003-09-10 5:45 UTC (permalink / raw) To: rob; +Cc: Ricardo Bugalho, linux-kernel >>>>> On Wed, 10 Sep 2003 01:14:37 -0400, Rob Landley <rob@landley.net> said: Rob> (These days, I've got a friend who's got an Itanium II Rob> evaluation system, but it's another doorstop and I'm not going Rob> to make him hook it up again just so I can go "yeah, I agree Rob> with you, it sucks"...) I'm sorry to hear that. If you really do want to try out an Itanium 2 system, an easy way to go about it is to get an account at http://testdrive.hp.com/ . It's a quick and painless process and a single account will give you access to all test-drive machines, including various Linux Itanium machines (up to 4x 1.4GHz), as shown here: http://testdrive.hp.com/current.shtml --david ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-10 5:14 ` Rob Landley 2003-09-10 5:45 ` David Mosberger @ 2003-09-10 10:10 ` Ricardo Bugalho 1 sibling, 0 replies; 62+ messages in thread From: Ricardo Bugalho @ 2003-09-10 10:10 UTC (permalink / raw) To: rob; +Cc: linux-kernel On Wed, 2003-09-10 at 06:14, Rob Landley wrote: > I'm not sure what the point of out-of-order VLIW would be. You just put extra > pressure on the memory bus by tagging your instructions with grouping info, > just to give you even LESS leeway about shuffling the groups at run-time... The point is: simpler in-order implementations. In-order CPUs don't reorder instructions at run-time, as the name suggests. > > VLIW ISAs are no different from others regarding branch prediction -- > > which is a problem for ALL pipelined implementations, superscalar or not. > > Speculative execution is a feature of out-of-order implementation. > > Ah yes, predication. Rather than having instruction execution thingies be > idle, have them follow both branches and do work with a 100% chance of being > thrown away. And you wonder why the chips have heat problems... :) You're confusing brach prediction with instruction predication. Branch prediction is a design feature, needed for most pipelined CPUs. Because they're pipelined, the CPU may not know whether to take or not the branch when its time to fetch the next instructions. So, instead of stalling, it guesses. If its wrong, it has to rollback. Instruction predication is another form of conditional execution: each instruction has a predicate (a register) and is only executed if the predicate is true. The bad thing is that these instructions take their slot in the pipeline, even if the CPU knows they'll never be executed in the moment it fecthed them. The good sides are: a) Unlike branches, it doesn't have a constant mispredict penalty. So, its good to replace "small" and unpredictable branches b) Instead of a control dependency (branches) predication is a data dependency. So, it gives compilers more freedom in scheduling- > The point was, you can spend your transistor budget with big caches on the > die, but there are diminishing returns. Depends on the workload.. -- Ricardo ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 4:29 ` Larry McVoy 2003-09-03 4:33 ` CaT @ 2003-09-03 6:28 ` Anton Blanchard 2003-09-03 6:55 ` Nick Piggin 1 sibling, 1 reply; 62+ messages in thread From: Anton Blanchard @ 2003-09-03 6:28 UTC (permalink / raw) To: Larry McVoy, Larry McVoy, linux-kernel > > > I've frequently tried to make the point that all the scaling for > > > lots of processors is nonsense. Mr Dell says it better: > > > > > > "Eight-way (servers) are less than 1 percent of the market and > > > shrinking pretty dramatically," Dell said. "If our competitors > > > want to claim they're No. 1 in eight-ways, that's fine. We > > > want to lead the market with two-way and four-way (processor > > > machines)." > > > > > > Tell me again that it is a good idea to screw up uniprocessor > > > performance for 64 way machines. Great idea, that. Go Dinosaurs! > > > > And does your 4 way have hyperthreading? > > What part of "shrinking pretty dramatically" did you not understand? > Maybe you know more than Mike Dell. Could you share that insight? Ok. But only because you asked nicely. Mike Dell wants to sell 2 and 4 processor boxes and Intel wants to sell processors with hyperthreading on them. Scaling to 4 or 8 threads is just like scaling to 4 or 8 processors, only worse. However, lets not end up in a yet another 64 way scalability argument here. The thing we should be worrying about is the UP -> 2 way SMP scalability issue. If every chip in the future has hyperthreading then all of sudden everyone is running an SMP kernel. And what hurts us? atomic ops memory barriers Ive always worried about those atomic ops that only appear in an SMP kernel, but Rusty recently reminded me its the same story for most of the memory barriers. Things like RCU can do a lot for this UP -> 2 way SMP issue. The fact it also helps the big end of town is just a bonus. Anton ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 6:28 ` Anton Blanchard @ 2003-09-03 6:55 ` Nick Piggin 2003-09-03 15:23 ` Martin J. Bligh 2003-09-03 15:51 ` UP Regression (was) " Cliff White 0 siblings, 2 replies; 62+ messages in thread From: Nick Piggin @ 2003-09-03 6:55 UTC (permalink / raw) To: Anton Blanchard; +Cc: Larry McVoy, Larry McVoy, linux-kernel Anton Blanchard wrote: >>>>I've frequently tried to make the point that all the scaling for >>>>lots of processors is nonsense. Mr Dell says it better: >>>> >>>> "Eight-way (servers) are less than 1 percent of the market and >>>> shrinking pretty dramatically," Dell said. "If our competitors >>>> want to claim they're No. 1 in eight-ways, that's fine. We >>>> want to lead the market with two-way and four-way (processor >>>> machines)." >>>> >>>>Tell me again that it is a good idea to screw up uniprocessor >>>>performance for 64 way machines. Great idea, that. Go Dinosaurs! >>>> >>>And does your 4 way have hyperthreading? >>> >>What part of "shrinking pretty dramatically" did you not understand? >>Maybe you know more than Mike Dell. Could you share that insight? >> > >Ok. But only because you asked nicely. > >Mike Dell wants to sell 2 and 4 processor boxes and Intel wants to sell >processors with hyperthreading on them. Scaling to 4 or 8 threads is just >like scaling to 4 or 8 processors, only worse. > >However, lets not end up in a yet another 64 way scalability argument here. > >The thing we should be worrying about is the UP -> 2 way SMP scalability >issue. If every chip in the future has hyperthreading then all of sudden >everyone is running an SMP kernel. And what hurts us? > >atomic ops >memory barriers > >Ive always worried about those atomic ops that only appear in an SMP >kernel, but Rusty recently reminded me its the same story for most of the >memory barriers. > >Things like RCU can do a lot for this UP -> 2 way SMP issue. The fact it >also helps the big end of town is just a bonus. > I think LM advocates aiming single image scalability at or before the knee of the CPU vs performance curve. Say thats 4 way, it means you should get good performance on 8 ways while keeping top performance on 1 and 2 and 4 ways. (Sorry if I mis-represent your position). I don't think anyone advocates sacrificing UP performance for 32 ways, but as he says it can happen .1% at a time. But it looks like 2.6 will scale well to 16 way and higher. I wonder if there are many regressions from 2.4 or 2.2 on small systems. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 6:55 ` Nick Piggin @ 2003-09-03 15:23 ` Martin J. Bligh 2003-09-03 15:39 ` Larry McVoy 2003-09-03 17:16 ` William Lee Irwin III 2003-09-03 15:51 ` UP Regression (was) " Cliff White 1 sibling, 2 replies; 62+ messages in thread From: Martin J. Bligh @ 2003-09-03 15:23 UTC (permalink / raw) To: Nick Piggin, Anton Blanchard; +Cc: Larry McVoy, linux-kernel > I think LM advocates aiming single image scalability at or before the knee > of the CPU vs performance curve. Say thats 4 way, it means you should get > good performance on 8 ways while keeping top performance on 1 and 2 and 4 > ways. (Sorry if I mis-represent your position). Splitting big machines into a cluster is not a solution. However, oddly enough I actually agree with Larry, with one major caveat ... you have to make it an SSI cluster (single system image) - that way it's transparent to users. Unfortunately that's hard to do, but since we still have a system that's single memory image coherent, it shouldn't actually be nearly as hard as doing it across machines, as you can still fudge in the odd global piece if you need it. Without SSI, it's pretty useless, you're just turning an expensive box into a cheap cluster, and burning a lot of cash. > I don't think anyone advocates sacrificing UP performance for 32 ways, but > as he says it can happen .1% at a time. > > But it looks like 2.6 will scale well to 16 way and higher. I wonder if > there are many regressions from 2.4 or 2.2 on small systems. You want real data instead of FUD? How *dare* you? ;-) Would be real interesting to see this ... there are actually plenty of real degredations there, none of which (that I've seen) come from any scalability changes. Things like RMAP on fork times (for which there are other legitimite reasons) are more responsible (for which the "scalability" people have offered a solution). Numbers would be cool ... particularly if people can refrain from the "it's worse, therefore it must be some scalability change that's at fault" insta-moron-leap-of-logic. M. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 15:23 ` Martin J. Bligh @ 2003-09-03 15:39 ` Larry McVoy 2003-09-03 15:50 ` Martin J. Bligh ` (2 more replies) 2003-09-03 17:16 ` William Lee Irwin III 1 sibling, 3 replies; 62+ messages in thread From: Larry McVoy @ 2003-09-03 15:39 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Nick Piggin, Anton Blanchard, Larry McVoy, linux-kernel On Wed, Sep 03, 2003 at 08:23:44AM -0700, Martin J. Bligh wrote: > > I think LM advocates aiming single image scalability at or before the knee > > of the CPU vs performance curve. Say thats 4 way, it means you should get > > good performance on 8 ways while keeping top performance on 1 and 2 and 4 > > ways. (Sorry if I mis-represent your position). > > Splitting big machines into a cluster is not a solution. However, oddly > enough I actually agree with Larry, with one major caveat ... you have to > make it an SSI cluster (single system image) - that way it's transparent > to users. Err, when did I ever say it wasn't SSI? If you look at what I said it's clearly SSI. Unified process, device, file, and memory namespaces. I'm pretty sure people were so eager to argue with my lovely personality that they never bothered to understand the architecture. It's _always_ been SSI. I have slides going back at least 4 years that state this: http://www.bitmover.com/talks/smp-clusters http://www.bitmover.com/talks/cliq > Numbers would be cool ... particularly if people can refrain from the > "it's worse, therefore it must be some scalability change that's at fault" > insta-moron-leap-of-logic. It's really easy to claim that scalability isn't the problem. Scaling changes in general cause very minute differences, it's just that there are a lot of them. There is constant pressure to scale further and people think it's cool. You can argue you all you want that scaling done right isn't a problem but nobody has ever managed to do it right. I know it's politically incorrect to say this group won't either but there is no evidence that they will. Instead of doggedly following the footsteps down a path that hasn't worked before, why not do something cool? The CC stuff is a fun place to work, it's the last paradigm shift that will ever happen in OS, it's a chance for Linux to actually do something new. I harp all the time that open source is a copying mechanism and you are playing right into my hands. Make me wrong. Do something new. Don't like this design? OK, then come up with a better design. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 15:39 ` Larry McVoy @ 2003-09-03 15:50 ` Martin J. Bligh 2003-09-04 0:49 ` Larry McVoy 2003-09-04 4:49 ` Scaling noise David S. Miller 2003-09-08 19:50 ` bill davidsen 2 siblings, 1 reply; 62+ messages in thread From: Martin J. Bligh @ 2003-09-03 15:50 UTC (permalink / raw) To: Larry McVoy; +Cc: Nick Piggin, Anton Blanchard, linux-kernel > Err, when did I ever say it wasn't SSI? If you look at what I said it's > clearly SSI. Unified process, device, file, and memory namespaces. I think it was the bit when you suggested using bitkeeper to sync multiple /etc/passwd files when I really switched off ... perhaps you were just joking ;-) Perhaps we just had a massive communication disconnect. > I'm pretty sure people were so eager to argue with my lovely personality > that they never bothered to understand the architecture. It's _always_ > been SSI. I have slides going back at least 4 years that state this: > > http://www.bitmover.com/talks/smp-clusters > http://www.bitmover.com/talks/cliq I can go back and re-read them, if I misread them last time than I apologise. I've also shifted perspectives on SSI clusters somewhat over the last year. Yes, if it's SSI, I'd agree for the most part ... once it's implemented ;-) I'd rather start with everything separate (one OS instance per node), and bind things back together, than split everything up. However, I'm really not sure how feasible it is until we actually have something that works. I have a rough plan of how to go about it mapped out, in small steps that might be useful by themselves. It's a lot of fairly complex hard work ;-) >> Numbers would be cool ... particularly if people can refrain from the >> "it's worse, therefore it must be some scalability change that's at fault" >> insta-moron-leap-of-logic. > > It's really easy to claim that scalability isn't the problem. Scaling > changes in general cause very minute differences, it's just that there > are a lot of them. There is constant pressure to scale further and people > think it's cool. You can argue you all you want that scaling done right > isn't a problem but nobody has ever managed to do it right. I know it's > politically incorrect to say this group won't either but there is no > evidence that they will. Let's not go into that one again, we've both dragged that over the coals already. Time to agree to disagree. All the significant degredations I looked at that people screamed were scalability changes turned out to be something else completely. > Instead of doggedly following the footsteps down a path that hasn't worked > before, why not do something cool? The CC stuff is a fun place to work, > it's the last paradigm shift that will ever happen in OS, it's a chance > for Linux to actually do something new. I harp all the time that open > source is a copying mechanism and you are playing right into my hands. > Make me wrong. Do something new. Don't like this design? OK, then come > up with a better design. I'm cool with doing SSI clusters over NUMA on a per-node basis. But it's still vapourware ... yes, I'd love to work on that full time to try and change that if I can get funding to do so. M. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 15:50 ` Martin J. Bligh @ 2003-09-04 0:49 ` Larry McVoy 2003-09-04 2:21 ` Daniel Phillips 0 siblings, 1 reply; 62+ messages in thread From: Larry McVoy @ 2003-09-04 0:49 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Larry McVoy, Nick Piggin, Anton Blanchard, linux-kernel On Wed, Sep 03, 2003 at 08:50:46AM -0700, Martin J. Bligh wrote: > > Err, when did I ever say it wasn't SSI? If you look at what I said it's > > clearly SSI. Unified process, device, file, and memory namespaces. > > I think it was the bit when you suggested using bitkeeper to sync multiple > /etc/passwd files when I really switched off ... perhaps you were just > joking ;-) Perhaps we just had a massive communication disconnect. I wasn't joking, but that has nothing to do with clusters. The BK license has a "single user is free" mode because I wanted very much to allow distros to use BK to control their /etc files. It would be amazingly useful if you could do an upgrade and merge your config changes with their config changes. Instead we're still in the 80's in terms of config files. By the way, I could care less if it were BK, CVS, SVN, SCCS, RCS, whatever. The config files need to be under version control and you need to be able to merge in your changes. BK is what I'd like because I understand it and know it would work, but it's not a BK thing at all, I'd happily do work on RCS or whatever to make this happen. It's just amazingly painful that these files aren't under version control, it's stupid, there is an obviously better answer and the distros aren't seeing it. Bummer. But this has nothing to do with clusters. > > I'm pretty sure people were so eager to argue with my lovely personality > > that they never bothered to understand the architecture. It's _always_ > > been SSI. I have slides going back at least 4 years that state this: > > > > http://www.bitmover.com/talks/smp-clusters > > http://www.bitmover.com/talks/cliq > > I can go back and re-read them, if I misread them last time than I apologise. > I've also shifted perspectives on SSI clusters somewhat over the last year. > Yes, if it's SSI, I'd agree for the most part ... once it's implemented ;-) Cool! > I'd rather start with everything separate (one OS instance per node), and > bind things back together, than split everything up. However, I'm really > not sure how feasible it is until we actually have something that works. I'm in 100% agreement. It's much better to have a bunch of OS's and pull them together than have one and try and pry it apart. > I have a rough plan of how to go about it mapped out, in small steps that > might be useful by themselves. It's a lot of fairly complex hard work ;-) I've spent quite a bit of time thinking about this and if it started going anywhere it would be easy for you to tell me to put up or shut up. I'd be happy to do some real work on this. Maybe it would just be doing the architecture stuff but I strongly suspect there are few people out there masochistic enough to make controlling tty semantics work properly in this environment. I don't want to do it, I'd love someone else to do it, but if noone steps up to the bat I will. I did all the POSIX crud in SunOS, I understand the issues, I can do it here and it is part of the least fun work so if I'm pushing the model I should be willing to put some work into the non fun part. The VM work is a lot more fun, I'd like to play there but I suspect that if we got rolling there are far more talented people who would push me aside. That's cool, the best people should do the work. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 0:49 ` Larry McVoy @ 2003-09-04 2:21 ` Daniel Phillips 2003-09-04 2:35 ` Martin J. Bligh 2003-09-04 2:46 ` Larry McVoy 0 siblings, 2 replies; 62+ messages in thread From: Daniel Phillips @ 2003-09-04 2:21 UTC (permalink / raw) To: Larry McVoy, Martin J. Bligh Cc: Larry McVoy, Nick Piggin, Anton Blanchard, linux-kernel On Thursday 04 September 2003 02:49, Larry McVoy wrote: > It's much better to have a bunch of OS's and pull > them together than have one and try and pry it apart. This is bogus. The numbers clearly don't work if the ccCluster is made of uniprocessors, so obviously the SMP locking has to be implemented anyway, to get each node up to the size just below the supposed knee in the scaling curve. This eliminates the argument about saving complexity and/or work. The way Linux scales now, the locking stays out of the range where SSI could compete up to, what? 128 processors? More? Maybe we'd better ask SGI about that, but we already know what the answer is for 32: boring old SMP wins hands down. Where is the machine that has the knee in the wrong part of the curve? Oh, maybe we should all just stop whatever work we're doing and wait ten years for one to show up. But far be it from me to suggest that reality should intefere with your fun. Regards, Daniel ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 2:21 ` Daniel Phillips @ 2003-09-04 2:35 ` Martin J. Bligh 2003-09-04 2:46 ` Larry McVoy 1 sibling, 0 replies; 62+ messages in thread From: Martin J. Bligh @ 2003-09-04 2:35 UTC (permalink / raw) To: Daniel Phillips, Larry McVoy; +Cc: Nick Piggin, Anton Blanchard, linux-kernel > On Thursday 04 September 2003 02:49, Larry McVoy wrote: >> It's much better to have a bunch of OS's and pull >> them together than have one and try and pry it apart. > > This is bogus. The numbers clearly don't work if the ccCluster is made of > uniprocessors, so obviously the SMP locking has to be implemented anyway, to > get each node up to the size just below the supposed knee in the scaling > curve. This eliminates the argument about saving complexity and/or work. > > The way Linux scales now, the locking stays out of the range where SSI could > compete up to, what? 128 processors? More? Maybe we'd better ask SGI about > that, but we already know what the answer is for 32: boring old SMP wins > hands down. Where is the machine that has the knee in the wrong part of the > curve? Oh, maybe we should all just stop whatever work we're doing and wait > ten years for one to show up. > > But far be it from me to suggest that reality should intefere with your fun. Yes you need locking, but only for the bits where you glue stuff back together. Plenty of bits can operate indepandantly per node, or at least ... I'm hoping they can in my vapourware world ;-) M. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 2:21 ` Daniel Phillips 2003-09-04 2:35 ` Martin J. Bligh @ 2003-09-04 2:46 ` Larry McVoy 2003-09-04 4:58 ` David S. Miller 1 sibling, 1 reply; 62+ messages in thread From: Larry McVoy @ 2003-09-04 2:46 UTC (permalink / raw) To: Daniel Phillips Cc: Larry McVoy, Martin J. Bligh, Nick Piggin, Anton Blanchard, linux-kernel On Thu, Sep 04, 2003 at 04:21:16AM +0200, Daniel Phillips wrote: > On Thursday 04 September 2003 02:49, Larry McVoy wrote: > > It's much better to have a bunch of OS's and pull > > them together than have one and try and pry it apart. > > This is bogus. The numbers clearly don't work if the ccCluster is made of > uniprocessors, so obviously the SMP locking has to be implemented anyway, to > get each node up to the size just below the supposed knee in the scaling > curve. This eliminates the argument about saving complexity and/or work. If you thought before you spoke you'd realize how wrong you are. How many locks are there in the IRIX/Solaris/Linux I/O path? How many are needed for 2-4 way scaling? Here's the litmus test: list all the locks in the kernel and the locking hierarchy. If you, a self claimed genius, can't do it, how can the rest of us mortals possibly do it? Quick. You have 30 seconds, I want a list. A complete list with the locking hierarchy, no silly awk scripts. You have to show which locks can deadlock, from memory. No list? Cool, you just proved my point. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-04 2:46 ` Larry McVoy @ 2003-09-04 4:58 ` David S. Miller 2003-09-10 15:47 ` Lock EVERYTHING (for testing) [was: Re: Scaling noise] Timothy Miller 0 siblings, 1 reply; 62+ messages in thread From: David S. Miller @ 2003-09-04 4:58 UTC (permalink / raw) To: Larry McVoy; +Cc: phillips, lm, mbligh, piggin, anton, linux-kernel On Wed, 3 Sep 2003 19:46:08 -0700 Larry McVoy <lm@bitmover.com> wrote: > Here's the litmus test: list all the locks in the kernel and the locking > hierarchy. If you, a self claimed genius, can't do it, how can the rest > of us mortals possibly do it? Quick. You have 30 seconds, I want a list. > A complete list with the locking hierarchy, no silly awk scripts. You have > to show which locks can deadlock, from memory. > > No list? Cool, you just proved my point. No point Larry, asking the same question about how the I/O path works sans the locks will give you the same blank stare. I absolutely do not accept the complexity argument. We have a fully scalable kernel now. Do you know why? It's not because we have some weird genius trolls writing the code, it's because of our insanely huge testing base. People give a lot of credit to the people writing the code in the Linux kernel which actually belongs to the people running the code. :-) That's where the other systems failed, all the in-house stress testing in the world is not going to find the bugs we do find in Linux. That's why Solaris goes out buggy and with all kinds of SMP deadlocks, their tester base is just too small to hit all the important bugs. FWIW, I actually can list all the locks taken for the primary paths in the networking, and that's about as finely locked as we can make it. As can Alexey Kuznetsov... So again, if you're going to argue against huge SMP (at least to me), don't use the locking complexity argument. Not only have we basically conquered it, we've along the way found some amazing ways to find locking bugs both at runtime and at compile time. You can even debug them on uniprocessor systems. And this doesn't even count the potential things we can do with Linus's sparse tool. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Lock EVERYTHING (for testing) [was: Re: Scaling noise] 2003-09-04 4:58 ` David S. Miller @ 2003-09-10 15:47 ` Timothy Miller 0 siblings, 0 replies; 62+ messages in thread From: Timothy Miller @ 2003-09-10 15:47 UTC (permalink / raw) To: David S. Miller Cc: Larry McVoy, phillips, mbligh, piggin, anton, linux-kernel David S. Miller wrote: > > So again, if you're going to argue against huge SMP (at least to me), > don't use the locking complexity argument. Not only have we basically > conquered it, we've along the way found some amazing ways to find > locking bugs both at runtime and at compile time. You can even debug > them on uniprocessor systems. And this doesn't even count the > potential things we can do with Linus's sparse tool. Pardon me for suggesting another idea for which I have no code written, but I was just wondering... Is there a way we could get gcc to wrap EVERY memory access with some kind of debug lock? Actually, I do have code, but for another application. I designed a graphics drawing engine which has a FIFO for commands. Before sending commands, you have to be sure there is enough free space in the FIFO, so there is a macro we use which tries to do this in an efficient way. Anyhow, there have been instances where we didn't check for enough space or didn't check for space at all, etc., and those bugs have been sometimes hard to find. Two macros involved are CHECK_FIFO and WRITE_WORD. Normally, CHECK_FIFO just checks for space, and WRITE_WORD just writes a word (it's more complicated than that, but never mind). However, we have a second set of macros which check to make sure we're doing everything right. The "check checker" macros have CHECK_FIFO set a counter and WRITE_WORD decrement that. (Again, a bit more complex than that.) If the counter ever goes below zero, we know we screwed up and exactly where. Another thing we have is a way to indicate that we know we're doing something that looks like it may violate the normal way of things but really doesn't (for instance, sometimes, we write fewer words than we check for, and that is something we still print warnings about, but not in the cases where it's intentional). The analogy for Linux is this: At a machine level, we add a check to EVERY access. The check is there to ensure that every memory access is properly locked. So, if some access is made where there isn't a proper lock applied, then we can print a warning with the line number or drop out into kdb or something of that sort. I'm betting there's another solution to this, otherwise, I wouldn't suggest such an idea, because of the relative amount of work versus benefit. But it may require massive modifications to GCC to add this code in at the machine level. Perhaps an even better solution would be to run an emulator. Anyone know of a 686 emulator I can compile for intel? The emulator could be modified to track locks and determine if any accesses are made without proper locks. And another option that I could REALLY sink my teeth into. If there was a 686 implementation in Verilog that I could run on an FPGA, it would be an order of magnitude slower than a real CPU, but still faster than an emulator. One idea is to have something which can run 686 ISA that fits in a Virtex 1000 and runs at maybe 66mhz. We put that with some adaptor board into an old dual processor PC that expects a Pentium Pro with a 66mhz FSB. That's probably overly ambitious, although I do do chip design for a living, so it's not entirely beyond the realm of possibility. One problem is that we need to have metadata about memory accesses so we can track the difference between accesses which are to memory private to a CPU (no lock required) and accesses which are to shared memory (lock required) so we can determine what is a violation. The FPGA daughter board would have to have its own RAM on it to track that. And that leads me to another idea: Reprogramming Transmeta processors to do all that. :) ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 15:39 ` Larry McVoy 2003-09-03 15:50 ` Martin J. Bligh @ 2003-09-04 4:49 ` David S. Miller 2003-09-08 19:50 ` bill davidsen 2 siblings, 0 replies; 62+ messages in thread From: David S. Miller @ 2003-09-04 4:49 UTC (permalink / raw) To: Larry McVoy; +Cc: mbligh, piggin, anton, lm, linux-kernel On Wed, 3 Sep 2003 08:39:01 -0700 Larry McVoy <lm@bitmover.com> wrote: > It's really easy to claim that scalability isn't the problem. Scaling > changes in general cause very minute differences, it's just that there > are a lot of them. There is constant pressure to scale further and people > think it's cool. So why are people still going down this path? I'll tell you why, because as SMP issues start to embark upon the mainstream boxes people are going to find clever solutions to most of the memory sharing issues that cause all the "lock overhead". Things like RCU are just the tip of the iceberg. And think Larry, we didn't have stuff like RCU back when you were directly working and watching people work on huge SMP systems. I think it's instructive to look at hyperthreading from another angle in this argument, that the cpu people invested billions of dollars in work to turn memory latency into free cpu cycles. Put that in your pipe and smoke it :-) ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 15:39 ` Larry McVoy 2003-09-03 15:50 ` Martin J. Bligh 2003-09-04 4:49 ` Scaling noise David S. Miller @ 2003-09-08 19:50 ` bill davidsen 2003-09-08 23:39 ` Peter Chubb 2 siblings, 1 reply; 62+ messages in thread From: bill davidsen @ 2003-09-08 19:50 UTC (permalink / raw) To: linux-kernel In article <20030903153901.GB5769@work.bitmover.com>, Larry McVoy <lm@bitmover.com> wrote: | It's really easy to claim that scalability isn't the problem. Scaling | changes in general cause very minute differences, it's just that there | are a lot of them. There is constant pressure to scale further and people | think it's cool. You can argue you all you want that scaling done right | isn't a problem but nobody has ever managed to do it right. I know it's | politically incorrect to say this group won't either but there is no | evidence that they will. I think that if the problem of a single scheduler which is "best" at everything proves out of reach, perhaps in 2.7 a modular scheduler will appear, which will allow the user to select the Nick+Con+Ingo responsiveness, or the default pretty good at everything, or the 4kbit affinity mask NUMA on steroids solution. I have faith that Linux will solve this one one way or the other, probably both. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-08 19:50 ` bill davidsen @ 2003-09-08 23:39 ` Peter Chubb 0 siblings, 0 replies; 62+ messages in thread From: Peter Chubb @ 2003-09-08 23:39 UTC (permalink / raw) To: bill davidsen; +Cc: linux-kernel >>>>> "bill" == bill davidsen <davidsen@tmr.com> writes: > In article <20030903153901.GB5769@work.bitmover.com>, Larry > McVoy <lm@bitmover.com> wrote: Larry> It's really easy to claim that scalability isn't the problem. Larry> Scaling changes in general cause very minute differences, it's Larry> just that there are a lot of them. There is constant pressure Larry> to scale further and people think it's cool. You can argue Larry> you all you want that scaling done right isn't a problem but Larry> nobody has ever managed to do it right. I know it's Larry> politically incorrect to say this group won't either but there Larry> is no evidence that they will. bill> I think that if the problem of a single scheduler which is bill> "best" at everything proves out of reach, perhaps in 2.7 a bill> modular scheduler will appear, which will allow the user to bill> select the Nick+Con+Ingo responsiveness, or the default pretty bill> good at everything, or the 4kbit affinity mask NUMA on steroids bill> solution. Well, as I see it it's not processor but memory scalability that's the problem right now. Memories are getting larger (and for NUMA systems, sparser), and the current linux solutions don't scale particularly well --- particularly when, for architectures like PPC or IA64, you need two copies in different formats, one for the hardware to look up, and one for the OS. I *do* think that pluggable schedulers are a good idea --- I'd like to introduce something like the scheduler class mechanism that SVr4 has (except that I've seen that code, and don't want to get sued by SCO) to allow different processes to be in different classes in a cleaner manner than the current FIFO or RR vs OTHER classes. We should be able to introduce isochronous, gang, lottery or fairshare schedulers (etc) at runtime, and then tie processes severally and indivdually to those schedulers, with a well defined idea of what happens when scheduler priorities overlap, and well defined APIs to adjust scheduler parameters. However, this will require more major infrastructure changes, and a better separation of dispatcher from scheduler than in the current one-size-fits-all scheduler. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au You are lost in a maze of BitKeeper repositories, all slightly different. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Scaling noise 2003-09-03 15:23 ` Martin J. Bligh 2003-09-03 15:39 ` Larry McVoy @ 2003-09-03 17:16 ` William Lee Irwin III 1 sibling, 0 replies; 62+ messages in thread From: William Lee Irwin III @ 2003-09-03 17:16 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Nick Piggin, Anton Blanchard, Larry McVoy, linux-kernel On Wed, Sep 03, 2003 at 08:23:44AM -0700, Martin J. Bligh wrote: > Would be real interesting to see this ... there are actually plenty of > real degredations there, none of which (that I've seen) come from any > scalability changes. Things like RMAP on fork times (for which there are > other legitimite reasons) are more responsible (for which the "scalability" > people have offered a solution). How'd that get capitalized? It's not an acronym. At any rate, fork()'s relevance to performance is not being measured in any context remotely resembling real usage cases, e.g. forking servers. There are other problems with kernel compiles, for instance, internally limited parallelism, and a relatively highly constrained userspace component which is impossible to increase the concurrency of. -- wli ^ permalink raw reply [flat|nested] 62+ messages in thread
* UP Regression (was) Re: Scaling noise 2003-09-03 6:55 ` Nick Piggin 2003-09-03 15:23 ` Martin J. Bligh @ 2003-09-03 15:51 ` Cliff White 2003-09-03 17:21 ` William Lee Irwin III 2003-09-04 0:54 ` Nick Piggin 1 sibling, 2 replies; 62+ messages in thread From: Cliff White @ 2003-09-03 15:51 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, cliffw [snip] . > > I don't think anyone advocates sacrificing UP performance for 32 ways, but > as he says it can happen .1% at a time. > > But it looks like 2.6 will scale well to 16 way and higher. I wonder if > there are many regressions from 2.4 or 2.2 on small systems. > > On the Scalable Test Platform, running osdl-aim-7, for the UP case, 2.4 is a bit better than 2.6, this is consistent across many runs. For SMP, 2.6 is better, but the delta is rather small, until we get to 8 CPUS. We have a lot of un-parsed data from other tests - might be some trends there also. See http://developer.osdl.org/cliffw/reaim/index.html 2.4 kernels are at the bottom of the page. Run # PLM # Kernel workload Max JPM max host 1-way lusers 278671 2083 patch-2.4.23-pre2 new_dbase 1066.75 18 stp1-003 278835 2087 2.6.0-test4-mm5 new_dbase 995.74 17 stp1-003 2-way 278690 2083 patch-2.4.23-pre2 new_dbase 1300.01 22 stp2-000 278854 2087 2.6.0-test4-mm5 new_dbase 1340.96 22 stp2-000 4-way 278437 2075 patch-2.4.23-pre1 new_dbase 5268.41 80 stp4-000 278805 2084 2.6.0-test4-mm4 new_dbase 5355.73 88 stp4-000 8-way 278651 2083 patch-2.4.23-pre2 new_dbase 6790.01 112 stp8-002 278722 2084 2.6.0-test4-mm4 new_dbase 8189.51 136 stp8-001 cliffw > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: UP Regression (was) Re: Scaling noise 2003-09-03 15:51 ` UP Regression (was) " Cliff White @ 2003-09-03 17:21 ` William Lee Irwin III 2003-09-03 18:53 ` Cliff White 2003-09-04 0:54 ` Nick Piggin 1 sibling, 1 reply; 62+ messages in thread From: William Lee Irwin III @ 2003-09-03 17:21 UTC (permalink / raw) To: Cliff White; +Cc: Nick Piggin, linux-kernel On Wed, Sep 03, 2003 at 08:51:56AM -0700, Cliff White wrote: > On the Scalable Test Platform, running osdl-aim-7, for the > UP case, 2.4 is a bit better than 2.6, this is consistent across > many runs. For SMP, 2.6 is better, but the delta is rather > small, until we get to 8 CPUS. We have a lot of un-parsed data from other > tests - might be some trends there also. > See http://developer.osdl.org/cliffw/reaim/index.html > 2.4 kernels are at the bottom of the page. Do you have profile data for these runs? Also, that webpage doesn't have 2.4.x results. -- wli ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: UP Regression (was) Re: Scaling noise 2003-09-03 17:21 ` William Lee Irwin III @ 2003-09-03 18:53 ` Cliff White 0 siblings, 0 replies; 62+ messages in thread From: Cliff White @ 2003-09-03 18:53 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Nick Piggin, linux-kernel > On Wed, Sep 03, 2003 at 08:51:56AM -0700, Cliff White wrote: > > On the Scalable Test Platform, running osdl-aim-7, for the > > UP case, 2.4 is a bit better than 2.6, this is consistent across > > many runs. For SMP, 2.6 is better, but the delta is rather > > small, until we get to 8 CPUS. We have a lot of un-parsed data from other > > tests - might be some trends there also. > > See http://developer.osdl.org/cliffw/reaim/index.html > > 2.4 kernels are at the bottom of the page. > > Do you have profile data for these runs? For most of them, yes. The link to the profile data is at the top of the report. Report sorted by load right now. Also, that webpage doesn't > have 2.4.x results. >> 2.4 kernels are at the bottom of the page. Scroll all the way down, look for the 'Other Kernels' header. There are results for linux-2.4.22, 2.4.23-pre1 + pre2 for both the new_dbase and compute workloads. Here's a link to 2.4.23-pre2 on an 8-way, if you don't see it.. http://khack.osdl.org/stp/278651/ cliffw > > > -- wli > ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: UP Regression (was) Re: Scaling noise 2003-09-03 15:51 ` UP Regression (was) " Cliff White 2003-09-03 17:21 ` William Lee Irwin III @ 2003-09-04 0:54 ` Nick Piggin 1 sibling, 0 replies; 62+ messages in thread From: Nick Piggin @ 2003-09-04 0:54 UTC (permalink / raw) To: Cliff White; +Cc: linux-kernel Cliff White wrote: >[snip] >. > >>I don't think anyone advocates sacrificing UP performance for 32 ways, but >>as he says it can happen .1% at a time. >> >>But it looks like 2.6 will scale well to 16 way and higher. I wonder if >>there are many regressions from 2.4 or 2.2 on small systems. >> >> >> >On the Scalable Test Platform, running osdl-aim-7, for the >UP case, 2.4 is a bit better than 2.6, this is consistent across >many runs. For SMP, 2.6 is better, but the delta is rather >small, until we get to 8 CPUS. We have a lot of un-parsed data from other >tests - might be some trends there also. >See http://developer.osdl.org/cliffw/reaim/index.html >2.4 kernels are at the bottom of the page. > Forgive my ignorance of your benchmarks, but this might very well be HZ == 1000? ^ permalink raw reply [flat|nested] 62+ messages in thread
end of thread, other threads:[~2003-09-10 15:26 UTC | newest] Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-09-03 4:03 Scaling noise Larry McVoy 2003-09-03 4:12 ` Roland Dreier 2003-09-03 4:20 ` Larry McVoy 2003-09-03 15:12 ` Martin J. Bligh 2003-09-03 4:18 ` Anton Blanchard 2003-09-03 4:29 ` Larry McVoy 2003-09-03 4:33 ` CaT 2003-09-03 5:08 ` Larry McVoy 2003-09-03 5:44 ` Mikael Abrahamsson 2003-09-03 6:12 ` Bernd Eckenfels 2003-09-03 12:09 ` Alan Cox 2003-09-03 15:10 ` Martin J. Bligh 2003-09-03 16:01 ` Jörn Engel 2003-09-03 16:21 ` Martin J. Bligh 2003-09-03 19:41 ` Mike Fedyk 2003-09-03 20:11 ` Martin J. Bligh 2003-09-04 20:36 ` Rik van Riel 2003-09-04 20:47 ` Martin J. Bligh 2003-09-04 21:30 ` William Lee Irwin III 2003-09-03 8:11 ` Giuliano Pochini 2003-09-03 14:25 ` Steven Cole 2003-09-03 12:47 ` Antonio Vargas 2003-09-03 15:31 ` Steven Cole 2003-09-04 1:50 ` Daniel Phillips 2003-09-04 1:52 ` Larry McVoy 2003-09-04 4:42 ` David S. Miller 2003-09-08 19:40 ` bill davidsen 2003-09-04 2:18 ` William Lee Irwin III 2003-09-04 2:19 ` Steven Cole 2003-09-04 2:35 ` William Lee Irwin III 2003-09-04 2:40 ` Steven Cole 2003-09-04 3:20 ` Nick Piggin 2003-09-04 3:07 ` Daniel Phillips 2003-09-08 19:27 ` bill davidsen 2003-09-08 19:12 ` bill davidsen 2003-09-03 16:37 ` Kurt Wall 2003-09-06 15:08 ` Pavel Machek 2003-09-08 13:38 ` Alan Cox 2003-09-09 6:11 ` Rob Landley 2003-09-09 16:07 ` Ricardo Bugalho 2003-09-10 5:14 ` Rob Landley 2003-09-10 5:45 ` David Mosberger 2003-09-10 10:10 ` Ricardo Bugalho 2003-09-03 6:28 ` Anton Blanchard 2003-09-03 6:55 ` Nick Piggin 2003-09-03 15:23 ` Martin J. Bligh 2003-09-03 15:39 ` Larry McVoy 2003-09-03 15:50 ` Martin J. Bligh 2003-09-04 0:49 ` Larry McVoy 2003-09-04 2:21 ` Daniel Phillips 2003-09-04 2:35 ` Martin J. Bligh 2003-09-04 2:46 ` Larry McVoy 2003-09-04 4:58 ` David S. Miller 2003-09-10 15:47 ` Lock EVERYTHING (for testing) [was: Re: Scaling noise] Timothy Miller 2003-09-04 4:49 ` Scaling noise David S. Miller 2003-09-08 19:50 ` bill davidsen 2003-09-08 23:39 ` Peter Chubb 2003-09-03 17:16 ` William Lee Irwin III 2003-09-03 15:51 ` UP Regression (was) " Cliff White 2003-09-03 17:21 ` William Lee Irwin III 2003-09-03 18:53 ` Cliff White 2003-09-04 0:54 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).