* RE: Scaling noise
@ 2003-09-03 9:41 Brown, Len
2003-09-03 11:02 ` Geert Uytterhoeven
2003-09-03 11:19 ` Larry McVoy
0 siblings, 2 replies; 49+ messages in thread
From: Brown, Len @ 2003-09-03 9:41 UTC (permalink / raw)
To: Giuliano Pochini, Larry McVoy; +Cc: linux-kernel
> Latency is not bandwidth.
Bingo.
The way to address memory latency is by increasing bandwidth and
increasing parallelism to use it -- thus amortizing the latency. HT is
one of many ways to do this. If systems are to grow faster at a rate
better than memory speeds, then plan on more parallelism, not less.
-Len
^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Scaling noise 2003-09-03 9:41 Scaling noise Brown, Len @ 2003-09-03 11:02 ` Geert Uytterhoeven 2003-09-03 11:19 ` Larry McVoy 1 sibling, 0 replies; 49+ messages in thread From: Geert Uytterhoeven @ 2003-09-03 11:02 UTC (permalink / raw) To: Brown, Len; +Cc: Giuliano Pochini, Larry McVoy, Linux Kernel Development On Wed, 3 Sep 2003, Brown, Len wrote: > > Latency is not bandwidth. > > Bingo. > > The way to address memory latency is by increasing bandwidth and > increasing parallelism to use it -- thus amortizing the latency. HT is > one of many ways to do this. If systems are to grow faster at a rate > better than memory speeds, then plan on more parallelism, not less. More parallelism usually means more data to process, hence more bandwidth is needed => back to where we started. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 9:41 Scaling noise Brown, Len 2003-09-03 11:02 ` Geert Uytterhoeven @ 2003-09-03 11:19 ` Larry McVoy 2003-09-03 11:47 ` Matthias Andree 2003-09-03 18:00 ` William Lee Irwin III 1 sibling, 2 replies; 49+ messages in thread From: Larry McVoy @ 2003-09-03 11:19 UTC (permalink / raw) To: Brown, Len; +Cc: Giuliano Pochini, Larry McVoy, linux-kernel On Wed, Sep 03, 2003 at 05:41:39AM -0400, Brown, Len wrote: > > Latency is not bandwidth. > > Bingo. > > The way to address memory latency is by increasing bandwidth and > increasing parallelism to use it -- thus amortizing the latency. And if the app is a pointer chasing app, as many apps are, that doesn't help at all. It's pretty much analogous to file systems. If bandwidth was the answer then we'd all be seeing data moving at 60MB/sec off the disk. Instead we see about 4 or 5MB/sec. Expecting more bandwidth to help your app is like expecting more platter speed to help your file system. It's not the platter speed, it's the seeks which are the problem. Same thing in system doesn't, it's not the bcopy speed, it's the cache misses that are the problem. More bandwidth doesn't do much for that. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 11:19 ` Larry McVoy @ 2003-09-03 11:47 ` Matthias Andree 2003-09-03 18:00 ` William Lee Irwin III 1 sibling, 0 replies; 49+ messages in thread From: Matthias Andree @ 2003-09-03 11:47 UTC (permalink / raw) To: linux-kernel; +Cc: Larry McVoy, Brown, Len, Giuliano Pochini, Larry McVoy On Wed, 03 Sep 2003, Larry McVoy wrote: > Expecting more bandwidth to help your app is like expecting more platter > speed to help your file system. It's not the platter speed, it's the > seeks which are the problem. Same thing in system doesn't, it's not the > bcopy speed, it's the cache misses that are the problem. More bandwidth > doesn't do much for that. Platter speed IS a problem for random access involving seeks, because platter speed reduces the rotational latency. Whether it takes 7.1 ms average for a block to rotate past the heads in your average notebook 4,200/min drive or 2 ms in your 15,000/min drive does make a difference. Even if the drive knows where the sectors are and folds rotational latency into positioning latency to the maximum possible extent, for short seeks (track-to-track) it's not going to help. Unless you're going to add more heads or use other media than spinning disc, that is. However, head positioning times, being a tradeoff between noise and speed, aren't that good particularly with many of the quieter drives, so the marketing people use the enormous sequential data rate on outer tracks for advertising. Head positioning time hasn't improved to the extent throughput has, but that doesn't mean higher rotational frequency is useless for random access delays. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 11:19 ` Larry McVoy 2003-09-03 11:47 ` Matthias Andree @ 2003-09-03 18:00 ` William Lee Irwin III 2003-09-03 18:05 ` Larry McVoy 2003-09-03 19:11 ` Steven Cole 1 sibling, 2 replies; 49+ messages in thread From: William Lee Irwin III @ 2003-09-03 18:00 UTC (permalink / raw) To: Larry McVoy, Brown, Len, Giuliano Pochini, Larry McVoy, linux-kernel On Wed, Sep 03, 2003 at 05:41:39AM -0400, Brown, Len wrote: >> The way to address memory latency is by increasing bandwidth and >> increasing parallelism to use it -- thus amortizing the latency. On Wed, Sep 03, 2003 at 04:19:34AM -0700, Larry McVoy wrote: > And if the app is a pointer chasing app, as many apps are, that doesn't > help at all. > It's pretty much analogous to file systems. If bandwidth was the answer > then we'd all be seeing data moving at 60MB/sec off the disk. Instead > we see about 4 or 5MB/sec. RAM is not operationally analogous to disk. For one, it supports efficient random access, where disk does not. On Wed, Sep 03, 2003 at 04:19:34AM -0700, Larry McVoy wrote: > Expecting more bandwidth to help your app is like expecting more platter > speed to help your file system. It's not the platter speed, it's the > seeks which are the problem. Same thing in system doesn't, it's not the > bcopy speed, it's the cache misses that are the problem. More bandwidth > doesn't do much for that. Obviously, since the technique is merely increasing concurrency, it doesn't help any individual application, but rather utilizes cpu resources while one is stalled to execute another. Cache misses are no mystery; N times the number of threads of execution is N times the cache footprint (assuming all threads equal, which is never true but useful to assume), so it doesn't pay to cachestrate. But it never did anyway. The lines of reasoning presented against tightly coupled systems are grossly flawed. Attacking the communication bottlenecks by increasing the penalty for communication is highly ineffective, which is why these cookie cutter clusters for everything strategies don't work even on paper. First, communication requirements originate from the applications, not the operating system, hence so long as there are applications with such requirements, the requirements for such kernels will exist. Second, the proposal is ignoring numerous environmental constraints, for instance, the system administration, colocation, and other costs of the massive duplication of perfectly shareable resources implied by the clustering. Third, the communication penalties are turned from memory access to I/O, which is tremendously slower by several orders of magnitude. Fourth, the kernel design problem is actually made harder, since no one has ever been able to produce a working design for these cache coherent clusters yet that I know of, and what descriptions of this proposal I've seen that are extant (you wrote some paper on it, IIRC) are too vague to be operationally useful. So as best as I can tell the proposal consists of using an orders-of- magnitude slower communication method to implement an underspecified solution to some research problem that to all appearances will be more expensive to maintain and keep running than the now extant designs. I like distributed systems and clusters, and they're great to use for what they're good for. They're not substitutes in any way for tightly coupled systems, nor do they render large specimens thereof unnecessary. -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 18:00 ` William Lee Irwin III @ 2003-09-03 18:05 ` Larry McVoy 2003-09-03 18:15 ` William Lee Irwin III 2003-09-03 19:11 ` Steven Cole 1 sibling, 1 reply; 49+ messages in thread From: Larry McVoy @ 2003-09-03 18:05 UTC (permalink / raw) To: William Lee Irwin III, Brown, Len, Giuliano Pochini, Larry McVoy, linux-kernel > The lines of reasoning presented against tightly coupled systems are > grossly flawed. [etc]. Only problem with your statements is that IBM has already implemented all of the required features in VM. And multiple Linux instances are running on it today, with shared disks underneath so they don't replicate all the stuff that doesn't need to be replicated, and they have shared memory across instances. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 18:05 ` Larry McVoy @ 2003-09-03 18:15 ` William Lee Irwin III 2003-09-03 18:15 ` Larry McVoy ` (2 more replies) 0 siblings, 3 replies; 49+ messages in thread From: William Lee Irwin III @ 2003-09-03 18:15 UTC (permalink / raw) To: Larry McVoy, Brown, Len, Giuliano Pochini, Larry McVoy, linux-kernel At some point in the past, I wrote: >> The lines of reasoning presented against tightly coupled systems are >> grossly flawed. On Wed, Sep 03, 2003 at 11:05:47AM -0700, Larry McVoy wrote: > [etc]. > Only problem with your statements is that IBM has already implemented all > of the required features in VM. And multiple Linux instances are running > on it today, with shared disks underneath so they don't replicate all the > stuff that doesn't need to be replicated, and they have shared memory > across instances. Independent operating system instances running under a hypervisor don't qualify as a cache-coherent cluster that I can tell; it's merely dynamic partitioning, which is great, but nothing to do with clustering or SMP. -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 18:15 ` William Lee Irwin III @ 2003-09-03 18:15 ` Larry McVoy 2003-09-03 18:26 ` William Lee Irwin III 2003-09-03 18:32 ` Alan Cox 2003-09-05 1:34 ` Robert White 2 siblings, 1 reply; 49+ messages in thread From: Larry McVoy @ 2003-09-03 18:15 UTC (permalink / raw) To: William Lee Irwin III, Brown, Len, Giuliano Pochini, Larry McVoy, linux-kernel On Wed, Sep 03, 2003 at 11:15:50AM -0700, William Lee Irwin III wrote: > Independent operating system instances running under a hypervisor don't > qualify as a cache-coherent cluster that I can tell; it's merely dynamic > partitioning, which is great, but nothing to do with clustering or SMP. they can map memory between instances -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 18:15 ` Larry McVoy @ 2003-09-03 18:26 ` William Lee Irwin III 0 siblings, 0 replies; 49+ messages in thread From: William Lee Irwin III @ 2003-09-03 18:26 UTC (permalink / raw) To: Larry McVoy, Brown, Len, Giuliano Pochini, Larry McVoy, linux-kernel On Wed, Sep 03, 2003 at 11:15:50AM -0700, William Lee Irwin III wrote: >> Independent operating system instances running under a hypervisor don't >> qualify as a cache-coherent cluster that I can tell; it's merely dynamic >> partitioning, which is great, but nothing to do with clustering or SMP. On Wed, Sep 03, 2003 at 11:15:52AM -0700, Larry McVoy wrote: > they can map memory between instances That's just enough of a hypervisor API for the kernel to do the rest, which it is very explicitly not doing. It also has other uses. -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 18:15 ` William Lee Irwin III 2003-09-03 18:15 ` Larry McVoy @ 2003-09-03 18:32 ` Alan Cox 2003-09-03 19:46 ` William Lee Irwin III 2003-09-05 1:34 ` Robert White 2 siblings, 1 reply; 49+ messages in thread From: Alan Cox @ 2003-09-03 18:32 UTC (permalink / raw) To: William Lee Irwin III Cc: Larry McVoy, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Mer, 2003-09-03 at 19:15, William Lee Irwin III wrote: > Independent operating system instances running under a hypervisor don't > qualify as a cache-coherent cluster that I can tell; it's merely dynamic > partitioning, which is great, but nothing to do with clustering or SMP. Now add a clusterfs and tell me the difference, other than there being a lot less sharing going on... ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 18:32 ` Alan Cox @ 2003-09-03 19:46 ` William Lee Irwin III 2003-09-03 20:13 ` Alan Cox 2003-09-03 20:48 ` Martin J. Bligh 0 siblings, 2 replies; 49+ messages in thread From: William Lee Irwin III @ 2003-09-03 19:46 UTC (permalink / raw) To: Alan Cox Cc: Larry McVoy, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Mer, 2003-09-03 at 19:15, William Lee Irwin III wrote: >> Independent operating system instances running under a hypervisor don't >> qualify as a cache-coherent cluster that I can tell; it's merely dynamic >> partitioning, which is great, but nothing to do with clustering or SMP. On Wed, Sep 03, 2003 at 07:32:12PM +0100, Alan Cox wrote: > Now add a clusterfs and tell me the difference, other than there being a > lot less sharing going on... The sharing matters; e.g. libc and other massively shared bits are replicated in memory once per instance, which increases memory and cache footprint(s). A number of other consequences of the sharing loss: The number of systems to manage proliferates. Pagecache access suddenly involves cross-instance communication instead of swift memory access and function calls, with potentially enormous invalidation latencies. Userspace IPC goes from shared memory and pipes and sockets inside a single instance (which are just memory copies) to cross-instance data traffic, which involves slinging memory around through the hypervisor's interface, which is slower. The limited size of a single instance bounds the size of individual applications, which at various times would like to have larger memory footprints or consume more cpu time than fits in a single instance. i.e. something resembling external fragmentation of system resources. Process migration is confined to within a single instance without some very ugly bits; things such as forking servers and dynamic task creation algorithms like thread pools fall apart here. There's suddenly competition for and a need for dynamic shifting around of resources not shared across instances, like private disk space and devices, shares of cpu, IP numbers and other system identifiers, and even such things as RAM and virtual cpus. AFAICT this raises more issues than it addresses. Not that the issues aren't worth addressing, but there's a lot more to do than Larry saying "I think this is a good idea" before expecting anyone to even think it's worth thinking about. -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 19:46 ` William Lee Irwin III @ 2003-09-03 20:13 ` Alan Cox 2003-09-03 20:31 ` William Lee Irwin III 2003-09-03 20:48 ` Martin J. Bligh 1 sibling, 1 reply; 49+ messages in thread From: Alan Cox @ 2003-09-03 20:13 UTC (permalink / raw) To: William Lee Irwin III Cc: Larry McVoy, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Mer, 2003-09-03 at 20:46, William Lee Irwin III wrote: > The sharing matters; e.g. libc and other massively shared bits are > replicated in memory once per instance, which increases memory and > cache footprint(s). A number of other consequences of the sharing loss: Memory is cheap. NUMA people already replicate pages on big systems, even the entire kernel. Last time I looked libc cost me under $1 a system. > Pagecache access suddenly involves cross-instance communication instead > of swift memory access and function calls, with potentially enormous > invalidation latencies. Your cross instance communication some LPAR like setup is tiny, it doesnt have to bounce over ethernet in that kind of setup that Larry talks about - in many cases its probably doable as atomic ops in a shared space > a single instance (which are just memory copies) to cross-instance > data traffic, which involves slinging memory around through the > hypervisor's interface, which is slower. Why. If I want to explicitly allocated shared space I can allocate it shared in a setup which is LPAR like. If its across a LAN then yes thats a different kettle of fish. > Process migration is confined to within a single instance without > some very ugly bits; things such as forking servers and dynamic task > creation algorithms like thread pools fall apart here. I'd be suprised if that is an issue because large systems either run lots of stuff so you can do the occasional move at fork time (which is expensive) or customised setups. Most NUMA setups already mess around with CPU binding to make the box fast > AFAICT this raises more issues than it addresses. Not that the issues > aren't worth addressing, but there's a lot more to do than Larry > saying "I think this is a good idea" before expecting anyone to even > think it's worth thinking about. Agreed ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 20:13 ` Alan Cox @ 2003-09-03 20:31 ` William Lee Irwin III 0 siblings, 0 replies; 49+ messages in thread From: William Lee Irwin III @ 2003-09-03 20:31 UTC (permalink / raw) To: Alan Cox Cc: Larry McVoy, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Mer, 2003-09-03 at 20:46, William Lee Irwin III wrote: >> Pagecache access suddenly involves cross-instance communication instead >> of swift memory access and function calls, with potentially enormous >> invalidation latencies. On Wed, Sep 03, 2003 at 09:13:05PM +0100, Alan Cox wrote: > Your cross instance communication some LPAR like setup is tiny, it > doesnt have to bounce over ethernet in that kind of setup that Larry > talks about - in many cases its probably doable as atomic ops in a > shared space I was thinking of things like truncate(), which is already a single system latency problem. On Mer, 2003-09-03 at 20:46, William Lee Irwin III wrote: >> a single instance (which are just memory copies) to cross-instance >> data traffic, which involves slinging memory around through the >> hypervisor's interface, which is slower. On Wed, Sep 03, 2003 at 09:13:05PM +0100, Alan Cox wrote: > Why. If I want to explicitly allocated shared space I can allocate it > shared in a setup which is LPAR like. If its across a LAN then yes thats > a different kettle of fish. It'll probably deteriorate by an additional copy plus trap costs for hcalls for things like sockets (and pipes are precluded unless far more cross-system integration than I've heard of is planned). Userspace API's for distributed shared memory are hard to program, but userspace could exploit them to cut down on the amount of copying. On Mer, 2003-09-03 at 20:46, William Lee Irwin III wrote: >> Process migration is confined to within a single instance without >> some very ugly bits; things such as forking servers and dynamic task >> creation algorithms like thread pools fall apart here. On Wed, Sep 03, 2003 at 09:13:05PM +0100, Alan Cox wrote: > I'd be suprised if that is an issue because large systems either run > lots of stuff so you can do the occasional move at fork time (which is > expensive) or customised setups. Most NUMA setups already mess around > with CPU binding to make the box fast A better way of phrasing this is "the load balancing problem is harder". -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 19:46 ` William Lee Irwin III 2003-09-03 20:13 ` Alan Cox @ 2003-09-03 20:48 ` Martin J. Bligh 2003-09-03 21:21 ` William Lee Irwin III 1 sibling, 1 reply; 49+ messages in thread From: Martin J. Bligh @ 2003-09-03 20:48 UTC (permalink / raw) To: William Lee Irwin III, Alan Cox Cc: Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List > On Wed, Sep 03, 2003 at 07:32:12PM +0100, Alan Cox wrote: >> Now add a clusterfs and tell me the difference, other than there being a >> lot less sharing going on... > > The sharing matters; e.g. libc and other massively shared bits are > replicated in memory once per instance, which increases memory and > cache footprint(s). A number of other consequences of the sharing loss: Explain the cache footprint argument - if you're only using a single copy from any given cpu, it shouldn't affect the cpu cache. More importantly, it'll massively reduce the footprint on the NUMA interconnect cache, which is the whole point of doing text replication. > The number of systems to manage proliferates. Not if you have an SSI cluster, that's the point. > Pagecache access suddenly involves cross-instance communication instead > of swift memory access and function calls, with potentially enormous > invalidation latencies. No, each node in an SSI cluster has its own pagecache, that's mostly independant. > Userspace IPC goes from shared memory and pipes and sockets inside > a single instance (which are just memory copies) to cross-instance > data traffic, which involves slinging memory around through the > hypervisor's interface, which is slower. Indeed, unless the hypervisor-type-layer sets up an efficient cross communication mechanism, that doesn't involve it for every transaction. Yes, there's some cost here. If the workload is fairly "independant" (between processes), it's easy, if it does a lot of cross-process traffic with pipes and shit, it's going to hurt to some extent, but it *may* be fairly small, depending on the implementation. > The limited size of a single instance bounds the size of individual > applications, which at various times would like to have larger memory > footprints or consume more cpu time than fits in a single instance. > i.e. something resembling external fragmentation of system resources. True. depends on how the processes / threads in that app communicate as to how big the impact would be. There's nothing saying that two processes of the same app in an SSI cluster can't run on different nodes ... we present a single system image to userspace, across nodes. Some of the glue layer (eg for ps, to give a simple example), like for_each_task, is where the hard work in doing this is. > Process migration is confined to within a single instance without > some very ugly bits; things such as forking servers and dynamic task > creation algorithms like thread pools fall apart here. You *need* to be able to migrate processes across nodes. Yes, it's hard. Doing it at exec time is easier, but still far from trivial, and not sufficient anyway. > There's suddenly competition for and a need for dynamic shifting around > of resources not shared across instances, like private disk space and > devices, shares of cpu, IP numbers and other system identifiers, and > even such things as RAM and virtual cpus. > > AFAICT this raises more issues than it addresses. Not that the issues > aren't worth addressing, but there's a lot more to do than Larry > saying "I think this is a good idea" before expecting anyone to even > think it's worth thinking about. It raises a lot of hard issues. It addresses a lot of hard issues. IMHO, it's a fascinating concept, that deserves some attention, and I'd love to work on it. However, I'm far from sure it'd work out, and until it's proven to do so, it's unreasonable to expect people to give up working on the existing methods in favour of an unproven (but rather cool) pipe-dream. What we're doing now is mostly just small incremental changes, and unlike Larry, I don't believe it's harmful (I'm not delving back into that debate again - see the mail archives of this list). I'd love to see how the radical SSI cluster approach compares, when it's done. If I can get funding for it, I'll help it get done. M. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 20:48 ` Martin J. Bligh @ 2003-09-03 21:21 ` William Lee Irwin III 2003-09-03 21:29 ` Martin J. Bligh 0 siblings, 1 reply; 49+ messages in thread From: William Lee Irwin III @ 2003-09-03 21:21 UTC (permalink / raw) To: Martin J. Bligh Cc: Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List At some point in the past, I wrote: >> The sharing matters; e.g. libc and other massively shared bits are >> replicated in memory once per instance, which increases memory and >> cache footprint(s). A number of other consequences of the sharing loss: On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote: > Explain the cache footprint argument - if you're only using a single > copy from any given cpu, it shouldn't affect the cpu cache. More > importantly, it'll massively reduce the footprint on the NUMA > interconnect cache, which is the whole point of doing text replication. The single copy from any given cpu assumption was not explicitly made. Some of this depends on how the administrator/whoever wants to arrange OS instances so that when one becomes blocked on io or otherwise idled others can make progress or other forms of overcommitment. At some point in the past, I wrote: >> The number of systems to manage proliferates. On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote: > Not if you have an SSI cluster, that's the point. The scenario described above wasn't SSI but independent instances with a shared distributed fs. SSI clusters have most of the same problems, really. Managing the systems just becomes "managing the nodes" because they're not called systems, and you have to go through some (possibly automated, though not likely) hassle to figure out the right way to spread things across nodes, which virtualizes pieces to hand to which nodes running which loads, etc. At some point in the past, I wrote: >> Pagecache access suddenly involves cross-instance communication instead >> of swift memory access and function calls, with potentially enormous >> invalidation latencies. On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote: > No, each node in an SSI cluster has its own pagecache, that's mostly > independant. But not totally. truncate() etc. need handling, i.e. cross-instance pagecache invalidations. And write() too. =) At some point in the past, I wrote: >> The limited size of a single instance bounds the size of individual >> applications, which at various times would like to have larger memory >> footprints or consume more cpu time than fits in a single instance. >> i.e. something resembling external fragmentation of system resources. On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote: > True. depends on how the processes / threads in that app communicate > as to how big the impact would be. There's nothing saying that two > processes of the same app in an SSI cluster can't run on different > nodes ... we present a single system image to userspace, across nodes. > Some of the glue layer (eg for ps, to give a simple example), like > for_each_task, is where the hard work in doing this is. Well, let's try the word "process" then. e.g. 4GB nodes and a process that suddenly wants to inflate to 8GB due to some ephemeral load imbalance. -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 21:21 ` William Lee Irwin III @ 2003-09-03 21:29 ` Martin J. Bligh 2003-09-03 21:51 ` William Lee Irwin III 0 siblings, 1 reply; 49+ messages in thread From: Martin J. Bligh @ 2003-09-03 21:29 UTC (permalink / raw) To: William Lee Irwin III Cc: Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List > On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote: >> Not if you have an SSI cluster, that's the point. > > The scenario described above wasn't SSI but independent instances with > a shared distributed fs. OK, sorry - crosstalk confusion between subthreads. > SSI clusters have most of the same problems, > really. Managing the systems just becomes "managing the nodes" because > they're not called systems, and you have to go through some (possibly > automated, though not likely) hassle to figure out the right way to > spread things across nodes, which virtualizes pieces to hand to which > nodes running which loads, etc. That's where I disagree - it's much easier for the USER because an SSI cluster works out all the load balancing shit for itself, instead of pushing the problem out to userspace. It's much harder for the KERNEL programmer, sure ... but we're smart ;-) And I'd rather solve it once, properly, in the right place where all the right data is about all the apps running on the system, and the data about the machine hardware. > On Wed, Sep 03, 2003 at 01:48:59PM -0700, Martin J. Bligh wrote: >> No, each node in an SSI cluster has its own pagecache, that's mostly >> independant. > > But not totally. truncate() etc. need handling, i.e. cross-instance > pagecache invalidations. And write() too. =) Same problem as any clustered fs, but yes, truncate will suck harder than it does now. Not sure I care though ;-) Invalidations on write, etc will be more expensive when we're sharing files across nodes, but independant operations will be cheaper due to the locality. It's a tradeoff - whether it pays off or not depends on the workload. >> True. depends on how the processes / threads in that app communicate >> as to how big the impact would be. There's nothing saying that two >> processes of the same app in an SSI cluster can't run on different >> nodes ... we present a single system image to userspace, across nodes. >> Some of the glue layer (eg for ps, to give a simple example), like >> for_each_task, is where the hard work in doing this is. > > Well, let's try the word "process" then. e.g. 4GB nodes and a process > that suddenly wants to inflate to 8GB due to some ephemeral load > imbalance. Well, if you mean "task" in the linux sense (ie not a multi-threaded process), that reduces us from worrying about tasks to memory. On an SSI cluster that's on a NUMA machine, we could loan memory across nodes or something, but yes, that's definitely a problem area. It ain't no panacea ;-) M. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 21:29 ` Martin J. Bligh @ 2003-09-03 21:51 ` William Lee Irwin III 2003-09-03 21:46 ` Martin J. Bligh 2003-09-04 0:58 ` Larry McVoy 0 siblings, 2 replies; 49+ messages in thread From: William Lee Irwin III @ 2003-09-03 21:51 UTC (permalink / raw) To: Martin J. Bligh Cc: Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List At some point in the past, I wrote: >> SSI clusters have most of the same problems, >> really. Managing the systems just becomes "managing the nodes" because >> they're not called systems, and you have to go through some (possibly >> automated, though not likely) hassle to figure out the right way to >> spread things across nodes, which virtualizes pieces to hand to which >> nodes running which loads, etc. On Wed, Sep 03, 2003 at 02:29:01PM -0700, Martin J. Bligh wrote: > That's where I disagree - it's much easier for the USER because an SSI > cluster works out all the load balancing shit for itself, instead of > pushing the problem out to userspace. It's much harder for the KERNEL > programmer, sure ... but we're smart ;-) And I'd rather solve it once, > properly, in the right place where all the right data is about all > the apps running on the system, and the data about the machine hardware. This is only truly feasible when the nodes are homogeneous. They will not be as there will be physical locality (esp. bits like device proximity) concerns. It's vaguely possible some kind of punting out of the kernel of the solutions to these concerns is possible, but upon the assumption it will appear, we descend further toward science fiction. Some of these proposals also beg the question of "who's going to write the rest of the hypervisor supporting this stuff?", which is ominous. -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 21:51 ` William Lee Irwin III @ 2003-09-03 21:46 ` Martin J. Bligh 2003-09-04 0:07 ` Mike Fedyk 2003-09-04 1:06 ` Larry McVoy 2003-09-04 0:58 ` Larry McVoy 1 sibling, 2 replies; 49+ messages in thread From: Martin J. Bligh @ 2003-09-03 21:46 UTC (permalink / raw) To: William Lee Irwin III Cc: Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List >> That's where I disagree - it's much easier for the USER because an SSI >> cluster works out all the load balancing shit for itself, instead of >> pushing the problem out to userspace. It's much harder for the KERNEL >> programmer, sure ... but we're smart ;-) And I'd rather solve it once, >> properly, in the right place where all the right data is about all >> the apps running on the system, and the data about the machine hardware. > > This is only truly feasible when the nodes are homogeneous. They will > not be as there will be physical locality (esp. bits like device > proximity) concerns. Same problem as a traditonal set up of a NUMA system - the scheduler needs to try to move the process closer to the resources it's using. > It's vaguely possible some kind of punting out > of the kernel of the solutions to these concerns is possible, but upon > the assumption it will appear, we descend further toward science fiction. Nah, punting to userspace is crap - they have no more ability to solve this than we do on any sort of dynamic worseload, and in most cases, much worse - they don't have the information that the kernel has available, at least not on a timely basis. The scheduler belongs in the kernel, where it can balance decisions across all of userspace, and we have all the info we need rapidly and efficiently available. > Some of these proposals also beg the question of "who's going to write > the rest of the hypervisor supporting this stuff?", which is ominous. Yeah, it needs lots of hard work by bright people. It's not easy. M. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 21:46 ` Martin J. Bligh @ 2003-09-04 0:07 ` Mike Fedyk 2003-09-04 1:06 ` Larry McVoy 1 sibling, 0 replies; 49+ messages in thread From: Mike Fedyk @ 2003-09-04 0:07 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, Sep 03, 2003 at 02:46:06PM -0700, Martin J. Bligh wrote: > > Some of these proposals also beg the question of "who's going to write > > the rest of the hypervisor supporting this stuff?", which is ominous. > > Yeah, it needs lots of hard work by bright people. It's not easy. Has OpenMosix done anything to help in this regard, or is it unmaintainable? ISTR, that much of its code is asm, and going from 2.4.18 to 19 took a long time to stabalize (that was the last time I used OM) ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 21:46 ` Martin J. Bligh 2003-09-04 0:07 ` Mike Fedyk @ 2003-09-04 1:06 ` Larry McVoy 2003-09-04 1:10 ` Larry McVoy ` (2 more replies) 1 sibling, 3 replies; 49+ messages in thread From: Larry McVoy @ 2003-09-04 1:06 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List Here's a thought. Maybe the next kernel summit needs to have a CC cluster BOF or whatever. I'd be happy to show up, describe what it is that I see and have you all try and poke holes in it. If the net result was that you walked away with the same picture in your head that I have that would be cool. Heck, I'll sponser it and buy beer and food if you like. On Wed, Sep 03, 2003 at 02:46:06PM -0700, Martin J. Bligh wrote: > >> That's where I disagree - it's much easier for the USER because an SSI > >> cluster works out all the load balancing shit for itself, instead of > >> pushing the problem out to userspace. It's much harder for the KERNEL > >> programmer, sure ... but we're smart ;-) And I'd rather solve it once, > >> properly, in the right place where all the right data is about all > >> the apps running on the system, and the data about the machine hardware. > > > > This is only truly feasible when the nodes are homogeneous. They will > > not be as there will be physical locality (esp. bits like device > > proximity) concerns. > > Same problem as a traditonal set up of a NUMA system - the scheduler > needs to try to move the process closer to the resources it's using. > > > It's vaguely possible some kind of punting out > > of the kernel of the solutions to these concerns is possible, but upon > > the assumption it will appear, we descend further toward science fiction. > > Nah, punting to userspace is crap - they have no more ability to solve > this than we do on any sort of dynamic worseload, and in most cases, > much worse - they don't have the information that the kernel has available, > at least not on a timely basis. The scheduler belongs in the kernel, > where it can balance decisions across all of userspace, and we have > all the info we need rapidly and efficiently available. > > > Some of these proposals also beg the question of "who's going to write > > the rest of the hypervisor supporting this stuff?", which is ominous. > > Yeah, it needs lots of hard work by bright people. It's not easy. > > M. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 1:06 ` Larry McVoy @ 2003-09-04 1:10 ` Larry McVoy 2003-09-04 1:32 ` William Lee Irwin III 2003-09-07 21:18 ` Eric W. Biederman 2 siblings, 0 replies; 49+ messages in thread From: Larry McVoy @ 2003-09-04 1:10 UTC (permalink / raw) To: Martin J. Bligh, William Lee Irwin III, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, Sep 03, 2003 at 06:06:53PM -0700, Larry McVoy wrote: > Here's a thought. Maybe the next kernel summit needs to have a CC cluster > BOF or whatever. I'd be happy to show up, describe what it is that I see > and have you all try and poke holes in it. If the net result was that you > walked away with the same picture in your head that I have that would be > cool. Heck, I'll sponser it and buy beer and food if you like. Oops. s/sponser/sponsor/. Long day, sorry. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 1:06 ` Larry McVoy 2003-09-04 1:10 ` Larry McVoy @ 2003-09-04 1:32 ` William Lee Irwin III 2003-09-04 1:46 ` David Lang 2003-09-04 2:31 ` Scaling noise Martin J. Bligh 2003-09-07 21:18 ` Eric W. Biederman 2 siblings, 2 replies; 49+ messages in thread From: William Lee Irwin III @ 2003-09-04 1:32 UTC (permalink / raw) To: Larry McVoy, Martin J. Bligh, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, Sep 03, 2003 at 06:06:53PM -0700, Larry McVoy wrote: > Here's a thought. Maybe the next kernel summit needs to have a CC cluster > BOF or whatever. I'd be happy to show up, describe what it is that I see > and have you all try and poke holes in it. If the net result was that you > walked away with the same picture in your head that I have that would be > cool. Heck, I'll sponser it and buy beer and food if you like. It'd be nice if there were a prototype or something around to at least get a feel for whether it's worthwhile and how it behaves. Most of the individual mechanisms have other uses ranging from playing the good citizen under a hypervisor to just plain old filesharing, so it should be vaguely possible to get a couple kernels talking and farting around without much more than 1-2 P-Y's for bootstrapping bits and some unspecified amount of pain for missing pieces of the above. Unfortunately, this means (a) the box needs a hypervisor (or equivalent in native nomenclature) (b) substantial outlay of kernel hacking time (who's doing this?) I'm vaguely attached to the idea of there being _something_ to assess, otherwise it's difficult to ground the discussions in evidence, though worse comes to worse, we can break down to plotting and scheming again. -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 1:32 ` William Lee Irwin III @ 2003-09-04 1:46 ` David Lang 2003-09-04 1:51 ` William Lee Irwin III 2003-09-04 2:33 ` SSI clusters on NUMA (was Re: Scaling noise) Martin J. Bligh 2003-09-04 2:31 ` Scaling noise Martin J. Bligh 1 sibling, 2 replies; 49+ messages in thread From: David Lang @ 2003-09-04 1:46 UTC (permalink / raw) To: William Lee Irwin III Cc: Larry McVoy, Martin J. Bligh, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, 3 Sep 2003, William Lee Irwin III wrote: > On Wed, Sep 03, 2003 at 06:06:53PM -0700, Larry McVoy wrote: > > Here's a thought. Maybe the next kernel summit needs to have a CC cluster > > BOF or whatever. I'd be happy to show up, describe what it is that I see > > and have you all try and poke holes in it. If the net result was that you > > walked away with the same picture in your head that I have that would be > > cool. Heck, I'll sponser it and buy beer and food if you like. > > It'd be nice if there were a prototype or something around to at least > get a feel for whether it's worthwhile and how it behaves. > > Most of the individual mechanisms have other uses ranging from playing > the good citizen under a hypervisor to just plain old filesharing, so > it should be vaguely possible to get a couple kernels talking and > farting around without much more than 1-2 P-Y's for bootstrapping bits > and some unspecified amount of pain for missing pieces of the above. > > Unfortunately, this means > (a) the box needs a hypervisor (or equivalent in native nomenclature) how much of this need could be met with a native linux master and kernels running user-mode kernels? (your resource sharing would obviously not be that clean, but you could develop the tools to work across the kernel images this way) David Lang > (b) substantial outlay of kernel hacking time (who's doing this?) > > I'm vaguely attached to the idea of there being _something_ to assess, > otherwise it's difficult to ground the discussions in evidence, though > worse comes to worse, we can break down to plotting and scheming again. > > > -- wli > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 1:46 ` David Lang @ 2003-09-04 1:51 ` William Lee Irwin III 2003-09-04 2:33 ` SSI clusters on NUMA (was Re: Scaling noise) Martin J. Bligh 1 sibling, 0 replies; 49+ messages in thread From: William Lee Irwin III @ 2003-09-04 1:51 UTC (permalink / raw) To: David Lang Cc: Larry McVoy, Martin J. Bligh, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, Sep 03, 2003 at 06:46:02PM -0700, David Lang wrote: > how much of this need could be met with a native linux master and kernels > running user-mode kernels? (your resource sharing would obviously not be > that clean, but you could develop the tools to work across the kernel > images this way) Probably a fair amount. -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
* SSI clusters on NUMA (was Re: Scaling noise) 2003-09-04 1:46 ` David Lang 2003-09-04 1:51 ` William Lee Irwin III @ 2003-09-04 2:33 ` Martin J. Bligh 2003-09-04 3:02 ` David Lang 1 sibling, 1 reply; 49+ messages in thread From: Martin J. Bligh @ 2003-09-04 2:33 UTC (permalink / raw) To: David Lang, William Lee Irwin III Cc: Larry McVoy, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List > how much of this need could be met with a native linux master and kernels > running user-mode kernels? (your resource sharing would obviously not be > that clean, but you could develop the tools to work across the kernel > images this way) I talked to Jeff and Andrea about this at KS & OLS this year ... the feeling was that UML was too much overhead, but there were various ways to reduce that, especially if the underlying OS had UML support (doesn't require it right now). I'd really like to see the performance proved to be better before basing a design on UML, though that was my first instinct of how to do it ... M. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SSI clusters on NUMA (was Re: Scaling noise) 2003-09-04 2:33 ` SSI clusters on NUMA (was Re: Scaling noise) Martin J. Bligh @ 2003-09-04 3:02 ` David Lang 2003-09-04 4:44 ` Martin J. Bligh 0 siblings, 1 reply; 49+ messages in thread From: David Lang @ 2003-09-04 3:02 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Larry McVoy, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, 3 Sep 2003, Martin J. Bligh wrote: > > how much of this need could be met with a native linux master and kernels > > running user-mode kernels? (your resource sharing would obviously not be > > that clean, but you could develop the tools to work across the kernel > > images this way) > > I talked to Jeff and Andrea about this at KS & OLS this year ... the feeling > was that UML was too much overhead, but there were various ways to reduce > that, especially if the underlying OS had UML support (doesn't require it > right now). > > I'd really like to see the performance proved to be better before basing > a design on UML, though that was my first instinct of how to do it ... I agree that UML won't be able to show the performance advantages (the fact that the UML kernel can't control the cache footprint on the CPU's becouse it gets swapped from one to another at the host OS's convienience is just one issue here) however with UML you should be able to develop the tools and features to start to weld the two different kernels into a single logical image. once people have a handle on how these tools work you can then try them on some hardware that has a lower level partitioning setup (i.e. the IBM mainframes) and do real speed comparisons between one kernel that's given X CPU's and Y memory and two kernels that are each given X/2 CPU's and Y/2 memory. the fact that common hardware doesn't nicly support the partitioning shouldn't stop people from solving the other problems. David Lang ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SSI clusters on NUMA (was Re: Scaling noise) 2003-09-04 3:02 ` David Lang @ 2003-09-04 4:44 ` Martin J. Bligh 0 siblings, 0 replies; 49+ messages in thread From: Martin J. Bligh @ 2003-09-04 4:44 UTC (permalink / raw) To: David Lang Cc: William Lee Irwin III, Larry McVoy, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List >> > how much of this need could be met with a native linux master and kernels >> > running user-mode kernels? (your resource sharing would obviously not be >> > that clean, but you could develop the tools to work across the kernel >> > images this way) >> >> I talked to Jeff and Andrea about this at KS & OLS this year ... the feeling >> was that UML was too much overhead, but there were various ways to reduce >> that, especially if the underlying OS had UML support (doesn't require it >> right now). >> >> I'd really like to see the performance proved to be better before basing >> a design on UML, though that was my first instinct of how to do it ... > > I agree that UML won't be able to show the performance advantages (the > fact that the UML kernel can't control the cache footprint on the CPU's > becouse it gets swapped from one to another at the host OS's convienience > is just one issue here) > > however with UML you should be able to develop the tools and features to > start to weld the two different kernels into a single logical image. once > people have a handle on how these tools work you can then try them on some > hardware that has a lower level partitioning setup (i.e. the IBM > mainframes) and do real speed comparisons between one kernel that's given > X CPU's and Y memory and two kernels that are each given X/2 CPU's and Y/2 > memory. > > the fact that common hardware doesn't nicly support the partitioning > shouldn't stop people from solving the other problems. Yeah, it's definitely an interesting development environment at least. FYI, most of the discussions in Ottowa centered around system call overhead (4 TLB flushes per, IIRC), but the cache footprint is interesting too ... with the O(1) sched in the underlying OS, it shouldn't flip-flop around too easily, but interesting, nonetheless. M. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 1:32 ` William Lee Irwin III 2003-09-04 1:46 ` David Lang @ 2003-09-04 2:31 ` Martin J. Bligh 2003-09-04 2:40 ` Mike Fedyk ` (2 more replies) 1 sibling, 3 replies; 49+ messages in thread From: Martin J. Bligh @ 2003-09-04 2:31 UTC (permalink / raw) To: William Lee Irwin III, Larry McVoy, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List > On Wed, Sep 03, 2003 at 06:06:53PM -0700, Larry McVoy wrote: >> Here's a thought. Maybe the next kernel summit needs to have a CC cluster >> BOF or whatever. I'd be happy to show up, describe what it is that I see >> and have you all try and poke holes in it. If the net result was that you >> walked away with the same picture in your head that I have that would be >> cool. Heck, I'll sponser it and buy beer and food if you like. > > It'd be nice if there were a prototype or something around to at least > get a feel for whether it's worthwhile and how it behaves. > > Most of the individual mechanisms have other uses ranging from playing > the good citizen under a hypervisor to just plain old filesharing, so > it should be vaguely possible to get a couple kernels talking and > farting around without much more than 1-2 P-Y's for bootstrapping bits > and some unspecified amount of pain for missing pieces of the above. > > Unfortunately, this means > (a) the box needs a hypervisor (or equivalent in native nomenclature) > (b) substantial outlay of kernel hacking time (who's doing this?) > > I'm vaguely attached to the idea of there being _something_ to assess, > otherwise it's difficult to ground the discussions in evidence, though > worse comes to worse, we can break down to plotting and scheming again. I don't think the initial development baby-steps are *too* bad, and don't even have to be done on a NUMA box - a pair of PCs connected by 100baseT would work. Personally, I think the first step is to do task migration - migrate a process without it realising from one linux instance to another. Start without the more complex bits like shared filehandles, etc. Something that just writes 1,2,3,4 to a file. It could even just use shared root NFS, I think that works already. Basically swap it out on one node, and in on another, though obviously there's more state to take across than just RAM. I was talking to Tridge the other day, and he said someone had hacked up something in userspace which kinda worked ... I'll get some details. I view UP -> SMP -> NUMA -> SSI on NUMA -> SSI on many PCs -> beowulf cluster as a continuum ... the SSI problems are easier on NUMA, because you can wimp out on things like shmem much easier, but it's all similar. M. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 2:31 ` Scaling noise Martin J. Bligh @ 2003-09-04 2:40 ` Mike Fedyk 2003-09-04 2:50 ` Martin J. Bligh 2003-09-04 2:48 ` Steven Cole 2003-09-04 17:05 ` Daniel Phillips 2 siblings, 1 reply; 49+ messages in thread From: Mike Fedyk @ 2003-09-04 2:40 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Larry McVoy, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, Sep 03, 2003 at 07:31:13PM -0700, Martin J. Bligh wrote: > > Unfortunately, this means > > (a) the box needs a hypervisor (or equivalent in native nomenclature) > > (b) substantial outlay of kernel hacking time (who's doing this?) > > > > I'm vaguely attached to the idea of there being _something_ to assess, > > otherwise it's difficult to ground the discussions in evidence, though > > worse comes to worse, we can break down to plotting and scheming again. > > I don't think the initial development baby-steps are *too* bad, and don't > even have to be done on a NUMA box - a pair of PCs connected by 100baseT > would work. Personally, I think the first step is to do task migration - > migrate a process without it realising from one linux instance to another. > Start without the more complex bits like shared filehandles, etc. Something > that just writes 1,2,3,4 to a file. It could even just use shared root NFS, > I think that works already. > > Basically swap it out on one node, and in on another, though obviously > there's more state to take across than just RAM. I was talking to Tridge > the other day, and he said someone had hacked up something in userspace > which kinda worked ... I'll get some details. > > I view UP -> SMP -> NUMA -> SSI on NUMA -> SSI on many PCs -> beowulf cluster > as a continuum ... the SSI problems are easier on NUMA, because you can > wimp out on things like shmem much easier, but it's all similar. Am I missing something, but why hasn't openmosix been brought into this discussion? It looks like the perfect base for something like this. All that it needs is some cleanup. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 2:40 ` Mike Fedyk @ 2003-09-04 2:50 ` Martin J. Bligh 2003-09-04 3:49 ` Mike Fedyk 0 siblings, 1 reply; 49+ messages in thread From: Martin J. Bligh @ 2003-09-04 2:50 UTC (permalink / raw) To: Mike Fedyk Cc: William Lee Irwin III, Larry McVoy, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List > Am I missing something, but why hasn't openmosix been brought into this > discussion? It looks like the perfect base for something like this. All > that it needs is some cleanup. >From what I've seen, it needs a redesign. I don't think it's maintainable or mergeable as is ... nor would I want to work with their design. Just an initial gut reaction, I haven't spent a lot of time looking at it, but from what I saw, I didn't bother looking further. >From all accounts, OpenSSI sounds more promising, but I need to spend some more time looking at it. M. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 2:50 ` Martin J. Bligh @ 2003-09-04 3:49 ` Mike Fedyk 0 siblings, 0 replies; 49+ messages in thread From: Mike Fedyk @ 2003-09-04 3:49 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Larry McVoy, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, Sep 03, 2003 at 07:50:07PM -0700, Martin J. Bligh wrote: > >From all accounts, OpenSSI sounds more promising, but I need to spend some > more time looking at it. No kidding. Taking a look at the web site it does look pretty impressive. And it is using other code that was integrated recently (linux virtual server), as well as lustre, opengfs, etc. This looks like they're making a lot of progress, and doing it in a generic way. I hope the code is as good as their documentation, and marketing... ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 2:31 ` Scaling noise Martin J. Bligh 2003-09-04 2:40 ` Mike Fedyk @ 2003-09-04 2:48 ` Steven Cole 2003-09-04 17:05 ` Daniel Phillips 2 siblings, 0 replies; 49+ messages in thread From: Steven Cole @ 2003-09-04 2:48 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Larry McVoy, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, 2003-09-03 at 20:31, Martin J. Bligh wrote: > I don't think the initial development baby-steps are *too* bad, and don't > even have to be done on a NUMA box - a pair of PCs connected by 100baseT > would work. Personally, I think the first step is to do task migration - > migrate a process without it realising from one linux instance to another. > Start without the more complex bits like shared filehandles, etc. Something > that just writes 1,2,3,4 to a file. It could even just use shared root NFS, > I think that works already. > > Basically swap it out on one node, and in on another, though obviously > there's more state to take across than just RAM. I was talking to Tridge > the other day, and he said someone had hacked up something in userspace > which kinda worked ... I'll get some details. > This project may be applicable: http://bproc.sourceforge.net/ BProc is used here: http://www.lanl.gov/projects/pink/ Steven ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 2:31 ` Scaling noise Martin J. Bligh 2003-09-04 2:40 ` Mike Fedyk 2003-09-04 2:48 ` Steven Cole @ 2003-09-04 17:05 ` Daniel Phillips 2 siblings, 0 replies; 49+ messages in thread From: Daniel Phillips @ 2003-09-04 17:05 UTC (permalink / raw) To: Martin J. Bligh, William Lee Irwin III, Larry McVoy, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Thursday 04 September 2003 04:31, Martin J. Bligh wrote: > I view UP -> SMP -> NUMA -> SSI on NUMA -> SSI on many PCs -> beowulf > cluster as a continuum ... Nicely put. But the last step, ->beowulf, doesn't fit with the others, because all the other steps are successive levels of virtualization that try to preserve the appearance of a single system, whereas Beowulf drops the pretence and lets applications worry about the boundaries, i.e., it lacks essential SSI features. Also, the hardware changes on each of the first four arrows and stays the same on the last one. Regards, Daniel ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 1:06 ` Larry McVoy 2003-09-04 1:10 ` Larry McVoy 2003-09-04 1:32 ` William Lee Irwin III @ 2003-09-07 21:18 ` Eric W. Biederman 2003-09-07 23:07 ` Larry McVoy 2 siblings, 1 reply; 49+ messages in thread From: Eric W. Biederman @ 2003-09-07 21:18 UTC (permalink / raw) To: Larry McVoy Cc: Martin J. Bligh, William Lee Irwin III, Alan Cox, Brown, Len, Giuliano Pochini, Linux Kernel Mailing List Larry McVoy <lm@bitmover.com> writes: > Here's a thought. Maybe the next kernel summit needs to have a CC cluster > BOF or whatever. I'd be happy to show up, describe what it is that I see > and have you all try and poke holes in it. If the net result was that you > walked away with the same picture in your head that I have that would be > cool. Heck, I'll sponser it and buy beer and food if you like. Larry CC clusters are an idiotic development target. The development target should be non coherent clusters. 1) NUMA machines are smaller, more expensive, and less available than their non cache coherent counter parts. 2) If you can solve the communications problems for a non cache coherent counter part the solution will also work on a NUMA machine. 3) People on a NUMA machine can always punt and over share. On a non cache coherent cluster when people punt they don't share. Not sharing increases scalability and usually performance. 4) Small start up companies can do non-coherent clusters, and can scale up. You have to be a substantial company to build a NUMA machine. 5) NUMA machines are slow. There is not a single NUMA machine in the top 10 of the top500 supercomputers list. Likely this has more to do with system sizes supported by the manufacture than inherent process inferiority, but it makes a difference. SSI is good and it helps. But that is not the primary management problem on a large system. The larger you get the imperfection of your materials tends to be an increasingly dominate factor in management problems. For example I routinely reproduce cases where the BIOS does not work around hardware bugs in a single boot that the motherboard vendors cannot even reproduce. Another example is Google who have given up entirely on machines always working, and has built the software to be robust about error detection and recovery. And the SSI solutions are evolving. But the problems are hard. How do you build a distributed filesystem that scales? How do you do process migration across machines? How do you checkpoint a distributed job? How do you properly build a cluster job scheduler? How do you handle simultaneous similar actions by a group of nodes? How do you usefully predict, detect, and isolate hardware failures so as not to cripple the cluster? etc. Eric ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-07 21:18 ` Eric W. Biederman @ 2003-09-07 23:07 ` Larry McVoy 2003-09-07 23:47 ` Eric W. Biederman 0 siblings, 1 reply; 49+ messages in thread From: Larry McVoy @ 2003-09-07 23:07 UTC (permalink / raw) To: Eric W. Biederman Cc: Larry McVoy, Martin J. Bligh, William Lee Irwin III, Alan Cox, Brown, Len, Giuliano Pochini, Linux Kernel Mailing List On Sun, Sep 07, 2003 at 03:18:19PM -0600, Eric W. Biederman wrote: > Larry McVoy <lm@bitmover.com> writes: > > > Here's a thought. Maybe the next kernel summit needs to have a CC cluster > > BOF or whatever. I'd be happy to show up, describe what it is that I see > > and have you all try and poke holes in it. If the net result was that you > > walked away with the same picture in your head that I have that would be > > cool. Heck, I'll sponser it and buy beer and food if you like. > > Larry CC clusters are an idiotic development target. What a nice way to start a technical conversation. *PLONK* on two counts: you're wrong and you're rude. Next contestant please. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-07 23:07 ` Larry McVoy @ 2003-09-07 23:47 ` Eric W. Biederman 2003-09-08 0:57 ` Larry McVoy 0 siblings, 1 reply; 49+ messages in thread From: Eric W. Biederman @ 2003-09-07 23:47 UTC (permalink / raw) To: Larry McVoy Cc: Martin J. Bligh, William Lee Irwin III, Alan Cox, Brown, Len, Giuliano Pochini, Linux Kernel Mailing List Larry McVoy <lm@bitmover.com> writes: > On Sun, Sep 07, 2003 at 03:18:19PM -0600, Eric W. Biederman wrote: > > Larry McVoy <lm@bitmover.com> writes: > > > > > Here's a thought. Maybe the next kernel summit needs to have a CC cluster > > > BOF or whatever. I'd be happy to show up, describe what it is that I see > > > and have you all try and poke holes in it. If the net result was that you > > > walked away with the same picture in your head that I have that would be > > > cool. Heck, I'll sponser it and buy beer and food if you like. > > > > Larry CC clusters are an idiotic development target. > > What a nice way to start a technical conversation. > > *PLONK* on two counts: you're wrong and you're rude. Next contestant please. Ok. I will keep building clusters and the code that makes them work, and you can dream. I backed up my assertion, and can do even better. I have already built a 2304 cpu machine and am working on a 2900+ cpu machine. The software stack and that part of the idea are reasonable but your target hardware is just plain rare, and expensive. If you don't get the commodity OS on commodity hardware thing, I'm sorry. The thing is for all of your talk of Dell, Dell doesn't make the hardware you need for a CC cluster. And because the cc NUMA interface requires a manufacturer to make chips, and boards, I have a hard time seeing cc NUMA hardware being a commodity any time soon. Eric ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-07 23:47 ` Eric W. Biederman @ 2003-09-08 0:57 ` Larry McVoy 2003-09-08 3:55 ` Eric W. Biederman 2003-09-08 4:47 ` Stephen Satchell 0 siblings, 2 replies; 49+ messages in thread From: Larry McVoy @ 2003-09-08 0:57 UTC (permalink / raw) To: Eric W. Biederman Cc: Larry McVoy, Martin J. Bligh, William Lee Irwin III, Alan Cox, Brown, Len, Giuliano Pochini, Linux Kernel Mailing List On Sun, Sep 07, 2003 at 05:47:04PM -0600, Eric W. Biederman wrote: > I have already built a 2304 cpu machine and am working on a 2900+ cpu > machine. That's not "a machine" that's ~1150 machines on a network. This business of describing a bunch of boxes on a network as "a machine" is nonsense. Don't get me wrong, I love clusters, in fact, I think what you are doing is great. It doesn't screw up the OS, it forces the OS to stay lean and mean. Goodness. All the CC cluster stuff is about making sure that the SMP fanatics don't screw up the OS for you. We're on the same side. Try not to be so rude and have a bit more vision. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-08 0:57 ` Larry McVoy @ 2003-09-08 3:55 ` Eric W. Biederman 2003-09-08 4:47 ` Stephen Satchell 1 sibling, 0 replies; 49+ messages in thread From: Eric W. Biederman @ 2003-09-08 3:55 UTC (permalink / raw) To: Larry McVoy Cc: Martin J. Bligh, William Lee Irwin III, Alan Cox, Brown, Len, Giuliano Pochini, Linux Kernel Mailing List Larry McVoy <lm@bitmover.com> writes: > On Sun, Sep 07, 2003 at 05:47:04PM -0600, Eric W. Biederman wrote: > > I have already built a 2304 cpu machine and am working on a 2900+ cpu > > machine. > > That's not "a machine" that's ~1150 machines on a network. This business > of describing a bunch of boxes on a network as "a machine" is nonsense. Every bit as much as describing a scalable NUMA box with replaceable nodes as a single machine is nonsense. When things are built and run as a single machine, it is a single machine. The fact you standardized parts used many times does not change that. The only real difference is cache coherency, and the price. I won't argue that at the lowest end the vendor delivers you a pile of boxes and walks away, at which point you must do everything yourself and it is a real maintenance pain. But that is the lowest end and certainly not what I sell. The systems are built and tested as a single machine before delivery. > Don't get me wrong, I love clusters, in fact, I think what you are doing > is great. It doesn't screw up the OS, it forces the OS to stay lean and > mean. Goodness. > > All the CC cluster stuff is about making sure that the SMP fanatics don't > screw up the OS for you. We're on the same side. Try not to be so rude > and have a bit more vision. And I agree, except on some small details. Although I have yet to see the large way SMP folks causing problems. But as far as doing the work there are two different ends the work can be started from. a) SMP and make the locks finer grained. b) Cluster and add the few necessary locks. Both solutions run fine on a NUMA machine. And both eventually lead to good SSI solutions. But except for some magic piece that only works on cc NUMA nodes, you can develop all of the SSI software on an ordinary cluster. On an ordinary cluster that is that is the only option, and so the people with clusters are going to do the work. The only reason you don't see more SSI work out of the cluster guys is they are willing to sacrifice some coherency for scalability. But mostly it is because of the fact that clusters are only slowly catching on. So assuming the non coherent cluster guys do their part you get SSI software that works out of the box and does everything except for optimize the page cache for the shared physical hardware. And the software will scale awesomely because each generation of cluster hardware is larger than the last. The only piece that is unique is CCFS, which builds a shared page cache. And even then the non coherent cluster guys may come up with a better solution. So my argument is that if you are going to do it right. Start with an ordinary non-coherent cluster. Build the SSI support. Then build CCFS the global shared page cache as an optimization. I fail to see how starting with CCFS will help, or assuming CCFS will be there will help. Unless you think the R&D budgets of all of the non coherent cluster guys is insubstantial, and somehow not up to the task. Eric ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-08 0:57 ` Larry McVoy 2003-09-08 3:55 ` Eric W. Biederman @ 2003-09-08 4:47 ` Stephen Satchell 2003-09-08 5:25 ` Larry McVoy 1 sibling, 1 reply; 49+ messages in thread From: Stephen Satchell @ 2003-09-08 4:47 UTC (permalink / raw) To: Larry McVoy, Eric W. Biederman Cc: Larry McVoy, Martin J. Bligh, William Lee Irwin III, Alan Cox, Brown, Len, Giuliano Pochini, Linux Kernel Mailing List At 05:57 PM 9/7/2003 -0700, Larry McVoy wrote: >That's not "a machine" that's ~1150 machines on a network. This business >of describing a bunch of boxes on a network as "a machine" is nonsense. Then you haven't been keeping up with Open-source projects, or the literature. The development of virtual servers composed of clusters of Linux boxes on a private network appears to be a single machine to the outside world. Indeed, a highly scaled Web site using such a cluster is indistinguishable from one using a mainframe-class computer (which for the past 30 years have been networks of specialized processors working together). The difference is that the bulk of the nodes are on a private network, not on a public one. Actually, the machines I have seen have been on a weave of networks, so that as data traverses the nodes you don't get a bottleneck effect. It's a lot different than the Illiac IV I grew up with... Satch -- "People who seem to have had a new idea have often just stopped having an old idea." -- Dr. Edwin H. Land ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-08 4:47 ` Stephen Satchell @ 2003-09-08 5:25 ` Larry McVoy 2003-09-08 8:32 ` Eric W. Biederman 0 siblings, 1 reply; 49+ messages in thread From: Larry McVoy @ 2003-09-08 5:25 UTC (permalink / raw) To: Stephen Satchell Cc: Larry McVoy, Eric W. Biederman, Martin J. Bligh, William Lee Irwin III, Alan Cox, Brown, Len, Giuliano Pochini, Linux Kernel Mailing List On Sun, Sep 07, 2003 at 09:47:58PM -0700, Stephen Satchell wrote: > At 05:57 PM 9/7/2003 -0700, Larry McVoy wrote: > >That's not "a machine" that's ~1150 machines on a network. This business > >of describing a bunch of boxes on a network as "a machine" is nonsense. > > Then you haven't been keeping up with Open-source projects, or the > literature. Err, I'm in that literature, dig a little, you'll find me. I'm quite familiar with clustering technology. While it is great that people are wiring up lots of machines and running MPI or whatever on them, they've been doing that for decades. It's only a recent thing that they started calling that "a machine". That's marketing, and it's fine marketing, but a bunch of machines, a network, and a library does not a machine make. Not to me it doesn't. I want to be able to exec a proces and have it land anywhere on the "machine", any CPU, I want controlling tty semantics, if I have 2300 processes in one process group then when I hit ^Z they had all better stop. Etc. A collection of machines that work together is called a network of machines, it's not one machine, it's a bunch of them. There's nothing wrong with getting a lot of use out of a pile of networked machines, it's a great thing. But it's no more a machine than the internet is a machine. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-08 5:25 ` Larry McVoy @ 2003-09-08 8:32 ` Eric W. Biederman 0 siblings, 0 replies; 49+ messages in thread From: Eric W. Biederman @ 2003-09-08 8:32 UTC (permalink / raw) To: Larry McVoy Cc: Stephen Satchell, Martin J. Bligh, William Lee Irwin III, Alan Cox, Brown, Len, Giuliano Pochini, Linux Kernel Mailing List Larry McVoy <lm@bitmover.com> writes: > On Sun, Sep 07, 2003 at 09:47:58PM -0700, Stephen Satchell wrote: > > At 05:57 PM 9/7/2003 -0700, Larry McVoy wrote: > > >That's not "a machine" that's ~1150 machines on a network. This business > > >of describing a bunch of boxes on a network as "a machine" is nonsense. > > > > Then you haven't been keeping up with Open-source projects, or the > > literature. > > Err, I'm in that literature, dig a little, you'll find me. I'm quite > familiar with clustering technology. While it is great that people are > wiring up lots of machines and running MPI or whatever on them, they've > been doing that for decades. It's only a recent thing that they started > calling that "a machine". That's marketing, and it's fine marketing, > but a bunch of machines, a network, and a library does not a machine make. Oh so you need cache coherency to make it a machine. That being the only difference between that and a NUMA box. Although I will state that there is a lot more that goes into such a system than a network, and a library. At least there is a lot more that goes into the manageable version of one. > Not to me it doesn't. I want to be able to exec a proces and have it land > anywhere on the "machine", any CPU, I want controlling tty semantics, > if I have 2300 processes in one process group then when I hit ^Z they > had all better stop. Etc. Oh wait none of that comes with cache coherency. So the difference cannot be cache coherency. > A collection of machines that work together is called a network of > machines, it's not one machine, it's a bunch of them. There's nothing > wrong with getting a lot of use out of a pile of networked machines, > it's a great thing. But it's no more a machine than the internet is > a machine. Cool so the SGI Ultrix is not a machine. Nor is the SMP box over in my lab. They are separate machines wired together with a network, and so I better start calling them a network of machines. As far as I can tell which pile of hardware to call a machine is a difference that makes no difference. Marketing as you put it. The only practical difference would seem to be what kind of problems you think are worth solving for a collection of hardware. By calling it a single machine I am saying I think it is worth solving the single system image problem. By refusing to call it a machine you seem to think it is a class of hardware which is not worth paying attention to. I do think it is a class of hardware that is worth solving the hard problems for. And I will continue to call that pile of hardware a machine until I give up on that. I admit the hard problems have not yet been solved but the solutions are coming. Eric ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 21:51 ` William Lee Irwin III 2003-09-03 21:46 ` Martin J. Bligh @ 2003-09-04 0:58 ` Larry McVoy 2003-09-04 1:12 ` William Lee Irwin III 1 sibling, 1 reply; 49+ messages in thread From: Larry McVoy @ 2003-09-04 0:58 UTC (permalink / raw) To: William Lee Irwin III, Martin J. Bligh, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, Sep 03, 2003 at 02:51:35PM -0700, William Lee Irwin III wrote: > At some point in the past, I wrote: > >> SSI clusters have most of the same problems, > >> really. Managing the systems just becomes "managing the nodes" because > >> they're not called systems, and you have to go through some (possibly > >> automated, though not likely) hassle to figure out the right way to > >> spread things across nodes, which virtualizes pieces to hand to which > >> nodes running which loads, etc. > > On Wed, Sep 03, 2003 at 02:29:01PM -0700, Martin J. Bligh wrote: > > That's where I disagree - it's much easier for the USER because an SSI > > cluster works out all the load balancing shit for itself, instead of > > pushing the problem out to userspace. It's much harder for the KERNEL > > programmer, sure ... but we're smart ;-) And I'd rather solve it once, > > properly, in the right place where all the right data is about all > > the apps running on the system, and the data about the machine hardware. > > This is only truly feasible when the nodes are homogeneous. They will > not be as there will be physical locality (esp. bits like device > proximity) concerns. Huh? The nodes are homogeneous. Devices are either local or proxied. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 0:58 ` Larry McVoy @ 2003-09-04 1:12 ` William Lee Irwin III 2003-09-04 2:49 ` Larry McVoy 0 siblings, 1 reply; 49+ messages in thread From: William Lee Irwin III @ 2003-09-04 1:12 UTC (permalink / raw) To: Larry McVoy, Martin J. Bligh, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, Sep 03, 2003 at 02:51:35PM -0700, William Lee Irwin III wrote: >> This is only truly feasible when the nodes are homogeneous. They will >> not be as there will be physical locality (esp. bits like device >> proximity) concerns. On Wed, Sep 03, 2003 at 05:58:22PM -0700, Larry McVoy wrote: > Huh? The nodes are homogeneous. Devices are either local or proxied. Virtualized devices are backed by real devices at some level, so the distance from the node's physical location to the device's then matters. -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 1:12 ` William Lee Irwin III @ 2003-09-04 2:49 ` Larry McVoy 2003-09-04 3:15 ` William Lee Irwin III 2003-09-04 3:38 ` Nick Piggin 0 siblings, 2 replies; 49+ messages in thread From: Larry McVoy @ 2003-09-04 2:49 UTC (permalink / raw) To: William Lee Irwin III, Martin J. Bligh, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, Sep 03, 2003 at 06:12:53PM -0700, William Lee Irwin III wrote: > On Wed, Sep 03, 2003 at 02:51:35PM -0700, William Lee Irwin III wrote: > >> This is only truly feasible when the nodes are homogeneous. They will > >> not be as there will be physical locality (esp. bits like device > >> proximity) concerns. > > On Wed, Sep 03, 2003 at 05:58:22PM -0700, Larry McVoy wrote: > > Huh? The nodes are homogeneous. Devices are either local or proxied. > > Virtualized devices are backed by real devices at some level, so the > distance from the node's physical location to the device's then matters. Go read what I've written about this. There is no sharing, devices are local or remote. You share in the page cache only, if you want fast access to a device you ask it to put the data in memory and you map it. It's absolutely as fast as an SMP. With no locking. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 2:49 ` Larry McVoy @ 2003-09-04 3:15 ` William Lee Irwin III 2003-09-04 3:38 ` Nick Piggin 1 sibling, 0 replies; 49+ messages in thread From: William Lee Irwin III @ 2003-09-04 3:15 UTC (permalink / raw) To: Larry McVoy, Martin J. Bligh, Alan Cox, Brown, Len, Giuliano Pochini, Larry McVoy, Linux Kernel Mailing List On Wed, Sep 03, 2003 at 06:12:53PM -0700, William Lee Irwin III wrote: >> Virtualized devices are backed by real devices at some level, so the >> distance from the node's physical location to the device's then matters. On Wed, Sep 03, 2003 at 07:49:04PM -0700, Larry McVoy wrote: > Go read what I've written about this. There is no sharing, devices are > local or remote. You share in the page cache only, if you want fast access > to a device you ask it to put the data in memory and you map it. It's > absolutely as fast as an SMP. With no locking. Given the lack of an implementation I'm going to have to take this claim as my opportunity to bow out of tonight's discussion. I'd love to hear more about it when there's something more substantial to examine. -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-04 2:49 ` Larry McVoy 2003-09-04 3:15 ` William Lee Irwin III @ 2003-09-04 3:38 ` Nick Piggin 1 sibling, 0 replies; 49+ messages in thread From: Nick Piggin @ 2003-09-04 3:38 UTC (permalink / raw) To: Larry McVoy Cc: William Lee Irwin III, Martin J. Bligh, Alan Cox, Brown, Len, Giuliano Pochini, Linux Kernel Mailing List Larry McVoy wrote: >On Wed, Sep 03, 2003 at 06:12:53PM -0700, William Lee Irwin III wrote: > >>On Wed, Sep 03, 2003 at 02:51:35PM -0700, William Lee Irwin III wrote: >> >>>>This is only truly feasible when the nodes are homogeneous. They will >>>>not be as there will be physical locality (esp. bits like device >>>>proximity) concerns. >>>> >>On Wed, Sep 03, 2003 at 05:58:22PM -0700, Larry McVoy wrote: >> >>>Huh? The nodes are homogeneous. Devices are either local or proxied. >>> >>Virtualized devices are backed by real devices at some level, so the >>distance from the node's physical location to the device's then matters. >> > >Go read what I've written about this. There is no sharing, devices are >local or remote. You share in the page cache only, if you want fast access >to a device you ask it to put the data in memory and you map it. It's >absolutely as fast as an SMP. With no locking. > There is probably more to it - I'm just an interested bystander - but how much locking does this case incur with a single kernel system? And what happens if more than one node wants to access the device? Through a filesystem? ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Scaling noise 2003-09-03 18:15 ` William Lee Irwin III 2003-09-03 18:15 ` Larry McVoy 2003-09-03 18:32 ` Alan Cox @ 2003-09-05 1:34 ` Robert White 2 siblings, 0 replies; 49+ messages in thread From: Robert White @ 2003-09-05 1:34 UTC (permalink / raw) To: 'William Lee Irwin III', 'Larry McVoy', 'Brown, Len', 'Giuliano Pochini', 'Larry McVoy', linux-kernel Not to throw a flag on the play but... Larry asks why penalize low end systems by making the kernel many-cpu-friendly. The implicit postulate in his question is that the current design path is "unfair" to the single and very-small-N N-way systems in favor of the larger-N and very-large-N niche user base. Lots of discussions then ensue about high end scalability and the performance penalties of doing memory barriers and atomic actions for large numbers of CPUs. I'm not sure I get it. The large-end dynamics don't seem to apply to the question, in support nor rebut. Is there any concrete evidence to support the inference? What really is the impact of high-end scalability issues on the low end machines? It *seems* that the high-end diminishing returns due to having any one CPU's cache invalidated by N companions is clearly decoupled from the uni-processor model because there are no "other" CPUs to cause the bulk of such invalidations when there is only one processor. It *seems* that in a single-kernel architecture, as soon as you reach "more than one CPU" what "must be shared" ...er... must be shared. It *seems* that what is unique to each CPU (the instances CPU private data structure) are wholly unlikely to be faulted out because of the actions of other CPUs. (that *is* part of why they are in separate spaces, right?) It *seems* that the high-end degradation of cache performance as N increases is coupled singularly to the increase in N and the follow-on multi-state competition for resources. (That is, it is the real presence of the 64 CPUs and not the code to accommodate them is the cause of the invalidations. If, on my 2-way box I set the MAX_CPU_COUNT to 8 or 64, the only difference in runtime is the unused memory for the 6 or 62 per-cpu-data structures. The cache-invalidation and memory-barrier cost is bounded by the two real CPUs and not the 62 empty slots.) It *seems* that any attempt to make a large N system cache friendly would involve preventing as many invalidations as possible. And finally, it *seems* that any large-N design, e.g. one that keeps cache invalidation to a minimum, would, by definition, directly *benefit* the small N systems because they would naturally also have less invalidation. That is, "change a pointer, flush a cache" is true for all N greater than 1, yes? So if I share between 2 or 1000, it is the actions of the 2 or 1000 at runtime that exact the cost. Any design to better accommodate "up to 1000" will naturally tend to better accommodate "up to 4". So... What/where, if any, are the examples of something being written (or some technique being propounded) in the current kernel design that "penalizes" a 2-by or 4-by box in the name of making a 64-by or 128-by machine perform better? Clearly keeping more private data private is "better" and "worse" for its respective reasons in a memory-footprint for cache-separation tradeoff, but that is true for all N >= 2. Is there some concrete example(s) of SMP code that is wrought overlarge? I mean all values of N between 1 and 255 fit in one byte (duh 8-) but cache invalidations happen in more than two bytes, so having some MAX_CPU_COUNT bounded at 65535 is only one byte more expansive and no-bytes more expensive in the cache-consistency-conflict space. At runtime, any loop that iterates across the number of active CPUs will be clamped at the actual number, and not the theoretical max. I suspect that the original question is specious and generally presumes facts not in evidence. Of course, just because *I* cannot immediately conceive of any useful optimization for a 128-way machine that is inherently detrimental to a 2-way or 4-way box, doesn't mean that no such optimization exists. Someone enlighten me please. Rob White -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of William Lee Irwin III Sent: Wednesday, September 03, 2003 11:16 AM To: Larry McVoy; Brown, Len; Giuliano Pochini; Larry McVoy; linux-kernel@vger.kernel.org Subject: Re: Scaling noise At some point in the past, I wrote: >> The lines of reasoning presented against tightly coupled systems are >> grossly flawed. On Wed, Sep 03, 2003 at 11:05:47AM -0700, Larry McVoy wrote: > [etc]. > Only problem with your statements is that IBM has already implemented all > of the required features in VM. And multiple Linux instances are running > on it today, with shared disks underneath so they don't replicate all the > stuff that doesn't need to be replicated, and they have shared memory > across instances. Independent operating system instances running under a hypervisor don't qualify as a cache-coherent cluster that I can tell; it's merely dynamic partitioning, which is great, but nothing to do with clustering or SMP. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 18:00 ` William Lee Irwin III 2003-09-03 18:05 ` Larry McVoy @ 2003-09-03 19:11 ` Steven Cole 2003-09-03 19:36 ` William Lee Irwin III 1 sibling, 1 reply; 49+ messages in thread From: Steven Cole @ 2003-09-03 19:11 UTC (permalink / raw) To: William Lee Irwin III Cc: Larry McVoy, Brown, Len, Giuliano Pochini, Larry McVoy, linux-kernel, karim On Wed, 2003-09-03 at 12:00, William Lee Irwin III wrote: > > First, communication requirements originate from the applications, not > the operating system, hence so long as there are applications with such > requirements, the requirements for such kernels will exist. Second, the > proposal is ignoring numerous environmental constraints, for instance, > the system administration, colocation, and other costs of the massive > duplication of perfectly shareable resources implied by the clustering. > Third, the communication penalties are turned from memory access to I/O, > which is tremendously slower by several orders of magnitude. Fourth, the > kernel design problem is actually made harder, since no one has ever > been able to produce a working design for these cache coherent clusters > yet that I know of, and what descriptions of this proposal I've seen that > are extant (you wrote some paper on it, IIRC) are too vague to be > operationally useful. > > So as best as I can tell the proposal consists of using an orders-of- > magnitude slower communication method to implement an underspecified > solution to some research problem that to all appearances will be more > expensive to maintain and keep running than the now extant designs. > You and Larry are either talking past each other, or perhaps it is I who don't understand the not-yet-existing CC-clusters. My understanding is that communication between nodes of a CC-cluster would be through a shared-memory mechanism, not through much slower I/O such as a network (even a very fast network). >From Karim Yaghmour's paper here: http://www.opersys.com/adeos/practical-smp-clusters/ "That being said, clustering packages may make assumptions that do not hold in the current architecture. Primarily, by having nodes so close together, physical network latencies and problems disappear." > I like distributed systems and clusters, and they're great to use for > what they're good for. They're not substitutes in any way for tightly > coupled systems, nor do they render large specimens thereof unnecessary. > My point is this: Currently at least one vendor (SGI) wants to scale the kernel to 128 CPUs. As far as I know, the SGI Altix systems can be configured up to 512 CPUs. If the Intel Tanglewood really will have 16 cores per chip, very much larger systems will be possible. Will you be able to scale the kernel to 2048 CPUs and beyond? This may happen during the lifetime of 2.8.x, so planning should be happening either now or soon. Steven ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Scaling noise 2003-09-03 19:11 ` Steven Cole @ 2003-09-03 19:36 ` William Lee Irwin III 0 siblings, 0 replies; 49+ messages in thread From: William Lee Irwin III @ 2003-09-03 19:36 UTC (permalink / raw) To: Steven Cole Cc: Larry McVoy, Brown, Len, Giuliano Pochini, Larry McVoy, linux-kernel, karim On Wed, 2003-09-03 at 12:00, William Lee Irwin III wrote: >> So as best as I can tell the proposal consists of using an orders-of- >> magnitude slower communication method to implement an underspecified >> solution to some research problem that to all appearances will be more >> expensive to maintain and keep running than the now extant designs. On Wed, Sep 03, 2003 at 01:11:55PM -0600, Steven Cole wrote: > You and Larry are either talking past each other, or perhaps it is I who > don't understand the not-yet-existing CC-clusters. My understanding is > that communication between nodes of a CC-cluster would be through a > shared-memory mechanism, not through much slower I/O such as a network > (even a very fast network). > From Karim Yaghmour's paper here: > http://www.opersys.com/adeos/practical-smp-clusters/ > "That being said, clustering packages may make assumptions that do not > hold in the current architecture. Primarily, by having nodes so close > together, physical network latencies and problems disappear." The communication latencies will get better that way, sure. On Wed, Sep 03, 2003 at 01:11:55PM -0600, Steven Cole wrote: >> I like distributed systems and clusters, and they're great to use for >> what they're good for. They're not substitutes in any way for tightly >> coupled systems, nor do they render large specimens thereof unnecessary. On Wed, Sep 03, 2003 at 01:11:55PM -0600, Steven Cole wrote: > My point is this: Currently at least one vendor (SGI) wants to scale the > kernel to 128 CPUs. As far as I know, the SGI Altix systems can be > configured up to 512 CPUs. If the Intel Tanglewood really will have 16 > cores per chip, very much larger systems will be possible. Will you be > able to scale the kernel to 2048 CPUs and beyond? This may happen > during the lifetime of 2.8.x, so planning should be happening either now > or soon. This is not particularly exciting (or truthfully remotely interesting) news. google for "BBN Butterfly" to see what was around ca. 1988. -- wli ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2003-09-08 8:32 UTC | newest] Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-09-03 9:41 Scaling noise Brown, Len 2003-09-03 11:02 ` Geert Uytterhoeven 2003-09-03 11:19 ` Larry McVoy 2003-09-03 11:47 ` Matthias Andree 2003-09-03 18:00 ` William Lee Irwin III 2003-09-03 18:05 ` Larry McVoy 2003-09-03 18:15 ` William Lee Irwin III 2003-09-03 18:15 ` Larry McVoy 2003-09-03 18:26 ` William Lee Irwin III 2003-09-03 18:32 ` Alan Cox 2003-09-03 19:46 ` William Lee Irwin III 2003-09-03 20:13 ` Alan Cox 2003-09-03 20:31 ` William Lee Irwin III 2003-09-03 20:48 ` Martin J. Bligh 2003-09-03 21:21 ` William Lee Irwin III 2003-09-03 21:29 ` Martin J. Bligh 2003-09-03 21:51 ` William Lee Irwin III 2003-09-03 21:46 ` Martin J. Bligh 2003-09-04 0:07 ` Mike Fedyk 2003-09-04 1:06 ` Larry McVoy 2003-09-04 1:10 ` Larry McVoy 2003-09-04 1:32 ` William Lee Irwin III 2003-09-04 1:46 ` David Lang 2003-09-04 1:51 ` William Lee Irwin III 2003-09-04 2:33 ` SSI clusters on NUMA (was Re: Scaling noise) Martin J. Bligh 2003-09-04 3:02 ` David Lang 2003-09-04 4:44 ` Martin J. Bligh 2003-09-04 2:31 ` Scaling noise Martin J. Bligh 2003-09-04 2:40 ` Mike Fedyk 2003-09-04 2:50 ` Martin J. Bligh 2003-09-04 3:49 ` Mike Fedyk 2003-09-04 2:48 ` Steven Cole 2003-09-04 17:05 ` Daniel Phillips 2003-09-07 21:18 ` Eric W. Biederman 2003-09-07 23:07 ` Larry McVoy 2003-09-07 23:47 ` Eric W. Biederman 2003-09-08 0:57 ` Larry McVoy 2003-09-08 3:55 ` Eric W. Biederman 2003-09-08 4:47 ` Stephen Satchell 2003-09-08 5:25 ` Larry McVoy 2003-09-08 8:32 ` Eric W. Biederman 2003-09-04 0:58 ` Larry McVoy 2003-09-04 1:12 ` William Lee Irwin III 2003-09-04 2:49 ` Larry McVoy 2003-09-04 3:15 ` William Lee Irwin III 2003-09-04 3:38 ` Nick Piggin 2003-09-05 1:34 ` Robert White 2003-09-03 19:11 ` Steven Cole 2003-09-03 19:36 ` William Lee Irwin III
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).