* Efficient IPC mechanism on Linux @ 2003-09-09 17:30 Luca Veraldi 2003-09-09 21:17 ` Alan Cox ` (3 more replies) 0 siblings, 4 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-09 17:30 UTC (permalink / raw) To: linux-kernel Hi all. At the web page http://web.tiscali.it/lucavera/www/root/ecbm/index.htm You can find the results of my attempt in modifing the linux kernel sources to implement a new Inter Process Communication mechanism. It is called ECBM for Efficient Capability-Based Messaging. In the reading You can also find the comparison of ECBM against some other commonly-used Linux IPC primitives (such as read/write on pipes or SYS V tools). The results are quite clear. Enjoy. Luca Veraldi ---------------------------------------- Luca Veraldi Graduate Student of Computer Science at the University of Pisa veraldi@cli.di.unipi.it luca.veraldi@katamail.com ICQ# 115368178 ---------------------------------------- ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-09 17:30 Efficient IPC mechanism on Linux Luca Veraldi @ 2003-09-09 21:17 ` Alan Cox 2003-09-09 21:57 ` Luca Veraldi [not found] ` <20030909175821.GL16080@Synopsys.COM> ` (2 subsequent siblings) 3 siblings, 1 reply; 74+ messages in thread From: Alan Cox @ 2003-09-09 21:17 UTC (permalink / raw) To: Luca Veraldi; +Cc: Linux Kernel Mailing List On Maw, 2003-09-09 at 18:30, Luca Veraldi wrote: > Hi all. > At the web page > http://web.tiscali.it/lucavera/www/root/ecbm/index.htm > You can find the results of my attempt in modifing the linux kernel sources > to implement a new Inter Process Communication mechanism. > > It is called ECBM for Efficient Capability-Based Messaging. > > In the reading You can also find the comparison of ECBM > against some other commonly-used Linux IPC primitives > (such as read/write on pipes or SYS V tools). The text is a little hard to follow for an English speaker. I can follow a fair bit and there are a few things I'd take issue with (forgive me if I'm misunderstanding your material) 1. Almost all IPC is < 512 bytes long. There are exceptions including X11 image passing which is an important performance one 2. One of the constraints that a message system has is that if it uses shared memory it must protect the memory queues from abuse. For example if previous/next pointers are kept in the shared memory one app may be able to trick another into modifying the wrong thing. Sometimes this matters. 3. An interesting question exists as to how whether you can create the same effect with current 2.6 kernel primitives. We have posix shared memory objects, we have read only mappings and we have extremely fast task switch and also locks (futex locks) Also from the benching point of view it is always instructive to run a set of benchmarks that involve the sender writing all the data bytes, sending it and the receiver reading them all. Without this zero copy mechanisms (especially mmu based ones) look artificially good. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-09 21:17 ` Alan Cox @ 2003-09-09 21:57 ` Luca Veraldi 2003-09-09 23:11 ` Alan Cox 0 siblings, 1 reply; 74+ messages in thread From: Luca Veraldi @ 2003-09-09 21:57 UTC (permalink / raw) To: Alan Cox, linux-kernel > The text is a little hard to follow for an English speaker. I can follow > a fair bit and there are a few things I'd take issue with (forgive me if > I'm misunderstanding your material) I know. I'm very sorry for this. I'm rewriting the text in English, so, don't bore with Italian. The new page will be up in a moment. > 1. Almost all IPC is < 512 bytes long. There are exceptions including > X11 image passing which is an important performance one Sure for system needs. What about programming? You can write application of any complexity. Why do I have to be limited by kernel in sending 4056 bytes (for exmple, if I use SYS V message queues) instead of an arbitrary sized message? > 2. One of the constraints that a message system has is that if it uses > shared memory it must protect the memory queues from abuse. For example > if previous/next pointers are kept in the shared memory one app may be > able to trick another into modifying the wrong thing. Sometimes this > matters. Please, refer to address space. If you don't have the piece of information you want to access into your own private address space, the only thing that will occur is a general protection fault and your process will be killed. Remember that, with my primitives, you're working with logical addresses that need relocation in hardware. The MMU contestually performs memory protection, too. > 3. An interesting question exists as to how whether you can create the > same effect with current 2.6 kernel primitives. We have posix shared > memory objects, we have read only mappings and we have extremely fast > task switch and also locks (futex locks) I too need to use locks in my primitives. The fact that new kernel has faster locks also reduceses the completion time of my primitives. Also. The central point is not to have10 instead of 50 assembler lines in the primitives. The central point is to implement communication primitives that do not require physical copying of the messages being sent and received. Lock probably reduces the time to write or read from a pipe. But the order of magnitute remains the same. Task switching is relevant in all communication primitives. Fast task switch means proportionally fast primitives. Also mine primitives. > Also from the benching point of view it is always instructive to run a > set of benchmarks that involve the sender writing all the data bytes, > sending it and the receiver reading them all. Without this zero copy > mechanisms (especially mmu based ones) look artificially good. The bench you are talking about is exactly the one I implemented to misure the IPC overhead. We have 2 process communicating over a channel in a pipeline fashion. A writes some information in a buffer and sends it to B. B receives and reads. This for 1000 times. Times reported are average time. Bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-09 21:57 ` Luca Veraldi @ 2003-09-09 23:11 ` Alan Cox 2003-09-10 9:04 ` Luca Veraldi 0 siblings, 1 reply; 74+ messages in thread From: Alan Cox @ 2003-09-09 23:11 UTC (permalink / raw) To: Luca Veraldi; +Cc: Linux Kernel Mailing List On Maw, 2003-09-09 at 22:57, Luca Veraldi wrote: > Also. The central point is not to have10 instead of 50 assembler lines > in the primitives. The central point is to implement communication > primitives > that do not require physical copying of the messages being sent and > received. The question (for smaller messages) is whether this is a win or not. While the data fits in L1 cache the copy is close to free compared with context switching and TLB overhead > We have 2 process communicating over a channel in a pipeline fashion. > A writes some information in a buffer and sends it to B. > B receives and reads. > This for 1000 times. > Times reported are average time. Ok - from you rexample I couldnt tell if B then touches each byte of data as well as having it appear in its address space. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-09 23:11 ` Alan Cox @ 2003-09-10 9:04 ` Luca Veraldi 2003-09-10 12:56 ` Alan Cox 0 siblings, 1 reply; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 9:04 UTC (permalink / raw) To: Alan Cox; +Cc: linux-kernel > The question (for smaller messages) is whether this is a win or not. > While the data fits in L1 cache the copy is close to free compared > with context switching and TLB overhead I can answer for messages of 512 bytes (that I consider small). This is the smallest case considered in the work. Completion times in this case are reported in graphs: you have the time of write() on pipe, of SYS V shmget() and of new primitives. It's easy to compare numbers and to say if it is "a win or not". Surely, transferring bytes from a cache line to another is at zero cost (that is, a single clock cycle of the cache unit). But here, how can you matematically grant that, using traditional IPC mechanism, you'll find the info you want directly in cache? This is quite far from being realistic. In real world, what happens is that copying a page memory raises many cache faults... at least, more that copying a capability structure (16 bytes long, as reported in the article). Of course, if you want to send a single int, probably you don't have any reason to use capability. That's sure. > Ok - from you rexample I couldnt tell if B then touches each byte of > data as well as having it appear in its address space. It is irrilevant. The time reported in the work are the time needed to complete the IPC primitives. After a send or a receive over the channel, the two processes may do everything they want. It is not interesting for us. Bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 9:04 ` Luca Veraldi @ 2003-09-10 12:56 ` Alan Cox 0 siblings, 0 replies; 74+ messages in thread From: Alan Cox @ 2003-09-10 12:56 UTC (permalink / raw) To: Luca Veraldi; +Cc: Linux Kernel Mailing List On Mer, 2003-09-10 at 10:04, Luca Veraldi wrote: > Surely, transferring bytes from a cache line to another is at zero cost > (that is, a single clock cycle of the cache unit). More complicated - when you move stuff you end up evicting another chunk from the cache, you can't armwave that out of existance because you yourself have created new eviction overhead for someone further down the line. > But here, how can you matematically grant that, > using traditional IPC mechanism, > you'll find the info you want directly in cache? Because programs normally pass data they've touched to other apps. Your A/B example touches the data before sending even, but your example on receive does not which is very atypical. > > Ok - from you rexample I couldnt tell if B then touches each byte of > > data as well as having it appear in its address space. > > It is irrilevant. The time reported in the work are the time needed > to complete the IPC primitives. > After a send or a receive over the channel, > the two processes may do everything they want. > It is not interesting for us. Its vital to include it in the measurement of all three. Without it your measurements are meaningless. Look at it as an engineer. IPC involves writing to memory, making the data appear in the other task and then reading it. You also want to measure things like throughput of your IPC tasks running parallel to other processes doing stuff like pure memory bandwidth tests so you can measure the impact you have on the system although in this case thats probably less relevant by far ^ permalink raw reply [flat|nested] 74+ messages in thread
[parent not found: <20030909175821.GL16080@Synopsys.COM>]
[parent not found: <001d01c37703$8edc10e0$36af7450@wssupremo>]
[parent not found: <20030910064508.GA25795@Synopsys.COM>]
* Re: Efficient IPC mechanism on Linux [not found] ` <20030910064508.GA25795@Synopsys.COM> @ 2003-09-10 9:18 ` Luca Veraldi 2003-09-10 9:23 ` Arjan van de Ven 2003-09-10 11:52 ` Alex Riesen 0 siblings, 2 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 9:18 UTC (permalink / raw) To: alexander.riesen; +Cc: linux-kernel > I mean for benchmarks. You compared your implementation to a very old > linux' one. There were big changes in these areas. The overhead implied by a memcpy() is the same, in the oder of magnitude, ***whatever*** kernel version you can develop. If you have to send, lt's say, 3 memory pages from A to B, using pipes you have to ***physically copy*** 2 * 3 * PAGE_SIZE bytes (firt time, sending; then receiving them. So two times). Which evals to 8192 * 3. Using capability, the only thing you have to do is copying an amount of bytes that are linearly dependent by the number of pages: so, proportional to 3. I you want the (quite) exact amount, it is 3 * sizeof(unsigned long) + sizeof(capability) which evals to 12 + 16 = 28. X is not present in the equation. I compared pipes and SYS V IPC on Linux 2.2.4 with the new mechanism also developed over kernel 2.2.4! The same inefficiency of the kernel support you are talking about affet all the primitives being examined in the article. Mine, too. So the relative results will be quite the same. > you surely know, that it is just an implementation. The mechanisms have > always been there, evolved from UNIX. > That is mainly the reason for them to exists: support the applications > which use them. > > Will it be possible to base existing facilities on your approach? > SVR5 messages (msg{get,snd,rcv}), for example? > (I have to ask, because I cannot understand the article, sorry) Ah, ok. So let's continue to do ineffient things only because it has always been so! Compatibility is not a problem. Simply rewrite the write() and read() for pipes in order to make them do the same thing done by zc_send() and zc_receive(). Or, if you are not referring to pipes, rewrite the support level of you anchient IPC primitives in order to make them do the same thing done by zc_send() and zc_receive(). Bye, Luca. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 9:18 ` Luca Veraldi @ 2003-09-10 9:23 ` Arjan van de Ven 2003-09-10 9:40 ` Luca Veraldi 2003-09-10 9:52 ` Jamie Lokier 2003-09-10 11:52 ` Alex Riesen 1 sibling, 2 replies; 74+ messages in thread From: Arjan van de Ven @ 2003-09-10 9:23 UTC (permalink / raw) To: Luca Veraldi; +Cc: alexander.riesen, linux-kernel [-- Attachment #1: Type: text/plain, Size: 619 bytes --] On Wed, 2003-09-10 at 11:18, Luca Veraldi wrote: > The overhead implied by a memcpy() is the same, in the oder of magnitude, > ***whatever*** kernel version you can develop. yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the uncached case. A pagetable operation (like the cpu setting the accessed or dirty bit) is in that same order I suspect (maybe half this, but not a lot less). Changing pagetable content is even more because all the tlb's and internal cpu state will need to be flushed... which is also a microcode operation for the cpu. And it's deadly in an SMP environment. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 9:23 ` Arjan van de Ven @ 2003-09-10 9:40 ` Luca Veraldi 2003-09-10 9:44 ` Arjan van de Ven 2003-09-10 12:47 ` Efficient IPC mechanism on Linux Alan Cox 2003-09-10 9:52 ` Jamie Lokier 1 sibling, 2 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 9:40 UTC (permalink / raw) To: arjanv; +Cc: linux-kernel > yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the > uncached case. A pagetable operation (like the cpu setting the accessed > or dirty bit) is in that same order I suspect (maybe half this, but not > a lot less). Probably you don't know what you're talking about. I don't know where you studied computer architectures, but... Let's answer. To set the accessed or dirty bit you use 38 __asm__ __volatile__( LOCK_PREFIX 39 "btsl %1,%0" 40 :"=m" (ADDR) 41 :"Ir" (nr)); which is a ***SINGLE CLOCK CYCLE*** of cpu. I don't think really that on any machine Firmware a btsl will require 4000 cycles. Neither on Intel x86. > Changing pagetable content is even more because all the > tlb's and internal cpu state will need to be flushed... which is also a > microcode operation for the cpu. Good. The same overhead you will find accessing a message after a read form a pipe. There will occur many TLB faults. And the same apply copying the message to the pipe. Many many TLB faults. > And it's deadly in an SMP environment. You say "tlb's and internal cpu state will need to be flushed". The other cpus in an SMP environment can continue to work, indipendently. TLBs and cpu state registers are ***PER-CPU*** resorces. Probably, it is worse the case of copying a memory page, because you have to hold some global lock all the time. This is deadly in an SMP environment, because critical section lasts for thousand of cycles, instead of simply a few. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 9:40 ` Luca Veraldi @ 2003-09-10 9:44 ` Arjan van de Ven 2003-09-10 10:09 ` Luca Veraldi 2003-09-10 12:47 ` Efficient IPC mechanism on Linux Alan Cox 1 sibling, 1 reply; 74+ messages in thread From: Arjan van de Ven @ 2003-09-10 9:44 UTC (permalink / raw) To: Luca Veraldi; +Cc: arjanv, linux-kernel On Wed, Sep 10, 2003 at 11:40:33AM +0200, Luca Veraldi wrote: > To set the accessed or dirty bit you use > > 38 __asm__ __volatile__( LOCK_PREFIX > 39 "btsl %1,%0" > 40 :"=m" (ADDR) > 41 :"Ir" (nr)); > > which is a ***SINGLE CLOCK CYCLE*** of cpu. > I don't think really that on any machine Firmware > a btsl will require 4000 cycles. > Neither on Intel x86. For fun do the measurement on a pIV cpu. You'll be surprised. The microcode "mark dirty" (which is NOT a btsl, it gets done when you do a write memory access to the page content) result will be in the 2000 to 4000 range I predict. There are things like SMP synchronisation to do, but also if the cpu marks a page dirty in the page table, that means the page table changes which means the pagetable needs to be marked in the PMD. Which means the PMD changes, which means the PGD needs the PMD marked dirty. Etc Etc. It's microcode. It'll take several 1000 cycles. > > Changing pagetable content is even more because all the > > tlb's and internal cpu state will need to be flushed... which is also a > > microcode operation for the cpu. > > Good. The same overhead you will find accessing a message > after a read form a pipe. There will occur many TLB faults. > And the same apply copying the message to the pipe. > Many many TLB faults. A TLB fault in the normal case is about 7 cycles. But that's for a TLB not being present. For TLB that IS present being written to means going to microcode. > > > And it's deadly in an SMP environment. > > You say "tlb's and internal cpu state will need to be flushed". > The other cpus in an SMP environment can continue to work, indipendently. > TLBs and cpu state registers are ***PER-CPU*** resorces. if you change a page table, you need to flush the TLB on all other cpus that have that same page table mapped, like a thread app running on all cpu's at once with the same pagetables. > Probably, it is worse the case of copying a memory page, > because you have to hold some global lock all the time. why would you need a global lock for copying memory ? ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 9:44 ` Arjan van de Ven @ 2003-09-10 10:09 ` Luca Veraldi 2003-09-10 10:14 ` Arjan van de Ven ` (3 more replies) 0 siblings, 4 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 10:09 UTC (permalink / raw) To: Arjan van de Ven; +Cc: linux-kernel > For fun do the measurement on a pIV cpu. You'll be surprised. > The microcode "mark dirty" (which is NOT a btsl, it gets done when you do a write > memory access to the page content) result will be in the 2000 to 4000 range I > predict. I'm not responsible for microarchitecture designer stupidity. If a simple STORE assembler instruction will eat up 4000 clock cycles, as you say here, well, I think all we Computer Scientists can go home and give it up now. > There are things like SMP synchronisation to do, but also > if the cpu marks a page dirty in the page table, that means the page table > changes which means the pagetable needs to be marked in the > PMD. Which means the PMD changes, which means the PGD needs the PMD marked > dirty. Etc Etc. It's microcode. It'll take several 1000 cycles. Please, return to the fact. Modifying the page contents is done by that part of benchmark application we are not misuring. That is, the code ***BEFORE*** the zc_send() (write() on pipe or whatever you choose) and ***AFTER*** a zc_receive() (or read() from pipe ot whatever else). This is nice, thanks, but out of our interests. We are only reading the relocation tables or inserting new entries in it. Not modifying page contents. > if you change a page table, you need to flush the TLB on all other cpus > that have that same page table mapped, like a thread app running > on all cpu's at once with the same pagetables. Ok. Simply, this is not the case in my experiment. This does not apply. We have no threads. But only detached process address spaces. Threads are a bit different from processes. > why would you need a global lock for copying memory ? System call sys_write() calls locks_verify_area() which calls locks_mandatory_area() which calls lock_kernel() oops... global spin_lock locking... SMP is crying... But not so much, if the lock is not lasting much, is it right? THIS IS NOT WHAT WE CALL A SHORT LOCK SECTION: 732 repeat: 733 /* Search the lock list for this inode for locks that conflict with 734 * the proposed read/write. 735 */ 736 for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) { 737 if (!(fl->fl_flags & FL_POSIX)) 738 continue; 739 if (fl->fl_start > new_fl->fl_end) 740 break; 741 if (posix_locks_conflict(new_fl, fl)) { 742 error = -EAGAIN; 743 if (filp && (filp->f_flags & O_NONBLOCK)) 744 break; 745 error = -EDEADLK; 746 if (posix_locks_deadlock(new_fl, fl)) 747 break; 748 749 error = locks_block_on(fl, new_fl); 750 if (error != 0) 751 break; 752 753 /* 754 * If we've been sleeping someone might have 755 * changed the permissions behind our back. 756 */ 757 if ((inode->i_mode & (S_ISGID | S_IXGRP)) != S_ISGID) 758 break; 759 goto repeat; 760 } 761 } >From the source code of your Operating System. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:09 ` Luca Veraldi @ 2003-09-10 10:14 ` Arjan van de Ven 2003-09-10 10:25 ` Luca Veraldi 2003-09-10 12:50 ` Alan Cox ` (2 subsequent siblings) 3 siblings, 1 reply; 74+ messages in thread From: Arjan van de Ven @ 2003-09-10 10:14 UTC (permalink / raw) To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel On Wed, Sep 10, 2003 at 12:09:21PM +0200, Luca Veraldi wrote: > > For fun do the measurement on a pIV cpu. You'll be surprised. > > The microcode "mark dirty" (which is NOT a btsl, it gets done when you do > a write > > memory access to the page content) result will be in the 2000 to 4000 > range I > > predict. > > I'm not responsible for microarchitecture designer stupidity. > If a simple STORE assembler instruction will eat up 4000 clock cycles, > as you say here, well, I think all we Computer Scientists can go home and > give it up now. I'm saying it can. I don't want to go too deep into an arguement about microarchitectural details, but my point was that a memory copy of a page is NOT super expensive relative to several other effects that have to do with pagetable manipulations. > > if you change a page table, you need to flush the TLB on all other cpus > > that have that same page table mapped, like a thread app running > > on all cpu's at once with the same pagetables. > > Ok. Simply, this is not the case in my experiment. > This does not apply. > We have no threads. But only detached process address spaces. > Threads are a bit different from processes. but the pipe code cannot know this so it has to do a cross cpu invalidate. > > why would you need a global lock for copying memory ? > > System call sys_write() calls > locks_verify_area() which calls > locks_mandatory_area() which calls > lock_kernel() and... which is also releasing it before the copy ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:14 ` Arjan van de Ven @ 2003-09-10 10:25 ` Luca Veraldi 2003-09-12 18:41 ` Timothy Miller 0 siblings, 1 reply; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 10:25 UTC (permalink / raw) To: Arjan van de Ven; +Cc: linux-kernel > I'm saying it can. I don't want to go too deep into an arguement about > microarchitectural details, but my point was that a memory copy of a page > is NOT super expensive relative to several other effects that have to do > with pagetable manipulations. Sorry, but I cannot believe it. Reading a page tagle entry and storing in into a struct capability is not comparable at all with the "for" needed to move bytes all around memory. > but the pipe code cannot know this so it has to do a cross cpu invalidate. Sorry for you. Don't knowning does not justify it. It's inefficient. > and... which is also releasing it before the copy Oh, yes. After wasting thousands of cycles, sure. Bye, Luca. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:25 ` Luca Veraldi @ 2003-09-12 18:41 ` Timothy Miller 2003-09-12 19:05 ` Luca Veraldi 0 siblings, 1 reply; 74+ messages in thread From: Timothy Miller @ 2003-09-12 18:41 UTC (permalink / raw) To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel Luca Veraldi wrote: > > Sorry, but I cannot believe it. > Reading a page tagle entry and storing in into a struct capability is not > comparable at all with the "for" needed to move bytes all around memory. Pardon my ignorance here, but the impression I get is that changing a page table entry is not as simple as just writing to a bit somewhere. I suppose it is if the page descriptor is not loaded into the TLB, but if it is, then you have to ensure that the TLB entry matches the page table; this may not be a quick operation. I can think of a lot of other possible complications to this. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-12 18:41 ` Timothy Miller @ 2003-09-12 19:05 ` Luca Veraldi 2003-09-12 22:37 ` Alan Cox 0 siblings, 1 reply; 74+ messages in thread From: Luca Veraldi @ 2003-09-12 19:05 UTC (permalink / raw) To: Timothy Miller; +Cc: Arjan van de Ven, linux-kernel > Pardon my ignorance here, but the impression I get is that changing a > page table entry is not as simple as just writing to a bit somewhere. I > suppose it is if the page descriptor is not loaded into the TLB, but if > it is, then you have to ensure that the TLB entry matches the page > table; this may not be a quick operation. > > I can think of a lot of other possible complications to this. I stopped writing about this question a lot of time ago. However, here we are. If you modify the page tables as in my example (and then if you do so only for B's pagetable) you can be sure the things you're modifying were not present in Firmware TLBs, not yet. Because those pagetable entries refer to a logical address interval you've just allocated in B address space. Bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-12 19:05 ` Luca Veraldi @ 2003-09-12 22:37 ` Alan Cox 0 siblings, 0 replies; 74+ messages in thread From: Alan Cox @ 2003-09-12 22:37 UTC (permalink / raw) To: Luca Veraldi; +Cc: Timothy Miller, Arjan van de Ven, Linux Kernel Mailing List On Gwe, 2003-09-12 at 20:05, Luca Veraldi wrote: > If you modify the page tables as in my example (and then if you do so only > for B's pagetable) > you can be sure the things you're modifying were not present in Firmware > TLBs, not yet. > > Because those pagetable entries refer to a logical address interval > you've just allocated in B address space. They may be present in some form, but you are right the task switch will flush them in the non SMP case on the x86 platform. In the SMP case it wont work. Alan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:09 ` Luca Veraldi 2003-09-10 10:14 ` Arjan van de Ven @ 2003-09-10 12:50 ` Alan Cox 2003-09-10 19:16 ` Shawn 2003-09-10 20:05 ` Rik van Riel 3 siblings, 0 replies; 74+ messages in thread From: Alan Cox @ 2003-09-10 12:50 UTC (permalink / raw) To: Luca Veraldi; +Cc: Arjan van de Ven, Linux Kernel Mailing List On Mer, 2003-09-10 at 11:09, Luca Veraldi wrote: > SMP is crying... > But not so much, if the lock is not lasting much, is it right? > > THIS IS NOT WHAT WE CALL A SHORT LOCK SECTION: This is the difference between science and engineering. Its a path that isnt normally taken and normally if taken takes almost no loops. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:09 ` Luca Veraldi 2003-09-10 10:14 ` Arjan van de Ven 2003-09-10 12:50 ` Alan Cox @ 2003-09-10 19:16 ` Shawn 2003-09-10 20:05 ` Rik van Riel 3 siblings, 0 replies; 74+ messages in thread From: Shawn @ 2003-09-10 19:16 UTC (permalink / raw) To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel On Wed, 2003-09-10 at 05:09, Luca Veraldi wrote: > > For fun do the measurement on a pIV cpu. You'll be surprised. > > The microcode "mark dirty" (which is NOT a btsl, it gets done when you do > a write > > memory access to the page content) result will be in the 2000 to 4000 > range I > > predict. > > I'm not responsible for microarchitecture designer stupidity. > If a simple STORE assembler instruction will eat up 4000 clock cycles, > as you say here, well, I think all we Computer Scientists can go home and > give it up now. Unfortunately you are responsible for working with the constructs reality has given you. Giving up is childish. Where there is a lose, there's a chance it was a tradeoff for a win elsewhere. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:09 ` Luca Veraldi ` (2 preceding siblings ...) 2003-09-10 19:16 ` Shawn @ 2003-09-10 20:05 ` Rik van Riel 2003-09-17 9:52 ` Rik's list of CS challenges Terje Eggestad 3 siblings, 1 reply; 74+ messages in thread From: Rik van Riel @ 2003-09-10 20:05 UTC (permalink / raw) To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel On Wed, 10 Sep 2003, Luca Veraldi wrote: > I'm not responsible for microarchitecture designer stupidity. > If a simple STORE assembler instruction will eat up 4000 clock cycles, > as you say here, well, If current trends continue, a L2 cache miss will be taking 5000 cycles in 15 to 20 years. > I think all we Computer Scientists can go home and give it up now. While I have seen some evidence of computer scientists going home and ignoring the problems presented to them by current hardware constraints, I'd really prefer it if they took up the challenge and did the research on how we should deal with hardware in the future. In fact, I've made up a little (incomplete) list of things that I suspect are in need of serious CS research, because otherwise both OS theory and practice will be unable to deal with the hardware of the next decade. http://surriel.com/research_wanted/ If you have any suggestions for the list, please let me know. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Rik's list of CS challenges 2003-09-10 20:05 ` Rik van Riel @ 2003-09-17 9:52 ` Terje Eggestad 2003-09-17 13:40 ` Alan Cox 0 siblings, 1 reply; 74+ messages in thread From: Terje Eggestad @ 2003-09-17 9:52 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel I got one: I've been paying attention the MRAM http://www.research.ibm.com/resources/news/20030610_mram.shtml (and the other NV rams comming up, most notabily the optical/polymer http://www.pcw.co.uk/news/1143633 techs). What interest me is that with multiple candidates in the pipeline, it's a high probability that we'll have a non-volatile DRAM replacement hitting mainstream computer market within 5-10 years. You have the obvious PDA/Laptop application, which I'll not dwell on. You have the HA server application where you want fast reboot on power failure etc. I'd expect that there will be a demand of never reboot, only seemless continue. Booting is after all a "hack" to get around - OS bugs - an easy way of detecting new HW and config the OS according to HW. - An easy way to upgrade the kernel hotplug demands are already rendering pt 2 obsolete. UPgrading the kernel while running would be a cool feature. One of the promises of the micro kernel archs in the early 90's Is it possible to "reboot" the kernel without affecting the running apps on the machine? What become more interesting is that while you may have NV RAM, it's not likely that MRAM is viable on the processor chip. The manufacture process may be too expensive, or outright impossible, (polymers on chips that hold 80 degrees C in not likely), leaving you with volatile register and cache but NV Main RAM. - should the OS now schedule "paging" between cache and RAM? - atomic syncing/updates of regs/cache to RAM? (checkpointing) A merge of FS and RAM? (didn't the AS/400 have mmap'ed disks?) And all the applications I haven't even thought of. TJ On Wed, 2003-09-10 at 22:05, Rik van Riel wrote: > On Wed, 10 Sep 2003, Luca Veraldi wrote: > > > I'm not responsible for microarchitecture designer stupidity. > > If a simple STORE assembler instruction will eat up 4000 clock cycles, > > as you say here, well, > > If current trends continue, a L2 cache miss will be > taking 5000 cycles in 15 to 20 years. > > > I think all we Computer Scientists can go home and give it up now. > > While I have seen some evidence of computer scientists > going home and ignoring the problems presented to them > by current hardware constraints, I'd really prefer it > if they took up the challenge and did the research on > how we should deal with hardware in the future. > > In fact, I've made up a little (incomplete) list of > things that I suspect are in need of serious CS research, > because otherwise both OS theory and practice will be > unable to deal with the hardware of the next decade. > > http://surriel.com/research_wanted/ > > If you have any suggestions for the list, please let > me know. -- Terje Eggestad Senior Software Engineer dir. +47 22 62 89 61 mob. +47 975 31 57 fax. +47 22 62 89 51 terje.eggestad@scali.com Scali - www.scali.com High Performance Clustering ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Rik's list of CS challenges 2003-09-17 9:52 ` Rik's list of CS challenges Terje Eggestad @ 2003-09-17 13:40 ` Alan Cox 2003-09-18 8:26 ` Helge Hafting 0 siblings, 1 reply; 74+ messages in thread From: Alan Cox @ 2003-09-17 13:40 UTC (permalink / raw) To: Terje Eggestad; +Cc: Rik van Riel, Linux Kernel Mailing List On Mer, 2003-09-17 at 10:52, Terje Eggestad wrote: > What become more interesting is that while you may have NV RAM, it's not > likely that MRAM is viable on the processor chip. The manufacture > process may be too expensive, or outright impossible, (polymers on chips > that hold 80 degrees C in not likely), leaving you with volatile > register and cache but NV Main RAM. We effectively handle that case now with the suspend-to-ram feature. > A merge of FS and RAM? (didn't the AS/400 have mmap'ed disks?) Persistant storage systems. These tend to look very unlike Linux because they throw out the idea of a file system as such. The issues with debugging if they break and backups make my head hurt but other folk seem to think they are solved problems ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Rik's list of CS challenges 2003-09-17 13:40 ` Alan Cox @ 2003-09-18 8:26 ` Helge Hafting 0 siblings, 0 replies; 74+ messages in thread From: Helge Hafting @ 2003-09-18 8:26 UTC (permalink / raw) To: Alan Cox; +Cc: Terje Eggestad, Rik van Riel, Linux Kernel Mailing List Alan Cox wrote: > On Mer, 2003-09-17 at 10:52, Terje Eggestad wrote: > >>What become more interesting is that while you may have NV RAM, it's not >>likely that MRAM is viable on the processor chip. The manufacture >>process may be too expensive, or outright impossible, (polymers on chips >>that hold 80 degrees C in not likely), leaving you with volatile >>register and cache but NV Main RAM. > > > We effectively handle that case now with the suspend-to-ram feature. > > >>A merge of FS and RAM? (didn't the AS/400 have mmap'ed disks?) > > > Persistant storage systems. These tend to look very unlike Linux because > they throw out the idea of a file system as such. The issues with > debugging if they break and backups make my head hurt but other folk > seem to think they are solved problems I see no reason to get rid of file systems - they provide a nice and established way of organizing data. Persistent RAM offer other advantages though - no need for disk caching, mmap simply gives access to the memory, execute without loading first. Helge Hafting ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 9:40 ` Luca Veraldi 2003-09-10 9:44 ` Arjan van de Ven @ 2003-09-10 12:47 ` Alan Cox 2003-09-10 13:56 ` Luca Veraldi 1 sibling, 1 reply; 74+ messages in thread From: Alan Cox @ 2003-09-10 12:47 UTC (permalink / raw) To: Luca Veraldi; +Cc: arjanv, Linux Kernel Mailing List On Mer, 2003-09-10 at 10:40, Luca Veraldi wrote: > To set the accessed or dirty bit you use > > 38 __asm__ __volatile__( LOCK_PREFIX > 39 "btsl %1,%0" > 40 :"=m" (ADDR) > 41 :"Ir" (nr)); > > which is a ***SINGLE CLOCK CYCLE*** of cpu. Its a _lot_ more than that in the normal case. Upwards of 60 clocks on a PIV. You then need to reload cr3 if you touched permissions which means every cached TLB in the system is lost, and you may need to do a cross CPU IPI on SMP (which takes a long time) > You say "tlb's and internal cpu state will need to be flushed". > The other cpus in an SMP environment can continue to work, indipendently. > TLBs and cpu state registers are ***PER-CPU*** resorces. Think of a threaded app passing a message to another app. You have to do the cross CPU flush in order to prevent races where another thread can scribble on data it doesnt own. Assuming a reasonable TLB reuse rate thats 120 plus TLB reloads. The newer CPU's cache TLB's in L1/L2 so thats not too bad but it all adds up. On SMP its a real pain > Probably, it is worse the case of copying a memory page, > because you have to hold some global lock all the time. > This is deadly in an SMP environment, You don't need a global lock to copy memory. One thing I do agree with you on is the API aspect - and that is problematic. The current API leaves data also owned by the source. If I write "fred" down a pipe I still own the "fred" bits. The method you propose was added long ago go to Solaris (see "doors") and its not exactly the most used interface even there. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 12:47 ` Efficient IPC mechanism on Linux Alan Cox @ 2003-09-10 13:56 ` Luca Veraldi 2003-09-10 15:59 ` Alan Cox 0 siblings, 1 reply; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 13:56 UTC (permalink / raw) To: Alan Cox; +Cc: linux-kernel > > Probably, it is worse the case of copying a memory page, > > because you have to hold some global lock all the time. > > This is deadly in an SMP environment, > > You don't need a global lock to copy memory. See sources. You neek to lock_kernel() for a great amount of instructions like any nother kernel function. There are kernel locks all around in sources. So don't come here and talk about locking ineffiency. Because scarse scalability of Linux on multiprocessor is a reality nobody can hide. With or without my primitives. > One thing I do agree with you on is the API aspect - and that is > problematic. The current API leaves data also owned by the source. > If I write "fred" down a pipe I still own the "fred" bits. Even with my API. Data doesn't disapper magically from the sender logical address space. Also, with minor changes, you can use my primitives to implement a one-copy channel, if shared memory among processes sounds to you as a blasfemy. Always better than two-fase-copying current pipe implementation. > The > method you propose was added long ago go to Solaris (see "doors") and > its not exactly the most used interface even there. Sincerely, I COMPLETELY DON'T CARE. If it is used or not, it's completely irrilevant, when valueing performance. Good bye. Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 13:56 ` Luca Veraldi @ 2003-09-10 15:59 ` Alan Cox 0 siblings, 0 replies; 74+ messages in thread From: Alan Cox @ 2003-09-10 15:59 UTC (permalink / raw) To: Luca Veraldi; +Cc: Linux Kernel Mailing List On Mer, 2003-09-10 at 14:56, Luca Veraldi wrote: > There are kernel locks all around in sources. > So don't come here and talk about locking ineffiency. > Because scarse scalability of Linux on multiprocessor > is a reality nobody can hide. Well if you will read the 2.2 code sure. If you want to argue scalability I suggest you look at current code and the SGI Altix boxes ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 9:23 ` Arjan van de Ven 2003-09-10 9:40 ` Luca Veraldi @ 2003-09-10 9:52 ` Jamie Lokier 2003-09-10 10:07 ` Arjan van de Ven ` (2 more replies) 1 sibling, 3 replies; 74+ messages in thread From: Jamie Lokier @ 2003-09-10 9:52 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Luca Veraldi, alexander.riesen, linux-kernel Arjan van de Ven wrote: > > The overhead implied by a memcpy() is the same, in the oder of magnitude, > > ***whatever*** kernel version you can develop. > > yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the > uncached case. A pagetable operation (like the cpu setting the accessed > or dirty bit) is in that same order I suspect (maybe half this, but not > a lot less). Changing pagetable content is even more because all the > tlb's and internal cpu state will need to be flushed... which is also a > microcode operation for the cpu. And it's deadly in an SMP environment. I have just done a measurement on a 366MHz PII Celeron. The test is quite crude, but these numbers are upper bounds on the timings: mean 813 cycles to: page fault install zero page remove zero page (= read every page from MAP_ANON region and unmap, repeatedly) mean 11561 cycles to: page fault copy zero page install copy remove copy (= write every page from MAP_ANON region and unmap, repeatedly) I think that makes a clear case for avoiding page copies. -- Jamie ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 9:52 ` Jamie Lokier @ 2003-09-10 10:07 ` Arjan van de Ven 2003-09-10 10:17 ` Luca Veraldi 2003-09-10 10:37 ` Jamie Lokier 2003-09-10 10:11 ` Luca Veraldi 2003-09-10 19:24 ` Pavel Machek 2 siblings, 2 replies; 74+ messages in thread From: Arjan van de Ven @ 2003-09-10 10:07 UTC (permalink / raw) To: Jamie Lokier Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel On Wed, Sep 10, 2003 at 10:52:55AM +0100, Jamie Lokier wrote: > Arjan van de Ven wrote: > > > The overhead implied by a memcpy() is the same, in the oder of magnitude, > > > ***whatever*** kernel version you can develop. > > > > yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the > > uncached case. A pagetable operation (like the cpu setting the accessed > > or dirty bit) is in that same order I suspect (maybe half this, but not > > a lot less). Changing pagetable content is even more because all the > > tlb's and internal cpu state will need to be flushed... which is also a > > microcode operation for the cpu. And it's deadly in an SMP environment. > > I have just done a measurement on a 366MHz PII Celeron This test is sort of the worst case against my argument: 1) It's a cpu with low memory bandwidth 2) It's a 1 CPU system 3) It's a pII not pIV; the pII is way more efficient cycle wise for pagetable operations ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:07 ` Arjan van de Ven @ 2003-09-10 10:17 ` Luca Veraldi 2003-09-10 10:37 ` Jamie Lokier 1 sibling, 0 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 10:17 UTC (permalink / raw) To: Arjan van de Ven; +Cc: linux-kernel > This test is sort of the worst case against my argument: > 1) It's a cpu with low memory bandwidth > 2) It's a 1 CPU system > 3) It's a pII not pIV; the pII is way more efficient cycle wise > for pagetable operations I'm a Compuer Science graduate. And at least one single thing I've learned in my studies. More efficient Firmware support (such as an extremely wide memory bandwith or tens of CPUs in an SMP/NUMA or efficient cache line transfer support) DOES NOT ALLOW YOU TO WASTE CYCLES IN DOING USELESS THINGS, such as TWO physical copies of a message if a process want to cooperate with another one. We are not engineer that simply have to let the things work. That's all. Luca. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:07 ` Arjan van de Ven 2003-09-10 10:17 ` Luca Veraldi @ 2003-09-10 10:37 ` Jamie Lokier 2003-09-10 10:41 ` Arjan van de Ven 1 sibling, 1 reply; 74+ messages in thread From: Jamie Lokier @ 2003-09-10 10:37 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Luca Veraldi, alexander.riesen, linux-kernel Arjan van de Ven wrote: > > I have just done a measurement on a 366MHz PII Celeron > > This test is sort of the worst case against my argument: > 1) It's a cpu with low memory bandwidth > 2) It's a 1 CPU system > 3) It's a pII not pIV; the pII is way more efficient cycle wise > for pagetable operations I thought that later generation CPUs were supposed to have lower memory bandwidth relative to the CPU core, so CPU operations are better than copying. Hence all the fancy special memory instructions, to work around that. Not that it matters. I think the 366 Celeron is typical of a lot of computers being used today. I still use it every day, after all. -- Jaie ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:37 ` Jamie Lokier @ 2003-09-10 10:41 ` Arjan van de Ven 2003-09-10 10:54 ` Luca Veraldi 0 siblings, 1 reply; 74+ messages in thread From: Arjan van de Ven @ 2003-09-10 10:41 UTC (permalink / raw) To: Jamie Lokier Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel On Wed, Sep 10, 2003 at 11:37:52AM +0100, Jamie Lokier wrote: > I thought that later generation CPUs were supposed to have lower > memory bandwidth relative to the CPU core this is far more true for "random/access" latency than for streaming bandwidth. Memory is sort of starting to be like disk IO in this regard. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:41 ` Arjan van de Ven @ 2003-09-10 10:54 ` Luca Veraldi 2003-09-10 10:54 ` Arjan van de Ven 2003-09-10 11:16 ` Nick Piggin 0 siblings, 2 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 10:54 UTC (permalink / raw) To: Arjan van de Ven; +Cc: linux-kernel > Memory is sort of starting to be like disk IO in this regard. Good. So the less you copy memory all around, the better you permorm. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:54 ` Luca Veraldi @ 2003-09-10 10:54 ` Arjan van de Ven 2003-09-10 11:16 ` Nick Piggin 1 sibling, 0 replies; 74+ messages in thread From: Arjan van de Ven @ 2003-09-10 10:54 UTC (permalink / raw) To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel On Wed, Sep 10, 2003 at 12:54:32PM +0200, Luca Veraldi wrote: > > Memory is sort of starting to be like disk IO in this regard. > > Good. So the less you copy memory all around, the better you permorm. Actually it's "the less discontiguos memory you touch the better you perform". Just like a 128Kb read is about the same cost as a 4Kb disk read, the same is starting to become true for memory copies. So in the extreme the memory copy is one operation only. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 10:54 ` Luca Veraldi 2003-09-10 10:54 ` Arjan van de Ven @ 2003-09-10 11:16 ` Nick Piggin 2003-09-10 11:30 ` Luca Veraldi 2003-09-10 12:42 ` Alan Cox 1 sibling, 2 replies; 74+ messages in thread From: Nick Piggin @ 2003-09-10 11:16 UTC (permalink / raw) To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel Luca Veraldi wrote: >>Memory is sort of starting to be like disk IO in this regard. >> > >Good. So the less you copy memory all around, the better you permorm. > Hi Luca, There was a zero-copy pipe implementation floating around a while ago I think. Did you have a look at that? IIRC it had advantages and disadvantages over regular pipes in performance. Nick ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 11:16 ` Nick Piggin @ 2003-09-10 11:30 ` Luca Veraldi 2003-09-10 11:44 ` Nick Piggin 2003-09-10 12:42 ` Alan Cox 1 sibling, 1 reply; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 11:30 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel > Hi Luca, > There was a zero-copy pipe implementation floating around a while ago > I think. Did you have a look at that? IIRC it had advantages and > disadvantages over regular pipes in performance. Sorry, but i subscripted this mailing-list only one day ago. Advantages and disadvantages depends on what actually you implement and on how you do that. I can only say that capabilities are a disadvantage only with very very short messages (that is, a few bytes). And this disadvantage is theoretically demonstrable. But, let's say also that such elementary messages are meaningful only in the kernel and for kernel purposes. User processes are another tail. Bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 11:30 ` Luca Veraldi @ 2003-09-10 11:44 ` Nick Piggin 2003-09-10 12:14 ` Luca Veraldi 0 siblings, 1 reply; 74+ messages in thread From: Nick Piggin @ 2003-09-10 11:44 UTC (permalink / raw) To: Luca Veraldi; +Cc: linux-kernel Luca Veraldi wrote: >>Hi Luca, >>There was a zero-copy pipe implementation floating around a while ago >>I think. Did you have a look at that? IIRC it had advantages and >>disadvantages over regular pipes in performance. >> > >Sorry, but i subscripted this mailing-list only one day ago. >Advantages and disadvantages depends on what actually you implement >and on how you do that. > Yes you're right. Just search for it. It would be interesting to compare them with your implementation. > >I can only say that capabilities are a disadvantage only with very very >short messages >(that is, a few bytes). And this disadvantage is theoretically demonstrable. > >But, let's say also that such elementary messages are meaningful only in the >kernel >and for kernel purposes. > >User processes are another tail. > I'm not too sure, I was just pointing you to zero copy pipes because they did have some disadvantages which is why they weren't included in the kernel. Quite possibly your mechanism doesn't suffer from these. What would really help convince everyone is, of course, "real world" benchmarks. I like your ideas though ;) ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 11:44 ` Nick Piggin @ 2003-09-10 12:14 ` Luca Veraldi 0 siblings, 0 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 12:14 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel > Yes you're right. Just search for it. It would be interesting to compare > them with your implementation. > I'm not too sure, I was just pointing you to zero copy pipes because > they did have some disadvantages which is why they weren't included in > the kernel. Quite possibly your mechanism doesn't suffer from these. Ok. I'll search for them. > What would really help convince everyone is, of course, "real world" > benchmarks. I like your ideas though ;) Thanks. Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 11:16 ` Nick Piggin 2003-09-10 11:30 ` Luca Veraldi @ 2003-09-10 12:42 ` Alan Cox 1 sibling, 0 replies; 74+ messages in thread From: Alan Cox @ 2003-09-10 12:42 UTC (permalink / raw) To: Nick Piggin; +Cc: Luca Veraldi, Arjan van de Ven, Linux Kernel Mailing List On Mer, 2003-09-10 at 12:16, Nick Piggin wrote: > There was a zero-copy pipe implementation floating around a while ago > I think. Did you have a look at that? IIRC it had advantages and > disadvantages over regular pipes in performance. FreeBSD had one, it was lightning fast until you used the data the other end of the pipe. Im not sure what happened to it in the long term. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 9:52 ` Jamie Lokier 2003-09-10 10:07 ` Arjan van de Ven @ 2003-09-10 10:11 ` Luca Veraldi 2003-09-10 19:24 ` Pavel Machek 2 siblings, 0 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 10:11 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel > mean 11561 cycles to: > > page fault > copy zero page > install copy > remove copy > > (= write every page from MAP_ANON region and unmap, repeatedly) > > I think that makes a clear case for avoiding page copies. Thanks. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 9:52 ` Jamie Lokier 2003-09-10 10:07 ` Arjan van de Ven 2003-09-10 10:11 ` Luca Veraldi @ 2003-09-10 19:24 ` Pavel Machek 2003-09-10 19:40 ` Jamie Lokier 2 siblings, 1 reply; 74+ messages in thread From: Pavel Machek @ 2003-09-10 19:24 UTC (permalink / raw) To: Jamie Lokier Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel Hi! > > > The overhead implied by a memcpy() is the same, in the oder of magnitude, > > > ***whatever*** kernel version you can develop. > > > > yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the > > uncached case. A pagetable operation (like the cpu setting the accessed > > or dirty bit) is in that same order I suspect (maybe half this, but not > > a lot less). Changing pagetable content is even more because all the > > tlb's and internal cpu state will need to be flushed... which is also a > > microcode operation for the cpu. And it's deadly in an SMP environment. > > I have just done a measurement on a 366MHz PII Celeron. The test is > quite crude, but these numbers are upper bounds on the timings: > > mean 813 cycles to: > > page fault > install zero page > remove zero page > > (= read every page from MAP_ANON region and unmap, repeatedly) > > mean 11561 cycles to: > > page fault > copy zero page > install copy > remove copy > > (= write every page from MAP_ANON region and unmap, repeatedly) > > I think that makes a clear case for avoiding page copies. Can you make it available so we can test on, say, 900MHz athlon? Or you can have it tested on 1800MHz athlon64, that's about as high end as it can get. Pavel -- When do you have a heart between your knees? [Johanka's followup: and *two* hearts?] ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 19:24 ` Pavel Machek @ 2003-09-10 19:40 ` Jamie Lokier 2003-09-10 21:35 ` Pavel Machek 0 siblings, 1 reply; 74+ messages in thread From: Jamie Lokier @ 2003-09-10 19:40 UTC (permalink / raw) To: Pavel Machek Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel Pavel Machek wrote: > Can you make it available so we can test on, say, 900MHz athlon? Or > you can have it tested on 1800MHz athlon64, that's about as high end > as it can get. I just deleted the program, so here's a rewrite :) #include <sys/mman.h> int main() { int i, j; for (j = 0; j < 64; j++) { volatile char * ptr = mmap (0, 4096 * 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); for (i = 0; i < 4096; i++) { #if 1 *(ptr + 4096 * i) = 0; /* Write */ #else (void) *(ptr + 4096 * i); /* Read */ #endif } munmap ((void *) ptr, 4096 * 4096); } return 0; } Smallest results, from "gcc -o test test.c -O2; time ./test" on a 1500MHz dual Athlon 1800 MP: Write: real 0m1.316s user 0m0.059s sys 0m1.256s ==> 7531 cycles per page Read: real 0m0.199s user 0m0.053s sys 0m0.146s ==> 1139 cycles per page As I said, it's a crude upper bound. -- Jamie ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 19:40 ` Jamie Lokier @ 2003-09-10 21:35 ` Pavel Machek 2003-09-10 22:06 ` Jamie Lokier 0 siblings, 1 reply; 74+ messages in thread From: Pavel Machek @ 2003-09-10 21:35 UTC (permalink / raw) To: Jamie Lokier Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel Hi! > > Can you make it available so we can test on, say, 900MHz athlon? Or > > you can have it tested on 1800MHz athlon64, that's about as high end > > as it can get. > > I just deleted the program, so here's a rewrite :) > > #include <sys/mman.h> > > int main() > { > int i, j; > for (j = 0; j < 64; j++) { > volatile char * ptr = > mmap (0, 4096 * 4096, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANON, -1, 0); > for (i = 0; i < 4096; i++) { > #if 1 > *(ptr + 4096 * i) = 0; /* Write */ > #else > (void) *(ptr + 4096 * i); /* Read */ > #endif > } > munmap ((void *) ptr, 4096 * 4096); > } > return 0; > } > > Smallest results, from "gcc -o test test.c -O2; time ./test" on a > 1500MHz dual Athlon 1800 MP: > > Write: > real 0m1.316s > user 0m0.059s > sys 0m1.256s > > ==> 7531 cycles per page > > Read: > real 0m0.199s > user 0m0.053s > sys 0m0.146s > > ==> 1139 cycles per page > > As I said, it's a crude upper bound. But you should also time memory remapping stuff on same cpu, no? Pavel -- When do you have a heart between your knees? [Johanka's followup: and *two* hearts?] ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 21:35 ` Pavel Machek @ 2003-09-10 22:06 ` Jamie Lokier 0 siblings, 0 replies; 74+ messages in thread From: Jamie Lokier @ 2003-09-10 22:06 UTC (permalink / raw) To: Pavel Machek Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel Pavel Machek wrote: > But you should also time memory remapping stuff on same cpu, no? Yes, but can I be bothered ;) Like I said, it's a crude upper bound. We know the remapping time will be about the same in both tests, because all it does is the same vma operations and the same TLB invalidations. -- Jamie ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 9:18 ` Luca Veraldi 2003-09-10 9:23 ` Arjan van de Ven @ 2003-09-10 11:52 ` Alex Riesen 2003-09-10 12:14 ` Luca Veraldi 1 sibling, 1 reply; 74+ messages in thread From: Alex Riesen @ 2003-09-10 11:52 UTC (permalink / raw) To: Luca Veraldi; +Cc: linux-kernel Luca Veraldi, Wed, Sep 10, 2003 11:18:48 +0200: > > Will it be possible to base existing facilities on your approach? > > SVR5 messages (msg{get,snd,rcv}), for example? > > Ah, ok. So let's continue to do ineffient things > only because it has always been so! It is because the interface is perfectly enough. You still can implement zero-copy local transfers for pipes for read and write calls. And for small amounts, where it is impossible to do zero-copy does not bring noticable advantages (as was already mentioned by Alan). > Compatibility is not a problem. Simply rewrite the write() and read() > for pipes in order to make them do the same thing done by zc_send() > and zc_receive(). Or, if you are not referring to pipes, rewrite the > support level of you anchient IPC primitives in order to make them do > the same thing done by zc_send() and zc_receive(). If it is possible, why new user-side interface? -alex ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 11:52 ` Alex Riesen @ 2003-09-10 12:14 ` Luca Veraldi 2003-09-10 12:11 ` Alex Riesen 0 siblings, 1 reply; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 12:14 UTC (permalink / raw) To: alexander.riesen; +Cc: linux-kernel > > Compatibility is not a problem. Simply rewrite the write() and read() > > for pipes in order to make them do the same thing done by zc_send() > > and zc_receive(). Or, if you are not referring to pipes, rewrite the > > support level of you anchient IPC primitives in order to make them do > > the same thing done by zc_send() and zc_receive(). > > If it is possible, why new user-side interface? Because my programming model is clear and easy. Linux one's is far from being so, in my opinion. Bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 12:14 ` Luca Veraldi @ 2003-09-10 12:11 ` Alex Riesen 2003-09-10 12:29 ` Luca Veraldi 0 siblings, 1 reply; 74+ messages in thread From: Alex Riesen @ 2003-09-10 12:11 UTC (permalink / raw) To: Luca Veraldi; +Cc: linux-kernel Luca Veraldi, Wed, Sep 10, 2003 14:14:00 +0200: > > > Compatibility is not a problem. Simply rewrite the write() and read() > > > for pipes in order to make them do the same thing done by zc_send() > > > and zc_receive(). Or, if you are not referring to pipes, rewrite the > > > support level of you anchient IPC primitives in order to make them do > > > the same thing done by zc_send() and zc_receive(). > > > > If it is possible, why new user-side interface? > > Because my programming model is clear and easy. does anyone, besides you, says so? > Linux one's is far from being so, in my opinion. it is widely accepted one. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 12:11 ` Alex Riesen @ 2003-09-10 12:29 ` Luca Veraldi 2003-09-10 12:28 ` Alex Riesen 0 siblings, 1 reply; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 12:29 UTC (permalink / raw) To: alexander.riesen; +Cc: linux-kernel > it is widely accepted one. Widely accepted does not necessary implies clear. Sorry if I consider more readable this: ch=acquire(CHANNEL_ID); instead of this: while((fd = open("fifo", O_WRONLY)) < 0) ; But this is quite an aestethical fact. If you like "open" and "write", open Emacs and do a global replace "acquire" with "open" and "zc_send" with "write". Few other minor details will make it a compatible interface. ------------------------------ Fashion is a variable but style is a constant ------------------------------ Larry Wall Bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 12:29 ` Luca Veraldi @ 2003-09-10 12:28 ` Alex Riesen 2003-09-10 12:36 ` Luca Veraldi 0 siblings, 1 reply; 74+ messages in thread From: Alex Riesen @ 2003-09-10 12:28 UTC (permalink / raw) To: Luca Veraldi; +Cc: linux-kernel Luca Veraldi, Wed, Sep 10, 2003 14:29:37 +0200: > > it is widely accepted one. > > Widely accepted does not necessary implies clear. > but it does imply "tested". And widely used. And well-known. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 12:28 ` Alex Riesen @ 2003-09-10 12:36 ` Luca Veraldi 2003-09-10 12:36 ` Alex Riesen 2003-09-10 13:33 ` Gábor Lénárt 0 siblings, 2 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 12:36 UTC (permalink / raw) To: alexander.riesen; +Cc: linux-kernel > but it does imply "tested". And widely used. And well-known. And inefficient, too. Sorry, but I use theory not popular convention to judge. Bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 12:36 ` Luca Veraldi @ 2003-09-10 12:36 ` Alex Riesen 2003-09-10 13:33 ` Gábor Lénárt 1 sibling, 0 replies; 74+ messages in thread From: Alex Riesen @ 2003-09-10 12:36 UTC (permalink / raw) To: Luca Veraldi; +Cc: linux-kernel Luca Veraldi, Wed, Sep 10, 2003 14:36:19 +0200: > > but it does imply "tested". And widely used. And well-known. > > And inefficient, too. yes, probably. Until proven. What exactly wrong with api? > Sorry, but I use theory not popular convention to judge. I personally do not see deficiencies of the mentioned apis. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 12:36 ` Luca Veraldi 2003-09-10 12:36 ` Alex Riesen @ 2003-09-10 13:33 ` Gábor Lénárt 2003-09-10 14:04 ` Luca Veraldi 1 sibling, 1 reply; 74+ messages in thread From: Gábor Lénárt @ 2003-09-10 13:33 UTC (permalink / raw) To: Luca Veraldi; +Cc: linux-kernel On Wed, Sep 10, 2003 at 02:36:19PM +0200, Luca Veraldi wrote: > > but it does imply "tested". And widely used. And well-known. > > And inefficient, too. > Sorry, but I use theory not popular convention to judge. But theory is often far from practice ... Theory is something which uses some abstraction to "model" reality. Which often means oversimplification of the problem or simply the fact that a real case has got much more (probably hidden) parameters affecting the problem. Maybe these parameters are very hard to even detect at the FIRST sight. That's why most software development starting from theory _BUT_ some "REAL WORLD" benchmark/test should confirm the truth of that theory. Or even think about relativity theory by Einstein: the theory is good, but it's even better to prove it in the real life too: perturbation in the orbit of Mercury, or affect to the visible position of stars at an eclipse. That was the point when Einstein's theory bacome to be widely used by science. So I thing it's no use to start a flame thread here. Implement that IPC solution, and test it. If it becomes a good one, everybody will be happy with your work. If not, at least you've tried, and maybe you will also learn from it. Since Linux is open source, you have nothing to lose with your implementation. - Gábor (larta'H) ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 13:33 ` Gábor Lénárt @ 2003-09-10 14:04 ` Luca Veraldi 0 siblings, 0 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 14:04 UTC (permalink / raw) To: lgb; +Cc: linux-kernel > So I thing it's no use to start a flame thread here. Implement that IPC > solution, and test it. If it becomes a good one, everybody will be happy > with your work. If not, at least you've tried, and maybe you will also > learn from it. Since Linux is open source, you have nothing to lose with > your implementation. It is exactly what I've done. But, did you read the page? I've implemented the IPC mechanism... And tested, too. The numbers reported in tables weren't the fuit of my immagination, dear. Bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-09 17:30 Efficient IPC mechanism on Linux Luca Veraldi 2003-09-09 21:17 ` Alan Cox [not found] ` <20030909175821.GL16080@Synopsys.COM> @ 2003-09-10 14:21 ` Stewart Smith 2003-09-10 14:39 ` Luca Veraldi 2003-09-10 16:59 ` Andrea Arcangeli 3 siblings, 1 reply; 74+ messages in thread From: Stewart Smith @ 2003-09-10 14:21 UTC (permalink / raw) To: Luca Veraldi; +Cc: linux-kernel hrm, you may find things related to the Password-Capability system and the Walnut kernel of interest - these systems take this kind of IPC to the extreme :) (ahhh... research OS hw & sw - except you *do not* want to see the walnut source - it makes ppl want to crawl up and cry). http://www.csse.monash.edu.au/~rdp/fetch/castro-thesis.ps and check the Readme.txt at http://www.csse.monash.edu.au/courseware/cse4333/rdp-ma terial/ for stuff on Multi and password-capabilities. interesting stuff, the Castro thesis does do some comparisons to FreeBSD (1.1 amazingly enough) - although the number of real world applications on these systems is minimal (and in the current state impossible - nobody can remember how to get userspace going on Walnut, we may have broken it) and so real-world comparisons just don't really happen these days. Maybe after a rewrite (removing some brain-damage of the original design). This is all related to my honors work, http://www.flamingspork.com/honors/ although my site needs an update. I'm working on the design and simulation of an improved storage system. On Wed, 2003-09-10 at 03:30, Luca Veraldi wrote: > Hi all. > At the web page > http://web.tiscali.it/lucavera/www/root/ecbm/index.htm > You can find the results of my attempt in modifing the linux kernel sources > to implement a new Inter Process Communication mechanism. > > It is called ECBM for Efficient Capability-Based Messaging. > > In the reading You can also find the comparison of ECBM > against some other commonly-used Linux IPC primitives > (such as read/write on pipes or SYS V tools). > > The results are quite clear. > > Enjoy. > Luca Veraldi > > > ---------------------------------------- > Luca Veraldi > > Graduate Student of Computer Science > at the University of Pisa > > veraldi@cli.di.unipi.it > luca.veraldi@katamail.com > ICQ# 115368178 > ---------------------------------------- > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 14:21 ` Stewart Smith @ 2003-09-10 14:39 ` Luca Veraldi 0 siblings, 0 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 14:39 UTC (permalink / raw) To: Stewart Smith; +Cc: linux-kernel > hrm, you may find things related to the Password-Capability system and > the Walnut kernel of interest - these systems take this kind of IPC to > the extreme :) (ahhh... research OS hw & sw - except you *do not* want > to see the walnut source - it makes ppl want to crawl up and cry). > > http://www.csse.monash.edu.au/~rdp/fetch/castro-thesis.ps > > and check the Readme.txt at > http://www.csse.monash.edu.au/courseware/cse4333/rdp-ma terial/ > for stuff on Multi and password-capabilities. > > interesting stuff, the Castro thesis does do some comparisons to FreeBSD > (1.1 amazingly enough) - although the number of real world applications > on these systems is minimal (and in the current state impossible - > nobody can remember how to get userspace going on Walnut, we may have > broken it) and so real-world comparisons just don't really happen these > days. Maybe after a rewrite (removing some brain-damage of the original > design). Thanks. It's really very interesting... Bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-09 17:30 Efficient IPC mechanism on Linux Luca Veraldi ` (2 preceding siblings ...) 2003-09-10 14:21 ` Stewart Smith @ 2003-09-10 16:59 ` Andrea Arcangeli 2003-09-10 17:05 ` Andrea Arcangeli ` (2 more replies) 3 siblings, 3 replies; 74+ messages in thread From: Andrea Arcangeli @ 2003-09-10 16:59 UTC (permalink / raw) To: Luca Veraldi; +Cc: linux-kernel Ciao Luca, On Tue, Sep 09, 2003 at 07:30:58PM +0200, Luca Veraldi wrote: > Hi all. > At the web page > http://web.tiscali.it/lucavera/www/root/ecbm/index.htm > You can find the results of my attempt in modifing the linux kernel sources > to implement a new Inter Process Communication mechanism. > > It is called ECBM for Efficient Capability-Based Messaging. > > In the reading You can also find the comparison of ECBM > against some other commonly-used Linux IPC primitives > (such as read/write on pipes or SYS V tools). > > The results are quite clear. in terms of design as far as I can tell the most efficient way to do message passing is not pass the data through the kernel at all (no matter if you intend to copy it or not), and to simply use futex on top of shm to synchronize/wakeup the access. If we want to make an API widespread, that should be simply an userspace library only. It's very inefficient to mangle pagetables and flush the tlb in a flood like you're doing (or better like you should do), when you can keep the memory mapped in *both* tasks at the same time *always* and there's no need of any kernel modification at all for that much more efficient design that I'm suggesting. Obviously lots of apps are already using this design and there's no userspace API simply because that's not needed. The only thing we need from the kernel is the wakeup mechanism and that's already provided by the futex (in the past userspae apps using this design used sched_yield, and that was very bad). About the implementation - the locking looks very wrong: you miss the page_table_lock in all the pte walking, you take a totally worthless lock_kernel() all over the place for no good reason, and the unconditional set_bit(PG_locked) clear_bit(PG_locked) on random pieces of ram almost guarantees that you'll corrupt ram quickly (the PG_locked is reserved for I/O serialization, the same ram that you're working on can be sent to disk or to swap by the kernel at the same time and it can be already locked, you can't clear_bit unless you're sure you're the guy that owns the lock, and you aren't sure because you didn't test if you're the owner, so that smeels like an huge bug that will random corrupt ram, like the pte walking race). there's also an obvious DoS that is trivial to generate by locking in ram some 64G of ram with ecbm_create_capability() see the for(count=0; count<pages; ++count) atomic_inc (btw, you should use get_page, and all the operations like LockPage to play with pages). I also don't see where you flush the tlb after the set_pte, and where you release the ram pointed by the pte (it seems you're leaking plenty of memory that way). I didn't check at all the credential checks (I didn't run into it while reading the code, but I assume I overlooked it). (do you rely on a random number? that's probably statistically secure, but we can guarantee security on a local box, we must not work by luck whenever possible) this was a very quick review, hope this helps, Andrea /* * If you refuse to depend on closed software for a critical * part of your business, these links may be useful: * * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/ * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/ * http://www.cobite.com/cvsps/ * * svn://svn.kernel.org/linux-2.6/trunk * svn://svn.kernel.org/linux-2.4/trunk */ ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 16:59 ` Andrea Arcangeli @ 2003-09-10 17:05 ` Andrea Arcangeli 2003-09-10 17:21 ` Luca Veraldi 2003-09-10 17:39 ` Martin Konold 2003-09-10 18:08 ` Inappropriate signatures Larry McVoy 2 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2003-09-10 17:05 UTC (permalink / raw) To: Luca Veraldi; +Cc: linux-kernel On Wed, Sep 10, 2003 at 06:59:44PM +0200, Andrea Arcangeli wrote: > About the implementation - the locking looks very wrong: you miss the > page_table_lock in all the pte walking, you take a totally worthless > lock_kernel() all over the place for no good reason, and the sorry for the self followup, but I just read another message where you mention 2.2, if that was for 2.2 the locking was the ok. all other problems remains though. Andrea /* * If you refuse to depend on closed software for a critical * part of your business, these links may be useful: * * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/ * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/ * http://www.cobite.com/cvsps/ * * svn://svn.kernel.org/linux-2.6/trunk * svn://svn.kernel.org/linux-2.4/trunk */ ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 17:05 ` Andrea Arcangeli @ 2003-09-10 17:21 ` Luca Veraldi 2003-09-10 17:41 ` Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 17:21 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel > sorry for the self followup, but I just read another message where you > mention 2.2, if that was for 2.2 the locking was the ok. Oh, good. I was already going to put my head under sand. :-) I'm not so expert about the kernel. I've just read [ORLL] and some bits of the kernel sources. So, error in the codes are not so strange. But, it's better now that you know we are talking about version 2.2... I'm glad to hear that locking are ok. You say: > in terms of design as far as I can tell the most efficient way to do > message passing is not pass the data through the kernel at all (no > matter if you intend to copy it or not), and to simply use futex on top > of shm to synchronize/wakeup the access. If we want to make an API > widespread, that should be simply an userspace library only. > > It's very inefficient to mangle pagetables and flush the tlb in a flood > like you're doing (or better like you should do), when you can keep the I guess futex are some kind of semaphore flavour under linux 2.4/2.6. However, you need to use SYS V shared memory in any case. Tests for SYS V shared memory are included in the web page (even though using SYS V semaphores). I don't think, reading the numbers, that managing pagetables "is very inefficient". I think very inefficient are SYS V semaphore orethe double-copying channel you call a pipe. > there's also an obvious DoS that is trivial to generate by locking in > ram some 64G of ram with ecbm_create_capability() see the for(count=0; > count<pages; ++count) atomic_inc (btw, you should use get_page, and all > the operations like LockPage to play with pages). As I say in the web page, having all the pages locked in memory is not a necessary condition for the applicability of communication mechanisms based on capabilities. Simply, it make it easier to write the code and does not make me crazy with the Linux swapping system. Bye bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 17:21 ` Luca Veraldi @ 2003-09-10 17:41 ` Andrea Arcangeli 0 siblings, 0 replies; 74+ messages in thread From: Andrea Arcangeli @ 2003-09-10 17:41 UTC (permalink / raw) To: Luca Veraldi; +Cc: linux-kernel On Wed, Sep 10, 2003 at 07:21:09PM +0200, Luca Veraldi wrote: > > sorry for the self followup, but I just read another message where you > > mention 2.2, if that was for 2.2 the locking was the ok. > > Oh, good. I was already going to put my head under sand. :-) ;) > > I'm not so expert about the kernel. I've just read [ORLL] > and some bits of the kernel sources. > So, error in the codes are not so strange. Ok no problem then. > But, it's better now that you know we are talking about version 2.2... > > I'm glad to hear that locking are ok. > > You say: > > > in terms of design as far as I can tell the most efficient way to do > > message passing is not pass the data through the kernel at all (no > > matter if you intend to copy it or not), and to simply use futex on top > > of shm to synchronize/wakeup the access. If we want to make an API > > widespread, that should be simply an userspace library only. > > > > It's very inefficient to mangle pagetables and flush the tlb in a flood > > like you're doing (or better like you should do), when you can keep the > > I guess futex are some kind of semaphore flavour under linux 2.4/2.6. > However, you need to use SYS V shared memory in any case. I need shared memory yes, but I can generate it with /dev/shm or clone(), it doesn't really make any difference. I would never use the ugly SYSV IPC API personally ;). > Tests for SYS V shared memory are included in the web page > (even though using SYS V semaphores). IPC semaphores are a totally different thing. Basically the whole ipc/ directory is an inefficient obsolete piece of code that nobody should use. The real thing is (/dev/shm || clone()) + futex. > I don't think, reading the numbers, that managing pagetables "is very > inefficient". Yep, it can be certainly more efficient than IPC semaphores or whatever IPC thing in ipc. However the speed of the _shm_ API doesn't matter, you map the shm only once and you serialize the messaging with the futex. So even if it takes a bit longer to setup the shm with IPC shm, it won't matter, even using IPC shm + futex would be fine (despite I find /dev/shm much more friendy as an API). > I think very inefficient are SYS V semaphore orethe double-copying channel > you call a pipe. you're right in term of bandwidth if the only thing the reader and the writer do is to make the data transfer. However if the app is computing stuff besides listening to the pipe (for example if it's in noblocking) the double copying allows the writer, to return way before the reader started reading the info. So the intermediate ram increases a bit the parallelism. One could play tricks with cows as well though but it would be way more complex than the current double copy ;). (as somebody pointed out) you may want to compare the pipe code with the new zerocopy-pipe one (the one that IMHO has a chance to decreases parallelism) > having all the pages locked in memory is not a necessary condition > for the applicability of communication mechanisms based on capabilities. > Simply, it make it easier to write the code and does not make me crazy > with the Linux swapping system. ok ;) Overall my point is that the best is to keep the ram mapped in both tasks at the same time and to use the kernel only for synchronization (i.e. the wakeup/schedule calls in your code and to remove the pinning/ pte mangling enterely from the design) and the futex provides a more efficient synchronization since it doesn't even enter kernel space if the writer completes before the reader start and the reader completes before the writer starts again (so it has a chance to be better than any other kernel base scheme that is always forced to enter kernel even in the very best case). Andrea /* * If you refuse to depend on closed software for a critical * part of your business, these links may be useful: * * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/ * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/ * http://www.cobite.com/cvsps/ * * svn://svn.kernel.org/linux-2.6/trunk * svn://svn.kernel.org/linux-2.4/trunk */ ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 16:59 ` Andrea Arcangeli 2003-09-10 17:05 ` Andrea Arcangeli @ 2003-09-10 17:39 ` Martin Konold 2003-09-10 18:01 ` Andrea Arcangeli 2003-09-10 18:08 ` Inappropriate signatures Larry McVoy 2 siblings, 1 reply; 74+ messages in thread From: Martin Konold @ 2003-09-10 17:39 UTC (permalink / raw) To: Andrea Arcangeli, Luca Veraldi; +Cc: linux-kernel Am Wednesday 10 September 2003 06:59 pm schrieb Andrea Arcangeli: Hi, > design that I'm suggesting. Obviously lots of apps are already using > this design and there's no userspace API simply because that's not > needed. HPC people have investigated High performance IPC many times basically it boils down to: 1. Userspace is much more efficient than kernel space. So efficient implementions avoid kernel space even for message transfers over networks (e.g. DMA directly to userspace). 2. The optimal protocol to use and the number of copies to do is depending on the message size. Small messages are most efficiently handled with a single/dual copy (short protocol / eager protocol) and large messages are more efficient with single/zero copy techniques (get protocol) depending if a network is involved or not. Traditionally in a networked environment single copy means PIO and zero copy means DMA when using network hardware. The idea is while DMA has much higher bandwidth than PIO DMA is more expensive to initiate than PIO. So DMA is only useful for large messages. In the local SMP case there do exist userspace APIs like MPI which can do single copy Message passing at memory speed in pure userspace since many years. The following PDF documents a measurement where the communication between two processes running on different CPUs in an SMP system is exactly the memcpy bandwidth for large messages using a single copy get protocol. http://ipdps.eece.unm.edu/1999/pc-now/takahash.pdf Numbers from a Dual P-II-333, Intel 440LX (100MB/s memcpy) eager get min. Latency µs 8.62 9.98 max Bandwidth MB/s 48.03 100.02 half bandwith point KB 0.3 2.5 You can easily see that the eager has better latency for very short messages and that the get has a max bandwidth beeing equivalent of a memcpy (single copy). True zero copy has unlimited (sigh!) bandwidth within an SMP and does not really make sense in contrast to a network. Regards, -- martin Dipl.-Phys. Martin Konold e r f r a k o n Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker Nobelstrasse 15, 70569 Stuttgart, Germany fon: 0711 67400963, fax: 0711 67400959 email: martin.konold@erfrakon.de ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 17:39 ` Martin Konold @ 2003-09-10 18:01 ` Andrea Arcangeli 2003-09-10 18:05 ` Martin Konold 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2003-09-10 18:01 UTC (permalink / raw) To: Martin Konold; +Cc: Luca Veraldi, linux-kernel On Wed, Sep 10, 2003 at 07:39:17PM +0200, Martin Konold wrote: > Am Wednesday 10 September 2003 06:59 pm schrieb Andrea Arcangeli: > > Hi, > > > design that I'm suggesting. Obviously lots of apps are already using > > this design and there's no userspace API simply because that's not > > needed. > > HPC people have investigated High performance IPC many times basically it > boils down to: > > 1. Userspace is much more efficient than kernel space. So efficient > implementions avoid kernel space even for message transfers over networks > (e.g. DMA directly to userspace). > > 2. The optimal protocol to use and the number of copies to do is depending on > the message size. > > Small messages are most efficiently handled with a single/dual copy (short > protocol / eager protocol) and large messages are more efficient with > single/zero copy techniques (get protocol) depending if a network is involved > or not. > > Traditionally in a networked environment single copy means PIO and zero copy > means DMA when using network hardware. > > The idea is while DMA has much higher bandwidth than PIO DMA is more expensive > to initiate than PIO. So DMA is only useful for large messages. agreed. > > In the local SMP case there do exist userspace APIs like MPI which can do btw, so far we were only discussing IPC in a local box (UP or SMP or NUMA) w/o networking involved. Luca's currnet implementation as well was only working locally. > single copy Message passing at memory speed in pure userspace since many > years. > > The following PDF documents a measurement where the communication between two > processes running on different CPUs in an SMP system is exactly the memcpy > bandwidth for large messages using a single copy get protocol. > > http://ipdps.eece.unm.edu/1999/pc-now/takahash.pdf > > Numbers from a Dual P-II-333, Intel 440LX (100MB/s memcpy) > > eager get > min. Latency µs 8.62 9.98 > max Bandwidth MB/s 48.03 100.02 > half bandwith point KB 0.3 2.5 > > You can easily see that the eager has better latency for very short messages > and that the get has a max bandwidth beeing equivalent of a memcpy (single > copy). > > True zero copy has unlimited (sigh!) bandwidth within an SMP and does not > really make sense in contrast to a network. if you can avoid to enter kernel, you'd better do that, because entering kernel will take much more time than the copy itself. with the shm/futex approch you can also have a ring buffer to handle parallelism better while it's at the same time zerocopy and enterely userspace based in the best case (thought that's not the common case). thanks, Andrea /* * If you refuse to depend on closed software for a critical * part of your business, these links may be useful: * * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/ * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/ * http://www.cobite.com/cvsps/ * * svn://svn.kernel.org/linux-2.6/trunk * svn://svn.kernel.org/linux-2.4/trunk */ ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 18:01 ` Andrea Arcangeli @ 2003-09-10 18:05 ` Martin Konold 2003-09-10 18:31 ` Chris Friesen 0 siblings, 1 reply; 74+ messages in thread From: Martin Konold @ 2003-09-10 18:05 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Luca Veraldi, linux-kernel Am Wednesday 10 September 2003 08:01 pm schrieb Andrea Arcangeli: Hi Andreas, > > The idea is while DMA has much higher bandwidth than PIO DMA is more > > expensive to initiate than PIO. So DMA is only useful for large messages. > > agreed. > > > In the local SMP case there do exist userspace APIs like MPI which can do > > btw, so far we were only discussing IPC in a local box (UP or SMP or > NUMA) w/o networking involved. Luca's currnet implementation as well was > only working locally. Yes, and I claim that the best you can get for large messages is a plain single copy userspace implementation as already implemented by some people using the MPI API. > > True zero copy has unlimited (sigh!) bandwidth within an SMP and does not > > really make sense in contrast to a network. > > if you can avoid to enter kernel, you'd better do that, because entering > kernel will take much more time than the copy itself. Yes, doing HPC the kernel may only be used to intialize the intial communication channels (e.g. handling permissions etc.). The kernel must be avoided for the actual communication by any means. > with the shm/futex approch you can also have a ring buffer to handle > parallelism better while it's at the same time zerocopy How fast will you get? I think you will get the bandwidth of a memcpy for large chunks?! This is imho not really zerocopy. The data has to travel over the memory bus involving the CPU so I would call this single copy ;-) > and enterely > userspace based in the best case (thought that's not the common case). Yes a userspace library using the standard MPI API is the proven best approach and freely downloadable from http://www.pccluster.org/score/dist/pub/score-5.4.0/source/score-5.4.0.mpi.tar.gz Regards, -- martin Dipl.-Phys. Martin Konold e r f r a k o n Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker Nobelstrasse 15, 70569 Stuttgart, Germany fon: 0711 67400963, fax: 0711 67400959 email: martin.konold@erfrakon.de ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 18:05 ` Martin Konold @ 2003-09-10 18:31 ` Chris Friesen 0 siblings, 0 replies; 74+ messages in thread From: Chris Friesen @ 2003-09-10 18:31 UTC (permalink / raw) To: Martin Konold; +Cc: Andrea Arcangeli, Luca Veraldi, linux-kernel Martin Konold wrote: > Am Wednesday 10 September 2003 08:01 pm schrieb Andrea Arcangeli: >>with the shm/futex approch you can also have a ring buffer to handle >>parallelism better while it's at the same time zerocopy >> > > How fast will you get? I think you will get the bandwidth of a memcpy for > large chunks?! > > This is imho not really zerocopy. The data has to travel over the memory bus > involving the CPU so I would call this single copy ;-) Even with zerocopy, you have to build the message somehow. If process A builds a message and passes it to process B, then then isn't that as efficient as you can get with message passing? How does MPI do any better? However, if both processes directly map the same memory and use it directly (without "messages" as such), then I would see *that* as zerocopy. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com ^ permalink raw reply [flat|nested] 74+ messages in thread
* Inappropriate signatures 2003-09-10 16:59 ` Andrea Arcangeli 2003-09-10 17:05 ` Andrea Arcangeli 2003-09-10 17:39 ` Martin Konold @ 2003-09-10 18:08 ` Larry McVoy 2003-09-10 18:52 ` Jamie Lokier ` (2 more replies) 2 siblings, 3 replies; 74+ messages in thread From: Larry McVoy @ 2003-09-10 18:08 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel Dear Andrea, We're providing the service which enables all of the below and without our good will that service is at risk. You're publicly slamming the providers of that service. If you want to do that, that's your right, but that leads us to ask: - wasn't the deal that we do the gateway and you stop whining? - why should we provide this gateway service at all if there is whining? - why shouldn't we firewall you off since you are whining? - or if firewalling fails, put in a day or two delay in the gateway? > /* > * If you refuse to depend on closed software for a critical > * part of your business, these links may be useful: > * > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/ > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/ > * http://www.cobite.com/cvsps/ > * > * svn://svn.kernel.org/linux-2.6/trunk > * svn://svn.kernel.org/linux-2.4/trunk > */ -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Inappropriate signatures 2003-09-10 18:08 ` Inappropriate signatures Larry McVoy @ 2003-09-10 18:52 ` Jamie Lokier 2003-09-10 19:54 ` rsync head? [was inappropriate signatures] Joe Perches 2003-09-13 15:39 ` Inappropriate signatures Pavel Machek 2 siblings, 0 replies; 74+ messages in thread From: Jamie Lokier @ 2003-09-10 18:52 UTC (permalink / raw) To: Larry McVoy, Andrea Arcangeli, linux-kernel > > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/ > > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/ > > * http://www.cobite.com/cvsps/ Oh, thanks! I didn't know about those links and have been using the daily snapshots from www.kernel.org's front page. Those URLs (*and* BitKeeper ones) really should appear on or near the www.kernel.org front page, near all the list of snapshots etc. -- Jamie ^ permalink raw reply [flat|nested] 74+ messages in thread
* rsync head? [was inappropriate signatures] 2003-09-10 18:08 ` Inappropriate signatures Larry McVoy 2003-09-10 18:52 ` Jamie Lokier @ 2003-09-10 19:54 ` Joe Perches 2003-09-13 15:39 ` Inappropriate signatures Pavel Machek 2 siblings, 0 replies; 74+ messages in thread From: Joe Perches @ 2003-09-10 19:54 UTC (permalink / raw) To: linux-kernel, webmaster; +Cc: Larry McVoy On Wed, 2003-09-10 at 11:08, Larry McVoy wrote: > > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/ cvs/bk/svn/bz2? How about just plain text instead? How about an rsync to the latest uncompressed source tree? rsync.kernel.org::pub/linux/kernel/head? ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Inappropriate signatures 2003-09-10 18:08 ` Inappropriate signatures Larry McVoy 2003-09-10 18:52 ` Jamie Lokier 2003-09-10 19:54 ` rsync head? [was inappropriate signatures] Joe Perches @ 2003-09-13 15:39 ` Pavel Machek 2003-09-13 16:49 ` Larry McVoy 2 siblings, 1 reply; 74+ messages in thread From: Pavel Machek @ 2003-09-13 15:39 UTC (permalink / raw) To: Larry McVoy, Andrea Arcangeli, linux-kernel Hi! > We're providing the service which enables all of the below and without > our good will that service is at risk. Eh? You are providing a service and he provides you advertising. I can't see anything about slamming you. I guess that Andrea does not think about kernel developing as a business, so it is not really targeted at you. But anyway its *his* signature. Pavel > > /* > > * If you refuse to depend on closed software for a critical > > * part of your business, these links may be useful: > > * > > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/ > > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/ > > * http://www.cobite.com/cvsps/ > > * > > * svn://svn.kernel.org/linux-2.6/trunk > > * svn://svn.kernel.org/linux-2.4/trunk > > */ -- When do you have a heart between your knees? [Johanka's followup: and *two* hearts?] ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Inappropriate signatures 2003-09-13 15:39 ` Inappropriate signatures Pavel Machek @ 2003-09-13 16:49 ` Larry McVoy 0 siblings, 0 replies; 74+ messages in thread From: Larry McVoy @ 2003-09-13 16:49 UTC (permalink / raw) To: Pavel Machek; +Cc: Larry McVoy, Andrea Arcangeli, linux-kernel Other people may not agree with your view on this one Pavel. Most people read that signature and saw it as a negative comment about BitKeeper. People have told me they believe that if BitMover isn't getting benefit from the free use of BK, free BK will go away. People don't want to have to depend on my goodwill, they want BitMover to derive benefit so that my goodwill doesn't matter, it's just smart business to give BK away. If that is really what people here think then that means there has to be some benefit for BitMover. One of the benefits is that we get to say that the kernel team uses it and get some marketing advantage out of that. But that benefit diminishes if what people are saying is negative. It makes sense that people are uneasy about depending on BK given the amount of work it takes to provide it and the amount of grief we take for providing it. I don't know why I didn't see that earlier, it's an unstable situation. I know there are some people who will never be happy until everything is GPLed, I can't help those people other than provide the gateway. In return, those people need to stop whining, the gateway has to be enough. For the rest of the people, I'm looking for suggestions on how to make this situation more stable. It took me a while but I can see why you are nervous, I'd be nervous in your position. I'm nervous about doing any real marketing of the kernel's use of BK because I figured it would lead to more flame wars. I'm starting to think that if we were doing that it might actually lead to less flames, based on the theory that we would then need you so you continue to get BK for free. If you have an opinion on that I'd like to know it. On Sat, Sep 13, 2003 at 05:39:56PM +0200, Pavel Machek wrote: > Hi! > > > We're providing the service which enables all of the below and without > > our good will that service is at risk. > > Eh? You are providing a service and he provides you advertising. I > can't see anything about slamming you. I guess that Andrea does not > think about kernel developing as a business, so it is not really > targeted at you. > > But anyway its *his* signature. > Pavel > > > > /* > > > * If you refuse to depend on closed software for a critical > > > * part of your business, these links may be useful: > > > * > > > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/ > > > * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/ > > > * http://www.cobite.com/cvsps/ > > > * > > > * svn://svn.kernel.org/linux-2.6/trunk > > > * svn://svn.kernel.org/linux-2.4/trunk > > > */ > > -- > When do you have a heart between your knees? > [Johanka's followup: and *two* hearts?] -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux @ 2003-09-09 18:59 Luca Veraldi 0 siblings, 0 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-09 18:59 UTC (permalink / raw) To: linux-kernel You're right. When i have a bit of time, i'll translate the page. The version of the kernel is not so important. The code is highly portable and little dependent of the internals of kernel. It only needs some functions that continue to survive in newer versions of Linux. I used 2.2.4 only because it was ready-to-use on my machine. The main purpose of the article was not to present a professional mechanism that you can download and use into your own applications. It is only the proof of a fact: Linux IPC Mechanisms are ***structurally*** inefficient. Bye. Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux @ 2003-09-09 22:15 Luca Veraldi 0 siblings, 0 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-09 22:15 UTC (permalink / raw) To: linux-kernel Ok. The English version of the document is online, now. I've translated the main part of it. The address is the same. For English speaker: http://web.tiscali.it/lucavera/www/root/ecbm/index.htm If You want it in Italian: http://web.tiscali.it/lucavera/www/root/ecbm/indexIT.htm Sorry if You find some typo. Here in Italy it's midnight... ^ permalink raw reply [flat|nested] 74+ messages in thread
[parent not found: <F71B37536F3B3D4FA521FEC7FCA17933164A@twinsrv.twinox.se>]
* Re: Efficient IPC mechanism on Linux [not found] <F71B37536F3B3D4FA521FEC7FCA17933164A@twinsrv.twinox.se> @ 2003-09-10 10:36 ` Luca Veraldi 0 siblings, 0 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 10:36 UTC (permalink / raw) To: Lars Hammarstrand; +Cc: linux-kernel > Don't bother about people nagging about bits and pieces because they are jealous. > Stand up for the basic idea since the rest is just details that always can be fixed. The incredible thing to say is that are what? 10, 20 years, perhaps even more, that in my University in Pisa we are talking about such communication primitives. It's not so new! We have only effectively implemented it. Bye, Luca. ^ permalink raw reply [flat|nested] 74+ messages in thread
[parent not found: <E19x3el-0002Fc-Rj@phoenix.hadiko.de>]
* Re: Efficient IPC mechanism on Linux [not found] <E19x3el-0002Fc-Rj@phoenix.hadiko.de> @ 2003-09-10 12:16 ` Luca Veraldi 2003-09-10 14:53 ` Larry McVoy 0 siblings, 1 reply; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 12:16 UTC (permalink / raw) To: uek32z; +Cc: linux-kernel > I've read your posting on the lkml and also the answers > concerning IPC mechanisms on Linux. > You speak English very well - why don't you translate your > page into english, I think many people would be very interested > in it... at least I am ;) Unfortunately not many kernel hackers > are able to understand Italian, I think... Page is now in English since last night (Italian time). Please, refresh your browser. http://web.tiscali.it/lucavera/www/root/ecbm/index.htm for English users and http://web.tiscali.it/lucavera/www/root/ecbm/indexIT.htm for Italian ones. Bye, Luca. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux 2003-09-10 12:16 ` Luca Veraldi @ 2003-09-10 14:53 ` Larry McVoy 0 siblings, 0 replies; 74+ messages in thread From: Larry McVoy @ 2003-09-10 14:53 UTC (permalink / raw) To: Luca Veraldi; +Cc: uek32z, linux-kernel On Wed, Sep 10, 2003 at 02:16:40PM +0200, Luca Veraldi wrote: > > I've read your posting on the lkml and also the answers > > concerning IPC mechanisms on Linux. > > You speak English very well - why don't you translate your > > page into english, I think many people would be very interested > > in it... at least I am ;) Unfortunately not many kernel hackers > > are able to understand Italian, I think... > > Page is now in English since last night (Italian time). > Please, refresh your browser. > > http://web.tiscali.it/lucavera/www/root/ecbm/index.htm > for English users and I read it and I must be missing something, which is possible, I need more coffee. I question the measurement methodology. Why didn't you grab the sources to LMbench and use them to measure this? It may well be that you disagree with how it measures things, which may be fine, but I think you'd benefit from understanding it and thinking about it. It's also trivial to add another test to the system, you can do it in a few lines of code. I also question the results. I modified lat_pipe.c from LMbench to measure a range of sizes, code is included below. The results don't match yours at all. This is on a 466Mhz Celeron running RedHat 7.1, Linux 2.4.2. The time reported is the time to send and receive the data between two processes, i.e., Process A Process B write <context switch> read write <context switch> read In other words, the time printed is for a round trip. Your numbers appear to be off by a factor of two, they look pretty similar to mine but as far as I can tell you are saying that it costs 4 usecs for a send and 4 for a recv and that's not true, the real numbers are 2 sends and 2 receives and 2 context switches. 1 bytes: Pipe latency: 8.0272 microseconds 8 bytes: Pipe latency: 7.8736 microseconds 64 bytes: Pipe latency: 8.0279 microseconds 512 bytes: Pipe latency: 10.0920 microseconds 4096 bytes: Pipe latency: 19.6434 microseconds 40960 bytes: Pipe latency: 313.3328 microseconds 81920 bytes: Pipe latency: 1267.7174 microseconds 163840 bytes: Pipe latency: 3052.1020 microseconds I want to stick some other numbers in here from LMbench, the signal handler cost and the select cost. On this machine it is about 5 usecs to handle the signal and about 4 for select on 10 file descriptors. If I were faced with the problem of moving data between processes at very low cost the path I choose would depend on whether it was a lot of data or just an event notification. It would also depend on whether the receiving process is doing anything else. Let's walk a couple of those paths: If all I want to do is let another process know that something has happened then a signal is darn close to as cheap as I can get. That's what SIGUSR1 and SIGUSR2 are for. If I wanted to move large quantities of data I'd combine signals with mmap and mutexes. This is easier than it sounds. Map a scratch file, truncate it up to the size you need, start writing into it and when you have enough signal the other side. It's a lot like the I/O loop in many device drivers. So I guess I'm not seeing why there needs to be a new interface here. This looks to me like you combine cheap messaging (signals, select, or even pipes) with shared data (mmap). I don't know why you didn't measure that, it's the obvious thing to measure and you are going to be running at memory speeds. The only justification I can see for a different mechanism is if the signaling really hurt but it doesn't. What am I missing? -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 74+ messages in thread
[parent not found: <fa.h06p421.1s00ojt@ifi.uio.no>]
[parent not found: <fa.gc37hsp.34id89@ifi.uio.no>]
[parent not found: <E19x47V-0002JG-J8@phoenix.hadiko.de>]
* Re: Efficient IPC mechanism on Linux [not found] ` <E19x47V-0002JG-J8@phoenix.hadiko.de> @ 2003-09-10 12:45 ` Luca Veraldi 0 siblings, 0 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 12:45 UTC (permalink / raw) To: uek32z; +Cc: linux-kernel > What occured to me: > Perhaps you're interested in the L4KA microkernel project > at my University, Karlsruhe, Germany. > The URL is http://l4ka.org/ Sure I know. Great work. But it is a completly new Operating Systems, and it is also a microkernel one. Here, they seem not to understand what permance is, figure out what will happen if I start talking about microkernel. However, that is the future, not old and dirty monolithical ones. And ***THEORY*** supports. Bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
[parent not found: <u9j3.1VB.27@gated-at.bofh.it>]
[parent not found: <u9j3.1VB.29@gated-at.bofh.it>]
[parent not found: <u9j3.1VB.31@gated-at.bofh.it>]
[parent not found: <u9j3.1VB.25@gated-at.bofh.it>]
[parent not found: <ubNY.5Ma.19@gated-at.bofh.it>]
[parent not found: <uc79.6lg.13@gated-at.bofh.it>]
[parent not found: <uc7d.6lg.23@gated-at.bofh.it>]
[parent not found: <uch0.6zx.17@gated-at.bofh.it>]
[parent not found: <ucqs.6NC.3@gated-at.bofh.it>]
[parent not found: <ucqy.6NC.19@gated-at.bofh.it>]
[parent not found: <udmB.8eZ.15@gated-at.bofh.it>]
[parent not found: <udPF.BD.11@gated-at.bofh.it>]
[parent not found: <3F5F37CD.6060808@softhome.net>]
* Re: Efficient IPC mechanism on Linux [not found] ` <3F5F37CD.6060808@softhome.net> @ 2003-09-10 15:28 ` Luca Veraldi 0 siblings, 0 replies; 74+ messages in thread From: Luca Veraldi @ 2003-09-10 15:28 UTC (permalink / raw) To: Ihar 'Philips' Filipau; +Cc: linux-kernel > Can you forward-port your work to 2.6 kernel? > Can you benchmarkt it against the same primitives in 2.6 kernel? I could do it only if there was an effective reason to do it. And there isn't. They have just posted me a message with pipe latency under kernel 2.4. And it is exactly the same (apart some minor variations in misurements that are natural enough). > You have just started your work - are you going to finish it? Or it > was just-another-academical-study? I consider my work finished. I studied an efficient IPC mechanism and I tried to implement it on an existing Operating System. I tried some other IPC primitives in benchmark tests and reported the comparisons between completion times. I got my goal. > If you can try to develop new sematics for old syntax (shm* & etc) it > can be welcomed too. > And if it would be poll()able - it would be great. Applications which > do block on read()/getmsg() in real-life not that common, and as I've > understood - this is the case for your message passing structure. I'm not a kernel developer. Ask Linus Torvalds. Bye, Luca ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: Efficient IPC mechanism on Linux
@ 2003-09-10 18:41 Manfred Spraul
0 siblings, 0 replies; 74+ messages in thread
From: Manfred Spraul @ 2003-09-10 18:41 UTC (permalink / raw)
To: Luca Veraldi; +Cc: linux-kernel, Nick Piggin
>
>
>Hi Luca,
>There was a zero-copy pipe implementation floating around a while ago
>I think. Did you have a look at that? IIRC it had advantages and
>disadvantages over regular pipes in performance.
>
It has doesn't have any performance disadvantages over regular pipes.
The problems is that it only helps for lmbench. Real world apps use
buffered io, which uses PAGE_SIZEd buffers, which mandate certain
atomicity requirements, which limits the gains that can be achieved.
If someone wants to try it again I can search for my patches.
Actually there are two independant improvements that are possible for
the pipe code:
- zero-copy. The name is misleading. In reality this does a single copy:
the sender creates a list of the pages with the data instead of copying
the data to kernel buffers. The receiver copies from the pages directly
to user space. The main advantage is that this implementation avoids one
copy without requiring any tlb flushes. It works with arbitrary aligned
user space buffers.
- use larger kernel buffers. Linux uses a 4 kB internal buffer. The
result is a very simple implementation that causes an incredible amount
of context switches. Larger buffers reduce that. The problem is
accounting, and avoiding to use too much memory for the pipe buffers. A
patch without accounting should be somewhere. The BSD unices have larger
pipe buffers, and a few pipes with huge buffers [perfect for lmbench
bragging].
Luca: An algorithm that uses page flipping, or requires tlb flushes,
probably performs bad in real-world scenarios. First you have the cost
of the tlb flush, then the inter-processor-interrupt to flush all cpus
on SMP. And if the target app doesn't expect that you use page flipping,
then you get a page fault [with another tlb flush], to break the COW
share. OTHO if you redesign the app for page flipping, then it could
also use sysv shm, and send a short message with a pointer to the shm
segment.
--
Manfred
^ permalink raw reply [flat|nested] 74+ messages in thread
end of thread, other threads:[~2003-09-18 8:18 UTC | newest] Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-09-09 17:30 Efficient IPC mechanism on Linux Luca Veraldi 2003-09-09 21:17 ` Alan Cox 2003-09-09 21:57 ` Luca Veraldi 2003-09-09 23:11 ` Alan Cox 2003-09-10 9:04 ` Luca Veraldi 2003-09-10 12:56 ` Alan Cox [not found] ` <20030909175821.GL16080@Synopsys.COM> [not found] ` <001d01c37703$8edc10e0$36af7450@wssupremo> [not found] ` <20030910064508.GA25795@Synopsys.COM> 2003-09-10 9:18 ` Luca Veraldi 2003-09-10 9:23 ` Arjan van de Ven 2003-09-10 9:40 ` Luca Veraldi 2003-09-10 9:44 ` Arjan van de Ven 2003-09-10 10:09 ` Luca Veraldi 2003-09-10 10:14 ` Arjan van de Ven 2003-09-10 10:25 ` Luca Veraldi 2003-09-12 18:41 ` Timothy Miller 2003-09-12 19:05 ` Luca Veraldi 2003-09-12 22:37 ` Alan Cox 2003-09-10 12:50 ` Alan Cox 2003-09-10 19:16 ` Shawn 2003-09-10 20:05 ` Rik van Riel 2003-09-17 9:52 ` Rik's list of CS challenges Terje Eggestad 2003-09-17 13:40 ` Alan Cox 2003-09-18 8:26 ` Helge Hafting 2003-09-10 12:47 ` Efficient IPC mechanism on Linux Alan Cox 2003-09-10 13:56 ` Luca Veraldi 2003-09-10 15:59 ` Alan Cox 2003-09-10 9:52 ` Jamie Lokier 2003-09-10 10:07 ` Arjan van de Ven 2003-09-10 10:17 ` Luca Veraldi 2003-09-10 10:37 ` Jamie Lokier 2003-09-10 10:41 ` Arjan van de Ven 2003-09-10 10:54 ` Luca Veraldi 2003-09-10 10:54 ` Arjan van de Ven 2003-09-10 11:16 ` Nick Piggin 2003-09-10 11:30 ` Luca Veraldi 2003-09-10 11:44 ` Nick Piggin 2003-09-10 12:14 ` Luca Veraldi 2003-09-10 12:42 ` Alan Cox 2003-09-10 10:11 ` Luca Veraldi 2003-09-10 19:24 ` Pavel Machek 2003-09-10 19:40 ` Jamie Lokier 2003-09-10 21:35 ` Pavel Machek 2003-09-10 22:06 ` Jamie Lokier 2003-09-10 11:52 ` Alex Riesen 2003-09-10 12:14 ` Luca Veraldi 2003-09-10 12:11 ` Alex Riesen 2003-09-10 12:29 ` Luca Veraldi 2003-09-10 12:28 ` Alex Riesen 2003-09-10 12:36 ` Luca Veraldi 2003-09-10 12:36 ` Alex Riesen 2003-09-10 13:33 ` Gábor Lénárt 2003-09-10 14:04 ` Luca Veraldi 2003-09-10 14:21 ` Stewart Smith 2003-09-10 14:39 ` Luca Veraldi 2003-09-10 16:59 ` Andrea Arcangeli 2003-09-10 17:05 ` Andrea Arcangeli 2003-09-10 17:21 ` Luca Veraldi 2003-09-10 17:41 ` Andrea Arcangeli 2003-09-10 17:39 ` Martin Konold 2003-09-10 18:01 ` Andrea Arcangeli 2003-09-10 18:05 ` Martin Konold 2003-09-10 18:31 ` Chris Friesen 2003-09-10 18:08 ` Inappropriate signatures Larry McVoy 2003-09-10 18:52 ` Jamie Lokier 2003-09-10 19:54 ` rsync head? [was inappropriate signatures] Joe Perches 2003-09-13 15:39 ` Inappropriate signatures Pavel Machek 2003-09-13 16:49 ` Larry McVoy 2003-09-09 18:59 Efficient IPC mechanism on Linux Luca Veraldi 2003-09-09 22:15 Luca Veraldi [not found] <F71B37536F3B3D4FA521FEC7FCA17933164A@twinsrv.twinox.se> 2003-09-10 10:36 ` Luca Veraldi [not found] <E19x3el-0002Fc-Rj@phoenix.hadiko.de> 2003-09-10 12:16 ` Luca Veraldi 2003-09-10 14:53 ` Larry McVoy [not found] <fa.h06p421.1s00ojt@ifi.uio.no> [not found] ` <fa.gc37hsp.34id89@ifi.uio.no> [not found] ` <E19x47V-0002JG-J8@phoenix.hadiko.de> 2003-09-10 12:45 ` Luca Veraldi [not found] <u9j3.1VB.27@gated-at.bofh.it> [not found] ` <u9j3.1VB.29@gated-at.bofh.it> [not found] ` <u9j3.1VB.31@gated-at.bofh.it> [not found] ` <u9j3.1VB.25@gated-at.bofh.it> [not found] ` <ubNY.5Ma.19@gated-at.bofh.it> [not found] ` <uc79.6lg.13@gated-at.bofh.it> [not found] ` <uc7d.6lg.23@gated-at.bofh.it> [not found] ` <uch0.6zx.17@gated-at.bofh.it> [not found] ` <ucqs.6NC.3@gated-at.bofh.it> [not found] ` <ucqy.6NC.19@gated-at.bofh.it> [not found] ` <udmB.8eZ.15@gated-at.bofh.it> [not found] ` <udPF.BD.11@gated-at.bofh.it> [not found] ` <3F5F37CD.6060808@softhome.net> 2003-09-10 15:28 ` Luca Veraldi 2003-09-10 18:41 Manfred Spraul
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).