linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Efficient IPC mechanism on Linux
@ 2003-09-10 18:41 Manfred Spraul
  0 siblings, 0 replies; 66+ messages in thread
From: Manfred Spraul @ 2003-09-10 18:41 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: linux-kernel, Nick Piggin

>
>
>Hi Luca,
>There was a zero-copy pipe implementation floating around a while ago
>I think. Did you have a look at that? IIRC it had advantages and
>disadvantages over regular pipes in performance.
>
It has doesn't have any performance disadvantages over regular pipes.
The problems is that it only helps for lmbench. Real world apps use 
buffered io, which uses PAGE_SIZEd buffers, which mandate certain 
atomicity requirements, which limits the gains that can be achieved.
If someone wants to try it again I can search for my patches.
Actually there are two independant improvements that are possible for 
the pipe code:
- zero-copy. The name is misleading. In reality this does a single copy: 
the sender creates a list of the pages with the data instead of copying 
the data to kernel buffers. The receiver copies from the pages directly 
to user space. The main advantage is that this implementation avoids one 
copy without requiring any tlb flushes. It works with arbitrary aligned 
user space buffers.
- use larger kernel buffers. Linux uses a 4 kB internal buffer. The 
result is a very simple implementation that causes an incredible amount 
of context switches. Larger buffers reduce that. The problem is 
accounting, and avoiding to use too much memory for the pipe buffers. A 
patch without accounting should be somewhere. The BSD unices have larger 
pipe buffers, and a few pipes with huge buffers [perfect for lmbench 
bragging].

Luca: An algorithm that uses page flipping, or requires tlb flushes, 
probably performs bad in real-world scenarios. First you have the cost 
of the tlb flush, then the inter-processor-interrupt to flush all cpus 
on SMP. And if the target app doesn't expect that you use page flipping, 
then you get a page fault [with another tlb flush], to break the COW 
share. OTHO if you redesign the app for page flipping, then it could 
also use sysv shm, and send a short message with a pointer to the shm 
segment.

--
    Manfred


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-12 19:05                       ` Luca Veraldi
@ 2003-09-12 22:37                         ` Alan Cox
  0 siblings, 0 replies; 66+ messages in thread
From: Alan Cox @ 2003-09-12 22:37 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Timothy Miller, Arjan van de Ven, Linux Kernel Mailing List

On Gwe, 2003-09-12 at 20:05, Luca Veraldi wrote:
> If you modify the page tables as in my example (and then if you do so only
> for B's pagetable)
> you can be sure the things you're modifying were not present in Firmware
> TLBs, not yet.
> 
> Because those pagetable entries refer to a logical address interval
> you've just allocated in B address space.

They may be present in some form, but you are right the task switch will
flush them in the non SMP case on the x86 platform. In the SMP case it
wont work. 

Alan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-12 18:41                     ` Timothy Miller
@ 2003-09-12 19:05                       ` Luca Veraldi
  2003-09-12 22:37                         ` Alan Cox
  0 siblings, 1 reply; 66+ messages in thread
From: Luca Veraldi @ 2003-09-12 19:05 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Arjan van de Ven, linux-kernel

> Pardon my ignorance here, but the impression I get is that changing a
> page table entry is not as simple as just writing to a bit somewhere.  I
> suppose it is if the page descriptor is not loaded into the TLB, but if
> it is, then you have to ensure that the TLB entry matches the page
> table; this may not be a quick operation.
>
> I can think of a lot of other possible complications to this.

I stopped writing about this question a lot of time ago.
However, here we are.

If you modify the page tables as in my example (and then if you do so only
for B's pagetable)
you can be sure the things you're modifying were not present in Firmware
TLBs, not yet.

Because those pagetable entries refer to a logical address interval
you've just allocated in B address space.

Bye,
Luca


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:25                   ` Luca Veraldi
@ 2003-09-12 18:41                     ` Timothy Miller
  2003-09-12 19:05                       ` Luca Veraldi
  0 siblings, 1 reply; 66+ messages in thread
From: Timothy Miller @ 2003-09-12 18:41 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel



Luca Veraldi wrote:

> 
> Sorry, but I cannot believe it.
> Reading a page tagle entry and storing in into a struct capability is not
> comparable at all with the "for" needed to move bytes all around memory.


Pardon my ignorance here, but the impression I get is that changing a 
page table entry is not as simple as just writing to a bit somewhere.  I 
suppose it is if the page descriptor is not loaded into the TLB, but if 
it is, then you have to ensure that the TLB entry matches the page 
table; this may not be a quick operation.

I can think of a lot of other possible complications to this.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 21:35                 ` Pavel Machek
@ 2003-09-10 22:06                   ` Jamie Lokier
  0 siblings, 0 replies; 66+ messages in thread
From: Jamie Lokier @ 2003-09-10 22:06 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel

Pavel Machek wrote:
> But you should also time memory remapping stuff on same cpu, no?

Yes, but can I be bothered ;)
Like I said, it's a crude upper bound.

We know the remapping time will be about the same in both tests,
because all it does is the same vma operations and the same TLB
invalidations.

-- Jamie

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 19:40               ` Jamie Lokier
@ 2003-09-10 21:35                 ` Pavel Machek
  2003-09-10 22:06                   ` Jamie Lokier
  0 siblings, 1 reply; 66+ messages in thread
From: Pavel Machek @ 2003-09-10 21:35 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel

Hi!

> > Can you make it available so we can test on, say, 900MHz athlon? Or
> > you can have it tested on 1800MHz athlon64, that's about as high end
> > as it can get.
> 
> I just deleted the program, so here's a rewrite :)
> 
> 	#include <sys/mman.h>
> 
> 	int main()
> 	{
> 		int i, j;
> 		for (j = 0; j < 64; j++) {
> 			volatile char * ptr =
> 				mmap (0, 4096 * 4096, PROT_READ | PROT_WRITE,
> 				      MAP_PRIVATE | MAP_ANON, -1, 0);
> 			for (i = 0; i < 4096; i++) {
> 	#if 1
> 				*(ptr + 4096 * i) = 0; /* Write */
> 	#else
> 				(void) *(ptr + 4096 * i); /* Read */
> 	#endif
> 			}
> 			munmap ((void *) ptr, 4096 * 4096);
> 		}
> 		return 0;
> 	}
> 
> Smallest results, from "gcc -o test test.c -O2; time ./test" on a
> 1500MHz dual Athlon 1800 MP:
> 
> Write:
> 	real	0m1.316s
> 	user	0m0.059s
> 	sys	0m1.256s
> 
> 	==> 7531 cycles per page
> 
> Read:
> 	real	0m0.199s
> 	user	0m0.053s
> 	sys	0m0.146s
> 
> 	==> 1139 cycles per page
> 
> As I said, it's a crude upper bound.

But you should also time memory remapping stuff on same cpu, no?

							Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:09               ` Luca Veraldi
                                   ` (2 preceding siblings ...)
  2003-09-10 19:16                 ` Shawn
@ 2003-09-10 20:05                 ` Rik van Riel
  3 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2003-09-10 20:05 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel

On Wed, 10 Sep 2003, Luca Veraldi wrote:

> I'm not responsible for microarchitecture designer stupidity.
> If a simple STORE assembler instruction will eat up 4000 clock cycles,
> as you say here, well,

If current trends continue, a L2 cache miss will be
taking 5000 cycles in 15 to 20 years.

> I think all we Computer Scientists can go home and give it up now.

While I have seen some evidence of computer scientists
going home and ignoring the problems presented to them
by current hardware constraints, I'd really prefer it
if they took up the challenge and did the research on
how we should deal with hardware in the future.

In fact, I've made up a little (incomplete) list of
things that I suspect are in need of serious CS research,
because otherwise both OS theory and practice will be
unable to deal with the hardware of the next decade.

	http://surriel.com/research_wanted/

If you have any suggestions for the list, please let
me know.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 19:24             ` Pavel Machek
@ 2003-09-10 19:40               ` Jamie Lokier
  2003-09-10 21:35                 ` Pavel Machek
  0 siblings, 1 reply; 66+ messages in thread
From: Jamie Lokier @ 2003-09-10 19:40 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel

Pavel Machek wrote:
> Can you make it available so we can test on, say, 900MHz athlon? Or
> you can have it tested on 1800MHz athlon64, that's about as high end
> as it can get.

I just deleted the program, so here's a rewrite :)

	#include <sys/mman.h>

	int main()
	{
		int i, j;
		for (j = 0; j < 64; j++) {
			volatile char * ptr =
				mmap (0, 4096 * 4096, PROT_READ | PROT_WRITE,
				      MAP_PRIVATE | MAP_ANON, -1, 0);
			for (i = 0; i < 4096; i++) {
	#if 1
				*(ptr + 4096 * i) = 0; /* Write */
	#else
				(void) *(ptr + 4096 * i); /* Read */
	#endif
			}
			munmap ((void *) ptr, 4096 * 4096);
		}
		return 0;
	}

Smallest results, from "gcc -o test test.c -O2; time ./test" on a
1500MHz dual Athlon 1800 MP:

Write:
	real	0m1.316s
	user	0m0.059s
	sys	0m1.256s

	==> 7531 cycles per page

Read:
	real	0m0.199s
	user	0m0.053s
	sys	0m0.146s

	==> 1139 cycles per page

As I said, it's a crude upper bound.

-- Jamie

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10  9:52           ` Jamie Lokier
  2003-09-10 10:07             ` Arjan van de Ven
  2003-09-10 10:11             ` Luca Veraldi
@ 2003-09-10 19:24             ` Pavel Machek
  2003-09-10 19:40               ` Jamie Lokier
  2 siblings, 1 reply; 66+ messages in thread
From: Pavel Machek @ 2003-09-10 19:24 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel

Hi!

> > > The overhead implied by a memcpy() is the same, in the oder of magnitude,
> > > ***whatever*** kernel version you can develop.
> > 
> > yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the
> > uncached case. A pagetable operation (like the cpu setting the accessed
> > or dirty bit) is in that same order I suspect (maybe half this, but not
> > a lot less). Changing pagetable content is even more because all the
> > tlb's and internal cpu state will need to be flushed... which is also a
> > microcode operation for the cpu. And it's deadly in an SMP environment.
> 
> I have just done a measurement on a 366MHz PII Celeron.  The test is
> quite crude, but these numbers are upper bounds on the timings:
> 
> 	mean 813 cycles to:
> 
> 		page fault
> 		install zero page
> 		remove zero page
> 
> 		(= read every page from MAP_ANON region and unmap, repeatedly)
> 
> 	mean 11561 cycles to:
> 
> 		page fault
> 		copy zero page
> 		install copy
> 		remove copy
> 
> 		(= write every page from MAP_ANON region and unmap, repeatedly)
> 
> I think that makes a clear case for avoiding page copies.

Can you make it available so we can test on, say, 900MHz athlon? Or
you can have it tested on 1800MHz athlon64, that's about as high end
as it can get.

								Pavel

-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:09               ` Luca Veraldi
  2003-09-10 10:14                 ` Arjan van de Ven
  2003-09-10 12:50                 ` Alan Cox
@ 2003-09-10 19:16                 ` Shawn
  2003-09-10 20:05                 ` Rik van Riel
  3 siblings, 0 replies; 66+ messages in thread
From: Shawn @ 2003-09-10 19:16 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel

On Wed, 2003-09-10 at 05:09, Luca Veraldi wrote:
> > For fun do the measurement on a pIV cpu. You'll be surprised.
> > The microcode "mark dirty" (which is NOT a btsl, it gets done when you do
> a write
> > memory access to the page content) result will be in the 2000 to 4000
> range I
> > predict.
> 
> I'm not responsible for microarchitecture designer stupidity.
> If a simple STORE assembler instruction will eat up 4000 clock cycles,
> as you say here, well, I think all we Computer Scientists can go home and
> give it up now.
Unfortunately you are responsible for working with the constructs reality
has given you. Giving up is childish. Where there is a lose, there's a
chance it was a tradeoff for a win elsewhere.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 18:05       ` Martin Konold
@ 2003-09-10 18:31         ` Chris Friesen
  0 siblings, 0 replies; 66+ messages in thread
From: Chris Friesen @ 2003-09-10 18:31 UTC (permalink / raw)
  To: Martin Konold; +Cc: Andrea Arcangeli, Luca Veraldi, linux-kernel

Martin Konold wrote:
> Am Wednesday 10 September 2003 08:01 pm schrieb Andrea Arcangeli:

>>with the shm/futex approch you can also have a ring buffer to handle
>>parallelism better while it's at the same time zerocopy
>>
> 
> How fast will you get? I think you will get the bandwidth of a memcpy for 
> large chunks?!
> 
> This is imho not really zerocopy. The data has to travel over the memory bus 
> involving the CPU so I would call this single copy ;-)

Even with zerocopy, you have to build the message somehow.  If process A 
builds a message and passes it to process B, then then isn't that as 
efficient as you can get with message passing?  How does MPI do any better?

However, if both processes directly map the same memory and use it 
directly (without "messages" as such), then I would see *that* as zerocopy.

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 18:01     ` Andrea Arcangeli
@ 2003-09-10 18:05       ` Martin Konold
  2003-09-10 18:31         ` Chris Friesen
  0 siblings, 1 reply; 66+ messages in thread
From: Martin Konold @ 2003-09-10 18:05 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Luca Veraldi, linux-kernel

Am Wednesday 10 September 2003 08:01 pm schrieb Andrea Arcangeli:

Hi Andreas,

> > The idea is while DMA has much higher bandwidth than PIO DMA is more
> > expensive to initiate than PIO. So DMA is only useful for large messages.
>
> agreed.
>
> > In the local SMP case there do exist userspace APIs like MPI which can do
>
> btw, so far we were only discussing IPC in a local box (UP or SMP or
> NUMA) w/o networking involved. Luca's currnet implementation as well was
> only working locally.

Yes, and I claim that the best you can get for large messages is a plain 
single copy userspace implementation as already implemented by some people 
using the MPI API.

> > True zero copy has unlimited (sigh!) bandwidth within an SMP and does not
> > really make sense in contrast to a network.
>
> if you can avoid to enter kernel, you'd better do that, because entering
> kernel will take much more time than the copy itself.

Yes, doing HPC the kernel may only be used to intialize the intial 
communication channels (e.g. handling permissions etc.). The kernel must be 
avoided for the actual communication by any means.

> with the shm/futex approch you can also have a ring buffer to handle
> parallelism better while it's at the same time zerocopy

How fast will you get? I think you will get the bandwidth of a memcpy for 
large chunks?!

This is imho not really zerocopy. The data has to travel over the memory bus 
involving the CPU so I would call this single copy ;-)

> and enterely
> userspace based in the best case (thought that's not the common case).

Yes a userspace library using the standard MPI API is the proven best approach 
and freely downloadable from 

http://www.pccluster.org/score/dist/pub/score-5.4.0/source/score-5.4.0.mpi.tar.gz

Regards,
-- martin

Dipl.-Phys. Martin Konold
e r f r a k o n
Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker
Nobelstrasse 15, 70569 Stuttgart, Germany
fon: 0711 67400963, fax: 0711 67400959
email: martin.konold@erfrakon.de

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 17:39   ` Martin Konold
@ 2003-09-10 18:01     ` Andrea Arcangeli
  2003-09-10 18:05       ` Martin Konold
  0 siblings, 1 reply; 66+ messages in thread
From: Andrea Arcangeli @ 2003-09-10 18:01 UTC (permalink / raw)
  To: Martin Konold; +Cc: Luca Veraldi, linux-kernel

On Wed, Sep 10, 2003 at 07:39:17PM +0200, Martin Konold wrote:
> Am Wednesday 10 September 2003 06:59 pm schrieb Andrea Arcangeli:
> 
> Hi,
> 
> > design that I'm suggesting. Obviously lots of apps are already using
> > this design and there's no userspace API simply because that's not
> > needed.
> 
> HPC people have investigated High performance IPC many times basically it 
> boils down to:
> 
> 1. Userspace is much more efficient than kernel space. So efficient 
> implementions avoid kernel space even for message transfers over networks 
> (e.g. DMA directly to userspace). 
> 
> 2. The optimal protocol to use and the number of copies to do is depending on 
> the message size.
> 
> Small messages are most efficiently handled with a single/dual copy (short 
> protocol / eager protocol) and large messages are more efficient with 
> single/zero copy techniques (get protocol) depending if a network is involved 
> or not.
> 
> Traditionally in a networked environment single copy means PIO and zero copy 
> means DMA when using network hardware.
> 
> The idea is while DMA has much higher bandwidth than PIO DMA is more expensive 
> to initiate than PIO. So DMA is only useful for large messages.

agreed.

> 
> In the local SMP case there do exist userspace APIs like MPI which can do 

btw, so far we were only discussing IPC in a local box (UP or SMP or
NUMA) w/o networking involved. Luca's currnet implementation as well was
only working locally.

> single copy Message passing at memory speed in pure userspace since many 
> years.
> 
> The following PDF documents a measurement where the communication between two 
> processes running on different CPUs in an SMP system is exactly the memcpy 
> bandwidth for large messages using a single copy get protocol.
> 
> 	http://ipdps.eece.unm.edu/1999/pc-now/takahash.pdf
> 
> Numbers from a Dual P-II-333, Intel 440LX (100MB/s memcpy)
> 
> 					eager 	get
> min. Latency µs		8.62		9.98
> max Bandwidth MB/s	48.03	100.02
> half bandwith point KB	0.3		2.5
> 
> You can easily see that the eager has better latency for very short messages 
> and that the get has a max bandwidth beeing equivalent of a memcpy (single 
> copy).
> 
> True zero copy has unlimited (sigh!) bandwidth within an SMP and does not 
> really make sense in contrast to a network.

if you can avoid to enter kernel, you'd better do that, because entering
kernel will take much more time than the copy itself.

with the shm/futex approch you can also have a ring buffer to handle
parallelism better while it's at the same time zerocopy and enterely
userspace based in the best case (thought that's not the common case).

thanks,

Andrea

/*
 * If you refuse to depend on closed software for a critical
 * part of your business, these links may be useful:
 *
 * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
 * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
 * http://www.cobite.com/cvsps/
 *
 * svn://svn.kernel.org/linux-2.6/trunk
 * svn://svn.kernel.org/linux-2.4/trunk
 */

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 17:21     ` Luca Veraldi
@ 2003-09-10 17:41       ` Andrea Arcangeli
  0 siblings, 0 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2003-09-10 17:41 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: linux-kernel

On Wed, Sep 10, 2003 at 07:21:09PM +0200, Luca Veraldi wrote:
> > sorry for the self followup, but I just read another message where you
> > mention 2.2, if that was for 2.2 the locking was the ok.
> 
> Oh, good. I was already going to put my head under sand. :-)

;)

> 
> I'm not so expert about the kernel. I've just read [ORLL]
> and some bits of the kernel sources.
> So, error in the codes are not so strange.

Ok no problem then.

> But, it's better now that you know we are talking about version 2.2...
> 
> I'm glad to hear that locking are ok.
> 
> You say:
> 
> > in terms of design as far as I can tell the most efficient way to do
> > message passing is not pass the data through the kernel at all (no
> > matter if you intend to copy it or not), and to simply use futex on top
> > of shm to synchronize/wakeup the access.  If we want to make an API
> > widespread, that should be simply an userspace library only.
> >
> > It's very inefficient to mangle pagetables and flush the tlb in a flood
> > like you're doing (or better like you should do), when you can keep the
> 
> I guess futex are some kind of semaphore flavour under linux 2.4/2.6.
> However, you need to use SYS V shared memory in any case.

I need shared memory yes, but I can generate it with /dev/shm or clone(),
it doesn't really make any difference. I would never use the ugly SYSV
IPC API personally ;).

> Tests for SYS V shared memory are included in the web page
> (even though using SYS V semaphores).

IPC semaphores are a totally different thing.

Basically the whole ipc/ directory is an inefficient obsolete piece of
code that nobody should use.

The real thing is (/dev/shm || clone()) + futex.

> I don't think, reading the numbers, that managing pagetables "is very
> inefficient".

Yep, it can be certainly more efficient than IPC semaphores or whatever
IPC thing in ipc.

However the speed of the _shm_ API doesn't matter, you map the shm only
once and you serialize the messaging with the futex. So even if it takes
a bit longer to setup the shm with IPC shm, it won't matter, even using
IPC shm + futex would be fine (despite I find /dev/shm much more friendy
as an API).

> I think very inefficient are SYS V semaphore orethe double-copying channel
> you call a pipe.

you're right in term of bandwidth if the only thing the reader and the
writer do is to make the data transfer.

However if the app is computing stuff besides listening to the pipe (for
example if it's in noblocking) the double copying allows the writer, to
return way before the reader started reading the info. So the
intermediate ram increases a bit the parallelism. One could play tricks
with cows as well though but it would be way more complex than the
current double copy ;).

(as somebody pointed out) you may want to compare the pipe code with the
new zerocopy-pipe one (the one that IMHO has a chance to decreases
parallelism)

>  having all the pages locked in memory is not a necessary condition
> for the applicability of communication mechanisms based on capabilities.
> Simply, it make it easier to write the code and does not make me crazy
> with the Linux swapping system.

ok ;)

Overall my point is that the best is to keep the ram mapped in both
tasks at the same time and to use the kernel only for synchronization
(i.e. the wakeup/schedule calls in your code and to remove the pinning/
pte mangling enterely from the design) and the futex provides a more
efficient synchronization since it doesn't even enter kernel space if
the writer completes before the reader start and the reader completes
before the writer starts again (so it has a chance to be better than any
other kernel base scheme that is always forced to enter kernel even in
the very best case).

Andrea

/*
 * If you refuse to depend on closed software for a critical
 * part of your business, these links may be useful:
 *
 * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
 * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
 * http://www.cobite.com/cvsps/
 *
 * svn://svn.kernel.org/linux-2.6/trunk
 * svn://svn.kernel.org/linux-2.4/trunk
 */

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 16:59 ` Andrea Arcangeli
  2003-09-10 17:05   ` Andrea Arcangeli
@ 2003-09-10 17:39   ` Martin Konold
  2003-09-10 18:01     ` Andrea Arcangeli
  1 sibling, 1 reply; 66+ messages in thread
From: Martin Konold @ 2003-09-10 17:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Luca Veraldi; +Cc: linux-kernel

Am Wednesday 10 September 2003 06:59 pm schrieb Andrea Arcangeli:

Hi,

> design that I'm suggesting. Obviously lots of apps are already using
> this design and there's no userspace API simply because that's not
> needed.

HPC people have investigated High performance IPC many times basically it 
boils down to:

1. Userspace is much more efficient than kernel space. So efficient 
implementions avoid kernel space even for message transfers over networks 
(e.g. DMA directly to userspace). 

2. The optimal protocol to use and the number of copies to do is depending on 
the message size.

Small messages are most efficiently handled with a single/dual copy (short 
protocol / eager protocol) and large messages are more efficient with 
single/zero copy techniques (get protocol) depending if a network is involved 
or not.

Traditionally in a networked environment single copy means PIO and zero copy 
means DMA when using network hardware.

The idea is while DMA has much higher bandwidth than PIO DMA is more expensive 
to initiate than PIO. So DMA is only useful for large messages.

In the local SMP case there do exist userspace APIs like MPI which can do 
single copy Message passing at memory speed in pure userspace since many 
years.

The following PDF documents a measurement where the communication between two 
processes running on different CPUs in an SMP system is exactly the memcpy 
bandwidth for large messages using a single copy get protocol.

	http://ipdps.eece.unm.edu/1999/pc-now/takahash.pdf

Numbers from a Dual P-II-333, Intel 440LX (100MB/s memcpy)

					eager 	get
min. Latency µs		8.62		9.98
max Bandwidth MB/s	48.03	100.02
half bandwith point KB	0.3		2.5

You can easily see that the eager has better latency for very short messages 
and that the get has a max bandwidth beeing equivalent of a memcpy (single 
copy).

True zero copy has unlimited (sigh!) bandwidth within an SMP and does not 
really make sense in contrast to a network.

Regards,
-- martin

Dipl.-Phys. Martin Konold
e r f r a k o n
Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker
Nobelstrasse 15, 70569 Stuttgart, Germany
fon: 0711 67400963, fax: 0711 67400959
email: martin.konold@erfrakon.de

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 17:05   ` Andrea Arcangeli
@ 2003-09-10 17:21     ` Luca Veraldi
  2003-09-10 17:41       ` Andrea Arcangeli
  0 siblings, 1 reply; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 17:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

> sorry for the self followup, but I just read another message where you
> mention 2.2, if that was for 2.2 the locking was the ok.

Oh, good. I was already going to put my head under sand. :-)

I'm not so expert about the kernel. I've just read [ORLL]
and some bits of the kernel sources.
So, error in the codes are not so strange.

But, it's better now that you know we are talking about version 2.2...

I'm glad to hear that locking are ok.

You say:

> in terms of design as far as I can tell the most efficient way to do
> message passing is not pass the data through the kernel at all (no
> matter if you intend to copy it or not), and to simply use futex on top
> of shm to synchronize/wakeup the access.  If we want to make an API
> widespread, that should be simply an userspace library only.
>
> It's very inefficient to mangle pagetables and flush the tlb in a flood
> like you're doing (or better like you should do), when you can keep the

I guess futex are some kind of semaphore flavour under linux 2.4/2.6.
However, you need to use SYS V shared memory in any case.
Tests for SYS V shared memory are included in the web page
(even though using SYS V semaphores).

I don't think, reading the numbers, that managing pagetables "is very
inefficient".
I think very inefficient are SYS V semaphore orethe double-copying channel
you call a pipe.

> there's also an obvious DoS that is trivial to generate by locking in
> ram some 64G of ram with ecbm_create_capability() see the for(count=0;
> count<pages; ++count) atomic_inc (btw, you should use get_page, and all
> the operations like LockPage to play with pages).

As I say in the web page,

 having all the pages locked in memory is not a necessary condition
for the applicability of communication mechanisms based on capabilities.
Simply, it make it easier to write the code and does not make me crazy
with the Linux swapping system.

Bye bye,
Luca



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 16:59 ` Andrea Arcangeli
@ 2003-09-10 17:05   ` Andrea Arcangeli
  2003-09-10 17:21     ` Luca Veraldi
  2003-09-10 17:39   ` Martin Konold
  1 sibling, 1 reply; 66+ messages in thread
From: Andrea Arcangeli @ 2003-09-10 17:05 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: linux-kernel

On Wed, Sep 10, 2003 at 06:59:44PM +0200, Andrea Arcangeli wrote:
> About the implementation - the locking looks very wrong: you miss the
> page_table_lock in all the pte walking, you take a totally worthless
> lock_kernel() all over the place for no good reason, and the

sorry for the self followup, but I just read another message where you
mention 2.2, if that was for 2.2 the locking was the ok.

all other problems remains though.

Andrea

/*
 * If you refuse to depend on closed software for a critical
 * part of your business, these links may be useful:
 *
 * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
 * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
 * http://www.cobite.com/cvsps/
 *
 * svn://svn.kernel.org/linux-2.6/trunk
 * svn://svn.kernel.org/linux-2.4/trunk
 */

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-09 17:30 Luca Veraldi
                   ` (2 preceding siblings ...)
  2003-09-10 14:21 ` Stewart Smith
@ 2003-09-10 16:59 ` Andrea Arcangeli
  2003-09-10 17:05   ` Andrea Arcangeli
  2003-09-10 17:39   ` Martin Konold
  3 siblings, 2 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2003-09-10 16:59 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: linux-kernel

Ciao Luca,

On Tue, Sep 09, 2003 at 07:30:58PM +0200, Luca Veraldi wrote:
> Hi all.
> At the web page
> http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
> You can find the results of my attempt in modifing the linux kernel sources
> to implement a new Inter Process Communication mechanism.
> 
> It is called ECBM for Efficient Capability-Based Messaging.
> 
> In the reading You can also find the comparison of ECBM 
> against some other commonly-used Linux IPC primitives 
> (such as read/write on pipes or SYS V tools).
> 
> The results are quite clear.

in terms of design as far as I can tell the most efficient way to do
message passing is not pass the data through the kernel at all (no
matter if you intend to copy it or not), and to simply use futex on top
of shm to synchronize/wakeup the access.  If we want to make an API
widespread, that should be simply an userspace library only.

It's very inefficient to mangle pagetables and flush the tlb in a flood
like you're doing (or better like you should do), when you can keep the
memory mapped in *both* tasks at the same time *always* and there's no
need of any kernel modification at all for that much more efficient
design that I'm suggesting. Obviously lots of apps are already using
this design and there's no userspace API simply because that's not
needed. The only thing we need from the kernel is the wakeup mechanism
and that's already provided by the futex (in the past userspae apps
using this design used sched_yield, and that was very bad).

About the implementation - the locking looks very wrong: you miss the
page_table_lock in all the pte walking, you take a totally worthless
lock_kernel() all over the place for no good reason, and the
unconditional set_bit(PG_locked) clear_bit(PG_locked) on random pieces
of ram almost guarantees that you'll corrupt ram quickly (the PG_locked
is reserved for I/O serialization, the same ram that you're working on
can be sent to disk or to swap by the kernel at the same time and it can
be already locked, you can't clear_bit unless you're sure you're the guy
that owns the lock, and you aren't sure because you didn't test if
you're the owner, so that smeels like an huge bug that will random
corrupt ram, like the pte walking race).

there's also an obvious DoS that is trivial to generate by locking in
ram some 64G of ram with ecbm_create_capability() see the for(count=0;
count<pages; ++count) atomic_inc (btw, you should use get_page, and all
the operations like LockPage to play with pages).

I also don't see where you flush the tlb after the set_pte, and where
you release the ram pointed by the pte (it seems you're leaking plenty
of memory that way).

I didn't check at all the credential checks (I didn't run into it while
reading the code, but I assume I overlooked it). (do you rely on a
random number? that's probably statistically secure, but we can
guarantee security on a local box, we must not work by luck whenever
possible)

this was a very quick review, hope this helps,

Andrea

/*
 * If you refuse to depend on closed software for a critical
 * part of your business, these links may be useful:
 *
 * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
 * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
 * http://www.cobite.com/cvsps/
 *
 * svn://svn.kernel.org/linux-2.6/trunk
 * svn://svn.kernel.org/linux-2.4/trunk
 */

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 13:56               ` Luca Veraldi
@ 2003-09-10 15:59                 ` Alan Cox
  0 siblings, 0 replies; 66+ messages in thread
From: Alan Cox @ 2003-09-10 15:59 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Linux Kernel Mailing List

On Mer, 2003-09-10 at 14:56, Luca Veraldi wrote:
> There are kernel locks all around in sources.
> So don't come here and talk about locking ineffiency.
> Because scarse scalability of Linux on multiprocessor
> is a reality nobody can hide.

Well if you will read the 2.2 code sure. If you want to argue
scalability I suggest you look at current code and the SGI Altix
boxes


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
       [not found]                       ` <3F5F37CD.6060808@softhome.net>
@ 2003-09-10 15:28                         ` Luca Veraldi
  0 siblings, 0 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 15:28 UTC (permalink / raw)
  To: Ihar 'Philips' Filipau; +Cc: linux-kernel

>    Can you forward-port your work to 2.6 kernel?
>    Can you benchmarkt it against the same primitives in 2.6 kernel?

I could do it only if there was an effective reason to do it. 
And there isn't.

They have just posted me a message with pipe latency under kernel 2.4.
And it is exactly the same (apart some minor variations in misurements 
that are natural enough).

>    You have just started your work - are you going to finish it? Or it 
> was just-another-academical-study?

I consider my work finished.
I studied an efficient IPC mechanism and I tried to implement it 
on an existing Operating System.

I tried some other IPC primitives in benchmark tests 
and reported the comparisons between completion times.

I got my goal.

>    If you can try to develop new sematics for old syntax (shm* & etc) it 
> can be welcomed too.
>    And if it would be poll()able - it would be great. Applications which 
> do block on read()/getmsg() in real-life not that common, and as I've 
> understood - this is the case for your message passing structure.

I'm not a kernel developer. Ask Linus Torvalds.

Bye,
Luca

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 12:16 ` Luca Veraldi
@ 2003-09-10 14:53   ` Larry McVoy
  0 siblings, 0 replies; 66+ messages in thread
From: Larry McVoy @ 2003-09-10 14:53 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: uek32z, linux-kernel

On Wed, Sep 10, 2003 at 02:16:40PM +0200, Luca Veraldi wrote:
> > I've read your posting on the lkml and also the answers
> > concerning IPC mechanisms on Linux.
> > You speak English very well - why don't you translate your
> > page into english, I think many people would be very interested
> > in it... at least I am ;) Unfortunately not many kernel hackers
> > are able to understand Italian, I think...
> 
> Page is now in English since last night (Italian time).
> Please, refresh your browser.
> 
> http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
> for English users and

I read it and I must be missing something, which is possible, I need more
coffee.

I question the measurement methodology.  Why didn't you grab the sources
to LMbench and use them to measure this?  It may well be that you disagree
with how it measures things, which may be fine, but I think you'd benefit
from understanding it and thinking about it.  It's also trivial to add
another test to the system, you can do it in a few lines of code.

I also question the results.  I modified lat_pipe.c from LMbench to measure
a range of sizes, code is included below.  The results don't match yours at
all.  This is on a 466Mhz Celeron running RedHat 7.1, Linux 2.4.2.  The 
time reported is the time to send and receive the data between two processes,
i.e., 

	Process A		Process B
	write
		 <context switch>
				read
				write
		 <context switch>
	read

In other words, the time printed is for a round trip.  Your numbers
appear to be off by a factor of two, they look pretty similar to mine
but as far as I can tell you are saying that it costs 4 usecs for a send
and 4 for a recv and that's not true, the real numbers are 2 sends and
2 receives and 2 context switches.

     1 bytes: Pipe latency: 8.0272 microseconds
     8 bytes: Pipe latency: 7.8736 microseconds
    64 bytes: Pipe latency: 8.0279 microseconds
   512 bytes: Pipe latency: 10.0920 microseconds
  4096 bytes: Pipe latency: 19.6434 microseconds
 40960 bytes: Pipe latency: 313.3328 microseconds
 81920 bytes: Pipe latency: 1267.7174 microseconds
163840 bytes: Pipe latency: 3052.1020 microseconds

I want to stick some other numbers in here from LMbench, the signal
handler cost and the select cost.  On this machine it is about 5 usecs
to handle the signal and about 4 for select on 10 file descriptors.

If I were faced with the problem of moving data between processes at very
low cost the path I choose would depend on whether it was a lot of data
or just an event notification.  It would also depend on whether the 
receiving process is doing anything else.  Let's walk a couple of those
paths:

If all I want to do is let another process know that something has happened
then a signal is darn close to as cheap as I can get.  That's what SIGUSR1
and SIGUSR2 are for.  

If I wanted to move large quantities of data I'd combine signals with 
mmap and mutexes.  This is easier than it sounds.  Map a scratch file,
truncate it up to the size you need, start writing into it and when you
have enough signal the other side.  It's a lot like the I/O loop in 
many device drivers.  

So I guess I'm not seeing why there needs to be a new interface here.
This looks to me like you combine cheap messaging (signals, select,
or even pipes) with shared data (mmap).  I don't know why you didn't
measure that, it's the obvious thing to measure and you are going to be
running at memory speeds.

The only justification I can see for a different mechanism is if the 
signaling really hurt but it doesn't.  What am I missing?
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 14:21 ` Stewart Smith
@ 2003-09-10 14:39   ` Luca Veraldi
  0 siblings, 0 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 14:39 UTC (permalink / raw)
  To: Stewart Smith; +Cc: linux-kernel

> hrm, you may find things related to the Password-Capability system and
> the Walnut kernel of interest - these systems take this kind of IPC to
> the extreme :) (ahhh... research OS hw & sw - except you *do not* want
> to see the walnut source - it makes ppl want to crawl up and cry).
> 
> http://www.csse.monash.edu.au/~rdp/fetch/castro-thesis.ps
> 
> and check the Readme.txt at
> http://www.csse.monash.edu.au/courseware/cse4333/rdp-ma terial/
> for stuff on Multi and password-capabilities.
> 
> interesting stuff, the Castro thesis does do some comparisons to FreeBSD
> (1.1 amazingly enough) - although the number of real world applications
> on these systems is minimal (and in the current state impossible -
> nobody can remember how to get userspace going on Walnut, we may have
> broken it) and so real-world comparisons just don't really happen these
> days. Maybe after a rewrite (removing some brain-damage of the original
> design).

Thanks. It's really very interesting...

Bye,
Luca

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-09 17:30 Luca Veraldi
  2003-09-09 21:17 ` Alan Cox
       [not found] ` <20030909175821.GL16080@Synopsys.COM>
@ 2003-09-10 14:21 ` Stewart Smith
  2003-09-10 14:39   ` Luca Veraldi
  2003-09-10 16:59 ` Andrea Arcangeli
  3 siblings, 1 reply; 66+ messages in thread
From: Stewart Smith @ 2003-09-10 14:21 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: linux-kernel

hrm, you may find things related to the Password-Capability system and
the Walnut kernel of interest - these systems take this kind of IPC to
the extreme :) (ahhh... research OS hw & sw - except you *do not* want
to see the walnut source - it makes ppl want to crawl up and cry).

http://www.csse.monash.edu.au/~rdp/fetch/castro-thesis.ps

and check the Readme.txt at
http://www.csse.monash.edu.au/courseware/cse4333/rdp-ma terial/
for stuff on Multi and password-capabilities.

interesting stuff, the Castro thesis does do some comparisons to FreeBSD
(1.1 amazingly enough) - although the number of real world applications
on these systems is minimal (and in the current state impossible -
nobody can remember how to get userspace going on Walnut, we may have
broken it) and so real-world comparisons just don't really happen these
days. Maybe after a rewrite (removing some brain-damage of the original
design).

This is all related to my honors work,
http://www.flamingspork.com/honors/
although my site needs an update.
I'm working on the design and simulation of an improved storage system.


On Wed, 2003-09-10 at 03:30, Luca Veraldi wrote:
> Hi all.
> At the web page
> http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
> You can find the results of my attempt in modifing the linux kernel sources
> to implement a new Inter Process Communication mechanism.
> 
> It is called ECBM for Efficient Capability-Based Messaging.
> 
> In the reading You can also find the comparison of ECBM 
> against some other commonly-used Linux IPC primitives 
> (such as read/write on pipes or SYS V tools).
> 
> The results are quite clear.
> 
> Enjoy.
> Luca Veraldi
> 
> 
> ----------------------------------------
> Luca Veraldi
> 
> Graduate Student of Computer Science
> at the University of Pisa
> 
> veraldi@cli.di.unipi.it
> luca.veraldi@katamail.com
> ICQ# 115368178
> ----------------------------------------
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 13:33                     ` Gábor Lénárt
@ 2003-09-10 14:04                       ` Luca Veraldi
  0 siblings, 0 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 14:04 UTC (permalink / raw)
  To: lgb; +Cc: linux-kernel

> So I thing it's no use to start a flame thread here. Implement that IPC
> solution, and test it. If it becomes a good one, everybody will be happy
> with your work. If not, at least you've tried, and maybe you will also
> learn from it. Since Linux is open source, you have nothing to lose with
> your implementation.

It is exactly what I've done. But, did you read the page?
I've implemented the IPC mechanism... And tested, too.
The numbers reported in tables weren't the fuit of my immagination, dear.

Bye,
Luca


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 12:47             ` Alan Cox
@ 2003-09-10 13:56               ` Luca Veraldi
  2003-09-10 15:59                 ` Alan Cox
  0 siblings, 1 reply; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 13:56 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

> > Probably, it is worse the case of copying a memory page,
> > because you have to hold some global lock all the time.
> > This is deadly in an SMP environment,
>
> You don't need a global lock to copy memory.

See sources. You neek to lock_kernel() for a great amount of instructions
like any nother kernel function.

There are kernel locks all around in sources.
So don't come here and talk about locking ineffiency.
Because scarse scalability of Linux on multiprocessor
is a reality nobody can hide.

With or without my primitives.

> One thing I do agree with you on is the API aspect - and that is
> problematic. The current API leaves data also owned by the source.
> If I write "fred" down a pipe I still own the "fred" bits.

Even with my API. Data doesn't disapper magically
from the sender logical address space.

Also, with minor changes, you can use my primitives
to implement a one-copy channel, if shared memory among processes
sounds to you as a blasfemy.

Always better than two-fase-copying current pipe implementation.

> The
> method you propose was added long ago go to Solaris (see "doors") and
> its not exactly the most used interface even there.

Sincerely, I COMPLETELY DON'T CARE.
If it is used or not, it's completely irrilevant, when valueing performance.

Good bye.
Luca


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 12:36                   ` Luca Veraldi
  2003-09-10 12:36                     ` Alex Riesen
@ 2003-09-10 13:33                     ` Gábor Lénárt
  2003-09-10 14:04                       ` Luca Veraldi
  1 sibling, 1 reply; 66+ messages in thread
From: Gábor Lénárt @ 2003-09-10 13:33 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: linux-kernel

On Wed, Sep 10, 2003 at 02:36:19PM +0200, Luca Veraldi wrote:
> > but it does imply "tested". And widely used. And well-known.
> 
> And inefficient, too.
> Sorry, but I use theory not popular convention to judge.

But theory is often far from practice ...

Theory is something which uses some abstraction to "model" reality. Which
often means oversimplification of the problem or simply the fact that a real
case has got much more (probably hidden) parameters affecting the problem.
Maybe these parameters are very hard to even detect at the FIRST sight.
That's why most software development starting from theory _BUT_ some "REAL
WORLD" benchmark/test should confirm the truth of that theory. Or even think
about relativity theory by Einstein: the theory is good, but it's even
better to prove it in the real life too: perturbation in the orbit of
Mercury, or affect to the visible position of stars at an eclipse. That was
the point when Einstein's theory bacome to be widely used by science.

So I thing it's no use to start a flame thread here. Implement that IPC
solution, and test it. If it becomes a good one, everybody will be happy
with your work. If not, at least you've tried, and maybe you will also
learn from it. Since Linux is open source, you have nothing to lose with
your implementation.

- Gábor (larta'H)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10  9:04       ` Luca Veraldi
@ 2003-09-10 12:56         ` Alan Cox
  0 siblings, 0 replies; 66+ messages in thread
From: Alan Cox @ 2003-09-10 12:56 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Linux Kernel Mailing List

On Mer, 2003-09-10 at 10:04, Luca Veraldi wrote:
> Surely, transferring bytes from a cache line to another is at zero cost
> (that is, a single clock cycle of the cache unit).

More complicated - when you move stuff you end up evicting another chunk
from the cache, you can't armwave that out of existance because you
yourself have created new eviction overhead for someone further down the
line.

> But here, how can you matematically grant that,
> using traditional IPC mechanism,
> you'll find the info you want directly in cache?

Because programs normally pass data they've touched to other apps.
Your A/B example touches the data before sending even, but your
example on receive does not which is very atypical.

> > Ok - from you rexample I couldnt tell if B then touches each byte of
> > data as well as having it appear in its address space.
> 
> It is irrilevant. The time reported in the work are the time needed
> to complete the IPC primitives.
> After a send or a receive over the channel,
> the two processes may do everything they want.
> It is not interesting for us.

Its vital to include it in the measurement of all three. Without it
your measurements are meaningless. Look at it as an engineer. IPC 
involves writing to memory, making the data appear in the other task and
then reading it. You also want to measure things like throughput of your
IPC tasks running parallel to other processes doing stuff like pure
memory bandwidth tests so you can measure the impact you have on the
system although in this case thats probably less relevant by far


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:09               ` Luca Veraldi
  2003-09-10 10:14                 ` Arjan van de Ven
@ 2003-09-10 12:50                 ` Alan Cox
  2003-09-10 19:16                 ` Shawn
  2003-09-10 20:05                 ` Rik van Riel
  3 siblings, 0 replies; 66+ messages in thread
From: Alan Cox @ 2003-09-10 12:50 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Arjan van de Ven, Linux Kernel Mailing List

On Mer, 2003-09-10 at 11:09, Luca Veraldi wrote:
> SMP is crying...
> But not so much, if the lock is not lasting much, is it right?
> 
> THIS IS NOT WHAT WE CALL A SHORT LOCK SECTION:

This is the difference between science and engineering. Its a path that
isnt normally taken and normally if taken takes almost no loops.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10  9:40           ` Luca Veraldi
  2003-09-10  9:44             ` Arjan van de Ven
@ 2003-09-10 12:47             ` Alan Cox
  2003-09-10 13:56               ` Luca Veraldi
  1 sibling, 1 reply; 66+ messages in thread
From: Alan Cox @ 2003-09-10 12:47 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: arjanv, Linux Kernel Mailing List

On Mer, 2003-09-10 at 10:40, Luca Veraldi wrote:
> To set the accessed or dirty bit you use
> 
> 38         __asm__ __volatile__( LOCK_PREFIX
> 39                 "btsl %1,%0"
> 40                 :"=m" (ADDR)
> 41                 :"Ir" (nr));
> 
> which is a ***SINGLE CLOCK CYCLE*** of cpu.

Its a _lot_ more than that in the normal case. Upwards of 60 clocks on 
a PIV. You then need to reload cr3 if you touched permissions which
means every cached TLB in the system is lost, and you may need to do
a cross CPU IPI on SMP (which takes a long time)

> You say "tlb's and internal cpu state will need to be flushed".
> The other cpus in an SMP environment can continue to work, indipendently.
> TLBs and cpu state registers are ***PER-CPU*** resorces.

Think of a threaded app passing a message to another app. You have to do
the cross CPU flush in order to prevent races where another thread can
scribble on data it doesnt own. Assuming a reasonable TLB reuse rate
thats 120 plus TLB reloads. The newer CPU's cache TLB's in L1/L2 so
thats not too bad but it all adds up. On SMP its a real pain

> Probably, it is worse the case of copying a memory page,
> because you have to hold some global lock all the time.
> This is deadly in an SMP environment, 

You don't need a global lock to copy memory. 

One thing I do agree with you on is the API aspect - and that is
problematic. The current API leaves data also owned by the source.
If I write "fred" down a pipe I still own the "fred" bits. The 
method you propose was added long ago go to Solaris (see "doors") and
its not exactly the most used interface even there.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
       [not found]   ` <E19x47V-0002JG-J8@phoenix.hadiko.de>
@ 2003-09-10 12:45     ` Luca Veraldi
  0 siblings, 0 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 12:45 UTC (permalink / raw)
  To: uek32z; +Cc: linux-kernel

> What occured to me:
> Perhaps you're interested in the L4KA microkernel project
> at my University, Karlsruhe, Germany.
> The URL is http://l4ka.org/

Sure I know. Great work.
But it is a completly new Operating Systems, 
and it is also a microkernel one.

Here, they seem not to understand what permance is, 
figure out what will happen if I start talking about microkernel.

However, that is the future, not old and dirty monolithical ones.
And ***THEORY*** supports.

Bye,
Luca

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 11:16                     ` Nick Piggin
  2003-09-10 11:30                       ` Luca Veraldi
@ 2003-09-10 12:42                       ` Alan Cox
  1 sibling, 0 replies; 66+ messages in thread
From: Alan Cox @ 2003-09-10 12:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Luca Veraldi, Arjan van de Ven, Linux Kernel Mailing List

On Mer, 2003-09-10 at 12:16, Nick Piggin wrote:
> There was a zero-copy pipe implementation floating around a while ago
> I think. Did you have a look at that? IIRC it had advantages and
> disadvantages over regular pipes in performance.

FreeBSD had one, it was lightning fast until you used the data the other
end of the pipe. Im not sure what happened to it in the long term.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 12:36                   ` Luca Veraldi
@ 2003-09-10 12:36                     ` Alex Riesen
  2003-09-10 13:33                     ` Gábor Lénárt
  1 sibling, 0 replies; 66+ messages in thread
From: Alex Riesen @ 2003-09-10 12:36 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: linux-kernel

Luca Veraldi, Wed, Sep 10, 2003 14:36:19 +0200:
> > but it does imply "tested". And widely used. And well-known.
> 
> And inefficient, too.

yes, probably. Until proven. What exactly wrong with api?

> Sorry, but I use theory not popular convention to judge.

I personally do not see deficiencies of the mentioned apis.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 12:28                 ` Alex Riesen
@ 2003-09-10 12:36                   ` Luca Veraldi
  2003-09-10 12:36                     ` Alex Riesen
  2003-09-10 13:33                     ` Gábor Lénárt
  0 siblings, 2 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 12:36 UTC (permalink / raw)
  To: alexander.riesen; +Cc: linux-kernel

> but it does imply "tested". And widely used. And well-known.

And inefficient, too.
Sorry, but I use theory not popular convention to judge.

Bye,
Luca

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 12:11             ` Alex Riesen
@ 2003-09-10 12:29               ` Luca Veraldi
  2003-09-10 12:28                 ` Alex Riesen
  0 siblings, 1 reply; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 12:29 UTC (permalink / raw)
  To: alexander.riesen; +Cc: linux-kernel

> it is widely accepted one.

Widely accepted does not necessary implies clear.

Sorry if I consider more readable this:
ch=acquire(CHANNEL_ID);

instead of this:
while((fd = open("fifo", O_WRONLY)) < 0) ;

But this is quite an aestethical fact.

If you like "open" and "write", open Emacs and do a global replace
"acquire" with "open"
and
"zc_send" with "write".

Few other minor details will make it a compatible interface.

------------------------------
Fashion is a variable
but style is a constant
------------------------------ Larry Wall

Bye,
Luca

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 12:29               ` Luca Veraldi
@ 2003-09-10 12:28                 ` Alex Riesen
  2003-09-10 12:36                   ` Luca Veraldi
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Riesen @ 2003-09-10 12:28 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: linux-kernel

Luca Veraldi, Wed, Sep 10, 2003 14:29:37 +0200:
> > it is widely accepted one.
> 
> Widely accepted does not necessary implies clear.
> 

but it does imply "tested". And widely used. And well-known.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
       [not found] <E19x3el-0002Fc-Rj@phoenix.hadiko.de>
@ 2003-09-10 12:16 ` Luca Veraldi
  2003-09-10 14:53   ` Larry McVoy
  0 siblings, 1 reply; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 12:16 UTC (permalink / raw)
  To: uek32z; +Cc: linux-kernel

> I've read your posting on the lkml and also the answers
> concerning IPC mechanisms on Linux.
> You speak English very well - why don't you translate your
> page into english, I think many people would be very interested
> in it... at least I am ;) Unfortunately not many kernel hackers
> are able to understand Italian, I think...

Page is now in English since last night (Italian time).
Please, refresh your browser.

http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
for English users and

http://web.tiscali.it/lucavera/www/root/ecbm/indexIT.htm
for Italian ones.

Bye,
Luca.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 11:44                         ` Nick Piggin
@ 2003-09-10 12:14                           ` Luca Veraldi
  0 siblings, 0 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 12:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel

> Yes you're right. Just search for it. It would be interesting to compare
> them with your implementation.
> I'm not too sure, I was just pointing you to zero copy pipes because
> they did have some disadvantages which is why they weren't included in
> the kernel. Quite possibly your mechanism doesn't suffer from these.

Ok. I'll search for them.

> What would really help convince everyone is, of course, "real world"
> benchmarks. I like your ideas though ;)

Thanks.

Luca

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 11:52         ` Alex Riesen
@ 2003-09-10 12:14           ` Luca Veraldi
  2003-09-10 12:11             ` Alex Riesen
  0 siblings, 1 reply; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 12:14 UTC (permalink / raw)
  To: alexander.riesen; +Cc: linux-kernel


> > Compatibility is not a problem. Simply rewrite the write() and read()
> > for pipes in order to make them do the same thing done by zc_send()
> > and zc_receive().  Or, if you are not referring to pipes, rewrite the
> > support level of you anchient IPC primitives in order to make them do
> > the same thing done by zc_send() and zc_receive().
> 
> If it is possible, why new user-side interface?

Because my programming model is clear and easy.
Linux one's is far from being so, in my opinion.

Bye,
Luca

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 12:14           ` Luca Veraldi
@ 2003-09-10 12:11             ` Alex Riesen
  2003-09-10 12:29               ` Luca Veraldi
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Riesen @ 2003-09-10 12:11 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: linux-kernel

Luca Veraldi, Wed, Sep 10, 2003 14:14:00 +0200:
> > > Compatibility is not a problem. Simply rewrite the write() and read()
> > > for pipes in order to make them do the same thing done by zc_send()
> > > and zc_receive().  Or, if you are not referring to pipes, rewrite the
> > > support level of you anchient IPC primitives in order to make them do
> > > the same thing done by zc_send() and zc_receive().
> > 
> > If it is possible, why new user-side interface?
> 
> Because my programming model is clear and easy.

does anyone, besides you, says so?

> Linux one's is far from being so, in my opinion.

it is widely accepted one.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10  9:18       ` Luca Veraldi
  2003-09-10  9:23         ` Arjan van de Ven
@ 2003-09-10 11:52         ` Alex Riesen
  2003-09-10 12:14           ` Luca Veraldi
  1 sibling, 1 reply; 66+ messages in thread
From: Alex Riesen @ 2003-09-10 11:52 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: linux-kernel

Luca Veraldi, Wed, Sep 10, 2003 11:18:48 +0200:
> > Will it be possible to base existing facilities on your approach?
> > SVR5 messages (msg{get,snd,rcv}), for example?
> 
> Ah, ok. So let's continue to do ineffient things
> only because it has always been so!

It is because the interface is perfectly enough. You still can
implement zero-copy local transfers for pipes for read and write
calls.
And for small amounts, where it is impossible to do zero-copy does not
bring noticable advantages (as was already mentioned by Alan).

> Compatibility is not a problem. Simply rewrite the write() and read()
> for pipes in order to make them do the same thing done by zc_send()
> and zc_receive().  Or, if you are not referring to pipes, rewrite the
> support level of you anchient IPC primitives in order to make them do
> the same thing done by zc_send() and zc_receive().

If it is possible, why new user-side interface?

-alex

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 11:30                       ` Luca Veraldi
@ 2003-09-10 11:44                         ` Nick Piggin
  2003-09-10 12:14                           ` Luca Veraldi
  0 siblings, 1 reply; 66+ messages in thread
From: Nick Piggin @ 2003-09-10 11:44 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: linux-kernel



Luca Veraldi wrote:

>>Hi Luca,
>>There was a zero-copy pipe implementation floating around a while ago
>>I think. Did you have a look at that? IIRC it had advantages and
>>disadvantages over regular pipes in performance.
>>
>
>Sorry, but i subscripted this mailing-list only one day ago.
>Advantages and disadvantages depends on what actually you implement
>and on how you do that.
>

Yes you're right. Just search for it. It would be interesting to compare
them with your implementation.

>
>I can only say that capabilities are a disadvantage only with very very
>short messages
>(that is, a few bytes). And this disadvantage is theoretically demonstrable.
>
>But, let's say also that such elementary messages are meaningful only in the
>kernel
>and for kernel purposes.
>
>User processes are another tail.
>

I'm not too sure, I was just pointing you to zero copy pipes because
they did have some disadvantages which is why they weren't included in
the kernel. Quite possibly your mechanism doesn't suffer from these.

What would really help convince everyone is, of course, "real world"
benchmarks. I like your ideas though ;)




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 11:16                     ` Nick Piggin
@ 2003-09-10 11:30                       ` Luca Veraldi
  2003-09-10 11:44                         ` Nick Piggin
  2003-09-10 12:42                       ` Alan Cox
  1 sibling, 1 reply; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 11:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel

> Hi Luca,
> There was a zero-copy pipe implementation floating around a while ago
> I think. Did you have a look at that? IIRC it had advantages and
> disadvantages over regular pipes in performance.

Sorry, but i subscripted this mailing-list only one day ago.
Advantages and disadvantages depends on what actually you implement
and on how you do that.

I can only say that capabilities are a disadvantage only with very very
short messages
(that is, a few bytes). And this disadvantage is theoretically demonstrable.

But, let's say also that such elementary messages are meaningful only in the
kernel
and for kernel purposes.

User processes are another tail.

Bye,
Luca


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:54                   ` Luca Veraldi
  2003-09-10 10:54                     ` Arjan van de Ven
@ 2003-09-10 11:16                     ` Nick Piggin
  2003-09-10 11:30                       ` Luca Veraldi
  2003-09-10 12:42                       ` Alan Cox
  1 sibling, 2 replies; 66+ messages in thread
From: Nick Piggin @ 2003-09-10 11:16 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel



Luca Veraldi wrote:

>>Memory is sort of starting to be like disk IO in this regard.
>>
>
>Good. So the less you copy memory all around, the better you permorm.
>

Hi Luca,
There was a zero-copy pipe implementation floating around a while ago
I think. Did you have a look at that? IIRC it had advantages and
disadvantages over regular pipes in performance.

Nick



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:41                 ` Arjan van de Ven
@ 2003-09-10 10:54                   ` Luca Veraldi
  2003-09-10 10:54                     ` Arjan van de Ven
  2003-09-10 11:16                     ` Nick Piggin
  0 siblings, 2 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 10:54 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel

> Memory is sort of starting to be like disk IO in this regard.

Good. So the less you copy memory all around, the better you permorm.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:54                   ` Luca Veraldi
@ 2003-09-10 10:54                     ` Arjan van de Ven
  2003-09-10 11:16                     ` Nick Piggin
  1 sibling, 0 replies; 66+ messages in thread
From: Arjan van de Ven @ 2003-09-10 10:54 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel

On Wed, Sep 10, 2003 at 12:54:32PM +0200, Luca Veraldi wrote:
> > Memory is sort of starting to be like disk IO in this regard.
> 
> Good. So the less you copy memory all around, the better you permorm.

Actually it's "the less discontiguos memory you touch the better you
perform". Just like a 128Kb read is about the same cost as a 4Kb
disk read, the same is starting to become true for memory copies. So
in the extreme the memory copy is one operation only.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:37               ` Jamie Lokier
@ 2003-09-10 10:41                 ` Arjan van de Ven
  2003-09-10 10:54                   ` Luca Veraldi
  0 siblings, 1 reply; 66+ messages in thread
From: Arjan van de Ven @ 2003-09-10 10:41 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel

On Wed, Sep 10, 2003 at 11:37:52AM +0100, Jamie Lokier wrote:
> I thought that later generation CPUs were supposed to have lower
> memory bandwidth relative to the CPU core

this is far more true for "random/access" latency than for streaming bandwidth.
Memory is sort of starting to be like disk IO in this regard. 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:07             ` Arjan van de Ven
  2003-09-10 10:17               ` Luca Veraldi
@ 2003-09-10 10:37               ` Jamie Lokier
  2003-09-10 10:41                 ` Arjan van de Ven
  1 sibling, 1 reply; 66+ messages in thread
From: Jamie Lokier @ 2003-09-10 10:37 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Luca Veraldi, alexander.riesen, linux-kernel

Arjan van de Ven wrote:
> > I have just done a measurement on a 366MHz PII Celeron
> 
> This test is sort of the worst case against my argument:
> 1) It's a cpu with low memory bandwidth
> 2) It's a 1 CPU system
> 3) It's a pII not pIV; the pII is way more efficient cycle wise
>    for pagetable operations

I thought that later generation CPUs were supposed to have lower
memory bandwidth relative to the CPU core, so CPU operations are
better than copying.  Hence all the fancy special memory instructions,
to work around that.

Not that it matters.  I think the 366 Celeron is typical of a lot of
computers being used today.  I still use it every day, after all.

-- Jaie


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
       [not found] <F71B37536F3B3D4FA521FEC7FCA17933164A@twinsrv.twinox.se>
@ 2003-09-10 10:36 ` Luca Veraldi
  0 siblings, 0 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 10:36 UTC (permalink / raw)
  To: Lars Hammarstrand; +Cc: linux-kernel

> Don't bother about people nagging about bits and pieces because they are
jealous.
> Stand up for the basic idea since the rest is just details that always can
be fixed.

The incredible thing to say is that are what? 10, 20 years, perhaps even
more,
that in my University in Pisa we are talking about such communication
primitives.

It's not so new!
We have only effectively implemented it.

Bye,
Luca.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:14                 ` Arjan van de Ven
@ 2003-09-10 10:25                   ` Luca Veraldi
  2003-09-12 18:41                     ` Timothy Miller
  0 siblings, 1 reply; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 10:25 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel

> I'm saying it can. I don't want to go too deep into an arguement about
> microarchitectural details, but my point was that a memory copy of a page
> is NOT super expensive relative to several other effects that have to do
> with pagetable manipulations.

Sorry, but I cannot believe it.
Reading a page tagle entry and storing in into a struct capability is not
comparable at all with the "for" needed to move bytes all around memory.

> but the pipe code cannot know this so it has to do a cross cpu invalidate.

Sorry for you. Don't knowning does not justify it.
It's inefficient.

> and... which is also releasing it before the copy

Oh, yes. After wasting thousands of cycles, sure.

Bye,
Luca.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:07             ` Arjan van de Ven
@ 2003-09-10 10:17               ` Luca Veraldi
  2003-09-10 10:37               ` Jamie Lokier
  1 sibling, 0 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 10:17 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel

> This test is sort of the worst case against my argument:
> 1) It's a cpu with low memory bandwidth
> 2) It's a 1 CPU system
> 3) It's a pII not pIV; the pII is way more efficient cycle wise
>    for pagetable operations

I'm a Compuer Science graduate. And at least one single thing I've learned
in my studies.
More efficient Firmware support
(such as an extremely wide memory bandwith
or tens of CPUs in an SMP/NUMA
or efficient cache line transfer support)

DOES NOT ALLOW YOU TO WASTE CYCLES IN DOING USELESS THINGS,
such as TWO physical copies of a message if a process want to cooperate with
another one.

We are not engineer that simply have to let the things work.

That's all.
Luca.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10 10:09               ` Luca Veraldi
@ 2003-09-10 10:14                 ` Arjan van de Ven
  2003-09-10 10:25                   ` Luca Veraldi
  2003-09-10 12:50                 ` Alan Cox
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Arjan van de Ven @ 2003-09-10 10:14 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Arjan van de Ven, linux-kernel

On Wed, Sep 10, 2003 at 12:09:21PM +0200, Luca Veraldi wrote:
> > For fun do the measurement on a pIV cpu. You'll be surprised.
> > The microcode "mark dirty" (which is NOT a btsl, it gets done when you do
> a write
> > memory access to the page content) result will be in the 2000 to 4000
> range I
> > predict.
> 
> I'm not responsible for microarchitecture designer stupidity.
> If a simple STORE assembler instruction will eat up 4000 clock cycles,
> as you say here, well, I think all we Computer Scientists can go home and
> give it up now.

I'm saying it can. I don't want to go too deep into an arguement about
microarchitectural details, but my point was that a memory copy of a page
is NOT super expensive relative to several other effects that have to do
with pagetable manipulations. 

> > if you change a page table, you need to flush the TLB on all other cpus
> > that have that same page table mapped, like a thread app running
> > on all cpu's at once with the same pagetables.
> 
> Ok. Simply, this is not the case in my experiment.
> This does not apply.
> We have no threads. But only detached process address spaces.
> Threads are a bit different from processes.

but the pipe code cannot know this so it has to do a cross cpu invalidate.

> > why would you need a global lock for copying memory ?
> 
> System call sys_write() calls
> locks_verify_area() which calls
> locks_mandatory_area() which calls
> lock_kernel()

and... which is also releasing it before the copy


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10  9:52           ` Jamie Lokier
  2003-09-10 10:07             ` Arjan van de Ven
@ 2003-09-10 10:11             ` Luca Veraldi
  2003-09-10 19:24             ` Pavel Machek
  2 siblings, 0 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 10:11 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

> mean 11561 cycles to:
> 
> page fault
> copy zero page
> install copy
> remove copy
> 
> (= write every page from MAP_ANON region and unmap, repeatedly)
> 
> I think that makes a clear case for avoiding page copies.

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10  9:44             ` Arjan van de Ven
@ 2003-09-10 10:09               ` Luca Veraldi
  2003-09-10 10:14                 ` Arjan van de Ven
                                   ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10 10:09 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel

> For fun do the measurement on a pIV cpu. You'll be surprised.
> The microcode "mark dirty" (which is NOT a btsl, it gets done when you do
a write
> memory access to the page content) result will be in the 2000 to 4000
range I
> predict.

I'm not responsible for microarchitecture designer stupidity.
If a simple STORE assembler instruction will eat up 4000 clock cycles,
as you say here, well, I think all we Computer Scientists can go home and
give it up now.

> There are things like SMP synchronisation to do, but also
> if the cpu marks a page dirty in the page table, that means the page table
> changes which means the pagetable needs to be marked in the
> PMD. Which means the PMD changes, which means the PGD needs the PMD marked
> dirty. Etc Etc. It's microcode. It'll take several 1000 cycles.

Please, return to the fact.
Modifying the page contents is done by that part of benchmark application we
are not misuring.
That is, the code ***BEFORE*** the zc_send() (write() on pipe or whatever
you choose)
and ***AFTER*** a zc_receive() (or read() from pipe ot whatever else).
This is nice, thanks, but out of our interests.

We are only reading the relocation tables or inserting new entries in it.
Not modifying page contents.

> if you change a page table, you need to flush the TLB on all other cpus
> that have that same page table mapped, like a thread app running
> on all cpu's at once with the same pagetables.

Ok. Simply, this is not the case in my experiment.
This does not apply.
We have no threads. But only detached process address spaces.
Threads are a bit different from processes.

> why would you need a global lock for copying memory ?

System call sys_write() calls
locks_verify_area() which calls
locks_mandatory_area() which calls
lock_kernel()

oops...
global spin_lock locking...

SMP is crying...
But not so much, if the lock is not lasting much, is it right?

THIS IS NOT WHAT WE CALL A SHORT LOCK SECTION:

732 repeat:
733         /* Search the lock list for this inode for locks that conflict
with
734          * the proposed read/write.
735          */
736         for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
737                 if (!(fl->fl_flags & FL_POSIX))
738                         continue;
739                 if (fl->fl_start > new_fl->fl_end)
740                         break;
741                 if (posix_locks_conflict(new_fl, fl)) {
742                         error = -EAGAIN;
743                         if (filp && (filp->f_flags & O_NONBLOCK))
744                                 break;
745                         error = -EDEADLK;
746                         if (posix_locks_deadlock(new_fl, fl))
747                                 break;
748
749                         error = locks_block_on(fl, new_fl);
750                         if (error != 0)
751                                 break;
752
753                         /*
754                          * If we've been sleeping someone might have
755                          * changed the permissions behind our back.
756                          */
757                         if ((inode->i_mode & (S_ISGID | S_IXGRP)) !=
S_ISGID)
758                                 break;
759                         goto repeat;
760                 }
761         }

>From the source code of your Operating System.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10  9:52           ` Jamie Lokier
@ 2003-09-10 10:07             ` Arjan van de Ven
  2003-09-10 10:17               ` Luca Veraldi
  2003-09-10 10:37               ` Jamie Lokier
  2003-09-10 10:11             ` Luca Veraldi
  2003-09-10 19:24             ` Pavel Machek
  2 siblings, 2 replies; 66+ messages in thread
From: Arjan van de Ven @ 2003-09-10 10:07 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Arjan van de Ven, Luca Veraldi, alexander.riesen, linux-kernel

On Wed, Sep 10, 2003 at 10:52:55AM +0100, Jamie Lokier wrote:
> Arjan van de Ven wrote:
> > > The overhead implied by a memcpy() is the same, in the oder of magnitude,
> > > ***whatever*** kernel version you can develop.
> > 
> > yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the
> > uncached case. A pagetable operation (like the cpu setting the accessed
> > or dirty bit) is in that same order I suspect (maybe half this, but not
> > a lot less). Changing pagetable content is even more because all the
> > tlb's and internal cpu state will need to be flushed... which is also a
> > microcode operation for the cpu. And it's deadly in an SMP environment.
> 
> I have just done a measurement on a 366MHz PII Celeron

This test is sort of the worst case against my argument:
1) It's a cpu with low memory bandwidth
2) It's a 1 CPU system
3) It's a pII not pIV; the pII is way more efficient cycle wise
   for pagetable operations


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10  9:23         ` Arjan van de Ven
  2003-09-10  9:40           ` Luca Veraldi
@ 2003-09-10  9:52           ` Jamie Lokier
  2003-09-10 10:07             ` Arjan van de Ven
                               ` (2 more replies)
  1 sibling, 3 replies; 66+ messages in thread
From: Jamie Lokier @ 2003-09-10  9:52 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Luca Veraldi, alexander.riesen, linux-kernel

Arjan van de Ven wrote:
> > The overhead implied by a memcpy() is the same, in the oder of magnitude,
> > ***whatever*** kernel version you can develop.
> 
> yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the
> uncached case. A pagetable operation (like the cpu setting the accessed
> or dirty bit) is in that same order I suspect (maybe half this, but not
> a lot less). Changing pagetable content is even more because all the
> tlb's and internal cpu state will need to be flushed... which is also a
> microcode operation for the cpu. And it's deadly in an SMP environment.

I have just done a measurement on a 366MHz PII Celeron.  The test is
quite crude, but these numbers are upper bounds on the timings:

	mean 813 cycles to:

		page fault
		install zero page
		remove zero page

		(= read every page from MAP_ANON region and unmap, repeatedly)

	mean 11561 cycles to:

		page fault
		copy zero page
		install copy
		remove copy

		(= write every page from MAP_ANON region and unmap, repeatedly)

I think that makes a clear case for avoiding page copies.

-- Jamie

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10  9:40           ` Luca Veraldi
@ 2003-09-10  9:44             ` Arjan van de Ven
  2003-09-10 10:09               ` Luca Veraldi
  2003-09-10 12:47             ` Alan Cox
  1 sibling, 1 reply; 66+ messages in thread
From: Arjan van de Ven @ 2003-09-10  9:44 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: arjanv, linux-kernel

On Wed, Sep 10, 2003 at 11:40:33AM +0200, Luca Veraldi wrote:
> To set the accessed or dirty bit you use
> 
> 38         __asm__ __volatile__( LOCK_PREFIX
> 39                 "btsl %1,%0"
> 40                 :"=m" (ADDR)
> 41                 :"Ir" (nr));
> 
> which is a ***SINGLE CLOCK CYCLE*** of cpu.
> I don't think really that on any machine Firmware 
> a btsl will require 4000 cycles.
> Neither on Intel x86.

For fun do the measurement on a pIV cpu. You'll be surprised.
The microcode "mark dirty" (which is NOT a btsl, it gets done when you do a write
memory access to the page content) result will be in the 2000 to 4000 range I
predict. There are things like SMP synchronisation to do, but also
if the cpu marks a page dirty in the page table, that means the page table
changes which means the pagetable needs to be marked in the
PMD. Which means the PMD changes, which means the PGD needs the PMD marked
dirty. Etc Etc. It's microcode. It'll take several 1000 cycles.



> > Changing pagetable content is even more because all the
> > tlb's and internal cpu state will need to be flushed... which is also a
> > microcode operation for the cpu. 
> 
> Good. The same overhead you will find accessing a message 
> after a read form a pipe. There will occur many TLB faults.
> And the same apply copying the message to the pipe.
> Many many TLB faults.

A TLB fault in the normal case is about 7 cycles. But that's for a TLB not
being present. For TLB that IS present being written to means going to
microcode.

> 
> > And it's deadly in an SMP environment.
> 
> You say "tlb's and internal cpu state will need to be flushed".
> The other cpus in an SMP environment can continue to work, indipendently.
> TLBs and cpu state registers are ***PER-CPU*** resorces.

if you change a page table, you need to flush the TLB on all other cpus
that have that same page table mapped, like a thread app running
on all cpu's at once with the same pagetables.

> Probably, it is worse the case of copying a memory page,
> because you have to hold some global lock all the time.

why would you need a global lock for copying memory ?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10  9:23         ` Arjan van de Ven
@ 2003-09-10  9:40           ` Luca Veraldi
  2003-09-10  9:44             ` Arjan van de Ven
  2003-09-10 12:47             ` Alan Cox
  2003-09-10  9:52           ` Jamie Lokier
  1 sibling, 2 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10  9:40 UTC (permalink / raw)
  To: arjanv; +Cc: linux-kernel

> yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the
> uncached case. A pagetable operation (like the cpu setting the accessed
> or dirty bit) is in that same order I suspect (maybe half this, but not
> a lot less). 

Probably you don't know what you're talking about.
I don't know where you studied computer architectures, but...
Let's answer.

To set the accessed or dirty bit you use

38         __asm__ __volatile__( LOCK_PREFIX
39                 "btsl %1,%0"
40                 :"=m" (ADDR)
41                 :"Ir" (nr));

which is a ***SINGLE CLOCK CYCLE*** of cpu.
I don't think really that on any machine Firmware 
a btsl will require 4000 cycles.
Neither on Intel x86.

> Changing pagetable content is even more because all the
> tlb's and internal cpu state will need to be flushed... which is also a
> microcode operation for the cpu. 

Good. The same overhead you will find accessing a message 
after a read form a pipe. There will occur many TLB faults.
And the same apply copying the message to the pipe.
Many many TLB faults.

> And it's deadly in an SMP environment.

You say "tlb's and internal cpu state will need to be flushed".
The other cpus in an SMP environment can continue to work, indipendently.
TLBs and cpu state registers are ***PER-CPU*** resorces.

Probably, it is worse the case of copying a memory page,
because you have to hold some global lock all the time.
This is deadly in an SMP environment, 
because critical section lasts for thousand of cycles, 
instead of simply a few.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-10  9:18       ` Luca Veraldi
@ 2003-09-10  9:23         ` Arjan van de Ven
  2003-09-10  9:40           ` Luca Veraldi
  2003-09-10  9:52           ` Jamie Lokier
  2003-09-10 11:52         ` Alex Riesen
  1 sibling, 2 replies; 66+ messages in thread
From: Arjan van de Ven @ 2003-09-10  9:23 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: alexander.riesen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 619 bytes --]

On Wed, 2003-09-10 at 11:18, Luca Veraldi wrote:

> The overhead implied by a memcpy() is the same, in the oder of magnitude,
> ***whatever*** kernel version you can develop.


yes a copy of a page is about 3000 to 4000 cycles on an x86 box in the
uncached case. A pagetable operation (like the cpu setting the accessed
or dirty bit) is in that same order I suspect (maybe half this, but not
a lot less). Changing pagetable content is even more because all the
tlb's and internal cpu state will need to be flushed... which is also a
microcode operation for the cpu. And it's deadly in an SMP environment.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
       [not found]     ` <20030910064508.GA25795@Synopsys.COM>
@ 2003-09-10  9:18       ` Luca Veraldi
  2003-09-10  9:23         ` Arjan van de Ven
  2003-09-10 11:52         ` Alex Riesen
  0 siblings, 2 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10  9:18 UTC (permalink / raw)
  To: alexander.riesen; +Cc: linux-kernel

> I mean for benchmarks. You compared your implementation to a very old
> linux' one. There were big changes in these areas.

The overhead implied by a memcpy() is the same, in the oder of magnitude,
***whatever*** kernel version you can develop.

If you have to send, lt's say, 3 memory pages from A to B,
using pipes you have to ***physically copy***
2 * 3 * PAGE_SIZE bytes
(firt time, sending; then receiving them. So two times).
Which evals to 8192 * 3.

Using capability, the only thing you have to do
is copying an amount of bytes that are linearly dependent
by the number of pages: so, proportional to 3.

I you want the (quite) exact amount, it is
3 * sizeof(unsigned long) + sizeof(capability)
which evals to  12 + 16 = 28.
X is not present in the equation.

I compared pipes and SYS V IPC on Linux 2.2.4 with the new mechanism
also developed over kernel 2.2.4!
The same inefficiency of the kernel support you are talking about
affet all the primitives being examined in the article. Mine, too.

So the relative results will be quite the same.

> you surely know, that it is just an implementation. The mechanisms have
> always been there, evolved from UNIX.
> That is mainly the reason for them to exists: support the applications
> which use them.
>
> Will it be possible to base existing facilities on your approach?
> SVR5 messages (msg{get,snd,rcv}), for example?
> (I have to ask, because I cannot understand the article, sorry)

Ah, ok. So let's continue to do ineffient things
only because it has always been so!

Compatibility is not a problem. Simply rewrite the write() and read() for
pipes
in order to make them do the same thing done by zc_send() and zc_receive().
Or, if you are not referring to pipes, rewrite the support level of you
anchient IPC primitives
in order to make them do the same thing done by zc_send() and zc_receive().

Bye,
Luca.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-09 23:11     ` Alan Cox
@ 2003-09-10  9:04       ` Luca Veraldi
  2003-09-10 12:56         ` Alan Cox
  0 siblings, 1 reply; 66+ messages in thread
From: Luca Veraldi @ 2003-09-10  9:04 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

> The question (for smaller messages) is whether this is a win or not.
> While the data fits in L1 cache the copy is close to free compared
> with context switching and TLB overhead

I can answer for messages of 512 bytes (that I consider small).
This is the smallest case considered in the work.

Completion times in this case are reported in graphs:
you have the time of write() on pipe,
of SYS V shmget() and of new primitives.
It's easy to compare numbers and to say if it is "a win or not".

Surely, transferring bytes from a cache line to another is at zero cost
(that is, a single clock cycle of the cache unit).
But here, how can you matematically grant that,
using traditional IPC mechanism,
you'll find the info you want directly in cache?
This is quite far from being realistic.

In real world, what happens is that copying a page memory
raises many cache faults... at least, more that copying a capability
structure
(16 bytes long, as reported in the article).

Of course, if you want to send a single int,
probably you don't have any reason to use capability.
That's sure.

> Ok - from you rexample I couldnt tell if B then touches each byte of
> data as well as having it appear in its address space.

It is irrilevant. The time reported in the work are the time needed
to complete the IPC primitives.
After a send or a receive over the channel,
the two processes may do everything they want.
It is not interesting for us.

Bye,
Luca


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-09 21:57   ` Luca Veraldi
@ 2003-09-09 23:11     ` Alan Cox
  2003-09-10  9:04       ` Luca Veraldi
  0 siblings, 1 reply; 66+ messages in thread
From: Alan Cox @ 2003-09-09 23:11 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Linux Kernel Mailing List

On Maw, 2003-09-09 at 22:57, Luca Veraldi wrote:
> Also. The central point is not to have10 instead of 50 assembler lines
> in the primitives. The central point is to implement communication
> primitives
> that do not require physical copying of the messages being sent and
> received.

The question (for smaller messages) is whether this is a win or not.
While the data fits in L1 cache the copy is close to free compared
with context switching and TLB overhead

> We have 2 process communicating over a channel in a pipeline fashion.
> A writes some information in a buffer and sends it to B.
> B receives and reads.
> This for 1000 times.
> Times reported are average time.

Ok - from you rexample I couldnt tell if B then touches each byte of
data as well as having it appear in its address space.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
@ 2003-09-09 22:15 Luca Veraldi
  0 siblings, 0 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-09 22:15 UTC (permalink / raw)
  To: linux-kernel

Ok. The English version of the document is online, now.
I've translated the main part of it.
The address is the same.

For English speaker:
http://web.tiscali.it/lucavera/www/root/ecbm/index.htm 

If You want it in Italian:
http://web.tiscali.it/lucavera/www/root/ecbm/indexIT.htm 

Sorry if You find some typo. Here in Italy it's midnight...


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-09 21:17 ` Alan Cox
@ 2003-09-09 21:57   ` Luca Veraldi
  2003-09-09 23:11     ` Alan Cox
  0 siblings, 1 reply; 66+ messages in thread
From: Luca Veraldi @ 2003-09-09 21:57 UTC (permalink / raw)
  To: Alan Cox, linux-kernel

> The text is a little hard to follow for an English speaker. I can follow
> a fair bit and there are a few things I'd take issue with (forgive me if
> I'm misunderstanding your material)

I know. I'm very sorry for this. I'm rewriting the text in English,
so, don't bore with Italian. The new page will be up in a moment.

> 1. Almost all IPC is < 512 bytes long. There are exceptions including
> X11 image passing which is an important performance one

Sure for system needs. What about programming?
You can write application of any complexity.
Why do I have to be limited by kernel in sending 4056 bytes
(for exmple, if I use SYS V message queues)
instead of an arbitrary sized message?

> 2. One of the constraints that a message system has is that if it uses
> shared memory it must protect the memory queues from abuse. For example
> if previous/next pointers are kept in the shared memory one app may be
> able to trick another into modifying the wrong thing. Sometimes this
> matters.

Please, refer to address space.
If you don't have the piece of information you want to access
into your own private address space, the only thing that will occur
is a general protection fault and your process will be killed.
Remember that, with my primitives, you're working with logical addresses
that need relocation in hardware. The MMU contestually performs
memory protection, too.

> 3. An interesting question exists as to how whether you can create the
> same effect with current 2.6 kernel primitives. We have posix shared
> memory objects, we have read only mappings and we have extremely fast
> task switch and also locks (futex locks)

I too need to use locks in my primitives. The fact that new kernel has
faster locks also reduceses the completion time of my primitives.

Also. The central point is not to have10 instead of 50 assembler lines
in the primitives. The central point is to implement communication
primitives
that do not require physical copying of the messages being sent and
received.
Lock probably reduces the time to write or read from a pipe.
But the order of magnitute remains the same.

Task switching is relevant in all communication primitives.
Fast task switch means proportionally fast primitives. Also mine primitives.

> Also from the benching point of view it is always instructive to run a
> set of benchmarks that involve the sender writing all the data bytes,
> sending it and the receiver reading them all. Without this zero copy
> mechanisms (especially mmu based ones) look artificially good.

The bench you are talking about is exactly the one I implemented to misure
the IPC overhead.
We have 2 process communicating over a channel in a pipeline fashion.
A writes some information in a buffer and sends it to B.
B receives and reads.
This for 1000 times.
Times reported are average time.

Bye,
Luca


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
  2003-09-09 17:30 Luca Veraldi
@ 2003-09-09 21:17 ` Alan Cox
  2003-09-09 21:57   ` Luca Veraldi
       [not found] ` <20030909175821.GL16080@Synopsys.COM>
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Alan Cox @ 2003-09-09 21:17 UTC (permalink / raw)
  To: Luca Veraldi; +Cc: Linux Kernel Mailing List

On Maw, 2003-09-09 at 18:30, Luca Veraldi wrote:
> Hi all.
> At the web page
> http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
> You can find the results of my attempt in modifing the linux kernel sources
> to implement a new Inter Process Communication mechanism.
> 
> It is called ECBM for Efficient Capability-Based Messaging.
> 
> In the reading You can also find the comparison of ECBM 
> against some other commonly-used Linux IPC primitives 
> (such as read/write on pipes or SYS V tools).

The text is a little hard to follow for an English speaker. I can follow
a fair bit and there are a few things I'd take issue with (forgive me if
I'm misunderstanding your material)


1. Almost all IPC is < 512 bytes long. There are exceptions including
X11 image passing which is an important performance one

2. One of the constraints that a message system has is that if it uses
shared memory it must protect the memory queues from abuse. For example
if previous/next pointers are kept in the shared memory one app may be
able to trick another into modifying the wrong thing. Sometimes this
matters.

3. An interesting question exists as to how whether you can create the
same effect with current 2.6 kernel primitives. We have posix shared
memory objects, we have read only mappings and we have extremely fast
task switch and also locks (futex locks)

Also from the benching point of view it is always instructive to run a
set of benchmarks that involve the sender writing all the data bytes,
sending it and the receiver reading them all. Without this zero copy
mechanisms (especially mmu based ones) look artificially good. 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Efficient IPC mechanism on Linux
@ 2003-09-09 18:59 Luca Veraldi
  0 siblings, 0 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-09 18:59 UTC (permalink / raw)
  To: linux-kernel

You're right. When i have a bit of time, i'll translate the page.

The version of the kernel is not so important.
The code is highly portable and little dependent of the internals of kernel.
It only needs some functions that continue to survive in newer versions of
Linux.
I used 2.2.4 only because it was ready-to-use on my machine.

The main purpose of the article was not to present a professional
mechanism that you can download and use into your own applications.

It is only the proof of a fact: Linux IPC Mechanisms
are ***structurally*** inefficient.

Bye.
Luca



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Efficient IPC mechanism on Linux
@ 2003-09-09 17:30 Luca Veraldi
  2003-09-09 21:17 ` Alan Cox
                   ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: Luca Veraldi @ 2003-09-09 17:30 UTC (permalink / raw)
  To: linux-kernel

Hi all.
At the web page
http://web.tiscali.it/lucavera/www/root/ecbm/index.htm
You can find the results of my attempt in modifing the linux kernel sources
to implement a new Inter Process Communication mechanism.

It is called ECBM for Efficient Capability-Based Messaging.

In the reading You can also find the comparison of ECBM 
against some other commonly-used Linux IPC primitives 
(such as read/write on pipes or SYS V tools).

The results are quite clear.

Enjoy.
Luca Veraldi


----------------------------------------
Luca Veraldi

Graduate Student of Computer Science
at the University of Pisa

veraldi@cli.di.unipi.it
luca.veraldi@katamail.com
ICQ# 115368178
----------------------------------------

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2003-09-12 22:39 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-09-10 18:41 Efficient IPC mechanism on Linux Manfred Spraul
     [not found] <u9j3.1VB.27@gated-at.bofh.it>
     [not found] ` <u9j3.1VB.29@gated-at.bofh.it>
     [not found]   ` <u9j3.1VB.31@gated-at.bofh.it>
     [not found]     ` <u9j3.1VB.25@gated-at.bofh.it>
     [not found]       ` <ubNY.5Ma.19@gated-at.bofh.it>
     [not found]         ` <uc79.6lg.13@gated-at.bofh.it>
     [not found]           ` <uc7d.6lg.23@gated-at.bofh.it>
     [not found]             ` <uch0.6zx.17@gated-at.bofh.it>
     [not found]               ` <ucqs.6NC.3@gated-at.bofh.it>
     [not found]                 ` <ucqy.6NC.19@gated-at.bofh.it>
     [not found]                   ` <udmB.8eZ.15@gated-at.bofh.it>
     [not found]                     ` <udPF.BD.11@gated-at.bofh.it>
     [not found]                       ` <3F5F37CD.6060808@softhome.net>
2003-09-10 15:28                         ` Luca Veraldi
     [not found] <fa.h06p421.1s00ojt@ifi.uio.no>
     [not found] ` <fa.gc37hsp.34id89@ifi.uio.no>
     [not found]   ` <E19x47V-0002JG-J8@phoenix.hadiko.de>
2003-09-10 12:45     ` Luca Veraldi
     [not found] <E19x3el-0002Fc-Rj@phoenix.hadiko.de>
2003-09-10 12:16 ` Luca Veraldi
2003-09-10 14:53   ` Larry McVoy
     [not found] <F71B37536F3B3D4FA521FEC7FCA17933164A@twinsrv.twinox.se>
2003-09-10 10:36 ` Luca Veraldi
  -- strict thread matches above, loose matches on Subject: below --
2003-09-09 22:15 Luca Veraldi
2003-09-09 18:59 Luca Veraldi
2003-09-09 17:30 Luca Veraldi
2003-09-09 21:17 ` Alan Cox
2003-09-09 21:57   ` Luca Veraldi
2003-09-09 23:11     ` Alan Cox
2003-09-10  9:04       ` Luca Veraldi
2003-09-10 12:56         ` Alan Cox
     [not found] ` <20030909175821.GL16080@Synopsys.COM>
     [not found]   ` <001d01c37703$8edc10e0$36af7450@wssupremo>
     [not found]     ` <20030910064508.GA25795@Synopsys.COM>
2003-09-10  9:18       ` Luca Veraldi
2003-09-10  9:23         ` Arjan van de Ven
2003-09-10  9:40           ` Luca Veraldi
2003-09-10  9:44             ` Arjan van de Ven
2003-09-10 10:09               ` Luca Veraldi
2003-09-10 10:14                 ` Arjan van de Ven
2003-09-10 10:25                   ` Luca Veraldi
2003-09-12 18:41                     ` Timothy Miller
2003-09-12 19:05                       ` Luca Veraldi
2003-09-12 22:37                         ` Alan Cox
2003-09-10 12:50                 ` Alan Cox
2003-09-10 19:16                 ` Shawn
2003-09-10 20:05                 ` Rik van Riel
2003-09-10 12:47             ` Alan Cox
2003-09-10 13:56               ` Luca Veraldi
2003-09-10 15:59                 ` Alan Cox
2003-09-10  9:52           ` Jamie Lokier
2003-09-10 10:07             ` Arjan van de Ven
2003-09-10 10:17               ` Luca Veraldi
2003-09-10 10:37               ` Jamie Lokier
2003-09-10 10:41                 ` Arjan van de Ven
2003-09-10 10:54                   ` Luca Veraldi
2003-09-10 10:54                     ` Arjan van de Ven
2003-09-10 11:16                     ` Nick Piggin
2003-09-10 11:30                       ` Luca Veraldi
2003-09-10 11:44                         ` Nick Piggin
2003-09-10 12:14                           ` Luca Veraldi
2003-09-10 12:42                       ` Alan Cox
2003-09-10 10:11             ` Luca Veraldi
2003-09-10 19:24             ` Pavel Machek
2003-09-10 19:40               ` Jamie Lokier
2003-09-10 21:35                 ` Pavel Machek
2003-09-10 22:06                   ` Jamie Lokier
2003-09-10 11:52         ` Alex Riesen
2003-09-10 12:14           ` Luca Veraldi
2003-09-10 12:11             ` Alex Riesen
2003-09-10 12:29               ` Luca Veraldi
2003-09-10 12:28                 ` Alex Riesen
2003-09-10 12:36                   ` Luca Veraldi
2003-09-10 12:36                     ` Alex Riesen
2003-09-10 13:33                     ` Gábor Lénárt
2003-09-10 14:04                       ` Luca Veraldi
2003-09-10 14:21 ` Stewart Smith
2003-09-10 14:39   ` Luca Veraldi
2003-09-10 16:59 ` Andrea Arcangeli
2003-09-10 17:05   ` Andrea Arcangeli
2003-09-10 17:21     ` Luca Veraldi
2003-09-10 17:41       ` Andrea Arcangeli
2003-09-10 17:39   ` Martin Konold
2003-09-10 18:01     ` Andrea Arcangeli
2003-09-10 18:05       ` Martin Konold
2003-09-10 18:31         ` Chris Friesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).