Re: Make pipe data structure be a circular list of pages, rather than

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Make pipe data structure be a circular list of pages, rather than
@ 2005-01-07 14:30 Oleg Nesterov
  2005-01-07 15:45 ` Alan Cox
  2005-01-07 16:17 ` Linus Torvalds
  0 siblings, 2 replies; 38+ messages in thread
From: Oleg Nesterov @ 2005-01-07 14:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: William Lee Irwin III, linux-kernel

Hello.

pipe_writev:
> +		if (bufs < PIPE_BUFFERS) {
> +			ssize_t chars;
> +			int newbuf = (info->curbuf + bufs) & (PIPE_BUFFERS-1);

If i understand this patch correctly, then this code

	for (;;)
		write(pipe_fd, &byte, 1);

will block after writing PIPE_BUFFERS == 16 characters, no?
And pipe_inode_info will use 64K to hold 16 bytes!

Is it ok?

May be it make sense to add data to the last allocated page
until buf->len > PAGE_SIZE ?

Oleg.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 14:30 Make pipe data structure be a circular list of pages, rather than Oleg Nesterov
@ 2005-01-07 15:45 ` Alan Cox
  2005-01-07 17:23   ` Linus Torvalds
  2005-01-07 16:17 ` Linus Torvalds
  1 sibling, 1 reply; 38+ messages in thread
From: Alan Cox @ 2005-01-07 15:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, William Lee Irwin III, Linux Kernel Mailing List

On Gwe, 2005-01-07 at 14:30, Oleg Nesterov wrote:
> will block after writing PIPE_BUFFERS == 16 characters, no?
> And pipe_inode_info will use 64K to hold 16 bytes!
> 
> Is it ok?

That would break stuff, but holding the last page until it fills 4K
would work, or just basic sg coalescing when possible. The pipe
behaviour - particularly size and size of atomic writes is defined by
SuS and there are people who use pipes two ways between apps and use the
guarantees to avoid deadlocks


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 16:17 ` Linus Torvalds
@ 2005-01-07 16:06   ` Alan Cox
  2005-01-07 17:33     ` Linus Torvalds
  2005-01-07 17:45     ` Chris Friesen
  2005-01-07 16:39   ` Davide Libenzi
  2005-08-18  6:07   ` Coywolf Qi Hunt
  2 siblings, 2 replies; 38+ messages in thread
From: Alan Cox @ 2005-01-07 16:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List

On Gwe, 2005-01-07 at 16:17, Linus Torvalds wrote:
> it a "don't do that then", and I'll wait to see if people do. I can't
> think of anything that cares about performance that does that anyway:  
> becuase system calls are reasonably expensive regardless, anybody who
> cares at all about performance will have buffered things up in user space.

Actually I found a load of apps that do this but they don't care about
performance. Lots of people have signal handlers that just do

	write(pipe_fd, &signr, sizeof(signr))

so they can drop signalled events into their event loops

> > May be it make sense to add data to the last allocated page
> > until buf->len > PAGE_SIZE ?
> 
> The reason I don't want to coalesce is that I don't ever want to modify a
> page that is on a pipe buffer (well, at least not through the pipe buffe

If I can't write 4096 bytes down it one at a time without blocking from
an empty pipe then its not a pipe in the eyes of the Unix world and the
standards.

> With this organization, a pipe ends up being able to act as a "conduit"  
> for pretty much any data, including some high-bandwidth things like video
> streams, where you really _really_ don't want to copy the data. So the 
> next stage is:

The data copying impact isn't very high even if it is just done for the
pipe() case for standards behaviour. You end up with one page that is
written too and then sent and then freed rather than many.

Possibly what should be done is to not use pipe() for this at all but to
create a more explicit object so we don't confuse existing code and code
that refuses buffers heavily (plenty of that around). Add "sewer()"[1]
or "conduit()" and we don't have to change the behaviour of the Unix
pipe world and risk screwing up apps, and you get to not to have to
write grungy corner cases to ruin the plan.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 14:30 Make pipe data structure be a circular list of pages, rather than Oleg Nesterov
  2005-01-07 15:45 ` Alan Cox
@ 2005-01-07 16:17 ` Linus Torvalds
  2005-01-07 16:06   ` Alan Cox
                     ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 16:17 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: William Lee Irwin III, linux-kernel

On Fri, 7 Jan 2005, Oleg Nesterov wrote:
> 
> If i understand this patch correctly, then this code
> 
> 	for (;;)
> 		write(pipe_fd, &byte, 1);
> 
> will block after writing PIPE_BUFFERS == 16 characters, no?
> And pipe_inode_info will use 64K to hold 16 bytes!

Yes. 

> Is it ok?

If you want throughput, don't do single-byte writes. Obviously we _could_
do coalescing, but there's a reason I'd prefer to avoid it. So I consider
it a "don't do that then", and I'll wait to see if people do. I can't
think of anything that cares about performance that does that anyway:  
becuase system calls are reasonably expensive regardless, anybody who
cares at all about performance will have buffered things up in user space.

> May be it make sense to add data to the last allocated page
> until buf->len > PAGE_SIZE ?

The reason I don't want to coalesce is that I don't ever want to modify a
page that is on a pipe buffer (well, at least not through the pipe buffer
- it might get modified some other way). Why? Because the long-term plan
for pipe-buffers is to allow the data to come from _other_ sources than
just a user space copy. For example, it might be a page directly from the
page cache, or a partial page that contains the data part of an skb that
just came in off the network.

With this organization, a pipe ends up being able to act as a "conduit"  
for pretty much any data, including some high-bandwidth things like video
streams, where you really _really_ don't want to copy the data. So the 
next stage is:

 - allow the buffer size to be set dynamically per-pipe (probably only
   increased by root, due to obvious issues, although a per-user limit is 
   not out of the question - it's just a "mlock" in kernel buffer space, 
   after all)
 - add per-"struct pipe_buffer" ops pointer to a structure with 
   operation function pointers: "release()", "wait_for_ready()", "poll()"  
   (and possibly "merge()", if we want to coalesce things, although I
   really hope we won't need to)
 - add a "splice(fd, fd)" system call that copies pages (by incrementing 
   their reference count, not by copying the data!) from an input source 
   to the pipe, or from a pipe to an output.
 - add a "tee(in, out1, out2)" system call that duplicates the pages 
   (again, incrementing their reference count, not copying the data) from 
   one pipe to two other pipes.

All of the above is basically a few lines of code (the "splice()" thing
requires some help from drivers/networking/pagecache etc, but it's not
complex help, and not everybody needs to do it - I'll start off with
_just_ a generic page cache helper to get the thing rolling, that's easy).

Now, imagine using the above in a media server, for example. Let's say 
that a year or two has passed, so that the video drivers have been updated 
to be able to do the splice thing, and what can you do? You can:

 - splice from the (mpeg or whatever - let's just assume that the video
   input is either digital or does the encoding on its own - like they
   pretty much all do) video input into a pipe (remember: no copies - the
   video input will just DMA directly into memory, and splice will just 
   set up the pages in the pipe buffer)
 - tee that pipe to split it up
 - splice one end to a file (ie "save the compressed stream to disk")
 - splice the other end to a real-time video decoder window for your 
   real-time viewing pleasure.

That's the plan, at least. I think it makes sense, and the thing that 
convinced me about it was (a) how simple all of this seems to be 
implementation-wise (modulo details - but there are no "conceptually 
complex" parts: no horrid asynchronous interfaces, no questions about 
hotw to buffer things, no direct user access to pages that might 
partially contain protected data etc etc) and (b) it's so UNIXy. If 
there's something that says "the UNIX way", it's pipes between entities 
that act on the data.

For example, let's say that you wanted to serve a file from disk (or any 
other source) with a header to another program (or to a TCP connection, or 
to whatever - it's just a file descriptor). You'd do

	fd = create_pipe_to_destination();

	input = open("filename", O_RDONLY);
	write(fd, "header goes here", length_of_header);
	for (;;) {
		ssize_t err;
		err = splice(input, fd,
			~0 /* maxlen */,
			0 /* msg flags - think "sendmgsg" */);
		if (err > 0)
			continue;
		if (!err) /* EOF */
			break;
		.. handle input errors here ..
	}

(obviously, if this is a real server, this would likely all be in a
select/epoll loop, but that just gets too hard to describe consicely, so
I'm showing the single-threaded simple version).

Further, this also ends up giving a nice pollable interface to regular
files too: just splice from the file (at any offset) into a pipe, and poll
on the result.  The "splice()" will just do the page cache operations and
start the IO if necessary, the "poll()" will wait for the first page to be
actually available. All _trivially_ done with the "struct pipe_buffer"
operations.

So the above kind of "send a file to another destination" should
automatically work very naturally in any poll loop: instead of filling a
writable pipe with a "write()", you just fill it with "splice()" instead
(and you can read it with a 'read()' or you just splice it to somewhere
else, or you tee() it to two destinations....).

I think the media server kind of environment is the one most easily
explained, where you have potentially tons of data that the server process
really never actually wants to _see_ - it just wants to push it on to
another process or connection or save it to a log-file or something. But 
as with regular pipes, it's not a specialized interface: it really is just 
a channel of communication.

The difference being that a historical UNIX pipe is always a channel 
between two process spaces (ie you can only fill it and empty it into the 
process address space), and the _only_ thing I'm trying to do is to have 
it be able to be a channel between two different file descriptors too. You 
still need the process to "control" the channel, but the data doesn't have 
to touch the address space any more.

Think of all the servers or other processes that really don't care about 
the data. Think of something as simple as encrypting a file, for example. 
Imagine that you have hardware encryption support that does DMA from the 
source, and writes the results using DMA. I think it's pretty obvious how 
you'd connect this up using pipes and two splices (one for the input, one 
for the output).

And notice how _flexible_ it is (both the input and the output can be any
kind of fd you want - the pipes end up doing both the "conversion"  into a
common format of "list of (possibly partial) pages" and the buffering, 
which is why the different "engines" don't need to care where the data 
comes from, or where it goes. So while you can use it to encrypt a file 
into another file, you could equally easily use it for something like

	tar cvf - my_source_tree | hw_engine_encrypt | splice_to_network

and the whole pipeline would not have a _single_ actual data copy: the 
pipes are channels.

Of course, since it's a pipe, the nice thing is that people don't have to
use "splice()" to access it - the above pipeline has a perfectly regular
"tar" process that probably just does regular writes. You can have a
process that does "splice()" to fill the pipe, and the other end is a
normal thing that just uses regular "read()" and doesn't even _know_ that
the pipe is using new-fangled technology to be filled.

I'm clearly enamoured with this concept. I think it's one of those few 
"RightThing(tm)" that doesn't come along all that often. I don't know of 
anybody else doing this, and I think it's both useful and clever. If you 
now prove me wrong, I'll hate you forever ;)

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 16:17 ` Linus Torvalds
  2005-01-07 16:06   ` Alan Cox
@ 2005-01-07 16:39   ` Davide Libenzi
  2005-01-07 17:09     ` Linus Torvalds
  2005-08-18  6:07   ` Coywolf Qi Hunt
  2 siblings, 1 reply; 38+ messages in thread
From: Davide Libenzi @ 2005-01-07 16:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List

On Fri, 7 Jan 2005, Linus Torvalds wrote:

> I'm clearly enamoured with this concept. I think it's one of those few 
> "RightThing(tm)" that doesn't come along all that often. I don't know of 
> anybody else doing this, and I think it's both useful and clever. If you 
> now prove me wrong, I'll hate you forever ;)

I don't know which kind of poison Larry put in your glass when you two 
met, but it clearly worked :=)



- Davide


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 16:39   ` Davide Libenzi
@ 2005-01-07 17:09     ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 17:09 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List

On Fri, 7 Jan 2005, Davide Libenzi wrote:
> 
> I don't know which kind of poison Larry put in your glass when you two 
> met, but it clearly worked :=)

Oh, Larry introduced the notion of "splice()" to me at least five years
ago, so whatever he put in that glass was quite slow-acting. At the time I
told him to forget it - I thought it intriguing, but couldn't see any
reasonable way to do it.

And to be honest, my pipes aren't really what Larry's "splice()" was: his 
notion was more of a direct "fd->fd" thing, with nothing in between. That 
was what I still consider unworkable. 

The problem with splicing two file descriptors together is that it needs
for them to agree on the interchange format, and these things tend to be
very expensive from an interface angle (ie the format has to _exactly_
match the "impedance" at both ends, since any conversion would make the
whole exercise pointless - you'd be better off just doing a read and a
write).

And note that the "format" is not just the actual data transfer thing, but
the rules about what to do with partial reads, partial writes, file
offsets, streams _without_ file offsets etc etc). And that's really what
the pipe generalization does: it decouples all the format things into an
intermediate thing that is very unambiguous, and has none of those issues.

The pipe approach also decouples things in another way: a true "splice()"  
needs all entities to participate in the new world order for it to be
useful. It shares that brokenness with "sendfile()", which to some degree
exemplifies the problems with splice() (sendfile ends up being a very very
specialized uni-directional form of Larry's splice()). But realizing that 
pipes _are_ the conduit and that they already exist and work with very old 
interfaces suddenly means that the new splice() can be incrementally 
useful.

So it took me five years to think about other things, until an actual
approach that would work and make sense from an implementation standpoint
worked.

What made me think about this particular thing is that I really thought a
lot of things want to just mmap the data, but do not actually want to go
to the expense of finding virtual addresses, handling the virtual->
physical translation at each stage. A lot of things just want to access
the buffers, _without_ the user-visible mapping. User-visibility is not
only expensive, it ends up being an impossible interface for many things
(memory mapping has page granularity, something that simply isn't true for
a lot of devices).

So in many ways, think of the new pipes as a "buffer mapping". Nothing
else. It's a way to carry arbitrary buffers along in a secure manner.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 15:45 ` Alan Cox
@ 2005-01-07 17:23   ` Linus Torvalds
  2005-01-08 18:25     ` Hugh Dickins
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 17:23 UTC (permalink / raw)
  To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List



On Fri, 7 Jan 2005, Alan Cox wrote:
> 
> That would break stuff, but holding the last page until it fills 4K
> would work, or just basic sg coalescing when possible. The pipe
> behaviour - particularly size and size of atomic writes is defined by
> SuS and there are people who use pipes two ways between apps and use the
> guarantees to avoid deadlocks

Hmm.. Are there really any other guarantees than the atomicity guarantee 
for a PIPE_BUF transfer? I don't see any in SuS..

That said, existing practice obviously always trumps paper standards, and 
yes, coalescing is possible.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 16:06   ` Alan Cox
@ 2005-01-07 17:33     ` Linus Torvalds
  2005-01-07 17:48       ` Linus Torvalds
  2005-01-07 17:45     ` Chris Friesen
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 17:33 UTC (permalink / raw)
  To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List

On Fri, 7 Jan 2005, Alan Cox wrote:
>
> > The reason I don't want to coalesce is that I don't ever want to modify a
> > page that is on a pipe buffer (well, at least not through the pipe buffe
> 
> If I can't write 4096 bytes down it one at a time without blocking from
> an empty pipe then its not a pipe in the eyes of the Unix world and the
> standards.

Absolutely. In fact, with the new implementation, you can often write
_several_ packets of 4096 bytes without blocking (but only writes less
than PIPE_BUF are guaranteed to be done all-or-nothing). I'm very aware of
the atomicity guarantees, I'm just saying that if you try to write 4096 
bytes by doing it one byte at a time, that has changed.

> > With this organization, a pipe ends up being able to act as a "conduit"  
> > for pretty much any data, including some high-bandwidth things like video
> > streams, where you really _really_ don't want to copy the data. So the 
> > next stage is:
> 
> The data copying impact isn't very high even if it is just done for the
> pipe() case for standards behaviour. You end up with one page that is
> written too and then sent and then freed rather than many.

I absolutely agree. A regular read()/write() still copies the data, and 
that's because I'm a firm believer that copying even a few kB of data is 
likely to be cheaper than trying to play MM games (not just the lookup of 
the physical address - all the locking, COW, etc crud that VM games 
require).

So while this shares some of the issues with the zero-copy pipes of yore,
but doesn't actually do any of that for regular pipe read/writes. And
never will, as far as I'm concerned. I just don't think user zero-copy is 
interesting at that level: if the user wants to access somethign without 
copying, he uses "mmap()".

So only when the data is _not_ in user VM space, that's when increasing a
reference count is cheaper than copying. Pretty much by definition, you
already have a "struct page *" at that point, along with which part of the
page contains the data.

So the "standard behaviour" (aka just plain read/write on the pipe) is all
the same copies that it used to be. The "just move pages around" issue
only happens when you want to duplicate the stream, or if you splice
around stuff that is already in kernel buffers (or needs a kernel buffer
anyway).

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 16:06   ` Alan Cox
  2005-01-07 17:33     ` Linus Torvalds
@ 2005-01-07 17:45     ` Chris Friesen
  1 sibling, 0 replies; 38+ messages in thread
From: Chris Friesen @ 2005-01-07 17:45 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Oleg Nesterov, William Lee Irwin III,
	Linux Kernel Mailing List

Alan Cox wrote:
> On Gwe, 2005-01-07 at 16:17, Linus Torvalds wrote:
> 
>>it a "don't do that then", and I'll wait to see if people do. I can't
>>think of anything that cares about performance that does that anyway:  
>>becuase system calls are reasonably expensive regardless, anybody who
>>cares at all about performance will have buffered things up in user space.
> 
> 
> Actually I found a load of apps that do this but they don't care about
> performance. Lots of people have signal handlers that just do
> 
> 	write(pipe_fd, &signr, sizeof(signr))
> 
> so they can drop signalled events into their event loops

I would be one of those people, although I do pass the signal 
information as well.

Chris


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 17:33     ` Linus Torvalds
@ 2005-01-07 17:48       ` Linus Torvalds
  2005-01-07 20:59         ` Mike Waychison
                           ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 17:48 UTC (permalink / raw)
  To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List

On Fri, 7 Jan 2005, Linus Torvalds wrote:
> 
> So the "standard behaviour" (aka just plain read/write on the pipe) is all
> the same copies that it used to be.

I want to highlight this again. The increase in throughput did _not_ come
from avoiding a copy. It came purely from the fact that we have multiple
buffers, and thus a throughput-intensive load gets to do several bigger
chunks before it needs to wait for the recipient. So the increase in
throughput comes from fewer synchronization points (scheduling and
locking), not from anything else.

Another way of saying the same thing: pipes actually used to have clearly
_lower_ bandwidth than UNIX domain sockets, even though clearly pipes are
simpler and should thus be able to be faster. The reason? UNIX domain
sockets allow multiple packets in flight, and pipes only used to have a
single buffer. With the new setup, pipes get roughly comparable
performance to UNIX domain sockets for me.

Sockets can still outperform pipes, truth be told - there's been more
work on socket locking than on pipe locking. For example, the new pipe
code should conceptually really allow one CPU to empty one pipe buffer
while another CPU fills another pipe buffer, but because I kept the
locking structure the same as it used to be, right now read/write
serialize over the whole pipe rather than anything else.

This is the main reason why I want to avoid coalescing if possible: if you
never coalesce, then each "pipe_buffer" is complete in itself, and that
simplifies locking enormously.

(The other reason to potentially avoid coalescing is that I think it might
be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
boundaries. The new pipe code _is_ internally "packetized" after all).

				Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 17:48       ` Linus Torvalds
@ 2005-01-07 20:59         ` Mike Waychison
  2005-01-07 23:46           ` Chris Friesen
  2005-01-07 21:59         ` Linus Torvalds
  2005-01-10 23:23         ` Robert White
  2 siblings, 1 reply; 38+ messages in thread
From: Mike Waychison @ 2005-01-07 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Oleg Nesterov, William Lee Irwin III,
	Linux Kernel Mailing List

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
> 
> On Fri, 7 Jan 2005, Linus Torvalds wrote:
> 
>>So the "standard behaviour" (aka just plain read/write on the pipe) is all
>>the same copies that it used to be.
> 
> 
> I want to highlight this again. The increase in throughput did _not_ come
> from avoiding a copy. It came purely from the fact that we have multiple
> buffers, and thus a throughput-intensive load gets to do several bigger
> chunks before it needs to wait for the recipient. So the increase in
> throughput comes from fewer synchronization points (scheduling and
> locking), not from anything else.
> 
> Another way of saying the same thing: pipes actually used to have clearly
> _lower_ bandwidth than UNIX domain sockets, even though clearly pipes are
> simpler and should thus be able to be faster. The reason? UNIX domain
> sockets allow multiple packets in flight, and pipes only used to have a
> single buffer. With the new setup, pipes get roughly comparable
> performance to UNIX domain sockets for me.
> 
> Sockets can still outperform pipes, truth be told - there's been more
> work on socket locking than on pipe locking. For example, the new pipe
> code should conceptually really allow one CPU to empty one pipe buffer
> while another CPU fills another pipe buffer, but because I kept the
> locking structure the same as it used to be, right now read/write
> serialize over the whole pipe rather than anything else.
> 
> This is the main reason why I want to avoid coalescing if possible: if you
> never coalesce, then each "pipe_buffer" is complete in itself, and that
> simplifies locking enormously.

This got me to thinking about how you can heuristically optimize away
coalescing support and still allow PAGE_SIZE bytes minimum in the
effective buffer.

Method 1)

Simply keep track of how much data is in the effective buffer.  If the
count is < PAGE_SIZE, then coalesce, otherwise, don't.  Very simple, but
means a high traffic pipe might see 'stuttering' between coalescing mode
and fastpath.

Method 2)

It would be interesting to see the performance of a simple heuristic
whereby writes smaller than a given threshold (PAGE_SIZE >> 3?) are
coalesced with the previous iff the previous also was less than the
threshold and there is room for the write in the given page.

With this heuristic, the minimum size of the effective buffer is only
possible if the sequence consists of alternating threshold (T) bytes and
1 byte writes.

If 'number of pages in circular buffer' (C) is even, this produces a
minimum of (T*C + C)/2 bytes in the effective buffer.  We can figure out
a minimum T value for a given page count C by isolating T from:

(T*C + C)/2 = PAGE_SIZE

which is:

T = (2 * PAGE_SIZE / C) - 1

So with the assumption that we only coalesce if

- - the previous write count was < T
- - the current write count is < T
- - there is room for the current write count in the current page

and if we set C = 16, we can safely let T = (PAGE_SIZE >> 3) - 1 and the
minimum effective buffer is guaranteed to be at least PAGE_SIZE bytes.

This should reduce the amount of coalescing work required and thus
reduce the amount of locking required for the case where performance
counts (on x86, the writer would have to write 511 bytes to use the
non-coalescing fast path).

This method is probably better as many writers will be buffering their
writes, and the size also nicely fits with block layer's blocksize.

Just my two cents.

> 
> (The other reason to potentially avoid coalescing is that I think it might
> be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
> boundaries. The new pipe code _is_ internally "packetized" after all).
> 
> 				Linus

- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE:  The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB3vgbdQs4kOxk3/MRAhN1AKCD6/o04URPYphGRh13P5GLNfV4JgCaA3wx
QpMbLHqBtmAB3sy7mmxXCEI=
=687H
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 17:48       ` Linus Torvalds
  2005-01-07 20:59         ` Mike Waychison
@ 2005-01-07 21:59         ` Linus Torvalds
  2005-01-07 22:53           ` Diego Calleja
  2005-01-10 23:23         ` Robert White
  2 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 21:59 UTC (permalink / raw)
  To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List

On Fri, 7 Jan 2005, Linus Torvalds wrote:
> 
> I want to highlight this again. The increase in throughput did _not_ come
> from avoiding a copy. It came purely from the fact that we have multiple
> buffers, and thus a throughput-intensive load gets to do several bigger
> chunks before it needs to wait for the recipient. So the increase in
> throughput comes from fewer synchronization points (scheduling and
> locking), not from anything else.

Btw, from limited testing this effect seems to be much more pronounced on
SMP.

I've done some more benchmarking, and on a single-CPU Banias laptop,
bandwidth didn't really change much, and latency goes up measurably
(roughly 2.5us -> 3.0us). In contrast, on a P4 with HT, latency was not
noticeable (looks like 10.4us -> 10.7us), since there the locking and
system call costs dominate, and throughput goes up from 370MB/s to
900+MB/s (ie > 100% improvement).

So it's not always a win, but at the same time sometimes it's quite a big
win. We might tune the number of buffers down for the UP case, or
something.

Side note: the UNIX domain socket lmbench numbers are higher than the
memory copy numbers, implying that at least the UNIX domain socket tests
have the data set fit in the L2. That may be the reason why lmbench says 
that sockets outperform pipes.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 21:59         ` Linus Torvalds
@ 2005-01-07 22:53           ` Diego Calleja
  2005-01-07 23:15             ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Diego Calleja @ 2005-01-07 22:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: alan, oleg, wli, linux-kernel

El Fri, 7 Jan 2005 13:59:53 -0800 (PST) Linus Torvalds <torvalds@osdl.org> escribió:

> Btw, from limited testing this effect seems to be much more pronounced on
> SMP.

I've tried it in a 2xPIII, 512 MB of ram. I've done kernel compiles...yeah
yeah I know kernel compiles are not a good benchmark but since linux uses
-pipe for gcc...I though it may try it to see if it makes a difference
(suggestions about better methods to test this are accepted :). The results
could be very well stadistical noise (it looks like it only takes a second less
on average), but I though it may be interesting.


without the patch:
952.84user 98.64system 8:54.64elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318007minor)pagefaults 0swaps
951.78user 98.25system 8:53.71elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318009minor)pagefaults 0swaps
954.53user 99.12system 8:53.02elapsed 197%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318041minor)pagefaults 0swaps

with it:
951.67user 97.59system 8:53.12elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318010minor)pagefaults 0swaps
949.04user 97.68system 8:52.04elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318018minor)pagefaults 0swaps
948.40user 97.37system 8:51.48elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318011minor)pagefaults 0swaps

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 22:53           ` Diego Calleja
@ 2005-01-07 23:15             ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 23:15 UTC (permalink / raw)
  To: Diego Calleja; +Cc: alan, oleg, wli, linux-kernel

On Fri, 7 Jan 2005, Diego Calleja wrote:
> 
> I've tried it in a 2xPIII, 512 MB of ram. I've done kernel compiles...yeah
> yeah I know kernel compiles are not a good benchmark but since linux uses
> -pipe for gcc...I though it may try it to see if it makes a difference
> (suggestions about better methods to test this are accepted :). The results
> could be very well stadistical noise (it looks like it only takes a second less
> on average), but I though it may be interesting.

Hmm.. I'm gratified, but a bit suspicious. I'd not have expected it to be
noticeable on a kernel compile (that's a 1%+ change in total system time,
which considering all the file ops etc that a compile does means that it
would have to have made quite a big difference for the pipe case).

So your data is intriguing, but it looks a bit _too_ good to be true ;)

Random things like page cache placement (and thus cache effects) might 
also account for it. But hey, if true, that's quite an improvement.

(Of course, anything the kernel does tends to be totally drowned out by 
allt eh actual user-level work gcc does, so in the _big_ picture the 1% 
change in system time means almost nothing. So even "good news" doesn't 
actually matter in this case).

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 20:59         ` Mike Waychison
@ 2005-01-07 23:46           ` Chris Friesen
  2005-01-08 21:38             ` Lee Revell
  2005-01-14 10:15             ` Peter Chubb
  0 siblings, 2 replies; 38+ messages in thread
From: Chris Friesen @ 2005-01-07 23:46 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Linus Torvalds, Alan Cox, Oleg Nesterov, William Lee Irwin III,
	Linux Kernel Mailing List

Mike Waychison wrote:

> This got me to thinking about how you can heuristically optimize away
> coalescing support and still allow PAGE_SIZE bytes minimum in the
> effective buffer.

While coalescing may be a win in some cases, there should also be some 
way to tell the kernel to NOT coalesce, to handle the case where you 
want minimum latency at the cost of some throughput.

Chris

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 17:23   ` Linus Torvalds
@ 2005-01-08 18:25     ` Hugh Dickins
  2005-01-08 18:54       ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Hugh Dickins @ 2005-01-08 18:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Oleg Nesterov, William Lee Irwin III,
	Linux Kernel Mailing List

On Fri, 7 Jan 2005, Linus Torvalds wrote:
> 
> Hmm.. Are there really any other guarantees than the atomicity guarantee 
> for a PIPE_BUF transfer? I don't see any in SuS..
> 
> That said, existing practice obviously always trumps paper standards, and 
> yes, coalescing is possible.

Would a kernel build with "make -j18" count as existing practice?
Immediately hangs for me...

Hugh


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-08 18:25     ` Hugh Dickins
@ 2005-01-08 18:54       ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-08 18:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alan Cox, Oleg Nesterov, William Lee Irwin III,
	Linux Kernel Mailing List



On Sat, 8 Jan 2005, Hugh Dickins wrote:
> 
> Would a kernel build with "make -j18" count as existing practice?

Absolutely.

Looks like it does a single-byte write of '+' for each process started 
(and I assume it will do a '-' for each process that exits ;)

Oh, well. So much for my hope to avoid coalescing.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 23:46           ` Chris Friesen
@ 2005-01-08 21:38             ` Lee Revell
  2005-01-08 21:51               ` Linus Torvalds
  2005-01-14 10:15             ` Peter Chubb
  1 sibling, 1 reply; 38+ messages in thread
From: Lee Revell @ 2005-01-08 21:38 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Mike Waychison, Linus Torvalds, Alan Cox, Oleg Nesterov,
	William Lee Irwin III, Linux Kernel Mailing List

On Fri, 2005-01-07 at 17:46 -0600, Chris Friesen wrote:
> Mike Waychison wrote:
> 
> > This got me to thinking about how you can heuristically optimize away
> > coalescing support and still allow PAGE_SIZE bytes minimum in the
> > effective buffer.
> 
> While coalescing may be a win in some cases, there should also be some 
> way to tell the kernel to NOT coalesce, to handle the case where you 
> want minimum latency at the cost of some throughput.

Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux
FIFO being one of the fastest known IPC mechanisms.  Each client in the
JACK (http://jackit.sf.net) graph wakes the next one by writing a single
byte to a FIFO.  Ardour's GUI, control, and audio threads interact via a
similar mechanism.  How would you expect this change to impact the inter
thread wakeup latency?  It's confusing when people say "performance",
meaning "increased throughput albeit with more latency".  For many
people that's a regression.

These apps *certainly* care about performance, they just don't define it
in terms of throughput.

And yes we do know futexes are the right tool for this but they weren't
available at the time and aren't available on 2.4.

Lee

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-08 21:38             ` Lee Revell
@ 2005-01-08 21:51               ` Linus Torvalds
  2005-01-08 22:02                 ` Lee Revell
                                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-08 21:51 UTC (permalink / raw)
  To: Lee Revell
  Cc: Chris Friesen, Mike Waychison, Alan Cox, Oleg Nesterov,
	William Lee Irwin III, Linux Kernel Mailing List

On Sat, 8 Jan 2005, Lee Revell wrote:
> 
> Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux
> FIFO being one of the fastest known IPC mechanisms.  Each client in the
> JACK (http://jackit.sf.net) graph wakes the next one by writing a single
> byte to a FIFO.  Ardour's GUI, control, and audio threads interact via a
> similar mechanism.  How would you expect this change to impact the inter
> thread wakeup latency?  It's confusing when people say "performance",
> meaning "increased throughput albeit with more latency".  For many
> people that's a regression.

I posted the performance numbers in the thread already, and with every
single throughput number I also talked abotu what the latency difference
was. So quite frankly, if you were confused, I suspect it was because you
didn't read them. Tssk, tssk.

Short and sweet: the latency changes are in the noise for SMP, but can be
seen on UP. I'll look at it a bit more:  since I had to add the coalescing
code anyway, I might also decide to re-use a buffer page rather than free
it immediately, since that may help latency for small writes.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-08 21:51               ` Linus Torvalds
@ 2005-01-08 22:02                 ` Lee Revell
  2005-01-08 22:29                 ` Davide Libenzi
  2005-01-09  4:07                 ` Linus Torvalds
  2 siblings, 0 replies; 38+ messages in thread
From: Lee Revell @ 2005-01-08 22:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Friesen, Mike Waychison, Alan Cox, Oleg Nesterov,
	William Lee Irwin III, Linux Kernel Mailing List

On Sat, 2005-01-08 at 13:51 -0800, Linus Torvalds wrote:
> 
> On Sat, 8 Jan 2005, Lee Revell wrote:
> > 
> > Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux
> > FIFO being one of the fastest known IPC mechanisms.  Each client in the
> > JACK (http://jackit.sf.net) graph wakes the next one by writing a single
> > byte to a FIFO.  Ardour's GUI, control, and audio threads interact via a
> > similar mechanism.  How would you expect this change to impact the inter
> > thread wakeup latency?  It's confusing when people say "performance",
> > meaning "increased throughput albeit with more latency".  For many
> > people that's a regression.
> 
> I posted the performance numbers in the thread already, and with every
> single throughput number I also talked abotu what the latency difference
> was. So quite frankly, if you were confused, I suspect it was because you
> didn't read them. Tssk, tssk.
> 

How embarassing, I found that message immediately after hitting send.
Looks like a big win.  I'll try the JACK stress test on it just to make
sure.

Lee



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-08 21:51               ` Linus Torvalds
  2005-01-08 22:02                 ` Lee Revell
@ 2005-01-08 22:29                 ` Davide Libenzi
  2005-01-09  4:07                 ` Linus Torvalds
  2 siblings, 0 replies; 38+ messages in thread
From: Davide Libenzi @ 2005-01-08 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lee Revell, Chris Friesen, Mike Waychison, Alan Cox,
	Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List

On Sat, 8 Jan 2005, Linus Torvalds wrote:

> Short and sweet: the latency changes are in the noise for SMP, but can be
> seen on UP. I'll look at it a bit more:  since I had to add the coalescing
> code anyway, I might also decide to re-use a buffer page rather than free
> it immediately, since that may help latency for small writes.

Yeah, I noticed you clean the page's magazine down to the bottom. Maybe 
keeping one (or more) page in there might help something, due to cache 
improvements and alloc/free page savings.


- Davide


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-08 21:51               ` Linus Torvalds
  2005-01-08 22:02                 ` Lee Revell
  2005-01-08 22:29                 ` Davide Libenzi
@ 2005-01-09  4:07                 ` Linus Torvalds
  2005-01-09 23:19                   ` Davide Libenzi
  2 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-09  4:07 UTC (permalink / raw)
  To: Lee Revell
  Cc: Chris Friesen, Mike Waychison, Alan Cox, Oleg Nesterov,
	William Lee Irwin III, Linux Kernel Mailing List



On Sat, 8 Jan 2005, Linus Torvalds wrote:
> 
> Short and sweet: the latency changes are in the noise for SMP, but can be
> seen on UP. I'll look at it a bit more:  since I had to add the coalescing
> code anyway, I might also decide to re-use a buffer page rather than free
> it immediately, since that may help latency for small writes.

Side note: the SMP numbers seem to fluctuate wildly making any
benchmarking a bit suspect. They probably depend very much on just how
things get scheduled, and on inter-CPU cache behaviour.

In contrast, the UP numbers are a lot more reliable, so I'll try to make
sure that everything looks good on UP - which definitely means that I'll
work to make sure that latency looks good too.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-09  4:07                 ` Linus Torvalds
@ 2005-01-09 23:19                   ` Davide Libenzi
  0 siblings, 0 replies; 38+ messages in thread
From: Davide Libenzi @ 2005-01-09 23:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lee Revell, Chris Friesen, Mike Waychison, Alan Cox,
	Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List

On Sat, 8 Jan 2005, Linus Torvalds wrote:

> Side note: the SMP numbers seem to fluctuate wildly making any
> benchmarking a bit suspect. They probably depend very much on just how
> things get scheduled, and on inter-CPU cache behaviour.

Lmbench results on SMP are bogus. OTOH you can easily make a more 
deterministic SMP pipe bench by making the two tasks affine to two 
different CPUs.


- Davide


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 17:48       ` Linus Torvalds
  2005-01-07 20:59         ` Mike Waychison
  2005-01-07 21:59         ` Linus Torvalds
@ 2005-01-10 23:23         ` Robert White
  2 siblings, 0 replies; 38+ messages in thread
From: Robert White @ 2005-01-10 23:23 UTC (permalink / raw)
  To: 'Linus Torvalds', 'Alan Cox'
  Cc: 'Oleg Nesterov', 'William Lee Irwin III',
	'Linux Kernel Mailing List'

Howdy,

This entire discussion of the pipe and splice() is very much the same (issue,
problem, approach) as the STREAMS interface from UNIX SVR4, which had the pipe device
as its trivial example device.  [This is all from vague recall, so forgive the
sketchiness 8-)]

[Note: check out this STREAMS programmers guide
http://docs-pdf.sun.com/802-1933/802-1933.pdf from Sun's documentation site.]

A Message was a fixed-size Message structure with a linked list of Buffers attached
to it. The Buffers were a structure-and-payload arrangement as in (over-expressed
here).  The whole arrangement could be zero-copy or self condensing as needed.

In Linux parlance each file descriptor was basically a reference to a pair of linked
list headers (one for messages available for reading and one to accept data from
writes), a pointer to the driver below the head, and an optional tasklet.  The driver
itself was virtually the same data structure.

STREAMS was a little over done and could have used some of the Linuxisims like
tasklets (instead of "the STREAMS Scheduler"), and the bulk of the nonsense for
getting Message and Buffer and STREAMS Head structures just screams Slab Cache, but
it has a lot of interesting bits.

The reason I bring it up is four-fold.

1) They had push and pop ioctl operations on the streams so that you could actually
interpose translators and multiplexors in the streams, so you could use the design to
do the splice() operation directly by building something that basically looked like:

fd0a   fd0b
    mux
 pipe-driver
    fd1

basically by pushing the mux into stream head and then pushing connecting the second
and subsequent fds to the mux object.

2) The Messages had type fields so you could do interesting things like pass an open
file descriptor to another process across a pipe because the special "this is an FD
message" could arrive unmolested and trigger the necessary file table manipulations
at the receiving end.

3) The optional tasklet (equivelant) at the various points in the STREAM could do
things like buffer consolidation, which in the Linux model could compete with actual
data consumption fairly easily/cheaply.

4) Some of the things that I have run into dealing with line disciplines and
streaming data through character devices (and USB) could gain a lot from the
Message/Buffer/STREAM paradigm.  [e.g. the fixed-size flip buffers in ntty would be
rendered moot by having the linked lists above and below the line discipline.]

Anyway, thick as it is, I think the pdf referenced above might be interesting reading
vis a vi the pipe() discussion.

Rob White,
Casabyte, Inc.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 23:46           ` Chris Friesen
  2005-01-08 21:38             ` Lee Revell
@ 2005-01-14 10:15             ` Peter Chubb
  1 sibling, 0 replies; 38+ messages in thread
From: Peter Chubb @ 2005-01-14 10:15 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Mike Waychison, Linus Torvalds, Alan Cox, Oleg Nesterov,
	William Lee Irwin III, Linux Kernel Mailing List

>>>>> "Chris" == Chris Friesen <cfriesen@nortelnetworks.com> writes:

Chris> Mike Waychison wrote:
>> This got me to thinking about how you can heuristically optimize
>> away coalescing support and still allow PAGE_SIZE bytes minimum in
>> the effective buffer.

Chris> While coalescing may be a win in some cases, there should also
Chris> be some way to tell the kernel to NOT coalesce, to handle the
Chris> case where you want minimum latency at the cost of some
Chris> throughput.

SysVr4 pipes used to have two modes: a `legacy' mode that coalesced
data, and a `message' mode that preserved message boundaries.  Seems
to me that we could be about to have the latter in Linux...

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07 16:17 ` Linus Torvalds
  2005-01-07 16:06   ` Alan Cox
  2005-01-07 16:39   ` Davide Libenzi
@ 2005-08-18  6:07   ` Coywolf Qi Hunt
  2 siblings, 0 replies; 38+ messages in thread
From: Coywolf Qi Hunt @ 2005-08-18  6:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Oleg Nesterov, William Lee Irwin III, linux-kernel

On 1/8/05, Linus Torvalds <torvalds@osdl.org> wrote:
> 
> 
> On Fri, 7 Jan 2005, Oleg Nesterov wrote:
> >
> > If i understand this patch correctly, then this code
> >
> >       for (;;)
> >               write(pipe_fd, &byte, 1);
> >
> > will block after writing PIPE_BUFFERS == 16 characters, no?
> > And pipe_inode_info will use 64K to hold 16 bytes!
> 
> Yes.
> 
> > Is it ok?
> 
> If you want throughput, don't do single-byte writes. Obviously we _could_
> do coalescing, but there's a reason I'd prefer to avoid it. So I consider
> it a "don't do that then", and I'll wait to see if people do. I can't
> think of anything that cares about performance that does that anyway:
> becuase system calls are reasonably expensive regardless, anybody who
> cares at all about performance will have buffered things up in user space.
> 
> > May be it make sense to add data to the last allocated page
> > until buf->len > PAGE_SIZE ?
> 
> The reason I don't want to coalesce is that I don't ever want to modify a
> page that is on a pipe buffer (well, at least not through the pipe buffer
> - it might get modified some other way). Why? Because the long-term plan
> for pipe-buffers is to allow the data to come from _other_ sources than
> just a user space copy. For example, it might be a page directly from the
> page cache, or a partial page that contains the data part of an skb that
> just came in off the network.
> 
> With this organization, a pipe ends up being able to act as a "conduit"
> for pretty much any data, including some high-bandwidth things like video
> streams, where you really _really_ don't want to copy the data. So the
> next stage is:
> 
>  - allow the buffer size to be set dynamically per-pipe (probably only
>    increased by root, due to obvious issues, although a per-user limit is
>    not out of the question - it's just a "mlock" in kernel buffer space,
>    after all)
>  - add per-"struct pipe_buffer" ops pointer to a structure with
>    operation function pointers: "release()", "wait_for_ready()", "poll()"
>    (and possibly "merge()", if we want to coalesce things, although I
>    really hope we won't need to)
>  - add a "splice(fd, fd)" system call that copies pages (by incrementing
>    their reference count, not by copying the data!) from an input source
>    to the pipe, or from a pipe to an output.
>  - add a "tee(in, out1, out2)" system call that duplicates the pages
>    (again, incrementing their reference count, not copying the data) from
>    one pipe to two other pipes.
> 
> All of the above is basically a few lines of code (the "splice()" thing
> requires some help from drivers/networking/pagecache etc, but it's not
> complex help, and not everybody needs to do it - I'll start off with
> _just_ a generic page cache helper to get the thing rolling, that's easy).
> 
> Now, imagine using the above in a media server, for example. Let's say
> that a year or two has passed, so that the video drivers have been updated
> to be able to do the splice thing, and what can you do? You can:
> 
>  - splice from the (mpeg or whatever - let's just assume that the video
>    input is either digital or does the encoding on its own - like they
>    pretty much all do) video input into a pipe (remember: no copies - the
>    video input will just DMA directly into memory, and splice will just
>    set up the pages in the pipe buffer)
>  - tee that pipe to split it up
>  - splice one end to a file (ie "save the compressed stream to disk")
>  - splice the other end to a real-time video decoder window for your
>    real-time viewing pleasure.
> 
> That's the plan, at least. I think it makes sense, and the thing that
> convinced me about it was (a) how simple all of this seems to be
> implementation-wise (modulo details - but there are no "conceptually
> complex" parts: no horrid asynchronous interfaces, no questions about
> hotw to buffer things, no direct user access to pages that might
> partially contain protected data etc etc) and (b) it's so UNIXy. If
> there's something that says "the UNIX way", it's pipes between entities
> that act on the data.
> 
> For example, let's say that you wanted to serve a file from disk (or any
> other source) with a header to another program (or to a TCP connection, or
> to whatever - it's just a file descriptor). You'd do
> 
>         fd = create_pipe_to_destination();
> 
>         input = open("filename", O_RDONLY);
>         write(fd, "header goes here", length_of_header);
>         for (;;) {
>                 ssize_t err;
>                 err = splice(input, fd,
>                         ~0 /* maxlen */,
>                         0 /* msg flags - think "sendmgsg" */);
>                 if (err > 0)
>                         continue;
>                 if (!err) /* EOF */
>                         break;
>                 .. handle input errors here ..
>         }
> 
> (obviously, if this is a real server, this would likely all be in a
> select/epoll loop, but that just gets too hard to describe consicely, so
> I'm showing the single-threaded simple version).
> 
> Further, this also ends up giving a nice pollable interface to regular
> files too: just splice from the file (at any offset) into a pipe, and poll
> on the result.  The "splice()" will just do the page cache operations and
> start the IO if necessary, the "poll()" will wait for the first page to be
> actually available. All _trivially_ done with the "struct pipe_buffer"
> operations.
> 
> So the above kind of "send a file to another destination" should
> automatically work very naturally in any poll loop: instead of filling a
> writable pipe with a "write()", you just fill it with "splice()" instead
> (and you can read it with a 'read()' or you just splice it to somewhere
> else, or you tee() it to two destinations....).
> 
> I think the media server kind of environment is the one most easily
> explained, where you have potentially tons of data that the server process
> really never actually wants to _see_ - it just wants to push it on to
> another process or connection or save it to a log-file or something. But
> as with regular pipes, it's not a specialized interface: it really is just
> a channel of communication.
> 
> The difference being that a historical UNIX pipe is always a channel
> between two process spaces (ie you can only fill it and empty it into the
> process address space), and the _only_ thing I'm trying to do is to have
> it be able to be a channel between two different file descriptors too. You
> still need the process to "control" the channel, but the data doesn't have
> to touch the address space any more.
> 
> Think of all the servers or other processes that really don't care about
> the data. Think of something as simple as encrypting a file, for example.
> Imagine that you have hardware encryption support that does DMA from the
> source, and writes the results using DMA. I think it's pretty obvious how
> you'd connect this up using pipes and two splices (one for the input, one
> for the output).
> 
> And notice how _flexible_ it is (both the input and the output can be any
> kind of fd you want - the pipes end up doing both the "conversion"  into a
> common format of "list of (possibly partial) pages" and the buffering,
> which is why the different "engines" don't need to care where the data
> comes from, or where it goes. So while you can use it to encrypt a file
> into another file, you could equally easily use it for something like
> 
>         tar cvf - my_source_tree | hw_engine_encrypt | splice_to_network
> 
> and the whole pipeline would not have a _single_ actual data copy: the
> pipes are channels.
> 
> Of course, since it's a pipe, the nice thing is that people don't have to
> use "splice()" to access it - the above pipeline has a perfectly regular
> "tar" process that probably just does regular writes. You can have a
> process that does "splice()" to fill the pipe, and the other end is a
> normal thing that just uses regular "read()" and doesn't even _know_ that
> the pipe is using new-fangled technology to be filled.
> 
> I'm clearly enamoured with this concept. I think it's one of those few
> "RightThing(tm)" that doesn't come along all that often. I don't know of
> anybody else doing this, and I think it's both useful and clever. If you
> now prove me wrong, I'll hate you forever ;)
> 
>                 Linus


Actually this what L4 is different from traditional micro-kernel, IMHO. Is it?

I have a long-term plan to add some messaging subsystem based on this.

-- 
Coywolf Qi Hunt
http://ahbl.org/~coywolf/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: Make pipe data structure be a circular list of pages, rather than
@ 2005-01-20  2:14 Robert White
  0 siblings, 0 replies; 38+ messages in thread
From: Robert White @ 2005-01-20  2:14 UTC (permalink / raw)
  To: 'Robert White', linux, linux-kernel; +Cc: lm, torvalds

P.S.  Not to reply to myself... 8-)  I took some liberties with that description.
STREAMS doesn't, to the best of my knowledge, have the cleanup hook stuff.  That was
me folding your issues (direct PCI device buffers etc) from this thread on top of the
basic skeleton of STREAMS to broaden the "impedance matching" possibilities.

(No really, back to lurking... 8-)

Rob White,
Casabyte, Inc.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: Make pipe data structure be a circular list of pages, rather than
  2005-01-19 21:12 ` Make pipe data structure be a circular list of pages, rather than linux
@ 2005-01-20  2:06   ` Robert White
  0 siblings, 0 replies; 38+ messages in thread
From: Robert White @ 2005-01-20  2:06 UTC (permalink / raw)
  To: linux, linux-kernel; +Cc: lm, torvalds

Howdy,

The below (sorry about the Outlook style includes, work _requires_ outlook 8-() is
why I previously mentioned the STREAMS implementation from SVR4 (and so Solaris etc).

Every interface point has a pointer to its peer, a pair of queues and a put()
routine.  Every message is in two or more pieces, the message header and the message
buffer/buffers (read real data pages).

So the "participating interface" requirement is really that the device be able to
expose a data structure consisting of four pointers that their peer can point to.

The "message types" for the messages passing around are all contained in the message
header structures.  These headers have all the reference counting and
disposal/cleanup hook procedures (pointers) along with the references to the actual
data.  This means that you could have a stock pool (from the slab cache) of these
headers for moving plain data around, and "magical devices" could use specialty
message headers with the necessary disposal elements or necessary "extra data" in
them.

All the devices in any kind of chain/mux/pipe/whatever arrangement pass messages to
their peers by invoking peer->put(message) [in OOD parlance, really probably
(*peer)(peer,message) in straight C].  And modules can be induced into the
chain/mux/pipe/whtever at runtime. The target put() procedure decides to copy or
reference-increment the message and put it on it's receive queue or process it
immediately.  At no time does any member of a STREAMS structure need to know any
details about the workings of their peers.

In the abstract, most communications would have message headers referring to blocks
of normal memory.  Some devices could send out messages that reference their PCI
mapped memory or whatever, and all _that_ device would have to do to mesh with the
system is to replace the cleanup function pointer with one appropriate to its need.
When the message has gone as far as it is going to go, its reference count drops to
zero and it gets cleaned up.

Unget can be done with some queue insertion/manipulation.  There is no pull() as
every target is either hot-processed or queued.  Flow control takes place when a
put() procedure returns a no-space-for-message indication.

If you absolutely _must_ know if the data was actually consumed by the target (and I
could argue that this is virtually never the case as things get lost in error-induced
pipe tear-down all the time) it would only be necessary to add a "cursor" member to
the message header and, at destruction time, see if the cursor had been advanced to
"use up" the data.

One could even have message headers that know they are dealing with

Now the message header is Linus' "impedance matcher" interface, transport is really
just about knowing your peer, calling put(), and doing reference counting.

There were a lot of (IMHO) mistakes in the STREAMS design, but the programming
document (previously referenced in this thread) covers a lot of the same message
typing and management questions that are being raised here.

While I can see that the fit between Linux and SVR4 STREAMS wouldn't be identical, a
good bit of the pipe-fitting (to pun 8-) and operation definitions would be.

(I'll go back to lurking now... 8-)

Rob White,
Casabyte, Inc.

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org]
On Behalf Of linux@horizon.com
lm@bitmover.com wrote:

The main infrastructure hassle you need to support this *universally*
is the unget() on "character devices" like pipes and network sockets.
Ideally, it would be some generic buffer front end that could be used
by the device for normal data as well as the special case.

Ooh.  Need to think.  If there's a -EIO problem on one of the file
descriptors, how does the caller know which one?  That's an argument for
separate pull and push (although the splice() library routine still
has the problem).  Any suggestions?  Does userland need to fall
back on read()/write() for a byte?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-19 19:01           ` Larry McVoy
@ 2005-01-20  0:01             ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-20  0:01 UTC (permalink / raw)
  To: Larry McVoy; +Cc: William Lee Irwin III, Linux Kernel Mailing List

On Wed, 19 Jan 2005, Larry McVoy wrote:
> 		
> See how more generic that is?  Pipes are just one source/sink but 
> everything else needs to play as well.  How are you going to 
> implement a socket sending data to a file without the VM nonsense
> and the extra copies?

Heh. This is exactly why I originally told you it is undoable - because 
your model implies a O(M x N) complexity, where everybody has to agree on 
how to push things to everybody else. 

The thing that made me so excited about the pipes is exactly that the 
pipes is what finally made me see how to actually _implement_ kind of what 
you wanted. Instead of a "everything to everything" thing that has O(n²) 
complexity, it's a "everything to a pipe", where the pipe ends up being 
the common medium for things to talk to each other.

So you have an extra level of indirection, but it also means that now the
interface complexity is purely linear - everybody needs to know how to
accept data from (and push data to) a pipe, but they do _not_ need to know
how to do it to anything else.

I agree that it's not what you described, and one of my first emails 
describing it even said so, but I now notice that you weren't cc'd on 
that one (I tried to cc you on some others, but not the whole discussion):

   "...

    And to be honest, my pipes aren't really what Larry's "splice()" was: his 
    notion was more of a direct "fd->fd" thing, with nothing in between. That 
    was what I still consider unworkable. 

    The problem with splicing two file descriptors together is that it needs
    for them to agree on the interchange format, and these things tend to be
    very expensive from an interface angle (ie the format has to _exactly_
    match the "impedance" at both ends, since any conversion would make the
    whole exercise pointless - you'd be better off just doing a read and a
    write).

    ..."

so what makes the pipes special is that they are the "impedance matcher". 

For example, absolutely _none_ of the regular file IO stuff wants to work 
with page fragments: it's all very much about whole-page stuff in the page 
cache, and that is how it has to be, considering the semantics of a 
coherent mmap etc. 

So your notion of having all the subsystems natively use a common format 
just never struck me as workable, because they very fundamentally do _not_ 
want work with the same kind of data structures, and forcing them to just 
didn't seem valid.  So I ignored splice as unworkable.

So think of my "splice" as giving you credit through the name, but at the
same time it definitely is something different (and in my opinion a lot
more workable). If you object to the name, I can call it "wire()" instead,
which was somebody elses suggestion, but I thought you might appreciate 
the name even if it isn't _really_ what you want ;)

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-16  2:59 Make pipe data structure be a circular list of pages, rather Linus Torvalds
@ 2005-01-19 21:12 ` linux
  2005-01-20  2:06   ` Robert White
  0 siblings, 1 reply; 38+ messages in thread
From: linux @ 2005-01-19 21:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux, lm, torvalds

lm@bitmover.com wrote:
> You seem to have misunderstood the original proposal, it had little to do
> with file descriptors.  The idea was that different subsystems in the OS
> export pull() and push() interfaces and you use them.  The file decriptors
> are only involved if you provide them with those interfaces(which you
> would, it makes sense).  You are hung up on the pipe idea, the idea I
> see in my head is far more generic.  Anything can play and you don't
> need a pipe at all, you need 

I was fantasizing about more generality as well.  In particular, my
original fantasy allowed data to, in theory and with compatible devices,
be read from one PCI device, passed through a series of pipes, and
written to another without ever hitting main memory - only one PCI-PCI
DMA operation performed.

A slightly more common case would be zero-copy, where data gets DMAed
from the source into memory and from memory to the destination.
That's roughly Larry's pull/push model.

The direct DMA case requires buffer memory on one of the two cards.
(And would possibly be a fruitful source of hardware bugs, since I
suspect that Windows Doesn't Do That.)

Larry has the "full-page gift" optimization, which could in theory allow
data to be "renamed" straight into the page cache.  However, the page also
has to be properly aligned and not in some awkward highmem address space.
I'm not currently convinced that this would happen often enough to be
worth the extra implementation hair, but feel free to argue otherwise.

(And Larry, what's the "loan" bit for?   When is loan != !gift ?)

The big gotcha, as Larry's original paper properly points out, is handling
write errors.  We need some sort of "unpull" operation to put data back
of the destination can't accept it.  Otherwise, what do you return from
splice()?  If the source is seekable, that's easy, and a pipe isn't much
harder, but for a general character device, we need a bit of help.

The way I handle this in user-level software, to connect modules that
provide data buffering, is to split "pull" into two operations.
"Show me some buffered data" and "Consume some buffered data".
The first returns a buffer pointer (to a const buffer) and length.
(The length must be non-zero except at EOF, but may be 1 byte.)

The second advances the buffer pointer.  The advance distance must be
no more than the length returned previously, but may be less.

In typical single-threaded code, I allow not calling the advance function
or calling it multiple times, but they're typically called 1:1, and
requiring that would give you a good place to do locking.  A character
device, network stream, or the like, would acquire an exclusive lock.
A block device or file would not need to (or could make it a shared lock
or refcount).

The same technique can be used when writing data to a module that does
buffering: "Give me some bufer space" and "Okay, I filled some part of
it in."  In some devices, the latter call can fail, and the writer has
to be able to cope with that.

By allowing both of those (and, knowing that PCI writes are more efficient
than PCI reads, giving the latter preference if both are available),
you can do direct device-to-device copies on splice().

The problem with Larry's separate pull() and push() calls is that you
then need a user-visible abstraction for "pulled but not yet pushed"
data, which seems lile unnecessary abstraction violation.

The main infrastructure hassle you need to support this *universally*
is the unget() on "character devices" like pipes and network sockets.
Ideally, it would be some generic buffer front end that could be used
by the device for normal data as well as the special case.

Ooh.  Need to think.  If there's a -EIO problem on one of the file
descriptors, how does the caller know which one?  That's an argument for
separate pull and push (although the splice() library routine still
has the problem).  Any suggestions?  Does userland need to fall
back on read()/write() for a byte?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-19 17:14         ` Linus Torvalds
@ 2005-01-19 19:01           ` Larry McVoy
  2005-01-20  0:01             ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Larry McVoy @ 2005-01-19 19:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Larry McVoy, William Lee Irwin III, Linux Kernel Mailing List

On Wed, Jan 19, 2005 at 09:14:33AM -0800, Linus Torvalds wrote:
> I think you are perhaps confused about the fact that what makes this all
> possible in the first place really is the _pipe_. You probably think of
> "splice()" as going from one file descriptor to another. It's not. 

You seem to have misunderstood the original proposal, it had little to do
with file descriptors.  The idea was that different subsystems in the OS
export pull() and push() interfaces and you use them.  The file decriptors
are only involved if you provide them with those interfaces(which you
would, it makes sense).  You are hung up on the pipe idea, the idea I
see in my head is far more generic.  Anything can play and you don't
need a pipe at all, you need 

	struct pagev {
		pageno	page;	// or whatever the type is to get the page
		u16	offset;	// could be u32 if you have large pages
		u16	len;	// ditto
	};
	struct splice {
		pagev	pages[];
		u8	gift:1;
		u8	loan:1;
	};

You can feed these into pipes for sure, but you can source and sink
them from/to anything which exports a pull()/push() entry point.
It's as generic as read()/write() in the driver object.

I never proposed restricting this to file descriptors, that makes
no sense.  File descriptors are nothing more than something which
gets you to the underlying object (socket/file/pipe/whatever) and
lets you call into that object to move some data.

See how more generic that is?  Pipes are just one source/sink but 
everything else needs to play as well.  How are you going to 
implement a socket sending data to a file without the VM nonsense
and the extra copies?
-- 
---
Larry McVoy                lm at bitmover.com           http://www.bitkeeper.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-19 16:29       ` Larry McVoy
@ 2005-01-19 17:14         ` Linus Torvalds
  2005-01-19 19:01           ` Larry McVoy
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-19 17:14 UTC (permalink / raw)
  To: Larry McVoy; +Cc: William Lee Irwin III, Linux Kernel Mailing List

On Wed, 19 Jan 2005, Larry McVoy wrote:
>
> I think you are going to regret making splice() a system call, it shouldn't
> be, you'll find cases where it won't work.  Make it a library call built
> out of pull() and push()

Actually, if you look at my current (ugly) test-patch, the one part of it
that isn't ugly is the actual "sys_splice()" + "do_splice()" part, and the
only thing they actually do is (a) do all the required "fd -> struct file"  
setup (including locking, testing readability vs writability etc), and
then (b) they check which side is a pipe, and split it up into "pull" vs 
"push" at that point.

Of course, the "pull" is actually called "do_splice_from()", and the 
"push" is called "do_splice_to()", but I can rename them if you want to ;)

So yes, internally it is actually a push/pull thing, with separate actions 
for both. I absolutely agree that you cannot have a "splice()" action, you 
need to have a _direction_. But that's actually exactly what the pipe 
gives you: since all data has to originate or end in a pipe, the direction 
is implicit by the file descriptors involved.

(The exception is a pipe->pipe transfer, but then the direction doesn't
matter, and you can choose whichever just ends up being easier. We happen
to consider it a "do_splice_from()" from the source pipe right now, but
that's purely a matter of "which way do you test first").

Could we show it as two system calls? Sure. It wouldn't give you anything,
though, since you need to do all the tests anyway - including the test
for whether the source or destination is a pipe which currently is the 
thing that splits up the "splice()" into two cases. So it's not even like 
you could optimize out the direction test - it would just become an 
error case test instead.

And having just one system call means that the user doesn't need to care
(other than making sure that one end _is_ a pipe), and also allows us to
share all the code that _is_ shared (namely the "struct file *" setups and
testing - not a whole lot, but it's cleaner that way).

I think you are perhaps confused about the fact that what makes this all
possible in the first place really is the _pipe_. You probably think of
"splice()" as going from one file descriptor to another. It's not. It goes
from one (generic) file descriptor into a _pipe_, or from one pipe into
another (generic) file descriptor. So the "push"/"pull" part really is 
there, but it's there in another form: if you want to copy from one file 
to another, you have to

	1) open a pipe - it's the splice equivalent of allocating a
	   temporary buffer.
   repeat:
	   2) "pull" from the fd into the pipe: splice(in, pipe[1], size)
	   3) "push" from the pipe into the fd: splice(pipe[0], out, size)

so it's all there. The above is 100% conceptually equivalent to

	 1) buf = malloc()
   repeat:
	   2) read(in, buf, size);
	   3) write(out, buf, size);

See? I think you'll find that the "push" and "pull" you look for really is 
there.

The advantage here is that if either the input or the output _already_ is
a pipe, you can optimize it to just do a loop of a single splice(), with
no new temporary buffer needed. So while the above three steps are the 
generic case, depending on how you do things the _common_ case may be just 
a simple

    repeat:
	  1) splice(in, out, maxsize);

which is why you do not want to _force_ the split. The determination of 
how to do this efficiently at run-time is also very easy:

	int copy_fd(int in, int out)
	{
		char *buf;
		int fds[2];

		for (;;) {
			int count = splice(in, out, ~0UL);
			if (count < 0) {
				if (errno == EAGAIN)
					continue;
				/* Old kernel without "splice()"? */
				if (errno == ENOSYS)
					goto read_write_case;
				/* No pipe involved? */
				if (errno == EINVAL)
					goto pipe_buffer_case;
				/* We found a pipe, but the other end cannot accept this splice */
				if (errno == ECANNOTSPLICE)
					goto read_write_case;
				return -1;
			}
			if (!count)
				return 0;
		}

	pipe_buffer_case:
		if (pipe(fds) < 0)
			goto read_write_case;
		for (;;) {
			int count = splice(in, fds[1], ~0UL);
			if (count < 0)
				if (errno == EAGAIN)
					continue;
				return -1;
			}
			if (!count)
				return 0;
			do {
				int n = splice(fds[0], out, count);
				if (n < 0) {
					if (errno == EAGAIN)
						continue;
					return -1;
				}
				if (!n) {
					errno = ENOSPC;
					return -1;
				}
			} while ((count -= n) > 0)
		}

	read_write_case:
		buf = malloc(BUFSIZE);
		for (;;) {
			int count = read(in, buf, BUFSIZE);
			...

See? THAT is the kind of library routine you want to have (btw, there's no 
"ECANNOTSPLICE" error code right now, my current example code incorrectly 
returns EINVAL for both the "no pipe" case and the "found pipe but not 
splicable" case).

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07  6:37     ` Linus Torvalds
@ 2005-01-19 16:29       ` Larry McVoy
  2005-01-19 17:14         ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Larry McVoy @ 2005-01-19 16:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: William Lee Irwin III, Linux Kernel Mailing List, Larry McVoy

I think you are going to regret making splice() a system call, it shouldn't
be, you'll find cases where it won't work.  Make it a library call built
out of pull() and push(), see my original postings on this and you'll
follow the logic.  Making splice a system call is like putting ftp 
into the kernel, not a good idea.
-- 
---
Larry McVoy                lm at bitmover.com           http://www.bitkeeper.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-12 19:50     ` Davide Libenzi
@ 2005-01-12 20:10       ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-12 20:10 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel Mailing List, Hugh Dickins

On Wed, 12 Jan 2005, Davide Libenzi wrote:

> On Sun, 9 Jan 2005, Linus Torvalds wrote:
> 
> > On Sun, 9 Jan 2005, Linus Torvalds wrote:
> > > 
> > > Since you guys stupidly showed interest, here's a very first-order
> > > approximation of filling the pipe from some other source.
> > 
> > Here's a somewhat fixed and tested version, which actually does something 
> > on x86.
> 
> Question. How do you think to splice() skb pages, or any other non page-based
> format?

That's why there's an "offset/length" pair. Right now the only limitation 
is actually
 - that at least the start of the area can be described as a "struct page
   *" (this means that things like a PCI MMIO mapping cannot be spliced -
   but that's true for other reasons anyway, since a "memcpy()" doesn't
   even work on such things on most architectures)
 - that it be mappable with kmap() (this means that if it's a multi-page 
   thing, it needs to be in low memory currently).

The second thing would actually going away in a full implementation
anyway, to be replaced by "map this page in" and "unmap this page" buffer
operations. We need those "map" operations in order to handle other
"prepare for copy" situations, notably waiting for disk IO to be finished
(ie I want to be able to splice pages to the pipe without having to wait
for them - you'd wait for the contents only when the data itself is
needed).

My example patch (which I didn't post publicly because it's a
work-in-progress) already made offset/length be "unsigned int", exactly
because I foresee the "page" possibly being part of a hugetlb page, or at
least a bigger multi-page allocation.

[ Side note: even the first issue - PCI MMIO or other descriptor that 
  doesn't normally have a "struct page *" associated with it - could 
  in theory be handled by just creating a fake "struct page", and hiding 
  the information into it. Then the map/unmap functions would just have to 
  return the virtual address that is usable for a memcpy - since on _some_ 
  architectures you can actually do a memcpy on IO space too. That might 
  be useful for things like AGP or similar that depend on architecture- 
  specific issues anyway ]

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
       [not found]   ` <Pine.LNX.4.58.0501091830120.2373@ppc970.osdl.org>
@ 2005-01-12 19:50     ` Davide Libenzi
  2005-01-12 20:10       ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Davide Libenzi @ 2005-01-12 19:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, Hugh Dickins

On Sun, 9 Jan 2005, Linus Torvalds wrote:

> On Sun, 9 Jan 2005, Linus Torvalds wrote:
> > 
> > Since you guys stupidly showed interest, here's a very first-order
> > approximation of filling the pipe from some other source.
> 
> Here's a somewhat fixed and tested version, which actually does something 
> on x86.

Question. How do you think to splice() skb pages, or any other non page-based
format?



- Davide


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07  6:35   ` Linus Torvalds
@ 2005-01-07  6:37     ` Linus Torvalds
  2005-01-19 16:29       ` Larry McVoy
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07  6:37 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel Mailing List, Larry McVoy



On Thu, 6 Jan 2005, Linus Torvalds wrote:
> 
> I worried a bit about latency, but my test-run of four before/after was in 
> the noise. It's bound to be worse because of the extra allocation, but 
> it's not clearly visible.

Btw, this is not to say that it isn't measurable: I'm sure it is. It's
just not as _obvious_ as the throughput improvement. If somebody were to
do some more controlled latency tests... Hint, hint.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
  2005-01-07  3:41 ` William Lee Irwin III
@ 2005-01-07  6:35   ` Linus Torvalds
  2005-01-07  6:37     ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07  6:35 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel Mailing List, Larry McVoy

On Thu, 6 Jan 2005, William Lee Irwin III wrote:

> On Fri, Jan 07, 2005 at 12:29:13AM +0000, Linux Kernel Mailing List wrote:
> > ChangeSet 1.2229.1.1, 2005/01/06 16:29:13-08:00, torvalds@ppc970.osdl.org
> > 	Make pipe data structure be a circular list of pages, rather than
> > 	a circular list of one page.
> > 	This improves pipe throughput, and allows us to (eventually)
> > 	use these lists of page buffers for moving data around efficiently.
> 
> Interesting; how big a gain did you see?

On my ppc970 (which Alan correctly points out is not necessarily the best 
platform to test), lmbench throughput looks like this:

	*Local* Communication bandwidths in MB/s - bigger is better
	-----------------------------------------------------------
	Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   
	Mem                          UNIX      reread reread (libc) (hand) read write
	--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
before:
	ppc970.os  Linux 2.6.10 753. 1305 550. 1110.3 2051.3  932.3  926.4 2043 1291.
	ppc970.os  Linux 2.6.10 738. 1596 558. 1099.4 2038.8  936.6  925.0 2033 1288.
	ppc970.os  Linux 2.6.10 752. 1535 557. 1107.2 2043.5  937.9  932.1 2037 1292.
	ppc970.os  Linux 2.6.10 750. 1524 556. 1107.5 2046.0  937.2  930.4 2037 1289.
after:
	ppc970.os  Linux 2.6.10 976. 1594 561. 1110.6 2045.8  934.9  923.6 2039 1288.
	ppc970.os  Linux 2.6.10 1195 1595 549. 1102.0 2049.0  909.5  930.5 2044 1292.
	ppc970.os  Linux 2.6.10 1669 1418 507. 1122.9 2051.3  943.4  916.4 2035 1280.
	ppc970.os  Linux 2.6.10 1472 1612 557. 1092.4 2032.9  932.1  935.1 2030 1281.

ie that's a clear improvement (30-90% higher bandwidth). That's from
having <n> (currently 16) pages of buffers in flight instead of just one.

Of course, the full 16 you'll seldom see in real life - stdio etc tends to
buffer up at 8kB or similar, but it does mean that such a 8kB write won't
need to block for the reader to empty the pipe.

I worried a bit about latency, but my test-run of four before/after was in 
the noise. It's bound to be worse because of the extra allocation, but 
it's not clearly visible.

However, I've got a long-term cunning plan, which was the _real_ reason
for the pipe changes. I'm finally going to implement Larry McVoy's
"splice()"  operation, and it turns out that a pipe (with the new
organization) ends up being the exact conduit between two files that you
need to "splice" an input an an output together - without any copying (the
new pipe buffering is just pointers to pages and byte ranges within those
pages).

Together with a "tee()" system call (duplicate it in two), you can
basically generate pipelines and forks of data, and you'll only need to
update a reference count (and pupulate a new "struct pipe_buffer" thing).

But making sure that every step along the way is an improvement is part of 
the philosophy.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather than
       [not found] <200501070313.j073DCaQ009641@hera.kernel.org>
@ 2005-01-07  3:41 ` William Lee Irwin III
  2005-01-07  6:35   ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: William Lee Irwin III @ 2005-01-07  3:41 UTC (permalink / raw)
  To: torvalds; +Cc: Linux Kernel Mailing List

On Fri, Jan 07, 2005 at 12:29:13AM +0000, Linux Kernel Mailing List wrote:
> ChangeSet 1.2229.1.1, 2005/01/06 16:29:13-08:00, torvalds@ppc970.osdl.org
> 	Make pipe data structure be a circular list of pages, rather than
> 	a circular list of one page.
> 	This improves pipe throughput, and allows us to (eventually)
> 	use these lists of page buffers for moving data around efficiently.

Interesting; how big a gain did you see?


-- wli

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2005-08-18  6:07 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-07 14:30 Make pipe data structure be a circular list of pages, rather than Oleg Nesterov
2005-01-07 15:45 ` Alan Cox
2005-01-07 17:23   ` Linus Torvalds
2005-01-08 18:25     ` Hugh Dickins
2005-01-08 18:54       ` Linus Torvalds
2005-01-07 16:17 ` Linus Torvalds
2005-01-07 16:06   ` Alan Cox
2005-01-07 17:33     ` Linus Torvalds
2005-01-07 17:48       ` Linus Torvalds
2005-01-07 20:59         ` Mike Waychison
2005-01-07 23:46           ` Chris Friesen
2005-01-08 21:38             ` Lee Revell
2005-01-08 21:51               ` Linus Torvalds
2005-01-08 22:02                 ` Lee Revell
2005-01-08 22:29                 ` Davide Libenzi
2005-01-09  4:07                 ` Linus Torvalds
2005-01-09 23:19                   ` Davide Libenzi
2005-01-14 10:15             ` Peter Chubb
2005-01-07 21:59         ` Linus Torvalds
2005-01-07 22:53           ` Diego Calleja
2005-01-07 23:15             ` Linus Torvalds
2005-01-10 23:23         ` Robert White
2005-01-07 17:45     ` Chris Friesen
2005-01-07 16:39   ` Davide Libenzi
2005-01-07 17:09     ` Linus Torvalds
2005-08-18  6:07   ` Coywolf Qi Hunt
  -- strict thread matches above, loose matches on Subject: below --
2005-01-20  2:14 Robert White
2005-01-16  2:59 Make pipe data structure be a circular list of pages, rather Linus Torvalds
2005-01-19 21:12 ` Make pipe data structure be a circular list of pages, rather than linux
2005-01-20  2:06   ` Robert White
     [not found] <Pine.LNX.4.44.0501091946020.3620-100000@localhost.localdomain>
     [not found] ` <Pine.LNX.4.58.0501091713300.2373@ppc970.osdl.org>
     [not found]   ` <Pine.LNX.4.58.0501091830120.2373@ppc970.osdl.org>
2005-01-12 19:50     ` Davide Libenzi
2005-01-12 20:10       ` Linus Torvalds
     [not found] <200501070313.j073DCaQ009641@hera.kernel.org>
2005-01-07  3:41 ` William Lee Irwin III
2005-01-07  6:35   ` Linus Torvalds
2005-01-07  6:37     ` Linus Torvalds
2005-01-19 16:29       ` Larry McVoy
2005-01-19 17:14         ` Linus Torvalds
2005-01-19 19:01           ` Larry McVoy
2005-01-20  0:01             ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).