Re: Make pipe data structure be a circular list of pages, rather than

From: Linus Torvalds <torvalds@osdl.org>
To: Oleg Nesterov <oleg@tv-sign.ru>
Cc: William Lee Irwin III <wli@holomorphy.com>, linux-kernel@vger.kernel.org
Subject: Re: Make pipe data structure be a circular list of pages, rather than
Date: Fri, 7 Jan 2005 08:17:42 -0800 (PST)	[thread overview]
Message-ID: <Pine.LNX.4.58.0501070735000.2272@ppc970.osdl.org> (raw)
In-Reply-To: <41DE9D10.B33ED5E4@tv-sign.ru>

On Fri, 7 Jan 2005, Oleg Nesterov wrote:
> 
> If i understand this patch correctly, then this code
> 
> 	for (;;)
> 		write(pipe_fd, &byte, 1);
> 
> will block after writing PIPE_BUFFERS == 16 characters, no?
> And pipe_inode_info will use 64K to hold 16 bytes!

Yes. 

> Is it ok?

If you want throughput, don't do single-byte writes. Obviously we _could_
do coalescing, but there's a reason I'd prefer to avoid it. So I consider
it a "don't do that then", and I'll wait to see if people do. I can't
think of anything that cares about performance that does that anyway:  
becuase system calls are reasonably expensive regardless, anybody who
cares at all about performance will have buffered things up in user space.

> May be it make sense to add data to the last allocated page
> until buf->len > PAGE_SIZE ?

The reason I don't want to coalesce is that I don't ever want to modify a
page that is on a pipe buffer (well, at least not through the pipe buffer
- it might get modified some other way). Why? Because the long-term plan
for pipe-buffers is to allow the data to come from _other_ sources than
just a user space copy. For example, it might be a page directly from the
page cache, or a partial page that contains the data part of an skb that
just came in off the network.

With this organization, a pipe ends up being able to act as a "conduit"  
for pretty much any data, including some high-bandwidth things like video
streams, where you really _really_ don't want to copy the data. So the 
next stage is:

 - allow the buffer size to be set dynamically per-pipe (probably only
   increased by root, due to obvious issues, although a per-user limit is 
   not out of the question - it's just a "mlock" in kernel buffer space, 
   after all)
 - add per-"struct pipe_buffer" ops pointer to a structure with 
   operation function pointers: "release()", "wait_for_ready()", "poll()"  
   (and possibly "merge()", if we want to coalesce things, although I
   really hope we won't need to)
 - add a "splice(fd, fd)" system call that copies pages (by incrementing 
   their reference count, not by copying the data!) from an input source 
   to the pipe, or from a pipe to an output.
 - add a "tee(in, out1, out2)" system call that duplicates the pages 
   (again, incrementing their reference count, not copying the data) from 
   one pipe to two other pipes.

All of the above is basically a few lines of code (the "splice()" thing
requires some help from drivers/networking/pagecache etc, but it's not
complex help, and not everybody needs to do it - I'll start off with
_just_ a generic page cache helper to get the thing rolling, that's easy).

Now, imagine using the above in a media server, for example. Let's say 
that a year or two has passed, so that the video drivers have been updated 
to be able to do the splice thing, and what can you do? You can:

 - splice from the (mpeg or whatever - let's just assume that the video
   input is either digital or does the encoding on its own - like they
   pretty much all do) video input into a pipe (remember: no copies - the
   video input will just DMA directly into memory, and splice will just 
   set up the pages in the pipe buffer)
 - tee that pipe to split it up
 - splice one end to a file (ie "save the compressed stream to disk")
 - splice the other end to a real-time video decoder window for your 
   real-time viewing pleasure.

That's the plan, at least. I think it makes sense, and the thing that 
convinced me about it was (a) how simple all of this seems to be 
implementation-wise (modulo details - but there are no "conceptually 
complex" parts: no horrid asynchronous interfaces, no questions about 
hotw to buffer things, no direct user access to pages that might 
partially contain protected data etc etc) and (b) it's so UNIXy. If 
there's something that says "the UNIX way", it's pipes between entities 
that act on the data.

For example, let's say that you wanted to serve a file from disk (or any 
other source) with a header to another program (or to a TCP connection, or 
to whatever - it's just a file descriptor). You'd do

	fd = create_pipe_to_destination();

	input = open("filename", O_RDONLY);
	write(fd, "header goes here", length_of_header);
	for (;;) {
		ssize_t err;
		err = splice(input, fd,
			~0 /* maxlen */,
			0 /* msg flags - think "sendmgsg" */);
		if (err > 0)
			continue;
		if (!err) /* EOF */
			break;
		.. handle input errors here ..
	}

(obviously, if this is a real server, this would likely all be in a
select/epoll loop, but that just gets too hard to describe consicely, so
I'm showing the single-threaded simple version).

Further, this also ends up giving a nice pollable interface to regular
files too: just splice from the file (at any offset) into a pipe, and poll
on the result.  The "splice()" will just do the page cache operations and
start the IO if necessary, the "poll()" will wait for the first page to be
actually available. All _trivially_ done with the "struct pipe_buffer"
operations.

So the above kind of "send a file to another destination" should
automatically work very naturally in any poll loop: instead of filling a
writable pipe with a "write()", you just fill it with "splice()" instead
(and you can read it with a 'read()' or you just splice it to somewhere
else, or you tee() it to two destinations....).

I think the media server kind of environment is the one most easily
explained, where you have potentially tons of data that the server process
really never actually wants to _see_ - it just wants to push it on to
another process or connection or save it to a log-file or something. But 
as with regular pipes, it's not a specialized interface: it really is just 
a channel of communication.

The difference being that a historical UNIX pipe is always a channel 
between two process spaces (ie you can only fill it and empty it into the 
process address space), and the _only_ thing I'm trying to do is to have 
it be able to be a channel between two different file descriptors too. You 
still need the process to "control" the channel, but the data doesn't have 
to touch the address space any more.

Think of all the servers or other processes that really don't care about 
the data. Think of something as simple as encrypting a file, for example. 
Imagine that you have hardware encryption support that does DMA from the 
source, and writes the results using DMA. I think it's pretty obvious how 
you'd connect this up using pipes and two splices (one for the input, one 
for the output).

And notice how _flexible_ it is (both the input and the output can be any
kind of fd you want - the pipes end up doing both the "conversion"  into a
common format of "list of (possibly partial) pages" and the buffering, 
which is why the different "engines" don't need to care where the data 
comes from, or where it goes. So while you can use it to encrypt a file 
into another file, you could equally easily use it for something like

	tar cvf - my_source_tree | hw_engine_encrypt | splice_to_network

and the whole pipeline would not have a _single_ actual data copy: the 
pipes are channels.

Of course, since it's a pipe, the nice thing is that people don't have to
use "splice()" to access it - the above pipeline has a perfectly regular
"tar" process that probably just does regular writes. You can have a
process that does "splice()" to fill the pipe, and the other end is a
normal thing that just uses regular "read()" and doesn't even _know_ that
the pipe is using new-fangled technology to be filled.

I'm clearly enamoured with this concept. I think it's one of those few 
"RightThing(tm)" that doesn't come along all that often. I don't know of 
anybody else doing this, and I think it's both useful and clever. If you 
now prove me wrong, I'll hate you forever ;)

		Linus