From: Coywolf Qi Hunt <coywolf@gmail.com>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>,
William Lee Irwin III <wli@holomorphy.com>,
linux-kernel@vger.kernel.org
Subject: Re: Make pipe data structure be a circular list of pages, rather than
Date: Thu, 18 Aug 2005 14:07:02 +0800 [thread overview]
Message-ID: <2cd57c90050817230735895530@mail.gmail.com> (raw)
In-Reply-To: <Pine.LNX.4.58.0501070735000.2272@ppc970.osdl.org>
On 1/8/05, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Fri, 7 Jan 2005, Oleg Nesterov wrote:
> >
> > If i understand this patch correctly, then this code
> >
> > for (;;)
> > write(pipe_fd, &byte, 1);
> >
> > will block after writing PIPE_BUFFERS == 16 characters, no?
> > And pipe_inode_info will use 64K to hold 16 bytes!
>
> Yes.
>
> > Is it ok?
>
> If you want throughput, don't do single-byte writes. Obviously we _could_
> do coalescing, but there's a reason I'd prefer to avoid it. So I consider
> it a "don't do that then", and I'll wait to see if people do. I can't
> think of anything that cares about performance that does that anyway:
> becuase system calls are reasonably expensive regardless, anybody who
> cares at all about performance will have buffered things up in user space.
>
> > May be it make sense to add data to the last allocated page
> > until buf->len > PAGE_SIZE ?
>
> The reason I don't want to coalesce is that I don't ever want to modify a
> page that is on a pipe buffer (well, at least not through the pipe buffer
> - it might get modified some other way). Why? Because the long-term plan
> for pipe-buffers is to allow the data to come from _other_ sources than
> just a user space copy. For example, it might be a page directly from the
> page cache, or a partial page that contains the data part of an skb that
> just came in off the network.
>
> With this organization, a pipe ends up being able to act as a "conduit"
> for pretty much any data, including some high-bandwidth things like video
> streams, where you really _really_ don't want to copy the data. So the
> next stage is:
>
> - allow the buffer size to be set dynamically per-pipe (probably only
> increased by root, due to obvious issues, although a per-user limit is
> not out of the question - it's just a "mlock" in kernel buffer space,
> after all)
> - add per-"struct pipe_buffer" ops pointer to a structure with
> operation function pointers: "release()", "wait_for_ready()", "poll()"
> (and possibly "merge()", if we want to coalesce things, although I
> really hope we won't need to)
> - add a "splice(fd, fd)" system call that copies pages (by incrementing
> their reference count, not by copying the data!) from an input source
> to the pipe, or from a pipe to an output.
> - add a "tee(in, out1, out2)" system call that duplicates the pages
> (again, incrementing their reference count, not copying the data) from
> one pipe to two other pipes.
>
> All of the above is basically a few lines of code (the "splice()" thing
> requires some help from drivers/networking/pagecache etc, but it's not
> complex help, and not everybody needs to do it - I'll start off with
> _just_ a generic page cache helper to get the thing rolling, that's easy).
>
> Now, imagine using the above in a media server, for example. Let's say
> that a year or two has passed, so that the video drivers have been updated
> to be able to do the splice thing, and what can you do? You can:
>
> - splice from the (mpeg or whatever - let's just assume that the video
> input is either digital or does the encoding on its own - like they
> pretty much all do) video input into a pipe (remember: no copies - the
> video input will just DMA directly into memory, and splice will just
> set up the pages in the pipe buffer)
> - tee that pipe to split it up
> - splice one end to a file (ie "save the compressed stream to disk")
> - splice the other end to a real-time video decoder window for your
> real-time viewing pleasure.
>
> That's the plan, at least. I think it makes sense, and the thing that
> convinced me about it was (a) how simple all of this seems to be
> implementation-wise (modulo details - but there are no "conceptually
> complex" parts: no horrid asynchronous interfaces, no questions about
> hotw to buffer things, no direct user access to pages that might
> partially contain protected data etc etc) and (b) it's so UNIXy. If
> there's something that says "the UNIX way", it's pipes between entities
> that act on the data.
>
> For example, let's say that you wanted to serve a file from disk (or any
> other source) with a header to another program (or to a TCP connection, or
> to whatever - it's just a file descriptor). You'd do
>
> fd = create_pipe_to_destination();
>
> input = open("filename", O_RDONLY);
> write(fd, "header goes here", length_of_header);
> for (;;) {
> ssize_t err;
> err = splice(input, fd,
> ~0 /* maxlen */,
> 0 /* msg flags - think "sendmgsg" */);
> if (err > 0)
> continue;
> if (!err) /* EOF */
> break;
> .. handle input errors here ..
> }
>
> (obviously, if this is a real server, this would likely all be in a
> select/epoll loop, but that just gets too hard to describe consicely, so
> I'm showing the single-threaded simple version).
>
> Further, this also ends up giving a nice pollable interface to regular
> files too: just splice from the file (at any offset) into a pipe, and poll
> on the result. The "splice()" will just do the page cache operations and
> start the IO if necessary, the "poll()" will wait for the first page to be
> actually available. All _trivially_ done with the "struct pipe_buffer"
> operations.
>
> So the above kind of "send a file to another destination" should
> automatically work very naturally in any poll loop: instead of filling a
> writable pipe with a "write()", you just fill it with "splice()" instead
> (and you can read it with a 'read()' or you just splice it to somewhere
> else, or you tee() it to two destinations....).
>
> I think the media server kind of environment is the one most easily
> explained, where you have potentially tons of data that the server process
> really never actually wants to _see_ - it just wants to push it on to
> another process or connection or save it to a log-file or something. But
> as with regular pipes, it's not a specialized interface: it really is just
> a channel of communication.
>
> The difference being that a historical UNIX pipe is always a channel
> between two process spaces (ie you can only fill it and empty it into the
> process address space), and the _only_ thing I'm trying to do is to have
> it be able to be a channel between two different file descriptors too. You
> still need the process to "control" the channel, but the data doesn't have
> to touch the address space any more.
>
> Think of all the servers or other processes that really don't care about
> the data. Think of something as simple as encrypting a file, for example.
> Imagine that you have hardware encryption support that does DMA from the
> source, and writes the results using DMA. I think it's pretty obvious how
> you'd connect this up using pipes and two splices (one for the input, one
> for the output).
>
> And notice how _flexible_ it is (both the input and the output can be any
> kind of fd you want - the pipes end up doing both the "conversion" into a
> common format of "list of (possibly partial) pages" and the buffering,
> which is why the different "engines" don't need to care where the data
> comes from, or where it goes. So while you can use it to encrypt a file
> into another file, you could equally easily use it for something like
>
> tar cvf - my_source_tree | hw_engine_encrypt | splice_to_network
>
> and the whole pipeline would not have a _single_ actual data copy: the
> pipes are channels.
>
> Of course, since it's a pipe, the nice thing is that people don't have to
> use "splice()" to access it - the above pipeline has a perfectly regular
> "tar" process that probably just does regular writes. You can have a
> process that does "splice()" to fill the pipe, and the other end is a
> normal thing that just uses regular "read()" and doesn't even _know_ that
> the pipe is using new-fangled technology to be filled.
>
> I'm clearly enamoured with this concept. I think it's one of those few
> "RightThing(tm)" that doesn't come along all that often. I don't know of
> anybody else doing this, and I think it's both useful and clever. If you
> now prove me wrong, I'll hate you forever ;)
>
> Linus
Actually this what L4 is different from traditional micro-kernel, IMHO. Is it?
I have a long-term plan to add some messaging subsystem based on this.
--
Coywolf Qi Hunt
http://ahbl.org/~coywolf/
next prev parent reply other threads:[~2005-08-18 6:07 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-01-07 14:30 Make pipe data structure be a circular list of pages, rather than Oleg Nesterov
2005-01-07 15:45 ` Alan Cox
2005-01-07 17:23 ` Linus Torvalds
2005-01-08 18:25 ` Hugh Dickins
2005-01-08 18:54 ` Linus Torvalds
2005-01-07 16:17 ` Linus Torvalds
2005-01-07 16:06 ` Alan Cox
2005-01-07 17:33 ` Linus Torvalds
2005-01-07 17:48 ` Linus Torvalds
2005-01-07 20:59 ` Mike Waychison
2005-01-07 23:46 ` Chris Friesen
2005-01-08 21:38 ` Lee Revell
2005-01-08 21:51 ` Linus Torvalds
2005-01-08 22:02 ` Lee Revell
2005-01-08 22:29 ` Davide Libenzi
2005-01-09 4:07 ` Linus Torvalds
2005-01-09 23:19 ` Davide Libenzi
2005-01-14 10:15 ` Peter Chubb
2005-01-07 21:59 ` Linus Torvalds
2005-01-07 22:53 ` Diego Calleja
2005-01-07 23:15 ` Linus Torvalds
2005-01-10 23:23 ` Robert White
2005-01-07 17:45 ` Chris Friesen
2005-01-07 16:39 ` Davide Libenzi
2005-01-07 17:09 ` Linus Torvalds
2005-08-18 6:07 ` Coywolf Qi Hunt [this message]
-- strict thread matches above, loose matches on Subject: below --
2005-01-20 2:14 Robert White
2005-01-16 2:59 Make pipe data structure be a circular list of pages, rather Linus Torvalds
2005-01-19 21:12 ` Make pipe data structure be a circular list of pages, rather than linux
2005-01-20 2:06 ` Robert White
[not found] <Pine.LNX.4.44.0501091946020.3620-100000@localhost.localdomain>
[not found] ` <Pine.LNX.4.58.0501091713300.2373@ppc970.osdl.org>
[not found] ` <Pine.LNX.4.58.0501091830120.2373@ppc970.osdl.org>
2005-01-12 19:50 ` Davide Libenzi
2005-01-12 20:10 ` Linus Torvalds
[not found] <200501070313.j073DCaQ009641@hera.kernel.org>
2005-01-07 3:41 ` William Lee Irwin III
2005-01-07 6:35 ` Linus Torvalds
2005-01-07 6:37 ` Linus Torvalds
2005-01-19 16:29 ` Larry McVoy
2005-01-19 17:14 ` Linus Torvalds
2005-01-19 19:01 ` Larry McVoy
2005-01-20 0:01 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2cd57c90050817230735895530@mail.gmail.com \
--to=coywolf@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=oleg@tv-sign.ru \
--cc=torvalds@osdl.org \
--cc=wli@holomorphy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).