From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750803AbVHRGHG (ORCPT ); Thu, 18 Aug 2005 02:07:06 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750800AbVHRGHG (ORCPT ); Thu, 18 Aug 2005 02:07:06 -0400 Received: from nproxy.gmail.com ([64.233.182.194]:39060 "EHLO nproxy.gmail.com") by vger.kernel.org with ESMTP id S1750805AbVHRGHE convert rfc822-to-8bit (ORCPT ); Thu, 18 Aug 2005 02:07:04 -0400 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=D43+lP46K2O5wqcfW8EhBTrXvjJmPod3ZBXMHJJrNVnH+vUeT/knsF4ow3fMSjjfZ87wYSqi3Ty/FEZRJWXOag0a4v1hjRMxTmx1p7hHVbwh5/8m5xrASYVsKOfcBki9jOc3kVHr0UYF6hQB+7N7K90Yw2//pjG94dxEOnn4yxE= Message-ID: <2cd57c90050817230735895530@mail.gmail.com> Date: Thu, 18 Aug 2005 14:07:02 +0800 From: Coywolf Qi Hunt To: Linus Torvalds Subject: Re: Make pipe data structure be a circular list of pages, rather than Cc: Oleg Nesterov , William Lee Irwin III , linux-kernel@vger.kernel.org In-Reply-To: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Content-Disposition: inline References: <41DE9D10.B33ED5E4@tv-sign.ru> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On 1/8/05, Linus Torvalds wrote: > > > On Fri, 7 Jan 2005, Oleg Nesterov wrote: > > > > If i understand this patch correctly, then this code > > > > for (;;) > > write(pipe_fd, &byte, 1); > > > > will block after writing PIPE_BUFFERS == 16 characters, no? > > And pipe_inode_info will use 64K to hold 16 bytes! > > Yes. > > > Is it ok? > > If you want throughput, don't do single-byte writes. Obviously we _could_ > do coalescing, but there's a reason I'd prefer to avoid it. So I consider > it a "don't do that then", and I'll wait to see if people do. I can't > think of anything that cares about performance that does that anyway: > becuase system calls are reasonably expensive regardless, anybody who > cares at all about performance will have buffered things up in user space. > > > May be it make sense to add data to the last allocated page > > until buf->len > PAGE_SIZE ? > > The reason I don't want to coalesce is that I don't ever want to modify a > page that is on a pipe buffer (well, at least not through the pipe buffer > - it might get modified some other way). Why? Because the long-term plan > for pipe-buffers is to allow the data to come from _other_ sources than > just a user space copy. For example, it might be a page directly from the > page cache, or a partial page that contains the data part of an skb that > just came in off the network. > > With this organization, a pipe ends up being able to act as a "conduit" > for pretty much any data, including some high-bandwidth things like video > streams, where you really _really_ don't want to copy the data. So the > next stage is: > > - allow the buffer size to be set dynamically per-pipe (probably only > increased by root, due to obvious issues, although a per-user limit is > not out of the question - it's just a "mlock" in kernel buffer space, > after all) > - add per-"struct pipe_buffer" ops pointer to a structure with > operation function pointers: "release()", "wait_for_ready()", "poll()" > (and possibly "merge()", if we want to coalesce things, although I > really hope we won't need to) > - add a "splice(fd, fd)" system call that copies pages (by incrementing > their reference count, not by copying the data!) from an input source > to the pipe, or from a pipe to an output. > - add a "tee(in, out1, out2)" system call that duplicates the pages > (again, incrementing their reference count, not copying the data) from > one pipe to two other pipes. > > All of the above is basically a few lines of code (the "splice()" thing > requires some help from drivers/networking/pagecache etc, but it's not > complex help, and not everybody needs to do it - I'll start off with > _just_ a generic page cache helper to get the thing rolling, that's easy). > > Now, imagine using the above in a media server, for example. Let's say > that a year or two has passed, so that the video drivers have been updated > to be able to do the splice thing, and what can you do? You can: > > - splice from the (mpeg or whatever - let's just assume that the video > input is either digital or does the encoding on its own - like they > pretty much all do) video input into a pipe (remember: no copies - the > video input will just DMA directly into memory, and splice will just > set up the pages in the pipe buffer) > - tee that pipe to split it up > - splice one end to a file (ie "save the compressed stream to disk") > - splice the other end to a real-time video decoder window for your > real-time viewing pleasure. > > That's the plan, at least. I think it makes sense, and the thing that > convinced me about it was (a) how simple all of this seems to be > implementation-wise (modulo details - but there are no "conceptually > complex" parts: no horrid asynchronous interfaces, no questions about > hotw to buffer things, no direct user access to pages that might > partially contain protected data etc etc) and (b) it's so UNIXy. If > there's something that says "the UNIX way", it's pipes between entities > that act on the data. > > For example, let's say that you wanted to serve a file from disk (or any > other source) with a header to another program (or to a TCP connection, or > to whatever - it's just a file descriptor). You'd do > > fd = create_pipe_to_destination(); > > input = open("filename", O_RDONLY); > write(fd, "header goes here", length_of_header); > for (;;) { > ssize_t err; > err = splice(input, fd, > ~0 /* maxlen */, > 0 /* msg flags - think "sendmgsg" */); > if (err > 0) > continue; > if (!err) /* EOF */ > break; > .. handle input errors here .. > } > > (obviously, if this is a real server, this would likely all be in a > select/epoll loop, but that just gets too hard to describe consicely, so > I'm showing the single-threaded simple version). > > Further, this also ends up giving a nice pollable interface to regular > files too: just splice from the file (at any offset) into a pipe, and poll > on the result. The "splice()" will just do the page cache operations and > start the IO if necessary, the "poll()" will wait for the first page to be > actually available. All _trivially_ done with the "struct pipe_buffer" > operations. > > So the above kind of "send a file to another destination" should > automatically work very naturally in any poll loop: instead of filling a > writable pipe with a "write()", you just fill it with "splice()" instead > (and you can read it with a 'read()' or you just splice it to somewhere > else, or you tee() it to two destinations....). > > I think the media server kind of environment is the one most easily > explained, where you have potentially tons of data that the server process > really never actually wants to _see_ - it just wants to push it on to > another process or connection or save it to a log-file or something. But > as with regular pipes, it's not a specialized interface: it really is just > a channel of communication. > > The difference being that a historical UNIX pipe is always a channel > between two process spaces (ie you can only fill it and empty it into the > process address space), and the _only_ thing I'm trying to do is to have > it be able to be a channel between two different file descriptors too. You > still need the process to "control" the channel, but the data doesn't have > to touch the address space any more. > > Think of all the servers or other processes that really don't care about > the data. Think of something as simple as encrypting a file, for example. > Imagine that you have hardware encryption support that does DMA from the > source, and writes the results using DMA. I think it's pretty obvious how > you'd connect this up using pipes and two splices (one for the input, one > for the output). > > And notice how _flexible_ it is (both the input and the output can be any > kind of fd you want - the pipes end up doing both the "conversion" into a > common format of "list of (possibly partial) pages" and the buffering, > which is why the different "engines" don't need to care where the data > comes from, or where it goes. So while you can use it to encrypt a file > into another file, you could equally easily use it for something like > > tar cvf - my_source_tree | hw_engine_encrypt | splice_to_network > > and the whole pipeline would not have a _single_ actual data copy: the > pipes are channels. > > Of course, since it's a pipe, the nice thing is that people don't have to > use "splice()" to access it - the above pipeline has a perfectly regular > "tar" process that probably just does regular writes. You can have a > process that does "splice()" to fill the pipe, and the other end is a > normal thing that just uses regular "read()" and doesn't even _know_ that > the pipe is using new-fangled technology to be filled. > > I'm clearly enamoured with this concept. I think it's one of those few > "RightThing(tm)" that doesn't come along all that often. I don't know of > anybody else doing this, and I think it's both useful and clever. If you > now prove me wrong, I'll hate you forever ;) > > Linus Actually this what L4 is different from traditional micro-kernel, IMHO. Is it? I have a long-term plan to add some messaging subsystem based on this. -- Coywolf Qi Hunt http://ahbl.org/~coywolf/