* Re: Make pipe data structure be a circular list of pages, rather @ 2005-01-08 8:25 linux 2005-01-08 18:41 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: linux @ 2005-01-08 8:25 UTC (permalink / raw) To: linux-kernel, torvalds > - add a "tee(in, out1, out2)" system call that duplicates the pages > (again, incrementing their reference count, not copying the data) from > one pipe to two other pipes. H'm... the version that seemed natural to me was an asymmetrical one-way tee, such as "tee(in, out, len, flags)" might be better, where the next <len> bytes are *both* readable on fd "in" *and* copied to fd "out". You can make it a two-way tee with an additional "splice(in, out2, len)" call, so you haven't lost expressiveness, and it makes three-way and higher tees easier to construct. But then I realized that I might be thinking about a completely different implementation than you... I was thinking asynchronous, while perhaps you were thinking synchronous. A simple example of the difference: int main(void) { fd *dest = open("/dev/null", O_WRONLY); FILE *src = popen("/usr/bin/yes", "r"); splice(fileno(src), dest, SPLICE_INFINITY, 0); return 0; } Will this process exit without being killed? I was imagining yes, it would exit immediately, but perhaps "no" makes more sense. Ding! Oh, of course, it couldn't exit, or cleaning up after the following mess would be far more difficult: int main(void) { int fd[2]; pipe(fd); write(fd[1], "Hello, world!\n", 14); splice(fd[0], fd[1], SPLICE_INFINITY, 0); return 0; } With the synchronous model, the two-output tee() call makes more sense, too. Still, it would be nice to be able to produce n identical output streams without needing n-1 processes to do it Any ideas? Perhaps int tee(int infd, int const *outfds, unsigned noutfds, loff_t size, unsigned flags) As for the issue of coalescing: > This is the main reason why I want to avoid coalescing if possible: if you > never coalesce, then each "pipe_buffer" is complete in itself, and that > simplifies locking enormously. > > (The other reason to potentially avoid coalescing is that I think it might > be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet > boundaries. The new pipe code _is_ internally "packetized" after all). It is somewhat offensive that the minimum overhead for a 1-byte write is a struct pipe_buffer plus a full page. But yes, keeping a pipe_buffer simple is a BIG win. So how about the following coalescing strategy, which complicates the reader not at all: - Each pipe writing fd holds a reference to a page and an offset within that page. - When writing some data, see if the data will fit in the remaining portion of the page. - If it will not, then dump the page, allocate a fresh one, and set the offset to 0. - Possible optimization: If the page's refcount is 1 (nobody else still has a reference to the page, and we would free it if we dropped it), then just recycle it directly. - Copy the written data (up to a maximum of 1 page) to the current write page. - Bump the page's reference count (to account for the pipe_buffer pointer) and queue an appropriate pipe_buffer. - Increment the offset by the amount of data written to the page. - Decrement the amount of data remaining to be written and repeat if necessary. This allocates one struct pipe_buffer (8 bytes) per write, but not a whole page. And it does so by exploiting the exact "we don't care what the rest of the page is used for" semantics that make the abstraction useful. The only waste is that, as written, every pipe writing fd keeps a page allocated even if the pipe is empty. Perhaps the vm could be taught to reclaim those references if necessary? It's also worth documenting atomicity guarantees and poll/select semantics. The above algorithm is careful to always queue the first PAGE_SIZE bytes of any write atomically, which I believe is what is historically expected. It would be possible to have writes larger than PIPE_BUF fill in the trailing end of any partial page as well. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-08 8:25 Make pipe data structure be a circular list of pages, rather linux @ 2005-01-08 18:41 ` Linus Torvalds 2005-01-08 21:47 ` Alan Cox 2005-01-13 21:46 ` Ingo Oeser 0 siblings, 2 replies; 51+ messages in thread From: Linus Torvalds @ 2005-01-08 18:41 UTC (permalink / raw) To: linux; +Cc: Kernel Mailing List On Sat, 8 Jan 2005 linux@horizon.com wrote: > > H'm... the version that seemed natural to me was an asymmetrical one-way > tee, such as "tee(in, out, len, flags)" might be better, where the next > <len> bytes are *both* readable on fd "in" *and* copied to fd "out". Yes, the implementation may end up being something like that, I haven't decided. That would be more of a "peek()" kind of thing, not a "tee()", but it has the advantage of getting rid of one unnecessary complexity of "tee()" - dependence on three different fd's (in particular, if one of the output fd's isn't writable, we can't do anything at all). However, "peek()" has the problem that while it allows you to just run "n" peek() calls for the "n" outputs, you now need to keep track of how much each output has consumed (since they can be filled at different rates), and then appropriately "eat" the proper amount of input when it has been consumed by everybody. That's a very error-prone interface. In contrast, with a plain "tee()", you'd just create a tree "log(n)" deep of "tee()" actors, and you'd have "n" outputs from one input, and each individual point would be very simple. NOTE! I don't think either "tee()" or "peek()" really works for very high fan-out. I don't believe that you'd ever really have a "balanced" enough output across many things, and since they all end up being synchronized to the same input (you can make the buffers deeper, which will help a bit, but..) I don't think it ends up being _that_ useful. Where I foresee "tee()" is when there is some rate-limited input that you can easily handle in two or three streams. That's what the example of a digital TV input signal was kind of meant to be - the input is rate-limited, which means that the fact that the outputs are likely consumed at different paces (saving to disk might be fast, showing it on the screen would be real-time) doesn't matter. But such a use ends up breaking down when the load is too high - you're likely going to be constrained in the number of streams by pure load issues. (And if you are _not_ constrained by load issues, then you don't need tee() at all - you could just have copied the things by hand ;) > But then I realized that I might be thinking about a completely different > implementation than you... I was thinking asynchronous, while perhaps > you were thinking synchronous. I _am_ thinking entirely synchronous. It would copy exactly what was in the pipe, and nothing more. It would sleep if either end was full/empty (modulo N_DELAY, of course), but it would in no way be a long-term "set up this copy mechanism. None of the pipes would do anything on their own - they aren't "actors". They'd be purely buffers. So if you want to have an asynchronous long-running "tee()/copy()/peek()", you'd have a thread that does that for you. > With the synchronous model, the two-output tee() call makes more sense, too. > Still, it would be nice to be able to produce n identical output streams > without needing n-1 processes to do it Any ideas? I agree, although I don't know how common it would be for "n" to be very big. And you don't need n-1 processes (or threads), you can certainly do it with the normal select-loop approach too. > As for the issue of coalescing: > > This is the main reason why I want to avoid coalescing if possible: if you > > never coalesce, then each "pipe_buffer" is complete in itself, and that > > simplifies locking enormously. > > > > (The other reason to potentially avoid coalescing is that I think it might > > be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet > > boundaries. The new pipe code _is_ internally "packetized" after all). > > It is somewhat offensive that the minimum overhead for a 1-byte write > is a struct pipe_buffer plus a full page. Note that there is no reason why we couldn't use just "kmalloc()" too. For the different input cases we'll need per-pipe_buffer operation functions (very much including a "release()" function) _anyway_, which means that while the current implementation always just does "alloc_page() + __free_page()" for it's buffer allocation, there is absolutely zero that says that you couldn't allocate small buffers with "kmalloc()" and free them with "kfree()". And since we need to support per-buffer ops anyway (for people mixing "write()" and "splice()" to set up the pipe contents), it works very naturally for mixed size writes too (ie you can have a small write that is buffered in a small kmalloc'ed buffer, followed by a big write that uses the page allocator, and nobody will care - you just decide at buffer fill time which one you need). So yes, the current setup is wasteful, but coalescing isn't the only way to fight that - you can just use smaller allocations too. So in my mind, the only reason for doign coalescing is if some silly program deadlocks if it writes many small writes, and nobody is reading it. Some people use pipes as buffers _within_ a program, ie use them to send signals etc (common for signal-safe select handling - you make the signal handler do a write() to a pipe, and the select loop has the pipe in its FD-set). For all of those that are valid (certainly the signal case), you need to put the pipe into nonblocking mode anyway, so I don't think coalescing is really needed. SuS certainly does _not_ seem guarantee that you can do many nonblocking writes (small _or_ big). Let's see if somebody reports any trouble with the new world order. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-08 18:41 ` Linus Torvalds @ 2005-01-08 21:47 ` Alan Cox 2005-01-13 21:46 ` Ingo Oeser 1 sibling, 0 replies; 51+ messages in thread From: Alan Cox @ 2005-01-08 21:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux, Linux Kernel Mailing List On Sad, 2005-01-08 at 18:41, Linus Torvalds wrote: > For all of those that are valid (certainly the signal case), you need to > put the pipe into nonblocking mode anyway, so I don't think coalescing is > really needed. SuS certainly does _not_ seem guarantee that you can do > many nonblocking writes (small _or_ big). Let's see if somebody reports > any trouble with the new world order. It breaks netscape 3 if I simulate it (but thats picking an app I know makes silly assumptions). >From the rules pasted below I think the 1 byte at a time 4K write is guaranteed or at least strongly implied. The single write atomicity is guaranteed but that is if anything made easier by the changes. (From pathconf) fpathconf(_PC_PATH_BUF) returns the size of the pipe buffer, where filedes must refer to a pipe or FIFO and path must refer to a FIFO. The corre- sponding macro is _POSIX_PIPE_BUF. (From pwrite) Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set. [of O_NDELAY for pipe] A write request for {PIPE_BUF} or fewer bytes shall have the following effect: if there is sufficient space available in the pipe, write() shall transfer all the data and return the number of bytes requested. Otherwise, write() shall transfer no data and return -1 with errno set to [EAGAIN]. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-08 18:41 ` Linus Torvalds 2005-01-08 21:47 ` Alan Cox @ 2005-01-13 21:46 ` Ingo Oeser 2005-01-13 22:32 ` Linus Torvalds 1 sibling, 1 reply; 51+ messages in thread From: Ingo Oeser @ 2005-01-13 21:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux, Kernel Mailing List Hi, Linus Torvalds wrote: > None of the pipes would do anything on their own - they aren't "actors". > They'd be purely buffers. So if you want to have an asynchronous > long-running "tee()/copy()/peek()", you'd have a thread that does that for > you. Hmm, that's a pity, because it makes hardware support more difficult. I thought you might consider an system call, which "wires up" fds. Imagine a device fd, which gets lots of measuring data, wired through a DSP pipe, spliced to realtime display fd and file storage fd. How many buffers do you like to use here? How much unnecessary copies are made by the CPU? Any realtime mass data processing needing hardware support could benefit from "active" pipes, where either end could be drivers "knowing each other" and deciding to rather route the IO in hardware, avoiding to go through the overworked memory bus and CPU. This kind of "wire me up to my next chip" is needed quite often in the embedded world, where power matters and special purpose chips or even DSPs rule. Every chipset vendor does its own libraries there for now, but a more generic approach might be better. What do you think? Regards Ingo Oeser ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-13 21:46 ` Ingo Oeser @ 2005-01-13 22:32 ` Linus Torvalds 2005-01-14 21:03 ` Ingo Oeser 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2005-01-13 22:32 UTC (permalink / raw) To: Ingo Oeser; +Cc: linux, Kernel Mailing List On Thu, 13 Jan 2005, Ingo Oeser wrote: > > Hmm, that's a pity, because it makes hardware support more difficult. > > I thought you might consider an system call, which "wires up" fds. > > Imagine a device fd, which gets lots of measuring data, wired through a > DSP pipe, spliced to realtime display fd and file storage fd. I think that the solution to that is to make the pipe _be_ the driver interface. Remember: a pipe is just a set of buffers. If you have a hardware device that could use the buffers, then there is nothing to say that the driver couldn't be the "actor" that fills the buffers. So doing an "open()" on the device would just create a pipe that gets filled by the hardware. But that doesn't make the pipe an "actor". The pipe just remains a standard way to encapsulate the notion of "set of buffers". It needs an external actor to do something, but that actor can be a device driver filling it up, or a system call that reads it or moves it to another destination. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-13 22:32 ` Linus Torvalds @ 2005-01-14 21:03 ` Ingo Oeser 2005-01-14 21:29 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Ingo Oeser @ 2005-01-14 21:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux, Kernel Mailing List Hi Linus, Linus Torvalds wrote: > On Thu, 13 Jan 2005, Ingo Oeser wrote: > > Hmm, that's a pity, because it makes hardware support more difficult. > > I thought you might consider an system call, which "wires up" fds. > I think that the solution to that is to make the pipe _be_ the driver > interface. But we have currently no suitable interface for this. We have socketpair(), we have pipe(), but we don't have dev_pipe() creating a pipe with the driver sitting in the middle. Usage: "processing devices" like crypto processors, DSPs, MPEG-Encoding/-Decoding hardware, smart cards and the like. Synopsys: int dev_pipe(const char *dev_path, int fd_vec[2]); We also don't have wire_fds(), which would wire up two fds by connecting the underlying file pointers with each other and closing the fds. That would be quite useful, because user space will not see the data anymore anyway, if it travels just between some chips. Usage: Chips from a chipset, which usally know each other quite well and can talk to each other without CPU intervention. Synopsys: int wire(int in_fd, int out_fd); > But that doesn't make the pipe an "actor". The pipe just remains a > standard way to encapsulate the notion of "set of buffers". It needs an > external actor to do something, but that actor can be a device driver > filling it up, or a system call that reads it or moves it to another > destination. I implemented this some years ago for the "Linux & DSP" project, but I had to impose ordering restrictions on open() and close() to user space and this is not really elegant. Many hardware vendors have implemented it somehow anyway. So what about starting to do it seriously and implement infrastructure for both of it? I could also do an simple example driver using both features, if you give me some evenings of time and are interested in the idea. Are you interested? Regards Ingo Oeser, still dreaming of sys_wire() and sys_dev_pipe() PS: Syscall names are under discussion of course ;-) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-14 21:03 ` Ingo Oeser @ 2005-01-14 21:29 ` Linus Torvalds 2005-01-14 22:12 ` Ingo Oeser 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2005-01-14 21:29 UTC (permalink / raw) To: Ingo Oeser; +Cc: linux, Kernel Mailing List On Fri, 14 Jan 2005, Ingo Oeser wrote: > > But we have currently no suitable interface for this. > > We have socketpair(), we have pipe(), but we don't have > dev_pipe() creating a pipe with the driver sitting in the middle. Not exactly that, no. But realize that what you are asking for is actually a _much_ simpler thing than even "fifo_open()". What you ask for (if somebody really starts doing this, and I'd love to see it) is not much more difficult than int device_open(struct inode *inode, struct file *filp) { if (!pipe_new(inode)) return -ENOMEM; /* The hardware is a writer */ PIPE_WRITERS(*inode)++; /* The new fd is a reader of the data the hardware generates */ file->f_op = read_fifo_fops; /* * This starts the hardware capture - although in real * life there might be a "start it" ioctl or something * that actually does that together with the parameters * for capture.. */ start_capture_process(inode); return 0; } ie we're definitely not talking rocket science here. (Yes, I'm sure there are some details missing in the above, but you get the idea: the infrastructure really _is_ pretty much there, and any lack comes from the fact that nobody has actually ever _used_ it). > Usage: "processing devices" like crypto processors, DSPs, > MPEG-Encoding/-Decoding hardware, smart cards and the like. > > Synopsys: int dev_pipe(const char *dev_path, int fd_vec[2]); No, I do NOT think that's what you'd want. What you want is more of a fd = open("/dev/mpeginput", O_RDONLY); and then that device driver open just does the thing outlined above. > We also don't have wire_fds(), which would wire up two fds by > connecting the underlying file pointers with each other and closing the fds. But that is _exactly_ what splice() would do. So you could have the above open of "/dev/mpeginput", and then you just sit and splice the result into a file or whatever. See? (Or maybe it's me who doesn't see what it is you want). Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-14 21:29 ` Linus Torvalds @ 2005-01-14 22:12 ` Ingo Oeser 2005-01-14 22:44 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Ingo Oeser @ 2005-01-14 22:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux, Kernel Mailing List Hi Linus, Linus Torvalds wrote: > On Fri, 14 Jan 2005, Ingo Oeser wrote: > But realize that what you are asking for is actually a _much_ simpler > thing than even "fifo_open()". What you ask for (if somebody really starts > doing this, and I'd love to see it) is not much more difficult than > > int device_open(struct inode *inode, struct file *filp) > { > if (!pipe_new(inode)) > return -ENOMEM; > > /* The hardware is a writer */ > PIPE_WRITERS(*inode)++; > > /* The new fd is a reader of the data the hardware generates */ > file->f_op = read_fifo_fops; > > /* > * This starts the hardware capture - although in real > * life there might be a "start it" ioctl or something > * that actually does that together with the parameters > * for capture.. > */ > start_capture_process(inode); > > return 0; > } > > ie we're definitely not talking rocket science here. Data sink/source is simple, indeed. You just implemented a buffering between a drivers output filp, if I understand you correctly. Now both directions together and it gets a bit more difficult. Driver writers basically reimplement fs/pipe.c over and over again. encoded stream -> write() -> decoder hardware -> read() -> decoded stream I'm trying to model the decoder as device here. Same goes for encoding. > (Yes, I'm sure there are some details missing in the above, but you get > the idea: the infrastructure really _is_ pretty much there, and any lack > comes from the fact that nobody has actually ever _used_ it). No, we have all the hardware actually doing this hidden behind other interfaces. e.g. the DVB core, my LinuxDSP core, the upcoming hardware crypto core and so on. Always the same basic operations for that needing the same structure to handle it. Maybe the solution here is just using ONE fd and read/write to it. But for identifying producer/consumer, you need two fds. And now we come to wire() (or maybe we discover that splice() is really doing the same). > > We also don't have wire_fds(), which would wire up two fds by > > connecting the underlying file pointers with each other and closing the > > fds. > > But that is _exactly_ what splice() would do. > > So you could have the above open of "/dev/mpeginput", and then you just > sit and splice the result into a file or whatever. > > See? But you are still pushing everything through the page cache. I would like to see "fop->connect(producer, consumer)" instead of while (producer->not_killed &&) consumer->not_killed) { producer->read(buffer); consumer->write(buffer); } That's what the wire() syscall is about. If the splice() syscall is the same and enables drivers to avoid the copying through the CPU, I'm perfectly happy. > (Or maybe it's me who doesn't see what it is you want). Imagine 3 chips. One is a encoder/decoder chip with 8 channels, the other 2 chips are a video DAC/ADC and an audio DAC/ADC with 4 channels each. These chips can be wired directly by programming a wire matrix (much like a dumb routing table). But you can also receive/send via all of these chips to/from harddisk for recording/playback. So you have to implement the drivers for these chips to provide one filp per channel and one minor per chip. If I can do this with splice, you've got me and I'm really looking forward to your first commits/patches, since this itch is scratching me since long ;-) Thanks and Regards Ingo oeser ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-14 22:12 ` Ingo Oeser @ 2005-01-14 22:44 ` Linus Torvalds 2005-01-14 23:34 ` Ingo Oeser 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2005-01-14 22:44 UTC (permalink / raw) To: Ingo Oeser; +Cc: linux, Kernel Mailing List On Fri, 14 Jan 2005, Ingo Oeser wrote: > > Data sink/source is simple, indeed. You just implemented a buffering > between a drivers output filp, if I understand you correctly. Yes and no. It is indeed just buffering. The thing I find most intriguing about it, and which is why I htink it's important is that while it's "just" buffering, it is _standard_ buffering. That's where it gets interesting. Everybody needs buffers, and they are generally so easy to implement that there's no point in having much of a "buffer library". But it the standard way allows you to do interesting stuff _between_ two things, than that's where it gets interesting. The combinations it allows. > Now both directions together and it gets a bit more difficult. > Driver writers basically reimplement fs/pipe.c over and over again. But bidirectional is just two of those things. > Always the same basic operations for that needing the same structure > to handle it. Yes. > Maybe the solution here is just using ONE fd and read/write to it. No. Use two totally separate fd's, and make it _cheap_ to move between them. That's what "splice()" gives you - basically a very low-cost way to move between two uni-directional things. No "memcpy()", because memcpy is expensive for large streams of data. > > > We also don't have wire_fds(), which would wire up two fds by > > > connecting the underlying file pointers with each other and closing the > > > fds. > > > > But that is _exactly_ what splice() would do. > > > > So you could have the above open of "/dev/mpeginput", and then you just> > sit and splice the result into a file or whatever. > > > > See? > > But you are still pushing everything through the page cache. > I would like to see "fop->connect(producer, consumer)" No no. There's no buffer cache involved. There is just "buffers". If you end up reading from a regular file (or writing to one), then yes, the buffers end up being picked up from the buffer cache. But that's by no means required. The buffers can be just anonymous pages (like the ones a regular "write()" to a pipe generates), or they could be DMA pages allocated for the data by a device driver. Or they could be the page that contains a "skb" from a networking device. I really think that splice() is what you want, and you call "wire()". > Imagine 3 chips. One is a encoder/decoder chip with 8 channels, the other > 2 chips are a video DAC/ADC and an audio DAC/ADC with 4 channels each. > > These chips can be wired directly by programming a wire matrix (much like a > dumb routing table). But you can also receive/send via all of these chips > to/from harddisk for recording/playback. > > So you have to implement the drivers for these chips to provide one filp per > channel and one minor per chip. > > If I can do this with splice, you've got me and I'm really looking forward to > your first commits/patches, since this itch is scratching me since long ;-) Yes, I believe that we're talking about the same thing. What you can do in my vision is: - create a pipe for feeding the audio decoder chip. This is just the sound driver interface to a pipe, and it's the "device_open()" code I gave as an example in my previous email, except going the other way (ie the writer is the 'fd', and the driver is the "reader" of the data). You'd do this with a simple 'fd = open("/dev/xxxx", O_WRONLY)", together with some ioctl (if necessary) to set up the actual _parameters_ for the piped device. - do a "splice()" from a file to the pipe. A splice from a regular file is really nothing but a page cache lookup, and moving those page cache pages to the pipe. - tell the receiving fd to start processing it (again, this might be an ioctl - some devices may need directions on how to interpret the data, or it might be implicit in the fact that the pipe got woken up by being filled) Going the other way (receiving data from a hardware device) is all the same thing - and there "tee()" may be useful, since it would allow the received data to be dup'ed to two different sinks. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-14 22:44 ` Linus Torvalds @ 2005-01-14 23:34 ` Ingo Oeser 2005-01-15 0:16 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Ingo Oeser @ 2005-01-14 23:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux, Kernel Mailing List Hi Linus, Linus Torvalds wrote: > That's where it gets interesting. Everybody needs buffers, and they are > generally so easy to implement that there's no point in having much of a > "buffer library". But my hardware has the buffers on chip, if connected directly to each other. So no system level buffering is needed. > No. Use two totally separate fd's, and make it _cheap_ to move between > them. That's what "splice()" gives you - basically a very low-cost way to > move between two uni-directional things. No "memcpy()", because memcpy is > expensive for large streams of data. Here we agree already, so lets talk about the rest. > If you end up reading from a regular file (or writing to one), then yes, > the buffers end up being picked up from the buffer cache. But that's by no > means required. The buffers can be just anonymous pages (like the ones a > regular "write()" to a pipe generates), or they could be DMA pages > allocated for the data by a device driver. Or they could be the page that > contains a "skb" from a networking device. > > I really think that splice() is what you want, and you call "wire()". That sounds really close to what I want. Now imagine this (ACSCII art in monospace font): [ chip A ] ------(1)------ [ chip B ] [ CPU ] [memory] [disk] | | | | | +----------(2)-----------+---(2)-----+----(2)--+--------+ (1) Is a direct path between these chips. (2) Is the system bus (e.g. PCI) (1) works without any CPU intervention (2) needs the buffering scheme you described. I like to use (1) for obvious reasons, but still allow (2). User space should decide which one, since it's a policy decision. I need to open the chips with default to (2), because the driver doesn't know in advance, that it will be connected to a chip it can talk to directly. Everything else will be quite ugly. Please note, that this is a simplified case and we actually have many chips of A/B and (1) is more like a software programmable "patch field" you might know from a server rack. I want user space to operate this "patch field". Maybe "io router" is a descriptive term. No? > Yes, I believe that we're talking about the same thing. What you can do in > my vision is: > > - create a pipe for feeding the audio decoder chip. This is just the > sound driver interface to a pipe, and it's the "device_open()" code I > gave as an example in my previous email, except going the other way (ie > the writer is the 'fd', and the driver is the "reader" of the data). > > You'd do this with a simple 'fd = open("/dev/xxxx", O_WRONLY)", > together with some ioctl (if necessary) to set up the actual > _parameters_ for the piped device. > > - do a "splice()" from a file to the pipe. A splice from a regular file > is really nothing but a page cache lookup, and moving those page cache > pages to the pipe. > > - tell the receiving fd to start processing it (again, this might be an > ioctl - some devices may need directions on how to interpret the data, > or it might be implicit in the fact that the pipe got woken up by being > filled) Yes, I understand and support your vision. Now I would like to use path (1) the direct for this. Possible? Maybe by making drivers optionally aware of splice() and defaulting to a regular pipe, if they don't care? Thanks and Regards Ingo Oeser ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-14 23:34 ` Ingo Oeser @ 2005-01-15 0:16 ` Linus Torvalds 2005-01-16 2:59 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2005-01-15 0:16 UTC (permalink / raw) To: Ingo Oeser; +Cc: linux, Kernel Mailing List On Sat, 15 Jan 2005, Ingo Oeser wrote: > > Now imagine this (ACSCII art in monospace font): > > [ chip A ] ------(1)------ [ chip B ] [ CPU ] [memory] [disk] > | | | | | > +----------(2)-----------+---(2)-----+----(2)--+--------+ > > Yes, I understand and support your vision. Now I would like to use path (1) > the direct for this. Possible? Yes, if I understand what you're thinking right. I'm just assuming (for the sake of simplicity) that "A" generates all data, and "B" is the one that uses it. If both A and B can generate/use data, it still works, but you'd need two pipes per endpoint, just because pipes are fundamentally uni-directional. Also, for this particular example, I'll also assume that whatever buffers that chips A/B use are "mappable" to the CPU too: they could be either regular memory pages (that A/B DMA to/from), or they can be IO buffers on the card itself that can be mapped into the CPU address space and memcpy'ed. That second simplification is something that at least my current example code (which I don't think I've sent to you) kind of assumes. It's not really _forced_ onto the design, but it avoids the need for a separate "accessor" function to copy things around. [ If you cannot map them into CPU space, then each "struct pipe_buffer" op structure would need operations to copy them to/from kernel memory and to copy them to/from user memory, so the second simplification basically avoids having to have four extra "accessor" functions. ] What you'd do to get your topology above is: - create a "pipe" for both A and B (let's call them "fdA" and "fdB" respectively). - the driver is responsible for creating the "struct pipe_buffer" for any data generated by chip A. If it's regular memory, it's a page allocation and the appropriate DMA setup, and if it's IO memory on the card, you'd need to generate fake "struct page *" and a mapping function for them. - if you want to send the data that A generates to both fdB _and_ write it to disk (file X), you'd also need one anonymous pipe (fdC), and open a fdX for writing to the file, and then just in your process effectively do: for (;;) { n = tee(fdA, fdB, fbC); splice(fdC, fdX, n); } and you'd get what I think you are after. The reason you need the "fdC" is simply because the "tee()" operation would only work on pipes: you can not duplicate any other data structure with "tee()" than a pipe buffer, so you couldn't do a direct "tee(fdA, fdB, fdX)" in my world. (The above example is _extremely_ simplified: in real life, you'd have to take care of partial results from "splice()" etc error conditions, of course). NOTE! My first cut will _assume_ that all buffers are in RAM, so it wouldn't support the notion of on-card memory. On-card memory does complicate things - not just the accessor functions, but also the notion of how to "splice()" (or "tee()" from one to the other). If you assume buffers-in-RAM, then all pointers are the same, and a tee/splice really just moves a pointer around. But if the pointer can have magic meaning that depends on the source it came from, then splicing such a pointer to another device must inevitably imply at least the possibility of some kind of memory copy. So I don't think you'll get _exactly_ what you want for a while, since you'd have to go through in-memory buffers. But there's no huge conceptual problem (just enough implementation issues to make it inconvenient) with the concept of actually keeping the buffer on an external controller. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-15 0:16 ` Linus Torvalds @ 2005-01-16 2:59 ` Linus Torvalds 2005-01-17 16:03 ` Ingo Oeser 2005-01-19 21:12 ` Make pipe data structure be a circular list of pages, rather than linux 0 siblings, 2 replies; 51+ messages in thread From: Linus Torvalds @ 2005-01-16 2:59 UTC (permalink / raw) To: Ingo Oeser; +Cc: linux, Kernel Mailing List On Fri, 14 Jan 2005, Linus Torvalds wrote: > > So I don't think you'll get _exactly_ what you want for a while, since > you'd have to go through in-memory buffers. But there's no huge conceptual > problem (just enough implementation issues to make it inconvenient) with > the concept of actually keeping the buffer on an external controller. Here is, if you are interested, a slightly updated version of my "work-in-progress" example patch. It: - depends on a fairly recent -BK (since it uses the pipe buffer ops interfaces that I already checked in) - is very ugly: the VFS interfaces to splice things really are very wrong indeed, and they _should_ be about passing "struct pipe_buffer" arrays around instead. - it only does one specific case (splicing from the page cache directly into a pipe) - only set up for ext3 right now (though the "generic_file_splice" thing should work for any page-cache user) - and even that specific case it does wrong since it doesn't update the "f_pos" field right now - error handling is broken - if an IO error happens on the splice, the "map()" operation will return NULL, and the current fs/pipe.c can't handle that and will oops ;) so it really is _purely_ a demonstration vehicle, but it at least shows the kind of operation that I'm thinking of. An example program to show its use is a simple: #include <fcntl.h> #include <unistd.h> #include <linux/unistd.h> #define __NR_sys_splice 289 inline _syscall4(int, sys_splice, int, a, int, b, size_t, len, unsigned long, flags); int main(int argc, char **argv) { int fd = open(argv[1], O_RDONLY); if (fd < 0) { perror("open"); exit(1); } write(1,"hello: ", 7); sys_splice(fd, 1, 1000, 0); } which demonstrates three things: - it works on a perfectly regular pipe, with no special setup, so to use it you can just do ./test-splice my-file | less and the normal pipe that the shell created for this will just work. No new magic fd's required. - that you can combine different sources to the pipe (mixing a traditional static write with a "splice from a file"). - we start the IO when we do the splice, but the waiting for the result happens separately, when (or if) the data is used by the other end. it will print out "hello: " followed by the first 1000 bytes of the file. It's obviously not very useful for anything but demonstration, since splicing the other way (from a pipe to another fd) isn't implemented by this example, but it gives a more concrete example of what I'm thinking of. And shows that the splicing itself is pretty simple ("move_to_pipe()" does all the moving of buffers to the pipe) - the rest is just regular setup and finding of the pages in the page cache. Linus ---- # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2005/01/15 18:36:19-08:00 torvalds@evo.osdl.org # sys_splice: take 3 # # Still horribly bad VFS interface, still doing the pipe # filling inside the per-fd callback instead of doing it # in "generic code" in fs/splice.c. # # Need to pass in (and out) arrays of "struct pipe_buffer" # on a VFS level instead. # # fs/splice.c # 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +229 -0 # # include/linux/fs.h # 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +2 -0 # sys_splice: take 3 # # Still horribly bad VFS interface, still doing the pipe # filling inside the per-fd callback instead of doing it # in "generic code" in fs/splice.c. # # Need to pass in (and out) arrays of "struct pipe_buffer" # on a VFS level instead. # # include/asm-i386/unistd.h # 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +2 -1 # sys_splice: take 3 # # Still horribly bad VFS interface, still doing the pipe # filling inside the per-fd callback instead of doing it # in "generic code" in fs/splice.c. # # Need to pass in (and out) arrays of "struct pipe_buffer" # on a VFS level instead. # # fs/splice.c # 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +0 -0 # BitKeeper file /home/torvalds/v2.6/linux/fs/splice.c # # fs/ext3/file.c # 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +3 -0 # sys_splice: take 3 # # Still horribly bad VFS interface, still doing the pipe # filling inside the per-fd callback instead of doing it # in "generic code" in fs/splice.c. # # Need to pass in (and out) arrays of "struct pipe_buffer" # on a VFS level instead. # # fs/Makefile # 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +1 -0 # sys_splice: take 3 # # Still horribly bad VFS interface, still doing the pipe # filling inside the per-fd callback instead of doing it # in "generic code" in fs/splice.c. # # Need to pass in (and out) arrays of "struct pipe_buffer" # on a VFS level instead. # # arch/i386/kernel/entry.S # 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +1 -0 # sys_splice: take 3 # # Still horribly bad VFS interface, still doing the pipe # filling inside the per-fd callback instead of doing it # in "generic code" in fs/splice.c. # # Need to pass in (and out) arrays of "struct pipe_buffer" # on a VFS level instead. # diff -Nru a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S --- a/arch/i386/kernel/entry.S 2005-01-15 18:36:40 -08:00 +++ b/arch/i386/kernel/entry.S 2005-01-15 18:36:40 -08:00 @@ -864,5 +864,6 @@ .long sys_add_key .long sys_request_key .long sys_keyctl + .long sys_splice syscall_table_size=(.-sys_call_table) diff -Nru a/fs/Makefile b/fs/Makefile --- a/fs/Makefile 2005-01-15 18:36:40 -08:00 +++ b/fs/Makefile 2005-01-15 18:36:40 -08:00 @@ -10,6 +10,7 @@ ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \ attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \ seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \ + splice.o obj-$(CONFIG_EPOLL) += eventpoll.o obj-$(CONFIG_COMPAT) += compat.o diff -Nru a/fs/ext3/file.c b/fs/ext3/file.c --- a/fs/ext3/file.c 2005-01-15 18:36:40 -08:00 +++ b/fs/ext3/file.c 2005-01-15 18:36:40 -08:00 @@ -101,6 +101,8 @@ return ret; } +extern ssize_t generic_file_splice_read(struct file *in, struct inode *pipe, size_t len, unsigned long flags); + struct file_operations ext3_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -115,6 +117,7 @@ .release = ext3_release_file, .fsync = ext3_sync_file, .sendfile = generic_file_sendfile, + .splice_read = generic_file_splice_read, }; struct inode_operations ext3_file_inode_operations = { diff -Nru a/fs/splice.c b/fs/splice.c --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/fs/splice.c 2005-01-15 18:36:40 -08:00 @@ -0,0 +1,229 @@ +/* + * linux/fs/splice.c + * + * "splice": joining two ropes together by interweaving their strands. + * + * This is the "extended pipe" functionality, where a pipe is used as + * an arbitrary in-memory buffer. Think of a pipe as a small kernel + * buffer that you can use to transfer data from one end to the other. + * + * The traditional unix read/write is extended with a "splice()" operation + * that transfers data buffers to or from a pipe buffer. + * + * Named by Larry McVoy. + */ +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/pagemap.h> +#include <linux/pipe_fs_i.h> + +static void page_cache_pipe_buf_release(struct pipe_inode_info *info, struct pipe_buffer *buf) +{ + page_cache_release(buf->page); +} + +static void *page_cache_pipe_buf_map(struct file *file, struct pipe_inode_info *info, struct pipe_buffer *buf) +{ + struct page *page = buf->page; + + if (!PageUptodate(page)) { + wait_on_page_locked(page); + if (!PageUptodate(page)) + return NULL; + } + return kmap(page); +} + +static void page_cache_pipe_buf_unmap(struct pipe_inode_info *info, struct pipe_buffer *buf) +{ + kunmap(buf->page); +} + +static struct pipe_buf_operations page_cache_pipe_buf_ops = { + .can_merge = 0, + .map = page_cache_pipe_buf_map, + .unmap = page_cache_pipe_buf_unmap, + .release = page_cache_pipe_buf_release, +}; + +static ssize_t move_to_pipe(struct inode *inode, struct page **array, int pages, unsigned long offset, unsigned long len) +{ + struct pipe_inode_info *info; + int ret, do_wakeup; + + down(PIPE_SEM(*inode)); + info = inode->i_pipe; + ret = 0; + do_wakeup = 0; + for (;;) { + int bufs; + + if (!PIPE_READERS(*inode)) { + send_sig(SIGPIPE, current, 0); + if (!ret) ret = -EPIPE; + break; + } + bufs = info->nrbufs; + if (bufs < PIPE_BUFFERS) { + int newbuf = (info->curbuf + bufs) & (PIPE_BUFFERS-1); + struct pipe_buffer *buf = info->bufs + newbuf; + struct page *page = *array++; + unsigned long this_len; + + this_len = PAGE_SIZE - offset; + if (this_len > len) + this_len = len; + + buf->page = page; + buf->offset = offset; + buf->len = this_len; + buf->ops = &page_cache_pipe_buf_ops; + info->nrbufs = ++bufs; + do_wakeup = 1; + + ret += this_len; + len -= this_len; + offset = 0; + if (!--pages) + break; + if (!len) + break; + if (bufs < PIPE_BUFFERS) + continue; + break; + } + if (signal_pending(current)) { + if (!ret) ret = -ERESTARTSYS; + break; + } + if (do_wakeup) { + wake_up_interruptible_sync(PIPE_WAIT(*inode)); + kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN); + do_wakeup = 0; + } + PIPE_WAITING_WRITERS(*inode)++; + pipe_wait(inode); + PIPE_WAITING_WRITERS(*inode)--; + } + up(PIPE_SEM(*inode)); + if (do_wakeup) { + wake_up_interruptible(PIPE_WAIT(*inode)); + kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN); + } + while (--pages >= 0) + page_cache_release(*++array); + return ret; +} + +ssize_t generic_file_splice_read(struct file *in, struct inode *pipe, size_t len, unsigned long flags) +{ + struct address_space *mapping = in->f_mapping; + unsigned long long pos, size, left; + unsigned long index, offset, pages; + struct page *array[PIPE_BUFFERS]; + int i; + + pos = in->f_pos; + size = i_size_read(mapping->host); + if (pos >= size) + return 0; + left = size - pos; + if (left < len) + len = left; + index = pos >> PAGE_CACHE_SHIFT; + offset = pos & ~PAGE_CACHE_MASK; + pages = (len + offset + PAGE_SIZE - 1) >> PAGE_CACHE_SHIFT; + + if (pages > PIPE_BUFFERS) + pages = PIPE_BUFFERS; + + /* + * Get as many pages from the page cache as possible.. + * Start IO on the page cache entries we create (we + * can assume that any pre-existing ones we find have + * already had IO started on them). + */ + i = find_get_pages(mapping, index, pages, array); + + /* + * If not all pages were in the page-cache, we'll + * just assume that the rest haven't been read in, + * so we'll get the rest locked and start IO on + * them if we can.. + */ + while (i < pages) { + struct page *page = find_or_create_page(mapping, index + i, GFP_USER); + int error; + if (!page) + break; + if (PageUptodate(page)) { + unlock_page(page); + } else { + error = mapping->a_ops->readpage(in, page); + if (unlikely(error)) { + page_cache_release(page); + break; + } + } + array[i++] = page; + } + + if (!i) + return 0; + + /* + * Now we splice them into the pipe.. + */ + return move_to_pipe(pipe, array, i, offset, len); +} + +static long do_splice_from(struct inode *pipe, struct file *out, size_t len, unsigned long flags) +{ + if (out->f_op && out->f_op->splice_write) + return out->f_op->splice_write(pipe, out, len, flags); + return -EINVAL; +} + +static long do_splice_to(struct file *in, struct inode *pipe, size_t len, unsigned long flags) +{ + if (in->f_op && in->f_op->splice_read) + return in->f_op->splice_read(in, pipe, len, flags); + return -EINVAL; +} + +static long do_splice(struct file *in, struct file *out, size_t len, unsigned long flags) +{ + struct inode *pipe; + + if (!len) + return 0; + pipe = in->f_dentry->d_inode; + if (pipe->i_pipe) + return do_splice_from(pipe, out, len, flags); + pipe = out->f_dentry->d_inode; + if (pipe->i_pipe) + return do_splice_to(in, pipe, len, flags); + return -EINVAL; +} + +asmlinkage long sys_splice(int fdin, int fdout, size_t len, unsigned long flags) +{ + long error; + struct file *in, *out; + int fput_in, fput_out; + + error = -EBADF; + in = fget_light(fdin, &fput_in); + if (in) { + if (in->f_mode & FMODE_READ) { + out = fget_light(fdout, &fput_out); + if (out) { + if (out->f_mode & FMODE_WRITE) + error = do_splice(in, out, len, flags); + fput_light(out, fput_out); + } + } + fput_light(in, fput_in); + } + return error; +} diff -Nru a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h --- a/include/asm-i386/unistd.h 2005-01-15 18:36:40 -08:00 +++ b/include/asm-i386/unistd.h 2005-01-15 18:36:40 -08:00 @@ -294,8 +294,9 @@ #define __NR_add_key 286 #define __NR_request_key 287 #define __NR_keyctl 288 +#define __NR_splice 289 -#define NR_syscalls 289 +#define NR_syscalls 290 /* * user-visible error numbers are in the range -1 - -128: see diff -Nru a/include/linux/fs.h b/include/linux/fs.h --- a/include/linux/fs.h 2005-01-15 18:36:40 -08:00 +++ b/include/linux/fs.h 2005-01-15 18:36:40 -08:00 @@ -944,6 +944,8 @@ int (*check_flags)(int); int (*dir_notify)(struct file *filp, unsigned long arg); int (*flock) (struct file *, int, struct file_lock *); + ssize_t (*splice_write)(struct inode *, struct file *, size_t, unsigned long); + ssize_t (*splice_read)(struct file *, struct inode *, size_t, unsigned long); }; struct inode_operations { ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather 2005-01-16 2:59 ` Linus Torvalds @ 2005-01-17 16:03 ` Ingo Oeser 2005-01-19 21:12 ` Make pipe data structure be a circular list of pages, rather than linux 1 sibling, 0 replies; 51+ messages in thread From: Ingo Oeser @ 2005-01-17 16:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux, Kernel Mailing List Hi Linus, Linus Torvalds wrote: > +static long do_splice_from(struct inode *pipe, struct file *out, size_t len, unsigned long flags) > +static long do_splice_to(struct file *in, struct inode *pipe, size_t len, unsigned long flags) > +static long do_splice(struct file *in, struct file *out, size_t len, unsigned long flags) > +asmlinkage long sys_splice(int fdin, int fdout, size_t len, unsigned long flags) That part looks quite perfect. As long as they stay like this, I'm totally happy. I have even no problem about limiting to a length, since I can use that to measure progress (e.g. a simple progress bar). This way I also keep the process as an "actor" like "linux@horizon.com" pointed out. It has unnecessary scheduling overhead, but the ability to stop/resume the transfer by killing the process doing it is worth it, I agree. So I would put a structure in the inode identifying the special device and check, whether the "in" and "out" parameters are from devices suitable for a direct on wire transfer. If they are, I just set up some registers and wait for the transfer to happen. Then I get an interrupt/wakeup, if the requested amount is streamed, increment some user space pointers, switch to user space, user space tells me abort or stream more and I follow. Continue until abort by user or streaming problems happen. Just to give you an idea: I debugged such a machine and I had a hard hanging kernel with interrupts disabled. It still got data from a tuner, through an MPEG decoder, an MPEG demultiplexer and played it to the audio card. Not just a buffer like ALSA/OSS, but as long as I would like and it's end to end without any CPU intervention. That behavior would be perfect, but I could also live with a "pushing process". Regards Ingo Oeser ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-16 2:59 ` Linus Torvalds 2005-01-17 16:03 ` Ingo Oeser @ 2005-01-19 21:12 ` linux 2005-01-20 2:06 ` Robert White 1 sibling, 1 reply; 51+ messages in thread From: linux @ 2005-01-19 21:12 UTC (permalink / raw) To: linux-kernel; +Cc: linux, lm, torvalds lm@bitmover.com wrote: > You seem to have misunderstood the original proposal, it had little to do > with file descriptors. The idea was that different subsystems in the OS > export pull() and push() interfaces and you use them. The file decriptors > are only involved if you provide them with those interfaces(which you > would, it makes sense). You are hung up on the pipe idea, the idea I > see in my head is far more generic. Anything can play and you don't > need a pipe at all, you need I was fantasizing about more generality as well. In particular, my original fantasy allowed data to, in theory and with compatible devices, be read from one PCI device, passed through a series of pipes, and written to another without ever hitting main memory - only one PCI-PCI DMA operation performed. A slightly more common case would be zero-copy, where data gets DMAed from the source into memory and from memory to the destination. That's roughly Larry's pull/push model. The direct DMA case requires buffer memory on one of the two cards. (And would possibly be a fruitful source of hardware bugs, since I suspect that Windows Doesn't Do That.) Larry has the "full-page gift" optimization, which could in theory allow data to be "renamed" straight into the page cache. However, the page also has to be properly aligned and not in some awkward highmem address space. I'm not currently convinced that this would happen often enough to be worth the extra implementation hair, but feel free to argue otherwise. (And Larry, what's the "loan" bit for? When is loan != !gift ?) The big gotcha, as Larry's original paper properly points out, is handling write errors. We need some sort of "unpull" operation to put data back of the destination can't accept it. Otherwise, what do you return from splice()? If the source is seekable, that's easy, and a pipe isn't much harder, but for a general character device, we need a bit of help. The way I handle this in user-level software, to connect modules that provide data buffering, is to split "pull" into two operations. "Show me some buffered data" and "Consume some buffered data". The first returns a buffer pointer (to a const buffer) and length. (The length must be non-zero except at EOF, but may be 1 byte.) The second advances the buffer pointer. The advance distance must be no more than the length returned previously, but may be less. In typical single-threaded code, I allow not calling the advance function or calling it multiple times, but they're typically called 1:1, and requiring that would give you a good place to do locking. A character device, network stream, or the like, would acquire an exclusive lock. A block device or file would not need to (or could make it a shared lock or refcount). The same technique can be used when writing data to a module that does buffering: "Give me some bufer space" and "Okay, I filled some part of it in." In some devices, the latter call can fail, and the writer has to be able to cope with that. By allowing both of those (and, knowing that PCI writes are more efficient than PCI reads, giving the latter preference if both are available), you can do direct device-to-device copies on splice(). The problem with Larry's separate pull() and push() calls is that you then need a user-visible abstraction for "pulled but not yet pushed" data, which seems lile unnecessary abstraction violation. The main infrastructure hassle you need to support this *universally* is the unget() on "character devices" like pipes and network sockets. Ideally, it would be some generic buffer front end that could be used by the device for normal data as well as the special case. Ooh. Need to think. If there's a -EIO problem on one of the file descriptors, how does the caller know which one? That's an argument for separate pull and push (although the splice() library routine still has the problem). Any suggestions? Does userland need to fall back on read()/write() for a byte? ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: Make pipe data structure be a circular list of pages, rather than 2005-01-19 21:12 ` Make pipe data structure be a circular list of pages, rather than linux @ 2005-01-20 2:06 ` Robert White 0 siblings, 0 replies; 51+ messages in thread From: Robert White @ 2005-01-20 2:06 UTC (permalink / raw) To: linux, linux-kernel; +Cc: lm, torvalds Howdy, The below (sorry about the Outlook style includes, work _requires_ outlook 8-() is why I previously mentioned the STREAMS implementation from SVR4 (and so Solaris etc). Every interface point has a pointer to its peer, a pair of queues and a put() routine. Every message is in two or more pieces, the message header and the message buffer/buffers (read real data pages). So the "participating interface" requirement is really that the device be able to expose a data structure consisting of four pointers that their peer can point to. The "message types" for the messages passing around are all contained in the message header structures. These headers have all the reference counting and disposal/cleanup hook procedures (pointers) along with the references to the actual data. This means that you could have a stock pool (from the slab cache) of these headers for moving plain data around, and "magical devices" could use specialty message headers with the necessary disposal elements or necessary "extra data" in them. All the devices in any kind of chain/mux/pipe/whatever arrangement pass messages to their peers by invoking peer->put(message) [in OOD parlance, really probably (*peer)(peer,message) in straight C]. And modules can be induced into the chain/mux/pipe/whtever at runtime. The target put() procedure decides to copy or reference-increment the message and put it on it's receive queue or process it immediately. At no time does any member of a STREAMS structure need to know any details about the workings of their peers. In the abstract, most communications would have message headers referring to blocks of normal memory. Some devices could send out messages that reference their PCI mapped memory or whatever, and all _that_ device would have to do to mesh with the system is to replace the cleanup function pointer with one appropriate to its need. When the message has gone as far as it is going to go, its reference count drops to zero and it gets cleaned up. Unget can be done with some queue insertion/manipulation. There is no pull() as every target is either hot-processed or queued. Flow control takes place when a put() procedure returns a no-space-for-message indication. If you absolutely _must_ know if the data was actually consumed by the target (and I could argue that this is virtually never the case as things get lost in error-induced pipe tear-down all the time) it would only be necessary to add a "cursor" member to the message header and, at destruction time, see if the cursor had been advanced to "use up" the data. One could even have message headers that know they are dealing with Now the message header is Linus' "impedance matcher" interface, transport is really just about knowing your peer, calling put(), and doing reference counting. There were a lot of (IMHO) mistakes in the STREAMS design, but the programming document (previously referenced in this thread) covers a lot of the same message typing and management questions that are being raised here. While I can see that the fit between Linux and SVR4 STREAMS wouldn't be identical, a good bit of the pipe-fitting (to pun 8-) and operation definitions would be. (I'll go back to lurking now... 8-) Rob White, Casabyte, Inc. -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of linux@horizon.com lm@bitmover.com wrote: The main infrastructure hassle you need to support this *universally* is the unget() on "character devices" like pipes and network sockets. Ideally, it would be some generic buffer front end that could be used by the device for normal data as well as the special case. Ooh. Need to think. If there's a -EIO problem on one of the file descriptors, how does the caller know which one? That's an argument for separate pull and push (although the splice() library routine still has the problem). Any suggestions? Does userland need to fall back on read()/write() for a byte? ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: Make pipe data structure be a circular list of pages, rather than @ 2005-01-20 2:14 Robert White 0 siblings, 0 replies; 51+ messages in thread From: Robert White @ 2005-01-20 2:14 UTC (permalink / raw) To: 'Robert White', linux, linux-kernel; +Cc: lm, torvalds P.S. Not to reply to myself... 8-) I took some liberties with that description. STREAMS doesn't, to the best of my knowledge, have the cleanup hook stuff. That was me folding your issues (direct PCI device buffers etc) from this thread on top of the basic skeleton of STREAMS to broaden the "impedance matching" possibilities. (No really, back to lurking... 8-) Rob White, Casabyte, Inc. ^ permalink raw reply [flat|nested] 51+ messages in thread
[parent not found: <Pine.LNX.4.44.0501091946020.3620-100000@localhost.localdomain>]
[parent not found: <Pine.LNX.4.58.0501091713300.2373@ppc970.osdl.org>]
[parent not found: <Pine.LNX.4.58.0501091830120.2373@ppc970.osdl.org>]
* Re: Make pipe data structure be a circular list of pages, rather than [not found] ` <Pine.LNX.4.58.0501091830120.2373@ppc970.osdl.org> @ 2005-01-12 19:50 ` Davide Libenzi 2005-01-12 20:10 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Davide Libenzi @ 2005-01-12 19:50 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Mailing List, Hugh Dickins On Sun, 9 Jan 2005, Linus Torvalds wrote: > On Sun, 9 Jan 2005, Linus Torvalds wrote: > > > > Since you guys stupidly showed interest, here's a very first-order > > approximation of filling the pipe from some other source. > > Here's a somewhat fixed and tested version, which actually does something > on x86. Question. How do you think to splice() skb pages, or any other non page-based format? - Davide ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-12 19:50 ` Davide Libenzi @ 2005-01-12 20:10 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2005-01-12 20:10 UTC (permalink / raw) To: Davide Libenzi; +Cc: Linux Kernel Mailing List, Hugh Dickins On Wed, 12 Jan 2005, Davide Libenzi wrote: > On Sun, 9 Jan 2005, Linus Torvalds wrote: > > > On Sun, 9 Jan 2005, Linus Torvalds wrote: > > > > > > Since you guys stupidly showed interest, here's a very first-order > > > approximation of filling the pipe from some other source. > > > > Here's a somewhat fixed and tested version, which actually does something > > on x86. > > Question. How do you think to splice() skb pages, or any other non page-based > format? That's why there's an "offset/length" pair. Right now the only limitation is actually - that at least the start of the area can be described as a "struct page *" (this means that things like a PCI MMIO mapping cannot be spliced - but that's true for other reasons anyway, since a "memcpy()" doesn't even work on such things on most architectures) - that it be mappable with kmap() (this means that if it's a multi-page thing, it needs to be in low memory currently). The second thing would actually going away in a full implementation anyway, to be replaced by "map this page in" and "unmap this page" buffer operations. We need those "map" operations in order to handle other "prepare for copy" situations, notably waiting for disk IO to be finished (ie I want to be able to splice pages to the pipe without having to wait for them - you'd wait for the contents only when the data itself is needed). My example patch (which I didn't post publicly because it's a work-in-progress) already made offset/length be "unsigned int", exactly because I foresee the "page" possibly being part of a hugetlb page, or at least a bigger multi-page allocation. [ Side note: even the first issue - PCI MMIO or other descriptor that doesn't normally have a "struct page *" associated with it - could in theory be handled by just creating a fake "struct page", and hiding the information into it. Then the map/unmap functions would just have to return the virtual address that is usable for a memcpy - since on _some_ architectures you can actually do a memcpy on IO space too. That might be useful for things like AGP or similar that depend on architecture- specific issues anyway ] Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
@ 2005-01-07 14:30 Oleg Nesterov
2005-01-07 15:45 ` Alan Cox
2005-01-07 16:17 ` Linus Torvalds
0 siblings, 2 replies; 51+ messages in thread
From: Oleg Nesterov @ 2005-01-07 14:30 UTC (permalink / raw)
To: Linus Torvalds; +Cc: William Lee Irwin III, linux-kernel
Hello.
pipe_writev:
> + if (bufs < PIPE_BUFFERS) {
> + ssize_t chars;
> + int newbuf = (info->curbuf + bufs) & (PIPE_BUFFERS-1);
If i understand this patch correctly, then this code
for (;;)
write(pipe_fd, &byte, 1);
will block after writing PIPE_BUFFERS == 16 characters, no?
And pipe_inode_info will use 64K to hold 16 bytes!
Is it ok?
May be it make sense to add data to the last allocated page
until buf->len > PAGE_SIZE ?
Oleg.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 14:30 Oleg Nesterov @ 2005-01-07 15:45 ` Alan Cox 2005-01-07 17:23 ` Linus Torvalds 2005-01-07 16:17 ` Linus Torvalds 1 sibling, 1 reply; 51+ messages in thread From: Alan Cox @ 2005-01-07 15:45 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, William Lee Irwin III, Linux Kernel Mailing List On Gwe, 2005-01-07 at 14:30, Oleg Nesterov wrote: > will block after writing PIPE_BUFFERS == 16 characters, no? > And pipe_inode_info will use 64K to hold 16 bytes! > > Is it ok? That would break stuff, but holding the last page until it fills 4K would work, or just basic sg coalescing when possible. The pipe behaviour - particularly size and size of atomic writes is defined by SuS and there are people who use pipes two ways between apps and use the guarantees to avoid deadlocks ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 15:45 ` Alan Cox @ 2005-01-07 17:23 ` Linus Torvalds 2005-01-08 18:25 ` Hugh Dickins 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2005-01-07 17:23 UTC (permalink / raw) To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Fri, 7 Jan 2005, Alan Cox wrote: > > That would break stuff, but holding the last page until it fills 4K > would work, or just basic sg coalescing when possible. The pipe > behaviour - particularly size and size of atomic writes is defined by > SuS and there are people who use pipes two ways between apps and use the > guarantees to avoid deadlocks Hmm.. Are there really any other guarantees than the atomicity guarantee for a PIPE_BUF transfer? I don't see any in SuS.. That said, existing practice obviously always trumps paper standards, and yes, coalescing is possible. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 17:23 ` Linus Torvalds @ 2005-01-08 18:25 ` Hugh Dickins 2005-01-08 18:54 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Hugh Dickins @ 2005-01-08 18:25 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Fri, 7 Jan 2005, Linus Torvalds wrote: > > Hmm.. Are there really any other guarantees than the atomicity guarantee > for a PIPE_BUF transfer? I don't see any in SuS.. > > That said, existing practice obviously always trumps paper standards, and > yes, coalescing is possible. Would a kernel build with "make -j18" count as existing practice? Immediately hangs for me... Hugh ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-08 18:25 ` Hugh Dickins @ 2005-01-08 18:54 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2005-01-08 18:54 UTC (permalink / raw) To: Hugh Dickins Cc: Alan Cox, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Sat, 8 Jan 2005, Hugh Dickins wrote: > > Would a kernel build with "make -j18" count as existing practice? Absolutely. Looks like it does a single-byte write of '+' for each process started (and I assume it will do a '-' for each process that exits ;) Oh, well. So much for my hope to avoid coalescing. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 14:30 Oleg Nesterov 2005-01-07 15:45 ` Alan Cox @ 2005-01-07 16:17 ` Linus Torvalds 2005-01-07 16:06 ` Alan Cox ` (2 more replies) 1 sibling, 3 replies; 51+ messages in thread From: Linus Torvalds @ 2005-01-07 16:17 UTC (permalink / raw) To: Oleg Nesterov; +Cc: William Lee Irwin III, linux-kernel On Fri, 7 Jan 2005, Oleg Nesterov wrote: > > If i understand this patch correctly, then this code > > for (;;) > write(pipe_fd, &byte, 1); > > will block after writing PIPE_BUFFERS == 16 characters, no? > And pipe_inode_info will use 64K to hold 16 bytes! Yes. > Is it ok? If you want throughput, don't do single-byte writes. Obviously we _could_ do coalescing, but there's a reason I'd prefer to avoid it. So I consider it a "don't do that then", and I'll wait to see if people do. I can't think of anything that cares about performance that does that anyway: becuase system calls are reasonably expensive regardless, anybody who cares at all about performance will have buffered things up in user space. > May be it make sense to add data to the last allocated page > until buf->len > PAGE_SIZE ? The reason I don't want to coalesce is that I don't ever want to modify a page that is on a pipe buffer (well, at least not through the pipe buffer - it might get modified some other way). Why? Because the long-term plan for pipe-buffers is to allow the data to come from _other_ sources than just a user space copy. For example, it might be a page directly from the page cache, or a partial page that contains the data part of an skb that just came in off the network. With this organization, a pipe ends up being able to act as a "conduit" for pretty much any data, including some high-bandwidth things like video streams, where you really _really_ don't want to copy the data. So the next stage is: - allow the buffer size to be set dynamically per-pipe (probably only increased by root, due to obvious issues, although a per-user limit is not out of the question - it's just a "mlock" in kernel buffer space, after all) - add per-"struct pipe_buffer" ops pointer to a structure with operation function pointers: "release()", "wait_for_ready()", "poll()" (and possibly "merge()", if we want to coalesce things, although I really hope we won't need to) - add a "splice(fd, fd)" system call that copies pages (by incrementing their reference count, not by copying the data!) from an input source to the pipe, or from a pipe to an output. - add a "tee(in, out1, out2)" system call that duplicates the pages (again, incrementing their reference count, not copying the data) from one pipe to two other pipes. All of the above is basically a few lines of code (the "splice()" thing requires some help from drivers/networking/pagecache etc, but it's not complex help, and not everybody needs to do it - I'll start off with _just_ a generic page cache helper to get the thing rolling, that's easy). Now, imagine using the above in a media server, for example. Let's say that a year or two has passed, so that the video drivers have been updated to be able to do the splice thing, and what can you do? You can: - splice from the (mpeg or whatever - let's just assume that the video input is either digital or does the encoding on its own - like they pretty much all do) video input into a pipe (remember: no copies - the video input will just DMA directly into memory, and splice will just set up the pages in the pipe buffer) - tee that pipe to split it up - splice one end to a file (ie "save the compressed stream to disk") - splice the other end to a real-time video decoder window for your real-time viewing pleasure. That's the plan, at least. I think it makes sense, and the thing that convinced me about it was (a) how simple all of this seems to be implementation-wise (modulo details - but there are no "conceptually complex" parts: no horrid asynchronous interfaces, no questions about hotw to buffer things, no direct user access to pages that might partially contain protected data etc etc) and (b) it's so UNIXy. If there's something that says "the UNIX way", it's pipes between entities that act on the data. For example, let's say that you wanted to serve a file from disk (or any other source) with a header to another program (or to a TCP connection, or to whatever - it's just a file descriptor). You'd do fd = create_pipe_to_destination(); input = open("filename", O_RDONLY); write(fd, "header goes here", length_of_header); for (;;) { ssize_t err; err = splice(input, fd, ~0 /* maxlen */, 0 /* msg flags - think "sendmgsg" */); if (err > 0) continue; if (!err) /* EOF */ break; .. handle input errors here .. } (obviously, if this is a real server, this would likely all be in a select/epoll loop, but that just gets too hard to describe consicely, so I'm showing the single-threaded simple version). Further, this also ends up giving a nice pollable interface to regular files too: just splice from the file (at any offset) into a pipe, and poll on the result. The "splice()" will just do the page cache operations and start the IO if necessary, the "poll()" will wait for the first page to be actually available. All _trivially_ done with the "struct pipe_buffer" operations. So the above kind of "send a file to another destination" should automatically work very naturally in any poll loop: instead of filling a writable pipe with a "write()", you just fill it with "splice()" instead (and you can read it with a 'read()' or you just splice it to somewhere else, or you tee() it to two destinations....). I think the media server kind of environment is the one most easily explained, where you have potentially tons of data that the server process really never actually wants to _see_ - it just wants to push it on to another process or connection or save it to a log-file or something. But as with regular pipes, it's not a specialized interface: it really is just a channel of communication. The difference being that a historical UNIX pipe is always a channel between two process spaces (ie you can only fill it and empty it into the process address space), and the _only_ thing I'm trying to do is to have it be able to be a channel between two different file descriptors too. You still need the process to "control" the channel, but the data doesn't have to touch the address space any more. Think of all the servers or other processes that really don't care about the data. Think of something as simple as encrypting a file, for example. Imagine that you have hardware encryption support that does DMA from the source, and writes the results using DMA. I think it's pretty obvious how you'd connect this up using pipes and two splices (one for the input, one for the output). And notice how _flexible_ it is (both the input and the output can be any kind of fd you want - the pipes end up doing both the "conversion" into a common format of "list of (possibly partial) pages" and the buffering, which is why the different "engines" don't need to care where the data comes from, or where it goes. So while you can use it to encrypt a file into another file, you could equally easily use it for something like tar cvf - my_source_tree | hw_engine_encrypt | splice_to_network and the whole pipeline would not have a _single_ actual data copy: the pipes are channels. Of course, since it's a pipe, the nice thing is that people don't have to use "splice()" to access it - the above pipeline has a perfectly regular "tar" process that probably just does regular writes. You can have a process that does "splice()" to fill the pipe, and the other end is a normal thing that just uses regular "read()" and doesn't even _know_ that the pipe is using new-fangled technology to be filled. I'm clearly enamoured with this concept. I think it's one of those few "RightThing(tm)" that doesn't come along all that often. I don't know of anybody else doing this, and I think it's both useful and clever. If you now prove me wrong, I'll hate you forever ;) Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 16:17 ` Linus Torvalds @ 2005-01-07 16:06 ` Alan Cox 2005-01-07 17:33 ` Linus Torvalds 2005-01-07 17:45 ` Chris Friesen 2005-01-07 16:39 ` Davide Libenzi 2005-08-18 6:07 ` Coywolf Qi Hunt 2 siblings, 2 replies; 51+ messages in thread From: Alan Cox @ 2005-01-07 16:06 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Gwe, 2005-01-07 at 16:17, Linus Torvalds wrote: > it a "don't do that then", and I'll wait to see if people do. I can't > think of anything that cares about performance that does that anyway: > becuase system calls are reasonably expensive regardless, anybody who > cares at all about performance will have buffered things up in user space. Actually I found a load of apps that do this but they don't care about performance. Lots of people have signal handlers that just do write(pipe_fd, &signr, sizeof(signr)) so they can drop signalled events into their event loops > > May be it make sense to add data to the last allocated page > > until buf->len > PAGE_SIZE ? > > The reason I don't want to coalesce is that I don't ever want to modify a > page that is on a pipe buffer (well, at least not through the pipe buffe If I can't write 4096 bytes down it one at a time without blocking from an empty pipe then its not a pipe in the eyes of the Unix world and the standards. > With this organization, a pipe ends up being able to act as a "conduit" > for pretty much any data, including some high-bandwidth things like video > streams, where you really _really_ don't want to copy the data. So the > next stage is: The data copying impact isn't very high even if it is just done for the pipe() case for standards behaviour. You end up with one page that is written too and then sent and then freed rather than many. Possibly what should be done is to not use pipe() for this at all but to create a more explicit object so we don't confuse existing code and code that refuses buffers heavily (plenty of that around). Add "sewer()"[1] or "conduit()" and we don't have to change the behaviour of the Unix pipe world and risk screwing up apps, and you get to not to have to write grungy corner cases to ruin the plan. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 16:06 ` Alan Cox @ 2005-01-07 17:33 ` Linus Torvalds 2005-01-07 17:48 ` Linus Torvalds 2005-01-07 17:45 ` Chris Friesen 1 sibling, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2005-01-07 17:33 UTC (permalink / raw) To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Fri, 7 Jan 2005, Alan Cox wrote: > > > The reason I don't want to coalesce is that I don't ever want to modify a > > page that is on a pipe buffer (well, at least not through the pipe buffe > > If I can't write 4096 bytes down it one at a time without blocking from > an empty pipe then its not a pipe in the eyes of the Unix world and the > standards. Absolutely. In fact, with the new implementation, you can often write _several_ packets of 4096 bytes without blocking (but only writes less than PIPE_BUF are guaranteed to be done all-or-nothing). I'm very aware of the atomicity guarantees, I'm just saying that if you try to write 4096 bytes by doing it one byte at a time, that has changed. > > With this organization, a pipe ends up being able to act as a "conduit" > > for pretty much any data, including some high-bandwidth things like video > > streams, where you really _really_ don't want to copy the data. So the > > next stage is: > > The data copying impact isn't very high even if it is just done for the > pipe() case for standards behaviour. You end up with one page that is > written too and then sent and then freed rather than many. I absolutely agree. A regular read()/write() still copies the data, and that's because I'm a firm believer that copying even a few kB of data is likely to be cheaper than trying to play MM games (not just the lookup of the physical address - all the locking, COW, etc crud that VM games require). So while this shares some of the issues with the zero-copy pipes of yore, but doesn't actually do any of that for regular pipe read/writes. And never will, as far as I'm concerned. I just don't think user zero-copy is interesting at that level: if the user wants to access somethign without copying, he uses "mmap()". So only when the data is _not_ in user VM space, that's when increasing a reference count is cheaper than copying. Pretty much by definition, you already have a "struct page *" at that point, along with which part of the page contains the data. So the "standard behaviour" (aka just plain read/write on the pipe) is all the same copies that it used to be. The "just move pages around" issue only happens when you want to duplicate the stream, or if you splice around stuff that is already in kernel buffers (or needs a kernel buffer anyway). Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 17:33 ` Linus Torvalds @ 2005-01-07 17:48 ` Linus Torvalds 2005-01-07 20:59 ` Mike Waychison ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Linus Torvalds @ 2005-01-07 17:48 UTC (permalink / raw) To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Fri, 7 Jan 2005, Linus Torvalds wrote: > > So the "standard behaviour" (aka just plain read/write on the pipe) is all > the same copies that it used to be. I want to highlight this again. The increase in throughput did _not_ come from avoiding a copy. It came purely from the fact that we have multiple buffers, and thus a throughput-intensive load gets to do several bigger chunks before it needs to wait for the recipient. So the increase in throughput comes from fewer synchronization points (scheduling and locking), not from anything else. Another way of saying the same thing: pipes actually used to have clearly _lower_ bandwidth than UNIX domain sockets, even though clearly pipes are simpler and should thus be able to be faster. The reason? UNIX domain sockets allow multiple packets in flight, and pipes only used to have a single buffer. With the new setup, pipes get roughly comparable performance to UNIX domain sockets for me. Sockets can still outperform pipes, truth be told - there's been more work on socket locking than on pipe locking. For example, the new pipe code should conceptually really allow one CPU to empty one pipe buffer while another CPU fills another pipe buffer, but because I kept the locking structure the same as it used to be, right now read/write serialize over the whole pipe rather than anything else. This is the main reason why I want to avoid coalescing if possible: if you never coalesce, then each "pipe_buffer" is complete in itself, and that simplifies locking enormously. (The other reason to potentially avoid coalescing is that I think it might be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet boundaries. The new pipe code _is_ internally "packetized" after all). Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 17:48 ` Linus Torvalds @ 2005-01-07 20:59 ` Mike Waychison 2005-01-07 23:46 ` Chris Friesen 2005-01-07 21:59 ` Linus Torvalds 2005-01-10 23:23 ` Robert White 2 siblings, 1 reply; 51+ messages in thread From: Mike Waychison @ 2005-01-07 20:59 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Linus Torvalds wrote: > > On Fri, 7 Jan 2005, Linus Torvalds wrote: > >>So the "standard behaviour" (aka just plain read/write on the pipe) is all >>the same copies that it used to be. > > > I want to highlight this again. The increase in throughput did _not_ come > from avoiding a copy. It came purely from the fact that we have multiple > buffers, and thus a throughput-intensive load gets to do several bigger > chunks before it needs to wait for the recipient. So the increase in > throughput comes from fewer synchronization points (scheduling and > locking), not from anything else. > > Another way of saying the same thing: pipes actually used to have clearly > _lower_ bandwidth than UNIX domain sockets, even though clearly pipes are > simpler and should thus be able to be faster. The reason? UNIX domain > sockets allow multiple packets in flight, and pipes only used to have a > single buffer. With the new setup, pipes get roughly comparable > performance to UNIX domain sockets for me. > > Sockets can still outperform pipes, truth be told - there's been more > work on socket locking than on pipe locking. For example, the new pipe > code should conceptually really allow one CPU to empty one pipe buffer > while another CPU fills another pipe buffer, but because I kept the > locking structure the same as it used to be, right now read/write > serialize over the whole pipe rather than anything else. > > This is the main reason why I want to avoid coalescing if possible: if you > never coalesce, then each "pipe_buffer" is complete in itself, and that > simplifies locking enormously. This got me to thinking about how you can heuristically optimize away coalescing support and still allow PAGE_SIZE bytes minimum in the effective buffer. Method 1) Simply keep track of how much data is in the effective buffer. If the count is < PAGE_SIZE, then coalesce, otherwise, don't. Very simple, but means a high traffic pipe might see 'stuttering' between coalescing mode and fastpath. Method 2) It would be interesting to see the performance of a simple heuristic whereby writes smaller than a given threshold (PAGE_SIZE >> 3?) are coalesced with the previous iff the previous also was less than the threshold and there is room for the write in the given page. With this heuristic, the minimum size of the effective buffer is only possible if the sequence consists of alternating threshold (T) bytes and 1 byte writes. If 'number of pages in circular buffer' (C) is even, this produces a minimum of (T*C + C)/2 bytes in the effective buffer. We can figure out a minimum T value for a given page count C by isolating T from: (T*C + C)/2 = PAGE_SIZE which is: T = (2 * PAGE_SIZE / C) - 1 So with the assumption that we only coalesce if - - the previous write count was < T - - the current write count is < T - - there is room for the current write count in the current page and if we set C = 16, we can safely let T = (PAGE_SIZE >> 3) - 1 and the minimum effective buffer is guaranteed to be at least PAGE_SIZE bytes. This should reduce the amount of coalescing work required and thus reduce the amount of locking required for the case where performance counts (on x86, the writer would have to write 511 bytes to use the non-coalescing fast path). This method is probably better as many writers will be buffering their writes, and the size also nicely fits with block layer's blocksize. Just my two cents. > > (The other reason to potentially avoid coalescing is that I think it might > be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet > boundaries. The new pipe code _is_ internally "packetized" after all). > > Linus - -- Mike Waychison Sun Microsystems, Inc. 1 (650) 352-5299 voice 1 (416) 202-8336 voice ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NOTICE: The opinions expressed in this email are held by me, and may not represent the views of Sun Microsystems, Inc. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFB3vgbdQs4kOxk3/MRAhN1AKCD6/o04URPYphGRh13P5GLNfV4JgCaA3wx QpMbLHqBtmAB3sy7mmxXCEI= =687H -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 20:59 ` Mike Waychison @ 2005-01-07 23:46 ` Chris Friesen 2005-01-08 21:38 ` Lee Revell 2005-01-14 10:15 ` Peter Chubb 0 siblings, 2 replies; 51+ messages in thread From: Chris Friesen @ 2005-01-07 23:46 UTC (permalink / raw) To: Mike Waychison Cc: Linus Torvalds, Alan Cox, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List Mike Waychison wrote: > This got me to thinking about how you can heuristically optimize away > coalescing support and still allow PAGE_SIZE bytes minimum in the > effective buffer. While coalescing may be a win in some cases, there should also be some way to tell the kernel to NOT coalesce, to handle the case where you want minimum latency at the cost of some throughput. Chris ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 23:46 ` Chris Friesen @ 2005-01-08 21:38 ` Lee Revell 2005-01-08 21:51 ` Linus Torvalds 2005-01-14 10:15 ` Peter Chubb 1 sibling, 1 reply; 51+ messages in thread From: Lee Revell @ 2005-01-08 21:38 UTC (permalink / raw) To: Chris Friesen Cc: Mike Waychison, Linus Torvalds, Alan Cox, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Fri, 2005-01-07 at 17:46 -0600, Chris Friesen wrote: > Mike Waychison wrote: > > > This got me to thinking about how you can heuristically optimize away > > coalescing support and still allow PAGE_SIZE bytes minimum in the > > effective buffer. > > While coalescing may be a win in some cases, there should also be some > way to tell the kernel to NOT coalesce, to handle the case where you > want minimum latency at the cost of some throughput. Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux FIFO being one of the fastest known IPC mechanisms. Each client in the JACK (http://jackit.sf.net) graph wakes the next one by writing a single byte to a FIFO. Ardour's GUI, control, and audio threads interact via a similar mechanism. How would you expect this change to impact the inter thread wakeup latency? It's confusing when people say "performance", meaning "increased throughput albeit with more latency". For many people that's a regression. These apps *certainly* care about performance, they just don't define it in terms of throughput. And yes we do know futexes are the right tool for this but they weren't available at the time and aren't available on 2.4. Lee ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-08 21:38 ` Lee Revell @ 2005-01-08 21:51 ` Linus Torvalds 2005-01-08 22:02 ` Lee Revell ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Linus Torvalds @ 2005-01-08 21:51 UTC (permalink / raw) To: Lee Revell Cc: Chris Friesen, Mike Waychison, Alan Cox, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Sat, 8 Jan 2005, Lee Revell wrote: > > Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux > FIFO being one of the fastest known IPC mechanisms. Each client in the > JACK (http://jackit.sf.net) graph wakes the next one by writing a single > byte to a FIFO. Ardour's GUI, control, and audio threads interact via a > similar mechanism. How would you expect this change to impact the inter > thread wakeup latency? It's confusing when people say "performance", > meaning "increased throughput albeit with more latency". For many > people that's a regression. I posted the performance numbers in the thread already, and with every single throughput number I also talked abotu what the latency difference was. So quite frankly, if you were confused, I suspect it was because you didn't read them. Tssk, tssk. Short and sweet: the latency changes are in the noise for SMP, but can be seen on UP. I'll look at it a bit more: since I had to add the coalescing code anyway, I might also decide to re-use a buffer page rather than free it immediately, since that may help latency for small writes. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-08 21:51 ` Linus Torvalds @ 2005-01-08 22:02 ` Lee Revell 2005-01-08 22:29 ` Davide Libenzi 2005-01-09 4:07 ` Linus Torvalds 2 siblings, 0 replies; 51+ messages in thread From: Lee Revell @ 2005-01-08 22:02 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Friesen, Mike Waychison, Alan Cox, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Sat, 2005-01-08 at 13:51 -0800, Linus Torvalds wrote: > > On Sat, 8 Jan 2005, Lee Revell wrote: > > > > Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux > > FIFO being one of the fastest known IPC mechanisms. Each client in the > > JACK (http://jackit.sf.net) graph wakes the next one by writing a single > > byte to a FIFO. Ardour's GUI, control, and audio threads interact via a > > similar mechanism. How would you expect this change to impact the inter > > thread wakeup latency? It's confusing when people say "performance", > > meaning "increased throughput albeit with more latency". For many > > people that's a regression. > > I posted the performance numbers in the thread already, and with every > single throughput number I also talked abotu what the latency difference > was. So quite frankly, if you were confused, I suspect it was because you > didn't read them. Tssk, tssk. > How embarassing, I found that message immediately after hitting send. Looks like a big win. I'll try the JACK stress test on it just to make sure. Lee ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-08 21:51 ` Linus Torvalds 2005-01-08 22:02 ` Lee Revell @ 2005-01-08 22:29 ` Davide Libenzi 2005-01-09 4:07 ` Linus Torvalds 2 siblings, 0 replies; 51+ messages in thread From: Davide Libenzi @ 2005-01-08 22:29 UTC (permalink / raw) To: Linus Torvalds Cc: Lee Revell, Chris Friesen, Mike Waychison, Alan Cox, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Sat, 8 Jan 2005, Linus Torvalds wrote: > Short and sweet: the latency changes are in the noise for SMP, but can be > seen on UP. I'll look at it a bit more: since I had to add the coalescing > code anyway, I might also decide to re-use a buffer page rather than free > it immediately, since that may help latency for small writes. Yeah, I noticed you clean the page's magazine down to the bottom. Maybe keeping one (or more) page in there might help something, due to cache improvements and alloc/free page savings. - Davide ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-08 21:51 ` Linus Torvalds 2005-01-08 22:02 ` Lee Revell 2005-01-08 22:29 ` Davide Libenzi @ 2005-01-09 4:07 ` Linus Torvalds 2005-01-09 23:19 ` Davide Libenzi 2 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2005-01-09 4:07 UTC (permalink / raw) To: Lee Revell Cc: Chris Friesen, Mike Waychison, Alan Cox, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Sat, 8 Jan 2005, Linus Torvalds wrote: > > Short and sweet: the latency changes are in the noise for SMP, but can be > seen on UP. I'll look at it a bit more: since I had to add the coalescing > code anyway, I might also decide to re-use a buffer page rather than free > it immediately, since that may help latency for small writes. Side note: the SMP numbers seem to fluctuate wildly making any benchmarking a bit suspect. They probably depend very much on just how things get scheduled, and on inter-CPU cache behaviour. In contrast, the UP numbers are a lot more reliable, so I'll try to make sure that everything looks good on UP - which definitely means that I'll work to make sure that latency looks good too. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-09 4:07 ` Linus Torvalds @ 2005-01-09 23:19 ` Davide Libenzi 0 siblings, 0 replies; 51+ messages in thread From: Davide Libenzi @ 2005-01-09 23:19 UTC (permalink / raw) To: Linus Torvalds Cc: Lee Revell, Chris Friesen, Mike Waychison, Alan Cox, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Sat, 8 Jan 2005, Linus Torvalds wrote: > Side note: the SMP numbers seem to fluctuate wildly making any > benchmarking a bit suspect. They probably depend very much on just how > things get scheduled, and on inter-CPU cache behaviour. Lmbench results on SMP are bogus. OTOH you can easily make a more deterministic SMP pipe bench by making the two tasks affine to two different CPUs. - Davide ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 23:46 ` Chris Friesen 2005-01-08 21:38 ` Lee Revell @ 2005-01-14 10:15 ` Peter Chubb 1 sibling, 0 replies; 51+ messages in thread From: Peter Chubb @ 2005-01-14 10:15 UTC (permalink / raw) To: Chris Friesen Cc: Mike Waychison, Linus Torvalds, Alan Cox, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List >>>>> "Chris" == Chris Friesen <cfriesen@nortelnetworks.com> writes: Chris> Mike Waychison wrote: >> This got me to thinking about how you can heuristically optimize >> away coalescing support and still allow PAGE_SIZE bytes minimum in >> the effective buffer. Chris> While coalescing may be a win in some cases, there should also Chris> be some way to tell the kernel to NOT coalesce, to handle the Chris> case where you want minimum latency at the cost of some Chris> throughput. SysVr4 pipes used to have two modes: a `legacy' mode that coalesced data, and a `message' mode that preserved message boundaries. Seems to me that we could be about to have the latter in Linux... -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 17:48 ` Linus Torvalds 2005-01-07 20:59 ` Mike Waychison @ 2005-01-07 21:59 ` Linus Torvalds 2005-01-07 22:53 ` Diego Calleja 2005-01-10 23:23 ` Robert White 2 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2005-01-07 21:59 UTC (permalink / raw) To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Fri, 7 Jan 2005, Linus Torvalds wrote: > > I want to highlight this again. The increase in throughput did _not_ come > from avoiding a copy. It came purely from the fact that we have multiple > buffers, and thus a throughput-intensive load gets to do several bigger > chunks before it needs to wait for the recipient. So the increase in > throughput comes from fewer synchronization points (scheduling and > locking), not from anything else. Btw, from limited testing this effect seems to be much more pronounced on SMP. I've done some more benchmarking, and on a single-CPU Banias laptop, bandwidth didn't really change much, and latency goes up measurably (roughly 2.5us -> 3.0us). In contrast, on a P4 with HT, latency was not noticeable (looks like 10.4us -> 10.7us), since there the locking and system call costs dominate, and throughput goes up from 370MB/s to 900+MB/s (ie > 100% improvement). So it's not always a win, but at the same time sometimes it's quite a big win. We might tune the number of buffers down for the UP case, or something. Side note: the UNIX domain socket lmbench numbers are higher than the memory copy numbers, implying that at least the UNIX domain socket tests have the data set fit in the L2. That may be the reason why lmbench says that sockets outperform pipes. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 21:59 ` Linus Torvalds @ 2005-01-07 22:53 ` Diego Calleja 2005-01-07 23:15 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Diego Calleja @ 2005-01-07 22:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: alan, oleg, wli, linux-kernel El Fri, 7 Jan 2005 13:59:53 -0800 (PST) Linus Torvalds <torvalds@osdl.org> escribió: > Btw, from limited testing this effect seems to be much more pronounced on > SMP. I've tried it in a 2xPIII, 512 MB of ram. I've done kernel compiles...yeah yeah I know kernel compiles are not a good benchmark but since linux uses -pipe for gcc...I though it may try it to see if it makes a difference (suggestions about better methods to test this are accepted :). The results could be very well stadistical noise (it looks like it only takes a second less on average), but I though it may be interesting. without the patch: 952.84user 98.64system 8:54.64elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+7318007minor)pagefaults 0swaps 951.78user 98.25system 8:53.71elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+7318009minor)pagefaults 0swaps 954.53user 99.12system 8:53.02elapsed 197%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+7318041minor)pagefaults 0swaps with it: 951.67user 97.59system 8:53.12elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+7318010minor)pagefaults 0swaps 949.04user 97.68system 8:52.04elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+7318018minor)pagefaults 0swaps 948.40user 97.37system 8:51.48elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+7318011minor)pagefaults 0swaps ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 22:53 ` Diego Calleja @ 2005-01-07 23:15 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2005-01-07 23:15 UTC (permalink / raw) To: Diego Calleja; +Cc: alan, oleg, wli, linux-kernel On Fri, 7 Jan 2005, Diego Calleja wrote: > > I've tried it in a 2xPIII, 512 MB of ram. I've done kernel compiles...yeah > yeah I know kernel compiles are not a good benchmark but since linux uses > -pipe for gcc...I though it may try it to see if it makes a difference > (suggestions about better methods to test this are accepted :). The results > could be very well stadistical noise (it looks like it only takes a second less > on average), but I though it may be interesting. Hmm.. I'm gratified, but a bit suspicious. I'd not have expected it to be noticeable on a kernel compile (that's a 1%+ change in total system time, which considering all the file ops etc that a compile does means that it would have to have made quite a big difference for the pipe case). So your data is intriguing, but it looks a bit _too_ good to be true ;) Random things like page cache placement (and thus cache effects) might also account for it. But hey, if true, that's quite an improvement. (Of course, anything the kernel does tends to be totally drowned out by allt eh actual user-level work gcc does, so in the _big_ picture the 1% change in system time means almost nothing. So even "good news" doesn't actually matter in this case). Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: Make pipe data structure be a circular list of pages, rather than 2005-01-07 17:48 ` Linus Torvalds 2005-01-07 20:59 ` Mike Waychison 2005-01-07 21:59 ` Linus Torvalds @ 2005-01-10 23:23 ` Robert White 2 siblings, 0 replies; 51+ messages in thread From: Robert White @ 2005-01-10 23:23 UTC (permalink / raw) To: 'Linus Torvalds', 'Alan Cox' Cc: 'Oleg Nesterov', 'William Lee Irwin III', 'Linux Kernel Mailing List' Howdy, This entire discussion of the pipe and splice() is very much the same (issue, problem, approach) as the STREAMS interface from UNIX SVR4, which had the pipe device as its trivial example device. [This is all from vague recall, so forgive the sketchiness 8-)] [Note: check out this STREAMS programmers guide http://docs-pdf.sun.com/802-1933/802-1933.pdf from Sun's documentation site.] A Message was a fixed-size Message structure with a linked list of Buffers attached to it. The Buffers were a structure-and-payload arrangement as in (over-expressed here). The whole arrangement could be zero-copy or self condensing as needed. In Linux parlance each file descriptor was basically a reference to a pair of linked list headers (one for messages available for reading and one to accept data from writes), a pointer to the driver below the head, and an optional tasklet. The driver itself was virtually the same data structure. STREAMS was a little over done and could have used some of the Linuxisims like tasklets (instead of "the STREAMS Scheduler"), and the bulk of the nonsense for getting Message and Buffer and STREAMS Head structures just screams Slab Cache, but it has a lot of interesting bits. The reason I bring it up is four-fold. 1) They had push and pop ioctl operations on the streams so that you could actually interpose translators and multiplexors in the streams, so you could use the design to do the splice() operation directly by building something that basically looked like: fd0a fd0b mux pipe-driver fd1 basically by pushing the mux into stream head and then pushing connecting the second and subsequent fds to the mux object. 2) The Messages had type fields so you could do interesting things like pass an open file descriptor to another process across a pipe because the special "this is an FD message" could arrive unmolested and trigger the necessary file table manipulations at the receiving end. 3) The optional tasklet (equivelant) at the various points in the STREAM could do things like buffer consolidation, which in the Linux model could compete with actual data consumption fairly easily/cheaply. 4) Some of the things that I have run into dealing with line disciplines and streaming data through character devices (and USB) could gain a lot from the Message/Buffer/STREAM paradigm. [e.g. the fixed-size flip buffers in ntty would be rendered moot by having the linked lists above and below the line discipline.] Anyway, thick as it is, I think the pdf referenced above might be interesting reading vis a vi the pipe() discussion. Rob White, Casabyte, Inc. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 16:06 ` Alan Cox 2005-01-07 17:33 ` Linus Torvalds @ 2005-01-07 17:45 ` Chris Friesen 1 sibling, 0 replies; 51+ messages in thread From: Chris Friesen @ 2005-01-07 17:45 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List Alan Cox wrote: > On Gwe, 2005-01-07 at 16:17, Linus Torvalds wrote: > >>it a "don't do that then", and I'll wait to see if people do. I can't >>think of anything that cares about performance that does that anyway: >>becuase system calls are reasonably expensive regardless, anybody who >>cares at all about performance will have buffered things up in user space. > > > Actually I found a load of apps that do this but they don't care about > performance. Lots of people have signal handlers that just do > > write(pipe_fd, &signr, sizeof(signr)) > > so they can drop signalled events into their event loops I would be one of those people, although I do pass the signal information as well. Chris ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 16:17 ` Linus Torvalds 2005-01-07 16:06 ` Alan Cox @ 2005-01-07 16:39 ` Davide Libenzi 2005-01-07 17:09 ` Linus Torvalds 2005-08-18 6:07 ` Coywolf Qi Hunt 2 siblings, 1 reply; 51+ messages in thread From: Davide Libenzi @ 2005-01-07 16:39 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Fri, 7 Jan 2005, Linus Torvalds wrote: > I'm clearly enamoured with this concept. I think it's one of those few > "RightThing(tm)" that doesn't come along all that often. I don't know of > anybody else doing this, and I think it's both useful and clever. If you > now prove me wrong, I'll hate you forever ;) I don't know which kind of poison Larry put in your glass when you two met, but it clearly worked :=) - Davide ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 16:39 ` Davide Libenzi @ 2005-01-07 17:09 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2005-01-07 17:09 UTC (permalink / raw) To: Davide Libenzi Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List On Fri, 7 Jan 2005, Davide Libenzi wrote: > > I don't know which kind of poison Larry put in your glass when you two > met, but it clearly worked :=) Oh, Larry introduced the notion of "splice()" to me at least five years ago, so whatever he put in that glass was quite slow-acting. At the time I told him to forget it - I thought it intriguing, but couldn't see any reasonable way to do it. And to be honest, my pipes aren't really what Larry's "splice()" was: his notion was more of a direct "fd->fd" thing, with nothing in between. That was what I still consider unworkable. The problem with splicing two file descriptors together is that it needs for them to agree on the interchange format, and these things tend to be very expensive from an interface angle (ie the format has to _exactly_ match the "impedance" at both ends, since any conversion would make the whole exercise pointless - you'd be better off just doing a read and a write). And note that the "format" is not just the actual data transfer thing, but the rules about what to do with partial reads, partial writes, file offsets, streams _without_ file offsets etc etc). And that's really what the pipe generalization does: it decouples all the format things into an intermediate thing that is very unambiguous, and has none of those issues. The pipe approach also decouples things in another way: a true "splice()" needs all entities to participate in the new world order for it to be useful. It shares that brokenness with "sendfile()", which to some degree exemplifies the problems with splice() (sendfile ends up being a very very specialized uni-directional form of Larry's splice()). But realizing that pipes _are_ the conduit and that they already exist and work with very old interfaces suddenly means that the new splice() can be incrementally useful. So it took me five years to think about other things, until an actual approach that would work and make sense from an implementation standpoint worked. What made me think about this particular thing is that I really thought a lot of things want to just mmap the data, but do not actually want to go to the expense of finding virtual addresses, handling the virtual-> physical translation at each stage. A lot of things just want to access the buffers, _without_ the user-visible mapping. User-visibility is not only expensive, it ends up being an impossible interface for many things (memory mapping has page granularity, something that simply isn't true for a lot of devices). So in many ways, think of the new pipes as a "buffer mapping". Nothing else. It's a way to carry arbitrary buffers along in a secure manner. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 16:17 ` Linus Torvalds 2005-01-07 16:06 ` Alan Cox 2005-01-07 16:39 ` Davide Libenzi @ 2005-08-18 6:07 ` Coywolf Qi Hunt 2 siblings, 0 replies; 51+ messages in thread From: Coywolf Qi Hunt @ 2005-08-18 6:07 UTC (permalink / raw) To: Linus Torvalds; +Cc: Oleg Nesterov, William Lee Irwin III, linux-kernel On 1/8/05, Linus Torvalds <torvalds@osdl.org> wrote: > > > On Fri, 7 Jan 2005, Oleg Nesterov wrote: > > > > If i understand this patch correctly, then this code > > > > for (;;) > > write(pipe_fd, &byte, 1); > > > > will block after writing PIPE_BUFFERS == 16 characters, no? > > And pipe_inode_info will use 64K to hold 16 bytes! > > Yes. > > > Is it ok? > > If you want throughput, don't do single-byte writes. Obviously we _could_ > do coalescing, but there's a reason I'd prefer to avoid it. So I consider > it a "don't do that then", and I'll wait to see if people do. I can't > think of anything that cares about performance that does that anyway: > becuase system calls are reasonably expensive regardless, anybody who > cares at all about performance will have buffered things up in user space. > > > May be it make sense to add data to the last allocated page > > until buf->len > PAGE_SIZE ? > > The reason I don't want to coalesce is that I don't ever want to modify a > page that is on a pipe buffer (well, at least not through the pipe buffer > - it might get modified some other way). Why? Because the long-term plan > for pipe-buffers is to allow the data to come from _other_ sources than > just a user space copy. For example, it might be a page directly from the > page cache, or a partial page that contains the data part of an skb that > just came in off the network. > > With this organization, a pipe ends up being able to act as a "conduit" > for pretty much any data, including some high-bandwidth things like video > streams, where you really _really_ don't want to copy the data. So the > next stage is: > > - allow the buffer size to be set dynamically per-pipe (probably only > increased by root, due to obvious issues, although a per-user limit is > not out of the question - it's just a "mlock" in kernel buffer space, > after all) > - add per-"struct pipe_buffer" ops pointer to a structure with > operation function pointers: "release()", "wait_for_ready()", "poll()" > (and possibly "merge()", if we want to coalesce things, although I > really hope we won't need to) > - add a "splice(fd, fd)" system call that copies pages (by incrementing > their reference count, not by copying the data!) from an input source > to the pipe, or from a pipe to an output. > - add a "tee(in, out1, out2)" system call that duplicates the pages > (again, incrementing their reference count, not copying the data) from > one pipe to two other pipes. > > All of the above is basically a few lines of code (the "splice()" thing > requires some help from drivers/networking/pagecache etc, but it's not > complex help, and not everybody needs to do it - I'll start off with > _just_ a generic page cache helper to get the thing rolling, that's easy). > > Now, imagine using the above in a media server, for example. Let's say > that a year or two has passed, so that the video drivers have been updated > to be able to do the splice thing, and what can you do? You can: > > - splice from the (mpeg or whatever - let's just assume that the video > input is either digital or does the encoding on its own - like they > pretty much all do) video input into a pipe (remember: no copies - the > video input will just DMA directly into memory, and splice will just > set up the pages in the pipe buffer) > - tee that pipe to split it up > - splice one end to a file (ie "save the compressed stream to disk") > - splice the other end to a real-time video decoder window for your > real-time viewing pleasure. > > That's the plan, at least. I think it makes sense, and the thing that > convinced me about it was (a) how simple all of this seems to be > implementation-wise (modulo details - but there are no "conceptually > complex" parts: no horrid asynchronous interfaces, no questions about > hotw to buffer things, no direct user access to pages that might > partially contain protected data etc etc) and (b) it's so UNIXy. If > there's something that says "the UNIX way", it's pipes between entities > that act on the data. > > For example, let's say that you wanted to serve a file from disk (or any > other source) with a header to another program (or to a TCP connection, or > to whatever - it's just a file descriptor). You'd do > > fd = create_pipe_to_destination(); > > input = open("filename", O_RDONLY); > write(fd, "header goes here", length_of_header); > for (;;) { > ssize_t err; > err = splice(input, fd, > ~0 /* maxlen */, > 0 /* msg flags - think "sendmgsg" */); > if (err > 0) > continue; > if (!err) /* EOF */ > break; > .. handle input errors here .. > } > > (obviously, if this is a real server, this would likely all be in a > select/epoll loop, but that just gets too hard to describe consicely, so > I'm showing the single-threaded simple version). > > Further, this also ends up giving a nice pollable interface to regular > files too: just splice from the file (at any offset) into a pipe, and poll > on the result. The "splice()" will just do the page cache operations and > start the IO if necessary, the "poll()" will wait for the first page to be > actually available. All _trivially_ done with the "struct pipe_buffer" > operations. > > So the above kind of "send a file to another destination" should > automatically work very naturally in any poll loop: instead of filling a > writable pipe with a "write()", you just fill it with "splice()" instead > (and you can read it with a 'read()' or you just splice it to somewhere > else, or you tee() it to two destinations....). > > I think the media server kind of environment is the one most easily > explained, where you have potentially tons of data that the server process > really never actually wants to _see_ - it just wants to push it on to > another process or connection or save it to a log-file or something. But > as with regular pipes, it's not a specialized interface: it really is just > a channel of communication. > > The difference being that a historical UNIX pipe is always a channel > between two process spaces (ie you can only fill it and empty it into the > process address space), and the _only_ thing I'm trying to do is to have > it be able to be a channel between two different file descriptors too. You > still need the process to "control" the channel, but the data doesn't have > to touch the address space any more. > > Think of all the servers or other processes that really don't care about > the data. Think of something as simple as encrypting a file, for example. > Imagine that you have hardware encryption support that does DMA from the > source, and writes the results using DMA. I think it's pretty obvious how > you'd connect this up using pipes and two splices (one for the input, one > for the output). > > And notice how _flexible_ it is (both the input and the output can be any > kind of fd you want - the pipes end up doing both the "conversion" into a > common format of "list of (possibly partial) pages" and the buffering, > which is why the different "engines" don't need to care where the data > comes from, or where it goes. So while you can use it to encrypt a file > into another file, you could equally easily use it for something like > > tar cvf - my_source_tree | hw_engine_encrypt | splice_to_network > > and the whole pipeline would not have a _single_ actual data copy: the > pipes are channels. > > Of course, since it's a pipe, the nice thing is that people don't have to > use "splice()" to access it - the above pipeline has a perfectly regular > "tar" process that probably just does regular writes. You can have a > process that does "splice()" to fill the pipe, and the other end is a > normal thing that just uses regular "read()" and doesn't even _know_ that > the pipe is using new-fangled technology to be filled. > > I'm clearly enamoured with this concept. I think it's one of those few > "RightThing(tm)" that doesn't come along all that often. I don't know of > anybody else doing this, and I think it's both useful and clever. If you > now prove me wrong, I'll hate you forever ;) > > Linus Actually this what L4 is different from traditional micro-kernel, IMHO. Is it? I have a long-term plan to add some messaging subsystem based on this. -- Coywolf Qi Hunt http://ahbl.org/~coywolf/ ^ permalink raw reply [flat|nested] 51+ messages in thread
[parent not found: <200501070313.j073DCaQ009641@hera.kernel.org>]
* Re: Make pipe data structure be a circular list of pages, rather than [not found] <200501070313.j073DCaQ009641@hera.kernel.org> @ 2005-01-07 3:41 ` William Lee Irwin III 2005-01-07 6:35 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: William Lee Irwin III @ 2005-01-07 3:41 UTC (permalink / raw) To: torvalds; +Cc: Linux Kernel Mailing List On Fri, Jan 07, 2005 at 12:29:13AM +0000, Linux Kernel Mailing List wrote: > ChangeSet 1.2229.1.1, 2005/01/06 16:29:13-08:00, torvalds@ppc970.osdl.org > Make pipe data structure be a circular list of pages, rather than > a circular list of one page. > This improves pipe throughput, and allows us to (eventually) > use these lists of page buffers for moving data around efficiently. Interesting; how big a gain did you see? -- wli ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 3:41 ` William Lee Irwin III @ 2005-01-07 6:35 ` Linus Torvalds 2005-01-07 6:37 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2005-01-07 6:35 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Linux Kernel Mailing List, Larry McVoy On Thu, 6 Jan 2005, William Lee Irwin III wrote: > On Fri, Jan 07, 2005 at 12:29:13AM +0000, Linux Kernel Mailing List wrote: > > ChangeSet 1.2229.1.1, 2005/01/06 16:29:13-08:00, torvalds@ppc970.osdl.org > > Make pipe data structure be a circular list of pages, rather than > > a circular list of one page. > > This improves pipe throughput, and allows us to (eventually) > > use these lists of page buffers for moving data around efficiently. > > Interesting; how big a gain did you see? On my ppc970 (which Alan correctly points out is not necessarily the best platform to test), lmbench throughput looks like this: *Local* Communication bandwidths in MB/s - bigger is better ----------------------------------------------------------- Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem UNIX reread reread (libc) (hand) read write --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- ----- before: ppc970.os Linux 2.6.10 753. 1305 550. 1110.3 2051.3 932.3 926.4 2043 1291. ppc970.os Linux 2.6.10 738. 1596 558. 1099.4 2038.8 936.6 925.0 2033 1288. ppc970.os Linux 2.6.10 752. 1535 557. 1107.2 2043.5 937.9 932.1 2037 1292. ppc970.os Linux 2.6.10 750. 1524 556. 1107.5 2046.0 937.2 930.4 2037 1289. after: ppc970.os Linux 2.6.10 976. 1594 561. 1110.6 2045.8 934.9 923.6 2039 1288. ppc970.os Linux 2.6.10 1195 1595 549. 1102.0 2049.0 909.5 930.5 2044 1292. ppc970.os Linux 2.6.10 1669 1418 507. 1122.9 2051.3 943.4 916.4 2035 1280. ppc970.os Linux 2.6.10 1472 1612 557. 1092.4 2032.9 932.1 935.1 2030 1281. ie that's a clear improvement (30-90% higher bandwidth). That's from having <n> (currently 16) pages of buffers in flight instead of just one. Of course, the full 16 you'll seldom see in real life - stdio etc tends to buffer up at 8kB or similar, but it does mean that such a 8kB write won't need to block for the reader to empty the pipe. I worried a bit about latency, but my test-run of four before/after was in the noise. It's bound to be worse because of the extra allocation, but it's not clearly visible. However, I've got a long-term cunning plan, which was the _real_ reason for the pipe changes. I'm finally going to implement Larry McVoy's "splice()" operation, and it turns out that a pipe (with the new organization) ends up being the exact conduit between two files that you need to "splice" an input an an output together - without any copying (the new pipe buffering is just pointers to pages and byte ranges within those pages). Together with a "tee()" system call (duplicate it in two), you can basically generate pipelines and forks of data, and you'll only need to update a reference count (and pupulate a new "struct pipe_buffer" thing). But making sure that every step along the way is an improvement is part of the philosophy. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 6:35 ` Linus Torvalds @ 2005-01-07 6:37 ` Linus Torvalds 2005-01-19 16:29 ` Larry McVoy 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2005-01-07 6:37 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Linux Kernel Mailing List, Larry McVoy On Thu, 6 Jan 2005, Linus Torvalds wrote: > > I worried a bit about latency, but my test-run of four before/after was in > the noise. It's bound to be worse because of the extra allocation, but > it's not clearly visible. Btw, this is not to say that it isn't measurable: I'm sure it is. It's just not as _obvious_ as the throughput improvement. If somebody were to do some more controlled latency tests... Hint, hint. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-07 6:37 ` Linus Torvalds @ 2005-01-19 16:29 ` Larry McVoy 2005-01-19 17:14 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Larry McVoy @ 2005-01-19 16:29 UTC (permalink / raw) To: Linus Torvalds Cc: William Lee Irwin III, Linux Kernel Mailing List, Larry McVoy I think you are going to regret making splice() a system call, it shouldn't be, you'll find cases where it won't work. Make it a library call built out of pull() and push(), see my original postings on this and you'll follow the logic. Making splice a system call is like putting ftp into the kernel, not a good idea. -- --- Larry McVoy lm at bitmover.com http://www.bitkeeper.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-19 16:29 ` Larry McVoy @ 2005-01-19 17:14 ` Linus Torvalds 2005-01-19 19:01 ` Larry McVoy 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2005-01-19 17:14 UTC (permalink / raw) To: Larry McVoy; +Cc: William Lee Irwin III, Linux Kernel Mailing List On Wed, 19 Jan 2005, Larry McVoy wrote: > > I think you are going to regret making splice() a system call, it shouldn't > be, you'll find cases where it won't work. Make it a library call built > out of pull() and push() Actually, if you look at my current (ugly) test-patch, the one part of it that isn't ugly is the actual "sys_splice()" + "do_splice()" part, and the only thing they actually do is (a) do all the required "fd -> struct file" setup (including locking, testing readability vs writability etc), and then (b) they check which side is a pipe, and split it up into "pull" vs "push" at that point. Of course, the "pull" is actually called "do_splice_from()", and the "push" is called "do_splice_to()", but I can rename them if you want to ;) So yes, internally it is actually a push/pull thing, with separate actions for both. I absolutely agree that you cannot have a "splice()" action, you need to have a _direction_. But that's actually exactly what the pipe gives you: since all data has to originate or end in a pipe, the direction is implicit by the file descriptors involved. (The exception is a pipe->pipe transfer, but then the direction doesn't matter, and you can choose whichever just ends up being easier. We happen to consider it a "do_splice_from()" from the source pipe right now, but that's purely a matter of "which way do you test first"). Could we show it as two system calls? Sure. It wouldn't give you anything, though, since you need to do all the tests anyway - including the test for whether the source or destination is a pipe which currently is the thing that splits up the "splice()" into two cases. So it's not even like you could optimize out the direction test - it would just become an error case test instead. And having just one system call means that the user doesn't need to care (other than making sure that one end _is_ a pipe), and also allows us to share all the code that _is_ shared (namely the "struct file *" setups and testing - not a whole lot, but it's cleaner that way). I think you are perhaps confused about the fact that what makes this all possible in the first place really is the _pipe_. You probably think of "splice()" as going from one file descriptor to another. It's not. It goes from one (generic) file descriptor into a _pipe_, or from one pipe into another (generic) file descriptor. So the "push"/"pull" part really is there, but it's there in another form: if you want to copy from one file to another, you have to 1) open a pipe - it's the splice equivalent of allocating a temporary buffer. repeat: 2) "pull" from the fd into the pipe: splice(in, pipe[1], size) 3) "push" from the pipe into the fd: splice(pipe[0], out, size) so it's all there. The above is 100% conceptually equivalent to 1) buf = malloc() repeat: 2) read(in, buf, size); 3) write(out, buf, size); See? I think you'll find that the "push" and "pull" you look for really is there. The advantage here is that if either the input or the output _already_ is a pipe, you can optimize it to just do a loop of a single splice(), with no new temporary buffer needed. So while the above three steps are the generic case, depending on how you do things the _common_ case may be just a simple repeat: 1) splice(in, out, maxsize); which is why you do not want to _force_ the split. The determination of how to do this efficiently at run-time is also very easy: int copy_fd(int in, int out) { char *buf; int fds[2]; for (;;) { int count = splice(in, out, ~0UL); if (count < 0) { if (errno == EAGAIN) continue; /* Old kernel without "splice()"? */ if (errno == ENOSYS) goto read_write_case; /* No pipe involved? */ if (errno == EINVAL) goto pipe_buffer_case; /* We found a pipe, but the other end cannot accept this splice */ if (errno == ECANNOTSPLICE) goto read_write_case; return -1; } if (!count) return 0; } pipe_buffer_case: if (pipe(fds) < 0) goto read_write_case; for (;;) { int count = splice(in, fds[1], ~0UL); if (count < 0) if (errno == EAGAIN) continue; return -1; } if (!count) return 0; do { int n = splice(fds[0], out, count); if (n < 0) { if (errno == EAGAIN) continue; return -1; } if (!n) { errno = ENOSPC; return -1; } } while ((count -= n) > 0) } read_write_case: buf = malloc(BUFSIZE); for (;;) { int count = read(in, buf, BUFSIZE); ... See? THAT is the kind of library routine you want to have (btw, there's no "ECANNOTSPLICE" error code right now, my current example code incorrectly returns EINVAL for both the "no pipe" case and the "found pipe but not splicable" case). Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-19 17:14 ` Linus Torvalds @ 2005-01-19 19:01 ` Larry McVoy 2005-01-20 0:01 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Larry McVoy @ 2005-01-19 19:01 UTC (permalink / raw) To: Linus Torvalds Cc: Larry McVoy, William Lee Irwin III, Linux Kernel Mailing List On Wed, Jan 19, 2005 at 09:14:33AM -0800, Linus Torvalds wrote: > I think you are perhaps confused about the fact that what makes this all > possible in the first place really is the _pipe_. You probably think of > "splice()" as going from one file descriptor to another. It's not. You seem to have misunderstood the original proposal, it had little to do with file descriptors. The idea was that different subsystems in the OS export pull() and push() interfaces and you use them. The file decriptors are only involved if you provide them with those interfaces(which you would, it makes sense). You are hung up on the pipe idea, the idea I see in my head is far more generic. Anything can play and you don't need a pipe at all, you need struct pagev { pageno page; // or whatever the type is to get the page u16 offset; // could be u32 if you have large pages u16 len; // ditto }; struct splice { pagev pages[]; u8 gift:1; u8 loan:1; }; You can feed these into pipes for sure, but you can source and sink them from/to anything which exports a pull()/push() entry point. It's as generic as read()/write() in the driver object. I never proposed restricting this to file descriptors, that makes no sense. File descriptors are nothing more than something which gets you to the underlying object (socket/file/pipe/whatever) and lets you call into that object to move some data. See how more generic that is? Pipes are just one source/sink but everything else needs to play as well. How are you going to implement a socket sending data to a file without the VM nonsense and the extra copies? -- --- Larry McVoy lm at bitmover.com http://www.bitkeeper.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than 2005-01-19 19:01 ` Larry McVoy @ 2005-01-20 0:01 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2005-01-20 0:01 UTC (permalink / raw) To: Larry McVoy; +Cc: William Lee Irwin III, Linux Kernel Mailing List On Wed, 19 Jan 2005, Larry McVoy wrote: > > See how more generic that is? Pipes are just one source/sink but > everything else needs to play as well. How are you going to > implement a socket sending data to a file without the VM nonsense > and the extra copies? Heh. This is exactly why I originally told you it is undoable - because your model implies a O(M x N) complexity, where everybody has to agree on how to push things to everybody else. The thing that made me so excited about the pipes is exactly that the pipes is what finally made me see how to actually _implement_ kind of what you wanted. Instead of a "everything to everything" thing that has O(n²) complexity, it's a "everything to a pipe", where the pipe ends up being the common medium for things to talk to each other. So you have an extra level of indirection, but it also means that now the interface complexity is purely linear - everybody needs to know how to accept data from (and push data to) a pipe, but they do _not_ need to know how to do it to anything else. I agree that it's not what you described, and one of my first emails describing it even said so, but I now notice that you weren't cc'd on that one (I tried to cc you on some others, but not the whole discussion): "... And to be honest, my pipes aren't really what Larry's "splice()" was: his notion was more of a direct "fd->fd" thing, with nothing in between. That was what I still consider unworkable. The problem with splicing two file descriptors together is that it needs for them to agree on the interchange format, and these things tend to be very expensive from an interface angle (ie the format has to _exactly_ match the "impedance" at both ends, since any conversion would make the whole exercise pointless - you'd be better off just doing a read and a write). ..." so what makes the pipes special is that they are the "impedance matcher". For example, absolutely _none_ of the regular file IO stuff wants to work with page fragments: it's all very much about whole-page stuff in the page cache, and that is how it has to be, considering the semantics of a coherent mmap etc. So your notion of having all the subsystems natively use a common format just never struck me as workable, because they very fundamentally do _not_ want work with the same kind of data structures, and forcing them to just didn't seem valid. So I ignored splice as unworkable. So think of my "splice" as giving you credit through the name, but at the same time it definitely is something different (and in my opinion a lot more workable). If you object to the name, I can call it "wire()" instead, which was somebody elses suggestion, but I thought you might appreciate the name even if it isn't _really_ what you want ;) Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2005-08-18 6:07 UTC | newest] Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-01-08 8:25 Make pipe data structure be a circular list of pages, rather linux 2005-01-08 18:41 ` Linus Torvalds 2005-01-08 21:47 ` Alan Cox 2005-01-13 21:46 ` Ingo Oeser 2005-01-13 22:32 ` Linus Torvalds 2005-01-14 21:03 ` Ingo Oeser 2005-01-14 21:29 ` Linus Torvalds 2005-01-14 22:12 ` Ingo Oeser 2005-01-14 22:44 ` Linus Torvalds 2005-01-14 23:34 ` Ingo Oeser 2005-01-15 0:16 ` Linus Torvalds 2005-01-16 2:59 ` Linus Torvalds 2005-01-17 16:03 ` Ingo Oeser 2005-01-19 21:12 ` Make pipe data structure be a circular list of pages, rather than linux 2005-01-20 2:06 ` Robert White -- strict thread matches above, loose matches on Subject: below -- 2005-01-20 2:14 Robert White [not found] <Pine.LNX.4.44.0501091946020.3620-100000@localhost.localdomain> [not found] ` <Pine.LNX.4.58.0501091713300.2373@ppc970.osdl.org> [not found] ` <Pine.LNX.4.58.0501091830120.2373@ppc970.osdl.org> 2005-01-12 19:50 ` Davide Libenzi 2005-01-12 20:10 ` Linus Torvalds 2005-01-07 14:30 Oleg Nesterov 2005-01-07 15:45 ` Alan Cox 2005-01-07 17:23 ` Linus Torvalds 2005-01-08 18:25 ` Hugh Dickins 2005-01-08 18:54 ` Linus Torvalds 2005-01-07 16:17 ` Linus Torvalds 2005-01-07 16:06 ` Alan Cox 2005-01-07 17:33 ` Linus Torvalds 2005-01-07 17:48 ` Linus Torvalds 2005-01-07 20:59 ` Mike Waychison 2005-01-07 23:46 ` Chris Friesen 2005-01-08 21:38 ` Lee Revell 2005-01-08 21:51 ` Linus Torvalds 2005-01-08 22:02 ` Lee Revell 2005-01-08 22:29 ` Davide Libenzi 2005-01-09 4:07 ` Linus Torvalds 2005-01-09 23:19 ` Davide Libenzi 2005-01-14 10:15 ` Peter Chubb 2005-01-07 21:59 ` Linus Torvalds 2005-01-07 22:53 ` Diego Calleja 2005-01-07 23:15 ` Linus Torvalds 2005-01-10 23:23 ` Robert White 2005-01-07 17:45 ` Chris Friesen 2005-01-07 16:39 ` Davide Libenzi 2005-01-07 17:09 ` Linus Torvalds 2005-08-18 6:07 ` Coywolf Qi Hunt [not found] <200501070313.j073DCaQ009641@hera.kernel.org> 2005-01-07 3:41 ` William Lee Irwin III 2005-01-07 6:35 ` Linus Torvalds 2005-01-07 6:37 ` Linus Torvalds 2005-01-19 16:29 ` Larry McVoy 2005-01-19 17:14 ` Linus Torvalds 2005-01-19 19:01 ` Larry McVoy 2005-01-20 0:01 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).