* Re: Make pipe data structure be a circular list of pages, rather than
@ 2005-01-07 14:30 Oleg Nesterov
2005-01-07 15:45 ` Alan Cox
2005-01-07 16:17 ` Linus Torvalds
0 siblings, 2 replies; 38+ messages in thread
From: Oleg Nesterov @ 2005-01-07 14:30 UTC (permalink / raw)
To: Linus Torvalds; +Cc: William Lee Irwin III, linux-kernel
Hello.
pipe_writev:
> + if (bufs < PIPE_BUFFERS) {
> + ssize_t chars;
> + int newbuf = (info->curbuf + bufs) & (PIPE_BUFFERS-1);
If i understand this patch correctly, then this code
for (;;)
write(pipe_fd, &byte, 1);
will block after writing PIPE_BUFFERS == 16 characters, no?
And pipe_inode_info will use 64K to hold 16 bytes!
Is it ok?
May be it make sense to add data to the last allocated page
until buf->len > PAGE_SIZE ?
Oleg.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 14:30 Make pipe data structure be a circular list of pages, rather than Oleg Nesterov
@ 2005-01-07 15:45 ` Alan Cox
2005-01-07 17:23 ` Linus Torvalds
2005-01-07 16:17 ` Linus Torvalds
1 sibling, 1 reply; 38+ messages in thread
From: Alan Cox @ 2005-01-07 15:45 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, William Lee Irwin III, Linux Kernel Mailing List
On Gwe, 2005-01-07 at 14:30, Oleg Nesterov wrote:
> will block after writing PIPE_BUFFERS == 16 characters, no?
> And pipe_inode_info will use 64K to hold 16 bytes!
>
> Is it ok?
That would break stuff, but holding the last page until it fills 4K
would work, or just basic sg coalescing when possible. The pipe
behaviour - particularly size and size of atomic writes is defined by
SuS and there are people who use pipes two ways between apps and use the
guarantees to avoid deadlocks
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 15:45 ` Alan Cox
@ 2005-01-07 17:23 ` Linus Torvalds
2005-01-08 18:25 ` Hugh Dickins
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 17:23 UTC (permalink / raw)
To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List
On Fri, 7 Jan 2005, Alan Cox wrote:
>
> That would break stuff, but holding the last page until it fills 4K
> would work, or just basic sg coalescing when possible. The pipe
> behaviour - particularly size and size of atomic writes is defined by
> SuS and there are people who use pipes two ways between apps and use the
> guarantees to avoid deadlocks
Hmm.. Are there really any other guarantees than the atomicity guarantee
for a PIPE_BUF transfer? I don't see any in SuS..
That said, existing practice obviously always trumps paper standards, and
yes, coalescing is possible.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 17:23 ` Linus Torvalds
@ 2005-01-08 18:25 ` Hugh Dickins
2005-01-08 18:54 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Hugh Dickins @ 2005-01-08 18:25 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Oleg Nesterov, William Lee Irwin III,
Linux Kernel Mailing List
On Fri, 7 Jan 2005, Linus Torvalds wrote:
>
> Hmm.. Are there really any other guarantees than the atomicity guarantee
> for a PIPE_BUF transfer? I don't see any in SuS..
>
> That said, existing practice obviously always trumps paper standards, and
> yes, coalescing is possible.
Would a kernel build with "make -j18" count as existing practice?
Immediately hangs for me...
Hugh
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-08 18:25 ` Hugh Dickins
@ 2005-01-08 18:54 ` Linus Torvalds
0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-08 18:54 UTC (permalink / raw)
To: Hugh Dickins
Cc: Alan Cox, Oleg Nesterov, William Lee Irwin III,
Linux Kernel Mailing List
On Sat, 8 Jan 2005, Hugh Dickins wrote:
>
> Would a kernel build with "make -j18" count as existing practice?
Absolutely.
Looks like it does a single-byte write of '+' for each process started
(and I assume it will do a '-' for each process that exits ;)
Oh, well. So much for my hope to avoid coalescing.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 14:30 Make pipe data structure be a circular list of pages, rather than Oleg Nesterov
2005-01-07 15:45 ` Alan Cox
@ 2005-01-07 16:17 ` Linus Torvalds
2005-01-07 16:06 ` Alan Cox
` (2 more replies)
1 sibling, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 16:17 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: William Lee Irwin III, linux-kernel
On Fri, 7 Jan 2005, Oleg Nesterov wrote:
>
> If i understand this patch correctly, then this code
>
> for (;;)
> write(pipe_fd, &byte, 1);
>
> will block after writing PIPE_BUFFERS == 16 characters, no?
> And pipe_inode_info will use 64K to hold 16 bytes!
Yes.
> Is it ok?
If you want throughput, don't do single-byte writes. Obviously we _could_
do coalescing, but there's a reason I'd prefer to avoid it. So I consider
it a "don't do that then", and I'll wait to see if people do. I can't
think of anything that cares about performance that does that anyway:
becuase system calls are reasonably expensive regardless, anybody who
cares at all about performance will have buffered things up in user space.
> May be it make sense to add data to the last allocated page
> until buf->len > PAGE_SIZE ?
The reason I don't want to coalesce is that I don't ever want to modify a
page that is on a pipe buffer (well, at least not through the pipe buffer
- it might get modified some other way). Why? Because the long-term plan
for pipe-buffers is to allow the data to come from _other_ sources than
just a user space copy. For example, it might be a page directly from the
page cache, or a partial page that contains the data part of an skb that
just came in off the network.
With this organization, a pipe ends up being able to act as a "conduit"
for pretty much any data, including some high-bandwidth things like video
streams, where you really _really_ don't want to copy the data. So the
next stage is:
- allow the buffer size to be set dynamically per-pipe (probably only
increased by root, due to obvious issues, although a per-user limit is
not out of the question - it's just a "mlock" in kernel buffer space,
after all)
- add per-"struct pipe_buffer" ops pointer to a structure with
operation function pointers: "release()", "wait_for_ready()", "poll()"
(and possibly "merge()", if we want to coalesce things, although I
really hope we won't need to)
- add a "splice(fd, fd)" system call that copies pages (by incrementing
their reference count, not by copying the data!) from an input source
to the pipe, or from a pipe to an output.
- add a "tee(in, out1, out2)" system call that duplicates the pages
(again, incrementing their reference count, not copying the data) from
one pipe to two other pipes.
All of the above is basically a few lines of code (the "splice()" thing
requires some help from drivers/networking/pagecache etc, but it's not
complex help, and not everybody needs to do it - I'll start off with
_just_ a generic page cache helper to get the thing rolling, that's easy).
Now, imagine using the above in a media server, for example. Let's say
that a year or two has passed, so that the video drivers have been updated
to be able to do the splice thing, and what can you do? You can:
- splice from the (mpeg or whatever - let's just assume that the video
input is either digital or does the encoding on its own - like they
pretty much all do) video input into a pipe (remember: no copies - the
video input will just DMA directly into memory, and splice will just
set up the pages in the pipe buffer)
- tee that pipe to split it up
- splice one end to a file (ie "save the compressed stream to disk")
- splice the other end to a real-time video decoder window for your
real-time viewing pleasure.
That's the plan, at least. I think it makes sense, and the thing that
convinced me about it was (a) how simple all of this seems to be
implementation-wise (modulo details - but there are no "conceptually
complex" parts: no horrid asynchronous interfaces, no questions about
hotw to buffer things, no direct user access to pages that might
partially contain protected data etc etc) and (b) it's so UNIXy. If
there's something that says "the UNIX way", it's pipes between entities
that act on the data.
For example, let's say that you wanted to serve a file from disk (or any
other source) with a header to another program (or to a TCP connection, or
to whatever - it's just a file descriptor). You'd do
fd = create_pipe_to_destination();
input = open("filename", O_RDONLY);
write(fd, "header goes here", length_of_header);
for (;;) {
ssize_t err;
err = splice(input, fd,
~0 /* maxlen */,
0 /* msg flags - think "sendmgsg" */);
if (err > 0)
continue;
if (!err) /* EOF */
break;
.. handle input errors here ..
}
(obviously, if this is a real server, this would likely all be in a
select/epoll loop, but that just gets too hard to describe consicely, so
I'm showing the single-threaded simple version).
Further, this also ends up giving a nice pollable interface to regular
files too: just splice from the file (at any offset) into a pipe, and poll
on the result. The "splice()" will just do the page cache operations and
start the IO if necessary, the "poll()" will wait for the first page to be
actually available. All _trivially_ done with the "struct pipe_buffer"
operations.
So the above kind of "send a file to another destination" should
automatically work very naturally in any poll loop: instead of filling a
writable pipe with a "write()", you just fill it with "splice()" instead
(and you can read it with a 'read()' or you just splice it to somewhere
else, or you tee() it to two destinations....).
I think the media server kind of environment is the one most easily
explained, where you have potentially tons of data that the server process
really never actually wants to _see_ - it just wants to push it on to
another process or connection or save it to a log-file or something. But
as with regular pipes, it's not a specialized interface: it really is just
a channel of communication.
The difference being that a historical UNIX pipe is always a channel
between two process spaces (ie you can only fill it and empty it into the
process address space), and the _only_ thing I'm trying to do is to have
it be able to be a channel between two different file descriptors too. You
still need the process to "control" the channel, but the data doesn't have
to touch the address space any more.
Think of all the servers or other processes that really don't care about
the data. Think of something as simple as encrypting a file, for example.
Imagine that you have hardware encryption support that does DMA from the
source, and writes the results using DMA. I think it's pretty obvious how
you'd connect this up using pipes and two splices (one for the input, one
for the output).
And notice how _flexible_ it is (both the input and the output can be any
kind of fd you want - the pipes end up doing both the "conversion" into a
common format of "list of (possibly partial) pages" and the buffering,
which is why the different "engines" don't need to care where the data
comes from, or where it goes. So while you can use it to encrypt a file
into another file, you could equally easily use it for something like
tar cvf - my_source_tree | hw_engine_encrypt | splice_to_network
and the whole pipeline would not have a _single_ actual data copy: the
pipes are channels.
Of course, since it's a pipe, the nice thing is that people don't have to
use "splice()" to access it - the above pipeline has a perfectly regular
"tar" process that probably just does regular writes. You can have a
process that does "splice()" to fill the pipe, and the other end is a
normal thing that just uses regular "read()" and doesn't even _know_ that
the pipe is using new-fangled technology to be filled.
I'm clearly enamoured with this concept. I think it's one of those few
"RightThing(tm)" that doesn't come along all that often. I don't know of
anybody else doing this, and I think it's both useful and clever. If you
now prove me wrong, I'll hate you forever ;)
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 16:17 ` Linus Torvalds
@ 2005-01-07 16:06 ` Alan Cox
2005-01-07 17:33 ` Linus Torvalds
2005-01-07 17:45 ` Chris Friesen
2005-01-07 16:39 ` Davide Libenzi
2005-08-18 6:07 ` Coywolf Qi Hunt
2 siblings, 2 replies; 38+ messages in thread
From: Alan Cox @ 2005-01-07 16:06 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List
On Gwe, 2005-01-07 at 16:17, Linus Torvalds wrote:
> it a "don't do that then", and I'll wait to see if people do. I can't
> think of anything that cares about performance that does that anyway:
> becuase system calls are reasonably expensive regardless, anybody who
> cares at all about performance will have buffered things up in user space.
Actually I found a load of apps that do this but they don't care about
performance. Lots of people have signal handlers that just do
write(pipe_fd, &signr, sizeof(signr))
so they can drop signalled events into their event loops
> > May be it make sense to add data to the last allocated page
> > until buf->len > PAGE_SIZE ?
>
> The reason I don't want to coalesce is that I don't ever want to modify a
> page that is on a pipe buffer (well, at least not through the pipe buffe
If I can't write 4096 bytes down it one at a time without blocking from
an empty pipe then its not a pipe in the eyes of the Unix world and the
standards.
> With this organization, a pipe ends up being able to act as a "conduit"
> for pretty much any data, including some high-bandwidth things like video
> streams, where you really _really_ don't want to copy the data. So the
> next stage is:
The data copying impact isn't very high even if it is just done for the
pipe() case for standards behaviour. You end up with one page that is
written too and then sent and then freed rather than many.
Possibly what should be done is to not use pipe() for this at all but to
create a more explicit object so we don't confuse existing code and code
that refuses buffers heavily (plenty of that around). Add "sewer()"[1]
or "conduit()" and we don't have to change the behaviour of the Unix
pipe world and risk screwing up apps, and you get to not to have to
write grungy corner cases to ruin the plan.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 16:06 ` Alan Cox
@ 2005-01-07 17:33 ` Linus Torvalds
2005-01-07 17:48 ` Linus Torvalds
2005-01-07 17:45 ` Chris Friesen
1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 17:33 UTC (permalink / raw)
To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List
On Fri, 7 Jan 2005, Alan Cox wrote:
>
> > The reason I don't want to coalesce is that I don't ever want to modify a
> > page that is on a pipe buffer (well, at least not through the pipe buffe
>
> If I can't write 4096 bytes down it one at a time without blocking from
> an empty pipe then its not a pipe in the eyes of the Unix world and the
> standards.
Absolutely. In fact, with the new implementation, you can often write
_several_ packets of 4096 bytes without blocking (but only writes less
than PIPE_BUF are guaranteed to be done all-or-nothing). I'm very aware of
the atomicity guarantees, I'm just saying that if you try to write 4096
bytes by doing it one byte at a time, that has changed.
> > With this organization, a pipe ends up being able to act as a "conduit"
> > for pretty much any data, including some high-bandwidth things like video
> > streams, where you really _really_ don't want to copy the data. So the
> > next stage is:
>
> The data copying impact isn't very high even if it is just done for the
> pipe() case for standards behaviour. You end up with one page that is
> written too and then sent and then freed rather than many.
I absolutely agree. A regular read()/write() still copies the data, and
that's because I'm a firm believer that copying even a few kB of data is
likely to be cheaper than trying to play MM games (not just the lookup of
the physical address - all the locking, COW, etc crud that VM games
require).
So while this shares some of the issues with the zero-copy pipes of yore,
but doesn't actually do any of that for regular pipe read/writes. And
never will, as far as I'm concerned. I just don't think user zero-copy is
interesting at that level: if the user wants to access somethign without
copying, he uses "mmap()".
So only when the data is _not_ in user VM space, that's when increasing a
reference count is cheaper than copying. Pretty much by definition, you
already have a "struct page *" at that point, along with which part of the
page contains the data.
So the "standard behaviour" (aka just plain read/write on the pipe) is all
the same copies that it used to be. The "just move pages around" issue
only happens when you want to duplicate the stream, or if you splice
around stuff that is already in kernel buffers (or needs a kernel buffer
anyway).
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 17:33 ` Linus Torvalds
@ 2005-01-07 17:48 ` Linus Torvalds
2005-01-07 20:59 ` Mike Waychison
` (2 more replies)
0 siblings, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 17:48 UTC (permalink / raw)
To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List
On Fri, 7 Jan 2005, Linus Torvalds wrote:
>
> So the "standard behaviour" (aka just plain read/write on the pipe) is all
> the same copies that it used to be.
I want to highlight this again. The increase in throughput did _not_ come
from avoiding a copy. It came purely from the fact that we have multiple
buffers, and thus a throughput-intensive load gets to do several bigger
chunks before it needs to wait for the recipient. So the increase in
throughput comes from fewer synchronization points (scheduling and
locking), not from anything else.
Another way of saying the same thing: pipes actually used to have clearly
_lower_ bandwidth than UNIX domain sockets, even though clearly pipes are
simpler and should thus be able to be faster. The reason? UNIX domain
sockets allow multiple packets in flight, and pipes only used to have a
single buffer. With the new setup, pipes get roughly comparable
performance to UNIX domain sockets for me.
Sockets can still outperform pipes, truth be told - there's been more
work on socket locking than on pipe locking. For example, the new pipe
code should conceptually really allow one CPU to empty one pipe buffer
while another CPU fills another pipe buffer, but because I kept the
locking structure the same as it used to be, right now read/write
serialize over the whole pipe rather than anything else.
This is the main reason why I want to avoid coalescing if possible: if you
never coalesce, then each "pipe_buffer" is complete in itself, and that
simplifies locking enormously.
(The other reason to potentially avoid coalescing is that I think it might
be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
boundaries. The new pipe code _is_ internally "packetized" after all).
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 17:48 ` Linus Torvalds
@ 2005-01-07 20:59 ` Mike Waychison
2005-01-07 23:46 ` Chris Friesen
2005-01-07 21:59 ` Linus Torvalds
2005-01-10 23:23 ` Robert White
2 siblings, 1 reply; 38+ messages in thread
From: Mike Waychison @ 2005-01-07 20:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Oleg Nesterov, William Lee Irwin III,
Linux Kernel Mailing List
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Linus Torvalds wrote:
>
> On Fri, 7 Jan 2005, Linus Torvalds wrote:
>
>>So the "standard behaviour" (aka just plain read/write on the pipe) is all
>>the same copies that it used to be.
>
>
> I want to highlight this again. The increase in throughput did _not_ come
> from avoiding a copy. It came purely from the fact that we have multiple
> buffers, and thus a throughput-intensive load gets to do several bigger
> chunks before it needs to wait for the recipient. So the increase in
> throughput comes from fewer synchronization points (scheduling and
> locking), not from anything else.
>
> Another way of saying the same thing: pipes actually used to have clearly
> _lower_ bandwidth than UNIX domain sockets, even though clearly pipes are
> simpler and should thus be able to be faster. The reason? UNIX domain
> sockets allow multiple packets in flight, and pipes only used to have a
> single buffer. With the new setup, pipes get roughly comparable
> performance to UNIX domain sockets for me.
>
> Sockets can still outperform pipes, truth be told - there's been more
> work on socket locking than on pipe locking. For example, the new pipe
> code should conceptually really allow one CPU to empty one pipe buffer
> while another CPU fills another pipe buffer, but because I kept the
> locking structure the same as it used to be, right now read/write
> serialize over the whole pipe rather than anything else.
>
> This is the main reason why I want to avoid coalescing if possible: if you
> never coalesce, then each "pipe_buffer" is complete in itself, and that
> simplifies locking enormously.
This got me to thinking about how you can heuristically optimize away
coalescing support and still allow PAGE_SIZE bytes minimum in the
effective buffer.
Method 1)
Simply keep track of how much data is in the effective buffer. If the
count is < PAGE_SIZE, then coalesce, otherwise, don't. Very simple, but
means a high traffic pipe might see 'stuttering' between coalescing mode
and fastpath.
Method 2)
It would be interesting to see the performance of a simple heuristic
whereby writes smaller than a given threshold (PAGE_SIZE >> 3?) are
coalesced with the previous iff the previous also was less than the
threshold and there is room for the write in the given page.
With this heuristic, the minimum size of the effective buffer is only
possible if the sequence consists of alternating threshold (T) bytes and
1 byte writes.
If 'number of pages in circular buffer' (C) is even, this produces a
minimum of (T*C + C)/2 bytes in the effective buffer. We can figure out
a minimum T value for a given page count C by isolating T from:
(T*C + C)/2 = PAGE_SIZE
which is:
T = (2 * PAGE_SIZE / C) - 1
So with the assumption that we only coalesce if
- - the previous write count was < T
- - the current write count is < T
- - there is room for the current write count in the current page
and if we set C = 16, we can safely let T = (PAGE_SIZE >> 3) - 1 and the
minimum effective buffer is guaranteed to be at least PAGE_SIZE bytes.
This should reduce the amount of coalescing work required and thus
reduce the amount of locking required for the case where performance
counts (on x86, the writer would have to write 511 bytes to use the
non-coalescing fast path).
This method is probably better as many writers will be buffering their
writes, and the size also nicely fits with block layer's blocksize.
Just my two cents.
>
> (The other reason to potentially avoid coalescing is that I think it might
> be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
> boundaries. The new pipe code _is_ internally "packetized" after all).
>
> Linus
- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFB3vgbdQs4kOxk3/MRAhN1AKCD6/o04URPYphGRh13P5GLNfV4JgCaA3wx
QpMbLHqBtmAB3sy7mmxXCEI=
=687H
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 20:59 ` Mike Waychison
@ 2005-01-07 23:46 ` Chris Friesen
2005-01-08 21:38 ` Lee Revell
2005-01-14 10:15 ` Peter Chubb
0 siblings, 2 replies; 38+ messages in thread
From: Chris Friesen @ 2005-01-07 23:46 UTC (permalink / raw)
To: Mike Waychison
Cc: Linus Torvalds, Alan Cox, Oleg Nesterov, William Lee Irwin III,
Linux Kernel Mailing List
Mike Waychison wrote:
> This got me to thinking about how you can heuristically optimize away
> coalescing support and still allow PAGE_SIZE bytes minimum in the
> effective buffer.
While coalescing may be a win in some cases, there should also be some
way to tell the kernel to NOT coalesce, to handle the case where you
want minimum latency at the cost of some throughput.
Chris
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 23:46 ` Chris Friesen
@ 2005-01-08 21:38 ` Lee Revell
2005-01-08 21:51 ` Linus Torvalds
2005-01-14 10:15 ` Peter Chubb
1 sibling, 1 reply; 38+ messages in thread
From: Lee Revell @ 2005-01-08 21:38 UTC (permalink / raw)
To: Chris Friesen
Cc: Mike Waychison, Linus Torvalds, Alan Cox, Oleg Nesterov,
William Lee Irwin III, Linux Kernel Mailing List
On Fri, 2005-01-07 at 17:46 -0600, Chris Friesen wrote:
> Mike Waychison wrote:
>
> > This got me to thinking about how you can heuristically optimize away
> > coalescing support and still allow PAGE_SIZE bytes minimum in the
> > effective buffer.
>
> While coalescing may be a win in some cases, there should also be some
> way to tell the kernel to NOT coalesce, to handle the case where you
> want minimum latency at the cost of some throughput.
Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux
FIFO being one of the fastest known IPC mechanisms. Each client in the
JACK (http://jackit.sf.net) graph wakes the next one by writing a single
byte to a FIFO. Ardour's GUI, control, and audio threads interact via a
similar mechanism. How would you expect this change to impact the inter
thread wakeup latency? It's confusing when people say "performance",
meaning "increased throughput albeit with more latency". For many
people that's a regression.
These apps *certainly* care about performance, they just don't define it
in terms of throughput.
And yes we do know futexes are the right tool for this but they weren't
available at the time and aren't available on 2.4.
Lee
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-08 21:38 ` Lee Revell
@ 2005-01-08 21:51 ` Linus Torvalds
2005-01-08 22:02 ` Lee Revell
` (2 more replies)
0 siblings, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-08 21:51 UTC (permalink / raw)
To: Lee Revell
Cc: Chris Friesen, Mike Waychison, Alan Cox, Oleg Nesterov,
William Lee Irwin III, Linux Kernel Mailing List
On Sat, 8 Jan 2005, Lee Revell wrote:
>
> Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux
> FIFO being one of the fastest known IPC mechanisms. Each client in the
> JACK (http://jackit.sf.net) graph wakes the next one by writing a single
> byte to a FIFO. Ardour's GUI, control, and audio threads interact via a
> similar mechanism. How would you expect this change to impact the inter
> thread wakeup latency? It's confusing when people say "performance",
> meaning "increased throughput albeit with more latency". For many
> people that's a regression.
I posted the performance numbers in the thread already, and with every
single throughput number I also talked abotu what the latency difference
was. So quite frankly, if you were confused, I suspect it was because you
didn't read them. Tssk, tssk.
Short and sweet: the latency changes are in the noise for SMP, but can be
seen on UP. I'll look at it a bit more: since I had to add the coalescing
code anyway, I might also decide to re-use a buffer page rather than free
it immediately, since that may help latency for small writes.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-08 21:51 ` Linus Torvalds
@ 2005-01-08 22:02 ` Lee Revell
2005-01-08 22:29 ` Davide Libenzi
2005-01-09 4:07 ` Linus Torvalds
2 siblings, 0 replies; 38+ messages in thread
From: Lee Revell @ 2005-01-08 22:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: Chris Friesen, Mike Waychison, Alan Cox, Oleg Nesterov,
William Lee Irwin III, Linux Kernel Mailing List
On Sat, 2005-01-08 at 13:51 -0800, Linus Torvalds wrote:
>
> On Sat, 8 Jan 2005, Lee Revell wrote:
> >
> > Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux
> > FIFO being one of the fastest known IPC mechanisms. Each client in the
> > JACK (http://jackit.sf.net) graph wakes the next one by writing a single
> > byte to a FIFO. Ardour's GUI, control, and audio threads interact via a
> > similar mechanism. How would you expect this change to impact the inter
> > thread wakeup latency? It's confusing when people say "performance",
> > meaning "increased throughput albeit with more latency". For many
> > people that's a regression.
>
> I posted the performance numbers in the thread already, and with every
> single throughput number I also talked abotu what the latency difference
> was. So quite frankly, if you were confused, I suspect it was because you
> didn't read them. Tssk, tssk.
>
How embarassing, I found that message immediately after hitting send.
Looks like a big win. I'll try the JACK stress test on it just to make
sure.
Lee
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-08 21:51 ` Linus Torvalds
2005-01-08 22:02 ` Lee Revell
@ 2005-01-08 22:29 ` Davide Libenzi
2005-01-09 4:07 ` Linus Torvalds
2 siblings, 0 replies; 38+ messages in thread
From: Davide Libenzi @ 2005-01-08 22:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: Lee Revell, Chris Friesen, Mike Waychison, Alan Cox,
Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List
On Sat, 8 Jan 2005, Linus Torvalds wrote:
> Short and sweet: the latency changes are in the noise for SMP, but can be
> seen on UP. I'll look at it a bit more: since I had to add the coalescing
> code anyway, I might also decide to re-use a buffer page rather than free
> it immediately, since that may help latency for small writes.
Yeah, I noticed you clean the page's magazine down to the bottom. Maybe
keeping one (or more) page in there might help something, due to cache
improvements and alloc/free page savings.
- Davide
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-08 21:51 ` Linus Torvalds
2005-01-08 22:02 ` Lee Revell
2005-01-08 22:29 ` Davide Libenzi
@ 2005-01-09 4:07 ` Linus Torvalds
2005-01-09 23:19 ` Davide Libenzi
2 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-09 4:07 UTC (permalink / raw)
To: Lee Revell
Cc: Chris Friesen, Mike Waychison, Alan Cox, Oleg Nesterov,
William Lee Irwin III, Linux Kernel Mailing List
On Sat, 8 Jan 2005, Linus Torvalds wrote:
>
> Short and sweet: the latency changes are in the noise for SMP, but can be
> seen on UP. I'll look at it a bit more: since I had to add the coalescing
> code anyway, I might also decide to re-use a buffer page rather than free
> it immediately, since that may help latency for small writes.
Side note: the SMP numbers seem to fluctuate wildly making any
benchmarking a bit suspect. They probably depend very much on just how
things get scheduled, and on inter-CPU cache behaviour.
In contrast, the UP numbers are a lot more reliable, so I'll try to make
sure that everything looks good on UP - which definitely means that I'll
work to make sure that latency looks good too.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-09 4:07 ` Linus Torvalds
@ 2005-01-09 23:19 ` Davide Libenzi
0 siblings, 0 replies; 38+ messages in thread
From: Davide Libenzi @ 2005-01-09 23:19 UTC (permalink / raw)
To: Linus Torvalds
Cc: Lee Revell, Chris Friesen, Mike Waychison, Alan Cox,
Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List
On Sat, 8 Jan 2005, Linus Torvalds wrote:
> Side note: the SMP numbers seem to fluctuate wildly making any
> benchmarking a bit suspect. They probably depend very much on just how
> things get scheduled, and on inter-CPU cache behaviour.
Lmbench results on SMP are bogus. OTOH you can easily make a more
deterministic SMP pipe bench by making the two tasks affine to two
different CPUs.
- Davide
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 23:46 ` Chris Friesen
2005-01-08 21:38 ` Lee Revell
@ 2005-01-14 10:15 ` Peter Chubb
1 sibling, 0 replies; 38+ messages in thread
From: Peter Chubb @ 2005-01-14 10:15 UTC (permalink / raw)
To: Chris Friesen
Cc: Mike Waychison, Linus Torvalds, Alan Cox, Oleg Nesterov,
William Lee Irwin III, Linux Kernel Mailing List
>>>>> "Chris" == Chris Friesen <cfriesen@nortelnetworks.com> writes:
Chris> Mike Waychison wrote:
>> This got me to thinking about how you can heuristically optimize
>> away coalescing support and still allow PAGE_SIZE bytes minimum in
>> the effective buffer.
Chris> While coalescing may be a win in some cases, there should also
Chris> be some way to tell the kernel to NOT coalesce, to handle the
Chris> case where you want minimum latency at the cost of some
Chris> throughput.
SysVr4 pipes used to have two modes: a `legacy' mode that coalesced
data, and a `message' mode that preserved message boundaries. Seems
to me that we could be about to have the latter in Linux...
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
The technical we do immediately, the political takes *forever*
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 17:48 ` Linus Torvalds
2005-01-07 20:59 ` Mike Waychison
@ 2005-01-07 21:59 ` Linus Torvalds
2005-01-07 22:53 ` Diego Calleja
2005-01-10 23:23 ` Robert White
2 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 21:59 UTC (permalink / raw)
To: Alan Cox; +Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List
On Fri, 7 Jan 2005, Linus Torvalds wrote:
>
> I want to highlight this again. The increase in throughput did _not_ come
> from avoiding a copy. It came purely from the fact that we have multiple
> buffers, and thus a throughput-intensive load gets to do several bigger
> chunks before it needs to wait for the recipient. So the increase in
> throughput comes from fewer synchronization points (scheduling and
> locking), not from anything else.
Btw, from limited testing this effect seems to be much more pronounced on
SMP.
I've done some more benchmarking, and on a single-CPU Banias laptop,
bandwidth didn't really change much, and latency goes up measurably
(roughly 2.5us -> 3.0us). In contrast, on a P4 with HT, latency was not
noticeable (looks like 10.4us -> 10.7us), since there the locking and
system call costs dominate, and throughput goes up from 370MB/s to
900+MB/s (ie > 100% improvement).
So it's not always a win, but at the same time sometimes it's quite a big
win. We might tune the number of buffers down for the UP case, or
something.
Side note: the UNIX domain socket lmbench numbers are higher than the
memory copy numbers, implying that at least the UNIX domain socket tests
have the data set fit in the L2. That may be the reason why lmbench says
that sockets outperform pipes.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 21:59 ` Linus Torvalds
@ 2005-01-07 22:53 ` Diego Calleja
2005-01-07 23:15 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Diego Calleja @ 2005-01-07 22:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: alan, oleg, wli, linux-kernel
El Fri, 7 Jan 2005 13:59:53 -0800 (PST) Linus Torvalds <torvalds@osdl.org> escribió:
> Btw, from limited testing this effect seems to be much more pronounced on
> SMP.
I've tried it in a 2xPIII, 512 MB of ram. I've done kernel compiles...yeah
yeah I know kernel compiles are not a good benchmark but since linux uses
-pipe for gcc...I though it may try it to see if it makes a difference
(suggestions about better methods to test this are accepted :). The results
could be very well stadistical noise (it looks like it only takes a second less
on average), but I though it may be interesting.
without the patch:
952.84user 98.64system 8:54.64elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318007minor)pagefaults 0swaps
951.78user 98.25system 8:53.71elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318009minor)pagefaults 0swaps
954.53user 99.12system 8:53.02elapsed 197%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318041minor)pagefaults 0swaps
with it:
951.67user 97.59system 8:53.12elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318010minor)pagefaults 0swaps
949.04user 97.68system 8:52.04elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318018minor)pagefaults 0swaps
948.40user 97.37system 8:51.48elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318011minor)pagefaults 0swaps
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 22:53 ` Diego Calleja
@ 2005-01-07 23:15 ` Linus Torvalds
0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 23:15 UTC (permalink / raw)
To: Diego Calleja; +Cc: alan, oleg, wli, linux-kernel
On Fri, 7 Jan 2005, Diego Calleja wrote:
>
> I've tried it in a 2xPIII, 512 MB of ram. I've done kernel compiles...yeah
> yeah I know kernel compiles are not a good benchmark but since linux uses
> -pipe for gcc...I though it may try it to see if it makes a difference
> (suggestions about better methods to test this are accepted :). The results
> could be very well stadistical noise (it looks like it only takes a second less
> on average), but I though it may be interesting.
Hmm.. I'm gratified, but a bit suspicious. I'd not have expected it to be
noticeable on a kernel compile (that's a 1%+ change in total system time,
which considering all the file ops etc that a compile does means that it
would have to have made quite a big difference for the pipe case).
So your data is intriguing, but it looks a bit _too_ good to be true ;)
Random things like page cache placement (and thus cache effects) might
also account for it. But hey, if true, that's quite an improvement.
(Of course, anything the kernel does tends to be totally drowned out by
allt eh actual user-level work gcc does, so in the _big_ picture the 1%
change in system time means almost nothing. So even "good news" doesn't
actually matter in this case).
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: Make pipe data structure be a circular list of pages, rather than
2005-01-07 17:48 ` Linus Torvalds
2005-01-07 20:59 ` Mike Waychison
2005-01-07 21:59 ` Linus Torvalds
@ 2005-01-10 23:23 ` Robert White
2 siblings, 0 replies; 38+ messages in thread
From: Robert White @ 2005-01-10 23:23 UTC (permalink / raw)
To: 'Linus Torvalds', 'Alan Cox'
Cc: 'Oleg Nesterov', 'William Lee Irwin III',
'Linux Kernel Mailing List'
Howdy,
This entire discussion of the pipe and splice() is very much the same (issue,
problem, approach) as the STREAMS interface from UNIX SVR4, which had the pipe device
as its trivial example device. [This is all from vague recall, so forgive the
sketchiness 8-)]
[Note: check out this STREAMS programmers guide
http://docs-pdf.sun.com/802-1933/802-1933.pdf from Sun's documentation site.]
A Message was a fixed-size Message structure with a linked list of Buffers attached
to it. The Buffers were a structure-and-payload arrangement as in (over-expressed
here). The whole arrangement could be zero-copy or self condensing as needed.
In Linux parlance each file descriptor was basically a reference to a pair of linked
list headers (one for messages available for reading and one to accept data from
writes), a pointer to the driver below the head, and an optional tasklet. The driver
itself was virtually the same data structure.
STREAMS was a little over done and could have used some of the Linuxisims like
tasklets (instead of "the STREAMS Scheduler"), and the bulk of the nonsense for
getting Message and Buffer and STREAMS Head structures just screams Slab Cache, but
it has a lot of interesting bits.
The reason I bring it up is four-fold.
1) They had push and pop ioctl operations on the streams so that you could actually
interpose translators and multiplexors in the streams, so you could use the design to
do the splice() operation directly by building something that basically looked like:
fd0a fd0b
mux
pipe-driver
fd1
basically by pushing the mux into stream head and then pushing connecting the second
and subsequent fds to the mux object.
2) The Messages had type fields so you could do interesting things like pass an open
file descriptor to another process across a pipe because the special "this is an FD
message" could arrive unmolested and trigger the necessary file table manipulations
at the receiving end.
3) The optional tasklet (equivelant) at the various points in the STREAM could do
things like buffer consolidation, which in the Linux model could compete with actual
data consumption fairly easily/cheaply.
4) Some of the things that I have run into dealing with line disciplines and
streaming data through character devices (and USB) could gain a lot from the
Message/Buffer/STREAM paradigm. [e.g. the fixed-size flip buffers in ntty would be
rendered moot by having the linked lists above and below the line discipline.]
Anyway, thick as it is, I think the pdf referenced above might be interesting reading
vis a vi the pipe() discussion.
Rob White,
Casabyte, Inc.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 16:06 ` Alan Cox
2005-01-07 17:33 ` Linus Torvalds
@ 2005-01-07 17:45 ` Chris Friesen
1 sibling, 0 replies; 38+ messages in thread
From: Chris Friesen @ 2005-01-07 17:45 UTC (permalink / raw)
To: Alan Cox
Cc: Linus Torvalds, Oleg Nesterov, William Lee Irwin III,
Linux Kernel Mailing List
Alan Cox wrote:
> On Gwe, 2005-01-07 at 16:17, Linus Torvalds wrote:
>
>>it a "don't do that then", and I'll wait to see if people do. I can't
>>think of anything that cares about performance that does that anyway:
>>becuase system calls are reasonably expensive regardless, anybody who
>>cares at all about performance will have buffered things up in user space.
>
>
> Actually I found a load of apps that do this but they don't care about
> performance. Lots of people have signal handlers that just do
>
> write(pipe_fd, &signr, sizeof(signr))
>
> so they can drop signalled events into their event loops
I would be one of those people, although I do pass the signal
information as well.
Chris
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 16:17 ` Linus Torvalds
2005-01-07 16:06 ` Alan Cox
@ 2005-01-07 16:39 ` Davide Libenzi
2005-01-07 17:09 ` Linus Torvalds
2005-08-18 6:07 ` Coywolf Qi Hunt
2 siblings, 1 reply; 38+ messages in thread
From: Davide Libenzi @ 2005-01-07 16:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List
On Fri, 7 Jan 2005, Linus Torvalds wrote:
> I'm clearly enamoured with this concept. I think it's one of those few
> "RightThing(tm)" that doesn't come along all that often. I don't know of
> anybody else doing this, and I think it's both useful and clever. If you
> now prove me wrong, I'll hate you forever ;)
I don't know which kind of poison Larry put in your glass when you two
met, but it clearly worked :=)
- Davide
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 16:39 ` Davide Libenzi
@ 2005-01-07 17:09 ` Linus Torvalds
0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 17:09 UTC (permalink / raw)
To: Davide Libenzi
Cc: Oleg Nesterov, William Lee Irwin III, Linux Kernel Mailing List
On Fri, 7 Jan 2005, Davide Libenzi wrote:
>
> I don't know which kind of poison Larry put in your glass when you two
> met, but it clearly worked :=)
Oh, Larry introduced the notion of "splice()" to me at least five years
ago, so whatever he put in that glass was quite slow-acting. At the time I
told him to forget it - I thought it intriguing, but couldn't see any
reasonable way to do it.
And to be honest, my pipes aren't really what Larry's "splice()" was: his
notion was more of a direct "fd->fd" thing, with nothing in between. That
was what I still consider unworkable.
The problem with splicing two file descriptors together is that it needs
for them to agree on the interchange format, and these things tend to be
very expensive from an interface angle (ie the format has to _exactly_
match the "impedance" at both ends, since any conversion would make the
whole exercise pointless - you'd be better off just doing a read and a
write).
And note that the "format" is not just the actual data transfer thing, but
the rules about what to do with partial reads, partial writes, file
offsets, streams _without_ file offsets etc etc). And that's really what
the pipe generalization does: it decouples all the format things into an
intermediate thing that is very unambiguous, and has none of those issues.
The pipe approach also decouples things in another way: a true "splice()"
needs all entities to participate in the new world order for it to be
useful. It shares that brokenness with "sendfile()", which to some degree
exemplifies the problems with splice() (sendfile ends up being a very very
specialized uni-directional form of Larry's splice()). But realizing that
pipes _are_ the conduit and that they already exist and work with very old
interfaces suddenly means that the new splice() can be incrementally
useful.
So it took me five years to think about other things, until an actual
approach that would work and make sense from an implementation standpoint
worked.
What made me think about this particular thing is that I really thought a
lot of things want to just mmap the data, but do not actually want to go
to the expense of finding virtual addresses, handling the virtual->
physical translation at each stage. A lot of things just want to access
the buffers, _without_ the user-visible mapping. User-visibility is not
only expensive, it ends up being an impossible interface for many things
(memory mapping has page granularity, something that simply isn't true for
a lot of devices).
So in many ways, think of the new pipes as a "buffer mapping". Nothing
else. It's a way to carry arbitrary buffers along in a secure manner.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 16:17 ` Linus Torvalds
2005-01-07 16:06 ` Alan Cox
2005-01-07 16:39 ` Davide Libenzi
@ 2005-08-18 6:07 ` Coywolf Qi Hunt
2 siblings, 0 replies; 38+ messages in thread
From: Coywolf Qi Hunt @ 2005-08-18 6:07 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Oleg Nesterov, William Lee Irwin III, linux-kernel
On 1/8/05, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Fri, 7 Jan 2005, Oleg Nesterov wrote:
> >
> > If i understand this patch correctly, then this code
> >
> > for (;;)
> > write(pipe_fd, &byte, 1);
> >
> > will block after writing PIPE_BUFFERS == 16 characters, no?
> > And pipe_inode_info will use 64K to hold 16 bytes!
>
> Yes.
>
> > Is it ok?
>
> If you want throughput, don't do single-byte writes. Obviously we _could_
> do coalescing, but there's a reason I'd prefer to avoid it. So I consider
> it a "don't do that then", and I'll wait to see if people do. I can't
> think of anything that cares about performance that does that anyway:
> becuase system calls are reasonably expensive regardless, anybody who
> cares at all about performance will have buffered things up in user space.
>
> > May be it make sense to add data to the last allocated page
> > until buf->len > PAGE_SIZE ?
>
> The reason I don't want to coalesce is that I don't ever want to modify a
> page that is on a pipe buffer (well, at least not through the pipe buffer
> - it might get modified some other way). Why? Because the long-term plan
> for pipe-buffers is to allow the data to come from _other_ sources than
> just a user space copy. For example, it might be a page directly from the
> page cache, or a partial page that contains the data part of an skb that
> just came in off the network.
>
> With this organization, a pipe ends up being able to act as a "conduit"
> for pretty much any data, including some high-bandwidth things like video
> streams, where you really _really_ don't want to copy the data. So the
> next stage is:
>
> - allow the buffer size to be set dynamically per-pipe (probably only
> increased by root, due to obvious issues, although a per-user limit is
> not out of the question - it's just a "mlock" in kernel buffer space,
> after all)
> - add per-"struct pipe_buffer" ops pointer to a structure with
> operation function pointers: "release()", "wait_for_ready()", "poll()"
> (and possibly "merge()", if we want to coalesce things, although I
> really hope we won't need to)
> - add a "splice(fd, fd)" system call that copies pages (by incrementing
> their reference count, not by copying the data!) from an input source
> to the pipe, or from a pipe to an output.
> - add a "tee(in, out1, out2)" system call that duplicates the pages
> (again, incrementing their reference count, not copying the data) from
> one pipe to two other pipes.
>
> All of the above is basically a few lines of code (the "splice()" thing
> requires some help from drivers/networking/pagecache etc, but it's not
> complex help, and not everybody needs to do it - I'll start off with
> _just_ a generic page cache helper to get the thing rolling, that's easy).
>
> Now, imagine using the above in a media server, for example. Let's say
> that a year or two has passed, so that the video drivers have been updated
> to be able to do the splice thing, and what can you do? You can:
>
> - splice from the (mpeg or whatever - let's just assume that the video
> input is either digital or does the encoding on its own - like they
> pretty much all do) video input into a pipe (remember: no copies - the
> video input will just DMA directly into memory, and splice will just
> set up the pages in the pipe buffer)
> - tee that pipe to split it up
> - splice one end to a file (ie "save the compressed stream to disk")
> - splice the other end to a real-time video decoder window for your
> real-time viewing pleasure.
>
> That's the plan, at least. I think it makes sense, and the thing that
> convinced me about it was (a) how simple all of this seems to be
> implementation-wise (modulo details - but there are no "conceptually
> complex" parts: no horrid asynchronous interfaces, no questions about
> hotw to buffer things, no direct user access to pages that might
> partially contain protected data etc etc) and (b) it's so UNIXy. If
> there's something that says "the UNIX way", it's pipes between entities
> that act on the data.
>
> For example, let's say that you wanted to serve a file from disk (or any
> other source) with a header to another program (or to a TCP connection, or
> to whatever - it's just a file descriptor). You'd do
>
> fd = create_pipe_to_destination();
>
> input = open("filename", O_RDONLY);
> write(fd, "header goes here", length_of_header);
> for (;;) {
> ssize_t err;
> err = splice(input, fd,
> ~0 /* maxlen */,
> 0 /* msg flags - think "sendmgsg" */);
> if (err > 0)
> continue;
> if (!err) /* EOF */
> break;
> .. handle input errors here ..
> }
>
> (obviously, if this is a real server, this would likely all be in a
> select/epoll loop, but that just gets too hard to describe consicely, so
> I'm showing the single-threaded simple version).
>
> Further, this also ends up giving a nice pollable interface to regular
> files too: just splice from the file (at any offset) into a pipe, and poll
> on the result. The "splice()" will just do the page cache operations and
> start the IO if necessary, the "poll()" will wait for the first page to be
> actually available. All _trivially_ done with the "struct pipe_buffer"
> operations.
>
> So the above kind of "send a file to another destination" should
> automatically work very naturally in any poll loop: instead of filling a
> writable pipe with a "write()", you just fill it with "splice()" instead
> (and you can read it with a 'read()' or you just splice it to somewhere
> else, or you tee() it to two destinations....).
>
> I think the media server kind of environment is the one most easily
> explained, where you have potentially tons of data that the server process
> really never actually wants to _see_ - it just wants to push it on to
> another process or connection or save it to a log-file or something. But
> as with regular pipes, it's not a specialized interface: it really is just
> a channel of communication.
>
> The difference being that a historical UNIX pipe is always a channel
> between two process spaces (ie you can only fill it and empty it into the
> process address space), and the _only_ thing I'm trying to do is to have
> it be able to be a channel between two different file descriptors too. You
> still need the process to "control" the channel, but the data doesn't have
> to touch the address space any more.
>
> Think of all the servers or other processes that really don't care about
> the data. Think of something as simple as encrypting a file, for example.
> Imagine that you have hardware encryption support that does DMA from the
> source, and writes the results using DMA. I think it's pretty obvious how
> you'd connect this up using pipes and two splices (one for the input, one
> for the output).
>
> And notice how _flexible_ it is (both the input and the output can be any
> kind of fd you want - the pipes end up doing both the "conversion" into a
> common format of "list of (possibly partial) pages" and the buffering,
> which is why the different "engines" don't need to care where the data
> comes from, or where it goes. So while you can use it to encrypt a file
> into another file, you could equally easily use it for something like
>
> tar cvf - my_source_tree | hw_engine_encrypt | splice_to_network
>
> and the whole pipeline would not have a _single_ actual data copy: the
> pipes are channels.
>
> Of course, since it's a pipe, the nice thing is that people don't have to
> use "splice()" to access it - the above pipeline has a perfectly regular
> "tar" process that probably just does regular writes. You can have a
> process that does "splice()" to fill the pipe, and the other end is a
> normal thing that just uses regular "read()" and doesn't even _know_ that
> the pipe is using new-fangled technology to be filled.
>
> I'm clearly enamoured with this concept. I think it's one of those few
> "RightThing(tm)" that doesn't come along all that often. I don't know of
> anybody else doing this, and I think it's both useful and clever. If you
> now prove me wrong, I'll hate you forever ;)
>
> Linus
Actually this what L4 is different from traditional micro-kernel, IMHO. Is it?
I have a long-term plan to add some messaging subsystem based on this.
--
Coywolf Qi Hunt
http://ahbl.org/~coywolf/
^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: Make pipe data structure be a circular list of pages, rather than
@ 2005-01-20 2:14 Robert White
0 siblings, 0 replies; 38+ messages in thread
From: Robert White @ 2005-01-20 2:14 UTC (permalink / raw)
To: 'Robert White', linux, linux-kernel; +Cc: lm, torvalds
P.S. Not to reply to myself... 8-) I took some liberties with that description.
STREAMS doesn't, to the best of my knowledge, have the cleanup hook stuff. That was
me folding your issues (direct PCI device buffers etc) from this thread on top of the
basic skeleton of STREAMS to broaden the "impedance matching" possibilities.
(No really, back to lurking... 8-)
Rob White,
Casabyte, Inc.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather
@ 2005-01-16 2:59 Linus Torvalds
2005-01-19 21:12 ` Make pipe data structure be a circular list of pages, rather than linux
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-16 2:59 UTC (permalink / raw)
To: Ingo Oeser; +Cc: linux, Kernel Mailing List
On Fri, 14 Jan 2005, Linus Torvalds wrote:
>
> So I don't think you'll get _exactly_ what you want for a while, since
> you'd have to go through in-memory buffers. But there's no huge conceptual
> problem (just enough implementation issues to make it inconvenient) with
> the concept of actually keeping the buffer on an external controller.
Here is, if you are interested, a slightly updated version of my
"work-in-progress" example patch. It:
- depends on a fairly recent -BK (since it uses the pipe buffer ops
interfaces that I already checked in)
- is very ugly: the VFS interfaces to splice things really are very wrong
indeed, and they _should_ be about passing "struct pipe_buffer" arrays
around instead.
- it only does one specific case (splicing from the page cache directly
into a pipe)
- only set up for ext3 right now (though the "generic_file_splice" thing
should work for any page-cache user)
- and even that specific case it does wrong since it doesn't update the
"f_pos" field right now
- error handling is broken - if an IO error happens on the splice, the
"map()" operation will return NULL, and the current fs/pipe.c can't
handle that and will oops ;)
so it really is _purely_ a demonstration vehicle, but it at least shows
the kind of operation that I'm thinking of.
An example program to show its use is a simple:
#include <fcntl.h>
#include <unistd.h>
#include <linux/unistd.h>
#define __NR_sys_splice 289
inline _syscall4(int, sys_splice, int, a, int, b, size_t, len, unsigned long, flags);
int main(int argc, char **argv)
{
int fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
exit(1);
}
write(1,"hello: ", 7);
sys_splice(fd, 1, 1000, 0);
}
which demonstrates three things:
- it works on a perfectly regular pipe, with no special setup, so to use
it you can just do
./test-splice my-file | less
and the normal pipe that the shell created for this will just work. No
new magic fd's required.
- that you can combine different sources to the pipe (mixing a
traditional static write with a "splice from a file").
- we start the IO when we do the splice, but the waiting for the result
happens separately, when (or if) the data is used by the other end.
it will print out "hello: " followed by the first 1000 bytes of the file.
It's obviously not very useful for anything but demonstration, since
splicing the other way (from a pipe to another fd) isn't implemented by
this example, but it gives a more concrete example of what I'm thinking
of.
And shows that the splicing itself is pretty simple ("move_to_pipe()"
does all the moving of buffers to the pipe) - the rest is just regular
setup and finding of the pages in the page cache.
Linus
----
# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
# 2005/01/15 18:36:19-08:00 torvalds@evo.osdl.org
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
# fs/splice.c
# 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +229 -0
#
# include/linux/fs.h
# 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +2 -0
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
# include/asm-i386/unistd.h
# 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +2 -1
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
# fs/splice.c
# 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +0 -0
# BitKeeper file /home/torvalds/v2.6/linux/fs/splice.c
#
# fs/ext3/file.c
# 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +3 -0
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
# fs/Makefile
# 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +1 -0
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
# arch/i386/kernel/entry.S
# 2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +1 -0
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
diff -Nru a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S
--- a/arch/i386/kernel/entry.S 2005-01-15 18:36:40 -08:00
+++ b/arch/i386/kernel/entry.S 2005-01-15 18:36:40 -08:00
@@ -864,5 +864,6 @@
.long sys_add_key
.long sys_request_key
.long sys_keyctl
+ .long sys_splice
syscall_table_size=(.-sys_call_table)
diff -Nru a/fs/Makefile b/fs/Makefile
--- a/fs/Makefile 2005-01-15 18:36:40 -08:00
+++ b/fs/Makefile 2005-01-15 18:36:40 -08:00
@@ -10,6 +10,7 @@
ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \
+ splice.o
obj-$(CONFIG_EPOLL) += eventpoll.o
obj-$(CONFIG_COMPAT) += compat.o
diff -Nru a/fs/ext3/file.c b/fs/ext3/file.c
--- a/fs/ext3/file.c 2005-01-15 18:36:40 -08:00
+++ b/fs/ext3/file.c 2005-01-15 18:36:40 -08:00
@@ -101,6 +101,8 @@
return ret;
}
+extern ssize_t generic_file_splice_read(struct file *in, struct inode *pipe, size_t len, unsigned long flags);
+
struct file_operations ext3_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
@@ -115,6 +117,7 @@
.release = ext3_release_file,
.fsync = ext3_sync_file,
.sendfile = generic_file_sendfile,
+ .splice_read = generic_file_splice_read,
};
struct inode_operations ext3_file_inode_operations = {
diff -Nru a/fs/splice.c b/fs/splice.c
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/fs/splice.c 2005-01-15 18:36:40 -08:00
@@ -0,0 +1,229 @@
+/*
+ * linux/fs/splice.c
+ *
+ * "splice": joining two ropes together by interweaving their strands.
+ *
+ * This is the "extended pipe" functionality, where a pipe is used as
+ * an arbitrary in-memory buffer. Think of a pipe as a small kernel
+ * buffer that you can use to transfer data from one end to the other.
+ *
+ * The traditional unix read/write is extended with a "splice()" operation
+ * that transfers data buffers to or from a pipe buffer.
+ *
+ * Named by Larry McVoy.
+ */
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/pipe_fs_i.h>
+
+static void page_cache_pipe_buf_release(struct pipe_inode_info *info, struct pipe_buffer *buf)
+{
+ page_cache_release(buf->page);
+}
+
+static void *page_cache_pipe_buf_map(struct file *file, struct pipe_inode_info *info, struct pipe_buffer *buf)
+{
+ struct page *page = buf->page;
+
+ if (!PageUptodate(page)) {
+ wait_on_page_locked(page);
+ if (!PageUptodate(page))
+ return NULL;
+ }
+ return kmap(page);
+}
+
+static void page_cache_pipe_buf_unmap(struct pipe_inode_info *info, struct pipe_buffer *buf)
+{
+ kunmap(buf->page);
+}
+
+static struct pipe_buf_operations page_cache_pipe_buf_ops = {
+ .can_merge = 0,
+ .map = page_cache_pipe_buf_map,
+ .unmap = page_cache_pipe_buf_unmap,
+ .release = page_cache_pipe_buf_release,
+};
+
+static ssize_t move_to_pipe(struct inode *inode, struct page **array, int pages, unsigned long offset, unsigned long len)
+{
+ struct pipe_inode_info *info;
+ int ret, do_wakeup;
+
+ down(PIPE_SEM(*inode));
+ info = inode->i_pipe;
+ ret = 0;
+ do_wakeup = 0;
+ for (;;) {
+ int bufs;
+
+ if (!PIPE_READERS(*inode)) {
+ send_sig(SIGPIPE, current, 0);
+ if (!ret) ret = -EPIPE;
+ break;
+ }
+ bufs = info->nrbufs;
+ if (bufs < PIPE_BUFFERS) {
+ int newbuf = (info->curbuf + bufs) & (PIPE_BUFFERS-1);
+ struct pipe_buffer *buf = info->bufs + newbuf;
+ struct page *page = *array++;
+ unsigned long this_len;
+
+ this_len = PAGE_SIZE - offset;
+ if (this_len > len)
+ this_len = len;
+
+ buf->page = page;
+ buf->offset = offset;
+ buf->len = this_len;
+ buf->ops = &page_cache_pipe_buf_ops;
+ info->nrbufs = ++bufs;
+ do_wakeup = 1;
+
+ ret += this_len;
+ len -= this_len;
+ offset = 0;
+ if (!--pages)
+ break;
+ if (!len)
+ break;
+ if (bufs < PIPE_BUFFERS)
+ continue;
+ break;
+ }
+ if (signal_pending(current)) {
+ if (!ret) ret = -ERESTARTSYS;
+ break;
+ }
+ if (do_wakeup) {
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
+ kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
+ do_wakeup = 0;
+ }
+ PIPE_WAITING_WRITERS(*inode)++;
+ pipe_wait(inode);
+ PIPE_WAITING_WRITERS(*inode)--;
+ }
+ up(PIPE_SEM(*inode));
+ if (do_wakeup) {
+ wake_up_interruptible(PIPE_WAIT(*inode));
+ kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
+ }
+ while (--pages >= 0)
+ page_cache_release(*++array);
+ return ret;
+}
+
+ssize_t generic_file_splice_read(struct file *in, struct inode *pipe, size_t len, unsigned long flags)
+{
+ struct address_space *mapping = in->f_mapping;
+ unsigned long long pos, size, left;
+ unsigned long index, offset, pages;
+ struct page *array[PIPE_BUFFERS];
+ int i;
+
+ pos = in->f_pos;
+ size = i_size_read(mapping->host);
+ if (pos >= size)
+ return 0;
+ left = size - pos;
+ if (left < len)
+ len = left;
+ index = pos >> PAGE_CACHE_SHIFT;
+ offset = pos & ~PAGE_CACHE_MASK;
+ pages = (len + offset + PAGE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
+ if (pages > PIPE_BUFFERS)
+ pages = PIPE_BUFFERS;
+
+ /*
+ * Get as many pages from the page cache as possible..
+ * Start IO on the page cache entries we create (we
+ * can assume that any pre-existing ones we find have
+ * already had IO started on them).
+ */
+ i = find_get_pages(mapping, index, pages, array);
+
+ /*
+ * If not all pages were in the page-cache, we'll
+ * just assume that the rest haven't been read in,
+ * so we'll get the rest locked and start IO on
+ * them if we can..
+ */
+ while (i < pages) {
+ struct page *page = find_or_create_page(mapping, index + i, GFP_USER);
+ int error;
+ if (!page)
+ break;
+ if (PageUptodate(page)) {
+ unlock_page(page);
+ } else {
+ error = mapping->a_ops->readpage(in, page);
+ if (unlikely(error)) {
+ page_cache_release(page);
+ break;
+ }
+ }
+ array[i++] = page;
+ }
+
+ if (!i)
+ return 0;
+
+ /*
+ * Now we splice them into the pipe..
+ */
+ return move_to_pipe(pipe, array, i, offset, len);
+}
+
+static long do_splice_from(struct inode *pipe, struct file *out, size_t len, unsigned long flags)
+{
+ if (out->f_op && out->f_op->splice_write)
+ return out->f_op->splice_write(pipe, out, len, flags);
+ return -EINVAL;
+}
+
+static long do_splice_to(struct file *in, struct inode *pipe, size_t len, unsigned long flags)
+{
+ if (in->f_op && in->f_op->splice_read)
+ return in->f_op->splice_read(in, pipe, len, flags);
+ return -EINVAL;
+}
+
+static long do_splice(struct file *in, struct file *out, size_t len, unsigned long flags)
+{
+ struct inode *pipe;
+
+ if (!len)
+ return 0;
+ pipe = in->f_dentry->d_inode;
+ if (pipe->i_pipe)
+ return do_splice_from(pipe, out, len, flags);
+ pipe = out->f_dentry->d_inode;
+ if (pipe->i_pipe)
+ return do_splice_to(in, pipe, len, flags);
+ return -EINVAL;
+}
+
+asmlinkage long sys_splice(int fdin, int fdout, size_t len, unsigned long flags)
+{
+ long error;
+ struct file *in, *out;
+ int fput_in, fput_out;
+
+ error = -EBADF;
+ in = fget_light(fdin, &fput_in);
+ if (in) {
+ if (in->f_mode & FMODE_READ) {
+ out = fget_light(fdout, &fput_out);
+ if (out) {
+ if (out->f_mode & FMODE_WRITE)
+ error = do_splice(in, out, len, flags);
+ fput_light(out, fput_out);
+ }
+ }
+ fput_light(in, fput_in);
+ }
+ return error;
+}
diff -Nru a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
--- a/include/asm-i386/unistd.h 2005-01-15 18:36:40 -08:00
+++ b/include/asm-i386/unistd.h 2005-01-15 18:36:40 -08:00
@@ -294,8 +294,9 @@
#define __NR_add_key 286
#define __NR_request_key 287
#define __NR_keyctl 288
+#define __NR_splice 289
-#define NR_syscalls 289
+#define NR_syscalls 290
/*
* user-visible error numbers are in the range -1 - -128: see
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h 2005-01-15 18:36:40 -08:00
+++ b/include/linux/fs.h 2005-01-15 18:36:40 -08:00
@@ -944,6 +944,8 @@
int (*check_flags)(int);
int (*dir_notify)(struct file *filp, unsigned long arg);
int (*flock) (struct file *, int, struct file_lock *);
+ ssize_t (*splice_write)(struct inode *, struct file *, size_t, unsigned long);
+ ssize_t (*splice_read)(struct file *, struct inode *, size_t, unsigned long);
};
struct inode_operations {
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-16 2:59 Make pipe data structure be a circular list of pages, rather Linus Torvalds
@ 2005-01-19 21:12 ` linux
2005-01-20 2:06 ` Robert White
0 siblings, 1 reply; 38+ messages in thread
From: linux @ 2005-01-19 21:12 UTC (permalink / raw)
To: linux-kernel; +Cc: linux, lm, torvalds
lm@bitmover.com wrote:
> You seem to have misunderstood the original proposal, it had little to do
> with file descriptors. The idea was that different subsystems in the OS
> export pull() and push() interfaces and you use them. The file decriptors
> are only involved if you provide them with those interfaces(which you
> would, it makes sense). You are hung up on the pipe idea, the idea I
> see in my head is far more generic. Anything can play and you don't
> need a pipe at all, you need
I was fantasizing about more generality as well. In particular, my
original fantasy allowed data to, in theory and with compatible devices,
be read from one PCI device, passed through a series of pipes, and
written to another without ever hitting main memory - only one PCI-PCI
DMA operation performed.
A slightly more common case would be zero-copy, where data gets DMAed
from the source into memory and from memory to the destination.
That's roughly Larry's pull/push model.
The direct DMA case requires buffer memory on one of the two cards.
(And would possibly be a fruitful source of hardware bugs, since I
suspect that Windows Doesn't Do That.)
Larry has the "full-page gift" optimization, which could in theory allow
data to be "renamed" straight into the page cache. However, the page also
has to be properly aligned and not in some awkward highmem address space.
I'm not currently convinced that this would happen often enough to be
worth the extra implementation hair, but feel free to argue otherwise.
(And Larry, what's the "loan" bit for? When is loan != !gift ?)
The big gotcha, as Larry's original paper properly points out, is handling
write errors. We need some sort of "unpull" operation to put data back
of the destination can't accept it. Otherwise, what do you return from
splice()? If the source is seekable, that's easy, and a pipe isn't much
harder, but for a general character device, we need a bit of help.
The way I handle this in user-level software, to connect modules that
provide data buffering, is to split "pull" into two operations.
"Show me some buffered data" and "Consume some buffered data".
The first returns a buffer pointer (to a const buffer) and length.
(The length must be non-zero except at EOF, but may be 1 byte.)
The second advances the buffer pointer. The advance distance must be
no more than the length returned previously, but may be less.
In typical single-threaded code, I allow not calling the advance function
or calling it multiple times, but they're typically called 1:1, and
requiring that would give you a good place to do locking. A character
device, network stream, or the like, would acquire an exclusive lock.
A block device or file would not need to (or could make it a shared lock
or refcount).
The same technique can be used when writing data to a module that does
buffering: "Give me some bufer space" and "Okay, I filled some part of
it in." In some devices, the latter call can fail, and the writer has
to be able to cope with that.
By allowing both of those (and, knowing that PCI writes are more efficient
than PCI reads, giving the latter preference if both are available),
you can do direct device-to-device copies on splice().
The problem with Larry's separate pull() and push() calls is that you
then need a user-visible abstraction for "pulled but not yet pushed"
data, which seems lile unnecessary abstraction violation.
The main infrastructure hassle you need to support this *universally*
is the unget() on "character devices" like pipes and network sockets.
Ideally, it would be some generic buffer front end that could be used
by the device for normal data as well as the special case.
Ooh. Need to think. If there's a -EIO problem on one of the file
descriptors, how does the caller know which one? That's an argument for
separate pull and push (although the splice() library routine still
has the problem). Any suggestions? Does userland need to fall
back on read()/write() for a byte?
^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: Make pipe data structure be a circular list of pages, rather than
2005-01-19 21:12 ` Make pipe data structure be a circular list of pages, rather than linux
@ 2005-01-20 2:06 ` Robert White
0 siblings, 0 replies; 38+ messages in thread
From: Robert White @ 2005-01-20 2:06 UTC (permalink / raw)
To: linux, linux-kernel; +Cc: lm, torvalds
Howdy,
The below (sorry about the Outlook style includes, work _requires_ outlook 8-() is
why I previously mentioned the STREAMS implementation from SVR4 (and so Solaris etc).
Every interface point has a pointer to its peer, a pair of queues and a put()
routine. Every message is in two or more pieces, the message header and the message
buffer/buffers (read real data pages).
So the "participating interface" requirement is really that the device be able to
expose a data structure consisting of four pointers that their peer can point to.
The "message types" for the messages passing around are all contained in the message
header structures. These headers have all the reference counting and
disposal/cleanup hook procedures (pointers) along with the references to the actual
data. This means that you could have a stock pool (from the slab cache) of these
headers for moving plain data around, and "magical devices" could use specialty
message headers with the necessary disposal elements or necessary "extra data" in
them.
All the devices in any kind of chain/mux/pipe/whatever arrangement pass messages to
their peers by invoking peer->put(message) [in OOD parlance, really probably
(*peer)(peer,message) in straight C]. And modules can be induced into the
chain/mux/pipe/whtever at runtime. The target put() procedure decides to copy or
reference-increment the message and put it on it's receive queue or process it
immediately. At no time does any member of a STREAMS structure need to know any
details about the workings of their peers.
In the abstract, most communications would have message headers referring to blocks
of normal memory. Some devices could send out messages that reference their PCI
mapped memory or whatever, and all _that_ device would have to do to mesh with the
system is to replace the cleanup function pointer with one appropriate to its need.
When the message has gone as far as it is going to go, its reference count drops to
zero and it gets cleaned up.
Unget can be done with some queue insertion/manipulation. There is no pull() as
every target is either hot-processed or queued. Flow control takes place when a
put() procedure returns a no-space-for-message indication.
If you absolutely _must_ know if the data was actually consumed by the target (and I
could argue that this is virtually never the case as things get lost in error-induced
pipe tear-down all the time) it would only be necessary to add a "cursor" member to
the message header and, at destruction time, see if the cursor had been advanced to
"use up" the data.
One could even have message headers that know they are dealing with
Now the message header is Linus' "impedance matcher" interface, transport is really
just about knowing your peer, calling put(), and doing reference counting.
There were a lot of (IMHO) mistakes in the STREAMS design, but the programming
document (previously referenced in this thread) covers a lot of the same message
typing and management questions that are being raised here.
While I can see that the fit between Linux and SVR4 STREAMS wouldn't be identical, a
good bit of the pipe-fitting (to pun 8-) and operation definitions would be.
(I'll go back to lurking now... 8-)
Rob White,
Casabyte, Inc.
-----Original Message-----
From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org]
On Behalf Of linux@horizon.com
lm@bitmover.com wrote:
The main infrastructure hassle you need to support this *universally*
is the unget() on "character devices" like pipes and network sockets.
Ideally, it would be some generic buffer front end that could be used
by the device for normal data as well as the special case.
Ooh. Need to think. If there's a -EIO problem on one of the file
descriptors, how does the caller know which one? That's an argument for
separate pull and push (although the splice() library routine still
has the problem). Any suggestions? Does userland need to fall
back on read()/write() for a byte?
^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <Pine.LNX.4.44.0501091946020.3620-100000@localhost.localdomain>]
[parent not found: <200501070313.j073DCaQ009641@hera.kernel.org>]
* Re: Make pipe data structure be a circular list of pages, rather than
[not found] <200501070313.j073DCaQ009641@hera.kernel.org>
@ 2005-01-07 3:41 ` William Lee Irwin III
2005-01-07 6:35 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: William Lee Irwin III @ 2005-01-07 3:41 UTC (permalink / raw)
To: torvalds; +Cc: Linux Kernel Mailing List
On Fri, Jan 07, 2005 at 12:29:13AM +0000, Linux Kernel Mailing List wrote:
> ChangeSet 1.2229.1.1, 2005/01/06 16:29:13-08:00, torvalds@ppc970.osdl.org
> Make pipe data structure be a circular list of pages, rather than
> a circular list of one page.
> This improves pipe throughput, and allows us to (eventually)
> use these lists of page buffers for moving data around efficiently.
Interesting; how big a gain did you see?
-- wli
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 3:41 ` William Lee Irwin III
@ 2005-01-07 6:35 ` Linus Torvalds
2005-01-07 6:37 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 6:35 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: Linux Kernel Mailing List, Larry McVoy
On Thu, 6 Jan 2005, William Lee Irwin III wrote:
> On Fri, Jan 07, 2005 at 12:29:13AM +0000, Linux Kernel Mailing List wrote:
> > ChangeSet 1.2229.1.1, 2005/01/06 16:29:13-08:00, torvalds@ppc970.osdl.org
> > Make pipe data structure be a circular list of pages, rather than
> > a circular list of one page.
> > This improves pipe throughput, and allows us to (eventually)
> > use these lists of page buffers for moving data around efficiently.
>
> Interesting; how big a gain did you see?
On my ppc970 (which Alan correctly points out is not necessarily the best
platform to test), lmbench throughput looks like this:
*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem
Mem UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
before:
ppc970.os Linux 2.6.10 753. 1305 550. 1110.3 2051.3 932.3 926.4 2043 1291.
ppc970.os Linux 2.6.10 738. 1596 558. 1099.4 2038.8 936.6 925.0 2033 1288.
ppc970.os Linux 2.6.10 752. 1535 557. 1107.2 2043.5 937.9 932.1 2037 1292.
ppc970.os Linux 2.6.10 750. 1524 556. 1107.5 2046.0 937.2 930.4 2037 1289.
after:
ppc970.os Linux 2.6.10 976. 1594 561. 1110.6 2045.8 934.9 923.6 2039 1288.
ppc970.os Linux 2.6.10 1195 1595 549. 1102.0 2049.0 909.5 930.5 2044 1292.
ppc970.os Linux 2.6.10 1669 1418 507. 1122.9 2051.3 943.4 916.4 2035 1280.
ppc970.os Linux 2.6.10 1472 1612 557. 1092.4 2032.9 932.1 935.1 2030 1281.
ie that's a clear improvement (30-90% higher bandwidth). That's from
having <n> (currently 16) pages of buffers in flight instead of just one.
Of course, the full 16 you'll seldom see in real life - stdio etc tends to
buffer up at 8kB or similar, but it does mean that such a 8kB write won't
need to block for the reader to empty the pipe.
I worried a bit about latency, but my test-run of four before/after was in
the noise. It's bound to be worse because of the extra allocation, but
it's not clearly visible.
However, I've got a long-term cunning plan, which was the _real_ reason
for the pipe changes. I'm finally going to implement Larry McVoy's
"splice()" operation, and it turns out that a pipe (with the new
organization) ends up being the exact conduit between two files that you
need to "splice" an input an an output together - without any copying (the
new pipe buffering is just pointers to pages and byte ranges within those
pages).
Together with a "tee()" system call (duplicate it in two), you can
basically generate pipelines and forks of data, and you'll only need to
update a reference count (and pupulate a new "struct pipe_buffer" thing).
But making sure that every step along the way is an improvement is part of
the philosophy.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 6:35 ` Linus Torvalds
@ 2005-01-07 6:37 ` Linus Torvalds
2005-01-19 16:29 ` Larry McVoy
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-07 6:37 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: Linux Kernel Mailing List, Larry McVoy
On Thu, 6 Jan 2005, Linus Torvalds wrote:
>
> I worried a bit about latency, but my test-run of four before/after was in
> the noise. It's bound to be worse because of the extra allocation, but
> it's not clearly visible.
Btw, this is not to say that it isn't measurable: I'm sure it is. It's
just not as _obvious_ as the throughput improvement. If somebody were to
do some more controlled latency tests... Hint, hint.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-07 6:37 ` Linus Torvalds
@ 2005-01-19 16:29 ` Larry McVoy
2005-01-19 17:14 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Larry McVoy @ 2005-01-19 16:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: William Lee Irwin III, Linux Kernel Mailing List, Larry McVoy
I think you are going to regret making splice() a system call, it shouldn't
be, you'll find cases where it won't work. Make it a library call built
out of pull() and push(), see my original postings on this and you'll
follow the logic. Making splice a system call is like putting ftp
into the kernel, not a good idea.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-19 16:29 ` Larry McVoy
@ 2005-01-19 17:14 ` Linus Torvalds
2005-01-19 19:01 ` Larry McVoy
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-01-19 17:14 UTC (permalink / raw)
To: Larry McVoy; +Cc: William Lee Irwin III, Linux Kernel Mailing List
On Wed, 19 Jan 2005, Larry McVoy wrote:
>
> I think you are going to regret making splice() a system call, it shouldn't
> be, you'll find cases where it won't work. Make it a library call built
> out of pull() and push()
Actually, if you look at my current (ugly) test-patch, the one part of it
that isn't ugly is the actual "sys_splice()" + "do_splice()" part, and the
only thing they actually do is (a) do all the required "fd -> struct file"
setup (including locking, testing readability vs writability etc), and
then (b) they check which side is a pipe, and split it up into "pull" vs
"push" at that point.
Of course, the "pull" is actually called "do_splice_from()", and the
"push" is called "do_splice_to()", but I can rename them if you want to ;)
So yes, internally it is actually a push/pull thing, with separate actions
for both. I absolutely agree that you cannot have a "splice()" action, you
need to have a _direction_. But that's actually exactly what the pipe
gives you: since all data has to originate or end in a pipe, the direction
is implicit by the file descriptors involved.
(The exception is a pipe->pipe transfer, but then the direction doesn't
matter, and you can choose whichever just ends up being easier. We happen
to consider it a "do_splice_from()" from the source pipe right now, but
that's purely a matter of "which way do you test first").
Could we show it as two system calls? Sure. It wouldn't give you anything,
though, since you need to do all the tests anyway - including the test
for whether the source or destination is a pipe which currently is the
thing that splits up the "splice()" into two cases. So it's not even like
you could optimize out the direction test - it would just become an
error case test instead.
And having just one system call means that the user doesn't need to care
(other than making sure that one end _is_ a pipe), and also allows us to
share all the code that _is_ shared (namely the "struct file *" setups and
testing - not a whole lot, but it's cleaner that way).
I think you are perhaps confused about the fact that what makes this all
possible in the first place really is the _pipe_. You probably think of
"splice()" as going from one file descriptor to another. It's not. It goes
from one (generic) file descriptor into a _pipe_, or from one pipe into
another (generic) file descriptor. So the "push"/"pull" part really is
there, but it's there in another form: if you want to copy from one file
to another, you have to
1) open a pipe - it's the splice equivalent of allocating a
temporary buffer.
repeat:
2) "pull" from the fd into the pipe: splice(in, pipe[1], size)
3) "push" from the pipe into the fd: splice(pipe[0], out, size)
so it's all there. The above is 100% conceptually equivalent to
1) buf = malloc()
repeat:
2) read(in, buf, size);
3) write(out, buf, size);
See? I think you'll find that the "push" and "pull" you look for really is
there.
The advantage here is that if either the input or the output _already_ is
a pipe, you can optimize it to just do a loop of a single splice(), with
no new temporary buffer needed. So while the above three steps are the
generic case, depending on how you do things the _common_ case may be just
a simple
repeat:
1) splice(in, out, maxsize);
which is why you do not want to _force_ the split. The determination of
how to do this efficiently at run-time is also very easy:
int copy_fd(int in, int out)
{
char *buf;
int fds[2];
for (;;) {
int count = splice(in, out, ~0UL);
if (count < 0) {
if (errno == EAGAIN)
continue;
/* Old kernel without "splice()"? */
if (errno == ENOSYS)
goto read_write_case;
/* No pipe involved? */
if (errno == EINVAL)
goto pipe_buffer_case;
/* We found a pipe, but the other end cannot accept this splice */
if (errno == ECANNOTSPLICE)
goto read_write_case;
return -1;
}
if (!count)
return 0;
}
pipe_buffer_case:
if (pipe(fds) < 0)
goto read_write_case;
for (;;) {
int count = splice(in, fds[1], ~0UL);
if (count < 0)
if (errno == EAGAIN)
continue;
return -1;
}
if (!count)
return 0;
do {
int n = splice(fds[0], out, count);
if (n < 0) {
if (errno == EAGAIN)
continue;
return -1;
}
if (!n) {
errno = ENOSPC;
return -1;
}
} while ((count -= n) > 0)
}
read_write_case:
buf = malloc(BUFSIZE);
for (;;) {
int count = read(in, buf, BUFSIZE);
...
See? THAT is the kind of library routine you want to have (btw, there's no
"ECANNOTSPLICE" error code right now, my current example code incorrectly
returns EINVAL for both the "no pipe" case and the "found pipe but not
splicable" case).
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-19 17:14 ` Linus Torvalds
@ 2005-01-19 19:01 ` Larry McVoy
2005-01-20 0:01 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Larry McVoy @ 2005-01-19 19:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry McVoy, William Lee Irwin III, Linux Kernel Mailing List
On Wed, Jan 19, 2005 at 09:14:33AM -0800, Linus Torvalds wrote:
> I think you are perhaps confused about the fact that what makes this all
> possible in the first place really is the _pipe_. You probably think of
> "splice()" as going from one file descriptor to another. It's not.
You seem to have misunderstood the original proposal, it had little to do
with file descriptors. The idea was that different subsystems in the OS
export pull() and push() interfaces and you use them. The file decriptors
are only involved if you provide them with those interfaces(which you
would, it makes sense). You are hung up on the pipe idea, the idea I
see in my head is far more generic. Anything can play and you don't
need a pipe at all, you need
struct pagev {
pageno page; // or whatever the type is to get the page
u16 offset; // could be u32 if you have large pages
u16 len; // ditto
};
struct splice {
pagev pages[];
u8 gift:1;
u8 loan:1;
};
You can feed these into pipes for sure, but you can source and sink
them from/to anything which exports a pull()/push() entry point.
It's as generic as read()/write() in the driver object.
I never proposed restricting this to file descriptors, that makes
no sense. File descriptors are nothing more than something which
gets you to the underlying object (socket/file/pipe/whatever) and
lets you call into that object to move some data.
See how more generic that is? Pipes are just one source/sink but
everything else needs to play as well. How are you going to
implement a socket sending data to a file without the VM nonsense
and the extra copies?
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Make pipe data structure be a circular list of pages, rather than
2005-01-19 19:01 ` Larry McVoy
@ 2005-01-20 0:01 ` Linus Torvalds
0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-01-20 0:01 UTC (permalink / raw)
To: Larry McVoy; +Cc: William Lee Irwin III, Linux Kernel Mailing List
On Wed, 19 Jan 2005, Larry McVoy wrote:
>
> See how more generic that is? Pipes are just one source/sink but
> everything else needs to play as well. How are you going to
> implement a socket sending data to a file without the VM nonsense
> and the extra copies?
Heh. This is exactly why I originally told you it is undoable - because
your model implies a O(M x N) complexity, where everybody has to agree on
how to push things to everybody else.
The thing that made me so excited about the pipes is exactly that the
pipes is what finally made me see how to actually _implement_ kind of what
you wanted. Instead of a "everything to everything" thing that has O(n²)
complexity, it's a "everything to a pipe", where the pipe ends up being
the common medium for things to talk to each other.
So you have an extra level of indirection, but it also means that now the
interface complexity is purely linear - everybody needs to know how to
accept data from (and push data to) a pipe, but they do _not_ need to know
how to do it to anything else.
I agree that it's not what you described, and one of my first emails
describing it even said so, but I now notice that you weren't cc'd on
that one (I tried to cc you on some others, but not the whole discussion):
"...
And to be honest, my pipes aren't really what Larry's "splice()" was: his
notion was more of a direct "fd->fd" thing, with nothing in between. That
was what I still consider unworkable.
The problem with splicing two file descriptors together is that it needs
for them to agree on the interchange format, and these things tend to be
very expensive from an interface angle (ie the format has to _exactly_
match the "impedance" at both ends, since any conversion would make the
whole exercise pointless - you'd be better off just doing a read and a
write).
..."
so what makes the pipes special is that they are the "impedance matcher".
For example, absolutely _none_ of the regular file IO stuff wants to work
with page fragments: it's all very much about whole-page stuff in the page
cache, and that is how it has to be, considering the semantics of a
coherent mmap etc.
So your notion of having all the subsystems natively use a common format
just never struck me as workable, because they very fundamentally do _not_
want work with the same kind of data structures, and forcing them to just
didn't seem valid. So I ignored splice as unworkable.
So think of my "splice" as giving you credit through the name, but at the
same time it definitely is something different (and in my opinion a lot
more workable). If you object to the name, I can call it "wire()" instead,
which was somebody elses suggestion, but I thought you might appreciate
the name even if it isn't _really_ what you want ;)
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2005-08-18 6:07 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-07 14:30 Make pipe data structure be a circular list of pages, rather than Oleg Nesterov
2005-01-07 15:45 ` Alan Cox
2005-01-07 17:23 ` Linus Torvalds
2005-01-08 18:25 ` Hugh Dickins
2005-01-08 18:54 ` Linus Torvalds
2005-01-07 16:17 ` Linus Torvalds
2005-01-07 16:06 ` Alan Cox
2005-01-07 17:33 ` Linus Torvalds
2005-01-07 17:48 ` Linus Torvalds
2005-01-07 20:59 ` Mike Waychison
2005-01-07 23:46 ` Chris Friesen
2005-01-08 21:38 ` Lee Revell
2005-01-08 21:51 ` Linus Torvalds
2005-01-08 22:02 ` Lee Revell
2005-01-08 22:29 ` Davide Libenzi
2005-01-09 4:07 ` Linus Torvalds
2005-01-09 23:19 ` Davide Libenzi
2005-01-14 10:15 ` Peter Chubb
2005-01-07 21:59 ` Linus Torvalds
2005-01-07 22:53 ` Diego Calleja
2005-01-07 23:15 ` Linus Torvalds
2005-01-10 23:23 ` Robert White
2005-01-07 17:45 ` Chris Friesen
2005-01-07 16:39 ` Davide Libenzi
2005-01-07 17:09 ` Linus Torvalds
2005-08-18 6:07 ` Coywolf Qi Hunt
-- strict thread matches above, loose matches on Subject: below --
2005-01-20 2:14 Robert White
2005-01-16 2:59 Make pipe data structure be a circular list of pages, rather Linus Torvalds
2005-01-19 21:12 ` Make pipe data structure be a circular list of pages, rather than linux
2005-01-20 2:06 ` Robert White
[not found] <Pine.LNX.4.44.0501091946020.3620-100000@localhost.localdomain>
[not found] ` <Pine.LNX.4.58.0501091713300.2373@ppc970.osdl.org>
[not found] ` <Pine.LNX.4.58.0501091830120.2373@ppc970.osdl.org>
2005-01-12 19:50 ` Davide Libenzi
2005-01-12 20:10 ` Linus Torvalds
[not found] <200501070313.j073DCaQ009641@hera.kernel.org>
2005-01-07 3:41 ` William Lee Irwin III
2005-01-07 6:35 ` Linus Torvalds
2005-01-07 6:37 ` Linus Torvalds
2005-01-19 16:29 ` Larry McVoy
2005-01-19 17:14 ` Linus Torvalds
2005-01-19 19:01 ` Larry McVoy
2005-01-20 0:01 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).