archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <>
Cc: Kernel Mailing List <>
Subject: Re: Make pipe data structure be a circular list of pages, rather
Date: Sat, 8 Jan 2005 10:41:48 -0800 (PST)	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Sat, 8 Jan 2005 wrote:
> H'm... the version that seemed natural to me was an asymmetrical one-way
> tee, such as "tee(in, out, len, flags)" might be better, where the next
> <len> bytes are *both* readable on fd "in" *and* copied to fd "out".

Yes, the implementation may end up being something like that, I haven't
decided. That would be more of a "peek()" kind of thing, not a "tee()",
but it has the advantage of getting rid of one unnecessary complexity of
"tee()" - dependence on three different fd's (in particular, if one of the
output fd's isn't writable, we can't do anything at all).

However, "peek()" has the problem that while it allows you to just run "n" 
peek() calls for the "n" outputs, you now need to keep track of how much 
each output has consumed (since they can be filled at different rates), 
and then appropriately "eat" the proper amount of input when it has been 
consumed by everybody. That's a very error-prone interface.

In contrast, with a plain "tee()", you'd just create a tree "log(n)" 
deep of "tee()" actors, and you'd have "n" outputs from one input, and 
each individual point would be very simple.

NOTE! I don't think either "tee()" or "peek()" really works for very high
fan-out. I don't believe that you'd ever really have a "balanced" enough
output across many things, and since they all end up being synchronized to
the same input (you can make the buffers deeper, which will help a bit,
but..) I don't think it ends up being _that_ useful.

Where I foresee "tee()" is when there is some rate-limited input that you 
can easily handle in two or three streams. That's what the example of a 
digital TV input signal was kind of meant to be - the input is 
rate-limited, which means that the fact that the outputs are 
likely consumed at different paces (saving to disk might be fast, showing 
it on the screen would be real-time) doesn't matter. But such a use ends 
up breaking down when the load is too high - you're likely going to be 
constrained in the number of streams by pure load issues.

(And if you are _not_ constrained by load issues, then you don't need 
tee() at all - you could just have copied the things by hand ;)

> But then I realized that I might be thinking about a completely different
> implementation than you... I was thinking asynchronous, while perhaps
> you were thinking synchronous.

I _am_ thinking entirely synchronous. It would copy exactly what was in 
the pipe, and nothing more. It would sleep if either end was full/empty 
(modulo N_DELAY, of course), but it would in no way be a long-term "set up 
this copy mechanism.

None of the pipes would do anything on their own - they aren't "actors".  
They'd be purely buffers. So if you want to have an asynchronous 
long-running "tee()/copy()/peek()", you'd have a thread that does that for 

> With the synchronous model, the two-output tee() call makes more sense, too.
> Still, it would be nice to be able to produce n identical output streams
> without needing n-1 processes to do it  Any ideas? 

I agree, although I don't know how common it would be for "n" to be very
big. And you don't need n-1 processes (or threads), you can certainly do
it with the normal select-loop approach too.

> As for the issue of coalescing:
> > This is the main reason why I want to avoid coalescing if possible: if you
> > never coalesce, then each "pipe_buffer" is complete in itself, and that
> > simplifies locking enormously.
> >
> > (The other reason to potentially avoid coalescing is that I think it might
> > be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
> > boundaries. The new pipe code _is_ internally "packetized" after all).
> It is somewhat offensive that the minimum overhead for a 1-byte write
> is a struct pipe_buffer plus a full page.

Note that there is no reason why we couldn't use just "kmalloc()" too.

For the different input cases we'll need per-pipe_buffer operation
functions (very much including a "release()" function) _anyway_, which
means that while the current implementation always just does "alloc_page()
+ __free_page()" for it's buffer allocation, there is absolutely zero that 
says that you couldn't allocate small buffers with "kmalloc()" and free 
them with "kfree()".

And since we need to support per-buffer ops anyway (for people mixing
"write()" and "splice()" to set up the pipe contents), it works very
naturally for mixed size writes too (ie you can have a small write that is
buffered in a small kmalloc'ed buffer, followed by a big write that uses
the page allocator, and nobody will care - you just decide at buffer fill
time which one you need).

So yes, the current setup is wasteful, but coalescing isn't the only way 
to fight that - you can just use smaller allocations too.

So in my mind, the only reason for doign coalescing is if some silly
program deadlocks if it writes many small writes, and nobody is reading
it. Some people use pipes as buffers _within_ a program, ie use them to
send signals etc (common for signal-safe select handling - you make the
signal handler do a write() to a pipe, and the select loop has the pipe in
its FD-set).

For all of those that are valid (certainly the signal case), you need to
put the pipe into nonblocking mode anyway, so I don't think coalescing is
really needed. SuS certainly does _not_ seem guarantee that you can do
many nonblocking writes (small _or_ big). Let's see if somebody reports 
any trouble with the new world order.


  reply	other threads:[~2005-01-08 18:41 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-08  8:25 Make pipe data structure be a circular list of pages, rather linux
2005-01-08 18:41 ` Linus Torvalds [this message]
2005-01-08 21:47   ` Alan Cox
2005-01-13 21:46   ` Ingo Oeser
2005-01-13 22:32     ` Linus Torvalds
2005-01-14 21:03       ` Ingo Oeser
2005-01-14 21:29         ` Linus Torvalds
2005-01-14 22:12           ` Ingo Oeser
2005-01-14 22:44             ` Linus Torvalds
2005-01-14 23:34               ` Ingo Oeser
2005-01-15  0:16                 ` Linus Torvalds
2005-01-16  2:59                   ` Linus Torvalds
2005-01-17 16:03                     ` Ingo Oeser
2005-01-19 21:12                     ` Make pipe data structure be a circular list of pages, rather than linux
2005-01-20  2:06                       ` Robert White
2005-01-15 23:42 Make pipe data structure be a circular list of pages, rather linux
2005-01-15 22:55 ` Alan Cox
2005-01-16  0:12   ` Linus Torvalds
2005-01-16  2:02     ` Miquel van Smoorenburg
2005-01-16  2:06     ` Jeff Garzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).