archive mirror
 help / color / mirror / Atom feed
Subject: Re: Make pipe data structure be a circular list of pages, rather
Date: 8 Jan 2005 08:25:35 -0000	[thread overview]
Message-ID: <> (raw)

>  - add a "tee(in, out1, out2)" system call that duplicates the pages 
>    (again, incrementing their reference count, not copying the data) from 
>    one pipe to two other pipes.

H'm... the version that seemed natural to me was an asymmetrical one-way
tee, such as "tee(in, out, len, flags)" might be better, where the next
<len> bytes are *both* readable on fd "in" *and* copied to fd "out".

You can make it a two-way tee with an additional "splice(in, out2, len)"
call, so you haven't lost expressiveness, and it makes three-way and
higher tees easier to construct.

But then I realized that I might be thinking about a completely different
implementation than you... I was thinking asynchronous, while perhaps
you were thinking synchronous.

A simple example of the difference:

	fd *dest = open("/dev/null", O_WRONLY);
	FILE *src = popen("/usr/bin/yes", "r");
	splice(fileno(src), dest, SPLICE_INFINITY, 0);
	return 0;

Will this process exit without being killed?  I was imagining yes,
it would exit immediately, but perhaps "no" makes more sense.

Ding!  Oh, of course, it couldn't exit, or cleaning up after the following
mess would be far more difficult:

	int fd[2];
	write(fd[1], "Hello, world!\n", 14);
	splice(fd[0], fd[1], SPLICE_INFINITY, 0);
	return 0;

With the synchronous model, the two-output tee() call makes more sense, too.
Still, it would be nice to be able to produce n identical output streams
without needing n-1 processes to do it  Any ideas?  Perhaps

tee(int infd, int const *outfds, unsigned noutfds, loff_t size, unsigned flags)

As for the issue of coalescing:
> This is the main reason why I want to avoid coalescing if possible: if you
> never coalesce, then each "pipe_buffer" is complete in itself, and that
> simplifies locking enormously.
> (The other reason to potentially avoid coalescing is that I think it might
> be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
> boundaries. The new pipe code _is_ internally "packetized" after all).

It is somewhat offensive that the minimum overhead for a 1-byte write
is a struct pipe_buffer plus a full page.

But yes, keeping a pipe_buffer simple is a BIG win.  So how about the
following coalescing strategy, which complicates the reader not at all:

- Each pipe writing fd holds a reference to a page and an offset within
  that page.
- When writing some data, see if the data will fit in the remaining
  portion of the page.
  - If it will not, then dump the page, allocate a fresh one, and set
    the offset to 0.
    - Possible optimization: If the page's refcount is 1 (nobody else
      still has a reference to the page, and we would free it if we
      dropped it), then just recycle it directly.
- Copy the written data (up to a maximum of 1 page) to the current write page.
- Bump the page's reference count (to account for the pipe_buffer pointer) and
  queue an appropriate pipe_buffer.
- Increment the offset by the amount of data written to the page.
- Decrement the amount of data remaining to be written and repeat if necessary.

This allocates one struct pipe_buffer (8 bytes) per write, but not a whole
page.  And it does so by exploiting the exact "we don't care what the rest
of the page is used for" semantics that make the abstraction useful.

The only waste is that, as written, every pipe writing fd keeps a page
allocated even if the pipe is empty.  Perhaps the vm could be taught to
reclaim those references if necessary?

It's also worth documenting atomicity guarantees and poll/select
semantics.  The above algorithm is careful to always queue the first
PAGE_SIZE bytes of any write atomically, which I believe is what is
historically expected.  It would be possible to have writes larger than
PIPE_BUF fill in the trailing end of any partial page as well.

             reply	other threads:[~2005-01-08 10:45 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-08  8:25 linux [this message]
2005-01-08 18:41 ` Make pipe data structure be a circular list of pages, rather Linus Torvalds
2005-01-08 21:47   ` Alan Cox
2005-01-13 21:46   ` Ingo Oeser
2005-01-13 22:32     ` Linus Torvalds
2005-01-14 21:03       ` Ingo Oeser
2005-01-14 21:29         ` Linus Torvalds
2005-01-14 22:12           ` Ingo Oeser
2005-01-14 22:44             ` Linus Torvalds
2005-01-14 23:34               ` Ingo Oeser
2005-01-15  0:16                 ` Linus Torvalds
2005-01-16  2:59                   ` Linus Torvalds
2005-01-17 16:03                     ` Ingo Oeser
2005-01-19 21:12                     ` Make pipe data structure be a circular list of pages, rather than linux
2005-01-20  2:06                       ` Robert White
2005-01-15 23:42 Make pipe data structure be a circular list of pages, rather linux
2005-01-15 22:55 ` Alan Cox
2005-01-16  0:12   ` Linus Torvalds
2005-01-16  2:02     ` Miquel van Smoorenburg
2005-01-16  2:06     ` Jeff Garzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).