linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Make pipe data structure be a circular list of pages, rather
@ 2005-01-08  8:25 linux
  2005-01-08 18:41 ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: linux @ 2005-01-08  8:25 UTC (permalink / raw)
  To: linux-kernel, torvalds

>  - add a "tee(in, out1, out2)" system call that duplicates the pages 
>    (again, incrementing their reference count, not copying the data) from 
>    one pipe to two other pipes.

H'm... the version that seemed natural to me was an asymmetrical one-way
tee, such as "tee(in, out, len, flags)" might be better, where the next
<len> bytes are *both* readable on fd "in" *and* copied to fd "out".

You can make it a two-way tee with an additional "splice(in, out2, len)"
call, so you haven't lost expressiveness, and it makes three-way and
higher tees easier to construct.


But then I realized that I might be thinking about a completely different
implementation than you... I was thinking asynchronous, while perhaps
you were thinking synchronous.

A simple example of the difference:

int
main(void)
{
	fd *dest = open("/dev/null", O_WRONLY);
	FILE *src = popen("/usr/bin/yes", "r");
	splice(fileno(src), dest, SPLICE_INFINITY, 0);
	return 0;
}

Will this process exit without being killed?  I was imagining yes,
it would exit immediately, but perhaps "no" makes more sense.

Ding!  Oh, of course, it couldn't exit, or cleaning up after the following
mess would be far more difficult:

int
main(void)
{
	int fd[2];
	pipe(fd);
	write(fd[1], "Hello, world!\n", 14);
	splice(fd[0], fd[1], SPLICE_INFINITY, 0);
	return 0;
}

With the synchronous model, the two-output tee() call makes more sense, too.
Still, it would be nice to be able to produce n identical output streams
without needing n-1 processes to do it  Any ideas?  Perhaps

int
tee(int infd, int const *outfds, unsigned noutfds, loff_t size, unsigned flags)

As for the issue of coalescing:
> This is the main reason why I want to avoid coalescing if possible: if you
> never coalesce, then each "pipe_buffer" is complete in itself, and that
> simplifies locking enormously.
>
> (The other reason to potentially avoid coalescing is that I think it might
> be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
> boundaries. The new pipe code _is_ internally "packetized" after all).

It is somewhat offensive that the minimum overhead for a 1-byte write
is a struct pipe_buffer plus a full page.

But yes, keeping a pipe_buffer simple is a BIG win.  So how about the
following coalescing strategy, which complicates the reader not at all:

- Each pipe writing fd holds a reference to a page and an offset within
  that page.
- When writing some data, see if the data will fit in the remaining
  portion of the page.
  - If it will not, then dump the page, allocate a fresh one, and set
    the offset to 0.
    - Possible optimization: If the page's refcount is 1 (nobody else
      still has a reference to the page, and we would free it if we
      dropped it), then just recycle it directly.
- Copy the written data (up to a maximum of 1 page) to the current write page.
- Bump the page's reference count (to account for the pipe_buffer pointer) and
  queue an appropriate pipe_buffer.
- Increment the offset by the amount of data written to the page.
- Decrement the amount of data remaining to be written and repeat if necessary.

This allocates one struct pipe_buffer (8 bytes) per write, but not a whole
page.  And it does so by exploiting the exact "we don't care what the rest
of the page is used for" semantics that make the abstraction useful.

The only waste is that, as written, every pipe writing fd keeps a page
allocated even if the pipe is empty.  Perhaps the vm could be taught to
reclaim those references if necessary?


It's also worth documenting atomicity guarantees and poll/select
semantics.  The above algorithm is careful to always queue the first
PAGE_SIZE bytes of any write atomically, which I believe is what is
historically expected.  It would be possible to have writes larger than
PIPE_BUF fill in the trailing end of any partial page as well.

^ permalink raw reply	[flat|nested] 51+ messages in thread
* RE: Make pipe data structure be a circular list of pages, rather than
@ 2005-01-20  2:14 Robert White
  0 siblings, 0 replies; 51+ messages in thread
From: Robert White @ 2005-01-20  2:14 UTC (permalink / raw)
  To: 'Robert White', linux, linux-kernel; +Cc: lm, torvalds



P.S.  Not to reply to myself... 8-)  I took some liberties with that description.
STREAMS doesn't, to the best of my knowledge, have the cleanup hook stuff.  That was
me folding your issues (direct PCI device buffers etc) from this thread on top of the
basic skeleton of STREAMS to broaden the "impedance matching" possibilities.

(No really, back to lurking... 8-)

Rob White,
Casabyte, Inc.


^ permalink raw reply	[flat|nested] 51+ messages in thread
[parent not found: <Pine.LNX.4.44.0501091946020.3620-100000@localhost.localdomain>]
* Re: Make pipe data structure be a circular list of pages, rather than
@ 2005-01-07 14:30 Oleg Nesterov
  2005-01-07 15:45 ` Alan Cox
  2005-01-07 16:17 ` Linus Torvalds
  0 siblings, 2 replies; 51+ messages in thread
From: Oleg Nesterov @ 2005-01-07 14:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: William Lee Irwin III, linux-kernel

Hello.

pipe_writev:
> +		if (bufs < PIPE_BUFFERS) {
> +			ssize_t chars;
> +			int newbuf = (info->curbuf + bufs) & (PIPE_BUFFERS-1);

If i understand this patch correctly, then this code

	for (;;)
		write(pipe_fd, &byte, 1);

will block after writing PIPE_BUFFERS == 16 characters, no?
And pipe_inode_info will use 64K to hold 16 bytes!

Is it ok?

May be it make sense to add data to the last allocated page
until buf->len > PAGE_SIZE ?

Oleg.

^ permalink raw reply	[flat|nested] 51+ messages in thread
[parent not found: <200501070313.j073DCaQ009641@hera.kernel.org>]

end of thread, other threads:[~2005-08-18  6:07 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-08  8:25 Make pipe data structure be a circular list of pages, rather linux
2005-01-08 18:41 ` Linus Torvalds
2005-01-08 21:47   ` Alan Cox
2005-01-13 21:46   ` Ingo Oeser
2005-01-13 22:32     ` Linus Torvalds
2005-01-14 21:03       ` Ingo Oeser
2005-01-14 21:29         ` Linus Torvalds
2005-01-14 22:12           ` Ingo Oeser
2005-01-14 22:44             ` Linus Torvalds
2005-01-14 23:34               ` Ingo Oeser
2005-01-15  0:16                 ` Linus Torvalds
2005-01-16  2:59                   ` Linus Torvalds
2005-01-17 16:03                     ` Ingo Oeser
2005-01-19 21:12                     ` Make pipe data structure be a circular list of pages, rather than linux
2005-01-20  2:06                       ` Robert White
  -- strict thread matches above, loose matches on Subject: below --
2005-01-20  2:14 Robert White
     [not found] <Pine.LNX.4.44.0501091946020.3620-100000@localhost.localdomain>
     [not found] ` <Pine.LNX.4.58.0501091713300.2373@ppc970.osdl.org>
     [not found]   ` <Pine.LNX.4.58.0501091830120.2373@ppc970.osdl.org>
2005-01-12 19:50     ` Davide Libenzi
2005-01-12 20:10       ` Linus Torvalds
2005-01-07 14:30 Oleg Nesterov
2005-01-07 15:45 ` Alan Cox
2005-01-07 17:23   ` Linus Torvalds
2005-01-08 18:25     ` Hugh Dickins
2005-01-08 18:54       ` Linus Torvalds
2005-01-07 16:17 ` Linus Torvalds
2005-01-07 16:06   ` Alan Cox
2005-01-07 17:33     ` Linus Torvalds
2005-01-07 17:48       ` Linus Torvalds
2005-01-07 20:59         ` Mike Waychison
2005-01-07 23:46           ` Chris Friesen
2005-01-08 21:38             ` Lee Revell
2005-01-08 21:51               ` Linus Torvalds
2005-01-08 22:02                 ` Lee Revell
2005-01-08 22:29                 ` Davide Libenzi
2005-01-09  4:07                 ` Linus Torvalds
2005-01-09 23:19                   ` Davide Libenzi
2005-01-14 10:15             ` Peter Chubb
2005-01-07 21:59         ` Linus Torvalds
2005-01-07 22:53           ` Diego Calleja
2005-01-07 23:15             ` Linus Torvalds
2005-01-10 23:23         ` Robert White
2005-01-07 17:45     ` Chris Friesen
2005-01-07 16:39   ` Davide Libenzi
2005-01-07 17:09     ` Linus Torvalds
2005-08-18  6:07   ` Coywolf Qi Hunt
     [not found] <200501070313.j073DCaQ009641@hera.kernel.org>
2005-01-07  3:41 ` William Lee Irwin III
2005-01-07  6:35   ` Linus Torvalds
2005-01-07  6:37     ` Linus Torvalds
2005-01-19 16:29       ` Larry McVoy
2005-01-19 17:14         ` Linus Torvalds
2005-01-19 19:01           ` Larry McVoy
2005-01-20  0:01             ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).