linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-15 23:42 Make pipe data structure be a circular list of pages, rather linux
@ 2005-01-15 22:55 ` Alan Cox
  2005-01-16  0:12   ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Alan Cox @ 2005-01-15 22:55 UTC (permalink / raw)
  To: linux; +Cc: Linux Kernel Mailing List, ioe-lkml, torvalds

On Sad, 2005-01-15 at 23:42, linux@horizon.com wrote:
> The beauty of Linus's scheme is that direct hardware-to-hardware
> copying is possible, without inventing new kernel interfaces.

You appear to need some simple kernel interfaces but not many and the
low level is there. This looks much cleaner (although perhaps not as
clever yet on the invalidation/cache side) as Matt  Dillon and Alan Cox
(the other Alan Cox not me) are doing on Dragonfly BSD in the same sort
of direction.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
@ 2005-01-15 23:42 linux
  2005-01-15 22:55 ` Alan Cox
  0 siblings, 1 reply; 18+ messages in thread
From: linux @ 2005-01-15 23:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: ioe-lkml, torvalds

The beauty of Linus's scheme is that direct hardware-to-hardware
copying is possible, without inventing new kernel interfaces.

Take a step way back to general software design, and consider a
unidirectional communications channel between a data producer and a
data consumer.

The data producer can be passive (produce data in response to read()
operations), or active (produce data spontaneously and go looking for
a place to put it).

Likewise, the data consumer can be passive (consume data when provided
by a write() operation), or active (in which case underflow is a
possibility).

An active producer can just call a passive consumer, and vice versa.  This
is the easiest way to pass data between two chunks of code, particularly
in the same address space.  But deciding which part will be active and
which part will be passive is an important early design decision.

An active producer can be connected to an active consumer with a passive
buffer structure.  This is what a Unix pipe is.  To make this work reliably,
you need some way to block the consumer when the pipe is empty and
some way to block the producer if the pipe is too full.

Finally, a passive producer can be connected to a passive consumer via
an active "pump" process that reads from the producer and writes to the
consumer.

A pipeline with multiple "pump" processes and multiple pipe buffers
connecting them is somewhat wasteful, but not necessarilty too bad.


Now, Unix tradition is that user processes are active and the kernel (and
all its files and devices devices) is passive.  Thus, the user process
makes read() and write() calls and the kernel does as directed.  Even though
some devices (particularly sound cards) might be better modelled as active,
changing that basic kernel assumption would be a massive change that would
have way too many consequences to consider right now.

(For an example of the complexity of not having a process involved, look at
network routing.)

This is fundamentally why the splice() call requires the process to actually
move the buffers around.  You could imagine it just creating some sort of
in-kernel "background pump" to do the transfer, which could be optimized
away in the case of pipe-to-pipe splices, but it's important that there
always be a process (with a killable pid) "in charge" of the copying.

Otherwise, consider a splice() between /dev/zero and /dev/null.  That call
is basically going to spin in the kernel forever.  You need a way to stop it.
Anything other than kill() would be sufficiently un-Unix-like to require
careful justification.


Thus, Linus's scheme always keeps the splice() process "in the loop".
"Ah," you say, "but doesn't that introduce extraneous data copies?"

No, it doesn't, and there's the clever bit.  Instead of passing the *bytes*
through the pipe, he's passing a description of *where to find the bytes*.

And in some cases (notably splice() and tee()), that description can
be forwarded directly from one device or pipe to another without ever
looking at the bytes themselves.  Then, when you get to a part where
the description isn't good enough and you need the bytes themselves,
you can find them and copy them.

Basically, it's lazy evaluation of the bytes.  Rather than passing around
the bytes, you pass around a promise to generate some bytes.  It's like
the difference between cash and a cheque.  As Linus noted, you can start
a disk I/O and pass around a reference to the target page before the I/O
is complete.  The wait for I/O can be deferred until the bytes are needed.


What you need is a generalization of a "struct page".  It need not refer
to system RAM.  It need not be kmap-able.  The only thing you absolutely
need is the ability to copy it to a low memory bounce buffer.

But, if you're lucky, the data consumer can consume the data directly
from the source, and we have zero-copy I/O.


The big requirement is a way to define various kinds of "pages" and define
methods for copying between them.  Normally, we do this with an ops vector,
but because "struct page" is so numerous, and the number of types is so
small, perhaps a (Lisp-style) few-bit "tag" could be used rather than a
(C++-style) operations vector pointer.  It could be as simple as a single
bit which allows the reinterpretation of some existing fields, or declares
the "struct page" to have additional fields which describe its I/O
features, non-standard size, or whatever.

Further, for each *pair* of page types, there are several ways to perform
the final copy from producer to consumer.

- There may be a special way for this particular (source, destination)
  pair.
- It may be possible to map the destination into some address space that
  the source can write directly (e.g. DMA).
- It may be possible to map the source into some address space that the
  destination can read directly (e.g. DMA).
- If all else fails, it may be necessary to allocate a bounce buffer
  accesible to both source and destination and copy the data.

Because of the advantages of PCI writes, I assume that having the source
initiate the DMA is preferable, so the above are listed in decreasing
order of preference.

The obvious thing to me is for there to be a way to extract the following
information from the source and destination:
"I am already mapped in the following address spaces"
"I can be mapped into the following address spaces"
"I can read from/write to the following address spaces"

Only the third list is required to be non-empty.

Note that various kinds of "low memory" count as separate "address spaces"
for this purpose.

Then some generic code can look for the cheapest way to get the data from
the source to the destination.  Mapping one and passing the address to
the other's read/write routine if possible, or allocating a bounce
buffer and using a read followed by a write.

Questions for those who know odd machines and big iron better:

- Are there any cases where a (source,dest) specific routine could
  do better than this "map one and access it from the other" scheme?
  I.e. do we need to support the n^2 case?
- Now many different kinds of "address spaces" do we need to deal with?
  Low kernel memory (below 1G), below 4G (32-bit DMA), and high memory
  are the obvious ones.  And below 16M for legacy DMA, I assume.
  x86 machines have I/O space as well.
  But what about machines with multiple PCI buses?  Can you DMA from one
  to the other, or is each one an "address space" that can have intra-bus
  DMA but needs a main-memory bounce buffer to cross buses?
  Can this be handled by a "PCI bus number" that must match, or are there
  cases where there are more intricate hierarchies of mutual accessibility?

  (A similar case is a copy from a process' address space.  Normally,
  simgle-copy pipes require a kmap in one of the processes.  But if
  the copying code can notice that source and dest both have the same
  mm structure, a direct copy becomes possible.)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-15 22:55 ` Alan Cox
@ 2005-01-16  0:12   ` Linus Torvalds
  2005-01-16  2:02     ` Miquel van Smoorenburg
  2005-01-16  2:06     ` Jeff Garzik
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2005-01-16  0:12 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux, Linux Kernel Mailing List, ioe-lkml



On Sat, 15 Jan 2005, Alan Cox wrote:
>
> Alan Cox (the other Alan Cox not me)

Oh no! You guys are multiplying! 

Run! Run for your lives!

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-16  0:12   ` Linus Torvalds
@ 2005-01-16  2:02     ` Miquel van Smoorenburg
  2005-01-16  2:06     ` Jeff Garzik
  1 sibling, 0 replies; 18+ messages in thread
From: Miquel van Smoorenburg @ 2005-01-16  2:02 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.58.0501151611540.8178@ppc970.osdl.org>,
Linus Torvalds  <torvalds@osdl.org> wrote:
>
>On Sat, 15 Jan 2005, Alan Cox wrote:
>>
>> Alan Cox (the other Alan Cox not me)
>
>Oh no! You guys are multiplying! 

http://www.imdb.com/name/nm0184893/

Mike.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-16  0:12   ` Linus Torvalds
  2005-01-16  2:02     ` Miquel van Smoorenburg
@ 2005-01-16  2:06     ` Jeff Garzik
  1 sibling, 0 replies; 18+ messages in thread
From: Jeff Garzik @ 2005-01-16  2:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, linux, Linux Kernel Mailing List, ioe-lkml

On Sat, Jan 15, 2005 at 04:12:27PM -0800, Linus Torvalds wrote:
> 
> 
> On Sat, 15 Jan 2005, Alan Cox wrote:
> >
> > Alan Cox (the other Alan Cox not me)
> 
> Oh no! You guys are multiplying! 
> 
> Run! Run for your lives!

It's been said for a while now that Alan is really a supercomputing
cluster of 100+ gnomes.

	Jeff




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-16  2:59                   ` Linus Torvalds
@ 2005-01-17 16:03                     ` Ingo Oeser
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Oeser @ 2005-01-17 16:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux, Kernel Mailing List

Hi Linus,

Linus Torvalds wrote:
> +static long do_splice_from(struct inode *pipe, struct file *out, size_t len, unsigned long flags) 
> +static long do_splice_to(struct file *in, struct inode *pipe, size_t len, unsigned long flags) 
> +static long do_splice(struct file *in, struct file *out, size_t len, unsigned long flags) 
> +asmlinkage long sys_splice(int fdin, int fdout, size_t len, unsigned long flags) 

That part looks quite perfect. As long as they stay like this, I'm totally 
happy. I have even no problem about limiting to a length, since I can use that
to measure progress (e.g. a simple progress bar). 

This way I also keep the process as an "actor" like "linux@horizon.com" pointed out.

It has unnecessary scheduling overhead, but the ability to stop/resume
the transfer by killing the process doing it is worth it, I agree.

So I would put a structure in the inode identifying the special device
and check, whether the "in" and "out" parameters are from devices suitable
for a direct on wire transfer. If they are, I just set up some registers
and wait for the transfer to happen. 

Then I get an interrupt/wakeup, if the requested amount is streamed, increment some 
user space pointers, switch to user space, user space tells me abort or stream
more and I follow. 
 
Continue until abort by user or streaming problems happen.

Just to give you an idea: I debugged such a machine and I had a hard hanging 
kernel with interrupts disabled. It still got data from a tuner, through an
MPEG decoder, an MPEG demultiplexer and played it to the audio card.
Not just a buffer like ALSA/OSS, but as long as I would like and it's end to end
without any CPU intervention.

That behavior would be perfect, but I could also live with a "pushing process".


Regards

Ingo Oeser


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-15  0:16                 ` Linus Torvalds
@ 2005-01-16  2:59                   ` Linus Torvalds
  2005-01-17 16:03                     ` Ingo Oeser
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2005-01-16  2:59 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: linux, Kernel Mailing List



On Fri, 14 Jan 2005, Linus Torvalds wrote:
> 
> So I don't think you'll get _exactly_ what you want for a while, since
> you'd have to go through in-memory buffers. But there's no huge conceptual
> problem (just enough implementation issues to make it inconvenient) with
> the concept of actually keeping the buffer on an external controller.

Here is, if you are interested, a slightly updated version of my 
"work-in-progress" example patch. It:

 - depends on a fairly recent -BK (since it uses the pipe buffer ops
   interfaces that I already checked in)
 - is very ugly: the VFS interfaces to splice things really are very wrong 
   indeed, and they _should_ be about passing "struct pipe_buffer" arrays
   around instead.
 - it only does one specific case (splicing from the page cache directly
   into a pipe)
 - only set up for ext3 right now (though the "generic_file_splice" thing 
   should work for any page-cache user)
 - and even that specific case it does wrong since it doesn't update the
   "f_pos" field right now
 - error handling is broken - if an IO error happens on the splice, the 
   "map()" operation will return NULL, and the current fs/pipe.c can't 
   handle that and will oops ;)

so it really is _purely_ a demonstration vehicle, but it at least shows
the kind of operation that I'm thinking of.

An example program to show its use is a simple:

	#include <fcntl.h>
	#include <unistd.h>
	#include <linux/unistd.h>
	
	#define __NR_sys_splice 289
	
	inline _syscall4(int, sys_splice, int, a, int, b, size_t, len, unsigned long, flags);
	
	int main(int argc, char **argv)
	{
		int fd = open(argv[1], O_RDONLY);
		if (fd < 0) {
			perror("open");
			exit(1);
		}
		write(1,"hello: ", 7);
		sys_splice(fd, 1, 1000, 0);
	}

which demonstrates three things: 

 - it works on a perfectly regular pipe, with no special setup, so to use 
   it you can just do

	./test-splice my-file | less

   and the normal pipe that the shell created for this will just work. No 
   new magic fd's required.

 - that you can combine different sources to the pipe (mixing a 
   traditional static write with a "splice from a file").

 - we start the IO when we do the splice, but the waiting for the result 
   happens separately, when (or if) the data is used by the other end.

it will print out "hello: " followed by the first 1000 bytes of the file.

It's obviously not very useful for anything but demonstration, since
splicing the other way (from a pipe to another fd) isn't implemented by
this example, but it gives a more concrete example of what I'm thinking
of.

And shows that the splicing itself is pretty simple ("move_to_pipe()"  
does all the moving of buffers to the pipe) - the rest is just regular
setup and finding of the pages in the page cache.

			Linus

----
# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2005/01/15 18:36:19-08:00 torvalds@evo.osdl.org 
#   sys_splice: take 3
#   
#   Still horribly bad VFS interface, still doing the pipe
#   filling inside the per-fd callback instead of doing it
#   in "generic code" in fs/splice.c.
#   
#   Need to pass in (and out) arrays of "struct pipe_buffer"
#   on a VFS level instead.
# 
# fs/splice.c
#   2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +229 -0
# 
# include/linux/fs.h
#   2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +2 -0
#   sys_splice: take 3
#   
#   Still horribly bad VFS interface, still doing the pipe
#   filling inside the per-fd callback instead of doing it
#   in "generic code" in fs/splice.c.
#   
#   Need to pass in (and out) arrays of "struct pipe_buffer"
#   on a VFS level instead.
# 
# include/asm-i386/unistd.h
#   2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +2 -1
#   sys_splice: take 3
#   
#   Still horribly bad VFS interface, still doing the pipe
#   filling inside the per-fd callback instead of doing it
#   in "generic code" in fs/splice.c.
#   
#   Need to pass in (and out) arrays of "struct pipe_buffer"
#   on a VFS level instead.
# 
# fs/splice.c
#   2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +0 -0
#   BitKeeper file /home/torvalds/v2.6/linux/fs/splice.c
# 
# fs/ext3/file.c
#   2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +3 -0
#   sys_splice: take 3
#   
#   Still horribly bad VFS interface, still doing the pipe
#   filling inside the per-fd callback instead of doing it
#   in "generic code" in fs/splice.c.
#   
#   Need to pass in (and out) arrays of "struct pipe_buffer"
#   on a VFS level instead.
# 
# fs/Makefile
#   2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +1 -0
#   sys_splice: take 3
#   
#   Still horribly bad VFS interface, still doing the pipe
#   filling inside the per-fd callback instead of doing it
#   in "generic code" in fs/splice.c.
#   
#   Need to pass in (and out) arrays of "struct pipe_buffer"
#   on a VFS level instead.
# 
# arch/i386/kernel/entry.S
#   2005/01/15 18:36:07-08:00 torvalds@evo.osdl.org +1 -0
#   sys_splice: take 3
#   
#   Still horribly bad VFS interface, still doing the pipe
#   filling inside the per-fd callback instead of doing it
#   in "generic code" in fs/splice.c.
#   
#   Need to pass in (and out) arrays of "struct pipe_buffer"
#   on a VFS level instead.
# 
diff -Nru a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S
--- a/arch/i386/kernel/entry.S	2005-01-15 18:36:40 -08:00
+++ b/arch/i386/kernel/entry.S	2005-01-15 18:36:40 -08:00
@@ -864,5 +864,6 @@
 	.long sys_add_key
 	.long sys_request_key
 	.long sys_keyctl
+	.long sys_splice
 
 syscall_table_size=(.-sys_call_table)
diff -Nru a/fs/Makefile b/fs/Makefile
--- a/fs/Makefile	2005-01-15 18:36:40 -08:00
+++ b/fs/Makefile	2005-01-15 18:36:40 -08:00
@@ -10,6 +10,7 @@
 		ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \
+		splice.o
 
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
 obj-$(CONFIG_COMPAT)		+= compat.o
diff -Nru a/fs/ext3/file.c b/fs/ext3/file.c
--- a/fs/ext3/file.c	2005-01-15 18:36:40 -08:00
+++ b/fs/ext3/file.c	2005-01-15 18:36:40 -08:00
@@ -101,6 +101,8 @@
 	return ret;
 }
 
+extern ssize_t generic_file_splice_read(struct file *in, struct inode *pipe, size_t len, unsigned long flags);
+
 struct file_operations ext3_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
@@ -115,6 +117,7 @@
 	.release	= ext3_release_file,
 	.fsync		= ext3_sync_file,
 	.sendfile	= generic_file_sendfile,
+	.splice_read	= generic_file_splice_read,
 };
 
 struct inode_operations ext3_file_inode_operations = {
diff -Nru a/fs/splice.c b/fs/splice.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/fs/splice.c	2005-01-15 18:36:40 -08:00
@@ -0,0 +1,229 @@
+/*
+ * linux/fs/splice.c
+ *
+ * "splice": joining two ropes together by interweaving their strands.
+ *
+ * This is the "extended pipe" functionality, where a pipe is used as
+ * an arbitrary in-memory buffer. Think of a pipe as a small kernel
+ * buffer that you can use to transfer data from one end to the other.
+ *
+ * The traditional unix read/write is extended with a "splice()" operation
+ * that transfers data buffers to or from a pipe buffer.
+ *
+ * Named by Larry McVoy.
+ */
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/pipe_fs_i.h>
+
+static void page_cache_pipe_buf_release(struct pipe_inode_info *info, struct pipe_buffer *buf)
+{
+	page_cache_release(buf->page);
+}
+
+static void *page_cache_pipe_buf_map(struct file *file, struct pipe_inode_info *info, struct pipe_buffer *buf)
+{
+	struct page *page = buf->page;
+
+	if (!PageUptodate(page)) {
+		wait_on_page_locked(page);
+		if (!PageUptodate(page))
+			return NULL;
+	}
+	return kmap(page);
+}
+
+static void page_cache_pipe_buf_unmap(struct pipe_inode_info *info, struct pipe_buffer *buf)
+{
+	kunmap(buf->page);
+}
+
+static struct pipe_buf_operations page_cache_pipe_buf_ops = {
+	.can_merge = 0,
+	.map = page_cache_pipe_buf_map,
+	.unmap = page_cache_pipe_buf_unmap,
+	.release = page_cache_pipe_buf_release,
+};
+
+static ssize_t move_to_pipe(struct inode *inode, struct page **array, int pages, unsigned long offset, unsigned long len)
+{
+	struct pipe_inode_info *info;
+	int ret, do_wakeup;
+
+	down(PIPE_SEM(*inode));
+	info = inode->i_pipe;
+	ret = 0;
+	do_wakeup = 0;
+	for (;;) {
+		int bufs;
+
+		if (!PIPE_READERS(*inode)) {
+			send_sig(SIGPIPE, current, 0);
+			if (!ret) ret = -EPIPE;
+			break;
+		}
+		bufs = info->nrbufs;
+		if (bufs < PIPE_BUFFERS) {
+			int newbuf = (info->curbuf + bufs) & (PIPE_BUFFERS-1);
+			struct pipe_buffer *buf = info->bufs + newbuf;
+			struct page *page = *array++;
+			unsigned long this_len;
+
+			this_len = PAGE_SIZE - offset;
+			if (this_len > len)
+				this_len = len;
+
+			buf->page = page;
+			buf->offset = offset;
+			buf->len = this_len;
+			buf->ops = &page_cache_pipe_buf_ops;
+			info->nrbufs = ++bufs;
+			do_wakeup = 1;
+
+			ret += this_len;
+			len -= this_len;
+			offset = 0;
+			if (!--pages)
+				break;
+			if (!len)
+				break;
+			if (bufs < PIPE_BUFFERS)
+				continue;
+			break;
+		}
+		if (signal_pending(current)) {
+			if (!ret) ret = -ERESTARTSYS;
+			break;
+		}
+		if (do_wakeup) {
+			wake_up_interruptible_sync(PIPE_WAIT(*inode));
+			kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
+			do_wakeup = 0;
+		}
+		PIPE_WAITING_WRITERS(*inode)++;
+		pipe_wait(inode);
+		PIPE_WAITING_WRITERS(*inode)--;
+	}
+	up(PIPE_SEM(*inode));
+	if (do_wakeup) {
+		wake_up_interruptible(PIPE_WAIT(*inode));
+		kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
+	}
+	while (--pages >= 0)
+		page_cache_release(*++array);
+	return ret;
+}
+
+ssize_t generic_file_splice_read(struct file *in, struct inode *pipe, size_t len, unsigned long flags)
+{
+	struct address_space *mapping = in->f_mapping;
+	unsigned long long pos, size, left;
+	unsigned long index, offset, pages;
+	struct page *array[PIPE_BUFFERS];
+	int i;
+
+	pos = in->f_pos;
+	size = i_size_read(mapping->host);
+	if (pos >= size)
+		return 0;
+	left = size - pos;
+	if (left < len)
+		len = left;
+	index = pos >> PAGE_CACHE_SHIFT;   
+	offset = pos & ~PAGE_CACHE_MASK;
+	pages = (len + offset + PAGE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
+	if (pages > PIPE_BUFFERS)
+		pages = PIPE_BUFFERS;
+
+	/*
+	 * Get as many pages from the page cache as possible..
+	 * Start IO on the page cache entries we create (we
+	 * can assume that any pre-existing ones we find have
+	 * already had IO started on them).
+	 */
+	i = find_get_pages(mapping, index, pages, array);
+
+	/*
+	 * If not all pages were in the page-cache, we'll
+	 * just assume that the rest haven't been read in,
+	 * so we'll get the rest locked and start IO on
+	 * them if we can..
+	 */
+	while (i < pages) {
+		struct page *page = find_or_create_page(mapping, index + i, GFP_USER);
+		int error;
+		if (!page)
+			break;
+		if (PageUptodate(page)) {
+			unlock_page(page);
+		} else {
+			error = mapping->a_ops->readpage(in, page);
+			if (unlikely(error)) {
+				page_cache_release(page);
+				break;
+			}
+		}
+		array[i++] = page;
+	}
+
+	if (!i)
+		return 0;
+
+	/*
+	 * Now we splice them into the pipe..
+	 */
+	return move_to_pipe(pipe, array, i, offset, len);
+}
+
+static long do_splice_from(struct inode *pipe, struct file *out, size_t len, unsigned long flags)
+{
+	if (out->f_op && out->f_op->splice_write)
+		return out->f_op->splice_write(pipe, out, len, flags);
+	return -EINVAL;
+}
+
+static long do_splice_to(struct file *in, struct inode *pipe, size_t len, unsigned long flags)
+{
+	if (in->f_op && in->f_op->splice_read)
+		return in->f_op->splice_read(in, pipe, len, flags);
+	return -EINVAL;
+}
+
+static long do_splice(struct file *in, struct file *out, size_t len, unsigned long flags)
+{
+	struct inode *pipe;
+
+	if (!len)
+		return 0;
+	pipe = in->f_dentry->d_inode;
+	if (pipe->i_pipe)
+		return do_splice_from(pipe, out, len, flags);
+	pipe = out->f_dentry->d_inode;
+	if (pipe->i_pipe)
+		return do_splice_to(in, pipe, len, flags);
+	return -EINVAL;
+}
+
+asmlinkage long sys_splice(int fdin, int fdout, size_t len, unsigned long flags)
+{
+	long error;
+	struct file *in, *out;
+	int fput_in, fput_out;
+
+	error = -EBADF;
+	in = fget_light(fdin, &fput_in);
+	if (in) {
+		if (in->f_mode & FMODE_READ) {
+			out = fget_light(fdout, &fput_out);
+			if (out) {
+				if (out->f_mode & FMODE_WRITE)
+					error = do_splice(in, out, len, flags);
+				fput_light(out, fput_out);
+			}
+		}
+		fput_light(in, fput_in);
+	}
+	return error;
+}
diff -Nru a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
--- a/include/asm-i386/unistd.h	2005-01-15 18:36:40 -08:00
+++ b/include/asm-i386/unistd.h	2005-01-15 18:36:40 -08:00
@@ -294,8 +294,9 @@
 #define __NR_add_key		286
 #define __NR_request_key	287
 #define __NR_keyctl		288
+#define __NR_splice		289
 
-#define NR_syscalls 289
+#define NR_syscalls 290
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h	2005-01-15 18:36:40 -08:00
+++ b/include/linux/fs.h	2005-01-15 18:36:40 -08:00
@@ -944,6 +944,8 @@
 	int (*check_flags)(int);
 	int (*dir_notify)(struct file *filp, unsigned long arg);
 	int (*flock) (struct file *, int, struct file_lock *);
+	ssize_t (*splice_write)(struct inode *, struct file *, size_t, unsigned long);
+	ssize_t (*splice_read)(struct file *, struct inode *, size_t, unsigned long);
 };
 
 struct inode_operations {

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-14 23:34               ` Ingo Oeser
@ 2005-01-15  0:16                 ` Linus Torvalds
  2005-01-16  2:59                   ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2005-01-15  0:16 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: linux, Kernel Mailing List



On Sat, 15 Jan 2005, Ingo Oeser wrote:
> 
> Now imagine this (ACSCII art in monospace font):
> 
> [ chip A ] ------(1)------ [ chip B ]   [ CPU ]   [memory]  [disk]
>       |                        |           |         |        |
>       +----------(2)-----------+---(2)-----+----(2)--+--------+
> 
> Yes, I understand and support your vision. Now I would like to use path (1)
> the direct for this. Possible? 

Yes, if I understand what you're thinking right. I'm just assuming (for 
the sake of simplicity) that "A" generates all data, and "B" is the one 
that uses it. If both A and B can generate/use data, it still works, but 
you'd need two pipes per endpoint, just because pipes are fundamentally 
uni-directional.

Also, for this particular example, I'll also assume that whatever buffers 
that chips A/B use are "mappable" to the CPU too: they could be either 
regular memory pages (that A/B DMA to/from), or they can be IO buffers on 
the card itself that can be mapped into the CPU address space and 
memcpy'ed.

That second simplification is something that at least my current example
code (which I don't think I've sent to you) kind of assumes. It's not
really _forced_ onto the design, but it avoids the need for a separate 
"accessor" function to copy things around.

[ If you cannot map them into CPU space, then each "struct pipe_buffer" op
  structure would need operations to copy them to/from kernel memory and
  to copy them to/from user memory, so the second simplification basically
  avoids having to have four extra "accessor" functions. ]

What you'd do to get your topology above is:
 - create a "pipe" for both A and B (let's call them "fdA" and "fdB" 
   respectively).
 - the driver is responsible for creating the "struct pipe_buffer" for any 
   data generated by chip A. If it's regular memory, it's a page 
   allocation and the appropriate DMA setup, and if it's IO memory on the 
   card, you'd need to generate fake "struct page *" and a mapping 
   function for them.
 - if you want to send the data that A generates to both fdB _and_ write 
   it to disk (file X), you'd also need one anonymous pipe (fdC), and open
   a fdX for writing to the file, and then just in your process
   effectively do:

	for (;;) {
		n = tee(fdA, fdB, fbC);
		splice(fdC, fdX, n);
	}

   and you'd get what I think you are after. The reason you need the "fdC" 
   is simply because the "tee()" operation would only work on pipes: you 
   can not duplicate any other data structure with "tee()" than a pipe
   buffer, so you couldn't do a direct "tee(fdA, fdB, fdX)" in my world.

(The above example is _extremely_ simplified: in real life, you'd have to
take care of partial results from "splice()" etc error conditions, of
course).

NOTE! My first cut will _assume_ that all buffers are in RAM, so it
wouldn't support the notion of on-card memory. On-card memory does
complicate things - not just the accessor functions, but also the notion
of how to "splice()" (or "tee()" from one to the other). If you assume
buffers-in-RAM, then all pointers are the same, and a tee/splice really
just moves a pointer around. But if the pointer can have magic meaning
that depends on the source it came from, then splicing such a pointer to
another device must inevitably imply at least the possibility of some kind
of memory copy.

So I don't think you'll get _exactly_ what you want for a while, since
you'd have to go through in-memory buffers. But there's no huge conceptual
problem (just enough implementation issues to make it inconvenient) with
the concept of actually keeping the buffer on an external controller.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-14 22:44             ` Linus Torvalds
@ 2005-01-14 23:34               ` Ingo Oeser
  2005-01-15  0:16                 ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Oeser @ 2005-01-14 23:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux, Kernel Mailing List

Hi Linus,

Linus Torvalds wrote:
> That's where it gets interesting. Everybody needs buffers, and they are
> generally so easy to implement that there's no point in having much of a
> "buffer library".

But my hardware has the buffers on chip, if connected directly to each other.
So no system level buffering is needed.

> No. Use two totally separate fd's, and make it _cheap_ to move between
> them. That's what "splice()" gives you - basically a very low-cost way to
> move between two uni-directional things. No "memcpy()", because memcpy is
> expensive for large streams of data.

Here we agree already, so lets talk about the rest.

> If you end up reading from a regular file (or writing to one), then yes,
> the buffers end up being picked up from the buffer cache. But that's by no
> means required. The buffers can be just anonymous pages (like the ones a
> regular "write()" to a pipe generates), or they could be DMA pages
> allocated for the data by a device driver. Or they could be the page that
> contains a "skb" from a networking device.
>
> I really think that splice() is what you want, and you call "wire()".

That sounds really close to what I want.

Now imagine this (ACSCII art in monospace font):

[ chip A ] ------(1)------ [ chip B ]   [ CPU ]   [memory]  [disk]
      |                        |           |         |        |
      +----------(2)-----------+---(2)-----+----(2)--+--------+

(1) Is a direct path between these chips.
(2) Is the system bus (e.g. PCI)

(1) works without any CPU intervention
(2) needs the buffering scheme you described.

I like to use (1) for obvious reasons, but still allow (2).
User space should decide which one, since it's a policy decision.

I need to open the chips with default to (2), because the driver
doesn't know in advance, that it will be connected to a chip it can talk to 
directly. Everything else will be quite ugly.

Please note, that this is a simplified case and we actually have many chips of
A/B and (1) is more like a software programmable "patch field" you
might know from a server rack. I want user space to operate this "patch 
field".

Maybe "io router" is a descriptive term. No?

> Yes, I believe that we're talking about the same thing. What you can do in
> my vision is:
>
>  - create a pipe for feeding the audio decoder chip. This is just the
>    sound driver interface to a pipe, and it's the "device_open()" code I
>    gave as an example in my previous email, except going the other way (ie
>    the writer is the 'fd', and the driver is the "reader" of the data).
>
>    You'd do this with a simple 'fd = open("/dev/xxxx", O_WRONLY)",
>    together with some ioctl (if necessary) to set up the actual
>    _parameters_ for the piped device.
>
>  - do a "splice()" from a file to the pipe. A splice from a regular file
>    is really nothing but a page cache lookup, and moving those page cache
>    pages to the pipe.
>
>  - tell the receiving fd to start processing it (again, this might be an
>    ioctl - some devices may need directions on how to interpret the data,
>    or it might be implicit in the fact that the pipe got woken up by being
>    filled)

Yes, I understand and support your vision. Now I would like to use path (1)
the direct for this. Possible? 

Maybe by making drivers optionally aware of splice() and defaulting to a 
regular pipe, if they don't care?


Thanks and Regards

Ingo Oeser


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-14 22:12           ` Ingo Oeser
@ 2005-01-14 22:44             ` Linus Torvalds
  2005-01-14 23:34               ` Ingo Oeser
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2005-01-14 22:44 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: linux, Kernel Mailing List



On Fri, 14 Jan 2005, Ingo Oeser wrote:
> 
> Data sink/source is simple, indeed. You just implemented a buffering
> between a drivers output filp, if I understand you correctly.

Yes and no. It is indeed just buffering.

The thing I find most intriguing about it, and which is why I htink it's
important is that while it's "just" buffering, it is _standard_ buffering.  
That's where it gets interesting. Everybody needs buffers, and they are
generally so easy to implement that there's no point in having much of a 
"buffer library". 

But it the standard way allows you to do interesting stuff _between_ two 
things, than that's where it gets interesting. The combinations it allows.

> Now both directions together and it gets a bit more difficult.
> Driver writers basically reimplement fs/pipe.c over and over again.

But bidirectional is just two of those things. 

> Always the same basic operations for that needing the same structure
> to handle it.

Yes.

> Maybe the solution here is just using ONE fd and read/write to it.

No. Use two totally separate fd's, and make it _cheap_ to move between 
them. That's what "splice()" gives you - basically a very low-cost way to 
move between two uni-directional things. No "memcpy()", because memcpy is 
expensive for large streams of data.

> > > We also don't have wire_fds(), which would wire up two fds by
> > > connecting the underlying file pointers with each other and closing the
> > > fds.
> >
> > But that is _exactly_ what splice() would do.
> >
> > So you could have the above open of "/dev/mpeginput", and then you just> > sit and splice the result into a file or whatever.
> >
> > See?
> 
> But you are still pushing everything through the page cache.
> I would like to see "fop->connect(producer, consumer)"

No no. There's no buffer cache involved. There is just "buffers".

If you end up reading from a regular file (or writing to one), then yes,
the buffers end up being picked up from the buffer cache. But that's by no
means required. The buffers can be just anonymous pages (like the ones a
regular "write()" to a pipe generates), or they could be DMA pages
allocated for the data by a device driver. Or they could be the page that
contains a "skb" from a networking device. 

I really think that splice() is what you want, and you call "wire()". 

> Imagine 3 chips. One is a encoder/decoder chip with 8 channels, the other
> 2 chips are a video DAC/ADC and an audio DAC/ADC with 4 channels each.
> 
> These chips can be wired directly by programming a wire matrix (much like a 
> dumb routing table). But you can also receive/send via all of these chips
> to/from harddisk for recording/playback. 
> 
> So you have to implement the drivers for these chips to provide one filp per 
> channel and one minor per chip.
> 
> If I can do this with splice, you've got me and I'm really looking forward to 
> your first commits/patches, since this itch is scratching me since long ;-)

Yes, I believe that we're talking about the same thing. What you can do in 
my vision is:

 - create a pipe for feeding the audio decoder chip. This is just the 
   sound driver interface to a pipe, and it's the "device_open()" code I 
   gave as an example in my previous email, except going the other way (ie 
   the writer is the 'fd', and the driver is the "reader" of the data).

   You'd do this with a simple 'fd = open("/dev/xxxx", O_WRONLY)",
   together with some ioctl (if necessary) to set up the actual
   _parameters_ for the piped device.

 - do a "splice()" from a file to the pipe. A splice from a regular file 
   is really nothing but a page cache lookup, and moving those page cache 
   pages to the pipe.

 - tell the receiving fd to start processing it (again, this might be an 
   ioctl - some devices may need directions on how to interpret the data,
   or it might be implicit in the fact that the pipe got woken up by being 
   filled)

Going the other way (receiving data from a hardware device) is all the 
same thing - and there "tee()" may be useful, since it would allow the 
received data to be dup'ed to two different sinks.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-14 21:29         ` Linus Torvalds
@ 2005-01-14 22:12           ` Ingo Oeser
  2005-01-14 22:44             ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Oeser @ 2005-01-14 22:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux, Kernel Mailing List

Hi Linus,

Linus Torvalds wrote:
> On Fri, 14 Jan 2005, Ingo Oeser wrote:
> But realize that what you are asking for is actually a _much_ simpler
> thing than even "fifo_open()". What you ask for (if somebody really starts
> doing this, and I'd love to see it) is not much more difficult than
>
>  int device_open(struct inode *inode, struct file *filp)
>  {
>   if (!pipe_new(inode))
>    return -ENOMEM;
>
>   /* The hardware is a writer */
>   PIPE_WRITERS(*inode)++;
>
>   /* The new fd is a reader of the data the hardware generates */
>   file->f_op = read_fifo_fops;
>
>   /*
>    * This starts the hardware capture - although in real
>    * life there might be a "start it" ioctl or something
>    * that actually does that together with the parameters
>    * for capture..
>    */
>   start_capture_process(inode);
>
>   return 0;
>  }
>
> ie we're definitely not talking rocket science here.

Data sink/source is simple, indeed. You just implemented a buffering
between a drivers output filp, if I understand you correctly.

Now both directions together and it gets a bit more difficult.
Driver writers basically reimplement fs/pipe.c over and over again.

encoded stream -> write() -> decoder hardware -> read() -> decoded stream

I'm trying to model the decoder as device here.
Same goes for encoding.

> (Yes, I'm sure there are some details missing in the above, but you get
> the idea: the infrastructure really _is_ pretty much there, and any lack
> comes from the fact that nobody has actually ever _used_ it).

No, we have all the hardware actually doing this hidden behind other 
interfaces. e.g. the DVB core, my LinuxDSP core, the upcoming hardware
crypto core and so on.

Always the same basic operations for that needing the same structure
to handle it.

Maybe the solution here is just using ONE fd and read/write to it.

But for identifying producer/consumer, you need two fds. And now we come to
wire() (or maybe we discover that splice() is really doing the same).
 
> > We also don't have wire_fds(), which would wire up two fds by
> > connecting the underlying file pointers with each other and closing the
> > fds.
>
> But that is _exactly_ what splice() would do.
>
> So you could have the above open of "/dev/mpeginput", and then you just
> sit and splice the result into a file or whatever.
>
> See?

But you are still pushing everything through the page cache.
I would like to see "fop->connect(producer, consumer)"

instead of

while (producer->not_killed &&) consumer->not_killed) {
 producer->read(buffer);
 consumer->write(buffer);
}

That's what the wire() syscall is about. If the splice() syscall is the same
and enables drivers to avoid the copying through the CPU, I'm perfectly happy.

> (Or maybe it's me who doesn't see what it is you want).

Imagine 3 chips. One is a encoder/decoder chip with 8 channels, the other
2 chips are a video DAC/ADC and an audio DAC/ADC with 4 channels each.

These chips can be wired directly by programming a wire matrix (much like a 
dumb routing table). But you can also receive/send via all of these chips
to/from harddisk for recording/playback. 

So you have to implement the drivers for these chips to provide one filp per 
channel and one minor per chip.

If I can do this with splice, you've got me and I'm really looking forward to 
your first commits/patches, since this itch is scratching me since long ;-)


Thanks and Regards

Ingo oeser


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-14 21:03       ` Ingo Oeser
@ 2005-01-14 21:29         ` Linus Torvalds
  2005-01-14 22:12           ` Ingo Oeser
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2005-01-14 21:29 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: linux, Kernel Mailing List



On Fri, 14 Jan 2005, Ingo Oeser wrote:
> 
> But we have currently no suitable interface for this.
> 
> We have socketpair(), we have pipe(), but we don't have
> dev_pipe() creating a pipe with the driver sitting in the middle.

Not exactly that, no.

But realize that what you are asking for is actually a _much_ simpler 
thing than even "fifo_open()". What you ask for (if somebody really starts 
doing this, and I'd love to see it) is not much more difficult than

	int device_open(struct inode *inode, struct file *filp)
	{
		if (!pipe_new(inode))
			return -ENOMEM;

		/* The hardware is a writer */
		PIPE_WRITERS(*inode)++;

		/* The new fd is a reader of the data the hardware generates */
		file->f_op = read_fifo_fops;

		/*
		 * This starts the hardware capture - although in real
		 * life there might be a "start it" ioctl or something
		 * that actually does that together with the parameters
		 * for capture..
		 */
		start_capture_process(inode);

		return 0;
	}

ie we're definitely not talking rocket science here.

(Yes, I'm sure there are some details missing in the above, but you get 
the idea: the infrastructure really _is_ pretty much there, and any lack 
comes from the fact that nobody has actually ever _used_ it).

> Usage: "processing devices" like crypto processors, DSPs, 
>  MPEG-Encoding/-Decoding hardware, smart cards and the like.
> 
> Synopsys: int dev_pipe(const char *dev_path, int fd_vec[2]);

No, I do NOT think that's what you'd want. 

What you want is more of a

	fd = open("/dev/mpeginput", O_RDONLY);

and then that device driver open just does the thing outlined above.

> We also don't have wire_fds(), which would wire up two fds by 
> connecting the underlying file pointers with each other and closing the fds.

But that is _exactly_ what splice() would do.

So you could have the above open of "/dev/mpeginput", and then you just 
sit and splice the result into a file or whatever.

See?

(Or maybe it's me who doesn't see what it is you want).

			Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-13 22:32     ` Linus Torvalds
@ 2005-01-14 21:03       ` Ingo Oeser
  2005-01-14 21:29         ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Oeser @ 2005-01-14 21:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux, Kernel Mailing List

Hi Linus,

Linus Torvalds wrote:
> On Thu, 13 Jan 2005, Ingo Oeser wrote:
> > Hmm, that's a pity, because it makes hardware support more difficult.
> > I thought you might consider an system call, which "wires up" fds.
> I think that the solution to that is to make the pipe _be_ the driver
> interface.

But we have currently no suitable interface for this.

We have socketpair(), we have pipe(), but we don't have
dev_pipe() creating a pipe with the driver sitting in the middle.

Usage: "processing devices" like crypto processors, DSPs, 
 MPEG-Encoding/-Decoding hardware, smart cards and the like.

Synopsys: int dev_pipe(const char *dev_path, int fd_vec[2]);

We also don't have wire_fds(), which would wire up two fds by 
connecting the underlying file pointers with each other and closing the fds.

That would be quite useful, because user space will not see the data anymore
anyway, if it travels just between some chips. 

Usage: Chips from a chipset, which usally know each other quite well and
 can talk to each other without CPU intervention.

Synopsys: int wire(int in_fd, int out_fd);

> But that doesn't make the pipe an "actor". The pipe just remains a
> standard way to encapsulate the notion of "set of buffers". It needs an
> external actor to do something, but that actor can be a device driver
> filling it up, or a system call that reads it or moves it to another
> destination.

I implemented this some years ago for the "Linux & DSP" project, 
but I had to impose ordering restrictions on open() and close() to user space
and this is not really elegant.

Many hardware vendors have implemented it somehow anyway.

So what about starting to do it seriously and implement infrastructure for 
both of it?

I could also do an simple example driver using both features, if you give me 
some evenings of time and are interested in the idea.

Are you interested?


Regards

Ingo Oeser, still dreaming of sys_wire() and sys_dev_pipe()

PS: Syscall names are under discussion of course ;-)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-13 21:46   ` Ingo Oeser
@ 2005-01-13 22:32     ` Linus Torvalds
  2005-01-14 21:03       ` Ingo Oeser
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2005-01-13 22:32 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: linux, Kernel Mailing List



On Thu, 13 Jan 2005, Ingo Oeser wrote:
> 
> Hmm, that's a pity, because it makes hardware support more difficult.
> 
> I thought you might consider an system call, which "wires up" fds.
> 
> Imagine a device fd, which gets lots of measuring data, wired through a 
> DSP pipe, spliced to realtime display fd and file storage fd. 

I think that the solution to that is to make the pipe _be_ the driver 
interface.

Remember: a pipe is just a set of buffers. If you have a hardware device 
that could use the buffers, then there is nothing to say that the driver 
couldn't be the "actor" that fills the buffers. So doing an "open()" on 
the device would just create a pipe that gets filled by the hardware.

But that doesn't make the pipe an "actor". The pipe just remains a 
standard way to encapsulate the notion of "set of buffers". It needs an 
external actor to do something, but that actor can be a device driver 
filling it up, or a system call that reads it or moves it to another 
destination.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-08 18:41 ` Linus Torvalds
  2005-01-08 21:47   ` Alan Cox
@ 2005-01-13 21:46   ` Ingo Oeser
  2005-01-13 22:32     ` Linus Torvalds
  1 sibling, 1 reply; 18+ messages in thread
From: Ingo Oeser @ 2005-01-13 21:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux, Kernel Mailing List

Hi,

Linus Torvalds wrote:
> None of the pipes would do anything on their own - they aren't "actors".
> They'd be purely buffers. So if you want to have an asynchronous
> long-running "tee()/copy()/peek()", you'd have a thread that does that for
> you.

Hmm, that's a pity, because it makes hardware support more difficult.

I thought you might consider an system call, which "wires up" fds.

Imagine a device fd, which gets lots of measuring data, wired through a 
DSP pipe, spliced to realtime display fd and file storage fd. 

How many buffers do you like to use here? How much unnecessary copies 
are made by the CPU?

Any realtime mass data processing needing hardware support could benefit
from "active" pipes, where either end could be drivers "knowing each other"
and deciding to rather route the IO in hardware, avoiding to go through the
overworked memory bus and CPU.

This kind of "wire me up to my next chip" is needed quite often in the 
embedded world, where power matters and special purpose chips or even 
DSPs rule. Every chipset vendor does its own libraries there for now,
but a more generic approach might be better.

What do you think?

Regards

Ingo Oeser


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-08 18:41 ` Linus Torvalds
@ 2005-01-08 21:47   ` Alan Cox
  2005-01-13 21:46   ` Ingo Oeser
  1 sibling, 0 replies; 18+ messages in thread
From: Alan Cox @ 2005-01-08 21:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux, Linux Kernel Mailing List

On Sad, 2005-01-08 at 18:41, Linus Torvalds wrote:
> For all of those that are valid (certainly the signal case), you need to
> put the pipe into nonblocking mode anyway, so I don't think coalescing is
> really needed. SuS certainly does _not_ seem guarantee that you can do
> many nonblocking writes (small _or_ big). Let's see if somebody reports 
> any trouble with the new world order.

It breaks netscape 3 if I simulate it (but thats picking an app I know
makes silly assumptions).

>From the rules pasted below I think the 1 byte at a time 4K write is
guaranteed or at least strongly implied. The single write atomicity is
guaranteed but that is if anything made easier by the changes.

(From pathconf)

	fpathconf(_PC_PATH_BUF)

             returns  the  size of the pipe buffer, where filedes must
refer
              to a pipe or FIFO and path must refer to  a  FIFO.  The 
corre-
              sponding macro is _POSIX_PIPE_BUF.

(From pwrite)

Write requests of {PIPE_BUF} bytes or less shall not be interleaved with
data from other processes doing writes on the same pipe. Writes of
greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary
boundaries, with writes by other processes, whether or not the
O_NONBLOCK flag of the file status flags is set.

[of O_NDELAY for pipe]
A write request for {PIPE_BUF} or fewer bytes shall have the following
effect: if there is sufficient space available in the pipe, write()
shall transfer all the data and return the number of bytes requested.
Otherwise, write() shall transfer no data and return -1 with errno set
to [EAGAIN].



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
  2005-01-08  8:25 linux
@ 2005-01-08 18:41 ` Linus Torvalds
  2005-01-08 21:47   ` Alan Cox
  2005-01-13 21:46   ` Ingo Oeser
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2005-01-08 18:41 UTC (permalink / raw)
  To: linux; +Cc: Kernel Mailing List



On Sat, 8 Jan 2005 linux@horizon.com wrote:
> 
> H'm... the version that seemed natural to me was an asymmetrical one-way
> tee, such as "tee(in, out, len, flags)" might be better, where the next
> <len> bytes are *both* readable on fd "in" *and* copied to fd "out".

Yes, the implementation may end up being something like that, I haven't
decided. That would be more of a "peek()" kind of thing, not a "tee()",
but it has the advantage of getting rid of one unnecessary complexity of
"tee()" - dependence on three different fd's (in particular, if one of the
output fd's isn't writable, we can't do anything at all).

However, "peek()" has the problem that while it allows you to just run "n" 
peek() calls for the "n" outputs, you now need to keep track of how much 
each output has consumed (since they can be filled at different rates), 
and then appropriately "eat" the proper amount of input when it has been 
consumed by everybody. That's a very error-prone interface.

In contrast, with a plain "tee()", you'd just create a tree "log(n)" 
deep of "tee()" actors, and you'd have "n" outputs from one input, and 
each individual point would be very simple.

NOTE! I don't think either "tee()" or "peek()" really works for very high
fan-out. I don't believe that you'd ever really have a "balanced" enough
output across many things, and since they all end up being synchronized to
the same input (you can make the buffers deeper, which will help a bit,
but..) I don't think it ends up being _that_ useful.

Where I foresee "tee()" is when there is some rate-limited input that you 
can easily handle in two or three streams. That's what the example of a 
digital TV input signal was kind of meant to be - the input is 
rate-limited, which means that the fact that the outputs are 
likely consumed at different paces (saving to disk might be fast, showing 
it on the screen would be real-time) doesn't matter. But such a use ends 
up breaking down when the load is too high - you're likely going to be 
constrained in the number of streams by pure load issues.

(And if you are _not_ constrained by load issues, then you don't need 
tee() at all - you could just have copied the things by hand ;)

> But then I realized that I might be thinking about a completely different
> implementation than you... I was thinking asynchronous, while perhaps
> you were thinking synchronous.

I _am_ thinking entirely synchronous. It would copy exactly what was in 
the pipe, and nothing more. It would sleep if either end was full/empty 
(modulo N_DELAY, of course), but it would in no way be a long-term "set up 
this copy mechanism.

None of the pipes would do anything on their own - they aren't "actors".  
They'd be purely buffers. So if you want to have an asynchronous 
long-running "tee()/copy()/peek()", you'd have a thread that does that for 
you.

> With the synchronous model, the two-output tee() call makes more sense, too.
> Still, it would be nice to be able to produce n identical output streams
> without needing n-1 processes to do it  Any ideas? 

I agree, although I don't know how common it would be for "n" to be very
big. And you don't need n-1 processes (or threads), you can certainly do
it with the normal select-loop approach too.

> As for the issue of coalescing:
> > This is the main reason why I want to avoid coalescing if possible: if you
> > never coalesce, then each "pipe_buffer" is complete in itself, and that
> > simplifies locking enormously.
> >
> > (The other reason to potentially avoid coalescing is that I think it might
> > be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
> > boundaries. The new pipe code _is_ internally "packetized" after all).
> 
> It is somewhat offensive that the minimum overhead for a 1-byte write
> is a struct pipe_buffer plus a full page.

Note that there is no reason why we couldn't use just "kmalloc()" too.

For the different input cases we'll need per-pipe_buffer operation
functions (very much including a "release()" function) _anyway_, which
means that while the current implementation always just does "alloc_page()
+ __free_page()" for it's buffer allocation, there is absolutely zero that 
says that you couldn't allocate small buffers with "kmalloc()" and free 
them with "kfree()".

And since we need to support per-buffer ops anyway (for people mixing
"write()" and "splice()" to set up the pipe contents), it works very
naturally for mixed size writes too (ie you can have a small write that is
buffered in a small kmalloc'ed buffer, followed by a big write that uses
the page allocator, and nobody will care - you just decide at buffer fill
time which one you need).

So yes, the current setup is wasteful, but coalescing isn't the only way 
to fight that - you can just use smaller allocations too.

So in my mind, the only reason for doign coalescing is if some silly
program deadlocks if it writes many small writes, and nobody is reading
it. Some people use pipes as buffers _within_ a program, ie use them to
send signals etc (common for signal-safe select handling - you make the
signal handler do a write() to a pipe, and the select loop has the pipe in
its FD-set).

For all of those that are valid (certainly the signal case), you need to
put the pipe into nonblocking mode anyway, so I don't think coalescing is
really needed. SuS certainly does _not_ seem guarantee that you can do
many nonblocking writes (small _or_ big). Let's see if somebody reports 
any trouble with the new world order.

			Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Make pipe data structure be a circular list of pages, rather
@ 2005-01-08  8:25 linux
  2005-01-08 18:41 ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: linux @ 2005-01-08  8:25 UTC (permalink / raw)
  To: linux-kernel, torvalds

>  - add a "tee(in, out1, out2)" system call that duplicates the pages 
>    (again, incrementing their reference count, not copying the data) from 
>    one pipe to two other pipes.

H'm... the version that seemed natural to me was an asymmetrical one-way
tee, such as "tee(in, out, len, flags)" might be better, where the next
<len> bytes are *both* readable on fd "in" *and* copied to fd "out".

You can make it a two-way tee with an additional "splice(in, out2, len)"
call, so you haven't lost expressiveness, and it makes three-way and
higher tees easier to construct.


But then I realized that I might be thinking about a completely different
implementation than you... I was thinking asynchronous, while perhaps
you were thinking synchronous.

A simple example of the difference:

int
main(void)
{
	fd *dest = open("/dev/null", O_WRONLY);
	FILE *src = popen("/usr/bin/yes", "r");
	splice(fileno(src), dest, SPLICE_INFINITY, 0);
	return 0;
}

Will this process exit without being killed?  I was imagining yes,
it would exit immediately, but perhaps "no" makes more sense.

Ding!  Oh, of course, it couldn't exit, or cleaning up after the following
mess would be far more difficult:

int
main(void)
{
	int fd[2];
	pipe(fd);
	write(fd[1], "Hello, world!\n", 14);
	splice(fd[0], fd[1], SPLICE_INFINITY, 0);
	return 0;
}

With the synchronous model, the two-output tee() call makes more sense, too.
Still, it would be nice to be able to produce n identical output streams
without needing n-1 processes to do it  Any ideas?  Perhaps

int
tee(int infd, int const *outfds, unsigned noutfds, loff_t size, unsigned flags)

As for the issue of coalescing:
> This is the main reason why I want to avoid coalescing if possible: if you
> never coalesce, then each "pipe_buffer" is complete in itself, and that
> simplifies locking enormously.
>
> (The other reason to potentially avoid coalescing is that I think it might
> be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
> boundaries. The new pipe code _is_ internally "packetized" after all).

It is somewhat offensive that the minimum overhead for a 1-byte write
is a struct pipe_buffer plus a full page.

But yes, keeping a pipe_buffer simple is a BIG win.  So how about the
following coalescing strategy, which complicates the reader not at all:

- Each pipe writing fd holds a reference to a page and an offset within
  that page.
- When writing some data, see if the data will fit in the remaining
  portion of the page.
  - If it will not, then dump the page, allocate a fresh one, and set
    the offset to 0.
    - Possible optimization: If the page's refcount is 1 (nobody else
      still has a reference to the page, and we would free it if we
      dropped it), then just recycle it directly.
- Copy the written data (up to a maximum of 1 page) to the current write page.
- Bump the page's reference count (to account for the pipe_buffer pointer) and
  queue an appropriate pipe_buffer.
- Increment the offset by the amount of data written to the page.
- Decrement the amount of data remaining to be written and repeat if necessary.

This allocates one struct pipe_buffer (8 bytes) per write, but not a whole
page.  And it does so by exploiting the exact "we don't care what the rest
of the page is used for" semantics that make the abstraction useful.

The only waste is that, as written, every pipe writing fd keeps a page
allocated even if the pipe is empty.  Perhaps the vm could be taught to
reclaim those references if necessary?


It's also worth documenting atomicity guarantees and poll/select
semantics.  The above algorithm is careful to always queue the first
PAGE_SIZE bytes of any write atomically, which I believe is what is
historically expected.  It would be possible to have writes larger than
PIPE_BUF fill in the trailing end of any partial page as well.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-01-17 16:04 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-15 23:42 Make pipe data structure be a circular list of pages, rather linux
2005-01-15 22:55 ` Alan Cox
2005-01-16  0:12   ` Linus Torvalds
2005-01-16  2:02     ` Miquel van Smoorenburg
2005-01-16  2:06     ` Jeff Garzik
  -- strict thread matches above, loose matches on Subject: below --
2005-01-08  8:25 linux
2005-01-08 18:41 ` Linus Torvalds
2005-01-08 21:47   ` Alan Cox
2005-01-13 21:46   ` Ingo Oeser
2005-01-13 22:32     ` Linus Torvalds
2005-01-14 21:03       ` Ingo Oeser
2005-01-14 21:29         ` Linus Torvalds
2005-01-14 22:12           ` Ingo Oeser
2005-01-14 22:44             ` Linus Torvalds
2005-01-14 23:34               ` Ingo Oeser
2005-01-15  0:16                 ` Linus Torvalds
2005-01-16  2:59                   ` Linus Torvalds
2005-01-17 16:03                     ` Ingo Oeser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).