All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/3] make splice more generic
@ 2009-05-07 13:37 Miklos Szeredi
  2009-05-07 13:37 ` [patch 1/3] splice: implement pipe to pipe splicing Miklos Szeredi
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Miklos Szeredi @ 2009-05-07 13:37 UTC (permalink / raw)
  To: jens.axboe; +Cc: Max Kellermann, torvalds, linux-fsdevel, linux-kernel

This series makes splice(2) work in more cases:

  - pipe to pipe splicing (zero copy)
  - fallback splice_read which uses readv()
  - fallback splice_write which uses writev()

Just after cleaning up my patches after a vacation I found Max
Kellermann's patch on LKML implementing the first part.  I'll still
post mine, because it's slightly simpler (no ref + unref on the buffer
if not necessary).

One more generalization would be to allow splice to work on two
non-pipes, using an internal intermediate pipe, a-la do_splice_direct().

Comments?

Thanks,
Miklos
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch 1/3] splice: implement pipe to pipe splicing
  2009-05-07 13:37 [patch 0/3] make splice more generic Miklos Szeredi
@ 2009-05-07 13:37 ` Miklos Szeredi
  2009-05-07 13:37 ` [patch 2/3] splice: implement default splice_read method Miklos Szeredi
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Miklos Szeredi @ 2009-05-07 13:37 UTC (permalink / raw)
  To: jens.axboe; +Cc: Max Kellermann, torvalds, linux-fsdevel, linux-kernel

[-- Attachment #1: splice_pipe_to_pipe.patch --]
[-- Type: text/plain, Size: 7469 bytes --]

Allow splice(2) to work when both the input and the output is a pipe.

Based on the impementation of the tee(2) syscall, but instead of
duplicating the buffer references move the buffers from the input pipe
to the output pipe.

Moving the whole buffer only succeeds if the full length of the buffer
is spliced.  Otherwise duplicate the buffer, just like tee(2), set the
length of the output buffer and advance the offset on the input
buffer.

Since splice is operating on two pipes, special care needs to be taken
with locking to prevent AN ABBA deadlock.  Again this is done
similarly to the tee(2) syscall, first preparing the input and output
pipes so there's data to consume and space for that data, and then
doing the move operation while holding both locks.

If other processes are doing I/O on the same pipes parallel to the
splice, then by the time both inodes are locked there might be no
buffers left to move, or no space to move them to.  In this case retry
the whole operation, including the preparation phase.  This could lead
to starvation, but I'm not sure if that's serious enough to worry
about.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/splice.c |  162 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 151 insertions(+), 11 deletions(-)

Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c	2009-05-07 14:40:01.000000000 +0200
+++ linux-2.6/fs/splice.c	2009-05-07 14:40:06.000000000 +0200
@@ -1112,6 +1112,9 @@ long do_splice_direct(struct file *in, l
 	return ret;
 }
 
+static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
+			       struct pipe_inode_info *opipe,
+			       size_t len, unsigned int flags);
 /*
  * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
  * location, so checking ->i_pipe is not enough to verify that this is a
@@ -1132,12 +1135,32 @@ static long do_splice(struct file *in, l
 		      struct file *out, loff_t __user *off_out,
 		      size_t len, unsigned int flags)
 {
-	struct pipe_inode_info *pipe;
+	struct pipe_inode_info *ipipe;
+	struct pipe_inode_info *opipe;
 	loff_t offset, *off;
 	long ret;
 
-	pipe = pipe_info(in->f_path.dentry->d_inode);
-	if (pipe) {
+	ipipe = pipe_info(in->f_path.dentry->d_inode);
+	opipe = pipe_info(out->f_path.dentry->d_inode);
+
+	if (ipipe && opipe) {
+		if (off_in || off_out)
+			return -ESPIPE;
+
+		if (!(in->f_mode & FMODE_READ))
+			return -EBADF;
+
+		if (!(out->f_mode & FMODE_WRITE))
+			return -EBADF;
+
+		/* Splicing to self would be fun, but... */
+		if (ipipe == opipe)
+			return -EINVAL;
+
+		return splice_pipe_to_pipe(ipipe, opipe, len, flags);
+	}
+
+	if (ipipe) {
 		if (off_in)
 			return -ESPIPE;
 		if (off_out) {
@@ -1149,7 +1172,7 @@ static long do_splice(struct file *in, l
 		} else
 			off = &out->f_pos;
 
-		ret = do_splice_from(pipe, out, off, len, flags);
+		ret = do_splice_from(ipipe, out, off, len, flags);
 
 		if (off_out && copy_to_user(off_out, off, sizeof(loff_t)))
 			ret = -EFAULT;
@@ -1157,8 +1180,7 @@ static long do_splice(struct file *in, l
 		return ret;
 	}
 
-	pipe = pipe_info(out->f_path.dentry->d_inode);
-	if (pipe) {
+	if (opipe) {
 		if (off_out)
 			return -ESPIPE;
 		if (off_in) {
@@ -1170,7 +1192,7 @@ static long do_splice(struct file *in, l
 		} else
 			off = &in->f_pos;
 
-		ret = do_splice_to(in, off, pipe, len, flags);
+		ret = do_splice_to(in, off, opipe, len, flags);
 
 		if (off_in && copy_to_user(off_in, off, sizeof(loff_t)))
 			ret = -EFAULT;
@@ -1511,7 +1533,7 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff
  * Make sure there's data to read. Wait for input if we can, otherwise
  * return an appropriate error.
  */
-static int link_ipipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
+static int ipipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
 {
 	int ret;
 
@@ -1549,7 +1571,7 @@ static int link_ipipe_prep(struct pipe_i
  * Make sure there's writeable room. Wait for room if we can, otherwise
  * return an appropriate error.
  */
-static int link_opipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
+static int opipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
 {
 	int ret;
 
@@ -1587,6 +1609,124 @@ static int link_opipe_prep(struct pipe_i
 }
 
 /*
+ * Splice contents of ipipe to opipe.
+ */
+static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
+			       struct pipe_inode_info *opipe,
+			       size_t len, unsigned int flags)
+{
+	struct pipe_buffer *ibuf, *obuf;
+	int ret = 0, nbuf;
+	bool input_wakeup = false;
+
+
+retry:
+	ret = ipipe_prep(ipipe, flags);
+	if (ret)
+		return ret;
+
+	ret = opipe_prep(opipe, flags);
+	if (ret)
+		return ret;
+
+	/*
+	 * Potential ABBA deadlock, work around it by ordering lock
+	 * grabbing by pipe info address. Otherwise two different processes
+	 * could deadlock (one doing tee from A -> B, the other from B -> A).
+	 */
+	pipe_double_lock(ipipe, opipe);
+
+	do {
+		if (!opipe->readers) {
+			send_sig(SIGPIPE, current, 0);
+			if (!ret)
+				ret = -EPIPE;
+			break;
+		}
+
+		if (!ipipe->nrbufs && !ipipe->writers)
+			break;
+
+		/*
+		 * Cannot make any progress, because either the input
+		 * pipe is empty or the output pipe is full.
+		 */
+		if (!ipipe->nrbufs || opipe->nrbufs >= PIPE_BUFFERS) {
+			/* Already processed some buffers, break */
+			if (ret)
+				break;
+
+			if (flags & SPLICE_F_NONBLOCK) {
+				ret = -EAGAIN;
+				break;
+			}
+
+			/*
+			 * We raced with another reader/writer and haven't
+			 * managed to process any buffers.  A zero return
+			 * value means EOF, so retry instead.
+			 */
+			pipe_unlock(ipipe);
+			pipe_unlock(opipe);
+			goto retry;
+		}
+
+		ibuf = ipipe->bufs + ipipe->curbuf;
+		nbuf = (opipe->curbuf + opipe->nrbufs) % PIPE_BUFFERS;
+		obuf = opipe->bufs + nbuf;
+
+		if (len >= ibuf->len) {
+			/*
+			 * Simply move the whole buffer from ipipe to opipe
+			 */
+			*obuf = *ibuf;
+			ibuf->ops = NULL;
+			opipe->nrbufs++;
+			ipipe->curbuf = (ipipe->curbuf + 1) % PIPE_BUFFERS;
+			ipipe->nrbufs--;
+			input_wakeup = true;
+		} else {
+			/*
+			 * Get a reference to this pipe buffer,
+			 * so we can copy the contents over.
+			 */
+			ibuf->ops->get(ipipe, ibuf);
+			*obuf = *ibuf;
+
+			/*
+			 * Don't inherit the gift flag, we need to
+			 * prevent multiple steals of this page.
+			 */
+			obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
+
+			obuf->len = len;
+			opipe->nrbufs++;
+			ibuf->offset += obuf->len;
+			ibuf->len -= obuf->len;
+		}
+		ret += obuf->len;
+		len -= obuf->len;
+	} while (len);
+
+	pipe_unlock(ipipe);
+	pipe_unlock(opipe);
+
+	/*
+	 * If we put data in the output pipe, wakeup any potential readers.
+	 */
+	if (ret > 0) {
+		smp_mb();
+		if (waitqueue_active(&opipe->wait))
+			wake_up_interruptible(&opipe->wait);
+		kill_fasync(&opipe->fasync_readers, SIGIO, POLL_IN);
+	}
+	if (input_wakeup)
+		wakeup_pipe_writers(ipipe);
+
+	return ret;
+}
+
+/*
  * Link contents of ipipe to opipe.
  */
 static int link_pipe(struct pipe_inode_info *ipipe,
@@ -1690,9 +1830,9 @@ static long do_tee(struct file *in, stru
 		 * Keep going, unless we encounter an error. The ipipe/opipe
 		 * ordering doesn't really matter.
 		 */
-		ret = link_ipipe_prep(ipipe, flags);
+		ret = ipipe_prep(ipipe, flags);
 		if (!ret) {
-			ret = link_opipe_prep(opipe, flags);
+			ret = opipe_prep(opipe, flags);
 			if (!ret)
 				ret = link_pipe(ipipe, opipe, len, flags);
 		}

--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch 2/3] splice: implement default splice_read method
  2009-05-07 13:37 [patch 0/3] make splice more generic Miklos Szeredi
  2009-05-07 13:37 ` [patch 1/3] splice: implement pipe to pipe splicing Miklos Szeredi
@ 2009-05-07 13:37 ` Miklos Szeredi
  2009-05-13  5:35   ` Andrew Morton
  2009-05-07 13:37 ` [patch 3/3] splice: implement default splice_write method Miklos Szeredi
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Miklos Szeredi @ 2009-05-07 13:37 UTC (permalink / raw)
  To: jens.axboe; +Cc: Max Kellermann, torvalds, linux-fsdevel, linux-kernel

[-- Attachment #1: splice_read_fallback.patch --]
[-- Type: text/plain, Size: 9728 bytes --]

If f_op->splice_read() is not implemented, fall back to a plain read.
Use vfs_readv() to read into previously allocated pages.

This will allow splice and functions using splice, such as the loop
device, to work on all filesystems.  This includes "direct_io" files
in fuse which bypass the page cache.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 drivers/block/loop.c      |   11 ----
 fs/coda/file.c            |    9 ++-
 fs/pipe.c                 |   14 +++++
 fs/read_write.c           |    7 --
 fs/splice.c               |  120 ++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/fs.h        |    2 
 include/linux/pipe_fs_i.h |    1 
 7 files changed, 140 insertions(+), 24 deletions(-)

Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c	2009-05-07 14:40:01.000000000 +0200
+++ linux-2.6/fs/pipe.c	2009-05-07 14:40:06.000000000 +0200
@@ -302,6 +302,20 @@ int generic_pipe_buf_confirm(struct pipe
 	return 0;
 }
 
+/**
+ * generic_pipe_buf_release - put a reference to a &struct pipe_buffer
+ * @pipe:	the pipe that the buffer belongs to
+ * @buf:	the buffer to put a reference to
+ *
+ * Description:
+ *	This function releases a reference to @buf.
+ */
+void generic_pipe_buf_release(struct pipe_inode_info *pipe,
+			      struct pipe_buffer *buf)
+{
+	page_cache_release(buf->page);
+}
+
 static const struct pipe_buf_operations anon_pipe_buf_ops = {
 	.can_merge = 1,
 	.map = generic_pipe_buf_map,
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c	2009-05-07 14:40:06.000000000 +0200
+++ linux-2.6/fs/splice.c	2009-05-07 14:40:06.000000000 +0200
@@ -507,9 +507,116 @@ ssize_t generic_file_splice_read(struct
 
 	return ret;
 }
-
 EXPORT_SYMBOL(generic_file_splice_read);
 
+static const struct pipe_buf_operations default_pipe_buf_ops = {
+	.can_merge = 0,
+	.map = generic_pipe_buf_map,
+	.unmap = generic_pipe_buf_unmap,
+	.confirm = generic_pipe_buf_confirm,
+	.release = generic_pipe_buf_release,
+	.steal = generic_pipe_buf_steal,
+	.get = generic_pipe_buf_get,
+};
+
+static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
+			    unsigned long vlen, loff_t offset)
+{
+	mm_segment_t old_fs;
+	loff_t pos = offset;
+	ssize_t res;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	/* The cast to a user pointer is valid due to the set_fs() */
+	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+	set_fs(old_fs);
+
+	return res;
+}
+
+ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
+				 struct pipe_inode_info *pipe, size_t len,
+				 unsigned int flags)
+{
+	unsigned int nr_pages;
+	unsigned int nr_freed;
+	size_t offset;
+	struct page *pages[PIPE_BUFFERS];
+	struct partial_page partial[PIPE_BUFFERS];
+	struct iovec vec[PIPE_BUFFERS];
+	pgoff_t index;
+	ssize_t res;
+	size_t this_len;
+	int error;
+	int i;
+	struct splice_pipe_desc spd = {
+		.pages = pages,
+		.partial = partial,
+		.flags = flags,
+		.ops = &default_pipe_buf_ops,
+		.spd_release = spd_release_page,
+	};
+
+	index = *ppos >> PAGE_CACHE_SHIFT;
+	offset = *ppos & ~PAGE_CACHE_MASK;
+	nr_pages = (len + offset + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
+	for (i = 0; i < nr_pages && i < PIPE_BUFFERS && len; i++) {
+		struct page *page;
+
+		page = alloc_page(GFP_HIGHUSER);
+		error = -ENOMEM;
+		if (!page)
+			goto err;
+
+		this_len = min_t(size_t, len, PAGE_CACHE_SIZE - offset);
+		vec[i].iov_base = (void __user *) kmap(page);
+		vec[i].iov_len = this_len;
+		pages[i] = page;
+		spd.nr_pages++;
+		len -= this_len;
+		offset = 0;
+	}
+
+	res = kernel_readv(in, vec, spd.nr_pages, *ppos);
+	if (res < 0)
+		goto err;
+
+	error = 0;
+	if (!res)
+		goto err;
+
+	nr_freed = 0;
+	for (i = 0; i < spd.nr_pages; i++) {
+		kunmap(pages[i]);
+		this_len = min_t(size_t, vec[i].iov_len, res);
+		partial[i].offset = 0;
+		partial[i].len = this_len;
+		if (!this_len) {
+			__free_page(pages[i]);
+			pages[i] = NULL;
+			nr_freed++;
+		}
+		res -= this_len;
+	}
+	spd.nr_pages -= nr_freed;
+
+	res = splice_to_pipe(pipe, &spd);
+	if (res > 0)
+		*ppos += res;
+
+	return res;
+
+err:
+	for (i = 0; i < spd.nr_pages; i++) {
+		kunmap(pages[i]);
+		__free_page(pages[i]);
+	}
+	return error;
+}
+EXPORT_SYMBOL(default_file_splice_read);
+
 /*
  * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
  * using sendpage(). Return the number of bytes sent.
@@ -933,11 +1040,10 @@ static long do_splice_to(struct file *in
 			 struct pipe_inode_info *pipe, size_t len,
 			 unsigned int flags)
 {
+	ssize_t (*splice_read)(struct file *, loff_t *,
+			       struct pipe_inode_info *, size_t, unsigned int);
 	int ret;
 
-	if (unlikely(!in->f_op || !in->f_op->splice_read))
-		return -EINVAL;
-
 	if (unlikely(!(in->f_mode & FMODE_READ)))
 		return -EBADF;
 
@@ -945,7 +1051,11 @@ static long do_splice_to(struct file *in
 	if (unlikely(ret < 0))
 		return ret;
 
-	return in->f_op->splice_read(in, ppos, pipe, len, flags);
+	splice_read = in->f_op->splice_read;
+	if (!splice_read)
+		splice_read = default_file_splice_read;
+
+	return splice_read(in, ppos, pipe, len, flags);
 }
 
 /**
Index: linux-2.6/include/linux/pipe_fs_i.h
===================================================================
--- linux-2.6.orig/include/linux/pipe_fs_i.h	2009-05-07 14:40:01.000000000 +0200
+++ linux-2.6/include/linux/pipe_fs_i.h	2009-05-07 14:40:06.000000000 +0200
@@ -152,5 +152,6 @@ void generic_pipe_buf_unmap(struct pipe_
 void generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *);
 int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *);
 int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *);
+void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
 
 #endif
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2009-05-07 14:40:01.000000000 +0200
+++ linux-2.6/include/linux/fs.h	2009-05-07 14:40:06.000000000 +0200
@@ -2204,6 +2204,8 @@ extern int generic_segment_checks(const
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
 		struct pipe_inode_info *, size_t, unsigned int);
+extern ssize_t default_file_splice_read(struct file *, loff_t *,
+		struct pipe_inode_info *, size_t, unsigned int);
 extern ssize_t generic_file_splice_write(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
Index: linux-2.6/drivers/block/loop.c
===================================================================
--- linux-2.6.orig/drivers/block/loop.c	2009-05-07 14:40:01.000000000 +0200
+++ linux-2.6/drivers/block/loop.c	2009-05-07 14:40:06.000000000 +0200
@@ -721,10 +721,6 @@ static int loop_change_fd(struct loop_de
 	if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
 		goto out_putf;
 
-	/* new backing store needs to support loop (eg splice_read) */
-	if (!inode->i_fop->splice_read)
-		goto out_putf;
-
 	/* size of the new backing store needs to be the same */
 	if (get_loop_size(lo, file) != get_loop_size(lo, old_file))
 		goto out_putf;
@@ -800,12 +796,7 @@ static int loop_set_fd(struct loop_devic
 	error = -EINVAL;
 	if (S_ISREG(inode->i_mode) || S_ISBLK(inode->i_mode)) {
 		const struct address_space_operations *aops = mapping->a_ops;
-		/*
-		 * If we can't read - sorry. If we only can't write - well,
-		 * it's going to be read-only.
-		 */
-		if (!file->f_op->splice_read)
-			goto out_putf;
+
 		if (aops->write_begin)
 			lo_flags |= LO_FLAGS_USE_AOPS;
 		if (!(lo_flags & LO_FLAGS_USE_AOPS) && !file->f_op->write)
Index: linux-2.6/fs/coda/file.c
===================================================================
--- linux-2.6.orig/fs/coda/file.c	2009-05-07 14:40:01.000000000 +0200
+++ linux-2.6/fs/coda/file.c	2009-05-07 14:40:06.000000000 +0200
@@ -47,6 +47,8 @@ coda_file_splice_read(struct file *coda_
 		      struct pipe_inode_info *pipe, size_t count,
 		      unsigned int flags)
 {
+	ssize_t (*splice_read)(struct file *, loff_t *,
+			       struct pipe_inode_info *, size_t, unsigned int);
 	struct coda_file_info *cfi;
 	struct file *host_file;
 
@@ -54,10 +56,11 @@ coda_file_splice_read(struct file *coda_
 	BUG_ON(!cfi || cfi->cfi_magic != CODA_MAGIC);
 	host_file = cfi->cfi_container;
 
-	if (!host_file->f_op || !host_file->f_op->splice_read)
-		return -EINVAL;
+	splice_read = host_file->f_op->splice_read;
+	if (!splice_read)
+		splice_read = default_file_splice_read;
 
-	return host_file->f_op->splice_read(host_file, ppos, pipe, count,flags);
+	return splice_read(host_file, ppos, pipe, count, flags);
 }
 
 static ssize_t
Index: linux-2.6/fs/read_write.c
===================================================================
--- linux-2.6.orig/fs/read_write.c	2009-05-07 14:40:01.000000000 +0200
+++ linux-2.6/fs/read_write.c	2009-05-07 14:40:06.000000000 +0200
@@ -805,12 +805,6 @@ static ssize_t do_sendfile(int out_fd, i
 		goto out;
 	if (!(in_file->f_mode & FMODE_READ))
 		goto fput_in;
-	retval = -EINVAL;
-	in_inode = in_file->f_path.dentry->d_inode;
-	if (!in_inode)
-		goto fput_in;
-	if (!in_file->f_op || !in_file->f_op->splice_read)
-		goto fput_in;
 	retval = -ESPIPE;
 	if (!ppos)
 		ppos = &in_file->f_pos;
@@ -834,6 +828,7 @@ static ssize_t do_sendfile(int out_fd, i
 	retval = -EINVAL;
 	if (!out_file->f_op || !out_file->f_op->sendpage)
 		goto fput_out;
+	in_inode = in_file->f_path.dentry->d_inode;
 	out_inode = out_file->f_path.dentry->d_inode;
 	retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count);
 	if (retval < 0)

--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch 3/3] splice: implement default splice_write method
  2009-05-07 13:37 [patch 0/3] make splice more generic Miklos Szeredi
  2009-05-07 13:37 ` [patch 1/3] splice: implement pipe to pipe splicing Miklos Szeredi
  2009-05-07 13:37 ` [patch 2/3] splice: implement default splice_read method Miklos Szeredi
@ 2009-05-07 13:37 ` Miklos Szeredi
  2009-05-07 15:55 ` [patch 0/3] make splice more generic Linus Torvalds
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Miklos Szeredi @ 2009-05-07 13:37 UTC (permalink / raw)
  To: jens.axboe; +Cc: Max Kellermann, torvalds, linux-fsdevel, linux-kernel

[-- Attachment #1: splice_write_fallback.patch --]
[-- Type: text/plain, Size: 4649 bytes --]

If f_op->splice_write() is not implemented, fall back to a plain write.
Use vfs_writev() to write from the pipe buffers.

This will allow splice on all filesystems and file types.  This
includes "direct_io" files in fuse which bypass the page cache.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/splice.c |  142 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 138 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c	2009-05-07 14:40:06.000000000 +0200
+++ linux-2.6/fs/splice.c	2009-05-07 14:40:07.000000000 +0200
@@ -535,6 +535,21 @@ static ssize_t kernel_readv(struct file
 	return res;
 }
 
+static ssize_t kernel_writev(struct file *file, const struct iovec *vec,
+			    unsigned long vlen, loff_t *ppos)
+{
+	mm_segment_t old_fs;
+	ssize_t res;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	/* The cast to a user pointer is valid due to the set_fs() */
+	res = vfs_writev(file, (const struct iovec __user *)vec, vlen, ppos);
+	set_fs(old_fs);
+
+	return res;
+}
+
 ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
@@ -988,6 +1003,122 @@ generic_file_splice_write(struct pipe_in
 
 EXPORT_SYMBOL(generic_file_splice_write);
 
+static struct pipe_buffer *nth_pipe_buf(struct pipe_inode_info *pipe, int n)
+{
+	return &pipe->bufs[(pipe->curbuf + n) % PIPE_BUFFERS];
+}
+
+static ssize_t default_file_splice_write(struct pipe_inode_info *pipe,
+					 struct file *out, loff_t *ppos,
+					 size_t len, unsigned int flags)
+{
+	ssize_t ret = 0;
+	ssize_t total_len = 0;
+	int do_wakeup = 0;
+
+	pipe_lock(pipe);
+	while (len) {
+		struct pipe_buffer *buf;
+		void *data[PIPE_BUFFERS];
+		struct iovec vec[PIPE_BUFFERS];
+		unsigned int nr_pages = 0;
+		unsigned int write_len = 0;
+		unsigned int now_len = len;
+		unsigned int this_len;
+		int i;
+
+		BUG_ON(pipe->nrbufs > PIPE_BUFFERS);
+		for (i = 0; i < pipe->nrbufs && now_len; i++) {
+			buf = nth_pipe_buf(pipe, i);
+
+			ret = buf->ops->confirm(pipe, buf);
+			if (ret)
+				break;
+
+			data[i] = buf->ops->map(pipe, buf, 0);
+			this_len = min(buf->len, now_len);
+			vec[i].iov_base = (void __user *) data[i] + buf->offset;
+			vec[i].iov_len = this_len;
+			now_len -= this_len;
+			write_len += this_len;
+			nr_pages++;
+		}
+
+		if (nr_pages) {
+			ret = kernel_writev(out, vec, nr_pages, ppos);
+			if (ret == 0)
+				ret = -EIO;
+			if (ret > 0) {
+				len -= ret;
+				total_len += ret;
+			}
+		}
+
+		for (i = 0; i < nr_pages; i++) {
+			buf = nth_pipe_buf(pipe, i);
+			buf->ops->unmap(pipe, buf, data[i]);
+
+			if (ret > 0) {
+				this_len = min_t(unsigned, vec[i].iov_len, ret);
+				buf->offset += this_len;
+				buf->len -= this_len;
+				ret -= this_len;
+			}
+		}
+
+		if (ret < 0)
+			break;
+
+		while (pipe->nrbufs) {
+			const struct pipe_buf_operations *ops;
+
+			buf = nth_pipe_buf(pipe, 0);
+			if (buf->len)
+				break;
+
+			ops = buf->ops;
+			buf->ops = NULL;
+			ops->release(pipe, buf);
+			pipe->curbuf = (pipe->curbuf + 1) % PIPE_BUFFERS;
+			pipe->nrbufs--;
+			if (pipe->inode)
+				do_wakeup = 1;
+		}
+
+		if (pipe->nrbufs)
+			continue;
+		if (!pipe->writers)
+			break;
+		if (!pipe->waiting_writers) {
+			if (total_len)
+				break;
+		}
+
+		if (flags & SPLICE_F_NONBLOCK) {
+			ret = -EAGAIN;
+			break;
+		}
+
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+
+		if (do_wakeup) {
+			wakeup_pipe_writers(pipe);
+			do_wakeup = 0;
+		}
+
+		pipe_wait(pipe);
+	}
+	pipe_unlock(pipe);
+
+	if (do_wakeup)
+		wakeup_pipe_writers(pipe);
+
+	return total_len ? total_len : ret;
+}
+
 /**
  * generic_splice_sendpage - splice data from a pipe to a socket
  * @pipe:	pipe to splice from
@@ -1015,11 +1146,10 @@ EXPORT_SYMBOL(generic_splice_sendpage);
 static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
 			   loff_t *ppos, size_t len, unsigned int flags)
 {
+	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *,
+				loff_t *, size_t, unsigned int);
 	int ret;
 
-	if (unlikely(!out->f_op || !out->f_op->splice_write))
-		return -EINVAL;
-
 	if (unlikely(!(out->f_mode & FMODE_WRITE)))
 		return -EBADF;
 
@@ -1030,7 +1160,11 @@ static long do_splice_from(struct pipe_i
 	if (unlikely(ret < 0))
 		return ret;
 
-	return out->f_op->splice_write(pipe, out, ppos, len, flags);
+	splice_write = out->f_op->splice_write;
+	if (!splice_write)
+		splice_write = default_file_splice_write;
+
+	return splice_write(pipe, out, ppos, len, flags);
 }
 
 /*

--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] make splice more generic
  2009-05-07 13:37 [patch 0/3] make splice more generic Miklos Szeredi
                   ` (2 preceding siblings ...)
  2009-05-07 13:37 ` [patch 3/3] splice: implement default splice_write method Miklos Szeredi
@ 2009-05-07 15:55 ` Linus Torvalds
  2009-05-11 15:17   ` Miklos Szeredi
  2009-05-09 11:36 ` Max Kellermann
  2009-05-11 12:12 ` Jens Axboe
  5 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2009-05-07 15:55 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: jens.axboe, Max Kellermann, linux-fsdevel, linux-kernel



On Thu, 7 May 2009, Miklos Szeredi wrote:
> 
> One more generalization would be to allow splice to work on two
> non-pipes, using an internal intermediate pipe, a-la do_splice_direct().

You can't do that without some painful issues.

Or rather, you can only do it trivially for the one case where we 
_already_ do that, namely "sendfile()". That's exactly what sendfile() is 
now.

What is so painful about it in general?

Reading from a source may _destroy_ that data, and you may not be able to 
push it back to the source. And what happens if the destination cannot 
take it?

Now, we could do a totally blocking version that simply refuses to return 
until the destination has taken all the splice data, and maybe it would be 
worth it as a simplified interface. But it does sound like a really ripe 
place for deadlocks etc (set up some trivial circular thing of everybody 
trying to pipe to each other, and all of them getting blocked on the 
receiver, and now they are unkillable).

Now, the reason it works for sendfile() is that when the source is known 
to be in the page cache, then if the receiver doesn't take the data, we 
know we can just drop it. But what if the source is some character device 
driver? We can't just drop the data on a signal.

So the reason splice does that pipe buffer is that the pipe itself then 
acts as a long-term buffer _across_ the kernel returning to user space, 
and thus allows the whole process to be interruptible.

That said, maybe we could allow it in a few more cases. Or maybe people 
think the simplification in user interfaces is worth making the IO be 
non-interruptible (but still killable, for example, at which point the 
buffered data really is dropped - but that's not different from having 
the buffers in user space, so at that point it's ok).

So I'm certainly willing to be convinced that it's a good idea after all, 
it just worries me, and I wanted to point out the painful issues that 
caused me to not allow it in general.

			Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] make splice more generic
  2009-05-07 13:37 [patch 0/3] make splice more generic Miklos Szeredi
                   ` (3 preceding siblings ...)
  2009-05-07 15:55 ` [patch 0/3] make splice more generic Linus Torvalds
@ 2009-05-09 11:36 ` Max Kellermann
  2009-05-11 12:12 ` Jens Axboe
  5 siblings, 0 replies; 19+ messages in thread
From: Max Kellermann @ 2009-05-09 11:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: jens.axboe, torvalds, linux-fsdevel, linux-kernel

On 2009/05/07 15:37, Miklos Szeredi <miklos@szeredi.hu> wrote:
> Just after cleaning up my patches after a vacation I found Max
> Kellermann's patch on LKML implementing the first part.  I'll still
> post mine, because it's slightly simpler (no ref + unref on the buffer
> if not necessary).

I didn't do that initially, but then I wasn't sure if that is actually
valid: the get() method requires a pointer to the pipe and the buffer,
not just the underlying page.  I chose the safe path, because I don't
know that API very well.

Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] make splice more generic
  2009-05-07 13:37 [patch 0/3] make splice more generic Miklos Szeredi
                   ` (4 preceding siblings ...)
  2009-05-09 11:36 ` Max Kellermann
@ 2009-05-11 12:12 ` Jens Axboe
  2009-05-11 15:22   ` Miklos Szeredi
  2009-05-14 20:27   ` Jamie Lokier
  5 siblings, 2 replies; 19+ messages in thread
From: Jens Axboe @ 2009-05-11 12:12 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Max Kellermann, torvalds, linux-fsdevel, linux-kernel

On Thu, May 07 2009, Miklos Szeredi wrote:
> This series makes splice(2) work in more cases:
> 
>   - pipe to pipe splicing (zero copy)
>   - fallback splice_read which uses readv()
>   - fallback splice_write which uses writev()
> 
> Just after cleaning up my patches after a vacation I found Max
> Kellermann's patch on LKML implementing the first part.  I'll still
> post mine, because it's slightly simpler (no ref + unref on the buffer
> if not necessary).

I have applied all three, thanks! However, I think we should be able to
pass in whether or not this is a strict splice or not. Falling back to
readv/writev is a good thing as it may help get the interface adopted
more widely, but I can also easily imagine cases where you'd want to
make sure that splice actually works without copies. It may even just be
for users to retry with alternative code paths, instead of proceeding
with the splice. SPLICE_F_MOVE is a soft flag in that it will move if it
can, but not fail if it can't. Perhaps we should add a flag that does
pass back an error if we can't just move pages around, SPLICE_F_STRICT
or something like that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] make splice more generic
  2009-05-07 15:55 ` [patch 0/3] make splice more generic Linus Torvalds
@ 2009-05-11 15:17   ` Miklos Szeredi
  0 siblings, 0 replies; 19+ messages in thread
From: Miklos Szeredi @ 2009-05-11 15:17 UTC (permalink / raw)
  To: torvalds; +Cc: miklos, jens.axboe, max, linux-fsdevel, linux-kernel

On Thu, 7 May 2009, Linus Torvalds wrote:
> On Thu, 7 May 2009, Miklos Szeredi wrote:
> > 
> > One more generalization would be to allow splice to work on two
> > non-pipes, using an internal intermediate pipe, a-la do_splice_direct().
> 
> You can't do that without some painful issues.
> 
> Or rather, you can only do it trivially for the one case where we 
> _already_ do that, namely "sendfile()". That's exactly what sendfile() is 
> now.

Hmm yeah, almost exactly: sendfile doesn't have a "destination offset"
argument.

> That said, maybe we could allow it in a few more cases. Or maybe people 
> think the simplification in user interfaces is worth making the IO be 
> non-interruptible (but still killable, for example, at which point the 
> buffered data really is dropped - but that's not different from having 
> the buffers in user space, so at that point it's ok).

That's a hard decision.  I think it's better to leave it in its
current form.  That means more complexity in the apps, but less
unexpected nastiness from being non-interruptible.

One more interesting use of splice() would be for things like speeding
up "cp" on network filesystems and similarly for fs which can
refcount/COW data blocks.  Seems like BTRFS already has an ioctl for
this, wouldn't it be nice to do it with splice()?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] make splice more generic
  2009-05-11 12:12 ` Jens Axboe
@ 2009-05-11 15:22   ` Miklos Szeredi
  2009-05-14 20:27   ` Jamie Lokier
  1 sibling, 0 replies; 19+ messages in thread
From: Miklos Szeredi @ 2009-05-11 15:22 UTC (permalink / raw)
  To: jens.axboe; +Cc: miklos, max, torvalds, linux-fsdevel, linux-kernel

On Mon, 11 May 2009, Jens Axboe wrote:
> I have applied all three, thanks! However, I think we should be able to
> pass in whether or not this is a strict splice or not. Falling back to
> readv/writev is a good thing as it may help get the interface adopted
> more widely, but I can also easily imagine cases where you'd want to
> make sure that splice actually works without copies. It may even just be
> for users to retry with alternative code paths, instead of proceeding
> with the splice. SPLICE_F_MOVE is a soft flag in that it will move if it
> can, but not fail if it can't. Perhaps we should add a flag that does
> pass back an error if we can't just move pages around, SPLICE_F_STRICT
> or something like that.

Okay, seems pretty trivial.  I'll do a patch.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] splice: implement default splice_read method
  2009-05-07 13:37 ` [patch 2/3] splice: implement default splice_read method Miklos Szeredi
@ 2009-05-13  5:35   ` Andrew Morton
  2009-05-13  6:37     ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2009-05-13  5:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: jens.axboe, Max Kellermann, torvalds, linux-fsdevel, linux-kernel

On Thu, 07 May 2009 15:37:36 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> +	for (i = 0; i < spd.nr_pages; i++) {
> +		kunmap(pages[i]);

It is deadlockable if any thread of control holds more than a single
kmap at a time.

Because there are a finite number of kmaps available, and if one is
unavailable, kmap() waits for one to become free.  If the number of
waiting threads equals the number of available slots, nobody makes any
progress.




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] splice: implement default splice_read method
  2009-05-13  5:35   ` Andrew Morton
@ 2009-05-13  6:37     ` Jens Axboe
  2009-05-13  9:01       ` Miklos Szeredi
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2009-05-13  6:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miklos Szeredi, Max Kellermann, torvalds, linux-fsdevel, linux-kernel

On Tue, May 12 2009, Andrew Morton wrote:
> On Thu, 07 May 2009 15:37:36 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > +	for (i = 0; i < spd.nr_pages; i++) {
> > +		kunmap(pages[i]);
> 
> It is deadlockable if any thread of control holds more than a single
> kmap at a time.
> 
> Because there are a finite number of kmaps available, and if one is
> unavailable, kmap() waits for one to become free.  If the number of
> waiting threads equals the number of available slots, nobody makes any
> progress.

Good catch, that will not work reliably. I've applied the below.

commit 4f23122858a27ba97444b9b37a066d83edebd4c8
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Wed May 13 08:35:35 2009 +0200

    splice: fix repeated kmap()'s in default_file_splice_read()
    
    We cannot reliably map more than one page at the time, or we risk
    deadlocking. Just allocate the pages from low mem instead.
    
    Reported-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

diff --git a/fs/splice.c b/fs/splice.c
index eefd96b..c5e3c79 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -580,13 +580,13 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 	for (i = 0; i < nr_pages && i < PIPE_BUFFERS && len; i++) {
 		struct page *page;
 
-		page = alloc_page(GFP_HIGHUSER);
+		page = alloc_page(GFP_USER);
 		error = -ENOMEM;
 		if (!page)
 			goto err;
 
 		this_len = min_t(size_t, len, PAGE_CACHE_SIZE - offset);
-		vec[i].iov_base = (void __user *) kmap(page);
+		vec[i].iov_base = (void __user *) page_address(page);
 		vec[i].iov_len = this_len;
 		pages[i] = page;
 		spd.nr_pages++;
@@ -604,7 +604,6 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 
 	nr_freed = 0;
 	for (i = 0; i < spd.nr_pages; i++) {
-		kunmap(pages[i]);
 		this_len = min_t(size_t, vec[i].iov_len, res);
 		partial[i].offset = 0;
 		partial[i].len = this_len;
@@ -624,10 +623,9 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 	return res;
 
 err:
-	for (i = 0; i < spd.nr_pages; i++) {
-		kunmap(pages[i]);
+	for (i = 0; i < spd.nr_pages; i++)
 		__free_page(pages[i]);
-	}
+
 	return error;
 }
 EXPORT_SYMBOL(default_file_splice_read);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] splice: implement default splice_read method
  2009-05-13  6:37     ` Jens Axboe
@ 2009-05-13  9:01       ` Miklos Szeredi
  2009-05-14 17:29         ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Miklos Szeredi @ 2009-05-13  9:01 UTC (permalink / raw)
  To: jens.axboe; +Cc: akpm, miklos, max, torvalds, linux-fsdevel, linux-kernel

On Wed, 13 May 2009, Jens Axboe wrote:
> On Tue, May 12 2009, Andrew Morton wrote:
> > On Thu, 07 May 2009 15:37:36 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > +	for (i = 0; i < spd.nr_pages; i++) {
> > > +		kunmap(pages[i]);
> > 
> > It is deadlockable if any thread of control holds more than a single
> > kmap at a time.
> > 
> > Because there are a finite number of kmaps available, and if one is
> > unavailable, kmap() waits for one to become free.  If the number of
> > waiting threads equals the number of available slots, nobody makes any
> > progress.

Ouch.

> 
> Good catch, that will not work reliably. I've applied the below.

Thanks.

The bigger problem is that the default_file_splice_write()
implementation in the other patch does the same (it calls
buf->ops->map() on all buffers).

Hmm.  Simple solution would be to do a write() for each buffer.  But
this only affects HIGHMEM kernels, so it's a bit pointless to do that
on all archs.  Sigh...

Miklos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] splice: implement default splice_read method
  2009-05-13  9:01       ` Miklos Szeredi
@ 2009-05-14 17:29         ` Jens Axboe
  2009-05-14 17:54           ` Miklos Szeredi
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2009-05-14 17:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, max, torvalds, linux-fsdevel, linux-kernel

On Wed, May 13 2009, Miklos Szeredi wrote:
> On Wed, 13 May 2009, Jens Axboe wrote:
> > On Tue, May 12 2009, Andrew Morton wrote:
> > > On Thu, 07 May 2009 15:37:36 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > 
> > > > +	for (i = 0; i < spd.nr_pages; i++) {
> > > > +		kunmap(pages[i]);
> > > 
> > > It is deadlockable if any thread of control holds more than a single
> > > kmap at a time.
> > > 
> > > Because there are a finite number of kmaps available, and if one is
> > > unavailable, kmap() waits for one to become free.  If the number of
> > > waiting threads equals the number of available slots, nobody makes any
> > > progress.
> 
> Ouch.
> 
> > 
> > Good catch, that will not work reliably. I've applied the below.
> 
> Thanks.
> 
> The bigger problem is that the default_file_splice_write()
> implementation in the other patch does the same (it calls
> buf->ops->map() on all buffers).

Yep that's even worse, as that should go BUG() pretty much immediately
when the KM_USER0 slot is reused!

> Hmm.  Simple solution would be to do a write() for each buffer.  But
> this only affects HIGHMEM kernels, so it's a bit pointless to do that
> on all archs.  Sigh...

It is unfortunate, we are going to be stuck with that for some time
still...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] splice: implement default splice_read method
  2009-05-14 17:29         ` Jens Axboe
@ 2009-05-14 17:54           ` Miklos Szeredi
  2009-05-14 18:00             ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Miklos Szeredi @ 2009-05-14 17:54 UTC (permalink / raw)
  To: jens.axboe; +Cc: miklos, akpm, max, torvalds, linux-fsdevel, linux-kernel

On Thu, 14 May 2009, Jens Axboe wrote:

> > The bigger problem is that the default_file_splice_write()
> > implementation in the other patch does the same (it calls
> > buf->ops->map() on all buffers).
> 
> Yep that's even worse, as that should go BUG() pretty much immediately
> when the KM_USER0 slot is reused!

No, ->map() calls plain kmap() if "atomic" argument is zero.
Nevertheless the deadlock due to multiple kmaps is still possible.

> > Hmm.  Simple solution would be to do a write() for each buffer.  But
> > this only affects HIGHMEM kernels, so it's a bit pointless to do that
> > on all archs.  Sigh...
> 
> It is unfortunate, we are going to be stuck with that for some time
> still...

I'm going to be offline till next monday, I'll post a fix after that.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] splice: implement default splice_read method
  2009-05-14 17:54           ` Miklos Szeredi
@ 2009-05-14 18:00             ` Jens Axboe
  2009-05-18 12:36               ` Miklos Szeredi
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2009-05-14 18:00 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, max, torvalds, linux-fsdevel, linux-kernel

On Thu, May 14 2009, Miklos Szeredi wrote:
> On Thu, 14 May 2009, Jens Axboe wrote:
> 
> > > The bigger problem is that the default_file_splice_write()
> > > implementation in the other patch does the same (it calls
> > > buf->ops->map() on all buffers).
> > 
> > Yep that's even worse, as that should go BUG() pretty much immediately
> > when the KM_USER0 slot is reused!
> 
> No, ->map() calls plain kmap() if "atomic" argument is zero.
> Nevertheless the deadlock due to multiple kmaps is still possible.

It's buggy either way, didn't check whether you used atomic maps or not.

> > > Hmm.  Simple solution would be to do a write() for each buffer.  But
> > > this only affects HIGHMEM kernels, so it's a bit pointless to do that
> > > on all archs.  Sigh...
> > 
> > It is unfortunate, we are going to be stuck with that for some time
> > still...
> 
> I'm going to be offline till next monday, I'll post a fix after that.

Until monday, or the monday after? I'm tempted to revert the
readv/writev part until this is settled, we cannot ship it as-is.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] make splice more generic
  2009-05-11 12:12 ` Jens Axboe
  2009-05-11 15:22   ` Miklos Szeredi
@ 2009-05-14 20:27   ` Jamie Lokier
  2009-05-15  7:32     ` Jens Axboe
  1 sibling, 1 reply; 19+ messages in thread
From: Jamie Lokier @ 2009-05-14 20:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Max Kellermann, torvalds, linux-fsdevel, linux-kernel

Jens Axboe wrote:
> SPLICE_F_MOVE is a soft flag in that it will move if it
> can, but not fail if it can't.

According to local man page, it's not even that.

  SPLICE_F_MOVE      [...] The initial implementation of this flag  was  buggy:
		     therefore  starting  in  Linux  2.6.21 it is a *no-op*
		     (but is still permitted in a splice() call); in  the
		     future, a correct implementation may be restored.

> Perhaps we should add a flag that does pass back an error if we
> can't just move pages around, SPLICE_F_STRICT or something like
> that.

What should it do in cases where the kernel would choose to copy data,
i.e. data that isn't a whole page?

-- Jamie

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 0/3] make splice more generic
  2009-05-14 20:27   ` Jamie Lokier
@ 2009-05-15  7:32     ` Jens Axboe
  0 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2009-05-15  7:32 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Miklos Szeredi, Max Kellermann, torvalds, linux-fsdevel,
	linux-kernel, npiggin

On Thu, May 14 2009, Jamie Lokier wrote:
> Jens Axboe wrote:
> > SPLICE_F_MOVE is a soft flag in that it will move if it
> > can, but not fail if it can't.
> 
> According to local man page, it's not even that.
> 
>   SPLICE_F_MOVE      [...] The initial implementation of this flag  was  buggy:
> 		     therefore  starting  in  Linux  2.6.21 it is a *no-op*
> 		     (but is still permitted in a splice() call); in  the
> 		     future, a correct implementation may be restored.

That's a temporary condition, and the statement isn't completely true
either.  What happened was that the stealing of fs pages was disabled.
Other users of splice could still implement it and take advantage of it.

Nick pulled the original code and did promise to work on a replacement,
but so far that hasn't happened.

> > Perhaps we should add a flag that does pass back an error if we
> > can't just move pages around, SPLICE_F_STRICT or something like
> > that.
> 
> What should it do in cases where the kernel would choose to copy data,
> i.e. data that isn't a whole page?

Probably -EINVAL, like O_DIRECT would.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] splice: implement default splice_read method
  2009-05-14 18:00             ` Jens Axboe
@ 2009-05-18 12:36               ` Miklos Szeredi
  2009-05-19  9:38                 ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Miklos Szeredi @ 2009-05-18 12:36 UTC (permalink / raw)
  To: jens.axboe; +Cc: miklos, akpm, max, torvalds, linux-fsdevel, linux-kernel

Hi Jens,

On Thu, 14 May 2009, Jens Axboe wrote:
> On Thu, May 14 2009, Miklos Szeredi wrote:
> > > > Hmm.  Simple solution would be to do a write() for each buffer.  But
> > > > this only affects HIGHMEM kernels, so it's a bit pointless to do that
> > > > on all archs.  Sigh...
> > > 
> > > It is unfortunate, we are going to be stuck with that for some time
> > > still...

Here's a patch that fixes the multiple kmap() issue.  It goes on top
of the previous splice patches.

Thanks,
Miklos

----
Subject: splice: fix kmaps in default_file_splice_write()
From: Miklos Szeredi <mszeredi@suse.cz>

Unfortunately multiple kmap() within a single thread are deadlockable,
so writing out multiple buffers with writev() isn't possible.

Change the implementation so that it does a separate write() for each
buffer.  This actually simplifies the code a lot since the
splice_from_pipe() helper can be used.

This limitation is caused by HIGHMEM pages, and so only affects a
subset of architectures and configurations.  In the future it may be
worth to implement default_file_splice_write() in a more efficient way
on configs that allow it.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/splice.c |  130 ++++++++++--------------------------------------------------
 1 file changed, 22 insertions(+), 108 deletions(-)

Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c	2009-05-18 13:22:49.000000000 +0200
+++ linux-2.6/fs/splice.c	2009-05-18 14:18:22.000000000 +0200
@@ -535,8 +535,8 @@ static ssize_t kernel_readv(struct file
 	return res;
 }
 
-static ssize_t kernel_writev(struct file *file, const struct iovec *vec,
-			    unsigned long vlen, loff_t *ppos)
+static ssize_t kernel_write(struct file *file, const char *buf, size_t count,
+			    loff_t pos)
 {
 	mm_segment_t old_fs;
 	ssize_t res;
@@ -544,7 +544,7 @@ static ssize_t kernel_writev(struct file
 	old_fs = get_fs();
 	set_fs(get_ds());
 	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_writev(file, (const struct iovec __user *)vec, vlen, ppos);
+	res = vfs_write(file, (const char __user *)buf, count, &pos);
 	set_fs(old_fs);
 
 	return res;
@@ -1003,120 +1003,34 @@ generic_file_splice_write(struct pipe_in
 
 EXPORT_SYMBOL(generic_file_splice_write);
 
-static struct pipe_buffer *nth_pipe_buf(struct pipe_inode_info *pipe, int n)
+static int write_pipe_buf(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+			  struct splice_desc *sd)
 {
-	return &pipe->bufs[(pipe->curbuf + n) % PIPE_BUFFERS];
+	int ret;
+	void *data;
+
+	ret = buf->ops->confirm(pipe, buf);
+	if (ret)
+		return ret;
+
+	data = buf->ops->map(pipe, buf, 0);
+	ret = kernel_write(sd->u.file, data + buf->offset, sd->len, sd->pos);
+	buf->ops->unmap(pipe, buf, data);
+
+	return ret;
 }
 
 static ssize_t default_file_splice_write(struct pipe_inode_info *pipe,
 					 struct file *out, loff_t *ppos,
 					 size_t len, unsigned int flags)
 {
-	ssize_t ret = 0;
-	ssize_t total_len = 0;
-	int do_wakeup = 0;
-
-	pipe_lock(pipe);
-	while (len) {
-		struct pipe_buffer *buf;
-		void *data[PIPE_BUFFERS];
-		struct iovec vec[PIPE_BUFFERS];
-		unsigned int nr_pages = 0;
-		unsigned int write_len = 0;
-		unsigned int now_len = len;
-		unsigned int this_len;
-		int i;
-
-		BUG_ON(pipe->nrbufs > PIPE_BUFFERS);
-		for (i = 0; i < pipe->nrbufs && now_len; i++) {
-			buf = nth_pipe_buf(pipe, i);
-
-			ret = buf->ops->confirm(pipe, buf);
-			if (ret)
-				break;
-
-			data[i] = buf->ops->map(pipe, buf, 0);
-			this_len = min(buf->len, now_len);
-			vec[i].iov_base = (void __user *) data[i] + buf->offset;
-			vec[i].iov_len = this_len;
-			now_len -= this_len;
-			write_len += this_len;
-			nr_pages++;
-		}
-
-		if (nr_pages) {
-			ret = kernel_writev(out, vec, nr_pages, ppos);
-			if (ret == 0)
-				ret = -EIO;
-			if (ret > 0) {
-				len -= ret;
-				total_len += ret;
-			}
-		}
-
-		for (i = 0; i < nr_pages; i++) {
-			buf = nth_pipe_buf(pipe, i);
-			buf->ops->unmap(pipe, buf, data[i]);
-
-			if (ret > 0) {
-				this_len = min_t(unsigned, vec[i].iov_len, ret);
-				buf->offset += this_len;
-				buf->len -= this_len;
-				ret -= this_len;
-			}
-		}
-
-		if (ret < 0)
-			break;
-
-		while (pipe->nrbufs) {
-			const struct pipe_buf_operations *ops;
-
-			buf = nth_pipe_buf(pipe, 0);
-			if (buf->len)
-				break;
-
-			ops = buf->ops;
-			buf->ops = NULL;
-			ops->release(pipe, buf);
-			pipe->curbuf = (pipe->curbuf + 1) % PIPE_BUFFERS;
-			pipe->nrbufs--;
-			if (pipe->inode)
-				do_wakeup = 1;
-		}
-
-		if (pipe->nrbufs)
-			continue;
-		if (!pipe->writers)
-			break;
-		if (!pipe->waiting_writers) {
-			if (total_len)
-				break;
-		}
-
-		if (flags & SPLICE_F_NONBLOCK) {
-			ret = -EAGAIN;
-			break;
-		}
-
-		if (signal_pending(current)) {
-			ret = -ERESTARTSYS;
-			break;
-		}
-
-		if (do_wakeup) {
-			wakeup_pipe_writers(pipe);
-			do_wakeup = 0;
-		}
-
-		pipe_wait(pipe);
-	}
-	pipe_unlock(pipe);
+	ssize_t ret;
 
-	if (do_wakeup)
-		wakeup_pipe_writers(pipe);
+	ret = splice_from_pipe(pipe, out, ppos, len, flags, write_pipe_buf);
+	if (ret > 0)
+		*ppos += ret;
 
-	return total_len ? total_len : ret;
+	return ret;
 }
 
 /**


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] splice: implement default splice_read method
  2009-05-18 12:36               ` Miklos Szeredi
@ 2009-05-19  9:38                 ` Jens Axboe
  0 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2009-05-19  9:38 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, max, torvalds, linux-fsdevel, linux-kernel

On Mon, May 18 2009, Miklos Szeredi wrote:
> Hi Jens,
> 
> On Thu, 14 May 2009, Jens Axboe wrote:
> > On Thu, May 14 2009, Miklos Szeredi wrote:
> > > > > Hmm.  Simple solution would be to do a write() for each buffer.  But
> > > > > this only affects HIGHMEM kernels, so it's a bit pointless to do that
> > > > > on all archs.  Sigh...
> > > > 
> > > > It is unfortunate, we are going to be stuck with that for some time
> > > > still...
> 
> Here's a patch that fixes the multiple kmap() issue.  It goes on top
> of the previous splice patches.

Thanks Miklos, applied.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-05-19  9:38 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-07 13:37 [patch 0/3] make splice more generic Miklos Szeredi
2009-05-07 13:37 ` [patch 1/3] splice: implement pipe to pipe splicing Miklos Szeredi
2009-05-07 13:37 ` [patch 2/3] splice: implement default splice_read method Miklos Szeredi
2009-05-13  5:35   ` Andrew Morton
2009-05-13  6:37     ` Jens Axboe
2009-05-13  9:01       ` Miklos Szeredi
2009-05-14 17:29         ` Jens Axboe
2009-05-14 17:54           ` Miklos Szeredi
2009-05-14 18:00             ` Jens Axboe
2009-05-18 12:36               ` Miklos Szeredi
2009-05-19  9:38                 ` Jens Axboe
2009-05-07 13:37 ` [patch 3/3] splice: implement default splice_write method Miklos Szeredi
2009-05-07 15:55 ` [patch 0/3] make splice more generic Linus Torvalds
2009-05-11 15:17   ` Miklos Szeredi
2009-05-09 11:36 ` Max Kellermann
2009-05-11 12:12 ` Jens Axboe
2009-05-11 15:22   ` Miklos Szeredi
2009-05-14 20:27   ` Jamie Lokier
2009-05-15  7:32     ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.