linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] vmpslice support for zero-copy gifting of pages
@ 2013-10-25 15:46 Robert Jennings
  2013-10-25 15:46 ` [PATCH v2 1/2] vmsplice: unmap gifted pages for recipient Robert Jennings
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Robert Jennings @ 2013-10-25 15:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Dave Hansen, Robert Jennings, Matt Helsley,
	Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia,
	Simon Jin, Vlastimil Babka

From: Robert C Jennings <rcj@linux.vnet.ibm.com>

This patch set would add the ability to move anonymous user pages from one
process to another through vmsplice without copying data.  Moving pages
rather than copying is implemented for a narrow case in this RFC to meet
the needs of QEMU's usage (below).

Among the restrictions the source address and destination addresses must
be page aligned, the size argument must be a multiple of page size,
and by the time the reader calls vmsplice, the page must no longer be
mapped in the source.  If a move is not possible the code transparently
falls back to copying data.

This comes from work in QEMU[1] to migrate a VM from one QEMU instance
to another with minimal down-time for the VM.  This would allow for an
update of the QEMU executable under the VM.

New flag usage
This introduces use of the SPLICE_F_MOVE flag for vmsplice, previously
unused.  Proposed usage is as follows:

 Writer gifts pages to pipe, can not access original contents after gift:
    vmsplice(fd, iov, nr_segs, (SPLICE_F_GIFT | SPLICE_F_MOVE);
 Reader asks kernel to move pages from pipe to memory described by iovec:
    vmsplice(fd, iov, nr_segs, SPLICE_F_MOVE);

Moving pages rather than copying is implemented for a narrow case in
this RFC to meet the needs of QEMU's usage.  If a move is not possible
the code transparently falls back to copying data.

For older kernels the SPLICE_F_MOVE would be ignored and a copy would occur.

[1] QEMU localhost live migration:
http://lists.gnu.org/archive/html/qemu-devel/2013-10/msg02787.html

Changes from V1:
 - Cleanup zap coalescing in splice_to_pipe for readability
 - Field added to struct partial_page in v1 was unnecessary, using
   private field instead.
 - Read-side code in pipe_to_user pulled out into a new function
 - Improved documentation of read-side flipping code
 - Fixed locking issue in read-size flipping code found by sparse
 - Updated vmsplice comments for vmsplice_to_user(),
   vmsplice_to_pipe, and vmsplice syscall
_______________________________________________________

  vmsplice: unmap gifted pages for recipient
  vmsplice: Add limited zero copy to vmsplice

 fs/splice.c | 159 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 150 insertions(+), 9 deletions(-)

-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2 1/2] vmsplice: unmap gifted pages for recipient
  2013-10-25 15:46 [PATCH v2 0/2] vmpslice support for zero-copy gifting of pages Robert Jennings
@ 2013-10-25 15:46 ` Robert Jennings
  2013-11-04 16:16   ` Vlastimil Babka
  2013-10-25 15:46 ` [PATCH v2 2/2] vmsplice: Add limited zero copy to vmsplice Robert Jennings
  2013-11-04 15:34 ` [PATCH v2 0/2] vmpslice support for zero-copy gifting of pages Vlastimil Babka
  2 siblings, 1 reply; 5+ messages in thread
From: Robert Jennings @ 2013-10-25 15:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Dave Hansen, Robert Jennings, Matt Helsley,
	Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia,
	Simon Jin, Vlastimil Babka

From: Robert C Jennings <rcj@linux.vnet.ibm.com>

Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap
pages.

When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the
writer's gift'ed pages would be zapped.  This patch supports further work
to move vmsplice'd pages rather than copying them.  That patch has the
restriction that the page must not be mapped by the source for the move,
otherwise it will fall back to copying the page.

Signed-off-by: Matt Helsley <matt.helsley@gmail.com>
Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
---
Changes since v1:
 - Cleanup zap coalescing in splice_to_pipe for readability
 - Field added to struct partial_page in v1 was unnecessary, using 
   private field instead.
---
 fs/splice.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 3b7ee65..c14be6f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -188,12 +188,18 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 {
 	unsigned int spd_pages = spd->nr_pages;
 	int ret, do_wakeup, page_nr;
+	struct vm_area_struct *vma;
+	unsigned long user_start, user_end, addr;
 
 	ret = 0;
 	do_wakeup = 0;
 	page_nr = 0;
+	vma = NULL;
+	user_start = user_end = 0;
 
 	pipe_lock(pipe);
+	/* mmap_sem taken for zap_page_range with SPLICE_F_MOVE */
+	down_read(&current->mm->mmap_sem);
 
 	for (;;) {
 		if (!pipe->readers) {
@@ -215,6 +221,33 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 			if (spd->flags & SPLICE_F_GIFT)
 				buf->flags |= PIPE_BUF_FLAG_GIFT;
 
+			/* Prepare to move page sized/aligned bufs.
+			 * Gather pages for a single zap_page_range()
+			 * call per VMA.
+			 */
+			if (spd->flags & (SPLICE_F_GIFT | SPLICE_F_MOVE) &&
+					!buf->offset &&
+					(buf->len == PAGE_SIZE)) {
+				addr = buf->private;
+
+				if (vma && (addr == user_end) &&
+					   (addr + PAGE_SIZE <= vma->vm_end)) {
+					/* Same vma, no holes */
+					user_end += PAGE_SIZE;
+				} else {
+					if (vma)
+						zap_page_range(vma, user_start,
+							(user_end - user_start),
+							NULL);
+					vma = find_vma(current->mm, addr);
+					if (!IS_ERR_OR_NULL(vma)) {
+						user_start = addr;
+						user_end = (addr + PAGE_SIZE);
+					} else
+						vma = NULL;
+				}
+			}
+
 			pipe->nrbufs++;
 			page_nr++;
 			ret += buf->len;
@@ -255,6 +288,10 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		pipe->waiting_writers--;
 	}
 
+	if (vma)
+		zap_page_range(vma, user_start, (user_end - user_start), NULL);
+
+	up_read(&current->mm->mmap_sem);
 	pipe_unlock(pipe);
 
 	if (do_wakeup)
@@ -1475,6 +1512,7 @@ static int get_iovec_page_array(const struct iovec __user *iov,
 
 			partial[buffers].offset = off;
 			partial[buffers].len = plen;
+			partial[buffers].private = (unsigned long)base;
 
 			off = 0;
 			len -= plen;
-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v2 2/2] vmsplice: Add limited zero copy to vmsplice
  2013-10-25 15:46 [PATCH v2 0/2] vmpslice support for zero-copy gifting of pages Robert Jennings
  2013-10-25 15:46 ` [PATCH v2 1/2] vmsplice: unmap gifted pages for recipient Robert Jennings
@ 2013-10-25 15:46 ` Robert Jennings
  2013-11-04 15:34 ` [PATCH v2 0/2] vmpslice support for zero-copy gifting of pages Vlastimil Babka
  2 siblings, 0 replies; 5+ messages in thread
From: Robert Jennings @ 2013-10-25 15:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Dave Hansen, Robert Jennings, Matt Helsley,
	Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia,
	Simon Jin, Vlastimil Babka

From: Robert C Jennings <rcj@linux.vnet.ibm.com>

It is sometimes useful to move anonymous pages over a pipe rather than
save/swap them. Check the SPLICE_F_GIFT and SPLICE_F_MOVE flags to see
if userspace would like to move such pages. This differs from plain
SPLICE_F_GIFT in that the memory written to the pipe will no longer
have the same contents as the original -- it effectively faults in new,
empty anonymous pages.

On the read side the page written to the pipe will be copied unless
SPLICE_F_MOVE is used. Otherwise page flipping will be performed and the
page will be reclaimed. Note that so long as there is a mapping to the
page copies will be performed instead because rmap will have upped the
map count for each anonymous mapping; this can happen due to fork(),
for example. This is necessary because moving the page will usually
change the anonymous page's nonlinear index and that can only be done
if it's unmapped.

Signed-off-by: Matt Helsley <matt.helsley@gmail.com>
Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
---
Changes since v1:
 - Page flipping in pipe_to_user pulled out into a new function,
   __pipe_to_user_move
 - Improved documentation in code and patch description
 - Fixed locking issue in flipping code found by sparse
 - Updated vmsplice comments for vmsplice_to_user(), 
   vmsplice_to_pipe, and vmsplice syscall
---
 fs/splice.c | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 112 insertions(+), 9 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index c14be6f..955afc0 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -32,6 +32,10 @@
 #include <linux/gfp.h>
 #include <linux/socket.h>
 #include <linux/compat.h>
+#include <linux/page-flags.h>
+#include <linux/hugetlb.h>
+#include <linux/ksm.h>
+#include <linux/swapops.h>
 #include "internal.h"
 
 /*
@@ -1545,12 +1549,108 @@ static int get_iovec_page_array(const struct iovec __user *iov,
 	return error;
 }
 
+/* __pipe_to_user_move - Attempt to move pages into user vma by flipping
+ *
+ * Description:
+ *	This function will try to flip pages in the pipe to the user rather
+ *	than copying.
+ */
+/* Returns:
+ *  Success, number of bytes flipped
+ *  Failure, negative error value
+ */
+static int __pipe_to_user_move(struct pipe_inode_info *pipe,
+			     struct pipe_buffer *buf, struct splice_desc *sd)
+{
+	int ret = -EFAULT;
+	struct page *page = buf->page;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	spinlock_t *ptl;
+	pte_t *ptep, pte;
+	unsigned long useraddr;
+
+
+	if (!(buf->flags & PIPE_BUF_FLAG_GIFT) ||
+			!(sd->flags & SPLICE_F_MOVE) ||
+			(buf->offset) || (buf->len != PAGE_SIZE))
+		goto out;
+
+	/* Moving pages is done only for a subset of pages.
+	 * They must be anonymous and unmapped. The anon page's
+	 * nonlinear index will probably change which can only be
+	 * done if it is unmapped.
+	 */
+	if (!PageAnon(page))
+		goto out;
+	if (page_mapped(page))
+		goto out;
+
+	/* Huge pages must be copied as we are not tracking if
+	 * all of the PAGE_SIZE pipe_buffers which compose the
+	 * huge page are in the pipe.
+	 */
+	if (PageCompound(page))
+		goto out;
+	/* TODO: Add support for TransHuge pages */
+	if (PageHuge(page) || PageTransHuge(page))
+		goto out;
+
+	useraddr = (unsigned long)sd->u.userptr;
+	mm = current->mm;
+
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, useraddr);
+	if (IS_ERR_OR_NULL(vma))
+		goto up_copy;
+	if (!vma->anon_vma) {
+		ret = anon_vma_prepare(vma);
+		if (ret)
+			goto up_copy;
+	}
+	zap_page_range(vma, useraddr, PAGE_SIZE, NULL);
+	ret = lock_page_killable(page);
+	if (ret)
+		goto up_copy;
+	ptep = get_locked_pte(mm, useraddr, &ptl);
+	if (!ptep)
+		goto page_unlock_up_copy;
+	pte = *ptep;
+	if (pte_present(pte))
+		goto pte_unlock_up_copy;
+	get_page(page);
+	page_add_anon_rmap(page, vma, useraddr);
+	pte = mk_pte(page, vma->vm_page_prot);
+	set_pte_at(mm, useraddr, ptep, pte);
+	update_mmu_cache(vma, useraddr, ptep);
+	ret = 0;
+pte_unlock_up_copy:
+	pte_unmap_unlock(ptep, ptl);
+page_unlock_up_copy:
+	unlock_page(page);
+up_copy:
+	up_read(&mm->mmap_sem);
+	if (!ret) {
+		ret = sd->len;
+		goto out;
+	}
+	/* else ret < 0 and we should fallback to copying */
+	VM_BUG_ON(ret > 0);
+out:
+	return ret;
+}
+
 static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
 			struct splice_desc *sd)
 {
 	char *src;
 	int ret;
 
+	/* Attempt to move pages rather than copy */
+	ret = __pipe_to_user_move(pipe, buf, sd);
+	if (ret > 0)
+		goto out;
+
 	/*
 	 * See if we can use the atomic maps, by prefaulting in the
 	 * pages and doing an atomic copy
@@ -1583,8 +1683,11 @@ out:
 }
 
 /*
- * For lack of a better implementation, implement vmsplice() to userspace
- * as a simple copy of the pipes pages to the user iov.
+ * Implement vmsplice() to userspace as a simple copy of the pipe's pages
+ * to the user iov.
+ *
+ * The SPLICE_F_MOVE flag for vmsplice() will cause pipe_to_user() to attempt
+ * moving pages into the user iov when possible, replacing the current pages.
  */
 static long vmsplice_to_user(struct file *file, const struct iovec __user *iov,
 			     unsigned long nr_segs, unsigned int flags)
@@ -1707,16 +1810,16 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
  * to a pipe, not the other way around. Splicing from user memory is a simple
  * operation that can be supported without any funky alignment restrictions
  * or nasty vm tricks. We simply map in the user memory and fill them into
- * a pipe. The reverse isn't quite as easy, though. There are two possible
- * solutions for that:
+ * a pipe.  The reverse isn't quite as easy, though. There are two paths
+ * taken:
  *
  *	- memcpy() the data internally, at which point we might as well just
  *	  do a regular read() on the buffer anyway.
- *	- Lots of nasty vm tricks, that are neither fast nor flexible (it
- *	  has restriction limitations on both ends of the pipe).
- *
- * Currently we punt and implement it as a normal copy, see pipe_to_user().
- *
+ *	- Move pages from source to destination when the flags
+ *	  (SPLICE_F_GIFT | SPLICE_F_MOVE) are present.  Pages are zapped on
+ *	  the source then moved into the destination process.  This falls
+ *	  back to memcpy() when necessary. See pipe_to_user() for fall-back
+ *	  conditions.
  */
 SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
 		unsigned long, nr_segs, unsigned int, flags)
-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 0/2] vmpslice support for zero-copy gifting of pages
  2013-10-25 15:46 [PATCH v2 0/2] vmpslice support for zero-copy gifting of pages Robert Jennings
  2013-10-25 15:46 ` [PATCH v2 1/2] vmsplice: unmap gifted pages for recipient Robert Jennings
  2013-10-25 15:46 ` [PATCH v2 2/2] vmsplice: Add limited zero copy to vmsplice Robert Jennings
@ 2013-11-04 15:34 ` Vlastimil Babka
  2 siblings, 0 replies; 5+ messages in thread
From: Vlastimil Babka @ 2013-11-04 15:34 UTC (permalink / raw)
  To: Robert Jennings, linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Dave Hansen, Matt Helsley, Anthony Liguori,
	Michael Roth, Lei Li, Leonardo Garcia, Simon Jin

On 10/25/2013 05:46 PM, Robert Jennings wrote:
> From: Robert C Jennings <rcj@linux.vnet.ibm.com>
> 
> This patch set would add the ability to move anonymous user pages from one
> process to another through vmsplice without copying data.  Moving pages
> rather than copying is implemented for a narrow case in this RFC to meet
> the needs of QEMU's usage (below).
> 
> Among the restrictions the source address and destination addresses must
> be page aligned, the size argument must be a multiple of page size,
> and by the time the reader calls vmsplice, the page must no longer be
> mapped in the source.  If a move is not possible the code transparently
> falls back to copying data.
> 
> This comes from work in QEMU[1] to migrate a VM from one QEMU instance
> to another with minimal down-time for the VM.  This would allow for an
> update of the QEMU executable under the VM.

Hello,

since this seems somewhat narrow use case for a syscall change, it would
be helpful if you included a larger discussion of considered existing
alternatives, with benchmark results justifying the changed syscall. E.g.:
- Cross Memory Attach comes to mind as one alternative to vmsplice.
Although it does perform a single copy, there are results suggesting
zero-copy doesn't necessarily add that much gain:
  http://marc.info/?l=linux-mm&m=130105930902915&w=2
- Would it be possible for QEMU to use shared memory to begin with?
Since you are already restricting this to page-aligned regions.

Ideally the benchmark results would also include the THP support when
complete.

Thanks,
Vlastimil

> New flag usage
> This introduces use of the SPLICE_F_MOVE flag for vmsplice, previously
> unused.  Proposed usage is as follows:
> 
>  Writer gifts pages to pipe, can not access original contents after gift:
>     vmsplice(fd, iov, nr_segs, (SPLICE_F_GIFT | SPLICE_F_MOVE);
>  Reader asks kernel to move pages from pipe to memory described by iovec:
>     vmsplice(fd, iov, nr_segs, SPLICE_F_MOVE);
> 
> Moving pages rather than copying is implemented for a narrow case in
> this RFC to meet the needs of QEMU's usage.  If a move is not possible
> the code transparently falls back to copying data.
> 
> For older kernels the SPLICE_F_MOVE would be ignored and a copy would occur.
> 
> [1] QEMU localhost live migration:
> http://lists.gnu.org/archive/html/qemu-devel/2013-10/msg02787.html
> 
> Changes from V1:
>  - Cleanup zap coalescing in splice_to_pipe for readability
>  - Field added to struct partial_page in v1 was unnecessary, using
>    private field instead.
>  - Read-side code in pipe_to_user pulled out into a new function
>  - Improved documentation of read-side flipping code
>  - Fixed locking issue in read-size flipping code found by sparse
>  - Updated vmsplice comments for vmsplice_to_user(),
>    vmsplice_to_pipe, and vmsplice syscall
> _______________________________________________________
> 
>   vmsplice: unmap gifted pages for recipient
>   vmsplice: Add limited zero copy to vmsplice
> 
>  fs/splice.c | 159 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 150 insertions(+), 9 deletions(-)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 1/2] vmsplice: unmap gifted pages for recipient
  2013-10-25 15:46 ` [PATCH v2 1/2] vmsplice: unmap gifted pages for recipient Robert Jennings
@ 2013-11-04 16:16   ` Vlastimil Babka
  0 siblings, 0 replies; 5+ messages in thread
From: Vlastimil Babka @ 2013-11-04 16:16 UTC (permalink / raw)
  To: Robert Jennings, linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Dave Hansen, Matt Helsley, Anthony Liguori,
	Michael Roth, Lei Li, Leonardo Garcia, Simon Jin

On 10/25/2013 05:46 PM, Robert Jennings wrote:
> From: Robert C Jennings <rcj@linux.vnet.ibm.com>
> 
> Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap
> pages.
> 
> When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the
> writer's gift'ed pages would be zapped.  This patch supports further work
> to move vmsplice'd pages rather than copying them.  That patch has the
> restriction that the page must not be mapped by the source for the move,
> otherwise it will fall back to copying the page.
> 
> Signed-off-by: Matt Helsley <matt.helsley@gmail.com>
> Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
> ---
> Changes since v1:
>  - Cleanup zap coalescing in splice_to_pipe for readability
>  - Field added to struct partial_page in v1 was unnecessary, using 
>    private field instead.
> ---
>  fs/splice.c | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/fs/splice.c b/fs/splice.c
> index 3b7ee65..c14be6f 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -188,12 +188,18 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
>  {
>  	unsigned int spd_pages = spd->nr_pages;
>  	int ret, do_wakeup, page_nr;
> +	struct vm_area_struct *vma;
> +	unsigned long user_start, user_end, addr;
>  
>  	ret = 0;
>  	do_wakeup = 0;
>  	page_nr = 0;
> +	vma = NULL;
> +	user_start = user_end = 0;
>  
>  	pipe_lock(pipe);
> +	/* mmap_sem taken for zap_page_range with SPLICE_F_MOVE */
> +	down_read(&current->mm->mmap_sem);

I have suggested taking the semaphore here only when the gift and move
flags are set. You said that taking it outside the loop and acquiring it
once already improved performance. This is OK, but my point was to not
take the semaphore at all for vmsplice calls without these flags, to
avoid unnecessary contention.

>  
>  	for (;;) {
>  		if (!pipe->readers) {
> @@ -215,6 +221,33 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
>  			if (spd->flags & SPLICE_F_GIFT)
>  				buf->flags |= PIPE_BUF_FLAG_GIFT;
>  
> +			/* Prepare to move page sized/aligned bufs.
> +			 * Gather pages for a single zap_page_range()
> +			 * call per VMA.
> +			 */
> +			if (spd->flags & (SPLICE_F_GIFT | SPLICE_F_MOVE) &&
> +					!buf->offset &&
> +					(buf->len == PAGE_SIZE)) {
> +				addr = buf->private;

Here you assume that buf->private (initialized from
spd->partial[page_nr].private) will contain a valid address whenever the
GIFT and MOVE flags are set. I think that's quite dangerous and could be
easily exploited. Briefly looking it seems to me that at least one
caller of splice_to_pipe(), __generic_file_splice_read() doesn't
initialize the on-stack-allocated private fields, and it can take flags
directly from the splice syscall.

> +
> +				if (vma && (addr == user_end) &&
> +					   (addr + PAGE_SIZE <= vma->vm_end)) {
> +					/* Same vma, no holes */
> +					user_end += PAGE_SIZE;
> +				} else {
> +					if (vma)
> +						zap_page_range(vma, user_start,
> +							(user_end - user_start),
> +							NULL);
> +					vma = find_vma(current->mm, addr);

Seems like there is a good chance that when crossing over previous vma's
vm_end, taking the next vma would suffice instead of find_vma().

> +					if (!IS_ERR_OR_NULL(vma)) {
> +						user_start = addr;
> +						user_end = (addr + PAGE_SIZE);
> +					} else
> +						vma = NULL;
> +				}
> +			}
> +
>  			pipe->nrbufs++;
>  			page_nr++;
>  			ret += buf->len;
> @@ -255,6 +288,10 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
>  		pipe->waiting_writers--;
>  	}
>  
> +	if (vma)
> +		zap_page_range(vma, user_start, (user_end - user_start), NULL);
> +
> +	up_read(&current->mm->mmap_sem);
>  	pipe_unlock(pipe);
>  
>  	if (do_wakeup)
> @@ -1475,6 +1512,7 @@ static int get_iovec_page_array(const struct iovec __user *iov,
>  
>  			partial[buffers].offset = off;
>  			partial[buffers].len = plen;
> +			partial[buffers].private = (unsigned long)base;
>  
>  			off = 0;
>  			len -= plen;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-11-04 16:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-25 15:46 [PATCH v2 0/2] vmpslice support for zero-copy gifting of pages Robert Jennings
2013-10-25 15:46 ` [PATCH v2 1/2] vmsplice: unmap gifted pages for recipient Robert Jennings
2013-11-04 16:16   ` Vlastimil Babka
2013-10-25 15:46 ` [PATCH v2 2/2] vmsplice: Add limited zero copy to vmsplice Robert Jennings
2013-11-04 15:34 ` [PATCH v2 0/2] vmpslice support for zero-copy gifting of pages Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).