linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] vmpslice support for zero-copy gifting of pages
@ 2013-10-07 20:21 Robert C Jennings
  2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings
  2013-10-07 20:21 ` [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice Robert C Jennings
  0 siblings, 2 replies; 15+ messages in thread
From: Robert C Jennings @ 2013-10-07 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Dave Hansen, Robert C Jennings, Matt Helsley,
	Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia,
	Vlastimil Babka

This patch set would add the ability to move anonymous user pages from one
process to another through vmsplice without copying data.  Moving pages
rather than copying is implemented for a narrow case in this RFC to meet
the needs of QEMU's usage (below).

Among the restrictions the source address and destination addresses must
be page aligned, the size argument must be a multiple of page size,
and by the time the reader calls vmsplice, the page must no longer be
mapped in the source.  If a move is not possible the code transparently
falls back to copying data.

This comes from work in QEMU[1] to migrate a VM from one QEMU instance
to another with minimal down-time for the VM.  This would allow for an
update of the QEMU executable under the VM.

New flag usage
This introduces use of the SPLICE_F_MOVE flag for vmsplice, previously
unused.  Proposed usage is as follows:

 Writer gifts pages to pipe, can not access original contents after gift:
    vmsplice(fd, iov, nr_segs, (SPLICE_F_GIFT | SPLICE_F_MOVE);
 Reader asks kernel to move pages from pipe to memory described by iovec:
    vmsplice(fd, iov, nr_segs, SPLICE_F_MOVE);

Moving pages rather than copying is implemented for a narrow case in
this RFC to meet the needs of QEMU's usage.  If a move is not possible
the code transparently falls back to copying data.

For older kernels the SPLICE_F_MOVE would be ignored and a copy would occur.

[1] QEMU localhost live migration:
      http://lists.gnu.org/archive/html/qemu-devel/2013-06/msg02540.html
      http://lists.gnu.org/archive/html/qemu-devel/2013-06/msg02577.html
_______________________________________________________

  vmsplice: Add limited zero copy to vmsplice
  vmsplice: unmap gifted pages for recipient

 fs/splice.c            | 114 ++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/splice.h |   1 +
 2 files changed, 114 insertions(+), 1 deletion(-)

-- 
1.8.1.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] vmsplice: unmap gifted pages for recipient
  2013-10-07 20:21 [PATCH 0/2] vmpslice support for zero-copy gifting of pages Robert C Jennings
@ 2013-10-07 20:21 ` Robert C Jennings
  2013-10-08 16:14   ` Dave Hansen
                     ` (2 more replies)
  2013-10-07 20:21 ` [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice Robert C Jennings
  1 sibling, 3 replies; 15+ messages in thread
From: Robert C Jennings @ 2013-10-07 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Dave Hansen, Robert C Jennings, Matt Helsley,
	Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia,
	Vlastimil Babka

Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap
pages.

When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the
writer's gift'ed pages would be zapped.  This patch supports further work
to move vmsplice'd pages rather than copying them.  That patch has the
restriction that the page must not be mapped by the source for the move,
otherwise it will fall back to copying the page.

Signed-off-by: Matt Helsley <matt.helsley@gmail.com>
Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
---
Since the RFC went out I have coalesced the zap_page_range() call to
operate on VMAs rather than calling this for each page.  For a 256MB
vmsplice this reduced the write side 50% from the RFC.
---
 fs/splice.c            | 51 +++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/splice.h |  1 +
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/fs/splice.c b/fs/splice.c
index 3b7ee65..a62d61e 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -188,12 +188,17 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 {
 	unsigned int spd_pages = spd->nr_pages;
 	int ret, do_wakeup, page_nr;
+	struct vm_area_struct *vma;
+	unsigned long user_start, user_end;
 
 	ret = 0;
 	do_wakeup = 0;
 	page_nr = 0;
+	vma = NULL;
+	user_start = user_end = 0;
 
 	pipe_lock(pipe);
+	down_read(&current->mm->mmap_sem);
 
 	for (;;) {
 		if (!pipe->readers) {
@@ -212,8 +217,44 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 			buf->len = spd->partial[page_nr].len;
 			buf->private = spd->partial[page_nr].private;
 			buf->ops = spd->ops;
-			if (spd->flags & SPLICE_F_GIFT)
+			if (spd->flags & SPLICE_F_GIFT) {
+				unsigned long useraddr =
+						spd->partial[page_nr].useraddr;
+
+				if ((spd->flags & SPLICE_F_MOVE) &&
+						!buf->offset &&
+						(buf->len == PAGE_SIZE)) {
+					/* Can move page aligned buf, gather
+					 * requests to make a single
+					 * zap_page_range() call per VMA
+					 */
+					if (vma && (useraddr == user_end) &&
+						   ((useraddr + PAGE_SIZE) <=
+						    vma->vm_end)) {
+						/* same vma, no holes */
+						user_end += PAGE_SIZE;
+					} else {
+						if (vma)
+							zap_page_range(vma,
+								user_start,
+								(user_end -
+								 user_start),
+								NULL);
+						vma = find_vma_intersection(
+								current->mm,
+								useraddr,
+								(useraddr +
+								 PAGE_SIZE));
+						if (!IS_ERR_OR_NULL(vma)) {
+							user_start = useraddr;
+							user_end = (useraddr +
+								    PAGE_SIZE);
+						} else
+							vma = NULL;
+					}
+				}
 				buf->flags |= PIPE_BUF_FLAG_GIFT;
+			}
 
 			pipe->nrbufs++;
 			page_nr++;
@@ -255,6 +296,10 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		pipe->waiting_writers--;
 	}
 
+	if (vma)
+		zap_page_range(vma, user_start, (user_end - user_start), NULL);
+
+	up_read(&current->mm->mmap_sem);
 	pipe_unlock(pipe);
 
 	if (do_wakeup)
@@ -485,6 +530,7 @@ fill_it:
 
 		spd.partial[page_nr].offset = loff;
 		spd.partial[page_nr].len = this_len;
+		spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT;
 		len -= this_len;
 		loff = 0;
 		spd.nr_pages++;
@@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 		this_len = min_t(size_t, vec[i].iov_len, res);
 		spd.partial[i].offset = 0;
 		spd.partial[i].len = this_len;
+		spd.partial[i].useraddr = (unsigned long)vec[i].iov_base;
 		if (!this_len) {
 			__free_page(spd.pages[i]);
 			spd.pages[i] = NULL;
@@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov,
 
 			partial[buffers].offset = off;
 			partial[buffers].len = plen;
+			partial[buffers].useraddr = (unsigned long)base;
+			base = (void*)((unsigned long)base + PAGE_SIZE);
 
 			off = 0;
 			len -= plen;
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 74575cb..56661e3 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -44,6 +44,7 @@ struct partial_page {
 	unsigned int offset;
 	unsigned int len;
 	unsigned long private;
+	unsigned long useraddr;
 };
 
 /*
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice
  2013-10-07 20:21 [PATCH 0/2] vmpslice support for zero-copy gifting of pages Robert C Jennings
  2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings
@ 2013-10-07 20:21 ` Robert C Jennings
  2013-10-08 16:45   ` Dave Hansen
  2013-10-17 11:23   ` Vlastimil Babka
  1 sibling, 2 replies; 15+ messages in thread
From: Robert C Jennings @ 2013-10-07 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Dave Hansen, Robert C Jennings, Matt Helsley,
	Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia,
	Vlastimil Babka

From: Matt Helsley <matt.helsley@gmail.com>

It is sometimes useful to move anonymous pages over a pipe rather than
save/swap them. Check the SPLICE_F_GIFT and SPLICE_F_MOVE flags to see
if userspace would like to move such pages. This differs from plain
SPLICE_F_GIFT in that the memory written to the pipe will no longer
have the same contents as the original -- it effectively faults in new,
empty anonymous pages.

On the read side the page written to the pipe will be copied unless
SPLICE_F_MOVE is used. Otherwise copying will be performed and the page
will be reclaimed. Note that so long as there is a mapping to the page
copies will be done instead because rmap will have upped the map count for
each anonymous mapping; this can happen do to fork(), for example. This
is necessary because moving the page will usually change the anonymous
page's nonlinear index and that can only be done if it's unmapped.

Signed-off-by: Matt Helsley <matt.helsley@gmail.com>
Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
---
 fs/splice.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index a62d61e..9d2ed128 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -32,6 +32,10 @@
 #include <linux/gfp.h>
 #include <linux/socket.h>
 #include <linux/compat.h>
+#include <linux/page-flags.h>
+#include <linux/hugetlb.h>
+#include <linux/ksm.h>
+#include <linux/swapops.h>
 #include "internal.h"
 
 /*
@@ -1562,6 +1566,65 @@ static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
 	char *src;
 	int ret;
 
+	if (!buf->offset && (buf->len == PAGE_SIZE) &&
+	    (buf->flags & PIPE_BUF_FLAG_GIFT) && (sd->flags & SPLICE_F_MOVE)) {
+		struct page *page = buf->page;
+		struct mm_struct *mm;
+		struct vm_area_struct *vma;
+		spinlock_t *ptl;
+		pte_t *ptep, pte;
+		unsigned long useraddr;
+
+		if (!PageAnon(page))
+			goto copy;
+		if (PageCompound(page))
+			goto copy;
+		if (PageHuge(page) || PageTransHuge(page))
+			goto copy;
+		if (page_mapped(page))
+			goto copy;
+		useraddr = (unsigned long)sd->u.userptr;
+		mm = current->mm;
+
+		ret = -EAGAIN;
+		down_read(&mm->mmap_sem);
+		vma = find_vma_intersection(mm, useraddr, useraddr + PAGE_SIZE);
+		if (IS_ERR_OR_NULL(vma))
+			goto up_copy;
+		if (!vma->anon_vma) {
+			ret = anon_vma_prepare(vma);
+			if (ret)
+				goto up_copy;
+		}
+		zap_page_range(vma, useraddr, PAGE_SIZE, NULL);
+		ret = lock_page_killable(page);
+		if (ret)
+			goto up_copy;
+		ptep = get_locked_pte(mm, useraddr, &ptl);
+		if (!ptep)
+			goto unlock_up_copy;
+		pte = *ptep;
+		if (pte_present(pte))
+			goto unlock_up_copy;
+		get_page(page);
+		page_add_anon_rmap(page, vma, useraddr);
+		pte = mk_pte(page, vma->vm_page_prot);
+		set_pte_at(mm, useraddr, ptep, pte);
+		update_mmu_cache(vma, useraddr, ptep);
+		pte_unmap_unlock(ptep, ptl);
+		ret = 0;
+unlock_up_copy:
+		unlock_page(page);
+up_copy:
+		up_read(&mm->mmap_sem);
+		if (!ret) {
+			ret = sd->len;
+			goto out;
+		}
+		/* else ret < 0 and we should fallback to copying */
+		VM_BUG_ON(ret > 0);
+	}
+copy:
 	/*
 	 * See if we can use the atomic maps, by prefaulting in the
 	 * pages and doing an atomic copy
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient
  2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings
@ 2013-10-08 16:14   ` Dave Hansen
  2013-10-08 19:48     ` Robert Jennings
  2013-10-08 16:23   ` Dave Hansen
  2013-10-17 10:20   ` Vlastimil Babka
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2013-10-08 16:14 UTC (permalink / raw)
  To: Robert C Jennings, linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth,
	Lei Li, Leonardo Garcia, Vlastimil Babka

On 10/07/2013 01:21 PM, Robert C Jennings wrote:
> +					} else {
> +						if (vma)
> +							zap_page_range(vma,
> +								user_start,
> +								(user_end -
> +								 user_start),
> +								NULL);
> +						vma = find_vma_intersection(
> +								current->mm,
> +								useraddr,
> +								(useraddr +
> +								 PAGE_SIZE));
> +						if (!IS_ERR_OR_NULL(vma)) {
> +							user_start = useraddr;
> +							user_end = (useraddr +
> +								    PAGE_SIZE);
> +						} else
> +							vma = NULL;
> +					}

This is pretty unspeakably hideous.  Was there truly no better way to do
this?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient
  2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings
  2013-10-08 16:14   ` Dave Hansen
@ 2013-10-08 16:23   ` Dave Hansen
  2013-10-17 13:54     ` Robert Jennings
  2013-10-17 10:20   ` Vlastimil Babka
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2013-10-08 16:23 UTC (permalink / raw)
  To: Robert C Jennings, linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth,
	Lei Li, Leonardo Garcia, Vlastimil Babka

On 10/07/2013 01:21 PM, Robert C Jennings wrote:
>  		spd.partial[page_nr].offset = loff;
>  		spd.partial[page_nr].len = this_len;
> +		spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT;
>  		len -= this_len;
>  		loff = 0;
>  		spd.nr_pages++;
> @@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
>  		this_len = min_t(size_t, vec[i].iov_len, res);
>  		spd.partial[i].offset = 0;
>  		spd.partial[i].len = this_len;
> +		spd.partial[i].useraddr = (unsigned long)vec[i].iov_base;
>  		if (!this_len) {
>  			__free_page(spd.pages[i]);
>  			spd.pages[i] = NULL;
> @@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov,
>  
>  			partial[buffers].offset = off;
>  			partial[buffers].len = plen;
> +			partial[buffers].useraddr = (unsigned long)base;
> +			base = (void*)((unsigned long)base + PAGE_SIZE);
>  
>  			off = 0;
>  			len -= plen;
> diff --git a/include/linux/splice.h b/include/linux/splice.h
> index 74575cb..56661e3 100644
> --- a/include/linux/splice.h
> +++ b/include/linux/splice.h
> @@ -44,6 +44,7 @@ struct partial_page {
>  	unsigned int offset;
>  	unsigned int len;
>  	unsigned long private;
> +	unsigned long useraddr;
>  };

"useraddr" confuses me.  You make it an 'unsigned long', yet two of the
three assignments are from "void __user *".  The other assignment:

	spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT;

'index' looks to be the offset inside the file, not a user address, so
I'm confused what that is doing.

Could you elaborate a little more on why 'useraddr' is suddenly needed
in these patches?  How is "index << PAGE_CACHE_SHIFT" a virtual address?
 Also, are we losing any of the advantages of sparse checking since
'useraddr' is without the __user annotation?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice
  2013-10-07 20:21 ` [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice Robert C Jennings
@ 2013-10-08 16:45   ` Dave Hansen
  2013-10-08 17:35     ` Robert Jennings
  2013-10-17 11:23   ` Vlastimil Babka
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2013-10-08 16:45 UTC (permalink / raw)
  To: Robert C Jennings, linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Matt Helsley, Anthony Liguori, Michael Roth,
	Lei Li, Leonardo Garcia, Vlastimil Babka

On 10/07/2013 01:21 PM, Robert C Jennings wrote:
> +	if (!buf->offset && (buf->len == PAGE_SIZE) &&
> +	    (buf->flags & PIPE_BUF_FLAG_GIFT) && (sd->flags & SPLICE_F_MOVE)) {
> +		struct page *page = buf->page;
> +		struct mm_struct *mm;
> +		struct vm_area_struct *vma;
> +		spinlock_t *ptl;
> +		pte_t *ptep, pte;
> +		unsigned long useraddr;
> +
> +		if (!PageAnon(page))
> +			goto copy;
> +		if (PageCompound(page))
> +			goto copy;
> +		if (PageHuge(page) || PageTransHuge(page))
> +			goto copy;
> +		if (page_mapped(page))
> +			goto copy;

I'd really like to see some comments about those cases.  You touched on
page_mapped() above, but could you replicate some of that in a comment?

Also, considering that this is being targeted at QEMU VMs, I would
imagine that you're going to want to support PageTransHuge() in here
pretty fast.  Do you anticipate that being very much trouble?  Have you
planned for it in here?

> +		useraddr = (unsigned long)sd->u.userptr;
> +		mm = current->mm;
> +
> +		ret = -EAGAIN;
> +		down_read(&mm->mmap_sem);
> +		vma = find_vma_intersection(mm, useraddr, useraddr + PAGE_SIZE);

If oyu are only doing these a page at a time, why bother with
find_vma_intersection()?  Why not a plain find_vma()?

Also, if we fail to find a VMA, won't this return -EAGAIN?  That seems
like a rather uninformative error code to get returned back out to
userspace, especially since retrying won't help.

> +		if (IS_ERR_OR_NULL(vma))
> +			goto up_copy;
> +		if (!vma->anon_vma) {
> +			ret = anon_vma_prepare(vma);
> +			if (ret)
> +				goto up_copy;
> +		}

The first thing anon_vma_prepare() does is check vma->anon_vma.  This
extra check seems unnecessary.

> +		zap_page_range(vma, useraddr, PAGE_SIZE, NULL);
> +		ret = lock_page_killable(page);
> +		if (ret)
> +			goto up_copy;
> +		ptep = get_locked_pte(mm, useraddr, &ptl);
> +		if (!ptep)
> +			goto unlock_up_copy;
> +		pte = *ptep;
> +		if (pte_present(pte))
> +			goto unlock_up_copy;
> +		get_page(page);
> +		page_add_anon_rmap(page, vma, useraddr);
> +		pte = mk_pte(page, vma->vm_page_prot);

'pte' is getting used for two different things here, which makes it a
bit confusing.  I'd probably just skip this first assignment and
directly do:

		if (pte_present(*ptep))
			goto unlock_up_copy;

> +		set_pte_at(mm, useraddr, ptep, pte);
> +		update_mmu_cache(vma, useraddr, ptep);
> +		pte_unmap_unlock(ptep, ptl);
> +		ret = 0;
> +unlock_up_copy:
> +		unlock_page(page);
> +up_copy:
> +		up_read(&mm->mmap_sem);
> +		if (!ret) {
> +			ret = sd->len;
> +			goto out;
> +		}
> +		/* else ret < 0 and we should fallback to copying */
> +		VM_BUG_ON(ret > 0);
> +	}

This also screams to be broken out in to a helper function instead of
just being thrown in with the existing code.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice
  2013-10-08 16:45   ` Dave Hansen
@ 2013-10-08 17:35     ` Robert Jennings
  0 siblings, 0 replies; 15+ messages in thread
From: Robert Jennings @ 2013-10-08 17:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro,
	Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori,
	Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka

* Dave Hansen (dave@sr71.net) wrote:
> On 10/07/2013 01:21 PM, Robert C Jennings wrote:
> > +	if (!buf->offset && (buf->len == PAGE_SIZE) &&
> > +	    (buf->flags & PIPE_BUF_FLAG_GIFT) && (sd->flags & SPLICE_F_MOVE)) {
> > +		struct page *page = buf->page;
> > +		struct mm_struct *mm;
> > +		struct vm_area_struct *vma;
> > +		spinlock_t *ptl;
> > +		pte_t *ptep, pte;
> > +		unsigned long useraddr;
> > +
> > +		if (!PageAnon(page))
> > +			goto copy;
> > +		if (PageCompound(page))
> > +			goto copy;
> > +		if (PageHuge(page) || PageTransHuge(page))
> > +			goto copy;
> > +		if (page_mapped(page))
> > +			goto copy;
> 
> I'd really like to see some comments about those cases.  You touched on
> page_mapped() above, but could you replicate some of that in a comment?

Yes, I'll add comments in the code for these cases.

> Also, considering that this is being targeted at QEMU VMs, I would
> imagine that you're going to want to support PageTransHuge() in here
> pretty fast.  Do you anticipate that being very much trouble?  Have you
> planned for it in here?

My focus with this patch set was to get agreement on the change in the
first patch of the vmsplice syscall flags to perform page flipping rather
than copying.

I am working on support of PageTransHuge() but it is not complete.
It reworks this function to coalesce PAGE_SIZE pipe buffers into THP-sized
units and operate on those.

> > +		useraddr = (unsigned long)sd->u.userptr;
> > +		mm = current->mm;
> > +
> > +		ret = -EAGAIN;
> > +		down_read(&mm->mmap_sem);
> > +		vma = find_vma_intersection(mm, useraddr, useraddr + PAGE_SIZE);
> 
> If oyu are only doing these a page at a time, why bother with
> find_vma_intersection()?  Why not a plain find_vma()?

You're correct, I can change this to use find_vma().

> Also, if we fail to find a VMA, won't this return -EAGAIN?  That seems
> like a rather uninformative error code to get returned back out to
> userspace, especially since retrying won't help.

Yes, -EAGAIN is not good for this case, I will use -EFAULT.

> > +		if (IS_ERR_OR_NULL(vma))
> > +			goto up_copy;
> > +		if (!vma->anon_vma) {
> > +			ret = anon_vma_prepare(vma);
> > +			if (ret)
> > +				goto up_copy;
> > +		}
> 
> The first thing anon_vma_prepare() does is check vma->anon_vma.  This
> extra check seems unnecessary.

I'll fix this, thanks.

> > +		zap_page_range(vma, useraddr, PAGE_SIZE, NULL);
> > +		ret = lock_page_killable(page);
> > +		if (ret)
> > +			goto up_copy;
> > +		ptep = get_locked_pte(mm, useraddr, &ptl);
> > +		if (!ptep)
> > +			goto unlock_up_copy;
> > +		pte = *ptep;
> > +		if (pte_present(pte))
> > +			goto unlock_up_copy;
> > +		get_page(page);
> > +		page_add_anon_rmap(page, vma, useraddr);
> > +		pte = mk_pte(page, vma->vm_page_prot);
> 
> 'pte' is getting used for two different things here, which makes it a
> bit confusing.  I'd probably just skip this first assignment and
> directly do:
> 
> 		if (pte_present(*ptep))
> 			goto unlock_up_copy;

I'll fix this, thanks.

> > +		set_pte_at(mm, useraddr, ptep, pte);
> > +		update_mmu_cache(vma, useraddr, ptep);
> > +		pte_unmap_unlock(ptep, ptl);
> > +		ret = 0;
> > +unlock_up_copy:
> > +		unlock_page(page);
> > +up_copy:
> > +		up_read(&mm->mmap_sem);
> > +		if (!ret) {
> > +			ret = sd->len;
> > +			goto out;
> > +		}
> > +		/* else ret < 0 and we should fallback to copying */
> > +		VM_BUG_ON(ret > 0);
> > +	}
> 
> This also screams to be broken out in to a helper function instead of
> just being thrown in with the existing code.

You're right, it's very self-contained already.  I'll pull it out. 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient
  2013-10-08 16:14   ` Dave Hansen
@ 2013-10-08 19:48     ` Robert Jennings
  2013-10-08 21:22       ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Robert Jennings @ 2013-10-08 19:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro,
	Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori,
	Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka

* Dave Hansen (dave@sr71.net) wrote:
> On 10/07/2013 01:21 PM, Robert C Jennings wrote:
> > +					} else {
> > +						if (vma)
> > +							zap_page_range(vma,
> > +								user_start,
> > +								(user_end -
> > +								 user_start),
> > +								NULL);
> > +						vma = find_vma_intersection(
> > +								current->mm,
> > +								useraddr,
> > +								(useraddr +
> > +								 PAGE_SIZE));
> > +						if (!IS_ERR_OR_NULL(vma)) {
> > +							user_start = useraddr;
> > +							user_end = (useraddr +
> > +								    PAGE_SIZE);
> > +						} else
> > +							vma = NULL;
> > +					}
> 
> This is pretty unspeakably hideous.  Was there truly no better way to do
> this?

I was hoping to find a better way to coalesce pipe buffers and zap
entire VMAs (and it needs better documentation but your argument is with
structure and I agree). I would love suggestions for improving this but
that is not to say that I've abandoned it; I'm still looking for ways
to make this cleaner.

Doing find_vma() on a single page in the VMA rather than on each and
then zapping once provides a 50% runtime reduction for the writer when
tested with a 256MB vmsplice operation.  Based on the result I felt that
coalescing was justfied but the implementation is ugly.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient
  2013-10-08 19:48     ` Robert Jennings
@ 2013-10-08 21:22       ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2013-10-08 21:22 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro,
	Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori,
	Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka

On 10/08/2013 12:48 PM, Robert Jennings wrote:
> * Dave Hansen (dave@sr71.net) wrote:
>> On 10/07/2013 01:21 PM, Robert C Jennings wrote:
>>> +					} else {
>>> +						if (vma)
>>> +							zap_page_range(vma,
>>> +								user_start,
>>> +								(user_end -
>>> +								 user_start),
>>> +								NULL);
>>> +						vma = find_vma_intersection(
>>> +								current->mm,
>>> +								useraddr,
>>> +								(useraddr +
>>> +								 PAGE_SIZE));
>>> +						if (!IS_ERR_OR_NULL(vma)) {
>>> +							user_start = useraddr;
>>> +							user_end = (useraddr +
>>> +								    PAGE_SIZE);
>>> +						} else
>>> +							vma = NULL;
>>> +					}
>>
>> This is pretty unspeakably hideous.  Was there truly no better way to do
>> this?
> 
> I was hoping to find a better way to coalesce pipe buffers and zap
> entire VMAs (and it needs better documentation but your argument is with
> structure and I agree). I would love suggestions for improving this but
> that is not to say that I've abandoned it; I'm still looking for ways
> to make this cleaner.

Doing the VMA search each and every time seems a bit silly.  Do one
find_vma(), the look at the _end_ virtual address of the VMA.  You can
continue to collect your set of zap_page_range() addresses as long as
you do not hit the end of that address range.

If and only if you hit the end of the vma, do the zap_page_range(), and
then look up the VMA again.

Storing the .useraddr still seems odd to me, and you haven't fully
explained why you're doing it or how it is safe, or why you store both
virtual addresses and file locations in it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient
  2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings
  2013-10-08 16:14   ` Dave Hansen
  2013-10-08 16:23   ` Dave Hansen
@ 2013-10-17 10:20   ` Vlastimil Babka
  2013-10-17 13:48     ` Robert Jennings
  2 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2013-10-17 10:20 UTC (permalink / raw)
  To: Robert C Jennings, linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Dave Hansen, Matt Helsley, Anthony Liguori,
	Michael Roth, Lei Li, Leonardo Garcia

On 10/07/2013 10:21 PM, Robert C Jennings wrote:
> Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap
> pages.
> 
> When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the
> writer's gift'ed pages would be zapped.  This patch supports further work
> to move vmsplice'd pages rather than copying them.  That patch has the
> restriction that the page must not be mapped by the source for the move,
> otherwise it will fall back to copying the page.
> 
> Signed-off-by: Matt Helsley <matt.helsley@gmail.com>
> Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
> ---
> Since the RFC went out I have coalesced the zap_page_range() call to
> operate on VMAs rather than calling this for each page.  For a 256MB
> vmsplice this reduced the write side 50% from the RFC.
> ---
>  fs/splice.c            | 51 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/splice.h |  1 +
>  2 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/splice.c b/fs/splice.c
> index 3b7ee65..a62d61e 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -188,12 +188,17 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
>  {
>  	unsigned int spd_pages = spd->nr_pages;
>  	int ret, do_wakeup, page_nr;
> +	struct vm_area_struct *vma;
> +	unsigned long user_start, user_end;
>  
>  	ret = 0;
>  	do_wakeup = 0;
>  	page_nr = 0;
> +	vma = NULL;
> +	user_start = user_end = 0;
>  
>  	pipe_lock(pipe);
> +	down_read(&current->mm->mmap_sem);

Seems like you could take the mmap_sem only when GIFT and MOVE is set.
Maybe it won't help that much for performance but at least serve as
documenting the reason it's needed?

Vlastimil

>  	for (;;) {
>  		if (!pipe->readers) {
> @@ -212,8 +217,44 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
>  			buf->len = spd->partial[page_nr].len;
>  			buf->private = spd->partial[page_nr].private;
>  			buf->ops = spd->ops;
> -			if (spd->flags & SPLICE_F_GIFT)
> +			if (spd->flags & SPLICE_F_GIFT) {
> +				unsigned long useraddr =
> +						spd->partial[page_nr].useraddr;
> +
> +				if ((spd->flags & SPLICE_F_MOVE) &&
> +						!buf->offset &&
> +						(buf->len == PAGE_SIZE)) {
> +					/* Can move page aligned buf, gather
> +					 * requests to make a single
> +					 * zap_page_range() call per VMA
> +					 */
> +					if (vma && (useraddr == user_end) &&
> +						   ((useraddr + PAGE_SIZE) <=
> +						    vma->vm_end)) {
> +						/* same vma, no holes */
> +						user_end += PAGE_SIZE;
> +					} else {
> +						if (vma)
> +							zap_page_range(vma,
> +								user_start,
> +								(user_end -
> +								 user_start),
> +								NULL);
> +						vma = find_vma_intersection(
> +								current->mm,
> +								useraddr,
> +								(useraddr +
> +								 PAGE_SIZE));
> +						if (!IS_ERR_OR_NULL(vma)) {
> +							user_start = useraddr;
> +							user_end = (useraddr +
> +								    PAGE_SIZE);
> +						} else
> +							vma = NULL;
> +					}
> +				}
>  				buf->flags |= PIPE_BUF_FLAG_GIFT;
> +			}
>  
>  			pipe->nrbufs++;
>  			page_nr++;
> @@ -255,6 +296,10 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
>  		pipe->waiting_writers--;
>  	}
>  
> +	if (vma)
> +		zap_page_range(vma, user_start, (user_end - user_start), NULL);
> +
> +	up_read(&current->mm->mmap_sem);
>  	pipe_unlock(pipe);
>  
>  	if (do_wakeup)
> @@ -485,6 +530,7 @@ fill_it:
>  
>  		spd.partial[page_nr].offset = loff;
>  		spd.partial[page_nr].len = this_len;
> +		spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT;
>  		len -= this_len;
>  		loff = 0;
>  		spd.nr_pages++;
> @@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
>  		this_len = min_t(size_t, vec[i].iov_len, res);
>  		spd.partial[i].offset = 0;
>  		spd.partial[i].len = this_len;
> +		spd.partial[i].useraddr = (unsigned long)vec[i].iov_base;
>  		if (!this_len) {
>  			__free_page(spd.pages[i]);
>  			spd.pages[i] = NULL;
> @@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov,
>  
>  			partial[buffers].offset = off;
>  			partial[buffers].len = plen;
> +			partial[buffers].useraddr = (unsigned long)base;
> +			base = (void*)((unsigned long)base + PAGE_SIZE);
>  
>  			off = 0;
>  			len -= plen;
> diff --git a/include/linux/splice.h b/include/linux/splice.h
> index 74575cb..56661e3 100644
> --- a/include/linux/splice.h
> +++ b/include/linux/splice.h
> @@ -44,6 +44,7 @@ struct partial_page {
>  	unsigned int offset;
>  	unsigned int len;
>  	unsigned long private;
> +	unsigned long useraddr;
>  };
>  
>  /*
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice
  2013-10-07 20:21 ` [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice Robert C Jennings
  2013-10-08 16:45   ` Dave Hansen
@ 2013-10-17 11:23   ` Vlastimil Babka
  2013-10-17 13:44     ` Robert Jennings
  1 sibling, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2013-10-17 11:23 UTC (permalink / raw)
  To: Robert C Jennings, linux-kernel
  Cc: linux-fsdevel, linux-mm, Alexander Viro, Rik van Riel,
	Andrea Arcangeli, Dave Hansen, Matt Helsley, Anthony Liguori,
	Michael Roth, Lei Li, Leonardo Garcia

On 10/07/2013 10:21 PM, Robert C Jennings wrote:
> From: Matt Helsley <matt.helsley@gmail.com>
> 
> It is sometimes useful to move anonymous pages over a pipe rather than
> save/swap them. Check the SPLICE_F_GIFT and SPLICE_F_MOVE flags to see
> if userspace would like to move such pages. This differs from plain
> SPLICE_F_GIFT in that the memory written to the pipe will no longer
> have the same contents as the original -- it effectively faults in new,
> empty anonymous pages.
> 
> On the read side the page written to the pipe will be copied unless
> SPLICE_F_MOVE is used. Otherwise copying will be performed and the page
> will be reclaimed. Note that so long as there is a mapping to the page
> copies will be done instead because rmap will have upped the map count for
> each anonymous mapping; this can happen do to fork(), for example. This
> is necessary because moving the page will usually change the anonymous
> page's nonlinear index and that can only be done if it's unmapped.

You might want to update comments of vmsplice_to_user() and
SYSCALL_DEFINE4(vmsplice) as they both explain how it's done only via
copying.

Vlastimil

> Signed-off-by: Matt Helsley <matt.helsley@gmail.com>
> Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
> ---
>  fs/splice.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 63 insertions(+)
> 
> diff --git a/fs/splice.c b/fs/splice.c
> index a62d61e..9d2ed128 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -32,6 +32,10 @@
>  #include <linux/gfp.h>
>  #include <linux/socket.h>
>  #include <linux/compat.h>
> +#include <linux/page-flags.h>
> +#include <linux/hugetlb.h>
> +#include <linux/ksm.h>
> +#include <linux/swapops.h>
>  #include "internal.h"
>  
>  /*
> @@ -1562,6 +1566,65 @@ static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>  	char *src;
>  	int ret;
>  
> +	if (!buf->offset && (buf->len == PAGE_SIZE) &&
> +	    (buf->flags & PIPE_BUF_FLAG_GIFT) && (sd->flags & SPLICE_F_MOVE)) {
> +		struct page *page = buf->page;
> +		struct mm_struct *mm;
> +		struct vm_area_struct *vma;
> +		spinlock_t *ptl;
> +		pte_t *ptep, pte;
> +		unsigned long useraddr;
> +
> +		if (!PageAnon(page))
> +			goto copy;
> +		if (PageCompound(page))
> +			goto copy;
> +		if (PageHuge(page) || PageTransHuge(page))
> +			goto copy;
> +		if (page_mapped(page))
> +			goto copy;
> +		useraddr = (unsigned long)sd->u.userptr;
> +		mm = current->mm;
> +
> +		ret = -EAGAIN;
> +		down_read(&mm->mmap_sem);
> +		vma = find_vma_intersection(mm, useraddr, useraddr + PAGE_SIZE);
> +		if (IS_ERR_OR_NULL(vma))
> +			goto up_copy;
> +		if (!vma->anon_vma) {
> +			ret = anon_vma_prepare(vma);
> +			if (ret)
> +				goto up_copy;
> +		}
> +		zap_page_range(vma, useraddr, PAGE_SIZE, NULL);
> +		ret = lock_page_killable(page);
> +		if (ret)
> +			goto up_copy;
> +		ptep = get_locked_pte(mm, useraddr, &ptl);
> +		if (!ptep)
> +			goto unlock_up_copy;
> +		pte = *ptep;
> +		if (pte_present(pte))
> +			goto unlock_up_copy;
> +		get_page(page);
> +		page_add_anon_rmap(page, vma, useraddr);
> +		pte = mk_pte(page, vma->vm_page_prot);
> +		set_pte_at(mm, useraddr, ptep, pte);
> +		update_mmu_cache(vma, useraddr, ptep);
> +		pte_unmap_unlock(ptep, ptl);
> +		ret = 0;
> +unlock_up_copy:
> +		unlock_page(page);
> +up_copy:
> +		up_read(&mm->mmap_sem);
> +		if (!ret) {
> +			ret = sd->len;
> +			goto out;
> +		}
> +		/* else ret < 0 and we should fallback to copying */
> +		VM_BUG_ON(ret > 0);
> +	}
> +copy:
>  	/*
>  	 * See if we can use the atomic maps, by prefaulting in the
>  	 * pages and doing an atomic copy
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice
  2013-10-17 11:23   ` Vlastimil Babka
@ 2013-10-17 13:44     ` Robert Jennings
  0 siblings, 0 replies; 15+ messages in thread
From: Robert Jennings @ 2013-10-17 13:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro,
	Rik van Riel, Andrea Arcangeli, Dave Hansen, Matt Helsley,
	Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia

* Vlastimil Babka (vbabka@suse.cz) wrote:
> On 10/07/2013 10:21 PM, Robert C Jennings wrote:
> > From: Matt Helsley <matt.helsley@gmail.com>
> > 
> > It is sometimes useful to move anonymous pages over a pipe rather than
> > save/swap them. Check the SPLICE_F_GIFT and SPLICE_F_MOVE flags to see
> > if userspace would like to move such pages. This differs from plain
> > SPLICE_F_GIFT in that the memory written to the pipe will no longer
> > have the same contents as the original -- it effectively faults in new,
> > empty anonymous pages.
> > 
> > On the read side the page written to the pipe will be copied unless
> > SPLICE_F_MOVE is used. Otherwise copying will be performed and the page
> > will be reclaimed. Note that so long as there is a mapping to the page
> > copies will be done instead because rmap will have upped the map count for
> > each anonymous mapping; this can happen do to fork(), for example. This
> > is necessary because moving the page will usually change the anonymous
> > page's nonlinear index and that can only be done if it's unmapped.
> 
> You might want to update comments of vmsplice_to_user() and
> SYSCALL_DEFINE4(vmsplice) as they both explain how it's done only via
> copying.
> 
> Vlastimil
> 

I will update those as well.  Thanks.

> > Signed-off-by: Matt Helsley <matt.helsley@gmail.com>
> > Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
> > ---
> >  fs/splice.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 63 insertions(+)
> > 
> > diff --git a/fs/splice.c b/fs/splice.c
> > index a62d61e..9d2ed128 100644
> > --- a/fs/splice.c
> > +++ b/fs/splice.c
> > @@ -32,6 +32,10 @@
> >  #include <linux/gfp.h>
> >  #include <linux/socket.h>
> >  #include <linux/compat.h>
> > +#include <linux/page-flags.h>
> > +#include <linux/hugetlb.h>
> > +#include <linux/ksm.h>
> > +#include <linux/swapops.h>
> >  #include "internal.h"
> >  
> >  /*
> > @@ -1562,6 +1566,65 @@ static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> >  	char *src;
> >  	int ret;
> >  
> > +	if (!buf->offset && (buf->len == PAGE_SIZE) &&
> > +	    (buf->flags & PIPE_BUF_FLAG_GIFT) && (sd->flags & SPLICE_F_MOVE)) {
> > +		struct page *page = buf->page;
> > +		struct mm_struct *mm;
> > +		struct vm_area_struct *vma;
> > +		spinlock_t *ptl;
> > +		pte_t *ptep, pte;
> > +		unsigned long useraddr;
> > +
> > +		if (!PageAnon(page))
> > +			goto copy;
> > +		if (PageCompound(page))
> > +			goto copy;
> > +		if (PageHuge(page) || PageTransHuge(page))
> > +			goto copy;
> > +		if (page_mapped(page))
> > +			goto copy;
> > +		useraddr = (unsigned long)sd->u.userptr;
> > +		mm = current->mm;
> > +
> > +		ret = -EAGAIN;
> > +		down_read(&mm->mmap_sem);
> > +		vma = find_vma_intersection(mm, useraddr, useraddr + PAGE_SIZE);
> > +		if (IS_ERR_OR_NULL(vma))
> > +			goto up_copy;
> > +		if (!vma->anon_vma) {
> > +			ret = anon_vma_prepare(vma);
> > +			if (ret)
> > +				goto up_copy;
> > +		}
> > +		zap_page_range(vma, useraddr, PAGE_SIZE, NULL);
> > +		ret = lock_page_killable(page);
> > +		if (ret)
> > +			goto up_copy;
> > +		ptep = get_locked_pte(mm, useraddr, &ptl);
> > +		if (!ptep)
> > +			goto unlock_up_copy;
> > +		pte = *ptep;
> > +		if (pte_present(pte))
> > +			goto unlock_up_copy;
> > +		get_page(page);
> > +		page_add_anon_rmap(page, vma, useraddr);
> > +		pte = mk_pte(page, vma->vm_page_prot);
> > +		set_pte_at(mm, useraddr, ptep, pte);
> > +		update_mmu_cache(vma, useraddr, ptep);
> > +		pte_unmap_unlock(ptep, ptl);
> > +		ret = 0;
> > +unlock_up_copy:
> > +		unlock_page(page);
> > +up_copy:
> > +		up_read(&mm->mmap_sem);
> > +		if (!ret) {
> > +			ret = sd->len;
> > +			goto out;
> > +		}
> > +		/* else ret < 0 and we should fallback to copying */
> > +		VM_BUG_ON(ret > 0);
> > +	}
> > +copy:
> >  	/*
> >  	 * See if we can use the atomic maps, by prefaulting in the
> >  	 * pages and doing an atomic copy
> > 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient
  2013-10-17 10:20   ` Vlastimil Babka
@ 2013-10-17 13:48     ` Robert Jennings
  2013-10-18  8:10       ` Vlastimil Babka
  0 siblings, 1 reply; 15+ messages in thread
From: Robert Jennings @ 2013-10-17 13:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro,
	Rik van Riel, Andrea Arcangeli, Dave Hansen, Matt Helsley,
	Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia

* Vlastimil Babka (vbabka@suse.cz) wrote:
> On 10/07/2013 10:21 PM, Robert C Jennings wrote:
> > Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap
> > pages.
> > 
> > When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the
> > writer's gift'ed pages would be zapped.  This patch supports further work
> > to move vmsplice'd pages rather than copying them.  That patch has the
> > restriction that the page must not be mapped by the source for the move,
> > otherwise it will fall back to copying the page.
> > 
> > Signed-off-by: Matt Helsley <matt.helsley@gmail.com>
> > Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
> > ---
> > Since the RFC went out I have coalesced the zap_page_range() call to
> > operate on VMAs rather than calling this for each page.  For a 256MB
> > vmsplice this reduced the write side 50% from the RFC.
> > ---
> >  fs/splice.c            | 51 +++++++++++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/splice.h |  1 +
> >  2 files changed, 51 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/splice.c b/fs/splice.c
> > index 3b7ee65..a62d61e 100644
> > --- a/fs/splice.c
> > +++ b/fs/splice.c
> > @@ -188,12 +188,17 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
> >  {
> >  	unsigned int spd_pages = spd->nr_pages;
> >  	int ret, do_wakeup, page_nr;
> > +	struct vm_area_struct *vma;
> > +	unsigned long user_start, user_end;
> >  
> >  	ret = 0;
> >  	do_wakeup = 0;
> >  	page_nr = 0;
> > +	vma = NULL;
> > +	user_start = user_end = 0;
> >  
> >  	pipe_lock(pipe);
> > +	down_read(&current->mm->mmap_sem);
> 
> Seems like you could take the mmap_sem only when GIFT and MOVE is set.
> Maybe it won't help that much for performance but at least serve as
> documenting the reason it's needed?
> 
> Vlastimil
> 

I had been doing that previously but moving this outside the loop and
acquiring it once did improve performance.  I'll add a comment on
down_read() as to the reason for taking this though.

-Rob

> >  	for (;;) {
> >  		if (!pipe->readers) {
> > @@ -212,8 +217,44 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
> >  			buf->len = spd->partial[page_nr].len;
> >  			buf->private = spd->partial[page_nr].private;
> >  			buf->ops = spd->ops;
> > -			if (spd->flags & SPLICE_F_GIFT)
> > +			if (spd->flags & SPLICE_F_GIFT) {
> > +				unsigned long useraddr =
> > +						spd->partial[page_nr].useraddr;
> > +
> > +				if ((spd->flags & SPLICE_F_MOVE) &&
> > +						!buf->offset &&
> > +						(buf->len == PAGE_SIZE)) {
> > +					/* Can move page aligned buf, gather
> > +					 * requests to make a single
> > +					 * zap_page_range() call per VMA
> > +					 */
> > +					if (vma && (useraddr == user_end) &&
> > +						   ((useraddr + PAGE_SIZE) <=
> > +						    vma->vm_end)) {
> > +						/* same vma, no holes */
> > +						user_end += PAGE_SIZE;
> > +					} else {
> > +						if (vma)
> > +							zap_page_range(vma,
> > +								user_start,
> > +								(user_end -
> > +								 user_start),
> > +								NULL);
> > +						vma = find_vma_intersection(
> > +								current->mm,
> > +								useraddr,
> > +								(useraddr +
> > +								 PAGE_SIZE));
> > +						if (!IS_ERR_OR_NULL(vma)) {
> > +							user_start = useraddr;
> > +							user_end = (useraddr +
> > +								    PAGE_SIZE);
> > +						} else
> > +							vma = NULL;
> > +					}
> > +				}
> >  				buf->flags |= PIPE_BUF_FLAG_GIFT;
> > +			}
> >  
> >  			pipe->nrbufs++;
> >  			page_nr++;
> > @@ -255,6 +296,10 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
> >  		pipe->waiting_writers--;
> >  	}
> >  
> > +	if (vma)
> > +		zap_page_range(vma, user_start, (user_end - user_start), NULL);
> > +
> > +	up_read(&current->mm->mmap_sem);
> >  	pipe_unlock(pipe);
> >  
> >  	if (do_wakeup)
> > @@ -485,6 +530,7 @@ fill_it:
> >  
> >  		spd.partial[page_nr].offset = loff;
> >  		spd.partial[page_nr].len = this_len;
> > +		spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT;
> >  		len -= this_len;
> >  		loff = 0;
> >  		spd.nr_pages++;
> > @@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
> >  		this_len = min_t(size_t, vec[i].iov_len, res);
> >  		spd.partial[i].offset = 0;
> >  		spd.partial[i].len = this_len;
> > +		spd.partial[i].useraddr = (unsigned long)vec[i].iov_base;
> >  		if (!this_len) {
> >  			__free_page(spd.pages[i]);
> >  			spd.pages[i] = NULL;
> > @@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov,
> >  
> >  			partial[buffers].offset = off;
> >  			partial[buffers].len = plen;
> > +			partial[buffers].useraddr = (unsigned long)base;
> > +			base = (void*)((unsigned long)base + PAGE_SIZE);
> >  
> >  			off = 0;
> >  			len -= plen;
> > diff --git a/include/linux/splice.h b/include/linux/splice.h
> > index 74575cb..56661e3 100644
> > --- a/include/linux/splice.h
> > +++ b/include/linux/splice.h
> > @@ -44,6 +44,7 @@ struct partial_page {
> >  	unsigned int offset;
> >  	unsigned int len;
> >  	unsigned long private;
> > +	unsigned long useraddr;
> >  };
> >  
> >  /*
> > 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient
  2013-10-08 16:23   ` Dave Hansen
@ 2013-10-17 13:54     ` Robert Jennings
  0 siblings, 0 replies; 15+ messages in thread
From: Robert Jennings @ 2013-10-17 13:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro,
	Rik van Riel, Andrea Arcangeli, Matt Helsley, Anthony Liguori,
	Michael Roth, Lei Li, Leonardo Garcia, Vlastimil Babka

* Dave Hansen (dave@sr71.net) wrote:
> On 10/07/2013 01:21 PM, Robert C Jennings wrote:
> >  		spd.partial[page_nr].offset = loff;
> >  		spd.partial[page_nr].len = this_len;
> > +		spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT;
> >  		len -= this_len;
> >  		loff = 0;
> >  		spd.nr_pages++;
> > @@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
> >  		this_len = min_t(size_t, vec[i].iov_len, res);
> >  		spd.partial[i].offset = 0;
> >  		spd.partial[i].len = this_len;
> > +		spd.partial[i].useraddr = (unsigned long)vec[i].iov_base;
> >  		if (!this_len) {
> >  			__free_page(spd.pages[i]);
> >  			spd.pages[i] = NULL;
> > @@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov,
> >  
> >  			partial[buffers].offset = off;
> >  			partial[buffers].len = plen;
> > +			partial[buffers].useraddr = (unsigned long)base;
> > +			base = (void*)((unsigned long)base + PAGE_SIZE);
> >  
> >  			off = 0;
> >  			len -= plen;
> > diff --git a/include/linux/splice.h b/include/linux/splice.h
> > index 74575cb..56661e3 100644
> > --- a/include/linux/splice.h
> > +++ b/include/linux/splice.h
> > @@ -44,6 +44,7 @@ struct partial_page {
> >  	unsigned int offset;
> >  	unsigned int len;
> >  	unsigned long private;
> > +	unsigned long useraddr;
> >  };
> 
> "useraddr" confuses me.  You make it an 'unsigned long', yet two of the
> three assignments are from "void __user *".  The other assignment:
> 
> 	spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT;
> 
> 'index' looks to be the offset inside the file, not a user address, so
> I'm confused what that is doing.
> 
> Could you elaborate a little more on why 'useraddr' is suddenly needed
> in these patches?  How is "index << PAGE_CACHE_SHIFT" a virtual address?
>  Also, are we losing any of the advantages of sparse checking since
> 'useraddr' is without the __user annotation?
> 

I'm working on cleaning this up.  Trying to remove useraddr altogher
through the use of the existing 'private' field just for the
splice_to_user/pipe_to_user flow without upsetting other uses of the
private field.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] vmsplice: unmap gifted pages for recipient
  2013-10-17 13:48     ` Robert Jennings
@ 2013-10-18  8:10       ` Vlastimil Babka
  0 siblings, 0 replies; 15+ messages in thread
From: Vlastimil Babka @ 2013-10-18  8:10 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, Alexander Viro,
	Rik van Riel, Andrea Arcangeli, Dave Hansen, Matt Helsley,
	Anthony Liguori, Michael Roth, Lei Li, Leonardo Garcia

On 10/17/2013 03:48 PM, Robert Jennings wrote:
> * Vlastimil Babka (vbabka@suse.cz) wrote:
>> On 10/07/2013 10:21 PM, Robert C Jennings wrote:
>>> Introduce use of the unused SPLICE_F_MOVE flag for vmsplice to zap
>>> pages.
>>>
>>> When vmsplice is called with flags (SPLICE_F_GIFT | SPLICE_F_MOVE) the
>>> writer's gift'ed pages would be zapped.  This patch supports further work
>>> to move vmsplice'd pages rather than copying them.  That patch has the
>>> restriction that the page must not be mapped by the source for the move,
>>> otherwise it will fall back to copying the page.
>>>
>>> Signed-off-by: Matt Helsley <matt.helsley@gmail.com>
>>> Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
>>> ---
>>> Since the RFC went out I have coalesced the zap_page_range() call to
>>> operate on VMAs rather than calling this for each page.  For a 256MB
>>> vmsplice this reduced the write side 50% from the RFC.
>>> ---
>>>  fs/splice.c            | 51 +++++++++++++++++++++++++++++++++++++++++++++++++-
>>>  include/linux/splice.h |  1 +
>>>  2 files changed, 51 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/splice.c b/fs/splice.c
>>> index 3b7ee65..a62d61e 100644
>>> --- a/fs/splice.c
>>> +++ b/fs/splice.c
>>> @@ -188,12 +188,17 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
>>>  {
>>>  	unsigned int spd_pages = spd->nr_pages;
>>>  	int ret, do_wakeup, page_nr;
>>> +	struct vm_area_struct *vma;
>>> +	unsigned long user_start, user_end;
>>>  
>>>  	ret = 0;
>>>  	do_wakeup = 0;
>>>  	page_nr = 0;
>>> +	vma = NULL;
>>> +	user_start = user_end = 0;
>>>  
>>>  	pipe_lock(pipe);
>>> +	down_read(&current->mm->mmap_sem);
>>
>> Seems like you could take the mmap_sem only when GIFT and MOVE is set.
>> Maybe it won't help that much for performance but at least serve as
>> documenting the reason it's needed?
>>
>> Vlastimil
>>
> 
> I had been doing that previously but moving this outside the loop and
> acquiring it once did improve performance.  I'll add a comment on
> down_read() as to the reason for taking this though.
> 
> -Rob

Hm perhaps in light of recent patches to reduce mmap_sem usage only to
really critical regions, maybe it really shouldn't be taken at all if
not needed.

Vlastimil

>>>  	for (;;) {
>>>  		if (!pipe->readers) {
>>> @@ -212,8 +217,44 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
>>>  			buf->len = spd->partial[page_nr].len;
>>>  			buf->private = spd->partial[page_nr].private;
>>>  			buf->ops = spd->ops;
>>> -			if (spd->flags & SPLICE_F_GIFT)
>>> +			if (spd->flags & SPLICE_F_GIFT) {
>>> +				unsigned long useraddr =
>>> +						spd->partial[page_nr].useraddr;
>>> +
>>> +				if ((spd->flags & SPLICE_F_MOVE) &&
>>> +						!buf->offset &&
>>> +						(buf->len == PAGE_SIZE)) {
>>> +					/* Can move page aligned buf, gather
>>> +					 * requests to make a single
>>> +					 * zap_page_range() call per VMA
>>> +					 */
>>> +					if (vma && (useraddr == user_end) &&
>>> +						   ((useraddr + PAGE_SIZE) <=
>>> +						    vma->vm_end)) {
>>> +						/* same vma, no holes */
>>> +						user_end += PAGE_SIZE;
>>> +					} else {
>>> +						if (vma)
>>> +							zap_page_range(vma,
>>> +								user_start,
>>> +								(user_end -
>>> +								 user_start),
>>> +								NULL);
>>> +						vma = find_vma_intersection(
>>> +								current->mm,
>>> +								useraddr,
>>> +								(useraddr +
>>> +								 PAGE_SIZE));
>>> +						if (!IS_ERR_OR_NULL(vma)) {
>>> +							user_start = useraddr;
>>> +							user_end = (useraddr +
>>> +								    PAGE_SIZE);
>>> +						} else
>>> +							vma = NULL;
>>> +					}
>>> +				}
>>>  				buf->flags |= PIPE_BUF_FLAG_GIFT;
>>> +			}
>>>  
>>>  			pipe->nrbufs++;
>>>  			page_nr++;
>>> @@ -255,6 +296,10 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
>>>  		pipe->waiting_writers--;
>>>  	}
>>>  
>>> +	if (vma)
>>> +		zap_page_range(vma, user_start, (user_end - user_start), NULL);
>>> +
>>> +	up_read(&current->mm->mmap_sem);
>>>  	pipe_unlock(pipe);
>>>  
>>>  	if (do_wakeup)
>>> @@ -485,6 +530,7 @@ fill_it:
>>>  
>>>  		spd.partial[page_nr].offset = loff;
>>>  		spd.partial[page_nr].len = this_len;
>>> +		spd.partial[page_nr].useraddr = index << PAGE_CACHE_SHIFT;
>>>  		len -= this_len;
>>>  		loff = 0;
>>>  		spd.nr_pages++;
>>> @@ -656,6 +702,7 @@ ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
>>>  		this_len = min_t(size_t, vec[i].iov_len, res);
>>>  		spd.partial[i].offset = 0;
>>>  		spd.partial[i].len = this_len;
>>> +		spd.partial[i].useraddr = (unsigned long)vec[i].iov_base;
>>>  		if (!this_len) {
>>>  			__free_page(spd.pages[i]);
>>>  			spd.pages[i] = NULL;
>>> @@ -1475,6 +1522,8 @@ static int get_iovec_page_array(const struct iovec __user *iov,
>>>  
>>>  			partial[buffers].offset = off;
>>>  			partial[buffers].len = plen;
>>> +			partial[buffers].useraddr = (unsigned long)base;
>>> +			base = (void*)((unsigned long)base + PAGE_SIZE);
>>>  
>>>  			off = 0;
>>>  			len -= plen;
>>> diff --git a/include/linux/splice.h b/include/linux/splice.h
>>> index 74575cb..56661e3 100644
>>> --- a/include/linux/splice.h
>>> +++ b/include/linux/splice.h
>>> @@ -44,6 +44,7 @@ struct partial_page {
>>>  	unsigned int offset;
>>>  	unsigned int len;
>>>  	unsigned long private;
>>> +	unsigned long useraddr;
>>>  };
>>>  
>>>  /*
>>>
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-10-18  8:10 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-07 20:21 [PATCH 0/2] vmpslice support for zero-copy gifting of pages Robert C Jennings
2013-10-07 20:21 ` [PATCH 1/2] vmsplice: unmap gifted pages for recipient Robert C Jennings
2013-10-08 16:14   ` Dave Hansen
2013-10-08 19:48     ` Robert Jennings
2013-10-08 21:22       ` Dave Hansen
2013-10-08 16:23   ` Dave Hansen
2013-10-17 13:54     ` Robert Jennings
2013-10-17 10:20   ` Vlastimil Babka
2013-10-17 13:48     ` Robert Jennings
2013-10-18  8:10       ` Vlastimil Babka
2013-10-07 20:21 ` [PATCH 2/2] vmsplice: Add limited zero copy to vmsplice Robert C Jennings
2013-10-08 16:45   ` Dave Hansen
2013-10-08 17:35     ` Robert Jennings
2013-10-17 11:23   ` Vlastimil Babka
2013-10-17 13:44     ` Robert Jennings

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).