All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
@ 2014-09-26 17:25 ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2014-09-26 17:25 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Gleb Natapov, Radim Krcmar, Paolo Bonzini, Rik van Riel,
	Peter Zijlstra, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Thu, Sep 25, 2014 at 02:50:29PM -0700, Andres Lagar-Cavilla wrote:
> It's nearly impossible to name it right because 1) it indicates we can
> relinquish 2) it returns whether we still hold the mmap semaphore.
> 
> I'd prefer it'd be called mmap_sem_hold, which conveys immediately
> what this is about ("nonblocking" or "locked" could be about a whole
> lot of things)

To me FOLL_NOWAIT/FAULT_FLAG_RETRY_NOWAIT is nonblocking,
"locked"/FAULT_FLAG_ALLOW_RETRY is still very much blocking, just
without the mmap_sem, so I called it "locked"... but I'm fine to
change the name to mmap_sem_hold. Just get_user_pages_mmap_sem_hold
seems less friendly than get_user_pages_locked(..., &locked). locked
as you used comes intuitive when you do later "if (locked) up_read".

Then I added an _unlocked kind which is a drop in replacement for many
places just to clean it up.

get_user_pages_unlocked and get_user_pages_fast are equivalent in
semantics, so any call of get_user_pages_unlocked(current,
current->mm, ...) has no reason to exist and should be replaced to
get_user_pages_fast unless "force = 1" (gup_fast has no force param
just to make the argument list a bit more confusing across the various
versions of gup).

get_user_pages over time should be phased out and dropped.

> I can see that. My background for coming into this is very similar: in
> a previous life we had a file system shim that would kick up into
> userspace for servicing VM memory. KVM just wouldn't let the file
> system give up the mmap semaphore. We had /proc readers hanging up all
> over the place while userspace was servicing. Not happy.
> 
> With KVM (now) and the standard x86 fault giving you ALLOW_RETRY, what
> stands in your way? Methinks that gup_fast has no slowpath fallback
> that turns on ALLOW_RETRY. What would oppose that being the global
> behavior?

It should become the global behavior. Just it doesn't need to become a
global behavior immediately for all kind of gups (i.e. video4linux
drivers will never need to poke into the KVM guest user memory so it
doesn't matter if they don't use gup_locked immediately). Even then we
can still support get_user_pages_locked(...., locked=NULL) for
ptrace/coredump and other things that may not want to trigger the
userfaultfd protocol and just get an immediate VM_FAULT_SIGBUS.

Userfaults will just VM_FAULT_SIGBUS (translated to -EFAULT by all gup
invocations) and not invoke the userfaultfd protocol, if
FAULT_FLAG_ALLOW_RETRY is not set. So any gup_locked with locked ==
NULL or or gup() (without locked parameter) will not invoke the
userfaultfd protocol.

But I need gup_fast to use FAULT_FLAG_ALLOW_RETRY because core places
like O_DIRECT uses it.

I tried to do a RFC patch below that goes into this direction and
should be enough for a start to solve all my issues with the mmap_sem
holding inside handle_userfault(), comments welcome.

=======
>From 41918f7d922d1e7fc70f117db713377e7e2af6e9 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Fri, 26 Sep 2014 18:36:53 +0200
Subject: [PATCH 1/2] mm: gup: add get_user_pages_locked and
 get_user_pages_unlocked

We can leverage the VM_FAULT_RETRY functionality in the page fault
paths better by using either get_user_pages_locked or
get_user_pages_unlocked.

The former allow conversion of get_user_pages invocations that will
have to pass a "&locked" parameter to know if the mmap_sem was dropped
during the call. Example from:

    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

to:

    int locked = 1;
    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages_locked(tsk, mm, ..., pages, &locked);
    if (locked)
        up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

    down_read(&mm->mmap_sem);
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

into:

    get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must
be NULL for get_user_pages_locked|unlocked to be usable (the latter
original form wouldn't have been safe anyway if vmas wasn't null, for
the former we just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

This patch then applies the new forms in various places, in some case
also replacing it with get_user_pages_fast whenever tsk and mm are
current and current->mm. get_user_pages_unlocked varies from
get_user_pages_fast only if mm is not current->mm (like when
get_user_pages works on some other process mm). Whenever tsk and mm
matches current and current->mm get_user_pages_fast must always be
used to increase performance and get the page lockless (only with irq
disabled).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/mips/mm/gup.c                 |   8 +-
 arch/powerpc/mm/gup.c              |   6 +-
 arch/s390/kvm/kvm-s390.c           |   4 +-
 arch/s390/mm/gup.c                 |   6 +-
 arch/sh/mm/gup.c                   |   6 +-
 arch/sparc/mm/gup.c                |   6 +-
 arch/x86/mm/gup.c                  |   7 +-
 drivers/dma/iovlock.c              |  10 +--
 drivers/iommu/amd_iommu_v2.c       |   6 +-
 drivers/media/pci/ivtv/ivtv-udma.c |   6 +-
 drivers/misc/sgi-gru/grufault.c    |   3 +-
 drivers/scsi/st.c                  |  10 +--
 drivers/video/fbdev/pvr2fb.c       |   5 +-
 include/linux/mm.h                 |   7 ++
 mm/gup.c                           | 147 ++++++++++++++++++++++++++++++++++---
 mm/mempolicy.c                     |   2 +-
 mm/nommu.c                         |  23 ++++++
 mm/process_vm_access.c             |   7 +-
 mm/util.c                          |  10 +--
 net/ceph/pagevec.c                 |   9 +--
 20 files changed, 200 insertions(+), 88 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 06ce17c..20884f5 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -301,11 +301,9 @@ slow_irqon:
 	start += nr << PAGE_SHIFT;
 	pages += nr;
 
-	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, start,
-				(end - start) >> PAGE_SHIFT,
-				write, 0, pages, NULL);
-	up_read(&mm->mmap_sem);
+	ret = get_user_pages_unlocked(current, mm, start,
+				      (end - start) >> PAGE_SHIFT,
+				      write, 0, pages);
 
 	/* Have to be a bit careful with return values */
 	if (nr > 0) {
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d874668..b70c34a 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -215,10 +215,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-				     nr_pages - nr, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+					      nr_pages - nr, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 81b0e11..37ca29a 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1092,9 +1092,7 @@ long kvm_arch_fault_in_page(struct kvm_vcpu *vcpu, gpa_t gpa, int writable)
 	hva = gmap_fault(gpa, vcpu->arch.gmap);
 	if (IS_ERR_VALUE(hva))
 		return (long)hva;
-	down_read(&mm->mmap_sem);
-	rc = get_user_pages(current, mm, hva, 1, writable, 0, NULL, NULL);
-	up_read(&mm->mmap_sem);
+	rc = get_user_pages_unlocked(current, mm, hva, 1, writable, 0, NULL);
 
 	return rc < 0 ? rc : 0;
 }
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 639fce46..5c586c7 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -235,10 +235,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	/* Try to get the remaining pages with get_user_pages */
 	start += nr << PAGE_SHIFT;
 	pages += nr;
-	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, start,
-			     nr_pages - nr, write, 0, pages, NULL);
-	up_read(&mm->mmap_sem);
+	ret = get_user_pages_unlocked(current, mm, start,
+			     nr_pages - nr, write, 0, pages);
 	/* Have to be a bit careful with return values */
 	if (nr > 0)
 		ret = (ret < 0) ? nr : ret + nr;
diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
index 37458f3..e15f52a 100644
--- a/arch/sh/mm/gup.c
+++ b/arch/sh/mm/gup.c
@@ -257,10 +257,8 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 1aed043..fa7de7d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -219,10 +219,8 @@ slow:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef..2ab183b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -388,10 +388,9 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+					      (end - start) >> PAGE_SHIFT,
+					      write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
index bb48a57..12ea7c3 100644
--- a/drivers/dma/iovlock.c
+++ b/drivers/dma/iovlock.c
@@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
 		pages += page_list->nr_pages;
 
 		/* pin pages down */
-		down_read(&current->mm->mmap_sem);
-		ret = get_user_pages(
-			current,
-			current->mm,
+		ret = get_user_pages_fast(
 			(unsigned long) iov[i].iov_base,
 			page_list->nr_pages,
 			1,	/* write */
-			0,	/* force */
-			page_list->pages,
-			NULL);
-		up_read(&current->mm->mmap_sem);
+			page_list->pages);
 
 		if (ret != page_list->nr_pages)
 			goto unpin;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 5f578e8..6963b73 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -519,10 +519,8 @@ static void do_fault(struct work_struct *work)
 
 	write = !!(fault->flags & PPR_FAULT_WRITE);
 
-	down_read(&fault->state->mm->mmap_sem);
-	npages = get_user_pages(NULL, fault->state->mm,
-				fault->address, 1, write, 0, &page, NULL);
-	up_read(&fault->state->mm->mmap_sem);
+	npages = get_user_pages_unlocked(NULL, fault->state->mm,
+					 fault->address, 1, write, 0, &page);
 
 	if (npages == 1) {
 		put_page(page);
diff --git a/drivers/media/pci/ivtv/ivtv-udma.c b/drivers/media/pci/ivtv/ivtv-udma.c
index 7338cb2..96d866b 100644
--- a/drivers/media/pci/ivtv/ivtv-udma.c
+++ b/drivers/media/pci/ivtv/ivtv-udma.c
@@ -124,10 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long ivtv_dest_addr,
 	}
 
 	/* Get user pages for DMA Xfer */
-	down_read(&current->mm->mmap_sem);
-	err = get_user_pages(current, current->mm,
-			user_dma.uaddr, user_dma.page_count, 0, 1, dma->map, NULL);
-	up_read(&current->mm->mmap_sem);
+	err = get_user_pages_unlocked(current, current->mm,
+			user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
 
 	if (user_dma.page_count != err) {
 		IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index f74fc0c..cd20669 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
 #else
 	*pageshift = PAGE_SHIFT;
 #endif
-	if (get_user_pages
-	    (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
+	if (get_user_pages_fast(vaddr, 1, write, &page) <= 0)
 		return -EFAULT;
 	*paddr = page_to_phys(page);
 	put_page(page);
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index aff9689..c89dcfa 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
 		return -ENOMEM;
 
         /* Try to fault in all of the necessary pages */
-	down_read(&current->mm->mmap_sem);
         /* rw==READ means read from drive, write into memory area */
-	res = get_user_pages(
-		current,
-		current->mm,
+	res = get_user_pages_fast(
 		uaddr,
 		nr_pages,
 		rw == READ,
-		0, /* don't force */
-		pages,
-		NULL);
-	up_read(&current->mm->mmap_sem);
+		pages);
 
 	/* Errors and no page mapped should return here */
 	if (res < nr_pages)
diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
index 167cfff..ff81f65 100644
--- a/drivers/video/fbdev/pvr2fb.c
+++ b/drivers/video/fbdev/pvr2fb.c
@@ -686,10 +686,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const char *buf,
 	if (!pages)
 		return -ENOMEM;
 
-	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(current, current->mm, (unsigned long)buf,
-			     nr_pages, WRITE, 0, pages, NULL);
-	up_read(&current->mm->mmap_sem);
+	ret = get_user_pages_fast((unsigned long)buf, nr_pages, WRITE, pages);
 
 	if (ret < nr_pages) {
 		nr_pages = ret;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 32ba786..69f692d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1197,6 +1197,13 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		    unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages,
 		    struct vm_area_struct **vmas);
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+		    unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages,
+		    int *locked);
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+		    unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct kvec;
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..19e17ab 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -576,6 +576,134 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 	return 0;
 }
 
+static inline long __get_user_pages_locked(struct task_struct *tsk,
+					   struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long nr_pages,
+					   int write, int force,
+					   struct page **pages,
+					   struct vm_area_struct **vmas,
+					   int *locked,
+					   bool immediate_unlock)
+{
+	int flags = FOLL_TOUCH;
+	long ret, pages_done;
+	bool lock_dropped;
+
+	if (locked) {
+		/* if VM_FAULT_RETRY can be returned, vmas become invalid */
+		BUG_ON(vmas);
+		/* check caller initialized locked */
+		BUG_ON(*locked != 1);
+	} else {
+		/*
+		 * Not really important, the value is irrelevant if
+		 * locked is NULL, but BUILD_BUG_ON costs nothing.
+		 */
+		BUILD_BUG_ON(immediate_unlock);
+	}
+
+	if (pages)
+		flags |= FOLL_GET;
+	if (write)
+		flags |= FOLL_WRITE;
+	if (force)
+		flags |= FOLL_FORCE;
+
+	pages_done = 0;
+	lock_dropped = false;
+	for (;;) {
+		ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
+				       vmas, locked);
+		if (!locked)
+			/* VM_FAULT_RETRY couldn't trigger, bypass */
+			return ret;
+
+		/* VM_FAULT_RETRY cannot return errors */
+		if (!*locked) {
+			BUG_ON(ret < 0);
+			BUG_ON(nr_pages == 1 && ret);
+		}
+
+		if (!pages)
+			/* If it's a prefault don't insist harder */
+			return ret;
+
+		if (ret > 0) {
+			nr_pages -= ret;
+			pages_done += ret;
+			if (!nr_pages)
+				break;
+		}
+		if (*locked) {
+			/* VM_FAULT_RETRY didn't trigger */
+			if (!pages_done)
+				pages_done = ret;
+			break;
+		}
+		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
+		pages += ret;
+		start += ret << PAGE_SHIFT;
+
+		/*
+		 * Repeat on the address that fired VM_FAULT_RETRY
+		 * without FAULT_FLAG_ALLOW_RETRY but with
+		 * FAULT_FLAG_TRIED.
+		 */
+		*locked = 1;
+		lock_dropped = true;
+		down_read(&mm->mmap_sem);
+		ret = __get_user_pages(tsk, mm, start, nr_pages, flags | FOLL_TRIED,
+				       pages, NULL, NULL);
+		if (ret != 1) {
+			BUG_ON(ret > 1);
+			if (!pages_done)
+				pages_done = ret;
+			break;
+		}
+		nr_pages--;
+		pages_done++;
+		if (!nr_pages)
+			break;
+		pages++;
+		start += PAGE_SIZE;
+	}
+	if (!immediate_unlock && lock_dropped && *locked) {
+		/*
+		 * We must let the caller know we temporarily dropped the lock
+		 * and so the critical section protected by it was lost.
+		 */
+		up_read(&mm->mmap_sem);
+		*locked = 0;
+	}
+	return pages_done;
+}
+
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   int write, int force, struct page **pages,
+			   int *locked)
+{
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+				       pages, NULL, locked, false);
+}
+EXPORT_SYMBOL(get_user_pages_locked);
+
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+			     unsigned long start, unsigned long nr_pages,
+			     int write, int force, struct page **pages)
+{
+	long ret;
+	int locked = 1;
+	down_read(&mm->mmap_sem);
+	ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+				      pages, NULL, &locked, true);
+	if (locked)
+		up_read(&mm->mmap_sem);
+	return ret;
+}
+EXPORT_SYMBOL(get_user_pages_unlocked);
+
 /*
  * get_user_pages() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -625,22 +753,19 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
  * use the correct cache flushing APIs.
  *
  * See also get_user_pages_fast, for performance critical applications.
+ *
+ * get_user_pages should be gradually obsoleted in favor of
+ * get_user_pages_locked|unlocked. Nothing should use get_user_pages
+ * because it cannot pass FAULT_FLAG_ALLOW_RETRY to handle_mm_fault in
+ * turn disabling the userfaultfd feature (after that "inline" can be
+ * cleaned up from get_user_pages_locked).
  */
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages, int write,
 		int force, struct page **pages, struct vm_area_struct **vmas)
 {
-	int flags = FOLL_TOUCH;
-
-	if (pages)
-		flags |= FOLL_GET;
-	if (write)
-		flags |= FOLL_WRITE;
-	if (force)
-		flags |= FOLL_FORCE;
-
-	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
-				NULL);
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+				       pages, vmas, NULL, false);
 }
 EXPORT_SYMBOL(get_user_pages);
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8f5330d..6606c10 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -881,7 +881,7 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
 	struct page *p;
 	int err;
 
-	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+	err = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p);
 	if (err >= 0) {
 		err = page_to_nid(p);
 		put_page(p);
diff --git a/mm/nommu.c b/mm/nommu.c
index a881d96..8a06341 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -213,6 +213,29 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   int write, int force, struct page **pages,
+			   int *locked)
+{
+	return get_user_pages(tsk, mm, start, nr_pages, write, force,
+			      pages, NULL);
+}
+EXPORT_SYMBOL(get_user_pages_locked);
+
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+			     unsigned long start, unsigned long nr_pages,
+			     int write, int force, struct page **pages)
+{
+	long ret;
+	down_read(&mm->mmap_sem);
+	ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
+			     pages, NULL);
+	up_read(&mm->mmap_sem);
+	return ret;
+}
+EXPORT_SYMBOL(get_user_pages_unlocked);
+
 /**
  * follow_pfn - look up PFN at a user virtual address
  * @vma: memory mapping
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 5077afc..b159769 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -99,11 +99,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
 		size_t bytes;
 
 		/* Get the pages we're interested in */
-		down_read(&mm->mmap_sem);
-		pages = get_user_pages(task, mm, pa, pages,
-				      vm_write, 0, process_pages, NULL);
-		up_read(&mm->mmap_sem);
-
+		pages = get_user_pages_unlocked(task, mm, pa, pages,
+						vm_write, 0, process_pages);
 		if (pages <= 0)
 			return -EFAULT;
 
diff --git a/mm/util.c b/mm/util.c
index 093c973..1b93f2d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -247,14 +247,8 @@ int __weak get_user_pages_fast(unsigned long start,
 				int nr_pages, int write, struct page **pages)
 {
 	struct mm_struct *mm = current->mm;
-	int ret;
-
-	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, start, nr_pages,
-					write, 0, pages, NULL);
-	up_read(&mm->mmap_sem);
-
-	return ret;
+	return get_user_pages_unlocked(current, mm, start, nr_pages,
+				       write, 0, pages);
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 5550130..5504783 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -23,17 +23,16 @@ struct page **ceph_get_direct_page_vector(const void __user *data,
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 
-	down_read(&current->mm->mmap_sem);
 	while (got < num_pages) {
-		rc = get_user_pages(current, current->mm,
-		    (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
-		    num_pages - got, write_page, 0, pages + got, NULL);
+		rc = get_user_pages_fast((unsigned long)data +
+					 ((unsigned long)got * PAGE_SIZE),
+					 num_pages - got,
+					 write_page, pages + got);
 		if (rc < 0)
 			break;
 		BUG_ON(rc == 0);
 		got += rc;
 	}
-	up_read(&current->mm->mmap_sem);
 	if (rc < 0)
 		goto fail;
 	return pages;


Then to make an example your patch would have become:

===
>From 74d88763cde285354fb78806ffb332030d1f0739 Mon Sep 17 00:00:00 2001
From: Andres Lagar-Cavilla <andreslc@google.com>
Date: Fri, 26 Sep 2014 18:36:56 +0200
Subject: [PATCH 2/2] kvm: Faults which trigger IO release the mmap_sem
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.

If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.

This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h  | 1 +
 mm/gup.c            | 4 ++++
 virt/kvm/async_pf.c | 4 +---
 virt/kvm/kvm_main.c | 4 ++--
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 69f692d..71dbe03 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1997,6 +1997,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
+#define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 19e17ab..369b3f6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
+	if (*flags & FOLL_TRIED) {
+		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+		fault_flags |= FAULT_FLAG_TRIED;
+	}
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
 	if (ret & VM_FAULT_ERROR) {
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index d6a3d09..44660ae 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
 
 	might_sleep();
 
-	down_read(&mm->mmap_sem);
-	get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
-	up_read(&mm->mmap_sem);
+	get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
 	kvm_async_page_present_sync(vcpu, apf);
 
 	spin_lock(&vcpu->async_pf.lock);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 95519bc..921bce7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1170,8 +1170,8 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 					      addr, write_fault, page);
 		up_read(&current->mm->mmap_sem);
 	} else
-		npages = get_user_pages_fast(addr, 1, write_fault,
-					     page);
+		npages = get_user_pages_unlocked(current, current->mm, addr, 1,
+						 write_fault, 0, page);
 	if (npages != 1)
 		return npages;
 


This isn't bisectable in this order and it's untested anyway. It needs
more patchsplits.

This is just an initial RFC to know if it's ok to go into this
direction.

If it's ok I'll do some testing and submit it more properly. If your
patches goes in first it's fine and I'll just replace the call in KVM
to get_user_pages_unlocked (or whatever we want to call that thing).

I'd need to get this (or equivalent solution) merged before
re-submitting the userfaultfd patchset. I think the above benefits the
kernel as a whole in terms of mmap_sem holdtimes regardless of
userfaultfd so it should be good.

> Well, IIUC every code path that has ALLOW_RETRY dives in the second
> time with FAULT_TRIED or similar. In the common case, you happily
> blaze through the second time, but if someone raced in while all locks
> were given up, one pays the price of the second time being a full
> fault hogging the mmap sem. At some point you need to not keep being
> polite otherwise the task starves. Presumably the risk of an extra
> retry drops steeply every new gup retry. Maybe just try three times is
> good enough. It makes for ugly logic though.

I was under the idea that if one looped forever with VM_FAULT_RETRY
it'd eventually succeed, but it risks doing more work, so I'm also
sticking to the "locked != NULL" first, seek to the first page that
returned VM_FAULT_RETRY and issue a nr_pages=1 gup with locked ==
NULL, and then continue with locked != NULL at the next page. Just
like you did in the KVM slow path. And if "pages" array is NULL I bail
out at the first VM_FAULT_RETRY failure without insisting with further
gup calls of the "&locked" kind, your patch had just 1 page but you
also bailed out.

What this code above does is basically to generalize your optimization
to KVM and it makes it global and at the same time it avoids me
trouble in handle_userfault().

While at it I also converted some obvious candidate for gup_fast that
had no point in running slower (which I should split off in a separate
patch).

Thanks,
Andrea

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
@ 2014-09-26 17:25 ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2014-09-26 17:25 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Gleb Natapov, Radim Krcmar, Paolo Bonzini, Rik van Riel,
	Peter Zijlstra, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Thu, Sep 25, 2014 at 02:50:29PM -0700, Andres Lagar-Cavilla wrote:
> It's nearly impossible to name it right because 1) it indicates we can
> relinquish 2) it returns whether we still hold the mmap semaphore.
> 
> I'd prefer it'd be called mmap_sem_hold, which conveys immediately
> what this is about ("nonblocking" or "locked" could be about a whole
> lot of things)

To me FOLL_NOWAIT/FAULT_FLAG_RETRY_NOWAIT is nonblocking,
"locked"/FAULT_FLAG_ALLOW_RETRY is still very much blocking, just
without the mmap_sem, so I called it "locked"... but I'm fine to
change the name to mmap_sem_hold. Just get_user_pages_mmap_sem_hold
seems less friendly than get_user_pages_locked(..., &locked). locked
as you used comes intuitive when you do later "if (locked) up_read".

Then I added an _unlocked kind which is a drop in replacement for many
places just to clean it up.

get_user_pages_unlocked and get_user_pages_fast are equivalent in
semantics, so any call of get_user_pages_unlocked(current,
current->mm, ...) has no reason to exist and should be replaced to
get_user_pages_fast unless "force = 1" (gup_fast has no force param
just to make the argument list a bit more confusing across the various
versions of gup).

get_user_pages over time should be phased out and dropped.

> I can see that. My background for coming into this is very similar: in
> a previous life we had a file system shim that would kick up into
> userspace for servicing VM memory. KVM just wouldn't let the file
> system give up the mmap semaphore. We had /proc readers hanging up all
> over the place while userspace was servicing. Not happy.
> 
> With KVM (now) and the standard x86 fault giving you ALLOW_RETRY, what
> stands in your way? Methinks that gup_fast has no slowpath fallback
> that turns on ALLOW_RETRY. What would oppose that being the global
> behavior?

It should become the global behavior. Just it doesn't need to become a
global behavior immediately for all kind of gups (i.e. video4linux
drivers will never need to poke into the KVM guest user memory so it
doesn't matter if they don't use gup_locked immediately). Even then we
can still support get_user_pages_locked(...., locked=NULL) for
ptrace/coredump and other things that may not want to trigger the
userfaultfd protocol and just get an immediate VM_FAULT_SIGBUS.

Userfaults will just VM_FAULT_SIGBUS (translated to -EFAULT by all gup
invocations) and not invoke the userfaultfd protocol, if
FAULT_FLAG_ALLOW_RETRY is not set. So any gup_locked with locked ==
NULL or or gup() (without locked parameter) will not invoke the
userfaultfd protocol.

But I need gup_fast to use FAULT_FLAG_ALLOW_RETRY because core places
like O_DIRECT uses it.

I tried to do a RFC patch below that goes into this direction and
should be enough for a start to solve all my issues with the mmap_sem
holding inside handle_userfault(), comments welcome.

=======
>From 41918f7d922d1e7fc70f117db713377e7e2af6e9 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Fri, 26 Sep 2014 18:36:53 +0200
Subject: [PATCH 1/2] mm: gup: add get_user_pages_locked and
 get_user_pages_unlocked

We can leverage the VM_FAULT_RETRY functionality in the page fault
paths better by using either get_user_pages_locked or
get_user_pages_unlocked.

The former allow conversion of get_user_pages invocations that will
have to pass a "&locked" parameter to know if the mmap_sem was dropped
during the call. Example from:

    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

to:

    int locked = 1;
    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages_locked(tsk, mm, ..., pages, &locked);
    if (locked)
        up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

    down_read(&mm->mmap_sem);
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

into:

    get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must
be NULL for get_user_pages_locked|unlocked to be usable (the latter
original form wouldn't have been safe anyway if vmas wasn't null, for
the former we just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

This patch then applies the new forms in various places, in some case
also replacing it with get_user_pages_fast whenever tsk and mm are
current and current->mm. get_user_pages_unlocked varies from
get_user_pages_fast only if mm is not current->mm (like when
get_user_pages works on some other process mm). Whenever tsk and mm
matches current and current->mm get_user_pages_fast must always be
used to increase performance and get the page lockless (only with irq
disabled).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/mips/mm/gup.c                 |   8 +-
 arch/powerpc/mm/gup.c              |   6 +-
 arch/s390/kvm/kvm-s390.c           |   4 +-
 arch/s390/mm/gup.c                 |   6 +-
 arch/sh/mm/gup.c                   |   6 +-
 arch/sparc/mm/gup.c                |   6 +-
 arch/x86/mm/gup.c                  |   7 +-
 drivers/dma/iovlock.c              |  10 +--
 drivers/iommu/amd_iommu_v2.c       |   6 +-
 drivers/media/pci/ivtv/ivtv-udma.c |   6 +-
 drivers/misc/sgi-gru/grufault.c    |   3 +-
 drivers/scsi/st.c                  |  10 +--
 drivers/video/fbdev/pvr2fb.c       |   5 +-
 include/linux/mm.h                 |   7 ++
 mm/gup.c                           | 147 ++++++++++++++++++++++++++++++++++---
 mm/mempolicy.c                     |   2 +-
 mm/nommu.c                         |  23 ++++++
 mm/process_vm_access.c             |   7 +-
 mm/util.c                          |  10 +--
 net/ceph/pagevec.c                 |   9 +--
 20 files changed, 200 insertions(+), 88 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 06ce17c..20884f5 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -301,11 +301,9 @@ slow_irqon:
 	start += nr << PAGE_SHIFT;
 	pages += nr;
 
-	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, start,
-				(end - start) >> PAGE_SHIFT,
-				write, 0, pages, NULL);
-	up_read(&mm->mmap_sem);
+	ret = get_user_pages_unlocked(current, mm, start,
+				      (end - start) >> PAGE_SHIFT,
+				      write, 0, pages);
 
 	/* Have to be a bit careful with return values */
 	if (nr > 0) {
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d874668..b70c34a 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -215,10 +215,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-				     nr_pages - nr, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+					      nr_pages - nr, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 81b0e11..37ca29a 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1092,9 +1092,7 @@ long kvm_arch_fault_in_page(struct kvm_vcpu *vcpu, gpa_t gpa, int writable)
 	hva = gmap_fault(gpa, vcpu->arch.gmap);
 	if (IS_ERR_VALUE(hva))
 		return (long)hva;
-	down_read(&mm->mmap_sem);
-	rc = get_user_pages(current, mm, hva, 1, writable, 0, NULL, NULL);
-	up_read(&mm->mmap_sem);
+	rc = get_user_pages_unlocked(current, mm, hva, 1, writable, 0, NULL);
 
 	return rc < 0 ? rc : 0;
 }
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 639fce46..5c586c7 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -235,10 +235,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	/* Try to get the remaining pages with get_user_pages */
 	start += nr << PAGE_SHIFT;
 	pages += nr;
-	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, start,
-			     nr_pages - nr, write, 0, pages, NULL);
-	up_read(&mm->mmap_sem);
+	ret = get_user_pages_unlocked(current, mm, start,
+			     nr_pages - nr, write, 0, pages);
 	/* Have to be a bit careful with return values */
 	if (nr > 0)
 		ret = (ret < 0) ? nr : ret + nr;
diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
index 37458f3..e15f52a 100644
--- a/arch/sh/mm/gup.c
+++ b/arch/sh/mm/gup.c
@@ -257,10 +257,8 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 1aed043..fa7de7d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -219,10 +219,8 @@ slow:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef..2ab183b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -388,10 +388,9 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+					      (end - start) >> PAGE_SHIFT,
+					      write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
index bb48a57..12ea7c3 100644
--- a/drivers/dma/iovlock.c
+++ b/drivers/dma/iovlock.c
@@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
 		pages += page_list->nr_pages;
 
 		/* pin pages down */
-		down_read(&current->mm->mmap_sem);
-		ret = get_user_pages(
-			current,
-			current->mm,
+		ret = get_user_pages_fast(
 			(unsigned long) iov[i].iov_base,
 			page_list->nr_pages,
 			1,	/* write */
-			0,	/* force */
-			page_list->pages,
-			NULL);
-		up_read(&current->mm->mmap_sem);
+			page_list->pages);
 
 		if (ret != page_list->nr_pages)
 			goto unpin;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 5f578e8..6963b73 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -519,10 +519,8 @@ static void do_fault(struct work_struct *work)
 
 	write = !!(fault->flags & PPR_FAULT_WRITE);
 
-	down_read(&fault->state->mm->mmap_sem);
-	npages = get_user_pages(NULL, fault->state->mm,
-				fault->address, 1, write, 0, &page, NULL);
-	up_read(&fault->state->mm->mmap_sem);
+	npages = get_user_pages_unlocked(NULL, fault->state->mm,
+					 fault->address, 1, write, 0, &page);
 
 	if (npages == 1) {
 		put_page(page);
diff --git a/drivers/media/pci/ivtv/ivtv-udma.c b/drivers/media/pci/ivtv/ivtv-udma.c
index 7338cb2..96d866b 100644
--- a/drivers/media/pci/ivtv/ivtv-udma.c
+++ b/drivers/media/pci/ivtv/ivtv-udma.c
@@ -124,10 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long ivtv_dest_addr,
 	}
 
 	/* Get user pages for DMA Xfer */
-	down_read(&current->mm->mmap_sem);
-	err = get_user_pages(current, current->mm,
-			user_dma.uaddr, user_dma.page_count, 0, 1, dma->map, NULL);
-	up_read(&current->mm->mmap_sem);
+	err = get_user_pages_unlocked(current, current->mm,
+			user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
 
 	if (user_dma.page_count != err) {
 		IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index f74fc0c..cd20669 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
 #else
 	*pageshift = PAGE_SHIFT;
 #endif
-	if (get_user_pages
-	    (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
+	if (get_user_pages_fast(vaddr, 1, write, &page) <= 0)
 		return -EFAULT;
 	*paddr = page_to_phys(page);
 	put_page(page);
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index aff9689..c89dcfa 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
 		return -ENOMEM;
 
         /* Try to fault in all of the necessary pages */
-	down_read(&current->mm->mmap_sem);
         /* rw==READ means read from drive, write into memory area */
-	res = get_user_pages(
-		current,
-		current->mm,
+	res = get_user_pages_fast(
 		uaddr,
 		nr_pages,
 		rw == READ,
-		0, /* don't force */
-		pages,
-		NULL);
-	up_read(&current->mm->mmap_sem);
+		pages);
 
 	/* Errors and no page mapped should return here */
 	if (res < nr_pages)
diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
index 167cfff..ff81f65 100644
--- a/drivers/video/fbdev/pvr2fb.c
+++ b/drivers/video/fbdev/pvr2fb.c
@@ -686,10 +686,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const char *buf,
 	if (!pages)
 		return -ENOMEM;
 
-	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(current, current->mm, (unsigned long)buf,
-			     nr_pages, WRITE, 0, pages, NULL);
-	up_read(&current->mm->mmap_sem);
+	ret = get_user_pages_fast((unsigned long)buf, nr_pages, WRITE, pages);
 
 	if (ret < nr_pages) {
 		nr_pages = ret;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 32ba786..69f692d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1197,6 +1197,13 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		    unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages,
 		    struct vm_area_struct **vmas);
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+		    unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages,
+		    int *locked);
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+		    unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct kvec;
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..19e17ab 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -576,6 +576,134 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 	return 0;
 }
 
+static inline long __get_user_pages_locked(struct task_struct *tsk,
+					   struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long nr_pages,
+					   int write, int force,
+					   struct page **pages,
+					   struct vm_area_struct **vmas,
+					   int *locked,
+					   bool immediate_unlock)
+{
+	int flags = FOLL_TOUCH;
+	long ret, pages_done;
+	bool lock_dropped;
+
+	if (locked) {
+		/* if VM_FAULT_RETRY can be returned, vmas become invalid */
+		BUG_ON(vmas);
+		/* check caller initialized locked */
+		BUG_ON(*locked != 1);
+	} else {
+		/*
+		 * Not really important, the value is irrelevant if
+		 * locked is NULL, but BUILD_BUG_ON costs nothing.
+		 */
+		BUILD_BUG_ON(immediate_unlock);
+	}
+
+	if (pages)
+		flags |= FOLL_GET;
+	if (write)
+		flags |= FOLL_WRITE;
+	if (force)
+		flags |= FOLL_FORCE;
+
+	pages_done = 0;
+	lock_dropped = false;
+	for (;;) {
+		ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
+				       vmas, locked);
+		if (!locked)
+			/* VM_FAULT_RETRY couldn't trigger, bypass */
+			return ret;
+
+		/* VM_FAULT_RETRY cannot return errors */
+		if (!*locked) {
+			BUG_ON(ret < 0);
+			BUG_ON(nr_pages == 1 && ret);
+		}
+
+		if (!pages)
+			/* If it's a prefault don't insist harder */
+			return ret;
+
+		if (ret > 0) {
+			nr_pages -= ret;
+			pages_done += ret;
+			if (!nr_pages)
+				break;
+		}
+		if (*locked) {
+			/* VM_FAULT_RETRY didn't trigger */
+			if (!pages_done)
+				pages_done = ret;
+			break;
+		}
+		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
+		pages += ret;
+		start += ret << PAGE_SHIFT;
+
+		/*
+		 * Repeat on the address that fired VM_FAULT_RETRY
+		 * without FAULT_FLAG_ALLOW_RETRY but with
+		 * FAULT_FLAG_TRIED.
+		 */
+		*locked = 1;
+		lock_dropped = true;
+		down_read(&mm->mmap_sem);
+		ret = __get_user_pages(tsk, mm, start, nr_pages, flags | FOLL_TRIED,
+				       pages, NULL, NULL);
+		if (ret != 1) {
+			BUG_ON(ret > 1);
+			if (!pages_done)
+				pages_done = ret;
+			break;
+		}
+		nr_pages--;
+		pages_done++;
+		if (!nr_pages)
+			break;
+		pages++;
+		start += PAGE_SIZE;
+	}
+	if (!immediate_unlock && lock_dropped && *locked) {
+		/*
+		 * We must let the caller know we temporarily dropped the lock
+		 * and so the critical section protected by it was lost.
+		 */
+		up_read(&mm->mmap_sem);
+		*locked = 0;
+	}
+	return pages_done;
+}
+
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   int write, int force, struct page **pages,
+			   int *locked)
+{
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+				       pages, NULL, locked, false);
+}
+EXPORT_SYMBOL(get_user_pages_locked);
+
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+			     unsigned long start, unsigned long nr_pages,
+			     int write, int force, struct page **pages)
+{
+	long ret;
+	int locked = 1;
+	down_read(&mm->mmap_sem);
+	ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+				      pages, NULL, &locked, true);
+	if (locked)
+		up_read(&mm->mmap_sem);
+	return ret;
+}
+EXPORT_SYMBOL(get_user_pages_unlocked);
+
 /*
  * get_user_pages() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -625,22 +753,19 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
  * use the correct cache flushing APIs.
  *
  * See also get_user_pages_fast, for performance critical applications.
+ *
+ * get_user_pages should be gradually obsoleted in favor of
+ * get_user_pages_locked|unlocked. Nothing should use get_user_pages
+ * because it cannot pass FAULT_FLAG_ALLOW_RETRY to handle_mm_fault in
+ * turn disabling the userfaultfd feature (after that "inline" can be
+ * cleaned up from get_user_pages_locked).
  */
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages, int write,
 		int force, struct page **pages, struct vm_area_struct **vmas)
 {
-	int flags = FOLL_TOUCH;
-
-	if (pages)
-		flags |= FOLL_GET;
-	if (write)
-		flags |= FOLL_WRITE;
-	if (force)
-		flags |= FOLL_FORCE;
-
-	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
-				NULL);
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+				       pages, vmas, NULL, false);
 }
 EXPORT_SYMBOL(get_user_pages);
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8f5330d..6606c10 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -881,7 +881,7 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
 	struct page *p;
 	int err;
 
-	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+	err = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p);
 	if (err >= 0) {
 		err = page_to_nid(p);
 		put_page(p);
diff --git a/mm/nommu.c b/mm/nommu.c
index a881d96..8a06341 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -213,6 +213,29 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   int write, int force, struct page **pages,
+			   int *locked)
+{
+	return get_user_pages(tsk, mm, start, nr_pages, write, force,
+			      pages, NULL);
+}
+EXPORT_SYMBOL(get_user_pages_locked);
+
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+			     unsigned long start, unsigned long nr_pages,
+			     int write, int force, struct page **pages)
+{
+	long ret;
+	down_read(&mm->mmap_sem);
+	ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
+			     pages, NULL);
+	up_read(&mm->mmap_sem);
+	return ret;
+}
+EXPORT_SYMBOL(get_user_pages_unlocked);
+
 /**
  * follow_pfn - look up PFN at a user virtual address
  * @vma: memory mapping
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 5077afc..b159769 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -99,11 +99,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
 		size_t bytes;
 
 		/* Get the pages we're interested in */
-		down_read(&mm->mmap_sem);
-		pages = get_user_pages(task, mm, pa, pages,
-				      vm_write, 0, process_pages, NULL);
-		up_read(&mm->mmap_sem);
-
+		pages = get_user_pages_unlocked(task, mm, pa, pages,
+						vm_write, 0, process_pages);
 		if (pages <= 0)
 			return -EFAULT;
 
diff --git a/mm/util.c b/mm/util.c
index 093c973..1b93f2d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -247,14 +247,8 @@ int __weak get_user_pages_fast(unsigned long start,
 				int nr_pages, int write, struct page **pages)
 {
 	struct mm_struct *mm = current->mm;
-	int ret;
-
-	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, start, nr_pages,
-					write, 0, pages, NULL);
-	up_read(&mm->mmap_sem);
-
-	return ret;
+	return get_user_pages_unlocked(current, mm, start, nr_pages,
+				       write, 0, pages);
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 5550130..5504783 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -23,17 +23,16 @@ struct page **ceph_get_direct_page_vector(const void __user *data,
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 
-	down_read(&current->mm->mmap_sem);
 	while (got < num_pages) {
-		rc = get_user_pages(current, current->mm,
-		    (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
-		    num_pages - got, write_page, 0, pages + got, NULL);
+		rc = get_user_pages_fast((unsigned long)data +
+					 ((unsigned long)got * PAGE_SIZE),
+					 num_pages - got,
+					 write_page, pages + got);
 		if (rc < 0)
 			break;
 		BUG_ON(rc == 0);
 		got += rc;
 	}
-	up_read(&current->mm->mmap_sem);
 	if (rc < 0)
 		goto fail;
 	return pages;


Then to make an example your patch would have become:

===
>From 74d88763cde285354fb78806ffb332030d1f0739 Mon Sep 17 00:00:00 2001
From: Andres Lagar-Cavilla <andreslc@google.com>
Date: Fri, 26 Sep 2014 18:36:56 +0200
Subject: [PATCH 2/2] kvm: Faults which trigger IO release the mmap_sem
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.

If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.

This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h  | 1 +
 mm/gup.c            | 4 ++++
 virt/kvm/async_pf.c | 4 +---
 virt/kvm/kvm_main.c | 4 ++--
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 69f692d..71dbe03 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1997,6 +1997,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
+#define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 19e17ab..369b3f6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
+	if (*flags & FOLL_TRIED) {
+		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+		fault_flags |= FAULT_FLAG_TRIED;
+	}
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
 	if (ret & VM_FAULT_ERROR) {
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index d6a3d09..44660ae 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
 
 	might_sleep();
 
-	down_read(&mm->mmap_sem);
-	get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
-	up_read(&mm->mmap_sem);
+	get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
 	kvm_async_page_present_sync(vcpu, apf);
 
 	spin_lock(&vcpu->async_pf.lock);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 95519bc..921bce7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1170,8 +1170,8 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 					      addr, write_fault, page);
 		up_read(&current->mm->mmap_sem);
 	} else
-		npages = get_user_pages_fast(addr, 1, write_fault,
-					     page);
+		npages = get_user_pages_unlocked(current, current->mm, addr, 1,
+						 write_fault, 0, page);
 	if (npages != 1)
 		return npages;
 


This isn't bisectable in this order and it's untested anyway. It needs
more patchsplits.

This is just an initial RFC to know if it's ok to go into this
direction.

If it's ok I'll do some testing and submit it more properly. If your
patches goes in first it's fine and I'll just replace the call in KVM
to get_user_pages_unlocked (or whatever we want to call that thing).

I'd need to get this (or equivalent solution) merged before
re-submitting the userfaultfd patchset. I think the above benefits the
kernel as a whole in terms of mmap_sem holdtimes regardless of
userfaultfd so it should be good.

> Well, IIUC every code path that has ALLOW_RETRY dives in the second
> time with FAULT_TRIED or similar. In the common case, you happily
> blaze through the second time, but if someone raced in while all locks
> were given up, one pays the price of the second time being a full
> fault hogging the mmap sem. At some point you need to not keep being
> polite otherwise the task starves. Presumably the risk of an extra
> retry drops steeply every new gup retry. Maybe just try three times is
> good enough. It makes for ugly logic though.

I was under the idea that if one looped forever with VM_FAULT_RETRY
it'd eventually succeed, but it risks doing more work, so I'm also
sticking to the "locked != NULL" first, seek to the first page that
returned VM_FAULT_RETRY and issue a nr_pages=1 gup with locked ==
NULL, and then continue with locked != NULL at the next page. Just
like you did in the KVM slow path. And if "pages" array is NULL I bail
out at the first VM_FAULT_RETRY failure without insisting with further
gup calls of the "&locked" kind, your patch had just 1 page but you
also bailed out.

What this code above does is basically to generalize your optimization
to KVM and it makes it global and at the same time it avoids me
trouble in handle_userfault().

While at it I also converted some obvious candidate for gup_fast that
had no point in running slower (which I should split off in a separate
patch).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
@ 2014-09-26 17:25 ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2014-09-26 17:25 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Gleb Natapov, Radim Krcmar, Paolo Bonzini, Rik van Riel,
	Peter Zijlstra, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Thu, Sep 25, 2014 at 02:50:29PM -0700, Andres Lagar-Cavilla wrote:
> It's nearly impossible to name it right because 1) it indicates we can
> relinquish 2) it returns whether we still hold the mmap semaphore.
> 
> I'd prefer it'd be called mmap_sem_hold, which conveys immediately
> what this is about ("nonblocking" or "locked" could be about a whole
> lot of things)

To me FOLL_NOWAIT/FAULT_FLAG_RETRY_NOWAIT is nonblocking,
"locked"/FAULT_FLAG_ALLOW_RETRY is still very much blocking, just
without the mmap_sem, so I called it "locked"... but I'm fine to
change the name to mmap_sem_hold. Just get_user_pages_mmap_sem_hold
seems less friendly than get_user_pages_locked(..., &locked). locked
as you used comes intuitive when you do later "if (locked) up_read".

Then I added an _unlocked kind which is a drop in replacement for many
places just to clean it up.

get_user_pages_unlocked and get_user_pages_fast are equivalent in
semantics, so any call of get_user_pages_unlocked(current,
current->mm, ...) has no reason to exist and should be replaced to
get_user_pages_fast unless "force = 1" (gup_fast has no force param
just to make the argument list a bit more confusing across the various
versions of gup).

get_user_pages over time should be phased out and dropped.

> I can see that. My background for coming into this is very similar: in
> a previous life we had a file system shim that would kick up into
> userspace for servicing VM memory. KVM just wouldn't let the file
> system give up the mmap semaphore. We had /proc readers hanging up all
> over the place while userspace was servicing. Not happy.
> 
> With KVM (now) and the standard x86 fault giving you ALLOW_RETRY, what
> stands in your way? Methinks that gup_fast has no slowpath fallback
> that turns on ALLOW_RETRY. What would oppose that being the global
> behavior?

It should become the global behavior. Just it doesn't need to become a
global behavior immediately for all kind of gups (i.e. video4linux
drivers will never need to poke into the KVM guest user memory so it
doesn't matter if they don't use gup_locked immediately). Even then we
can still support get_user_pages_locked(...., locked=NULL) for
ptrace/coredump and other things that may not want to trigger the
userfaultfd protocol and just get an immediate VM_FAULT_SIGBUS.

Userfaults will just VM_FAULT_SIGBUS (translated to -EFAULT by all gup
invocations) and not invoke the userfaultfd protocol, if
FAULT_FLAG_ALLOW_RETRY is not set. So any gup_locked with locked ==
NULL or or gup() (without locked parameter) will not invoke the
userfaultfd protocol.

But I need gup_fast to use FAULT_FLAG_ALLOW_RETRY because core places
like O_DIRECT uses it.

I tried to do a RFC patch below that goes into this direction and
should be enough for a start to solve all my issues with the mmap_sem
holding inside handle_userfault(), comments welcome.

=======

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
  2014-09-26 17:25 ` Andrea Arcangeli
@ 2014-09-26 19:54   ` Andres Lagar-Cavilla
  -1 siblings, 0 replies; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2014-09-26 19:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Gleb Natapov, Radim Krcmar, Paolo Bonzini, Rik van Riel,
	Peter Zijlstra, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Fri, Sep 26, 2014 at 10:25 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Sep 25, 2014 at 02:50:29PM -0700, Andres Lagar-Cavilla wrote:
>> It's nearly impossible to name it right because 1) it indicates we can
>> relinquish 2) it returns whether we still hold the mmap semaphore.
>>
>> I'd prefer it'd be called mmap_sem_hold, which conveys immediately
>> what this is about ("nonblocking" or "locked" could be about a whole
>> lot of things)
>
> To me FOLL_NOWAIT/FAULT_FLAG_RETRY_NOWAIT is nonblocking,
> "locked"/FAULT_FLAG_ALLOW_RETRY is still very much blocking, just
> without the mmap_sem, so I called it "locked"... but I'm fine to
> change the name to mmap_sem_hold. Just get_user_pages_mmap_sem_hold
> seems less friendly than get_user_pages_locked(..., &locked). locked
> as you used comes intuitive when you do later "if (locked) up_read".
>

Heh. I was previously referring to the int *locked param , not the
_(un)locked suffix. That param is all about the mmap semaphore, so why
not name it less ambiguously. It's essentially a tristate.

My suggestion is that you just make gup behave as your proposed
gup_locked, and no need to introduce another call. But I understand if
you want to phase this out politely.

> Then I added an _unlocked kind which is a drop in replacement for many
> places just to clean it up.
>
> get_user_pages_unlocked and get_user_pages_fast are equivalent in
> semantics, so any call of get_user_pages_unlocked(current,
> current->mm, ...) has no reason to exist and should be replaced to
> get_user_pages_fast unless "force = 1" (gup_fast has no force param
> just to make the argument list a bit more confusing across the various
> versions of gup).
>
> get_user_pages over time should be phased out and dropped.

Please. Too many variants. So the end goal is
* __gup_fast
* gup_fast == __gup_fast + gup_unlocked for fallback
* gup (or gup_locked)
* gup_unlocked
(and flat __gup remains buried in the impl)?

>
>> I can see that. My background for coming into this is very similar: in
>> a previous life we had a file system shim that would kick up into
>> userspace for servicing VM memory. KVM just wouldn't let the file
>> system give up the mmap semaphore. We had /proc readers hanging up all
>> over the place while userspace was servicing. Not happy.
>>
>> With KVM (now) and the standard x86 fault giving you ALLOW_RETRY, what
>> stands in your way? Methinks that gup_fast has no slowpath fallback
>> that turns on ALLOW_RETRY. What would oppose that being the global
>> behavior?
>
> It should become the global behavior. Just it doesn't need to become a
> global behavior immediately for all kind of gups (i.e. video4linux
> drivers will never need to poke into the KVM guest user memory so it
> doesn't matter if they don't use gup_locked immediately). Even then we
> can still support get_user_pages_locked(...., locked=NULL) for
> ptrace/coredump and other things that may not want to trigger the
> userfaultfd protocol and just get an immediate VM_FAULT_SIGBUS.
>
> Userfaults will just VM_FAULT_SIGBUS (translated to -EFAULT by all gup
> invocations) and not invoke the userfaultfd protocol, if
> FAULT_FLAG_ALLOW_RETRY is not set. So any gup_locked with locked ==
> NULL or or gup() (without locked parameter) will not invoke the
> userfaultfd protocol.
>
> But I need gup_fast to use FAULT_FLAG_ALLOW_RETRY because core places
> like O_DIRECT uses it.
>
> I tried to do a RFC patch below that goes into this direction and
> should be enough for a start to solve all my issues with the mmap_sem
> holding inside handle_userfault(), comments welcome.
>
> =======
> From 41918f7d922d1e7fc70f117db713377e7e2af6e9 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Fri, 26 Sep 2014 18:36:53 +0200
> Subject: [PATCH 1/2] mm: gup: add get_user_pages_locked and
>  get_user_pages_unlocked
>
> We can leverage the VM_FAULT_RETRY functionality in the page fault
> paths better by using either get_user_pages_locked or
> get_user_pages_unlocked.
>
> The former allow conversion of get_user_pages invocations that will
> have to pass a "&locked" parameter to know if the mmap_sem was dropped
> during the call. Example from:
>
>     down_read(&mm->mmap_sem);
>     do_something()
>     get_user_pages(tsk, mm, ..., pages, NULL);
>     up_read(&mm->mmap_sem);
>
> to:
>
>     int locked = 1;
>     down_read(&mm->mmap_sem);
>     do_something()
>     get_user_pages_locked(tsk, mm, ..., pages, &locked);
>     if (locked)
>         up_read(&mm->mmap_sem);
>
> The latter is suitable only as a drop in replacement of the form:
>
>     down_read(&mm->mmap_sem);
>     get_user_pages(tsk, mm, ..., pages, NULL);
>     up_read(&mm->mmap_sem);
>
> into:
>
>     get_user_pages_unlocked(tsk, mm, ..., pages);
>
> Where tsk, mm, the intermediate "..." paramters and "pages" can be any
> value as before. Just the last parameter of get_user_pages (vmas) must
> be NULL for get_user_pages_locked|unlocked to be usable (the latter
> original form wouldn't have been safe anyway if vmas wasn't null, for
> the former we just make it explicit by dropping the parameter).
>
> If vmas is not NULL these two methods cannot be used.
>
> This patch then applies the new forms in various places, in some case
> also replacing it with get_user_pages_fast whenever tsk and mm are
> current and current->mm. get_user_pages_unlocked varies from
> get_user_pages_fast only if mm is not current->mm (like when
> get_user_pages works on some other process mm). Whenever tsk and mm
> matches current and current->mm get_user_pages_fast must always be
> used to increase performance and get the page lockless (only with irq
> disabled).

Basically all this discussion should go into the patch as comments.
Help people shortcut git blame.

>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/mips/mm/gup.c                 |   8 +-
>  arch/powerpc/mm/gup.c              |   6 +-
>  arch/s390/kvm/kvm-s390.c           |   4 +-
>  arch/s390/mm/gup.c                 |   6 +-
>  arch/sh/mm/gup.c                   |   6 +-
>  arch/sparc/mm/gup.c                |   6 +-
>  arch/x86/mm/gup.c                  |   7 +-
>  drivers/dma/iovlock.c              |  10 +--
>  drivers/iommu/amd_iommu_v2.c       |   6 +-
>  drivers/media/pci/ivtv/ivtv-udma.c |   6 +-
>  drivers/misc/sgi-gru/grufault.c    |   3 +-
>  drivers/scsi/st.c                  |  10 +--
>  drivers/video/fbdev/pvr2fb.c       |   5 +-
>  include/linux/mm.h                 |   7 ++
>  mm/gup.c                           | 147 ++++++++++++++++++++++++++++++++++---
>  mm/mempolicy.c                     |   2 +-
>  mm/nommu.c                         |  23 ++++++
>  mm/process_vm_access.c             |   7 +-
>  mm/util.c                          |  10 +--
>  net/ceph/pagevec.c                 |   9 +--
>  20 files changed, 200 insertions(+), 88 deletions(-)
>
> diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
> index 06ce17c..20884f5 100644
> --- a/arch/mips/mm/gup.c
> +++ b/arch/mips/mm/gup.c
> @@ -301,11 +301,9 @@ slow_irqon:
>         start += nr << PAGE_SHIFT;
>         pages += nr;
>
> -       down_read(&mm->mmap_sem);
> -       ret = get_user_pages(current, mm, start,
> -                               (end - start) >> PAGE_SHIFT,
> -                               write, 0, pages, NULL);
> -       up_read(&mm->mmap_sem);
> +       ret = get_user_pages_unlocked(current, mm, start,
> +                                     (end - start) >> PAGE_SHIFT,
> +                                     write, 0, pages);
>
>         /* Have to be a bit careful with return values */
>         if (nr > 0) {
> diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
> index d874668..b70c34a 100644
> --- a/arch/powerpc/mm/gup.c
> +++ b/arch/powerpc/mm/gup.c
> @@ -215,10 +215,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                                    nr_pages - nr, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                                             nr_pages - nr, write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 81b0e11..37ca29a 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -1092,9 +1092,7 @@ long kvm_arch_fault_in_page(struct kvm_vcpu *vcpu, gpa_t gpa, int writable)
>         hva = gmap_fault(gpa, vcpu->arch.gmap);
>         if (IS_ERR_VALUE(hva))
>                 return (long)hva;
> -       down_read(&mm->mmap_sem);
> -       rc = get_user_pages(current, mm, hva, 1, writable, 0, NULL, NULL);
> -       up_read(&mm->mmap_sem);
> +       rc = get_user_pages_unlocked(current, mm, hva, 1, writable, 0, NULL);
>
>         return rc < 0 ? rc : 0;
>  }
> diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
> index 639fce46..5c586c7 100644
> --- a/arch/s390/mm/gup.c
> +++ b/arch/s390/mm/gup.c
> @@ -235,10 +235,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>         /* Try to get the remaining pages with get_user_pages */
>         start += nr << PAGE_SHIFT;
>         pages += nr;
> -       down_read(&mm->mmap_sem);
> -       ret = get_user_pages(current, mm, start,
> -                            nr_pages - nr, write, 0, pages, NULL);
> -       up_read(&mm->mmap_sem);
> +       ret = get_user_pages_unlocked(current, mm, start,
> +                            nr_pages - nr, write, 0, pages);
>         /* Have to be a bit careful with return values */
>         if (nr > 0)
>                 ret = (ret < 0) ? nr : ret + nr;
> diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
> index 37458f3..e15f52a 100644
> --- a/arch/sh/mm/gup.c
> +++ b/arch/sh/mm/gup.c
> @@ -257,10 +257,8 @@ slow_irqon:
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                       (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                       (end - start) >> PAGE_SHIFT, write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
> index 1aed043..fa7de7d 100644
> --- a/arch/sparc/mm/gup.c
> +++ b/arch/sparc/mm/gup.c
> @@ -219,10 +219,8 @@ slow:
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                       (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                       (end - start) >> PAGE_SHIFT, write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 207d9aef..2ab183b 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -388,10 +388,9 @@ slow_irqon:
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                       (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                                             (end - start) >> PAGE_SHIFT,
> +                                             write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
> index bb48a57..12ea7c3 100644
> --- a/drivers/dma/iovlock.c
> +++ b/drivers/dma/iovlock.c
> @@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
>                 pages += page_list->nr_pages;
>
>                 /* pin pages down */
> -               down_read(&current->mm->mmap_sem);
> -               ret = get_user_pages(
> -                       current,
> -                       current->mm,
> +               ret = get_user_pages_fast(
>                         (unsigned long) iov[i].iov_base,
>                         page_list->nr_pages,
>                         1,      /* write */
> -                       0,      /* force */
> -                       page_list->pages,
> -                       NULL);
> -               up_read(&current->mm->mmap_sem);
> +                       page_list->pages);
>
>                 if (ret != page_list->nr_pages)
>                         goto unpin;
> diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> index 5f578e8..6963b73 100644
> --- a/drivers/iommu/amd_iommu_v2.c
> +++ b/drivers/iommu/amd_iommu_v2.c
> @@ -519,10 +519,8 @@ static void do_fault(struct work_struct *work)
>
>         write = !!(fault->flags & PPR_FAULT_WRITE);
>
> -       down_read(&fault->state->mm->mmap_sem);
> -       npages = get_user_pages(NULL, fault->state->mm,
> -                               fault->address, 1, write, 0, &page, NULL);
> -       up_read(&fault->state->mm->mmap_sem);
> +       npages = get_user_pages_unlocked(NULL, fault->state->mm,
> +                                        fault->address, 1, write, 0, &page);
>
>         if (npages == 1) {
>                 put_page(page);
> diff --git a/drivers/media/pci/ivtv/ivtv-udma.c b/drivers/media/pci/ivtv/ivtv-udma.c
> index 7338cb2..96d866b 100644
> --- a/drivers/media/pci/ivtv/ivtv-udma.c
> +++ b/drivers/media/pci/ivtv/ivtv-udma.c
> @@ -124,10 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long ivtv_dest_addr,
>         }
>
>         /* Get user pages for DMA Xfer */
> -       down_read(&current->mm->mmap_sem);
> -       err = get_user_pages(current, current->mm,
> -                       user_dma.uaddr, user_dma.page_count, 0, 1, dma->map, NULL);
> -       up_read(&current->mm->mmap_sem);
> +       err = get_user_pages_unlocked(current, current->mm,
> +                       user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
>
>         if (user_dma.page_count != err) {
>                 IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
> diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> index f74fc0c..cd20669 100644
> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
>  #else
>         *pageshift = PAGE_SHIFT;
>  #endif
> -       if (get_user_pages
> -           (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
> +       if (get_user_pages_fast(vaddr, 1, write, &page) <= 0)
>                 return -EFAULT;
>         *paddr = page_to_phys(page);
>         put_page(page);
> diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> index aff9689..c89dcfa 100644
> --- a/drivers/scsi/st.c
> +++ b/drivers/scsi/st.c
> @@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
>                 return -ENOMEM;
>
>          /* Try to fault in all of the necessary pages */
> -       down_read(&current->mm->mmap_sem);
>          /* rw==READ means read from drive, write into memory area */
> -       res = get_user_pages(
> -               current,
> -               current->mm,
> +       res = get_user_pages_fast(
>                 uaddr,
>                 nr_pages,
>                 rw == READ,
> -               0, /* don't force */
> -               pages,
> -               NULL);
> -       up_read(&current->mm->mmap_sem);
> +               pages);
>
>         /* Errors and no page mapped should return here */
>         if (res < nr_pages)
> diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
> index 167cfff..ff81f65 100644
> --- a/drivers/video/fbdev/pvr2fb.c
> +++ b/drivers/video/fbdev/pvr2fb.c
> @@ -686,10 +686,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const char *buf,
>         if (!pages)
>                 return -ENOMEM;
>
> -       down_read(&current->mm->mmap_sem);
> -       ret = get_user_pages(current, current->mm, (unsigned long)buf,
> -                            nr_pages, WRITE, 0, pages, NULL);
> -       up_read(&current->mm->mmap_sem);
> +       ret = get_user_pages_fast((unsigned long)buf, nr_pages, WRITE, pages);
>
>         if (ret < nr_pages) {
>                 nr_pages = ret;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 32ba786..69f692d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1197,6 +1197,13 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>                     unsigned long start, unsigned long nr_pages,
>                     int write, int force, struct page **pages,
>                     struct vm_area_struct **vmas);
> +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> +                   unsigned long start, unsigned long nr_pages,
> +                   int write, int force, struct page **pages,
> +                   int *locked);
> +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> +                   unsigned long start, unsigned long nr_pages,
> +                   int write, int force, struct page **pages);
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>                         struct page **pages);
>  struct kvec;
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..19e17ab 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -576,6 +576,134 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>         return 0;
>  }
>
> +static inline long __get_user_pages_locked(struct task_struct *tsk,
> +                                          struct mm_struct *mm,
> +                                          unsigned long start,
> +                                          unsigned long nr_pages,
> +                                          int write, int force,
> +                                          struct page **pages,
> +                                          struct vm_area_struct **vmas,
> +                                          int *locked,
> +                                          bool immediate_unlock)
s/immediate_unlock/notify_drop/
> +{
> +       int flags = FOLL_TOUCH;
> +       long ret, pages_done;
> +       bool lock_dropped;
s/lock_dropped/sem_dropped/
> +
> +       if (locked) {
> +               /* if VM_FAULT_RETRY can be returned, vmas become invalid */
> +               BUG_ON(vmas);
> +               /* check caller initialized locked */
> +               BUG_ON(*locked != 1);
> +       } else {
> +               /*
> +                * Not really important, the value is irrelevant if
> +                * locked is NULL, but BUILD_BUG_ON costs nothing.
> +                */
> +               BUILD_BUG_ON(immediate_unlock);
> +       }
> +
> +       if (pages)
> +               flags |= FOLL_GET;
> +       if (write)
> +               flags |= FOLL_WRITE;
> +       if (force)
> +               flags |= FOLL_FORCE;
> +
> +       pages_done = 0;
> +       lock_dropped = false;
> +       for (;;) {
> +               ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
> +                                      vmas, locked);
> +               if (!locked)
> +                       /* VM_FAULT_RETRY couldn't trigger, bypass */
> +                       return ret;
> +
> +               /* VM_FAULT_RETRY cannot return errors */
> +               if (!*locked) {

Set lock_dropped = 1. In case we break out too soon (which we do if
nr_pages drops to zero a couple lines below) and report a stale value.

> +                       BUG_ON(ret < 0);
> +                       BUG_ON(nr_pages == 1 && ret);
> +               }
> +
> +               if (!pages)
> +                       /* If it's a prefault don't insist harder */
> +                       return ret;
> +
> +               if (ret > 0) {
> +                       nr_pages -= ret;
> +                       pages_done += ret;
> +                       if (!nr_pages)
> +                               break;
> +               }
> +               if (*locked) {
> +                       /* VM_FAULT_RETRY didn't trigger */
> +                       if (!pages_done)
> +                               pages_done = ret;

Replace top two lines with
if (ret >0)
    pages_done += ret;

> +                       break;
> +               }
> +               /* VM_FAULT_RETRY triggered, so seek to the faulting offset */
> +               pages += ret;
> +               start += ret << PAGE_SHIFT;
> +
> +               /*
> +                * Repeat on the address that fired VM_FAULT_RETRY
> +                * without FAULT_FLAG_ALLOW_RETRY but with
> +                * FAULT_FLAG_TRIED.
> +                */
> +               *locked = 1;
> +               lock_dropped = true;

Not really needed if set where previously suggested.

> +               down_read(&mm->mmap_sem);
> +               ret = __get_user_pages(tsk, mm, start, nr_pages, flags | FOLL_TRIED,
> +                                      pages, NULL, NULL);

s/nr_pages/1/ otherwise we block on everything left ahead, not just
the one that fired RETRY.

> +               if (ret != 1) {
> +                       BUG_ON(ret > 1);

Can ret ever be zero here with count == 1? (ENOENT for a stack guard
page TTBOMK, but what the heck are we doing gup'ing stacks. Suggest
fixing that one case inside __gup impl so count == 1 never returns
zero)

> +                       if (!pages_done)
> +                               pages_done = ret;

Don't think so. ret is --errno at this point (maybe zero). So remove.

> +                       break;
> +               }
> +               nr_pages--;
> +               pages_done++;
> +               if (!nr_pages)
> +                       break;
> +               pages++;
> +               start += PAGE_SIZE;
> +       }
> +       if (!immediate_unlock && lock_dropped && *locked) {
> +               /*
> +                * We must let the caller know we temporarily dropped the lock
> +                * and so the critical section protected by it was lost.
> +                */
> +               up_read(&mm->mmap_sem);

With my suggestion of s/immediate_unlock/notify_drop/ this gets a lot
more understandable (IMHO).

> +               *locked = 0;
> +       }
> +       return pages_done;
> +}
> +
> +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> +                          unsigned long start, unsigned long nr_pages,
> +                          int write, int force, struct page **pages,
> +                          int *locked)
> +{
> +       return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> +                                      pages, NULL, locked, false);
> +}
> +EXPORT_SYMBOL(get_user_pages_locked);
> +
> +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> +                            unsigned long start, unsigned long nr_pages,
> +                            int write, int force, struct page **pages)
> +{
> +       long ret;
> +       int locked = 1;
> +       down_read(&mm->mmap_sem);
> +       ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> +                                     pages, NULL, &locked, true);
> +       if (locked)
> +               up_read(&mm->mmap_sem);
> +       return ret;
> +}
> +EXPORT_SYMBOL(get_user_pages_unlocked);
> +
>  /*
>   * get_user_pages() - pin user pages in memory
>   * @tsk:       the task_struct to use for page fault accounting, or
> @@ -625,22 +753,19 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>   * use the correct cache flushing APIs.
>   *
>   * See also get_user_pages_fast, for performance critical applications.
> + *
> + * get_user_pages should be gradually obsoleted in favor of
> + * get_user_pages_locked|unlocked. Nothing should use get_user_pages
> + * because it cannot pass FAULT_FLAG_ALLOW_RETRY to handle_mm_fault in
> + * turn disabling the userfaultfd feature (after that "inline" can be
> + * cleaned up from get_user_pages_locked).
>   */
>  long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>                 unsigned long start, unsigned long nr_pages, int write,
>                 int force, struct page **pages, struct vm_area_struct **vmas)
>  {
> -       int flags = FOLL_TOUCH;
> -
> -       if (pages)
> -               flags |= FOLL_GET;
> -       if (write)
> -               flags |= FOLL_WRITE;
> -       if (force)
> -               flags |= FOLL_FORCE;
> -
> -       return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
> -                               NULL);
> +       return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> +                                      pages, vmas, NULL, false);
>  }
>  EXPORT_SYMBOL(get_user_pages);

*Or*, forget about gup_locked and just leave gup as proposed in this
patch. Then gup_unlocked (again IMHO) becomes more meaningful ... "Ah,
that's the one I call when I have no locks taken".

>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 8f5330d..6606c10 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -881,7 +881,7 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
>         struct page *p;
>         int err;
>
> -       err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
> +       err = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p);
>         if (err >= 0) {
>                 err = page_to_nid(p);
>                 put_page(p);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index a881d96..8a06341 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -213,6 +213,29 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL(get_user_pages);
>
> +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> +                          unsigned long start, unsigned long nr_pages,
> +                          int write, int force, struct page **pages,
> +                          int *locked)
> +{
> +       return get_user_pages(tsk, mm, start, nr_pages, write, force,
> +                             pages, NULL);
> +}
> +EXPORT_SYMBOL(get_user_pages_locked);
> +
> +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> +                            unsigned long start, unsigned long nr_pages,
> +                            int write, int force, struct page **pages)
> +{
> +       long ret;
> +       down_read(&mm->mmap_sem);
> +       ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
> +                            pages, NULL);
> +       up_read(&mm->mmap_sem);
> +       return ret;
> +}
> +EXPORT_SYMBOL(get_user_pages_unlocked);
> +
>  /**
>   * follow_pfn - look up PFN at a user virtual address
>   * @vma: memory mapping
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 5077afc..b159769 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -99,11 +99,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
>                 size_t bytes;
>
>                 /* Get the pages we're interested in */
> -               down_read(&mm->mmap_sem);
> -               pages = get_user_pages(task, mm, pa, pages,
> -                                     vm_write, 0, process_pages, NULL);
> -               up_read(&mm->mmap_sem);
> -
> +               pages = get_user_pages_unlocked(task, mm, pa, pages,
> +                                               vm_write, 0, process_pages);
>                 if (pages <= 0)
>                         return -EFAULT;
>
> diff --git a/mm/util.c b/mm/util.c
> index 093c973..1b93f2d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -247,14 +247,8 @@ int __weak get_user_pages_fast(unsigned long start,
>                                 int nr_pages, int write, struct page **pages)
>  {
>         struct mm_struct *mm = current->mm;
> -       int ret;
> -
> -       down_read(&mm->mmap_sem);
> -       ret = get_user_pages(current, mm, start, nr_pages,
> -                                       write, 0, pages, NULL);
> -       up_read(&mm->mmap_sem);
> -
> -       return ret;
> +       return get_user_pages_unlocked(current, mm, start, nr_pages,
> +                                      write, 0, pages);
>  }
>  EXPORT_SYMBOL_GPL(get_user_pages_fast);
>
> diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
> index 5550130..5504783 100644
> --- a/net/ceph/pagevec.c
> +++ b/net/ceph/pagevec.c
> @@ -23,17 +23,16 @@ struct page **ceph_get_direct_page_vector(const void __user *data,
>         if (!pages)
>                 return ERR_PTR(-ENOMEM);
>
> -       down_read(&current->mm->mmap_sem);
>         while (got < num_pages) {
> -               rc = get_user_pages(current, current->mm,
> -                   (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
> -                   num_pages - got, write_page, 0, pages + got, NULL);
> +               rc = get_user_pages_fast((unsigned long)data +
> +                                        ((unsigned long)got * PAGE_SIZE),
> +                                        num_pages - got,
> +                                        write_page, pages + got);
>                 if (rc < 0)
>                         break;
>                 BUG_ON(rc == 0);
>                 got += rc;
>         }
> -       up_read(&current->mm->mmap_sem);
>         if (rc < 0)
>                 goto fail;
>         return pages;
>
>
> Then to make an example your patch would have become:
>
> ===
> From 74d88763cde285354fb78806ffb332030d1f0739 Mon Sep 17 00:00:00 2001
> From: Andres Lagar-Cavilla <andreslc@google.com>
> Date: Fri, 26 Sep 2014 18:36:56 +0200
> Subject: [PATCH 2/2] kvm: Faults which trigger IO release the mmap_sem
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
> has been swapped out or is behind a filemap, this will trigger async
> readahead and return immediately. The rationale is that KVM will kick
> back the guest with an "async page fault" and allow for some other
> guest process to take over.
>
> If async PFs are enabled the fault is retried asap from an async
> workqueue. If not, it's retried immediately in the same code path. In
> either case the retry will not relinquish the mmap semaphore and will
> block on the IO. This is a bad thing, as other mmap semaphore users
> now stall as a function of swap or filemap latency.
>
> This patch ensures both the regular and async PF path re-enter the
> fault allowing for the mmap semaphore to be relinquished in the case
> of IO wait.
>
> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
> Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/mm.h  | 1 +
>  mm/gup.c            | 4 ++++
>  virt/kvm/async_pf.c | 4 +---
>  virt/kvm/kvm_main.c | 4 ++--
>  4 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 69f692d..71dbe03 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1997,6 +1997,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
>  #define FOLL_HWPOISON  0x100   /* check page is hwpoisoned */
>  #define FOLL_NUMA      0x200   /* force NUMA hinting page fault */
>  #define FOLL_MIGRATION 0x400   /* wait for page to replace migration entry */
> +#define FOLL_TRIED     0x800   /* a retry, previous pass started an IO */
>
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>                         void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 19e17ab..369b3f6 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>                 fault_flags |= FAULT_FLAG_ALLOW_RETRY;
>         if (*flags & FOLL_NOWAIT)
>                 fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> +       if (*flags & FOLL_TRIED) {
> +               VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> +               fault_flags |= FAULT_FLAG_TRIED;
> +       }
>
>         ret = handle_mm_fault(mm, vma, address, fault_flags);
>         if (ret & VM_FAULT_ERROR) {
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d6a3d09..44660ae 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
>
>         might_sleep();
>
> -       down_read(&mm->mmap_sem);
> -       get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
> -       up_read(&mm->mmap_sem);
> +       get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
>         kvm_async_page_present_sync(vcpu, apf);
>
>         spin_lock(&vcpu->async_pf.lock);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 95519bc..921bce7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1170,8 +1170,8 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>                                               addr, write_fault, page);
>                 up_read(&current->mm->mmap_sem);
>         } else
> -               npages = get_user_pages_fast(addr, 1, write_fault,
> -                                            page);
> +               npages = get_user_pages_unlocked(current, current->mm, addr, 1,
> +                                                write_fault, 0, page);
>         if (npages != 1)
>                 return npages;

Acked, for the spirit. Likely my patch will go in and then you can
just throw this one on top, removing kvm_get_user_page_io in the
process.

>
>
>
> This isn't bisectable in this order and it's untested anyway. It needs
> more patchsplits.
>
> This is just an initial RFC to know if it's ok to go into this
> direction.
>
> If it's ok I'll do some testing and submit it more properly. If your
> patches goes in first it's fine and I'll just replace the call in KVM
> to get_user_pages_unlocked (or whatever we want to call that thing).
>
> I'd need to get this (or equivalent solution) merged before
> re-submitting the userfaultfd patchset. I think the above benefits the
> kernel as a whole in terms of mmap_sem holdtimes regardless of
> userfaultfd so it should be good.
>
>> Well, IIUC every code path that has ALLOW_RETRY dives in the second
>> time with FAULT_TRIED or similar. In the common case, you happily
>> blaze through the second time, but if someone raced in while all locks
>> were given up, one pays the price of the second time being a full
>> fault hogging the mmap sem. At some point you need to not keep being
>> polite otherwise the task starves. Presumably the risk of an extra
>> retry drops steeply every new gup retry. Maybe just try three times is
>> good enough. It makes for ugly logic though.
>
> I was under the idea that if one looped forever with VM_FAULT_RETRY
> it'd eventually succeed, but it risks doing more work, so I'm also
> sticking to the "locked != NULL" first, seek to the first page that
> returned VM_FAULT_RETRY and issue a nr_pages=1 gup with locked ==
> NULL, and then continue with locked != NULL at the next page. Just
> like you did in the KVM slow path. And if "pages" array is NULL I bail
> out at the first VM_FAULT_RETRY failure without insisting with further
> gup calls of the "&locked" kind, your patch had just 1 page but you
> also bailed out.
>
> What this code above does is basically to generalize your optimization
> to KVM and it makes it global and at the same time it avoids me
> trouble in handle_userfault().
>
> While at it I also converted some obvious candidate for gup_fast that
> had no point in running slower (which I should split off in a separate
> patch).

Yes to all.

The part that I'm missing is how would MADV_USERFAULT handle this. It
would be buried in faultin_page, if no RETRY possible raise sigbus,
otherwise drop the mmap semaphore and signal and sleep on the
userfaultfd?

Thanks,
Andres

>
> Thanks,
> Andrea



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
@ 2014-09-26 19:54   ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2014-09-26 19:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Gleb Natapov, Radim Krcmar, Paolo Bonzini, Rik van Riel,
	Peter Zijlstra, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Fri, Sep 26, 2014 at 10:25 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Sep 25, 2014 at 02:50:29PM -0700, Andres Lagar-Cavilla wrote:
>> It's nearly impossible to name it right because 1) it indicates we can
>> relinquish 2) it returns whether we still hold the mmap semaphore.
>>
>> I'd prefer it'd be called mmap_sem_hold, which conveys immediately
>> what this is about ("nonblocking" or "locked" could be about a whole
>> lot of things)
>
> To me FOLL_NOWAIT/FAULT_FLAG_RETRY_NOWAIT is nonblocking,
> "locked"/FAULT_FLAG_ALLOW_RETRY is still very much blocking, just
> without the mmap_sem, so I called it "locked"... but I'm fine to
> change the name to mmap_sem_hold. Just get_user_pages_mmap_sem_hold
> seems less friendly than get_user_pages_locked(..., &locked). locked
> as you used comes intuitive when you do later "if (locked) up_read".
>

Heh. I was previously referring to the int *locked param , not the
_(un)locked suffix. That param is all about the mmap semaphore, so why
not name it less ambiguously. It's essentially a tristate.

My suggestion is that you just make gup behave as your proposed
gup_locked, and no need to introduce another call. But I understand if
you want to phase this out politely.

> Then I added an _unlocked kind which is a drop in replacement for many
> places just to clean it up.
>
> get_user_pages_unlocked and get_user_pages_fast are equivalent in
> semantics, so any call of get_user_pages_unlocked(current,
> current->mm, ...) has no reason to exist and should be replaced to
> get_user_pages_fast unless "force = 1" (gup_fast has no force param
> just to make the argument list a bit more confusing across the various
> versions of gup).
>
> get_user_pages over time should be phased out and dropped.

Please. Too many variants. So the end goal is
* __gup_fast
* gup_fast == __gup_fast + gup_unlocked for fallback
* gup (or gup_locked)
* gup_unlocked
(and flat __gup remains buried in the impl)?

>
>> I can see that. My background for coming into this is very similar: in
>> a previous life we had a file system shim that would kick up into
>> userspace for servicing VM memory. KVM just wouldn't let the file
>> system give up the mmap semaphore. We had /proc readers hanging up all
>> over the place while userspace was servicing. Not happy.
>>
>> With KVM (now) and the standard x86 fault giving you ALLOW_RETRY, what
>> stands in your way? Methinks that gup_fast has no slowpath fallback
>> that turns on ALLOW_RETRY. What would oppose that being the global
>> behavior?
>
> It should become the global behavior. Just it doesn't need to become a
> global behavior immediately for all kind of gups (i.e. video4linux
> drivers will never need to poke into the KVM guest user memory so it
> doesn't matter if they don't use gup_locked immediately). Even then we
> can still support get_user_pages_locked(...., locked=NULL) for
> ptrace/coredump and other things that may not want to trigger the
> userfaultfd protocol and just get an immediate VM_FAULT_SIGBUS.
>
> Userfaults will just VM_FAULT_SIGBUS (translated to -EFAULT by all gup
> invocations) and not invoke the userfaultfd protocol, if
> FAULT_FLAG_ALLOW_RETRY is not set. So any gup_locked with locked ==
> NULL or or gup() (without locked parameter) will not invoke the
> userfaultfd protocol.
>
> But I need gup_fast to use FAULT_FLAG_ALLOW_RETRY because core places
> like O_DIRECT uses it.
>
> I tried to do a RFC patch below that goes into this direction and
> should be enough for a start to solve all my issues with the mmap_sem
> holding inside handle_userfault(), comments welcome.
>
> =======
> From 41918f7d922d1e7fc70f117db713377e7e2af6e9 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Fri, 26 Sep 2014 18:36:53 +0200
> Subject: [PATCH 1/2] mm: gup: add get_user_pages_locked and
>  get_user_pages_unlocked
>
> We can leverage the VM_FAULT_RETRY functionality in the page fault
> paths better by using either get_user_pages_locked or
> get_user_pages_unlocked.
>
> The former allow conversion of get_user_pages invocations that will
> have to pass a "&locked" parameter to know if the mmap_sem was dropped
> during the call. Example from:
>
>     down_read(&mm->mmap_sem);
>     do_something()
>     get_user_pages(tsk, mm, ..., pages, NULL);
>     up_read(&mm->mmap_sem);
>
> to:
>
>     int locked = 1;
>     down_read(&mm->mmap_sem);
>     do_something()
>     get_user_pages_locked(tsk, mm, ..., pages, &locked);
>     if (locked)
>         up_read(&mm->mmap_sem);
>
> The latter is suitable only as a drop in replacement of the form:
>
>     down_read(&mm->mmap_sem);
>     get_user_pages(tsk, mm, ..., pages, NULL);
>     up_read(&mm->mmap_sem);
>
> into:
>
>     get_user_pages_unlocked(tsk, mm, ..., pages);
>
> Where tsk, mm, the intermediate "..." paramters and "pages" can be any
> value as before. Just the last parameter of get_user_pages (vmas) must
> be NULL for get_user_pages_locked|unlocked to be usable (the latter
> original form wouldn't have been safe anyway if vmas wasn't null, for
> the former we just make it explicit by dropping the parameter).
>
> If vmas is not NULL these two methods cannot be used.
>
> This patch then applies the new forms in various places, in some case
> also replacing it with get_user_pages_fast whenever tsk and mm are
> current and current->mm. get_user_pages_unlocked varies from
> get_user_pages_fast only if mm is not current->mm (like when
> get_user_pages works on some other process mm). Whenever tsk and mm
> matches current and current->mm get_user_pages_fast must always be
> used to increase performance and get the page lockless (only with irq
> disabled).

Basically all this discussion should go into the patch as comments.
Help people shortcut git blame.

>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/mips/mm/gup.c                 |   8 +-
>  arch/powerpc/mm/gup.c              |   6 +-
>  arch/s390/kvm/kvm-s390.c           |   4 +-
>  arch/s390/mm/gup.c                 |   6 +-
>  arch/sh/mm/gup.c                   |   6 +-
>  arch/sparc/mm/gup.c                |   6 +-
>  arch/x86/mm/gup.c                  |   7 +-
>  drivers/dma/iovlock.c              |  10 +--
>  drivers/iommu/amd_iommu_v2.c       |   6 +-
>  drivers/media/pci/ivtv/ivtv-udma.c |   6 +-
>  drivers/misc/sgi-gru/grufault.c    |   3 +-
>  drivers/scsi/st.c                  |  10 +--
>  drivers/video/fbdev/pvr2fb.c       |   5 +-
>  include/linux/mm.h                 |   7 ++
>  mm/gup.c                           | 147 ++++++++++++++++++++++++++++++++++---
>  mm/mempolicy.c                     |   2 +-
>  mm/nommu.c                         |  23 ++++++
>  mm/process_vm_access.c             |   7 +-
>  mm/util.c                          |  10 +--
>  net/ceph/pagevec.c                 |   9 +--
>  20 files changed, 200 insertions(+), 88 deletions(-)
>
> diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
> index 06ce17c..20884f5 100644
> --- a/arch/mips/mm/gup.c
> +++ b/arch/mips/mm/gup.c
> @@ -301,11 +301,9 @@ slow_irqon:
>         start += nr << PAGE_SHIFT;
>         pages += nr;
>
> -       down_read(&mm->mmap_sem);
> -       ret = get_user_pages(current, mm, start,
> -                               (end - start) >> PAGE_SHIFT,
> -                               write, 0, pages, NULL);
> -       up_read(&mm->mmap_sem);
> +       ret = get_user_pages_unlocked(current, mm, start,
> +                                     (end - start) >> PAGE_SHIFT,
> +                                     write, 0, pages);
>
>         /* Have to be a bit careful with return values */
>         if (nr > 0) {
> diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
> index d874668..b70c34a 100644
> --- a/arch/powerpc/mm/gup.c
> +++ b/arch/powerpc/mm/gup.c
> @@ -215,10 +215,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                                    nr_pages - nr, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                                             nr_pages - nr, write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 81b0e11..37ca29a 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -1092,9 +1092,7 @@ long kvm_arch_fault_in_page(struct kvm_vcpu *vcpu, gpa_t gpa, int writable)
>         hva = gmap_fault(gpa, vcpu->arch.gmap);
>         if (IS_ERR_VALUE(hva))
>                 return (long)hva;
> -       down_read(&mm->mmap_sem);
> -       rc = get_user_pages(current, mm, hva, 1, writable, 0, NULL, NULL);
> -       up_read(&mm->mmap_sem);
> +       rc = get_user_pages_unlocked(current, mm, hva, 1, writable, 0, NULL);
>
>         return rc < 0 ? rc : 0;
>  }
> diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
> index 639fce46..5c586c7 100644
> --- a/arch/s390/mm/gup.c
> +++ b/arch/s390/mm/gup.c
> @@ -235,10 +235,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>         /* Try to get the remaining pages with get_user_pages */
>         start += nr << PAGE_SHIFT;
>         pages += nr;
> -       down_read(&mm->mmap_sem);
> -       ret = get_user_pages(current, mm, start,
> -                            nr_pages - nr, write, 0, pages, NULL);
> -       up_read(&mm->mmap_sem);
> +       ret = get_user_pages_unlocked(current, mm, start,
> +                            nr_pages - nr, write, 0, pages);
>         /* Have to be a bit careful with return values */
>         if (nr > 0)
>                 ret = (ret < 0) ? nr : ret + nr;
> diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
> index 37458f3..e15f52a 100644
> --- a/arch/sh/mm/gup.c
> +++ b/arch/sh/mm/gup.c
> @@ -257,10 +257,8 @@ slow_irqon:
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                       (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                       (end - start) >> PAGE_SHIFT, write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
> index 1aed043..fa7de7d 100644
> --- a/arch/sparc/mm/gup.c
> +++ b/arch/sparc/mm/gup.c
> @@ -219,10 +219,8 @@ slow:
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                       (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                       (end - start) >> PAGE_SHIFT, write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 207d9aef..2ab183b 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -388,10 +388,9 @@ slow_irqon:
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                       (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                                             (end - start) >> PAGE_SHIFT,
> +                                             write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
> index bb48a57..12ea7c3 100644
> --- a/drivers/dma/iovlock.c
> +++ b/drivers/dma/iovlock.c
> @@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
>                 pages += page_list->nr_pages;
>
>                 /* pin pages down */
> -               down_read(&current->mm->mmap_sem);
> -               ret = get_user_pages(
> -                       current,
> -                       current->mm,
> +               ret = get_user_pages_fast(
>                         (unsigned long) iov[i].iov_base,
>                         page_list->nr_pages,
>                         1,      /* write */
> -                       0,      /* force */
> -                       page_list->pages,
> -                       NULL);
> -               up_read(&current->mm->mmap_sem);
> +                       page_list->pages);
>
>                 if (ret != page_list->nr_pages)
>                         goto unpin;
> diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> index 5f578e8..6963b73 100644
> --- a/drivers/iommu/amd_iommu_v2.c
> +++ b/drivers/iommu/amd_iommu_v2.c
> @@ -519,10 +519,8 @@ static void do_fault(struct work_struct *work)
>
>         write = !!(fault->flags & PPR_FAULT_WRITE);
>
> -       down_read(&fault->state->mm->mmap_sem);
> -       npages = get_user_pages(NULL, fault->state->mm,
> -                               fault->address, 1, write, 0, &page, NULL);
> -       up_read(&fault->state->mm->mmap_sem);
> +       npages = get_user_pages_unlocked(NULL, fault->state->mm,
> +                                        fault->address, 1, write, 0, &page);
>
>         if (npages == 1) {
>                 put_page(page);
> diff --git a/drivers/media/pci/ivtv/ivtv-udma.c b/drivers/media/pci/ivtv/ivtv-udma.c
> index 7338cb2..96d866b 100644
> --- a/drivers/media/pci/ivtv/ivtv-udma.c
> +++ b/drivers/media/pci/ivtv/ivtv-udma.c
> @@ -124,10 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long ivtv_dest_addr,
>         }
>
>         /* Get user pages for DMA Xfer */
> -       down_read(&current->mm->mmap_sem);
> -       err = get_user_pages(current, current->mm,
> -                       user_dma.uaddr, user_dma.page_count, 0, 1, dma->map, NULL);
> -       up_read(&current->mm->mmap_sem);
> +       err = get_user_pages_unlocked(current, current->mm,
> +                       user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
>
>         if (user_dma.page_count != err) {
>                 IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
> diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> index f74fc0c..cd20669 100644
> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
>  #else
>         *pageshift = PAGE_SHIFT;
>  #endif
> -       if (get_user_pages
> -           (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
> +       if (get_user_pages_fast(vaddr, 1, write, &page) <= 0)
>                 return -EFAULT;
>         *paddr = page_to_phys(page);
>         put_page(page);
> diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> index aff9689..c89dcfa 100644
> --- a/drivers/scsi/st.c
> +++ b/drivers/scsi/st.c
> @@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
>                 return -ENOMEM;
>
>          /* Try to fault in all of the necessary pages */
> -       down_read(&current->mm->mmap_sem);
>          /* rw==READ means read from drive, write into memory area */
> -       res = get_user_pages(
> -               current,
> -               current->mm,
> +       res = get_user_pages_fast(
>                 uaddr,
>                 nr_pages,
>                 rw == READ,
> -               0, /* don't force */
> -               pages,
> -               NULL);
> -       up_read(&current->mm->mmap_sem);
> +               pages);
>
>         /* Errors and no page mapped should return here */
>         if (res < nr_pages)
> diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
> index 167cfff..ff81f65 100644
> --- a/drivers/video/fbdev/pvr2fb.c
> +++ b/drivers/video/fbdev/pvr2fb.c
> @@ -686,10 +686,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const char *buf,
>         if (!pages)
>                 return -ENOMEM;
>
> -       down_read(&current->mm->mmap_sem);
> -       ret = get_user_pages(current, current->mm, (unsigned long)buf,
> -                            nr_pages, WRITE, 0, pages, NULL);
> -       up_read(&current->mm->mmap_sem);
> +       ret = get_user_pages_fast((unsigned long)buf, nr_pages, WRITE, pages);
>
>         if (ret < nr_pages) {
>                 nr_pages = ret;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 32ba786..69f692d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1197,6 +1197,13 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>                     unsigned long start, unsigned long nr_pages,
>                     int write, int force, struct page **pages,
>                     struct vm_area_struct **vmas);
> +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> +                   unsigned long start, unsigned long nr_pages,
> +                   int write, int force, struct page **pages,
> +                   int *locked);
> +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> +                   unsigned long start, unsigned long nr_pages,
> +                   int write, int force, struct page **pages);
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>                         struct page **pages);
>  struct kvec;
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..19e17ab 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -576,6 +576,134 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>         return 0;
>  }
>
> +static inline long __get_user_pages_locked(struct task_struct *tsk,
> +                                          struct mm_struct *mm,
> +                                          unsigned long start,
> +                                          unsigned long nr_pages,
> +                                          int write, int force,
> +                                          struct page **pages,
> +                                          struct vm_area_struct **vmas,
> +                                          int *locked,
> +                                          bool immediate_unlock)
s/immediate_unlock/notify_drop/
> +{
> +       int flags = FOLL_TOUCH;
> +       long ret, pages_done;
> +       bool lock_dropped;
s/lock_dropped/sem_dropped/
> +
> +       if (locked) {
> +               /* if VM_FAULT_RETRY can be returned, vmas become invalid */
> +               BUG_ON(vmas);
> +               /* check caller initialized locked */
> +               BUG_ON(*locked != 1);
> +       } else {
> +               /*
> +                * Not really important, the value is irrelevant if
> +                * locked is NULL, but BUILD_BUG_ON costs nothing.
> +                */
> +               BUILD_BUG_ON(immediate_unlock);
> +       }
> +
> +       if (pages)
> +               flags |= FOLL_GET;
> +       if (write)
> +               flags |= FOLL_WRITE;
> +       if (force)
> +               flags |= FOLL_FORCE;
> +
> +       pages_done = 0;
> +       lock_dropped = false;
> +       for (;;) {
> +               ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
> +                                      vmas, locked);
> +               if (!locked)
> +                       /* VM_FAULT_RETRY couldn't trigger, bypass */
> +                       return ret;
> +
> +               /* VM_FAULT_RETRY cannot return errors */
> +               if (!*locked) {

Set lock_dropped = 1. In case we break out too soon (which we do if
nr_pages drops to zero a couple lines below) and report a stale value.

> +                       BUG_ON(ret < 0);
> +                       BUG_ON(nr_pages == 1 && ret);
> +               }
> +
> +               if (!pages)
> +                       /* If it's a prefault don't insist harder */
> +                       return ret;
> +
> +               if (ret > 0) {
> +                       nr_pages -= ret;
> +                       pages_done += ret;
> +                       if (!nr_pages)
> +                               break;
> +               }
> +               if (*locked) {
> +                       /* VM_FAULT_RETRY didn't trigger */
> +                       if (!pages_done)
> +                               pages_done = ret;

Replace top two lines with
if (ret >0)
    pages_done += ret;

> +                       break;
> +               }
> +               /* VM_FAULT_RETRY triggered, so seek to the faulting offset */
> +               pages += ret;
> +               start += ret << PAGE_SHIFT;
> +
> +               /*
> +                * Repeat on the address that fired VM_FAULT_RETRY
> +                * without FAULT_FLAG_ALLOW_RETRY but with
> +                * FAULT_FLAG_TRIED.
> +                */
> +               *locked = 1;
> +               lock_dropped = true;

Not really needed if set where previously suggested.

> +               down_read(&mm->mmap_sem);
> +               ret = __get_user_pages(tsk, mm, start, nr_pages, flags | FOLL_TRIED,
> +                                      pages, NULL, NULL);

s/nr_pages/1/ otherwise we block on everything left ahead, not just
the one that fired RETRY.

> +               if (ret != 1) {
> +                       BUG_ON(ret > 1);

Can ret ever be zero here with count == 1? (ENOENT for a stack guard
page TTBOMK, but what the heck are we doing gup'ing stacks. Suggest
fixing that one case inside __gup impl so count == 1 never returns
zero)

> +                       if (!pages_done)
> +                               pages_done = ret;

Don't think so. ret is --errno at this point (maybe zero). So remove.

> +                       break;
> +               }
> +               nr_pages--;
> +               pages_done++;
> +               if (!nr_pages)
> +                       break;
> +               pages++;
> +               start += PAGE_SIZE;
> +       }
> +       if (!immediate_unlock && lock_dropped && *locked) {
> +               /*
> +                * We must let the caller know we temporarily dropped the lock
> +                * and so the critical section protected by it was lost.
> +                */
> +               up_read(&mm->mmap_sem);

With my suggestion of s/immediate_unlock/notify_drop/ this gets a lot
more understandable (IMHO).

> +               *locked = 0;
> +       }
> +       return pages_done;
> +}
> +
> +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> +                          unsigned long start, unsigned long nr_pages,
> +                          int write, int force, struct page **pages,
> +                          int *locked)
> +{
> +       return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> +                                      pages, NULL, locked, false);
> +}
> +EXPORT_SYMBOL(get_user_pages_locked);
> +
> +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> +                            unsigned long start, unsigned long nr_pages,
> +                            int write, int force, struct page **pages)
> +{
> +       long ret;
> +       int locked = 1;
> +       down_read(&mm->mmap_sem);
> +       ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> +                                     pages, NULL, &locked, true);
> +       if (locked)
> +               up_read(&mm->mmap_sem);
> +       return ret;
> +}
> +EXPORT_SYMBOL(get_user_pages_unlocked);
> +
>  /*
>   * get_user_pages() - pin user pages in memory
>   * @tsk:       the task_struct to use for page fault accounting, or
> @@ -625,22 +753,19 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>   * use the correct cache flushing APIs.
>   *
>   * See also get_user_pages_fast, for performance critical applications.
> + *
> + * get_user_pages should be gradually obsoleted in favor of
> + * get_user_pages_locked|unlocked. Nothing should use get_user_pages
> + * because it cannot pass FAULT_FLAG_ALLOW_RETRY to handle_mm_fault in
> + * turn disabling the userfaultfd feature (after that "inline" can be
> + * cleaned up from get_user_pages_locked).
>   */
>  long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>                 unsigned long start, unsigned long nr_pages, int write,
>                 int force, struct page **pages, struct vm_area_struct **vmas)
>  {
> -       int flags = FOLL_TOUCH;
> -
> -       if (pages)
> -               flags |= FOLL_GET;
> -       if (write)
> -               flags |= FOLL_WRITE;
> -       if (force)
> -               flags |= FOLL_FORCE;
> -
> -       return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
> -                               NULL);
> +       return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> +                                      pages, vmas, NULL, false);
>  }
>  EXPORT_SYMBOL(get_user_pages);

*Or*, forget about gup_locked and just leave gup as proposed in this
patch. Then gup_unlocked (again IMHO) becomes more meaningful ... "Ah,
that's the one I call when I have no locks taken".

>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 8f5330d..6606c10 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -881,7 +881,7 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
>         struct page *p;
>         int err;
>
> -       err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
> +       err = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p);
>         if (err >= 0) {
>                 err = page_to_nid(p);
>                 put_page(p);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index a881d96..8a06341 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -213,6 +213,29 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL(get_user_pages);
>
> +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> +                          unsigned long start, unsigned long nr_pages,
> +                          int write, int force, struct page **pages,
> +                          int *locked)
> +{
> +       return get_user_pages(tsk, mm, start, nr_pages, write, force,
> +                             pages, NULL);
> +}
> +EXPORT_SYMBOL(get_user_pages_locked);
> +
> +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> +                            unsigned long start, unsigned long nr_pages,
> +                            int write, int force, struct page **pages)
> +{
> +       long ret;
> +       down_read(&mm->mmap_sem);
> +       ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
> +                            pages, NULL);
> +       up_read(&mm->mmap_sem);
> +       return ret;
> +}
> +EXPORT_SYMBOL(get_user_pages_unlocked);
> +
>  /**
>   * follow_pfn - look up PFN at a user virtual address
>   * @vma: memory mapping
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 5077afc..b159769 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -99,11 +99,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
>                 size_t bytes;
>
>                 /* Get the pages we're interested in */
> -               down_read(&mm->mmap_sem);
> -               pages = get_user_pages(task, mm, pa, pages,
> -                                     vm_write, 0, process_pages, NULL);
> -               up_read(&mm->mmap_sem);
> -
> +               pages = get_user_pages_unlocked(task, mm, pa, pages,
> +                                               vm_write, 0, process_pages);
>                 if (pages <= 0)
>                         return -EFAULT;
>
> diff --git a/mm/util.c b/mm/util.c
> index 093c973..1b93f2d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -247,14 +247,8 @@ int __weak get_user_pages_fast(unsigned long start,
>                                 int nr_pages, int write, struct page **pages)
>  {
>         struct mm_struct *mm = current->mm;
> -       int ret;
> -
> -       down_read(&mm->mmap_sem);
> -       ret = get_user_pages(current, mm, start, nr_pages,
> -                                       write, 0, pages, NULL);
> -       up_read(&mm->mmap_sem);
> -
> -       return ret;
> +       return get_user_pages_unlocked(current, mm, start, nr_pages,
> +                                      write, 0, pages);
>  }
>  EXPORT_SYMBOL_GPL(get_user_pages_fast);
>
> diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
> index 5550130..5504783 100644
> --- a/net/ceph/pagevec.c
> +++ b/net/ceph/pagevec.c
> @@ -23,17 +23,16 @@ struct page **ceph_get_direct_page_vector(const void __user *data,
>         if (!pages)
>                 return ERR_PTR(-ENOMEM);
>
> -       down_read(&current->mm->mmap_sem);
>         while (got < num_pages) {
> -               rc = get_user_pages(current, current->mm,
> -                   (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
> -                   num_pages - got, write_page, 0, pages + got, NULL);
> +               rc = get_user_pages_fast((unsigned long)data +
> +                                        ((unsigned long)got * PAGE_SIZE),
> +                                        num_pages - got,
> +                                        write_page, pages + got);
>                 if (rc < 0)
>                         break;
>                 BUG_ON(rc == 0);
>                 got += rc;
>         }
> -       up_read(&current->mm->mmap_sem);
>         if (rc < 0)
>                 goto fail;
>         return pages;
>
>
> Then to make an example your patch would have become:
>
> ===
> From 74d88763cde285354fb78806ffb332030d1f0739 Mon Sep 17 00:00:00 2001
> From: Andres Lagar-Cavilla <andreslc@google.com>
> Date: Fri, 26 Sep 2014 18:36:56 +0200
> Subject: [PATCH 2/2] kvm: Faults which trigger IO release the mmap_sem
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
> has been swapped out or is behind a filemap, this will trigger async
> readahead and return immediately. The rationale is that KVM will kick
> back the guest with an "async page fault" and allow for some other
> guest process to take over.
>
> If async PFs are enabled the fault is retried asap from an async
> workqueue. If not, it's retried immediately in the same code path. In
> either case the retry will not relinquish the mmap semaphore and will
> block on the IO. This is a bad thing, as other mmap semaphore users
> now stall as a function of swap or filemap latency.
>
> This patch ensures both the regular and async PF path re-enter the
> fault allowing for the mmap semaphore to be relinquished in the case
> of IO wait.
>
> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
> Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/mm.h  | 1 +
>  mm/gup.c            | 4 ++++
>  virt/kvm/async_pf.c | 4 +---
>  virt/kvm/kvm_main.c | 4 ++--
>  4 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 69f692d..71dbe03 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1997,6 +1997,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
>  #define FOLL_HWPOISON  0x100   /* check page is hwpoisoned */
>  #define FOLL_NUMA      0x200   /* force NUMA hinting page fault */
>  #define FOLL_MIGRATION 0x400   /* wait for page to replace migration entry */
> +#define FOLL_TRIED     0x800   /* a retry, previous pass started an IO */
>
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>                         void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 19e17ab..369b3f6 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>                 fault_flags |= FAULT_FLAG_ALLOW_RETRY;
>         if (*flags & FOLL_NOWAIT)
>                 fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> +       if (*flags & FOLL_TRIED) {
> +               VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> +               fault_flags |= FAULT_FLAG_TRIED;
> +       }
>
>         ret = handle_mm_fault(mm, vma, address, fault_flags);
>         if (ret & VM_FAULT_ERROR) {
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d6a3d09..44660ae 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
>
>         might_sleep();
>
> -       down_read(&mm->mmap_sem);
> -       get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
> -       up_read(&mm->mmap_sem);
> +       get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
>         kvm_async_page_present_sync(vcpu, apf);
>
>         spin_lock(&vcpu->async_pf.lock);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 95519bc..921bce7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1170,8 +1170,8 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>                                               addr, write_fault, page);
>                 up_read(&current->mm->mmap_sem);
>         } else
> -               npages = get_user_pages_fast(addr, 1, write_fault,
> -                                            page);
> +               npages = get_user_pages_unlocked(current, current->mm, addr, 1,
> +                                                write_fault, 0, page);
>         if (npages != 1)
>                 return npages;

Acked, for the spirit. Likely my patch will go in and then you can
just throw this one on top, removing kvm_get_user_page_io in the
process.

>
>
>
> This isn't bisectable in this order and it's untested anyway. It needs
> more patchsplits.
>
> This is just an initial RFC to know if it's ok to go into this
> direction.
>
> If it's ok I'll do some testing and submit it more properly. If your
> patches goes in first it's fine and I'll just replace the call in KVM
> to get_user_pages_unlocked (or whatever we want to call that thing).
>
> I'd need to get this (or equivalent solution) merged before
> re-submitting the userfaultfd patchset. I think the above benefits the
> kernel as a whole in terms of mmap_sem holdtimes regardless of
> userfaultfd so it should be good.
>
>> Well, IIUC every code path that has ALLOW_RETRY dives in the second
>> time with FAULT_TRIED or similar. In the common case, you happily
>> blaze through the second time, but if someone raced in while all locks
>> were given up, one pays the price of the second time being a full
>> fault hogging the mmap sem. At some point you need to not keep being
>> polite otherwise the task starves. Presumably the risk of an extra
>> retry drops steeply every new gup retry. Maybe just try three times is
>> good enough. It makes for ugly logic though.
>
> I was under the idea that if one looped forever with VM_FAULT_RETRY
> it'd eventually succeed, but it risks doing more work, so I'm also
> sticking to the "locked != NULL" first, seek to the first page that
> returned VM_FAULT_RETRY and issue a nr_pages=1 gup with locked ==
> NULL, and then continue with locked != NULL at the next page. Just
> like you did in the KVM slow path. And if "pages" array is NULL I bail
> out at the first VM_FAULT_RETRY failure without insisting with further
> gup calls of the "&locked" kind, your patch had just 1 page but you
> also bailed out.
>
> What this code above does is basically to generalize your optimization
> to KVM and it makes it global and at the same time it avoids me
> trouble in handle_userfault().
>
> While at it I also converted some obvious candidate for gup_fast that
> had no point in running slower (which I should split off in a separate
> patch).

Yes to all.

The part that I'm missing is how would MADV_USERFAULT handle this. It
would be buried in faultin_page, if no RETRY possible raise sigbus,
otherwise drop the mmap semaphore and signal and sleep on the
userfaultfd?

Thanks,
Andres

>
> Thanks,
> Andrea



-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
  2014-09-26 19:54   ` Andres Lagar-Cavilla
@ 2014-09-28 14:00     ` Andrea Arcangeli
  -1 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2014-09-28 14:00 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Gleb Natapov, Radim Krcmar, Paolo Bonzini, Rik van Riel,
	Peter Zijlstra, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Fri, Sep 26, 2014 at 12:54:46PM -0700, Andres Lagar-Cavilla wrote:
> On Fri, Sep 26, 2014 at 10:25 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > On Thu, Sep 25, 2014 at 02:50:29PM -0700, Andres Lagar-Cavilla wrote:
> >> It's nearly impossible to name it right because 1) it indicates we can
> >> relinquish 2) it returns whether we still hold the mmap semaphore.
> >>
> >> I'd prefer it'd be called mmap_sem_hold, which conveys immediately
> >> what this is about ("nonblocking" or "locked" could be about a whole
> >> lot of things)
> >
> > To me FOLL_NOWAIT/FAULT_FLAG_RETRY_NOWAIT is nonblocking,
> > "locked"/FAULT_FLAG_ALLOW_RETRY is still very much blocking, just
> > without the mmap_sem, so I called it "locked"... but I'm fine to
> > change the name to mmap_sem_hold. Just get_user_pages_mmap_sem_hold
> > seems less friendly than get_user_pages_locked(..., &locked). locked
> > as you used comes intuitive when you do later "if (locked) up_read".
> >
> 
> Heh. I was previously referring to the int *locked param , not the
> _(un)locked suffix. That param is all about the mmap semaphore, so why
> not name it less ambiguously. It's essentially a tristate.

I got you were referring to the parameter name, problem I didn't want
to call the function get_user_pages_mmap_sem_hold(), and if I call it
get_user_pages_locked() calling the parameter "*locked" just like you
did in your patch looked more intuitive.

Suggestions to a better than gup_locked are welcome though!

> My suggestion is that you just make gup behave as your proposed
> gup_locked, and no need to introduce another call. But I understand if
> you want to phase this out politely.

Yes replacing gup would be ideal but there are various drivers that
make use if and have a larger critical section and I didn't want to
have to deal with all that immediately. Not to tell if anything uses
the "vmas" parameter that prevent releasing the mmap_sem by design and
will require larger modifications to get rid of.

So with this patch there's an optimal version of gup_locked|unlocked,
and a "nonscalable" one that allows for a large critical section
before and after gup with the vmas parameter too.

> > Then I added an _unlocked kind which is a drop in replacement for many
> > places just to clean it up.
> >
> > get_user_pages_unlocked and get_user_pages_fast are equivalent in
> > semantics, so any call of get_user_pages_unlocked(current,
> > current->mm, ...) has no reason to exist and should be replaced to
> > get_user_pages_fast unless "force = 1" (gup_fast has no force param
> > just to make the argument list a bit more confusing across the various
> > versions of gup).
> >
> > get_user_pages over time should be phased out and dropped.
> 
> Please. Too many variants. So the end goal is
> * __gup_fast
> * gup_fast == __gup_fast + gup_unlocked for fallback
> * gup (or gup_locked)
> * gup_unlocked
> (and flat __gup remains buried in the impl)?

That's exactly the end goal, yes.

> Basically all this discussion should go into the patch as comments.
> Help people shortcut git blame.

Sure, I added the comments of the commit header inline too.

> > +static inline long __get_user_pages_locked(struct task_struct *tsk,
> > +                                          struct mm_struct *mm,
> > +                                          unsigned long start,
> > +                                          unsigned long nr_pages,
> > +                                          int write, int force,
> > +                                          struct page **pages,
> > +                                          struct vm_area_struct **vmas,
> > +                                          int *locked,
> > +                                          bool immediate_unlock)
> s/immediate_unlock/notify_drop/

Applied.

> > +{
> > +       int flags = FOLL_TOUCH;
> > +       long ret, pages_done;
> > +       bool lock_dropped;
> s/lock_dropped/sem_dropped/

Well, this sounds a bit more confusing actually, unless we stop
calling the parameter "locked" first.

I mean it's the very "locked" parameter the "lock_dropped" variable
refers to. So I wouldn't bother to change to "sem" and stick to the
generic concept of locked/unlocked regardless of the underlying
implementation (the rwsem for reading).

> > +
> > +       if (locked) {
> > +               /* if VM_FAULT_RETRY can be returned, vmas become invalid */
> > +               BUG_ON(vmas);
> > +               /* check caller initialized locked */
> > +               BUG_ON(*locked != 1);
> > +       } else {
> > +               /*
> > +                * Not really important, the value is irrelevant if
> > +                * locked is NULL, but BUILD_BUG_ON costs nothing.
> > +                */
> > +               BUILD_BUG_ON(immediate_unlock);
> > +       }
> > +
> > +       if (pages)
> > +               flags |= FOLL_GET;
> > +       if (write)
> > +               flags |= FOLL_WRITE;
> > +       if (force)
> > +               flags |= FOLL_FORCE;
> > +
> > +       pages_done = 0;
> > +       lock_dropped = false;
> > +       for (;;) {
> > +               ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
> > +                                      vmas, locked);
> > +               if (!locked)
> > +                       /* VM_FAULT_RETRY couldn't trigger, bypass */
> > +                       return ret;
> > +
> > +               /* VM_FAULT_RETRY cannot return errors */
> > +               if (!*locked) {
> 
> Set lock_dropped = 1. In case we break out too soon (which we do if
> nr_pages drops to zero a couple lines below) and report a stale value.

This is purely a debug path (I suppose the compiler will nuke the
!*locked check too and the branch, if BUG_ON() is defined to a noop).

We don't need to set lock_dropped if the mmap_sem was released because
*locked is checked later:

	if (notify_drop && lock_dropped && *locked) {

So I set lock_dropped only by the time I "reacquire" the mmap_sem for
reading and I force *locked back to 1. As long as *locked == 0,
there's no point to set lock_dropped.

> > +                       BUG_ON(ret < 0);
> > +                       BUG_ON(nr_pages == 1 && ret);
> > +               }
> > +
> > +               if (!pages)
> > +                       /* If it's a prefault don't insist harder */
> > +                       return ret;
> > +
> > +               if (ret > 0) {
> > +                       nr_pages -= ret;
> > +                       pages_done += ret;
> > +                       if (!nr_pages)
> > +                               break;
> > +               }
> > +               if (*locked) {
> > +                       /* VM_FAULT_RETRY didn't trigger */
> > +                       if (!pages_done)
> > +                               pages_done = ret;
> 
> Replace top two lines with
> if (ret >0)
>     pages_done += ret;

I don't get what to change above exactly. In general every time
pages_done is increased nr_pages shall be decreased so just doing
pages_done += ret doesn't look right so it's not clear.

> > +                       break;
> > +               }
> > +               /* VM_FAULT_RETRY triggered, so seek to the faulting offset */
> > +               pages += ret;
> > +               start += ret << PAGE_SHIFT;
> > +
> > +               /*
> > +                * Repeat on the address that fired VM_FAULT_RETRY
> > +                * without FAULT_FLAG_ALLOW_RETRY but with
> > +                * FAULT_FLAG_TRIED.
> > +                */
> > +               *locked = 1;
> > +               lock_dropped = true;
> 
> Not really needed if set where previously suggested.

Yes, I just thought it's simpler to set it here, because lock_dropped
only is meant to cover the case of when we "reacquire" the mmap_sem
and override *locked = 1 (to notify the caller we destroyed the
critical section if the caller gets locked == 1 and it thinks it was
never released). Just in case the caller does something like:

      down_read(mmap_sem);
      vma = find_vma_somehow();
      ...
      locked = 1;
      gup_locked(&locked);
      if (!locked) {
      	 down_read(mmap_sem);
	 vma = find_vma_somehow();
      }
      use vma
      up_read(mmap_sem);

that's what notify_drop and lock_dropped are all about and it only
matters for gup_locked (only gup_locked will set notify_drop to true).

> > +               down_read(&mm->mmap_sem);
> > +               ret = __get_user_pages(tsk, mm, start, nr_pages, flags | FOLL_TRIED,
> > +                                      pages, NULL, NULL);
> 
> s/nr_pages/1/ otherwise we block on everything left ahead, not just
> the one that fired RETRY.

Applied. This just slipped. At least there would have been no risk
this would go unnoticed, it would BUG_ON below at the first O_DIRECT
just below with ret > 1 :).

> > +               if (ret != 1) {
> > +                       BUG_ON(ret > 1);
> 
> Can ret ever be zero here with count == 1? (ENOENT for a stack guard
> page TTBOMK, but what the heck are we doing gup'ing stacks. Suggest
> fixing that one case inside __gup impl so count == 1 never returns
> zero)

I'm ok with that change but I'd leave for later. I'd rather prefer to
leave get_user_pages API indentical for the legacy callers of gup().

> 
> > +                       if (!pages_done)
> > +                               pages_done = ret;
> 
> Don't think so. ret is --errno at this point (maybe zero). So remove.

Hmm I don't get what's wrong with the above if ret is --errno or zero.

If we get anything but 1 int the gup(locked == NULL) invocation on the
page that return VM_FAULT_RETRY we must stop and that is definitive
retval of __gup_locked (we must forget that error and return the
pages_done if any earlier pass of the loop succeeded).

Or if I drop it, we could leak all pinned pages in the previous loops
that succeeded or miss the error if we got an error and this was the
first loop.

This is __gup behavior too:

		if (IS_ERR(page))
			return i ? i : PTR_ERR(page);

It's up to the caller to figure that the page at address "start +
(pages_done << PAGE_SHIFT)" failed gup, and should be retried from
that address, if the caller wants to get a meaningful -errno.

> > +                       break;
> > +               }
> > +               nr_pages--;
> > +               pages_done++;
> > +               if (!nr_pages)
> > +                       break;
> > +               pages++;
> > +               start += PAGE_SIZE;
> > +       }
> > +       if (!immediate_unlock && lock_dropped && *locked) {
> > +               /*
> > +                * We must let the caller know we temporarily dropped the lock
> > +                * and so the critical section protected by it was lost.
> > +                */
> > +               up_read(&mm->mmap_sem);
> 
> With my suggestion of s/immediate_unlock/notify_drop/ this gets a lot
> more understandable (IMHO).

It become like this yes:

	if (notify_drop && lock_dropped && *locked) {
		/*
		 * We must let the caller know we temporarily dropped the lock
		 * and so the critical section protected by it was lost.
		 */
		up_read(&mm->mmap_sem);
		*locked = 0;
	}

And only gup_locked pass it to true (unlocked would drop the lock
anyway if it was set before returning so it doesn't need a notify).

[..]
long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
[..]
	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
				       pages, NULL, locked, true);
[..]
long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
[..]
	ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
				      pages, NULL, &locked, false);
[..]
long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
[..]
	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
				       pages, vmas, NULL, false);
[..]


> *Or*, forget about gup_locked and just leave gup as proposed in this
> patch. Then gup_unlocked (again IMHO) becomes more meaningful ... "Ah,
> that's the one I call when I have no locks taken".

Yes, that's the longer term objective but that will require also
getting rid of vmas argument and more auditing.

> > -               npages = get_user_pages_fast(addr, 1, write_fault,
> > -                                            page);
> > +               npages = get_user_pages_unlocked(current, current->mm, addr, 1,
> > +                                                write_fault, 0, page);
> >         if (npages != 1)
> >                 return npages;
> 
> Acked, for the spirit. Likely my patch will go in and then you can
> just throw this one on top, removing kvm_get_user_page_io in the
> process.

Sure, that's fine. I'm very happy with your patch and it should go in
first as it's a first step in the right direction. Making this
incremental is trivial.

Also note gup_fast would also be doing the right thing with my patch
applied so we could theoretically still call gup_fast in the kvm slow
path, but I figure from the review of your patch the __gup_fast was
already tried shortly earlier by the time we got there, so it should
be more efficient to skip the irq-disabled path and just take the
gup_locked() slow-path. So replacing kvm_get_user_page_io with
gup_locked sounds better than going back to the original gup_fast.

> > While at it I also converted some obvious candidate for gup_fast that
> > had no point in running slower (which I should split off in a separate
> > patch).
> 
> Yes to all.

Thanks for the review! 

> The part that I'm missing is how would MADV_USERFAULT handle this. It
> would be buried in faultin_page, if no RETRY possible raise sigbus,
> otherwise drop the mmap semaphore and signal and sleep on the
> userfaultfd?

Below are the userfaultfd entry points.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index af61e57..26f59af 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -728,6 +732,20 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
+
+		/* Deliver the page fault to userland */
+		if (vma->vm_flags & VM_USERFAULT) {
+			int ret;
+
+			spin_unlock(ptl);
+			mem_cgroup_uncharge_page(page);
+			put_page(page);
+			pte_free(mm, pgtable);
+			ret = handle_userfault(vma, haddr, flags);
+			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			return ret;
+		}
+
 		entry = mk_huge_pmd(page, vma);
 		page_add_new_anon_rmap(page, vma, haddr);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
@@ -808,14 +825,27 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_FALLBACK;
 		}
 		ptl = pmd_lock(mm, pmd);
-		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-				zero_page);
-		spin_unlock(ptl);
+		ret = 0;
+		set = false;
+		if (pmd_none(*pmd)) {
+			if (vma->vm_flags & VM_USERFAULT) {
+				spin_unlock(ptl);
+				ret = handle_userfault(vma, haddr, flags);
+				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			} else {
+				set_huge_zero_page(pgtable, mm, vma,
+						   haddr, pmd,
+						   zero_page);
+				spin_unlock(ptl);
+				set = true;
+			}
+		} else
+			spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
 		}
-		return 0;
+		return ret;
 	}
 	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 			vma, haddr, numa_node_id(), 0);
diff --git a/mm/memory.c b/mm/memory.c
index 986ddb2..18b8dde 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3244,6 +3245,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto unlock;
+		/* Deliver the page fault to userland, check inside PT lock */
+		if (vma->vm_flags & VM_USERFAULT) {
+			pte_unmap_unlock(page_table, ptl);
+			return handle_userfault(vma, address, flags);
+		}
 		goto setpte;
 	}
 
@@ -3271,6 +3277,14 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_none(*page_table))
 		goto release;
 
+	/* Deliver the page fault to userland, check inside PT lock */
+	if (vma->vm_flags & VM_USERFAULT) {
+		pte_unmap_unlock(page_table, ptl);
+		mem_cgroup_uncharge_page(page);
+		page_cache_release(page);
+		return handle_userfault(vma, address, flags);
+	}
+
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
 setpte:

I hope the developers working on the volatile pages could help to add
the pagecache entry points for tmpfs too so they can use
MADV_USERFAULT (and userfaultfd) on tmpfs backed memory in combination
with volatile pages (so they don't need to use syscalls before
touching the volatile memory and they can catch the fault in userland
with SIGBUS or the userfaultfd protocol). Then we've to decide how to
resolve the fault if they want to do it all on-demand paging. For
anonymous memory we do it with a newly introduced remap_anon_pages
which is like a strict and optimized version of mremap that only
touches two trans_huge_pmds/ptes and takes only the mmap_sem for
reading (and won't ever risk to lead to silent memory corruption if
userland is buggy because it triggers all sort of errors if src
pmd/pte is null or the dst pmd/pte is not null). But if there's a
fault they could also just drop the volatile range and rebuild the
area (similarly to what would happen with the syscall, just it avoids
to run the syscall in the fast path). The slow path is probably not
performance critical because it's not as common as in the KVM case (in
KVM case we know we'll fire a flood of userfaults so the userfault
handler must be as efficient as possible with remap_anon_pages THP
aware too, the volatile pages slow path shouldn't ever run normally
instead).

Anyway the first priority is to solve the above problem with
gup_locked, my last userfaultfd patch was rock solid in practice on
KVM but I was too aggressive at dropping the mmap_sem even when I
should have not, so it wasn't yet production quality and I just can't
allow userland to indefinitely keep the mmap_sem hold for reading
without special user privileges either (and no, I don't want
userfaultfd to require special permissions).

In the meantime I also implemented the range
registration/deregistration ops so you can have an infinite numbers of
userfaultfds for each process so shared libs are free to use
userfaultfd independently of each other as long as each one is
tracking its own ranges as vmas (supposedly each library will use
userfaultfd for its own memory not stepping into each other toes, two
different libraries pretending to handle userfaults on the same memory
wouldn't make sense by design in the first place, so this constraint
looks fine).

I had to extend vma_merge though, it still looks cleaner to work with
native vmas than building up a per-process range registration layer
inside userfaultfd without collisions. The vmas are already solving
99% of that problem so it felt better to make a small extension to the
vmas, exactly like it was done for the vma_policy earlier.

Andrea

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
@ 2014-09-28 14:00     ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2014-09-28 14:00 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Gleb Natapov, Radim Krcmar, Paolo Bonzini, Rik van Riel,
	Peter Zijlstra, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Fri, Sep 26, 2014 at 12:54:46PM -0700, Andres Lagar-Cavilla wrote:
> On Fri, Sep 26, 2014 at 10:25 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > On Thu, Sep 25, 2014 at 02:50:29PM -0700, Andres Lagar-Cavilla wrote:
> >> It's nearly impossible to name it right because 1) it indicates we can
> >> relinquish 2) it returns whether we still hold the mmap semaphore.
> >>
> >> I'd prefer it'd be called mmap_sem_hold, which conveys immediately
> >> what this is about ("nonblocking" or "locked" could be about a whole
> >> lot of things)
> >
> > To me FOLL_NOWAIT/FAULT_FLAG_RETRY_NOWAIT is nonblocking,
> > "locked"/FAULT_FLAG_ALLOW_RETRY is still very much blocking, just
> > without the mmap_sem, so I called it "locked"... but I'm fine to
> > change the name to mmap_sem_hold. Just get_user_pages_mmap_sem_hold
> > seems less friendly than get_user_pages_locked(..., &locked). locked
> > as you used comes intuitive when you do later "if (locked) up_read".
> >
> 
> Heh. I was previously referring to the int *locked param , not the
> _(un)locked suffix. That param is all about the mmap semaphore, so why
> not name it less ambiguously. It's essentially a tristate.

I got you were referring to the parameter name, problem I didn't want
to call the function get_user_pages_mmap_sem_hold(), and if I call it
get_user_pages_locked() calling the parameter "*locked" just like you
did in your patch looked more intuitive.

Suggestions to a better than gup_locked are welcome though!

> My suggestion is that you just make gup behave as your proposed
> gup_locked, and no need to introduce another call. But I understand if
> you want to phase this out politely.

Yes replacing gup would be ideal but there are various drivers that
make use if and have a larger critical section and I didn't want to
have to deal with all that immediately. Not to tell if anything uses
the "vmas" parameter that prevent releasing the mmap_sem by design and
will require larger modifications to get rid of.

So with this patch there's an optimal version of gup_locked|unlocked,
and a "nonscalable" one that allows for a large critical section
before and after gup with the vmas parameter too.

> > Then I added an _unlocked kind which is a drop in replacement for many
> > places just to clean it up.
> >
> > get_user_pages_unlocked and get_user_pages_fast are equivalent in
> > semantics, so any call of get_user_pages_unlocked(current,
> > current->mm, ...) has no reason to exist and should be replaced to
> > get_user_pages_fast unless "force = 1" (gup_fast has no force param
> > just to make the argument list a bit more confusing across the various
> > versions of gup).
> >
> > get_user_pages over time should be phased out and dropped.
> 
> Please. Too many variants. So the end goal is
> * __gup_fast
> * gup_fast == __gup_fast + gup_unlocked for fallback
> * gup (or gup_locked)
> * gup_unlocked
> (and flat __gup remains buried in the impl)?

That's exactly the end goal, yes.

> Basically all this discussion should go into the patch as comments.
> Help people shortcut git blame.

Sure, I added the comments of the commit header inline too.

> > +static inline long __get_user_pages_locked(struct task_struct *tsk,
> > +                                          struct mm_struct *mm,
> > +                                          unsigned long start,
> > +                                          unsigned long nr_pages,
> > +                                          int write, int force,
> > +                                          struct page **pages,
> > +                                          struct vm_area_struct **vmas,
> > +                                          int *locked,
> > +                                          bool immediate_unlock)
> s/immediate_unlock/notify_drop/

Applied.

> > +{
> > +       int flags = FOLL_TOUCH;
> > +       long ret, pages_done;
> > +       bool lock_dropped;
> s/lock_dropped/sem_dropped/

Well, this sounds a bit more confusing actually, unless we stop
calling the parameter "locked" first.

I mean it's the very "locked" parameter the "lock_dropped" variable
refers to. So I wouldn't bother to change to "sem" and stick to the
generic concept of locked/unlocked regardless of the underlying
implementation (the rwsem for reading).

> > +
> > +       if (locked) {
> > +               /* if VM_FAULT_RETRY can be returned, vmas become invalid */
> > +               BUG_ON(vmas);
> > +               /* check caller initialized locked */
> > +               BUG_ON(*locked != 1);
> > +       } else {
> > +               /*
> > +                * Not really important, the value is irrelevant if
> > +                * locked is NULL, but BUILD_BUG_ON costs nothing.
> > +                */
> > +               BUILD_BUG_ON(immediate_unlock);
> > +       }
> > +
> > +       if (pages)
> > +               flags |= FOLL_GET;
> > +       if (write)
> > +               flags |= FOLL_WRITE;
> > +       if (force)
> > +               flags |= FOLL_FORCE;
> > +
> > +       pages_done = 0;
> > +       lock_dropped = false;
> > +       for (;;) {
> > +               ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
> > +                                      vmas, locked);
> > +               if (!locked)
> > +                       /* VM_FAULT_RETRY couldn't trigger, bypass */
> > +                       return ret;
> > +
> > +               /* VM_FAULT_RETRY cannot return errors */
> > +               if (!*locked) {
> 
> Set lock_dropped = 1. In case we break out too soon (which we do if
> nr_pages drops to zero a couple lines below) and report a stale value.

This is purely a debug path (I suppose the compiler will nuke the
!*locked check too and the branch, if BUG_ON() is defined to a noop).

We don't need to set lock_dropped if the mmap_sem was released because
*locked is checked later:

	if (notify_drop && lock_dropped && *locked) {

So I set lock_dropped only by the time I "reacquire" the mmap_sem for
reading and I force *locked back to 1. As long as *locked == 0,
there's no point to set lock_dropped.

> > +                       BUG_ON(ret < 0);
> > +                       BUG_ON(nr_pages == 1 && ret);
> > +               }
> > +
> > +               if (!pages)
> > +                       /* If it's a prefault don't insist harder */
> > +                       return ret;
> > +
> > +               if (ret > 0) {
> > +                       nr_pages -= ret;
> > +                       pages_done += ret;
> > +                       if (!nr_pages)
> > +                               break;
> > +               }
> > +               if (*locked) {
> > +                       /* VM_FAULT_RETRY didn't trigger */
> > +                       if (!pages_done)
> > +                               pages_done = ret;
> 
> Replace top two lines with
> if (ret >0)
>     pages_done += ret;

I don't get what to change above exactly. In general every time
pages_done is increased nr_pages shall be decreased so just doing
pages_done += ret doesn't look right so it's not clear.

> > +                       break;
> > +               }
> > +               /* VM_FAULT_RETRY triggered, so seek to the faulting offset */
> > +               pages += ret;
> > +               start += ret << PAGE_SHIFT;
> > +
> > +               /*
> > +                * Repeat on the address that fired VM_FAULT_RETRY
> > +                * without FAULT_FLAG_ALLOW_RETRY but with
> > +                * FAULT_FLAG_TRIED.
> > +                */
> > +               *locked = 1;
> > +               lock_dropped = true;
> 
> Not really needed if set where previously suggested.

Yes, I just thought it's simpler to set it here, because lock_dropped
only is meant to cover the case of when we "reacquire" the mmap_sem
and override *locked = 1 (to notify the caller we destroyed the
critical section if the caller gets locked == 1 and it thinks it was
never released). Just in case the caller does something like:

      down_read(mmap_sem);
      vma = find_vma_somehow();
      ...
      locked = 1;
      gup_locked(&locked);
      if (!locked) {
      	 down_read(mmap_sem);
	 vma = find_vma_somehow();
      }
      use vma
      up_read(mmap_sem);

that's what notify_drop and lock_dropped are all about and it only
matters for gup_locked (only gup_locked will set notify_drop to true).

> > +               down_read(&mm->mmap_sem);
> > +               ret = __get_user_pages(tsk, mm, start, nr_pages, flags | FOLL_TRIED,
> > +                                      pages, NULL, NULL);
> 
> s/nr_pages/1/ otherwise we block on everything left ahead, not just
> the one that fired RETRY.

Applied. This just slipped. At least there would have been no risk
this would go unnoticed, it would BUG_ON below at the first O_DIRECT
just below with ret > 1 :).

> > +               if (ret != 1) {
> > +                       BUG_ON(ret > 1);
> 
> Can ret ever be zero here with count == 1? (ENOENT for a stack guard
> page TTBOMK, but what the heck are we doing gup'ing stacks. Suggest
> fixing that one case inside __gup impl so count == 1 never returns
> zero)

I'm ok with that change but I'd leave for later. I'd rather prefer to
leave get_user_pages API indentical for the legacy callers of gup().

> 
> > +                       if (!pages_done)
> > +                               pages_done = ret;
> 
> Don't think so. ret is --errno at this point (maybe zero). So remove.

Hmm I don't get what's wrong with the above if ret is --errno or zero.

If we get anything but 1 int the gup(locked == NULL) invocation on the
page that return VM_FAULT_RETRY we must stop and that is definitive
retval of __gup_locked (we must forget that error and return the
pages_done if any earlier pass of the loop succeeded).

Or if I drop it, we could leak all pinned pages in the previous loops
that succeeded or miss the error if we got an error and this was the
first loop.

This is __gup behavior too:

		if (IS_ERR(page))
			return i ? i : PTR_ERR(page);

It's up to the caller to figure that the page at address "start +
(pages_done << PAGE_SHIFT)" failed gup, and should be retried from
that address, if the caller wants to get a meaningful -errno.

> > +                       break;
> > +               }
> > +               nr_pages--;
> > +               pages_done++;
> > +               if (!nr_pages)
> > +                       break;
> > +               pages++;
> > +               start += PAGE_SIZE;
> > +       }
> > +       if (!immediate_unlock && lock_dropped && *locked) {
> > +               /*
> > +                * We must let the caller know we temporarily dropped the lock
> > +                * and so the critical section protected by it was lost.
> > +                */
> > +               up_read(&mm->mmap_sem);
> 
> With my suggestion of s/immediate_unlock/notify_drop/ this gets a lot
> more understandable (IMHO).

It become like this yes:

	if (notify_drop && lock_dropped && *locked) {
		/*
		 * We must let the caller know we temporarily dropped the lock
		 * and so the critical section protected by it was lost.
		 */
		up_read(&mm->mmap_sem);
		*locked = 0;
	}

And only gup_locked pass it to true (unlocked would drop the lock
anyway if it was set before returning so it doesn't need a notify).

[..]
long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
[..]
	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
				       pages, NULL, locked, true);
[..]
long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
[..]
	ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
				      pages, NULL, &locked, false);
[..]
long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
[..]
	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
				       pages, vmas, NULL, false);
[..]


> *Or*, forget about gup_locked and just leave gup as proposed in this
> patch. Then gup_unlocked (again IMHO) becomes more meaningful ... "Ah,
> that's the one I call when I have no locks taken".

Yes, that's the longer term objective but that will require also
getting rid of vmas argument and more auditing.

> > -               npages = get_user_pages_fast(addr, 1, write_fault,
> > -                                            page);
> > +               npages = get_user_pages_unlocked(current, current->mm, addr, 1,
> > +                                                write_fault, 0, page);
> >         if (npages != 1)
> >                 return npages;
> 
> Acked, for the spirit. Likely my patch will go in and then you can
> just throw this one on top, removing kvm_get_user_page_io in the
> process.

Sure, that's fine. I'm very happy with your patch and it should go in
first as it's a first step in the right direction. Making this
incremental is trivial.

Also note gup_fast would also be doing the right thing with my patch
applied so we could theoretically still call gup_fast in the kvm slow
path, but I figure from the review of your patch the __gup_fast was
already tried shortly earlier by the time we got there, so it should
be more efficient to skip the irq-disabled path and just take the
gup_locked() slow-path. So replacing kvm_get_user_page_io with
gup_locked sounds better than going back to the original gup_fast.

> > While at it I also converted some obvious candidate for gup_fast that
> > had no point in running slower (which I should split off in a separate
> > patch).
> 
> Yes to all.

Thanks for the review! 

> The part that I'm missing is how would MADV_USERFAULT handle this. It
> would be buried in faultin_page, if no RETRY possible raise sigbus,
> otherwise drop the mmap semaphore and signal and sleep on the
> userfaultfd?

Below are the userfaultfd entry points.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index af61e57..26f59af 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -728,6 +732,20 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
+
+		/* Deliver the page fault to userland */
+		if (vma->vm_flags & VM_USERFAULT) {
+			int ret;
+
+			spin_unlock(ptl);
+			mem_cgroup_uncharge_page(page);
+			put_page(page);
+			pte_free(mm, pgtable);
+			ret = handle_userfault(vma, haddr, flags);
+			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			return ret;
+		}
+
 		entry = mk_huge_pmd(page, vma);
 		page_add_new_anon_rmap(page, vma, haddr);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
@@ -808,14 +825,27 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_FALLBACK;
 		}
 		ptl = pmd_lock(mm, pmd);
-		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-				zero_page);
-		spin_unlock(ptl);
+		ret = 0;
+		set = false;
+		if (pmd_none(*pmd)) {
+			if (vma->vm_flags & VM_USERFAULT) {
+				spin_unlock(ptl);
+				ret = handle_userfault(vma, haddr, flags);
+				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			} else {
+				set_huge_zero_page(pgtable, mm, vma,
+						   haddr, pmd,
+						   zero_page);
+				spin_unlock(ptl);
+				set = true;
+			}
+		} else
+			spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
 		}
-		return 0;
+		return ret;
 	}
 	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 			vma, haddr, numa_node_id(), 0);
diff --git a/mm/memory.c b/mm/memory.c
index 986ddb2..18b8dde 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3244,6 +3245,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto unlock;
+		/* Deliver the page fault to userland, check inside PT lock */
+		if (vma->vm_flags & VM_USERFAULT) {
+			pte_unmap_unlock(page_table, ptl);
+			return handle_userfault(vma, address, flags);
+		}
 		goto setpte;
 	}
 
@@ -3271,6 +3277,14 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_none(*page_table))
 		goto release;
 
+	/* Deliver the page fault to userland, check inside PT lock */
+	if (vma->vm_flags & VM_USERFAULT) {
+		pte_unmap_unlock(page_table, ptl);
+		mem_cgroup_uncharge_page(page);
+		page_cache_release(page);
+		return handle_userfault(vma, address, flags);
+	}
+
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
 setpte:

I hope the developers working on the volatile pages could help to add
the pagecache entry points for tmpfs too so they can use
MADV_USERFAULT (and userfaultfd) on tmpfs backed memory in combination
with volatile pages (so they don't need to use syscalls before
touching the volatile memory and they can catch the fault in userland
with SIGBUS or the userfaultfd protocol). Then we've to decide how to
resolve the fault if they want to do it all on-demand paging. For
anonymous memory we do it with a newly introduced remap_anon_pages
which is like a strict and optimized version of mremap that only
touches two trans_huge_pmds/ptes and takes only the mmap_sem for
reading (and won't ever risk to lead to silent memory corruption if
userland is buggy because it triggers all sort of errors if src
pmd/pte is null or the dst pmd/pte is not null). But if there's a
fault they could also just drop the volatile range and rebuild the
area (similarly to what would happen with the syscall, just it avoids
to run the syscall in the fast path). The slow path is probably not
performance critical because it's not as common as in the KVM case (in
KVM case we know we'll fire a flood of userfaults so the userfault
handler must be as efficient as possible with remap_anon_pages THP
aware too, the volatile pages slow path shouldn't ever run normally
instead).

Anyway the first priority is to solve the above problem with
gup_locked, my last userfaultfd patch was rock solid in practice on
KVM but I was too aggressive at dropping the mmap_sem even when I
should have not, so it wasn't yet production quality and I just can't
allow userland to indefinitely keep the mmap_sem hold for reading
without special user privileges either (and no, I don't want
userfaultfd to require special permissions).

In the meantime I also implemented the range
registration/deregistration ops so you can have an infinite numbers of
userfaultfds for each process so shared libs are free to use
userfaultfd independently of each other as long as each one is
tracking its own ranges as vmas (supposedly each library will use
userfaultfd for its own memory not stepping into each other toes, two
different libraries pretending to handle userfaults on the same memory
wouldn't make sense by design in the first place, so this constraint
looks fine).

I had to extend vma_merge though, it still looks cleaner to work with
native vmas than building up a per-process range registration layer
inside userfaultfd without collisions. The vmas are already solving
99% of that problem so it felt better to make a small extension to the
vmas, exactly like it was done for the vma_policy earlier.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
  2014-09-26 17:25 ` Andrea Arcangeli
@ 2014-10-01 15:36   ` Peter Zijlstra
  -1 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-01 15:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andres Lagar-Cavilla, Gleb Natapov, Radim Krcmar, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Fri, Sep 26, 2014 at 07:25:35PM +0200, Andrea Arcangeli wrote:
> diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
> index bb48a57..12ea7c3 100644
> --- a/drivers/dma/iovlock.c
> +++ b/drivers/dma/iovlock.c
> @@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
>  		pages += page_list->nr_pages;
>  
>  		/* pin pages down */
> -		down_read(&current->mm->mmap_sem);
> -		ret = get_user_pages(
> -			current,
> -			current->mm,
> +		ret = get_user_pages_fast(
>  			(unsigned long) iov[i].iov_base,
>  			page_list->nr_pages,
>  			1,	/* write */
> -			0,	/* force */
> -			page_list->pages,
> -			NULL);
> -		up_read(&current->mm->mmap_sem);
> +			page_list->pages);
>  
>  		if (ret != page_list->nr_pages)
>  			goto unpin;

> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
>  #else
>  	*pageshift = PAGE_SHIFT;
>  #endif
> -	if (get_user_pages
> -	    (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
> +	if (get_user_pages_fast(vaddr, 1, write, &page) <= 0)
>  		return -EFAULT;
>  	*paddr = page_to_phys(page);
>  	put_page(page);

> diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> index aff9689..c89dcfa 100644
> --- a/drivers/scsi/st.c
> +++ b/drivers/scsi/st.c
> @@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
>  		return -ENOMEM;
>  
>          /* Try to fault in all of the necessary pages */
> -	down_read(&current->mm->mmap_sem);
>          /* rw==READ means read from drive, write into memory area */
> -	res = get_user_pages(
> -		current,
> -		current->mm,
> +	res = get_user_pages_fast(
>  		uaddr,
>  		nr_pages,
>  		rw == READ,
> -		0, /* don't force */
> -		pages,
> -		NULL);
> -	up_read(&current->mm->mmap_sem);
> +		pages);
>  
>  	/* Errors and no page mapped should return here */
>  	if (res < nr_pages)


For all these and the other _fast() users, is there an actual limit to
the nr_pages passed in? Because we used to have the 64 pages limit from
DIO, but without that we get rather long IRQ-off latencies.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
@ 2014-10-01 15:36   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-01 15:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andres Lagar-Cavilla, Gleb Natapov, Radim Krcmar, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Fri, Sep 26, 2014 at 07:25:35PM +0200, Andrea Arcangeli wrote:
> diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
> index bb48a57..12ea7c3 100644
> --- a/drivers/dma/iovlock.c
> +++ b/drivers/dma/iovlock.c
> @@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
>  		pages += page_list->nr_pages;
>  
>  		/* pin pages down */
> -		down_read(&current->mm->mmap_sem);
> -		ret = get_user_pages(
> -			current,
> -			current->mm,
> +		ret = get_user_pages_fast(
>  			(unsigned long) iov[i].iov_base,
>  			page_list->nr_pages,
>  			1,	/* write */
> -			0,	/* force */
> -			page_list->pages,
> -			NULL);
> -		up_read(&current->mm->mmap_sem);
> +			page_list->pages);
>  
>  		if (ret != page_list->nr_pages)
>  			goto unpin;

> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
>  #else
>  	*pageshift = PAGE_SHIFT;
>  #endif
> -	if (get_user_pages
> -	    (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
> +	if (get_user_pages_fast(vaddr, 1, write, &page) <= 0)
>  		return -EFAULT;
>  	*paddr = page_to_phys(page);
>  	put_page(page);

> diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> index aff9689..c89dcfa 100644
> --- a/drivers/scsi/st.c
> +++ b/drivers/scsi/st.c
> @@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
>  		return -ENOMEM;
>  
>          /* Try to fault in all of the necessary pages */
> -	down_read(&current->mm->mmap_sem);
>          /* rw==READ means read from drive, write into memory area */
> -	res = get_user_pages(
> -		current,
> -		current->mm,
> +	res = get_user_pages_fast(
>  		uaddr,
>  		nr_pages,
>  		rw == READ,
> -		0, /* don't force */
> -		pages,
> -		NULL);
> -	up_read(&current->mm->mmap_sem);
> +		pages);
>  
>  	/* Errors and no page mapped should return here */
>  	if (res < nr_pages)


For all these and the other _fast() users, is there an actual limit to
the nr_pages passed in? Because we used to have the 64 pages limit from
DIO, but without that we get rather long IRQ-off latencies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
  2014-10-01 15:36   ` Peter Zijlstra
@ 2014-10-02 12:31     ` Andrea Arcangeli
  -1 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2014-10-02 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andres Lagar-Cavilla, Gleb Natapov, Radim Krcmar, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
> For all these and the other _fast() users, is there an actual limit to
> the nr_pages passed in? Because we used to have the 64 pages limit from
> DIO, but without that we get rather long IRQ-off latencies.

Ok, I would tend to think this is an issue to solve in gup_fast
implementation, I wouldn't blame or modify the callers for it.

I don't think there's anything that prevents gup_fast to enable irqs
after certain number of pages have been taken, nop; and disable the
irqs again.

If the TLB flush runs in parallel with gup_fast the result is
undefined anyway so there's no point to wait all pages to be taken
before letting the TLB flush go through. All it matters is that
gup_fast don't take pages that have been invalidated after the
tlb_flush returns on the other side. So I don't see issues in
releasing irqs and be latency friendly inside gup_fast fast path loop.

In fact gup_fast should also cond_resched() after releasing irqs, it's
not just an irq latency matter.

I could fix x86-64 for it in the same patchset unless somebody sees a
problem in releasing irqs inside the gup_fast fast path loop.

__gup_fast is an entirely different beast and that needs the callers to
be fixed but I didn't alter its callers.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
@ 2014-10-02 12:31     ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2014-10-02 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andres Lagar-Cavilla, Gleb Natapov, Radim Krcmar, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
> For all these and the other _fast() users, is there an actual limit to
> the nr_pages passed in? Because we used to have the 64 pages limit from
> DIO, but without that we get rather long IRQ-off latencies.

Ok, I would tend to think this is an issue to solve in gup_fast
implementation, I wouldn't blame or modify the callers for it.

I don't think there's anything that prevents gup_fast to enable irqs
after certain number of pages have been taken, nop; and disable the
irqs again.

If the TLB flush runs in parallel with gup_fast the result is
undefined anyway so there's no point to wait all pages to be taken
before letting the TLB flush go through. All it matters is that
gup_fast don't take pages that have been invalidated after the
tlb_flush returns on the other side. So I don't see issues in
releasing irqs and be latency friendly inside gup_fast fast path loop.

In fact gup_fast should also cond_resched() after releasing irqs, it's
not just an irq latency matter.

I could fix x86-64 for it in the same patchset unless somebody sees a
problem in releasing irqs inside the gup_fast fast path loop.

__gup_fast is an entirely different beast and that needs the callers to
be fixed but I didn't alter its callers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
  2014-10-02 12:31     ` Andrea Arcangeli
@ 2014-10-02 12:50       ` Peter Zijlstra
  -1 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-02 12:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andres Lagar-Cavilla, Gleb Natapov, Radim Krcmar, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote:
> On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
> > For all these and the other _fast() users, is there an actual limit to
> > the nr_pages passed in? Because we used to have the 64 pages limit from
> > DIO, but without that we get rather long IRQ-off latencies.
> 
> Ok, I would tend to think this is an issue to solve in gup_fast
> implementation, I wouldn't blame or modify the callers for it.
> 
> I don't think there's anything that prevents gup_fast to enable irqs
> after certain number of pages have been taken, nop; and disable the
> irqs again.
> 

Agreed, I once upon a time had a patch set converting the 2 (x86 and
powerpc) gup_fast implementations at the time, but somehow that never
got anywhere.

Just saying we should probably do that before we add callers with
unlimited nr_pages.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
@ 2014-10-02 12:50       ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-02 12:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andres Lagar-Cavilla, Gleb Natapov, Radim Krcmar, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote:
> On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
> > For all these and the other _fast() users, is there an actual limit to
> > the nr_pages passed in? Because we used to have the 64 pages limit from
> > DIO, but without that we get rather long IRQ-off latencies.
> 
> Ok, I would tend to think this is an issue to solve in gup_fast
> implementation, I wouldn't blame or modify the callers for it.
> 
> I don't think there's anything that prevents gup_fast to enable irqs
> after certain number of pages have been taken, nop; and disable the
> irqs again.
> 

Agreed, I once upon a time had a patch set converting the 2 (x86 and
powerpc) gup_fast implementations at the time, but somehow that never
got anywhere.

Just saying we should probably do that before we add callers with
unlimited nr_pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
  2014-10-02 12:50       ` Peter Zijlstra
@ 2014-10-02 12:56         ` Peter Zijlstra
  -1 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-02 12:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andres Lagar-Cavilla, Gleb Natapov, Radim Krcmar, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Thu, Oct 02, 2014 at 02:50:52PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote:
> > On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
> > > For all these and the other _fast() users, is there an actual limit to
> > > the nr_pages passed in? Because we used to have the 64 pages limit from
> > > DIO, but without that we get rather long IRQ-off latencies.
> > 
> > Ok, I would tend to think this is an issue to solve in gup_fast
> > implementation, I wouldn't blame or modify the callers for it.
> > 
> > I don't think there's anything that prevents gup_fast to enable irqs
> > after certain number of pages have been taken, nop; and disable the
> > irqs again.
> > 
> 
> Agreed, I once upon a time had a patch set converting the 2 (x86 and
> powerpc) gup_fast implementations at the time, but somehow that never
> got anywhere.
> 
> Just saying we should probably do that before we add callers with
> unlimited nr_pages.

https://lkml.org/lkml/2009/6/24/457

Clearly there's more work these days. Many more archs grew a gup.c

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
@ 2014-10-02 12:56         ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-10-02 12:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andres Lagar-Cavilla, Gleb Natapov, Radim Krcmar, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Thu, Oct 02, 2014 at 02:50:52PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote:
> > On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
> > > For all these and the other _fast() users, is there an actual limit to
> > > the nr_pages passed in? Because we used to have the 64 pages limit from
> > > DIO, but without that we get rather long IRQ-off latencies.
> > 
> > Ok, I would tend to think this is an issue to solve in gup_fast
> > implementation, I wouldn't blame or modify the callers for it.
> > 
> > I don't think there's anything that prevents gup_fast to enable irqs
> > after certain number of pages have been taken, nop; and disable the
> > irqs again.
> > 
> 
> Agreed, I once upon a time had a patch set converting the 2 (x86 and
> powerpc) gup_fast implementations at the time, but somehow that never
> got anywhere.
> 
> Just saying we should probably do that before we add callers with
> unlimited nr_pages.

https://lkml.org/lkml/2009/6/24/457

Clearly there's more work these days. Many more archs grew a gup.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
  2014-10-02 12:56         ` Peter Zijlstra
  (?)
@ 2014-10-02 15:53           ` Andrea Arcangeli
  -1 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2014-10-02 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andres Lagar-Cavilla, Gleb Natapov, Radim Krcmar, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Thu, Oct 02, 2014 at 02:56:38PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 02, 2014 at 02:50:52PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote:
> > > On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
> > > > For all these and the other _fast() users, is there an actual limit to
> > > > the nr_pages passed in? Because we used to have the 64 pages limit from
> > > > DIO, but without that we get rather long IRQ-off latencies.
> > > 
> > > Ok, I would tend to think this is an issue to solve in gup_fast
> > > implementation, I wouldn't blame or modify the callers for it.
> > > 
> > > I don't think there's anything that prevents gup_fast to enable irqs
> > > after certain number of pages have been taken, nop; and disable the
> > > irqs again.
> > > 
> > 
> > Agreed, I once upon a time had a patch set converting the 2 (x86 and
> > powerpc) gup_fast implementations at the time, but somehow that never
> > got anywhere.
> > 
> > Just saying we should probably do that before we add callers with
> > unlimited nr_pages.
> 
> https://lkml.org/lkml/2009/6/24/457
> 
> Clearly there's more work these days. Many more archs grew a gup.c

What about this? The alternative is that I do s/gup_fast/gup_unlocked/
to still retain the mmap_sem scalability benefit. It'd be still better
than the current plain gup() (and it would be equivalent for
userfaultfd point of view).

Or if the below is ok, should I modify all other archs too or are the
respective maintainers going to fix it themself? For example the arm*
gup_fast is a moving target in development on linux-mm right now and I
should only patch the gup_rcu version that didn't hit upstream yet. In
fact after that gup_rcu merge, supposedly the powerpc and sparc
gup_fast can be dropped from arch/* entirely and they can use the
generic version (otherwise having the arm gup_fast in mm/ instead of
arch/ would be a mistake). Right now, I wouldn't touch at least
arm/sparc/powerpc until the gup_rcu hit upstream as those are all
about to disappear.

Thanks,
Andrea

>From 2f6079396a59e64a380ff06e6107276dfa67b3ba Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Thu, 2 Oct 2014 16:58:00 +0200
Subject: [PATCH] mm: gup: make get_user_pages_fast and __get_user_pages_fast
 latency conscious

This teaches gup_fast and __gup_fast to re-enable irqs and
cond_resched() if possible every BATCH_PAGES.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/gup.c | 239 +++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 154 insertions(+), 85 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 2ab183b..e05d7b0 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -12,6 +12,12 @@
 
 #include <asm/pgtable.h>
 
+/*
+ * Keep irq disabled for no more than BATCH_PAGES pages.
+ * Matches PTRS_PER_PTE (or half in non-PAE kernels).
+ */
+#define BATCH_PAGES	512
+
 static inline pte_t gup_get_pte(pte_t *ptep)
 {
 #ifndef CONFIG_X86_PAE
@@ -250,6 +256,40 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	return 1;
 }
 
+static inline int __get_user_pages_fast_batch(unsigned long start,
+					      unsigned long end,
+					      int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long next;
+	unsigned long flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables and pages from being freed on x86.
+	 *
+	 * So long as we atomically load page table pointers versus teardown
+	 * (which we do on x86, with the above PAE exception), we can follow the
+	 * address down to the the page and take a ref on it.
+	 */
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, start);
+	do {
+		pgd_t pgd = *pgdp;
+
+		next = pgd_addr_end(start, end);
+		if (pgd_none(pgd))
+			break;
+		if (!gup_pud_range(pgd, start, next, write, pages, &nr))
+			break;
+	} while (pgdp++, start = next, start != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
 /*
  * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
  * back to the regular GUP.
@@ -257,31 +297,57 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
 {
-	struct mm_struct *mm = current->mm;
-	unsigned long addr, len, end;
-	unsigned long next;
-	unsigned long flags;
-	pgd_t *pgdp;
-	int nr = 0;
+	unsigned long len, end, batch_pages;
+	int nr, ret;
 
 	start &= PAGE_MASK;
-	addr = start;
 	len = (unsigned long) nr_pages << PAGE_SHIFT;
 	end = start + len;
+	/*
+	 * get_user_pages() handles nr_pages == 0 gracefully, but
+	 * gup_fast starts walking the first pagetable in a do {}
+	 * while() fashion so it's not robust to handle nr_pages ==
+	 * 0. There's no point in being permissive about end < start
+	 * either. So this check verifies both nr_pages being non
+	 * zero, and that "end" didn't overflow.
+	 */
+	VM_BUG_ON(end <= start);
 	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
 					(void __user *)start, len)))
 		return 0;
 
-	/*
-	 * XXX: batch / limit 'nr', to avoid large irq off latency
-	 * needs some instrumenting to determine the common sizes used by
-	 * important workloads (eg. DB2), and whether limiting the batch size
-	 * will decrease performance.
-	 *
-	 * It seems like we're in the clear for the moment. Direct-IO is
-	 * the main guy that batches up lots of get_user_pages, and even
-	 * they are limited to 64-at-a-time which is not so many.
-	 */
+	ret = 0;
+	for (;;) {
+		batch_pages = nr_pages;
+		if (batch_pages > BATCH_PAGES && !irqs_disabled())
+			batch_pages = BATCH_PAGES;
+		len = (unsigned long) batch_pages << PAGE_SHIFT;
+		end = start + len;
+		nr = __get_user_pages_fast_batch(start, end, write, pages);
+		VM_BUG_ON(nr > batch_pages);
+		nr_pages -= nr;
+		ret += nr;
+		if (!nr_pages || nr != batch_pages)
+			break;
+		start += len;
+		pages += batch_pages;
+	}
+
+	return ret;
+}
+
+static inline int get_user_pages_fast_batch(unsigned long start,
+					    unsigned long end,
+					    int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long next;
+	pgd_t *pgdp;
+	int nr = 0;
+#ifdef CONFIG_DEBUG_VM
+	unsigned long orig_start = start;
+#endif
+
 	/*
 	 * This doesn't prevent pagetable teardown, but does prevent
 	 * the pagetables and pages from being freed on x86.
@@ -290,18 +356,22 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	 * (which we do on x86, with the above PAE exception), we can follow the
 	 * address down to the the page and take a ref on it.
 	 */
-	local_irq_save(flags);
-	pgdp = pgd_offset(mm, addr);
+	local_irq_disable();
+	pgdp = pgd_offset(mm, start);
 	do {
 		pgd_t pgd = *pgdp;
 
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(pgd))
+		next = pgd_addr_end(start, end);
+		if (pgd_none(pgd)) {
+			VM_BUG_ON(nr >= (end-orig_start) >> PAGE_SHIFT);
 			break;
-		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+		}
+		if (!gup_pud_range(pgd, start, next, write, pages, &nr)) {
+			VM_BUG_ON(nr >= (end-orig_start) >> PAGE_SHIFT);
 			break;
-	} while (pgdp++, addr = next, addr != end);
-	local_irq_restore(flags);
+		}
+	} while (pgdp++, start = next, start != end);
+	local_irq_enable();
 
 	return nr;
 }
@@ -326,80 +396,79 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages)
 {
 	struct mm_struct *mm = current->mm;
-	unsigned long addr, len, end;
-	unsigned long next;
-	pgd_t *pgdp;
-	int nr = 0;
+	unsigned long len, end, batch_pages;
+	int nr, ret;
+#ifdef CONFIG_DEBUG_VM
+	unsigned long orig_start = start;
+#endif
 
 	start &= PAGE_MASK;
-	addr = start;
+#ifdef CONFIG_DEBUG_VM
+	orig_start = start;
+#endif
 	len = (unsigned long) nr_pages << PAGE_SHIFT;
 
 	end = start + len;
-	if (end < start)
-		goto slow_irqon;
+	/*
+	 * get_user_pages() handles nr_pages == 0 gracefully, but
+	 * gup_fast starts walking the first pagetable in a do {}
+	 * while() fashion so it's not robust to handle nr_pages ==
+	 * 0. There's no point in being permissive about end < start
+	 * either. So this check verifies both nr_pages being non
+	 * zero, and that "end" didn't overflow.
+	 */
+	VM_BUG_ON(end <= start);
 
+	nr = ret = 0;
 #ifdef CONFIG_X86_64
 	if (end >> __VIRTUAL_MASK_SHIFT)
 		goto slow_irqon;
 #endif
+	for (;;) {
+		cond_resched();
+		batch_pages = min(nr_pages, BATCH_PAGES);
+		len = (unsigned long) batch_pages << PAGE_SHIFT;
+		end = start + len;
+		nr = get_user_pages_fast_batch(start, end, write, pages);
+		VM_BUG_ON(nr > batch_pages);
+		nr_pages -= nr;
+		ret += nr;
+		if (!nr_pages)
+			break;
+		if (nr < batch_pages)
+			goto slow_irqon;
+		start += len;
+		pages += batch_pages;
+	}
 
-	/*
-	 * XXX: batch / limit 'nr', to avoid large irq off latency
-	 * needs some instrumenting to determine the common sizes used by
-	 * important workloads (eg. DB2), and whether limiting the batch size
-	 * will decrease performance.
-	 *
-	 * It seems like we're in the clear for the moment. Direct-IO is
-	 * the main guy that batches up lots of get_user_pages, and even
-	 * they are limited to 64-at-a-time which is not so many.
-	 */
-	/*
-	 * This doesn't prevent pagetable teardown, but does prevent
-	 * the pagetables and pages from being freed on x86.
-	 *
-	 * So long as we atomically load page table pointers versus teardown
-	 * (which we do on x86, with the above PAE exception), we can follow the
-	 * address down to the the page and take a ref on it.
-	 */
-	local_irq_disable();
-	pgdp = pgd_offset(mm, addr);
-	do {
-		pgd_t pgd = *pgdp;
-
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(pgd))
-			goto slow;
-		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
-			goto slow;
-	} while (pgdp++, addr = next, addr != end);
-	local_irq_enable();
-
-	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
-	return nr;
-
-	{
-		int ret;
+	VM_BUG_ON(ret != (end - orig_start) >> PAGE_SHIFT);
+	return ret;
 
-slow:
-		local_irq_enable();
 slow_irqon:
-		/* Try to get the remaining pages with get_user_pages */
-		start += nr << PAGE_SHIFT;
-		pages += nr;
-
-		ret = get_user_pages_unlocked(current, mm, start,
-					      (end - start) >> PAGE_SHIFT,
-					      write, 0, pages);
-
-		/* Have to be a bit careful with return values */
-		if (nr > 0) {
-			if (ret < 0)
-				ret = nr;
-			else
-				ret += nr;
-		}
+	/* Try to get the remaining pages with get_user_pages */
+	start += nr << PAGE_SHIFT;
+	pages += nr;
 
-		return ret;
+	/*
+	 * "nr" was the get_user_pages_fast_batch last retval, "ret"
+	 * was the sum of all get_user_pages_fast_batch retvals, now
+	 * "nr" becomes the sum of all get_user_pages_fast_batch
+	 * retvals and "ret" will become the get_user_pages_unlocked
+	 * retval.
+	 */
+	nr = ret;
+
+	ret = get_user_pages_unlocked(current, mm, start,
+				      (end - start) >> PAGE_SHIFT,
+				      write, 0, pages);
+
+	/* Have to be a bit careful with return values */
+	if (nr > 0) {
+		if (ret < 0)
+			ret = nr;
+		else
+			ret += nr;
 	}
+
+	return ret;
 }

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
@ 2014-10-02 15:53           ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2014-10-02 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andres Lagar-Cavilla, Gleb Natapov, Radim Krcmar, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Thu, Oct 02, 2014 at 02:56:38PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 02, 2014 at 02:50:52PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote:
> > > On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
> > > > For all these and the other _fast() users, is there an actual limit to
> > > > the nr_pages passed in? Because we used to have the 64 pages limit from
> > > > DIO, but without that we get rather long IRQ-off latencies.
> > > 
> > > Ok, I would tend to think this is an issue to solve in gup_fast
> > > implementation, I wouldn't blame or modify the callers for it.
> > > 
> > > I don't think there's anything that prevents gup_fast to enable irqs
> > > after certain number of pages have been taken, nop; and disable the
> > > irqs again.
> > > 
> > 
> > Agreed, I once upon a time had a patch set converting the 2 (x86 and
> > powerpc) gup_fast implementations at the time, but somehow that never
> > got anywhere.
> > 
> > Just saying we should probably do that before we add callers with
> > unlimited nr_pages.
> 
> https://lkml.org/lkml/2009/6/24/457
> 
> Clearly there's more work these days. Many more archs grew a gup.c

What about this? The alternative is that I do s/gup_fast/gup_unlocked/
to still retain the mmap_sem scalability benefit. It'd be still better
than the current plain gup() (and it would be equivalent for
userfaultfd point of view).

Or if the below is ok, should I modify all other archs too or are the
respective maintainers going to fix it themself? For example the arm*
gup_fast is a moving target in development on linux-mm right now and I
should only patch the gup_rcu version that didn't hit upstream yet. In
fact after that gup_rcu merge, supposedly the powerpc and sparc
gup_fast can be dropped from arch/* entirely and they can use the
generic version (otherwise having the arm gup_fast in mm/ instead of
arch/ would be a mistake). Right now, I wouldn't touch at least
arm/sparc/powerpc until the gup_rcu hit upstream as those are all
about to disappear.

Thanks,
Andrea

>From 2f6079396a59e64a380ff06e6107276dfa67b3ba Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Thu, 2 Oct 2014 16:58:00 +0200
Subject: [PATCH] mm: gup: make get_user_pages_fast and __get_user_pages_fast
 latency conscious

This teaches gup_fast and __gup_fast to re-enable irqs and
cond_resched() if possible every BATCH_PAGES.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/gup.c | 239 +++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 154 insertions(+), 85 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 2ab183b..e05d7b0 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -12,6 +12,12 @@
 
 #include <asm/pgtable.h>
 
+/*
+ * Keep irq disabled for no more than BATCH_PAGES pages.
+ * Matches PTRS_PER_PTE (or half in non-PAE kernels).
+ */
+#define BATCH_PAGES	512
+
 static inline pte_t gup_get_pte(pte_t *ptep)
 {
 #ifndef CONFIG_X86_PAE
@@ -250,6 +256,40 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	return 1;
 }
 
+static inline int __get_user_pages_fast_batch(unsigned long start,
+					      unsigned long end,
+					      int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long next;
+	unsigned long flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables and pages from being freed on x86.
+	 *
+	 * So long as we atomically load page table pointers versus teardown
+	 * (which we do on x86, with the above PAE exception), we can follow the
+	 * address down to the the page and take a ref on it.
+	 */
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, start);
+	do {
+		pgd_t pgd = *pgdp;
+
+		next = pgd_addr_end(start, end);
+		if (pgd_none(pgd))
+			break;
+		if (!gup_pud_range(pgd, start, next, write, pages, &nr))
+			break;
+	} while (pgdp++, start = next, start != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
 /*
  * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
  * back to the regular GUP.
@@ -257,31 +297,57 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
 {
-	struct mm_struct *mm = current->mm;
-	unsigned long addr, len, end;
-	unsigned long next;
-	unsigned long flags;
-	pgd_t *pgdp;
-	int nr = 0;
+	unsigned long len, end, batch_pages;
+	int nr, ret;
 
 	start &= PAGE_MASK;
-	addr = start;
 	len = (unsigned long) nr_pages << PAGE_SHIFT;
 	end = start + len;
+	/*
+	 * get_user_pages() handles nr_pages == 0 gracefully, but
+	 * gup_fast starts walking the first pagetable in a do {}
+	 * while() fashion so it's not robust to handle nr_pages ==
+	 * 0. There's no point in being permissive about end < start
+	 * either. So this check verifies both nr_pages being non
+	 * zero, and that "end" didn't overflow.
+	 */
+	VM_BUG_ON(end <= start);
 	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
 					(void __user *)start, len)))
 		return 0;
 
-	/*
-	 * XXX: batch / limit 'nr', to avoid large irq off latency
-	 * needs some instrumenting to determine the common sizes used by
-	 * important workloads (eg. DB2), and whether limiting the batch size
-	 * will decrease performance.
-	 *
-	 * It seems like we're in the clear for the moment. Direct-IO is
-	 * the main guy that batches up lots of get_user_pages, and even
-	 * they are limited to 64-at-a-time which is not so many.
-	 */
+	ret = 0;
+	for (;;) {
+		batch_pages = nr_pages;
+		if (batch_pages > BATCH_PAGES && !irqs_disabled())
+			batch_pages = BATCH_PAGES;
+		len = (unsigned long) batch_pages << PAGE_SHIFT;
+		end = start + len;
+		nr = __get_user_pages_fast_batch(start, end, write, pages);
+		VM_BUG_ON(nr > batch_pages);
+		nr_pages -= nr;
+		ret += nr;
+		if (!nr_pages || nr != batch_pages)
+			break;
+		start += len;
+		pages += batch_pages;
+	}
+
+	return ret;
+}
+
+static inline int get_user_pages_fast_batch(unsigned long start,
+					    unsigned long end,
+					    int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long next;
+	pgd_t *pgdp;
+	int nr = 0;
+#ifdef CONFIG_DEBUG_VM
+	unsigned long orig_start = start;
+#endif
+
 	/*
 	 * This doesn't prevent pagetable teardown, but does prevent
 	 * the pagetables and pages from being freed on x86.
@@ -290,18 +356,22 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	 * (which we do on x86, with the above PAE exception), we can follow the
 	 * address down to the the page and take a ref on it.
 	 */
-	local_irq_save(flags);
-	pgdp = pgd_offset(mm, addr);
+	local_irq_disable();
+	pgdp = pgd_offset(mm, start);
 	do {
 		pgd_t pgd = *pgdp;
 
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(pgd))
+		next = pgd_addr_end(start, end);
+		if (pgd_none(pgd)) {
+			VM_BUG_ON(nr >= (end-orig_start) >> PAGE_SHIFT);
 			break;
-		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+		}
+		if (!gup_pud_range(pgd, start, next, write, pages, &nr)) {
+			VM_BUG_ON(nr >= (end-orig_start) >> PAGE_SHIFT);
 			break;
-	} while (pgdp++, addr = next, addr != end);
-	local_irq_restore(flags);
+		}
+	} while (pgdp++, start = next, start != end);
+	local_irq_enable();
 
 	return nr;
 }
@@ -326,80 +396,79 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages)
 {
 	struct mm_struct *mm = current->mm;
-	unsigned long addr, len, end;
-	unsigned long next;
-	pgd_t *pgdp;
-	int nr = 0;
+	unsigned long len, end, batch_pages;
+	int nr, ret;
+#ifdef CONFIG_DEBUG_VM
+	unsigned long orig_start = start;
+#endif
 
 	start &= PAGE_MASK;
-	addr = start;
+#ifdef CONFIG_DEBUG_VM
+	orig_start = start;
+#endif
 	len = (unsigned long) nr_pages << PAGE_SHIFT;
 
 	end = start + len;
-	if (end < start)
-		goto slow_irqon;
+	/*
+	 * get_user_pages() handles nr_pages == 0 gracefully, but
+	 * gup_fast starts walking the first pagetable in a do {}
+	 * while() fashion so it's not robust to handle nr_pages ==
+	 * 0. There's no point in being permissive about end < start
+	 * either. So this check verifies both nr_pages being non
+	 * zero, and that "end" didn't overflow.
+	 */
+	VM_BUG_ON(end <= start);
 
+	nr = ret = 0;
 #ifdef CONFIG_X86_64
 	if (end >> __VIRTUAL_MASK_SHIFT)
 		goto slow_irqon;
 #endif
+	for (;;) {
+		cond_resched();
+		batch_pages = min(nr_pages, BATCH_PAGES);
+		len = (unsigned long) batch_pages << PAGE_SHIFT;
+		end = start + len;
+		nr = get_user_pages_fast_batch(start, end, write, pages);
+		VM_BUG_ON(nr > batch_pages);
+		nr_pages -= nr;
+		ret += nr;
+		if (!nr_pages)
+			break;
+		if (nr < batch_pages)
+			goto slow_irqon;
+		start += len;
+		pages += batch_pages;
+	}
 
-	/*
-	 * XXX: batch / limit 'nr', to avoid large irq off latency
-	 * needs some instrumenting to determine the common sizes used by
-	 * important workloads (eg. DB2), and whether limiting the batch size
-	 * will decrease performance.
-	 *
-	 * It seems like we're in the clear for the moment. Direct-IO is
-	 * the main guy that batches up lots of get_user_pages, and even
-	 * they are limited to 64-at-a-time which is not so many.
-	 */
-	/*
-	 * This doesn't prevent pagetable teardown, but does prevent
-	 * the pagetables and pages from being freed on x86.
-	 *
-	 * So long as we atomically load page table pointers versus teardown
-	 * (which we do on x86, with the above PAE exception), we can follow the
-	 * address down to the the page and take a ref on it.
-	 */
-	local_irq_disable();
-	pgdp = pgd_offset(mm, addr);
-	do {
-		pgd_t pgd = *pgdp;
-
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(pgd))
-			goto slow;
-		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
-			goto slow;
-	} while (pgdp++, addr = next, addr != end);
-	local_irq_enable();
-
-	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
-	return nr;
-
-	{
-		int ret;
+	VM_BUG_ON(ret != (end - orig_start) >> PAGE_SHIFT);
+	return ret;
 
-slow:
-		local_irq_enable();
 slow_irqon:
-		/* Try to get the remaining pages with get_user_pages */
-		start += nr << PAGE_SHIFT;
-		pages += nr;
-
-		ret = get_user_pages_unlocked(current, mm, start,
-					      (end - start) >> PAGE_SHIFT,
-					      write, 0, pages);
-
-		/* Have to be a bit careful with return values */
-		if (nr > 0) {
-			if (ret < 0)
-				ret = nr;
-			else
-				ret += nr;
-		}
+	/* Try to get the remaining pages with get_user_pages */
+	start += nr << PAGE_SHIFT;
+	pages += nr;
 
-		return ret;
+	/*
+	 * "nr" was the get_user_pages_fast_batch last retval, "ret"
+	 * was the sum of all get_user_pages_fast_batch retvals, now
+	 * "nr" becomes the sum of all get_user_pages_fast_batch
+	 * retvals and "ret" will become the get_user_pages_unlocked
+	 * retval.
+	 */
+	nr = ret;
+
+	ret = get_user_pages_unlocked(current, mm, start,
+				      (end - start) >> PAGE_SHIFT,
+				      write, 0, pages);
+
+	/* Have to be a bit careful with return values */
+	if (nr > 0) {
+		if (ret < 0)
+			ret = nr;
+		else
+			ret += nr;
 	}
+
+	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
@ 2014-10-02 15:53           ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2014-10-02 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andres Lagar-Cavilla, Gleb Natapov, Radim Krcmar, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Jianyu Zhan, Paul Cassella, Hugh Dickins,
	Peter Feiner, kvm, linux-kernel, linux-mm,
	Dr. David Alan Gilbert

On Thu, Oct 02, 2014 at 02:56:38PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 02, 2014 at 02:50:52PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote:
> > > On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
> > > > For all these and the other _fast() users, is there an actual limit to
> > > > the nr_pages passed in? Because we used to have the 64 pages limit from
> > > > DIO, but without that we get rather long IRQ-off latencies.
> > > 
> > > Ok, I would tend to think this is an issue to solve in gup_fast
> > > implementation, I wouldn't blame or modify the callers for it.
> > > 
> > > I don't think there's anything that prevents gup_fast to enable irqs
> > > after certain number of pages have been taken, nop; and disable the
> > > irqs again.
> > > 
> > 
> > Agreed, I once upon a time had a patch set converting the 2 (x86 and
> > powerpc) gup_fast implementations at the time, but somehow that never
> > got anywhere.
> > 
> > Just saying we should probably do that before we add callers with
> > unlimited nr_pages.
> 
> https://lkml.org/lkml/2009/6/24/457
> 
> Clearly there's more work these days. Many more archs grew a gup.c

What about this? The alternative is that I do s/gup_fast/gup_unlocked/
to still retain the mmap_sem scalability benefit. It'd be still better
than the current plain gup() (and it would be equivalent for
userfaultfd point of view).

Or if the below is ok, should I modify all other archs too or are the
respective maintainers going to fix it themself? For example the arm*
gup_fast is a moving target in development on linux-mm right now and I
should only patch the gup_rcu version that didn't hit upstream yet. In
fact after that gup_rcu merge, supposedly the powerpc and sparc
gup_fast can be dropped from arch/* entirely and they can use the
generic version (otherwise having the arm gup_fast in mm/ instead of
arch/ would be a mistake). Right now, I wouldn't touch at least
arm/sparc/powerpc until the gup_rcu hit upstream as those are all
about to disappear.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-10-02 15:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-26 17:25 RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY Andrea Arcangeli
2014-09-26 17:25 ` Andrea Arcangeli
2014-09-26 17:25 ` Andrea Arcangeli
2014-09-26 19:54 ` Andres Lagar-Cavilla
2014-09-26 19:54   ` Andres Lagar-Cavilla
2014-09-28 14:00   ` Andrea Arcangeli
2014-09-28 14:00     ` Andrea Arcangeli
2014-10-01 15:36 ` Peter Zijlstra
2014-10-01 15:36   ` Peter Zijlstra
2014-10-02 12:31   ` Andrea Arcangeli
2014-10-02 12:31     ` Andrea Arcangeli
2014-10-02 12:50     ` Peter Zijlstra
2014-10-02 12:50       ` Peter Zijlstra
2014-10-02 12:56       ` Peter Zijlstra
2014-10-02 12:56         ` Peter Zijlstra
2014-10-02 15:53         ` Andrea Arcangeli
2014-10-02 15:53           ` Andrea Arcangeli
2014-10-02 15:53           ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.