linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [rfc] lockless get_user_pages for dio (and more)
       [not found]     ` <200710152225.11433.nickpiggin@yahoo.com.au>
@ 2007-12-10 21:30       ` Dave Kleikamp
  2007-12-12  4:57         ` Nick Piggin
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Kleikamp @ 2007-12-10 21:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, Ken Chen, Badari Pulavarty, linux-mm,
	tony.luck, Adam Litke, linux-kernel


On Mon, 2007-10-15 at 22:25 +1000, Nick Piggin wrote:
> On Monday 15 October 2007 04:19, Siddha, Suresh B wrote:
> > On Sun, Oct 14, 2007 at 11:01:02AM +1000, Nick Piggin wrote:

> > > This is just a really quick hack, untested ATM, but one that
> > > has at least a chance of working (on x86).
> >
> > When we fall back to slow mode, we should decrement the ref counts
> > on the pages we got so far in the fast mode.
> 
> Here is something that is actually tested and works (not
> tested with hugepages yet, though).
> 
> However it's not 100% secure at the moment. It's actually
> not completely trivial; I think we need to use an extra bit
> in the present pte in order to exclude "not normal" pages,
> if we want fast_gup to work on small page mappings too. I
> think this would be possible to do on most architectures, but
> I haven't done it here obviously.
> 
> Still, it should be enough to test the design. I've added
> fast_gup and fast_gup_slow to /proc/vmstat, which count the
> number of times fast_gup was called, and the number of times
> it dropped into the slowpath. It would be interesting to know
> how it performs compared to your granular hugepage ptl...

Nick,
I've played with the fast_gup patch a bit.  I was able to find a problem
in follow_hugetlb_page() that Adam Litke fixed.  I'm haven't been brave
enough to implement it on any other architectures, but I did add  a
default that takes mmap_sem and calls the normal get_user_pages() if the
architecture doesn't define fast_gup().  I put it in linux/mm.h, for
lack of a better place, but it's a little kludgy since I didn't want
mm.h to have to include sched.h.  This patch is against 2.6.24-rc4.
It's not ready for inclusion yet, of course.

I haven't done much benchmarking.  The one test I was looking at didn't
show much of a change.

 ==============================================
Introduce a new "fast_gup" (for want of a better name right now) which
is basically a get_user_pages with a less general API that is more suited
to the common case.

- task and mm are always current and current->mm
- force is always 0
- pages is always non-NULL
- don't pass back vmas

This allows (at least on x86), an optimistic lockless pagetable walk,
without taking any page table locks or even mmap_sem. Page table existence
is guaranteed by turning interrupts off (combined with the fact that we're
always looking up the current mm, which would need an IPI before its
pagetables could be shot down from another CPU).

Many other architectures could do the same thing. Those that don't IPI
could potentially RCU free the page tables and do speculative references
on the pages (a la lockless pagecache) to achieve a lockless fast_gup.

Originally by Nick Piggin <nickpiggin@yahoo.com.au>
---
 arch/x86/lib/Makefile_64     |    2 
 arch/x86/lib/gup_64.c        |  188 +++++++++++++++++++++++++++++++++++++++++++
 fs/bio.c                     |    8 -
 fs/block_dev.c               |    5 -
 fs/direct-io.c               |   10 --
 fs/splice.c                  |   38 --------
 include/asm-x86/uaccess_64.h |    4 
 include/linux/mm.h           |   26 +++++
 include/linux/vmstat.h       |    1 
 mm/vmstat.c                  |    3 
 10 files changed, 231 insertions(+), 54 deletions(-)

diff -Nurp linux-2.6.24-rc4/arch/x86/lib/Makefile_64 linux/arch/x86/lib/Makefile_64
--- linux-2.6.24-rc4/arch/x86/lib/Makefile_64	2007-12-04 08:44:34.000000000 -0600
+++ linux/arch/x86/lib/Makefile_64	2007-12-10 15:01:17.000000000 -0600
@@ -10,4 +10,4 @@ obj-$(CONFIG_SMP)	+= msr-on-cpu.o
 lib-y := csum-partial_64.o csum-copy_64.o csum-wrappers_64.o delay_64.o \
 	usercopy_64.o getuser_64.o putuser_64.o  \
 	thunk_64.o clear_page_64.o copy_page_64.o bitstr_64.o bitops_64.o
-lib-y += memcpy_64.o memmove_64.o memset_64.o copy_user_64.o rwlock_64.o copy_user_nocache_64.o
+lib-y += memcpy_64.o memmove_64.o memset_64.o copy_user_64.o rwlock_64.o copy_user_nocache_64.o gup_64.o
diff -Nurp linux-2.6.24-rc4/arch/x86/lib/gup_64.c linux/arch/x86/lib/gup_64.c
--- linux-2.6.24-rc4/arch/x86/lib/gup_64.c	1969-12-31 18:00:00.000000000 -0600
+++ linux/arch/x86/lib/gup_64.c	2007-12-10 15:01:17.000000000 -0600
@@ -0,0 +1,188 @@
+/*
+ * Lockless fast_gup for x86
+ *
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <asm/pgtable.h>
+
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep;
+
+	/* XXX: this won't work for 32-bit (must map pte) */
+	ptep = (pte_t *)pmd_page_vaddr(pmd) + pte_index(addr);
+	do {
+		pte_t pte = *ptep;
+		unsigned long pfn;
+		struct page *page;
+
+		if ((pte_val(pte) & (_PAGE_PRESENT|_PAGE_USER)) !=
+		    (_PAGE_PRESENT|_PAGE_USER))
+			return 0;
+
+		if (write && !pte_write(pte))
+			return 0;
+
+		/* XXX: really need new bit in pte to denote normal page */
+		pfn = pte_pfn(pte);
+		if (unlikely(!pfn_valid(pfn)))
+			return 0;
+
+		page = pfn_to_page(pfn);
+		get_page(page);
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+	pte_unmap(ptep - 1);
+
+	return 1;
+}
+
+static inline void get_head_page_multiple(struct page *page, int nr)
+{
+	VM_BUG_ON(page != compound_head(page));
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_add(nr, &page->_count);
+}
+
+static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
+			int write, struct page **pages, int *nr)
+{
+	pte_t pte = *(pte_t *)&pmd;
+	struct page *head, *page;
+	int refs;
+
+	if ((pte_val(pte) & _PAGE_USER) != _PAGE_USER)
+		return 0;
+
+	BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	if (write && !pte_write(pte))
+		return 0;
+
+	refs = 0;
+	head = pte_page(pte);
+	page = head + ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
+	do {
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	get_head_page_multiple(head, refs);
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	static int count = 50;
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = (pmd_t *)pud_page_vaddr(pud) + pmd_index(addr);
+	do {
+		pmd_t pmd = *pmdp;
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd))
+			return 0;
+		if (unlikely(pmd_large(pmd))) {
+			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) {
+				if (count) {
+					printk(KERN_ERR
+						"pmd = 0x%lx, addr = 0x%lx\n",
+						pmd.pmd, addr);
+					count--;
+				}
+				return 0;
+			}
+		} else {
+			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = (pud_t *)pgd_page_vaddr(pgd) + pud_index(addr);
+	do {
+		pud_t pud = *pudp;
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long end = start + (nr_pages << PAGE_SHIFT);
+	unsigned long addr = start;
+	unsigned long next;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	/* XXX: batch / limit 'nr', to avoid huge latency */
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables from being freed on x86-64.
+	 *
+	 * So long as we atomically load page table pointers versus teardown
+	 * (which we do on x86-64), we can follow the address down to the
+	 * the page.
+	 */
+	local_irq_disable();
+	__count_vm_event(FAST_GUP);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = *pgdp;
+
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			goto slow;
+		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			goto slow;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_enable();
+
+	BUG_ON(nr != (end - start) >> PAGE_SHIFT);
+	return nr;
+
+slow:
+	{
+		int i, ret;
+
+		__count_vm_event(FAST_GUP_SLOW);
+		local_irq_enable();
+		for (i = 0; i < nr; i++)
+			put_page(pages[i]);
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		return ret;
+	}
+}
diff -Nurp linux-2.6.24-rc4/fs/bio.c linux/fs/bio.c
--- linux-2.6.24-rc4/fs/bio.c	2007-12-04 08:44:49.000000000 -0600
+++ linux/fs/bio.c	2007-12-10 15:01:17.000000000 -0600
@@ -636,13 +636,9 @@ static struct bio *__bio_map_user_iov(st
 		unsigned long start = uaddr >> PAGE_SHIFT;
 		const int local_nr_pages = end - start;
 		const int page_limit = cur_page + local_nr_pages;
-		
-		down_read(&current->mm->mmap_sem);
-		ret = get_user_pages(current, current->mm, uaddr,
-				     local_nr_pages,
-				     write_to_vm, 0, &pages[cur_page], NULL);
-		up_read(&current->mm->mmap_sem);
 
+		ret = fast_gup(uaddr, local_nr_pages, write_to_vm,
+			       &pages[cur_page]);
 		if (ret < local_nr_pages) {
 			ret = -EFAULT;
 			goto out_unmap;
diff -Nurp linux-2.6.24-rc4/fs/block_dev.c linux/fs/block_dev.c
--- linux-2.6.24-rc4/fs/block_dev.c	2007-12-04 08:44:49.000000000 -0600
+++ linux/fs/block_dev.c	2007-12-10 15:01:17.000000000 -0600
@@ -221,10 +221,7 @@ static struct page *blk_get_page(unsigne
 	if (pvec->idx == pvec->nr) {
 		nr_pages = PAGES_SPANNED(addr, count);
 		nr_pages = min(nr_pages, VEC_SIZE);
-		down_read(&current->mm->mmap_sem);
-		ret = get_user_pages(current, current->mm, addr, nr_pages,
-				     rw == READ, 0, pvec->page, NULL);
-		up_read(&current->mm->mmap_sem);
+		ret = fast_gup(addr, nr_pages, rw == READ, pvec->page);
 		if (ret < 0)
 			return ERR_PTR(ret);
 		pvec->nr = ret;
diff -Nurp linux-2.6.24-rc4/fs/direct-io.c linux/fs/direct-io.c
--- linux-2.6.24-rc4/fs/direct-io.c	2007-12-04 08:44:49.000000000 -0600
+++ linux/fs/direct-io.c	2007-12-10 15:01:17.000000000 -0600
@@ -150,17 +150,11 @@ static int dio_refill_pages(struct dio *
 	int nr_pages;
 
 	nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
-	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(
-		current,			/* Task for fault acounting */
-		current->mm,			/* whose pages? */
+	ret = fast_gup(
 		dio->curr_user_address,		/* Where from? */
 		nr_pages,			/* How many pages? */
 		dio->rw == READ,		/* Write to memory? */
-		0,				/* force (?) */
-		&dio->pages[0],
-		NULL);				/* vmas */
-	up_read(&current->mm->mmap_sem);
+		&dio->pages[0]);		/* Put results here */
 
 	if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
 		struct page *page = ZERO_PAGE(0);
diff -Nurp linux-2.6.24-rc4/fs/splice.c linux/fs/splice.c
--- linux-2.6.24-rc4/fs/splice.c	2007-12-04 08:44:50.000000000 -0600
+++ linux/fs/splice.c	2007-12-10 15:01:17.000000000 -0600
@@ -1174,33 +1174,6 @@ static long do_splice(struct file *in, l
 }
 
 /*
- * Do a copy-from-user while holding the mmap_semaphore for reading, in a
- * manner safe from deadlocking with simultaneous mmap() (grabbing mmap_sem
- * for writing) and page faulting on the user memory pointed to by src.
- * This assumes that we will very rarely hit the partial != 0 path, or this
- * will not be a win.
- */
-static int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n)
-{
-	int partial;
-
-	pagefault_disable();
-	partial = __copy_from_user_inatomic(dst, src, n);
-	pagefault_enable();
-
-	/*
-	 * Didn't copy everything, drop the mmap_sem and do a faulting copy
-	 */
-	if (unlikely(partial)) {
-		up_read(&current->mm->mmap_sem);
-		partial = copy_from_user(dst, src, n);
-		down_read(&current->mm->mmap_sem);
-	}
-
-	return partial;
-}
-
-/*
  * Map an iov into an array of pages and offset/length tupples. With the
  * partial_page structure, we can map several non-contiguous ranges into
  * our ones pages[] map instead of splitting that operation into pieces.
@@ -1213,8 +1186,6 @@ static int get_iovec_page_array(const st
 {
 	int buffers = 0, error = 0;
 
-	down_read(&current->mm->mmap_sem);
-
 	while (nr_vecs) {
 		unsigned long off, npages;
 		struct iovec entry;
@@ -1223,7 +1194,7 @@ static int get_iovec_page_array(const st
 		int i;
 
 		error = -EFAULT;
-		if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry)))
+		if (copy_from_user(&entry, iov, sizeof(entry)))
 			break;
 
 		base = entry.iov_base;
@@ -1257,9 +1228,8 @@ static int get_iovec_page_array(const st
 		if (npages > PIPE_BUFFERS - buffers)
 			npages = PIPE_BUFFERS - buffers;
 
-		error = get_user_pages(current, current->mm,
-				       (unsigned long) base, npages, 0, 0,
-				       &pages[buffers], NULL);
+		error = fast_gup((unsigned long)base, npages, 0,
+				 &pages[buffers]);
 
 		if (unlikely(error <= 0))
 			break;
@@ -1298,8 +1268,6 @@ static int get_iovec_page_array(const st
 		iov++;
 	}
 
-	up_read(&current->mm->mmap_sem);
-
 	if (buffers)
 		return buffers;
 
diff -Nurp linux-2.6.24-rc4/include/asm-x86/uaccess_64.h linux/include/asm-x86/uaccess_64.h
--- linux-2.6.24-rc4/include/asm-x86/uaccess_64.h	2007-12-04 08:44:54.000000000 -0600
+++ linux/include/asm-x86/uaccess_64.h	2007-12-10 15:01:17.000000000 -0600
@@ -381,4 +381,8 @@ static inline int __copy_from_user_inato
 	return __copy_user_nocache(dst, src, size, 0);
 }
 
+#define __HAVE_ARCH_FAST_GUP
+struct page;
+int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages);
+
 #endif /* __X86_64_UACCESS_H */
diff -Nurp linux-2.6.24-rc4/include/linux/mm.h linux/include/linux/mm.h
--- linux-2.6.24-rc4/include/linux/mm.h	2007-12-04 08:44:56.000000000 -0600
+++ linux/include/linux/mm.h	2007-12-10 15:01:17.000000000 -0600
@@ -12,6 +12,7 @@
 #include <linux/prio_tree.h>
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
+#include <linux/uaccess.h>	/* for __HAVE_ARCH_FAST_GUP */
 
 struct mempolicy;
 struct anon_vma;
@@ -750,6 +751,31 @@ extern int mprotect_fixup(struct vm_area
 			  unsigned long end, unsigned long newflags);
 
 /*
+ * Architecture may implement efficient get_user_pages to avoid having to
+ * take the mmap sem
+ */
+#ifndef __HAVE_ARCH_FAST_GUP
+static inline int __fast_gup(struct mm_struct *mm, unsigned long start,
+			     int nr_pages, int write, struct page **pages)
+{
+	int ret;
+
+	down_read(&mm->mmap_sem);
+	ret = get_user_pages(current, mm, start, nr_pages, write,
+			     0, pages, NULL);
+	up_read(&mm->mmap_sem);
+
+	return ret;
+}
+/*
+ * This macro avoids having to include sched.h in this header to
+ * dereference current->mm.
+ */
+#define fast_gup(start, nr_pages, write, pages) \
+	__fast_gup(current->mm, start, nr_pages, write, pages)
+#endif
+
+/*
  * A callback you can register to apply pressure to ageable caches.
  *
  * 'shrink' is passed a count 'nr_to_scan' and a 'gfpmask'.  It should
diff -Nurp linux-2.6.24-rc4/include/linux/vmstat.h linux/include/linux/vmstat.h
--- linux-2.6.24-rc4/include/linux/vmstat.h	2007-10-09 15:31:38.000000000 -0500
+++ linux/include/linux/vmstat.h	2007-12-10 15:01:17.000000000 -0600
@@ -37,6 +37,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		FOR_ALL_ZONES(PGSCAN_DIRECT),
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		FAST_GUP, FAST_GUP_SLOW,
 		NR_VM_EVENT_ITEMS
 };
 
diff -Nurp linux-2.6.24-rc4/mm/vmstat.c linux/mm/vmstat.c
--- linux-2.6.24-rc4/mm/vmstat.c	2007-12-04 08:45:01.000000000 -0600
+++ linux/mm/vmstat.c	2007-12-10 15:01:17.000000000 -0600
@@ -642,6 +642,9 @@ static const char * const vmstat_text[] 
 	"allocstall",
 
 	"pgrotated",
+	"fast_gup",
+	"fast_gup_slow",
+
 #endif
 };
 

-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [rfc] lockless get_user_pages for dio (and more)
  2007-12-10 21:30       ` [rfc] lockless get_user_pages for dio (and more) Dave Kleikamp
@ 2007-12-12  4:57         ` Nick Piggin
  2007-12-12  5:11           ` Dave Kleikamp
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Piggin @ 2007-12-12  4:57 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Siddha, Suresh B, Ken Chen, Badari Pulavarty, linux-mm,
	tony.luck, Adam Litke, linux-kernel

On Tuesday 11 December 2007 08:30, Dave Kleikamp wrote:
> On Mon, 2007-10-15 at 22:25 +1000, Nick Piggin wrote:
> > On Monday 15 October 2007 04:19, Siddha, Suresh B wrote:
> > > On Sun, Oct 14, 2007 at 11:01:02AM +1000, Nick Piggin wrote:
> > > > This is just a really quick hack, untested ATM, but one that
> > > > has at least a chance of working (on x86).
> > >
> > > When we fall back to slow mode, we should decrement the ref counts
> > > on the pages we got so far in the fast mode.
> >
> > Here is something that is actually tested and works (not
> > tested with hugepages yet, though).
> >
> > However it's not 100% secure at the moment. It's actually
> > not completely trivial; I think we need to use an extra bit
> > in the present pte in order to exclude "not normal" pages,
> > if we want fast_gup to work on small page mappings too. I
> > think this would be possible to do on most architectures, but
> > I haven't done it here obviously.
> >
> > Still, it should be enough to test the design. I've added
> > fast_gup and fast_gup_slow to /proc/vmstat, which count the
> > number of times fast_gup was called, and the number of times
> > it dropped into the slowpath. It would be interesting to know
> > how it performs compared to your granular hugepage ptl...
>
> Nick,
> I've played with the fast_gup patch a bit.  I was able to find a problem
> in follow_hugetlb_page() that Adam Litke fixed.  I'm haven't been brave
> enough to implement it on any other architectures, but I did add  a
> default that takes mmap_sem and calls the normal get_user_pages() if the
> architecture doesn't define fast_gup().  I put it in linux/mm.h, for
> lack of a better place, but it's a little kludgy since I didn't want
> mm.h to have to include sched.h.  This patch is against 2.6.24-rc4.
> It's not ready for inclusion yet, of course.

Hi Dave,

Thanks so much. This makes it much more a complete patch (although
still missing the "normal page" detection).

I think I missed -- or forgot -- what was the follow_hugetlb_page
problem?

Anyway, I am hoping that someone will one day and test if this and
find it helps their workload, but on the other hand, if it doesn't
help anyone then we don't have to worry about adding it to the
kernel ;) I don't have any real setups that hammers DIO with threads.
I'm guessing DB2 and/or Oracle does?

Thanks,
Nick

>
> I haven't done much benchmarking.  The one test I was looking at didn't
> show much of a change.
>
>  ==============================================
> Introduce a new "fast_gup" (for want of a better name right now) which
> is basically a get_user_pages with a less general API that is more suited
> to the common case.
>
> - task and mm are always current and current->mm
> - force is always 0
> - pages is always non-NULL
> - don't pass back vmas
>
> This allows (at least on x86), an optimistic lockless pagetable walk,
> without taking any page table locks or even mmap_sem. Page table existence
> is guaranteed by turning interrupts off (combined with the fact that we're
> always looking up the current mm, which would need an IPI before its
> pagetables could be shot down from another CPU).
>
> Many other architectures could do the same thing. Those that don't IPI
> could potentially RCU free the page tables and do speculative references
> on the pages (a la lockless pagecache) to achieve a lockless fast_gup.
>
> Originally by Nick Piggin <nickpiggin@yahoo.com.au>
> ---
>  arch/x86/lib/Makefile_64     |    2
>  arch/x86/lib/gup_64.c        |  188
> +++++++++++++++++++++++++++++++++++++++++++ fs/bio.c                     | 
>   8 -
>  fs/block_dev.c               |    5 -
>  fs/direct-io.c               |   10 --
>  fs/splice.c                  |   38 --------
>  include/asm-x86/uaccess_64.h |    4
>  include/linux/mm.h           |   26 +++++
>  include/linux/vmstat.h       |    1
>  mm/vmstat.c                  |    3
>  10 files changed, 231 insertions(+), 54 deletions(-)
>
> diff -Nurp linux-2.6.24-rc4/arch/x86/lib/Makefile_64
> linux/arch/x86/lib/Makefile_64 ---
> linux-2.6.24-rc4/arch/x86/lib/Makefile_64	2007-12-04 08:44:34.000000000
> -0600 +++ linux/arch/x86/lib/Makefile_64	2007-12-10 15:01:17.000000000
> -0600 @@ -10,4 +10,4 @@ obj-$(CONFIG_SMP)	+= msr-on-cpu.o
>  lib-y := csum-partial_64.o csum-copy_64.o csum-wrappers_64.o delay_64.o \
>  	usercopy_64.o getuser_64.o putuser_64.o  \
>  	thunk_64.o clear_page_64.o copy_page_64.o bitstr_64.o bitops_64.o
> -lib-y += memcpy_64.o memmove_64.o memset_64.o copy_user_64.o rwlock_64.o
> copy_user_nocache_64.o +lib-y += memcpy_64.o memmove_64.o memset_64.o
> copy_user_64.o rwlock_64.o copy_user_nocache_64.o gup_64.o diff -Nurp
> linux-2.6.24-rc4/arch/x86/lib/gup_64.c linux/arch/x86/lib/gup_64.c ---
> linux-2.6.24-rc4/arch/x86/lib/gup_64.c	1969-12-31 18:00:00.000000000 -0600
> +++ linux/arch/x86/lib/gup_64.c	2007-12-10 15:01:17.000000000 -0600 @@ -0,0
> +1,188 @@
> +/*
> + * Lockless fast_gup for x86
> + *
> + * Copyright (C) 2007 Nick Piggin
> + * Copyright (C) 2007 Novell Inc.
> + */
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/vmstat.h>
> +#include <asm/pgtable.h>
> +
> +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	pte_t *ptep;
> +
> +	/* XXX: this won't work for 32-bit (must map pte) */
> +	ptep = (pte_t *)pmd_page_vaddr(pmd) + pte_index(addr);
> +	do {
> +		pte_t pte = *ptep;
> +		unsigned long pfn;
> +		struct page *page;
> +
> +		if ((pte_val(pte) & (_PAGE_PRESENT|_PAGE_USER)) !=
> +		    (_PAGE_PRESENT|_PAGE_USER))
> +			return 0;
> +
> +		if (write && !pte_write(pte))
> +			return 0;
> +
> +		/* XXX: really need new bit in pte to denote normal page */
> +		pfn = pte_pfn(pte);
> +		if (unlikely(!pfn_valid(pfn)))
> +			return 0;
> +
> +		page = pfn_to_page(pfn);
> +		get_page(page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +
> +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> +	pte_unmap(ptep - 1);
> +
> +	return 1;
> +}
> +
> +static inline void get_head_page_multiple(struct page *page, int nr)
> +{
> +	VM_BUG_ON(page != compound_head(page));
> +	VM_BUG_ON(page_count(page) == 0);
> +	atomic_add(nr, &page->_count);
> +}
> +
> +static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
> +			int write, struct page **pages, int *nr)
> +{
> +	pte_t pte = *(pte_t *)&pmd;
> +	struct page *head, *page;
> +	int refs;
> +
> +	if ((pte_val(pte) & _PAGE_USER) != _PAGE_USER)
> +		return 0;
> +
> +	BUG_ON(!pfn_valid(pte_pfn(pte)));
> +
> +	if (write && !pte_write(pte))
> +		return 0;
> +
> +	refs = 0;
> +	head = pte_page(pte);
> +	page = head + ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
> +	do {
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	get_head_page_multiple(head, refs);
> +
> +	return 1;
> +}
> +
> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	static int count = 50;
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = (pmd_t *)pud_page_vaddr(pud) + pmd_index(addr);
> +	do {
> +		pmd_t pmd = *pmdp;
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd))
> +			return 0;
> +		if (unlikely(pmd_large(pmd))) {
> +			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) {
> +				if (count) {
> +					printk(KERN_ERR
> +						"pmd = 0x%lx, addr = 0x%lx\n",
> +						pmd.pmd, addr);
> +					count--;
> +				}
> +				return 0;
> +			}
> +		} else {
> +			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +		}
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pud_t *pudp;
> +
> +	pudp = (pud_t *)pgd_page_vaddr(pgd) + pud_index(addr);
> +	do {
> +		pud_t pud = *pudp;
> +
> +		next = pud_addr_end(addr, end);
> +		if (pud_none(pud))
> +			return 0;
> +		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
> +			return 0;
> +	} while (pudp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +int fast_gup(unsigned long start, int nr_pages, int write, struct page
> **pages) +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long end = start + (nr_pages << PAGE_SHIFT);
> +	unsigned long addr = start;
> +	unsigned long next;
> +	pgd_t *pgdp;
> +	int nr = 0;
> +
> +	/* XXX: batch / limit 'nr', to avoid huge latency */
> +	/*
> +	 * This doesn't prevent pagetable teardown, but does prevent
> +	 * the pagetables from being freed on x86-64.
> +	 *
> +	 * So long as we atomically load page table pointers versus teardown
> +	 * (which we do on x86-64), we can follow the address down to the
> +	 * the page.
> +	 */
> +	local_irq_disable();
> +	__count_vm_event(FAST_GUP);
> +	pgdp = pgd_offset(mm, addr);
> +	do {
> +		pgd_t pgd = *pgdp;
> +
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(pgd))
> +			goto slow;
> +		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
> +			goto slow;
> +	} while (pgdp++, addr = next, addr != end);
> +	local_irq_enable();
> +
> +	BUG_ON(nr != (end - start) >> PAGE_SHIFT);
> +	return nr;
> +
> +slow:
> +	{
> +		int i, ret;
> +
> +		__count_vm_event(FAST_GUP_SLOW);
> +		local_irq_enable();
> +		for (i = 0; i < nr; i++)
> +			put_page(pages[i]);
> +
> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);
> +
> +		return ret;
> +	}
> +}
> diff -Nurp linux-2.6.24-rc4/fs/bio.c linux/fs/bio.c
> --- linux-2.6.24-rc4/fs/bio.c	2007-12-04 08:44:49.000000000 -0600
> +++ linux/fs/bio.c	2007-12-10 15:01:17.000000000 -0600
> @@ -636,13 +636,9 @@ static struct bio *__bio_map_user_iov(st
>  		unsigned long start = uaddr >> PAGE_SHIFT;
>  		const int local_nr_pages = end - start;
>  		const int page_limit = cur_page + local_nr_pages;
> -
> -		down_read(&current->mm->mmap_sem);
> -		ret = get_user_pages(current, current->mm, uaddr,
> -				     local_nr_pages,
> -				     write_to_vm, 0, &pages[cur_page], NULL);
> -		up_read(&current->mm->mmap_sem);
>
> +		ret = fast_gup(uaddr, local_nr_pages, write_to_vm,
> +			       &pages[cur_page]);
>  		if (ret < local_nr_pages) {
>  			ret = -EFAULT;
>  			goto out_unmap;
> diff -Nurp linux-2.6.24-rc4/fs/block_dev.c linux/fs/block_dev.c
> --- linux-2.6.24-rc4/fs/block_dev.c	2007-12-04 08:44:49.000000000 -0600
> +++ linux/fs/block_dev.c	2007-12-10 15:01:17.000000000 -0600
> @@ -221,10 +221,7 @@ static struct page *blk_get_page(unsigne
>  	if (pvec->idx == pvec->nr) {
>  		nr_pages = PAGES_SPANNED(addr, count);
>  		nr_pages = min(nr_pages, VEC_SIZE);
> -		down_read(&current->mm->mmap_sem);
> -		ret = get_user_pages(current, current->mm, addr, nr_pages,
> -				     rw == READ, 0, pvec->page, NULL);
> -		up_read(&current->mm->mmap_sem);
> +		ret = fast_gup(addr, nr_pages, rw == READ, pvec->page);
>  		if (ret < 0)
>  			return ERR_PTR(ret);
>  		pvec->nr = ret;
> diff -Nurp linux-2.6.24-rc4/fs/direct-io.c linux/fs/direct-io.c
> --- linux-2.6.24-rc4/fs/direct-io.c	2007-12-04 08:44:49.000000000 -0600
> +++ linux/fs/direct-io.c	2007-12-10 15:01:17.000000000 -0600
> @@ -150,17 +150,11 @@ static int dio_refill_pages(struct dio *
>  	int nr_pages;
>
>  	nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
> -	down_read(&current->mm->mmap_sem);
> -	ret = get_user_pages(
> -		current,			/* Task for fault acounting */
> -		current->mm,			/* whose pages? */
> +	ret = fast_gup(
>  		dio->curr_user_address,		/* Where from? */
>  		nr_pages,			/* How many pages? */
>  		dio->rw == READ,		/* Write to memory? */
> -		0,				/* force (?) */
> -		&dio->pages[0],
> -		NULL);				/* vmas */
> -	up_read(&current->mm->mmap_sem);
> +		&dio->pages[0]);		/* Put results here */
>
>  	if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
>  		struct page *page = ZERO_PAGE(0);
> diff -Nurp linux-2.6.24-rc4/fs/splice.c linux/fs/splice.c
> --- linux-2.6.24-rc4/fs/splice.c	2007-12-04 08:44:50.000000000 -0600
> +++ linux/fs/splice.c	2007-12-10 15:01:17.000000000 -0600
> @@ -1174,33 +1174,6 @@ static long do_splice(struct file *in, l
>  }
>
>  /*
> - * Do a copy-from-user while holding the mmap_semaphore for reading, in a
> - * manner safe from deadlocking with simultaneous mmap() (grabbing
> mmap_sem - * for writing) and page faulting on the user memory pointed to
> by src. - * This assumes that we will very rarely hit the partial != 0
> path, or this - * will not be a win.
> - */
> -static int copy_from_user_mmap_sem(void *dst, const void __user *src,
> size_t n) -{
> -	int partial;
> -
> -	pagefault_disable();
> -	partial = __copy_from_user_inatomic(dst, src, n);
> -	pagefault_enable();
> -
> -	/*
> -	 * Didn't copy everything, drop the mmap_sem and do a faulting copy
> -	 */
> -	if (unlikely(partial)) {
> -		up_read(&current->mm->mmap_sem);
> -		partial = copy_from_user(dst, src, n);
> -		down_read(&current->mm->mmap_sem);
> -	}
> -
> -	return partial;
> -}
> -
> -/*
>   * Map an iov into an array of pages and offset/length tupples. With the
>   * partial_page structure, we can map several non-contiguous ranges into
>   * our ones pages[] map instead of splitting that operation into pieces.
> @@ -1213,8 +1186,6 @@ static int get_iovec_page_array(const st
>  {
>  	int buffers = 0, error = 0;
>
> -	down_read(&current->mm->mmap_sem);
> -
>  	while (nr_vecs) {
>  		unsigned long off, npages;
>  		struct iovec entry;
> @@ -1223,7 +1194,7 @@ static int get_iovec_page_array(const st
>  		int i;
>
>  		error = -EFAULT;
> -		if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry)))
> +		if (copy_from_user(&entry, iov, sizeof(entry)))
>  			break;
>
>  		base = entry.iov_base;
> @@ -1257,9 +1228,8 @@ static int get_iovec_page_array(const st
>  		if (npages > PIPE_BUFFERS - buffers)
>  			npages = PIPE_BUFFERS - buffers;
>
> -		error = get_user_pages(current, current->mm,
> -				       (unsigned long) base, npages, 0, 0,
> -				       &pages[buffers], NULL);
> +		error = fast_gup((unsigned long)base, npages, 0,
> +				 &pages[buffers]);
>
>  		if (unlikely(error <= 0))
>  			break;
> @@ -1298,8 +1268,6 @@ static int get_iovec_page_array(const st
>  		iov++;
>  	}
>
> -	up_read(&current->mm->mmap_sem);
> -
>  	if (buffers)
>  		return buffers;
>
> diff -Nurp linux-2.6.24-rc4/include/asm-x86/uaccess_64.h
> linux/include/asm-x86/uaccess_64.h ---
> linux-2.6.24-rc4/include/asm-x86/uaccess_64.h	2007-12-04 08:44:54.000000000
> -0600 +++ linux/include/asm-x86/uaccess_64.h	2007-12-10 15:01:17.000000000
> -0600 @@ -381,4 +381,8 @@ static inline int __copy_from_user_inato
>  	return __copy_user_nocache(dst, src, size, 0);
>  }
>
> +#define __HAVE_ARCH_FAST_GUP
> +struct page;
> +int fast_gup(unsigned long start, int nr_pages, int write, struct page
> **pages); +
>  #endif /* __X86_64_UACCESS_H */
> diff -Nurp linux-2.6.24-rc4/include/linux/mm.h linux/include/linux/mm.h
> --- linux-2.6.24-rc4/include/linux/mm.h	2007-12-04 08:44:56.000000000 -0600
> +++ linux/include/linux/mm.h	2007-12-10 15:01:17.000000000 -0600
> @@ -12,6 +12,7 @@
>  #include <linux/prio_tree.h>
>  #include <linux/debug_locks.h>
>  #include <linux/mm_types.h>
> +#include <linux/uaccess.h>	/* for __HAVE_ARCH_FAST_GUP */
>
>  struct mempolicy;
>  struct anon_vma;
> @@ -750,6 +751,31 @@ extern int mprotect_fixup(struct vm_area
>  			  unsigned long end, unsigned long newflags);
>
>  /*
> + * Architecture may implement efficient get_user_pages to avoid having to
> + * take the mmap sem
> + */
> +#ifndef __HAVE_ARCH_FAST_GUP
> +static inline int __fast_gup(struct mm_struct *mm, unsigned long start,
> +			     int nr_pages, int write, struct page **pages)
> +{
> +	int ret;
> +
> +	down_read(&mm->mmap_sem);
> +	ret = get_user_pages(current, mm, start, nr_pages, write,
> +			     0, pages, NULL);
> +	up_read(&mm->mmap_sem);
> +
> +	return ret;
> +}
> +/*
> + * This macro avoids having to include sched.h in this header to
> + * dereference current->mm.
> + */
> +#define fast_gup(start, nr_pages, write, pages) \
> +	__fast_gup(current->mm, start, nr_pages, write, pages)
> +#endif
> +
> +/*
>   * A callback you can register to apply pressure to ageable caches.
>   *
>   * 'shrink' is passed a count 'nr_to_scan' and a 'gfpmask'.  It should
> diff -Nurp linux-2.6.24-rc4/include/linux/vmstat.h
> linux/include/linux/vmstat.h ---
> linux-2.6.24-rc4/include/linux/vmstat.h	2007-10-09 15:31:38.000000000 -0500
> +++ linux/include/linux/vmstat.h	2007-12-10 15:01:17.000000000 -0600 @@
> -37,6 +37,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
>  		FOR_ALL_ZONES(PGSCAN_DIRECT),
>  		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
>  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> +		FAST_GUP, FAST_GUP_SLOW,
>  		NR_VM_EVENT_ITEMS
>  };
>
> diff -Nurp linux-2.6.24-rc4/mm/vmstat.c linux/mm/vmstat.c
> --- linux-2.6.24-rc4/mm/vmstat.c	2007-12-04 08:45:01.000000000 -0600
> +++ linux/mm/vmstat.c	2007-12-10 15:01:17.000000000 -0600
> @@ -642,6 +642,9 @@ static const char * const vmstat_text[]
>  	"allocstall",
>
>  	"pgrotated",
> +	"fast_gup",
> +	"fast_gup_slow",
> +
>  #endif
>  };

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [rfc] lockless get_user_pages for dio (and more)
  2007-12-12  4:57         ` Nick Piggin
@ 2007-12-12  5:11           ` Dave Kleikamp
  2007-12-12  5:40             ` Nick Piggin
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Kleikamp @ 2007-12-12  5:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, Ken Chen, Badari Pulavarty, linux-mm,
	tony.luck, Adam Litke, linux-kernel


On Wed, 2007-12-12 at 15:57 +1100, Nick Piggin wrote:
> On Tuesday 11 December 2007 08:30, Dave Kleikamp wrote:

> > Nick,
> > I've played with the fast_gup patch a bit.  I was able to find a problem
> > in follow_hugetlb_page() that Adam Litke fixed.  I'm haven't been brave
> > enough to implement it on any other architectures, but I did add  a
> > default that takes mmap_sem and calls the normal get_user_pages() if the
> > architecture doesn't define fast_gup().  I put it in linux/mm.h, for
> > lack of a better place, but it's a little kludgy since I didn't want
> > mm.h to have to include sched.h.  This patch is against 2.6.24-rc4.
> > It's not ready for inclusion yet, of course.
> 
> Hi Dave,
> 
> Thanks so much. This makes it much more a complete patch (although
> still missing the "normal page" detection).
> 
> I think I missed -- or forgot -- what was the follow_hugetlb_page
> problem?

Badari found a problem running some tests and handed it off to me to
look at.  I didn't share it publicly.  Anyway, we were finding that
fastgup was taking the slow path almost all the time with huge pages.
The problem was that follow_hugetlb_page was failing to fault on a
non-writable page when it needed a writable one.  So we'd keep seeing a
non-writable page over and over.  This is fixed in 2.6.24-rc5.

> Anyway, I am hoping that someone will one day and test if this and
> find it helps their workload, but on the other hand, if it doesn't
> help anyone then we don't have to worry about adding it to the
> kernel ;) I don't have any real setups that hammers DIO with threads.
> I'm guessing DB2 and/or Oracle does?

I'll try to get someone to run a DB2 benchmark and see what it looks
like.
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [rfc] lockless get_user_pages for dio (and more)
  2007-12-12  5:11           ` Dave Kleikamp
@ 2007-12-12  5:40             ` Nick Piggin
  2008-01-16 19:58               ` Dave Kleikamp
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Piggin @ 2007-12-12  5:40 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Siddha, Suresh B, Ken Chen, Badari Pulavarty, linux-mm,
	tony.luck, Adam Litke, linux-kernel

On Wednesday 12 December 2007 16:11, Dave Kleikamp wrote:
> On Wed, 2007-12-12 at 15:57 +1100, Nick Piggin wrote:
> > On Tuesday 11 December 2007 08:30, Dave Kleikamp wrote:
> > > Nick,
> > > I've played with the fast_gup patch a bit.  I was able to find a
> > > problem in follow_hugetlb_page() that Adam Litke fixed.  I'm haven't
> > > been brave enough to implement it on any other architectures, but I did
> > > add  a default that takes mmap_sem and calls the normal
> > > get_user_pages() if the architecture doesn't define fast_gup().  I put
> > > it in linux/mm.h, for lack of a better place, but it's a little kludgy
> > > since I didn't want mm.h to have to include sched.h.  This patch is
> > > against 2.6.24-rc4. It's not ready for inclusion yet, of course.
> >
> > Hi Dave,
> >
> > Thanks so much. This makes it much more a complete patch (although
> > still missing the "normal page" detection).
> >
> > I think I missed -- or forgot -- what was the follow_hugetlb_page
> > problem?
>
> Badari found a problem running some tests and handed it off to me to
> look at.  I didn't share it publicly.  Anyway, we were finding that
> fastgup was taking the slow path almost all the time with huge pages.
> The problem was that follow_hugetlb_page was failing to fault on a
> non-writable page when it needed a writable one.  So we'd keep seeing a
> non-writable page over and over.  This is fixed in 2.6.24-rc5.

Ah yes, I just saw that fix in the changelog. So not a problem with my
patch as such, but good to get that fixed.


> > Anyway, I am hoping that someone will one day and test if this and
> > find it helps their workload, but on the other hand, if it doesn't
> > help anyone then we don't have to worry about adding it to the
> > kernel ;) I don't have any real setups that hammers DIO with threads.
> > I'm guessing DB2 and/or Oracle does?
>
> I'll try to get someone to run a DB2 benchmark and see what it looks
> like.

That would be great if you could.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [rfc] lockless get_user_pages for dio (and more)
  2007-12-12  5:40             ` Nick Piggin
@ 2008-01-16 19:58               ` Dave Kleikamp
  2008-01-17  6:34                 ` Nick Piggin
  2008-01-24  7:06                 ` Nick Piggin
  0 siblings, 2 replies; 7+ messages in thread
From: Dave Kleikamp @ 2008-01-16 19:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, Ken Chen, Badari Pulavarty, linux-mm,
	tony.luck, Adam Litke, linux-kernel


On Wed, 2007-12-12 at 16:40 +1100, Nick Piggin wrote:
> On Wednesday 12 December 2007 16:11, Dave Kleikamp wrote:
> > On Wed, 2007-12-12 at 15:57 +1100, Nick Piggin wrote:

> > > Anyway, I am hoping that someone will one day and test if this and
> > > find it helps their workload, but on the other hand, if it doesn't
> > > help anyone then we don't have to worry about adding it to the
> > > kernel ;) I don't have any real setups that hammers DIO with threads.
> > > I'm guessing DB2 and/or Oracle does?
> >
> > I'll try to get someone to run a DB2 benchmark and see what it looks
> > like.
> 
> That would be great if you could.

We weren't able to get in any runs before the holidays, but we finally
have some good news from our performance team:

"To test the effects of the patch, an OLTP workload was run on an IBM
x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
runs with and without the patch resulted in an overall performance
benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
__up_read and __down_read routines that is seen during thread contention
for system resources was reduced from 2.8% down to .05%. Monitoring
the /proc/vmstat output from the patched run showed that the counter for
fast_gup contained a very high number while the fast_gup_slow value was
zero."

Great work, Nick!

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [rfc] lockless get_user_pages for dio (and more)
  2008-01-16 19:58               ` Dave Kleikamp
@ 2008-01-17  6:34                 ` Nick Piggin
  2008-01-24  7:06                 ` Nick Piggin
  1 sibling, 0 replies; 7+ messages in thread
From: Nick Piggin @ 2008-01-17  6:34 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Siddha, Suresh B, Ken Chen, Badari Pulavarty, linux-mm,
	tony.luck, Adam Litke, linux-kernel, linux-arch

On Thursday 17 January 2008 06:58, Dave Kleikamp wrote:
> On Wed, 2007-12-12 at 16:40 +1100, Nick Piggin wrote:
> > On Wednesday 12 December 2007 16:11, Dave Kleikamp wrote:
> > > On Wed, 2007-12-12 at 15:57 +1100, Nick Piggin wrote:
> > > > Anyway, I am hoping that someone will one day and test if this and
> > > > find it helps their workload, but on the other hand, if it doesn't
> > > > help anyone then we don't have to worry about adding it to the
> > > > kernel ;) I don't have any real setups that hammers DIO with threads.
> > > > I'm guessing DB2 and/or Oracle does?
> > >
> > > I'll try to get someone to run a DB2 benchmark and see what it looks
> > > like.
> >
> > That would be great if you could.
>
> We weren't able to get in any runs before the holidays, but we finally
> have some good news from our performance team:
>
> "To test the effects of the patch, an OLTP workload was run on an IBM
> x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
> 2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
> runs with and without the patch resulted in an overall performance
> benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
> __up_read and __down_read routines that is seen during thread contention
> for system resources was reduced from 2.8% down to .05%. Monitoring
> the /proc/vmstat output from the patched run showed that the counter for
> fast_gup contained a very high number while the fast_gup_slow value was
> zero."
>
> Great work, Nick!

Ah, excellent. Thanks for getting those numbers Dave. This will
be a great help towards getting the patch merged.

I'm just working on the final required piece for this thing (the
pte_special pte bit, required to distinguish whether or not we
can refcount a page without looking at the vma). It is strictly
just a correctness/security measure, which is why you were able
to run tests without it. And it won't add any significant cost to
the fastpaths, so the numbers remain valid.

FWIW, I cc'ed linux-arch: the lockless get_user_pages patch has
architecture specific elements, so it will need some attention
there. If other architectures are interested (eg. powerpc or
ia64), then I will be happy to work with maintainers to help
try to devise a way of fitting it into their tlb flushing scheme.
Ping me if you'd like to take up the offer.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [rfc] lockless get_user_pages for dio (and more)
  2008-01-16 19:58               ` Dave Kleikamp
  2008-01-17  6:34                 ` Nick Piggin
@ 2008-01-24  7:06                 ` Nick Piggin
  1 sibling, 0 replies; 7+ messages in thread
From: Nick Piggin @ 2008-01-24  7:06 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Siddha, Suresh B, Ken Chen, Badari Pulavarty, linux-mm,
	tony.luck, Adam Litke, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1491 bytes --]

On Thursday 17 January 2008 06:58, Dave Kleikamp wrote:

> We weren't able to get in any runs before the holidays, but we finally
> have some good news from our performance team:
>
> "To test the effects of the patch, an OLTP workload was run on an IBM
> x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
> 2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
> runs with and without the patch resulted in an overall performance
> benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
> __up_read and __down_read routines that is seen during thread contention
> for system resources was reduced from 2.8% down to .05%. Monitoring
> the /proc/vmstat output from the patched run showed that the counter for
> fast_gup contained a very high number while the fast_gup_slow value was
> zero."

Just for reference, I've attached a more complete patch for x86,
which has to be applied on top of the pte_special patch posted in
another thread.

No need to test anything at this point... the generated code for
this version is actually slightly better than the last one despite
the extra condition being tested for. With a few tweak I was
actually able to reduce the number of tests in the inner loop, and
adding noinline to the leaf functions helps keep them in registers.

I'm currently having a look at an initial powerpc 64 patch,
hopefully we'll see similar improvements there. Will post that when
I get further along with it.

Thanks,
Nick

[-- Attachment #2: mm-get_user_pages-fast.patch --]
[-- Type: text/x-diff, Size: 12491 bytes --]

Introduce a new "fast_gup" (for want of a better name right now) which
is basically a get_user_pages with a less general API that is more suited
to the common case.

- task and mm are always current and current->mm
- force is always 0
- pages is always non-NULL
- don't pass back vmas

This allows (at least on x86), an optimistic lockless pagetable walk,
without taking any page table locks or even mmap_sem. Page table existence
is guaranteed by turning interrupts off (combined with the fact that we're
always looking up the current mm, which would need an IPI before its
pagetables could be shot down from another CPU).

Many other architectures could do the same thing. Those that don't IPI
could potentially RCU free the page tables and do speculative references
on the pages (a la lockless pagecache) to achieve a lockless fast_gup.


---
Index: linux-2.6/arch/x86/lib/Makefile_64
===================================================================
--- linux-2.6.orig/arch/x86/lib/Makefile_64
+++ linux-2.6/arch/x86/lib/Makefile_64
@@ -10,4 +10,4 @@ obj-$(CONFIG_SMP)	+= msr-on-cpu.o
 lib-y := csum-partial_64.o csum-copy_64.o csum-wrappers_64.o delay_64.o \
 	usercopy_64.o getuser_64.o putuser_64.o  \
 	thunk_64.o clear_page_64.o copy_page_64.o bitstr_64.o bitops_64.o
-lib-y += memcpy_64.o memmove_64.o memset_64.o copy_user_64.o rwlock_64.o copy_user_nocache_64.o
+lib-y += memcpy_64.o memmove_64.o memset_64.o copy_user_64.o rwlock_64.o copy_user_nocache_64.o gup.o
Index: linux-2.6/arch/x86/lib/gup.c
===================================================================
--- /dev/null
+++ linux-2.6/arch/x86/lib/gup.c
@@ -0,0 +1,189 @@
+/*
+ * Lockless fast_gup for x86
+ *
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <asm/pgtable.h>
+
+/*
+ * The performance critical leaf functions are made noinline otherwise gcc
+ * inlines everything into a single function which results in too much
+ * register pressure.
+ */
+static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask, result;
+	pte_t *ptep;
+
+	result = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		result |= _PAGE_RW;
+	mask = result | _PAGE_SPECIAL;
+
+	ptep = pte_offset_map(&pmd, addr);
+	do {
+		/*
+		 * XXX: careful. On 3-level 32-bit, the pte is 64 bits, and
+		 * we need to make sure we load the low word first, then the
+		 * high. This means _PAGE_PRESENT should be clear if the high
+		 * word was not valid. Currently, the C compiler can issue
+		 * the loads in any order, and I don't know of a wrapper
+		 * function that will do this properly, so it is broken on
+		 * 32-bit 3-level for the moment.
+		 */
+		pte_t pte = *ptep;
+		struct page *page;
+
+		if ((pte_val(pte) & mask) != result)
+			return 0;
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+		get_page(page);
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+	pte_unmap(ptep - 1);
+
+	return 1;
+}
+
+static inline void get_head_page_multiple(struct page *page, int nr)
+{
+	VM_BUG_ON(page != compound_head(page));
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_add(nr, &page->_count);
+}
+
+static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	pte_t pte = *(pte_t *)&pmd;
+	struct page *head, *page;
+	int refs;
+
+	mask = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+	if ((pte_val(pte) & mask) != mask)
+		return 0;
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	refs = 0;
+	head = pte_page(pte);
+	page = head + ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
+	do {
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+	get_head_page_multiple(head, refs);
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = (pmd_t *)pud_page_vaddr(pud) + pmd_index(addr);
+	do {
+		pmd_t pmd = *pmdp;
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd))
+			return 0;
+		if (unlikely(pmd_large(pmd))) {
+			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
+				return 0;
+		} else {
+			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = (pud_t *)pgd_page_vaddr(pgd) + pud_index(addr);
+	do {
+		pud_t pud = *pudp;
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long end = start + (nr_pages << PAGE_SHIFT);
+	unsigned long addr = start;
+	unsigned long next;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	/*
+	 * XXX: batch / limit 'nr', to avoid huge latency
+	 * needs some instrumenting to determine the common sizes used by
+	 * important workloads (eg. DB2), and whether limiting the batch size
+	 * will decrease performance.
+	 */
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables and pages from being freed on x86.
+	 *
+	 * So long as we atomically load page table pointers versus teardown
+	 * (which we do on x86), we can follow the address down to the the
+	 * page and take a ref on it.
+	 */
+	local_irq_disable();
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = *pgdp;
+
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			goto slow;
+		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			goto slow;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_enable();
+
+	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
+	return nr;
+
+slow:
+	{
+		int i, ret;
+
+		local_irq_enable();
+		/* Could optimise this more by keeping what we've already got */
+		for (i = 0; i < nr; i++)
+			put_page(pages[i]);
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		return ret;
+	}
+}
Index: linux-2.6/include/asm-x86/uaccess_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/uaccess_64.h
+++ linux-2.6/include/asm-x86/uaccess_64.h
@@ -381,4 +381,8 @@ static inline int __copy_from_user_inato
 	return __copy_user_nocache(dst, src, size, 0);
 }
 
+#define __HAVE_ARCH_FAST_GUP
+struct page;
+int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages);
+
 #endif /* __X86_64_UACCESS_H */
Index: linux-2.6/fs/bio.c
===================================================================
--- linux-2.6.orig/fs/bio.c
+++ linux-2.6/fs/bio.c
@@ -637,12 +637,7 @@ static struct bio *__bio_map_user_iov(st
 		const int local_nr_pages = end - start;
 		const int page_limit = cur_page + local_nr_pages;
 		
-		down_read(&current->mm->mmap_sem);
-		ret = get_user_pages(current, current->mm, uaddr,
-				     local_nr_pages,
-				     write_to_vm, 0, &pages[cur_page], NULL);
-		up_read(&current->mm->mmap_sem);
-
+		ret = fast_gup(uaddr, local_nr_pages, write_to_vm, &pages[cur_page]);
 		if (ret < local_nr_pages) {
 			ret = -EFAULT;
 			goto out_unmap;
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -221,10 +221,7 @@ static struct page *blk_get_page(unsigne
 	if (pvec->idx == pvec->nr) {
 		nr_pages = PAGES_SPANNED(addr, count);
 		nr_pages = min(nr_pages, VEC_SIZE);
-		down_read(&current->mm->mmap_sem);
-		ret = get_user_pages(current, current->mm, addr, nr_pages,
-				     rw == READ, 0, pvec->page, NULL);
-		up_read(&current->mm->mmap_sem);
+		ret = fast_gup(addr, nr_pages, rw == READ, pvec->page);
 		if (ret < 0)
 			return ERR_PTR(ret);
 		pvec->nr = ret;
Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c
+++ linux-2.6/fs/direct-io.c
@@ -150,17 +150,11 @@ static int dio_refill_pages(struct dio *
 	int nr_pages;
 
 	nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
-	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(
-		current,			/* Task for fault acounting */
-		current->mm,			/* whose pages? */
+	ret = fast_gup(
 		dio->curr_user_address,		/* Where from? */
 		nr_pages,			/* How many pages? */
 		dio->rw == READ,		/* Write to memory? */
-		0,				/* force (?) */
-		&dio->pages[0],
-		NULL);				/* vmas */
-	up_read(&current->mm->mmap_sem);
+		&dio->pages[0]);		/* Put results here */
 
 	if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
 		struct page *page = ZERO_PAGE(0);
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -1174,33 +1174,6 @@ static long do_splice(struct file *in, l
 }
 
 /*
- * Do a copy-from-user while holding the mmap_semaphore for reading, in a
- * manner safe from deadlocking with simultaneous mmap() (grabbing mmap_sem
- * for writing) and page faulting on the user memory pointed to by src.
- * This assumes that we will very rarely hit the partial != 0 path, or this
- * will not be a win.
- */
-static int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n)
-{
-	int partial;
-
-	pagefault_disable();
-	partial = __copy_from_user_inatomic(dst, src, n);
-	pagefault_enable();
-
-	/*
-	 * Didn't copy everything, drop the mmap_sem and do a faulting copy
-	 */
-	if (unlikely(partial)) {
-		up_read(&current->mm->mmap_sem);
-		partial = copy_from_user(dst, src, n);
-		down_read(&current->mm->mmap_sem);
-	}
-
-	return partial;
-}
-
-/*
  * Map an iov into an array of pages and offset/length tupples. With the
  * partial_page structure, we can map several non-contiguous ranges into
  * our ones pages[] map instead of splitting that operation into pieces.
@@ -1213,8 +1186,6 @@ static int get_iovec_page_array(const st
 {
 	int buffers = 0, error = 0;
 
-	down_read(&current->mm->mmap_sem);
-
 	while (nr_vecs) {
 		unsigned long off, npages;
 		struct iovec entry;
@@ -1223,7 +1194,7 @@ static int get_iovec_page_array(const st
 		int i;
 
 		error = -EFAULT;
-		if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry)))
+		if (copy_from_user(&entry, iov, sizeof(entry)))
 			break;
 
 		base = entry.iov_base;
@@ -1257,9 +1228,7 @@ static int get_iovec_page_array(const st
 		if (npages > PIPE_BUFFERS - buffers)
 			npages = PIPE_BUFFERS - buffers;
 
-		error = get_user_pages(current, current->mm,
-				       (unsigned long) base, npages, 0, 0,
-				       &pages[buffers], NULL);
+		error = fast_gup((unsigned long)base, npages, 0, &pages[buffers]);
 
 		if (unlikely(error <= 0))
 			break;
@@ -1298,8 +1267,6 @@ static int get_iovec_page_array(const st
 		iov++;
 	}
 
-	up_read(&current->mm->mmap_sem);
-
 	if (buffers)
 		return buffers;
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -13,6 +13,7 @@
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
 #include <linux/security.h>
+#include <linux/uaccess.h> /* for __HAVE_ARCH_FAST_GUP */
 
 struct mempolicy;
 struct anon_vma;
@@ -767,6 +768,24 @@ extern int mprotect_fixup(struct vm_area
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
 
+#ifndef __HAVE_ARCH_FAST_GUP
+/* Should be moved to asm-generic, and architectures can include it if they
+ * don't implement their own fast_gup.
+ */
+#define fast_gup(start, nr_pages, write, pages)			\
+({								\
+	struct mm_struct *mm = current->mm;			\
+	int ret;						\
+								\
+	down_read(&mm->mmap_sem);				\
+	ret = get_user_pages(current, mm, start, nr_pages,	\
+					write, 0, pages, NULL);	\
+	up_read(&mm->mmap_sem);					\
+								\
+	ret;							\
+})
+#endif
+
 /*
  * A callback you can register to apply pressure to ageable caches.
  *

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-01-24  7:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20071008225234.GC27824@linux-os.sc.intel.com>
     [not found] ` <200710141101.02649.nickpiggin@yahoo.com.au>
     [not found]   ` <20071014181929.GA19902@linux-os.sc.intel.com>
     [not found]     ` <200710152225.11433.nickpiggin@yahoo.com.au>
2007-12-10 21:30       ` [rfc] lockless get_user_pages for dio (and more) Dave Kleikamp
2007-12-12  4:57         ` Nick Piggin
2007-12-12  5:11           ` Dave Kleikamp
2007-12-12  5:40             ` Nick Piggin
2008-01-16 19:58               ` Dave Kleikamp
2008-01-17  6:34                 ` Nick Piggin
2008-01-24  7:06                 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).