[PATCH 0/6] mlock: do not hold mmap_sem for extended periods of time

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] mlock: do not hold mmap_sem for extended periods of time
@ 2010-12-03  0:16 ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

Currently mlock() holds mmap_sem in exclusive mode while the pages get
faulted in. In the case of a large mlock, this can potentially take a
very long time, during which various commands such as 'ps auxw' will
block. This makes sysadmins unhappy:

real    14m36.232s
user    0m0.003s
sys     0m0.015s
(output from 'time ps auxw' while a 20GB file was being mlocked without
being previously preloaded into page cache)

I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs. Then a second pass could be done
to actually mlock the pages, in small batches, releasing mmap_sem when
we block on disk access or when we detect some contention.

Patches are against v2.6.37-rc4 plus my patches to avoid mlock dirtying
(presently queued in -mm).

Michel Lespinasse (6):
  mlock: only hold mmap_sem in shared mode when faulting in pages
  mm: add FOLL_MLOCK follow_page flag.
  mm: move VM_LOCKED check to __mlock_vma_pages_range()
  rwsem: implement rwsem_is_contended()
  mlock: do not hold mmap_sem for extended periods of time
  x86 rwsem: more precise rwsem_is_contended() implementation

 arch/alpha/include/asm/rwsem.h   |    5 +
 arch/ia64/include/asm/rwsem.h    |    5 +
 arch/powerpc/include/asm/rwsem.h |    5 +
 arch/s390/include/asm/rwsem.h    |    5 +
 arch/sh/include/asm/rwsem.h      |    5 +
 arch/sparc/include/asm/rwsem.h   |    5 +
 arch/x86/include/asm/rwsem.h     |   35 ++++++---
 arch/x86/lib/rwsem_64.S          |    4 +-
 arch/x86/lib/semaphore_32.S      |    4 +-
 arch/xtensa/include/asm/rwsem.h  |    5 +
 include/linux/mm.h               |    1 +
 include/linux/rwsem-spinlock.h   |    1 +
 lib/rwsem-spinlock.c             |   12 +++
 mm/internal.h                    |    3 +-
 mm/memory.c                      |   54 ++++++++++++--
 mm/mlock.c                       |  150 ++++++++++++++++++--------------------
 mm/nommu.c                       |    6 +-
 17 files changed, 201 insertions(+), 104 deletions(-)

-- 
1.7.3.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 0/6] mlock: do not hold mmap_sem for extended periods of time
@ 2010-12-03  0:16 ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

Currently mlock() holds mmap_sem in exclusive mode while the pages get
faulted in. In the case of a large mlock, this can potentially take a
very long time, during which various commands such as 'ps auxw' will
block. This makes sysadmins unhappy:

real    14m36.232s
user    0m0.003s
sys     0m0.015s
(output from 'time ps auxw' while a 20GB file was being mlocked without
being previously preloaded into page cache)

I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs. Then a second pass could be done
to actually mlock the pages, in small batches, releasing mmap_sem when
we block on disk access or when we detect some contention.

Patches are against v2.6.37-rc4 plus my patches to avoid mlock dirtying
(presently queued in -mm).

Michel Lespinasse (6):
  mlock: only hold mmap_sem in shared mode when faulting in pages
  mm: add FOLL_MLOCK follow_page flag.
  mm: move VM_LOCKED check to __mlock_vma_pages_range()
  rwsem: implement rwsem_is_contended()
  mlock: do not hold mmap_sem for extended periods of time
  x86 rwsem: more precise rwsem_is_contended() implementation

 arch/alpha/include/asm/rwsem.h   |    5 +
 arch/ia64/include/asm/rwsem.h    |    5 +
 arch/powerpc/include/asm/rwsem.h |    5 +
 arch/s390/include/asm/rwsem.h    |    5 +
 arch/sh/include/asm/rwsem.h      |    5 +
 arch/sparc/include/asm/rwsem.h   |    5 +
 arch/x86/include/asm/rwsem.h     |   35 ++++++---
 arch/x86/lib/rwsem_64.S          |    4 +-
 arch/x86/lib/semaphore_32.S      |    4 +-
 arch/xtensa/include/asm/rwsem.h  |    5 +
 include/linux/mm.h               |    1 +
 include/linux/rwsem-spinlock.h   |    1 +
 lib/rwsem-spinlock.c             |   12 +++
 mm/internal.h                    |    3 +-
 mm/memory.c                      |   54 ++++++++++++--
 mm/mlock.c                       |  150 ++++++++++++++++++--------------------
 mm/nommu.c                       |    6 +-
 17 files changed, 201 insertions(+), 104 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
  2010-12-03  0:16 ` Michel Lespinasse
@ 2010-12-03  0:16   ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

Before this change, mlock() holds mmap_sem in exclusive mode while the
pages get faulted in. In the case of a large mlock, this can potentially
take a very long time. Various things will block while mmap_sem is held,
including 'ps auxw'. This can make sysadmins angry.

I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs. Then a second pass could be done
to actually mlock the pages with mmap_sem held for reads only. We need
to recheck the vma flags after we re-acquire mmap_sem, but this is easy.

In the case where a vma has been munlocked before mlock completes,
pages that were already marked as PageMlocked() are handled by the
munlock() call, and mlock() is careful to not mark new page batches
as PageMlocked() after the munlock() call has cleared the VM_LOCKED
vma flags. So, the end result will be identical to what'd happen if
munlock() had executed after the mlock() call.

In a later change, I will allow the second pass to release mmap_sem when
blocking on disk accesses or when it is otherwise contended, so that
it won't be held for long periods of time even in shared mode.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/mlock.c |   75 ++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 58 insertions(+), 17 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 4f31864..8d6b702 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -377,18 +377,10 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int ret = 0;
 	int lock = newflags & VM_LOCKED;
 
-	if (newflags == vma->vm_flags ||
-			(vma->vm_flags & (VM_IO | VM_PFNMAP)))
+	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
+	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current))
 		goto out;	/* don't set VM_LOCKED,  don't count */
 
-	if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
-			is_vm_hugetlb_page(vma) ||
-			vma == get_gate_vma(current)) {
-		if (lock)
-			make_pages_present(start, end);
-		goto out;	/* don't set VM_LOCKED,  don't count */
-	}
-
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
 			  vma->vm_file, pgoff, vma_policy(vma));
@@ -424,14 +416,10 @@ success:
 	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 
-	if (lock) {
+	if (lock)
 		vma->vm_flags = newflags;
-		ret = __mlock_vma_pages_range(vma, start, end);
-		if (ret < 0)
-			ret = __mlock_posix_error_return(ret);
-	} else {
+	else
 		munlock_vma_pages_range(vma, start, end);
-	}
 
 out:
 	*prev = vma;
@@ -444,7 +432,8 @@ static int do_mlock(unsigned long start, size_t len, int on)
 	struct vm_area_struct * vma, * prev;
 	int error;
 
-	len = PAGE_ALIGN(len);
+	VM_BUG_ON(start & ~PAGE_MASK);
+	VM_BUG_ON(len != PAGE_ALIGN(len));
 	end = start + len;
 	if (end < start)
 		return -EINVAL;
@@ -487,6 +476,54 @@ static int do_mlock(unsigned long start, size_t len, int on)
 	return error;
 }
 
+static int do_mlock_pages(unsigned long start, size_t len)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long end, nstart, nend;
+	struct vm_area_struct *vma = NULL;
+	int ret = 0;
+
+	VM_BUG_ON(start & ~PAGE_MASK);
+	VM_BUG_ON(len != PAGE_ALIGN(len));
+	end = start + len;
+
+	down_read(&mm->mmap_sem);
+	for (nstart = start; nstart < end; nstart = nend) {
+		/*
+		 * We want to fault in pages for [nstart; end) address range.
+		 * Find first corresponding VMA.
+		 */
+		if (!vma)
+			vma = find_vma(mm, nstart);
+		else
+			vma = vma->vm_next;
+		if (!vma || vma->vm_start >= end)
+			break;
+		/*
+		 * Set [nstart; nend) to intersection of desired address
+		 * range with the first VMA. Also, skip undesirable VMA types.
+		 */
+		nend = min(end, vma->vm_end);
+		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+			continue;
+		if (nstart < vma->vm_start)
+			nstart = vma->vm_start;
+		/*
+		 * Now fault in a range of pages within the first VMA.
+		 */
+		if (vma->vm_flags & VM_LOCKED) {
+			ret = __mlock_vma_pages_range(vma, nstart, nend);
+			if (ret) {
+				ret = __mlock_posix_error_return(ret);
+				break;
+			}
+		} else
+			make_pages_present(nstart, nend);
+	}
+	up_read(&mm->mmap_sem);
+	return ret;	/* 0 or negative error code */
+}
+
 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
 {
 	unsigned long locked;
@@ -512,6 +549,8 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
 	if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
 		error = do_mlock(start, len, 1);
 	up_write(&current->mm->mmap_sem);
+	if (!error)
+		error = do_mlock_pages(start, len);
 	return error;
 }
 
@@ -576,6 +615,8 @@ SYSCALL_DEFINE1(mlockall, int, flags)
 	    capable(CAP_IPC_LOCK))
 		ret = do_mlockall(flags);
 	up_write(&current->mm->mmap_sem);
+	if (!ret && (flags & MCL_CURRENT))
+		ret = do_mlock_pages(0, TASK_SIZE);
 out:
 	return ret;
 }
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
@ 2010-12-03  0:16   ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

Before this change, mlock() holds mmap_sem in exclusive mode while the
pages get faulted in. In the case of a large mlock, this can potentially
take a very long time. Various things will block while mmap_sem is held,
including 'ps auxw'. This can make sysadmins angry.

I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs. Then a second pass could be done
to actually mlock the pages with mmap_sem held for reads only. We need
to recheck the vma flags after we re-acquire mmap_sem, but this is easy.

In the case where a vma has been munlocked before mlock completes,
pages that were already marked as PageMlocked() are handled by the
munlock() call, and mlock() is careful to not mark new page batches
as PageMlocked() after the munlock() call has cleared the VM_LOCKED
vma flags. So, the end result will be identical to what'd happen if
munlock() had executed after the mlock() call.

In a later change, I will allow the second pass to release mmap_sem when
blocking on disk accesses or when it is otherwise contended, so that
it won't be held for long periods of time even in shared mode.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/mlock.c |   75 ++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 58 insertions(+), 17 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 4f31864..8d6b702 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -377,18 +377,10 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int ret = 0;
 	int lock = newflags & VM_LOCKED;
 
-	if (newflags == vma->vm_flags ||
-			(vma->vm_flags & (VM_IO | VM_PFNMAP)))
+	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
+	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current))
 		goto out;	/* don't set VM_LOCKED,  don't count */
 
-	if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
-			is_vm_hugetlb_page(vma) ||
-			vma == get_gate_vma(current)) {
-		if (lock)
-			make_pages_present(start, end);
-		goto out;	/* don't set VM_LOCKED,  don't count */
-	}
-
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
 			  vma->vm_file, pgoff, vma_policy(vma));
@@ -424,14 +416,10 @@ success:
 	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 
-	if (lock) {
+	if (lock)
 		vma->vm_flags = newflags;
-		ret = __mlock_vma_pages_range(vma, start, end);
-		if (ret < 0)
-			ret = __mlock_posix_error_return(ret);
-	} else {
+	else
 		munlock_vma_pages_range(vma, start, end);
-	}
 
 out:
 	*prev = vma;
@@ -444,7 +432,8 @@ static int do_mlock(unsigned long start, size_t len, int on)
 	struct vm_area_struct * vma, * prev;
 	int error;
 
-	len = PAGE_ALIGN(len);
+	VM_BUG_ON(start & ~PAGE_MASK);
+	VM_BUG_ON(len != PAGE_ALIGN(len));
 	end = start + len;
 	if (end < start)
 		return -EINVAL;
@@ -487,6 +476,54 @@ static int do_mlock(unsigned long start, size_t len, int on)
 	return error;
 }
 
+static int do_mlock_pages(unsigned long start, size_t len)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long end, nstart, nend;
+	struct vm_area_struct *vma = NULL;
+	int ret = 0;
+
+	VM_BUG_ON(start & ~PAGE_MASK);
+	VM_BUG_ON(len != PAGE_ALIGN(len));
+	end = start + len;
+
+	down_read(&mm->mmap_sem);
+	for (nstart = start; nstart < end; nstart = nend) {
+		/*
+		 * We want to fault in pages for [nstart; end) address range.
+		 * Find first corresponding VMA.
+		 */
+		if (!vma)
+			vma = find_vma(mm, nstart);
+		else
+			vma = vma->vm_next;
+		if (!vma || vma->vm_start >= end)
+			break;
+		/*
+		 * Set [nstart; nend) to intersection of desired address
+		 * range with the first VMA. Also, skip undesirable VMA types.
+		 */
+		nend = min(end, vma->vm_end);
+		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+			continue;
+		if (nstart < vma->vm_start)
+			nstart = vma->vm_start;
+		/*
+		 * Now fault in a range of pages within the first VMA.
+		 */
+		if (vma->vm_flags & VM_LOCKED) {
+			ret = __mlock_vma_pages_range(vma, nstart, nend);
+			if (ret) {
+				ret = __mlock_posix_error_return(ret);
+				break;
+			}
+		} else
+			make_pages_present(nstart, nend);
+	}
+	up_read(&mm->mmap_sem);
+	return ret;	/* 0 or negative error code */
+}
+
 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
 {
 	unsigned long locked;
@@ -512,6 +549,8 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
 	if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
 		error = do_mlock(start, len, 1);
 	up_write(&current->mm->mmap_sem);
+	if (!error)
+		error = do_mlock_pages(start, len);
 	return error;
 }
 
@@ -576,6 +615,8 @@ SYSCALL_DEFINE1(mlockall, int, flags)
 	    capable(CAP_IPC_LOCK))
 		ret = do_mlockall(flags);
 	up_write(&current->mm->mmap_sem);
+	if (!ret && (flags & MCL_CURRENT))
+		ret = do_mlock_pages(0, TASK_SIZE);
 out:
 	return ret;
 }
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/6] mm: add FOLL_MLOCK follow_page flag.
  2010-12-03  0:16 ` Michel Lespinasse
@ 2010-12-03  0:16   ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

Move the code to mlock pages from __mlock_vma_pages_range()
to follow_page().

This allows __mlock_vma_pages_range() to not have to break down work
into 16-page batches.

An additional motivation for doing this within the present patch series
is that it'll make it easier for a later chagne to drop mmap_sem when
blocking on disk (we'd like to be able to resume at the page that was
read from disk instead of at the start of a 16-page batch).

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mm.h |    1 +
 mm/memory.c        |   27 ++++++++++++++++++++-
 mm/mlock.c         |   65 ++++------------------------------------------------
 3 files changed, 32 insertions(+), 61 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 721f451..cebbb0d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1415,6 +1415,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_MLOCK	0x20	/* mark page as mlocked */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
index b8f97b8..f3a9242 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1297,7 +1297,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		page = pte_page(pte);
 	}
 
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_MLOCK))
 		get_page(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
@@ -1310,6 +1310,31 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		 */
 		mark_page_accessed(page);
 	}
+	if (flags & FOLL_MLOCK) {
+		pte_unmap_unlock(ptep, ptl);
+		lru_add_drain();  /* push cached pages to LRU */
+		if (page->mapping) {
+			/*
+			 * That preliminary check is mainly to avoid
+			 * the pointless overhead of lock_page on the
+			 * ZERO_PAGE: which might bounce very badly if
+			 * there is contention.  However, we're still
+			 * dirtying its cacheline with get/put_page.
+			 */
+			lock_page(page);
+			/*
+			 * Because we lock page here and migration is
+			 * blocked by the elevated reference, we need
+			 * only check for file-cache page truncation.
+			 */
+			if (page->mapping)
+				mlock_vma_page(page);
+			unlock_page(page);
+		}
+		VM_BUG_ON(flags & FOLL_GET);
+		put_page(page);
+		return page;
+	}
 unlock:
 	pte_unmap_unlock(ptep, ptl);
 out:
diff --git a/mm/mlock.c b/mm/mlock.c
index 8d6b702..0531173 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -159,10 +159,9 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = start;
-	struct page *pages[16]; /* 16 gives a reasonable batch */
 	int nr_pages = (end - start) / PAGE_SIZE;
-	int ret = 0;
 	int gup_flags;
+	int ret;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
@@ -170,7 +169,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	VM_BUG_ON(end   > vma->vm_end);
 	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
-	gup_flags = FOLL_TOUCH | FOLL_GET;
+	gup_flags = FOLL_TOUCH | FOLL_MLOCK;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
@@ -185,63 +184,9 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 		nr_pages--;
 	}
 
-	while (nr_pages > 0) {
-		int i;
-
-		cond_resched();
-
-		/*
-		 * get_user_pages makes pages present if we are
-		 * setting mlock. and this extra reference count will
-		 * disable migration of this page.  However, page may
-		 * still be truncated out from under us.
-		 */
-		ret = __get_user_pages(current, mm, addr,
-				min_t(int, nr_pages, ARRAY_SIZE(pages)),
-				gup_flags, pages, NULL);
-		/*
-		 * This can happen for, e.g., VM_NONLINEAR regions before
-		 * a page has been allocated and mapped at a given offset,
-		 * or for addresses that map beyond end of a file.
-		 * We'll mlock the pages if/when they get faulted in.
-		 */
-		if (ret < 0)
-			break;
-
-		lru_add_drain();	/* push cached pages to LRU */
-
-		for (i = 0; i < ret; i++) {
-			struct page *page = pages[i];
-
-			if (page->mapping) {
-				/*
-				 * That preliminary check is mainly to avoid
-				 * the pointless overhead of lock_page on the
-				 * ZERO_PAGE: which might bounce very badly if
-				 * there is contention.  However, we're still
-				 * dirtying its cacheline with get/put_page:
-				 * we'll add another __get_user_pages flag to
-				 * avoid it if that case turns out to matter.
-				 */
-				lock_page(page);
-				/*
-				 * Because we lock page here and migration is
-				 * blocked by the elevated reference, we need
-				 * only check for file-cache page truncation.
-				 */
-				if (page->mapping)
-					mlock_vma_page(page);
-				unlock_page(page);
-			}
-			put_page(page);	/* ref from get_user_pages() */
-		}
-
-		addr += ret * PAGE_SIZE;
-		nr_pages -= ret;
-		ret = 0;
-	}
-
-	return ret;	/* 0 or negative error code */
+	ret = __get_user_pages(current, mm, addr, nr_pages, gup_flags,
+			       NULL, NULL);
+	return max(ret, 0);	/* 0 or negative error code */
 }
 
 /*
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/6] mm: add FOLL_MLOCK follow_page flag.
@ 2010-12-03  0:16   ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

Move the code to mlock pages from __mlock_vma_pages_range()
to follow_page().

This allows __mlock_vma_pages_range() to not have to break down work
into 16-page batches.

An additional motivation for doing this within the present patch series
is that it'll make it easier for a later chagne to drop mmap_sem when
blocking on disk (we'd like to be able to resume at the page that was
read from disk instead of at the start of a 16-page batch).

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mm.h |    1 +
 mm/memory.c        |   27 ++++++++++++++++++++-
 mm/mlock.c         |   65 ++++------------------------------------------------
 3 files changed, 32 insertions(+), 61 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 721f451..cebbb0d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1415,6 +1415,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_MLOCK	0x20	/* mark page as mlocked */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
index b8f97b8..f3a9242 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1297,7 +1297,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		page = pte_page(pte);
 	}
 
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_MLOCK))
 		get_page(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
@@ -1310,6 +1310,31 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		 */
 		mark_page_accessed(page);
 	}
+	if (flags & FOLL_MLOCK) {
+		pte_unmap_unlock(ptep, ptl);
+		lru_add_drain();  /* push cached pages to LRU */
+		if (page->mapping) {
+			/*
+			 * That preliminary check is mainly to avoid
+			 * the pointless overhead of lock_page on the
+			 * ZERO_PAGE: which might bounce very badly if
+			 * there is contention.  However, we're still
+			 * dirtying its cacheline with get/put_page.
+			 */
+			lock_page(page);
+			/*
+			 * Because we lock page here and migration is
+			 * blocked by the elevated reference, we need
+			 * only check for file-cache page truncation.
+			 */
+			if (page->mapping)
+				mlock_vma_page(page);
+			unlock_page(page);
+		}
+		VM_BUG_ON(flags & FOLL_GET);
+		put_page(page);
+		return page;
+	}
 unlock:
 	pte_unmap_unlock(ptep, ptl);
 out:
diff --git a/mm/mlock.c b/mm/mlock.c
index 8d6b702..0531173 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -159,10 +159,9 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = start;
-	struct page *pages[16]; /* 16 gives a reasonable batch */
 	int nr_pages = (end - start) / PAGE_SIZE;
-	int ret = 0;
 	int gup_flags;
+	int ret;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
@@ -170,7 +169,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	VM_BUG_ON(end   > vma->vm_end);
 	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
-	gup_flags = FOLL_TOUCH | FOLL_GET;
+	gup_flags = FOLL_TOUCH | FOLL_MLOCK;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
@@ -185,63 +184,9 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 		nr_pages--;
 	}
 
-	while (nr_pages > 0) {
-		int i;
-
-		cond_resched();
-
-		/*
-		 * get_user_pages makes pages present if we are
-		 * setting mlock. and this extra reference count will
-		 * disable migration of this page.  However, page may
-		 * still be truncated out from under us.
-		 */
-		ret = __get_user_pages(current, mm, addr,
-				min_t(int, nr_pages, ARRAY_SIZE(pages)),
-				gup_flags, pages, NULL);
-		/*
-		 * This can happen for, e.g., VM_NONLINEAR regions before
-		 * a page has been allocated and mapped at a given offset,
-		 * or for addresses that map beyond end of a file.
-		 * We'll mlock the pages if/when they get faulted in.
-		 */
-		if (ret < 0)
-			break;
-
-		lru_add_drain();	/* push cached pages to LRU */
-
-		for (i = 0; i < ret; i++) {
-			struct page *page = pages[i];
-
-			if (page->mapping) {
-				/*
-				 * That preliminary check is mainly to avoid
-				 * the pointless overhead of lock_page on the
-				 * ZERO_PAGE: which might bounce very badly if
-				 * there is contention.  However, we're still
-				 * dirtying its cacheline with get/put_page:
-				 * we'll add another __get_user_pages flag to
-				 * avoid it if that case turns out to matter.
-				 */
-				lock_page(page);
-				/*
-				 * Because we lock page here and migration is
-				 * blocked by the elevated reference, we need
-				 * only check for file-cache page truncation.
-				 */
-				if (page->mapping)
-					mlock_vma_page(page);
-				unlock_page(page);
-			}
-			put_page(page);	/* ref from get_user_pages() */
-		}
-
-		addr += ret * PAGE_SIZE;
-		nr_pages -= ret;
-		ret = 0;
-	}
-
-	return ret;	/* 0 or negative error code */
+	ret = __get_user_pages(current, mm, addr, nr_pages, gup_flags,
+			       NULL, NULL);
+	return max(ret, 0);	/* 0 or negative error code */
 }
 
 /*
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/6] mm: move VM_LOCKED check to __mlock_vma_pages_range()
  2010-12-03  0:16 ` Michel Lespinasse
@ 2010-12-03  0:16   ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

Use a single code path for faulting in pages during mlock.

The reason to have it in this patch series is that I did not want to
update both code paths in a later change that releases mmap_sem when
blocking on disk.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/mlock.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 0531173..241a5d2 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -169,7 +169,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	VM_BUG_ON(end   > vma->vm_end);
 	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
-	gup_flags = FOLL_TOUCH | FOLL_MLOCK;
+	gup_flags = FOLL_TOUCH;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
@@ -178,6 +178,9 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
 		gup_flags |= FOLL_WRITE;
 
+	if (vma->vm_flags & VM_LOCKED)
+		gup_flags |= FOLL_MLOCK;
+
 	/* We don't try to access the guard page of a stack vma */
 	if (stack_guard_page(vma, start)) {
 		addr += PAGE_SIZE;
@@ -456,14 +459,11 @@ static int do_mlock_pages(unsigned long start, size_t len)
 		/*
 		 * Now fault in a range of pages within the first VMA.
 		 */
-		if (vma->vm_flags & VM_LOCKED) {
-			ret = __mlock_vma_pages_range(vma, nstart, nend);
-			if (ret) {
-				ret = __mlock_posix_error_return(ret);
-				break;
-			}
-		} else
-			make_pages_present(nstart, nend);
+		ret = __mlock_vma_pages_range(vma, nstart, nend);
+		if (ret) {
+			ret = __mlock_posix_error_return(ret);
+			break;
+		}
 	}
 	up_read(&mm->mmap_sem);
 	return ret;	/* 0 or negative error code */
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/6] mm: move VM_LOCKED check to __mlock_vma_pages_range()
@ 2010-12-03  0:16   ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

Use a single code path for faulting in pages during mlock.

The reason to have it in this patch series is that I did not want to
update both code paths in a later change that releases mmap_sem when
blocking on disk.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/mlock.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 0531173..241a5d2 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -169,7 +169,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	VM_BUG_ON(end   > vma->vm_end);
 	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
-	gup_flags = FOLL_TOUCH | FOLL_MLOCK;
+	gup_flags = FOLL_TOUCH;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
@@ -178,6 +178,9 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
 		gup_flags |= FOLL_WRITE;
 
+	if (vma->vm_flags & VM_LOCKED)
+		gup_flags |= FOLL_MLOCK;
+
 	/* We don't try to access the guard page of a stack vma */
 	if (stack_guard_page(vma, start)) {
 		addr += PAGE_SIZE;
@@ -456,14 +459,11 @@ static int do_mlock_pages(unsigned long start, size_t len)
 		/*
 		 * Now fault in a range of pages within the first VMA.
 		 */
-		if (vma->vm_flags & VM_LOCKED) {
-			ret = __mlock_vma_pages_range(vma, nstart, nend);
-			if (ret) {
-				ret = __mlock_posix_error_return(ret);
-				break;
-			}
-		} else
-			make_pages_present(nstart, nend);
+		ret = __mlock_vma_pages_range(vma, nstart, nend);
+		if (ret) {
+			ret = __mlock_posix_error_return(ret);
+			break;
+		}
 	}
 	up_read(&mm->mmap_sem);
 	return ret;	/* 0 or negative error code */
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/6] rwsem: implement rwsem_is_contended()
  2010-12-03  0:16 ` Michel Lespinasse
@ 2010-12-03  0:16   ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

Trivial implementations for rwsem_is_contended()

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/alpha/include/asm/rwsem.h   |    5 +++++
 arch/ia64/include/asm/rwsem.h    |    5 +++++
 arch/powerpc/include/asm/rwsem.h |    5 +++++
 arch/s390/include/asm/rwsem.h    |    5 +++++
 arch/sh/include/asm/rwsem.h      |    5 +++++
 arch/sparc/include/asm/rwsem.h   |    5 +++++
 arch/x86/include/asm/rwsem.h     |    5 +++++
 arch/xtensa/include/asm/rwsem.h  |    5 +++++
 include/linux/rwsem-spinlock.h   |    1 +
 lib/rwsem-spinlock.c             |   12 ++++++++++++
 10 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h
index 1570c0b..6183eec 100644
--- a/arch/alpha/include/asm/rwsem.h
+++ b/arch/alpha/include/asm/rwsem.h
@@ -255,5 +255,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _ALPHA_RWSEM_H */
diff --git a/arch/ia64/include/asm/rwsem.h b/arch/ia64/include/asm/rwsem.h
index 215d545..e965b7a 100644
--- a/arch/ia64/include/asm/rwsem.h
+++ b/arch/ia64/include/asm/rwsem.h
@@ -179,4 +179,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif /* _ASM_IA64_RWSEM_H */
diff --git a/arch/powerpc/include/asm/rwsem.h b/arch/powerpc/include/asm/rwsem.h
index 8447d89..69f5d13 100644
--- a/arch/powerpc/include/asm/rwsem.h
+++ b/arch/powerpc/include/asm/rwsem.h
@@ -179,5 +179,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return sem->count != 0;
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return sem->count < 0;
+}
+
 #endif	/* __KERNEL__ */
 #endif	/* _ASM_POWERPC_RWSEM_H */
diff --git a/arch/s390/include/asm/rwsem.h b/arch/s390/include/asm/rwsem.h
index 423fdda..7d36f68 100644
--- a/arch/s390/include/asm/rwsem.h
+++ b/arch/s390/include/asm/rwsem.h
@@ -382,5 +382,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _S390_RWSEM_H */
diff --git a/arch/sh/include/asm/rwsem.h b/arch/sh/include/asm/rwsem.h
index 06e2251..1f59516 100644
--- a/arch/sh/include/asm/rwsem.h
+++ b/arch/sh/include/asm/rwsem.h
@@ -184,5 +184,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_SH_RWSEM_H */
diff --git a/arch/sparc/include/asm/rwsem.h b/arch/sparc/include/asm/rwsem.h
index a2b4302..88242e6 100644
--- a/arch/sparc/include/asm/rwsem.h
+++ b/arch/sparc/include/asm/rwsem.h
@@ -165,6 +165,11 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _SPARC64_RWSEM_H */
diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index d1e41b0..a35521e 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -275,5 +275,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	return (sem->count < 0);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_X86_RWSEM_H */
diff --git a/arch/xtensa/include/asm/rwsem.h b/arch/xtensa/include/asm/rwsem.h
index e39edf5..6c658cb 100644
--- a/arch/xtensa/include/asm/rwsem.h
+++ b/arch/xtensa/include/asm/rwsem.h
@@ -165,4 +165,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif	/* _XTENSA_RWSEM_H */
diff --git a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
index bdfcc25..430de5b 100644
--- a/include/linux/rwsem-spinlock.h
+++ b/include/linux/rwsem-spinlock.h
@@ -69,6 +69,7 @@ extern void __up_read(struct rw_semaphore *sem);
 extern void __up_write(struct rw_semaphore *sem);
 extern void __downgrade_write(struct rw_semaphore *sem);
 extern int rwsem_is_locked(struct rw_semaphore *sem);
+extern int rwsem_is_contended(struct rw_semaphore *sem);
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_RWSEM_SPINLOCK_H */
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index ffc9fc7..783753d 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -30,6 +30,18 @@ int rwsem_is_locked(struct rw_semaphore *sem)
 }
 EXPORT_SYMBOL(rwsem_is_locked);
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	int ret = 0;
+	unsigned long flags;
+
+	if (spin_trylock_irqsave(&sem->wait_lock, flags)) {
+		ret = !list_empty(&sem->wait_list);
+		spin_unlock_irqrestore(&sem->wait_lock, flags);
+	}
+	return ret;
+}
+
 /*
  * initialise the semaphore
  */
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/6] rwsem: implement rwsem_is_contended()
@ 2010-12-03  0:16   ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

Trivial implementations for rwsem_is_contended()

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/alpha/include/asm/rwsem.h   |    5 +++++
 arch/ia64/include/asm/rwsem.h    |    5 +++++
 arch/powerpc/include/asm/rwsem.h |    5 +++++
 arch/s390/include/asm/rwsem.h    |    5 +++++
 arch/sh/include/asm/rwsem.h      |    5 +++++
 arch/sparc/include/asm/rwsem.h   |    5 +++++
 arch/x86/include/asm/rwsem.h     |    5 +++++
 arch/xtensa/include/asm/rwsem.h  |    5 +++++
 include/linux/rwsem-spinlock.h   |    1 +
 lib/rwsem-spinlock.c             |   12 ++++++++++++
 10 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h
index 1570c0b..6183eec 100644
--- a/arch/alpha/include/asm/rwsem.h
+++ b/arch/alpha/include/asm/rwsem.h
@@ -255,5 +255,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _ALPHA_RWSEM_H */
diff --git a/arch/ia64/include/asm/rwsem.h b/arch/ia64/include/asm/rwsem.h
index 215d545..e965b7a 100644
--- a/arch/ia64/include/asm/rwsem.h
+++ b/arch/ia64/include/asm/rwsem.h
@@ -179,4 +179,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif /* _ASM_IA64_RWSEM_H */
diff --git a/arch/powerpc/include/asm/rwsem.h b/arch/powerpc/include/asm/rwsem.h
index 8447d89..69f5d13 100644
--- a/arch/powerpc/include/asm/rwsem.h
+++ b/arch/powerpc/include/asm/rwsem.h
@@ -179,5 +179,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return sem->count != 0;
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return sem->count < 0;
+}
+
 #endif	/* __KERNEL__ */
 #endif	/* _ASM_POWERPC_RWSEM_H */
diff --git a/arch/s390/include/asm/rwsem.h b/arch/s390/include/asm/rwsem.h
index 423fdda..7d36f68 100644
--- a/arch/s390/include/asm/rwsem.h
+++ b/arch/s390/include/asm/rwsem.h
@@ -382,5 +382,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _S390_RWSEM_H */
diff --git a/arch/sh/include/asm/rwsem.h b/arch/sh/include/asm/rwsem.h
index 06e2251..1f59516 100644
--- a/arch/sh/include/asm/rwsem.h
+++ b/arch/sh/include/asm/rwsem.h
@@ -184,5 +184,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_SH_RWSEM_H */
diff --git a/arch/sparc/include/asm/rwsem.h b/arch/sparc/include/asm/rwsem.h
index a2b4302..88242e6 100644
--- a/arch/sparc/include/asm/rwsem.h
+++ b/arch/sparc/include/asm/rwsem.h
@@ -165,6 +165,11 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _SPARC64_RWSEM_H */
diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index d1e41b0..a35521e 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -275,5 +275,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	return (sem->count < 0);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_X86_RWSEM_H */
diff --git a/arch/xtensa/include/asm/rwsem.h b/arch/xtensa/include/asm/rwsem.h
index e39edf5..6c658cb 100644
--- a/arch/xtensa/include/asm/rwsem.h
+++ b/arch/xtensa/include/asm/rwsem.h
@@ -165,4 +165,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 	return (sem->count != 0);
 }
 
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+        return (sem->count < 0);
+}
+
 #endif	/* _XTENSA_RWSEM_H */
diff --git a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
index bdfcc25..430de5b 100644
--- a/include/linux/rwsem-spinlock.h
+++ b/include/linux/rwsem-spinlock.h
@@ -69,6 +69,7 @@ extern void __up_read(struct rw_semaphore *sem);
 extern void __up_write(struct rw_semaphore *sem);
 extern void __downgrade_write(struct rw_semaphore *sem);
 extern int rwsem_is_locked(struct rw_semaphore *sem);
+extern int rwsem_is_contended(struct rw_semaphore *sem);
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_RWSEM_SPINLOCK_H */
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index ffc9fc7..783753d 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -30,6 +30,18 @@ int rwsem_is_locked(struct rw_semaphore *sem)
 }
 EXPORT_SYMBOL(rwsem_is_locked);
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	int ret = 0;
+	unsigned long flags;
+
+	if (spin_trylock_irqsave(&sem->wait_lock, flags)) {
+		ret = !list_empty(&sem->wait_list);
+		spin_unlock_irqrestore(&sem->wait_lock, flags);
+	}
+	return ret;
+}
+
 /*
  * initialise the semaphore
  */
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/6] mlock: do not hold mmap_sem for extended periods of time
  2010-12-03  0:16 ` Michel Lespinasse
@ 2010-12-03  0:16   ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

__get_user_pages gets a new 'nonblocking' parameter to signal that the
caller is prepared to re-acquire mmap_sem and retry the operation if needed.
This is used to split off long operations if they are going to block on
a disk transfer, or when we detect contention on the mmap_sem.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/internal.h |    3 ++-
 mm/memory.c   |   27 ++++++++++++++++++++++-----
 mm/mlock.c    |   34 ++++++++++++++++++++--------------
 mm/nommu.c    |    6 ++++--
 4 files changed, 48 insertions(+), 22 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index dedb0af..bd4f581 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -243,7 +243,8 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, unsigned int foll_flags,
-		     struct page **pages, struct vm_area_struct **vmas);
+		     struct page **pages, struct vm_area_struct **vmas,
+		     int *nonblocking);
 
 #define ZONE_RECLAIM_NOSCAN	-2
 #define ZONE_RECLAIM_FULL	-1
diff --git a/mm/memory.c b/mm/memory.c
index f3a9242..85e56dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1366,7 +1366,8 @@ no_page_table:
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int nr_pages, unsigned int gup_flags,
-		     struct page **pages, struct vm_area_struct **vmas)
+		     struct page **pages, struct vm_area_struct **vmas,
+		     int *nonblocking)
 {
 	int i;
 	unsigned long vm_flags;
@@ -1465,11 +1466,15 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
+				int fault_flags = 0;
 				int ret;
 
+				if (foll_flags & FOLL_WRITE)
+					fault_flags |= FAULT_FLAG_WRITE;
+				if (nonblocking)
+					fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 				ret = handle_mm_fault(mm, vma, start,
-					(foll_flags & FOLL_WRITE) ?
-					FAULT_FLAG_WRITE : 0);
+						      fault_flags);
 
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
@@ -1485,6 +1490,11 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				else
 					tsk->min_flt++;
 
+				if (ret & VM_FAULT_RETRY) {
+					*nonblocking = 0;
+					return i;
+				}
+
 				/*
 				 * The VM_FAULT_WRITE bit tells us that
 				 * do_wp_page has broken COW when necessary,
@@ -1516,6 +1526,11 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			i++;
 			start += PAGE_SIZE;
 			nr_pages--;
+			if (nonblocking && rwsem_is_contended(&mm->mmap_sem)) {
+				up_read(&mm->mmap_sem);
+				*nonblocking = 0;
+				return i;
+			}
 		} while (nr_pages && start < vma->vm_end);
 	} while (nr_pages);
 	return i;
@@ -1584,7 +1599,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 	if (force)
 		flags |= FOLL_FORCE;
 
-	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
+				NULL);
 }
 EXPORT_SYMBOL(get_user_pages);
 
@@ -1609,7 +1625,8 @@ struct page *get_dump_page(unsigned long addr)
 	struct page *page;
 
 	if (__get_user_pages(current, current->mm, addr, 1,
-			FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma) < 1)
+			     FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma,
+			     NULL) < 1)
 		return NULL;
 	flush_cache_page(vma, addr, page_to_pfn(page));
 	return page;
diff --git a/mm/mlock.c b/mm/mlock.c
index 241a5d2..569ae6a 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -155,13 +155,13 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add
  * vma->vm_mm->mmap_sem must be held for at least read.
  */
 static long __mlock_vma_pages_range(struct vm_area_struct *vma,
-				    unsigned long start, unsigned long end)
+				    unsigned long start, unsigned long end,
+				    int *nonblocking)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = start;
 	int nr_pages = (end - start) / PAGE_SIZE;
 	int gup_flags;
-	int ret;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
@@ -187,9 +187,8 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 		nr_pages--;
 	}
 
-	ret = __get_user_pages(current, mm, addr, nr_pages, gup_flags,
-			       NULL, NULL);
-	return max(ret, 0);	/* 0 or negative error code */
+	return __get_user_pages(current, mm, addr, nr_pages, gup_flags,
+				NULL, NULL, nonblocking);
 }
 
 /*
@@ -233,7 +232,7 @@ long mlock_vma_pages_range(struct vm_area_struct *vma,
 			is_vm_hugetlb_page(vma) ||
 			vma == get_gate_vma(current))) {
 
-		__mlock_vma_pages_range(vma, start, end);
+		__mlock_vma_pages_range(vma, start, end, NULL);
 
 		/* Hide errors from mmap() and other callers */
 		return 0;
@@ -429,21 +428,23 @@ static int do_mlock_pages(unsigned long start, size_t len)
 	struct mm_struct *mm = current->mm;
 	unsigned long end, nstart, nend;
 	struct vm_area_struct *vma = NULL;
+	int locked = 0;
 	int ret = 0;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(len != PAGE_ALIGN(len));
 	end = start + len;
 
-	down_read(&mm->mmap_sem);
 	for (nstart = start; nstart < end; nstart = nend) {
 		/*
 		 * We want to fault in pages for [nstart; end) address range.
 		 * Find first corresponding VMA.
 		 */
-		if (!vma)
+		if (!locked) {
+			locked = 1;
+			down_read(&mm->mmap_sem);
 			vma = find_vma(mm, nstart);
-		else
+		} else if (nstart >= vma->vm_end)
 			vma = vma->vm_next;
 		if (!vma || vma->vm_start >= end)
 			break;
@@ -456,16 +457,21 @@ static int do_mlock_pages(unsigned long start, size_t len)
 			continue;
 		if (nstart < vma->vm_start)
 			nstart = vma->vm_start;
 		/*
-		 * Now fault in a range of pages within the first VMA.
+		 * Now fault in a range of pages. __mlock_vma_pages_range()
+		 * double checks the vma flags, so that it won't mlock pages
+		 * if the vma was already munlocked.
 		 */
-		ret = __mlock_vma_pages_range(vma, nstart, nend);
-		if (ret) {
+		ret = __mlock_vma_pages_range(vma, nstart, nend, &locked);
+		if (ret < 0) {
 			ret = __mlock_posix_error_return(ret);
 			break;
 		}
+		nend = nstart + ret * PAGE_SIZE;
+		ret = 0;
 	}
-	up_read(&mm->mmap_sem);
+	if (locked)
+		up_read(&mm->mmap_sem);
 	return ret;	/* 0 or negative error code */
 }
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 27a9ac5..c8b8a7e 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -127,7 +127,8 @@ unsigned int kobjsize(const void *objp)
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int nr_pages, unsigned int foll_flags,
-		     struct page **pages, struct vm_area_struct **vmas)
+		     struct page **pages, struct vm_area_struct **vmas,
+		     int *retry)
 {
 	struct vm_area_struct *vma;
 	unsigned long vm_flags;
@@ -185,7 +186,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 	if (force)
 		flags |= FOLL_FORCE;
 
-	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
+				NULL);
 }
 EXPORT_SYMBOL(get_user_pages);
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/6] mlock: do not hold mmap_sem for extended periods of time
@ 2010-12-03  0:16   ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

__get_user_pages gets a new 'nonblocking' parameter to signal that the
caller is prepared to re-acquire mmap_sem and retry the operation if needed.
This is used to split off long operations if they are going to block on
a disk transfer, or when we detect contention on the mmap_sem.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/internal.h |    3 ++-
 mm/memory.c   |   27 ++++++++++++++++++++++-----
 mm/mlock.c    |   34 ++++++++++++++++++++--------------
 mm/nommu.c    |    6 ++++--
 4 files changed, 48 insertions(+), 22 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index dedb0af..bd4f581 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -243,7 +243,8 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, unsigned int foll_flags,
-		     struct page **pages, struct vm_area_struct **vmas);
+		     struct page **pages, struct vm_area_struct **vmas,
+		     int *nonblocking);
 
 #define ZONE_RECLAIM_NOSCAN	-2
 #define ZONE_RECLAIM_FULL	-1
diff --git a/mm/memory.c b/mm/memory.c
index f3a9242..85e56dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1366,7 +1366,8 @@ no_page_table:
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int nr_pages, unsigned int gup_flags,
-		     struct page **pages, struct vm_area_struct **vmas)
+		     struct page **pages, struct vm_area_struct **vmas,
+		     int *nonblocking)
 {
 	int i;
 	unsigned long vm_flags;
@@ -1465,11 +1466,15 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
+				int fault_flags = 0;
 				int ret;
 
+				if (foll_flags & FOLL_WRITE)
+					fault_flags |= FAULT_FLAG_WRITE;
+				if (nonblocking)
+					fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 				ret = handle_mm_fault(mm, vma, start,
-					(foll_flags & FOLL_WRITE) ?
-					FAULT_FLAG_WRITE : 0);
+						      fault_flags);
 
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
@@ -1485,6 +1490,11 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				else
 					tsk->min_flt++;
 
+				if (ret & VM_FAULT_RETRY) {
+					*nonblocking = 0;
+					return i;
+				}
+
 				/*
 				 * The VM_FAULT_WRITE bit tells us that
 				 * do_wp_page has broken COW when necessary,
@@ -1516,6 +1526,11 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			i++;
 			start += PAGE_SIZE;
 			nr_pages--;
+			if (nonblocking && rwsem_is_contended(&mm->mmap_sem)) {
+				up_read(&mm->mmap_sem);
+				*nonblocking = 0;
+				return i;
+			}
 		} while (nr_pages && start < vma->vm_end);
 	} while (nr_pages);
 	return i;
@@ -1584,7 +1599,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 	if (force)
 		flags |= FOLL_FORCE;
 
-	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
+				NULL);
 }
 EXPORT_SYMBOL(get_user_pages);
 
@@ -1609,7 +1625,8 @@ struct page *get_dump_page(unsigned long addr)
 	struct page *page;
 
 	if (__get_user_pages(current, current->mm, addr, 1,
-			FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma) < 1)
+			     FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma,
+			     NULL) < 1)
 		return NULL;
 	flush_cache_page(vma, addr, page_to_pfn(page));
 	return page;
diff --git a/mm/mlock.c b/mm/mlock.c
index 241a5d2..569ae6a 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -155,13 +155,13 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add
  * vma->vm_mm->mmap_sem must be held for at least read.
  */
 static long __mlock_vma_pages_range(struct vm_area_struct *vma,
-				    unsigned long start, unsigned long end)
+				    unsigned long start, unsigned long end,
+				    int *nonblocking)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = start;
 	int nr_pages = (end - start) / PAGE_SIZE;
 	int gup_flags;
-	int ret;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
@@ -187,9 +187,8 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 		nr_pages--;
 	}
 
-	ret = __get_user_pages(current, mm, addr, nr_pages, gup_flags,
-			       NULL, NULL);
-	return max(ret, 0);	/* 0 or negative error code */
+	return __get_user_pages(current, mm, addr, nr_pages, gup_flags,
+				NULL, NULL, nonblocking);
 }
 
 /*
@@ -233,7 +232,7 @@ long mlock_vma_pages_range(struct vm_area_struct *vma,
 			is_vm_hugetlb_page(vma) ||
 			vma == get_gate_vma(current))) {
 
-		__mlock_vma_pages_range(vma, start, end);
+		__mlock_vma_pages_range(vma, start, end, NULL);
 
 		/* Hide errors from mmap() and other callers */
 		return 0;
@@ -429,21 +428,23 @@ static int do_mlock_pages(unsigned long start, size_t len)
 	struct mm_struct *mm = current->mm;
 	unsigned long end, nstart, nend;
 	struct vm_area_struct *vma = NULL;
+	int locked = 0;
 	int ret = 0;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(len != PAGE_ALIGN(len));
 	end = start + len;
 
-	down_read(&mm->mmap_sem);
 	for (nstart = start; nstart < end; nstart = nend) {
 		/*
 		 * We want to fault in pages for [nstart; end) address range.
 		 * Find first corresponding VMA.
 		 */
-		if (!vma)
+		if (!locked) {
+			locked = 1;
+			down_read(&mm->mmap_sem);
 			vma = find_vma(mm, nstart);
-		else
+		} else if (nstart >= vma->vm_end)
 			vma = vma->vm_next;
 		if (!vma || vma->vm_start >= end)
 			break;
@@ -456,16 +457,21 @@ static int do_mlock_pages(unsigned long start, size_t len)
 			continue;
 		if (nstart < vma->vm_start)
 			nstart = vma->vm_start;
 		/*
-		 * Now fault in a range of pages within the first VMA.
+		 * Now fault in a range of pages. __mlock_vma_pages_range()
+		 * double checks the vma flags, so that it won't mlock pages
+		 * if the vma was already munlocked.
 		 */
-		ret = __mlock_vma_pages_range(vma, nstart, nend);
-		if (ret) {
+		ret = __mlock_vma_pages_range(vma, nstart, nend, &locked);
+		if (ret < 0) {
 			ret = __mlock_posix_error_return(ret);
 			break;
 		}
+		nend = nstart + ret * PAGE_SIZE;
+		ret = 0;
 	}
-	up_read(&mm->mmap_sem);
+	if (locked)
+		up_read(&mm->mmap_sem);
 	return ret;	/* 0 or negative error code */
 }
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 27a9ac5..c8b8a7e 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -127,7 +127,8 @@ unsigned int kobjsize(const void *objp)
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int nr_pages, unsigned int foll_flags,
-		     struct page **pages, struct vm_area_struct **vmas)
+		     struct page **pages, struct vm_area_struct **vmas,
+		     int *retry)
 {
 	struct vm_area_struct *vma;
 	unsigned long vm_flags;
@@ -185,7 +186,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 	if (force)
 		flags |= FOLL_FORCE;
 
-	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas);
+	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
+				NULL);
 }
 EXPORT_SYMBOL(get_user_pages);
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/6] x86 rwsem: more precise rwsem_is_contended() implementation
  2010-12-03  0:16 ` Michel Lespinasse
@ 2010-12-03  0:16   ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

We would like rwsem_is_contended() to return true only once a contending
writer has had a chance to insert itself onto the rwsem wait queue.
To that end, we need to differenciate between active and queued writers.

A new property is introduced: RWSEM_ACTIVE_WRITE_BIAS is set to be
'more negative' than RWSEM_WAITING_BIAS. RWSEM_WAITING_MASK designates
a bit in the rwsem count that will be set only when RWSEM_WAITING_BIAS
is in effect.

The basic properties that have been true so far also still hold:
- RWSEM_ACTIVE_READ_BIAS  & RWSEM_ACTIVE_MASK == 1
- RWSEM_ACTIVE_WRITE_BIAS & RWSEM_ACTIVE_MASK == 1
- RWSEM_WAITING_BIAS      & RWSEM_ACTIVE_MASK == 0
- RWSEM_ACTIVE_WRITE_BIAS < 0 and RWSEM_WAITING_BIAS < 0

In addition, the rwsem count will be < RWSEM_WAITING_BIAS only if there
are any active writers (though we don't make use of this property so far).

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/include/asm/rwsem.h |   32 +++++++++++++++++++-------------
 arch/x86/lib/rwsem_64.S      |    4 ++--
 arch/x86/lib/semaphore_32.S  |    4 ++--
 3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index a35521e..1ce7759 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -16,11 +16,10 @@
  * if there are writers (and maybe) readers waiting (in which case it goes to
  * sleep).
  *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
+ * The WRITE_BIAS value supports up to 32767 processes simultaneously
+ * trying to acquire a write lock.
  *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
+ * The value of ACTIVE_MASK supports up to 32767 active processes.
  *
  * This should be totally fair - if anything is waiting, a process that wants a
  * lock will go to the back of the queue. When the currently active lock is
@@ -62,17 +61,23 @@ extern asmregparm struct rw_semaphore *
  * for 64 bits.
  */
 
+
 #ifdef CONFIG_X86_64
-# define RWSEM_ACTIVE_MASK		0xffffffffL
+# define RWSEM_UNLOCKED_VALUE		0x0000000000000000L
+# define RWSEM_ACTIVE_MASK		0x000000007fffffffL
+# define RWSEM_ACTIVE_READ_BIAS		0x0000000000000001L
+# define RWSEM_ACTIVE_WRITE_BIAS	0xffffffff00000001L
+# define RWSEM_WAITING_BIAS		0xffffffff80000000L
+# define RWSEM_WAITING_MASK		0x0000000080000000L
 #else
-# define RWSEM_ACTIVE_MASK		0x0000ffffL
+# define RWSEM_UNLOCKED_VALUE		0x00000000L
+# define RWSEM_ACTIVE_MASK		0x00007fffL
+# define RWSEM_ACTIVE_READ_BIAS		0x00000001L
+# define RWSEM_ACTIVE_WRITE_BIAS	0xffff0001L
+# define RWSEM_WAITING_BIAS		0xffff8000L
+# define RWSEM_WAITING_MASK		0x00008000L
 #endif
 
-#define RWSEM_UNLOCKED_VALUE		0x00000000L
-#define RWSEM_ACTIVE_BIAS		0x00000001L
-#define RWSEM_WAITING_BIAS		(-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
 
 typedef signed long rwsem_count_t;
 
@@ -240,7 +245,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 		     "1:\n\t"
 		     "# ending __downgrade_write\n"
 		     : "+m" (sem->count)
-		     : "a" (sem), "er" (-RWSEM_WAITING_BIAS)
+		     : "a" (sem),
+		       "er" (RWSEM_ACTIVE_READ_BIAS - RWSEM_ACTIVE_WRITE_BIAS)
 		     : "memory", "cc");
 }
 
@@ -277,7 +283,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 
 static inline int rwsem_is_contended(struct rw_semaphore *sem)
 {
-	return (sem->count < 0);
+	return (sem->count & RWSEM_WAITING_MASK) != 0;
 }
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/lib/rwsem_64.S b/arch/x86/lib/rwsem_64.S
index 41fcf00..35b797e 100644
--- a/arch/x86/lib/rwsem_64.S
+++ b/arch/x86/lib/rwsem_64.S
@@ -60,8 +60,8 @@ ENTRY(call_rwsem_down_write_failed)
 	ENDPROC(call_rwsem_down_write_failed)
 
 ENTRY(call_rwsem_wake)
-	decl %edx	/* do nothing if still outstanding active readers */
-	jnz 1f
+	cmpl $0x80000001, %edx
+	jne 1f	/* do nothing unless there are waiters and no active threads */
 	save_common_regs
 	movq %rax,%rdi
 	call rwsem_wake
diff --git a/arch/x86/lib/semaphore_32.S b/arch/x86/lib/semaphore_32.S
index 648fe47..256fa7d 100644
--- a/arch/x86/lib/semaphore_32.S
+++ b/arch/x86/lib/semaphore_32.S
@@ -103,8 +103,8 @@ ENTRY(call_rwsem_down_write_failed)
 
 ENTRY(call_rwsem_wake)
 	CFI_STARTPROC
-	decw %dx    /* do nothing if still outstanding active readers */
-	jnz 1f
+	cmpw $0x8001, %dx
+	jne 1f	/* do nothing unless there are waiters and no active threads */
 	push %ecx
 	CFI_ADJUST_CFA_OFFSET 4
 	CFI_REL_OFFSET ecx,0
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/6] x86 rwsem: more precise rwsem_is_contended() implementation
@ 2010-12-03  0:16   ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03  0:16 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

We would like rwsem_is_contended() to return true only once a contending
writer has had a chance to insert itself onto the rwsem wait queue.
To that end, we need to differenciate between active and queued writers.

A new property is introduced: RWSEM_ACTIVE_WRITE_BIAS is set to be
'more negative' than RWSEM_WAITING_BIAS. RWSEM_WAITING_MASK designates
a bit in the rwsem count that will be set only when RWSEM_WAITING_BIAS
is in effect.

The basic properties that have been true so far also still hold:
- RWSEM_ACTIVE_READ_BIAS  & RWSEM_ACTIVE_MASK == 1
- RWSEM_ACTIVE_WRITE_BIAS & RWSEM_ACTIVE_MASK == 1
- RWSEM_WAITING_BIAS      & RWSEM_ACTIVE_MASK == 0
- RWSEM_ACTIVE_WRITE_BIAS < 0 and RWSEM_WAITING_BIAS < 0

In addition, the rwsem count will be < RWSEM_WAITING_BIAS only if there
are any active writers (though we don't make use of this property so far).

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/include/asm/rwsem.h |   32 +++++++++++++++++++-------------
 arch/x86/lib/rwsem_64.S      |    4 ++--
 arch/x86/lib/semaphore_32.S  |    4 ++--
 3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index a35521e..1ce7759 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -16,11 +16,10 @@
  * if there are writers (and maybe) readers waiting (in which case it goes to
  * sleep).
  *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
+ * The WRITE_BIAS value supports up to 32767 processes simultaneously
+ * trying to acquire a write lock.
  *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
+ * The value of ACTIVE_MASK supports up to 32767 active processes.
  *
  * This should be totally fair - if anything is waiting, a process that wants a
  * lock will go to the back of the queue. When the currently active lock is
@@ -62,17 +61,23 @@ extern asmregparm struct rw_semaphore *
  * for 64 bits.
  */
 
+
 #ifdef CONFIG_X86_64
-# define RWSEM_ACTIVE_MASK		0xffffffffL
+# define RWSEM_UNLOCKED_VALUE		0x0000000000000000L
+# define RWSEM_ACTIVE_MASK		0x000000007fffffffL
+# define RWSEM_ACTIVE_READ_BIAS		0x0000000000000001L
+# define RWSEM_ACTIVE_WRITE_BIAS	0xffffffff00000001L
+# define RWSEM_WAITING_BIAS		0xffffffff80000000L
+# define RWSEM_WAITING_MASK		0x0000000080000000L
 #else
-# define RWSEM_ACTIVE_MASK		0x0000ffffL
+# define RWSEM_UNLOCKED_VALUE		0x00000000L
+# define RWSEM_ACTIVE_MASK		0x00007fffL
+# define RWSEM_ACTIVE_READ_BIAS		0x00000001L
+# define RWSEM_ACTIVE_WRITE_BIAS	0xffff0001L
+# define RWSEM_WAITING_BIAS		0xffff8000L
+# define RWSEM_WAITING_MASK		0x00008000L
 #endif
 
-#define RWSEM_UNLOCKED_VALUE		0x00000000L
-#define RWSEM_ACTIVE_BIAS		0x00000001L
-#define RWSEM_WAITING_BIAS		(-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
 
 typedef signed long rwsem_count_t;
 
@@ -240,7 +245,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 		     "1:\n\t"
 		     "# ending __downgrade_write\n"
 		     : "+m" (sem->count)
-		     : "a" (sem), "er" (-RWSEM_WAITING_BIAS)
+		     : "a" (sem),
+		       "er" (RWSEM_ACTIVE_READ_BIAS - RWSEM_ACTIVE_WRITE_BIAS)
 		     : "memory", "cc");
 }
 
@@ -277,7 +283,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 
 static inline int rwsem_is_contended(struct rw_semaphore *sem)
 {
-	return (sem->count < 0);
+	return (sem->count & RWSEM_WAITING_MASK) != 0;
 }
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/lib/rwsem_64.S b/arch/x86/lib/rwsem_64.S
index 41fcf00..35b797e 100644
--- a/arch/x86/lib/rwsem_64.S
+++ b/arch/x86/lib/rwsem_64.S
@@ -60,8 +60,8 @@ ENTRY(call_rwsem_down_write_failed)
 	ENDPROC(call_rwsem_down_write_failed)
 
 ENTRY(call_rwsem_wake)
-	decl %edx	/* do nothing if still outstanding active readers */
-	jnz 1f
+	cmpl $0x80000001, %edx
+	jne 1f	/* do nothing unless there are waiters and no active threads */
 	save_common_regs
 	movq %rax,%rdi
 	call rwsem_wake
diff --git a/arch/x86/lib/semaphore_32.S b/arch/x86/lib/semaphore_32.S
index 648fe47..256fa7d 100644
--- a/arch/x86/lib/semaphore_32.S
+++ b/arch/x86/lib/semaphore_32.S
@@ -103,8 +103,8 @@ ENTRY(call_rwsem_down_write_failed)
 
 ENTRY(call_rwsem_wake)
 	CFI_STARTPROC
-	decw %dx    /* do nothing if still outstanding active readers */
-	jnz 1f
+	cmpw $0x8001, %dx
+	jne 1f	/* do nothing unless there are waiters and no active threads */
 	push %ecx
 	CFI_ADJUST_CFA_OFFSET 4
 	CFI_REL_OFFSET ecx,0
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/6] x86 rwsem: more precise rwsem_is_contended() implementation
  2010-12-03  0:16   ` Michel Lespinasse
@ 2010-12-03 22:41     ` Peter Zijlstra
  -1 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-03 22:41 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-kernel, Linus Torvalds

On Thu, 2010-12-02 at 16:16 -0800, Michel Lespinasse wrote:
> We would like rwsem_is_contended() to return true only once a contending
> writer has had a chance to insert itself onto the rwsem wait queue.
> To that end, we need to differenciate between active and queued writers.

So you're only considering writer-writer contention? Not writer-reader
and reader-writer contention?

I'd argue rwsem_is_contended() should return true if there is _any_
blocked task, be it a read or a writer.

If you want something else call it something else, like
rwsem_is_write_contended() (there's a pending writer), or
rwsem_is_read_contended() (there's a pending reader).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/6] x86 rwsem: more precise rwsem_is_contended() implementation
@ 2010-12-03 22:41     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-03 22:41 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-kernel, Linus Torvalds

On Thu, 2010-12-02 at 16:16 -0800, Michel Lespinasse wrote:
> We would like rwsem_is_contended() to return true only once a contending
> writer has had a chance to insert itself onto the rwsem wait queue.
> To that end, we need to differenciate between active and queued writers.

So you're only considering writer-writer contention? Not writer-reader
and reader-writer contention?

I'd argue rwsem_is_contended() should return true if there is _any_
blocked task, be it a read or a writer.

If you want something else call it something else, like
rwsem_is_write_contended() (there's a pending writer), or
rwsem_is_read_contended() (there's a pending reader).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/6] x86 rwsem: more precise rwsem_is_contended() implementation
  2010-12-03 22:41     ` Peter Zijlstra
@ 2010-12-03 22:51       ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03 22:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-kernel, Linus Torvalds

On Fri, Dec 3, 2010 at 2:41 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-12-02 at 16:16 -0800, Michel Lespinasse wrote:
>> We would like rwsem_is_contended() to return true only once a contending
>> writer has had a chance to insert itself onto the rwsem wait queue.
>> To that end, we need to differenciate between active and queued writers.
>
> So you're only considering writer-writer contention? Not writer-reader
> and reader-writer contention?
>
> I'd argue rwsem_is_contended() should return true if there is _any_
> blocked task, be it a read or a writer.
>
> If you want something else call it something else, like
> rwsem_is_write_contended() (there's a pending writer), or
> rwsem_is_read_contended() (there's a pending reader).

I said 'a contending writer' because in my use case, the rwsem was
being held for read, and other readers would get admitted in (so they
don't count as contention).

The code as written will work in the more general case - it tells if
there is another thread queued waiting for the rwsem to be released.
It can't tell you however if that thread is a reader or a writer (but
if mmap_sem is already held for read, then the contending thread must
be a writer).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/6] x86 rwsem: more precise rwsem_is_contended() implementation
@ 2010-12-03 22:51       ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03 22:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-kernel, Linus Torvalds

On Fri, Dec 3, 2010 at 2:41 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-12-02 at 16:16 -0800, Michel Lespinasse wrote:
>> We would like rwsem_is_contended() to return true only once a contending
>> writer has had a chance to insert itself onto the rwsem wait queue.
>> To that end, we need to differenciate between active and queued writers.
>
> So you're only considering writer-writer contention? Not writer-reader
> and reader-writer contention?
>
> I'd argue rwsem_is_contended() should return true if there is _any_
> blocked task, be it a read or a writer.
>
> If you want something else call it something else, like
> rwsem_is_write_contended() (there's a pending writer), or
> rwsem_is_read_contended() (there's a pending reader).

I said 'a contending writer' because in my use case, the rwsem was
being held for read, and other readers would get admitted in (so they
don't count as contention).

The code as written will work in the more general case - it tells if
there is another thread queued waiting for the rwsem to be released.
It can't tell you however if that thread is a reader or a writer (but
if mmap_sem is already held for read, then the contending thread must
be a writer).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] mlock: do not hold mmap_sem for extended periods of time
  2010-12-03  0:16 ` Michel Lespinasse
@ 2010-12-03 23:02   ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03 23:02 UTC (permalink / raw)
  To: David Howells; +Cc: linux-mm, linux-kernel

Hi David,

I forgot to add you on the original submission, but I think you'd be
best qualified to look at the two patches implementing
rwsem_is_contended()...

On Thu, Dec 2, 2010 at 4:16 PM, Michel Lespinasse <walken@google.com> wrote:
> Currently mlock() holds mmap_sem in exclusive mode while the pages get
> faulted in. In the case of a large mlock, this can potentially take a
> very long time, during which various commands such as 'ps auxw' will
> block. This makes sysadmins unhappy:
>
> real    14m36.232s
> user    0m0.003s
> sys     0m0.015s
> (output from 'time ps auxw' while a 20GB file was being mlocked without
> being previously preloaded into page cache)
>
> I propose that mlock() could release mmap_sem after the VM_LOCKED bits
> have been set in all appropriate VMAs. Then a second pass could be done
> to actually mlock the pages, in small batches, releasing mmap_sem when
> we block on disk access or when we detect some contention.
>
> Patches are against v2.6.37-rc4 plus my patches to avoid mlock dirtying
> (presently queued in -mm).
>
> Michel Lespinasse (6):
>  mlock: only hold mmap_sem in shared mode when faulting in pages
>  mm: add FOLL_MLOCK follow_page flag.
>  mm: move VM_LOCKED check to __mlock_vma_pages_range()
>  rwsem: implement rwsem_is_contended()
>  mlock: do not hold mmap_sem for extended periods of time
>  x86 rwsem: more precise rwsem_is_contended() implementation

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] mlock: do not hold mmap_sem for extended periods of time
@ 2010-12-03 23:02   ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-03 23:02 UTC (permalink / raw)
  To: David Howells; +Cc: linux-mm, linux-kernel

Hi David,

I forgot to add you on the original submission, but I think you'd be
best qualified to look at the two patches implementing
rwsem_is_contended()...

On Thu, Dec 2, 2010 at 4:16 PM, Michel Lespinasse <walken@google.com> wrote:
> Currently mlock() holds mmap_sem in exclusive mode while the pages get
> faulted in. In the case of a large mlock, this can potentially take a
> very long time, during which various commands such as 'ps auxw' will
> block. This makes sysadmins unhappy:
>
> real    14m36.232s
> user    0m0.003s
> sys     0m0.015s
> (output from 'time ps auxw' while a 20GB file was being mlocked without
> being previously preloaded into page cache)
>
> I propose that mlock() could release mmap_sem after the VM_LOCKED bits
> have been set in all appropriate VMAs. Then a second pass could be done
> to actually mlock the pages, in small batches, releasing mmap_sem when
> we block on disk access or when we detect some contention.
>
> Patches are against v2.6.37-rc4 plus my patches to avoid mlock dirtying
> (presently queued in -mm).
>
> Michel Lespinasse (6):
>  mlock: only hold mmap_sem in shared mode when faulting in pages
>  mm: add FOLL_MLOCK follow_page flag.
>  mm: move VM_LOCKED check to __mlock_vma_pages_range()
>  rwsem: implement rwsem_is_contended()
>  mlock: do not hold mmap_sem for extended periods of time
>  x86 rwsem: more precise rwsem_is_contended() implementation

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] mm: add FOLL_MLOCK follow_page flag.
  2010-12-03  0:16   ` Michel Lespinasse
@ 2010-12-04  6:55     ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-04  6:55 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

I did not realize that previously, but it turns out to be possible to
manipulate LRU lists under control of the pte lock, rather than having
to get/put the page. This allows the FOLL_MLOCK follow_page flag
implementation to look much nicer than it did in initial proposal.

Please consider the following as a replacement for the initial
patch 2/6 proposal...

--------------------------------- 8< --------------------------------------

Move the code to mlock pages from __mlock_vma_pages_range()
to follow_page().

This allows __mlock_vma_pages_range() to not have to break down work
into 16-page batches.

An additional motivation for doing this within the present patch series
is that it'll make it easier for a later chagne to drop mmap_sem when
blocking on disk (we'd like to be able to resume at the page that was
read from disk instead of at the start of a 16-page batch).

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mm.h |    1 +
 mm/memory.c        |   22 +++++++++++++++++
 mm/mlock.c         |   65 ++++------------------------------------------------
 3 files changed, 28 insertions(+), 60 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 721f451..cebbb0d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1415,6 +1415,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_MLOCK	0x20	/* mark page as mlocked */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
index b8f97b8..15e1f19 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1310,6 +1310,28 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		 */
 		mark_page_accessed(page);
 	}
+	if (flags & FOLL_MLOCK) {
+		/*
+		 * The preliminary mapping check is mainly to avoid the
+		 * pointless overhead of lock_page on the ZERO_PAGE
+		 * which might bounce very badly if there is contention.
+		 *
+		 * If the page is already locked, we don't need to
+		 * handle it now - vmscan will handle it later if and
+		 * when it attempts to reclaim the page.
+		 */
+		if (page->mapping && trylock_page(page)) {
+			lru_add_drain();  /* push cached pages to LRU */
+			/*
+			 * Because we lock page here and migration is
+			 * blocked by the pte's page reference, we need
+			 * only check for file-cache page truncation.
+			 */
+			if (page->mapping)
+				mlock_vma_page(page);
+			unlock_page(page);
+		}
+	}
 unlock:
 	pte_unmap_unlock(ptep, ptl);
 out:
diff --git a/mm/mlock.c b/mm/mlock.c
index 71db83e..573bd2c 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -159,10 +159,9 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = start;
-	struct page *pages[16]; /* 16 gives a reasonable batch */
 	int nr_pages = (end - start) / PAGE_SIZE;
-	int ret = 0;
 	int gup_flags;
+	int ret;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
@@ -170,7 +169,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	VM_BUG_ON(end   > vma->vm_end);
 	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
-	gup_flags = FOLL_TOUCH | FOLL_GET;
+	gup_flags = FOLL_TOUCH | FOLL_MLOCK;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
@@ -185,63 +184,9 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 		nr_pages--;
 	}
 
-	while (nr_pages > 0) {
-		int i;
-
-		cond_resched();
-
-		/*
-		 * get_user_pages makes pages present if we are
-		 * setting mlock. and this extra reference count will
-		 * disable migration of this page.  However, page may
-		 * still be truncated out from under us.
-		 */
-		ret = __get_user_pages(current, mm, addr,
-				min_t(int, nr_pages, ARRAY_SIZE(pages)),
-				gup_flags, pages, NULL);
-		/*
-		 * This can happen for, e.g., VM_NONLINEAR regions before
-		 * a page has been allocated and mapped at a given offset,
-		 * or for addresses that map beyond end of a file.
-		 * We'll mlock the pages if/when they get faulted in.
-		 */
-		if (ret < 0)
-			break;
-
-		lru_add_drain();	/* push cached pages to LRU */
-
-		for (i = 0; i < ret; i++) {
-			struct page *page = pages[i];
-
-			if (page->mapping) {
-				/*
-				 * That preliminary check is mainly to avoid
-				 * the pointless overhead of lock_page on the
-				 * ZERO_PAGE: which might bounce very badly if
-				 * there is contention.  However, we're still
-				 * dirtying its cacheline with get/put_page:
-				 * we'll add another __get_user_pages flag to
-				 * avoid it if that case turns out to matter.
-				 */
-				lock_page(page);
-				/*
-				 * Because we lock page here and migration is
-				 * blocked by the elevated reference, we need
-				 * only check for file-cache page truncation.
-				 */
-				if (page->mapping)
-					mlock_vma_page(page);
-				unlock_page(page);
-			}
-			put_page(page);	/* ref from get_user_pages() */
-		}
-
-		addr += ret * PAGE_SIZE;
-		nr_pages -= ret;
-		ret = 0;
-	}
-
-	return ret;	/* 0 or negative error code */
+	ret = __get_user_pages(current, mm, addr, nr_pages, gup_flags,
+			       NULL, NULL);
+	return max(ret, 0);	/* 0 or negative error code */
 }
 
 /*
-- 
1.7.3.1



-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] mm: add FOLL_MLOCK follow_page flag.
@ 2010-12-04  6:55     ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-04  6:55 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro
  Cc: linux-kernel, Linus Torvalds

I did not realize that previously, but it turns out to be possible to
manipulate LRU lists under control of the pte lock, rather than having
to get/put the page. This allows the FOLL_MLOCK follow_page flag
implementation to look much nicer than it did in initial proposal.

Please consider the following as a replacement for the initial
patch 2/6 proposal...

--------------------------------- 8< --------------------------------------

Move the code to mlock pages from __mlock_vma_pages_range()
to follow_page().

This allows __mlock_vma_pages_range() to not have to break down work
into 16-page batches.

An additional motivation for doing this within the present patch series
is that it'll make it easier for a later chagne to drop mmap_sem when
blocking on disk (we'd like to be able to resume at the page that was
read from disk instead of at the start of a 16-page batch).

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mm.h |    1 +
 mm/memory.c        |   22 +++++++++++++++++
 mm/mlock.c         |   65 ++++------------------------------------------------
 3 files changed, 28 insertions(+), 60 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 721f451..cebbb0d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1415,6 +1415,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_MLOCK	0x20	/* mark page as mlocked */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
index b8f97b8..15e1f19 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1310,6 +1310,28 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		 */
 		mark_page_accessed(page);
 	}
+	if (flags & FOLL_MLOCK) {
+		/*
+		 * The preliminary mapping check is mainly to avoid the
+		 * pointless overhead of lock_page on the ZERO_PAGE
+		 * which might bounce very badly if there is contention.
+		 *
+		 * If the page is already locked, we don't need to
+		 * handle it now - vmscan will handle it later if and
+		 * when it attempts to reclaim the page.
+		 */
+		if (page->mapping && trylock_page(page)) {
+			lru_add_drain();  /* push cached pages to LRU */
+			/*
+			 * Because we lock page here and migration is
+			 * blocked by the pte's page reference, we need
+			 * only check for file-cache page truncation.
+			 */
+			if (page->mapping)
+				mlock_vma_page(page);
+			unlock_page(page);
+		}
+	}
 unlock:
 	pte_unmap_unlock(ptep, ptl);
 out:
diff --git a/mm/mlock.c b/mm/mlock.c
index 71db83e..573bd2c 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -159,10 +159,9 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = start;
-	struct page *pages[16]; /* 16 gives a reasonable batch */
 	int nr_pages = (end - start) / PAGE_SIZE;
-	int ret = 0;
 	int gup_flags;
+	int ret;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
@@ -170,7 +169,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	VM_BUG_ON(end   > vma->vm_end);
 	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
-	gup_flags = FOLL_TOUCH | FOLL_GET;
+	gup_flags = FOLL_TOUCH | FOLL_MLOCK;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
@@ -185,63 +184,9 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 		nr_pages--;
 	}
 
-	while (nr_pages > 0) {
-		int i;
-
-		cond_resched();
-
-		/*
-		 * get_user_pages makes pages present if we are
-		 * setting mlock. and this extra reference count will
-		 * disable migration of this page.  However, page may
-		 * still be truncated out from under us.
-		 */
-		ret = __get_user_pages(current, mm, addr,
-				min_t(int, nr_pages, ARRAY_SIZE(pages)),
-				gup_flags, pages, NULL);
-		/*
-		 * This can happen for, e.g., VM_NONLINEAR regions before
-		 * a page has been allocated and mapped at a given offset,
-		 * or for addresses that map beyond end of a file.
-		 * We'll mlock the pages if/when they get faulted in.
-		 */
-		if (ret < 0)
-			break;
-
-		lru_add_drain();	/* push cached pages to LRU */
-
-		for (i = 0; i < ret; i++) {
-			struct page *page = pages[i];
-
-			if (page->mapping) {
-				/*
-				 * That preliminary check is mainly to avoid
-				 * the pointless overhead of lock_page on the
-				 * ZERO_PAGE: which might bounce very badly if
-				 * there is contention.  However, we're still
-				 * dirtying its cacheline with get/put_page:
-				 * we'll add another __get_user_pages flag to
-				 * avoid it if that case turns out to matter.
-				 */
-				lock_page(page);
-				/*
-				 * Because we lock page here and migration is
-				 * blocked by the elevated reference, we need
-				 * only check for file-cache page truncation.
-				 */
-				if (page->mapping)
-					mlock_vma_page(page);
-				unlock_page(page);
-			}
-			put_page(page);	/* ref from get_user_pages() */
-		}
-
-		addr += ret * PAGE_SIZE;
-		nr_pages -= ret;
-		ret = 0;
-	}
-
-	return ret;	/* 0 or negative error code */
+	ret = __get_user_pages(current, mm, addr, nr_pages, gup_flags,
+			       NULL, NULL);
+	return max(ret, 0);	/* 0 or negative error code */
 }
 
 /*
-- 
1.7.3.1



-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
  2010-12-03  0:16   ` Michel Lespinasse
@ 2010-12-08 23:27     ` Andrew Morton
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2010-12-08 23:27 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Nick Piggin, KOSAKI Motohiro, linux-kernel, Linus Torvalds

> Currently mlock() holds mmap_sem in exclusive mode while the pages get
> faulted in. In the case of a large mlock, this can potentially take a
> very long time, during which various commands such as 'ps auxw' will
> block. This makes sysadmins unhappy:
>
> real    14m36.232s
> user    0m0.003s
> sys     0m0.015s
>(output from 'time ps auxw' while a 20GB file was being mlocked without
> being previously preloaded into page cache)

The kernel holds down_write(mmap_sem) for 14m36s?

geeze you guys are picky - that's less than a quarter hour!

On Thu,  2 Dec 2010 16:16:47 -0800
Michel Lespinasse <walken@google.com> wrote:

> Before this change, mlock() holds mmap_sem in exclusive mode while the
> pages get faulted in. In the case of a large mlock, this can potentially
> take a very long time. Various things will block while mmap_sem is held,
> including 'ps auxw'. This can make sysadmins angry.
> 
> I propose that mlock() could release mmap_sem after the VM_LOCKED bits
> have been set in all appropriate VMAs. Then a second pass could be done
> to actually mlock the pages with mmap_sem held for reads only. We need
> to recheck the vma flags after we re-acquire mmap_sem, but this is easy.
> 
> In the case where a vma has been munlocked before mlock completes,
> pages that were already marked as PageMlocked() are handled by the
> munlock() call, and mlock() is careful to not mark new page batches
> as PageMlocked() after the munlock() call has cleared the VM_LOCKED
> vma flags. So, the end result will be identical to what'd happen if
> munlock() had executed after the mlock() call.
> 
> In a later change, I will allow the second pass to release mmap_sem when
> blocking on disk accesses or when it is otherwise contended, so that
> it won't be held for long periods of time even in shared mode.
> 
> ...
>
> +static int do_mlock_pages(unsigned long start, size_t len)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long end, nstart, nend;
> +	struct vm_area_struct *vma = NULL;
> +	int ret = 0;
> +
> +	VM_BUG_ON(start & ~PAGE_MASK);
> +	VM_BUG_ON(len != PAGE_ALIGN(len));
> +	end = start + len;
> +
> +	down_read(&mm->mmap_sem);
> +	for (nstart = start; nstart < end; nstart = nend) {
> +		/*
> +		 * We want to fault in pages for [nstart; end) address range.
> +		 * Find first corresponding VMA.
> +		 */
> +		if (!vma)
> +			vma = find_vma(mm, nstart);
> +		else
> +			vma = vma->vm_next;
> +		if (!vma || vma->vm_start >= end)
> +			break;
> +		/*
> +		 * Set [nstart; nend) to intersection of desired address
> +		 * range with the first VMA. Also, skip undesirable VMA types.
> +		 */
> +		nend = min(end, vma->vm_end);
> +		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> +			continue;
> +		if (nstart < vma->vm_start)
> +			nstart = vma->vm_start;
> +		/*
> +		 * Now fault in a range of pages within the first VMA.
> +		 */
> +		if (vma->vm_flags & VM_LOCKED) {
> +			ret = __mlock_vma_pages_range(vma, nstart, nend);
> +			if (ret) {
> +				ret = __mlock_posix_error_return(ret);
> +				break;
> +			}
> +		} else
> +			make_pages_present(nstart, nend);
> +	}
> +	up_read(&mm->mmap_sem);
> +	return ret;	/* 0 or negative error code */
> +}

Am I correct in believing that we'll still hold down_read(mmap_sem) for
a quarter hour?

If so, that's still pretty obnoxious behaviour - presumably there are
workloads which will hurt like hell from that.

We don't need to hold mmap_sem at all while faulting in those pages,
do we?  We could just do

	for (addr = start, addr < end; addr += PAGE_SIZE)
		get_user(x, addr);

and voila.  If the pages are in cache and the ptes are set up then that
will be *vastly* faster than the proposed code.  If the get_user()
takes a minor fault then it'll be slower.  If it's a major fault then
the difference probably doesn't matter much.

But whatever.  Is this patchset a half-fix, and should we rather be
looking for a full-fix?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
@ 2010-12-08 23:27     ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2010-12-08 23:27 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Nick Piggin, KOSAKI Motohiro, linux-kernel, Linus Torvalds

> Currently mlock() holds mmap_sem in exclusive mode while the pages get
> faulted in. In the case of a large mlock, this can potentially take a
> very long time, during which various commands such as 'ps auxw' will
> block. This makes sysadmins unhappy:
>
> real    14m36.232s
> user    0m0.003s
> sys     0m0.015s
>(output from 'time ps auxw' while a 20GB file was being mlocked without
> being previously preloaded into page cache)

The kernel holds down_write(mmap_sem) for 14m36s?

geeze you guys are picky - that's less than a quarter hour!

On Thu,  2 Dec 2010 16:16:47 -0800
Michel Lespinasse <walken@google.com> wrote:

> Before this change, mlock() holds mmap_sem in exclusive mode while the
> pages get faulted in. In the case of a large mlock, this can potentially
> take a very long time. Various things will block while mmap_sem is held,
> including 'ps auxw'. This can make sysadmins angry.
> 
> I propose that mlock() could release mmap_sem after the VM_LOCKED bits
> have been set in all appropriate VMAs. Then a second pass could be done
> to actually mlock the pages with mmap_sem held for reads only. We need
> to recheck the vma flags after we re-acquire mmap_sem, but this is easy.
> 
> In the case where a vma has been munlocked before mlock completes,
> pages that were already marked as PageMlocked() are handled by the
> munlock() call, and mlock() is careful to not mark new page batches
> as PageMlocked() after the munlock() call has cleared the VM_LOCKED
> vma flags. So, the end result will be identical to what'd happen if
> munlock() had executed after the mlock() call.
> 
> In a later change, I will allow the second pass to release mmap_sem when
> blocking on disk accesses or when it is otherwise contended, so that
> it won't be held for long periods of time even in shared mode.
> 
> ...
>
> +static int do_mlock_pages(unsigned long start, size_t len)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long end, nstart, nend;
> +	struct vm_area_struct *vma = NULL;
> +	int ret = 0;
> +
> +	VM_BUG_ON(start & ~PAGE_MASK);
> +	VM_BUG_ON(len != PAGE_ALIGN(len));
> +	end = start + len;
> +
> +	down_read(&mm->mmap_sem);
> +	for (nstart = start; nstart < end; nstart = nend) {
> +		/*
> +		 * We want to fault in pages for [nstart; end) address range.
> +		 * Find first corresponding VMA.
> +		 */
> +		if (!vma)
> +			vma = find_vma(mm, nstart);
> +		else
> +			vma = vma->vm_next;
> +		if (!vma || vma->vm_start >= end)
> +			break;
> +		/*
> +		 * Set [nstart; nend) to intersection of desired address
> +		 * range with the first VMA. Also, skip undesirable VMA types.
> +		 */
> +		nend = min(end, vma->vm_end);
> +		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> +			continue;
> +		if (nstart < vma->vm_start)
> +			nstart = vma->vm_start;
> +		/*
> +		 * Now fault in a range of pages within the first VMA.
> +		 */
> +		if (vma->vm_flags & VM_LOCKED) {
> +			ret = __mlock_vma_pages_range(vma, nstart, nend);
> +			if (ret) {
> +				ret = __mlock_posix_error_return(ret);
> +				break;
> +			}
> +		} else
> +			make_pages_present(nstart, nend);
> +	}
> +	up_read(&mm->mmap_sem);
> +	return ret;	/* 0 or negative error code */
> +}

Am I correct in believing that we'll still hold down_read(mmap_sem) for
a quarter hour?

If so, that's still pretty obnoxious behaviour - presumably there are
workloads which will hurt like hell from that.

We don't need to hold mmap_sem at all while faulting in those pages,
do we?  We could just do

	for (addr = start, addr < end; addr += PAGE_SIZE)
		get_user(x, addr);

and voila.  If the pages are in cache and the ptes are set up then that
will be *vastly* faster than the proposed code.  If the get_user()
takes a minor fault then it'll be slower.  If it's a major fault then
the difference probably doesn't matter much.

But whatever.  Is this patchset a half-fix, and should we rather be
looking for a full-fix?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] mlock: do not hold mmap_sem for extended periods of time
  2010-12-03  0:16   ` Michel Lespinasse
@ 2010-12-08 23:42     ` Andrew Morton
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2010-12-08 23:42 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Nick Piggin, KOSAKI Motohiro, linux-kernel, Linus Torvalds,
	Gleb Natapov

On Thu,  2 Dec 2010 16:16:51 -0800
Michel Lespinasse <walken@google.com> wrote:

> __get_user_pages gets a new 'nonblocking' parameter to signal that the
> caller is prepared to re-acquire mmap_sem and retry the operation if needed.
> This is used to split off long operations if they are going to block on
> a disk transfer, or when we detect contention on the mmap_sem.

Doesn't apply to linux-next because the KVM guys snuck in a new
FAULT_FLAG_MINOR (who knew?).  With a bonus, undocumented,
exported-to-modules get_user_pages_noio().

I liked your code better so I munged __get_user_pages() together thusly:


			cond_resched();
			while (!(page = follow_page(vma, start, foll_flags))) {
				int ret;
				unsigned int fault_flags = 0;

				if (foll_flags & FOLL_WRITE)
					fault_flags |= FAULT_FLAG_WRITE;
				if (nonblocking)
					fault_flags |= FAULT_FLAG_ALLOW_RETRY;
				if (foll_flags & FOLL_MINOR)
					fault_flags |= FAULT_FLAG_MINOR;

				ret = handle_mm_fault(mm, vma, start,
							fault_flags);


please review the end result..

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] mlock: do not hold mmap_sem for extended periods of time
@ 2010-12-08 23:42     ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2010-12-08 23:42 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Nick Piggin, KOSAKI Motohiro, linux-kernel, Linus Torvalds,
	Gleb Natapov

On Thu,  2 Dec 2010 16:16:51 -0800
Michel Lespinasse <walken@google.com> wrote:

> __get_user_pages gets a new 'nonblocking' parameter to signal that the
> caller is prepared to re-acquire mmap_sem and retry the operation if needed.
> This is used to split off long operations if they are going to block on
> a disk transfer, or when we detect contention on the mmap_sem.

Doesn't apply to linux-next because the KVM guys snuck in a new
FAULT_FLAG_MINOR (who knew?).  With a bonus, undocumented,
exported-to-modules get_user_pages_noio().

I liked your code better so I munged __get_user_pages() together thusly:


			cond_resched();
			while (!(page = follow_page(vma, start, foll_flags))) {
				int ret;
				unsigned int fault_flags = 0;

				if (foll_flags & FOLL_WRITE)
					fault_flags |= FAULT_FLAG_WRITE;
				if (nonblocking)
					fault_flags |= FAULT_FLAG_ALLOW_RETRY;
				if (foll_flags & FOLL_MINOR)
					fault_flags |= FAULT_FLAG_MINOR;

				ret = handle_mm_fault(mm, vma, start,
							fault_flags);


please review the end result..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
  2010-12-08 23:27     ` Andrew Morton
@ 2010-12-08 23:58       ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-08 23:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Nick Piggin, KOSAKI Motohiro, linux-kernel, Linus Torvalds

On Wed, Dec 8, 2010 at 3:27 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> Currently mlock() holds mmap_sem in exclusive mode while the pages get
>> faulted in. In the case of a large mlock, this can potentially take a
>> very long time, during which various commands such as 'ps auxw' will
>> block. This makes sysadmins unhappy:
>>
>> real    14m36.232s
>> user    0m0.003s
>> sys     0m0.015s
>>(output from 'time ps auxw' while a 20GB file was being mlocked without
>> being previously preloaded into page cache)
>
> The kernel holds down_write(mmap_sem) for 14m36s?

Yes...

[... patch snipped off ...]

> Am I correct in believing that we'll still hold down_read(mmap_sem) for
> a quarter hour?

Yes, patch 1/6 changes the long hold time to be in read mode instead
of write mode, which is only a band-aid. But, this prepares for patch
5/6, which releases mmap_sem whenever there is contention on it or
when blocking on disk reads.

> We don't need to hold mmap_sem at all while faulting in those pages,
> do we?  We could just do
>
>        for (addr = start, addr < end; addr += PAGE_SIZE)
>                get_user(x, addr);
>
> and voila.  If the pages are in cache and the ptes are set up then that
> will be *vastly* faster than the proposed code.  If the get_user()
> takes a minor fault then it'll be slower.  If it's a major fault then
> the difference probably doesn't matter much.

get_user wouldn't suffice if the page is already mapped in, as we need
to mark it as PageMlocked. Also, we need to skip IO and PFNMAP
regions. I don't think you can make things much simpler than what I
ended up with.

> But whatever.  Is this patchset a half-fix, and should we rather be
> looking for a full-fix?

I think the series fully fixes the mlock() and mlockall() cases, which
has been the more pressing use case for us.

Even then, there are still cases where we could still observe long
mmap_sem hold times - fundamentally, every place that calls
get_user_pages (or do_mmap, in the mlockall MCL_FUTURE case) with a
large page range may create such problems. From the looks of it, most
of these places wouldn't actually care if the mmap_sem got dropped in
the middle of the operation, but a general fix will have to involve
looking at all the call sites to be sure.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
@ 2010-12-08 23:58       ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-08 23:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Nick Piggin, KOSAKI Motohiro, linux-kernel, Linus Torvalds

On Wed, Dec 8, 2010 at 3:27 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>> Currently mlock() holds mmap_sem in exclusive mode while the pages get
>> faulted in. In the case of a large mlock, this can potentially take a
>> very long time, during which various commands such as 'ps auxw' will
>> block. This makes sysadmins unhappy:
>>
>> real    14m36.232s
>> user    0m0.003s
>> sys     0m0.015s
>>(output from 'time ps auxw' while a 20GB file was being mlocked without
>> being previously preloaded into page cache)
>
> The kernel holds down_write(mmap_sem) for 14m36s?

Yes...

[... patch snipped off ...]

> Am I correct in believing that we'll still hold down_read(mmap_sem) for
> a quarter hour?

Yes, patch 1/6 changes the long hold time to be in read mode instead
of write mode, which is only a band-aid. But, this prepares for patch
5/6, which releases mmap_sem whenever there is contention on it or
when blocking on disk reads.

> We don't need to hold mmap_sem at all while faulting in those pages,
> do we?  We could just do
>
>        for (addr = start, addr < end; addr += PAGE_SIZE)
>                get_user(x, addr);
>
> and voila.  If the pages are in cache and the ptes are set up then that
> will be *vastly* faster than the proposed code.  If the get_user()
> takes a minor fault then it'll be slower.  If it's a major fault then
> the difference probably doesn't matter much.

get_user wouldn't suffice if the page is already mapped in, as we need
to mark it as PageMlocked. Also, we need to skip IO and PFNMAP
regions. I don't think you can make things much simpler than what I
ended up with.

> But whatever.  Is this patchset a half-fix, and should we rather be
> looking for a full-fix?

I think the series fully fixes the mlock() and mlockall() cases, which
has been the more pressing use case for us.

Even then, there are still cases where we could still observe long
mmap_sem hold times - fundamentally, every place that calls
get_user_pages (or do_mmap, in the mlockall MCL_FUTURE case) with a
large page range may create such problems. From the looks of it, most
of these places wouldn't actually care if the mmap_sem got dropped in
the middle of the operation, but a general fix will have to involve
looking at all the call sites to be sure.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
  2010-12-08 23:58       ` Michel Lespinasse
@ 2010-12-10  6:11         ` Linus Torvalds
  -1 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2010-12-10  6:11 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel,
	Linus Torvalds

On Wednesday, December 8, 2010, Michel Lespinasse <walken@google.com> wrote:
>
> Yes, patch 1/6 changes the long hold time to be in read mode instead
> of write mode, which is only a band-aid. But, this prepares for patch
> 5/6, which releases mmap_sem whenever there is contention on it or
> when blocking on disk reads.

I have to say that I'm not a huge fan of that horribly kludgy
contention check case.

The "move page-in to read-locked sequence" and the changes to
get_user_pages look fine, but the contention thing is just disgusting.
I'd really like to see some other approach if at all possible.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
@ 2010-12-10  6:11         ` Linus Torvalds
  0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2010-12-10  6:11 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel,
	Linus Torvalds

On Wednesday, December 8, 2010, Michel Lespinasse <walken@google.com> wrote:
>
> Yes, patch 1/6 changes the long hold time to be in read mode instead
> of write mode, which is only a band-aid. But, this prepares for patch
> 5/6, which releases mmap_sem whenever there is contention on it or
> when blocking on disk reads.

I have to say that I'm not a huge fan of that horribly kludgy
contention check case.

The "move page-in to read-locked sequence" and the changes to
get_user_pages look fine, but the contention thing is just disgusting.
I'd really like to see some other approach if at all possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
  2010-12-10  6:11         ` Linus Torvalds
@ 2010-12-10  6:39           ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-10  6:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Thu, Dec 9, 2010 at 10:11 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wednesday, December 8, 2010, Michel Lespinasse <walken@google.com> wrote:
>> Yes, patch 1/6 changes the long hold time to be in read mode instead
>> of write mode, which is only a band-aid. But, this prepares for patch
>> 5/6, which releases mmap_sem whenever there is contention on it or
>> when blocking on disk reads.
>
> I have to say that I'm not a huge fan of that horribly kludgy
> contention check case.
>
> The "move page-in to read-locked sequence" and the changes to
> get_user_pages look fine, but the contention thing is just disgusting.
> I'd really like to see some other approach if at all possible.

Are you OK with the part of patch 5/6 that drops mmap_sem when
blocking on disk ?

This by itself brings mmap_sem hold time down to a few seconds. Plus,
I could add something to limit the interval passed to
__mlock_vma_pages_range to a thousand pages or so, so that the hold
time would then be bounded by that constant.

I think rwsem_is_contended() actually sounds better than fiddling with
constants, but OTOH maybe the mlock use case is not significant enough
to justify introducing that new API.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
@ 2010-12-10  6:39           ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-10  6:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Thu, Dec 9, 2010 at 10:11 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wednesday, December 8, 2010, Michel Lespinasse <walken@google.com> wrote:
>> Yes, patch 1/6 changes the long hold time to be in read mode instead
>> of write mode, which is only a band-aid. But, this prepares for patch
>> 5/6, which releases mmap_sem whenever there is contention on it or
>> when blocking on disk reads.
>
> I have to say that I'm not a huge fan of that horribly kludgy
> contention check case.
>
> The "move page-in to read-locked sequence" and the changes to
> get_user_pages look fine, but the contention thing is just disgusting.
> I'd really like to see some other approach if at all possible.

Are you OK with the part of patch 5/6 that drops mmap_sem when
blocking on disk ?

This by itself brings mmap_sem hold time down to a few seconds. Plus,
I could add something to limit the interval passed to
__mlock_vma_pages_range to a thousand pages or so, so that the hold
time would then be bounded by that constant.

I think rwsem_is_contended() actually sounds better than fiddling with
constants, but OTOH maybe the mlock use case is not significant enough
to justify introducing that new API.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
  2010-12-10  6:39           ` Michel Lespinasse
@ 2010-12-10 11:12             ` Peter Zijlstra
  -1 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-10 11:12 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, Andrew Morton, linux-mm, Hugh Dickins,
	Rik van Riel, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Thu, 2010-12-09 at 22:39 -0800, Michel Lespinasse wrote:
> I think rwsem_is_contended() actually sounds better than fiddling with
> constants, but OTOH maybe the mlock use case is not significant enough
> to justify introducing that new API. 

Right, so I don't see the problem with _is_contended() either. In fact,
I introduce mutex_is_contended() in the mmu_preempt series to convert
existing (spin) lock break tests.

If you want to do lock-breaks like cond_resched_lock() all you really
have is *_is_contended(), sleeping locks will schedule unconditional.

int cond_break_mutex(struct mutex *mutex)
{
	int ret = 0;
	if (mutex_is_contended(mutex)) {
		mutex_unlock(mutex);
		ret = 1;
		mutex_lock(mutex);
	}
	return 1;
}

Or more exotic lock breaks, like the mmu-gather stuff, which falls out
of its nested page-table loops and restarts the whole affair.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
@ 2010-12-10 11:12             ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-12-10 11:12 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, Andrew Morton, linux-mm, Hugh Dickins,
	Rik van Riel, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Thu, 2010-12-09 at 22:39 -0800, Michel Lespinasse wrote:
> I think rwsem_is_contended() actually sounds better than fiddling with
> constants, but OTOH maybe the mlock use case is not significant enough
> to justify introducing that new API. 

Right, so I don't see the problem with _is_contended() either. In fact,
I introduce mutex_is_contended() in the mmu_preempt series to convert
existing (spin) lock break tests.

If you want to do lock-breaks like cond_resched_lock() all you really
have is *_is_contended(), sleeping locks will schedule unconditional.

int cond_break_mutex(struct mutex *mutex)
{
	int ret = 0;
	if (mutex_is_contended(mutex)) {
		mutex_unlock(mutex);
		ret = 1;
		mutex_lock(mutex);
	}
	return 1;
}

Or more exotic lock breaks, like the mmu-gather stuff, which falls out
of its nested page-table loops and restarts the whole affair.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
  2010-12-10  6:11         ` Linus Torvalds
@ 2010-12-14  0:51           ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-14  0:51 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: linux-mm, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Nick Piggin, KOSAKI Motohiro, linux-kernel

On Thu, Dec 9, 2010 at 10:11 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wednesday, December 8, 2010, Michel Lespinasse <walken@google.com> wrote:
>>
>> Yes, patch 1/6 changes the long hold time to be in read mode instead
>> of write mode, which is only a band-aid. But, this prepares for patch
>> 5/6, which releases mmap_sem whenever there is contention on it or
>> when blocking on disk reads.
>
> I have to say that I'm not a huge fan of that horribly kludgy
> contention check case.
>
> The "move page-in to read-locked sequence" and the changes to
> get_user_pages look fine, but the contention thing is just disgusting.
> I'd really like to see some other approach if at all possible.

Andrew, should I amend my patches to remove the rwsem_is_contended() code ?
This would involve:
- remove rwsem-implement-rwsem_is_contended.patch and
  x86-rwsem-more-precise-rwsem_is_contended-implementation.patch
- in mlock-do-not-hold-mmap_sem-for-extended-periods-of-time.patch,
  drop the one hunk making use of rwsem_is_contended (rest of the patch
  would still work without it)
- optionally, follow up patch to limit batch size to a constant
  in do_mlock_pages():

diff --git a/mm/mlock.c b/mm/mlock.c
index 569ae6a..a505a7e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -457,15 +457,23 @@ static int do_mlock_pages(unsigned long start, size_t len)
 			continue;
 		if (nstart < vma->vm_start)
 			nstart = vma->vm_start;
+		/*
+		 * Constrain batch size to limit mmap_sem hold time.
+		 */
+		if (nend > nstart + 1024 * PAGE_SIZE)
+			nend = nstart + 1024 * PAGE_SIZE;
 		/*
 		 * Now fault in a range of pages. __mlock_vma_pages_range()
 		 * double checks the vma flags, so that it won't mlock pages
 		 * if the vma was already munlocked.
 		 */
 		ret = __mlock_vma_pages_range(vma, nstart, nend, &locked);
 		if (ret < 0) {
 			ret = __mlock_posix_error_return(ret);
 			break;
+		} else if (locked) {
+			locked = 0;
+			up_read(&mm->mmap_sem);
 		}
 		nend = nstart + ret * PAGE_SIZE;
 		ret = 0;


I don't really prefer using a constant, but I'm not sure how else to make
Linus happy :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
@ 2010-12-14  0:51           ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-14  0:51 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: linux-mm, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Nick Piggin, KOSAKI Motohiro, linux-kernel

On Thu, Dec 9, 2010 at 10:11 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wednesday, December 8, 2010, Michel Lespinasse <walken@google.com> wrote:
>>
>> Yes, patch 1/6 changes the long hold time to be in read mode instead
>> of write mode, which is only a band-aid. But, this prepares for patch
>> 5/6, which releases mmap_sem whenever there is contention on it or
>> when blocking on disk reads.
>
> I have to say that I'm not a huge fan of that horribly kludgy
> contention check case.
>
> The "move page-in to read-locked sequence" and the changes to
> get_user_pages look fine, but the contention thing is just disgusting.
> I'd really like to see some other approach if at all possible.

Andrew, should I amend my patches to remove the rwsem_is_contended() code ?
This would involve:
- remove rwsem-implement-rwsem_is_contended.patch and
  x86-rwsem-more-precise-rwsem_is_contended-implementation.patch
- in mlock-do-not-hold-mmap_sem-for-extended-periods-of-time.patch,
  drop the one hunk making use of rwsem_is_contended (rest of the patch
  would still work without it)
- optionally, follow up patch to limit batch size to a constant
  in do_mlock_pages():

diff --git a/mm/mlock.c b/mm/mlock.c
index 569ae6a..a505a7e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -457,15 +457,23 @@ static int do_mlock_pages(unsigned long start, size_t len)
 			continue;
 		if (nstart < vma->vm_start)
 			nstart = vma->vm_start;
+		/*
+		 * Constrain batch size to limit mmap_sem hold time.
+		 */
+		if (nend > nstart + 1024 * PAGE_SIZE)
+			nend = nstart + 1024 * PAGE_SIZE;
 		/*
 		 * Now fault in a range of pages. __mlock_vma_pages_range()
 		 * double checks the vma flags, so that it won't mlock pages
 		 * if the vma was already munlocked.
 		 */
 		ret = __mlock_vma_pages_range(vma, nstart, nend, &locked);
 		if (ret < 0) {
 			ret = __mlock_posix_error_return(ret);
 			break;
+		} else if (locked) {
+			locked = 0;
+			up_read(&mm->mmap_sem);
 		}
 		nend = nstart + ret * PAGE_SIZE;
 		ret = 0;


I don't really prefer using a constant, but I'm not sure how else to make
Linus happy :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
  2010-12-14  0:51           ` Michel Lespinasse
@ 2010-12-14  1:05             ` Andrew Morton
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2010-12-14  1:05 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Mon, 13 Dec 2010 16:51:40 -0800
Michel Lespinasse <walken@google.com> wrote:

> On Thu, Dec 9, 2010 at 10:11 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Wednesday, December 8, 2010, Michel Lespinasse <walken@google.com> wrote:
> >>
> >> Yes, patch 1/6 changes the long hold time to be in read mode instead
> >> of write mode, which is only a band-aid. But, this prepares for patch
> >> 5/6, which releases mmap_sem whenever there is contention on it or
> >> when blocking on disk reads.
> >
> > I have to say that I'm not a huge fan of that horribly kludgy
> > contention check case.
> >
> > The "move page-in to read-locked sequence" and the changes to
> > get_user_pages look fine, but the contention thing is just disgusting.
> > I'd really like to see some other approach if at all possible.
> 
> Andrew, should I amend my patches to remove the rwsem_is_contended() code ?
> This would involve:
> - remove rwsem-implement-rwsem_is_contended.patch and
>   x86-rwsem-more-precise-rwsem_is_contended-implementation.patch
> - in mlock-do-not-hold-mmap_sem-for-extended-periods-of-time.patch,
>   drop the one hunk making use of rwsem_is_contended (rest of the patch
>   would still work without it)

I think I fixed all that up.

> - optionally, follow up patch to limit batch size to a constant
>   in do_mlock_pages():
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 569ae6a..a505a7e 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -457,15 +457,23 @@ static int do_mlock_pages(unsigned long start, size_t len)
>  			continue;
>  		if (nstart < vma->vm_start)
>  			nstart = vma->vm_start;
> +		/*
> +		 * Constrain batch size to limit mmap_sem hold time.
> +		 */
> +		if (nend > nstart + 1024 * PAGE_SIZE)
> +			nend = nstart + 1024 * PAGE_SIZE;
>  		/*
>  		 * Now fault in a range of pages. __mlock_vma_pages_range()
>  		 * double checks the vma flags, so that it won't mlock pages
>  		 * if the vma was already munlocked.
>  		 */
>  		ret = __mlock_vma_pages_range(vma, nstart, nend, &locked);
>  		if (ret < 0) {
>  			ret = __mlock_posix_error_return(ret);
>  			break;
> +		} else if (locked) {
> +			locked = 0;
> +			up_read(&mm->mmap_sem);
>  		}
>  		nend = nstart + ret * PAGE_SIZE;
>  		ret = 0;
> 
> 
> I don't really prefer using a constant, but I'm not sure how else to make
> Linus happy :)

rwsem_is_contended() didn't seem so bad to me.

Reading 1024 pages can still take a long time.  I can't immediately
think of a better approach though.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
@ 2010-12-14  1:05             ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2010-12-14  1:05 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Mon, 13 Dec 2010 16:51:40 -0800
Michel Lespinasse <walken@google.com> wrote:

> On Thu, Dec 9, 2010 at 10:11 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Wednesday, December 8, 2010, Michel Lespinasse <walken@google.com> wrote:
> >>
> >> Yes, patch 1/6 changes the long hold time to be in read mode instead
> >> of write mode, which is only a band-aid. But, this prepares for patch
> >> 5/6, which releases mmap_sem whenever there is contention on it or
> >> when blocking on disk reads.
> >
> > I have to say that I'm not a huge fan of that horribly kludgy
> > contention check case.
> >
> > The "move page-in to read-locked sequence" and the changes to
> > get_user_pages look fine, but the contention thing is just disgusting.
> > I'd really like to see some other approach if at all possible.
> 
> Andrew, should I amend my patches to remove the rwsem_is_contended() code ?
> This would involve:
> - remove rwsem-implement-rwsem_is_contended.patch and
>   x86-rwsem-more-precise-rwsem_is_contended-implementation.patch
> - in mlock-do-not-hold-mmap_sem-for-extended-periods-of-time.patch,
>   drop the one hunk making use of rwsem_is_contended (rest of the patch
>   would still work without it)

I think I fixed all that up.

> - optionally, follow up patch to limit batch size to a constant
>   in do_mlock_pages():
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 569ae6a..a505a7e 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -457,15 +457,23 @@ static int do_mlock_pages(unsigned long start, size_t len)
>  			continue;
>  		if (nstart < vma->vm_start)
>  			nstart = vma->vm_start;
> +		/*
> +		 * Constrain batch size to limit mmap_sem hold time.
> +		 */
> +		if (nend > nstart + 1024 * PAGE_SIZE)
> +			nend = nstart + 1024 * PAGE_SIZE;
>  		/*
>  		 * Now fault in a range of pages. __mlock_vma_pages_range()
>  		 * double checks the vma flags, so that it won't mlock pages
>  		 * if the vma was already munlocked.
>  		 */
>  		ret = __mlock_vma_pages_range(vma, nstart, nend, &locked);
>  		if (ret < 0) {
>  			ret = __mlock_posix_error_return(ret);
>  			break;
> +		} else if (locked) {
> +			locked = 0;
> +			up_read(&mm->mmap_sem);
>  		}
>  		nend = nstart + ret * PAGE_SIZE;
>  		ret = 0;
> 
> 
> I don't really prefer using a constant, but I'm not sure how else to make
> Linus happy :)

rwsem_is_contended() didn't seem so bad to me.

Reading 1024 pages can still take a long time.  I can't immediately
think of a better approach though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
  2010-12-14  1:05             ` Andrew Morton
@ 2010-12-14  1:26               ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-14  1:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Mon, Dec 13, 2010 at 5:05 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 13 Dec 2010 16:51:40 -0800
> Michel Lespinasse <walken@google.com> wrote:
>> Andrew, should I amend my patches to remove the rwsem_is_contended() code ?
>> This would involve:
>> - remove rwsem-implement-rwsem_is_contended.patch and
>>   x86-rwsem-more-precise-rwsem_is_contended-implementation.patch
>> - in mlock-do-not-hold-mmap_sem-for-extended-periods-of-time.patch,
>>   drop the one hunk making use of rwsem_is_contended (rest of the patch
>>   would still work without it)
>
> I think I fixed all that up.

Thanks!

>> - optionally, follow up patch to limit batch size to a constant
>>   in do_mlock_pages():
[... diff snipped ...]
>> I don't really prefer using a constant, but I'm not sure how else to make
>> Linus happy :)
>
> rwsem_is_contended() didn't seem so bad to me.
>
> Reading 1024 pages can still take a long time.  I can't immediately
> think of a better approach though.

Note that we're only concerned page cache hist here, as
__get_user_pages with non-NULL nonblocking argument will release
mmap_sem when blocking on disk. So time per page is somewhat constant
- we don't need to worry about disk seeks at least.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
@ 2010-12-14  1:26               ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-14  1:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Mon, Dec 13, 2010 at 5:05 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 13 Dec 2010 16:51:40 -0800
> Michel Lespinasse <walken@google.com> wrote:
>> Andrew, should I amend my patches to remove the rwsem_is_contended() code ?
>> This would involve:
>> - remove rwsem-implement-rwsem_is_contended.patch and
>>   x86-rwsem-more-precise-rwsem_is_contended-implementation.patch
>> - in mlock-do-not-hold-mmap_sem-for-extended-periods-of-time.patch,
>>   drop the one hunk making use of rwsem_is_contended (rest of the patch
>>   would still work without it)
>
> I think I fixed all that up.

Thanks!

>> - optionally, follow up patch to limit batch size to a constant
>>   in do_mlock_pages():
[... diff snipped ...]
>> I don't really prefer using a constant, but I'm not sure how else to make
>> Linus happy :)
>
> rwsem_is_contended() didn't seem so bad to me.
>
> Reading 1024 pages can still take a long time.  I can't immediately
> think of a better approach though.

Note that we're only concerned page cache hist here, as
__get_user_pages with non-NULL nonblocking argument will release
mmap_sem when blocking on disk. So time per page is somewhat constant
- we don't need to worry about disk seeks at least.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
  2010-12-14  1:05             ` Andrew Morton
@ 2010-12-14 15:43               ` Linus Torvalds
  -1 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2010-12-14 15:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michel Lespinasse, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Mon, Dec 13, 2010 at 5:05 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> Reading 1024 pages can still take a long time.  I can't immediately
> think of a better approach though.

I don't see the need for _any_ of this.

Guys, we used to hold the damn thing for writing the *WHOLE*DAMN*TIME*.

Without _any_ at all of the crappy "rwsem_contended()" or the stupid
constants, we hold it only for reading, _and_ we drop it for any
actual IO. So the semaphore is held only for actual CPU intensive
cases. We're talking a reduction from minutes to milliseconds.

So stop this insanity. Do neither the rwsem contention checking _nor_
the "do things in batches".

Really.

The thing is, afte six months of doing the simple and straightforward
and _obvious_ parts, if people still think it's a real problem, at
that point I'm going to be interested in hearing about trying to be
clever. But when the semaphore hold times have gone down by four
orders of magnitude, I simply think it's fundamentally wrong to dick
around with some stupid detail. Certainly not in the same patch
series.

"Keep It Simple, Stupid".

So don't even _try_ to send me a series that does all of this. I'm not
going to take it. Do a series that fixes the _problem_. No more.

And btw, read the paper "Worse is better".

                                   Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
@ 2010-12-14 15:43               ` Linus Torvalds
  0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2010-12-14 15:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michel Lespinasse, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Mon, Dec 13, 2010 at 5:05 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> Reading 1024 pages can still take a long time.  I can't immediately
> think of a better approach though.

I don't see the need for _any_ of this.

Guys, we used to hold the damn thing for writing the *WHOLE*DAMN*TIME*.

Without _any_ at all of the crappy "rwsem_contended()" or the stupid
constants, we hold it only for reading, _and_ we drop it for any
actual IO. So the semaphore is held only for actual CPU intensive
cases. We're talking a reduction from minutes to milliseconds.

So stop this insanity. Do neither the rwsem contention checking _nor_
the "do things in batches".

Really.

The thing is, afte six months of doing the simple and straightforward
and _obvious_ parts, if people still think it's a real problem, at
that point I'm going to be interested in hearing about trying to be
clever. But when the semaphore hold times have gone down by four
orders of magnitude, I simply think it's fundamentally wrong to dick
around with some stupid detail. Certainly not in the same patch
series.

"Keep It Simple, Stupid".

So don't even _try_ to send me a series that does all of this. I'm not
going to take it. Do a series that fixes the _problem_. No more.

And btw, read the paper "Worse is better".

                                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
  2010-12-14 15:43               ` Linus Torvalds
@ 2010-12-14 23:22                 ` Michel Lespinasse
  -1 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-14 23:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Tue, Dec 14, 2010 at 7:43 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Without _any_ at all of the crappy "rwsem_contended()" or the stupid
> constants, we hold it only for reading, _and_ we drop it for any
> actual IO. So the semaphore is held only for actual CPU intensive
> cases. We're talking a reduction from minutes to milliseconds.

It's actually still several seconds for a large enough mlock from page cache.

But yes, I agree it'll do fine for now :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages
@ 2010-12-14 23:22                 ` Michel Lespinasse
  0 siblings, 0 replies; 44+ messages in thread
From: Michel Lespinasse @ 2010-12-14 23:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-mm, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Nick Piggin, KOSAKI Motohiro, linux-kernel

On Tue, Dec 14, 2010 at 7:43 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Without _any_ at all of the crappy "rwsem_contended()" or the stupid
> constants, we hold it only for reading, _and_ we drop it for any
> actual IO. So the semaphore is held only for actual CPU intensive
> cases. We're talking a reduction from minutes to milliseconds.

It's actually still several seconds for a large enough mlock from page cache.

But yes, I agree it'll do fine for now :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2010-12-14 23:22 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-03  0:16 [PATCH 0/6] mlock: do not hold mmap_sem for extended periods of time Michel Lespinasse
2010-12-03  0:16 ` Michel Lespinasse
2010-12-03  0:16 ` [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages Michel Lespinasse
2010-12-03  0:16   ` Michel Lespinasse
2010-12-08 23:27   ` Andrew Morton
2010-12-08 23:27     ` Andrew Morton
2010-12-08 23:58     ` Michel Lespinasse
2010-12-08 23:58       ` Michel Lespinasse
2010-12-10  6:11       ` Linus Torvalds
2010-12-10  6:11         ` Linus Torvalds
2010-12-10  6:39         ` Michel Lespinasse
2010-12-10  6:39           ` Michel Lespinasse
2010-12-10 11:12           ` Peter Zijlstra
2010-12-10 11:12             ` Peter Zijlstra
2010-12-14  0:51         ` Michel Lespinasse
2010-12-14  0:51           ` Michel Lespinasse
2010-12-14  1:05           ` Andrew Morton
2010-12-14  1:05             ` Andrew Morton
2010-12-14  1:26             ` Michel Lespinasse
2010-12-14  1:26               ` Michel Lespinasse
2010-12-14 15:43             ` Linus Torvalds
2010-12-14 15:43               ` Linus Torvalds
2010-12-14 23:22               ` Michel Lespinasse
2010-12-14 23:22                 ` Michel Lespinasse
2010-12-03  0:16 ` [PATCH 2/6] mm: add FOLL_MLOCK follow_page flag Michel Lespinasse
2010-12-03  0:16   ` Michel Lespinasse
2010-12-04  6:55   ` Michel Lespinasse
2010-12-04  6:55     ` Michel Lespinasse
2010-12-03  0:16 ` [PATCH 3/6] mm: move VM_LOCKED check to __mlock_vma_pages_range() Michel Lespinasse
2010-12-03  0:16   ` Michel Lespinasse
2010-12-03  0:16 ` [PATCH 4/6] rwsem: implement rwsem_is_contended() Michel Lespinasse
2010-12-03  0:16   ` Michel Lespinasse
2010-12-03  0:16 ` [PATCH 5/6] mlock: do not hold mmap_sem for extended periods of time Michel Lespinasse
2010-12-03  0:16   ` Michel Lespinasse
2010-12-08 23:42   ` Andrew Morton
2010-12-08 23:42     ` Andrew Morton
2010-12-03  0:16 ` [PATCH 6/6] x86 rwsem: more precise rwsem_is_contended() implementation Michel Lespinasse
2010-12-03  0:16   ` Michel Lespinasse
2010-12-03 22:41   ` Peter Zijlstra
2010-12-03 22:41     ` Peter Zijlstra
2010-12-03 22:51     ` Michel Lespinasse
2010-12-03 22:51       ` Michel Lespinasse
2010-12-03 23:02 ` [PATCH 0/6] mlock: do not hold mmap_sem for extended periods of time Michel Lespinasse
2010-12-03 23:02   ` Michel Lespinasse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.