[PATCH 0/3] V2: Reduce mmap_sem hold times during file backed page faults

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] V2: Reduce mmap_sem hold times during file backed page faults
@ 2010-10-05  7:53 ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-05  7:53 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Ying Han
  Cc: linux-kernel, Andrew Morton, Rik van Riel, Nick Piggin, Peter Zijlstra

This is the second iteration of our change dropping mmap_sem when a disk
access occurs during a page fault to a file backed VMA.

Changes since V1:
- Cleaned up 'Retry page fault when blocking on disk transfer' applying
  linus's suggestions
- Added 'access_error API cleanup'

Tests:

- microbenchmark: thread A mmaps a large file and does random read accesses
  to the mmaped area - achieves about 55 iterations/s. Thread B does
  mmap/munmap in a loop at a separate location - achieves 55 iterations/s
  before, 15000 iterations/s after.
- We are seeing related effects in some applications in house, which show
  significant performance regressions when running without this change.
- I am looking for a microbenchmark to expose the worst case overhead of
  the page fault retry. Would FIO be a good match for that use ?

Michel Lespinasse (3):
  filemap_fault: unique path for locking page
  Retry page fault when blocking on disk transfer.
  access_error API cleanup

 arch/x86/mm/fault.c |   44 +++++++++++++++++++++++++++++---------------
 include/linux/mm.h  |    2 ++
 mm/filemap.c        |   41 ++++++++++++++++++++++++++++++++---------
 mm/memory.c         |    3 ++-
 4 files changed, 65 insertions(+), 25 deletions(-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 0/3] V2: Reduce mmap_sem hold times during file backed page faults
@ 2010-10-05  7:53 ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-05  7:53 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Ying Han
  Cc: linux-kernel, Andrew Morton, Rik van Riel, Nick Piggin, Peter Zijlstra

This is the second iteration of our change dropping mmap_sem when a disk
access occurs during a page fault to a file backed VMA.

Changes since V1:
- Cleaned up 'Retry page fault when blocking on disk transfer' applying
  linus's suggestions
- Added 'access_error API cleanup'

Tests:

- microbenchmark: thread A mmaps a large file and does random read accesses
  to the mmaped area - achieves about 55 iterations/s. Thread B does
  mmap/munmap in a loop at a separate location - achieves 55 iterations/s
  before, 15000 iterations/s after.
- We are seeing related effects in some applications in house, which show
  significant performance regressions when running without this change.
- I am looking for a microbenchmark to expose the worst case overhead of
  the page fault retry. Would FIO be a good match for that use ?

Michel Lespinasse (3):
  filemap_fault: unique path for locking page
  Retry page fault when blocking on disk transfer.
  access_error API cleanup

 arch/x86/mm/fault.c |   44 +++++++++++++++++++++++++++++---------------
 include/linux/mm.h  |    2 ++
 mm/filemap.c        |   41 ++++++++++++++++++++++++++++++++---------
 mm/memory.c         |    3 ++-
 4 files changed, 65 insertions(+), 25 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/3] filemap_fault: unique path for locking page
  2010-10-05  7:53 ` Michel Lespinasse
@ 2010-10-05  7:53   ` Michel Lespinasse
  -1 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-05  7:53 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Ying Han
  Cc: linux-kernel, Andrew Morton, Rik van Riel, Nick Piggin, Peter Zijlstra

This change introduces a single location where filemap_fault() locks
the desired page. There used to be two such places, depending if the
initial find_get_page() was successful or not.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/filemap.c |   20 +++++++++++---------
 1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..8ed709a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1539,25 +1539,27 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		 * waiting for the lock.
 		 */
 		do_async_mmap_readahead(vma, ra, file, page, offset);
-		lock_page(page);
-
-		/* Did it get truncated? */
-		if (unlikely(page->mapping != mapping)) {
-			unlock_page(page);
-			put_page(page);
-			goto no_cached_page;
-		}
 	} else {
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vma, ra, file, offset);
 		count_vm_event(PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 retry_find:
-		page = find_lock_page(mapping, offset);
+		page = find_get_page(mapping, offset);
 		if (!page)
 			goto no_cached_page;
 	}
 
+	lock_page(page);
+
+	/* Did it get truncated? */
+	if (unlikely(page->mapping != mapping)) {
+		unlock_page(page);
+		put_page(page);
+		goto retry_find;
+	}
+	VM_BUG_ON(page->index != offset);
+
 	/*
 	 * We have a locked page in the page cache, now we need to check
 	 * that it's up-to-date. If not, it is going to be due to an error.
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 1/3] filemap_fault: unique path for locking page
@ 2010-10-05  7:53   ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-05  7:53 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Ying Han
  Cc: linux-kernel, Andrew Morton, Rik van Riel, Nick Piggin, Peter Zijlstra

This change introduces a single location where filemap_fault() locks
the desired page. There used to be two such places, depending if the
initial find_get_page() was successful or not.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/filemap.c |   20 +++++++++++---------
 1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..8ed709a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1539,25 +1539,27 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		 * waiting for the lock.
 		 */
 		do_async_mmap_readahead(vma, ra, file, page, offset);
-		lock_page(page);
-
-		/* Did it get truncated? */
-		if (unlikely(page->mapping != mapping)) {
-			unlock_page(page);
-			put_page(page);
-			goto no_cached_page;
-		}
 	} else {
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vma, ra, file, offset);
 		count_vm_event(PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 retry_find:
-		page = find_lock_page(mapping, offset);
+		page = find_get_page(mapping, offset);
 		if (!page)
 			goto no_cached_page;
 	}
 
+	lock_page(page);
+
+	/* Did it get truncated? */
+	if (unlikely(page->mapping != mapping)) {
+		unlock_page(page);
+		put_page(page);
+		goto retry_find;
+	}
+	VM_BUG_ON(page->index != offset);
+
 	/*
 	 * We have a locked page in the page cache, now we need to check
 	 * that it's up-to-date. If not, it is going to be due to an error.
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-05  7:53 ` Michel Lespinasse
@ 2010-10-05  7:53   ` Michel Lespinasse
  -1 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-05  7:53 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Ying Han
  Cc: linux-kernel, Andrew Morton, Rik van Riel, Nick Piggin, Peter Zijlstra

This change reduces mmap_sem hold times that are caused by waiting for
disk transfers when accessing file mapped VMAs. It introduces the
VM_FAULT_ALLOW_RETRY flag, which indicates that the call site wants
mmap_sem to be released if blocking on a pending disk transfer.
In that case, filemap_fault() returns the VM_FAULT_RETRY status bit
and do_page_fault() will then re-acquire mmap_sem and retry the page fault.
It is expected that the retry will hit the same page which will now be
cached, and thus it will complete with a low mmap_sem hold time.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/mm/fault.c |   38 ++++++++++++++++++++++++++------------
 include/linux/mm.h  |    2 ++
 mm/filemap.c        |   23 ++++++++++++++++++++++-
 mm/memory.c         |    3 ++-
 4 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 4c4508e..b355b92 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -952,8 +952,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	struct task_struct *tsk;
 	unsigned long address;
 	struct mm_struct *mm;
-	int write;
 	int fault;
+	int write = error_code & PF_WRITE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY |
+					(write ? FAULT_FLAG_WRITE : 0);
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1064,6 +1066,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 			bad_area_nosemaphore(regs, error_code, address);
 			return;
 		}
+retry:
 		down_read(&mm->mmap_sem);
 	} else {
 		/*
@@ -1107,8 +1110,6 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	 * we can handle it..
 	 */
 good_area:
-	write = error_code & PF_WRITE;
-
 	if (unlikely(access_error(error_code, write, vma))) {
 		bad_area_access_error(regs, error_code, address);
 		return;
@@ -1119,21 +1120,34 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault:
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, fault);
 		return;
 	}
 
-	if (fault & VM_FAULT_MAJOR) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
-				     regs, address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
-				     regs, address);
+	/*
+	 * Major/minor page fault accounting is only done on the
+	 * initial attempt. If we go through a retry, it is extremely
+	 * likely that the page will be found in page cache at that point.
+	 */
+	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+		if (fault & VM_FAULT_MAJOR) {
+			tsk->maj_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
+				      regs, address);
+		} else {
+			tsk->min_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
+				      regs, address);
+		}
+		if (fault & VM_FAULT_RETRY) {
+			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
+			 * of starvation. */
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			goto retry;
+		}
 	}
 
 	check_v8086_mode(regs, address, tsk);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..0b4f9b2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -144,6 +144,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -722,6 +723,7 @@ static inline int page_mapped(struct page *page)
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
+#define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 8ed709a..7cba9bf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1501,6 +1501,26 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma,
 					   page, offset, ra->ra_pages);
 }
 
+/*
+ * Lock the page, unless this would block and the caller indicated that it
+ * can handle a retry.
+ */
+static int lock_page_or_retry(struct page *page,
+			      struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	if (trylock_page(page))
+		return 1;
+	if (!(vmf->flags & FAULT_FLAG_ALLOW_RETRY)) {
+		__lock_page(page);
+		return 1;
+	}
+
+	up_read(&vma->vm_mm->mmap_sem);
+	wait_on_page_locked(page);
+	page_cache_release(page);
+	return 0;
+}
+
 /**
  * filemap_fault - read in file data for page fault handling
  * @vma:	vma in which the fault was taken
@@ -1550,7 +1570,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	lock_page(page);
+	if (!lock_page_or_retry(page, vma, vmf))
+		return ret | VM_FAULT_RETRY;
 
 	/* Did it get truncated? */
 	if (unlikely(page->mapping != mapping)) {
diff --git a/mm/memory.c b/mm/memory.c
index 0e18b4d..b068c68 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2926,7 +2926,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	vmf.page = NULL;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
-	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+			    VM_FAULT_RETRY)))
 		return ret;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-05  7:53   ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-05  7:53 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Ying Han
  Cc: linux-kernel, Andrew Morton, Rik van Riel, Nick Piggin, Peter Zijlstra

This change reduces mmap_sem hold times that are caused by waiting for
disk transfers when accessing file mapped VMAs. It introduces the
VM_FAULT_ALLOW_RETRY flag, which indicates that the call site wants
mmap_sem to be released if blocking on a pending disk transfer.
In that case, filemap_fault() returns the VM_FAULT_RETRY status bit
and do_page_fault() will then re-acquire mmap_sem and retry the page fault.
It is expected that the retry will hit the same page which will now be
cached, and thus it will complete with a low mmap_sem hold time.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/mm/fault.c |   38 ++++++++++++++++++++++++++------------
 include/linux/mm.h  |    2 ++
 mm/filemap.c        |   23 ++++++++++++++++++++++-
 mm/memory.c         |    3 ++-
 4 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 4c4508e..b355b92 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -952,8 +952,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	struct task_struct *tsk;
 	unsigned long address;
 	struct mm_struct *mm;
-	int write;
 	int fault;
+	int write = error_code & PF_WRITE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY |
+					(write ? FAULT_FLAG_WRITE : 0);
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1064,6 +1066,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 			bad_area_nosemaphore(regs, error_code, address);
 			return;
 		}
+retry:
 		down_read(&mm->mmap_sem);
 	} else {
 		/*
@@ -1107,8 +1110,6 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	 * we can handle it..
 	 */
 good_area:
-	write = error_code & PF_WRITE;
-
 	if (unlikely(access_error(error_code, write, vma))) {
 		bad_area_access_error(regs, error_code, address);
 		return;
@@ -1119,21 +1120,34 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault:
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, fault);
 		return;
 	}
 
-	if (fault & VM_FAULT_MAJOR) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
-				     regs, address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
-				     regs, address);
+	/*
+	 * Major/minor page fault accounting is only done on the
+	 * initial attempt. If we go through a retry, it is extremely
+	 * likely that the page will be found in page cache at that point.
+	 */
+	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+		if (fault & VM_FAULT_MAJOR) {
+			tsk->maj_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
+				      regs, address);
+		} else {
+			tsk->min_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
+				      regs, address);
+		}
+		if (fault & VM_FAULT_RETRY) {
+			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
+			 * of starvation. */
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			goto retry;
+		}
 	}
 
 	check_v8086_mode(regs, address, tsk);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..0b4f9b2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -144,6 +144,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -722,6 +723,7 @@ static inline int page_mapped(struct page *page)
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
+#define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 8ed709a..7cba9bf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1501,6 +1501,26 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma,
 					   page, offset, ra->ra_pages);
 }
 
+/*
+ * Lock the page, unless this would block and the caller indicated that it
+ * can handle a retry.
+ */
+static int lock_page_or_retry(struct page *page,
+			      struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	if (trylock_page(page))
+		return 1;
+	if (!(vmf->flags & FAULT_FLAG_ALLOW_RETRY)) {
+		__lock_page(page);
+		return 1;
+	}
+
+	up_read(&vma->vm_mm->mmap_sem);
+	wait_on_page_locked(page);
+	page_cache_release(page);
+	return 0;
+}
+
 /**
  * filemap_fault - read in file data for page fault handling
  * @vma:	vma in which the fault was taken
@@ -1550,7 +1570,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	lock_page(page);
+	if (!lock_page_or_retry(page, vma, vmf))
+		return ret | VM_FAULT_RETRY;
 
 	/* Did it get truncated? */
 	if (unlikely(page->mapping != mapping)) {
diff --git a/mm/memory.c b/mm/memory.c
index 0e18b4d..b068c68 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2926,7 +2926,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	vmf.page = NULL;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
-	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+			    VM_FAULT_RETRY)))
 		return ret;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/3] access_error API cleanup
  2010-10-05  7:53 ` Michel Lespinasse
@ 2010-10-05  7:53   ` Michel Lespinasse
  -1 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-05  7:53 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Ying Han
  Cc: linux-kernel, Andrew Morton, Rik van Riel, Nick Piggin, Peter Zijlstra

access_error() already takes error_code as an argument, so there is
no need for an additional write flag.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/mm/fault.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b355b92..844d46f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -915,9 +915,9 @@ spurious_fault(unsigned long error_code, unsigned long address)
 int show_unhandled_signals = 1;
 
 static inline int
-access_error(unsigned long error_code, int write, struct vm_area_struct *vma)
+access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
-	if (write) {
+	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
 			return 1;
@@ -1110,7 +1110,7 @@ retry:
 	 * we can handle it..
 	 */
 good_area:
-	if (unlikely(access_error(error_code, write, vma))) {
+	if (unlikely(access_error(error_code, vma))) {
 		bad_area_access_error(regs, error_code, address);
 		return;
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/3] access_error API cleanup
@ 2010-10-05  7:53   ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-05  7:53 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Ying Han
  Cc: linux-kernel, Andrew Morton, Rik van Riel, Nick Piggin, Peter Zijlstra

access_error() already takes error_code as an argument, so there is
no need for an additional write flag.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/mm/fault.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b355b92..844d46f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -915,9 +915,9 @@ spurious_fault(unsigned long error_code, unsigned long address)
 int show_unhandled_signals = 1;
 
 static inline int
-access_error(unsigned long error_code, int write, struct vm_area_struct *vma)
+access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
-	if (write) {
+	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
 			return 1;
@@ -1110,7 +1110,7 @@ retry:
 	 * we can handle it..
 	 */
 good_area:
-	if (unlikely(access_error(error_code, write, vma))) {
+	if (unlikely(access_error(error_code, vma))) {
 		bad_area_access_error(regs, error_code, address);
 		return;
 	}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] V2: Reduce mmap_sem hold times during file backed page faults
  2010-10-05  7:53 ` Michel Lespinasse
@ 2010-10-05 14:55   ` Rik van Riel
  -1 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2010-10-05 14:55 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On 10/05/2010 03:53 AM, Michel Lespinasse wrote:
> This is the second iteration of our change dropping mmap_sem when a disk
> access occurs during a page fault to a file backed VMA.
>
> Changes since V1:
> - Cleaned up 'Retry page fault when blocking on disk transfer' applying
>    linus's suggestions
> - Added 'access_error API cleanup'
>
> Tests:
>
> - microbenchmark: thread A mmaps a large file and does random read accesses
>    to the mmaped area - achieves about 55 iterations/s. Thread B does
>    mmap/munmap in a loop at a separate location - achieves 55 iterations/s
>    before, 15000 iterations/s after.
> - We are seeing related effects in some applications in house, which show
>    significant performance regressions when running without this change.
> - I am looking for a microbenchmark to expose the worst case overhead of
>    the page fault retry. Would FIO be a good match for that use ?

I imagine MySQL could show the problem, on a system with
so much memory pressure that part of MySQL itself gets
swapped out (a slightly too large innodb buffer?).

Without your patches, a new database thread can only be
created in-between page faults.  With your patches, a
new thread can be started even while other threads are
waiting on page faults.

More importantly, multiple threads can start pagein
IO simultaneously after memory pressure has let up.
This should allow the system to go back to normal
performance much faster after a load spike has passed.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] V2: Reduce mmap_sem hold times during file backed page faults
@ 2010-10-05 14:55   ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2010-10-05 14:55 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On 10/05/2010 03:53 AM, Michel Lespinasse wrote:
> This is the second iteration of our change dropping mmap_sem when a disk
> access occurs during a page fault to a file backed VMA.
>
> Changes since V1:
> - Cleaned up 'Retry page fault when blocking on disk transfer' applying
>    linus's suggestions
> - Added 'access_error API cleanup'
>
> Tests:
>
> - microbenchmark: thread A mmaps a large file and does random read accesses
>    to the mmaped area - achieves about 55 iterations/s. Thread B does
>    mmap/munmap in a loop at a separate location - achieves 55 iterations/s
>    before, 15000 iterations/s after.
> - We are seeing related effects in some applications in house, which show
>    significant performance regressions when running without this change.
> - I am looking for a microbenchmark to expose the worst case overhead of
>    the page fault retry. Would FIO be a good match for that use ?

I imagine MySQL could show the problem, on a system with
so much memory pressure that part of MySQL itself gets
swapped out (a slightly too large innodb buffer?).

Without your patches, a new database thread can only be
created in-between page faults.  With your patches, a
new thread can be started even while other threads are
waiting on page faults.

More importantly, multiple threads can start pagein
IO simultaneously after memory pressure has let up.
This should allow the system to go back to normal
performance much faster after a load spike has passed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/3] filemap_fault: unique path for locking page
  2010-10-05  7:53   ` Michel Lespinasse
@ 2010-10-05 17:07     ` Rik van Riel
  -1 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2010-10-05 17:07 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On 10/05/2010 03:53 AM, Michel Lespinasse wrote:
> This change introduces a single location where filemap_fault() locks
> the desired page. There used to be two such places, depending if the
> initial find_get_page() was successful or not.
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/3] filemap_fault: unique path for locking page
@ 2010-10-05 17:07     ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2010-10-05 17:07 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On 10/05/2010 03:53 AM, Michel Lespinasse wrote:
> This change introduces a single location where filemap_fault() locks
> the desired page. There used to be two such places, depending if the
> initial find_get_page() was successful or not.
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-05  7:53   ` Michel Lespinasse
@ 2010-10-05 17:33     ` Linus Torvalds
  -1 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2010-10-05 17:33 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Ying Han, linux-kernel, Andrew Morton, Rik van Riel,
	Nick Piggin, Peter Zijlstra

On Tue, Oct 5, 2010 at 12:53 AM, Michel Lespinasse <walken@google.com> wrote:
>
> This change reduces mmap_sem hold times that are caused by waiting for
> disk transfers when accessing file mapped VMAs.

Ok, this series looks much better. The new mm/filemap.c diff looks
much better with the lock_page_or_retry() helper function, and on the
whole I think I can Ack this (although obviously not actually apply it
- but please do get it into linux-next).

The do_page_fault() part of the patch still looks ugly, but I don't
see any obvious way to improve on it.

                        Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-05 17:33     ` Linus Torvalds
  0 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2010-10-05 17:33 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Ying Han, linux-kernel, Andrew Morton, Rik van Riel,
	Nick Piggin, Peter Zijlstra

On Tue, Oct 5, 2010 at 12:53 AM, Michel Lespinasse <walken@google.com> wrote:
>
> This change reduces mmap_sem hold times that are caused by waiting for
> disk transfers when accessing file mapped VMAs.

Ok, this series looks much better. The new mm/filemap.c diff looks
much better with the lock_page_or_retry() helper function, and on the
whole I think I can Ack this (although obviously not actually apply it
- but please do get it into linux-next).

The do_page_fault() part of the patch still looks ugly, but I don't
see any obvious way to improve on it.

                        Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-05  7:53   ` Michel Lespinasse
@ 2010-10-05 17:38     ` Rik van Riel
  -1 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2010-10-05 17:38 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On 10/05/2010 03:53 AM, Michel Lespinasse wrote:
> This change reduces mmap_sem hold times that are caused by waiting for
> disk transfers when accessing file mapped VMAs. It introduces the
> VM_FAULT_ALLOW_RETRY flag, which indicates that the call site wants
> mmap_sem to be released if blocking on a pending disk transfer.
> In that case, filemap_fault() returns the VM_FAULT_RETRY status bit
> and do_page_fault() will then re-acquire mmap_sem and retry the page fault.
> It is expected that the retry will hit the same page which will now be
> cached, and thus it will complete with a low mmap_sem hold time.
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

Acked-by: Rik van Riel <riel@redhat.com>

Looks like it should be relatively easy to do something
similar in do_swap_page also.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-05 17:38     ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2010-10-05 17:38 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On 10/05/2010 03:53 AM, Michel Lespinasse wrote:
> This change reduces mmap_sem hold times that are caused by waiting for
> disk transfers when accessing file mapped VMAs. It introduces the
> VM_FAULT_ALLOW_RETRY flag, which indicates that the call site wants
> mmap_sem to be released if blocking on a pending disk transfer.
> In that case, filemap_fault() returns the VM_FAULT_RETRY status bit
> and do_page_fault() will then re-acquire mmap_sem and retry the page fault.
> It is expected that the retry will hit the same page which will now be
> cached, and thus it will complete with a low mmap_sem hold time.
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

Acked-by: Rik van Riel <riel@redhat.com>

Looks like it should be relatively easy to do something
similar in do_swap_page also.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] access_error API cleanup
  2010-10-05  7:53   ` Michel Lespinasse
@ 2010-10-05 19:44     ` Rik van Riel
  -1 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2010-10-05 19:44 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On 10/05/2010 03:53 AM, Michel Lespinasse wrote:
> access_error() already takes error_code as an argument, so there is
> no need for an additional write flag.
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] access_error API cleanup
@ 2010-10-05 19:44     ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2010-10-05 19:44 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On 10/05/2010 03:53 AM, Michel Lespinasse wrote:
> access_error() already takes error_code as an argument, so there is
> no need for an additional write flag.
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-05 17:38     ` Rik van Riel
@ 2010-10-05 22:44       ` Michel Lespinasse
  -1 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-05 22:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On Tue, Oct 5, 2010 at 10:38 AM, Rik van Riel <riel@redhat.com> wrote:
> On 10/05/2010 03:53 AM, Michel Lespinasse wrote:
>>
>> This change reduces mmap_sem hold times that are caused by waiting for
>> disk transfers when accessing file mapped VMAs. It introduces the
>> VM_FAULT_ALLOW_RETRY flag, which indicates that the call site wants
>> mmap_sem to be released if blocking on a pending disk transfer.
>> In that case, filemap_fault() returns the VM_FAULT_RETRY status bit
>> and do_page_fault() will then re-acquire mmap_sem and retry the page
>> fault.
>> It is expected that the retry will hit the same page which will now be
>> cached, and thus it will complete with a low mmap_sem hold time.
>>
>> Signed-off-by: Michel Lespinasse<walken@google.com>
>
> Acked-by: Rik van Riel <riel@redhat.com>
>
> Looks like it should be relatively easy to do something
> similar in do_swap_page also.

Good idea. We don't make use of swap too much, which is probably why
we didn't have that in our kernel, but it seems like a good idea just
for uniformity. I'll add this in a follow-on patch.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-05 22:44       ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-05 22:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On Tue, Oct 5, 2010 at 10:38 AM, Rik van Riel <riel@redhat.com> wrote:
> On 10/05/2010 03:53 AM, Michel Lespinasse wrote:
>>
>> This change reduces mmap_sem hold times that are caused by waiting for
>> disk transfers when accessing file mapped VMAs. It introduces the
>> VM_FAULT_ALLOW_RETRY flag, which indicates that the call site wants
>> mmap_sem to be released if blocking on a pending disk transfer.
>> In that case, filemap_fault() returns the VM_FAULT_RETRY status bit
>> and do_page_fault() will then re-acquire mmap_sem and retry the page
>> fault.
>> It is expected that the retry will hit the same page which will now be
>> cached, and thus it will complete with a low mmap_sem hold time.
>>
>> Signed-off-by: Michel Lespinasse<walken@google.com>
>
> Acked-by: Rik van Riel <riel@redhat.com>
>
> Looks like it should be relatively easy to do something
> similar in do_swap_page also.

Good idea. We don't make use of swap too much, which is probably why
we didn't have that in our kernel, but it seems like a good idea just
for uniformity. I'll add this in a follow-on patch.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-05  7:53   ` Michel Lespinasse
@ 2010-10-06  4:02     ` H. Peter Anvin
  -1 siblings, 0 replies; 46+ messages in thread
From: H. Peter Anvin @ 2010-10-06  4:02 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Rik van Riel, Nick Piggin, Peter Zijlstra, Andrew Morton

On 10/05/2010 12:53 AM, Michel Lespinasse wrote:
> This change reduces mmap_sem hold times that are caused by waiting for
> disk transfers when accessing file mapped VMAs. It introduces the
> VM_FAULT_ALLOW_RETRY flag, which indicates that the call site wants
> mmap_sem to be released if blocking on a pending disk transfer.
> In that case, filemap_fault() returns the VM_FAULT_RETRY status bit
> and do_page_fault() will then re-acquire mmap_sem and retry the page fault.
> It is expected that the retry will hit the same page which will now be
> cached, and thus it will complete with a low mmap_sem hold time.
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---
>  arch/x86/mm/fault.c |   38 ++++++++++++++++++++++++++------------
>  include/linux/mm.h  |    2 ++
>  mm/filemap.c        |   23 ++++++++++++++++++++++-
>  mm/memory.c         |    3 ++-
>  4 files changed, 52 insertions(+), 14 deletions(-)
> 

Acked-by: H. Peter Anvin <hpa@zytor.com>

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-06  4:02     ` H. Peter Anvin
  0 siblings, 0 replies; 46+ messages in thread
From: H. Peter Anvin @ 2010-10-06  4:02 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Rik van Riel, Nick Piggin, Peter Zijlstra

On 10/05/2010 12:53 AM, Michel Lespinasse wrote:
> This change reduces mmap_sem hold times that are caused by waiting for
> disk transfers when accessing file mapped VMAs. It introduces the
> VM_FAULT_ALLOW_RETRY flag, which indicates that the call site wants
> mmap_sem to be released if blocking on a pending disk transfer.
> In that case, filemap_fault() returns the VM_FAULT_RETRY status bit
> and do_page_fault() will then re-acquire mmap_sem and retry the page fault.
> It is expected that the retry will hit the same page which will now be
> cached, and thus it will complete with a low mmap_sem hold time.
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---
>  arch/x86/mm/fault.c |   38 ++++++++++++++++++++++++++------------
>  include/linux/mm.h  |    2 ++
>  mm/filemap.c        |   23 ++++++++++++++++++++++-
>  mm/memory.c         |    3 ++-
>  4 files changed, 52 insertions(+), 14 deletions(-)
> 

Acked-by: H. Peter Anvin <hpa@zytor.com>

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] access_error API cleanup
  2010-10-05  7:53   ` Michel Lespinasse
@ 2010-10-06  4:02     ` H. Peter Anvin
  -1 siblings, 0 replies; 46+ messages in thread
From: H. Peter Anvin @ 2010-10-06  4:02 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Rik van Riel, Nick Piggin, Peter Zijlstra, Andrew Morton

On 10/05/2010 12:53 AM, Michel Lespinasse wrote:
> access_error() already takes error_code as an argument, so there is
> no need for an additional write flag.
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---
>  arch/x86/mm/fault.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index b355b92..844d46f 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -915,9 +915,9 @@ spurious_fault(unsigned long error_code, unsigned long address)
>  int show_unhandled_signals = 1;
>  
>  static inline int
> -access_error(unsigned long error_code, int write, struct vm_area_struct *vma)
> +access_error(unsigned long error_code, struct vm_area_struct *vma)
>  {
> -	if (write) {
> +	if (error_code & PF_WRITE) {
>  		/* write, present and write, not present: */
>  		if (unlikely(!(vma->vm_flags & VM_WRITE)))
>  			return 1;
> @@ -1110,7 +1110,7 @@ retry:
>  	 * we can handle it..
>  	 */
>  good_area:
> -	if (unlikely(access_error(error_code, write, vma))) {
> +	if (unlikely(access_error(error_code, vma))) {
>  		bad_area_access_error(regs, error_code, address);
>  		return;
>  	}

Acked-by: H. Peter Anvin <hpa@zytor.com>

I was going to put it into the x86 tree, but being part of a larger
series it gets messy.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] access_error API cleanup
@ 2010-10-06  4:02     ` H. Peter Anvin
  0 siblings, 0 replies; 46+ messages in thread
From: H. Peter Anvin @ 2010-10-06  4:02 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Rik van Riel, Nick Piggin, Peter Zijlstra

On 10/05/2010 12:53 AM, Michel Lespinasse wrote:
> access_error() already takes error_code as an argument, so there is
> no need for an additional write flag.
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---
>  arch/x86/mm/fault.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index b355b92..844d46f 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -915,9 +915,9 @@ spurious_fault(unsigned long error_code, unsigned long address)
>  int show_unhandled_signals = 1;
>  
>  static inline int
> -access_error(unsigned long error_code, int write, struct vm_area_struct *vma)
> +access_error(unsigned long error_code, struct vm_area_struct *vma)
>  {
> -	if (write) {
> +	if (error_code & PF_WRITE) {
>  		/* write, present and write, not present: */
>  		if (unlikely(!(vma->vm_flags & VM_WRITE)))
>  			return 1;
> @@ -1110,7 +1110,7 @@ retry:
>  	 * we can handle it..
>  	 */
>  good_area:
> -	if (unlikely(access_error(error_code, write, vma))) {
> +	if (unlikely(access_error(error_code, vma))) {
>  		bad_area_access_error(regs, error_code, address);
>  		return;
>  	}

Acked-by: H. Peter Anvin <hpa@zytor.com>

I was going to put it into the x86 tree, but being part of a larger
series it gets messy.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] access_error API cleanup
  2010-10-06  4:02     ` H. Peter Anvin
@ 2010-10-06  4:14       ` Michel Lespinasse
  -1 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-06  4:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Rik van Riel, Nick Piggin, Peter Zijlstra

On Tue, Oct 5, 2010 at 9:02 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> I was going to put it into the x86 tree, but being part of a larger
> series it gets messy.

Yes. It's easy enough to reorder the patches, if we care about this
one making it into 2.6.36 ahead of the rest, but it does not seem like
a huge payoff either.

I figure barring any surprises, the series is likely to merge from -mm
tree to mainline in time for 2.6.37 ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] access_error API cleanup
@ 2010-10-06  4:14       ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-06  4:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-mm, Linus Torvalds, Ying Han, linux-kernel, Andrew Morton,
	Rik van Riel, Nick Piggin, Peter Zijlstra

On Tue, Oct 5, 2010 at 9:02 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> I was going to put it into the x86 tree, but being part of a larger
> series it gets messy.

Yes. It's easy enough to reorder the patches, if we care about this
one making it into 2.6.36 ahead of the rest, but it does not seem like
a huge payoff either.

I figure barring any surprises, the series is likely to merge from -mm
tree to mainline in time for 2.6.37 ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] access_error API cleanup
  2010-10-06  4:14       ` Michel Lespinasse
@ 2010-10-06  4:18         ` Andrew Morton
  -1 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2010-10-06  4:18 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: H. Peter Anvin, linux-mm, Linus Torvalds, Ying Han, linux-kernel,
	Rik van Riel, Nick Piggin, Peter Zijlstra

On Tue, 5 Oct 2010 21:14:12 -0700 Michel Lespinasse <walken@google.com> wrote:

> I figure barring any surprises, the series is likely to merge from -mm
> tree to mainline in time for 2.6.37 ?

yep.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] access_error API cleanup
@ 2010-10-06  4:18         ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2010-10-06  4:18 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: H. Peter Anvin, linux-mm, Linus Torvalds, Ying Han, linux-kernel,
	Rik van Riel, Nick Piggin, Peter Zijlstra

On Tue, 5 Oct 2010 21:14:12 -0700 Michel Lespinasse <walken@google.com> wrote:

> I figure barring any surprises, the series is likely to merge from -mm
> tree to mainline in time for 2.6.37 ?

yep.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] access_error API cleanup
  2010-10-06  4:18         ` Andrew Morton
@ 2010-10-06  4:20           ` H. Peter Anvin
  -1 siblings, 0 replies; 46+ messages in thread
From: H. Peter Anvin @ 2010-10-06  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michel Lespinasse, linux-mm, Linus Torvalds, Ying Han,
	linux-kernel, Rik van Riel, Nick Piggin, Peter Zijlstra

On 10/05/2010 09:18 PM, Andrew Morton wrote:
> On Tue, 5 Oct 2010 21:14:12 -0700 Michel Lespinasse <walken@google.com> wrote:
> 
>> I figure barring any surprises, the series is likely to merge from -mm
>> tree to mainline in time for 2.6.37 ?
> 
> yep.

Sounds good to me.

Again, Acked-by: H. Peter Anvin <hpa@zytor.com>

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] access_error API cleanup
@ 2010-10-06  4:20           ` H. Peter Anvin
  0 siblings, 0 replies; 46+ messages in thread
From: H. Peter Anvin @ 2010-10-06  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michel Lespinasse, linux-mm, Linus Torvalds, Ying Han,
	linux-kernel, Rik van Riel, Nick Piggin, Peter Zijlstra

On 10/05/2010 09:18 PM, Andrew Morton wrote:
> On Tue, 5 Oct 2010 21:14:12 -0700 Michel Lespinasse <walken@google.com> wrote:
> 
>> I figure barring any surprises, the series is likely to merge from -mm
>> tree to mainline in time for 2.6.37 ?
> 
> yep.

Sounds good to me.

Again, Acked-by: H. Peter Anvin <hpa@zytor.com>

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-05 22:44       ` Michel Lespinasse
@ 2010-10-08  4:39         ` Michel Lespinasse
  -1 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-08  4:39 UTC (permalink / raw)
  To: Rik van Riel, Linus Torvalds
  Cc: linux-mm, Ying Han, linux-kernel, Andrew Morton, Nick Piggin,
	Peter Zijlstra

On Tue, Oct 05, 2010 at 03:44:22PM -0700, Michel Lespinasse wrote:
> On Tue, Oct 5, 2010 at 10:38 AM, Rik van Riel <riel@redhat.com> wrote:
> > Looks like it should be relatively easy to do something
> > similar in do_swap_page also.
> 
> Good idea. We don't make use of swap too much, which is probably why
> we didn't have that in our kernel, but it seems like a good idea just
> for uniformity. I'll add this in a follow-on patch.

So here's the patch. Sorry for the delay - it did not take long to write,
but I couldn't test it before today.

Please have a look - I'd like to add this to the series I sent earlier.

----------------------------------- 8< ---------------------------------

Retry page fault when blocking on swap in

This change is the cousin of 'Retry page fault when blocking
on disk transfer'. The idea here is to reduce mmap_sem hold times
that are caused by disk transfers when swapping in pages. We drop
mmap_sem while waiting for the page lock, and return the VM_FAULT_RETRY
flag. do_page_fault will then re-acquire mmap_sem and retry the
page fault. It is expected that upon retry the page will now be cached,
and thus the retry will complete with a low mmap_sem hold time.

Signed-off-by: Michel Lespinasse <walken@google.com>

diff --git a/mm/memory.c b/mm/memory.c
index b068c68..0ec70b4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2613,6 +2613,21 @@ int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
 	return 0;
 }
 
+static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				     unsigned int flags)
+{
+	if (trylock_page(page))
+		return 1;
+	if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
+		__lock_page(page);
+		return 1;
+	}
+
+	up_read(&mm->mmap_sem);
+	wait_on_page_locked(page);
+	return 0;
+}
+
 /*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -2626,6 +2641,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page, *swapcache = NULL;
 	swp_entry_t entry;
 	pte_t pte;
+	int locked;
 	struct mem_cgroup *ptr = NULL;
 	int exclusive = 0;
 	int ret = 0;
@@ -2676,8 +2692,12 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_release;
 	}
 
-	lock_page(page);
+	locked = lock_page_or_retry(page, mm, flags);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!locked) {
+		ret |= VM_FAULT_RETRY;
+		goto out_release;
+	}
 
 	/*
 	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not


-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-08  4:39         ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-08  4:39 UTC (permalink / raw)
  To: Rik van Riel, Linus Torvalds
  Cc: linux-mm, Ying Han, linux-kernel, Andrew Morton, Nick Piggin,
	Peter Zijlstra

On Tue, Oct 05, 2010 at 03:44:22PM -0700, Michel Lespinasse wrote:
> On Tue, Oct 5, 2010 at 10:38 AM, Rik van Riel <riel@redhat.com> wrote:
> > Looks like it should be relatively easy to do something
> > similar in do_swap_page also.
> 
> Good idea. We don't make use of swap too much, which is probably why
> we didn't have that in our kernel, but it seems like a good idea just
> for uniformity. I'll add this in a follow-on patch.

So here's the patch. Sorry for the delay - it did not take long to write,
but I couldn't test it before today.

Please have a look - I'd like to add this to the series I sent earlier.

----------------------------------- 8< ---------------------------------

Retry page fault when blocking on swap in

This change is the cousin of 'Retry page fault when blocking
on disk transfer'. The idea here is to reduce mmap_sem hold times
that are caused by disk transfers when swapping in pages. We drop
mmap_sem while waiting for the page lock, and return the VM_FAULT_RETRY
flag. do_page_fault will then re-acquire mmap_sem and retry the
page fault. It is expected that upon retry the page will now be cached,
and thus the retry will complete with a low mmap_sem hold time.

Signed-off-by: Michel Lespinasse <walken@google.com>

diff --git a/mm/memory.c b/mm/memory.c
index b068c68..0ec70b4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2613,6 +2613,21 @@ int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
 	return 0;
 }
 
+static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				     unsigned int flags)
+{
+	if (trylock_page(page))
+		return 1;
+	if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
+		__lock_page(page);
+		return 1;
+	}
+
+	up_read(&mm->mmap_sem);
+	wait_on_page_locked(page);
+	return 0;
+}
+
 /*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -2626,6 +2641,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page, *swapcache = NULL;
 	swp_entry_t entry;
 	pte_t pte;
+	int locked;
 	struct mem_cgroup *ptr = NULL;
 	int exclusive = 0;
 	int ret = 0;
@@ -2676,8 +2692,12 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_release;
 	}
 
-	lock_page(page);
+	locked = lock_page_or_retry(page, mm, flags);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!locked) {
+		ret |= VM_FAULT_RETRY;
+		goto out_release;
+	}
 
 	/*
 	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not


-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-08  4:39         ` Michel Lespinasse
@ 2010-10-08 13:24           ` Rik van Riel
  -1 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2010-10-08 13:24 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, linux-mm, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On 10/08/2010 12:39 AM, Michel Lespinasse wrote:
> On Tue, Oct 05, 2010 at 03:44:22PM -0700, Michel Lespinasse wrote:
>> On Tue, Oct 5, 2010 at 10:38 AM, Rik van Riel<riel@redhat.com>  wrote:
>>> Looks like it should be relatively easy to do something
>>> similar in do_swap_page also.
>>
>> Good idea. We don't make use of swap too much, which is probably why
>> we didn't have that in our kernel, but it seems like a good idea just
>> for uniformity. I'll add this in a follow-on patch.
>
> So here's the patch. Sorry for the delay - it did not take long to write,
> but I couldn't test it before today.
>
> Please have a look - I'd like to add this to the series I sent earlier.
>
> ----------------------------------- 8<  ---------------------------------
>
> Retry page fault when blocking on swap in
>
> This change is the cousin of 'Retry page fault when blocking
> on disk transfer'. The idea here is to reduce mmap_sem hold times
> that are caused by disk transfers when swapping in pages. We drop
> mmap_sem while waiting for the page lock, and return the VM_FAULT_RETRY
> flag. do_page_fault will then re-acquire mmap_sem and retry the
> page fault. It is expected that upon retry the page will now be cached,
> and thus the retry will complete with a low mmap_sem hold time.
>
> Signed-off-by: Michel Lespinasse<walken@google.com>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b068c68..0ec70b4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2613,6 +2613,21 @@ int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
>   	return 0;
>   }
>
> +static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
> +				     unsigned int flags)
> +{
> +	if (trylock_page(page))
> +		return 1;
> +	if (!(flags&  FAULT_FLAG_ALLOW_RETRY)) {
> +		__lock_page(page);
> +		return 1;
> +	}
> +
> +	up_read(&mm->mmap_sem);
> +	wait_on_page_locked(page);
> +	return 0;
> +}

Wait a moment.  Your other patch 2/3 also has a
lock_page_or_retry function.  That one is in
filemap.c and takes slightly different arguments,
to do essentially the same thing...

+/*
+ * Lock the page, unless this would block and the caller indicated that it
+ * can handle a retry.
+ */
+static int lock_page_or_retry(struct page *page,
+			      struct vm_area_struct *vma, struct vm_fault *vmf)
+{

Is there a way the two functions can be merged
into one?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-08 13:24           ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2010-10-08 13:24 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, linux-mm, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On 10/08/2010 12:39 AM, Michel Lespinasse wrote:
> On Tue, Oct 05, 2010 at 03:44:22PM -0700, Michel Lespinasse wrote:
>> On Tue, Oct 5, 2010 at 10:38 AM, Rik van Riel<riel@redhat.com>  wrote:
>>> Looks like it should be relatively easy to do something
>>> similar in do_swap_page also.
>>
>> Good idea. We don't make use of swap too much, which is probably why
>> we didn't have that in our kernel, but it seems like a good idea just
>> for uniformity. I'll add this in a follow-on patch.
>
> So here's the patch. Sorry for the delay - it did not take long to write,
> but I couldn't test it before today.
>
> Please have a look - I'd like to add this to the series I sent earlier.
>
> ----------------------------------- 8<  ---------------------------------
>
> Retry page fault when blocking on swap in
>
> This change is the cousin of 'Retry page fault when blocking
> on disk transfer'. The idea here is to reduce mmap_sem hold times
> that are caused by disk transfers when swapping in pages. We drop
> mmap_sem while waiting for the page lock, and return the VM_FAULT_RETRY
> flag. do_page_fault will then re-acquire mmap_sem and retry the
> page fault. It is expected that upon retry the page will now be cached,
> and thus the retry will complete with a low mmap_sem hold time.
>
> Signed-off-by: Michel Lespinasse<walken@google.com>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b068c68..0ec70b4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2613,6 +2613,21 @@ int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
>   	return 0;
>   }
>
> +static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
> +				     unsigned int flags)
> +{
> +	if (trylock_page(page))
> +		return 1;
> +	if (!(flags&  FAULT_FLAG_ALLOW_RETRY)) {
> +		__lock_page(page);
> +		return 1;
> +	}
> +
> +	up_read(&mm->mmap_sem);
> +	wait_on_page_locked(page);
> +	return 0;
> +}

Wait a moment.  Your other patch 2/3 also has a
lock_page_or_retry function.  That one is in
filemap.c and takes slightly different arguments,
to do essentially the same thing...

+/*
+ * Lock the page, unless this would block and the caller indicated that it
+ * can handle a retry.
+ */
+static int lock_page_or_retry(struct page *page,
+			      struct vm_area_struct *vma, struct vm_fault *vmf)
+{

Is there a way the two functions can be merged
into one?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-08 13:24           ` Rik van Riel
@ 2010-10-08 20:06             ` Michel Lespinasse
  -1 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-08 20:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, linux-mm, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On Fri, Oct 8, 2010 at 6:24 AM, Rik van Riel <riel@redhat.com> wrote:
>> +static inline int lock_page_or_retry(struct page *page, struct mm_struct
>> *mm,
>> +                                    unsigned int flags)
>> +{
>> +       if (trylock_page(page))
>> +               return 1;
>> +       if (!(flags&  FAULT_FLAG_ALLOW_RETRY)) {
>> +               __lock_page(page);
>> +               return 1;
>> +       }
>> +
>> +       up_read(&mm->mmap_sem);
>> +       wait_on_page_locked(page);
>> +       return 0;
>> +}
>
> Wait a moment.  Your other patch 2/3 also has a
> lock_page_or_retry function.  That one is in
> filemap.c and takes slightly different arguments,
> to do essentially the same thing...
>
> +/*
> + * Lock the page, unless this would block and the caller indicated that it
> + * can handle a retry.
> + */
> +static int lock_page_or_retry(struct page *page,
> +                             struct vm_area_struct *vma, struct vm_fault
> *vmf)
> +{
>
> Is there a way the two functions can be merged
> into one?

Yes, this would be easy to do.

The argument against it would be loss of inlining and, in the filemap
version, the need to reference vma and vmf fields to find out the mm
and flags values. We'd like to avoid doing that at least in the fast
path when trylock_page succeeds - though, now that I think about it,
both could be avoided with an inline function in a header returning
trylock_page(page) || _lock_page_or_retry(mm, flags)

Hmmm, this is actually quite similar to how other functions in
pagemap.h / filemap.c are done...
I'll send an updated series using this suggestion.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-08 20:06             ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-08 20:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, linux-mm, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra

On Fri, Oct 8, 2010 at 6:24 AM, Rik van Riel <riel@redhat.com> wrote:
>> +static inline int lock_page_or_retry(struct page *page, struct mm_struct
>> *mm,
>> +                                    unsigned int flags)
>> +{
>> +       if (trylock_page(page))
>> +               return 1;
>> +       if (!(flags&  FAULT_FLAG_ALLOW_RETRY)) {
>> +               __lock_page(page);
>> +               return 1;
>> +       }
>> +
>> +       up_read(&mm->mmap_sem);
>> +       wait_on_page_locked(page);
>> +       return 0;
>> +}
>
> Wait a moment.  Your other patch 2/3 also has a
> lock_page_or_retry function.  That one is in
> filemap.c and takes slightly different arguments,
> to do essentially the same thing...
>
> +/*
> + * Lock the page, unless this would block and the caller indicated that it
> + * can handle a retry.
> + */
> +static int lock_page_or_retry(struct page *page,
> +                             struct vm_area_struct *vma, struct vm_fault
> *vmf)
> +{
>
> Is there a way the two functions can be merged
> into one?

Yes, this would be easy to do.

The argument against it would be loss of inlining and, in the filemap
version, the need to reference vma and vmf fields to find out the mm
and flags values. We'd like to avoid doing that at least in the fast
path when trylock_page succeeds - though, now that I think about it,
both could be avoided with an inline function in a header returning
trylock_page(page) || _lock_page_or_retry(mm, flags)

Hmmm, this is actually quite similar to how other functions in
pagemap.h / filemap.c are done...
I'll send an updated series using this suggestion.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-08 20:06             ` Michel Lespinasse
@ 2010-10-09  1:22               ` Michel Lespinasse
  -1 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-09  1:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, linux-mm, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra


Second try on adding the VM_FAULT_RETRY functionality to the swap in path.

This proposal would replace [patch 2/3] of this series (the initial
version of it, which was approved by linus / rik / hpa).

Changes since the approved version:

- split lock_page_or_retry() into an inline function in  pagemap.h,
  handling the trylock_page() fast path, and __lock_page_or_retry() in
  filemap.c, handling the blocking path (with or without retry).

- make do_swap_page() call lock_page_or_retry() in place of lock_page(),
  and handle the retry case.

---------------------------------- 8< -----------------------------------

Retry page fault when blocking on disk transfer.
    
This change reduces mmap_sem hold times that are caused by waiting for
disk transfers when accessing file mapped VMAs or swap space.
It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the
call site wants mmap_sem to be released if blocking on a pending
disk transfer. In that case, handle_mm_fault() returns the
VM_FAULT_RETRY status flag and do_page_fault() will then re-acquire
mmap_sem and retry the page fault. It is expected that the retry will
hit the same page which will now be cached, and thus it will
complete with a low mmap_sem hold time.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/mm/fault.c     |   38 ++++++++++++++++++++++++++------------
 include/linux/mm.h      |    2 ++
 include/linux/pagemap.h |   13 +++++++++++++
 mm/filemap.c            |   16 +++++++++++++++-
 mm/memory.c             |   10 ++++++++--
 5 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 4c4508e..b355b92 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -952,8 +952,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	struct task_struct *tsk;
 	unsigned long address;
 	struct mm_struct *mm;
-	int write;
 	int fault;
+	int write = error_code & PF_WRITE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY |
+					(write ? FAULT_FLAG_WRITE : 0);
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1064,6 +1066,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 			bad_area_nosemaphore(regs, error_code, address);
 			return;
 		}
+retry:
 		down_read(&mm->mmap_sem);
 	} else {
 		/*
@@ -1107,8 +1110,6 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	 * we can handle it..
 	 */
 good_area:
-	write = error_code & PF_WRITE;
-
 	if (unlikely(access_error(error_code, write, vma))) {
 		bad_area_access_error(regs, error_code, address);
 		return;
@@ -1119,21 +1120,34 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault:
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, fault);
 		return;
 	}
 
-	if (fault & VM_FAULT_MAJOR) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
-				     regs, address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
-				     regs, address);
+	/*
+	 * Major/minor page fault accounting is only done on the
+	 * initial attempt. If we go through a retry, it is extremely
+	 * likely that the page will be found in page cache at that point.
+	 */
+	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+		if (fault & VM_FAULT_MAJOR) {
+			tsk->maj_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
+				      regs, address);
+		} else {
+			tsk->min_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
+				      regs, address);
+		}
+		if (fault & VM_FAULT_RETRY) {
+			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
+			 * of starvation. */
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			goto retry;
+		}
 	}
 
 	check_v8086_mode(regs, address, tsk);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..0b4f9b2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -144,6 +144,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -722,6 +723,7 @@ static inline int page_mapped(struct page *page)
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
+#define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e12cdc6..2d1ffe3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -299,6 +299,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern void __lock_page_nosync(struct page *page);
+extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				unsigned int flags);
 extern void unlock_page(struct page *page);
 
 static inline void __set_page_locked(struct page *page)
@@ -351,6 +353,17 @@ static inline void lock_page_nosync(struct page *page)
 }
 	
 /*
+ * lock_page_or_retry - Lock the page, unless this would block and the
+ * caller indicated that it can handle a retry.
+ */
+static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				     unsigned int flags)
+{
+	might_sleep();
+	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
+}
+
+/*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index 8ed709a..2eeb6c5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -612,6 +612,19 @@ void __lock_page_nosync(struct page *page)
 							TASK_UNINTERRUPTIBLE);
 }
 
+int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+			 unsigned int flags)
+{
+	if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
+		__lock_page(page);
+		return 1;
+	} else {
+		up_read(&mm->mmap_sem);
+		wait_on_page_locked(page);
+		return 0;
+	}
+}
+
 /**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
@@ -1550,7 +1563,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	lock_page(page);
+	if (!lock_page_or_retry(page, &vma->vm_mm, vmf->flags))
+		return ret | VM_FAULT_RETRY;
 
 	/* Did it get truncated? */
 	if (unlikely(page->mapping != mapping)) {
diff --git a/mm/memory.c b/mm/memory.c
index 0e18b4d..362e803 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2626,6 +2626,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page, *swapcache = NULL;
 	swp_entry_t entry;
 	pte_t pte;
+	int locked;
 	struct mem_cgroup *ptr = NULL;
 	int exclusive = 0;
 	int ret = 0;
@@ -2676,8 +2677,12 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_release;
 	}
 
-	lock_page(page);
+	locked = lock_page_or_retry(page, mm, flags);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!locked) {
+		ret |= VM_FAULT_RETRY;
+		goto out_release;
+	}
 
 	/*
 	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
@@ -2926,7 +2931,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	vmf.page = NULL;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
-	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+			    VM_FAULT_RETRY)))
 		return ret;
 
 	if (unlikely(PageHWPoison(vmf.page))) {


-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-09  1:22               ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-09  1:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, linux-mm, Ying Han, linux-kernel, Andrew Morton,
	Nick Piggin, Peter Zijlstra


Second try on adding the VM_FAULT_RETRY functionality to the swap in path.

This proposal would replace [patch 2/3] of this series (the initial
version of it, which was approved by linus / rik / hpa).

Changes since the approved version:

- split lock_page_or_retry() into an inline function in  pagemap.h,
  handling the trylock_page() fast path, and __lock_page_or_retry() in
  filemap.c, handling the blocking path (with or without retry).

- make do_swap_page() call lock_page_or_retry() in place of lock_page(),
  and handle the retry case.

---------------------------------- 8< -----------------------------------

Retry page fault when blocking on disk transfer.
    
This change reduces mmap_sem hold times that are caused by waiting for
disk transfers when accessing file mapped VMAs or swap space.
It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the
call site wants mmap_sem to be released if blocking on a pending
disk transfer. In that case, handle_mm_fault() returns the
VM_FAULT_RETRY status flag and do_page_fault() will then re-acquire
mmap_sem and retry the page fault. It is expected that the retry will
hit the same page which will now be cached, and thus it will
complete with a low mmap_sem hold time.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/mm/fault.c     |   38 ++++++++++++++++++++++++++------------
 include/linux/mm.h      |    2 ++
 include/linux/pagemap.h |   13 +++++++++++++
 mm/filemap.c            |   16 +++++++++++++++-
 mm/memory.c             |   10 ++++++++--
 5 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 4c4508e..b355b92 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -952,8 +952,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	struct task_struct *tsk;
 	unsigned long address;
 	struct mm_struct *mm;
-	int write;
 	int fault;
+	int write = error_code & PF_WRITE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY |
+					(write ? FAULT_FLAG_WRITE : 0);
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1064,6 +1066,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 			bad_area_nosemaphore(regs, error_code, address);
 			return;
 		}
+retry:
 		down_read(&mm->mmap_sem);
 	} else {
 		/*
@@ -1107,8 +1110,6 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	 * we can handle it..
 	 */
 good_area:
-	write = error_code & PF_WRITE;
-
 	if (unlikely(access_error(error_code, write, vma))) {
 		bad_area_access_error(regs, error_code, address);
 		return;
@@ -1119,21 +1120,34 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault:
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, fault);
 		return;
 	}
 
-	if (fault & VM_FAULT_MAJOR) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
-				     regs, address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
-				     regs, address);
+	/*
+	 * Major/minor page fault accounting is only done on the
+	 * initial attempt. If we go through a retry, it is extremely
+	 * likely that the page will be found in page cache at that point.
+	 */
+	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+		if (fault & VM_FAULT_MAJOR) {
+			tsk->maj_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
+				      regs, address);
+		} else {
+			tsk->min_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
+				      regs, address);
+		}
+		if (fault & VM_FAULT_RETRY) {
+			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
+			 * of starvation. */
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			goto retry;
+		}
 	}
 
 	check_v8086_mode(regs, address, tsk);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..0b4f9b2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -144,6 +144,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -722,6 +723,7 @@ static inline int page_mapped(struct page *page)
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
+#define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e12cdc6..2d1ffe3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -299,6 +299,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern void __lock_page_nosync(struct page *page);
+extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				unsigned int flags);
 extern void unlock_page(struct page *page);
 
 static inline void __set_page_locked(struct page *page)
@@ -351,6 +353,17 @@ static inline void lock_page_nosync(struct page *page)
 }
 	
 /*
+ * lock_page_or_retry - Lock the page, unless this would block and the
+ * caller indicated that it can handle a retry.
+ */
+static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				     unsigned int flags)
+{
+	might_sleep();
+	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
+}
+
+/*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index 8ed709a..2eeb6c5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -612,6 +612,19 @@ void __lock_page_nosync(struct page *page)
 							TASK_UNINTERRUPTIBLE);
 }
 
+int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+			 unsigned int flags)
+{
+	if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
+		__lock_page(page);
+		return 1;
+	} else {
+		up_read(&mm->mmap_sem);
+		wait_on_page_locked(page);
+		return 0;
+	}
+}
+
 /**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
@@ -1550,7 +1563,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	lock_page(page);
+	if (!lock_page_or_retry(page, &vma->vm_mm, vmf->flags))
+		return ret | VM_FAULT_RETRY;
 
 	/* Did it get truncated? */
 	if (unlikely(page->mapping != mapping)) {
diff --git a/mm/memory.c b/mm/memory.c
index 0e18b4d..362e803 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2626,6 +2626,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page, *swapcache = NULL;
 	swp_entry_t entry;
 	pte_t pte;
+	int locked;
 	struct mem_cgroup *ptr = NULL;
 	int exclusive = 0;
 	int ret = 0;
@@ -2676,8 +2677,12 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_release;
 	}
 
-	lock_page(page);
+	locked = lock_page_or_retry(page, mm, flags);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!locked) {
+		ret |= VM_FAULT_RETRY;
+		goto out_release;
+	}
 
 	/*
 	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
@@ -2926,7 +2931,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	vmf.page = NULL;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
-	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+			    VM_FAULT_RETRY)))
 		return ret;
 
 	if (unlikely(PageHWPoison(vmf.page))) {


-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-09  1:22               ` Michel Lespinasse
@ 2010-10-11 22:25                 ` Andrew Morton
  -1 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2010-10-11 22:25 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Linus Torvalds, linux-mm, Ying Han, linux-kernel,
	Nick Piggin, Peter Zijlstra

On Fri, 8 Oct 2010 18:22:04 -0700
Michel Lespinasse <walken@google.com> wrote:

> Second try on adding the VM_FAULT_RETRY functionality to the swap in path.
> 
> This proposal would replace [patch 2/3] of this series (the initial
> version of it, which was approved by linus / rik / hpa).
> 
> Changes since the approved version:
> 
> - split lock_page_or_retry() into an inline function in  pagemap.h,
>   handling the trylock_page() fast path, and __lock_page_or_retry() in
>   filemap.c, handling the blocking path (with or without retry).
> 
> - make do_swap_page() call lock_page_or_retry() in place of lock_page(),
>   and handle the retry case.

Replacement patches are a bit cruel to people who've already reviewed
the previous version.  I always turn them into deltas so I can see what
was changed.  It is below.

How well was the new swapin path tested?


 include/linux/pagemap.h |   13 +++++++++++++
 mm/filemap.c            |   35 ++++++++++++++---------------------
 mm/memory.c             |    7 ++++++-
 3 files changed, 33 insertions(+), 22 deletions(-)

diff -puN include/linux/pagemap.h~mm-retry-page-fault-when-blocking-on-disk-transfer-update include/linux/pagemap.h
--- a/include/linux/pagemap.h~mm-retry-page-fault-when-blocking-on-disk-transfer-update
+++ a/include/linux/pagemap.h
@@ -299,6 +299,8 @@ static inline pgoff_t linear_page_index(
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern void __lock_page_nosync(struct page *page);
+extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				unsigned int flags);
 extern void unlock_page(struct page *page);
 
 static inline void __set_page_locked(struct page *page)
@@ -351,6 +353,17 @@ static inline void lock_page_nosync(stru
 }
 	
 /*
+ * lock_page_or_retry - Lock the page, unless this would block and the
+ * caller indicated that it can handle a retry.
+ */
+static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				     unsigned int flags)
+{
+	might_sleep();
+	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
+}
+
+/*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!
  */
diff -puN mm/filemap.c~mm-retry-page-fault-when-blocking-on-disk-transfer-update mm/filemap.c
--- a/mm/filemap.c~mm-retry-page-fault-when-blocking-on-disk-transfer-update
+++ a/mm/filemap.c
@@ -623,6 +623,19 @@ void __lock_page_nosync(struct page *pag
 							TASK_UNINTERRUPTIBLE);
 }
 
+int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+			 unsigned int flags)
+{
+	if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
+		__lock_page(page);
+		return 1;
+	} else {
+		up_read(&mm->mmap_sem);
+		wait_on_page_locked(page);
+		return 0;
+	}
+}
+
 /**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
@@ -1512,26 +1525,6 @@ static void do_async_mmap_readahead(stru
 					   page, offset, ra->ra_pages);
 }
 
-/*
- * Lock the page, unless this would block and the caller indicated that it
- * can handle a retry.
- */
-static int lock_page_or_retry(struct page *page,
-			      struct vm_area_struct *vma, struct vm_fault *vmf)
-{
-	if (trylock_page(page))
-		return 1;
-	if (!(vmf->flags & FAULT_FLAG_ALLOW_RETRY)) {
-		__lock_page(page);
-		return 1;
-	}
-
-	up_read(&vma->vm_mm->mmap_sem);
-	wait_on_page_locked(page);
-	page_cache_release(page);
-	return 0;
-}
-
 /**
  * filemap_fault - read in file data for page fault handling
  * @vma:	vma in which the fault was taken
@@ -1581,7 +1574,7 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	if (!lock_page_or_retry(page, vma, vmf))
+	if (!lock_page_or_retry(page, &vma->vm_mm, vmf->flags))
 		return ret | VM_FAULT_RETRY;
 
 	/* Did it get truncated? */
diff -puN mm/memory.c~mm-retry-page-fault-when-blocking-on-disk-transfer-update mm/memory.c
--- a/mm/memory.c~mm-retry-page-fault-when-blocking-on-disk-transfer-update
+++ a/mm/memory.c
@@ -2627,6 +2627,7 @@ static int do_swap_page(struct mm_struct
 	struct page *page, *swapcache = NULL;
 	swp_entry_t entry;
 	pte_t pte;
+	int locked;
 	struct mem_cgroup *ptr = NULL;
 	int exclusive = 0;
 	int ret = 0;
@@ -2677,8 +2678,12 @@ static int do_swap_page(struct mm_struct
 		goto out_release;
 	}
 
-	lock_page(page);
+	locked = lock_page_or_retry(page, mm, flags);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!locked) {
+		ret |= VM_FAULT_RETRY;
+		goto out_release;
+	}
 
 	/*
 	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
_




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-11 22:25                 ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2010-10-11 22:25 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Linus Torvalds, linux-mm, Ying Han, linux-kernel,
	Nick Piggin, Peter Zijlstra

On Fri, 8 Oct 2010 18:22:04 -0700
Michel Lespinasse <walken@google.com> wrote:

> Second try on adding the VM_FAULT_RETRY functionality to the swap in path.
> 
> This proposal would replace [patch 2/3] of this series (the initial
> version of it, which was approved by linus / rik / hpa).
> 
> Changes since the approved version:
> 
> - split lock_page_or_retry() into an inline function in  pagemap.h,
>   handling the trylock_page() fast path, and __lock_page_or_retry() in
>   filemap.c, handling the blocking path (with or without retry).
> 
> - make do_swap_page() call lock_page_or_retry() in place of lock_page(),
>   and handle the retry case.

Replacement patches are a bit cruel to people who've already reviewed
the previous version.  I always turn them into deltas so I can see what
was changed.  It is below.

How well was the new swapin path tested?


 include/linux/pagemap.h |   13 +++++++++++++
 mm/filemap.c            |   35 ++++++++++++++---------------------
 mm/memory.c             |    7 ++++++-
 3 files changed, 33 insertions(+), 22 deletions(-)

diff -puN include/linux/pagemap.h~mm-retry-page-fault-when-blocking-on-disk-transfer-update include/linux/pagemap.h
--- a/include/linux/pagemap.h~mm-retry-page-fault-when-blocking-on-disk-transfer-update
+++ a/include/linux/pagemap.h
@@ -299,6 +299,8 @@ static inline pgoff_t linear_page_index(
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern void __lock_page_nosync(struct page *page);
+extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				unsigned int flags);
 extern void unlock_page(struct page *page);
 
 static inline void __set_page_locked(struct page *page)
@@ -351,6 +353,17 @@ static inline void lock_page_nosync(stru
 }
 	
 /*
+ * lock_page_or_retry - Lock the page, unless this would block and the
+ * caller indicated that it can handle a retry.
+ */
+static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				     unsigned int flags)
+{
+	might_sleep();
+	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
+}
+
+/*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!
  */
diff -puN mm/filemap.c~mm-retry-page-fault-when-blocking-on-disk-transfer-update mm/filemap.c
--- a/mm/filemap.c~mm-retry-page-fault-when-blocking-on-disk-transfer-update
+++ a/mm/filemap.c
@@ -623,6 +623,19 @@ void __lock_page_nosync(struct page *pag
 							TASK_UNINTERRUPTIBLE);
 }
 
+int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+			 unsigned int flags)
+{
+	if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
+		__lock_page(page);
+		return 1;
+	} else {
+		up_read(&mm->mmap_sem);
+		wait_on_page_locked(page);
+		return 0;
+	}
+}
+
 /**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
@@ -1512,26 +1525,6 @@ static void do_async_mmap_readahead(stru
 					   page, offset, ra->ra_pages);
 }
 
-/*
- * Lock the page, unless this would block and the caller indicated that it
- * can handle a retry.
- */
-static int lock_page_or_retry(struct page *page,
-			      struct vm_area_struct *vma, struct vm_fault *vmf)
-{
-	if (trylock_page(page))
-		return 1;
-	if (!(vmf->flags & FAULT_FLAG_ALLOW_RETRY)) {
-		__lock_page(page);
-		return 1;
-	}
-
-	up_read(&vma->vm_mm->mmap_sem);
-	wait_on_page_locked(page);
-	page_cache_release(page);
-	return 0;
-}
-
 /**
  * filemap_fault - read in file data for page fault handling
  * @vma:	vma in which the fault was taken
@@ -1581,7 +1574,7 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	if (!lock_page_or_retry(page, vma, vmf))
+	if (!lock_page_or_retry(page, &vma->vm_mm, vmf->flags))
 		return ret | VM_FAULT_RETRY;
 
 	/* Did it get truncated? */
diff -puN mm/memory.c~mm-retry-page-fault-when-blocking-on-disk-transfer-update mm/memory.c
--- a/mm/memory.c~mm-retry-page-fault-when-blocking-on-disk-transfer-update
+++ a/mm/memory.c
@@ -2627,6 +2627,7 @@ static int do_swap_page(struct mm_struct
 	struct page *page, *swapcache = NULL;
 	swp_entry_t entry;
 	pte_t pte;
+	int locked;
 	struct mem_cgroup *ptr = NULL;
 	int exclusive = 0;
 	int ret = 0;
@@ -2677,8 +2678,12 @@ static int do_swap_page(struct mm_struct
 		goto out_release;
 	}
 
-	lock_page(page);
+	locked = lock_page_or_retry(page, mm, flags);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!locked) {
+		ret |= VM_FAULT_RETRY;
+		goto out_release;
+	}
 
 	/*
 	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
_



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-11 22:25                 ` Andrew Morton
@ 2010-10-11 22:43                   ` Michel Lespinasse
  -1 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-11 22:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Linus Torvalds, linux-mm, Ying Han, linux-kernel,
	Nick Piggin, Peter Zijlstra

On Mon, Oct 11, 2010 at 3:25 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> Replacement patches are a bit cruel to people who've already reviewed
> the previous version.  I always turn them into deltas so I can see what
> was changed.  It is below.

Thanks Andrew. Sorry for the trouble, I'll know to avoid this next time.

> How well was the new swapin path tested?

Not as well as the file backed path - it's not gotten real production
use yet. However the plan is that this change will be in google's next
kernel update, so it will get a lot more more testing soon (starting
in ~1 week).

I did basic testing by dirtying an anon VMA larger that memory, then
accessing it in random order while another thread runs a mmap/munmap
loop, and checking that things behave as expected there (i.e. the
patch allows the mmap/munmap thread to progress without waiting for
the other thread swap-ins).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-11 22:43                   ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2010-10-11 22:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Linus Torvalds, linux-mm, Ying Han, linux-kernel,
	Nick Piggin, Peter Zijlstra

On Mon, Oct 11, 2010 at 3:25 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> Replacement patches are a bit cruel to people who've already reviewed
> the previous version.  I always turn them into deltas so I can see what
> was changed.  It is below.

Thanks Andrew. Sorry for the trouble, I'll know to avoid this next time.

> How well was the new swapin path tested?

Not as well as the file backed path - it's not gotten real production
use yet. However the plan is that this change will be in google's next
kernel update, so it will get a lot more more testing soon (starting
in ~1 week).

I did basic testing by dirtying an anon VMA larger that memory, then
accessing it in random order while another thread runs a mmap/munmap
loop, and checking that things behave as expected there (i.e. the
patch allows the mmap/munmap thread to progress without waiting for
the other thread swap-ins).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-09  1:22               ` Michel Lespinasse
@ 2010-10-13 23:17                 ` Andrew Morton
  -1 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2010-10-13 23:17 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Linus Torvalds, linux-mm, Ying Han, linux-kernel,
	Nick Piggin, Peter Zijlstra

On Fri, 8 Oct 2010 18:22:04 -0700
Michel Lespinasse <walken@google.com> wrote:

> 
> Second try on adding the VM_FAULT_RETRY functionality to the swap in path.
> 
> This proposal would replace [patch 2/3] of this series (the initial
> version of it, which was approved by linus / rik / hpa).
> 
> Changes since the approved version:
> 
> - split lock_page_or_retry() into an inline function in  pagemap.h,
>   handling the trylock_page() fast path, and __lock_page_or_retry() in
>   filemap.c, handling the blocking path (with or without retry).
> 
> - make do_swap_page() call lock_page_or_retry() in place of lock_page(),
>   and handle the retry case.
>
> ...
> 
> @@ -1550,7 +1563,8 @@ retry_find:
>  			goto no_cached_page;
>  	}
>  
> -	lock_page(page);
> +	if (!lock_page_or_retry(page, &vma->vm_mm, vmf->flags))
> +		return ret | VM_FAULT_RETRY;
>  
>  	/* Did it get truncated? */
>  	if (unlikely(page->mapping != mapping)) {

mm/filemap.c: In function 'filemap_fault':
mm/filemap.c:1577: warning: passing argument 2 of 'lock_page_or_retry' from incompatible pointer type

--- a/mm/filemap.c~mm-retry-page-fault-when-blocking-on-disk-transfer-update-fix
+++ a/mm/filemap.c
@@ -1574,7 +1574,7 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	if (!lock_page_or_retry(page, &vma->vm_mm, vmf->flags))
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
 		return ret | VM_FAULT_RETRY;
 
 	/* Did it get truncated? */
_

the runtime effects are rather ghastly - a null-pointer deref deep in
the bowels of lockdep, early during /sbin/init execution (below).  But
for some reason the kernel then runs OK (!).

Probably you sent an older version of the patch by mistake.  Can you
please check whether anything else was missed?  I've appended my
current copy of this patch: it's a rollup of
mm-retry-page-fault-when-blocking-on-disk-transfer.patch,
mm-retry-page-fault-when-blocking-on-disk-transfer-update.patch and
mm-retry-page-fault-when-blocking-on-disk-transfer-update-fix.patch.



[   25.970482] =====================================
[   25.970804] [ BUG: bad unlock balance detected! ]
[   25.970968] -------------------------------------
[   25.971132] init/1 is trying to release lock (
[   25.971190] BUG: unable to handle kernel paging request at 0000000000876000
[   25.971594] IP: [<ffffffff8119089f>] strnlen+0x15/0x22
[   25.971807] PGD 255ea9067 PUD 255fb0067 PMD 0 
[   25.972104] Oops: 0000 [#1] SMP 
[   25.972352] last sysfs file: /sys/block/sdb/dev
[   25.972516] CPU 5 
[   25.972563] Modules linked in: ehci_hcd ohci_hcd uhci_hcd
[   25.973052] 
[   25.973212] Pid: 1, comm: init Not tainted 2.6.36-rc7-mm1 #14 /
[   25.973376] RIP: 0010:[<ffffffff8119089f>]  [<ffffffff8119089f>] strnlen+0x15/0x22
[   25.973702] RSP: 0018:ffff8802578d34d8  EFLAGS: 00010012
[   25.973861] RAX: 0000000000875fff RBX: 0000000000000000 RCX: ffffffffff0a0004
[   25.973956] RDX: 0000000000876000 RSI: ffffffffffffffff RDI: 0000000000876000
[   25.973956] RBP: ffff8802578d34d8 R08: 0000000000000002 R09: 0000000000000000
[   25.973956] R10: 0000000000000001 R11: ffff880255de6900 R12: ffffffff81ac9940
[   25.973956] R13: 0000000000876000 R14: 000000000000ffff R15: ffffffff81ac9d40
[   25.973956] FS:  0000000000000000(0000) GS:ffff88009e800000(0000) knlGS:0000000000000000
[   25.973956] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   25.973956] CR2: 0000000000876000 CR3: 0000000255916000 CR4: 00000000000006e0
[   25.973956] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   25.973956] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   25.973956] Process init (pid: 1, threadinfo ffff8802578d2000, task ffff8802578d0040)
[   25.973956] Stack:
[   25.973956]  ffff8802578d3518 ffffffff811912f2 ffffffff81ac9940 ffff8802578d36e8
[   25.973956] <0> ffffffff81ac9940 ffffffff81549849 ffffffff81ac9d40 ffffffff81ac9940
[   25.973956] <0> ffff8802578d3578 ffffffff8119283a 0000000000000400 ffffffff8154984b
[   25.973956] Call Trace:
[   25.973956]  [<ffffffff811912f2>] string+0x51/0xb5
[   25.973956]  [<ffffffff8119283a>] vsnprintf+0x1d7/0x42e
[   25.973956]  [<ffffffff81192bee>] vscnprintf+0xf/0x21
[   25.973956]  [<ffffffff810387cd>] vprintk+0x1c2/0x3be
[   25.973956]  [<ffffffff81386a74>] ? _raw_spin_unlock_irqrestore+0x38/0x47
[   25.973956]  [<ffffffff8103846b>] ? release_console_sem+0x1ad/0x1ba
[   25.973956]  [<ffffffff813868af>] ? _raw_spin_unlock_irq+0x29/0x2f
[   25.973956]  [<ffffffff8105d83a>] ? trace_hardirqs_on+0xd/0xf
[   25.973956]  [<ffffffff813868af>] ? _raw_spin_unlock_irq+0x29/0x2f
[   25.973956]  [<ffffffff8117d0ae>] ? __make_request+0x409/0x42a
[   25.973956]  [<ffffffff8108d97c>] ? __lock_page_or_retry+0x25/0x41
[   25.973956]  [<ffffffff81038a30>] printk+0x67/0x69
[   25.973956]  [<ffffffff8108d97c>] ? __lock_page_or_retry+0x25/0x41
[   25.973956]  [<ffffffff81038a30>] ? printk+0x67/0x69
[   25.973956]  [<ffffffff8108cf0e>] ? add_to_page_cache_locked+0xa0/0xca
[   25.973956]  [<ffffffff8105a650>] print_lockdep_cache+0x2e/0x30
[   25.973956]  [<ffffffff810eb822>] ? mpage_bio_submit+0x22/0x26
[   25.973956]  [<ffffffff810ec054>] ? mpage_readpages+0xff/0x113
[   25.973956]  [<ffffffff81121b01>] ? ext3_get_block+0x0/0xf5
[   25.973956]  [<ffffffff81121b01>] ? ext3_get_block+0x0/0xf5
[   25.973956]  [<ffffffff81053b32>] ? sched_clock_local+0xe/0x73
[   25.973956]  [<ffffffff81053cd7>] ? sched_clock_cpu+0xba/0xc5
[   25.973956]  [<ffffffff8105bade>] print_unlock_inbalance_bug+0x7b/0xde
[   25.973956]  [<ffffffff8105fc03>] lock_release_non_nested+0xb8/0x23d
[   25.973956]  [<ffffffff81053b32>] ? sched_clock_local+0xe/0x73
[   25.973956]  [<ffffffff8108d97c>] ? __lock_page_or_retry+0x25/0x41
[   25.973956]  [<ffffffff8105aaae>] ? trace_hardirqs_off+0xd/0xf
[   25.973956]  [<ffffffff81053d0d>] ? local_clock+0x2b/0x3c
[   25.973956]  [<ffffffff8108d97c>] ? __lock_page_or_retry+0x25/0x41
[   25.973956]  [<ffffffff8105fedb>] lock_release+0x153/0x181
[   25.973956]  [<ffffffff8105292f>] up_read+0x1c/0x35
[   25.973956]  [<ffffffff8108d97c>] __lock_page_or_retry+0x25/0x41
[   25.973956]  [<ffffffff8108db67>] filemap_fault+0x1cf/0x363
[   25.973956]  [<ffffffff810a20ab>] __do_fault+0x54/0x40f
[   25.973956]  [<ffffffff810a50dd>] handle_mm_fault+0x1d5/0x7af
[   25.973956]  [<ffffffff81389d06>] do_page_fault+0x2c9/0x429
[   25.973956]  [<ffffffff810a77c4>] ? vma_link+0x85/0x96
[   25.973956]  [<ffffffff81053b32>] ? sched_clock_local+0xe/0x73
[   25.973956]  [<ffffffff81053cd7>] ? sched_clock_cpu+0xba/0xc5
[   25.973956]  [<ffffffff8105aaae>] ? trace_hardirqs_off+0xd/0xf
[   25.973956]  [<ffffffff81053d0d>] ? local_clock+0x2b/0x3c
[   25.973956]  [<ffffffff8138600b>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   25.973956]  [<ffffffff81386f5f>] page_fault+0x1f/0x30
[   25.973956]  [<ffffffff811940b4>] ? __clear_user+0x2e/0x50
[   25.973956]  [<ffffffff81194098>] ? __clear_user+0x12/0x50
[   25.973956]  [<ffffffff81194200>] clear_user+0x2b/0x33
[   25.973956]  [<ffffffff810fc2cb>] padzero+0x1b/0x2b
[   25.973956]  [<ffffffff810fd294>] load_elf_binary+0x8e0/0x16c7
[   25.973956]  [<ffffffff8105aaae>] ? trace_hardirqs_off+0xd/0xf
[   25.973956]  [<ffffffff81053d0d>] ? local_clock+0x2b/0x3c
[   25.973956]  [<ffffffff810fc9b4>] ? load_elf_binary+0x0/0x16c7
[   25.973956]  [<ffffffff810c7aa4>] search_binary_handler+0xc3/0x288
[   25.973956]  [<ffffffff810c7e09>] do_execve+0x1a0/0x266
[   25.973956]  [<ffffffff81009586>] sys_execve+0x3c/0x57
[   25.973956]  [<ffffffff81002ecc>] stub_execve+0x6c/0xc0
[   25.973956] Code: 03 48 ff c7 48 0f b6 07 f6 80 a0 55 42 81 20 75 f0 c9 48 89 f8 c3 55 48 89 fa 48 89 e5 eb 03 48 ff c2 48 8d 04 37 48 39 c2 74 05 <80> 3a 00 75 ef c9 48 29 fa 48 89 d0 c3 55 31 c0 48 89 e5 eb 13 
[   25.973956] RIP  [<ffffffff8119089f>] strnlen+0x15/0x22
[   25.973956]  RSP <ffff8802578d34d8>
[   25.973956] CR2: 0000000000876000




 arch/x86/mm/fault.c     |   38 ++++++++++++++++++++++++++------------
 include/linux/mm.h      |    2 ++
 include/linux/pagemap.h |   13 +++++++++++++
 mm/filemap.c            |   16 +++++++++++++++-
 mm/memory.c             |   10 ++++++++--
 5 files changed, 64 insertions(+), 15 deletions(-)

diff -puN arch/x86/mm/fault.c~mm-retry-page-fault-when-blocking-on-disk-transfer arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~mm-retry-page-fault-when-blocking-on-disk-transfer
+++ a/arch/x86/mm/fault.c
@@ -943,8 +943,10 @@ do_page_fault(struct pt_regs *regs, unsi
 	struct task_struct *tsk;
 	unsigned long address;
 	struct mm_struct *mm;
-	int write;
 	int fault;
+	int write = error_code & PF_WRITE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY |
+					(write ? FAULT_FLAG_WRITE : 0);
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1055,6 +1057,7 @@ do_page_fault(struct pt_regs *regs, unsi
 			bad_area_nosemaphore(regs, error_code, address);
 			return;
 		}
+retry:
 		down_read(&mm->mmap_sem);
 	} else {
 		/*
@@ -1098,8 +1101,6 @@ do_page_fault(struct pt_regs *regs, unsi
 	 * we can handle it..
 	 */
 good_area:
-	write = error_code & PF_WRITE;
-
 	if (unlikely(access_error(error_code, write, vma))) {
 		bad_area_access_error(regs, error_code, address);
 		return;
@@ -1110,21 +1111,34 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault:
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, fault);
 		return;
 	}
 
-	if (fault & VM_FAULT_MAJOR) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
-				     regs, address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
-				     regs, address);
+	/*
+	 * Major/minor page fault accounting is only done on the
+	 * initial attempt. If we go through a retry, it is extremely
+	 * likely that the page will be found in page cache at that point.
+	 */
+	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+		if (fault & VM_FAULT_MAJOR) {
+			tsk->maj_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
+				      regs, address);
+		} else {
+			tsk->min_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
+				      regs, address);
+		}
+		if (fault & VM_FAULT_RETRY) {
+			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
+			 * of starvation. */
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			goto retry;
+		}
 	}
 
 	check_v8086_mode(regs, address, tsk);
diff -puN include/linux/mm.h~mm-retry-page-fault-when-blocking-on-disk-transfer include/linux/mm.h
--- a/include/linux/mm.h~mm-retry-page-fault-when-blocking-on-disk-transfer
+++ a/include/linux/mm.h
@@ -144,6 +144,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -723,6 +724,7 @@ static inline int page_mapped(struct pag
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
+#define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
diff -puN mm/filemap.c~mm-retry-page-fault-when-blocking-on-disk-transfer mm/filemap.c
--- a/mm/filemap.c~mm-retry-page-fault-when-blocking-on-disk-transfer
+++ a/mm/filemap.c
@@ -623,6 +623,19 @@ void __lock_page_nosync(struct page *pag
 							TASK_UNINTERRUPTIBLE);
 }
 
+int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+			 unsigned int flags)
+{
+	if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
+		__lock_page(page);
+		return 1;
+	} else {
+		up_read(&mm->mmap_sem);
+		wait_on_page_locked(page);
+		return 0;
+	}
+}
+
 /**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
@@ -1561,7 +1574,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	lock_page(page);
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+		return ret | VM_FAULT_RETRY;
 
 	/* Did it get truncated? */
 	if (unlikely(page->mapping != mapping)) {
diff -puN mm/memory.c~mm-retry-page-fault-when-blocking-on-disk-transfer mm/memory.c
--- a/mm/memory.c~mm-retry-page-fault-when-blocking-on-disk-transfer
+++ a/mm/memory.c
@@ -2627,6 +2627,7 @@ static int do_swap_page(struct mm_struct
 	struct page *page, *swapcache = NULL;
 	swp_entry_t entry;
 	pte_t pte;
+	int locked;
 	struct mem_cgroup *ptr = NULL;
 	int exclusive = 0;
 	int ret = 0;
@@ -2677,8 +2678,12 @@ static int do_swap_page(struct mm_struct
 		goto out_release;
 	}
 
-	lock_page(page);
+	locked = lock_page_or_retry(page, mm, flags);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!locked) {
+		ret |= VM_FAULT_RETRY;
+		goto out_release;
+	}
 
 	/*
 	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
@@ -2927,7 +2932,8 @@ static int __do_fault(struct mm_struct *
 	vmf.page = NULL;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
-	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+			    VM_FAULT_RETRY)))
 		return ret;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
diff -puN include/linux/pagemap.h~mm-retry-page-fault-when-blocking-on-disk-transfer include/linux/pagemap.h
--- a/include/linux/pagemap.h~mm-retry-page-fault-when-blocking-on-disk-transfer
+++ a/include/linux/pagemap.h
@@ -299,6 +299,8 @@ static inline pgoff_t linear_page_index(
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern void __lock_page_nosync(struct page *page);
+extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				unsigned int flags);
 extern void unlock_page(struct page *page);
 
 static inline void __set_page_locked(struct page *page)
@@ -351,6 +353,17 @@ static inline void lock_page_nosync(stru
 }
 	
 /*
+ * lock_page_or_retry - Lock the page, unless this would block and the
+ * caller indicated that it can handle a retry.
+ */
+static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				     unsigned int flags)
+{
+	might_sleep();
+	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
+}
+
+/*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!
  */
_


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
@ 2010-10-13 23:17                 ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2010-10-13 23:17 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Linus Torvalds, linux-mm, Ying Han, linux-kernel,
	Nick Piggin, Peter Zijlstra

On Fri, 8 Oct 2010 18:22:04 -0700
Michel Lespinasse <walken@google.com> wrote:

> 
> Second try on adding the VM_FAULT_RETRY functionality to the swap in path.
> 
> This proposal would replace [patch 2/3] of this series (the initial
> version of it, which was approved by linus / rik / hpa).
> 
> Changes since the approved version:
> 
> - split lock_page_or_retry() into an inline function in  pagemap.h,
>   handling the trylock_page() fast path, and __lock_page_or_retry() in
>   filemap.c, handling the blocking path (with or without retry).
> 
> - make do_swap_page() call lock_page_or_retry() in place of lock_page(),
>   and handle the retry case.
>
> ...
> 
> @@ -1550,7 +1563,8 @@ retry_find:
>  			goto no_cached_page;
>  	}
>  
> -	lock_page(page);
> +	if (!lock_page_or_retry(page, &vma->vm_mm, vmf->flags))
> +		return ret | VM_FAULT_RETRY;
>  
>  	/* Did it get truncated? */
>  	if (unlikely(page->mapping != mapping)) {

mm/filemap.c: In function 'filemap_fault':
mm/filemap.c:1577: warning: passing argument 2 of 'lock_page_or_retry' from incompatible pointer type

--- a/mm/filemap.c~mm-retry-page-fault-when-blocking-on-disk-transfer-update-fix
+++ a/mm/filemap.c
@@ -1574,7 +1574,7 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	if (!lock_page_or_retry(page, &vma->vm_mm, vmf->flags))
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
 		return ret | VM_FAULT_RETRY;
 
 	/* Did it get truncated? */
_

the runtime effects are rather ghastly - a null-pointer deref deep in
the bowels of lockdep, early during /sbin/init execution (below).  But
for some reason the kernel then runs OK (!).

Probably you sent an older version of the patch by mistake.  Can you
please check whether anything else was missed?  I've appended my
current copy of this patch: it's a rollup of
mm-retry-page-fault-when-blocking-on-disk-transfer.patch,
mm-retry-page-fault-when-blocking-on-disk-transfer-update.patch and
mm-retry-page-fault-when-blocking-on-disk-transfer-update-fix.patch.



[   25.970482] =====================================
[   25.970804] [ BUG: bad unlock balance detected! ]
[   25.970968] -------------------------------------
[   25.971132] init/1 is trying to release lock (
[   25.971190] BUG: unable to handle kernel paging request at 0000000000876000
[   25.971594] IP: [<ffffffff8119089f>] strnlen+0x15/0x22
[   25.971807] PGD 255ea9067 PUD 255fb0067 PMD 0 
[   25.972104] Oops: 0000 [#1] SMP 
[   25.972352] last sysfs file: /sys/block/sdb/dev
[   25.972516] CPU 5 
[   25.972563] Modules linked in: ehci_hcd ohci_hcd uhci_hcd
[   25.973052] 
[   25.973212] Pid: 1, comm: init Not tainted 2.6.36-rc7-mm1 #14 /
[   25.973376] RIP: 0010:[<ffffffff8119089f>]  [<ffffffff8119089f>] strnlen+0x15/0x22
[   25.973702] RSP: 0018:ffff8802578d34d8  EFLAGS: 00010012
[   25.973861] RAX: 0000000000875fff RBX: 0000000000000000 RCX: ffffffffff0a0004
[   25.973956] RDX: 0000000000876000 RSI: ffffffffffffffff RDI: 0000000000876000
[   25.973956] RBP: ffff8802578d34d8 R08: 0000000000000002 R09: 0000000000000000
[   25.973956] R10: 0000000000000001 R11: ffff880255de6900 R12: ffffffff81ac9940
[   25.973956] R13: 0000000000876000 R14: 000000000000ffff R15: ffffffff81ac9d40
[   25.973956] FS:  0000000000000000(0000) GS:ffff88009e800000(0000) knlGS:0000000000000000
[   25.973956] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   25.973956] CR2: 0000000000876000 CR3: 0000000255916000 CR4: 00000000000006e0
[   25.973956] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   25.973956] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   25.973956] Process init (pid: 1, threadinfo ffff8802578d2000, task ffff8802578d0040)
[   25.973956] Stack:
[   25.973956]  ffff8802578d3518 ffffffff811912f2 ffffffff81ac9940 ffff8802578d36e8
[   25.973956] <0> ffffffff81ac9940 ffffffff81549849 ffffffff81ac9d40 ffffffff81ac9940
[   25.973956] <0> ffff8802578d3578 ffffffff8119283a 0000000000000400 ffffffff8154984b
[   25.973956] Call Trace:
[   25.973956]  [<ffffffff811912f2>] string+0x51/0xb5
[   25.973956]  [<ffffffff8119283a>] vsnprintf+0x1d7/0x42e
[   25.973956]  [<ffffffff81192bee>] vscnprintf+0xf/0x21
[   25.973956]  [<ffffffff810387cd>] vprintk+0x1c2/0x3be
[   25.973956]  [<ffffffff81386a74>] ? _raw_spin_unlock_irqrestore+0x38/0x47
[   25.973956]  [<ffffffff8103846b>] ? release_console_sem+0x1ad/0x1ba
[   25.973956]  [<ffffffff813868af>] ? _raw_spin_unlock_irq+0x29/0x2f
[   25.973956]  [<ffffffff8105d83a>] ? trace_hardirqs_on+0xd/0xf
[   25.973956]  [<ffffffff813868af>] ? _raw_spin_unlock_irq+0x29/0x2f
[   25.973956]  [<ffffffff8117d0ae>] ? __make_request+0x409/0x42a
[   25.973956]  [<ffffffff8108d97c>] ? __lock_page_or_retry+0x25/0x41
[   25.973956]  [<ffffffff81038a30>] printk+0x67/0x69
[   25.973956]  [<ffffffff8108d97c>] ? __lock_page_or_retry+0x25/0x41
[   25.973956]  [<ffffffff81038a30>] ? printk+0x67/0x69
[   25.973956]  [<ffffffff8108cf0e>] ? add_to_page_cache_locked+0xa0/0xca
[   25.973956]  [<ffffffff8105a650>] print_lockdep_cache+0x2e/0x30
[   25.973956]  [<ffffffff810eb822>] ? mpage_bio_submit+0x22/0x26
[   25.973956]  [<ffffffff810ec054>] ? mpage_readpages+0xff/0x113
[   25.973956]  [<ffffffff81121b01>] ? ext3_get_block+0x0/0xf5
[   25.973956]  [<ffffffff81121b01>] ? ext3_get_block+0x0/0xf5
[   25.973956]  [<ffffffff81053b32>] ? sched_clock_local+0xe/0x73
[   25.973956]  [<ffffffff81053cd7>] ? sched_clock_cpu+0xba/0xc5
[   25.973956]  [<ffffffff8105bade>] print_unlock_inbalance_bug+0x7b/0xde
[   25.973956]  [<ffffffff8105fc03>] lock_release_non_nested+0xb8/0x23d
[   25.973956]  [<ffffffff81053b32>] ? sched_clock_local+0xe/0x73
[   25.973956]  [<ffffffff8108d97c>] ? __lock_page_or_retry+0x25/0x41
[   25.973956]  [<ffffffff8105aaae>] ? trace_hardirqs_off+0xd/0xf
[   25.973956]  [<ffffffff81053d0d>] ? local_clock+0x2b/0x3c
[   25.973956]  [<ffffffff8108d97c>] ? __lock_page_or_retry+0x25/0x41
[   25.973956]  [<ffffffff8105fedb>] lock_release+0x153/0x181
[   25.973956]  [<ffffffff8105292f>] up_read+0x1c/0x35
[   25.973956]  [<ffffffff8108d97c>] __lock_page_or_retry+0x25/0x41
[   25.973956]  [<ffffffff8108db67>] filemap_fault+0x1cf/0x363
[   25.973956]  [<ffffffff810a20ab>] __do_fault+0x54/0x40f
[   25.973956]  [<ffffffff810a50dd>] handle_mm_fault+0x1d5/0x7af
[   25.973956]  [<ffffffff81389d06>] do_page_fault+0x2c9/0x429
[   25.973956]  [<ffffffff810a77c4>] ? vma_link+0x85/0x96
[   25.973956]  [<ffffffff81053b32>] ? sched_clock_local+0xe/0x73
[   25.973956]  [<ffffffff81053cd7>] ? sched_clock_cpu+0xba/0xc5
[   25.973956]  [<ffffffff8105aaae>] ? trace_hardirqs_off+0xd/0xf
[   25.973956]  [<ffffffff81053d0d>] ? local_clock+0x2b/0x3c
[   25.973956]  [<ffffffff8138600b>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[   25.973956]  [<ffffffff81386f5f>] page_fault+0x1f/0x30
[   25.973956]  [<ffffffff811940b4>] ? __clear_user+0x2e/0x50
[   25.973956]  [<ffffffff81194098>] ? __clear_user+0x12/0x50
[   25.973956]  [<ffffffff81194200>] clear_user+0x2b/0x33
[   25.973956]  [<ffffffff810fc2cb>] padzero+0x1b/0x2b
[   25.973956]  [<ffffffff810fd294>] load_elf_binary+0x8e0/0x16c7
[   25.973956]  [<ffffffff8105aaae>] ? trace_hardirqs_off+0xd/0xf
[   25.973956]  [<ffffffff81053d0d>] ? local_clock+0x2b/0x3c
[   25.973956]  [<ffffffff810fc9b4>] ? load_elf_binary+0x0/0x16c7
[   25.973956]  [<ffffffff810c7aa4>] search_binary_handler+0xc3/0x288
[   25.973956]  [<ffffffff810c7e09>] do_execve+0x1a0/0x266
[   25.973956]  [<ffffffff81009586>] sys_execve+0x3c/0x57
[   25.973956]  [<ffffffff81002ecc>] stub_execve+0x6c/0xc0
[   25.973956] Code: 03 48 ff c7 48 0f b6 07 f6 80 a0 55 42 81 20 75 f0 c9 48 89 f8 c3 55 48 89 fa 48 89 e5 eb 03 48 ff c2 48 8d 04 37 48 39 c2 74 05 <80> 3a 00 75 ef c9 48 29 fa 48 89 d0 c3 55 31 c0 48 89 e5 eb 13 
[   25.973956] RIP  [<ffffffff8119089f>] strnlen+0x15/0x22
[   25.973956]  RSP <ffff8802578d34d8>
[   25.973956] CR2: 0000000000876000




 arch/x86/mm/fault.c     |   38 ++++++++++++++++++++++++++------------
 include/linux/mm.h      |    2 ++
 include/linux/pagemap.h |   13 +++++++++++++
 mm/filemap.c            |   16 +++++++++++++++-
 mm/memory.c             |   10 ++++++++--
 5 files changed, 64 insertions(+), 15 deletions(-)

diff -puN arch/x86/mm/fault.c~mm-retry-page-fault-when-blocking-on-disk-transfer arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~mm-retry-page-fault-when-blocking-on-disk-transfer
+++ a/arch/x86/mm/fault.c
@@ -943,8 +943,10 @@ do_page_fault(struct pt_regs *regs, unsi
 	struct task_struct *tsk;
 	unsigned long address;
 	struct mm_struct *mm;
-	int write;
 	int fault;
+	int write = error_code & PF_WRITE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY |
+					(write ? FAULT_FLAG_WRITE : 0);
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1055,6 +1057,7 @@ do_page_fault(struct pt_regs *regs, unsi
 			bad_area_nosemaphore(regs, error_code, address);
 			return;
 		}
+retry:
 		down_read(&mm->mmap_sem);
 	} else {
 		/*
@@ -1098,8 +1101,6 @@ do_page_fault(struct pt_regs *regs, unsi
 	 * we can handle it..
 	 */
 good_area:
-	write = error_code & PF_WRITE;
-
 	if (unlikely(access_error(error_code, write, vma))) {
 		bad_area_access_error(regs, error_code, address);
 		return;
@@ -1110,21 +1111,34 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault:
 	 */
-	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+	fault = handle_mm_fault(mm, vma, address, flags);
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, fault);
 		return;
 	}
 
-	if (fault & VM_FAULT_MAJOR) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
-				     regs, address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
-				     regs, address);
+	/*
+	 * Major/minor page fault accounting is only done on the
+	 * initial attempt. If we go through a retry, it is extremely
+	 * likely that the page will be found in page cache at that point.
+	 */
+	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+		if (fault & VM_FAULT_MAJOR) {
+			tsk->maj_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
+				      regs, address);
+		} else {
+			tsk->min_flt++;
+			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
+				      regs, address);
+		}
+		if (fault & VM_FAULT_RETRY) {
+			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
+			 * of starvation. */
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			goto retry;
+		}
 	}
 
 	check_v8086_mode(regs, address, tsk);
diff -puN include/linux/mm.h~mm-retry-page-fault-when-blocking-on-disk-transfer include/linux/mm.h
--- a/include/linux/mm.h~mm-retry-page-fault-when-blocking-on-disk-transfer
+++ a/include/linux/mm.h
@@ -144,6 +144,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -723,6 +724,7 @@ static inline int page_mapped(struct pag
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
+#define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
diff -puN mm/filemap.c~mm-retry-page-fault-when-blocking-on-disk-transfer mm/filemap.c
--- a/mm/filemap.c~mm-retry-page-fault-when-blocking-on-disk-transfer
+++ a/mm/filemap.c
@@ -623,6 +623,19 @@ void __lock_page_nosync(struct page *pag
 							TASK_UNINTERRUPTIBLE);
 }
 
+int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+			 unsigned int flags)
+{
+	if (!(flags & FAULT_FLAG_ALLOW_RETRY)) {
+		__lock_page(page);
+		return 1;
+	} else {
+		up_read(&mm->mmap_sem);
+		wait_on_page_locked(page);
+		return 0;
+	}
+}
+
 /**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
@@ -1561,7 +1574,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	lock_page(page);
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+		return ret | VM_FAULT_RETRY;
 
 	/* Did it get truncated? */
 	if (unlikely(page->mapping != mapping)) {
diff -puN mm/memory.c~mm-retry-page-fault-when-blocking-on-disk-transfer mm/memory.c
--- a/mm/memory.c~mm-retry-page-fault-when-blocking-on-disk-transfer
+++ a/mm/memory.c
@@ -2627,6 +2627,7 @@ static int do_swap_page(struct mm_struct
 	struct page *page, *swapcache = NULL;
 	swp_entry_t entry;
 	pte_t pte;
+	int locked;
 	struct mem_cgroup *ptr = NULL;
 	int exclusive = 0;
 	int ret = 0;
@@ -2677,8 +2678,12 @@ static int do_swap_page(struct mm_struct
 		goto out_release;
 	}
 
-	lock_page(page);
+	locked = lock_page_or_retry(page, mm, flags);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!locked) {
+		ret |= VM_FAULT_RETRY;
+		goto out_release;
+	}
 
 	/*
 	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
@@ -2927,7 +2932,8 @@ static int __do_fault(struct mm_struct *
 	vmf.page = NULL;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
-	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+			    VM_FAULT_RETRY)))
 		return ret;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
diff -puN include/linux/pagemap.h~mm-retry-page-fault-when-blocking-on-disk-transfer include/linux/pagemap.h
--- a/include/linux/pagemap.h~mm-retry-page-fault-when-blocking-on-disk-transfer
+++ a/include/linux/pagemap.h
@@ -299,6 +299,8 @@ static inline pgoff_t linear_page_index(
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern void __lock_page_nosync(struct page *page);
+extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				unsigned int flags);
 extern void unlock_page(struct page *page);
 
 static inline void __set_page_locked(struct page *page)
@@ -351,6 +353,17 @@ static inline void lock_page_nosync(stru
 }
 	
 /*
+ * lock_page_or_retry - Lock the page, unless this would block and the
+ * caller indicated that it can handle a retry.
+ */
+static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
+				     unsigned int flags)
+{
+	might_sleep();
+	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
+}
+
+/*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!
  */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-10-09  1:22               ` Michel Lespinasse
                                 ` (2 preceding siblings ...)
  (?)
@ 2010-11-02 20:05               ` Michel Lespinasse
  2010-11-02 20:11                 ` Rik van Riel
  -1 siblings, 1 reply; 46+ messages in thread
From: Michel Lespinasse @ 2010-11-02 20:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Andrew Morton, Rik van Riel

On Fri, Oct 08, 2010 at 06:22:04PM -0700, Michel Lespinasse wrote:
> Second try on adding the VM_FAULT_RETRY functionality to the swap in path.
> 
> This proposal would replace [patch 2/3] of this series (the initial
> version of it, which was approved by linus / rik / hpa).

Oh my. I've been evidently sloppy while merging the filemap and swap
versions of lock_page_or_retry(), and this issue slipped in...

Please apply attached change. Sorry for not catching this earlier :(

----------------------------- 8< --------------------------------

Release page reference during page fault retry

This slipped by when unifying the filemap and swap versions of
lock_page_or_retry()...

Signed-off-by: Michel Lespinasse <walken@google.com>
    
diff --git a/mm/filemap.c b/mm/filemap.c
index 33f8125..c479d9c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1563,8 +1563,10 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
+	}
 
 	/* Did it get truncated? */
 	if (unlikely(page->mapping != mapping)) {

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] Retry page fault when blocking on disk transfer.
  2010-11-02 20:05               ` Michel Lespinasse
@ 2010-11-02 20:11                 ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2010-11-02 20:11 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: Linus Torvalds, linux-kernel, Andrew Morton

On 11/02/2010 04:05 PM, Michel Lespinasse wrote:

> Release page reference during page fault retry
>
> This slipped by when unifying the filemap and swap versions of
> lock_page_or_retry()...
>
> Signed-off-by: Michel Lespinasse<walken@google.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2010-11-02 20:12 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-05  7:53 [PATCH 0/3] V2: Reduce mmap_sem hold times during file backed page faults Michel Lespinasse
2010-10-05  7:53 ` Michel Lespinasse
2010-10-05  7:53 ` [PATCH 1/3] filemap_fault: unique path for locking page Michel Lespinasse
2010-10-05  7:53   ` Michel Lespinasse
2010-10-05 17:07   ` Rik van Riel
2010-10-05 17:07     ` Rik van Riel
2010-10-05  7:53 ` [PATCH 2/3] Retry page fault when blocking on disk transfer Michel Lespinasse
2010-10-05  7:53   ` Michel Lespinasse
2010-10-05 17:33   ` Linus Torvalds
2010-10-05 17:33     ` Linus Torvalds
2010-10-05 17:38   ` Rik van Riel
2010-10-05 17:38     ` Rik van Riel
2010-10-05 22:44     ` Michel Lespinasse
2010-10-05 22:44       ` Michel Lespinasse
2010-10-08  4:39       ` Michel Lespinasse
2010-10-08  4:39         ` Michel Lespinasse
2010-10-08 13:24         ` Rik van Riel
2010-10-08 13:24           ` Rik van Riel
2010-10-08 20:06           ` Michel Lespinasse
2010-10-08 20:06             ` Michel Lespinasse
2010-10-09  1:22             ` Michel Lespinasse
2010-10-09  1:22               ` Michel Lespinasse
2010-10-11 22:25               ` Andrew Morton
2010-10-11 22:25                 ` Andrew Morton
2010-10-11 22:43                 ` Michel Lespinasse
2010-10-11 22:43                   ` Michel Lespinasse
2010-10-13 23:17               ` Andrew Morton
2010-10-13 23:17                 ` Andrew Morton
2010-11-02 20:05               ` Michel Lespinasse
2010-11-02 20:11                 ` Rik van Riel
2010-10-06  4:02   ` H. Peter Anvin
2010-10-06  4:02     ` H. Peter Anvin
2010-10-05  7:53 ` [PATCH 3/3] access_error API cleanup Michel Lespinasse
2010-10-05  7:53   ` Michel Lespinasse
2010-10-05 19:44   ` Rik van Riel
2010-10-05 19:44     ` Rik van Riel
2010-10-06  4:02   ` H. Peter Anvin
2010-10-06  4:02     ` H. Peter Anvin
2010-10-06  4:14     ` Michel Lespinasse
2010-10-06  4:14       ` Michel Lespinasse
2010-10-06  4:18       ` Andrew Morton
2010-10-06  4:18         ` Andrew Morton
2010-10-06  4:20         ` H. Peter Anvin
2010-10-06  4:20           ` H. Peter Anvin
2010-10-05 14:55 ` [PATCH 0/3] V2: Reduce mmap_sem hold times during file backed page faults Rik van Riel
2010-10-05 14:55   ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.