All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/6] fault vs truncate/invalidate race fix
@ 2007-02-21  4:49 ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:49 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

The following set of patches are based on current git.

These fix the fault vs invalidate and fault vs truncate_range race for
filemap_nopage mappings, plus those and fault vs truncate race for nonlinear
mappings.

These patches fix silent data corruption that we've had several people hitting
in SUSE kernels. Our kernels have similar patches to lock the page over page
fault, and no problem.

I've also got rid of the horrible populate API, and integrated nonlinear pages
properly with the page fault path.

Downside is that this adds one more vector through which the buffered write
deadlock can occur. However this is just a very tiny one (pte being unmapped
for reclaim), compared to all the other ways that deadlock can occur (unmap,
reclaim, truncate, invalidate). I doubt it will be noticable. At any rate, it
is better than data corruption.

I hope these can get merged (at least into -mm) soon.

Thanks,
Nick

--
SuSE Labs


^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 0/6] fault vs truncate/invalidate race fix
@ 2007-02-21  4:49 ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:49 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

The following set of patches are based on current git.

These fix the fault vs invalidate and fault vs truncate_range race for
filemap_nopage mappings, plus those and fault vs truncate race for nonlinear
mappings.

These patches fix silent data corruption that we've had several people hitting
in SUSE kernels. Our kernels have similar patches to lock the page over page
fault, and no problem.

I've also got rid of the horrible populate API, and integrated nonlinear pages
properly with the page fault path.

Downside is that this adds one more vector through which the buffered write
deadlock can occur. However this is just a very tiny one (pte being unmapped
for reclaim), compared to all the other ways that deadlock can occur (unmap,
reclaim, truncate, invalidate). I doubt it will be noticable. At any rate, it
is better than data corruption.

I hope these can get merged (at least into -mm) soon.

Thanks,
Nick

--
SuSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 1/6] mm: debug check for the fault vs invalidate race
  2007-02-21  4:49 ` Nick Piggin
@ 2007-02-21  4:49   ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:49 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Add a bugcheck for Andrea's pagefault vs invalidate race. This is triggerable
for both linear and nonlinear pages with a userspace test harness (using
direct IO and truncate, respectively).

Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -120,6 +120,8 @@ void __remove_from_page_cache(struct pag
 	page->mapping = NULL;
 	mapping->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
+
+	BUG_ON(page_mapped(page));
 }
 
 void remove_from_page_cache(struct page *page)

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 1/6] mm: debug check for the fault vs invalidate race
@ 2007-02-21  4:49   ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:49 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Add a bugcheck for Andrea's pagefault vs invalidate race. This is triggerable
for both linear and nonlinear pages with a userspace test harness (using
direct IO and truncate, respectively).

Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -120,6 +120,8 @@ void __remove_from_page_cache(struct pag
 	page->mapping = NULL;
 	mapping->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
+
+	BUG_ON(page_mapped(page));
 }
 
 void remove_from_page_cache(struct page *page)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 2/6] mm: simplify filemap_nopage
  2007-02-21  4:49 ` Nick Piggin
@ 2007-02-21  4:49   ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:49 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Identical block is duplicated twice: contrary to the comment, we have been
re-reading the page *twice* in filemap_nopage rather than once.

If any retry logic or anything is needed, it belongs in lower levels anyway.
Only retry once. Linus agrees.

Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c |   24 ------------------------
 1 file changed, 24 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1448,30 +1448,6 @@ page_not_uptodate:
 		majmin = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 	}
-	lock_page(page);
-
-	/* Did it get unhashed while we waited for it? */
-	if (!page->mapping) {
-		unlock_page(page);
-		page_cache_release(page);
-		goto retry_all;
-	}
-
-	/* Did somebody else get it up-to-date? */
-	if (PageUptodate(page)) {
-		unlock_page(page);
-		goto success;
-	}
-
-	error = mapping->a_ops->readpage(file, page);
-	if (!error) {
-		wait_on_page_locked(page);
-		if (PageUptodate(page))
-			goto success;
-	} else if (error == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry_find;
-	}
 
 	/*
 	 * Umm, take care of errors if the page isn't up-to-date.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 2/6] mm: simplify filemap_nopage
@ 2007-02-21  4:49   ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:49 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Identical block is duplicated twice: contrary to the comment, we have been
re-reading the page *twice* in filemap_nopage rather than once.

If any retry logic or anything is needed, it belongs in lower levels anyway.
Only retry once. Linus agrees.

Signed-off-by: Nick Piggin <npiggin@suse.de>

 mm/filemap.c |   24 ------------------------
 1 file changed, 24 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1448,30 +1448,6 @@ page_not_uptodate:
 		majmin = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 	}
-	lock_page(page);
-
-	/* Did it get unhashed while we waited for it? */
-	if (!page->mapping) {
-		unlock_page(page);
-		page_cache_release(page);
-		goto retry_all;
-	}
-
-	/* Did somebody else get it up-to-date? */
-	if (PageUptodate(page)) {
-		unlock_page(page);
-		goto success;
-	}
-
-	error = mapping->a_ops->readpage(file, page);
-	if (!error) {
-		wait_on_page_locked(page);
-		if (PageUptodate(page))
-			goto success;
-	} else if (error == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry_find;
-	}
 
 	/*
 	 * Umm, take care of errors if the page isn't up-to-date.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 3/6] mm: fix fault vs invalidate race for linear mappings
  2007-02-21  4:49 ` Nick Piggin
@ 2007-02-21  4:50   ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:50 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Fix the race between invalidate_inode_pages and do_no_page.

Andrea Arcangeli identified a subtle race between invalidation of
pages from pagecache with userspace mappings, and do_no_page.

The issue is that invalidation has to shoot down all mappings to the
page, before it can be discarded from the pagecache. Between shooting
down ptes to a particular page, and actually dropping the struct page
from the pagecache, do_no_page from any process might fault on that
page and establish a new mapping to the page just before it gets
discarded from the pagecache.

The most common case where such invalidation is used is in file
truncation. This case was catered for by doing a sort of open-coded
seqlock between the file's i_size, and its truncate_count.

Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then
find the page if it is within i_size, and then check truncate_count
under the page table lock and back out and retry if it had
subsequently been changed (ptl will serialise against unmapping, and
ensure a potentially updated truncate_count is actually visible).

Complexity and documentation issues aside, the locking protocol fails
in the case where we would like to invalidate pagecache inside i_size.
do_no_page can come in anytime and filemap_nopage is not aware of the
invalidation in progress (as it is when it is outside i_size). The
end result is that dangling (->mapping == NULL) pages that appear to
be from a particular file may be mapped into userspace with nonsense
data. Valid mappings to the same place will see a different page.

Andrea implemented two working fixes, one using a real seqlock,
another using a page->flags bit. He also proposed using the page lock
in do_no_page, but that was initially considered too heavyweight.
However, it is not a global or per-file lock, and the page cacheline
is modified in do_no_page to increment _count and _mapcount anyway, so
a further modification should not be a large performance hit.
Scalability is not an issue.

This patch implements this latter approach. ->nopage implementations
return with the page locked if it is possible for their underlying
file to be invalidated (in that case, they must set a special vm_flags
bit to indicate so). do_no_page only unlocks the page after setting
up the mapping completely. invalidation is excluded because it holds
the page lock during invalidation of each page (and ensures that the
page is not mapped while holding the lock).

This also allows significant simplifications in do_no_page, because
we have the page locked in the right place in the pagecache from the
start.

Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/gfs2/ops_file.c          |    2 
 fs/ncpfs/mmap.c             |    1 
 fs/ocfs2/mmap.c             |    1 
 fs/xfs/linux-2.6/xfs_file.c |    1 
 include/linux/mm.h          |    6 +
 ipc/shm.c                   |    1 
 mm/filemap.c                |   53 ++++++----------
 mm/memory.c                 |  138 +++++++++++++++++++-------------------------
 mm/shmem.c                  |   11 ++-
 mm/truncate.c               |   10 +++
 10 files changed, 111 insertions(+), 113 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -170,6 +170,12 @@ extern unsigned int kobjsize(const void 
 #define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
 #define VM_ALWAYSDUMP	0x04000000	/* Always include in core dumps */
 
+#define VM_CAN_INVALIDATE 0x08000000	/* The mapping may be invalidated,
+					 * eg. truncate or invalidate_inode_*.
+					 * In this case, do_no_page must
+					 * return with the page locked.
+					 */
+
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1329,9 +1329,10 @@ struct page *filemap_nopage(struct vm_ar
 	unsigned long size, pgoff;
 	int did_readaround = 0, majmin = VM_FAULT_MINOR;
 
+	BUG_ON(!(area->vm_flags & VM_CAN_INVALIDATE));
+
 	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 
-retry_all:
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (pgoff >= size)
 		goto outside_data_content;
@@ -1353,7 +1354,7 @@ retry_all:
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
-	page = find_get_page(mapping, pgoff);
+	page = find_lock_page(mapping, pgoff);
 	if (!page) {
 		unsigned long ra_pages;
 
@@ -1387,7 +1388,7 @@ retry_find:
 				start = pgoff - ra_pages / 2;
 			do_page_cache_readahead(mapping, file, start, ra_pages);
 		}
-		page = find_get_page(mapping, pgoff);
+		page = find_lock_page(mapping, pgoff);
 		if (!page)
 			goto no_cached_page;
 	}
@@ -1396,13 +1397,19 @@ retry_find:
 		ra->mmap_hit++;
 
 	/*
-	 * Ok, found a page in the page cache, now we need to check
-	 * that it's up-to-date.
+	 * We have a locked page in the page cache, now we need to check
+	 * that it's up-to-date. If not, it is going to be due to an error.
 	 */
-	if (!PageUptodate(page))
+	if (unlikely(!PageUptodate(page)))
 		goto page_not_uptodate;
 
-success:
+	/* Must recheck i_size under page lock */
+	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	if (unlikely(pgoff >= size)) {
+		unlock_page(page);
+		goto outside_data_content;
+	}
+
 	/*
 	 * Found the page and have a reference on it.
 	 */
@@ -1444,6 +1451,7 @@ no_cached_page:
 	return NOPAGE_SIGBUS;
 
 page_not_uptodate:
+	/* IO error path */
 	if (!did_readaround) {
 		majmin = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
@@ -1455,37 +1463,15 @@ page_not_uptodate:
 	 * because there really aren't any performance issues here
 	 * and we need to check for errors.
 	 */
-	lock_page(page);
-
-	/* Somebody truncated the page on us? */
-	if (!page->mapping) {
-		unlock_page(page);
-		page_cache_release(page);
-		goto retry_all;
-	}
-
-	/* Somebody else successfully read it in? */
-	if (PageUptodate(page)) {
-		unlock_page(page);
-		goto success;
-	}
 	ClearPageError(page);
 	error = mapping->a_ops->readpage(file, page);
-	if (!error) {
-		wait_on_page_locked(page);
-		if (PageUptodate(page))
-			goto success;
-	} else if (error == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
+	page_cache_release(page);
+
+	if (!error || error == AOP_TRUNCATED_PAGE)
 		goto retry_find;
-	}
 
-	/*
-	 * Things didn't work out. Return zero to tell the
-	 * mm layer so, possibly freeing the page cache page first.
-	 */
+	/* Things didn't work out. Return zero to tell the mm layer so. */
 	shrink_readahead_size_eio(file, ra);
-	page_cache_release(page);
 	return NOPAGE_SIGBUS;
 }
 EXPORT_SYMBOL(filemap_nopage);
@@ -1678,6 +1664,7 @@ int generic_file_mmap(struct file * file
 		return -ENOEXEC;
 	file_accessed(file);
 	vma->vm_ops = &generic_file_vm_ops;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 	return 0;
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1730,6 +1730,13 @@ static int unmap_mapping_range_vma(struc
 	unsigned long restart_addr;
 	int need_break;
 
+	/*
+	 * files that support invalidating or truncating portions of the
+	 * file from under mmaped areas must set the VM_CAN_INVALIDATE flag, and
+	 * have their .nopage function return the page locked.
+	 */
+	BUG_ON(!(vma->vm_flags & VM_CAN_INVALIDATE));
+
 again:
 	restart_addr = vma->vm_truncate_count;
 	if (is_restart_addr(restart_addr) && start_addr < restart_addr) {
@@ -1858,17 +1865,8 @@ void unmap_mapping_range(struct address_
 
 	spin_lock(&mapping->i_mmap_lock);
 
-	/* serialize i_size write against truncate_count write */
-	smp_wmb();
-	/* Protect against page faults, and endless unmapping loops */
+	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
-	/*
-	 * For archs where spin_lock has inclusive semantics like ia64
-	 * this smp_mb() will prevent to read pagetable contents
-	 * before the truncate_count increment is visible to
-	 * other cpus.
-	 */
-	smp_mb();
 	if (unlikely(is_restart_addr(mapping->truncate_count))) {
 		if (mapping->truncate_count == 0)
 			reset_vma_truncate_counts(mapping);
@@ -1907,7 +1905,6 @@ int vmtruncate(struct inode * inode, lof
 	if (IS_SWAPFILE(inode))
 		goto out_busy;
 	i_size_write(inode, offset);
-	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
 	truncate_inode_pages(mapping, offset);
 	goto out_truncate;
 
@@ -1946,7 +1943,6 @@ int vmtruncate_range(struct inode *inode
 
 	mutex_lock(&inode->i_mutex);
 	down_write(&inode->i_alloc_sem);
-	unmap_mapping_range(mapping, offset, (end - offset), 1);
 	truncate_inode_pages_range(mapping, offset, end);
 	inode->i_op->truncate_range(inode, offset, end);
 	up_write(&inode->i_alloc_sem);
@@ -2196,10 +2192,8 @@ static int do_no_page(struct mm_struct *
 		int write_access)
 {
 	spinlock_t *ptl;
-	struct page *new_page;
-	struct address_space *mapping = NULL;
+	struct page *page, *nopage_page;
 	pte_t entry;
-	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
 	struct page *dirty_page = NULL;
@@ -2207,73 +2201,53 @@ static int do_no_page(struct mm_struct *
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
 
-	if (vma->vm_file) {
-		mapping = vma->vm_file->f_mapping;
-		sequence = mapping->truncate_count;
-		smp_rmb(); /* serializes i_size against truncate_count */
-	}
-retry:
-	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
-	/*
-	 * No smp_rmb is needed here as long as there's a full
-	 * spin_lock/unlock sequence inside the ->nopage callback
-	 * (for the pagecache lookup) that acts as an implicit
-	 * smp_mb() and prevents the i_size read to happen
-	 * after the next truncate_count read.
-	 */
-
+	nopage_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
 	/* no page was available -- either SIGBUS, OOM or REFAULT */
-	if (unlikely(new_page == NOPAGE_SIGBUS))
+	if (unlikely(nopage_page == NOPAGE_SIGBUS))
 		return VM_FAULT_SIGBUS;
-	else if (unlikely(new_page == NOPAGE_OOM))
+	else if (unlikely(nopage_page == NOPAGE_OOM))
 		return VM_FAULT_OOM;
-	else if (unlikely(new_page == NOPAGE_REFAULT))
+	else if (unlikely(nopage_page == NOPAGE_REFAULT))
 		return VM_FAULT_MINOR;
 
+	BUG_ON(vma->vm_flags & VM_CAN_INVALIDATE && !PageLocked(nopage_page));
+	/*
+	 * For consistency in subsequent calls, make the nopage_page always
+	 * locked.
+	 */
+	if (unlikely(!(vma->vm_flags & VM_CAN_INVALIDATE)))
+		lock_page(nopage_page);
+
 	/*
 	 * Should we do an early C-O-W break?
 	 */
+	page = nopage_page;
 	if (write_access) {
 		if (!(vma->vm_flags & VM_SHARED)) {
-			struct page *page;
-
-			if (unlikely(anon_vma_prepare(vma)))
-				goto oom;
+			if (unlikely(anon_vma_prepare(vma))) {
+				ret = VM_FAULT_OOM;
+				goto out_error;
+			}
 			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-			if (!page)
-				goto oom;
-			copy_user_highpage(page, new_page, address, vma);
-			page_cache_release(new_page);
-			new_page = page;
+			if (!page) {
+				ret = VM_FAULT_OOM;
+				goto out_error;
+			}
+			copy_user_highpage(page, nopage_page, address, vma);
 			anon = 1;
-
 		} else {
 			/* if the page will be shareable, see if the backing
 			 * address space wants to know that the page is about
 			 * to become writable */
 			if (vma->vm_ops->page_mkwrite &&
-			    vma->vm_ops->page_mkwrite(vma, new_page) < 0
-			    ) {
-				page_cache_release(new_page);
-				return VM_FAULT_SIGBUS;
+			    vma->vm_ops->page_mkwrite(vma, page) < 0) {
+				ret = VM_FAULT_SIGBUS;
+				goto out_error;
 			}
 		}
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	/*
-	 * For a file-backed vma, someone could have truncated or otherwise
-	 * invalidated this page.  If unmap_mapping_range got called,
-	 * retry getting the page.
-	 */
-	if (mapping && unlikely(sequence != mapping->truncate_count)) {
-		pte_unmap_unlock(page_table, ptl);
-		page_cache_release(new_page);
-		cond_resched();
-		sequence = mapping->truncate_count;
-		smp_rmb();
-		goto retry;
-	}
 
 	/*
 	 * This silly early PAGE_DIRTY setting removes a race
@@ -2286,43 +2260,51 @@ retry:
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (pte_none(*page_table)) {
-		flush_icache_page(vma, new_page);
-		entry = mk_pte(new_page, vma->vm_page_prot);
+	if (likely(pte_none(*page_table))) {
+		flush_icache_page(vma, page);
+		entry = mk_pte(page, vma->vm_page_prot);
 		if (write_access)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
-			lru_cache_add_active(new_page);
-			page_add_new_anon_rmap(new_page, vma, address);
+			lru_cache_add_active(page);
+			page_add_new_anon_rmap(page, vma, address);
 		} else {
 			inc_mm_counter(mm, file_rss);
-			page_add_file_rmap(new_page);
+			page_add_file_rmap(page);
 			if (write_access) {
-				dirty_page = new_page;
+				dirty_page = page;
 				get_page(dirty_page);
 			}
 		}
+
+		/* no need to invalidate: a not-present page won't be cached */
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
 	} else {
-		/* One of our sibling threads was faster, back out. */
-		page_cache_release(new_page);
-		goto unlock;
+		if (anon)
+			page_cache_release(page);
+		else
+			anon = 1; /* not anon, but release nopage_page */
 	}
 
-	/* no need to invalidate: a not-present page shouldn't be cached */
-	update_mmu_cache(vma, address, entry);
-	lazy_mmu_prot_update(entry);
-unlock:
 	pte_unmap_unlock(page_table, ptl);
-	if (dirty_page) {
+
+out:
+	unlock_page(nopage_page);
+	if (anon)
+		page_cache_release(nopage_page);
+	else if (dirty_page) {
 		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
+
 	return ret;
-oom:
-	page_cache_release(new_page);
-	return VM_FAULT_OOM;
+
+out_error:
+	anon = 1; /* relase nopage_page */
+	goto out;
 }
 
 /*
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -82,6 +82,7 @@ enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
 	SGP_WRITE,	/* may exceed i_size, may allocate page */
+	SGP_NOPAGE,	/* same as SGP_CACHE, return with page locked */
 };
 
 static int shmem_getpage(struct inode *inode, unsigned long idx,
@@ -1215,8 +1216,10 @@ repeat:
 	}
 done:
 	if (*pagep != filepage) {
-		unlock_page(filepage);
 		*pagep = filepage;
+		if (sgp != SGP_NOPAGE)
+			unlock_page(filepage);
+
 	}
 	return 0;
 
@@ -1235,13 +1238,15 @@ struct page *shmem_nopage(struct vm_area
 	unsigned long idx;
 	int error;
 
+	BUG_ON(!(vma->vm_flags & VM_CAN_INVALIDATE));
+
 	idx = (address - vma->vm_start) >> PAGE_SHIFT;
 	idx += vma->vm_pgoff;
 	idx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
 	if (((loff_t) idx << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		return NOPAGE_SIGBUS;
 
-	error = shmem_getpage(inode, idx, &page, SGP_CACHE, type);
+	error = shmem_getpage(inode, idx, &page, SGP_NOPAGE, type);
 	if (error)
 		return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS;
 
@@ -1339,6 +1344,7 @@ int shmem_mmap(struct file *file, struct
 {
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 	return 0;
 }
 
@@ -2532,5 +2538,6 @@ int shmem_zero_setup(struct vm_area_stru
 		fput(vma->vm_file);
 	vma->vm_file = file;
 	vma->vm_ops = &shmem_vm_ops;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 	return 0;
 }
Index: linux-2.6/fs/ncpfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/mmap.c
+++ linux-2.6/fs/ncpfs/mmap.c
@@ -123,6 +123,7 @@ int ncp_mmap(struct file *file, struct v
 		return -EFBIG;
 
 	vma->vm_ops = &ncp_file_mmap;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 	file_accessed(file);
 	return 0;
 }
Index: linux-2.6/fs/ocfs2/mmap.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/mmap.c
+++ linux-2.6/fs/ocfs2/mmap.c
@@ -104,6 +104,7 @@ int ocfs2_mmap(struct file *file, struct
 	ocfs2_meta_unlock(file->f_dentry->d_inode, lock_level);
 out:
 	vma->vm_ops = &ocfs2_file_vm_ops;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 	return 0;
 }
 
Index: linux-2.6/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_file.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c
@@ -343,6 +343,7 @@ xfs_file_mmap(
 	struct vm_area_struct *vma)
 {
 	vma->vm_ops = &xfs_file_vm_ops;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 
 #ifdef CONFIG_XFS_DMAPI
 	if (vn_from_inode(filp->f_path.dentry->d_inode)->v_vfsp->vfs_flag & VFS_DMI)
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c
+++ linux-2.6/ipc/shm.c
@@ -231,6 +231,7 @@ static int shm_mmap(struct file * file, 
 	ret = shmem_mmap(file, vma);
 	if (ret == 0) {
 		vma->vm_ops = &shm_vm_ops;
+		vma->vm_flags |= VM_CAN_INVALIDATE;
 		if (!(vma->vm_flags & VM_WRITE))
 			vma->vm_flags &= ~VM_MAYWRITE;
 		shm_inc(shm_file_ns(file), file->f_path.dentry->d_inode->i_ino);
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -191,6 +191,11 @@ void truncate_inode_pages_range(struct a
 				unlock_page(page);
 				continue;
 			}
+			if (page_mapped(page)) {
+				unmap_mapping_range(mapping,
+				  (loff_t)page_index<<PAGE_CACHE_SHIFT,
+				  PAGE_CACHE_SIZE, 0);
+			}
 			truncate_complete_page(mapping, page);
 			unlock_page(page);
 		}
@@ -228,6 +233,11 @@ void truncate_inode_pages_range(struct a
 				break;
 			lock_page(page);
 			wait_on_page_writeback(page);
+			if (page_mapped(page)) {
+				unmap_mapping_range(mapping,
+				  (loff_t)page_index<<PAGE_CACHE_SHIFT,
+				  PAGE_CACHE_SIZE, 0);
+			}
 			if (page->index > next)
 				next = page->index;
 			next++;
@@ -396,7 +406,7 @@ int invalidate_inode_pages2_range(struct
 				break;
 			}
 			wait_on_page_writeback(page);
-			while (page_mapped(page)) {
+			if (page_mapped(page)) {
 				if (!did_range_unmap) {
 					/*
 					 * Zap the rest of the file in one hit.
@@ -416,6 +426,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
+			BUG_ON(page_mapped(page));
 			ret = do_launder_page(mapping, page);
 			if (ret == 0 && !invalidate_complete_page2(mapping, page))
 				ret = -EIO;
Index: linux-2.6/fs/gfs2/ops_file.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_file.c
+++ linux-2.6/fs/gfs2/ops_file.c
@@ -365,6 +365,8 @@ static int gfs2_mmap(struct file *file, 
 	else
 		vma->vm_ops = &gfs2_vm_ops_private;
 
+	vma->vm_flags |= VM_CAN_INVALIDATE;
+
 	gfs2_glock_dq_uninit(&i_gh);
 
 	return error;

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 3/6] mm: fix fault vs invalidate race for linear mappings
@ 2007-02-21  4:50   ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:50 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Fix the race between invalidate_inode_pages and do_no_page.

Andrea Arcangeli identified a subtle race between invalidation of
pages from pagecache with userspace mappings, and do_no_page.

The issue is that invalidation has to shoot down all mappings to the
page, before it can be discarded from the pagecache. Between shooting
down ptes to a particular page, and actually dropping the struct page
from the pagecache, do_no_page from any process might fault on that
page and establish a new mapping to the page just before it gets
discarded from the pagecache.

The most common case where such invalidation is used is in file
truncation. This case was catered for by doing a sort of open-coded
seqlock between the file's i_size, and its truncate_count.

Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then
find the page if it is within i_size, and then check truncate_count
under the page table lock and back out and retry if it had
subsequently been changed (ptl will serialise against unmapping, and
ensure a potentially updated truncate_count is actually visible).

Complexity and documentation issues aside, the locking protocol fails
in the case where we would like to invalidate pagecache inside i_size.
do_no_page can come in anytime and filemap_nopage is not aware of the
invalidation in progress (as it is when it is outside i_size). The
end result is that dangling (->mapping == NULL) pages that appear to
be from a particular file may be mapped into userspace with nonsense
data. Valid mappings to the same place will see a different page.

Andrea implemented two working fixes, one using a real seqlock,
another using a page->flags bit. He also proposed using the page lock
in do_no_page, but that was initially considered too heavyweight.
However, it is not a global or per-file lock, and the page cacheline
is modified in do_no_page to increment _count and _mapcount anyway, so
a further modification should not be a large performance hit.
Scalability is not an issue.

This patch implements this latter approach. ->nopage implementations
return with the page locked if it is possible for their underlying
file to be invalidated (in that case, they must set a special vm_flags
bit to indicate so). do_no_page only unlocks the page after setting
up the mapping completely. invalidation is excluded because it holds
the page lock during invalidation of each page (and ensures that the
page is not mapped while holding the lock).

This also allows significant simplifications in do_no_page, because
we have the page locked in the right place in the pagecache from the
start.

Signed-off-by: Nick Piggin <npiggin@suse.de>

 fs/gfs2/ops_file.c          |    2 
 fs/ncpfs/mmap.c             |    1 
 fs/ocfs2/mmap.c             |    1 
 fs/xfs/linux-2.6/xfs_file.c |    1 
 include/linux/mm.h          |    6 +
 ipc/shm.c                   |    1 
 mm/filemap.c                |   53 ++++++----------
 mm/memory.c                 |  138 +++++++++++++++++++-------------------------
 mm/shmem.c                  |   11 ++-
 mm/truncate.c               |   10 +++
 10 files changed, 111 insertions(+), 113 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -170,6 +170,12 @@ extern unsigned int kobjsize(const void 
 #define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
 #define VM_ALWAYSDUMP	0x04000000	/* Always include in core dumps */
 
+#define VM_CAN_INVALIDATE 0x08000000	/* The mapping may be invalidated,
+					 * eg. truncate or invalidate_inode_*.
+					 * In this case, do_no_page must
+					 * return with the page locked.
+					 */
+
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1329,9 +1329,10 @@ struct page *filemap_nopage(struct vm_ar
 	unsigned long size, pgoff;
 	int did_readaround = 0, majmin = VM_FAULT_MINOR;
 
+	BUG_ON(!(area->vm_flags & VM_CAN_INVALIDATE));
+
 	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 
-retry_all:
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (pgoff >= size)
 		goto outside_data_content;
@@ -1353,7 +1354,7 @@ retry_all:
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
-	page = find_get_page(mapping, pgoff);
+	page = find_lock_page(mapping, pgoff);
 	if (!page) {
 		unsigned long ra_pages;
 
@@ -1387,7 +1388,7 @@ retry_find:
 				start = pgoff - ra_pages / 2;
 			do_page_cache_readahead(mapping, file, start, ra_pages);
 		}
-		page = find_get_page(mapping, pgoff);
+		page = find_lock_page(mapping, pgoff);
 		if (!page)
 			goto no_cached_page;
 	}
@@ -1396,13 +1397,19 @@ retry_find:
 		ra->mmap_hit++;
 
 	/*
-	 * Ok, found a page in the page cache, now we need to check
-	 * that it's up-to-date.
+	 * We have a locked page in the page cache, now we need to check
+	 * that it's up-to-date. If not, it is going to be due to an error.
 	 */
-	if (!PageUptodate(page))
+	if (unlikely(!PageUptodate(page)))
 		goto page_not_uptodate;
 
-success:
+	/* Must recheck i_size under page lock */
+	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	if (unlikely(pgoff >= size)) {
+		unlock_page(page);
+		goto outside_data_content;
+	}
+
 	/*
 	 * Found the page and have a reference on it.
 	 */
@@ -1444,6 +1451,7 @@ no_cached_page:
 	return NOPAGE_SIGBUS;
 
 page_not_uptodate:
+	/* IO error path */
 	if (!did_readaround) {
 		majmin = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
@@ -1455,37 +1463,15 @@ page_not_uptodate:
 	 * because there really aren't any performance issues here
 	 * and we need to check for errors.
 	 */
-	lock_page(page);
-
-	/* Somebody truncated the page on us? */
-	if (!page->mapping) {
-		unlock_page(page);
-		page_cache_release(page);
-		goto retry_all;
-	}
-
-	/* Somebody else successfully read it in? */
-	if (PageUptodate(page)) {
-		unlock_page(page);
-		goto success;
-	}
 	ClearPageError(page);
 	error = mapping->a_ops->readpage(file, page);
-	if (!error) {
-		wait_on_page_locked(page);
-		if (PageUptodate(page))
-			goto success;
-	} else if (error == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
+	page_cache_release(page);
+
+	if (!error || error == AOP_TRUNCATED_PAGE)
 		goto retry_find;
-	}
 
-	/*
-	 * Things didn't work out. Return zero to tell the
-	 * mm layer so, possibly freeing the page cache page first.
-	 */
+	/* Things didn't work out. Return zero to tell the mm layer so. */
 	shrink_readahead_size_eio(file, ra);
-	page_cache_release(page);
 	return NOPAGE_SIGBUS;
 }
 EXPORT_SYMBOL(filemap_nopage);
@@ -1678,6 +1664,7 @@ int generic_file_mmap(struct file * file
 		return -ENOEXEC;
 	file_accessed(file);
 	vma->vm_ops = &generic_file_vm_ops;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 	return 0;
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1730,6 +1730,13 @@ static int unmap_mapping_range_vma(struc
 	unsigned long restart_addr;
 	int need_break;
 
+	/*
+	 * files that support invalidating or truncating portions of the
+	 * file from under mmaped areas must set the VM_CAN_INVALIDATE flag, and
+	 * have their .nopage function return the page locked.
+	 */
+	BUG_ON(!(vma->vm_flags & VM_CAN_INVALIDATE));
+
 again:
 	restart_addr = vma->vm_truncate_count;
 	if (is_restart_addr(restart_addr) && start_addr < restart_addr) {
@@ -1858,17 +1865,8 @@ void unmap_mapping_range(struct address_
 
 	spin_lock(&mapping->i_mmap_lock);
 
-	/* serialize i_size write against truncate_count write */
-	smp_wmb();
-	/* Protect against page faults, and endless unmapping loops */
+	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
-	/*
-	 * For archs where spin_lock has inclusive semantics like ia64
-	 * this smp_mb() will prevent to read pagetable contents
-	 * before the truncate_count increment is visible to
-	 * other cpus.
-	 */
-	smp_mb();
 	if (unlikely(is_restart_addr(mapping->truncate_count))) {
 		if (mapping->truncate_count == 0)
 			reset_vma_truncate_counts(mapping);
@@ -1907,7 +1905,6 @@ int vmtruncate(struct inode * inode, lof
 	if (IS_SWAPFILE(inode))
 		goto out_busy;
 	i_size_write(inode, offset);
-	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
 	truncate_inode_pages(mapping, offset);
 	goto out_truncate;
 
@@ -1946,7 +1943,6 @@ int vmtruncate_range(struct inode *inode
 
 	mutex_lock(&inode->i_mutex);
 	down_write(&inode->i_alloc_sem);
-	unmap_mapping_range(mapping, offset, (end - offset), 1);
 	truncate_inode_pages_range(mapping, offset, end);
 	inode->i_op->truncate_range(inode, offset, end);
 	up_write(&inode->i_alloc_sem);
@@ -2196,10 +2192,8 @@ static int do_no_page(struct mm_struct *
 		int write_access)
 {
 	spinlock_t *ptl;
-	struct page *new_page;
-	struct address_space *mapping = NULL;
+	struct page *page, *nopage_page;
 	pte_t entry;
-	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
 	struct page *dirty_page = NULL;
@@ -2207,73 +2201,53 @@ static int do_no_page(struct mm_struct *
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
 
-	if (vma->vm_file) {
-		mapping = vma->vm_file->f_mapping;
-		sequence = mapping->truncate_count;
-		smp_rmb(); /* serializes i_size against truncate_count */
-	}
-retry:
-	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
-	/*
-	 * No smp_rmb is needed here as long as there's a full
-	 * spin_lock/unlock sequence inside the ->nopage callback
-	 * (for the pagecache lookup) that acts as an implicit
-	 * smp_mb() and prevents the i_size read to happen
-	 * after the next truncate_count read.
-	 */
-
+	nopage_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
 	/* no page was available -- either SIGBUS, OOM or REFAULT */
-	if (unlikely(new_page == NOPAGE_SIGBUS))
+	if (unlikely(nopage_page == NOPAGE_SIGBUS))
 		return VM_FAULT_SIGBUS;
-	else if (unlikely(new_page == NOPAGE_OOM))
+	else if (unlikely(nopage_page == NOPAGE_OOM))
 		return VM_FAULT_OOM;
-	else if (unlikely(new_page == NOPAGE_REFAULT))
+	else if (unlikely(nopage_page == NOPAGE_REFAULT))
 		return VM_FAULT_MINOR;
 
+	BUG_ON(vma->vm_flags & VM_CAN_INVALIDATE && !PageLocked(nopage_page));
+	/*
+	 * For consistency in subsequent calls, make the nopage_page always
+	 * locked.
+	 */
+	if (unlikely(!(vma->vm_flags & VM_CAN_INVALIDATE)))
+		lock_page(nopage_page);
+
 	/*
 	 * Should we do an early C-O-W break?
 	 */
+	page = nopage_page;
 	if (write_access) {
 		if (!(vma->vm_flags & VM_SHARED)) {
-			struct page *page;
-
-			if (unlikely(anon_vma_prepare(vma)))
-				goto oom;
+			if (unlikely(anon_vma_prepare(vma))) {
+				ret = VM_FAULT_OOM;
+				goto out_error;
+			}
 			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-			if (!page)
-				goto oom;
-			copy_user_highpage(page, new_page, address, vma);
-			page_cache_release(new_page);
-			new_page = page;
+			if (!page) {
+				ret = VM_FAULT_OOM;
+				goto out_error;
+			}
+			copy_user_highpage(page, nopage_page, address, vma);
 			anon = 1;
-
 		} else {
 			/* if the page will be shareable, see if the backing
 			 * address space wants to know that the page is about
 			 * to become writable */
 			if (vma->vm_ops->page_mkwrite &&
-			    vma->vm_ops->page_mkwrite(vma, new_page) < 0
-			    ) {
-				page_cache_release(new_page);
-				return VM_FAULT_SIGBUS;
+			    vma->vm_ops->page_mkwrite(vma, page) < 0) {
+				ret = VM_FAULT_SIGBUS;
+				goto out_error;
 			}
 		}
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	/*
-	 * For a file-backed vma, someone could have truncated or otherwise
-	 * invalidated this page.  If unmap_mapping_range got called,
-	 * retry getting the page.
-	 */
-	if (mapping && unlikely(sequence != mapping->truncate_count)) {
-		pte_unmap_unlock(page_table, ptl);
-		page_cache_release(new_page);
-		cond_resched();
-		sequence = mapping->truncate_count;
-		smp_rmb();
-		goto retry;
-	}
 
 	/*
 	 * This silly early PAGE_DIRTY setting removes a race
@@ -2286,43 +2260,51 @@ retry:
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (pte_none(*page_table)) {
-		flush_icache_page(vma, new_page);
-		entry = mk_pte(new_page, vma->vm_page_prot);
+	if (likely(pte_none(*page_table))) {
+		flush_icache_page(vma, page);
+		entry = mk_pte(page, vma->vm_page_prot);
 		if (write_access)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
-			lru_cache_add_active(new_page);
-			page_add_new_anon_rmap(new_page, vma, address);
+			lru_cache_add_active(page);
+			page_add_new_anon_rmap(page, vma, address);
 		} else {
 			inc_mm_counter(mm, file_rss);
-			page_add_file_rmap(new_page);
+			page_add_file_rmap(page);
 			if (write_access) {
-				dirty_page = new_page;
+				dirty_page = page;
 				get_page(dirty_page);
 			}
 		}
+
+		/* no need to invalidate: a not-present page won't be cached */
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
 	} else {
-		/* One of our sibling threads was faster, back out. */
-		page_cache_release(new_page);
-		goto unlock;
+		if (anon)
+			page_cache_release(page);
+		else
+			anon = 1; /* not anon, but release nopage_page */
 	}
 
-	/* no need to invalidate: a not-present page shouldn't be cached */
-	update_mmu_cache(vma, address, entry);
-	lazy_mmu_prot_update(entry);
-unlock:
 	pte_unmap_unlock(page_table, ptl);
-	if (dirty_page) {
+
+out:
+	unlock_page(nopage_page);
+	if (anon)
+		page_cache_release(nopage_page);
+	else if (dirty_page) {
 		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
+
 	return ret;
-oom:
-	page_cache_release(new_page);
-	return VM_FAULT_OOM;
+
+out_error:
+	anon = 1; /* relase nopage_page */
+	goto out;
 }
 
 /*
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -82,6 +82,7 @@ enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
 	SGP_WRITE,	/* may exceed i_size, may allocate page */
+	SGP_NOPAGE,	/* same as SGP_CACHE, return with page locked */
 };
 
 static int shmem_getpage(struct inode *inode, unsigned long idx,
@@ -1215,8 +1216,10 @@ repeat:
 	}
 done:
 	if (*pagep != filepage) {
-		unlock_page(filepage);
 		*pagep = filepage;
+		if (sgp != SGP_NOPAGE)
+			unlock_page(filepage);
+
 	}
 	return 0;
 
@@ -1235,13 +1238,15 @@ struct page *shmem_nopage(struct vm_area
 	unsigned long idx;
 	int error;
 
+	BUG_ON(!(vma->vm_flags & VM_CAN_INVALIDATE));
+
 	idx = (address - vma->vm_start) >> PAGE_SHIFT;
 	idx += vma->vm_pgoff;
 	idx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
 	if (((loff_t) idx << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		return NOPAGE_SIGBUS;
 
-	error = shmem_getpage(inode, idx, &page, SGP_CACHE, type);
+	error = shmem_getpage(inode, idx, &page, SGP_NOPAGE, type);
 	if (error)
 		return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS;
 
@@ -1339,6 +1344,7 @@ int shmem_mmap(struct file *file, struct
 {
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 	return 0;
 }
 
@@ -2532,5 +2538,6 @@ int shmem_zero_setup(struct vm_area_stru
 		fput(vma->vm_file);
 	vma->vm_file = file;
 	vma->vm_ops = &shmem_vm_ops;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 	return 0;
 }
Index: linux-2.6/fs/ncpfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/mmap.c
+++ linux-2.6/fs/ncpfs/mmap.c
@@ -123,6 +123,7 @@ int ncp_mmap(struct file *file, struct v
 		return -EFBIG;
 
 	vma->vm_ops = &ncp_file_mmap;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 	file_accessed(file);
 	return 0;
 }
Index: linux-2.6/fs/ocfs2/mmap.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/mmap.c
+++ linux-2.6/fs/ocfs2/mmap.c
@@ -104,6 +104,7 @@ int ocfs2_mmap(struct file *file, struct
 	ocfs2_meta_unlock(file->f_dentry->d_inode, lock_level);
 out:
 	vma->vm_ops = &ocfs2_file_vm_ops;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 	return 0;
 }
 
Index: linux-2.6/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_file.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c
@@ -343,6 +343,7 @@ xfs_file_mmap(
 	struct vm_area_struct *vma)
 {
 	vma->vm_ops = &xfs_file_vm_ops;
+	vma->vm_flags |= VM_CAN_INVALIDATE;
 
 #ifdef CONFIG_XFS_DMAPI
 	if (vn_from_inode(filp->f_path.dentry->d_inode)->v_vfsp->vfs_flag & VFS_DMI)
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c
+++ linux-2.6/ipc/shm.c
@@ -231,6 +231,7 @@ static int shm_mmap(struct file * file, 
 	ret = shmem_mmap(file, vma);
 	if (ret == 0) {
 		vma->vm_ops = &shm_vm_ops;
+		vma->vm_flags |= VM_CAN_INVALIDATE;
 		if (!(vma->vm_flags & VM_WRITE))
 			vma->vm_flags &= ~VM_MAYWRITE;
 		shm_inc(shm_file_ns(file), file->f_path.dentry->d_inode->i_ino);
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -191,6 +191,11 @@ void truncate_inode_pages_range(struct a
 				unlock_page(page);
 				continue;
 			}
+			if (page_mapped(page)) {
+				unmap_mapping_range(mapping,
+				  (loff_t)page_index<<PAGE_CACHE_SHIFT,
+				  PAGE_CACHE_SIZE, 0);
+			}
 			truncate_complete_page(mapping, page);
 			unlock_page(page);
 		}
@@ -228,6 +233,11 @@ void truncate_inode_pages_range(struct a
 				break;
 			lock_page(page);
 			wait_on_page_writeback(page);
+			if (page_mapped(page)) {
+				unmap_mapping_range(mapping,
+				  (loff_t)page_index<<PAGE_CACHE_SHIFT,
+				  PAGE_CACHE_SIZE, 0);
+			}
 			if (page->index > next)
 				next = page->index;
 			next++;
@@ -396,7 +406,7 @@ int invalidate_inode_pages2_range(struct
 				break;
 			}
 			wait_on_page_writeback(page);
-			while (page_mapped(page)) {
+			if (page_mapped(page)) {
 				if (!did_range_unmap) {
 					/*
 					 * Zap the rest of the file in one hit.
@@ -416,6 +426,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
+			BUG_ON(page_mapped(page));
 			ret = do_launder_page(mapping, page);
 			if (ret == 0 && !invalidate_complete_page2(mapping, page))
 				ret = -EIO;
Index: linux-2.6/fs/gfs2/ops_file.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_file.c
+++ linux-2.6/fs/gfs2/ops_file.c
@@ -365,6 +365,8 @@ static int gfs2_mmap(struct file *file, 
 	else
 		vma->vm_ops = &gfs2_vm_ops_private;
 
+	vma->vm_flags |= VM_CAN_INVALIDATE;
+
 	gfs2_glock_dq_uninit(&i_gh);
 
 	return error;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-02-21  4:49 ` Nick Piggin
@ 2007-02-21  4:50   ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:50 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
encodes the virtual address -> file offset differently from linear
mappings.

I can't see why the filesystem/pagecache code should need to know anything
about it, except for the fact that the ->nopage handler didn't quite pass
down enough information (ie. pgoff). But it is more logical to pass pgoff
rather than have the ->nopage function calculate it itself anyway. And
having the nopage handler install the pte itself is sort of nasty.

This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn. Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.

The rationale for doing this in the first place is that nonlinear mappings
are subject to the pagefault vs invalidate/truncate race too, and it seemed
stupid to duplicate the synchronisation logic rather than just consolidate
the two.

After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache. Seems like a fringe functionality anyway.

NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
no users have hit mainline yet.

Signed-off-by: Nick Piggin <npiggin@suse.de>

 Documentation/feature-removal-schedule.txt |   27 ++++++
 Documentation/filesystems/Locking          |    2 
 fs/gfs2/ops_address.c                      |    2 
 fs/gfs2/ops_file.c                         |    2 
 fs/gfs2/ops_vm.c                           |   34 ++++---
 fs/ncpfs/mmap.c                            |   23 ++---
 fs/ocfs2/aops.c                            |    2 
 fs/ocfs2/mmap.c                            |   17 +--
 fs/xfs/linux-2.6/xfs_file.c                |   23 ++---
 include/linux/mm.h                         |   36 ++++++--
 ipc/shm.c                                  |    2 
 mm/filemap.c                               |   93 ++++++++++++--------
 mm/filemap_xip.c                           |   44 +++++----
 mm/fremap.c                                |  105 ++++++++++++++++-------
 mm/memory.c                                |  129 ++++++++++++++++++-----------
 mm/mmap.c                                  |    8 -
 mm/nommu.c                                 |    3 
 mm/shmem.c                                 |   80 ++++-------------
 mm/truncate.c                              |    2 
 19 files changed, 371 insertions(+), 263 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -175,6 +175,7 @@ extern unsigned int kobjsize(const void 
 					 * In this case, do_no_page must
 					 * return with the page locked.
 					 */
+#define VM_CAN_NONLINEAR 0x10000000	/* Has ->fault & does nonlinear pages */
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -198,6 +199,26 @@ extern unsigned int kobjsize(const void 
  */
 extern pgprot_t protection_map[16];
 
+#define FAULT_FLAG_WRITE	0x01
+#define FAULT_FLAG_NONLINEAR	0x02
+
+/*
+ * fault_data is filled in the the pagefault handler and passed to the
+ * vma's ->fault function. That function is responsible for filling in
+ * 'type', which is the type of fault if a page is returned, or the type
+ * of error if NULL is returned.
+ *
+ * pgoff should be used in favour of address, if possible. If pgoff is
+ * used, one may set VM_CAN_NONLINEAR in the vma->vm_flags to get
+ * nonlinear mapping support.
+ */
+struct fault_data {
+	unsigned long address;
+	pgoff_t pgoff;
+	unsigned int flags;
+
+	int type;
+};
 
 /*
  * These are the virtual MM functions - opening of an area, closing and
@@ -207,6 +228,7 @@ extern pgprot_t protection_map[16];
 struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
+	struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	unsigned long (*nopfn)(struct vm_area_struct * area, unsigned long address);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
@@ -636,7 +658,6 @@ static inline int page_mapped(struct pag
  */
 #define NOPAGE_SIGBUS	(NULL)
 #define NOPAGE_OOM	((struct page *) (-1))
-#define NOPAGE_REFAULT	((struct page *) (-2))	/* Return to userspace, rerun */
 
 /*
  * Error return values for the *_nopfn functions
@@ -666,14 +687,13 @@ static inline int page_mapped(struct pag
 extern void show_free_areas(void);
 
 #ifdef CONFIG_SHMEM
-struct page *shmem_nopage(struct vm_area_struct *vma,
-			unsigned long address, int *type);
+struct page *shmem_fault(struct vm_area_struct *vma, struct fault_data *fdata);
 int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new);
 struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 					unsigned long addr);
 int shmem_lock(struct file *file, int lock, struct user_struct *user);
 #else
-#define shmem_nopage filemap_nopage
+#define shmem_fault filemap_fault
 
 static inline int shmem_lock(struct file *file, int lock,
 			     struct user_struct *user)
@@ -1071,9 +1091,11 @@ extern void truncate_inode_pages_range(s
 				       loff_t lstart, loff_t lend);
 
 /* generic vm_area_ops exported for stackable file systems */
-extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *);
-extern int filemap_populate(struct vm_area_struct *, unsigned long,
-		unsigned long, pgprot_t, unsigned long, int);
+extern struct page *filemap_fault(struct vm_area_struct *, struct fault_data *);
+extern struct page * __deprecated_for_modules filemap_nopage(
+			struct vm_area_struct *, unsigned long, int *);
+extern int __deprecated_for_modules filemap_populate(struct vm_area_struct *,
+		unsigned long, unsigned long, pgprot_t, unsigned long, int);
 
 /* mm/page-writeback.c */
 int write_one_page(struct page *page, int wait);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2175,10 +2175,10 @@ oom:
 }
 
 /*
- * do_no_page() tries to create a new page mapping. It aggressively
+ * __do_fault() tries to create a new page mapping. It aggressively
  * tries to share with existing pages, but makes a separate copy if
- * the "write_access" parameter is true in order to avoid the next
- * page fault.
+ * the FAULT_FLAG_WRITE is set in the flags parameter in order to avoid
+ * the next page fault.
  *
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
@@ -2187,64 +2187,82 @@ oom:
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static int do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		int write_access)
+		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
 	spinlock_t *ptl;
-	struct page *page, *nopage_page;
+	struct page *page, *faulted_page;
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
 	int anon = 0;
 	struct page *dirty_page = NULL;
+	struct fault_data fdata;
+
+	fdata.address = address & PAGE_MASK;
+	fdata.pgoff = pgoff;
+	fdata.flags = flags;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
 
-	nopage_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
-	/* no page was available -- either SIGBUS, OOM or REFAULT */
-	if (unlikely(nopage_page == NOPAGE_SIGBUS))
-		return VM_FAULT_SIGBUS;
-	else if (unlikely(nopage_page == NOPAGE_OOM))
-		return VM_FAULT_OOM;
-	else if (unlikely(nopage_page == NOPAGE_REFAULT))
-		return VM_FAULT_MINOR;
+	if (likely(vma->vm_ops->fault)) {
+		fdata.type = -1;
+		faulted_page = vma->vm_ops->fault(vma, &fdata);
+		WARN_ON(fdata.type == -1);
+		if (unlikely(!faulted_page))
+			return fdata.type;
+	} else {
+		/* Legacy ->nopage path */
+		fdata.type = VM_FAULT_MINOR;
+		faulted_page = vma->vm_ops->nopage(vma, address & PAGE_MASK,
+								&fdata.type);
+		/* no page was available -- either SIGBUS or OOM */
+		if (unlikely(faulted_page == NOPAGE_SIGBUS))
+			return VM_FAULT_SIGBUS;
+		else if (unlikely(faulted_page == NOPAGE_OOM))
+			return VM_FAULT_OOM;
+	}
 
-	BUG_ON(vma->vm_flags & VM_CAN_INVALIDATE && !PageLocked(nopage_page));
 	/*
-	 * For consistency in subsequent calls, make the nopage_page always
+	 * For consistency in subsequent calls, make the faulted_page always
 	 * locked.
 	 */
 	if (unlikely(!(vma->vm_flags & VM_CAN_INVALIDATE)))
-		lock_page(nopage_page);
+		lock_page(faulted_page);
+	else
+		BUG_ON(!PageLocked(faulted_page));
 
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	page = nopage_page;
-	if (write_access) {
+	page = faulted_page;
+	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
+			anon = 1;
 			if (unlikely(anon_vma_prepare(vma))) {
-				ret = VM_FAULT_OOM;
-				goto out_error;
+				fdata.type = VM_FAULT_OOM;
+				goto out;
 			}
 			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
 			if (!page) {
-				ret = VM_FAULT_OOM;
-				goto out_error;
+				fdata.type = VM_FAULT_OOM;
+				goto out;
 			}
-			copy_user_highpage(page, nopage_page, address, vma);
-			anon = 1;
+			copy_user_highpage(page, faulted_page, address, vma);
 		} else {
-			/* if the page will be shareable, see if the backing
+			/*
+			 * If the page will be shareable, see if the backing
 			 * address space wants to know that the page is about
-			 * to become writable */
+			 * to become writable
+			 */
 			if (vma->vm_ops->page_mkwrite &&
 			    vma->vm_ops->page_mkwrite(vma, page) < 0) {
-				ret = VM_FAULT_SIGBUS;
-				goto out_error;
+				fdata.type = VM_FAULT_SIGBUS;
+				anon = 1; /* no anon but release faulted_page */
+				goto out;
 			}
 		}
+
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -2260,10 +2278,10 @@ static int do_no_page(struct mm_struct *
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_none(*page_table))) {
+	if (likely(pte_same(*page_table, orig_pte))) {
 		flush_icache_page(vma, page);
 		entry = mk_pte(page, vma->vm_page_prot);
-		if (write_access)
+		if (flags & FAULT_FLAG_WRITE)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
@@ -2273,7 +2291,7 @@ static int do_no_page(struct mm_struct *
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
-			if (write_access) {
+			if (flags & FAULT_FLAG_WRITE) {
 				dirty_page = page;
 				get_page(dirty_page);
 			}
@@ -2286,25 +2304,42 @@ static int do_no_page(struct mm_struct *
 		if (anon)
 			page_cache_release(page);
 		else
-			anon = 1; /* not anon, but release nopage_page */
+			anon = 1; /* no anon but release faulted_page */
 	}
 
 	pte_unmap_unlock(page_table, ptl);
 
 out:
-	unlock_page(nopage_page);
+	unlock_page(faulted_page);
 	if (anon)
-		page_cache_release(nopage_page);
+		page_cache_release(faulted_page);
 	else if (dirty_page) {
 		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
 
-	return ret;
+	return fdata.type;
+}
+
+static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pte_t *page_table, pmd_t *pmd,
+		int write_access, pte_t orig_pte)
+{
+	pgoff_t pgoff = (((address & PAGE_MASK)
+			- vma->vm_start) >> PAGE_CACHE_SHIFT) + vma->vm_pgoff;
+	unsigned int flags = (write_access ? FAULT_FLAG_WRITE : 0);
+
+	return __do_fault(mm, vma, address, page_table, pmd, pgoff, flags, orig_pte);
+}
 
-out_error:
-	anon = 1; /* relase nopage_page */
-	goto out;
+static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pte_t *page_table, pmd_t *pmd,
+		int write_access, pgoff_t pgoff, pte_t orig_pte)
+{
+	unsigned int flags = FAULT_FLAG_NONLINEAR |
+				(write_access ? FAULT_FLAG_WRITE : 0);
+
+	return __do_fault(mm, vma, address, page_table, pmd, pgoff, flags, orig_pte);
 }
 
 /*
@@ -2383,9 +2418,14 @@ static int do_file_page(struct mm_struct
 		print_bad_pte(vma, orig_pte, address);
 		return VM_FAULT_OOM;
 	}
-	/* We can then assume vm->vm_ops && vma->vm_ops->populate */
 
 	pgoff = pte_to_pgoff(orig_pte);
+
+	if (vma->vm_ops && vma->vm_ops->fault)
+		return do_nonlinear_fault(mm, vma, address, page_table, pmd,
+					write_access, pgoff, orig_pte);
+
+	/* We can then assume vm->vm_ops && vma->vm_ops->populate */
 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE,
 					vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -2420,10 +2460,9 @@ static inline int handle_pte_fault(struc
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
-				if (vma->vm_ops->nopage)
-					return do_no_page(mm, vma, address,
-							  pte, pmd,
-							  write_access);
+				if (vma->vm_ops->fault || vma->vm_ops->nopage)
+					return do_linear_fault(mm, vma, address,
+						pte, pmd, write_access, entry);
 				if (unlikely(vma->vm_ops->nopfn))
 					return do_no_pfn(mm, vma, address, pte,
 							 pmd, write_access);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1305,40 +1305,37 @@ static int fastcall page_cache_read(stru
 #define MMAP_LOTSAMISS  (100)
 
 /**
- * filemap_nopage - read in file data for page fault handling
- * @area:	the applicable vm_area
- * @address:	target address to read in
- * @type:	returned with VM_FAULT_{MINOR,MAJOR} if not %NULL
+ * filemap_fault - read in file data for page fault handling
+ * @data:	the applicable fault_data
  *
- * filemap_nopage() is invoked via the vma operations vector for a
+ * filemap_fault() is invoked via the vma operations vector for a
  * mapped memory region to read in file data during a page fault.
  *
  * The goto's are kind of ugly, but this streamlines the normal case of having
  * it in the page cache, and handles the special cases reasonably without
  * having a lot of duplicated code.
  */
-struct page *filemap_nopage(struct vm_area_struct *area,
-				unsigned long address, int *type)
+struct page *filemap_fault(struct vm_area_struct *vma, struct fault_data *fdata)
 {
 	int error;
-	struct file *file = area->vm_file;
+	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
 	struct page *page;
-	unsigned long size, pgoff;
-	int did_readaround = 0, majmin = VM_FAULT_MINOR;
+	unsigned long size;
+	int did_readaround = 0;
 
-	BUG_ON(!(area->vm_flags & VM_CAN_INVALIDATE));
+	fdata->type = VM_FAULT_MINOR;
 
-	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
+	BUG_ON(!(vma->vm_flags & VM_CAN_INVALIDATE));
 
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (pgoff >= size)
+	if (fdata->pgoff >= size)
 		goto outside_data_content;
 
 	/* If we don't want any read-ahead, don't bother */
-	if (VM_RandomReadHint(area))
+	if (VM_RandomReadHint(vma))
 		goto no_cached_page;
 
 	/*
@@ -1347,19 +1344,19 @@ struct page *filemap_nopage(struct vm_ar
 	 *
 	 * For sequential accesses, we use the generic readahead logic.
 	 */
-	if (VM_SequentialReadHint(area))
-		page_cache_readahead(mapping, ra, file, pgoff, 1);
+	if (VM_SequentialReadHint(vma))
+		page_cache_readahead(mapping, ra, file, fdata->pgoff, 1);
 
 	/*
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
-	page = find_lock_page(mapping, pgoff);
+	page = find_lock_page(mapping, fdata->pgoff);
 	if (!page) {
 		unsigned long ra_pages;
 
-		if (VM_SequentialReadHint(area)) {
-			handle_ra_miss(mapping, ra, pgoff);
+		if (VM_SequentialReadHint(vma)) {
+			handle_ra_miss(mapping, ra, fdata->pgoff);
 			goto no_cached_page;
 		}
 		ra->mmap_miss++;
@@ -1376,7 +1373,7 @@ retry_find:
 		 * check did_readaround, as this is an inner loop.
 		 */
 		if (!did_readaround) {
-			majmin = VM_FAULT_MAJOR;
+			fdata->type = VM_FAULT_MAJOR;
 			count_vm_event(PGMAJFAULT);
 		}
 		did_readaround = 1;
@@ -1384,11 +1381,11 @@ retry_find:
 		if (ra_pages) {
 			pgoff_t start = 0;
 
-			if (pgoff > ra_pages / 2)
-				start = pgoff - ra_pages / 2;
+			if (fdata->pgoff > ra_pages / 2)
+				start = fdata->pgoff - ra_pages / 2;
 			do_page_cache_readahead(mapping, file, start, ra_pages);
 		}
-		page = find_lock_page(mapping, pgoff);
+		page = find_lock_page(mapping, fdata->pgoff);
 		if (!page)
 			goto no_cached_page;
 	}
@@ -1405,7 +1402,7 @@ retry_find:
 
 	/* Must recheck i_size under page lock */
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (unlikely(pgoff >= size)) {
+	if (unlikely(fdata->pgoff >= size)) {
 		unlock_page(page);
 		goto outside_data_content;
 	}
@@ -1414,8 +1411,6 @@ retry_find:
 	 * Found the page and have a reference on it.
 	 */
 	mark_page_accessed(page);
-	if (type)
-		*type = majmin;
 	return page;
 
 outside_data_content:
@@ -1423,15 +1418,17 @@ outside_data_content:
 	 * An external ptracer can access pages that normally aren't
 	 * accessible..
 	 */
-	if (area->vm_mm == current->mm)
-		return NOPAGE_SIGBUS;
+	if (vma->vm_mm == current->mm) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
 	/* Fall through to the non-read-ahead case */
 no_cached_page:
 	/*
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, pgoff);
+	error = page_cache_read(file, fdata->pgoff);
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -1447,13 +1444,15 @@ no_cached_page:
 	 * to schedule I/O.
 	 */
 	if (error == -ENOMEM)
-		return NOPAGE_OOM;
-	return NOPAGE_SIGBUS;
+		fdata->type = VM_FAULT_OOM;
+	else
+		fdata->type = VM_FAULT_SIGBUS;
+	return NULL;
 
 page_not_uptodate:
 	/* IO error path */
 	if (!did_readaround) {
-		majmin = VM_FAULT_MAJOR;
+		fdata->type = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 	}
 
@@ -1472,7 +1471,30 @@ page_not_uptodate:
 
 	/* Things didn't work out. Return zero to tell the mm layer so. */
 	shrink_readahead_size_eio(file, ra);
-	return NOPAGE_SIGBUS;
+	fdata->type = VM_FAULT_SIGBUS;
+	return NULL;
+}
+EXPORT_SYMBOL(filemap_fault);
+
+/*
+ * filemap_nopage and filemap_populate are legacy exports that are not used
+ * in tree. Scheduled for removal.
+ */
+struct page *filemap_nopage(struct vm_area_struct *area,
+				unsigned long address, int *type)
+{
+	struct page *page;
+	struct fault_data fdata;
+	fdata.address = address;
+	fdata.pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT)
+			+ area->vm_pgoff;
+	fdata.flags = 0;
+
+	page = filemap_fault(area, &fdata);
+	if (type)
+		*type = fdata.type;
+
+	return page;
 }
 EXPORT_SYMBOL(filemap_nopage);
 
@@ -1650,8 +1672,7 @@ repeat:
 EXPORT_SYMBOL(filemap_populate);
 
 struct vm_operations_struct generic_file_vm_ops = {
-	.nopage		= filemap_nopage,
-	.populate	= filemap_populate,
+	.fault		= filemap_fault,
 };
 
 /* This is used for a general mmap of a disk file */
@@ -1664,7 +1685,7 @@ int generic_file_mmap(struct file * file
 		return -ENOEXEC;
 	file_accessed(file);
 	vma->vm_ops = &generic_file_vm_ops;
-	vma->vm_flags |= VM_CAN_INVALIDATE;
+	vma->vm_flags |= VM_CAN_INVALIDATE | VM_CAN_NONLINEAR;
 	return 0;
 }
 
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -126,6 +126,25 @@ out:
 	return err;
 }
 
+static int populate_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long addr, unsigned long size, pgoff_t pgoff)
+{
+	int err;
+
+	do {
+		err = install_file_pte(mm, vma, addr, pgoff, vma->vm_page_prot);
+		if (err)
+			return err;
+
+		size -= PAGE_SIZE;
+		addr += PAGE_SIZE;
+		pgoff++;
+	} while (size);
+
+        return 0;
+
+}
+
 /***
  * sys_remap_file_pages - remap arbitrary pages of a shared backing store
  *                        file within an existing vma.
@@ -183,41 +202,63 @@ asmlinkage long sys_remap_file_pages(uns
 	 * the single existing vma.  vm_private_data is used as a
 	 * swapout cursor in a VM_NONLINEAR vma.
 	 */
-	if (vma && (vma->vm_flags & VM_SHARED) &&
-		(!vma->vm_private_data || (vma->vm_flags & VM_NONLINEAR)) &&
-		vma->vm_ops && vma->vm_ops->populate &&
-			end > start && start >= vma->vm_start &&
-				end <= vma->vm_end) {
-
-		/* Must set VM_NONLINEAR before any pages are populated. */
-		if (pgoff != linear_page_index(vma, start) &&
-		    !(vma->vm_flags & VM_NONLINEAR)) {
-			if (!has_write_lock) {
-				up_read(&mm->mmap_sem);
-				down_write(&mm->mmap_sem);
-				has_write_lock = 1;
-				goto retry;
-			}
-			mapping = vma->vm_file->f_mapping;
-			spin_lock(&mapping->i_mmap_lock);
-			flush_dcache_mmap_lock(mapping);
-			vma->vm_flags |= VM_NONLINEAR;
-			vma_prio_tree_remove(vma, &mapping->i_mmap);
-			vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
-			flush_dcache_mmap_unlock(mapping);
-			spin_unlock(&mapping->i_mmap_lock);
+	if (!vma || !(vma->vm_flags & VM_SHARED))
+		goto out;
+
+	if (vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR))
+		goto out;
+
+	if ((!vma->vm_ops || !vma->vm_ops->populate) &&
+					!(vma->vm_flags & VM_CAN_NONLINEAR))
+		goto out;
+
+	if (end <= start || start < vma->vm_start || end > vma->vm_end)
+		goto out;
+
+	/* Must set VM_NONLINEAR before any pages are populated. */
+	if (!(vma->vm_flags & VM_NONLINEAR)) {
+		/* Don't need a nonlinear mapping, exit success */
+		if (pgoff == linear_page_index(vma, start)) {
+			err = 0;
+			goto out;
 		}
 
-		err = vma->vm_ops->populate(vma, start, size,
-					    vma->vm_page_prot,
-					    pgoff, flags & MAP_NONBLOCK);
-
-		/*
-		 * We can't clear VM_NONLINEAR because we'd have to do
-		 * it after ->populate completes, and that would prevent
-		 * downgrading the lock.  (Locks can't be upgraded).
-		 */
+		if (!has_write_lock) {
+			up_read(&mm->mmap_sem);
+			down_write(&mm->mmap_sem);
+			has_write_lock = 1;
+			goto retry;
+		}
+		mapping = vma->vm_file->f_mapping;
+		spin_lock(&mapping->i_mmap_lock);
+		flush_dcache_mmap_lock(mapping);
+		vma->vm_flags |= VM_NONLINEAR;
+		vma_prio_tree_remove(vma, &mapping->i_mmap);
+		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
+		flush_dcache_mmap_unlock(mapping);
+		spin_unlock(&mapping->i_mmap_lock);
 	}
+
+	if (vma->vm_flags & VM_CAN_NONLINEAR) {
+		err = populate_range(mm, vma, start, size, pgoff);
+		if (!err && !(flags & MAP_NONBLOCK)) {
+			if (unlikely(has_write_lock)) {
+				downgrade_write(&mm->mmap_sem);
+				has_write_lock = 0;
+			}
+			make_pages_present(start, start+size);
+		}
+	} else
+		err = vma->vm_ops->populate(vma, start, size, vma->vm_page_prot,
+					    	pgoff, flags & MAP_NONBLOCK);
+
+	/*
+	 * We can't clear VM_NONLINEAR because we'd have to do
+	 * it after ->populate completes, and that would prevent
+	 * downgrading the lock.  (Locks can't be upgraded).
+	 */
+
+out:
 	if (likely(!has_write_lock))
 		up_read(&mm->mmap_sem);
 	else
Index: linux-2.6/fs/gfs2/ops_file.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_file.c
+++ linux-2.6/fs/gfs2/ops_file.c
@@ -365,7 +365,7 @@ static int gfs2_mmap(struct file *file, 
 	else
 		vma->vm_ops = &gfs2_vm_ops_private;
 
-	vma->vm_flags |= VM_CAN_INVALIDATE;
+	vma->vm_flags |= VM_CAN_INVALIDATE|VM_CAN_NONLINEAR;
 
 	gfs2_glock_dq_uninit(&i_gh);
 
Index: linux-2.6/fs/gfs2/ops_vm.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_vm.c
+++ linux-2.6/fs/gfs2/ops_vm.c
@@ -27,13 +27,13 @@
 #include "trans.h"
 #include "util.h"
 
-static struct page *gfs2_private_nopage(struct vm_area_struct *area,
-					unsigned long address, int *type)
+static struct page *gfs2_private_fault(struct vm_area_struct *vma,
+					struct fault_data *fdata)
 {
-	struct gfs2_inode *ip = GFS2_I(area->vm_file->f_mapping->host);
+	struct gfs2_inode *ip = GFS2_I(vma->vm_file->f_mapping->host);
 
 	set_bit(GIF_PAGED, &ip->i_flags);
-	return filemap_nopage(area, address, type);
+	return filemap_fault(vma, fdata);
 }
 
 static int alloc_page_backing(struct gfs2_inode *ip, struct page *page)
@@ -104,16 +104,14 @@ out:
 	return error;
 }
 
-static struct page *gfs2_sharewrite_nopage(struct vm_area_struct *area,
-					   unsigned long address, int *type)
+static struct page *gfs2_sharewrite_fault(struct vm_area_struct *vma,
+						struct fault_data *fdata)
 {
-	struct file *file = area->vm_file;
+	struct file *file = vma->vm_file;
 	struct gfs2_file *gf = file->private_data;
 	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
 	struct gfs2_holder i_gh;
 	struct page *result = NULL;
-	unsigned long index = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) +
-			      area->vm_pgoff;
 	int alloc_required;
 	int error;
 
@@ -124,21 +122,25 @@ static struct page *gfs2_sharewrite_nopa
 	set_bit(GIF_PAGED, &ip->i_flags);
 	set_bit(GIF_SW_PAGED, &ip->i_flags);
 
-	error = gfs2_write_alloc_required(ip, (u64)index << PAGE_CACHE_SHIFT,
-					  PAGE_CACHE_SIZE, &alloc_required);
-	if (error)
+	error = gfs2_write_alloc_required(ip,
+					(u64)fdata->pgoff << PAGE_CACHE_SHIFT,
+					PAGE_CACHE_SIZE, &alloc_required);
+	if (error) {
+		fdata->type = VM_FAULT_OOM; /* XXX: are these right? */
 		goto out;
+	}
 
 	set_bit(GFF_EXLOCK, &gf->f_flags);
-	result = filemap_nopage(area, address, type);
+	result = filemap_fault(vma, fdata);
 	clear_bit(GFF_EXLOCK, &gf->f_flags);
-	if (!result || result == NOPAGE_OOM)
+	if (!result)
 		goto out;
 
 	if (alloc_required) {
 		error = alloc_page_backing(ip, result);
 		if (error) {
 			page_cache_release(result);
+			fdata->type = VM_FAULT_OOM;
 			result = NULL;
 			goto out;
 		}
@@ -152,10 +154,10 @@ out:
 }
 
 struct vm_operations_struct gfs2_vm_ops_private = {
-	.nopage = gfs2_private_nopage,
+	.fault = gfs2_private_fault,
 };
 
 struct vm_operations_struct gfs2_vm_ops_sharewrite = {
-	.nopage = gfs2_sharewrite_nopage,
+	.fault = gfs2_sharewrite_fault,
 };
 
Index: linux-2.6/fs/ocfs2/mmap.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/mmap.c
+++ linux-2.6/fs/ocfs2/mmap.c
@@ -42,16 +42,14 @@
 #include "inode.h"
 #include "mmap.h"
 
-static struct page *ocfs2_nopage(struct vm_area_struct * area,
-				 unsigned long address,
-				 int *type)
+static struct page *ocfs2_fault(struct vm_area_struct *area,
+						struct fault_data *fdata)
 {
-	struct page *page = NOPAGE_SIGBUS;
+	struct page *page = NULL;
 	sigset_t blocked, oldset;
 	int ret;
 
-	mlog_entry("(area=%p, address=%lu, type=%p)\n", area, address,
-		   type);
+	mlog_entry("(area=%p, page offset=%lu)\n", area, fdata->pgoff);
 
 	/* The best way to deal with signals in this path is
 	 * to block them upfront, rather than allowing the
@@ -62,11 +60,12 @@ static struct page *ocfs2_nopage(struct 
 	 * from sigprocmask */
 	ret = sigprocmask(SIG_BLOCK, &blocked, &oldset);
 	if (ret < 0) {
+		fdata->type = VM_FAULT_SIGBUS;
 		mlog_errno(ret);
 		goto out;
 	}
 
-	page = filemap_nopage(area, address, type);
+	page = filemap_fault(area, fdata);
 
 	ret = sigprocmask(SIG_SETMASK, &oldset, NULL);
 	if (ret < 0)
@@ -77,7 +76,7 @@ out:
 }
 
 static struct vm_operations_struct ocfs2_file_vm_ops = {
-	.nopage = ocfs2_nopage,
+	.fault = ocfs2_fault,
 };
 
 int ocfs2_mmap(struct file *file, struct vm_area_struct *vma)
@@ -104,7 +103,7 @@ int ocfs2_mmap(struct file *file, struct
 	ocfs2_meta_unlock(file->f_dentry->d_inode, lock_level);
 out:
 	vma->vm_ops = &ocfs2_file_vm_ops;
-	vma->vm_flags |= VM_CAN_INVALIDATE;
+	vma->vm_flags |= VM_CAN_INVALIDATE | VM_CAN_NONLINEAR;
 	return 0;
 }
 
Index: linux-2.6/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_file.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c
@@ -246,18 +246,19 @@ xfs_file_fsync(
 
 #ifdef CONFIG_XFS_DMAPI
 STATIC struct page *
-xfs_vm_nopage(
-	struct vm_area_struct	*area,
-	unsigned long		address,
-	int			*type)
+xfs_vm_fault(
+	struct vm_area_struct	*vma,
+	struct fault_data	*fdata)
 {
-	struct inode	*inode = area->vm_file->f_path.dentry->d_inode;
+	struct inode	*inode = vma->vm_file->f_path.dentry->d_inode;
 	bhv_vnode_t	*vp = vn_from_inode(inode);
 
 	ASSERT_ALWAYS(vp->v_vfsp->vfs_flag & VFS_DMI);
-	if (XFS_SEND_MMAP(XFS_VFSTOM(vp->v_vfsp), area, 0))
+	if (XFS_SEND_MMAP(XFS_VFSTOM(vp->v_vfsp), vma, 0)) {
+		fdata->type = VM_FAULT_SIGBUS;
 		return NULL;
-	return filemap_nopage(area, address, type);
+	}
+	return filemap_fault(vma, fdata);
 }
 #endif /* CONFIG_XFS_DMAPI */
 
@@ -343,7 +344,7 @@ xfs_file_mmap(
 	struct vm_area_struct *vma)
 {
 	vma->vm_ops = &xfs_file_vm_ops;
-	vma->vm_flags |= VM_CAN_INVALIDATE;
+	vma->vm_flags |= VM_CAN_INVALIDATE | VM_CAN_NONLINEAR;
 
 #ifdef CONFIG_XFS_DMAPI
 	if (vn_from_inode(filp->f_path.dentry->d_inode)->v_vfsp->vfs_flag & VFS_DMI)
@@ -502,14 +503,12 @@ const struct file_operations xfs_dir_fil
 };
 
 static struct vm_operations_struct xfs_file_vm_ops = {
-	.nopage		= filemap_nopage,
-	.populate	= filemap_populate,
+	.fault		= filemap_fault,
 };
 
 #ifdef CONFIG_XFS_DMAPI
 static struct vm_operations_struct xfs_dmapi_file_vm_ops = {
-	.nopage		= xfs_vm_nopage,
-	.populate	= filemap_populate,
+	.fault		= xfs_vm_fault,
 #ifdef HAVE_VMOP_MPROTECT
 	.mprotect	= xfs_vm_mprotect,
 #endif
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1147,12 +1147,8 @@ out:	
 		mm->locked_vm += len >> PAGE_SHIFT;
 		make_pages_present(addr, addr + len);
 	}
-	if (flags & MAP_POPULATE) {
-		up_write(&mm->mmap_sem);
-		sys_remap_file_pages(addr, len, 0,
-					pgoff, flags & MAP_NONBLOCK);
-		down_write(&mm->mmap_sem);
-	}
+	if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
+		make_pages_present(addr, addr + len);
 	return addr;
 
 unmap_and_free_vma:
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c
+++ linux-2.6/ipc/shm.c
@@ -261,7 +261,7 @@ static const struct file_operations shm_
 static struct vm_operations_struct shm_vm_ops = {
 	.open	= shm_open,	/* callback for a new vm-area open */
 	.close	= shm_close,	/* callback for when the vm-area is released */
-	.nopage	= shmem_nopage,
+	.fault	= shmem_fault,
 #if defined(CONFIG_NUMA) && defined(CONFIG_SHMEM)
 	.set_policy = shmem_set_policy,
 	.get_policy = shmem_get_policy,
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -200,62 +200,63 @@ __xip_unmap (struct address_space * mapp
 }
 
 /*
- * xip_nopage() is invoked via the vma operations vector for a
+ * xip_fault() is invoked via the vma operations vector for a
  * mapped memory region to read in file data during a page fault.
  *
- * This function is derived from filemap_nopage, but used for execute in place
+ * This function is derived from filemap_fault, but used for execute in place
  */
-static struct page *
-xip_file_nopage(struct vm_area_struct * area,
-		   unsigned long address,
-		   int *type)
+static struct page *xip_file_fault(struct vm_area_struct *area,
+					struct fault_data *fdata)
 {
 	struct file *file = area->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
 	struct page *page;
-	unsigned long size, pgoff, endoff;
+	pgoff_t size;
 
-	pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT)
-		+ area->vm_pgoff;
-	endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT)
-		+ area->vm_pgoff;
+	/* XXX: are VM_FAULT_ codes OK? */
 
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (pgoff >= size) {
+	if (fdata->pgoff >= size) {
+		fdata->type = VM_FAULT_SIGBUS;
 		return NULL;
 	}
 
-	page = mapping->a_ops->get_xip_page(mapping, pgoff*(PAGE_SIZE/512), 0);
-	if (!IS_ERR(page)) {
+	page = mapping->a_ops->get_xip_page(mapping,
+					fdata->pgoff*(PAGE_SIZE/512), 0);
+	if (!IS_ERR(page))
 		goto out;
-	}
-	if (PTR_ERR(page) != -ENODATA)
+	if (PTR_ERR(page) != -ENODATA) {
+		fdata->type = VM_FAULT_OOM;
 		return NULL;
+	}
 
 	/* sparse block */
 	if ((area->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
 	    (area->vm_flags & (VM_SHARED| VM_MAYSHARE)) &&
 	    (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
 		/* maybe shared writable, allocate new block */
-		page = mapping->a_ops->get_xip_page (mapping,
-			pgoff*(PAGE_SIZE/512), 1);
-		if (IS_ERR(page))
+		page = mapping->a_ops->get_xip_page(mapping,
+					fdata->pgoff*(PAGE_SIZE/512), 1);
+		if (IS_ERR(page)) {
+			fdata->type = VM_FAULT_SIGBUS;
 			return NULL;
+		}
 		/* unmap page at pgoff from all other vmas */
-		__xip_unmap(mapping, pgoff);
+		__xip_unmap(mapping, fdata->pgoff);
 	} else {
 		/* not shared and writable, use ZERO_PAGE() */
 		page = ZERO_PAGE(0);
 	}
 
 out:
+	fdata->type = VM_FAULT_MINOR;
 	page_cache_get(page);
 	return page;
 }
 
 static struct vm_operations_struct xip_file_vm_ops = {
-	.nopage         = xip_file_nopage,
+	.fault	= xip_file_fault,
 };
 
 int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
@@ -264,6 +265,7 @@ int xip_file_mmap(struct file * file, st
 
 	file_accessed(file);
 	vma->vm_ops = &xip_file_vm_ops;
+	vma->vm_flags |= VM_CAN_NONLINEAR;
 	return 0;
 }
 EXPORT_SYMBOL_GPL(xip_file_mmap);
Index: linux-2.6/mm/nommu.c
===================================================================
--- linux-2.6.orig/mm/nommu.c
+++ linux-2.6/mm/nommu.c
@@ -1299,8 +1299,7 @@ int in_gate_area_no_task(unsigned long a
 	return 0;
 }
 
-struct page *filemap_nopage(struct vm_area_struct *area,
-			unsigned long address, int *type)
+struct page *filemap_fault(struct vm_area_struct *vma, struct fault_data *fdata)
 {
 	BUG();
 	return NULL;
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -82,7 +82,7 @@ enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
 	SGP_WRITE,	/* may exceed i_size, may allocate page */
-	SGP_NOPAGE,	/* same as SGP_CACHE, return with page locked */
+	SGP_FAULT,	/* same as SGP_CACHE, return with page locked */
 };
 
 static int shmem_getpage(struct inode *inode, unsigned long idx,
@@ -1027,6 +1027,10 @@ static int shmem_getpage(struct inode *i
 
 	if (idx >= SHMEM_MAX_INDEX)
 		return -EFBIG;
+
+	if (type)
+		*type = VM_FAULT_MINOR;
+
 	/*
 	 * Normally, filepage is NULL on entry, and either found
 	 * uptodate immediately, or allocated and zeroed, or read
@@ -1217,7 +1221,7 @@ repeat:
 done:
 	if (*pagep != filepage) {
 		*pagep = filepage;
-		if (sgp != SGP_NOPAGE)
+		if (sgp != SGP_FAULT)
 			unlock_page(filepage);
 
 	}
@@ -1231,75 +1235,30 @@ failed:
 	return error;
 }
 
-struct page *shmem_nopage(struct vm_area_struct *vma, unsigned long address, int *type)
+struct page *shmem_fault(struct vm_area_struct *vma, struct fault_data *fdata)
 {
 	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
 	struct page *page = NULL;
-	unsigned long idx;
 	int error;
 
 	BUG_ON(!(vma->vm_flags & VM_CAN_INVALIDATE));
 
-	idx = (address - vma->vm_start) >> PAGE_SHIFT;
-	idx += vma->vm_pgoff;
-	idx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
-	if (((loff_t) idx << PAGE_CACHE_SHIFT) >= i_size_read(inode))
-		return NOPAGE_SIGBUS;
+	if (((loff_t)fdata->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
 
-	error = shmem_getpage(inode, idx, &page, SGP_NOPAGE, type);
-	if (error)
-		return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS;
+	error = shmem_getpage(inode, fdata->pgoff, &page,
+						SGP_FAULT, &fdata->type);
+	if (error) {
+		fdata->type = ((error == -ENOMEM)?VM_FAULT_OOM:VM_FAULT_SIGBUS);
+		return NULL;
+	}
 
 	mark_page_accessed(page);
 	return page;
 }
 
-static int shmem_populate(struct vm_area_struct *vma,
-	unsigned long addr, unsigned long len,
-	pgprot_t prot, unsigned long pgoff, int nonblock)
-{
-	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
-	struct mm_struct *mm = vma->vm_mm;
-	enum sgp_type sgp = nonblock? SGP_QUICK: SGP_CACHE;
-	unsigned long size;
-
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size || pgoff + (len >> PAGE_SHIFT) > size)
-		return -EINVAL;
-
-	while ((long) len > 0) {
-		struct page *page = NULL;
-		int err;
-		/*
-		 * Will need changing if PAGE_CACHE_SIZE != PAGE_SIZE
-		 */
-		err = shmem_getpage(inode, pgoff, &page, sgp, NULL);
-		if (err)
-			return err;
-		/* Page may still be null, but only if nonblock was set. */
-		if (page) {
-			mark_page_accessed(page);
-			err = install_page(mm, vma, addr, page, prot);
-			if (err) {
-				page_cache_release(page);
-				return err;
-			}
-		} else if (vma->vm_flags & VM_NONLINEAR) {
-			/* No page was found just because we can't read it in
-			 * now (being here implies nonblock != 0), but the page
-			 * may exist, so set the PTE to fault it in later. */
-    			err = install_file_pte(mm, vma, addr, pgoff, prot);
-			if (err)
-	    			return err;
-		}
-
-		len -= PAGE_SIZE;
-		addr += PAGE_SIZE;
-		pgoff++;
-	}
-	return 0;
-}
-
 #ifdef CONFIG_NUMA
 int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
 {
@@ -1344,7 +1303,7 @@ int shmem_mmap(struct file *file, struct
 {
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
-	vma->vm_flags |= VM_CAN_INVALIDATE;
+	vma->vm_flags |= VM_CAN_INVALIDATE | VM_CAN_NONLINEAR;
 	return 0;
 }
 
@@ -2401,8 +2360,7 @@ static struct super_operations shmem_ops
 };
 
 static struct vm_operations_struct shmem_vm_ops = {
-	.nopage		= shmem_nopage,
-	.populate	= shmem_populate,
+	.fault		= shmem_fault,
 #ifdef CONFIG_NUMA
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -81,7 +81,7 @@ EXPORT_SYMBOL(cancel_dirty_page);
 /*
  * If truncate cannot remove the fs-private metadata from the page, the page
  * becomes anonymous.  It will be left on the LRU and may even be mapped into
- * user pagetables if we're racing with filemap_nopage().
+ * user pagetables if we're racing with filemap_fault().
  *
  * We need to bale out if page->mapping is no longer equal to the original
  * mapping.  This happens a) when the VM reclaimed the page while we waited on
Index: linux-2.6/fs/gfs2/ops_address.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_address.c
+++ linux-2.6/fs/gfs2/ops_address.c
@@ -239,7 +239,7 @@ static int gfs2_readpage(struct file *fi
 		if (file) {
 			gf = file->private_data;
 			if (test_bit(GFF_EXLOCK, &gf->f_flags))
-				/* gfs2_sharewrite_nopage has grabbed the ip->i_gl already */
+				/* gfs2_sharewrite_fault has grabbed the ip->i_gl already */
 				goto skip_lock;
 		}
 		gfs2_holder_init(ip->i_gl, LM_ST_SHARED, GL_ATIME|LM_FLAG_TRY_1CB, &gh);
Index: linux-2.6/fs/ncpfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/mmap.c
+++ linux-2.6/fs/ncpfs/mmap.c
@@ -25,8 +25,8 @@
 /*
  * Fill in the supplied page for mmap
  */
-static struct page* ncp_file_mmap_nopage(struct vm_area_struct *area,
-				     unsigned long address, int *type)
+static struct page* ncp_file_mmap_fault(struct vm_area_struct *area,
+						struct fault_data *fdata)
 {
 	struct file *file = area->vm_file;
 	struct dentry *dentry = file->f_path.dentry;
@@ -40,15 +40,17 @@ static struct page* ncp_file_mmap_nopage
 
 	page = alloc_page(GFP_HIGHUSER); /* ncpfs has nothing against high pages
 	           as long as recvmsg and memset works on it */
-	if (!page)
-		return page;
+	if (!page) {
+		fdata->type = VM_FAULT_OOM;
+		return NULL;
+	}
 	pg_addr = kmap(page);
-	address &= PAGE_MASK;
-	pos = address - area->vm_start + (area->vm_pgoff << PAGE_SHIFT);
+	pos = fdata->pgoff << PAGE_SHIFT;
 
 	count = PAGE_SIZE;
-	if (address + PAGE_SIZE > area->vm_end) {
-		count = area->vm_end - address;
+	if (fdata->address + PAGE_SIZE > area->vm_end) {
+		WARN_ON(1); /* shouldn't happen? */
+		count = area->vm_end - fdata->address;
 	}
 	/* what we can read in one go */
 	bufsize = NCP_SERVER(inode)->buffer_size;
@@ -91,15 +93,14 @@ static struct page* ncp_file_mmap_nopage
 	 * fetches from the network, here the analogue of disk.
 	 * -- wli
 	 */
-	if (type)
-		*type = VM_FAULT_MAJOR;
+	fdata->type = VM_FAULT_MAJOR;
 	count_vm_event(PGMAJFAULT);
 	return page;
 }
 
 static struct vm_operations_struct ncp_file_mmap =
 {
-	.nopage	= ncp_file_mmap_nopage,
+	.fault = ncp_file_mmap_fault,
 };
 
 
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c
+++ linux-2.6/fs/ocfs2/aops.c
@@ -215,7 +215,7 @@ static int ocfs2_readpage(struct file *f
 	 * might now be discovering a truncate that hit on another node.
 	 * block_read_full_page->get_block freaks out if it is asked to read
 	 * beyond the end of a file, so we check here.  Callers
-	 * (generic_file_read, fault->nopage) are clever enough to check i_size
+	 * (generic_file_read, vm_ops->fault) are clever enough to check i_size
 	 * and notice that the page they just read isn't needed.
 	 *
 	 * XXX sys_readahead() seems to get that wrong?
Index: linux-2.6/Documentation/feature-removal-schedule.txt
===================================================================
--- linux-2.6.orig/Documentation/feature-removal-schedule.txt
+++ linux-2.6/Documentation/feature-removal-schedule.txt
@@ -170,6 +170,33 @@ Who:	Greg Kroah-Hartman <gregkh@suse.de>
 
 ---------------------------
 
+What:	filemap_nopage, filemap_populate
+When:	April 2007
+Why:	These legacy interfaces no longer have any callers in the kernel and
+	any functionality provided can be provided with filemap_fault. The
+	removal schedule is short because they are a big maintainence burden
+	and have some bugs.
+Who:	Nick Piggin <npiggin@suse.de>
+
+---------------------------
+
+What:	vm_ops.populate, install_page
+When:	April 2007
+Why:	These legacy interfaces no longer have any callers in the kernel and
+	any functionality provided can be provided with vm_ops.fault.
+Who:	Nick Piggin <npiggin@suse.de>
+
+---------------------------
+
+What:	vm_ops.nopage
+When:	February 2008, provided in-kernel callers have been converted
+Why:	This interface is replaced by vm_ops.fault, but it has been around
+	forever, is used by a lot of drivers, and doesn't cost much to
+	maintain.
+Who:	Nick Piggin <npiggin@suse.de>
+
+---------------------------
+
 What:	Interrupt only SA_* flags
 When:	Januar 2007
 Why:	The interrupt related SA_* flags are replaced by IRQF_* to move them
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -508,12 +508,14 @@ More details about quota locking can be 
 prototypes:
 	void (*open)(struct vm_area_struct*);
 	void (*close)(struct vm_area_struct*);
+	struct page *(*fault)(struct vm_area_struct*, struct fault_data *);
 	struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *);
 
 locking rules:
 		BKL	mmap_sem
 open:		no	yes
 close:		no	yes
+fault:		no	yes
 nopage:		no	yes
 
 ================================================================================

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-02-21  4:50   ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:50 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
encodes the virtual address -> file offset differently from linear
mappings.

I can't see why the filesystem/pagecache code should need to know anything
about it, except for the fact that the ->nopage handler didn't quite pass
down enough information (ie. pgoff). But it is more logical to pass pgoff
rather than have the ->nopage function calculate it itself anyway. And
having the nopage handler install the pte itself is sort of nasty.

This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn. Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.

The rationale for doing this in the first place is that nonlinear mappings
are subject to the pagefault vs invalidate/truncate race too, and it seemed
stupid to duplicate the synchronisation logic rather than just consolidate
the two.

After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache. Seems like a fringe functionality anyway.

NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
no users have hit mainline yet.

Signed-off-by: Nick Piggin <npiggin@suse.de>

 Documentation/feature-removal-schedule.txt |   27 ++++++
 Documentation/filesystems/Locking          |    2 
 fs/gfs2/ops_address.c                      |    2 
 fs/gfs2/ops_file.c                         |    2 
 fs/gfs2/ops_vm.c                           |   34 ++++---
 fs/ncpfs/mmap.c                            |   23 ++---
 fs/ocfs2/aops.c                            |    2 
 fs/ocfs2/mmap.c                            |   17 +--
 fs/xfs/linux-2.6/xfs_file.c                |   23 ++---
 include/linux/mm.h                         |   36 ++++++--
 ipc/shm.c                                  |    2 
 mm/filemap.c                               |   93 ++++++++++++--------
 mm/filemap_xip.c                           |   44 +++++----
 mm/fremap.c                                |  105 ++++++++++++++++-------
 mm/memory.c                                |  129 ++++++++++++++++++-----------
 mm/mmap.c                                  |    8 -
 mm/nommu.c                                 |    3 
 mm/shmem.c                                 |   80 ++++-------------
 mm/truncate.c                              |    2 
 19 files changed, 371 insertions(+), 263 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -175,6 +175,7 @@ extern unsigned int kobjsize(const void 
 					 * In this case, do_no_page must
 					 * return with the page locked.
 					 */
+#define VM_CAN_NONLINEAR 0x10000000	/* Has ->fault & does nonlinear pages */
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -198,6 +199,26 @@ extern unsigned int kobjsize(const void 
  */
 extern pgprot_t protection_map[16];
 
+#define FAULT_FLAG_WRITE	0x01
+#define FAULT_FLAG_NONLINEAR	0x02
+
+/*
+ * fault_data is filled in the the pagefault handler and passed to the
+ * vma's ->fault function. That function is responsible for filling in
+ * 'type', which is the type of fault if a page is returned, or the type
+ * of error if NULL is returned.
+ *
+ * pgoff should be used in favour of address, if possible. If pgoff is
+ * used, one may set VM_CAN_NONLINEAR in the vma->vm_flags to get
+ * nonlinear mapping support.
+ */
+struct fault_data {
+	unsigned long address;
+	pgoff_t pgoff;
+	unsigned int flags;
+
+	int type;
+};
 
 /*
  * These are the virtual MM functions - opening of an area, closing and
@@ -207,6 +228,7 @@ extern pgprot_t protection_map[16];
 struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
+	struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	unsigned long (*nopfn)(struct vm_area_struct * area, unsigned long address);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
@@ -636,7 +658,6 @@ static inline int page_mapped(struct pag
  */
 #define NOPAGE_SIGBUS	(NULL)
 #define NOPAGE_OOM	((struct page *) (-1))
-#define NOPAGE_REFAULT	((struct page *) (-2))	/* Return to userspace, rerun */
 
 /*
  * Error return values for the *_nopfn functions
@@ -666,14 +687,13 @@ static inline int page_mapped(struct pag
 extern void show_free_areas(void);
 
 #ifdef CONFIG_SHMEM
-struct page *shmem_nopage(struct vm_area_struct *vma,
-			unsigned long address, int *type);
+struct page *shmem_fault(struct vm_area_struct *vma, struct fault_data *fdata);
 int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new);
 struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 					unsigned long addr);
 int shmem_lock(struct file *file, int lock, struct user_struct *user);
 #else
-#define shmem_nopage filemap_nopage
+#define shmem_fault filemap_fault
 
 static inline int shmem_lock(struct file *file, int lock,
 			     struct user_struct *user)
@@ -1071,9 +1091,11 @@ extern void truncate_inode_pages_range(s
 				       loff_t lstart, loff_t lend);
 
 /* generic vm_area_ops exported for stackable file systems */
-extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *);
-extern int filemap_populate(struct vm_area_struct *, unsigned long,
-		unsigned long, pgprot_t, unsigned long, int);
+extern struct page *filemap_fault(struct vm_area_struct *, struct fault_data *);
+extern struct page * __deprecated_for_modules filemap_nopage(
+			struct vm_area_struct *, unsigned long, int *);
+extern int __deprecated_for_modules filemap_populate(struct vm_area_struct *,
+		unsigned long, unsigned long, pgprot_t, unsigned long, int);
 
 /* mm/page-writeback.c */
 int write_one_page(struct page *page, int wait);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2175,10 +2175,10 @@ oom:
 }
 
 /*
- * do_no_page() tries to create a new page mapping. It aggressively
+ * __do_fault() tries to create a new page mapping. It aggressively
  * tries to share with existing pages, but makes a separate copy if
- * the "write_access" parameter is true in order to avoid the next
- * page fault.
+ * the FAULT_FLAG_WRITE is set in the flags parameter in order to avoid
+ * the next page fault.
  *
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
@@ -2187,64 +2187,82 @@ oom:
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static int do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		int write_access)
+		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
 	spinlock_t *ptl;
-	struct page *page, *nopage_page;
+	struct page *page, *faulted_page;
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
 	int anon = 0;
 	struct page *dirty_page = NULL;
+	struct fault_data fdata;
+
+	fdata.address = address & PAGE_MASK;
+	fdata.pgoff = pgoff;
+	fdata.flags = flags;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
 
-	nopage_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
-	/* no page was available -- either SIGBUS, OOM or REFAULT */
-	if (unlikely(nopage_page == NOPAGE_SIGBUS))
-		return VM_FAULT_SIGBUS;
-	else if (unlikely(nopage_page == NOPAGE_OOM))
-		return VM_FAULT_OOM;
-	else if (unlikely(nopage_page == NOPAGE_REFAULT))
-		return VM_FAULT_MINOR;
+	if (likely(vma->vm_ops->fault)) {
+		fdata.type = -1;
+		faulted_page = vma->vm_ops->fault(vma, &fdata);
+		WARN_ON(fdata.type == -1);
+		if (unlikely(!faulted_page))
+			return fdata.type;
+	} else {
+		/* Legacy ->nopage path */
+		fdata.type = VM_FAULT_MINOR;
+		faulted_page = vma->vm_ops->nopage(vma, address & PAGE_MASK,
+								&fdata.type);
+		/* no page was available -- either SIGBUS or OOM */
+		if (unlikely(faulted_page == NOPAGE_SIGBUS))
+			return VM_FAULT_SIGBUS;
+		else if (unlikely(faulted_page == NOPAGE_OOM))
+			return VM_FAULT_OOM;
+	}
 
-	BUG_ON(vma->vm_flags & VM_CAN_INVALIDATE && !PageLocked(nopage_page));
 	/*
-	 * For consistency in subsequent calls, make the nopage_page always
+	 * For consistency in subsequent calls, make the faulted_page always
 	 * locked.
 	 */
 	if (unlikely(!(vma->vm_flags & VM_CAN_INVALIDATE)))
-		lock_page(nopage_page);
+		lock_page(faulted_page);
+	else
+		BUG_ON(!PageLocked(faulted_page));
 
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	page = nopage_page;
-	if (write_access) {
+	page = faulted_page;
+	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
+			anon = 1;
 			if (unlikely(anon_vma_prepare(vma))) {
-				ret = VM_FAULT_OOM;
-				goto out_error;
+				fdata.type = VM_FAULT_OOM;
+				goto out;
 			}
 			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
 			if (!page) {
-				ret = VM_FAULT_OOM;
-				goto out_error;
+				fdata.type = VM_FAULT_OOM;
+				goto out;
 			}
-			copy_user_highpage(page, nopage_page, address, vma);
-			anon = 1;
+			copy_user_highpage(page, faulted_page, address, vma);
 		} else {
-			/* if the page will be shareable, see if the backing
+			/*
+			 * If the page will be shareable, see if the backing
 			 * address space wants to know that the page is about
-			 * to become writable */
+			 * to become writable
+			 */
 			if (vma->vm_ops->page_mkwrite &&
 			    vma->vm_ops->page_mkwrite(vma, page) < 0) {
-				ret = VM_FAULT_SIGBUS;
-				goto out_error;
+				fdata.type = VM_FAULT_SIGBUS;
+				anon = 1; /* no anon but release faulted_page */
+				goto out;
 			}
 		}
+
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -2260,10 +2278,10 @@ static int do_no_page(struct mm_struct *
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_none(*page_table))) {
+	if (likely(pte_same(*page_table, orig_pte))) {
 		flush_icache_page(vma, page);
 		entry = mk_pte(page, vma->vm_page_prot);
-		if (write_access)
+		if (flags & FAULT_FLAG_WRITE)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
@@ -2273,7 +2291,7 @@ static int do_no_page(struct mm_struct *
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
-			if (write_access) {
+			if (flags & FAULT_FLAG_WRITE) {
 				dirty_page = page;
 				get_page(dirty_page);
 			}
@@ -2286,25 +2304,42 @@ static int do_no_page(struct mm_struct *
 		if (anon)
 			page_cache_release(page);
 		else
-			anon = 1; /* not anon, but release nopage_page */
+			anon = 1; /* no anon but release faulted_page */
 	}
 
 	pte_unmap_unlock(page_table, ptl);
 
 out:
-	unlock_page(nopage_page);
+	unlock_page(faulted_page);
 	if (anon)
-		page_cache_release(nopage_page);
+		page_cache_release(faulted_page);
 	else if (dirty_page) {
 		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
 
-	return ret;
+	return fdata.type;
+}
+
+static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pte_t *page_table, pmd_t *pmd,
+		int write_access, pte_t orig_pte)
+{
+	pgoff_t pgoff = (((address & PAGE_MASK)
+			- vma->vm_start) >> PAGE_CACHE_SHIFT) + vma->vm_pgoff;
+	unsigned int flags = (write_access ? FAULT_FLAG_WRITE : 0);
+
+	return __do_fault(mm, vma, address, page_table, pmd, pgoff, flags, orig_pte);
+}
 
-out_error:
-	anon = 1; /* relase nopage_page */
-	goto out;
+static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pte_t *page_table, pmd_t *pmd,
+		int write_access, pgoff_t pgoff, pte_t orig_pte)
+{
+	unsigned int flags = FAULT_FLAG_NONLINEAR |
+				(write_access ? FAULT_FLAG_WRITE : 0);
+
+	return __do_fault(mm, vma, address, page_table, pmd, pgoff, flags, orig_pte);
 }
 
 /*
@@ -2383,9 +2418,14 @@ static int do_file_page(struct mm_struct
 		print_bad_pte(vma, orig_pte, address);
 		return VM_FAULT_OOM;
 	}
-	/* We can then assume vm->vm_ops && vma->vm_ops->populate */
 
 	pgoff = pte_to_pgoff(orig_pte);
+
+	if (vma->vm_ops && vma->vm_ops->fault)
+		return do_nonlinear_fault(mm, vma, address, page_table, pmd,
+					write_access, pgoff, orig_pte);
+
+	/* We can then assume vm->vm_ops && vma->vm_ops->populate */
 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE,
 					vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -2420,10 +2460,9 @@ static inline int handle_pte_fault(struc
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
-				if (vma->vm_ops->nopage)
-					return do_no_page(mm, vma, address,
-							  pte, pmd,
-							  write_access);
+				if (vma->vm_ops->fault || vma->vm_ops->nopage)
+					return do_linear_fault(mm, vma, address,
+						pte, pmd, write_access, entry);
 				if (unlikely(vma->vm_ops->nopfn))
 					return do_no_pfn(mm, vma, address, pte,
 							 pmd, write_access);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1305,40 +1305,37 @@ static int fastcall page_cache_read(stru
 #define MMAP_LOTSAMISS  (100)
 
 /**
- * filemap_nopage - read in file data for page fault handling
- * @area:	the applicable vm_area
- * @address:	target address to read in
- * @type:	returned with VM_FAULT_{MINOR,MAJOR} if not %NULL
+ * filemap_fault - read in file data for page fault handling
+ * @data:	the applicable fault_data
  *
- * filemap_nopage() is invoked via the vma operations vector for a
+ * filemap_fault() is invoked via the vma operations vector for a
  * mapped memory region to read in file data during a page fault.
  *
  * The goto's are kind of ugly, but this streamlines the normal case of having
  * it in the page cache, and handles the special cases reasonably without
  * having a lot of duplicated code.
  */
-struct page *filemap_nopage(struct vm_area_struct *area,
-				unsigned long address, int *type)
+struct page *filemap_fault(struct vm_area_struct *vma, struct fault_data *fdata)
 {
 	int error;
-	struct file *file = area->vm_file;
+	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
 	struct page *page;
-	unsigned long size, pgoff;
-	int did_readaround = 0, majmin = VM_FAULT_MINOR;
+	unsigned long size;
+	int did_readaround = 0;
 
-	BUG_ON(!(area->vm_flags & VM_CAN_INVALIDATE));
+	fdata->type = VM_FAULT_MINOR;
 
-	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
+	BUG_ON(!(vma->vm_flags & VM_CAN_INVALIDATE));
 
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (pgoff >= size)
+	if (fdata->pgoff >= size)
 		goto outside_data_content;
 
 	/* If we don't want any read-ahead, don't bother */
-	if (VM_RandomReadHint(area))
+	if (VM_RandomReadHint(vma))
 		goto no_cached_page;
 
 	/*
@@ -1347,19 +1344,19 @@ struct page *filemap_nopage(struct vm_ar
 	 *
 	 * For sequential accesses, we use the generic readahead logic.
 	 */
-	if (VM_SequentialReadHint(area))
-		page_cache_readahead(mapping, ra, file, pgoff, 1);
+	if (VM_SequentialReadHint(vma))
+		page_cache_readahead(mapping, ra, file, fdata->pgoff, 1);
 
 	/*
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
-	page = find_lock_page(mapping, pgoff);
+	page = find_lock_page(mapping, fdata->pgoff);
 	if (!page) {
 		unsigned long ra_pages;
 
-		if (VM_SequentialReadHint(area)) {
-			handle_ra_miss(mapping, ra, pgoff);
+		if (VM_SequentialReadHint(vma)) {
+			handle_ra_miss(mapping, ra, fdata->pgoff);
 			goto no_cached_page;
 		}
 		ra->mmap_miss++;
@@ -1376,7 +1373,7 @@ retry_find:
 		 * check did_readaround, as this is an inner loop.
 		 */
 		if (!did_readaround) {
-			majmin = VM_FAULT_MAJOR;
+			fdata->type = VM_FAULT_MAJOR;
 			count_vm_event(PGMAJFAULT);
 		}
 		did_readaround = 1;
@@ -1384,11 +1381,11 @@ retry_find:
 		if (ra_pages) {
 			pgoff_t start = 0;
 
-			if (pgoff > ra_pages / 2)
-				start = pgoff - ra_pages / 2;
+			if (fdata->pgoff > ra_pages / 2)
+				start = fdata->pgoff - ra_pages / 2;
 			do_page_cache_readahead(mapping, file, start, ra_pages);
 		}
-		page = find_lock_page(mapping, pgoff);
+		page = find_lock_page(mapping, fdata->pgoff);
 		if (!page)
 			goto no_cached_page;
 	}
@@ -1405,7 +1402,7 @@ retry_find:
 
 	/* Must recheck i_size under page lock */
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (unlikely(pgoff >= size)) {
+	if (unlikely(fdata->pgoff >= size)) {
 		unlock_page(page);
 		goto outside_data_content;
 	}
@@ -1414,8 +1411,6 @@ retry_find:
 	 * Found the page and have a reference on it.
 	 */
 	mark_page_accessed(page);
-	if (type)
-		*type = majmin;
 	return page;
 
 outside_data_content:
@@ -1423,15 +1418,17 @@ outside_data_content:
 	 * An external ptracer can access pages that normally aren't
 	 * accessible..
 	 */
-	if (area->vm_mm == current->mm)
-		return NOPAGE_SIGBUS;
+	if (vma->vm_mm == current->mm) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
 	/* Fall through to the non-read-ahead case */
 no_cached_page:
 	/*
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, pgoff);
+	error = page_cache_read(file, fdata->pgoff);
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -1447,13 +1444,15 @@ no_cached_page:
 	 * to schedule I/O.
 	 */
 	if (error == -ENOMEM)
-		return NOPAGE_OOM;
-	return NOPAGE_SIGBUS;
+		fdata->type = VM_FAULT_OOM;
+	else
+		fdata->type = VM_FAULT_SIGBUS;
+	return NULL;
 
 page_not_uptodate:
 	/* IO error path */
 	if (!did_readaround) {
-		majmin = VM_FAULT_MAJOR;
+		fdata->type = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 	}
 
@@ -1472,7 +1471,30 @@ page_not_uptodate:
 
 	/* Things didn't work out. Return zero to tell the mm layer so. */
 	shrink_readahead_size_eio(file, ra);
-	return NOPAGE_SIGBUS;
+	fdata->type = VM_FAULT_SIGBUS;
+	return NULL;
+}
+EXPORT_SYMBOL(filemap_fault);
+
+/*
+ * filemap_nopage and filemap_populate are legacy exports that are not used
+ * in tree. Scheduled for removal.
+ */
+struct page *filemap_nopage(struct vm_area_struct *area,
+				unsigned long address, int *type)
+{
+	struct page *page;
+	struct fault_data fdata;
+	fdata.address = address;
+	fdata.pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT)
+			+ area->vm_pgoff;
+	fdata.flags = 0;
+
+	page = filemap_fault(area, &fdata);
+	if (type)
+		*type = fdata.type;
+
+	return page;
 }
 EXPORT_SYMBOL(filemap_nopage);
 
@@ -1650,8 +1672,7 @@ repeat:
 EXPORT_SYMBOL(filemap_populate);
 
 struct vm_operations_struct generic_file_vm_ops = {
-	.nopage		= filemap_nopage,
-	.populate	= filemap_populate,
+	.fault		= filemap_fault,
 };
 
 /* This is used for a general mmap of a disk file */
@@ -1664,7 +1685,7 @@ int generic_file_mmap(struct file * file
 		return -ENOEXEC;
 	file_accessed(file);
 	vma->vm_ops = &generic_file_vm_ops;
-	vma->vm_flags |= VM_CAN_INVALIDATE;
+	vma->vm_flags |= VM_CAN_INVALIDATE | VM_CAN_NONLINEAR;
 	return 0;
 }
 
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -126,6 +126,25 @@ out:
 	return err;
 }
 
+static int populate_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long addr, unsigned long size, pgoff_t pgoff)
+{
+	int err;
+
+	do {
+		err = install_file_pte(mm, vma, addr, pgoff, vma->vm_page_prot);
+		if (err)
+			return err;
+
+		size -= PAGE_SIZE;
+		addr += PAGE_SIZE;
+		pgoff++;
+	} while (size);
+
+        return 0;
+
+}
+
 /***
  * sys_remap_file_pages - remap arbitrary pages of a shared backing store
  *                        file within an existing vma.
@@ -183,41 +202,63 @@ asmlinkage long sys_remap_file_pages(uns
 	 * the single existing vma.  vm_private_data is used as a
 	 * swapout cursor in a VM_NONLINEAR vma.
 	 */
-	if (vma && (vma->vm_flags & VM_SHARED) &&
-		(!vma->vm_private_data || (vma->vm_flags & VM_NONLINEAR)) &&
-		vma->vm_ops && vma->vm_ops->populate &&
-			end > start && start >= vma->vm_start &&
-				end <= vma->vm_end) {
-
-		/* Must set VM_NONLINEAR before any pages are populated. */
-		if (pgoff != linear_page_index(vma, start) &&
-		    !(vma->vm_flags & VM_NONLINEAR)) {
-			if (!has_write_lock) {
-				up_read(&mm->mmap_sem);
-				down_write(&mm->mmap_sem);
-				has_write_lock = 1;
-				goto retry;
-			}
-			mapping = vma->vm_file->f_mapping;
-			spin_lock(&mapping->i_mmap_lock);
-			flush_dcache_mmap_lock(mapping);
-			vma->vm_flags |= VM_NONLINEAR;
-			vma_prio_tree_remove(vma, &mapping->i_mmap);
-			vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
-			flush_dcache_mmap_unlock(mapping);
-			spin_unlock(&mapping->i_mmap_lock);
+	if (!vma || !(vma->vm_flags & VM_SHARED))
+		goto out;
+
+	if (vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR))
+		goto out;
+
+	if ((!vma->vm_ops || !vma->vm_ops->populate) &&
+					!(vma->vm_flags & VM_CAN_NONLINEAR))
+		goto out;
+
+	if (end <= start || start < vma->vm_start || end > vma->vm_end)
+		goto out;
+
+	/* Must set VM_NONLINEAR before any pages are populated. */
+	if (!(vma->vm_flags & VM_NONLINEAR)) {
+		/* Don't need a nonlinear mapping, exit success */
+		if (pgoff == linear_page_index(vma, start)) {
+			err = 0;
+			goto out;
 		}
 
-		err = vma->vm_ops->populate(vma, start, size,
-					    vma->vm_page_prot,
-					    pgoff, flags & MAP_NONBLOCK);
-
-		/*
-		 * We can't clear VM_NONLINEAR because we'd have to do
-		 * it after ->populate completes, and that would prevent
-		 * downgrading the lock.  (Locks can't be upgraded).
-		 */
+		if (!has_write_lock) {
+			up_read(&mm->mmap_sem);
+			down_write(&mm->mmap_sem);
+			has_write_lock = 1;
+			goto retry;
+		}
+		mapping = vma->vm_file->f_mapping;
+		spin_lock(&mapping->i_mmap_lock);
+		flush_dcache_mmap_lock(mapping);
+		vma->vm_flags |= VM_NONLINEAR;
+		vma_prio_tree_remove(vma, &mapping->i_mmap);
+		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
+		flush_dcache_mmap_unlock(mapping);
+		spin_unlock(&mapping->i_mmap_lock);
 	}
+
+	if (vma->vm_flags & VM_CAN_NONLINEAR) {
+		err = populate_range(mm, vma, start, size, pgoff);
+		if (!err && !(flags & MAP_NONBLOCK)) {
+			if (unlikely(has_write_lock)) {
+				downgrade_write(&mm->mmap_sem);
+				has_write_lock = 0;
+			}
+			make_pages_present(start, start+size);
+		}
+	} else
+		err = vma->vm_ops->populate(vma, start, size, vma->vm_page_prot,
+					    	pgoff, flags & MAP_NONBLOCK);
+
+	/*
+	 * We can't clear VM_NONLINEAR because we'd have to do
+	 * it after ->populate completes, and that would prevent
+	 * downgrading the lock.  (Locks can't be upgraded).
+	 */
+
+out:
 	if (likely(!has_write_lock))
 		up_read(&mm->mmap_sem);
 	else
Index: linux-2.6/fs/gfs2/ops_file.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_file.c
+++ linux-2.6/fs/gfs2/ops_file.c
@@ -365,7 +365,7 @@ static int gfs2_mmap(struct file *file, 
 	else
 		vma->vm_ops = &gfs2_vm_ops_private;
 
-	vma->vm_flags |= VM_CAN_INVALIDATE;
+	vma->vm_flags |= VM_CAN_INVALIDATE|VM_CAN_NONLINEAR;
 
 	gfs2_glock_dq_uninit(&i_gh);
 
Index: linux-2.6/fs/gfs2/ops_vm.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_vm.c
+++ linux-2.6/fs/gfs2/ops_vm.c
@@ -27,13 +27,13 @@
 #include "trans.h"
 #include "util.h"
 
-static struct page *gfs2_private_nopage(struct vm_area_struct *area,
-					unsigned long address, int *type)
+static struct page *gfs2_private_fault(struct vm_area_struct *vma,
+					struct fault_data *fdata)
 {
-	struct gfs2_inode *ip = GFS2_I(area->vm_file->f_mapping->host);
+	struct gfs2_inode *ip = GFS2_I(vma->vm_file->f_mapping->host);
 
 	set_bit(GIF_PAGED, &ip->i_flags);
-	return filemap_nopage(area, address, type);
+	return filemap_fault(vma, fdata);
 }
 
 static int alloc_page_backing(struct gfs2_inode *ip, struct page *page)
@@ -104,16 +104,14 @@ out:
 	return error;
 }
 
-static struct page *gfs2_sharewrite_nopage(struct vm_area_struct *area,
-					   unsigned long address, int *type)
+static struct page *gfs2_sharewrite_fault(struct vm_area_struct *vma,
+						struct fault_data *fdata)
 {
-	struct file *file = area->vm_file;
+	struct file *file = vma->vm_file;
 	struct gfs2_file *gf = file->private_data;
 	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
 	struct gfs2_holder i_gh;
 	struct page *result = NULL;
-	unsigned long index = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) +
-			      area->vm_pgoff;
 	int alloc_required;
 	int error;
 
@@ -124,21 +122,25 @@ static struct page *gfs2_sharewrite_nopa
 	set_bit(GIF_PAGED, &ip->i_flags);
 	set_bit(GIF_SW_PAGED, &ip->i_flags);
 
-	error = gfs2_write_alloc_required(ip, (u64)index << PAGE_CACHE_SHIFT,
-					  PAGE_CACHE_SIZE, &alloc_required);
-	if (error)
+	error = gfs2_write_alloc_required(ip,
+					(u64)fdata->pgoff << PAGE_CACHE_SHIFT,
+					PAGE_CACHE_SIZE, &alloc_required);
+	if (error) {
+		fdata->type = VM_FAULT_OOM; /* XXX: are these right? */
 		goto out;
+	}
 
 	set_bit(GFF_EXLOCK, &gf->f_flags);
-	result = filemap_nopage(area, address, type);
+	result = filemap_fault(vma, fdata);
 	clear_bit(GFF_EXLOCK, &gf->f_flags);
-	if (!result || result == NOPAGE_OOM)
+	if (!result)
 		goto out;
 
 	if (alloc_required) {
 		error = alloc_page_backing(ip, result);
 		if (error) {
 			page_cache_release(result);
+			fdata->type = VM_FAULT_OOM;
 			result = NULL;
 			goto out;
 		}
@@ -152,10 +154,10 @@ out:
 }
 
 struct vm_operations_struct gfs2_vm_ops_private = {
-	.nopage = gfs2_private_nopage,
+	.fault = gfs2_private_fault,
 };
 
 struct vm_operations_struct gfs2_vm_ops_sharewrite = {
-	.nopage = gfs2_sharewrite_nopage,
+	.fault = gfs2_sharewrite_fault,
 };
 
Index: linux-2.6/fs/ocfs2/mmap.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/mmap.c
+++ linux-2.6/fs/ocfs2/mmap.c
@@ -42,16 +42,14 @@
 #include "inode.h"
 #include "mmap.h"
 
-static struct page *ocfs2_nopage(struct vm_area_struct * area,
-				 unsigned long address,
-				 int *type)
+static struct page *ocfs2_fault(struct vm_area_struct *area,
+						struct fault_data *fdata)
 {
-	struct page *page = NOPAGE_SIGBUS;
+	struct page *page = NULL;
 	sigset_t blocked, oldset;
 	int ret;
 
-	mlog_entry("(area=%p, address=%lu, type=%p)\n", area, address,
-		   type);
+	mlog_entry("(area=%p, page offset=%lu)\n", area, fdata->pgoff);
 
 	/* The best way to deal with signals in this path is
 	 * to block them upfront, rather than allowing the
@@ -62,11 +60,12 @@ static struct page *ocfs2_nopage(struct 
 	 * from sigprocmask */
 	ret = sigprocmask(SIG_BLOCK, &blocked, &oldset);
 	if (ret < 0) {
+		fdata->type = VM_FAULT_SIGBUS;
 		mlog_errno(ret);
 		goto out;
 	}
 
-	page = filemap_nopage(area, address, type);
+	page = filemap_fault(area, fdata);
 
 	ret = sigprocmask(SIG_SETMASK, &oldset, NULL);
 	if (ret < 0)
@@ -77,7 +76,7 @@ out:
 }
 
 static struct vm_operations_struct ocfs2_file_vm_ops = {
-	.nopage = ocfs2_nopage,
+	.fault = ocfs2_fault,
 };
 
 int ocfs2_mmap(struct file *file, struct vm_area_struct *vma)
@@ -104,7 +103,7 @@ int ocfs2_mmap(struct file *file, struct
 	ocfs2_meta_unlock(file->f_dentry->d_inode, lock_level);
 out:
 	vma->vm_ops = &ocfs2_file_vm_ops;
-	vma->vm_flags |= VM_CAN_INVALIDATE;
+	vma->vm_flags |= VM_CAN_INVALIDATE | VM_CAN_NONLINEAR;
 	return 0;
 }
 
Index: linux-2.6/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_file.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c
@@ -246,18 +246,19 @@ xfs_file_fsync(
 
 #ifdef CONFIG_XFS_DMAPI
 STATIC struct page *
-xfs_vm_nopage(
-	struct vm_area_struct	*area,
-	unsigned long		address,
-	int			*type)
+xfs_vm_fault(
+	struct vm_area_struct	*vma,
+	struct fault_data	*fdata)
 {
-	struct inode	*inode = area->vm_file->f_path.dentry->d_inode;
+	struct inode	*inode = vma->vm_file->f_path.dentry->d_inode;
 	bhv_vnode_t	*vp = vn_from_inode(inode);
 
 	ASSERT_ALWAYS(vp->v_vfsp->vfs_flag & VFS_DMI);
-	if (XFS_SEND_MMAP(XFS_VFSTOM(vp->v_vfsp), area, 0))
+	if (XFS_SEND_MMAP(XFS_VFSTOM(vp->v_vfsp), vma, 0)) {
+		fdata->type = VM_FAULT_SIGBUS;
 		return NULL;
-	return filemap_nopage(area, address, type);
+	}
+	return filemap_fault(vma, fdata);
 }
 #endif /* CONFIG_XFS_DMAPI */
 
@@ -343,7 +344,7 @@ xfs_file_mmap(
 	struct vm_area_struct *vma)
 {
 	vma->vm_ops = &xfs_file_vm_ops;
-	vma->vm_flags |= VM_CAN_INVALIDATE;
+	vma->vm_flags |= VM_CAN_INVALIDATE | VM_CAN_NONLINEAR;
 
 #ifdef CONFIG_XFS_DMAPI
 	if (vn_from_inode(filp->f_path.dentry->d_inode)->v_vfsp->vfs_flag & VFS_DMI)
@@ -502,14 +503,12 @@ const struct file_operations xfs_dir_fil
 };
 
 static struct vm_operations_struct xfs_file_vm_ops = {
-	.nopage		= filemap_nopage,
-	.populate	= filemap_populate,
+	.fault		= filemap_fault,
 };
 
 #ifdef CONFIG_XFS_DMAPI
 static struct vm_operations_struct xfs_dmapi_file_vm_ops = {
-	.nopage		= xfs_vm_nopage,
-	.populate	= filemap_populate,
+	.fault		= xfs_vm_fault,
 #ifdef HAVE_VMOP_MPROTECT
 	.mprotect	= xfs_vm_mprotect,
 #endif
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1147,12 +1147,8 @@ out:	
 		mm->locked_vm += len >> PAGE_SHIFT;
 		make_pages_present(addr, addr + len);
 	}
-	if (flags & MAP_POPULATE) {
-		up_write(&mm->mmap_sem);
-		sys_remap_file_pages(addr, len, 0,
-					pgoff, flags & MAP_NONBLOCK);
-		down_write(&mm->mmap_sem);
-	}
+	if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
+		make_pages_present(addr, addr + len);
 	return addr;
 
 unmap_and_free_vma:
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c
+++ linux-2.6/ipc/shm.c
@@ -261,7 +261,7 @@ static const struct file_operations shm_
 static struct vm_operations_struct shm_vm_ops = {
 	.open	= shm_open,	/* callback for a new vm-area open */
 	.close	= shm_close,	/* callback for when the vm-area is released */
-	.nopage	= shmem_nopage,
+	.fault	= shmem_fault,
 #if defined(CONFIG_NUMA) && defined(CONFIG_SHMEM)
 	.set_policy = shmem_set_policy,
 	.get_policy = shmem_get_policy,
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -200,62 +200,63 @@ __xip_unmap (struct address_space * mapp
 }
 
 /*
- * xip_nopage() is invoked via the vma operations vector for a
+ * xip_fault() is invoked via the vma operations vector for a
  * mapped memory region to read in file data during a page fault.
  *
- * This function is derived from filemap_nopage, but used for execute in place
+ * This function is derived from filemap_fault, but used for execute in place
  */
-static struct page *
-xip_file_nopage(struct vm_area_struct * area,
-		   unsigned long address,
-		   int *type)
+static struct page *xip_file_fault(struct vm_area_struct *area,
+					struct fault_data *fdata)
 {
 	struct file *file = area->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
 	struct page *page;
-	unsigned long size, pgoff, endoff;
+	pgoff_t size;
 
-	pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT)
-		+ area->vm_pgoff;
-	endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT)
-		+ area->vm_pgoff;
+	/* XXX: are VM_FAULT_ codes OK? */
 
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (pgoff >= size) {
+	if (fdata->pgoff >= size) {
+		fdata->type = VM_FAULT_SIGBUS;
 		return NULL;
 	}
 
-	page = mapping->a_ops->get_xip_page(mapping, pgoff*(PAGE_SIZE/512), 0);
-	if (!IS_ERR(page)) {
+	page = mapping->a_ops->get_xip_page(mapping,
+					fdata->pgoff*(PAGE_SIZE/512), 0);
+	if (!IS_ERR(page))
 		goto out;
-	}
-	if (PTR_ERR(page) != -ENODATA)
+	if (PTR_ERR(page) != -ENODATA) {
+		fdata->type = VM_FAULT_OOM;
 		return NULL;
+	}
 
 	/* sparse block */
 	if ((area->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
 	    (area->vm_flags & (VM_SHARED| VM_MAYSHARE)) &&
 	    (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
 		/* maybe shared writable, allocate new block */
-		page = mapping->a_ops->get_xip_page (mapping,
-			pgoff*(PAGE_SIZE/512), 1);
-		if (IS_ERR(page))
+		page = mapping->a_ops->get_xip_page(mapping,
+					fdata->pgoff*(PAGE_SIZE/512), 1);
+		if (IS_ERR(page)) {
+			fdata->type = VM_FAULT_SIGBUS;
 			return NULL;
+		}
 		/* unmap page at pgoff from all other vmas */
-		__xip_unmap(mapping, pgoff);
+		__xip_unmap(mapping, fdata->pgoff);
 	} else {
 		/* not shared and writable, use ZERO_PAGE() */
 		page = ZERO_PAGE(0);
 	}
 
 out:
+	fdata->type = VM_FAULT_MINOR;
 	page_cache_get(page);
 	return page;
 }
 
 static struct vm_operations_struct xip_file_vm_ops = {
-	.nopage         = xip_file_nopage,
+	.fault	= xip_file_fault,
 };
 
 int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
@@ -264,6 +265,7 @@ int xip_file_mmap(struct file * file, st
 
 	file_accessed(file);
 	vma->vm_ops = &xip_file_vm_ops;
+	vma->vm_flags |= VM_CAN_NONLINEAR;
 	return 0;
 }
 EXPORT_SYMBOL_GPL(xip_file_mmap);
Index: linux-2.6/mm/nommu.c
===================================================================
--- linux-2.6.orig/mm/nommu.c
+++ linux-2.6/mm/nommu.c
@@ -1299,8 +1299,7 @@ int in_gate_area_no_task(unsigned long a
 	return 0;
 }
 
-struct page *filemap_nopage(struct vm_area_struct *area,
-			unsigned long address, int *type)
+struct page *filemap_fault(struct vm_area_struct *vma, struct fault_data *fdata)
 {
 	BUG();
 	return NULL;
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -82,7 +82,7 @@ enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
 	SGP_WRITE,	/* may exceed i_size, may allocate page */
-	SGP_NOPAGE,	/* same as SGP_CACHE, return with page locked */
+	SGP_FAULT,	/* same as SGP_CACHE, return with page locked */
 };
 
 static int shmem_getpage(struct inode *inode, unsigned long idx,
@@ -1027,6 +1027,10 @@ static int shmem_getpage(struct inode *i
 
 	if (idx >= SHMEM_MAX_INDEX)
 		return -EFBIG;
+
+	if (type)
+		*type = VM_FAULT_MINOR;
+
 	/*
 	 * Normally, filepage is NULL on entry, and either found
 	 * uptodate immediately, or allocated and zeroed, or read
@@ -1217,7 +1221,7 @@ repeat:
 done:
 	if (*pagep != filepage) {
 		*pagep = filepage;
-		if (sgp != SGP_NOPAGE)
+		if (sgp != SGP_FAULT)
 			unlock_page(filepage);
 
 	}
@@ -1231,75 +1235,30 @@ failed:
 	return error;
 }
 
-struct page *shmem_nopage(struct vm_area_struct *vma, unsigned long address, int *type)
+struct page *shmem_fault(struct vm_area_struct *vma, struct fault_data *fdata)
 {
 	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
 	struct page *page = NULL;
-	unsigned long idx;
 	int error;
 
 	BUG_ON(!(vma->vm_flags & VM_CAN_INVALIDATE));
 
-	idx = (address - vma->vm_start) >> PAGE_SHIFT;
-	idx += vma->vm_pgoff;
-	idx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
-	if (((loff_t) idx << PAGE_CACHE_SHIFT) >= i_size_read(inode))
-		return NOPAGE_SIGBUS;
+	if (((loff_t)fdata->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
 
-	error = shmem_getpage(inode, idx, &page, SGP_NOPAGE, type);
-	if (error)
-		return (error == -ENOMEM)? NOPAGE_OOM: NOPAGE_SIGBUS;
+	error = shmem_getpage(inode, fdata->pgoff, &page,
+						SGP_FAULT, &fdata->type);
+	if (error) {
+		fdata->type = ((error == -ENOMEM)?VM_FAULT_OOM:VM_FAULT_SIGBUS);
+		return NULL;
+	}
 
 	mark_page_accessed(page);
 	return page;
 }
 
-static int shmem_populate(struct vm_area_struct *vma,
-	unsigned long addr, unsigned long len,
-	pgprot_t prot, unsigned long pgoff, int nonblock)
-{
-	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
-	struct mm_struct *mm = vma->vm_mm;
-	enum sgp_type sgp = nonblock? SGP_QUICK: SGP_CACHE;
-	unsigned long size;
-
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size || pgoff + (len >> PAGE_SHIFT) > size)
-		return -EINVAL;
-
-	while ((long) len > 0) {
-		struct page *page = NULL;
-		int err;
-		/*
-		 * Will need changing if PAGE_CACHE_SIZE != PAGE_SIZE
-		 */
-		err = shmem_getpage(inode, pgoff, &page, sgp, NULL);
-		if (err)
-			return err;
-		/* Page may still be null, but only if nonblock was set. */
-		if (page) {
-			mark_page_accessed(page);
-			err = install_page(mm, vma, addr, page, prot);
-			if (err) {
-				page_cache_release(page);
-				return err;
-			}
-		} else if (vma->vm_flags & VM_NONLINEAR) {
-			/* No page was found just because we can't read it in
-			 * now (being here implies nonblock != 0), but the page
-			 * may exist, so set the PTE to fault it in later. */
-    			err = install_file_pte(mm, vma, addr, pgoff, prot);
-			if (err)
-	    			return err;
-		}
-
-		len -= PAGE_SIZE;
-		addr += PAGE_SIZE;
-		pgoff++;
-	}
-	return 0;
-}
-
 #ifdef CONFIG_NUMA
 int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
 {
@@ -1344,7 +1303,7 @@ int shmem_mmap(struct file *file, struct
 {
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
-	vma->vm_flags |= VM_CAN_INVALIDATE;
+	vma->vm_flags |= VM_CAN_INVALIDATE | VM_CAN_NONLINEAR;
 	return 0;
 }
 
@@ -2401,8 +2360,7 @@ static struct super_operations shmem_ops
 };
 
 static struct vm_operations_struct shmem_vm_ops = {
-	.nopage		= shmem_nopage,
-	.populate	= shmem_populate,
+	.fault		= shmem_fault,
 #ifdef CONFIG_NUMA
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -81,7 +81,7 @@ EXPORT_SYMBOL(cancel_dirty_page);
 /*
  * If truncate cannot remove the fs-private metadata from the page, the page
  * becomes anonymous.  It will be left on the LRU and may even be mapped into
- * user pagetables if we're racing with filemap_nopage().
+ * user pagetables if we're racing with filemap_fault().
  *
  * We need to bale out if page->mapping is no longer equal to the original
  * mapping.  This happens a) when the VM reclaimed the page while we waited on
Index: linux-2.6/fs/gfs2/ops_address.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_address.c
+++ linux-2.6/fs/gfs2/ops_address.c
@@ -239,7 +239,7 @@ static int gfs2_readpage(struct file *fi
 		if (file) {
 			gf = file->private_data;
 			if (test_bit(GFF_EXLOCK, &gf->f_flags))
-				/* gfs2_sharewrite_nopage has grabbed the ip->i_gl already */
+				/* gfs2_sharewrite_fault has grabbed the ip->i_gl already */
 				goto skip_lock;
 		}
 		gfs2_holder_init(ip->i_gl, LM_ST_SHARED, GL_ATIME|LM_FLAG_TRY_1CB, &gh);
Index: linux-2.6/fs/ncpfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ncpfs/mmap.c
+++ linux-2.6/fs/ncpfs/mmap.c
@@ -25,8 +25,8 @@
 /*
  * Fill in the supplied page for mmap
  */
-static struct page* ncp_file_mmap_nopage(struct vm_area_struct *area,
-				     unsigned long address, int *type)
+static struct page* ncp_file_mmap_fault(struct vm_area_struct *area,
+						struct fault_data *fdata)
 {
 	struct file *file = area->vm_file;
 	struct dentry *dentry = file->f_path.dentry;
@@ -40,15 +40,17 @@ static struct page* ncp_file_mmap_nopage
 
 	page = alloc_page(GFP_HIGHUSER); /* ncpfs has nothing against high pages
 	           as long as recvmsg and memset works on it */
-	if (!page)
-		return page;
+	if (!page) {
+		fdata->type = VM_FAULT_OOM;
+		return NULL;
+	}
 	pg_addr = kmap(page);
-	address &= PAGE_MASK;
-	pos = address - area->vm_start + (area->vm_pgoff << PAGE_SHIFT);
+	pos = fdata->pgoff << PAGE_SHIFT;
 
 	count = PAGE_SIZE;
-	if (address + PAGE_SIZE > area->vm_end) {
-		count = area->vm_end - address;
+	if (fdata->address + PAGE_SIZE > area->vm_end) {
+		WARN_ON(1); /* shouldn't happen? */
+		count = area->vm_end - fdata->address;
 	}
 	/* what we can read in one go */
 	bufsize = NCP_SERVER(inode)->buffer_size;
@@ -91,15 +93,14 @@ static struct page* ncp_file_mmap_nopage
 	 * fetches from the network, here the analogue of disk.
 	 * -- wli
 	 */
-	if (type)
-		*type = VM_FAULT_MAJOR;
+	fdata->type = VM_FAULT_MAJOR;
 	count_vm_event(PGMAJFAULT);
 	return page;
 }
 
 static struct vm_operations_struct ncp_file_mmap =
 {
-	.nopage	= ncp_file_mmap_nopage,
+	.fault = ncp_file_mmap_fault,
 };
 
 
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c
+++ linux-2.6/fs/ocfs2/aops.c
@@ -215,7 +215,7 @@ static int ocfs2_readpage(struct file *f
 	 * might now be discovering a truncate that hit on another node.
 	 * block_read_full_page->get_block freaks out if it is asked to read
 	 * beyond the end of a file, so we check here.  Callers
-	 * (generic_file_read, fault->nopage) are clever enough to check i_size
+	 * (generic_file_read, vm_ops->fault) are clever enough to check i_size
 	 * and notice that the page they just read isn't needed.
 	 *
 	 * XXX sys_readahead() seems to get that wrong?
Index: linux-2.6/Documentation/feature-removal-schedule.txt
===================================================================
--- linux-2.6.orig/Documentation/feature-removal-schedule.txt
+++ linux-2.6/Documentation/feature-removal-schedule.txt
@@ -170,6 +170,33 @@ Who:	Greg Kroah-Hartman <gregkh@suse.de>
 
 ---------------------------
 
+What:	filemap_nopage, filemap_populate
+When:	April 2007
+Why:	These legacy interfaces no longer have any callers in the kernel and
+	any functionality provided can be provided with filemap_fault. The
+	removal schedule is short because they are a big maintainence burden
+	and have some bugs.
+Who:	Nick Piggin <npiggin@suse.de>
+
+---------------------------
+
+What:	vm_ops.populate, install_page
+When:	April 2007
+Why:	These legacy interfaces no longer have any callers in the kernel and
+	any functionality provided can be provided with vm_ops.fault.
+Who:	Nick Piggin <npiggin@suse.de>
+
+---------------------------
+
+What:	vm_ops.nopage
+When:	February 2008, provided in-kernel callers have been converted
+Why:	This interface is replaced by vm_ops.fault, but it has been around
+	forever, is used by a lot of drivers, and doesn't cost much to
+	maintain.
+Who:	Nick Piggin <npiggin@suse.de>
+
+---------------------------
+
 What:	Interrupt only SA_* flags
 When:	Januar 2007
 Why:	The interrupt related SA_* flags are replaced by IRQF_* to move them
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -508,12 +508,14 @@ More details about quota locking can be 
 prototypes:
 	void (*open)(struct vm_area_struct*);
 	void (*close)(struct vm_area_struct*);
+	struct page *(*fault)(struct vm_area_struct*, struct fault_data *);
 	struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *);
 
 locking rules:
 		BKL	mmap_sem
 open:		no	yes
 close:		no	yes
+fault:		no	yes
 nopage:		no	yes
 
 ================================================================================

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 5/6] mm: merge nopfn into fault
  2007-02-21  4:49 ` Nick Piggin
@ 2007-02-21  4:50   ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:50 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Remove ->nopfn and reimplement the existing handlers with ->fault

Signed-off-by: Nick Piggin <npiggin@suse.de>

 arch/powerpc/platforms/cell/spufs/file.c |   90 ++++++++++++++++---------------
 drivers/char/mspec.c                     |   29 ++++++---
 include/linux/mm.h                       |    8 --
 mm/memory.c                              |   58 +------------------
 4 files changed, 71 insertions(+), 114 deletions(-)

Index: linux-2.6/drivers/char/mspec.c
===================================================================
--- linux-2.6.orig/drivers/char/mspec.c
+++ linux-2.6/drivers/char/mspec.c
@@ -182,24 +182,25 @@ mspec_close(struct vm_area_struct *vma)
 
 
 /*
- * mspec_nopfn
+ * mspec_fault
  *
  * Creates a mspec page and maps it to user space.
  */
-static unsigned long
-mspec_nopfn(struct vm_area_struct *vma, unsigned long address)
+static struct page *
+mspec_fault(struct fault_data *fdata)
 {
 	unsigned long paddr, maddr;
 	unsigned long pfn;
-	int index;
-	struct vma_data *vdata = vma->vm_private_data;
+	int index = fdata->pgoff;
+	struct vma_data *vdata = fdata->vma->vm_private_data;
 
-	index = (address - vma->vm_start) >> PAGE_SHIFT;
 	maddr = (volatile unsigned long) vdata->maddr[index];
 	if (maddr == 0) {
 		maddr = uncached_alloc_page(numa_node_id());
-		if (maddr == 0)
-			return NOPFN_OOM;
+		if (maddr == 0) {
+			fdata->type = VM_FAULT_OOM;
+			return NULL;
+		}
 
 		spin_lock(&vdata->lock);
 		if (vdata->maddr[index] == 0) {
@@ -219,13 +220,21 @@ mspec_nopfn(struct vm_area_struct *vma, 
 
 	pfn = paddr >> PAGE_SHIFT;
 
-	return pfn;
+	fdata->type = VM_FAULT_MINOR;
+	/*
+	 * vm_insert_pfn can fail with -EBUSY, but in that case it will
+	 * be because another thread has installed the pte first, so it
+	 * is no problem.
+	 */
+	vm_insert_pfn(fdata->vma, fdata->address, pfn);
+
+	return NULL;
 }
 
 static struct vm_operations_struct mspec_vm_ops = {
 	.open = mspec_open,
 	.close = mspec_close,
-	.nopfn = mspec_nopfn
+	.fault = mspec_fault,
 };
 
 /*
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -230,7 +230,6 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
-	unsigned long (*nopfn)(struct vm_area_struct * area, unsigned long address);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 
 	/* notification that a previously read-only page is about to become
@@ -660,13 +659,6 @@ static inline int page_mapped(struct pag
 #define NOPAGE_OOM	((struct page *) (-1))
 
 /*
- * Error return values for the *_nopfn functions
- */
-#define NOPFN_SIGBUS	((unsigned long) -1)
-#define NOPFN_OOM	((unsigned long) -2)
-#define NOPFN_REFAULT	((unsigned long) -3)
-
-/*
  * Different kinds of faults, as returned by handle_mm_fault().
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1288,6 +1288,11 @@ EXPORT_SYMBOL(vm_insert_page);
  *
  * This function should only be called from a vm_ops->fault handler, and
  * in that case the handler should return NULL.
+ *
+ * vma cannot be a COW mapping.
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
  */
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 		unsigned long pfn)
@@ -2343,56 +2348,6 @@ static int do_nonlinear_fault(struct mm_
 }
 
 /*
- * do_no_pfn() tries to create a new page mapping for a page without
- * a struct_page backing it
- *
- * As this is called only for pages that do not currently exist, we
- * do not need to flush old virtual caches or the TLB.
- *
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with mmap_sem still held, but pte unmapped and unlocked.
- *
- * It is expected that the ->nopfn handler always returns the same pfn
- * for a given virtual mapping.
- *
- * Mark this `noinline' to prevent it from bloating the main pagefault code.
- */
-static noinline int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long address, pte_t *page_table, pmd_t *pmd,
-		     int write_access)
-{
-	spinlock_t *ptl;
-	pte_t entry;
-	unsigned long pfn;
-	int ret = VM_FAULT_MINOR;
-
-	pte_unmap(page_table);
-	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
-	BUG_ON(is_cow_mapping(vma->vm_flags));
-
-	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK);
-	if (unlikely(pfn == NOPFN_OOM))
-		return VM_FAULT_OOM;
-	else if (unlikely(pfn == NOPFN_SIGBUS))
-		return VM_FAULT_SIGBUS;
-	else if (unlikely(pfn == NOPFN_REFAULT))
-		return VM_FAULT_MINOR;
-
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-
-	/* Only go through if we didn't race with anybody else... */
-	if (pte_none(*page_table)) {
-		entry = pfn_pte(pfn, vma->vm_page_prot);
-		if (write_access)
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		set_pte_at(mm, address, page_table, entry);
-	}
-	pte_unmap_unlock(page_table, ptl);
-	return ret;
-}
-
-/*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
  * nonlinear vmas.
@@ -2463,9 +2418,6 @@ static inline int handle_pte_fault(struc
 				if (vma->vm_ops->fault || vma->vm_ops->nopage)
 					return do_linear_fault(mm, vma, address,
 						pte, pmd, write_access, entry);
-				if (unlikely(vma->vm_ops->nopfn))
-					return do_no_pfn(mm, vma, address, pte,
-							 pmd, write_access);
 			}
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, write_access);
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/file.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
@@ -95,16 +95,16 @@ spufs_mem_write(struct file *file, const
 	return ret;
 }
 
-static unsigned long spufs_mem_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mem_mmap_fault(struct vm_area_struct *vma,
+					  struct faul_data *fdata)
 {
 	struct spu_context *ctx = vma->vm_file->private_data;
-	unsigned long pfn, offset = address - vma->vm_start;
+	unsigned long pfn, offset = fdata->pgoff << PAGE_SHIFT;
 
-	offset += vma->vm_pgoff << PAGE_SHIFT;
-
-	if (offset >= LS_SIZE)
-		return NOPFN_SIGBUS;
+	if (offset >= LS_SIZE) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
 
 	spu_acquire(ctx);
 
@@ -121,12 +121,13 @@ static unsigned long spufs_mem_mmap_nopf
 
 	spu_release(ctx);
 
-	return NOPFN_REFAULT;
+	fdata->type = VM_FAULT_MINOR;
+	return NULL;
 }
 
 
 static struct vm_operations_struct spufs_mem_mmap_vmops = {
-	.nopfn = spufs_mem_mmap_nopfn,
+	.fault = spufs_mem_mmap_fault,
 };
 
 static int
@@ -151,42 +152,45 @@ static const struct file_operations spuf
 	.mmap    = spufs_mem_mmap,
 };
 
-static unsigned long spufs_ps_nopfn(struct vm_area_struct *vma,
-				    unsigned long address,
+static struct page *spufs_ps_fault(struct vm_area_struct *vma,
+				    struct fault_data *fdata,
 				    unsigned long ps_offs,
 				    unsigned long ps_size)
 {
 	struct spu_context *ctx = vma->vm_file->private_data;
-	unsigned long area, offset = address - vma->vm_start;
+	unsigned long area, offset = fdata->pgoff << PAGE_SHIFT;
 	int ret;
 
-	offset += vma->vm_pgoff << PAGE_SHIFT;
-	if (offset >= ps_size)
-		return NOPFN_SIGBUS;
+	if (offset >= ps_size) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
+
+	fdata->type = VM_FAULT_MINOR;
 
 	/* error here usually means a signal.. we might want to test
 	 * the error code more precisely though
 	 */
 	ret = spu_acquire_runnable(ctx, 0);
 	if (ret)
-		return NOPFN_REFAULT;
+		return NULL;
 
 	area = ctx->spu->problem_phys + ps_offs;
 	vm_insert_pfn(vma, address, (area + offset) >> PAGE_SHIFT);
 	spu_release(ctx);
 
-	return NOPFN_REFAULT;
+	return NULL;
 }
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_cntl_mmap_nopfn(struct vm_area_struct *vma,
-					   unsigned long address)
+static struct page *spufs_cntl_mmap_fault(struct vm_area_struct *vma,
+					   struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x4000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x4000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_cntl_mmap_vmops = {
-	.nopfn = spufs_cntl_mmap_nopfn,
+	.fault = spufs_cntl_mmap_fault,
 };
 
 /*
@@ -783,23 +787,23 @@ static ssize_t spufs_signal1_write(struc
 	return 4;
 }
 
-static unsigned long spufs_signal1_mmap_nopfn(struct vm_area_struct *vma,
-					      unsigned long address)
+static struct page *spufs_signal1_mmap_fault(struct vm_area_struct *vma,
+					      struct fault_data *fdata)
 {
 #if PAGE_SIZE == 0x1000
-	return spufs_ps_nopfn(vma, address, 0x14000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x14000, 0x1000);
 #elif PAGE_SIZE == 0x10000
 	/* For 64k pages, both signal1 and signal2 can be used to mmap the whole
 	 * signal 1 and 2 area
 	 */
-	return spufs_ps_nopfn(vma, address, 0x10000, 0x10000);
+	return spufs_ps_fault(vma, fdata, 0x10000, 0x10000);
 #else
 #error unsupported page size
 #endif
 }
 
 static struct vm_operations_struct spufs_signal1_mmap_vmops = {
-	.nopfn = spufs_signal1_mmap_nopfn,
+	.fault = spufs_signal1_mmap_fault,
 };
 
 static int spufs_signal1_mmap(struct file *file, struct vm_area_struct *vma)
@@ -891,23 +895,23 @@ static ssize_t spufs_signal2_write(struc
 }
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_signal2_mmap_nopfn(struct vm_area_struct *vma,
-					      unsigned long address)
+static struct page *spufs_signal2_mmap_fault(struct vm_area_struct *vma,
+					      struct fault_data *fdata)
 {
 #if PAGE_SIZE == 0x1000
-	return spufs_ps_nopfn(vma, address, 0x1c000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x1c000, 0x1000);
 #elif PAGE_SIZE == 0x10000
 	/* For 64k pages, both signal1 and signal2 can be used to mmap the whole
 	 * signal 1 and 2 area
 	 */
-	return spufs_ps_nopfn(vma, address, 0x10000, 0x10000);
+	return spufs_ps_fault(vma, fdata, 0x10000, 0x10000);
 #else
 #error unsupported page size
 #endif
 }
 
 static struct vm_operations_struct spufs_signal2_mmap_vmops = {
-	.nopfn = spufs_signal2_mmap_nopfn,
+	.fault = spufs_signal2_mmap_fault,
 };
 
 static int spufs_signal2_mmap(struct file *file, struct vm_area_struct *vma)
@@ -992,14 +996,14 @@ DEFINE_SIMPLE_ATTRIBUTE(spufs_signal2_ty
 					spufs_signal2_type_set, "%llu");
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_mss_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mss_mmap_fault(struct vm_area_struct *vma,
+					  struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x0000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x0000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_mss_mmap_vmops = {
-	.nopfn = spufs_mss_mmap_nopfn,
+	.fault = spufs_mss_mmap_fault,
 };
 
 /*
@@ -1037,14 +1041,14 @@ static const struct file_operations spuf
 	.mmap	 = spufs_mss_mmap,
 };
 
-static unsigned long spufs_psmap_mmap_nopfn(struct vm_area_struct *vma,
-					    unsigned long address)
+static struct page *spufs_psmap_mmap_fault(struct vm_area_struct *vma,
+					    struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x0000, 0x20000);
+	return spufs_ps_fault(vma, fdata, 0x0000, 0x20000);
 }
 
 static struct vm_operations_struct spufs_psmap_mmap_vmops = {
-	.nopfn = spufs_psmap_mmap_nopfn,
+	.fault = spufs_psmap_mmap_fault,
 };
 
 /*
@@ -1081,14 +1085,14 @@ static const struct file_operations spuf
 
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_mfc_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mfc_mmap_fault(struct vm_area_struct *vma,
+					  struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x3000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x3000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_mfc_mmap_vmops = {
-	.nopfn = spufs_mfc_mmap_nopfn,
+	.fault = spufs_mfc_mmap_fault,
 };
 
 /*

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 5/6] mm: merge nopfn into fault
@ 2007-02-21  4:50   ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:50 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Remove ->nopfn and reimplement the existing handlers with ->fault

Signed-off-by: Nick Piggin <npiggin@suse.de>

 arch/powerpc/platforms/cell/spufs/file.c |   90 ++++++++++++++++---------------
 drivers/char/mspec.c                     |   29 ++++++---
 include/linux/mm.h                       |    8 --
 mm/memory.c                              |   58 +------------------
 4 files changed, 71 insertions(+), 114 deletions(-)

Index: linux-2.6/drivers/char/mspec.c
===================================================================
--- linux-2.6.orig/drivers/char/mspec.c
+++ linux-2.6/drivers/char/mspec.c
@@ -182,24 +182,25 @@ mspec_close(struct vm_area_struct *vma)
 
 
 /*
- * mspec_nopfn
+ * mspec_fault
  *
  * Creates a mspec page and maps it to user space.
  */
-static unsigned long
-mspec_nopfn(struct vm_area_struct *vma, unsigned long address)
+static struct page *
+mspec_fault(struct fault_data *fdata)
 {
 	unsigned long paddr, maddr;
 	unsigned long pfn;
-	int index;
-	struct vma_data *vdata = vma->vm_private_data;
+	int index = fdata->pgoff;
+	struct vma_data *vdata = fdata->vma->vm_private_data;
 
-	index = (address - vma->vm_start) >> PAGE_SHIFT;
 	maddr = (volatile unsigned long) vdata->maddr[index];
 	if (maddr == 0) {
 		maddr = uncached_alloc_page(numa_node_id());
-		if (maddr == 0)
-			return NOPFN_OOM;
+		if (maddr == 0) {
+			fdata->type = VM_FAULT_OOM;
+			return NULL;
+		}
 
 		spin_lock(&vdata->lock);
 		if (vdata->maddr[index] == 0) {
@@ -219,13 +220,21 @@ mspec_nopfn(struct vm_area_struct *vma, 
 
 	pfn = paddr >> PAGE_SHIFT;
 
-	return pfn;
+	fdata->type = VM_FAULT_MINOR;
+	/*
+	 * vm_insert_pfn can fail with -EBUSY, but in that case it will
+	 * be because another thread has installed the pte first, so it
+	 * is no problem.
+	 */
+	vm_insert_pfn(fdata->vma, fdata->address, pfn);
+
+	return NULL;
 }
 
 static struct vm_operations_struct mspec_vm_ops = {
 	.open = mspec_open,
 	.close = mspec_close,
-	.nopfn = mspec_nopfn
+	.fault = mspec_fault,
 };
 
 /*
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -230,7 +230,6 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
-	unsigned long (*nopfn)(struct vm_area_struct * area, unsigned long address);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 
 	/* notification that a previously read-only page is about to become
@@ -660,13 +659,6 @@ static inline int page_mapped(struct pag
 #define NOPAGE_OOM	((struct page *) (-1))
 
 /*
- * Error return values for the *_nopfn functions
- */
-#define NOPFN_SIGBUS	((unsigned long) -1)
-#define NOPFN_OOM	((unsigned long) -2)
-#define NOPFN_REFAULT	((unsigned long) -3)
-
-/*
  * Different kinds of faults, as returned by handle_mm_fault().
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1288,6 +1288,11 @@ EXPORT_SYMBOL(vm_insert_page);
  *
  * This function should only be called from a vm_ops->fault handler, and
  * in that case the handler should return NULL.
+ *
+ * vma cannot be a COW mapping.
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
  */
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 		unsigned long pfn)
@@ -2343,56 +2348,6 @@ static int do_nonlinear_fault(struct mm_
 }
 
 /*
- * do_no_pfn() tries to create a new page mapping for a page without
- * a struct_page backing it
- *
- * As this is called only for pages that do not currently exist, we
- * do not need to flush old virtual caches or the TLB.
- *
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with mmap_sem still held, but pte unmapped and unlocked.
- *
- * It is expected that the ->nopfn handler always returns the same pfn
- * for a given virtual mapping.
- *
- * Mark this `noinline' to prevent it from bloating the main pagefault code.
- */
-static noinline int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long address, pte_t *page_table, pmd_t *pmd,
-		     int write_access)
-{
-	spinlock_t *ptl;
-	pte_t entry;
-	unsigned long pfn;
-	int ret = VM_FAULT_MINOR;
-
-	pte_unmap(page_table);
-	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
-	BUG_ON(is_cow_mapping(vma->vm_flags));
-
-	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK);
-	if (unlikely(pfn == NOPFN_OOM))
-		return VM_FAULT_OOM;
-	else if (unlikely(pfn == NOPFN_SIGBUS))
-		return VM_FAULT_SIGBUS;
-	else if (unlikely(pfn == NOPFN_REFAULT))
-		return VM_FAULT_MINOR;
-
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-
-	/* Only go through if we didn't race with anybody else... */
-	if (pte_none(*page_table)) {
-		entry = pfn_pte(pfn, vma->vm_page_prot);
-		if (write_access)
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		set_pte_at(mm, address, page_table, entry);
-	}
-	pte_unmap_unlock(page_table, ptl);
-	return ret;
-}
-
-/*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
  * nonlinear vmas.
@@ -2463,9 +2418,6 @@ static inline int handle_pte_fault(struc
 				if (vma->vm_ops->fault || vma->vm_ops->nopage)
 					return do_linear_fault(mm, vma, address,
 						pte, pmd, write_access, entry);
-				if (unlikely(vma->vm_ops->nopfn))
-					return do_no_pfn(mm, vma, address, pte,
-							 pmd, write_access);
 			}
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, write_access);
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/file.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
@@ -95,16 +95,16 @@ spufs_mem_write(struct file *file, const
 	return ret;
 }
 
-static unsigned long spufs_mem_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mem_mmap_fault(struct vm_area_struct *vma,
+					  struct faul_data *fdata)
 {
 	struct spu_context *ctx = vma->vm_file->private_data;
-	unsigned long pfn, offset = address - vma->vm_start;
+	unsigned long pfn, offset = fdata->pgoff << PAGE_SHIFT;
 
-	offset += vma->vm_pgoff << PAGE_SHIFT;
-
-	if (offset >= LS_SIZE)
-		return NOPFN_SIGBUS;
+	if (offset >= LS_SIZE) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
 
 	spu_acquire(ctx);
 
@@ -121,12 +121,13 @@ static unsigned long spufs_mem_mmap_nopf
 
 	spu_release(ctx);
 
-	return NOPFN_REFAULT;
+	fdata->type = VM_FAULT_MINOR;
+	return NULL;
 }
 
 
 static struct vm_operations_struct spufs_mem_mmap_vmops = {
-	.nopfn = spufs_mem_mmap_nopfn,
+	.fault = spufs_mem_mmap_fault,
 };
 
 static int
@@ -151,42 +152,45 @@ static const struct file_operations spuf
 	.mmap    = spufs_mem_mmap,
 };
 
-static unsigned long spufs_ps_nopfn(struct vm_area_struct *vma,
-				    unsigned long address,
+static struct page *spufs_ps_fault(struct vm_area_struct *vma,
+				    struct fault_data *fdata,
 				    unsigned long ps_offs,
 				    unsigned long ps_size)
 {
 	struct spu_context *ctx = vma->vm_file->private_data;
-	unsigned long area, offset = address - vma->vm_start;
+	unsigned long area, offset = fdata->pgoff << PAGE_SHIFT;
 	int ret;
 
-	offset += vma->vm_pgoff << PAGE_SHIFT;
-	if (offset >= ps_size)
-		return NOPFN_SIGBUS;
+	if (offset >= ps_size) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
+
+	fdata->type = VM_FAULT_MINOR;
 
 	/* error here usually means a signal.. we might want to test
 	 * the error code more precisely though
 	 */
 	ret = spu_acquire_runnable(ctx, 0);
 	if (ret)
-		return NOPFN_REFAULT;
+		return NULL;
 
 	area = ctx->spu->problem_phys + ps_offs;
 	vm_insert_pfn(vma, address, (area + offset) >> PAGE_SHIFT);
 	spu_release(ctx);
 
-	return NOPFN_REFAULT;
+	return NULL;
 }
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_cntl_mmap_nopfn(struct vm_area_struct *vma,
-					   unsigned long address)
+static struct page *spufs_cntl_mmap_fault(struct vm_area_struct *vma,
+					   struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x4000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x4000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_cntl_mmap_vmops = {
-	.nopfn = spufs_cntl_mmap_nopfn,
+	.fault = spufs_cntl_mmap_fault,
 };
 
 /*
@@ -783,23 +787,23 @@ static ssize_t spufs_signal1_write(struc
 	return 4;
 }
 
-static unsigned long spufs_signal1_mmap_nopfn(struct vm_area_struct *vma,
-					      unsigned long address)
+static struct page *spufs_signal1_mmap_fault(struct vm_area_struct *vma,
+					      struct fault_data *fdata)
 {
 #if PAGE_SIZE == 0x1000
-	return spufs_ps_nopfn(vma, address, 0x14000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x14000, 0x1000);
 #elif PAGE_SIZE == 0x10000
 	/* For 64k pages, both signal1 and signal2 can be used to mmap the whole
 	 * signal 1 and 2 area
 	 */
-	return spufs_ps_nopfn(vma, address, 0x10000, 0x10000);
+	return spufs_ps_fault(vma, fdata, 0x10000, 0x10000);
 #else
 #error unsupported page size
 #endif
 }
 
 static struct vm_operations_struct spufs_signal1_mmap_vmops = {
-	.nopfn = spufs_signal1_mmap_nopfn,
+	.fault = spufs_signal1_mmap_fault,
 };
 
 static int spufs_signal1_mmap(struct file *file, struct vm_area_struct *vma)
@@ -891,23 +895,23 @@ static ssize_t spufs_signal2_write(struc
 }
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_signal2_mmap_nopfn(struct vm_area_struct *vma,
-					      unsigned long address)
+static struct page *spufs_signal2_mmap_fault(struct vm_area_struct *vma,
+					      struct fault_data *fdata)
 {
 #if PAGE_SIZE == 0x1000
-	return spufs_ps_nopfn(vma, address, 0x1c000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x1c000, 0x1000);
 #elif PAGE_SIZE == 0x10000
 	/* For 64k pages, both signal1 and signal2 can be used to mmap the whole
 	 * signal 1 and 2 area
 	 */
-	return spufs_ps_nopfn(vma, address, 0x10000, 0x10000);
+	return spufs_ps_fault(vma, fdata, 0x10000, 0x10000);
 #else
 #error unsupported page size
 #endif
 }
 
 static struct vm_operations_struct spufs_signal2_mmap_vmops = {
-	.nopfn = spufs_signal2_mmap_nopfn,
+	.fault = spufs_signal2_mmap_fault,
 };
 
 static int spufs_signal2_mmap(struct file *file, struct vm_area_struct *vma)
@@ -992,14 +996,14 @@ DEFINE_SIMPLE_ATTRIBUTE(spufs_signal2_ty
 					spufs_signal2_type_set, "%llu");
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_mss_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mss_mmap_fault(struct vm_area_struct *vma,
+					  struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x0000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x0000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_mss_mmap_vmops = {
-	.nopfn = spufs_mss_mmap_nopfn,
+	.fault = spufs_mss_mmap_fault,
 };
 
 /*
@@ -1037,14 +1041,14 @@ static const struct file_operations spuf
 	.mmap	 = spufs_mss_mmap,
 };
 
-static unsigned long spufs_psmap_mmap_nopfn(struct vm_area_struct *vma,
-					    unsigned long address)
+static struct page *spufs_psmap_mmap_fault(struct vm_area_struct *vma,
+					    struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x0000, 0x20000);
+	return spufs_ps_fault(vma, fdata, 0x0000, 0x20000);
 }
 
 static struct vm_operations_struct spufs_psmap_mmap_vmops = {
-	.nopfn = spufs_psmap_mmap_nopfn,
+	.fault = spufs_psmap_mmap_fault,
 };
 
 /*
@@ -1081,14 +1085,14 @@ static const struct file_operations spuf
 
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_mfc_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mfc_mmap_fault(struct vm_area_struct *vma,
+					  struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x3000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x3000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_mfc_mmap_vmops = {
-	.nopfn = spufs_mfc_mmap_nopfn,
+	.fault = spufs_mfc_mmap_fault,
 };
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 6/6] mm: remove legacy cruft
  2007-02-21  4:49 ` Nick Piggin
@ 2007-02-21  4:50   ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:50 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Remove legacy filemap_nopage and all of the .populate API cruft.

This patch can be skipped if it will cause clashes in your tree, or you
disagree with removing these guys right now.

Signed-off-by: Nick Piggin <npiggin@suse.de>

 Documentation/feature-removal-schedule.txt |   18 --
 include/linux/mm.h                         |    8 -
 mm/filemap.c                               |  195 -----------------------------
 mm/fremap.c                                |   71 +---------
 mm/memory.c                                |   36 +----
 5 files changed, 19 insertions(+), 309 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -230,8 +230,6 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
-	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
-
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
@@ -767,8 +765,6 @@ static inline void unmap_shared_mapping_
 
 extern int vmtruncate(struct inode * inode, loff_t offset);
 extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
-extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
-extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot);
 
 #ifdef CONFIG_MMU
 extern int __handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma,
@@ -1084,10 +1080,6 @@ extern void truncate_inode_pages_range(s
 
 /* generic vm_area_ops exported for stackable file systems */
 extern struct page *filemap_fault(struct vm_area_struct *, struct fault_data *);
-extern struct page * __deprecated_for_modules filemap_nopage(
-			struct vm_area_struct *, unsigned long, int *);
-extern int __deprecated_for_modules filemap_populate(struct vm_area_struct *,
-		unsigned long, unsigned long, pgprot_t, unsigned long, int);
 
 /* mm/page-writeback.c */
 int write_one_page(struct page *page, int wait);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1476,201 +1476,6 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
-/*
- * filemap_nopage and filemap_populate are legacy exports that are not used
- * in tree. Scheduled for removal.
- */
-struct page *filemap_nopage(struct vm_area_struct *area,
-				unsigned long address, int *type)
-{
-	struct page *page;
-	struct fault_data fdata;
-	fdata.address = address;
-	fdata.pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT)
-			+ area->vm_pgoff;
-	fdata.flags = 0;
-
-	page = filemap_fault(area, &fdata);
-	if (type)
-		*type = fdata.type;
-
-	return page;
-}
-EXPORT_SYMBOL(filemap_nopage);
-
-static struct page * filemap_getpage(struct file *file, unsigned long pgoff,
-					int nonblock)
-{
-	struct address_space *mapping = file->f_mapping;
-	struct page *page;
-	int error;
-
-	/*
-	 * Do we have something in the page cache already?
-	 */
-retry_find:
-	page = find_get_page(mapping, pgoff);
-	if (!page) {
-		if (nonblock)
-			return NULL;
-		goto no_cached_page;
-	}
-
-	/*
-	 * Ok, found a page in the page cache, now we need to check
-	 * that it's up-to-date.
-	 */
-	if (!PageUptodate(page)) {
-		if (nonblock) {
-			page_cache_release(page);
-			return NULL;
-		}
-		goto page_not_uptodate;
-	}
-
-success:
-	/*
-	 * Found the page and have a reference on it.
-	 */
-	mark_page_accessed(page);
-	return page;
-
-no_cached_page:
-	error = page_cache_read(file, pgoff);
-
-	/*
-	 * The page we want has now been added to the page cache.
-	 * In the unlikely event that someone removed it in the
-	 * meantime, we'll just come back here and read it again.
-	 */
-	if (error >= 0)
-		goto retry_find;
-
-	/*
-	 * An error return from page_cache_read can result if the
-	 * system is low on memory, or a problem occurs while trying
-	 * to schedule I/O.
-	 */
-	return NULL;
-
-page_not_uptodate:
-	lock_page(page);
-
-	/* Did it get truncated while we waited for it? */
-	if (!page->mapping) {
-		unlock_page(page);
-		goto err;
-	}
-
-	/* Did somebody else get it up-to-date? */
-	if (PageUptodate(page)) {
-		unlock_page(page);
-		goto success;
-	}
-
-	error = mapping->a_ops->readpage(file, page);
-	if (!error) {
-		wait_on_page_locked(page);
-		if (PageUptodate(page))
-			goto success;
-	} else if (error == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry_find;
-	}
-
-	/*
-	 * Umm, take care of errors if the page isn't up-to-date.
-	 * Try to re-read it _once_. We do this synchronously,
-	 * because there really aren't any performance issues here
-	 * and we need to check for errors.
-	 */
-	lock_page(page);
-
-	/* Somebody truncated the page on us? */
-	if (!page->mapping) {
-		unlock_page(page);
-		goto err;
-	}
-	/* Somebody else successfully read it in? */
-	if (PageUptodate(page)) {
-		unlock_page(page);
-		goto success;
-	}
-
-	ClearPageError(page);
-	error = mapping->a_ops->readpage(file, page);
-	if (!error) {
-		wait_on_page_locked(page);
-		if (PageUptodate(page))
-			goto success;
-	} else if (error == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry_find;
-	}
-
-	/*
-	 * Things didn't work out. Return zero to tell the
-	 * mm layer so, possibly freeing the page cache page first.
-	 */
-err:
-	page_cache_release(page);
-
-	return NULL;
-}
-
-int filemap_populate(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long len, pgprot_t prot, unsigned long pgoff,
-		int nonblock)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	unsigned long size;
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int err;
-
-	if (!nonblock)
-		force_page_cache_readahead(mapping, vma->vm_file,
-					pgoff, len >> PAGE_CACHE_SHIFT);
-
-repeat:
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (pgoff + (len >> PAGE_CACHE_SHIFT) > size)
-		return -EINVAL;
-
-	page = filemap_getpage(file, pgoff, nonblock);
-
-	/* XXX: This is wrong, a filesystem I/O error may have happened. Fix that as
-	 * done in shmem_populate calling shmem_getpage */
-	if (!page && !nonblock)
-		return -ENOMEM;
-
-	if (page) {
-		err = install_page(mm, vma, addr, page, prot);
-		if (err) {
-			page_cache_release(page);
-			return err;
-		}
-	} else if (vma->vm_flags & VM_NONLINEAR) {
-		/* No page was found just because we can't read it in now (being
-		 * here implies nonblock != 0), but the page may exist, so set
-		 * the PTE to fault it in later. */
-		err = install_file_pte(mm, vma, addr, pgoff, prot);
-		if (err)
-			return err;
-	}
-
-	len -= PAGE_SIZE;
-	addr += PAGE_SIZE;
-	pgoff++;
-	if (len)
-		goto repeat;
-
-	return 0;
-}
-EXPORT_SYMBOL(filemap_populate);
-
 struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
 };
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -45,58 +45,10 @@ static int zap_pte(struct mm_struct *mm,
 }
 
 /*
- * Install a file page to a given virtual memory address, release any
- * previously existing mapping.
- */
-int install_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long addr, struct page *page, pgprot_t prot)
-{
-	struct inode *inode;
-	pgoff_t size;
-	int err = -ENOMEM;
-	pte_t *pte;
-	pte_t pte_val;
-	spinlock_t *ptl;
-
-	pte = get_locked_pte(mm, addr, &ptl);
-	if (!pte)
-		goto out;
-
-	/*
-	 * This page may have been truncated. Tell the
-	 * caller about it.
-	 */
-	err = -EINVAL;
-	inode = vma->vm_file->f_mapping->host;
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (!page->mapping || page->index >= size)
-		goto unlock;
-	err = -ENOMEM;
-	if (page_mapcount(page) > INT_MAX/2)
-		goto unlock;
-
-	if (pte_none(*pte) || !zap_pte(mm, vma, addr, pte))
-		inc_mm_counter(mm, file_rss);
-
-	flush_icache_page(vma, page);
-	pte_val = mk_pte(page, prot);
-	set_pte_at(mm, addr, pte, pte_val);
-	page_add_file_rmap(page);
-	update_mmu_cache(vma, addr, pte_val);
-	lazy_mmu_prot_update(pte_val);
-	err = 0;
-unlock:
-	pte_unmap_unlock(pte, ptl);
-out:
-	return err;
-}
-EXPORT_SYMBOL(install_page);
-
-/*
  * Install a file pte to a given virtual memory address, release any
  * previously existing mapping.
  */
-int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+static int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long addr, unsigned long pgoff, pgprot_t prot)
 {
 	int err = -ENOMEM;
@@ -208,8 +160,7 @@ asmlinkage long sys_remap_file_pages(uns
 	if (vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR))
 		goto out;
 
-	if ((!vma->vm_ops || !vma->vm_ops->populate) &&
-					!(vma->vm_flags & VM_CAN_NONLINEAR))
+	if (!vma->vm_flags & VM_CAN_NONLINEAR)
 		goto out;
 
 	if (end <= start || start < vma->vm_start || end > vma->vm_end)
@@ -239,18 +190,14 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
-	if (vma->vm_flags & VM_CAN_NONLINEAR) {
-		err = populate_range(mm, vma, start, size, pgoff);
-		if (!err && !(flags & MAP_NONBLOCK)) {
-			if (unlikely(has_write_lock)) {
-				downgrade_write(&mm->mmap_sem);
-				has_write_lock = 0;
-			}
-			make_pages_present(start, start+size);
+	err = populate_range(mm, vma, start, size, pgoff);
+	if (!err && !(flags & MAP_NONBLOCK)) {
+		if (unlikely(has_write_lock)) {
+			downgrade_write(&mm->mmap_sem);
+			has_write_lock = 0;
 		}
-	} else
-		err = vma->vm_ops->populate(vma, start, size, vma->vm_page_prot,
-					    	pgoff, flags & MAP_NONBLOCK);
+		make_pages_present(start, start+size);
+	}
 
 	/*
 	 * We can't clear VM_NONLINEAR because we'd have to do
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2334,18 +2334,10 @@ static int do_linear_fault(struct mm_str
 			- vma->vm_start) >> PAGE_CACHE_SHIFT) + vma->vm_pgoff;
 	unsigned int flags = (write_access ? FAULT_FLAG_WRITE : 0);
 
-	return __do_fault(mm, vma, address, page_table, pmd, pgoff, flags, orig_pte);
+	return __do_fault(mm, vma, address, page_table, pmd, pgoff,
+							flags, orig_pte);
 }
 
-static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		int write_access, pgoff_t pgoff, pte_t orig_pte)
-{
-	unsigned int flags = FAULT_FLAG_NONLINEAR |
-				(write_access ? FAULT_FLAG_WRITE : 0);
-
-	return __do_fault(mm, vma, address, page_table, pmd, pgoff, flags, orig_pte);
-}
 
 /*
  * Fault of a previously existing named mapping. Repopulate the pte
@@ -2356,17 +2348,19 @@ static int do_nonlinear_fault(struct mm_
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static int do_file_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
 		int write_access, pte_t orig_pte)
 {
+	unsigned int flags = FAULT_FLAG_NONLINEAR |
+				(write_access ? FAULT_FLAG_WRITE : 0);
 	pgoff_t pgoff;
-	int err;
 
 	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
 		return VM_FAULT_MINOR;
 
-	if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
+	if (unlikely(!(vma->vm_flags & VM_NONLINEAR) ||
+			!(vma->vm_flags & VM_CAN_NONLINEAR))) {
 		/*
 		 * Page table corrupted: show pte and kill process.
 		 */
@@ -2376,18 +2370,8 @@ static int do_file_page(struct mm_struct
 
 	pgoff = pte_to_pgoff(orig_pte);
 
-	if (vma->vm_ops && vma->vm_ops->fault)
-		return do_nonlinear_fault(mm, vma, address, page_table, pmd,
-					write_access, pgoff, orig_pte);
-
-	/* We can then assume vm->vm_ops && vma->vm_ops->populate */
-	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE,
-					vma->vm_page_prot, pgoff, 0);
-	if (err == -ENOMEM)
-		return VM_FAULT_OOM;
-	if (err)
-		return VM_FAULT_SIGBUS;
-	return VM_FAULT_MAJOR;
+	return __do_fault(mm, vma, address, page_table, pmd, pgoff,
+							flags, orig_pte);
 }
 
 /*
@@ -2423,7 +2407,7 @@ static inline int handle_pte_fault(struc
 						 pte, pmd, write_access);
 		}
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address,
+			return do_nonlinear_fault(mm, vma, address,
 					pte, pmd, write_access, entry);
 		return do_swap_page(mm, vma, address,
 					pte, pmd, write_access, entry);
Index: linux-2.6/Documentation/feature-removal-schedule.txt
===================================================================
--- linux-2.6.orig/Documentation/feature-removal-schedule.txt
+++ linux-2.6/Documentation/feature-removal-schedule.txt
@@ -170,24 +170,6 @@ Who:	Greg Kroah-Hartman <gregkh@suse.de>
 
 ---------------------------
 
-What:	filemap_nopage, filemap_populate
-When:	April 2007
-Why:	These legacy interfaces no longer have any callers in the kernel and
-	any functionality provided can be provided with filemap_fault. The
-	removal schedule is short because they are a big maintainence burden
-	and have some bugs.
-Who:	Nick Piggin <npiggin@suse.de>
-
----------------------------
-
-What:	vm_ops.populate, install_page
-When:	April 2007
-Why:	These legacy interfaces no longer have any callers in the kernel and
-	any functionality provided can be provided with vm_ops.fault.
-Who:	Nick Piggin <npiggin@suse.de>
-
----------------------------
-
 What:	vm_ops.nopage
 When:	February 2008, provided in-kernel callers have been converted
 Why:	This interface is replaced by vm_ops.fault, but it has been around

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch 6/6] mm: remove legacy cruft
@ 2007-02-21  4:50   ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  4:50 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Nick Piggin, Benjamin Herrenschmidt

Remove legacy filemap_nopage and all of the .populate API cruft.

This patch can be skipped if it will cause clashes in your tree, or you
disagree with removing these guys right now.

Signed-off-by: Nick Piggin <npiggin@suse.de>

 Documentation/feature-removal-schedule.txt |   18 --
 include/linux/mm.h                         |    8 -
 mm/filemap.c                               |  195 -----------------------------
 mm/fremap.c                                |   71 +---------
 mm/memory.c                                |   36 +----
 5 files changed, 19 insertions(+), 309 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -230,8 +230,6 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
-	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
-
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
@@ -767,8 +765,6 @@ static inline void unmap_shared_mapping_
 
 extern int vmtruncate(struct inode * inode, loff_t offset);
 extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
-extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
-extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot);
 
 #ifdef CONFIG_MMU
 extern int __handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma,
@@ -1084,10 +1080,6 @@ extern void truncate_inode_pages_range(s
 
 /* generic vm_area_ops exported for stackable file systems */
 extern struct page *filemap_fault(struct vm_area_struct *, struct fault_data *);
-extern struct page * __deprecated_for_modules filemap_nopage(
-			struct vm_area_struct *, unsigned long, int *);
-extern int __deprecated_for_modules filemap_populate(struct vm_area_struct *,
-		unsigned long, unsigned long, pgprot_t, unsigned long, int);
 
 /* mm/page-writeback.c */
 int write_one_page(struct page *page, int wait);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1476,201 +1476,6 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
-/*
- * filemap_nopage and filemap_populate are legacy exports that are not used
- * in tree. Scheduled for removal.
- */
-struct page *filemap_nopage(struct vm_area_struct *area,
-				unsigned long address, int *type)
-{
-	struct page *page;
-	struct fault_data fdata;
-	fdata.address = address;
-	fdata.pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT)
-			+ area->vm_pgoff;
-	fdata.flags = 0;
-
-	page = filemap_fault(area, &fdata);
-	if (type)
-		*type = fdata.type;
-
-	return page;
-}
-EXPORT_SYMBOL(filemap_nopage);
-
-static struct page * filemap_getpage(struct file *file, unsigned long pgoff,
-					int nonblock)
-{
-	struct address_space *mapping = file->f_mapping;
-	struct page *page;
-	int error;
-
-	/*
-	 * Do we have something in the page cache already?
-	 */
-retry_find:
-	page = find_get_page(mapping, pgoff);
-	if (!page) {
-		if (nonblock)
-			return NULL;
-		goto no_cached_page;
-	}
-
-	/*
-	 * Ok, found a page in the page cache, now we need to check
-	 * that it's up-to-date.
-	 */
-	if (!PageUptodate(page)) {
-		if (nonblock) {
-			page_cache_release(page);
-			return NULL;
-		}
-		goto page_not_uptodate;
-	}
-
-success:
-	/*
-	 * Found the page and have a reference on it.
-	 */
-	mark_page_accessed(page);
-	return page;
-
-no_cached_page:
-	error = page_cache_read(file, pgoff);
-
-	/*
-	 * The page we want has now been added to the page cache.
-	 * In the unlikely event that someone removed it in the
-	 * meantime, we'll just come back here and read it again.
-	 */
-	if (error >= 0)
-		goto retry_find;
-
-	/*
-	 * An error return from page_cache_read can result if the
-	 * system is low on memory, or a problem occurs while trying
-	 * to schedule I/O.
-	 */
-	return NULL;
-
-page_not_uptodate:
-	lock_page(page);
-
-	/* Did it get truncated while we waited for it? */
-	if (!page->mapping) {
-		unlock_page(page);
-		goto err;
-	}
-
-	/* Did somebody else get it up-to-date? */
-	if (PageUptodate(page)) {
-		unlock_page(page);
-		goto success;
-	}
-
-	error = mapping->a_ops->readpage(file, page);
-	if (!error) {
-		wait_on_page_locked(page);
-		if (PageUptodate(page))
-			goto success;
-	} else if (error == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry_find;
-	}
-
-	/*
-	 * Umm, take care of errors if the page isn't up-to-date.
-	 * Try to re-read it _once_. We do this synchronously,
-	 * because there really aren't any performance issues here
-	 * and we need to check for errors.
-	 */
-	lock_page(page);
-
-	/* Somebody truncated the page on us? */
-	if (!page->mapping) {
-		unlock_page(page);
-		goto err;
-	}
-	/* Somebody else successfully read it in? */
-	if (PageUptodate(page)) {
-		unlock_page(page);
-		goto success;
-	}
-
-	ClearPageError(page);
-	error = mapping->a_ops->readpage(file, page);
-	if (!error) {
-		wait_on_page_locked(page);
-		if (PageUptodate(page))
-			goto success;
-	} else if (error == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry_find;
-	}
-
-	/*
-	 * Things didn't work out. Return zero to tell the
-	 * mm layer so, possibly freeing the page cache page first.
-	 */
-err:
-	page_cache_release(page);
-
-	return NULL;
-}
-
-int filemap_populate(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long len, pgprot_t prot, unsigned long pgoff,
-		int nonblock)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	unsigned long size;
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int err;
-
-	if (!nonblock)
-		force_page_cache_readahead(mapping, vma->vm_file,
-					pgoff, len >> PAGE_CACHE_SHIFT);
-
-repeat:
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (pgoff + (len >> PAGE_CACHE_SHIFT) > size)
-		return -EINVAL;
-
-	page = filemap_getpage(file, pgoff, nonblock);
-
-	/* XXX: This is wrong, a filesystem I/O error may have happened. Fix that as
-	 * done in shmem_populate calling shmem_getpage */
-	if (!page && !nonblock)
-		return -ENOMEM;
-
-	if (page) {
-		err = install_page(mm, vma, addr, page, prot);
-		if (err) {
-			page_cache_release(page);
-			return err;
-		}
-	} else if (vma->vm_flags & VM_NONLINEAR) {
-		/* No page was found just because we can't read it in now (being
-		 * here implies nonblock != 0), but the page may exist, so set
-		 * the PTE to fault it in later. */
-		err = install_file_pte(mm, vma, addr, pgoff, prot);
-		if (err)
-			return err;
-	}
-
-	len -= PAGE_SIZE;
-	addr += PAGE_SIZE;
-	pgoff++;
-	if (len)
-		goto repeat;
-
-	return 0;
-}
-EXPORT_SYMBOL(filemap_populate);
-
 struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
 };
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -45,58 +45,10 @@ static int zap_pte(struct mm_struct *mm,
 }
 
 /*
- * Install a file page to a given virtual memory address, release any
- * previously existing mapping.
- */
-int install_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long addr, struct page *page, pgprot_t prot)
-{
-	struct inode *inode;
-	pgoff_t size;
-	int err = -ENOMEM;
-	pte_t *pte;
-	pte_t pte_val;
-	spinlock_t *ptl;
-
-	pte = get_locked_pte(mm, addr, &ptl);
-	if (!pte)
-		goto out;
-
-	/*
-	 * This page may have been truncated. Tell the
-	 * caller about it.
-	 */
-	err = -EINVAL;
-	inode = vma->vm_file->f_mapping->host;
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (!page->mapping || page->index >= size)
-		goto unlock;
-	err = -ENOMEM;
-	if (page_mapcount(page) > INT_MAX/2)
-		goto unlock;
-
-	if (pte_none(*pte) || !zap_pte(mm, vma, addr, pte))
-		inc_mm_counter(mm, file_rss);
-
-	flush_icache_page(vma, page);
-	pte_val = mk_pte(page, prot);
-	set_pte_at(mm, addr, pte, pte_val);
-	page_add_file_rmap(page);
-	update_mmu_cache(vma, addr, pte_val);
-	lazy_mmu_prot_update(pte_val);
-	err = 0;
-unlock:
-	pte_unmap_unlock(pte, ptl);
-out:
-	return err;
-}
-EXPORT_SYMBOL(install_page);
-
-/*
  * Install a file pte to a given virtual memory address, release any
  * previously existing mapping.
  */
-int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+static int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long addr, unsigned long pgoff, pgprot_t prot)
 {
 	int err = -ENOMEM;
@@ -208,8 +160,7 @@ asmlinkage long sys_remap_file_pages(uns
 	if (vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR))
 		goto out;
 
-	if ((!vma->vm_ops || !vma->vm_ops->populate) &&
-					!(vma->vm_flags & VM_CAN_NONLINEAR))
+	if (!vma->vm_flags & VM_CAN_NONLINEAR)
 		goto out;
 
 	if (end <= start || start < vma->vm_start || end > vma->vm_end)
@@ -239,18 +190,14 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
-	if (vma->vm_flags & VM_CAN_NONLINEAR) {
-		err = populate_range(mm, vma, start, size, pgoff);
-		if (!err && !(flags & MAP_NONBLOCK)) {
-			if (unlikely(has_write_lock)) {
-				downgrade_write(&mm->mmap_sem);
-				has_write_lock = 0;
-			}
-			make_pages_present(start, start+size);
+	err = populate_range(mm, vma, start, size, pgoff);
+	if (!err && !(flags & MAP_NONBLOCK)) {
+		if (unlikely(has_write_lock)) {
+			downgrade_write(&mm->mmap_sem);
+			has_write_lock = 0;
 		}
-	} else
-		err = vma->vm_ops->populate(vma, start, size, vma->vm_page_prot,
-					    	pgoff, flags & MAP_NONBLOCK);
+		make_pages_present(start, start+size);
+	}
 
 	/*
 	 * We can't clear VM_NONLINEAR because we'd have to do
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2334,18 +2334,10 @@ static int do_linear_fault(struct mm_str
 			- vma->vm_start) >> PAGE_CACHE_SHIFT) + vma->vm_pgoff;
 	unsigned int flags = (write_access ? FAULT_FLAG_WRITE : 0);
 
-	return __do_fault(mm, vma, address, page_table, pmd, pgoff, flags, orig_pte);
+	return __do_fault(mm, vma, address, page_table, pmd, pgoff,
+							flags, orig_pte);
 }
 
-static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		int write_access, pgoff_t pgoff, pte_t orig_pte)
-{
-	unsigned int flags = FAULT_FLAG_NONLINEAR |
-				(write_access ? FAULT_FLAG_WRITE : 0);
-
-	return __do_fault(mm, vma, address, page_table, pmd, pgoff, flags, orig_pte);
-}
 
 /*
  * Fault of a previously existing named mapping. Repopulate the pte
@@ -2356,17 +2348,19 @@ static int do_nonlinear_fault(struct mm_
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static int do_file_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
 		int write_access, pte_t orig_pte)
 {
+	unsigned int flags = FAULT_FLAG_NONLINEAR |
+				(write_access ? FAULT_FLAG_WRITE : 0);
 	pgoff_t pgoff;
-	int err;
 
 	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
 		return VM_FAULT_MINOR;
 
-	if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
+	if (unlikely(!(vma->vm_flags & VM_NONLINEAR) ||
+			!(vma->vm_flags & VM_CAN_NONLINEAR))) {
 		/*
 		 * Page table corrupted: show pte and kill process.
 		 */
@@ -2376,18 +2370,8 @@ static int do_file_page(struct mm_struct
 
 	pgoff = pte_to_pgoff(orig_pte);
 
-	if (vma->vm_ops && vma->vm_ops->fault)
-		return do_nonlinear_fault(mm, vma, address, page_table, pmd,
-					write_access, pgoff, orig_pte);
-
-	/* We can then assume vm->vm_ops && vma->vm_ops->populate */
-	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE,
-					vma->vm_page_prot, pgoff, 0);
-	if (err == -ENOMEM)
-		return VM_FAULT_OOM;
-	if (err)
-		return VM_FAULT_SIGBUS;
-	return VM_FAULT_MAJOR;
+	return __do_fault(mm, vma, address, page_table, pmd, pgoff,
+							flags, orig_pte);
 }
 
 /*
@@ -2423,7 +2407,7 @@ static inline int handle_pte_fault(struc
 						 pte, pmd, write_access);
 		}
 		if (pte_file(entry))
-			return do_file_page(mm, vma, address,
+			return do_nonlinear_fault(mm, vma, address,
 					pte, pmd, write_access, entry);
 		return do_swap_page(mm, vma, address,
 					pte, pmd, write_access, entry);
Index: linux-2.6/Documentation/feature-removal-schedule.txt
===================================================================
--- linux-2.6.orig/Documentation/feature-removal-schedule.txt
+++ linux-2.6/Documentation/feature-removal-schedule.txt
@@ -170,24 +170,6 @@ Who:	Greg Kroah-Hartman <gregkh@suse.de>
 
 ---------------------------
 
-What:	filemap_nopage, filemap_populate
-When:	April 2007
-Why:	These legacy interfaces no longer have any callers in the kernel and
-	any functionality provided can be provided with filemap_fault. The
-	removal schedule is short because they are a big maintainence burden
-	and have some bugs.
-Who:	Nick Piggin <npiggin@suse.de>
-
----------------------------
-
-What:	vm_ops.populate, install_page
-When:	April 2007
-Why:	These legacy interfaces no longer have any callers in the kernel and
-	any functionality provided can be provided with vm_ops.fault.
-Who:	Nick Piggin <npiggin@suse.de>
-
----------------------------
-
 What:	vm_ops.nopage
 When:	February 2008, provided in-kernel callers have been converted
 Why:	This interface is replaced by vm_ops.fault, but it has been around

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 5/6] mm: merge nopfn into fault
  2007-02-21  4:50   ` Nick Piggin
@ 2007-02-21  5:13     ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  5:13 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Benjamin Herrenschmidt

On Wed, Feb 21, 2007 at 05:50:31AM +0100, Nick Piggin wrote:
> Remove ->nopfn and reimplement the existing handlers with ->fault
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Dang, forgot to quilt refresh after fixing spufs compile.
--

Remove ->nopfn and reimplement the existing handlers with ->fault

Signed-off-by: Nick Piggin <npiggin@suse.de>

 arch/powerpc/platforms/cell/spufs/file.c |   90 ++++++++++++++++---------------
 drivers/char/mspec.c                     |   29 ++++++---
 include/linux/mm.h                       |    8 --
 mm/memory.c                              |   58 +------------------
 4 files changed, 71 insertions(+), 114 deletions(-)

Index: linux-2.6/drivers/char/mspec.c
===================================================================
--- linux-2.6.orig/drivers/char/mspec.c
+++ linux-2.6/drivers/char/mspec.c
@@ -182,24 +182,25 @@ mspec_close(struct vm_area_struct *vma)
 
 
 /*
- * mspec_nopfn
+ * mspec_fault
  *
  * Creates a mspec page and maps it to user space.
  */
-static unsigned long
-mspec_nopfn(struct vm_area_struct *vma, unsigned long address)
+static struct page *
+mspec_fault(struct fault_data *fdata)
 {
 	unsigned long paddr, maddr;
 	unsigned long pfn;
-	int index;
-	struct vma_data *vdata = vma->vm_private_data;
+	int index = fdata->pgoff;
+	struct vma_data *vdata = fdata->vma->vm_private_data;
 
-	index = (address - vma->vm_start) >> PAGE_SHIFT;
 	maddr = (volatile unsigned long) vdata->maddr[index];
 	if (maddr == 0) {
 		maddr = uncached_alloc_page(numa_node_id());
-		if (maddr == 0)
-			return NOPFN_OOM;
+		if (maddr == 0) {
+			fdata->type = VM_FAULT_OOM;
+			return NULL;
+		}
 
 		spin_lock(&vdata->lock);
 		if (vdata->maddr[index] == 0) {
@@ -219,13 +220,21 @@ mspec_nopfn(struct vm_area_struct *vma, 
 
 	pfn = paddr >> PAGE_SHIFT;
 
-	return pfn;
+	fdata->type = VM_FAULT_MINOR;
+	/*
+	 * vm_insert_pfn can fail with -EBUSY, but in that case it will
+	 * be because another thread has installed the pte first, so it
+	 * is no problem.
+	 */
+	vm_insert_pfn(fdata->vma, fdata->address, pfn);
+
+	return NULL;
 }
 
 static struct vm_operations_struct mspec_vm_ops = {
 	.open = mspec_open,
 	.close = mspec_close,
-	.nopfn = mspec_nopfn
+	.fault = mspec_fault,
 };
 
 /*
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -230,7 +230,6 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
-	unsigned long (*nopfn)(struct vm_area_struct * area, unsigned long address);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 
 	/* notification that a previously read-only page is about to become
@@ -660,13 +659,6 @@ static inline int page_mapped(struct pag
 #define NOPAGE_OOM	((struct page *) (-1))
 
 /*
- * Error return values for the *_nopfn functions
- */
-#define NOPFN_SIGBUS	((unsigned long) -1)
-#define NOPFN_OOM	((unsigned long) -2)
-#define NOPFN_REFAULT	((unsigned long) -3)
-
-/*
  * Different kinds of faults, as returned by handle_mm_fault().
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1288,6 +1288,11 @@ EXPORT_SYMBOL(vm_insert_page);
  *
  * This function should only be called from a vm_ops->fault handler, and
  * in that case the handler should return NULL.
+ *
+ * vma cannot be a COW mapping.
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
  */
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 		unsigned long pfn)
@@ -2343,56 +2348,6 @@ static int do_nonlinear_fault(struct mm_
 }
 
 /*
- * do_no_pfn() tries to create a new page mapping for a page without
- * a struct_page backing it
- *
- * As this is called only for pages that do not currently exist, we
- * do not need to flush old virtual caches or the TLB.
- *
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with mmap_sem still held, but pte unmapped and unlocked.
- *
- * It is expected that the ->nopfn handler always returns the same pfn
- * for a given virtual mapping.
- *
- * Mark this `noinline' to prevent it from bloating the main pagefault code.
- */
-static noinline int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long address, pte_t *page_table, pmd_t *pmd,
-		     int write_access)
-{
-	spinlock_t *ptl;
-	pte_t entry;
-	unsigned long pfn;
-	int ret = VM_FAULT_MINOR;
-
-	pte_unmap(page_table);
-	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
-	BUG_ON(is_cow_mapping(vma->vm_flags));
-
-	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK);
-	if (unlikely(pfn == NOPFN_OOM))
-		return VM_FAULT_OOM;
-	else if (unlikely(pfn == NOPFN_SIGBUS))
-		return VM_FAULT_SIGBUS;
-	else if (unlikely(pfn == NOPFN_REFAULT))
-		return VM_FAULT_MINOR;
-
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-
-	/* Only go through if we didn't race with anybody else... */
-	if (pte_none(*page_table)) {
-		entry = pfn_pte(pfn, vma->vm_page_prot);
-		if (write_access)
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		set_pte_at(mm, address, page_table, entry);
-	}
-	pte_unmap_unlock(page_table, ptl);
-	return ret;
-}
-
-/*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
  * nonlinear vmas.
@@ -2463,9 +2418,6 @@ static inline int handle_pte_fault(struc
 				if (vma->vm_ops->fault || vma->vm_ops->nopage)
 					return do_linear_fault(mm, vma, address,
 						pte, pmd, write_access, entry);
-				if (unlikely(vma->vm_ops->nopfn))
-					return do_no_pfn(mm, vma, address, pte,
-							 pmd, write_access);
 			}
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, write_access);
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/file.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
@@ -95,16 +95,16 @@ spufs_mem_write(struct file *file, const
 	return ret;
 }
 
-static unsigned long spufs_mem_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mem_mmap_fault(struct vm_area_struct *vma,
+					  struct fault_data *fdata)
 {
 	struct spu_context *ctx = vma->vm_file->private_data;
-	unsigned long pfn, offset = address - vma->vm_start;
+	unsigned long pfn, offset = fdata->pgoff << PAGE_SHIFT;
 
-	offset += vma->vm_pgoff << PAGE_SHIFT;
-
-	if (offset >= LS_SIZE)
-		return NOPFN_SIGBUS;
+	if (offset >= LS_SIZE) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
 
 	spu_acquire(ctx);
 
@@ -117,16 +117,17 @@ static unsigned long spufs_mem_mmap_nopf
 					     | _PAGE_NO_CACHE);
 		pfn = (ctx->spu->local_store_phys + offset) >> PAGE_SHIFT;
 	}
-	vm_insert_pfn(vma, address, pfn);
+	vm_insert_pfn(vma, fdata->address, pfn);
 
 	spu_release(ctx);
 
-	return NOPFN_REFAULT;
+	fdata->type = VM_FAULT_MINOR;
+	return NULL;
 }
 
 
 static struct vm_operations_struct spufs_mem_mmap_vmops = {
-	.nopfn = spufs_mem_mmap_nopfn,
+	.fault = spufs_mem_mmap_fault,
 };
 
 static int
@@ -151,42 +152,45 @@ static const struct file_operations spuf
 	.mmap    = spufs_mem_mmap,
 };
 
-static unsigned long spufs_ps_nopfn(struct vm_area_struct *vma,
-				    unsigned long address,
+static struct page *spufs_ps_fault(struct vm_area_struct *vma,
+				    struct fault_data *fdata,
 				    unsigned long ps_offs,
 				    unsigned long ps_size)
 {
 	struct spu_context *ctx = vma->vm_file->private_data;
-	unsigned long area, offset = address - vma->vm_start;
+	unsigned long area, offset = fdata->pgoff << PAGE_SHIFT;
 	int ret;
 
-	offset += vma->vm_pgoff << PAGE_SHIFT;
-	if (offset >= ps_size)
-		return NOPFN_SIGBUS;
+	if (offset >= ps_size) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
+
+	fdata->type = VM_FAULT_MINOR;
 
 	/* error here usually means a signal.. we might want to test
 	 * the error code more precisely though
 	 */
 	ret = spu_acquire_runnable(ctx, 0);
 	if (ret)
-		return NOPFN_REFAULT;
+		return NULL;
 
 	area = ctx->spu->problem_phys + ps_offs;
-	vm_insert_pfn(vma, address, (area + offset) >> PAGE_SHIFT);
+	vm_insert_pfn(vma, fdata->address, (area + offset) >> PAGE_SHIFT);
 	spu_release(ctx);
 
-	return NOPFN_REFAULT;
+	return NULL;
 }
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_cntl_mmap_nopfn(struct vm_area_struct *vma,
-					   unsigned long address)
+static struct page *spufs_cntl_mmap_fault(struct vm_area_struct *vma,
+					   struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x4000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x4000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_cntl_mmap_vmops = {
-	.nopfn = spufs_cntl_mmap_nopfn,
+	.fault = spufs_cntl_mmap_fault,
 };
 
 /*
@@ -783,23 +787,23 @@ static ssize_t spufs_signal1_write(struc
 	return 4;
 }
 
-static unsigned long spufs_signal1_mmap_nopfn(struct vm_area_struct *vma,
-					      unsigned long address)
+static struct page *spufs_signal1_mmap_fault(struct vm_area_struct *vma,
+					      struct fault_data *fdata)
 {
 #if PAGE_SIZE == 0x1000
-	return spufs_ps_nopfn(vma, address, 0x14000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x14000, 0x1000);
 #elif PAGE_SIZE == 0x10000
 	/* For 64k pages, both signal1 and signal2 can be used to mmap the whole
 	 * signal 1 and 2 area
 	 */
-	return spufs_ps_nopfn(vma, address, 0x10000, 0x10000);
+	return spufs_ps_fault(vma, fdata, 0x10000, 0x10000);
 #else
 #error unsupported page size
 #endif
 }
 
 static struct vm_operations_struct spufs_signal1_mmap_vmops = {
-	.nopfn = spufs_signal1_mmap_nopfn,
+	.fault = spufs_signal1_mmap_fault,
 };
 
 static int spufs_signal1_mmap(struct file *file, struct vm_area_struct *vma)
@@ -891,23 +895,23 @@ static ssize_t spufs_signal2_write(struc
 }
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_signal2_mmap_nopfn(struct vm_area_struct *vma,
-					      unsigned long address)
+static struct page *spufs_signal2_mmap_fault(struct vm_area_struct *vma,
+					      struct fault_data *fdata)
 {
 #if PAGE_SIZE == 0x1000
-	return spufs_ps_nopfn(vma, address, 0x1c000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x1c000, 0x1000);
 #elif PAGE_SIZE == 0x10000
 	/* For 64k pages, both signal1 and signal2 can be used to mmap the whole
 	 * signal 1 and 2 area
 	 */
-	return spufs_ps_nopfn(vma, address, 0x10000, 0x10000);
+	return spufs_ps_fault(vma, fdata, 0x10000, 0x10000);
 #else
 #error unsupported page size
 #endif
 }
 
 static struct vm_operations_struct spufs_signal2_mmap_vmops = {
-	.nopfn = spufs_signal2_mmap_nopfn,
+	.fault = spufs_signal2_mmap_fault,
 };
 
 static int spufs_signal2_mmap(struct file *file, struct vm_area_struct *vma)
@@ -992,14 +996,14 @@ DEFINE_SIMPLE_ATTRIBUTE(spufs_signal2_ty
 					spufs_signal2_type_set, "%llu");
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_mss_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mss_mmap_fault(struct vm_area_struct *vma,
+					  struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x0000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x0000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_mss_mmap_vmops = {
-	.nopfn = spufs_mss_mmap_nopfn,
+	.fault = spufs_mss_mmap_fault,
 };
 
 /*
@@ -1037,14 +1041,14 @@ static const struct file_operations spuf
 	.mmap	 = spufs_mss_mmap,
 };
 
-static unsigned long spufs_psmap_mmap_nopfn(struct vm_area_struct *vma,
-					    unsigned long address)
+static struct page *spufs_psmap_mmap_fault(struct vm_area_struct *vma,
+					    struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x0000, 0x20000);
+	return spufs_ps_fault(vma, fdata, 0x0000, 0x20000);
 }
 
 static struct vm_operations_struct spufs_psmap_mmap_vmops = {
-	.nopfn = spufs_psmap_mmap_nopfn,
+	.fault = spufs_psmap_mmap_fault,
 };
 
 /*
@@ -1081,14 +1085,14 @@ static const struct file_operations spuf
 
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_mfc_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mfc_mmap_fault(struct vm_area_struct *vma,
+					  struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x3000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x3000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_mfc_mmap_vmops = {
-	.nopfn = spufs_mfc_mmap_nopfn,
+	.fault = spufs_mfc_mmap_fault,
 };
 
 /*

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 5/6] mm: merge nopfn into fault
@ 2007-02-21  5:13     ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-21  5:13 UTC (permalink / raw)
  To: Linux Memory Management, Andrew Morton
  Cc: Linux Kernel, Benjamin Herrenschmidt

On Wed, Feb 21, 2007 at 05:50:31AM +0100, Nick Piggin wrote:
> Remove ->nopfn and reimplement the existing handlers with ->fault
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Dang, forgot to quilt refresh after fixing spufs compile.
--

Remove ->nopfn and reimplement the existing handlers with ->fault

Signed-off-by: Nick Piggin <npiggin@suse.de>

 arch/powerpc/platforms/cell/spufs/file.c |   90 ++++++++++++++++---------------
 drivers/char/mspec.c                     |   29 ++++++---
 include/linux/mm.h                       |    8 --
 mm/memory.c                              |   58 +------------------
 4 files changed, 71 insertions(+), 114 deletions(-)

Index: linux-2.6/drivers/char/mspec.c
===================================================================
--- linux-2.6.orig/drivers/char/mspec.c
+++ linux-2.6/drivers/char/mspec.c
@@ -182,24 +182,25 @@ mspec_close(struct vm_area_struct *vma)
 
 
 /*
- * mspec_nopfn
+ * mspec_fault
  *
  * Creates a mspec page and maps it to user space.
  */
-static unsigned long
-mspec_nopfn(struct vm_area_struct *vma, unsigned long address)
+static struct page *
+mspec_fault(struct fault_data *fdata)
 {
 	unsigned long paddr, maddr;
 	unsigned long pfn;
-	int index;
-	struct vma_data *vdata = vma->vm_private_data;
+	int index = fdata->pgoff;
+	struct vma_data *vdata = fdata->vma->vm_private_data;
 
-	index = (address - vma->vm_start) >> PAGE_SHIFT;
 	maddr = (volatile unsigned long) vdata->maddr[index];
 	if (maddr == 0) {
 		maddr = uncached_alloc_page(numa_node_id());
-		if (maddr == 0)
-			return NOPFN_OOM;
+		if (maddr == 0) {
+			fdata->type = VM_FAULT_OOM;
+			return NULL;
+		}
 
 		spin_lock(&vdata->lock);
 		if (vdata->maddr[index] == 0) {
@@ -219,13 +220,21 @@ mspec_nopfn(struct vm_area_struct *vma, 
 
 	pfn = paddr >> PAGE_SHIFT;
 
-	return pfn;
+	fdata->type = VM_FAULT_MINOR;
+	/*
+	 * vm_insert_pfn can fail with -EBUSY, but in that case it will
+	 * be because another thread has installed the pte first, so it
+	 * is no problem.
+	 */
+	vm_insert_pfn(fdata->vma, fdata->address, pfn);
+
+	return NULL;
 }
 
 static struct vm_operations_struct mspec_vm_ops = {
 	.open = mspec_open,
 	.close = mspec_close,
-	.nopfn = mspec_nopfn
+	.fault = mspec_fault,
 };
 
 /*
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -230,7 +230,6 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
-	unsigned long (*nopfn)(struct vm_area_struct * area, unsigned long address);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 
 	/* notification that a previously read-only page is about to become
@@ -660,13 +659,6 @@ static inline int page_mapped(struct pag
 #define NOPAGE_OOM	((struct page *) (-1))
 
 /*
- * Error return values for the *_nopfn functions
- */
-#define NOPFN_SIGBUS	((unsigned long) -1)
-#define NOPFN_OOM	((unsigned long) -2)
-#define NOPFN_REFAULT	((unsigned long) -3)
-
-/*
  * Different kinds of faults, as returned by handle_mm_fault().
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1288,6 +1288,11 @@ EXPORT_SYMBOL(vm_insert_page);
  *
  * This function should only be called from a vm_ops->fault handler, and
  * in that case the handler should return NULL.
+ *
+ * vma cannot be a COW mapping.
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
  */
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 		unsigned long pfn)
@@ -2343,56 +2348,6 @@ static int do_nonlinear_fault(struct mm_
 }
 
 /*
- * do_no_pfn() tries to create a new page mapping for a page without
- * a struct_page backing it
- *
- * As this is called only for pages that do not currently exist, we
- * do not need to flush old virtual caches or the TLB.
- *
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with mmap_sem still held, but pte unmapped and unlocked.
- *
- * It is expected that the ->nopfn handler always returns the same pfn
- * for a given virtual mapping.
- *
- * Mark this `noinline' to prevent it from bloating the main pagefault code.
- */
-static noinline int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long address, pte_t *page_table, pmd_t *pmd,
-		     int write_access)
-{
-	spinlock_t *ptl;
-	pte_t entry;
-	unsigned long pfn;
-	int ret = VM_FAULT_MINOR;
-
-	pte_unmap(page_table);
-	BUG_ON(!(vma->vm_flags & VM_PFNMAP));
-	BUG_ON(is_cow_mapping(vma->vm_flags));
-
-	pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK);
-	if (unlikely(pfn == NOPFN_OOM))
-		return VM_FAULT_OOM;
-	else if (unlikely(pfn == NOPFN_SIGBUS))
-		return VM_FAULT_SIGBUS;
-	else if (unlikely(pfn == NOPFN_REFAULT))
-		return VM_FAULT_MINOR;
-
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-
-	/* Only go through if we didn't race with anybody else... */
-	if (pte_none(*page_table)) {
-		entry = pfn_pte(pfn, vma->vm_page_prot);
-		if (write_access)
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		set_pte_at(mm, address, page_table, entry);
-	}
-	pte_unmap_unlock(page_table, ptl);
-	return ret;
-}
-
-/*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
  * nonlinear vmas.
@@ -2463,9 +2418,6 @@ static inline int handle_pte_fault(struc
 				if (vma->vm_ops->fault || vma->vm_ops->nopage)
 					return do_linear_fault(mm, vma, address,
 						pte, pmd, write_access, entry);
-				if (unlikely(vma->vm_ops->nopfn))
-					return do_no_pfn(mm, vma, address, pte,
-							 pmd, write_access);
 			}
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, write_access);
Index: linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/cell/spufs/file.c
+++ linux-2.6/arch/powerpc/platforms/cell/spufs/file.c
@@ -95,16 +95,16 @@ spufs_mem_write(struct file *file, const
 	return ret;
 }
 
-static unsigned long spufs_mem_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mem_mmap_fault(struct vm_area_struct *vma,
+					  struct fault_data *fdata)
 {
 	struct spu_context *ctx = vma->vm_file->private_data;
-	unsigned long pfn, offset = address - vma->vm_start;
+	unsigned long pfn, offset = fdata->pgoff << PAGE_SHIFT;
 
-	offset += vma->vm_pgoff << PAGE_SHIFT;
-
-	if (offset >= LS_SIZE)
-		return NOPFN_SIGBUS;
+	if (offset >= LS_SIZE) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
 
 	spu_acquire(ctx);
 
@@ -117,16 +117,17 @@ static unsigned long spufs_mem_mmap_nopf
 					     | _PAGE_NO_CACHE);
 		pfn = (ctx->spu->local_store_phys + offset) >> PAGE_SHIFT;
 	}
-	vm_insert_pfn(vma, address, pfn);
+	vm_insert_pfn(vma, fdata->address, pfn);
 
 	spu_release(ctx);
 
-	return NOPFN_REFAULT;
+	fdata->type = VM_FAULT_MINOR;
+	return NULL;
 }
 
 
 static struct vm_operations_struct spufs_mem_mmap_vmops = {
-	.nopfn = spufs_mem_mmap_nopfn,
+	.fault = spufs_mem_mmap_fault,
 };
 
 static int
@@ -151,42 +152,45 @@ static const struct file_operations spuf
 	.mmap    = spufs_mem_mmap,
 };
 
-static unsigned long spufs_ps_nopfn(struct vm_area_struct *vma,
-				    unsigned long address,
+static struct page *spufs_ps_fault(struct vm_area_struct *vma,
+				    struct fault_data *fdata,
 				    unsigned long ps_offs,
 				    unsigned long ps_size)
 {
 	struct spu_context *ctx = vma->vm_file->private_data;
-	unsigned long area, offset = address - vma->vm_start;
+	unsigned long area, offset = fdata->pgoff << PAGE_SHIFT;
 	int ret;
 
-	offset += vma->vm_pgoff << PAGE_SHIFT;
-	if (offset >= ps_size)
-		return NOPFN_SIGBUS;
+	if (offset >= ps_size) {
+		fdata->type = VM_FAULT_SIGBUS;
+		return NULL;
+	}
+
+	fdata->type = VM_FAULT_MINOR;
 
 	/* error here usually means a signal.. we might want to test
 	 * the error code more precisely though
 	 */
 	ret = spu_acquire_runnable(ctx, 0);
 	if (ret)
-		return NOPFN_REFAULT;
+		return NULL;
 
 	area = ctx->spu->problem_phys + ps_offs;
-	vm_insert_pfn(vma, address, (area + offset) >> PAGE_SHIFT);
+	vm_insert_pfn(vma, fdata->address, (area + offset) >> PAGE_SHIFT);
 	spu_release(ctx);
 
-	return NOPFN_REFAULT;
+	return NULL;
 }
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_cntl_mmap_nopfn(struct vm_area_struct *vma,
-					   unsigned long address)
+static struct page *spufs_cntl_mmap_fault(struct vm_area_struct *vma,
+					   struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x4000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x4000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_cntl_mmap_vmops = {
-	.nopfn = spufs_cntl_mmap_nopfn,
+	.fault = spufs_cntl_mmap_fault,
 };
 
 /*
@@ -783,23 +787,23 @@ static ssize_t spufs_signal1_write(struc
 	return 4;
 }
 
-static unsigned long spufs_signal1_mmap_nopfn(struct vm_area_struct *vma,
-					      unsigned long address)
+static struct page *spufs_signal1_mmap_fault(struct vm_area_struct *vma,
+					      struct fault_data *fdata)
 {
 #if PAGE_SIZE == 0x1000
-	return spufs_ps_nopfn(vma, address, 0x14000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x14000, 0x1000);
 #elif PAGE_SIZE == 0x10000
 	/* For 64k pages, both signal1 and signal2 can be used to mmap the whole
 	 * signal 1 and 2 area
 	 */
-	return spufs_ps_nopfn(vma, address, 0x10000, 0x10000);
+	return spufs_ps_fault(vma, fdata, 0x10000, 0x10000);
 #else
 #error unsupported page size
 #endif
 }
 
 static struct vm_operations_struct spufs_signal1_mmap_vmops = {
-	.nopfn = spufs_signal1_mmap_nopfn,
+	.fault = spufs_signal1_mmap_fault,
 };
 
 static int spufs_signal1_mmap(struct file *file, struct vm_area_struct *vma)
@@ -891,23 +895,23 @@ static ssize_t spufs_signal2_write(struc
 }
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_signal2_mmap_nopfn(struct vm_area_struct *vma,
-					      unsigned long address)
+static struct page *spufs_signal2_mmap_fault(struct vm_area_struct *vma,
+					      struct fault_data *fdata)
 {
 #if PAGE_SIZE == 0x1000
-	return spufs_ps_nopfn(vma, address, 0x1c000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x1c000, 0x1000);
 #elif PAGE_SIZE == 0x10000
 	/* For 64k pages, both signal1 and signal2 can be used to mmap the whole
 	 * signal 1 and 2 area
 	 */
-	return spufs_ps_nopfn(vma, address, 0x10000, 0x10000);
+	return spufs_ps_fault(vma, fdata, 0x10000, 0x10000);
 #else
 #error unsupported page size
 #endif
 }
 
 static struct vm_operations_struct spufs_signal2_mmap_vmops = {
-	.nopfn = spufs_signal2_mmap_nopfn,
+	.fault = spufs_signal2_mmap_fault,
 };
 
 static int spufs_signal2_mmap(struct file *file, struct vm_area_struct *vma)
@@ -992,14 +996,14 @@ DEFINE_SIMPLE_ATTRIBUTE(spufs_signal2_ty
 					spufs_signal2_type_set, "%llu");
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_mss_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mss_mmap_fault(struct vm_area_struct *vma,
+					  struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x0000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x0000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_mss_mmap_vmops = {
-	.nopfn = spufs_mss_mmap_nopfn,
+	.fault = spufs_mss_mmap_fault,
 };
 
 /*
@@ -1037,14 +1041,14 @@ static const struct file_operations spuf
 	.mmap	 = spufs_mss_mmap,
 };
 
-static unsigned long spufs_psmap_mmap_nopfn(struct vm_area_struct *vma,
-					    unsigned long address)
+static struct page *spufs_psmap_mmap_fault(struct vm_area_struct *vma,
+					    struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x0000, 0x20000);
+	return spufs_ps_fault(vma, fdata, 0x0000, 0x20000);
 }
 
 static struct vm_operations_struct spufs_psmap_mmap_vmops = {
-	.nopfn = spufs_psmap_mmap_nopfn,
+	.fault = spufs_psmap_mmap_fault,
 };
 
 /*
@@ -1081,14 +1085,14 @@ static const struct file_operations spuf
 
 
 #if SPUFS_MMAP_4K
-static unsigned long spufs_mfc_mmap_nopfn(struct vm_area_struct *vma,
-					  unsigned long address)
+static struct page *spufs_mfc_mmap_fault(struct vm_area_struct *vma,
+					  struct fault_data *fdata)
 {
-	return spufs_ps_nopfn(vma, address, 0x3000, 0x1000);
+	return spufs_ps_fault(vma, fdata, 0x3000, 0x1000);
 }
 
 static struct vm_operations_struct spufs_mfc_mmap_vmops = {
-	.nopfn = spufs_mfc_mmap_nopfn,
+	.fault = spufs_mfc_mmap_fault,
 };
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
  2007-02-21  4:49 ` Nick Piggin
@ 2007-02-27  4:36   ` Dave Airlie
  -1 siblings, 0 replies; 198+ messages in thread
From: Dave Airlie @ 2007-02-27  4:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Andrew Morton, Linux Kernel,
	Benjamin Herrenschmidt

>
> I've also got rid of the horrible populate API, and integrated nonlinear pages
> properly with the page fault path.
>
> Downside is that this adds one more vector through which the buffered write
> deadlock can occur. However this is just a very tiny one (pte being unmapped
> for reclaim), compared to all the other ways that deadlock can occur (unmap,
> reclaim, truncate, invalidate). I doubt it will be noticable. At any rate, it
> is better than data corruption.
>
> I hope these can get merged (at least into -mm) soon.

Have these been put into mm? can I expect them in the next -mm so I
can start merging up the drm memory manager code to my -mm tree..

Dave.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
@ 2007-02-27  4:36   ` Dave Airlie
  0 siblings, 0 replies; 198+ messages in thread
From: Dave Airlie @ 2007-02-27  4:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Andrew Morton, Linux Kernel,
	Benjamin Herrenschmidt

>
> I've also got rid of the horrible populate API, and integrated nonlinear pages
> properly with the page fault path.
>
> Downside is that this adds one more vector through which the buffered write
> deadlock can occur. However this is just a very tiny one (pte being unmapped
> for reclaim), compared to all the other ways that deadlock can occur (unmap,
> reclaim, truncate, invalidate). I doubt it will be noticable. At any rate, it
> is better than data corruption.
>
> I hope these can get merged (at least into -mm) soon.

Have these been put into mm? can I expect them in the next -mm so I
can start merging up the drm memory manager code to my -mm tree..

Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
  2007-02-27  4:36   ` Dave Airlie
@ 2007-02-27  5:32     ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-02-27  5:32 UTC (permalink / raw)
  To: Dave Airlie; +Cc: npiggin, linux-mm, linux-kernel, benh

> On Tue, 27 Feb 2007 15:36:03 +1100 "Dave Airlie" <airlied@gmail.com> wrote:
> >
> > I've also got rid of the horrible populate API, and integrated nonlinear pages
> > properly with the page fault path.
> >
> > Downside is that this adds one more vector through which the buffered write
> > deadlock can occur. However this is just a very tiny one (pte being unmapped
> > for reclaim), compared to all the other ways that deadlock can occur (unmap,
> > reclaim, truncate, invalidate). I doubt it will be noticable. At any rate, it
> > is better than data corruption.
> >
> > I hope these can get merged (at least into -mm) soon.
> 
> Have these been put into mm?

Not yet - I need to get back on the correct continent, review the code,
stuff like that.  It still hurts that this work makes the write() deadlock
harder to hit, and we haven't worked out how to fix that.

> can I expect them in the next -mm so I
> can start merging up the drm memory manager code to my -mm tree..

What is the linkage between these patches and DRM?

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
@ 2007-02-27  5:32     ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-02-27  5:32 UTC (permalink / raw)
  To: Dave Airlie; +Cc: npiggin, linux-mm, linux-kernel, benh

> On Tue, 27 Feb 2007 15:36:03 +1100 "Dave Airlie" <airlied@gmail.com> wrote:
> >
> > I've also got rid of the horrible populate API, and integrated nonlinear pages
> > properly with the page fault path.
> >
> > Downside is that this adds one more vector through which the buffered write
> > deadlock can occur. However this is just a very tiny one (pte being unmapped
> > for reclaim), compared to all the other ways that deadlock can occur (unmap,
> > reclaim, truncate, invalidate). I doubt it will be noticable. At any rate, it
> > is better than data corruption.
> >
> > I hope these can get merged (at least into -mm) soon.
> 
> Have these been put into mm?

Not yet - I need to get back on the correct continent, review the code,
stuff like that.  It still hurts that this work makes the write() deadlock
harder to hit, and we haven't worked out how to fix that.

> can I expect them in the next -mm so I
> can start merging up the drm memory manager code to my -mm tree..

What is the linkage between these patches and DRM?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
  2007-02-27  5:32     ` Andrew Morton
@ 2007-02-27  6:26       ` Dave Airlie
  -1 siblings, 0 replies; 198+ messages in thread
From: Dave Airlie @ 2007-02-27  6:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: npiggin, linux-mm, linux-kernel, benh

On 2/27/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Tue, 27 Feb 2007 15:36:03 +1100 "Dave Airlie" <airlied@gmail.com> wrote:
> > >
> > > I've also got rid of the horrible populate API, and integrated nonlinear pages
> > > properly with the page fault path.
> > >
> > > Downside is that this adds one more vector through which the buffered write
> > > deadlock can occur. However this is just a very tiny one (pte being unmapped
> > > for reclaim), compared to all the other ways that deadlock can occur (unmap,
> > > reclaim, truncate, invalidate). I doubt it will be noticable. At any rate, it
> > > is better than data corruption.
> > >
> > > I hope these can get merged (at least into -mm) soon.
> >
> > Have these been put into mm?
>
> Not yet - I need to get back on the correct continent, review the code,
> stuff like that.  It still hurts that this work makes the write() deadlock
> harder to hit, and we haven't worked out how to fix that.
>
> > can I expect them in the next -mm so I
> > can start merging up the drm memory manager code to my -mm tree..
>
> What is the linkage between these patches and DRM?
>

the new fault hander made the memory manager code a lot cleaner and
very less hacky in a lot of cases. so I'd rather merge the clean code
than have to fight with the current code...

Dave.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
@ 2007-02-27  6:26       ` Dave Airlie
  0 siblings, 0 replies; 198+ messages in thread
From: Dave Airlie @ 2007-02-27  6:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: npiggin, linux-mm, linux-kernel, benh

On 2/27/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Tue, 27 Feb 2007 15:36:03 +1100 "Dave Airlie" <airlied@gmail.com> wrote:
> > >
> > > I've also got rid of the horrible populate API, and integrated nonlinear pages
> > > properly with the page fault path.
> > >
> > > Downside is that this adds one more vector through which the buffered write
> > > deadlock can occur. However this is just a very tiny one (pte being unmapped
> > > for reclaim), compared to all the other ways that deadlock can occur (unmap,
> > > reclaim, truncate, invalidate). I doubt it will be noticable. At any rate, it
> > > is better than data corruption.
> > >
> > > I hope these can get merged (at least into -mm) soon.
> >
> > Have these been put into mm?
>
> Not yet - I need to get back on the correct continent, review the code,
> stuff like that.  It still hurts that this work makes the write() deadlock
> harder to hit, and we haven't worked out how to fix that.
>
> > can I expect them in the next -mm so I
> > can start merging up the drm memory manager code to my -mm tree..
>
> What is the linkage between these patches and DRM?
>

the new fault hander made the memory manager code a lot cleaner and
very less hacky in a lot of cases. so I'd rather merge the clean code
than have to fight with the current code...

Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
  2007-02-27  6:26       ` Dave Airlie
@ 2007-02-27  6:54         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 198+ messages in thread
From: Benjamin Herrenschmidt @ 2007-02-27  6:54 UTC (permalink / raw)
  To: Dave Airlie; +Cc: Andrew Morton, npiggin, linux-mm, linux-kernel


> the new fault hander made the memory manager code a lot cleaner and
> very less hacky in a lot of cases. so I'd rather merge the clean code
> than have to fight with the current code...

Note that you can probably get away with NOPFN_REFAULT etc... like I did
for the SPEs in the meantime.

Ben.



^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
@ 2007-02-27  6:54         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 198+ messages in thread
From: Benjamin Herrenschmidt @ 2007-02-27  6:54 UTC (permalink / raw)
  To: Dave Airlie; +Cc: Andrew Morton, npiggin, linux-mm, linux-kernel

> the new fault hander made the memory manager code a lot cleaner and
> very less hacky in a lot of cases. so I'd rather merge the clean code
> than have to fight with the current code...

Note that you can probably get away with NOPFN_REFAULT etc... like I did
for the SPEs in the meantime.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
  2007-02-27  5:32     ` Andrew Morton
@ 2007-02-27  8:50       ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-27  8:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Airlie, linux-mm, linux-kernel, benh

On Mon, Feb 26, 2007 at 09:32:04PM -0800, Andrew Morton wrote:
> > On Tue, 27 Feb 2007 15:36:03 +1100 "Dave Airlie" <airlied@gmail.com> wrote:
> > >
> > > I've also got rid of the horrible populate API, and integrated nonlinear pages
> > > properly with the page fault path.
> > >
> > > Downside is that this adds one more vector through which the buffered write
> > > deadlock can occur. However this is just a very tiny one (pte being unmapped
> > > for reclaim), compared to all the other ways that deadlock can occur (unmap,
> > > reclaim, truncate, invalidate). I doubt it will be noticable. At any rate, it
> > > is better than data corruption.
> > >
> > > I hope these can get merged (at least into -mm) soon.
> > 
> > Have these been put into mm?
> 
> Not yet - I need to get back on the correct continent, review the code,
> stuff like that.  It still hurts that this work makes the write() deadlock
> harder to hit,

s/harder/easier of course...

I think there is good reason to assume the buffered write page lock
deadlocks would not occur in "normal" programs (or very very few),
because it would require writing from the same page you are writing to,
or 2 processes writing from the page the other is writing to. If any
innocent users do hit this, at least it is not data corrupting, and is
relatively easy to trace back to the kernel.

In the case of local DoS exploits, the deadlocks already present in the
buffered write path are already trivial to exploit...  locking the page
in the fault path doesn't make the deadlock exploit any more possible.

So the downside to merging is that we _may_ get some additional deadlocks.

What is being fixed is silent data corruption that has been reported by
several different users of the SLES kernel (because we have assertions
there to catch it), and can be triggered by DIO or NFS, or anything using
vmtruncate_range or invalidate_inode_pages2 on regular files. Or even a
regular truncate with nonlinear pages. These are known problems on
production workloads.

That's my argument for merging these. I think it's reasonable, but I'm
open to debate.

I did get some page fault performance numbers at one stage. Nothing
really exciting seemed to happen IIRC, but I can do another set of tests
if you want?

> and we haven't worked out how to fix that.

To be fair, I have 2 ways to fix it. Unfortunately one is slow and the
other requires cooperation from filesystem developers. perform_write() is
still on track, but it is going to take a reasonable amount of time and
effort to convert filesystems. I just can't see any gain in holding these
patches back until that all happens.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
@ 2007-02-27  8:50       ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-02-27  8:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Airlie, linux-mm, linux-kernel, benh

On Mon, Feb 26, 2007 at 09:32:04PM -0800, Andrew Morton wrote:
> > On Tue, 27 Feb 2007 15:36:03 +1100 "Dave Airlie" <airlied@gmail.com> wrote:
> > >
> > > I've also got rid of the horrible populate API, and integrated nonlinear pages
> > > properly with the page fault path.
> > >
> > > Downside is that this adds one more vector through which the buffered write
> > > deadlock can occur. However this is just a very tiny one (pte being unmapped
> > > for reclaim), compared to all the other ways that deadlock can occur (unmap,
> > > reclaim, truncate, invalidate). I doubt it will be noticable. At any rate, it
> > > is better than data corruption.
> > >
> > > I hope these can get merged (at least into -mm) soon.
> > 
> > Have these been put into mm?
> 
> Not yet - I need to get back on the correct continent, review the code,
> stuff like that.  It still hurts that this work makes the write() deadlock
> harder to hit,

s/harder/easier of course...

I think there is good reason to assume the buffered write page lock
deadlocks would not occur in "normal" programs (or very very few),
because it would require writing from the same page you are writing to,
or 2 processes writing from the page the other is writing to. If any
innocent users do hit this, at least it is not data corrupting, and is
relatively easy to trace back to the kernel.

In the case of local DoS exploits, the deadlocks already present in the
buffered write path are already trivial to exploit...  locking the page
in the fault path doesn't make the deadlock exploit any more possible.

So the downside to merging is that we _may_ get some additional deadlocks.

What is being fixed is silent data corruption that has been reported by
several different users of the SLES kernel (because we have assertions
there to catch it), and can be triggered by DIO or NFS, or anything using
vmtruncate_range or invalidate_inode_pages2 on regular files. Or even a
regular truncate with nonlinear pages. These are known problems on
production workloads.

That's my argument for merging these. I think it's reasonable, but I'm
open to debate.

I did get some page fault performance numbers at one stage. Nothing
really exciting seemed to happen IIRC, but I can do another set of tests
if you want?

> and we haven't worked out how to fix that.

To be fair, I have 2 ways to fix it. Unfortunately one is slow and the
other requires cooperation from filesystem developers. perform_write() is
still on track, but it is going to take a reasonable amount of time and
effort to convert filesystems. I just can't see any gain in holding these
patches back until that all happens.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 3/6] mm: fix fault vs invalidate race for linear mappings
  2007-02-21  4:50   ` Nick Piggin
@ 2007-03-07  6:36     ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  6:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Andrew Morton, Linux Kernel,
	Benjamin Herrenschmidt

On Wed, 21 Feb 2007 05:50:05 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:

> Fix the race between invalidate_inode_pages and do_no_page.
> 
> Andrea Arcangeli identified a subtle race between invalidation of
> pages from pagecache with userspace mappings, and do_no_page.
> 
> The issue is that invalidation has to shoot down all mappings to the
> page, before it can be discarded from the pagecache. Between shooting
> down ptes to a particular page, and actually dropping the struct page
> from the pagecache, do_no_page from any process might fault on that
> page and establish a new mapping to the page just before it gets
> discarded from the pagecache.
> 
> The most common case where such invalidation is used is in file
> truncation. This case was catered for by doing a sort of open-coded
> seqlock between the file's i_size, and its truncate_count.
> 
> Truncation will decrease i_size, then increment truncate_count before
> unmapping userspace pages; do_no_page will read truncate_count, then
> find the page if it is within i_size, and then check truncate_count
> under the page table lock and back out and retry if it had
> subsequently been changed (ptl will serialise against unmapping, and
> ensure a potentially updated truncate_count is actually visible).
> 
> Complexity and documentation issues aside, the locking protocol fails
> in the case where we would like to invalidate pagecache inside i_size.
> do_no_page can come in anytime and filemap_nopage is not aware of the
> invalidation in progress (as it is when it is outside i_size). The
> end result is that dangling (->mapping == NULL) pages that appear to
> be from a particular file may be mapped into userspace with nonsense
> data. Valid mappings to the same place will see a different page.
> 
> Andrea implemented two working fixes, one using a real seqlock,
> another using a page->flags bit. He also proposed using the page lock
> in do_no_page, but that was initially considered too heavyweight.
> However, it is not a global or per-file lock, and the page cacheline
> is modified in do_no_page to increment _count and _mapcount anyway, so
> a further modification should not be a large performance hit.
> Scalability is not an issue.
> 
> This patch implements this latter approach. ->nopage implementations
> return with the page locked if it is possible for their underlying
> file to be invalidated (in that case, they must set a special vm_flags
> bit to indicate so). do_no_page only unlocks the page after setting
> up the mapping completely. invalidation is excluded because it holds
> the page lock during invalidation of each page (and ensures that the
> page is not mapped while holding the lock).
> 
> This also allows significant simplifications in do_no_page, because
> we have the page locked in the right place in the pagecache from the
> start.
> 

Why was truncate_inode_pages_range() altered to unmap the page if it got
mapped again?

Oh.  Because the unmap_mapping_range() call got removed from vmtruncate(). 
Why?  (Please send suitable updates to the changelog).

I guess truncate of a mmapped area isn't sufficiently common to worry about
the inefficiency of this change.

Lots of memory barriers got removed in memory.c, unchangeloggedly.

Gratuitous renaming of locals in do_no_page() makes the change hard to
review.  Should have been a separate patch.

In fact, the patch would have been heaps clearer if that renaming had been
a separate patch.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 3/6] mm: fix fault vs invalidate race for linear mappings
@ 2007-03-07  6:36     ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  6:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Andrew Morton, Linux Kernel,
	Benjamin Herrenschmidt

On Wed, 21 Feb 2007 05:50:05 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:

> Fix the race between invalidate_inode_pages and do_no_page.
> 
> Andrea Arcangeli identified a subtle race between invalidation of
> pages from pagecache with userspace mappings, and do_no_page.
> 
> The issue is that invalidation has to shoot down all mappings to the
> page, before it can be discarded from the pagecache. Between shooting
> down ptes to a particular page, and actually dropping the struct page
> from the pagecache, do_no_page from any process might fault on that
> page and establish a new mapping to the page just before it gets
> discarded from the pagecache.
> 
> The most common case where such invalidation is used is in file
> truncation. This case was catered for by doing a sort of open-coded
> seqlock between the file's i_size, and its truncate_count.
> 
> Truncation will decrease i_size, then increment truncate_count before
> unmapping userspace pages; do_no_page will read truncate_count, then
> find the page if it is within i_size, and then check truncate_count
> under the page table lock and back out and retry if it had
> subsequently been changed (ptl will serialise against unmapping, and
> ensure a potentially updated truncate_count is actually visible).
> 
> Complexity and documentation issues aside, the locking protocol fails
> in the case where we would like to invalidate pagecache inside i_size.
> do_no_page can come in anytime and filemap_nopage is not aware of the
> invalidation in progress (as it is when it is outside i_size). The
> end result is that dangling (->mapping == NULL) pages that appear to
> be from a particular file may be mapped into userspace with nonsense
> data. Valid mappings to the same place will see a different page.
> 
> Andrea implemented two working fixes, one using a real seqlock,
> another using a page->flags bit. He also proposed using the page lock
> in do_no_page, but that was initially considered too heavyweight.
> However, it is not a global or per-file lock, and the page cacheline
> is modified in do_no_page to increment _count and _mapcount anyway, so
> a further modification should not be a large performance hit.
> Scalability is not an issue.
> 
> This patch implements this latter approach. ->nopage implementations
> return with the page locked if it is possible for their underlying
> file to be invalidated (in that case, they must set a special vm_flags
> bit to indicate so). do_no_page only unlocks the page after setting
> up the mapping completely. invalidation is excluded because it holds
> the page lock during invalidation of each page (and ensures that the
> page is not mapped while holding the lock).
> 
> This also allows significant simplifications in do_no_page, because
> we have the page locked in the right place in the pagecache from the
> start.
> 

Why was truncate_inode_pages_range() altered to unmap the page if it got
mapped again?

Oh.  Because the unmap_mapping_range() call got removed from vmtruncate(). 
Why?  (Please send suitable updates to the changelog).

I guess truncate of a mmapped area isn't sufficiently common to worry about
the inefficiency of this change.

Lots of memory barriers got removed in memory.c, unchangeloggedly.

Gratuitous renaming of locals in do_no_page() makes the change hard to
review.  Should have been a separate patch.

In fact, the patch would have been heaps clearer if that renaming had been
a separate patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-02-21  4:50   ` Nick Piggin
@ 2007-03-07  6:51     ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  6:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt,
	Ingo Molnar

On Wed, 21 Feb 2007 05:50:17 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:

> Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
> encodes the virtual address -> file offset differently from linear
> mappings.
> 
> I can't see why the filesystem/pagecache code should need to know anything
> about it, except for the fact that the ->nopage handler didn't quite pass
> down enough information (ie. pgoff). But it is more logical to pass pgoff
> rather than have the ->nopage function calculate it itself anyway. And
> having the nopage handler install the pte itself is sort of nasty.
> 
> This patch introduces a new fault handler that replaces ->nopage and
> ->populate and (later) ->nopfn. Most of the old mechanism is still in place
> so there is a lot of duplication and nice cleanups that can be removed if
> everyone switches over.
> 
> The rationale for doing this in the first place is that nonlinear mappings
> are subject to the pagefault vs invalidate/truncate race too, and it seemed
> stupid to duplicate the synchronisation logic rather than just consolidate
> the two.
> 

It's awkward to layer a largely do-nothing patch like this on top of a
significant functional change.  Makes it harder to isolate the source of
regressions, harder to revert the do-something patch.

> After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
> pagecache. Seems like a fringe functionality anyway.

Does Ingo agree?

> NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
> no users have hit mainline yet.

Did benh agree with that?


The patch unchangeloggedly adds a basic new structure to core mm
(fault_data).  Would be nice to document its fields, especially `flags'.


Please add less pointless blank lines.


How well has this been tested?  The ocfs2 changes?  gfs2?  We should at
least give those guys a heads-up.


Does anybody really pass a NULL `type' arg into filemap_nopage()?


This patch seems to churn things around an awful lot for minimal benefit.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  6:51     ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  6:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt,
	Ingo Molnar

On Wed, 21 Feb 2007 05:50:17 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:

> Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
> encodes the virtual address -> file offset differently from linear
> mappings.
> 
> I can't see why the filesystem/pagecache code should need to know anything
> about it, except for the fact that the ->nopage handler didn't quite pass
> down enough information (ie. pgoff). But it is more logical to pass pgoff
> rather than have the ->nopage function calculate it itself anyway. And
> having the nopage handler install the pte itself is sort of nasty.
> 
> This patch introduces a new fault handler that replaces ->nopage and
> ->populate and (later) ->nopfn. Most of the old mechanism is still in place
> so there is a lot of duplication and nice cleanups that can be removed if
> everyone switches over.
> 
> The rationale for doing this in the first place is that nonlinear mappings
> are subject to the pagefault vs invalidate/truncate race too, and it seemed
> stupid to duplicate the synchronisation logic rather than just consolidate
> the two.
> 

It's awkward to layer a largely do-nothing patch like this on top of a
significant functional change.  Makes it harder to isolate the source of
regressions, harder to revert the do-something patch.

> After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
> pagecache. Seems like a fringe functionality anyway.

Does Ingo agree?

> NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
> no users have hit mainline yet.

Did benh agree with that?


The patch unchangeloggedly adds a basic new structure to core mm
(fault_data).  Would be nice to document its fields, especially `flags'.


Please add less pointless blank lines.


How well has this been tested?  The ocfs2 changes?  gfs2?  We should at
least give those guys a heads-up.


Does anybody really pass a NULL `type' arg into filemap_nopage()?


This patch seems to churn things around an awful lot for minimal benefit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 3/6] mm: fix fault vs invalidate race for linear mappings
  2007-03-07  6:36     ` Andrew Morton
@ 2007-03-07  6:57       ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  6:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt

On Tue, Mar 06, 2007 at 10:36:41PM -0800, Andrew Morton wrote:
> On Wed, 21 Feb 2007 05:50:05 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> 
> > Fix the race between invalidate_inode_pages and do_no_page.
> > 
> > Andrea Arcangeli identified a subtle race between invalidation of
> > pages from pagecache with userspace mappings, and do_no_page.
> > 
> > The issue is that invalidation has to shoot down all mappings to the
> > page, before it can be discarded from the pagecache. Between shooting
> > down ptes to a particular page, and actually dropping the struct page
> > from the pagecache, do_no_page from any process might fault on that
> > page and establish a new mapping to the page just before it gets
> > discarded from the pagecache.
> > 
> > The most common case where such invalidation is used is in file
> > truncation. This case was catered for by doing a sort of open-coded
> > seqlock between the file's i_size, and its truncate_count.
> > 
> > Truncation will decrease i_size, then increment truncate_count before
> > unmapping userspace pages; do_no_page will read truncate_count, then
> > find the page if it is within i_size, and then check truncate_count
> > under the page table lock and back out and retry if it had
> > subsequently been changed (ptl will serialise against unmapping, and
> > ensure a potentially updated truncate_count is actually visible).
> > 
> > Complexity and documentation issues aside, the locking protocol fails
> > in the case where we would like to invalidate pagecache inside i_size.
> > do_no_page can come in anytime and filemap_nopage is not aware of the
> > invalidation in progress (as it is when it is outside i_size). The
> > end result is that dangling (->mapping == NULL) pages that appear to
> > be from a particular file may be mapped into userspace with nonsense
> > data. Valid mappings to the same place will see a different page.
> > 
> > Andrea implemented two working fixes, one using a real seqlock,
> > another using a page->flags bit. He also proposed using the page lock
> > in do_no_page, but that was initially considered too heavyweight.
> > However, it is not a global or per-file lock, and the page cacheline
> > is modified in do_no_page to increment _count and _mapcount anyway, so
> > a further modification should not be a large performance hit.
> > Scalability is not an issue.
> > 
> > This patch implements this latter approach. ->nopage implementations
> > return with the page locked if it is possible for their underlying
> > file to be invalidated (in that case, they must set a special vm_flags
> > bit to indicate so). do_no_page only unlocks the page after setting
> > up the mapping completely. invalidation is excluded because it holds
> > the page lock during invalidation of each page (and ensures that the
> > page is not mapped while holding the lock).
> > 
> > This also allows significant simplifications in do_no_page, because
> > we have the page locked in the right place in the pagecache from the
> > start.
> > 
> 
> Why was truncate_inode_pages_range() altered to unmap the page if it got
> mapped again?
> 
> Oh.  Because the unmap_mapping_range() call got removed from vmtruncate(). 
> Why?  (Please send suitable updates to the changelog).

We have to ensure it is unmapped, and be prepared to unmap it while under
the page lock.

> I guess truncate of a mmapped area isn't sufficiently common to worry about
> the inefficiency of this change.

Yeah, and it should be more efficient for files that aren't mmapped,
because we don't have to take i_mmap_lock for them.

> Lots of memory barriers got removed in memory.c, unchangeloggedly.

Yeah they were all for the lockless truncate_count checks. Now that
we use the page lock, we don't need barriers.

> Gratuitous renaming of locals in do_no_page() makes the change hard to
> review.  Should have been a separate patch.
> 
> In fact, the patch would have been heaps clearer if that renaming had been
> a separate patch.

Shall I?

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 3/6] mm: fix fault vs invalidate race for linear mappings
@ 2007-03-07  6:57       ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  6:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt

On Tue, Mar 06, 2007 at 10:36:41PM -0800, Andrew Morton wrote:
> On Wed, 21 Feb 2007 05:50:05 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> 
> > Fix the race between invalidate_inode_pages and do_no_page.
> > 
> > Andrea Arcangeli identified a subtle race between invalidation of
> > pages from pagecache with userspace mappings, and do_no_page.
> > 
> > The issue is that invalidation has to shoot down all mappings to the
> > page, before it can be discarded from the pagecache. Between shooting
> > down ptes to a particular page, and actually dropping the struct page
> > from the pagecache, do_no_page from any process might fault on that
> > page and establish a new mapping to the page just before it gets
> > discarded from the pagecache.
> > 
> > The most common case where such invalidation is used is in file
> > truncation. This case was catered for by doing a sort of open-coded
> > seqlock between the file's i_size, and its truncate_count.
> > 
> > Truncation will decrease i_size, then increment truncate_count before
> > unmapping userspace pages; do_no_page will read truncate_count, then
> > find the page if it is within i_size, and then check truncate_count
> > under the page table lock and back out and retry if it had
> > subsequently been changed (ptl will serialise against unmapping, and
> > ensure a potentially updated truncate_count is actually visible).
> > 
> > Complexity and documentation issues aside, the locking protocol fails
> > in the case where we would like to invalidate pagecache inside i_size.
> > do_no_page can come in anytime and filemap_nopage is not aware of the
> > invalidation in progress (as it is when it is outside i_size). The
> > end result is that dangling (->mapping == NULL) pages that appear to
> > be from a particular file may be mapped into userspace with nonsense
> > data. Valid mappings to the same place will see a different page.
> > 
> > Andrea implemented two working fixes, one using a real seqlock,
> > another using a page->flags bit. He also proposed using the page lock
> > in do_no_page, but that was initially considered too heavyweight.
> > However, it is not a global or per-file lock, and the page cacheline
> > is modified in do_no_page to increment _count and _mapcount anyway, so
> > a further modification should not be a large performance hit.
> > Scalability is not an issue.
> > 
> > This patch implements this latter approach. ->nopage implementations
> > return with the page locked if it is possible for their underlying
> > file to be invalidated (in that case, they must set a special vm_flags
> > bit to indicate so). do_no_page only unlocks the page after setting
> > up the mapping completely. invalidation is excluded because it holds
> > the page lock during invalidation of each page (and ensures that the
> > page is not mapped while holding the lock).
> > 
> > This also allows significant simplifications in do_no_page, because
> > we have the page locked in the right place in the pagecache from the
> > start.
> > 
> 
> Why was truncate_inode_pages_range() altered to unmap the page if it got
> mapped again?
> 
> Oh.  Because the unmap_mapping_range() call got removed from vmtruncate(). 
> Why?  (Please send suitable updates to the changelog).

We have to ensure it is unmapped, and be prepared to unmap it while under
the page lock.

> I guess truncate of a mmapped area isn't sufficiently common to worry about
> the inefficiency of this change.

Yeah, and it should be more efficient for files that aren't mmapped,
because we don't have to take i_mmap_lock for them.

> Lots of memory barriers got removed in memory.c, unchangeloggedly.

Yeah they were all for the lockless truncate_count checks. Now that
we use the page lock, we don't need barriers.

> Gratuitous renaming of locals in do_no_page() makes the change hard to
> review.  Should have been a separate patch.
> 
> In fact, the patch would have been heaps clearer if that renaming had been
> a separate patch.

Shall I?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 3/6] mm: fix fault vs invalidate race for linear mappings
  2007-03-07  6:57       ` Nick Piggin
@ 2007-03-07  7:08         ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  7:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt

On Wed, 7 Mar 2007 07:57:27 +0100 Nick Piggin <npiggin@suse.de> wrote:

> > 
> > Why was truncate_inode_pages_range() altered to unmap the page if it got
> > mapped again?
> > 
> > Oh.  Because the unmap_mapping_range() call got removed from vmtruncate(). 
> > Why?  (Please send suitable updates to the changelog).
> 
> We have to ensure it is unmapped, and be prepared to unmap it while under
> the page lock.

But vmtruncate() dropped i_size, so nobody will map this page into
pagetables from then on.

> > I guess truncate of a mmapped area isn't sufficiently common to worry about
> > the inefficiency of this change.
> 
> Yeah, and it should be more efficient for files that aren't mmapped,
> because we don't have to take i_mmap_lock for them.
> 
> > Lots of memory barriers got removed in memory.c, unchangeloggedly.
> 
> Yeah they were all for the lockless truncate_count checks. Now that
> we use the page lock, we don't need barriers.
> 
> > Gratuitous renaming of locals in do_no_page() makes the change hard to
> > review.  Should have been a separate patch.
> > 
> > In fact, the patch would have been heaps clearer if that renaming had been
> > a separate patch.
> 
> Shall I?

If you don't have anything better to do, yes please ;)


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 3/6] mm: fix fault vs invalidate race for linear mappings
@ 2007-03-07  7:08         ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  7:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt

On Wed, 7 Mar 2007 07:57:27 +0100 Nick Piggin <npiggin@suse.de> wrote:

> > 
> > Why was truncate_inode_pages_range() altered to unmap the page if it got
> > mapped again?
> > 
> > Oh.  Because the unmap_mapping_range() call got removed from vmtruncate(). 
> > Why?  (Please send suitable updates to the changelog).
> 
> We have to ensure it is unmapped, and be prepared to unmap it while under
> the page lock.

But vmtruncate() dropped i_size, so nobody will map this page into
pagetables from then on.

> > I guess truncate of a mmapped area isn't sufficiently common to worry about
> > the inefficiency of this change.
> 
> Yeah, and it should be more efficient for files that aren't mmapped,
> because we don't have to take i_mmap_lock for them.
> 
> > Lots of memory barriers got removed in memory.c, unchangeloggedly.
> 
> Yeah they were all for the lockless truncate_count checks. Now that
> we use the page lock, we don't need barriers.
> 
> > Gratuitous renaming of locals in do_no_page() makes the change hard to
> > review.  Should have been a separate patch.
> > 
> > In fact, the patch would have been heaps clearer if that renaming had been
> > a separate patch.
> 
> Shall I?

If you don't have anything better to do, yes please ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  6:51     ` Andrew Morton
@ 2007-03-07  7:08       ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  7:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt,
	Ingo Molnar

On Tue, Mar 06, 2007 at 10:51:01PM -0800, Andrew Morton wrote:
> On Wed, 21 Feb 2007 05:50:17 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> 
> > Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
> > encodes the virtual address -> file offset differently from linear
> > mappings.
> > 
> > I can't see why the filesystem/pagecache code should need to know anything
> > about it, except for the fact that the ->nopage handler didn't quite pass
> > down enough information (ie. pgoff). But it is more logical to pass pgoff
> > rather than have the ->nopage function calculate it itself anyway. And
> > having the nopage handler install the pte itself is sort of nasty.
> > 
> > This patch introduces a new fault handler that replaces ->nopage and
> > ->populate and (later) ->nopfn. Most of the old mechanism is still in place
> > so there is a lot of duplication and nice cleanups that can be removed if
> > everyone switches over.
> > 
> > The rationale for doing this in the first place is that nonlinear mappings
> > are subject to the pagefault vs invalidate/truncate race too, and it seemed
> > stupid to duplicate the synchronisation logic rather than just consolidate
> > the two.
> > 
> 
> It's awkward to layer a largely do-nothing patch like this on top of a
> significant functional change.  Makes it harder to isolate the source of
> regressions, harder to revert the do-something patch.
> 
> > After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
> > pagecache. Seems like a fringe functionality anyway.
> 
> Does Ingo agree?

I cc'ed him when first posting it. He didn't disagree.

> > NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
> > no users have hit mainline yet.
> 
> Did benh agree with that?

Yes.

> The patch unchangeloggedly adds a basic new structure to core mm
> (fault_data).  Would be nice to document its fields, especially `flags'.

OK. This is actually something that I would like more people to review.
Do we need any different fields? Should it be passed as arguments instead
of a structure?

> Please add less pointless blank lines.
> 
> 
> How well has this been tested?  The ocfs2 changes?  gfs2?  We should at
> least give those guys a heads-up.

Yes we should. Not all those filesystem changes have been tested.

> Does anybody really pass a NULL `type' arg into filemap_nopage()?

Dunno, it's exported. I remove that completely in a subsequent patch
anyway.

> This patch seems to churn things around an awful lot for minimal benefit.

Well it fixes the whole design of the nonlinear fault path.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  7:08       ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  7:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt,
	Ingo Molnar

On Tue, Mar 06, 2007 at 10:51:01PM -0800, Andrew Morton wrote:
> On Wed, 21 Feb 2007 05:50:17 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> 
> > Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
> > encodes the virtual address -> file offset differently from linear
> > mappings.
> > 
> > I can't see why the filesystem/pagecache code should need to know anything
> > about it, except for the fact that the ->nopage handler didn't quite pass
> > down enough information (ie. pgoff). But it is more logical to pass pgoff
> > rather than have the ->nopage function calculate it itself anyway. And
> > having the nopage handler install the pte itself is sort of nasty.
> > 
> > This patch introduces a new fault handler that replaces ->nopage and
> > ->populate and (later) ->nopfn. Most of the old mechanism is still in place
> > so there is a lot of duplication and nice cleanups that can be removed if
> > everyone switches over.
> > 
> > The rationale for doing this in the first place is that nonlinear mappings
> > are subject to the pagefault vs invalidate/truncate race too, and it seemed
> > stupid to duplicate the synchronisation logic rather than just consolidate
> > the two.
> > 
> 
> It's awkward to layer a largely do-nothing patch like this on top of a
> significant functional change.  Makes it harder to isolate the source of
> regressions, harder to revert the do-something patch.
> 
> > After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
> > pagecache. Seems like a fringe functionality anyway.
> 
> Does Ingo agree?

I cc'ed him when first posting it. He didn't disagree.

> > NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
> > no users have hit mainline yet.
> 
> Did benh agree with that?

Yes.

> The patch unchangeloggedly adds a basic new structure to core mm
> (fault_data).  Would be nice to document its fields, especially `flags'.

OK. This is actually something that I would like more people to review.
Do we need any different fields? Should it be passed as arguments instead
of a structure?

> Please add less pointless blank lines.
> 
> 
> How well has this been tested?  The ocfs2 changes?  gfs2?  We should at
> least give those guys a heads-up.

Yes we should. Not all those filesystem changes have been tested.

> Does anybody really pass a NULL `type' arg into filemap_nopage()?

Dunno, it's exported. I remove that completely in a subsequent patch
anyway.

> This patch seems to churn things around an awful lot for minimal benefit.

Well it fixes the whole design of the nonlinear fault path.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  6:51     ` Andrew Morton
@ 2007-03-07  7:19       ` Bill Irwin
  -1 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Ingo Molnar

On Tue, Mar 06, 2007 at 10:51:01PM -0800, Andrew Morton wrote:
> Does anybody really pass a NULL `type' arg into filemap_nopage()?

The major vs. minor fault accounting patch that introduced the argument
didn't make non-NULL type arguments a requirement. It's essentially an
optional second return value and the NULL pointer represents the caller
choosing to ignore it. I'm not sure I actually liked that aspect of it,
but that's how it ended up going in. I think it had something to do
with driver churn clashing with the sweep at the time of the merge. I'd
rather the argument be mandatory and defaulted to VM_FAULT_MINOR.

It's something of a non-answer, though, since it only discusses a
convention as opposed to reviewing specific callers of filemap_nopage().
NULL type arguments to ->nopage() are rare at most, and could be easily
eliminated, at least for in-tree drivers.

egrep -nr 'nopage.*NULL' . 2>/dev/null | grep -v '^Bin' on a current
git tree yields zero matches.


-- wli

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  7:19       ` Bill Irwin
  0 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Ingo Molnar

On Tue, Mar 06, 2007 at 10:51:01PM -0800, Andrew Morton wrote:
> Does anybody really pass a NULL `type' arg into filemap_nopage()?

The major vs. minor fault accounting patch that introduced the argument
didn't make non-NULL type arguments a requirement. It's essentially an
optional second return value and the NULL pointer represents the caller
choosing to ignore it. I'm not sure I actually liked that aspect of it,
but that's how it ended up going in. I think it had something to do
with driver churn clashing with the sweep at the time of the merge. I'd
rather the argument be mandatory and defaulted to VM_FAULT_MINOR.

It's something of a non-answer, though, since it only discusses a
convention as opposed to reviewing specific callers of filemap_nopage().
NULL type arguments to ->nopage() are rare at most, and could be easily
eliminated, at least for in-tree drivers.

egrep -nr 'nopage.*NULL' . 2>/dev/null | grep -v '^Bin' on a current
git tree yields zero matches.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 3/6] mm: fix fault vs invalidate race for linear mappings
  2007-03-07  7:08         ` Andrew Morton
@ 2007-03-07  7:25           ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt

On Tue, Mar 06, 2007 at 11:08:41PM -0800, Andrew Morton wrote:
> On Wed, 7 Mar 2007 07:57:27 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > > 
> > > Why was truncate_inode_pages_range() altered to unmap the page if it got
> > > mapped again?
> > > 
> > > Oh.  Because the unmap_mapping_range() call got removed from vmtruncate(). 
> > > Why?  (Please send suitable updates to the changelog).
> > 
> > We have to ensure it is unmapped, and be prepared to unmap it while under
> > the page lock.
> 
> But vmtruncate() dropped i_size, so nobody will map this page into
> pagetables from then on.

But there could be a fault in progress... the only way to know is
locking the page.

> > > I guess truncate of a mmapped area isn't sufficiently common to worry about
> > > the inefficiency of this change.
> > 
> > Yeah, and it should be more efficient for files that aren't mmapped,
> > because we don't have to take i_mmap_lock for them.
> > 
> > > Lots of memory barriers got removed in memory.c, unchangeloggedly.
> > 
> > Yeah they were all for the lockless truncate_count checks. Now that
> > we use the page lock, we don't need barriers.
> > 
> > > Gratuitous renaming of locals in do_no_page() makes the change hard to
> > > review.  Should have been a separate patch.
> > > 
> > > In fact, the patch would have been heaps clearer if that renaming had been
> > > a separate patch.
> > 
> > Shall I?
> 
> If you don't have anything better to do, yes please ;)

OK.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 3/6] mm: fix fault vs invalidate race for linear mappings
@ 2007-03-07  7:25           ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt

On Tue, Mar 06, 2007 at 11:08:41PM -0800, Andrew Morton wrote:
> On Wed, 7 Mar 2007 07:57:27 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > > 
> > > Why was truncate_inode_pages_range() altered to unmap the page if it got
> > > mapped again?
> > > 
> > > Oh.  Because the unmap_mapping_range() call got removed from vmtruncate(). 
> > > Why?  (Please send suitable updates to the changelog).
> > 
> > We have to ensure it is unmapped, and be prepared to unmap it while under
> > the page lock.
> 
> But vmtruncate() dropped i_size, so nobody will map this page into
> pagetables from then on.

But there could be a fault in progress... the only way to know is
locking the page.

> > > I guess truncate of a mmapped area isn't sufficiently common to worry about
> > > the inefficiency of this change.
> > 
> > Yeah, and it should be more efficient for files that aren't mmapped,
> > because we don't have to take i_mmap_lock for them.
> > 
> > > Lots of memory barriers got removed in memory.c, unchangeloggedly.
> > 
> > Yeah they were all for the lockless truncate_count checks. Now that
> > we use the page lock, we don't need barriers.
> > 
> > > Gratuitous renaming of locals in do_no_page() makes the change hard to
> > > review.  Should have been a separate patch.
> > > 
> > > In fact, the patch would have been heaps clearer if that renaming had been
> > > a separate patch.
> > 
> > Shall I?
> 
> If you don't have anything better to do, yes please ;)

OK.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  7:08       ` Nick Piggin
@ 2007-03-07  8:19         ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  8:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt,
	Ingo Molnar

On Wed, Mar 07, 2007 at 08:08:53AM +0100, Nick Piggin wrote:
> On Tue, Mar 06, 2007 at 10:51:01PM -0800, Andrew Morton wrote:
> 
> > This patch seems to churn things around an awful lot for minimal benefit.
> 
> Well it fixes the whole design of the nonlinear fault path.

If it doesn't look very impressive, it could be because it leaves all
the old crud around for backwards compatibility (the worst offenders
are removed in patch 6/6).

If you look at the patchset as a whole, it removes about 250 lines,
mostly of (non trivial) duplicated code in filemap.c memory.c shmem.c
fremap.c, that is nonlinear pages specific and doesn't get anywhere
near the testing that the linear fault path does.

A minimal fix for nonlinear pages would have required changing all
->populate handlers, which I simply thought was not very productive
considering the testing and coverage issues, and that I was going to
rewrite the nonlinear path anyway.

If you like, you can consider patches 1,2,3 as the fix, and ignore
nonlinear (hey, it doesn't even bother checking truncate_count today!).

Then 4,5,6 is the fault/nonlinear rewrite, take it or leave it. I thought
you would have liked the patches...


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  8:19         ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  8:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt,
	Ingo Molnar

On Wed, Mar 07, 2007 at 08:08:53AM +0100, Nick Piggin wrote:
> On Tue, Mar 06, 2007 at 10:51:01PM -0800, Andrew Morton wrote:
> 
> > This patch seems to churn things around an awful lot for minimal benefit.
> 
> Well it fixes the whole design of the nonlinear fault path.

If it doesn't look very impressive, it could be because it leaves all
the old crud around for backwards compatibility (the worst offenders
are removed in patch 6/6).

If you look at the patchset as a whole, it removes about 250 lines,
mostly of (non trivial) duplicated code in filemap.c memory.c shmem.c
fremap.c, that is nonlinear pages specific and doesn't get anywhere
near the testing that the linear fault path does.

A minimal fix for nonlinear pages would have required changing all
->populate handlers, which I simply thought was not very productive
considering the testing and coverage issues, and that I was going to
rewrite the nonlinear path anyway.

If you like, you can consider patches 1,2,3 as the fix, and ignore
nonlinear (hey, it doesn't even bother checking truncate_count today!).

Then 4,5,6 is the fault/nonlinear rewrite, take it or leave it. I thought
you would have liked the patches...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:19         ` Nick Piggin
@ 2007-03-07  8:27           ` Ingo Molnar
  -1 siblings, 0 replies; 198+ messages in thread
From: Ingo Molnar @ 2007-03-07  8:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt


* Nick Piggin <npiggin@suse.de> wrote:

> If it doesn't look very impressive, it could be because it leaves all 
> the old crud around for backwards compatibility (the worst offenders 
> are removed in patch 6/6).
> 
> If you look at the patchset as a whole, it removes about 250 lines, 
> mostly of (non trivial) duplicated code in filemap.c memory.c shmem.c 
> fremap.c, that is nonlinear pages specific and doesn't get anywhere 
> near the testing that the linear fault path does.
> 
> A minimal fix for nonlinear pages would have required changing all 
> ->populate handlers, which I simply thought was not very productive 
> considering the testing and coverage issues, and that I was going to 
> rewrite the nonlinear path anyway.
> 
> If you like, you can consider patches 1,2,3 as the fix, and ignore 
> nonlinear (hey, it doesn't even bother checking truncate_count 
> today!).
> 
> Then 4,5,6 is the fault/nonlinear rewrite, take it or leave it. I 
> thought you would have liked the patches...

btw., if we decide that nonlinear isnt worth the continuing maintainance 
pain, we could internally implement/emulate sys_remap_file_pages() via a 
call to mremap() and essentially deprecate it, without breaking the ABI 
- and remove all the nonlinear code. (This would split fremap areas into 
separate vmas)

	Ingo

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  8:27           ` Ingo Molnar
  0 siblings, 0 replies; 198+ messages in thread
From: Ingo Molnar @ 2007-03-07  8:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt

* Nick Piggin <npiggin@suse.de> wrote:

> If it doesn't look very impressive, it could be because it leaves all 
> the old crud around for backwards compatibility (the worst offenders 
> are removed in patch 6/6).
> 
> If you look at the patchset as a whole, it removes about 250 lines, 
> mostly of (non trivial) duplicated code in filemap.c memory.c shmem.c 
> fremap.c, that is nonlinear pages specific and doesn't get anywhere 
> near the testing that the linear fault path does.
> 
> A minimal fix for nonlinear pages would have required changing all 
> ->populate handlers, which I simply thought was not very productive 
> considering the testing and coverage issues, and that I was going to 
> rewrite the nonlinear path anyway.
> 
> If you like, you can consider patches 1,2,3 as the fix, and ignore 
> nonlinear (hey, it doesn't even bother checking truncate_count 
> today!).
> 
> Then 4,5,6 is the fault/nonlinear rewrite, take it or leave it. I 
> thought you would have liked the patches...

btw., if we decide that nonlinear isnt worth the continuing maintainance 
pain, we could internally implement/emulate sys_remap_file_pages() via a 
call to mremap() and essentially deprecate it, without breaking the ABI 
- and remove all the nonlinear code. (This would split fremap areas into 
separate vmas)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:27           ` Ingo Molnar
@ 2007-03-07  8:35             ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  8:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Paolo 'Blaisorblade' Giarrusso

On Wed, 7 Mar 2007 09:27:55 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > If it doesn't look very impressive, it could be because it leaves all 
> > the old crud around for backwards compatibility (the worst offenders 
> > are removed in patch 6/6).
> > 
> > If you look at the patchset as a whole, it removes about 250 lines, 
> > mostly of (non trivial) duplicated code in filemap.c memory.c shmem.c 
> > fremap.c, that is nonlinear pages specific and doesn't get anywhere 
> > near the testing that the linear fault path does.
> > 
> > A minimal fix for nonlinear pages would have required changing all 
> > ->populate handlers, which I simply thought was not very productive 
> > considering the testing and coverage issues, and that I was going to 
> > rewrite the nonlinear path anyway.
> > 
> > If you like, you can consider patches 1,2,3 as the fix, and ignore 
> > nonlinear (hey, it doesn't even bother checking truncate_count 
> > today!).
> > 
> > Then 4,5,6 is the fault/nonlinear rewrite, take it or leave it. I 
> > thought you would have liked the patches...
> 
> btw., if we decide that nonlinear isnt worth the continuing maintainance 
> pain, we could internally implement/emulate sys_remap_file_pages() via a 
> call to mremap() and essentially deprecate it, without breaking the ABI 
> - and remove all the nonlinear code. (This would split fremap areas into 
> separate vmas)
> 

I'm rather regretting having merged it - I don't think it has been used for
much.

Paolo's UML speedup patches might use nonlinear though.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  8:35             ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  8:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Paolo 'Blaisorblade' Giarrusso

On Wed, 7 Mar 2007 09:27:55 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > If it doesn't look very impressive, it could be because it leaves all 
> > the old crud around for backwards compatibility (the worst offenders 
> > are removed in patch 6/6).
> > 
> > If you look at the patchset as a whole, it removes about 250 lines, 
> > mostly of (non trivial) duplicated code in filemap.c memory.c shmem.c 
> > fremap.c, that is nonlinear pages specific and doesn't get anywhere 
> > near the testing that the linear fault path does.
> > 
> > A minimal fix for nonlinear pages would have required changing all 
> > ->populate handlers, which I simply thought was not very productive 
> > considering the testing and coverage issues, and that I was going to 
> > rewrite the nonlinear path anyway.
> > 
> > If you like, you can consider patches 1,2,3 as the fix, and ignore 
> > nonlinear (hey, it doesn't even bother checking truncate_count 
> > today!).
> > 
> > Then 4,5,6 is the fault/nonlinear rewrite, take it or leave it. I 
> > thought you would have liked the patches...
> 
> btw., if we decide that nonlinear isnt worth the continuing maintainance 
> pain, we could internally implement/emulate sys_remap_file_pages() via a 
> call to mremap() and essentially deprecate it, without breaking the ABI 
> - and remove all the nonlinear code. (This would split fremap areas into 
> separate vmas)
> 

I'm rather regretting having merged it - I don't think it has been used for
much.

Paolo's UML speedup patches might use nonlinear though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:27           ` Ingo Molnar
@ 2007-03-07  8:38             ` Miklos Szeredi
  -1 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07  8:38 UTC (permalink / raw)
  To: mingo; +Cc: npiggin, akpm, linux-mm, linux-kernel, benh

> > If it doesn't look very impressive, it could be because it leaves all 
> > the old crud around for backwards compatibility (the worst offenders 
> > are removed in patch 6/6).
> > 
> > If you look at the patchset as a whole, it removes about 250 lines, 
> > mostly of (non trivial) duplicated code in filemap.c memory.c shmem.c 
> > fremap.c, that is nonlinear pages specific and doesn't get anywhere 
> > near the testing that the linear fault path does.
> > 
> > A minimal fix for nonlinear pages would have required changing all 
> > ->populate handlers, which I simply thought was not very productive 
> > considering the testing and coverage issues, and that I was going to 
> > rewrite the nonlinear path anyway.
> > 
> > If you like, you can consider patches 1,2,3 as the fix, and ignore 
> > nonlinear (hey, it doesn't even bother checking truncate_count 
> > today!).
> > 
> > Then 4,5,6 is the fault/nonlinear rewrite, take it or leave it. I 
> > thought you would have liked the patches...
> 
> btw., if we decide that nonlinear isnt worth the continuing maintainance 
> pain, we could internally implement/emulate sys_remap_file_pages() via a 
> call to mremap() and essentially deprecate it, without breaking the ABI 
> - and remove all the nonlinear code. (This would split fremap areas into 
> separate vmas)

That would make sense.  Dirty page accounting doesn't work either on
non-linear mappings, and I can't see how that could be fixed in any
other way.

Miklos

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  8:38             ` Miklos Szeredi
  0 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07  8:38 UTC (permalink / raw)
  To: mingo; +Cc: npiggin, akpm, linux-mm, linux-kernel, benh

> > If it doesn't look very impressive, it could be because it leaves all 
> > the old crud around for backwards compatibility (the worst offenders 
> > are removed in patch 6/6).
> > 
> > If you look at the patchset as a whole, it removes about 250 lines, 
> > mostly of (non trivial) duplicated code in filemap.c memory.c shmem.c 
> > fremap.c, that is nonlinear pages specific and doesn't get anywhere 
> > near the testing that the linear fault path does.
> > 
> > A minimal fix for nonlinear pages would have required changing all 
> > ->populate handlers, which I simply thought was not very productive 
> > considering the testing and coverage issues, and that I was going to 
> > rewrite the nonlinear path anyway.
> > 
> > If you like, you can consider patches 1,2,3 as the fix, and ignore 
> > nonlinear (hey, it doesn't even bother checking truncate_count 
> > today!).
> > 
> > Then 4,5,6 is the fault/nonlinear rewrite, take it or leave it. I 
> > thought you would have liked the patches...
> 
> btw., if we decide that nonlinear isnt worth the continuing maintainance 
> pain, we could internally implement/emulate sys_remap_file_pages() via a 
> call to mremap() and essentially deprecate it, without breaking the ABI 
> - and remove all the nonlinear code. (This would split fremap areas into 
> separate vmas)

That would make sense.  Dirty page accounting doesn't work either on
non-linear mappings, and I can't see how that could be fixed in any
other way.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:38             ` Miklos Szeredi
@ 2007-03-07  8:47               ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  8:47 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: mingo, npiggin, linux-mm, linux-kernel, benh

On Wed, 07 Mar 2007 09:38:34 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> Dirty page accounting doesn't work either on
> non-linear mappings

It doesn't?  Confused - these things don't have anything to do with each
other do they?

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  8:47               ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  8:47 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: mingo, npiggin, linux-mm, linux-kernel, benh

On Wed, 07 Mar 2007 09:38:34 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> Dirty page accounting doesn't work either on
> non-linear mappings

It doesn't?  Confused - these things don't have anything to do with each
other do they?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:47               ` Andrew Morton
@ 2007-03-07  8:51                 ` Miklos Szeredi
  -1 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07  8:51 UTC (permalink / raw)
  To: akpm; +Cc: mingo, npiggin, linux-mm, linux-kernel, benh

> > Dirty page accounting doesn't work either on
> > non-linear mappings
> 
> It doesn't?  Confused - these things don't have anything to do with each
> other do they?

Look in page_mkclean().  Where does it handle non-linear mappings?

Miklos

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  8:51                 ` Miklos Szeredi
  0 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07  8:51 UTC (permalink / raw)
  To: akpm; +Cc: mingo, npiggin, linux-mm, linux-kernel, benh

> > Dirty page accounting doesn't work either on
> > non-linear mappings
> 
> It doesn't?  Confused - these things don't have anything to do with each
> other do they?

Look in page_mkclean().  Where does it handle non-linear mappings?

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:35             ` Andrew Morton
@ 2007-03-07  8:53               ` Ingo Molnar
  -1 siblings, 0 replies; 198+ messages in thread
From: Ingo Molnar @ 2007-03-07  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Paolo 'Blaisorblade' Giarrusso


* Andrew Morton <akpm@linux-foundation.org> wrote:

> > btw., if we decide that nonlinear isnt worth the continuing 
> > maintainance pain, we could internally implement/emulate 
> > sys_remap_file_pages() via a call to mremap() and essentially 
> > deprecate it, without breaking the ABI - and remove all the 
> > nonlinear code. (This would split fremap areas into separate vmas)
> > 
> 
> I'm rather regretting having merged it - I don't think it has been 
> used for much.
> 
> Paolo's UML speedup patches might use nonlinear though.

yes, i wrote the first, prototype version of that for UML, it needs an 
extended version of the syscall, sys_remap_file_pages_prot():

 http://redhat.com/~mingo/remap-file-pages-patches/remap-file-pages-prot-2.6.4-rc1-mm1-A1

i also wrote an x86 hypervisor kind of thing for UML, called 
'sys_vcpu()', which allows UML to execute guest user-mode in a box, 
which also relies on sys_remap_file_pages_prot():

 http://redhat.com/~mingo/remap-file-pages-patches/vcpu-2.6.4-rc2-mm1-A2

which reduced the UML guest syscall overhead from 30 usecs to 4 usecs 
(with native syscalls taking 2 usecs, on the box i tested, years ago).

So it certainly looked useful to me - but wasnt really picked up widely. 

We'll always have the option to get rid of it (and hence completely 
reverse the decision to merge it) without breaking the ABI, by emulating 
the API via mremap(). That eliminates the UML speedup though. So no need 
to feel sorry about having merged it, we can easily revisit that 
years-old 'do we want it' decision, without any ABI worries.

	Ingo

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  8:53               ` Ingo Molnar
  0 siblings, 0 replies; 198+ messages in thread
From: Ingo Molnar @ 2007-03-07  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Paolo 'Blaisorblade' Giarrusso

* Andrew Morton <akpm@linux-foundation.org> wrote:

> > btw., if we decide that nonlinear isnt worth the continuing 
> > maintainance pain, we could internally implement/emulate 
> > sys_remap_file_pages() via a call to mremap() and essentially 
> > deprecate it, without breaking the ABI - and remove all the 
> > nonlinear code. (This would split fremap areas into separate vmas)
> > 
> 
> I'm rather regretting having merged it - I don't think it has been 
> used for much.
> 
> Paolo's UML speedup patches might use nonlinear though.

yes, i wrote the first, prototype version of that for UML, it needs an 
extended version of the syscall, sys_remap_file_pages_prot():

 http://redhat.com/~mingo/remap-file-pages-patches/remap-file-pages-prot-2.6.4-rc1-mm1-A1

i also wrote an x86 hypervisor kind of thing for UML, called 
'sys_vcpu()', which allows UML to execute guest user-mode in a box, 
which also relies on sys_remap_file_pages_prot():

 http://redhat.com/~mingo/remap-file-pages-patches/vcpu-2.6.4-rc2-mm1-A2

which reduced the UML guest syscall overhead from 30 usecs to 4 usecs 
(with native syscalls taking 2 usecs, on the box i tested, years ago).

So it certainly looked useful to me - but wasnt really picked up widely. 

We'll always have the option to get rid of it (and hence completely 
reverse the decision to merge it) without breaking the ABI, by emulating 
the API via mremap(). That eliminates the UML speedup though. So no need 
to feel sorry about having merged it, we can easily revisit that 
years-old 'do we want it' decision, without any ABI worries.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:27           ` Ingo Molnar
@ 2007-03-07  8:59             ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt

On Wed, Mar 07, 2007 at 09:27:55AM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > Then 4,5,6 is the fault/nonlinear rewrite, take it or leave it. I 
> > thought you would have liked the patches...
> 
> btw., if we decide that nonlinear isnt worth the continuing maintainance 
> pain, we could internally implement/emulate sys_remap_file_pages() via a 
> call to mremap() and essentially deprecate it, without breaking the ABI 
> - and remove all the nonlinear code. (This would split fremap areas into 
> separate vmas)

Well I think it has a few possible uses outside the PAE database
workloads. UML for one seem to be interested... as much as I don't
use them, I think nonlinear mappings are kinda cool ;)

After these patches, I don't think there is too much burden. The main
thing left really is just the objrmap stuff, but that is just handled
with a minimal 'dumb' algorithm that doesn't cost much.

Then the core of it is just the file pte handling, which really doesn't
seem to be much problem.

Apart from a handful of trivial if (pte_file()) cases throughout mm/,
our maintainance burden basically now amounts to the following patch.
Even the rmap.c change looks bigger than it is because I split out
the nonlinear unmapping code from try_to_unmap_file. Not too bad, eh? :)

--

 include/asm-powerpc/pgtable.h |   12 ++++
 mm/Kconfig                    |    6 ++
 mm/Makefile                   |    6 +-
 mm/rmap.c                     |  101 +++++++++++++++++++++++++-----------------
 4 files changed, 83 insertions(+), 42 deletions(-)

Index: linux-2.6/include/asm-powerpc/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable.h
+++ linux-2.6/include/asm-powerpc/pgtable.h
@@ -243,7 +243,12 @@ static inline int pte_write(pte_t pte) {
 static inline int pte_exec(pte_t pte)  { return pte_val(pte) & _PAGE_EXEC;}
 static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}
 static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
+
+#ifdef CONFIG_NONLINEAR
 static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
+#else
+static inline int pte_file(pte_t pte) { return 0; }
+#endif
 
 static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
 static inline void pte_cache(pte_t pte)   { pte_val(pte) &= ~_PAGE_NO_CACHE; }
@@ -483,9 +488,16 @@ extern void update_mmu_cache(struct vm_a
 #define __swp_entry(type, offset) ((swp_entry_t){((type)<< 1)|((offset)<<8)})
 #define __pte_to_swp_entry(pte)	((swp_entry_t){pte_val(pte) >> PTE_RPN_SHIFT})
 #define __swp_entry_to_pte(x)	((pte_t) { (x).val << PTE_RPN_SHIFT })
+
+#ifdef CONFIG_NONLINEAR
 #define pte_to_pgoff(pte)	(pte_val(pte) >> PTE_RPN_SHIFT)
 #define pgoff_to_pte(off)	((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE})
 #define PTE_FILE_MAX_BITS	(BITS_PER_LONG - PTE_RPN_SHIFT)
+#else
+#define pte_to_pgoff(pte)	({BUG(); -1;})
+#define pgoff_to_pte(off)	({BUG(); (pte_t){-1};})
+#define PTE_FILE_MAX_BITS	0
+#endif
 
 /*
  * kern_addr_valid is intended to indicate whether an address is a valid
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig
+++ linux-2.6/mm/Kconfig
@@ -142,6 +142,12 @@ config SPLIT_PTLOCK_CPUS
 #
 # support for page migration
 #
+config NONLINEAR
+	bool "Non linear mappings"
+	def_bool y
+	help
+	  Provides support for the remap_file_pages syscall.
+
 config MIGRATION
 	bool "Page migration"
 	def_bool y
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -3,9 +3,8 @@
 #
 
 mmu-y			:= nommu.o
-mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
-			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+mmu-$(CONFIG_MMU)	:= highmem.o madvise.o memory.o mincore.o mlock.o \
+			   mmap.o mprotect.o mremap.o msync.o rmap.o vmalloc.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
@@ -27,5 +26,6 @@ obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
+obj-$(CONFIG_NONLINEAR) += fremap.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -756,6 +756,7 @@ out:
 	return ret;
 }
 
+#ifdef CONFIG_NONLINEAR
 /*
  * objrmap doesn't work for nonlinear VMAs because the assumption that
  * offset-into-file correlates with offset-into-virtual-addresses does not hold.
@@ -845,53 +846,18 @@ static void try_to_unmap_cluster(unsigne
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
-static int try_to_unmap_anon(struct page *page, int migration)
-{
-	struct anon_vma *anon_vma;
-	struct vm_area_struct *vma;
-	int ret = SWAP_AGAIN;
-
-	anon_vma = page_lock_anon_vma(page);
-	if (!anon_vma)
-		return ret;
-
-	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
-		ret = try_to_unmap_one(page, vma, migration);
-		if (ret == SWAP_FAIL || !page_mapped(page))
-			break;
-	}
-
-	page_unlock_anon_vma(anon_vma);
-	return ret;
-}
-
-/**
- * try_to_unmap_file - unmap file page using the object-based rmap method
- * @page: the page to unmap
- *
- * Find all the mappings of a page using the mapping pointer and the vma chains
- * contained in the address_space struct it points to.
- *
- * This function is only called from try_to_unmap for object-based pages.
+/*
+ * Called with page->mapping->i_mmap_lock held.
  */
-static int try_to_unmap_file(struct page *page, int migration)
+static int try_to_unmap_file_nonlinear(struct page *page, int migration)
 {
 	struct address_space *mapping = page->mapping;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	struct vm_area_struct *vma;
-	struct prio_tree_iter iter;
-	int ret = SWAP_AGAIN;
 	unsigned long cursor;
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
-
-	spin_lock(&mapping->i_mmap_lock);
-	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
-		ret = try_to_unmap_one(page, vma, migration);
-		if (ret == SWAP_FAIL || !page_mapped(page))
-			goto out;
-	}
+	int ret = SWAP_AGAIN;
 
 	if (list_empty(&mapping->i_mmap_nonlinear))
 		goto out;
@@ -956,6 +922,63 @@ static int try_to_unmap_file(struct page
 	 */
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
+
+out:
+	return ret;
+}
+
+#else /* CONFIG_NONLINEAR */
+static int try_to_unmap_file_nonlinear(struct page *page, int migration)
+{
+	return SWAP_AGAIN;
+}
+#endif
+
+static int try_to_unmap_anon(struct page *page, int migration)
+{
+	struct anon_vma *anon_vma;
+	struct vm_area_struct *vma;
+	int ret = SWAP_AGAIN;
+
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		return ret;
+
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		ret = try_to_unmap_one(page, vma, migration);
+		if (ret == SWAP_FAIL || !page_mapped(page))
+			break;
+	}
+
+	page_unlock_anon_vma(anon_vma);
+	return ret;
+}
+
+/**
+ * try_to_unmap_file - unmap file page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the address_space struct it points to.
+ *
+ * This function is only called from try_to_unmap for object-based pages.
+ */
+static int try_to_unmap_file(struct page *page, int migration)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int ret = SWAP_AGAIN;
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		ret = try_to_unmap_one(page, vma, migration);
+		if (ret == SWAP_FAIL || !page_mapped(page))
+			goto out;
+	}
+
+	ret = try_to_unmap_file_nonlinear(page, migration);
 out:
 	spin_unlock(&mapping->i_mmap_lock);
 	return ret;

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  8:59             ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt

On Wed, Mar 07, 2007 at 09:27:55AM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > Then 4,5,6 is the fault/nonlinear rewrite, take it or leave it. I 
> > thought you would have liked the patches...
> 
> btw., if we decide that nonlinear isnt worth the continuing maintainance 
> pain, we could internally implement/emulate sys_remap_file_pages() via a 
> call to mremap() and essentially deprecate it, without breaking the ABI 
> - and remove all the nonlinear code. (This would split fremap areas into 
> separate vmas)

Well I think it has a few possible uses outside the PAE database
workloads. UML for one seem to be interested... as much as I don't
use them, I think nonlinear mappings are kinda cool ;)

After these patches, I don't think there is too much burden. The main
thing left really is just the objrmap stuff, but that is just handled
with a minimal 'dumb' algorithm that doesn't cost much.

Then the core of it is just the file pte handling, which really doesn't
seem to be much problem.

Apart from a handful of trivial if (pte_file()) cases throughout mm/,
our maintainance burden basically now amounts to the following patch.
Even the rmap.c change looks bigger than it is because I split out
the nonlinear unmapping code from try_to_unmap_file. Not too bad, eh? :)

--

 include/asm-powerpc/pgtable.h |   12 ++++
 mm/Kconfig                    |    6 ++
 mm/Makefile                   |    6 +-
 mm/rmap.c                     |  101 +++++++++++++++++++++++++-----------------
 4 files changed, 83 insertions(+), 42 deletions(-)

Index: linux-2.6/include/asm-powerpc/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable.h
+++ linux-2.6/include/asm-powerpc/pgtable.h
@@ -243,7 +243,12 @@ static inline int pte_write(pte_t pte) {
 static inline int pte_exec(pte_t pte)  { return pte_val(pte) & _PAGE_EXEC;}
 static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}
 static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
+
+#ifdef CONFIG_NONLINEAR
 static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
+#else
+static inline int pte_file(pte_t pte) { return 0; }
+#endif
 
 static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
 static inline void pte_cache(pte_t pte)   { pte_val(pte) &= ~_PAGE_NO_CACHE; }
@@ -483,9 +488,16 @@ extern void update_mmu_cache(struct vm_a
 #define __swp_entry(type, offset) ((swp_entry_t){((type)<< 1)|((offset)<<8)})
 #define __pte_to_swp_entry(pte)	((swp_entry_t){pte_val(pte) >> PTE_RPN_SHIFT})
 #define __swp_entry_to_pte(x)	((pte_t) { (x).val << PTE_RPN_SHIFT })
+
+#ifdef CONFIG_NONLINEAR
 #define pte_to_pgoff(pte)	(pte_val(pte) >> PTE_RPN_SHIFT)
 #define pgoff_to_pte(off)	((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE})
 #define PTE_FILE_MAX_BITS	(BITS_PER_LONG - PTE_RPN_SHIFT)
+#else
+#define pte_to_pgoff(pte)	({BUG(); -1;})
+#define pgoff_to_pte(off)	({BUG(); (pte_t){-1};})
+#define PTE_FILE_MAX_BITS	0
+#endif
 
 /*
  * kern_addr_valid is intended to indicate whether an address is a valid
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig
+++ linux-2.6/mm/Kconfig
@@ -142,6 +142,12 @@ config SPLIT_PTLOCK_CPUS
 #
 # support for page migration
 #
+config NONLINEAR
+	bool "Non linear mappings"
+	def_bool y
+	help
+	  Provides support for the remap_file_pages syscall.
+
 config MIGRATION
 	bool "Page migration"
 	def_bool y
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -3,9 +3,8 @@
 #
 
 mmu-y			:= nommu.o
-mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
-			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+mmu-$(CONFIG_MMU)	:= highmem.o madvise.o memory.o mincore.o mlock.o \
+			   mmap.o mprotect.o mremap.o msync.o rmap.o vmalloc.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
@@ -27,5 +26,6 @@ obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
+obj-$(CONFIG_NONLINEAR) += fremap.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -756,6 +756,7 @@ out:
 	return ret;
 }
 
+#ifdef CONFIG_NONLINEAR
 /*
  * objrmap doesn't work for nonlinear VMAs because the assumption that
  * offset-into-file correlates with offset-into-virtual-addresses does not hold.
@@ -845,53 +846,18 @@ static void try_to_unmap_cluster(unsigne
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
-static int try_to_unmap_anon(struct page *page, int migration)
-{
-	struct anon_vma *anon_vma;
-	struct vm_area_struct *vma;
-	int ret = SWAP_AGAIN;
-
-	anon_vma = page_lock_anon_vma(page);
-	if (!anon_vma)
-		return ret;
-
-	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
-		ret = try_to_unmap_one(page, vma, migration);
-		if (ret == SWAP_FAIL || !page_mapped(page))
-			break;
-	}
-
-	page_unlock_anon_vma(anon_vma);
-	return ret;
-}
-
-/**
- * try_to_unmap_file - unmap file page using the object-based rmap method
- * @page: the page to unmap
- *
- * Find all the mappings of a page using the mapping pointer and the vma chains
- * contained in the address_space struct it points to.
- *
- * This function is only called from try_to_unmap for object-based pages.
+/*
+ * Called with page->mapping->i_mmap_lock held.
  */
-static int try_to_unmap_file(struct page *page, int migration)
+static int try_to_unmap_file_nonlinear(struct page *page, int migration)
 {
 	struct address_space *mapping = page->mapping;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	struct vm_area_struct *vma;
-	struct prio_tree_iter iter;
-	int ret = SWAP_AGAIN;
 	unsigned long cursor;
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
-
-	spin_lock(&mapping->i_mmap_lock);
-	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
-		ret = try_to_unmap_one(page, vma, migration);
-		if (ret == SWAP_FAIL || !page_mapped(page))
-			goto out;
-	}
+	int ret = SWAP_AGAIN;
 
 	if (list_empty(&mapping->i_mmap_nonlinear))
 		goto out;
@@ -956,6 +922,63 @@ static int try_to_unmap_file(struct page
 	 */
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
+
+out:
+	return ret;
+}
+
+#else /* CONFIG_NONLINEAR */
+static int try_to_unmap_file_nonlinear(struct page *page, int migration)
+{
+	return SWAP_AGAIN;
+}
+#endif
+
+static int try_to_unmap_anon(struct page *page, int migration)
+{
+	struct anon_vma *anon_vma;
+	struct vm_area_struct *vma;
+	int ret = SWAP_AGAIN;
+
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		return ret;
+
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		ret = try_to_unmap_one(page, vma, migration);
+		if (ret == SWAP_FAIL || !page_mapped(page))
+			break;
+	}
+
+	page_unlock_anon_vma(anon_vma);
+	return ret;
+}
+
+/**
+ * try_to_unmap_file - unmap file page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the address_space struct it points to.
+ *
+ * This function is only called from try_to_unmap for object-based pages.
+ */
+static int try_to_unmap_file(struct page *page, int migration)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int ret = SWAP_AGAIN;
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		ret = try_to_unmap_one(page, vma, migration);
+		if (ret == SWAP_FAIL || !page_mapped(page))
+			goto out;
+	}
+
+	ret = try_to_unmap_file_nonlinear(page, migration);
 out:
 	spin_unlock(&mapping->i_mmap_lock);
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:51                 ` Miklos Szeredi
@ 2007-03-07  9:07                   ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  9:07 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: mingo, npiggin, linux-mm, linux-kernel, benh, Peter Zijlstra

On Wed, 07 Mar 2007 09:51:57 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > Dirty page accounting doesn't work either on
> > > non-linear mappings
> > 
> > It doesn't?  Confused - these things don't have anything to do with each
> > other do they?
> 
> Look in page_mkclean().  Where does it handle non-linear mappings?
> 

OK, I'd forgotten about that.  It won't break dirty memory accounting,
but it'll potentially break dirty memory balancing.

If we have the wrong page (due to nonlinear), page_check_address() will
fail and we'll leave the pte dirty.  That puts us back to the pre-2.6.17
algorithms and I guess it'll break the msync guarantees.

Peter, I thought we went through the nonlinear problem ages ago and decided
it was OK?

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:07                   ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  9:07 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: mingo, npiggin, linux-mm, linux-kernel, benh, Peter Zijlstra

On Wed, 07 Mar 2007 09:51:57 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > Dirty page accounting doesn't work either on
> > > non-linear mappings
> > 
> > It doesn't?  Confused - these things don't have anything to do with each
> > other do they?
> 
> Look in page_mkclean().  Where does it handle non-linear mappings?
> 

OK, I'd forgotten about that.  It won't break dirty memory accounting,
but it'll potentially break dirty memory balancing.

If we have the wrong page (due to nonlinear), page_check_address() will
fail and we'll leave the pte dirty.  That puts us back to the pre-2.6.17
algorithms and I guess it'll break the msync guarantees.

Peter, I thought we went through the nonlinear problem ages ago and decided
it was OK?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:59             ` Nick Piggin
@ 2007-03-07  9:11               ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt

On Wed, Mar 07, 2007 at 09:59:44AM +0100, Nick Piggin wrote:
> Apart from a handful of trivial if (pte_file()) cases throughout mm/,
> our maintainance burden basically now amounts to the following patch.
> Even the rmap.c change looks bigger than it is because I split out
> the nonlinear unmapping code from try_to_unmap_file. Not too bad, eh? :)

Oh, there is a bit more nonlinear mmap list manipulation I'd forgotten
about too... makes things a little bit worse, but not too much.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:11               ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt

On Wed, Mar 07, 2007 at 09:59:44AM +0100, Nick Piggin wrote:
> Apart from a handful of trivial if (pte_file()) cases throughout mm/,
> our maintainance burden basically now amounts to the following patch.
> Even the rmap.c change looks bigger than it is because I split out
> the nonlinear unmapping code from try_to_unmap_file. Not too bad, eh? :)

Oh, there is a bit more nonlinear mmap list manipulation I'd forgotten
about too... makes things a little bit worse, but not too much.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:07                   ` Andrew Morton
@ 2007-03-07  9:18                     ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miklos Szeredi, mingo, linux-mm, linux-kernel, benh, Peter Zijlstra

On Wed, Mar 07, 2007 at 01:07:56AM -0800, Andrew Morton wrote:
> On Wed, 07 Mar 2007 09:51:57 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > > > Dirty page accounting doesn't work either on
> > > > non-linear mappings
> > > 
> > > It doesn't?  Confused - these things don't have anything to do with each
> > > other do they?
> > 
> > Look in page_mkclean().  Where does it handle non-linear mappings?
> > 
> 
> OK, I'd forgotten about that.  It won't break dirty memory accounting,
> but it'll potentially break dirty memory balancing.
> 
> If we have the wrong page (due to nonlinear), page_check_address() will
> fail and we'll leave the pte dirty.  That puts us back to the pre-2.6.17
> algorithms and I guess it'll break the msync guarantees.
> 
> Peter, I thought we went through the nonlinear problem ages ago and decided
> it was OK?

msync breakage is bad, but otherwise I don't know that we care about
dirty page writeout efficiency.

But I think we discovered that those msync changes are bogus anyway
becuase there is a small race window where pte could be dirtied without
page being set dirty?



^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:18                     ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miklos Szeredi, mingo, linux-mm, linux-kernel, benh, Peter Zijlstra

On Wed, Mar 07, 2007 at 01:07:56AM -0800, Andrew Morton wrote:
> On Wed, 07 Mar 2007 09:51:57 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > > > Dirty page accounting doesn't work either on
> > > > non-linear mappings
> > > 
> > > It doesn't?  Confused - these things don't have anything to do with each
> > > other do they?
> > 
> > Look in page_mkclean().  Where does it handle non-linear mappings?
> > 
> 
> OK, I'd forgotten about that.  It won't break dirty memory accounting,
> but it'll potentially break dirty memory balancing.
> 
> If we have the wrong page (due to nonlinear), page_check_address() will
> fail and we'll leave the pte dirty.  That puts us back to the pre-2.6.17
> algorithms and I guess it'll break the msync guarantees.
> 
> Peter, I thought we went through the nonlinear problem ages ago and decided
> it was OK?

msync breakage is bad, but otherwise I don't know that we care about
dirty page writeout efficiency.

But I think we discovered that those msync changes are bogus anyway
becuase there is a small race window where pte could be dirtied without
page being set dirty?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:59             ` Nick Piggin
@ 2007-03-07  9:22               ` Ingo Molnar
  -1 siblings, 0 replies; 198+ messages in thread
From: Ingo Molnar @ 2007-03-07  9:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt


* Nick Piggin <npiggin@suse.de> wrote:

> After these patches, I don't think there is too much burden. The main 
> thing left really is just the objrmap stuff, but that is just handled 
> with a minimal 'dumb' algorithm that doesn't cost much.

ok. What do you think about the sys_remap_file_pages_prot() thing that 
Paolo has done in a nicely split up form - does that complicate things 
in any fundamental way? That is what is useful to UML.

	Ingo

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:22               ` Ingo Molnar
  0 siblings, 0 replies; 198+ messages in thread
From: Ingo Molnar @ 2007-03-07  9:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt

* Nick Piggin <npiggin@suse.de> wrote:

> After these patches, I don't think there is too much burden. The main 
> thing left really is just the objrmap stuff, but that is just handled 
> with a minimal 'dumb' algorithm that doesn't cost much.

ok. What do you think about the sys_remap_file_pages_prot() thing that 
Paolo has done in a nicely split up form - does that complicate things 
in any fundamental way? That is what is useful to UML.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:07                   ` Andrew Morton
@ 2007-03-07  9:25                     ` Miklos Szeredi
  -1 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07  9:25 UTC (permalink / raw)
  To: akpm; +Cc: mingo, npiggin, linux-mm, linux-kernel, benh, a.p.zijlstra

> > 
> > Look in page_mkclean().  Where does it handle non-linear mappings?
> > 
> 
> OK, I'd forgotten about that.  It won't break dirty memory accounting,
> but it'll potentially break dirty memory balancing.
> 
> If we have the wrong page (due to nonlinear), page_check_address() will
> fail and we'll leave the pte dirty.

It won't even get that far, because it only looks at vmas on
mapping->i_mmap, and not on i_mmap_nonlinear.

Miklos

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:25                     ` Miklos Szeredi
  0 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07  9:25 UTC (permalink / raw)
  To: akpm; +Cc: mingo, npiggin, linux-mm, linux-kernel, benh, a.p.zijlstra

> > 
> > Look in page_mkclean().  Where does it handle non-linear mappings?
> > 
> 
> OK, I'd forgotten about that.  It won't break dirty memory accounting,
> but it'll potentially break dirty memory balancing.
> 
> If we have the wrong page (due to nonlinear), page_check_address() will
> fail and we'll leave the pte dirty.

It won't even get that far, because it only looks at vmas on
mapping->i_mmap, and not on i_mmap_nonlinear.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:18                     ` Nick Piggin
@ 2007-03-07  9:26                       ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  9:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, mingo, linux-mm, linux-kernel, benh, Peter Zijlstra

On Wed, 7 Mar 2007 10:18:23 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Wed, Mar 07, 2007 at 01:07:56AM -0800, Andrew Morton wrote:
> > On Wed, 07 Mar 2007 09:51:57 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > > > Dirty page accounting doesn't work either on
> > > > > non-linear mappings
> > > > 
> > > > It doesn't?  Confused - these things don't have anything to do with each
> > > > other do they?
> > > 
> > > Look in page_mkclean().  Where does it handle non-linear mappings?
> > > 
> > 
> > OK, I'd forgotten about that.  It won't break dirty memory accounting,
> > but it'll potentially break dirty memory balancing.
> > 
> > If we have the wrong page (due to nonlinear), page_check_address() will
> > fail and we'll leave the pte dirty.  That puts us back to the pre-2.6.17
> > algorithms and I guess it'll break the msync guarantees.
> > 
> > Peter, I thought we went through the nonlinear problem ages ago and decided
> > it was OK?
> 
> msync breakage is bad, but otherwise I don't know that we care about
> dirty page writeout efficiency.

Well.  We made so many changes to support the synchronous
dirty-the-page-when-we-dirty-the-pte thing that I'm rather doubtful that
the old-style approach still works.  It might seem to, most of the time. 
But if it _is_ subtly broken, boy it's going to take a long time for us to
find out.

> But I think we discovered that those msync changes are bogus anyway
> becuase there is a small race window where pte could be dirtied without
> page being set dirty?

Dunno, I don't recall that.  We dirty the page before the pte...

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:26                       ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  9:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, mingo, linux-mm, linux-kernel, benh, Peter Zijlstra

On Wed, 7 Mar 2007 10:18:23 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Wed, Mar 07, 2007 at 01:07:56AM -0800, Andrew Morton wrote:
> > On Wed, 07 Mar 2007 09:51:57 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > > > Dirty page accounting doesn't work either on
> > > > > non-linear mappings
> > > > 
> > > > It doesn't?  Confused - these things don't have anything to do with each
> > > > other do they?
> > > 
> > > Look in page_mkclean().  Where does it handle non-linear mappings?
> > > 
> > 
> > OK, I'd forgotten about that.  It won't break dirty memory accounting,
> > but it'll potentially break dirty memory balancing.
> > 
> > If we have the wrong page (due to nonlinear), page_check_address() will
> > fail and we'll leave the pte dirty.  That puts us back to the pre-2.6.17
> > algorithms and I guess it'll break the msync guarantees.
> > 
> > Peter, I thought we went through the nonlinear problem ages ago and decided
> > it was OK?
> 
> msync breakage is bad, but otherwise I don't know that we care about
> dirty page writeout efficiency.

Well.  We made so many changes to support the synchronous
dirty-the-page-when-we-dirty-the-pte thing that I'm rather doubtful that
the old-style approach still works.  It might seem to, most of the time. 
But if it _is_ subtly broken, boy it's going to take a long time for us to
find out.

> But I think we discovered that those msync changes are bogus anyway
> becuase there is a small race window where pte could be dirtied without
> page being set dirty?

Dunno, I don't recall that.  We dirty the page before the pte...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:53               ` Ingo Molnar
@ 2007-03-07  9:28                 ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Paolo 'Blaisorblade' Giarrusso

On Wed, Mar 07, 2007 at 09:53:23AM +0100, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > btw., if we decide that nonlinear isnt worth the continuing 
> > > maintainance pain, we could internally implement/emulate 
> > > sys_remap_file_pages() via a call to mremap() and essentially 
> > > deprecate it, without breaking the ABI - and remove all the 
> > > nonlinear code. (This would split fremap areas into separate vmas)
> > > 
> > 
> > I'm rather regretting having merged it - I don't think it has been 
> > used for much.
> > 
> > Paolo's UML speedup patches might use nonlinear though.
> 
> yes, i wrote the first, prototype version of that for UML, it needs an 
> extended version of the syscall, sys_remap_file_pages_prot():
> 
>  http://redhat.com/~mingo/remap-file-pages-patches/remap-file-pages-prot-2.6.4-rc1-mm1-A1
> 
> i also wrote an x86 hypervisor kind of thing for UML, called 
> 'sys_vcpu()', which allows UML to execute guest user-mode in a box, 
> which also relies on sys_remap_file_pages_prot():
> 
>  http://redhat.com/~mingo/remap-file-pages-patches/vcpu-2.6.4-rc2-mm1-A2
> 
> which reduced the UML guest syscall overhead from 30 usecs to 4 usecs 
> (with native syscalls taking 2 usecs, on the box i tested, years ago).
> 
> So it certainly looked useful to me - but wasnt really picked up widely. 
> 
> We'll always have the option to get rid of it (and hence completely 
> reverse the decision to merge it) without breaking the ABI, by emulating 
> the API via mremap(). That eliminates the UML speedup though. So no need 
> to feel sorry about having merged it, we can easily revisit that 
> years-old 'do we want it' decision, without any ABI worries.

Depending on whether anyone wants it, and what features they want, we
could emulate the old syscall, and make a new restricted one which is
much less intrusive.

For example, if we can operate only on MAP_ANONYMOUS memory and specify
that nonlinear mappings effectively mlock the pages, then we can get
rid of all the objrmap and unmap_mapping_range handling, forget about
the writeout and msync problems...


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:28                 ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Paolo 'Blaisorblade' Giarrusso

On Wed, Mar 07, 2007 at 09:53:23AM +0100, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > btw., if we decide that nonlinear isnt worth the continuing 
> > > maintainance pain, we could internally implement/emulate 
> > > sys_remap_file_pages() via a call to mremap() and essentially 
> > > deprecate it, without breaking the ABI - and remove all the 
> > > nonlinear code. (This would split fremap areas into separate vmas)
> > > 
> > 
> > I'm rather regretting having merged it - I don't think it has been 
> > used for much.
> > 
> > Paolo's UML speedup patches might use nonlinear though.
> 
> yes, i wrote the first, prototype version of that for UML, it needs an 
> extended version of the syscall, sys_remap_file_pages_prot():
> 
>  http://redhat.com/~mingo/remap-file-pages-patches/remap-file-pages-prot-2.6.4-rc1-mm1-A1
> 
> i also wrote an x86 hypervisor kind of thing for UML, called 
> 'sys_vcpu()', which allows UML to execute guest user-mode in a box, 
> which also relies on sys_remap_file_pages_prot():
> 
>  http://redhat.com/~mingo/remap-file-pages-patches/vcpu-2.6.4-rc2-mm1-A2
> 
> which reduced the UML guest syscall overhead from 30 usecs to 4 usecs 
> (with native syscalls taking 2 usecs, on the box i tested, years ago).
> 
> So it certainly looked useful to me - but wasnt really picked up widely. 
> 
> We'll always have the option to get rid of it (and hence completely 
> reverse the decision to merge it) without breaking the ABI, by emulating 
> the API via mremap(). That eliminates the UML speedup though. So no need 
> to feel sorry about having merged it, we can easily revisit that 
> years-old 'do we want it' decision, without any ABI worries.

Depending on whether anyone wants it, and what features they want, we
could emulate the old syscall, and make a new restricted one which is
much less intrusive.

For example, if we can operate only on MAP_ANONYMOUS memory and specify
that nonlinear mappings effectively mlock the pages, then we can get
rid of all the objrmap and unmap_mapping_range handling, forget about
the writeout and msync problems...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:26                       ` Andrew Morton
@ 2007-03-07  9:28                         ` Miklos Szeredi
  -1 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07  9:28 UTC (permalink / raw)
  To: akpm; +Cc: npiggin, miklos, mingo, linux-mm, linux-kernel, benh, a.p.zijlstra

> > But I think we discovered that those msync changes are bogus anyway
> > becuase there is a small race window where pte could be dirtied without
> > page being set dirty?
> 
> Dunno, I don't recall that.  We dirty the page before the pte...

That's the one I just submitted a fix for ;)

  http://lkml.org/lkml/2007/3/6/308

Miklos

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:28                         ` Miklos Szeredi
  0 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07  9:28 UTC (permalink / raw)
  To: akpm; +Cc: npiggin, miklos, mingo, linux-mm, linux-kernel, benh, a.p.zijlstra

> > But I think we discovered that those msync changes are bogus anyway
> > becuase there is a small race window where pte could be dirtied without
> > page being set dirty?
> 
> Dunno, I don't recall that.  We dirty the page before the pte...

That's the one I just submitted a fix for ;)

  http://lkml.org/lkml/2007/3/6/308

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  8:35             ` Andrew Morton
@ 2007-03-07  9:29               ` Bill Irwin
  -1 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07  9:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Nick Piggin, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Paolo 'Blaisorblade' Giarrusso

On Wed, 7 Mar 2007 09:27:55 +0100 Ingo Molnar <mingo@elte.hu> wrote:
>> btw., if we decide that nonlinear isnt worth the continuing maintainance 
>> pain, we could internally implement/emulate sys_remap_file_pages() via a 
>> call to mremap() and essentially deprecate it, without breaking the ABI 
>> - and remove all the nonlinear code. (This would split fremap areas into 
>> separate vmas)

On Wed, Mar 07, 2007 at 12:35:20AM -0800, Andrew Morton wrote:
> I'm rather regretting having merged it - I don't think it has been used for
> much.
> Paolo's UML speedup patches might use nonlinear though.

Guess what major real-life application not only uses nonlinear daily
but would even be very happy to see it extended with non-vma-creating
protections and more? It's not terribly typical for things to be
truncated while remap_file_pages() is doing its work, though it's been
proposed as a method of dynamism. It won't stress remap_file_pages() vs.
truncate() in any meaningful way, though, as userspace will be rather
diligent about clearing in-use data out of the file offset range to be
truncated away anyway, and all that via O_DIRECT.


-- wli

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:29               ` Bill Irwin
  0 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07  9:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Nick Piggin, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Paolo 'Blaisorblade' Giarrusso

On Wed, 7 Mar 2007 09:27:55 +0100 Ingo Molnar <mingo@elte.hu> wrote:
>> btw., if we decide that nonlinear isnt worth the continuing maintainance 
>> pain, we could internally implement/emulate sys_remap_file_pages() via a 
>> call to mremap() and essentially deprecate it, without breaking the ABI 
>> - and remove all the nonlinear code. (This would split fremap areas into 
>> separate vmas)

On Wed, Mar 07, 2007 at 12:35:20AM -0800, Andrew Morton wrote:
> I'm rather regretting having merged it - I don't think it has been used for
> much.
> Paolo's UML speedup patches might use nonlinear though.

Guess what major real-life application not only uses nonlinear daily
but would even be very happy to see it extended with non-vma-creating
protections and more? It's not terribly typical for things to be
truncated while remap_file_pages() is doing its work, though it's been
proposed as a method of dynamism. It won't stress remap_file_pages() vs.
truncate() in any meaningful way, though, as userspace will be rather
diligent about clearing in-use data out of the file offset range to be
truncated away anyway, and all that via O_DIRECT.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:22               ` Ingo Molnar
@ 2007-03-07  9:32                 ` Bill Irwin
  -1 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07  9:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

* Nick Piggin <npiggin@suse.de> wrote:
>> After these patches, I don't think there is too much burden. The main 
>> thing left really is just the objrmap stuff, but that is just handled 
>> with a minimal 'dumb' algorithm that doesn't cost much.

On Wed, Mar 07, 2007 at 10:22:52AM +0100, Ingo Molnar wrote:
> ok. What do you think about the sys_remap_file_pages_prot() thing that 
> Paolo has done in a nicely split up form - does that complicate things 
> in any fundamental way? That is what is useful to UML.

Oracle would love it. You don't want to know how far back I've been
asked to backport that.


-- wli

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:32                 ` Bill Irwin
  0 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07  9:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

* Nick Piggin <npiggin@suse.de> wrote:
>> After these patches, I don't think there is too much burden. The main 
>> thing left really is just the objrmap stuff, but that is just handled 
>> with a minimal 'dumb' algorithm that doesn't cost much.

On Wed, Mar 07, 2007 at 10:22:52AM +0100, Ingo Molnar wrote:
> ok. What do you think about the sys_remap_file_pages_prot() thing that 
> Paolo has done in a nicely split up form - does that complicate things 
> in any fundamental way? That is what is useful to UML.

Oracle would love it. You don't want to know how far back I've been
asked to backport that.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:07                   ` Andrew Morton
@ 2007-03-07  9:32                     ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07  9:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miklos Szeredi, mingo, npiggin, linux-mm, linux-kernel, benh

On Wed, 2007-03-07 at 01:07 -0800, Andrew Morton wrote:
> On Wed, 07 Mar 2007 09:51:57 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > > > Dirty page accounting doesn't work either on
> > > > non-linear mappings
> > > 
> > > It doesn't?  Confused - these things don't have anything to do with each
> > > other do they?
> > 
> > Look in page_mkclean().  Where does it handle non-linear mappings?
> > 
> 
> OK, I'd forgotten about that.  It won't break dirty memory accounting,
> but it'll potentially break dirty memory balancing.
> 
> If we have the wrong page (due to nonlinear), page_check_address() will
> fail and we'll leave the pte dirty.  That puts us back to the pre-2.6.17
> algorithms and I guess it'll break the msync guarantees.
> 
> Peter, I thought we went through the nonlinear problem ages ago and decided
> it was OK?

Can recollect as much, I modelled it after page_referenced() and can't
find any VM_NONLINEAR specific code in there either.

Will have a hard look, but if its broken, then page_referenced if
equally broken it seems, which would make page reclaim funny in the
light of nonlinear mappings.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:32                     ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07  9:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miklos Szeredi, mingo, npiggin, linux-mm, linux-kernel, benh

On Wed, 2007-03-07 at 01:07 -0800, Andrew Morton wrote:
> On Wed, 07 Mar 2007 09:51:57 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > > > Dirty page accounting doesn't work either on
> > > > non-linear mappings
> > > 
> > > It doesn't?  Confused - these things don't have anything to do with each
> > > other do they?
> > 
> > Look in page_mkclean().  Where does it handle non-linear mappings?
> > 
> 
> OK, I'd forgotten about that.  It won't break dirty memory accounting,
> but it'll potentially break dirty memory balancing.
> 
> If we have the wrong page (due to nonlinear), page_check_address() will
> fail and we'll leave the pte dirty.  That puts us back to the pre-2.6.17
> algorithms and I guess it'll break the msync guarantees.
> 
> Peter, I thought we went through the nonlinear problem ages ago and decided
> it was OK?

Can recollect as much, I modelled it after page_referenced() and can't
find any VM_NONLINEAR specific code in there either.

Will have a hard look, but if its broken, then page_referenced if
equally broken it seems, which would make page reclaim funny in the
light of nonlinear mappings.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:32                 ` Bill Irwin
@ 2007-03-07  9:35                   ` Ingo Molnar
  -1 siblings, 0 replies; 198+ messages in thread
From: Ingo Molnar @ 2007-03-07  9:35 UTC (permalink / raw)
  To: Bill Irwin, Nick Piggin, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt


* Bill Irwin <bill.irwin@oracle.com> wrote:

> * Nick Piggin <npiggin@suse.de> wrote:
> >> After these patches, I don't think there is too much burden. The main 
> >> thing left really is just the objrmap stuff, but that is just handled 
> >> with a minimal 'dumb' algorithm that doesn't cost much.
> 
> On Wed, Mar 07, 2007 at 10:22:52AM +0100, Ingo Molnar wrote:
> > ok. What do you think about the sys_remap_file_pages_prot() thing that 
> > Paolo has done in a nicely split up form - does that complicate things 
> > in any fundamental way? That is what is useful to UML.
> 
> Oracle would love it. You don't want to know how far back I've been 
> asked to backport that.

ok, cool! Then the first step would be for you to talk to Paolo and to 
pick up the patches, review them, nurse it in -mm, etc. Suffering in 
silence is just a pointless act of masochism, not an efficient 
upstream-merge tactic ;-)

	Ingo

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:35                   ` Ingo Molnar
  0 siblings, 0 replies; 198+ messages in thread
From: Ingo Molnar @ 2007-03-07  9:35 UTC (permalink / raw)
  To: Bill Irwin, Nick Piggin, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

* Bill Irwin <bill.irwin@oracle.com> wrote:

> * Nick Piggin <npiggin@suse.de> wrote:
> >> After these patches, I don't think there is too much burden. The main 
> >> thing left really is just the objrmap stuff, but that is just handled 
> >> with a minimal 'dumb' algorithm that doesn't cost much.
> 
> On Wed, Mar 07, 2007 at 10:22:52AM +0100, Ingo Molnar wrote:
> > ok. What do you think about the sys_remap_file_pages_prot() thing that 
> > Paolo has done in a nicely split up form - does that complicate things 
> > in any fundamental way? That is what is useful to UML.
> 
> Oracle would love it. You don't want to know how far back I've been 
> asked to backport that.

ok, cool! Then the first step would be for you to talk to Paolo and to 
pick up the patches, review them, nurse it in -mm, etc. Suffering in 
silence is just a pointless act of masochism, not an efficient 
upstream-merge tactic ;-)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:26                       ` Andrew Morton
@ 2007-03-07  9:38                         ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miklos Szeredi, mingo, linux-mm, linux-kernel, benh, Peter Zijlstra

On Wed, Mar 07, 2007 at 01:26:38AM -0800, Andrew Morton wrote:
> On Wed, 7 Mar 2007 10:18:23 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > 
> > msync breakage is bad, but otherwise I don't know that we care about
> > dirty page writeout efficiency.
> 
> Well.  We made so many changes to support the synchronous
> dirty-the-page-when-we-dirty-the-pte thing that I'm rather doubtful that
> the old-style approach still works.  It might seem to, most of the time. 
> But if it _is_ subtly broken, boy it's going to take a long time for us to
> find out.

I can't think of anything that should have caused breakage (except for
the msync thing). We're still careful about not dropping pte dirty bits.

> > But I think we discovered that those msync changes are bogus anyway
> > becuase there is a small race window where pte could be dirtied without
> > page being set dirty?
> 
> Dunno, I don't recall that.  We dirty the page before the pte...

I don't think it isn't really that simple. There is a big comment in
clear_page_dirty_for_io.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:38                         ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miklos Szeredi, mingo, linux-mm, linux-kernel, benh, Peter Zijlstra

On Wed, Mar 07, 2007 at 01:26:38AM -0800, Andrew Morton wrote:
> On Wed, 7 Mar 2007 10:18:23 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > 
> > msync breakage is bad, but otherwise I don't know that we care about
> > dirty page writeout efficiency.
> 
> Well.  We made so many changes to support the synchronous
> dirty-the-page-when-we-dirty-the-pte thing that I'm rather doubtful that
> the old-style approach still works.  It might seem to, most of the time. 
> But if it _is_ subtly broken, boy it's going to take a long time for us to
> find out.

I can't think of anything that should have caused breakage (except for
the msync thing). We're still careful about not dropping pte dirty bits.

> > But I think we discovered that those msync changes are bogus anyway
> > becuase there is a small race window where pte could be dirtied without
> > page being set dirty?
> 
> Dunno, I don't recall that.  We dirty the page before the pte...

I don't think it isn't really that simple. There is a big comment in
clear_page_dirty_for_io.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:29               ` Bill Irwin
@ 2007-03-07  9:39                 ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  9:39 UTC (permalink / raw)
  To: Bill Irwin
  Cc: Ingo Molnar, Nick Piggin, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Paolo 'Blaisorblade' Giarrusso

On Wed, 7 Mar 2007 01:29:03 -0800 Bill Irwin <bill.irwin@oracle.com> wrote:

> On Wed, 7 Mar 2007 09:27:55 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> >> btw., if we decide that nonlinear isnt worth the continuing maintainance 
> >> pain, we could internally implement/emulate sys_remap_file_pages() via a 
> >> call to mremap() and essentially deprecate it, without breaking the ABI 
> >> - and remove all the nonlinear code. (This would split fremap areas into 
> >> separate vmas)
> 
> On Wed, Mar 07, 2007 at 12:35:20AM -0800, Andrew Morton wrote:
> > I'm rather regretting having merged it - I don't think it has been used for
> > much.
> > Paolo's UML speedup patches might use nonlinear though.
> 
> Guess what major real-life application not only uses nonlinear daily
> but would even be very happy to see it extended with non-vma-creating
> protections and more?

uh-oh.  SQL server?

> It's not terribly typical for things to be
> truncated while remap_file_pages() is doing its work, though it's been
> proposed as a method of dynamism. It won't stress remap_file_pages() vs.
> truncate() in any meaningful way, though, as userspace will be rather
> diligent about clearing in-use data out of the file offset range to be
> truncated away anyway, and all that via O_DIRECT.

The problem here isn't related to truncate or direct-IO.  It's just
plain-old MAP_SHARED.  nonlinear VMAs are now using the old-style
dirty-memory management.  msync() is basically a no-op and the code is
wildly tricky and pretty much untested.  The chances that we broke it are
considerable.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:39                 ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2007-03-07  9:39 UTC (permalink / raw)
  To: Bill Irwin
  Cc: Ingo Molnar, Nick Piggin, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt, Paolo 'Blaisorblade' Giarrusso

On Wed, 7 Mar 2007 01:29:03 -0800 Bill Irwin <bill.irwin@oracle.com> wrote:

> On Wed, 7 Mar 2007 09:27:55 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> >> btw., if we decide that nonlinear isnt worth the continuing maintainance 
> >> pain, we could internally implement/emulate sys_remap_file_pages() via a 
> >> call to mremap() and essentially deprecate it, without breaking the ABI 
> >> - and remove all the nonlinear code. (This would split fremap areas into 
> >> separate vmas)
> 
> On Wed, Mar 07, 2007 at 12:35:20AM -0800, Andrew Morton wrote:
> > I'm rather regretting having merged it - I don't think it has been used for
> > much.
> > Paolo's UML speedup patches might use nonlinear though.
> 
> Guess what major real-life application not only uses nonlinear daily
> but would even be very happy to see it extended with non-vma-creating
> protections and more?

uh-oh.  SQL server?

> It's not terribly typical for things to be
> truncated while remap_file_pages() is doing its work, though it's been
> proposed as a method of dynamism. It won't stress remap_file_pages() vs.
> truncate() in any meaningful way, though, as userspace will be rather
> diligent about clearing in-use data out of the file offset range to be
> truncated away anyway, and all that via O_DIRECT.

The problem here isn't related to truncate or direct-IO.  It's just
plain-old MAP_SHARED.  nonlinear VMAs are now using the old-style
dirty-memory management.  msync() is basically a no-op and the code is
wildly tricky and pretty much untested.  The chances that we broke it are
considerable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:28                 ` Nick Piggin
@ 2007-03-07  9:44                   ` Bill Irwin
  -1 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07  9:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt,
	Paolo 'Blaisorblade' Giarrusso

On Wed, Mar 07, 2007 at 10:28:21AM +0100, Nick Piggin wrote:
> Depending on whether anyone wants it, and what features they want, we
> could emulate the old syscall, and make a new restricted one which is
> much less intrusive.
> For example, if we can operate only on MAP_ANONYMOUS memory and specify
> that nonlinear mappings effectively mlock the pages, then we can get
> rid of all the objrmap and unmap_mapping_range handling, forget about
> the writeout and msync problems...

Anonymous-only would make it a doorstop for Oracle, since its entire
motive for using it is to window into objects larger than user virtual
address spaces (this likely also applies to UML, though they should
really chime in to confirm). Restrictions to tmpfs and/or ramfs would
likely be liveable, though I suspect some things might want to do it to
shm segments (I'll ask about that one). There's definitely no need for a
persistent backing store for the object to be remapped in Oracle's case,
in any event. It's largely the in-core destination and source of IO, not
something saved on-disk itself.


-- wli

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:44                   ` Bill Irwin
  0 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07  9:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt,
	Paolo 'Blaisorblade' Giarrusso

On Wed, Mar 07, 2007 at 10:28:21AM +0100, Nick Piggin wrote:
> Depending on whether anyone wants it, and what features they want, we
> could emulate the old syscall, and make a new restricted one which is
> much less intrusive.
> For example, if we can operate only on MAP_ANONYMOUS memory and specify
> that nonlinear mappings effectively mlock the pages, then we can get
> rid of all the objrmap and unmap_mapping_range handling, forget about
> the writeout and msync problems...

Anonymous-only would make it a doorstop for Oracle, since its entire
motive for using it is to window into objects larger than user virtual
address spaces (this likely also applies to UML, though they should
really chime in to confirm). Restrictions to tmpfs and/or ramfs would
likely be liveable, though I suspect some things might want to do it to
shm segments (I'll ask about that one). There's definitely no need for a
persistent backing store for the object to be remapped in Oracle's case,
in any event. It's largely the in-core destination and source of IO, not
something saved on-disk itself.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:32                     ` Peter Zijlstra
@ 2007-03-07  9:45                       ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miklos Szeredi, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 10:32:22AM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 01:07 -0800, Andrew Morton wrote:
> > On Wed, 07 Mar 2007 09:51:57 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > > > Dirty page accounting doesn't work either on
> > > > > non-linear mappings
> > > > 
> > > > It doesn't?  Confused - these things don't have anything to do with each
> > > > other do they?
> > > 
> > > Look in page_mkclean().  Where does it handle non-linear mappings?
> > > 
> > 
> > OK, I'd forgotten about that.  It won't break dirty memory accounting,
> > but it'll potentially break dirty memory balancing.
> > 
> > If we have the wrong page (due to nonlinear), page_check_address() will
> > fail and we'll leave the pte dirty.  That puts us back to the pre-2.6.17
> > algorithms and I guess it'll break the msync guarantees.
> > 
> > Peter, I thought we went through the nonlinear problem ages ago and decided
> > it was OK?
> 
> Can recollect as much, I modelled it after page_referenced() and can't
> find any VM_NONLINEAR specific code in there either.
> 
> Will have a hard look, but if its broken, then page_referenced if
> equally broken it seems, which would make page reclaim funny in the
> light of nonlinear mappings.

page_referenced is just an heuristic, and it ignores nonlinear mappings
and the page which will get filtered down to try_to_unmap.

Page reclaim is already "funny" for nonlinear mappings, page_referenced
is the least of its worries ;) It works, though.



^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:45                       ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miklos Szeredi, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 10:32:22AM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 01:07 -0800, Andrew Morton wrote:
> > On Wed, 07 Mar 2007 09:51:57 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > > > Dirty page accounting doesn't work either on
> > > > > non-linear mappings
> > > > 
> > > > It doesn't?  Confused - these things don't have anything to do with each
> > > > other do they?
> > > 
> > > Look in page_mkclean().  Where does it handle non-linear mappings?
> > > 
> > 
> > OK, I'd forgotten about that.  It won't break dirty memory accounting,
> > but it'll potentially break dirty memory balancing.
> > 
> > If we have the wrong page (due to nonlinear), page_check_address() will
> > fail and we'll leave the pte dirty.  That puts us back to the pre-2.6.17
> > algorithms and I guess it'll break the msync guarantees.
> > 
> > Peter, I thought we went through the nonlinear problem ages ago and decided
> > it was OK?
> 
> Can recollect as much, I modelled it after page_referenced() and can't
> find any VM_NONLINEAR specific code in there either.
> 
> Will have a hard look, but if its broken, then page_referenced if
> equally broken it seems, which would make page reclaim funny in the
> light of nonlinear mappings.

page_referenced is just an heuristic, and it ignores nonlinear mappings
and the page which will get filtered down to try_to_unmap.

Page reclaim is already "funny" for nonlinear mappings, page_referenced
is the least of its worries ;) It works, though.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:44                   ` Bill Irwin
@ 2007-03-07  9:49                     ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:49 UTC (permalink / raw)
  To: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt,
	Paolo 'Blaisorblade' Giarrusso

On Wed, Mar 07, 2007 at 01:44:20AM -0800, Bill Irwin wrote:
> On Wed, Mar 07, 2007 at 10:28:21AM +0100, Nick Piggin wrote:
> > Depending on whether anyone wants it, and what features they want, we
> > could emulate the old syscall, and make a new restricted one which is
> > much less intrusive.
> > For example, if we can operate only on MAP_ANONYMOUS memory and specify
> > that nonlinear mappings effectively mlock the pages, then we can get
> > rid of all the objrmap and unmap_mapping_range handling, forget about
> > the writeout and msync problems...
> 
> Anonymous-only would make it a doorstop for Oracle, since its entire
> motive for using it is to window into objects larger than user virtual

Uh, duh yes I don't mean MAP_ANONYMOUS, I was just thinking of the shmem
inode that sits behind MAP_ANONYMOUS|MAP_SHARED. Of course if you don't
have a file descriptor to get a pgoff, then remap_file_pages is a doorstop
for everyone ;)

> address spaces (this likely also applies to UML, though they should
> really chime in to confirm). Restrictions to tmpfs and/or ramfs would
> likely be liveable, though I suspect some things might want to do it to
> shm segments (I'll ask about that one). There's definitely no need for a
> persistent backing store for the object to be remapped in Oracle's case,
> in any event. It's largely the in-core destination and source of IO, not
> something saved on-disk itself.

Yeah, tmpfs/shm segs are what I was thinking about. If UML can live with
that as well, then I think it might be a good option.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:49                     ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:49 UTC (permalink / raw)
  To: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt,
	Paolo 'Blaisorblade' Giarrusso

On Wed, Mar 07, 2007 at 01:44:20AM -0800, Bill Irwin wrote:
> On Wed, Mar 07, 2007 at 10:28:21AM +0100, Nick Piggin wrote:
> > Depending on whether anyone wants it, and what features they want, we
> > could emulate the old syscall, and make a new restricted one which is
> > much less intrusive.
> > For example, if we can operate only on MAP_ANONYMOUS memory and specify
> > that nonlinear mappings effectively mlock the pages, then we can get
> > rid of all the objrmap and unmap_mapping_range handling, forget about
> > the writeout and msync problems...
> 
> Anonymous-only would make it a doorstop for Oracle, since its entire
> motive for using it is to window into objects larger than user virtual

Uh, duh yes I don't mean MAP_ANONYMOUS, I was just thinking of the shmem
inode that sits behind MAP_ANONYMOUS|MAP_SHARED. Of course if you don't
have a file descriptor to get a pgoff, then remap_file_pages is a doorstop
for everyone ;)

> address spaces (this likely also applies to UML, though they should
> really chime in to confirm). Restrictions to tmpfs and/or ramfs would
> likely be liveable, though I suspect some things might want to do it to
> shm segments (I'll ask about that one). There's definitely no need for a
> persistent backing store for the object to be remapped in Oracle's case,
> in any event. It's largely the in-core destination and source of IO, not
> something saved on-disk itself.

Yeah, tmpfs/shm segs are what I was thinking about. If UML can live with
that as well, then I think it might be a good option.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:35                   ` Ingo Molnar
@ 2007-03-07  9:50                     ` Bill Irwin
  -1 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07  9:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Bill Irwin, Nick Piggin, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Wed, Mar 07, 2007 at 10:22:52AM +0100, Ingo Molnar wrote:
>>> ok. What do you think about the sys_remap_file_pages_prot() thing that 
>>> Paolo has done in a nicely split up form - does that complicate things 
>>> in any fundamental way? That is what is useful to UML.

* Bill Irwin <bill.irwin@oracle.com> wrote:
>> Oracle would love it. You don't want to know how far back I've been 
>> asked to backport that.

On Wed, Mar 07, 2007 at 10:35:18AM +0100, Ingo Molnar wrote:
> ok, cool! Then the first step would be for you to talk to Paolo and to 
> pick up the patches, review them, nurse it in -mm, etc. Suffering in 
> silence is just a pointless act of masochism, not an efficient 
> upstream-merge tactic ;-)

It was intended for use in a debugging mode for the database, so given
the general mood where fighting backouts was an issue, I was relatively
loath to bring it up. With UML behind it I don't feel that's as much of
a concern.


-- wli

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:50                     ` Bill Irwin
  0 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07  9:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Bill Irwin, Nick Piggin, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Wed, Mar 07, 2007 at 10:22:52AM +0100, Ingo Molnar wrote:
>>> ok. What do you think about the sys_remap_file_pages_prot() thing that 
>>> Paolo has done in a nicely split up form - does that complicate things 
>>> in any fundamental way? That is what is useful to UML.

* Bill Irwin <bill.irwin@oracle.com> wrote:
>> Oracle would love it. You don't want to know how far back I've been 
>> asked to backport that.

On Wed, Mar 07, 2007 at 10:35:18AM +0100, Ingo Molnar wrote:
> ok, cool! Then the first step would be for you to talk to Paolo and to 
> pick up the patches, review them, nurse it in -mm, etc. Suffering in 
> silence is just a pointless act of masochism, not an efficient 
> upstream-merge tactic ;-)

It was intended for use in a debugging mode for the database, so given
the general mood where fighting backouts was an issue, I was relatively
loath to bring it up. With UML behind it I don't feel that's as much of
a concern.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:22               ` Ingo Molnar
@ 2007-03-07  9:52                 ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt

On Wed, Mar 07, 2007 at 10:22:52AM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > After these patches, I don't think there is too much burden. The main 
> > thing left really is just the objrmap stuff, but that is just handled 
> > with a minimal 'dumb' algorithm that doesn't cost much.
> 
> ok. What do you think about the sys_remap_file_pages_prot() thing that 
> Paolo has done in a nicely split up form - does that complicate things 
> in any fundamental way? That is what is useful to UML.

Last time I looked (a while ago), the only issue I had was that he was
doing a weird special case rather than using another !present pte bit
for his "nonlinear protection" ptes.

I think he fixed that now and so it should be quite good now.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07  9:52                 ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07  9:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel,
	Benjamin Herrenschmidt

On Wed, Mar 07, 2007 at 10:22:52AM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > After these patches, I don't think there is too much burden. The main 
> > thing left really is just the objrmap stuff, but that is just handled 
> > with a minimal 'dumb' algorithm that doesn't cost much.
> 
> ok. What do you think about the sys_remap_file_pages_prot() thing that 
> Paolo has done in a nicely split up form - does that complicate things 
> in any fundamental way? That is what is useful to UML.

Last time I looked (a while ago), the only issue I had was that he was
doing a weird special case rather than using another !present pte bit
for his "nonlinear protection" ptes.

I think he fixed that now and so it should be quite good now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:49                     ` Nick Piggin
@ 2007-03-07 10:02                       ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:02 UTC (permalink / raw)
  To: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt,
	Paolo 'Blaisorblade' Giarrusso

On Wed, Mar 07, 2007 at 10:49:47AM +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 01:44:20AM -0800, Bill Irwin wrote:
> > On Wed, Mar 07, 2007 at 10:28:21AM +0100, Nick Piggin wrote:
> > > Depending on whether anyone wants it, and what features they want, we
> > > could emulate the old syscall, and make a new restricted one which is
> > > much less intrusive.
> > > For example, if we can operate only on MAP_ANONYMOUS memory and specify
> > > that nonlinear mappings effectively mlock the pages, then we can get
> > > rid of all the objrmap and unmap_mapping_range handling, forget about
> > > the writeout and msync problems...
> > 
> > Anonymous-only would make it a doorstop for Oracle, since its entire
> > motive for using it is to window into objects larger than user virtual
> 
> Uh, duh yes I don't mean MAP_ANONYMOUS, I was just thinking of the shmem
> inode that sits behind MAP_ANONYMOUS|MAP_SHARED. Of course if you don't
> have a file descriptor to get a pgoff, then remap_file_pages is a doorstop
> for everyone ;)
> 
> > address spaces (this likely also applies to UML, though they should
> > really chime in to confirm). Restrictions to tmpfs and/or ramfs would
> > likely be liveable, though I suspect some things might want to do it to
> > shm segments (I'll ask about that one). There's definitely no need for a
> > persistent backing store for the object to be remapped in Oracle's case,
> > in any event. It's largely the in-core destination and source of IO, not
> > something saved on-disk itself.
> 
> Yeah, tmpfs/shm segs are what I was thinking about. If UML can live with
> that as well, then I think it might be a good option.

Oh, hmm.... if you can truncate these things then you still need to
force unmap so you still need i_mmap_nonlinear.

But come to think of it, I still don't think nonlinear mappings are
too bad as they are ;)

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:02                       ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:02 UTC (permalink / raw)
  To: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt,
	Paolo 'Blaisorblade' Giarrusso

On Wed, Mar 07, 2007 at 10:49:47AM +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 01:44:20AM -0800, Bill Irwin wrote:
> > On Wed, Mar 07, 2007 at 10:28:21AM +0100, Nick Piggin wrote:
> > > Depending on whether anyone wants it, and what features they want, we
> > > could emulate the old syscall, and make a new restricted one which is
> > > much less intrusive.
> > > For example, if we can operate only on MAP_ANONYMOUS memory and specify
> > > that nonlinear mappings effectively mlock the pages, then we can get
> > > rid of all the objrmap and unmap_mapping_range handling, forget about
> > > the writeout and msync problems...
> > 
> > Anonymous-only would make it a doorstop for Oracle, since its entire
> > motive for using it is to window into objects larger than user virtual
> 
> Uh, duh yes I don't mean MAP_ANONYMOUS, I was just thinking of the shmem
> inode that sits behind MAP_ANONYMOUS|MAP_SHARED. Of course if you don't
> have a file descriptor to get a pgoff, then remap_file_pages is a doorstop
> for everyone ;)
> 
> > address spaces (this likely also applies to UML, though they should
> > really chime in to confirm). Restrictions to tmpfs and/or ramfs would
> > likely be liveable, though I suspect some things might want to do it to
> > shm segments (I'll ask about that one). There's definitely no need for a
> > persistent backing store for the object to be remapped in Oracle's case,
> > in any event. It's largely the in-core destination and source of IO, not
> > something saved on-disk itself.
> 
> Yeah, tmpfs/shm segs are what I was thinking about. If UML can live with
> that as well, then I think it might be a good option.

Oh, hmm.... if you can truncate these things then you still need to
force unmap so you still need i_mmap_nonlinear.

But come to think of it, I still don't think nonlinear mappings are
too bad as they are ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:45                       ` Nick Piggin
@ 2007-03-07 10:04                         ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miklos Szeredi, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 10:45:03AM +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 10:32:22AM +0100, Peter Zijlstra wrote:
> > 
> > Can recollect as much, I modelled it after page_referenced() and can't
> > find any VM_NONLINEAR specific code in there either.
> > 
> > Will have a hard look, but if its broken, then page_referenced if
> > equally broken it seems, which would make page reclaim funny in the
> > light of nonlinear mappings.
> 
> page_referenced is just an heuristic, and it ignores nonlinear mappings
> and the page which will get filtered down to try_to_unmap.
> 
> Page reclaim is already "funny" for nonlinear mappings, page_referenced
> is the least of its worries ;) It works, though.

Or, to be more helpful, unmap_mapping_range is what it should be
modelled on.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:04                         ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Miklos Szeredi, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 10:45:03AM +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 10:32:22AM +0100, Peter Zijlstra wrote:
> > 
> > Can recollect as much, I modelled it after page_referenced() and can't
> > find any VM_NONLINEAR specific code in there either.
> > 
> > Will have a hard look, but if its broken, then page_referenced if
> > equally broken it seems, which would make page reclaim funny in the
> > light of nonlinear mappings.
> 
> page_referenced is just an heuristic, and it ignores nonlinear mappings
> and the page which will get filtered down to try_to_unmap.
> 
> Page reclaim is already "funny" for nonlinear mappings, page_referenced
> is the least of its worries ;) It works, though.

Or, to be more helpful, unmap_mapping_range is what it should be
modelled on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  6:51     ` Andrew Morton
@ 2007-03-07 10:05       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 198+ messages in thread
From: Benjamin Herrenschmidt @ 2007-03-07 10:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Memory Management, Linux Kernel, Ingo Molnar


> > NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
> > no users have hit mainline yet.
> 
> Did benh agree with that?

I won't use NOPAGE_REFAULT, I use NOPFN_REFAULT and that has hit
mainline. I will switch to ->fault when I have time to adapt the code,
in the meantime, NOPFN_REFAULT should stay.

Note that one thing we really want with the new ->fault (though I
haven't looked at the patches lately to see if it's available) is to be
able to differenciate faults coming from userspace from faults coming
from the kernel. The major difference is that the former can be
re-executed to handle signals, the later can't. Thus waiting in the
fault handler can be made interruptible in the former case, not in the
later case.

Ben.



^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:05       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 198+ messages in thread
From: Benjamin Herrenschmidt @ 2007-03-07 10:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Memory Management, Linux Kernel, Ingo Molnar

> > NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
> > no users have hit mainline yet.
> 
> Did benh agree with that?

I won't use NOPAGE_REFAULT, I use NOPFN_REFAULT and that has hit
mainline. I will switch to ->fault when I have time to adapt the code,
in the meantime, NOPFN_REFAULT should stay.

Note that one thing we really want with the new ->fault (though I
haven't looked at the patches lately to see if it's available) is to be
able to differenciate faults coming from userspace from faults coming
from the kernel. The major difference is that the former can be
re-executed to handle signals, the later can't. Thus waiting in the
fault handler can be made interruptible in the former case, not in the
later case.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 10:04                         ` Nick Piggin
@ 2007-03-07 10:06                           ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 10:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Miklos Szeredi, mingo, linux-mm, linux-kernel, benh

On Wed, 2007-03-07 at 11:04 +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 10:45:03AM +0100, Nick Piggin wrote:
> > On Wed, Mar 07, 2007 at 10:32:22AM +0100, Peter Zijlstra wrote:
> > > 
> > > Can recollect as much, I modelled it after page_referenced() and can't
> > > find any VM_NONLINEAR specific code in there either.
> > > 
> > > Will have a hard look, but if its broken, then page_referenced if
> > > equally broken it seems, which would make page reclaim funny in the
> > > light of nonlinear mappings.
> > 
> > page_referenced is just an heuristic, and it ignores nonlinear mappings
> > and the page which will get filtered down to try_to_unmap.
> > 
> > Page reclaim is already "funny" for nonlinear mappings, page_referenced
> > is the least of its worries ;) It works, though.
> 
> Or, to be more helpful, unmap_mapping_range is what it should be
> modelled on.

*sigh* yes was looking at all that code, thats gonna be darn slow
though, but I'll whip up a patch.

/me feels terribly bad about having missed this..


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:06                           ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 10:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Miklos Szeredi, mingo, linux-mm, linux-kernel, benh

On Wed, 2007-03-07 at 11:04 +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 10:45:03AM +0100, Nick Piggin wrote:
> > On Wed, Mar 07, 2007 at 10:32:22AM +0100, Peter Zijlstra wrote:
> > > 
> > > Can recollect as much, I modelled it after page_referenced() and can't
> > > find any VM_NONLINEAR specific code in there either.
> > > 
> > > Will have a hard look, but if its broken, then page_referenced if
> > > equally broken it seems, which would make page reclaim funny in the
> > > light of nonlinear mappings.
> > 
> > page_referenced is just an heuristic, and it ignores nonlinear mappings
> > and the page which will get filtered down to try_to_unmap.
> > 
> > Page reclaim is already "funny" for nonlinear mappings, page_referenced
> > is the least of its worries ;) It works, though.
> 
> Or, to be more helpful, unmap_mapping_range is what it should be
> modelled on.

*sigh* yes was looking at all that code, thats gonna be darn slow
though, but I'll whip up a patch.

/me feels terribly bad about having missed this..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:39                 ` Andrew Morton
@ 2007-03-07 10:09                   ` Bill Irwin
  -1 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07 10:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Bill Irwin, Ingo Molnar, Nick Piggin, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt,
	Paolo 'Blaisorblade' Giarrusso

On Wed, 7 Mar 2007 01:29:03 -0800 Bill Irwin <bill.irwin@oracle.com> wrote:
>> Guess what major real-life application not only uses nonlinear daily
>> but would even be very happy to see it extended with non-vma-creating
>> protections and more?

On Wed, Mar 07, 2007 at 01:39:42AM -0800, Andrew Morton wrote:
> uh-oh.  SQL server?

Close enough. ;)


On Wed, 7 Mar 2007 01:29:03 -0800 Bill Irwin <bill.irwin@oracle.com> wrote:
>> It's not terribly typical for things to be
>> truncated while remap_file_pages() is doing its work, though it's been
>> proposed as a method of dynamism. It won't stress remap_file_pages() vs.
>> truncate() in any meaningful way, though, as userspace will be rather
>> diligent about clearing in-use data out of the file offset range to be
>> truncated away anyway, and all that via O_DIRECT.

On Wed, Mar 07, 2007 at 01:39:42AM -0800, Andrew Morton wrote:
> The problem here isn't related to truncate or direct-IO.  It's just
> plain-old MAP_SHARED.  nonlinear VMAs are now using the old-style
> dirty-memory management.  msync() is basically a no-op and the code is
> wildly tricky and pretty much untested.  The chances that we broke it are
> considerable.

This would be of concern for swapping out tmpfs-backed nonlinearly-
mapped files under extreme stress in Oracle's case, though it's rather
typical for it all to be mlock()'d in-core and cases where that's
necessary to be considered grossly underprovisioned. As far as I know,
msync() is not used to manage the nonlinearly-mapped objects, which are
most typically expected to be memory-backed, rendering writeback to
disk of questionable value. Also quite happily, I'm not aware of any
data integrity issues it would explain. Bug though it may be, it
requires a usage model very rarely used by Oracle to trigger, so we've
not run into it.


-- wli

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:09                   ` Bill Irwin
  0 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07 10:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Bill Irwin, Ingo Molnar, Nick Piggin, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt,
	Paolo 'Blaisorblade' Giarrusso

On Wed, 7 Mar 2007 01:29:03 -0800 Bill Irwin <bill.irwin@oracle.com> wrote:
>> Guess what major real-life application not only uses nonlinear daily
>> but would even be very happy to see it extended with non-vma-creating
>> protections and more?

On Wed, Mar 07, 2007 at 01:39:42AM -0800, Andrew Morton wrote:
> uh-oh.  SQL server?

Close enough. ;)


On Wed, 7 Mar 2007 01:29:03 -0800 Bill Irwin <bill.irwin@oracle.com> wrote:
>> It's not terribly typical for things to be
>> truncated while remap_file_pages() is doing its work, though it's been
>> proposed as a method of dynamism. It won't stress remap_file_pages() vs.
>> truncate() in any meaningful way, though, as userspace will be rather
>> diligent about clearing in-use data out of the file offset range to be
>> truncated away anyway, and all that via O_DIRECT.

On Wed, Mar 07, 2007 at 01:39:42AM -0800, Andrew Morton wrote:
> The problem here isn't related to truncate or direct-IO.  It's just
> plain-old MAP_SHARED.  nonlinear VMAs are now using the old-style
> dirty-memory management.  msync() is basically a no-op and the code is
> wildly tricky and pretty much untested.  The chances that we broke it are
> considerable.

This would be of concern for swapping out tmpfs-backed nonlinearly-
mapped files under extreme stress in Oracle's case, though it's rather
typical for it all to be mlock()'d in-core and cases where that's
necessary to be considered grossly underprovisioned. As far as I know,
msync() is not used to manage the nonlinearly-mapped objects, which are
most typically expected to be memory-backed, rendering writeback to
disk of questionable value. Also quite happily, I'm not aware of any
data integrity issues it would explain. Bug though it may be, it
requires a usage model very rarely used by Oracle to trigger, so we've
not run into it.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 10:06                           ` Peter Zijlstra
@ 2007-03-07 10:13                             ` Miklos Szeredi
  -1 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07 10:13 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: npiggin, akpm, miklos, mingo, linux-mm, linux-kernel, benh

> *sigh* yes was looking at all that code, thats gonna be darn slow
> though, but I'll whip up a patch.

Well, if it's going to be darn slow, maybe it's better to go with
mingo's plan on emulating nonlinear vmas with linear ones.  That'll be
darn slow as well, but at least it will be much less complicated.

Miklos

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:13                             ` Miklos Szeredi
  0 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07 10:13 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: npiggin, akpm, miklos, mingo, linux-mm, linux-kernel, benh

> *sigh* yes was looking at all that code, thats gonna be darn slow
> though, but I'll whip up a patch.

Well, if it's going to be darn slow, maybe it's better to go with
mingo's plan on emulating nonlinear vmas with linear ones.  That'll be
darn slow as well, but at least it will be much less complicated.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 10:05       ` Benjamin Herrenschmidt
@ 2007-03-07 10:17         ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel, Ingo Molnar

On Wed, Mar 07, 2007 at 11:05:48AM +0100, Benjamin Herrenschmidt wrote:
> 
> > > NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
> > > no users have hit mainline yet.
> > 
> > Did benh agree with that?
> 
> I won't use NOPAGE_REFAULT, I use NOPFN_REFAULT and that has hit
> mainline. I will switch to ->fault when I have time to adapt the code,
> in the meantime, NOPFN_REFAULT should stay.

I think I removed not only NOFPN_REFAULT, but also nopfn itself, *and*
adapted the code for you ;) it is in patch 5/6, sent a while ago. 


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:17         ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:17 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel, Ingo Molnar

On Wed, Mar 07, 2007 at 11:05:48AM +0100, Benjamin Herrenschmidt wrote:
> 
> > > NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
> > > no users have hit mainline yet.
> > 
> > Did benh agree with that?
> 
> I won't use NOPAGE_REFAULT, I use NOPFN_REFAULT and that has hit
> mainline. I will switch to ->fault when I have time to adapt the code,
> in the meantime, NOPFN_REFAULT should stay.

I think I removed not only NOFPN_REFAULT, but also nopfn itself, *and*
adapted the code for you ;) it is in patch 5/6, sent a while ago. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 10:13                             ` Miklos Szeredi
@ 2007-03-07 10:21                               ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:21 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 11:13:20AM +0100, Miklos Szeredi wrote:
> > *sigh* yes was looking at all that code, thats gonna be darn slow
> > though, but I'll whip up a patch.
> 
> Well, if it's going to be darn slow, maybe it's better to go with
> mingo's plan on emulating nonlinear vmas with linear ones.  That'll be

There are real users who want these fast, though.

> darn slow as well, but at least it will be much less complicated.

IMO, the best thing to do is just restore msync behaviour, and comment
the fact that we ignore nonlinears. We need to restore msync behaviour
to fix races in regular mappings anyway, at least for now.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:21                               ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:21 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 11:13:20AM +0100, Miklos Szeredi wrote:
> > *sigh* yes was looking at all that code, thats gonna be darn slow
> > though, but I'll whip up a patch.
> 
> Well, if it's going to be darn slow, maybe it's better to go with
> mingo's plan on emulating nonlinear vmas with linear ones.  That'll be

There are real users who want these fast, though.

> darn slow as well, but at least it will be much less complicated.

IMO, the best thing to do is just restore msync behaviour, and comment
the fact that we ignore nonlinears. We need to restore msync behaviour
to fix races in regular mappings anyway, at least for now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 10:21                               ` Nick Piggin
@ 2007-03-07 10:24                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 10:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, 2007-03-07 at 11:21 +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 11:13:20AM +0100, Miklos Szeredi wrote:
> > > *sigh* yes was looking at all that code, thats gonna be darn slow
> > > though, but I'll whip up a patch.
> > 
> > Well, if it's going to be darn slow, maybe it's better to go with
> > mingo's plan on emulating nonlinear vmas with linear ones.  That'll be
> 
> There are real users who want these fast, though.

Yeah, why don't we have a tree per nonlinear vma to find these pages?

wli mentions shadow page tables..

> > darn slow as well, but at least it will be much less complicated.
> 
> IMO, the best thing to do is just restore msync behaviour, and comment
> the fact that we ignore nonlinears. We need to restore msync behaviour
> to fix races in regular mappings anyway, at least for now.

Seems to be the best quick solution indeed.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:24                                 ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 10:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, 2007-03-07 at 11:21 +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 11:13:20AM +0100, Miklos Szeredi wrote:
> > > *sigh* yes was looking at all that code, thats gonna be darn slow
> > > though, but I'll whip up a patch.
> > 
> > Well, if it's going to be darn slow, maybe it's better to go with
> > mingo's plan on emulating nonlinear vmas with linear ones.  That'll be
> 
> There are real users who want these fast, though.

Yeah, why don't we have a tree per nonlinear vma to find these pages?

wli mentions shadow page tables..

> > darn slow as well, but at least it will be much less complicated.
> 
> IMO, the best thing to do is just restore msync behaviour, and comment
> the fact that we ignore nonlinears. We need to restore msync behaviour
> to fix races in regular mappings anyway, at least for now.

Seems to be the best quick solution indeed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [rfc][patch 7/6] mm: merge page_mkwrite
  2007-03-07 10:13                             ` Miklos Szeredi
@ 2007-03-07 10:30                               ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, David Howells, linux-fsdevel

Now that I'm making some progress on merging the basic stuff, I'd
like to get opinions about merging page_mkwrite functionality into
->fault().

I still don't see any callers in the tree, but I see no reason why
this won't work (or why it isn't better).

--
Like everything else in life, page_mkwrite()ing is just a primitive,
degenerate form of fault()ing.

Having FAULT_FLAG_WRITE in the fault operation allows us to just get
rid of the page_mkwrite call in do_fault, because filesystems can check
for that flag bit, and do the page_mkwrite thing before returning the
page (this will improve efficiency for everyone).

Then, we introduce another fault flag to signal that the fault is
an event notification for a page, rather than a request for a pgoff.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -176,6 +176,7 @@ extern unsigned int kobjsize(const void 
 					 * return with the page locked.
 					 */
 #define VM_CAN_NONLINEAR 0x10000000	/* Has ->fault & does nonlinear pages */
+#define VM_NOTIFY_MKWRITE 0x20000000	/* Has ->fault & wants page writable notification */
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -201,6 +202,7 @@ extern pgprot_t protection_map[16];
 
 #define FAULT_FLAG_WRITE	0x01
 #define FAULT_FLAG_NONLINEAR	0x02
+#define FAULT_FLAG_NOTIFY	0x04	/* fault_data.page contains page */
 
 /*
  * fault_data is filled in the the pagefault handler and passed to the
@@ -213,7 +215,10 @@ extern pgprot_t protection_map[16];
  * nonlinear mapping support.
  */
 struct fault_data {
-	unsigned long address;
+	union {
+		unsigned long address;
+		struct page *page;
+	};
 	pgoff_t pgoff;
 	unsigned int flags;
 
@@ -230,9 +235,6 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
-	/* notification that a previously read-only page is about to become
-	 * writable, if an error is returned it will cause a SIGBUS */
-	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
@@ -831,7 +833,7 @@ extern struct shrinker *set_shrinker(int
 extern void remove_shrinker(struct shrinker *shrinker);
 
 /*
- * Some shared mappigns will want the pages marked read-only
+ * Some shared mappings will want the pages marked read-only
  * to track write events. If so, we'll downgrade vm_page_prot
  * to the private version (using protection_map[] without the
  * VM_SHARED bit).
@@ -845,7 +847,7 @@ static inline int vma_wants_writenotify(
 		return 0;
 
 	/* The backer wishes to know when pages are first written to? */
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+	if (vma->vm_flags & VM_NOTIFY_MKWRITE)
 		return 1;
 
 	/* The open routine did something to the protections already? */
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1566,7 +1566,8 @@ static int do_wp_page(struct mm_struct *
 		 * read-only shared pages can get COWed by
 		 * get_user_pages(.write=1, .force=1).
 		 */
-		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+		if (unlikely(vma->vm_flags & VM_NOTIFY_MKWRITE)) {
+			struct fault_data fdata;
 			/*
 			 * Notify the address space that the page is about to
 			 * become writable so that it can prohibit this or wait
@@ -1578,8 +1579,14 @@ static int do_wp_page(struct mm_struct *
 			page_cache_get(old_page);
 			pte_unmap_unlock(page_table, ptl);
 
-			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
-				goto unwritable_page;
+			fdata.flags = FAULT_FLAG_NOTIFY|FAULT_FLAG_WRITE;
+			fdata.page = old_page;
+			fdata.type = -1;
+			old_page = vma->vm_ops->fault(vma, &fdata);
+			WARN_ON(fdata.type == -1);
+			ret = fdata.type;
+			if (!old_page)
+				return ret;
 
 			/*
 			 * Since we dropped the lock we need to revalidate
@@ -1677,10 +1684,6 @@ oom:
 	if (old_page)
 		page_cache_release(old_page);
 	return VM_FAULT_OOM;
-
-unwritable_page:
-	page_cache_release(old_page);
-	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -2254,18 +2257,6 @@ static int __do_fault(struct mm_struct *
 				goto out;
 			}
 			copy_user_highpage(page, faulted_page, address, vma);
-		} else {
-			/*
-			 * If the page will be shareable, see if the backing
-			 * address space wants to know that the page is about
-			 * to become writable
-			 */
-			if (vma->vm_ops->page_mkwrite &&
-			    vma->vm_ops->page_mkwrite(vma, page) < 0) {
-				fdata.type = VM_FAULT_SIGBUS;
-				anon = 1; /* no anon but release faulted_page */
-				goto out;
-			}
 		}
 
 	}

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [rfc][patch 7/6] mm: merge page_mkwrite
@ 2007-03-07 10:30                               ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, David Howells, linux-fsdevel

Now that I'm making some progress on merging the basic stuff, I'd
like to get opinions about merging page_mkwrite functionality into
->fault().

I still don't see any callers in the tree, but I see no reason why
this won't work (or why it isn't better).

--
Like everything else in life, page_mkwrite()ing is just a primitive,
degenerate form of fault()ing.

Having FAULT_FLAG_WRITE in the fault operation allows us to just get
rid of the page_mkwrite call in do_fault, because filesystems can check
for that flag bit, and do the page_mkwrite thing before returning the
page (this will improve efficiency for everyone).

Then, we introduce another fault flag to signal that the fault is
an event notification for a page, rather than a request for a pgoff.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -176,6 +176,7 @@ extern unsigned int kobjsize(const void 
 					 * return with the page locked.
 					 */
 #define VM_CAN_NONLINEAR 0x10000000	/* Has ->fault & does nonlinear pages */
+#define VM_NOTIFY_MKWRITE 0x20000000	/* Has ->fault & wants page writable notification */
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -201,6 +202,7 @@ extern pgprot_t protection_map[16];
 
 #define FAULT_FLAG_WRITE	0x01
 #define FAULT_FLAG_NONLINEAR	0x02
+#define FAULT_FLAG_NOTIFY	0x04	/* fault_data.page contains page */
 
 /*
  * fault_data is filled in the the pagefault handler and passed to the
@@ -213,7 +215,10 @@ extern pgprot_t protection_map[16];
  * nonlinear mapping support.
  */
 struct fault_data {
-	unsigned long address;
+	union {
+		unsigned long address;
+		struct page *page;
+	};
 	pgoff_t pgoff;
 	unsigned int flags;
 
@@ -230,9 +235,6 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
-	/* notification that a previously read-only page is about to become
-	 * writable, if an error is returned it will cause a SIGBUS */
-	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
@@ -831,7 +833,7 @@ extern struct shrinker *set_shrinker(int
 extern void remove_shrinker(struct shrinker *shrinker);
 
 /*
- * Some shared mappigns will want the pages marked read-only
+ * Some shared mappings will want the pages marked read-only
  * to track write events. If so, we'll downgrade vm_page_prot
  * to the private version (using protection_map[] without the
  * VM_SHARED bit).
@@ -845,7 +847,7 @@ static inline int vma_wants_writenotify(
 		return 0;
 
 	/* The backer wishes to know when pages are first written to? */
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+	if (vma->vm_flags & VM_NOTIFY_MKWRITE)
 		return 1;
 
 	/* The open routine did something to the protections already? */
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1566,7 +1566,8 @@ static int do_wp_page(struct mm_struct *
 		 * read-only shared pages can get COWed by
 		 * get_user_pages(.write=1, .force=1).
 		 */
-		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+		if (unlikely(vma->vm_flags & VM_NOTIFY_MKWRITE)) {
+			struct fault_data fdata;
 			/*
 			 * Notify the address space that the page is about to
 			 * become writable so that it can prohibit this or wait
@@ -1578,8 +1579,14 @@ static int do_wp_page(struct mm_struct *
 			page_cache_get(old_page);
 			pte_unmap_unlock(page_table, ptl);
 
-			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
-				goto unwritable_page;
+			fdata.flags = FAULT_FLAG_NOTIFY|FAULT_FLAG_WRITE;
+			fdata.page = old_page;
+			fdata.type = -1;
+			old_page = vma->vm_ops->fault(vma, &fdata);
+			WARN_ON(fdata.type == -1);
+			ret = fdata.type;
+			if (!old_page)
+				return ret;
 
 			/*
 			 * Since we dropped the lock we need to revalidate
@@ -1677,10 +1684,6 @@ oom:
 	if (old_page)
 		page_cache_release(old_page);
 	return VM_FAULT_OOM;
-
-unwritable_page:
-	page_cache_release(old_page);
-	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -2254,18 +2257,6 @@ static int __do_fault(struct mm_struct *
 				goto out;
 			}
 			copy_user_highpage(page, faulted_page, address, vma);
-		} else {
-			/*
-			 * If the page will be shareable, see if the backing
-			 * address space wants to know that the page is about
-			 * to become writable
-			 */
-			if (vma->vm_ops->page_mkwrite &&
-			    vma->vm_ops->page_mkwrite(vma, page) < 0) {
-				fdata.type = VM_FAULT_SIGBUS;
-				anon = 1; /* no anon but release faulted_page */
-				goto out;
-			}
 		}
 
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 10:24                                 ` Peter Zijlstra
@ 2007-03-07 10:38                                   ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 11:24:45AM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 11:21 +0100, Nick Piggin wrote:
> > On Wed, Mar 07, 2007 at 11:13:20AM +0100, Miklos Szeredi wrote:
> > > > *sigh* yes was looking at all that code, thats gonna be darn slow
> > > > though, but I'll whip up a patch.
> > > 
> > > Well, if it's going to be darn slow, maybe it's better to go with
> > > mingo's plan on emulating nonlinear vmas with linear ones.  That'll be
> > 
> > There are real users who want these fast, though.
> 
> Yeah, why don't we have a tree per nonlinear vma to find these pages?
> 
> wli mentions shadow page tables..

We could do something more efficient, but I thought that half the point
was that they didn't carry any of this extra memory, and they could be
really fast to set up at the expense of efficiency elsewhere.

> > > darn slow as well, but at least it will be much less complicated.
> > 
> > IMO, the best thing to do is just restore msync behaviour, and comment
> > the fact that we ignore nonlinears. We need to restore msync behaviour
> > to fix races in regular mappings anyway, at least for now.
> 
> Seems to be the best quick solution indeed.

If we fix the race in the linear mappings, then we can just do the full
msync for nonlinear vmas, and the fast noop version for everyone else.

I don't see it being a big deal. I doubt anybody is writing out huge
amounts of data via nonlinear mappings.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:38                                   ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 10:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 11:24:45AM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 11:21 +0100, Nick Piggin wrote:
> > On Wed, Mar 07, 2007 at 11:13:20AM +0100, Miklos Szeredi wrote:
> > > > *sigh* yes was looking at all that code, thats gonna be darn slow
> > > > though, but I'll whip up a patch.
> > > 
> > > Well, if it's going to be darn slow, maybe it's better to go with
> > > mingo's plan on emulating nonlinear vmas with linear ones.  That'll be
> > 
> > There are real users who want these fast, though.
> 
> Yeah, why don't we have a tree per nonlinear vma to find these pages?
> 
> wli mentions shadow page tables..

We could do something more efficient, but I thought that half the point
was that they didn't carry any of this extra memory, and they could be
really fast to set up at the expense of efficiency elsewhere.

> > > darn slow as well, but at least it will be much less complicated.
> > 
> > IMO, the best thing to do is just restore msync behaviour, and comment
> > the fact that we ignore nonlinears. We need to restore msync behaviour
> > to fix races in regular mappings anyway, at least for now.
> 
> Seems to be the best quick solution indeed.

If we fix the race in the linear mappings, then we can just do the full
msync for nonlinear vmas, and the fast noop version for everyone else.

I don't see it being a big deal. I doubt anybody is writing out huge
amounts of data via nonlinear mappings.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 10:17         ` Nick Piggin
@ 2007-03-07 10:46           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 198+ messages in thread
From: Benjamin Herrenschmidt @ 2007-03-07 10:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel, Ingo Molnar

On Wed, 2007-03-07 at 11:17 +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 11:05:48AM +0100, Benjamin Herrenschmidt wrote:
> > 
> > > > NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
> > > > no users have hit mainline yet.
> > > 
> > > Did benh agree with that?
> > 
> > I won't use NOPAGE_REFAULT, I use NOPFN_REFAULT and that has hit
> > mainline. I will switch to ->fault when I have time to adapt the code,
> > in the meantime, NOPFN_REFAULT should stay.
> 
> I think I removed not only NOFPN_REFAULT, but also nopfn itself, *and*
> adapted the code for you ;) it is in patch 5/6, sent a while ago. 

Ok, I need to look. I've been travelling, having meeting etc... for the
last couple of weeks and I'm taking a week off next week :-)

Ben.



^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:46           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 198+ messages in thread
From: Benjamin Herrenschmidt @ 2007-03-07 10:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Memory Management, Linux Kernel, Ingo Molnar

On Wed, 2007-03-07 at 11:17 +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 11:05:48AM +0100, Benjamin Herrenschmidt wrote:
> > 
> > > > NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
> > > > no users have hit mainline yet.
> > > 
> > > Did benh agree with that?
> > 
> > I won't use NOPAGE_REFAULT, I use NOPFN_REFAULT and that has hit
> > mainline. I will switch to ->fault when I have time to adapt the code,
> > in the meantime, NOPFN_REFAULT should stay.
> 
> I think I removed not only NOFPN_REFAULT, but also nopfn itself, *and*
> adapted the code for you ;) it is in patch 5/6, sent a while ago. 

Ok, I need to look. I've been travelling, having meeting etc... for the
last couple of weeks and I'm taking a week off next week :-)

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 10:38                                   ` Nick Piggin
@ 2007-03-07 10:47                                     ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 10:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, 2007-03-07 at 11:38 +0100, Nick Piggin wrote:

> > > There are real users who want these fast, though.
> > 
> > Yeah, why don't we have a tree per nonlinear vma to find these pages?
> > 
> > wli mentions shadow page tables..
> 
> We could do something more efficient, but I thought that half the point
> was that they didn't carry any of this extra memory, and they could be
> really fast to set up at the expense of efficiency elsewhere.

I'm failing to understand this :-(

That extra memory, and apparently they don't want the inefficiency
either.

> I don't see it being a big deal. I doubt anybody is writing out huge
> amounts of data via nonlinear mappings.

Well, now they don't, but it could be done or even exploited as a DoS.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 10:47                                     ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 10:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, 2007-03-07 at 11:38 +0100, Nick Piggin wrote:

> > > There are real users who want these fast, though.
> > 
> > Yeah, why don't we have a tree per nonlinear vma to find these pages?
> > 
> > wli mentions shadow page tables..
> 
> We could do something more efficient, but I thought that half the point
> was that they didn't carry any of this extra memory, and they could be
> really fast to set up at the expense of efficiency elsewhere.

I'm failing to understand this :-(

That extra memory, and apparently they don't want the inefficiency
either.

> I don't see it being a big deal. I doubt anybody is writing out huge
> amounts of data via nonlinear mappings.

Well, now they don't, but it could be done or even exploited as a DoS.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 10:47                                     ` Peter Zijlstra
@ 2007-03-07 11:00                                       ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 11:00 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 11:47:42AM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 11:38 +0100, Nick Piggin wrote:
> 
> > > > There are real users who want these fast, though.
> > > 
> > > Yeah, why don't we have a tree per nonlinear vma to find these pages?
> > > 
> > > wli mentions shadow page tables..
> > 
> > We could do something more efficient, but I thought that half the point
> > was that they didn't carry any of this extra memory, and they could be
> > really fast to set up at the expense of efficiency elsewhere.
> 
> I'm failing to understand this :-(
> 
> That extra memory, and apparently they don't want the inefficiency
> either.

Sorry, I didn't understand your misunderstandings ;)

> 
> > I don't see it being a big deal. I doubt anybody is writing out huge
> > amounts of data via nonlinear mappings.
> 
> Well, now they don't, but it could be done or even exploited as a DoS.

But so could nonlinear page reclaim. I think we need to restrict nonlinear
mappings to root if we're worried about that.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 11:00                                       ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 11:00 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 11:47:42AM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 11:38 +0100, Nick Piggin wrote:
> 
> > > > There are real users who want these fast, though.
> > > 
> > > Yeah, why don't we have a tree per nonlinear vma to find these pages?
> > > 
> > > wli mentions shadow page tables..
> > 
> > We could do something more efficient, but I thought that half the point
> > was that they didn't carry any of this extra memory, and they could be
> > really fast to set up at the expense of efficiency elsewhere.
> 
> I'm failing to understand this :-(
> 
> That extra memory, and apparently they don't want the inefficiency
> either.

Sorry, I didn't understand your misunderstandings ;)

> 
> > I don't see it being a big deal. I doubt anybody is writing out huge
> > amounts of data via nonlinear mappings.
> 
> Well, now they don't, but it could be done or even exploited as a DoS.

But so could nonlinear page reclaim. I think we need to restrict nonlinear
mappings to root if we're worried about that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 11:00                                       ` Nick Piggin
@ 2007-03-07 11:48                                         ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 11:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, 2007-03-07 at 12:00 +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 11:47:42AM +0100, Peter Zijlstra wrote:
> > On Wed, 2007-03-07 at 11:38 +0100, Nick Piggin wrote:
> > 
> > > > > There are real users who want these fast, though.
> > > > 
> > > > Yeah, why don't we have a tree per nonlinear vma to find these pages?
> > > > 
> > > > wli mentions shadow page tables..
> > > 
> > > We could do something more efficient, but I thought that half the point
> > > was that they didn't carry any of this extra memory, and they could be
> > > really fast to set up at the expense of efficiency elsewhere.
> > 
> > I'm failing to understand this :-(
> > 
> > That extra memory, and apparently they don't want the inefficiency

s/T/W/

> > either.
> 
> Sorry, I didn't understand your misunderstandings ;)

Bah, my brain is thick and foggy today. Let us try again;

Nonlinear vmas exist because many vmas are expensive somehow, right?
Nonlinear vmas keep the page mapping in the page tables and screw rmaps.

This 'extra memory' you mentioned would be the overhead of tracking the
actual ranges?

And apparently now we want it to not suck on the rmap case :-(

Anyway, if used on a non writeback capable backing store (ramfs)
page_mkclean will never be called. If also mlocked (I think oracle does
this) then page reclaim will pass over too.

So we're only interested in the bdi_cap_accounting_dirty and VM_SHARED
case, right?

Tracking these ranges on a per-vma basis would avoid taking the mm wide
mmap_sem and so would be cheaper than regular vmas.

Would that still be too expensive?

> > Well, now they don't, but it could be done or even exploited as a DoS.
> 
> But so could nonlinear page reclaim. I think we need to restrict nonlinear
> mappings to root if we're worried about that.

Can't we just 'fix' it?


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 11:48                                         ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 11:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, 2007-03-07 at 12:00 +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 11:47:42AM +0100, Peter Zijlstra wrote:
> > On Wed, 2007-03-07 at 11:38 +0100, Nick Piggin wrote:
> > 
> > > > > There are real users who want these fast, though.
> > > > 
> > > > Yeah, why don't we have a tree per nonlinear vma to find these pages?
> > > > 
> > > > wli mentions shadow page tables..
> > > 
> > > We could do something more efficient, but I thought that half the point
> > > was that they didn't carry any of this extra memory, and they could be
> > > really fast to set up at the expense of efficiency elsewhere.
> > 
> > I'm failing to understand this :-(
> > 
> > That extra memory, and apparently they don't want the inefficiency

s/T/W/

> > either.
> 
> Sorry, I didn't understand your misunderstandings ;)

Bah, my brain is thick and foggy today. Let us try again;

Nonlinear vmas exist because many vmas are expensive somehow, right?
Nonlinear vmas keep the page mapping in the page tables and screw rmaps.

This 'extra memory' you mentioned would be the overhead of tracking the
actual ranges?

And apparently now we want it to not suck on the rmap case :-(

Anyway, if used on a non writeback capable backing store (ramfs)
page_mkclean will never be called. If also mlocked (I think oracle does
this) then page reclaim will pass over too.

So we're only interested in the bdi_cap_accounting_dirty and VM_SHARED
case, right?

Tracking these ranges on a per-vma basis would avoid taking the mm wide
mmap_sem and so would be cheaper than regular vmas.

Would that still be too expensive?

> > Well, now they don't, but it could be done or even exploited as a DoS.
> 
> But so could nonlinear page reclaim. I think we need to restrict nonlinear
> mappings to root if we're worried about that.

Can't we just 'fix' it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 11:48                                         ` Peter Zijlstra
@ 2007-03-07 12:17                                           ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 12:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 12:48:06PM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 12:00 +0100, Nick Piggin wrote:
> > On Wed, Mar 07, 2007 at 11:47:42AM +0100, Peter Zijlstra wrote:
> > > On Wed, 2007-03-07 at 11:38 +0100, Nick Piggin wrote:
> > > 
> > > > > > There are real users who want these fast, though.
> > > > > 
> > > > > Yeah, why don't we have a tree per nonlinear vma to find these pages?
> > > > > 
> > > > > wli mentions shadow page tables..
> > > > 
> > > > We could do something more efficient, but I thought that half the point
> > > > was that they didn't carry any of this extra memory, and they could be
> > > > really fast to set up at the expense of efficiency elsewhere.
> > > 
> > > I'm failing to understand this :-(
> > > 
> > > That extra memory, and apparently they don't want the inefficiency
> 
> s/T/W/
> 
> > > either.
> > 
> > Sorry, I didn't understand your misunderstandings ;)
> 
> Bah, my brain is thick and foggy today. Let us try again;
> 
> Nonlinear vmas exist because many vmas are expensive somehow, right?
> Nonlinear vmas keep the page mapping in the page tables and screw rmaps.
> 
> This 'extra memory' you mentioned would be the overhead of tracking the
> actual ranges?
> 
> And apparently now we want it to not suck on the rmap case :-(

Do we? I think just "work" is the way we've been handling them up until
now. Making them suck less for rmap makes them suck more for what they're
good at.

> Anyway, if used on a non writeback capable backing store (ramfs)
> page_mkclean will never be called. If also mlocked (I think oracle does
> this) then page reclaim will pass over too.
> 
> So we're only interested in the bdi_cap_accounting_dirty and VM_SHARED
> case, right?
> 
> Tracking these ranges on a per-vma basis would avoid taking the mm wide
> mmap_sem and so would be cheaper than regular vmas.
> 
> Would that still be too expensive?

Well you can today remap N pages in a file, arbitrarily for
sizeof(pte_t)*tiny bit for the upper page tables + small constant
for the vma.

At best, you need an extra pointer to pte / vaddr, so you'd basically
double memory overhead.

> > > Well, now they don't, but it could be done or even exploited as a DoS.
> > 
> > But so could nonlinear page reclaim. I think we need to restrict nonlinear
> > mappings to root if we're worried about that.
> 
> Can't we just 'fix' it?

The thing is, I don't think anybody who uses these things cares
about any of the 'problems' you want to fix, do they? We are
interested in dirty pages only for the correctness issue, rather
than performance. Same as reclaim.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 12:17                                           ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 12:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 12:48:06PM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 12:00 +0100, Nick Piggin wrote:
> > On Wed, Mar 07, 2007 at 11:47:42AM +0100, Peter Zijlstra wrote:
> > > On Wed, 2007-03-07 at 11:38 +0100, Nick Piggin wrote:
> > > 
> > > > > > There are real users who want these fast, though.
> > > > > 
> > > > > Yeah, why don't we have a tree per nonlinear vma to find these pages?
> > > > > 
> > > > > wli mentions shadow page tables..
> > > > 
> > > > We could do something more efficient, but I thought that half the point
> > > > was that they didn't carry any of this extra memory, and they could be
> > > > really fast to set up at the expense of efficiency elsewhere.
> > > 
> > > I'm failing to understand this :-(
> > > 
> > > That extra memory, and apparently they don't want the inefficiency
> 
> s/T/W/
> 
> > > either.
> > 
> > Sorry, I didn't understand your misunderstandings ;)
> 
> Bah, my brain is thick and foggy today. Let us try again;
> 
> Nonlinear vmas exist because many vmas are expensive somehow, right?
> Nonlinear vmas keep the page mapping in the page tables and screw rmaps.
> 
> This 'extra memory' you mentioned would be the overhead of tracking the
> actual ranges?
> 
> And apparently now we want it to not suck on the rmap case :-(

Do we? I think just "work" is the way we've been handling them up until
now. Making them suck less for rmap makes them suck more for what they're
good at.

> Anyway, if used on a non writeback capable backing store (ramfs)
> page_mkclean will never be called. If also mlocked (I think oracle does
> this) then page reclaim will pass over too.
> 
> So we're only interested in the bdi_cap_accounting_dirty and VM_SHARED
> case, right?
> 
> Tracking these ranges on a per-vma basis would avoid taking the mm wide
> mmap_sem and so would be cheaper than regular vmas.
> 
> Would that still be too expensive?

Well you can today remap N pages in a file, arbitrarily for
sizeof(pte_t)*tiny bit for the upper page tables + small constant
for the vma.

At best, you need an extra pointer to pte / vaddr, so you'd basically
double memory overhead.

> > > Well, now they don't, but it could be done or even exploited as a DoS.
> > 
> > But so could nonlinear page reclaim. I think we need to restrict nonlinear
> > mappings to root if we're worried about that.
> 
> Can't we just 'fix' it?

The thing is, I don't think anybody who uses these things cares
about any of the 'problems' you want to fix, do they? We are
interested in dirty pages only for the correctness issue, rather
than performance. Same as reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 11:00                                       ` Nick Piggin
@ 2007-03-07 12:22                                         ` Bill Irwin
  -1 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07 12:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Miklos Szeredi, akpm, mingo, linux-mm,
	linux-kernel, benh

On Wed, Mar 07, 2007 at 11:47:42AM +0100, Peter Zijlstra wrote:
>> Well, now they don't, but it could be done or even exploited as a DoS.

On Wed, Mar 07, 2007 at 12:00:36PM +0100, Nick Piggin wrote:
> But so could nonlinear page reclaim. I think we need to restrict nonlinear
> mappings to root if we're worried about that.

Please not root. The users really don't want to be privileged. UML
itself is at least partly for use as privilege isolation of the guest
workload. Oracle has some of the same concerns itself, which is part of
why it uses separate processes heavily, even: to isolate instances from
each other.


-- wli

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 12:22                                         ` Bill Irwin
  0 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-07 12:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Miklos Szeredi, akpm, mingo, linux-mm,
	linux-kernel, benh

On Wed, Mar 07, 2007 at 11:47:42AM +0100, Peter Zijlstra wrote:
>> Well, now they don't, but it could be done or even exploited as a DoS.

On Wed, Mar 07, 2007 at 12:00:36PM +0100, Nick Piggin wrote:
> But so could nonlinear page reclaim. I think we need to restrict nonlinear
> mappings to root if we're worried about that.

Please not root. The users really don't want to be privileged. UML
itself is at least partly for use as privilege isolation of the guest
workload. Oracle has some of the same concerns itself, which is part of
why it uses separate processes heavily, even: to isolate instances from
each other.


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 12:22                                         ` Bill Irwin
@ 2007-03-07 12:36                                           ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 12:36 UTC (permalink / raw)
  To: Bill Irwin, Peter Zijlstra, Miklos Szeredi, akpm, mingo,
	linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 04:22:24AM -0800, Bill Irwin wrote:
> On Wed, Mar 07, 2007 at 11:47:42AM +0100, Peter Zijlstra wrote:
> >> Well, now they don't, but it could be done or even exploited as a DoS.
> 
> On Wed, Mar 07, 2007 at 12:00:36PM +0100, Nick Piggin wrote:
> > But so could nonlinear page reclaim. I think we need to restrict nonlinear
> > mappings to root if we're worried about that.
> 
> Please not root. The users really don't want to be privileged. UML
> itself is at least partly for use as privilege isolation of the guest
> workload. Oracle has some of the same concerns itself, which is part of
> why it uses separate processes heavily, even: to isolate instances from
> each other.

Well non-root users could be allowed to work on mlocked regions on
tmpfs/shm. That way they avoid the pathological nonlinear problems,
and can work within the mlock ulimit.

That is, if we are worried about such a DoS.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 12:36                                           ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 12:36 UTC (permalink / raw)
  To: Bill Irwin, Peter Zijlstra, Miklos Szeredi, akpm, mingo,
	linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 04:22:24AM -0800, Bill Irwin wrote:
> On Wed, Mar 07, 2007 at 11:47:42AM +0100, Peter Zijlstra wrote:
> >> Well, now they don't, but it could be done or even exploited as a DoS.
> 
> On Wed, Mar 07, 2007 at 12:00:36PM +0100, Nick Piggin wrote:
> > But so could nonlinear page reclaim. I think we need to restrict nonlinear
> > mappings to root if we're worried about that.
> 
> Please not root. The users really don't want to be privileged. UML
> itself is at least partly for use as privilege isolation of the guest
> workload. Oracle has some of the same concerns itself, which is part of
> why it uses separate processes heavily, even: to isolate instances from
> each other.

Well non-root users could be allowed to work on mlocked regions on
tmpfs/shm. That way they avoid the pathological nonlinear problems,
and can work within the mlock ulimit.

That is, if we are worried about such a DoS.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 12:17                                           ` Nick Piggin
@ 2007-03-07 12:41                                             ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 12:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, 2007-03-07 at 13:17 +0100, Nick Piggin wrote:

> > Tracking these ranges on a per-vma basis would avoid taking the mm wide
> > mmap_sem and so would be cheaper than regular vmas.
> > 
> > Would that still be too expensive?
> 
> Well you can today remap N pages in a file, arbitrarily for
> sizeof(pte_t)*tiny bit for the upper page tables + small constant
> for the vma.
> 
> At best, you need an extra pointer to pte / vaddr, so you'd basically
> double memory overhead.

I was hoping some form of range compression would gain something, but if
its a fully random mapping, then yes a shadow page table would be needed
(still looking into what a pte_chain is)

> > > > Well, now they don't, but it could be done or even exploited as a DoS.
> > > 
> > > But so could nonlinear page reclaim. I think we need to restrict nonlinear
> > > mappings to root if we're worried about that.
> > 
> > Can't we just 'fix' it?
> 
> The thing is, I don't think anybody who uses these things cares
> about any of the 'problems' you want to fix, do they? We are
> interested in dirty pages only for the correctness issue, rather
> than performance. Same as reclaim.

If so, we can just stick to the dead slow but correct 'scan the full
vma' page_mkclean() and nobody would ever trigger it.

What is the DoS scenario wrt reclaim? We really ought to fix that if
real, those UML farms run on nothing but nonlinear reclaim I'd think.



^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 12:41                                             ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 12:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, 2007-03-07 at 13:17 +0100, Nick Piggin wrote:

> > Tracking these ranges on a per-vma basis would avoid taking the mm wide
> > mmap_sem and so would be cheaper than regular vmas.
> > 
> > Would that still be too expensive?
> 
> Well you can today remap N pages in a file, arbitrarily for
> sizeof(pte_t)*tiny bit for the upper page tables + small constant
> for the vma.
> 
> At best, you need an extra pointer to pte / vaddr, so you'd basically
> double memory overhead.

I was hoping some form of range compression would gain something, but if
its a fully random mapping, then yes a shadow page table would be needed
(still looking into what a pte_chain is)

> > > > Well, now they don't, but it could be done or even exploited as a DoS.
> > > 
> > > But so could nonlinear page reclaim. I think we need to restrict nonlinear
> > > mappings to root if we're worried about that.
> > 
> > Can't we just 'fix' it?
> 
> The thing is, I don't think anybody who uses these things cares
> about any of the 'problems' you want to fix, do they? We are
> interested in dirty pages only for the correctness issue, rather
> than performance. Same as reclaim.

If so, we can just stick to the dead slow but correct 'scan the full
vma' page_mkclean() and nobody would ever trigger it.

What is the DoS scenario wrt reclaim? We really ought to fix that if
real, those UML farms run on nothing but nonlinear reclaim I'd think.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 12:41                                             ` Peter Zijlstra
@ 2007-03-07 13:08                                               ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, Mar 07, 2007 at 01:41:26PM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 13:17 +0100, Nick Piggin wrote:
> 
> > > Tracking these ranges on a per-vma basis would avoid taking the mm wide
> > > mmap_sem and so would be cheaper than regular vmas.
> > > 
> > > Would that still be too expensive?
> > 
> > Well you can today remap N pages in a file, arbitrarily for
> > sizeof(pte_t)*tiny bit for the upper page tables + small constant
> > for the vma.
> > 
> > At best, you need an extra pointer to pte / vaddr, so you'd basically
> > double memory overhead.
> 
> I was hoping some form of range compression would gain something, but if
> its a fully random mapping, then yes a shadow page table would be needed
> (still looking into what a pte_chain is)
> 
> > > > > Well, now they don't, but it could be done or even exploited as a DoS.
> > > > 
> > > > But so could nonlinear page reclaim. I think we need to restrict nonlinear
> > > > mappings to root if we're worried about that.
> > > 
> > > Can't we just 'fix' it?
> > 
> > The thing is, I don't think anybody who uses these things cares
> > about any of the 'problems' you want to fix, do they? We are
> > interested in dirty pages only for the correctness issue, rather
> > than performance. Same as reclaim.
> 
> If so, we can just stick to the dead slow but correct 'scan the full
> vma' page_mkclean() and nobody would ever trigger it.

Not if we restricted it to root and mlocked tmpfs. But then why
wouldn't you just do it with the much more efficient msync walk,
so that if root does want to do writeout via these things, it does
not blow up?

> What is the DoS scenario wrt reclaim? We really ought to fix that if
> real, those UML farms run on nothing but nonlinear reclaim I'd think.

I guess you can just increase the computational complexity of
reclaim quite easily.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 13:08                                               ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, Mar 07, 2007 at 01:41:26PM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 13:17 +0100, Nick Piggin wrote:
> 
> > > Tracking these ranges on a per-vma basis would avoid taking the mm wide
> > > mmap_sem and so would be cheaper than regular vmas.
> > > 
> > > Would that still be too expensive?
> > 
> > Well you can today remap N pages in a file, arbitrarily for
> > sizeof(pte_t)*tiny bit for the upper page tables + small constant
> > for the vma.
> > 
> > At best, you need an extra pointer to pte / vaddr, so you'd basically
> > double memory overhead.
> 
> I was hoping some form of range compression would gain something, but if
> its a fully random mapping, then yes a shadow page table would be needed
> (still looking into what a pte_chain is)
> 
> > > > > Well, now they don't, but it could be done or even exploited as a DoS.
> > > > 
> > > > But so could nonlinear page reclaim. I think we need to restrict nonlinear
> > > > mappings to root if we're worried about that.
> > > 
> > > Can't we just 'fix' it?
> > 
> > The thing is, I don't think anybody who uses these things cares
> > about any of the 'problems' you want to fix, do they? We are
> > interested in dirty pages only for the correctness issue, rather
> > than performance. Same as reclaim.
> 
> If so, we can just stick to the dead slow but correct 'scan the full
> vma' page_mkclean() and nobody would ever trigger it.

Not if we restricted it to root and mlocked tmpfs. But then why
wouldn't you just do it with the much more efficient msync walk,
so that if root does want to do writeout via these things, it does
not blow up?

> What is the DoS scenario wrt reclaim? We really ought to fix that if
> real, those UML farms run on nothing but nonlinear reclaim I'd think.

I guess you can just increase the computational complexity of
reclaim quite easily.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 13:08                                               ` Nick Piggin
@ 2007-03-07 13:19                                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 13:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, 2007-03-07 at 14:08 +0100, Nick Piggin wrote:

> > > The thing is, I don't think anybody who uses these things cares
> > > about any of the 'problems' you want to fix, do they? We are
> > > interested in dirty pages only for the correctness issue, rather
> > > than performance. Same as reclaim.
> > 
> > If so, we can just stick to the dead slow but correct 'scan the full
> > vma' page_mkclean() and nobody would ever trigger it.
> 
> Not if we restricted it to root and mlocked tmpfs. But then why
> wouldn't you just do it with the much more efficient msync walk,
> so that if root does want to do writeout via these things, it does
> not blow up?

This is all used on ram based filesystems right, they all have
BDI_CAP_NO_WRITEBACK afaik, so page_mkclean will never get called
anyway. Mlock doesn't avoid getting page_mkclean called.

Those who use this on a 'real' filesystem will get hit in the face by a
linear scanning page_mkclean(), but AFAIK nobody does this anyway.

Restricting it to root for such filesystems is unwanted, that'd severely
handicap both UML and Oracle as I understand it (are there other users
of this feature around?)

msync() might never get called and then we're back with the old
behaviour where we can surprise the VM with a ton of dirty pages.

> > What is the DoS scenario wrt reclaim? We really ought to fix that if
> > real, those UML farms run on nothing but nonlinear reclaim I'd think.
> 
> I guess you can just increase the computational complexity of
> reclaim quite easily.

Right, on first glance it doesn't look to be too bad, but I should take
a closer look.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 13:19                                                 ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 13:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, 2007-03-07 at 14:08 +0100, Nick Piggin wrote:

> > > The thing is, I don't think anybody who uses these things cares
> > > about any of the 'problems' you want to fix, do they? We are
> > > interested in dirty pages only for the correctness issue, rather
> > > than performance. Same as reclaim.
> > 
> > If so, we can just stick to the dead slow but correct 'scan the full
> > vma' page_mkclean() and nobody would ever trigger it.
> 
> Not if we restricted it to root and mlocked tmpfs. But then why
> wouldn't you just do it with the much more efficient msync walk,
> so that if root does want to do writeout via these things, it does
> not blow up?

This is all used on ram based filesystems right, they all have
BDI_CAP_NO_WRITEBACK afaik, so page_mkclean will never get called
anyway. Mlock doesn't avoid getting page_mkclean called.

Those who use this on a 'real' filesystem will get hit in the face by a
linear scanning page_mkclean(), but AFAIK nobody does this anyway.

Restricting it to root for such filesystems is unwanted, that'd severely
handicap both UML and Oracle as I understand it (are there other users
of this feature around?)

msync() might never get called and then we're back with the old
behaviour where we can surprise the VM with a ton of dirty pages.

> > What is the DoS scenario wrt reclaim? We really ought to fix that if
> > real, those UML farms run on nothing but nonlinear reclaim I'd think.
> 
> I guess you can just increase the computational complexity of
> reclaim quite easily.

Right, on first glance it doesn't look to be too bad, but I should take
a closer look.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 13:19                                                 ` Peter Zijlstra
@ 2007-03-07 13:36                                                   ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, Mar 07, 2007 at 02:19:22PM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 14:08 +0100, Nick Piggin wrote:
> 
> > > > The thing is, I don't think anybody who uses these things cares
> > > > about any of the 'problems' you want to fix, do they? We are
> > > > interested in dirty pages only for the correctness issue, rather
> > > > than performance. Same as reclaim.
> > > 
> > > If so, we can just stick to the dead slow but correct 'scan the full
> > > vma' page_mkclean() and nobody would ever trigger it.
> > 
> > Not if we restricted it to root and mlocked tmpfs. But then why
> > wouldn't you just do it with the much more efficient msync walk,
> > so that if root does want to do writeout via these things, it does
> > not blow up?
> 
> This is all used on ram based filesystems right, they all have
> BDI_CAP_NO_WRITEBACK afaik, so page_mkclean will never get called
> anyway. Mlock doesn't avoid getting page_mkclean called.
> 
> Those who use this on a 'real' filesystem will get hit in the face by a
> linear scanning page_mkclean(), but AFAIK nobody does this anyway.

But somebody might do it. I just don't know why you'd want to make
this _worse_ when the msync option would work?

> Restricting it to root for such filesystems is unwanted, that'd severely
> handicap both UML and Oracle as I understand it (are there other users
> of this feature around?)

Why? I think they all use tmpfs backings, don't they?

> msync() might never get called and then we're back with the old
> behaviour where we can surprise the VM with a ton of dirty pages.

But we're root. With your patch, root *can't* do nonlinear writeback
well. Ever. With msync, at least you give them enough rope.

> > > What is the DoS scenario wrt reclaim? We really ought to fix that if
> > > real, those UML farms run on nothing but nonlinear reclaim I'd think.
> > 
> > I guess you can just increase the computational complexity of
> > reclaim quite easily.
> 
> Right, on first glance it doesn't look to be too bad, but I should take
> a closer look.

Well I don't think UML uses nonlinear yet anyway, does it? Can they
make do with restricting nonlinear to mlocked vmas, I wonder? Probably
not.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 13:36                                                   ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, Mar 07, 2007 at 02:19:22PM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 14:08 +0100, Nick Piggin wrote:
> 
> > > > The thing is, I don't think anybody who uses these things cares
> > > > about any of the 'problems' you want to fix, do they? We are
> > > > interested in dirty pages only for the correctness issue, rather
> > > > than performance. Same as reclaim.
> > > 
> > > If so, we can just stick to the dead slow but correct 'scan the full
> > > vma' page_mkclean() and nobody would ever trigger it.
> > 
> > Not if we restricted it to root and mlocked tmpfs. But then why
> > wouldn't you just do it with the much more efficient msync walk,
> > so that if root does want to do writeout via these things, it does
> > not blow up?
> 
> This is all used on ram based filesystems right, they all have
> BDI_CAP_NO_WRITEBACK afaik, so page_mkclean will never get called
> anyway. Mlock doesn't avoid getting page_mkclean called.
> 
> Those who use this on a 'real' filesystem will get hit in the face by a
> linear scanning page_mkclean(), but AFAIK nobody does this anyway.

But somebody might do it. I just don't know why you'd want to make
this _worse_ when the msync option would work?

> Restricting it to root for such filesystems is unwanted, that'd severely
> handicap both UML and Oracle as I understand it (are there other users
> of this feature around?)

Why? I think they all use tmpfs backings, don't they?

> msync() might never get called and then we're back with the old
> behaviour where we can surprise the VM with a ton of dirty pages.

But we're root. With your patch, root *can't* do nonlinear writeback
well. Ever. With msync, at least you give them enough rope.

> > > What is the DoS scenario wrt reclaim? We really ought to fix that if
> > > real, those UML farms run on nothing but nonlinear reclaim I'd think.
> > 
> > I guess you can just increase the computational complexity of
> > reclaim quite easily.
> 
> Right, on first glance it doesn't look to be too bad, but I should take
> a closer look.

Well I don't think UML uses nonlinear yet anyway, does it? Can they
make do with restricting nonlinear to mlocked vmas, I wonder? Probably
not.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 13:36                                                   ` Nick Piggin
@ 2007-03-07 13:52                                                     ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 13:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, 2007-03-07 at 14:36 +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 02:19:22PM +0100, Peter Zijlstra wrote:
> > On Wed, 2007-03-07 at 14:08 +0100, Nick Piggin wrote:
> > 
> > > > > The thing is, I don't think anybody who uses these things cares
> > > > > about any of the 'problems' you want to fix, do they? We are
> > > > > interested in dirty pages only for the correctness issue, rather
> > > > > than performance. Same as reclaim.
> > > > 
> > > > If so, we can just stick to the dead slow but correct 'scan the full
> > > > vma' page_mkclean() and nobody would ever trigger it.
> > > 
> > > Not if we restricted it to root and mlocked tmpfs. But then why
> > > wouldn't you just do it with the much more efficient msync walk,
> > > so that if root does want to do writeout via these things, it does
> > > not blow up?
> > 
> > This is all used on ram based filesystems right, they all have
> > BDI_CAP_NO_WRITEBACK afaik, so page_mkclean will never get called
> > anyway. Mlock doesn't avoid getting page_mkclean called.
> > 
> > Those who use this on a 'real' filesystem will get hit in the face by a
> > linear scanning page_mkclean(), but AFAIK nobody does this anyway.
> 
> But somebody might do it. I just don't know why you'd want to make
> this _worse_ when the msync option would work?
> 
> > Restricting it to root for such filesystems is unwanted, that'd severely
> > handicap both UML and Oracle as I understand it (are there other users
> > of this feature around?)
> 
> Why? I think they all use tmpfs backings, don't they?

Ooh, you only want to restrict remap_file_pages on mappings from bdi's
without BDI_CAP_NO_WRITEBACK. Sure, I can live with that, and I suspect
others can as well.

> > msync() might never get called and then we're back with the old
> > behaviour where we can surprise the VM with a ton of dirty pages.
> 
> But we're root. With your patch, root *can't* do nonlinear writeback
> well. Ever. With msync, at least you give them enough rope.

True. We could even guesstimate the nonlinear dirty pages by subtracting
the result of page_mkclean() from page_mapcount() and force an
msync(MS_ASYNC) on said mapping (or all (nonlinear) mappings of the
related file) when some threshold gets exceeded.

> > > > What is the DoS scenario wrt reclaim? We really ought to fix that if
> > > > real, those UML farms run on nothing but nonlinear reclaim I'd think.
> > > 
> > > I guess you can just increase the computational complexity of
> > > reclaim quite easily.
> > 
> > Right, on first glance it doesn't look to be too bad, but I should take
> > a closer look.
> 
> Well I don't think UML uses nonlinear yet anyway, does it? Can they
> make do with restricting nonlinear to mlocked vmas, I wonder? Probably
> not.

I think it does, but lets ask, Jeff?


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 13:52                                                     ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 13:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, 2007-03-07 at 14:36 +0100, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 02:19:22PM +0100, Peter Zijlstra wrote:
> > On Wed, 2007-03-07 at 14:08 +0100, Nick Piggin wrote:
> > 
> > > > > The thing is, I don't think anybody who uses these things cares
> > > > > about any of the 'problems' you want to fix, do they? We are
> > > > > interested in dirty pages only for the correctness issue, rather
> > > > > than performance. Same as reclaim.
> > > > 
> > > > If so, we can just stick to the dead slow but correct 'scan the full
> > > > vma' page_mkclean() and nobody would ever trigger it.
> > > 
> > > Not if we restricted it to root and mlocked tmpfs. But then why
> > > wouldn't you just do it with the much more efficient msync walk,
> > > so that if root does want to do writeout via these things, it does
> > > not blow up?
> > 
> > This is all used on ram based filesystems right, they all have
> > BDI_CAP_NO_WRITEBACK afaik, so page_mkclean will never get called
> > anyway. Mlock doesn't avoid getting page_mkclean called.
> > 
> > Those who use this on a 'real' filesystem will get hit in the face by a
> > linear scanning page_mkclean(), but AFAIK nobody does this anyway.
> 
> But somebody might do it. I just don't know why you'd want to make
> this _worse_ when the msync option would work?
> 
> > Restricting it to root for such filesystems is unwanted, that'd severely
> > handicap both UML and Oracle as I understand it (are there other users
> > of this feature around?)
> 
> Why? I think they all use tmpfs backings, don't they?

Ooh, you only want to restrict remap_file_pages on mappings from bdi's
without BDI_CAP_NO_WRITEBACK. Sure, I can live with that, and I suspect
others can as well.

> > msync() might never get called and then we're back with the old
> > behaviour where we can surprise the VM with a ton of dirty pages.
> 
> But we're root. With your patch, root *can't* do nonlinear writeback
> well. Ever. With msync, at least you give them enough rope.

True. We could even guesstimate the nonlinear dirty pages by subtracting
the result of page_mkclean() from page_mapcount() and force an
msync(MS_ASYNC) on said mapping (or all (nonlinear) mappings of the
related file) when some threshold gets exceeded.

> > > > What is the DoS scenario wrt reclaim? We really ought to fix that if
> > > > real, those UML farms run on nothing but nonlinear reclaim I'd think.
> > > 
> > > I guess you can just increase the computational complexity of
> > > reclaim quite easily.
> > 
> > Right, on first glance it doesn't look to be too bad, but I should take
> > a closer look.
> 
> Well I don't think UML uses nonlinear yet anyway, does it? Can they
> make do with restricting nonlinear to mlocked vmas, I wonder? Probably
> not.

I think it does, but lets ask, Jeff?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 13:36                                                   ` Nick Piggin
@ 2007-03-07 13:53                                                     ` Miklos Szeredi
  -1 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07 13:53 UTC (permalink / raw)
  To: npiggin
  Cc: a.p.zijlstra, miklos, akpm, mingo, linux-mm, linux-kernel, benh, jdike

> On Wed, Mar 07, 2007 at 02:19:22PM +0100, Peter Zijlstra wrote:
> > On Wed, 2007-03-07 at 14:08 +0100, Nick Piggin wrote:
> > 
> > > > > The thing is, I don't think anybody who uses these things cares
> > > > > about any of the 'problems' you want to fix, do they? We are
> > > > > interested in dirty pages only for the correctness issue, rather
> > > > > than performance. Same as reclaim.
> > > > 
> > > > If so, we can just stick to the dead slow but correct 'scan the full
> > > > vma' page_mkclean() and nobody would ever trigger it.
> > > 
> > > Not if we restricted it to root and mlocked tmpfs. But then why
> > > wouldn't you just do it with the much more efficient msync walk,
> > > so that if root does want to do writeout via these things, it does
> > > not blow up?
> > 
> > This is all used on ram based filesystems right, they all have
> > BDI_CAP_NO_WRITEBACK afaik, so page_mkclean will never get called
> > anyway. Mlock doesn't avoid getting page_mkclean called.
> > 
> > Those who use this on a 'real' filesystem will get hit in the face by a
> > linear scanning page_mkclean(), but AFAIK nobody does this anyway.
> 
> But somebody might do it. I just don't know why you'd want to make
> this _worse_ when the msync option would work?
> 
> > Restricting it to root for such filesystems is unwanted, that'd severely
> > handicap both UML and Oracle as I understand it (are there other users
> > of this feature around?)
> 
> Why? I think they all use tmpfs backings, don't they?
> 
> > msync() might never get called and then we're back with the old
> > behaviour where we can surprise the VM with a ton of dirty pages.
> 
> But we're root. With your patch, root *can't* do nonlinear writeback
> well. Ever. With msync, at least you give them enough rope.

Restricting to root doesn't buy you much, nobody wants to be root.
Restricting to mlock is similarly pointless.  UML _will_ want to get
swapped out if there's no activity.

Restricting to tmpfs makes sense, but it's probably not what UML
wants.

Conclusion: there's no good solution for UML in kernel-space.

Miklos

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 13:53                                                     ` Miklos Szeredi
  0 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07 13:53 UTC (permalink / raw)
  To: npiggin
  Cc: a.p.zijlstra, miklos, akpm, mingo, linux-mm, linux-kernel, benh, jdike

> On Wed, Mar 07, 2007 at 02:19:22PM +0100, Peter Zijlstra wrote:
> > On Wed, 2007-03-07 at 14:08 +0100, Nick Piggin wrote:
> > 
> > > > > The thing is, I don't think anybody who uses these things cares
> > > > > about any of the 'problems' you want to fix, do they? We are
> > > > > interested in dirty pages only for the correctness issue, rather
> > > > > than performance. Same as reclaim.
> > > > 
> > > > If so, we can just stick to the dead slow but correct 'scan the full
> > > > vma' page_mkclean() and nobody would ever trigger it.
> > > 
> > > Not if we restricted it to root and mlocked tmpfs. But then why
> > > wouldn't you just do it with the much more efficient msync walk,
> > > so that if root does want to do writeout via these things, it does
> > > not blow up?
> > 
> > This is all used on ram based filesystems right, they all have
> > BDI_CAP_NO_WRITEBACK afaik, so page_mkclean will never get called
> > anyway. Mlock doesn't avoid getting page_mkclean called.
> > 
> > Those who use this on a 'real' filesystem will get hit in the face by a
> > linear scanning page_mkclean(), but AFAIK nobody does this anyway.
> 
> But somebody might do it. I just don't know why you'd want to make
> this _worse_ when the msync option would work?
> 
> > Restricting it to root for such filesystems is unwanted, that'd severely
> > handicap both UML and Oracle as I understand it (are there other users
> > of this feature around?)
> 
> Why? I think they all use tmpfs backings, don't they?
> 
> > msync() might never get called and then we're back with the old
> > behaviour where we can surprise the VM with a ton of dirty pages.
> 
> But we're root. With your patch, root *can't* do nonlinear writeback
> well. Ever. With msync, at least you give them enough rope.

Restricting to root doesn't buy you much, nobody wants to be root.
Restricting to mlock is similarly pointless.  UML _will_ want to get
swapped out if there's no activity.

Restricting to tmpfs makes sense, but it's probably not what UML
wants.

Conclusion: there's no good solution for UML in kernel-space.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 13:52                                                     ` Peter Zijlstra
@ 2007-03-07 13:56                                                       ` Miklos Szeredi
  -1 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07 13:56 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: npiggin, miklos, akpm, mingo, linux-mm, linux-kernel, benh, jdike

> > Well I don't think UML uses nonlinear yet anyway, does it? Can they
> > make do with restricting nonlinear to mlocked vmas, I wonder? Probably
> > not.
> 
> I think it does, but lets ask, Jeff?

Looks like it doesn't:

$ grep -r remap_file_pages arch/um/
$

Miklos

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 13:56                                                       ` Miklos Szeredi
  0 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-07 13:56 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: npiggin, miklos, akpm, mingo, linux-mm, linux-kernel, benh, jdike

> > Well I don't think UML uses nonlinear yet anyway, does it? Can they
> > make do with restricting nonlinear to mlocked vmas, I wonder? Probably
> > not.
> 
> I think it does, but lets ask, Jeff?

Looks like it doesn't:

$ grep -r remap_file_pages arch/um/
$

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 13:52                                                     ` Peter Zijlstra
@ 2007-03-07 14:34                                                       ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 14:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, 2007-03-07 at 14:52 +0100, Peter Zijlstra wrote:

> True. We could even guesstimate the nonlinear dirty pages by subtracting
> the result of page_mkclean() from page_mapcount() and force an
> msync(MS_ASYNC) on said mapping (or all (nonlinear) mappings of the
> related file) when some threshold gets exceeded.

Almost, but not quite, we'd need to extract another value from the
page_mkclean() run, the actual number of mappings encountered. The
return value only sums the number of dirty mappings encountered.

s390 would already work I guess.

Certainly doable.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 14:34                                                       ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 14:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, 2007-03-07 at 14:52 +0100, Peter Zijlstra wrote:

> True. We could even guesstimate the nonlinear dirty pages by subtracting
> the result of page_mkclean() from page_mapcount() and force an
> msync(MS_ASYNC) on said mapping (or all (nonlinear) mappings of the
> related file) when some threshold gets exceeded.

Almost, but not quite, we'd need to extract another value from the
page_mkclean() run, the actual number of mappings encountered. The
return value only sums the number of dirty mappings encountered.

s390 would already work I guess.

Certainly doable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 13:53                                                     ` Miklos Szeredi
@ 2007-03-07 14:50                                                       ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 14:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: a.p.zijlstra, akpm, mingo, linux-mm, linux-kernel, benh, jdike

On Wed, Mar 07, 2007 at 02:53:07PM +0100, Miklos Szeredi wrote:
> > > msync() might never get called and then we're back with the old
> > > behaviour where we can surprise the VM with a ton of dirty pages.
> > 
> > But we're root. With your patch, root *can't* do nonlinear writeback
> > well. Ever. With msync, at least you give them enough rope.
> 
> Restricting to root doesn't buy you much, nobody wants to be root.
> Restricting to mlock is similarly pointless.  UML _will_ want to get
> swapped out if there's no activity.

They could always not use nonlinear, or we could add a ulimit to the
size of nonlinear vaddr allowed. 

> Restricting to tmpfs makes sense, but it's probably not what UML
> wants.

I think it is OK. They might want some persistent storage to migrate
or something, but that can always be done by copying from tmpfs to
a block based filesystem.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 14:50                                                       ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 14:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: a.p.zijlstra, akpm, mingo, linux-mm, linux-kernel, benh, jdike

On Wed, Mar 07, 2007 at 02:53:07PM +0100, Miklos Szeredi wrote:
> > > msync() might never get called and then we're back with the old
> > > behaviour where we can surprise the VM with a ton of dirty pages.
> > 
> > But we're root. With your patch, root *can't* do nonlinear writeback
> > well. Ever. With msync, at least you give them enough rope.
> 
> Restricting to root doesn't buy you much, nobody wants to be root.
> Restricting to mlock is similarly pointless.  UML _will_ want to get
> swapped out if there's no activity.

They could always not use nonlinear, or we could add a ulimit to the
size of nonlinear vaddr allowed. 

> Restricting to tmpfs makes sense, but it's probably not what UML
> wants.

I think it is OK. They might want some persistent storage to migrate
or something, but that can always be done by copying from tmpfs to
a block based filesystem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 14:34                                                       ` Peter Zijlstra
@ 2007-03-07 15:01                                                         ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, Mar 07, 2007 at 03:34:27PM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 14:52 +0100, Peter Zijlstra wrote:
> 
> > True. We could even guesstimate the nonlinear dirty pages by subtracting
> > the result of page_mkclean() from page_mapcount() and force an
> > msync(MS_ASYNC) on said mapping (or all (nonlinear) mappings of the
> > related file) when some threshold gets exceeded.
> 
> Almost, but not quite, we'd need to extract another value from the
> page_mkclean() run, the actual number of mappings encountered. The
> return value only sums the number of dirty mappings encountered.
> 
> s390 would already work I guess.
> 
> Certainly doable.

But if we restrict it to root only, and have a note in the man page
about it, then it really isn't worth cluttering up the kernel.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 15:01                                                         ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-07 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, Jeff Dike

On Wed, Mar 07, 2007 at 03:34:27PM +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 14:52 +0100, Peter Zijlstra wrote:
> 
> > True. We could even guesstimate the nonlinear dirty pages by subtracting
> > the result of page_mkclean() from page_mapcount() and force an
> > msync(MS_ASYNC) on said mapping (or all (nonlinear) mappings of the
> > related file) when some threshold gets exceeded.
> 
> Almost, but not quite, we'd need to extract another value from the
> page_mkclean() run, the actual number of mappings encountered. The
> return value only sums the number of dirty mappings encountered.
> 
> s390 would already work I guess.
> 
> Certainly doable.

But if we restrict it to root only, and have a note in the man page
about it, then it really isn't worth cluttering up the kernel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 13:52                                                     ` Peter Zijlstra
@ 2007-03-07 15:10                                                       ` Jeff Dike
  -1 siblings, 0 replies; 198+ messages in thread
From: Jeff Dike @ 2007-03-07 15:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 02:52:12PM +0100, Peter Zijlstra wrote:
> > Well I don't think UML uses nonlinear yet anyway, does it? Can they
> > make do with restricting nonlinear to mlocked vmas, I wonder? Probably
> > not.
> 
> I think it does, but lets ask, Jeff?

Nope, UML needs to be able to change permissions as well as locations.

Would be nice, though, there are apparently nice UML speedups with it.

				Jeff

-- 
Work email - jdike at linux dot intel dot com

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-07 15:10                                                       ` Jeff Dike
  0 siblings, 0 replies; 198+ messages in thread
From: Jeff Dike @ 2007-03-07 15:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh

On Wed, Mar 07, 2007 at 02:52:12PM +0100, Peter Zijlstra wrote:
> > Well I don't think UML uses nonlinear yet anyway, does it? Can they
> > make do with restricting nonlinear to mlocked vmas, I wonder? Probably
> > not.
> 
> I think it does, but lets ask, Jeff?

Nope, UML needs to be able to change permissions as well as locations.

Would be nice, though, there are apparently nice UML speedups with it.

				Jeff

-- 
Work email - jdike at linux dot intel dot com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-07 15:01                                                         ` Nick Piggin
@ 2007-03-07 16:58                                                           ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 16:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh,
	Jeff Dike, hugh, Linus Torvalds


compile tested only so far

---

Partial revert of commit: 204ec841fbea3e5138168edbc3a76d46747cc987

Non-linear vmas aren't properly handled by page_mkclean() and fixing that
would result in linear scans of all related non-linear vmas per page_mkclean()
invocation.

This is deemed too costly, hence re-instate the msync scan for non-linear vmas.

However this can lead to double IO:

 - pages get instanciated with RO mapping
 - page takes write fault, and gets marked with PG_dirty
 - page gets tagged for writeout and calls page_mkclean()
 - page_mkclean() fails to find the dirty pte (and clean it)
 - writeout happens and PG_dirty gets cleared.
 - user calls msync, the dirty pte is found and the page marked with PG_dirty
 - the page gets writen out _again_ even though its not re-dirtied.

To minimize this reset the protection when creating a nonlinear vma.

I'm not at all happy with this, but plain disallowing remap_file_pages on bdis
without BDI_CAP_NO_WRITEBACK seems to offend some people, hence restrict it to
root only.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/fremap.c |   21 ++++++++
 mm/msync.c  |  146 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 162 insertions(+), 5 deletions(-)

Index: linux-2.6-git/mm/msync.c
===================================================================
--- linux-2.6-git.orig/mm/msync.c	2007-03-07 17:18:09.000000000 +0100
+++ linux-2.6-git/mm/msync.c	2007-03-07 17:31:29.000000000 +0100
@@ -7,12 +7,123 @@
 /*
  * The msync() system call.
  */
+#include <linux/slab.h>
+#include <linux/pagemap.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/hugetlb.h>
+#include <linux/writeback.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
 
+#include <asm/pgtable.h>
+#include <asm/tlbflush.h>
+
+static unsigned long msync_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
+				unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	spinlock_t *ptl;
+	int progress = 0;
+	unsigned long ret = 0;
+
+again:
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	do {
+		struct page *page;
+
+		if (progress >= 64) {
+			progress = 0;
+			if (need_resched() || need_lockbreak(ptl))
+				break;
+		}
+		progress++;
+		if (!pte_present(*pte))
+			continue;
+		if (!pte_maybe_dirty(*pte))
+			continue;
+		page = vm_normal_page(vma, addr, *pte);
+		if (!page)
+			continue;
+
+		/*
+		 * Only non-linear vmas reach here, resetting the RO state
+		 * has no use, since page_mkclean doesn't work for them anyway.
+		 * It might even cause extra IO.
+		 */
+		if (ptep_clear_flush_dirty(vma, addr, pte) ||
+				page_test_and_clear_dirty(page))
+			ret += set_page_dirty(page);
+		progress += 3;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+	if (addr != end)
+		goto again;
+	return ret;
+}
+
+static inline unsigned long msync_pmd_range(struct vm_area_struct *vma,
+			pud_t *pud, unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	unsigned long ret = 0;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		ret += msync_pte_range(vma, pmd, addr, next);
+	} while (pmd++, addr = next, addr != end);
+	return ret;
+}
+
+static inline unsigned long msync_pud_range(struct vm_area_struct *vma,
+			pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+	unsigned long ret = 0;
+
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		ret += msync_pmd_range(vma, pud, addr, next);
+	} while (pud++, addr = next, addr != end);
+	return ret;
+}
+
+static unsigned long msync_page_range(struct vm_area_struct *vma,
+				unsigned long addr, unsigned long end)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	unsigned long ret = 0;
+
+	/* For hugepages we can't go walking the page table normally,
+	 * but that's ok, hugetlbfs is memory based, so we don't need
+	 * to do anything more on an msync().
+	 */
+	if (vma->vm_flags & VM_HUGETLB)
+		return 0;
+
+	BUG_ON(addr >= end);
+	pgd = pgd_offset(vma->vm_mm, addr);
+	flush_cache_range(vma, addr, end);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		ret += msync_pud_range(vma, pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
+	return ret;
+}
+
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
@@ -27,6 +138,21 @@
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
+static int msync_interval(struct vm_area_struct *vma, unsigned long addr,
+			unsigned long end, int flags,
+			unsigned long *nr_pages_dirtied)
+{
+	struct file *file = vma->vm_file;
+
+	if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
+		return -EBUSY;
+
+	if (file && (vma->vm_flags & VM_SHARED) &&
+			(vma->vm_flags & VM_NONLINEAR))
+		*nr_pages_dirtied = msync_page_range(vma, addr, end);
+	return 0;
+}
+
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
 	unsigned long end;
@@ -56,6 +182,7 @@ asmlinkage long sys_msync(unsigned long 
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, start);
 	for (;;) {
+		unsigned long nr_pages_dirtied = 0;
 		struct file *file;
 
 		/* Still start < end. */
@@ -70,14 +197,23 @@ asmlinkage long sys_msync(unsigned long 
 			unmapped_error = -ENOMEM;
 		}
 		/* Here vma->vm_start <= start < vma->vm_end. */
-		if ((flags & MS_INVALIDATE) &&
-				(vma->vm_flags & VM_LOCKED)) {
-			error = -EBUSY;
+		error = msync_interval(vma, start, min(end, vma->vm_end),
+				flags, &nr_pages_dirtied);
+		if (error)
 			goto out_unlock;
-		}
 		file = vma->vm_file;
 		start = vma->vm_end;
-		if ((flags & MS_SYNC) && file &&
+		if ((flags & MS_ASYNC) && file && nr_pages_dirtied) {
+			get_file(file);
+			up_read(&mm->mmap_sem);
+			balance_dirty_pages_ratelimited_nr(file->f_mapping,
+					nr_pages_dirtied);
+			fput(file);
+			if (start >= end)
+				goto out;
+			down_read(&mm->mmap_sem);
+			vma = find_vma(mm, start);
+		} else if ((flags & MS_SYNC) && file &&
 				(vma->vm_flags & VM_SHARED)) {
 			get_file(file);
 			up_read(&mm->mmap_sem);
Index: linux-2.6-git/mm/fremap.c
===================================================================
--- linux-2.6-git.orig/mm/fremap.c	2007-03-07 17:35:19.000000000 +0100
+++ linux-2.6-git/mm/fremap.c	2007-03-07 17:52:15.000000000 +0100
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/capability.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -178,6 +179,16 @@ asmlinkage long sys_remap_file_pages(uns
 	vma = find_vma(mm, start);
 
 	/*
+	 * Don't allow non root to create non-linear mappings on backing
+	 * devices capable of accounting dirty pages.
+	 */
+	if (!(vma->vm_flags & VM_NONLINEAR) && vma_wants_writenotify(vma) &&
+			!capable(CAP_SYS_ADMIN)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	/*
 	 * Make sure the vma is shared, that it supports prefaulting,
 	 * and that the remapped range is valid and fully within
 	 * the single existing vma.  vm_private_data is used as a
@@ -201,6 +212,15 @@ asmlinkage long sys_remap_file_pages(uns
 			mapping = vma->vm_file->f_mapping;
 			spin_lock(&mapping->i_mmap_lock);
 			flush_dcache_mmap_lock(mapping);
+			/*
+			 * reset protection because non-linear maps don't
+			 * work with the fancy dirty page accounting code.
+			 */
+			if (vma_wants_writenotify(vma)) {
+				vma->vm_page_prot =
+					protection_map[vma->vm_flags &
+					(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
+			}
 			vma->vm_flags |= VM_NONLINEAR;
 			vma_prio_tree_remove(vma, &mapping->i_mmap);
 			vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
@@ -218,6 +238,7 @@ asmlinkage long sys_remap_file_pages(uns
 		 * downgrading the lock.  (Locks can't be upgraded).
 		 */
 	}
+out:
 	if (likely(!has_write_lock))
 		up_read(&mm->mmap_sem);
 	else



^ permalink raw reply	[flat|nested] 198+ messages in thread

* [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-07 16:58                                                           ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 16:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh,
	Jeff Dike, hugh, Linus Torvalds

compile tested only so far

---

Partial revert of commit: 204ec841fbea3e5138168edbc3a76d46747cc987

Non-linear vmas aren't properly handled by page_mkclean() and fixing that
would result in linear scans of all related non-linear vmas per page_mkclean()
invocation.

This is deemed too costly, hence re-instate the msync scan for non-linear vmas.

However this can lead to double IO:

 - pages get instanciated with RO mapping
 - page takes write fault, and gets marked with PG_dirty
 - page gets tagged for writeout and calls page_mkclean()
 - page_mkclean() fails to find the dirty pte (and clean it)
 - writeout happens and PG_dirty gets cleared.
 - user calls msync, the dirty pte is found and the page marked with PG_dirty
 - the page gets writen out _again_ even though its not re-dirtied.

To minimize this reset the protection when creating a nonlinear vma.

I'm not at all happy with this, but plain disallowing remap_file_pages on bdis
without BDI_CAP_NO_WRITEBACK seems to offend some people, hence restrict it to
root only.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/fremap.c |   21 ++++++++
 mm/msync.c  |  146 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 162 insertions(+), 5 deletions(-)

Index: linux-2.6-git/mm/msync.c
===================================================================
--- linux-2.6-git.orig/mm/msync.c	2007-03-07 17:18:09.000000000 +0100
+++ linux-2.6-git/mm/msync.c	2007-03-07 17:31:29.000000000 +0100
@@ -7,12 +7,123 @@
 /*
  * The msync() system call.
  */
+#include <linux/slab.h>
+#include <linux/pagemap.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/hugetlb.h>
+#include <linux/writeback.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
 
+#include <asm/pgtable.h>
+#include <asm/tlbflush.h>
+
+static unsigned long msync_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
+				unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	spinlock_t *ptl;
+	int progress = 0;
+	unsigned long ret = 0;
+
+again:
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	do {
+		struct page *page;
+
+		if (progress >= 64) {
+			progress = 0;
+			if (need_resched() || need_lockbreak(ptl))
+				break;
+		}
+		progress++;
+		if (!pte_present(*pte))
+			continue;
+		if (!pte_maybe_dirty(*pte))
+			continue;
+		page = vm_normal_page(vma, addr, *pte);
+		if (!page)
+			continue;
+
+		/*
+		 * Only non-linear vmas reach here, resetting the RO state
+		 * has no use, since page_mkclean doesn't work for them anyway.
+		 * It might even cause extra IO.
+		 */
+		if (ptep_clear_flush_dirty(vma, addr, pte) ||
+				page_test_and_clear_dirty(page))
+			ret += set_page_dirty(page);
+		progress += 3;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+	if (addr != end)
+		goto again;
+	return ret;
+}
+
+static inline unsigned long msync_pmd_range(struct vm_area_struct *vma,
+			pud_t *pud, unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	unsigned long ret = 0;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		ret += msync_pte_range(vma, pmd, addr, next);
+	} while (pmd++, addr = next, addr != end);
+	return ret;
+}
+
+static inline unsigned long msync_pud_range(struct vm_area_struct *vma,
+			pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+	unsigned long ret = 0;
+
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		ret += msync_pmd_range(vma, pud, addr, next);
+	} while (pud++, addr = next, addr != end);
+	return ret;
+}
+
+static unsigned long msync_page_range(struct vm_area_struct *vma,
+				unsigned long addr, unsigned long end)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	unsigned long ret = 0;
+
+	/* For hugepages we can't go walking the page table normally,
+	 * but that's ok, hugetlbfs is memory based, so we don't need
+	 * to do anything more on an msync().
+	 */
+	if (vma->vm_flags & VM_HUGETLB)
+		return 0;
+
+	BUG_ON(addr >= end);
+	pgd = pgd_offset(vma->vm_mm, addr);
+	flush_cache_range(vma, addr, end);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		ret += msync_pud_range(vma, pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
+	return ret;
+}
+
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
@@ -27,6 +138,21 @@
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
+static int msync_interval(struct vm_area_struct *vma, unsigned long addr,
+			unsigned long end, int flags,
+			unsigned long *nr_pages_dirtied)
+{
+	struct file *file = vma->vm_file;
+
+	if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
+		return -EBUSY;
+
+	if (file && (vma->vm_flags & VM_SHARED) &&
+			(vma->vm_flags & VM_NONLINEAR))
+		*nr_pages_dirtied = msync_page_range(vma, addr, end);
+	return 0;
+}
+
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
 	unsigned long end;
@@ -56,6 +182,7 @@ asmlinkage long sys_msync(unsigned long 
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, start);
 	for (;;) {
+		unsigned long nr_pages_dirtied = 0;
 		struct file *file;
 
 		/* Still start < end. */
@@ -70,14 +197,23 @@ asmlinkage long sys_msync(unsigned long 
 			unmapped_error = -ENOMEM;
 		}
 		/* Here vma->vm_start <= start < vma->vm_end. */
-		if ((flags & MS_INVALIDATE) &&
-				(vma->vm_flags & VM_LOCKED)) {
-			error = -EBUSY;
+		error = msync_interval(vma, start, min(end, vma->vm_end),
+				flags, &nr_pages_dirtied);
+		if (error)
 			goto out_unlock;
-		}
 		file = vma->vm_file;
 		start = vma->vm_end;
-		if ((flags & MS_SYNC) && file &&
+		if ((flags & MS_ASYNC) && file && nr_pages_dirtied) {
+			get_file(file);
+			up_read(&mm->mmap_sem);
+			balance_dirty_pages_ratelimited_nr(file->f_mapping,
+					nr_pages_dirtied);
+			fput(file);
+			if (start >= end)
+				goto out;
+			down_read(&mm->mmap_sem);
+			vma = find_vma(mm, start);
+		} else if ((flags & MS_SYNC) && file &&
 				(vma->vm_flags & VM_SHARED)) {
 			get_file(file);
 			up_read(&mm->mmap_sem);
Index: linux-2.6-git/mm/fremap.c
===================================================================
--- linux-2.6-git.orig/mm/fremap.c	2007-03-07 17:35:19.000000000 +0100
+++ linux-2.6-git/mm/fremap.c	2007-03-07 17:52:15.000000000 +0100
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/capability.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -178,6 +179,16 @@ asmlinkage long sys_remap_file_pages(uns
 	vma = find_vma(mm, start);
 
 	/*
+	 * Don't allow non root to create non-linear mappings on backing
+	 * devices capable of accounting dirty pages.
+	 */
+	if (!(vma->vm_flags & VM_NONLINEAR) && vma_wants_writenotify(vma) &&
+			!capable(CAP_SYS_ADMIN)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	/*
 	 * Make sure the vma is shared, that it supports prefaulting,
 	 * and that the remapped range is valid and fully within
 	 * the single existing vma.  vm_private_data is used as a
@@ -201,6 +212,15 @@ asmlinkage long sys_remap_file_pages(uns
 			mapping = vma->vm_file->f_mapping;
 			spin_lock(&mapping->i_mmap_lock);
 			flush_dcache_mmap_lock(mapping);
+			/*
+			 * reset protection because non-linear maps don't
+			 * work with the fancy dirty page accounting code.
+			 */
+			if (vma_wants_writenotify(vma)) {
+				vma->vm_page_prot =
+					protection_map[vma->vm_flags &
+					(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
+			}
 			vma->vm_flags |= VM_NONLINEAR;
 			vma_prio_tree_remove(vma, &mapping->i_mmap);
 			vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
@@ -218,6 +238,7 @@ asmlinkage long sys_remap_file_pages(uns
 		 * downgrading the lock.  (Locks can't be upgraded).
 		 */
 	}
+out:
 	if (likely(!has_write_lock))
 		up_read(&mm->mmap_sem);
 	else


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-07 16:58                                                           ` Peter Zijlstra
@ 2007-03-07 18:00                                                             ` Linus Torvalds
  -1 siblings, 0 replies; 198+ messages in thread
From: Linus Torvalds @ 2007-03-07 18:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel,
	benh, Jeff Dike, hugh



On Wed, 7 Mar 2007, Peter Zijlstra wrote:
> 
> I'm not at all happy with this, but plain disallowing remap_file_pages on bdis
> without BDI_CAP_NO_WRITEBACK seems to offend some people, hence restrict it to
> root only.

I don't think that's a viable approach. Nonlinear mappings would normally 
be used by databases, and you don't want to limit databases to be run by 
root only.

		Linus

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-07 18:00                                                             ` Linus Torvalds
  0 siblings, 0 replies; 198+ messages in thread
From: Linus Torvalds @ 2007-03-07 18:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel,
	benh, Jeff Dike, hugh


On Wed, 7 Mar 2007, Peter Zijlstra wrote:
> 
> I'm not at all happy with this, but plain disallowing remap_file_pages on bdis
> without BDI_CAP_NO_WRITEBACK seems to offend some people, hence restrict it to
> root only.

I don't think that's a viable approach. Nonlinear mappings would normally 
be used by databases, and you don't want to limit databases to be run by 
root only.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-07 18:00                                                             ` Linus Torvalds
@ 2007-03-07 18:12                                                               ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel,
	benh, Jeff Dike, hugh

On Wed, 2007-03-07 at 10:00 -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Mar 2007, Peter Zijlstra wrote:
> > 
> > I'm not at all happy with this, but plain disallowing remap_file_pages on bdis
> > without BDI_CAP_NO_WRITEBACK seems to offend some people, hence restrict it to
> > root only.
> 
> I don't think that's a viable approach. Nonlinear mappings would normally 
> be used by databases, and you don't want to limit databases to be run by 
> root only.

It was claimed that they use it on tmpfs only, not on a 'real'
filesystem.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-07 18:12                                                               ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel,
	benh, Jeff Dike, hugh

On Wed, 2007-03-07 at 10:00 -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Mar 2007, Peter Zijlstra wrote:
> > 
> > I'm not at all happy with this, but plain disallowing remap_file_pages on bdis
> > without BDI_CAP_NO_WRITEBACK seems to offend some people, hence restrict it to
> > root only.
> 
> I don't think that's a viable approach. Nonlinear mappings would normally 
> be used by databases, and you don't want to limit databases to be run by 
> root only.

It was claimed that they use it on tmpfs only, not on a 'real'
filesystem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-07 18:12                                                               ` Peter Zijlstra
@ 2007-03-07 18:24                                                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 18:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel,
	benh, Jeff Dike, hugh

On Wed, 2007-03-07 at 19:12 +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 10:00 -0800, Linus Torvalds wrote:
> > 
> > On Wed, 7 Mar 2007, Peter Zijlstra wrote:
> > > 
> > > I'm not at all happy with this, but plain disallowing remap_file_pages on bdis
> > > without BDI_CAP_NO_WRITEBACK seems to offend some people, hence restrict it to
> > > root only.
> > 
> > I don't think that's a viable approach. Nonlinear mappings would normally 
> > be used by databases, and you don't want to limit databases to be run by 
> > root only.
> 
> It was claimed that they use it on tmpfs only, not on a 'real'
> filesystem.

More specifically, databases want to use direct IO (I know you hate it)
and use the nonlinear vma as buffer area to feed this direct IO

Mapped IO is unsuited for databases in its current form due to the way
IO errors are handled.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-07 18:24                                                                 ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-07 18:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel,
	benh, Jeff Dike, hugh

On Wed, 2007-03-07 at 19:12 +0100, Peter Zijlstra wrote:
> On Wed, 2007-03-07 at 10:00 -0800, Linus Torvalds wrote:
> > 
> > On Wed, 7 Mar 2007, Peter Zijlstra wrote:
> > > 
> > > I'm not at all happy with this, but plain disallowing remap_file_pages on bdis
> > > without BDI_CAP_NO_WRITEBACK seems to offend some people, hence restrict it to
> > > root only.
> > 
> > I don't think that's a viable approach. Nonlinear mappings would normally 
> > be used by databases, and you don't want to limit databases to be run by 
> > root only.
> 
> It was claimed that they use it on tmpfs only, not on a 'real'
> filesystem.

More specifically, databases want to use direct IO (I know you hate it)
and use the nonlinear vma as buffer area to feed this direct IO

Mapped IO is unsuited for databases in its current form due to the way
IO errors are handled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-07 16:58                                                           ` Peter Zijlstra
@ 2007-03-08 11:21                                                             ` Miklos Szeredi
  -1 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-08 11:21 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: npiggin, miklos, akpm, mingo, linux-mm, linux-kernel, benh,
	jdike, hugh, torvalds

> Partial revert of commit: 204ec841fbea3e5138168edbc3a76d46747cc987
> 
> Non-linear vmas aren't properly handled by page_mkclean() and fixing that
> would result in linear scans of all related non-linear vmas per page_mkclean()
> invocation.
> 
> This is deemed too costly, hence re-instate the msync scan for non-linear vmas.
> 
> However this can lead to double IO:
> 
>  - pages get instanciated with RO mapping
>  - page takes write fault, and gets marked with PG_dirty
>  - page gets tagged for writeout and calls page_mkclean()
>  - page_mkclean() fails to find the dirty pte (and clean it)
>  - writeout happens and PG_dirty gets cleared.
>  - user calls msync, the dirty pte is found and the page marked with PG_dirty
>  - the page gets writen out _again_ even though its not re-dirtied.
> 
> To minimize this reset the protection when creating a nonlinear vma.
> 
> I'm not at all happy with this, but plain disallowing
> remap_file_pages on bdis without BDI_CAP_NO_WRITEBACK seems to
> offend some people, hence restrict it to root only.

Root only for !BDI_CAP_NO_WRITEBACK mappings doesn't make sense
because:

  - just encourages insecure applications

  - there are no current users that want this and presumable no future
    uses either

  - it's a maintenance burden: I'll have to layer the m/ctime update
    patch on top of this

  - the only pro for this has been that Nick thinks it cool ;)

I think the proper way to deal with this is to

  - allow BDI_CAP_NO_WRITEBACK (tmpfs/ramfs) uses, makes database
    people happy

  - for !BDI_CAP_NO_WRITEBACK emulate using do_mmap_pgoff(), should be
    trivial, no userspace ABI breakage

Miklos

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-08 11:21                                                             ` Miklos Szeredi
  0 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-08 11:21 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: npiggin, miklos, akpm, mingo, linux-mm, linux-kernel, benh,
	jdike, hugh, torvalds

> Partial revert of commit: 204ec841fbea3e5138168edbc3a76d46747cc987
> 
> Non-linear vmas aren't properly handled by page_mkclean() and fixing that
> would result in linear scans of all related non-linear vmas per page_mkclean()
> invocation.
> 
> This is deemed too costly, hence re-instate the msync scan for non-linear vmas.
> 
> However this can lead to double IO:
> 
>  - pages get instanciated with RO mapping
>  - page takes write fault, and gets marked with PG_dirty
>  - page gets tagged for writeout and calls page_mkclean()
>  - page_mkclean() fails to find the dirty pte (and clean it)
>  - writeout happens and PG_dirty gets cleared.
>  - user calls msync, the dirty pte is found and the page marked with PG_dirty
>  - the page gets writen out _again_ even though its not re-dirtied.
> 
> To minimize this reset the protection when creating a nonlinear vma.
> 
> I'm not at all happy with this, but plain disallowing
> remap_file_pages on bdis without BDI_CAP_NO_WRITEBACK seems to
> offend some people, hence restrict it to root only.

Root only for !BDI_CAP_NO_WRITEBACK mappings doesn't make sense
because:

  - just encourages insecure applications

  - there are no current users that want this and presumable no future
    uses either

  - it's a maintenance burden: I'll have to layer the m/ctime update
    patch on top of this

  - the only pro for this has been that Nick thinks it cool ;)

I think the proper way to deal with this is to

  - allow BDI_CAP_NO_WRITEBACK (tmpfs/ramfs) uses, makes database
    people happy

  - for !BDI_CAP_NO_WRITEBACK emulate using do_mmap_pgoff(), should be
    trivial, no userspace ABI breakage

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-08 11:21                                                             ` Miklos Szeredi
@ 2007-03-08 11:37                                                               ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-08 11:37 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: npiggin, akpm, mingo, linux-mm, linux-kernel, benh, jdike, hugh,
	torvalds

On Thu, 2007-03-08 at 12:21 +0100, Miklos Szeredi wrote:
> > Partial revert of commit: 204ec841fbea3e5138168edbc3a76d46747cc987
> > 
> > Non-linear vmas aren't properly handled by page_mkclean() and fixing that
> > would result in linear scans of all related non-linear vmas per page_mkclean()
> > invocation.
> > 
> > This is deemed too costly, hence re-instate the msync scan for non-linear vmas.
> > 
> > However this can lead to double IO:
> > 
> >  - pages get instanciated with RO mapping
> >  - page takes write fault, and gets marked with PG_dirty
> >  - page gets tagged for writeout and calls page_mkclean()
> >  - page_mkclean() fails to find the dirty pte (and clean it)
> >  - writeout happens and PG_dirty gets cleared.
> >  - user calls msync, the dirty pte is found and the page marked with PG_dirty
> >  - the page gets writen out _again_ even though its not re-dirtied.
> > 
> > To minimize this reset the protection when creating a nonlinear vma.
> > 
> > I'm not at all happy with this, but plain disallowing
> > remap_file_pages on bdis without BDI_CAP_NO_WRITEBACK seems to
> > offend some people, hence restrict it to root only.
> 
> Root only for !BDI_CAP_NO_WRITEBACK mappings doesn't make sense
> because:
> 
>   - just encourages insecure applications
> 
>   - there are no current users that want this and presumable no future
>     uses either

AFAIK no other OS does this against regular filesystems (hear-say)

>   - it's a maintenance burden: I'll have to layer the m/ctime update
>     patch on top of this
> 
>   - the only pro for this has been that Nick thinks it cool ;)
> 
> I think the proper way to deal with this is to
> 
>   - allow BDI_CAP_NO_WRITEBACK (tmpfs/ramfs) uses, makes database
>     people happy

And UML once the remap_file_pages_prot() stuff is merged.

>   - for !BDI_CAP_NO_WRITEBACK emulate using do_mmap_pgoff(), should be
>     trivial, no userspace ABI breakage

I can live with that.

However this still leaves the non-linear reclaim (Nick pointed it out as
a potential DoS and other people have corroborated this). I have no idea
on that to do about that.

Oracle seems to mlock these things anyway, but UML surely would not.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-08 11:37                                                               ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-08 11:37 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: npiggin, akpm, mingo, linux-mm, linux-kernel, benh, jdike, hugh,
	torvalds

On Thu, 2007-03-08 at 12:21 +0100, Miklos Szeredi wrote:
> > Partial revert of commit: 204ec841fbea3e5138168edbc3a76d46747cc987
> > 
> > Non-linear vmas aren't properly handled by page_mkclean() and fixing that
> > would result in linear scans of all related non-linear vmas per page_mkclean()
> > invocation.
> > 
> > This is deemed too costly, hence re-instate the msync scan for non-linear vmas.
> > 
> > However this can lead to double IO:
> > 
> >  - pages get instanciated with RO mapping
> >  - page takes write fault, and gets marked with PG_dirty
> >  - page gets tagged for writeout and calls page_mkclean()
> >  - page_mkclean() fails to find the dirty pte (and clean it)
> >  - writeout happens and PG_dirty gets cleared.
> >  - user calls msync, the dirty pte is found and the page marked with PG_dirty
> >  - the page gets writen out _again_ even though its not re-dirtied.
> > 
> > To minimize this reset the protection when creating a nonlinear vma.
> > 
> > I'm not at all happy with this, but plain disallowing
> > remap_file_pages on bdis without BDI_CAP_NO_WRITEBACK seems to
> > offend some people, hence restrict it to root only.
> 
> Root only for !BDI_CAP_NO_WRITEBACK mappings doesn't make sense
> because:
> 
>   - just encourages insecure applications
> 
>   - there are no current users that want this and presumable no future
>     uses either

AFAIK no other OS does this against regular filesystems (hear-say)

>   - it's a maintenance burden: I'll have to layer the m/ctime update
>     patch on top of this
> 
>   - the only pro for this has been that Nick thinks it cool ;)
> 
> I think the proper way to deal with this is to
> 
>   - allow BDI_CAP_NO_WRITEBACK (tmpfs/ramfs) uses, makes database
>     people happy

And UML once the remap_file_pages_prot() stuff is merged.

>   - for !BDI_CAP_NO_WRITEBACK emulate using do_mmap_pgoff(), should be
>     trivial, no userspace ABI breakage

I can live with that.

However this still leaves the non-linear reclaim (Nick pointed it out as
a potential DoS and other people have corroborated this). I have no idea
on that to do about that.

Oracle seems to mlock these things anyway, but UML surely would not.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-08 11:37                                                               ` Peter Zijlstra
@ 2007-03-08 11:48                                                                 ` Miklos Szeredi
  -1 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-08 11:48 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: npiggin, akpm, mingo, linux-mm, linux-kernel, benh, jdike, hugh,
	torvalds

> However this still leaves the non-linear reclaim (Nick pointed it out as
> a potential DoS and other people have corroborated this). I have no idea
> on that to do about that.

OK, but that is a completely different problem, not affecting
page_mkclean() or msync().

And it doesn't sound too hard to solve: when current algorithm doesn't
seem to be making progress, then it will have to be done the hard way,
searching for for all nonlinear ptes of a page to unmap.

Miklos

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-08 11:48                                                                 ` Miklos Szeredi
  0 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-08 11:48 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: npiggin, akpm, mingo, linux-mm, linux-kernel, benh, jdike, hugh,
	torvalds

> However this still leaves the non-linear reclaim (Nick pointed it out as
> a potential DoS and other people have corroborated this). I have no idea
> on that to do about that.

OK, but that is a completely different problem, not affecting
page_mkclean() or msync().

And it doesn't sound too hard to solve: when current algorithm doesn't
seem to be making progress, then it will have to be done the hard way,
searching for for all nonlinear ptes of a page to unmap.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-08 11:21                                                             ` Miklos Szeredi
@ 2007-03-08 11:58                                                               ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-08 11:58 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: a.p.zijlstra, akpm, mingo, linux-mm, linux-kernel, benh, jdike,
	hugh, torvalds

On Thu, Mar 08, 2007 at 12:21:01PM +0100, Miklos Szeredi wrote:
> > Partial revert of commit: 204ec841fbea3e5138168edbc3a76d46747cc987
> > 
> > Non-linear vmas aren't properly handled by page_mkclean() and fixing that
> > would result in linear scans of all related non-linear vmas per page_mkclean()
> > invocation.
> > 
> > This is deemed too costly, hence re-instate the msync scan for non-linear vmas.
> > 
> > However this can lead to double IO:
> > 
> >  - pages get instanciated with RO mapping
> >  - page takes write fault, and gets marked with PG_dirty
> >  - page gets tagged for writeout and calls page_mkclean()
> >  - page_mkclean() fails to find the dirty pte (and clean it)
> >  - writeout happens and PG_dirty gets cleared.
> >  - user calls msync, the dirty pte is found and the page marked with PG_dirty
> >  - the page gets writen out _again_ even though its not re-dirtied.
> > 
> > To minimize this reset the protection when creating a nonlinear vma.
> > 
> > I'm not at all happy with this, but plain disallowing
> > remap_file_pages on bdis without BDI_CAP_NO_WRITEBACK seems to
> > offend some people, hence restrict it to root only.
> 
> Root only for !BDI_CAP_NO_WRITEBACK mappings doesn't make sense
> because:
> 
>   - just encourages insecure applications
> 
>   - there are no current users that want this and presumable no future
>     uses either
> 
>   - it's a maintenance burden: I'll have to layer the m/ctime update
>     patch on top of this

But you have to update m/ctime for BDI_CAP_NO_WRITEBACK mappings anyway
don't you?

> 
>   - the only pro for this has been that Nick thinks it cool ;)

Nonlinear in general, rather than this specifically.

> I think the proper way to deal with this is to
> 
>   - allow BDI_CAP_NO_WRITEBACK (tmpfs/ramfs) uses, makes database
>     people happy
> 
>   - for !BDI_CAP_NO_WRITEBACK emulate using do_mmap_pgoff(), should be
>     trivial, no userspace ABI breakage

Yeah that sounds OK.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-08 11:58                                                               ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-08 11:58 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: a.p.zijlstra, akpm, mingo, linux-mm, linux-kernel, benh, jdike,
	hugh, torvalds

On Thu, Mar 08, 2007 at 12:21:01PM +0100, Miklos Szeredi wrote:
> > Partial revert of commit: 204ec841fbea3e5138168edbc3a76d46747cc987
> > 
> > Non-linear vmas aren't properly handled by page_mkclean() and fixing that
> > would result in linear scans of all related non-linear vmas per page_mkclean()
> > invocation.
> > 
> > This is deemed too costly, hence re-instate the msync scan for non-linear vmas.
> > 
> > However this can lead to double IO:
> > 
> >  - pages get instanciated with RO mapping
> >  - page takes write fault, and gets marked with PG_dirty
> >  - page gets tagged for writeout and calls page_mkclean()
> >  - page_mkclean() fails to find the dirty pte (and clean it)
> >  - writeout happens and PG_dirty gets cleared.
> >  - user calls msync, the dirty pte is found and the page marked with PG_dirty
> >  - the page gets writen out _again_ even though its not re-dirtied.
> > 
> > To minimize this reset the protection when creating a nonlinear vma.
> > 
> > I'm not at all happy with this, but plain disallowing
> > remap_file_pages on bdis without BDI_CAP_NO_WRITEBACK seems to
> > offend some people, hence restrict it to root only.
> 
> Root only for !BDI_CAP_NO_WRITEBACK mappings doesn't make sense
> because:
> 
>   - just encourages insecure applications
> 
>   - there are no current users that want this and presumable no future
>     uses either
> 
>   - it's a maintenance burden: I'll have to layer the m/ctime update
>     patch on top of this

But you have to update m/ctime for BDI_CAP_NO_WRITEBACK mappings anyway
don't you?

> 
>   - the only pro for this has been that Nick thinks it cool ;)

Nonlinear in general, rather than this specifically.

> I think the proper way to deal with this is to
> 
>   - allow BDI_CAP_NO_WRITEBACK (tmpfs/ramfs) uses, makes database
>     people happy
> 
>   - for !BDI_CAP_NO_WRITEBACK emulate using do_mmap_pgoff(), should be
>     trivial, no userspace ABI breakage

Yeah that sounds OK.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-08 11:58                                                               ` Nick Piggin
@ 2007-03-08 12:09                                                                 ` Miklos Szeredi
  -1 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-08 12:09 UTC (permalink / raw)
  To: npiggin
  Cc: a.p.zijlstra, akpm, mingo, linux-mm, linux-kernel, benh, jdike,
	hugh, torvalds

> >   - it's a maintenance burden: I'll have to layer the m/ctime update
> >     patch on top of this
> 
> But you have to update m/ctime for BDI_CAP_NO_WRITEBACK mappings anyway
> don't you?

Yes, but that's a different aspect of msync(), not about the data
writeback issues that nonlinear mappings have.

So a solution that solves both these problems would probably be more
complex.

> >   - the only pro for this has been that Nick thinks it cool ;)
> 
> Nonlinear in general, rather than this specifically.

Fair enough.

Miklos

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-08 12:09                                                                 ` Miklos Szeredi
  0 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-08 12:09 UTC (permalink / raw)
  To: npiggin
  Cc: a.p.zijlstra, akpm, mingo, linux-mm, linux-kernel, benh, jdike,
	hugh, torvalds

> >   - it's a maintenance burden: I'll have to layer the m/ctime update
> >     patch on top of this
> 
> But you have to update m/ctime for BDI_CAP_NO_WRITEBACK mappings anyway
> don't you?

Yes, but that's a different aspect of msync(), not about the data
writeback issues that nonlinear mappings have.

So a solution that solves both these problems would probably be more
complex.

> >   - the only pro for this has been that Nick thinks it cool ;)
> 
> Nonlinear in general, rather than this specifically.

Fair enough.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-08 11:48                                                                 ` Miklos Szeredi
@ 2007-03-08 12:11                                                                   ` Peter Zijlstra
  -1 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-08 12:11 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: npiggin, akpm, mingo, linux-mm, linux-kernel, benh, jdike, hugh,
	torvalds

On Thu, 2007-03-08 at 12:48 +0100, Miklos Szeredi wrote:
> > However this still leaves the non-linear reclaim (Nick pointed it out as
> > a potential DoS and other people have corroborated this). I have no idea
> > on that to do about that.
> 
> OK, but that is a completely different problem, not affecting
> page_mkclean() or msync().
> 
> And it doesn't sound too hard to solve: when current algorithm doesn't
> seem to be making progress, then it will have to be done the hard way,
> searching for for all nonlinear ptes of a page to unmap.

Ah, you see, but that is when you've already lost.

The DoS is about the computational complexity of the reclaim, not if it
will ever come out of it with free pages.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-08 12:11                                                                   ` Peter Zijlstra
  0 siblings, 0 replies; 198+ messages in thread
From: Peter Zijlstra @ 2007-03-08 12:11 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: npiggin, akpm, mingo, linux-mm, linux-kernel, benh, jdike, hugh,
	torvalds

On Thu, 2007-03-08 at 12:48 +0100, Miklos Szeredi wrote:
> > However this still leaves the non-linear reclaim (Nick pointed it out as
> > a potential DoS and other people have corroborated this). I have no idea
> > on that to do about that.
> 
> OK, but that is a completely different problem, not affecting
> page_mkclean() or msync().
> 
> And it doesn't sound too hard to solve: when current algorithm doesn't
> seem to be making progress, then it will have to be done the hard way,
> searching for for all nonlinear ptes of a page to unmap.

Ah, you see, but that is when you've already lost.

The DoS is about the computational complexity of the reclaim, not if it
will ever come out of it with free pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-08 12:11                                                                   ` Peter Zijlstra
@ 2007-03-08 12:19                                                                     ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-08 12:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, jdike,
	hugh, torvalds

On Thu, Mar 08, 2007 at 01:11:43PM +0100, Peter Zijlstra wrote:
> On Thu, 2007-03-08 at 12:48 +0100, Miklos Szeredi wrote:
> > > However this still leaves the non-linear reclaim (Nick pointed it out as
> > > a potential DoS and other people have corroborated this). I have no idea
> > > on that to do about that.
> > 
> > OK, but that is a completely different problem, not affecting
> > page_mkclean() or msync().
> > 
> > And it doesn't sound too hard to solve: when current algorithm doesn't
> > seem to be making progress, then it will have to be done the hard way,
> > searching for for all nonlinear ptes of a page to unmap.
> 
> Ah, you see, but that is when you've already lost.
> 
> The DoS is about the computational complexity of the reclaim, not if it
> will ever come out of it with free pages.

If we really want to, we could limit it to mlock for !root. This is
a reasonable way to solve the problem, and UML could fall back on
vma emulated version if they didn't want to use mlock memory...

Or we could limit the size/number of nonlinear vmas that could be
created.

But just quietly, I think there are probably a lot of other ways to
perform a local DoS anyway ;) 


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-08 12:19                                                                     ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-08 12:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Miklos Szeredi, akpm, mingo, linux-mm, linux-kernel, benh, jdike,
	hugh, torvalds

On Thu, Mar 08, 2007 at 01:11:43PM +0100, Peter Zijlstra wrote:
> On Thu, 2007-03-08 at 12:48 +0100, Miklos Szeredi wrote:
> > > However this still leaves the non-linear reclaim (Nick pointed it out as
> > > a potential DoS and other people have corroborated this). I have no idea
> > > on that to do about that.
> > 
> > OK, but that is a completely different problem, not affecting
> > page_mkclean() or msync().
> > 
> > And it doesn't sound too hard to solve: when current algorithm doesn't
> > seem to be making progress, then it will have to be done the hard way,
> > searching for for all nonlinear ptes of a page to unmap.
> 
> Ah, you see, but that is when you've already lost.
> 
> The DoS is about the computational complexity of the reclaim, not if it
> will ever come out of it with free pages.

If we really want to, we could limit it to mlock for !root. This is
a reasonable way to solve the problem, and UML could fall back on
vma emulated version if they didn't want to use mlock memory...

Or we could limit the size/number of nonlinear vmas that could be
created.

But just quietly, I think there are probably a lot of other ways to
perform a local DoS anyway ;) 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
  2007-03-08 12:19                                                                     ` Nick Piggin
@ 2007-03-08 12:25                                                                       ` Miklos Szeredi
  -1 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-08 12:25 UTC (permalink / raw)
  To: npiggin
  Cc: a.p.zijlstra, miklos, akpm, mingo, linux-mm, linux-kernel, benh,
	jdike, hugh, torvalds

> > > And it doesn't sound too hard to solve: when current algorithm doesn't
> > > seem to be making progress, then it will have to be done the hard way,
> > > searching for for all nonlinear ptes of a page to unmap.
> > 
> > Ah, you see, but that is when you've already lost.
> > 
> > The DoS is about the computational complexity of the reclaim, not if it
> > will ever come out of it with free pages.
> 
> If we really want to, we could limit it to mlock for !root. This is
> a reasonable way to solve the problem, and UML could fall back on
> vma emulated version if they didn't want to use mlock memory...
> 
> Or we could limit the size/number of nonlinear vmas that could be
> created.
> 
> But just quietly, I think there are probably a lot of other ways to
> perform a local DoS anyway ;) 

I aggree, requiring apps to mlock would probably just make things
slightly worse for about 100% of users, without any gain.  There could
be a

  /proc/sys/vm/turn_off_nonlinear_for_paranoid_sysadmin

knob that would unconditionally emulate nonlinear vmas.

Miklos

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas
@ 2007-03-08 12:25                                                                       ` Miklos Szeredi
  0 siblings, 0 replies; 198+ messages in thread
From: Miklos Szeredi @ 2007-03-08 12:25 UTC (permalink / raw)
  To: npiggin
  Cc: a.p.zijlstra, miklos, akpm, mingo, linux-mm, linux-kernel, benh,
	jdike, hugh, torvalds

> > > And it doesn't sound too hard to solve: when current algorithm doesn't
> > > seem to be making progress, then it will have to be done the hard way,
> > > searching for for all nonlinear ptes of a page to unmap.
> > 
> > Ah, you see, but that is when you've already lost.
> > 
> > The DoS is about the computational complexity of the reclaim, not if it
> > will ever come out of it with free pages.
> 
> If we really want to, we could limit it to mlock for !root. This is
> a reasonable way to solve the problem, and UML could fall back on
> vma emulated version if they didn't want to use mlock memory...
> 
> Or we could limit the size/number of nonlinear vmas that could be
> created.
> 
> But just quietly, I think there are probably a lot of other ways to
> perform a local DoS anyway ;) 

I aggree, requiring apps to mlock would probably just make things
slightly worse for about 100% of users, without any gain.  There could
be a

  /proc/sys/vm/turn_off_nonlinear_for_paranoid_sysadmin

knob that would unconditionally emulate nonlinear vmas.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07  9:44                   ` Bill Irwin
@ 2007-03-08 12:39                     ` Blaisorblade
  -1 siblings, 0 replies; 198+ messages in thread
From: Blaisorblade @ 2007-03-08 12:39 UTC (permalink / raw)
  To: Bill Irwin
  Cc: Nick Piggin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Wednesday 07 March 2007 10:44, Bill Irwin wrote:
> On Wed, Mar 07, 2007 at 10:28:21AM +0100, Nick Piggin wrote:
> > Depending on whether anyone wants it, and what features they want, we
> > could emulate the old syscall, and make a new restricted one which is
> > much less intrusive.
> > For example, if we can operate only on MAP_ANONYMOUS memory and specify
> > that nonlinear mappings effectively mlock the pages, then we can get
> > rid of all the objrmap and unmap_mapping_range handling, forget about
> > the writeout and msync problems...
>
> Anonymous-only would make it a doorstop for Oracle, since its entire
> motive for using it is to window into objects larger than user virtual
> address spaces (this likely also applies to UML, though they should
> really chime in to confirm).

We need it for shared file mappings (for tmpfs only).

Our scenario is:
RAM is implemented through a shared mapped file, kept on tmpfs (except by dumb 
users); various processes share an fd for this file (it's opened and 
immediately deleted).

We maintain page tables in x86 style, and TLB flush is implemented through 
mmap()/munmap()/mprotect().

Having a VMA per each 4K is not the intended VMA usage: for instance, the 
default /proc/sys/vm/max_map_count (64K) is saturated by a UML process with 
64K * 4K = 256M of resident memory.

> Restrictions to tmpfs and/or ramfs would 
> likely be liveable, though I suspect some things might want to do it to
> shm segments (I'll ask about that one).

> There's definitely no need for a 
> persistent backing store for the object to be remapped in Oracle's case,
> in any event. It's largely the in-core destination and source of IO, not
> something saved on-disk itself.
>
>
> -- wli

-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-08 12:39                     ` Blaisorblade
  0 siblings, 0 replies; 198+ messages in thread
From: Blaisorblade @ 2007-03-08 12:39 UTC (permalink / raw)
  To: Bill Irwin
  Cc: Nick Piggin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Wednesday 07 March 2007 10:44, Bill Irwin wrote:
> On Wed, Mar 07, 2007 at 10:28:21AM +0100, Nick Piggin wrote:
> > Depending on whether anyone wants it, and what features they want, we
> > could emulate the old syscall, and make a new restricted one which is
> > much less intrusive.
> > For example, if we can operate only on MAP_ANONYMOUS memory and specify
> > that nonlinear mappings effectively mlock the pages, then we can get
> > rid of all the objrmap and unmap_mapping_range handling, forget about
> > the writeout and msync problems...
>
> Anonymous-only would make it a doorstop for Oracle, since its entire
> motive for using it is to window into objects larger than user virtual
> address spaces (this likely also applies to UML, though they should
> really chime in to confirm).

We need it for shared file mappings (for tmpfs only).

Our scenario is:
RAM is implemented through a shared mapped file, kept on tmpfs (except by dumb 
users); various processes share an fd for this file (it's opened and 
immediately deleted).

We maintain page tables in x86 style, and TLB flush is implemented through 
mmap()/munmap()/mprotect().

Having a VMA per each 4K is not the intended VMA usage: for instance, the 
default /proc/sys/vm/max_map_count (64K) is saturated by a UML process with 
64K * 4K = 256M of resident memory.

> Restrictions to tmpfs and/or ramfs would 
> likely be liveable, though I suspect some things might want to do it to
> shm segments (I'll ask about that one).

> There's definitely no need for a 
> persistent backing store for the object to be remapped in Oracle's case,
> in any event. It's largely the in-core destination and source of IO, not
> something saved on-disk itself.
>
>
> -- wli

-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-07 10:02                       ` Nick Piggin
@ 2007-03-12 23:01                         ` Blaisorblade
  -1 siblings, 0 replies; 198+ messages in thread
From: Blaisorblade @ 2007-03-12 23:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 10:49:47AM +0100, Nick Piggin wrote:
> > On Wed, Mar 07, 2007 at 01:44:20AM -0800, Bill Irwin wrote:
> > > On Wed, Mar 07, 2007 at 10:28:21AM +0100, Nick Piggin wrote:
> > > > Depending on whether anyone wants it, and what features they want, we
> > > > could emulate the old syscall, and make a new restricted one which is
> > > > much less intrusive.
> > > > For example, if we can operate only on MAP_ANONYMOUS memory and
> > > > specify that nonlinear mappings effectively mlock the pages, then we
> > > > can get rid of all the objrmap and unmap_mapping_range handling,
> > > > forget about the writeout and msync problems...
> > >
> > > Anonymous-only would make it a doorstop for Oracle, since its entire
> > > motive for using it is to window into objects larger than user virtual
> >
> > Uh, duh yes I don't mean MAP_ANONYMOUS, I was just thinking of the shmem
> > inode that sits behind MAP_ANONYMOUS|MAP_SHARED. Of course if you don't
> > have a file descriptor to get a pgoff, then remap_file_pages is a
> > doorstop for everyone ;)
> >
> > > address spaces (this likely also applies to UML, though they should
> > > really chime in to confirm). Restrictions to tmpfs and/or ramfs would
> > > likely be liveable, though I suspect some things might want to do it to
> > > shm segments (I'll ask about that one). There's definitely no need for
> > > a persistent backing store for the object to be remapped in Oracle's
> > > case, in any event. It's largely the in-core destination and source of
> > > IO, not something saved on-disk itself.
> >
> > Yeah, tmpfs/shm segs are what I was thinking about. If UML can live with
> > that as well, then I think it might be a good option.
>
> Oh, hmm.... if you can truncate these things then you still need to
> force unmap so you still need i_mmap_nonlinear.

Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug, which is 
way similar I guess.

About the restriction to tmpfs, I have just discovered 
'[PATCH] mm: tracking shared dirty pages' (commit 
d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially conflicts 
with remap_file_pages for file-based mmaps (and that's fully fine, for now).

Even if UML does not need it, till now if there is a VMA protection and a page 
hasn't been remapped with remap_file_pages, the VMA protection is used (just 
because it makes sense).

However, it is only used when the PTE is first created - we can never change 
protections on a VMA  - so it vma_wants_writenotify() is true (on all 
file-based and on no shmfs based mapping, right?), and we write-protect the 
VMA, it will always be write-protected.

That's no problem for UML, but for any other user (I guess I'll have to 
prevent callers from trying such stuff - I started from a pretty generic 
patch).

> But come to think of it, I still don't think nonlinear mappings are
> too bad as they are ;)

Btw, I really like removing ->populate and merging the common code together. 
filemap_populate and shmem_populate are so obnoxiously different that I 
already wanted to do that (after merging remap_file_pages() core).

Also, I'm curious. Since my patches are already changing remap_file_pages() 
code, should they be absolutely merged after yours?
-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-12 23:01                         ` Blaisorblade
  0 siblings, 0 replies; 198+ messages in thread
From: Blaisorblade @ 2007-03-12 23:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
> On Wed, Mar 07, 2007 at 10:49:47AM +0100, Nick Piggin wrote:
> > On Wed, Mar 07, 2007 at 01:44:20AM -0800, Bill Irwin wrote:
> > > On Wed, Mar 07, 2007 at 10:28:21AM +0100, Nick Piggin wrote:
> > > > Depending on whether anyone wants it, and what features they want, we
> > > > could emulate the old syscall, and make a new restricted one which is
> > > > much less intrusive.
> > > > For example, if we can operate only on MAP_ANONYMOUS memory and
> > > > specify that nonlinear mappings effectively mlock the pages, then we
> > > > can get rid of all the objrmap and unmap_mapping_range handling,
> > > > forget about the writeout and msync problems...
> > >
> > > Anonymous-only would make it a doorstop for Oracle, since its entire
> > > motive for using it is to window into objects larger than user virtual
> >
> > Uh, duh yes I don't mean MAP_ANONYMOUS, I was just thinking of the shmem
> > inode that sits behind MAP_ANONYMOUS|MAP_SHARED. Of course if you don't
> > have a file descriptor to get a pgoff, then remap_file_pages is a
> > doorstop for everyone ;)
> >
> > > address spaces (this likely also applies to UML, though they should
> > > really chime in to confirm). Restrictions to tmpfs and/or ramfs would
> > > likely be liveable, though I suspect some things might want to do it to
> > > shm segments (I'll ask about that one). There's definitely no need for
> > > a persistent backing store for the object to be remapped in Oracle's
> > > case, in any event. It's largely the in-core destination and source of
> > > IO, not something saved on-disk itself.
> >
> > Yeah, tmpfs/shm segs are what I was thinking about. If UML can live with
> > that as well, then I think it might be a good option.
>
> Oh, hmm.... if you can truncate these things then you still need to
> force unmap so you still need i_mmap_nonlinear.

Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug, which is 
way similar I guess.

About the restriction to tmpfs, I have just discovered 
'[PATCH] mm: tracking shared dirty pages' (commit 
d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially conflicts 
with remap_file_pages for file-based mmaps (and that's fully fine, for now).

Even if UML does not need it, till now if there is a VMA protection and a page 
hasn't been remapped with remap_file_pages, the VMA protection is used (just 
because it makes sense).

However, it is only used when the PTE is first created - we can never change 
protections on a VMA  - so it vma_wants_writenotify() is true (on all 
file-based and on no shmfs based mapping, right?), and we write-protect the 
VMA, it will always be write-protected.

That's no problem for UML, but for any other user (I guess I'll have to 
prevent callers from trying such stuff - I started from a pretty generic 
patch).

> But come to think of it, I still don't think nonlinear mappings are
> too bad as they are ;)

Btw, I really like removing ->populate and merging the common code together. 
filemap_populate and shmem_populate are so obnoxiously different that I 
already wanted to do that (after merging remap_file_pages() core).

Also, I'm curious. Since my patches are already changing remap_file_pages() 
code, should they be absolutely merged after yours?
-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-12 23:01                         ` Blaisorblade
@ 2007-03-13  1:19                           ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-13  1:19 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Tue, Mar 13, 2007 at 12:01:13AM +0100, Blaisorblade wrote:
> On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
> > >
> > > Yeah, tmpfs/shm segs are what I was thinking about. If UML can live with
> > > that as well, then I think it might be a good option.
> >
> > Oh, hmm.... if you can truncate these things then you still need to
> > force unmap so you still need i_mmap_nonlinear.
> 
> Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug, which is 
> way similar I guess.
> 
> About the restriction to tmpfs, I have just discovered 
> '[PATCH] mm: tracking shared dirty pages' (commit 
> d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially conflicts 
> with remap_file_pages for file-based mmaps (and that's fully fine, for now).
> 
> Even if UML does not need it, till now if there is a VMA protection and a page 
> hasn't been remapped with remap_file_pages, the VMA protection is used (just 
> because it makes sense).
> 
> However, it is only used when the PTE is first created - we can never change 
> protections on a VMA  - so it vma_wants_writenotify() is true (on all 
> file-based and on no shmfs based mapping, right?), and we write-protect the 
> VMA, it will always be write-protected.

Yes, I believe that is the case, however I wonder if that is going to be
a problem for you to distinguish between write faults for clean writable
ptes, and write faults for readonly ptes?

> That's no problem for UML, but for any other user (I guess I'll have to 
> prevent callers from trying such stuff - I started from a pretty generic 
> patch).
> 
> > But come to think of it, I still don't think nonlinear mappings are
> > too bad as they are ;)
> 
> Btw, I really like removing ->populate and merging the common code together. 
> filemap_populate and shmem_populate are so obnoxiously different that I 
> already wanted to do that (after merging remap_file_pages() core).

Yeah they are also frustratingly similar to filemap_nopage and shmem_nopage,
and duplicate a lot of the same code ;)

> Also, I'm curious. Since my patches are already changing remap_file_pages() 
> code, should they be absolutely merged after yours?

Is there a big clash? I don't think I did a great deal to fremap.c (mainly
just removing stuff)...

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-13  1:19                           ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-13  1:19 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Tue, Mar 13, 2007 at 12:01:13AM +0100, Blaisorblade wrote:
> On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
> > >
> > > Yeah, tmpfs/shm segs are what I was thinking about. If UML can live with
> > > that as well, then I think it might be a good option.
> >
> > Oh, hmm.... if you can truncate these things then you still need to
> > force unmap so you still need i_mmap_nonlinear.
> 
> Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug, which is 
> way similar I guess.
> 
> About the restriction to tmpfs, I have just discovered 
> '[PATCH] mm: tracking shared dirty pages' (commit 
> d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially conflicts 
> with remap_file_pages for file-based mmaps (and that's fully fine, for now).
> 
> Even if UML does not need it, till now if there is a VMA protection and a page 
> hasn't been remapped with remap_file_pages, the VMA protection is used (just 
> because it makes sense).
> 
> However, it is only used when the PTE is first created - we can never change 
> protections on a VMA  - so it vma_wants_writenotify() is true (on all 
> file-based and on no shmfs based mapping, right?), and we write-protect the 
> VMA, it will always be write-protected.

Yes, I believe that is the case, however I wonder if that is going to be
a problem for you to distinguish between write faults for clean writable
ptes, and write faults for readonly ptes?

> That's no problem for UML, but for any other user (I guess I'll have to 
> prevent callers from trying such stuff - I started from a pretty generic 
> patch).
> 
> > But come to think of it, I still don't think nonlinear mappings are
> > too bad as they are ;)
> 
> Btw, I really like removing ->populate and merging the common code together. 
> filemap_populate and shmem_populate are so obnoxiously different that I 
> already wanted to do that (after merging remap_file_pages() core).

Yeah they are also frustratingly similar to filemap_nopage and shmem_nopage,
and duplicate a lot of the same code ;)

> Also, I'm curious. Since my patches are already changing remap_file_pages() 
> code, should they be absolutely merged after yours?

Is there a big clash? I don't think I did a great deal to fremap.c (mainly
just removing stuff)...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-13  1:19                           ` Nick Piggin
@ 2007-03-17 12:17                             ` Blaisorblade
  -1 siblings, 0 replies; 198+ messages in thread
From: Blaisorblade @ 2007-03-17 12:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Tuesday 13 March 2007 02:19, Nick Piggin wrote:
> On Tue, Mar 13, 2007 at 12:01:13AM +0100, Blaisorblade wrote:
> > On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
> > > > Yeah, tmpfs/shm segs are what I was thinking about. If UML can live
> > > > with that as well, then I think it might be a good option.
> > >
> > > Oh, hmm.... if you can truncate these things then you still need to
> > > force unmap so you still need i_mmap_nonlinear.
> >
> > Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug,
> > which is way similar I guess.
> >
> > About the restriction to tmpfs, I have just discovered
> > '[PATCH] mm: tracking shared dirty pages' (commit
> > d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially
> > conflicts with remap_file_pages for file-based mmaps (and that's fully
> > fine, for now).
> >
> > Even if UML does not need it, till now if there is a VMA protection and a
> > page hasn't been remapped with remap_file_pages, the VMA protection is
> > used (just because it makes sense).
> >
> > However, it is only used when the PTE is first created - we can never
> > change protections on a VMA  - so it vma_wants_writenotify() is true (on
> > all file-based and on no shmfs based mapping, right?), and we
> > write-protect the VMA, it will always be write-protected.
>
> Yes, I believe that is the case, however I wonder if that is going to be
> a problem for you to distinguish between write faults for clean writable
> ptes, and write faults for readonly ptes?
I wouldn't be able to distinguish them, but am I going to get write faults for 
clean ptes when vma_wants_writenotify() is false (as seems to be for tmpfs)? 
I guess not.

For tmpfs pages, clean writable PTEs are mapped as writable so they won't give 
any problem, since vma_wants_writenotify() is false for tmpfs. Correct?

> > Also, I'm curious. Since my patches are already changing
> > remap_file_pages() code, should they be absolutely merged after yours?
>
> Is there a big clash? I don't think I did a great deal to fremap.c (mainly
> just removing stuff)...
Hopefully, we just both modify sys_remap_file_pages(), I'll see soon.
-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-17 12:17                             ` Blaisorblade
  0 siblings, 0 replies; 198+ messages in thread
From: Blaisorblade @ 2007-03-17 12:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Tuesday 13 March 2007 02:19, Nick Piggin wrote:
> On Tue, Mar 13, 2007 at 12:01:13AM +0100, Blaisorblade wrote:
> > On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
> > > > Yeah, tmpfs/shm segs are what I was thinking about. If UML can live
> > > > with that as well, then I think it might be a good option.
> > >
> > > Oh, hmm.... if you can truncate these things then you still need to
> > > force unmap so you still need i_mmap_nonlinear.
> >
> > Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug,
> > which is way similar I guess.
> >
> > About the restriction to tmpfs, I have just discovered
> > '[PATCH] mm: tracking shared dirty pages' (commit
> > d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially
> > conflicts with remap_file_pages for file-based mmaps (and that's fully
> > fine, for now).
> >
> > Even if UML does not need it, till now if there is a VMA protection and a
> > page hasn't been remapped with remap_file_pages, the VMA protection is
> > used (just because it makes sense).
> >
> > However, it is only used when the PTE is first created - we can never
> > change protections on a VMA  - so it vma_wants_writenotify() is true (on
> > all file-based and on no shmfs based mapping, right?), and we
> > write-protect the VMA, it will always be write-protected.
>
> Yes, I believe that is the case, however I wonder if that is going to be
> a problem for you to distinguish between write faults for clean writable
> ptes, and write faults for readonly ptes?
I wouldn't be able to distinguish them, but am I going to get write faults for 
clean ptes when vma_wants_writenotify() is false (as seems to be for tmpfs)? 
I guess not.

For tmpfs pages, clean writable PTEs are mapped as writable so they won't give 
any problem, since vma_wants_writenotify() is false for tmpfs. Correct?

> > Also, I'm curious. Since my patches are already changing
> > remap_file_pages() code, should they be absolutely merged after yours?
>
> Is there a big clash? I don't think I did a great deal to fremap.c (mainly
> just removing stuff)...
Hopefully, we just both modify sys_remap_file_pages(), I'll see soon.
-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-17 12:17                             ` Blaisorblade
@ 2007-03-18  2:50                               ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-18  2:50 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Sat, Mar 17, 2007 at 01:17:00PM +0100, Blaisorblade wrote:
> On Tuesday 13 March 2007 02:19, Nick Piggin wrote:
> > On Tue, Mar 13, 2007 at 12:01:13AM +0100, Blaisorblade wrote:
> > > On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
> > > > > Yeah, tmpfs/shm segs are what I was thinking about. If UML can live
> > > > > with that as well, then I think it might be a good option.
> > > >
> > > > Oh, hmm.... if you can truncate these things then you still need to
> > > > force unmap so you still need i_mmap_nonlinear.
> > >
> > > Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug,
> > > which is way similar I guess.
> > >
> > > About the restriction to tmpfs, I have just discovered
> > > '[PATCH] mm: tracking shared dirty pages' (commit
> > > d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially
> > > conflicts with remap_file_pages for file-based mmaps (and that's fully
> > > fine, for now).
> > >
> > > Even if UML does not need it, till now if there is a VMA protection and a
> > > page hasn't been remapped with remap_file_pages, the VMA protection is
> > > used (just because it makes sense).
> > >
> > > However, it is only used when the PTE is first created - we can never
> > > change protections on a VMA  - so it vma_wants_writenotify() is true (on
> > > all file-based and on no shmfs based mapping, right?), and we
> > > write-protect the VMA, it will always be write-protected.
> >
> > Yes, I believe that is the case, however I wonder if that is going to be
> > a problem for you to distinguish between write faults for clean writable
> > ptes, and write faults for readonly ptes?
> I wouldn't be able to distinguish them, but am I going to get write faults for 
> clean ptes when vma_wants_writenotify() is false (as seems to be for tmpfs)? 
> I guess not.
> 
> For tmpfs pages, clean writable PTEs are mapped as writable so they won't give 
> any problem, since vma_wants_writenotify() is false for tmpfs. Correct?

Yes, that should be the case. So would this mean that nonlinear protections
don't work on regular files? I guess that's OK if Oracle and UML both use
tmpfs/shm?


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-18  2:50                               ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-18  2:50 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Sat, Mar 17, 2007 at 01:17:00PM +0100, Blaisorblade wrote:
> On Tuesday 13 March 2007 02:19, Nick Piggin wrote:
> > On Tue, Mar 13, 2007 at 12:01:13AM +0100, Blaisorblade wrote:
> > > On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
> > > > > Yeah, tmpfs/shm segs are what I was thinking about. If UML can live
> > > > > with that as well, then I think it might be a good option.
> > > >
> > > > Oh, hmm.... if you can truncate these things then you still need to
> > > > force unmap so you still need i_mmap_nonlinear.
> > >
> > > Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug,
> > > which is way similar I guess.
> > >
> > > About the restriction to tmpfs, I have just discovered
> > > '[PATCH] mm: tracking shared dirty pages' (commit
> > > d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially
> > > conflicts with remap_file_pages for file-based mmaps (and that's fully
> > > fine, for now).
> > >
> > > Even if UML does not need it, till now if there is a VMA protection and a
> > > page hasn't been remapped with remap_file_pages, the VMA protection is
> > > used (just because it makes sense).
> > >
> > > However, it is only used when the PTE is first created - we can never
> > > change protections on a VMA  - so it vma_wants_writenotify() is true (on
> > > all file-based and on no shmfs based mapping, right?), and we
> > > write-protect the VMA, it will always be write-protected.
> >
> > Yes, I believe that is the case, however I wonder if that is going to be
> > a problem for you to distinguish between write faults for clean writable
> > ptes, and write faults for readonly ptes?
> I wouldn't be able to distinguish them, but am I going to get write faults for 
> clean ptes when vma_wants_writenotify() is false (as seems to be for tmpfs)? 
> I guess not.
> 
> For tmpfs pages, clean writable PTEs are mapped as writable so they won't give 
> any problem, since vma_wants_writenotify() is false for tmpfs. Correct?

Yes, that should be the case. So would this mean that nonlinear protections
don't work on regular files? I guess that's OK if Oracle and UML both use
tmpfs/shm?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-18  2:50                               ` Nick Piggin
@ 2007-03-18 13:09                                 ` Jeff Dike
  -1 siblings, 0 replies; 198+ messages in thread
From: Jeff Dike @ 2007-03-18 13:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Blaisorblade, Bill Irwin, Ingo Molnar, Andrew Morton,
	Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt

On Sun, Mar 18, 2007 at 03:50:10AM +0100, Nick Piggin wrote:
> Yes, that should be the case. So would this mean that nonlinear protections
> don't work on regular files? I guess that's OK if Oracle and UML both use
> tmpfs/shm?

It's OK for UML.

		Jeff

-- 
Work email - jdike at linux dot intel dot com

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-18 13:09                                 ` Jeff Dike
  0 siblings, 0 replies; 198+ messages in thread
From: Jeff Dike @ 2007-03-18 13:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Blaisorblade, Bill Irwin, Ingo Molnar, Andrew Morton,
	Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt

On Sun, Mar 18, 2007 at 03:50:10AM +0100, Nick Piggin wrote:
> Yes, that should be the case. So would this mean that nonlinear protections
> don't work on regular files? I guess that's OK if Oracle and UML both use
> tmpfs/shm?

It's OK for UML.

		Jeff

-- 
Work email - jdike at linux dot intel dot com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
  2007-02-27  6:54         ` Benjamin Herrenschmidt
@ 2007-03-18 23:13           ` Dave Airlie
  -1 siblings, 0 replies; 198+ messages in thread
From: Dave Airlie @ 2007-03-18 23:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Andrew Morton, npiggin, linux-mm, linux-kernel

> > the new fault hander made the memory manager code a lot cleaner and
> > very less hacky in a lot of cases. so I'd rather merge the clean code
> > than have to fight with the current code...
>
> Note that you can probably get away with NOPFN_REFAULT etc... like I did
> for the SPEs in the meantime.

Indeed, Thomas has done this work and I'm just lining up a TTM tree to
start the merge process..

Dave.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 0/6] fault vs truncate/invalidate race fix
@ 2007-03-18 23:13           ` Dave Airlie
  0 siblings, 0 replies; 198+ messages in thread
From: Dave Airlie @ 2007-03-18 23:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Andrew Morton, npiggin, linux-mm, linux-kernel

> > the new fault hander made the memory manager code a lot cleaner and
> > very less hacky in a lot of cases. so I'd rather merge the clean code
> > than have to fight with the current code...
>
> Note that you can probably get away with NOPFN_REFAULT etc... like I did
> for the SPEs in the meantime.

Indeed, Thomas has done this work and I'm just lining up a TTM tree to
start the merge process..

Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-18  2:50                               ` Nick Piggin
@ 2007-03-19 12:04                                 ` Bill Irwin
  -1 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-19 12:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Blaisorblade, Bill Irwin, Ingo Molnar, Andrew Morton,
	Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt

On Sun, Mar 18, 2007 at 03:50:10AM +0100, Nick Piggin wrote:
> Yes, that should be the case. So would this mean that nonlinear protections
> don't work on regular files? I guess that's OK if Oracle and UML both use
> tmpfs/shm?

Sometimes ramfs is also used in the Oracle case. I presume that's even
simpler than tmpfs. (Hugetlb, while also used in for the same general
buffer pool, is never used in conjunction with remap_file_pages() etc.)


-- wli

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-19 12:04                                 ` Bill Irwin
  0 siblings, 0 replies; 198+ messages in thread
From: Bill Irwin @ 2007-03-19 12:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Blaisorblade, Bill Irwin, Ingo Molnar, Andrew Morton,
	Linux Memory Management, Linux Kernel, Benjamin Herrenschmidt

On Sun, Mar 18, 2007 at 03:50:10AM +0100, Nick Piggin wrote:
> Yes, that should be the case. So would this mean that nonlinear protections
> don't work on regular files? I guess that's OK if Oracle and UML both use
> tmpfs/shm?

Sometimes ramfs is also used in the Oracle case. I presume that's even
simpler than tmpfs. (Hugetlb, while also used in for the same general
buffer pool, is never used in conjunction with remap_file_pages() etc.)


-- wli

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-18  2:50                               ` Nick Piggin
@ 2007-03-19 20:44                                 ` Blaisorblade
  -1 siblings, 0 replies; 198+ messages in thread
From: Blaisorblade @ 2007-03-19 20:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Sunday 18 March 2007 03:50, Nick Piggin wrote:
> On Sat, Mar 17, 2007 at 01:17:00PM +0100, Blaisorblade wrote:
> > On Tuesday 13 March 2007 02:19, Nick Piggin wrote:
> > > On Tue, Mar 13, 2007 at 12:01:13AM +0100, Blaisorblade wrote:
> > > > On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
> > > > > > Yeah, tmpfs/shm segs are what I was thinking about. If UML can
> > > > > > live with that as well, then I think it might be a good option.
> > > > >
> > > > > Oh, hmm.... if you can truncate these things then you still need to
> > > > > force unmap so you still need i_mmap_nonlinear.
> > > >
> > > > Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug,
> > > > which is way similar I guess.
> > > >
> > > > About the restriction to tmpfs, I have just discovered
> > > > '[PATCH] mm: tracking shared dirty pages' (commit
> > > > d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially
> > > > conflicts with remap_file_pages for file-based mmaps (and that's
> > > > fully fine, for now).
> > > >
> > > > Even if UML does not need it, till now if there is a VMA protection
> > > > and a page hasn't been remapped with remap_file_pages, the VMA
> > > > protection is used (just because it makes sense).
> > > >
> > > > However, it is only used when the PTE is first created - we can never
> > > > change protections on a VMA  - so it vma_wants_writenotify() is true
> > > > (on all file-based and on no shmfs based mapping, right?), and we
> > > > write-protect the VMA, it will always be write-protected.
> > >
> > > Yes, I believe that is the case, however I wonder if that is going to
> > > be a problem for you to distinguish between write faults for clean
> > > writable ptes, and write faults for readonly ptes?
> >
> > I wouldn't be able to distinguish them, but am I going to get write
> > faults for clean ptes when vma_wants_writenotify() is false (as seems to
> > be for tmpfs)? I guess not.
> >
> > For tmpfs pages, clean writable PTEs are mapped as writable so they won't
> > give any problem, since vma_wants_writenotify() is false for tmpfs.
> > Correct?
>
> Yes, that should be the case. So would this mean that nonlinear protections
> don't work on regular files?

They still work in most cases (including for UML), but if the initial mmap() 
specified PROT_WRITE, that is ignored, for pages which are not remapped via 
remap_file_pages(). UML uses PROT_NONE for the initial mmap, so that's no 
problem.

> I guess that's OK if Oracle and UML both use 
> tmpfs/shm?

-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-19 20:44                                 ` Blaisorblade
  0 siblings, 0 replies; 198+ messages in thread
From: Blaisorblade @ 2007-03-19 20:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Sunday 18 March 2007 03:50, Nick Piggin wrote:
> On Sat, Mar 17, 2007 at 01:17:00PM +0100, Blaisorblade wrote:
> > On Tuesday 13 March 2007 02:19, Nick Piggin wrote:
> > > On Tue, Mar 13, 2007 at 12:01:13AM +0100, Blaisorblade wrote:
> > > > On Wednesday 07 March 2007 11:02, Nick Piggin wrote:
> > > > > > Yeah, tmpfs/shm segs are what I was thinking about. If UML can
> > > > > > live with that as well, then I think it might be a good option.
> > > > >
> > > > > Oh, hmm.... if you can truncate these things then you still need to
> > > > > force unmap so you still need i_mmap_nonlinear.
> > > >
> > > > Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug,
> > > > which is way similar I guess.
> > > >
> > > > About the restriction to tmpfs, I have just discovered
> > > > '[PATCH] mm: tracking shared dirty pages' (commit
> > > > d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially
> > > > conflicts with remap_file_pages for file-based mmaps (and that's
> > > > fully fine, for now).
> > > >
> > > > Even if UML does not need it, till now if there is a VMA protection
> > > > and a page hasn't been remapped with remap_file_pages, the VMA
> > > > protection is used (just because it makes sense).
> > > >
> > > > However, it is only used when the PTE is first created - we can never
> > > > change protections on a VMA  - so it vma_wants_writenotify() is true
> > > > (on all file-based and on no shmfs based mapping, right?), and we
> > > > write-protect the VMA, it will always be write-protected.
> > >
> > > Yes, I believe that is the case, however I wonder if that is going to
> > > be a problem for you to distinguish between write faults for clean
> > > writable ptes, and write faults for readonly ptes?
> >
> > I wouldn't be able to distinguish them, but am I going to get write
> > faults for clean ptes when vma_wants_writenotify() is false (as seems to
> > be for tmpfs)? I guess not.
> >
> > For tmpfs pages, clean writable PTEs are mapped as writable so they won't
> > give any problem, since vma_wants_writenotify() is false for tmpfs.
> > Correct?
>
> Yes, that should be the case. So would this mean that nonlinear protections
> don't work on regular files?

They still work in most cases (including for UML), but if the initial mmap() 
specified PROT_WRITE, that is ignored, for pages which are not remapped via 
remap_file_pages(). UML uses PROT_NONE for the initial mmap, so that's no 
problem.

> I guess that's OK if Oracle and UML both use 
> tmpfs/shm?

-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-19 20:44                                 ` Blaisorblade
@ 2007-03-20  6:00                                   ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-20  6:00 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Mon, Mar 19, 2007 at 09:44:28PM +0100, Blaisorblade wrote:
> On Sunday 18 March 2007 03:50, Nick Piggin wrote:
> > > >
> > > > Yes, I believe that is the case, however I wonder if that is going to
> > > > be a problem for you to distinguish between write faults for clean
> > > > writable ptes, and write faults for readonly ptes?
> > >
> > > I wouldn't be able to distinguish them, but am I going to get write
> > > faults for clean ptes when vma_wants_writenotify() is false (as seems to
> > > be for tmpfs)? I guess not.
> > >
> > > For tmpfs pages, clean writable PTEs are mapped as writable so they won't
> > > give any problem, since vma_wants_writenotify() is false for tmpfs.
> > > Correct?
> >
> > Yes, that should be the case. So would this mean that nonlinear protections
> > don't work on regular files?
> 
> They still work in most cases (including for UML), but if the initial mmap() 
> specified PROT_WRITE, that is ignored, for pages which are not remapped via 
> remap_file_pages(). UML uses PROT_NONE for the initial mmap, so that's no 
> problem.

But how are you going to distinguish a write fault on a readonly pte for
dirty page accounting vs a read-only nonlinear protection?

You can't store any more data in a present pte AFAIK, so you'd have to
have some out of band data. At which point, you may as well just forget
about vma_wants_writenotify vmas, considering that everybody is using
shmem/ramfs.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-20  6:00                                   ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2007-03-20  6:00 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Mon, Mar 19, 2007 at 09:44:28PM +0100, Blaisorblade wrote:
> On Sunday 18 March 2007 03:50, Nick Piggin wrote:
> > > >
> > > > Yes, I believe that is the case, however I wonder if that is going to
> > > > be a problem for you to distinguish between write faults for clean
> > > > writable ptes, and write faults for readonly ptes?
> > >
> > > I wouldn't be able to distinguish them, but am I going to get write
> > > faults for clean ptes when vma_wants_writenotify() is false (as seems to
> > > be for tmpfs)? I guess not.
> > >
> > > For tmpfs pages, clean writable PTEs are mapped as writable so they won't
> > > give any problem, since vma_wants_writenotify() is false for tmpfs.
> > > Correct?
> >
> > Yes, that should be the case. So would this mean that nonlinear protections
> > don't work on regular files?
> 
> They still work in most cases (including for UML), but if the initial mmap() 
> specified PROT_WRITE, that is ignored, for pages which are not remapped via 
> remap_file_pages(). UML uses PROT_NONE for the initial mmap, so that's no 
> problem.

But how are you going to distinguish a write fault on a readonly pte for
dirty page accounting vs a read-only nonlinear protection?

You can't store any more data in a present pte AFAIK, so you'd have to
have some out of band data. At which point, you may as well just forget
about vma_wants_writenotify vmas, considering that everybody is using
shmem/ramfs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
  2007-03-20  6:00                                   ` Nick Piggin
@ 2007-03-21 19:45                                     ` Blaisorblade
  -1 siblings, 0 replies; 198+ messages in thread
From: Blaisorblade @ 2007-03-21 19:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Tuesday 20 March 2007 07:00, Nick Piggin wrote:
> On Mon, Mar 19, 2007 at 09:44:28PM +0100, Blaisorblade wrote:
> > On Sunday 18 March 2007 03:50, Nick Piggin wrote:
> > > > > Yes, I believe that is the case, however I wonder if that is going
> > > > > to be a problem for you to distinguish between write faults for
> > > > > clean writable ptes, and write faults for readonly ptes?

> > > > I wouldn't be able to distinguish them, but am I going to get write
> > > > faults for clean ptes when vma_wants_writenotify() is false (as seems
> > > > to be for tmpfs)? I guess not.

> > > > For tmpfs pages, clean writable PTEs are mapped as writable so they
> > > > won't give any problem, since vma_wants_writenotify() is false for
> > > > tmpfs. Correct?

> > > Yes, that should be the case. So would this mean that nonlinear
> > > protections don't work on regular files?

> > They still work in most cases (including for UML), but if the initial
> > mmap() specified PROT_WRITE, that is ignored, for pages which are not
> > remapped via remap_file_pages(). UML uses PROT_NONE for the initial mmap,
> > so that's no problem.

> But how are you going to distinguish a write fault on a readonly pte for
> dirty page accounting vs a read-only nonlinear protection?

Hmm... I was only thinking to PTEs which hadn't been remapped via 
remap_file_pages, but just faulted in with initial mmap() protection.

For the other PTEs, however, I overlooked that the current code ignores 
vma_wants_writenotify(), i.e. breaks dirty page accounting for them, and I 
refused to even consider this opportunity, even without knowing the purposes 
of dirty pages accounting (I found the commits explaining this however).

> You can't store any more data in a present pte AFAIK, so you'd have to
> have some out of band data. At which point, you may as well just forget
> about vma_wants_writenotify vmas, considering that everybody is using
> shmem/ramfs.

I was going to do that anyway. I'd guess that I should just disallow in 
remap_file_pages() the VM_MANYPROTS (i.e. MAP_CHGPROT in flags) && 
vma_wants_writenotify() combination, right? Ok, trivial (shouldn't even have 
pointed this out).
-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)
@ 2007-03-21 19:45                                     ` Blaisorblade
  0 siblings, 0 replies; 198+ messages in thread
From: Blaisorblade @ 2007-03-21 19:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Bill Irwin, Ingo Molnar, Andrew Morton, Linux Memory Management,
	Linux Kernel, Benjamin Herrenschmidt

On Tuesday 20 March 2007 07:00, Nick Piggin wrote:
> On Mon, Mar 19, 2007 at 09:44:28PM +0100, Blaisorblade wrote:
> > On Sunday 18 March 2007 03:50, Nick Piggin wrote:
> > > > > Yes, I believe that is the case, however I wonder if that is going
> > > > > to be a problem for you to distinguish between write faults for
> > > > > clean writable ptes, and write faults for readonly ptes?

> > > > I wouldn't be able to distinguish them, but am I going to get write
> > > > faults for clean ptes when vma_wants_writenotify() is false (as seems
> > > > to be for tmpfs)? I guess not.

> > > > For tmpfs pages, clean writable PTEs are mapped as writable so they
> > > > won't give any problem, since vma_wants_writenotify() is false for
> > > > tmpfs. Correct?

> > > Yes, that should be the case. So would this mean that nonlinear
> > > protections don't work on regular files?

> > They still work in most cases (including for UML), but if the initial
> > mmap() specified PROT_WRITE, that is ignored, for pages which are not
> > remapped via remap_file_pages(). UML uses PROT_NONE for the initial mmap,
> > so that's no problem.

> But how are you going to distinguish a write fault on a readonly pte for
> dirty page accounting vs a read-only nonlinear protection?

Hmm... I was only thinking to PTEs which hadn't been remapped via 
remap_file_pages, but just faulted in with initial mmap() protection.

For the other PTEs, however, I overlooked that the current code ignores 
vma_wants_writenotify(), i.e. breaks dirty page accounting for them, and I 
refused to even consider this opportunity, even without knowing the purposes 
of dirty pages accounting (I found the commits explaining this however).

> You can't store any more data in a present pte AFAIK, so you'd have to
> have some out of band data. At which point, you may as well just forget
> about vma_wants_writenotify vmas, considering that everybody is using
> shmem/ramfs.

I was going to do that anyway. I'd guess that I should just disallow in 
remap_file_pages() the VM_MANYPROTS (i.e. MAP_CHGPROT in flags) && 
vma_wants_writenotify() combination, right? Ok, trivial (shouldn't even have 
pointed this out).
-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

end of thread, other threads:[~2007-03-21 19:46 UTC | newest]

Thread overview: 198+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-21  4:49 [patch 0/6] fault vs truncate/invalidate race fix Nick Piggin
2007-02-21  4:49 ` Nick Piggin
2007-02-21  4:49 ` [patch 1/6] mm: debug check for the fault vs invalidate race Nick Piggin
2007-02-21  4:49   ` Nick Piggin
2007-02-21  4:49 ` [patch 2/6] mm: simplify filemap_nopage Nick Piggin
2007-02-21  4:49   ` Nick Piggin
2007-02-21  4:50 ` [patch 3/6] mm: fix fault vs invalidate race for linear mappings Nick Piggin
2007-02-21  4:50   ` Nick Piggin
2007-03-07  6:36   ` Andrew Morton
2007-03-07  6:36     ` Andrew Morton
2007-03-07  6:57     ` Nick Piggin
2007-03-07  6:57       ` Nick Piggin
2007-03-07  7:08       ` Andrew Morton
2007-03-07  7:08         ` Andrew Morton
2007-03-07  7:25         ` Nick Piggin
2007-03-07  7:25           ` Nick Piggin
2007-02-21  4:50 ` [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear) Nick Piggin
2007-02-21  4:50   ` Nick Piggin
2007-03-07  6:51   ` Andrew Morton
2007-03-07  6:51     ` Andrew Morton
2007-03-07  7:08     ` Nick Piggin
2007-03-07  7:08       ` Nick Piggin
2007-03-07  8:19       ` Nick Piggin
2007-03-07  8:19         ` Nick Piggin
2007-03-07  8:27         ` Ingo Molnar
2007-03-07  8:27           ` Ingo Molnar
2007-03-07  8:35           ` Andrew Morton
2007-03-07  8:35             ` Andrew Morton
2007-03-07  8:53             ` Ingo Molnar
2007-03-07  8:53               ` Ingo Molnar
2007-03-07  9:28               ` Nick Piggin
2007-03-07  9:28                 ` Nick Piggin
2007-03-07  9:44                 ` Bill Irwin
2007-03-07  9:44                   ` Bill Irwin
2007-03-07  9:49                   ` Nick Piggin
2007-03-07  9:49                     ` Nick Piggin
2007-03-07 10:02                     ` Nick Piggin
2007-03-07 10:02                       ` Nick Piggin
2007-03-12 23:01                       ` Blaisorblade
2007-03-12 23:01                         ` Blaisorblade
2007-03-13  1:19                         ` Nick Piggin
2007-03-13  1:19                           ` Nick Piggin
2007-03-17 12:17                           ` Blaisorblade
2007-03-17 12:17                             ` Blaisorblade
2007-03-18  2:50                             ` Nick Piggin
2007-03-18  2:50                               ` Nick Piggin
2007-03-18 13:09                               ` Jeff Dike
2007-03-18 13:09                                 ` Jeff Dike
2007-03-19 12:04                               ` Bill Irwin
2007-03-19 12:04                                 ` Bill Irwin
2007-03-19 20:44                               ` Blaisorblade
2007-03-19 20:44                                 ` Blaisorblade
2007-03-20  6:00                                 ` Nick Piggin
2007-03-20  6:00                                   ` Nick Piggin
2007-03-21 19:45                                   ` Blaisorblade
2007-03-21 19:45                                     ` Blaisorblade
2007-03-08 12:39                   ` Blaisorblade
2007-03-08 12:39                     ` Blaisorblade
2007-03-07  9:29             ` Bill Irwin
2007-03-07  9:29               ` Bill Irwin
2007-03-07  9:39               ` Andrew Morton
2007-03-07  9:39                 ` Andrew Morton
2007-03-07 10:09                 ` Bill Irwin
2007-03-07 10:09                   ` Bill Irwin
2007-03-07  8:38           ` Miklos Szeredi
2007-03-07  8:38             ` Miklos Szeredi
2007-03-07  8:47             ` Andrew Morton
2007-03-07  8:47               ` Andrew Morton
2007-03-07  8:51               ` Miklos Szeredi
2007-03-07  8:51                 ` Miklos Szeredi
2007-03-07  9:07                 ` Andrew Morton
2007-03-07  9:07                   ` Andrew Morton
2007-03-07  9:18                   ` Nick Piggin
2007-03-07  9:18                     ` Nick Piggin
2007-03-07  9:26                     ` Andrew Morton
2007-03-07  9:26                       ` Andrew Morton
2007-03-07  9:28                       ` Miklos Szeredi
2007-03-07  9:28                         ` Miklos Szeredi
2007-03-07  9:38                       ` Nick Piggin
2007-03-07  9:38                         ` Nick Piggin
2007-03-07  9:25                   ` Miklos Szeredi
2007-03-07  9:25                     ` Miklos Szeredi
2007-03-07  9:32                   ` Peter Zijlstra
2007-03-07  9:32                     ` Peter Zijlstra
2007-03-07  9:45                     ` Nick Piggin
2007-03-07  9:45                       ` Nick Piggin
2007-03-07 10:04                       ` Nick Piggin
2007-03-07 10:04                         ` Nick Piggin
2007-03-07 10:06                         ` Peter Zijlstra
2007-03-07 10:06                           ` Peter Zijlstra
2007-03-07 10:13                           ` Miklos Szeredi
2007-03-07 10:13                             ` Miklos Szeredi
2007-03-07 10:21                             ` Nick Piggin
2007-03-07 10:21                               ` Nick Piggin
2007-03-07 10:24                               ` Peter Zijlstra
2007-03-07 10:24                                 ` Peter Zijlstra
2007-03-07 10:38                                 ` Nick Piggin
2007-03-07 10:38                                   ` Nick Piggin
2007-03-07 10:47                                   ` Peter Zijlstra
2007-03-07 10:47                                     ` Peter Zijlstra
2007-03-07 11:00                                     ` Nick Piggin
2007-03-07 11:00                                       ` Nick Piggin
2007-03-07 11:48                                       ` Peter Zijlstra
2007-03-07 11:48                                         ` Peter Zijlstra
2007-03-07 12:17                                         ` Nick Piggin
2007-03-07 12:17                                           ` Nick Piggin
2007-03-07 12:41                                           ` Peter Zijlstra
2007-03-07 12:41                                             ` Peter Zijlstra
2007-03-07 13:08                                             ` Nick Piggin
2007-03-07 13:08                                               ` Nick Piggin
2007-03-07 13:19                                               ` Peter Zijlstra
2007-03-07 13:19                                                 ` Peter Zijlstra
2007-03-07 13:36                                                 ` Nick Piggin
2007-03-07 13:36                                                   ` Nick Piggin
2007-03-07 13:52                                                   ` Peter Zijlstra
2007-03-07 13:52                                                     ` Peter Zijlstra
2007-03-07 13:56                                                     ` Miklos Szeredi
2007-03-07 13:56                                                       ` Miklos Szeredi
2007-03-07 14:34                                                     ` Peter Zijlstra
2007-03-07 14:34                                                       ` Peter Zijlstra
2007-03-07 15:01                                                       ` Nick Piggin
2007-03-07 15:01                                                         ` Nick Piggin
2007-03-07 16:58                                                         ` [RFC][PATCH] mm: fix page_mkclean() vs non-linear vmas Peter Zijlstra
2007-03-07 16:58                                                           ` Peter Zijlstra
2007-03-07 18:00                                                           ` Linus Torvalds
2007-03-07 18:00                                                             ` Linus Torvalds
2007-03-07 18:12                                                             ` Peter Zijlstra
2007-03-07 18:12                                                               ` Peter Zijlstra
2007-03-07 18:24                                                               ` Peter Zijlstra
2007-03-07 18:24                                                                 ` Peter Zijlstra
2007-03-08 11:21                                                           ` Miklos Szeredi
2007-03-08 11:21                                                             ` Miklos Szeredi
2007-03-08 11:37                                                             ` Peter Zijlstra
2007-03-08 11:37                                                               ` Peter Zijlstra
2007-03-08 11:48                                                               ` Miklos Szeredi
2007-03-08 11:48                                                                 ` Miklos Szeredi
2007-03-08 12:11                                                                 ` Peter Zijlstra
2007-03-08 12:11                                                                   ` Peter Zijlstra
2007-03-08 12:19                                                                   ` Nick Piggin
2007-03-08 12:19                                                                     ` Nick Piggin
2007-03-08 12:25                                                                     ` Miklos Szeredi
2007-03-08 12:25                                                                       ` Miklos Szeredi
2007-03-08 11:58                                                             ` Nick Piggin
2007-03-08 11:58                                                               ` Nick Piggin
2007-03-08 12:09                                                               ` Miklos Szeredi
2007-03-08 12:09                                                                 ` Miklos Szeredi
2007-03-07 15:10                                                     ` [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear) Jeff Dike
2007-03-07 15:10                                                       ` Jeff Dike
2007-03-07 13:53                                                   ` Miklos Szeredi
2007-03-07 13:53                                                     ` Miklos Szeredi
2007-03-07 14:50                                                     ` Nick Piggin
2007-03-07 14:50                                                       ` Nick Piggin
2007-03-07 12:22                                       ` Bill Irwin
2007-03-07 12:22                                         ` Bill Irwin
2007-03-07 12:36                                         ` Nick Piggin
2007-03-07 12:36                                           ` Nick Piggin
2007-03-07 10:30                             ` [rfc][patch 7/6] mm: merge page_mkwrite Nick Piggin
2007-03-07 10:30                               ` Nick Piggin
2007-03-07  8:59           ` [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear) Nick Piggin
2007-03-07  8:59             ` Nick Piggin
2007-03-07  9:11             ` Nick Piggin
2007-03-07  9:11               ` Nick Piggin
2007-03-07  9:22             ` Ingo Molnar
2007-03-07  9:22               ` Ingo Molnar
2007-03-07  9:32               ` Bill Irwin
2007-03-07  9:32                 ` Bill Irwin
2007-03-07  9:35                 ` Ingo Molnar
2007-03-07  9:35                   ` Ingo Molnar
2007-03-07  9:50                   ` Bill Irwin
2007-03-07  9:50                     ` Bill Irwin
2007-03-07  9:52               ` Nick Piggin
2007-03-07  9:52                 ` Nick Piggin
2007-03-07  7:19     ` Bill Irwin
2007-03-07  7:19       ` Bill Irwin
2007-03-07 10:05     ` Benjamin Herrenschmidt
2007-03-07 10:05       ` Benjamin Herrenschmidt
2007-03-07 10:17       ` Nick Piggin
2007-03-07 10:17         ` Nick Piggin
2007-03-07 10:46         ` Benjamin Herrenschmidt
2007-03-07 10:46           ` Benjamin Herrenschmidt
2007-02-21  4:50 ` [patch 5/6] mm: merge nopfn into fault Nick Piggin
2007-02-21  4:50   ` Nick Piggin
2007-02-21  5:13   ` Nick Piggin
2007-02-21  5:13     ` Nick Piggin
2007-02-21  4:50 ` [patch 6/6] mm: remove legacy cruft Nick Piggin
2007-02-21  4:50   ` Nick Piggin
2007-02-27  4:36 ` [patch 0/6] fault vs truncate/invalidate race fix Dave Airlie
2007-02-27  4:36   ` Dave Airlie
2007-02-27  5:32   ` Andrew Morton
2007-02-27  5:32     ` Andrew Morton
2007-02-27  6:26     ` Dave Airlie
2007-02-27  6:26       ` Dave Airlie
2007-02-27  6:54       ` Benjamin Herrenschmidt
2007-02-27  6:54         ` Benjamin Herrenschmidt
2007-03-18 23:13         ` Dave Airlie
2007-03-18 23:13           ` Dave Airlie
2007-02-27  8:50     ` Nick Piggin
2007-02-27  8:50       ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.