linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] mm: tracking dirty pages -v11
@ 2006-06-23 22:31 Peter Zijlstra
  2006-06-23 22:31 ` [PATCH 1/5] mm: tracking shared dirty pages Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Peter Zijlstra @ 2006-06-23 22:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

Hi,

I hope to have addressed all Hugh's latest comments in this version.
Its against 2.6.17-mm1, however I wasted most of the day trying to 
test it on that kernel. But due to various circumstances that failed.

So I've tested something like this against something 2.6.17'ish and 
respun against the -mm lineup.

I've taken Hugh's msync changes too, looks a lot better and does indeed
fix some boundary cases.

Peter

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/5] mm: tracking shared dirty pages
  2006-06-23 22:31 [PATCH 0/5] mm: tracking dirty pages -v11 Peter Zijlstra
@ 2006-06-23 22:31 ` Peter Zijlstra
  2006-06-23 23:54   ` Peter Zijlstra
  2006-06-23 22:31 ` [PATCH 2/5] mm: balance " Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2006-06-23 22:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Tracking of dirty pages in shared writeable mmap()s.

The idea is simple: write protect clean shared writeable pages,
catch the write-fault, make writeable and set dirty. On page write-back
clean all the PTE dirty bits and write protect them once again.

The implementation is a tad harder, mainly because the default
backing_dev_info capabilities were too loosely maintained. Hence it is
not enough to test the backing_dev_info for cap_account_dirty.

The current heuristic is as follows, a VMA is eligible when:
 - its shared writeable 
    (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
 - it is not a PFN mapping
    (vm_flags & VM_PFNMAP) == 0
 - the backing_dev_info is cap_account_dirty
    mapping_cap_account_dirty(vma->vm_file->f_mapping)
 - f_op->mmap() didn't change the default page protection

NOTE: the last rule is only checked in do_mmap_pgoff(), other code-
paths assume they will not be reached in that case.

Page from remap_pfn_range() are explicitly excluded because their
COW semantics are already horrid enough (see vm_normal_page() in
do_wp_page()) and because they don't have a backing store anyway.

mprotect() is taught about the new behaviour as well.

Cleaning the pages on write-back is done with page_mkclean() a new
rmap call. It can be called on any page, but is currently only
implemented for mapped pages (does it make sense to also implement
anon pages?), if the page is found the be of a trackable VMA it
will also wrprotect the PTE.

Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty()
from under ->private_lock. This seems to be save, since ->private_lock 
is used to serialize access to the buffers, not the page itself.
This is needed because clear_page_dirty() will call into page_mkclean()
and would thereby violate locking order.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Changes in -v11

 - small cleanups
 - pulled page_mkclean back under mapping_cap_account_dirty()

Changes in -v10

 - 2.6.17-mm1
 - Drop the ugly duckling pgprotting, Hugh suggested resetting the
   vma->vm_page_prot when f_op->mmap() didn't modify it. If it were
   modified we're not interested anyway.
 - abandon is_shared_writable() because its actually spelled writeable
   and it didn't actually mean that any more.
 - Comments all round.

Changes in -v9

 - respin against latest -mm.

Changes in -v8

 - access_process_vm() and other force users of get_user_pages() can
   induce COW of read-only shared mappings.

Changes in -v7

 - changed is_shared_writable() to exclude VM_PFNMAP'ed regions.
 - Hugh's tiresome problem wasn't fully solved, now using the ugly duckling
   method.

Changes in -v6

 - make page_mkclean_one() modify the pte more like change_pte_range() 
   (suggested by Christoph Lameter)
 - made is_shared_writable() take vm_flags, it now resembles is_cow_mapping().
 - fixed the mprotect() bug (spotted by Hugh Dickins)
 - hopefully fixed the tiresome issue of do_mmap_pgoff() trampling on
   driver specific vm_page_prot settings (spotted by Hugh Dickins)
 - made a new version of the page_mkwrite() patch to go on top of all this.
   This so that Linus could merge this very early on in 2.6.18.

Changes in -v5

 - rename page_wrprotect() to page_mkclean() (suggested by Nick Piggin)
 - added comment to test_clear_page_dirty() (Andrew Morton)
 - cleanup page_wrprotect() (Andrew Morton)
 - renamed VM_SharedWritable() to is_shared_writable()
 - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
   ->private_lock. This seems to be save, since ->private_lock is used to
   serialize access to the buffers, not the page itself.
 - rebased on top of David Howells' page_mkwrite() patch.

Changes in -v4:

 - small cleanup as suggested by Christoph Lameter.

Changes in -v3:

 - move set_page_dirty() outside pte lock (suggested by Christoph Lameter)

Changes in -v2:

 - only wrprotect pages from dirty capable mappings. (Nick Piggin)
 - move the writefault handling from do_wp_page() into handle_pte_fault(). 
   (Nick Piggin)
 - revert to the old install_page interface. (Nick Piggin)
 - also clear the pte dirty bit when we make pages read-only again.
   (spotted by Rik van Riel)
 - make page_wrprotect() return the number of reprotected ptes.

 fs/buffer.c          |    2 -
 include/linux/rmap.h |    8 +++++
 mm/memory.c          |   29 ++++++++++++++++-----
 mm/mmap.c            |   40 +++++++++++++++++++++++------
 mm/mprotect.c        |   17 ++++++++++--
 mm/page-writeback.c  |   15 ++++++++--
 mm/rmap.c            |   70 +++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 160 insertions(+), 21 deletions(-)

Index: 2.6-mm/mm/memory.c
===================================================================
--- 2.6-mm.orig/mm/memory.c	2006-06-23 23:08:49.000000000 +0200
+++ 2.6-mm/mm/memory.c	2006-06-23 23:09:26.000000000 +0200
@@ -1458,14 +1458,19 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int reuse, ret = VM_FAULT_MINOR;
+	int reuse = 0, ret = VM_FAULT_MINOR;
+	struct page *dirty_page = NULL;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (unlikely((vma->vm_flags & (VM_SHARED|VM_WRITE)) ==
-				(VM_SHARED|VM_WRITE))) {
+	/*
+	 * Only catch write-faults on shared writable pages, read-only
+	 * shared pages can get COWed by get_user_pages(.write=1, .force=1).
+	 */
+	if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+					(VM_WRITE|VM_SHARED))) {
 		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
 			/*
 			 * Notify the address space that the page is about to
@@ -1494,13 +1499,12 @@ static int do_wp_page(struct mm_struct *
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
 		}
-
+		dirty_page = old_page;
+		get_page(dirty_page);
 		reuse = 1;
 	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
 		reuse = can_share_swap_page(old_page);
 		unlock_page(old_page);
-	} else {
-		reuse = 0;
 	}
 
 	if (reuse) {
@@ -1566,6 +1570,10 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	if (old_page)
@@ -2098,6 +2106,7 @@ static int do_no_page(struct mm_struct *
 	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
+	struct page *dirty_page = NULL;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
@@ -2192,6 +2201,10 @@ retry:
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			if (write_access) {
+				dirty_page = new_page;
+				get_page(dirty_page);
+			}
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -2204,6 +2217,10 @@ retry:
 	lazy_mmu_prot_update(entry);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	page_cache_release(new_page);
Index: 2.6-mm/mm/mmap.c
===================================================================
--- 2.6-mm.orig/mm/mmap.c	2006-06-23 23:08:49.000000000 +0200
+++ 2.6-mm/mm/mmap.c	2006-06-23 23:09:26.000000000 +0200
@@ -25,6 +25,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -888,6 +889,7 @@ unsigned long do_mmap_pgoff(struct file 
 	struct rb_node ** rb_link, * rb_parent;
 	int accountable = 1;
 	unsigned long charged = 0, reqprot = prot;
+	pgprot_t vm_page_prot;
 
 	if (file) {
 		if (is_file_hugepages(file))
@@ -1065,8 +1067,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags &
-				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
+	vma->vm_page_prot = vm_page_prot = protection_map[vm_flags &
+		(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1090,12 +1092,6 @@ munmap_back:
 			goto free_vma;
 	}
 
-	/* Don't make the VMA automatically writable if it's shared, but the
-	 * backer wishes to know when pages are first written to */
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
-		vma->vm_page_prot =
-			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
-
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation
@@ -1113,6 +1109,34 @@ munmap_back:
 	pgoff = vma->vm_pgoff;
 	vm_flags = vma->vm_flags;
 
+	/*
+	 * Tracking of dirty pages for shared writable mappings. Do this by
+	 * write protecting writable pages, and mark dirty in the write fault.
+	 *
+	 * Cannot do before because the condition depends on:
+	 *  - backing_dev_info having the right capabilities
+	 *  - vma->vm_flags being fully set
+	 *    (finished in f_op->mmap(), which could call remap_pfn_range())
+	 *
+	 * If f_op->mmap() changed vma->vm_page_prot its a funny mapping
+	 * and we won't touch it.
+	 * NOTE: in a perfect world backing_dev_info would have the proper
+	 * capabilities.
+	 *
+	 * OR
+	 *
+	 * Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to.
+	 */
+	if (((pgprot_val(vma->vm_page_prot) == pgprot_val(vm_page_prot)) &&
+	     ((vm_flags & (VM_WRITE|VM_SHARED|VM_PFNMAP|VM_INSERTPAGE)) ==
+			  (VM_WRITE|VM_SHARED)) &&
+	     vma->vm_file && vma->vm_file->f_mapping &&
+	     mapping_cap_account_dirty(vma->vm_file->f_mapping)) ||
+	    (vma->vm_ops && vma->vm_ops->page_mkwrite))
+		vma->vm_page_prot =
+			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
+
 	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
 			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
 		file = vma->vm_file;
Index: 2.6-mm/mm/mprotect.c
===================================================================
--- 2.6-mm.orig/mm/mprotect.c	2006-06-23 23:08:49.000000000 +0200
+++ 2.6-mm/mm/mprotect.c	2006-06-23 23:10:08.000000000 +0200
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/backing-dev.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -176,10 +177,20 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
-	/* Don't make the VMA automatically writable if it's shared, but the
-	 * backer wishes to know when pages are first written to */
+	/*
+	 * Tracking of dirty pages logic (see comment in do_mmap_pgoff)
+	 *
+	 * OR
+	 *
+	 * Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to.
+	 */
 	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+	if ((((newflags & (VM_WRITE|VM_SHARED|VM_PFNMAP|VM_INSERTPAGE)) ==
+			  (VM_WRITE|VM_SHARED)) &&
+	     vma->vm_file && vma->vm_file->f_mapping &&
+	     mapping_cap_account_dirty(vma->vm_file->f_mapping)) ||
+	    (vma->vm_ops && vma->vm_ops->page_mkwrite))
 		mask &= ~VM_SHARED;
 
 	newprot = protection_map[newflags & mask];
Index: 2.6-mm/mm/page-writeback.c
===================================================================
--- 2.6-mm.orig/mm/page-writeback.c	2006-06-23 23:08:49.000000000 +0200
+++ 2.6-mm/mm/page-writeback.c	2006-06-23 23:09:26.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -566,7 +567,7 @@ int do_writepages(struct address_space *
 		return 0;
 	wbc->for_writepages = 1;
 	if (mapping->a_ops->writepages)
-		ret =  mapping->a_ops->writepages(mapping, wbc);
+		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
 	wbc->for_writepages = 0;
@@ -728,8 +729,14 @@ int test_clear_page_dirty(struct page *p
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 			write_unlock_irqrestore(&mapping->tree_lock, flags);
-			if (mapping_cap_account_dirty(mapping))
+			/*
+			 * We can continue to use `mapping' here because the
+			 * page is locked, which pins the address_space
+			 */
+			if (mapping_cap_account_dirty(mapping)) {
+				page_mkclean(page);
 				dec_page_state(nr_dirty);
+			}
 			return 1;
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -759,8 +766,10 @@ int clear_page_dirty_for_io(struct page 
 
 	if (mapping) {
 		if (TestClearPageDirty(page)) {
-			if (mapping_cap_account_dirty(mapping))
+			if (mapping_cap_account_dirty(mapping)) {
+				page_mkclean(page);
 				dec_page_state(nr_dirty);
+			}
 			return 1;
 		}
 		return 0;
Index: 2.6-mm/mm/rmap.c
===================================================================
--- 2.6-mm.orig/mm/rmap.c	2006-06-23 23:08:49.000000000 +0200
+++ 2.6-mm/mm/rmap.c	2006-06-23 23:09:26.000000000 +0200
@@ -53,6 +53,7 @@
 #include <linux/rmap.h>
 #include <linux/rcupdate.h>
 #include <linux/module.h>
+#include <linux/backing-dev.h>
 
 #include <asm/tlbflush.h>
 
@@ -434,6 +435,75 @@ int page_referenced(struct page *page, i
 	return referenced;
 }
 
+static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, int protect)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	address = vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
+	pte = page_check_address(page, mm, address, &ptl);
+	if (!pte)
+		goto out;
+
+	if (!(pte_dirty(*pte) || (protect && pte_write(*pte))))
+		goto unlock;
+
+	entry = ptep_get_and_clear(mm, address, pte);
+	entry = pte_mkclean(entry);
+	if (protect)
+		entry = pte_wrprotect(entry);
+	ptep_establish(vma, address, pte, entry);
+	lazy_mmu_prot_update(entry);
+	ret = 1;
+
+unlock:
+	pte_unmap_unlock(pte, ptl);
+out:
+	return ret;
+}
+
+static int page_mkclean_file(struct address_space *mapping, struct page *page)
+{
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int ret = 0;
+
+	BUG_ON(PageAnon(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		int protect = mapping_cap_account_dirty(mapping) &&
+			((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+			 (VM_WRITE|VM_SHARED));
+		BUG_ON(vma->vm_flags & (VM_PFNMAP|VM_INSERTPAGE));
+		ret += page_mkclean_one(page, vma, protect);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	return ret;
+}
+
+int page_mkclean(struct page *page)
+{
+	int ret = 0;
+
+	BUG_ON(!PageLocked(page));
+
+	if (page_mapped(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			ret = page_mkclean_file(mapping, page);
+	}
+
+	return ret;
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
Index: 2.6-mm/include/linux/rmap.h
===================================================================
--- 2.6-mm.orig/include/linux/rmap.h	2006-06-23 23:08:49.000000000 +0200
+++ 2.6-mm/include/linux/rmap.h	2006-06-23 23:09:26.000000000 +0200
@@ -103,6 +103,14 @@ pte_t *page_check_address(struct page *,
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
+/*
+ * Cleans the PTEs of shared mappings.
+ * (and since clean PTEs should also be readonly, write protects them too)
+ *
+ * returns the number of cleaned PTEs.
+ */
+int page_mkclean(struct page *);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: 2.6-mm/fs/buffer.c
===================================================================
--- 2.6-mm.orig/fs/buffer.c	2006-06-23 23:08:49.000000000 +0200
+++ 2.6-mm/fs/buffer.c	2006-06-23 23:09:26.000000000 +0200
@@ -2983,6 +2983,7 @@ int try_to_free_buffers(struct page *pag
 
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
+	spin_unlock(&mapping->private_lock);
 	if (ret) {
 		/*
 		 * If the filesystem writes its buffers by hand (eg ext3)
@@ -2994,7 +2995,6 @@ int try_to_free_buffers(struct page *pag
 		 */
 		clear_page_dirty(page);
 	}
-	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/5] mm: balance dirty pages
  2006-06-23 22:31 [PATCH 0/5] mm: tracking dirty pages -v11 Peter Zijlstra
  2006-06-23 22:31 ` [PATCH 1/5] mm: tracking shared dirty pages Peter Zijlstra
@ 2006-06-23 22:31 ` Peter Zijlstra
  2006-06-23 22:31 ` [PATCH 3/5] mm: msync() cleanup Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2006-06-23 22:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Now that we can detect writers of shared mappings, throttle them.
Avoids OOM by surprise.

Changes -v2:

 - small helper function (Andrew Morton)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

 include/linux/writeback.h |    1 +
 mm/memory.c               |    5 +++--
 mm/page-writeback.c       |   10 ++++++++++
 3 files changed, 14 insertions(+), 2 deletions(-)

Index: 2.6-mm/mm/memory.c
===================================================================
--- 2.6-mm.orig/mm/memory.c	2006-06-23 01:08:11.000000000 +0200
+++ 2.6-mm/mm/memory.c	2006-06-23 11:02:03.000000000 +0200
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/delayacct.h>
 #include <linux/init.h>
+#include <linux/writeback.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -1571,7 +1572,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty(dirty_page);
+		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
@@ -2218,7 +2219,7 @@ retry:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty(dirty_page);
+		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
Index: 2.6-mm/include/linux/writeback.h
===================================================================
--- 2.6-mm.orig/include/linux/writeback.h	2006-06-22 17:59:07.000000000 +0200
+++ 2.6-mm/include/linux/writeback.h	2006-06-23 11:02:03.000000000 +0200
@@ -121,6 +121,7 @@ int sync_page_range(struct inode *inode,
 			loff_t pos, loff_t count);
 int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
 			   loff_t pos, loff_t count);
+void set_page_dirty_balance(struct page *page);
 
 /* pdflush.c */
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
Index: 2.6-mm/mm/page-writeback.c
===================================================================
--- 2.6-mm.orig/mm/page-writeback.c	2006-06-23 00:27:09.000000000 +0200
+++ 2.6-mm/mm/page-writeback.c	2006-06-23 11:02:03.000000000 +0200
@@ -256,6 +256,16 @@ static void balance_dirty_pages(struct a
 		pdflush_operation(background_writeout, 0);
 }
 
+void set_page_dirty_balance(struct page *page)
+{
+	if (set_page_dirty(page)) {
+		struct address_space *mapping = page_mapping(page);
+
+		if (mapping)
+			balance_dirty_pages_ratelimited(mapping);
+	}
+}
+
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/5] mm: msync() cleanup
  2006-06-23 22:31 [PATCH 0/5] mm: tracking dirty pages -v11 Peter Zijlstra
  2006-06-23 22:31 ` [PATCH 1/5] mm: tracking shared dirty pages Peter Zijlstra
  2006-06-23 22:31 ` [PATCH 2/5] mm: balance " Peter Zijlstra
@ 2006-06-23 22:31 ` Peter Zijlstra
  2006-06-23 22:31 ` [PATCH 4/5] mm: optimize the new mprotect() code a bit Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2006-06-23 22:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

With the tracking of dirty pages properly done now, msync doesn't need to
scan the PTEs anymore to determine the dirty status.

From: Hugh Dickins <hugh@veritas.com>

In looking to do that, I made some other tidyups: can remove several
#includes, and sys_msync loop termination not quite right.

Most of those points are criticisms of the existing sys_msync, not of
your patch.  In particular, the loop termination errors were introduced
in 2.6.17: I did notice this shortly before it came out, but decided I
was more likely to get it wrong myself, and make matters worse if I
tried to rush a last-minute fix in.  And it's not terribly likely
to go wrong, nor disastrous if it does go wrong (may miss reporting
an unmapped area; may also fsync file of a following vma).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

 mm/msync.c |  194 +++++++++----------------------------------------------------
 1 file changed, 31 insertions(+), 163 deletions(-)

Index: 2.6-mm/mm/msync.c
===================================================================
--- 2.6-mm.orig/mm/msync.c	2006-06-22 17:59:06.000000000 +0200
+++ 2.6-mm/mm/msync.c	2006-06-23 13:28:09.000000000 +0200
@@ -7,149 +7,33 @@
 /*
  * The msync() system call.
  */
-#include <linux/slab.h>
-#include <linux/pagemap.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
-#include <linux/hugetlb.h>
-#include <linux/writeback.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
 
-#include <asm/pgtable.h>
-#include <asm/tlbflush.h>
-
-static unsigned long msync_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-				unsigned long addr, unsigned long end)
-{
-	pte_t *pte;
-	spinlock_t *ptl;
-	int progress = 0;
-	unsigned long ret = 0;
-
-again:
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	do {
-		struct page *page;
-
-		if (progress >= 64) {
-			progress = 0;
-			if (need_resched() || need_lockbreak(ptl))
-				break;
-		}
-		progress++;
-		if (!pte_present(*pte))
-			continue;
-		if (!pte_maybe_dirty(*pte))
-			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
-			continue;
-		if (ptep_clear_flush_dirty(vma, addr, pte) ||
-				page_test_and_clear_dirty(page))
-			ret += set_page_dirty(page);
-		progress += 3;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
-	if (addr != end)
-		goto again;
-	return ret;
-}
-
-static inline unsigned long msync_pmd_range(struct vm_area_struct *vma,
-			pud_t *pud, unsigned long addr, unsigned long end)
-{
-	pmd_t *pmd;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		if (pmd_none_or_clear_bad(pmd))
-			continue;
-		ret += msync_pte_range(vma, pmd, addr, next);
-	} while (pmd++, addr = next, addr != end);
-	return ret;
-}
-
-static inline unsigned long msync_pud_range(struct vm_area_struct *vma,
-			pgd_t *pgd, unsigned long addr, unsigned long end)
-{
-	pud_t *pud;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		ret += msync_pmd_range(vma, pud, addr, next);
-	} while (pud++, addr = next, addr != end);
-	return ret;
-}
-
-static unsigned long msync_page_range(struct vm_area_struct *vma,
-				unsigned long addr, unsigned long end)
-{
-	pgd_t *pgd;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	/* For hugepages we can't go walking the page table normally,
-	 * but that's ok, hugetlbfs is memory based, so we don't need
-	 * to do anything more on an msync().
-	 */
-	if (vma->vm_flags & VM_HUGETLB)
-		return 0;
-
-	BUG_ON(addr >= end);
-	pgd = pgd_offset(vma->vm_mm, addr);
-	flush_cache_range(vma, addr, end);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		ret += msync_pud_range(vma, pgd, addr, next);
-	} while (pgd++, addr = next, addr != end);
-	return ret;
-}
-
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).  Instead, it just
- * marks the relevant pages dirty.  The application may now run fsync() to
+ * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
+ * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Now it doesn't do anything, since dirty pages are properly tracked.
+ *
+ * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
  * async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
-static int msync_interval(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long end, int flags,
-			unsigned long *nr_pages_dirtied)
-{
-	struct file *file = vma->vm_file;
-
-	if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
-		return -EBUSY;
-
-	if (file && (vma->vm_flags & VM_SHARED))
-		*nr_pages_dirtied = msync_page_range(vma, addr, end);
-	return 0;
-}
-
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
 	unsigned long end;
+	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	int unmapped_error = 0;
 	int error = -EINVAL;
-	int done = 0;
 
 	if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
 		goto out;
@@ -169,64 +53,48 @@ asmlinkage long sys_msync(unsigned long 
 	 * If the interval [start,end) covers some unmapped address ranges,
 	 * just ignore them, but return -ENOMEM at the end.
 	 */
-	down_read(&current->mm->mmap_sem);
-	vma = find_vma(current->mm, start);
-	if (!vma) {
-		error = -ENOMEM;
-		goto out_unlock;
-	}
-	do {
-		unsigned long nr_pages_dirtied = 0;
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	for (;;) {
 		struct file *file;
 
+		/* Still start < end. */
+		error = -ENOMEM;
+		if (!vma)
+			goto out_unlock;
 		/* Here start < vma->vm_end. */
 		if (start < vma->vm_start) {
-			unmapped_error = -ENOMEM;
 			start = vma->vm_start;
+			if (start >= end)
+				goto out_unlock;
+			unmapped_error = -ENOMEM;
 		}
 		/* Here vma->vm_start <= start < vma->vm_end. */
-		if (end <= vma->vm_end) {
-			if (start < end) {
-				error = msync_interval(vma, start, end, flags,
-							&nr_pages_dirtied);
-				if (error)
-					goto out_unlock;
-			}
-			error = unmapped_error;
-			done = 1;
-		} else {
-			/* Here vma->vm_start <= start < vma->vm_end < end. */
-			error = msync_interval(vma, start, vma->vm_end, flags,
-						&nr_pages_dirtied);
-			if (error)
-				goto out_unlock;
+		if ((flags & MS_INVALIDATE) &&
+				(vma->vm_flags & VM_LOCKED)) {
+			error = -EBUSY;
+			goto out_unlock;
 		}
 		file = vma->vm_file;
 		start = vma->vm_end;
-		if ((flags & MS_ASYNC) && file && nr_pages_dirtied) {
-			get_file(file);
-			up_read(&current->mm->mmap_sem);
-			balance_dirty_pages_ratelimited_nr(file->f_mapping,
-							nr_pages_dirtied);
-			fput(file);
-			down_read(&current->mm->mmap_sem);
-			vma = find_vma(current->mm, start);
-		} else if ((flags & MS_SYNC) && file &&
+		if ((flags & MS_SYNC) && file &&
 				(vma->vm_flags & VM_SHARED)) {
 			get_file(file);
-			up_read(&current->mm->mmap_sem);
+			up_read(&mm->mmap_sem);
 			error = do_fsync(file, 0);
 			fput(file);
-			down_read(&current->mm->mmap_sem);
-			if (error)
-				goto out_unlock;
-			vma = find_vma(current->mm, start);
+			if (error || start >= end)
+				goto out;
+			down_read(&mm->mmap_sem);
+			vma = find_vma(mm, start);
 		} else {
+			if (start >= end)
+				goto out_unlock;
 			vma = vma->vm_next;
 		}
-	} while (vma && !done);
+	}
 out_unlock:
-	up_read(&current->mm->mmap_sem);
+	up_read(&mm->mmap_sem);
 out:
-	return error;
+	return error ? : unmapped_error;
 }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 4/5] mm: optimize the new mprotect() code a bit
  2006-06-23 22:31 [PATCH 0/5] mm: tracking dirty pages -v11 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2006-06-23 22:31 ` [PATCH 3/5] mm: msync() cleanup Peter Zijlstra
@ 2006-06-23 22:31 ` Peter Zijlstra
  2006-06-23 22:31 ` [PATCH 5/5] mm: small cleanup of install_page() Peter Zijlstra
  2006-06-26 15:35 ` [PATCH 0/5] mm: tracking dirty pages -v11 Hugh Dickins
  5 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2006-06-23 22:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

mprotect() resets the page protections, which could result in extra write
faults for those pages whose dirty state we track using write faults
and are dirty already.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mprotect.c |   33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

Index: 2.6-mm/mm/mprotect.c
===================================================================
--- 2.6-mm.orig/mm/mprotect.c	2006-06-23 01:35:24.000000000 +0200
+++ 2.6-mm/mm/mprotect.c	2006-06-23 15:06:10.000000000 +0200
@@ -28,7 +28,8 @@
 #include <asm/tlbflush.h>
 
 static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int dirty_accountable)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
@@ -43,7 +44,14 @@ static void change_pte_range(struct mm_s
 			 * bits by wiping the pte and then setting the new pte
 			 * into place.
 			 */
-			ptent = pte_modify(ptep_get_and_clear(mm, addr, pte), newprot);
+			ptent = ptep_get_and_clear(mm, addr, pte);
+			ptent = pte_modify(ptent, newprot);
+			/*
+			 * Avoid taking write faults for pages we know to be
+			 * dirty.
+			 */
+			if (dirty_accountable && pte_dirty(ptent))
+				ptent = pte_mkwrite(ptent);
 			set_pte_at(mm, addr, pte, ptent);
 			lazy_mmu_prot_update(ptent);
 #ifdef CONFIG_MIGRATION
@@ -67,7 +75,8 @@ static void change_pte_range(struct mm_s
 }
 
 static inline void change_pmd_range(struct mm_struct *mm, pud_t *pud,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int dirty_accountable)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -77,12 +86,13 @@ static inline void change_pmd_range(stru
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		change_pte_range(mm, pmd, addr, next, newprot);
+		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
 	} while (pmd++, addr = next, addr != end);
 }
 
 static inline void change_pud_range(struct mm_struct *mm, pgd_t *pgd,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int dirty_accountable)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -92,12 +102,13 @@ static inline void change_pud_range(stru
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		change_pmd_range(mm, pud, addr, next, newprot);
+		change_pmd_range(mm, pud, addr, next, newprot, dirty_accountable);
 	} while (pud++, addr = next, addr != end);
 }
 
 static void change_protection(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int dirty_accountable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -111,7 +122,7 @@ static void change_protection(struct vm_
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		change_pud_range(mm, pgd, addr, next, newprot);
+		change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
 	flush_tlb_range(vma, start, end);
 }
@@ -128,6 +139,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
+	int dirty_accountable = 0;
 
 	if (newflags == oldflags) {
 		*pprev = vma;
@@ -190,8 +202,10 @@ success:
 			  (VM_WRITE|VM_SHARED)) &&
 	     vma->vm_file && vma->vm_file->f_mapping &&
 	     mapping_cap_account_dirty(vma->vm_file->f_mapping)) ||
-	    (vma->vm_ops && vma->vm_ops->page_mkwrite))
+	    (vma->vm_ops && vma->vm_ops->page_mkwrite)) {
 		mask &= ~VM_SHARED;
+		dirty_accountable = 1;
+	}
 
 	newprot = protection_map[newflags & mask];
 
@@ -204,7 +218,7 @@ success:
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, newprot);
 	else
-		change_protection(vma, start, end, newprot);
+		change_protection(vma, start, end, newprot, dirty_accountable);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 5/5] mm: small cleanup of install_page()
  2006-06-23 22:31 [PATCH 0/5] mm: tracking dirty pages -v11 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2006-06-23 22:31 ` [PATCH 4/5] mm: optimize the new mprotect() code a bit Peter Zijlstra
@ 2006-06-23 22:31 ` Peter Zijlstra
  2006-06-26 15:35 ` [PATCH 0/5] mm: tracking dirty pages -v11 Hugh Dickins
  5 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2006-06-23 22:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Smallish cleanup to install_page(), could save a memory read
(haven't checked the asm output) and sure looks nicer.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

---
 mm/fremap.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: 2.6-mm/mm/fremap.c
===================================================================
--- 2.6-mm.orig/mm/fremap.c	2006-06-19 16:20:52.000000000 +0200
+++ 2.6-mm/mm/fremap.c	2006-06-19 16:20:57.000000000 +0200
@@ -79,9 +79,9 @@ int install_page(struct mm_struct *mm, s
 		inc_mm_counter(mm, file_rss);
 
 	flush_icache_page(vma, page);
-	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	pte_val = mk_pte(page, prot);
+	set_pte_at(mm, addr, pte, pte_val);
 	page_add_file_rmap(page);
-	pte_val = *pte;
 	update_mmu_cache(vma, addr, pte_val);
 	lazy_mmu_prot_update(pte_val);
 	err = 0;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/5] mm: tracking shared dirty pages
  2006-06-23 22:31 ` [PATCH 1/5] mm: tracking shared dirty pages Peter Zijlstra
@ 2006-06-23 23:54   ` Peter Zijlstra
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2006-06-23 23:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Hugh Dickins, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Tracking of dirty pages in shared writeable mmap()s.

The idea is simple: write protect clean shared writeable pages,
catch the write-fault, make writeable and set dirty. On page write-back
clean all the PTE dirty bits and write protect them once again.

The implementation is a tad harder, mainly because the default
backing_dev_info capabilities were too loosely maintained. Hence it is
not enough to test the backing_dev_info for cap_account_dirty.

The current heuristic is as follows, a VMA is eligible when:
 - its shared writeable 
    (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
 - it is not a 'special' mapping
    (vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0
 - the backing_dev_info is cap_account_dirty
    mapping_cap_account_dirty(vma->vm_file->f_mapping)
 - f_op->mmap() didn't change the default page protection

Page from remap_pfn_range() are explicitly excluded because their
COW semantics are already horrid enough (see vm_normal_page() in
do_wp_page()) and because they don't have a backing store anyway.

mprotect() is taught about the new behaviour as well. However it fudges
the last.

Cleaning the pages on write-back is done with page_mkclean() a new
rmap call. It can be called on any page, but is currently only
implemented for mapped pages, if the page is found the be of a 
trackable VMA it will also wrprotect the PTE.



Bah Bah Bah, why didn't the page_mkwrite() patch re-protect clean pages?
And is it a Bad-Thing (tm) that that can happen now?



Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty()
from under ->private_lock. This seems to be safe, since ->private_lock 
is used to serialize access to the buffers, not the page itself.
This is needed because clear_page_dirty() will call into page_mkclean()
and would thereby violate locking order.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Changes in -v12

 - Linus suggested a nice replacement for the beast condition

Changes in -v11

 - small cleanups
 - pulled page_mkclean back under mapping_cap_account_dirty()

Changes in -v10

 - 2.6.17-mm1
 - Drop the ugly duckling pgprotting, Hugh suggested resetting the
   vma->vm_page_prot when f_op->mmap() didn't modify it. If it were
   modified we're not interested anyway.
 - abandon is_shared_writable() because its actually spelled writeable
   and it didn't actually mean that any more.
 - Comments all round.

Changes in -v9

 - respin against latest -mm.

Changes in -v8

 - access_process_vm() and other force users of get_user_pages() can
   induce COW of read-only shared mappings.

Changes in -v7

 - changed is_shared_writable() to exclude VM_PFNMAP'ed regions.
 - Hugh's tiresome problem wasn't fully solved, now using the ugly duckling
   method.

Changes in -v6

 - make page_mkclean_one() modify the pte more like change_pte_range() 
   (suggested by Christoph Lameter)
 - made is_shared_writable() take vm_flags, it now resembles is_cow_mapping().
 - fixed the mprotect() bug (spotted by Hugh Dickins)
 - hopefully fixed the tiresome issue of do_mmap_pgoff() trampling on
   driver specific vm_page_prot settings (spotted by Hugh Dickins)
 - made a new version of the page_mkwrite() patch to go on top of all this.
   This so that Linus could merge this very early on in 2.6.18.

Changes in -v5

 - rename page_wrprotect() to page_mkclean() (suggested by Nick Piggin)
 - added comment to test_clear_page_dirty() (Andrew Morton)
 - cleanup page_wrprotect() (Andrew Morton)
 - renamed VM_SharedWritable() to is_shared_writable()
 - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
   ->private_lock. This seems to be save, since ->private_lock is used to
   serialize access to the buffers, not the page itself.
 - rebased on top of David Howells' page_mkwrite() patch.

Changes in -v4:

 - small cleanup as suggested by Christoph Lameter.

Changes in -v3:

 - move set_page_dirty() outside pte lock (suggested by Christoph Lameter)

Changes in -v2:

 - only wrprotect pages from dirty capable mappings. (Nick Piggin)
 - move the writefault handling from do_wp_page() into handle_pte_fault(). 
   (Nick Piggin)
 - revert to the old install_page interface. (Nick Piggin)
 - also clear the pte dirty bit when we make pages read-only again.
   (spotted by Rik van Riel)
 - make page_wrprotect() return the number of reprotected ptes.

 fs/buffer.c          |    2 -
 include/linux/mm.h   |   34 ++++++++++++++++++++++++
 include/linux/rmap.h |    8 +++++
 mm/memory.c          |   29 ++++++++++++++++-----
 mm/mmap.c            |   10 ++-----
 mm/mprotect.c        |   22 ++++++----------
 mm/page-writeback.c  |   15 ++++++++--
 mm/rmap.c            |   70 +++++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 161 insertions(+), 29 deletions(-)

Index: 2.6-mm/mm/memory.c
===================================================================
--- 2.6-mm.orig/mm/memory.c	2006-06-24 01:39:29.000000000 +0200
+++ 2.6-mm/mm/memory.c	2006-06-24 01:40:20.000000000 +0200
@@ -1458,14 +1458,19 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int reuse, ret = VM_FAULT_MINOR;
+	int reuse = 0, ret = VM_FAULT_MINOR;
+	struct page *dirty_page = NULL;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (unlikely((vma->vm_flags & (VM_SHARED|VM_WRITE)) ==
-				(VM_SHARED|VM_WRITE))) {
+	/*
+	 * Only catch write-faults on shared writable pages, read-only
+	 * shared pages can get COWed by get_user_pages(.write=1, .force=1).
+	 */
+	if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+					(VM_WRITE|VM_SHARED))) {
 		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
 			/*
 			 * Notify the address space that the page is about to
@@ -1494,13 +1499,12 @@ static int do_wp_page(struct mm_struct *
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
 		}
-
+		dirty_page = old_page;
+		get_page(dirty_page);
 		reuse = 1;
 	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
 		reuse = can_share_swap_page(old_page);
 		unlock_page(old_page);
-	} else {
-		reuse = 0;
 	}
 
 	if (reuse) {
@@ -1566,6 +1570,10 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	if (old_page)
@@ -2098,6 +2106,7 @@ static int do_no_page(struct mm_struct *
 	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
+	struct page *dirty_page = NULL;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
@@ -2192,6 +2201,10 @@ retry:
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			if (write_access) {
+				dirty_page = new_page;
+				get_page(dirty_page);
+			}
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -2204,6 +2217,10 @@ retry:
 	lazy_mmu_prot_update(entry);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	page_cache_release(new_page);
Index: 2.6-mm/mm/mmap.c
===================================================================
--- 2.6-mm.orig/mm/mmap.c	2006-06-24 01:39:29.000000000 +0200
+++ 2.6-mm/mm/mmap.c	2006-06-24 01:40:20.000000000 +0200
@@ -1090,12 +1090,6 @@ munmap_back:
 			goto free_vma;
 	}
 
-	/* Don't make the VMA automatically writable if it's shared, but the
-	 * backer wishes to know when pages are first written to */
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
-		vma->vm_page_prot =
-			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
-
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation
@@ -1113,6 +1107,10 @@ munmap_back:
 	pgoff = vma->vm_pgoff;
 	vm_flags = vma->vm_flags;
 
+	if (vma_wants_writenotify(vma))
+		vma->vm_page_prot =
+			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
+
 	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
 			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
 		file = vma->vm_file;
Index: 2.6-mm/mm/mprotect.c
===================================================================
--- 2.6-mm.orig/mm/mprotect.c	2006-06-24 01:39:29.000000000 +0200
+++ 2.6-mm/mm/mprotect.c	2006-06-24 01:40:20.000000000 +0200
@@ -123,8 +123,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
-	unsigned int mask;
-	pgprot_t newprot;
+	unsigned int mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
 	pgoff_t pgoff;
 	int error;
 
@@ -176,24 +175,21 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
-	/* Don't make the VMA automatically writable if it's shared, but the
-	 * backer wishes to know when pages are first written to */
-	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
-		mask &= ~VM_SHARED;
-
-	newprot = protection_map[newflags & mask];
-
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
 	 */
 	vma->vm_flags = newflags;
-	vma->vm_page_prot = newprot;
+	vma->vm_page_prot = protection_map[newflags & mask];
+	if (vma_wants_writenotify(vma)) {
+		mask &= ~VM_SHARED;
+		vma->vm_page_prot = protection_map[newflags & mask];
+	}
+
 	if (is_vm_hugetlb_page(vma))
-		hugetlb_change_protection(vma, start, end, newprot);
+		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
-		change_protection(vma, start, end, newprot);
+		change_protection(vma, start, end, vma->vm_page_prot);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
Index: 2.6-mm/mm/page-writeback.c
===================================================================
--- 2.6-mm.orig/mm/page-writeback.c	2006-06-24 01:39:29.000000000 +0200
+++ 2.6-mm/mm/page-writeback.c	2006-06-24 01:40:20.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -566,7 +567,7 @@ int do_writepages(struct address_space *
 		return 0;
 	wbc->for_writepages = 1;
 	if (mapping->a_ops->writepages)
-		ret =  mapping->a_ops->writepages(mapping, wbc);
+		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
 	wbc->for_writepages = 0;
@@ -728,8 +729,14 @@ int test_clear_page_dirty(struct page *p
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 			write_unlock_irqrestore(&mapping->tree_lock, flags);
-			if (mapping_cap_account_dirty(mapping))
+			/*
+			 * We can continue to use `mapping' here because the
+			 * page is locked, which pins the address_space
+			 */
+			if (mapping_cap_account_dirty(mapping)) {
+				page_mkclean(page);
 				dec_page_state(nr_dirty);
+			}
 			return 1;
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -759,8 +766,10 @@ int clear_page_dirty_for_io(struct page 
 
 	if (mapping) {
 		if (TestClearPageDirty(page)) {
-			if (mapping_cap_account_dirty(mapping))
+			if (mapping_cap_account_dirty(mapping)) {
+				page_mkclean(page);
 				dec_page_state(nr_dirty);
+			}
 			return 1;
 		}
 		return 0;
Index: 2.6-mm/mm/rmap.c
===================================================================
--- 2.6-mm.orig/mm/rmap.c	2006-06-24 01:39:29.000000000 +0200
+++ 2.6-mm/mm/rmap.c	2006-06-24 01:40:20.000000000 +0200
@@ -53,6 +53,7 @@
 #include <linux/rmap.h>
 #include <linux/rcupdate.h>
 #include <linux/module.h>
+#include <linux/backing-dev.h>
 
 #include <asm/tlbflush.h>
 
@@ -434,6 +435,75 @@ int page_referenced(struct page *page, i
 	return referenced;
 }
 
+static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, int protect)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	address = vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
+	pte = page_check_address(page, mm, address, &ptl);
+	if (!pte)
+		goto out;
+
+	if (!(pte_dirty(*pte) || (protect && pte_write(*pte))))
+		goto unlock;
+
+	entry = ptep_get_and_clear(mm, address, pte);
+	entry = pte_mkclean(entry);
+	if (protect)
+		entry = pte_wrprotect(entry);
+	ptep_establish(vma, address, pte, entry);
+	lazy_mmu_prot_update(entry);
+	ret = 1;
+
+unlock:
+	pte_unmap_unlock(pte, ptl);
+out:
+	return ret;
+}
+
+static int page_mkclean_file(struct address_space *mapping, struct page *page)
+{
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int ret = 0;
+
+	BUG_ON(PageAnon(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		int protect = mapping_cap_account_dirty(mapping) &&
+			((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+			 (VM_WRITE|VM_SHARED));
+		BUG_ON(vma->vm_flags & (VM_PFNMAP|VM_INSERTPAGE));
+		ret += page_mkclean_one(page, vma, protect);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	return ret;
+}
+
+int page_mkclean(struct page *page)
+{
+	int ret = 0;
+
+	BUG_ON(!PageLocked(page));
+
+	if (page_mapped(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			ret = page_mkclean_file(mapping, page);
+	}
+
+	return ret;
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
Index: 2.6-mm/include/linux/rmap.h
===================================================================
--- 2.6-mm.orig/include/linux/rmap.h	2006-06-24 01:39:29.000000000 +0200
+++ 2.6-mm/include/linux/rmap.h	2006-06-24 01:40:20.000000000 +0200
@@ -103,6 +103,14 @@ pte_t *page_check_address(struct page *,
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
+/*
+ * Cleans the PTEs of shared mappings.
+ * (and since clean PTEs should also be readonly, write protects them too)
+ *
+ * returns the number of cleaned PTEs.
+ */
+int page_mkclean(struct page *);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: 2.6-mm/fs/buffer.c
===================================================================
--- 2.6-mm.orig/fs/buffer.c	2006-06-24 01:39:29.000000000 +0200
+++ 2.6-mm/fs/buffer.c	2006-06-24 01:40:20.000000000 +0200
@@ -2983,6 +2983,7 @@ int try_to_free_buffers(struct page *pag
 
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
+	spin_unlock(&mapping->private_lock);
 	if (ret) {
 		/*
 		 * If the filesystem writes its buffers by hand (eg ext3)
@@ -2994,7 +2995,6 @@ int try_to_free_buffers(struct page *pag
 		 */
 		clear_page_dirty(page);
 	}
-	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
Index: 2.6-mm/include/linux/mm.h
===================================================================
--- 2.6-mm.orig/include/linux/mm.h	2006-06-24 01:39:29.000000000 +0200
+++ 2.6-mm/include/linux/mm.h	2006-06-24 01:40:20.000000000 +0200
@@ -15,6 +15,7 @@
 #include <linux/fs.h>
 #include <linux/mutex.h>
 #include <linux/debug_locks.h>
+#include <linux/backing-dev.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -847,6 +848,39 @@ struct shrinker {
 extern struct shrinker *set_shrinker(int, shrinker_t);
 extern void remove_shrinker(struct shrinker *shrinker);
 
+/*
+ * Some shared mappigns will want the pages marked read-only
+ * to track write events. If so, we'll downgrade vm_page_prot
+ * to the private version (using protection_map[] without the
+ * VM_SHARED bit).
+ */
+static inline int vma_wants_writenotify(struct vm_area_struct *vma)
+{
+	unsigned int vm_flags = vma->vm_flags;
+
+	/* If it was private or non-writable, the write bit is already clear */
+	if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
+		return 0;
+
+	/* The backer wishes to know when pages are first written to? */
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		return 1;
+
+	/* The open routine did something to the protections already? */
+	if (pgprot_val(vma->vm_page_prot) !=
+	    pgprot_val(protection_map[vm_flags &
+		    (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]))
+		return 0;
+
+	/* Specialty mapping? */
+	if (vm_flags & (VM_PFNMAP|VM_INSERTPAGE))
+		return 0;
+
+	/* Can the mapping track the dirty pages? */
+	return vma->vm_file && vma->vm_file->f_mapping &&
+		mapping_cap_account_dirty(vma->vm_file->f_mapping);
+}
+
 extern pte_t *FASTCALL(get_locked_pte(struct mm_struct *mm, unsigned long addr, spinlock_t **ptl));
 
 int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/5] mm: tracking dirty pages -v11
  2006-06-23 22:31 [PATCH 0/5] mm: tracking dirty pages -v11 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2006-06-23 22:31 ` [PATCH 5/5] mm: small cleanup of install_page() Peter Zijlstra
@ 2006-06-26 15:35 ` Hugh Dickins
  5 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2006-06-26 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Sat, 24 Jun 2006, Peter Zijlstra wrote:
> 
> I hope to have addressed all Hugh's latest comments in this version.
> Its against 2.6.17-mm1, however I wasted most of the day trying to 
> test it on that kernel. But due to various circumstances that failed.

Looks good - I'm happy that we leave the do_wp_page test reordering
(to fix up that third order ptrace poke issue) to a subsequent patch,
it's better separated.

> So I've tested something like this against something 2.6.17'ish and 
> respun against the -mm lineup.

Your next (final?) spin should be against Linus' current git tree,
http://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.17-git10.bz2
is the latest snapshot patch if you're not using git itself.  That will
suit Andrew better too: he prefers patches against Linus' current tree,
except when the changes are to work that's only in -mm.

You ought to respin, because the vma_wants_writenotify mods in mprotect.c
affect later patches in your series, giving rejects at present.  It does
look _much_ better with Linus' vma_wants_writenotify.  I did think of
asking you for that, but it seemed unfair because I knew you'd want
to use it in mprotect, and then get in trouble with backing-dev.h:
which you've solved by #including that now in mm.h - a pity,
but an unavoidable decision.

Given the reordering you had to make in mprotect_fixup to get its tests
working right (a little naughty!), I'd now do away with the "mask"
variable, and just work directly on "newflags" itself; but up to you.

> I've taken Hugh's msync changes too, looks a lot better and does indeed
> fix some boundary cases.

Thanks for reviewing: please add my
Signed-off-by: Hugh Dickins <hugh@veritas.com>
to that msync one.

In the respin of 1/5 you enquired:
> Bah Bah Bah, why didn't the page_mkwrite() patch re-protect clean pages?
> And is it a Bad-Thing (tm) that that can happen now?

You'll need a reply from David for the definitive answer, but I think
page_mkwrite is only wanting to know about the _first_ write to the
page e.g. so that it can allocate space on disk for that page.  And
many (most) calls to page_mkwrite won't be for that first write at
all, the filesystem already has to work out the irrelevant calls:
so it's no great problem that you'll be making some more such calls.

Hugh

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2006-06-26 15:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-23 22:31 [PATCH 0/5] mm: tracking dirty pages -v11 Peter Zijlstra
2006-06-23 22:31 ` [PATCH 1/5] mm: tracking shared dirty pages Peter Zijlstra
2006-06-23 23:54   ` Peter Zijlstra
2006-06-23 22:31 ` [PATCH 2/5] mm: balance " Peter Zijlstra
2006-06-23 22:31 ` [PATCH 3/5] mm: msync() cleanup Peter Zijlstra
2006-06-23 22:31 ` [PATCH 4/5] mm: optimize the new mprotect() code a bit Peter Zijlstra
2006-06-23 22:31 ` [PATCH 5/5] mm: small cleanup of install_page() Peter Zijlstra
2006-06-26 15:35 ` [PATCH 0/5] mm: tracking dirty pages -v11 Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).