linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] mm: tracking dirty pages -v8
@ 2006-06-13 11:21 Peter Zijlstra
  2006-06-13 11:21 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Peter Zijlstra @ 2006-06-13 11:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


The latest version of the tracking dirty pages patch-set.

This version handles VM_PFNMAP vmas and the COW case of shared RO mappings.

follow_page() got a comment for being weird, but in the light of the 
set_page_dirty() call that can not yet be removed does something sane.

copy_one_pte() also does the right thing, although I wonder why it clears
the dirty bit for children?

f_op->open() - sets a backing_dev_info
f_op->mmap() - modifies both vma->vm_flags and vma->vm_page_prot

Since our condition depends on both the backing_dev_info and vma->vm_flags
it cannot set vma->vm_page_prot before f_op->mmap().

However this means that !VM_PFNMAP vmas that are shared writable but do not
provide a f_op->nopage() and whos backing_dev_info does not have 
BDI_CAP_NO_ACCT_DIRTY, are left writable.

Peter

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-13 11:21 [PATCH 0/6] mm: tracking dirty pages -v8 Peter Zijlstra
@ 2006-06-13 11:21 ` Peter Zijlstra
  2006-06-13 12:03   ` Andi Kleen
  2006-06-13 11:21 ` [PATCH 2/6] mm: balance " Peter Zijlstra
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2006-06-13 11:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

People expressed the need to track dirty pages in shared mappings.

Linus outlined the general idea of doing that through making clean
writable pages write-protected and taking the write fault.

This patch does exactly that, it makes pages in a shared writable
mapping write-protected. On write-fault the pages are marked dirty and
made writable. When the pages get synced with their backing store, the
write-protection is re-instated.

It survives a simple test and shows the dirty pages in /proc/vmstat.

Changes in -v8

 - access_process_vm() and other force users of get_user_pages() can
   induce COW of read-only shared mappings.

Changes in -v7

 - changed is_shared_writable() to exclude VM_PFNMAP'ed regions.
 - Hugh's tiresome problem wasn't fully solved, now using the ugly duckling
   method.

Changes in -v6

 - make page_mkclean_one() modify the pte more like change_pte_range() 
   (suggested by Christoph Lameter)
 - made is_shared_writable() take vm_flags, it now resembles is_cow_mapping().
 - fixed the mprotect() bug (spotted by Hugh Dickins)
 - hopefully fixed the tiresome issue of do_mmap_pgoff() trampling on
   driver specific vm_page_prot settings (spotted by Hugh Dickins)
 - made a new version of the page_mkwrite() patch to go on top of all this.
   This so that Linus could merge this very early on in 2.6.18.

Changes in -v5

 - rename page_wrprotect() to page_mkclean() (suggested by Nick Piggin)
 - added comment to test_clear_page_dirty() (Andrew Morton)
 - cleanup page_wrprotect() (Andrew Morton)
 - renamed VM_SharedWritable() to is_shared_writable()
 - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
   ->private_lock. This seems to be save, since ->private_lock is used to
   serialize access to the buffers, not the page itself.
 - rebased on top of David Howells' page_mkwrite() patch.

Changes in -v4:

 - small cleanup as suggested by Christoph Lameter.

Changes in -v3:

 - move set_page_dirty() outside pte lock (suggested by Christoph Lameter)

Changes in -v2:

 - only wrprotect pages from dirty capable mappings. (Nick Piggin)
 - move the writefault handling from do_wp_page() into handle_pte_fault(). 
   (Nick Piggin)
 - revert to the old install_page interface. (Nick Piggin)
 - also clear the pte dirty bit when we make pages read-only again.
   (spotted by Rik van Riel)
 - make page_wrprotect() return the number of reprotected ptes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

 fs/buffer.c          |    2 -
 include/linux/mm.h   |    6 ++++
 include/linux/rmap.h |    8 ++++++
 mm/memory.c          |   56 ++++++++++++++++++++++++++++++++----------
 mm/mmap.c            |   33 +++++++++++++++++++++++-
 mm/mprotect.c        |   14 +++++++++-
 mm/page-writeback.c  |    9 ++++++
 mm/rmap.c            |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 177 insertions(+), 19 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2006-06-12 13:09:55.000000000 +0200
@@ -183,6 +183,12 @@ extern unsigned int kobjsize(const void 
 #define VM_SequentialReadHint(v)	((v)->vm_flags & VM_SEQ_READ)
 #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)
 
+static inline int is_shared_writable(unsigned int flags)
+{
+	return (flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
+		(VM_SHARED|VM_WRITE);
+}
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/mm/memory.c	2006-06-13 11:17:51.000000000 +0200
@@ -925,6 +925,12 @@ struct page *follow_page(struct vm_area_
 	pte = *ptep;
 	if (!pte_present(pte))
 		goto unlock;
+	/*
+	 * This is not fully correct in the light of trapping write faults
+	 * for writable shared mappings. However since we're going to mark
+	 * the page dirty anyway some few lines downward, we might as well
+	 * take the write fault now.
+	 */
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
 	page = vm_normal_page(vma, address, pte);
@@ -1445,25 +1451,36 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
+	int reuse = 0, ret = VM_FAULT_MINOR;
+	struct page *dirty_page = NULL;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		int reuse = can_share_swap_page(old_page);
+	/* get_user_pages(.write:1, .force:1)
+	 *   __handle_mm_fault()
+	 *
+	 * Makes COW happen for readonly shared mappings too.
+	 */
+	if (unlikely(is_shared_writable(vma->vm_flags))) {
+		reuse = 1;
+		dirty_page = old_page;
+		get_page(dirty_page);
+	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+		reuse = can_share_swap_page(old_page);
 		unlock_page(old_page);
-		if (reuse) {
-			flush_cache_page(vma, address, pte_pfn(orig_pte));
-			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			ptep_set_access_flags(vma, address, page_table, entry, 1);
-			update_mmu_cache(vma, address, entry);
-			lazy_mmu_prot_update(entry);
-			ret |= VM_FAULT_WRITE;
-			goto unlock;
-		}
+	}
+
+	if (reuse) {
+		flush_cache_page(vma, address, pte_pfn(orig_pte));
+		entry = pte_mkyoung(orig_pte);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		ptep_set_access_flags(vma, address, page_table, entry, 1);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+		ret |= VM_FAULT_WRITE;
+		goto unlock;
 	}
 
 	/*
@@ -1518,6 +1535,10 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	if (old_page)
@@ -2046,6 +2067,7 @@ static int do_no_page(struct mm_struct *
 	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
+	struct page *dirty_page = NULL;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
@@ -2127,6 +2149,10 @@ retry:
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			if (write_access) {
+				dirty_page = new_page;
+				get_page(dirty_page);
+			}
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -2139,6 +2165,10 @@ retry:
 	lazy_mmu_prot_update(entry);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	page_cache_release(new_page);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/mm/mmap.c	2006-06-12 14:40:50.000000000 +0200
@@ -25,6 +25,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -888,6 +889,7 @@ unsigned long do_mmap_pgoff(struct file 
 	struct rb_node ** rb_link, * rb_parent;
 	int accountable = 1;
 	unsigned long charged = 0, reqprot = prot;
+	struct address_space *mapping = NULL;
 
 	if (file) {
 		if (is_file_hugepages(file))
@@ -1065,7 +1067,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+	vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1083,6 +1086,7 @@ munmap_back:
 		error = file->f_op->mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
+
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
 		if (error)
@@ -1106,6 +1110,30 @@ munmap_back:
 	pgoff = vma->vm_pgoff;
 	vm_flags = vma->vm_flags;
 
+	/*
+	 * Tracking of dirty pages for shared writable mappings. Do this by
+	 * write protecting writable pages, and mark dirty in the write fault.
+	 *
+	 * Modify vma->vm_page_prot (the default protection for new pages)
+	 * to this effect.
+	 *
+	 * Cannot do before because the condition depends on:
+	 *  - backing_dev_info having the right capabilities
+	 *    (set by f_op->open())
+	 *  - vma->vm_flags being fully set
+	 *    (finished in f_op->mmap(), which could call remap_pfn_range())
+	 *
+	 *  Also, cannot reset vma->vm_page_prot from vma->vm_flags because
+	 *  f_op->mmap() can modify it.
+	 */
+	if (is_shared_writable(vm_flags) && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+	if (mapping && mapping_cap_account_dirty(mapping))
+		vma->vm_page_prot =
+			__pgprot(pte_val
+				(pte_wrprotect
+				 (__pte(pgprot_val(vma->vm_page_prot)))));
+
 	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
 			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
 		file = vma->vm_file;
@@ -1921,7 +1949,8 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
-	vma->vm_page_prot = protection_map[flags & 0x0f];
+	vma->vm_page_prot = protection_map[flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/mm/mprotect.c	2006-06-12 13:09:55.000000000 +0200
@@ -19,6 +19,7 @@
 #include <linux/mempolicy.h>
 #include <linux/personality.h>
 #include <linux/syscalls.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -106,6 +107,8 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned int mask;
+	struct address_space *mapping = NULL;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -132,8 +135,6 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
-	newprot = protection_map[newflags & 0xf];
-
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
@@ -160,6 +161,15 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
+	/* Don't make the VMA automatically writable if it's shared. */
+	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+	if (is_shared_writable(newflags) && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+	if (mapping && mapping_cap_account_dirty(mapping))
+		mask &= ~VM_SHARED;
+
+	newprot = protection_map[newflags & mask];
+
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2006-06-13 11:17:40.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -563,7 +564,7 @@ int do_writepages(struct address_space *
 		return 0;
 	wbc->for_writepages = 1;
 	if (mapping->a_ops->writepages)
-		ret =  mapping->a_ops->writepages(mapping, wbc);
+		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
 	wbc->for_writepages = 0;
@@ -725,6 +726,11 @@ int test_clear_page_dirty(struct page *p
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 			write_unlock_irqrestore(&mapping->tree_lock, flags);
+			/*
+			 * We can continue to use `mapping' here because the
+			 * page is locked, which pins the address_space
+			 */
+			page_mkclean(page);
 			if (mapping_cap_account_dirty(mapping))
 				dec_page_state(nr_dirty);
 			return 1;
@@ -756,6 +762,7 @@ int clear_page_dirty_for_io(struct page 
 
 	if (mapping) {
 		if (TestClearPageDirty(page)) {
+			page_mkclean(page);
 			if (mapping_cap_account_dirty(mapping))
 				dec_page_state(nr_dirty);
 			return 1;
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/mm/rmap.c	2006-06-13 11:18:15.000000000 +0200
@@ -53,6 +53,7 @@
 #include <linux/rmap.h>
 #include <linux/rcupdate.h>
 #include <linux/module.h>
+#include <linux/backing-dev.h>
 
 #include <asm/tlbflush.h>
 
@@ -472,6 +473,73 @@ int page_referenced(struct page *page, i
 	return referenced;
 }
 
+static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, int protect)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	address = vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
+	pte = page_check_address(page, mm, address, &ptl);
+	if (!pte)
+		goto out;
+
+	if (!(pte_dirty(*pte) || (protect && pte_write(*pte))))
+		goto unlock;
+
+	entry = ptep_get_and_clear(mm, address, pte);
+	entry = pte_mkclean(entry);
+	if (protect)
+		entry = pte_wrprotect(entry);
+	ptep_establish(vma, address, pte, entry);
+	lazy_mmu_prot_update(entry);
+	ret = 1;
+
+unlock:
+	pte_unmap_unlock(pte, ptl);
+out:
+	return ret;
+}
+
+static int page_mkclean_file(struct address_space *mapping, struct page *page)
+{
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int ret = 0;
+
+	BUG_ON(PageAnon(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		int protect = mapping_cap_account_dirty(mapping) &&
+			is_shared_writable(vma->vm_flags);
+		ret += page_mkclean_one(page, vma, protect);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	return ret;
+}
+
+int page_mkclean(struct page *page)
+{
+	int ret = 0;
+
+	BUG_ON(!PageLocked(page));
+
+	if (page_mapped(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			ret = page_mkclean_file(mapping, page);
+	}
+
+	return ret;
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/include/linux/rmap.h	2006-06-12 13:09:55.000000000 +0200
@@ -105,6 +105,14 @@ pte_t *page_check_address(struct page *,
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
+/*
+ * Cleans the PTEs of shared mappings.
+ * (and since clean PTEs should also be readonly, write protects them too)
+ *
+ * returns the number of cleaned PTEs.
+ */
+int page_mkclean(struct page *);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/fs/buffer.c	2006-06-12 13:09:55.000000000 +0200
@@ -2985,6 +2985,7 @@ int try_to_free_buffers(struct page *pag
 
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
+	spin_unlock(&mapping->private_lock);
 	if (ret) {
 		/*
 		 * If the filesystem writes its buffers by hand (eg ext3)
@@ -2996,7 +2997,6 @@ int try_to_free_buffers(struct page *pag
 		 */
 		clear_page_dirty(page);
 	}
-	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/6] mm: balance dirty pages
  2006-06-13 11:21 [PATCH 0/6] mm: tracking dirty pages -v8 Peter Zijlstra
  2006-06-13 11:21 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
@ 2006-06-13 11:21 ` Peter Zijlstra
  2006-06-13 11:21 ` [PATCH 3/6] mm: msync() cleanup Peter Zijlstra
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2006-06-13 11:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Now that we can detect writers of shared mappings, throttle them.
Avoids OOM by surprise.

Changes -v2:

 - small helper function (Andrew Morton)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

 include/linux/writeback.h |    1 +
 mm/memory.c               |    5 +++--
 mm/page-writeback.c       |   10 ++++++++++
 3 files changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2006-06-08 16:28:39.000000000 +0200
+++ linux-2.6/mm/memory.c	2006-06-08 16:29:01.000000000 +0200
@@ -48,6 +48,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/writeback.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -1525,7 +1526,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty(dirty_page);
+		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
@@ -2155,7 +2156,7 @@ retry:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty(dirty_page);
+		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2006-06-08 16:28:25.000000000 +0200
+++ linux-2.6/include/linux/writeback.h	2006-06-08 16:28:44.000000000 +0200
@@ -114,6 +114,7 @@ int sync_page_range(struct inode *inode,
 			loff_t pos, loff_t count);
 int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
 			   loff_t pos, loff_t count);
+void set_page_dirty_balance(struct page *page);
 
 /* pdflush.c */
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2006-06-08 16:28:39.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2006-06-08 16:28:44.000000000 +0200
@@ -255,6 +255,16 @@ static void balance_dirty_pages(struct a
 		pdflush_operation(background_writeout, 0);
 }
 
+void set_page_dirty_balance(struct page *page)
+{
+	if (set_page_dirty(page)) {
+		struct address_space *mapping = page_mapping(page);
+
+		if (mapping)
+			balance_dirty_pages_ratelimited(mapping);
+	}
+}
+
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/6] mm: msync() cleanup
  2006-06-13 11:21 [PATCH 0/6] mm: tracking dirty pages -v8 Peter Zijlstra
  2006-06-13 11:21 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
  2006-06-13 11:21 ` [PATCH 2/6] mm: balance " Peter Zijlstra
@ 2006-06-13 11:21 ` Peter Zijlstra
  2006-06-13 11:22 ` [PATCH 4/6] mm: optimize the new mprotect() code a bit Peter Zijlstra
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2006-06-13 11:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

With the tracking of dirty pages properly done now, msync doesn't need to
scan the PTEs anymore to determine the dirty status.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

 mm/msync.c |  129 ++++---------------------------------------------------------
 1 file changed, 9 insertions(+), 120 deletions(-)

Index: linux-2.6/mm/msync.c
===================================================================
--- linux-2.6.orig/mm/msync.c	2006-06-07 14:25:54.000000000 +0200
+++ linux-2.6/mm/msync.c	2006-06-07 14:32:33.000000000 +0200
@@ -20,109 +20,14 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
-static unsigned long msync_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-				unsigned long addr, unsigned long end)
-{
-	pte_t *pte;
-	spinlock_t *ptl;
-	int progress = 0;
-	unsigned long ret = 0;
-
-again:
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	do {
-		struct page *page;
-
-		if (progress >= 64) {
-			progress = 0;
-			if (need_resched() || need_lockbreak(ptl))
-				break;
-		}
-		progress++;
-		if (!pte_present(*pte))
-			continue;
-		if (!pte_maybe_dirty(*pte))
-			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
-			continue;
-		if (ptep_clear_flush_dirty(vma, addr, pte) ||
-				page_test_and_clear_dirty(page))
-			ret += set_page_dirty(page);
-		progress += 3;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
-	if (addr != end)
-		goto again;
-	return ret;
-}
-
-static inline unsigned long msync_pmd_range(struct vm_area_struct *vma,
-			pud_t *pud, unsigned long addr, unsigned long end)
-{
-	pmd_t *pmd;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		if (pmd_none_or_clear_bad(pmd))
-			continue;
-		ret += msync_pte_range(vma, pmd, addr, next);
-	} while (pmd++, addr = next, addr != end);
-	return ret;
-}
-
-static inline unsigned long msync_pud_range(struct vm_area_struct *vma,
-			pgd_t *pgd, unsigned long addr, unsigned long end)
-{
-	pud_t *pud;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		ret += msync_pmd_range(vma, pud, addr, next);
-	} while (pud++, addr = next, addr != end);
-	return ret;
-}
-
-static unsigned long msync_page_range(struct vm_area_struct *vma,
-				unsigned long addr, unsigned long end)
-{
-	pgd_t *pgd;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	/* For hugepages we can't go walking the page table normally,
-	 * but that's ok, hugetlbfs is memory based, so we don't need
-	 * to do anything more on an msync().
-	 */
-	if (vma->vm_flags & VM_HUGETLB)
-		return 0;
-
-	BUG_ON(addr >= end);
-	pgd = pgd_offset(vma->vm_mm, addr);
-	flush_cache_range(vma, addr, end);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		ret += msync_pud_range(vma, pgd, addr, next);
-	} while (pgd++, addr = next, addr != end);
-	return ret;
-}
-
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).  Instead, it just
- * marks the relevant pages dirty.  The application may now run fsync() to
+ * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
+ * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Now it doesn't do anything, since dirty pages are properly tracked.
+ *
+ * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
  * async writeout immediately.
@@ -130,16 +35,11 @@
  * applications.
  */
 static int msync_interval(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long end, int flags,
-			unsigned long *nr_pages_dirtied)
+			unsigned long end, int flags)
 {
-	struct file *file = vma->vm_file;
-
 	if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
 		return -EBUSY;
 
-	if (file && (vma->vm_flags & VM_SHARED))
-		*nr_pages_dirtied = msync_page_range(vma, addr, end);
 	return 0;
 }
 
@@ -178,7 +78,6 @@
 		goto out_unlock;
 	}
 	do {
-		unsigned long nr_pages_dirtied = 0;
 		struct file *file;
 
 		/* Here start < vma->vm_end. */
@@ -189,8 +88,7 @@
 		/* Here vma->vm_start <= start < vma->vm_end. */
 		if (end <= vma->vm_end) {
 			if (start < end) {
-				error = msync_interval(vma, start, end, flags,
-							&nr_pages_dirtied);
+				error = msync_interval(vma, start, end, flags);
 				if (error)
 					goto out_unlock;
 			}
@@ -198,22 +96,13 @@
 			done = 1;
 		} else {
 			/* Here vma->vm_start <= start < vma->vm_end < end. */
-			error = msync_interval(vma, start, vma->vm_end, flags,
-						&nr_pages_dirtied);
+			error = msync_interval(vma, start, vma->vm_end, flags);
 			if (error)
 				goto out_unlock;
 		}
 		file = vma->vm_file;
 		start = vma->vm_end;
-		if ((flags & MS_ASYNC) && file && nr_pages_dirtied) {
-			get_file(file);
-			up_read(&current->mm->mmap_sem);
-			balance_dirty_pages_ratelimited_nr(file->f_mapping,
-							nr_pages_dirtied);
-			fput(file);
-			down_read(&current->mm->mmap_sem);
-			vma = find_vma(current->mm, start);
-		} else if ((flags & MS_SYNC) && file &&
+		if ((flags & MS_SYNC) && file &&
 				(vma->vm_flags & VM_SHARED)) {
 			get_file(file);
 			up_read(&current->mm->mmap_sem);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/6] mm: optimize the new mprotect() code a bit
  2006-06-13 11:21 [PATCH 0/6] mm: tracking dirty pages -v8 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2006-06-13 11:21 ` [PATCH 3/6] mm: msync() cleanup Peter Zijlstra
@ 2006-06-13 11:22 ` Peter Zijlstra
  2006-06-13 11:22 ` [PATCH 5/6] mm: small cleanup of install_page() Peter Zijlstra
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2006-06-13 11:22 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

mprotect() resets the page protections, which could result in extra write
faults for those pages whos dirty state we track using write faults
and are dirty already.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mprotect.c |   33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c	2006-06-12 13:09:55.000000000 +0200
+++ linux-2.6/mm/mprotect.c	2006-06-13 11:43:13.000000000 +0200
@@ -27,7 +27,8 @@
 #include <asm/tlbflush.h>
 
 static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -41,7 +42,13 @@ static void change_pte_range(struct mm_s
 			 * bits by wiping the pte and then setting the new pte
 			 * into place.
 			 */
-			ptent = pte_modify(ptep_get_and_clear(mm, addr, pte), newprot);
+			ptent = ptep_get_and_clear(mm, addr, pte);
+			ptent = pte_modify(ptent, newprot);
+			/* Avoid taking write faults for pages we know to be
+			 * dirty.
+			 */
+			if (is_accountable && pte_dirty(ptent))
+				ptent = pte_mkwrite(ptent);
 			set_pte_at(mm, addr, pte, ptent);
 			lazy_mmu_prot_update(ptent);
 		}
@@ -50,7 +57,8 @@ static void change_pte_range(struct mm_s
 }
 
 static inline void change_pmd_range(struct mm_struct *mm, pud_t *pud,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -60,12 +68,13 @@ static inline void change_pmd_range(stru
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		change_pte_range(mm, pmd, addr, next, newprot);
+		change_pte_range(mm, pmd, addr, next, newprot, is_accountable);
 	} while (pmd++, addr = next, addr != end);
 }
 
 static inline void change_pud_range(struct mm_struct *mm, pgd_t *pgd,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -75,12 +84,13 @@ static inline void change_pud_range(stru
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		change_pmd_range(mm, pud, addr, next, newprot);
+		change_pmd_range(mm, pud, addr, next, newprot, is_accountable);
 	} while (pud++, addr = next, addr != end);
 }
 
 static void change_protection(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -94,7 +104,7 @@ static void change_protection(struct vm_
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		change_pud_range(mm, pgd, addr, next, newprot);
+		change_pud_range(mm, pgd, addr, next, newprot, is_accountable);
 	} while (pgd++, addr = next, addr != end);
 	flush_tlb_range(vma, start, end);
 }
@@ -112,6 +122,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
+	int is_accountable = 0;
 
 	if (newflags == oldflags) {
 		*pprev = vma;
@@ -165,8 +176,10 @@ success:
 	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
 	if (is_shared_writable(newflags) && vma->vm_file)
 		mapping = vma->vm_file->f_mapping;
-	if (mapping && mapping_cap_account_dirty(mapping))
+	if (mapping && mapping_cap_account_dirty(mapping)) {
 		mask &= ~VM_SHARED;
+		is_accountable = 1;
+	}
 
 	newprot = protection_map[newflags & mask];
 
@@ -179,7 +192,7 @@ success:
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, newprot);
 	else
-		change_protection(vma, start, end, newprot);
+		change_protection(vma, start, end, newprot, is_accountable);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 5/6] mm: small cleanup of install_page()
  2006-06-13 11:21 [PATCH 0/6] mm: tracking dirty pages -v8 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2006-06-13 11:22 ` [PATCH 4/6] mm: optimize the new mprotect() code a bit Peter Zijlstra
@ 2006-06-13 11:22 ` Peter Zijlstra
  2006-06-13 11:22 ` [PATCH 6/6] mm: remove some update_mmu_cache() calls Peter Zijlstra
  2006-06-13 11:22 ` [PATCH 7/6] mm: page_mkwrite Peter Zijlstra
  6 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2006-06-13 11:22 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Smallish cleanup to install_page(), could save a memory read
(haven't checked the asm output) and sure looks nicer.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2006-06-08 13:47:29.000000000 +0200
+++ linux-2.6/mm/fremap.c	2006-06-08 13:50:44.000000000 +0200
@@ -79,9 +79,9 @@ int install_page(struct mm_struct *mm, s
 		inc_mm_counter(mm, file_rss);
 
 	flush_icache_page(vma, page);
-	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	pte_val = mk_pte(page, prot);
+	set_pte_at(mm, addr, pte, pte_val);
 	page_add_file_rmap(page);
-	pte_val = *pte;
 	update_mmu_cache(vma, addr, pte_val);
 	err = 0;
 unlock:

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 6/6] mm: remove some update_mmu_cache() calls
  2006-06-13 11:21 [PATCH 0/6] mm: tracking dirty pages -v8 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2006-06-13 11:22 ` [PATCH 5/6] mm: small cleanup of install_page() Peter Zijlstra
@ 2006-06-13 11:22 ` Peter Zijlstra
  2006-06-13 11:22 ` [PATCH 7/6] mm: page_mkwrite Peter Zijlstra
  6 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2006-06-13 11:22 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Christoph Lameter <clameter@sgi.com>

This may be a bit controversial but it does not seem to
make sense to use the update_mmu_cache macro when we reuse
the page. We are only fiddling around with the protections,
the dirty and accessed bits.

With the call to update_mmu_cache the way of using the macros
would be different from mprotect() and page_mkclean(). I'd
rather have everything work the same way. If this breaks on some
arches then also mprotect and page_mkclean() are broken.
The use of mprotect() is rare, we may have breakage in some
arches that we just have not seen yet.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/memory.c |    2 --
 1 file changed, 2 deletions(-)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2006-06-13 10:33:21.000000000 +0200
+++ linux-2.6/mm/memory.c	2006-06-13 10:36:31.000000000 +0200
@@ -1477,7 +1477,6 @@ static int do_wp_page(struct mm_struct *
 		entry = pte_mkyoung(orig_pte);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		ptep_set_access_flags(vma, address, page_table, entry, 1);
-		update_mmu_cache(vma, address, entry);
 		lazy_mmu_prot_update(entry);
 		ret |= VM_FAULT_WRITE;
 		goto unlock;
@@ -2263,7 +2262,6 @@ static inline int handle_pte_fault(struc
 	entry = pte_mkyoung(entry);
 	if (!pte_same(old_entry, entry)) {
 		ptep_set_access_flags(vma, address, pte, entry, write_access);
-		update_mmu_cache(vma, address, entry);
 		lazy_mmu_prot_update(entry);
 	} else {
 		/*

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 7/6] mm: page_mkwrite
  2006-06-13 11:21 [PATCH 0/6] mm: tracking dirty pages -v8 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2006-06-13 11:22 ` [PATCH 6/6] mm: remove some update_mmu_cache() calls Peter Zijlstra
@ 2006-06-13 11:22 ` Peter Zijlstra
  6 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2006-06-13 11:22 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: David Howells <dhowells@redhat.com>

The attached patch adds a new VMA operation to notify a filesystem or other
driver about the MMU generating a fault because userspace attempted to write
to a page mapped through a read-only PTE.

This facility permits the filesystem or driver to:

 (*) Implement storage allocation/reservation on attempted write, and so to
     deal with problems such as ENOSPC more gracefully (perhaps by generating
     SIGBUS).

 (*) Delay making the page writable until the contents have been written to a
     backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
     It permits the filesystem to have some guarantee about the state of the
     cache.

 (*) Account and limit number of dirty pages. This is one piece of the puzzle
     needed to make shared writable mapping work safely in FUSE.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-Off-By: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mm.h |    4 +++
 mm/memory.c        |   67 ++++++++++++++++++++++++++++++++++++++++++++---------
 mm/mmap.c          |    3 +-
 mm/mprotect.c      |    3 +-
 4 files changed, 64 insertions(+), 13 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2006-06-13 12:15:28.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2006-06-13 12:15:33.000000000 +0200
@@ -206,6 +206,10 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+
+	/* notification that a previously read-only page is about to become
+	 * writable, if an error is returned it will cause a SIGBUS */
+	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2006-06-13 12:15:28.000000000 +0200
+++ linux-2.6/mm/memory.c	2006-06-13 12:20:00.000000000 +0200
@@ -1465,6 +1465,35 @@ static int do_wp_page(struct mm_struct *
 	 * Makes COW happen for readonly shared mappings too.
 	 */
 	if (unlikely(is_shared_writable(vma->vm_flags))) {
+		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+			/*
+			 * Notify the address space that the page is about to
+			 * become writable so that it can prohibit this or wait
+			 * for the page to get into an appropriate state.
+			 *
+			 * We do this without the lock held, so that it can
+			 * sleep if it needs to.
+			 */
+			page_cache_get(old_page);
+			pte_unmap_unlock(page_table, ptl);
+
+			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
+				goto unwritable_page;
+
+			page_cache_release(old_page);
+
+			/*
+			 * Since we dropped the lock we need to revalidate
+			 * the PTE as someone else may have changed it.  If
+			 * they did, we just return, as we can count on the
+			 * MMU to tell us if they didn't also make it writable.
+			 */
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte))
+				goto unlock;
+		}
+
 		reuse = 1;
 		dirty_page = old_page;
 		get_page(dirty_page);
@@ -1544,6 +1573,10 @@ oom:
 	if (old_page)
 		page_cache_release(old_page);
 	return VM_FAULT_OOM;
+
+unwritable_page:
+	page_cache_release(old_page);
+	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -2096,18 +2129,30 @@ retry:
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page *page;
+	if (write_access) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			struct page *page;
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto oom;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-		if (!page)
-			goto oom;
-		copy_user_highpage(page, new_page, address);
-		page_cache_release(new_page);
-		new_page = page;
-		anon = 1;
+			if (unlikely(anon_vma_prepare(vma)))
+				goto oom;
+			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			if (!page)
+				goto oom;
+			copy_user_highpage(page, new_page, address);
+			page_cache_release(new_page);
+			new_page = page;
+			anon = 1;
+		} else {
+			/* if the page will be shareable, see if the backing
+			 * address space wants to know that the page is about
+			 * to become writable */
+			if (vma->vm_ops->page_mkwrite &&
+			    vma->vm_ops->page_mkwrite(vma, new_page) < 0
+			    ) {
+				page_cache_release(new_page);
+				return VM_FAULT_SIGBUS;
+			}
+		}
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2006-06-13 12:15:28.000000000 +0200
+++ linux-2.6/mm/mmap.c	2006-06-13 12:21:49.000000000 +0200
@@ -1128,7 +1128,8 @@ munmap_back:
 	 */
 	if (is_shared_writable(vm_flags) && vma->vm_file)
 		mapping = vma->vm_file->f_mapping;
-	if (mapping && mapping_cap_account_dirty(mapping))
+	if ((mapping && mapping_cap_account_dirty(mapping)) ||
+			(vma->vm_ops && vma->vm_ops->page_mkwrite))
 		vma->vm_page_prot =
 			__pgprot(pte_val
 				(pte_wrprotect
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c	2006-06-13 12:15:28.000000000 +0200
+++ linux-2.6/mm/mprotect.c	2006-06-13 12:26:46.000000000 +0200
@@ -176,7 +176,8 @@ success:
 	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
 	if (is_shared_writable(newflags) && vma->vm_file)
 		mapping = vma->vm_file->f_mapping;
-	if (mapping && mapping_cap_account_dirty(mapping)) {
+	if ((mapping && mapping_cap_account_dirty(mapping)) ||
+			(vma->vm_ops && vma->vm_ops->page_mkwrite)) {
 		mask &= ~VM_SHARED;
 		is_accountable = 1;
 	}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-13 11:21 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
@ 2006-06-13 12:03   ` Andi Kleen
  2006-06-13 12:15     ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2006-06-13 12:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, David Howells, Christoph Lameter,
	Martin Bligh, Nick Piggin, Linus Torvalds, linux-kernel

Peter Zijlstra <a.p.zijlstra@chello.nl> writes:

> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> People expressed the need to track dirty pages in shared mappings.

Why only shared mappings? Anonymous pages can be dirty too
and would need to be written to swap then before making progress.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-13 12:03   ` Andi Kleen
@ 2006-06-13 12:15     ` Peter Zijlstra
  2006-06-13 12:24       ` Nick Piggin
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2006-06-13 12:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, Andrew Morton, David Howells, Christoph Lameter,
	Martin Bligh, Nick Piggin, Linus Torvalds, linux-kernel

On Tue, 2006-06-13 at 14:03 +0200, Andi Kleen wrote:
> Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
> 
> > From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > 
> > People expressed the need to track dirty pages in shared mappings.
> 
> Why only shared mappings? Anonymous pages can be dirty too
> and would need to be written to swap then before making progress.

Anonymous pages are per definition dirty, as they don't have a
persistent backing store. Each eviction of an anonymous page requires IO
to swap space. On swap-in pages are removed from the swap space to make
place for other pages.

Peter


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-13 12:15     ` Peter Zijlstra
@ 2006-06-13 12:24       ` Nick Piggin
  0 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2006-06-13 12:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Hugh Dickins, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds,
	linux-kernel

Peter Zijlstra wrote:
> On Tue, 2006-06-13 at 14:03 +0200, Andi Kleen wrote:
> 
>>Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
>>
>>
>>>From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>>
>>>People expressed the need to track dirty pages in shared mappings.
>>
>>Why only shared mappings? Anonymous pages can be dirty too
>>and would need to be written to swap then before making progress.
> 
> 
> Anonymous pages are per definition dirty, as they don't have a
> persistent backing store.

They can be clean.

> Each eviction of an anonymous page requires IO
> to swap space. On swap-in pages are removed from the swap space to make
> place for other pages.

No they can remain in swap too.

Swap is a bit different because the memory usage patters are going
to be much different. There is no reason why something similar couldn't
be done for swap as well, however I don't think there is so much need
for it that has been demonstrated.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/6] mm: optimize the new mprotect() code a bit
  2006-06-28 20:17 [PATCH 0/6] mm: tracking dirty pages -v14 Peter Zijlstra
@ 2006-06-28 20:17 ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2006-06-28 20:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

mprotect() resets the page protections, which could result in extra write
faults for those pages whose dirty state we track using write faults
and are dirty already.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mprotect.c |   31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

Index: linux-2.6-dirty/mm/mprotect.c
===================================================================
--- linux-2.6-dirty.orig/mm/mprotect.c	2006-06-28 19:49:27.000000000 +0200
+++ linux-2.6-dirty/mm/mprotect.c	2006-06-28 20:03:51.000000000 +0200
@@ -27,7 +27,8 @@
 #include <asm/tlbflush.h>
 
 static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int dirty_accountable)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
@@ -42,7 +43,14 @@ static void change_pte_range(struct mm_s
 			 * bits by wiping the pte and then setting the new pte
 			 * into place.
 			 */
-			ptent = pte_modify(ptep_get_and_clear(mm, addr, pte), newprot);
+			ptent = ptep_get_and_clear(mm, addr, pte);
+			ptent = pte_modify(ptent, newprot);
+			/*
+			 * Avoid taking write faults for pages we know to be
+			 * dirty.
+			 */
+			if (dirty_accountable && pte_dirty(ptent))
+				ptent = pte_mkwrite(ptent);
 			set_pte_at(mm, addr, pte, ptent);
 			lazy_mmu_prot_update(ptent);
 #ifdef CONFIG_MIGRATION
@@ -66,7 +74,8 @@ static void change_pte_range(struct mm_s
 }
 
 static inline void change_pmd_range(struct mm_struct *mm, pud_t *pud,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int dirty_accountable)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -76,12 +85,13 @@ static inline void change_pmd_range(stru
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		change_pte_range(mm, pmd, addr, next, newprot);
+		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
 	} while (pmd++, addr = next, addr != end);
 }
 
 static inline void change_pud_range(struct mm_struct *mm, pgd_t *pgd,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int dirty_accountable)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -91,12 +101,13 @@ static inline void change_pud_range(stru
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		change_pmd_range(mm, pud, addr, next, newprot);
+		change_pmd_range(mm, pud, addr, next, newprot, dirty_accountable);
 	} while (pud++, addr = next, addr != end);
 }
 
 static void change_protection(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int dirty_accountable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -110,7 +121,7 @@ static void change_protection(struct vm_
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		change_pud_range(mm, pgd, addr, next, newprot);
+		change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
 	flush_tlb_range(vma, start, end);
 }
@@ -125,6 +136,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long charged = 0;
 	pgoff_t pgoff;
 	int error;
+	int dirty_accountable = 0;
 
 	if (newflags == oldflags) {
 		*pprev = vma;
@@ -181,14 +193,16 @@ success:
 	vma->vm_flags = newflags;
 	vma->vm_page_prot = protection_map[newflags &
 		(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
-	if (vma_wants_writenotify(vma))
+	if (vma_wants_writenotify(vma)) {
 		vma->vm_page_prot = protection_map[newflags &
 			(VM_READ|VM_WRITE|VM_EXEC)];
+		dirty_accountable = 1;
+	}
 
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
-		change_protection(vma, start, end, vma->vm_page_prot);
+		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/6] mm: optimize the new mprotect() code a bit
  2006-06-19 17:53 ` [PATCH 4/6] mm: optimize the new mprotect() code a bit Peter Zijlstra
@ 2006-06-22 17:21   ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2006-06-22 17:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Mon, 19 Jun 2006, Peter Zijlstra wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> mprotect() resets the page protections, which could result in extra write
> faults for those pages whos dirty state we track using write faults
> and are dirty already.
> 
> @@ -43,7 +44,13 @@ static void change_pte_range(struct mm_s
>  			 * bits by wiping the pte and then setting the new pte
>  			 * into place.
>  			 */
> -			ptent = pte_modify(ptep_get_and_clear(mm, addr, pte), newprot);
> +			ptent = ptep_get_and_clear(mm, addr, pte);
> +			ptent = pte_modify(ptent, newprot);
> +			/* Avoid taking write faults for pages we know to be
> +			 * dirty.
> +			 */
> +			if (is_accountable && pte_dirty(ptent))
> +				ptent = pte_mkwrite(ptent);
>  			set_pte_at(mm, addr, pte, ptent);
>  			lazy_mmu_prot_update(ptent);

Thanks for adding that comment, I completely misread this when you
first showed it to me, and didn't get the point at all.  (But you're
a little too fond of "/* Multiline" comments: in this case, with no
blank line above, it'd look better with a "/*" lone line to separate
from the pte_modify code.)

Yes, I guess that is worth doing, though it's a bit sad and ugly:
goes right against the simplicity of working with vm_page_prot.

Could you change "is_accountable" to "dirty_accountable" throughout?
We've various different kinds of accounting going on hereabouts,
I think it'd be more understandable as "dirty_accountable".

Hugh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/6] mm: optimize the new mprotect() code a bit
  2006-06-19 17:52 [PATCH 0/6] mm: tracking dirty pages -v9 Peter Zijlstra
@ 2006-06-19 17:53 ` Peter Zijlstra
  2006-06-22 17:21   ` Hugh Dickins
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2006-06-19 17:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

mprotect() resets the page protections, which could result in extra write
faults for those pages whos dirty state we track using write faults
and are dirty already.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mprotect.c |   33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

Index: 2.6-mm/mm/mprotect.c
===================================================================
--- 2.6-mm.orig/mm/mprotect.c	2006-06-19 16:19:42.000000000 +0200
+++ 2.6-mm/mm/mprotect.c	2006-06-19 16:20:42.000000000 +0200
@@ -28,7 +28,8 @@
 #include <asm/tlbflush.h>
 
 static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
@@ -43,7 +44,13 @@ static void change_pte_range(struct mm_s
 			 * bits by wiping the pte and then setting the new pte
 			 * into place.
 			 */
-			ptent = pte_modify(ptep_get_and_clear(mm, addr, pte), newprot);
+			ptent = ptep_get_and_clear(mm, addr, pte);
+			ptent = pte_modify(ptent, newprot);
+			/* Avoid taking write faults for pages we know to be
+			 * dirty.
+			 */
+			if (is_accountable && pte_dirty(ptent))
+				ptent = pte_mkwrite(ptent);
 			set_pte_at(mm, addr, pte, ptent);
 			lazy_mmu_prot_update(ptent);
 #ifdef CONFIG_MIGRATION
@@ -67,7 +74,8 @@ static void change_pte_range(struct mm_s
 }
 
 static inline void change_pmd_range(struct mm_struct *mm, pud_t *pud,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -77,12 +85,13 @@ static inline void change_pmd_range(stru
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		change_pte_range(mm, pmd, addr, next, newprot);
+		change_pte_range(mm, pmd, addr, next, newprot, is_accountable);
 	} while (pmd++, addr = next, addr != end);
 }
 
 static inline void change_pud_range(struct mm_struct *mm, pgd_t *pgd,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -92,12 +101,13 @@ static inline void change_pud_range(stru
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		change_pmd_range(mm, pud, addr, next, newprot);
+		change_pmd_range(mm, pud, addr, next, newprot, is_accountable);
 	} while (pud++, addr = next, addr != end);
 }
 
 static void change_protection(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -111,7 +121,7 @@ static void change_protection(struct vm_
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		change_pud_range(mm, pgd, addr, next, newprot);
+		change_pud_range(mm, pgd, addr, next, newprot, is_accountable);
 	} while (pgd++, addr = next, addr != end);
 	flush_tlb_range(vma, start, end);
 }
@@ -129,6 +139,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
+	int is_accountable = 0;
 
 	if (newflags == oldflags) {
 		*pprev = vma;
@@ -184,8 +195,10 @@ success:
 	if (is_shared_writable(newflags) && vma->vm_file)
 		mapping = vma->vm_file->f_mapping;
 	if ((mapping && mapping_cap_account_dirty(mapping)) ||
-			(vma->vm_ops && vma->vm_ops->page_mkwrite))
+			(vma->vm_ops && vma->vm_ops->page_mkwrite)) {
 		mask &= ~VM_SHARED;
+		is_accountable = 1;
+	}
 
 	newprot = protection_map[newflags & mask];
 
@@ -198,7 +211,7 @@ success:
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, newprot);
 	else
-		change_protection(vma, start, end, newprot);
+		change_protection(vma, start, end, newprot, is_accountable);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-06-28 20:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-13 11:21 [PATCH 0/6] mm: tracking dirty pages -v8 Peter Zijlstra
2006-06-13 11:21 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
2006-06-13 12:03   ` Andi Kleen
2006-06-13 12:15     ` Peter Zijlstra
2006-06-13 12:24       ` Nick Piggin
2006-06-13 11:21 ` [PATCH 2/6] mm: balance " Peter Zijlstra
2006-06-13 11:21 ` [PATCH 3/6] mm: msync() cleanup Peter Zijlstra
2006-06-13 11:22 ` [PATCH 4/6] mm: optimize the new mprotect() code a bit Peter Zijlstra
2006-06-13 11:22 ` [PATCH 5/6] mm: small cleanup of install_page() Peter Zijlstra
2006-06-13 11:22 ` [PATCH 6/6] mm: remove some update_mmu_cache() calls Peter Zijlstra
2006-06-13 11:22 ` [PATCH 7/6] mm: page_mkwrite Peter Zijlstra
2006-06-19 17:52 [PATCH 0/6] mm: tracking dirty pages -v9 Peter Zijlstra
2006-06-19 17:53 ` [PATCH 4/6] mm: optimize the new mprotect() code a bit Peter Zijlstra
2006-06-22 17:21   ` Hugh Dickins
2006-06-28 20:17 [PATCH 0/6] mm: tracking dirty pages -v14 Peter Zijlstra
2006-06-28 20:17 ` [PATCH 4/6] mm: optimize the new mprotect() code a bit Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).