linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] mm: tracking dirty pages -v9
@ 2006-06-19 17:52 Peter Zijlstra
  2006-06-19 17:52 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-19 17:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


The latest version of the tracking dirty pages patch-set.
On request against -mm.

This version handles VM_PFNMAP vmas and the COW case of shared RO mappings.

follow_page() got a comment for being weird, but in the light of the 
set_page_dirty() call that can not yet be removed does something sane.

copy_one_pte() also does the right thing, although I wonder why it clears
the dirty bit for children?

f_op->open() - sets a backing_dev_info
f_op->mmap() - modifies both vma->vm_flags and vma->vm_page_prot

Since our condition depends on both the backing_dev_info and vma->vm_flags
it cannot set vma->vm_page_prot before f_op->mmap().

However this means that !VM_PFNMAP vmas that are shared writable but do not
provide a f_op->nopage() and whos backing_dev_info does not have 
BDI_CAP_NO_ACCT_DIRTY, are left writable.

Peter

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-19 17:52 [PATCH 0/6] mm: tracking dirty pages -v9 Peter Zijlstra
@ 2006-06-19 17:52 ` Peter Zijlstra
  2006-06-22  5:56   ` Andrew Morton
  2006-06-22 20:52   ` Hugh Dickins
  2006-06-19 17:53 ` [PATCH 2/6] mm: balance dirty pages Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-19 17:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

People expressed the need to track dirty pages in shared mappings.

Linus outlined the general idea of doing that through making clean
writable pages write-protected and taking the write fault.

This patch does exactly that, it makes pages in a shared writable
mapping write-protected. On write-fault the pages are marked dirty and
made writable. When the pages get synced with their backing store, the
write-protection is re-instated.

It survives a simple test and shows the dirty pages in /proc/vmstat.

Changes in -v9

 - respin against latest -mm.

Changes in -v8

 - access_process_vm() and other force users of get_user_pages() can
   induce COW of read-only shared mappings.

Changes in -v7

 - changed is_shared_writable() to exclude VM_PFNMAP'ed regions.
 - Hugh's tiresome problem wasn't fully solved, now using the ugly duckling
   method.

Changes in -v6

 - make page_mkclean_one() modify the pte more like change_pte_range() 
   (suggested by Christoph Lameter)
 - made is_shared_writable() take vm_flags, it now resembles is_cow_mapping().
 - fixed the mprotect() bug (spotted by Hugh Dickins)
 - hopefully fixed the tiresome issue of do_mmap_pgoff() trampling on
   driver specific vm_page_prot settings (spotted by Hugh Dickins)
 - made a new version of the page_mkwrite() patch to go on top of all this.
   This so that Linus could merge this very early on in 2.6.18.

Changes in -v5

 - rename page_wrprotect() to page_mkclean() (suggested by Nick Piggin)
 - added comment to test_clear_page_dirty() (Andrew Morton)
 - cleanup page_wrprotect() (Andrew Morton)
 - renamed VM_SharedWritable() to is_shared_writable()
 - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
   ->private_lock. This seems to be save, since ->private_lock is used to
   serialize access to the buffers, not the page itself.
 - rebased on top of David Howells' page_mkwrite() patch.

Changes in -v4:

 - small cleanup as suggested by Christoph Lameter.

Changes in -v3:

 - move set_page_dirty() outside pte lock (suggested by Christoph Lameter)

Changes in -v2:

 - only wrprotect pages from dirty capable mappings. (Nick Piggin)
 - move the writefault handling from do_wp_page() into handle_pte_fault(). 
   (Nick Piggin)
 - revert to the old install_page interface. (Nick Piggin)
 - also clear the pte dirty bit when we make pages read-only again.
   (spotted by Rik van Riel)
 - make page_wrprotect() return the number of reprotected ptes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

 fs/buffer.c          |    2 -
 include/linux/mm.h   |    6 ++++
 include/linux/rmap.h |    8 ++++++
 mm/memory.c          |   29 ++++++++++++++++++---
 mm/mmap.c            |   34 +++++++++++++++++++++----
 mm/mprotect.c        |    7 ++++-
 mm/page-writeback.c  |    9 ++++++
 mm/rmap.c            |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 149 insertions(+), 14 deletions(-)

Index: 2.6-mm/include/linux/mm.h
===================================================================
--- 2.6-mm.orig/include/linux/mm.h	2006-06-14 10:29:04.000000000 +0200
+++ 2.6-mm/include/linux/mm.h	2006-06-19 14:45:18.000000000 +0200
@@ -182,6 +182,12 @@ extern unsigned int kobjsize(const void 
 #define VM_SequentialReadHint(v)	((v)->vm_flags & VM_SEQ_READ)
 #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)
 
+static inline int is_shared_writable(unsigned int flags)
+{
+	return (flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
+		(VM_SHARED|VM_WRITE);
+}
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
Index: 2.6-mm/mm/memory.c
===================================================================
--- 2.6-mm.orig/mm/memory.c	2006-06-14 10:29:06.000000000 +0200
+++ 2.6-mm/mm/memory.c	2006-06-19 16:20:06.000000000 +0200
@@ -938,6 +938,12 @@ struct page *follow_page(struct vm_area_
 	pte = *ptep;
 	if (!pte_present(pte))
 		goto unlock;
+	/*
+	 * This is not fully correct in the light of trapping write faults
+	 * for writable shared mappings. However since we're going to mark
+	 * the page dirty anyway some few lines downward, we might as well
+	 * take the write fault now.
+	 */
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
 	page = vm_normal_page(vma, address, pte);
@@ -1458,13 +1464,14 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int reuse, ret = VM_FAULT_MINOR;
+	int reuse = 0, ret = VM_FAULT_MINOR;
+	struct page *dirty_page = NULL;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (unlikely(vma->vm_flags & VM_SHARED)) {
+	if (unlikely(is_shared_writable(vma->vm_flags))) {
 		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
 			/*
 			 * Notify the address space that the page is about to
@@ -1493,13 +1500,12 @@ static int do_wp_page(struct mm_struct *
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
 		}
-
+		dirty_page = old_page;
+		get_page(dirty_page);
 		reuse = 1;
 	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
 		reuse = can_share_swap_page(old_page);
 		unlock_page(old_page);
-	} else {
-		reuse = 0;
 	}
 
 	if (reuse) {
@@ -1565,6 +1571,10 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	if (old_page)
@@ -2097,6 +2107,7 @@ static int do_no_page(struct mm_struct *
 	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
+	struct page *dirty_page = NULL;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
@@ -2191,6 +2202,10 @@ retry:
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			if (write_access) {
+				dirty_page = new_page;
+				get_page(dirty_page);
+			}
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -2203,6 +2218,10 @@ retry:
 	lazy_mmu_prot_update(entry);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	page_cache_release(new_page);
Index: 2.6-mm/mm/mmap.c
===================================================================
--- 2.6-mm.orig/mm/mmap.c	2006-06-14 10:29:06.000000000 +0200
+++ 2.6-mm/mm/mmap.c	2006-06-19 15:41:53.000000000 +0200
@@ -25,6 +25,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -888,6 +889,7 @@ unsigned long do_mmap_pgoff(struct file 
 	struct rb_node ** rb_link, * rb_parent;
 	int accountable = 1;
 	unsigned long charged = 0, reqprot = prot;
+	struct address_space *mapping = NULL;
 
 	if (file) {
 		if (is_file_hugepages(file))
@@ -1084,18 +1086,13 @@ munmap_back:
 		error = file->f_op->mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
+
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
 		if (error)
 			goto free_vma;
 	}
 
-	/* Don't make the VMA automatically writable if it's shared, but the
-	 * backer wishes to know when pages are first written to */
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
-		vma->vm_page_prot =
-			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
-
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation
@@ -1113,6 +1110,31 @@ munmap_back:
 	pgoff = vma->vm_pgoff;
 	vm_flags = vma->vm_flags;
 
+	/*
+	 * Tracking of dirty pages for shared writable mappings. Do this by
+	 * write protecting writable pages, and mark dirty in the write fault.
+	 *
+	 * Modify vma->vm_page_prot (the default protection for new pages)
+	 * to this effect.
+	 *
+	 * Cannot do before because the condition depends on:
+	 *  - backing_dev_info having the right capabilities
+	 *    (set by f_op->open())
+	 *  - vma->vm_flags being fully set
+	 *    (finished in f_op->mmap(), which could call remap_pfn_range())
+	 *
+	 *  Also, cannot reset vma->vm_page_prot from vma->vm_flags because
+	 *  f_op->mmap() can modify it.
+	 */
+	if (is_shared_writable(vm_flags) && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+	if ((mapping && mapping_cap_account_dirty(mapping)) ||
+			(vma->vm_ops && vma->vm_ops->page_mkwrite))
+		vma->vm_page_prot =
+			__pgprot(pte_val
+				(pte_wrprotect
+				 (__pte(pgprot_val(vma->vm_page_prot)))));
+
 	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
 			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
 		file = vma->vm_file;
Index: 2.6-mm/mm/mprotect.c
===================================================================
--- 2.6-mm.orig/mm/mprotect.c	2006-06-14 10:29:06.000000000 +0200
+++ 2.6-mm/mm/mprotect.c	2006-06-19 16:19:42.000000000 +0200
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/backing-dev.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -124,6 +125,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
 	unsigned int mask;
+	struct address_space *mapping = NULL;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -179,7 +181,10 @@ success:
 	/* Don't make the VMA automatically writable if it's shared, but the
 	 * backer wishes to know when pages are first written to */
 	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+	if (is_shared_writable(newflags) && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+	if ((mapping && mapping_cap_account_dirty(mapping)) ||
+			(vma->vm_ops && vma->vm_ops->page_mkwrite))
 		mask &= ~VM_SHARED;
 
 	newprot = protection_map[newflags & mask];
Index: 2.6-mm/mm/page-writeback.c
===================================================================
--- 2.6-mm.orig/mm/page-writeback.c	2006-06-14 10:29:07.000000000 +0200
+++ 2.6-mm/mm/page-writeback.c	2006-06-19 16:19:42.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -540,7 +541,7 @@ int do_writepages(struct address_space *
 		return 0;
 	wbc->for_writepages = 1;
 	if (mapping->a_ops->writepages)
-		ret =  mapping->a_ops->writepages(mapping, wbc);
+		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
 	wbc->for_writepages = 0;
@@ -704,6 +705,11 @@ int test_clear_page_dirty(struct page *p
 			if (mapping_cap_account_dirty(mapping))
 				__dec_zone_page_state(page, NR_DIRTY);
 			write_unlock_irqrestore(&mapping->tree_lock, flags);
+			/*
+			 * We can continue to use `mapping' here because the
+			 * page is locked, which pins the address_space
+			 */
+			page_mkclean(page);
 			return 1;
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -733,6 +739,7 @@ int clear_page_dirty_for_io(struct page 
 
 	if (mapping) {
 		if (TestClearPageDirty(page)) {
+			page_mkclean(page);
 			if (mapping_cap_account_dirty(mapping))
 				dec_zone_page_state(page, NR_DIRTY);
 			return 1;
Index: 2.6-mm/mm/rmap.c
===================================================================
--- 2.6-mm.orig/mm/rmap.c	2006-06-14 10:29:07.000000000 +0200
+++ 2.6-mm/mm/rmap.c	2006-06-19 14:45:18.000000000 +0200
@@ -53,6 +53,7 @@
 #include <linux/rmap.h>
 #include <linux/rcupdate.h>
 #include <linux/module.h>
+#include <linux/backing-dev.h>
 
 #include <asm/tlbflush.h>
 
@@ -434,6 +435,73 @@ int page_referenced(struct page *page, i
 	return referenced;
 }
 
+static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, int protect)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	address = vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
+	pte = page_check_address(page, mm, address, &ptl);
+	if (!pte)
+		goto out;
+
+	if (!(pte_dirty(*pte) || (protect && pte_write(*pte))))
+		goto unlock;
+
+	entry = ptep_get_and_clear(mm, address, pte);
+	entry = pte_mkclean(entry);
+	if (protect)
+		entry = pte_wrprotect(entry);
+	ptep_establish(vma, address, pte, entry);
+	lazy_mmu_prot_update(entry);
+	ret = 1;
+
+unlock:
+	pte_unmap_unlock(pte, ptl);
+out:
+	return ret;
+}
+
+static int page_mkclean_file(struct address_space *mapping, struct page *page)
+{
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int ret = 0;
+
+	BUG_ON(PageAnon(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		int protect = mapping_cap_account_dirty(mapping) &&
+			is_shared_writable(vma->vm_flags);
+		ret += page_mkclean_one(page, vma, protect);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	return ret;
+}
+
+int page_mkclean(struct page *page)
+{
+	int ret = 0;
+
+	BUG_ON(!PageLocked(page));
+
+	if (page_mapped(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			ret = page_mkclean_file(mapping, page);
+	}
+
+	return ret;
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
Index: 2.6-mm/include/linux/rmap.h
===================================================================
--- 2.6-mm.orig/include/linux/rmap.h	2006-06-14 10:29:04.000000000 +0200
+++ 2.6-mm/include/linux/rmap.h	2006-06-19 14:45:18.000000000 +0200
@@ -103,6 +103,14 @@ pte_t *page_check_address(struct page *,
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
+/*
+ * Cleans the PTEs of shared mappings.
+ * (and since clean PTEs should also be readonly, write protects them too)
+ *
+ * returns the number of cleaned PTEs.
+ */
+int page_mkclean(struct page *);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: 2.6-mm/fs/buffer.c
===================================================================
--- 2.6-mm.orig/fs/buffer.c	2006-06-14 10:28:52.000000000 +0200
+++ 2.6-mm/fs/buffer.c	2006-06-19 14:45:18.000000000 +0200
@@ -2985,6 +2985,7 @@ int try_to_free_buffers(struct page *pag
 
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
+	spin_unlock(&mapping->private_lock);
 	if (ret) {
 		/*
 		 * If the filesystem writes its buffers by hand (eg ext3)
@@ -2996,7 +2997,6 @@ int try_to_free_buffers(struct page *pag
 		 */
 		clear_page_dirty(page);
 	}
-	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 2/6] mm: balance dirty pages
  2006-06-19 17:52 [PATCH 0/6] mm: tracking dirty pages -v9 Peter Zijlstra
  2006-06-19 17:52 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
@ 2006-06-19 17:53 ` Peter Zijlstra
  2006-06-19 17:53 ` [PATCH 3/6] mm: msync() cleanup Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-19 17:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Now that we can detect writers of shared mappings, throttle them.
Avoids OOM by surprise.

Changes -v2:

 - small helper function (Andrew Morton)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

 include/linux/writeback.h |    1 +
 mm/memory.c               |    5 +++--
 mm/page-writeback.c       |   10 ++++++++++
 3 files changed, 14 insertions(+), 2 deletions(-)

Index: 2.6-mm/mm/memory.c
===================================================================
--- 2.6-mm.orig/mm/memory.c	2006-06-19 16:21:15.000000000 +0200
+++ 2.6-mm/mm/memory.c	2006-06-19 16:21:16.000000000 +0200
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/delayacct.h>
 #include <linux/init.h>
+#include <linux/writeback.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -1572,7 +1573,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty(dirty_page);
+		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
@@ -2219,7 +2220,7 @@ retry:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty(dirty_page);
+		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
Index: 2.6-mm/include/linux/writeback.h
===================================================================
--- 2.6-mm.orig/include/linux/writeback.h	2006-06-19 16:21:13.000000000 +0200
+++ 2.6-mm/include/linux/writeback.h	2006-06-19 16:21:16.000000000 +0200
@@ -121,6 +121,7 @@ int sync_page_range(struct inode *inode,
 			loff_t pos, loff_t count);
 int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
 			   loff_t pos, loff_t count);
+void set_page_dirty_balance(struct page *page);
 
 /* pdflush.c */
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
Index: 2.6-mm/mm/page-writeback.c
===================================================================
--- 2.6-mm.orig/mm/page-writeback.c	2006-06-19 16:21:15.000000000 +0200
+++ 2.6-mm/mm/page-writeback.c	2006-06-19 16:21:16.000000000 +0200
@@ -237,6 +237,16 @@ static void balance_dirty_pages(struct a
 		pdflush_operation(background_writeout, 0);
 }
 
+void set_page_dirty_balance(struct page *page)
+{
+	if (set_page_dirty(page)) {
+		struct address_space *mapping = page_mapping(page);
+
+		if (mapping)
+			balance_dirty_pages_ratelimited(mapping);
+	}
+}
+
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 3/6] mm: msync() cleanup
  2006-06-19 17:52 [PATCH 0/6] mm: tracking dirty pages -v9 Peter Zijlstra
  2006-06-19 17:52 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
  2006-06-19 17:53 ` [PATCH 2/6] mm: balance dirty pages Peter Zijlstra
@ 2006-06-19 17:53 ` Peter Zijlstra
  2006-06-22 17:02   ` Hugh Dickins
  2006-06-19 17:53 ` [PATCH 4/6] mm: optimize the new mprotect() code a bit Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-19 17:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

With the tracking of dirty pages properly done now, msync doesn't need to
scan the PTEs anymore to determine the dirty status.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

 mm/msync.c |  129 ++++---------------------------------------------------------
 1 file changed, 9 insertions(+), 120 deletions(-)

Index: 2.6-mm/mm/msync.c
===================================================================
--- 2.6-mm.orig/mm/msync.c	2006-06-19 16:21:13.000000000 +0200
+++ 2.6-mm/mm/msync.c	2006-06-19 16:21:21.000000000 +0200
@@ -20,109 +20,14 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
-static unsigned long msync_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-				unsigned long addr, unsigned long end)
-{
-	pte_t *pte;
-	spinlock_t *ptl;
-	int progress = 0;
-	unsigned long ret = 0;
-
-again:
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	do {
-		struct page *page;
-
-		if (progress >= 64) {
-			progress = 0;
-			if (need_resched() || need_lockbreak(ptl))
-				break;
-		}
-		progress++;
-		if (!pte_present(*pte))
-			continue;
-		if (!pte_maybe_dirty(*pte))
-			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
-			continue;
-		if (ptep_clear_flush_dirty(vma, addr, pte) ||
-				page_test_and_clear_dirty(page))
-			ret += set_page_dirty(page);
-		progress += 3;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
-	if (addr != end)
-		goto again;
-	return ret;
-}
-
-static inline unsigned long msync_pmd_range(struct vm_area_struct *vma,
-			pud_t *pud, unsigned long addr, unsigned long end)
-{
-	pmd_t *pmd;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		if (pmd_none_or_clear_bad(pmd))
-			continue;
-		ret += msync_pte_range(vma, pmd, addr, next);
-	} while (pmd++, addr = next, addr != end);
-	return ret;
-}
-
-static inline unsigned long msync_pud_range(struct vm_area_struct *vma,
-			pgd_t *pgd, unsigned long addr, unsigned long end)
-{
-	pud_t *pud;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		ret += msync_pmd_range(vma, pud, addr, next);
-	} while (pud++, addr = next, addr != end);
-	return ret;
-}
-
-static unsigned long msync_page_range(struct vm_area_struct *vma,
-				unsigned long addr, unsigned long end)
-{
-	pgd_t *pgd;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	/* For hugepages we can't go walking the page table normally,
-	 * but that's ok, hugetlbfs is memory based, so we don't need
-	 * to do anything more on an msync().
-	 */
-	if (vma->vm_flags & VM_HUGETLB)
-		return 0;
-
-	BUG_ON(addr >= end);
-	pgd = pgd_offset(vma->vm_mm, addr);
-	flush_cache_range(vma, addr, end);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		ret += msync_pud_range(vma, pgd, addr, next);
-	} while (pgd++, addr = next, addr != end);
-	return ret;
-}
-
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).  Instead, it just
- * marks the relevant pages dirty.  The application may now run fsync() to
+ * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
+ * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Now it doesn't do anything, since dirty pages are properly tracked.
+ *
+ * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
  * async writeout immediately.
@@ -130,16 +35,11 @@ static unsigned long msync_page_range(st
  * applications.
  */
 static int msync_interval(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long end, int flags,
-			unsigned long *nr_pages_dirtied)
+			unsigned long end, int flags)
 {
-	struct file *file = vma->vm_file;
-
 	if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
 		return -EBUSY;
 
-	if (file && (vma->vm_flags & VM_SHARED))
-		*nr_pages_dirtied = msync_page_range(vma, addr, end);
 	return 0;
 }
 
@@ -178,7 +78,6 @@ asmlinkage long sys_msync(unsigned long 
 		goto out_unlock;
 	}
 	do {
-		unsigned long nr_pages_dirtied = 0;
 		struct file *file;
 
 		/* Here start < vma->vm_end. */
@@ -189,8 +88,7 @@ asmlinkage long sys_msync(unsigned long 
 		/* Here vma->vm_start <= start < vma->vm_end. */
 		if (end <= vma->vm_end) {
 			if (start < end) {
-				error = msync_interval(vma, start, end, flags,
-							&nr_pages_dirtied);
+				error = msync_interval(vma, start, end, flags);
 				if (error)
 					goto out_unlock;
 			}
@@ -198,22 +96,13 @@ asmlinkage long sys_msync(unsigned long 
 			done = 1;
 		} else {
 			/* Here vma->vm_start <= start < vma->vm_end < end. */
-			error = msync_interval(vma, start, vma->vm_end, flags,
-						&nr_pages_dirtied);
+			error = msync_interval(vma, start, vma->vm_end, flags);
 			if (error)
 				goto out_unlock;
 		}
 		file = vma->vm_file;
 		start = vma->vm_end;
-		if ((flags & MS_ASYNC) && file && nr_pages_dirtied) {
-			get_file(file);
-			up_read(&current->mm->mmap_sem);
-			balance_dirty_pages_ratelimited_nr(file->f_mapping,
-							nr_pages_dirtied);
-			fput(file);
-			down_read(&current->mm->mmap_sem);
-			vma = find_vma(current->mm, start);
-		} else if ((flags & MS_SYNC) && file &&
+		if ((flags & MS_SYNC) && file &&
 				(vma->vm_flags & VM_SHARED)) {
 			get_file(file);
 			up_read(&current->mm->mmap_sem);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 4/6] mm: optimize the new mprotect() code a bit
  2006-06-19 17:52 [PATCH 0/6] mm: tracking dirty pages -v9 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2006-06-19 17:53 ` [PATCH 3/6] mm: msync() cleanup Peter Zijlstra
@ 2006-06-19 17:53 ` Peter Zijlstra
  2006-06-22 17:21   ` Hugh Dickins
  2006-06-19 17:53 ` [PATCH 5/6] mm: small cleanup of install_page() Peter Zijlstra
  2006-06-19 17:53 ` [PATCH 6/6] mm: remove some update_mmu_cache() calls Peter Zijlstra
  5 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-19 17:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

mprotect() resets the page protections, which could result in extra write
faults for those pages whos dirty state we track using write faults
and are dirty already.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/mprotect.c |   33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

Index: 2.6-mm/mm/mprotect.c
===================================================================
--- 2.6-mm.orig/mm/mprotect.c	2006-06-19 16:19:42.000000000 +0200
+++ 2.6-mm/mm/mprotect.c	2006-06-19 16:20:42.000000000 +0200
@@ -28,7 +28,8 @@
 #include <asm/tlbflush.h>
 
 static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
@@ -43,7 +44,13 @@ static void change_pte_range(struct mm_s
 			 * bits by wiping the pte and then setting the new pte
 			 * into place.
 			 */
-			ptent = pte_modify(ptep_get_and_clear(mm, addr, pte), newprot);
+			ptent = ptep_get_and_clear(mm, addr, pte);
+			ptent = pte_modify(ptent, newprot);
+			/* Avoid taking write faults for pages we know to be
+			 * dirty.
+			 */
+			if (is_accountable && pte_dirty(ptent))
+				ptent = pte_mkwrite(ptent);
 			set_pte_at(mm, addr, pte, ptent);
 			lazy_mmu_prot_update(ptent);
 #ifdef CONFIG_MIGRATION
@@ -67,7 +74,8 @@ static void change_pte_range(struct mm_s
 }
 
 static inline void change_pmd_range(struct mm_struct *mm, pud_t *pud,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -77,12 +85,13 @@ static inline void change_pmd_range(stru
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		change_pte_range(mm, pmd, addr, next, newprot);
+		change_pte_range(mm, pmd, addr, next, newprot, is_accountable);
 	} while (pmd++, addr = next, addr != end);
 }
 
 static inline void change_pud_range(struct mm_struct *mm, pgd_t *pgd,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -92,12 +101,13 @@ static inline void change_pud_range(stru
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		change_pmd_range(mm, pud, addr, next, newprot);
+		change_pmd_range(mm, pud, addr, next, newprot, is_accountable);
 	} while (pud++, addr = next, addr != end);
 }
 
 static void change_protection(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end, pgprot_t newprot)
+		unsigned long addr, unsigned long end, pgprot_t newprot,
+		int is_accountable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -111,7 +121,7 @@ static void change_protection(struct vm_
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		change_pud_range(mm, pgd, addr, next, newprot);
+		change_pud_range(mm, pgd, addr, next, newprot, is_accountable);
 	} while (pgd++, addr = next, addr != end);
 	flush_tlb_range(vma, start, end);
 }
@@ -129,6 +139,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
+	int is_accountable = 0;
 
 	if (newflags == oldflags) {
 		*pprev = vma;
@@ -184,8 +195,10 @@ success:
 	if (is_shared_writable(newflags) && vma->vm_file)
 		mapping = vma->vm_file->f_mapping;
 	if ((mapping && mapping_cap_account_dirty(mapping)) ||
-			(vma->vm_ops && vma->vm_ops->page_mkwrite))
+			(vma->vm_ops && vma->vm_ops->page_mkwrite)) {
 		mask &= ~VM_SHARED;
+		is_accountable = 1;
+	}
 
 	newprot = protection_map[newflags & mask];
 
@@ -198,7 +211,7 @@ success:
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, newprot);
 	else
-		change_protection(vma, start, end, newprot);
+		change_protection(vma, start, end, newprot, is_accountable);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 5/6] mm: small cleanup of install_page()
  2006-06-19 17:52 [PATCH 0/6] mm: tracking dirty pages -v9 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2006-06-19 17:53 ` [PATCH 4/6] mm: optimize the new mprotect() code a bit Peter Zijlstra
@ 2006-06-19 17:53 ` Peter Zijlstra
  2006-06-19 17:53 ` [PATCH 6/6] mm: remove some update_mmu_cache() calls Peter Zijlstra
  5 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-19 17:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Smallish cleanup to install_page(), could save a memory read
(haven't checked the asm output) and sure looks nicer.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

---
 mm/fremap.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: 2.6-mm/mm/fremap.c
===================================================================
--- 2.6-mm.orig/mm/fremap.c	2006-06-19 16:20:52.000000000 +0200
+++ 2.6-mm/mm/fremap.c	2006-06-19 16:20:57.000000000 +0200
@@ -79,9 +79,9 @@ int install_page(struct mm_struct *mm, s
 		inc_mm_counter(mm, file_rss);
 
 	flush_icache_page(vma, page);
-	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	pte_val = mk_pte(page, prot);
+	set_pte_at(mm, addr, pte, pte_val);
 	page_add_file_rmap(page);
-	pte_val = *pte;
 	update_mmu_cache(vma, addr, pte_val);
 	lazy_mmu_prot_update(pte_val);
 	err = 0;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 6/6] mm: remove some update_mmu_cache() calls
  2006-06-19 17:52 [PATCH 0/6] mm: tracking dirty pages -v9 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2006-06-19 17:53 ` [PATCH 5/6] mm: small cleanup of install_page() Peter Zijlstra
@ 2006-06-19 17:53 ` Peter Zijlstra
  2006-06-22 16:29   ` Hugh Dickins
  5 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-19 17:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Christoph Lameter <clameter@sgi.com>

This may be a bit controversial but it does not seem to
make sense to use the update_mmu_cache macro when we reuse
the page. We are only fiddling around with the protections,
the dirty and accessed bits.

With the call to update_mmu_cache the way of using the macros
would be different from mprotect() and page_mkclean(). I'd
rather have everything work the same way. If this breaks on some
arches then also mprotect and page_mkclean() are broken.
The use of mprotect() is rare, we may have breakage in some
arches that we just have not seen yet.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/memory.c |    2 --
 1 file changed, 2 deletions(-)

Index: 2.6-mm/mm/memory.c
===================================================================
--- 2.6-mm.orig/mm/memory.c	2006-06-19 16:21:16.000000000 +0200
+++ 2.6-mm/mm/memory.c	2006-06-19 16:21:25.000000000 +0200
@@ -1514,7 +1514,6 @@ static int do_wp_page(struct mm_struct *
 		entry = pte_mkyoung(orig_pte);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		ptep_set_access_flags(vma, address, page_table, entry, 1);
-		update_mmu_cache(vma, address, entry);
 		lazy_mmu_prot_update(entry);
 		ret |= VM_FAULT_WRITE;
 		goto unlock;
@@ -2317,7 +2316,6 @@ static inline int handle_pte_fault(struc
 	entry = pte_mkyoung(entry);
 	if (!pte_same(old_entry, entry)) {
 		ptep_set_access_flags(vma, address, pte, entry, write_access);
-		update_mmu_cache(vma, address, entry);
 		lazy_mmu_prot_update(entry);
 	} else {
 		/*

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-19 17:52 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
@ 2006-06-22  5:56   ` Andrew Morton
  2006-06-22  6:07     ` Christoph Lameter
  2006-06-22 11:33     ` Peter Zijlstra
  2006-06-22 20:52   ` Hugh Dickins
  1 sibling, 2 replies; 51+ messages in thread
From: Andrew Morton @ 2006-06-22  5:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, hugh, dhowells, a.p.zijlstra, christoph,
	mbligh, npiggin, torvalds

On Mon, 19 Jun 2006 19:52:53 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

Could we have a changelog for this patch?  A list of
what-i-changed-since-last-time bullet points is uninteresting and unuseful
for the permanent record.  Think in terms of some random person in
Barcelona who is trying to work upon your code two years hence.

The patch should be accompanied by a standalone description of what it
does, why it does it and how it does it, thanks.


> Index: 2.6-mm/include/linux/mm.h
> ===================================================================
> --- 2.6-mm.orig/include/linux/mm.h	2006-06-14 10:29:04.000000000 +0200
> +++ 2.6-mm/include/linux/mm.h	2006-06-19 14:45:18.000000000 +0200
> @@ -182,6 +182,12 @@ extern unsigned int kobjsize(const void 
>  #define VM_SequentialReadHint(v)	((v)->vm_flags & VM_SEQ_READ)
>  #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)
>  
> +static inline int is_shared_writable(unsigned int flags)
> +{
> +	return (flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
> +		(VM_SHARED|VM_WRITE);
> +}

Need a comment: what's VM_PFNMAP doing in there.

>  
> -	if (unlikely(vma->vm_flags & VM_SHARED)) {
> +	if (unlikely(is_shared_writable(vma->vm_flags))) {

See, this is an incompatible change.  We used to have

	if (unlikely((vma->vm_flags & (VM_SHARED|VM_WRITE)) ==
			(VM_SHARED|VM_WRITE))) {

in there, only now it has secretly started looking at VM_PFNMAP.


> +		vma->vm_page_prot =
> +			__pgprot(pte_val
> +				(pte_wrprotect
> +				 (__pte(pgprot_val(vma->vm_page_prot)))));
> +

Is there really no simpler way?

> ===================================================================
> --- 2.6-mm.orig/fs/buffer.c	2006-06-14 10:28:52.000000000 +0200
> +++ 2.6-mm/fs/buffer.c	2006-06-19 14:45:18.000000000 +0200
> @@ -2985,6 +2985,7 @@ int try_to_free_buffers(struct page *pag
>  
>  	spin_lock(&mapping->private_lock);
>  	ret = drop_buffers(page, &buffers_to_free);
> +	spin_unlock(&mapping->private_lock);
>  	if (ret) {
>  		/*
>  		 * If the filesystem writes its buffers by hand (eg ext3)
> @@ -2996,7 +2997,6 @@ int try_to_free_buffers(struct page *pag
>  		 */
>  		clear_page_dirty(page);
>  	}
> -	spin_unlock(&mapping->private_lock);
>  out:
>  	if (buffers_to_free) {
>  		struct buffer_head *bh = buffers_to_free;

Needs changelogging, please.


Performance testing is critical here.  I think some was done, but I don't
reall what tests were performed, nor do I remember the results.  Without such
info it's not possible to make a go/no-go decision.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-22  5:56   ` Andrew Morton
@ 2006-06-22  6:07     ` Christoph Lameter
  2006-06-22  6:15       ` Andrew Morton
  2006-06-22 11:33     ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2006-06-22  6:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, linux-mm, linux-kernel, hugh, dhowells,
	christoph, mbligh, npiggin, torvalds

On Wed, 21 Jun 2006, Andrew Morton wrote:

> Performance testing is critical here.  I think some was done, but I don't
> reall what tests were performed, nor do I remember the results.  Without such
> info it's not possible to make a go/no-go decision.

Tests did show that there was no performance regression for the usual 
tests. That is to be expected since the patch should only modify the 
behavior of shared writable mapping. The use of those is rare in typical 
benchmarks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-22  6:07     ` Christoph Lameter
@ 2006-06-22  6:15       ` Andrew Morton
  0 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2006-06-22  6:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: a.p.zijlstra, linux-mm, linux-kernel, hugh, dhowells, christoph,
	mbligh, npiggin, torvalds

On Wed, 21 Jun 2006 23:07:37 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 21 Jun 2006, Andrew Morton wrote:
> 
> > Performance testing is critical here.  I think some was done, but I don't
> > reall what tests were performed, nor do I remember the results.  Without such
> > info it's not possible to make a go/no-go decision.
> 
> Tests did show that there was no performance regression for the usual 
> tests. That is to be expected since the patch should only modify the 
> behavior of shared writable mapping. The use of those is rare in typical 
> benchmarks.

Of course.  In this case one should prepare an artificial microbenchmark so
we can understand the worst case.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-22  5:56   ` Andrew Morton
  2006-06-22  6:07     ` Christoph Lameter
@ 2006-06-22 11:33     ` Peter Zijlstra
  2006-06-22 13:17       ` Hugh Dickins
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-22 11:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, hugh, dhowells, christoph, mbligh,
	npiggin, torvalds

On Wed, 2006-06-21 at 22:56 -0700, Andrew Morton wrote:
> On Mon, 19 Jun 2006 19:52:53 +0200

> > +		vma->vm_page_prot =
> > +			__pgprot(pte_val
> > +				(pte_wrprotect
> > +				 (__pte(pgprot_val(vma->vm_page_prot)))));
> > +
> 
> Is there really no simpler way?

	pgprot_t prot_shared = protection_map[vm_flags & 
		(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
	pgprot_t prot_priv = protection_map[vm_flags & 
		(VM_READ|VM_WRITE|VM_EXEC)];

	typeof(pgprot_val(prot_shared)) mask = 
		~(pgprot_val(prot_shared) ^ pgprot_val(prot_priv));

	pgprot_val(vma->vm_page_prot) &= mask;
	pgprot_val(vma->vm_page_prot) |= 
		(pgprot_val(prot_priv) & mask);

its more readable, but barely so.

BTW, is there a difference between:
  (VM_READ|VM_WRITE|VM_EXEC)
and
  (VM_READ|VM_EXEC|VM_SHARED)
in this context?

Or I can make it a generic arch specific function and override for i386
and x86-64. That way I can also cleanup drivers/char/drm/drm_vm.c where
I found this thing.

include/asm-generic/pgtable.h

#ifndef __HAVE_ARCH_PGPROT_WRPROTECT
#define pgprot_wrprotect(prot) \
({ (prot) = __pgprot(pte_val \
		(pte_wrprotect	\
		(__pte(pgprot_val(prot))))) \
})
#endif

include/asm-{i386,x86-64}/pgtable.h

#define pgprot_wrprotect(prot) ({ pgprot_val(prot) &= ~_PAGE_RW; })
#define __HAVE_ARCH_PGPROT_WRPROTECT

I can go through some other archs and see what I can do.

Hmm, now that I look at this, might give a include-dependency problem.
Awell, thoughts, comments?


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-22 11:33     ` Peter Zijlstra
@ 2006-06-22 13:17       ` Hugh Dickins
  0 siblings, 0 replies; 51+ messages in thread
From: Hugh Dickins @ 2006-06-22 13:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, linux-mm, linux-kernel, dhowells, christoph,
	mbligh, npiggin, torvalds

On Thu, 22 Jun 2006, Peter Zijlstra wrote:
> On Wed, 2006-06-21 at 22:56 -0700, Andrew Morton wrote:
> 
> > > +		vma->vm_page_prot =
> > > +			__pgprot(pte_val
> > > +				(pte_wrprotect
> > > +				 (__pte(pgprot_val(vma->vm_page_prot)))));
> > > +
> > 
> > Is there really no simpler way?
>.... 
> Awell, thoughts, comments?

Just note vma->vm_page_prot before calling the ->mmap and check after.
If the driver has messed with it at all, it's not a mapping we want to
apply dirty capping to anyway.  If it's unchanged, and the other tests
pass, go ahead and reset readonly vm_page_prot via protection_map[].

This is of course a hack, as is the pgprotting code from drm.
The correct solution would be to set the proper backing_dev_info
in a number of drivers - we were too lazy when setting the default
in the first place; but even if we caught all the intree drivers,
we'd miss out-of-tree ones and suffer unnecessary pain from them.
So for now a hack is necessary, and I haven't thought of a better.

Other comments to follow...

Hugh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/6] mm: remove some update_mmu_cache() calls
  2006-06-19 17:53 ` [PATCH 6/6] mm: remove some update_mmu_cache() calls Peter Zijlstra
@ 2006-06-22 16:29   ` Hugh Dickins
  2006-06-22 16:37     ` Christoph Lameter
  0 siblings, 1 reply; 51+ messages in thread
From: Hugh Dickins @ 2006-06-22 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Miller, linux-mm, linux-kernel, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Mon, 19 Jun 2006, Peter Zijlstra wrote:
> From: Christoph Lameter <clameter@sgi.com>
> 
> This may be a bit controversial but it does not seem to
> make sense to use the update_mmu_cache macro when we reuse
> the page. We are only fiddling around with the protections,
> the dirty and accessed bits.

Controversial indeed.  Just because Christoph tentatively puts forward
such a patch is no reason to include it in your dirty page series.

Probably my fault for drawing attention to the ill-defined relationship
between update_mmu_cache and more recently added lazy_mmu_prot_update,
and wondering aloud on what's different between do_wp_page and mprotect.

If you want to pursue this patch, you should address it to someone
experienced in update_mmu_cache, someone familiar with those
architectures which implement it: ideally to Dave Miller (CCed),
who designed and wrote most of Documentation/cachetlb.txt.

The answer I expect is that update_mmu_cache is essential there in
do_wp_page (reuse case) and handle_pte_fault, on at least some if
not all of those arches which implement it.  That without those
lines, they'll fault and refault endlessly, since the "MMU cache"
has not been updated with the write permission.

But omitted from mprotect, since that's dealing with a batch of
pages, perhaps none of which will be faulted in the near future:
a waste of resources to update for all those entries.

But now I wonder, why does do_wp_page reuse case flush_cache_page?

Hugh

> With the call to update_mmu_cache the way of using the macros
> would be different from mprotect() and page_mkclean(). I'd
> rather have everything work the same way. If this breaks on some
> arches then also mprotect and page_mkclean() are broken.
> The use of mprotect() is rare, we may have breakage in some
> arches that we just have not seen yet.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  mm/memory.c |    2 --
>  1 file changed, 2 deletions(-)
> 
> Index: 2.6-mm/mm/memory.c
> ===================================================================
> --- 2.6-mm.orig/mm/memory.c	2006-06-19 16:21:16.000000000 +0200
> +++ 2.6-mm/mm/memory.c	2006-06-19 16:21:25.000000000 +0200
> @@ -1514,7 +1514,6 @@ static int do_wp_page(struct mm_struct *
>  		entry = pte_mkyoung(orig_pte);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  		ptep_set_access_flags(vma, address, page_table, entry, 1);
> -		update_mmu_cache(vma, address, entry);
>  		lazy_mmu_prot_update(entry);
>  		ret |= VM_FAULT_WRITE;
>  		goto unlock;
> @@ -2317,7 +2316,6 @@ static inline int handle_pte_fault(struc
>  	entry = pte_mkyoung(entry);
>  	if (!pte_same(old_entry, entry)) {
>  		ptep_set_access_flags(vma, address, pte, entry, write_access);
> -		update_mmu_cache(vma, address, entry);
>  		lazy_mmu_prot_update(entry);
>  	} else {
>  		/*
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/6] mm: remove some update_mmu_cache() calls
  2006-06-22 16:29   ` Hugh Dickins
@ 2006-06-22 16:37     ` Christoph Lameter
  2006-06-22 17:35       ` Hugh Dickins
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2006-06-22 16:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, David Miller, linux-mm, linux-kernel,
	Andrew Morton, David Howells, Christoph Lameter, Martin Bligh,
	Nick Piggin, Linus Torvalds

On Thu, 22 Jun 2006, Hugh Dickins wrote:

> The answer I expect is that update_mmu_cache is essential there in
> do_wp_page (reuse case) and handle_pte_fault, on at least some if
> not all of those arches which implement it.  That without those
> lines, they'll fault and refault endlessly, since the "MMU cache"
> has not been updated with the write permission.

Yes a likely scenario.
 
> But omitted from mprotect, since that's dealing with a batch of
> pages, perhaps none of which will be faulted in the near future:
> a waste of resources to update for all those entries.

So we intentially allow mprotect to be racy?

> But now I wonder, why does do_wp_page reuse case flush_cache_page?

Some arches may have virtual caches?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/6] mm: msync() cleanup
  2006-06-19 17:53 ` [PATCH 3/6] mm: msync() cleanup Peter Zijlstra
@ 2006-06-22 17:02   ` Hugh Dickins
  0 siblings, 0 replies; 51+ messages in thread
From: Hugh Dickins @ 2006-06-22 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Mon, 19 Jun 2006, Peter Zijlstra wrote:
> 
> With the tracking of dirty pages properly done now, msync doesn't need to
> scan the PTEs anymore to determine the dirty status.

Lots of nice deletions, but...

>  static int msync_interval(struct vm_area_struct *vma, unsigned long addr,
> -			unsigned long end, int flags,
> -			unsigned long *nr_pages_dirtied)
> +			unsigned long end, int flags)
>  {
> -	struct file *file = vma->vm_file;
> -
>  	if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
>  		return -EBUSY;
>  
> -	if (file && (vma->vm_flags & VM_SHARED))
> -		*nr_pages_dirtied = msync_page_range(vma, addr, end);
>  	return 0;
>  }

... I was offended by this msync_interval remnant which does nothing
but (rightly) choose between -EBUSY and 0: msync_interval isn't the
name I'd give that; but it might as well be brought in line anyway.

In looking to do that, I made some other tidyups: can remove several
#includes, don't keep re-evaluating current, PF_SYNCWRITE manipulation
best left to do_fsync, and sys_msync loop termination not quite right.

Most of those points are criticisms of the existing sys_msync, not of
your patch.  In particular, the loop termination errors were introduced
in 2.6.17: I did notice this shortly before it came out, but decided I
was more likely to get it wrong myself, and make matters worse if I
tried to rush a last-minute fix in.  And it's not terribly likely
to go wrong, nor disastrous if it does go wrong (may miss reporting
an unmapped area; may also fsync file of a following vma).

Would you apply this patch on top of yours, and give it a long hard
look to see if you agree it's correct and an improvement?  If you
do agree, please absorb it into yours, then maybe Andrew will want
one of us to supply just a fix patch to the current sys_msync,
to precede your dirty page series.

Ah, -mm already includes patches from Jens which delete PF_SYNCWRITE
altogether.

Hugh

--- your/mm/msync.c	2006-06-13 19:48:41.000000000 +0100
+++ my/mm/msync.c	2006-06-14 20:51:20.000000000 +0100
@@ -7,19 +7,12 @@
 /*
  * The msync() system call.
  */
-#include <linux/slab.h>
-#include <linux/pagemap.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
-#include <linux/hugetlb.h>
-#include <linux/writeback.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
 
-#include <asm/pgtable.h>
-#include <asm/tlbflush.h>
-
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
@@ -34,22 +27,14 @@
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
-static int msync_interval(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long end, int flags)
-{
-	if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
-		return -EBUSY;
-
-	return 0;
-}
 
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
 	unsigned long end;
+	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	int unmapped_error = 0;
 	int error = -EINVAL;
-	int done = 0;
 
 	if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
 		goto out;
@@ -69,56 +54,46 @@ asmlinkage long sys_msync(unsigned long 
 	 * If the interval [start,end) covers some unmapped address ranges,
 	 * just ignore them, but return -ENOMEM at the end.
 	 */
-	down_read(&current->mm->mmap_sem);
-	if (flags & MS_SYNC)
-		current->flags |= PF_SYNCWRITE;
-	vma = find_vma(current->mm, start);
-	if (!vma) {
-		error = -ENOMEM;
-		goto out_unlock;
-	}
-	do {
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	for (;;) {
 		struct file *file;
 
+		/* Still start < end. */
+		error = -ENOMEM;
+		if (!vma)
+			goto out_unlock;
 		/* Here start < vma->vm_end. */
 		if (start < vma->vm_start) {
-			unmapped_error = -ENOMEM;
 			start = vma->vm_start;
+			if (start >= end)
+				goto out_unlock;
+			unmapped_error = -ENOMEM;
 		}
 		/* Here vma->vm_start <= start < vma->vm_end. */
-		if (end <= vma->vm_end) {
-			if (start < end) {
-				error = msync_interval(vma, start, end, flags);
-				if (error)
-					goto out_unlock;
-			}
-			error = unmapped_error;
-			done = 1;
-		} else {
-			/* Here vma->vm_start <= start < vma->vm_end < end. */
-			error = msync_interval(vma, start, vma->vm_end, flags);
-			if (error)
-				goto out_unlock;
+		if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
+			error = -EBUSY;
+			goto out_unlock;
 		}
 		file = vma->vm_file;
 		start = vma->vm_end;
-		if ((flags & MS_SYNC) && file &&
-				(vma->vm_flags & VM_SHARED)) {
+		if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) {
 			get_file(file);
-			up_read(&current->mm->mmap_sem);
+			up_read(&mm->mmap_sem);
 			error = do_fsync(file, 0);
 			fput(file);
-			down_read(&current->mm->mmap_sem);
-			if (error)
-				goto out_unlock;
-			vma = find_vma(current->mm, start);
+			if (error || start >= end)
+				goto out;
+			down_read(&mm->mmap_sem);
+			vma = find_vma(mm, start);
 		} else {
+			if (start >= end)
+				goto out_unlock;
 			vma = vma->vm_next;
 		}
-	} while (vma && !done);
+	}
 out_unlock:
-	current->flags &= ~PF_SYNCWRITE;
-	up_read(&current->mm->mmap_sem);
+	up_read(&mm->mmap_sem);
 out:
-	return error;
+	return error? : unmapped_error;
 }

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/6] mm: optimize the new mprotect() code a bit
  2006-06-19 17:53 ` [PATCH 4/6] mm: optimize the new mprotect() code a bit Peter Zijlstra
@ 2006-06-22 17:21   ` Hugh Dickins
  0 siblings, 0 replies; 51+ messages in thread
From: Hugh Dickins @ 2006-06-22 17:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Mon, 19 Jun 2006, Peter Zijlstra wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> mprotect() resets the page protections, which could result in extra write
> faults for those pages whos dirty state we track using write faults
> and are dirty already.
> 
> @@ -43,7 +44,13 @@ static void change_pte_range(struct mm_s
>  			 * bits by wiping the pte and then setting the new pte
>  			 * into place.
>  			 */
> -			ptent = pte_modify(ptep_get_and_clear(mm, addr, pte), newprot);
> +			ptent = ptep_get_and_clear(mm, addr, pte);
> +			ptent = pte_modify(ptent, newprot);
> +			/* Avoid taking write faults for pages we know to be
> +			 * dirty.
> +			 */
> +			if (is_accountable && pte_dirty(ptent))
> +				ptent = pte_mkwrite(ptent);
>  			set_pte_at(mm, addr, pte, ptent);
>  			lazy_mmu_prot_update(ptent);

Thanks for adding that comment, I completely misread this when you
first showed it to me, and didn't get the point at all.  (But you're
a little too fond of "/* Multiline" comments: in this case, with no
blank line above, it'd look better with a "/*" lone line to separate
from the pte_modify code.)

Yes, I guess that is worth doing, though it's a bit sad and ugly:
goes right against the simplicity of working with vm_page_prot.

Could you change "is_accountable" to "dirty_accountable" throughout?
We've various different kinds of accounting going on hereabouts,
I think it'd be more understandable as "dirty_accountable".

Hugh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/6] mm: remove some update_mmu_cache() calls
  2006-06-22 16:37     ` Christoph Lameter
@ 2006-06-22 17:35       ` Hugh Dickins
  2006-06-22 18:31         ` Christoph Lameter
  0 siblings, 1 reply; 51+ messages in thread
From: Hugh Dickins @ 2006-06-22 17:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, David Miller, linux-mm, linux-kernel,
	Andrew Morton, David Howells, Christoph Lameter, Martin Bligh,
	Nick Piggin, Linus Torvalds

On Thu, 22 Jun 2006, Christoph Lameter wrote:
> On Thu, 22 Jun 2006, Hugh Dickins wrote:
> 
> > The answer I expect is that update_mmu_cache is essential there in
> > do_wp_page (reuse case) and handle_pte_fault, on at least some if
> > not all of those arches which implement it.  That without those
> > lines, they'll fault and refault endlessly, since the "MMU cache"
> > has not been updated with the write permission.
> 
> Yes a likely scenario.
>  
> > But omitted from mprotect, since that's dealing with a batch of
> > pages, perhaps none of which will be faulted in the near future:
> > a waste of resources to update for all those entries.
> 
> So we intentially allow mprotect to be racy?

It's intentionally allowed to be racy (ambiguous whether a racing
thread sees protections before or protections after) up until the
flush_tlb_range.  Should be well-defined from there on.
Or am I misunderstanding you?

> > But now I wonder, why does do_wp_page reuse case flush_cache_page?
> 
> Some arches may have virtual caches?

Sorry, I don't get it, you'll have to spell it out to me in detail.
We have a page mapped for reading, we're about to map that same page
for writing too.  We have no modifications to flush yet,
why flush_cache_page there?

Hugh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/6] mm: remove some update_mmu_cache() calls
  2006-06-22 17:35       ` Hugh Dickins
@ 2006-06-22 18:31         ` Christoph Lameter
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2006-06-22 18:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, David Miller, linux-mm, linux-kernel,
	Andrew Morton, David Howells, Christoph Lameter, Martin Bligh,
	Nick Piggin, Linus Torvalds

On Thu, 22 Jun 2006, Hugh Dickins wrote:

> It's intentionally allowed to be racy (ambiguous whether a racing
> thread sees protections before or protections after) up until the
> flush_tlb_range.  Should be well-defined from there on.
> Or am I misunderstanding you?

No thats fine but this line of thinking establishes that we need 
update_mmu_cache for protection changes. So the documentation on the role 
of these calls needs to change.  lazy_mmu_prot_update does not do the 
notification to arch specific code as documented otherwise we would not 
need the flush_tlb_range. In fact it seems that lazy_mmu_prot_update does
only deal with icache/dcache coherency issues and it is separate from 
update_mmu_cache because we can avoid checking the icache/dcache issues in 
every update_mmu_cache. 

> > > But now I wonder, why does do_wp_page reuse case flush_cache_page?
> > 
> > Some arches may have virtual caches?
> 
> Sorry, I don't get it, you'll have to spell it out to me in detail.
> We have a page mapped for reading, we're about to map that same page
> for writing too.  We have no modifications to flush yet,
> why flush_cache_page there?

Hmmm. I found this by Dave Miller in 1999

http://www.ussg.iu.edu/hypermail/linux/kernel/9906.0/1237.html
  
  flush_cache_page(vma, page) is meant to also take care of the case
  here for some reason the TLB entry must exist for the cache entry to
  be valid as well. This is the case on the HyperSparc's combined I/D
  L2 cache (it has no L1 cache), you cannot flush out cache entries
  which have no translation, it will make the cpu trap. Sparc/sun4c's
  mmu is like this too.

If I read this correctly then it seems the reason that flush_cache_page 
was placed there is that on some architecture the TLB entry must exist 
correctly on virtual caches to be able to flush the caches (maybe they 
are hashed?). update_mmu_cache is called after we changed things so there
may be no way to determine how to flush the cache contents for the page 
contents if we later drop the page.




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-19 17:52 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
  2006-06-22  5:56   ` Andrew Morton
@ 2006-06-22 20:52   ` Hugh Dickins
  2006-06-22 23:02     ` Peter Zijlstra
  2006-06-22 23:39     ` [PATCH] mm: tracking shared dirty pages -v10 Peter Zijlstra
  1 sibling, 2 replies; 51+ messages in thread
From: Hugh Dickins @ 2006-06-22 20:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Mon, 19 Jun 2006, Peter Zijlstra wrote:
> 
> People expressed the need to track dirty pages in shared mappings.
> 
> Linus outlined the general idea of doing that through making clean
> writable pages write-protected and taking the write fault.
> 
> Index: 2.6-mm/include/linux/mm.h
> ===================================================================
> --- 2.6-mm.orig/include/linux/mm.h	2006-06-14 10:29:04.000000000 +0200
> +++ 2.6-mm/include/linux/mm.h	2006-06-19 14:45:18.000000000 +0200
> @@ -182,6 +182,12 @@ extern unsigned int kobjsize(const void 
>  #define VM_SequentialReadHint(v)	((v)->vm_flags & VM_SEQ_READ)
>  #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)
>  
> +static inline int is_shared_writable(unsigned int flags)
> +{
> +	return (flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
> +		(VM_SHARED|VM_WRITE);
> +}
> +

Andrew asked for the inclusion of VM_PFNMAP to be commented there,
I don't believe that's enough: a function called "is_shared_writable"
should be testing precisely that, or people will misuse it.

Either you change the name to "is_shared_writable_but_not_pfnmap"
or somesuch, or you split out the VM_PFNMAP test, or you do away
with the function and make the tests explicit inline.  As before,
my instinctive preference is the latter: I really want to see what's
being tested (especially in do_wp_page); but perhaps it'll just look
too ugly all over - give it a try and see.

>  /*
>   * mapping from the currently active vm_flags protection bits (the
>   * low four bits) to a page protection mask..
> Index: 2.6-mm/mm/memory.c
> ===================================================================
> --- 2.6-mm.orig/mm/memory.c	2006-06-14 10:29:06.000000000 +0200
> +++ 2.6-mm/mm/memory.c	2006-06-19 16:20:06.000000000 +0200
> @@ -938,6 +938,12 @@ struct page *follow_page(struct vm_area_
>  	pte = *ptep;
>  	if (!pte_present(pte))
>  		goto unlock;
> +	/*
> +	 * This is not fully correct in the light of trapping write faults
> +	 * for writable shared mappings. However since we're going to mark
> +	 * the page dirty anyway some few lines downward, we might as well
> +	 * take the write fault now.
> +	 */

I don't understand what you're getting at here: please explain,
what is not fully correct and why?  In mail first, then we can
decide what the comment should say, or if it should be removed.
follow_page isn't making a pte writable, so what's the issue?

>  	if ((flags & FOLL_WRITE) && !pte_write(pte))
>  		goto unlock;
>  	page = vm_normal_page(vma, address, pte);
> @@ -1458,13 +1464,14 @@ static int do_wp_page(struct mm_struct *
>  {
>  	struct page *old_page, *new_page;
>  	pte_t entry;
> -	int reuse, ret = VM_FAULT_MINOR;
> +	int reuse = 0, ret = VM_FAULT_MINOR;
> +	struct page *dirty_page = NULL;
>  
>  	old_page = vm_normal_page(vma, address, orig_pte);
>  	if (!old_page)
>  		goto gotten;
>  
> -	if (unlikely(vma->vm_flags & VM_SHARED)) {
> +	if (unlikely(is_shared_writable(vma->vm_flags))) {

Most interesting line in the series, yes, and I'd find it
easier to think through if it showed the flags test explicitly:
	if ((vma->vm_flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
		(VM_SHARED|VM_WRITE))

Yes, Andrew, you're right it's a change in behaviour from David's
page_mkwrite patch.  I've realized that when I was originally
reviewing David's patch, I believed do_wp_page was mistaken to be
doing COW on VM_SHARED areas.  But Linus has since asserted very
forcefully that it's intentional, that ptrace poke on a VM_SHARED
area which is currently not !VM_WRITE should COW it, so I mentioned
that to Peter.

Has he got the test right there now?  Ummm... maybe: my brain
exploded weeks ago.  Several strangenesses collide here, I'll
try again tomorrow, maybe others will argue it to certainty before.

>  		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
>  			/*
>  			 * Notify the address space that the page is about to
> Index: 2.6-mm/mm/mmap.c
> ===================================================================
> --- 2.6-mm.orig/mm/mmap.c	2006-06-14 10:29:06.000000000 +0200
> +++ 2.6-mm/mm/mmap.c	2006-06-19 15:41:53.000000000 +0200
> @@ -25,6 +25,7 @@
>  #include <linux/mount.h>
>  #include <linux/mempolicy.h>
>  #include <linux/rmap.h>
> +#include <linux/backing-dev.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/cacheflush.h>
> @@ -888,6 +889,7 @@ unsigned long do_mmap_pgoff(struct file 
>  	struct rb_node ** rb_link, * rb_parent;
>  	int accountable = 1;
>  	unsigned long charged = 0, reqprot = prot;
> +	struct address_space *mapping = NULL;
>  
>  	if (file) {
>  		if (is_file_hugepages(file))
> @@ -1084,18 +1086,13 @@ munmap_back:
>  		error = file->f_op->mmap(file, vma);
>  		if (error)
>  			goto unmap_and_free_vma;
> +

Do you really need this blank line?

>  	} else if (vm_flags & VM_SHARED) {
>  		error = shmem_zero_setup(vma);
>  		if (error)
>  			goto free_vma;
>  	}
>  
> -	/* Don't make the VMA automatically writable if it's shared, but the
> -	 * backer wishes to know when pages are first written to */
> -	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
> -		vma->vm_page_prot =
> -			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
> -
>  	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
>  	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
>  	 * that memory reservation must be checked; but that reservation
> @@ -1113,6 +1110,31 @@ munmap_back:
>  	pgoff = vma->vm_pgoff;
>  	vm_flags = vma->vm_flags;
>  
> +	/*
> +	 * Tracking of dirty pages for shared writable mappings. Do this by
> +	 * write protecting writable pages, and mark dirty in the write fault.
> +	 *
> +	 * Modify vma->vm_page_prot (the default protection for new pages)
> +	 * to this effect.
> +	 *
> +	 * Cannot do before because the condition depends on:
> +	 *  - backing_dev_info having the right capabilities
> +	 *    (set by f_op->open())

Is that so, backing_dev_info set by f_op->open()?
And how would that be a problem here if it were so?

> +	 *  - vma->vm_flags being fully set
> +	 *    (finished in f_op->mmap(), which could call remap_pfn_range())
> +	 *
> +	 *  Also, cannot reset vma->vm_page_prot from vma->vm_flags because
> +	 *  f_op->mmap() can modify it.
> +	 */
> +	if (is_shared_writable(vm_flags) && vma->vm_file)
> +		mapping = vma->vm_file->f_mapping;
> +	if ((mapping && mapping_cap_account_dirty(mapping)) ||
> +			(vma->vm_ops && vma->vm_ops->page_mkwrite))

The only way "mapping" might be set is just above.
Wouldn't it all be clearer (though more indented) if you said

	if (is_shared_writable(vm_flags) && vma->vm_file) {
		mapping = vma->vm_file->f_mapping;
		if ((mapping && mapping_cap_account_dirty(mapping)) ||
				(vma->vm_ops && vma->vm_ops->page_mkwrite)) {
			vma->vm_page_prot = whatever;
		}
	}

Or no need for "mapping" here at all if you change
mapping_cap_account_dirty(vma->vm_file->f_mapping)
to do the right thing with NULL.

> +		vma->vm_page_prot =
> +			__pgprot(pte_val
> +				(pte_wrprotect
> +				 (__pte(pgprot_val(vma->vm_page_prot)))));
> +

In other mail I've suggested saving vm_page_prot above, and
changing it here only if the driver's ->mmap did not change it.

I remain uneasy about interfering with the permissions expected by
strange drivers, but can't really justify my paranoia.  Certainly
you're right to exclude VM_PFNMAPs from this interference, that's
important; I'd be less uneasy if you also exclude VM_INSERTPAGEs,
they're strange too - but at least they're dealing with proper struct
pages, so should be able to handle an unexpected do_wp_page; that
leaves the driver nopage cases, which again should be okay now you're
(one way or another) protecting specially added vm_page_prot flags.

I guess I'm just paranoid; it's irritating me that we do not have
the right backing_dev_infos in place and having to hack around it.

>  	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
>  			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
>  		file = vma->vm_file;
> Index: 2.6-mm/mm/mprotect.c
> ===================================================================
> --- 2.6-mm.orig/mm/mprotect.c	2006-06-14 10:29:06.000000000 +0200
> +++ 2.6-mm/mm/mprotect.c	2006-06-19 16:19:42.000000000 +0200
> @@ -21,6 +21,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/backing-dev.h>
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
>  #include <asm/cacheflush.h>
> @@ -124,6 +125,7 @@ mprotect_fixup(struct vm_area_struct *vm
>  	long nrpages = (end - start) >> PAGE_SHIFT;
>  	unsigned long charged = 0;
>  	unsigned int mask;
> +	struct address_space *mapping = NULL;
>  	pgprot_t newprot;
>  	pgoff_t pgoff;
>  	int error;
> @@ -179,7 +181,10 @@ success:
>  	/* Don't make the VMA automatically writable if it's shared, but the
>  	 * backer wishes to know when pages are first written to */
>  	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
> -	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
> +	if (is_shared_writable(newflags) && vma->vm_file)
> +		mapping = vma->vm_file->f_mapping;
> +	if ((mapping && mapping_cap_account_dirty(mapping)) ||
> +			(vma->vm_ops && vma->vm_ops->page_mkwrite))

Similar remarks on indenting,
or letting mapping_cap_account_dirty take NULL mapping.

>  		mask &= ~VM_SHARED;
>  
>  	newprot = protection_map[newflags & mask];
> Index: 2.6-mm/mm/rmap.c
> ===================================================================
> --- 2.6-mm.orig/mm/rmap.c	2006-06-14 10:29:07.000000000 +0200
> +++ 2.6-mm/mm/rmap.c	2006-06-19 14:45:18.000000000 +0200
> @@ -53,6 +53,7 @@
>  #include <linux/rmap.h>
>  #include <linux/rcupdate.h>
>  #include <linux/module.h>
> +#include <linux/backing-dev.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -434,6 +435,73 @@ int page_referenced(struct page *page, i
>  	return referenced;
>  }
>  
> +static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, int protect)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address;
> +	pte_t *pte, entry;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +
> +	address = vma_address(page, vma);
> +	if (address == -EFAULT)
> +		goto out;
> +
> +	pte = page_check_address(page, mm, address, &ptl);
> +	if (!pte)
> +		goto out;
> +
> +	if (!(pte_dirty(*pte) || (protect && pte_write(*pte))))
> +		goto unlock;
> +
> +	entry = ptep_get_and_clear(mm, address, pte);
> +	entry = pte_mkclean(entry);
> +	if (protect)
> +		entry = pte_wrprotect(entry);
> +	ptep_establish(vma, address, pte, entry);
> +	lazy_mmu_prot_update(entry);
> +	ret = 1;
> +
> +unlock:
> +	pte_unmap_unlock(pte, ptl);
> +out:
> +	return ret;
> +}
> +
> +static int page_mkclean_file(struct address_space *mapping, struct page *page)
> +{
> +	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +	struct vm_area_struct *vma;
> +	struct prio_tree_iter iter;
> +	int ret = 0;
> +
> +	BUG_ON(PageAnon(page));
> +
> +	spin_lock(&mapping->i_mmap_lock);
> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> +		int protect = mapping_cap_account_dirty(mapping) &&
> +			is_shared_writable(vma->vm_flags);
> +		ret += page_mkclean_one(page, vma, protect);

You have a good point here, one I'd completely missed: because a vma
may have been recently mprotected !VM_WRITE, you have to check readonly
mappings too.  Perhaps worth a comment.  But I think "is_shared_writable"
is not the best test here: just test for VM_SHARED vmas, they're the
only ones which can be mprotected to/from shared writable.  And then
I think you don't need to pass down an additional "protect" argument?
It's only being called for mapping_cap_account_dirty mappings anyway,
isn't it?

> +	}
> +	spin_unlock(&mapping->i_mmap_lock);
> +	return ret;
> +}

Hugh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-22 20:52   ` Hugh Dickins
@ 2006-06-22 23:02     ` Peter Zijlstra
  2006-06-22 23:39     ` [PATCH] mm: tracking shared dirty pages -v10 Peter Zijlstra
  1 sibling, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-22 23:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Thu, 2006-06-22 at 21:52 +0100, Hugh Dickins wrote:
> On Mon, 19 Jun 2006, Peter Zijlstra wrote:

> > +static inline int is_shared_writable(unsigned int flags)
> > +{
> > +	return (flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
> > +		(VM_SHARED|VM_WRITE);
> > +}
> > +
> 
> Andrew asked for the inclusion of VM_PFNMAP to be commented there,
> I don't believe that's enough: a function called "is_shared_writable"
> should be testing precisely that, or people will misuse it.
> 
> Either you change the name to "is_shared_writable_but_not_pfnmap"
> or somesuch, or you split out the VM_PFNMAP test, or you do away
> with the function and make the tests explicit inline.  As before,
> my instinctive preference is the latter: I really want to see what's
> being tested (especially in do_wp_page); but perhaps it'll just look
> too ugly all over - give it a try and see.

*sight*, thats it, explicit it will be :-)

> > +	/*
> > +	 * This is not fully correct in the light of trapping write faults
> > +	 * for writable shared mappings. However since we're going to mark
> > +	 * the page dirty anyway some few lines downward, we might as well
> > +	 * take the write fault now.
> > +	 */
> 
> I don't understand what you're getting at here: please explain,
> what is not fully correct and why?  In mail first, then we can
> decide what the comment should say, or if it should be removed.
> follow_page isn't making a pte writable, so what's the issue?

I have no idea either, I reread this part earlier today and found it one
big brainfart. It does indeed seem to do the right thing.

> > -	if (unlikely(vma->vm_flags & VM_SHARED)) {
> > +	if (unlikely(is_shared_writable(vma->vm_flags))) {
> 
> Most interesting line in the series, yes, and I'd find it
> easier to think through if it showed the flags test explicitly:
> 	if ((vma->vm_flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
> 		(VM_SHARED|VM_WRITE))
> 
> Yes, Andrew, you're right it's a change in behaviour from David's
> page_mkwrite patch.  I've realized that when I was originally
> reviewing David's patch, I believed do_wp_page was mistaken to be
> doing COW on VM_SHARED areas.  But Linus has since asserted very
> forcefully that it's intentional, that ptrace poke on a VM_SHARED
> area which is currently not !VM_WRITE should COW it, so I mentioned
> that to Peter.
> 
> Has he got the test right there now?  Ummm... maybe: my brain
> exploded weeks ago.  Several strangenesses collide here, I'll
> try again tomorrow, maybe others will argue it to certainty before.

I don't think the VM_PFNMAP is needed here, but it doesn't hurt either.
Like said, I'll do explicits from now on.

> > @@ -1084,18 +1086,13 @@ munmap_back:
> >  		error = file->f_op->mmap(file, vma);
> >  		if (error)
> >  			goto unmap_and_free_vma;
> > +
> 
> Do you really need this blank line?

:-) uhu..

> > +	/*
> > +	 * Tracking of dirty pages for shared writable mappings. Do this by
> > +	 * write protecting writable pages, and mark dirty in the write fault.
> > +	 *
> > +	 * Modify vma->vm_page_prot (the default protection for new pages)
> > +	 * to this effect.
> > +	 *
> > +	 * Cannot do before because the condition depends on:
> > +	 *  - backing_dev_info having the right capabilities
> > +	 *    (set by f_op->open())
> 
> Is that so, backing_dev_info set by f_op->open()?
> And how would that be a problem here if it were so?

useless information indeed, a remnant from old times when I placed the
vm_page_prot modification between the two calls, shall remove.

> > +	 *  - vma->vm_flags being fully set
> > +	 *    (finished in f_op->mmap(), which could call remap_pfn_range())
> > +	 *
> > +	 *  Also, cannot reset vma->vm_page_prot from vma->vm_flags because
> > +	 *  f_op->mmap() can modify it.
> > +	 */
> > +	if (is_shared_writable(vm_flags) && vma->vm_file)
> > +		mapping = vma->vm_file->f_mapping;
> > +	if ((mapping && mapping_cap_account_dirty(mapping)) ||
> > +			(vma->vm_ops && vma->vm_ops->page_mkwrite))
> 
> The only way "mapping" might be set is just above.
> Wouldn't it all be clearer (though more indented) if you said
> 
> 	if (is_shared_writable(vm_flags) && vma->vm_file) {
> 		mapping = vma->vm_file->f_mapping;
> 		if ((mapping && mapping_cap_account_dirty(mapping)) ||
> 				(vma->vm_ops && vma->vm_ops->page_mkwrite)) {
> 			vma->vm_page_prot = whatever;
> 		}
> 	}
> 
> Or no need for "mapping" here at all if you change
> mapping_cap_account_dirty(vma->vm_file->f_mapping)
> to do the right thing with NULL.

Made it one big if stmt, perhaps too big, we'll see.

> 
> > +		vma->vm_page_prot =
> > +			__pgprot(pte_val
> > +				(pte_wrprotect
> > +				 (__pte(pgprot_val(vma->vm_page_prot)))));
> > +
> 
> In other mail I've suggested saving vm_page_prot above, and
> changing it here only if the driver's ->mmap did not change it.

Yes, that was a very good suggestion and has already been incorporated,
thanks.

> I remain uneasy about interfering with the permissions expected by
> strange drivers, but can't really justify my paranoia.  Certainly
> you're right to exclude VM_PFNMAPs from this interference, that's
> important; I'd be less uneasy if you also exclude VM_INSERTPAGEs,
> they're strange too - but at least they're dealing with proper struct
> pages, so should be able to handle an unexpected do_wp_page; that
> leaves the driver nopage cases, which again should be okay now you're
> (one way or another) protecting specially added vm_page_prot flags.

VM_INSERTPAGE thou shall have.

> I guess I'm just paranoid; it's irritating me that we do not have
> the right backing_dev_infos in place and having to hack around it.

Sad situation but true.

> > +static int page_mkclean_file(struct address_space *mapping, struct page *page)
> > +{
> > +	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > +	struct vm_area_struct *vma;
> > +	struct prio_tree_iter iter;
> > +	int ret = 0;
> > +
> > +	BUG_ON(PageAnon(page));
> > +
> > +	spin_lock(&mapping->i_mmap_lock);
> > +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> > +		int protect = mapping_cap_account_dirty(mapping) &&
> > +			is_shared_writable(vma->vm_flags);
> > +		ret += page_mkclean_one(page, vma, protect);
> 
> You have a good point here, one I'd completely missed: because a vma
> may have been recently mprotected !VM_WRITE, you have to check readonly
> mappings too.  Perhaps worth a comment.  But I think "is_shared_writable"
> is not the best test here: just test for VM_SHARED vmas, they're the
> only ones which can be mprotected to/from shared writable.  And then
> I think you don't need to pass down an additional "protect" argument?
> It's only being called for mapping_cap_account_dirty mappings anyway,
> isn't it?

Well, no, not anymore. I thought to make it actually do what its name
said it does: clean the page's PTEs (I am even pondering about
implementing the anonymous branch).

In that light, its now called for each page.


New patch will follow shortly since I can't seem to sleep anyway...




^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH] mm: tracking shared dirty pages -v10
  2006-06-22 20:52   ` Hugh Dickins
  2006-06-22 23:02     ` Peter Zijlstra
@ 2006-06-22 23:39     ` Peter Zijlstra
  2006-06-23  3:10       ` Jeff Dike
                         ` (2 more replies)
  1 sibling, 3 replies; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-22 23:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

Preview of the goodness,

I'll repost the whole thing tomorrow after I've updated the other
patches, esp. the msync one. I seem to be too tired to make any sense
out of that atm.

(PS, 2.6.17-mm1 UML doesn't seem to boot)

-------------------------------------------------------


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Tracking of dirty pages in shared writeable mmap()s.

The idea is simple: write protect clean shared writeable pages,
catch the write-fault, make writeable and set dirty. On page write-back
clean all the PTE dirty bits and write protect them once again.

The implementation is a tad harder, mainly because the default
backing_dev_info capabilities were too loosely maintained. Hence it is
not enough to test the backing_dev_info for cap_account_dirty.

The current heuristic is as follows, a VMA is eligible when:
 - its shared writeable 
    (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
 - it is not a PFN mapping
    (vm_flags & VM_PFNMAP) == 0
 - the backing_dev_info is cap_account_dirty
    mapping_cap_account_dirty(vma->vm_file->f_mapping)
 - f_op->mmap() didn't change the default page protection

NOTE: the last rule is only checked in do_mmap_pgoff(), other code-
paths assume they will not be reached in that case.

Page from remap_pfn_range() are explicitly excluded because their
COW semantics are already horrid enough (see vm_normal_page() in
do_wp_page()) and because they don't have a backing store anyway.

copy_one_pte() cleans child PTEs, not wrong in the current context,
but why?

mprotect() is taught about the new behaviour as well.

Cleaning the pages on write-back is done with page_mkclean() a new
rmap call. It can be called on any page, but is currently only
implemented for mapped pages (does it make sense to also implement
anon pages?), if the page is found the be of a trackable VMA it
will also wrprotect the PTE.

Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty()
from under ->private_lock. This seems to be save, since ->private_lock 
is used to serialize access to the buffers, not the page itself.
This is needed because clear_page_dirty() will call into page_mkclean()
and would thereby violate locking order.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Changes in -v10

 - 2.6.17-mm1
 - Drop the ugly duckling pgprotting, Hugh suggested resetting the
   vma->vm_page_prot when f_op->mmap() didn't modify it. If it were
   modified we're not interested anyway.
 - abandon is_shared_writable() because its actually spelled writeable
   and it didn't actually mean that any more.
 - Comments all round.

Changes in -v9

 - respin against latest -mm.

Changes in -v8

 - access_process_vm() and other force users of get_user_pages() can
   induce COW of read-only shared mappings.

Changes in -v7

 - changed is_shared_writable() to exclude VM_PFNMAP'ed regions.
 - Hugh's tiresome problem wasn't fully solved, now using the ugly duckling
   method.

Changes in -v6

 - make page_mkclean_one() modify the pte more like change_pte_range() 
   (suggested by Christoph Lameter)
 - made is_shared_writable() take vm_flags, it now resembles is_cow_mapping().
 - fixed the mprotect() bug (spotted by Hugh Dickins)
 - hopefully fixed the tiresome issue of do_mmap_pgoff() trampling on
   driver specific vm_page_prot settings (spotted by Hugh Dickins)
 - made a new version of the page_mkwrite() patch to go on top of all this.
   This so that Linus could merge this very early on in 2.6.18.

Changes in -v5

 - rename page_wrprotect() to page_mkclean() (suggested by Nick Piggin)
 - added comment to test_clear_page_dirty() (Andrew Morton)
 - cleanup page_wrprotect() (Andrew Morton)
 - renamed VM_SharedWritable() to is_shared_writable()
 - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
   ->private_lock. This seems to be save, since ->private_lock is used to
   serialize access to the buffers, not the page itself.
 - rebased on top of David Howells' page_mkwrite() patch.

Changes in -v4:

 - small cleanup as suggested by Christoph Lameter.

Changes in -v3:

 - move set_page_dirty() outside pte lock (suggested by Christoph Lameter)

Changes in -v2:

 - only wrprotect pages from dirty capable mappings. (Nick Piggin)
 - move the writefault handling from do_wp_page() into handle_pte_fault(). 
   (Nick Piggin)
 - revert to the old install_page interface. (Nick Piggin)
 - also clear the pte dirty bit when we make pages read-only again.
   (spotted by Rik van Riel)
 - make page_wrprotect() return the number of reprotected ptes.

 fs/buffer.c                 |    2 -
 include/linux/backing-dev.h |   13 +++++---
 include/linux/rmap.h        |    8 +++++
 mm/memory.c                 |   29 ++++++++++++++----
 mm/mmap.c                   |   40 ++++++++++++++++++++-----
 mm/mprotect.c               |   17 ++++++++--
 mm/page-writeback.c         |    9 +++++
 mm/rmap.c                   |   69 ++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 163 insertions(+), 24 deletions(-)

Index: 2.6-mm/mm/memory.c
===================================================================
--- 2.6-mm.orig/mm/memory.c	2006-06-22 17:59:07.000000000 +0200
+++ 2.6-mm/mm/memory.c	2006-06-23 01:08:11.000000000 +0200
@@ -1458,14 +1458,19 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int reuse, ret = VM_FAULT_MINOR;
+	int reuse = 0, ret = VM_FAULT_MINOR;
+	struct page *dirty_page = NULL;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (unlikely((vma->vm_flags & (VM_SHARED|VM_WRITE)) ==
-				(VM_SHARED|VM_WRITE))) {
+	/*
+	 * Only catch write-faults on shared writable pages, read-only
+	 * shared pages can get COWed by get_user_pages(.write=1, .force=1).
+	 */
+	if (unlikely(((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+					(VM_WRITE|VM_SHARED)))) {
 		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
 			/*
 			 * Notify the address space that the page is about to
@@ -1494,13 +1499,12 @@ static int do_wp_page(struct mm_struct *
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
 		}
-
+		dirty_page = old_page;
+		get_page(dirty_page);
 		reuse = 1;
 	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
 		reuse = can_share_swap_page(old_page);
 		unlock_page(old_page);
-	} else {
-		reuse = 0;
 	}
 
 	if (reuse) {
@@ -1566,6 +1570,10 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	if (old_page)
@@ -2098,6 +2106,7 @@ static int do_no_page(struct mm_struct *
 	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
+	struct page *dirty_page = NULL;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
@@ -2192,6 +2201,10 @@ retry:
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			if (write_access) {
+				dirty_page = new_page;
+				get_page(dirty_page);
+			}
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -2204,6 +2217,10 @@ retry:
 	lazy_mmu_prot_update(entry);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	page_cache_release(new_page);
Index: 2.6-mm/mm/mmap.c
===================================================================
--- 2.6-mm.orig/mm/mmap.c	2006-06-22 17:59:07.000000000 +0200
+++ 2.6-mm/mm/mmap.c	2006-06-23 01:31:44.000000000 +0200
@@ -25,6 +25,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -888,6 +889,7 @@ unsigned long do_mmap_pgoff(struct file 
 	struct rb_node ** rb_link, * rb_parent;
 	int accountable = 1;
 	unsigned long charged = 0, reqprot = prot;
+	pgprot_t vm_page_prot;
 
 	if (file) {
 		if (is_file_hugepages(file))
@@ -1065,8 +1067,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags &
-				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
+	vma->vm_page_prot = vm_page_prot = protection_map[vm_flags &
+		(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1090,12 +1092,6 @@ munmap_back:
 			goto free_vma;
 	}
 
-	/* Don't make the VMA automatically writable if it's shared, but the
-	 * backer wishes to know when pages are first written to */
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
-		vma->vm_page_prot =
-			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
-
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation
@@ -1113,6 +1109,34 @@ munmap_back:
 	pgoff = vma->vm_pgoff;
 	vm_flags = vma->vm_flags;
 
+	/*
+	 * Tracking of dirty pages for shared writable mappings. Do this by
+	 * write protecting writable pages, and mark dirty in the write fault.
+	 *
+	 * Cannot do before because the condition depends on:
+	 *  - backing_dev_info having the right capabilities
+	 *  - vma->vm_flags being fully set
+	 *    (finished in f_op->mmap(), which could call remap_pfn_range())
+	 *
+	 * If f_op->mmap() changed vma->vm_page_prot its a funny mapping
+	 * and we won't touch it.
+	 * NOTE: in a perfect world backing_dev_info would have the proper
+	 * capabilities.
+	 *
+	 * OR
+	 *
+	 * Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to.
+	 */
+	if ((pgprot_val(vma->vm_page_prot) == pgprot_val(vm_page_prot) &&
+	     ((vm_flags & (VM_WRITE|VM_SHARED|VM_PFNMAP|VM_INSERTPAGE)) ==
+			  (VM_WRITE|VM_SHARED)) &&
+	     vma->vm_file && vma->vm_file->f_mapping &&
+	     mapping_cap_account_dirty(vma->vm_file->f_mapping)) ||
+	    (vma->vm_ops && vma->vm_ops->page_mkwrite))
+		vma->vm_page_prot =
+			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
+
 	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
 			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
 		file = vma->vm_file;
Index: 2.6-mm/mm/mprotect.c
===================================================================
--- 2.6-mm.orig/mm/mprotect.c	2006-06-22 17:59:07.000000000 +0200
+++ 2.6-mm/mm/mprotect.c	2006-06-23 01:35:24.000000000 +0200
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/backing-dev.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -176,10 +177,20 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
-	/* Don't make the VMA automatically writable if it's shared, but the
-	 * backer wishes to know when pages are first written to */
+	/*
+	 * Tracking of dirty pages logic (see comment in do_mmap_pgoff)
+	 *
+	 * OR
+	 *
+	 * Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to.
+	 */
 	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+	if ((((newflags & (VM_WRITE|VM_SHARED|VM_PFNMAP|VM_INSERTPAGE)) ==
+			  (VM_WRITE|VM_SHARED)) &&
+	     vma->vm_file && vma->vm_file->f_mapping &&
+	     mapping_cap_account_dirty(vma->vm_file->f_mapping)) ||
+	    (vma->vm_ops && vma->vm_ops->page_mkwrite))
 		mask &= ~VM_SHARED;
 
 	newprot = protection_map[newflags & mask];
Index: 2.6-mm/mm/page-writeback.c
===================================================================
--- 2.6-mm.orig/mm/page-writeback.c	2006-06-22 17:59:07.000000000 +0200
+++ 2.6-mm/mm/page-writeback.c	2006-06-23 00:27:09.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -566,7 +567,7 @@ int do_writepages(struct address_space *
 		return 0;
 	wbc->for_writepages = 1;
 	if (mapping->a_ops->writepages)
-		ret =  mapping->a_ops->writepages(mapping, wbc);
+		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
 	wbc->for_writepages = 0;
@@ -728,6 +729,11 @@ int test_clear_page_dirty(struct page *p
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 			write_unlock_irqrestore(&mapping->tree_lock, flags);
+			/*
+			 * We can continue to use `mapping' here because the
+			 * page is locked, which pins the address_space
+			 */
+			page_mkclean(page);
 			if (mapping_cap_account_dirty(mapping))
 				dec_page_state(nr_dirty);
 			return 1;
@@ -759,6 +765,7 @@ int clear_page_dirty_for_io(struct page 
 
 	if (mapping) {
 		if (TestClearPageDirty(page)) {
+			page_mkclean(page);
 			if (mapping_cap_account_dirty(mapping))
 				dec_page_state(nr_dirty);
 			return 1;
Index: 2.6-mm/mm/rmap.c
===================================================================
--- 2.6-mm.orig/mm/rmap.c	2006-06-22 17:59:07.000000000 +0200
+++ 2.6-mm/mm/rmap.c	2006-06-23 01:14:39.000000000 +0200
@@ -53,6 +53,7 @@
 #include <linux/rmap.h>
 #include <linux/rcupdate.h>
 #include <linux/module.h>
+#include <linux/backing-dev.h>
 
 #include <asm/tlbflush.h>
 
@@ -434,6 +435,74 @@ int page_referenced(struct page *page, i
 	return referenced;
 }
 
+static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, int protect)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	address = vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
+	pte = page_check_address(page, mm, address, &ptl);
+	if (!pte)
+		goto out;
+
+	if (!(pte_dirty(*pte) || (protect && pte_write(*pte))))
+		goto unlock;
+
+	entry = ptep_get_and_clear(mm, address, pte);
+	entry = pte_mkclean(entry);
+	if (protect)
+		entry = pte_wrprotect(entry);
+	ptep_establish(vma, address, pte, entry);
+	lazy_mmu_prot_update(entry);
+	ret = 1;
+
+unlock:
+	pte_unmap_unlock(pte, ptl);
+out:
+	return ret;
+}
+
+static int page_mkclean_file(struct address_space *mapping, struct page *page)
+{
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int ret = 0;
+
+	BUG_ON(PageAnon(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		int protect = mapping_cap_account_dirty(mapping) &&
+	((vma->vm_flags & (VM_WRITE|VM_SHARED|VM_PFNMAP|VM_INSERTPAGE)) ==
+	 		  (VM_WRITE|VM_SHARED));
+		ret += page_mkclean_one(page, vma, protect);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	return ret;
+}
+
+int page_mkclean(struct page *page)
+{
+	int ret = 0;
+
+	BUG_ON(!PageLocked(page));
+
+	if (page_mapped(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			ret = page_mkclean_file(mapping, page);
+	}
+
+	return ret;
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
Index: 2.6-mm/include/linux/rmap.h
===================================================================
--- 2.6-mm.orig/include/linux/rmap.h	2006-06-22 17:59:07.000000000 +0200
+++ 2.6-mm/include/linux/rmap.h	2006-06-23 00:27:09.000000000 +0200
@@ -103,6 +103,14 @@ pte_t *page_check_address(struct page *,
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
+/*
+ * Cleans the PTEs of shared mappings.
+ * (and since clean PTEs should also be readonly, write protects them too)
+ *
+ * returns the number of cleaned PTEs.
+ */
+int page_mkclean(struct page *);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: 2.6-mm/fs/buffer.c
===================================================================
--- 2.6-mm.orig/fs/buffer.c	2006-06-22 17:59:07.000000000 +0200
+++ 2.6-mm/fs/buffer.c	2006-06-23 00:27:09.000000000 +0200
@@ -2983,6 +2983,7 @@ int try_to_free_buffers(struct page *pag
 
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
+	spin_unlock(&mapping->private_lock);
 	if (ret) {
 		/*
 		 * If the filesystem writes its buffers by hand (eg ext3)
@@ -2994,7 +2995,6 @@ int try_to_free_buffers(struct page *pag
 		 */
 		clear_page_dirty(page);
 	}
-	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
Index: 2.6-mm/include/linux/backing-dev.h
===================================================================
--- 2.6-mm.orig/include/linux/backing-dev.h	2006-06-22 08:39:46.000000000 +0200
+++ 2.6-mm/include/linux/backing-dev.h	2006-06-23 00:41:01.000000000 +0200
@@ -97,11 +97,14 @@ static inline int bdi_rw_congested(struc
 #define bdi_cap_account_dirty(bdi) \
 	(!((bdi)->capabilities & BDI_CAP_NO_ACCT_DIRTY))
 
-#define mapping_cap_writeback_dirty(mapping) \
-	bdi_cap_writeback_dirty((mapping)->backing_dev_info)
-
-#define mapping_cap_account_dirty(mapping) \
-	bdi_cap_account_dirty((mapping)->backing_dev_info)
+#define mapping_cap_writeback_dirty(__mapping)				\
+({	typeof(__mapping) mapping = (__mapping);			\
+	(mapping && bdi_cap_writeback_dirty(mapping->backing_dev_info));\
+})
 
+#define mapping_cap_account_dirty(__mapping)				\
+({	typeof(__mapping) mapping = (__mapping);			\
+  	(mapping && bdi_cap_account_dirty(mapping->backing_dev_info));	\
+})
 
 #endif		/* _LINUX_BACKING_DEV_H */



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-22 23:39     ` [PATCH] mm: tracking shared dirty pages -v10 Peter Zijlstra
@ 2006-06-23  3:10       ` Jeff Dike
  2006-06-23  3:31         ` Andrew Morton
  2006-06-23  6:08       ` Linus Torvalds
  2006-06-23 19:06       ` Hugh Dickins
  2 siblings, 1 reply; 51+ messages in thread
From: Jeff Dike @ 2006-06-23  3:10 UTC (permalink / raw)
  To: hpa, Peter Zijlstra
  Cc: Hugh Dickins, linux-mm, linux-kernel, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Fri, Jun 23, 2006 at 01:39:49AM +0200, Peter Zijlstra wrote:
> (PS, 2.6.17-mm1 UML doesn't seem to boot)

I don't get that far - it doesn't build for me.  It dies in klibc thusly:

  gcc -Wp,-MD,usr/klibc/syscalls/.typesize.o.d  -nostdinc -isystem /usr/lib/gcc/i386-redhat-linux/4.1.1/include -I/home/jdike/linux/2.6/test/linux-2.6.17/usr/include/arch/i386 -Iusr/include/arch/i386 -I/home/jdike/linux/2.6/test/linux-2.6.17/usr/include/bits32 -Iusr/include/bits32  -I/home/jdike/linux/2.6/test/linux-2.6.17/obj/usr/klibc/../include -I/home/jdike/linux/2.6/test/linux-2.6.17/usr/include -Iusr/include  -I/home/jdike/linux/2.6/test/linux-2.6.17/include -I/home/jdike/linux/2.6/test/linux-2.6.17/include2 -Iinclude2 -I/home/jdike/linux/2.6/test/linux-2.6.17/include -Iinclude  -I/home/jdike/linux/2.6/test/linux-2.6.17/include -D__KLIBC__=1 -D__KLIBC_MINOR__=4 -D_BITSIZE=32 -m32 -march=i386 -Os -g -fomit-frame-pointer -falign-functions=0 -falign-jumps=0 -falign-loops=0 -W -Wall -Wno-sign-compare -Wno-unused-parameter -c -o usr/klibc/syscalls/typesize.o usr/klibc/syscalls/typesize.c
usr/klibc/syscalls/typesize.c:1:23: error: syscommon.h: No such file or directory
usr/klibc/syscalls/typesize.c:5: error: '__u32' undeclared here (not in a function)
usr/klibc/syscalls/typesize.c:9: error: expected ')' before 'gid_t'
usr/klibc/syscalls/typesize.c:9: warning: type defaults to 'int' in declaration of 'type name'
usr/klibc/syscalls/typesize.c:10: error: expected ')' before 'sigset_t'
usr/klibc/syscalls/typesize.c:10: warning: type defaults to 'int' in declaration of 'type name'
usr/klibc/syscalls/typesize.c:21: error: 'dev_t' undeclared here (not in a function)
usr/klibc/syscalls/typesize.c:22: error: 'fd_set' undeclared here (not in a function)
usr/klibc/syscalls/typesize.c:22: error: expected expression before ')' token

syscommon.h is in usr/klibc/syscalls, but I don't see a -I pointing there.

				Jeff

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23  3:10       ` Jeff Dike
@ 2006-06-23  3:31         ` Andrew Morton
  2006-06-23  3:50           ` Jeff Dike
  2006-06-23  4:01           ` H. Peter Anvin
  0 siblings, 2 replies; 51+ messages in thread
From: Andrew Morton @ 2006-06-23  3:31 UTC (permalink / raw)
  To: Jeff Dike
  Cc: hpa, a.p.zijlstra, hugh, linux-mm, linux-kernel, dhowells,
	christoph, mbligh, npiggin, torvalds

On Thu, 22 Jun 2006 23:10:12 -0400
Jeff Dike <jdike@addtoit.com> wrote:

> On Fri, Jun 23, 2006 at 01:39:49AM +0200, Peter Zijlstra wrote:
> > (PS, 2.6.17-mm1 UML doesn't seem to boot)
> 
> I don't get that far - it doesn't build for me.  It dies in klibc thusly:
> 
>   gcc -Wp,-MD,usr/klibc/syscalls/.typesize.o.d  -nostdinc -isystem /usr/lib/gcc/i386-redhat-linux/4.1.1/include -I/home/jdike/linux/2.6/test/linux-2.6.17/usr/include/arch/i386 -Iusr/include/arch/i386 -I/home/jdike/linux/2.6/test/linux-2.6.17/usr/include/bits32 -Iusr/include/bits32  -I/home/jdike/linux/2.6/test/linux-2.6.17/obj/usr/klibc/../include -I/home/jdike/linux/2.6/test/linux-2.6.17/usr/include -Iusr/include  -I/home/jdike/linux/2.6/test/linux-2.6.17/include -I/home/jdike/linux/2.6/test/linux-2.6.17/include2 -Iinclude2 -I/home/jdike/linux/2.6/test/linux-2.6.17/include -Iinclude  -I/home/jdike/linux/2.6/test/linux-2.6.17/include -D__KLIBC__=1 -D__KLIBC_MINOR__=4 -D_BITSIZE=32 -m32 -march=i386 -Os -g -fomit-frame-pointer -falign-functions=0 -falign-jumps=0 -falign-loops=0 -W -Wall -Wno-sign-compare -Wno-unused-parameter -c -o usr/klibc/syscalls/typesize.o usr/klibc/syscalls/typesize.c
> usr/klibc/syscalls/typesize.c:1:23: error: syscommon.h: No such file or directory

That's probably a parallel kbuild race.  Type `make' again ;)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23  3:31         ` Andrew Morton
@ 2006-06-23  3:50           ` Jeff Dike
  2006-06-23  4:01           ` H. Peter Anvin
  1 sibling, 0 replies; 51+ messages in thread
From: Jeff Dike @ 2006-06-23  3:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hpa, a.p.zijlstra, hugh, linux-mm, linux-kernel, dhowells,
	christoph, mbligh, npiggin, torvalds

On Thu, Jun 22, 2006 at 08:31:23PM -0700, Andrew Morton wrote:
> That's probably a parallel kbuild race.  Type `make' again ;)

Nope, it's extremely consistent, including with a non-parallel build.

				Jeff

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23  3:31         ` Andrew Morton
  2006-06-23  3:50           ` Jeff Dike
@ 2006-06-23  4:01           ` H. Peter Anvin
  2006-06-23 15:08             ` Jeff Dike
  1 sibling, 1 reply; 51+ messages in thread
From: H. Peter Anvin @ 2006-06-23  4:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Dike, a.p.zijlstra, hugh, linux-mm, linux-kernel, dhowells,
	christoph, mbligh, npiggin, torvalds

Andrew Morton wrote:
> On Thu, 22 Jun 2006 23:10:12 -0400
> Jeff Dike <jdike@addtoit.com> wrote:
> 
>> On Fri, Jun 23, 2006 at 01:39:49AM +0200, Peter Zijlstra wrote:
>>> (PS, 2.6.17-mm1 UML doesn't seem to boot)
>> I don't get that far - it doesn't build for me.  It dies in klibc thusly:
>>
>>   gcc -Wp,-MD,usr/klibc/syscalls/.typesize.o.d  -nostdinc -isystem /usr/lib/gcc/i386-redhat-linux/4.1.1/include -I/home/jdike/linux/2.6/test/linux-2.6.17/usr/include/arch/i386 -Iusr/include/arch/i386 -I/home/jdike/linux/2.6/test/linux-2.6.17/usr/include/bits32 -Iusr/include/bits32  -I/home/jdike/linux/2.6/test/linux-2.6.17/obj/usr/klibc/../include -I/home/jdike/linux/2.6/test/linux-2.6.17/usr/include -Iusr/include  -I/home/jdike/linux/2.6/test/linux-2.6.17/include -I/home/jdike/linux/2.6/test/linux-2.6.17/include2 -Iinclude2 -I/home/jdike/linux/2.6/test/linux-2.6.17/include -Iinclude  -I/home/jdike/linux/2.6/test/linux-2.6.17/include -D__KLIBC__=1 -D__KLIBC_MINOR__=4 -D_BITSIZE=32 -m32 -march=i386 -Os -g -fomit-frame-pointer -falign-functions=0 -falign-jumps=0 -falign-loops=0 -W -Wall -Wno-sign-compare -Wno-unused-parameter -c -o usr/klibc/syscalls/typesize.o usr/klibc/syscalls/typesize.c
>> usr/klibc/syscalls/typesize.c:1:23: error: syscommon.h: No such file or directory
> 
> That's probably a parallel kbuild race.  Type `make' again ;)

No, it's not.  It's a problem with O=, apparently; this patch fixes it:

http://www.kernel.org/git/?p=linux/kernel/git/hpa/linux-2.6-klibc.git;a=commitdiff;h=4e51186fb663b57ac7c53517947510d2e1e9de01;hp=79317ba49e3f83d40f37b59fcdd5bd7c7635ee32

	-hpa

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-22 23:39     ` [PATCH] mm: tracking shared dirty pages -v10 Peter Zijlstra
  2006-06-23  3:10       ` Jeff Dike
@ 2006-06-23  6:08       ` Linus Torvalds
  2006-06-23  7:27         ` Hugh Dickins
  2006-06-23 19:06       ` Hugh Dickins
  2 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2006-06-23  6:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, linux-mm, linux-kernel, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin



On Fri, 23 Jun 2006, Peter Zijlstra wrote:
>
> Preview of the goodness,
> 
> I'll repost the whole thing tomorrow after I've updated the other
> patches, esp. the msync one. I seem to be too tired to make any sense
> out of that atm.

Do people agree about this thing? If we want it in 2.6.18, we should merge 
this soon. I'd prefer to not leave something like this to be a last-minute 
thing before the merge window closes, and I get the feeling that we're 
getting to where this should just go in sooner rather than later.

Comments? Hugh, does the last version address all your concerns?

		Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23  6:08       ` Linus Torvalds
@ 2006-06-23  7:27         ` Hugh Dickins
  2006-06-23 17:00           ` Christoph Lameter
  2006-06-23 17:49           ` Linus Torvalds
  0 siblings, 2 replies; 51+ messages in thread
From: Hugh Dickins @ 2006-06-23  7:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin

On Thu, 22 Jun 2006, Linus Torvalds wrote:
> On Fri, 23 Jun 2006, Peter Zijlstra wrote:
> >
> > Preview of the goodness,
> 
> Do people agree about this thing? If we want it in 2.6.18, we should merge 
> this soon. I'd prefer to not leave something like this to be a last-minute 
> thing before the merge window closes, and I get the feeling that we're 
> getting to where this should just go in sooner rather than later.
> 
> Comments? Hugh, does the last version address all your concerns?

Not even looked at the preview yet, but as far as mechanism goes,
I'm sure it won't be worse than a few fixups away from good.

However, I've never understood why it should be fasttracked into
2.6.18: we usually let such patchsets cook for a cycle in -mm.
2.6.N-rc can get wider exposure than 2.6.(N-1)-mm, reveal problems
missed all the while in -mm, but a cycle in -mm is still worthwhile.

My pathetically slow responses have hindered Peter's good work, yes,
but I don't think they've affected the overall appropriate timing.

Is there any particular reason why 2.6.18 rather than 2.6.19 be
the release that fixes this issue that's been around forever?

And have we even seen stats for it yet?  We know that it shouldn't
affect the vast majority of loads (not mapping shared writable), but
it won't be fixing any problem on them either; and we've had reports
that it does fix the issue, but at what perf cost? (I may have missed)

Several people also have doubts as to whether it's right to be
focussing just on shared writable here, whether the private also
needs tweaking.  I'm undecided.  Can be considered a separate
issue, but a cycle in -mm would help settle that question too.

But if you want to push for 2.6.18, I won't be aggrieved.

Hugh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23  4:01           ` H. Peter Anvin
@ 2006-06-23 15:08             ` Jeff Dike
  0 siblings, 0 replies; 51+ messages in thread
From: Jeff Dike @ 2006-06-23 15:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Morton, a.p.zijlstra, hugh, linux-mm, linux-kernel,
	dhowells, christoph, mbligh, npiggin, torvalds

On Thu, Jun 22, 2006 at 09:01:20PM -0700, H. Peter Anvin wrote:
> No, it's not.  It's a problem with O=, apparently; this patch fixes it:
> 
> http://www.kernel.org/git/?p=linux/kernel/git/hpa/linux-2.6-klibc.git;a=commitdiff;h=4e51186fb663b57ac7c53517947510d2e1e9de01;hp=79317ba49e3f83d40f37b59fcdd5bd7c7635ee32

That works, thanks!

Back to the original problem - 2.6.17-mm1 UML not booting.  If you add
stderr=1 to the command line, you'll see this:

	timer_init : request_irq failed - errno = 38
	NET: Registered protocol family 2
	irq 0, desc: 081debe0, depth: 0, count: 0, unhandled: 0
	->handle_irq():  0808af80, handle_bad_irq+0x0/0x1b7
	->chip(): 081d9320, 0x81d9320
	->action(): 00000000
	   IRQ_NOPROBE set
	unexpected IRQ 00
	BUG: failure at include2/asm/hardirq.h:22/ack_bad_irq()!
	Kernel panic - not syncing: BUG!

which means that the genirq stuff needs UML work, which I was working
on anyway because UML could already be made to crash like this.
Except now, it always does.

				Jeff

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23  7:27         ` Hugh Dickins
@ 2006-06-23 17:00           ` Christoph Lameter
  2006-06-23 17:22             ` Peter Zijlstra
  2006-06-23 17:49           ` Linus Torvalds
  1 sibling, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2006-06-23 17:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm, linux-kernel,
	Andrew Morton, David Howells, Christoph Lameter, Martin Bligh,
	Nick Piggin

On Fri, 23 Jun 2006, Hugh Dickins wrote:

> Not even looked at the preview yet, but as far as mechanism goes,
> I'm sure it won't be worse than a few fixups away from good.

Sounds good.
 
> And have we even seen stats for it yet?  We know that it shouldn't
> affect the vast majority of loads (not mapping shared writable), but
> it won't be fixing any problem on them either; and we've had reports
> that it does fix the issue, but at what perf cost? (I may have missed)

I do not think that statistics are that important. One of the primary
advantages is that this fixes up a way to deadlock the machine through
dirtying too many pages. The other side effect is that dirty page
writeout can begin before an application terminates. We have had cases
where dirty memory was sitting for weeks in a machine because the process
did not terminate. These are major VM issues that need a resolution.

Also Peter has made the tracking configurable. So there is a way
to switch it off if it is harmful for some situations.

> Several people also have doubts as to whether it's right to be
> focussing just on shared writable here, whether the private also
> needs tweaking.  I'm undecided.  Can be considered a separate
> issue, but a cycle in -mm would help settle that question too.

You mean anonymous pages? Anonymous pages are always dirty unless
you consider swap and we currently do not take account of dirty anonymous 
pages. With swap we already have performance problems and maybe there are
additional issues to fix in that area. But these are secondary.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 17:00           ` Christoph Lameter
@ 2006-06-23 17:22             ` Peter Zijlstra
  2006-06-23 17:52               ` Christoph Lameter
  2006-06-23 17:56               ` Linus Torvalds
  0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-23 17:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Linus Torvalds, linux-mm, linux-kernel,
	Andrew Morton, David Howells, Christoph Lameter, Martin Bligh,
	Nick Piggin

On Fri, 2006-06-23 at 10:00 -0700, Christoph Lameter wrote:

> Also Peter has made the tracking configurable. So there is a way
> to switch it off if it is harmful for some situations.

Oh?

> You mean anonymous pages? Anonymous pages are always dirty unless
> you consider swap and we currently do not take account of dirty anonymous 
> pages. With swap we already have performance problems and maybe there are
> additional issues to fix in that area. But these are secondary.

I intent to make swap over NFS work next.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23  7:27         ` Hugh Dickins
  2006-06-23 17:00           ` Christoph Lameter
@ 2006-06-23 17:49           ` Linus Torvalds
  2006-06-23 18:05             ` Arjan van de Ven
  2006-06-23 18:08             ` Miklos Szeredi
  1 sibling, 2 replies; 51+ messages in thread
From: Linus Torvalds @ 2006-06-23 17:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin



On Fri, 23 Jun 2006, Hugh Dickins wrote:
> 
> However, I've never understood why it should be fasttracked into
> 2.6.18: we usually let such patchsets cook for a cycle in -mm.
> 2.6.N-rc can get wider exposure than 2.6.(N-1)-mm, reveal problems
> missed all the while in -mm, but a cycle in -mm is still worthwhile.

Well, I've got two reasons to want to fast-track it:

 - it's exactly what I wanted to see, so I'm obviously personally happy 
   with the patch

 - the main _real_ issues I'd expect to surface are some subtle 
   performance issues when the kernel knows about more dirty pages, and 
   the dirty limit is thus _effectively_ different (never mind even the 
   page dirtying throttling itself - just the fact that we count the dirty 
   pages will mean that we'll see different throttling behaviour for 
   regular writes too in the presense of dirty)

Now, the first one is admittedly purely personal, but the second one boils 
down to the fact that I think we want people to use it in order to find 
these problems, and I suspect the -mm users are very uniform: it's 
probably almost exclusively (kernel) developers rather than "normal 
users".

> My pathetically slow responses have hindered Peter's good work, yes,
> but I don't think they've affected the overall appropriate timing.

No, I don't thinkthat has been the problem. I think the patches have 
improved from the feedback from you and others, I just think we're quite 
possible past the point where we simply would be better off with testing 
than with arguing from a source perspective.

But maybe I'm just biased.

> Is there any particular reason why 2.6.18 rather than 2.6.19 be
> the release that fixes this issue that's been around forever?

My main worry has always been the effects of this on some strange load, 
not the stability itself.

> And have we even seen stats for it yet?  We know that it shouldn't
> affect the vast majority of loads (not mapping shared writable), but
> it won't be fixing any problem on them either; and we've had reports
> that it does fix the issue, but at what perf cost? (I may have missed)

_Exactly_. This is why I think earlier rather than later is better. 

Sitting in -mm won't get us any new unexpected load cases - only more of 
the same that hasn't shown any huge flags per se (although the dirty limit 
discussion clearly shows people are at least thinking about it).

		Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 17:22             ` Peter Zijlstra
@ 2006-06-23 17:52               ` Christoph Lameter
  2006-06-23 18:11                 ` Martin Bligh
  2006-06-23 17:56               ` Linus Torvalds
  1 sibling, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2006-06-23 17:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Linus Torvalds, linux-mm, linux-kernel,
	Andrew Morton, David Howells, Christoph Lameter, Martin Bligh,
	Nick Piggin

On Fri, 23 Jun 2006, Peter Zijlstra wrote:

> On Fri, 2006-06-23 at 10:00 -0700, Christoph Lameter wrote:
> 
> > Also Peter has made the tracking configurable. So there is a way
> > to switch it off if it is harmful for some situations.
> 
> Oh?

mapping_cap_account_dirty is not a way for a fs to switch this on and off?

> > You mean anonymous pages? Anonymous pages are always dirty unless
> > you consider swap and we currently do not take account of dirty anonymous 
> > pages. With swap we already have performance problems and maybe there are
> > additional issues to fix in that area. But these are secondary.
> 
> I intent to make swap over NFS work next.

I am still a bit unclear on what you mean by "work." The only 
issue may be to consider the amount of swap pages about to be written out 
for write throttling.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 17:22             ` Peter Zijlstra
  2006-06-23 17:52               ` Christoph Lameter
@ 2006-06-23 17:56               ` Linus Torvalds
  2006-06-23 18:03                 ` Peter Zijlstra
  2006-06-23 18:41                 ` Christoph Hellwig
  1 sibling, 2 replies; 51+ messages in thread
From: Linus Torvalds @ 2006-06-23 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Hugh Dickins, linux-mm, linux-kernel,
	Andrew Morton, David Howells, Christoph Lameter, Martin Bligh,
	Nick Piggin



On Fri, 23 Jun 2006, Peter Zijlstra wrote:
> 
> I intent to make swap over NFS work next.

Doesn't it work already? Is there some throttling that doesn't work?

		Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 17:56               ` Linus Torvalds
@ 2006-06-23 18:03                 ` Peter Zijlstra
  2006-06-23 18:23                   ` Christoph Lameter
  2006-06-23 18:41                 ` Christoph Hellwig
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-23 18:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Hugh Dickins, linux-mm, linux-kernel,
	Andrew Morton, David Howells, Christoph Lameter, Martin Bligh,
	Nick Piggin, riel

On Fri, 2006-06-23 at 10:56 -0700, Linus Torvalds wrote:
> 
> On Fri, 23 Jun 2006, Peter Zijlstra wrote:
> > 
> > I intent to make swap over NFS work next.
> 
> Doesn't it work already? Is there some throttling that doesn't work?

I do not know how 'bad' the situation is now that we have the dirty page
tracking stuff, still need to create a test environment.

But the general idea is that its broken because the ACK from writeout
can be delayed and the remaining free memory taken by other incomming
network packets.

Until I have a test setup and some reproducable deadlocks I cannot say
more I'm afraid.

Peter


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 17:49           ` Linus Torvalds
@ 2006-06-23 18:05             ` Arjan van de Ven
  2006-06-23 18:08             ` Miklos Szeredi
  1 sibling, 0 replies; 51+ messages in thread
From: Arjan van de Ven @ 2006-06-23 18:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, linux-mm, linux-kernel,
	Andrew Morton, David Howells, Christoph Lameter, Martin Bligh,
	Nick Piggin


> My main worry has always been the effects of this on some strange load, 
> not the stability itself.
> 
> > And have we even seen stats for it yet?  We know that it shouldn't
> > affect the vast majority of loads (not mapping shared writable), but
> > it won't be fixing any problem on them either; and we've had reports
> > that it does fix the issue, but at what perf cost? (I may have missed)
> 
> _Exactly_. This is why I think earlier rather than later is better. 
> 
> Sitting in -mm won't get us any new unexpected load cases - only more of 
> the same that hasn't shown any huge flags per se (although the dirty limit 
> discussion clearly shows people are at least thinking about it).


one options it to ask the distributions to put this into their more
experimental kernels for a bit to give it a broader exposure... it's
still a bit small but at least broader than "kernel developers"...


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 17:49           ` Linus Torvalds
  2006-06-23 18:05             ` Arjan van de Ven
@ 2006-06-23 18:08             ` Miklos Szeredi
  1 sibling, 0 replies; 51+ messages in thread
From: Miklos Szeredi @ 2006-06-23 18:08 UTC (permalink / raw)
  To: torvalds
  Cc: hugh, a.p.zijlstra, linux-mm, linux-kernel, akpm, dhowells,
	christoph, mbligh, npiggin

> Well, I've got two reasons to want to fast-track it:
> 
>  - it's exactly what I wanted to see, so I'm obviously personally happy 
>    with the patch

Heh, IIRC you rejected the idea of doing a fault on dirtying for
performance reasons, during the discussion of VM deadlocks in FUSE.

Anyway, I'm more than happy, since David and Peter basically solved
the problem, so shared writable mappings should now be possible to do.

Miklos

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 17:52               ` Christoph Lameter
@ 2006-06-23 18:11                 ` Martin Bligh
  2006-06-23 18:20                   ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Martin Bligh @ 2006-06-23 18:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Hugh Dickins, Linus Torvalds, linux-mm,
	linux-kernel, Andrew Morton, David Howells, Christoph Lameter,
	Nick Piggin


>>I intent to make swap over NFS work next.
> 
> 
> I am still a bit unclear on what you mean by "work." The only 
> issue may be to consider the amount of swap pages about to be written out 
> for write throttling.

I had assumed this was a sick joke. Please tell me people aren't
really swapping over NFS. That's *insane*.

M.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 18:11                 ` Martin Bligh
@ 2006-06-23 18:20                   ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2006-06-23 18:20 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Christoph Lameter, Peter Zijlstra, Hugh Dickins, linux-mm,
	linux-kernel, Andrew Morton, David Howells, Christoph Lameter,
	Nick Piggin



On Fri, 23 Jun 2006, Martin Bligh wrote:
> 
> I had assumed this was a sick joke. Please tell me people aren't
> really swapping over NFS. That's *insane*.

Hey, I think it even used to be common. I think some NCR X client did 
basically exactly that, with _no_ local disk at all.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 18:03                 ` Peter Zijlstra
@ 2006-06-23 18:23                   ` Christoph Lameter
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2006-06-23 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Hugh Dickins, linux-mm, linux-kernel,
	Andrew Morton, David Howells, Martin Bligh, Nick Piggin, riel

On Fri, 23 Jun 2006, Peter Zijlstra wrote:

> But the general idea is that its broken because the ACK from writeout
> can be delayed and the remaining free memory taken by other incomming
> network packets.

That is already taken care of by nr_unstable which is considered for 
write throttling.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 17:56               ` Linus Torvalds
  2006-06-23 18:03                 ` Peter Zijlstra
@ 2006-06-23 18:41                 ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2006-06-23 18:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Christoph Lameter, Hugh Dickins, linux-mm,
	linux-kernel, Andrew Morton, David Howells, Christoph Lameter,
	Martin Bligh, Nick Piggin

On Fri, Jun 23, 2006 at 10:56:44AM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 23 Jun 2006, Peter Zijlstra wrote:
> > 
> > I intent to make swap over NFS work next.
> 
> Doesn't it work already? Is there some throttling that doesn't work?

With the current code it definitly doesn't.  The swap code calls ->bmap
to do block mappings at swapon time and then uses bios directly. This
obviously can't work on anything but blockdevice-based filesystems.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-22 23:39     ` [PATCH] mm: tracking shared dirty pages -v10 Peter Zijlstra
  2006-06-23  3:10       ` Jeff Dike
  2006-06-23  6:08       ` Linus Torvalds
@ 2006-06-23 19:06       ` Hugh Dickins
  2006-06-23 22:00         ` Peter Zijlstra
  2006-06-28 14:58         ` [RFC][PATCH] mm: fixup do_wp_page() Peter Zijlstra
  2 siblings, 2 replies; 51+ messages in thread
From: Hugh Dickins @ 2006-06-23 19:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Fri, 23 Jun 2006, Peter Zijlstra wrote:

> Preview of the goodness,
> 
> I'll repost the whole thing tomorrow after I've updated the other
> patches, esp. the msync one. I seem to be too tired to make any sense
> out of that atm.
> 
> (PS, 2.6.17-mm1 UML doesn't seem to boot)
> 
> -------------------------------------------------------
> 
> 
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Tracking of dirty pages in shared writeable mmap()s.
> 
> The idea is simple: write protect clean shared writeable pages,
> catch the write-fault, make writeable and set dirty. On page write-back
> clean all the PTE dirty bits and write protect them once again.
> 
> The implementation is a tad harder, mainly because the default
> backing_dev_info capabilities were too loosely maintained. Hence it is
> not enough to test the backing_dev_info for cap_account_dirty.
> 
> The current heuristic is as follows, a VMA is eligible when:
>  - its shared writeable 
>     (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
>  - it is not a PFN mapping
>     (vm_flags & VM_PFNMAP) == 0
>  - the backing_dev_info is cap_account_dirty
>     mapping_cap_account_dirty(vma->vm_file->f_mapping)
>  - f_op->mmap() didn't change the default page protection
> 
> NOTE: the last rule is only checked in do_mmap_pgoff(), other code-
> paths assume they will not be reached in that case.

(You've probably noticed that use of mprotect on problem vmas
is liable to remove the cache bits.  There have been occasional
complaints about that, and perhaps a patch to fix it.  But that's
long-standing behaviour, so not something you need worry about:
we've only had to worry about a new change in behaviour in mmap.)

> 
> Page from remap_pfn_range() are explicitly excluded because their
> COW semantics are already horrid enough (see vm_normal_page() in
> do_wp_page()) and because they don't have a backing store anyway.
> 
> copy_one_pte() cleans child PTEs, not wrong in the current context,
> but why?

Dates back to Linux 2.2 if not earlier.  I think the idea is that the
parent has its pte dirty, and that's enough to mark the page as dirty;
the child is unlikely to dirty the page further itself, and a second
instance of dirty pte for that page may entail some overhead.  What
overhead would vary from release to release, I've not thought through
what the current saving would be in 2.6.17, probably very little.
But you can imagine there might have been some release which would
write out the page each time it found pte dirty, in which case good
reason to clear it there.  But I'm guessing, it's from before my time.

> 
> mprotect() is taught about the new behaviour as well.
> 
> Cleaning the pages on write-back is done with page_mkclean() a new
> rmap call. It can be called on any page, but is currently only
> implemented for mapped pages (does it make sense to also implement
> anon pages?), if the page is found the be of a trackable VMA it
> will also wrprotect the PTE.
> 
> Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty()
> from under ->private_lock. This seems to be save, since ->private_lock 
> is used to serialize access to the buffers, not the page itself.
> This is needed because clear_page_dirty() will call into page_mkclean()
> and would thereby violate locking order.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> Changes in -v10
> 
>  - 2.6.17-mm1
>  - Drop the ugly duckling pgprotting, Hugh suggested resetting the
>    vma->vm_page_prot when f_op->mmap() didn't modify it. If it were
>    modified we're not interested anyway.
>  - abandon is_shared_writable() because its actually spelled writeable
>    and it didn't actually mean that any more.
>  - Comments all round.
> 
> Index: 2.6-mm/mm/memory.c
> ===================================================================
> --- 2.6-mm.orig/mm/memory.c	2006-06-22 17:59:07.000000000 +0200
> +++ 2.6-mm/mm/memory.c	2006-06-23 01:08:11.000000000 +0200
> @@ -1458,14 +1458,19 @@ static int do_wp_page(struct mm_struct *
>  {
>  	struct page *old_page, *new_page;
>  	pte_t entry;
> -	int reuse, ret = VM_FAULT_MINOR;
> +	int reuse = 0, ret = VM_FAULT_MINOR;
> +	struct page *dirty_page = NULL;
>  
>  	old_page = vm_normal_page(vma, address, orig_pte);
>  	if (!old_page)
>  		goto gotten;
>  
> -	if (unlikely((vma->vm_flags & (VM_SHARED|VM_WRITE)) ==
> -				(VM_SHARED|VM_WRITE))) {
> +	/*
> +	 * Only catch write-faults on shared writable pages, read-only
> +	 * shared pages can get COWed by get_user_pages(.write=1, .force=1).
> +	 */
> +	if (unlikely(((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> +					(VM_WRITE|VM_SHARED)))) {

I was about to say I'm satisfied with that test now, and it's all the
better for not mentioning VM_PFNMAP there - though odd that you've
exchanged WRITE and SHARED, and added extra unnecessary parentheses ;)

But grrr, sigh, damn, argh - I now realize it's right to the first
order (normal case) and to the second order (ptrace poke), but not
to the third order (ptrace poke anon page here to be COWed -
perhaps can't occur without intervening mprotects).

That's not an issue for you at all (there are other places which
are inconsistent on whether such pages are private or shared e.g.
copy_one_pte does not wrprotect them), but could be a problem for
David's page_mkwrite - there's a danger of passing it an anonymous
page, which (depending on what the ->page_mkwrite actually does)
could go seriously wrong.

I guess it ought to be restructured
	if (PageAnon(old_page)) {
		...
	} else if (shared writable vma) {
		...
	}
and a patch to do that should precede your dirty page patches
(and the only change your dirty page patches need add here on top
of that would be the dirty_page = old_page, get_page(dirty_page)).

Oh, it looks like Linus is keen to go ahead with your patches in
2.6.18, so in that case it'll be easier to go ahead with the patch
as you have it, and fix up this order-3 issue on top afterwards -
it's not something testers are going to be hitting every day,
especially without any ->page_mkwrite implementations.

> Index: 2.6-mm/mm/mmap.c
> ===================================================================
> --- 2.6-mm.orig/mm/mmap.c	2006-06-22 17:59:07.000000000 +0200
> +++ 2.6-mm/mm/mmap.c	2006-06-23 01:31:44.000000000 +0200
> @@ -1113,6 +1109,34 @@ munmap_back:
>  	pgoff = vma->vm_pgoff;
>  	vm_flags = vma->vm_flags;
>  
> +	/*
> +	 * Tracking of dirty pages for shared writable mappings. Do this by
> +	 * write protecting writable pages, and mark dirty in the write fault.
> +	 *
> +	 * Cannot do before because the condition depends on:
> +	 *  - backing_dev_info having the right capabilities
> +	 *  - vma->vm_flags being fully set
> +	 *    (finished in f_op->mmap(), which could call remap_pfn_range())
> +	 *
> +	 * If f_op->mmap() changed vma->vm_page_prot its a funny mapping
> +	 * and we won't touch it.
> +	 * NOTE: in a perfect world backing_dev_info would have the proper
> +	 * capabilities.
> +	 *
> +	 * OR
> +	 *
> +	 * Don't make the VMA automatically writable if it's shared, but the
> +	 * backer wishes to know when pages are first written to.
> +	 */
> +	if ((pgprot_val(vma->vm_page_prot) == pgprot_val(vm_page_prot) &&
> +	     ((vm_flags & (VM_WRITE|VM_SHARED|VM_PFNMAP|VM_INSERTPAGE)) ==
> +			  (VM_WRITE|VM_SHARED)) &&
> +	     vma->vm_file && vma->vm_file->f_mapping &&
> +	     mapping_cap_account_dirty(vma->vm_file->f_mapping)) ||
> +	    (vma->vm_ops && vma->vm_ops->page_mkwrite))
> +		vma->vm_page_prot =
> +			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
> +

I'm dazzled by the beauty of it!  Given the change you've made to
mapping_cap_account_dirty, you don't have to check vma->vm_file->f_mapping
there; but other than that, seems right to me.  May offend other tastes.

>  	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
>  			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
>  		file = vma->vm_file;
> Index: 2.6-mm/mm/rmap.c
> ===================================================================
> --- 2.6-mm.orig/mm/rmap.c	2006-06-22 17:59:07.000000000 +0200
> +++ 2.6-mm/mm/rmap.c	2006-06-23 01:14:39.000000000 +0200
> +static int page_mkclean_file(struct address_space *mapping, struct page *page)
> +{
> +	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +	struct vm_area_struct *vma;
> +	struct prio_tree_iter iter;
> +	int ret = 0;
> +
> +	BUG_ON(PageAnon(page));
> +
> +	spin_lock(&mapping->i_mmap_lock);
> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> +		int protect = mapping_cap_account_dirty(mapping) &&
> +	((vma->vm_flags & (VM_WRITE|VM_SHARED|VM_PFNMAP|VM_INSERTPAGE)) ==
> +	 		  (VM_WRITE|VM_SHARED));

I think you can forget about testing VM_PFNMAP|VM_INSERTPAGE here,
or else	BUG_ON or WARN_ON(vma->vm_flags & (VM_PFNMAP|VM_INSERTPAGE)):
those just shouldn't arrive here.

But I'm still puzzled, why did you move page_mkclean out from under
the mapping_cap_account_dirty tests in page-writeback.c: no explanation
in your change history.  Doesn't that just add some unnecessary work,
and this "protect" business?

If you're thinking ahead to other changes you want to make, you'd
do better to add these changes then, and for now just work on the
VM_SHARED vmas.

> +		ret += page_mkclean_one(page, vma, protect);
> +	}
> +	spin_unlock(&mapping->i_mmap_lock);
> +	return ret;
> +}

Hugh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 19:06       ` Hugh Dickins
@ 2006-06-23 22:00         ` Peter Zijlstra
  2006-06-23 22:35           ` Linus Torvalds
  2006-06-28 14:58         ` [RFC][PATCH] mm: fixup do_wp_page() Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-23 22:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Fri, 2006-06-23 at 20:06 +0100, Hugh Dickins wrote:

> > -     if (unlikely((vma->vm_flags & (VM_SHARED|VM_WRITE)) ==
> > -                             (VM_SHARED|VM_WRITE))) {
> > +     /*
> > +      * Only catch write-faults on shared writable pages, read-only
> > +      * shared pages can get COWed by get_user_pages(.write=1, .force=1).
> > +      */
> > +     if (unlikely(((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> > +                                     (VM_WRITE|VM_SHARED)))) {
> 
> I was about to say I'm satisfied with that test now, and it's all the
> better for not mentioning VM_PFNMAP there - though odd that you've
> exchanged WRITE and SHARED, and added extra unnecessary parentheses ;)

Oops, on the extra parentheses. I exchanged the order because that is
more consistent with all the other sites.

> But grrr, sigh, damn, argh - I now realize it's right to the first
> order (normal case) and to the second order (ptrace poke), but not
> to the third order (ptrace poke anon page here to be COWed -
> perhaps can't occur without intervening mprotects).
> 
> That's not an issue for you at all (there are other places which
> are inconsistent on whether such pages are private or shared e.g.
> copy_one_pte does not wrprotect them), but could be a problem for
> David's page_mkwrite - there's a danger of passing it an anonymous
> page, which (depending on what the ->page_mkwrite actually does)
> could go seriously wrong.
> 
> I guess it ought to be restructured
>         if (PageAnon(old_page)) {
>                 ...
>         } else if (shared writable vma) {
>                 ...
>         }
> and a patch to do that should precede your dirty page patches
> (and the only change your dirty page patches need add here on top
> of that would be the dirty_page = old_page, get_page(dirty_page)).

Hmm, I'll investigate this and maybe write a follow-up patch if you or
someone else doesn't beaten me to it.

> (You've probably noticed that use of mprotect on problem vmas
> is liable to remove the cache bits.  There have been occasional
> complaints about that, and perhaps a patch to fix it.  But that's
> long-standing behaviour, so not something you need worry about:
> we've only had to worry about a new change in behaviour in mmap.)

Yeah, had noticed, but had assumed that since it was a user action
they'd know better than to call it on dubious mappings.

However if you say there is demand; I've been through all the arch's to
verify the use of that most ugly drm construct, and whipping up an arch
function like pgprot_modify() that would fold a pgprot_t and
protection_map[] argument together seems quite doable.

> > Page from remap_pfn_range() are explicitly excluded because their
> > COW semantics are already horrid enough (see vm_normal_page() in
> > do_wp_page()) and because they don't have a backing store anyway.
> > 
> > copy_one_pte() cleans child PTEs, not wrong in the current context,
> > but why?
> 
> Dates back to Linux 2.2 if not earlier.  I think the idea is that the
> parent has its pte dirty, and that's enough to mark the page as dirty;
> the child is unlikely to dirty the page further itself, and a second
> instance of dirty pte for that page may entail some overhead.  What
> overhead would vary from release to release, I've not thought through
> what the current saving would be in 2.6.17, probably very little.
> But you can imagine there might have been some release which would
> write out the page each time it found pte dirty, in which case good
> reason to clear it there.  But I'm guessing, it's from before my time.

I'll keep that in mind for a future cleanup.

> > +	if ((pgprot_val(vma->vm_page_prot) == pgprot_val(vm_page_prot) &&
> > +	     ((vm_flags & (VM_WRITE|VM_SHARED|VM_PFNMAP|VM_INSERTPAGE)) ==
> > +			  (VM_WRITE|VM_SHARED)) &&
> > +	     vma->vm_file && vma->vm_file->f_mapping &&
> > +	     mapping_cap_account_dirty(vma->vm_file->f_mapping)) ||
> > +	    (vma->vm_ops && vma->vm_ops->page_mkwrite))
> > +		vma->vm_page_prot =
> > +			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
> > +
> 
> I'm dazzled by the beauty of it!

It's a real beauty isn't it :-)

> Given the change you've made to
> mapping_cap_account_dirty, you don't have to check vma->vm_file->f_mapping
> there; but other than that, seems right to me.  May offend other tastes.

I meant to remove the mapping_cap_account_dirty() changes, so as not to
burden all call sites with the extra conditional. This one extra check
in here does not make the difference between beauty and beast.

> > Index: 2.6-mm/mm/rmap.c
> > ===================================================================
> > --- 2.6-mm.orig/mm/rmap.c	2006-06-22 17:59:07.000000000 +0200
> > +++ 2.6-mm/mm/rmap.c	2006-06-23 01:14:39.000000000 +0200
> > +static int page_mkclean_file(struct address_space *mapping, struct page *page)
> > +{
> > +	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > +	struct vm_area_struct *vma;
> > +	struct prio_tree_iter iter;
> > +	int ret = 0;
> > +
> > +	BUG_ON(PageAnon(page));
> > +
> > +	spin_lock(&mapping->i_mmap_lock);
> > +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> > +		int protect = mapping_cap_account_dirty(mapping) &&
> > +	((vma->vm_flags & (VM_WRITE|VM_SHARED|VM_PFNMAP|VM_INSERTPAGE)) ==
> > +	 		  (VM_WRITE|VM_SHARED));
> 
> I think you can forget about testing VM_PFNMAP|VM_INSERTPAGE here,
> or else	BUG_ON or WARN_ON(vma->vm_flags & (VM_PFNMAP|VM_INSERTPAGE)):
> those just shouldn't arrive here.

Done

> But I'm still puzzled, why did you move page_mkclean out from under
> the mapping_cap_account_dirty tests in page-writeback.c: no explanation
> in your change history.  Doesn't that just add some unnecessary work,
> and this "protect" business?

You're right, added back under the wing of mapping_cap_account_dirty().

> If you're thinking ahead to other changes you want to make, you'd
> do better to add these changes then, and for now just work on the
> VM_SHARED vmas.

But I'll leave page_mkclean() as callable without, that way its more
true to its name.

Peter


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 22:00         ` Peter Zijlstra
@ 2006-06-23 22:35           ` Linus Torvalds
  2006-06-23 22:44             ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2006-06-23 22:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, linux-mm, linux-kernel, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin



On Sat, 24 Jun 2006, Peter Zijlstra wrote:

> > > +	if ((pgprot_val(vma->vm_page_prot) == pgprot_val(vm_page_prot) &&
> > > +	     ((vm_flags & (VM_WRITE|VM_SHARED|VM_PFNMAP|VM_INSERTPAGE)) ==
> > > +			  (VM_WRITE|VM_SHARED)) &&
> > > +	     vma->vm_file && vma->vm_file->f_mapping &&
> > > +	     mapping_cap_account_dirty(vma->vm_file->f_mapping)) ||
> > > +	    (vma->vm_ops && vma->vm_ops->page_mkwrite))
> > > +		vma->vm_page_prot =
> > > +			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
> > > +
> > 
> > I'm dazzled by the beauty of it!
> 
> It's a real beauty isn't it :-)

Since Hugh pointed that out..

It really would be nice to just encapsulate that as an inline function of 
its own, and move the comment at the top of it to be at the top of the 
inline function.

Just make it something like

	/*
	 * Some shared mappigns will want the pages marked read-only
	 * to track write events. If so, we'll downgrade vm_page_prot
	 * to the private version (using protection_map[] without the
	 * VM_SHARED bit).
	 */
	static inline int vma_wants_writenotify(struct vm_area_struct *vma)
	{
		unsigned int vm_flags = vma->vm_flags;

		/* If it was private or non-writable, the write bit is already clear */
		if ((vm_flags & (VM_SHARED | VM_WRITE)) != ((VM_SHARED | VM_WRITE))
			return 0;

		/* The open routine did something to the protections already? */
		if (pgprot_val(vma->vm_page_prot) !=
		   pgprot_val(protection_map[vm_flags & (VM_SHARED|VM_READ|VM_WRITE|VM_EXEC)]))
			return 0;

		/* The backer wishes to know when pages are first written to? */
		if (vma->vm_ops && vma->vm_ops->page_mkwrite)
			return 1;

		/* Specialty mapping? */
		if (vm_flags & (VM_PFNMAP|VM_INSERTPAGE))
			return 0;

		/* Can the mapping track the dirty pages? */
		return vma->vm_file && vma->vm_file->f_mapping &&
			mapping_cap_account_dirty(vma->vm_file->f_mapping);
	}

(And no, I didn't make sure to test that it gives the same answer as your 
version, it's just a more readable version of what I think your version 
tests ;)

And then just use it with

	if (vma_wants_writenotify(vma))
		vma->vm_page_prot = protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];

which would appear to be more readable.

Yeah, the compiler may do worse. Or it may not. It usually pays to try to 
write code more readably, sometimes the compiler ends up understanding it 
better too ;)

		Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm: tracking shared dirty pages -v10
  2006-06-23 22:35           ` Linus Torvalds
@ 2006-06-23 22:44             ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-23 22:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-mm, linux-kernel, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin

On Fri, 2006-06-23 at 15:35 -0700, Linus Torvalds wrote:
> On Sat, 24 Jun 2006, Peter Zijlstra wrote:
> 
> > > > +	if ((pgprot_val(vma->vm_page_prot) == pgprot_val(vm_page_prot) &&
> > > > +	     ((vm_flags & (VM_WRITE|VM_SHARED|VM_PFNMAP|VM_INSERTPAGE)) ==
> > > > +			  (VM_WRITE|VM_SHARED)) &&
> > > > +	     vma->vm_file && vma->vm_file->f_mapping &&
> > > > +	     mapping_cap_account_dirty(vma->vm_file->f_mapping)) ||
> > > > +	    (vma->vm_ops && vma->vm_ops->page_mkwrite))
> > > > +		vma->vm_page_prot =
> > > > +			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
> > > > +
> > > 
> > > I'm dazzled by the beauty of it!
> > 
> > It's a real beauty isn't it :-)
> 
> Since Hugh pointed that out..
> 
> It really would be nice to just encapsulate that as an inline function of 
> its own, and move the comment at the top of it to be at the top of the 
> inline function.

Dang, I just send out -v11, awell such is life. I'll see what I can do.
Preferably mprotect can reuse that same function.

Will repost a new 1/5 shortly.

Peter


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC][PATCH] mm: fixup do_wp_page()
  2006-06-23 19:06       ` Hugh Dickins
  2006-06-23 22:00         ` Peter Zijlstra
@ 2006-06-28 14:58         ` Peter Zijlstra
  2006-06-28 18:20           ` Hugh Dickins
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-28 14:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Fri, 2006-06-23 at 20:06 +0100, Hugh Dickins wrote:

> But grrr, sigh, damn, argh - I now realize it's right to the first
> order (normal case) and to the second order (ptrace poke), but not
> to the third order (ptrace poke anon page here to be COWed -
> perhaps can't occur without intervening mprotects).
> 
> That's not an issue for you at all (there are other places which
> are inconsistent on whether such pages are private or shared e.g.
> copy_one_pte does not wrprotect them), but could be a problem for
> David's page_mkwrite - there's a danger of passing it an anonymous
> page, which (depending on what the ->page_mkwrite actually does)
> could go seriously wrong.
> 
> I guess it ought to be restructured
> 	if (PageAnon(old_page)) {
> 		...
> 	} else if (shared writable vma) {
> 		...
> 	}
> and a patch to do that should precede your dirty page patches
> (and the only change your dirty page patches need add here on top
> of that would be the dirty_page = old_page, get_page(dirty_page)).
> 
> Oh, it looks like Linus is keen to go ahead with your patches in
> 2.6.18, so in that case it'll be easier to go ahead with the patch
> as you have it, and fix up this order-3 issue on top afterwards -
> it's not something testers are going to be hitting every day,
> especially without any ->page_mkwrite implementations.

How about something like this? This should make all anonymous write
faults do as before the page_mkwrite patch.

As for copy_one_pte(), I'm not sure what you meant, shared writable
anonymous pages need not be write protected as far as I can see.
If you meant to say, anonymous or file-backed, then copy_one_pte() still
does the right thing. If the source is untouched/clean it will still be
wrprotected and this state will be copied, if its dirtied and made
writeable we don't need the notification anymore anyway and hence the
regular copy is still correct.

Peter

---
 mm/memory.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

Index: linux-2.6-dirty/mm/memory.c
===================================================================
--- linux-2.6-dirty.orig/mm/memory.c	2006-06-28 13:16:15.000000000 +0200
+++ linux-2.6-dirty/mm/memory.c	2006-06-28 16:18:51.000000000 +0200
@@ -1466,11 +1466,21 @@ static int do_wp_page(struct mm_struct *
 		goto gotten;
 
 	/*
-	 * Only catch write-faults on shared writable pages, read-only
-	 * shared pages can get COWed by get_user_pages(.write=1, .force=1).
+	 * Take out anonymous pages first, anonymous shared vmas are
+	 * not accountable.
 	 */
-	if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+	if (PageAnon(old_page)) {
+		if (!TestSetPageLocked(old_page)) {
+			reuse = can_share_swap_page(old_page);
+			unlock(old_page);
+		}
+	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
+		/*
+		 * Only catch write-faults on shared writable pages,
+		 * read-only shared pages can get COWed by
+		 * get_user_pages(.write=1, .force=1).
+		 */
 		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
 			/*
 			 * Notify the address space that the page is about to
@@ -1502,9 +1512,6 @@ static int do_wp_page(struct mm_struct *
 		dirty_page = old_page;
 		get_page(dirty_page);
 		reuse = 1;
-	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		reuse = can_share_swap_page(old_page);
-		unlock_page(old_page);
 	}
 
 	if (reuse) {



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH] mm: fixup do_wp_page()
  2006-06-28 14:58         ` [RFC][PATCH] mm: fixup do_wp_page() Peter Zijlstra
@ 2006-06-28 18:20           ` Hugh Dickins
  0 siblings, 0 replies; 51+ messages in thread
From: Hugh Dickins @ 2006-06-28 18:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Wed, 28 Jun 2006, Peter Zijlstra wrote:
> 
> How about something like this? This should make all anonymous write
> faults do as before the page_mkwrite patch.

Yes, I believe your patch below is just how it should be.

> As for copy_one_pte(), I'm not sure what you meant, shared writable
> anonymous pages need not be write protected as far as I can see.

Anonymous pages in a shared writable vma, got there via ptrace poke.
They're in a curious limbo between private and shared.  You can
reasonably argue that the page was supposed to be shared in the first
place, so although it's now become private, it's reasonable for it to
remain shared at least between parent and child.  I don't disagree.

But if it's then swapped out under memory pressure, and brought back
in, it will be treated as an ordinary anonymous page, write-protected,
and once parent or child makes a modification, will cease to be shared
between parent and child.  Not a big deal to lose sleep over, but
such pages do behave inconsistently.

Hugh

> --- linux-2.6-dirty.orig/mm/memory.c	2006-06-28 13:16:15.000000000 +0200
> +++ linux-2.6-dirty/mm/memory.c	2006-06-28 16:18:51.000000000 +0200
> @@ -1466,11 +1466,21 @@ static int do_wp_page(struct mm_struct *
>  		goto gotten;
>  
>  	/*
> -	 * Only catch write-faults on shared writable pages, read-only
> -	 * shared pages can get COWed by get_user_pages(.write=1, .force=1).
> +	 * Take out anonymous pages first, anonymous shared vmas are
> +	 * not accountable.
>  	 */
> -	if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> +	if (PageAnon(old_page)) {
> +		if (!TestSetPageLocked(old_page)) {
> +			reuse = can_share_swap_page(old_page);
> +			unlock(old_page);
> +		}
> +	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>  					(VM_WRITE|VM_SHARED))) {
> +		/*
> +		 * Only catch write-faults on shared writable pages,
> +		 * read-only shared pages can get COWed by
> +		 * get_user_pages(.write=1, .force=1).
> +		 */
>  		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
>  			/*
>  			 * Notify the address space that the page is about to
> @@ -1502,9 +1512,6 @@ static int do_wp_page(struct mm_struct *
>  		dirty_page = old_page;
>  		get_page(dirty_page);
>  		reuse = 1;
> -	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
> -		reuse = can_share_swap_page(old_page);
> -		unlock_page(old_page);
>  	}
>  
>  	if (reuse) {

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-28 20:17 [PATCH 0/6] mm: tracking dirty pages -v14 Peter Zijlstra
@ 2006-06-28 20:17 ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-28 20:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Tracking of dirty pages in shared writeable mmap()s.

The idea is simple: write protect clean shared writeable pages,
catch the write-fault, make writeable and set dirty. On page write-back
clean all the PTE dirty bits and write protect them once again.

The implementation is a tad harder, mainly because the default
backing_dev_info capabilities were too loosely maintained. Hence it is
not enough to test the backing_dev_info for cap_account_dirty.

The current heuristic is as follows, a VMA is eligible when:
 - its shared writeable 
    (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
 - it is not a 'special' mapping
    (vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0
 - the backing_dev_info is cap_account_dirty
    mapping_cap_account_dirty(vma->vm_file->f_mapping)
 - f_op->mmap() didn't change the default page protection

Page from remap_pfn_range() are explicitly excluded because their
COW semantics are already horrid enough (see vm_normal_page() in
do_wp_page()) and because they don't have a backing store anyway.

mprotect() is taught about the new behaviour as well. However it
fudges the last condition.

Cleaning the pages on write-back is done with page_mkclean() a new
rmap call. It cleans and wrprotects all PTEs of dirty accountable
pages.

Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty()
from under ->private_lock. This seems to be safe, since ->private_lock 
is used to serialize access to the buffers, not the page itself.
This is needed because clear_page_dirty() will call into page_mkclean()
and would thereby violate locking order.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Changes in -v14

 - removed the flags to vma_wants_notification() agian

Changes in -v13

 - against Linus' tree again
 - added flags to vma_wants_notification()

Changes in -v12

 - Linus suggested a nice replacement for the beast condition

Changes in -v11

 - small cleanups
 - pulled page_mkclean back under mapping_cap_account_dirty()

Changes in -v10

 - 2.6.17-mm1
 - Drop the ugly duckling pgprotting, Hugh suggested resetting the
   vma->vm_page_prot when f_op->mmap() didn't modify it. If it were
   modified we're not interested anyway.
 - abandon is_shared_writable() because its actually spelled writeable
   and it didn't actually mean that any more.
 - Comments all round.

Changes in -v9

 - respin against latest -mm.

Changes in -v8

 - access_process_vm() and other force users of get_user_pages() can
   induce COW of read-only shared mappings.

Changes in -v7

 - changed is_shared_writable() to exclude VM_PFNMAP'ed regions.
 - Hugh's tiresome problem wasn't fully solved, now using the ugly duckling
   method.

Changes in -v6

 - make page_mkclean_one() modify the pte more like change_pte_range() 
   (suggested by Christoph Lameter)
 - made is_shared_writable() take vm_flags, it now resembles is_cow_mapping().
 - fixed the mprotect() bug (spotted by Hugh Dickins)
 - hopefully fixed the tiresome issue of do_mmap_pgoff() trampling on
   driver specific vm_page_prot settings (spotted by Hugh Dickins)
 - made a new version of the page_mkwrite() patch to go on top of all this.
   This so that Linus could merge this very early on in 2.6.18.

Changes in -v5

 - rename page_wrprotect() to page_mkclean() (suggested by Nick Piggin)
 - added comment to test_clear_page_dirty() (Andrew Morton)
 - cleanup page_wrprotect() (Andrew Morton)
 - renamed VM_SharedWritable() to is_shared_writable()
 - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
   ->private_lock. This seems to be save, since ->private_lock is used to
   serialize access to the buffers, not the page itself.
 - rebased on top of David Howells' page_mkwrite() patch.

Changes in -v4:

 - small cleanup as suggested by Christoph Lameter.

Changes in -v3:

 - move set_page_dirty() outside pte lock (suggested by Christoph Lameter)

Changes in -v2:

 - only wrprotect pages from dirty capable mappings. (Nick Piggin)
 - move the writefault handling from do_wp_page() into handle_pte_fault(). 
   (Nick Piggin)
 - revert to the old install_page interface. (Nick Piggin)
 - also clear the pte dirty bit when we make pages read-only again.
   (spotted by Rik van Riel)
 - make page_wrprotect() return the number of reprotected ptes.

 fs/buffer.c          |    2 -
 include/linux/mm.h   |   34 ++++++++++++++++++++++++++
 include/linux/rmap.h |    8 ++++++
 mm/memory.c          |   29 ++++++++++++++++++----
 mm/mmap.c            |   10 +++----
 mm/mprotect.c        |   21 ++++++----------
 mm/page-writeback.c  |   15 +++++++++--
 mm/rmap.c            |   65 +++++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 155 insertions(+), 29 deletions(-)

Index: linux-2.6-dirty/mm/memory.c
===================================================================
--- linux-2.6-dirty.orig/mm/memory.c	2006-06-28 21:47:19.000000000 +0200
+++ linux-2.6-dirty/mm/memory.c	2006-06-28 21:47:21.000000000 +0200
@@ -1457,14 +1457,19 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int reuse, ret = VM_FAULT_MINOR;
+	int reuse = 0, ret = VM_FAULT_MINOR;
+	struct page *dirty_page = NULL;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (unlikely((vma->vm_flags & (VM_SHARED|VM_WRITE)) ==
-				(VM_SHARED|VM_WRITE))) {
+	/*
+	 * Only catch write-faults on shared writable pages, read-only
+	 * shared pages can get COWed by get_user_pages(.write=1, .force=1).
+	 */
+	if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+					(VM_WRITE|VM_SHARED))) {
 		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
 			/*
 			 * Notify the address space that the page is about to
@@ -1493,13 +1498,12 @@ static int do_wp_page(struct mm_struct *
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
 		}
-
+		dirty_page = old_page;
+		get_page(dirty_page);
 		reuse = 1;
 	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
 		reuse = can_share_swap_page(old_page);
 		unlock_page(old_page);
-	} else {
-		reuse = 0;
 	}
 
 	if (reuse) {
@@ -1565,6 +1569,10 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	if (old_page)
@@ -2094,6 +2102,7 @@ static int do_no_page(struct mm_struct *
 	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
+	struct page *dirty_page = NULL;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
@@ -2188,6 +2197,10 @@ retry:
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			if (write_access) {
+				dirty_page = new_page;
+				get_page(dirty_page);
+			}
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -2200,6 +2213,10 @@ retry:
 	lazy_mmu_prot_update(entry);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	page_cache_release(new_page);
Index: linux-2.6-dirty/mm/mmap.c
===================================================================
--- linux-2.6-dirty.orig/mm/mmap.c	2006-06-28 21:47:19.000000000 +0200
+++ linux-2.6-dirty/mm/mmap.c	2006-06-28 21:47:21.000000000 +0200
@@ -1090,12 +1090,6 @@ munmap_back:
 			goto free_vma;
 	}
 
-	/* Don't make the VMA automatically writable if it's shared, but the
-	 * backer wishes to know when pages are first written to */
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
-		vma->vm_page_prot =
-			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
-
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation
@@ -1113,6 +1107,10 @@ munmap_back:
 	pgoff = vma->vm_pgoff;
 	vm_flags = vma->vm_flags;
 
+	if (vma_wants_writenotify(vma))
+		vma->vm_page_prot =
+			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
+
 	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
 			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
 		file = vma->vm_file;
Index: linux-2.6-dirty/mm/mprotect.c
===================================================================
--- linux-2.6-dirty.orig/mm/mprotect.c	2006-06-28 21:47:19.000000000 +0200
+++ linux-2.6-dirty/mm/mprotect.c	2006-06-28 21:47:21.000000000 +0200
@@ -123,8 +123,6 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
-	unsigned int mask;
-	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
 
@@ -176,24 +174,21 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
-	/* Don't make the VMA automatically writable if it's shared, but the
-	 * backer wishes to know when pages are first written to */
-	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
-		mask &= ~VM_SHARED;
-
-	newprot = protection_map[newflags & mask];
-
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
 	 */
 	vma->vm_flags = newflags;
-	vma->vm_page_prot = newprot;
+	vma->vm_page_prot = protection_map[newflags &
+		(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
+	if (vma_wants_writenotify(vma))
+		vma->vm_page_prot = protection_map[newflags &
+			(VM_READ|VM_WRITE|VM_EXEC)];
+
 	if (is_vm_hugetlb_page(vma))
-		hugetlb_change_protection(vma, start, end, newprot);
+		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
-		change_protection(vma, start, end, newprot);
+		change_protection(vma, start, end, vma->vm_page_prot);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
Index: linux-2.6-dirty/mm/page-writeback.c
===================================================================
--- linux-2.6-dirty.orig/mm/page-writeback.c	2006-06-28 21:47:19.000000000 +0200
+++ linux-2.6-dirty/mm/page-writeback.c	2006-06-28 21:47:21.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -566,7 +567,7 @@ int do_writepages(struct address_space *
 		return 0;
 	wbc->for_writepages = 1;
 	if (mapping->a_ops->writepages)
-		ret =  mapping->a_ops->writepages(mapping, wbc);
+		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
 	wbc->for_writepages = 0;
@@ -728,8 +729,14 @@ int test_clear_page_dirty(struct page *p
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 			write_unlock_irqrestore(&mapping->tree_lock, flags);
-			if (mapping_cap_account_dirty(mapping))
+			/*
+			 * We can continue to use `mapping' here because the
+			 * page is locked, which pins the address_space
+			 */
+			if (mapping_cap_account_dirty(mapping)) {
+				page_mkclean(page);
 				dec_page_state(nr_dirty);
+			}
 			return 1;
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -759,8 +766,10 @@ int clear_page_dirty_for_io(struct page 
 
 	if (mapping) {
 		if (TestClearPageDirty(page)) {
-			if (mapping_cap_account_dirty(mapping))
+			if (mapping_cap_account_dirty(mapping)) {
+				page_mkclean(page);
 				dec_page_state(nr_dirty);
+			}
 			return 1;
 		}
 		return 0;
Index: linux-2.6-dirty/mm/rmap.c
===================================================================
--- linux-2.6-dirty.orig/mm/rmap.c	2006-06-28 21:47:19.000000000 +0200
+++ linux-2.6-dirty/mm/rmap.c	2006-06-28 21:47:35.000000000 +0200
@@ -434,6 +434,71 @@ int page_referenced(struct page *page, i
 	return referenced;
 }
 
+static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	address = vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
+	pte = page_check_address(page, mm, address, &ptl);
+	if (!pte)
+		goto out;
+
+	if (!pte_dirty(*pte) && !pte_write(*pte))
+		goto unlock;
+
+	entry = ptep_get_and_clear(mm, address, pte);
+	entry = pte_mkclean(entry);
+	entry = pte_wrprotect(entry);
+	ptep_establish(vma, address, pte, entry);
+	lazy_mmu_prot_update(entry);
+	ret = 1;
+
+unlock:
+	pte_unmap_unlock(pte, ptl);
+out:
+	return ret;
+}
+
+static int page_mkclean_file(struct address_space *mapping, struct page *page)
+{
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int ret = 0;
+
+	BUG_ON(PageAnon(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		if (vma->vm_flags & VM_SHARED)
+			ret += page_mkclean_one(page, vma);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	return ret;
+}
+
+int page_mkclean(struct page *page)
+{
+	int ret = 0;
+
+	BUG_ON(!PageLocked(page));
+
+	if (page_mapped(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			ret = page_mkclean_file(mapping, page);
+	}
+
+	return ret;
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
Index: linux-2.6-dirty/include/linux/rmap.h
===================================================================
--- linux-2.6-dirty.orig/include/linux/rmap.h	2006-06-28 21:47:19.000000000 +0200
+++ linux-2.6-dirty/include/linux/rmap.h	2006-06-28 21:47:21.000000000 +0200
@@ -103,6 +103,14 @@ pte_t *page_check_address(struct page *,
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
+/*
+ * Cleans the PTEs of shared mappings.
+ * (and since clean PTEs should also be readonly, write protects them too)
+ *
+ * returns the number of cleaned PTEs.
+ */
+int page_mkclean(struct page *);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux-2.6-dirty/fs/buffer.c
===================================================================
--- linux-2.6-dirty.orig/fs/buffer.c	2006-06-28 21:47:19.000000000 +0200
+++ linux-2.6-dirty/fs/buffer.c	2006-06-28 21:47:21.000000000 +0200
@@ -2983,6 +2983,7 @@ int try_to_free_buffers(struct page *pag
 
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
+	spin_unlock(&mapping->private_lock);
 	if (ret) {
 		/*
 		 * If the filesystem writes its buffers by hand (eg ext3)
@@ -2994,7 +2995,6 @@ int try_to_free_buffers(struct page *pag
 		 */
 		clear_page_dirty(page);
 	}
-	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
Index: linux-2.6-dirty/include/linux/mm.h
===================================================================
--- linux-2.6-dirty.orig/include/linux/mm.h	2006-06-28 21:47:19.000000000 +0200
+++ linux-2.6-dirty/include/linux/mm.h	2006-06-28 21:47:21.000000000 +0200
@@ -14,6 +14,7 @@
 #include <linux/prio_tree.h>
 #include <linux/fs.h>
 #include <linux/mutex.h>
+#include <linux/backing-dev.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -796,6 +797,39 @@ struct shrinker;
 extern struct shrinker *set_shrinker(int, shrinker_t);
 extern void remove_shrinker(struct shrinker *shrinker);
 
+/*
+ * Some shared mappigns will want the pages marked read-only
+ * to track write events. If so, we'll downgrade vm_page_prot
+ * to the private version (using protection_map[] without the
+ * VM_SHARED bit).
+ */
+static inline int vma_wants_writenotify(struct vm_area_struct *vma)
+{
+	unsigned int vm_flags = vma->vm_flags;
+
+	/* If it was private or non-writable, the write bit is already clear */
+	if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
+		return 0;
+
+	/* The backer wishes to know when pages are first written to? */
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		return 1;
+
+	/* The open routine did something to the protections already? */
+	if (pgprot_val(vma->vm_page_prot) !=
+	    pgprot_val(protection_map[vm_flags &
+		    (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]))
+		return 0;
+
+	/* Specialty mapping? */
+	if (vm_flags & (VM_PFNMAP|VM_INSERTPAGE))
+		return 0;
+
+	/* Can the mapping track the dirty pages? */
+	return vma->vm_file && vma->vm_file->f_mapping &&
+		mapping_cap_account_dirty(vma->vm_file->f_mapping);
+}
+
 extern pte_t *FASTCALL(get_locked_pte(struct mm_struct *mm, unsigned long addr, spinlock_t **ptl));
 
 int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-13 12:15     ` Peter Zijlstra
@ 2006-06-13 12:24       ` Nick Piggin
  0 siblings, 0 replies; 51+ messages in thread
From: Nick Piggin @ 2006-06-13 12:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Hugh Dickins, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds,
	linux-kernel

Peter Zijlstra wrote:
> On Tue, 2006-06-13 at 14:03 +0200, Andi Kleen wrote:
> 
>>Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
>>
>>
>>>From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>>
>>>People expressed the need to track dirty pages in shared mappings.
>>
>>Why only shared mappings? Anonymous pages can be dirty too
>>and would need to be written to swap then before making progress.
> 
> 
> Anonymous pages are per definition dirty, as they don't have a
> persistent backing store.

They can be clean.

> Each eviction of an anonymous page requires IO
> to swap space. On swap-in pages are removed from the swap space to make
> place for other pages.

No they can remain in swap too.

Swap is a bit different because the memory usage patters are going
to be much different. There is no reason why something similar couldn't
be done for swap as well, however I don't think there is so much need
for it that has been demonstrated.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-13 12:03   ` Andi Kleen
@ 2006-06-13 12:15     ` Peter Zijlstra
  2006-06-13 12:24       ` Nick Piggin
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-13 12:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, Andrew Morton, David Howells, Christoph Lameter,
	Martin Bligh, Nick Piggin, Linus Torvalds, linux-kernel

On Tue, 2006-06-13 at 14:03 +0200, Andi Kleen wrote:
> Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
> 
> > From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > 
> > People expressed the need to track dirty pages in shared mappings.
> 
> Why only shared mappings? Anonymous pages can be dirty too
> and would need to be written to swap then before making progress.

Anonymous pages are per definition dirty, as they don't have a
persistent backing store. Each eviction of an anonymous page requires IO
to swap space. On swap-in pages are removed from the swap space to make
place for other pages.

Peter


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-13 11:21 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
@ 2006-06-13 12:03   ` Andi Kleen
  2006-06-13 12:15     ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2006-06-13 12:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, David Howells, Christoph Lameter,
	Martin Bligh, Nick Piggin, Linus Torvalds, linux-kernel

Peter Zijlstra <a.p.zijlstra@chello.nl> writes:

> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> People expressed the need to track dirty pages in shared mappings.

Why only shared mappings? Anonymous pages can be dirty too
and would need to be written to swap then before making progress.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/6] mm: tracking shared dirty pages
  2006-06-13 11:21 [PATCH 0/6] mm: tracking dirty pages -v8 Peter Zijlstra
@ 2006-06-13 11:21 ` Peter Zijlstra
  2006-06-13 12:03   ` Andi Kleen
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2006-06-13 11:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

People expressed the need to track dirty pages in shared mappings.

Linus outlined the general idea of doing that through making clean
writable pages write-protected and taking the write fault.

This patch does exactly that, it makes pages in a shared writable
mapping write-protected. On write-fault the pages are marked dirty and
made writable. When the pages get synced with their backing store, the
write-protection is re-instated.

It survives a simple test and shows the dirty pages in /proc/vmstat.

Changes in -v8

 - access_process_vm() and other force users of get_user_pages() can
   induce COW of read-only shared mappings.

Changes in -v7

 - changed is_shared_writable() to exclude VM_PFNMAP'ed regions.
 - Hugh's tiresome problem wasn't fully solved, now using the ugly duckling
   method.

Changes in -v6

 - make page_mkclean_one() modify the pte more like change_pte_range() 
   (suggested by Christoph Lameter)
 - made is_shared_writable() take vm_flags, it now resembles is_cow_mapping().
 - fixed the mprotect() bug (spotted by Hugh Dickins)
 - hopefully fixed the tiresome issue of do_mmap_pgoff() trampling on
   driver specific vm_page_prot settings (spotted by Hugh Dickins)
 - made a new version of the page_mkwrite() patch to go on top of all this.
   This so that Linus could merge this very early on in 2.6.18.

Changes in -v5

 - rename page_wrprotect() to page_mkclean() (suggested by Nick Piggin)
 - added comment to test_clear_page_dirty() (Andrew Morton)
 - cleanup page_wrprotect() (Andrew Morton)
 - renamed VM_SharedWritable() to is_shared_writable()
 - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
   ->private_lock. This seems to be save, since ->private_lock is used to
   serialize access to the buffers, not the page itself.
 - rebased on top of David Howells' page_mkwrite() patch.

Changes in -v4:

 - small cleanup as suggested by Christoph Lameter.

Changes in -v3:

 - move set_page_dirty() outside pte lock (suggested by Christoph Lameter)

Changes in -v2:

 - only wrprotect pages from dirty capable mappings. (Nick Piggin)
 - move the writefault handling from do_wp_page() into handle_pte_fault(). 
   (Nick Piggin)
 - revert to the old install_page interface. (Nick Piggin)
 - also clear the pte dirty bit when we make pages read-only again.
   (spotted by Rik van Riel)
 - make page_wrprotect() return the number of reprotected ptes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

 fs/buffer.c          |    2 -
 include/linux/mm.h   |    6 ++++
 include/linux/rmap.h |    8 ++++++
 mm/memory.c          |   56 ++++++++++++++++++++++++++++++++----------
 mm/mmap.c            |   33 +++++++++++++++++++++++-
 mm/mprotect.c        |   14 +++++++++-
 mm/page-writeback.c  |    9 ++++++
 mm/rmap.c            |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 177 insertions(+), 19 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2006-06-12 13:09:55.000000000 +0200
@@ -183,6 +183,12 @@ extern unsigned int kobjsize(const void 
 #define VM_SequentialReadHint(v)	((v)->vm_flags & VM_SEQ_READ)
 #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)
 
+static inline int is_shared_writable(unsigned int flags)
+{
+	return (flags & (VM_SHARED|VM_WRITE|VM_PFNMAP)) ==
+		(VM_SHARED|VM_WRITE);
+}
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/mm/memory.c	2006-06-13 11:17:51.000000000 +0200
@@ -925,6 +925,12 @@ struct page *follow_page(struct vm_area_
 	pte = *ptep;
 	if (!pte_present(pte))
 		goto unlock;
+	/*
+	 * This is not fully correct in the light of trapping write faults
+	 * for writable shared mappings. However since we're going to mark
+	 * the page dirty anyway some few lines downward, we might as well
+	 * take the write fault now.
+	 */
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
 	page = vm_normal_page(vma, address, pte);
@@ -1445,25 +1451,36 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
+	int reuse = 0, ret = VM_FAULT_MINOR;
+	struct page *dirty_page = NULL;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		int reuse = can_share_swap_page(old_page);
+	/* get_user_pages(.write:1, .force:1)
+	 *   __handle_mm_fault()
+	 *
+	 * Makes COW happen for readonly shared mappings too.
+	 */
+	if (unlikely(is_shared_writable(vma->vm_flags))) {
+		reuse = 1;
+		dirty_page = old_page;
+		get_page(dirty_page);
+	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+		reuse = can_share_swap_page(old_page);
 		unlock_page(old_page);
-		if (reuse) {
-			flush_cache_page(vma, address, pte_pfn(orig_pte));
-			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			ptep_set_access_flags(vma, address, page_table, entry, 1);
-			update_mmu_cache(vma, address, entry);
-			lazy_mmu_prot_update(entry);
-			ret |= VM_FAULT_WRITE;
-			goto unlock;
-		}
+	}
+
+	if (reuse) {
+		flush_cache_page(vma, address, pte_pfn(orig_pte));
+		entry = pte_mkyoung(orig_pte);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		ptep_set_access_flags(vma, address, page_table, entry, 1);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+		ret |= VM_FAULT_WRITE;
+		goto unlock;
 	}
 
 	/*
@@ -1518,6 +1535,10 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	if (old_page)
@@ -2046,6 +2067,7 @@ static int do_no_page(struct mm_struct *
 	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
+	struct page *dirty_page = NULL;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
@@ -2127,6 +2149,10 @@ retry:
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			if (write_access) {
+				dirty_page = new_page;
+				get_page(dirty_page);
+			}
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -2139,6 +2165,10 @@ retry:
 	lazy_mmu_prot_update(entry);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	page_cache_release(new_page);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/mm/mmap.c	2006-06-12 14:40:50.000000000 +0200
@@ -25,6 +25,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -888,6 +889,7 @@ unsigned long do_mmap_pgoff(struct file 
 	struct rb_node ** rb_link, * rb_parent;
 	int accountable = 1;
 	unsigned long charged = 0, reqprot = prot;
+	struct address_space *mapping = NULL;
 
 	if (file) {
 		if (is_file_hugepages(file))
@@ -1065,7 +1067,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+	vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1083,6 +1086,7 @@ munmap_back:
 		error = file->f_op->mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
+
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
 		if (error)
@@ -1106,6 +1110,30 @@ munmap_back:
 	pgoff = vma->vm_pgoff;
 	vm_flags = vma->vm_flags;
 
+	/*
+	 * Tracking of dirty pages for shared writable mappings. Do this by
+	 * write protecting writable pages, and mark dirty in the write fault.
+	 *
+	 * Modify vma->vm_page_prot (the default protection for new pages)
+	 * to this effect.
+	 *
+	 * Cannot do before because the condition depends on:
+	 *  - backing_dev_info having the right capabilities
+	 *    (set by f_op->open())
+	 *  - vma->vm_flags being fully set
+	 *    (finished in f_op->mmap(), which could call remap_pfn_range())
+	 *
+	 *  Also, cannot reset vma->vm_page_prot from vma->vm_flags because
+	 *  f_op->mmap() can modify it.
+	 */
+	if (is_shared_writable(vm_flags) && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+	if (mapping && mapping_cap_account_dirty(mapping))
+		vma->vm_page_prot =
+			__pgprot(pte_val
+				(pte_wrprotect
+				 (__pte(pgprot_val(vma->vm_page_prot)))));
+
 	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
 			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
 		file = vma->vm_file;
@@ -1921,7 +1949,8 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
-	vma->vm_page_prot = protection_map[flags & 0x0f];
+	vma->vm_page_prot = protection_map[flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/mm/mprotect.c	2006-06-12 13:09:55.000000000 +0200
@@ -19,6 +19,7 @@
 #include <linux/mempolicy.h>
 #include <linux/personality.h>
 #include <linux/syscalls.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -106,6 +107,8 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned int mask;
+	struct address_space *mapping = NULL;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -132,8 +135,6 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
-	newprot = protection_map[newflags & 0xf];
-
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
@@ -160,6 +161,15 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
+	/* Don't make the VMA automatically writable if it's shared. */
+	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+	if (is_shared_writable(newflags) && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+	if (mapping && mapping_cap_account_dirty(mapping))
+		mask &= ~VM_SHARED;
+
+	newprot = protection_map[newflags & mask];
+
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2006-06-13 11:17:40.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -563,7 +564,7 @@ int do_writepages(struct address_space *
 		return 0;
 	wbc->for_writepages = 1;
 	if (mapping->a_ops->writepages)
-		ret =  mapping->a_ops->writepages(mapping, wbc);
+		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
 	wbc->for_writepages = 0;
@@ -725,6 +726,11 @@ int test_clear_page_dirty(struct page *p
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 			write_unlock_irqrestore(&mapping->tree_lock, flags);
+			/*
+			 * We can continue to use `mapping' here because the
+			 * page is locked, which pins the address_space
+			 */
+			page_mkclean(page);
 			if (mapping_cap_account_dirty(mapping))
 				dec_page_state(nr_dirty);
 			return 1;
@@ -756,6 +762,7 @@ int clear_page_dirty_for_io(struct page 
 
 	if (mapping) {
 		if (TestClearPageDirty(page)) {
+			page_mkclean(page);
 			if (mapping_cap_account_dirty(mapping))
 				dec_page_state(nr_dirty);
 			return 1;
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/mm/rmap.c	2006-06-13 11:18:15.000000000 +0200
@@ -53,6 +53,7 @@
 #include <linux/rmap.h>
 #include <linux/rcupdate.h>
 #include <linux/module.h>
+#include <linux/backing-dev.h>
 
 #include <asm/tlbflush.h>
 
@@ -472,6 +473,73 @@ int page_referenced(struct page *page, i
 	return referenced;
 }
 
+static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, int protect)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	address = vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
+	pte = page_check_address(page, mm, address, &ptl);
+	if (!pte)
+		goto out;
+
+	if (!(pte_dirty(*pte) || (protect && pte_write(*pte))))
+		goto unlock;
+
+	entry = ptep_get_and_clear(mm, address, pte);
+	entry = pte_mkclean(entry);
+	if (protect)
+		entry = pte_wrprotect(entry);
+	ptep_establish(vma, address, pte, entry);
+	lazy_mmu_prot_update(entry);
+	ret = 1;
+
+unlock:
+	pte_unmap_unlock(pte, ptl);
+out:
+	return ret;
+}
+
+static int page_mkclean_file(struct address_space *mapping, struct page *page)
+{
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int ret = 0;
+
+	BUG_ON(PageAnon(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		int protect = mapping_cap_account_dirty(mapping) &&
+			is_shared_writable(vma->vm_flags);
+		ret += page_mkclean_one(page, vma, protect);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	return ret;
+}
+
+int page_mkclean(struct page *page)
+{
+	int ret = 0;
+
+	BUG_ON(!PageLocked(page));
+
+	if (page_mapped(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			ret = page_mkclean_file(mapping, page);
+	}
+
+	return ret;
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/include/linux/rmap.h	2006-06-12 13:09:55.000000000 +0200
@@ -105,6 +105,14 @@ pte_t *page_check_address(struct page *,
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
+/*
+ * Cleans the PTEs of shared mappings.
+ * (and since clean PTEs should also be readonly, write protects them too)
+ *
+ * returns the number of cleaned PTEs.
+ */
+int page_mkclean(struct page *);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2006-06-12 07:01:34.000000000 +0200
+++ linux-2.6/fs/buffer.c	2006-06-12 13:09:55.000000000 +0200
@@ -2985,6 +2985,7 @@ int try_to_free_buffers(struct page *pag
 
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
+	spin_unlock(&mapping->private_lock);
 	if (ret) {
 		/*
 		 * If the filesystem writes its buffers by hand (eg ext3)
@@ -2996,7 +2997,6 @@ int try_to_free_buffers(struct page *pag
 		 */
 		clear_page_dirty(page);
 	}
-	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2006-06-28 20:17 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-19 17:52 [PATCH 0/6] mm: tracking dirty pages -v9 Peter Zijlstra
2006-06-19 17:52 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
2006-06-22  5:56   ` Andrew Morton
2006-06-22  6:07     ` Christoph Lameter
2006-06-22  6:15       ` Andrew Morton
2006-06-22 11:33     ` Peter Zijlstra
2006-06-22 13:17       ` Hugh Dickins
2006-06-22 20:52   ` Hugh Dickins
2006-06-22 23:02     ` Peter Zijlstra
2006-06-22 23:39     ` [PATCH] mm: tracking shared dirty pages -v10 Peter Zijlstra
2006-06-23  3:10       ` Jeff Dike
2006-06-23  3:31         ` Andrew Morton
2006-06-23  3:50           ` Jeff Dike
2006-06-23  4:01           ` H. Peter Anvin
2006-06-23 15:08             ` Jeff Dike
2006-06-23  6:08       ` Linus Torvalds
2006-06-23  7:27         ` Hugh Dickins
2006-06-23 17:00           ` Christoph Lameter
2006-06-23 17:22             ` Peter Zijlstra
2006-06-23 17:52               ` Christoph Lameter
2006-06-23 18:11                 ` Martin Bligh
2006-06-23 18:20                   ` Linus Torvalds
2006-06-23 17:56               ` Linus Torvalds
2006-06-23 18:03                 ` Peter Zijlstra
2006-06-23 18:23                   ` Christoph Lameter
2006-06-23 18:41                 ` Christoph Hellwig
2006-06-23 17:49           ` Linus Torvalds
2006-06-23 18:05             ` Arjan van de Ven
2006-06-23 18:08             ` Miklos Szeredi
2006-06-23 19:06       ` Hugh Dickins
2006-06-23 22:00         ` Peter Zijlstra
2006-06-23 22:35           ` Linus Torvalds
2006-06-23 22:44             ` Peter Zijlstra
2006-06-28 14:58         ` [RFC][PATCH] mm: fixup do_wp_page() Peter Zijlstra
2006-06-28 18:20           ` Hugh Dickins
2006-06-19 17:53 ` [PATCH 2/6] mm: balance dirty pages Peter Zijlstra
2006-06-19 17:53 ` [PATCH 3/6] mm: msync() cleanup Peter Zijlstra
2006-06-22 17:02   ` Hugh Dickins
2006-06-19 17:53 ` [PATCH 4/6] mm: optimize the new mprotect() code a bit Peter Zijlstra
2006-06-22 17:21   ` Hugh Dickins
2006-06-19 17:53 ` [PATCH 5/6] mm: small cleanup of install_page() Peter Zijlstra
2006-06-19 17:53 ` [PATCH 6/6] mm: remove some update_mmu_cache() calls Peter Zijlstra
2006-06-22 16:29   ` Hugh Dickins
2006-06-22 16:37     ` Christoph Lameter
2006-06-22 17:35       ` Hugh Dickins
2006-06-22 18:31         ` Christoph Lameter
  -- strict thread matches above, loose matches on Subject: below --
2006-06-28 20:17 [PATCH 0/6] mm: tracking dirty pages -v14 Peter Zijlstra
2006-06-28 20:17 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
2006-06-13 11:21 [PATCH 0/6] mm: tracking dirty pages -v8 Peter Zijlstra
2006-06-13 11:21 ` [PATCH 1/6] mm: tracking shared dirty pages Peter Zijlstra
2006-06-13 12:03   ` Andi Kleen
2006-06-13 12:15     ` Peter Zijlstra
2006-06-13 12:24       ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).