linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] mm: tracking dirty pages -v5
@ 2006-05-25 13:55 Peter Zijlstra
  2006-05-25 13:55 ` [PATCH -1/3] mm: page_mkwrite Peter Zijlstra
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Peter Zijlstra @ 2006-05-25 13:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


I hacked up a new version last night.

Its now based on top of David's patches, Hugh's approach of using the
MAP_PRIVATE protections instead of the MAP_SHARED seems far superior indeed.

Q: would it be feasable to do so for al shared mappings so we can remove
the MAP_SHARED protections all together?

They survive my simple testing, but esp. the msync cleanup might need some
more attention.

I post them now instead of after a little more testing because I'll not 
have much time the coming few days to do so, and hoarding them does 
nobody any good.

Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -1/3] mm: page_mkwrite
  2006-05-25 13:55 [PATCH 0/3] mm: tracking dirty pages -v5 Peter Zijlstra
@ 2006-05-25 13:55 ` Peter Zijlstra
  2006-05-25 13:55 ` [PATCH 1/3] mm: tracking shared dirty pages Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2006-05-25 13:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


The attached patch adds a new VMA operation to notify a filesystem or other
driver about the MMU generating a fault because userspace attempted to write
to a page mapped through a read-only PTE.

This facility permits the filesystem or driver to:

 (*) Implement storage allocation/reservation on attempted write, and so to
     deal with problems such as ENOSPC more gracefully (perhaps by generating
     SIGBUS).

 (*) Delay making the page writable until the contents have been written to a
     backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
     It permits the filesystem to have some guarantee about the state of the
     cache.

 (*) Account and limit number of dirty pages. This is one piece of the puzzle
     needed to make shared writable mapping work safely in FUSE.

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 include/linux/mm.h |    4 ++
 mm/memory.c        |   99 +++++++++++++++++++++++++++++++++++++++-------------
 mm/mmap.c          |   12 +++++-
 mm/mprotect.c      |   11 +++++-
 4 files changed, 98 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1154684..cd3c2cf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -200,6 +200,10 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+
+	/* notification that a previously read-only page is about to become
+	 * writable, if an error is returned it will cause a SIGBUS */
+	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 0ec7bc6..6c6891e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1445,25 +1445,59 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
+	int reuse, ret = VM_FAULT_MINOR;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		int reuse = can_share_swap_page(old_page);
-		unlock_page(old_page);
-		if (reuse) {
-			flush_cache_page(vma, address, pte_pfn(orig_pte));
-			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			ptep_set_access_flags(vma, address, page_table, entry, 1);
-			update_mmu_cache(vma, address, entry);
-			lazy_mmu_prot_update(entry);
-			ret |= VM_FAULT_WRITE;
-			goto unlock;
+	if (unlikely(vma->vm_flags & VM_SHARED)) {
+		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+			/*
+			 * Notify the address space that the page is about to
+			 * become writable so that it can prohibit this or wait
+			 * for the page to get into an appropriate state.
+			 *
+			 * We do this without the lock held, so that it can
+			 * sleep if it needs to.
+			 */
+			page_cache_get(old_page);
+			pte_unmap_unlock(page_table, ptl);
+
+			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
+				goto unwritable_page;
+
+			page_cache_release(old_page);
+
+			/*
+			 * Since we dropped the lock we need to revalidate
+			 * the PTE as someone else may have changed it.  If
+			 * they did, we just return, as we can count on the
+			 * MMU to tell us if they didn't also make it writable.
+			 */
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte))
+				goto unlock;
 		}
+
+		reuse = 1;
+	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+		reuse = can_share_swap_page(old_page);
+		unlock_page(old_page);
+	} else {
+		reuse = 0;
+	}
+
+	if (reuse) {
+		flush_cache_page(vma, address, pte_pfn(orig_pte));
+		entry = pte_mkyoung(orig_pte);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		ptep_set_access_flags(vma, address, page_table, entry, 1);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+		ret |= VM_FAULT_WRITE;
+		goto unlock;
 	}
 
 	/*
@@ -1523,6 +1557,10 @@ oom:
 	if (old_page)
 		page_cache_release(old_page);
 	return VM_FAULT_OOM;
+
+unwritable_page:
+	page_cache_release(old_page);
+	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -2074,18 +2112,31 @@ retry:
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page *page;
+	if (write_access) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			struct page *page;
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto oom;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-		if (!page)
-			goto oom;
-		copy_user_highpage(page, new_page, address);
-		page_cache_release(new_page);
-		new_page = page;
-		anon = 1;
+			if (unlikely(anon_vma_prepare(vma)))
+				goto oom;
+			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			if (!page)
+				goto oom;
+			copy_user_highpage(page, new_page, address);
+			page_cache_release(new_page);
+			new_page = page;
+			anon = 1;
+
+		} else {
+			/* if the page will be shareable, see if the backing
+			 * address space wants to know that the page is about
+			 * to become writable */
+			if (vma->vm_ops->page_mkwrite &&
+			    vma->vm_ops->page_mkwrite(vma, new_page) < 0
+			    ) {
+				page_cache_release(new_page);
+				return VM_FAULT_SIGBUS;
+			}
+		}
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
diff --git a/mm/mmap.c b/mm/mmap.c
index e6ee123..6446c61 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1065,7 +1065,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+	vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1089,6 +1090,12 @@ munmap_back:
 			goto free_vma;
 	}
 
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		vma->vm_page_prot =
+			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
+
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation
@@ -1921,7 +1928,8 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
-	vma->vm_page_prot = protection_map[flags & 0x0f];
+	vma->vm_page_prot = protection_map[flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4c14d42..2697abd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -106,6 +106,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned int mask;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -132,8 +133,6 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
-	newprot = protection_map[newflags & 0xf];
-
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
@@ -160,6 +159,14 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		mask &= ~VM_SHARED;
+
+	newprot = protection_map[newflags & mask];
+
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-25 13:55 [PATCH 0/3] mm: tracking dirty pages -v5 Peter Zijlstra
  2006-05-25 13:55 ` [PATCH -1/3] mm: page_mkwrite Peter Zijlstra
@ 2006-05-25 13:55 ` Peter Zijlstra
  2006-05-25 16:21   ` Christoph Lameter
                     ` (2 more replies)
  2006-05-25 13:56 ` [PATCH 2/3] mm: balance " Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 31+ messages in thread
From: Peter Zijlstra @ 2006-05-25 13:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

People expressed the need to track dirty pages in shared mappings.

Linus outlined the general idea of doing that through making clean
writable pages write-protected and taking the write fault.

This patch does exactly that, it makes pages in a shared writable
mapping write-protected. On write-fault the pages are marked dirty and
made writable. When the pages get synced with their backing store, the
write-protection is re-instated.

It survives a simple test and shows the dirty pages in /proc/vmstat.

Changes in -v5

 - rename page_wrprotect() to page_mkclean() (suggested by Nick Piggin)
 - added comment to test_clear_page_dirty() (Andrew Morton)
 - cleanup page_wrprotect() (Andrew Morton)
 - renamed VM_SharedWritable() to is_shared_writable()
 - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
   ->private_lock. This seems to be save, since ->private_lock is used to
   serialize access to the buffers, not the page itself.
 - rebased on top of David Howells' page_mkwrite() patch.

Changes in -v4:

 - small cleanup as suggested by Christoph Lameter.

Changes in -v3:

 - move set_page_dirty() outside pte lock (suggested by Christoph Lameter)

Changes in -v2:

 - only wrprotect pages from dirty capable mappings. (Nick Piggin)
 - move the writefault handling from do_wp_page() into handle_pte_fault(). 
   (Nick Piggin)
 - revert to the old install_page interface. (Nick Piggin)
 - also clear the pte dirty bit when we make pages read-only again.
   (spotted by Rik van Riel)
 - make page_wrprotect() return the number of reprotected ptes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

 fs/buffer.c          |    2 -
 include/linux/mm.h   |    5 +++
 include/linux/rmap.h |    8 ++++++
 mm/fremap.c          |    4 +--
 mm/memory.c          |   20 +++++++++++++++
 mm/mmap.c            |    7 ++++-
 mm/mprotect.c        |    7 ++++-
 mm/page-writeback.c  |   13 ++++++++--
 mm/rmap.c            |   64 +++++++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 122 insertions(+), 8 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2006-05-25 15:46:03.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2006-05-25 15:46:08.000000000 +0200
@@ -183,6 +183,11 @@ extern unsigned int kobjsize(const void 
 #define VM_SequentialReadHint(v)	((v)->vm_flags & VM_SEQ_READ)
 #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)
 
+static inline int is_shared_writable(struct vm_area_struct *vma)
+{
+	return (vma->vm_flags & (VM_SHARED|VM_WRITE)) == (VM_SHARED|VM_WRITE);
+}
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2006-05-25 15:07:43.000000000 +0200
+++ linux-2.6/mm/fremap.c	2006-05-25 15:46:08.000000000 +0200
@@ -79,9 +79,9 @@ int install_page(struct mm_struct *mm, s
 		inc_mm_counter(mm, file_rss);
 
 	flush_icache_page(vma, page);
-	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	pte_val = mk_pte(page, prot);
+	set_pte_at(mm, addr, pte, pte_val);
 	page_add_file_rmap(page);
-	pte_val = *pte;
 	update_mmu_cache(vma, addr, pte_val);
 	err = 0;
 unlock:
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2006-05-25 15:46:03.000000000 +0200
+++ linux-2.6/mm/memory.c	2006-05-25 15:47:10.000000000 +0200
@@ -48,6 +48,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/backing-dev.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -1446,12 +1447,13 @@ static int do_wp_page(struct mm_struct *
 	struct page *old_page, *new_page;
 	pte_t entry;
 	int reuse, ret = VM_FAULT_MINOR;
+	struct page *dirty_page = NULL;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (unlikely(vma->vm_flags & VM_SHARED)) {
+	if (vma->vm_flags & VM_SHARED) {
 		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
 			/*
 			 * Notify the address space that the page is about to
@@ -1482,6 +1484,9 @@ static int do_wp_page(struct mm_struct *
 		}
 
 		reuse = 1;
+
+		dirty_page = old_page;
+		get_page(dirty_page);
 	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
 		reuse = can_share_swap_page(old_page);
 		unlock_page(old_page);
@@ -1552,6 +1557,10 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	if (old_page)
@@ -2084,6 +2093,7 @@ static int do_no_page(struct mm_struct *
 	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
+	struct page *dirty_page = NULL;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
@@ -2178,6 +2188,10 @@ retry:
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			if (write_access) {
+				dirty_page = new_page;
+				get_page(dirty_page);
+			}
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -2190,6 +2204,10 @@ retry:
 	lazy_mmu_prot_update(entry);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	page_cache_release(new_page);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2006-05-25 15:07:44.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2006-05-25 15:47:10.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -725,8 +726,14 @@ int test_clear_page_dirty(struct page *p
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 			write_unlock_irqrestore(&mapping->tree_lock, flags);
-			if (mapping_cap_account_dirty(mapping))
+			/*
+			 * We can continue to use `mapping' here because the
+			 * page is locked, which pins the address_space
+			 */
+			if (mapping_cap_account_dirty(mapping)) {
+				page_mkclean(page);
 				dec_page_state(nr_dirty);
+			}
 			return 1;
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -756,8 +763,10 @@ int clear_page_dirty_for_io(struct page 
 
 	if (mapping) {
 		if (TestClearPageDirty(page)) {
-			if (mapping_cap_account_dirty(mapping))
+			if (mapping_cap_account_dirty(mapping)) {
+				page_mkclean(page);
 				dec_page_state(nr_dirty);
+			}
 			return 1;
 		}
 		return 0;
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2006-05-25 15:07:46.000000000 +0200
+++ linux-2.6/mm/rmap.c	2006-05-25 15:46:08.000000000 +0200
@@ -472,6 +472,70 @@ int page_referenced(struct page *page, i
 	return referenced;
 }
 
+static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	address = vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
+	pte = page_check_address(page, mm, address, &ptl);
+	if (!pte)
+		goto out;
+
+	if (!pte_write(*pte))
+		goto unlock;
+
+	entry = pte_mkclean(pte_wrprotect(*pte));
+	ptep_establish(vma, address, pte, entry);
+	update_mmu_cache(vma, address, entry);
+	lazy_mmu_prot_update(entry);
+	ret = 1;
+
+unlock:
+	pte_unmap_unlock(pte, ptl);
+out:
+	return ret;
+}
+
+static int page_mkclean_file(struct address_space *mapping, struct page *page)
+{
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int ret = 0;
+
+	BUG_ON(PageAnon(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		if (is_shared_writable(vma))
+			ret += page_mkclean_one(page, vma);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	return ret;
+}
+
+int page_mkclean(struct page *page)
+{
+	int ret = 0;
+
+	BUG_ON(!PageLocked(page));
+
+	if (page_mapped(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			ret = page_mkclean_file(mapping, page);
+	}
+
+	return ret;
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h	2006-05-25 15:07:46.000000000 +0200
+++ linux-2.6/include/linux/rmap.h	2006-05-25 15:46:08.000000000 +0200
@@ -105,6 +105,14 @@ pte_t *page_check_address(struct page *,
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
+/*
+ * Cleans the PTEs of shared mappings.
+ * (and since clean PTEs should also be readonly, write protects them too)
+ *
+ * returns the number of cleaned PTEs.
+ */
+int page_mkclean(struct page *);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2006-05-25 15:07:43.000000000 +0200
+++ linux-2.6/fs/buffer.c	2006-05-25 15:46:09.000000000 +0200
@@ -2985,6 +2985,7 @@ int try_to_free_buffers(struct page *pag
 
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
+	spin_unlock(&mapping->private_lock);
 	if (ret) {
 		/*
 		 * If the filesystem writes its buffers by hand (eg ext3)
@@ -2996,7 +2997,6 @@ int try_to_free_buffers(struct page *pag
 		 */
 		clear_page_dirty(page);
 	}
-	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2006-05-25 15:46:03.000000000 +0200
+++ linux-2.6/mm/mmap.c	2006-05-25 15:46:45.000000000 +0200
@@ -25,6 +25,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -888,6 +889,7 @@ unsigned long do_mmap_pgoff(struct file 
 	struct rb_node ** rb_link, * rb_parent;
 	int accountable = 1;
 	unsigned long charged = 0, reqprot = prot;
+	struct address_space *mapping = NULL;
 
 	if (file) {
 		if (is_file_hugepages(file))
@@ -1092,7 +1094,10 @@ munmap_back:
 
 	/* Don't make the VMA automatically writable if it's shared, but the
 	 * backer wishes to know when pages are first written to */
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+	if (is_shared_writable(vma) && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+	if ((mapping && mapping_cap_account_dirty(mapping)) ||
+			(vma->vm_ops && vma->vm_ops->page_mkwrite))
 		vma->vm_page_prot =
 			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
 
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c	2006-05-25 15:46:03.000000000 +0200
+++ linux-2.6/mm/mprotect.c	2006-05-25 15:46:09.000000000 +0200
@@ -19,6 +19,7 @@
 #include <linux/mempolicy.h>
 #include <linux/personality.h>
 #include <linux/syscalls.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -107,6 +108,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
 	unsigned int mask;
+	struct address_space *mapping = NULL;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -162,7 +164,10 @@ success:
 	/* Don't make the VMA automatically writable if it's shared, but the
 	 * backer wishes to know when pages are first written to */
 	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
-	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+	if (is_shared_writable(vma) && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+	if ((mapping && mapping_cap_account_dirty(mapping)) ||
+			(vma->vm_ops && vma->vm_ops->page_mkwrite))
 		mask &= ~VM_SHARED;
 
 	newprot = protection_map[newflags & mask];

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/3] mm: balance dirty pages
  2006-05-25 13:55 [PATCH 0/3] mm: tracking dirty pages -v5 Peter Zijlstra
  2006-05-25 13:55 ` [PATCH -1/3] mm: page_mkwrite Peter Zijlstra
  2006-05-25 13:55 ` [PATCH 1/3] mm: tracking shared dirty pages Peter Zijlstra
@ 2006-05-25 13:56 ` Peter Zijlstra
  2006-05-25 13:56 ` [PATCH 3/3] mm: msync cleanup Peter Zijlstra
  2006-06-06 20:06 ` [PATCH 0/3] mm: tracking dirty pages -v5 Hugh Dickins
  4 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2006-05-25 13:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Now that we can detect writers of shared mappings, throttle them.
Avoids OOM by surprise.

Changes -v2:

 - small helper function (Andrew Morton)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

 include/linux/writeback.h |    1 +
 mm/memory.c               |    5 +++--
 mm/page-writeback.c       |   10 ++++++++++
 3 files changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2006-05-25 15:46:31.000000000 +0200
+++ linux-2.6/mm/memory.c	2006-05-25 15:46:55.000000000 +0200
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/backing-dev.h>
+#include <linux/writeback.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -1558,7 +1559,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty(dirty_page);
+		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
@@ -2205,7 +2206,7 @@ retry:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty(dirty_page);
+		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2006-05-25 15:07:44.000000000 +0200
+++ linux-2.6/include/linux/writeback.h	2006-05-25 15:46:55.000000000 +0200
@@ -114,6 +114,7 @@ int sync_page_range(struct inode *inode,
 			loff_t pos, loff_t count);
 int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
 			   loff_t pos, loff_t count);
+void set_page_dirty_balance(struct page *page);
 
 /* pdflush.c */
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2006-05-25 15:46:08.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2006-05-25 15:46:55.000000000 +0200
@@ -255,6 +255,16 @@ static void balance_dirty_pages(struct a
 		pdflush_operation(background_writeout, 0);
 }
 
+void set_page_dirty_balance(struct page *page)
+{
+	if (set_page_dirty(page)) {
+		struct address_space *mapping = page_mapping(page);
+
+		if (mapping)
+			balance_dirty_pages_ratelimited(mapping);
+	}
+}
+
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/3] mm: msync cleanup
  2006-05-25 13:55 [PATCH 0/3] mm: tracking dirty pages -v5 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2006-05-25 13:56 ` [PATCH 2/3] mm: balance " Peter Zijlstra
@ 2006-05-25 13:56 ` Peter Zijlstra
  2006-06-06 20:06 ` [PATCH 0/3] mm: tracking dirty pages -v5 Hugh Dickins
  4 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2006-05-25 13:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andrew Morton, David Howells, Peter Zijlstra,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

With the tracking of dirty pages properly done now, msync doesn't need to
scan the pte's anymore to determine the dirty status.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

 mm/msync.c |  145 ++++---------------------------------------------------------
 1 file changed, 10 insertions(+), 135 deletions(-)

Index: linux-2.6/mm/msync.c
===================================================================
--- linux-2.6.orig/mm/msync.c	2006-05-24 18:03:15.000000000 +0200
+++ linux-2.6/mm/msync.c	2006-05-24 18:38:09.000000000 +0200
@@ -20,128 +20,20 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
-static unsigned long msync_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-				unsigned long addr, unsigned long end)
-{
-	pte_t *pte;
-	spinlock_t *ptl;
-	int progress = 0;
-	unsigned long ret = 0;
-
-again:
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	do {
-		struct page *page;
-
-		if (progress >= 64) {
-			progress = 0;
-			if (need_resched() || need_lockbreak(ptl))
-				break;
-		}
-		progress++;
-		if (!pte_present(*pte))
-			continue;
-		if (!pte_maybe_dirty(*pte))
-			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
-			continue;
-		if (ptep_clear_flush_dirty(vma, addr, pte) ||
-				page_test_and_clear_dirty(page))
-			ret += set_page_dirty(page);
-		progress += 3;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
-	if (addr != end)
-		goto again;
-	return ret;
-}
-
-static inline unsigned long msync_pmd_range(struct vm_area_struct *vma,
-			pud_t *pud, unsigned long addr, unsigned long end)
-{
-	pmd_t *pmd;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		if (pmd_none_or_clear_bad(pmd))
-			continue;
-		ret += msync_pte_range(vma, pmd, addr, next);
-	} while (pmd++, addr = next, addr != end);
-	return ret;
-}
-
-static inline unsigned long msync_pud_range(struct vm_area_struct *vma,
-			pgd_t *pgd, unsigned long addr, unsigned long end)
-{
-	pud_t *pud;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		ret += msync_pmd_range(vma, pud, addr, next);
-	} while (pud++, addr = next, addr != end);
-	return ret;
-}
-
-static unsigned long msync_page_range(struct vm_area_struct *vma,
-				unsigned long addr, unsigned long end)
-{
-	pgd_t *pgd;
-	unsigned long next;
-	unsigned long ret = 0;
-
-	/* For hugepages we can't go walking the page table normally,
-	 * but that's ok, hugetlbfs is memory based, so we don't need
-	 * to do anything more on an msync().
-	 */
-	if (vma->vm_flags & VM_HUGETLB)
-		return 0;
-
-	BUG_ON(addr >= end);
-	pgd = pgd_offset(vma->vm_mm, addr);
-	flush_cache_range(vma, addr, end);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		ret += msync_pud_range(vma, pgd, addr, next);
-	} while (pgd++, addr = next, addr != end);
-	return ret;
-}
-
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).  Instead, it just
- * marks the relevant pages dirty.  The application may now run fsync() to
+ * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
+ * Nor does it just marks the relevant pages dirty (it used to up to 2.6.17).
+ * Now it doesn't do anything, since dirty pages are properly tracked.
+ *
+ * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
  * async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
-static int msync_interval(struct vm_area_struct *vma, unsigned long addr,
-			unsigned long end, int flags,
-			unsigned long *nr_pages_dirtied)
-{
-	struct file *file = vma->vm_file;
-
-	if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
-		return -EBUSY;
-
-	if (file && (vma->vm_flags & VM_SHARED))
-		*nr_pages_dirtied = msync_page_range(vma, addr, end);
-	return 0;
-}
 
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
@@ -157,6 +49,10 @@ asmlinkage long sys_msync(unsigned long 
 		goto out;
 	if ((flags & MS_ASYNC) && (flags & MS_SYNC))
 		goto out;
+	if (flags & MS_ASYNC) {
+		error = 0;
+		goto out;
+	}
 	error = -ENOMEM;
 	len = (len + ~PAGE_MASK) & PAGE_MASK;
 	end = start + len;
@@ -178,7 +74,6 @@ asmlinkage long sys_msync(unsigned long 
 		goto out_unlock;
 	}
 	do {
-		unsigned long nr_pages_dirtied = 0;
 		struct file *file;
 
 		/* Here start < vma->vm_end. */
@@ -188,32 +83,12 @@ asmlinkage long sys_msync(unsigned long 
 		}
 		/* Here vma->vm_start <= start < vma->vm_end. */
 		if (end <= vma->vm_end) {
-			if (start < end) {
-				error = msync_interval(vma, start, end, flags,
-							&nr_pages_dirtied);
-				if (error)
-					goto out_unlock;
-			}
 			error = unmapped_error;
 			done = 1;
-		} else {
-			/* Here vma->vm_start <= start < vma->vm_end < end. */
-			error = msync_interval(vma, start, vma->vm_end, flags,
-						&nr_pages_dirtied);
-			if (error)
-				goto out_unlock;
 		}
 		file = vma->vm_file;
 		start = vma->vm_end;
-		if ((flags & MS_ASYNC) && file && nr_pages_dirtied) {
-			get_file(file);
-			up_read(&current->mm->mmap_sem);
-			balance_dirty_pages_ratelimited_nr(file->f_mapping,
-							nr_pages_dirtied);
-			fput(file);
-			down_read(&current->mm->mmap_sem);
-			vma = find_vma(current->mm, start);
-		} else if ((flags & MS_SYNC) && file &&
+		if ((flags & MS_SYNC) && file &&
 				(vma->vm_flags & VM_SHARED)) {
 			get_file(file);
 			up_read(&current->mm->mmap_sem);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-25 13:55 ` [PATCH 1/3] mm: tracking shared dirty pages Peter Zijlstra
@ 2006-05-25 16:21   ` Christoph Lameter
  2006-05-25 17:00     ` Peter Zijlstra
  2006-05-25 16:27   ` Christoph Lameter
  2006-05-26 14:33   ` David Howells
  2 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2006-05-25 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Hugh Dickins, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Thu, 25 May 2006, Peter Zijlstra wrote:

> @@ -1446,12 +1447,13 @@ static int do_wp_page(struct mm_struct *
>  
> -	if (unlikely(vma->vm_flags & VM_SHARED)) {
> +	if (vma->vm_flags & VM_SHARED) {

You add this unlikely later again it seems. Why remove in the first place?

> +static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
> +	entry = pte_mkclean(pte_wrprotect(*pte));
> +	ptep_establish(vma, address, pte, entry);

> +	update_mmu_cache(vma, address, entry);

You only changed protections on an estisting pte and ptep_establish 
already flushed the tlb. No need to call update_mmu_cache. See how 
change_protection() in mm/mprotect.c does it.

> +	lazy_mmu_prot_update(entry);

Needed.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-25 13:55 ` [PATCH 1/3] mm: tracking shared dirty pages Peter Zijlstra
  2006-05-25 16:21   ` Christoph Lameter
@ 2006-05-25 16:27   ` Christoph Lameter
  2006-05-25 17:03     ` Peter Zijlstra
  2006-05-26  2:28     ` Jeff Anderson-Lee
  2006-05-26 14:33   ` David Howells
  2 siblings, 2 replies; 31+ messages in thread
From: Christoph Lameter @ 2006-05-25 16:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Hugh Dickins, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Thu, 25 May 2006, Peter Zijlstra wrote:

>  - rebased on top of David Howells' page_mkwrite() patch.

I am a bit confused about the need for Davids patch. set_page_dirty() is 
already a notification that a page is to be dirtied. Why do we need it 
twice? set_page_dirty could return an error code and the file system can 
use the set_page_dirty() hook to get its notification. What we would need 
to do is to make sure that set_page_dirty can sleep.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-25 16:21   ` Christoph Lameter
@ 2006-05-25 17:00     ` Peter Zijlstra
  2006-05-25 17:03       ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2006-05-25 17:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, Hugh Dickins, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Thu, 2006-05-25 at 09:21 -0700, Christoph Lameter wrote:
> On Thu, 25 May 2006, Peter Zijlstra wrote:
> 
> > @@ -1446,12 +1447,13 @@ static int do_wp_page(struct mm_struct *
> >  
> > -	if (unlikely(vma->vm_flags & VM_SHARED)) {
> > +	if (vma->vm_flags & VM_SHARED) {
> 
> You add this unlikely later again it seems. Why remove in the first place?

I'm not sure I follow you, are you suggesting that we'll find the
condition to be unlikely still, even with most of the shared mappings
trapping this branch?

> > +static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
> > +	entry = pte_mkclean(pte_wrprotect(*pte));
> > +	ptep_establish(vma, address, pte, entry);
> 
> > +	update_mmu_cache(vma, address, entry);
> 
> You only changed protections on an estisting pte and ptep_establish 
> already flushed the tlb. No need to call update_mmu_cache. See how 
> change_protection() in mm/mprotect.c does it.

OK, will check.

> > +	lazy_mmu_prot_update(entry);
> 
> Needed.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-25 16:27   ` Christoph Lameter
@ 2006-05-25 17:03     ` Peter Zijlstra
  2006-05-25 17:06       ` Christoph Lameter
  2006-05-26  2:28     ` Jeff Anderson-Lee
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2006-05-25 17:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, Hugh Dickins, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Thu, 2006-05-25 at 09:27 -0700, Christoph Lameter wrote:
> On Thu, 25 May 2006, Peter Zijlstra wrote:
> 
> >  - rebased on top of David Howells' page_mkwrite() patch.
> 
> I am a bit confused about the need for Davids patch. set_page_dirty() is 
> already a notification that a page is to be dirtied. Why do we need it 
> twice? set_page_dirty could return an error code and the file system can 
> use the set_page_dirty() hook to get its notification. What we would need 
> to do is to make sure that set_page_dirty can sleep.

Ah, I see what you're saying here. Good point, David, Hugh?

The reason I did it was because of Hugh's trick to use MAP_SHARED
protection and building on top of it naturally solves the patch conflict
Andrew would have had to resolve otherwise.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-25 17:00     ` Peter Zijlstra
@ 2006-05-25 17:03       ` Christoph Lameter
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2006-05-25 17:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, linux-mm, linux-kernel, Hugh Dickins,
	Andrew Morton, David Howells, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Thu, 25 May 2006, Peter Zijlstra wrote:

> On Thu, 2006-05-25 at 09:21 -0700, Christoph Lameter wrote:
> > On Thu, 25 May 2006, Peter Zijlstra wrote:
> > 
> > > @@ -1446,12 +1447,13 @@ static int do_wp_page(struct mm_struct *
> > >  
> > > -	if (unlikely(vma->vm_flags & VM_SHARED)) {
> > > +	if (vma->vm_flags & VM_SHARED) {
> > 
> > You add this unlikely later again it seems. Why remove in the first place?
> 
> I'm not sure I follow you, are you suggesting that we'll find the
> condition to be unlikely still, even with most of the shared mappings
> trapping this branch?

No, I just saw the opposite in a later patch. It was the -1 patch that 
does

+       if (unlikely(vma->vm_flags & VM_SHARED)) {

but thats a different context?
\

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-25 17:03     ` Peter Zijlstra
@ 2006-05-25 17:06       ` Christoph Lameter
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2006-05-25 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, linux-mm, linux-kernel, Hugh Dickins,
	Andrew Morton, David Howells, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Thu, 25 May 2006, Peter Zijlstra wrote:

> Ah, I see what you're saying here. Good point, David, Hugh?
> 
> The reason I did it was because of Hugh's trick to use MAP_SHARED
> protection and building on top of it naturally solves the patch conflict
> Andrew would have had to resolve otherwise.

I guess what we wanted is a patch that addresses the concern of both 
patches and not a combination of the patches. IMHO the dirty notification 
of David's patch is possible with the shared dirty pages patch if we allow 
the set_page_dirty method in address operations to sleep and return an 
error code. However, this may raise some additional issues because we have 
to check whenever we dirty a page.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-25 16:27   ` Christoph Lameter
  2006-05-25 17:03     ` Peter Zijlstra
@ 2006-05-26  2:28     ` Jeff Anderson-Lee
  2006-05-26  2:33       ` Christoph Lameter
  1 sibling, 1 reply; 31+ messages in thread
From: Jeff Anderson-Lee @ 2006-05-26  2:28 UTC (permalink / raw)
  To: 'Christoph Lameter'
  Cc: 'Peter Zijlstra',
	linux-mm, linux-kernel, 'Hugh Dickins',
	'Andrew Morton', 'David Howells',
	'Christoph Lameter', 'Martin Bligh',
	'Nick Piggin', 'Linus Torvalds'

Christoph Lameter wrote:

> I am a bit confused about the need for Davids patch. set_page_dirty() is 
> already a notification that a page is to be dirtied. Why do we need it 
> twice? set_page_dirty could return an error code and the file system can 
> use the set_page_dirty() hook to get its notification. What we would need 
> to do is to make sure that set_page_dirty can sleep.

set_page_dirty() is actually called fairly late in the game by
zap_pte_range() and follow_page().  Thus, it is a notification that a page
HAS BEEN dirtied and needs a writeback.

What is needed (at least for log structured or copy-on-write filesystems) is
a function that is called during the page fault BEFORE the pte's writeable
bit is set so that the pages' mapping can either be unmapped or remapped. 
That prevents possible readers of the old (read-only) version of the page
from seeing any changes via the page map cache.  If the page was unmapped, a
second call is needed later to map it to a new set of blocks on the device. 
For a log structured filesystem it can makes sense to defer the remapping
until late in the game when the block is about to be queued for i/o, in case
it gets deleted before it is ever flushed from the page cache.

I looked at using set_page_dirty for the first of these hooks, but it comes
too late.  It might be OK for the second (remapping) hook though.

do_wp_page() is called by handle_pte_fault() in exactly the case where this
would apply: if write access is needed and !pte_write(entry).  This patch
appears to address the necessary issue in a way that set_page_dirty cannot.

Jeff Anderson-Lee

[I'm a little late in coming in on this thread, and not 100% familiar with
the latest kernels, so please pardon me if I'm not fully up to speed yet. 
I'm trying to catch up as quickly as I can though!]


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-26  2:28     ` Jeff Anderson-Lee
@ 2006-05-26  2:33       ` Christoph Lameter
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2006-05-26  2:33 UTC (permalink / raw)
  To: Jeff Anderson-Lee
  Cc: 'Christoph Lameter', 'Peter Zijlstra',
	linux-mm, linux-kernel, 'Hugh Dickins',
	'Andrew Morton', 'David Howells',
	'Martin Bligh', 'Nick Piggin',
	'Linus Torvalds'

[-- Attachment #1: Type: TEXT/PLAIN, Size: 822 bytes --]

On Thu, 25 May 2006, Jeff Anderson-Lee wrote:

> Christoph Lameter wrote:
> 
> > I am a bit confused about the need for Davids patch. set_page_dirty() is 
> > already a notification that a page is to be dirtied. Why do we need it 
> > twice? set_page_dirty could return an error code and the file system can 
> > use the set_page_dirty() hook to get its notification. What we would need 
> > to do is to make sure that set_page_dirty can sleep.
> 
> set_page_dirty() is actually called fairly late in the game by
> zap_pte_range() and follow_page().  Thus, it is a notification that a page
> HAS BEEN dirtied and needs a writeback.

The tracking patch changes that behavior. set_page_dirty is called before 
the write to the page occurs and so its similar to the new method 
introduced by David's patch.
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-25 13:55 ` [PATCH 1/3] mm: tracking shared dirty pages Peter Zijlstra
  2006-05-25 16:21   ` Christoph Lameter
  2006-05-25 16:27   ` Christoph Lameter
@ 2006-05-26 14:33   ` David Howells
  2006-05-26 15:39     ` Christoph Lameter
  2006-05-30  8:00     ` David Howells
  2 siblings, 2 replies; 31+ messages in thread
From: David Howells @ 2006-05-26 14:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Hugh Dickins,
	Andrew Morton, David Howells, Christoph Lameter, Martin Bligh,
	Nick Piggin, Linus Torvalds

Christoph Lameter <clameter@sgi.com> wrote:

> >  - rebased on top of David Howells' page_mkwrite() patch.
> 
> I am a bit confused about the need for Davids patch. set_page_dirty() is 
> already a notification that a page is to be dirtied. Why do we need it 
> twice? set_page_dirty could return an error code and the file system can 
> use the set_page_dirty() hook to get its notification. What we would need 
> to do is to make sure that set_page_dirty can sleep.

page_mkwrite() is called just before the _PTE_ is dirtied.  Take do_wp_page()
for example, set_page_dirty() is called after a lot of stuff, including some
stuff that marks the PTE dirty... by which time it's too late as another
thread sharing the page tables can come along and modify the page before the
first thread calls set_page_dirty().

Furthermore, by that point, it's pointless to have set_page_dirty() return an
error by then.  The deed is effectively done: the PTE is marked dirty and
writable.

And also as you pointed out, set_page_dirty() needs to be able to sleep.
There are places where it's called still, even with Peter's patch, with the
page table lock held - zap_pte_range() for example.  In that particular case,
dropping the lock for each PTE would be bad for performance.

Basically, you can look at it as page_mkwrite() is called upfront, and
set_page_dirty() is called at the end.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-26 14:33   ` David Howells
@ 2006-05-26 15:39     ` Christoph Lameter
  2006-05-30  8:00     ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2006-05-26 15:39 UTC (permalink / raw)
  To: David Howells
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Fri, 26 May 2006, David Howells wrote:

> page_mkwrite() is called just before the _PTE_ is dirtied.  Take do_wp_page()
> for example, set_page_dirty() is called after a lot of stuff, including some
> stuff that marks the PTE dirty... by which time it's too late as another
> thread sharing the page tables can come along and modify the page before the
> first thread calls set_page_dirty().

Since we are terminating the application with extreme prejudice on an 
error (SIGBUS) it does not matter if another process has written to the 
page in the meantime.

> And also as you pointed out, set_page_dirty() needs to be able to sleep.
> There are places where it's called still, even with Peter's patch, with the
> page table lock held - zap_pte_range() for example.  In that particular case,
> dropping the lock for each PTE would be bad for performance.

zap_pte_range would only have to dirty anonymous pages. The pages of 
shared mappings would already be dirty.

> Basically, you can look at it as page_mkwrite() is called upfront, and
> set_page_dirty() is called at the end.

The end is that the page is written back. I think we can still solve this 
with set_page_dirty being called when a page is about to be dirtied or
was dirtied.

The page_mkwrite() method does not really allow the tracking of dirty 
pages. It is a way to track the potentially dirty pages that is useful if 
one is not able to track dirty pages. Moreover, unmapped dirtied pages do 
not factor into that scheme probably because it was thought that they are
already sufficiently tracked by nr_dirty. However, having two methods
of accounting for dirty pages creates problems in correlating the number 
of dirty pages. This is unnecessarily complex.

In order to consistently reach the goal of of tracking dirty pages we 
have to deal with set_page_dirty(). In the first stage lets just
be satified with being able to throttle dirty pages by having an accurate
nr_dirty.

We can then avoid doing too many things on set_page_dirty so that we do 
not have to sleep or return an error.  Maybe add the support for errors 
(SIGBUS) later. But then we should consistently check everytime we dirty a 
page be it mapped or unmapped.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-26 14:33   ` David Howells
  2006-05-26 15:39     ` Christoph Lameter
@ 2006-05-30  8:00     ` David Howells
  2006-05-30 15:38       ` Christoph Lameter
  2006-05-30 16:26       ` David Howells
  1 sibling, 2 replies; 31+ messages in thread
From: David Howells @ 2006-05-30  8:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Howells, Peter Zijlstra, linux-mm, linux-kernel,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Martin Bligh,
	Nick Piggin, Linus Torvalds

Christoph Lameter <clameter@sgi.com> wrote:

> > page_mkwrite() is called just before the _PTE_ is dirtied.  Take
> > do_wp_page() for example, set_page_dirty() is called after a lot of stuff,
> > including some stuff that marks the PTE dirty... by which time it's too
> > late as another thread sharing the page tables can come along and modify
> > the page before the first thread calls set_page_dirty().
> 
> Since we are terminating the application with extreme prejudice on an 
> error (SIGBUS) it does not matter if another process has written to the 
> page in the meantime.

Erm... Yes, it does matter, at least for AFS or NFS using FS-Cache, and whether
or not we're generating a SIGBUS or just proceeding normally.  There are two
cases I've come across:

Firstly I use page_mkwrite() to make sure that the page is written to the cache
before we let anyone modify it, just so that we've got a reasonable idea of
what's in the cache.

What we currently have is:

	invoke page_mkwrite()
	  - wait for page to be written to the cache
	lock
	modify PTE
	unlock
	invoke set_page_dirty()

What your suggestion gives is:

	lock
	modify PTE
	unlock
	invoke set_page_dirty()
	  - wait for page to be written to the cache

But what can happen is:

	CPU 1			CPU 2
	=======================	=======================
	write to page (faults)
	lock
	modify PTE
	unlock
				write to page (succeeds)
	invoke set_page_dirty()
	  - wait for page to be written to the cache
	write to page (succeeds)

That potentially lets data of uncertain state into the cache, which means we
can't trust what's in the cache any longer.

Secondly some filesystems want to use page_mkwrite() to prevent a write from
occurring if a write to a shared writable mapping would require an allocation
from a filesystem that's currently in an ENOSPC state.  That means that we may
not change the PTE until we're sure that the allocation is guaranteed to
succeed, and that means that the kernel isn't left with dirty pages attached to
inodes it'd like to dispose of but can't because there's nowhere to write the
data.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-30  8:00     ` David Howells
@ 2006-05-30 15:38       ` Christoph Lameter
  2006-05-30 16:26       ` David Howells
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2006-05-30 15:38 UTC (permalink / raw)
  To: David Howells
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Tue, 30 May 2006, David Howells wrote:

> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > > page_mkwrite() is called just before the _PTE_ is dirtied.  Take
> > > do_wp_page() for example, set_page_dirty() is called after a lot of stuff,
> > > including some stuff that marks the PTE dirty... by which time it's too
> > > late as another thread sharing the page tables can come along and modify
> > > the page before the first thread calls set_page_dirty().

Maybe I do not understand properly. I thought page_mkwrite is called 
before a page is made writable not before it is dirtied. If its only 
called before the page is dirtied then a better name maybe before_dirty or 
so?

> > Since we are terminating the application with extreme prejudice on an 
> > error (SIGBUS) it does not matter if another process has written to the 
> > page in the meantime.
> 
> Erm... Yes, it does matter, at least for AFS or NFS using FS-Cache, and whether
> or not we're generating a SIGBUS or just proceeding normally.  There are two
> cases I've come across:
> 
> Firstly I use page_mkwrite() to make sure that the page is written to the cache
> before we let anyone modify it, just so that we've got a reasonable idea of
> what's in the cache.

What do you mean by "written to the cache"? It cannot be written back 
since the page has been dirtied yet. So "written to the cache" means 
that the FS does some reservation, right?

> What we currently have is:
> 
> 	invoke page_mkwrite()
> 	  - wait for page to be written to the cache

I am not sure what the point would be to write a page that is
not dirty. So I guess this reserves space on disk for the page.

> 	lock
> 	modify PTE
> 	unlock
> 	invoke set_page_dirty()
> 
> What your suggestion gives is:
> 
> 	lock
> 	modify PTE
> 	unlock
> 	invoke set_page_dirty()
> 	  - wait for page to be written to the cache

Regardless of what "written to the cache" exactly means here
we have the page marked dirty in the mapping and queued for write back.

> But what can happen is:
> 
> 	CPU 1			CPU 2
> 	=======================	=======================
> 	write to page (faults)
> 	lock
> 	modify PTE
> 	unlock
> 				write to page (succeeds)

Correct.

> 	invoke set_page_dirty()
> 	  - wait for page to be written to the cache
> 	write to page (succeeds)
> 
> That potentially lets data of uncertain state into the cache, which means we
> can't trust what's in the cache any longer.

It continues established usage and usage that even with your patch 
continues at least for anonymous pages. The page dirty state may also be 
reflected by any dirty bit set in a pte pointing to the page even if the 
dirty state is not set on the page itself.

> Secondly some filesystems want to use page_mkwrite() to prevent a write from
> occurring if a write to a shared writable mapping would require an allocation
> from a filesystem that's currently in an ENOSPC state.  That means that we may
> not change the PTE until we're sure that the allocation is guaranteed to
> succeed, and that means that the kernel isn't left with dirty pages attached to
> inodes it'd like to dispose of but can't because there's nowhere to write the
> data.

If set_page_dirty cannot reserve the page then we know that some severe
action is required. The FS method set_page_dirty() could:

1. Determine the ENOSPC condition before it sets the page dirty.
   That leaves the potential that some writes to the page have occurred
   by other processes.

2. Track down all processes that use the mapping (or maybe less
   severe: processes that have set the dirty bit in the pte) and 
   terminate them with SIGBUS.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-30  8:00     ` David Howells
  2006-05-30 15:38       ` Christoph Lameter
@ 2006-05-30 16:26       ` David Howells
  2006-05-30 17:02         ` Christoph Lameter
  2006-05-30 17:56         ` David Howells
  1 sibling, 2 replies; 31+ messages in thread
From: David Howells @ 2006-05-30 16:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Howells, Peter Zijlstra, linux-mm, linux-kernel,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Martin Bligh,
	Nick Piggin, Linus Torvalds

Christoph Lameter <clameter@sgi.com> wrote:

> Maybe I do not understand properly. I thought page_mkwrite is called 
> before a page is made writable not before it is dirtied. If its only 
> called before the page is dirtied then a better name maybe before_dirty or 
> so?

page_mkwrite() is called before any of the PTEs referring to a page are made
writable.  This must precede a page being dirtied by writing to it directly
through an mmap'd section.  It does not catch write() and co. dirtying pages,
but then there's no need since prepare_write() is available for that.

> What do you mean by "written to the cache"? It cannot be written back 
> since the page has been dirtied yet. So "written to the cache" means 
> that the FS does some reservation, right?

See the FS-Cache patches posted to LKML on the 19th of May, in particular the
documentation included in the patch with the subject:

	[PATCH 10/14] FS-Cache: Generic filesystem caching facility [try #10]

These patches permit data retrieved from network filesystems (NFS and AFS for
now) to be cached locally on disk.

The page is fetched from the server and then written to the cache.  We don't
let the clean page be modified or released until the write to the cache is
complete.  This permits us to keep track of what state the cache is in.

> If set_page_dirty cannot reserve the page then we know that some severe
> action is required. The FS method set_page_dirty() could:

But by the time set_page_dirty() is called, it's too late as the code
currently stands.  We've already marked the PTE writable and dirty.  The
page_mkwrite() op is called _first_.

> 1. Determine the ENOSPC condition before it sets the page dirty.
>    That leaves the potential that some writes to the page have occurred
>    by other processes.

You have to detect ENOSPC before you modify the PTE, otherwise someone else
can jump through your hoop and dirty the page before you can stop them.

> 2. Track down all processes that use the mapping (or maybe less

That's bad, even if you restrict it to those that have MAP_SHARED and
PROT_WRITE set.  They should not be terminated if they haven't attempted to
write to the mapping.

>    severe: processes that have set the dirty bit in the pte) and 
>    terminate them with SIGBUS.

Brrrr.

What's wrong with my suggestion anyway?

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-30 16:26       ` David Howells
@ 2006-05-30 17:02         ` Christoph Lameter
  2006-05-30 17:25           ` Hugh Dickins
  2006-05-30 17:56         ` David Howells
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2006-05-30 17:02 UTC (permalink / raw)
  To: David Howells
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Tue, 30 May 2006, David Howells wrote:

> > If set_page_dirty cannot reserve the page then we know that some severe
> > action is required. The FS method set_page_dirty() could:
> 
> But by the time set_page_dirty() is called, it's too late as the code
> currently stands.  We've already marked the PTE writable and dirty.  The
> page_mkwrite() op is called _first_.

We are in set_page_dirty and this would be part of set_page_dirty 
processing.

> > 2. Track down all processes that use the mapping (or maybe less
> 
> That's bad, even if you restrict it to those that have MAP_SHARED and
> PROT_WRITE set.  They should not be terminated if they haven't attempted to
> write to the mapping.

Its bad but the out of space situation is an exceptional situation. We do 
similar contortions when we run out of memory space. As I said: One can 
track down the processes that have dirtied the pte to the page in question 
and just terminate those and remove the page.

> What's wrong with my suggestion anyway?

Adds yet another method with functionality that for the most part 
is the same as set_page_dirty().

The advantage of such a method seems to be that it reserves filesystem 
space for pages that could potentially be written to. This allows the 
filesystem to accurately deal with out of space situations (a very rare 
condition. Is this really justifiable?). Maybe having already reserved 
space could speed up the real dirtying of pages?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-30 17:02         ` Christoph Lameter
@ 2006-05-30 17:25           ` Hugh Dickins
  2006-05-30 17:30             ` Christoph Lameter
  0 siblings, 1 reply; 31+ messages in thread
From: Hugh Dickins @ 2006-05-30 17:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Howells, Peter Zijlstra, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Tue, 30 May 2006, Christoph Lameter wrote:
> On Tue, 30 May 2006, David Howells wrote:
> 
> > What's wrong with my suggestion anyway?
> 
> Adds yet another method with functionality that for the most part 
> is the same as set_page_dirty().

Your original question, whether they could be combined, was a good one;
and I hoped you'd be right.  But I agree with David, they cannot, unless
we sacrifice the guarantee that one or the other is there to give.  It's
much like the relationship between ->prepare_write and ->commit_write.

Hugh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-30 17:25           ` Hugh Dickins
@ 2006-05-30 17:30             ` Christoph Lameter
  2006-05-30 17:41               ` Hugh Dickins
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2006-05-30 17:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Howells, Peter Zijlstra, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Tue, 30 May 2006, Hugh Dickins wrote:

> Your original question, whether they could be combined, was a good one;
> and I hoped you'd be right.  But I agree with David, they cannot, unless
> we sacrifice the guarantee that one or the other is there to give.  It's
> much like the relationship between ->prepare_write and ->commit_write.

Ok, so separate patch sets?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-30 17:30             ` Christoph Lameter
@ 2006-05-30 17:41               ` Hugh Dickins
  0 siblings, 0 replies; 31+ messages in thread
From: Hugh Dickins @ 2006-05-30 17:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Howells, Peter Zijlstra, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Tue, 30 May 2006, Christoph Lameter wrote:
> On Tue, 30 May 2006, Hugh Dickins wrote:
> 
> > Your original question, whether they could be combined, was a good one;
> > and I hoped you'd be right.  But I agree with David, they cannot, unless
> > we sacrifice the guarantee that one or the other is there to give.  It's
> > much like the relationship between ->prepare_write and ->commit_write.
> 
> Ok, so separate patch sets?

They are separate patch sets (and rc5-mm1 has David's without Peter's).
But they trample on nearby areas and share infrastructure, so I'm happy
that they're looked at together.  Peter has helpfully arranged his to
go on top of David's: that can be reversed later if it's decided that
Peter's is wanted but David's not.

Hugh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-30 16:26       ` David Howells
  2006-05-30 17:02         ` Christoph Lameter
@ 2006-05-30 17:56         ` David Howells
  2006-05-30 20:21           ` Christoph Lameter
  1 sibling, 1 reply; 31+ messages in thread
From: David Howells @ 2006-05-30 17:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Howells, Peter Zijlstra, linux-mm, linux-kernel,
	Hugh Dickins, Andrew Morton, Christoph Lameter, Martin Bligh,
	Nick Piggin, Linus Torvalds

Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 30 May 2006, David Howells wrote:
> 
> > > If set_page_dirty cannot reserve the page then we know that some severe
> > > action is required. The FS method set_page_dirty() could:
> > 
> > But by the time set_page_dirty() is called, it's too late as the code
> > currently stands.  We've already marked the PTE writable and dirty.  The
> > page_mkwrite() op is called _first_.
> 
> We are in set_page_dirty and this would be part of set_page_dirty 
> processing.

Eh?  What do you mean "We are in set_page_dirty"?

The main caller of page_mkwrite() is do_wp_page().  With my patch, this
function does the following steps in order:

	(1) Invoke page_mkwrite() if available, giving SIGBUS on error.

	(2) Mark the PTE writable and dirty.

	(3) Unlock the page table.

Then with Peter's patch:

	(4) Invoke set_page_dirty_balance().

We shouldn't be having to handle an error at step (4), and we shouldn't be
marking the page dirty before step (2).

In fact, it could be argued that we shouldn't be marking the page dirty until
it is actually dirty, but I can't see any way to do that atomically short of
emulating the store instruction without hardware support.

> > > 2. Track down all processes that use the mapping (or maybe less
> > 
> > That's bad, even if you restrict it to those that have MAP_SHARED and
> > PROT_WRITE set.  They should not be terminated if they haven't attempted to
> > write to the mapping.
> 
> Its bad but the out of space situation is an exceptional situation. We do 
> similar contortions when we run out of memory space. As I said: One can 
> track down the processes that have dirtied the pte to the page in question 
> and just terminate those and remove the page.

In general, we really really shouldn't just remove the page.  The only place I
would consider it appropriate to destroy a dirty page is if we can't write it
back because the backing store is no longer available due to an event we
cannot predict (network filesystems for example).

We should operate on preventative measures rather than curative ones: we
should detect ENOSPC in advance and signal the process doing the write that
that it's not permitted to proceed.  We shouldn't go and kill all the
processes that might write to that mapping.

Prevention is almost always cleaner than trying to fix things up afterwards.

> > What's wrong with my suggestion anyway?
> 
> Adds yet another method with functionality that for the most part 
> is the same as set_page_dirty().

It's not exactly the same.  It's almost the same difference as between
prepare_write() and commit_write().  Currently it's very much the same
because, as I understand it, the PTE dirty flag is transferred to the page
struct when the PTE is destroyed (in zap_pte_range()).

Actually, I'm not sure that calling set_page_dirty() at the bottom of
do_wp_page() is necessarily a good idea.  It's possible that the page will be
marked dirty in do_wp_page() and then will get written back before the write
actually succeeds.  In other words the page may be marked dirty and cleaned up
all before the modification _actually_ occurs.  On the other hand, the common
case is probably that the store instruction will beat the writeback.

Of course, if you're proposing to call set_page_dirty() before the PTE is
modified, that would work.  But you still have to make sure that it isn't
called under spinlock, and doing that would make zap_pte_range() potentially
indulge in lock thrashing.

> The advantage of such a method seems to be that it reserves filesystem 
> space for pages that could potentially be written to. This allows the 
> filesystem to accurately deal with out of space situations (a very rare 
> condition. Is this really justifiable?). Maybe having already reserved 
> space could speed up the real dirtying of pages?

It's better than having unwritable pages sat around in the pagecache because
there's no space on the backing device to write them back.  At least this way
gives a graceful way to handle the situation.

Destroying uncommitted data isn't really an acceptable answer.

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] mm: tracking shared dirty pages
  2006-05-30 17:56         ` David Howells
@ 2006-05-30 20:21           ` Christoph Lameter
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2006-05-30 20:21 UTC (permalink / raw)
  To: David Howells
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Hugh Dickins,
	Andrew Morton, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On Tue, 30 May 2006, David Howells wrote:

> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Tue, 30 May 2006, David Howells wrote:
> > 
> > > > If set_page_dirty cannot reserve the page then we know that some severe
> > > > action is required. The FS method set_page_dirty() could:
> > > 
> > > But by the time set_page_dirty() is called, it's too late as the code
> > > currently stands.  We've already marked the PTE writable and dirty.  The
> > > page_mkwrite() op is called _first_.
> > 
> > We are in set_page_dirty and this would be part of set_page_dirty 
> > processing.
> 
> Eh?  What do you mean "We are in set_page_dirty"?

We could do the reservation in as part of the set_page_dirty FS method.

> Actually, I'm not sure that calling set_page_dirty() at the bottom of
> do_wp_page() is necessarily a good idea.  It's possible that the page will be
> marked dirty in do_wp_page() and then will get written back before the write
> actually succeeds.  In other words the page may be marked dirty and cleaned up
> all before the modification _actually_ occurs.  On the other hand, the common
> case is probably that the store instruction will beat the writeback.

Yes we are aware of that case.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] mm: tracking dirty pages -v5
  2006-05-25 13:55 [PATCH 0/3] mm: tracking dirty pages -v5 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2006-05-25 13:56 ` [PATCH 3/3] mm: msync cleanup Peter Zijlstra
@ 2006-06-06 20:06 ` Hugh Dickins
  2006-06-07 18:08   ` Peter Zijlstra
  2006-06-08 12:44   ` [PATCH] mm: tracking dirty pages -v6 Peter Zijlstra
  4 siblings, 2 replies; 31+ messages in thread
From: Hugh Dickins @ 2006-06-06 20:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Thu, 25 May 2006, Peter Zijlstra wrote:
> 
> I hacked up a new version last night.
> 
> Its now based on top of David's patches, Hugh's approach of using the
> MAP_PRIVATE protections instead of the MAP_SHARED seems far superior indeed.

Although I like this approach, and your patches implementing it,
I have realized a snag with it, see below.  I hope someone can
make a suggestion that shows up my stupidity and melts away the
difficulty, but for now I'm stumped.

> Q: would it be feasable to do so for al shared mappings so we can remove
> the MAP_SHARED protections all together?

That would impose the overhead of additional write faults on those
cases (notably shared memory) where we just don't need it.

> They survive my simple testing, but esp. the msync cleanup might need some
> more attention.

Yes, I think you've cleaned up slightly too far, see remarks below.

> I post them now instead of after a little more testing because I'll not 
> have much time the coming few days to do so, and hoarding them does 
> nobody any good.

I'll mention the minor points first, then after come to the tiresome
issue I've spent so much time failing to resolve.

You tend to use get_page/put_page amidst code using page_cache_get/
page_cache_release.  Carry on: it sometimes looks odd, but I can't see
any way to impose consistency, short of abolishing one or the other
throughout the tree.  So don't worry about it.

It's irritating to have to be setting "dirty_page" near the page_mkwrite
code, then doing the set_page_dirty_balance separately at the end.  But
I agree with you and David: the page_mkwrite needs to be checked before,
and the set_page_dirty needs to be done after (when the pte is exposed
in the page table): otherwise page_mkclean would miss instances, and
the nr_dirty count drift slowly downwards away from reality.  Or, am
I mistaken?  Maybe, but the way you have it is exact (one could add a
BUG in zap_pte_range, that a VM_SHARED page with pte dirty is already
PageDirty; though I wouldn't recommend doing so unless testing), let's
keep it like that for now, maybe look into relaxing at a future date.

You've got a minor cleanup to install_page, left over from an earlier
iteration: the cleanup looked okay, but of no relevance to your patchset
now, is it?  Just cut mm/fremap.c out of the patchset I think.

You've taken the simplification of sys_msync a little too far, I believe:
you ought to try to reproduce the same errors as before, so MS_ASYNC
should be winding through the separate vmas like MS_SYNC, just to
report -ENOMEM if it crosses an unmapped area; and MS_INVALIDATE
used to say -EBUSY if VM_LOCKED, but that has disappeared.  (Perhaps
I've missed other such details, please recheck.)  Your comment should
say "Nor does it mark" instead of "Nor does it just marks".

I think remove the update_mmu_cache from page_mkclean_one as Christoph
suggested, leaving the lazy_mmu_prot_update: I cannot then justify why
do_wp_page's reuse case does update_mmu_cache when it's merely changing
the protection on the page, but it always has so I suppose it should
continue to do so; what you have in page_mkclean_one is more like
mprotecting than faulting.

Your is_shared_writable(vma) in mprotect_fixup is along the right
lines, but wrong: because at that point vma->vm_flags is the old one,
and may be omitting VM_WRITE when that is about to be added.  Perhaps
you should move the "vma->vm_flags = newflags" above it, or perhaps
you should change is_shared_writable to work on flags rather than vma
(as Linus' is_cow_mapping does).

And whenever I see that call to change_protection, I wonder whether
we should optimize it out when newprot is the same as the old
vma->vm_page_prot: we've not done so in the past, but perhaps now
we're adding cases when it's likely to be the same as before, we
should add that little optimization.

But the tiresome issue is the similar mapping_cap_account_dirty
check in do_mmap_pgoff.  Whereas page_mkwrite defaults to unset,
so that the restriction on vm_page_prot only applies to drivers
or filesystems which have opted in by declaring a page_mkwrite,
mapping->backing_dev_info (initialized in alloc_inode) defaults to
default_backing_dev_info, which does not have BDI_CAP_NO_ACCT_DIRTY.

Which is fine for ordinary filesystems, but means that all the strange
drivers with their own ->mmap are by default mapping_cap_account_dirty,
and therefore you will be adjusting their vm_page_prot; and some of
them have set vm_page_prot rather carefully e.g. with a NOCACHE bit.

This shouldn't be a problem for remap_pfn_range and vm_insert_page
users (vmas now marked VM_PFNMAP or VM_INSERTPAGE), since they should
have already set up their ptes in the ->mmap above, and have no further
use for vm_page_prot - though there might be odd cases, and it would
be preferable not to risk changing behaviour on them.

Where it's really a problem is for the various drivers with .nopage
methods.  In the old scheme (before I diverted you to follow
page_mkwrite), that could have been handled in do_no_page, by
checking page->mapping (there is then a question of race with truncate
on ordinary filesystem pages, which also lose page->mapping, but I
think that works out not to matter - though I'm not sure); but in
do_mmap_pgoff we do not know whether the filesystem or driver will
be dealing in pagecache pages or not.

There are several ways forward, but I don't know which to advise:
I'm hoping someone else will see an easy and safe solution which
I've been missing.

The "right" solution is for us to declare another backing_dev_info
with BDI_CAP_NO_ACCT_DIRTY set, and all those drivers point their
mapping->backing_dev_info to that instead in their .mmap.  But that
means your work becomes dependent on changes to a variety of drivers;
an obscure further requirement on drivers trying to enable mmap, when
there's too many things to get right already; and obscure bugs in any
drivers forgotten or outside the tree.

Or resurrect the VM_RESERVED flag we were thinking of scrapping, add
it in whichever drivers it's needed but missing, and check that?

Or is there anything we can key off instead, to default to this other
backing_dev_info from the core, without changing drivers?

Make character devices use the BDI_CAP_NO_ACCT_DIRTY info, leaving
block devices and files with the current default?  Perhaps, but I'm
not sure if that apportions all the cases correctly, and it would
sit better in UNIX than in Linux (which doesn't key much off the
block/character distinction, I think).

Check for ->fsync in the file_operations, use BDI_CAP_NO_ACCT_DIRTY
backing_dev_info when it's NULL?  That looks to me like it would
work pretty well; but definitely a hack, and again I'm not certain
it would get all the cases right (I haven't the faintest idea what
should happen with arch/powerpc/platforms/cell/spufs, for example).

Or give up, apologize, and ask you to go back to testing in do_no_page,
instead of following the page_mkwrite vm_page_prot technique?

I don't know: sorry.

Hugh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] mm: tracking dirty pages -v5
  2006-06-06 20:06 ` [PATCH 0/3] mm: tracking dirty pages -v5 Hugh Dickins
@ 2006-06-07 18:08   ` Peter Zijlstra
  2006-06-08 12:44   ` [PATCH] mm: tracking dirty pages -v6 Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2006-06-07 18:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

On Tue, 2006-06-06 at 21:06 +0100, Hugh Dickins wrote:

> You tend to use get_page/put_page amidst code using page_cache_get/
> page_cache_release.  Carry on: it sometimes looks odd, but I can't see
> any way to impose consistency, short of abolishing one or the other
> throughout the tree.  So don't worry about it.

Noticed that myself too, came to the same conclusion, thanks for the
confirmation thought.

> You've got a minor cleanup to install_page, left over from an earlier
> iteration: the cleanup looked okay, but of no relevance to your patchset
> now, is it?  Just cut mm/fremap.c out of the patchset I think.

OK, unless we go back to the previous way I'll send this tiny cleanup as
a separate patch to Andrew.

> You've taken the simplification of sys_msync a little too far, I believe:
> you ought to try to reproduce the same errors as before, so MS_ASYNC
> should be winding through the separate vmas like MS_SYNC, just to
> report -ENOMEM if it crosses an unmapped area; and MS_INVALIDATE
> used to say -EBUSY if VM_LOCKED, but that has disappeared.  (Perhaps
> I've missed other such details, please recheck.)

Ah, yes, I've noticed this too, I just wasn't sure on if this would be
wanted or not. Is fixed, thanks!

> Your comment should
> say "Nor does it mark" instead of "Nor does it just marks".

Hehe, thanks, please do point out my mistakes with the English language.
I have good enough control to convey most of what I intent but I'm not a
native.

> Your is_shared_writable(vma) in mprotect_fixup is along the right
> lines, but wrong: because at that point vma->vm_flags is the old one,
> and may be omitting VM_WRITE when that is about to be added.  Perhaps
> you should move the "vma->vm_flags = newflags" above it, or perhaps
> you should change is_shared_writable to work on flags rather than vma
> (as Linus' is_cow_mapping does).

How odd, I have the distinct recollection of actually moving that
assignment upwards a few lines, must have been late or something. Thanks
for pointing this out, would've never found it again; with me thinking
it was done with.

<snip: big tiresome story/>

Well, you got me. I don't know either, the only thing I can come up with
is making the breakage compile-time (for 3rd-party modules) instead of
subtle run-time, but its still not pretty.

/me looks around at assorted VM gurus; any ideas out there?

If by tomorrow morning CET nobody has spoken up, I'll just go ahead and
accept Hugh's apology :-), that is revert back to my original way of
doing it. (I can always go back to this scheme if some smart but slower
working brain manages a solution)


Peter


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] mm: tracking dirty pages -v6
  2006-06-06 20:06 ` [PATCH 0/3] mm: tracking dirty pages -v5 Hugh Dickins
  2006-06-07 18:08   ` Peter Zijlstra
@ 2006-06-08 12:44   ` Peter Zijlstra
  2006-06-08 13:02     ` Peter Zijlstra
                       ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Peter Zijlstra @ 2006-06-08 12:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

People expressed the need to track dirty pages in shared mappings.

Linus outlined the general idea of doing that through making clean
writable pages write-protected and taking the write fault.

This patch does exactly that, it makes pages in a shared writable
mapping write-protected. On write-fault the pages are marked dirty and
made writable. When the pages get synced with their backing store, the
write-protection is re-instated.

It survives a simple test and shows the dirty pages in /proc/vmstat.

Changes in -v6

 - make page_mkclean_one() modify the pte more like change_pte_range() 
   (suggested by Christoph Lameter)
 - made is_shared_writable() take vm_flags, it now resembles is_cow_mapping().
 - fixed the mprotect() bug (spotted by Hugh Dickins)
 - hopefully fixed the tiresome issue of do_mmap_pgoff() trampling on
   driver specific vm_page_prot settings (spotted by Hugh Dickins)

   This does add the requirement that f_op->mmap() may not 'reset' 
   vm_page_prot from vm_flags. I checked all in-kernel users and non seem
   to do this.

 - made a new version of the page_mkwrite() patch to go on top of all this.
   This so that Linus could merge this very early on in 2.6.18.

Changes in -v5

 - rename page_wrprotect() to page_mkclean() (suggested by Nick Piggin)
 - added comment to test_clear_page_dirty() (Andrew Morton)
 - cleanup page_wrprotect() (Andrew Morton)
 - renamed VM_SharedWritable() to is_shared_writable()
 - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
   ->private_lock. This seems to be save, since ->private_lock is used to
   serialize access to the buffers, not the page itself.
 - rebased on top of David Howells' page_mkwrite() patch.

Changes in -v4:

 - small cleanup as suggested by Christoph Lameter.

Changes in -v3:

 - move set_page_dirty() outside pte lock (suggested by Christoph Lameter)

Changes in -v2:

 - only wrprotect pages from dirty capable mappings. (Nick Piggin)
 - move the writefault handling from do_wp_page() into handle_pte_fault(). 
   (Nick Piggin)
 - revert to the old install_page interface. (Nick Piggin)
 - also clear the pte dirty bit when we make pages read-only again.
   (spotted by Rik van Riel)
 - make page_wrprotect() return the number of reprotected ptes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

 drivers/char/drm/drm_vm.c |    5 ++-
 fs/buffer.c               |    2 -
 include/linux/mm.h        |    5 +++
 include/linux/rmap.h      |    8 +++++
 mm/memory.c               |   45 +++++++++++++++++++++++---------
 mm/mmap.c                 |   28 ++++++++++++++++++--
 mm/mprotect.c             |   14 ++++++++--
 mm/page-writeback.c       |   13 +++++++--
 mm/rmap.c                 |   64 ++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 163 insertions(+), 21 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2006-06-08 13:59:35.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2006-06-08 13:59:37.000000000 +0200
@@ -183,6 +183,11 @@ extern unsigned int kobjsize(const void 
 #define VM_SequentialReadHint(v)	((v)->vm_flags & VM_SEQ_READ)
 #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)
 
+static inline int is_shared_writable(unsigned int flags)
+{
+	return (flags & (VM_SHARED|VM_WRITE)) == (VM_SHARED|VM_WRITE);
+}
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2006-06-08 13:59:35.000000000 +0200
+++ linux-2.6/mm/memory.c	2006-06-08 13:59:37.000000000 +0200
@@ -1445,25 +1445,31 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
+	int reuse = 0, ret = VM_FAULT_MINOR;
+	struct page *dirty_page = NULL;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		int reuse = can_share_swap_page(old_page);
+	if (vma->vm_flags & VM_SHARED) {
+		reuse = 1;
+		dirty_page = old_page;
+		get_page(dirty_page);
+	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+		reuse = can_share_swap_page(old_page);
 		unlock_page(old_page);
-		if (reuse) {
-			flush_cache_page(vma, address, pte_pfn(orig_pte));
-			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			ptep_set_access_flags(vma, address, page_table, entry, 1);
-			update_mmu_cache(vma, address, entry);
-			lazy_mmu_prot_update(entry);
-			ret |= VM_FAULT_WRITE;
-			goto unlock;
-		}
+	}
+
+	if (reuse) {
+		flush_cache_page(vma, address, pte_pfn(orig_pte));
+		entry = pte_mkyoung(orig_pte);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		ptep_set_access_flags(vma, address, page_table, entry, 1);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+		ret |= VM_FAULT_WRITE;
+		goto unlock;
 	}
 
 	/*
@@ -1518,6 +1524,10 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	if (old_page)
@@ -2046,6 +2056,7 @@ static int do_no_page(struct mm_struct *
 	unsigned int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
+	struct page *dirty_page = NULL;
 
 	pte_unmap(page_table);
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
@@ -2127,6 +2138,10 @@ retry:
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			if (write_access) {
+				dirty_page = new_page;
+				get_page(dirty_page);
+			}
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -2139,6 +2154,10 @@ retry:
 	lazy_mmu_prot_update(entry);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (dirty_page) {
+		set_page_dirty(dirty_page);
+		put_page(dirty_page);
+	}
 	return ret;
 oom:
 	page_cache_release(new_page);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2006-06-08 13:59:35.000000000 +0200
+++ linux-2.6/mm/mmap.c	2006-06-08 14:08:04.000000000 +0200
@@ -25,6 +25,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1065,10 +1066,13 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+	vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
+		struct address_space *mapping = NULL;
+
 		error = -EINVAL;
 		if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
 			goto free_vma;
@@ -1080,6 +1084,25 @@ munmap_back:
 		}
 		vma->vm_file = file;
 		get_file(file);
+
+		/* Trap writes to shared writable mappings that have
+		 * dirty page accounting.
+		 *
+		 * f_op->open() - sets the backing_dev_info
+		 * f_op->mmap() - modifies vm_page_prot; but will not reset
+		 *                from vm_flags.
+		 *
+		 * Hence between the two calls (here) it is save to modify
+		 * vm_page_prot depending on backing_dev_info capabilities.
+		 *
+		 * shmem_backing_dev_info does have BDI_CAP_NO_ACCT_DIRTY.
+		 */
+		if (is_shared_writable(vm_flags))
+			mapping = file->f_mapping;
+		if (mapping && mapping_cap_account_dirty(mapping))
+			vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC)];
+
 		error = file->f_op->mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
@@ -1921,7 +1944,8 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
-	vma->vm_page_prot = protection_map[flags & 0x0f];
+	vma->vm_page_prot = protection_map[flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c	2006-06-08 13:59:35.000000000 +0200
+++ linux-2.6/mm/mprotect.c	2006-06-08 14:15:24.000000000 +0200
@@ -19,6 +19,7 @@
 #include <linux/mempolicy.h>
 #include <linux/personality.h>
 #include <linux/syscalls.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -106,6 +107,8 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned int mask;
+	struct address_space *mapping = NULL;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -132,8 +135,6 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
-	newprot = protection_map[newflags & 0xf];
-
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
@@ -160,6 +161,15 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
+	/* Don't make the VMA automatically writable if it's shared. */
+	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+	if (is_shared_writable(newflags) && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+	if (mapping && mapping_cap_account_dirty(mapping))
+		mask &= ~VM_SHARED;
+
+	newprot = protection_map[newflags & mask];
+
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2006-06-08 13:59:35.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2006-06-08 13:59:37.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -725,8 +726,14 @@ int test_clear_page_dirty(struct page *p
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 			write_unlock_irqrestore(&mapping->tree_lock, flags);
-			if (mapping_cap_account_dirty(mapping))
+			/*
+			 * We can continue to use `mapping' here because the
+			 * page is locked, which pins the address_space
+			 */
+			if (mapping_cap_account_dirty(mapping)) {
+				page_mkclean(page);
 				dec_page_state(nr_dirty);
+			}
 			return 1;
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -756,8 +763,10 @@ int clear_page_dirty_for_io(struct page 
 
 	if (mapping) {
 		if (TestClearPageDirty(page)) {
-			if (mapping_cap_account_dirty(mapping))
+			if (mapping_cap_account_dirty(mapping)) {
+				page_mkclean(page);
 				dec_page_state(nr_dirty);
+			}
 			return 1;
 		}
 		return 0;
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2006-06-08 13:59:35.000000000 +0200
+++ linux-2.6/mm/rmap.c	2006-06-08 13:59:37.000000000 +0200
@@ -472,6 +472,70 @@ int page_referenced(struct page *page, i
 	return referenced;
 }
 
+static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	address = vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
+	pte = page_check_address(page, mm, address, &ptl);
+	if (!pte)
+		goto out;
+
+	if (!pte_write(*pte))
+		goto unlock;
+
+	entry = ptep_get_and_clear(mm, address, pte);
+	entry = pte_mkclean(pte_wrprotect(entry));
+	ptep_establish(vma, address, pte, entry);
+	lazy_mmu_prot_update(entry);
+	ret = 1;
+
+unlock:
+	pte_unmap_unlock(pte, ptl);
+out:
+	return ret;
+}
+
+static int page_mkclean_file(struct address_space *mapping, struct page *page)
+{
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int ret = 0;
+
+	BUG_ON(PageAnon(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		if (is_shared_writable(vma->vm_flags))
+			ret += page_mkclean_one(page, vma);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	return ret;
+}
+
+int page_mkclean(struct page *page)
+{
+	int ret = 0;
+
+	BUG_ON(!PageLocked(page));
+
+	if (page_mapped(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			ret = page_mkclean_file(mapping, page);
+	}
+
+	return ret;
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h	2006-06-08 13:59:35.000000000 +0200
+++ linux-2.6/include/linux/rmap.h	2006-06-08 13:59:37.000000000 +0200
@@ -105,6 +105,14 @@ pte_t *page_check_address(struct page *,
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
+/*
+ * Cleans the PTEs of shared mappings.
+ * (and since clean PTEs should also be readonly, write protects them too)
+ *
+ * returns the number of cleaned PTEs.
+ */
+int page_mkclean(struct page *);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2006-06-08 13:59:35.000000000 +0200
+++ linux-2.6/fs/buffer.c	2006-06-08 13:59:37.000000000 +0200
@@ -2985,6 +2985,7 @@ int try_to_free_buffers(struct page *pag
 
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
+	spin_unlock(&mapping->private_lock);
 	if (ret) {
 		/*
 		 * If the filesystem writes its buffers by hand (eg ext3)
@@ -2996,7 +2997,6 @@ int try_to_free_buffers(struct page *pag
 		 */
 		clear_page_dirty(page);
 	}
-	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
Index: linux-2.6/drivers/char/drm/drm_vm.c
===================================================================
--- linux-2.6.orig/drivers/char/drm/drm_vm.c	2006-04-07 10:54:38.000000000 +0200
+++ linux-2.6/drivers/char/drm/drm_vm.c	2006-06-08 14:11:37.000000000 +0200
@@ -574,7 +574,10 @@ int drm_mmap(struct file *filp, struct v
 #else
 		/* Ye gads this is ugly.  With more thought
 		   we could move this up higher and use
-		   `protection_map' instead.  */
+		   `protection_map' instead.
+
+		   One cannot; drivers are not allowed to 'reset'
+		   vm_page_prot from vm_flags. */
 		vma->vm_page_prot =
 		    __pgprot(pte_val
 			     (pte_wrprotect



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] mm: tracking dirty pages -v6
  2006-06-08 12:44   ` [PATCH] mm: tracking dirty pages -v6 Peter Zijlstra
@ 2006-06-08 13:02     ` Peter Zijlstra
  2006-06-08 16:53     ` Christoph Lameter
  2006-06-08 20:10     ` Nate Diller
  2 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2006-06-08 13:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-kernel, Andrew Morton, David Howells,
	Christoph Lameter, Martin Bligh, Nick Piggin, Linus Torvalds

If this one still has some problems there is on more thing we could try
before going back to the old way of doing things.

I found this 'gem' in the drm code:

        vma->vm_page_prot =
            __pgprot(pte_val
                     (pte_wrprotect
                      (__pte(pgprot_val(vma->vm_page_prot)))));

which does exactly what is needed.

OTOH, my alternate version of the mprotect fix leaves dirty pages
writable, which saves a few faults.

Peter


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] mm: tracking dirty pages -v6
  2006-06-08 12:44   ` [PATCH] mm: tracking dirty pages -v6 Peter Zijlstra
  2006-06-08 13:02     ` Peter Zijlstra
@ 2006-06-08 16:53     ` Christoph Lameter
  2006-06-08 20:10     ` Nate Diller
  2 siblings, 0 replies; 31+ messages in thread
From: Christoph Lameter @ 2006-06-08 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, linux-mm, linux-kernel, Andrew Morton,
	David Howells, Martin Bligh, Nick Piggin, Linus Torvalds

1. The case that a vma is shared is not likely.

2. Typo fix in mmap.c

3. This may be a bit controversial but it does not seem to
   make sense to use the update_mmu_cache macro when we reuse
   the page. We are only fiddling around with the protections,
   the dirty and accessed bits.

   With the call to update_mmu_cache the way of using the macros
   would be different from mprotect() and page_mkclean(). I'd
   rather have everything work the same way. If this breaks on some
   arches then also mprotect and page_mkclean() are broken.
   The use of mprotect() is rare, we may have breakage in some
   arches that we just have not seen yet.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2006-06-08 09:12:13.000000000 -0700
+++ linux-2.6/mm/memory.c	2006-06-08 09:50:17.000000000 -0700
@@ -1452,7 +1452,7 @@ static int do_wp_page(struct mm_struct *
 	if (!old_page)
 		goto gotten;
 
-	if (vma->vm_flags & VM_SHARED) {
+	if (unlikely(vma->vm_flags & VM_SHARED)) {
 		reuse = 1;
 		dirty_page = old_page;
 		get_page(dirty_page);
@@ -1466,7 +1466,6 @@ static int do_wp_page(struct mm_struct *
 		entry = pte_mkyoung(orig_pte);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		ptep_set_access_flags(vma, address, page_table, entry, 1);
-		update_mmu_cache(vma, address, entry);
 		lazy_mmu_prot_update(entry);
 		ret |= VM_FAULT_WRITE;
 		goto unlock;
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2006-06-08 09:15:54.000000000 -0700
+++ linux-2.6/mm/mmap.c	2006-06-08 09:17:41.000000000 -0700
@@ -1092,7 +1092,7 @@ munmap_back:
 		 * f_op->mmap() - modifies vm_page_prot; but will not reset
 		 *                from vm_flags.
 		 *
-		 * Hence between the two calls (here) it is save to modify
+		 * Hence between the two calls (here) it is safe to modify
 		 * vm_page_prot depending on backing_dev_info capabilities.
 		 *
 		 * shmem_backing_dev_info does have BDI_CAP_NO_ACCT_DIRTY.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] mm: tracking dirty pages -v6
  2006-06-08 12:44   ` [PATCH] mm: tracking dirty pages -v6 Peter Zijlstra
  2006-06-08 13:02     ` Peter Zijlstra
  2006-06-08 16:53     ` Christoph Lameter
@ 2006-06-08 20:10     ` Nate Diller
  2006-06-08 20:20       ` Linus Torvalds
  2 siblings, 1 reply; 31+ messages in thread
From: Nate Diller @ 2006-06-08 20:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, linux-mm, linux-kernel, Andrew Morton,
	David Howells, Christoph Lameter, Martin Bligh, Nick Piggin,
	Linus Torvalds

On 6/8/06, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>
> People expressed the need to track dirty pages in shared mappings.
>
> Linus outlined the general idea of doing that through making clean
> writable pages write-protected and taking the write fault.
>
> This patch does exactly that, it makes pages in a shared writable
> mapping write-protected. On write-fault the pages are marked dirty and
> made writable. When the pages get synced with their backing store, the
> write-protection is re-instated.

Does this mean that processes dirtying pages via mmap are now subject
to write throttling?  That could dramatically change the performance
for tasks with a working set larger than 10% of memory.

NATE

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] mm: tracking dirty pages -v6
  2006-06-08 20:10     ` Nate Diller
@ 2006-06-08 20:20       ` Linus Torvalds
  0 siblings, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2006-06-08 20:20 UTC (permalink / raw)
  To: Nate Diller
  Cc: Peter Zijlstra, Hugh Dickins, linux-mm, linux-kernel,
	Andrew Morton, David Howells, Christoph Lameter, Martin Bligh,
	Nick Piggin



On Thu, 8 Jun 2006, Nate Diller wrote:
> 
> Does this mean that processes dirtying pages via mmap are now subject
> to write throttling?  That could dramatically change the performance
> for tasks with a working set larger than 10% of memory.

Exactly. Except it's not a "working set", it's a "dirty set".

Which is the whole (and only) point of the whole patch.

If you want to live on the edge, you can set the dirty_balance trigger to 
something much higher, it's entirely configurable if I remember correctly.

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2006-06-08 20:21 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-25 13:55 [PATCH 0/3] mm: tracking dirty pages -v5 Peter Zijlstra
2006-05-25 13:55 ` [PATCH -1/3] mm: page_mkwrite Peter Zijlstra
2006-05-25 13:55 ` [PATCH 1/3] mm: tracking shared dirty pages Peter Zijlstra
2006-05-25 16:21   ` Christoph Lameter
2006-05-25 17:00     ` Peter Zijlstra
2006-05-25 17:03       ` Christoph Lameter
2006-05-25 16:27   ` Christoph Lameter
2006-05-25 17:03     ` Peter Zijlstra
2006-05-25 17:06       ` Christoph Lameter
2006-05-26  2:28     ` Jeff Anderson-Lee
2006-05-26  2:33       ` Christoph Lameter
2006-05-26 14:33   ` David Howells
2006-05-26 15:39     ` Christoph Lameter
2006-05-30  8:00     ` David Howells
2006-05-30 15:38       ` Christoph Lameter
2006-05-30 16:26       ` David Howells
2006-05-30 17:02         ` Christoph Lameter
2006-05-30 17:25           ` Hugh Dickins
2006-05-30 17:30             ` Christoph Lameter
2006-05-30 17:41               ` Hugh Dickins
2006-05-30 17:56         ` David Howells
2006-05-30 20:21           ` Christoph Lameter
2006-05-25 13:56 ` [PATCH 2/3] mm: balance " Peter Zijlstra
2006-05-25 13:56 ` [PATCH 3/3] mm: msync cleanup Peter Zijlstra
2006-06-06 20:06 ` [PATCH 0/3] mm: tracking dirty pages -v5 Hugh Dickins
2006-06-07 18:08   ` Peter Zijlstra
2006-06-08 12:44   ` [PATCH] mm: tracking dirty pages -v6 Peter Zijlstra
2006-06-08 13:02     ` Peter Zijlstra
2006-06-08 16:53     ` Christoph Lameter
2006-06-08 20:10     ` Nate Diller
2006-06-08 20:20       ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).