linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* page_mkwrite seems broken
@ 2005-02-09 14:28 Hugh Dickins
  2005-10-24 15:16 ` what happened to page_mkwrite? - was: " Anton Altaparmakov
  2005-10-24 15:26 ` David Howells
  0 siblings, 2 replies; 24+ messages in thread
From: Hugh Dickins @ 2005-02-09 14:28 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Anton Altaparmakov, Andrew Morton, linux-kernel

On Fri, 4 Feb 2005, Hugh Dickins wrote in another thread:
> 
> Isn't this exactly what David Howells' page_mkwrite stuff in -mm's
> add-page-becoming-writable-notification.patch is designed for?
> 
> Though it looks a little broken to me as it stands (beyond the two
> fixup patches already there).  I've not found time to double-check
> or test, apologies in advance if I'm libelling, but...
> 
> (a) I thought the prot bits do_nopage gives a pte in a shared writable
>     mapping include write permission, even when it's a read fault:
>     that can't be allowed if there's a page_mkwrite.
> 
> (b) I don't understand how do_wp_page's "reuse" logic for whether it
>     can just go ahead and use the existing anonymous page, would have
>     any relevance to calling page_mkwrite on a shared writable page,
>     which must be used and not COWed however many references there are.

I have now looked further, and both points still seem valid to me:
the page_mkwrite calling code looks doubly broken.  (Tested?)

Nor has there been any movement on the points raised by Christoph,
that aops->page_mkwrite is redundant, and do_wp_page_mk_pte_writable
separation unhelpful.

I could probably put page_mkwrite to use in tmpfs (to eliminate its
unsatisfactory but never over-troubling shmem_recalc_inode), but not
as it currently stands.

Are you planning any movement on this, David?
Or should I have a go sometime?

Hugh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* what happened to page_mkwrite? - was: Re: page_mkwrite seems broken
  2005-02-09 14:28 page_mkwrite seems broken Hugh Dickins
@ 2005-10-24 15:16 ` Anton Altaparmakov
  2005-10-24 15:36   ` Hugh Dickins
  2005-10-24 15:26 ` David Howells
  1 sibling, 1 reply; 24+ messages in thread
From: Anton Altaparmakov @ 2005-10-24 15:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Howells, Christoph Hellwig, Andrew Morton, linux-kernel

Hi,

On Wed, 2005-02-09 at 14:28 +0000, Hugh Dickins wrote:
> On Fri, 4 Feb 2005, Hugh Dickins wrote in another thread:
> > Isn't this exactly what David Howells' page_mkwrite stuff in -mm's
> > add-page-becoming-writable-notification.patch is designed for?
> > 
> > Though it looks a little broken to me as it stands (beyond the two
> > fixup patches already there).  I've not found time to double-check
> > or test, apologies in advance if I'm libelling, but...
> > 
> > (a) I thought the prot bits do_nopage gives a pte in a shared writable
> >     mapping include write permission, even when it's a read fault:
> >     that can't be allowed if there's a page_mkwrite.
> > 
> > (b) I don't understand how do_wp_page's "reuse" logic for whether it
> >     can just go ahead and use the existing anonymous page, would have
> >     any relevance to calling page_mkwrite on a shared writable page,
> >     which must be used and not COWed however many references there are.
> 
> I have now looked further, and both points still seem valid to me:
> the page_mkwrite calling code looks doubly broken.  (Tested?)
> 
> Nor has there been any movement on the points raised by Christoph,
> that aops->page_mkwrite is redundant, and do_wp_page_mk_pte_writable
> separation unhelpful.
> 
> I could probably put page_mkwrite to use in tmpfs (to eliminate its
> unsatisfactory but never over-troubling shmem_recalc_inode), but not
> as it currently stands.
> 
> Are you planning any movement on this, David?
> Or should I have a go sometime?

What happened with page_mkwrite?  It seems to have disappeared both from
-mm and generally from the face of the earth...

I am very interested in having such ability for ntfs...

Is anyone still working on this?  If not why not?  Did it prove
impractical or ...?

If no-one is working on this anymore, where do I find the last "current"
patch?

Thanks a lot in advance!

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: what happened to page_mkwrite? - was: Re: page_mkwrite seems broken
  2005-02-09 14:28 page_mkwrite seems broken Hugh Dickins
  2005-10-24 15:16 ` what happened to page_mkwrite? - was: " Anton Altaparmakov
@ 2005-10-24 15:26 ` David Howells
  2005-10-24 15:43   ` Anton Altaparmakov
  2005-10-24 16:23   ` [PATCH] Add notification of page becoming writable to VMA ops David Howells
  1 sibling, 2 replies; 24+ messages in thread
From: David Howells @ 2005-10-24 15:26 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Hugh Dickins, David Howells, Christoph Hellwig, Andrew Morton,
	linux-kernel

Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> What happened with page_mkwrite?  It seems to have disappeared both from
> -mm and generally from the face of the earth...

It got taken out because no one was using it (CacheFS has been removed
temorarily).

I'm still attempting to maintain it. If you want I can post it to Andrew again
to see if he'll take it back. If you want a direct copy, I'll have to extract
it from CacheFS.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: what happened to page_mkwrite? - was: Re: page_mkwrite seems broken
  2005-10-24 15:16 ` what happened to page_mkwrite? - was: " Anton Altaparmakov
@ 2005-10-24 15:36   ` Hugh Dickins
  2005-10-24 15:49     ` Anton Altaparmakov
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2005-10-24 15:36 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: David Howells, Christoph Hellwig, Andrew Morton, linux-kernel

On Mon, 24 Oct 2005, Anton Altaparmakov wrote:
> On Wed, 2005-02-09 at 14:28 +0000, Hugh Dickins wrote:
> > On Fri, 4 Feb 2005, Hugh Dickins wrote in another thread:
> > > Isn't this exactly what David Howells' page_mkwrite stuff in -mm's
> > > add-page-becoming-writable-notification.patch is designed for?
> > > 
> > > Though it looks a little broken to me as it stands (beyond the two
> > > fixup patches already there).  I've not found time to double-check
.....
> 
> What happened with page_mkwrite?  It seems to have disappeared both from
> -mm and generally from the face of the earth...

page_mkwrite??  No, never heard of it round here, you must be mistaken ;)

But seriously, Andrew dropped it from 2.6.13-rc5-mm1, for expedient reasons:
- Dropped cachefs and the cachefs-for-AFS patches.  These get in the way of
  memory management testing a bit, and they're being redone anyway.

So Andrew's 2.6.13-rc4-mm1 directory should contain its last public state
(by which time I'd fixed up those various things I'd found to be broken).

But David may have redone a lot since then, I don't know: he's the one
to ask.  (And I'm afraid I've done my best to make the old patch not
apply to current -mm.)

Hugh

> I am very interested in having such ability for ntfs...
> 
> Is anyone still working on this?  If not why not?  Did it prove
> impractical or ...?
> 
> If no-one is working on this anymore, where do I find the last "current"
> patch?
> 
> Thanks a lot in advance!
> 
> Best regards,
> 
>         Anton

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: what happened to page_mkwrite? - was: Re: page_mkwrite seems broken
  2005-10-24 15:26 ` David Howells
@ 2005-10-24 15:43   ` Anton Altaparmakov
  2005-10-24 16:01     ` Hugh Dickins
  2005-10-24 16:23   ` [PATCH] Add notification of page becoming writable to VMA ops David Howells
  1 sibling, 1 reply; 24+ messages in thread
From: Anton Altaparmakov @ 2005-10-24 15:43 UTC (permalink / raw)
  To: David Howells
  Cc: Hugh Dickins, Christoph Hellwig, Andrew Morton, linux-kernel

On Mon, 2005-10-24 at 16:26 +0100, David Howells wrote:
> Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> 
> > What happened with page_mkwrite?  It seems to have disappeared both from
> > -mm and generally from the face of the earth...
> 
> It got taken out because no one was using it (CacheFS has been removed
> temorarily).
> 
> I'm still attempting to maintain it. If you want I can post it to Andrew again
> to see if he'll take it back. If you want a direct copy, I'll have to extract
> it from CacheFS.

I don't really mind either way.  I am stuck with ntfs at the moment at
the point where I am either going to use my own ->nopage handler to
allocate on-disk clusters or have a ->page_mkwrite handler do it.  The
former is not nice as it means we allocate space even when only reading
whilst the later is very nice as it only triggers when someone actually
does an mmapped write.

So whatever works best for you.  I am happy with it appearing in -mm and
I am also happy with having my own copy in my ntfs development tree for
now.  Given we both want it and High said he wanted it, too, it may be
the more sensible approach to have it in -mm so we have one common code
base to work with/apply fixes to/whatever...  Otherwise we may get
headaches down the line if we end up with diverging implementations of
the same thing and all try to merge ours into the kernel...

Btw. have you addressed the problems Hugh pointed out with it?  If not,
-mm would perhaps be a good place for us to get it sorted?

Andrew, would you be happy to take the ->page_mkwrite patches again into
-mm?  I promise you a user of it to appear in -mm as soon as I can knock
up the ntfs code for it once I have seen the exact interface I am coding
for...

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: what happened to page_mkwrite? - was: Re: page_mkwrite seems broken
  2005-10-24 15:36   ` Hugh Dickins
@ 2005-10-24 15:49     ` Anton Altaparmakov
  0 siblings, 0 replies; 24+ messages in thread
From: Anton Altaparmakov @ 2005-10-24 15:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Howells, Christoph Hellwig, Andrew Morton, linux-kernel

Hi,

On Mon, 2005-10-24 at 16:36 +0100, Hugh Dickins wrote:
> On Mon, 24 Oct 2005, Anton Altaparmakov wrote:
> > On Wed, 2005-02-09 at 14:28 +0000, Hugh Dickins wrote:
> > > On Fri, 4 Feb 2005, Hugh Dickins wrote in another thread:
> > > > Isn't this exactly what David Howells' page_mkwrite stuff in -mm's
> > > > add-page-becoming-writable-notification.patch is designed for?
> > > > 
> > > > Though it looks a little broken to me as it stands (beyond the two
> > > > fixup patches already there).  I've not found time to double-check
> .....
> > 
> > What happened with page_mkwrite?  It seems to have disappeared both from
> > -mm and generally from the face of the earth...
> 
> page_mkwrite??  No, never heard of it round here, you must be mistaken ;)

(-:

> But seriously, Andrew dropped it from 2.6.13-rc5-mm1, for expedient reasons:
> - Dropped cachefs and the cachefs-for-AFS patches.  These get in the way of
>   memory management testing a bit, and they're being redone anyway.
> 
> So Andrew's 2.6.13-rc4-mm1 directory should contain its last public state
> (by which time I'd fixed up those various things I'd found to be broken).

Right, thanks.  I was wondering whether they had been fixed.

> But David may have redone a lot since then, I don't know: he's the one
> to ask.  (And I'm afraid I've done my best to make the old patch not
> apply to current -mm.)

That can be fixed, if David has not done it already...  (-:

This is what I am working on in ntfs as my top priority at present, so I
really want to get it fixed and merged and I am willing to put in the
time necessary to make it happen as I really hate having to instantiate
holes on read access for files with logical blocks of size above
PAGE_{CACHE_,}SIZE, it just feels wrong...

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: what happened to page_mkwrite? - was: Re: page_mkwrite seems broken
  2005-10-24 15:43   ` Anton Altaparmakov
@ 2005-10-24 16:01     ` Hugh Dickins
  2005-10-24 19:38       ` Anton Altaparmakov
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2005-10-24 16:01 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: David Howells, Christoph Hellwig, Andrew Morton, Carsten Otte,
	linux-kernel

On Mon, 24 Oct 2005, Anton Altaparmakov wrote:
> 
> I don't really mind either way.  I am stuck with ntfs at the moment at
> the point where I am either going to use my own ->nopage handler to
> allocate on-disk clusters or have a ->page_mkwrite handler do it.  The
> former is not nice as it means we allocate space even when only reading
> whilst the later is very nice as it only triggers when someone actually
> does an mmapped write.

A complication to beware of there (and I may be misunderstanding, but
the point is worth making).  If you have already mmaped readonly zero
pages into some mms, you'll need to update those mms with the new
shared writable pages once they are allocated.  That put me off using
page_mkwrite in tmpfs, but Carsten has solved the problem (though
not going so far as to use page_mkwrite) with his xip_file_nopage
in mm/filemap_xip.c - has to go down the vma_prio_tree like rmap.

(That code is a little different in -mm, partly because of my page
table locking changes, partly because of Nick's ZERO_PAGE changes.)

Hmm, strictly speaking, it should be substituting the new page
when VM_LOCKED: whether that's worth the effort of implementing....

Hugh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] Add notification of page becoming writable to VMA ops
  2005-10-24 15:26 ` David Howells
  2005-10-24 15:43   ` Anton Altaparmakov
@ 2005-10-24 16:23   ` David Howells
  2005-10-24 19:11     ` Hugh Dickins
                       ` (7 more replies)
  1 sibling, 8 replies; 24+ messages in thread
From: David Howells @ 2005-10-24 16:23 UTC (permalink / raw)
  To: Anton Altaparmakov, Andrew Morton, torvalds
  Cc: David Howells, Hugh Dickins, Christoph Hellwig, linux-kernel


The attached patch adds a new VMA operation to notify a filesystem or other
driver about the MMU generating a fault because userspace attempted to write
to a page mapped through a read-only PTE.

This facility permits the filesystem or driver to:

 (*) Implement storage allocation/reservation on attempted write, and so to
     deal with problems such as ENOSPC more gracefully (perhaps by generating
     SIGBUS).

 (*) Delay making the page writable until the contents have been written to a
     backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
     It permits the filesystem to have some guarantee about the state of the
     cache.

Signed-Off-By: David Howells <dhowells@redhat.com>
---
warthog>diffstat -p1 page-mkwrite-2614rc4mm1.diff
 include/linux/mm.h |    4 +
 mm/memory.c        |  125 ++++++++++++++++++++++++++++++++++++++++++-----------
 mm/mmap.c          |    9 ++-
 mm/mprotect.c      |    8 ++-
 4 files changed, 117 insertions(+), 29 deletions(-)

diff -uNrp linux-2.6.14-rc4-mm1/include/linux/mm.h linux-2.6.14-rc4-mm1-cachefs/include/linux/mm.h
--- linux-2.6.14-rc4-mm1/include/linux/mm.h	2005-10-17 14:26:43.000000000 +0100
+++ linux-2.6.14-rc4-mm1-cachefs/include/linux/mm.h	2005-10-18 14:02:39.000000000 +0100
@@ -196,6 +196,10 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+
+	/* notification that a previously read-only page is about to become
+	 * writable, if an error is returned it will cause a SIGBUS */
+	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
diff -uNrp linux-2.6.14-rc4-mm1/mm/memory.c linux-2.6.14-rc4-mm1-cachefs/mm/memory.c
--- linux-2.6.14-rc4-mm1/mm/memory.c	2005-10-17 14:26:44.000000000 +0100
+++ linux-2.6.14-rc4-mm1-cachefs/mm/memory.c	2005-10-20 18:53:04.000000000 +0100
@@ -1247,7 +1247,7 @@ static int do_wp_page(struct mm_struct *
 	struct page *old_page, *new_page;
 	unsigned long pfn = pte_pfn(orig_pte);
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
+	int reuse, ret = VM_FAULT_MINOR;
 
 	BUG_ON(vma->vm_flags & VM_RESERVED);
 
@@ -1261,19 +1261,53 @@ static int do_wp_page(struct mm_struct *
 	}
 	old_page = pfn_to_page(pfn);
 
-	if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		int reuse = can_share_swap_page(old_page);
-		unlock_page(old_page);
-		if (reuse) {
-			flush_cache_page(vma, address, pfn);
-			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			ptep_set_access_flags(vma, address, page_table, entry, 1);
-			update_mmu_cache(vma, address, entry);
-			lazy_mmu_prot_update(entry);
-			ret |= VM_FAULT_WRITE;
-			goto unlock;
+	if (unlikely(vma->vm_flags & VM_SHARED)) {
+		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+			/*
+			 * Notify the page owner without the lock held,
+			 * so they can sleep if they want to.
+			 */
+			pte_unmap(page_table);
+			if (!PageReserved(old_page))
+				page_cache_get(old_page);
+			spin_unlock(&mm->page_table_lock);
+
+			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
+				goto unwritable_page;
+
+			spin_lock(&mm->page_table_lock);
+			page_cache_release(old_page);
+
+			/*
+			 * Since we dropped the lock we need to revalidate
+			 * the PTE as someone else may have changed it.  If
+			 * they did, we just return, as we can count on the
+			 * MMU to tell us if they didn't also make it writable.
+			 */
+			page_table = pte_offset_map(pmd, address);
+			if (!pte_same(*page_table, orig_pte)) {
+				ret |= VM_FAULT_WRITE;
+				goto success;
+			}
 		}
+
+		reuse = 1;
+	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+		reuse = can_share_swap_page(old_page);
+		unlock_page(old_page);
+	} else {
+		reuse = 0;
+	}
+
+	if (reuse) {
+		flush_cache_page(vma, address, pfn);
+		entry = pte_mkyoung(orig_pte);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		ptep_set_access_flags(vma, address, page_table, entry, 1);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+		ret |= VM_FAULT_WRITE;
+		goto success;
 	}
 
 	/*
@@ -1326,6 +1360,15 @@ unlock:
 oom:
 	page_cache_release(old_page);
 	return VM_FAULT_OOM;
+
+success:
+ 	pte_unmap(page_table);
+  	spin_unlock(&mm->page_table_lock);
+  	return VM_FAULT_MINOR | VM_FAULT_WRITE;
+
+unwritable_page:
+	page_cache_release(old_page);
+	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -1847,18 +1890,28 @@ retry:
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page *page;
+	if (write_access) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			struct page *page;
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto oom;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-		if (!page)
-			goto oom;
-		copy_user_highpage(page, new_page, address);
-		page_cache_release(new_page);
-		new_page = page;
-		anon = 1;
+			if (unlikely(anon_vma_prepare(vma)))
+				goto oom;
+			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			if (!page)
+				goto oom;
+			copy_user_highpage(page, new_page, address);
+			page_cache_release(new_page);
+			new_page = page;
+			anon = 1;
+
+		} else {
+			/* if the page will be shareable, see if the backing
+			 * address space wants to know that the page is about
+			 * to become writable */
+			if (vma->vm_ops->page_mkwrite &&
+			    vma->vm_ops->page_mkwrite(vma, new_page) < 0)
+				return VM_FAULT_SIGBUS;
+		}
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -1945,7 +1998,7 @@ static int do_file_page(struct mm_struct
 		return VM_FAULT_OOM;
 	}
 	/* We can then assume vm->vm_ops && vma->vm_ops->populate */
-
+again:
 	pgoff = pte_to_pgoff(orig_pte);
 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE,
 					vma->vm_page_prot, pgoff, 0);
@@ -1953,6 +2006,28 @@ static int do_file_page(struct mm_struct
 		return VM_FAULT_OOM;
 	if (err)
 		return VM_FAULT_SIGBUS;
+
+	/* For the get_user_pages force write case, we must make sure that
+	 * page_mkwrite is called by this invocation of handle_mm_fault.
+	 */
+	if (write_access && vma->vm_ops->page_mkwrite) {
+		spinlock_t *ptl;
+		int ret;
+
+		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+		orig_pte = *page_table;
+
+		if (!pte_present(orig_pte)) {
+			pte_unmap_unlock(page_table, ptl);
+			goto again;
+		}
+		ret = do_wp_page(mm, vma, address, page_table, pmd, ptl,
+				 orig_pte);
+		if (ret != VM_FAULT_MINOR)
+			return ret;
+	}
+
 	return VM_FAULT_MAJOR;
 }
 
diff -uNrp linux-2.6.14-rc4-mm1/mm/mmap.c linux-2.6.14-rc4-mm1-cachefs/mm/mmap.c
--- linux-2.6.14-rc4-mm1/mm/mmap.c	2005-10-17 14:26:44.000000000 +0100
+++ linux-2.6.14-rc4-mm1-cachefs/mm/mmap.c	2005-10-18 14:02:39.000000000 +0100
@@ -1058,7 +1058,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+	vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1092,6 +1093,9 @@ munmap_back:
 		if (error)
 			goto free_vma;
 	}
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		vma->vm_page_prot = protection_map[vm_flags &
+					(VM_READ|VM_WRITE|VM_EXEC)];
 
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
@@ -1926,7 +1930,8 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
-	vma->vm_page_prot = protection_map[flags & 0x0f];
+	vma->vm_page_prot = protection_map[flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
diff -uNrp linux-2.6.14-rc4-mm1/mm/mprotect.c linux-2.6.14-rc4-mm1-cachefs/mm/mprotect.c
--- linux-2.6.14-rc4-mm1/mm/mprotect.c	2005-10-17 14:26:44.000000000 +0100
+++ linux-2.6.14-rc4-mm1-cachefs/mm/mprotect.c	2005-10-18 14:02:39.000000000 +0100
@@ -106,6 +106,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned int mask;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -140,8 +141,6 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
-	newprot = protection_map[newflags & 0xf];
-
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
@@ -168,6 +167,11 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
+	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		mask &= ~VM_SHARED;
+	newprot = protection_map[newflags & mask];
+
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Add notification of page becoming writable to VMA ops
  2005-10-24 16:23   ` [PATCH] Add notification of page becoming writable to VMA ops David Howells
@ 2005-10-24 19:11     ` Hugh Dickins
  2005-10-25  7:59       ` Anton Altaparmakov
  2005-10-25  9:49     ` David Howells
                       ` (6 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2005-10-24 19:11 UTC (permalink / raw)
  To: David Howells
  Cc: Anton Altaparmakov, Andrew Morton, torvalds, Christoph Hellwig,
	linux-kernel

On Mon, 24 Oct 2005, David Howells wrote:
> 
> The attached patch adds a new VMA operation to notify a filesystem or other
> driver about the MMU generating a fault because userspace attempted to write
> to a page mapped through a read-only PTE.
> 
> This facility permits the filesystem or driver to:
> 
>  (*) Implement storage allocation/reservation on attempted write, and so to
>      deal with problems such as ENOSPC more gracefully (perhaps by generating
>      SIGBUS).
> 
>  (*) Delay making the page writable until the contents have been written to a
>      backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
>      It permits the filesystem to have some guarantee about the state of the
>      cache.

I've only given it a quick look, it looks pretty good, but too hastily
thrown together, without understanding of the intervening changes:

> --- linux-2.6.14-rc4-mm1/mm/memory.c	2005-10-17 14:26:44.000000000 +0100
> +++ linux-2.6.14-rc4-mm1-cachefs/mm/memory.c	2005-10-20 18:53:04.000000000 +0100
> @@ -1261,19 +1261,53 @@ static int do_wp_page(struct mm_struct *
> +	if (unlikely(vma->vm_flags & VM_SHARED)) {
> +		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
> +			/*
> +			 * Notify the page owner without the lock held,
> +			 * so they can sleep if they want to.
> +			 */
> +			pte_unmap(page_table);
> +			if (!PageReserved(old_page))
> +				page_cache_get(old_page);
> +			spin_unlock(&mm->page_table_lock);

No, you need to pay attention to Nick's PageReserved removal, and
my pte lock stuff, throughout do_wp_page - there shouldn't be any
references to PageReserved or page_table_lock there now (and you'll
need to recheck the mapping/locking/unlocking/unmapping).  Sorry,
I don't have the time to spare to do it myself right now.

> +			page_table = pte_offset_map(pmd, address);
> +			if (!pte_same(*page_table, orig_pte)) {
> +				ret |= VM_FAULT_WRITE;

No, don't add VM_FAULT_WRITE in this case: you should only do that
when you've gone through the maybe_mkwrite yourself; this case
should remain the default VM_FAULT_MINOR.

> @@ -1847,18 +1890,28 @@ retry:
> +		} else {
> +			/* if the page will be shareable, see if the backing
> +			 * address space wants to know that the page is about
> +			 * to become writable */
> +			if (vma->vm_ops->page_mkwrite &&
> +			    vma->vm_ops->page_mkwrite(vma, new_page) < 0)
> +				return VM_FAULT_SIGBUS;
> +		}
>  	}

This isn't necessarily wrong, and may be exactly how it was before,
I don't remember.  But it implies that when page_mkwrite fails,
it page_cache_releases the page.  Is that desirable?  Or should
that be left to the caller?

> @@ -1945,7 +1998,7 @@ static int do_file_page(struct mm_struct

Drop all those changes to do_file_page (which I added), they're no
longer necessary.  A case appeared which made it clear that we cannot
rely on resolving this issue for get_user_pages in a single call to
handle_mm_fault, and that's why the VM_FAULT_WRITE stuff got added. 

This complication of do_file_page was always ugly, and I'm delighted
to drop it.  Whereas the call to do_wp_page from do_swap_page is less
obtrusive and may still be a worthwhile optimization, though I added
it for the same disgraced reason a year or more back.

Hugh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: what happened to page_mkwrite? - was: Re: page_mkwrite seems broken
  2005-10-24 16:01     ` Hugh Dickins
@ 2005-10-24 19:38       ` Anton Altaparmakov
  2005-10-24 20:31         ` Hugh Dickins
  0 siblings, 1 reply; 24+ messages in thread
From: Anton Altaparmakov @ 2005-10-24 19:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Howells, Christoph Hellwig, Andrew Morton, Carsten Otte,
	linux-kernel

On Mon, 24 Oct 2005, Hugh Dickins wrote:
> On Mon, 24 Oct 2005, Anton Altaparmakov wrote:
> > I don't really mind either way.  I am stuck with ntfs at the moment at
> > the point where I am either going to use my own ->nopage handler to
> > allocate on-disk clusters or have a ->page_mkwrite handler do it.  The
> > former is not nice as it means we allocate space even when only reading
> > whilst the later is very nice as it only triggers when someone actually
> > does an mmapped write.
> 
> A complication to beware of there (and I may be misunderstanding, but
> the point is worth making).  

Now you got me completely confused.  Just when I thought I was 
understanding things.  (-;  Let me repeat what you say with some questions 
thrown in...  Please bear with me and help me beat some clue into my 
head...  (-:

> If you have already mmaped readonly zero pages into some mms, you'll 

When you say "zero pages" you mean just normal page cache pages that are 
fully zero because they are in a sparse region of a file?

Or do you mean ZERO_PAGE(address) stuff like in xip_file_nopage()?

And when you say "mapped readonly zero pages into some mms" you mean that 
there are several processes which have done mmap(PROT_READ, MAP_SHARED) on 
the same file, right?

Or do you mean mmap(PROT_READ|PROT_WRITE, MAP_SHARED)?

Or something else?

> need to update those mms with the new shared writable pages once they 
> are allocated.

If your answer above is that the pages are normal page cache pages, then:

Is it to reflect the fact that the pages are now marked writable?

Or is it to reflect the fact that the pages are now allocated on disk 
(i.e. update the page buffers)?

Or am I barking up the wrong tree and the reasons are altogether 
different?

If your answer above was ZERO_PAGE(), then:

Is it to get rid of the zero page and replace it with the _real_, now 
allocated and writable page?

> That put me off using page_mkwrite in tmpfs, but Carsten has solved the 
> problem (though not going so far as to use page_mkwrite) with his 
> xip_file_nopage in mm/filemap_xip.c - has to go down the vma_prio_tree 
> like rmap.
>  
> (That code is a little different in -mm, partly because of my page
> table locking changes, partly because of Nick's ZERO_PAGE changes.)
> 
> Hmm, strictly speaking, it should be substituting the new page
> when VM_LOCKED: whether that's worth the effort of implementing....

I have now read filemap_xip.c as it is in Linus kernel and see the 
ZERO_PAGE().  I guess that that is what you were talking about above all 
along and not normal page cache pages that happen to be zero.  Correct?

In which case am I correct in saying that as long as I use 
filemap_nopage() and filemap_populate(), I can ignore your comment about 
updating mms as ZERO_PAGE() will _never_ be mapped and in fact just 
normal page cache pages containing zeroes will be mapped?

If that is correct then great.  Otherwise I have missed the plot and would 
be very grateful if you were to impart some clue upon me.  (-:

Thanks a lot for your help!  Much appreciated!

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: what happened to page_mkwrite? - was: Re: page_mkwrite seems broken
  2005-10-24 19:38       ` Anton Altaparmakov
@ 2005-10-24 20:31         ` Hugh Dickins
  2005-10-24 21:18           ` Anton Altaparmakov
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2005-10-24 20:31 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: David Howells, Christoph Hellwig, Andrew Morton, Carsten Otte,
	linux-kernel

On Mon, 24 Oct 2005, Anton Altaparmakov wrote:
> On Mon, 24 Oct 2005, Hugh Dickins wrote:
> 
> Now you got me completely confused.  Just when I thought I was 
> understanding things.  (-;  Let me repeat what you say with some questions 
> thrown in...  Please bear with me and help me beat some clue into my 
> head...  (-:

Sorry for confusing you.  I can't answer many of your questions, because
I don't know what you're doing or intending to do.  But you expressed an
aversion to allocating pages unnecessarily.  Probably that made me think
of memory allocation where you meant disk allocation.

Cutting a lot of questions...

> If your answer above is that the pages are normal page cache pages, then:

Nothing special needs doing if you choose to use normal page cache pages
even for the holes.

> If your answer above was ZERO_PAGE(), then:
> 
> Is it to get rid of the zero page and replace it with the _real_, now 
> allocated and writable page?

Yes.

> I have now read filemap_xip.c as it is in Linus kernel and see the 
> ZERO_PAGE().  I guess that that is what you were talking about above all 
> along and not normal page cache pages that happen to be zero.  Correct?

Yes.

> In which case am I correct in saying that as long as I use 
> filemap_nopage() and filemap_populate(), I can ignore your comment about 
> updating mms as ZERO_PAGE() will _never_ be mapped and in fact just 
> normal page cache pages containing zeroes will be mapped?

Yes.

> If that is correct then great.  Otherwise I have missed the plot and would 
> be very grateful if you were to impart some clue upon me.  (-:
> 
> Thanks a lot for your help!  Much appreciated!

Sorry for the confusion: I was just trying to warn you of some difficulties
and their solution, if you were intending to pursue an alternative path.

Hugh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: what happened to page_mkwrite? - was: Re: page_mkwrite seems broken
  2005-10-24 20:31         ` Hugh Dickins
@ 2005-10-24 21:18           ` Anton Altaparmakov
  0 siblings, 0 replies; 24+ messages in thread
From: Anton Altaparmakov @ 2005-10-24 21:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Howells, Christoph Hellwig, Andrew Morton, Carsten Otte,
	linux-kernel

On Mon, 24 Oct 2005, Hugh Dickins wrote:
> On Mon, 24 Oct 2005, Anton Altaparmakov wrote:
> > On Mon, 24 Oct 2005, Hugh Dickins wrote:
> > 
> > Now you got me completely confused.  Just when I thought I was 
> > understanding things.  (-;  Let me repeat what you say with some questions 
> > thrown in...  Please bear with me and help me beat some clue into my 
> > head...  (-:
> 
> Sorry for confusing you.  I can't answer many of your questions, because
> I don't know what you're doing or intending to do.  But you expressed an
> aversion to allocating pages unnecessarily.  Probably that made me think
> of memory allocation where you meant disk allocation.
> 
> Cutting a lot of questions...
> 
> > If your answer above is that the pages are normal page cache pages, then:
> 
> Nothing special needs doing if you choose to use normal page cache pages
> even for the holes.

Great!  I have no intention of using ZERO_PAGE().  Just normal page cache 
pages that are memset() to zero when sparse.  Phew.  /me relaxes (((-:

> Sorry for the confusion: I was just trying to warn you of some difficulties
> and their solution, if you were intending to pursue an alternative path.

No need to apologize!

If I had wanted to use the ZERO_PAGE() then you would be right and I would 
have missed all those things you said, but I never even knew about 
ZERO_PAGE().  (-:  I could well see that as a nice optimization at some 
point but for now I want it to work, not conserve memory.  (-:

Thank you very much for your comments!

In case you are curious, ntfs allows logical blocks of between 512 bytes 
and many hundreds of kiB in size (but always power of 2).  So to write to 
a mmap()ed sparse file using a PAGE_CACHE_SIZE page into the middle of a 
large, sparse logical block, I need to allocate the whole block on disk 
and cause all page cache pages to be zeroed and marked dirty.  To do this 
from writepage() is not possible due to deadlocks.  1) because the page is 
locked already and I would need to lock all the other pages in that 
logical block so we get into deadlock city with out of order page locking 
(I now only lock in ascending page index order and this requires no page 
with a higher index to be locked and dropping page lock in writepage is 
royal pain in the backside) and 2) because I am not meant to go allocating 
memory for more pages when the system is low on memory and running 
writepage exactly so it can reclaim some memory...

How I want to use page_mkdirty is that when it is run for a sparse logical 
block of size > PAGE_CACHE_SIZE, I allocate the logical block and get hold 
of all the pages (locked) that lie in that block and bring them uptodate 
(by zeroing if not uptodate already) and then mark them all dirty and 
release them again so the zeroes will make it to disk later.  Not sure 
whether to do the allocations even for logical blocks <= PAGE_CACHE_SIZE 
or just leave those to writepage...

In fact before allocating the block I plan to simply do a page cache read 
(via read_cache_page() which will give me uptodate, cleared pages) then do 
the allocation, then mark the pages dirty and that's it.  Writepage will 
later cause the buffers in the pages to be mapped to the new on-disk 
location and will write the dirty pages to disk.  (I may map the buffers 
in the pages there an then as an optimization given I have the pages and I 
know the on-disk location but I am not sure I will do that, at least 
probably not initially as it only makes the code more complex for very 
little gain.)

There is another ntfs complication and this is initialized size.  This is 
an evil beast that say that anything between initialized size and the real 
file size (inode->i_size), no matter whether it is allocated on disk or is 
a sparse hole or a mixture of the two, is to be read as zeroes.  The 
annoying thing here is that if you have a 1TiB file that is fully 
allocated on disk but has an initialized size of 0, and you write 1 byte 
somewhere towards the end of the file (or even at the end), you need to 
write to disk zeroes between file offset = initialized size (0 in this 
example) and the position of the write, in this case 1TiB.  Doing that 
from writepage could never fly.  But from page_mkdirty() it should work, 
again in same way as above for the sparse holes, I will read_cache_page() 
followed by set_page_dirty() for all pages between initialized size and 
the offset of the write.  It just means that the first write to such an 
mmaped file would take a _very_ long time in the specific example above...

Note for the above I plan to leverage 
fs/ntfs/file.c::ntfs_attr_extend_initialized() or at least an adapted form 
of it.  This function does the above described but for the file write(2) 
case where a user opens a file and writes somewhere beyond the initialized 
size...

I hope that explains what and why I am doing and I also hope that if you 
were not interested you didn't bother reading it all and hence never see 
this sentence.  (-;

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Add notification of page becoming writable to VMA ops
  2005-10-24 19:11     ` Hugh Dickins
@ 2005-10-25  7:59       ` Anton Altaparmakov
  2005-10-25  8:26         ` Hugh Dickins
  0 siblings, 1 reply; 24+ messages in thread
From: Anton Altaparmakov @ 2005-10-25  7:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Howells, Andrew Morton, torvalds, Christoph Hellwig, linux-kernel

On Mon, 2005-10-24 at 20:11 +0100, Hugh Dickins wrote:
> On Mon, 24 Oct 2005, David Howells wrote:
> > 
> > The attached patch adds a new VMA operation to notify a filesystem or other
> > driver about the MMU generating a fault because userspace attempted to write
> > to a page mapped through a read-only PTE.
> > 
> > This facility permits the filesystem or driver to:
> > 
> >  (*) Implement storage allocation/reservation on attempted write, and so to
> >      deal with problems such as ENOSPC more gracefully (perhaps by generating
> >      SIGBUS).
> > 
> >  (*) Delay making the page writable until the contents have been written to a
> >      backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
> >      It permits the filesystem to have some guarantee about the state of the
> >      cache.
> 
> I've only given it a quick look, it looks pretty good, but too hastily
> thrown together, without understanding of the intervening changes:

There really is quite a difference between mm/*.c in -mm and Linus
kernel at present.  Is all this planned to be merged as soon as 2.6.14
is out or is -mm just a playground for now with no mainline merge
intentions?

Just asking so I know whether to work against stock kernels or -mm for
the moment...

[snip some corrections I am in no position to comment on at the moment]
> > @@ -1945,7 +1998,7 @@ static int do_file_page(struct mm_struct
> 
> Drop all those changes to do_file_page (which I added), they're no
> longer necessary.  A case appeared which made it clear that we cannot
> rely on resolving this issue for get_user_pages in a single call to
> handle_mm_fault, and that's why the VM_FAULT_WRITE stuff got added. 
> 
> This complication of do_file_page was always ugly, and I'm delighted
> to drop it.  Whereas the call to do_wp_page from do_swap_page is less
> obtrusive and may still be a worthwhile optimization, though I added
> it for the same disgraced reason a year or more back.

Cool, that reduces the size of the patch.  (-:

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Add notification of page becoming writable to VMA ops
  2005-10-25  7:59       ` Anton Altaparmakov
@ 2005-10-25  8:26         ` Hugh Dickins
  2005-10-25  8:49           ` Anton Altaparmakov
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2005-10-25  8:26 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: David Howells, Andrew Morton, torvalds, Christoph Hellwig, linux-kernel

On Tue, 25 Oct 2005, Anton Altaparmakov wrote:
> On Mon, 2005-10-24 at 20:11 +0100, Hugh Dickins wrote:
> 
> There really is quite a difference between mm/*.c in -mm and Linus
> kernel at present.  Is all this planned to be merged as soon as 2.6.14
> is out or is -mm just a playground for now with no mainline merge
> intentions?

It certainly won't all be merged as soon as 2.6.14 is out, some of it
has only just got into -mm.  Andrew's current intention is to merge
the early part of the changes soonish after 2.6.14 gets out, but he's
not likely to merge it all into 2.6.15.

But we aren't using -mm as a playground: it is likely to go forward,
provided it doesn't show regressions of some kind while it's in -mm.

> Just asking so I know whether to work against stock kernels or -mm for
> the moment...

I'd recommend -mm for now.  page_mkwrite will want a spell in there
too, won't it?

Hugh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Add notification of page becoming writable to VMA ops
  2005-10-25  8:26         ` Hugh Dickins
@ 2005-10-25  8:49           ` Anton Altaparmakov
  0 siblings, 0 replies; 24+ messages in thread
From: Anton Altaparmakov @ 2005-10-25  8:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Howells, Andrew Morton, torvalds, Christoph Hellwig, linux-kernel

On Tue, 2005-10-25 at 09:26 +0100, Hugh Dickins wrote:
> On Tue, 25 Oct 2005, Anton Altaparmakov wrote:
> > On Mon, 2005-10-24 at 20:11 +0100, Hugh Dickins wrote:
> > 
> > There really is quite a difference between mm/*.c in -mm and Linus
> > kernel at present.  Is all this planned to be merged as soon as 2.6.14
> > is out or is -mm just a playground for now with no mainline merge
> > intentions?
> 
> It certainly won't all be merged as soon as 2.6.14 is out, some of it
> has only just got into -mm.  Andrew's current intention is to merge
> the early part of the changes soonish after 2.6.14 gets out, but he's
> not likely to merge it all into 2.6.15.

Ok, sounds good.  As long as they at least start converging...

> But we aren't using -mm as a playground: it is likely to go forward,
> provided it doesn't show regressions of some kind while it's in -mm.

Cool.

> > Just asking so I know whether to work against stock kernels or -mm for
> > the moment...
> 
> I'd recommend -mm for now.

Great, thanks, will do.

> page_mkwrite will want a spell in there too, won't it?

Sure.  But if the other mm changes in -mm were not going forward it
would be a little silly to get page_mkwrite to work there only to have
to rewrite it in order to get it merged...

If Linus really is going to release .14 in the next few days,
page_mkwrite is never going to make it into .15 anyway, no matter
what...  But .16 would be a realistic target I would have thought which
seems to fit in nicely with the plans for -mm.  (-:

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Add notification of page becoming writable to VMA ops
  2005-10-24 16:23   ` [PATCH] Add notification of page becoming writable to VMA ops David Howells
  2005-10-24 19:11     ` Hugh Dickins
@ 2005-10-25  9:49     ` David Howells
  2005-10-25  9:55     ` David Howells
                       ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2005-10-25  9:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Howells, Anton Altaparmakov, Andrew Morton, torvalds,
	Christoph Hellwig, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:

> No, you need to pay attention to Nick's PageReserved removal, and
> my pte lock stuff, throughout do_wp_page - there shouldn't be any
> references to PageReserved or page_table_lock there now (and you'll
> need to recheck the mapping/locking/unlocking/unmapping).  Sorry,
> I don't have the time to spare to do it myself right now.

I attempted to 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Add notification of page becoming writable to VMA ops
  2005-10-24 16:23   ` [PATCH] Add notification of page becoming writable to VMA ops David Howells
  2005-10-24 19:11     ` Hugh Dickins
  2005-10-25  9:49     ` David Howells
@ 2005-10-25  9:55     ` David Howells
  2005-10-25 10:12     ` David Howells
                       ` (4 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2005-10-25  9:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Howells, Anton Altaparmakov, Andrew Morton, torvalds,
	Christoph Hellwig, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:

> I've only given it a quick look, it looks pretty good, but too hastily
> thrown together, without understanding of the intervening changes:

I attempted to forward port your patch; unfortunately, I'm not fully
conversant with some of the VM stuff.

> This isn't necessarily wrong, and may be exactly how it was before,

It's as it was in your patch.

I'll try and fix the changes.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Add notification of page becoming writable to VMA ops
  2005-10-24 16:23   ` [PATCH] Add notification of page becoming writable to VMA ops David Howells
                       ` (2 preceding siblings ...)
  2005-10-25  9:55     ` David Howells
@ 2005-10-25 10:12     ` David Howells
  2005-10-25 13:18     ` [PATCH] Add notification of page becoming writable to VMA ops [try #2] David Howells
                       ` (3 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2005-10-25 10:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Howells, Anton Altaparmakov, Andrew Morton, torvalds,
	Christoph Hellwig, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:

> > +			if (vma->vm_ops->page_mkwrite &&
> > +			    vma->vm_ops->page_mkwrite(vma, new_page) < 0)
> > +				return VM_FAULT_SIGBUS;
> > +		}
> >  	}
> 
> This isn't necessarily wrong, and may be exactly how it was before,
> I don't remember.  But it implies that when page_mkwrite fails,
> it page_cache_releases the page.  Is that desirable?  Or should
> that be left to the caller?

You're right. I've added a release. That may explain a memory leak I was
seeing that I couldn't find.

> > @@ -1945,7 +1998,7 @@ static int do_file_page(struct mm_struct
> 
> Drop all those changes to do_file_page (which I added), they're no
> longer necessary.  A case appeared which made it clear that we cannot
> rely on resolving this issue for get_user_pages in a single call to
> handle_mm_fault, and that's why the VM_FAULT_WRITE stuff got added. 

I take it then that:

 (1) the write_access parameter to do_file_page() is there purely so that
     handle_pte_fault() can jump to it rather than calling it since they have
     the same parameter set and return value;

 (2) and that do_file_page() always installs a read-only PTE so that
     do_wp_page() will be called subsequently on a write attempt.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] Add notification of page becoming writable to VMA ops [try #2]
  2005-10-24 16:23   ` [PATCH] Add notification of page becoming writable to VMA ops David Howells
                       ` (3 preceding siblings ...)
  2005-10-25 10:12     ` David Howells
@ 2005-10-25 13:18     ` David Howells
  2005-11-30 13:58     ` [PATCH] Add notification of page becoming writable to VMA ops [try #3] David Howells
                       ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2005-10-25 13:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Howells, Anton Altaparmakov, Andrew Morton, torvalds,
	Christoph Hellwig, linux-kernel


The attached patch adds a new VMA operation to notify a filesystem or other
driver about the MMU generating a fault because userspace attempted to write
to a page mapped through a read-only PTE.

This facility permits the filesystem or driver to:

 (*) Implement storage allocation/reservation on attempted write, and so to
     deal with problems such as ENOSPC more gracefully (perhaps by generating
     SIGBUS).

 (*) Delay making the page writable until the contents have been written to a
     backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
     It permits the filesystem to have some guarantee about the state of the
     cache.

Signed-Off-By: David Howells <dhowells@redhat.com>
---
warthog>diffstat -p1 page-mkwrite-2614rc4mm1-2.diff
 include/linux/mm.h |    4 ++
 mm/memory.c        |   95 +++++++++++++++++++++++++++++++++++++++--------------
 mm/mmap.c          |    9 +++--
 mm/mprotect.c      |    8 +++-
 4 files changed, 88 insertions(+), 28 deletions(-)

diff -uNrp linux-2.6.14-rc4-mm1/include/linux/mm.h linux-2.6.14-rc4-mm1-cachefs/include/linux/mm.h
--- linux-2.6.14-rc4-mm1/include/linux/mm.h	2005-10-17 14:26:43.000000000 +0100
+++ linux-2.6.14-rc4-mm1-cachefs/include/linux/mm.h	2005-10-18 14:02:39.000000000 +0100
@@ -196,6 +196,10 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+
+	/* notification that a previously read-only page is about to become
+	 * writable, if an error is returned it will cause a SIGBUS */
+	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
diff -uNrp linux-2.6.14-rc4-mm1/mm/memory.c linux-2.6.14-rc4-mm1-cachefs/mm/memory.c
--- linux-2.6.14-rc4-mm1/mm/memory.c	2005-10-17 14:26:44.000000000 +0100
+++ linux-2.6.14-rc4-mm1-cachefs/mm/memory.c	2005-10-25 11:16:56.000000000 +0100
@@ -1247,7 +1247,7 @@ static int do_wp_page(struct mm_struct *
 	struct page *old_page, *new_page;
 	unsigned long pfn = pte_pfn(orig_pte);
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
+	int reuse, ret = VM_FAULT_MINOR;
 
 	BUG_ON(vma->vm_flags & VM_RESERVED);
 
@@ -1261,19 +1261,49 @@ static int do_wp_page(struct mm_struct *
 	}
 	old_page = pfn_to_page(pfn);
 
-	if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		int reuse = can_share_swap_page(old_page);
-		unlock_page(old_page);
-		if (reuse) {
-			flush_cache_page(vma, address, pfn);
-			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			ptep_set_access_flags(vma, address, page_table, entry, 1);
-			update_mmu_cache(vma, address, entry);
-			lazy_mmu_prot_update(entry);
-			ret |= VM_FAULT_WRITE;
-			goto unlock;
+	if (unlikely(vma->vm_flags & VM_SHARED)) {
+		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+			/*
+			 * Notify the page owner without the lock held,
+			 * so they can sleep if they want to.
+			 */
+			page_cache_get(old_page);
+			pte_unmap_unlock(page_table, ptl);
+
+			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
+				goto unwritable_page;
+
+			page_cache_release(old_page);
+
+			/*
+			 * Since we dropped the lock we need to revalidate
+			 * the PTE as someone else may have changed it.  If
+			 * they did, we just return, as we can count on the
+			 * MMU to tell us if they didn't also make it writable.
+			 */
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte))
+				goto unlock;
 		}
+
+		reuse = 1;
+	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+		reuse = can_share_swap_page(old_page);
+		unlock_page(old_page);
+	} else {
+		reuse = 0;
+	}
+
+	if (reuse) {
+		flush_cache_page(vma, address, pfn);
+		entry = pte_mkyoung(orig_pte);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		ptep_set_access_flags(vma, address, page_table, entry, 1);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+		ret |= VM_FAULT_WRITE;
+		goto unlock;
 	}
 
 	/*
@@ -1326,6 +1356,10 @@ unlock:
 oom:
 	page_cache_release(old_page);
 	return VM_FAULT_OOM;
+
+unwritable_page:
+	page_cache_release(old_page);
+	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -1847,18 +1881,31 @@ retry:
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page *page;
+	if (write_access) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			struct page *page;
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto oom;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-		if (!page)
-			goto oom;
-		copy_user_highpage(page, new_page, address);
-		page_cache_release(new_page);
-		new_page = page;
-		anon = 1;
+			if (unlikely(anon_vma_prepare(vma)))
+				goto oom;
+			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			if (!page)
+				goto oom;
+			copy_user_highpage(page, new_page, address);
+			page_cache_release(new_page);
+			new_page = page;
+			anon = 1;
+
+		} else {
+			/* if the page will be shareable, see if the backing
+			 * address space wants to know that the page is about
+			 * to become writable */
+			if (vma->vm_ops->page_mkwrite &&
+			    vma->vm_ops->page_mkwrite(vma, new_page) < 0
+			    ) {
+				page_cache_release(new_page);
+				return VM_FAULT_SIGBUS;
+			}
+		}
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
diff -uNrp linux-2.6.14-rc4-mm1/mm/mmap.c linux-2.6.14-rc4-mm1-cachefs/mm/mmap.c
--- linux-2.6.14-rc4-mm1/mm/mmap.c	2005-10-17 14:26:44.000000000 +0100
+++ linux-2.6.14-rc4-mm1-cachefs/mm/mmap.c	2005-10-18 14:02:39.000000000 +0100
@@ -1058,7 +1058,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+	vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1092,6 +1093,9 @@ munmap_back:
 		if (error)
 			goto free_vma;
 	}
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		vma->vm_page_prot = protection_map[vm_flags &
+					(VM_READ|VM_WRITE|VM_EXEC)];
 
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
@@ -1926,7 +1930,8 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
-	vma->vm_page_prot = protection_map[flags & 0x0f];
+	vma->vm_page_prot = protection_map[flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
diff -uNrp linux-2.6.14-rc4-mm1/mm/mprotect.c linux-2.6.14-rc4-mm1-cachefs/mm/mprotect.c
--- linux-2.6.14-rc4-mm1/mm/mprotect.c	2005-10-17 14:26:44.000000000 +0100
+++ linux-2.6.14-rc4-mm1-cachefs/mm/mprotect.c	2005-10-18 14:02:39.000000000 +0100
@@ -106,6 +106,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned int mask;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -140,8 +141,6 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
-	newprot = protection_map[newflags & 0xf];
-
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
@@ -168,6 +167,11 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
+	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		mask &= ~VM_SHARED;
+	newprot = protection_map[newflags & mask];
+
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] Add notification of page becoming writable to VMA ops [try #3]
  2005-10-24 16:23   ` [PATCH] Add notification of page becoming writable to VMA ops David Howells
                       ` (4 preceding siblings ...)
  2005-10-25 13:18     ` [PATCH] Add notification of page becoming writable to VMA ops [try #2] David Howells
@ 2005-11-30 13:58     ` David Howells
  2005-11-30 14:40       ` Miklos Szeredi
  2005-11-30 14:50       ` David Howells
  2005-11-30 15:20     ` [PATCH] Add notification of page becoming writable to VMA ops [try #4] David Howells
  2006-01-11 12:19     ` [PATCH] Add notification of page becoming writable to VMA ops [try #5] David Howells
  7 siblings, 2 replies; 24+ messages in thread
From: David Howells @ 2005-11-30 13:58 UTC (permalink / raw)
  To: torvalds, Andrew Morton
  Cc: Hugh Dickins, Anton Altaparmakov, Christoph Hellwig, linux-kernel


The attached patch adds a new VMA operation to notify a filesystem or other
driver about the MMU generating a fault because userspace attempted to write
to a page mapped through a read-only PTE.

This facility permits the filesystem or driver to:

 (*) Implement storage allocation/reservation on attempted write, and so to
     deal with problems such as ENOSPC more gracefully (perhaps by generating
     SIGBUS).

 (*) Delay making the page writable until the contents have been written to a
     backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
     It permits the filesystem to have some guarantee about the state of the
     cache.

Updated to 2.6.14-git14.

Signed-Off-By: David Howells <dhowells@redhat.com>
---
warthog>diffstat -p1 page-mkwrite-2614git14.diff 
 include/linux/mm.h |    4 ++
 mm/memory.c        |   95 +++++++++++++++++++++++++++++++++++++++--------------
 mm/mmap.c          |   12 +++++-
 mm/mprotect.c      |   11 +++++-
 4 files changed, 94 insertions(+), 28 deletions(-)

diff -uNrp linux-2.6.14-git14/include/linux/mm.h linux-2.6.14-git14-pagenotify/include/linux/mm.h
--- linux-2.6.14-git14/include/linux/mm.h	2005-11-30 13:01:27.000000000 +0000
+++ linux-2.6.14-git14-pagenotify/include/linux/mm.h	2005-11-30 13:02:48.000000000 +0000
@@ -196,6 +196,10 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+
+	/* notification that a previously read-only page is about to become
+	 * writable, if an error is returned it will cause a SIGBUS */
+	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
diff -uNrp linux-2.6.14-git14/mm/memory.c linux-2.6.14-git14-pagenotify/mm/memory.c
--- linux-2.6.14-git14/mm/memory.c	2005-11-30 13:01:29.000000000 +0000
+++ linux-2.6.14-git14-pagenotify/mm/memory.c	2005-11-30 13:15:50.000000000 +0000
@@ -1253,7 +1253,7 @@ static int do_wp_page(struct mm_struct *
 	struct page *old_page, *new_page;
 	unsigned long pfn = pte_pfn(orig_pte);
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
+	int reuse, ret = VM_FAULT_MINOR;
 
 	BUG_ON(vma->vm_flags & VM_RESERVED);
 
@@ -1267,19 +1267,49 @@ static int do_wp_page(struct mm_struct *
 	}
 	old_page = pfn_to_page(pfn);
 
-	if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		int reuse = can_share_swap_page(old_page);
-		unlock_page(old_page);
-		if (reuse) {
-			flush_cache_page(vma, address, pfn);
-			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			ptep_set_access_flags(vma, address, page_table, entry, 1);
-			update_mmu_cache(vma, address, entry);
-			lazy_mmu_prot_update(entry);
-			ret |= VM_FAULT_WRITE;
-			goto unlock;
+	if (unlikely(vma->vm_flags & VM_SHARED)) {
+		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+			/*
+			 * Notify the page owner without the lock held,
+			 * so they can sleep if they want to.
+			 */
+			page_cache_get(old_page);
+			pte_unmap_unlock(page_table, ptl);
+
+			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
+				goto unwritable_page;
+
+			page_cache_release(old_page);
+
+			/*
+			 * Since we dropped the lock we need to revalidate
+			 * the PTE as someone else may have changed it.  If
+			 * they did, we just return, as we can count on the
+			 * MMU to tell us if they didn't also make it writable.
+			 */
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte))
+				goto unlock;
 		}
+
+		reuse = 1;
+	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+		reuse = can_share_swap_page(old_page);
+		unlock_page(old_page);
+	} else {
+		reuse = 0;
+	}
+
+	if (reuse) {
+		flush_cache_page(vma, address, pfn);
+		entry = pte_mkyoung(orig_pte);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		ptep_set_access_flags(vma, address, page_table, entry, 1);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+		ret |= VM_FAULT_WRITE;
+		goto unlock;
 	}
 
 	/*
@@ -1332,6 +1362,10 @@ unlock:
 oom:
 	page_cache_release(old_page);
 	return VM_FAULT_OOM;
+
+unwritable_page:
+	page_cache_release(old_page);
+	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -1853,18 +1887,31 @@ retry:
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page *page;
+	if (write_access) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			struct page *page;
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto oom;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-		if (!page)
-			goto oom;
-		copy_user_highpage(page, new_page, address);
-		page_cache_release(new_page);
-		new_page = page;
-		anon = 1;
+			if (unlikely(anon_vma_prepare(vma)))
+				goto oom;
+			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			if (!page)
+				goto oom;
+			copy_user_highpage(page, new_page, address);
+			page_cache_release(new_page);
+			new_page = page;
+			anon = 1;
+
+		} else {
+			/* if the page will be shareable, see if the backing
+			 * address space wants to know that the page is about
+			 * to become writable */
+			if (vma->vm_ops->page_mkwrite &&
+			    vma->vm_ops->page_mkwrite(vma, new_page) < 0
+			    ) {
+				page_cache_release(new_page);
+				return VM_FAULT_SIGBUS;
+			}
+		}
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
diff -uNrp linux-2.6.14-git14/mm/mmap.c linux-2.6.14-git14-pagenotify/mm/mmap.c
--- linux-2.6.14-git14/mm/mmap.c	2005-11-30 13:01:29.000000000 +0000
+++ linux-2.6.14-git14-pagenotify/mm/mmap.c	2005-11-30 13:27:19.000000000 +0000
@@ -1058,7 +1058,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+	vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1093,6 +1094,12 @@ munmap_back:
 			goto free_vma;
 	}
 
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		vma->vm_page_prot =
+			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
+
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation
@@ -1926,7 +1933,8 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
-	vma->vm_page_prot = protection_map[flags & 0x0f];
+	vma->vm_page_prot = protection_map[flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
diff -uNrp linux-2.6.14-git14/mm/mprotect.c linux-2.6.14-git14-pagenotify/mm/mprotect.c
--- linux-2.6.14-git14/mm/mprotect.c	2005-11-30 13:01:29.000000000 +0000
+++ linux-2.6.14-git14-pagenotify/mm/mprotect.c	2005-11-30 13:26:37.000000000 +0000
@@ -106,6 +106,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned int mask;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -140,8 +141,6 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
-	newprot = protection_map[newflags & 0xf];
-
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
@@ -168,6 +167,14 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		mask &= ~VM_SHARED;
+
+	newprot = protection_map[newflags & mask];
+
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Add notification of page becoming writable to VMA ops [try #3]
  2005-11-30 13:58     ` [PATCH] Add notification of page becoming writable to VMA ops [try #3] David Howells
@ 2005-11-30 14:40       ` Miklos Szeredi
  2005-11-30 14:50       ` David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: Miklos Szeredi @ 2005-11-30 14:40 UTC (permalink / raw)
  To: dhowells; +Cc: torvalds, akpm, hugh, aia21, hch, linux-kernel

> The attached patch adds a new VMA operation to notify a filesystem or other
> driver about the MMU generating a fault because userspace attempted to write
> to a page mapped through a read-only PTE.
> 
> This facility permits the filesystem or driver to:
> 
>  (*) Implement storage allocation/reservation on attempted write, and so to
>      deal with problems such as ENOSPC more gracefully (perhaps by generating
>      SIGBUS).
> 
>  (*) Delay making the page writable until the contents have been written to a
>      backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
>      It permits the filesystem to have some guarantee about the state of the
>      cache.

  (*) account and limit number of dirty pages

This is one piece of the puzze needed to make shared writable mapping
work safely in FUSE.

> Updated to 2.6.14-git14.

But doesn't apply against 2.6.15-rc3 or -rc3-mm1.

Miklos

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Add notification of page becoming writable to VMA ops [try #3]
  2005-11-30 13:58     ` [PATCH] Add notification of page becoming writable to VMA ops [try #3] David Howells
  2005-11-30 14:40       ` Miklos Szeredi
@ 2005-11-30 14:50       ` David Howells
  1 sibling, 0 replies; 24+ messages in thread
From: David Howells @ 2005-11-30 14:50 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dhowells, torvalds, akpm, hugh, aia21, hch, linux-kernel

Miklos Szeredi <miklos@szeredi.hu> wrote:

> > Updated to 2.6.14-git14.
> 
> But doesn't apply against 2.6.15-rc3 or -rc3-mm1.

Hmmm... It would appear that "The latest snapshot for the stable Linux kernel
tree" is a bit out of date on www.kernel.org. I should've checked the dates.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] Add notification of page becoming writable to VMA ops [try #4]
  2005-10-24 16:23   ` [PATCH] Add notification of page becoming writable to VMA ops David Howells
                       ` (5 preceding siblings ...)
  2005-11-30 13:58     ` [PATCH] Add notification of page becoming writable to VMA ops [try #3] David Howells
@ 2005-11-30 15:20     ` David Howells
  2006-01-11 12:19     ` [PATCH] Add notification of page becoming writable to VMA ops [try #5] David Howells
  7 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2005-11-30 15:20 UTC (permalink / raw)
  To: torvalds, Andrew Morton, Miklos Szeredi
  Cc: Hugh Dickins, Anton Altaparmakov, Christoph Hellwig, linux-kernel


The attached patch adds a new VMA operation to notify a filesystem or other
driver about the MMU generating a fault because userspace attempted to write
to a page mapped through a read-only PTE.

This facility permits the filesystem or driver to:

 (*) Implement storage allocation/reservation on attempted write, and so to
     deal with problems such as ENOSPC more gracefully (perhaps by generating
     SIGBUS).

 (*) Delay making the page writable until the contents have been written to a
     backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
     It permits the filesystem to have some guarantee about the state of the
     cache.

 (*) Account and limit number of dirty pages. This is one piece of the puzzle
     needed to make shared writable mapping work safely in FUSE.

Updated to 2.6.15-rc3.

Signed-Off-By: David Howells <dhowells@redhat.com>
---
warthog>diffstat -p1 page-mkwrite-2615rc3.diff 
 include/linux/mm.h |    4 ++
 mm/memory.c        |   95 +++++++++++++++++++++++++++++++++++++++--------------
 mm/mmap.c          |   12 +++++-
 mm/mprotect.c      |   11 +++++-
 4 files changed, 94 insertions(+), 28 deletions(-)

diff -uNrp linux-2.6.15-rc3/include/linux/mm.h linux-2.6.15-rc3-page-mkwrite/include/linux/mm.h
--- linux-2.6.15-rc3/include/linux/mm.h	2005-11-29 17:35:12.000000000 +0000
+++ linux-2.6.15-rc3-page-mkwrite/include/linux/mm.h	2005-11-30 14:51:55.000000000 +0000
@@ -197,6 +197,10 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+
+	/* notification that a previously read-only page is about to become
+	 * writable, if an error is returned it will cause a SIGBUS */
+	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
diff -uNrp linux-2.6.15-rc3/mm/memory.c linux-2.6.15-rc3-page-mkwrite/mm/memory.c
--- linux-2.6.15-rc3/mm/memory.c	2005-11-29 17:35:13.000000000 +0000
+++ linux-2.6.15-rc3-page-mkwrite/mm/memory.c	2005-11-30 14:55:21.000000000 +0000
@@ -1334,26 +1334,56 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *src_page, *new_page;
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
+	int reuse, ret = VM_FAULT_MINOR;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	src_page = old_page;
 	if (!old_page)
 		goto gotten;
 
-	if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		int reuse = can_share_swap_page(old_page);
-		unlock_page(old_page);
-		if (reuse) {
-			flush_cache_page(vma, address, pfn);
-			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			ptep_set_access_flags(vma, address, page_table, entry, 1);
-			update_mmu_cache(vma, address, entry);
-			lazy_mmu_prot_update(entry);
-			ret |= VM_FAULT_WRITE;
-			goto unlock;
+	if (unlikely(vma->vm_flags & VM_SHARED)) {
+		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+			/*
+			 * Notify the page owner without the lock held,
+			 * so they can sleep if they want to.
+			 */
+			page_cache_get(old_page);
+			pte_unmap_unlock(page_table, ptl);
+
+			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
+				goto unwritable_page;
+
+			page_cache_release(old_page);
+
+			/*
+			 * Since we dropped the lock we need to revalidate
+			 * the PTE as someone else may have changed it.  If
+			 * they did, we just return, as we can count on the
+			 * MMU to tell us if they didn't also make it writable.
+			 */
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte))
+				goto unlock;
 		}
+
+		reuse = 1;
+	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+		reuse = can_share_swap_page(old_page);
+		unlock_page(old_page);
+	} else {
+		reuse = 0;
+	}
+
+	if (reuse) {
+		flush_cache_page(vma, address, pfn);
+		entry = pte_mkyoung(orig_pte);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		ptep_set_access_flags(vma, address, page_table, entry, 1);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+		ret |= VM_FAULT_WRITE;
+		goto unlock;
 	}
 
 	/*
@@ -1413,6 +1443,10 @@ oom:
 	if (old_page)
 		page_cache_release(old_page);
 	return VM_FAULT_OOM;
+
+unwritable_page:
+	page_cache_release(old_page);
+	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -1933,18 +1967,31 @@ retry:
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page *page;
+	if (write_access) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			struct page *page;
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto oom;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-		if (!page)
-			goto oom;
-		cow_user_page(page, new_page, address);
-		page_cache_release(new_page);
-		new_page = page;
-		anon = 1;
+			if (unlikely(anon_vma_prepare(vma)))
+				goto oom;
+			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			if (!page)
+				goto oom;
+			cow_user_page(page, new_page, address);
+			page_cache_release(new_page);
+			new_page = page;
+			anon = 1;
+
+		} else {
+			/* if the page will be shareable, see if the backing
+			 * address space wants to know that the page is about
+			 * to become writable */
+			if (vma->vm_ops->page_mkwrite &&
+			    vma->vm_ops->page_mkwrite(vma, new_page) < 0
+			    ) {
+				page_cache_release(new_page);
+				return VM_FAULT_SIGBUS;
+			}
+		}
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
diff -uNrp linux-2.6.15-rc3/mm/mmap.c linux-2.6.15-rc3-page-mkwrite/mm/mmap.c
--- linux-2.6.15-rc3/mm/mmap.c	2005-11-29 17:35:13.000000000 +0000
+++ linux-2.6.15-rc3-page-mkwrite/mm/mmap.c	2005-11-30 14:51:55.000000000 +0000
@@ -1058,7 +1058,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+	vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1082,6 +1083,12 @@ munmap_back:
 			goto free_vma;
 	}
 
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		vma->vm_page_prot =
+			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
+
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation
@@ -1915,7 +1922,8 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
-	vma->vm_page_prot = protection_map[flags & 0x0f];
+	vma->vm_page_prot = protection_map[flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
diff -uNrp linux-2.6.15-rc3/mm/mprotect.c linux-2.6.15-rc3-page-mkwrite/mm/mprotect.c
--- linux-2.6.15-rc3/mm/mprotect.c	2005-11-29 17:35:13.000000000 +0000
+++ linux-2.6.15-rc3-page-mkwrite/mm/mprotect.c	2005-11-30 14:51:55.000000000 +0000
@@ -106,6 +106,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned int mask;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -132,8 +133,6 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
-	newprot = protection_map[newflags & 0xf];
-
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
@@ -160,6 +159,14 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		mask &= ~VM_SHARED;
+
+	newprot = protection_map[newflags & mask];
+
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] Add notification of page becoming writable to VMA ops [try #5]
  2005-10-24 16:23   ` [PATCH] Add notification of page becoming writable to VMA ops David Howells
                       ` (6 preceding siblings ...)
  2005-11-30 15:20     ` [PATCH] Add notification of page becoming writable to VMA ops [try #4] David Howells
@ 2006-01-11 12:19     ` David Howells
  7 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2006-01-11 12:19 UTC (permalink / raw)
  To: torvalds, Andrew Morton, Miklos Szeredi
  Cc: Hugh Dickins, Anton Altaparmakov, linux-kernel


The attached patch adds a new VMA operation to notify a filesystem or other
driver about the MMU generating a fault because userspace attempted to write
to a page mapped through a read-only PTE.

This facility permits the filesystem or driver to:

 (*) Implement storage allocation/reservation on attempted write, and so to
     deal with problems such as ENOSPC more gracefully (perhaps by generating
     SIGBUS).

 (*) Delay making the page writable until the contents have been written to a
     backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
     It permits the filesystem to have some guarantee about the state of the
     cache.

 (*) Account and limit number of dirty pages. This is one piece of the puzzle
     needed to make shared writable mapping work safely in FUSE.

Updated to 2.6.15-mm2.

Signed-Off-By: David Howells <dhowells@redhat.com>
---
warthog>diffstat -p1 page-mkwrite-2615mm2.diff
 include/linux/mm.h |    4 ++
 mm/memory.c        |   99 ++++++++++++++++++++++++++++++++++++++++-------------
 mm/mmap.c          |   12 +++++-
 mm/mprotect.c      |   11 ++++-
 4 files changed, 98 insertions(+), 28 deletions(-)

diff -uNrp linux-2.6.15-mm2/include/linux/mm.h linux-2.6.15-mm2-page-mkwrite/include/linux/mm.h
--- linux-2.6.15-mm2/include/linux/mm.h	2006-01-11 11:21:16.000000000 +0000
+++ linux-2.6.15-mm2-page-mkwrite/include/linux/mm.h	2006-01-11 11:38:13.000000000 +0000
@@ -199,6 +199,10 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+
+	/* notification that a previously read-only page is about to become
+	 * writable, if an error is returned it will cause a SIGBUS */
+	int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
diff -uNrp linux-2.6.15-mm2/mm/memory.c linux-2.6.15-mm2-page-mkwrite/mm/memory.c
--- linux-2.6.15-mm2/mm/memory.c	2006-01-11 11:21:17.000000000 +0000
+++ linux-2.6.15-mm2-page-mkwrite/mm/memory.c	2006-01-11 11:52:05.000000000 +0000
@@ -1438,25 +1438,59 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int ret = VM_FAULT_MINOR;
+	int reuse, ret = VM_FAULT_MINOR;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
 		goto gotten;
 
-	if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
-		int reuse = can_share_swap_page(old_page);
-		unlock_page(old_page);
-		if (reuse) {
-			flush_cache_page(vma, address, pte_pfn(orig_pte));
-			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			ptep_set_access_flags(vma, address, page_table, entry, 1);
-			update_mmu_cache(vma, address, entry);
-			lazy_mmu_prot_update(entry);
-			ret |= VM_FAULT_WRITE;
-			goto unlock;
+	if (unlikely(vma->vm_flags & VM_SHARED)) {
+		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+			/*
+			 * Notify the address space that the page is about to
+			 * become writable so that it can prohibit this or wait
+			 * for the page to get into an appropriate state.
+			 *
+			 * We do this without the lock held, so that it can
+			 * sleep if it needs to.
+			 */
+			page_cache_get(old_page);
+			pte_unmap_unlock(page_table, ptl);
+
+			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
+				goto unwritable_page;
+
+			page_cache_release(old_page);
+
+			/*
+			 * Since we dropped the lock we need to revalidate
+			 * the PTE as someone else may have changed it.  If
+			 * they did, we just return, as we can count on the
+			 * MMU to tell us if they didn't also make it writable.
+			 */
+			page_table = pte_offset_map_lock(mm, pmd, address,
+							 &ptl);
+			if (!pte_same(*page_table, orig_pte))
+				goto unlock;
 		}
+
+		reuse = 1;
+	} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+		reuse = can_share_swap_page(old_page);
+		unlock_page(old_page);
+	} else {
+		reuse = 0;
+	}
+
+	if (reuse) {
+		flush_cache_page(vma, address, pte_pfn(orig_pte));
+		entry = pte_mkyoung(orig_pte);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		ptep_set_access_flags(vma, address, page_table, entry, 1);
+		update_mmu_cache(vma, address, entry);
+		lazy_mmu_prot_update(entry);
+		ret |= VM_FAULT_WRITE;
+		goto unlock;
 	}
 
 	/*
@@ -1516,6 +1550,10 @@ oom:
 	if (old_page)
 		page_cache_release(old_page);
 	return VM_FAULT_OOM;
+
+unwritable_page:
+	page_cache_release(old_page);
+	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -2060,18 +2098,31 @@ retry:
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page *page;
+	if (write_access) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			struct page *page;
 
-		if (unlikely(anon_vma_prepare(vma)))
-			goto oom;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-		if (!page)
-			goto oom;
-		copy_user_highpage(page, new_page, address);
-		page_cache_release(new_page);
-		new_page = page;
-		anon = 1;
+			if (unlikely(anon_vma_prepare(vma)))
+				goto oom;
+			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			if (!page)
+				goto oom;
+			copy_user_highpage(page, new_page, address);
+			page_cache_release(new_page);
+			new_page = page;
+			anon = 1;
+
+		} else {
+			/* if the page will be shareable, see if the backing
+			 * address space wants to know that the page is about
+			 * to become writable */
+			if (vma->vm_ops->page_mkwrite &&
+			    vma->vm_ops->page_mkwrite(vma, new_page) < 0
+			    ) {
+				page_cache_release(new_page);
+				return VM_FAULT_SIGBUS;
+			}
+		}
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
diff -uNrp linux-2.6.15-mm2/mm/mmap.c linux-2.6.15-mm2-page-mkwrite/mm/mmap.c
--- linux-2.6.15-mm2/mm/mmap.c	2006-01-04 12:39:43.000000000 +0000
+++ linux-2.6.15-mm2-page-mkwrite/mm/mmap.c	2006-01-11 11:38:13.000000000 +0000
@@ -1058,7 +1058,8 @@ munmap_back:
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
 	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+	vma->vm_page_prot = protection_map[vm_flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
@@ -1082,6 +1083,12 @@ munmap_back:
 			goto free_vma;
 	}
 
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		vma->vm_page_prot =
+			protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
+
 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
 	 * that memory reservation must be checked; but that reservation
@@ -1915,7 +1922,8 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_end = addr + len;
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
-	vma->vm_page_prot = protection_map[flags & 0x0f];
+	vma->vm_page_prot = protection_map[flags &
+				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
diff -uNrp linux-2.6.15-mm2/mm/mprotect.c linux-2.6.15-mm2-page-mkwrite/mm/mprotect.c
--- linux-2.6.15-mm2/mm/mprotect.c	2006-01-04 12:39:43.000000000 +0000
+++ linux-2.6.15-mm2-page-mkwrite/mm/mprotect.c	2006-01-11 11:38:13.000000000 +0000
@@ -106,6 +106,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned int mask;
 	pgprot_t newprot;
 	pgoff_t pgoff;
 	int error;
@@ -132,8 +133,6 @@ mprotect_fixup(struct vm_area_struct *vm
 		}
 	}
 
-	newprot = protection_map[newflags & 0xf];
-
 	/*
 	 * First try to merge with previous and/or next vma.
 	 */
@@ -160,6 +159,14 @@ mprotect_fixup(struct vm_area_struct *vm
 	}
 
 success:
+	/* Don't make the VMA automatically writable if it's shared, but the
+	 * backer wishes to know when pages are first written to */
+	mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		mask &= ~VM_SHARED;
+
+	newprot = protection_map[newflags & mask];
+
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2006-01-11 12:20 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-02-09 14:28 page_mkwrite seems broken Hugh Dickins
2005-10-24 15:16 ` what happened to page_mkwrite? - was: " Anton Altaparmakov
2005-10-24 15:36   ` Hugh Dickins
2005-10-24 15:49     ` Anton Altaparmakov
2005-10-24 15:26 ` David Howells
2005-10-24 15:43   ` Anton Altaparmakov
2005-10-24 16:01     ` Hugh Dickins
2005-10-24 19:38       ` Anton Altaparmakov
2005-10-24 20:31         ` Hugh Dickins
2005-10-24 21:18           ` Anton Altaparmakov
2005-10-24 16:23   ` [PATCH] Add notification of page becoming writable to VMA ops David Howells
2005-10-24 19:11     ` Hugh Dickins
2005-10-25  7:59       ` Anton Altaparmakov
2005-10-25  8:26         ` Hugh Dickins
2005-10-25  8:49           ` Anton Altaparmakov
2005-10-25  9:49     ` David Howells
2005-10-25  9:55     ` David Howells
2005-10-25 10:12     ` David Howells
2005-10-25 13:18     ` [PATCH] Add notification of page becoming writable to VMA ops [try #2] David Howells
2005-11-30 13:58     ` [PATCH] Add notification of page becoming writable to VMA ops [try #3] David Howells
2005-11-30 14:40       ` Miklos Szeredi
2005-11-30 14:50       ` David Howells
2005-11-30 15:20     ` [PATCH] Add notification of page becoming writable to VMA ops [try #4] David Howells
2006-01-11 12:19     ` [PATCH] Add notification of page becoming writable to VMA ops [try #5] David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).