linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared
@ 2014-08-20 21:46 Peter Feiner
  2014-08-20 23:45 ` Kirill A. Shutemov
                   ` (5 more replies)
  0 siblings, 6 replies; 36+ messages in thread
From: Peter Feiner @ 2014-08-20 21:46 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Peter Feiner, Cyrill Gorcunov, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton

In readable+writable+shared VMAs, PTEs created for read faults have
their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after
subsequent writes.

Here's a simple code snippet to demonstrate the bug:

  char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                 MAP_ANONYMOUS | MAP_SHARED, -1, 0);
  system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
  assert(*m == '\0');     /* new PTE allows write access */
  assert(!soft_dirty(x));
  *m = 'x';               /* should dirty the page */
  assert(soft_dirty(x));  /* fails */

With this patch, new PTEs created for read faults are write protected
if the VMA has VM_SOFTDIRTY clear.

Signed-off-by: Peter Feiner <pfeiner@google.com>
---
 mm/memory.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index ab3537b..282a959 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2755,6 +2755,8 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	else if (pte_file(*pte) && pte_file_soft_dirty(*pte))
 		entry = pte_mksoft_dirty(entry);
+	else if (!(vma->vm_flags & VM_SOFTDIRTY))
+		entry = pte_wrprotect(entry);
 	if (anon) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, address);
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared
  2014-08-20 21:46 [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared Peter Feiner
@ 2014-08-20 23:45 ` Kirill A. Shutemov
  2014-08-21 19:37   ` Peter Feiner
  2014-08-23 22:11 ` [PATCH v2 0/3] softdirty fix and write notification cleanup Peter Feiner
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-08-20 23:45 UTC (permalink / raw)
  To: Peter Feiner
  Cc: linux-mm, linux-kernel, Cyrill Gorcunov, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Wed, Aug 20, 2014 at 05:46:22PM -0400, Peter Feiner wrote:
> In readable+writable+shared VMAs, PTEs created for read faults have
> their write bit set. If the read fault happens after VM_SOFTDIRTY is
> cleared, then the PTE's softdirty bit will remain clear after
> subsequent writes.
> 
> Here's a simple code snippet to demonstrate the bug:
> 
>   char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
>                  MAP_ANONYMOUS | MAP_SHARED, -1, 0);
>   system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
>   assert(*m == '\0');     /* new PTE allows write access */
>   assert(!soft_dirty(x));
>   *m = 'x';               /* should dirty the page */
>   assert(soft_dirty(x));  /* fails */
> 
> With this patch, new PTEs created for read faults are write protected
> if the VMA has VM_SOFTDIRTY clear.
> 
> Signed-off-by: Peter Feiner <pfeiner@google.com>
> ---
>  mm/memory.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index ab3537b..282a959 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2755,6 +2755,8 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  	else if (pte_file(*pte) && pte_file_soft_dirty(*pte))
>  		entry = pte_mksoft_dirty(entry);
> +	else if (!(vma->vm_flags & VM_SOFTDIRTY))
> +		entry = pte_wrprotect(entry);

It basically means VM_SOFTDIRTY require writenotify on the vma.

What about patch below? Untested. And it seems it'll introduce bug similar
to bug fixed by c9d0bf241451, *but* IIUC we have it already in mprotect()
code path.

I'll look more careful tomorrow.

Not-signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dfc791c42d64..67d509a15969 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -851,8 +851,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
                        if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
                                continue;
                        if (type == CLEAR_REFS_SOFT_DIRTY) {
-                               if (vma->vm_flags & VM_SOFTDIRTY)
-                                       vma->vm_flags &= ~VM_SOFTDIRTY;
+                               vma->vm_flags &= ~VM_SOFTDIRTY;
+                               vma->vm_page_prot = vm_get_page_prot(
+                                               vma->vm_flags & ~VM_SHARED);
                        }
                        walk_page_range(vma->vm_start, vma->vm_end,
                                        &clear_refs_walk);
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared
  2014-08-20 23:45 ` Kirill A. Shutemov
@ 2014-08-21 19:37   ` Peter Feiner
  2014-08-21 20:51     ` Cyrill Gorcunov
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Feiner @ 2014-08-21 19:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, Cyrill Gorcunov, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Thu, Aug 21, 2014 at 02:45:43AM +0300, Kirill A. Shutemov wrote:
> On Wed, Aug 20, 2014 at 05:46:22PM -0400, Peter Feiner wrote:
> It basically means VM_SOFTDIRTY require writenotify on the vma.
> 
> What about patch below? Untested. And it seems it'll introduce bug similar
> to bug fixed by c9d0bf241451, *but* IIUC we have it already in mprotect()
> code path.
> 
> I'll look more careful tomorrow.
> 
> Not-signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index dfc791c42d64..67d509a15969 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -851,8 +851,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>                         if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
>                                 continue;
>                         if (type == CLEAR_REFS_SOFT_DIRTY) {
> -                               if (vma->vm_flags & VM_SOFTDIRTY)
> -                                       vma->vm_flags &= ~VM_SOFTDIRTY;
> +                               vma->vm_flags &= ~VM_SOFTDIRTY;
> +                               vma->vm_page_prot = vm_get_page_prot(
> +                                               vma->vm_flags & ~VM_SHARED);
>                         }
>                         walk_page_range(vma->vm_start, vma->vm_end,
>                                         &clear_refs_walk);
> -- 
>  Kirill A. Shutemov

Thanks Kirill, I prefer your approach. I'll send a v2.

I believe you're right about c9d0bf241451. It seems like passing the old & new
pgprot through pgprot_modify would handle the problem. Furthermore, as you
suggest, mprotect_fixup should use pgprot_modify when it turns write
notification on.  I think a patch like this is in order:

Not-signed-off-by: Peter Feiner <pfeiner@google.com>

diff --git a/mm/mmap.c b/mm/mmap.c
index c1f2ea4..86f89a1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1611,18 +1611,15 @@ munmap_back:
 	}
 
 	if (vma_wants_writenotify(vma)) {
-		pgprot_t pprot = vma->vm_page_prot;
-
 		/* Can vma->vm_page_prot have changed??
 		 *
 		 * Answer: Yes, drivers may have changed it in their
 		 *         f_op->mmap method.
 		 *
-		 * Ensures that vmas marked as uncached stay that way.
+		 * Ensures that vmas marked with special bits stay that way.
 		 */
-		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
-		if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))
-			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+		vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
+		                        vm_get_page_prot(vm_flags & ~VM_SHARED);
 	}
 
 	vma_link(mm, vma, prev, rb_link, rb_parent);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..6826313 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -324,7 +324,8 @@ success:
 					  vm_get_page_prot(newflags));
 
 	if (vma_wants_writenotify(vma)) {
-		vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
+		vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
+		                       vm_get_page_prot(newflags & ~VM_SHARED));
 		dirty_accountable = 1;
 	}
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared
  2014-08-21 19:37   ` Peter Feiner
@ 2014-08-21 20:51     ` Cyrill Gorcunov
  2014-08-21 21:39       ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Cyrill Gorcunov @ 2014-08-21 20:51 UTC (permalink / raw)
  To: Peter Feiner
  Cc: Kirill A. Shutemov, linux-mm, linux-kernel, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Thu, Aug 21, 2014 at 03:37:37PM -0400, Peter Feiner wrote:
>
> Thanks Kirill, I prefer your approach. I'll send a v2.
> 
> I believe you're right about c9d0bf241451. It seems like passing the old & new
> pgprot through pgprot_modify would handle the problem. Furthermore, as you
> suggest, mprotect_fixup should use pgprot_modify when it turns write
> notification on.  I think a patch like this is in order:
> 
> Not-signed-off-by: Peter Feiner <pfeiner@google.com>
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c1f2ea4..86f89a1 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1611,18 +1611,15 @@ munmap_back:
>  	}
>  
>  	if (vma_wants_writenotify(vma)) {
> -		pgprot_t pprot = vma->vm_page_prot;
> -
>  		/* Can vma->vm_page_prot have changed??
>  		 *
>  		 * Answer: Yes, drivers may have changed it in their
>  		 *         f_op->mmap method.
>  		 *
> -		 * Ensures that vmas marked as uncached stay that way.
> +		 * Ensures that vmas marked with special bits stay that way.
>  		 */
> -		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
> -		if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))
> -			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +		vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> +		                        vm_get_page_prot(vm_flags & ~VM_SHARED);
>  	}
>  
>  	vma_link(mm, vma, prev, rb_link, rb_parent);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index c43d557..6826313 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -324,7 +324,8 @@ success:
>  					  vm_get_page_prot(newflags));
>  
>  	if (vma_wants_writenotify(vma)) {
> -		vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
> +		vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> +		                       vm_get_page_prot(newflags & ~VM_SHARED));
>  		dirty_accountable = 1;
>  	}

Thanks a lot Peter and Kirill for catching it and providing the prelim. fixup. (Initial
patch doesn't look that right for me because vm-softdirty should involve into
account for newly created/expaned vmas only but not into some deep code such
as fault handlings). Peter does the patch above helps? (out of testing machine
at the moment so cant test myself).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared
  2014-08-21 20:51     ` Cyrill Gorcunov
@ 2014-08-21 21:39       ` Kirill A. Shutemov
  2014-08-21 21:46         ` Peter Feiner
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-08-21 21:39 UTC (permalink / raw)
  To: Cyrill Gorcunov, Peter Feiner
  Cc: linux-mm, linux-kernel, Pavel Emelyanov, Jamie Liu, Hugh Dickins,
	Naoya Horiguchi, Andrew Morton, Magnus Damm

On Fri, Aug 22, 2014 at 12:51:15AM +0400, Cyrill Gorcunov wrote:
> On Thu, Aug 21, 2014 at 03:37:37PM -0400, Peter Feiner wrote:
> >
> > Thanks Kirill, I prefer your approach. I'll send a v2.
> > 
> > I believe you're right about c9d0bf241451. It seems like passing the old & new
> > pgprot through pgprot_modify would handle the problem. Furthermore, as you
> > suggest, mprotect_fixup should use pgprot_modify when it turns write
> > notification on.  I think a patch like this is in order:

Looks good to me.

Would you mind to apply the same pgprot_modify() approach on the
clear_refs_write(), test and post the patch?

Feel free to use my singed-off-by (or suggested-by if you prefer) once
it's tested (see merge case below).
 
> > Not-signed-off-by: Peter Feiner <pfeiner@google.com>
> > 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index c1f2ea4..86f89a1 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1611,18 +1611,15 @@ munmap_back:
> >  	}
> >  
> >  	if (vma_wants_writenotify(vma)) {
> > -		pgprot_t pprot = vma->vm_page_prot;
> > -
> >  		/* Can vma->vm_page_prot have changed??
> >  		 *
> >  		 * Answer: Yes, drivers may have changed it in their
> >  		 *         f_op->mmap method.
> >  		 *
> > -		 * Ensures that vmas marked as uncached stay that way.
> > +		 * Ensures that vmas marked with special bits stay that way.
> >  		 */
> > -		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
> > -		if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))
> > -			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > +		vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> > +		                        vm_get_page_prot(vm_flags & ~VM_SHARED);
> >  	}
> >  
> >  	vma_link(mm, vma, prev, rb_link, rb_parent);
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index c43d557..6826313 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -324,7 +324,8 @@ success:
> >  					  vm_get_page_prot(newflags));
> >  
> >  	if (vma_wants_writenotify(vma)) {
> > -		vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
> > +		vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> > +		                       vm_get_page_prot(newflags & ~VM_SHARED));
> >  		dirty_accountable = 1;
> >  	}
> 
> Thanks a lot Peter and Kirill for catching it and providing the prelim. fixup. (Initial
> patch doesn't look that right for me because vm-softdirty should involve into
> account for newly created/expaned vmas only but not into some deep code such
> as fault handlings). Peter does the patch above helps? (out of testing machine
> at the moment so cant test myself).

One thing: there could be (I haven't checked) complications on
vma_merge(): since vm_flags are identical it assumes that it can reuse
vma->vm_page_prot of expanded vma. But VM_SOFTDIRTY is excluded from
vm_flags compatibility check. What should we do with vm_page_prot there?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared
  2014-08-21 21:39       ` Kirill A. Shutemov
@ 2014-08-21 21:46         ` Peter Feiner
  2014-08-21 21:51           ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Feiner @ 2014-08-21 21:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Cyrill Gorcunov, linux-mm, linux-kernel, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Fri, Aug 22, 2014 at 12:39:42AM +0300, Kirill A. Shutemov wrote:
> On Fri, Aug 22, 2014 at 12:51:15AM +0400, Cyrill Gorcunov wrote:
> 
> Looks good to me.
> 
> Would you mind to apply the same pgprot_modify() approach on the
> clear_refs_write(), test and post the patch?
> 
> Feel free to use my singed-off-by (or suggested-by if you prefer) once
> it's tested (see merge case below).

Sure thing :-)

> One thing: there could be (I haven't checked) complications on
> vma_merge(): since vm_flags are identical it assumes that it can reuse
> vma->vm_page_prot of expanded vma. But VM_SOFTDIRTY is excluded from
> vm_flags compatibility check. What should we do with vm_page_prot there?

Since the merged VMA will have VM_SOFTDIRTY set, it's OK that it's vm_page_prot
won't be setup for write notifications. For the purpose of process migration,
you'll just get some false positives, which is tolerable.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared
  2014-08-21 21:46         ` Peter Feiner
@ 2014-08-21 21:51           ` Kirill A. Shutemov
  2014-08-21 22:50             ` Peter Feiner
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-08-21 21:51 UTC (permalink / raw)
  To: Peter Feiner
  Cc: Cyrill Gorcunov, linux-mm, linux-kernel, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Thu, Aug 21, 2014 at 05:46:01PM -0400, Peter Feiner wrote:
> On Fri, Aug 22, 2014 at 12:39:42AM +0300, Kirill A. Shutemov wrote:
> > On Fri, Aug 22, 2014 at 12:51:15AM +0400, Cyrill Gorcunov wrote:
> > 
> > Looks good to me.
> > 
> > Would you mind to apply the same pgprot_modify() approach on the
> > clear_refs_write(), test and post the patch?
> > 
> > Feel free to use my singed-off-by (or suggested-by if you prefer) once
> > it's tested (see merge case below).
> 
> Sure thing :-)
> 
> > One thing: there could be (I haven't checked) complications on
> > vma_merge(): since vm_flags are identical it assumes that it can reuse
> > vma->vm_page_prot of expanded vma. But VM_SOFTDIRTY is excluded from
> > vm_flags compatibility check. What should we do with vm_page_prot there?
> 
> Since the merged VMA will have VM_SOFTDIRTY set, it's OK that it's vm_page_prot
> won't be setup for write notifications. For the purpose of process migration,
> you'll just get some false positives, which is tolerable.

Right. But should we disable writenotify back to avoid exessive wp-faults
if it was enabled due to soft-dirty (the case when expanded vma is
soft-dirty)?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared
  2014-08-21 21:51           ` Kirill A. Shutemov
@ 2014-08-21 22:50             ` Peter Feiner
  2014-08-22  6:33               ` Cyrill Gorcunov
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Feiner @ 2014-08-21 22:50 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Cyrill Gorcunov, linux-mm, linux-kernel, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Fri, Aug 22, 2014 at 12:51:47AM +0300, Kirill A. Shutemov wrote:
> > > One thing: there could be (I haven't checked) complications on
> > > vma_merge(): since vm_flags are identical it assumes that it can reuse
> > > vma->vm_page_prot of expanded vma. But VM_SOFTDIRTY is excluded from
> > > vm_flags compatibility check. What should we do with vm_page_prot there?
> > 
> > Since the merged VMA will have VM_SOFTDIRTY set, it's OK that it's vm_page_prot
> > won't be setup for write notifications. For the purpose of process migration,
> > you'll just get some false positives, which is tolerable.
> 
> Right. But should we disable writenotify back to avoid exessive wp-faults
> if it was enabled due to soft-dirty (the case when expanded vma is
> soft-dirty)?

Ah, I understand now. I've got a patch in the works that disables the write
faults when a VMA is merged. I'll send a series with all of the changes
tomorrow.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared
  2014-08-21 22:50             ` Peter Feiner
@ 2014-08-22  6:33               ` Cyrill Gorcunov
  0 siblings, 0 replies; 36+ messages in thread
From: Cyrill Gorcunov @ 2014-08-22  6:33 UTC (permalink / raw)
  To: Peter Feiner
  Cc: Kirill A. Shutemov, linux-mm, linux-kernel, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Thu, Aug 21, 2014 at 06:50:33PM -0400, Peter Feiner wrote:
> On Fri, Aug 22, 2014 at 12:51:47AM +0300, Kirill A. Shutemov wrote:
> > > > One thing: there could be (I haven't checked) complications on
> > > > vma_merge(): since vm_flags are identical it assumes that it can reuse
> > > > vma->vm_page_prot of expanded vma. But VM_SOFTDIRTY is excluded from
> > > > vm_flags compatibility check. What should we do with vm_page_prot there?
> > > 
> > > Since the merged VMA will have VM_SOFTDIRTY set, it's OK that it's vm_page_prot
> > > won't be setup for write notifications. For the purpose of process migration,
> > > you'll just get some false positives, which is tolerable.
> > 
> > Right. But should we disable writenotify back to avoid exessive wp-faults
> > if it was enabled due to soft-dirty (the case when expanded vma is
> > soft-dirty)?
> 
> Ah, I understand now. I've got a patch in the works that disables the write
> faults when a VMA is merged. I'll send a series with all of the changes
> tomorrow.

Cool! Thanks a lot, guys!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 0/3] softdirty fix and write notification cleanup
  2014-08-20 21:46 [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared Peter Feiner
  2014-08-20 23:45 ` Kirill A. Shutemov
@ 2014-08-23 22:11 ` Peter Feiner
  2014-08-23 22:11   ` [PATCH v2 1/3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared Peter Feiner
                     ` (2 more replies)
  2014-08-24  1:43 ` [PATCH v3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared Peter Feiner
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 36+ messages in thread
From: Peter Feiner @ 2014-08-23 22:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Peter Feiner, Kirill A. Shutemov, Cyrill Gorcunov,
	Pavel Emelyanov, Jamie Liu, Hugh Dickins, Naoya Horiguchi,
	Andrew Morton

Here's the new patch that uses Kirill's approach of setting write
notifications on the VMA. I also included write notification cleanups and
fixes per our discussion.

Peter Feiner (3):
  mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY
    cleared
  mm: mprotect: preserve special page protection bits
  mm: mmap: cleanup code that preserves special vm_page_prot bits

 fs/proc/task_mmu.c | 17 ++++++++++++++++-
 include/linux/mm.h | 15 +++++++++++++++
 mm/mmap.c          | 26 +++++++++++---------------
 mm/mprotect.c      |  2 +-
 4 files changed, 43 insertions(+), 17 deletions(-)

-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 1/3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-23 22:11 ` [PATCH v2 0/3] softdirty fix and write notification cleanup Peter Feiner
@ 2014-08-23 22:11   ` Peter Feiner
  2014-08-23 23:00     ` Kirill A. Shutemov
  2014-08-23 22:12   ` [PATCH v2 2/3] mm: mprotect: preserve special page protection bits Peter Feiner
  2014-08-23 22:12   ` [PATCH v2 3/3] mm: mmap: cleanup code that preserves special vm_page_prot bits Peter Feiner
  2 siblings, 1 reply; 36+ messages in thread
From: Peter Feiner @ 2014-08-23 22:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Peter Feiner, Kirill A. Shutemov, Cyrill Gorcunov,
	Pavel Emelyanov, Jamie Liu, Hugh Dickins, Naoya Horiguchi,
	Andrew Morton

For VMAs that don't want write notifications, PTEs created for read
faults have their write bit set. If the read fault happens after
VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain
clear after subsequent writes.

Here's a simple code snippet to demonstrate the bug:

  char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                 MAP_ANONYMOUS | MAP_SHARED, -1, 0);
  system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
  assert(*m == '\0');     /* new PTE allows write access */
  assert(!soft_dirty(x));
  *m = 'x';               /* should dirty the page */
  assert(soft_dirty(x));  /* fails */

With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid faults, write notifications are
disabled when VM_SOFTDIRTY is reset.

Signed-off-by: Peter Feiner <pfeiner@google.com>
---
 v1 -> v2: Instead of checking VM_SOFTDIRTY in the fault handler, enable write
           notifications on vm_page_prot when we clear VM_SOFTDIRTY.

 fs/proc/task_mmu.c | 17 ++++++++++++++++-
 include/linux/mm.h | 15 +++++++++++++++
 mm/mmap.c          | 10 +++++++++-
 3 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dfc791c..f1a5382 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -851,8 +851,23 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
 				continue;
 			if (type == CLEAR_REFS_SOFT_DIRTY) {
-				if (vma->vm_flags & VM_SOFTDIRTY)
+				if (vma->vm_flags & VM_SOFTDIRTY) {
 					vma->vm_flags &= ~VM_SOFTDIRTY;
+					/*
+					 * We don't have a write lock on
+					 * mm->mmap_sem, so we race with the
+					 * fault handler reading vm_page_prot.
+					 * Therefore writable PTEs (that won't
+					 * have soft-dirty set) can be created
+					 * for read faults. However, since the
+					 * PTE lock is held while vm_page_prot
+					 * is read and while we write protect
+					 * PTEs during our walk, any writable
+					 * PTEs that slipped through will be
+					 * write protected.
+					 */
+					vma_enable_writenotify(vma);
+				}
 			}
 			walk_page_range(vma->vm_start, vma->vm_end,
 					&clear_refs_walk);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..5f26634 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1946,6 +1946,21 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
+/* Enable write notifications without blowing away special flags. */
+static inline void vma_enable_writenotify(struct vm_area_struct *vma)
+{
+	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
+	                                  vm_get_page_prot(vma->vm_flags &
+					                   ~VM_SHARED));
+}
+
+/* Disable write notifications without blowing away special flags. */
+static inline void vma_disable_writenotify(struct vm_area_struct *vma)
+{
+	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
+	                                  vm_get_page_prot(vma->vm_flags));
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
diff --git a/mm/mmap.c b/mm/mmap.c
index c1f2ea4..abcac32 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1549,8 +1549,16 @@ munmap_back:
 	 * Can we just expand an old mapping?
 	 */
 	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
-	if (vma)
+	if (vma) {
+		if (!vma_wants_writenotify(vma)) {
+			/*
+			 * We're going to reset VM_SOFTDIRTY, so we can disable
+			 * write notifications.
+			 */
+			vma_disable_writenotify(vma);
+		}
 		goto out;
+	}
 
 	/*
 	 * Determine the object being mapped and call the appropriate
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 2/3] mm: mprotect: preserve special page protection bits
  2014-08-23 22:11 ` [PATCH v2 0/3] softdirty fix and write notification cleanup Peter Feiner
  2014-08-23 22:11   ` [PATCH v2 1/3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared Peter Feiner
@ 2014-08-23 22:12   ` Peter Feiner
  2014-08-23 22:12   ` [PATCH v2 3/3] mm: mmap: cleanup code that preserves special vm_page_prot bits Peter Feiner
  2 siblings, 0 replies; 36+ messages in thread
From: Peter Feiner @ 2014-08-23 22:12 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Peter Feiner, Kirill A. Shutemov, Cyrill Gorcunov,
	Pavel Emelyanov, Jamie Liu, Hugh Dickins, Naoya Horiguchi,
	Andrew Morton

We don't want to zap special page protection bits on mprotect.
Analogous to the bug fixed in c9d0bf241451a3ab7d02e1652c22b80cd7d93e8f
where vm_page_prot bits set by drivers were zapped when write
notifications were enabled on new VMAs.

Signed-off-by: Peter Feiner <pfeiner@google.com>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..1c1afd4 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -324,7 +324,7 @@ success:
 					  vm_get_page_prot(newflags));
 
 	if (vma_wants_writenotify(vma)) {
-		vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
+		vma_enable_writenotify(vma);
 		dirty_accountable = 1;
 	}
 
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 3/3] mm: mmap: cleanup code that preserves special vm_page_prot bits
  2014-08-23 22:11 ` [PATCH v2 0/3] softdirty fix and write notification cleanup Peter Feiner
  2014-08-23 22:11   ` [PATCH v2 1/3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared Peter Feiner
  2014-08-23 22:12   ` [PATCH v2 2/3] mm: mprotect: preserve special page protection bits Peter Feiner
@ 2014-08-23 22:12   ` Peter Feiner
  2 siblings, 0 replies; 36+ messages in thread
From: Peter Feiner @ 2014-08-23 22:12 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Peter Feiner, Kirill A. Shutemov, Cyrill Gorcunov,
	Pavel Emelyanov, Jamie Liu, Hugh Dickins, Naoya Horiguchi,
	Andrew Morton

Replace logic that has been factored out into a utility method.

Signed-off-by: Peter Feiner <pfeiner@google.com>
---
 mm/mmap.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index abcac32..c18c49a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1618,20 +1618,8 @@ munmap_back:
 			goto free_vma;
 	}
 
-	if (vma_wants_writenotify(vma)) {
-		pgprot_t pprot = vma->vm_page_prot;
-
-		/* Can vma->vm_page_prot have changed??
-		 *
-		 * Answer: Yes, drivers may have changed it in their
-		 *         f_op->mmap method.
-		 *
-		 * Ensures that vmas marked as uncached stay that way.
-		 */
-		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
-		if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))
-			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	}
+	if (vma_wants_writenotify(vma))
+		vma_enable_writenotify(vma);
 
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	/* Once vma denies write, undo our temporary denial count */
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-23 22:11   ` [PATCH v2 1/3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared Peter Feiner
@ 2014-08-23 23:00     ` Kirill A. Shutemov
  2014-08-23 23:15       ` Peter Feiner
  2014-08-23 23:50       ` Kirill A. Shutemov
  0 siblings, 2 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-08-23 23:00 UTC (permalink / raw)
  To: Peter Feiner
  Cc: linux-mm, linux-kernel, Cyrill Gorcunov, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton

On Sat, Aug 23, 2014 at 06:11:59PM -0400, Peter Feiner wrote:
> For VMAs that don't want write notifications, PTEs created for read
> faults have their write bit set. If the read fault happens after
> VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain
> clear after subsequent writes.
> 
> Here's a simple code snippet to demonstrate the bug:
> 
>   char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
>                  MAP_ANONYMOUS | MAP_SHARED, -1, 0);
>   system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
>   assert(*m == '\0');     /* new PTE allows write access */
>   assert(!soft_dirty(x));
>   *m = 'x';               /* should dirty the page */
>   assert(soft_dirty(x));  /* fails */
> 
> With this patch, write notifications are enabled when VM_SOFTDIRTY is
> cleared. Furthermore, to avoid faults, write notifications are
> disabled when VM_SOFTDIRTY is reset.
> 
> Signed-off-by: Peter Feiner <pfeiner@google.com>
> ---
>  v1 -> v2: Instead of checking VM_SOFTDIRTY in the fault handler, enable write
>            notifications on vm_page_prot when we clear VM_SOFTDIRTY.
> 
>  fs/proc/task_mmu.c | 17 ++++++++++++++++-
>  include/linux/mm.h | 15 +++++++++++++++
>  mm/mmap.c          | 10 +++++++++-
>  3 files changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index dfc791c..f1a5382 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -851,8 +851,23 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  			if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
>  				continue;
>  			if (type == CLEAR_REFS_SOFT_DIRTY) {
> -				if (vma->vm_flags & VM_SOFTDIRTY)
> +				if (vma->vm_flags & VM_SOFTDIRTY) {

Why do we need the branch here. Does it save us anything?
Looks like we can update vm_flags and enable writenotify unconditionally.
Indentation level is high enough already.

>  					vma->vm_flags &= ~VM_SOFTDIRTY;
> +					/*
> +					 * We don't have a write lock on
> +					 * mm->mmap_sem, so we race with the
> +					 * fault handler reading vm_page_prot.
> +					 * Therefore writable PTEs (that won't
> +					 * have soft-dirty set) can be created
> +					 * for read faults. However, since the
> +					 * PTE lock is held while vm_page_prot
> +					 * is read and while we write protect
> +					 * PTEs during our walk, any writable
> +					 * PTEs that slipped through will be
> +					 * write protected.
> +					 */

Hm.. Isn't this yet another bug?
Updating vma->vm_flags without down_write(&mm->mmap_sem) looks troublesome
to me. Am I wrong?

> +					vma_enable_writenotify(vma);
> +				}
>  			}
>  			walk_page_range(vma->vm_start, vma->vm_end,
>  					&clear_refs_walk);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8981cc8..5f26634 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1946,6 +1946,21 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
>  }
>  #endif
>  
> +/* Enable write notifications without blowing away special flags. */
> +static inline void vma_enable_writenotify(struct vm_area_struct *vma)
> +{
> +	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> +	                                  vm_get_page_prot(vma->vm_flags &
> +					                   ~VM_SHARED));

I think this way is more readable:

	pgprot_t newprot;
	newprot = vm_get_page_prot(vma->vm_flags & ~VM_SHARED);
	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);

> +}
> +
> +/* Disable write notifications without blowing away special flags. */
> +static inline void vma_disable_writenotify(struct vm_area_struct *vma)
> +{
> +	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> +	                                  vm_get_page_prot(vma->vm_flags));

ditto.

> +}
> +
>  #ifdef CONFIG_NUMA_BALANCING
>  unsigned long change_prot_numa(struct vm_area_struct *vma,
>  			unsigned long start, unsigned long end);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c1f2ea4..abcac32 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1549,8 +1549,16 @@ munmap_back:
>  	 * Can we just expand an old mapping?
>  	 */
>  	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
> -	if (vma)
> +	if (vma) {
> +		if (!vma_wants_writenotify(vma)) {
> +			/*
> +			 * We're going to reset VM_SOFTDIRTY, so we can disable
> +			 * write notifications.
> +			 */
> +			vma_disable_writenotify(vma);
> +		}
>  		goto out;
> +	}
>  
>  	/*
>  	 * Determine the object being mapped and call the appropriate
> -- 
> 2.1.0.rc2.206.gedb03e5
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-23 23:00     ` Kirill A. Shutemov
@ 2014-08-23 23:15       ` Peter Feiner
  2014-08-23 23:50       ` Kirill A. Shutemov
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Feiner @ 2014-08-23 23:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, Cyrill Gorcunov, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton

On Sun, Aug 24, 2014 at 02:00:11AM +0300, Kirill A. Shutemov wrote:
> On Sat, Aug 23, 2014 at 06:11:59PM -0400, Peter Feiner wrote:
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index dfc791c..f1a5382 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -851,8 +851,23 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  			if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
> >  				continue;
> >  			if (type == CLEAR_REFS_SOFT_DIRTY) {
> > -				if (vma->vm_flags & VM_SOFTDIRTY)
> > +				if (vma->vm_flags & VM_SOFTDIRTY) {
> 
> Why do we need the branch here. Does it save us anything?
> Looks like we can update vm_flags and enable writenotify unconditionally.
> Indentation level is high enough already.

You're right, we don't need the branch here. I'll change for v3.

> >  					vma->vm_flags &= ~VM_SOFTDIRTY;
> > +					/*
> > +					 * We don't have a write lock on
> > +					 * mm->mmap_sem, so we race with the
> > +					 * fault handler reading vm_page_prot.
> > +					 * Therefore writable PTEs (that won't
> > +					 * have soft-dirty set) can be created
> > +					 * for read faults. However, since the
> > +					 * PTE lock is held while vm_page_prot
> > +					 * is read and while we write protect
> > +					 * PTEs during our walk, any writable
> > +					 * PTEs that slipped through will be
> > +					 * write protected.
> > +					 */
> 
> Hm.. Isn't this yet another bug?
> Updating vma->vm_flags without down_write(&mm->mmap_sem) looks troublesome
> to me. Am I wrong?

As I said in the comment, it looks fishy but we're still fixing the bug. That
is, no writable PTEs will sneak by that don't have soft-dirty set.

I was originally going to submit something that dropped the mmap_sem and
re-took it in write mode before manipulating vm_page_prot. The control flow was
slightly hairy, so I convinced myself that the race is benign :-)

If I'm right and the race is benign, it still might be worth having the more
straightforward & obviously correct implementation since this isn't performance
critical code.

> > +/* Enable write notifications without blowing away special flags. */
> > +static inline void vma_enable_writenotify(struct vm_area_struct *vma)
> > +{
> > +	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> > +	                                  vm_get_page_prot(vma->vm_flags &
> > +					                   ~VM_SHARED));
> 
> I think this way is more readable:
> 
> 	pgprot_t newprot;
> 	newprot = vm_get_page_prot(vma->vm_flags & ~VM_SHARED);
> 	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
> 

Looks good. I'll update.

> > +}
> > +
> > +/* Disable write notifications without blowing away special flags. */
> > +static inline void vma_disable_writenotify(struct vm_area_struct *vma)
> > +{
> > +	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> > +	                                  vm_get_page_prot(vma->vm_flags));
> 
> ditto.

I'll change this too.

Peter

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-23 23:00     ` Kirill A. Shutemov
  2014-08-23 23:15       ` Peter Feiner
@ 2014-08-23 23:50       ` Kirill A. Shutemov
  2014-08-24  0:55         ` Peter Feiner
  1 sibling, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-08-23 23:50 UTC (permalink / raw)
  To: Peter Feiner
  Cc: linux-mm, linux-kernel, Cyrill Gorcunov, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton

On Sun, Aug 24, 2014 at 02:00:11AM +0300, Kirill A. Shutemov wrote:
> On Sat, Aug 23, 2014 at 06:11:59PM -0400, Peter Feiner wrote:
> > For VMAs that don't want write notifications, PTEs created for read
> > faults have their write bit set. If the read fault happens after
> > VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain
> > clear after subsequent writes.
> > 
> > Here's a simple code snippet to demonstrate the bug:
> > 
> >   char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
> >                  MAP_ANONYMOUS | MAP_SHARED, -1, 0);
> >   system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
> >   assert(*m == '\0');     /* new PTE allows write access */
> >   assert(!soft_dirty(x));
> >   *m = 'x';               /* should dirty the page */
> >   assert(soft_dirty(x));  /* fails */
> > 
> > With this patch, write notifications are enabled when VM_SOFTDIRTY is
> > cleared. Furthermore, to avoid faults, write notifications are
> > disabled when VM_SOFTDIRTY is reset.
> > 
> > Signed-off-by: Peter Feiner <pfeiner@google.com>

One more case to consider: mprotect() which doesn't trigger successful
vma_merge() will not set VM_SOFTDIRTY and will not enable write-protect on
the vma.

It's probably better to take VM_SOFTDIRTY into account in
vma_wants_writenotify() and re-think logic in other corners.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-23 23:50       ` Kirill A. Shutemov
@ 2014-08-24  0:55         ` Peter Feiner
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Feiner @ 2014-08-24  0:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, Cyrill Gorcunov, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton

On Sun, Aug 24, 2014 at 02:50:58AM +0300, Kirill A. Shutemov wrote:
> One more case to consider: mprotect() which doesn't trigger successful
> vma_merge() will not set VM_SOFTDIRTY and will not enable write-protect on
> the vma.
> 
> It's probably better to take VM_SOFTDIRTY into account in
> vma_wants_writenotify() and re-think logic in other corners.

Merge or not, write notifications get disabled! I'll fix this too :-)

Peter

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-20 21:46 [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared Peter Feiner
  2014-08-20 23:45 ` Kirill A. Shutemov
  2014-08-23 22:11 ` [PATCH v2 0/3] softdirty fix and write notification cleanup Peter Feiner
@ 2014-08-24  1:43 ` Peter Feiner
  2014-08-24  7:59   ` Kirill A. Shutemov
  2014-08-24 14:41 ` [PATCH v4] " Peter Feiner
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: Peter Feiner @ 2014-08-24  1:43 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Peter Feiner, Kirill A. Shutemov, Cyrill Gorcunov,
	Pavel Emelyanov, Jamie Liu, Hugh Dickins, Naoya Horiguchi,
	Andrew Morton

For VMAs that don't want write notifications, PTEs created for read
faults have their write bit set. If the read fault happens after
VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain
clear after subsequent writes.

Here's a simple code snippet to demonstrate the bug:

  char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                 MAP_ANONYMOUS | MAP_SHARED, -1, 0);
  system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
  assert(*m == '\0');     /* new PTE allows write access */
  assert(!soft_dirty(x));
  *m = 'x';               /* should dirty the page */
  assert(soft_dirty(x));  /* fails */

With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write
notifications are disabled when VM_SOFTDIRTY is reset.

As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set
by drivers were zapped on mprotect. An analogous bug was fixed in mmap
by c9d0bf241451a3ab7d02e1652c22b80cd7d93e8f.

---

v1 -> v2: Instead of checking VM_SOFTDIRTY in the fault handler,
          enable write notifications on vm_page_prot when we clear
          VM_SOFTDIRTY.

v2 -> v3: * Grab the mmap_sem in write mode if any VMAs have
            VM_SOFTDIRTY set. This involved refactoring clear_refs_write
            to make it less unwieldy.

          * In mprotect, don't inadvertently disable write notifications on VMAs
            that have had VM_SOFTDIRTY cleared

          * The mprotect fix and mmap cleanup that comprised the
            second and third patches in v2 were swallowed by the main
            patch because of vm_page_prot corner case handling.
---
 fs/proc/task_mmu.c | 113 +++++++++++++++++++++++++++++++++--------------------
 include/linux/mm.h |  14 +++++++
 mm/mmap.c          |  24 +++++-------
 mm/mprotect.c      |   6 +--
 4 files changed, 97 insertions(+), 60 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dfc791c..f5e75c6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -785,13 +785,80 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
+static int clear_refs(struct mm_struct *mm, enum clear_refs_types type,
+                      int write)
+{
+	int r = 0;
+	struct vm_area_struct *vma;
+	struct clear_refs_private cp = {
+		.type = type,
+	};
+	struct mm_walk clear_refs_walk = {
+		.pmd_entry = clear_refs_pte_range,
+		.mm = mm,
+		.private = &cp,
+	};
+
+	if (write)
+		down_write(&mm->mmap_sem);
+	else
+		down_read(&mm->mmap_sem);
+
+	if (type == CLEAR_REFS_SOFT_DIRTY)
+		mmu_notifier_invalidate_range_start(mm, 0, -1);
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		cp.vma = vma;
+		if (is_vm_hugetlb_page(vma))
+			continue;
+		/*
+		 * Writing 1 to /proc/pid/clear_refs affects all pages.
+		 *
+		 * Writing 2 to /proc/pid/clear_refs only affects
+		 * Anonymous pages.
+		 *
+		 * Writing 3 to /proc/pid/clear_refs only affects file
+		 * mapped pages.
+		 *
+		 * Writing 4 to /proc/pid/clear_refs affects all pages.
+		 */
+		if (type == CLEAR_REFS_ANON && vma->vm_file)
+			continue;
+		if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
+			continue;
+		if (type == CLEAR_REFS_SOFT_DIRTY &&
+		    (vma->vm_flags & VM_SOFTDIRTY)) {
+			if (!write) {
+				r = -EAGAIN;
+				break;
+			}
+			vma->vm_flags &= ~VM_SOFTDIRTY;
+			vma_enable_writenotify(vma);
+		}
+		walk_page_range(vma->vm_start, vma->vm_end,
+				&clear_refs_walk);
+	}
+
+	if (type == CLEAR_REFS_SOFT_DIRTY)
+		mmu_notifier_invalidate_range_end(mm, 0, -1);
+
+	if (!r)
+		flush_tlb_mm(mm);
+
+	if (write)
+		up_write(&mm->mmap_sem);
+	else
+		up_read(&mm->mmap_sem);
+
+	return r;
+}
+
 static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				size_t count, loff_t *ppos)
 {
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	struct mm_struct *mm;
-	struct vm_area_struct *vma;
 	enum clear_refs_types type;
 	int itype;
 	int rv;
@@ -820,47 +887,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		return -ESRCH;
 	mm = get_task_mm(task);
 	if (mm) {
-		struct clear_refs_private cp = {
-			.type = type,
-		};
-		struct mm_walk clear_refs_walk = {
-			.pmd_entry = clear_refs_pte_range,
-			.mm = mm,
-			.private = &cp,
-		};
-		down_read(&mm->mmap_sem);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			cp.vma = vma;
-			if (is_vm_hugetlb_page(vma))
-				continue;
-			/*
-			 * Writing 1 to /proc/pid/clear_refs affects all pages.
-			 *
-			 * Writing 2 to /proc/pid/clear_refs only affects
-			 * Anonymous pages.
-			 *
-			 * Writing 3 to /proc/pid/clear_refs only affects file
-			 * mapped pages.
-			 *
-			 * Writing 4 to /proc/pid/clear_refs affects all pages.
-			 */
-			if (type == CLEAR_REFS_ANON && vma->vm_file)
-				continue;
-			if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
-				continue;
-			if (type == CLEAR_REFS_SOFT_DIRTY) {
-				if (vma->vm_flags & VM_SOFTDIRTY)
-					vma->vm_flags &= ~VM_SOFTDIRTY;
-			}
-			walk_page_range(vma->vm_start, vma->vm_end,
-					&clear_refs_walk);
-		}
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
-		flush_tlb_mm(mm);
-		up_read(&mm->mmap_sem);
+		rv = clear_refs(mm, type, 0);
+		if (rv)
+			clear_refs(mm, type, 1);
 		mmput(mm);
 	}
 	put_task_struct(task);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..7979b79 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1946,6 +1946,20 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
+/* Enable write notifications without blowing away special flags. */
+static inline void vma_enable_writenotify(struct vm_area_struct *vma)
+{
+	pgprot_t newprot = vm_get_page_prot(vma->vm_flags & ~VM_SHARED);
+	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
+}
+
+/* Disable write notifications without blowing away special flags. */
+static inline void vma_disable_writenotify(struct vm_area_struct *vma)
+{
+	pgprot_t newprot = vm_get_page_prot(vma->vm_flags);
+	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
diff --git a/mm/mmap.c b/mm/mmap.c
index c1f2ea4..1b61fbc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1470,6 +1470,10 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
 	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
 		return 1;
 
+	/* Do we need to track softdirty? */
+	if (!(vm_flags & VM_SOFTDIRTY))
+		return 1;
+
 	/* The open routine did something to the protections already? */
 	if (pgprot_val(vma->vm_page_prot) !=
 	    pgprot_val(vm_get_page_prot(vm_flags)))
@@ -1610,21 +1614,6 @@ munmap_back:
 			goto free_vma;
 	}
 
-	if (vma_wants_writenotify(vma)) {
-		pgprot_t pprot = vma->vm_page_prot;
-
-		/* Can vma->vm_page_prot have changed??
-		 *
-		 * Answer: Yes, drivers may have changed it in their
-		 *         f_op->mmap method.
-		 *
-		 * Ensures that vmas marked as uncached stay that way.
-		 */
-		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
-		if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))
-			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	}
-
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	/* Once vma denies write, undo our temporary denial count */
 	if (file) {
@@ -1658,6 +1647,11 @@ out:
 	 */
 	vma->vm_flags |= VM_SOFTDIRTY;
 
+	if (vma_wants_writenotify(vma))
+		vma_enable_writenotify(vma);
+	else
+		vma_disable_writenotify(vma);
+
 	return addr;
 
 unmap_and_free_vma:
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..2dea043 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -320,12 +320,12 @@ success:
 	 * held in write mode.
 	 */
 	vma->vm_flags = newflags;
-	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
-					  vm_get_page_prot(newflags));
 
 	if (vma_wants_writenotify(vma)) {
-		vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
+		vma_enable_writenotify(vma);
 		dirty_accountable = 1;
+	} else {
+		vma_disable_writenotify(vma);
 	}
 
 	change_protection(vma, start, end, vma->vm_page_prot,
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-24  1:43 ` [PATCH v3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared Peter Feiner
@ 2014-08-24  7:59   ` Kirill A. Shutemov
  2014-08-24 19:22     ` Cyrill Gorcunov
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-08-24  7:59 UTC (permalink / raw)
  To: Peter Feiner
  Cc: linux-mm, linux-kernel, Cyrill Gorcunov, Pavel Emelyanov,
	Jamie Liu, Hugh Dickins, Naoya Horiguchi, Andrew Morton

On Sat, Aug 23, 2014 at 09:43:04PM -0400, Peter Feiner wrote:
> For VMAs that don't want write notifications, PTEs created for read
> faults have their write bit set. If the read fault happens after
> VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain
> clear after subsequent writes.
> 
> Here's a simple code snippet to demonstrate the bug:
> 
>   char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
>                  MAP_ANONYMOUS | MAP_SHARED, -1, 0);
>   system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
>   assert(*m == '\0');     /* new PTE allows write access */
>   assert(!soft_dirty(x));
>   *m = 'x';               /* should dirty the page */
>   assert(soft_dirty(x));  /* fails */
> 
> With this patch, write notifications are enabled when VM_SOFTDIRTY is
> cleared. Furthermore, to avoid unnecessary faults, write
> notifications are disabled when VM_SOFTDIRTY is reset.
> 
> As a side effect of enabling and disabling write notifications with
> care, this patch fixes a bug in mprotect where vm_page_prot bits set
> by drivers were zapped on mprotect. An analogous bug was fixed in mmap
> by c9d0bf241451a3ab7d02e1652c22b80cd7d93e8f.
> 
> ---
> 
> v1 -> v2: Instead of checking VM_SOFTDIRTY in the fault handler,
>           enable write notifications on vm_page_prot when we clear
>           VM_SOFTDIRTY.
> 
> v2 -> v3: * Grab the mmap_sem in write mode if any VMAs have
>             VM_SOFTDIRTY set. This involved refactoring clear_refs_write
>             to make it less unwieldy.
> 
>           * In mprotect, don't inadvertently disable write notifications on VMAs
>             that have had VM_SOFTDIRTY cleared
> 
>           * The mprotect fix and mmap cleanup that comprised the
>             second and third patches in v2 were swallowed by the main
>             patch because of vm_page_prot corner case handling.
> ---
>  fs/proc/task_mmu.c | 113 +++++++++++++++++++++++++++++++++--------------------
>  include/linux/mm.h |  14 +++++++
>  mm/mmap.c          |  24 +++++-------
>  mm/mprotect.c      |   6 +--
>  4 files changed, 97 insertions(+), 60 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index dfc791c..f5e75c6 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -785,13 +785,80 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  	return 0;
>  }
>  
> +static int clear_refs(struct mm_struct *mm, enum clear_refs_types type,
> +                      int write)
> +{
> +	int r = 0;
> +	struct vm_area_struct *vma;
> +	struct clear_refs_private cp = {
> +		.type = type,
> +	};
> +	struct mm_walk clear_refs_walk = {
> +		.pmd_entry = clear_refs_pte_range,
> +		.mm = mm,
> +		.private = &cp,
> +	};
> +
> +	if (write)
> +		down_write(&mm->mmap_sem);
> +	else
> +		down_read(&mm->mmap_sem);
> +
> +	if (type == CLEAR_REFS_SOFT_DIRTY)
> +		mmu_notifier_invalidate_range_start(mm, 0, -1);
> +
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		cp.vma = vma;
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +		/*
> +		 * Writing 1 to /proc/pid/clear_refs affects all pages.
> +		 *
> +		 * Writing 2 to /proc/pid/clear_refs only affects
> +		 * Anonymous pages.
> +		 *
> +		 * Writing 3 to /proc/pid/clear_refs only affects file
> +		 * mapped pages.
> +		 *
> +		 * Writing 4 to /proc/pid/clear_refs affects all pages.
> +		 */
> +		if (type == CLEAR_REFS_ANON && vma->vm_file)
> +			continue;
> +		if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
> +			continue;
> +		if (type == CLEAR_REFS_SOFT_DIRTY &&
> +		    (vma->vm_flags & VM_SOFTDIRTY)) {
> +			if (!write) {
> +				r = -EAGAIN;
> +				break;
> +			}
> +			vma->vm_flags &= ~VM_SOFTDIRTY;
> +			vma_enable_writenotify(vma);
> +		}
> +		walk_page_range(vma->vm_start, vma->vm_end,
> +				&clear_refs_walk);
> +	}
> +
> +	if (type == CLEAR_REFS_SOFT_DIRTY)
> +		mmu_notifier_invalidate_range_end(mm, 0, -1);
> +
> +	if (!r)
> +		flush_tlb_mm(mm);
> +
> +	if (write)
> +		up_write(&mm->mmap_sem);
> +	else
> +		up_read(&mm->mmap_sem);
> +
> +	return r;
> +}
> +
>  static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  				size_t count, loff_t *ppos)
>  {
>  	struct task_struct *task;
>  	char buffer[PROC_NUMBUF];
>  	struct mm_struct *mm;
> -	struct vm_area_struct *vma;
>  	enum clear_refs_types type;
>  	int itype;
>  	int rv;
> @@ -820,47 +887,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  		return -ESRCH;
>  	mm = get_task_mm(task);
>  	if (mm) {
> -		struct clear_refs_private cp = {
> -			.type = type,
> -		};
> -		struct mm_walk clear_refs_walk = {
> -			.pmd_entry = clear_refs_pte_range,
> -			.mm = mm,
> -			.private = &cp,
> -		};
> -		down_read(&mm->mmap_sem);
> -		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_start(mm, 0, -1);
> -		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> -			cp.vma = vma;
> -			if (is_vm_hugetlb_page(vma))
> -				continue;
> -			/*
> -			 * Writing 1 to /proc/pid/clear_refs affects all pages.
> -			 *
> -			 * Writing 2 to /proc/pid/clear_refs only affects
> -			 * Anonymous pages.
> -			 *
> -			 * Writing 3 to /proc/pid/clear_refs only affects file
> -			 * mapped pages.
> -			 *
> -			 * Writing 4 to /proc/pid/clear_refs affects all pages.
> -			 */
> -			if (type == CLEAR_REFS_ANON && vma->vm_file)
> -				continue;
> -			if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
> -				continue;
> -			if (type == CLEAR_REFS_SOFT_DIRTY) {
> -				if (vma->vm_flags & VM_SOFTDIRTY)
> -					vma->vm_flags &= ~VM_SOFTDIRTY;
> -			}
> -			walk_page_range(vma->vm_start, vma->vm_end,
> -					&clear_refs_walk);
> -		}
> -		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_end(mm, 0, -1);
> -		flush_tlb_mm(mm);
> -		up_read(&mm->mmap_sem);
> +		rv = clear_refs(mm, type, 0);
> +		if (rv)
> +			clear_refs(mm, type, 1);
>  		mmput(mm);
>  	}
>  	put_task_struct(task);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8981cc8..7979b79 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1946,6 +1946,20 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
>  }
>  #endif
>  
> +/* Enable write notifications without blowing away special flags. */
> +static inline void vma_enable_writenotify(struct vm_area_struct *vma)
> +{
> +	pgprot_t newprot = vm_get_page_prot(vma->vm_flags & ~VM_SHARED);
> +	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
> +}
> +
> +/* Disable write notifications without blowing away special flags. */
> +static inline void vma_disable_writenotify(struct vm_area_struct *vma)
> +{
> +	pgprot_t newprot = vm_get_page_prot(vma->vm_flags);
> +	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
> +}
> +
>  #ifdef CONFIG_NUMA_BALANCING
>  unsigned long change_prot_numa(struct vm_area_struct *vma,
>  			unsigned long start, unsigned long end);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c1f2ea4..1b61fbc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1470,6 +1470,10 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
>  	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
>  		return 1;
>  
> +	/* Do we need to track softdirty? */
> +	if (!(vm_flags & VM_SOFTDIRTY))

This will give false-positive if CONFIG_MEM_SOFT_DIRTY is disabled, since
VM_SOFTDIRTY is 0 in this case:

	if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !(vm_flags & VM_SOFTDIRTY))

Otherwise looks good to me.

Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

> +		return 1;
> +
>  	/* The open routine did something to the protections already? */
>  	if (pgprot_val(vma->vm_page_prot) !=
>  	    pgprot_val(vm_get_page_prot(vm_flags)))
> @@ -1610,21 +1614,6 @@ munmap_back:
>  			goto free_vma;
>  	}
>  
> -	if (vma_wants_writenotify(vma)) {
> -		pgprot_t pprot = vma->vm_page_prot;
> -
> -		/* Can vma->vm_page_prot have changed??
> -		 *
> -		 * Answer: Yes, drivers may have changed it in their
> -		 *         f_op->mmap method.
> -		 *
> -		 * Ensures that vmas marked as uncached stay that way.
> -		 */
> -		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
> -		if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))
> -			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> -	}
> -
>  	vma_link(mm, vma, prev, rb_link, rb_parent);
>  	/* Once vma denies write, undo our temporary denial count */
>  	if (file) {
> @@ -1658,6 +1647,11 @@ out:
>  	 */
>  	vma->vm_flags |= VM_SOFTDIRTY;
>  
> +	if (vma_wants_writenotify(vma))
> +		vma_enable_writenotify(vma);
> +	else
> +		vma_disable_writenotify(vma);
> +
>  	return addr;
>  
>  unmap_and_free_vma:
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index c43d557..2dea043 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -320,12 +320,12 @@ success:
>  	 * held in write mode.
>  	 */
>  	vma->vm_flags = newflags;
> -	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> -					  vm_get_page_prot(newflags));
>  
>  	if (vma_wants_writenotify(vma)) {
> -		vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
> +		vma_enable_writenotify(vma);
>  		dirty_accountable = 1;
> +	} else {
> +		vma_disable_writenotify(vma);
>  	}
>  
>  	change_protection(vma, start, end, vma->vm_page_prot,
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-20 21:46 [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared Peter Feiner
                   ` (2 preceding siblings ...)
  2014-08-24  1:43 ` [PATCH v3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared Peter Feiner
@ 2014-08-24 14:41 ` Peter Feiner
  2014-08-25  3:34 ` [PATCH v5] " Peter Feiner
  2014-09-07 23:01 ` [PATCH v6] " Peter Feiner
  5 siblings, 0 replies; 36+ messages in thread
From: Peter Feiner @ 2014-08-24 14:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Peter Feiner, Kirill A. Shutemov, Cyrill Gorcunov,
	Pavel Emelyanov, Jamie Liu, Hugh Dickins, Naoya Horiguchi,
	Andrew Morton

For VMAs that don't want write notifications, PTEs created for read
faults have their write bit set. If the read fault happens after
VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain
clear after subsequent writes.

Here's a simple code snippet to demonstrate the bug:

  char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                 MAP_ANONYMOUS | MAP_SHARED, -1, 0);
  system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
  assert(*m == '\0');     /* new PTE allows write access */
  assert(!soft_dirty(x));
  *m = 'x';               /* should dirty the page */
  assert(soft_dirty(x));  /* fails */

With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write
notifications are disabled when VM_SOFTDIRTY is reset.

As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set
by drivers were zapped on mprotect. An analogous bug was fixed in mmap
by c9d0bf241451a3ab7d02e1652c22b80cd7d93e8f.

Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Peter Feiner <pfeiner@google.com>

---

v1 -> v2: Instead of checking VM_SOFTDIRTY in the fault handler,
          enable write notifications on vm_page_prot when we clear
          VM_SOFTDIRTY.

v2 -> v3: * Grab the mmap_sem in write mode if any VMAs have
            VM_SOFTDIRTY set. This involved refactoring clear_refs_write
            to make it less unwieldy.

          * In mprotect, don't inadvertently disable write notifications on VMAs
            that have had VM_SOFTDIRTY cleared

          * The mprotect fix and mmap cleanup that comprised the
            second and third patches in v2 were swallowed by the main
            patch because of vm_page_prot corner case handling.

v3 -> v4: Handle !defined(CONFIG_MEM_SOFT_DIRTY): old patch would have
          enabled write notifications for all VMAs in this case.
---
 fs/proc/task_mmu.c | 113 +++++++++++++++++++++++++++++++++--------------------
 include/linux/mm.h |  14 +++++++
 mm/mmap.c          |  26 ++++++------
 mm/mprotect.c      |   6 +--
 4 files changed, 99 insertions(+), 60 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dfc791c..f5e75c6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -785,13 +785,80 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
+static int clear_refs(struct mm_struct *mm, enum clear_refs_types type,
+                      int write)
+{
+	int r = 0;
+	struct vm_area_struct *vma;
+	struct clear_refs_private cp = {
+		.type = type,
+	};
+	struct mm_walk clear_refs_walk = {
+		.pmd_entry = clear_refs_pte_range,
+		.mm = mm,
+		.private = &cp,
+	};
+
+	if (write)
+		down_write(&mm->mmap_sem);
+	else
+		down_read(&mm->mmap_sem);
+
+	if (type == CLEAR_REFS_SOFT_DIRTY)
+		mmu_notifier_invalidate_range_start(mm, 0, -1);
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		cp.vma = vma;
+		if (is_vm_hugetlb_page(vma))
+			continue;
+		/*
+		 * Writing 1 to /proc/pid/clear_refs affects all pages.
+		 *
+		 * Writing 2 to /proc/pid/clear_refs only affects
+		 * Anonymous pages.
+		 *
+		 * Writing 3 to /proc/pid/clear_refs only affects file
+		 * mapped pages.
+		 *
+		 * Writing 4 to /proc/pid/clear_refs affects all pages.
+		 */
+		if (type == CLEAR_REFS_ANON && vma->vm_file)
+			continue;
+		if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
+			continue;
+		if (type == CLEAR_REFS_SOFT_DIRTY &&
+		    (vma->vm_flags & VM_SOFTDIRTY)) {
+			if (!write) {
+				r = -EAGAIN;
+				break;
+			}
+			vma->vm_flags &= ~VM_SOFTDIRTY;
+			vma_enable_writenotify(vma);
+		}
+		walk_page_range(vma->vm_start, vma->vm_end,
+				&clear_refs_walk);
+	}
+
+	if (type == CLEAR_REFS_SOFT_DIRTY)
+		mmu_notifier_invalidate_range_end(mm, 0, -1);
+
+	if (!r)
+		flush_tlb_mm(mm);
+
+	if (write)
+		up_write(&mm->mmap_sem);
+	else
+		up_read(&mm->mmap_sem);
+
+	return r;
+}
+
 static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				size_t count, loff_t *ppos)
 {
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	struct mm_struct *mm;
-	struct vm_area_struct *vma;
 	enum clear_refs_types type;
 	int itype;
 	int rv;
@@ -820,47 +887,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		return -ESRCH;
 	mm = get_task_mm(task);
 	if (mm) {
-		struct clear_refs_private cp = {
-			.type = type,
-		};
-		struct mm_walk clear_refs_walk = {
-			.pmd_entry = clear_refs_pte_range,
-			.mm = mm,
-			.private = &cp,
-		};
-		down_read(&mm->mmap_sem);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			cp.vma = vma;
-			if (is_vm_hugetlb_page(vma))
-				continue;
-			/*
-			 * Writing 1 to /proc/pid/clear_refs affects all pages.
-			 *
-			 * Writing 2 to /proc/pid/clear_refs only affects
-			 * Anonymous pages.
-			 *
-			 * Writing 3 to /proc/pid/clear_refs only affects file
-			 * mapped pages.
-			 *
-			 * Writing 4 to /proc/pid/clear_refs affects all pages.
-			 */
-			if (type == CLEAR_REFS_ANON && vma->vm_file)
-				continue;
-			if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
-				continue;
-			if (type == CLEAR_REFS_SOFT_DIRTY) {
-				if (vma->vm_flags & VM_SOFTDIRTY)
-					vma->vm_flags &= ~VM_SOFTDIRTY;
-			}
-			walk_page_range(vma->vm_start, vma->vm_end,
-					&clear_refs_walk);
-		}
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
-		flush_tlb_mm(mm);
-		up_read(&mm->mmap_sem);
+		rv = clear_refs(mm, type, 0);
+		if (rv)
+			clear_refs(mm, type, 1);
 		mmput(mm);
 	}
 	put_task_struct(task);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..7979b79 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1946,6 +1946,20 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
+/* Enable write notifications without blowing away special flags. */
+static inline void vma_enable_writenotify(struct vm_area_struct *vma)
+{
+	pgprot_t newprot = vm_get_page_prot(vma->vm_flags & ~VM_SHARED);
+	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
+}
+
+/* Disable write notifications without blowing away special flags. */
+static inline void vma_disable_writenotify(struct vm_area_struct *vma)
+{
+	pgprot_t newprot = vm_get_page_prot(vma->vm_flags);
+	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
diff --git a/mm/mmap.c b/mm/mmap.c
index c1f2ea4..0031130 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1470,6 +1470,12 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
 	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
 		return 1;
 
+#ifdef CONFIG_MEM_SOFT_DIRTY
+	/* Do we need to track softdirty? */
+	if (!(vm_flags & VM_SOFTDIRTY))
+		return 1;
+#endif
+
 	/* The open routine did something to the protections already? */
 	if (pgprot_val(vma->vm_page_prot) !=
 	    pgprot_val(vm_get_page_prot(vm_flags)))
@@ -1610,21 +1616,6 @@ munmap_back:
 			goto free_vma;
 	}
 
-	if (vma_wants_writenotify(vma)) {
-		pgprot_t pprot = vma->vm_page_prot;
-
-		/* Can vma->vm_page_prot have changed??
-		 *
-		 * Answer: Yes, drivers may have changed it in their
-		 *         f_op->mmap method.
-		 *
-		 * Ensures that vmas marked as uncached stay that way.
-		 */
-		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
-		if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))
-			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	}
-
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	/* Once vma denies write, undo our temporary denial count */
 	if (file) {
@@ -1658,6 +1649,11 @@ out:
 	 */
 	vma->vm_flags |= VM_SOFTDIRTY;
 
+	if (vma_wants_writenotify(vma))
+		vma_enable_writenotify(vma);
+	else
+		vma_disable_writenotify(vma);
+
 	return addr;
 
 unmap_and_free_vma:
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..2dea043 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -320,12 +320,12 @@ success:
 	 * held in write mode.
 	 */
 	vma->vm_flags = newflags;
-	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
-					  vm_get_page_prot(newflags));
 
 	if (vma_wants_writenotify(vma)) {
-		vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
+		vma_enable_writenotify(vma);
 		dirty_accountable = 1;
+	} else {
+		vma_disable_writenotify(vma);
 	}
 
 	change_protection(vma, start, end, vma->vm_page_prot,
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-24  7:59   ` Kirill A. Shutemov
@ 2014-08-24 19:22     ` Cyrill Gorcunov
  0 siblings, 0 replies; 36+ messages in thread
From: Cyrill Gorcunov @ 2014-08-24 19:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Feiner, linux-mm, linux-kernel, Pavel Emelyanov, Jamie Liu,
	Hugh Dickins, Naoya Horiguchi, Andrew Morton

On Sun, Aug 24, 2014 at 10:59:24AM +0300, Kirill A. Shutemov wrote:
...
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index c1f2ea4..1b61fbc 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1470,6 +1470,10 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
> >  	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
> >  		return 1;
> >  
> > +	/* Do we need to track softdirty? */
> > +	if (!(vm_flags & VM_SOFTDIRTY))
> 
> This will give false-positive if CONFIG_MEM_SOFT_DIRTY is disabled, since
> VM_SOFTDIRTY is 0 in this case:
> 
> 	if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !(vm_flags & VM_SOFTDIRTY))
> 
> Otherwise looks good to me.
> 
> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Really sorry for delay. Thanks a huge, guys!

Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-20 21:46 [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared Peter Feiner
                   ` (3 preceding siblings ...)
  2014-08-24 14:41 ` [PATCH v4] " Peter Feiner
@ 2014-08-25  3:34 ` Peter Feiner
  2014-08-26  4:45   ` Hugh Dickins
  2014-09-07 23:01 ` [PATCH v6] " Peter Feiner
  5 siblings, 1 reply; 36+ messages in thread
From: Peter Feiner @ 2014-08-25  3:34 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Peter Feiner, Kirill A. Shutemov, Cyrill Gorcunov,
	Pavel Emelyanov, Jamie Liu, Hugh Dickins, Naoya Horiguchi,
	Andrew Morton

For VMAs that don't want write notifications, PTEs created for read
faults have their write bit set. If the read fault happens after
VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain
clear after subsequent writes.

Here's a simple code snippet to demonstrate the bug:

  char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                 MAP_ANONYMOUS | MAP_SHARED, -1, 0);
  system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
  assert(*m == '\0');     /* new PTE allows write access */
  assert(!soft_dirty(x));
  *m = 'x';               /* should dirty the page */
  assert(soft_dirty(x));  /* fails */

With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write
notifications are disabled when VM_SOFTDIRTY is reset.

As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set
by drivers were zapped on mprotect. An analogous bug was fixed in mmap
by c9d0bf241451a3ab7d02e1652c22b80cd7d93e8f.

Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Peter Feiner <pfeiner@google.com>

---

v1 -> v2: Instead of checking VM_SOFTDIRTY in the fault handler,
          enable write notifications on vm_page_prot when we clear
          VM_SOFTDIRTY.

v2 -> v3: * Grab the mmap_sem in write mode if any VMAs have
            VM_SOFTDIRTY set. This involved refactoring clear_refs_write
            to make it less unwieldy.

          * In mprotect, don't inadvertently disable write notifications on VMAs
            that have had VM_SOFTDIRTY cleared

          * The mprotect fix and mmap cleanup that comprised the
            second and third patches in v2 were swallowed by the main
            patch because of vm_page_prot corner case handling.

v3 -> v4: Handle !defined(CONFIG_MEM_SOFT_DIRTY): old patch would have
          enabled write notifications for all VMAs in this case.

v4 -> v5: IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) instead of #ifdef ...
---
 fs/proc/task_mmu.c | 113 +++++++++++++++++++++++++++++++++--------------------
 include/linux/mm.h |  14 +++++++
 mm/mmap.c          |  24 +++++-------
 mm/mprotect.c      |   6 +--
 4 files changed, 97 insertions(+), 60 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dfc791c..f5e75c6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -785,13 +785,80 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
+static int clear_refs(struct mm_struct *mm, enum clear_refs_types type,
+                      int write)
+{
+	int r = 0;
+	struct vm_area_struct *vma;
+	struct clear_refs_private cp = {
+		.type = type,
+	};
+	struct mm_walk clear_refs_walk = {
+		.pmd_entry = clear_refs_pte_range,
+		.mm = mm,
+		.private = &cp,
+	};
+
+	if (write)
+		down_write(&mm->mmap_sem);
+	else
+		down_read(&mm->mmap_sem);
+
+	if (type == CLEAR_REFS_SOFT_DIRTY)
+		mmu_notifier_invalidate_range_start(mm, 0, -1);
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		cp.vma = vma;
+		if (is_vm_hugetlb_page(vma))
+			continue;
+		/*
+		 * Writing 1 to /proc/pid/clear_refs affects all pages.
+		 *
+		 * Writing 2 to /proc/pid/clear_refs only affects
+		 * Anonymous pages.
+		 *
+		 * Writing 3 to /proc/pid/clear_refs only affects file
+		 * mapped pages.
+		 *
+		 * Writing 4 to /proc/pid/clear_refs affects all pages.
+		 */
+		if (type == CLEAR_REFS_ANON && vma->vm_file)
+			continue;
+		if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
+			continue;
+		if (type == CLEAR_REFS_SOFT_DIRTY &&
+		    (vma->vm_flags & VM_SOFTDIRTY)) {
+			if (!write) {
+				r = -EAGAIN;
+				break;
+			}
+			vma->vm_flags &= ~VM_SOFTDIRTY;
+			vma_enable_writenotify(vma);
+		}
+		walk_page_range(vma->vm_start, vma->vm_end,
+				&clear_refs_walk);
+	}
+
+	if (type == CLEAR_REFS_SOFT_DIRTY)
+		mmu_notifier_invalidate_range_end(mm, 0, -1);
+
+	if (!r)
+		flush_tlb_mm(mm);
+
+	if (write)
+		up_write(&mm->mmap_sem);
+	else
+		up_read(&mm->mmap_sem);
+
+	return r;
+}
+
 static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				size_t count, loff_t *ppos)
 {
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF];
 	struct mm_struct *mm;
-	struct vm_area_struct *vma;
 	enum clear_refs_types type;
 	int itype;
 	int rv;
@@ -820,47 +887,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		return -ESRCH;
 	mm = get_task_mm(task);
 	if (mm) {
-		struct clear_refs_private cp = {
-			.type = type,
-		};
-		struct mm_walk clear_refs_walk = {
-			.pmd_entry = clear_refs_pte_range,
-			.mm = mm,
-			.private = &cp,
-		};
-		down_read(&mm->mmap_sem);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			cp.vma = vma;
-			if (is_vm_hugetlb_page(vma))
-				continue;
-			/*
-			 * Writing 1 to /proc/pid/clear_refs affects all pages.
-			 *
-			 * Writing 2 to /proc/pid/clear_refs only affects
-			 * Anonymous pages.
-			 *
-			 * Writing 3 to /proc/pid/clear_refs only affects file
-			 * mapped pages.
-			 *
-			 * Writing 4 to /proc/pid/clear_refs affects all pages.
-			 */
-			if (type == CLEAR_REFS_ANON && vma->vm_file)
-				continue;
-			if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
-				continue;
-			if (type == CLEAR_REFS_SOFT_DIRTY) {
-				if (vma->vm_flags & VM_SOFTDIRTY)
-					vma->vm_flags &= ~VM_SOFTDIRTY;
-			}
-			walk_page_range(vma->vm_start, vma->vm_end,
-					&clear_refs_walk);
-		}
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
-		flush_tlb_mm(mm);
-		up_read(&mm->mmap_sem);
+		rv = clear_refs(mm, type, 0);
+		if (rv)
+			clear_refs(mm, type, 1);
 		mmput(mm);
 	}
 	put_task_struct(task);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..7979b79 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1946,6 +1946,20 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
+/* Enable write notifications without blowing away special flags. */
+static inline void vma_enable_writenotify(struct vm_area_struct *vma)
+{
+	pgprot_t newprot = vm_get_page_prot(vma->vm_flags & ~VM_SHARED);
+	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
+}
+
+/* Disable write notifications without blowing away special flags. */
+static inline void vma_disable_writenotify(struct vm_area_struct *vma)
+{
+	pgprot_t newprot = vm_get_page_prot(vma->vm_flags);
+	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
diff --git a/mm/mmap.c b/mm/mmap.c
index c1f2ea4..2963130 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1470,6 +1470,10 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
 	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
 		return 1;
 
+	/* Do we need to track softdirty? */
+	if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !(vm_flags & VM_SOFTDIRTY))
+		return 1;
+
 	/* The open routine did something to the protections already? */
 	if (pgprot_val(vma->vm_page_prot) !=
 	    pgprot_val(vm_get_page_prot(vm_flags)))
@@ -1610,21 +1614,6 @@ munmap_back:
 			goto free_vma;
 	}
 
-	if (vma_wants_writenotify(vma)) {
-		pgprot_t pprot = vma->vm_page_prot;
-
-		/* Can vma->vm_page_prot have changed??
-		 *
-		 * Answer: Yes, drivers may have changed it in their
-		 *         f_op->mmap method.
-		 *
-		 * Ensures that vmas marked as uncached stay that way.
-		 */
-		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
-		if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))
-			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	}
-
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	/* Once vma denies write, undo our temporary denial count */
 	if (file) {
@@ -1658,6 +1647,11 @@ out:
 	 */
 	vma->vm_flags |= VM_SOFTDIRTY;
 
+	if (vma_wants_writenotify(vma))
+		vma_enable_writenotify(vma);
+	else
+		vma_disable_writenotify(vma);
+
 	return addr;
 
 unmap_and_free_vma:
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..2dea043 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -320,12 +320,12 @@ success:
 	 * held in write mode.
 	 */
 	vma->vm_flags = newflags;
-	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
-					  vm_get_page_prot(newflags));
 
 	if (vma_wants_writenotify(vma)) {
-		vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
+		vma_enable_writenotify(vma);
 		dirty_accountable = 1;
+	} else {
+		vma_disable_writenotify(vma);
 	}
 
 	change_protection(vma, start, end, vma->vm_page_prot,
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-25  3:34 ` [PATCH v5] " Peter Feiner
@ 2014-08-26  4:45   ` Hugh Dickins
  2014-08-26  6:49     ` Cyrill Gorcunov
  2014-09-04 16:43     ` Peter Feiner
  0 siblings, 2 replies; 36+ messages in thread
From: Hugh Dickins @ 2014-08-26  4:45 UTC (permalink / raw)
  To: Peter Feiner
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov, Cyrill Gorcunov,
	Pavel Emelyanov, Jamie Liu, Hugh Dickins, Naoya Horiguchi,
	Andrew Morton, Magnus Damm

On Sun, 24 Aug 2014, Peter Feiner wrote:

> For VMAs that don't want write notifications, PTEs created for read
> faults have their write bit set. If the read fault happens after
> VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain
> clear after subsequent writes.

Good catch.  Worrying that it's not been noticed until now.
But I find quite a lot that needs thinking about in the fix.

> 
> Here's a simple code snippet to demonstrate the bug:
> 
>   char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
>                  MAP_ANONYMOUS | MAP_SHARED, -1, 0);
>   system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
>   assert(*m == '\0');     /* new PTE allows write access */
>   assert(!soft_dirty(x));
>   *m = 'x';               /* should dirty the page */
>   assert(soft_dirty(x));  /* fails */
> 
> With this patch, write notifications are enabled when VM_SOFTDIRTY is
> cleared. Furthermore, to avoid unnecessary faults, write
> notifications are disabled when VM_SOFTDIRTY is reset.

"reset" is often a synonym for "cleared": "whenever VM_SOFTDIRTY is set"?

> 
> As a side effect of enabling and disabling write notifications with
> care, this patch fixes a bug in mprotect where vm_page_prot bits set
> by drivers were zapped on mprotect. An analogous bug was fixed in mmap
> by c9d0bf241451a3ab7d02e1652c22b80cd7d93e8f.


Commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Adding Magnus to the Cc list: I have some doubt as to whether his
bugfix is in fact preserved below, and would like him to check.

> 
> Reported-by: Peter Feiner <pfeiner@google.com>
> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Signed-off-by: Peter Feiner <pfeiner@google.com>

I like Kirill's suggestion to approach this via writenotify,
but find the disable/enable rather confusing (partly because
enabling writenotify amounts to disabling write access).
I may be alone in my confusion.

> 
> ---
> 
> v1 -> v2: Instead of checking VM_SOFTDIRTY in the fault handler,
>           enable write notifications on vm_page_prot when we clear
>           VM_SOFTDIRTY.
> 
> v2 -> v3: * Grab the mmap_sem in write mode if any VMAs have
>             VM_SOFTDIRTY set. This involved refactoring clear_refs_write
>             to make it less unwieldy.
> 
>           * In mprotect, don't inadvertently disable write notifications on VMAs
>             that have had VM_SOFTDIRTY cleared
> 
>           * The mprotect fix and mmap cleanup that comprised the
>             second and third patches in v2 were swallowed by the main
>             patch because of vm_page_prot corner case handling.
> 
> v3 -> v4: Handle !defined(CONFIG_MEM_SOFT_DIRTY): old patch would have
>           enabled write notifications for all VMAs in this case.
> 
> v4 -> v5: IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) instead of #ifdef ...
> ---
>  fs/proc/task_mmu.c | 113 +++++++++++++++++++++++++++++++++--------------------
>  include/linux/mm.h |  14 +++++++
>  mm/mmap.c          |  24 +++++-------
>  mm/mprotect.c      |   6 +--
>  4 files changed, 97 insertions(+), 60 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index dfc791c..f5e75c6 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -785,13 +785,80 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  	return 0;
>  }
>  
> +static int clear_refs(struct mm_struct *mm, enum clear_refs_types type,
> +                      int write)
> +{
> +	int r = 0;
> +	struct vm_area_struct *vma;
> +	struct clear_refs_private cp = {
> +		.type = type,
> +	};
> +	struct mm_walk clear_refs_walk = {
> +		.pmd_entry = clear_refs_pte_range,
> +		.mm = mm,
> +		.private = &cp,
> +	};
> +
> +	if (write)
> +		down_write(&mm->mmap_sem);
> +	else
> +		down_read(&mm->mmap_sem);
> +
> +	if (type == CLEAR_REFS_SOFT_DIRTY)
> +		mmu_notifier_invalidate_range_start(mm, 0, -1);
> +
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		cp.vma = vma;
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +		/*
> +		 * Writing 1 to /proc/pid/clear_refs affects all pages.
> +		 *
> +		 * Writing 2 to /proc/pid/clear_refs only affects
> +		 * Anonymous pages.
> +		 *
> +		 * Writing 3 to /proc/pid/clear_refs only affects file
> +		 * mapped pages.
> +		 *
> +		 * Writing 4 to /proc/pid/clear_refs affects all pages.
> +		 */
> +		if (type == CLEAR_REFS_ANON && vma->vm_file)
> +			continue;
> +		if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
> +			continue;
> +		if (type == CLEAR_REFS_SOFT_DIRTY &&
> +		    (vma->vm_flags & VM_SOFTDIRTY)) {
> +			if (!write) {
> +				r = -EAGAIN;
> +				break;

Hmm.  For a long time I thought you were fixing another important bug
with down_write, since we "always" use down_write to modify vm_flags.

But now I'm realizing that if this is the _only_ place which modifies
vm_flags with down_read, then it's "probably" safe.  I've a vague
feeling that this was discussed before - is that so, Cyrill?

It certainly feels fragile to depend on this; but conversely, I don't
like replacing a long down_read scan by an indefinite down_read scan
followed by a long down_write scan.

I see that you earlier persuaded yourself that the races are benign
if you stick with down_read.  I can't confirm or deny that at present:
seems more important right now to get this mail out to you than think
through that aspect.

> +			}
> +			vma->vm_flags &= ~VM_SOFTDIRTY;
> +			vma_enable_writenotify(vma);

That's an example of how the vma_enable_writenotify() interface
may be confusing.  I thought for a while that that line was unsafe,
there being quite other reasons why write protection may be needed;
then realized it's okay because "enable" is the restrictive one.

> +		}
> +		walk_page_range(vma->vm_start, vma->vm_end,
> +				&clear_refs_walk);
> +	}
> +
> +	if (type == CLEAR_REFS_SOFT_DIRTY)
> +		mmu_notifier_invalidate_range_end(mm, 0, -1);
> +
> +	if (!r)
> +		flush_tlb_mm(mm);
> +
> +	if (write)
> +		up_write(&mm->mmap_sem);
> +	else
> +		up_read(&mm->mmap_sem);
> +
> +	return r;
> +}
> +
>  static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  				size_t count, loff_t *ppos)
>  {
>  	struct task_struct *task;
>  	char buffer[PROC_NUMBUF];
>  	struct mm_struct *mm;
> -	struct vm_area_struct *vma;
>  	enum clear_refs_types type;
>  	int itype;
>  	int rv;
> @@ -820,47 +887,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  		return -ESRCH;
>  	mm = get_task_mm(task);
>  	if (mm) {
> -		struct clear_refs_private cp = {
> -			.type = type,
> -		};
> -		struct mm_walk clear_refs_walk = {
> -			.pmd_entry = clear_refs_pte_range,
> -			.mm = mm,
> -			.private = &cp,
> -		};
> -		down_read(&mm->mmap_sem);
> -		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_start(mm, 0, -1);
> -		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> -			cp.vma = vma;
> -			if (is_vm_hugetlb_page(vma))
> -				continue;
> -			/*
> -			 * Writing 1 to /proc/pid/clear_refs affects all pages.
> -			 *
> -			 * Writing 2 to /proc/pid/clear_refs only affects
> -			 * Anonymous pages.
> -			 *
> -			 * Writing 3 to /proc/pid/clear_refs only affects file
> -			 * mapped pages.
> -			 *
> -			 * Writing 4 to /proc/pid/clear_refs affects all pages.
> -			 */
> -			if (type == CLEAR_REFS_ANON && vma->vm_file)
> -				continue;
> -			if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
> -				continue;
> -			if (type == CLEAR_REFS_SOFT_DIRTY) {
> -				if (vma->vm_flags & VM_SOFTDIRTY)
> -					vma->vm_flags &= ~VM_SOFTDIRTY;
> -			}
> -			walk_page_range(vma->vm_start, vma->vm_end,
> -					&clear_refs_walk);
> -		}
> -		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_end(mm, 0, -1);
> -		flush_tlb_mm(mm);
> -		up_read(&mm->mmap_sem);
> +		rv = clear_refs(mm, type, 0);
> +		if (rv)
> +			clear_refs(mm, type, 1);
>  		mmput(mm);
>  	}
>  	put_task_struct(task);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8981cc8..7979b79 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1946,6 +1946,20 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
>  }
>  #endif
>  
> +/* Enable write notifications without blowing away special flags. */
> +static inline void vma_enable_writenotify(struct vm_area_struct *vma)
> +{
> +	pgprot_t newprot = vm_get_page_prot(vma->vm_flags & ~VM_SHARED);
> +	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
> +}
> +
> +/* Disable write notifications without blowing away special flags. */
> +static inline void vma_disable_writenotify(struct vm_area_struct *vma)
> +{
> +	pgprot_t newprot = vm_get_page_prot(vma->vm_flags);
> +	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
> +}

As mentioned above, I find that enable and disable confusing.
Might it be better just to have a vma_set_page_prot(vma), which does
the "if vma_wants_writenotify(vma) blah; else blah;" internally?

And does what you have there build on any architecture other than
x86 and tile?  Because pgprot_modify() was only used in mm/mprotect.c
before, we declare the fallback version there, and so far as I can see,
only x86 and tile declare the pgprot_modify() they need in a header file.

> +
>  #ifdef CONFIG_NUMA_BALANCING
>  unsigned long change_prot_numa(struct vm_area_struct *vma,
>  			unsigned long start, unsigned long end);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c1f2ea4..2963130 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1470,6 +1470,10 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
>  	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
>  		return 1;
>  
> +	/* Do we need to track softdirty? */
> +	if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !(vm_flags & VM_SOFTDIRTY))
> +		return 1;
> +
>  	/* The open routine did something to the protections already? */
>  	if (pgprot_val(vma->vm_page_prot) !=
>  	    pgprot_val(vm_get_page_prot(vm_flags)))

That sets me wondering: have you placed the VM_SOFTDIRTY check in the
right place in this series of tests?

I think, once pgprot_modify() is correct on all architectures,
it should be possible to drop that pgprot_val() check from
vma_wants_writenotify() - which would be a welcome simplification.

But what about the VM_PFNMAP test below it?  If that test was necessary,
then having your VM_SOFTDIRTY check before it seems dangerous.  But I'm
hoping we can persuade ourselves that the VM_PFNMAP test was unnecessary,
and simply delete it.

> @@ -1610,21 +1614,6 @@ munmap_back:
>  			goto free_vma;
>  	}
>  
> -	if (vma_wants_writenotify(vma)) {
> -		pgprot_t pprot = vma->vm_page_prot;
> -
> -		/* Can vma->vm_page_prot have changed??
> -		 *
> -		 * Answer: Yes, drivers may have changed it in their
> -		 *         f_op->mmap method.
> -		 *
> -		 * Ensures that vmas marked as uncached stay that way.
> -		 */
> -		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
> -		if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))
> -			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

So, this is where Magnus's bugfix gets deleted: but I'm afraid that
with pgprot_modify() properly implemented only on x86 and tile, we
cannot delete this so easily.

It's going to be tedious and error-prone to devise a proper
pgprot_modify() for each of N unfamiliar architectures.  I wonder
if we can take a hint from Magnus's code there, to get a suitable
default going, which may not be perfect for each, but will avoid
introducing regression.

Or am I simply confused about the lack of proper pgprot_modify()s?

> -	}
> -
>  	vma_link(mm, vma, prev, rb_link, rb_parent);
>  	/* Once vma denies write, undo our temporary denial count */
>  	if (file) {
> @@ -1658,6 +1647,11 @@ out:
>  	 */
>  	vma->vm_flags |= VM_SOFTDIRTY;
>  
> +	if (vma_wants_writenotify(vma))
> +		vma_enable_writenotify(vma);
> +	else
> +		vma_disable_writenotify(vma);
> +
>  	return addr;
>  
>  unmap_and_free_vma:
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index c43d557..2dea043 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -320,12 +320,12 @@ success:
>  	 * held in write mode.
>  	 */
>  	vma->vm_flags = newflags;
> -	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> -					  vm_get_page_prot(newflags));
>  
>  	if (vma_wants_writenotify(vma)) {
> -		vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
> +		vma_enable_writenotify(vma);
>  		dirty_accountable = 1;

Not an issue coming from your patch, but please take a look at how
dirty_accountable gets used in change_pte_range(): I suspect we have a
similar soft-dirty bug there, do you agree?  Or does it work out safely?

> +	} else {
> +		vma_disable_writenotify(vma);
>  	}
>  
>  	change_protection(vma, start, end, vma->vm_page_prot,
> -- 
> 2.1.0.rc2.206.gedb03e5

scripts/checkpatch.pl has a few complaints too.  Personally, I like
to make very simple functions as brief as possible, ignoring the rule
about a blank line between declarations and body.  So I like your style,
but others will disagree: I suppose we should bow to checkpatch there.

Hugh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-26  4:45   ` Hugh Dickins
@ 2014-08-26  6:49     ` Cyrill Gorcunov
  2014-08-26 14:04       ` Kirill A. Shutemov
  2014-08-27 21:55       ` Hugh Dickins
  2014-09-04 16:43     ` Peter Feiner
  1 sibling, 2 replies; 36+ messages in thread
From: Cyrill Gorcunov @ 2014-08-26  6:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Feiner, linux-mm, linux-kernel, Kirill A. Shutemov,
	Pavel Emelyanov, Jamie Liu, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Mon, Aug 25, 2014 at 09:45:34PM -0700, Hugh Dickins wrote:
> > +static int clear_refs(struct mm_struct *mm, enum clear_refs_types type,
> > +                      int write)
> > +{
...
> > +
> > +	if (write)
> > +		down_write(&mm->mmap_sem);
> > +	else
> > +		down_read(&mm->mmap_sem);
> > +
> > +	if (type == CLEAR_REFS_SOFT_DIRTY)
> > +		mmu_notifier_invalidate_range_start(mm, 0, -1);
> > +
> > +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > +		cp.vma = vma;
> > +		if (is_vm_hugetlb_page(vma))
> > +			continue;
...
> > +		if (type == CLEAR_REFS_ANON && vma->vm_file)
> > +			continue;
> > +		if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
> > +			continue;
> > +		if (type == CLEAR_REFS_SOFT_DIRTY &&
> > +		    (vma->vm_flags & VM_SOFTDIRTY)) {
> > +			if (!write) {
> > +				r = -EAGAIN;
> > +				break;
> 
> Hmm.  For a long time I thought you were fixing another important bug
> with down_write, since we "always" use down_write to modify vm_flags.
> 
> But now I'm realizing that if this is the _only_ place which modifies
> vm_flags with down_read, then it's "probably" safe.  I've a vague
> feeling that this was discussed before - is that so, Cyrill?

Well, as far as I remember we were not talking before about vm_flags
and read-lock in this function, maybe it was on some unrelated lkml thread
without me CC'ed? Until I miss something obvious using read-lock here
for vm_flags modification should be safe, since the only thing which is
important (in context of vma-softdirty) is the vma's presence. Hugh,
mind to refresh my memory, how long ago the discussion took place?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-26  6:49     ` Cyrill Gorcunov
@ 2014-08-26 14:04       ` Kirill A. Shutemov
  2014-08-26 14:19         ` Cyrill Gorcunov
  2014-08-27 21:55       ` Hugh Dickins
  1 sibling, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-08-26 14:04 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Hugh Dickins, Peter Feiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Jamie Liu, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Tue, Aug 26, 2014 at 10:49:52AM +0400, Cyrill Gorcunov wrote:
> On Mon, Aug 25, 2014 at 09:45:34PM -0700, Hugh Dickins wrote:
> > > +static int clear_refs(struct mm_struct *mm, enum clear_refs_types type,
> > > +                      int write)
> > > +{
> ...
> > > +
> > > +	if (write)
> > > +		down_write(&mm->mmap_sem);
> > > +	else
> > > +		down_read(&mm->mmap_sem);
> > > +
> > > +	if (type == CLEAR_REFS_SOFT_DIRTY)
> > > +		mmu_notifier_invalidate_range_start(mm, 0, -1);
> > > +
> > > +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > > +		cp.vma = vma;
> > > +		if (is_vm_hugetlb_page(vma))
> > > +			continue;
> ...
> > > +		if (type == CLEAR_REFS_ANON && vma->vm_file)
> > > +			continue;
> > > +		if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
> > > +			continue;
> > > +		if (type == CLEAR_REFS_SOFT_DIRTY &&
> > > +		    (vma->vm_flags & VM_SOFTDIRTY)) {
> > > +			if (!write) {
> > > +				r = -EAGAIN;
> > > +				break;
> > 
> > Hmm.  For a long time I thought you were fixing another important bug
> > with down_write, since we "always" use down_write to modify vm_flags.
> > 
> > But now I'm realizing that if this is the _only_ place which modifies
> > vm_flags with down_read, then it's "probably" safe.  I've a vague
> > feeling that this was discussed before - is that so, Cyrill?
> 
> Well, as far as I remember we were not talking before about vm_flags
> and read-lock in this function, maybe it was on some unrelated lkml thread
> without me CC'ed? Until I miss something obvious using read-lock here
> for vm_flags modification should be safe, since the only thing which is
> important (in context of vma-softdirty) is the vma's presence. Hugh,
> mind to refresh my memory, how long ago the discussion took place?

It seems safe in vma-softdirty context. But if somebody else will decide that
it's fine to modify vm_flags without down_write (in their context), we
will get trouble. Sasha will come with weird bug report one day ;)

At least vm_flags must be updated atomically to avoid race in middle of
load-modify-store.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-26 14:04       ` Kirill A. Shutemov
@ 2014-08-26 14:19         ` Cyrill Gorcunov
  2014-08-26 14:56           ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Cyrill Gorcunov @ 2014-08-26 14:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Peter Feiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Jamie Liu, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Tue, Aug 26, 2014 at 05:04:19PM +0300, Kirill A. Shutemov wrote:
> > > 
> > > But now I'm realizing that if this is the _only_ place which modifies
> > > vm_flags with down_read, then it's "probably" safe.  I've a vague
> > > feeling that this was discussed before - is that so, Cyrill?
> > 
> > Well, as far as I remember we were not talking before about vm_flags
> > and read-lock in this function, maybe it was on some unrelated lkml thread
> > without me CC'ed? Until I miss something obvious using read-lock here
> > for vm_flags modification should be safe, since the only thing which is
> > important (in context of vma-softdirty) is the vma's presence. Hugh,
> > mind to refresh my memory, how long ago the discussion took place?
> 
> It seems safe in vma-softdirty context. But if somebody else will decide that
> it's fine to modify vm_flags without down_write (in their context), we
> will get trouble. Sasha will come with weird bug report one day ;)
> 
> At least vm_flags must be updated atomically to avoid race in middle of
> load-modify-store.

Which race you mean here? Two concurrent clear-refs?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-26 14:19         ` Cyrill Gorcunov
@ 2014-08-26 14:56           ` Kirill A. Shutemov
  2014-08-26 15:18             ` Cyrill Gorcunov
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-08-26 14:56 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Hugh Dickins, Peter Feiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Jamie Liu, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Tue, Aug 26, 2014 at 06:19:14PM +0400, Cyrill Gorcunov wrote:
> On Tue, Aug 26, 2014 at 05:04:19PM +0300, Kirill A. Shutemov wrote:
> > > > 
> > > > But now I'm realizing that if this is the _only_ place which modifies
> > > > vm_flags with down_read, then it's "probably" safe.  I've a vague
> > > > feeling that this was discussed before - is that so, Cyrill?
> > > 
> > > Well, as far as I remember we were not talking before about vm_flags
> > > and read-lock in this function, maybe it was on some unrelated lkml thread
> > > without me CC'ed? Until I miss something obvious using read-lock here
> > > for vm_flags modification should be safe, since the only thing which is
> > > important (in context of vma-softdirty) is the vma's presence. Hugh,
> > > mind to refresh my memory, how long ago the discussion took place?
> > 
> > It seems safe in vma-softdirty context. But if somebody else will decide that
> > it's fine to modify vm_flags without down_write (in their context), we
> > will get trouble. Sasha will come with weird bug report one day ;)
> > 
> > At least vm_flags must be updated atomically to avoid race in middle of
> > load-modify-store.
> 
> Which race you mean here? Two concurrent clear-refs?

Two concurent clear-refs is fine. But if somebody else will exploit the
same approch to set/clear other VM_FOO and it will race with clear-refs
we get trouble: some modifications can be lost.

Basically, it's safe if only soft-dirty is allowed to modify vm_flags
without down_write(). But why is soft-dirty so special?

Should we consider moving protection of some vma fields under per-vma lock
rather use over-loaded mmap_sem?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-26 14:56           ` Kirill A. Shutemov
@ 2014-08-26 15:18             ` Cyrill Gorcunov
  2014-08-26 15:43               ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Cyrill Gorcunov @ 2014-08-26 15:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Peter Feiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Jamie Liu, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Tue, Aug 26, 2014 at 05:56:12PM +0300, Kirill A. Shutemov wrote:
> > > 
> > > It seems safe in vma-softdirty context. But if somebody else will decide that
> > > it's fine to modify vm_flags without down_write (in their context), we
> > > will get trouble. Sasha will come with weird bug report one day ;)
> > > 
> > > At least vm_flags must be updated atomically to avoid race in middle of
> > > load-modify-store.
> > 
> > Which race you mean here? Two concurrent clear-refs?
> 
> Two concurent clear-refs is fine. But if somebody else will exploit the
> same approch to set/clear other VM_FOO and it will race with clear-refs
> we get trouble: some modifications can be lost.

yup, i see

> Basically, it's safe if only soft-dirty is allowed to modify vm_flags
> without down_write(). But why is soft-dirty so special?

because how we use this bit, i mean in normal workload this bit won't
be used intensively i think so it's not widespread in kernel code

> Should we consider moving protection of some vma fields under per-vma lock
> rather use over-loaded mmap_sem?

Hard to say, if vma-softdirty bit is the reason then I guess no, probably
it worth to estimate how much profit we would have if using per-vma lock.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-26 15:18             ` Cyrill Gorcunov
@ 2014-08-26 15:43               ` Kirill A. Shutemov
  2014-08-26 15:53                 ` Cyrill Gorcunov
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-08-26 15:43 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Hugh Dickins, Peter Feiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Jamie Liu, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Tue, Aug 26, 2014 at 07:18:13PM +0400, Cyrill Gorcunov wrote:
> > Basically, it's safe if only soft-dirty is allowed to modify vm_flags
> > without down_write(). But why is soft-dirty so special?
> 
> because how we use this bit, i mean in normal workload this bit won't
> be used intensively i think so it's not widespread in kernel code

Weak argument to me.

What about walk through vmas twice: first with down_write() to modify
vm_flags and vm_page_prot, then downgrade_write() and do
walk_page_range() on every vma?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-26 15:43               ` Kirill A. Shutemov
@ 2014-08-26 15:53                 ` Cyrill Gorcunov
  2014-08-27 23:12                   ` Hugh Dickins
  0 siblings, 1 reply; 36+ messages in thread
From: Cyrill Gorcunov @ 2014-08-26 15:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Peter Feiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Jamie Liu, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Tue, Aug 26, 2014 at 06:43:55PM +0300, Kirill A. Shutemov wrote:
> On Tue, Aug 26, 2014 at 07:18:13PM +0400, Cyrill Gorcunov wrote:
> > > Basically, it's safe if only soft-dirty is allowed to modify vm_flags
> > > without down_write(). But why is soft-dirty so special?
> > 
> > because how we use this bit, i mean in normal workload this bit won't
> > be used intensively i think so it's not widespread in kernel code
> 
> Weak argument to me.
> 
> What about walk through vmas twice: first with down_write() to modify
> vm_flags and vm_page_prot, then downgrade_write() and do
> walk_page_range() on every vma?

I still it's undeeded, but for sure using write-lock/downgrade won't hurt,
so no argues from my side.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-26  6:49     ` Cyrill Gorcunov
  2014-08-26 14:04       ` Kirill A. Shutemov
@ 2014-08-27 21:55       ` Hugh Dickins
  1 sibling, 0 replies; 36+ messages in thread
From: Hugh Dickins @ 2014-08-27 21:55 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Hugh Dickins, Peter Feiner, linux-mm, linux-kernel,
	Kirill A. Shutemov, Pavel Emelyanov, Jamie Liu, Naoya Horiguchi,
	Andrew Morton, Magnus Damm

On Tue, 26 Aug 2014, Cyrill Gorcunov wrote:
> On Mon, Aug 25, 2014 at 09:45:34PM -0700, Hugh Dickins wrote:
> > 
> > Hmm.  For a long time I thought you were fixing another important bug
> > with down_write, since we "always" use down_write to modify vm_flags.
> > 
> > But now I'm realizing that if this is the _only_ place which modifies
> > vm_flags with down_read, then it's "probably" safe.  I've a vague
> > feeling that this was discussed before - is that so, Cyrill?
> 
> Well, as far as I remember we were not talking before about vm_flags
> and read-lock in this function, maybe it was on some unrelated lkml thread
> without me CC'ed? Until I miss something obvious using read-lock here
> for vm_flags modification should be safe, since the only thing which is
> important (in context of vma-softdirty) is the vma's presence. Hugh,
> mind to refresh my memory, how long ago the discussion took place?

Sorry for making you think you were losing your mind, Cyrill.

I myself have no recollection of any such conversation with you;
but afraid that I might have lost _my_ memory of it - I didn't want
to get too strident about how fragile (though probably not yet buggy)
this down_read-for-updating-VM_SOFTDIRTY-onlyi is, if there had already
been such a discussion, coming to the conclusion that it is okay for now.

I am fairly sure that I have had some such discussion before; but
probably with someone else, probably still about mmap_sem and vm_flags,
but probably some other VM_flag: the surprising realization that it may
be safe but fragile to use just down_read for updating one particular flag.

Hugh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-26 15:53                 ` Cyrill Gorcunov
@ 2014-08-27 23:12                   ` Hugh Dickins
  2014-08-28  6:31                     ` Cyrill Gorcunov
  0 siblings, 1 reply; 36+ messages in thread
From: Hugh Dickins @ 2014-08-27 23:12 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kirill A. Shutemov, Hugh Dickins, Peter Feiner, linux-mm,
	linux-kernel, Pavel Emelyanov, Jamie Liu, Naoya Horiguchi,
	Andrew Morton, Magnus Damm

On Tue, 26 Aug 2014, Cyrill Gorcunov wrote:
> On Tue, Aug 26, 2014 at 06:43:55PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Aug 26, 2014 at 07:18:13PM +0400, Cyrill Gorcunov wrote:
> > > > Basically, it's safe if only soft-dirty is allowed to modify vm_flags
> > > > without down_write(). But why is soft-dirty so special?
> > > 
> > > because how we use this bit, i mean in normal workload this bit won't
> > > be used intensively i think so it's not widespread in kernel code
> > 
> > Weak argument to me.

Yes.  However rarely it's modified, we don't want any chance of it
corrupting another flag.

VM_SOFTDIRTY is special in the sense that it's maintained in a very
different way from the other VM_flags.  If we had a little alignment
padding space somewhere in struct vm_area_struct, I think I'd jump at
Kirill's suggestion to move it out of vm_flags and into a new field:
that would remove some other special casing, like the vma merge issue.

But I don't think we have such padding space, and we'd prefer not to
bloat struct vm_area_struct for it; so maybe it should stay for now.
Besides, with Peter's patch, we're also talking about the locking on
modifications to vm_page_prot, aren't we?

> > 
> > What about walk through vmas twice: first with down_write() to modify
> > vm_flags and vm_page_prot, then downgrade_write() and do
> > walk_page_range() on every vma?
> 
> I still it's undeeded,

Yes, so long as nothing else is doing the same.
No bug yet, that we can see, but a bug in waiting.

> but for sure using write-lock/downgrade won't hurt,
> so no argues from my side.

Yes, Kirill's two-stage suggestion seems the best:

down_write
quickly scan vmas clearing VM_SOFT_DIRTY and updating vm_page_prot
downgrade_write (or up_write, down_read?)
slowly walk page tables write protecting and clearing soft-dirty on ptes
up_read

But please don't mistake me for someone who has a good grasp of
soft-dirty: I don't.

Hugh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-27 23:12                   ` Hugh Dickins
@ 2014-08-28  6:31                     ` Cyrill Gorcunov
  0 siblings, 0 replies; 36+ messages in thread
From: Cyrill Gorcunov @ 2014-08-28  6:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Peter Feiner, linux-mm, linux-kernel,
	Pavel Emelyanov, Jamie Liu, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Wed, Aug 27, 2014 at 04:12:43PM -0700, Hugh Dickins wrote:
> > > 
> > > Weak argument to me.
> 
> Yes.  However rarely it's modified, we don't want any chance of it
> corrupting another flag.
> 
> VM_SOFTDIRTY is special in the sense that it's maintained in a very
> different way from the other VM_flags.  If we had a little alignment
> padding space somewhere in struct vm_area_struct, I think I'd jump at
> Kirill's suggestion to move it out of vm_flags and into a new field:
> that would remove some other special casing, like the vma merge issue.
> 
> But I don't think we have such padding space, and we'd prefer not to
> bloat struct vm_area_struct for it; so maybe it should stay for now.
> Besides, with Peter's patch, we're also talking about the locking on
> modifications to vm_page_prot, aren't we?

I think so.

> > > What about walk through vmas twice: first with down_write() to modify
> > > vm_flags and vm_page_prot, then downgrade_write() and do
> > > walk_page_range() on every vma?
> > 
> > I still it's undeeded,
> 
> Yes, so long as nothing else is doing the same.
> No bug yet, that we can see, but a bug in waiting.

:-)

> 
> > but for sure using write-lock/downgrade won't hurt,
> > so no argues from my side.
> 
> Yes, Kirill's two-stage suggestion seems the best:
> 
> down_write
> quickly scan vmas clearing VM_SOFT_DIRTY and updating vm_page_prot
> downgrade_write (or up_write, down_read?)
> slowly walk page tables write protecting and clearing soft-dirty on ptes
> up_read
> 
> But please don't mistake me for someone who has a good grasp of
> soft-dirty: I don't.

Thanks for sharing opinion, Hugh! (And thanks for second email about
vma-flags) So lets move it to Kirill's way, otherwise indeed one day
it might end up in a bug which for sure will not be easy to catch.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-26  4:45   ` Hugh Dickins
  2014-08-26  6:49     ` Cyrill Gorcunov
@ 2014-09-04 16:43     ` Peter Feiner
  2014-09-07 21:31       ` Peter Feiner
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Feiner @ 2014-09-04 16:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov, Cyrill Gorcunov,
	Pavel Emelyanov, Jamie Liu, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Mon, Aug 25, 2014 at 09:45:34PM -0700, Hugh Dickins wrote:
> On Sun, 24 Aug 2014, Peter Feiner wrote:
> > With this patch, write notifications are enabled when VM_SOFTDIRTY is
> > cleared. Furthermore, to avoid unnecessary faults, write
> > notifications are disabled when VM_SOFTDIRTY is reset.
> 
> "reset" is often a synonym for "cleared": "whenever VM_SOFTDIRTY is set"?

Agreed, "set" sounds good.

> > As a side effect of enabling and disabling write notifications with
> > care, this patch fixes a bug in mprotect where vm_page_prot bits set
> > by drivers were zapped on mprotect. An analogous bug was fixed in mmap
> > by c9d0bf241451a3ab7d02e1652c22b80cd7d93e8f.
> 
> 
> Commit c9d0bf241451 ("mm: uncached vma support with writenotify").
> Adding Magnus to the Cc list: I have some doubt as to whether his
> bugfix is in fact preserved below, and would like him to check.

I believe the fix is preserved as long as pgprot_modify preserves cache flags.
As you explain below, pgprot_modify only does this on x86 and tile. So this
patch does indeed break c9d0bf241451 on most architectures. Furthermore, as you
said below, this patch would break the build on the other architectures :-)

> I like Kirill's suggestion to approach this via writenotify,
> but find the disable/enable rather confusing (partly because
> enabling writenotify amounts to disabling write access).
> I may be alone in my confusion.

I agree about the confusion. I wasn't too happy with the names myself.
Furthermore, I only really use vma_disable_writenotify to set vm_page_prot
from vm_flags. So I think the enable / disable idea is pretty broken. As you
suggest below, I'm going to give vma_set_page_prot a try.

> > +		if (type == CLEAR_REFS_SOFT_DIRTY &&
> > +		    (vma->vm_flags & VM_SOFTDIRTY)) {
> > +			if (!write) {
> > +				r = -EAGAIN;
> > +				break;
> 
> Hmm.  For a long time I thought you were fixing another important bug
> with down_write, since we "always" use down_write to modify vm_flags.
> 
> But now I'm realizing that if this is the _only_ place which modifies
> vm_flags with down_read, then it's "probably" safe.  I've a vague
> feeling that this was discussed before - is that so, Cyrill?
> 
> It certainly feels fragile to depend on this; but conversely, I don't
> like replacing a long down_read scan by an indefinite down_read scan
> followed by a long down_write scan.
> 
> I see that you earlier persuaded yourself that the races are benign
> if you stick with down_read.  I can't confirm or deny that at present:
> seems more important right now to get this mail out to you than think
> through that aspect.

Your observation is correct: clear_refs_write is the only place that vm_flags
is modified without an exclusive lock on mmap_sem.

I was wrong about the race between clear_refs_write modifying vm_flags and the
fault handler reading vm_flags being benign. I had thought that since
clear_refs_write zaps all of the PTEs in the VMA after it modifies vm_flags,
it was ok for a writable PTE to be temporarily installed to handle a read
fault. However, if a write happened after the read fault and before
clear_refs_write zapped the PTE, then we'd miss the write. Therefore I'm
convinced that its necessary to serialize changes to vm_flags and fault
handling.

There are a few ways to accomplish this serialization, all with their pros and
cons:

	* One down_read scan followed by a down_write scan, if necessary. This
	  is the current implementation.
	  Pros: won't take exclusive lock when VMAs haven't changed.
	  Cons: might hold exclusive lock during page table walk.

	* Per-vma lock, as Cyrill and Kirill were discussing.
	  Pros: handle faults on other VMAs when vm_flags is changing.
	  Cons: another lock acquired in fault path.
	
	* Iterate over VMAs and modify vm_flags with down_write, then
	  downgrade to down_read for page table scan, as Kirill suggested.
	  Pros: won't hold exclusive lock during page table walk.
	  Cons: clear_refs_write always grabs exclusive lock.

I think the extra lock in the fault handling path rules the per-vma lock out.
Whether the first or third approach is better depends on whether or not VMAs
are changing, which is obviously application specific behavior. A hybrid
approach offers the best of both worlds (i.e., an optimistic down_read scan
that bails out if there's a VM_SOFTDIRTY VMA and falls back to the downgrading
approach).

> 
> > +			}
> > +			vma->vm_flags &= ~VM_SOFTDIRTY;
> > +			vma_enable_writenotify(vma);
> 
> That's an example of how the vma_enable_writenotify() interface
> may be confusing.  I thought for a while that that line was unsafe,
> there being quite other reasons why write protection may be needed;
> then realized it's okay because "enable" is the restrictive one.

Yep, agreed. It'll look like

	vma->vm_flags &= ~VM_SOFTDIRTY;
	vma_set_page_prot(vma);

after implementing your suggestion.

> 
> > +		}
> > +		walk_page_range(vma->vm_start, vma->vm_end,
> > +				&clear_refs_walk);
> > +	}
> > +
> > +	if (type == CLEAR_REFS_SOFT_DIRTY)
> > +		mmu_notifier_invalidate_range_end(mm, 0, -1);
> > +
> > +	if (!r)
> > +		flush_tlb_mm(mm);
> > +
> > +	if (write)
> > +		up_write(&mm->mmap_sem);
> > +	else
> > +		up_read(&mm->mmap_sem);
> > +
> > +	return r;
> > +}
> > +
> >  static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  				size_t count, loff_t *ppos)
> >  {
> >  	struct task_struct *task;
> >  	char buffer[PROC_NUMBUF];
> >  	struct mm_struct *mm;
> > -	struct vm_area_struct *vma;
> >  	enum clear_refs_types type;
> >  	int itype;
> >  	int rv;
> > @@ -820,47 +887,9 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  		return -ESRCH;
> >  	mm = get_task_mm(task);
> >  	if (mm) {
> > -		struct clear_refs_private cp = {
> > -			.type = type,
> > -		};
> > -		struct mm_walk clear_refs_walk = {
> > -			.pmd_entry = clear_refs_pte_range,
> > -			.mm = mm,
> > -			.private = &cp,
> > -		};
> > -		down_read(&mm->mmap_sem);
> > -		if (type == CLEAR_REFS_SOFT_DIRTY)
> > -			mmu_notifier_invalidate_range_start(mm, 0, -1);
> > -		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > -			cp.vma = vma;
> > -			if (is_vm_hugetlb_page(vma))
> > -				continue;
> > -			/*
> > -			 * Writing 1 to /proc/pid/clear_refs affects all pages.
> > -			 *
> > -			 * Writing 2 to /proc/pid/clear_refs only affects
> > -			 * Anonymous pages.
> > -			 *
> > -			 * Writing 3 to /proc/pid/clear_refs only affects file
> > -			 * mapped pages.
> > -			 *
> > -			 * Writing 4 to /proc/pid/clear_refs affects all pages.
> > -			 */
> > -			if (type == CLEAR_REFS_ANON && vma->vm_file)
> > -				continue;
> > -			if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
> > -				continue;
> > -			if (type == CLEAR_REFS_SOFT_DIRTY) {
> > -				if (vma->vm_flags & VM_SOFTDIRTY)
> > -					vma->vm_flags &= ~VM_SOFTDIRTY;
> > -			}
> > -			walk_page_range(vma->vm_start, vma->vm_end,
> > -					&clear_refs_walk);
> > -		}
> > -		if (type == CLEAR_REFS_SOFT_DIRTY)
> > -			mmu_notifier_invalidate_range_end(mm, 0, -1);
> > -		flush_tlb_mm(mm);
> > -		up_read(&mm->mmap_sem);
> > +		rv = clear_refs(mm, type, 0);
> > +		if (rv)
> > +			clear_refs(mm, type, 1);
> >  		mmput(mm);
> >  	}
> >  	put_task_struct(task);
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 8981cc8..7979b79 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1946,6 +1946,20 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
> >  }
> >  #endif
> >  
> > +/* Enable write notifications without blowing away special flags. */
> > +static inline void vma_enable_writenotify(struct vm_area_struct *vma)
> > +{
> > +	pgprot_t newprot = vm_get_page_prot(vma->vm_flags & ~VM_SHARED);
> > +	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
> > +}
> > +
> > +/* Disable write notifications without blowing away special flags. */
> > +static inline void vma_disable_writenotify(struct vm_area_struct *vma)
> > +{
> > +	pgprot_t newprot = vm_get_page_prot(vma->vm_flags);
> > +	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot, newprot);
> > +}
> 
> As mentioned above, I find that enable and disable confusing.
> Might it be better just to have a vma_set_page_prot(vma), which does
> the "if vma_wants_writenotify(vma) blah; else blah;" internally?

I like that idea. I'll think it through and give it a try.

> And does what you have there build on any architecture other than
> x86 and tile?  Because pgprot_modify() was only used in mm/mprotect.c
> before, we declare the fallback version there, and so far as I can see,
> only x86 and tile declare the pgprot_modify() they need in a header file.
> 
> > +
> >  #ifdef CONFIG_NUMA_BALANCING
> >  unsigned long change_prot_numa(struct vm_area_struct *vma,
> >  			unsigned long start, unsigned long end);
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index c1f2ea4..2963130 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1470,6 +1470,10 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
> >  	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
> >  		return 1;
> >  
> > +	/* Do we need to track softdirty? */
> > +	if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !(vm_flags & VM_SOFTDIRTY))
> > +		return 1;
> > +
> >  	/* The open routine did something to the protections already? */
> >  	if (pgprot_val(vma->vm_page_prot) !=
> >  	    pgprot_val(vm_get_page_prot(vm_flags)))
> 
> That sets me wondering: have you placed the VM_SOFTDIRTY check in the
> right place in this series of tests?
> 
> I think, once pgprot_modify() is correct on all architectures,
> it should be possible to drop that pgprot_val() check from
> vma_wants_writenotify() - which would be a welcome simplification.
> 
> But what about the VM_PFNMAP test below it?  If that test was necessary,
> then having your VM_SOFTDIRTY check before it seems dangerous.  But I'm
> hoping we can persuade ourselves that the VM_PFNMAP test was unnecessary,
> and simply delete it.

If VM_PFNMAP is necessary, then I definitely put the VM_SOFTDIRTY check in the
wrong spot :-) I don't know much (i.e., anything) about VM_PFNMAP, so I'll
have to bone up on a lot of code before I have an informed opinion about the
necessity of the check.

I had erroneously reasoned that it was necessary to put the VM_SOFTDIRTY check
before the pgprot_val check in order to handle VMA merging correctly. I'll
give this another think and look into dropping the pgprot_val check altogether,
as you suggest.

> 
> > @@ -1610,21 +1614,6 @@ munmap_back:
> >  			goto free_vma;
> >  	}
> >  
> > -	if (vma_wants_writenotify(vma)) {
> > -		pgprot_t pprot = vma->vm_page_prot;
> > -
> > -		/* Can vma->vm_page_prot have changed??
> > -		 *
> > -		 * Answer: Yes, drivers may have changed it in their
> > -		 *         f_op->mmap method.
> > -		 *
> > -		 * Ensures that vmas marked as uncached stay that way.
> > -		 */
> > -		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
> > -		if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))
> > -			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> 
> So, this is where Magnus's bugfix gets deleted: but I'm afraid that
> with pgprot_modify() properly implemented only on x86 and tile, we
> cannot delete this so easily.
> 
> It's going to be tedious and error-prone to devise a proper
> pgprot_modify() for each of N unfamiliar architectures.  I wonder
> if we can take a hint from Magnus's code there, to get a suitable
> default going, which may not be perfect for each, but will avoid
> introducing regression.
> 
> Or am I simply confused about the lack of proper pgprot_modify()s?

No, I think you're right about pgprot_modify. I like your idea for
a best-effort implementation of a generic pgprot_modify. I'll give it a try
and see if everything fits together nicely for VM_SOFTDIRTY.

> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index c43d557..2dea043 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -320,12 +320,12 @@ success:
> >  	 * held in write mode.
> >  	 */
> >  	vma->vm_flags = newflags;
> > -	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> > -					  vm_get_page_prot(newflags));
> >  
> >  	if (vma_wants_writenotify(vma)) {
> > -		vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
> > +		vma_enable_writenotify(vma);
> >  		dirty_accountable = 1;
> 
> Not an issue coming from your patch, but please take a look at how
> dirty_accountable gets used in change_pte_range(): I suspect we have a
> similar soft-dirty bug there, do you agree?  Or does it work out safely?

Indeed, there is a similar bug in change_pte_range. If a PTE is dirty but not
soft-dirty and dirty_accountable is true, then the PTE will be made writable
and we'll never get to mark the PTE softdirty. Good catch! I'll submit another
patch to fix this.

> scripts/checkpatch.pl has a few complaints too.  Personally, I like
> to make very simple functions as brief as possible, ignoring the rule
> about a blank line between declarations and body.  So I like your style,
> but others will disagree: I suppose we should bow to checkpatch there.

Aye, I shall bend the knee.

Peter

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v5] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-09-04 16:43     ` Peter Feiner
@ 2014-09-07 21:31       ` Peter Feiner
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Feiner @ 2014-09-07 21:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov, Cyrill Gorcunov,
	Pavel Emelyanov, Jamie Liu, Naoya Horiguchi, Andrew Morton,
	Magnus Damm

On Thu, Sep 04, 2014 at 09:43:11AM -0700, Peter Feiner wrote:
> On Mon, Aug 25, 2014 at 09:45:34PM -0700, Hugh Dickins wrote:
> > That sets me wondering: have you placed the VM_SOFTDIRTY check in the
> > right place in this series of tests?
> > 
> > I think, once pgprot_modify() is correct on all architectures,
> > it should be possible to drop that pgprot_val() check from
> > vma_wants_writenotify() - which would be a welcome simplification.
> > 
> > But what about the VM_PFNMAP test below it?  If that test was necessary,
> > then having your VM_SOFTDIRTY check before it seems dangerous.  But I'm
> > hoping we can persuade ourselves that the VM_PFNMAP test was unnecessary,
> > and simply delete it.
> 
> If VM_PFNMAP is necessary, then I definitely put the VM_SOFTDIRTY check in the
> wrong spot :-) I don't know much (i.e., anything) about VM_PFNMAP, so I'll
> have to bone up on a lot of code before I have an informed opinion about the
> necessity of the check.

AFAICT, the VM_PFNMAP check is unnecessary since I can't find any drivers that
set VM_PFNMAP and enable dirty accounting on their mappings. If anything,
VM_PFNMAP precludes mapping dirty tracking since set_page_dirty takes a
struct_page argument! Perhaps the VM_PFNMAP check was originally put in
vma_wants_writenotify as a safeguard against bogus calls to set_page_dirty?
In any case, it seems harmless to me to put the VM_SOFTDIRTY check before the
VM_PFNMAP check since none of the fault handling code in mm/memory.c calls
set_page_dirty on a VM_PFNMAP fault because either vm_normal_page() returns
NULL or ->fault() / ->page_mkwrite() return VM_FAULT_NOPAGE. Moreover, for
the purpose of softdirty tracking, enabling write notifications on VM_PFNMAP
VMAs is OK since do_wp_page does the right thing when vm_normal_page() returns
NULL.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
  2014-08-20 21:46 [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared Peter Feiner
                   ` (4 preceding siblings ...)
  2014-08-25  3:34 ` [PATCH v5] " Peter Feiner
@ 2014-09-07 23:01 ` Peter Feiner
  5 siblings, 0 replies; 36+ messages in thread
From: Peter Feiner @ 2014-09-07 23:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Peter Feiner, Kirill A. Shutemov, Cyrill Gorcunov,
	Pavel Emelyanov, Jamie Liu, Hugh Dickins, Naoya Horiguchi,
	Andrew Morton

For VMAs that don't want write notifications, PTEs created for read
faults have their write bit set. If the read fault happens after
VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain
clear after subsequent writes.

Here's a simple code snippet to demonstrate the bug:

  char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                 MAP_ANONYMOUS | MAP_SHARED, -1, 0);
  system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
  assert(*m == '\0');     /* new PTE allows write access */
  assert(!soft_dirty(x));
  *m = 'x';               /* should dirty the page */
  assert(soft_dirty(x));  /* fails */

With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write
notifications are disabled when VM_SOFTDIRTY is set.

As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set
by drivers were zapped on mprotect. An analogous bug was fixed in mmap
by c9d0bf241451a3ab7d02e1652c22b80cd7d93e8f.

Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Peter Feiner <pfeiner@google.com>

---

v1 -> v2: Instead of checking VM_SOFTDIRTY in the fault handler,
          enable write notifications on vm_page_prot when we clear
          VM_SOFTDIRTY.

v2 -> v3: * Grab the mmap_sem in write mode if any VMAs have
            VM_SOFTDIRTY set. This involved refactoring clear_refs_write
            to make it less unwieldy.

          * In mprotect, don't inadvertently disable write notifications on VMAs
            that have had VM_SOFTDIRTY cleared

          * The mprotect fix and mmap cleanup that comprised the
            second and third patches in v2 were swallowed by the main
            patch because of vm_page_prot corner case handling.

v3 -> v4: Handle !defined(CONFIG_MEM_SOFT_DIRTY): old patch would have
          enabled write notifications for all VMAs in this case.

v4 -> v5: IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) instead of #ifdef ...

v5 -> v6:
          * Replaced vma_{enable,disable}_writenotify() with vma_set_page_prot()
            as per Hugh's suggestion.

          * Per Hugh's suggestion, added an arch generic pgprot_modify that
            handles pgprot_noncached and pgprot_writecombine pgprot. This arch
            generic pgprot_modify fixes the regression introduced in v2 that
            re-introduced the bug fixed by
            c9d0bf241451a3ab7d02e1652c22b80cd7d93e8f for arch's without
            pgprot_modify.

          * Made vma_set_page_prot's pgprot check less restrictive by using
            pgprot_modify. Couldn't remove the pgprot check altogether since
            pgprot_modify isn't 100% on all arch's, such as powerpc.

          * Fixed dirty_accountable bug.

          * Fixed non-x86 build (tested build on arm64 and powerpc)

          * Per Kirill's suggestion, changed locking so mmap_sem isn't held
            exclusively during page table traversal.

Kirill & Cyrill: I removed your Reviewed-By: footers since this patch is quite
different than v5 and I didn't want to presume your approval. Please re-add as
you see fit - this has been a group effort!
---
 fs/proc/task_mmu.c            | 19 +++++++++++++-----
 include/asm-generic/pgtable.h | 12 ++++++++++++
 include/linux/mm.h            |  5 +++++
 mm/memory.c                   |  3 ++-
 mm/mmap.c                     | 45 +++++++++++++++++++++++++++----------------
 mm/mprotect.c                 | 20 +++++--------------
 6 files changed, 66 insertions(+), 38 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dfc791c..4621914 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -829,8 +829,21 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.private = &cp,
 		};
 		down_read(&mm->mmap_sem);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
+		if (type == CLEAR_REFS_SOFT_DIRTY) {
+			for (vma = mm->mmap; vma; vma = vma->vm_next) {
+				if (!(vma->vm_flags & VM_SOFTDIRTY))
+					continue;
+				up_read(&mm->mmap_sem);
+				down_write(&mm->mmap_sem);
+				for (vma = mm->mmap; vma; vma = vma->vm_next) {
+					vma->vm_flags &= ~VM_SOFTDIRTY;
+					vma_set_page_prot(vma);
+				}
+				downgrade_write(&mm->mmap_sem);
+				break;
+			}
 			mmu_notifier_invalidate_range_start(mm, 0, -1);
+		}
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			cp.vma = vma;
 			if (is_vm_hugetlb_page(vma))
@@ -850,10 +863,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				continue;
 			if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
 				continue;
-			if (type == CLEAR_REFS_SOFT_DIRTY) {
-				if (vma->vm_flags & VM_SOFTDIRTY)
-					vma->vm_flags &= ~VM_SOFTDIRTY;
-			}
 			walk_page_range(vma->vm_start, vma->vm_end,
 					&clear_refs_walk);
 		}
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 53b2acc..0f6edbd 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -249,6 +249,18 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #define pgprot_writecombine pgprot_noncached
 #endif
 
+#ifndef pgprot_modify
+#define pgprot_modify pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+	if (pgprot_val(oldprot) == pgprot_val(pgprot_noncached(oldprot)))
+		newprot = pgprot_noncached(newprot);
+	if (pgprot_val(oldprot) == pgprot_val(pgprot_writecombine(oldprot)))
+		newprot = pgprot_writecombine(newprot);
+	return newprot;
+}
+#endif
+
 /*
  * When walking page tables, get the address of the next boundary,
  * or the end address of the range if that comes earlier.  Although no
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..4e070aa 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1939,11 +1939,16 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
 
 #ifdef CONFIG_MMU
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
+void vma_set_page_prot(struct vm_area_struct *vma);
 #else
 static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 {
 	return __pgprot(0);
 }
+static inline void vma_set_page_prot(struct vm_area_struct *vma)
+{
+	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+}
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/memory.c b/mm/memory.c
index adeac30..1715233 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2051,7 +2051,8 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page) {
 		/*
-		 * VM_MIXEDMAP !pfn_valid() case
+		 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
+		 * VM_PFNMAP VMA.
 		 *
 		 * We should not cow pages in a shared writeable mapping.
 		 * Just mark the pages writable as we can't do any dirty
diff --git a/mm/mmap.c b/mm/mmap.c
index c1f2ea4..4ef9325 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -89,6 +89,25 @@ pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 EXPORT_SYMBOL(vm_get_page_prot);
 
+static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
+{
+	return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
+}
+
+/* Update vma->vm_page_prot to reflect vma->vm_flags. */
+void vma_set_page_prot(struct vm_area_struct *vma)
+{
+	unsigned long vm_flags = vma->vm_flags;
+
+	vma->vm_page_prot = vm_pgprot_modify(vma->vm_page_prot, vm_flags);
+	if (vma_wants_writenotify(vma)) {
+		vm_flags &= ~VM_SHARED;
+		vma->vm_page_prot = vm_pgprot_modify(vma->vm_page_prot,
+						     vm_flags);
+	}
+}
+
+
 int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;  /* heuristic overcommit */
 int sysctl_overcommit_ratio __read_mostly = 50;	/* default is 50% */
 unsigned long sysctl_overcommit_kbytes __read_mostly;
@@ -1470,11 +1489,16 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
 	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
 		return 1;
 
-	/* The open routine did something to the protections already? */
+	/* The open routine did something to the protections that pgprot_modify
+	 * won't preserve? */
 	if (pgprot_val(vma->vm_page_prot) !=
-	    pgprot_val(vm_get_page_prot(vm_flags)))
+	    pgprot_val(vm_pgprot_modify(vma->vm_page_prot, vm_flags)))
 		return 0;
 
+	/* Do we need to track softdirty? */
+	if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !(vm_flags & VM_SOFTDIRTY))
+		return 1;
+
 	/* Specialty mapping? */
 	if (vm_flags & VM_PFNMAP)
 		return 0;
@@ -1610,21 +1634,6 @@ munmap_back:
 			goto free_vma;
 	}
 
-	if (vma_wants_writenotify(vma)) {
-		pgprot_t pprot = vma->vm_page_prot;
-
-		/* Can vma->vm_page_prot have changed??
-		 *
-		 * Answer: Yes, drivers may have changed it in their
-		 *         f_op->mmap method.
-		 *
-		 * Ensures that vmas marked as uncached stay that way.
-		 */
-		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
-		if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))
-			vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	}
-
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	/* Once vma denies write, undo our temporary denial count */
 	if (file) {
@@ -1658,6 +1667,8 @@ out:
 	 */
 	vma->vm_flags |= VM_SOFTDIRTY;
 
+	vma_set_page_prot(vma);
+
 	return addr;
 
 unmap_and_free_vma:
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..ace9345 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -29,13 +29,6 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
-#ifndef pgprot_modify
-static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
-{
-	return newprot;
-}
-#endif
-
 /*
  * For a prot_numa update we only hold mmap_sem for read so there is a
  * potential race with faulting where a pmd was temporarily none. This
@@ -93,7 +86,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				 * Avoid taking write faults for pages we
 				 * know to be dirty.
 				 */
-				if (dirty_accountable && pte_dirty(ptent))
+				if (dirty_accountable && pte_dirty(ptent) &&
+				    (pte_soft_dirty(ptent) ||
+				     !(vma->vm_flags & VM_SOFTDIRTY)))
 					ptent = pte_mkwrite(ptent);
 				ptep_modify_prot_commit(mm, addr, pte, ptent);
 				updated = true;
@@ -320,13 +315,8 @@ success:
 	 * held in write mode.
 	 */
 	vma->vm_flags = newflags;
-	vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
-					  vm_get_page_prot(newflags));
-
-	if (vma_wants_writenotify(vma)) {
-		vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
-		dirty_accountable = 1;
-	}
+	dirty_accountable = vma_wants_writenotify(vma);
+	vma_set_page_prot(vma);
 
 	change_protection(vma, start, end, vma->vm_page_prot,
 			  dirty_accountable, 0);
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2014-09-07 23:01 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-20 21:46 [PATCH] mm: softdirty: write protect PTEs created for read faults after VM_SOFTDIRTY cleared Peter Feiner
2014-08-20 23:45 ` Kirill A. Shutemov
2014-08-21 19:37   ` Peter Feiner
2014-08-21 20:51     ` Cyrill Gorcunov
2014-08-21 21:39       ` Kirill A. Shutemov
2014-08-21 21:46         ` Peter Feiner
2014-08-21 21:51           ` Kirill A. Shutemov
2014-08-21 22:50             ` Peter Feiner
2014-08-22  6:33               ` Cyrill Gorcunov
2014-08-23 22:11 ` [PATCH v2 0/3] softdirty fix and write notification cleanup Peter Feiner
2014-08-23 22:11   ` [PATCH v2 1/3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared Peter Feiner
2014-08-23 23:00     ` Kirill A. Shutemov
2014-08-23 23:15       ` Peter Feiner
2014-08-23 23:50       ` Kirill A. Shutemov
2014-08-24  0:55         ` Peter Feiner
2014-08-23 22:12   ` [PATCH v2 2/3] mm: mprotect: preserve special page protection bits Peter Feiner
2014-08-23 22:12   ` [PATCH v2 3/3] mm: mmap: cleanup code that preserves special vm_page_prot bits Peter Feiner
2014-08-24  1:43 ` [PATCH v3] mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared Peter Feiner
2014-08-24  7:59   ` Kirill A. Shutemov
2014-08-24 19:22     ` Cyrill Gorcunov
2014-08-24 14:41 ` [PATCH v4] " Peter Feiner
2014-08-25  3:34 ` [PATCH v5] " Peter Feiner
2014-08-26  4:45   ` Hugh Dickins
2014-08-26  6:49     ` Cyrill Gorcunov
2014-08-26 14:04       ` Kirill A. Shutemov
2014-08-26 14:19         ` Cyrill Gorcunov
2014-08-26 14:56           ` Kirill A. Shutemov
2014-08-26 15:18             ` Cyrill Gorcunov
2014-08-26 15:43               ` Kirill A. Shutemov
2014-08-26 15:53                 ` Cyrill Gorcunov
2014-08-27 23:12                   ` Hugh Dickins
2014-08-28  6:31                     ` Cyrill Gorcunov
2014-08-27 21:55       ` Hugh Dickins
2014-09-04 16:43     ` Peter Feiner
2014-09-07 21:31       ` Peter Feiner
2014-09-07 23:01 ` [PATCH v6] " Peter Feiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).