linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] mm: dirty page tracking race fix
@ 2008-08-18  5:38 Nick Piggin
  2008-08-18  5:44 ` [patch] mm: xip fix fault vs sparse page invalidate race Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Nick Piggin @ 2008-08-18  5:38 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Peter Zijlstra, Linus Torvalds


There is a race with dirty page accounting where a page may not properly
be accounted for.

clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.

page_mkclean walks the rmaps for that page, and for each one it cleans and
write protects the pte if it was dirty. It uses page_check_address to find the
pte. That function has a shortcut to avoid the ptl if the pte is not
present. Unfortunately, the pte can be switched to not-present then back to
present by other code while holding the page table lock -- this should not
be a signal for page_mkclean to ignore that pte, because it may be dirty.

For example, powerpc64's set_pte_at will clear a previously present pte before
setting it to the desired value. There may also be other code in core mm or
in arch which do similar things.

The consequence of the bug is loss of data integrity due to msync, and loss
of dirty page accounting accuracy. XIP's __xip_unmap could easily also be
unreliable (depending on the exact XIP locking scheme), which can lead to data
corruption.

Fix this by having an option to always take ptl to check the pte in
page_check_address.

It's possible to retain this optimization for page_referenced and
try_to_unmap.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -102,7 +102,7 @@ int try_to_unmap(struct page *, int igno
  * Called from mm/filemap_xip.c to unmap empty zero page
  */
 pte_t *page_check_address(struct page *, struct mm_struct *,
-				unsigned long, spinlock_t **);
+				unsigned long, spinlock_t **, int);
 
 /*
  * Used by swapoff to help locate where page is expected in vma.
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -185,7 +185,7 @@ __xip_unmap (struct address_space * mapp
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl);
+		pte = page_check_address(page, mm, address, &ptl, 1);
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -224,10 +224,14 @@ unsigned long page_address_in_vma(struct
 /*
  * Check that @page is mapped at @address into @mm.
  *
+ * If @synch is false, page_check_address may perform a racy check to avoid
+ * the page table lock when the pte is not present (helpful when reclaiming
+ * highly shared pages).
+ *
  * On success returns with pte mapped and locked.
  */
 pte_t *page_check_address(struct page *page, struct mm_struct *mm,
-			  unsigned long address, spinlock_t **ptlp)
+			  unsigned long address, spinlock_t **ptlp, int synch)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -249,7 +253,7 @@ pte_t *page_check_address(struct page *p
 
 	pte = pte_offset_map(pmd, address);
 	/* Make a quick check before getting the lock */
-	if (!pte_present(*pte)) {
+	if (!synch && !pte_present(*pte)) {
 		pte_unmap(pte);
 		return NULL;
 	}
@@ -281,7 +285,7 @@ static int page_referenced_one(struct pa
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
+	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
 		goto out;
 
@@ -450,7 +454,7 @@ static int page_mkclean_one(struct page 
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
+	pte = page_check_address(page, mm, address, &ptl, 1);
 	if (!pte)
 		goto out;
 
@@ -697,7 +701,7 @@ static int try_to_unmap_one(struct page 
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
+	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
 		goto out;
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch] mm: xip fix fault vs sparse page invalidate race
  2008-08-18  5:38 [patch] mm: dirty page tracking race fix Nick Piggin
@ 2008-08-18  5:44 ` Nick Piggin
  2008-08-18  6:03   ` [patch] mm: xip/ext2 fix block allocation race Nick Piggin
  2008-08-18 10:50   ` [patch] mm: xip fix fault vs sparse page invalidate race Carsten Otte
  2008-08-18  7:49 ` [patch] mm: dirty page tracking race fix Peter Zijlstra
  2008-08-19  9:11 ` Andrew Morton
  2 siblings, 2 replies; 11+ messages in thread
From: Nick Piggin @ 2008-08-18  5:44 UTC (permalink / raw)
  To: Andrew Morton, cotte, borntraeger, Jared Hulbert
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Linus Torvalds, Hugh Dickins


XIP has a race between sparse pages being inserted into page tables, and
sparse pages being zapped when its time to put a non-sparse page in.

What can happen is that a process can be left with a dangling sparse page
in a MAP_SHARED mapping, while the rest of the world sees the non-sparse
version. Ie. data corruption. 

Guard these operations with a seqlock, making fault-in-sparse-pages
the slowpath, and try-to-unmap-sparse-pages the fastpath.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---

Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-08-18 13:43:53.000000000 +1000
+++ linux-2.6/mm/filemap_xip.c	2008-08-18 15:02:29.000000000 +1000
@@ -15,6 +15,8 @@
 #include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
 #include <linux/sched.h>
+#include <linux/seqlock.h>
+#include <linux/mutex.h>
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 
@@ -22,22 +24,18 @@
  * We do use our own empty page to avoid interference with other users
  * of ZERO_PAGE(), such as /dev/zero
  */
+static DEFINE_MUTEX(xip_sparse_mutex);
+static seqcount_t xip_sparse_seq = SEQCNT_ZERO;
 static struct page *__xip_sparse_page;
 
+/* called under xip_sparse_mutex */
 static struct page *xip_sparse_page(void)
 {
 	if (!__xip_sparse_page) {
 		struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
 
-		if (page) {
-			static DEFINE_SPINLOCK(xip_alloc_lock);
-			spin_lock(&xip_alloc_lock);
-			if (!__xip_sparse_page)
-				__xip_sparse_page = page;
-			else
-				__free_page(page);
-			spin_unlock(&xip_alloc_lock);
-		}
+		if (page)
+			__xip_sparse_page = page;
 	}
 	return __xip_sparse_page;
 }
@@ -174,11 +172,16 @@ __xip_unmap (struct address_space * mapp
 	pte_t pteval;
 	spinlock_t *ptl;
 	struct page *page;
+	unsigned count;
+	int locked = 0;
+
+	count = read_seqcount_begin(&xip_sparse_seq);
 
 	page = __xip_sparse_page;
 	if (!page)
 		return;
 
+retry:
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
@@ -198,6 +201,14 @@ __xip_unmap (struct address_space * mapp
 		}
 	}
 	spin_unlock(&mapping->i_mmap_lock);
+
+	if (locked) {
+		mutex_unlock(&xip_sparse_mutex);
+	} else if (read_seqcount_retry(&xip_sparse_seq, count)) {
+		mutex_lock(&xip_sparse_mutex);
+		locked = 1;
+		goto retry;
+	}
 }
 
 /*
@@ -218,7 +229,7 @@ static int xip_file_fault(struct vm_area
 	int error;
 
 	/* XXX: are VM_FAULT_ codes OK? */
-
+again:
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
@@ -245,6 +256,7 @@ static int xip_file_fault(struct vm_area
 		__xip_unmap(mapping, vmf->pgoff);
 
 found:
+		printk("%s insert %lx@%lx\n", current->comm, (unsigned long)vmf->virtual_address, xip_pfn);
 		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
 							xip_pfn);
 		if (err == -ENOMEM)
@@ -252,14 +264,34 @@ found:
 		BUG_ON(err);
 		return VM_FAULT_NOPAGE;
 	} else {
+		int err, ret = VM_FAULT_OOM;
+
+		mutex_lock(&xip_sparse_mutex);
+		write_seqcount_begin(&xip_sparse_seq);
+		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
+							&xip_mem, &xip_pfn);
+		if (unlikely(!error)) {
+			write_seqcount_end(&xip_sparse_seq);
+			mutex_unlock(&xip_sparse_mutex);
+			goto again;
+		}
+		if (error != -ENODATA)
+			goto out;
 		/* not shared and writable, use xip_sparse_page() */
 		page = xip_sparse_page();
 		if (!page)
-			return VM_FAULT_OOM;
+			goto out;
+		err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
+							page);
+		if (err == -ENOMEM)
+			goto out;
 
-		page_cache_get(page);
-		vmf->page = page;
-		return 0;
+		ret = VM_FAULT_NOPAGE;
+out:
+		write_seqcount_end(&xip_sparse_seq);
+		mutex_unlock(&xip_sparse_mutex);
+
+		return ret;
 	}
 }
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch] mm: xip/ext2 fix block allocation race
  2008-08-18  5:44 ` [patch] mm: xip fix fault vs sparse page invalidate race Nick Piggin
@ 2008-08-18  6:03   ` Nick Piggin
  2008-08-18 10:51     ` Carsten Otte
  2008-08-18 10:50   ` [patch] mm: xip fix fault vs sparse page invalidate race Carsten Otte
  1 sibling, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2008-08-18  6:03 UTC (permalink / raw)
  To: Andrew Morton, cotte, borntraeger, Jared Hulbert
  Cc: Linux Kernel Mailing List, Linux Memory Management List,
	Linus Torvalds, Hugh Dickins


XIP can call into get_xip_mem concurrently with the same file,offset with
create=1.  This usually maps down to get_block, which expects the page lock
to prevent such a situation. This causes ext2 to explode for one reason or
another.

Serialise those calls for the moment. For common usages today, I suspect
get_xip_mem rarely is called to create new blocks. In future as XIP
technologies evolve we might need to look at which operations require
scalability, and rework the locking to suit.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-08-18 14:47:38.000000000 +1000
+++ linux-2.6/mm/filemap_xip.c	2008-08-18 14:53:11.000000000 +1000
@@ -248,15 +248,16 @@ again:
 		int err;
 
 		/* maybe shared writable, allocate new block */
+		mutex_lock(&xip_sparse_mutex);
 		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
 							&xip_mem, &xip_pfn);
+		mutex_unlock(&xip_sparse_mutex);
 		if (error)
 			return VM_FAULT_SIGBUS;
 		/* unmap sparse mappings at pgoff from all other vmas */
 		__xip_unmap(mapping, vmf->pgoff);
 
 found:
-		printk("%s insert %lx@%lx\n", current->comm, (unsigned long)vmf->virtual_address, xip_pfn);
 		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
 							xip_pfn);
 		if (err == -ENOMEM)
@@ -340,8 +341,10 @@ __xip_file_write(struct file *filp, cons
 						&xip_mem, &xip_pfn);
 		if (status == -ENODATA) {
 			/* we allocate a new page unmap it */
+			mutex_lock(&xip_sparse_mutex);
 			status = a_ops->get_xip_mem(mapping, index, 1,
 							&xip_mem, &xip_pfn);
+			mutex_unlock(&xip_sparse_mutex);
 			if (!status)
 				/* unmap page at pgoff from all other vmas */
 				__xip_unmap(mapping, index);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] mm: dirty page tracking race fix
  2008-08-18  5:38 [patch] mm: dirty page tracking race fix Nick Piggin
  2008-08-18  5:44 ` [patch] mm: xip fix fault vs sparse page invalidate race Nick Piggin
@ 2008-08-18  7:49 ` Peter Zijlstra
  2008-08-18  8:03   ` Nick Piggin
  2008-08-19  9:11 ` Andrew Morton
  2 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2008-08-18  7:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Hugh Dickins, Linux Kernel Mailing List,
	Linux Memory Management List, Linus Torvalds

On Mon, 2008-08-18 at 07:38 +0200, Nick Piggin wrote:
> There is a race with dirty page accounting where a page may not properly
> be accounted for.
> 
> clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.
> 
> page_mkclean walks the rmaps for that page, and for each one it cleans and
> write protects the pte if it was dirty. It uses page_check_address to find the
> pte. That function has a shortcut to avoid the ptl if the pte is not
> present. Unfortunately, the pte can be switched to not-present then back to
> present by other code while holding the page table lock -- this should not
> be a signal for page_mkclean to ignore that pte, because it may be dirty.
> 
> For example, powerpc64's set_pte_at will clear a previously present pte before
> setting it to the desired value. There may also be other code in core mm or
> in arch which do similar things.
> 
> The consequence of the bug is loss of data integrity due to msync, and loss
> of dirty page accounting accuracy. XIP's __xip_unmap could easily also be
> unreliable (depending on the exact XIP locking scheme), which can lead to data
> corruption.
> 
> Fix this by having an option to always take ptl to check the pte in
> page_check_address.
> 
> It's possible to retain this optimization for page_referenced and
> try_to_unmap.

s/synch/sync/ ?

we use sync all over the kernel to mean synchonous, so why are you
inventing a new shorthand?

Anyway, still

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/include/linux/rmap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/rmap.h
> +++ linux-2.6/include/linux/rmap.h
> @@ -102,7 +102,7 @@ int try_to_unmap(struct page *, int igno
>   * Called from mm/filemap_xip.c to unmap empty zero page
>   */
>  pte_t *page_check_address(struct page *, struct mm_struct *,
> -				unsigned long, spinlock_t **);
> +				unsigned long, spinlock_t **, int);
>  
>  /*
>   * Used by swapoff to help locate where page is expected in vma.
> Index: linux-2.6/mm/filemap_xip.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap_xip.c
> +++ linux-2.6/mm/filemap_xip.c
> @@ -185,7 +185,7 @@ __xip_unmap (struct address_space * mapp
>  		address = vma->vm_start +
>  			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> -		pte = page_check_address(page, mm, address, &ptl);
> +		pte = page_check_address(page, mm, address, &ptl, 1);
>  		if (pte) {
>  			/* Nuke the page table entry. */
>  			flush_cache_page(vma, address, pte_pfn(*pte));
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -224,10 +224,14 @@ unsigned long page_address_in_vma(struct
>  /*
>   * Check that @page is mapped at @address into @mm.
>   *
> + * If @synch is false, page_check_address may perform a racy check to avoid
> + * the page table lock when the pte is not present (helpful when reclaiming
> + * highly shared pages).
> + *
>   * On success returns with pte mapped and locked.
>   */
>  pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> -			  unsigned long address, spinlock_t **ptlp)
> +			  unsigned long address, spinlock_t **ptlp, int synch)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> @@ -249,7 +253,7 @@ pte_t *page_check_address(struct page *p
>  
>  	pte = pte_offset_map(pmd, address);
>  	/* Make a quick check before getting the lock */
> -	if (!pte_present(*pte)) {
> +	if (!synch && !pte_present(*pte)) {
>  		pte_unmap(pte);
>  		return NULL;
>  	}
> @@ -281,7 +285,7 @@ static int page_referenced_one(struct pa
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
>  		goto out;
>  
> @@ -450,7 +454,7 @@ static int page_mkclean_one(struct page 
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 1);
>  	if (!pte)
>  		goto out;
>  
> @@ -697,7 +701,7 @@ static int try_to_unmap_one(struct page 
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> +	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
>  		goto out;
>  


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] mm: dirty page tracking race fix
  2008-08-18  7:49 ` [patch] mm: dirty page tracking race fix Peter Zijlstra
@ 2008-08-18  8:03   ` Nick Piggin
  2008-08-18  8:07     ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2008-08-18  8:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Andrew Morton, Hugh Dickins,
	Linux Kernel Mailing List, Linux Memory Management List,
	Linus Torvalds

On Monday 18 August 2008 17:49, Peter Zijlstra wrote:
> On Mon, 2008-08-18 at 07:38 +0200, Nick Piggin wrote:

> > It's possible to retain this optimization for page_referenced and
> > try_to_unmap.
>
> s/synch/sync/ ?
>
> we use sync all over the kernel to mean synchonous, so why are you
> inventing a new shorthand?

Mmm, we also use synch all over the kernel to mean synchronous,
including in mm/, so I'm not inventing a new shorthand. sync I
see is more common, but it's not something anybody would get
confused about is it?

>
> Anyway, still
>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Thanks.


> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> > Index: linux-2.6/include/linux/rmap.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/rmap.h
> > +++ linux-2.6/include/linux/rmap.h
> > @@ -102,7 +102,7 @@ int try_to_unmap(struct page *, int igno
> >   * Called from mm/filemap_xip.c to unmap empty zero page
> >   */
> >  pte_t *page_check_address(struct page *, struct mm_struct *,
> > -				unsigned long, spinlock_t **);
> > +				unsigned long, spinlock_t **, int);
> >
> >  /*
> >   * Used by swapoff to help locate where page is expected in vma.
> > Index: linux-2.6/mm/filemap_xip.c
> > ===================================================================
> > --- linux-2.6.orig/mm/filemap_xip.c
> > +++ linux-2.6/mm/filemap_xip.c
> > @@ -185,7 +185,7 @@ __xip_unmap (struct address_space * mapp
> >  		address = vma->vm_start +
> >  			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> >  		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> > -		pte = page_check_address(page, mm, address, &ptl);
> > +		pte = page_check_address(page, mm, address, &ptl, 1);
> >  		if (pte) {
> >  			/* Nuke the page table entry. */
> >  			flush_cache_page(vma, address, pte_pfn(*pte));
> > Index: linux-2.6/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/rmap.c
> > +++ linux-2.6/mm/rmap.c
> > @@ -224,10 +224,14 @@ unsigned long page_address_in_vma(struct
> >  /*
> >   * Check that @page is mapped at @address into @mm.
> >   *
> > + * If @synch is false, page_check_address may perform a racy check to
> > avoid + * the page table lock when the pte is not present (helpful when
> > reclaiming + * highly shared pages).
> > + *
> >   * On success returns with pte mapped and locked.
> >   */
> >  pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> > -			  unsigned long address, spinlock_t **ptlp)
> > +			  unsigned long address, spinlock_t **ptlp, int synch)
> >  {
> >  	pgd_t *pgd;
> >  	pud_t *pud;
> > @@ -249,7 +253,7 @@ pte_t *page_check_address(struct page *p
> >
> >  	pte = pte_offset_map(pmd, address);
> >  	/* Make a quick check before getting the lock */
> > -	if (!pte_present(*pte)) {
> > +	if (!synch && !pte_present(*pte)) {
> >  		pte_unmap(pte);
> >  		return NULL;
> >  	}
> > @@ -281,7 +285,7 @@ static int page_referenced_one(struct pa
> >  	if (address == -EFAULT)
> >  		goto out;
> >
> > -	pte = page_check_address(page, mm, address, &ptl);
> > +	pte = page_check_address(page, mm, address, &ptl, 0);
> >  	if (!pte)
> >  		goto out;
> >
> > @@ -450,7 +454,7 @@ static int page_mkclean_one(struct page
> >  	if (address == -EFAULT)
> >  		goto out;
> >
> > -	pte = page_check_address(page, mm, address, &ptl);
> > +	pte = page_check_address(page, mm, address, &ptl, 1);
> >  	if (!pte)
> >  		goto out;
> >
> > @@ -697,7 +701,7 @@ static int try_to_unmap_one(struct page
> >  	if (address == -EFAULT)
> >  		goto out;
> >
> > -	pte = page_check_address(page, mm, address, &ptl);
> > +	pte = page_check_address(page, mm, address, &ptl, 0);
> >  	if (!pte)
> >  		goto out;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] mm: dirty page tracking race fix
  2008-08-18  8:03   ` Nick Piggin
@ 2008-08-18  8:07     ` Peter Zijlstra
  2008-08-18  8:12       ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2008-08-18  8:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, Andrew Morton, Hugh Dickins,
	Linux Kernel Mailing List, Linux Memory Management List,
	Linus Torvalds

On Mon, 2008-08-18 at 18:03 +1000, Nick Piggin wrote:
> On Monday 18 August 2008 17:49, Peter Zijlstra wrote:
> > On Mon, 2008-08-18 at 07:38 +0200, Nick Piggin wrote:
> 
> > > It's possible to retain this optimization for page_referenced and
> > > try_to_unmap.
> >
> > s/synch/sync/ ?
> >
> > we use sync all over the kernel to mean synchonous, so why are you
> > inventing a new shorthand?
> 
> Mmm, we also use synch all over the kernel to mean synchronous,
> including in mm/, so I'm not inventing a new shorthand. sync I
> see is more common, but it's not something anybody would get
> confused about is it?

I hadn't noticed before, and my grep skillz seem to have left me in the
cold

git grep "\<synch\>" mm/* | wc -l
0

And you're right, its not something one can get confused about. So lets
just keep it unless someone else objects ;-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] mm: dirty page tracking race fix
  2008-08-18  8:07     ` Peter Zijlstra
@ 2008-08-18  8:12       ` Nick Piggin
  0 siblings, 0 replies; 11+ messages in thread
From: Nick Piggin @ 2008-08-18  8:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Andrew Morton, Hugh Dickins,
	Linux Kernel Mailing List, Linux Memory Management List,
	Linus Torvalds

On Monday 18 August 2008 18:07, Peter Zijlstra wrote:
> On Mon, 2008-08-18 at 18:03 +1000, Nick Piggin wrote:
> > On Monday 18 August 2008 17:49, Peter Zijlstra wrote:
> > > On Mon, 2008-08-18 at 07:38 +0200, Nick Piggin wrote:
> > > > It's possible to retain this optimization for page_referenced and
> > > > try_to_unmap.
> > >
> > > s/synch/sync/ ?
> > >
> > > we use sync all over the kernel to mean synchonous, so why are you
> > > inventing a new shorthand?
> >
> > Mmm, we also use synch all over the kernel to mean synchronous,
> > including in mm/, so I'm not inventing a new shorthand. sync I
> > see is more common, but it's not something anybody would get
> > confused about is it?
>
> I hadn't noticed before, and my grep skillz seem to have left me in the
> cold
>
> git grep "\<synch\>" mm/* | wc -l
> 0

"asynch" ;) 1 hit! So *technically* I'm wrong...


> And you're right, its not something one can get confused about. So lets
> just keep it unless someone else objects ;-)

Andrew feel free to edit the patch if/when you pick it up :P

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] mm: xip fix fault vs sparse page invalidate race
  2008-08-18  5:44 ` [patch] mm: xip fix fault vs sparse page invalidate race Nick Piggin
  2008-08-18  6:03   ` [patch] mm: xip/ext2 fix block allocation race Nick Piggin
@ 2008-08-18 10:50   ` Carsten Otte
  1 sibling, 0 replies; 11+ messages in thread
From: Carsten Otte @ 2008-08-18 10:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, borntrae, Jared Hulbert,
	Linux Kernel Mailing List, Linux Memory Management List,
	Linus Torvalds, Hugh Dickins

Nick Piggin wrote:
> XIP has a race between sparse pages being inserted into page tables, and
> sparse pages being zapped when its time to put a non-sparse page in.
> 
> What can happen is that a process can be left with a dangling sparse page
> in a MAP_SHARED mapping, while the rest of the world sees the non-sparse
> version. Ie. data corruption. 
> 
> Guard these operations with a seqlock, making fault-in-sparse-pages
> the slowpath, and try-to-unmap-sparse-pages the fastpath.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
Ouch.
Acked-by: Carsten Otte <cotte@de.ibm.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] mm: xip/ext2 fix block allocation race
  2008-08-18  6:03   ` [patch] mm: xip/ext2 fix block allocation race Nick Piggin
@ 2008-08-18 10:51     ` Carsten Otte
  0 siblings, 0 replies; 11+ messages in thread
From: Carsten Otte @ 2008-08-18 10:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, borntrae, Jared Hulbert,
	Linux Kernel Mailing List, Linux Memory Management List,
	Linus Torvalds, Hugh Dickins

Nick Piggin wrote:
> XIP can call into get_xip_mem concurrently with the same file,offset with
> create=1.  This usually maps down to get_block, which expects the page lock
> to prevent such a situation. This causes ext2 to explode for one reason or
> another.
> 
> Serialise those calls for the moment. For common usages today, I suspect
> get_xip_mem rarely is called to create new blocks. In future as XIP
> technologies evolve we might need to look at which operations require
> scalability, and rework the locking to suit.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Carsten Otte <cotte@de.ibm.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] mm: dirty page tracking race fix
  2008-08-18  5:38 [patch] mm: dirty page tracking race fix Nick Piggin
  2008-08-18  5:44 ` [patch] mm: xip fix fault vs sparse page invalidate race Nick Piggin
  2008-08-18  7:49 ` [patch] mm: dirty page tracking race fix Peter Zijlstra
@ 2008-08-19  9:11 ` Andrew Morton
  2008-08-19 10:19   ` Nick Piggin
  2 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2008-08-19  9:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Linux Kernel Mailing List,
	Linux Memory Management List, Peter Zijlstra, Linus Torvalds

On Mon, 18 Aug 2008 07:38:21 +0200 Nick Piggin <npiggin@suse.de> wrote:

> There is a race with dirty page accounting where a page may not properly
> be accounted for.
> 
> clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.
> 
> page_mkclean walks the rmaps for that page, and for each one it cleans and
> write protects the pte if it was dirty. It uses page_check_address to find the
> pte. That function has a shortcut to avoid the ptl if the pte is not
> present. Unfortunately, the pte can be switched to not-present then back to
> present by other code while holding the page table lock -- this should not
> be a signal for page_mkclean to ignore that pte, because it may be dirty.
> 
> For example, powerpc64's set_pte_at will clear a previously present pte before
> setting it to the desired value. There may also be other code in core mm or
> in arch which do similar things.
> 
> The consequence of the bug is loss of data integrity due to msync, and loss
> of dirty page accounting accuracy. XIP's __xip_unmap could easily also be
> unreliable (depending on the exact XIP locking scheme), which can lead to data
> corruption.
> 
> Fix this by having an option to always take ptl to check the pte in
> page_check_address.
> 
> It's possible to retain this optimization for page_referenced and
> try_to_unmap.

Is it also possible to retain it for

/**
 * page_mapped_in_vma - check whether a page is really mapped in a VMA
 * @page: the page to test
 * @vma: the VMA to test
 *
 * Returns 1 if the page is mapped into the page tables of the VMA, 0
 * if the page is not mapped into the page tables of this VMA.  Only
 * valid for normal file or anonymous VMAs.
 */
static int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
{
	unsigned long address;
	pte_t *pte;
	spinlock_t *ptl;

	address = vma_address(page, vma);
	if (address == -EFAULT)		/* out of vma range */
		return 0;
	pte = page_check_address(page, vma->vm_mm, address, &ptl);
	if (!pte)			/* the page is not in this mm */
		return 0;
	pte_unmap_unlock(pte, ptl);

	return 1;
}

?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] mm: dirty page tracking race fix
  2008-08-19  9:11 ` Andrew Morton
@ 2008-08-19 10:19   ` Nick Piggin
  0 siblings, 0 replies; 11+ messages in thread
From: Nick Piggin @ 2008-08-19 10:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linux Kernel Mailing List,
	Linux Memory Management List, Peter Zijlstra, Linus Torvalds

On Tue, Aug 19, 2008 at 02:11:55AM -0700, Andrew Morton wrote:
> On Mon, 18 Aug 2008 07:38:21 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > There is a race with dirty page accounting where a page may not properly
> > be accounted for.
> > 
> > clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.
> > 
> > page_mkclean walks the rmaps for that page, and for each one it cleans and
> > write protects the pte if it was dirty. It uses page_check_address to find the
> > pte. That function has a shortcut to avoid the ptl if the pte is not
> > present. Unfortunately, the pte can be switched to not-present then back to
> > present by other code while holding the page table lock -- this should not
> > be a signal for page_mkclean to ignore that pte, because it may be dirty.
> > 
> > For example, powerpc64's set_pte_at will clear a previously present pte before
> > setting it to the desired value. There may also be other code in core mm or
> > in arch which do similar things.
> > 
> > The consequence of the bug is loss of data integrity due to msync, and loss
> > of dirty page accounting accuracy. XIP's __xip_unmap could easily also be
> > unreliable (depending on the exact XIP locking scheme), which can lead to data
> > corruption.
> > 
> > Fix this by having an option to always take ptl to check the pte in
> > page_check_address.
> > 
> > It's possible to retain this optimization for page_referenced and
> > try_to_unmap.
> 
> Is it also possible to retain it for
> 
> /**
>  * page_mapped_in_vma - check whether a page is really mapped in a VMA
>  * @page: the page to test
>  * @vma: the VMA to test
>  *
>  * Returns 1 if the page is mapped into the page tables of the VMA, 0
>  * if the page is not mapped into the page tables of this VMA.  Only
>  * valid for normal file or anonymous VMAs.
>  */
> static int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
> {
> 	unsigned long address;
> 	pte_t *pte;
> 	spinlock_t *ptl;
> 
> 	address = vma_address(page, vma);
> 	if (address == -EFAULT)		/* out of vma range */
> 		return 0;
> 	pte = page_check_address(page, vma->vm_mm, address, &ptl);
> 	if (!pte)			/* the page is not in this mm */
> 		return 0;
> 	pte_unmap_unlock(pte, ptl);
> 
> 	return 1;
> }
> 
> ?

No. While callers of page_mapped_in_vma might be OK with it, the function
comments themselves indicate we can't take the shortcut.

It was possible the retain the optimization for page_referenced and
try_to_unmap because it is OK if those guys fail in rare cases. Page
reclaim will do a final non-racy check of refcounts.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-08-19 10:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-08-18  5:38 [patch] mm: dirty page tracking race fix Nick Piggin
2008-08-18  5:44 ` [patch] mm: xip fix fault vs sparse page invalidate race Nick Piggin
2008-08-18  6:03   ` [patch] mm: xip/ext2 fix block allocation race Nick Piggin
2008-08-18 10:51     ` Carsten Otte
2008-08-18 10:50   ` [patch] mm: xip fix fault vs sparse page invalidate race Carsten Otte
2008-08-18  7:49 ` [patch] mm: dirty page tracking race fix Peter Zijlstra
2008-08-18  8:03   ` Nick Piggin
2008-08-18  8:07     ` Peter Zijlstra
2008-08-18  8:12       ` Nick Piggin
2008-08-19  9:11 ` Andrew Morton
2008-08-19 10:19   ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).