linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* DAX: __dax_fault race question
@ 2016-02-08 11:23 Dmitry Monakhov
  2016-02-08 13:53 ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Dmitry Monakhov
  2016-02-11 18:43 ` DAX: __dax_fault race question Ross Zwisler
  0 siblings, 2 replies; 5+ messages in thread
From: Dmitry Monakhov @ 2016-02-08 11:23 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: willy, ross.zwisler

[-- Attachment #1: Type: text/plain, Size: 1505 bytes --]


Hi,

I try to understand locking rules for dax and realized that there is
some suspicious case in dax_fault

On __dax_fault we try to replace normal page with dax-entry
Basically dax_fault steps looks like follows

1) page = find_get_page(..)
2) lock_page_or_retry(page)
3) get_block
4) delete_from_page_cache(page)
5) unlock_page(page)
6) dax_insert_mapping(inode, &bh, vma, vmf)
...

But what protects us from other taks does new page_fault after (4) but
before (6).
AFAIU this case is not prohibited
Let's see what happens for two read/write tasks does fault inside file-hole
task_1(writer)                  task_2(reader)
__dax_fault(write)
  ->lock_page_or_retry
  ->delete_from_page_cache()    __dax_fault(read)
                                ->dax_load_hole
                                  ->find_or_create_page()
                                    ->new page in mapping->radix_tree               
  ->dax_insert_mapping
     ->dax_radix_entry->collision: return -EIO

Before dax/fsync patch-set this race result in silent dax/page duality(which
likely result data incoherence or data corruption), Luckily now this
race result in collision on insertion to radix_tree and return -EIO.
From first glance testcase looks very simple, but I can not reproduce
this in my environment. 

Imho it is reasonable pass locked page to dax_insert_mapping and let
dax_radix_entry use atomic page/dax-entry replacement similar to
replace_page_cache_page. Am I right?


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert
  2016-02-08 11:23 DAX: __dax_fault race question Dmitry Monakhov
@ 2016-02-08 13:53 ` Dmitry Monakhov
  2016-02-08 13:53   ` [PATCH 2/2] dax: fix race dax_fault write vs read Dmitry Monakhov
  2016-02-11 17:42   ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Ross Zwisler
  2016-02-11 18:43 ` DAX: __dax_fault race question Ross Zwisler
  1 sibling, 2 replies; 5+ messages in thread
From: Dmitry Monakhov @ 2016-02-08 13:53 UTC (permalink / raw)
  To: linux-mm; +Cc: willy, ross.zwisler, Dmitry Monakhov

- dax_radix_entry_insert is more appropriate name for that function
- Add lockless helper __dax_radix_entry_insert, it will be used by second patch

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/dax.c | 39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fc2e314..89bb1f8 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -349,7 +349,7 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 #define NO_SECTOR -1
 #define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_CACHE_SHIFT))
 
-static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
+static int __dax_radix_entry_insert(struct address_space *mapping, pgoff_t index,
 		sector_t sector, bool pmd_entry, bool dirty)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
@@ -358,10 +358,6 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 	void *entry;
 
 	WARN_ON_ONCE(pmd_entry && !dirty);
-	if (dirty)
-		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
-	spin_lock_irq(&mapping->tree_lock);
 
 	entry = radix_tree_lookup(page_tree, pmd_index);
 	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
@@ -374,8 +370,7 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 		type = RADIX_DAX_TYPE(entry);
 		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
 					type != RADIX_DAX_PMD)) {
-			error = -EIO;
-			goto unlock;
+		        return -EIO;
 		}
 
 		if (!pmd_entry || type == RADIX_DAX_PMD)
@@ -402,19 +397,31 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 		 * pte_same() check will fail, eventually causing page fault
 		 * to be retried by the CPU.
 		 */
-		goto unlock;
+		return 0;
 	}
 
 	error = radix_tree_insert(page_tree, index,
 			RADIX_DAX_ENTRY(sector, pmd_entry));
 	if (error)
-		goto unlock;
+		return error;
 
 	mapping->nrexceptional++;
- dirty:
+dirty:
 	if (dirty)
 		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
- unlock:
+	return error;
+}
+
+static int dax_radix_entry_insert(struct address_space *mapping, pgoff_t index,
+		sector_t sector, bool pmd_entry, bool dirty)
+{
+	int error;
+
+	if (dirty)
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	error =__dax_radix_entry_insert(mapping, index, sector, pmd_entry, dirty);
 	spin_unlock_irq(&mapping->tree_lock);
 	return error;
 }
@@ -579,8 +586,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
-			vmf->flags & FAULT_FLAG_WRITE);
+	error = dax_radix_entry_insert(mapping, vmf->pgoff, dax.sector, false,
+				vmf->flags & FAULT_FLAG_WRITE, vmf->page);
 	if (error)
 		goto out;
 
@@ -984,7 +991,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		 * the write to insert a dirty entry.
 		 */
 		if (write) {
-			error = dax_radix_entry(mapping, pgoff, dax.sector,
+			error = dax_radix_entry_insert(mapping, pgoff, dax.sector,
 					true, true);
 			if (error) {
 				dax_pmd_dbg(&bh, address,
@@ -1057,14 +1064,14 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct file *file = vma->vm_file;
 
 	/*
-	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
+	 * We pass NO_SECTOR to dax_radix_entry_insert() because we expect that a
 	 * RADIX_DAX_PTE entry already exists in the radix tree from a
 	 * previous call to __dax_fault().  We just want to look up that PTE
 	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
 	 * saves us from having to make a call to get_block() here to look
 	 * up the sector.
 	 */
-	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
+	dax_radix_entry_insert(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] dax: fix race dax_fault write vs read
  2016-02-08 13:53 ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Dmitry Monakhov
@ 2016-02-08 13:53   ` Dmitry Monakhov
  2016-02-11 17:42   ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Ross Zwisler
  1 sibling, 0 replies; 5+ messages in thread
From: Dmitry Monakhov @ 2016-02-08 13:53 UTC (permalink / raw)
  To: linux-mm; +Cc: willy, ross.zwisler, Dmitry Monakhov

Two read/write tasks does fault inside file-hole
task_1(writer)                  task_2(reader)
__dax_fault(write)
  ->lock_page_or_retry
  ->delete_from_page_cache()    __dax_fault(read)
						->dax_load_hole
                                  ->find_or_create_page()
                                    ->new page in mapping->radix_tree
  ->dax_insert_mapping
     ->dax_radix_entry => collision

Let's move radix_tree update to dax_radix_entry_replace() where
page deletion and dax entry insertion will be protected by ->tree_lock

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/dax.c | 37 +++++++++++++++++++++++++++++++------
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 89bb1f8..0294fc9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -424,6 +424,31 @@ static int dax_radix_entry_insert(struct address_space *mapping, pgoff_t index,
 	error =__dax_radix_entry_insert(mapping, index, sector, pmd_entry, dirty);
 	spin_unlock_irq(&mapping->tree_lock);
 	return error;
+
+}
+
+static int dax_radix_entry_replace(struct address_space *mapping, pgoff_t index,
+				   sector_t sector, bool pmd_entry, bool dirty,
+				   struct page* old_page)
+{
+	int error;
+
+	BUG_ON(old_page && !PageLocked(old_page));
+	if (dirty)
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	if (old_page)
+		__delete_from_page_cache(old_page, NULL);
+	error =__dax_radix_entry_insert(mapping, index, sector, pmd_entry, dirty);
+	spin_unlock_irq(&mapping->tree_lock);
+	if (old_page) {
+		if (mapping->a_ops->freepage)
+			mapping->a_ops->freepage(old_page);
+		page_cache_release(old_page);
+	}
+	return error;
+
 }
 
 static int dax_writeback_one(struct block_device *bdev,
@@ -586,7 +611,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry_insert(mapping, vmf->pgoff, dax.sector, false,
+	error = dax_radix_entry_replace(mapping, vmf->pgoff, dax.sector, false,
 				vmf->flags & FAULT_FLAG_WRITE, vmf->page);
 	if (error)
 		goto out;
@@ -711,14 +736,16 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		page = find_lock_page(mapping, vmf->pgoff);
 
 	if (page) {
+		vmf->page = page;
 		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
 							PAGE_CACHE_SIZE, 0);
-		delete_from_page_cache(page);
+	}
+	error = dax_insert_mapping(inode, &bh, vma, vmf);
+	if (page) {
 		unlock_page(page);
 		page_cache_release(page);
-		page = NULL;
+		vmf->page = page = NULL;
 	}
-
 	/*
 	 * If we successfully insert the new mapping over an unwritten extent,
 	 * we need to ensure we convert the unwritten extent. If there is an
@@ -729,14 +756,12 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 * indicate what the callback should do via the uptodate variable, same
 	 * as for normal BH based IO completions.
 	 */
-	error = dax_insert_mapping(inode, &bh, vma, vmf);
 	if (buffer_unwritten(&bh)) {
 		if (complete_unwritten)
 			complete_unwritten(&bh, !error);
 		else
 			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
 	}
-
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert
  2016-02-08 13:53 ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Dmitry Monakhov
  2016-02-08 13:53   ` [PATCH 2/2] dax: fix race dax_fault write vs read Dmitry Monakhov
@ 2016-02-11 17:42   ` Ross Zwisler
  1 sibling, 0 replies; 5+ messages in thread
From: Ross Zwisler @ 2016-02-11 17:42 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-mm, willy, ross.zwisler

On Mon, Feb 08, 2016 at 05:53:17PM +0400, Dmitry Monakhov wrote:
> - dax_radix_entry_insert is more appropriate name for that function

I think I may have actually had it named that at some point. :)  I changed it   
because it doesn't always insert an entry - in the read case for example we     
insert a clean entry, and then on the following dax_pfn_mkwrite() we call back  
in and mark it as dirty. 

> - Add lockless helper __dax_radix_entry_insert, it will be used by second patch
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  fs/dax.c | 39 +++++++++++++++++++++++----------------
>  1 file changed, 23 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index fc2e314..89bb1f8 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
<>
> @@ -579,8 +586,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	}
>  	dax_unmap_atomic(bdev, &dax);
>  
> -	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> -			vmf->flags & FAULT_FLAG_WRITE);
> +	error = dax_radix_entry_insert(mapping, vmf->pgoff, dax.sector, false,
> +				vmf->flags & FAULT_FLAG_WRITE, vmf->page);

fs/dax.c: In function a??dax_insert_mappinga??:
fs/dax.c:589:10: error: too many arguments to function a??dax_radix_entry_inserta??
  error = dax_radix_entry_insert(mapping, vmf->pgoff, dax.sector, false,
          ^
fs/dax.c:415:12: note: declared here
 static int dax_radix_entry_insert(struct address_space *mapping, pgoff_t index,
            ^
scripts/Makefile.build:258: recipe for target 'fs/dax.o' failed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: DAX: __dax_fault race question
  2016-02-08 11:23 DAX: __dax_fault race question Dmitry Monakhov
  2016-02-08 13:53 ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Dmitry Monakhov
@ 2016-02-11 18:43 ` Ross Zwisler
  1 sibling, 0 replies; 5+ messages in thread
From: Ross Zwisler @ 2016-02-11 18:43 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Linux Memory Management, willy, ross.zwisler

On Mon, Feb 08, 2016 at 02:23:49PM +0300, Dmitry Monakhov wrote:
> 
> Hi,
> 
> I try to understand locking rules for dax and realized that there is
> some suspicious case in dax_fault
> 
> On __dax_fault we try to replace normal page with dax-entry
> Basically dax_fault steps looks like follows
> 
> 1) page = find_get_page(..)
> 2) lock_page_or_retry(page)
> 3) get_block
> 4) delete_from_page_cache(page)
> 5) unlock_page(page)
> 6) dax_insert_mapping(inode, &bh, vma, vmf)
> ...
> 
> But what protects us from other taks does new page_fault after (4) but
> before (6).
> AFAIU this case is not prohibited
> Let's see what happens for two read/write tasks does fault inside file-hole
> task_1(writer)                  task_2(reader)
> __dax_fault(write)
>   ->lock_page_or_retry
>   ->delete_from_page_cache()    __dax_fault(read)
>                                 ->dax_load_hole
>                                   ->find_or_create_page()
>                                     ->new page in mapping->radix_tree               
>   ->dax_insert_mapping
>      ->dax_radix_entry->collision: return -EIO
> 
> Before dax/fsync patch-set this race result in silent dax/page duality(which
> likely result data incoherence or data corruption), Luckily now this
> race result in collision on insertion to radix_tree and return -EIO.
> From first glance testcase looks very simple, but I can not reproduce
> this in my environment. 
> 
> Imho it is reasonable pass locked page to dax_insert_mapping and let
> dax_radix_entry use atomic page/dax-entry replacement similar to
> replace_page_cache_page. Am I right?

We are trying to come up with a general locking scheme that will solve this
race as well as others.

https://lkml.org/lkml/2016/2/9/607

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-02-11 18:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-08 11:23 DAX: __dax_fault race question Dmitry Monakhov
2016-02-08 13:53 ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Dmitry Monakhov
2016-02-08 13:53   ` [PATCH 2/2] dax: fix race dax_fault write vs read Dmitry Monakhov
2016-02-11 17:42   ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Ross Zwisler
2016-02-11 18:43 ` DAX: __dax_fault race question Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).