All of lore.kernel.org
 help / color / mirror / Atom feed
* DAX: __dax_fault race question
@ 2016-02-08 11:23 Dmitry Monakhov
  2016-02-08 13:53 ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Dmitry Monakhov
  2016-02-11 18:43 ` DAX: __dax_fault race question Ross Zwisler
  0 siblings, 2 replies; 5+ messages in thread
From: Dmitry Monakhov @ 2016-02-08 11:23 UTC (permalink / raw)
  To: Linux Memory Management; +Cc: willy, ross.zwisler

[-- Attachment #1: Type: text/plain, Size: 1505 bytes --]


Hi,

I try to understand locking rules for dax and realized that there is
some suspicious case in dax_fault

On __dax_fault we try to replace normal page with dax-entry
Basically dax_fault steps looks like follows

1) page = find_get_page(..)
2) lock_page_or_retry(page)
3) get_block
4) delete_from_page_cache(page)
5) unlock_page(page)
6) dax_insert_mapping(inode, &bh, vma, vmf)
...

But what protects us from other taks does new page_fault after (4) but
before (6).
AFAIU this case is not prohibited
Let's see what happens for two read/write tasks does fault inside file-hole
task_1(writer)                  task_2(reader)
__dax_fault(write)
  ->lock_page_or_retry
  ->delete_from_page_cache()    __dax_fault(read)
                                ->dax_load_hole
                                  ->find_or_create_page()
                                    ->new page in mapping->radix_tree               
  ->dax_insert_mapping
     ->dax_radix_entry->collision: return -EIO

Before dax/fsync patch-set this race result in silent dax/page duality(which
likely result data incoherence or data corruption), Luckily now this
race result in collision on insertion to radix_tree and return -EIO.
From first glance testcase looks very simple, but I can not reproduce
this in my environment. 

Imho it is reasonable pass locked page to dax_insert_mapping and let
dax_radix_entry use atomic page/dax-entry replacement similar to
replace_page_cache_page. Am I right?


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert
  2016-02-08 11:23 DAX: __dax_fault race question Dmitry Monakhov
@ 2016-02-08 13:53 ` Dmitry Monakhov
  2016-02-08 13:53   ` [PATCH 2/2] dax: fix race dax_fault write vs read Dmitry Monakhov
  2016-02-11 17:42   ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Ross Zwisler
  2016-02-11 18:43 ` DAX: __dax_fault race question Ross Zwisler
  1 sibling, 2 replies; 5+ messages in thread
From: Dmitry Monakhov @ 2016-02-08 13:53 UTC (permalink / raw)
  To: linux-mm; +Cc: willy, ross.zwisler, Dmitry Monakhov

- dax_radix_entry_insert is more appropriate name for that function
- Add lockless helper __dax_radix_entry_insert, it will be used by second patch

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/dax.c | 39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fc2e314..89bb1f8 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -349,7 +349,7 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 #define NO_SECTOR -1
 #define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_CACHE_SHIFT))
 
-static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
+static int __dax_radix_entry_insert(struct address_space *mapping, pgoff_t index,
 		sector_t sector, bool pmd_entry, bool dirty)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
@@ -358,10 +358,6 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 	void *entry;
 
 	WARN_ON_ONCE(pmd_entry && !dirty);
-	if (dirty)
-		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
-	spin_lock_irq(&mapping->tree_lock);
 
 	entry = radix_tree_lookup(page_tree, pmd_index);
 	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
@@ -374,8 +370,7 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 		type = RADIX_DAX_TYPE(entry);
 		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
 					type != RADIX_DAX_PMD)) {
-			error = -EIO;
-			goto unlock;
+		        return -EIO;
 		}
 
 		if (!pmd_entry || type == RADIX_DAX_PMD)
@@ -402,19 +397,31 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 		 * pte_same() check will fail, eventually causing page fault
 		 * to be retried by the CPU.
 		 */
-		goto unlock;
+		return 0;
 	}
 
 	error = radix_tree_insert(page_tree, index,
 			RADIX_DAX_ENTRY(sector, pmd_entry));
 	if (error)
-		goto unlock;
+		return error;
 
 	mapping->nrexceptional++;
- dirty:
+dirty:
 	if (dirty)
 		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
- unlock:
+	return error;
+}
+
+static int dax_radix_entry_insert(struct address_space *mapping, pgoff_t index,
+		sector_t sector, bool pmd_entry, bool dirty)
+{
+	int error;
+
+	if (dirty)
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	error =__dax_radix_entry_insert(mapping, index, sector, pmd_entry, dirty);
 	spin_unlock_irq(&mapping->tree_lock);
 	return error;
 }
@@ -579,8 +586,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
-			vmf->flags & FAULT_FLAG_WRITE);
+	error = dax_radix_entry_insert(mapping, vmf->pgoff, dax.sector, false,
+				vmf->flags & FAULT_FLAG_WRITE, vmf->page);
 	if (error)
 		goto out;
 
@@ -984,7 +991,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		 * the write to insert a dirty entry.
 		 */
 		if (write) {
-			error = dax_radix_entry(mapping, pgoff, dax.sector,
+			error = dax_radix_entry_insert(mapping, pgoff, dax.sector,
 					true, true);
 			if (error) {
 				dax_pmd_dbg(&bh, address,
@@ -1057,14 +1064,14 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct file *file = vma->vm_file;
 
 	/*
-	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
+	 * We pass NO_SECTOR to dax_radix_entry_insert() because we expect that a
 	 * RADIX_DAX_PTE entry already exists in the radix tree from a
 	 * previous call to __dax_fault().  We just want to look up that PTE
 	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
 	 * saves us from having to make a call to get_block() here to look
 	 * up the sector.
 	 */
-	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
+	dax_radix_entry_insert(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] dax: fix race dax_fault write vs read
  2016-02-08 13:53 ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Dmitry Monakhov
@ 2016-02-08 13:53   ` Dmitry Monakhov
  2016-02-11 17:42   ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Ross Zwisler
  1 sibling, 0 replies; 5+ messages in thread
From: Dmitry Monakhov @ 2016-02-08 13:53 UTC (permalink / raw)
  To: linux-mm; +Cc: willy, ross.zwisler, Dmitry Monakhov

Two read/write tasks does fault inside file-hole
task_1(writer)                  task_2(reader)
__dax_fault(write)
  ->lock_page_or_retry
  ->delete_from_page_cache()    __dax_fault(read)
						->dax_load_hole
                                  ->find_or_create_page()
                                    ->new page in mapping->radix_tree
  ->dax_insert_mapping
     ->dax_radix_entry => collision

Let's move radix_tree update to dax_radix_entry_replace() where
page deletion and dax entry insertion will be protected by ->tree_lock

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/dax.c | 37 +++++++++++++++++++++++++++++++------
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 89bb1f8..0294fc9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -424,6 +424,31 @@ static int dax_radix_entry_insert(struct address_space *mapping, pgoff_t index,
 	error =__dax_radix_entry_insert(mapping, index, sector, pmd_entry, dirty);
 	spin_unlock_irq(&mapping->tree_lock);
 	return error;
+
+}
+
+static int dax_radix_entry_replace(struct address_space *mapping, pgoff_t index,
+				   sector_t sector, bool pmd_entry, bool dirty,
+				   struct page* old_page)
+{
+	int error;
+
+	BUG_ON(old_page && !PageLocked(old_page));
+	if (dirty)
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	if (old_page)
+		__delete_from_page_cache(old_page, NULL);
+	error =__dax_radix_entry_insert(mapping, index, sector, pmd_entry, dirty);
+	spin_unlock_irq(&mapping->tree_lock);
+	if (old_page) {
+		if (mapping->a_ops->freepage)
+			mapping->a_ops->freepage(old_page);
+		page_cache_release(old_page);
+	}
+	return error;
+
 }
 
 static int dax_writeback_one(struct block_device *bdev,
@@ -586,7 +611,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry_insert(mapping, vmf->pgoff, dax.sector, false,
+	error = dax_radix_entry_replace(mapping, vmf->pgoff, dax.sector, false,
 				vmf->flags & FAULT_FLAG_WRITE, vmf->page);
 	if (error)
 		goto out;
@@ -711,14 +736,16 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		page = find_lock_page(mapping, vmf->pgoff);
 
 	if (page) {
+		vmf->page = page;
 		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
 							PAGE_CACHE_SIZE, 0);
-		delete_from_page_cache(page);
+	}
+	error = dax_insert_mapping(inode, &bh, vma, vmf);
+	if (page) {
 		unlock_page(page);
 		page_cache_release(page);
-		page = NULL;
+		vmf->page = page = NULL;
 	}
-
 	/*
 	 * If we successfully insert the new mapping over an unwritten extent,
 	 * we need to ensure we convert the unwritten extent. If there is an
@@ -729,14 +756,12 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 * indicate what the callback should do via the uptodate variable, same
 	 * as for normal BH based IO completions.
 	 */
-	error = dax_insert_mapping(inode, &bh, vma, vmf);
 	if (buffer_unwritten(&bh)) {
 		if (complete_unwritten)
 			complete_unwritten(&bh, !error);
 		else
 			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
 	}
-
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert
  2016-02-08 13:53 ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Dmitry Monakhov
  2016-02-08 13:53   ` [PATCH 2/2] dax: fix race dax_fault write vs read Dmitry Monakhov
@ 2016-02-11 17:42   ` Ross Zwisler
  1 sibling, 0 replies; 5+ messages in thread
From: Ross Zwisler @ 2016-02-11 17:42 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-mm, willy, ross.zwisler

On Mon, Feb 08, 2016 at 05:53:17PM +0400, Dmitry Monakhov wrote:
> - dax_radix_entry_insert is more appropriate name for that function

I think I may have actually had it named that at some point. :)  I changed it   
because it doesn't always insert an entry - in the read case for example we     
insert a clean entry, and then on the following dax_pfn_mkwrite() we call back  
in and mark it as dirty. 

> - Add lockless helper __dax_radix_entry_insert, it will be used by second patch
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  fs/dax.c | 39 +++++++++++++++++++++++----------------
>  1 file changed, 23 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index fc2e314..89bb1f8 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
<>
> @@ -579,8 +586,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	}
>  	dax_unmap_atomic(bdev, &dax);
>  
> -	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> -			vmf->flags & FAULT_FLAG_WRITE);
> +	error = dax_radix_entry_insert(mapping, vmf->pgoff, dax.sector, false,
> +				vmf->flags & FAULT_FLAG_WRITE, vmf->page);

fs/dax.c: In function a??dax_insert_mappinga??:
fs/dax.c:589:10: error: too many arguments to function a??dax_radix_entry_inserta??
  error = dax_radix_entry_insert(mapping, vmf->pgoff, dax.sector, false,
          ^
fs/dax.c:415:12: note: declared here
 static int dax_radix_entry_insert(struct address_space *mapping, pgoff_t index,
            ^
scripts/Makefile.build:258: recipe for target 'fs/dax.o' failed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: DAX: __dax_fault race question
  2016-02-08 11:23 DAX: __dax_fault race question Dmitry Monakhov
  2016-02-08 13:53 ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Dmitry Monakhov
@ 2016-02-11 18:43 ` Ross Zwisler
  1 sibling, 0 replies; 5+ messages in thread
From: Ross Zwisler @ 2016-02-11 18:43 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Linux Memory Management, willy, ross.zwisler

On Mon, Feb 08, 2016 at 02:23:49PM +0300, Dmitry Monakhov wrote:
> 
> Hi,
> 
> I try to understand locking rules for dax and realized that there is
> some suspicious case in dax_fault
> 
> On __dax_fault we try to replace normal page with dax-entry
> Basically dax_fault steps looks like follows
> 
> 1) page = find_get_page(..)
> 2) lock_page_or_retry(page)
> 3) get_block
> 4) delete_from_page_cache(page)
> 5) unlock_page(page)
> 6) dax_insert_mapping(inode, &bh, vma, vmf)
> ...
> 
> But what protects us from other taks does new page_fault after (4) but
> before (6).
> AFAIU this case is not prohibited
> Let's see what happens for two read/write tasks does fault inside file-hole
> task_1(writer)                  task_2(reader)
> __dax_fault(write)
>   ->lock_page_or_retry
>   ->delete_from_page_cache()    __dax_fault(read)
>                                 ->dax_load_hole
>                                   ->find_or_create_page()
>                                     ->new page in mapping->radix_tree               
>   ->dax_insert_mapping
>      ->dax_radix_entry->collision: return -EIO
> 
> Before dax/fsync patch-set this race result in silent dax/page duality(which
> likely result data incoherence or data corruption), Luckily now this
> race result in collision on insertion to radix_tree and return -EIO.
> From first glance testcase looks very simple, but I can not reproduce
> this in my environment. 
> 
> Imho it is reasonable pass locked page to dax_insert_mapping and let
> dax_radix_entry use atomic page/dax-entry replacement similar to
> replace_page_cache_page. Am I right?

We are trying to come up with a general locking scheme that will solve this
race as well as others.

https://lkml.org/lkml/2016/2/9/607

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-02-11 18:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-08 11:23 DAX: __dax_fault race question Dmitry Monakhov
2016-02-08 13:53 ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Dmitry Monakhov
2016-02-08 13:53   ` [PATCH 2/2] dax: fix race dax_fault write vs read Dmitry Monakhov
2016-02-11 17:42   ` [PATCH 1/2] dax: rename dax_radix_entry to dax_radix_entry_insert Ross Zwisler
2016-02-11 18:43 ` DAX: __dax_fault race question Ross Zwisler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.