All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4 v4] mm,dax: Fix data corruption due to mmap inconsistency
@ 2017-05-10  8:54 ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, Jan Kara

Hello,

this series fixes data corruption that can happen for DAX mounts when
page faults race with write(2) and as a result page tables get out of sync
with block mappings in the filesystem and thus data seen through mmap is
different from data seen through read(2).

The series passes testing with t_mmap_stale test program from Ross and also
other mmap related tests on DAX filesystem.

Andrew, can you please merge these patches? Thanks!

Changes since v3:
* Rebased on top of current Linus' tree due to non-trivial conflicts with
  added tracepoint

Changes since v2:
* Added reviewed-by tag from Ross

Changes since v1:
* Improved performance of unmapping pages
* Changed fault locking to fix another write vs fault race

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 0/4 v4] mm,dax: Fix data corruption due to mmap inconsistency
@ 2017-05-10  8:54 ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, Jan Kara

Hello,

this series fixes data corruption that can happen for DAX mounts when
page faults race with write(2) and as a result page tables get out of sync
with block mappings in the filesystem and thus data seen through mmap is
different from data seen through read(2).

The series passes testing with t_mmap_stale test program from Ross and also
other mmap related tests on DAX filesystem.

Andrew, can you please merge these patches? Thanks!

Changes since v3:
* Rebased on top of current Linus' tree due to non-trivial conflicts with
  added tracepoint

Changes since v2:
* Added reviewed-by tag from Ross

Changes since v1:
* Improved performance of unmapping pages
* Changed fault locking to fix another write vs fault race

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/4] dax: prevent invalidation of mapped DAX entries
  2017-05-10  8:54 ` Jan Kara
  (?)
  (?)
@ 2017-05-10  8:54   ` Jan Kara
  -1 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-nvdimm, [4.10+], linux-mm, linux-fsdevel, linux-ext4

From: Ross Zwisler <ross.zwisler@linux.intel.com>

dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked.  This is done via:

invalidate_mapping_pages()
  invalidate_exceptional_entry()
    dax_invalidate_mapping_entry()

However, for page cache pages removed in invalidate_mapping_pages() there
is an additional criteria which is that the page must not be mapped.  This
is noted in the comments above invalidate_mapping_pages() and is checked in
invalidate_inode_page().

For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry. This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.

We aren't able to unmap the DAX exceptional entry because according to its
comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Reported-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>    [4.10+]
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 29 -----------------------------
 include/linux/dax.h |  1 -
 mm/truncate.c       |  9 +++------
 3 files changed, 3 insertions(+), 36 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 66d79067eedf..38deebb8c86e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -461,35 +461,6 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 }
 
 /*
- * Invalidate exceptional DAX entry if easily possible. This handles DAX
- * entries for invalidate_inode_pages() so we evict the entry only if we can
- * do so without blocking.
- */
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
-{
-	int ret = 0;
-	void *entry, **slot;
-	struct radix_tree_root *page_tree = &mapping->page_tree;
-
-	spin_lock_irq(&mapping->tree_lock);
-	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
-	if (!entry || !radix_tree_exceptional_entry(entry) ||
-	    slot_locked(mapping, slot))
-		goto out;
-	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
-	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
-		goto out;
-	radix_tree_delete(page_tree, index);
-	mapping->nrexceptional--;
-	ret = 1;
-out:
-	spin_unlock_irq(&mapping->tree_lock);
-	if (ret)
-		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
-	return ret;
-}
-
-/*
  * Invalidate exceptional DAX entry if it is clean.
  */
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..d1236d16ef00 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -63,7 +63,6 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 		    const struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
diff --git a/mm/truncate.c b/mm/truncate.c
index 83a059e8cd1d..706cff171a15 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -67,17 +67,14 @@ static void truncate_exceptional_entry(struct address_space *mapping,
 
 /*
  * Invalidate exceptional entry if easily possible. This handles exceptional
- * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
- * clean entries.
+ * entries for invalidate_inode_pages().
  */
 static int invalidate_exceptional_entry(struct address_space *mapping,
 					pgoff_t index, void *entry)
 {
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
+	/* Handled by shmem itself, or for DAX we do nothing. */
+	if (shmem_mapping(mapping) || dax_mapping(mapping))
 		return 1;
-	if (dax_mapping(mapping))
-		return dax_invalidate_mapping_entry(mapping, index);
 	clear_shadow_entry(mapping, index, entry);
 	return 1;
 }
-- 
2.12.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/4] dax: prevent invalidation of mapped DAX entries
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, [4.10+],
	Jan Kara

From: Ross Zwisler <ross.zwisler@linux.intel.com>

dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked.  This is done via:

invalidate_mapping_pages()
  invalidate_exceptional_entry()
    dax_invalidate_mapping_entry()

However, for page cache pages removed in invalidate_mapping_pages() there
is an additional criteria which is that the page must not be mapped.  This
is noted in the comments above invalidate_mapping_pages() and is checked in
invalidate_inode_page().

For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry. This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.

We aren't able to unmap the DAX exceptional entry because according to its
comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Reported-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>    [4.10+]
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 29 -----------------------------
 include/linux/dax.h |  1 -
 mm/truncate.c       |  9 +++------
 3 files changed, 3 insertions(+), 36 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 66d79067eedf..38deebb8c86e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -461,35 +461,6 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 }
 
 /*
- * Invalidate exceptional DAX entry if easily possible. This handles DAX
- * entries for invalidate_inode_pages() so we evict the entry only if we can
- * do so without blocking.
- */
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
-{
-	int ret = 0;
-	void *entry, **slot;
-	struct radix_tree_root *page_tree = &mapping->page_tree;
-
-	spin_lock_irq(&mapping->tree_lock);
-	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
-	if (!entry || !radix_tree_exceptional_entry(entry) ||
-	    slot_locked(mapping, slot))
-		goto out;
-	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
-	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
-		goto out;
-	radix_tree_delete(page_tree, index);
-	mapping->nrexceptional--;
-	ret = 1;
-out:
-	spin_unlock_irq(&mapping->tree_lock);
-	if (ret)
-		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
-	return ret;
-}
-
-/*
  * Invalidate exceptional DAX entry if it is clean.
  */
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..d1236d16ef00 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -63,7 +63,6 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 		    const struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
diff --git a/mm/truncate.c b/mm/truncate.c
index 83a059e8cd1d..706cff171a15 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -67,17 +67,14 @@ static void truncate_exceptional_entry(struct address_space *mapping,
 
 /*
  * Invalidate exceptional entry if easily possible. This handles exceptional
- * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
- * clean entries.
+ * entries for invalidate_inode_pages().
  */
 static int invalidate_exceptional_entry(struct address_space *mapping,
 					pgoff_t index, void *entry)
 {
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
+	/* Handled by shmem itself, or for DAX we do nothing. */
+	if (shmem_mapping(mapping) || dax_mapping(mapping))
 		return 1;
-	if (dax_mapping(mapping))
-		return dax_invalidate_mapping_entry(mapping, index);
 	clear_shadow_entry(mapping, index, entry);
 	return 1;
 }
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/4] dax: prevent invalidation of mapped DAX entries
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, [4.10+],
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

From: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked.  This is done via:

invalidate_mapping_pages()
  invalidate_exceptional_entry()
    dax_invalidate_mapping_entry()

However, for page cache pages removed in invalidate_mapping_pages() there
is an additional criteria which is that the page must not be mapped.  This
is noted in the comments above invalidate_mapping_pages() and is checked in
invalidate_inode_page().

For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry. This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.

We aren't able to unmap the DAX exceptional entry because according to its
comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Reported-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>    [4.10+]
Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/dax.c            | 29 -----------------------------
 include/linux/dax.h |  1 -
 mm/truncate.c       |  9 +++------
 3 files changed, 3 insertions(+), 36 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 66d79067eedf..38deebb8c86e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -461,35 +461,6 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 }
 
 /*
- * Invalidate exceptional DAX entry if easily possible. This handles DAX
- * entries for invalidate_inode_pages() so we evict the entry only if we can
- * do so without blocking.
- */
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
-{
-	int ret = 0;
-	void *entry, **slot;
-	struct radix_tree_root *page_tree = &mapping->page_tree;
-
-	spin_lock_irq(&mapping->tree_lock);
-	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
-	if (!entry || !radix_tree_exceptional_entry(entry) ||
-	    slot_locked(mapping, slot))
-		goto out;
-	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
-	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
-		goto out;
-	radix_tree_delete(page_tree, index);
-	mapping->nrexceptional--;
-	ret = 1;
-out:
-	spin_unlock_irq(&mapping->tree_lock);
-	if (ret)
-		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
-	return ret;
-}
-
-/*
  * Invalidate exceptional DAX entry if it is clean.
  */
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..d1236d16ef00 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -63,7 +63,6 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 		    const struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
diff --git a/mm/truncate.c b/mm/truncate.c
index 83a059e8cd1d..706cff171a15 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -67,17 +67,14 @@ static void truncate_exceptional_entry(struct address_space *mapping,
 
 /*
  * Invalidate exceptional entry if easily possible. This handles exceptional
- * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
- * clean entries.
+ * entries for invalidate_inode_pages().
  */
 static int invalidate_exceptional_entry(struct address_space *mapping,
 					pgoff_t index, void *entry)
 {
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
+	/* Handled by shmem itself, or for DAX we do nothing. */
+	if (shmem_mapping(mapping) || dax_mapping(mapping))
 		return 1;
-	if (dax_mapping(mapping))
-		return dax_invalidate_mapping_entry(mapping, index);
 	clear_shadow_entry(mapping, index, entry);
 	return 1;
 }
-- 
2.12.0

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/4] dax: prevent invalidation of mapped DAX entries
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, [4.10+],
	Jan Kara

From: Ross Zwisler <ross.zwisler@linux.intel.com>

dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked.  This is done via:

invalidate_mapping_pages()
  invalidate_exceptional_entry()
    dax_invalidate_mapping_entry()

However, for page cache pages removed in invalidate_mapping_pages() there
is an additional criteria which is that the page must not be mapped.  This
is noted in the comments above invalidate_mapping_pages() and is checked in
invalidate_inode_page().

For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry. This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.

We aren't able to unmap the DAX exceptional entry because according to its
comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Reported-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>    [4.10+]
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 29 -----------------------------
 include/linux/dax.h |  1 -
 mm/truncate.c       |  9 +++------
 3 files changed, 3 insertions(+), 36 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 66d79067eedf..38deebb8c86e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -461,35 +461,6 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 }
 
 /*
- * Invalidate exceptional DAX entry if easily possible. This handles DAX
- * entries for invalidate_inode_pages() so we evict the entry only if we can
- * do so without blocking.
- */
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
-{
-	int ret = 0;
-	void *entry, **slot;
-	struct radix_tree_root *page_tree = &mapping->page_tree;
-
-	spin_lock_irq(&mapping->tree_lock);
-	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
-	if (!entry || !radix_tree_exceptional_entry(entry) ||
-	    slot_locked(mapping, slot))
-		goto out;
-	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
-	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
-		goto out;
-	radix_tree_delete(page_tree, index);
-	mapping->nrexceptional--;
-	ret = 1;
-out:
-	spin_unlock_irq(&mapping->tree_lock);
-	if (ret)
-		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
-	return ret;
-}
-
-/*
  * Invalidate exceptional DAX entry if it is clean.
  */
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d3158e74a59e..d1236d16ef00 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -63,7 +63,6 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 		    const struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
diff --git a/mm/truncate.c b/mm/truncate.c
index 83a059e8cd1d..706cff171a15 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -67,17 +67,14 @@ static void truncate_exceptional_entry(struct address_space *mapping,
 
 /*
  * Invalidate exceptional entry if easily possible. This handles exceptional
- * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
- * clean entries.
+ * entries for invalidate_inode_pages().
  */
 static int invalidate_exceptional_entry(struct address_space *mapping,
 					pgoff_t index, void *entry)
 {
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
+	/* Handled by shmem itself, or for DAX we do nothing. */
+	if (shmem_mapping(mapping) || dax_mapping(mapping))
 		return 1;
-	if (dax_mapping(mapping))
-		return dax_invalidate_mapping_entry(mapping, index);
 	clear_shadow_entry(mapping, index, entry);
 	return 1;
 }
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 2/4] mm: Fix data corruption due to stale mmap reads
  2017-05-10  8:54 ` Jan Kara
  (?)
  (?)
@ 2017-05-10  8:54   ` Jan Kara
  -1 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-nvdimm, stable, linux-mm, linux-fsdevel, linux-ext4

Currently, we didn't invalidate page tables during
invalidate_inode_pages2() for DAX. That could result in e.g. 2MiB zero
page being mapped into page tables while there were already underlying
blocks allocated and thus data seen through mmap were different from
data seen by read(2). The following sequence reproduces the problem:

- open an mmap over a 2MiB hole

- read from a 2MiB hole, faulting in a 2MiB zero page

- write to the hole with write(3p).  The write succeeds but we
  incorrectly leave the 2MiB zero page mapping intact.

- via the mmap, read the data that was just written.  Since the zero
  page mapping is still intact we read back zeroes instead of the new
  data.

Fix the problem by unconditionally calling
invalidate_inode_pages2_range() in dax_iomap_actor() for new block
allocations and by properly invalidating page tables in
invalidate_inode_pages2_range() for DAX mappings.

Fixes: c6dcf52c23d2d3fb5235cec42d7dd3f786b87d55
CC: stable@vger.kernel.org
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c      |  2 +-
 mm/truncate.c | 12 +++++++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 38deebb8c86e..123d9903c77d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1015,7 +1015,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	 * into page tables. We have to tear down these mappings so that data
 	 * written by write(2) is visible in mmap.
 	 */
-	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+	if (iomap->flags & IOMAP_F_NEW) {
 		invalidate_inode_pages2_range(inode->i_mapping,
 					      pos >> PAGE_SHIFT,
 					      (end - 1) >> PAGE_SHIFT);
diff --git a/mm/truncate.c b/mm/truncate.c
index 706cff171a15..6479ed2afc53 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -686,7 +686,17 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		cond_resched();
 		index++;
 	}
-
+	/*
+	 * For DAX we invalidate page tables after invalidating radix tree.  We
+	 * could invalidate page tables while invalidating each entry however
+	 * that would be expensive. And doing range unmapping before doesn't
+	 * work as we have no cheap way to find whether radix tree entry didn't
+	 * get remapped later.
+	 */
+	if (dax_mapping(mapping)) {
+		unmap_mapping_range(mapping, (loff_t)start << PAGE_SHIFT,
+				    (loff_t)(end - start + 1) << PAGE_SHIFT, 0);
+	}
 out:
 	cleancache_invalidate_inode(mapping);
 	return ret;
-- 
2.12.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 2/4] mm: Fix data corruption due to stale mmap reads
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, Jan Kara, stable

Currently, we didn't invalidate page tables during
invalidate_inode_pages2() for DAX. That could result in e.g. 2MiB zero
page being mapped into page tables while there were already underlying
blocks allocated and thus data seen through mmap were different from
data seen by read(2). The following sequence reproduces the problem:

- open an mmap over a 2MiB hole

- read from a 2MiB hole, faulting in a 2MiB zero page

- write to the hole with write(3p).  The write succeeds but we
  incorrectly leave the 2MiB zero page mapping intact.

- via the mmap, read the data that was just written.  Since the zero
  page mapping is still intact we read back zeroes instead of the new
  data.

Fix the problem by unconditionally calling
invalidate_inode_pages2_range() in dax_iomap_actor() for new block
allocations and by properly invalidating page tables in
invalidate_inode_pages2_range() for DAX mappings.

Fixes: c6dcf52c23d2d3fb5235cec42d7dd3f786b87d55
CC: stable@vger.kernel.org
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c      |  2 +-
 mm/truncate.c | 12 +++++++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 38deebb8c86e..123d9903c77d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1015,7 +1015,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	 * into page tables. We have to tear down these mappings so that data
 	 * written by write(2) is visible in mmap.
 	 */
-	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+	if (iomap->flags & IOMAP_F_NEW) {
 		invalidate_inode_pages2_range(inode->i_mapping,
 					      pos >> PAGE_SHIFT,
 					      (end - 1) >> PAGE_SHIFT);
diff --git a/mm/truncate.c b/mm/truncate.c
index 706cff171a15..6479ed2afc53 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -686,7 +686,17 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		cond_resched();
 		index++;
 	}
-
+	/*
+	 * For DAX we invalidate page tables after invalidating radix tree.  We
+	 * could invalidate page tables while invalidating each entry however
+	 * that would be expensive. And doing range unmapping before doesn't
+	 * work as we have no cheap way to find whether radix tree entry didn't
+	 * get remapped later.
+	 */
+	if (dax_mapping(mapping)) {
+		unmap_mapping_range(mapping, (loff_t)start << PAGE_SHIFT,
+				    (loff_t)(end - start + 1) << PAGE_SHIFT, 0);
+	}
 out:
 	cleancache_invalidate_inode(mapping);
 	return ret;
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 2/4] mm: Fix data corruption due to stale mmap reads
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, Jan Kara, stable

Currently, we didn't invalidate page tables during
invalidate_inode_pages2() for DAX. That could result in e.g. 2MiB zero
page being mapped into page tables while there were already underlying
blocks allocated and thus data seen through mmap were different from
data seen by read(2). The following sequence reproduces the problem:

- open an mmap over a 2MiB hole

- read from a 2MiB hole, faulting in a 2MiB zero page

- write to the hole with write(3p).  The write succeeds but we
  incorrectly leave the 2MiB zero page mapping intact.

- via the mmap, read the data that was just written.  Since the zero
  page mapping is still intact we read back zeroes instead of the new
  data.

Fix the problem by unconditionally calling
invalidate_inode_pages2_range() in dax_iomap_actor() for new block
allocations and by properly invalidating page tables in
invalidate_inode_pages2_range() for DAX mappings.

Fixes: c6dcf52c23d2d3fb5235cec42d7dd3f786b87d55
CC: stable@vger.kernel.org
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c      |  2 +-
 mm/truncate.c | 12 +++++++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 38deebb8c86e..123d9903c77d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1015,7 +1015,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	 * into page tables. We have to tear down these mappings so that data
 	 * written by write(2) is visible in mmap.
 	 */
-	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+	if (iomap->flags & IOMAP_F_NEW) {
 		invalidate_inode_pages2_range(inode->i_mapping,
 					      pos >> PAGE_SHIFT,
 					      (end - 1) >> PAGE_SHIFT);
diff --git a/mm/truncate.c b/mm/truncate.c
index 706cff171a15..6479ed2afc53 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -686,7 +686,17 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		cond_resched();
 		index++;
 	}

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 2/4] mm: Fix data corruption due to stale mmap reads
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, Jan Kara, stable

Currently, we didn't invalidate page tables during
invalidate_inode_pages2() for DAX. That could result in e.g. 2MiB zero
page being mapped into page tables while there were already underlying
blocks allocated and thus data seen through mmap were different from
data seen by read(2). The following sequence reproduces the problem:

- open an mmap over a 2MiB hole

- read from a 2MiB hole, faulting in a 2MiB zero page

- write to the hole with write(3p).  The write succeeds but we
  incorrectly leave the 2MiB zero page mapping intact.

- via the mmap, read the data that was just written.  Since the zero
  page mapping is still intact we read back zeroes instead of the new
  data.

Fix the problem by unconditionally calling
invalidate_inode_pages2_range() in dax_iomap_actor() for new block
allocations and by properly invalidating page tables in
invalidate_inode_pages2_range() for DAX mappings.

Fixes: c6dcf52c23d2d3fb5235cec42d7dd3f786b87d55
CC: stable@vger.kernel.org
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c      |  2 +-
 mm/truncate.c | 12 +++++++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 38deebb8c86e..123d9903c77d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1015,7 +1015,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	 * into page tables. We have to tear down these mappings so that data
 	 * written by write(2) is visible in mmap.
 	 */
-	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+	if (iomap->flags & IOMAP_F_NEW) {
 		invalidate_inode_pages2_range(inode->i_mapping,
 					      pos >> PAGE_SHIFT,
 					      (end - 1) >> PAGE_SHIFT);
diff --git a/mm/truncate.c b/mm/truncate.c
index 706cff171a15..6479ed2afc53 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -686,7 +686,17 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		cond_resched();
 		index++;
 	}
-
+	/*
+	 * For DAX we invalidate page tables after invalidating radix tree.  We
+	 * could invalidate page tables while invalidating each entry however
+	 * that would be expensive. And doing range unmapping before doesn't
+	 * work as we have no cheap way to find whether radix tree entry didn't
+	 * get remapped later.
+	 */
+	if (dax_mapping(mapping)) {
+		unmap_mapping_range(mapping, (loff_t)start << PAGE_SHIFT,
+				    (loff_t)(end - start + 1) << PAGE_SHIFT, 0);
+	}
 out:
 	cleancache_invalidate_inode(mapping);
 	return ret;
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 3/4] ext4: Return back to starting transaction in ext4_dax_huge_fault()
  2017-05-10  8:54 ` Jan Kara
  (?)
  (?)
@ 2017-05-10  8:54   ` Jan Kara
  -1 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-nvdimm, stable, linux-mm, linux-fsdevel, linux-ext4

DAX will return to locking exceptional entry before mapping blocks
for a page fault to fix possible races with concurrent writes. To avoid
lock inversion between exceptional entry lock and transaction start,
start the transaction already in ext4_dax_huge_fault().

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index cefa9835f275..831fd6beebf0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -257,6 +257,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int result;
+	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -264,12 +265,24 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vmf->vma->vm_file);
+		down_read(&EXT4_I(inode)->i_mmap_sem);
+		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
+					       EXT4_DATA_TRANS_BLOCKS(sb));
+	} else {
+		down_read(&EXT4_I(inode)->i_mmap_sem);
 	}
-	down_read(&EXT4_I(inode)->i_mmap_sem);
-	result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
-	up_read(&EXT4_I(inode)->i_mmap_sem);
-	if (write)
+	if (!IS_ERR(handle))
+		result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
+	else
+		result = VM_FAULT_SIGBUS;
+	if (write) {
+		if (!IS_ERR(handle))
+			ext4_journal_stop(handle);
+		up_read(&EXT4_I(inode)->i_mmap_sem);
 		sb_end_pagefault(sb);
+	} else {
+		up_read(&EXT4_I(inode)->i_mmap_sem);
+	}
 
 	return result;
 }
-- 
2.12.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 3/4] ext4: Return back to starting transaction in ext4_dax_huge_fault()
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, Jan Kara, stable

DAX will return to locking exceptional entry before mapping blocks
for a page fault to fix possible races with concurrent writes. To avoid
lock inversion between exceptional entry lock and transaction start,
start the transaction already in ext4_dax_huge_fault().

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index cefa9835f275..831fd6beebf0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -257,6 +257,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int result;
+	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -264,12 +265,24 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vmf->vma->vm_file);
+		down_read(&EXT4_I(inode)->i_mmap_sem);
+		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
+					       EXT4_DATA_TRANS_BLOCKS(sb));
+	} else {
+		down_read(&EXT4_I(inode)->i_mmap_sem);
 	}
-	down_read(&EXT4_I(inode)->i_mmap_sem);
-	result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
-	up_read(&EXT4_I(inode)->i_mmap_sem);
-	if (write)
+	if (!IS_ERR(handle))
+		result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
+	else
+		result = VM_FAULT_SIGBUS;
+	if (write) {
+		if (!IS_ERR(handle))
+			ext4_journal_stop(handle);
+		up_read(&EXT4_I(inode)->i_mmap_sem);
 		sb_end_pagefault(sb);
+	} else {
+		up_read(&EXT4_I(inode)->i_mmap_sem);
+	}
 
 	return result;
 }
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 3/4] ext4: Return back to starting transaction in ext4_dax_huge_fault()
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	stable-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

DAX will return to locking exceptional entry before mapping blocks
for a page fault to fix possible races with concurrent writes. To avoid
lock inversion between exceptional entry lock and transaction start,
start the transaction already in ext4_dax_huge_fault().

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/ext4/file.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index cefa9835f275..831fd6beebf0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -257,6 +257,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int result;
+	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -264,12 +265,24 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vmf->vma->vm_file);
+		down_read(&EXT4_I(inode)->i_mmap_sem);
+		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
+					       EXT4_DATA_TRANS_BLOCKS(sb));
+	} else {
+		down_read(&EXT4_I(inode)->i_mmap_sem);
 	}
-	down_read(&EXT4_I(inode)->i_mmap_sem);
-	result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
-	up_read(&EXT4_I(inode)->i_mmap_sem);
-	if (write)
+	if (!IS_ERR(handle))
+		result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
+	else
+		result = VM_FAULT_SIGBUS;
+	if (write) {
+		if (!IS_ERR(handle))
+			ext4_journal_stop(handle);
+		up_read(&EXT4_I(inode)->i_mmap_sem);
 		sb_end_pagefault(sb);
+	} else {
+		up_read(&EXT4_I(inode)->i_mmap_sem);
+	}
 
 	return result;
 }
-- 
2.12.0

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 3/4] ext4: Return back to starting transaction in ext4_dax_huge_fault()
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, Jan Kara, stable

DAX will return to locking exceptional entry before mapping blocks
for a page fault to fix possible races with concurrent writes. To avoid
lock inversion between exceptional entry lock and transaction start,
start the transaction already in ext4_dax_huge_fault().

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index cefa9835f275..831fd6beebf0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -257,6 +257,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
 	int result;
+	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -264,12 +265,24 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vmf->vma->vm_file);
+		down_read(&EXT4_I(inode)->i_mmap_sem);
+		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
+					       EXT4_DATA_TRANS_BLOCKS(sb));
+	} else {
+		down_read(&EXT4_I(inode)->i_mmap_sem);
 	}
-	down_read(&EXT4_I(inode)->i_mmap_sem);
-	result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
-	up_read(&EXT4_I(inode)->i_mmap_sem);
-	if (write)
+	if (!IS_ERR(handle))
+		result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
+	else
+		result = VM_FAULT_SIGBUS;
+	if (write) {
+		if (!IS_ERR(handle))
+			ext4_journal_stop(handle);
+		up_read(&EXT4_I(inode)->i_mmap_sem);
 		sb_end_pagefault(sb);
+	} else {
+		up_read(&EXT4_I(inode)->i_mmap_sem);
+	}
 
 	return result;
 }
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] dax: Fix data corruption when fault races with write
  2017-05-10  8:54 ` Jan Kara
  (?)
  (?)
@ 2017-05-10  8:54   ` Jan Kara
  -1 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-nvdimm, stable, linux-mm, linux-fsdevel, linux-ext4

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 123d9903c77d..32f020c9cedf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1148,6 +1148,12 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
+		goto out;
+	}
+
 	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
@@ -1156,17 +1162,11 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error) {
 		vmf_ret = dax_fault_return(error);
-		goto out;
+		goto unlock_entry;
 	}
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
-		goto finish_iomap;
-	}
-
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		vmf_ret = dax_fault_return(PTR_ERR(entry));
-		goto finish_iomap;
+		error = -EIO;	/* fs corruption? */
+		goto error_finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1188,13 +1188,13 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto error_unlock_entry;
+			goto error_finish_iomap;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	switch (iomap.type) {
@@ -1214,7 +1214,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto unlock_entry;
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1223,10 +1223,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		break;
 	}
 
- error_unlock_entry:
+ error_finish_iomap:
 	vmf_ret = dax_fault_return(error) | major;
- unlock_entry:
-	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
@@ -1241,7 +1239,9 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		 */
 		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-out:
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+ out:
 	trace_dax_pte_fault_done(inode, vmf, vmf_ret);
 	return vmf_ret;
 }
-- 
2.12.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, Jan Kara, stable

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 123d9903c77d..32f020c9cedf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1148,6 +1148,12 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
+		goto out;
+	}
+
 	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
@@ -1156,17 +1162,11 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error) {
 		vmf_ret = dax_fault_return(error);
-		goto out;
+		goto unlock_entry;
 	}
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
-		goto finish_iomap;
-	}
-
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		vmf_ret = dax_fault_return(PTR_ERR(entry));
-		goto finish_iomap;
+		error = -EIO;	/* fs corruption? */
+		goto error_finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1188,13 +1188,13 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto error_unlock_entry;
+			goto error_finish_iomap;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	switch (iomap.type) {
@@ -1214,7 +1214,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto unlock_entry;
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1223,10 +1223,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		break;
 	}
 
- error_unlock_entry:
+ error_finish_iomap:
 	vmf_ret = dax_fault_return(error) | major;
- unlock_entry:
-	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
@@ -1241,7 +1239,9 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		 */
 		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-out:
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+ out:
 	trace_dax_pte_fault_done(inode, vmf, vmf_ret);
 	return vmf_ret;
 }
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	stable-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Reviewed-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 123d9903c77d..32f020c9cedf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1148,6 +1148,12 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
+		goto out;
+	}
+
 	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
@@ -1156,17 +1162,11 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error) {
 		vmf_ret = dax_fault_return(error);
-		goto out;
+		goto unlock_entry;
 	}
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
-		goto finish_iomap;
-	}
-
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		vmf_ret = dax_fault_return(PTR_ERR(entry));
-		goto finish_iomap;
+		error = -EIO;	/* fs corruption? */
+		goto error_finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1188,13 +1188,13 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto error_unlock_entry;
+			goto error_finish_iomap;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	switch (iomap.type) {
@@ -1214,7 +1214,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto unlock_entry;
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1223,10 +1223,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		break;
 	}
 
- error_unlock_entry:
+ error_finish_iomap:
 	vmf_ret = dax_fault_return(error) | major;
- unlock_entry:
-	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
@@ -1241,7 +1239,9 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		 */
 		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-out:
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+ out:
 	trace_dax_pte_fault_done(inode, vmf, vmf_ret);
 	return vmf_ret;
 }
-- 
2.12.0

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-10  8:54   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-10  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, Jan Kara, stable

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 123d9903c77d..32f020c9cedf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1148,6 +1148,12 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
+		goto out;
+	}
+
 	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
@@ -1156,17 +1162,11 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error) {
 		vmf_ret = dax_fault_return(error);
-		goto out;
+		goto unlock_entry;
 	}
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
-		goto finish_iomap;
-	}
-
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		vmf_ret = dax_fault_return(PTR_ERR(entry));
-		goto finish_iomap;
+		error = -EIO;	/* fs corruption? */
+		goto error_finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1188,13 +1188,13 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto error_unlock_entry;
+			goto error_finish_iomap;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	switch (iomap.type) {
@@ -1214,7 +1214,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto unlock_entry;
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1223,10 +1223,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		break;
 	}
 
- error_unlock_entry:
+ error_finish_iomap:
 	vmf_ret = dax_fault_return(error) | major;
- unlock_entry:
-	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
@@ -1241,7 +1239,9 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		 */
 		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-out:
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+ out:
 	trace_dax_pte_fault_done(inode, vmf, vmf_ret);
 	return vmf_ret;
 }
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 5/4] dax: Fix PMD data corruption when fault races with write
  2017-05-10  8:54   ` Jan Kara
@ 2017-05-10 17:27     ` Ross Zwisler
  -1 siblings, 0 replies; 39+ messages in thread
From: Ross Zwisler @ 2017-05-10 17:27 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara
  Cc: linux-nvdimm, stable, linux-mm, linux-fsdevel, linux-ext4

This is based on a patch from Jan Kara that fixed the equivalent race in
the DAX PTE fault path.

Currently DAX PMD read fault can race with write(2) in the following way:

CPU1 - write(2)                 CPU2 - read fault
                                dax_iomap_pmd_fault()
                                  ->iomap_begin() - sees hole

dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate

                                  grab_mapping_entry()
				  - we add huge zero page to the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
---

Jan, I just realized that we need an equivalent fix in the PMD path.  Let's
keep this with the rest of your series so they get applied together,
applied to stable together, etc.

This applies cleanly to the current linux/master (56868a460b83) + the four
patches from Jan's series.  I've run it through xfstests and some targeted
testing for the PMD path.

---
 fs/dax.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 32f020c..93ae872 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1388,6 +1388,16 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 		goto fallback;
 
 	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		goto fallback;
+
+	/*
 	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
 	 * setting up a mapping, so really we're using iomap_begin() as a way
 	 * to look up our filesystem block.
@@ -1395,21 +1405,11 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 	pos = (loff_t)pgoff << PAGE_SHIFT;
 	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
 	if (error)
-		goto fallback;
+		goto unlock_entry;
 
 	if (iomap.offset + iomap.length < pos + PMD_SIZE)
 		goto finish_iomap;
 
-	/*
-	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
-	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
-	 * the tree, for instance), it will return -EEXIST and we just fall
-	 * back to 4k entries.
-	 */
-	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
-	if (IS_ERR(entry))
-		goto finish_iomap;
-
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
 		result = dax_pmd_insert_mapping(vmf, &iomap, pos, &entry);
@@ -1417,7 +1417,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
-			goto unlock_entry;
+			break;
 		result = dax_pmd_load_hole(vmf, &iomap, &entry);
 		break;
 	default:
@@ -1425,8 +1425,6 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 		break;
 	}
 
- unlock_entry:
-	put_locked_mapping_entry(mapping, pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PMD_SIZE;
@@ -1442,6 +1440,8 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
 				&iomap);
 	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
  fallback:
 	if (result == VM_FAULT_FALLBACK) {
 		split_huge_pmd(vma, vmf->pmd, vmf->address);
-- 
2.9.3

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 5/4] dax: Fix PMD data corruption when fault races with write
@ 2017-05-10 17:27     ` Ross Zwisler
  0 siblings, 0 replies; 39+ messages in thread
From: Ross Zwisler @ 2017-05-10 17:27 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara
  Cc: Ross Zwisler, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, stable

This is based on a patch from Jan Kara that fixed the equivalent race in
the DAX PTE fault path.

Currently DAX PMD read fault can race with write(2) in the following way:

CPU1 - write(2)                 CPU2 - read fault
                                dax_iomap_pmd_fault()
                                  ->iomap_begin() - sees hole

dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate

                                  grab_mapping_entry()
				  - we add huge zero page to the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
---

Jan, I just realized that we need an equivalent fix in the PMD path.  Let's
keep this with the rest of your series so they get applied together,
applied to stable together, etc.

This applies cleanly to the current linux/master (56868a460b83) + the four
patches from Jan's series.  I've run it through xfstests and some targeted
testing for the PMD path.

---
 fs/dax.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 32f020c..93ae872 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1388,6 +1388,16 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 		goto fallback;
 
 	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		goto fallback;
+
+	/*
 	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
 	 * setting up a mapping, so really we're using iomap_begin() as a way
 	 * to look up our filesystem block.
@@ -1395,21 +1405,11 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 	pos = (loff_t)pgoff << PAGE_SHIFT;
 	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
 	if (error)
-		goto fallback;
+		goto unlock_entry;
 
 	if (iomap.offset + iomap.length < pos + PMD_SIZE)
 		goto finish_iomap;
 
-	/*
-	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
-	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
-	 * the tree, for instance), it will return -EEXIST and we just fall
-	 * back to 4k entries.
-	 */
-	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
-	if (IS_ERR(entry))
-		goto finish_iomap;
-
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
 		result = dax_pmd_insert_mapping(vmf, &iomap, pos, &entry);
@@ -1417,7 +1417,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
-			goto unlock_entry;
+			break;
 		result = dax_pmd_load_hole(vmf, &iomap, &entry);
 		break;
 	default:
@@ -1425,8 +1425,6 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 		break;
 	}
 
- unlock_entry:
-	put_locked_mapping_entry(mapping, pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PMD_SIZE;
@@ -1442,6 +1440,8 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
 				&iomap);
 	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
  fallback:
 	if (result == VM_FAULT_FALLBACK) {
 		split_huge_pmd(vma, vmf->pmd, vmf->address);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/4 v4] mm,dax: Fix data corruption due to mmap inconsistency
  2017-05-10  8:54 ` Jan Kara
  (?)
@ 2017-05-10 17:27   ` Ross Zwisler
  -1 siblings, 0 replies; 39+ messages in thread
From: Ross Zwisler @ 2017-05-10 17:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, linux-mm, linux-fsdevel, Andrew Morton, linux-ext4

On Wed, May 10, 2017 at 10:54:15AM +0200, Jan Kara wrote:
> Hello,
> 
> this series fixes data corruption that can happen for DAX mounts when
> page faults race with write(2) and as a result page tables get out of sync
> with block mappings in the filesystem and thus data seen through mmap is
> different from data seen through read(2).
> 
> The series passes testing with t_mmap_stale test program from Ross and also
> other mmap related tests on DAX filesystem.
> 
> Andrew, can you please merge these patches? Thanks!
> 
> Changes since v3:
> * Rebased on top of current Linus' tree due to non-trivial conflicts with
>   added tracepoint

Cool, the merge update looks correct to me.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/4 v4] mm,dax: Fix data corruption due to mmap inconsistency
@ 2017-05-10 17:27   ` Ross Zwisler
  0 siblings, 0 replies; 39+ messages in thread
From: Ross Zwisler @ 2017-05-10 17:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Ross Zwisler, Dan Williams, linux-ext4,
	linux-fsdevel, linux-nvdimm, linux-mm

On Wed, May 10, 2017 at 10:54:15AM +0200, Jan Kara wrote:
> Hello,
> 
> this series fixes data corruption that can happen for DAX mounts when
> page faults race with write(2) and as a result page tables get out of sync
> with block mappings in the filesystem and thus data seen through mmap is
> different from data seen through read(2).
> 
> The series passes testing with t_mmap_stale test program from Ross and also
> other mmap related tests on DAX filesystem.
> 
> Andrew, can you please merge these patches? Thanks!
> 
> Changes since v3:
> * Rebased on top of current Linus' tree due to non-trivial conflicts with
>   added tracepoint

Cool, the merge update looks correct to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/4 v4] mm,dax: Fix data corruption due to mmap inconsistency
@ 2017-05-10 17:27   ` Ross Zwisler
  0 siblings, 0 replies; 39+ messages in thread
From: Ross Zwisler @ 2017-05-10 17:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

On Wed, May 10, 2017 at 10:54:15AM +0200, Jan Kara wrote:
> Hello,
> 
> this series fixes data corruption that can happen for DAX mounts when
> page faults race with write(2) and as a result page tables get out of sync
> with block mappings in the filesystem and thus data seen through mmap is
> different from data seen through read(2).
> 
> The series passes testing with t_mmap_stale test program from Ross and also
> other mmap related tests on DAX filesystem.
> 
> Andrew, can you please merge these patches? Thanks!
> 
> Changes since v3:
> * Rebased on top of current Linus' tree due to non-trivial conflicts with
>   added tracepoint

Cool, the merge update looks correct to me.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/4] dax: Fix PMD data corruption when fault races with write
  2017-05-10 17:27     ` Ross Zwisler
  (?)
@ 2017-05-11  8:39       ` Jan Kara
  -1 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-11  8:39 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, stable, linux-mm, linux-fsdevel,
	Andrew Morton, linux-ext4

On Wed 10-05-17 11:27:00, Ross Zwisler wrote:
> This is based on a patch from Jan Kara that fixed the equivalent race in
> the DAX PTE fault path.
> 
> Currently DAX PMD read fault can race with write(2) in the following way:
> 
> CPU1 - write(2)                 CPU2 - read fault
>                                 dax_iomap_pmd_fault()
>                                   ->iomap_begin() - sees hole
> 
> dax_iomap_rw()
>   iomap_apply()
>     ->iomap_begin - allocates blocks
>     dax_iomap_actor()
>       invalidate_inode_pages2_range()
>         - there's nothing to invalidate
> 
>                                   grab_mapping_entry()
> 				  - we add huge zero page to the radix tree
> 				    and map it to page tables
> 
> The result is that hole page is mapped into page tables (and thus zeros
> are seen in mmap) while file has data written in that place.
> 
> Fix the problem by locking exception entry before mapping blocks for the
> fault. That way we are sure invalidate_inode_pages2_range() call for
> racing write will either block on entry lock waiting for the fault to
> finish (and unmap stale page tables after that) or read fault will see
> already allocated blocks by write(2).
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
> CC: stable@vger.kernel.org
> ---
> 
> Jan, I just realized that we need an equivalent fix in the PMD path.  Let's
> keep this with the rest of your series so they get applied together,
> applied to stable together, etc.
> 
> This applies cleanly to the current linux/master (56868a460b83) + the four
> patches from Jan's series.  I've run it through xfstests and some targeted
> testing for the PMD path.

Ah, right. Thanks for fixing it up. The patch looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 32f020c..93ae872 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1388,6 +1388,16 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  		goto fallback;
>  
>  	/*
> +	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
> +	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
> +	 * the tree, for instance), it will return -EEXIST and we just fall
> +	 * back to 4k entries.
> +	 */
> +	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> +	if (IS_ERR(entry))
> +		goto fallback;
> +
> +	/*
>  	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
>  	 * setting up a mapping, so really we're using iomap_begin() as a way
>  	 * to look up our filesystem block.
> @@ -1395,21 +1405,11 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  	pos = (loff_t)pgoff << PAGE_SHIFT;
>  	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
>  	if (error)
> -		goto fallback;
> +		goto unlock_entry;
>  
>  	if (iomap.offset + iomap.length < pos + PMD_SIZE)
>  		goto finish_iomap;
>  
> -	/*
> -	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
> -	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
> -	 * the tree, for instance), it will return -EEXIST and we just fall
> -	 * back to 4k entries.
> -	 */
> -	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> -	if (IS_ERR(entry))
> -		goto finish_iomap;
> -
>  	switch (iomap.type) {
>  	case IOMAP_MAPPED:
>  		result = dax_pmd_insert_mapping(vmf, &iomap, pos, &entry);
> @@ -1417,7 +1417,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  	case IOMAP_UNWRITTEN:
>  	case IOMAP_HOLE:
>  		if (WARN_ON_ONCE(write))
> -			goto unlock_entry;
> +			break;
>  		result = dax_pmd_load_hole(vmf, &iomap, &entry);
>  		break;
>  	default:
> @@ -1425,8 +1425,6 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  		break;
>  	}
>  
> - unlock_entry:
> -	put_locked_mapping_entry(mapping, pgoff, entry);
>   finish_iomap:
>  	if (ops->iomap_end) {
>  		int copied = PMD_SIZE;
> @@ -1442,6 +1440,8 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
>  				&iomap);
>  	}
> + unlock_entry:
> +	put_locked_mapping_entry(mapping, pgoff, entry);
>   fallback:
>  	if (result == VM_FAULT_FALLBACK) {
>  		split_huge_pmd(vma, vmf->pmd, vmf->address);
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/4] dax: Fix PMD data corruption when fault races with write
@ 2017-05-11  8:39       ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-11  8:39 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Jan Kara, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, stable

On Wed 10-05-17 11:27:00, Ross Zwisler wrote:
> This is based on a patch from Jan Kara that fixed the equivalent race in
> the DAX PTE fault path.
> 
> Currently DAX PMD read fault can race with write(2) in the following way:
> 
> CPU1 - write(2)                 CPU2 - read fault
>                                 dax_iomap_pmd_fault()
>                                   ->iomap_begin() - sees hole
> 
> dax_iomap_rw()
>   iomap_apply()
>     ->iomap_begin - allocates blocks
>     dax_iomap_actor()
>       invalidate_inode_pages2_range()
>         - there's nothing to invalidate
> 
>                                   grab_mapping_entry()
> 				  - we add huge zero page to the radix tree
> 				    and map it to page tables
> 
> The result is that hole page is mapped into page tables (and thus zeros
> are seen in mmap) while file has data written in that place.
> 
> Fix the problem by locking exception entry before mapping blocks for the
> fault. That way we are sure invalidate_inode_pages2_range() call for
> racing write will either block on entry lock waiting for the fault to
> finish (and unmap stale page tables after that) or read fault will see
> already allocated blocks by write(2).
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
> CC: stable@vger.kernel.org
> ---
> 
> Jan, I just realized that we need an equivalent fix in the PMD path.  Let's
> keep this with the rest of your series so they get applied together,
> applied to stable together, etc.
> 
> This applies cleanly to the current linux/master (56868a460b83) + the four
> patches from Jan's series.  I've run it through xfstests and some targeted
> testing for the PMD path.

Ah, right. Thanks for fixing it up. The patch looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 32f020c..93ae872 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1388,6 +1388,16 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  		goto fallback;
>  
>  	/*
> +	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
> +	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
> +	 * the tree, for instance), it will return -EEXIST and we just fall
> +	 * back to 4k entries.
> +	 */
> +	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> +	if (IS_ERR(entry))
> +		goto fallback;
> +
> +	/*
>  	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
>  	 * setting up a mapping, so really we're using iomap_begin() as a way
>  	 * to look up our filesystem block.
> @@ -1395,21 +1405,11 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  	pos = (loff_t)pgoff << PAGE_SHIFT;
>  	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
>  	if (error)
> -		goto fallback;
> +		goto unlock_entry;
>  
>  	if (iomap.offset + iomap.length < pos + PMD_SIZE)
>  		goto finish_iomap;
>  
> -	/*
> -	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
> -	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
> -	 * the tree, for instance), it will return -EEXIST and we just fall
> -	 * back to 4k entries.
> -	 */
> -	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> -	if (IS_ERR(entry))
> -		goto finish_iomap;
> -
>  	switch (iomap.type) {
>  	case IOMAP_MAPPED:
>  		result = dax_pmd_insert_mapping(vmf, &iomap, pos, &entry);
> @@ -1417,7 +1417,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  	case IOMAP_UNWRITTEN:
>  	case IOMAP_HOLE:
>  		if (WARN_ON_ONCE(write))
> -			goto unlock_entry;
> +			break;
>  		result = dax_pmd_load_hole(vmf, &iomap, &entry);
>  		break;
>  	default:
> @@ -1425,8 +1425,6 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  		break;
>  	}
>  
> - unlock_entry:
> -	put_locked_mapping_entry(mapping, pgoff, entry);
>   finish_iomap:
>  	if (ops->iomap_end) {
>  		int copied = PMD_SIZE;
> @@ -1442,6 +1440,8 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
>  				&iomap);
>  	}
> + unlock_entry:
> +	put_locked_mapping_entry(mapping, pgoff, entry);
>   fallback:
>  	if (result == VM_FAULT_FALLBACK) {
>  		split_huge_pmd(vma, vmf->pmd, vmf->address);
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/4] dax: Fix PMD data corruption when fault races with write
@ 2017-05-11  8:39       ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-11  8:39 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Jan Kara, Dan Williams, linux-ext4, linux-fsdevel,
	linux-nvdimm, linux-mm, stable

On Wed 10-05-17 11:27:00, Ross Zwisler wrote:
> This is based on a patch from Jan Kara that fixed the equivalent race in
> the DAX PTE fault path.
> 
> Currently DAX PMD read fault can race with write(2) in the following way:
> 
> CPU1 - write(2)                 CPU2 - read fault
>                                 dax_iomap_pmd_fault()
>                                   ->iomap_begin() - sees hole
> 
> dax_iomap_rw()
>   iomap_apply()
>     ->iomap_begin - allocates blocks
>     dax_iomap_actor()
>       invalidate_inode_pages2_range()
>         - there's nothing to invalidate
> 
>                                   grab_mapping_entry()
> 				  - we add huge zero page to the radix tree
> 				    and map it to page tables
> 
> The result is that hole page is mapped into page tables (and thus zeros
> are seen in mmap) while file has data written in that place.
> 
> Fix the problem by locking exception entry before mapping blocks for the
> fault. That way we are sure invalidate_inode_pages2_range() call for
> racing write will either block on entry lock waiting for the fault to
> finish (and unmap stale page tables after that) or read fault will see
> already allocated blocks by write(2).
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
> CC: stable@vger.kernel.org
> ---
> 
> Jan, I just realized that we need an equivalent fix in the PMD path.  Let's
> keep this with the rest of your series so they get applied together,
> applied to stable together, etc.
> 
> This applies cleanly to the current linux/master (56868a460b83) + the four
> patches from Jan's series.  I've run it through xfstests and some targeted
> testing for the PMD path.

Ah, right. Thanks for fixing it up. The patch looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 32f020c..93ae872 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1388,6 +1388,16 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  		goto fallback;
>  
>  	/*
> +	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
> +	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
> +	 * the tree, for instance), it will return -EEXIST and we just fall
> +	 * back to 4k entries.
> +	 */
> +	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> +	if (IS_ERR(entry))
> +		goto fallback;
> +
> +	/*
>  	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
>  	 * setting up a mapping, so really we're using iomap_begin() as a way
>  	 * to look up our filesystem block.
> @@ -1395,21 +1405,11 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  	pos = (loff_t)pgoff << PAGE_SHIFT;
>  	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
>  	if (error)
> -		goto fallback;
> +		goto unlock_entry;
>  
>  	if (iomap.offset + iomap.length < pos + PMD_SIZE)
>  		goto finish_iomap;
>  
> -	/*
> -	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
> -	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
> -	 * the tree, for instance), it will return -EEXIST and we just fall
> -	 * back to 4k entries.
> -	 */
> -	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> -	if (IS_ERR(entry))
> -		goto finish_iomap;
> -
>  	switch (iomap.type) {
>  	case IOMAP_MAPPED:
>  		result = dax_pmd_insert_mapping(vmf, &iomap, pos, &entry);
> @@ -1417,7 +1417,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  	case IOMAP_UNWRITTEN:
>  	case IOMAP_HOLE:
>  		if (WARN_ON_ONCE(write))
> -			goto unlock_entry;
> +			break;
>  		result = dax_pmd_load_hole(vmf, &iomap, &entry);
>  		break;
>  	default:
> @@ -1425,8 +1425,6 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  		break;
>  	}
>  
> - unlock_entry:
> -	put_locked_mapping_entry(mapping, pgoff, entry);
>   finish_iomap:
>  	if (ops->iomap_end) {
>  		int copied = PMD_SIZE;
> @@ -1442,6 +1440,8 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
>  				&iomap);
>  	}
> + unlock_entry:
> +	put_locked_mapping_entry(mapping, pgoff, entry);
>   fallback:
>  	if (result == VM_FAULT_FALLBACK) {
>  		split_huge_pmd(vma, vmf->pmd, vmf->address);
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] dax: Fix data corruption when fault races with write
  2017-05-09 12:18 [PATCH 0/4 v3] " Jan Kara
  2017-05-09 12:18   ` Jan Kara
  (?)
@ 2017-05-09 12:18   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-09 12:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-nvdimm, stable, linux-mm, linux-fsdevel, linux-ext4

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 72853669a356..f5071249d456 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1124,23 +1124,23 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry))
+		return dax_fault_return(PTR_ERR(entry));
+
 	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
 	 * that we never have to deal with more than a single extent here.
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
-	if (error)
-		return dax_fault_return(error);
-	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
-		goto finish_iomap;
+	if (error) {
+		vmf_ret = dax_fault_return(error);
+		goto unlock_entry;
 	}
-
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		vmf_ret = dax_fault_return(PTR_ERR(entry));
-		goto finish_iomap;
+	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
+		error = -EIO;	/* fs corruption? */
+		goto error_finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1162,13 +1162,13 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto error_unlock_entry;
+			goto error_finish_iomap;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	switch (iomap.type) {
@@ -1188,7 +1188,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto unlock_entry;
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1197,10 +1197,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		break;
 	}
 
- error_unlock_entry:
+ error_finish_iomap:
 	vmf_ret = dax_fault_return(error) | major;
- unlock_entry:
-	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
@@ -1215,6 +1213,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		 */
 		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return vmf_ret;
 }
 
-- 
2.12.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-09 12:18   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-09 12:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-fsdevel, linux-ext4,
	linux-nvdimm, linux-mm, Jan Kara, stable

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 72853669a356..f5071249d456 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1124,23 +1124,23 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry))
+		return dax_fault_return(PTR_ERR(entry));
+
 	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
 	 * that we never have to deal with more than a single extent here.
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
-	if (error)
-		return dax_fault_return(error);
-	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
-		goto finish_iomap;
+	if (error) {
+		vmf_ret = dax_fault_return(error);
+		goto unlock_entry;
 	}
-
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		vmf_ret = dax_fault_return(PTR_ERR(entry));
-		goto finish_iomap;
+	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
+		error = -EIO;	/* fs corruption? */
+		goto error_finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1162,13 +1162,13 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto error_unlock_entry;
+			goto error_finish_iomap;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	switch (iomap.type) {
@@ -1188,7 +1188,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto unlock_entry;
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1197,10 +1197,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		break;
 	}
 
- error_unlock_entry:
+ error_finish_iomap:
 	vmf_ret = dax_fault_return(error) | major;
- unlock_entry:
-	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
@@ -1215,6 +1213,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		 */
 		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return vmf_ret;
 }
 
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-09 12:18   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-09 12:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	stable-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Reviewed-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 72853669a356..f5071249d456 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1124,23 +1124,23 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry))
+		return dax_fault_return(PTR_ERR(entry));
+
 	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
 	 * that we never have to deal with more than a single extent here.
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
-	if (error)
-		return dax_fault_return(error);
-	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
-		goto finish_iomap;
+	if (error) {
+		vmf_ret = dax_fault_return(error);
+		goto unlock_entry;
 	}
-
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		vmf_ret = dax_fault_return(PTR_ERR(entry));
-		goto finish_iomap;
+	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
+		error = -EIO;	/* fs corruption? */
+		goto error_finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1162,13 +1162,13 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto error_unlock_entry;
+			goto error_finish_iomap;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	switch (iomap.type) {
@@ -1188,7 +1188,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto unlock_entry;
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1197,10 +1197,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		break;
 	}
 
- error_unlock_entry:
+ error_finish_iomap:
 	vmf_ret = dax_fault_return(error) | major;
- unlock_entry:
-	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
@@ -1215,6 +1213,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		 */
 		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return vmf_ret;
 }
 
-- 
2.12.0

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-09 12:18   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-09 12:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ross Zwisler, Dan Williams, linux-fsdevel, linux-ext4,
	linux-nvdimm, linux-mm, Jan Kara, stable

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 72853669a356..f5071249d456 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1124,23 +1124,23 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry))
+		return dax_fault_return(PTR_ERR(entry));
+
 	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
 	 * that we never have to deal with more than a single extent here.
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
-	if (error)
-		return dax_fault_return(error);
-	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
-		goto finish_iomap;
+	if (error) {
+		vmf_ret = dax_fault_return(error);
+		goto unlock_entry;
 	}
-
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		vmf_ret = dax_fault_return(PTR_ERR(entry));
-		goto finish_iomap;
+	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
+		error = -EIO;	/* fs corruption? */
+		goto error_finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1162,13 +1162,13 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto error_unlock_entry;
+			goto error_finish_iomap;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	switch (iomap.type) {
@@ -1188,7 +1188,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto unlock_entry;
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1197,10 +1197,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		break;
 	}
 
- error_unlock_entry:
+ error_finish_iomap:
 	vmf_ret = dax_fault_return(error) | major;
- unlock_entry:
-	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
@@ -1215,6 +1213,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		 */
 		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return vmf_ret;
 }
 
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] dax: Fix data corruption when fault races with write
  2017-05-08 17:25     ` Ross Zwisler
  (?)
@ 2017-05-09 12:14       ` Jan Kara
  -1 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-09 12:14 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, stable, linux-fsdevel, Andrew Morton, linux-ext4

On Mon 08-05-17 11:25:27, Ross Zwisler wrote:
> On Fri, May 05, 2017 at 09:25:00AM +0200, Jan Kara wrote:
> > Currently DAX read fault can race with write(2) in the following way:
> > 
> > CPU1 - write(2)			CPU2 - read fault
> > 				dax_iomap_pte_fault()
> > 				  ->iomap_begin() - sees hole
> > dax_iomap_rw()
> >   iomap_apply()
> >     ->iomap_begin - allocates blocks
> >     dax_iomap_actor()
> >       invalidate_inode_pages2_range()
> >         - there's nothing to invalidate
> > 				  grab_mapping_entry()
> > 				  - we add zero page in the radix tree
> > 				    and map it to page tables
> > 
> > The result is that hole page is mapped into page tables (and thus zeros
> > are seen in mmap) while file has data written in that place.
> > 
> > Fix the problem by locking exception entry before mapping blocks for the
> > fault. That way we are sure invalidate_inode_pages2_range() call for
> > racing write will either block on entry lock waiting for the fault to
> > finish (and unmap stale page tables after that) or read fault will see
> > already allocated blocks by write(2).
> > 
> > Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
> > CC: stable@vger.kernel.org
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Yep, this looks correct to me.  Thanks!
> 
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks. I'll add your reviewed-by tag and send patches to Andrew for
inclusion.
								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-09 12:14       ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-09 12:14 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Andrew Morton, Dan Williams, linux-fsdevel, linux-ext4,
	linux-nvdimm, stable

On Mon 08-05-17 11:25:27, Ross Zwisler wrote:
> On Fri, May 05, 2017 at 09:25:00AM +0200, Jan Kara wrote:
> > Currently DAX read fault can race with write(2) in the following way:
> > 
> > CPU1 - write(2)			CPU2 - read fault
> > 				dax_iomap_pte_fault()
> > 				  ->iomap_begin() - sees hole
> > dax_iomap_rw()
> >   iomap_apply()
> >     ->iomap_begin - allocates blocks
> >     dax_iomap_actor()
> >       invalidate_inode_pages2_range()
> >         - there's nothing to invalidate
> > 				  grab_mapping_entry()
> > 				  - we add zero page in the radix tree
> > 				    and map it to page tables
> > 
> > The result is that hole page is mapped into page tables (and thus zeros
> > are seen in mmap) while file has data written in that place.
> > 
> > Fix the problem by locking exception entry before mapping blocks for the
> > fault. That way we are sure invalidate_inode_pages2_range() call for
> > racing write will either block on entry lock waiting for the fault to
> > finish (and unmap stale page tables after that) or read fault will see
> > already allocated blocks by write(2).
> > 
> > Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
> > CC: stable@vger.kernel.org
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Yep, this looks correct to me.  Thanks!
> 
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks. I'll add your reviewed-by tag and send patches to Andrew for
inclusion.
								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-09 12:14       ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-09 12:14 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	stable-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

On Mon 08-05-17 11:25:27, Ross Zwisler wrote:
> On Fri, May 05, 2017 at 09:25:00AM +0200, Jan Kara wrote:
> > Currently DAX read fault can race with write(2) in the following way:
> > 
> > CPU1 - write(2)			CPU2 - read fault
> > 				dax_iomap_pte_fault()
> > 				  ->iomap_begin() - sees hole
> > dax_iomap_rw()
> >   iomap_apply()
> >     ->iomap_begin - allocates blocks
> >     dax_iomap_actor()
> >       invalidate_inode_pages2_range()
> >         - there's nothing to invalidate
> > 				  grab_mapping_entry()
> > 				  - we add zero page in the radix tree
> > 				    and map it to page tables
> > 
> > The result is that hole page is mapped into page tables (and thus zeros
> > are seen in mmap) while file has data written in that place.
> > 
> > Fix the problem by locking exception entry before mapping blocks for the
> > fault. That way we are sure invalidate_inode_pages2_range() call for
> > racing write will either block on entry lock waiting for the fault to
> > finish (and unmap stale page tables after that) or read fault will see
> > already allocated blocks by write(2).
> > 
> > Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
> > CC: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> 
> Yep, this looks correct to me.  Thanks!
> 
> Reviewed-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Thanks. I'll add your reviewed-by tag and send patches to Andrew for
inclusion.
								Honza

-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] dax: Fix data corruption when fault races with write
  2017-05-05  7:25   ` Jan Kara
  (?)
@ 2017-05-08 17:25     ` Ross Zwisler
  -1 siblings, 0 replies; 39+ messages in thread
From: Ross Zwisler @ 2017-05-08 17:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-nvdimm, stable, linux-fsdevel, linux-ext4

On Fri, May 05, 2017 at 09:25:00AM +0200, Jan Kara wrote:
> Currently DAX read fault can race with write(2) in the following way:
> 
> CPU1 - write(2)			CPU2 - read fault
> 				dax_iomap_pte_fault()
> 				  ->iomap_begin() - sees hole
> dax_iomap_rw()
>   iomap_apply()
>     ->iomap_begin - allocates blocks
>     dax_iomap_actor()
>       invalidate_inode_pages2_range()
>         - there's nothing to invalidate
> 				  grab_mapping_entry()
> 				  - we add zero page in the radix tree
> 				    and map it to page tables
> 
> The result is that hole page is mapped into page tables (and thus zeros
> are seen in mmap) while file has data written in that place.
> 
> Fix the problem by locking exception entry before mapping blocks for the
> fault. That way we are sure invalidate_inode_pages2_range() call for
> racing write will either block on entry lock waiting for the fault to
> finish (and unmap stale page tables after that) or read fault will see
> already allocated blocks by write(2).
> 
> Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
> CC: stable@vger.kernel.org
> Signed-off-by: Jan Kara <jack@suse.cz>

Yep, this looks correct to me.  Thanks!

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-08 17:25     ` Ross Zwisler
  0 siblings, 0 replies; 39+ messages in thread
From: Ross Zwisler @ 2017-05-08 17:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrew Morton, Dan Williams, linux-fsdevel,
	linux-ext4, linux-nvdimm, stable

On Fri, May 05, 2017 at 09:25:00AM +0200, Jan Kara wrote:
> Currently DAX read fault can race with write(2) in the following way:
> 
> CPU1 - write(2)			CPU2 - read fault
> 				dax_iomap_pte_fault()
> 				  ->iomap_begin() - sees hole
> dax_iomap_rw()
>   iomap_apply()
>     ->iomap_begin - allocates blocks
>     dax_iomap_actor()
>       invalidate_inode_pages2_range()
>         - there's nothing to invalidate
> 				  grab_mapping_entry()
> 				  - we add zero page in the radix tree
> 				    and map it to page tables
> 
> The result is that hole page is mapped into page tables (and thus zeros
> are seen in mmap) while file has data written in that place.
> 
> Fix the problem by locking exception entry before mapping blocks for the
> fault. That way we are sure invalidate_inode_pages2_range() call for
> racing write will either block on entry lock waiting for the fault to
> finish (and unmap stale page tables after that) or read fault will see
> already allocated blocks by write(2).
> 
> Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
> CC: stable@vger.kernel.org
> Signed-off-by: Jan Kara <jack@suse.cz>

Yep, this looks correct to me.  Thanks!

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-08 17:25     ` Ross Zwisler
  0 siblings, 0 replies; 39+ messages in thread
From: Ross Zwisler @ 2017-05-08 17:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	stable-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

On Fri, May 05, 2017 at 09:25:00AM +0200, Jan Kara wrote:
> Currently DAX read fault can race with write(2) in the following way:
> 
> CPU1 - write(2)			CPU2 - read fault
> 				dax_iomap_pte_fault()
> 				  ->iomap_begin() - sees hole
> dax_iomap_rw()
>   iomap_apply()
>     ->iomap_begin - allocates blocks
>     dax_iomap_actor()
>       invalidate_inode_pages2_range()
>         - there's nothing to invalidate
> 				  grab_mapping_entry()
> 				  - we add zero page in the radix tree
> 				    and map it to page tables
> 
> The result is that hole page is mapped into page tables (and thus zeros
> are seen in mmap) while file has data written in that place.
> 
> Fix the problem by locking exception entry before mapping blocks for the
> fault. That way we are sure invalidate_inode_pages2_range() call for
> racing write will either block on entry lock waiting for the fault to
> finish (and unmap stale page tables after that) or read fault will see
> already allocated blocks by write(2).
> 
> Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
> CC: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>

Yep, this looks correct to me.  Thanks!

Reviewed-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] dax: Fix data corruption when fault races with write
  2017-05-05  7:24 [PATCH 0/4 v2] mm,dax: Fix data corruption due to mmap inconsistency Jan Kara
  2017-05-05  7:25   ` Jan Kara
@ 2017-05-05  7:25   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-05  7:25 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, stable, linux-fsdevel, Andrew Morton, linux-ext4

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 72853669a356..f5071249d456 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1124,23 +1124,23 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry))
+		return dax_fault_return(PTR_ERR(entry));
+
 	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
 	 * that we never have to deal with more than a single extent here.
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
-	if (error)
-		return dax_fault_return(error);
-	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
-		goto finish_iomap;
+	if (error) {
+		vmf_ret = dax_fault_return(error);
+		goto unlock_entry;
 	}
-
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		vmf_ret = dax_fault_return(PTR_ERR(entry));
-		goto finish_iomap;
+	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
+		error = -EIO;	/* fs corruption? */
+		goto error_finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1162,13 +1162,13 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto error_unlock_entry;
+			goto error_finish_iomap;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	switch (iomap.type) {
@@ -1188,7 +1188,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto unlock_entry;
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1197,10 +1197,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		break;
 	}
 
- error_unlock_entry:
+ error_finish_iomap:
 	vmf_ret = dax_fault_return(error) | major;
- unlock_entry:
-	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
@@ -1215,6 +1213,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		 */
 		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return vmf_ret;
 }
 
-- 
2.12.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-05  7:25   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-05  7:25 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Dan Williams, linux-fsdevel, linux-ext4,
	linux-nvdimm, Jan Kara, stable

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 72853669a356..f5071249d456 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1124,23 +1124,23 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry))
+		return dax_fault_return(PTR_ERR(entry));
+
 	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
 	 * that we never have to deal with more than a single extent here.
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
-	if (error)
-		return dax_fault_return(error);
-	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
-		goto finish_iomap;
+	if (error) {
+		vmf_ret = dax_fault_return(error);
+		goto unlock_entry;
 	}
-
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		vmf_ret = dax_fault_return(PTR_ERR(entry));
-		goto finish_iomap;
+	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
+		error = -EIO;	/* fs corruption? */
+		goto error_finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1162,13 +1162,13 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto error_unlock_entry;
+			goto error_finish_iomap;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	switch (iomap.type) {
@@ -1188,7 +1188,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto unlock_entry;
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1197,10 +1197,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		break;
 	}
 
- error_unlock_entry:
+ error_finish_iomap:
 	vmf_ret = dax_fault_return(error) | major;
- unlock_entry:
-	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
@@ -1215,6 +1213,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		 */
 		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return vmf_ret;
 }
 
-- 
2.12.0

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/4] dax: Fix data corruption when fault races with write
@ 2017-05-05  7:25   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2017-05-05  7:25 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	stable-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6258a3a37a045842d9ba7e68f368956
CC: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 72853669a356..f5071249d456 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1124,23 +1124,23 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry))
+		return dax_fault_return(PTR_ERR(entry));
+
 	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
 	 * that we never have to deal with more than a single extent here.
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
-	if (error)
-		return dax_fault_return(error);
-	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
-		goto finish_iomap;
+	if (error) {
+		vmf_ret = dax_fault_return(error);
+		goto unlock_entry;
 	}
-
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		vmf_ret = dax_fault_return(PTR_ERR(entry));
-		goto finish_iomap;
+	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
+		error = -EIO;	/* fs corruption? */
+		goto error_finish_iomap;
 	}
 
 	sector = dax_iomap_sector(&iomap, pos);
@@ -1162,13 +1162,13 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto error_unlock_entry;
+			goto error_finish_iomap;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto unlock_entry;
+		goto finish_iomap;
 	}
 
 	switch (iomap.type) {
@@ -1188,7 +1188,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto unlock_entry;
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1197,10 +1197,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		break;
 	}
 
- error_unlock_entry:
+ error_finish_iomap:
 	vmf_ret = dax_fault_return(error) | major;
- unlock_entry:
-	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
@@ -1215,6 +1213,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 		 */
 		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return vmf_ret;
 }
 
-- 
2.12.0

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2017-05-11  8:39 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-10  8:54 [PATCH 0/4 v4] mm,dax: Fix data corruption due to mmap inconsistency Jan Kara
2017-05-10  8:54 ` Jan Kara
2017-05-10  8:54 ` [PATCH 1/4] dax: prevent invalidation of mapped DAX entries Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10  8:54 ` [PATCH 2/4] mm: Fix data corruption due to stale mmap reads Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10  8:54 ` [PATCH 3/4] ext4: Return back to starting transaction in ext4_dax_huge_fault() Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10  8:54 ` [PATCH 4/4] dax: Fix data corruption when fault races with write Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10  8:54   ` Jan Kara
2017-05-10 17:27   ` [PATCH 5/4] dax: Fix PMD " Ross Zwisler
2017-05-10 17:27     ` Ross Zwisler
2017-05-11  8:39     ` Jan Kara
2017-05-11  8:39       ` Jan Kara
2017-05-11  8:39       ` Jan Kara
2017-05-10 17:27 ` [PATCH 0/4 v4] mm,dax: Fix data corruption due to mmap inconsistency Ross Zwisler
2017-05-10 17:27   ` Ross Zwisler
2017-05-10 17:27   ` Ross Zwisler
  -- strict thread matches above, loose matches on Subject: below --
2017-05-09 12:18 [PATCH 0/4 v3] " Jan Kara
2017-05-09 12:18 ` [PATCH 4/4] dax: Fix data corruption when fault races with write Jan Kara
2017-05-09 12:18   ` Jan Kara
2017-05-09 12:18   ` Jan Kara
2017-05-09 12:18   ` Jan Kara
2017-05-05  7:24 [PATCH 0/4 v2] mm,dax: Fix data corruption due to mmap inconsistency Jan Kara
2017-05-05  7:25 ` [PATCH 4/4] dax: Fix data corruption when fault races with write Jan Kara
2017-05-05  7:25   ` Jan Kara
2017-05-05  7:25   ` Jan Kara
2017-05-08 17:25   ` Ross Zwisler
2017-05-08 17:25     ` Ross Zwisler
2017-05-08 17:25     ` Ross Zwisler
2017-05-09 12:14     ` Jan Kara
2017-05-09 12:14       ` Jan Kara
2017-05-09 12:14       ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.