All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 v2] dax: Page invalidation fixes
@ 2016-11-24  9:46 ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, linux-nvdimm, linux-mm, Johannes Weiner, linux-ext4

Hello,

this is second revision of my fixes of races when invalidating hole pages in
DAX mappings. See changelogs for details. The series is based on my patches to
write-protect DAX PTEs which are currently carried in mm tree. This is a hard
dependency because we really need to closely track dirtiness (and cleanness!)
of radix tree entries in DAX mappings in order to avoid discarding valid dirty
bits leading to missed cache flushes on fsync(2).

The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.

I'd like to get some review of the patches (MM/FS people, please check whether
you like the direction changes in mm/truncate.c take in patch 2/6 - added
Johannes to CC since he was touching related code recently) so that these
patches can land in some tree once DAX write-protection patches are merged.
I'm hoping to get at least first three patches merged for 4.10-rc2... Thanks!

Changes since v1:
* Rebased on top of patches in mm tree
* Added some Reviewed-by tags
* renamed some functions based on review feedback

								Honza
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 0/6 v2] dax: Page invalidation fixes
@ 2016-11-24  9:46 ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

Hello,

this is second revision of my fixes of races when invalidating hole pages in
DAX mappings. See changelogs for details. The series is based on my patches to
write-protect DAX PTEs which are currently carried in mm tree. This is a hard
dependency because we really need to closely track dirtiness (and cleanness!)
of radix tree entries in DAX mappings in order to avoid discarding valid dirty
bits leading to missed cache flushes on fsync(2).

The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.

I'd like to get some review of the patches (MM/FS people, please check whether
you like the direction changes in mm/truncate.c take in patch 2/6 - added
Johannes to CC since he was touching related code recently) so that these
patches can land in some tree once DAX write-protection patches are merged.
I'm hoping to get at least first three patches merged for 4.10-rc2... Thanks!

Changes since v1:
* Rebased on top of patches in mm tree
* Added some Reviewed-by tags
* renamed some functions based on review feedback

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 0/6 v2] dax: Page invalidation fixes
@ 2016-11-24  9:46 ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

Hello,

this is second revision of my fixes of races when invalidating hole pages in
DAX mappings. See changelogs for details. The series is based on my patches to
write-protect DAX PTEs which are currently carried in mm tree. This is a hard
dependency because we really need to closely track dirtiness (and cleanness!)
of radix tree entries in DAX mappings in order to avoid discarding valid dirty
bits leading to missed cache flushes on fsync(2).

The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.

I'd like to get some review of the patches (MM/FS people, please check whether
you like the direction changes in mm/truncate.c take in patch 2/6 - added
Johannes to CC since he was touching related code recently) so that these
patches can land in some tree once DAX write-protection patches are merged.
I'm hoping to get at least first three patches merged for 4.10-rc2... Thanks!

Changes since v1:
* Rebased on top of patches in mm tree
* Added some Reviewed-by tags
* renamed some functions based on review feedback

								Honza

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 0/6 v2] dax: Page invalidation fixes
@ 2016-11-24  9:46 ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

Hello,

this is second revision of my fixes of races when invalidating hole pages in
DAX mappings. See changelogs for details. The series is based on my patches to
write-protect DAX PTEs which are currently carried in mm tree. This is a hard
dependency because we really need to closely track dirtiness (and cleanness!)
of radix tree entries in DAX mappings in order to avoid discarding valid dirty
bits leading to missed cache flushes on fsync(2).

The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.

I'd like to get some review of the patches (MM/FS people, please check whether
you like the direction changes in mm/truncate.c take in patch 2/6 - added
Johannes to CC since he was touching related code recently) so that these
patches can land in some tree once DAX write-protection patches are merged.
I'm hoping to get at least first three patches merged for 4.10-rc2... Thanks!

Changes since v1:
* Rebased on top of patches in mm tree
* Added some Reviewed-by tags
* renamed some functions based on review feedback

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 1/6] ext2: Return BH_New buffers for zeroed blocks
  2016-11-24  9:46 ` Jan Kara
  (?)
@ 2016-11-24  9:46   ` Jan Kara
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, linux-nvdimm, linux-mm, Johannes Weiner, linux-ext4

So far we did not return BH_New buffers from ext2_get_blocks() when we
allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
DAX code. This zeroing is gone these days so we can remove the
workaround.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/inode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 046b642f3585..e626fe892c01 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -754,9 +754,8 @@ static int ext2_get_blocks(struct inode *inode,
 			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
 		}
-	} else {
-		*new = true;
 	}
+	*new = true;
 
 	ext2_splice_branch(inode, iblock, partial, indirect_blks, count);
 	mutex_unlock(&ei->truncate_mutex);
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 1/6] ext2: Return BH_New buffers for zeroed blocks
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

So far we did not return BH_New buffers from ext2_get_blocks() when we
allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
DAX code. This zeroing is gone these days so we can remove the
workaround.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/inode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 046b642f3585..e626fe892c01 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -754,9 +754,8 @@ static int ext2_get_blocks(struct inode *inode,
 			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
 		}
-	} else {
-		*new = true;
 	}
+	*new = true;
 
 	ext2_splice_branch(inode, iblock, partial, indirect_blks, count);
 	mutex_unlock(&ei->truncate_mutex);
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 1/6] ext2: Return BH_New buffers for zeroed blocks
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

So far we did not return BH_New buffers from ext2_get_blocks() when we
allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
DAX code. This zeroing is gone these days so we can remove the
workaround.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/inode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 046b642f3585..e626fe892c01 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -754,9 +754,8 @@ static int ext2_get_blocks(struct inode *inode,
 			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
 		}
-	} else {
-		*new = true;
 	}
+	*new = true;
 
 	ext2_splice_branch(inode, iblock, partial, indirect_blks, count);
 	mutex_unlock(&ei->truncate_mutex);
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
  2016-11-24  9:46 ` Jan Kara
  (?)
  (?)
@ 2016-11-24  9:46   ` Jan Kara
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, linux-nvdimm, linux-mm, Johannes Weiner, linux-ext4

Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.

Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 71 +++++++++++++++++++++++++++++++++++++++++++++--------
 include/linux/dax.h |  3 +++
 mm/truncate.c       | 71 ++++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 123 insertions(+), 22 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cafd5597434b..4534f0e232e9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -452,16 +452,37 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
 }
 
+static int __dax_invalidate_mapping_entry(struct address_space *mapping,
+					  pgoff_t index, bool trunc)
+{
+	int ret = 0;
+	void *entry;
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	if (!entry || !radix_tree_exceptional_entry(entry))
+		goto out;
+	if (!trunc &&
+	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
+		goto out;
+	radix_tree_delete(page_tree, index);
+	mapping->nrexceptional--;
+	ret = 1;
+out:
+	put_unlocked_mapping_entry(mapping, index, entry);
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
  */
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *entry;
+	int ret = __dax_invalidate_mapping_entry(mapping, index, true);
 
-	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
 	/*
 	 * This gets called from truncate / punch_hole path. As such, the caller
 	 * must hold locks protecting against concurrent modifications of the
@@ -469,16 +490,46 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	 * caller has seen exceptional entry for this index, we better find it
 	 * at that index as well...
 	 */
-	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
-		spin_unlock_irq(&mapping->tree_lock);
-		return 0;
-	}
-	radix_tree_delete(&mapping->page_tree, index);
+	WARN_ON_ONCE(!ret);
+	return ret;
+}
+
+/*
+ * Invalidate exceptional DAX entry if easily possible. This handles DAX
+ * entries for invalidate_inode_pages() so we evict the entry only if we can
+ * do so without blocking.
+ */
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	int ret = 0;
+	void *entry, **slot;
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
+	if (!entry || !radix_tree_exceptional_entry(entry) ||
+	    slot_locked(mapping, slot))
+		goto out;
+	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto out;
+	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
+	ret = 1;
+out:
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+	if (ret)
+		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+	return ret;
+}
 
-	return 1;
+/*
+ * Invalidate exceptional DAX entry if it is clean.
+ */
+int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
+				       pgoff_t index)
+{
+	return __dax_invalidate_mapping_entry(mapping, index, false);
 }
 
 /*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f97bcfe79472..6e36b11285b0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,6 +41,9 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
+				       pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 		pgoff_t index, void *entry, bool wake_all);
 
diff --git a/mm/truncate.c b/mm/truncate.c
index a01cce450a26..215c5edfd11e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -30,14 +30,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	struct radix_tree_node *node;
 	void **slot;
 
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
-		return;
-
-	if (dax_mapping(mapping)) {
-		dax_delete_mapping_entry(mapping, index);
-		return;
-	}
 	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * Regular page slots are stabilized by the page lock even
@@ -70,6 +62,56 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
+/*
+ * Unconditionally remove exceptional entry. Usually called from truncate path.
+ */
+static void truncate_exceptional_entry(struct address_space *mapping,
+				       pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	if (dax_mapping(mapping)) {
+		dax_delete_mapping_entry(mapping, index);
+		return;
+	}
+	clear_exceptional_entry(mapping, index, entry);
+}
+
+/*
+ * Invalidate exceptional entry if easily possible. This handles exceptional
+ * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
+ * clean entries.
+ */
+static int invalidate_exceptional_entry(struct address_space *mapping,
+					pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return 1;
+	if (dax_mapping(mapping))
+		return dax_invalidate_mapping_entry(mapping, index);
+	clear_exceptional_entry(mapping, index, entry);
+	return 1;
+}
+
+/*
+ * Invalidate exceptional entry if clean. This handles exceptional entries for
+ * invalidate_inode_pages2() so for DAX it evicts only clean entries.
+ */
+static int invalidate_exceptional_entry2(struct address_space *mapping,
+					 pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return 1;
+	if (dax_mapping(mapping))
+		return dax_invalidate_clean_mapping_entry(mapping, index);
+	clear_exceptional_entry(mapping, index, entry);
+	return 1;
+}
+
 /**
  * do_invalidatepage - invalidate part or all of a page
  * @page: the page which is affected
@@ -277,7 +319,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				truncate_exceptional_entry(mapping, index,
+							   page);
 				continue;
 			}
 
@@ -366,7 +409,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			}
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				truncate_exceptional_entry(mapping, index,
+							   page);
 				continue;
 			}
 
@@ -485,7 +529,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				invalidate_exceptional_entry(mapping, index,
+							     page);
 				continue;
 			}
 
@@ -607,7 +652,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				if (!invalidate_exceptional_entry2(mapping,
+								   index, page))
+					ret = -EBUSY;
 				continue;
 			}
 
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.

Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 71 +++++++++++++++++++++++++++++++++++++++++++++--------
 include/linux/dax.h |  3 +++
 mm/truncate.c       | 71 ++++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 123 insertions(+), 22 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cafd5597434b..4534f0e232e9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -452,16 +452,37 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
 }
 
+static int __dax_invalidate_mapping_entry(struct address_space *mapping,
+					  pgoff_t index, bool trunc)
+{
+	int ret = 0;
+	void *entry;
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	if (!entry || !radix_tree_exceptional_entry(entry))
+		goto out;
+	if (!trunc &&
+	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
+		goto out;
+	radix_tree_delete(page_tree, index);
+	mapping->nrexceptional--;
+	ret = 1;
+out:
+	put_unlocked_mapping_entry(mapping, index, entry);
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
  */
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *entry;
+	int ret = __dax_invalidate_mapping_entry(mapping, index, true);
 
-	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
 	/*
 	 * This gets called from truncate / punch_hole path. As such, the caller
 	 * must hold locks protecting against concurrent modifications of the
@@ -469,16 +490,46 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	 * caller has seen exceptional entry for this index, we better find it
 	 * at that index as well...
 	 */
-	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
-		spin_unlock_irq(&mapping->tree_lock);
-		return 0;
-	}
-	radix_tree_delete(&mapping->page_tree, index);
+	WARN_ON_ONCE(!ret);
+	return ret;
+}
+
+/*
+ * Invalidate exceptional DAX entry if easily possible. This handles DAX
+ * entries for invalidate_inode_pages() so we evict the entry only if we can
+ * do so without blocking.
+ */
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	int ret = 0;
+	void *entry, **slot;
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
+	if (!entry || !radix_tree_exceptional_entry(entry) ||
+	    slot_locked(mapping, slot))
+		goto out;
+	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto out;
+	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
+	ret = 1;
+out:
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+	if (ret)
+		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+	return ret;
+}
 
-	return 1;
+/*
+ * Invalidate exceptional DAX entry if it is clean.
+ */
+int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
+				       pgoff_t index)
+{
+	return __dax_invalidate_mapping_entry(mapping, index, false);
 }
 
 /*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f97bcfe79472..6e36b11285b0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,6 +41,9 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
+				       pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 		pgoff_t index, void *entry, bool wake_all);
 
diff --git a/mm/truncate.c b/mm/truncate.c
index a01cce450a26..215c5edfd11e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -30,14 +30,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	struct radix_tree_node *node;
 	void **slot;
 
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
-		return;
-
-	if (dax_mapping(mapping)) {
-		dax_delete_mapping_entry(mapping, index);
-		return;
-	}
 	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * Regular page slots are stabilized by the page lock even
@@ -70,6 +62,56 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
+/*
+ * Unconditionally remove exceptional entry. Usually called from truncate path.
+ */
+static void truncate_exceptional_entry(struct address_space *mapping,
+				       pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	if (dax_mapping(mapping)) {
+		dax_delete_mapping_entry(mapping, index);
+		return;
+	}
+	clear_exceptional_entry(mapping, index, entry);
+}
+
+/*
+ * Invalidate exceptional entry if easily possible. This handles exceptional
+ * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
+ * clean entries.
+ */
+static int invalidate_exceptional_entry(struct address_space *mapping,
+					pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return 1;
+	if (dax_mapping(mapping))
+		return dax_invalidate_mapping_entry(mapping, index);
+	clear_exceptional_entry(mapping, index, entry);
+	return 1;
+}
+
+/*
+ * Invalidate exceptional entry if clean. This handles exceptional entries for
+ * invalidate_inode_pages2() so for DAX it evicts only clean entries.
+ */
+static int invalidate_exceptional_entry2(struct address_space *mapping,
+					 pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return 1;
+	if (dax_mapping(mapping))
+		return dax_invalidate_clean_mapping_entry(mapping, index);
+	clear_exceptional_entry(mapping, index, entry);
+	return 1;
+}
+
 /**
  * do_invalidatepage - invalidate part or all of a page
  * @page: the page which is affected
@@ -277,7 +319,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				truncate_exceptional_entry(mapping, index,
+							   page);
 				continue;
 			}
 
@@ -366,7 +409,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			}
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				truncate_exceptional_entry(mapping, index,
+							   page);
 				continue;
 			}
 
@@ -485,7 +529,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				invalidate_exceptional_entry(mapping, index,
+							     page);
 				continue;
 			}
 
@@ -607,7 +652,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				if (!invalidate_exceptional_entry2(mapping,
+								   index, page))
+					ret = -EBUSY;
 				continue;
 			}
 
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.

Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.

Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/dax.c            | 71 +++++++++++++++++++++++++++++++++++++++++++++--------
 include/linux/dax.h |  3 +++
 mm/truncate.c       | 71 ++++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 123 insertions(+), 22 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cafd5597434b..4534f0e232e9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -452,16 +452,37 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
 }
 
+static int __dax_invalidate_mapping_entry(struct address_space *mapping,
+					  pgoff_t index, bool trunc)
+{
+	int ret = 0;
+	void *entry;
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	if (!entry || !radix_tree_exceptional_entry(entry))
+		goto out;
+	if (!trunc &&
+	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
+		goto out;
+	radix_tree_delete(page_tree, index);
+	mapping->nrexceptional--;
+	ret = 1;
+out:
+	put_unlocked_mapping_entry(mapping, index, entry);
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
  */
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *entry;
+	int ret = __dax_invalidate_mapping_entry(mapping, index, true);
 
-	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
 	/*
 	 * This gets called from truncate / punch_hole path. As such, the caller
 	 * must hold locks protecting against concurrent modifications of the
@@ -469,16 +490,46 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	 * caller has seen exceptional entry for this index, we better find it
 	 * at that index as well...
 	 */
-	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
-		spin_unlock_irq(&mapping->tree_lock);
-		return 0;
-	}
-	radix_tree_delete(&mapping->page_tree, index);
+	WARN_ON_ONCE(!ret);
+	return ret;
+}
+
+/*
+ * Invalidate exceptional DAX entry if easily possible. This handles DAX
+ * entries for invalidate_inode_pages() so we evict the entry only if we can
+ * do so without blocking.
+ */
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	int ret = 0;
+	void *entry, **slot;
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
+	if (!entry || !radix_tree_exceptional_entry(entry) ||
+	    slot_locked(mapping, slot))
+		goto out;
+	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto out;
+	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
+	ret = 1;
+out:
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+	if (ret)
+		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+	return ret;
+}
 
-	return 1;
+/*
+ * Invalidate exceptional DAX entry if it is clean.
+ */
+int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
+				       pgoff_t index)
+{
+	return __dax_invalidate_mapping_entry(mapping, index, false);
 }
 
 /*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f97bcfe79472..6e36b11285b0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,6 +41,9 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
+				       pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 		pgoff_t index, void *entry, bool wake_all);
 
diff --git a/mm/truncate.c b/mm/truncate.c
index a01cce450a26..215c5edfd11e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -30,14 +30,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	struct radix_tree_node *node;
 	void **slot;
 
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
-		return;
-
-	if (dax_mapping(mapping)) {
-		dax_delete_mapping_entry(mapping, index);
-		return;
-	}
 	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * Regular page slots are stabilized by the page lock even
@@ -70,6 +62,56 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
+/*
+ * Unconditionally remove exceptional entry. Usually called from truncate path.
+ */
+static void truncate_exceptional_entry(struct address_space *mapping,
+				       pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	if (dax_mapping(mapping)) {
+		dax_delete_mapping_entry(mapping, index);
+		return;
+	}
+	clear_exceptional_entry(mapping, index, entry);
+}
+
+/*
+ * Invalidate exceptional entry if easily possible. This handles exceptional
+ * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
+ * clean entries.
+ */
+static int invalidate_exceptional_entry(struct address_space *mapping,
+					pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return 1;
+	if (dax_mapping(mapping))
+		return dax_invalidate_mapping_entry(mapping, index);
+	clear_exceptional_entry(mapping, index, entry);
+	return 1;
+}
+
+/*
+ * Invalidate exceptional entry if clean. This handles exceptional entries for
+ * invalidate_inode_pages2() so for DAX it evicts only clean entries.
+ */
+static int invalidate_exceptional_entry2(struct address_space *mapping,
+					 pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return 1;
+	if (dax_mapping(mapping))
+		return dax_invalidate_clean_mapping_entry(mapping, index);
+	clear_exceptional_entry(mapping, index, entry);
+	return 1;
+}
+
 /**
  * do_invalidatepage - invalidate part or all of a page
  * @page: the page which is affected
@@ -277,7 +319,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				truncate_exceptional_entry(mapping, index,
+							   page);
 				continue;
 			}
 
@@ -366,7 +409,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			}
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				truncate_exceptional_entry(mapping, index,
+							   page);
 				continue;
 			}
 
@@ -485,7 +529,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				invalidate_exceptional_entry(mapping, index,
+							     page);
 				continue;
 			}
 
@@ -607,7 +652,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				if (!invalidate_exceptional_entry2(mapping,
+								   index, page))
+					ret = -EBUSY;
 				continue;
 			}
 
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.

Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 71 +++++++++++++++++++++++++++++++++++++++++++++--------
 include/linux/dax.h |  3 +++
 mm/truncate.c       | 71 ++++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 123 insertions(+), 22 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cafd5597434b..4534f0e232e9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -452,16 +452,37 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
 }
 
+static int __dax_invalidate_mapping_entry(struct address_space *mapping,
+					  pgoff_t index, bool trunc)
+{
+	int ret = 0;
+	void *entry;
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	if (!entry || !radix_tree_exceptional_entry(entry))
+		goto out;
+	if (!trunc &&
+	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
+		goto out;
+	radix_tree_delete(page_tree, index);
+	mapping->nrexceptional--;
+	ret = 1;
+out:
+	put_unlocked_mapping_entry(mapping, index, entry);
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
  */
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *entry;
+	int ret = __dax_invalidate_mapping_entry(mapping, index, true);
 
-	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
 	/*
 	 * This gets called from truncate / punch_hole path. As such, the caller
 	 * must hold locks protecting against concurrent modifications of the
@@ -469,16 +490,46 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	 * caller has seen exceptional entry for this index, we better find it
 	 * at that index as well...
 	 */
-	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
-		spin_unlock_irq(&mapping->tree_lock);
-		return 0;
-	}
-	radix_tree_delete(&mapping->page_tree, index);
+	WARN_ON_ONCE(!ret);
+	return ret;
+}
+
+/*
+ * Invalidate exceptional DAX entry if easily possible. This handles DAX
+ * entries for invalidate_inode_pages() so we evict the entry only if we can
+ * do so without blocking.
+ */
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	int ret = 0;
+	void *entry, **slot;
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
+	if (!entry || !radix_tree_exceptional_entry(entry) ||
+	    slot_locked(mapping, slot))
+		goto out;
+	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto out;
+	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
+	ret = 1;
+out:
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+	if (ret)
+		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+	return ret;
+}
 
-	return 1;
+/*
+ * Invalidate exceptional DAX entry if it is clean.
+ */
+int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
+				       pgoff_t index)
+{
+	return __dax_invalidate_mapping_entry(mapping, index, false);
 }
 
 /*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f97bcfe79472..6e36b11285b0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,6 +41,9 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
+				       pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 		pgoff_t index, void *entry, bool wake_all);
 
diff --git a/mm/truncate.c b/mm/truncate.c
index a01cce450a26..215c5edfd11e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -30,14 +30,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	struct radix_tree_node *node;
 	void **slot;
 
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
-		return;
-
-	if (dax_mapping(mapping)) {
-		dax_delete_mapping_entry(mapping, index);
-		return;
-	}
 	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * Regular page slots are stabilized by the page lock even
@@ -70,6 +62,56 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
+/*
+ * Unconditionally remove exceptional entry. Usually called from truncate path.
+ */
+static void truncate_exceptional_entry(struct address_space *mapping,
+				       pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	if (dax_mapping(mapping)) {
+		dax_delete_mapping_entry(mapping, index);
+		return;
+	}
+	clear_exceptional_entry(mapping, index, entry);
+}
+
+/*
+ * Invalidate exceptional entry if easily possible. This handles exceptional
+ * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
+ * clean entries.
+ */
+static int invalidate_exceptional_entry(struct address_space *mapping,
+					pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return 1;
+	if (dax_mapping(mapping))
+		return dax_invalidate_mapping_entry(mapping, index);
+	clear_exceptional_entry(mapping, index, entry);
+	return 1;
+}
+
+/*
+ * Invalidate exceptional entry if clean. This handles exceptional entries for
+ * invalidate_inode_pages2() so for DAX it evicts only clean entries.
+ */
+static int invalidate_exceptional_entry2(struct address_space *mapping,
+					 pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return 1;
+	if (dax_mapping(mapping))
+		return dax_invalidate_clean_mapping_entry(mapping, index);
+	clear_exceptional_entry(mapping, index, entry);
+	return 1;
+}
+
 /**
  * do_invalidatepage - invalidate part or all of a page
  * @page: the page which is affected
@@ -277,7 +319,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				truncate_exceptional_entry(mapping, index,
+							   page);
 				continue;
 			}
 
@@ -366,7 +409,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			}
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				truncate_exceptional_entry(mapping, index,
+							   page);
 				continue;
 			}
 
@@ -485,7 +529,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				invalidate_exceptional_entry(mapping, index,
+							     page);
 				continue;
 			}
 
@@ -607,7 +652,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				if (!invalidate_exceptional_entry2(mapping,
+								   index, page))
+					ret = -EBUSY;
 				continue;
 			}
 
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
  2016-11-24  9:46 ` Jan Kara
  (?)
  (?)
@ 2016-11-24  9:46   ` Jan Kara
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, linux-nvdimm, linux-mm, Johannes Weiner, linux-ext4

Currently each filesystem (possibly through generic_file_direct_write()
or iomap_dax_rw()) takes care of invalidating page tables and evicting
hole pages from the radix tree when write(2) to the file happens. This
invalidation is only necessary when there is some block allocation
resulting from write(2). Furthermore in current place the invalidation
is racy wrt page fault instantiating a hole page just after we have
invalidated it.

So perform the page invalidation inside dax_do_io() where we can do it
only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4534f0e232e9..ddf77ef2ca18 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -984,6 +984,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
 		return -EIO;
 
+	/*
+	 * Write can allocate block for an area which has a hole page mapped
+	 * into page tables. We have to tear down these mappings so that data
+	 * written by write(2) is visible in mmap.
+	 */
+	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+		invalidate_inode_pages2_range(inode->i_mapping,
+					      pos >> PAGE_SHIFT,
+					      (end - 1) >> PAGE_SHIFT);
+	}
+
 	while (pos < end) {
 		unsigned offset = pos & (PAGE_SIZE - 1);
 		struct blk_dax_ctl dax = { 0 };
@@ -1042,23 +1053,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iov_iter_rw(iter) == WRITE)
 		flags |= IOMAP_WRITE;
 
-	/*
-	 * Yes, even DAX files can have page cache attached to them:  A zeroed
-	 * page is inserted into the pagecache when we have to serve a write
-	 * fault on a hole.  It should never be dirtied and can simply be
-	 * dropped from the pagecache once we get real data for the page.
-	 *
-	 * XXX: This is racy against mmap, and there's nothing we can do about
-	 * it. We'll eventually need to shift this down even further so that
-	 * we can check if we allocated blocks over a hole first.
-	 */
-	if (mapping->nrpages) {
-		ret = invalidate_inode_pages2_range(mapping,
-				pos >> PAGE_SHIFT,
-				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
-		WARN_ON_ONCE(ret);
-	}
-
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
 				iter, dax_iomap_actor);
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

Currently each filesystem (possibly through generic_file_direct_write()
or iomap_dax_rw()) takes care of invalidating page tables and evicting
hole pages from the radix tree when write(2) to the file happens. This
invalidation is only necessary when there is some block allocation
resulting from write(2). Furthermore in current place the invalidation
is racy wrt page fault instantiating a hole page just after we have
invalidated it.

So perform the page invalidation inside dax_do_io() where we can do it
only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4534f0e232e9..ddf77ef2ca18 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -984,6 +984,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
 		return -EIO;
 
+	/*
+	 * Write can allocate block for an area which has a hole page mapped
+	 * into page tables. We have to tear down these mappings so that data
+	 * written by write(2) is visible in mmap.
+	 */
+	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+		invalidate_inode_pages2_range(inode->i_mapping,
+					      pos >> PAGE_SHIFT,
+					      (end - 1) >> PAGE_SHIFT);
+	}
+
 	while (pos < end) {
 		unsigned offset = pos & (PAGE_SIZE - 1);
 		struct blk_dax_ctl dax = { 0 };
@@ -1042,23 +1053,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iov_iter_rw(iter) == WRITE)
 		flags |= IOMAP_WRITE;
 
-	/*
-	 * Yes, even DAX files can have page cache attached to them:  A zeroed
-	 * page is inserted into the pagecache when we have to serve a write
-	 * fault on a hole.  It should never be dirtied and can simply be
-	 * dropped from the pagecache once we get real data for the page.
-	 *
-	 * XXX: This is racy against mmap, and there's nothing we can do about
-	 * it. We'll eventually need to shift this down even further so that
-	 * we can check if we allocated blocks over a hole first.
-	 */
-	if (mapping->nrpages) {
-		ret = invalidate_inode_pages2_range(mapping,
-				pos >> PAGE_SHIFT,
-				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
-		WARN_ON_ONCE(ret);
-	}
-
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
 				iter, dax_iomap_actor);
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

Currently each filesystem (possibly through generic_file_direct_write()
or iomap_dax_rw()) takes care of invalidating page tables and evicting
hole pages from the radix tree when write(2) to the file happens. This
invalidation is only necessary when there is some block allocation
resulting from write(2). Furthermore in current place the invalidation
is racy wrt page fault instantiating a hole page just after we have
invalidated it.

So perform the page invalidation inside dax_do_io() where we can do it
only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.

Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/dax.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4534f0e232e9..ddf77ef2ca18 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -984,6 +984,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
 		return -EIO;
 
+	/*
+	 * Write can allocate block for an area which has a hole page mapped
+	 * into page tables. We have to tear down these mappings so that data
+	 * written by write(2) is visible in mmap.
+	 */
+	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+		invalidate_inode_pages2_range(inode->i_mapping,
+					      pos >> PAGE_SHIFT,
+					      (end - 1) >> PAGE_SHIFT);
+	}
+
 	while (pos < end) {
 		unsigned offset = pos & (PAGE_SIZE - 1);
 		struct blk_dax_ctl dax = { 0 };
@@ -1042,23 +1053,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iov_iter_rw(iter) == WRITE)
 		flags |= IOMAP_WRITE;
 
-	/*
-	 * Yes, even DAX files can have page cache attached to them:  A zeroed
-	 * page is inserted into the pagecache when we have to serve a write
-	 * fault on a hole.  It should never be dirtied and can simply be
-	 * dropped from the pagecache once we get real data for the page.
-	 *
-	 * XXX: This is racy against mmap, and there's nothing we can do about
-	 * it. We'll eventually need to shift this down even further so that
-	 * we can check if we allocated blocks over a hole first.
-	 */
-	if (mapping->nrpages) {
-		ret = invalidate_inode_pages2_range(mapping,
-				pos >> PAGE_SHIFT,
-				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
-		WARN_ON_ONCE(ret);
-	}
-
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
 				iter, dax_iomap_actor);
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

Currently each filesystem (possibly through generic_file_direct_write()
or iomap_dax_rw()) takes care of invalidating page tables and evicting
hole pages from the radix tree when write(2) to the file happens. This
invalidation is only necessary when there is some block allocation
resulting from write(2). Furthermore in current place the invalidation
is racy wrt page fault instantiating a hole page just after we have
invalidated it.

So perform the page invalidation inside dax_do_io() where we can do it
only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4534f0e232e9..ddf77ef2ca18 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -984,6 +984,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
 		return -EIO;
 
+	/*
+	 * Write can allocate block for an area which has a hole page mapped
+	 * into page tables. We have to tear down these mappings so that data
+	 * written by write(2) is visible in mmap.
+	 */
+	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+		invalidate_inode_pages2_range(inode->i_mapping,
+					      pos >> PAGE_SHIFT,
+					      (end - 1) >> PAGE_SHIFT);
+	}
+
 	while (pos < end) {
 		unsigned offset = pos & (PAGE_SIZE - 1);
 		struct blk_dax_ctl dax = { 0 };
@@ -1042,23 +1053,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iov_iter_rw(iter) == WRITE)
 		flags |= IOMAP_WRITE;
 
-	/*
-	 * Yes, even DAX files can have page cache attached to them:  A zeroed
-	 * page is inserted into the pagecache when we have to serve a write
-	 * fault on a hole.  It should never be dirtied and can simply be
-	 * dropped from the pagecache once we get real data for the page.
-	 *
-	 * XXX: This is racy against mmap, and there's nothing we can do about
-	 * it. We'll eventually need to shift this down even further so that
-	 * we can check if we allocated blocks over a hole first.
-	 */
-	if (mapping->nrpages) {
-		ret = invalidate_inode_pages2_range(mapping,
-				pos >> PAGE_SHIFT,
-				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
-		WARN_ON_ONCE(ret);
-	}
-
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
 				iter, dax_iomap_actor);
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 4/6] dax: Finish fault completely when loading holes
  2016-11-24  9:46 ` Jan Kara
  (?)
  (?)
@ 2016-11-24  9:46   ` Jan Kara
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, linux-nvdimm, linux-mm, Johannes Weiner, linux-ext4

The only case when we do not finish the page fault completely is when we
are loading hole pages into a radix tree. Avoid this special case and
finish the fault in that case as well inside the DAX fault handler. It
will allow us for easier iomap handling.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ddf77ef2ca18..38f996976ebf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -540,15 +540,16 @@ int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
  * otherwise it will simply fall out of the page cache under memory
  * pressure without ever having been dirtied.
  */
-static int dax_load_hole(struct address_space *mapping, void *entry,
+static int dax_load_hole(struct address_space *mapping, void **entry,
 			 struct vm_fault *vmf)
 {
 	struct page *page;
+	int ret;
 
 	/* Hole page already exists? Return it...  */
-	if (!radix_tree_exceptional_entry(entry)) {
-		vmf->page = entry;
-		return VM_FAULT_LOCKED;
+	if (!radix_tree_exceptional_entry(*entry)) {
+		page = *entry;
+		goto out;
 	}
 
 	/* This will replace locked radix tree entry with a hole page */
@@ -556,8 +557,17 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 				   vmf->gfp_mask | __GFP_ZERO);
 	if (!page)
 		return VM_FAULT_OOM;
+ out:
 	vmf->page = page;
-	return VM_FAULT_LOCKED;
+	ret = finish_fault(vmf);
+	vmf->page = NULL;
+	*entry = page;
+	if (!ret) {
+		/* Grab reference for PTE that is now referencing the page */
+		get_page(page);
+		return VM_FAULT_NOPAGE;
+	}
+	return ret;
 }
 
 static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size,
@@ -1162,8 +1172,8 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
-			vmf_ret = dax_load_hole(mapping, entry, vmf);
-			break;
+			vmf_ret = dax_load_hole(mapping, &entry, vmf);
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1184,8 +1194,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 	}
  unlock_entry:
-	if (vmf_ret != VM_FAULT_LOCKED || error)
-		put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 4/6] dax: Finish fault completely when loading holes
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

The only case when we do not finish the page fault completely is when we
are loading hole pages into a radix tree. Avoid this special case and
finish the fault in that case as well inside the DAX fault handler. It
will allow us for easier iomap handling.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ddf77ef2ca18..38f996976ebf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -540,15 +540,16 @@ int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
  * otherwise it will simply fall out of the page cache under memory
  * pressure without ever having been dirtied.
  */
-static int dax_load_hole(struct address_space *mapping, void *entry,
+static int dax_load_hole(struct address_space *mapping, void **entry,
 			 struct vm_fault *vmf)
 {
 	struct page *page;
+	int ret;
 
 	/* Hole page already exists? Return it...  */
-	if (!radix_tree_exceptional_entry(entry)) {
-		vmf->page = entry;
-		return VM_FAULT_LOCKED;
+	if (!radix_tree_exceptional_entry(*entry)) {
+		page = *entry;
+		goto out;
 	}
 
 	/* This will replace locked radix tree entry with a hole page */
@@ -556,8 +557,17 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 				   vmf->gfp_mask | __GFP_ZERO);
 	if (!page)
 		return VM_FAULT_OOM;
+ out:
 	vmf->page = page;
-	return VM_FAULT_LOCKED;
+	ret = finish_fault(vmf);
+	vmf->page = NULL;
+	*entry = page;
+	if (!ret) {
+		/* Grab reference for PTE that is now referencing the page */
+		get_page(page);
+		return VM_FAULT_NOPAGE;
+	}
+	return ret;
 }
 
 static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size,
@@ -1162,8 +1172,8 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
-			vmf_ret = dax_load_hole(mapping, entry, vmf);
-			break;
+			vmf_ret = dax_load_hole(mapping, &entry, vmf);
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1184,8 +1194,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 	}
  unlock_entry:
-	if (vmf_ret != VM_FAULT_LOCKED || error)
-		put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 4/6] dax: Finish fault completely when loading holes
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

The only case when we do not finish the page fault completely is when we
are loading hole pages into a radix tree. Avoid this special case and
finish the fault in that case as well inside the DAX fault handler. It
will allow us for easier iomap handling.

Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/dax.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ddf77ef2ca18..38f996976ebf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -540,15 +540,16 @@ int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
  * otherwise it will simply fall out of the page cache under memory
  * pressure without ever having been dirtied.
  */
-static int dax_load_hole(struct address_space *mapping, void *entry,
+static int dax_load_hole(struct address_space *mapping, void **entry,
 			 struct vm_fault *vmf)
 {
 	struct page *page;
+	int ret;
 
 	/* Hole page already exists? Return it...  */
-	if (!radix_tree_exceptional_entry(entry)) {
-		vmf->page = entry;
-		return VM_FAULT_LOCKED;
+	if (!radix_tree_exceptional_entry(*entry)) {
+		page = *entry;
+		goto out;
 	}
 
 	/* This will replace locked radix tree entry with a hole page */
@@ -556,8 +557,17 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 				   vmf->gfp_mask | __GFP_ZERO);
 	if (!page)
 		return VM_FAULT_OOM;
+ out:
 	vmf->page = page;
-	return VM_FAULT_LOCKED;
+	ret = finish_fault(vmf);
+	vmf->page = NULL;
+	*entry = page;
+	if (!ret) {
+		/* Grab reference for PTE that is now referencing the page */
+		get_page(page);
+		return VM_FAULT_NOPAGE;
+	}
+	return ret;
 }
 
 static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size,
@@ -1162,8 +1172,8 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
-			vmf_ret = dax_load_hole(mapping, entry, vmf);
-			break;
+			vmf_ret = dax_load_hole(mapping, &entry, vmf);
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1184,8 +1194,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 	}
  unlock_entry:
-	if (vmf_ret != VM_FAULT_LOCKED || error)
-		put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 4/6] dax: Finish fault completely when loading holes
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

The only case when we do not finish the page fault completely is when we
are loading hole pages into a radix tree. Avoid this special case and
finish the fault in that case as well inside the DAX fault handler. It
will allow us for easier iomap handling.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ddf77ef2ca18..38f996976ebf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -540,15 +540,16 @@ int dax_invalidate_clean_mapping_entry(struct address_space *mapping,
  * otherwise it will simply fall out of the page cache under memory
  * pressure without ever having been dirtied.
  */
-static int dax_load_hole(struct address_space *mapping, void *entry,
+static int dax_load_hole(struct address_space *mapping, void **entry,
 			 struct vm_fault *vmf)
 {
 	struct page *page;
+	int ret;
 
 	/* Hole page already exists? Return it...  */
-	if (!radix_tree_exceptional_entry(entry)) {
-		vmf->page = entry;
-		return VM_FAULT_LOCKED;
+	if (!radix_tree_exceptional_entry(*entry)) {
+		page = *entry;
+		goto out;
 	}
 
 	/* This will replace locked radix tree entry with a hole page */
@@ -556,8 +557,17 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 				   vmf->gfp_mask | __GFP_ZERO);
 	if (!page)
 		return VM_FAULT_OOM;
+ out:
 	vmf->page = page;
-	return VM_FAULT_LOCKED;
+	ret = finish_fault(vmf);
+	vmf->page = NULL;
+	*entry = page;
+	if (!ret) {
+		/* Grab reference for PTE that is now referencing the page */
+		get_page(page);
+		return VM_FAULT_NOPAGE;
+	}
+	return ret;
 }
 
 static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size,
@@ -1162,8 +1172,8 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
-			vmf_ret = dax_load_hole(mapping, entry, vmf);
-			break;
+			vmf_ret = dax_load_hole(mapping, &entry, vmf);
+			goto finish_iomap;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1184,8 +1194,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 	}
  unlock_entry:
-	if (vmf_ret != VM_FAULT_LOCKED || error)
-		put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
  2016-11-24  9:46 ` Jan Kara
  (?)
  (?)
@ 2016-11-24  9:46   ` Jan Kara
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, linux-nvdimm, linux-mm, Johannes Weiner, linux-ext4

Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX.

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 120 ++++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 55 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 38f996976ebf..be39633d346e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1077,6 +1077,15 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
+static int dax_fault_return(int error)
+{
+	if (error == 0)
+		return VM_FAULT_NOPAGE;
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	return VM_FAULT_SIGBUS;
+}
+
 /**
  * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
@@ -1109,12 +1118,6 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		error = PTR_ERR(entry);
-		goto out;
-	}
-
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
@@ -1125,9 +1128,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error)
-		goto unlock_entry;
+		return dax_fault_return(error);
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		error = -EIO;		/* fs corruption? */
+		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
+		goto finish_iomap;
+	}
+
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
 		goto finish_iomap;
 	}
 
@@ -1150,13 +1159,13 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto finish_iomap;
+			goto error_unlock_entry;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto finish_iomap;
+		goto unlock_entry;
 	}
 
 	switch (iomap.type) {
@@ -1168,12 +1177,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 		error = dax_insert_mapping(mapping, iomap.bdev, sector,
 				PAGE_SIZE, &entry, vma, vmf);
+		/* -EBUSY is fine, somebody else faulted on the same PTE */
+		if (error == -EBUSY)
+			error = 0;
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto finish_iomap;
+			goto unlock_entry;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1182,30 +1194,25 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
- finish_iomap:
-	if (ops->iomap_end) {
-		if (error || (vmf_ret & VM_FAULT_ERROR)) {
-			/* keep previous error */
-			ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PAGE_SIZE,
-					PAGE_SIZE, flags, &iomap);
-		}
-	}
+ error_unlock_entry:
+	vmf_ret = dax_fault_return(error) | major;
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
- out:
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM | major;
-	/* -EBUSY is fine, somebody else faulted on the same PTE */
-	if (error < 0 && error != -EBUSY)
-		return VM_FAULT_SIGBUS | major;
-	if (vmf_ret) {
-		WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
-		return vmf_ret;
+ finish_iomap:
+	if (ops->iomap_end) {
+		int copied = PAGE_SIZE;
+
+		if (vmf_ret & VM_FAULT_ERROR)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PTE we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-	return VM_FAULT_NOPAGE | major;
+	return vmf_ret;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
 
@@ -1330,6 +1337,15 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto fallback;
 
 	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	if (error)
+		goto fallback;
+	/*
 	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
 	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
 	 * the tree, for instance), it will return -EEXIST and we just fall
@@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	 */
 	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
 	if (IS_ERR(entry))
-		goto fallback;
+		goto finish_iomap;
 
-	/*
-	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
-	 * setting up a mapping, so really we're using iomap_begin() as a way
-	 * to look up our filesystem block.
-	 */
-	pos = (loff_t)pgoff << PAGE_SHIFT;
-	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
-	if (error)
-		goto unlock_entry;
 	if (iomap.offset + iomap.length < pos + PMD_SIZE)
-		goto finish_iomap;
+		goto unlock_entry;
 
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
@@ -1363,7 +1370,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
-			goto finish_iomap;
+			goto unlock_entry;
 		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
 				&entry);
 		break;
@@ -1372,20 +1379,23 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		break;
 	}
 
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
-		if (result == VM_FAULT_FALLBACK) {
-			ops->iomap_end(inode, pos, PMD_SIZE, 0, iomap_flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
-					iomap_flags, &iomap);
-			if (error)
-				result = VM_FAULT_FALLBACK;
-		}
+		int copied = PMD_SIZE;
+
+		if (result == VM_FAULT_FALLBACK)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PMD we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
+				&iomap);
 	}
- unlock_entry:
-	put_locked_mapping_entry(mapping, pgoff, entry);
  fallback:
 	if (result == VM_FAULT_FALLBACK) {
 		split_huge_pmd(vma, pmd, address);
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX.

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 120 ++++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 55 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 38f996976ebf..be39633d346e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1077,6 +1077,15 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
+static int dax_fault_return(int error)
+{
+	if (error == 0)
+		return VM_FAULT_NOPAGE;
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	return VM_FAULT_SIGBUS;
+}
+
 /**
  * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
@@ -1109,12 +1118,6 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		error = PTR_ERR(entry);
-		goto out;
-	}
-
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
@@ -1125,9 +1128,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error)
-		goto unlock_entry;
+		return dax_fault_return(error);
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		error = -EIO;		/* fs corruption? */
+		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
+		goto finish_iomap;
+	}
+
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
 		goto finish_iomap;
 	}
 
@@ -1150,13 +1159,13 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto finish_iomap;
+			goto error_unlock_entry;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto finish_iomap;
+		goto unlock_entry;
 	}
 
 	switch (iomap.type) {
@@ -1168,12 +1177,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 		error = dax_insert_mapping(mapping, iomap.bdev, sector,
 				PAGE_SIZE, &entry, vma, vmf);
+		/* -EBUSY is fine, somebody else faulted on the same PTE */
+		if (error == -EBUSY)
+			error = 0;
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto finish_iomap;
+			goto unlock_entry;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1182,30 +1194,25 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
- finish_iomap:
-	if (ops->iomap_end) {
-		if (error || (vmf_ret & VM_FAULT_ERROR)) {
-			/* keep previous error */
-			ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PAGE_SIZE,
-					PAGE_SIZE, flags, &iomap);
-		}
-	}
+ error_unlock_entry:
+	vmf_ret = dax_fault_return(error) | major;
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
- out:
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM | major;
-	/* -EBUSY is fine, somebody else faulted on the same PTE */
-	if (error < 0 && error != -EBUSY)
-		return VM_FAULT_SIGBUS | major;
-	if (vmf_ret) {
-		WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
-		return vmf_ret;
+ finish_iomap:
+	if (ops->iomap_end) {
+		int copied = PAGE_SIZE;
+
+		if (vmf_ret & VM_FAULT_ERROR)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PTE we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-	return VM_FAULT_NOPAGE | major;
+	return vmf_ret;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
 
@@ -1330,6 +1337,15 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto fallback;
 
 	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	if (error)
+		goto fallback;
+	/*
 	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
 	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
 	 * the tree, for instance), it will return -EEXIST and we just fall
@@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	 */
 	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
 	if (IS_ERR(entry))
-		goto fallback;
+		goto finish_iomap;
 
-	/*
-	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
-	 * setting up a mapping, so really we're using iomap_begin() as a way
-	 * to look up our filesystem block.
-	 */
-	pos = (loff_t)pgoff << PAGE_SHIFT;
-	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
-	if (error)
-		goto unlock_entry;
 	if (iomap.offset + iomap.length < pos + PMD_SIZE)
-		goto finish_iomap;
+		goto unlock_entry;
 
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
@@ -1363,7 +1370,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
-			goto finish_iomap;
+			goto unlock_entry;
 		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
 				&entry);
 		break;
@@ -1372,20 +1379,23 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		break;
 	}
 
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
-		if (result == VM_FAULT_FALLBACK) {
-			ops->iomap_end(inode, pos, PMD_SIZE, 0, iomap_flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
-					iomap_flags, &iomap);
-			if (error)
-				result = VM_FAULT_FALLBACK;
-		}
+		int copied = PMD_SIZE;
+
+		if (result == VM_FAULT_FALLBACK)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PMD we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
+				&iomap);
 	}
- unlock_entry:
-	put_locked_mapping_entry(mapping, pgoff, entry);
  fallback:
 	if (result == VM_FAULT_FALLBACK) {
 		split_huge_pmd(vma, pmd, address);
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX.

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 120 ++++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 55 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 38f996976ebf..be39633d346e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1077,6 +1077,15 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
+static int dax_fault_return(int error)
+{
+	if (error == 0)
+		return VM_FAULT_NOPAGE;
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	return VM_FAULT_SIGBUS;
+}
+
 /**
  * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
@@ -1109,12 +1118,6 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		error = PTR_ERR(entry);
-		goto out;
-	}
-
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
@@ -1125,9 +1128,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error)
-		goto unlock_entry;
+		return dax_fault_return(error);
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		error = -EIO;		/* fs corruption? */
+		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
+		goto finish_iomap;
+	}
+
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
 		goto finish_iomap;
 	}
 
@@ -1150,13 +1159,13 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto finish_iomap;
+			goto error_unlock_entry;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto finish_iomap;
+		goto unlock_entry;
 	}
 
 	switch (iomap.type) {
@@ -1168,12 +1177,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 		error = dax_insert_mapping(mapping, iomap.bdev, sector,
 				PAGE_SIZE, &entry, vma, vmf);
+		/* -EBUSY is fine, somebody else faulted on the same PTE */
+		if (error == -EBUSY)
+			error = 0;
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto finish_iomap;
+			goto unlock_entry;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1182,30 +1194,25 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
- finish_iomap:
-	if (ops->iomap_end) {
-		if (error || (vmf_ret & VM_FAULT_ERROR)) {
-			/* keep previous error */
-			ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PAGE_SIZE,
-					PAGE_SIZE, flags, &iomap);
-		}
-	}
+ error_unlock_entry:
+	vmf_ret = dax_fault_return(error) | major;
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
- out:
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM | major;
-	/* -EBUSY is fine, somebody else faulted on the same PTE */
-	if (error < 0 && error != -EBUSY)
-		return VM_FAULT_SIGBUS | major;
-	if (vmf_ret) {
-		WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
-		return vmf_ret;
+ finish_iomap:
+	if (ops->iomap_end) {
+		int copied = PAGE_SIZE;
+
+		if (vmf_ret & VM_FAULT_ERROR)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PTE we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-	return VM_FAULT_NOPAGE | major;
+	return vmf_ret;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
 
@@ -1330,6 +1337,15 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto fallback;
 
 	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	if (error)
+		goto fallback;
+	/*
 	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
 	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
 	 * the tree, for instance), it will return -EEXIST and we just fall
@@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	 */
 	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
 	if (IS_ERR(entry))
-		goto fallback;
+		goto finish_iomap;
 
-	/*
-	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
-	 * setting up a mapping, so really we're using iomap_begin() as a way
-	 * to look up our filesystem block.
-	 */
-	pos = (loff_t)pgoff << PAGE_SHIFT;
-	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
-	if (error)
-		goto unlock_entry;
 	if (iomap.offset + iomap.length < pos + PMD_SIZE)
-		goto finish_iomap;
+		goto unlock_entry;
 
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
@@ -1363,7 +1370,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
-			goto finish_iomap;
+			goto unlock_entry;
 		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
 				&entry);
 		break;
@@ -1372,20 +1379,23 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		break;
 	}
 
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
-		if (result == VM_FAULT_FALLBACK) {
-			ops->iomap_end(inode, pos, PMD_SIZE, 0, iomap_flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
-					iomap_flags, &iomap);
-			if (error)
-				result = VM_FAULT_FALLBACK;
-		}
+		int copied = PMD_SIZE;
+
+		if (result == VM_FAULT_FALLBACK)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PMD we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
+				&iomap);
 	}
- unlock_entry:
-	put_locked_mapping_entry(mapping, pgoff, entry);
  fallback:
 	if (result == VM_FAULT_FALLBACK) {
 		split_huge_pmd(vma, pmd, address);
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX.

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 120 ++++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 55 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 38f996976ebf..be39633d346e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1077,6 +1077,15 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
+static int dax_fault_return(int error)
+{
+	if (error == 0)
+		return VM_FAULT_NOPAGE;
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	return VM_FAULT_SIGBUS;
+}
+
 /**
  * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
@@ -1109,12 +1118,6 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		error = PTR_ERR(entry);
-		goto out;
-	}
-
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
@@ -1125,9 +1128,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error)
-		goto unlock_entry;
+		return dax_fault_return(error);
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		error = -EIO;		/* fs corruption? */
+		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
+		goto finish_iomap;
+	}
+
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
 		goto finish_iomap;
 	}
 
@@ -1150,13 +1159,13 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto finish_iomap;
+			goto error_unlock_entry;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto finish_iomap;
+		goto unlock_entry;
 	}
 
 	switch (iomap.type) {
@@ -1168,12 +1177,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 		error = dax_insert_mapping(mapping, iomap.bdev, sector,
 				PAGE_SIZE, &entry, vma, vmf);
+		/* -EBUSY is fine, somebody else faulted on the same PTE */
+		if (error == -EBUSY)
+			error = 0;
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto finish_iomap;
+			goto unlock_entry;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1182,30 +1194,25 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
- finish_iomap:
-	if (ops->iomap_end) {
-		if (error || (vmf_ret & VM_FAULT_ERROR)) {
-			/* keep previous error */
-			ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PAGE_SIZE,
-					PAGE_SIZE, flags, &iomap);
-		}
-	}
+ error_unlock_entry:
+	vmf_ret = dax_fault_return(error) | major;
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
- out:
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM | major;
-	/* -EBUSY is fine, somebody else faulted on the same PTE */
-	if (error < 0 && error != -EBUSY)
-		return VM_FAULT_SIGBUS | major;
-	if (vmf_ret) {
-		WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
-		return vmf_ret;
+ finish_iomap:
+	if (ops->iomap_end) {
+		int copied = PAGE_SIZE;
+
+		if (vmf_ret & VM_FAULT_ERROR)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PTE we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-	return VM_FAULT_NOPAGE | major;
+	return vmf_ret;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
 
@@ -1330,6 +1337,15 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto fallback;
 
 	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	if (error)
+		goto fallback;
+	/*
 	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
 	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
 	 * the tree, for instance), it will return -EEXIST and we just fall
@@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	 */
 	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
 	if (IS_ERR(entry))
-		goto fallback;
+		goto finish_iomap;
 
-	/*
-	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
-	 * setting up a mapping, so really we're using iomap_begin() as a way
-	 * to look up our filesystem block.
-	 */
-	pos = (loff_t)pgoff << PAGE_SHIFT;
-	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
-	if (error)
-		goto unlock_entry;
 	if (iomap.offset + iomap.length < pos + PMD_SIZE)
-		goto finish_iomap;
+		goto unlock_entry;
 
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
@@ -1363,7 +1370,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
-			goto finish_iomap;
+			goto unlock_entry;
 		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
 				&entry);
 		break;
@@ -1372,20 +1379,23 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		break;
 	}
 
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
-		if (result == VM_FAULT_FALLBACK) {
-			ops->iomap_end(inode, pos, PMD_SIZE, 0, iomap_flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
-					iomap_flags, &iomap);
-			if (error)
-				result = VM_FAULT_FALLBACK;
-		}
+		int copied = PMD_SIZE;
+
+		if (result == VM_FAULT_FALLBACK)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PMD we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
+				&iomap);
 	}
- unlock_entry:
-	put_locked_mapping_entry(mapping, pgoff, entry);
  fallback:
 	if (result == VM_FAULT_FALLBACK) {
 		split_huge_pmd(vma, pmd, address);
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 6/6] ext4: Simplify DAX fault path
  2016-11-24  9:46 ` Jan Kara
  (?)
  (?)
@ 2016-11-24  9:46   ` Jan Kara
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, linux-nvdimm, linux-mm, Johannes Weiner, linux-ext4

Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we
can use transaction starting in ext4_iomap_begin() and thus simplify
ext4_dax_fault(). It also provides us proper retries in case of ENOSPC.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 48 ++++++++++--------------------------------------
 1 file changed, 10 insertions(+), 38 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b5f184493c57..d663d3d7c81c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -258,7 +258,6 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	int result;
-	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -266,24 +265,12 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
-						EXT4_DATA_TRANS_BLOCKS(sb));
-	} else
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-
-	if (IS_ERR(handle))
-		result = VM_FAULT_SIGBUS;
-	else
-		result = dax_iomap_fault(vma, vmf, &ext4_iomap_ops);
-
-	if (write) {
-		if (!IS_ERR(handle))
-			ext4_journal_stop(handle);
-		up_read(&EXT4_I(inode)->i_mmap_sem);
+	}
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	result = dax_iomap_fault(vma, vmf, &ext4_iomap_ops);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	if (write)
 		sb_end_pagefault(sb);
-	} else
-		up_read(&EXT4_I(inode)->i_mmap_sem);
 
 	return result;
 }
@@ -292,7 +279,6 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 						pmd_t *pmd, unsigned int flags)
 {
 	int result;
-	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = flags & FAULT_FLAG_WRITE;
@@ -300,27 +286,13 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
-				ext4_chunk_trans_blocks(inode,
-							PMD_SIZE / PAGE_SIZE));
-	} else
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-
-	if (IS_ERR(handle))
-		result = VM_FAULT_SIGBUS;
-	else {
-		result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
-					     &ext4_iomap_ops);
 	}
-
-	if (write) {
-		if (!IS_ERR(handle))
-			ext4_journal_stop(handle);
-		up_read(&EXT4_I(inode)->i_mmap_sem);
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
+				     &ext4_iomap_ops);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	if (write)
 		sb_end_pagefault(sb);
-	} else
-		up_read(&EXT4_I(inode)->i_mmap_sem);
 
 	return result;
 }
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 6/6] ext4: Simplify DAX fault path
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we
can use transaction starting in ext4_iomap_begin() and thus simplify
ext4_dax_fault(). It also provides us proper retries in case of ENOSPC.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 48 ++++++++++--------------------------------------
 1 file changed, 10 insertions(+), 38 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b5f184493c57..d663d3d7c81c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -258,7 +258,6 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	int result;
-	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -266,24 +265,12 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
-						EXT4_DATA_TRANS_BLOCKS(sb));
-	} else
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-
-	if (IS_ERR(handle))
-		result = VM_FAULT_SIGBUS;
-	else
-		result = dax_iomap_fault(vma, vmf, &ext4_iomap_ops);
-
-	if (write) {
-		if (!IS_ERR(handle))
-			ext4_journal_stop(handle);
-		up_read(&EXT4_I(inode)->i_mmap_sem);
+	}
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	result = dax_iomap_fault(vma, vmf, &ext4_iomap_ops);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	if (write)
 		sb_end_pagefault(sb);
-	} else
-		up_read(&EXT4_I(inode)->i_mmap_sem);
 
 	return result;
 }
@@ -292,7 +279,6 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 						pmd_t *pmd, unsigned int flags)
 {
 	int result;
-	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = flags & FAULT_FLAG_WRITE;
@@ -300,27 +286,13 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
-				ext4_chunk_trans_blocks(inode,
-							PMD_SIZE / PAGE_SIZE));
-	} else
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-
-	if (IS_ERR(handle))
-		result = VM_FAULT_SIGBUS;
-	else {
-		result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
-					     &ext4_iomap_ops);
 	}
-
-	if (write) {
-		if (!IS_ERR(handle))
-			ext4_journal_stop(handle);
-		up_read(&EXT4_I(inode)->i_mmap_sem);
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
+				     &ext4_iomap_ops);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	if (write)
 		sb_end_pagefault(sb);
-	} else
-		up_read(&EXT4_I(inode)->i_mmap_sem);
 
 	return result;
 }
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 6/6] ext4: Simplify DAX fault path
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we
can use transaction starting in ext4_iomap_begin() and thus simplify
ext4_dax_fault(). It also provides us proper retries in case of ENOSPC.

Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/ext4/file.c | 48 ++++++++++--------------------------------------
 1 file changed, 10 insertions(+), 38 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b5f184493c57..d663d3d7c81c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -258,7 +258,6 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	int result;
-	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -266,24 +265,12 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
-						EXT4_DATA_TRANS_BLOCKS(sb));
-	} else
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-
-	if (IS_ERR(handle))
-		result = VM_FAULT_SIGBUS;
-	else
-		result = dax_iomap_fault(vma, vmf, &ext4_iomap_ops);
-
-	if (write) {
-		if (!IS_ERR(handle))
-			ext4_journal_stop(handle);
-		up_read(&EXT4_I(inode)->i_mmap_sem);
+	}
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	result = dax_iomap_fault(vma, vmf, &ext4_iomap_ops);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	if (write)
 		sb_end_pagefault(sb);
-	} else
-		up_read(&EXT4_I(inode)->i_mmap_sem);
 
 	return result;
 }
@@ -292,7 +279,6 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 						pmd_t *pmd, unsigned int flags)
 {
 	int result;
-	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = flags & FAULT_FLAG_WRITE;
@@ -300,27 +286,13 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
-				ext4_chunk_trans_blocks(inode,
-							PMD_SIZE / PAGE_SIZE));
-	} else
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-
-	if (IS_ERR(handle))
-		result = VM_FAULT_SIGBUS;
-	else {
-		result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
-					     &ext4_iomap_ops);
 	}
-
-	if (write) {
-		if (!IS_ERR(handle))
-			ext4_journal_stop(handle);
-		up_read(&EXT4_I(inode)->i_mmap_sem);
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
+				     &ext4_iomap_ops);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	if (write)
 		sb_end_pagefault(sb);
-	} else
-		up_read(&EXT4_I(inode)->i_mmap_sem);
 
 	return result;
 }
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 6/6] ext4: Simplify DAX fault path
@ 2016-11-24  9:46   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-24  9:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner, Jan Kara

Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we
can use transaction starting in ext4_iomap_begin() and thus simplify
ext4_dax_fault(). It also provides us proper retries in case of ENOSPC.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 48 ++++++++++--------------------------------------
 1 file changed, 10 insertions(+), 38 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b5f184493c57..d663d3d7c81c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -258,7 +258,6 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	int result;
-	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -266,24 +265,12 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
-						EXT4_DATA_TRANS_BLOCKS(sb));
-	} else
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-
-	if (IS_ERR(handle))
-		result = VM_FAULT_SIGBUS;
-	else
-		result = dax_iomap_fault(vma, vmf, &ext4_iomap_ops);
-
-	if (write) {
-		if (!IS_ERR(handle))
-			ext4_journal_stop(handle);
-		up_read(&EXT4_I(inode)->i_mmap_sem);
+	}
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	result = dax_iomap_fault(vma, vmf, &ext4_iomap_ops);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	if (write)
 		sb_end_pagefault(sb);
-	} else
-		up_read(&EXT4_I(inode)->i_mmap_sem);
 
 	return result;
 }
@@ -292,7 +279,6 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 						pmd_t *pmd, unsigned int flags)
 {
 	int result;
-	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = flags & FAULT_FLAG_WRITE;
@@ -300,27 +286,13 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
-				ext4_chunk_trans_blocks(inode,
-							PMD_SIZE / PAGE_SIZE));
-	} else
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-
-	if (IS_ERR(handle))
-		result = VM_FAULT_SIGBUS;
-	else {
-		result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
-					     &ext4_iomap_ops);
 	}
-
-	if (write) {
-		if (!IS_ERR(handle))
-			ext4_journal_stop(handle);
-		up_read(&EXT4_I(inode)->i_mmap_sem);
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
+				     &ext4_iomap_ops);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
+	if (write)
 		sb_end_pagefault(sb);
-	} else
-		up_read(&EXT4_I(inode)->i_mmap_sem);
 
 	return result;
 }
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/6] ext2: Return BH_New buffers for zeroed blocks
  2016-11-24  9:46   ` Jan Kara
@ 2016-11-29 17:48     ` Ross Zwisler
  -1 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-11-29 17:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, linux-mm, Johannes Weiner, linux-fsdevel, linux-ext4

On Thu, Nov 24, 2016 at 10:46:31AM +0100, Jan Kara wrote:
> So far we did not return BH_New buffers from ext2_get_blocks() when we
> allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
> DAX code. This zeroing is gone these days so we can remove the
> workaround.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/6] ext2: Return BH_New buffers for zeroed blocks
@ 2016-11-29 17:48     ` Ross Zwisler
  0 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-11-29 17:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner

On Thu, Nov 24, 2016 at 10:46:31AM +0100, Jan Kara wrote:
> So far we did not return BH_New buffers from ext2_get_blocks() when we
> allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
> DAX code. This zeroing is gone these days so we can remove the
> workaround.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
  2016-11-24  9:46   ` Jan Kara
@ 2016-11-29 19:34     ` Johannes Weiner
  -1 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2016-11-29 19:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, linux-ext4, linux-nvdimm

Hi Jan,

On Thu, Nov 24, 2016 at 10:46:32AM +0100, Jan Kara wrote:
> @@ -452,16 +452,37 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
>  		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
>  }
>  
> +static int __dax_invalidate_mapping_entry(struct address_space *mapping,
> +					  pgoff_t index, bool trunc)
> +{
> +	int ret = 0;
> +	void *entry;
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	if (!entry || !radix_tree_exceptional_entry(entry))
> +		goto out;
> +	if (!trunc &&
> +	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> +	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
> +		goto out;
> +	radix_tree_delete(page_tree, index);

You could use the new __radix_tree_replace() here and save a second
tree lookup.

> +/*
> + * Invalidate exceptional DAX entry if easily possible. This handles DAX
> + * entries for invalidate_inode_pages() so we evict the entry only if we can
> + * do so without blocking.
> + */
> +int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	int ret = 0;
> +	void *entry, **slot;
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
> +	if (!entry || !radix_tree_exceptional_entry(entry) ||
> +	    slot_locked(mapping, slot))
> +		goto out;
> +	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> +	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> +		goto out;
> +	radix_tree_delete(page_tree, index);

Ditto for __radix_tree_replace().

> @@ -30,14 +30,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  	struct radix_tree_node *node;
>  	void **slot;
>  
> -	/* Handled by shmem itself */
> -	if (shmem_mapping(mapping))
> -		return;
> -
> -	if (dax_mapping(mapping)) {
> -		dax_delete_mapping_entry(mapping, index);
> -		return;
> -	}
>  	spin_lock_irq(&mapping->tree_lock);
>  	/*
>  	 * Regular page slots are stabilized by the page lock even
> @@ -70,6 +62,56 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  	spin_unlock_irq(&mapping->tree_lock);
>  }
>  
> +/*
> + * Unconditionally remove exceptional entry. Usually called from truncate path.
> + */
> +static void truncate_exceptional_entry(struct address_space *mapping,
> +				       pgoff_t index, void *entry)
> +{
> +	/* Handled by shmem itself */
> +	if (shmem_mapping(mapping))
> +		return;
> +
> +	if (dax_mapping(mapping)) {
> +		dax_delete_mapping_entry(mapping, index);
> +		return;
> +	}
> +	clear_exceptional_entry(mapping, index, entry);
> +}
> +
> +/*
> + * Invalidate exceptional entry if easily possible. This handles exceptional
> + * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
> + * clean entries.
> + */
> +static int invalidate_exceptional_entry(struct address_space *mapping,
> +					pgoff_t index, void *entry)
> +{
> +	/* Handled by shmem itself */
> +	if (shmem_mapping(mapping))
> +		return 1;
> +	if (dax_mapping(mapping))
> +		return dax_invalidate_mapping_entry(mapping, index);
> +	clear_exceptional_entry(mapping, index, entry);
> +	return 1;
> +}
> +
> +/*
> + * Invalidate exceptional entry if clean. This handles exceptional entries for
> + * invalidate_inode_pages2() so for DAX it evicts only clean entries.
> + */
> +static int invalidate_exceptional_entry2(struct address_space *mapping,
> +					 pgoff_t index, void *entry)
> +{
> +	/* Handled by shmem itself */
> +	if (shmem_mapping(mapping))
> +		return 1;
> +	if (dax_mapping(mapping))
> +		return dax_invalidate_clean_mapping_entry(mapping, index);
> +	clear_exceptional_entry(mapping, index, entry);
> +	return 1;
> +}

The way these functions are split out looks fine to me.

Now that clear_exceptional_entry() doesn't handle shmem and DAX
anymore, only shadows, could you rename it to clear_shadow_entry()?

The naming situation with truncate, invalidate, invalidate2 worries me
a bit. They aren't great names to begin with, but now DAX uses yet
another terminology for what state prevents a page from being dropped.
Can we switch to truncate, invalidate, and invalidate_sync throughout
truncate.c and then have DAX follow that naming too? Or maybe you can
think of better names. But neither invalidate2 and invalidate_clean
don't seem to capture it quite right ;)

Thanks
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
@ 2016-11-29 19:34     ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2016-11-29 19:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm

Hi Jan,

On Thu, Nov 24, 2016 at 10:46:32AM +0100, Jan Kara wrote:
> @@ -452,16 +452,37 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
>  		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
>  }
>  
> +static int __dax_invalidate_mapping_entry(struct address_space *mapping,
> +					  pgoff_t index, bool trunc)
> +{
> +	int ret = 0;
> +	void *entry;
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	if (!entry || !radix_tree_exceptional_entry(entry))
> +		goto out;
> +	if (!trunc &&
> +	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> +	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
> +		goto out;
> +	radix_tree_delete(page_tree, index);

You could use the new __radix_tree_replace() here and save a second
tree lookup.

> +/*
> + * Invalidate exceptional DAX entry if easily possible. This handles DAX
> + * entries for invalidate_inode_pages() so we evict the entry only if we can
> + * do so without blocking.
> + */
> +int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	int ret = 0;
> +	void *entry, **slot;
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
> +	if (!entry || !radix_tree_exceptional_entry(entry) ||
> +	    slot_locked(mapping, slot))
> +		goto out;
> +	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> +	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> +		goto out;
> +	radix_tree_delete(page_tree, index);

Ditto for __radix_tree_replace().

> @@ -30,14 +30,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  	struct radix_tree_node *node;
>  	void **slot;
>  
> -	/* Handled by shmem itself */
> -	if (shmem_mapping(mapping))
> -		return;
> -
> -	if (dax_mapping(mapping)) {
> -		dax_delete_mapping_entry(mapping, index);
> -		return;
> -	}
>  	spin_lock_irq(&mapping->tree_lock);
>  	/*
>  	 * Regular page slots are stabilized by the page lock even
> @@ -70,6 +62,56 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  	spin_unlock_irq(&mapping->tree_lock);
>  }
>  
> +/*
> + * Unconditionally remove exceptional entry. Usually called from truncate path.
> + */
> +static void truncate_exceptional_entry(struct address_space *mapping,
> +				       pgoff_t index, void *entry)
> +{
> +	/* Handled by shmem itself */
> +	if (shmem_mapping(mapping))
> +		return;
> +
> +	if (dax_mapping(mapping)) {
> +		dax_delete_mapping_entry(mapping, index);
> +		return;
> +	}
> +	clear_exceptional_entry(mapping, index, entry);
> +}
> +
> +/*
> + * Invalidate exceptional entry if easily possible. This handles exceptional
> + * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
> + * clean entries.
> + */
> +static int invalidate_exceptional_entry(struct address_space *mapping,
> +					pgoff_t index, void *entry)
> +{
> +	/* Handled by shmem itself */
> +	if (shmem_mapping(mapping))
> +		return 1;
> +	if (dax_mapping(mapping))
> +		return dax_invalidate_mapping_entry(mapping, index);
> +	clear_exceptional_entry(mapping, index, entry);
> +	return 1;
> +}
> +
> +/*
> + * Invalidate exceptional entry if clean. This handles exceptional entries for
> + * invalidate_inode_pages2() so for DAX it evicts only clean entries.
> + */
> +static int invalidate_exceptional_entry2(struct address_space *mapping,
> +					 pgoff_t index, void *entry)
> +{
> +	/* Handled by shmem itself */
> +	if (shmem_mapping(mapping))
> +		return 1;
> +	if (dax_mapping(mapping))
> +		return dax_invalidate_clean_mapping_entry(mapping, index);
> +	clear_exceptional_entry(mapping, index, entry);
> +	return 1;
> +}

The way these functions are split out looks fine to me.

Now that clear_exceptional_entry() doesn't handle shmem and DAX
anymore, only shadows, could you rename it to clear_shadow_entry()?

The naming situation with truncate, invalidate, invalidate2 worries me
a bit. They aren't great names to begin with, but now DAX uses yet
another terminology for what state prevents a page from being dropped.
Can we switch to truncate, invalidate, and invalidate_sync throughout
truncate.c and then have DAX follow that naming too? Or maybe you can
think of better names. But neither invalidate2 and invalidate_clean
don't seem to capture it quite right ;)

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
  2016-11-24  9:46   ` Jan Kara
                     ` (3 preceding siblings ...)
  (?)
@ 2016-11-29 22:17   ` Ross Zwisler
  -1 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-11-29 22:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner

On Thu, Nov 24, 2016 at 10:46:32AM +0100, Jan Kara wrote:
> Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
> just delete all exceptional radix tree entries they find. For DAX this
> is not desirable as we track cache dirtiness in these entries and when
> they are evicted, we may not flush caches although it is necessary. This
> can for example manifest when we write to the same block both via mmap
> and via write(2) (to different offsets) and fsync(2) then does not
> properly flush CPU caches when modification via write(2) was the last
> one.
> 
> Create appropriate DAX functions to handle invalidation of DAX entries
> for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
> wire them up into the corresponding mm functions.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

For the DAX bits:
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
  2016-11-24  9:46   ` Jan Kara
  (?)
@ 2016-11-29 22:31     ` Ross Zwisler
  -1 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-11-29 22:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, linux-mm, Johannes Weiner, linux-fsdevel, linux-ext4

On Thu, Nov 24, 2016 at 10:46:33AM +0100, Jan Kara wrote:
> Currently each filesystem (possibly through generic_file_direct_write()
> or iomap_dax_rw()) takes care of invalidating page tables and evicting

Just some nits about the commit message: the DAX I/O path function is now
called dax_iomap_rw(), and no filesystems still use
generic_file_direct_write() for DAX so you can probably remove it from the
changelog - up to you.

Aside from that:
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

> hole pages from the radix tree when write(2) to the file happens. This
> invalidation is only necessary when there is some block allocation
> resulting from write(2). Furthermore in current place the invalidation
> is racy wrt page fault instantiating a hole page just after we have
> invalidated it.
> 
> So perform the page invalidation inside dax_do_io() where we can do it
> only when really necessary and after blocks have been allocated so
> nobody will be instantiating new hole pages anymore.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c | 28 +++++++++++-----------------
>  1 file changed, 11 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4534f0e232e9..ddf77ef2ca18 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -984,6 +984,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
>  		return -EIO;
>  
> +	/*
> +	 * Write can allocate block for an area which has a hole page mapped
> +	 * into page tables. We have to tear down these mappings so that data
> +	 * written by write(2) is visible in mmap.
> +	 */
> +	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +		invalidate_inode_pages2_range(inode->i_mapping,
> +					      pos >> PAGE_SHIFT,
> +					      (end - 1) >> PAGE_SHIFT);
> +	}
> +
>  	while (pos < end) {
>  		unsigned offset = pos & (PAGE_SIZE - 1);
>  		struct blk_dax_ctl dax = { 0 };
> @@ -1042,23 +1053,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iov_iter_rw(iter) == WRITE)
>  		flags |= IOMAP_WRITE;
>  
> -	/*
> -	 * Yes, even DAX files can have page cache attached to them:  A zeroed
> -	 * page is inserted into the pagecache when we have to serve a write
> -	 * fault on a hole.  It should never be dirtied and can simply be
> -	 * dropped from the pagecache once we get real data for the page.
> -	 *
> -	 * XXX: This is racy against mmap, and there's nothing we can do about
> -	 * it. We'll eventually need to shift this down even further so that
> -	 * we can check if we allocated blocks over a hole first.
> -	 */
> -	if (mapping->nrpages) {
> -		ret = invalidate_inode_pages2_range(mapping,
> -				pos >> PAGE_SHIFT,
> -				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
> -		WARN_ON_ONCE(ret);
> -	}
> -
>  	while (iov_iter_count(iter)) {
>  		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
>  				iter, dax_iomap_actor);
> -- 
> 2.6.6
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
@ 2016-11-29 22:31     ` Ross Zwisler
  0 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-11-29 22:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner

On Thu, Nov 24, 2016 at 10:46:33AM +0100, Jan Kara wrote:
> Currently each filesystem (possibly through generic_file_direct_write()
> or iomap_dax_rw()) takes care of invalidating page tables and evicting

Just some nits about the commit message: the DAX I/O path function is now
called dax_iomap_rw(), and no filesystems still use
generic_file_direct_write() for DAX so you can probably remove it from the
changelog - up to you.

Aside from that:
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

> hole pages from the radix tree when write(2) to the file happens. This
> invalidation is only necessary when there is some block allocation
> resulting from write(2). Furthermore in current place the invalidation
> is racy wrt page fault instantiating a hole page just after we have
> invalidated it.
> 
> So perform the page invalidation inside dax_do_io() where we can do it
> only when really necessary and after blocks have been allocated so
> nobody will be instantiating new hole pages anymore.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c | 28 +++++++++++-----------------
>  1 file changed, 11 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4534f0e232e9..ddf77ef2ca18 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -984,6 +984,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
>  		return -EIO;
>  
> +	/*
> +	 * Write can allocate block for an area which has a hole page mapped
> +	 * into page tables. We have to tear down these mappings so that data
> +	 * written by write(2) is visible in mmap.
> +	 */
> +	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +		invalidate_inode_pages2_range(inode->i_mapping,
> +					      pos >> PAGE_SHIFT,
> +					      (end - 1) >> PAGE_SHIFT);
> +	}
> +
>  	while (pos < end) {
>  		unsigned offset = pos & (PAGE_SIZE - 1);
>  		struct blk_dax_ctl dax = { 0 };
> @@ -1042,23 +1053,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iov_iter_rw(iter) == WRITE)
>  		flags |= IOMAP_WRITE;
>  
> -	/*
> -	 * Yes, even DAX files can have page cache attached to them:  A zeroed
> -	 * page is inserted into the pagecache when we have to serve a write
> -	 * fault on a hole.  It should never be dirtied and can simply be
> -	 * dropped from the pagecache once we get real data for the page.
> -	 *
> -	 * XXX: This is racy against mmap, and there's nothing we can do about
> -	 * it. We'll eventually need to shift this down even further so that
> -	 * we can check if we allocated blocks over a hole first.
> -	 */
> -	if (mapping->nrpages) {
> -		ret = invalidate_inode_pages2_range(mapping,
> -				pos >> PAGE_SHIFT,
> -				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
> -		WARN_ON_ONCE(ret);
> -	}
> -
>  	while (iov_iter_count(iter)) {
>  		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
>  				iter, dax_iomap_actor);
> -- 
> 2.6.6
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
@ 2016-11-29 22:31     ` Ross Zwisler
  0 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-11-29 22:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

On Thu, Nov 24, 2016 at 10:46:33AM +0100, Jan Kara wrote:
> Currently each filesystem (possibly through generic_file_direct_write()
> or iomap_dax_rw()) takes care of invalidating page tables and evicting

Just some nits about the commit message: the DAX I/O path function is now
called dax_iomap_rw(), and no filesystems still use
generic_file_direct_write() for DAX so you can probably remove it from the
changelog - up to you.

Aside from that:
Reviewed-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

> hole pages from the radix tree when write(2) to the file happens. This
> invalidation is only necessary when there is some block allocation
> resulting from write(2). Furthermore in current place the invalidation
> is racy wrt page fault instantiating a hole page just after we have
> invalidated it.
> 
> So perform the page invalidation inside dax_do_io() where we can do it
> only when really necessary and after blocks have been allocated so
> nobody will be instantiating new hole pages anymore.
> 
> Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> ---
>  fs/dax.c | 28 +++++++++++-----------------
>  1 file changed, 11 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4534f0e232e9..ddf77ef2ca18 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -984,6 +984,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
>  		return -EIO;
>  
> +	/*
> +	 * Write can allocate block for an area which has a hole page mapped
> +	 * into page tables. We have to tear down these mappings so that data
> +	 * written by write(2) is visible in mmap.
> +	 */
> +	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +		invalidate_inode_pages2_range(inode->i_mapping,
> +					      pos >> PAGE_SHIFT,
> +					      (end - 1) >> PAGE_SHIFT);
> +	}
> +
>  	while (pos < end) {
>  		unsigned offset = pos & (PAGE_SIZE - 1);
>  		struct blk_dax_ctl dax = { 0 };
> @@ -1042,23 +1053,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iov_iter_rw(iter) == WRITE)
>  		flags |= IOMAP_WRITE;
>  
> -	/*
> -	 * Yes, even DAX files can have page cache attached to them:  A zeroed
> -	 * page is inserted into the pagecache when we have to serve a write
> -	 * fault on a hole.  It should never be dirtied and can simply be
> -	 * dropped from the pagecache once we get real data for the page.
> -	 *
> -	 * XXX: This is racy against mmap, and there's nothing we can do about
> -	 * it. We'll eventually need to shift this down even further so that
> -	 * we can check if we allocated blocks over a hole first.
> -	 */
> -	if (mapping->nrpages) {
> -		ret = invalidate_inode_pages2_range(mapping,
> -				pos >> PAGE_SHIFT,
> -				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
> -		WARN_ON_ONCE(ret);
> -	}
> -
>  	while (iov_iter_count(iter)) {
>  		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
>  				iter, dax_iomap_actor);
> -- 
> 2.6.6
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
  2016-11-29 19:34     ` Johannes Weiner
  (?)
@ 2016-11-30  8:08       ` Jan Kara
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-30  8:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jan Kara, linux-nvdimm, linux-mm, linux-fsdevel, linux-ext4

Hi Johannes,

On Tue 29-11-16 14:34:03, Johannes Weiner wrote:
> On Thu, Nov 24, 2016 at 10:46:32AM +0100, Jan Kara wrote:
> > @@ -452,16 +452,37 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> >  		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
> >  }
> >  
> > +static int __dax_invalidate_mapping_entry(struct address_space *mapping,
> > +					  pgoff_t index, bool trunc)
> > +{
> > +	int ret = 0;
> > +	void *entry;
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> > +	if (!entry || !radix_tree_exceptional_entry(entry))
> > +		goto out;
> > +	if (!trunc &&
> > +	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> > +	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
> > +		goto out;
> > +	radix_tree_delete(page_tree, index);
> 
> You could use the new __radix_tree_replace() here and save a second
> tree lookup.

Hum, I'd need to return 'node' from get_unlocked_mapping_entry(). So
probably I'll do it in a patch separate from this fix. But thanks for
suggestion.

> > +/*
> > + * Invalidate exceptional DAX entry if easily possible. This handles DAX
> > + * entries for invalidate_inode_pages() so we evict the entry only if we can
> > + * do so without blocking.
> > + */
> > +int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +{
> > +	int ret = 0;
> > +	void *entry, **slot;
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
> > +	if (!entry || !radix_tree_exceptional_entry(entry) ||
> > +	    slot_locked(mapping, slot))
> > +		goto out;
> > +	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> > +	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto out;
> > +	radix_tree_delete(page_tree, index);
> 
> Ditto for __radix_tree_replace().

Yes, here I can do it easily rightaway.

> > @@ -30,14 +30,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
> >  	struct radix_tree_node *node;
> >  	void **slot;
> >  
> > -	/* Handled by shmem itself */
> > -	if (shmem_mapping(mapping))
> > -		return;
> > -
> > -	if (dax_mapping(mapping)) {
> > -		dax_delete_mapping_entry(mapping, index);
> > -		return;
> > -	}
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	/*
> >  	 * Regular page slots are stabilized by the page lock even
> > @@ -70,6 +62,56 @@ static void clear_exceptional_entry(struct address_space *mapping,
> >  	spin_unlock_irq(&mapping->tree_lock);
> >  }
> >  
> > +/*
> > + * Unconditionally remove exceptional entry. Usually called from truncate path.
> > + */
> > +static void truncate_exceptional_entry(struct address_space *mapping,
> > +				       pgoff_t index, void *entry)
> > +{
> > +	/* Handled by shmem itself */
> > +	if (shmem_mapping(mapping))
> > +		return;
> > +
> > +	if (dax_mapping(mapping)) {
> > +		dax_delete_mapping_entry(mapping, index);
> > +		return;
> > +	}
> > +	clear_exceptional_entry(mapping, index, entry);
> > +}
> > +
> > +/*
> > + * Invalidate exceptional entry if easily possible. This handles exceptional
> > + * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
> > + * clean entries.
> > + */
> > +static int invalidate_exceptional_entry(struct address_space *mapping,
> > +					pgoff_t index, void *entry)
> > +{
> > +	/* Handled by shmem itself */
> > +	if (shmem_mapping(mapping))
> > +		return 1;
> > +	if (dax_mapping(mapping))
> > +		return dax_invalidate_mapping_entry(mapping, index);
> > +	clear_exceptional_entry(mapping, index, entry);
> > +	return 1;
> > +}
> > +
> > +/*
> > + * Invalidate exceptional entry if clean. This handles exceptional entries for
> > + * invalidate_inode_pages2() so for DAX it evicts only clean entries.
> > + */
> > +static int invalidate_exceptional_entry2(struct address_space *mapping,
> > +					 pgoff_t index, void *entry)
> > +{
> > +	/* Handled by shmem itself */
> > +	if (shmem_mapping(mapping))
> > +		return 1;
> > +	if (dax_mapping(mapping))
> > +		return dax_invalidate_clean_mapping_entry(mapping, index);
> > +	clear_exceptional_entry(mapping, index, entry);
> > +	return 1;
> > +}
> 
> The way these functions are split out looks fine to me.
> 
> Now that clear_exceptional_entry() doesn't handle shmem and DAX
> anymore, only shadows, could you rename it to clear_shadow_entry()?

Sure. Done.

> The naming situation with truncate, invalidate, invalidate2 worries me
> a bit. They aren't great names to begin with, but now DAX uses yet
> another terminology for what state prevents a page from being dropped.
> Can we switch to truncate, invalidate, and invalidate_sync throughout
> truncate.c and then have DAX follow that naming too? Or maybe you can
> think of better names. But neither invalidate2 and invalidate_clean
> don't seem to capture it quite right ;)

Yeah, the naming is confusing. I like the invalidate_sync proposal however
renaming invalidate_inode_pages2() to invalidate_inode_pages_sync() is a
larger undertaking - grep shows 51 places need to be changed. So I don't
want to do it in this patch set. I can call the function
dax_invalidate_mapping_entry_sync() if it makes you happier and do the rest
later... OK?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
@ 2016-11-30  8:08       ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-30  8:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jan Kara, linux-fsdevel, Ross Zwisler, linux-ext4, linux-mm,
	linux-nvdimm

Hi Johannes,

On Tue 29-11-16 14:34:03, Johannes Weiner wrote:
> On Thu, Nov 24, 2016 at 10:46:32AM +0100, Jan Kara wrote:
> > @@ -452,16 +452,37 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> >  		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
> >  }
> >  
> > +static int __dax_invalidate_mapping_entry(struct address_space *mapping,
> > +					  pgoff_t index, bool trunc)
> > +{
> > +	int ret = 0;
> > +	void *entry;
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> > +	if (!entry || !radix_tree_exceptional_entry(entry))
> > +		goto out;
> > +	if (!trunc &&
> > +	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> > +	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
> > +		goto out;
> > +	radix_tree_delete(page_tree, index);
> 
> You could use the new __radix_tree_replace() here and save a second
> tree lookup.

Hum, I'd need to return 'node' from get_unlocked_mapping_entry(). So
probably I'll do it in a patch separate from this fix. But thanks for
suggestion.

> > +/*
> > + * Invalidate exceptional DAX entry if easily possible. This handles DAX
> > + * entries for invalidate_inode_pages() so we evict the entry only if we can
> > + * do so without blocking.
> > + */
> > +int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +{
> > +	int ret = 0;
> > +	void *entry, **slot;
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
> > +	if (!entry || !radix_tree_exceptional_entry(entry) ||
> > +	    slot_locked(mapping, slot))
> > +		goto out;
> > +	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> > +	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto out;
> > +	radix_tree_delete(page_tree, index);
> 
> Ditto for __radix_tree_replace().

Yes, here I can do it easily rightaway.

> > @@ -30,14 +30,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
> >  	struct radix_tree_node *node;
> >  	void **slot;
> >  
> > -	/* Handled by shmem itself */
> > -	if (shmem_mapping(mapping))
> > -		return;
> > -
> > -	if (dax_mapping(mapping)) {
> > -		dax_delete_mapping_entry(mapping, index);
> > -		return;
> > -	}
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	/*
> >  	 * Regular page slots are stabilized by the page lock even
> > @@ -70,6 +62,56 @@ static void clear_exceptional_entry(struct address_space *mapping,
> >  	spin_unlock_irq(&mapping->tree_lock);
> >  }
> >  
> > +/*
> > + * Unconditionally remove exceptional entry. Usually called from truncate path.
> > + */
> > +static void truncate_exceptional_entry(struct address_space *mapping,
> > +				       pgoff_t index, void *entry)
> > +{
> > +	/* Handled by shmem itself */
> > +	if (shmem_mapping(mapping))
> > +		return;
> > +
> > +	if (dax_mapping(mapping)) {
> > +		dax_delete_mapping_entry(mapping, index);
> > +		return;
> > +	}
> > +	clear_exceptional_entry(mapping, index, entry);
> > +}
> > +
> > +/*
> > + * Invalidate exceptional entry if easily possible. This handles exceptional
> > + * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
> > + * clean entries.
> > + */
> > +static int invalidate_exceptional_entry(struct address_space *mapping,
> > +					pgoff_t index, void *entry)
> > +{
> > +	/* Handled by shmem itself */
> > +	if (shmem_mapping(mapping))
> > +		return 1;
> > +	if (dax_mapping(mapping))
> > +		return dax_invalidate_mapping_entry(mapping, index);
> > +	clear_exceptional_entry(mapping, index, entry);
> > +	return 1;
> > +}
> > +
> > +/*
> > + * Invalidate exceptional entry if clean. This handles exceptional entries for
> > + * invalidate_inode_pages2() so for DAX it evicts only clean entries.
> > + */
> > +static int invalidate_exceptional_entry2(struct address_space *mapping,
> > +					 pgoff_t index, void *entry)
> > +{
> > +	/* Handled by shmem itself */
> > +	if (shmem_mapping(mapping))
> > +		return 1;
> > +	if (dax_mapping(mapping))
> > +		return dax_invalidate_clean_mapping_entry(mapping, index);
> > +	clear_exceptional_entry(mapping, index, entry);
> > +	return 1;
> > +}
> 
> The way these functions are split out looks fine to me.
> 
> Now that clear_exceptional_entry() doesn't handle shmem and DAX
> anymore, only shadows, could you rename it to clear_shadow_entry()?

Sure. Done.

> The naming situation with truncate, invalidate, invalidate2 worries me
> a bit. They aren't great names to begin with, but now DAX uses yet
> another terminology for what state prevents a page from being dropped.
> Can we switch to truncate, invalidate, and invalidate_sync throughout
> truncate.c and then have DAX follow that naming too? Or maybe you can
> think of better names. But neither invalidate2 and invalidate_clean
> don't seem to capture it quite right ;)

Yeah, the naming is confusing. I like the invalidate_sync proposal however
renaming invalidate_inode_pages2() to invalidate_inode_pages_sync() is a
larger undertaking - grep shows 51 places need to be changed. So I don't
want to do it in this patch set. I can call the function
dax_invalidate_mapping_entry_sync() if it makes you happier and do the rest
later... OK?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
@ 2016-11-30  8:08       ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-30  8:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

Hi Johannes,

On Tue 29-11-16 14:34:03, Johannes Weiner wrote:
> On Thu, Nov 24, 2016 at 10:46:32AM +0100, Jan Kara wrote:
> > @@ -452,16 +452,37 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> >  		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
> >  }
> >  
> > +static int __dax_invalidate_mapping_entry(struct address_space *mapping,
> > +					  pgoff_t index, bool trunc)
> > +{
> > +	int ret = 0;
> > +	void *entry;
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> > +	if (!entry || !radix_tree_exceptional_entry(entry))
> > +		goto out;
> > +	if (!trunc &&
> > +	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> > +	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
> > +		goto out;
> > +	radix_tree_delete(page_tree, index);
> 
> You could use the new __radix_tree_replace() here and save a second
> tree lookup.

Hum, I'd need to return 'node' from get_unlocked_mapping_entry(). So
probably I'll do it in a patch separate from this fix. But thanks for
suggestion.

> > +/*
> > + * Invalidate exceptional DAX entry if easily possible. This handles DAX
> > + * entries for invalidate_inode_pages() so we evict the entry only if we can
> > + * do so without blocking.
> > + */
> > +int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +{
> > +	int ret = 0;
> > +	void *entry, **slot;
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
> > +	if (!entry || !radix_tree_exceptional_entry(entry) ||
> > +	    slot_locked(mapping, slot))
> > +		goto out;
> > +	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> > +	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto out;
> > +	radix_tree_delete(page_tree, index);
> 
> Ditto for __radix_tree_replace().

Yes, here I can do it easily rightaway.

> > @@ -30,14 +30,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
> >  	struct radix_tree_node *node;
> >  	void **slot;
> >  
> > -	/* Handled by shmem itself */
> > -	if (shmem_mapping(mapping))
> > -		return;
> > -
> > -	if (dax_mapping(mapping)) {
> > -		dax_delete_mapping_entry(mapping, index);
> > -		return;
> > -	}
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	/*
> >  	 * Regular page slots are stabilized by the page lock even
> > @@ -70,6 +62,56 @@ static void clear_exceptional_entry(struct address_space *mapping,
> >  	spin_unlock_irq(&mapping->tree_lock);
> >  }
> >  
> > +/*
> > + * Unconditionally remove exceptional entry. Usually called from truncate path.
> > + */
> > +static void truncate_exceptional_entry(struct address_space *mapping,
> > +				       pgoff_t index, void *entry)
> > +{
> > +	/* Handled by shmem itself */
> > +	if (shmem_mapping(mapping))
> > +		return;
> > +
> > +	if (dax_mapping(mapping)) {
> > +		dax_delete_mapping_entry(mapping, index);
> > +		return;
> > +	}
> > +	clear_exceptional_entry(mapping, index, entry);
> > +}
> > +
> > +/*
> > + * Invalidate exceptional entry if easily possible. This handles exceptional
> > + * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
> > + * clean entries.
> > + */
> > +static int invalidate_exceptional_entry(struct address_space *mapping,
> > +					pgoff_t index, void *entry)
> > +{
> > +	/* Handled by shmem itself */
> > +	if (shmem_mapping(mapping))
> > +		return 1;
> > +	if (dax_mapping(mapping))
> > +		return dax_invalidate_mapping_entry(mapping, index);
> > +	clear_exceptional_entry(mapping, index, entry);
> > +	return 1;
> > +}
> > +
> > +/*
> > + * Invalidate exceptional entry if clean. This handles exceptional entries for
> > + * invalidate_inode_pages2() so for DAX it evicts only clean entries.
> > + */
> > +static int invalidate_exceptional_entry2(struct address_space *mapping,
> > +					 pgoff_t index, void *entry)
> > +{
> > +	/* Handled by shmem itself */
> > +	if (shmem_mapping(mapping))
> > +		return 1;
> > +	if (dax_mapping(mapping))
> > +		return dax_invalidate_clean_mapping_entry(mapping, index);
> > +	clear_exceptional_entry(mapping, index, entry);
> > +	return 1;
> > +}
> 
> The way these functions are split out looks fine to me.
> 
> Now that clear_exceptional_entry() doesn't handle shmem and DAX
> anymore, only shadows, could you rename it to clear_shadow_entry()?

Sure. Done.

> The naming situation with truncate, invalidate, invalidate2 worries me
> a bit. They aren't great names to begin with, but now DAX uses yet
> another terminology for what state prevents a page from being dropped.
> Can we switch to truncate, invalidate, and invalidate_sync throughout
> truncate.c and then have DAX follow that naming too? Or maybe you can
> think of better names. But neither invalidate2 and invalidate_clean
> don't seem to capture it quite right ;)

Yeah, the naming is confusing. I like the invalidate_sync proposal however
renaming invalidate_inode_pages2() to invalidate_inode_pages_sync() is a
larger undertaking - grep shows 51 places need to be changed. So I don't
want to do it in this patch set. I can call the function
dax_invalidate_mapping_entry_sync() if it makes you happier and do the rest
later... OK?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
  2016-11-29 22:31     ` Ross Zwisler
@ 2016-11-30  8:23       ` Jan Kara
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-30  8:23 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, linux-mm, Johannes Weiner, linux-fsdevel,
	linux-ext4

On Tue 29-11-16 15:31:38, Ross Zwisler wrote:
> On Thu, Nov 24, 2016 at 10:46:33AM +0100, Jan Kara wrote:
> > Currently each filesystem (possibly through generic_file_direct_write()
> > or iomap_dax_rw()) takes care of invalidating page tables and evicting
> 
> Just some nits about the commit message: the DAX I/O path function is now
> called dax_iomap_rw(), and no filesystems still use
> generic_file_direct_write() for DAX so you can probably remove it from the
> changelog - up to you.

Yeah, good spotting. Many things have changed through rebases. I've fixed
up the changelog to reflect current state.

> Aside from that:
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks.

								Honza

> 
> > hole pages from the radix tree when write(2) to the file happens. This
> > invalidation is only necessary when there is some block allocation
> > resulting from write(2). Furthermore in current place the invalidation
> > is racy wrt page fault instantiating a hole page just after we have
> > invalidated it.
> > 
> > So perform the page invalidation inside dax_do_io() where we can do it
> > only when really necessary and after blocks have been allocated so
> > nobody will be instantiating new hole pages anymore.
> > 
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c | 28 +++++++++++-----------------
> >  1 file changed, 11 insertions(+), 17 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 4534f0e232e9..ddf77ef2ca18 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -984,6 +984,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> >  	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> >  		return -EIO;
> >  
> > +	/*
> > +	 * Write can allocate block for an area which has a hole page mapped
> > +	 * into page tables. We have to tear down these mappings so that data
> > +	 * written by write(2) is visible in mmap.
> > +	 */
> > +	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> > +		invalidate_inode_pages2_range(inode->i_mapping,
> > +					      pos >> PAGE_SHIFT,
> > +					      (end - 1) >> PAGE_SHIFT);
> > +	}
> > +
> >  	while (pos < end) {
> >  		unsigned offset = pos & (PAGE_SIZE - 1);
> >  		struct blk_dax_ctl dax = { 0 };
> > @@ -1042,23 +1053,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  	if (iov_iter_rw(iter) == WRITE)
> >  		flags |= IOMAP_WRITE;
> >  
> > -	/*
> > -	 * Yes, even DAX files can have page cache attached to them:  A zeroed
> > -	 * page is inserted into the pagecache when we have to serve a write
> > -	 * fault on a hole.  It should never be dirtied and can simply be
> > -	 * dropped from the pagecache once we get real data for the page.
> > -	 *
> > -	 * XXX: This is racy against mmap, and there's nothing we can do about
> > -	 * it. We'll eventually need to shift this down even further so that
> > -	 * we can check if we allocated blocks over a hole first.
> > -	 */
> > -	if (mapping->nrpages) {
> > -		ret = invalidate_inode_pages2_range(mapping,
> > -				pos >> PAGE_SHIFT,
> > -				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
> > -		WARN_ON_ONCE(ret);
> > -	}
> > -
> >  	while (iov_iter_count(iter)) {
> >  		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
> >  				iter, dax_iomap_actor);
> > -- 
> > 2.6.6
> > 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
@ 2016-11-30  8:23       ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-11-30  8:23 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-fsdevel, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner

On Tue 29-11-16 15:31:38, Ross Zwisler wrote:
> On Thu, Nov 24, 2016 at 10:46:33AM +0100, Jan Kara wrote:
> > Currently each filesystem (possibly through generic_file_direct_write()
> > or iomap_dax_rw()) takes care of invalidating page tables and evicting
> 
> Just some nits about the commit message: the DAX I/O path function is now
> called dax_iomap_rw(), and no filesystems still use
> generic_file_direct_write() for DAX so you can probably remove it from the
> changelog - up to you.

Yeah, good spotting. Many things have changed through rebases. I've fixed
up the changelog to reflect current state.

> Aside from that:
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks.

								Honza

> 
> > hole pages from the radix tree when write(2) to the file happens. This
> > invalidation is only necessary when there is some block allocation
> > resulting from write(2). Furthermore in current place the invalidation
> > is racy wrt page fault instantiating a hole page just after we have
> > invalidated it.
> > 
> > So perform the page invalidation inside dax_do_io() where we can do it
> > only when really necessary and after blocks have been allocated so
> > nobody will be instantiating new hole pages anymore.
> > 
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c | 28 +++++++++++-----------------
> >  1 file changed, 11 insertions(+), 17 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 4534f0e232e9..ddf77ef2ca18 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -984,6 +984,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> >  	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> >  		return -EIO;
> >  
> > +	/*
> > +	 * Write can allocate block for an area which has a hole page mapped
> > +	 * into page tables. We have to tear down these mappings so that data
> > +	 * written by write(2) is visible in mmap.
> > +	 */
> > +	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> > +		invalidate_inode_pages2_range(inode->i_mapping,
> > +					      pos >> PAGE_SHIFT,
> > +					      (end - 1) >> PAGE_SHIFT);
> > +	}
> > +
> >  	while (pos < end) {
> >  		unsigned offset = pos & (PAGE_SIZE - 1);
> >  		struct blk_dax_ctl dax = { 0 };
> > @@ -1042,23 +1053,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  	if (iov_iter_rw(iter) == WRITE)
> >  		flags |= IOMAP_WRITE;
> >  
> > -	/*
> > -	 * Yes, even DAX files can have page cache attached to them:  A zeroed
> > -	 * page is inserted into the pagecache when we have to serve a write
> > -	 * fault on a hole.  It should never be dirtied and can simply be
> > -	 * dropped from the pagecache once we get real data for the page.
> > -	 *
> > -	 * XXX: This is racy against mmap, and there's nothing we can do about
> > -	 * it. We'll eventually need to shift this down even further so that
> > -	 * we can check if we allocated blocks over a hole first.
> > -	 */
> > -	if (mapping->nrpages) {
> > -		ret = invalidate_inode_pages2_range(mapping,
> > -				pos >> PAGE_SHIFT,
> > -				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
> > -		WARN_ON_ONCE(ret);
> > -	}
> > -
> >  	while (iov_iter_count(iter)) {
> >  		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
> >  				iter, dax_iomap_actor);
> > -- 
> > 2.6.6
> > 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
@ 2016-11-30 15:59         ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2016-11-30 15:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm

On Wed, Nov 30, 2016 at 09:08:41AM +0100, Jan Kara wrote:
> > The naming situation with truncate, invalidate, invalidate2 worries me
> > a bit. They aren't great names to begin with, but now DAX uses yet
> > another terminology for what state prevents a page from being dropped.
> > Can we switch to truncate, invalidate, and invalidate_sync throughout
> > truncate.c and then have DAX follow that naming too? Or maybe you can
> > think of better names. But neither invalidate2 and invalidate_clean
> > don't seem to capture it quite right ;)
> 
> Yeah, the naming is confusing. I like the invalidate_sync proposal however
> renaming invalidate_inode_pages2() to invalidate_inode_pages_sync() is a
> larger undertaking - grep shows 51 places need to be changed. So I don't
> want to do it in this patch set. I can call the function
> dax_invalidate_mapping_entry_sync() if it makes you happier and do the rest
> later... OK?

Yep, that sounds reasonable on both counts.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
@ 2016-11-30 15:59         ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2016-11-30 15:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw

On Wed, Nov 30, 2016 at 09:08:41AM +0100, Jan Kara wrote:
> > The naming situation with truncate, invalidate, invalidate2 worries me
> > a bit. They aren't great names to begin with, but now DAX uses yet
> > another terminology for what state prevents a page from being dropped.
> > Can we switch to truncate, invalidate, and invalidate_sync throughout
> > truncate.c and then have DAX follow that naming too? Or maybe you can
> > think of better names. But neither invalidate2 and invalidate_clean
> > don't seem to capture it quite right ;)
> 
> Yeah, the naming is confusing. I like the invalidate_sync proposal however
> renaming invalidate_inode_pages2() to invalidate_inode_pages_sync() is a
> larger undertaking - grep shows 51 places need to be changed. So I don't
> want to do it in this patch set. I can call the function
> dax_invalidate_mapping_entry_sync() if it makes you happier and do the rest
> later... OK?

Yep, that sounds reasonable on both counts.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 4/6] dax: Finish fault completely when loading holes
@ 2016-12-01 22:13     ` Ross Zwisler
  0 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-12-01 22:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner

On Thu, Nov 24, 2016 at 10:46:34AM +0100, Jan Kara wrote:
> The only case when we do not finish the page fault completely is when we
> are loading hole pages into a radix tree. Avoid this special case and
> finish the fault in that case as well inside the DAX fault handler. It
> will allow us for easier iomap handling.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

This seems correct to me.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 4/6] dax: Finish fault completely when loading holes
@ 2016-12-01 22:13     ` Ross Zwisler
  0 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-12-01 22:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

On Thu, Nov 24, 2016 at 10:46:34AM +0100, Jan Kara wrote:
> The only case when we do not finish the page fault completely is when we
> are loading hole pages into a radix tree. Avoid this special case and
> finish the fault in that case as well inside the DAX fault handler. It
> will allow us for easier iomap handling.
> 
> Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>

This seems correct to me.

Reviewed-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
  2016-11-24  9:46   ` Jan Kara
@ 2016-12-01 22:24     ` Ross Zwisler
  -1 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-12-01 22:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner

On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> Currently ->iomap_begin() handler is called with entry lock held. If the
> filesystem held any locks between ->iomap_begin() and ->iomap_end()
> (such as ext4 which will want to hold transaction open), this would cause
> lock inversion with the iomap_apply() from standard IO path which first
> calls ->iomap_begin() and only then calls ->actor() callback which grabs
> entry locks for DAX.

I don't see the dax_iomap_actor() grabbing any entry locks for DAX?  Is this
an issue currently, or are you just trying to make the code consistent so we
don't run into issues in the future?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
@ 2016-12-01 22:24     ` Ross Zwisler
  0 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-12-01 22:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Ross Zwisler, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner

On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> Currently ->iomap_begin() handler is called with entry lock held. If the
> filesystem held any locks between ->iomap_begin() and ->iomap_end()
> (such as ext4 which will want to hold transaction open), this would cause
> lock inversion with the iomap_apply() from standard IO path which first
> calls ->iomap_begin() and only then calls ->actor() callback which grabs
> entry locks for DAX.

I don't see the dax_iomap_actor() grabbing any entry locks for DAX?  Is this
an issue currently, or are you just trying to make the code consistent so we
don't run into issues in the future?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
  2016-12-01 22:24     ` Ross Zwisler
@ 2016-12-01 23:27       ` Ross Zwisler
  -1 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-12-01 23:27 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-fsdevel, linux-ext4, linux-mm,
	linux-nvdimm, Johannes Weiner

On Thu, Dec 01, 2016 at 03:24:47PM -0700, Ross Zwisler wrote:
> On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> > Currently ->iomap_begin() handler is called with entry lock held. If the
> > filesystem held any locks between ->iomap_begin() and ->iomap_end()
> > (such as ext4 which will want to hold transaction open), this would cause
> > lock inversion with the iomap_apply() from standard IO path which first
> > calls ->iomap_begin() and only then calls ->actor() callback which grabs
> > entry locks for DAX.
> 
> I don't see the dax_iomap_actor() grabbing any entry locks for DAX?  Is this
> an issue currently, or are you just trying to make the code consistent so we
> don't run into issues in the future?

Ah, I see that you use this new ordering in patch 6/6 so that you can change
your interaction with the ext4 journal.  I'm still curious if we have a lock
ordering inversion within DAX, but if this ordering helps you with ext4, good
enough.

One quick comment:

> @@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>        */                                                                     
>       entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);              
>       if (IS_ERR(entry))                                                      
> -             goto fallback;                                                  
> +             goto finish_iomap;                                              
>                                                                               
> -     /*                                                                      
> -      * Note that we don't use iomap_apply here.  We aren't doing I/O, only  
> -      * setting up a mapping, so really we're using iomap_begin() as a way   
> -      * to look up our filesystem block.                                     
> -      */                                                                     
> -     pos = (loff_t)pgoff << PAGE_SHIFT;                                      
> -     error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);    
> -     if (error)                                                              
> -             goto unlock_entry;                                              
>       if (iomap.offset + iomap.length < pos + PMD_SIZE)                       
> -             goto finish_iomap;                                              
> +             goto unlock_entry;       

I think this offset+length bounds check could be moved along with the
iomap_begin() call up above the grab_mapping_entry().  You would then goto
'finish_iomap' if you hit this error condition, allowing you to avoid grabbing
and releasing of the mapping entry.

Other than that one small nit, this looks fine to me:
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
@ 2016-12-01 23:27       ` Ross Zwisler
  0 siblings, 0 replies; 56+ messages in thread
From: Ross Zwisler @ 2016-12-01 23:27 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-fsdevel, linux-ext4, linux-mm,
	linux-nvdimm, Johannes Weiner

On Thu, Dec 01, 2016 at 03:24:47PM -0700, Ross Zwisler wrote:
> On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> > Currently ->iomap_begin() handler is called with entry lock held. If the
> > filesystem held any locks between ->iomap_begin() and ->iomap_end()
> > (such as ext4 which will want to hold transaction open), this would cause
> > lock inversion with the iomap_apply() from standard IO path which first
> > calls ->iomap_begin() and only then calls ->actor() callback which grabs
> > entry locks for DAX.
> 
> I don't see the dax_iomap_actor() grabbing any entry locks for DAX?  Is this
> an issue currently, or are you just trying to make the code consistent so we
> don't run into issues in the future?

Ah, I see that you use this new ordering in patch 6/6 so that you can change
your interaction with the ext4 journal.  I'm still curious if we have a lock
ordering inversion within DAX, but if this ordering helps you with ext4, good
enough.

One quick comment:

> @@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>        */                                                                     
>       entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);              
>       if (IS_ERR(entry))                                                      
> -             goto fallback;                                                  
> +             goto finish_iomap;                                              
>                                                                               
> -     /*                                                                      
> -      * Note that we don't use iomap_apply here.  We aren't doing I/O, only  
> -      * setting up a mapping, so really we're using iomap_begin() as a way   
> -      * to look up our filesystem block.                                     
> -      */                                                                     
> -     pos = (loff_t)pgoff << PAGE_SHIFT;                                      
> -     error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);    
> -     if (error)                                                              
> -             goto unlock_entry;                                              
>       if (iomap.offset + iomap.length < pos + PMD_SIZE)                       
> -             goto finish_iomap;                                              
> +             goto unlock_entry;       

I think this offset+length bounds check could be moved along with the
iomap_begin() call up above the grab_mapping_entry().  You would then goto
'finish_iomap' if you hit this error condition, allowing you to avoid grabbing
and releasing of the mapping entry.

Other than that one small nit, this looks fine to me:
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
  2016-12-01 22:24     ` Ross Zwisler
@ 2016-12-02 10:08       ` Jan Kara
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-12-02 10:08 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, linux-mm, Johannes Weiner, linux-fsdevel,
	linux-ext4

On Thu 01-12-16 15:24:47, Ross Zwisler wrote:
> On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> > Currently ->iomap_begin() handler is called with entry lock held. If the
> > filesystem held any locks between ->iomap_begin() and ->iomap_end()
> > (such as ext4 which will want to hold transaction open), this would cause
> > lock inversion with the iomap_apply() from standard IO path which first
> > calls ->iomap_begin() and only then calls ->actor() callback which grabs
> > entry locks for DAX.
> 
> I don't see the dax_iomap_actor() grabbing any entry locks for DAX?  Is this
> an issue currently, or are you just trying to make the code consistent so we
> don't run into issues in the future?

So dax_iomap_actor() copies data from / to user provided buffer. That can
fault and if the buffer happens to be mmaped file on DAX filesystem, the
fault will end up grabbing entry locks. Sample evil test:

	fd = open("some_file", O_RDWR);
	buf = mmap(NULL, 65536, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	write(fd, buf, 4096);

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
@ 2016-12-02 10:08       ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-12-02 10:08 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-fsdevel, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner

On Thu 01-12-16 15:24:47, Ross Zwisler wrote:
> On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> > Currently ->iomap_begin() handler is called with entry lock held. If the
> > filesystem held any locks between ->iomap_begin() and ->iomap_end()
> > (such as ext4 which will want to hold transaction open), this would cause
> > lock inversion with the iomap_apply() from standard IO path which first
> > calls ->iomap_begin() and only then calls ->actor() callback which grabs
> > entry locks for DAX.
> 
> I don't see the dax_iomap_actor() grabbing any entry locks for DAX?  Is this
> an issue currently, or are you just trying to make the code consistent so we
> don't run into issues in the future?

So dax_iomap_actor() copies data from / to user provided buffer. That can
fault and if the buffer happens to be mmaped file on DAX filesystem, the
fault will end up grabbing entry locks. Sample evil test:

	fd = open("some_file", O_RDWR);
	buf = mmap(NULL, 65536, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	write(fd, buf, 4096);

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
  2016-12-01 23:27       ` Ross Zwisler
  (?)
@ 2016-12-02 10:12       ` Jan Kara
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-12-02 10:12 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-fsdevel, linux-ext4, linux-mm, linux-nvdimm,
	Johannes Weiner

On Thu 01-12-16 16:27:04, Ross Zwisler wrote:
> On Thu, Dec 01, 2016 at 03:24:47PM -0700, Ross Zwisler wrote:
> > On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> > > Currently ->iomap_begin() handler is called with entry lock held. If the
> > > filesystem held any locks between ->iomap_begin() and ->iomap_end()
> > > (such as ext4 which will want to hold transaction open), this would cause
> > > lock inversion with the iomap_apply() from standard IO path which first
> > > calls ->iomap_begin() and only then calls ->actor() callback which grabs
> > > entry locks for DAX.
> > 
> > I don't see the dax_iomap_actor() grabbing any entry locks for DAX?  Is this
> > an issue currently, or are you just trying to make the code consistent so we
> > don't run into issues in the future?
> 
> Ah, I see that you use this new ordering in patch 6/6 so that you can change
> your interaction with the ext4 journal.  I'm still curious if we have a lock
> ordering inversion within DAX, but if this ordering helps you with ext4, good
> enough.
> 
> One quick comment:
> 
> > @@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >        */                                                                     
> >       entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);              
> >       if (IS_ERR(entry))                                                      
> > -             goto fallback;                                                  
> > +             goto finish_iomap;                                              
> >                                                                               
> > -     /*                                                                      
> > -      * Note that we don't use iomap_apply here.  We aren't doing I/O, only  
> > -      * setting up a mapping, so really we're using iomap_begin() as a way   
> > -      * to look up our filesystem block.                                     
> > -      */                                                                     
> > -     pos = (loff_t)pgoff << PAGE_SHIFT;                                      
> > -     error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);    
> > -     if (error)                                                              
> > -             goto unlock_entry;                                              
> >       if (iomap.offset + iomap.length < pos + PMD_SIZE)                       
> > -             goto finish_iomap;                                              
> > +             goto unlock_entry;       
> 
> I think this offset+length bounds check could be moved along with the
> iomap_begin() call up above the grab_mapping_entry().  You would then goto
> 'finish_iomap' if you hit this error condition, allowing you to avoid grabbing
> and releasing of the mapping entry.

Yes, that is nicer. Changed.

> Other than that one small nit, this looks fine to me:
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
  2016-11-30  8:08       ` Jan Kara
@ 2016-12-09 12:02         ` Jan Kara
  -1 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-12-09 12:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jan Kara, linux-nvdimm, linux-mm, linux-fsdevel, linux-ext4

Hi,

On Wed 30-11-16 09:08:41, Jan Kara wrote:
> > > +static int __dax_invalidate_mapping_entry(struct address_space *mapping,
> > > +					  pgoff_t index, bool trunc)
> > > +{
> > > +	int ret = 0;
> > > +	void *entry;
> > > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > > +
> > > +	spin_lock_irq(&mapping->tree_lock);
> > > +	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> > > +	if (!entry || !radix_tree_exceptional_entry(entry))
> > > +		goto out;
> > > +	if (!trunc &&
> > > +	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> > > +	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
> > > +		goto out;
> > > +	radix_tree_delete(page_tree, index);
> > 
> > You could use the new __radix_tree_replace() here and save a second
> > tree lookup.
> 
> Hum, I'd need to return 'node' from get_unlocked_mapping_entry(). So
> probably I'll do it in a patch separate from this fix. But thanks for
> suggestion.

So I did this and quickly spotted a problem that when you use
__radix_tree_replace() to clear an entry, it will leave tags for that entry
set and that results in surprises. So I think I'll leave the code with
radix_tree_delete() for now.

It would probably make sense to make __radix_tree_replace() to clear tags
when we replace entry with NULL or at least WARN if some tags are set...
What do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate
@ 2016-12-09 12:02         ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-12-09 12:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jan Kara, linux-fsdevel, Ross Zwisler, linux-ext4, linux-mm,
	linux-nvdimm

Hi,

On Wed 30-11-16 09:08:41, Jan Kara wrote:
> > > +static int __dax_invalidate_mapping_entry(struct address_space *mapping,
> > > +					  pgoff_t index, bool trunc)
> > > +{
> > > +	int ret = 0;
> > > +	void *entry;
> > > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > > +
> > > +	spin_lock_irq(&mapping->tree_lock);
> > > +	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> > > +	if (!entry || !radix_tree_exceptional_entry(entry))
> > > +		goto out;
> > > +	if (!trunc &&
> > > +	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> > > +	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
> > > +		goto out;
> > > +	radix_tree_delete(page_tree, index);
> > 
> > You could use the new __radix_tree_replace() here and save a second
> > tree lookup.
> 
> Hum, I'd need to return 'node' from get_unlocked_mapping_entry(). So
> probably I'll do it in a patch separate from this fix. But thanks for
> suggestion.

So I did this and quickly spotted a problem that when you use
__radix_tree_replace() to clear an entry, it will leave tags for that entry
set and that results in surprises. So I think I'll leave the code with
radix_tree_delete() for now.

It would probably make sense to make __radix_tree_replace() to clear tags
when we replace entry with NULL or at least WARN if some tags are set...
What do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
  2016-12-12 16:47 [PATCH 0/6 v3] dax: Page invalidation fixes Jan Kara
  2016-12-12 16:47   ` Jan Kara
@ 2016-12-12 16:47   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-12-12 16:47 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-mm, linux-ext4, Johannes Weiner, Jan Kara

Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX (if it faults when copying from/to user provided
buffers).

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 121 ++++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 66 insertions(+), 55 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index e186bba0a642..51b03e91d3e2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1079,6 +1079,15 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
+static int dax_fault_return(int error)
+{
+	if (error == 0)
+		return VM_FAULT_NOPAGE;
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	return VM_FAULT_SIGBUS;
+}
+
 /**
  * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
@@ -1111,12 +1120,6 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		error = PTR_ERR(entry);
-		goto out;
-	}
-
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
@@ -1127,9 +1130,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error)
-		goto unlock_entry;
+		return dax_fault_return(error);
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		error = -EIO;		/* fs corruption? */
+		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
+		goto finish_iomap;
+	}
+
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
 		goto finish_iomap;
 	}
 
@@ -1152,13 +1161,13 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto finish_iomap;
+			goto error_unlock_entry;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto finish_iomap;
+		goto unlock_entry;
 	}
 
 	switch (iomap.type) {
@@ -1170,12 +1179,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 		error = dax_insert_mapping(mapping, iomap.bdev, sector,
 				PAGE_SIZE, &entry, vma, vmf);
+		/* -EBUSY is fine, somebody else faulted on the same PTE */
+		if (error == -EBUSY)
+			error = 0;
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto finish_iomap;
+			goto unlock_entry;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1184,30 +1196,25 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
- finish_iomap:
-	if (ops->iomap_end) {
-		if (error || (vmf_ret & VM_FAULT_ERROR)) {
-			/* keep previous error */
-			ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PAGE_SIZE,
-					PAGE_SIZE, flags, &iomap);
-		}
-	}
+ error_unlock_entry:
+	vmf_ret = dax_fault_return(error) | major;
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
- out:
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM | major;
-	/* -EBUSY is fine, somebody else faulted on the same PTE */
-	if (error < 0 && error != -EBUSY)
-		return VM_FAULT_SIGBUS | major;
-	if (vmf_ret) {
-		WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
-		return vmf_ret;
+ finish_iomap:
+	if (ops->iomap_end) {
+		int copied = PAGE_SIZE;
+
+		if (vmf_ret & VM_FAULT_ERROR)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PTE we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-	return VM_FAULT_NOPAGE | major;
+	return vmf_ret;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
 
@@ -1332,16 +1339,6 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto fallback;
 
 	/*
-	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
-	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
-	 * the tree, for instance), it will return -EEXIST and we just fall
-	 * back to 4k entries.
-	 */
-	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
-	if (IS_ERR(entry))
-		goto fallback;
-
-	/*
 	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
 	 * setting up a mapping, so really we're using iomap_begin() as a way
 	 * to look up our filesystem block.
@@ -1349,10 +1346,21 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	pos = (loff_t)pgoff << PAGE_SHIFT;
 	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
 	if (error)
-		goto unlock_entry;
+		goto fallback;
+
 	if (iomap.offset + iomap.length < pos + PMD_SIZE)
 		goto finish_iomap;
 
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		goto finish_iomap;
+
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_IO;
@@ -1365,7 +1373,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
-			goto finish_iomap;
+			goto unlock_entry;
 		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
 				&entry);
 		break;
@@ -1374,20 +1382,23 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		break;
 	}
 
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
-		if (result == VM_FAULT_FALLBACK) {
-			ops->iomap_end(inode, pos, PMD_SIZE, 0, iomap_flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
-					iomap_flags, &iomap);
-			if (error)
-				result = VM_FAULT_FALLBACK;
-		}
+		int copied = PMD_SIZE;
+
+		if (result == VM_FAULT_FALLBACK)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PMD we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
+				&iomap);
 	}
- unlock_entry:
-	put_locked_mapping_entry(mapping, pgoff, entry);
  fallback:
 	if (result == VM_FAULT_FALLBACK) {
 		split_huge_pmd(vma, pmd, address);
-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
@ 2016-12-12 16:47   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-12-12 16:47 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-mm, linux-ext4, Johannes Weiner, Jan Kara

Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX (if it faults when copying from/to user provided
buffers).

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 121 ++++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 66 insertions(+), 55 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index e186bba0a642..51b03e91d3e2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1079,6 +1079,15 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
+static int dax_fault_return(int error)
+{
+	if (error == 0)
+		return VM_FAULT_NOPAGE;
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	return VM_FAULT_SIGBUS;
+}
+
 /**
  * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
@@ -1111,12 +1120,6 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		error = PTR_ERR(entry);
-		goto out;
-	}
-
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
@@ -1127,9 +1130,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error)
-		goto unlock_entry;
+		return dax_fault_return(error);
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		error = -EIO;		/* fs corruption? */
+		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
+		goto finish_iomap;
+	}
+
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
 		goto finish_iomap;
 	}
 
@@ -1152,13 +1161,13 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto finish_iomap;
+			goto error_unlock_entry;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto finish_iomap;
+		goto unlock_entry;
 	}
 
 	switch (iomap.type) {
@@ -1170,12 +1179,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 		error = dax_insert_mapping(mapping, iomap.bdev, sector,
 				PAGE_SIZE, &entry, vma, vmf);
+		/* -EBUSY is fine, somebody else faulted on the same PTE */
+		if (error == -EBUSY)
+			error = 0;
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto finish_iomap;
+			goto unlock_entry;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1184,30 +1196,25 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
- finish_iomap:
-	if (ops->iomap_end) {
-		if (error || (vmf_ret & VM_FAULT_ERROR)) {
-			/* keep previous error */
-			ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PAGE_SIZE,
-					PAGE_SIZE, flags, &iomap);
-		}
-	}
+ error_unlock_entry:
+	vmf_ret = dax_fault_return(error) | major;
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
- out:
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM | major;
-	/* -EBUSY is fine, somebody else faulted on the same PTE */
-	if (error < 0 && error != -EBUSY)
-		return VM_FAULT_SIGBUS | major;
-	if (vmf_ret) {
-		WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
-		return vmf_ret;
+ finish_iomap:
+	if (ops->iomap_end) {
+		int copied = PAGE_SIZE;
+
+		if (vmf_ret & VM_FAULT_ERROR)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PTE we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-	return VM_FAULT_NOPAGE | major;
+	return vmf_ret;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
 
@@ -1332,16 +1339,6 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto fallback;
 
 	/*
-	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
-	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
-	 * the tree, for instance), it will return -EEXIST and we just fall
-	 * back to 4k entries.
-	 */
-	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
-	if (IS_ERR(entry))
-		goto fallback;
-
-	/*
 	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
 	 * setting up a mapping, so really we're using iomap_begin() as a way
 	 * to look up our filesystem block.
@@ -1349,10 +1346,21 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	pos = (loff_t)pgoff << PAGE_SHIFT;
 	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
 	if (error)
-		goto unlock_entry;
+		goto fallback;
+
 	if (iomap.offset + iomap.length < pos + PMD_SIZE)
 		goto finish_iomap;
 
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		goto finish_iomap;
+
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_IO;
@@ -1365,7 +1373,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
-			goto finish_iomap;
+			goto unlock_entry;
 		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
 				&entry);
 		break;
@@ -1374,20 +1382,23 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		break;
 	}
 
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
-		if (result == VM_FAULT_FALLBACK) {
-			ops->iomap_end(inode, pos, PMD_SIZE, 0, iomap_flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
-					iomap_flags, &iomap);
-			if (error)
-				result = VM_FAULT_FALLBACK;
-		}
+		int copied = PMD_SIZE;
+
+		if (result == VM_FAULT_FALLBACK)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PMD we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
+				&iomap);
 	}
- unlock_entry:
-	put_locked_mapping_entry(mapping, pgoff, entry);
  fallback:
 	if (result == VM_FAULT_FALLBACK) {
 		split_huge_pmd(vma, pmd, address);
-- 
2.10.2


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault
@ 2016-12-12 16:47   ` Jan Kara
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2016-12-12 16:47 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ross Zwisler, linux-mm, linux-ext4, Johannes Weiner, Jan Kara

Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX (if it faults when copying from/to user provided
buffers).

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 121 ++++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 66 insertions(+), 55 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index e186bba0a642..51b03e91d3e2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1079,6 +1079,15 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
+static int dax_fault_return(int error)
+{
+	if (error == 0)
+		return VM_FAULT_NOPAGE;
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	return VM_FAULT_SIGBUS;
+}
+
 /**
  * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
@@ -1111,12 +1120,6 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
-	if (IS_ERR(entry)) {
-		error = PTR_ERR(entry);
-		goto out;
-	}
-
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		flags |= IOMAP_WRITE;
 
@@ -1127,9 +1130,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 */
 	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
 	if (error)
-		goto unlock_entry;
+		return dax_fault_return(error);
 	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
-		error = -EIO;		/* fs corruption? */
+		vmf_ret = dax_fault_return(-EIO);	/* fs corruption? */
+		goto finish_iomap;
+	}
+
+	entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+	if (IS_ERR(entry)) {
+		vmf_ret = dax_fault_return(PTR_ERR(entry));
 		goto finish_iomap;
 	}
 
@@ -1152,13 +1161,13 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 
 		if (error)
-			goto finish_iomap;
+			goto error_unlock_entry;
 
 		__SetPageUptodate(vmf->cow_page);
 		vmf_ret = finish_fault(vmf);
 		if (!vmf_ret)
 			vmf_ret = VM_FAULT_DONE_COW;
-		goto finish_iomap;
+		goto unlock_entry;
 	}
 
 	switch (iomap.type) {
@@ -1170,12 +1179,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		}
 		error = dax_insert_mapping(mapping, iomap.bdev, sector,
 				PAGE_SIZE, &entry, vma, vmf);
+		/* -EBUSY is fine, somebody else faulted on the same PTE */
+		if (error == -EBUSY)
+			error = 0;
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
 			vmf_ret = dax_load_hole(mapping, &entry, vmf);
-			goto finish_iomap;
+			goto unlock_entry;
 		}
 		/*FALLTHRU*/
 	default:
@@ -1184,30 +1196,25 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		break;
 	}
 
- finish_iomap:
-	if (ops->iomap_end) {
-		if (error || (vmf_ret & VM_FAULT_ERROR)) {
-			/* keep previous error */
-			ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PAGE_SIZE,
-					PAGE_SIZE, flags, &iomap);
-		}
-	}
+ error_unlock_entry:
+	vmf_ret = dax_fault_return(error) | major;
  unlock_entry:
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
- out:
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM | major;
-	/* -EBUSY is fine, somebody else faulted on the same PTE */
-	if (error < 0 && error != -EBUSY)
-		return VM_FAULT_SIGBUS | major;
-	if (vmf_ret) {
-		WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
-		return vmf_ret;
+ finish_iomap:
+	if (ops->iomap_end) {
+		int copied = PAGE_SIZE;
+
+		if (vmf_ret & VM_FAULT_ERROR)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PTE we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
 	}
-	return VM_FAULT_NOPAGE | major;
+	return vmf_ret;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
 
@@ -1332,16 +1339,6 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto fallback;
 
 	/*
-	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
-	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
-	 * the tree, for instance), it will return -EEXIST and we just fall
-	 * back to 4k entries.
-	 */
-	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
-	if (IS_ERR(entry))
-		goto fallback;
-
-	/*
 	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
 	 * setting up a mapping, so really we're using iomap_begin() as a way
 	 * to look up our filesystem block.
@@ -1349,10 +1346,21 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	pos = (loff_t)pgoff << PAGE_SHIFT;
 	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
 	if (error)
-		goto unlock_entry;
+		goto fallback;
+
 	if (iomap.offset + iomap.length < pos + PMD_SIZE)
 		goto finish_iomap;
 
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		goto finish_iomap;
+
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_IO;
@@ -1365,7 +1373,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
-			goto finish_iomap;
+			goto unlock_entry;
 		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
 				&entry);
 		break;
@@ -1374,20 +1382,23 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		break;
 	}
 
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
  finish_iomap:
 	if (ops->iomap_end) {
-		if (result == VM_FAULT_FALLBACK) {
-			ops->iomap_end(inode, pos, PMD_SIZE, 0, iomap_flags,
-					&iomap);
-		} else {
-			error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
-					iomap_flags, &iomap);
-			if (error)
-				result = VM_FAULT_FALLBACK;
-		}
+		int copied = PMD_SIZE;
+
+		if (result == VM_FAULT_FALLBACK)
+			copied = 0;
+		/*
+		 * The fault is done by now and there's no way back (other
+		 * thread may be already happily using PMD we have installed).
+		 * Just ignore error from ->iomap_end since we cannot do much
+		 * with it.
+		 */
+		ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
+				&iomap);
 	}
- unlock_entry:
-	put_locked_mapping_entry(mapping, pgoff, entry);
  fallback:
 	if (result == VM_FAULT_FALLBACK) {
 		split_huge_pmd(vma, pmd, address);
-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2016-12-12 16:47 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-24  9:46 [PATCH 0/6 v2] dax: Page invalidation fixes Jan Kara
2016-11-24  9:46 ` Jan Kara
2016-11-24  9:46 ` Jan Kara
2016-11-24  9:46 ` Jan Kara
2016-11-24  9:46 ` [PATCH 1/6] ext2: Return BH_New buffers for zeroed blocks Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-29 17:48   ` Ross Zwisler
2016-11-29 17:48     ` Ross Zwisler
2016-11-24  9:46 ` [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-29 19:34   ` Johannes Weiner
2016-11-29 19:34     ` Johannes Weiner
2016-11-30  8:08     ` Jan Kara
2016-11-30  8:08       ` Jan Kara
2016-11-30  8:08       ` Jan Kara
2016-11-30 15:59       ` Johannes Weiner
2016-11-30 15:59         ` Johannes Weiner
2016-12-09 12:02       ` Jan Kara
2016-12-09 12:02         ` Jan Kara
2016-11-29 22:17   ` Ross Zwisler
2016-11-24  9:46 ` [PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-29 22:31   ` Ross Zwisler
2016-11-29 22:31     ` Ross Zwisler
2016-11-29 22:31     ` Ross Zwisler
2016-11-30  8:23     ` Jan Kara
2016-11-30  8:23       ` Jan Kara
2016-11-24  9:46 ` [PATCH 4/6] dax: Finish fault completely when loading holes Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-12-01 22:13   ` Ross Zwisler
2016-12-01 22:13     ` Ross Zwisler
2016-11-24  9:46 ` [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-12-01 22:24   ` Ross Zwisler
2016-12-01 22:24     ` Ross Zwisler
2016-12-01 23:27     ` Ross Zwisler
2016-12-01 23:27       ` Ross Zwisler
2016-12-02 10:12       ` Jan Kara
2016-12-02 10:08     ` Jan Kara
2016-12-02 10:08       ` Jan Kara
2016-11-24  9:46 ` [PATCH 6/6] ext4: Simplify DAX fault path Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-11-24  9:46   ` Jan Kara
2016-12-12 16:47 [PATCH 0/6 v3] dax: Page invalidation fixes Jan Kara
2016-12-12 16:47 ` [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault Jan Kara
2016-12-12 16:47   ` Jan Kara
2016-12-12 16:47   ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.