linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] dax: Page invalidation fixes
@ 2016-09-27 16:43 Jan Kara
  2016-09-27 16:43 ` [PATCH 1/6] dax: Do not warn about BH_New buffers Jan Kara
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: Jan Kara @ 2016-09-27 16:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-nvdimm, Dan Williams, Ross Zwisler, Jan Kara

Hello,

these patches fix races when invalidating hole pages in DAX mappings. See
changelogs for details. The series is based on my patches to write-protect
DAX PTEs because we really need to closely track dirtiness (and cleanness!)
of radix tree entries in DAX mappings in order to avoid discarding valid
dirty bits leading to missed cache flushes on fsync(2).

								Honza

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/6] dax: Do not warn about BH_New buffers
  2016-09-27 16:43 [PATCH 0/6] dax: Page invalidation fixes Jan Kara
@ 2016-09-27 16:43 ` Jan Kara
  2016-09-30  8:53   ` Christoph Hellwig
  2016-09-27 16:43 ` [PATCH 2/6] ext2: Return BH_New buffers for zeroed blocks Jan Kara
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2016-09-27 16:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-nvdimm, Dan Williams, Ross Zwisler, Jan Kara

Filesystems will return BH_New buffers to dax code to indicate freshly
allocated blocks which will then trigger synchronization of file
mappings in page tables with actual block mappings. So do not warn about
returned BH_New buffers.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 233f548d298e..1542653e8aa1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -158,8 +158,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		.addr = ERR_PTR(-EIO),
 	};
 	unsigned blkbits = inode->i_blkbits;
-	sector_t file_blks = (i_size_read(inode) + (1 << blkbits) - 1)
-								>> blkbits;
 
 	if (rw == READ)
 		end = min(end, i_size_read(inode));
@@ -186,9 +184,8 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				 * We allow uninitialized buffers for writes
 				 * beyond EOF as those cannot race with faults
 				 */
-				WARN_ON_ONCE(
-					(buffer_new(bh) && block < file_blks) ||
-					(rw == WRITE && buffer_unwritten(bh)));
+				WARN_ON_ONCE(rw == WRITE &&
+					     buffer_unwritten(bh));
 			} else {
 				unsigned done = bh->b_size -
 						(bh_max - (pos - first));
@@ -985,7 +982,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	}
 
 	/* Filesystem should not return unwritten buffers to us! */
-	WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+	WARN_ON_ONCE(buffer_unwritten(&bh));
 	error = dax_insert_mapping(mapping, bh.b_bdev, to_sector(&bh, inode),
 			bh.b_size, &entry, vma, vmf);
  unlock_entry:
@@ -1094,7 +1091,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		if (get_block(inode, block, &bh, 1) != 0)
 			return VM_FAULT_SIGBUS;
 		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+		WARN_ON_ONCE(buffer_unwritten(&bh));
 	}
 
 	bdev = bh.b_bdev;
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/6] ext2: Return BH_New buffers for zeroed blocks
  2016-09-27 16:43 [PATCH 0/6] dax: Page invalidation fixes Jan Kara
  2016-09-27 16:43 ` [PATCH 1/6] dax: Do not warn about BH_New buffers Jan Kara
@ 2016-09-27 16:43 ` Jan Kara
  2016-09-30  8:53   ` Christoph Hellwig
  2016-09-27 16:43 ` [PATCH 3/6] ext4: Remove clearing of BH_New bit " Jan Kara
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2016-09-27 16:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-nvdimm, Dan Williams, Ross Zwisler, Jan Kara

So far we did not return BH_New buffers from ext2_get_blocks() when we
allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
DAX code. This zeroing is gone these days so we can remove the
workaround.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/inode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f6312c153731..ac8edbd4af74 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -754,9 +754,8 @@ static int ext2_get_blocks(struct inode *inode,
 			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
 		}
-	} else {
-		*new = true;
 	}
+	*new = true;
 
 	ext2_splice_branch(inode, iblock, partial, indirect_blks, count);
 	mutex_unlock(&ei->truncate_mutex);
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/6] ext4: Remove clearing of BH_New bit for zeroed blocks
  2016-09-27 16:43 [PATCH 0/6] dax: Page invalidation fixes Jan Kara
  2016-09-27 16:43 ` [PATCH 1/6] dax: Do not warn about BH_New buffers Jan Kara
  2016-09-27 16:43 ` [PATCH 2/6] ext2: Return BH_New buffers for zeroed blocks Jan Kara
@ 2016-09-27 16:43 ` Jan Kara
  2016-09-30  8:53   ` Christoph Hellwig
  2016-09-27 16:43 ` [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks() Jan Kara
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2016-09-27 16:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-nvdimm, Dan Williams, Ross Zwisler, Jan Kara

So far we did not return BH_New buffers from ext4_dax_get_block()
because that would trigger racy zeroing in DAX code. This zeroing is
gone these days so we can remove the workaround.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 87150122d361..7ccd6fd7819d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3298,11 +3298,6 @@ int ext4_dax_get_block(struct inode *inode, sector_t iblock,
 		if (ret < 0)
 			return ret;
 	}
-	/*
-	 * At least for now we have to clear BH_New so that DAX code
-	 * doesn't attempt to zero blocks again in a racy way.
-	 */
-	clear_buffer_new(bh_result);
 	return 0;
 }
 #else
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks()
  2016-09-27 16:43 [PATCH 0/6] dax: Page invalidation fixes Jan Kara
                   ` (2 preceding siblings ...)
  2016-09-27 16:43 ` [PATCH 3/6] ext4: Remove clearing of BH_New bit " Jan Kara
@ 2016-09-27 16:43 ` Jan Kara
  2016-09-27 17:01   ` Christoph Hellwig
  2016-09-27 16:43 ` [PATCH 5/6] mm: Invalidate DAX radix tree entries only if appropriate Jan Kara
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2016-09-27 16:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-nvdimm, Dan Williams, Ross Zwisler, Jan Kara

So far we did not set BH_New for newly allocated blocks for DAX inodes
in __xfs_get_blocks() because we wanted to avoid zeroing done in generic
DAX code which was racy. Now the zeroing is gone so we can remove this
workaround and return BH_New for newly allocated blocks. DAX will use this
information to properly update mappings of the file.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_aops.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4a28fa91e3b1..b25b7b5a1e6e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1245,11 +1245,8 @@ __xfs_get_blocks(
 		goto out_unlock;
 	}
 
-	if (IS_DAX(inode) && create) {
+	if (IS_DAX(inode) && create)
 		ASSERT(!ISUNWRITTEN(&imap));
-		/* zeroing is not needed at a higher layer */
-		new = 0;
-	}
 
 	/* trim mapping down to size requested */
 	xfs_map_trim_size(inode, iblock, bh_result, &imap, offset, size);
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/6] mm: Invalidate DAX radix tree entries only if appropriate
  2016-09-27 16:43 [PATCH 0/6] dax: Page invalidation fixes Jan Kara
                   ` (3 preceding siblings ...)
  2016-09-27 16:43 ` [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks() Jan Kara
@ 2016-09-27 16:43 ` Jan Kara
  2016-09-28  0:18   ` Dave Chinner
  2016-09-30  8:57   ` Christoph Hellwig
  2016-09-27 16:43 ` [PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals Jan Kara
  2016-09-30  8:59 ` [PATCH 0/6] dax: Page invalidation fixes Christoph Hellwig
  6 siblings, 2 replies; 28+ messages in thread
From: Jan Kara @ 2016-09-27 16:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-nvdimm, Dan Williams, Ross Zwisler, Jan Kara

Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.

Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 71 +++++++++++++++++++++++++++++++++++++++++++++--------
 include/linux/dax.h |  2 ++
 mm/truncate.c       | 71 ++++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 122 insertions(+), 22 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1542653e8aa1..c8a639d2214e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -521,16 +521,38 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 	dax_wake_mapping_entry_waiter(mapping, index, false);
 }
 
+static int __dax_invalidate_mapping_entry(struct address_space *mapping,
+					  pgoff_t index, bool trunc)
+{
+	int ret = 0;
+	void *entry;
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	if (!entry || !radix_tree_exceptional_entry(entry))
+		goto out;
+	if (!trunc &&
+	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
+		goto out;
+	radix_tree_delete(page_tree, index);
+	mapping->nrexceptional--;
+	ret = 1;
+out:
+	spin_unlock_irq(&mapping->tree_lock);
+	if (ret)
+		dax_wake_mapping_entry_waiter(mapping, index, true);
+	return ret;
+}
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
  */
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *entry;
+	int ret = __dax_invalidate_mapping_entry(mapping, index, true);
 
-	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
 	/*
 	 * This gets called from truncate / punch_hole path. As such, the caller
 	 * must hold locks protecting against concurrent modifications of the
@@ -538,16 +560,45 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	 * caller has seen exceptional entry for this index, we better find it
 	 * at that index as well...
 	 */
-	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
-		spin_unlock_irq(&mapping->tree_lock);
-		return 0;
-	}
-	radix_tree_delete(&mapping->page_tree, index);
+	WARN_ON_ONCE(!ret);
+	return ret;
+}
+
+/*
+ * Invalidate exceptional DAX entry if easily possible. This handles DAX
+ * entries for invalidate_inode_pages() so we evict the entry only if we can
+ * do so without blocking.
+ */
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	int ret = 0;
+	void *entry, **slot;
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
+	if (!entry || !radix_tree_exceptional_entry(entry) ||
+	    slot_locked(mapping, slot))
+		goto out;
+	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto out;
+	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
+	ret = 1;
+out:
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, true);
+	if (ret)
+		dax_wake_mapping_entry_waiter(mapping, index, true);
+	return ret;
+}
 
-	return 1;
+/*
+ * Invalidate exceptional DAX entry if it is clean.
+ */
+int dax_invalidate_mapping_entry2(struct address_space *mapping, pgoff_t index)
+{
+	return __dax_invalidate_mapping_entry(mapping, index, false);
 }
 
 /*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b1a1acd10df2..d2fd94b057fe 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -21,6 +21,8 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_mapping_entry2(struct address_space *mapping, pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 				   pgoff_t index, bool wake_all);
 
diff --git a/mm/truncate.c b/mm/truncate.c
index a01cce450a26..bb91e3532ae6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -30,14 +30,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	struct radix_tree_node *node;
 	void **slot;
 
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
-		return;
-
-	if (dax_mapping(mapping)) {
-		dax_delete_mapping_entry(mapping, index);
-		return;
-	}
 	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * Regular page slots are stabilized by the page lock even
@@ -70,6 +62,56 @@ unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
+/*
+ * Unconditionally remove exceptional entry. Usually called from truncate path.
+ */
+static void truncate_exceptional_entry(struct address_space *mapping,
+				       pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	if (dax_mapping(mapping)) {
+		dax_delete_mapping_entry(mapping, index);
+		return;
+	}
+	clear_exceptional_entry(mapping, index, entry);
+}
+
+/*
+ * Invalidate exceptional entry if easily possible. This handles exceptional
+ * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
+ * clean entries.
+ */
+static int invalidate_exceptional_entry(struct address_space *mapping,
+					pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return 1;
+	if (dax_mapping(mapping))
+		return dax_invalidate_mapping_entry(mapping, index);
+	clear_exceptional_entry(mapping, index, entry);
+	return 1;
+}
+
+/*
+ * Invalidate exceptional entry if clean. This handles exceptional entries for
+ * invalidate_inode_pages2() so for DAX it evicts only clean entries.
+ */
+static int invalidate_exceptional_entry2(struct address_space *mapping,
+					 pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return 1;
+	if (dax_mapping(mapping))
+		return dax_invalidate_mapping_entry2(mapping, index);
+	clear_exceptional_entry(mapping, index, entry);
+	return 1;
+}
+
 /**
  * do_invalidatepage - invalidate part or all of a page
  * @page: the page which is affected
@@ -277,7 +319,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				truncate_exceptional_entry(mapping, index,
+							   page);
 				continue;
 			}
 
@@ -366,7 +409,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			}
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				truncate_exceptional_entry(mapping, index,
+							   page);
 				continue;
 			}
 
@@ -485,7 +529,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				invalidate_exceptional_entry(mapping, index,
+							     page);
 				continue;
 			}
 
@@ -607,7 +652,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				break;
 
 			if (radix_tree_exceptional_entry(page)) {
-				clear_exceptional_entry(mapping, index, page);
+				if (!invalidate_exceptional_entry2(mapping,
+								   index, page))
+					ret = -EBUSY;
 				continue;
 			}
 
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
  2016-09-27 16:43 [PATCH 0/6] dax: Page invalidation fixes Jan Kara
                   ` (4 preceding siblings ...)
  2016-09-27 16:43 ` [PATCH 5/6] mm: Invalidate DAX radix tree entries only if appropriate Jan Kara
@ 2016-09-27 16:43 ` Jan Kara
  2016-09-28  0:20   ` Dave Chinner
  2016-09-30  8:55   ` Christoph Hellwig
  2016-09-30  8:59 ` [PATCH 0/6] dax: Page invalidation fixes Christoph Hellwig
  6 siblings, 2 replies; 28+ messages in thread
From: Jan Kara @ 2016-09-27 16:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-nvdimm, Dan Williams, Ross Zwisler, Jan Kara

Currently each filesystem (possibly through generic_file_direct_write()
or iomap_dax_rw()) takes care of invalidating page tables and evicting
hole pages from the radix tree when write(2) to the file happens. This
invalidation is only necessary when there is some block allocation
resulting from write(2). Furthermore in current place the invalidation
is racy wrt page fault instantiating a hole page just after we have
invalidated it.

So perform the page invalidation inside dax_do_io() where we can do it
only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 40 +++++++++++++++++++++++-----------------
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index c8a639d2214e..2f69ca891aab 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -186,6 +186,18 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				 */
 				WARN_ON_ONCE(rw == WRITE &&
 					     buffer_unwritten(bh));
+				/*
+				 * Write can allocate block for an area which
+				 * has a hole page mapped into page tables. We
+				 * have to tear down these mappings so that
+				 * data written by write(2) is visible in mmap.
+				 */
+				if (buffer_new(bh) &&
+				    inode->i_mapping->nrpages) {
+					invalidate_inode_pages2_range(
+						  inode->i_mapping, page,
+						  (bh_max - 1) >> PAGE_SHIFT);
+				}
 			} else {
 				unsigned done = bh->b_size -
 						(bh_max - (pos - first));
@@ -1410,6 +1422,17 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
 		return -EIO;
 
+	/*
+	 * Write can allocate block for an area which has a hole page mapped
+	 * into page tables. We have to tear down these mappings so that data
+	 * written by write(2) is visible in mmap.
+	 */
+	if (iomap->flags & IOMAP_F_NEW && inode->i_mapping->nrpages) {
+		invalidate_inode_pages2_range(inode->i_mapping,
+					      pos >> PAGE_SHIFT,
+					      (end - 1) >> PAGE_SHIFT);
+	}
+
 	while (pos < end) {
 		unsigned offset = pos & (PAGE_SIZE - 1);
 		struct blk_dax_ctl dax = { 0 };
@@ -1469,23 +1492,6 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iov_iter_rw(iter) == WRITE)
 		flags |= IOMAP_WRITE;
 
-	/*
-	 * Yes, even DAX files can have page cache attached to them:  A zeroed
-	 * page is inserted into the pagecache when we have to serve a write
-	 * fault on a hole.  It should never be dirtied and can simply be
-	 * dropped from the pagecache once we get real data for the page.
-	 *
-	 * XXX: This is racy against mmap, and there's nothing we can do about
-	 * it. We'll eventually need to shift this down even further so that
-	 * we can check if we allocated blocks over a hole first.
-	 */
-	if (mapping->nrpages) {
-		ret = invalidate_inode_pages2_range(mapping,
-				pos >> PAGE_SHIFT,
-				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
-		WARN_ON_ONCE(ret);
-	}
-
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
 				iter, iomap_dax_actor);
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks()
  2016-09-27 16:43 ` [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks() Jan Kara
@ 2016-09-27 17:01   ` Christoph Hellwig
  2016-09-27 17:17     ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2016-09-27 17:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Tue, Sep 27, 2016 at 06:43:33PM +0200, Jan Kara wrote:
> So far we did not set BH_New for newly allocated blocks for DAX inodes
> in __xfs_get_blocks() because we wanted to avoid zeroing done in generic
> DAX code which was racy. Now the zeroing is gone so we can remove this
> workaround and return BH_New for newly allocated blocks. DAX will use this
> information to properly update mappings of the file.

__xfs_get_blocks isn't used by the DAX code any more.
xfs_file_iomap_begin should already be doing the right thing for now.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks()
  2016-09-27 17:01   ` Christoph Hellwig
@ 2016-09-27 17:17     ` Jan Kara
  2016-09-28  0:22       ` Dave Chinner
  2016-09-28  2:13       ` Christoph Hellwig
  0 siblings, 2 replies; 28+ messages in thread
From: Jan Kara @ 2016-09-27 17:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Tue 27-09-16 10:01:18, Christoph Hellwig wrote:
> On Tue, Sep 27, 2016 at 06:43:33PM +0200, Jan Kara wrote:
> > So far we did not set BH_New for newly allocated blocks for DAX inodes
> > in __xfs_get_blocks() because we wanted to avoid zeroing done in generic
> > DAX code which was racy. Now the zeroing is gone so we can remove this
> > workaround and return BH_New for newly allocated blocks. DAX will use this
> > information to properly update mappings of the file.
> 
> __xfs_get_blocks isn't used by the DAX code any more.
> xfs_file_iomap_begin should already be doing the right thing for now.

OK, the changelog is stale but I actually took care to integrate this with
your iomap patches and for the new invalidation code in iomap_dax_actor()
to work we need this additional information...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] mm: Invalidate DAX radix tree entries only if appropriate
  2016-09-27 16:43 ` [PATCH 5/6] mm: Invalidate DAX radix tree entries only if appropriate Jan Kara
@ 2016-09-28  0:18   ` Dave Chinner
  2016-10-03 14:52     ` Jan Kara
  2016-09-30  8:57   ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2016-09-28  0:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Tue, Sep 27, 2016 at 06:43:34PM +0200, Jan Kara wrote:
> +/*
> + * Invalidate exceptional DAX entry if it is clean.
> + */
> +int dax_invalidate_mapping_entry2(struct address_space *mapping, pgoff_t index)
> +{
> +	return __dax_invalidate_mapping_entry(mapping, index, false);
>  }

dax_try_invalidate_mapping_entry()

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
  2016-09-27 16:43 ` [PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals Jan Kara
@ 2016-09-28  0:20   ` Dave Chinner
  2016-10-03 14:58     ` Jan Kara
  2016-09-30  8:55   ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2016-09-28  0:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Tue, Sep 27, 2016 at 06:43:35PM +0200, Jan Kara wrote:
> @@ -1410,6 +1422,17 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
>  		return -EIO;
>  
> +	/*
> +	 * Write can allocate block for an area which has a hole page mapped
> +	 * into page tables. We have to tear down these mappings so that data
> +	 * written by write(2) is visible in mmap.
> +	 */
> +	if (iomap->flags & IOMAP_F_NEW && inode->i_mapping->nrpages) {

gcc should be throwing warnings about that:

	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks()
  2016-09-27 17:17     ` Jan Kara
@ 2016-09-28  0:22       ` Dave Chinner
  2016-10-03 14:44         ` Jan Kara
  2016-09-28  2:13       ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2016-09-28  0:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, linux-nvdimm, Dan Williams,
	Ross Zwisler

On Tue, Sep 27, 2016 at 07:17:07PM +0200, Jan Kara wrote:
> On Tue 27-09-16 10:01:18, Christoph Hellwig wrote:
> > On Tue, Sep 27, 2016 at 06:43:33PM +0200, Jan Kara wrote:
> > > So far we did not set BH_New for newly allocated blocks for DAX inodes
> > > in __xfs_get_blocks() because we wanted to avoid zeroing done in generic
> > > DAX code which was racy. Now the zeroing is gone so we can remove this
> > > workaround and return BH_New for newly allocated blocks. DAX will use this
> > > information to properly update mappings of the file.
> > 
> > __xfs_get_blocks isn't used by the DAX code any more.
> > xfs_file_iomap_begin should already be doing the right thing for now.
> 
> OK, the changelog is stale but I actually took care to integrate this with
> your iomap patches and for the new invalidation code in iomap_dax_actor()
> to work we need this additional information...

So this applies to the iomap-4.9-dax branch in the XFS tree?

https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/log/?h=iomap-4.9-dax

i.e. Do I need to merge the patchset with that branch?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks()
  2016-09-27 17:17     ` Jan Kara
  2016-09-28  0:22       ` Dave Chinner
@ 2016-09-28  2:13       ` Christoph Hellwig
  2016-10-03 14:42         ` Jan Kara
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2016-09-28  2:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, linux-nvdimm, Dan Williams,
	Ross Zwisler

On Tue, Sep 27, 2016 at 07:17:07PM +0200, Jan Kara wrote:
> OK, the changelog is stale but I actually took care to integrate this with
> your iomap patches and for the new invalidation code in iomap_dax_actor()
> to work we need this additional information...

It's not just the changelogs (which will need updates on more than this
patch), but also the content.  We're not using get_blocks for DAX
anymore, so this patch should not be needed anymore.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] dax: Do not warn about BH_New buffers
  2016-09-27 16:43 ` [PATCH 1/6] dax: Do not warn about BH_New buffers Jan Kara
@ 2016-09-30  8:53   ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Tue, Sep 27, 2016 at 06:43:30PM +0200, Jan Kara wrote:
> Filesystems will return BH_New buffers to dax code to indicate freshly
> allocated blocks which will then trigger synchronization of file
> mappings in page tables with actual block mappings. So do not warn about
> returned BH_New buffers.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] ext2: Return BH_New buffers for zeroed blocks
  2016-09-27 16:43 ` [PATCH 2/6] ext2: Return BH_New buffers for zeroed blocks Jan Kara
@ 2016-09-30  8:53   ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Tue, Sep 27, 2016 at 06:43:31PM +0200, Jan Kara wrote:
> So far we did not return BH_New buffers from ext2_get_blocks() when we
> allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
> DAX code. This zeroing is gone these days so we can remove the
> workaround.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] ext4: Remove clearing of BH_New bit for zeroed blocks
  2016-09-27 16:43 ` [PATCH 3/6] ext4: Remove clearing of BH_New bit " Jan Kara
@ 2016-09-30  8:53   ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Tue, Sep 27, 2016 at 06:43:32PM +0200, Jan Kara wrote:
> So far we did not return BH_New buffers from ext4_dax_get_block()
> because that would trigger racy zeroing in DAX code. This zeroing is
> gone these days so we can remove the workaround.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
  2016-09-27 16:43 ` [PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals Jan Kara
  2016-09-28  0:20   ` Dave Chinner
@ 2016-09-30  8:55   ` Christoph Hellwig
  2016-10-03 15:02     ` Jan Kara
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Tue, Sep 27, 2016 at 06:43:35PM +0200, Jan Kara wrote:
> Currently each filesystem (possibly through generic_file_direct_write()
> or iomap_dax_rw()) takes care of invalidating page tables and evicting
> hole pages from the radix tree when write(2) to the file happens. This
> invalidation is only necessary when there is some block allocation
> resulting from write(2). Furthermore in current place the invalidation
> is racy wrt page fault instantiating a hole page just after we have
> invalidated it.
> 
> So perform the page invalidation inside dax_do_io() where we can do it
> only when really necessary and after blocks have been allocated so
> nobody will be instantiating new hole pages anymore.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

This looks fine with the comment from Dave addressed:

Reviewed-by: Christoph Hellwig <hch@lst.de>

> +				if (buffer_new(bh) &&
> +				    inode->i_mapping->nrpages) {

Btw, it would be nice if the nrpages check could move into
invalidate_inode_pages2_range instead of having to bother with it in
the callers.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] mm: Invalidate DAX radix tree entries only if appropriate
  2016-09-27 16:43 ` [PATCH 5/6] mm: Invalidate DAX radix tree entries only if appropriate Jan Kara
  2016-09-28  0:18   ` Dave Chinner
@ 2016-09-30  8:57   ` Christoph Hellwig
  2016-10-03 12:56     ` Jan Kara
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Tue, Sep 27, 2016 at 06:43:34PM +0200, Jan Kara wrote:
> Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
> just delete all exceptional radix tree entries they find. For DAX this
> is not desirable as we track cache dirtiness in these entries and when
> they are evicted, we may not flush caches although it is necessary. This
> can for example manifest when we write to the same block both via mmap
> and via write(2) (to different offsets) and fsync(2) then does not
> properly flush CPU caches when modification via write(2) was the last
> one.

Can you come up with an xfstests test case for these data loss cases?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/6] dax: Page invalidation fixes
  2016-09-27 16:43 [PATCH 0/6] dax: Page invalidation fixes Jan Kara
                   ` (5 preceding siblings ...)
  2016-09-27 16:43 ` [PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals Jan Kara
@ 2016-09-30  8:59 ` Christoph Hellwig
  2016-10-03 13:01   ` Jan Kara
  6 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Tue, Sep 27, 2016 at 06:43:29PM +0200, Jan Kara wrote:
> Hello,
> 
> these patches fix races when invalidating hole pages in DAX mappings. See
> changelogs for details. The series is based on my patches to write-protect
> DAX PTEs because we really need to closely track dirtiness (and cleanness!)
> of radix tree entries in DAX mappings in order to avoid discarding valid
> dirty bits leading to missed cache flushes on fsync(2).

Except for the VM magic I don't feel qualified to review this looks fine
to me.  But don't we need to switch ext4 away from going through the DIO
path first before the limited invalidation is effective there?
Otherwise we'll still get the full range invalidation from
generic_file_direct_write?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] mm: Invalidate DAX radix tree entries only if appropriate
  2016-09-30  8:57   ` Christoph Hellwig
@ 2016-10-03 12:56     ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2016-10-03 12:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Fri 30-09-16 01:57:14, Christoph Hellwig wrote:
> On Tue, Sep 27, 2016 at 06:43:34PM +0200, Jan Kara wrote:
> > Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
> > just delete all exceptional radix tree entries they find. For DAX this
> > is not desirable as we track cache dirtiness in these entries and when
> > they are evicted, we may not flush caches although it is necessary. This
> > can for example manifest when we write to the same block both via mmap
> > and via write(2) (to different offsets) and fsync(2) then does not
> > properly flush CPU caches when modification via write(2) was the last
> > one.
> 
> Can you come up with an xfstests test case for these data loss cases?

I'm not sure how to easily do this. For DAX to be enabled, we need
memory-like storage so there's no easy way to intercept writes that may be
only in CPU caches but not in the persistent memory. Eventually we may want
to write a DAX flush testing driver - something like ramdisk but that would
keep "cached" and "persistent" version of each page and on wb_cache_pmem()
copy one version to the other. We'd also need to hook into
copy_from_iter_nocache() and stuff to make cache-avoiding writes behave as
expected but all in all it should be doable...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/6] dax: Page invalidation fixes
  2016-09-30  8:59 ` [PATCH 0/6] dax: Page invalidation fixes Christoph Hellwig
@ 2016-10-03 13:01   ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2016-10-03 13:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Fri 30-09-16 01:59:05, Christoph Hellwig wrote:
> On Tue, Sep 27, 2016 at 06:43:29PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > these patches fix races when invalidating hole pages in DAX mappings. See
> > changelogs for details. The series is based on my patches to write-protect
> > DAX PTEs because we really need to closely track dirtiness (and cleanness!)
> > of radix tree entries in DAX mappings in order to avoid discarding valid
> > dirty bits leading to missed cache flushes on fsync(2).
> 
> Except for the VM magic I don't feel qualified to review this looks fine
> to me.  But don't we need to switch ext4 away from going through the DIO
> path first before the limited invalidation is effective there?
> Otherwise we'll still get the full range invalidation from
> generic_file_direct_write?

After patch 5/6 the additional invalidation we get from
generic_file_direct_write() for ext4 is racy but not doing any harm since
we later do a reliable invalidation directly from dax_io() which patch 6/6
takes care to update. So it will be inefficient but it will work. Once ext4
is converted we get rid of this redundancy.

BTW I have locally patches that avoid this inefficiency for ext4 but I
figured there's no point in posting them since we want to convert ext4 to
iomap interface sooner rather than later.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks()
  2016-09-28  2:13       ` Christoph Hellwig
@ 2016-10-03 14:42         ` Jan Kara
  2016-10-03 16:39           ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2016-10-03 14:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Tue 27-09-16 19:13:50, Christoph Hellwig wrote:
> On Tue, Sep 27, 2016 at 07:17:07PM +0200, Jan Kara wrote:
> > OK, the changelog is stale but I actually took care to integrate this with
> > your iomap patches and for the new invalidation code in iomap_dax_actor()
> > to work we need this additional information...
> 
> It's not just the changelogs (which will need updates on more than this
> patch), but also the content.  We're not using get_blocks for DAX
> anymore, so this patch should not be needed anymore.

Ah, you are right. After reading the code again this patch is not needed
anymore because xfs_file_iomap_begin() returns IOMAP_F_NEW whenever we
allocated block regardless whether it was zeroed-out or not. But if I get it
right, all the IS_DAX checks in __xfs_get_blocks() can be dropped now, cannot
they?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks()
  2016-09-28  0:22       ` Dave Chinner
@ 2016-10-03 14:44         ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2016-10-03 14:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-nvdimm,
	Dan Williams, Ross Zwisler

On Wed 28-09-16 10:22:55, Dave Chinner wrote:
> On Tue, Sep 27, 2016 at 07:17:07PM +0200, Jan Kara wrote:
> > On Tue 27-09-16 10:01:18, Christoph Hellwig wrote:
> > > On Tue, Sep 27, 2016 at 06:43:33PM +0200, Jan Kara wrote:
> > > > So far we did not set BH_New for newly allocated blocks for DAX inodes
> > > > in __xfs_get_blocks() because we wanted to avoid zeroing done in generic
> > > > DAX code which was racy. Now the zeroing is gone so we can remove this
> > > > workaround and return BH_New for newly allocated blocks. DAX will use this
> > > > information to properly update mappings of the file.
> > > 
> > > __xfs_get_blocks isn't used by the DAX code any more.
> > > xfs_file_iomap_begin should already be doing the right thing for now.
> > 
> > OK, the changelog is stale but I actually took care to integrate this with
> > your iomap patches and for the new invalidation code in iomap_dax_actor()
> > to work we need this additional information...
> 
> So this applies to the iomap-4.9-dax branch in the XFS tree?

Yes.

> https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/log/?h=iomap-4.9-dax
> 
> i.e. Do I need to merge the patchset with that branch?

This patch series cannot be immediately applied as it functionally depends
on the patches to allow safe clearing of dirty bits in DAX mappings.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] mm: Invalidate DAX radix tree entries only if appropriate
  2016-09-28  0:18   ` Dave Chinner
@ 2016-10-03 14:52     ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2016-10-03 14:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Wed 28-09-16 10:18:40, Dave Chinner wrote:
> On Tue, Sep 27, 2016 at 06:43:34PM +0200, Jan Kara wrote:
> > +/*
> > + * Invalidate exceptional DAX entry if it is clean.
> > + */
> > +int dax_invalidate_mapping_entry2(struct address_space *mapping, pgoff_t index)
> > +{
> > +	return __dax_invalidate_mapping_entry(mapping, index, false);
> >  }
> 
> dax_try_invalidate_mapping_entry()

Or maybe dax_invalidate_clean_mapping_entry()? I was originally trying to keep
consistency with how functions in mm/truncate.c are traditionally called but
even there it's not quite consistent so better have a more descriptive name.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
  2016-09-28  0:20   ` Dave Chinner
@ 2016-10-03 14:58     ` Jan Kara
  2016-10-03 23:40       ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2016-10-03 14:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Wed 28-09-16 10:20:34, Dave Chinner wrote:
> On Tue, Sep 27, 2016 at 06:43:35PM +0200, Jan Kara wrote:
> > @@ -1410,6 +1422,17 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> >  	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> >  		return -EIO;
> >  
> > +	/*
> > +	 * Write can allocate block for an area which has a hole page mapped
> > +	 * into page tables. We have to tear down these mappings so that data
> > +	 * written by write(2) is visible in mmap.
> > +	 */
> > +	if (iomap->flags & IOMAP_F_NEW && inode->i_mapping->nrpages) {
> 
> gcc should be throwing warnings about that:
> 
> 	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {

Actually the bitwise '&' takes precedense over the logical '&&' so the
evaluation order ends up being correct. But I agree it's better to be
explicit with parenthesis here. Fixed.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
  2016-09-30  8:55   ` Christoph Hellwig
@ 2016-10-03 15:02     ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2016-10-03 15:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Fri 30-09-16 01:55:44, Christoph Hellwig wrote:
> On Tue, Sep 27, 2016 at 06:43:35PM +0200, Jan Kara wrote:
> > Currently each filesystem (possibly through generic_file_direct_write()
> > or iomap_dax_rw()) takes care of invalidating page tables and evicting
> > hole pages from the radix tree when write(2) to the file happens. This
> > invalidation is only necessary when there is some block allocation
> > resulting from write(2). Furthermore in current place the invalidation
> > is racy wrt page fault instantiating a hole page just after we have
> > invalidated it.
> > 
> > So perform the page invalidation inside dax_do_io() where we can do it
> > only when really necessary and after blocks have been allocated so
> > nobody will be instantiating new hole pages anymore.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> This looks fine with the comment from Dave addressed:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks.

> > +				if (buffer_new(bh) &&
> > +				    inode->i_mapping->nrpages) {
> 
> Btw, it would be nice if the nrpages check could move into
> invalidate_inode_pages2_range instead of having to bother with it in
> the callers.

We cannot do that - someone can be possibly calling
invalidate_inode_pages2_range() to invalidate exceptional entries in the
given range (although I don't think there's currently any such caller). DAX
code currently needs it only to invalidate hole pages so that's why it can
do an optimization to call invalidate_inode_pages2_range() only when
mapping->nrpages > 0. But in general it would be a trap for
invalidate_inode_pages2_range() to work only if mapping->nrpages > 0...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks()
  2016-10-03 14:42         ` Jan Kara
@ 2016-10-03 16:39           ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2016-10-03 16:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, linux-nvdimm, Dan Williams,
	Ross Zwisler

On Mon, Oct 03, 2016 at 04:42:46PM +0200, Jan Kara wrote:
> Ah, you are right. After reading the code again this patch is not needed
> anymore because xfs_file_iomap_begin() returns IOMAP_F_NEW whenever we
> allocated block regardless whether it was zeroed-out or not. But if I get it
> right, all the IS_DAX checks in __xfs_get_blocks() can be dropped now, cannot
> they?

Right now the code is required to keep the pmd_fault handler (which
is always disabled though) compiling.  Once that one is gone the
DAX code in __xfs_get_blocks can be removed.  In fact Ross sent a
patch to do just that as part of his PMD fault rework.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals
  2016-10-03 14:58     ` Jan Kara
@ 2016-10-03 23:40       ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2016-10-03 23:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, Dan Williams, Ross Zwisler

On Mon, Oct 03, 2016 at 04:58:02PM +0200, Jan Kara wrote:
> On Wed 28-09-16 10:20:34, Dave Chinner wrote:
> > On Tue, Sep 27, 2016 at 06:43:35PM +0200, Jan Kara wrote:
> > > @@ -1410,6 +1422,17 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > >  	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> > >  		return -EIO;
> > >  
> > > +	/*
> > > +	 * Write can allocate block for an area which has a hole page mapped
> > > +	 * into page tables. We have to tear down these mappings so that data
> > > +	 * written by write(2) is visible in mmap.
> > > +	 */
> > > +	if (iomap->flags & IOMAP_F_NEW && inode->i_mapping->nrpages) {
> > 
> > gcc should be throwing warnings about that:
> > 
> > 	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> 
> Actually the bitwise '&' takes precedense over the logical '&&' so the
> evaluation order ends up being correct.

Yes, I know that.

However, my concern is that such expressions don't indicate the
/intent/ of the author and so it can be difficult when reading the
code to determine if the logic is correct or whether it is a typo
and is wrong. In many cases, & and && will function identically for
the tested cases, so neither cursory review or testing would catch
something like this being wrong.

> But I agree it's better to be
> explicit with parenthesis here. Fixed.

Yup, a little bit of "documentation" goes a long way :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-10-03 23:40 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-27 16:43 [PATCH 0/6] dax: Page invalidation fixes Jan Kara
2016-09-27 16:43 ` [PATCH 1/6] dax: Do not warn about BH_New buffers Jan Kara
2016-09-30  8:53   ` Christoph Hellwig
2016-09-27 16:43 ` [PATCH 2/6] ext2: Return BH_New buffers for zeroed blocks Jan Kara
2016-09-30  8:53   ` Christoph Hellwig
2016-09-27 16:43 ` [PATCH 3/6] ext4: Remove clearing of BH_New bit " Jan Kara
2016-09-30  8:53   ` Christoph Hellwig
2016-09-27 16:43 ` [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks() Jan Kara
2016-09-27 17:01   ` Christoph Hellwig
2016-09-27 17:17     ` Jan Kara
2016-09-28  0:22       ` Dave Chinner
2016-10-03 14:44         ` Jan Kara
2016-09-28  2:13       ` Christoph Hellwig
2016-10-03 14:42         ` Jan Kara
2016-10-03 16:39           ` Christoph Hellwig
2016-09-27 16:43 ` [PATCH 5/6] mm: Invalidate DAX radix tree entries only if appropriate Jan Kara
2016-09-28  0:18   ` Dave Chinner
2016-10-03 14:52     ` Jan Kara
2016-09-30  8:57   ` Christoph Hellwig
2016-10-03 12:56     ` Jan Kara
2016-09-27 16:43 ` [PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals Jan Kara
2016-09-28  0:20   ` Dave Chinner
2016-10-03 14:58     ` Jan Kara
2016-10-03 23:40       ` Dave Chinner
2016-09-30  8:55   ` Christoph Hellwig
2016-10-03 15:02     ` Jan Kara
2016-09-30  8:59 ` [PATCH 0/6] dax: Page invalidation fixes Christoph Hellwig
2016-10-03 13:01   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).