linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax
@ 2021-05-19  6:00 Shiyang Ruan
  2021-05-19  6:00 ` [PATCH v6 1/7] fsdax: Introduce dax_iomap_cow_copy() Shiyang Ruan
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Shiyang Ruan @ 2021-05-19  6:00 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel
  Cc: darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn

This patchset is attempt to add CoW support for fsdax, and take XFS,
which has both reflink and fsdax feature, as an example.

Changes from V5:
 - Fix the lock order of xfs_inode in xfs_mmaplock_two_inodes_and_break_dax_layout()
 - move dax_remap_file_range_prep() to fs/dax.c
 - change type of length to uint64_t in dax_iomap_cow_copy()
 - fix mistake in dax_iomap_zero()

Changes from V4:
 - Fix the mistake of breaking dax layout for two inodes
 - Add CONFIG_FS_DAX judgement for fsdax code in remap_range.c
 - Fix other small problems and mistakes

One of the key mechanism need to be implemented in fsdax is CoW.  Copy
the data from srcmap before we actually write data to the destance
iomap.  And we just copy range in which data won't be changed.

Another mechanism is range comparison.  In page cache case, readpage()
is used to load data on disk to page cache in order to be able to
compare data.  In fsdax case, readpage() does not work.  So, we need
another compare data with direct access support.

With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.

Some of the patches are picked up from Goldwyn's patchset.  I made some
changes to adapt to this patchset.


(Rebased on v5.13-rc2 and patchset[1])
[1]: https://lkml.org/lkml/2021/4/22/575

Shiyang Ruan (7):
  fsdax: Introduce dax_iomap_cow_copy()
  fsdax: Replace mmap entry in case of CoW
  fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
  iomap: Introduce iomap_apply2() for operations on two files
  fsdax: Dedup file range to use a compare function
  fs/xfs: Handle CoW for fsdax write() path
  fs/xfs: Add dax dedupe support

 fs/dax.c               | 216 ++++++++++++++++++++++++++++++++++++-----
 fs/iomap/apply.c       |  52 ++++++++++
 fs/iomap/buffered-io.c |   2 +-
 fs/remap_range.c       |  36 +++++--
 fs/xfs/xfs_bmap_util.c |   3 +-
 fs/xfs/xfs_file.c      |  11 +--
 fs/xfs/xfs_inode.c     |  57 +++++++++++
 fs/xfs/xfs_inode.h     |   1 +
 fs/xfs/xfs_iomap.c     |  38 +++++++-
 fs/xfs/xfs_iomap.h     |  24 +++++
 fs/xfs/xfs_iops.c      |   7 +-
 fs/xfs/xfs_reflink.c   |  15 +--
 include/linux/dax.h    |  11 ++-
 include/linux/fs.h     |  12 ++-
 include/linux/iomap.h  |   7 +-
 15 files changed, 431 insertions(+), 61 deletions(-)

-- 
2.31.1




^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 1/7] fsdax: Introduce dax_iomap_cow_copy()
  2021-05-19  6:00 [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
@ 2021-05-19  6:00 ` Shiyang Ruan
  2021-05-19  6:00 ` [PATCH v6 2/7] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Shiyang Ruan @ 2021-05-19  6:00 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel
  Cc: darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn,
	Darrick J . Wong

In the case where the iomap is a write operation and iomap is not equal
to srcmap after iomap_begin, we consider it is a CoW operation.

The destance extent which iomap indicated is new allocated extent.
So, it is needed to copy the data from srcmap to new allocated extent.
In theory, it is better to copy the head and tail ranges which is
outside of the non-aligned area instead of copying the whole aligned
range. But in dax page fault, it will always be an aligned range.  So,
we have to copy the whole range in this case.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 81 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f661227b49cd..6396f091e60b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1049,6 +1049,61 @@ static int dax_iomap_direct_access(struct iomap *iomap, loff_t pos, size_t size,
 	return rc;
 }
 
+/**
+ * dax_iomap_cow_copy(): Copy the data from source to destination before write.
+ * @pos:	address to do copy from.
+ * @length:	size of copy operation.
+ * @align_size:	aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
+ * @srcmap:	iomap srcmap
+ * @daddr:	destination address to copy to.
+ *
+ * This can be called from two places. Either during DAX write fault, to copy
+ * the length size data to daddr. Or, while doing normal DAX write operation,
+ * dax_iomap_actor() might call this to do the copy of either start or end
+ * unaligned address. In this case the rest of the copy of aligned ranges is
+ * taken care by dax_iomap_actor() itself.
+ * Also, note DAX fault will always result in aligned pos and pos + length.
+ */
+static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
+		struct iomap *srcmap, void *daddr)
+{
+	loff_t head_off = pos & (align_size - 1);
+	size_t size = ALIGN(head_off + length, align_size);
+	loff_t end = pos + length;
+	loff_t pg_end = round_up(end, align_size);
+	bool copy_all = head_off == 0 && end == pg_end;
+	void *saddr = 0;
+	int ret = 0;
+
+	ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL);
+	if (ret)
+		return ret;
+
+	if (copy_all) {
+		ret = copy_mc_to_kernel(daddr, saddr, length);
+		return ret ? -EIO : 0;
+	}
+
+	/* Copy the head part of the range.  Note: we pass offset as length. */
+	if (head_off) {
+		ret = copy_mc_to_kernel(daddr, saddr, head_off);
+		if (ret)
+			return -EIO;
+	}
+
+	/* Copy the tail part of the range */
+	if (end < pg_end) {
+		loff_t tail_off = head_off + length;
+		loff_t tail_len = pg_end - end;
+
+		ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
+					tail_len);
+		if (ret)
+			return -EIO;
+	}
+	return 0;
+}
+
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
  * page in the file would cause excessive storage usage for workloads with
@@ -1178,11 +1233,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	struct dax_device *dax_dev = iomap->dax_dev;
 	struct iov_iter *iter = data;
 	loff_t end = pos + length, done = 0;
+	bool write = iov_iter_rw(iter) == WRITE;
 	ssize_t ret = 0;
 	size_t xfer;
 	int id;
 
-	if (iov_iter_rw(iter) == READ) {
+	if (!write) {
 		end = min(end, i_size_read(inode));
 		if (pos >= end)
 			return 0;
@@ -1191,7 +1247,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			return iov_iter_zero(min(length, end - pos), iter);
 	}
 
-	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+	/*
+	 * In DAX mode, we allow either pure overwrites of written extents, or
+	 * writes to unwritten extents as part of a copy-on-write operation.
+	 */
+	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
+			!(iomap->flags & IOMAP_F_SHARED)))
 		return -EIO;
 
 	/*
@@ -1230,6 +1291,13 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			break;
 		}
 
+		if (write && srcmap->addr != iomap->addr) {
+			ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap,
+						 kaddr);
+			if (ret)
+				break;
+		}
+
 		map_len = PFN_PHYS(map_len);
 		kaddr += offset;
 		map_len -= offset;
@@ -1241,7 +1309,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		 * validated via access_ok() in either vfs_read() or
 		 * vfs_write(), depending on which operation we are doing.
 		 */
-		if (iov_iter_rw(iter) == WRITE)
+		if (write)
 			xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
 					map_len, iter);
 		else
@@ -1393,6 +1461,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
 	unsigned long entry_flags = pmd ? DAX_PMD : 0;
 	int err = 0;
 	pfn_t pfn;
+	void *kaddr;
 
 	/* if we are reading UNWRITTEN and HOLE, return a hole. */
 	if (!write &&
@@ -1403,18 +1472,25 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
 			return dax_pmd_load_hole(xas, vmf, iomap, entry);
 	}
 
-	if (iomap->type != IOMAP_MAPPED) {
+	if (iomap->type != IOMAP_MAPPED && !(iomap->flags & IOMAP_F_SHARED)) {
 		WARN_ON_ONCE(1);
 		return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
 	}
 
-	err = dax_iomap_direct_access(iomap, pos, size, NULL, &pfn);
+	err = dax_iomap_direct_access(iomap, pos, size, &kaddr, &pfn);
 	if (err)
 		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
 	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags,
 				  write && !sync);
 
+	if (write &&
+	    srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
+		err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr);
+		if (err)
+			return dax_fault_return(err);
+	}
+
 	if (sync)
 		return dax_fault_synchronous_pfnp(pfnp, pfn);
 
-- 
2.31.1




^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 2/7] fsdax: Replace mmap entry in case of CoW
  2021-05-19  6:00 [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
  2021-05-19  6:00 ` [PATCH v6 1/7] fsdax: Introduce dax_iomap_cow_copy() Shiyang Ruan
@ 2021-05-19  6:00 ` Shiyang Ruan
  2021-05-19  6:00 ` [PATCH v6 3/7] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero Shiyang Ruan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Shiyang Ruan @ 2021-05-19  6:00 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel
  Cc: darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn,
	Goldwyn Rodrigues, Ritesh Harjani, Darrick J . Wong

We replace the existing entry to the newly allocated one in case of CoW.
Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
entry as writeprotected.  This helps us snapshots so new write
pagefaults after snapshots trigger a CoW.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c | 39 ++++++++++++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6396f091e60b..98531c53d613 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -733,6 +733,10 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
 	return 0;
 }
 
+/* DAX Insert Flag: The state of the entry we insert */
+#define DAX_IF_DIRTY		(1 << 0)
+#define DAX_IF_COW		(1 << 1)
+
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -740,16 +744,19 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
  * already in the tree, we will skip the insertion and just dirty the PMD as
  * appropriate.
  */
-static void *dax_insert_entry(struct xa_state *xas,
-		struct address_space *mapping, struct vm_fault *vmf,
-		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
+		void *entry, pfn_t pfn, unsigned long flags,
+		unsigned int insert_flags)
 {
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	void *new_entry = dax_make_entry(pfn, flags);
+	bool dirty = insert_flags & DAX_IF_DIRTY;
+	bool cow = insert_flags & DAX_IF_COW;
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
 		unsigned long index = xas->xa_index;
 		/* we are replacing a zero page with block mapping */
 		if (dax_is_pmd_entry(entry))
@@ -761,7 +768,7 @@ static void *dax_insert_entry(struct xa_state *xas,
 
 	xas_reset(xas);
 	xas_lock_irq(xas);
-	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		void *old;
 
 		dax_disassociate_entry(entry, mapping, false);
@@ -785,6 +792,9 @@ static void *dax_insert_entry(struct xa_state *xas,
 	if (dirty)
 		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
+	if (cow)
+		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
 	xas_unlock_irq(xas);
 	return entry;
 }
@@ -1120,8 +1130,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
 	vm_fault_t ret;
 
-	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_ZERO_PAGE, false);
+	*entry = dax_insert_entry(xas, vmf, *entry, pfn, DAX_ZERO_PAGE, 0);
 
 	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
 	trace_dax_load_hole(inode, vmf, ret);
@@ -1148,8 +1157,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 		goto fallback;
 
 	pfn = page_to_pfn_t(zero_page);
-	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_PMD | DAX_ZERO_PAGE, false);
+	*entry = dax_insert_entry(xas, vmf, *entry, pfn,
+				  DAX_PMD | DAX_ZERO_PAGE, 0);
 
 	if (arch_needs_pgtable_deposit()) {
 		pgtable = pte_alloc_one(vma->vm_mm);
@@ -1459,6 +1468,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
 	unsigned long entry_flags = pmd ? DAX_PMD : 0;
+	unsigned int insert_flags = 0;
 	int err = 0;
 	pfn_t pfn;
 	void *kaddr;
@@ -1481,8 +1491,15 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
 	if (err)
 		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
-	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags,
-				  write && !sync);
+	if (write) {
+		if (!sync)
+			insert_flags |= DAX_IF_DIRTY;
+		if (iomap->flags & IOMAP_F_SHARED)
+			insert_flags |= DAX_IF_COW;
+	}
+
+	*entry = dax_insert_entry(xas, vmf, *entry, pfn, entry_flags,
+				  insert_flags);
 
 	if (write &&
 	    srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
-- 
2.31.1




^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 3/7] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
  2021-05-19  6:00 [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
  2021-05-19  6:00 ` [PATCH v6 1/7] fsdax: Introduce dax_iomap_cow_copy() Shiyang Ruan
  2021-05-19  6:00 ` [PATCH v6 2/7] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
@ 2021-05-19  6:00 ` Shiyang Ruan
  2021-05-25 22:17   ` Darrick J. Wong
  2021-05-19  6:00 ` [PATCH v6 4/7] iomap: Introduce iomap_apply2() for operations on two files Shiyang Ruan
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Shiyang Ruan @ 2021-05-19  6:00 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel
  Cc: darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn,
	Ritesh Harjani

Punch hole on a reflinked file needs dax_copy_edge() too.  Otherwise,
data in not aligned area will be not correct.  So, add the srcmap to
dax_iomap_zero() and replace memset() as dax_copy_edge().

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/dax.c               | 25 +++++++++++++++----------
 fs/iomap/buffered-io.c |  2 +-
 include/linux/dax.h    |  3 ++-
 3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 98531c53d613..baee584cb8ae 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1197,7 +1197,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 }
 #endif /* CONFIG_FS_DAX_PMD */
 
-s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
+s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
 	pgoff_t pgoff;
@@ -1219,19 +1220,23 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
 
 	if (page_aligned)
 		rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
-	else
+	else {
 		rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
-	if (rc < 0) {
-		dax_read_unlock(id);
-		return rc;
-	}
-
-	if (!page_aligned) {
-		memset(kaddr + offset, 0, size);
+		if (rc < 0)
+			goto out;
+		if (iomap->addr != srcmap->addr) {
+			rc = dax_iomap_cow_copy(pos, size, PAGE_SIZE, srcmap,
+						kaddr);
+			if (rc < 0)
+				goto out;
+		} else
+			memset(kaddr + offset, 0, size);
 		dax_flush(iomap->dax_dev, kaddr + offset, size);
 	}
+
+out:
 	dax_read_unlock(id);
-	return size;
+	return rc < 0 ? rc : size;
 }
 
 static loff_t
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9023717c5188..fdaac4ba9b9d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -933,7 +933,7 @@ static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
 		s64 bytes;
 
 		if (IS_DAX(inode))
-			bytes = dax_iomap_zero(pos, length, iomap);
+			bytes = dax_iomap_zero(pos, length, iomap, srcmap);
 		else
 			bytes = iomap_zero(inode, pos, length, iomap, srcmap);
 		if (bytes < 0)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..3275e01ed33d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -237,7 +237,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
-s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap);
+s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
+		struct iomap *srcmap);
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
-- 
2.31.1




^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 4/7] iomap: Introduce iomap_apply2() for operations on two files
  2021-05-19  6:00 [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
                   ` (2 preceding siblings ...)
  2021-05-19  6:00 ` [PATCH v6 3/7] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero Shiyang Ruan
@ 2021-05-19  6:00 ` Shiyang Ruan
  2021-05-19  6:00 ` [PATCH v6 5/7] fsdax: Dedup file range to use a compare function Shiyang Ruan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Shiyang Ruan @ 2021-05-19  6:00 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel
  Cc: darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn,
	Darrick J . Wong

Some operations, such as comparing a range of data in two files under
fsdax mode, requires nested iomap_open()/iomap_end() on two file.  Thus,
we introduce iomap_apply2() to accept arguments from two files and
iomap_actor2_t for actions on two files.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/apply.c      | 52 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/iomap.h |  7 +++++-
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 26ab6563181f..0493da5286ad 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -97,3 +97,55 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 
 	return written ? written : ret;
 }
+
+loff_t
+iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t pos2,
+		loff_t length, unsigned int flags, const struct iomap_ops *ops,
+		void *data, iomap_actor2_t actor)
+{
+	struct iomap smap = { .type = IOMAP_HOLE };
+	struct iomap dmap = { .type = IOMAP_HOLE };
+	loff_t written = 0, ret, ret2 = 0;
+	loff_t len1 = length, len2, min_len;
+
+	ret = ops->iomap_begin(ino1, pos1, len1, flags, &smap, NULL);
+	if (ret)
+		goto out;
+	if (WARN_ON(smap.offset > pos1)) {
+		written = -EIO;
+		goto out_src;
+	}
+	if (WARN_ON(smap.length == 0)) {
+		written = -EIO;
+		goto out_src;
+	}
+	len2 = min_t(loff_t, len1, smap.length);
+
+	ret = ops->iomap_begin(ino2, pos2, len2, flags, &dmap, NULL);
+	if (ret)
+		goto out_src;
+	if (WARN_ON(dmap.offset > pos2)) {
+		written = -EIO;
+		goto out_dest;
+	}
+	if (WARN_ON(dmap.length == 0)) {
+		written = -EIO;
+		goto out_dest;
+	}
+	min_len = min_t(loff_t, len2, dmap.length);
+
+	written = actor(ino1, pos1, ino2, pos2, min_len, data, &smap, &dmap);
+
+out_dest:
+	if (ops->iomap_end)
+		ret2 = ops->iomap_end(ino2, pos2, len2,
+				      written > 0 ? written : 0, flags, &dmap);
+out_src:
+	if (ops->iomap_end)
+		ret = ops->iomap_end(ino1, pos1, len1,
+				     written > 0 ? written : 0, flags, &smap);
+out:
+	if (written)
+		return written;
+	return ret ?: ret2;
+}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index c87d0cb0de6d..95562f863ad0 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -150,10 +150,15 @@ struct iomap_ops {
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
 		void *data, struct iomap *iomap, struct iomap *srcmap);
-
+typedef loff_t (*iomap_actor2_t)(struct inode *ino1, loff_t pos1,
+		struct inode *ino2, loff_t pos2, loff_t len, void *data,
+		struct iomap *smap, struct iomap *dmap);
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 		unsigned flags, const struct iomap_ops *ops, void *data,
 		iomap_actor_t actor);
+loff_t iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2,
+		loff_t pos2, loff_t length, unsigned int flags,
+		const struct iomap_ops *ops, void *data, iomap_actor2_t actor);
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
-- 
2.31.1




^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 5/7] fsdax: Dedup file range to use a compare function
  2021-05-19  6:00 [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
                   ` (3 preceding siblings ...)
  2021-05-19  6:00 ` [PATCH v6 4/7] iomap: Introduce iomap_apply2() for operations on two files Shiyang Ruan
@ 2021-05-19  6:00 ` Shiyang Ruan
  2021-05-25 23:29   ` Darrick J. Wong
  2021-05-19  6:00 ` [PATCH v6 6/7] fs/xfs: Handle CoW for fsdax write() path Shiyang Ruan
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Shiyang Ruan @ 2021-05-19  6:00 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel
  Cc: darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn,
	Goldwyn Rodrigues

With dax we cannot deal with readpage() etc. So, we create a dax
comparison funciton which is similar with
vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/dax.c             | 66 ++++++++++++++++++++++++++++++++++++++++++++
 fs/remap_range.c     | 36 ++++++++++++++++++------
 fs/xfs/xfs_reflink.c |  8 ++++--
 include/linux/dax.h  |  8 ++++++
 include/linux/fs.h   | 12 +++++---
 5 files changed, 116 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index baee584cb8ae..93f16210847b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1864,3 +1864,69 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 	return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
+		struct inode *ino2, loff_t pos2, loff_t len, void *data,
+		struct iomap *smap, struct iomap *dmap)
+{
+	void *saddr, *daddr;
+	bool *same = data;
+	int ret;
+
+	if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
+		*same = true;
+		return len;
+	}
+
+	if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
+		*same = false;
+		return 0;
+	}
+
+	ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
+				      &saddr, NULL);
+	if (ret < 0)
+		return -EIO;
+
+	ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
+				      &daddr, NULL);
+	if (ret < 0)
+		return -EIO;
+
+	*same = !memcmp(saddr, daddr, len);
+	return len;
+}
+
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+		struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
+		const struct iomap_ops *ops)
+{
+	int id, ret = 0;
+
+	id = dax_read_lock();
+	while (len) {
+		ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops,
+				   is_same, dax_range_compare_actor);
+		if (ret < 0 || !*is_same)
+			goto out;
+
+		len -= ret;
+		srcoff += ret;
+		destoff += ret;
+	}
+	ret = 0;
+out:
+	dax_read_unlock(id);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
+
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      loff_t *len, unsigned int remap_flags,
+			      const struct iomap_ops *ops)
+{
+	return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+					       pos_out, len, remap_flags, ops);
+}
+EXPORT_SYMBOL(dax_remap_file_range_prep);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index e4a5fdd7ad7b..4cfc1553f3bf 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -14,6 +14,7 @@
 #include <linux/compat.h>
 #include <linux/mount.h>
 #include <linux/fs.h>
+#include <linux/dax.h>
 #include "internal.h"
 
 #include <linux/uaccess.h>
@@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
  * Compare extents of two files to see if they are the same.
  * Caller must have locked both inodes to prevent write races.
  */
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
-					 struct inode *dest, loff_t destoff,
-					 loff_t len, bool *is_same)
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+				  struct inode *dest, loff_t destoff,
+				  loff_t len, bool *is_same)
 {
 	loff_t src_poff;
 	loff_t dest_poff;
@@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 out_error:
 	return error;
 }
+EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
 /*
  * Check that the two inodes are eligible for cloning, the ranges make
@@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
  * If there's an error, then the usual negative error code is returned.
  * Otherwise returns 0 with *len set to the request length.
  */
-int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
-				  struct file *file_out, loff_t pos_out,
-				  loff_t *len, unsigned int remap_flags)
+int
+__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				loff_t *len, unsigned int remap_flags,
+				const struct iomap_ops *dax_read_ops)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
@@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	if (remap_flags & REMAP_FILE_DEDUP) {
 		bool		is_same = false;
 
-		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
-				inode_out, pos_out, *len, &is_same);
+		if (!IS_DAX(inode_in))
+			ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
+					inode_out, pos_out, *len, &is_same);
+		else if (dax_read_ops)
+			ret = dax_dedupe_file_range_compare(inode_in, pos_in,
+					inode_out, pos_out, *len, &is_same,
+					dax_read_ops);
+		else
+			return -EINVAL;
 		if (ret)
 			return ret;
 		if (!is_same)
@@ -370,6 +381,15 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 
 	return ret;
 }
+EXPORT_SYMBOL(__generic_remap_file_range_prep);
+
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				  struct file *file_out, loff_t pos_out,
+				  loff_t *len, unsigned int remap_flags)
+{
+	return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+					       pos_out, len, remap_flags, NULL);
+}
 EXPORT_SYMBOL(generic_remap_file_range_prep);
 
 loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 060695d6d56a..d25434f93235 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1329,8 +1329,12 @@ xfs_reflink_remap_prep(
 	if (IS_DAX(inode_in) || IS_DAX(inode_out))
 		goto out_unlock;
 
-	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			len, remap_flags);
+	if (!IS_DAX(inode_in))
+		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+				pos_out, len, remap_flags);
+	else
+		ret = dax_remap_file_range_prep(file_in, pos_in, file_out,
+				pos_out, len, remap_flags, &xfs_read_iomap_ops);
 	if (ret || *len == 0)
 		goto out_unlock;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3275e01ed33d..106d1f033a78 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -239,6 +239,14 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
 		struct iomap *srcmap);
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+				  struct inode *dest, loff_t destoff,
+				  loff_t len, bool *is_same,
+				  const struct iomap_ops *ops);
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      loff_t *len, unsigned int remap_flags,
+			      const struct iomap_ops *ops);
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c3c88fdb9b2a..deed4371f34f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -71,6 +71,7 @@ struct fsverity_operations;
 struct fs_context;
 struct fs_parameter_spec;
 struct fileattr;
+struct iomap_ops;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -2126,10 +2127,13 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
 				       struct file *file_out, loff_t pos_out,
 				       size_t len, unsigned int flags);
-extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
-					 struct file *file_out, loff_t pos_out,
-					 loff_t *count,
-					 unsigned int remap_flags);
+int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				    struct file *file_out, loff_t pos_out,
+				    loff_t *len, unsigned int remap_flags,
+				    const struct iomap_ops *dax_read_ops);
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				  struct file *file_out, loff_t pos_out,
+				  loff_t *count, unsigned int remap_flags);
 extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
 				  loff_t len, unsigned int remap_flags);
-- 
2.31.1




^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 6/7] fs/xfs: Handle CoW for fsdax write() path
  2021-05-19  6:00 [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
                   ` (4 preceding siblings ...)
  2021-05-19  6:00 ` [PATCH v6 5/7] fsdax: Dedup file range to use a compare function Shiyang Ruan
@ 2021-05-19  6:00 ` Shiyang Ruan
  2021-05-26  0:21   ` Darrick J. Wong
  2021-05-19  6:00 ` [PATCH v6 7/7] fs/xfs: Add dax dedupe support Shiyang Ruan
  2021-05-26  0:51 ` [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Darrick J. Wong
  7 siblings, 1 reply; 23+ messages in thread
From: Shiyang Ruan @ 2021-05-19  6:00 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel
  Cc: darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn

In fsdax mode, WRITE and ZERO on a shared extent need CoW performed. After
CoW, new allocated extents needs to be remapped to the file.  So, add an
iomap_end for dax write ops to do the remapping work.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/xfs/xfs_bmap_util.c |  3 +--
 fs/xfs/xfs_file.c      |  9 +++------
 fs/xfs/xfs_iomap.c     | 38 +++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_iomap.h     | 24 ++++++++++++++++++++++++
 fs/xfs/xfs_iops.c      |  7 +++----
 fs/xfs/xfs_reflink.c   |  3 +--
 6 files changed, 69 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index a5e9d7d34023..2a36dc93ff27 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -965,8 +965,7 @@ xfs_free_file_space(
 		return 0;
 	if (offset + len > XFS_ISIZE(ip))
 		len = XFS_ISIZE(ip) - offset;
-	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
-			&xfs_buffered_write_iomap_ops);
+	error = xfs_iomap_zero_range(ip, offset, len, NULL);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 396ef36dcd0a..38d8eca05aee 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -684,11 +684,8 @@ xfs_file_dax_write(
 	pos = iocb->ki_pos;
 
 	trace_xfs_file_dax_write(iocb, from);
-	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
-	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
-		i_size_write(inode, iocb->ki_pos);
-		error = xfs_setfilesize(ip, pos, ret);
-	}
+	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
+
 out:
 	if (iolock)
 		xfs_iunlock(ip, iolock);
@@ -1309,7 +1306,7 @@ __xfs_filemap_fault(
 
 		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
 				(write_fault && !vmf->cow_page) ?
-				 &xfs_direct_write_iomap_ops :
+				 &xfs_dax_write_iomap_ops :
 				 &xfs_read_iomap_ops);
 		if (ret & VM_FAULT_NEEDDSYNC)
 			ret = dax_finish_sync_fault(vmf, pe_size, pfn);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index d154f42e2dc6..938723aa137d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
 
 		/* may drop and re-acquire the ilock */
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
-				&lockmode, flags & IOMAP_DIRECT);
+				&lockmode,
+				(flags & IOMAP_DIRECT) || IS_DAX(inode));
 		if (error)
 			goto out_unlock;
 		if (shared)
@@ -854,6 +855,41 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
 	.iomap_begin		= xfs_direct_write_iomap_begin,
 };
 
+static int
+xfs_dax_write_iomap_end(
+	struct inode		*inode,
+	loff_t			pos,
+	loff_t			length,
+	ssize_t			written,
+	unsigned int		flags,
+	struct iomap		*iomap)
+{
+	int			error = 0;
+	struct xfs_inode	*ip = XFS_I(inode);
+	bool			cow = xfs_is_cow_inode(ip);
+
+	if (!written)
+		return 0;
+
+	if (pos + written > i_size_read(inode) && !(flags & IOMAP_FAULT)) {
+		i_size_write(inode, pos + written);
+		error = xfs_setfilesize(ip, pos, written);
+		if (error && cow) {
+			xfs_reflink_cancel_cow_range(ip, pos, written, true);
+			return error;
+		}
+	}
+	if (cow)
+		error = xfs_reflink_end_cow(ip, pos, written);
+
+	return error;
+}
+
+const struct iomap_ops xfs_dax_write_iomap_ops = {
+	.iomap_begin		= xfs_direct_write_iomap_begin,
+	.iomap_end		= xfs_dax_write_iomap_end,
+};
+
 static int
 xfs_buffered_write_iomap_begin(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 7d3703556d0e..fbacf638ab21 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -42,8 +42,32 @@ xfs_aligned_fsb_count(
 
 extern const struct iomap_ops xfs_buffered_write_iomap_ops;
 extern const struct iomap_ops xfs_direct_write_iomap_ops;
+extern const struct iomap_ops xfs_dax_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
 
+static inline int
+xfs_iomap_zero_range(
+	struct xfs_inode	*ip,
+	loff_t			offset,
+	loff_t			len,
+	bool			*did_zero)
+{
+	return iomap_zero_range(VFS_I(ip), offset, len, did_zero,
+			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
+					  : &xfs_buffered_write_iomap_ops);
+}
+
+static inline int
+xfs_iomap_truncate_page(
+	struct xfs_inode	*ip,
+	loff_t			pos,
+	bool			*did_zero)
+{
+	return iomap_truncate_page(VFS_I(ip), pos, did_zero,
+			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
+					  : &xfs_buffered_write_iomap_ops);
+}
+
 #endif /* __XFS_IOMAP_H__*/
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index dfe24b7f26e5..6d936c3e1a6e 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -911,8 +911,8 @@ xfs_setattr_size(
 	 */
 	if (newsize > oldsize) {
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
-		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
-				&did_zeroing, &xfs_buffered_write_iomap_ops);
+		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
+				&did_zeroing);
 	} else {
 		/*
 		 * iomap won't detect a dirty page over an unwritten block (or a
@@ -924,8 +924,7 @@ xfs_setattr_size(
 						     newsize);
 		if (error)
 			return error;
-		error = iomap_truncate_page(inode, newsize, &did_zeroing,
-				&xfs_buffered_write_iomap_ops);
+		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
 	}
 
 	if (error)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d25434f93235..9a780948dbd0 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof(
 		return 0;
 
 	trace_xfs_zero_eof(ip, isize, pos - isize);
-	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
-			&xfs_buffered_write_iomap_ops);
+	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
 }
 
 /*
-- 
2.31.1




^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 7/7] fs/xfs: Add dax dedupe support
  2021-05-19  6:00 [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
                   ` (5 preceding siblings ...)
  2021-05-19  6:00 ` [PATCH v6 6/7] fs/xfs: Handle CoW for fsdax write() path Shiyang Ruan
@ 2021-05-19  6:00 ` Shiyang Ruan
  2021-05-26  0:31   ` Darrick J. Wong
  2021-05-26  0:51 ` [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Darrick J. Wong
  7 siblings, 1 reply; 23+ messages in thread
From: Shiyang Ruan @ 2021-05-19  6:00 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel
  Cc: darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn

Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
who are going to be deduped.  After that, call compare range function
only when files are both DAX or not.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/xfs/xfs_file.c    |  2 +-
 fs/xfs/xfs_inode.c   | 57 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h   |  1 +
 fs/xfs/xfs_reflink.c |  4 ++--
 4 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 38d8eca05aee..bd5002d38df4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -823,7 +823,7 @@ xfs_wait_dax_page(
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 }
 
-static int
+int
 xfs_break_dax_layouts(
 	struct inode		*inode,
 	bool			*retry)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 0369eb22c1bb..d5e2791969ba 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3711,6 +3711,59 @@ xfs_iolock_two_inodes_and_break_layout(
 	return 0;
 }
 
+static int
+xfs_mmaplock_two_inodes_and_break_dax_layout(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	int			error, attempts = 0;
+	bool			retry;
+	struct page		*page;
+	struct xfs_log_item	*lp;
+
+	if (ip1->i_ino > ip2->i_ino)
+		swap(ip1, ip2);
+
+again:
+	retry = false;
+	/* Lock the first inode */
+	xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
+	error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
+	if (error || retry) {
+		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+		goto again;
+	}
+
+	if (ip1 == ip2)
+		return 0;
+
+	/* Nested lock the second inode */
+	lp = &ip1->i_itemp->ili_item;
+	if (lp && test_bit(XFS_LI_IN_AIL, &lp->li_flags)) {
+		if (!xfs_ilock_nowait(ip2,
+		    xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1))) {
+			xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+			if ((++attempts % 5) == 0)
+				delay(1); /* Don't just spin the CPU */
+			goto again;
+		}
+	} else
+		xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
+	/*
+	 * We cannot use xfs_break_dax_layouts() directly here because it may
+	 * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
+	 * for this nested lock case.
+	 */
+	page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
+	if (page && page_ref_count(page) != 1) {
+		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+		goto again;
+	}
+
+	return 0;
+}
+
 /*
  * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
  * mmap activity.
@@ -3725,6 +3778,10 @@ xfs_ilock2_io_mmap(
 	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
 	if (ret)
 		return ret;
+
+	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2)))
+		return xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2);
+
 	if (ip1 == ip2)
 		xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
 	else
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index ca826cfba91c..2d0b344fb100 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -457,6 +457,7 @@ enum xfs_prealloc_flags {
 
 int	xfs_update_prealloc_flags(struct xfs_inode *ip,
 				  enum xfs_prealloc_flags flags);
+int	xfs_break_dax_layouts(struct inode *inode, bool *retry);
 int	xfs_break_layouts(struct inode *inode, uint *iolock,
 		enum layout_break_reason reason);
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 9a780948dbd0..ff308304c5cd 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1324,8 +1324,8 @@ xfs_reflink_remap_prep(
 	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
 		goto out_unlock;
 
-	/* Don't share DAX file data for now. */
-	if (IS_DAX(inode_in) || IS_DAX(inode_out))
+	/* Don't share DAX file data with non-DAX file. */
+	if (IS_DAX(inode_in) != IS_DAX(inode_out))
 		goto out_unlock;
 
 	if (!IS_DAX(inode_in))
-- 
2.31.1




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 3/7] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
  2021-05-19  6:00 ` [PATCH v6 3/7] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero Shiyang Ruan
@ 2021-05-25 22:17   ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2021-05-25 22:17 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel,
	darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn,
	Ritesh Harjani

On Wed, May 19, 2021 at 02:00:41PM +0800, Shiyang Ruan wrote:
> Punch hole on a reflinked file needs dax_copy_edge() too.  Otherwise,
> data in not aligned area will be not correct.  So, add the srcmap to
> dax_iomap_zero() and replace memset() as dax_copy_edge().
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>

Looks good now,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/dax.c               | 25 +++++++++++++++----------
>  fs/iomap/buffered-io.c |  2 +-
>  include/linux/dax.h    |  3 ++-
>  3 files changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 98531c53d613..baee584cb8ae 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1197,7 +1197,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>  }
>  #endif /* CONFIG_FS_DAX_PMD */
>  
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
> +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
> +		struct iomap *srcmap)
>  {
>  	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>  	pgoff_t pgoff;
> @@ -1219,19 +1220,23 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  
>  	if (page_aligned)
>  		rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
> -	else
> +	else {
>  		rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
> -	if (rc < 0) {
> -		dax_read_unlock(id);
> -		return rc;
> -	}
> -
> -	if (!page_aligned) {
> -		memset(kaddr + offset, 0, size);
> +		if (rc < 0)
> +			goto out;
> +		if (iomap->addr != srcmap->addr) {
> +			rc = dax_iomap_cow_copy(pos, size, PAGE_SIZE, srcmap,
> +						kaddr);
> +			if (rc < 0)
> +				goto out;
> +		} else
> +			memset(kaddr + offset, 0, size);
>  		dax_flush(iomap->dax_dev, kaddr + offset, size);
>  	}
> +
> +out:
>  	dax_read_unlock(id);
> -	return size;
> +	return rc < 0 ? rc : size;
>  }
>  
>  static loff_t
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 9023717c5188..fdaac4ba9b9d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -933,7 +933,7 @@ static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
>  		s64 bytes;
>  
>  		if (IS_DAX(inode))
> -			bytes = dax_iomap_zero(pos, length, iomap);
> +			bytes = dax_iomap_zero(pos, length, iomap, srcmap);
>  		else
>  			bytes = iomap_zero(inode, pos, length, iomap, srcmap);
>  		if (bytes < 0)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b52f084aa643..3275e01ed33d 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -237,7 +237,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
>  				      pgoff_t index);
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap);
> +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
> +		struct iomap *srcmap);
>  static inline bool dax_mapping(struct address_space *mapping)
>  {
>  	return mapping->host && IS_DAX(mapping->host);
> -- 
> 2.31.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 5/7] fsdax: Dedup file range to use a compare function
  2021-05-19  6:00 ` [PATCH v6 5/7] fsdax: Dedup file range to use a compare function Shiyang Ruan
@ 2021-05-25 23:29   ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2021-05-25 23:29 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel,
	darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn,
	Goldwyn Rodrigues

On Wed, May 19, 2021 at 02:00:43PM +0800, Shiyang Ruan wrote:
> With dax we cannot deal with readpage() etc. So, we create a dax
> comparison funciton which is similar with

s/funciton/function/

> vfs_dedupe_file_range_compare().
> And introduce dax_remap_file_range_prep() for filesystem use.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  fs/dax.c             | 66 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/remap_range.c     | 36 ++++++++++++++++++------
>  fs/xfs/xfs_reflink.c |  8 ++++--
>  include/linux/dax.h  |  8 ++++++
>  include/linux/fs.h   | 12 +++++---
>  5 files changed, 116 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index baee584cb8ae..93f16210847b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1864,3 +1864,69 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>  	return dax_insert_pfn_mkwrite(vmf, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
> +		struct inode *ino2, loff_t pos2, loff_t len, void *data,
> +		struct iomap *smap, struct iomap *dmap)
> +{
> +	void *saddr, *daddr;
> +	bool *same = data;
> +	int ret;
> +
> +	if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
> +		*same = true;
> +		return len;
> +	}
> +
> +	if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
> +		*same = false;
> +		return 0;
> +	}
> +
> +	ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
> +				      &saddr, NULL);
> +	if (ret < 0)
> +		return -EIO;
> +
> +	ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
> +				      &daddr, NULL);
> +	if (ret < 0)
> +		return -EIO;
> +
> +	*same = !memcmp(saddr, daddr, len);
> +	return len;
> +}
> +
> +int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +		struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
> +		const struct iomap_ops *ops)
> +{
> +	int id, ret = 0;
> +
> +	id = dax_read_lock();
> +	while (len) {
> +		ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops,
> +				   is_same, dax_range_compare_actor);
> +		if (ret < 0 || !*is_same)
> +			goto out;
> +
> +		len -= ret;
> +		srcoff += ret;
> +		destoff += ret;
> +	}
> +	ret = 0;
> +out:
> +	dax_read_unlock(id);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
> +
> +int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> +			      struct file *file_out, loff_t pos_out,
> +			      loff_t *len, unsigned int remap_flags,
> +			      const struct iomap_ops *ops)
> +{
> +	return __generic_remap_file_range_prep(file_in, pos_in, file_out,
> +					       pos_out, len, remap_flags, ops);
> +}
> +EXPORT_SYMBOL(dax_remap_file_range_prep);
> diff --git a/fs/remap_range.c b/fs/remap_range.c
> index e4a5fdd7ad7b..4cfc1553f3bf 100644
> --- a/fs/remap_range.c
> +++ b/fs/remap_range.c
> @@ -14,6 +14,7 @@
>  #include <linux/compat.h>
>  #include <linux/mount.h>
>  #include <linux/fs.h>
> +#include <linux/dax.h>
>  #include "internal.h"
>  
>  #include <linux/uaccess.h>
> @@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
>   * Compare extents of two files to see if they are the same.
>   * Caller must have locked both inodes to prevent write races.
>   */
> -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> -					 struct inode *dest, loff_t destoff,
> -					 loff_t len, bool *is_same)
> +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +				  struct inode *dest, loff_t destoff,
> +				  loff_t len, bool *is_same)
>  {
>  	loff_t src_poff;
>  	loff_t dest_poff;
> @@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>  out_error:
>  	return error;
>  }
> +EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
>  
>  /*
>   * Check that the two inodes are eligible for cloning, the ranges make
> @@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>   * If there's an error, then the usual negative error code is returned.
>   * Otherwise returns 0 with *len set to the request length.
>   */
> -int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> -				  struct file *file_out, loff_t pos_out,
> -				  loff_t *len, unsigned int remap_flags)
> +int
> +__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> +				struct file *file_out, loff_t pos_out,
> +				loff_t *len, unsigned int remap_flags,
> +				const struct iomap_ops *dax_read_ops)
>  {
>  	struct inode *inode_in = file_inode(file_in);
>  	struct inode *inode_out = file_inode(file_out);
> @@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>  	if (remap_flags & REMAP_FILE_DEDUP) {
>  		bool		is_same = false;
>  
> -		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> -				inode_out, pos_out, *len, &is_same);
> +		if (!IS_DAX(inode_in))
> +			ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> +					inode_out, pos_out, *len, &is_same);
> +		else if (dax_read_ops)
> +			ret = dax_dedupe_file_range_compare(inode_in, pos_in,
> +					inode_out, pos_out, *len, &is_same,
> +					dax_read_ops);
> +		else
> +			return -EINVAL;
>  		if (ret)
>  			return ret;
>  		if (!is_same)
> @@ -370,6 +381,15 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>  
>  	return ret;
>  }
> +EXPORT_SYMBOL(__generic_remap_file_range_prep);
> +
> +int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> +				  struct file *file_out, loff_t pos_out,
> +				  loff_t *len, unsigned int remap_flags)
> +{
> +	return __generic_remap_file_range_prep(file_in, pos_in, file_out,
> +					       pos_out, len, remap_flags, NULL);
> +}
>  EXPORT_SYMBOL(generic_remap_file_range_prep);
>  
>  loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 060695d6d56a..d25434f93235 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1329,8 +1329,12 @@ xfs_reflink_remap_prep(
>  	if (IS_DAX(inode_in) || IS_DAX(inode_out))
>  		goto out_unlock;
>  
> -	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> -			len, remap_flags);
> +	if (!IS_DAX(inode_in))
> +		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> +				pos_out, len, remap_flags);
> +	else
> +		ret = dax_remap_file_range_prep(file_in, pos_in, file_out,
> +				pos_out, len, remap_flags, &xfs_read_iomap_ops);
>  	if (ret || *len == 0)
>  		goto out_unlock;
>  
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 3275e01ed33d..106d1f033a78 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -239,6 +239,14 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
>  				      pgoff_t index);
>  s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
>  		struct iomap *srcmap);
> +int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +				  struct inode *dest, loff_t destoff,
> +				  loff_t len, bool *is_same,
> +				  const struct iomap_ops *ops);
> +int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> +			      struct file *file_out, loff_t pos_out,
> +			      loff_t *len, unsigned int remap_flags,
> +			      const struct iomap_ops *ops);

I totally thought that not having explicit static inline stubs of these
functions would break the build when CONFIG_FS_DAX=n, but then I
realized that when fsdax is disabled, S_DAX is zero, so this works
because dead code elimination in the compiler means that the object
files never receive deferred references to the dax functions, which
means that linking actually succeeds.

So:

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  static inline bool dax_mapping(struct address_space *mapping)
>  {
>  	return mapping->host && IS_DAX(mapping->host);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c3c88fdb9b2a..deed4371f34f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -71,6 +71,7 @@ struct fsverity_operations;
>  struct fs_context;
>  struct fs_parameter_spec;
>  struct fileattr;
> +struct iomap_ops;
>  
>  extern void __init inode_init(void);
>  extern void __init inode_init_early(void);
> @@ -2126,10 +2127,13 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
>  extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
>  				       struct file *file_out, loff_t pos_out,
>  				       size_t len, unsigned int flags);
> -extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> -					 struct file *file_out, loff_t pos_out,
> -					 loff_t *count,
> -					 unsigned int remap_flags);
> +int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> +				    struct file *file_out, loff_t pos_out,
> +				    loff_t *len, unsigned int remap_flags,
> +				    const struct iomap_ops *dax_read_ops);
> +int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> +				  struct file *file_out, loff_t pos_out,
> +				  loff_t *count, unsigned int remap_flags);
>  extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
>  				  struct file *file_out, loff_t pos_out,
>  				  loff_t len, unsigned int remap_flags);
> -- 
> 2.31.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 6/7] fs/xfs: Handle CoW for fsdax write() path
  2021-05-19  6:00 ` [PATCH v6 6/7] fs/xfs: Handle CoW for fsdax write() path Shiyang Ruan
@ 2021-05-26  0:21   ` Darrick J. Wong
  2021-06-09  2:28     ` ruansy.fnst
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2021-05-26  0:21 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel,
	darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn

On Wed, May 19, 2021 at 02:00:44PM +0800, Shiyang Ruan wrote:
> In fsdax mode, WRITE and ZERO on a shared extent need CoW performed. After
> CoW, new allocated extents needs to be remapped to the file.  So, add an
> iomap_end for dax write ops to do the remapping work.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  fs/xfs/xfs_bmap_util.c |  3 +--
>  fs/xfs/xfs_file.c      |  9 +++------
>  fs/xfs/xfs_iomap.c     | 38 +++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_iomap.h     | 24 ++++++++++++++++++++++++
>  fs/xfs/xfs_iops.c      |  7 +++----
>  fs/xfs/xfs_reflink.c   |  3 +--
>  6 files changed, 69 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index a5e9d7d34023..2a36dc93ff27 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -965,8 +965,7 @@ xfs_free_file_space(
>  		return 0;
>  	if (offset + len > XFS_ISIZE(ip))
>  		len = XFS_ISIZE(ip) - offset;
> -	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> -			&xfs_buffered_write_iomap_ops);
> +	error = xfs_iomap_zero_range(ip, offset, len, NULL);
>  	if (error)
>  		return error;
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..38d8eca05aee 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -684,11 +684,8 @@ xfs_file_dax_write(
>  	pos = iocb->ki_pos;
>  
>  	trace_xfs_file_dax_write(iocb, from);
> -	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
> -	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> -		i_size_write(inode, iocb->ki_pos);
> -		error = xfs_setfilesize(ip, pos, ret);
> -	}
> +	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
> +
>  out:
>  	if (iolock)
>  		xfs_iunlock(ip, iolock);
> @@ -1309,7 +1306,7 @@ __xfs_filemap_fault(
>  
>  		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
>  				(write_fault && !vmf->cow_page) ?
> -				 &xfs_direct_write_iomap_ops :
> +				 &xfs_dax_write_iomap_ops :
>  				 &xfs_read_iomap_ops);
>  		if (ret & VM_FAULT_NEEDDSYNC)
>  			ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index d154f42e2dc6..938723aa137d 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
>  
>  		/* may drop and re-acquire the ilock */
>  		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> -				&lockmode, flags & IOMAP_DIRECT);
> +				&lockmode,
> +				(flags & IOMAP_DIRECT) || IS_DAX(inode));
>  		if (error)
>  			goto out_unlock;
>  		if (shared)
> @@ -854,6 +855,41 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
>  	.iomap_begin		= xfs_direct_write_iomap_begin,
>  };
>  
> +static int
> +xfs_dax_write_iomap_end(
> +	struct inode		*inode,
> +	loff_t			pos,
> +	loff_t			length,
> +	ssize_t			written,
> +	unsigned int		flags,
> +	struct iomap		*iomap)
> +{
> +	int			error = 0;
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	bool			cow = xfs_is_cow_inode(ip);
> +
> +	if (!written)
> +		return 0;
> +
> +	if (pos + written > i_size_read(inode) && !(flags & IOMAP_FAULT)) {
> +		i_size_write(inode, pos + written);
> +		error = xfs_setfilesize(ip, pos, written);
> +		if (error && cow) {
> +			xfs_reflink_cancel_cow_range(ip, pos, written, true);
> +			return error;
> +		}
> +	}
> +	if (cow)
> +		error = xfs_reflink_end_cow(ip, pos, written);
> +
> +	return error;

I think this (the ->iomap_end handler) is the wrong place to be
performing COW remapping, because of this chunk in iomap_apply:

	/*
	 * Now the data has been copied, commit the range we've copied.
	 * This should not fail unless the filesystem has had a fatal
	 * error.
	 */
	if (ops->iomap_end) {
		ret = ops->iomap_end(inode, pos, length,
				     written > 0 ? written : 0,
				     flags, &iomap);
	}

	return written ? written : ret;

If we managed to write something but the remap fails, we'll eat the
error message and return the length of the write to the caller.
Eventually the callers /may/ notice that they can still read old file
contents after a "successful" write.

I think what needs to happen here is that we call out to the filesystem
to remap the blocks at the end of dax_iomap_actor, similar to how the
iomap directio code calls xfs_dio_write_end_io after all of the write
bios complete.  If the remap fails, we return that error out of
dax_iomap_actor, which will be returned to the caller as a short write
or an error code if nothing got written.

IOWs, the end of dax_iomap_actor should become:

		/* dax_copy_{to,from}_iter calls here */

		pos += xfer;
		length -= xfer;
		done += xfer;

		if (xfer == 0)
			ret = -EFAULT;
		if (xfer < map_len)
			break;
	}
	dax_read_unlock(id);

	if (dops && dops->end_io) {
		unsigned flags = 0;

		if (srcmap->addr != iomap->addr)
			flags |= IOMAP_DIO_COW;

		ret = dops->end_io(iocb, done, ret, flags);
	}

	if (likely(!ret)) {
		ret = done;
		/* check for short read */
		if (offset + ret > i_size_read(inode) && !write)
			ret = i_size_read(inode) - offset;
		iocb->ki_pos += ret;
	}

	return ret;
}

And I think you can even reuse the struct iomap_dio_ops and
xfs_dio_write_end_io for this purpose.

> +}
> +
> +const struct iomap_ops xfs_dax_write_iomap_ops = {
> +	.iomap_begin		= xfs_direct_write_iomap_begin,
> +	.iomap_end		= xfs_dax_write_iomap_end,
> +};
> +
>  static int
>  xfs_buffered_write_iomap_begin(
>  	struct inode		*inode,
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index 7d3703556d0e..fbacf638ab21 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -42,8 +42,32 @@ xfs_aligned_fsb_count(
>  
>  extern const struct iomap_ops xfs_buffered_write_iomap_ops;
>  extern const struct iomap_ops xfs_direct_write_iomap_ops;
> +extern const struct iomap_ops xfs_dax_write_iomap_ops;
>  extern const struct iomap_ops xfs_read_iomap_ops;
>  extern const struct iomap_ops xfs_seek_iomap_ops;
>  extern const struct iomap_ops xfs_xattr_iomap_ops;
>  
> +static inline int
> +xfs_iomap_zero_range(
> +	struct xfs_inode	*ip,
> +	loff_t			offset,
> +	loff_t			len,
> +	bool			*did_zero)
> +{
> +	return iomap_zero_range(VFS_I(ip), offset, len, did_zero,
> +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> +					  : &xfs_buffered_write_iomap_ops);
> +}
> +
> +static inline int
> +xfs_iomap_truncate_page(
> +	struct xfs_inode	*ip,
> +	loff_t			pos,
> +	bool			*did_zero)
> +{
> +	return iomap_truncate_page(VFS_I(ip), pos, did_zero,
> +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> +					  : &xfs_buffered_write_iomap_ops);
> +}
> +
>  #endif /* __XFS_IOMAP_H__*/
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index dfe24b7f26e5..6d936c3e1a6e 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -911,8 +911,8 @@ xfs_setattr_size(
>  	 */
>  	if (newsize > oldsize) {
>  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> -				&did_zeroing, &xfs_buffered_write_iomap_ops);
> +		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
> +				&did_zeroing);
>  	} else {
>  		/*
>  		 * iomap won't detect a dirty page over an unwritten block (or a
> @@ -924,8 +924,7 @@ xfs_setattr_size(
>  						     newsize);
>  		if (error)
>  			return error;
> -		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> -				&xfs_buffered_write_iomap_ops);
> +		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
>  	}
>  
>  	if (error)
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index d25434f93235..9a780948dbd0 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof(
>  		return 0;
>  
>  	trace_xfs_zero_eof(ip, isize, pos - isize);
> -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> -			&xfs_buffered_write_iomap_ops);
> +	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
>  }
>  
>  /*
> -- 
> 2.31.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 7/7] fs/xfs: Add dax dedupe support
  2021-05-19  6:00 ` [PATCH v6 7/7] fs/xfs: Add dax dedupe support Shiyang Ruan
@ 2021-05-26  0:31   ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2021-05-26  0:31 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel,
	darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn,
	jack

On Wed, May 19, 2021 at 02:00:45PM +0800, Shiyang Ruan wrote:
> Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
> who are going to be deduped.  After that, call compare range function
> only when files are both DAX or not.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  fs/xfs/xfs_file.c    |  2 +-
>  fs/xfs/xfs_inode.c   | 57 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_inode.h   |  1 +
>  fs/xfs/xfs_reflink.c |  4 ++--
>  4 files changed, 61 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 38d8eca05aee..bd5002d38df4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -823,7 +823,7 @@ xfs_wait_dax_page(
>  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>  }
>  
> -static int
> +int
>  xfs_break_dax_layouts(
>  	struct inode		*inode,
>  	bool			*retry)
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 0369eb22c1bb..d5e2791969ba 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3711,6 +3711,59 @@ xfs_iolock_two_inodes_and_break_layout(
>  	return 0;
>  }
>  
> +static int
> +xfs_mmaplock_two_inodes_and_break_dax_layout(
> +	struct xfs_inode	*ip1,
> +	struct xfs_inode	*ip2)
> +{
> +	int			error, attempts = 0;
> +	bool			retry;
> +	struct page		*page;
> +	struct xfs_log_item	*lp;
> +
> +	if (ip1->i_ino > ip2->i_ino)
> +		swap(ip1, ip2);

If Jan Kara [added to cc] succeeds in hoisting the MMAPLOCK to struct
address space then this is going to have to change to:

	if (VFS_I(ip1)->i_mapping > VFS_I(ip2)->i_mapping)
		swap(ip1, ip2);

For now this is ok.

> +
> +again:
> +	retry = false;
> +	/* Lock the first inode */
> +	xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> +	error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
> +	if (error || retry) {
> +		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> +		goto again;
> +	}
> +
> +	if (ip1 == ip2)
> +		return 0;
> +
> +	/* Nested lock the second inode */
> +	lp = &ip1->i_itemp->ili_item;
> +	if (lp && test_bit(XFS_LI_IN_AIL, &lp->li_flags)) {
> +		if (!xfs_ilock_nowait(ip2,
> +		    xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1))) {
> +			xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> +			if ((++attempts % 5) == 0)
> +				delay(1); /* Don't just spin the CPU */
> +			goto again;
> +		}
> +	} else
> +		xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));

I wonder if this chunk is really necessary considering that the AIL
never touches the MMAPLOCK/i_mapping invalidation lock?  I guess it
doesn't really hurt anything since that's what the code does now.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> +	/*
> +	 * We cannot use xfs_break_dax_layouts() directly here because it may
> +	 * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
> +	 * for this nested lock case.
> +	 */
> +	page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
> +	if (page && page_ref_count(page) != 1) {
> +		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
> +		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> +		goto again;
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
>   * mmap activity.
> @@ -3725,6 +3778,10 @@ xfs_ilock2_io_mmap(
>  	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
>  	if (ret)
>  		return ret;
> +
> +	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2)))
> +		return xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2);
> +
>  	if (ip1 == ip2)
>  		xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
>  	else
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index ca826cfba91c..2d0b344fb100 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -457,6 +457,7 @@ enum xfs_prealloc_flags {
>  
>  int	xfs_update_prealloc_flags(struct xfs_inode *ip,
>  				  enum xfs_prealloc_flags flags);
> +int	xfs_break_dax_layouts(struct inode *inode, bool *retry);
>  int	xfs_break_layouts(struct inode *inode, uint *iolock,
>  		enum layout_break_reason reason);
>  
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 9a780948dbd0..ff308304c5cd 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1324,8 +1324,8 @@ xfs_reflink_remap_prep(
>  	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
>  		goto out_unlock;
>  
> -	/* Don't share DAX file data for now. */
> -	if (IS_DAX(inode_in) || IS_DAX(inode_out))
> +	/* Don't share DAX file data with non-DAX file. */
> +	if (IS_DAX(inode_in) != IS_DAX(inode_out))
>  		goto out_unlock;
>  
>  	if (!IS_DAX(inode_in))
> -- 
> 2.31.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax
  2021-05-19  6:00 [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
                   ` (6 preceding siblings ...)
  2021-05-19  6:00 ` [PATCH v6 7/7] fs/xfs: Add dax dedupe support Shiyang Ruan
@ 2021-05-26  0:51 ` Darrick J. Wong
  7 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2021-05-26  0:51 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel,
	darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn

On Wed, May 19, 2021 at 02:00:38PM +0800, Shiyang Ruan wrote:
> This patchset is attempt to add CoW support for fsdax, and take XFS,
> which has both reflink and fsdax feature, as an example.

Soooo... how close are we to enabling reflink for DAX?

I <cough> got rid of the lockouts in xfs_super.c and ran a quick
fstests, which showed a number of odd regressions where dedupe tests
that were supposed to fail with EBADE didn't and a bunch of clonerange
tests failed with EINVAL:

generic/122     - output mismatch (see /var/tmp/fstests/generic/122.out.bad)
    --- tests/generic/122.out   2021-05-13 11:47:55.665860364 -0700
    +++ /var/tmp/fstests/generic/122.out.bad    2021-05-25 17:24:03.333270522 -0700
    @@ -4,7 +4,8 @@
     5e3501f97fd2669babfcbd3e1972e833  TEST_DIR/test-122/file2
     Files 1-2 do not match (intentional)
     (Fail to) dedupe the middle blocks together
    -XFS_IOC_FILE_EXTENT_SAME: Extents did not match.
    +deduped 131072/131072 bytes at offset 262144
    +128 KiB, 1 ops; 0.0000 sec (12.207 GiB/sec and 100000.0000 ops/sec)
     Compare sections
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/122.out /var/tmp/fstests/generic/122.out.bad'  to see the entire diff)
generic/136     - output mismatch (see /var/tmp/fstests/generic/136.out.bad)
    --- tests/generic/136.out   2021-05-13 11:47:55.668860355 -0700
    +++ /var/tmp/fstests/generic/136.out.bad    2021-05-25 17:24:05.773367756 -0700
    @@ -7,7 +7,8 @@
     Dedupe the last blocks together
     1->2
     1->3
    -XFS_IOC_FILE_EXTENT_SAME: Extents did not match.
    +deduped 37/37 bytes at offset 65536
    +37.000000 bytes, 1 ops; 0.0000 sec (1.960 MiB/sec and 55555.5556 ops/sec)
     c4fd505be25a0c91bcca9f502b9a8156  TEST_DIR/test-136/file1
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/136.out /var/tmp/fstests/generic/136.out.bad'  to see the entire diff)
generic/164     - output mismatch (see /var/tmp/fstests/generic/164.out.bad)
    --- tests/generic/164.out   2021-05-13 11:47:55.674860338 -0700
    +++ /var/tmp/fstests/generic/164.out.bad    2021-05-25 17:25:33.339738197 -0700
    @@ -2,4 +2,1028 @@
     Format and mount
     Initialize files
     Reflink and reread the files!
    +XFS_IOC_CLONE_RANGE: Invalid argument
    +XFS_IOC_CLONE_RANGE: Invalid argument
    +XFS_IOC_CLONE_RANGE: Invalid argument
    +XFS_IOC_CLONE_RANGE: Invalid argument
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/164.out /var/tmp/fstests/generic/164.out.bad'  to see the entire diff)
generic/165     - output mismatch (see /var/tmp/fstests/generic/165.out.bad)
    --- tests/generic/165.out   2021-05-13 11:47:55.674860338 -0700
    +++ /var/tmp/fstests/generic/165.out.bad    2021-05-25 17:25:45.247685323 -0700
    @@ -2,4 +2,1028 @@
     Format and mount
     Initialize files
     Reflink and dio reread the files!
    +XFS_IOC_CLONE_RANGE: Invalid argument
    +XFS_IOC_CLONE_RANGE: Invalid argument
    +XFS_IOC_CLONE_RANGE: Invalid argument
    +XFS_IOC_CLONE_RANGE: Invalid argument
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/165.out /var/tmp/fstests/generic/165.out.bad'  to see the entire diff)
generic/175     - output mismatch (see /var/tmp/fstests/generic/175.out.bad)
    --- tests/generic/175.out   2021-05-13 11:47:55.676860332 -0700
    +++ /var/tmp/fstests/generic/175.out.bad    2021-05-25 17:29:55.060917807 -0700
    @@ -3,3 +3,4 @@
     Create a one block file
     Create extents
     Reflink the big file
    +XFS_IOC_CLONE_RANGE: Invalid argument
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/175.out /var/tmp/fstests/generic/175.out.bad'  to see the entire diff)
generic/327     - output mismatch (see /var/tmp/fstests/generic/327.out.bad)
    --- tests/generic/327.out   2021-05-13 11:47:55.704860251 -0700
    +++ /var/tmp/fstests/generic/327.out.bad    2021-05-25 17:35:22.338448231 -0700
    @@ -7,6 +7,6 @@
     root 0 0 0
     fsgqa 2048 0 1024
     Try to reflink again
    -cp: failed to clone 'SCRATCH_MNT/test-327/file3' from 'SCRATCH_MNT/test-327/file1': Disk quota exceeded
    +cp: failed to clone 'SCRATCH_MNT/test-327/file3' from 'SCRATCH_MNT/test-327/file1': Invalid argument
     root 0 0 0
     fsgqa 2048 0 1024
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/327.out /var/tmp/fstests/generic/327.out.bad'  to see the entire diff)
generic/516     - output mismatch (see /var/tmp/fstests/generic/516.out.bad)
    --- tests/generic/516.out   2021-05-13 11:47:55.739860150 -0700
    +++ /var/tmp/fstests/generic/516.out.bad    2021-05-25 17:41:58.144177193 -0700
    @@ -4,7 +4,8 @@
     39578c21e2cb9f6049b1cf7fc7be12a6  TEST_DIR/test-516/file2
     Files 1-2 do not match (intentional)
     (partial) dedupe the middle blocks together
    -XFS_IOC_FILE_EXTENT_SAME: Extents did not match.
    +deduped XXXX/XXXX bytes at offset XXXX
    +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
     Compare sections
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/516.out /var/tmp/fstests/generic/516.out.bad'  to see the entire diff)
generic/517     - output mismatch (see /var/tmp/fstests/generic/517.out.bad)
    --- tests/generic/517.out   2021-05-13 11:47:55.739860150 -0700
    +++ /var/tmp/fstests/generic/517.out.bad    2021-05-25 17:41:59.352000318 -0700
    @@ -33,8 +33,7 @@
     0786532
     wrote 100/100 bytes at offset 0
     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    -deduped 100/100 bytes at offset 655360
    -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    +XFS_IOC_FILE_EXTENT_SAME: Invalid argument
     File content after second deduplication:
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/517.out /var/tmp/fstests/generic/517.out.bad'  to see the entire diff)
generic/518      1s
generic/540     - output mismatch (see /var/tmp/fstests/generic/540.out.bad)
    --- tests/generic/540.out   2021-05-13 11:47:55.743860139 -0700
    +++ /var/tmp/fstests/generic/540.out.bad    2021-05-25 17:42:01.999613949 -0700
    @@ -7,8 +7,9 @@
     6366fd359371414186688a0ef6988893  SCRATCH_MNT/test-540/file3
     6366fd359371414186688a0ef6988893  SCRATCH_MNT/test-540/file3.chk
     reflink across the transition
    +XFS_IOC_CLONE_RANGE: Invalid argument
     Compare files
     bdbcf02ee0aa977795a79d25fcfdccb1  SCRATCH_MNT/test-540/file1
     5a5221017d3ab8fd7583312a14d2ba80  SCRATCH_MNT/test-540/file2
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/540.out /var/tmp/fstests/generic/540.out.bad'  to see the entire diff)
generic/541     - output mismatch (see /var/tmp/fstests/generic/541.out.bad)
    --- tests/generic/541.out   2021-05-13 11:47:55.743860139 -0700
    +++ /var/tmp/fstests/generic/541.out.bad    2021-05-25 17:42:03.623377997 -0700
    @@ -8,9 +8,10 @@
     6366fd359371414186688a0ef6988893  SCRATCH_MNT/test-541/file3
     6366fd359371414186688a0ef6988893  SCRATCH_MNT/test-541/file3.chk
     reflink across the transition
    +XFS_IOC_CLONE_RANGE: Invalid argument
     Compare files
     bdbcf02ee0aa977795a79d25fcfdccb1  SCRATCH_MNT/test-541/file1
    -51a300aae3a4b4eaa023876a397e01ef  SCRATCH_MNT/test-541/file2
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/541.out /var/tmp/fstests/generic/541.out.bad'  to see the entire diff)
generic/542     - output mismatch (see /var/tmp/fstests/generic/542.out.bad)
    --- tests/generic/542.out   2021-05-13 11:47:55.743860139 -0700
    +++ /var/tmp/fstests/generic/542.out.bad    2021-05-25 17:42:05.487108030 -0700
    @@ -7,8 +7,9 @@
     6366fd359371414186688a0ef6988893  SCRATCH_MNT/test-542/file3
     6366fd359371414186688a0ef6988893  SCRATCH_MNT/test-542/file3.chk
     reflink across the transition
    +XFS_IOC_CLONE_RANGE: Invalid argument
     Compare files
     bdbcf02ee0aa977795a79d25fcfdccb1  SCRATCH_MNT/test-542/file1
     5a5221017d3ab8fd7583312a14d2ba80  SCRATCH_MNT/test-542/file2
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/542.out /var/tmp/fstests/generic/542.out.bad'  to see the entire diff)
generic/543     - output mismatch (see /var/tmp/fstests/generic/543.out.bad)
    --- tests/generic/543.out   2021-05-13 11:47:55.744860136 -0700
    +++ /var/tmp/fstests/generic/543.out.bad    2021-05-25 17:42:07.386833815 -0700
    @@ -8,9 +8,10 @@
     6366fd359371414186688a0ef6988893  SCRATCH_MNT/test-543/file3
     6366fd359371414186688a0ef6988893  SCRATCH_MNT/test-543/file3.chk
     reflink across the transition
    +XFS_IOC_CLONE_RANGE: Invalid argument
     Compare files
     bdbcf02ee0aa977795a79d25fcfdccb1  SCRATCH_MNT/test-543/file1
    -d93123af536c8c012f866ea383a905ce  SCRATCH_MNT/test-543/file2
    ...
    (Run 'diff -u /tmp/fstests/tests/generic/543.out /var/tmp/fstests/generic/543.out.bad'  to see the entire diff)

That's all the failures to the end of the generic group; I cut it off so
that I could schedule my regular nightly testing runs.

--D

> 
> Changes from V5:
>  - Fix the lock order of xfs_inode in xfs_mmaplock_two_inodes_and_break_dax_layout()
>  - move dax_remap_file_range_prep() to fs/dax.c
>  - change type of length to uint64_t in dax_iomap_cow_copy()
>  - fix mistake in dax_iomap_zero()
> 
> Changes from V4:
>  - Fix the mistake of breaking dax layout for two inodes
>  - Add CONFIG_FS_DAX judgement for fsdax code in remap_range.c
>  - Fix other small problems and mistakes
> 
> One of the key mechanism need to be implemented in fsdax is CoW.  Copy
> the data from srcmap before we actually write data to the destance
> iomap.  And we just copy range in which data won't be changed.
> 
> Another mechanism is range comparison.  In page cache case, readpage()
> is used to load data on disk to page cache in order to be able to
> compare data.  In fsdax case, readpage() does not work.  So, we need
> another compare data with direct access support.
> 
> With the two mechanisms implemented in fsdax, we are able to make reflink
> and fsdax work together in XFS.
> 
> Some of the patches are picked up from Goldwyn's patchset.  I made some
> changes to adapt to this patchset.
> 
> 
> (Rebased on v5.13-rc2 and patchset[1])
> [1]: https://lkml.org/lkml/2021/4/22/575
> 
> Shiyang Ruan (7):
>   fsdax: Introduce dax_iomap_cow_copy()
>   fsdax: Replace mmap entry in case of CoW
>   fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
>   iomap: Introduce iomap_apply2() for operations on two files
>   fsdax: Dedup file range to use a compare function
>   fs/xfs: Handle CoW for fsdax write() path
>   fs/xfs: Add dax dedupe support
> 
>  fs/dax.c               | 216 ++++++++++++++++++++++++++++++++++++-----
>  fs/iomap/apply.c       |  52 ++++++++++
>  fs/iomap/buffered-io.c |   2 +-
>  fs/remap_range.c       |  36 +++++--
>  fs/xfs/xfs_bmap_util.c |   3 +-
>  fs/xfs/xfs_file.c      |  11 +--
>  fs/xfs/xfs_inode.c     |  57 +++++++++++
>  fs/xfs/xfs_inode.h     |   1 +
>  fs/xfs/xfs_iomap.c     |  38 +++++++-
>  fs/xfs/xfs_iomap.h     |  24 +++++
>  fs/xfs/xfs_iops.c      |   7 +-
>  fs/xfs/xfs_reflink.c   |  15 +--
>  include/linux/dax.h    |  11 ++-
>  include/linux/fs.h     |  12 ++-
>  include/linux/iomap.h  |   7 +-
>  15 files changed, 431 insertions(+), 61 deletions(-)
> 
> -- 
> 2.31.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v6 6/7] fs/xfs: Handle CoW for fsdax write() path
  2021-05-26  0:21   ` Darrick J. Wong
@ 2021-06-09  2:28     ` ruansy.fnst
  2021-06-15  7:21       ` [PATCH v6.1 " Shiyang Ruan
  0 siblings, 1 reply; 23+ messages in thread
From: ruansy.fnst @ 2021-06-09  2:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-fsdevel,
	darrick.wong, dan.j.williams, willy, viro, david, hch, rgoldwyn

> -----Original Message-----
> From: Darrick J. Wong <djwong@kernel.org>
> Subject: Re: [PATCH v6 6/7] fs/xfs: Handle CoW for fsdax write() path
> 
> On Wed, May 19, 2021 at 02:00:44PM +0800, Shiyang Ruan wrote:
> > In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
> > After CoW, new allocated extents needs to be remapped to the file.
> > So, add an iomap_end for dax write ops to do the remapping work.
> >
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > ---
> >  fs/xfs/xfs_bmap_util.c |  3 +--
> >  fs/xfs/xfs_file.c      |  9 +++------
> >  fs/xfs/xfs_iomap.c     | 38 +++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_iomap.h     | 24 ++++++++++++++++++++++++
> >  fs/xfs/xfs_iops.c      |  7 +++----
> >  fs/xfs/xfs_reflink.c   |  3 +--
> >  6 files changed, 69 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index
> > a5e9d7d34023..2a36dc93ff27 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -965,8 +965,7 @@ xfs_free_file_space(
> >  		return 0;
> >  	if (offset + len > XFS_ISIZE(ip))
> >  		len = XFS_ISIZE(ip) - offset;
> > -	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> > -			&xfs_buffered_write_iomap_ops);
> > +	error = xfs_iomap_zero_range(ip, offset, len, NULL);
> >  	if (error)
> >  		return error;
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index
> > 396ef36dcd0a..38d8eca05aee 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -684,11 +684,8 @@ xfs_file_dax_write(
> >  	pos = iocb->ki_pos;
> >
> >  	trace_xfs_file_dax_write(iocb, from);
> > -	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
> > -	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> > -		i_size_write(inode, iocb->ki_pos);
> > -		error = xfs_setfilesize(ip, pos, ret);
> > -	}
> > +	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
> > +
> >  out:
> >  	if (iolock)
> >  		xfs_iunlock(ip, iolock);
> > @@ -1309,7 +1306,7 @@ __xfs_filemap_fault(
> >
> >  		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
> >  				(write_fault && !vmf->cow_page) ?
> > -				 &xfs_direct_write_iomap_ops :
> > +				 &xfs_dax_write_iomap_ops :
> >  				 &xfs_read_iomap_ops);
> >  		if (ret & VM_FAULT_NEEDDSYNC)
> >  			ret = dax_finish_sync_fault(vmf, pe_size, pfn); diff --git
> > a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index
> > d154f42e2dc6..938723aa137d 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
> >
> >  		/* may drop and re-acquire the ilock */
> >  		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> > -				&lockmode, flags & IOMAP_DIRECT);
> > +				&lockmode,
> > +				(flags & IOMAP_DIRECT) || IS_DAX(inode));
> >  		if (error)
> >  			goto out_unlock;
> >  		if (shared)
> > @@ -854,6 +855,41 @@ const struct iomap_ops xfs_direct_write_iomap_ops
> = {
> >  	.iomap_begin		= xfs_direct_write_iomap_begin,
> >  };
> >
> > +static int
> > +xfs_dax_write_iomap_end(
> > +	struct inode		*inode,
> > +	loff_t			pos,
> > +	loff_t			length,
> > +	ssize_t			written,
> > +	unsigned int		flags,
> > +	struct iomap		*iomap)
> > +{
> > +	int			error = 0;
> > +	struct xfs_inode	*ip = XFS_I(inode);
> > +	bool			cow = xfs_is_cow_inode(ip);
> > +
> > +	if (!written)
> > +		return 0;
> > +
> > +	if (pos + written > i_size_read(inode) && !(flags & IOMAP_FAULT)) {
> > +		i_size_write(inode, pos + written);
> > +		error = xfs_setfilesize(ip, pos, written);
> > +		if (error && cow) {
> > +			xfs_reflink_cancel_cow_range(ip, pos, written, true);
> > +			return error;
> > +		}
> > +	}
> > +	if (cow)
> > +		error = xfs_reflink_end_cow(ip, pos, written);
> > +
> > +	return error;
> 
> I think this (the ->iomap_end handler) is the wrong place to be performing COW
> remapping, because of this chunk in iomap_apply:
> 
> 	/*
> 	 * Now the data has been copied, commit the range we've copied.
> 	 * This should not fail unless the filesystem has had a fatal
> 	 * error.
> 	 */
> 	if (ops->iomap_end) {
> 		ret = ops->iomap_end(inode, pos, length,
> 				     written > 0 ? written : 0,
> 				     flags, &iomap);
> 	}
> 
> 	return written ? written : ret;
> 
> If we managed to write something but the remap fails, we'll eat the error
> message and return the length of the write to the caller.
> Eventually the callers /may/ notice that they can still read old file contents after
> a "successful" write.

Understood.

> 
> I think what needs to happen here is that we call out to the filesystem to remap
> the blocks at the end of dax_iomap_actor, similar to how the iomap directio
> code calls xfs_dio_write_end_io after all of the write bios complete.  If the
> remap fails, we return that error out of dax_iomap_actor, which will be
> returned to the caller as a short write or an error code if nothing got written.
> 
> IOWs, the end of dax_iomap_actor should become:
> 
> 		/* dax_copy_{to,from}_iter calls here */
> 
> 		pos += xfer;
> 		length -= xfer;
> 		done += xfer;
> 
> 		if (xfer == 0)
> 			ret = -EFAULT;
> 		if (xfer < map_len)
> 			break;
> 	}
> 	dax_read_unlock(id);
> 
> 	if (dops && dops->end_io) {
> 		unsigned flags = 0;
> 
> 		if (srcmap->addr != iomap->addr)
> 			flags |= IOMAP_DIO_COW;
> 
> 		ret = dops->end_io(iocb, done, ret, flags);
> 	}
> 
> 	if (likely(!ret)) {
> 		ret = done;
> 		/* check for short read */
> 		if (offset + ret > i_size_read(inode) && !write)
> 			ret = i_size_read(inode) - offset;
> 		iocb->ki_pos += ret;
> 	}
> 
> 	return ret;
> }
> 
> And I think you can even reuse the struct iomap_dio_ops and
> xfs_dio_write_end_io for this purpose.

This is a good idea.  And the two dax page fault handlers also need it.  So I am thinking, instead of copying this logic three times, how about adding an interface, called ->iomap_post_actor() for example, in the iomap_ops, between ->actor() and ->iomap_end() ?  And this looks more common.

In addition, the dops->end_io() requires struct kiocb and its not available in dax page fault handler.


--
Ruan.

> 
> > +}
> > +
> > +const struct iomap_ops xfs_dax_write_iomap_ops = {
> > +	.iomap_begin		= xfs_direct_write_iomap_begin,
> > +	.iomap_end		= xfs_dax_write_iomap_end,
> > +};
> > +
> >  static int
> >  xfs_buffered_write_iomap_begin(
> >  	struct inode		*inode,
> > diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h index
> > 7d3703556d0e..fbacf638ab21 100644
> > --- a/fs/xfs/xfs_iomap.h
> > +++ b/fs/xfs/xfs_iomap.h
> > @@ -42,8 +42,32 @@ xfs_aligned_fsb_count(
> >
> >  extern const struct iomap_ops xfs_buffered_write_iomap_ops;  extern
> > const struct iomap_ops xfs_direct_write_iomap_ops;
> > +extern const struct iomap_ops xfs_dax_write_iomap_ops;
> >  extern const struct iomap_ops xfs_read_iomap_ops;  extern const
> > struct iomap_ops xfs_seek_iomap_ops;  extern const struct iomap_ops
> > xfs_xattr_iomap_ops;
> >
> > +static inline int
> > +xfs_iomap_zero_range(
> > +	struct xfs_inode	*ip,
> > +	loff_t			offset,
> > +	loff_t			len,
> > +	bool			*did_zero)
> > +{
> > +	return iomap_zero_range(VFS_I(ip), offset, len, did_zero,
> > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > +					  : &xfs_buffered_write_iomap_ops); }
> > +
> > +static inline int
> > +xfs_iomap_truncate_page(
> > +	struct xfs_inode	*ip,
> > +	loff_t			pos,
> > +	bool			*did_zero)
> > +{
> > +	return iomap_truncate_page(VFS_I(ip), pos, did_zero,
> > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > +					  : &xfs_buffered_write_iomap_ops); }
> > +
> >  #endif /* __XFS_IOMAP_H__*/
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index
> > dfe24b7f26e5..6d936c3e1a6e 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -911,8 +911,8 @@ xfs_setattr_size(
> >  	 */
> >  	if (newsize > oldsize) {
> >  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> > -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> > -				&did_zeroing, &xfs_buffered_write_iomap_ops);
> > +		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
> > +				&did_zeroing);
> >  	} else {
> >  		/*
> >  		 * iomap won't detect a dirty page over an unwritten block (or a @@
> > -924,8 +924,7 @@ xfs_setattr_size(
> >  						     newsize);
> >  		if (error)
> >  			return error;
> > -		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> > -				&xfs_buffered_write_iomap_ops);
> > +		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
> >  	}
> >
> >  	if (error)
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index
> > d25434f93235..9a780948dbd0 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof(
> >  		return 0;
> >
> >  	trace_xfs_zero_eof(ip, isize, pos - isize);
> > -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > -			&xfs_buffered_write_iomap_ops);
> > +	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
> >  }
> >
> >  /*
> > --
> > 2.31.1
> >
> >
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
  2021-06-09  2:28     ` ruansy.fnst
@ 2021-06-15  7:21       ` Shiyang Ruan
  2021-06-24  8:49         ` ruansy.fnst
  0 siblings, 1 reply; 23+ messages in thread
From: Shiyang Ruan @ 2021-06-15  7:21 UTC (permalink / raw)
  To: darrick.wong
  Cc: ruansy.fnst, dan.j.williams, david, djwong, hch, linux-fsdevel,
	linux-kernel, nvdimm, linux-xfs, rgoldwyn, viro, willy

Hi Darrick,

Since other patches looks good, I post this RFC patch singly to hot-fix the
problem in xfs_dax_write_iomap_ops->iomap_end() of v6 that the error code was
ingored. I will split this in two patches(changes in iomap and xfs
respectively) in next formal version if it looks ok.

====

Introduce a new interface called "iomap_post_actor()" in iomap_ops.  And call it
between ->actor() and ->iomap_end().  It is mean to handle the error code
returned from ->actor().  In this patchset, it is used to remap or cancel the
CoW extents according to the error code.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/dax.c               | 27 ++++++++++++++++++---------
 fs/iomap/apply.c       |  4 ++++
 fs/xfs/xfs_bmap_util.c |  3 +--
 fs/xfs/xfs_file.c      |  5 +++--
 fs/xfs/xfs_iomap.c     | 33 ++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_iomap.h     | 24 ++++++++++++++++++++++++
 fs/xfs/xfs_iops.c      |  7 +++----
 fs/xfs/xfs_reflink.c   |  3 +--
 include/linux/iomap.h  |  8 ++++++++
 9 files changed, 94 insertions(+), 20 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 93f16210847b..0740c2610b6f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1537,7 +1537,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	struct iomap iomap = { .type = IOMAP_HOLE };
 	struct iomap srcmap = { .type = IOMAP_HOLE };
 	unsigned flags = IOMAP_FAULT;
-	int error;
+	int error, copied = PAGE_SIZE;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	vm_fault_t ret = 0, major = 0;
 	void *entry;
@@ -1598,7 +1598,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, false, flags,
 			      &iomap, &srcmap);
 	if (ret == VM_FAULT_SIGBUS)
-		goto finish_iomap;
+		goto finish_iomap_actor;
 
 	/* read/write MAPPED, CoW UNWRITTEN */
 	if (iomap.flags & IOMAP_F_NEW) {
@@ -1607,10 +1607,16 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		major = VM_FAULT_MAJOR;
 	}
 
+ finish_iomap_actor:
+	if (ops->iomap_post_actor) {
+		if (ret & VM_FAULT_ERROR)
+			copied = 0;
+		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
+				      &iomap, &srcmap);
+	}
+
 finish_iomap:
 	if (ops->iomap_end) {
-		int copied = PAGE_SIZE;
-
 		if (ret & VM_FAULT_ERROR)
 			copied = 0;
 		/*
@@ -1677,7 +1683,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	pgoff_t max_pgoff;
 	void *entry;
 	loff_t pos;
-	int error;
+	int error, copied = PMD_SIZE;
 
 	/*
 	 * Check whether offset isn't beyond end of file now. Caller is
@@ -1736,12 +1742,15 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, true, flags,
 			      &iomap, &srcmap);
 
+	if (ret == VM_FAULT_FALLBACK)
+		copied = 0;
+	if (ops->iomap_post_actor) {
+		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
+				      &iomap, &srcmap);
+	}
+
 finish_iomap:
 	if (ops->iomap_end) {
-		int copied = PMD_SIZE;
-
-		if (ret == VM_FAULT_FALLBACK)
-			copied = 0;
 		/*
 		 * The fault is done by now and there's no way back (other
 		 * thread may be already happily using PMD we have installed).
diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 0493da5286ad..26a54ded184f 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -84,6 +84,10 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	written = actor(inode, pos, length, data, &iomap,
 			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
 
+	if (ops->iomap_post_actor) {
+		written = ops->iomap_post_actor(inode, pos, length, written,
+						flags, &iomap, &srcmap);
+	}
 out:
 	/*
 	 * Now the data has been copied, commit the range we've copied.  This
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index a5e9d7d34023..2a36dc93ff27 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -965,8 +965,7 @@ xfs_free_file_space(
 		return 0;
 	if (offset + len > XFS_ISIZE(ip))
 		len = XFS_ISIZE(ip) - offset;
-	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
-			&xfs_buffered_write_iomap_ops);
+	error = xfs_iomap_zero_range(ip, offset, len, NULL);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 396ef36dcd0a..89406ec6741b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -684,11 +684,12 @@ xfs_file_dax_write(
 	pos = iocb->ki_pos;
 
 	trace_xfs_file_dax_write(iocb, from);
-	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
 	}
+
 out:
 	if (iolock)
 		xfs_iunlock(ip, iolock);
@@ -1309,7 +1310,7 @@ __xfs_filemap_fault(
 
 		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
 				(write_fault && !vmf->cow_page) ?
-				 &xfs_direct_write_iomap_ops :
+				 &xfs_dax_write_iomap_ops :
 				 &xfs_read_iomap_ops);
 		if (ret & VM_FAULT_NEEDDSYNC)
 			ret = dax_finish_sync_fault(vmf, pe_size, pfn);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index d154f42e2dc6..2f322e2f8544 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
 
 		/* may drop and re-acquire the ilock */
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
-				&lockmode, flags & IOMAP_DIRECT);
+				&lockmode,
+				(flags & IOMAP_DIRECT) || IS_DAX(inode));
 		if (error)
 			goto out_unlock;
 		if (shared)
@@ -854,6 +855,36 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
 	.iomap_begin		= xfs_direct_write_iomap_begin,
 };
 
+static int
+xfs_dax_write_iomap_post_actor(
+	struct inode		*inode,
+	loff_t			pos,
+	loff_t			length,
+	ssize_t			written,
+	unsigned int		flags,
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
+{
+	int			error = 0;
+	struct xfs_inode	*ip = XFS_I(inode);
+	bool			cow = xfs_is_cow_inode(ip);
+
+	if (written <= 0) {
+		if (cow)
+			xfs_reflink_cancel_cow_range(ip, pos, length, true);
+		return written;
+	}
+
+	if (cow)
+		error = xfs_reflink_end_cow(ip, pos, written);
+	return error ?: written;
+}
+
+const struct iomap_ops xfs_dax_write_iomap_ops = {
+	.iomap_begin		= xfs_direct_write_iomap_begin,
+	.iomap_post_actor	= xfs_dax_write_iomap_post_actor,
+};
+
 static int
 xfs_buffered_write_iomap_begin(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 7d3703556d0e..fbacf638ab21 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -42,8 +42,32 @@ xfs_aligned_fsb_count(
 
 extern const struct iomap_ops xfs_buffered_write_iomap_ops;
 extern const struct iomap_ops xfs_direct_write_iomap_ops;
+extern const struct iomap_ops xfs_dax_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
 
+static inline int
+xfs_iomap_zero_range(
+	struct xfs_inode	*ip,
+	loff_t			offset,
+	loff_t			len,
+	bool			*did_zero)
+{
+	return iomap_zero_range(VFS_I(ip), offset, len, did_zero,
+			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
+					  : &xfs_buffered_write_iomap_ops);
+}
+
+static inline int
+xfs_iomap_truncate_page(
+	struct xfs_inode	*ip,
+	loff_t			pos,
+	bool			*did_zero)
+{
+	return iomap_truncate_page(VFS_I(ip), pos, did_zero,
+			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
+					  : &xfs_buffered_write_iomap_ops);
+}
+
 #endif /* __XFS_IOMAP_H__*/
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index dfe24b7f26e5..6d936c3e1a6e 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -911,8 +911,8 @@ xfs_setattr_size(
 	 */
 	if (newsize > oldsize) {
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
-		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
-				&did_zeroing, &xfs_buffered_write_iomap_ops);
+		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
+				&did_zeroing);
 	} else {
 		/*
 		 * iomap won't detect a dirty page over an unwritten block (or a
@@ -924,8 +924,7 @@ xfs_setattr_size(
 						     newsize);
 		if (error)
 			return error;
-		error = iomap_truncate_page(inode, newsize, &did_zeroing,
-				&xfs_buffered_write_iomap_ops);
+		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
 	}
 
 	if (error)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d25434f93235..9a780948dbd0 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof(
 		return 0;
 
 	trace_xfs_zero_eof(ip, isize, pos - isize);
-	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
-			&xfs_buffered_write_iomap_ops);
+	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
 }
 
 /*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 95562f863ad0..58f2e1c78018 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -135,6 +135,14 @@ struct iomap_ops {
 			unsigned flags, struct iomap *iomap,
 			struct iomap *srcmap);
 
+	/*
+	 * Handle the error code from actor(). Do the finishing jobs for extra
+	 * operations, such as CoW, according to whether written is negative.
+	 */
+	int (*iomap_post_actor)(struct inode *inode, loff_t pos, loff_t length,
+			ssize_t written, unsigned flags, struct iomap *iomap,
+			struct iomap *srcmap);
+
 	/*
 	 * Commit and/or unreserve space previous allocated using iomap_begin.
 	 * Written indicates the length of the successful write operation which
-- 
2.31.1




^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
  2021-06-15  7:21       ` [PATCH v6.1 " Shiyang Ruan
@ 2021-06-24  8:49         ` ruansy.fnst
  2021-06-25 22:18           ` Darrick J. Wong
  0 siblings, 1 reply; 23+ messages in thread
From: ruansy.fnst @ 2021-06-24  8:49 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: dan.j.williams, david, djwong, hch, linux-fsdevel, linux-kernel,
	nvdimm, linux-xfs, rgoldwyn, viro, willy

Hi Darrick,

Do you have any comment on this?


--
Thanks,
Ruan.

> -----Original Message-----
> From: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> Subject: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
> 
> Hi Darrick,
> 
> Since other patches looks good, I post this RFC patch singly to hot-fix the
> problem in xfs_dax_write_iomap_ops->iomap_end() of v6 that the error code
> was ingored. I will split this in two patches(changes in iomap and xfs
> respectively) in next formal version if it looks ok.
> 
> ====
> 
> Introduce a new interface called "iomap_post_actor()" in iomap_ops.  And call
> it between ->actor() and ->iomap_end().  It is mean to handle the error code
> returned from ->actor().  In this patchset, it is used to remap or cancel the
> CoW extents according to the error code.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  fs/dax.c               | 27 ++++++++++++++++++---------
>  fs/iomap/apply.c       |  4 ++++
>  fs/xfs/xfs_bmap_util.c |  3 +--
>  fs/xfs/xfs_file.c      |  5 +++--
>  fs/xfs/xfs_iomap.c     | 33 ++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_iomap.h     | 24 ++++++++++++++++++++++++
>  fs/xfs/xfs_iops.c      |  7 +++----
>  fs/xfs/xfs_reflink.c   |  3 +--
>  include/linux/iomap.h  |  8 ++++++++
>  9 files changed, 94 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 93f16210847b..0740c2610b6f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1537,7 +1537,7 @@ static vm_fault_t dax_iomap_pte_fault(struct
> vm_fault *vmf, pfn_t *pfnp,
>  	struct iomap iomap = { .type = IOMAP_HOLE };
>  	struct iomap srcmap = { .type = IOMAP_HOLE };
>  	unsigned flags = IOMAP_FAULT;
> -	int error;
> +	int error, copied = PAGE_SIZE;
>  	bool write = vmf->flags & FAULT_FLAG_WRITE;
>  	vm_fault_t ret = 0, major = 0;
>  	void *entry;
> @@ -1598,7 +1598,7 @@ static vm_fault_t dax_iomap_pte_fault(struct
> vm_fault *vmf, pfn_t *pfnp,
>  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, false, flags,
>  			      &iomap, &srcmap);
>  	if (ret == VM_FAULT_SIGBUS)
> -		goto finish_iomap;
> +		goto finish_iomap_actor;
> 
>  	/* read/write MAPPED, CoW UNWRITTEN */
>  	if (iomap.flags & IOMAP_F_NEW) {
> @@ -1607,10 +1607,16 @@ static vm_fault_t dax_iomap_pte_fault(struct
> vm_fault *vmf, pfn_t *pfnp,
>  		major = VM_FAULT_MAJOR;
>  	}
> 
> + finish_iomap_actor:
> +	if (ops->iomap_post_actor) {
> +		if (ret & VM_FAULT_ERROR)
> +			copied = 0;
> +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> +				      &iomap, &srcmap);
> +	}
> +
>  finish_iomap:
>  	if (ops->iomap_end) {
> -		int copied = PAGE_SIZE;
> -
>  		if (ret & VM_FAULT_ERROR)
>  			copied = 0;
>  		/*
> @@ -1677,7 +1683,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct
> vm_fault *vmf, pfn_t *pfnp,
>  	pgoff_t max_pgoff;
>  	void *entry;
>  	loff_t pos;
> -	int error;
> +	int error, copied = PMD_SIZE;
> 
>  	/*
>  	 * Check whether offset isn't beyond end of file now. Caller is @@ -1736,12
> +1742,15 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf,
> pfn_t *pfnp,
>  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, true, flags,
>  			      &iomap, &srcmap);
> 
> +	if (ret == VM_FAULT_FALLBACK)
> +		copied = 0;
> +	if (ops->iomap_post_actor) {
> +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> +				      &iomap, &srcmap);
> +	}
> +
>  finish_iomap:
>  	if (ops->iomap_end) {
> -		int copied = PMD_SIZE;
> -
> -		if (ret == VM_FAULT_FALLBACK)
> -			copied = 0;
>  		/*
>  		 * The fault is done by now and there's no way back (other
>  		 * thread may be already happily using PMD we have installed).
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c index
> 0493da5286ad..26a54ded184f 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -84,6 +84,10 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length,
> unsigned flags,
>  	written = actor(inode, pos, length, data, &iomap,
>  			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
> 
> +	if (ops->iomap_post_actor) {
> +		written = ops->iomap_post_actor(inode, pos, length, written,
> +						flags, &iomap, &srcmap);
> +	}
>  out:
>  	/*
>  	 * Now the data has been copied, commit the range we've copied.  This
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index
> a5e9d7d34023..2a36dc93ff27 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -965,8 +965,7 @@ xfs_free_file_space(
>  		return 0;
>  	if (offset + len > XFS_ISIZE(ip))
>  		len = XFS_ISIZE(ip) - offset;
> -	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> -			&xfs_buffered_write_iomap_ops);
> +	error = xfs_iomap_zero_range(ip, offset, len, NULL);
>  	if (error)
>  		return error;
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 396ef36dcd0a..89406ec6741b
> 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -684,11 +684,12 @@ xfs_file_dax_write(
>  	pos = iocb->ki_pos;
> 
>  	trace_xfs_file_dax_write(iocb, from);
> -	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
> +	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
>  	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
>  		i_size_write(inode, iocb->ki_pos);
>  		error = xfs_setfilesize(ip, pos, ret);
>  	}
> +
>  out:
>  	if (iolock)
>  		xfs_iunlock(ip, iolock);
> @@ -1309,7 +1310,7 @@ __xfs_filemap_fault(
> 
>  		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
>  				(write_fault && !vmf->cow_page) ?
> -				 &xfs_direct_write_iomap_ops :
> +				 &xfs_dax_write_iomap_ops :
>  				 &xfs_read_iomap_ops);
>  		if (ret & VM_FAULT_NEEDDSYNC)
>  			ret = dax_finish_sync_fault(vmf, pe_size, pfn); diff --git
> a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index d154f42e2dc6..2f322e2f8544
> 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
> 
>  		/* may drop and re-acquire the ilock */
>  		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> -				&lockmode, flags & IOMAP_DIRECT);
> +				&lockmode,
> +				(flags & IOMAP_DIRECT) || IS_DAX(inode));
>  		if (error)
>  			goto out_unlock;
>  		if (shared)
> @@ -854,6 +855,36 @@ const struct iomap_ops xfs_direct_write_iomap_ops =
> {
>  	.iomap_begin		= xfs_direct_write_iomap_begin,
>  };
> 
> +static int
> +xfs_dax_write_iomap_post_actor(
> +	struct inode		*inode,
> +	loff_t			pos,
> +	loff_t			length,
> +	ssize_t			written,
> +	unsigned int		flags,
> +	struct iomap		*iomap,
> +	struct iomap		*srcmap)
> +{
> +	int			error = 0;
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	bool			cow = xfs_is_cow_inode(ip);
> +
> +	if (written <= 0) {
> +		if (cow)
> +			xfs_reflink_cancel_cow_range(ip, pos, length, true);
> +		return written;
> +	}
> +
> +	if (cow)
> +		error = xfs_reflink_end_cow(ip, pos, written);
> +	return error ?: written;
> +}
> +
> +const struct iomap_ops xfs_dax_write_iomap_ops = {
> +	.iomap_begin		= xfs_direct_write_iomap_begin,
> +	.iomap_post_actor	= xfs_dax_write_iomap_post_actor,
> +};
> +
>  static int
>  xfs_buffered_write_iomap_begin(
>  	struct inode		*inode,
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h index
> 7d3703556d0e..fbacf638ab21 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -42,8 +42,32 @@ xfs_aligned_fsb_count(
> 
>  extern const struct iomap_ops xfs_buffered_write_iomap_ops;  extern const
> struct iomap_ops xfs_direct_write_iomap_ops;
> +extern const struct iomap_ops xfs_dax_write_iomap_ops;
>  extern const struct iomap_ops xfs_read_iomap_ops;  extern const struct
> iomap_ops xfs_seek_iomap_ops;  extern const struct iomap_ops
> xfs_xattr_iomap_ops;
> 
> +static inline int
> +xfs_iomap_zero_range(
> +	struct xfs_inode	*ip,
> +	loff_t			offset,
> +	loff_t			len,
> +	bool			*did_zero)
> +{
> +	return iomap_zero_range(VFS_I(ip), offset, len, did_zero,
> +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> +					  : &xfs_buffered_write_iomap_ops); }
> +
> +static inline int
> +xfs_iomap_truncate_page(
> +	struct xfs_inode	*ip,
> +	loff_t			pos,
> +	bool			*did_zero)
> +{
> +	return iomap_truncate_page(VFS_I(ip), pos, did_zero,
> +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> +					  : &xfs_buffered_write_iomap_ops); }
> +
>  #endif /* __XFS_IOMAP_H__*/
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index
> dfe24b7f26e5..6d936c3e1a6e 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -911,8 +911,8 @@ xfs_setattr_size(
>  	 */
>  	if (newsize > oldsize) {
>  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> -				&did_zeroing, &xfs_buffered_write_iomap_ops);
> +		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
> +				&did_zeroing);
>  	} else {
>  		/*
>  		 * iomap won't detect a dirty page over an unwritten block (or a @@
> -924,8 +924,7 @@ xfs_setattr_size(
>  						     newsize);
>  		if (error)
>  			return error;
> -		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> -				&xfs_buffered_write_iomap_ops);
> +		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
>  	}
> 
>  	if (error)
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index
> d25434f93235..9a780948dbd0 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof(
>  		return 0;
> 
>  	trace_xfs_zero_eof(ip, isize, pos - isize);
> -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> -			&xfs_buffered_write_iomap_ops);
> +	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
>  }
> 
>  /*
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h index
> 95562f863ad0..58f2e1c78018 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -135,6 +135,14 @@ struct iomap_ops {
>  			unsigned flags, struct iomap *iomap,
>  			struct iomap *srcmap);
> 
> +	/*
> +	 * Handle the error code from actor(). Do the finishing jobs for extra
> +	 * operations, such as CoW, according to whether written is negative.
> +	 */
> +	int (*iomap_post_actor)(struct inode *inode, loff_t pos, loff_t length,
> +			ssize_t written, unsigned flags, struct iomap *iomap,
> +			struct iomap *srcmap);
> +
>  	/*
>  	 * Commit and/or unreserve space previous allocated using iomap_begin.
>  	 * Written indicates the length of the successful write operation which
> --
> 2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
  2021-06-24  8:49         ` ruansy.fnst
@ 2021-06-25 22:18           ` Darrick J. Wong
  2021-06-28  2:55             ` ruansy.fnst
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2021-06-25 22:18 UTC (permalink / raw)
  To: ruansy.fnst
  Cc: dan.j.williams, david, hch, linux-fsdevel, linux-kernel, nvdimm,
	linux-xfs, rgoldwyn, viro, willy

On Thu, Jun 24, 2021 at 08:49:17AM +0000, ruansy.fnst@fujitsu.com wrote:
> Hi Darrick,
> 
> Do you have any comment on this?

Sorry, was on vacation.

> Thanks,
> Ruan.
> 
> > -----Original Message-----
> > From: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > Subject: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
> > 
> > Hi Darrick,
> > 
> > Since other patches looks good, I post this RFC patch singly to hot-fix the
> > problem in xfs_dax_write_iomap_ops->iomap_end() of v6 that the error code
> > was ingored. I will split this in two patches(changes in iomap and xfs
> > respectively) in next formal version if it looks ok.
> > 
> > ====
> > 
> > Introduce a new interface called "iomap_post_actor()" in iomap_ops.  And call
> > it between ->actor() and ->iomap_end().  It is mean to handle the error code
> > returned from ->actor().  In this patchset, it is used to remap or cancel the
> > CoW extents according to the error code.
> > 
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > ---
> >  fs/dax.c               | 27 ++++++++++++++++++---------
> >  fs/iomap/apply.c       |  4 ++++
> >  fs/xfs/xfs_bmap_util.c |  3 +--
> >  fs/xfs/xfs_file.c      |  5 +++--
> >  fs/xfs/xfs_iomap.c     | 33 ++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_iomap.h     | 24 ++++++++++++++++++++++++
> >  fs/xfs/xfs_iops.c      |  7 +++----
> >  fs/xfs/xfs_reflink.c   |  3 +--
> >  include/linux/iomap.h  |  8 ++++++++
> >  9 files changed, 94 insertions(+), 20 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 93f16210847b..0740c2610b6f 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1537,7 +1537,7 @@ static vm_fault_t dax_iomap_pte_fault(struct
> > vm_fault *vmf, pfn_t *pfnp,
> >  	struct iomap iomap = { .type = IOMAP_HOLE };
> >  	struct iomap srcmap = { .type = IOMAP_HOLE };
> >  	unsigned flags = IOMAP_FAULT;
> > -	int error;
> > +	int error, copied = PAGE_SIZE;
> >  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> >  	vm_fault_t ret = 0, major = 0;
> >  	void *entry;
> > @@ -1598,7 +1598,7 @@ static vm_fault_t dax_iomap_pte_fault(struct
> > vm_fault *vmf, pfn_t *pfnp,
> >  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, false, flags,
> >  			      &iomap, &srcmap);
> >  	if (ret == VM_FAULT_SIGBUS)
> > -		goto finish_iomap;
> > +		goto finish_iomap_actor;
> > 
> >  	/* read/write MAPPED, CoW UNWRITTEN */
> >  	if (iomap.flags & IOMAP_F_NEW) {
> > @@ -1607,10 +1607,16 @@ static vm_fault_t dax_iomap_pte_fault(struct
> > vm_fault *vmf, pfn_t *pfnp,
> >  		major = VM_FAULT_MAJOR;
> >  	}
> > 
> > + finish_iomap_actor:
> > +	if (ops->iomap_post_actor) {
> > +		if (ret & VM_FAULT_ERROR)
> > +			copied = 0;
> > +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> > +				      &iomap, &srcmap);
> > +	}
> > +
> >  finish_iomap:
> >  	if (ops->iomap_end) {
> > -		int copied = PAGE_SIZE;
> > -
> >  		if (ret & VM_FAULT_ERROR)
> >  			copied = 0;
> >  		/*
> > @@ -1677,7 +1683,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct
> > vm_fault *vmf, pfn_t *pfnp,
> >  	pgoff_t max_pgoff;
> >  	void *entry;
> >  	loff_t pos;
> > -	int error;
> > +	int error, copied = PMD_SIZE;
> > 
> >  	/*
> >  	 * Check whether offset isn't beyond end of file now. Caller is @@ -1736,12
> > +1742,15 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf,
> > pfn_t *pfnp,
> >  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, true, flags,
> >  			      &iomap, &srcmap);
> > 
> > +	if (ret == VM_FAULT_FALLBACK)
> > +		copied = 0;
> > +	if (ops->iomap_post_actor) {
> > +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> > +				      &iomap, &srcmap);
> > +	}
> > +
> >  finish_iomap:
> >  	if (ops->iomap_end) {
> > -		int copied = PMD_SIZE;
> > -
> > -		if (ret == VM_FAULT_FALLBACK)
> > -			copied = 0;
> >  		/*
> >  		 * The fault is done by now and there's no way back (other
> >  		 * thread may be already happily using PMD we have installed).
> > diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c index
> > 0493da5286ad..26a54ded184f 100644
> > --- a/fs/iomap/apply.c
> > +++ b/fs/iomap/apply.c
> > @@ -84,6 +84,10 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length,
> > unsigned flags,
> >  	written = actor(inode, pos, length, data, &iomap,
> >  			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
> > 
> > +	if (ops->iomap_post_actor) {
> > +		written = ops->iomap_post_actor(inode, pos, length, written,
> > +						flags, &iomap, &srcmap);

How many operations actually need an iomap_post_actor?  It's just the
dax ones, right?  Which is ... iomap_truncate_page, iomap_zero_range,
dax_iomap_fault, and dax_iomap_rw, right?  We don't need a post_actor
for other iomap functionality (like FIEMAP, SEEK_DATA/SEEK_HOLE, etc.)
so adding a new function pointer for all operations feels a bit
overbroad.

I had imagined that you'd create a struct dax_iomap_ops to wrap all the
extra functionality that you need for dax operations:

struct dax_iomap_ops {
	struct iomap_ops	iomap_ops;

	int			(*end_io)(inode, pos, length...);
};

And alter the four functions that you need to take the special
dax_iomap_ops.  I guess the downside is that this makes
iomap_truncate_page and iomap_zero_range more complicated, but maybe
it's just time to split those into DAX-specific versions.  Then we'd be
rid of the cross-links betwee fs/iomap/buffered-io.c and fs/dax.c.

> > +	}
> >  out:
> >  	/*
> >  	 * Now the data has been copied, commit the range we've copied.  This
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index
> > a5e9d7d34023..2a36dc93ff27 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -965,8 +965,7 @@ xfs_free_file_space(
> >  		return 0;
> >  	if (offset + len > XFS_ISIZE(ip))
> >  		len = XFS_ISIZE(ip) - offset;
> > -	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> > -			&xfs_buffered_write_iomap_ops);
> > +	error = xfs_iomap_zero_range(ip, offset, len, NULL);
> >  	if (error)
> >  		return error;
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 396ef36dcd0a..89406ec6741b
> > 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -684,11 +684,12 @@ xfs_file_dax_write(
> >  	pos = iocb->ki_pos;
> > 
> >  	trace_xfs_file_dax_write(iocb, from);
> > -	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
> > +	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
> >  	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> >  		i_size_write(inode, iocb->ki_pos);
> >  		error = xfs_setfilesize(ip, pos, ret);
> >  	}
> > +
> >  out:
> >  	if (iolock)
> >  		xfs_iunlock(ip, iolock);
> > @@ -1309,7 +1310,7 @@ __xfs_filemap_fault(
> > 
> >  		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
> >  				(write_fault && !vmf->cow_page) ?
> > -				 &xfs_direct_write_iomap_ops :
> > +				 &xfs_dax_write_iomap_ops :
> >  				 &xfs_read_iomap_ops);
> >  		if (ret & VM_FAULT_NEEDDSYNC)
> >  			ret = dax_finish_sync_fault(vmf, pe_size, pfn); diff --git
> > a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index d154f42e2dc6..2f322e2f8544
> > 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
> > 
> >  		/* may drop and re-acquire the ilock */
> >  		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> > -				&lockmode, flags & IOMAP_DIRECT);
> > +				&lockmode,
> > +				(flags & IOMAP_DIRECT) || IS_DAX(inode));
> >  		if (error)
> >  			goto out_unlock;
> >  		if (shared)
> > @@ -854,6 +855,36 @@ const struct iomap_ops xfs_direct_write_iomap_ops =
> > {
> >  	.iomap_begin		= xfs_direct_write_iomap_begin,
> >  };
> > 
> > +static int
> > +xfs_dax_write_iomap_post_actor(
> > +	struct inode		*inode,
> > +	loff_t			pos,
> > +	loff_t			length,
> > +	ssize_t			written,
> > +	unsigned int		flags,
> > +	struct iomap		*iomap,
> > +	struct iomap		*srcmap)
> > +{
> > +	int			error = 0;
> > +	struct xfs_inode	*ip = XFS_I(inode);
> > +	bool			cow = xfs_is_cow_inode(ip);
> > +
> > +	if (written <= 0) {
> > +		if (cow)
> > +			xfs_reflink_cancel_cow_range(ip, pos, length, true);
> > +		return written;
> > +	}
> > +
> > +	if (cow)
> > +		error = xfs_reflink_end_cow(ip, pos, written);
> > +	return error ?: written;
> > +}

This is pretty much the same as what xfs_dio_write_end_io does, right?

I had imagined that you'd change the function signatures to drop the
iocb so that you could reuse this code instead of creating a whole new
callback.

Ah well.  Can I send you some prep patches to clean up some of the weird
iomap code as a preparation series for this?

--D

> > +
> > +const struct iomap_ops xfs_dax_write_iomap_ops = {
> > +	.iomap_begin		= xfs_direct_write_iomap_begin,
> > +	.iomap_post_actor	= xfs_dax_write_iomap_post_actor,
> > +};
> > +
> >  static int
> >  xfs_buffered_write_iomap_begin(
> >  	struct inode		*inode,
> > diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h index
> > 7d3703556d0e..fbacf638ab21 100644
> > --- a/fs/xfs/xfs_iomap.h
> > +++ b/fs/xfs/xfs_iomap.h
> > @@ -42,8 +42,32 @@ xfs_aligned_fsb_count(
> > 
> >  extern const struct iomap_ops xfs_buffered_write_iomap_ops;  extern const
> > struct iomap_ops xfs_direct_write_iomap_ops;
> > +extern const struct iomap_ops xfs_dax_write_iomap_ops;
> >  extern const struct iomap_ops xfs_read_iomap_ops;  extern const struct
> > iomap_ops xfs_seek_iomap_ops;  extern const struct iomap_ops
> > xfs_xattr_iomap_ops;
> > 
> > +static inline int
> > +xfs_iomap_zero_range(
> > +	struct xfs_inode	*ip,
> > +	loff_t			offset,
> > +	loff_t			len,
> > +	bool			*did_zero)
> > +{
> > +	return iomap_zero_range(VFS_I(ip), offset, len, did_zero,
> > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > +					  : &xfs_buffered_write_iomap_ops); }
> > +
> > +static inline int
> > +xfs_iomap_truncate_page(
> > +	struct xfs_inode	*ip,
> > +	loff_t			pos,
> > +	bool			*did_zero)
> > +{
> > +	return iomap_truncate_page(VFS_I(ip), pos, did_zero,
> > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > +					  : &xfs_buffered_write_iomap_ops); }
> > +
> >  #endif /* __XFS_IOMAP_H__*/
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index
> > dfe24b7f26e5..6d936c3e1a6e 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -911,8 +911,8 @@ xfs_setattr_size(
> >  	 */
> >  	if (newsize > oldsize) {
> >  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> > -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> > -				&did_zeroing, &xfs_buffered_write_iomap_ops);
> > +		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
> > +				&did_zeroing);
> >  	} else {
> >  		/*
> >  		 * iomap won't detect a dirty page over an unwritten block (or a @@
> > -924,8 +924,7 @@ xfs_setattr_size(
> >  						     newsize);
> >  		if (error)
> >  			return error;
> > -		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> > -				&xfs_buffered_write_iomap_ops);
> > +		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
> >  	}
> > 
> >  	if (error)
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index
> > d25434f93235..9a780948dbd0 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof(
> >  		return 0;
> > 
> >  	trace_xfs_zero_eof(ip, isize, pos - isize);
> > -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > -			&xfs_buffered_write_iomap_ops);
> > +	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
> >  }
> > 
> >  /*
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h index
> > 95562f863ad0..58f2e1c78018 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -135,6 +135,14 @@ struct iomap_ops {
> >  			unsigned flags, struct iomap *iomap,
> >  			struct iomap *srcmap);
> > 
> > +	/*
> > +	 * Handle the error code from actor(). Do the finishing jobs for extra
> > +	 * operations, such as CoW, according to whether written is negative.
> > +	 */
> > +	int (*iomap_post_actor)(struct inode *inode, loff_t pos, loff_t length,
> > +			ssize_t written, unsigned flags, struct iomap *iomap,
> > +			struct iomap *srcmap);
> > +
> >  	/*
> >  	 * Commit and/or unreserve space previous allocated using iomap_begin.
> >  	 * Written indicates the length of the successful write operation which
> > --
> > 2.31.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
  2021-06-25 22:18           ` Darrick J. Wong
@ 2021-06-28  2:55             ` ruansy.fnst
  2021-06-28  5:09               ` Darrick J. Wong
  2021-07-09 12:36               ` [PATCH v6.2 6/7] dax: Introduce dax_iomap_ops for end of reflink Shiyang Ruan
  0 siblings, 2 replies; 23+ messages in thread
From: ruansy.fnst @ 2021-06-28  2:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: dan.j.williams, david, hch, linux-fsdevel, linux-kernel, nvdimm,
	linux-xfs, rgoldwyn, viro, willy

> -----Original Message-----
> Subject: Re: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
> 
> On Thu, Jun 24, 2021 at 08:49:17AM +0000, ruansy.fnst@fujitsu.com wrote:
> > Hi Darrick,
> >
> > Do you have any comment on this?
> 
> Sorry, was on vacation.
> 
> > Thanks,
> > Ruan.
> >
> > > -----Original Message-----
> > > From: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > Subject: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
> > >
> > > Hi Darrick,
> > >
> > > Since other patches looks good, I post this RFC patch singly to
> > > hot-fix the problem in xfs_dax_write_iomap_ops->iomap_end() of v6
> > > that the error code was ingored. I will split this in two
> > > patches(changes in iomap and xfs
> > > respectively) in next formal version if it looks ok.
> > >
> > > ====
> > >
> > > Introduce a new interface called "iomap_post_actor()" in iomap_ops.
> > > And call it between ->actor() and ->iomap_end().  It is mean to
> > > handle the error code returned from ->actor().  In this patchset, it
> > > is used to remap or cancel the CoW extents according to the error code.
> > >
> > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > ---
> > >  fs/dax.c               | 27 ++++++++++++++++++---------
> > >  fs/iomap/apply.c       |  4 ++++
> > >  fs/xfs/xfs_bmap_util.c |  3 +--
> > >  fs/xfs/xfs_file.c      |  5 +++--
> > >  fs/xfs/xfs_iomap.c     | 33 ++++++++++++++++++++++++++++++++-
> > >  fs/xfs/xfs_iomap.h     | 24 ++++++++++++++++++++++++
> > >  fs/xfs/xfs_iops.c      |  7 +++----
> > >  fs/xfs/xfs_reflink.c   |  3 +--
> > >  include/linux/iomap.h  |  8 ++++++++
> > >  9 files changed, 94 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 93f16210847b..0740c2610b6f 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1537,7 +1537,7 @@ static vm_fault_t dax_iomap_pte_fault(struct
> > > vm_fault *vmf, pfn_t *pfnp,
> > >  	struct iomap iomap = { .type = IOMAP_HOLE };
> > >  	struct iomap srcmap = { .type = IOMAP_HOLE };
> > >  	unsigned flags = IOMAP_FAULT;
> > > -	int error;
> > > +	int error, copied = PAGE_SIZE;
> > >  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> > >  	vm_fault_t ret = 0, major = 0;
> > >  	void *entry;
> > > @@ -1598,7 +1598,7 @@ static vm_fault_t dax_iomap_pte_fault(struct
> > > vm_fault *vmf, pfn_t *pfnp,
> > >  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, false, flags,
> > >  			      &iomap, &srcmap);
> > >  	if (ret == VM_FAULT_SIGBUS)
> > > -		goto finish_iomap;
> > > +		goto finish_iomap_actor;
> > >
> > >  	/* read/write MAPPED, CoW UNWRITTEN */
> > >  	if (iomap.flags & IOMAP_F_NEW) {
> > > @@ -1607,10 +1607,16 @@ static vm_fault_t dax_iomap_pte_fault(struct
> > > vm_fault *vmf, pfn_t *pfnp,
> > >  		major = VM_FAULT_MAJOR;
> > >  	}
> > >
> > > + finish_iomap_actor:
> > > +	if (ops->iomap_post_actor) {
> > > +		if (ret & VM_FAULT_ERROR)
> > > +			copied = 0;
> > > +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> > > +				      &iomap, &srcmap);
> > > +	}
> > > +
> > >  finish_iomap:
> > >  	if (ops->iomap_end) {
> > > -		int copied = PAGE_SIZE;
> > > -
> > >  		if (ret & VM_FAULT_ERROR)
> > >  			copied = 0;
> > >  		/*
> > > @@ -1677,7 +1683,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct
> > > vm_fault *vmf, pfn_t *pfnp,
> > >  	pgoff_t max_pgoff;
> > >  	void *entry;
> > >  	loff_t pos;
> > > -	int error;
> > > +	int error, copied = PMD_SIZE;
> > >
> > >  	/*
> > >  	 * Check whether offset isn't beyond end of file now. Caller is @@
> > > -1736,12
> > > +1742,15 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault
> > > +*vmf,
> > > pfn_t *pfnp,
> > >  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, true, flags,
> > >  			      &iomap, &srcmap);
> > >
> > > +	if (ret == VM_FAULT_FALLBACK)
> > > +		copied = 0;
> > > +	if (ops->iomap_post_actor) {
> > > +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> > > +				      &iomap, &srcmap);
> > > +	}
> > > +
> > >  finish_iomap:
> > >  	if (ops->iomap_end) {
> > > -		int copied = PMD_SIZE;
> > > -
> > > -		if (ret == VM_FAULT_FALLBACK)
> > > -			copied = 0;
> > >  		/*
> > >  		 * The fault is done by now and there's no way back (other
> > >  		 * thread may be already happily using PMD we have installed).
> > > diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c index
> > > 0493da5286ad..26a54ded184f 100644
> > > --- a/fs/iomap/apply.c
> > > +++ b/fs/iomap/apply.c
> > > @@ -84,6 +84,10 @@ iomap_apply(struct inode *inode, loff_t pos,
> > > loff_t length, unsigned flags,
> > >  	written = actor(inode, pos, length, data, &iomap,
> > >  			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
> > >
> > > +	if (ops->iomap_post_actor) {
> > > +		written = ops->iomap_post_actor(inode, pos, length, written,
> > > +						flags, &iomap, &srcmap);
> 
> How many operations actually need an iomap_post_actor?  It's just the dax
> ones, right?  Which is ... iomap_truncate_page, iomap_zero_range,
> dax_iomap_fault, and dax_iomap_rw, right?  We don't need a post_actor for
> other iomap functionality (like FIEMAP, SEEK_DATA/SEEK_HOLE, etc.) so adding
> a new function pointer for all operations feels a bit overbroad.

Yes.

> 
> I had imagined that you'd create a struct dax_iomap_ops to wrap all the extra
> functionality that you need for dax operations:
> 
> struct dax_iomap_ops {
> 	struct iomap_ops	iomap_ops;
> 
> 	int			(*end_io)(inode, pos, length...);
> };
> 
> And alter the four functions that you need to take the special dax_iomap_ops.
> I guess the downside is that this makes iomap_truncate_page and
> iomap_zero_range more complicated, but maybe it's just time to split those into
> DAX-specific versions.  Then we'd be rid of the cross-links betwee
> fs/iomap/buffered-io.c and fs/dax.c.

This seems to be a better solution.  I'll try in this way.  Thanks for your guidance.

> 
> > > +	}
> > >  out:
> > >  	/*
> > >  	 * Now the data has been copied, commit the range we've copied.
> > > This diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > index
> > > a5e9d7d34023..2a36dc93ff27 100644
> > > --- a/fs/xfs/xfs_bmap_util.c
> > > +++ b/fs/xfs/xfs_bmap_util.c
> > > @@ -965,8 +965,7 @@ xfs_free_file_space(
> > >  		return 0;
> > >  	if (offset + len > XFS_ISIZE(ip))
> > >  		len = XFS_ISIZE(ip) - offset;
> > > -	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> > > -			&xfs_buffered_write_iomap_ops);
> > > +	error = xfs_iomap_zero_range(ip, offset, len, NULL);
> > >  	if (error)
> > >  		return error;
> > >
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index
> > > 396ef36dcd0a..89406ec6741b
> > > 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -684,11 +684,12 @@ xfs_file_dax_write(
> > >  	pos = iocb->ki_pos;
> > >
> > >  	trace_xfs_file_dax_write(iocb, from);
> > > -	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
> > > +	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
> > >  	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> > >  		i_size_write(inode, iocb->ki_pos);
> > >  		error = xfs_setfilesize(ip, pos, ret);
> > >  	}
> > > +
> > >  out:
> > >  	if (iolock)
> > >  		xfs_iunlock(ip, iolock);
> > > @@ -1309,7 +1310,7 @@ __xfs_filemap_fault(
> > >
> > >  		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
> > >  				(write_fault && !vmf->cow_page) ?
> > > -				 &xfs_direct_write_iomap_ops :
> > > +				 &xfs_dax_write_iomap_ops :
> > >  				 &xfs_read_iomap_ops);
> > >  		if (ret & VM_FAULT_NEEDDSYNC)
> > >  			ret = dax_finish_sync_fault(vmf, pe_size, pfn); diff --git
> > > a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index
> > > d154f42e2dc6..2f322e2f8544
> > > 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
> > >
> > >  		/* may drop and re-acquire the ilock */
> > >  		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> > > -				&lockmode, flags & IOMAP_DIRECT);
> > > +				&lockmode,
> > > +				(flags & IOMAP_DIRECT) || IS_DAX(inode));
> > >  		if (error)
> > >  			goto out_unlock;
> > >  		if (shared)
> > > @@ -854,6 +855,36 @@ const struct iomap_ops
> > > xfs_direct_write_iomap_ops = {
> > >  	.iomap_begin		= xfs_direct_write_iomap_begin,
> > >  };
> > >
> > > +static int
> > > +xfs_dax_write_iomap_post_actor(
> > > +	struct inode		*inode,
> > > +	loff_t			pos,
> > > +	loff_t			length,
> > > +	ssize_t			written,
> > > +	unsigned int		flags,
> > > +	struct iomap		*iomap,
> > > +	struct iomap		*srcmap)
> > > +{
> > > +	int			error = 0;
> > > +	struct xfs_inode	*ip = XFS_I(inode);
> > > +	bool			cow = xfs_is_cow_inode(ip);
> > > +
> > > +	if (written <= 0) {
> > > +		if (cow)
> > > +			xfs_reflink_cancel_cow_range(ip, pos, length, true);
> > > +		return written;
> > > +	}
> > > +
> > > +	if (cow)
> > > +		error = xfs_reflink_end_cow(ip, pos, written);
> > > +	return error ?: written;
> > > +}
> 
> This is pretty much the same as what xfs_dio_write_end_io does, right?

It just handles the end part of CoW here.
xfs_dio_write_end_io() also updates file size, which is only needed in write() but not in page fault.  And the update file size work is done in xfs_dax_file_write(), it's fine, no need to modify it.

> 
> I had imagined that you'd change the function signatures to drop the iocb so
> that you could reuse this code instead of creating a whole new callback.
> 
> Ah well.  Can I send you some prep patches to clean up some of the weird
> iomap code as a preparation series for this?

Sure.  Thanks.


--
Ruan.

> 
> --D
> 
> > > +
> > > +const struct iomap_ops xfs_dax_write_iomap_ops = {
> > > +	.iomap_begin		= xfs_direct_write_iomap_begin,
> > > +	.iomap_post_actor	= xfs_dax_write_iomap_post_actor,
> > > +};
> > > +
> > >  static int
> > >  xfs_buffered_write_iomap_begin(
> > >  	struct inode		*inode,
> > > diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h index
> > > 7d3703556d0e..fbacf638ab21 100644
> > > --- a/fs/xfs/xfs_iomap.h
> > > +++ b/fs/xfs/xfs_iomap.h
> > > @@ -42,8 +42,32 @@ xfs_aligned_fsb_count(
> > >
> > >  extern const struct iomap_ops xfs_buffered_write_iomap_ops;  extern
> > > const struct iomap_ops xfs_direct_write_iomap_ops;
> > > +extern const struct iomap_ops xfs_dax_write_iomap_ops;
> > >  extern const struct iomap_ops xfs_read_iomap_ops;  extern const
> > > struct iomap_ops xfs_seek_iomap_ops;  extern const struct iomap_ops
> > > xfs_xattr_iomap_ops;
> > >
> > > +static inline int
> > > +xfs_iomap_zero_range(
> > > +	struct xfs_inode	*ip,
> > > +	loff_t			offset,
> > > +	loff_t			len,
> > > +	bool			*did_zero)
> > > +{
> > > +	return iomap_zero_range(VFS_I(ip), offset, len, did_zero,
> > > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > > +					  : &xfs_buffered_write_iomap_ops); }
> > > +
> > > +static inline int
> > > +xfs_iomap_truncate_page(
> > > +	struct xfs_inode	*ip,
> > > +	loff_t			pos,
> > > +	bool			*did_zero)
> > > +{
> > > +	return iomap_truncate_page(VFS_I(ip), pos, did_zero,
> > > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > > +					  : &xfs_buffered_write_iomap_ops); }
> > > +
> > >  #endif /* __XFS_IOMAP_H__*/
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index
> > > dfe24b7f26e5..6d936c3e1a6e 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -911,8 +911,8 @@ xfs_setattr_size(
> > >  	 */
> > >  	if (newsize > oldsize) {
> > >  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> > > -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> > > -				&did_zeroing, &xfs_buffered_write_iomap_ops);
> > > +		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
> > > +				&did_zeroing);
> > >  	} else {
> > >  		/*
> > >  		 * iomap won't detect a dirty page over an unwritten block (or a
> > > @@
> > > -924,8 +924,7 @@ xfs_setattr_size(
> > >  						     newsize);
> > >  		if (error)
> > >  			return error;
> > > -		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> > > -				&xfs_buffered_write_iomap_ops);
> > > +		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
> > >  	}
> > >
> > >  	if (error)
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index
> > > d25434f93235..9a780948dbd0 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof(
> > >  		return 0;
> > >
> > >  	trace_xfs_zero_eof(ip, isize, pos - isize);
> > > -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > > -			&xfs_buffered_write_iomap_ops);
> > > +	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
> > >  }
> > >
> > >  /*
> > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h index
> > > 95562f863ad0..58f2e1c78018 100644
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > > @@ -135,6 +135,14 @@ struct iomap_ops {
> > >  			unsigned flags, struct iomap *iomap,
> > >  			struct iomap *srcmap);
> > >
> > > +	/*
> > > +	 * Handle the error code from actor(). Do the finishing jobs for extra
> > > +	 * operations, such as CoW, according to whether written is negative.
> > > +	 */
> > > +	int (*iomap_post_actor)(struct inode *inode, loff_t pos, loff_t length,
> > > +			ssize_t written, unsigned flags, struct iomap *iomap,
> > > +			struct iomap *srcmap);
> > > +
> > >  	/*
> > >  	 * Commit and/or unreserve space previous allocated using
> iomap_begin.
> > >  	 * Written indicates the length of the successful write operation
> > > which
> > > --
> > > 2.31.1
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
  2021-06-28  2:55             ` ruansy.fnst
@ 2021-06-28  5:09               ` Darrick J. Wong
  2021-06-29 11:25                 ` ruansy.fnst
  2021-07-08 23:16                 ` Dave Chinner
  2021-07-09 12:36               ` [PATCH v6.2 6/7] dax: Introduce dax_iomap_ops for end of reflink Shiyang Ruan
  1 sibling, 2 replies; 23+ messages in thread
From: Darrick J. Wong @ 2021-06-28  5:09 UTC (permalink / raw)
  To: ruansy.fnst
  Cc: dan.j.williams, david, hch, linux-fsdevel, linux-kernel, nvdimm,
	linux-xfs, rgoldwyn, viro, willy

On Mon, Jun 28, 2021 at 02:55:03AM +0000, ruansy.fnst@fujitsu.com wrote:
> > -----Original Message-----
> > Subject: Re: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
> > 
> > On Thu, Jun 24, 2021 at 08:49:17AM +0000, ruansy.fnst@fujitsu.com wrote:
> > > Hi Darrick,
> > >
> > > Do you have any comment on this?
> > 
> > Sorry, was on vacation.
> > 
> > > Thanks,
> > > Ruan.
> > >
> > > > -----Original Message-----
> > > > From: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > > Subject: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
> > > >
> > > > Hi Darrick,
> > > >
> > > > Since other patches looks good, I post this RFC patch singly to
> > > > hot-fix the problem in xfs_dax_write_iomap_ops->iomap_end() of v6
> > > > that the error code was ingored. I will split this in two
> > > > patches(changes in iomap and xfs
> > > > respectively) in next formal version if it looks ok.
> > > >
> > > > ====
> > > >
> > > > Introduce a new interface called "iomap_post_actor()" in iomap_ops.
> > > > And call it between ->actor() and ->iomap_end().  It is mean to
> > > > handle the error code returned from ->actor().  In this patchset, it
> > > > is used to remap or cancel the CoW extents according to the error code.
> > > >
> > > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > > ---
> > > >  fs/dax.c               | 27 ++++++++++++++++++---------
> > > >  fs/iomap/apply.c       |  4 ++++
> > > >  fs/xfs/xfs_bmap_util.c |  3 +--
> > > >  fs/xfs/xfs_file.c      |  5 +++--
> > > >  fs/xfs/xfs_iomap.c     | 33 ++++++++++++++++++++++++++++++++-
> > > >  fs/xfs/xfs_iomap.h     | 24 ++++++++++++++++++++++++
> > > >  fs/xfs/xfs_iops.c      |  7 +++----
> > > >  fs/xfs/xfs_reflink.c   |  3 +--
> > > >  include/linux/iomap.h  |  8 ++++++++
> > > >  9 files changed, 94 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index 93f16210847b..0740c2610b6f 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -1537,7 +1537,7 @@ static vm_fault_t dax_iomap_pte_fault(struct
> > > > vm_fault *vmf, pfn_t *pfnp,
> > > >  	struct iomap iomap = { .type = IOMAP_HOLE };
> > > >  	struct iomap srcmap = { .type = IOMAP_HOLE };
> > > >  	unsigned flags = IOMAP_FAULT;
> > > > -	int error;
> > > > +	int error, copied = PAGE_SIZE;
> > > >  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> > > >  	vm_fault_t ret = 0, major = 0;
> > > >  	void *entry;
> > > > @@ -1598,7 +1598,7 @@ static vm_fault_t dax_iomap_pte_fault(struct
> > > > vm_fault *vmf, pfn_t *pfnp,
> > > >  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, false, flags,
> > > >  			      &iomap, &srcmap);
> > > >  	if (ret == VM_FAULT_SIGBUS)
> > > > -		goto finish_iomap;
> > > > +		goto finish_iomap_actor;
> > > >
> > > >  	/* read/write MAPPED, CoW UNWRITTEN */
> > > >  	if (iomap.flags & IOMAP_F_NEW) {
> > > > @@ -1607,10 +1607,16 @@ static vm_fault_t dax_iomap_pte_fault(struct
> > > > vm_fault *vmf, pfn_t *pfnp,
> > > >  		major = VM_FAULT_MAJOR;
> > > >  	}
> > > >
> > > > + finish_iomap_actor:
> > > > +	if (ops->iomap_post_actor) {
> > > > +		if (ret & VM_FAULT_ERROR)
> > > > +			copied = 0;
> > > > +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> > > > +				      &iomap, &srcmap);
> > > > +	}
> > > > +
> > > >  finish_iomap:
> > > >  	if (ops->iomap_end) {
> > > > -		int copied = PAGE_SIZE;
> > > > -
> > > >  		if (ret & VM_FAULT_ERROR)
> > > >  			copied = 0;
> > > >  		/*
> > > > @@ -1677,7 +1683,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct
> > > > vm_fault *vmf, pfn_t *pfnp,
> > > >  	pgoff_t max_pgoff;
> > > >  	void *entry;
> > > >  	loff_t pos;
> > > > -	int error;
> > > > +	int error, copied = PMD_SIZE;
> > > >
> > > >  	/*
> > > >  	 * Check whether offset isn't beyond end of file now. Caller is @@
> > > > -1736,12
> > > > +1742,15 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault
> > > > +*vmf,
> > > > pfn_t *pfnp,
> > > >  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, true, flags,
> > > >  			      &iomap, &srcmap);
> > > >
> > > > +	if (ret == VM_FAULT_FALLBACK)
> > > > +		copied = 0;
> > > > +	if (ops->iomap_post_actor) {
> > > > +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> > > > +				      &iomap, &srcmap);
> > > > +	}
> > > > +
> > > >  finish_iomap:
> > > >  	if (ops->iomap_end) {
> > > > -		int copied = PMD_SIZE;
> > > > -
> > > > -		if (ret == VM_FAULT_FALLBACK)
> > > > -			copied = 0;
> > > >  		/*
> > > >  		 * The fault is done by now and there's no way back (other
> > > >  		 * thread may be already happily using PMD we have installed).
> > > > diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c index
> > > > 0493da5286ad..26a54ded184f 100644
> > > > --- a/fs/iomap/apply.c
> > > > +++ b/fs/iomap/apply.c
> > > > @@ -84,6 +84,10 @@ iomap_apply(struct inode *inode, loff_t pos,
> > > > loff_t length, unsigned flags,
> > > >  	written = actor(inode, pos, length, data, &iomap,
> > > >  			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
> > > >
> > > > +	if (ops->iomap_post_actor) {
> > > > +		written = ops->iomap_post_actor(inode, pos, length, written,
> > > > +						flags, &iomap, &srcmap);
> > 
> > How many operations actually need an iomap_post_actor?  It's just the dax
> > ones, right?  Which is ... iomap_truncate_page, iomap_zero_range,
> > dax_iomap_fault, and dax_iomap_rw, right?  We don't need a post_actor for
> > other iomap functionality (like FIEMAP, SEEK_DATA/SEEK_HOLE, etc.) so adding
> > a new function pointer for all operations feels a bit overbroad.
> 
> Yes.
> 
> > 
> > I had imagined that you'd create a struct dax_iomap_ops to wrap all the extra
> > functionality that you need for dax operations:
> > 
> > struct dax_iomap_ops {
> > 	struct iomap_ops	iomap_ops;
> > 
> > 	int			(*end_io)(inode, pos, length...);
> > };
> > 
> > And alter the four functions that you need to take the special dax_iomap_ops.
> > I guess the downside is that this makes iomap_truncate_page and
> > iomap_zero_range more complicated, but maybe it's just time to split those into
> > DAX-specific versions.  Then we'd be rid of the cross-links betwee
> > fs/iomap/buffered-io.c and fs/dax.c.
> 
> This seems to be a better solution.  I'll try in this way.  Thanks for your guidance.

I started writing on Friday a patchset to apply this style cleanup both
to the directio and dax paths.  The cleanups were pretty straightforward
until I started reading the dax code paths again and realized that file
writes still have the weird behavior of mapping extents into a file,
zeroing them, then issuing the actual write to the extent.  IOWs, a
double-write to avoid exposing stale contents if crash.

Apparently the reason for this was that dax (at least 6 years ago) had
no concept paralleling the page lock, so it was necessary to do that to
avoid page fault handlers racing to map pfns into the file mapping?
That would seem to prevent us from doing the more standard behavior of
allocate unwritten, write data, convert mapping... but is that still the
case?  Or can we get rid of this bad quirk?

--D

> 
> > 
> > > > +	}
> > > >  out:
> > > >  	/*
> > > >  	 * Now the data has been copied, commit the range we've copied.
> > > > This diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > > index
> > > > a5e9d7d34023..2a36dc93ff27 100644
> > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > @@ -965,8 +965,7 @@ xfs_free_file_space(
> > > >  		return 0;
> > > >  	if (offset + len > XFS_ISIZE(ip))
> > > >  		len = XFS_ISIZE(ip) - offset;
> > > > -	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> > > > -			&xfs_buffered_write_iomap_ops);
> > > > +	error = xfs_iomap_zero_range(ip, offset, len, NULL);
> > > >  	if (error)
> > > >  		return error;
> > > >
> > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index
> > > > 396ef36dcd0a..89406ec6741b
> > > > 100644
> > > > --- a/fs/xfs/xfs_file.c
> > > > +++ b/fs/xfs/xfs_file.c
> > > > @@ -684,11 +684,12 @@ xfs_file_dax_write(
> > > >  	pos = iocb->ki_pos;
> > > >
> > > >  	trace_xfs_file_dax_write(iocb, from);
> > > > -	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
> > > > +	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
> > > >  	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> > > >  		i_size_write(inode, iocb->ki_pos);
> > > >  		error = xfs_setfilesize(ip, pos, ret);
> > > >  	}
> > > > +
> > > >  out:
> > > >  	if (iolock)
> > > >  		xfs_iunlock(ip, iolock);
> > > > @@ -1309,7 +1310,7 @@ __xfs_filemap_fault(
> > > >
> > > >  		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
> > > >  				(write_fault && !vmf->cow_page) ?
> > > > -				 &xfs_direct_write_iomap_ops :
> > > > +				 &xfs_dax_write_iomap_ops :
> > > >  				 &xfs_read_iomap_ops);
> > > >  		if (ret & VM_FAULT_NEEDDSYNC)
> > > >  			ret = dax_finish_sync_fault(vmf, pe_size, pfn); diff --git
> > > > a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index
> > > > d154f42e2dc6..2f322e2f8544
> > > > 100644
> > > > --- a/fs/xfs/xfs_iomap.c
> > > > +++ b/fs/xfs/xfs_iomap.c
> > > > @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
> > > >
> > > >  		/* may drop and re-acquire the ilock */
> > > >  		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> > > > -				&lockmode, flags & IOMAP_DIRECT);
> > > > +				&lockmode,
> > > > +				(flags & IOMAP_DIRECT) || IS_DAX(inode));
> > > >  		if (error)
> > > >  			goto out_unlock;
> > > >  		if (shared)
> > > > @@ -854,6 +855,36 @@ const struct iomap_ops
> > > > xfs_direct_write_iomap_ops = {
> > > >  	.iomap_begin		= xfs_direct_write_iomap_begin,
> > > >  };
> > > >
> > > > +static int
> > > > +xfs_dax_write_iomap_post_actor(
> > > > +	struct inode		*inode,
> > > > +	loff_t			pos,
> > > > +	loff_t			length,
> > > > +	ssize_t			written,
> > > > +	unsigned int		flags,
> > > > +	struct iomap		*iomap,
> > > > +	struct iomap		*srcmap)
> > > > +{
> > > > +	int			error = 0;
> > > > +	struct xfs_inode	*ip = XFS_I(inode);
> > > > +	bool			cow = xfs_is_cow_inode(ip);
> > > > +
> > > > +	if (written <= 0) {
> > > > +		if (cow)
> > > > +			xfs_reflink_cancel_cow_range(ip, pos, length, true);
> > > > +		return written;
> > > > +	}
> > > > +
> > > > +	if (cow)
> > > > +		error = xfs_reflink_end_cow(ip, pos, written);
> > > > +	return error ?: written;
> > > > +}
> > 
> > This is pretty much the same as what xfs_dio_write_end_io does, right?
> 
> It just handles the end part of CoW here.
> xfs_dio_write_end_io() also updates file size, which is only needed in write() but not in page fault.  And the update file size work is done in xfs_dax_file_write(), it's fine, no need to modify it.
> 
> > 
> > I had imagined that you'd change the function signatures to drop the iocb so
> > that you could reuse this code instead of creating a whole new callback.
> > 
> > Ah well.  Can I send you some prep patches to clean up some of the weird
> > iomap code as a preparation series for this?
> 
> Sure.  Thanks.
> 
> 
> --
> Ruan.
> 
> > 
> > --D
> > 
> > > > +
> > > > +const struct iomap_ops xfs_dax_write_iomap_ops = {
> > > > +	.iomap_begin		= xfs_direct_write_iomap_begin,
> > > > +	.iomap_post_actor	= xfs_dax_write_iomap_post_actor,
> > > > +};
> > > > +
> > > >  static int
> > > >  xfs_buffered_write_iomap_begin(
> > > >  	struct inode		*inode,
> > > > diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h index
> > > > 7d3703556d0e..fbacf638ab21 100644
> > > > --- a/fs/xfs/xfs_iomap.h
> > > > +++ b/fs/xfs/xfs_iomap.h
> > > > @@ -42,8 +42,32 @@ xfs_aligned_fsb_count(
> > > >
> > > >  extern const struct iomap_ops xfs_buffered_write_iomap_ops;  extern
> > > > const struct iomap_ops xfs_direct_write_iomap_ops;
> > > > +extern const struct iomap_ops xfs_dax_write_iomap_ops;
> > > >  extern const struct iomap_ops xfs_read_iomap_ops;  extern const
> > > > struct iomap_ops xfs_seek_iomap_ops;  extern const struct iomap_ops
> > > > xfs_xattr_iomap_ops;
> > > >
> > > > +static inline int
> > > > +xfs_iomap_zero_range(
> > > > +	struct xfs_inode	*ip,
> > > > +	loff_t			offset,
> > > > +	loff_t			len,
> > > > +	bool			*did_zero)
> > > > +{
> > > > +	return iomap_zero_range(VFS_I(ip), offset, len, did_zero,
> > > > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > > > +					  : &xfs_buffered_write_iomap_ops); }
> > > > +
> > > > +static inline int
> > > > +xfs_iomap_truncate_page(
> > > > +	struct xfs_inode	*ip,
> > > > +	loff_t			pos,
> > > > +	bool			*did_zero)
> > > > +{
> > > > +	return iomap_truncate_page(VFS_I(ip), pos, did_zero,
> > > > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > > > +					  : &xfs_buffered_write_iomap_ops); }
> > > > +
> > > >  #endif /* __XFS_IOMAP_H__*/
> > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index
> > > > dfe24b7f26e5..6d936c3e1a6e 100644
> > > > --- a/fs/xfs/xfs_iops.c
> > > > +++ b/fs/xfs/xfs_iops.c
> > > > @@ -911,8 +911,8 @@ xfs_setattr_size(
> > > >  	 */
> > > >  	if (newsize > oldsize) {
> > > >  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> > > > -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> > > > -				&did_zeroing, &xfs_buffered_write_iomap_ops);
> > > > +		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
> > > > +				&did_zeroing);
> > > >  	} else {
> > > >  		/*
> > > >  		 * iomap won't detect a dirty page over an unwritten block (or a
> > > > @@
> > > > -924,8 +924,7 @@ xfs_setattr_size(
> > > >  						     newsize);
> > > >  		if (error)
> > > >  			return error;
> > > > -		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> > > > -				&xfs_buffered_write_iomap_ops);
> > > > +		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
> > > >  	}
> > > >
> > > >  	if (error)
> > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index
> > > > d25434f93235..9a780948dbd0 100644
> > > > --- a/fs/xfs/xfs_reflink.c
> > > > +++ b/fs/xfs/xfs_reflink.c
> > > > @@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof(
> > > >  		return 0;
> > > >
> > > >  	trace_xfs_zero_eof(ip, isize, pos - isize);
> > > > -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > > > -			&xfs_buffered_write_iomap_ops);
> > > > +	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
> > > >  }
> > > >
> > > >  /*
> > > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h index
> > > > 95562f863ad0..58f2e1c78018 100644
> > > > --- a/include/linux/iomap.h
> > > > +++ b/include/linux/iomap.h
> > > > @@ -135,6 +135,14 @@ struct iomap_ops {
> > > >  			unsigned flags, struct iomap *iomap,
> > > >  			struct iomap *srcmap);
> > > >
> > > > +	/*
> > > > +	 * Handle the error code from actor(). Do the finishing jobs for extra
> > > > +	 * operations, such as CoW, according to whether written is negative.
> > > > +	 */
> > > > +	int (*iomap_post_actor)(struct inode *inode, loff_t pos, loff_t length,
> > > > +			ssize_t written, unsigned flags, struct iomap *iomap,
> > > > +			struct iomap *srcmap);
> > > > +
> > > >  	/*
> > > >  	 * Commit and/or unreserve space previous allocated using
> > iomap_begin.
> > > >  	 * Written indicates the length of the successful write operation
> > > > which
> > > > --
> > > > 2.31.1
> > >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
  2021-06-28  5:09               ` Darrick J. Wong
@ 2021-06-29 11:25                 ` ruansy.fnst
  2021-06-29 21:01                   ` Darrick J. Wong
  2021-07-08 23:16                 ` Dave Chinner
  1 sibling, 1 reply; 23+ messages in thread
From: ruansy.fnst @ 2021-06-29 11:25 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: dan.j.williams, david, hch, linux-fsdevel, linux-kernel, nvdimm,
	linux-xfs, rgoldwyn, viro, willy



> -----Original Message-----
> From: Darrick J. Wong <djwong@kernel.org>
> Subject: Re: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
> 
> On Mon, Jun 28, 2021 at 02:55:03AM +0000, ruansy.fnst@fujitsu.com wrote:
> > > -----Original Message-----
> > > Subject: Re: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write()
> > > path
> > >
> > > On Thu, Jun 24, 2021 at 08:49:17AM +0000, ruansy.fnst@fujitsu.com wrote:
> > > > Hi Darrick,
> > > >
> > > > Do you have any comment on this?
> > >
> > > Sorry, was on vacation.
> > >
> > > > Thanks,
> > > > Ruan.
> > > >
> > > > > -----Original Message-----
> > > > > From: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > > > Subject: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write()
> > > > > path
> > > > >
> > > > > Hi Darrick,
> > > > >
> > > > > Since other patches looks good, I post this RFC patch singly to
> > > > > hot-fix the problem in xfs_dax_write_iomap_ops->iomap_end() of
> > > > > v6 that the error code was ingored. I will split this in two
> > > > > patches(changes in iomap and xfs
> > > > > respectively) in next formal version if it looks ok.
> > > > >
> > > > > ====
> > > > >
> > > > > Introduce a new interface called "iomap_post_actor()" in iomap_ops.
> > > > > And call it between ->actor() and ->iomap_end().  It is mean to
> > > > > handle the error code returned from ->actor().  In this
> > > > > patchset, it is used to remap or cancel the CoW extents according to the
> error code.
> > > > >
> > > > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > > > ---
> > > > >  fs/dax.c               | 27 ++++++++++++++++++---------
> > > > >  fs/iomap/apply.c       |  4 ++++
> > > > >  fs/xfs/xfs_bmap_util.c |  3 +--
> > > > >  fs/xfs/xfs_file.c      |  5 +++--
> > > > >  fs/xfs/xfs_iomap.c     | 33 ++++++++++++++++++++++++++++++++-
> > > > >  fs/xfs/xfs_iomap.h     | 24 ++++++++++++++++++++++++
> > > > >  fs/xfs/xfs_iops.c      |  7 +++----
> > > > >  fs/xfs/xfs_reflink.c   |  3 +--
> > > > >  include/linux/iomap.h  |  8 ++++++++
> > > > >  9 files changed, 94 insertions(+), 20 deletions(-)
> > > > >
> > > > > diff --git a/fs/dax.c b/fs/dax.c index
> > > > > 93f16210847b..0740c2610b6f 100644
> > > > > --- a/fs/dax.c
> > > > > +++ b/fs/dax.c
> > > > > @@ -1537,7 +1537,7 @@ static vm_fault_t
> > > > > dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > > > >  	struct iomap iomap = { .type = IOMAP_HOLE };
> > > > >  	struct iomap srcmap = { .type = IOMAP_HOLE };
> > > > >  	unsigned flags = IOMAP_FAULT;
> > > > > -	int error;
> > > > > +	int error, copied = PAGE_SIZE;
> > > > >  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> > > > >  	vm_fault_t ret = 0, major = 0;
> > > > >  	void *entry;
> > > > > @@ -1598,7 +1598,7 @@ static vm_fault_t
> > > > > dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > > > >  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, false, flags,
> > > > >  			      &iomap, &srcmap);
> > > > >  	if (ret == VM_FAULT_SIGBUS)
> > > > > -		goto finish_iomap;
> > > > > +		goto finish_iomap_actor;
> > > > >
> > > > >  	/* read/write MAPPED, CoW UNWRITTEN */
> > > > >  	if (iomap.flags & IOMAP_F_NEW) { @@ -1607,10 +1607,16 @@
> > > > > static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf,
> > > > > pfn_t *pfnp,
> > > > >  		major = VM_FAULT_MAJOR;
> > > > >  	}
> > > > >
> > > > > + finish_iomap_actor:
> > > > > +	if (ops->iomap_post_actor) {
> > > > > +		if (ret & VM_FAULT_ERROR)
> > > > > +			copied = 0;
> > > > > +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> > > > > +				      &iomap, &srcmap);
> > > > > +	}
> > > > > +
> > > > >  finish_iomap:
> > > > >  	if (ops->iomap_end) {
> > > > > -		int copied = PAGE_SIZE;
> > > > > -
> > > > >  		if (ret & VM_FAULT_ERROR)
> > > > >  			copied = 0;
> > > > >  		/*
> > > > > @@ -1677,7 +1683,7 @@ static vm_fault_t
> > > > > dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > > > >  	pgoff_t max_pgoff;
> > > > >  	void *entry;
> > > > >  	loff_t pos;
> > > > > -	int error;
> > > > > +	int error, copied = PMD_SIZE;
> > > > >
> > > > >  	/*
> > > > >  	 * Check whether offset isn't beyond end of file now. Caller
> > > > > is @@
> > > > > -1736,12
> > > > > +1742,15 @@ static vm_fault_t dax_iomap_pmd_fault(struct
> > > > > +vm_fault *vmf,
> > > > > pfn_t *pfnp,
> > > > >  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, true, flags,
> > > > >  			      &iomap, &srcmap);
> > > > >
> > > > > +	if (ret == VM_FAULT_FALLBACK)
> > > > > +		copied = 0;
> > > > > +	if (ops->iomap_post_actor) {
> > > > > +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> > > > > +				      &iomap, &srcmap);
> > > > > +	}
> > > > > +
> > > > >  finish_iomap:
> > > > >  	if (ops->iomap_end) {
> > > > > -		int copied = PMD_SIZE;
> > > > > -
> > > > > -		if (ret == VM_FAULT_FALLBACK)
> > > > > -			copied = 0;
> > > > >  		/*
> > > > >  		 * The fault is done by now and there's no way back (other
> > > > >  		 * thread may be already happily using PMD we have installed).
> > > > > diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c index
> > > > > 0493da5286ad..26a54ded184f 100644
> > > > > --- a/fs/iomap/apply.c
> > > > > +++ b/fs/iomap/apply.c
> > > > > @@ -84,6 +84,10 @@ iomap_apply(struct inode *inode, loff_t pos,
> > > > > loff_t length, unsigned flags,
> > > > >  	written = actor(inode, pos, length, data, &iomap,
> > > > >  			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
> > > > >
> > > > > +	if (ops->iomap_post_actor) {
> > > > > +		written = ops->iomap_post_actor(inode, pos, length, written,
> > > > > +						flags, &iomap, &srcmap);
> > >
> > > How many operations actually need an iomap_post_actor?  It's just
> > > the dax ones, right?  Which is ... iomap_truncate_page,
> > > iomap_zero_range, dax_iomap_fault, and dax_iomap_rw, right?  We
> > > don't need a post_actor for other iomap functionality (like FIEMAP,
> > > SEEK_DATA/SEEK_HOLE, etc.) so adding a new function pointer for all
> operations feels a bit overbroad.
> >
> > Yes.
> >
> > >
> > > I had imagined that you'd create a struct dax_iomap_ops to wrap all
> > > the extra functionality that you need for dax operations:
> > >
> > > struct dax_iomap_ops {
> > > 	struct iomap_ops	iomap_ops;
> > >
> > > 	int			(*end_io)(inode, pos, length...);
> > > };
> > >
> > > And alter the four functions that you need to take the special
> dax_iomap_ops.
> > > I guess the downside is that this makes iomap_truncate_page and
> > > iomap_zero_range more complicated, but maybe it's just time to split
> > > those into DAX-specific versions.  Then we'd be rid of the
> > > cross-links betwee fs/iomap/buffered-io.c and fs/dax.c.
> >
> > This seems to be a better solution.  I'll try in this way.  Thanks for your
> guidance.
> 
> I started writing on Friday a patchset to apply this style cleanup both to the
> directio and dax paths.  The cleanups were pretty straightforward until I
> started reading the dax code paths again and realized that file writes still have
> the weird behavior of mapping extents into a file, zeroing them, then issuing the
> actual write to the extent.  IOWs, a double-write to avoid exposing stale
> contents if crash.

The current code seems not zeroing an unwritten extent when writing in fsdax mode?  Just allocate unwritten extents in filesystem, and then write data in fsdax.

> 
> Apparently the reason for this was that dax (at least 6 years ago) had no
> concept paralleling the page lock, so it was necessary to do that to avoid page
> fault handlers racing to map pfns into the file mapping?
> That would seem to prevent us from doing the more standard behavior of
> allocate unwritten, write data, convert mapping... but is that still the case?  Or
> can we get rid of this bad quirk?

I am not sure about this...


--
Thanks,
Ruan.

> 
> --D
> 
> >
> > >
> > > > > +	}
> > > > >  out:
> > > > >  	/*
> > > > >  	 * Now the data has been copied, commit the range we've copied.
> > > > > This diff --git a/fs/xfs/xfs_bmap_util.c
> > > > > b/fs/xfs/xfs_bmap_util.c index
> > > > > a5e9d7d34023..2a36dc93ff27 100644
> > > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > > @@ -965,8 +965,7 @@ xfs_free_file_space(
> > > > >  		return 0;
> > > > >  	if (offset + len > XFS_ISIZE(ip))
> > > > >  		len = XFS_ISIZE(ip) - offset;
> > > > > -	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> > > > > -			&xfs_buffered_write_iomap_ops);
> > > > > +	error = xfs_iomap_zero_range(ip, offset, len, NULL);
> > > > >  	if (error)
> > > > >  		return error;
> > > > >
> > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index
> > > > > 396ef36dcd0a..89406ec6741b
> > > > > 100644
> > > > > --- a/fs/xfs/xfs_file.c
> > > > > +++ b/fs/xfs/xfs_file.c
> > > > > @@ -684,11 +684,12 @@ xfs_file_dax_write(
> > > > >  	pos = iocb->ki_pos;
> > > > >
> > > > >  	trace_xfs_file_dax_write(iocb, from);
> > > > > -	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
> > > > > +	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
> > > > >  	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> > > > >  		i_size_write(inode, iocb->ki_pos);
> > > > >  		error = xfs_setfilesize(ip, pos, ret);
> > > > >  	}
> > > > > +
> > > > >  out:
> > > > >  	if (iolock)
> > > > >  		xfs_iunlock(ip, iolock);
> > > > > @@ -1309,7 +1310,7 @@ __xfs_filemap_fault(
> > > > >
> > > > >  		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
> > > > >  				(write_fault && !vmf->cow_page) ?
> > > > > -				 &xfs_direct_write_iomap_ops :
> > > > > +				 &xfs_dax_write_iomap_ops :
> > > > >  				 &xfs_read_iomap_ops);
> > > > >  		if (ret & VM_FAULT_NEEDDSYNC)
> > > > >  			ret = dax_finish_sync_fault(vmf, pe_size, pfn); diff --git
> > > > > a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index
> > > > > d154f42e2dc6..2f322e2f8544
> > > > > 100644
> > > > > --- a/fs/xfs/xfs_iomap.c
> > > > > +++ b/fs/xfs/xfs_iomap.c
> > > > > @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
> > > > >
> > > > >  		/* may drop and re-acquire the ilock */
> > > > >  		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> > > > > -				&lockmode, flags & IOMAP_DIRECT);
> > > > > +				&lockmode,
> > > > > +				(flags & IOMAP_DIRECT) || IS_DAX(inode));
> > > > >  		if (error)
> > > > >  			goto out_unlock;
> > > > >  		if (shared)
> > > > > @@ -854,6 +855,36 @@ const struct iomap_ops
> > > > > xfs_direct_write_iomap_ops = {
> > > > >  	.iomap_begin		= xfs_direct_write_iomap_begin,
> > > > >  };
> > > > >
> > > > > +static int
> > > > > +xfs_dax_write_iomap_post_actor(
> > > > > +	struct inode		*inode,
> > > > > +	loff_t			pos,
> > > > > +	loff_t			length,
> > > > > +	ssize_t			written,
> > > > > +	unsigned int		flags,
> > > > > +	struct iomap		*iomap,
> > > > > +	struct iomap		*srcmap)
> > > > > +{
> > > > > +	int			error = 0;
> > > > > +	struct xfs_inode	*ip = XFS_I(inode);
> > > > > +	bool			cow = xfs_is_cow_inode(ip);
> > > > > +
> > > > > +	if (written <= 0) {
> > > > > +		if (cow)
> > > > > +			xfs_reflink_cancel_cow_range(ip, pos, length, true);
> > > > > +		return written;
> > > > > +	}
> > > > > +
> > > > > +	if (cow)
> > > > > +		error = xfs_reflink_end_cow(ip, pos, written);
> > > > > +	return error ?: written;
> > > > > +}
> > >
> > > This is pretty much the same as what xfs_dio_write_end_io does, right?
> >
> > It just handles the end part of CoW here.
> > xfs_dio_write_end_io() also updates file size, which is only needed in write()
> but not in page fault.  And the update file size work is done in
> xfs_dax_file_write(), it's fine, no need to modify it.
> >
> > >
> > > I had imagined that you'd change the function signatures to drop the
> > > iocb so that you could reuse this code instead of creating a whole new
> callback.
> > >
> > > Ah well.  Can I send you some prep patches to clean up some of the
> > > weird iomap code as a preparation series for this?
> >
> > Sure.  Thanks.
> >
> >
> > --
> > Ruan.
> >
> > >
> > > --D
> > >
> > > > > +
> > > > > +const struct iomap_ops xfs_dax_write_iomap_ops = {
> > > > > +	.iomap_begin		= xfs_direct_write_iomap_begin,
> > > > > +	.iomap_post_actor	= xfs_dax_write_iomap_post_actor,
> > > > > +};
> > > > > +
> > > > >  static int
> > > > >  xfs_buffered_write_iomap_begin(
> > > > >  	struct inode		*inode,
> > > > > diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h index
> > > > > 7d3703556d0e..fbacf638ab21 100644
> > > > > --- a/fs/xfs/xfs_iomap.h
> > > > > +++ b/fs/xfs/xfs_iomap.h
> > > > > @@ -42,8 +42,32 @@ xfs_aligned_fsb_count(
> > > > >
> > > > >  extern const struct iomap_ops xfs_buffered_write_iomap_ops;
> > > > > extern const struct iomap_ops xfs_direct_write_iomap_ops;
> > > > > +extern const struct iomap_ops xfs_dax_write_iomap_ops;
> > > > >  extern const struct iomap_ops xfs_read_iomap_ops;  extern const
> > > > > struct iomap_ops xfs_seek_iomap_ops;  extern const struct
> > > > > iomap_ops xfs_xattr_iomap_ops;
> > > > >
> > > > > +static inline int
> > > > > +xfs_iomap_zero_range(
> > > > > +	struct xfs_inode	*ip,
> > > > > +	loff_t			offset,
> > > > > +	loff_t			len,
> > > > > +	bool			*did_zero)
> > > > > +{
> > > > > +	return iomap_zero_range(VFS_I(ip), offset, len, did_zero,
> > > > > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > > > > +					  : &xfs_buffered_write_iomap_ops); }
> > > > > +
> > > > > +static inline int
> > > > > +xfs_iomap_truncate_page(
> > > > > +	struct xfs_inode	*ip,
> > > > > +	loff_t			pos,
> > > > > +	bool			*did_zero)
> > > > > +{
> > > > > +	return iomap_truncate_page(VFS_I(ip), pos, did_zero,
> > > > > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > > > > +					  : &xfs_buffered_write_iomap_ops); }
> > > > > +
> > > > >  #endif /* __XFS_IOMAP_H__*/
> > > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index
> > > > > dfe24b7f26e5..6d936c3e1a6e 100644
> > > > > --- a/fs/xfs/xfs_iops.c
> > > > > +++ b/fs/xfs/xfs_iops.c
> > > > > @@ -911,8 +911,8 @@ xfs_setattr_size(
> > > > >  	 */
> > > > >  	if (newsize > oldsize) {
> > > > >  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> > > > > -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> > > > > -				&did_zeroing, &xfs_buffered_write_iomap_ops);
> > > > > +		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
> > > > > +				&did_zeroing);
> > > > >  	} else {
> > > > >  		/*
> > > > >  		 * iomap won't detect a dirty page over an unwritten block
> > > > > (or a @@
> > > > > -924,8 +924,7 @@ xfs_setattr_size(
> > > > >  						     newsize);
> > > > >  		if (error)
> > > > >  			return error;
> > > > > -		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> > > > > -				&xfs_buffered_write_iomap_ops);
> > > > > +		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
> > > > >  	}
> > > > >
> > > > >  	if (error)
> > > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index
> > > > > d25434f93235..9a780948dbd0 100644
> > > > > --- a/fs/xfs/xfs_reflink.c
> > > > > +++ b/fs/xfs/xfs_reflink.c
> > > > > @@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof(
> > > > >  		return 0;
> > > > >
> > > > >  	trace_xfs_zero_eof(ip, isize, pos - isize);
> > > > > -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > > > > -			&xfs_buffered_write_iomap_ops);
> > > > > +	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
> > > > >  }
> > > > >
> > > > >  /*
> > > > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h index
> > > > > 95562f863ad0..58f2e1c78018 100644
> > > > > --- a/include/linux/iomap.h
> > > > > +++ b/include/linux/iomap.h
> > > > > @@ -135,6 +135,14 @@ struct iomap_ops {
> > > > >  			unsigned flags, struct iomap *iomap,
> > > > >  			struct iomap *srcmap);
> > > > >
> > > > > +	/*
> > > > > +	 * Handle the error code from actor(). Do the finishing jobs for extra
> > > > > +	 * operations, such as CoW, according to whether written is negative.
> > > > > +	 */
> > > > > +	int (*iomap_post_actor)(struct inode *inode, loff_t pos, loff_t length,
> > > > > +			ssize_t written, unsigned flags, struct iomap *iomap,
> > > > > +			struct iomap *srcmap);
> > > > > +
> > > > >  	/*
> > > > >  	 * Commit and/or unreserve space previous allocated using
> > > iomap_begin.
> > > > >  	 * Written indicates the length of the successful write
> > > > > operation which
> > > > > --
> > > > > 2.31.1
> > > >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
  2021-06-29 11:25                 ` ruansy.fnst
@ 2021-06-29 21:01                   ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2021-06-29 21:01 UTC (permalink / raw)
  To: ruansy.fnst
  Cc: dan.j.williams, david, hch, linux-fsdevel, linux-kernel, nvdimm,
	linux-xfs, rgoldwyn, viro, willy

On Tue, Jun 29, 2021 at 11:25:37AM +0000, ruansy.fnst@fujitsu.com wrote:
> 
> 
> > -----Original Message-----
> > From: Darrick J. Wong <djwong@kernel.org>
> > Subject: Re: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
> > 
> > On Mon, Jun 28, 2021 at 02:55:03AM +0000, ruansy.fnst@fujitsu.com wrote:
> > > > -----Original Message-----
> > > > Subject: Re: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write()
> > > > path
> > > >
> > > > On Thu, Jun 24, 2021 at 08:49:17AM +0000, ruansy.fnst@fujitsu.com wrote:
> > > > > Hi Darrick,
> > > > >
> > > > > Do you have any comment on this?
> > > >
> > > > Sorry, was on vacation.
> > > >
> > > > > Thanks,
> > > > > Ruan.
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > > > > Subject: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write()
> > > > > > path
> > > > > >
> > > > > > Hi Darrick,
> > > > > >
> > > > > > Since other patches looks good, I post this RFC patch singly to
> > > > > > hot-fix the problem in xfs_dax_write_iomap_ops->iomap_end() of
> > > > > > v6 that the error code was ingored. I will split this in two
> > > > > > patches(changes in iomap and xfs
> > > > > > respectively) in next formal version if it looks ok.
> > > > > >
> > > > > > ====
> > > > > >
> > > > > > Introduce a new interface called "iomap_post_actor()" in iomap_ops.
> > > > > > And call it between ->actor() and ->iomap_end().  It is mean to
> > > > > > handle the error code returned from ->actor().  In this
> > > > > > patchset, it is used to remap or cancel the CoW extents according to the
> > error code.
> > > > > >
> > > > > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > > > > ---
> > > > > >  fs/dax.c               | 27 ++++++++++++++++++---------
> > > > > >  fs/iomap/apply.c       |  4 ++++
> > > > > >  fs/xfs/xfs_bmap_util.c |  3 +--
> > > > > >  fs/xfs/xfs_file.c      |  5 +++--
> > > > > >  fs/xfs/xfs_iomap.c     | 33 ++++++++++++++++++++++++++++++++-
> > > > > >  fs/xfs/xfs_iomap.h     | 24 ++++++++++++++++++++++++
> > > > > >  fs/xfs/xfs_iops.c      |  7 +++----
> > > > > >  fs/xfs/xfs_reflink.c   |  3 +--
> > > > > >  include/linux/iomap.h  |  8 ++++++++
> > > > > >  9 files changed, 94 insertions(+), 20 deletions(-)
> > > > > >
> > > > > > diff --git a/fs/dax.c b/fs/dax.c index
> > > > > > 93f16210847b..0740c2610b6f 100644
> > > > > > --- a/fs/dax.c
> > > > > > +++ b/fs/dax.c
> > > > > > @@ -1537,7 +1537,7 @@ static vm_fault_t
> > > > > > dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > > > > >  	struct iomap iomap = { .type = IOMAP_HOLE };
> > > > > >  	struct iomap srcmap = { .type = IOMAP_HOLE };
> > > > > >  	unsigned flags = IOMAP_FAULT;
> > > > > > -	int error;
> > > > > > +	int error, copied = PAGE_SIZE;
> > > > > >  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> > > > > >  	vm_fault_t ret = 0, major = 0;
> > > > > >  	void *entry;
> > > > > > @@ -1598,7 +1598,7 @@ static vm_fault_t
> > > > > > dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > > > > >  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, false, flags,
> > > > > >  			      &iomap, &srcmap);
> > > > > >  	if (ret == VM_FAULT_SIGBUS)
> > > > > > -		goto finish_iomap;
> > > > > > +		goto finish_iomap_actor;
> > > > > >
> > > > > >  	/* read/write MAPPED, CoW UNWRITTEN */
> > > > > >  	if (iomap.flags & IOMAP_F_NEW) { @@ -1607,10 +1607,16 @@
> > > > > > static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf,
> > > > > > pfn_t *pfnp,
> > > > > >  		major = VM_FAULT_MAJOR;
> > > > > >  	}
> > > > > >
> > > > > > + finish_iomap_actor:
> > > > > > +	if (ops->iomap_post_actor) {
> > > > > > +		if (ret & VM_FAULT_ERROR)
> > > > > > +			copied = 0;
> > > > > > +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> > > > > > +				      &iomap, &srcmap);
> > > > > > +	}
> > > > > > +
> > > > > >  finish_iomap:
> > > > > >  	if (ops->iomap_end) {
> > > > > > -		int copied = PAGE_SIZE;
> > > > > > -
> > > > > >  		if (ret & VM_FAULT_ERROR)
> > > > > >  			copied = 0;
> > > > > >  		/*
> > > > > > @@ -1677,7 +1683,7 @@ static vm_fault_t
> > > > > > dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > > > > >  	pgoff_t max_pgoff;
> > > > > >  	void *entry;
> > > > > >  	loff_t pos;
> > > > > > -	int error;
> > > > > > +	int error, copied = PMD_SIZE;
> > > > > >
> > > > > >  	/*
> > > > > >  	 * Check whether offset isn't beyond end of file now. Caller
> > > > > > is @@
> > > > > > -1736,12
> > > > > > +1742,15 @@ static vm_fault_t dax_iomap_pmd_fault(struct
> > > > > > +vm_fault *vmf,
> > > > > > pfn_t *pfnp,
> > > > > >  	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, true, flags,
> > > > > >  			      &iomap, &srcmap);
> > > > > >
> > > > > > +	if (ret == VM_FAULT_FALLBACK)
> > > > > > +		copied = 0;
> > > > > > +	if (ops->iomap_post_actor) {
> > > > > > +		ops->iomap_post_actor(inode, pos, PMD_SIZE, copied, flags,
> > > > > > +				      &iomap, &srcmap);
> > > > > > +	}
> > > > > > +
> > > > > >  finish_iomap:
> > > > > >  	if (ops->iomap_end) {
> > > > > > -		int copied = PMD_SIZE;
> > > > > > -
> > > > > > -		if (ret == VM_FAULT_FALLBACK)
> > > > > > -			copied = 0;
> > > > > >  		/*
> > > > > >  		 * The fault is done by now and there's no way back (other
> > > > > >  		 * thread may be already happily using PMD we have installed).
> > > > > > diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c index
> > > > > > 0493da5286ad..26a54ded184f 100644
> > > > > > --- a/fs/iomap/apply.c
> > > > > > +++ b/fs/iomap/apply.c
> > > > > > @@ -84,6 +84,10 @@ iomap_apply(struct inode *inode, loff_t pos,
> > > > > > loff_t length, unsigned flags,
> > > > > >  	written = actor(inode, pos, length, data, &iomap,
> > > > > >  			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
> > > > > >
> > > > > > +	if (ops->iomap_post_actor) {
> > > > > > +		written = ops->iomap_post_actor(inode, pos, length, written,
> > > > > > +						flags, &iomap, &srcmap);
> > > >
> > > > How many operations actually need an iomap_post_actor?  It's just
> > > > the dax ones, right?  Which is ... iomap_truncate_page,
> > > > iomap_zero_range, dax_iomap_fault, and dax_iomap_rw, right?  We
> > > > don't need a post_actor for other iomap functionality (like FIEMAP,
> > > > SEEK_DATA/SEEK_HOLE, etc.) so adding a new function pointer for all
> > operations feels a bit overbroad.
> > >
> > > Yes.
> > >
> > > >
> > > > I had imagined that you'd create a struct dax_iomap_ops to wrap all
> > > > the extra functionality that you need for dax operations:
> > > >
> > > > struct dax_iomap_ops {
> > > > 	struct iomap_ops	iomap_ops;
> > > >
> > > > 	int			(*end_io)(inode, pos, length...);
> > > > };
> > > >
> > > > And alter the four functions that you need to take the special
> > dax_iomap_ops.
> > > > I guess the downside is that this makes iomap_truncate_page and
> > > > iomap_zero_range more complicated, but maybe it's just time to split
> > > > those into DAX-specific versions.  Then we'd be rid of the
> > > > cross-links betwee fs/iomap/buffered-io.c and fs/dax.c.
> > >
> > > This seems to be a better solution.  I'll try in this way.  Thanks for your
> > guidance.
> > 
> > I started writing on Friday a patchset to apply this style cleanup both to the
> > directio and dax paths.  The cleanups were pretty straightforward until I
> > started reading the dax code paths again and realized that file writes still have
> > the weird behavior of mapping extents into a file, zeroing them, then issuing the
> > actual write to the extent.  IOWs, a double-write to avoid exposing stale
> > contents if crash.
> 
> The current code seems not zeroing an unwritten extent when writing in
> fsdax mode?  Just allocate unwritten extents in filesystem, and then
> write data in fsdax.

That's not what it does.  See xfs_iomap_write_direct:

	/*
	 * For DAX, we do not allocate unwritten extents, but instead we
	 * zero the block before we commit the transaction.  Ideally
	 * we'd like to do this outside the transaction context, but if
	 * we commit and then crash we may not have zeroed the blocks
	 * and this will be exposed on recovery of the allocation. Hence
	 * we must zero before commit.
	 *
	 * Further, if we are mapping unwritten extents here, we need to
	 * zero and convert them to written so that we don't need an
	 * unwritten extent callback for DAX. This also means that we
	 * need to be able to dip into the reserve block pool for bmbt
	 * block allocation if there is no space left but we need to do
	 * unwritten extent conversion.
	 */
	if (IS_DAX(VFS_I(ip))) {
		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
		if (imap->br_state == XFS_EXT_UNWRITTEN) {
			force = true;
			nr_exts = XFS_IEXT_WRITE_UNWRITTEN_CNT;
			dblocks = XFS_DIOSTRAT_SPACE_RES(mp, 0) << 1;
		}
	}

Originally added in:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1ca191576fc862b4766f58e41aa362b28a7c1866

Of course, that was six years ago when the mm folks were still arguing
about whether they'd have struct page or pfns or some combination of the
two for DAX.  I'm not sure if those limitations still exist, or if they
quietly disappeared and xfs/ext4/ext2 haven't noticed.  I think I see
that we store pfns in i_pages along with a lock bit, but I don't know if
that lock bit is sufficient to prevent races between page faults.

Hence my question below:
> 
> > 
> > Apparently the reason for this was that dax (at least 6 years ago) had no
> > concept paralleling the page lock, so it was necessary to do that to avoid page
> > fault handlers racing to map pfns into the file mapping?
> > That would seem to prevent us from doing the more standard behavior of
> > allocate unwritten, write data, convert mapping... but is that still the case?  Or
> > can we get rid of this bad quirk?
> 
> I am not sure about this...

Me neither.

--D

> 
> 
> --
> Thanks,
> Ruan.
> 
> > 
> > --D
> > 
> > >
> > > >
> > > > > > +	}
> > > > > >  out:
> > > > > >  	/*
> > > > > >  	 * Now the data has been copied, commit the range we've copied.
> > > > > > This diff --git a/fs/xfs/xfs_bmap_util.c
> > > > > > b/fs/xfs/xfs_bmap_util.c index
> > > > > > a5e9d7d34023..2a36dc93ff27 100644
> > > > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > > > @@ -965,8 +965,7 @@ xfs_free_file_space(
> > > > > >  		return 0;
> > > > > >  	if (offset + len > XFS_ISIZE(ip))
> > > > > >  		len = XFS_ISIZE(ip) - offset;
> > > > > > -	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> > > > > > -			&xfs_buffered_write_iomap_ops);
> > > > > > +	error = xfs_iomap_zero_range(ip, offset, len, NULL);
> > > > > >  	if (error)
> > > > > >  		return error;
> > > > > >
> > > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index
> > > > > > 396ef36dcd0a..89406ec6741b
> > > > > > 100644
> > > > > > --- a/fs/xfs/xfs_file.c
> > > > > > +++ b/fs/xfs/xfs_file.c
> > > > > > @@ -684,11 +684,12 @@ xfs_file_dax_write(
> > > > > >  	pos = iocb->ki_pos;
> > > > > >
> > > > > >  	trace_xfs_file_dax_write(iocb, from);
> > > > > > -	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
> > > > > > +	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
> > > > > >  	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> > > > > >  		i_size_write(inode, iocb->ki_pos);
> > > > > >  		error = xfs_setfilesize(ip, pos, ret);
> > > > > >  	}
> > > > > > +
> > > > > >  out:
> > > > > >  	if (iolock)
> > > > > >  		xfs_iunlock(ip, iolock);
> > > > > > @@ -1309,7 +1310,7 @@ __xfs_filemap_fault(
> > > > > >
> > > > > >  		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
> > > > > >  				(write_fault && !vmf->cow_page) ?
> > > > > > -				 &xfs_direct_write_iomap_ops :
> > > > > > +				 &xfs_dax_write_iomap_ops :
> > > > > >  				 &xfs_read_iomap_ops);
> > > > > >  		if (ret & VM_FAULT_NEEDDSYNC)
> > > > > >  			ret = dax_finish_sync_fault(vmf, pe_size, pfn); diff --git
> > > > > > a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index
> > > > > > d154f42e2dc6..2f322e2f8544
> > > > > > 100644
> > > > > > --- a/fs/xfs/xfs_iomap.c
> > > > > > +++ b/fs/xfs/xfs_iomap.c
> > > > > > @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
> > > > > >
> > > > > >  		/* may drop and re-acquire the ilock */
> > > > > >  		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> > > > > > -				&lockmode, flags & IOMAP_DIRECT);
> > > > > > +				&lockmode,
> > > > > > +				(flags & IOMAP_DIRECT) || IS_DAX(inode));
> > > > > >  		if (error)
> > > > > >  			goto out_unlock;
> > > > > >  		if (shared)
> > > > > > @@ -854,6 +855,36 @@ const struct iomap_ops
> > > > > > xfs_direct_write_iomap_ops = {
> > > > > >  	.iomap_begin		= xfs_direct_write_iomap_begin,
> > > > > >  };
> > > > > >
> > > > > > +static int
> > > > > > +xfs_dax_write_iomap_post_actor(
> > > > > > +	struct inode		*inode,
> > > > > > +	loff_t			pos,
> > > > > > +	loff_t			length,
> > > > > > +	ssize_t			written,
> > > > > > +	unsigned int		flags,
> > > > > > +	struct iomap		*iomap,
> > > > > > +	struct iomap		*srcmap)
> > > > > > +{
> > > > > > +	int			error = 0;
> > > > > > +	struct xfs_inode	*ip = XFS_I(inode);
> > > > > > +	bool			cow = xfs_is_cow_inode(ip);
> > > > > > +
> > > > > > +	if (written <= 0) {
> > > > > > +		if (cow)
> > > > > > +			xfs_reflink_cancel_cow_range(ip, pos, length, true);
> > > > > > +		return written;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (cow)
> > > > > > +		error = xfs_reflink_end_cow(ip, pos, written);
> > > > > > +	return error ?: written;
> > > > > > +}
> > > >
> > > > This is pretty much the same as what xfs_dio_write_end_io does, right?
> > >
> > > It just handles the end part of CoW here.
> > > xfs_dio_write_end_io() also updates file size, which is only needed in write()
> > but not in page fault.  And the update file size work is done in
> > xfs_dax_file_write(), it's fine, no need to modify it.
> > >
> > > >
> > > > I had imagined that you'd change the function signatures to drop the
> > > > iocb so that you could reuse this code instead of creating a whole new
> > callback.
> > > >
> > > > Ah well.  Can I send you some prep patches to clean up some of the
> > > > weird iomap code as a preparation series for this?
> > >
> > > Sure.  Thanks.
> > >
> > >
> > > --
> > > Ruan.
> > >
> > > >
> > > > --D
> > > >
> > > > > > +
> > > > > > +const struct iomap_ops xfs_dax_write_iomap_ops = {
> > > > > > +	.iomap_begin		= xfs_direct_write_iomap_begin,
> > > > > > +	.iomap_post_actor	= xfs_dax_write_iomap_post_actor,
> > > > > > +};
> > > > > > +
> > > > > >  static int
> > > > > >  xfs_buffered_write_iomap_begin(
> > > > > >  	struct inode		*inode,
> > > > > > diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h index
> > > > > > 7d3703556d0e..fbacf638ab21 100644
> > > > > > --- a/fs/xfs/xfs_iomap.h
> > > > > > +++ b/fs/xfs/xfs_iomap.h
> > > > > > @@ -42,8 +42,32 @@ xfs_aligned_fsb_count(
> > > > > >
> > > > > >  extern const struct iomap_ops xfs_buffered_write_iomap_ops;
> > > > > > extern const struct iomap_ops xfs_direct_write_iomap_ops;
> > > > > > +extern const struct iomap_ops xfs_dax_write_iomap_ops;
> > > > > >  extern const struct iomap_ops xfs_read_iomap_ops;  extern const
> > > > > > struct iomap_ops xfs_seek_iomap_ops;  extern const struct
> > > > > > iomap_ops xfs_xattr_iomap_ops;
> > > > > >
> > > > > > +static inline int
> > > > > > +xfs_iomap_zero_range(
> > > > > > +	struct xfs_inode	*ip,
> > > > > > +	loff_t			offset,
> > > > > > +	loff_t			len,
> > > > > > +	bool			*did_zero)
> > > > > > +{
> > > > > > +	return iomap_zero_range(VFS_I(ip), offset, len, did_zero,
> > > > > > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > > > > > +					  : &xfs_buffered_write_iomap_ops); }
> > > > > > +
> > > > > > +static inline int
> > > > > > +xfs_iomap_truncate_page(
> > > > > > +	struct xfs_inode	*ip,
> > > > > > +	loff_t			pos,
> > > > > > +	bool			*did_zero)
> > > > > > +{
> > > > > > +	return iomap_truncate_page(VFS_I(ip), pos, did_zero,
> > > > > > +			IS_DAX(VFS_I(ip)) ? &xfs_dax_write_iomap_ops
> > > > > > +					  : &xfs_buffered_write_iomap_ops); }
> > > > > > +
> > > > > >  #endif /* __XFS_IOMAP_H__*/
> > > > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index
> > > > > > dfe24b7f26e5..6d936c3e1a6e 100644
> > > > > > --- a/fs/xfs/xfs_iops.c
> > > > > > +++ b/fs/xfs/xfs_iops.c
> > > > > > @@ -911,8 +911,8 @@ xfs_setattr_size(
> > > > > >  	 */
> > > > > >  	if (newsize > oldsize) {
> > > > > >  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> > > > > > -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> > > > > > -				&did_zeroing, &xfs_buffered_write_iomap_ops);
> > > > > > +		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
> > > > > > +				&did_zeroing);
> > > > > >  	} else {
> > > > > >  		/*
> > > > > >  		 * iomap won't detect a dirty page over an unwritten block
> > > > > > (or a @@
> > > > > > -924,8 +924,7 @@ xfs_setattr_size(
> > > > > >  						     newsize);
> > > > > >  		if (error)
> > > > > >  			return error;
> > > > > > -		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> > > > > > -				&xfs_buffered_write_iomap_ops);
> > > > > > +		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
> > > > > >  	}
> > > > > >
> > > > > >  	if (error)
> > > > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index
> > > > > > d25434f93235..9a780948dbd0 100644
> > > > > > --- a/fs/xfs/xfs_reflink.c
> > > > > > +++ b/fs/xfs/xfs_reflink.c
> > > > > > @@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof(
> > > > > >  		return 0;
> > > > > >
> > > > > >  	trace_xfs_zero_eof(ip, isize, pos - isize);
> > > > > > -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > > > > > -			&xfs_buffered_write_iomap_ops);
> > > > > > +	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
> > > > > >  }
> > > > > >
> > > > > >  /*
> > > > > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h index
> > > > > > 95562f863ad0..58f2e1c78018 100644
> > > > > > --- a/include/linux/iomap.h
> > > > > > +++ b/include/linux/iomap.h
> > > > > > @@ -135,6 +135,14 @@ struct iomap_ops {
> > > > > >  			unsigned flags, struct iomap *iomap,
> > > > > >  			struct iomap *srcmap);
> > > > > >
> > > > > > +	/*
> > > > > > +	 * Handle the error code from actor(). Do the finishing jobs for extra
> > > > > > +	 * operations, such as CoW, according to whether written is negative.
> > > > > > +	 */
> > > > > > +	int (*iomap_post_actor)(struct inode *inode, loff_t pos, loff_t length,
> > > > > > +			ssize_t written, unsigned flags, struct iomap *iomap,
> > > > > > +			struct iomap *srcmap);
> > > > > > +
> > > > > >  	/*
> > > > > >  	 * Commit and/or unreserve space previous allocated using
> > > > iomap_begin.
> > > > > >  	 * Written indicates the length of the successful write
> > > > > > operation which
> > > > > > --
> > > > > > 2.31.1
> > > > >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v6.1 6/7] fs/xfs: Handle CoW for fsdax write() path
  2021-06-28  5:09               ` Darrick J. Wong
  2021-06-29 11:25                 ` ruansy.fnst
@ 2021-07-08 23:16                 ` Dave Chinner
  1 sibling, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2021-07-08 23:16 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: ruansy.fnst, dan.j.williams, hch, linux-fsdevel, linux-kernel,
	nvdimm, linux-xfs, rgoldwyn, viro, willy

On Sun, Jun 27, 2021 at 10:09:19PM -0700, Darrick J. Wong wrote:
> > > I had imagined that you'd create a struct dax_iomap_ops to wrap all the extra
> > > functionality that you need for dax operations:
> > > 
> > > struct dax_iomap_ops {
> > > 	struct iomap_ops	iomap_ops;
> > > 
> > > 	int			(*end_io)(inode, pos, length...);
> > > };
> > > 
> > > And alter the four functions that you need to take the special dax_iomap_ops.
> > > I guess the downside is that this makes iomap_truncate_page and
> > > iomap_zero_range more complicated, but maybe it's just time to split those into
> > > DAX-specific versions.  Then we'd be rid of the cross-links betwee
> > > fs/iomap/buffered-io.c and fs/dax.c.
> > 
> > This seems to be a better solution.  I'll try in this way.  Thanks for your guidance.
> 
> I started writing on Friday a patchset to apply this style cleanup both
> to the directio and dax paths.  The cleanups were pretty straightforward
> until I started reading the dax code paths again and realized that file
> writes still have the weird behavior of mapping extents into a file,
> zeroing them, then issuing the actual write to the extent.  IOWs, a
> double-write to avoid exposing stale contents if crash.
> 
> Apparently the reason for this was that dax (at least 6 years ago) had
> no concept paralleling the page lock, so it was necessary to do that to
> avoid page fault handlers racing to map pfns into the file mapping?
> That would seem to prevent us from doing the more standard behavior of
> allocate unwritten, write data, convert mapping... but is that still the
> case?  Or can we get rid of this bad quirk?

Yeah, so that was the deciding factor in getting rid of unwritten
extent allocation in DAX similar to the DIO path. However, we were
already considering getting rid of it for another reason: write
performance.

That is, doing two extent tree manipulation transactions per write
is way more expensive than the double memory write for small IOs.
IIRC, for small writes (4kB) the double memroy write version we now
have was 2-3x faster than the {unwritten allocation, write, convert}
algorithm we had originally.

I don't think we want to go back to the unwritten allocation
behaviour - it sucked when it was first done because all DAX write
IO is synchronous, and it will still suck now because DAX writes are
still synchronous. What we really want to do here is copy the data
into the new extent before we commit the allocation transaction....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6.2 6/7] dax: Introduce dax_iomap_ops for end of reflink
  2021-06-28  2:55             ` ruansy.fnst
  2021-06-28  5:09               ` Darrick J. Wong
@ 2021-07-09 12:36               ` Shiyang Ruan
  1 sibling, 0 replies; 23+ messages in thread
From: Shiyang Ruan @ 2021-07-09 12:36 UTC (permalink / raw)
  To: darrick.wong
  Cc: ruansy.fnst, dan.j.williams, david, djwong, hch, linux-fsdevel,
	linux-kernel, nvdimm, linux-xfs, rgoldwyn, viro, willy

After writing data, reflink requires end operations to remap those new
allocated extents.  The current ->iomap_end() ignores the error code
returned from ->actor(), so we need to introduce this dax_iomap_ops and
change the dax_iomap_* interfaces to do this job.

- the dax_iomap_ops contains the original struct iomap_ops and fsdax
    specific ->actor_end(), which is for the end operations of reflink
- also introduce dax specific zero_range, truncate_page
- create new dax_iomap_ops for ext2 and ext4

Then enable fsdax and reflink together in xfs.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/dax.c               | 105 +++++++++++++++++++++++++++++++++++------
 fs/ext2/ext2.h         |   3 ++
 fs/ext2/file.c         |   6 +--
 fs/ext2/inode.c        |  11 ++++-
 fs/ext4/ext4.h         |   3 ++
 fs/ext4/file.c         |   6 +--
 fs/ext4/inode.c        |  13 ++++-
 fs/iomap/buffered-io.c |   6 +--
 fs/xfs/xfs_bmap_util.c |   3 +-
 fs/xfs/xfs_file.c      |   8 ++--
 fs/xfs/xfs_iomap.c     |  36 +++++++++++++-
 fs/xfs/xfs_iomap.h     |  33 +++++++++++++
 fs/xfs/xfs_iops.c      |   7 ++-
 fs/xfs/xfs_reflink.c   |   3 +-
 include/linux/dax.h    |  28 +++++++++--
 include/linux/iomap.h  |   2 +
 16 files changed, 228 insertions(+), 45 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 93f16210847b..9285ea796668 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1240,7 +1240,7 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
 }
 
 static loff_t
-dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+__dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap, struct iomap *srcmap)
 {
 	struct block_device *bdev = iomap->bdev;
@@ -1344,11 +1344,25 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	return done ? done : ret;
 }
 
+static loff_t
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+		struct iomap *iomap, struct iomap *srcmap)
+{
+	struct dax_iomap_data *idata = data;
+	loff_t ret = __dax_iomap_actor(inode, pos, length, idata->data,
+					iomap, srcmap);
+
+	if (idata->ops->actor_end)
+		ret = idata->ops->actor_end(inode, pos, length, ret);
+
+	return ret;
+}
+
 /**
  * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:	The control block for this I/O
  * @iter:	The addresses to do I/O from or to
- * @ops:	iomap ops passed from the file system
+ * @ops:	dax iomap ops passed from the file system
  *
  * This function performs read and write operations to directly mapped
  * persistent memory.  The callers needs to take care of read/write exclusion
@@ -1356,12 +1370,13 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
  */
 ssize_t
 dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
-		const struct iomap_ops *ops)
+		const struct dax_iomap_ops *ops)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = mapping->host;
 	loff_t pos = iocb->ki_pos, ret = 0, done = 0;
 	unsigned flags = 0;
+	struct dax_iomap_data data = { iter, ops };
 
 	if (iov_iter_rw(iter) == WRITE) {
 		lockdep_assert_held_write(&inode->i_rwsem);
@@ -1374,8 +1389,8 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		flags |= IOMAP_NOWAIT;
 
 	while (iov_iter_count(iter)) {
-		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-				iter, dax_iomap_actor);
+		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags,
+				  &ops->iomap_ops, &data, dax_iomap_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
@@ -1387,6 +1402,55 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
+static loff_t
+dax_iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t length,
+		void *data, struct iomap *iomap, struct iomap *srcmap)
+{
+	struct dax_iomap_data *idata = data;
+	loff_t ret = iomap_zero_range_actor(inode, pos, length, idata->data,
+					    iomap, srcmap);
+
+	if (idata->ops->actor_end)
+		ret = idata->ops->actor_end(inode, pos, length, ret);
+
+	return ret;
+}
+
+int
+dax_iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
+		bool *did_zero, const struct dax_iomap_ops *ops)
+{
+	struct dax_iomap_data data = { did_zero, ops };
+	loff_t ret;
+
+	while (len > 0) {
+		ret = iomap_apply(inode, pos, len, IOMAP_ZERO, &ops->iomap_ops,
+				  &data, dax_iomap_zero_range_actor);
+		if (ret <= 0)
+			return ret;
+
+		pos += ret;
+		len -= ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_iomap_zero_range);
+
+int
+dax_iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
+		const struct dax_iomap_ops *ops)
+{
+	unsigned int blocksize = i_blocksize(inode);
+	unsigned int off = pos & (blocksize - 1);
+
+	/* Block boundary? Nothing to do */
+	if (!off)
+		return 0;
+	return dax_iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
+}
+EXPORT_SYMBOL_GPL(dax_iomap_truncate_page);
+
 static vm_fault_t dax_fault_return(int error)
 {
 	if (error == 0)
@@ -1527,7 +1591,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
 }
 
 static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
-			       int *iomap_errp, const struct iomap_ops *ops)
+		int *iomap_errp, const struct dax_iomap_ops *dops)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1536,8 +1600,9 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
 	struct iomap iomap = { .type = IOMAP_HOLE };
 	struct iomap srcmap = { .type = IOMAP_HOLE };
+	const struct iomap_ops *ops = &dops->iomap_ops;
 	unsigned flags = IOMAP_FAULT;
-	int error;
+	int error, copied = PAGE_SIZE;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	vm_fault_t ret = 0, major = 0;
 	void *entry;
@@ -1598,7 +1663,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, false, flags,
 			      &iomap, &srcmap);
 	if (ret == VM_FAULT_SIGBUS)
-		goto finish_iomap;
+		goto finish_iomap_actor_end;
 
 	/* read/write MAPPED, CoW UNWRITTEN */
 	if (iomap.flags & IOMAP_F_NEW) {
@@ -1607,10 +1672,15 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		major = VM_FAULT_MAJOR;
 	}
 
+finish_iomap_actor_end:
+	if (dops->actor_end) {
+		if (ret & VM_FAULT_ERROR)
+			copied = 0;
+		dops->actor_end(inode, pos, PMD_SIZE, copied);
+	}
+
 finish_iomap:
 	if (ops->iomap_end) {
-		int copied = PAGE_SIZE;
-
 		if (ret & VM_FAULT_ERROR)
 			copied = 0;
 		/*
@@ -1663,7 +1733,7 @@ static bool dax_fault_check_fallback(struct vm_fault *vmf, struct xa_state *xas,
 }
 
 static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
-			       const struct iomap_ops *ops)
+		const struct dax_iomap_ops *dops)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1674,10 +1744,11 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	vm_fault_t ret = VM_FAULT_FALLBACK;
 	struct iomap iomap = { .type = IOMAP_HOLE };
 	struct iomap srcmap = { .type = IOMAP_HOLE };
+	const struct iomap_ops *ops = &dops->iomap_ops;
 	pgoff_t max_pgoff;
 	void *entry;
 	loff_t pos;
-	int error;
+	int error, copied = PMD_SIZE;
 
 	/*
 	 * Check whether offset isn't beyond end of file now. Caller is
@@ -1736,10 +1807,14 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	ret = dax_fault_actor(vmf, pfnp, &xas, &entry, true, flags,
 			      &iomap, &srcmap);
 
+	if (dops->actor_end) {
+		if (ret == VM_FAULT_FALLBACK)
+			copied = 0;
+		dops->actor_end(inode, pos, PMD_SIZE, copied);
+	}
+
 finish_iomap:
 	if (ops->iomap_end) {
-		int copied = PMD_SIZE;
-
 		if (ret == VM_FAULT_FALLBACK)
 			copied = 0;
 		/*
@@ -1783,7 +1858,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
  * successfully.
  */
 vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
-		    pfn_t *pfnp, int *iomap_errp, const struct iomap_ops *ops)
+		pfn_t *pfnp, int *iomap_errp, const struct dax_iomap_ops *ops)
 {
 	switch (pe_size) {
 	case PE_SIZE_PTE:
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b0a694820cb7..765269804f83 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -806,6 +806,9 @@ extern void ext2_set_file_ops(struct inode *inode);
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_nobh_aops;
 extern const struct iomap_ops ext2_iomap_ops;
+#ifdef CONFIG_FS_DAX
+extern const struct dax_iomap_ops ext2_dax_iomap_ops;
+#endif
 
 /* namei.c */
 extern const struct inode_operations ext2_dir_inode_operations;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index f98466acc672..d5dd82111128 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -39,7 +39,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return 0; /* skip atime */
 
 	inode_lock_shared(inode);
-	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &ext2_dax_iomap_ops);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -63,7 +63,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret)
 		goto out_unlock;
 
-	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &ext2_dax_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		mark_inode_dirty(inode);
@@ -102,7 +102,7 @@ static vm_fault_t ext2_dax_fault(struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, NULL, &ext2_iomap_ops);
+	ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, NULL, &ext2_dax_iomap_ops);
 
 	up_read(&ei->dax_sem);
 	if (write)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 68178b2234bd..a94744bbf82f 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -852,6 +852,13 @@ const struct iomap_ops ext2_iomap_ops = {
 	.iomap_begin		= ext2_iomap_begin,
 	.iomap_end		= ext2_iomap_end,
 };
+
+const struct dax_iomap_ops ext2_dax_iomap_ops = {
+	.iomap_ops	= {
+		.iomap_begin	= ext2_iomap_begin,
+		.iomap_end	= ext2_iomap_end,
+	},
+};
 #else
 /* Define empty ops for !CONFIG_FS_DAX case to avoid ugly ifdefs */
 const struct iomap_ops ext2_iomap_ops;
@@ -1294,9 +1301,9 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	inode_dio_wait(inode);
 
 	if (IS_DAX(inode)) {
-		error = iomap_zero_range(inode, newsize,
+		error = dax_iomap_zero_range(inode, newsize,
 					 PAGE_ALIGN(newsize) - newsize, NULL,
-					 &ext2_iomap_ops);
+					 &ext2_dax_iomap_ops);
 	} else if (test_opt(inode->i_sb, NOBH))
 		error = nobh_truncate_page(inode->i_mapping,
 				newsize, ext2_get_block);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 37002663d521..b4e6df93dd82 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3773,6 +3773,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
 }
 
 extern const struct iomap_ops ext4_iomap_ops;
+#ifdef CONFIG_FS_DAX
+extern const struct dax_iomap_ops ext4_dax_iomap_ops;
+#endif
 extern const struct iomap_ops ext4_iomap_overwrite_ops;
 extern const struct iomap_ops ext4_iomap_report_ops;
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 816dedcbd541..a7a3497429ca 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -102,7 +102,7 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		/* Fallback to buffered IO in case we cannot support DAX */
 		return generic_file_read_iter(iocb, to);
 	}
-	ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &ext4_dax_iomap_ops);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -650,7 +650,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		ext4_journal_stop(handle);
 	}
 
-	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &ext4_dax_iomap_ops);
 
 	if (extend)
 		ret = ext4_handle_inode_extension(inode, offset, ret, count);
@@ -721,7 +721,7 @@ static vm_fault_t ext4_dax_huge_fault(struct vm_fault *vmf,
 	} else {
 		down_read(&EXT4_I(inode)->i_mmap_sem);
 	}
-	result = dax_iomap_fault(vmf, pe_size, &pfn, &error, &ext4_iomap_ops);
+	result = dax_iomap_fault(vmf, pe_size, &pfn, &error, &ext4_dax_iomap_ops);
 	if (write) {
 		ext4_journal_stop(handle);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fe6045a46599..2310f5cc6cd5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3523,6 +3523,15 @@ const struct iomap_ops ext4_iomap_ops = {
 	.iomap_end		= ext4_iomap_end,
 };
 
+#ifdef CONFIG_FS_DAX
+const struct dax_iomap_ops ext4_dax_iomap_ops = {
+	.iomap_ops		= {
+		.iomap_begin = ext4_iomap_begin,
+		.iomap_end   = ext4_iomap_end,
+	},
+};
+#endif
+
 const struct iomap_ops ext4_iomap_overwrite_ops = {
 	.iomap_begin		= ext4_iomap_overwrite_begin,
 	.iomap_end		= ext4_iomap_end,
@@ -3840,8 +3849,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
 		length = max;
 
 	if (IS_DAX(inode)) {
-		return iomap_zero_range(inode, from, length, NULL,
-					&ext4_iomap_ops);
+		return dax_iomap_zero_range(inode, from, length, NULL,
+					    &ext4_dax_iomap_ops);
 	}
 	return __ext4_block_zero_page_range(handle, mapping, from, length);
 }
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index fdaac4ba9b9d..32c6b2ab6251 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -918,9 +918,9 @@ static s64 iomap_zero(struct inode *inode, loff_t pos, u64 length,
 	return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
 }
 
-static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
-		loff_t length, void *data, struct iomap *iomap,
-		struct iomap *srcmap)
+loff_t
+iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t length,
+		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	bool *did_zero = data;
 	loff_t written = 0;
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 0936f3a96fe6..4b0744b5a75f 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1009,8 +1009,7 @@ xfs_free_file_space(
 		return 0;
 	if (offset + len > XFS_ISIZE(ip))
 		len = XFS_ISIZE(ip) - offset;
-	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
-			&xfs_buffered_write_iomap_ops);
+	error = xfs_iomap_zero_range(ip, offset, len, NULL);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 396ef36dcd0a..9bca68872242 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -281,7 +281,7 @@ xfs_file_dax_read(
 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
-	ret = dax_iomap_rw(iocb, to, &xfs_read_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &xfs_dax_read_iomap_ops);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
@@ -684,7 +684,7 @@ xfs_file_dax_write(
 	pos = iocb->ki_pos;
 
 	trace_xfs_file_dax_write(iocb, from);
-	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1309,8 +1309,8 @@ __xfs_filemap_fault(
 
 		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
 				(write_fault && !vmf->cow_page) ?
-				 &xfs_direct_write_iomap_ops :
-				 &xfs_read_iomap_ops);
+				 &xfs_dax_write_iomap_ops :
+				 &xfs_dax_read_iomap_ops);
 		if (ret & VM_FAULT_NEEDDSYNC)
 			ret = dax_finish_sync_fault(vmf, pe_size, pfn);
 	} else {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index d154f42e2dc6..48004cf28a88 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
 
 		/* may drop and re-acquire the ilock */
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
-				&lockmode, flags & IOMAP_DIRECT);
+				&lockmode,
+				(flags & IOMAP_DIRECT) || IS_DAX(inode));
 		if (error)
 			goto out_unlock;
 		if (shared)
@@ -854,6 +855,33 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
 	.iomap_begin		= xfs_direct_write_iomap_begin,
 };
 
+static int
+xfs_dax_write_iomap_actor_end(
+	struct inode		*inode,
+	loff_t			pos,
+	loff_t			length,
+	ssize_t			written)
+{
+	int			error = 0;
+	struct xfs_inode	*ip = XFS_I(inode);
+	bool			cow = xfs_is_cow_inode(ip);
+
+	if (cow) {
+		if (written <= 0)
+			xfs_reflink_cancel_cow_range(ip, pos, length, true);
+		else
+			error = xfs_reflink_end_cow(ip, pos, written);
+	}
+	return error ?: written;
+}
+
+const struct dax_iomap_ops xfs_dax_write_iomap_ops = {
+	.iomap_ops 		= {
+		.iomap_begin = xfs_direct_write_iomap_begin,
+	},
+	.actor_end		= xfs_dax_write_iomap_actor_end,
+};
+
 static int
 xfs_buffered_write_iomap_begin(
 	struct inode		*inode,
@@ -1184,6 +1212,12 @@ const struct iomap_ops xfs_read_iomap_ops = {
 	.iomap_begin		= xfs_read_iomap_begin,
 };
 
+const struct dax_iomap_ops xfs_dax_read_iomap_ops = {
+	.iomap_ops		= {
+		.iomap_begin = xfs_read_iomap_begin,
+	},
+};
+
 static int
 xfs_seek_iomap_begin(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 7d3703556d0e..5eacb5d8ca88 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -45,5 +45,38 @@ extern const struct iomap_ops xfs_direct_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
+extern const struct dax_iomap_ops xfs_dax_write_iomap_ops;
+extern const struct dax_iomap_ops xfs_dax_read_iomap_ops;
+
+static inline int
+xfs_iomap_zero_range(
+	struct xfs_inode	*ip,
+	loff_t			pos,
+	loff_t			len,
+	bool			*did_zero)
+{
+	struct inode		*inode = VFS_I(ip);
+
+	return IS_DAX(inode)
+			? dax_iomap_zero_range(inode, pos, len, did_zero,
+					       &xfs_dax_write_iomap_ops)
+			: iomap_zero_range(inode, pos, len, did_zero,
+					       &xfs_buffered_write_iomap_ops);
+}
+
+static inline int
+xfs_iomap_truncate_page(
+	struct xfs_inode	*ip,
+	loff_t			pos,
+	bool			*did_zero)
+{
+	struct inode		*inode = VFS_I(ip);
+
+	return IS_DAX(inode)
+			? dax_iomap_truncate_page(inode, pos, did_zero,
+					       &xfs_dax_write_iomap_ops)
+			: iomap_truncate_page(inode, pos, did_zero,
+					       &xfs_buffered_write_iomap_ops);
+}
 
 #endif /* __XFS_IOMAP_H__*/
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index dfe24b7f26e5..6d936c3e1a6e 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -911,8 +911,8 @@ xfs_setattr_size(
 	 */
 	if (newsize > oldsize) {
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
-		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
-				&did_zeroing, &xfs_buffered_write_iomap_ops);
+		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
+				&did_zeroing);
 	} else {
 		/*
 		 * iomap won't detect a dirty page over an unwritten block (or a
@@ -924,8 +924,7 @@ xfs_setattr_size(
 						     newsize);
 		if (error)
 			return error;
-		error = iomap_truncate_page(inode, newsize, &did_zeroing,
-				&xfs_buffered_write_iomap_ops);
+		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
 	}
 
 	if (error)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d25434f93235..9a780948dbd0 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1266,8 +1266,7 @@ xfs_reflink_zero_posteof(
 		return 0;
 
 	trace_xfs_zero_eof(ip, isize, pos - isize);
-	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
-			&xfs_buffered_write_iomap_ops);
+	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
 }
 
 /*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 106d1f033a78..64393f6e96cf 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -3,6 +3,7 @@
 #define _LINUX_DAX_H
 
 #include <linux/fs.h>
+#include <linux/iomap.h>
 #include <linux/mm.h>
 #include <linux/radix-tree.h>
 
@@ -11,8 +12,6 @@
 
 typedef unsigned long dax_entry_t;
 
-struct iomap_ops;
-struct iomap;
 struct dax_device;
 struct dax_operations {
 	/*
@@ -38,6 +37,23 @@ struct dax_operations {
 	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
 };
 
+struct dax_iomap_ops {
+	/* the original iomap ops */
+	struct iomap_ops iomap_ops;
+	/*
+	 * actor_end: accept error code returned from ->actor(), deal with it
+	 * before ->iomap_end()
+	 */
+	int (*actor_end)(struct inode *, loff_t, loff_t, ssize_t);
+};
+
+/* dax iomap specific data, in order to call ->actor_end() in ->actor() */
+struct dax_iomap_data {
+	/* the original data pointer */
+	void *data;
+	const struct dax_iomap_ops *ops;
+};
+
 extern struct attribute_group dax_attribute_group;
 
 #if IS_ENABLED(CONFIG_DAX)
@@ -229,14 +245,18 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
-		const struct iomap_ops *ops);
+		const struct dax_iomap_ops *ops);
 vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
-		    pfn_t *pfnp, int *errp, const struct iomap_ops *ops);
+		pfn_t *pfnp, int *errp, const struct dax_iomap_ops *ops);
 vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size, pfn_t pfn);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
+int dax_iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
+		bool *did_zero, const struct dax_iomap_ops *ops);
+int dax_iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
+		const struct dax_iomap_ops *ops);
 s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
 		struct iomap *srcmap);
 int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 95562f863ad0..05437fbf5f68 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -178,6 +178,8 @@ int iomap_migrate_page(struct address_space *mapping, struct page *newpage,
 #endif
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
+loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t length,
+		void *data, struct iomap *iomap, struct iomap *srcmap);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
 		bool *did_zero, const struct iomap_ops *ops);
 int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-- 
2.32.0




^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-07-09 12:37 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19  6:00 [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
2021-05-19  6:00 ` [PATCH v6 1/7] fsdax: Introduce dax_iomap_cow_copy() Shiyang Ruan
2021-05-19  6:00 ` [PATCH v6 2/7] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
2021-05-19  6:00 ` [PATCH v6 3/7] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero Shiyang Ruan
2021-05-25 22:17   ` Darrick J. Wong
2021-05-19  6:00 ` [PATCH v6 4/7] iomap: Introduce iomap_apply2() for operations on two files Shiyang Ruan
2021-05-19  6:00 ` [PATCH v6 5/7] fsdax: Dedup file range to use a compare function Shiyang Ruan
2021-05-25 23:29   ` Darrick J. Wong
2021-05-19  6:00 ` [PATCH v6 6/7] fs/xfs: Handle CoW for fsdax write() path Shiyang Ruan
2021-05-26  0:21   ` Darrick J. Wong
2021-06-09  2:28     ` ruansy.fnst
2021-06-15  7:21       ` [PATCH v6.1 " Shiyang Ruan
2021-06-24  8:49         ` ruansy.fnst
2021-06-25 22:18           ` Darrick J. Wong
2021-06-28  2:55             ` ruansy.fnst
2021-06-28  5:09               ` Darrick J. Wong
2021-06-29 11:25                 ` ruansy.fnst
2021-06-29 21:01                   ` Darrick J. Wong
2021-07-08 23:16                 ` Dave Chinner
2021-07-09 12:36               ` [PATCH v6.2 6/7] dax: Introduce dax_iomap_ops for end of reflink Shiyang Ruan
2021-05-19  6:00 ` [PATCH v6 7/7] fs/xfs: Add dax dedupe support Shiyang Ruan
2021-05-26  0:31   ` Darrick J. Wong
2021-05-26  0:51 ` [PATCH v6 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).