nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/8] fsdax,xfs: Add reflink&dedupe support for fsdax
@ 2021-09-15 10:44 Shiyang Ruan
  2021-09-15 10:44 ` [PATCH v9 1/8] fsdax: Output address in dax_iomap_pfn() and rename it Shiyang Ruan
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-15 10:44 UTC (permalink / raw)
  To: djwong, hch, linux-xfs
  Cc: ruansy.fnst, dan.j.williams, david, linux-fsdevel, linux-kernel,
	nvdimm, rgoldwyn, viro, willy

This patchset is attempt to add CoW support for fsdax, and take XFS,
which has both reflink and fsdax feature, as an example.

Changes from V8:
 - Rebased on v5.15-rc1
 - Patch 4: Add a pre patch to convert dax_iomap_zero to iter model[1]
 - Patch 6&7: Remove EXPORT_SYMBOL
 - Patch 8: Solve the conflict caused by rebase
 - Fix code style problems, add comment

One of the key mechanism need to be implemented in fsdax is CoW.  Copy
the data from srcmap before we actually write data to the destance
iomap.  And we just copy range in which data won't be changed.

Another mechanism is range comparison.  In page cache case, readpage()
is used to load data on disk to page cache in order to be able to
compare data.  In fsdax case, readpage() does not work.  So, we need
another compare data with direct access support.

With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.

(Rebased on v5.15-rc1)
==

Shiyang Ruan (8):
  fsdax: Output address in dax_iomap_pfn() and rename it
  fsdax: Introduce dax_iomap_cow_copy()
  fsdax: Replace mmap entry in case of CoW
  fsdax: Convert dax_iomap_zero to iter model
  fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
  fsdax: Dedup file range to use a compare function
  xfs: support CoW in fsdax mode
  xfs: Add dax dedupe support

 fs/dax.c               | 284 +++++++++++++++++++++++++++++++++--------
 fs/iomap/buffered-io.c |   3 +-
 fs/remap_range.c       |  31 ++++-
 fs/xfs/xfs_bmap_util.c |   3 +-
 fs/xfs/xfs_file.c      |   8 +-
 fs/xfs/xfs_inode.c     |  80 +++++++++++-
 fs/xfs/xfs_inode.h     |   1 +
 fs/xfs/xfs_iomap.c     |  38 +++++-
 fs/xfs/xfs_iomap.h     |  30 +++++
 fs/xfs/xfs_iops.c      |   7 +-
 fs/xfs/xfs_reflink.c   |  15 ++-
 include/linux/dax.h    |  11 +-
 include/linux/fs.h     |  12 +-
 13 files changed, 438 insertions(+), 85 deletions(-)

-- 
2.33.0




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 1/8] fsdax: Output address in dax_iomap_pfn() and rename it
  2021-09-15 10:44 [PATCH v9 0/8] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
@ 2021-09-15 10:44 ` Shiyang Ruan
  2021-09-16  0:09   ` Darrick J. Wong
  2021-09-15 10:44 ` [PATCH v9 2/8] fsdax: Introduce dax_iomap_cow_copy() Shiyang Ruan
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-15 10:44 UTC (permalink / raw)
  To: djwong, hch, linux-xfs
  Cc: ruansy.fnst, dan.j.williams, david, linux-fsdevel, linux-kernel,
	nvdimm, rgoldwyn, viro, willy, Ritesh Harjani

Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case.  Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4e3e5a283a91..8b482a58acae 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1010,8 +1010,8 @@ static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos)
 	return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
 }
 
-static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
-			 pfn_t *pfnp)
+static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
+		size_t size, void **kaddr, pfn_t *pfnp)
 {
 	const sector_t sector = dax_iomap_sector(iomap, pos);
 	pgoff_t pgoff;
@@ -1023,11 +1023,13 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
 		return rc;
 	id = dax_read_lock();
 	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-				   NULL, pfnp);
+				   kaddr, pfnp);
 	if (length < 0) {
 		rc = length;
 		goto out;
 	}
+	if (!pfnp)
+		goto out_check_addr;
 	rc = -EINVAL;
 	if (PFN_PHYS(length) < size)
 		goto out;
@@ -1037,6 +1039,12 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
 	if (length > 1 && !pfn_t_devmap(*pfnp))
 		goto out;
 	rc = 0;
+
+out_check_addr:
+	if (!kaddr)
+		goto out;
+	if (!*kaddr)
+		rc = -EFAULT;
 out:
 	dax_read_unlock(id);
 	return rc;
@@ -1401,7 +1409,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 		return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
 	}
 
-	err = dax_iomap_pfn(&iter->iomap, pos, size, &pfn);
+	err = dax_iomap_direct_access(&iter->iomap, pos, size, NULL, &pfn);
 	if (err)
 		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
-- 
2.33.0




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 2/8] fsdax: Introduce dax_iomap_cow_copy()
  2021-09-15 10:44 [PATCH v9 0/8] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
  2021-09-15 10:44 ` [PATCH v9 1/8] fsdax: Output address in dax_iomap_pfn() and rename it Shiyang Ruan
@ 2021-09-15 10:44 ` Shiyang Ruan
  2021-09-15 10:44 ` [PATCH v9 3/8] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-15 10:44 UTC (permalink / raw)
  To: djwong, hch, linux-xfs
  Cc: ruansy.fnst, dan.j.williams, david, linux-fsdevel, linux-kernel,
	nvdimm, rgoldwyn, viro, willy

In the case where the iomap is a write operation and iomap is not equal
to srcmap after iomap_begin, we consider it is a CoW operation.

In this case, the destination (iomap->addr) points to a newly allocated
extent.  It is needed to copy the data from srcmap to the extent.  In
theory, it is better to copy the head and tail ranges which is outside
of the non-aligned area instead of copying the whole aligned range. But
in dax page fault, it will always be an aligned range. So copy the whole
range in this case.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 83 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 8b482a58acae..dded08be54dc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1050,6 +1050,60 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
 	return rc;
 }
 
+/**
+ * dax_iomap_cow_copy - Copy the data from source to destination before write
+ * @pos:	address to do copy from.
+ * @length:	size of copy operation.
+ * @align_size:	aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
+ * @srcmap:	iomap srcmap
+ * @daddr:	destination address to copy to.
+ *
+ * This can be called from two places. Either during DAX write fault (page
+ * aligned), to copy the length size data to daddr. Or, while doing normal DAX
+ * write operation, dax_iomap_actor() might call this to do the copy of either
+ * start or end unaligned address. In the latter case the rest of the copy of
+ * aligned ranges is taken care by dax_iomap_actor() itself.
+ */
+static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
+		const struct iomap *srcmap, void *daddr)
+{
+	loff_t head_off = pos & (align_size - 1);
+	size_t size = ALIGN(head_off + length, align_size);
+	loff_t end = pos + length;
+	loff_t pg_end = round_up(end, align_size);
+	bool copy_all = head_off == 0 && end == pg_end;
+	void *saddr = 0;
+	int ret = 0;
+
+	ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL);
+	if (ret)
+		return ret;
+
+	if (copy_all) {
+		ret = copy_mc_to_kernel(daddr, saddr, length);
+		return ret ? -EIO : 0;
+	}
+
+	/* Copy the head part of the range */
+	if (head_off) {
+		ret = copy_mc_to_kernel(daddr, saddr, head_off);
+		if (ret)
+			return -EIO;
+	}
+
+	/* Copy the tail part of the range */
+	if (end < pg_end) {
+		loff_t tail_off = head_off + length;
+		loff_t tail_len = pg_end - end;
+
+		ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
+					tail_len);
+		if (ret)
+			return -EIO;
+	}
+	return 0;
+}
+
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
  * page in the file would cause excessive storage usage for workloads with
@@ -1175,16 +1229,18 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		struct iov_iter *iter)
 {
 	const struct iomap *iomap = &iomi->iomap;
+	const struct iomap *srcmap = &iomi->srcmap;
 	loff_t length = iomap_length(iomi);
 	loff_t pos = iomi->pos;
 	struct block_device *bdev = iomap->bdev;
 	struct dax_device *dax_dev = iomap->dax_dev;
 	loff_t end = pos + length, done = 0;
+	bool write = iov_iter_rw(iter) == WRITE;
 	ssize_t ret = 0;
 	size_t xfer;
 	int id;
 
-	if (iov_iter_rw(iter) == READ) {
+	if (!write) {
 		end = min(end, i_size_read(iomi->inode));
 		if (pos >= end)
 			return 0;
@@ -1193,7 +1249,12 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 			return iov_iter_zero(min(length, end - pos), iter);
 	}
 
-	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+	/*
+	 * In DAX mode, enforce either pure overwrites of written extents, or
+	 * writes to unwritten extents as part of a copy-on-write operation.
+	 */
+	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
+			!(iomap->flags & IOMAP_F_SHARED)))
 		return -EIO;
 
 	/*
@@ -1232,6 +1293,14 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 			break;
 		}
 
+		if (write &&
+		    srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
+			ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap,
+						 kaddr);
+			if (ret)
+				break;
+		}
+
 		map_len = PFN_PHYS(map_len);
 		kaddr += offset;
 		map_len -= offset;
@@ -1243,7 +1312,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		 * validated via access_ok() in either vfs_read() or
 		 * vfs_write(), depending on which operation we are doing.
 		 */
-		if (iov_iter_rw(iter) == WRITE)
+		if (write)
 			xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
 					map_len, iter);
 		else
@@ -1385,6 +1454,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	const struct iomap *iomap = &iter->iomap;
+	const struct iomap *srcmap = &iter->srcmap;
 	size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
 	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -1392,6 +1462,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	unsigned long entry_flags = pmd ? DAX_PMD : 0;
 	int err = 0;
 	pfn_t pfn;
+	void *kaddr;
 
 	if (!pmd && vmf->cow_page)
 		return dax_fault_cow_page(vmf, iter);
@@ -1404,18 +1475,25 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 		return dax_pmd_load_hole(xas, vmf, iomap, entry);
 	}
 
-	if (iomap->type != IOMAP_MAPPED) {
+	if (iomap->type != IOMAP_MAPPED && !(iomap->flags & IOMAP_F_SHARED)) {
 		WARN_ON_ONCE(1);
 		return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
 	}
 
-	err = dax_iomap_direct_access(&iter->iomap, pos, size, NULL, &pfn);
+	err = dax_iomap_direct_access(iomap, pos, size, &kaddr, &pfn);
 	if (err)
 		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
 	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags,
 				  write && !sync);
 
+	if (write &&
+	    srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
+		err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr);
+		if (err)
+			return dax_fault_return(err);
+	}
+
 	if (sync)
 		return dax_fault_synchronous_pfnp(pfnp, pfn);
 
-- 
2.33.0




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 3/8] fsdax: Replace mmap entry in case of CoW
  2021-09-15 10:44 [PATCH v9 0/8] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
  2021-09-15 10:44 ` [PATCH v9 1/8] fsdax: Output address in dax_iomap_pfn() and rename it Shiyang Ruan
  2021-09-15 10:44 ` [PATCH v9 2/8] fsdax: Introduce dax_iomap_cow_copy() Shiyang Ruan
@ 2021-09-15 10:44 ` Shiyang Ruan
  2021-09-16 14:51   ` kernel test robot
  2021-09-15 10:44 ` [PATCH v9 4/8] fsdax: Convert dax_iomap_zero to iter model Shiyang Ruan
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-15 10:44 UTC (permalink / raw)
  To: djwong, hch, linux-xfs
  Cc: ruansy.fnst, dan.j.williams, david, linux-fsdevel, linux-kernel,
	nvdimm, rgoldwyn, viro, willy, Goldwyn Rodrigues, Ritesh Harjani

Replace the existing entry to the newly allocated one in case of CoW.
Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
entry as writeprotected.  This helps us snapshots so new write
pagefaults after snapshots trigger a CoW.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c | 73 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 40 insertions(+), 33 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index dded08be54dc..41c93929f20b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -734,6 +734,23 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
 	return 0;
 }
 
+/*
+ * MAP_SYNC on a dax mapping guarantees dirty metadata is
+ * flushed on write-faults (non-cow), but not read-faults.
+ */
+static bool dax_fault_is_synchronous(const struct iomap_iter *iter,
+		struct vm_area_struct *vma)
+{
+	return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) &&
+		(iter->iomap.flags & IOMAP_F_DIRTY);
+}
+
+static bool dax_fault_is_cow(const struct iomap_iter *iter)
+{
+	return (iter->flags & IOMAP_WRITE) &&
+		(iter->iomap.flags & IOMAP_F_SHARED);
+}
+
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -741,16 +758,19 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
  * already in the tree, we will skip the insertion and just dirty the PMD as
  * appropriate.
  */
-static void *dax_insert_entry(struct xa_state *xas,
-		struct address_space *mapping, struct vm_fault *vmf,
-		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
+		const struct iomap_iter *iter, void *entry, pfn_t pfn,
+		unsigned long flags)
 {
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	void *new_entry = dax_make_entry(pfn, flags);
+	bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
+	bool cow = dax_fault_is_cow(iter);
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
 		unsigned long index = xas->xa_index;
 		/* we are replacing a zero page with block mapping */
 		if (dax_is_pmd_entry(entry))
@@ -762,7 +782,7 @@ static void *dax_insert_entry(struct xa_state *xas,
 
 	xas_reset(xas);
 	xas_lock_irq(xas);
-	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		void *old;
 
 		dax_disassociate_entry(entry, mapping, false);
@@ -786,6 +806,9 @@ static void *dax_insert_entry(struct xa_state *xas,
 	if (dirty)
 		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
+	if (cow)
+		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
 	xas_unlock_irq(xas);
 	return entry;
 }
@@ -1111,17 +1134,15 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
  * If this page is ever written to we will re-fault and change the mapping to
  * point to real DAX storage instead.
  */
-static vm_fault_t dax_load_hole(struct xa_state *xas,
-		struct address_space *mapping, void **entry,
-		struct vm_fault *vmf)
+static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+		const struct iomap_iter *iter, void **entry)
 {
-	struct inode *inode = mapping->host;
+	struct inode *inode = iter->inode;
 	unsigned long vaddr = vmf->address;
 	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
 	vm_fault_t ret;
 
-	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_ZERO_PAGE, false);
+	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
 
 	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
 	trace_dax_load_hole(inode, vmf, ret);
@@ -1130,7 +1151,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 
 #ifdef CONFIG_FS_DAX_PMD
 static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
-		const struct iomap *iomap, void **entry)
+		const struct iomap_iter *iter, void **entry)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
@@ -1148,8 +1169,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 		goto fallback;
 
 	pfn = page_to_pfn_t(zero_page);
-	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_PMD | DAX_ZERO_PAGE, false);
+	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
+				  DAX_PMD | DAX_ZERO_PAGE);
 
 	if (arch_needs_pgtable_deposit()) {
 		pgtable = pte_alloc_one(vma->vm_mm);
@@ -1381,17 +1402,6 @@ static vm_fault_t dax_fault_return(int error)
 	return vmf_error(error);
 }
 
-/*
- * MAP_SYNC on a dax mapping guarantees dirty metadata is
- * flushed on write-faults (non-cow), but not read-faults.
- */
-static bool dax_fault_is_synchronous(unsigned long flags,
-		struct vm_area_struct *vma, const struct iomap *iomap)
-{
-	return (flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC)
-		&& (iomap->flags & IOMAP_F_DIRTY);
-}
-
 /*
  * When handling a synchronous page fault and the inode need a fsync, we can
  * insert the PTE/PMD into page tables only after that fsync happened. Skip
@@ -1452,13 +1462,11 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 		const struct iomap_iter *iter, pfn_t *pfnp,
 		struct xa_state *xas, void **entry, bool pmd)
 {
-	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	const struct iomap *iomap = &iter->iomap;
 	const struct iomap *srcmap = &iter->srcmap;
 	size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
 	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
-	bool write = vmf->flags & FAULT_FLAG_WRITE;
-	bool sync = dax_fault_is_synchronous(iter->flags, vmf->vma, iomap);
+	bool write = iter->flags & IOMAP_WRITE;
 	unsigned long entry_flags = pmd ? DAX_PMD : 0;
 	int err = 0;
 	pfn_t pfn;
@@ -1471,8 +1479,8 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	if (!write &&
 	    (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) {
 		if (!pmd)
-			return dax_load_hole(xas, mapping, entry, vmf);
-		return dax_pmd_load_hole(xas, vmf, iomap, entry);
+			return dax_load_hole(xas, vmf, iter, entry);
+		return dax_pmd_load_hole(xas, vmf, iter, entry);
 	}
 
 	if (iomap->type != IOMAP_MAPPED && !(iomap->flags & IOMAP_F_SHARED)) {
@@ -1484,8 +1492,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	if (err)
 		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
-	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags,
-				  write && !sync);
+	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, entry_flags);
 
 	if (write &&
 	    srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
@@ -1494,7 +1501,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 			return dax_fault_return(err);
 	}
 
-	if (sync)
+	if (dax_fault_is_synchronous(iter, vmf->vma))
 		return dax_fault_synchronous_pfnp(pfnp, pfn);
 
 	/* insert PMD pfn */
-- 
2.33.0




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 4/8] fsdax: Convert dax_iomap_zero to iter model
  2021-09-15 10:44 [PATCH v9 0/8] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
                   ` (2 preceding siblings ...)
  2021-09-15 10:44 ` [PATCH v9 3/8] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
@ 2021-09-15 10:44 ` Shiyang Ruan
  2021-09-16  0:11   ` Darrick J. Wong
  2021-09-16  6:15   ` Christoph Hellwig
  2021-09-15 10:44 ` [PATCH v9 5/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero Shiyang Ruan
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-15 10:44 UTC (permalink / raw)
  To: djwong, hch, linux-xfs
  Cc: ruansy.fnst, dan.j.williams, david, linux-fsdevel, linux-kernel,
	nvdimm, rgoldwyn, viro, willy

Let dax_iomap_zero() support iter model.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/dax.c               | 3 ++-
 fs/iomap/buffered-io.c | 3 +--
 include/linux/dax.h    | 3 ++-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 41c93929f20b..4f346e25e488 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1209,8 +1209,9 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 }
 #endif /* CONFIG_FS_DAX_PMD */
 
-s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
+s64 dax_iomap_zero(struct iomap_iter *iter, loff_t pos, u64 length)
 {
+	const struct iomap *iomap = &iter->iomap;
 	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
 	pgoff_t pgoff;
 	long rc, id;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9cc5798423d1..84a861d3b3e0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -889,7 +889,6 @@ static s64 __iomap_zero_iter(struct iomap_iter *iter, loff_t pos, u64 length)
 
 static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 {
-	struct iomap *iomap = &iter->iomap;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	loff_t pos = iter->pos;
 	loff_t length = iomap_length(iter);
@@ -903,7 +902,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		s64 bytes;
 
 		if (IS_DAX(iter->inode))
-			bytes = dax_iomap_zero(pos, length, iomap);
+			bytes = dax_iomap_zero(iter, pos, length);
 		else
 			bytes = __iomap_zero_iter(iter, pos, length);
 		if (bytes < 0)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 2619d94c308d..642de7ef1a10 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -13,6 +13,7 @@ typedef unsigned long dax_entry_t;
 
 struct iomap_ops;
 struct iomap;
+struct iomap_iter;
 struct dax_device;
 struct dax_operations {
 	/*
@@ -210,7 +211,7 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
-s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap);
+s64 dax_iomap_zero(struct iomap_iter *iter, loff_t pos, u64 length);
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
-- 
2.33.0




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 5/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
  2021-09-15 10:44 [PATCH v9 0/8] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
                   ` (3 preceding siblings ...)
  2021-09-15 10:44 ` [PATCH v9 4/8] fsdax: Convert dax_iomap_zero to iter model Shiyang Ruan
@ 2021-09-15 10:44 ` Shiyang Ruan
  2021-09-16  6:16   ` Christoph Hellwig
  2021-09-15 10:44 ` [PATCH v9 6/8] fsdax: Dedup file range to use a compare function Shiyang Ruan
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-15 10:44 UTC (permalink / raw)
  To: djwong, hch, linux-xfs
  Cc: ruansy.fnst, dan.j.williams, david, linux-fsdevel, linux-kernel,
	nvdimm, rgoldwyn, viro, willy, Ritesh Harjani

Punch hole on a reflinked file needs dax_iomap_cow_copy() too.
Otherwise, data in not aligned area will be not correct.  So, add the
CoW operation for not aligned case in dax_iomap_zero().

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4f346e25e488..ca4308c85988 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1212,6 +1212,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 s64 dax_iomap_zero(struct iomap_iter *iter, loff_t pos, u64 length)
 {
 	const struct iomap *iomap = &iter->iomap;
+	const struct iomap *srcmap = &iter->srcmap;
 	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
 	pgoff_t pgoff;
 	long rc, id;
@@ -1230,21 +1231,27 @@ s64 dax_iomap_zero(struct iomap_iter *iter, loff_t pos, u64 length)
 
 	id = dax_read_lock();
 
-	if (page_aligned)
+	if (page_aligned) {
 		rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
-	else
-		rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
-	if (rc < 0) {
-		dax_read_unlock(id);
-		return rc;
+		goto out;
 	}
 
-	if (!page_aligned) {
-		memset(kaddr + offset, 0, size);
+	rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
+	if (rc < 0)
+		goto out;
+	memset(kaddr + offset, 0, size);
+	if (srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
+		rc = dax_iomap_cow_copy(pos, size, PAGE_SIZE, srcmap,
+					kaddr);
+		if (rc < 0)
+			goto out;
+		dax_flush(iomap->dax_dev, kaddr, PAGE_SIZE);
+	} else
 		dax_flush(iomap->dax_dev, kaddr + offset, size);
-	}
+
+out:
 	dax_read_unlock(id);
-	return size;
+	return rc < 0 ? rc : size;
 }
 
 static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
-- 
2.33.0




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 6/8] fsdax: Dedup file range to use a compare function
  2021-09-15 10:44 [PATCH v9 0/8] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
                   ` (4 preceding siblings ...)
  2021-09-15 10:44 ` [PATCH v9 5/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero Shiyang Ruan
@ 2021-09-15 10:44 ` Shiyang Ruan
  2021-09-16  6:17   ` Christoph Hellwig
  2021-09-15 10:45 ` [PATCH v9 7/8] xfs: support CoW in fsdax mode Shiyang Ruan
  2021-09-15 10:45 ` [PATCH v9 8/8] xfs: Add dax dedupe support Shiyang Ruan
  7 siblings, 1 reply; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-15 10:44 UTC (permalink / raw)
  To: djwong, hch, linux-xfs
  Cc: ruansy.fnst, dan.j.williams, david, linux-fsdevel, linux-kernel,
	nvdimm, rgoldwyn, viro, willy, Goldwyn Rodrigues

With dax we cannot deal with readpage() etc. So, we create a dax
comparison function which is similar with
vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c             | 79 ++++++++++++++++++++++++++++++++++++++++++++
 fs/remap_range.c     | 31 ++++++++++++++---
 fs/xfs/xfs_reflink.c |  8 +++--
 include/linux/dax.h  |  8 +++++
 include/linux/fs.h   | 12 ++++---
 5 files changed, 127 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ca4308c85988..33b0b9f4b904 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1815,3 +1815,82 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 	return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t dax_range_compare_iter(struct iomap_iter *it_src,
+		struct iomap_iter *it_dest, bool *same)
+{
+	const struct iomap *smap = &it_src->iomap;
+	const struct iomap *dmap = &it_dest->iomap;
+	loff_t pos1 = it_src->pos, pos2 = it_dest->pos;
+	loff_t len = min(smap->length, dmap->length);
+	void *saddr, *daddr;
+	int id, ret;
+
+	if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
+		*same = true;
+		return len;
+	}
+
+	if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
+		*same = false;
+		return 0;
+	}
+
+	id = dax_read_lock();
+	ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
+				      &saddr, NULL);
+	if (ret < 0)
+		goto out_unlock;
+
+	ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
+				      &daddr, NULL);
+	if (ret < 0)
+		goto out_unlock;
+
+	*same = !memcmp(saddr, daddr, len);
+	if (!*same)
+		len = 0;
+	dax_read_unlock(id);
+	return len;
+
+out_unlock:
+	dax_read_unlock(id);
+	return -EIO;
+}
+
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+		struct inode *dst, loff_t dstoff, loff_t len, bool *same,
+		const struct iomap_ops *ops)
+{
+	struct iomap_iter src_iter = {
+		.inode		= src,
+		.pos		= srcoff,
+		.len		= len,
+	};
+	struct iomap_iter dst_iter = {
+		.inode		= dst,
+		.pos		= dstoff,
+		.len		= len,
+	};
+	int ret;
+
+	while ((ret = iomap_iter(&src_iter, ops)) > 0) {
+		while ((ret = iomap_iter(&dst_iter, ops)) > 0) {
+			dst_iter.processed = dax_range_compare_iter(&src_iter,
+						&dst_iter, same);
+		}
+		if (ret <= 0)
+			src_iter.processed = ret;
+	}
+	return ret;
+}
+
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      loff_t *len, unsigned int remap_flags,
+			      const struct iomap_ops *ops)
+{
+	return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+					       pos_out, len, remap_flags, ops);
+}
+EXPORT_SYMBOL_GPL(dax_remap_file_range_prep);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index 6d4a9beaa097..1157169b9d79 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -14,6 +14,7 @@
 #include <linux/compat.h>
 #include <linux/mount.h>
 #include <linux/fs.h>
+#include <linux/dax.h>
 #include "internal.h"
 
 #include <linux/uaccess.h>
@@ -277,9 +278,11 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
  * If there's an error, then the usual negative error code is returned.
  * Otherwise returns 0 with *len set to the request length.
  */
-int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
-				  struct file *file_out, loff_t pos_out,
-				  loff_t *len, unsigned int remap_flags)
+int
+__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				loff_t *len, unsigned int remap_flags,
+				const struct iomap_ops *dax_read_ops)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
@@ -339,8 +342,18 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	if (remap_flags & REMAP_FILE_DEDUP) {
 		bool		is_same = false;
 
-		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
-				inode_out, pos_out, *len, &is_same);
+		if (*len == 0)
+			return 0;
+
+		if (!IS_DAX(inode_in))
+			ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
+					inode_out, pos_out, *len, &is_same);
+		else if (dax_read_ops)
+			ret = dax_dedupe_file_range_compare(inode_in, pos_in,
+					inode_out, pos_out, *len, &is_same,
+					dax_read_ops);
+		else
+			return -EINVAL;
 		if (ret)
 			return ret;
 		if (!is_same)
@@ -358,6 +371,14 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 
 	return ret;
 }
+
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				  struct file *file_out, loff_t pos_out,
+				  loff_t *len, unsigned int remap_flags)
+{
+	return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+					       pos_out, len, remap_flags, NULL);
+}
 EXPORT_SYMBOL(generic_remap_file_range_prep);
 
 loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 76355f293488..7ecea0311e88 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1332,8 +1332,12 @@ xfs_reflink_remap_prep(
 	if (IS_DAX(inode_in) || IS_DAX(inode_out))
 		goto out_unlock;
 
-	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			len, remap_flags);
+	if (!IS_DAX(inode_in))
+		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+				pos_out, len, remap_flags);
+	else
+		ret = dax_remap_file_range_prep(file_in, pos_in, file_out,
+				pos_out, len, remap_flags, &xfs_read_iomap_ops);
 	if (ret || *len == 0)
 		goto out_unlock;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 642de7ef1a10..b975629a8d5b 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -212,6 +212,14 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 s64 dax_iomap_zero(struct iomap_iter *iter, loff_t pos, u64 length);
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+				  struct inode *dest, loff_t destoff,
+				  loff_t len, bool *is_same,
+				  const struct iomap_ops *ops);
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      loff_t *len, unsigned int remap_flags,
+			      const struct iomap_ops *ops);
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e7a633353fd2..670b53c722a7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -71,6 +71,7 @@ struct fsverity_operations;
 struct fs_context;
 struct fs_parameter_spec;
 struct fileattr;
+struct iomap_ops;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -2175,10 +2176,13 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
 				       struct file *file_out, loff_t pos_out,
 				       size_t len, unsigned int flags);
-extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
-					 struct file *file_out, loff_t pos_out,
-					 loff_t *count,
-					 unsigned int remap_flags);
+int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				    struct file *file_out, loff_t pos_out,
+				    loff_t *len, unsigned int remap_flags,
+				    const struct iomap_ops *dax_read_ops);
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				  struct file *file_out, loff_t pos_out,
+				  loff_t *count, unsigned int remap_flags);
 extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
 				  loff_t len, unsigned int remap_flags);
-- 
2.33.0




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 7/8] xfs: support CoW in fsdax mode
  2021-09-15 10:44 [PATCH v9 0/8] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
                   ` (5 preceding siblings ...)
  2021-09-15 10:44 ` [PATCH v9 6/8] fsdax: Dedup file range to use a compare function Shiyang Ruan
@ 2021-09-15 10:45 ` Shiyang Ruan
  2021-09-16  0:22   ` Darrick J. Wong
  2021-09-16  6:23   ` Christoph Hellwig
  2021-09-15 10:45 ` [PATCH v9 8/8] xfs: Add dax dedupe support Shiyang Ruan
  7 siblings, 2 replies; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-15 10:45 UTC (permalink / raw)
  To: djwong, hch, linux-xfs
  Cc: ruansy.fnst, dan.j.williams, david, linux-fsdevel, linux-kernel,
	nvdimm, rgoldwyn, viro, willy

In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
After that, new allocated extents needs to be remapped to the file.
So, add a CoW identification in ->iomap_begin(), and implement
->iomap_end() to do the remapping work.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/xfs/xfs_bmap_util.c |  3 +--
 fs/xfs/xfs_file.c      |  6 +++---
 fs/xfs/xfs_iomap.c     | 38 +++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_iomap.h     | 30 ++++++++++++++++++++++++++++++
 fs/xfs/xfs_iops.c      |  7 +++----
 fs/xfs/xfs_reflink.c   |  3 +--
 6 files changed, 75 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 73a36b7be3bd..0681250e0a5d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1009,8 +1009,7 @@ xfs_free_file_space(
 		return 0;
 	if (offset + len > XFS_ISIZE(ip))
 		len = XFS_ISIZE(ip) - offset;
-	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
-			&xfs_buffered_write_iomap_ops);
+	error = xfs_iomap_zero_range(ip, offset, len, NULL);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7aa943edfc02..2ef1930374d2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -704,7 +704,7 @@ xfs_file_dax_write(
 	pos = iocb->ki_pos;
 
 	trace_xfs_file_dax_write(iocb, from);
-	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1329,8 +1329,8 @@ __xfs_filemap_fault(
 		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
 				(write_fault && !vmf->cow_page) ?
-				 &xfs_direct_write_iomap_ops :
-				 &xfs_read_iomap_ops);
+					&xfs_dax_write_iomap_ops :
+					&xfs_read_iomap_ops);
 		if (ret & VM_FAULT_NEEDDSYNC)
 			ret = dax_finish_sync_fault(vmf, pe_size, pfn);
 		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 093758440ad5..6fa3b377cb81 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
 
 		/* may drop and re-acquire the ilock */
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
-				&lockmode, flags & IOMAP_DIRECT);
+				&lockmode,
+				(flags & IOMAP_DIRECT) || IS_DAX(inode));
 		if (error)
 			goto out_unlock;
 		if (shared)
@@ -854,6 +855,41 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
 	.iomap_begin		= xfs_direct_write_iomap_begin,
 };
 
+static int
+xfs_dax_write_iomap_end(
+	struct inode 		*inode,
+	loff_t 			pos,
+	loff_t 			length,
+	ssize_t 		written,
+	unsigned 		flags,
+	struct iomap 		*iomap)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	/*
+	 * Usually we use @written to indicate whether the operation was
+	 * successful.  But it is always positive or zero.  The CoW needs the
+	 * actual error code from actor().  So, get it from
+	 * iomap_iter->processed.
+	 */
+	const struct iomap_iter *iter =
+				container_of(iomap, typeof(*iter), iomap);
+
+	if (!xfs_is_cow_inode(ip))
+		return 0;
+
+	if (iter->processed <= 0) {
+		xfs_reflink_cancel_cow_range(ip, pos, length, true);
+		return 0;
+	}
+
+	return xfs_reflink_end_cow(ip, pos, iter->processed);
+}
+
+const struct iomap_ops xfs_dax_write_iomap_ops = {
+	.iomap_begin 	= xfs_direct_write_iomap_begin,
+	.iomap_end	= xfs_dax_write_iomap_end,
+};
+
 static int
 xfs_buffered_write_iomap_begin(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 7d3703556d0e..92679a0c3578 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -45,5 +45,35 @@ extern const struct iomap_ops xfs_direct_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
+extern const struct iomap_ops xfs_dax_write_iomap_ops;
+
+static inline int
+xfs_iomap_zero_range(
+	struct xfs_inode	*ip,
+	loff_t			pos,
+	loff_t			len,
+	bool			*did_zero)
+{
+	struct inode		*inode = VFS_I(ip);
+
+	return iomap_zero_range(inode, pos, len, did_zero,
+			IS_DAX(inode) ?
+				&xfs_dax_write_iomap_ops :
+				&xfs_buffered_write_iomap_ops);
+}
+
+static inline int
+xfs_iomap_truncate_page(
+	struct xfs_inode	*ip,
+	loff_t			pos,
+	bool			*did_zero)
+{
+	struct inode		*inode = VFS_I(ip);
+
+	return iomap_truncate_page(inode, pos, did_zero,
+			IS_DAX(inode)?
+				&xfs_dax_write_iomap_ops :
+				&xfs_buffered_write_iomap_ops);
+}
 
 #endif /* __XFS_IOMAP_H__*/
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a607d6aca5c4..332e6208dffd 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -911,8 +911,8 @@ xfs_setattr_size(
 	 */
 	if (newsize > oldsize) {
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
-		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
-				&did_zeroing, &xfs_buffered_write_iomap_ops);
+		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
+				&did_zeroing);
 	} else {
 		/*
 		 * iomap won't detect a dirty page over an unwritten block (or a
@@ -924,8 +924,7 @@ xfs_setattr_size(
 						     newsize);
 		if (error)
 			return error;
-		error = iomap_truncate_page(inode, newsize, &did_zeroing,
-				&xfs_buffered_write_iomap_ops);
+		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
 	}
 
 	if (error)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 7ecea0311e88..9d876e268734 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1269,8 +1269,7 @@ xfs_reflink_zero_posteof(
 		return 0;
 
 	trace_xfs_zero_eof(ip, isize, pos - isize);
-	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
-			&xfs_buffered_write_iomap_ops);
+	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
 }
 
 /*
-- 
2.33.0




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 8/8] xfs: Add dax dedupe support
  2021-09-15 10:44 [PATCH v9 0/8] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
                   ` (6 preceding siblings ...)
  2021-09-15 10:45 ` [PATCH v9 7/8] xfs: support CoW in fsdax mode Shiyang Ruan
@ 2021-09-15 10:45 ` Shiyang Ruan
  2021-09-16  0:30   ` Darrick J. Wong
  7 siblings, 1 reply; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-15 10:45 UTC (permalink / raw)
  To: djwong, hch, linux-xfs
  Cc: ruansy.fnst, dan.j.williams, david, linux-fsdevel, linux-kernel,
	nvdimm, rgoldwyn, viro, willy

Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
who are going to be deduped.  After that, call compare range function
only when files are both DAX or not.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c    |  2 +-
 fs/xfs/xfs_inode.c   | 80 +++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_inode.h   |  1 +
 fs/xfs/xfs_reflink.c |  4 +--
 4 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2ef1930374d2..c3061723613c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -846,7 +846,7 @@ xfs_wait_dax_page(
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 }
 
-static int
+int
 xfs_break_dax_layouts(
 	struct inode		*inode,
 	bool			*retry)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index a4f6f034fb81..bdc084cdbf46 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3790,6 +3790,61 @@ xfs_iolock_two_inodes_and_break_layout(
 	return 0;
 }
 
+static int
+xfs_mmaplock_two_inodes_and_break_dax_layout(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	int			error, attempts = 0;
+	bool			retry;
+	struct page		*page;
+	struct xfs_log_item	*lp;
+
+	if (ip1->i_ino > ip2->i_ino)
+		swap(ip1, ip2);
+
+again:
+	retry = false;
+	/* Lock the first inode */
+	xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
+	error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
+	if (error || retry) {
+		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+		if (error == 0 && retry)
+			goto again;
+		return error;
+	}
+
+	if (ip1 == ip2)
+		return 0;
+
+	/* Nested lock the second inode */
+	lp = &ip1->i_itemp->ili_item;
+	if (lp && test_bit(XFS_LI_IN_AIL, &lp->li_flags)) {
+		if (!xfs_ilock_nowait(ip2,
+		    xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1))) {
+			xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+			if ((++attempts % 5) == 0)
+				delay(1); /* Don't just spin the CPU */
+			goto again;
+		}
+	} else
+		xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
+	/*
+	 * We cannot use xfs_break_dax_layouts() directly here because it may
+	 * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
+	 * for this nested lock case.
+	 */
+	page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
+	if (page && page_ref_count(page) != 1) {
+		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+		goto again;
+	}
+
+	return 0;
+}
+
 /*
  * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
  * mmap activity.
@@ -3804,8 +3859,19 @@ xfs_ilock2_io_mmap(
 	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
 	if (ret)
 		return ret;
-	filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
-				    VFS_I(ip2)->i_mapping);
+
+	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
+		ret = xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2);
+		if (ret) {
+			inode_unlock(VFS_I(ip2));
+			if (ip1 != ip2)
+				inode_unlock(VFS_I(ip1));
+			return ret;
+		}
+	} else
+		filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
+					    VFS_I(ip2)->i_mapping);
+
 	return 0;
 }
 
@@ -3815,8 +3881,14 @@ xfs_iunlock2_io_mmap(
 	struct xfs_inode	*ip1,
 	struct xfs_inode	*ip2)
 {
-	filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
-				      VFS_I(ip2)->i_mapping);
+	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
+		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+		if (ip1 != ip2)
+			xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+	} else
+		filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
+					      VFS_I(ip2)->i_mapping);
+
 	inode_unlock(VFS_I(ip2));
 	if (ip1 != ip2)
 		inode_unlock(VFS_I(ip1));
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b21b177832d1..f7e26fe31a26 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -472,6 +472,7 @@ enum xfs_prealloc_flags {
 
 int	xfs_update_prealloc_flags(struct xfs_inode *ip,
 				  enum xfs_prealloc_flags flags);
+int	xfs_break_dax_layouts(struct inode *inode, bool *retry);
 int	xfs_break_layouts(struct inode *inode, uint *iolock,
 		enum layout_break_reason reason);
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 9d876e268734..3b99c9dfcf0d 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1327,8 +1327,8 @@ xfs_reflink_remap_prep(
 	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
 		goto out_unlock;
 
-	/* Don't share DAX file data for now. */
-	if (IS_DAX(inode_in) || IS_DAX(inode_out))
+	/* Don't share DAX file data with non-DAX file. */
+	if (IS_DAX(inode_in) != IS_DAX(inode_out))
 		goto out_unlock;
 
 	if (!IS_DAX(inode_in))
-- 
2.33.0




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 1/8] fsdax: Output address in dax_iomap_pfn() and rename it
  2021-09-15 10:44 ` [PATCH v9 1/8] fsdax: Output address in dax_iomap_pfn() and rename it Shiyang Ruan
@ 2021-09-16  0:09   ` Darrick J. Wong
  2021-09-16  1:36     ` Shiyang Ruan
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2021-09-16  0:09 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: hch, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy, Ritesh Harjani

On Wed, Sep 15, 2021 at 06:44:54PM +0800, Shiyang Ruan wrote:
> Add address output in dax_iomap_pfn() in order to perform a memcpy() in
> CoW case.  Since this function both output address and pfn, rename it to
> dax_iomap_direct_access().
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>

Could've sworn I reviewed this a few revisions ago...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/dax.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4e3e5a283a91..8b482a58acae 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1010,8 +1010,8 @@ static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos)
>  	return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
>  }
>  
> -static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
> -			 pfn_t *pfnp)
> +static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
> +		size_t size, void **kaddr, pfn_t *pfnp)
>  {
>  	const sector_t sector = dax_iomap_sector(iomap, pos);
>  	pgoff_t pgoff;
> @@ -1023,11 +1023,13 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>  		return rc;
>  	id = dax_read_lock();
>  	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
> -				   NULL, pfnp);
> +				   kaddr, pfnp);
>  	if (length < 0) {
>  		rc = length;
>  		goto out;
>  	}
> +	if (!pfnp)
> +		goto out_check_addr;
>  	rc = -EINVAL;
>  	if (PFN_PHYS(length) < size)
>  		goto out;
> @@ -1037,6 +1039,12 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>  	if (length > 1 && !pfn_t_devmap(*pfnp))
>  		goto out;
>  	rc = 0;
> +
> +out_check_addr:
> +	if (!kaddr)
> +		goto out;
> +	if (!*kaddr)
> +		rc = -EFAULT;
>  out:
>  	dax_read_unlock(id);
>  	return rc;
> @@ -1401,7 +1409,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>  		return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
>  	}
>  
> -	err = dax_iomap_pfn(&iter->iomap, pos, size, &pfn);
> +	err = dax_iomap_direct_access(&iter->iomap, pos, size, NULL, &pfn);
>  	if (err)
>  		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
>  
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 4/8] fsdax: Convert dax_iomap_zero to iter model
  2021-09-15 10:44 ` [PATCH v9 4/8] fsdax: Convert dax_iomap_zero to iter model Shiyang Ruan
@ 2021-09-16  0:11   ` Darrick J. Wong
  2021-09-16  6:15   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2021-09-16  0:11 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: hch, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy

On Wed, Sep 15, 2021 at 06:44:57PM +0800, Shiyang Ruan wrote:
> Let dax_iomap_zero() support iter model.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>

Oops, I guess we forgot this one when we did the iter conversion last
cycle. :(

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/dax.c               | 3 ++-
>  fs/iomap/buffered-io.c | 3 +--
>  include/linux/dax.h    | 3 ++-
>  3 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 41c93929f20b..4f346e25e488 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1209,8 +1209,9 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>  }
>  #endif /* CONFIG_FS_DAX_PMD */
>  
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
> +s64 dax_iomap_zero(struct iomap_iter *iter, loff_t pos, u64 length)
>  {
> +	const struct iomap *iomap = &iter->iomap;
>  	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>  	pgoff_t pgoff;
>  	long rc, id;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 9cc5798423d1..84a861d3b3e0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -889,7 +889,6 @@ static s64 __iomap_zero_iter(struct iomap_iter *iter, loff_t pos, u64 length)
>  
>  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  {
> -	struct iomap *iomap = &iter->iomap;
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
>  	loff_t pos = iter->pos;
>  	loff_t length = iomap_length(iter);
> @@ -903,7 +902,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		s64 bytes;
>  
>  		if (IS_DAX(iter->inode))
> -			bytes = dax_iomap_zero(pos, length, iomap);
> +			bytes = dax_iomap_zero(iter, pos, length);
>  		else
>  			bytes = __iomap_zero_iter(iter, pos, length);
>  		if (bytes < 0)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 2619d94c308d..642de7ef1a10 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -13,6 +13,7 @@ typedef unsigned long dax_entry_t;
>  
>  struct iomap_ops;
>  struct iomap;
> +struct iomap_iter;
>  struct dax_device;
>  struct dax_operations {
>  	/*
> @@ -210,7 +211,7 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
>  				      pgoff_t index);
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap);
> +s64 dax_iomap_zero(struct iomap_iter *iter, loff_t pos, u64 length);
>  static inline bool dax_mapping(struct address_space *mapping)
>  {
>  	return mapping->host && IS_DAX(mapping->host);
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 7/8] xfs: support CoW in fsdax mode
  2021-09-15 10:45 ` [PATCH v9 7/8] xfs: support CoW in fsdax mode Shiyang Ruan
@ 2021-09-16  0:22   ` Darrick J. Wong
  2021-09-16  6:32     ` Christoph Hellwig
  2021-09-16  6:23   ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2021-09-16  0:22 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: hch, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy

On Wed, Sep 15, 2021 at 06:45:00PM +0800, Shiyang Ruan wrote:
> In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
> After that, new allocated extents needs to be remapped to the file.
> So, add a CoW identification in ->iomap_begin(), and implement
> ->iomap_end() to do the remapping work.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  fs/xfs/xfs_bmap_util.c |  3 +--
>  fs/xfs/xfs_file.c      |  6 +++---
>  fs/xfs/xfs_iomap.c     | 38 +++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_iomap.h     | 30 ++++++++++++++++++++++++++++++
>  fs/xfs/xfs_iops.c      |  7 +++----
>  fs/xfs/xfs_reflink.c   |  3 +--
>  6 files changed, 75 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 73a36b7be3bd..0681250e0a5d 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1009,8 +1009,7 @@ xfs_free_file_space(
>  		return 0;
>  	if (offset + len > XFS_ISIZE(ip))
>  		len = XFS_ISIZE(ip) - offset;
> -	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> -			&xfs_buffered_write_iomap_ops);
> +	error = xfs_iomap_zero_range(ip, offset, len, NULL);
>  	if (error)
>  		return error;
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7aa943edfc02..2ef1930374d2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -704,7 +704,7 @@ xfs_file_dax_write(
>  	pos = iocb->ki_pos;
>  
>  	trace_xfs_file_dax_write(iocb, from);
> -	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
> +	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
>  	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
>  		i_size_write(inode, iocb->ki_pos);
>  		error = xfs_setfilesize(ip, pos, ret);
> @@ -1329,8 +1329,8 @@ __xfs_filemap_fault(
>  		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
>  				(write_fault && !vmf->cow_page) ?
> -				 &xfs_direct_write_iomap_ops :
> -				 &xfs_read_iomap_ops);
> +					&xfs_dax_write_iomap_ops :
> +					&xfs_read_iomap_ops);

Hmm... I wonder if this should get hoisted to a "xfs_dax_iomap_fault"
wrapper like you did for xfs_iomap_zero_range?

>  		if (ret & VM_FAULT_NEEDDSYNC)
>  			ret = dax_finish_sync_fault(vmf, pe_size, pfn);
>  		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 093758440ad5..6fa3b377cb81 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
>  
>  		/* may drop and re-acquire the ilock */
>  		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> -				&lockmode, flags & IOMAP_DIRECT);
> +				&lockmode,
> +				(flags & IOMAP_DIRECT) || IS_DAX(inode));
>  		if (error)
>  			goto out_unlock;
>  		if (shared)
> @@ -854,6 +855,41 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
>  	.iomap_begin		= xfs_direct_write_iomap_begin,
>  };
>  
> +static int
> +xfs_dax_write_iomap_end(
> +	struct inode 		*inode,
> +	loff_t 			pos,
> +	loff_t 			length,
> +	ssize_t 		written,
> +	unsigned 		flags,
> +	struct iomap 		*iomap)

Whitespace nit:     ^ space before a tab.

> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	/*
> +	 * Usually we use @written to indicate whether the operation was
> +	 * successful.  But it is always positive or zero.  The CoW needs the
> +	 * actual error code from actor().  So, get it from
> +	 * iomap_iter->processed.

Hm.  All six arguments are derived from the struct iomap_iter, so maybe
it makes more sense to pass that in?  I'll poke around with this more
tomorrow.

> +	 */
> +	const struct iomap_iter *iter =
> +				container_of(iomap, typeof(*iter), iomap);
> +
> +	if (!xfs_is_cow_inode(ip))
> +		return 0;
> +
> +	if (iter->processed <= 0) {
> +		xfs_reflink_cancel_cow_range(ip, pos, length, true);
> +		return 0;
> +	}
> +
> +	return xfs_reflink_end_cow(ip, pos, iter->processed);
> +}
> +
> +const struct iomap_ops xfs_dax_write_iomap_ops = {
> +	.iomap_begin 	= xfs_direct_write_iomap_begin,

Space before tab    ^

> +	.iomap_end	= xfs_dax_write_iomap_end,
> +};
> +
>  static int
>  xfs_buffered_write_iomap_begin(

Also, we have an related request to drop the EXPERIMENTAL tag for
non-DAX reflink.  Whichever patch enables dax+reflink for xfs needs to
make it clear that reflink + any possibility of DAX emits an
EXPERIMENTAL warning.

--D

>  	struct inode		*inode,
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index 7d3703556d0e..92679a0c3578 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -45,5 +45,35 @@ extern const struct iomap_ops xfs_direct_write_iomap_ops;
>  extern const struct iomap_ops xfs_read_iomap_ops;
>  extern const struct iomap_ops xfs_seek_iomap_ops;
>  extern const struct iomap_ops xfs_xattr_iomap_ops;
> +extern const struct iomap_ops xfs_dax_write_iomap_ops;
> +
> +static inline int
> +xfs_iomap_zero_range(
> +	struct xfs_inode	*ip,
> +	loff_t			pos,
> +	loff_t			len,
> +	bool			*did_zero)
> +{
> +	struct inode		*inode = VFS_I(ip);
> +
> +	return iomap_zero_range(inode, pos, len, did_zero,
> +			IS_DAX(inode) ?
> +				&xfs_dax_write_iomap_ops :
> +				&xfs_buffered_write_iomap_ops);
> +}
> +
> +static inline int
> +xfs_iomap_truncate_page(
> +	struct xfs_inode	*ip,
> +	loff_t			pos,
> +	bool			*did_zero)
> +{
> +	struct inode		*inode = VFS_I(ip);
> +
> +	return iomap_truncate_page(inode, pos, did_zero,
> +			IS_DAX(inode)?
> +				&xfs_dax_write_iomap_ops :
> +				&xfs_buffered_write_iomap_ops);
> +}
>  
>  #endif /* __XFS_IOMAP_H__*/
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a607d6aca5c4..332e6208dffd 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -911,8 +911,8 @@ xfs_setattr_size(
>  	 */
>  	if (newsize > oldsize) {
>  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> -				&did_zeroing, &xfs_buffered_write_iomap_ops);
> +		error = xfs_iomap_zero_range(ip, oldsize, newsize - oldsize,
> +				&did_zeroing);
>  	} else {
>  		/*
>  		 * iomap won't detect a dirty page over an unwritten block (or a
> @@ -924,8 +924,7 @@ xfs_setattr_size(
>  						     newsize);
>  		if (error)
>  			return error;
> -		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> -				&xfs_buffered_write_iomap_ops);
> +		error = xfs_iomap_truncate_page(ip, newsize, &did_zeroing);
>  	}
>  
>  	if (error)
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 7ecea0311e88..9d876e268734 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1269,8 +1269,7 @@ xfs_reflink_zero_posteof(
>  		return 0;
>  
>  	trace_xfs_zero_eof(ip, isize, pos - isize);
> -	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> -			&xfs_buffered_write_iomap_ops);
> +	return xfs_iomap_zero_range(ip, isize, pos - isize, NULL);
>  }
>  
>  /*
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 8/8] xfs: Add dax dedupe support
  2021-09-15 10:45 ` [PATCH v9 8/8] xfs: Add dax dedupe support Shiyang Ruan
@ 2021-09-16  0:30   ` Darrick J. Wong
  2021-09-16  4:01     ` Shiyang Ruan
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2021-09-16  0:30 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: hch, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy

On Wed, Sep 15, 2021 at 06:45:01PM +0800, Shiyang Ruan wrote:
> Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
> who are going to be deduped.  After that, call compare range function
> only when files are both DAX or not.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_file.c    |  2 +-
>  fs/xfs/xfs_inode.c   | 80 +++++++++++++++++++++++++++++++++++++++++---
>  fs/xfs/xfs_inode.h   |  1 +
>  fs/xfs/xfs_reflink.c |  4 +--
>  4 files changed, 80 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 2ef1930374d2..c3061723613c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -846,7 +846,7 @@ xfs_wait_dax_page(
>  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>  }
>  
> -static int
> +int
>  xfs_break_dax_layouts(
>  	struct inode		*inode,
>  	bool			*retry)
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index a4f6f034fb81..bdc084cdbf46 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3790,6 +3790,61 @@ xfs_iolock_two_inodes_and_break_layout(
>  	return 0;
>  }
>  
> +static int
> +xfs_mmaplock_two_inodes_and_break_dax_layout(
> +	struct xfs_inode	*ip1,
> +	struct xfs_inode	*ip2)
> +{
> +	int			error, attempts = 0;
> +	bool			retry;
> +	struct page		*page;
> +	struct xfs_log_item	*lp;
> +
> +	if (ip1->i_ino > ip2->i_ino)
> +		swap(ip1, ip2);
> +
> +again:
> +	retry = false;
> +	/* Lock the first inode */
> +	xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> +	error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
> +	if (error || retry) {
> +		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> +		if (error == 0 && retry)
> +			goto again;
> +		return error;
> +	}
> +
> +	if (ip1 == ip2)
> +		return 0;
> +
> +	/* Nested lock the second inode */
> +	lp = &ip1->i_itemp->ili_item;
> +	if (lp && test_bit(XFS_LI_IN_AIL, &lp->li_flags)) {
> +		if (!xfs_ilock_nowait(ip2,
> +		    xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1))) {
> +			xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> +			if ((++attempts % 5) == 0)
> +				delay(1); /* Don't just spin the CPU */
> +			goto again;
> +		}

I suspect we don't need this part for grabbing the MMAPLOCK^W pagecache
invalidatelock.  The AIL only grabs the ILOCK, never the IOLOCK or the
MMAPLOCK.

> +	} else
> +		xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
> +	/*
> +	 * We cannot use xfs_break_dax_layouts() directly here because it may
> +	 * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
> +	 * for this nested lock case.
> +	 */
> +	page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
> +	if (page && page_ref_count(page) != 1) {

Do you think the patch "ext4/xfs: add page refcount helper" would be a
good cleanup to head this series?

https://lore.kernel.org/linux-xfs/20210913161604.31981-1-alex.sierra@amd.com/T/#m59cf7cd5c0d521ad487fa3a15d31c3865db88bdf

The rest of the logic looks ok.

--D

> +		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
> +		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> +		goto again;
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
>   * mmap activity.
> @@ -3804,8 +3859,19 @@ xfs_ilock2_io_mmap(
>  	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
>  	if (ret)
>  		return ret;
> -	filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
> -				    VFS_I(ip2)->i_mapping);
> +
> +	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
> +		ret = xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2);
> +		if (ret) {
> +			inode_unlock(VFS_I(ip2));
> +			if (ip1 != ip2)
> +				inode_unlock(VFS_I(ip1));
> +			return ret;
> +		}
> +	} else
> +		filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
> +					    VFS_I(ip2)->i_mapping);
> +
>  	return 0;
>  }
>  
> @@ -3815,8 +3881,14 @@ xfs_iunlock2_io_mmap(
>  	struct xfs_inode	*ip1,
>  	struct xfs_inode	*ip2)
>  {
> -	filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
> -				      VFS_I(ip2)->i_mapping);
> +	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
> +		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
> +		if (ip1 != ip2)
> +			xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> +	} else
> +		filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
> +					      VFS_I(ip2)->i_mapping);
> +
>  	inode_unlock(VFS_I(ip2));
>  	if (ip1 != ip2)
>  		inode_unlock(VFS_I(ip1));
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index b21b177832d1..f7e26fe31a26 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -472,6 +472,7 @@ enum xfs_prealloc_flags {
>  
>  int	xfs_update_prealloc_flags(struct xfs_inode *ip,
>  				  enum xfs_prealloc_flags flags);
> +int	xfs_break_dax_layouts(struct inode *inode, bool *retry);
>  int	xfs_break_layouts(struct inode *inode, uint *iolock,
>  		enum layout_break_reason reason);
>  
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 9d876e268734..3b99c9dfcf0d 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1327,8 +1327,8 @@ xfs_reflink_remap_prep(
>  	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
>  		goto out_unlock;
>  
> -	/* Don't share DAX file data for now. */
> -	if (IS_DAX(inode_in) || IS_DAX(inode_out))
> +	/* Don't share DAX file data with non-DAX file. */
> +	if (IS_DAX(inode_in) != IS_DAX(inode_out))
>  		goto out_unlock;
>  
>  	if (!IS_DAX(inode_in))
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 1/8] fsdax: Output address in dax_iomap_pfn() and rename it
  2021-09-16  0:09   ` Darrick J. Wong
@ 2021-09-16  1:36     ` Shiyang Ruan
  0 siblings, 0 replies; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-16  1:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy, Ritesh Harjani



On 2021/9/16 8:09, Darrick J. Wong wrote:
> On Wed, Sep 15, 2021 at 06:44:54PM +0800, Shiyang Ruan wrote:
>> Add address output in dax_iomap_pfn() in order to perform a memcpy() in
>> CoW case.  Since this function both output address and pfn, rename it to
>> dax_iomap_direct_access().
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
>> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
> Could've sworn I reviewed this a few revisions ago...

Oh, sorry, Maybe I missed that.

> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

--
Ruan

> 
> --D
> 
>> ---
>>   fs/dax.c | 16 ++++++++++++----
>>   1 file changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 4e3e5a283a91..8b482a58acae 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -1010,8 +1010,8 @@ static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos)
>>   	return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
>>   }
>>   
>> -static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>> -			 pfn_t *pfnp)
>> +static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
>> +		size_t size, void **kaddr, pfn_t *pfnp)
>>   {
>>   	const sector_t sector = dax_iomap_sector(iomap, pos);
>>   	pgoff_t pgoff;
>> @@ -1023,11 +1023,13 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>>   		return rc;
>>   	id = dax_read_lock();
>>   	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
>> -				   NULL, pfnp);
>> +				   kaddr, pfnp);
>>   	if (length < 0) {
>>   		rc = length;
>>   		goto out;
>>   	}
>> +	if (!pfnp)
>> +		goto out_check_addr;
>>   	rc = -EINVAL;
>>   	if (PFN_PHYS(length) < size)
>>   		goto out;
>> @@ -1037,6 +1039,12 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>>   	if (length > 1 && !pfn_t_devmap(*pfnp))
>>   		goto out;
>>   	rc = 0;
>> +
>> +out_check_addr:
>> +	if (!kaddr)
>> +		goto out;
>> +	if (!*kaddr)
>> +		rc = -EFAULT;
>>   out:
>>   	dax_read_unlock(id);
>>   	return rc;
>> @@ -1401,7 +1409,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>>   		return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
>>   	}
>>   
>> -	err = dax_iomap_pfn(&iter->iomap, pos, size, &pfn);
>> +	err = dax_iomap_direct_access(&iter->iomap, pos, size, NULL, &pfn);
>>   	if (err)
>>   		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
>>   
>> -- 
>> 2.33.0
>>
>>
>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 8/8] xfs: Add dax dedupe support
  2021-09-16  0:30   ` Darrick J. Wong
@ 2021-09-16  4:01     ` Shiyang Ruan
  2021-09-16  4:18       ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-16  4:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy



On 2021/9/16 8:30, Darrick J. Wong wrote:
> On Wed, Sep 15, 2021 at 06:45:01PM +0800, Shiyang Ruan wrote:
>> Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
>> who are going to be deduped.  After that, call compare range function
>> only when files are both DAX or not.
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> ---
>>   fs/xfs/xfs_file.c    |  2 +-
>>   fs/xfs/xfs_inode.c   | 80 +++++++++++++++++++++++++++++++++++++++++---
>>   fs/xfs/xfs_inode.h   |  1 +
>>   fs/xfs/xfs_reflink.c |  4 +--
>>   4 files changed, 80 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 2ef1930374d2..c3061723613c 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -846,7 +846,7 @@ xfs_wait_dax_page(
>>   	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>>   }
>>   
>> -static int
>> +int
>>   xfs_break_dax_layouts(
>>   	struct inode		*inode,
>>   	bool			*retry)
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index a4f6f034fb81..bdc084cdbf46 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -3790,6 +3790,61 @@ xfs_iolock_two_inodes_and_break_layout(
>>   	return 0;
>>   }
>>   
>> +static int
>> +xfs_mmaplock_two_inodes_and_break_dax_layout(
>> +	struct xfs_inode	*ip1,
>> +	struct xfs_inode	*ip2)
>> +{
>> +	int			error, attempts = 0;
>> +	bool			retry;
>> +	struct page		*page;
>> +	struct xfs_log_item	*lp;
>> +
>> +	if (ip1->i_ino > ip2->i_ino)
>> +		swap(ip1, ip2);
>> +
>> +again:
>> +	retry = false;
>> +	/* Lock the first inode */
>> +	xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
>> +	error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
>> +	if (error || retry) {
>> +		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
>> +		if (error == 0 && retry)
>> +			goto again;
>> +		return error;
>> +	}
>> +
>> +	if (ip1 == ip2)
>> +		return 0;
>> +
>> +	/* Nested lock the second inode */
>> +	lp = &ip1->i_itemp->ili_item;
>> +	if (lp && test_bit(XFS_LI_IN_AIL, &lp->li_flags)) {
>> +		if (!xfs_ilock_nowait(ip2,
>> +		    xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1))) {
>> +			xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
>> +			if ((++attempts % 5) == 0)
>> +				delay(1); /* Don't just spin the CPU */
>> +			goto again;
>> +		}
> 
> I suspect we don't need this part for grabbing the MMAPLOCK^W pagecache
> invalidatelock.  The AIL only grabs the ILOCK, never the IOLOCK or the
> MMAPLOCK.

Maybe I have misunderstood this part.

What I want is to lock the two inode nestedly.  This code is copied from 
xfs_lock_two_inodes(), which checks this AIL during locking two inode 
with each of the three kinds of locks.

But I also found the recent merged function: 
filemap_invalidate_lock_two() just locks two inode directly without 
checking AIL.  So, I am not if the AIL check is needed in this case.

> 
>> +	} else
>> +		xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
>> +	/*
>> +	 * We cannot use xfs_break_dax_layouts() directly here because it may
>> +	 * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
>> +	 * for this nested lock case.
>> +	 */
>> +	page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
>> +	if (page && page_ref_count(page) != 1) {
> 
> Do you think the patch "ext4/xfs: add page refcount helper" would be a
> good cleanup to head this series?
> 
> https://lore.kernel.org/linux-xfs/20210913161604.31981-1-alex.sierra@amd.com/T/#m59cf7cd5c0d521ad487fa3a15d31c3865db88bdf

Got it.


--
Thanks,
Ruan

> 
> The rest of the logic looks ok.
> 
> --D
> 
>> +		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
>> +		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
>> +		goto again;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   /*
>>    * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
>>    * mmap activity.
>> @@ -3804,8 +3859,19 @@ xfs_ilock2_io_mmap(
>>   	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
>>   	if (ret)
>>   		return ret;
>> -	filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
>> -				    VFS_I(ip2)->i_mapping);
>> +
>> +	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
>> +		ret = xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2);
>> +		if (ret) {
>> +			inode_unlock(VFS_I(ip2));
>> +			if (ip1 != ip2)
>> +				inode_unlock(VFS_I(ip1));
>> +			return ret;
>> +		}
>> +	} else
>> +		filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
>> +					    VFS_I(ip2)->i_mapping);
>> +
>>   	return 0;
>>   }
>>   
>> @@ -3815,8 +3881,14 @@ xfs_iunlock2_io_mmap(
>>   	struct xfs_inode	*ip1,
>>   	struct xfs_inode	*ip2)
>>   {
>> -	filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
>> -				      VFS_I(ip2)->i_mapping);
>> +	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
>> +		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
>> +		if (ip1 != ip2)
>> +			xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
>> +	} else
>> +		filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
>> +					      VFS_I(ip2)->i_mapping);
>> +
>>   	inode_unlock(VFS_I(ip2));
>>   	if (ip1 != ip2)
>>   		inode_unlock(VFS_I(ip1));
>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
>> index b21b177832d1..f7e26fe31a26 100644
>> --- a/fs/xfs/xfs_inode.h
>> +++ b/fs/xfs/xfs_inode.h
>> @@ -472,6 +472,7 @@ enum xfs_prealloc_flags {
>>   
>>   int	xfs_update_prealloc_flags(struct xfs_inode *ip,
>>   				  enum xfs_prealloc_flags flags);
>> +int	xfs_break_dax_layouts(struct inode *inode, bool *retry);
>>   int	xfs_break_layouts(struct inode *inode, uint *iolock,
>>   		enum layout_break_reason reason);
>>   
>> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
>> index 9d876e268734..3b99c9dfcf0d 100644
>> --- a/fs/xfs/xfs_reflink.c
>> +++ b/fs/xfs/xfs_reflink.c
>> @@ -1327,8 +1327,8 @@ xfs_reflink_remap_prep(
>>   	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
>>   		goto out_unlock;
>>   
>> -	/* Don't share DAX file data for now. */
>> -	if (IS_DAX(inode_in) || IS_DAX(inode_out))
>> +	/* Don't share DAX file data with non-DAX file. */
>> +	if (IS_DAX(inode_in) != IS_DAX(inode_out))
>>   		goto out_unlock;
>>   
>>   	if (!IS_DAX(inode_in))
>> -- 
>> 2.33.0
>>
>>
>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 8/8] xfs: Add dax dedupe support
  2021-09-16  4:01     ` Shiyang Ruan
@ 2021-09-16  4:18       ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2021-09-16  4:18 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: hch, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy

On Thu, Sep 16, 2021 at 12:01:18PM +0800, Shiyang Ruan wrote:
> 
> 
> On 2021/9/16 8:30, Darrick J. Wong wrote:
> > On Wed, Sep 15, 2021 at 06:45:01PM +0800, Shiyang Ruan wrote:
> > > Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
> > > who are going to be deduped.  After that, call compare range function
> > > only when files are both DAX or not.
> > > 
> > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >   fs/xfs/xfs_file.c    |  2 +-
> > >   fs/xfs/xfs_inode.c   | 80 +++++++++++++++++++++++++++++++++++++++++---
> > >   fs/xfs/xfs_inode.h   |  1 +
> > >   fs/xfs/xfs_reflink.c |  4 +--
> > >   4 files changed, 80 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 2ef1930374d2..c3061723613c 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -846,7 +846,7 @@ xfs_wait_dax_page(
> > >   	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> > >   }
> > > -static int
> > > +int
> > >   xfs_break_dax_layouts(
> > >   	struct inode		*inode,
> > >   	bool			*retry)
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index a4f6f034fb81..bdc084cdbf46 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -3790,6 +3790,61 @@ xfs_iolock_two_inodes_and_break_layout(
> > >   	return 0;
> > >   }
> > > +static int
> > > +xfs_mmaplock_two_inodes_and_break_dax_layout(
> > > +	struct xfs_inode	*ip1,
> > > +	struct xfs_inode	*ip2)
> > > +{
> > > +	int			error, attempts = 0;
> > > +	bool			retry;
> > > +	struct page		*page;
> > > +	struct xfs_log_item	*lp;
> > > +
> > > +	if (ip1->i_ino > ip2->i_ino)
> > > +		swap(ip1, ip2);
> > > +
> > > +again:
> > > +	retry = false;
> > > +	/* Lock the first inode */
> > > +	xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> > > +	error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
> > > +	if (error || retry) {
> > > +		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> > > +		if (error == 0 && retry)
> > > +			goto again;
> > > +		return error;
> > > +	}
> > > +
> > > +	if (ip1 == ip2)
> > > +		return 0;
> > > +
> > > +	/* Nested lock the second inode */
> > > +	lp = &ip1->i_itemp->ili_item;
> > > +	if (lp && test_bit(XFS_LI_IN_AIL, &lp->li_flags)) {
> > > +		if (!xfs_ilock_nowait(ip2,
> > > +		    xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1))) {
> > > +			xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> > > +			if ((++attempts % 5) == 0)
> > > +				delay(1); /* Don't just spin the CPU */
> > > +			goto again;
> > > +		}
> > 
> > I suspect we don't need this part for grabbing the MMAPLOCK^W pagecache
> > invalidatelock.  The AIL only grabs the ILOCK, never the IOLOCK or the
> > MMAPLOCK.
> 
> Maybe I have misunderstood this part.
> 
> What I want is to lock the two inode nestedly.  This code is copied from
> xfs_lock_two_inodes(), which checks this AIL during locking two inode with
> each of the three kinds of locks.

<nod> It's totally reasonable to copy-paste the function you want and
change it as needed...

> But I also found the recent merged function: filemap_invalidate_lock_two()
> just locks two inode directly without checking AIL.  So, I am not if the AIL
> check is needed in this case.

...especially when even the maintainer is only 99% sure that the AIL
checking chunk here can be removed.  Anyone else have an opinion?

--D

> > 
> > > +	} else
> > > +		xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
> > > +	/*
> > > +	 * We cannot use xfs_break_dax_layouts() directly here because it may
> > > +	 * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
> > > +	 * for this nested lock case.
> > > +	 */
> > > +	page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
> > > +	if (page && page_ref_count(page) != 1) {
> > 
> > Do you think the patch "ext4/xfs: add page refcount helper" would be a
> > good cleanup to head this series?
> > 
> > https://lore.kernel.org/linux-xfs/20210913161604.31981-1-alex.sierra@amd.com/T/#m59cf7cd5c0d521ad487fa3a15d31c3865db88bdf
> 
> Got it.
> 
> 
> --
> Thanks,
> Ruan
> 
> > 
> > The rest of the logic looks ok.
> > 
> > --D
> > 
> > > +		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
> > > +		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> > > +		goto again;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >   /*
> > >    * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
> > >    * mmap activity.
> > > @@ -3804,8 +3859,19 @@ xfs_ilock2_io_mmap(
> > >   	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
> > >   	if (ret)
> > >   		return ret;
> > > -	filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
> > > -				    VFS_I(ip2)->i_mapping);
> > > +
> > > +	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
> > > +		ret = xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2);
> > > +		if (ret) {
> > > +			inode_unlock(VFS_I(ip2));
> > > +			if (ip1 != ip2)
> > > +				inode_unlock(VFS_I(ip1));
> > > +			return ret;
> > > +		}
> > > +	} else
> > > +		filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
> > > +					    VFS_I(ip2)->i_mapping);
> > > +
> > >   	return 0;
> > >   }
> > > @@ -3815,8 +3881,14 @@ xfs_iunlock2_io_mmap(
> > >   	struct xfs_inode	*ip1,
> > >   	struct xfs_inode	*ip2)
> > >   {
> > > -	filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
> > > -				      VFS_I(ip2)->i_mapping);
> > > +	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
> > > +		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
> > > +		if (ip1 != ip2)
> > > +			xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> > > +	} else
> > > +		filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
> > > +					      VFS_I(ip2)->i_mapping);
> > > +
> > >   	inode_unlock(VFS_I(ip2));
> > >   	if (ip1 != ip2)
> > >   		inode_unlock(VFS_I(ip1));
> > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > index b21b177832d1..f7e26fe31a26 100644
> > > --- a/fs/xfs/xfs_inode.h
> > > +++ b/fs/xfs/xfs_inode.h
> > > @@ -472,6 +472,7 @@ enum xfs_prealloc_flags {
> > >   int	xfs_update_prealloc_flags(struct xfs_inode *ip,
> > >   				  enum xfs_prealloc_flags flags);
> > > +int	xfs_break_dax_layouts(struct inode *inode, bool *retry);
> > >   int	xfs_break_layouts(struct inode *inode, uint *iolock,
> > >   		enum layout_break_reason reason);
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 9d876e268734..3b99c9dfcf0d 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -1327,8 +1327,8 @@ xfs_reflink_remap_prep(
> > >   	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
> > >   		goto out_unlock;
> > > -	/* Don't share DAX file data for now. */
> > > -	if (IS_DAX(inode_in) || IS_DAX(inode_out))
> > > +	/* Don't share DAX file data with non-DAX file. */
> > > +	if (IS_DAX(inode_in) != IS_DAX(inode_out))
> > >   		goto out_unlock;
> > >   	if (!IS_DAX(inode_in))
> > > -- 
> > > 2.33.0
> > > 
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 4/8] fsdax: Convert dax_iomap_zero to iter model
  2021-09-15 10:44 ` [PATCH v9 4/8] fsdax: Convert dax_iomap_zero to iter model Shiyang Ruan
  2021-09-16  0:11   ` Darrick J. Wong
@ 2021-09-16  6:15   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2021-09-16  6:15 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: djwong, hch, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy

> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
> +s64 dax_iomap_zero(struct iomap_iter *iter, loff_t pos, u64 length)

I think we can also mark the iter const.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 5/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
  2021-09-15 10:44 ` [PATCH v9 5/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero Shiyang Ruan
@ 2021-09-16  6:16   ` Christoph Hellwig
  2021-09-16  8:49     ` Shiyang Ruan
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2021-09-16  6:16 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: djwong, hch, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy, Ritesh Harjani

On Wed, Sep 15, 2021 at 06:44:58PM +0800, Shiyang Ruan wrote:
> +	rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
> +	if (rc < 0)
> +		goto out;
> +	memset(kaddr + offset, 0, size);
> +	if (srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {

Should we also check that ->dax_dev for iomap and srcmap are different
first to deal with case of file system with multiple devices?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 6/8] fsdax: Dedup file range to use a compare function
  2021-09-15 10:44 ` [PATCH v9 6/8] fsdax: Dedup file range to use a compare function Shiyang Ruan
@ 2021-09-16  6:17   ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2021-09-16  6:17 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: djwong, hch, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy, Goldwyn Rodrigues

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 7/8] xfs: support CoW in fsdax mode
  2021-09-15 10:45 ` [PATCH v9 7/8] xfs: support CoW in fsdax mode Shiyang Ruan
  2021-09-16  0:22   ` Darrick J. Wong
@ 2021-09-16  6:23   ` Christoph Hellwig
  2021-09-16  8:51     ` Shiyang Ruan
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2021-09-16  6:23 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: djwong, hch, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy

On Wed, Sep 15, 2021 at 06:45:00PM +0800, Shiyang Ruan wrote:
> +static int
> +xfs_dax_write_iomap_end(
> +	struct inode 		*inode,
> +	loff_t 			pos,
> +	loff_t 			length,
> +	ssize_t 		written,
> +	unsigned 		flags,
> +	struct iomap 		*iomap)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	/*
> +	 * Usually we use @written to indicate whether the operation was
> +	 * successful.  But it is always positive or zero.  The CoW needs the
> +	 * actual error code from actor().  So, get it from
> +	 * iomap_iter->processed.
> +	 */
> +	const struct iomap_iter *iter =
> +				container_of(iomap, typeof(*iter), iomap);
> +
> +	if (!xfs_is_cow_inode(ip))
> +		return 0;
> +
> +	if (iter->processed <= 0) {
> +		xfs_reflink_cancel_cow_range(ip, pos, length, true);
> +		return 0;
> +	}
> +
> +	return xfs_reflink_end_cow(ip, pos, iter->processed);

Didn't we come to the conflusion last time that we don't actually
need to poke into the iomap_iter here as the written argument is equal
to iter->processed if it is > 0:

	if (iter->iomap.length && ops->iomap_end) {
		ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
				iter->processed > 0 ? iter->processed : 0,
				iter->flags, &iter->iomap);
		..

So should be able to just do:

static int
xfs_dax_write_iomap_end(
	struct inode 		*inode,
	loff_t 			pos,
	loff_t 			length,
	ssize_t 		written,
	unsigned 		flags,
	struct iomap 		*iomap)
{
	struct xfs_inode	*ip = XFS_I(inode);

	if (!xfs_is_cow_inode(ip))
		return 0;

	if (!written) {
		xfs_reflink_cancel_cow_range(ip, pos, length, true);
		return 0;
	}

	return xfs_reflink_end_cow(ip, pos, written);
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 7/8] xfs: support CoW in fsdax mode
  2021-09-16  0:22   ` Darrick J. Wong
@ 2021-09-16  6:32     ` Christoph Hellwig
  2021-09-17 15:33       ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2021-09-16  6:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Shiyang Ruan, hch, linux-xfs, dan.j.williams, david,
	linux-fsdevel, linux-kernel, nvdimm, rgoldwyn, viro, willy

On Wed, Sep 15, 2021 at 05:22:27PM -0700, Darrick J. Wong wrote:
> >  		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> >  		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
> >  				(write_fault && !vmf->cow_page) ?
> > -				 &xfs_direct_write_iomap_ops :
> > -				 &xfs_read_iomap_ops);
> > +					&xfs_dax_write_iomap_ops :
> > +					&xfs_read_iomap_ops);
> 
> Hmm... I wonder if this should get hoisted to a "xfs_dax_iomap_fault"
> wrapper like you did for xfs_iomap_zero_range?

This has just a single users, so the classic argument won't apply.  That
being said __xfs_filemap_fault is a complete mess to due the calling
conventions of the various VFS methods multiplexed into it.  So yes,
splitting out a xfs_dax_iomap_fault to wrap the above plus the
dax_finish_sync_fault call might not actually be a bad idea nevertheless.

> > +	struct xfs_inode	*ip = XFS_I(inode);
> > +	/*
> > +	 * Usually we use @written to indicate whether the operation was
> > +	 * successful.  But it is always positive or zero.  The CoW needs the
> > +	 * actual error code from actor().  So, get it from
> > +	 * iomap_iter->processed.
> 
> Hm.  All six arguments are derived from the struct iomap_iter, so maybe
> it makes more sense to pass that in?  I'll poke around with this more
> tomorrow.

I'd argue against just changing the calling conventions for ->iomap_end
now.  The original iter patches from willy allowed passing a single
next callback combinging iomap_begin and iomap_end in a way that with
a little magic we can avoid the indirect calls entirely.  I think we'll
need to experiment with that that a bit and see if is worth the effort
first.  I plan to do that but I might not get to it immediate.  If some
else wants to take over I'm fine with that.

> >  static int
> >  xfs_buffered_write_iomap_begin(
> 
> Also, we have an related request to drop the EXPERIMENTAL tag for
> non-DAX reflink.  Whichever patch enables dax+reflink for xfs needs to
> make it clear that reflink + any possibility of DAX emits an
> EXPERIMENTAL warning.

More importantly before we can merge this series we also need the VM
level support for reflink-aware reverse mapping.  So while this series
here is no in a good enough shape I don't see how we could merge it
without that other series as we'd have to disallow mmap for reflink+dax
files otherwise.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 5/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
  2021-09-16  6:16   ` Christoph Hellwig
@ 2021-09-16  8:49     ` Shiyang Ruan
  2021-09-16 22:55       ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-16  8:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: djwong, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy, Ritesh Harjani



On 2021/9/16 14:16, Christoph Hellwig wrote:
> On Wed, Sep 15, 2021 at 06:44:58PM +0800, Shiyang Ruan wrote:
>> +	rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
>> +	if (rc < 0)
>> +		goto out;
>> +	memset(kaddr + offset, 0, size);
>> +	if (srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
> 
> Should we also check that ->dax_dev for iomap and srcmap are different
> first to deal with case of file system with multiple devices?

I have not thought of this case.  Isn't it possible to CoW between 
different devices?


--
Thanks,
Ruan

> 
> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 7/8] xfs: support CoW in fsdax mode
  2021-09-16  6:23   ` Christoph Hellwig
@ 2021-09-16  8:51     ` Shiyang Ruan
  0 siblings, 0 replies; 27+ messages in thread
From: Shiyang Ruan @ 2021-09-16  8:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: djwong, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy



On 2021/9/16 14:23, Christoph Hellwig wrote:
> On Wed, Sep 15, 2021 at 06:45:00PM +0800, Shiyang Ruan wrote:
>> +static int
>> +xfs_dax_write_iomap_end(
>> +	struct inode 		*inode,
>> +	loff_t 			pos,
>> +	loff_t 			length,
>> +	ssize_t 		written,
>> +	unsigned 		flags,
>> +	struct iomap 		*iomap)
>> +{
>> +	struct xfs_inode	*ip = XFS_I(inode);
>> +	/*
>> +	 * Usually we use @written to indicate whether the operation was
>> +	 * successful.  But it is always positive or zero.  The CoW needs the
>> +	 * actual error code from actor().  So, get it from
>> +	 * iomap_iter->processed.
>> +	 */
>> +	const struct iomap_iter *iter =
>> +				container_of(iomap, typeof(*iter), iomap);
>> +
>> +	if (!xfs_is_cow_inode(ip))
>> +		return 0;
>> +
>> +	if (iter->processed <= 0) {
>> +		xfs_reflink_cancel_cow_range(ip, pos, length, true);
>> +		return 0;
>> +	}
>> +
>> +	return xfs_reflink_end_cow(ip, pos, iter->processed);
> 
> Didn't we come to the conflusion last time that we don't actually
> need to poke into the iomap_iter here as the written argument is equal
> to iter->processed if it is > 0:
> 
> 	if (iter->iomap.length && ops->iomap_end) {
> 		ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
> 				iter->processed > 0 ? iter->processed : 0,
> 				iter->flags, &iter->iomap);
> 		..
> 
> So should be able to just do:
> 
> static int
> xfs_dax_write_iomap_end(
> 	struct inode 		*inode,
> 	loff_t 			pos,
> 	loff_t 			length,
> 	ssize_t 		written,
> 	unsigned 		flags,
> 	struct iomap 		*iomap)
> {
> 	struct xfs_inode	*ip = XFS_I(inode);
> 
> 	if (!xfs_is_cow_inode(ip))
> 		return 0;
> 
> 	if (!written) {
> 		xfs_reflink_cancel_cow_range(ip, pos, length, true);
> 		return 0;
> 	}
> 
> 	return xfs_reflink_end_cow(ip, pos, written);
> }
> 

I see.  Thanks for your guidance.


--
Ruan



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/8] fsdax: Replace mmap entry in case of CoW
  2021-09-15 10:44 ` [PATCH v9 3/8] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
@ 2021-09-16 14:51   ` kernel test robot
  0 siblings, 0 replies; 27+ messages in thread
From: kernel test robot @ 2021-09-16 14:51 UTC (permalink / raw)
  To: Shiyang Ruan, djwong, hch, linux-xfs
  Cc: llvm, kbuild-all, ruansy.fnst, dan.j.williams, david,
	linux-fsdevel, linux-kernel, nvdimm, rgoldwyn

[-- Attachment #1: Type: text/plain, Size: 4670 bytes --]

Hi Shiyang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.15-rc1 next-20210916]
[cannot apply to xfs-linux/for-next hch-configfs/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Shiyang-Ruan/fsdax-xfs-Add-reflink-dedupe-support-for-fsdax/20210915-184743
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 3ca706c189db861b2ca2019a0901b94050ca49d8
config: hexagon-randconfig-r045-20210916 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c8b3d7d6d6de37af68b2f379d0e37304f78e115f)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/34b16b56bacb2d3e1e98f9ed47d20b545358bdcd
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Shiyang-Ruan/fsdax-xfs-Add-reflink-dedupe-support-for-fsdax/20210915-184743
        git checkout 34b16b56bacb2d3e1e98f9ed47d20b545358bdcd
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=hexagon 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> fs/dax.c:1483:38: error: incompatible pointer types passing 'const struct iomap_iter *' to parameter of type 'const struct iomap *' [-Werror,-Wincompatible-pointer-types]
                   return dax_pmd_load_hole(xas, vmf, iter, entry);
                                                      ^~~~
   fs/dax.c:1206:23: note: passing argument to parameter 'iomap' here
                   const struct iomap *iomap, void **entry)
                                       ^
   1 error generated.


vim +1483 fs/dax.c

  1451	
  1452	/**
  1453	 * dax_fault_iter - Common actor to handle pfn insertion in PTE/PMD fault.
  1454	 * @vmf:	vm fault instance
  1455	 * @iter:	iomap iter
  1456	 * @pfnp:	pfn to be returned
  1457	 * @xas:	the dax mapping tree of a file
  1458	 * @entry:	an unlocked dax entry to be inserted
  1459	 * @pmd:	distinguish whether it is a pmd fault
  1460	 */
  1461	static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
  1462			const struct iomap_iter *iter, pfn_t *pfnp,
  1463			struct xa_state *xas, void **entry, bool pmd)
  1464	{
  1465		const struct iomap *iomap = &iter->iomap;
  1466		const struct iomap *srcmap = &iter->srcmap;
  1467		size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
  1468		loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
  1469		bool write = iter->flags & IOMAP_WRITE;
  1470		unsigned long entry_flags = pmd ? DAX_PMD : 0;
  1471		int err = 0;
  1472		pfn_t pfn;
  1473		void *kaddr;
  1474	
  1475		if (!pmd && vmf->cow_page)
  1476			return dax_fault_cow_page(vmf, iter);
  1477	
  1478		/* if we are reading UNWRITTEN and HOLE, return a hole. */
  1479		if (!write &&
  1480		    (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) {
  1481			if (!pmd)
  1482				return dax_load_hole(xas, vmf, iter, entry);
> 1483			return dax_pmd_load_hole(xas, vmf, iter, entry);
  1484		}
  1485	
  1486		if (iomap->type != IOMAP_MAPPED && !(iomap->flags & IOMAP_F_SHARED)) {
  1487			WARN_ON_ONCE(1);
  1488			return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
  1489		}
  1490	
  1491		err = dax_iomap_direct_access(iomap, pos, size, &kaddr, &pfn);
  1492		if (err)
  1493			return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
  1494	
  1495		*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, entry_flags);
  1496	
  1497		if (write &&
  1498		    srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
  1499			err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr);
  1500			if (err)
  1501				return dax_fault_return(err);
  1502		}
  1503	
  1504		if (dax_fault_is_synchronous(iter, vmf->vma))
  1505			return dax_fault_synchronous_pfnp(pfnp, pfn);
  1506	
  1507		/* insert PMD pfn */
  1508		if (pmd)
  1509			return vmf_insert_pfn_pmd(vmf, pfn, write);
  1510	
  1511		/* insert PTE pfn */
  1512		if (write)
  1513			return vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
  1514		return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
  1515	}
  1516	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28632 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 5/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
  2021-09-16  8:49     ` Shiyang Ruan
@ 2021-09-16 22:55       ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2021-09-16 22:55 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: Christoph Hellwig, linux-xfs, dan.j.williams, david,
	linux-fsdevel, linux-kernel, nvdimm, rgoldwyn, viro, willy,
	Ritesh Harjani

On Thu, Sep 16, 2021 at 04:49:19PM +0800, Shiyang Ruan wrote:
> 
> 
> On 2021/9/16 14:16, Christoph Hellwig wrote:
> > On Wed, Sep 15, 2021 at 06:44:58PM +0800, Shiyang Ruan wrote:
> > > +	rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
> > > +	if (rc < 0)
> > > +		goto out;
> > > +	memset(kaddr + offset, 0, size);
> > > +	if (srcmap->addr != IOMAP_HOLE && srcmap->addr != iomap->addr) {
> > 
> > Should we also check that ->dax_dev for iomap and srcmap are different
> > first to deal with case of file system with multiple devices?
> 
> I have not thought of this case.  Isn't it possible to CoW between different
> devices?

There's nothing in the iomap API that prevents a filesystem from doing
that, though there are no filesystems today that do such a thing.

That said, if btrfs ever joins the fold (and adds DAX support) then they
could totally COW to a different device.

--D

> 
> 
> --
> Thanks,
> Ruan
> 
> > 
> > Otherwise looks good:
> > 
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 7/8] xfs: support CoW in fsdax mode
  2021-09-16  6:32     ` Christoph Hellwig
@ 2021-09-17 15:33       ` Darrick J. Wong
  2021-09-21  8:14         ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2021-09-17 15:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shiyang Ruan, linux-xfs, dan.j.williams, david, linux-fsdevel,
	linux-kernel, nvdimm, rgoldwyn, viro, willy

On Thu, Sep 16, 2021 at 08:32:51AM +0200, Christoph Hellwig wrote:
> On Wed, Sep 15, 2021 at 05:22:27PM -0700, Darrick J. Wong wrote:
> > >  		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > >  		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
> > >  				(write_fault && !vmf->cow_page) ?
> > > -				 &xfs_direct_write_iomap_ops :
> > > -				 &xfs_read_iomap_ops);
> > > +					&xfs_dax_write_iomap_ops :
> > > +					&xfs_read_iomap_ops);
> > 
> > Hmm... I wonder if this should get hoisted to a "xfs_dax_iomap_fault"
> > wrapper like you did for xfs_iomap_zero_range?
> 
> This has just a single users, so the classic argument won't apply.  That
> being said __xfs_filemap_fault is a complete mess to due the calling
> conventions of the various VFS methods multiplexed into it.  So yes,
> splitting out a xfs_dax_iomap_fault to wrap the above plus the
> dax_finish_sync_fault call might not actually be a bad idea nevertheless.

Agree.

> > > +	struct xfs_inode	*ip = XFS_I(inode);
> > > +	/*
> > > +	 * Usually we use @written to indicate whether the operation was
> > > +	 * successful.  But it is always positive or zero.  The CoW needs the
> > > +	 * actual error code from actor().  So, get it from
> > > +	 * iomap_iter->processed.
> > 
> > Hm.  All six arguments are derived from the struct iomap_iter, so maybe
> > it makes more sense to pass that in?  I'll poke around with this more
> > tomorrow.
> 
> I'd argue against just changing the calling conventions for ->iomap_end
> now.  The original iter patches from willy allowed passing a single
> next callback combinging iomap_begin and iomap_end in a way that with
> a little magic we can avoid the indirect calls entirely.  I think we'll
> need to experiment with that that a bit and see if is worth the effort
> first.  I plan to do that but I might not get to it immediate.  If some
> else wants to take over I'm fine with that.

Ah, I forgot that.  Yay Etch-a-Sketch brain. <shake> -ENODATA ;)

> > >  static int
> > >  xfs_buffered_write_iomap_begin(
> > 
> > Also, we have an related request to drop the EXPERIMENTAL tag for
> > non-DAX reflink.  Whichever patch enables dax+reflink for xfs needs to
> > make it clear that reflink + any possibility of DAX emits an
> > EXPERIMENTAL warning.
> 
> More importantly before we can merge this series we also need the VM
> level support for reflink-aware reverse mapping.  So while this series
> here is no in a good enough shape I don't see how we could merge it
> without that other series as we'd have to disallow mmap for reflink+dax
> files otherwise.

I've forgotten why we need mm level reverse mapping again?  The pmem
poison stuff can use ->media_failure (or whatever it was called,
memory_failure?) to find all the owners and notify them.  Was there
some other accounting reason that fell out of my brain?

I'm more afraid of 'sharing pages between files needs mm support'
sparking another multi-year folioesque fight with the mm people.

--D

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 7/8] xfs: support CoW in fsdax mode
  2021-09-17 15:33       ` Darrick J. Wong
@ 2021-09-21  8:14         ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2021-09-21  8:14 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Shiyang Ruan, linux-xfs, dan.j.williams,
	david, linux-fsdevel, linux-kernel, nvdimm, rgoldwyn, viro,
	willy

On Fri, Sep 17, 2021 at 08:33:04AM -0700, Darrick J. Wong wrote:
> > More importantly before we can merge this series we also need the VM
> > level support for reflink-aware reverse mapping.  So while this series
> > here is no in a good enough shape I don't see how we could merge it
> > without that other series as we'd have to disallow mmap for reflink+dax
> > files otherwise.
> 
> I've forgotten why we need mm level reverse mapping again?  The pmem
> poison stuff can use ->media_failure (or whatever it was called,
> memory_failure?) to find all the owners and notify them.  Was there
> some other accounting reason that fell out of my brain?
> 
> I'm more afraid of 'sharing pages between files needs mm support'
> sparking another multi-year folioesque fight with the mm people.

Because of the way page->mapping is used by DAX.  But I think this is
mostly under control in the other series.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-09-21  8:14 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-15 10:44 [PATCH v9 0/8] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
2021-09-15 10:44 ` [PATCH v9 1/8] fsdax: Output address in dax_iomap_pfn() and rename it Shiyang Ruan
2021-09-16  0:09   ` Darrick J. Wong
2021-09-16  1:36     ` Shiyang Ruan
2021-09-15 10:44 ` [PATCH v9 2/8] fsdax: Introduce dax_iomap_cow_copy() Shiyang Ruan
2021-09-15 10:44 ` [PATCH v9 3/8] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
2021-09-16 14:51   ` kernel test robot
2021-09-15 10:44 ` [PATCH v9 4/8] fsdax: Convert dax_iomap_zero to iter model Shiyang Ruan
2021-09-16  0:11   ` Darrick J. Wong
2021-09-16  6:15   ` Christoph Hellwig
2021-09-15 10:44 ` [PATCH v9 5/8] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero Shiyang Ruan
2021-09-16  6:16   ` Christoph Hellwig
2021-09-16  8:49     ` Shiyang Ruan
2021-09-16 22:55       ` Darrick J. Wong
2021-09-15 10:44 ` [PATCH v9 6/8] fsdax: Dedup file range to use a compare function Shiyang Ruan
2021-09-16  6:17   ` Christoph Hellwig
2021-09-15 10:45 ` [PATCH v9 7/8] xfs: support CoW in fsdax mode Shiyang Ruan
2021-09-16  0:22   ` Darrick J. Wong
2021-09-16  6:32     ` Christoph Hellwig
2021-09-17 15:33       ` Darrick J. Wong
2021-09-21  8:14         ` Christoph Hellwig
2021-09-16  6:23   ` Christoph Hellwig
2021-09-16  8:51     ` Shiyang Ruan
2021-09-15 10:45 ` [PATCH v9 8/8] xfs: Add dax dedupe support Shiyang Ruan
2021-09-16  0:30   ` Darrick J. Wong
2021-09-16  4:01     ` Shiyang Ruan
2021-09-16  4:18       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).