All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] xfs: reflink cleanups
@ 2020-06-25  1:17 Darrick J. Wong
  2020-06-25  1:17 ` [PATCH 1/9] xfs: fix reflink quota reservation accounting error Darrick J. Wong
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25  1:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: Brian Foster, linux-xfs, edwin

Hi all,

Here are a few patches cleaning up some problems with reflink.  The
first patch is the largest -- it reorganizes the remapping loop so that
instead of running one transaction for each extent in the source file
regardless of what's in the destination file, we look at both files to
find the longest extent we can swap in one go, and run a transaction for
just that piece.  This fixes a problem of running out of block
reservation when the filesystem is very fragmented.

The second patch fixes some goofiness in the reflink prep function, and
the third patch moves the "lock two inodes" code into xfs_inode.c since
none of that is related to reflink.

Mr. Torok: Could you try applying these patches to a recent kernel to
see if they fix the fs crash problems you were seeing with duperemove,
please?

v2: various cleanups suggested by Brian Foster

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=reflink-cleanups
---
 fs/xfs/libxfs/xfs_bmap.h     |   21 ++-
 fs/xfs/libxfs/xfs_rtbitmap.c |    2 
 fs/xfs/xfs_file.c            |    4 
 fs/xfs/xfs_inode.c           |   93 +++++++++++
 fs/xfs/xfs_inode.h           |    3 
 fs/xfs/xfs_reflink.c         |  352 +++++++++++++++++++-----------------------
 fs/xfs/xfs_reflink.h         |    2 
 fs/xfs/xfs_trace.h           |   52 ------
 8 files changed, 275 insertions(+), 254 deletions(-)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/9] xfs: fix reflink quota reservation accounting error
  2020-06-25  1:17 [PATCH v2 0/9] xfs: reflink cleanups Darrick J. Wong
@ 2020-06-25  1:17 ` Darrick J. Wong
  2020-06-25 12:26   ` Brian Foster
  2020-07-01  8:07   ` Christoph Hellwig
  2020-06-25  1:17 ` [PATCH 2/9] xfs: rename xfs_bmap_is_real_extent to is_written_extent Darrick J. Wong
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25  1:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

Quota reservations are supposed to account for the blocks that might be
allocated due to a bmap btree split.  Reflink doesn't do this, so fix
this to make the quota accounting more accurate before we start
rearranging things.

Fixes: 862bb360ef56 ("xfs: reflink extents from one file to another")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 107bf2a2f344..d89201d40891 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1003,6 +1003,7 @@ xfs_reflink_remap_extent(
 	xfs_filblks_t		rlen;
 	xfs_filblks_t		unmap_len;
 	xfs_off_t		newlen;
+	int64_t			qres;
 	int			error;
 
 	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
@@ -1025,13 +1026,19 @@ xfs_reflink_remap_extent(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
-	/* If we're not just clearing space, then do we have enough quota? */
-	if (real_extent) {
-		error = xfs_trans_reserve_quota_nblks(tp, ip,
-				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
-		if (error)
-			goto out_cancel;
-	}
+	/*
+	 * Reserve quota for this operation.  We don't know if the first unmap
+	 * in the dest file will cause a bmap btree split, so we always reserve
+	 * at least enough blocks for that split.  If the extent being mapped
+	 * in is written, we need to reserve quota for that too.
+	 */
+	qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+	if (real_extent)
+		qres += irec->br_blockcount;
+	error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
+			XFS_QMOPT_RES_REGBLKS);
+	if (error)
+		goto out_cancel;
 
 	trace_xfs_reflink_remap(ip, irec->br_startoff,
 				irec->br_blockcount, irec->br_startblock);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/9] xfs: rename xfs_bmap_is_real_extent to is_written_extent
  2020-06-25  1:17 [PATCH v2 0/9] xfs: reflink cleanups Darrick J. Wong
  2020-06-25  1:17 ` [PATCH 1/9] xfs: fix reflink quota reservation accounting error Darrick J. Wong
@ 2020-06-25  1:17 ` Darrick J. Wong
  2020-06-25 12:26   ` Brian Foster
  2020-07-01  8:08   ` Christoph Hellwig
  2020-06-25  1:17 ` [PATCH 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash Darrick J. Wong
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25  1:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

The name of this predicate is a little misleading -- it decides if the
extent mapping is allocated and written.  Change the name to be more
direct, as we're going to add a new predicate in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.h     |    2 +-
 fs/xfs/libxfs/xfs_rtbitmap.c |    2 +-
 fs/xfs/xfs_reflink.c         |    6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 6028a3c825ba..2b18338d0643 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -163,7 +163,7 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
  * Return true if the extent is a real, allocated extent, or false if it is  a
  * delayed allocation, and unwritten extent or a hole.
  */
-static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
+static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
 {
 	return irec->br_state != XFS_EXT_UNWRITTEN &&
 		irec->br_startblock != HOLESTARTBLOCK &&
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 9498ced947be..1d9fa8a300f1 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -70,7 +70,7 @@ xfs_rtbuf_get(
 	if (error)
 		return error;
 
-	if (XFS_IS_CORRUPT(mp, nmap == 0 || !xfs_bmap_is_real_extent(&map)))
+	if (XFS_IS_CORRUPT(mp, nmap == 0 || !xfs_bmap_is_written_extent(&map)))
 		return -EFSCORRUPTED;
 
 	ASSERT(map.br_startblock != NULLFSBLOCK);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d89201d40891..22fdea6d69d3 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -179,7 +179,7 @@ xfs_reflink_trim_around_shared(
 	int			error = 0;
 
 	/* Holes, unwritten, and delalloc extents cannot be shared */
-	if (!xfs_is_cow_inode(ip) || !xfs_bmap_is_real_extent(irec)) {
+	if (!xfs_is_cow_inode(ip) || !xfs_bmap_is_written_extent(irec)) {
 		*shared = false;
 		return 0;
 	}
@@ -655,7 +655,7 @@ xfs_reflink_end_cow_extent(
 	 * preallocations can leak into the range we are called upon, and we
 	 * need to skip them.
 	 */
-	if (!xfs_bmap_is_real_extent(&got)) {
+	if (!xfs_bmap_is_written_extent(&got)) {
 		*end_fsb = del.br_startoff;
 		goto out_cancel;
 	}
@@ -996,7 +996,7 @@ xfs_reflink_remap_extent(
 	xfs_off_t		new_isize)
 {
 	struct xfs_mount	*mp = ip->i_mount;
-	bool			real_extent = xfs_bmap_is_real_extent(irec);
+	bool			real_extent = xfs_bmap_is_written_extent(irec);
 	struct xfs_trans	*tp;
 	unsigned int		resblks;
 	struct xfs_bmbt_irec	uirec;


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-25  1:17 [PATCH v2 0/9] xfs: reflink cleanups Darrick J. Wong
  2020-06-25  1:17 ` [PATCH 1/9] xfs: fix reflink quota reservation accounting error Darrick J. Wong
  2020-06-25  1:17 ` [PATCH 2/9] xfs: rename xfs_bmap_is_real_extent to is_written_extent Darrick J. Wong
@ 2020-06-25  1:17 ` Darrick J. Wong
  2020-06-25 12:27   ` Brian Foster
  2020-06-25 17:23   ` [PATCH v4.2 " Darrick J. Wong
  2020-06-25  1:18 ` [PATCH 4/9] xfs: only reserve quota blocks for bmbt changes if we're changing the data fork Darrick J. Wong
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25  1:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

The existing reflink remapping loop has some structural problems that
need addressing:

The biggest problem is that we create one transaction for each extent in
the source file without accounting for the number of mappings there are
for the same range in the destination file.  In other words, we don't
know the number of remap operations that will be necessary and we
therefore cannot guess the block reservation required.  On highly
fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
run out of block reservation, and fail.

The second problem is that we don't actually use the bmap intents to
their full potential -- instead of calling bunmapi directly and having
to deal with its backwards operation, we could call the deferred ops
xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
makes the frontend loop much simpler.

Solve all of these problems by refactoring the remapping loops so that
we only perform one remapping operation per transaction, and each
operation only tries to remap a single extent from source to dest.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.h |   13 ++
 fs/xfs/xfs_reflink.c     |  242 +++++++++++++++++++++++++---------------------
 fs/xfs/xfs_trace.h       |   52 +---------
 3 files changed, 142 insertions(+), 165 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 2b18338d0643..2e3c49f1f5c9 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
 	{ BMAP_ATTRFORK,	"ATTR" }, \
 	{ BMAP_COWFORK,		"COW" }
 
+/* Return true if the extent is an allocated extent, written or not. */
+static inline bool xfs_bmap_is_mapped_extent(struct xfs_bmbt_irec *irec)
+{
+	return irec->br_startblock != HOLESTARTBLOCK &&
+		irec->br_startblock != DELAYSTARTBLOCK &&
+		!isnullstartblock(irec->br_startblock);
+}
 
 /*
  * Return true if the extent is a real, allocated extent, or false if it is  a
@@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
  */
 static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
 {
-	return irec->br_state != XFS_EXT_UNWRITTEN &&
-		irec->br_startblock != HOLESTARTBLOCK &&
-		irec->br_startblock != DELAYSTARTBLOCK &&
-		!isnullstartblock(irec->br_startblock);
+	return xfs_bmap_is_mapped_extent(irec) &&
+	       irec->br_state != XFS_EXT_UNWRITTEN;
 }
 
 /*
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 22fdea6d69d3..c593d52156df 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -984,41 +984,28 @@ xfs_reflink_ag_has_free_space(
 }
 
 /*
- * Unmap a range of blocks from a file, then map other blocks into the hole.
- * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
- * The extent irec is mapped into dest at irec->br_startoff.
+ * Remap the given extent into the file.  The dmap blockcount will be set to
+ * the number of blocks that were actually remapped.
  */
 STATIC int
 xfs_reflink_remap_extent(
 	struct xfs_inode	*ip,
-	struct xfs_bmbt_irec	*irec,
-	xfs_fileoff_t		destoff,
+	struct xfs_bmbt_irec	*dmap,
 	xfs_off_t		new_isize)
 {
+	struct xfs_bmbt_irec	smap;
 	struct xfs_mount	*mp = ip->i_mount;
-	bool			real_extent = xfs_bmap_is_written_extent(irec);
 	struct xfs_trans	*tp;
-	unsigned int		resblks;
-	struct xfs_bmbt_irec	uirec;
-	xfs_filblks_t		rlen;
-	xfs_filblks_t		unmap_len;
 	xfs_off_t		newlen;
-	int64_t			qres;
+	int64_t			qres, qdelta;
+	unsigned int		resblks;
+	bool			smap_mapped;
+	bool			dmap_written = xfs_bmap_is_written_extent(dmap);
+	int			nimaps;
 	int			error;
 
-	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
-	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
-
-	/* No reflinking if we're low on space */
-	if (real_extent) {
-		error = xfs_reflink_ag_has_free_space(mp,
-				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
-		if (error)
-			goto out;
-	}
-
 	/* Start a rolling transaction to switch the mappings */
-	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
+	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
 	if (error)
 		goto out;
@@ -1026,93 +1013,120 @@ xfs_reflink_remap_extent(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
+	/* Read what's currently mapped in the destination file */
+	nimaps = 1;
+	error = xfs_bmapi_read(ip, dmap->br_startoff, dmap->br_blockcount,
+			&smap, &nimaps, 0);
+	if (error)
+		goto out_cancel;
+	ASSERT(nimaps == 1 && smap.br_startoff == dmap->br_startoff);
+	smap_mapped = xfs_bmap_is_mapped_extent(&smap);
+
 	/*
-	 * Reserve quota for this operation.  We don't know if the first unmap
-	 * in the dest file will cause a bmap btree split, so we always reserve
-	 * at least enough blocks for that split.  If the extent being mapped
-	 * in is written, we need to reserve quota for that too.
+	 * We can only remap as many blocks as the smaller of the two extent
+	 * maps, because we can only remap one extent at a time.
 	 */
+	dmap->br_blockcount = min(dmap->br_blockcount, smap.br_blockcount);
+	ASSERT(dmap->br_blockcount == smap.br_blockcount);
+
+	trace_xfs_reflink_remap_extent_dest(ip, &smap);
+
+	/* No reflinking if the AG of the dest mapping is low on space. */
+	if (dmap_written) {
+		error = xfs_reflink_ag_has_free_space(mp,
+				XFS_FSB_TO_AGNO(mp, dmap->br_startblock));
+		if (error)
+			goto out_cancel;
+	}
+
+	/*
+	 * Compute quota reservation if we think the quota block counter for
+	 * this file could increase.
+	 *
+	 * We start by reserving enough blocks to handle a bmbt split.
+	 *
+	 * If we are mapping a written extent into the file, we need to have
+	 * enough quota block count reservation to handle the blocks in that
+	 * extent.
+	 *
+	 * Note that if we're replacing a delalloc reservation with a written
+	 * extent, we have to take the full quota reservation because removing
+	 * the delalloc reservation gives the block count back to the quota
+	 * count.  This is suboptimal, but the VFS flushed the dest range
+	 * before we started.  That should have removed all the delalloc
+	 * reservations, but we code defensively.
+	 */
+	qdelta = 0;
 	qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
-	if (real_extent)
-		qres += irec->br_blockcount;
-	error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
-			XFS_QMOPT_RES_REGBLKS);
-	if (error)
-		goto out_cancel;
-
-	trace_xfs_reflink_remap(ip, irec->br_startoff,
-				irec->br_blockcount, irec->br_startblock);
-
-	/* Unmap the old blocks in the data fork. */
-	rlen = unmap_len;
-	while (rlen) {
-		ASSERT(tp->t_firstblock == NULLFSBLOCK);
-		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
+	if (dmap_written)
+		qres += dmap->br_blockcount;
+	if (qres > 0) {
+		error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
+				XFS_QMOPT_RES_REGBLKS);
 		if (error)
 			goto out_cancel;
+	}
 
+	if (smap_mapped) {
 		/*
-		 * Trim the extent to whatever got unmapped.
-		 * Remember, bunmapi works backwards.
+		 * If the extent we're unmapping is backed by storage (written
+		 * or not), unmap the extent and drop its refcount.
 		 */
-		uirec.br_startblock = irec->br_startblock + rlen;
-		uirec.br_startoff = irec->br_startoff + rlen;
-		uirec.br_blockcount = unmap_len - rlen;
-		uirec.br_state = irec->br_state;
-		unmap_len = rlen;
-
-		/* If this isn't a real mapping, we're done. */
-		if (!real_extent || uirec.br_blockcount == 0)
-			goto next_extent;
+		xfs_bmap_unmap_extent(tp, ip, &smap);
+		xfs_refcount_decrease_extent(tp, &smap);
+		qdelta -= smap.br_blockcount;
+	} else if (smap.br_startblock == DELAYSTARTBLOCK) {
+		xfs_filblks_t	len = smap.br_blockcount;
 
-		trace_xfs_reflink_remap(ip, uirec.br_startoff,
-				uirec.br_blockcount, uirec.br_startblock);
-
-		/* Update the refcount tree */
-		xfs_refcount_increase_extent(tp, &uirec);
-
-		/* Map the new blocks into the data fork. */
-		xfs_bmap_map_extent(tp, ip, &uirec);
+		/*
+		 * If the extent we're unmapping is a delalloc reservation,
+		 * we can use the regular bunmapi function to release the
+		 * incore state.  Dropping the delalloc reservation takes care
+		 * of the quota reservation for us.
+		 */
+		error = __xfs_bunmapi(NULL, ip, smap.br_startoff, &len, 0, 1);
+		if (error)
+			goto out_cancel;
+		ASSERT(len == 0);
+	}
 
-		/* Update quota accounting. */
-		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
-				uirec.br_blockcount);
+	/*
+	 * If the extent we're sharing is backed by written storage, increase
+	 * its refcount and map it into the file.
+	 */
+	if (dmap_written) {
+		xfs_refcount_increase_extent(tp, dmap);
+		xfs_bmap_map_extent(tp, ip, dmap);
+		qdelta += dmap->br_blockcount;
+	}
 
-		/* Update dest isize if needed. */
-		newlen = XFS_FSB_TO_B(mp,
-				uirec.br_startoff + uirec.br_blockcount);
-		newlen = min_t(xfs_off_t, newlen, new_isize);
-		if (newlen > i_size_read(VFS_I(ip))) {
-			trace_xfs_reflink_update_inode_size(ip, newlen);
-			i_size_write(VFS_I(ip), newlen);
-			ip->i_d.di_size = newlen;
-			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-		}
+	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, qdelta);
 
-next_extent:
-		/* Process all the deferred stuff. */
-		error = xfs_defer_finish(&tp);
-		if (error)
-			goto out_cancel;
+	/* Update dest isize if needed. */
+	newlen = XFS_FSB_TO_B(mp, smap.br_startoff + smap.br_blockcount);
+	newlen = min_t(xfs_off_t, newlen, new_isize);
+	if (newlen > i_size_read(VFS_I(ip))) {
+		trace_xfs_reflink_update_inode_size(ip, newlen);
+		i_size_write(VFS_I(ip), newlen);
+		ip->i_d.di_size = newlen;
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 	}
 
+	/* Commit everything and unlock. */
 	error = xfs_trans_commit(tp);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	if (error)
-		goto out;
-	return 0;
+	goto out_unlock;
 
 out_cancel:
 	xfs_trans_cancel(tp);
+out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out:
-	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
+	if (error)
+		trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
 	return error;
 }
 
-/*
- * Iteratively remap one file's extents (and holes) to another's.
- */
+/* Remap a range of one file to the other. */
 int
 xfs_reflink_remap_blocks(
 	struct xfs_inode	*src,
@@ -1123,25 +1137,22 @@ xfs_reflink_remap_blocks(
 	loff_t			*remapped)
 {
 	struct xfs_bmbt_irec	imap;
-	xfs_fileoff_t		srcoff;
-	xfs_fileoff_t		destoff;
+	struct xfs_mount	*mp = src->i_mount;
+	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
+	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
 	xfs_filblks_t		len;
-	xfs_filblks_t		range_len;
 	xfs_filblks_t		remapped_len = 0;
 	xfs_off_t		new_isize = pos_out + remap_len;
 	int			nimaps;
 	int			error = 0;
 
-	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
-	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
-	len = XFS_B_TO_FSB(src->i_mount, remap_len);
+	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
+			XFS_MAX_FILEOFF);
 
-	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
-	while (len) {
-		uint		lock_mode;
+	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
 
-		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
-				dest, destoff);
+	while (len > 0) {
+		unsigned int	lock_mode;
 
 		/* Read extent from the source file */
 		nimaps = 1;
@@ -1150,18 +1161,25 @@ xfs_reflink_remap_blocks(
 		xfs_iunlock(src, lock_mode);
 		if (error)
 			break;
-		ASSERT(nimaps == 1);
+		/*
+		 * The caller supposedly flushed all dirty pages in the source
+		 * file range, which means that writeback should have allocated
+		 * or deleted all delalloc reservations in that range.  If we
+		 * find one, that's a good sign that something is seriously
+		 * wrong here.
+		 */
+		ASSERT(nimaps == 1 && imap.br_startoff == srcoff);
+		if (imap.br_startblock == DELAYSTARTBLOCK) {
+			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+			error = -EFSCORRUPTED;
+			break;
+		}
 
-		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
-				&imap);
+		trace_xfs_reflink_remap_extent_src(src, &imap);
 
-		/* Translate imap into the destination file. */
-		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
-		imap.br_startoff += destoff - srcoff;
-
-		/* Clear dest from destoff to the end of imap and map it in. */
-		error = xfs_reflink_remap_extent(dest, &imap, destoff,
-				new_isize);
+		/* Remap into the destination file at the given offset. */
+		imap.br_startoff = destoff;
+		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
 		if (error)
 			break;
 
@@ -1171,10 +1189,10 @@ xfs_reflink_remap_blocks(
 		}
 
 		/* Advance drange/srange */
-		srcoff += range_len;
-		destoff += range_len;
-		len -= range_len;
-		remapped_len += range_len;
+		srcoff += imap.br_blockcount;
+		destoff += imap.br_blockcount;
+		len -= imap.br_blockcount;
+		remapped_len += imap.br_blockcount;
 	}
 
 	if (error)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 460136628a79..50c478374a31 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
 DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
 DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
 DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
-DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
-TRACE_EVENT(xfs_reflink_remap_blocks_loop,
+TRACE_EVENT(xfs_reflink_remap_blocks,
 	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
 		 xfs_filblks_t len, struct xfs_inode *dest,
 		 xfs_fileoff_t doffset),
@@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
 		  __entry->dest_ino,
 		  __entry->dest_lblk)
 );
-TRACE_EVENT(xfs_reflink_punch_range,
-	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
-		 xfs_extlen_t len),
-	TP_ARGS(ip, lblk, len),
-	TP_STRUCT__entry(
-		__field(dev_t, dev)
-		__field(xfs_ino_t, ino)
-		__field(xfs_fileoff_t, lblk)
-		__field(xfs_extlen_t, len)
-	),
-	TP_fast_assign(
-		__entry->dev = VFS_I(ip)->i_sb->s_dev;
-		__entry->ino = ip->i_ino;
-		__entry->lblk = lblk;
-		__entry->len = len;
-	),
-	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino,
-		  __entry->lblk,
-		  __entry->len)
-);
-TRACE_EVENT(xfs_reflink_remap,
-	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
-		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
-	TP_ARGS(ip, lblk, len, new_pblk),
-	TP_STRUCT__entry(
-		__field(dev_t, dev)
-		__field(xfs_ino_t, ino)
-		__field(xfs_fileoff_t, lblk)
-		__field(xfs_extlen_t, len)
-		__field(xfs_fsblock_t, new_pblk)
-	),
-	TP_fast_assign(
-		__entry->dev = VFS_I(ip)->i_sb->s_dev;
-		__entry->ino = ip->i_ino;
-		__entry->lblk = lblk;
-		__entry->len = len;
-		__entry->new_pblk = new_pblk;
-	),
-	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino,
-		  __entry->lblk,
-		  __entry->len,
-		  __entry->new_pblk)
-);
 DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_src);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_dest);
 
 /* dedupe tracepoints */
 DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/9] xfs: only reserve quota blocks for bmbt changes if we're changing the data fork
  2020-06-25  1:17 [PATCH v2 0/9] xfs: reflink cleanups Darrick J. Wong
                   ` (2 preceding siblings ...)
  2020-06-25  1:17 ` [PATCH 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash Darrick J. Wong
@ 2020-06-25  1:18 ` Darrick J. Wong
  2020-06-25 12:27   ` Brian Foster
  2020-07-01  8:21   ` Christoph Hellwig
  2020-06-25  1:18 ` [PATCH 5/9] xfs: only reserve quota blocks if we're mapping into a hole Darrick J. Wong
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25  1:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that we've reworked xfs_reflink_remap_extent to remap only one
extent per transaction, we actually know if the extent being removed is
an allocated mapping.  This means that we now know ahead of time if
we're going to be touching the data fork.

Since we only need blocks for a bmbt split if we're going to update the
data fork, we only need to get quota reservation if we know we're going
to touch the data fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index c593d52156df..9cc1c340d0ec 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1043,7 +1043,11 @@ xfs_reflink_remap_extent(
 	 * Compute quota reservation if we think the quota block counter for
 	 * this file could increase.
 	 *
-	 * We start by reserving enough blocks to handle a bmbt split.
+	 * Adding a written extent to the extent map can cause a bmbt split,
+	 * and removing a mapped extent from the extent can cause a bmbt split.
+	 * The two operations cannot both cause a split since they operate on
+	 * the same index in the bmap btree, so we only need a reservation for
+	 * one bmbt split if either thing is happening.
 	 *
 	 * If we are mapping a written extent into the file, we need to have
 	 * enough quota block count reservation to handle the blocks in that
@@ -1056,8 +1060,9 @@ xfs_reflink_remap_extent(
 	 * before we started.  That should have removed all the delalloc
 	 * reservations, but we code defensively.
 	 */
-	qdelta = 0;
-	qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+	qres = qdelta = 0;
+	if (smap_mapped || dmap_written)
+		qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
 	if (dmap_written)
 		qres += dmap->br_blockcount;
 	if (qres > 0) {


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/9] xfs: only reserve quota blocks if we're mapping into a hole
  2020-06-25  1:17 [PATCH v2 0/9] xfs: reflink cleanups Darrick J. Wong
                   ` (3 preceding siblings ...)
  2020-06-25  1:18 ` [PATCH 4/9] xfs: only reserve quota blocks for bmbt changes if we're changing the data fork Darrick J. Wong
@ 2020-06-25  1:18 ` Darrick J. Wong
  2020-06-25 12:28   ` Brian Foster
  2020-07-01  8:22   ` Christoph Hellwig
  2020-06-25  1:18 ` [PATCH 6/9] xfs: reflink can skip remap existing mappings Darrick J. Wong
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25  1:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

When logging quota block count updates during a reflink operation, we
only log the /delta/ of the block count changes to the dquot.  Since we
now know ahead of time the extent type of both dmap and smap (and that
they have the same length), we know that we only need to reserve quota
blocks for dmap's blockcount if we're mapping it into a hole.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 9cc1c340d0ec..72de7179399d 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1051,7 +1051,9 @@ xfs_reflink_remap_extent(
 	 *
 	 * If we are mapping a written extent into the file, we need to have
 	 * enough quota block count reservation to handle the blocks in that
-	 * extent.
+	 * extent.  We log only the delta to the quota block counts, so if the
+	 * extent we're unmapping also has blocks allocated to it, we don't
+	 * need a quota reservation for the extent itself.
 	 *
 	 * Note that if we're replacing a delalloc reservation with a written
 	 * extent, we have to take the full quota reservation because removing
@@ -1063,7 +1065,7 @@ xfs_reflink_remap_extent(
 	qres = qdelta = 0;
 	if (smap_mapped || dmap_written)
 		qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
-	if (dmap_written)
+	if (!smap_mapped && dmap_written)
 		qres += dmap->br_blockcount;
 	if (qres > 0) {
 		error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/9] xfs: reflink can skip remap existing mappings
  2020-06-25  1:17 [PATCH v2 0/9] xfs: reflink cleanups Darrick J. Wong
                   ` (4 preceding siblings ...)
  2020-06-25  1:18 ` [PATCH 5/9] xfs: only reserve quota blocks if we're mapping into a hole Darrick J. Wong
@ 2020-06-25  1:18 ` Darrick J. Wong
  2020-06-25 12:28   ` Brian Foster
  2020-06-25  1:18 ` [PATCH 7/9] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25  1:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

If the source and destination map are identical, we can skip the remap
step to save some time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 72de7179399d..f1156f121b7d 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1031,6 +1031,23 @@ xfs_reflink_remap_extent(
 
 	trace_xfs_reflink_remap_extent_dest(ip, &smap);
 
+	/*
+	 * Two extents mapped to the same physical block must not have
+	 * different states; that's filesystem corruption.  Move on to the next
+	 * extent if they're both holes or both the same physical extent.
+	 */
+	if (dmap->br_startblock == smap.br_startblock) {
+		ASSERT(dmap->br_startblock == smap.br_startblock);
+		if (dmap->br_state != smap.br_state)
+			error = -EFSCORRUPTED;
+		goto out_cancel;
+	}
+
+	/* If both extents are unwritten, leave them alone. */
+	if (dmap->br_state == XFS_EXT_UNWRITTEN &&
+	    smap.br_state == XFS_EXT_UNWRITTEN)
+		goto out_cancel;
+
 	/* No reflinking if the AG of the dest mapping is low on space. */
 	if (dmap_written) {
 		error = xfs_reflink_ag_has_free_space(mp,


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/9] xfs: fix xfs_reflink_remap_prep calling conventions
  2020-06-25  1:17 [PATCH v2 0/9] xfs: reflink cleanups Darrick J. Wong
                   ` (5 preceding siblings ...)
  2020-06-25  1:18 ` [PATCH 6/9] xfs: reflink can skip remap existing mappings Darrick J. Wong
@ 2020-06-25  1:18 ` Darrick J. Wong
  2020-07-01  8:23   ` Christoph Hellwig
  2020-06-25  1:18 ` [PATCH 8/9] xfs: refactor locking and unlocking two inodes against userspace IO Darrick J. Wong
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25  1:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: Brian Foster, linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

Fix the return value of xfs_reflink_remap_prep so that its return value
conventions match the rest of xfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_file.c    |    2 +-
 fs/xfs/xfs_reflink.c |    6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00db81eac80d..b375fae811f2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1035,7 +1035,7 @@ xfs_file_remap_range(
 	/* Prepare and then clone file data. */
 	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
 			&len, remap_flags);
-	if (ret < 0 || len == 0)
+	if (ret || len == 0)
 		return ret;
 
 	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f1156f121b7d..4238d43c773f 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1364,7 +1364,7 @@ xfs_reflink_remap_prep(
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	bool			same_inode = (inode_in == inode_out);
-	ssize_t			ret;
+	int			ret;
 
 	/* Lock both files against IO */
 	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
@@ -1388,7 +1388,7 @@ xfs_reflink_remap_prep(
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
 			len, remap_flags);
-	if (ret < 0 || *len == 0)
+	if (ret || *len == 0)
 		goto out_unlock;
 
 	/* Attach dquots to dest inode before changing block map */
@@ -1423,7 +1423,7 @@ xfs_reflink_remap_prep(
 	if (ret)
 		goto out_unlock;
 
-	return 1;
+	return 0;
 out_unlock:
 	xfs_reflink_remap_unlock(file_in, file_out);
 	return ret;


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 8/9] xfs: refactor locking and unlocking two inodes against userspace IO
  2020-06-25  1:17 [PATCH v2 0/9] xfs: reflink cleanups Darrick J. Wong
                   ` (6 preceding siblings ...)
  2020-06-25  1:18 ` [PATCH 7/9] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
@ 2020-06-25  1:18 ` Darrick J. Wong
  2020-07-01  8:26   ` Christoph Hellwig
  2020-06-25  1:18 ` [PATCH 9/9] xfs: move helpers that lock and unlock " Darrick J. Wong
  2020-06-28 12:06 ` [PATCH v2 0/9] xfs: reflink cleanups Edwin Török
  9 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25  1:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: Brian Foster, linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor the two functions that we use to lock and unlock two inodes to
block userspace from initiating IO against a file, whether via system
calls or mmap activity.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_file.c    |    2 +-
 fs/xfs/xfs_reflink.c |   52 +++++++++++++++++++++++++++++++-------------------
 fs/xfs/xfs_reflink.h |    3 +--
 3 files changed, 34 insertions(+), 23 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b375fae811f2..f189bdcbeddd 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1065,7 +1065,7 @@ xfs_file_remap_range(
 	if (mp->m_flags & XFS_MOUNT_WSYNC)
 		xfs_log_force_inode(dest);
 out_unlock:
-	xfs_reflink_remap_unlock(file_in, file_out);
+	xfs_reflink_remap_unlock(src, dest);
 	if (ret)
 		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
 	return remapped > 0 ? remapped : ret;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 4238d43c773f..81190e963c8a 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1281,24 +1281,42 @@ xfs_iolock_two_inodes_and_break_layout(
 	return 0;
 }
 
-/* Unlock both inodes after they've been prepped for a range clone. */
+/*
+ * Lock two files so that userspace cannot initiate I/O via file syscalls or
+ * mmap activity.
+ */
+static int
+xfs_reflink_remap_lock(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	int			ret;
+
+	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
+	if (ret)
+		return ret;
+	if (ip1 == ip2)
+		xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
+	else
+		xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL,
+				    ip2, XFS_MMAPLOCK_EXCL);
+	return 0;
+}
+
+/* Unlock both files to allow IO and mmap activity. */
 void
 xfs_reflink_remap_unlock(
-	struct file		*file_in,
-	struct file		*file_out)
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
 {
-	struct inode		*inode_in = file_inode(file_in);
-	struct xfs_inode	*src = XFS_I(inode_in);
-	struct inode		*inode_out = file_inode(file_out);
-	struct xfs_inode	*dest = XFS_I(inode_out);
-	bool			same_inode = (inode_in == inode_out);
+	bool			same_inode = (ip1 == ip2);
 
-	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
+	xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
 	if (!same_inode)
-		xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
-	inode_unlock(inode_out);
+		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+	inode_unlock(VFS_I(ip2));
 	if (!same_inode)
-		inode_unlock(inode_in);
+		inode_unlock(VFS_I(ip1));
 }
 
 /*
@@ -1363,18 +1381,12 @@ xfs_reflink_remap_prep(
 	struct xfs_inode	*src = XFS_I(inode_in);
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
-	bool			same_inode = (inode_in == inode_out);
 	int			ret;
 
 	/* Lock both files against IO */
-	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
+	ret = xfs_reflink_remap_lock(src, dest);
 	if (ret)
 		return ret;
-	if (same_inode)
-		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
-	else
-		xfs_lock_two_inodes(src, XFS_MMAPLOCK_EXCL, dest,
-				XFS_MMAPLOCK_EXCL);
 
 	/* Check file eligibility and prepare for block sharing. */
 	ret = -EINVAL;
@@ -1425,7 +1437,7 @@ xfs_reflink_remap_prep(
 
 	return 0;
 out_unlock:
-	xfs_reflink_remap_unlock(file_in, file_out);
+	xfs_reflink_remap_unlock(src, dest);
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 3e4fd46373ab..ceeb59b86b29 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -56,7 +56,6 @@ extern int xfs_reflink_remap_blocks(struct xfs_inode *src, loff_t pos_in,
 		loff_t *remapped);
 extern int xfs_reflink_update_dest(struct xfs_inode *dest, xfs_off_t newlen,
 		xfs_extlen_t cowextsize, unsigned int remap_flags);
-extern void xfs_reflink_remap_unlock(struct file *file_in,
-		struct file *file_out);
+extern void xfs_reflink_remap_unlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
 
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 9/9] xfs: move helpers that lock and unlock two inodes against userspace IO
  2020-06-25  1:17 [PATCH v2 0/9] xfs: reflink cleanups Darrick J. Wong
                   ` (7 preceding siblings ...)
  2020-06-25  1:18 ` [PATCH 8/9] xfs: refactor locking and unlocking two inodes against userspace IO Darrick J. Wong
@ 2020-06-25  1:18 ` Darrick J. Wong
  2020-07-01  8:27   ` Christoph Hellwig
  2020-06-28 12:06 ` [PATCH v2 0/9] xfs: reflink cleanups Edwin Török
  9 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25  1:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: Brian Foster, linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

Move the double-inode locking helpers to xfs_inode.c since they're not
specific to reflink.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_file.c    |    2 +
 fs/xfs/xfs_inode.c   |   93 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h   |    3 ++
 fs/xfs/xfs_reflink.c |   97 +-------------------------------------------------
 fs/xfs/xfs_reflink.h |    1 -
 5 files changed, 99 insertions(+), 97 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f189bdcbeddd..97aa74800bd9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1065,7 +1065,7 @@ xfs_file_remap_range(
 	if (mp->m_flags & XFS_MOUNT_WSYNC)
 		xfs_log_force_inode(dest);
 out_unlock:
-	xfs_reflink_remap_unlock(src, dest);
+	xfs_iunlock2_io_mmap(src, dest);
 	if (ret)
 		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
 	return remapped > 0 ? remapped : ret;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9aea7d68d8ab..24edec472a7c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3881,3 +3881,96 @@ xfs_log_force_inode(
 		return 0;
 	return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL);
 }
+
+/*
+ * Grab the exclusive iolock for a data copy from src to dest, making sure to
+ * abide vfs locking order (lowest pointer value goes first) and breaking the
+ * layout leases before proceeding.  The loop is needed because we cannot call
+ * the blocking break_layout() with the iolocks held, and therefore have to
+ * back out both locks.
+ */
+static int
+xfs_iolock_two_inodes_and_break_layout(
+	struct inode		*src,
+	struct inode		*dest)
+{
+	int			error;
+
+	if (src > dest)
+		swap(src, dest);
+
+retry:
+	/* Wait to break both inodes' layouts before we start locking. */
+	error = break_layout(src, true);
+	if (error)
+		return error;
+	if (src != dest) {
+		error = break_layout(dest, true);
+		if (error)
+			return error;
+	}
+
+	/* Lock one inode and make sure nobody got in and leased it. */
+	inode_lock(src);
+	error = break_layout(src, false);
+	if (error) {
+		inode_unlock(src);
+		if (error == -EWOULDBLOCK)
+			goto retry;
+		return error;
+	}
+
+	if (src == dest)
+		return 0;
+
+	/* Lock the other inode and make sure nobody got in and leased it. */
+	inode_lock_nested(dest, I_MUTEX_NONDIR2);
+	error = break_layout(dest, false);
+	if (error) {
+		inode_unlock(src);
+		inode_unlock(dest);
+		if (error == -EWOULDBLOCK)
+			goto retry;
+		return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
+ * mmap activity.
+ */
+int
+xfs_ilock2_io_mmap(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	int			ret;
+
+	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
+	if (ret)
+		return ret;
+	if (ip1 == ip2)
+		xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
+	else
+		xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL,
+				    ip2, XFS_MMAPLOCK_EXCL);
+	return 0;
+}
+
+/* Unlock both inodes to allow IO and mmap activity. */
+void
+xfs_iunlock2_io_mmap(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	bool			same_inode = (ip1 == ip2);
+
+	xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+	if (!same_inode)
+		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+	inode_unlock(VFS_I(ip2));
+	if (!same_inode)
+		inode_unlock(VFS_I(ip1));
+}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 47d3b391030d..1534386b430c 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -499,4 +499,7 @@ void xfs_iunlink_destroy(struct xfs_perag *pag);
 
 void xfs_end_io(struct work_struct *work);
 
+int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
+void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
+
 #endif	/* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 81190e963c8a..fac19d9bc083 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1226,99 +1226,6 @@ xfs_reflink_remap_blocks(
 	return error;
 }
 
-/*
- * Grab the exclusive iolock for a data copy from src to dest, making sure to
- * abide vfs locking order (lowest pointer value goes first) and breaking the
- * layout leases before proceeding.  The loop is needed because we cannot call
- * the blocking break_layout() with the iolocks held, and therefore have to
- * back out both locks.
- */
-static int
-xfs_iolock_two_inodes_and_break_layout(
-	struct inode		*src,
-	struct inode		*dest)
-{
-	int			error;
-
-	if (src > dest)
-		swap(src, dest);
-
-retry:
-	/* Wait to break both inodes' layouts before we start locking. */
-	error = break_layout(src, true);
-	if (error)
-		return error;
-	if (src != dest) {
-		error = break_layout(dest, true);
-		if (error)
-			return error;
-	}
-
-	/* Lock one inode and make sure nobody got in and leased it. */
-	inode_lock(src);
-	error = break_layout(src, false);
-	if (error) {
-		inode_unlock(src);
-		if (error == -EWOULDBLOCK)
-			goto retry;
-		return error;
-	}
-
-	if (src == dest)
-		return 0;
-
-	/* Lock the other inode and make sure nobody got in and leased it. */
-	inode_lock_nested(dest, I_MUTEX_NONDIR2);
-	error = break_layout(dest, false);
-	if (error) {
-		inode_unlock(src);
-		inode_unlock(dest);
-		if (error == -EWOULDBLOCK)
-			goto retry;
-		return error;
-	}
-
-	return 0;
-}
-
-/*
- * Lock two files so that userspace cannot initiate I/O via file syscalls or
- * mmap activity.
- */
-static int
-xfs_reflink_remap_lock(
-	struct xfs_inode	*ip1,
-	struct xfs_inode	*ip2)
-{
-	int			ret;
-
-	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
-	if (ret)
-		return ret;
-	if (ip1 == ip2)
-		xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
-	else
-		xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL,
-				    ip2, XFS_MMAPLOCK_EXCL);
-	return 0;
-}
-
-/* Unlock both files to allow IO and mmap activity. */
-void
-xfs_reflink_remap_unlock(
-	struct xfs_inode	*ip1,
-	struct xfs_inode	*ip2)
-{
-	bool			same_inode = (ip1 == ip2);
-
-	xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
-	if (!same_inode)
-		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
-	inode_unlock(VFS_I(ip2));
-	if (!same_inode)
-		inode_unlock(VFS_I(ip1));
-}
-
 /*
  * If we're reflinking to a point past the destination file's EOF, we must
  * zero any speculative post-EOF preallocations that sit between the old EOF
@@ -1384,7 +1291,7 @@ xfs_reflink_remap_prep(
 	int			ret;
 
 	/* Lock both files against IO */
-	ret = xfs_reflink_remap_lock(src, dest);
+	ret = xfs_ilock2_io_mmap(src, dest);
 	if (ret)
 		return ret;
 
@@ -1437,7 +1344,7 @@ xfs_reflink_remap_prep(
 
 	return 0;
 out_unlock:
-	xfs_reflink_remap_unlock(src, dest);
+	xfs_iunlock2_io_mmap(src, dest);
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index ceeb59b86b29..487b00434b96 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -56,6 +56,5 @@ extern int xfs_reflink_remap_blocks(struct xfs_inode *src, loff_t pos_in,
 		loff_t *remapped);
 extern int xfs_reflink_update_dest(struct xfs_inode *dest, xfs_off_t newlen,
 		xfs_extlen_t cowextsize, unsigned int remap_flags);
-extern void xfs_reflink_remap_unlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
 
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/9] xfs: fix reflink quota reservation accounting error
  2020-06-25  1:17 ` [PATCH 1/9] xfs: fix reflink quota reservation accounting error Darrick J. Wong
@ 2020-06-25 12:26   ` Brian Foster
  2020-07-01  8:07   ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Brian Foster @ 2020-06-25 12:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:17:45PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Quota reservations are supposed to account for the blocks that might be
> allocated due to a bmap btree split.  Reflink doesn't do this, so fix
> this to make the quota accounting more accurate before we start
> rearranging things.
> 
> Fixes: 862bb360ef56 ("xfs: reflink extents from one file to another")
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_reflink.c |   21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 107bf2a2f344..d89201d40891 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1003,6 +1003,7 @@ xfs_reflink_remap_extent(
>  	xfs_filblks_t		rlen;
>  	xfs_filblks_t		unmap_len;
>  	xfs_off_t		newlen;
> +	int64_t			qres;
>  	int			error;
>  
>  	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> @@ -1025,13 +1026,19 @@ xfs_reflink_remap_extent(
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	xfs_trans_ijoin(tp, ip, 0);
>  
> -	/* If we're not just clearing space, then do we have enough quota? */
> -	if (real_extent) {
> -		error = xfs_trans_reserve_quota_nblks(tp, ip,
> -				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> -		if (error)
> -			goto out_cancel;
> -	}
> +	/*
> +	 * Reserve quota for this operation.  We don't know if the first unmap
> +	 * in the dest file will cause a bmap btree split, so we always reserve
> +	 * at least enough blocks for that split.  If the extent being mapped
> +	 * in is written, we need to reserve quota for that too.
> +	 */
> +	qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> +	if (real_extent)
> +		qres += irec->br_blockcount;
> +	error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
> +			XFS_QMOPT_RES_REGBLKS);
> +	if (error)
> +		goto out_cancel;
>  
>  	trace_xfs_reflink_remap(ip, irec->br_startoff,
>  				irec->br_blockcount, irec->br_startblock);
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/9] xfs: rename xfs_bmap_is_real_extent to is_written_extent
  2020-06-25  1:17 ` [PATCH 2/9] xfs: rename xfs_bmap_is_real_extent to is_written_extent Darrick J. Wong
@ 2020-06-25 12:26   ` Brian Foster
  2020-07-01  8:08   ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Brian Foster @ 2020-06-25 12:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:17:52PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> The name of this predicate is a little misleading -- it decides if the
> extent mapping is allocated and written.  Change the name to be more
> direct, as we're going to add a new predicate in the next patch.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_bmap.h     |    2 +-
>  fs/xfs/libxfs/xfs_rtbitmap.c |    2 +-
>  fs/xfs/xfs_reflink.c         |    6 +++---
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 6028a3c825ba..2b18338d0643 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -163,7 +163,7 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
>   * Return true if the extent is a real, allocated extent, or false if it is  a
>   * delayed allocation, and unwritten extent or a hole.
>   */
> -static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
> +static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
>  {
>  	return irec->br_state != XFS_EXT_UNWRITTEN &&
>  		irec->br_startblock != HOLESTARTBLOCK &&
> diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
> index 9498ced947be..1d9fa8a300f1 100644
> --- a/fs/xfs/libxfs/xfs_rtbitmap.c
> +++ b/fs/xfs/libxfs/xfs_rtbitmap.c
> @@ -70,7 +70,7 @@ xfs_rtbuf_get(
>  	if (error)
>  		return error;
>  
> -	if (XFS_IS_CORRUPT(mp, nmap == 0 || !xfs_bmap_is_real_extent(&map)))
> +	if (XFS_IS_CORRUPT(mp, nmap == 0 || !xfs_bmap_is_written_extent(&map)))
>  		return -EFSCORRUPTED;
>  
>  	ASSERT(map.br_startblock != NULLFSBLOCK);
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index d89201d40891..22fdea6d69d3 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -179,7 +179,7 @@ xfs_reflink_trim_around_shared(
>  	int			error = 0;
>  
>  	/* Holes, unwritten, and delalloc extents cannot be shared */
> -	if (!xfs_is_cow_inode(ip) || !xfs_bmap_is_real_extent(irec)) {
> +	if (!xfs_is_cow_inode(ip) || !xfs_bmap_is_written_extent(irec)) {
>  		*shared = false;
>  		return 0;
>  	}
> @@ -655,7 +655,7 @@ xfs_reflink_end_cow_extent(
>  	 * preallocations can leak into the range we are called upon, and we
>  	 * need to skip them.
>  	 */
> -	if (!xfs_bmap_is_real_extent(&got)) {
> +	if (!xfs_bmap_is_written_extent(&got)) {
>  		*end_fsb = del.br_startoff;
>  		goto out_cancel;
>  	}
> @@ -996,7 +996,7 @@ xfs_reflink_remap_extent(
>  	xfs_off_t		new_isize)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> -	bool			real_extent = xfs_bmap_is_real_extent(irec);
> +	bool			real_extent = xfs_bmap_is_written_extent(irec);
>  	struct xfs_trans	*tp;
>  	unsigned int		resblks;
>  	struct xfs_bmbt_irec	uirec;
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-25  1:17 ` [PATCH 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash Darrick J. Wong
@ 2020-06-25 12:27   ` Brian Foster
  2020-06-25 16:57     ` Darrick J. Wong
  2020-06-25 17:23   ` [PATCH v4.2 " Darrick J. Wong
  1 sibling, 1 reply; 31+ messages in thread
From: Brian Foster @ 2020-06-25 12:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:17:59PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> The existing reflink remapping loop has some structural problems that
> need addressing:
> 
> The biggest problem is that we create one transaction for each extent in
> the source file without accounting for the number of mappings there are
> for the same range in the destination file.  In other words, we don't
> know the number of remap operations that will be necessary and we
> therefore cannot guess the block reservation required.  On highly
> fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
> run out of block reservation, and fail.
> 
> The second problem is that we don't actually use the bmap intents to
> their full potential -- instead of calling bunmapi directly and having
> to deal with its backwards operation, we could call the deferred ops
> xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
> makes the frontend loop much simpler.
> 
> Solve all of these problems by refactoring the remapping loops so that
> we only perform one remapping operation per transaction, and each
> operation only tries to remap a single extent from source to dest.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Looks pretty good to me outside of a few nits..

>  fs/xfs/libxfs/xfs_bmap.h |   13 ++
>  fs/xfs/xfs_reflink.c     |  242 +++++++++++++++++++++++++---------------------
>  fs/xfs/xfs_trace.h       |   52 +---------
>  3 files changed, 142 insertions(+), 165 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 2b18338d0643..2e3c49f1f5c9 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
>  	{ BMAP_ATTRFORK,	"ATTR" }, \
>  	{ BMAP_COWFORK,		"COW" }
>  
> +/* Return true if the extent is an allocated extent, written or not. */
> +static inline bool xfs_bmap_is_mapped_extent(struct xfs_bmbt_irec *irec)
> +{
> +	return irec->br_startblock != HOLESTARTBLOCK &&
> +		irec->br_startblock != DELAYSTARTBLOCK &&
> +		!isnullstartblock(irec->br_startblock);
> +}

The point of the previous patch was to facilitate renaming this helper,
no?

>  
>  /*
>   * Return true if the extent is a real, allocated extent, or false if it is  a
> @@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
>   */
>  static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
>  {
> -	return irec->br_state != XFS_EXT_UNWRITTEN &&
> -		irec->br_startblock != HOLESTARTBLOCK &&
> -		irec->br_startblock != DELAYSTARTBLOCK &&
> -		!isnullstartblock(irec->br_startblock);
> +	return xfs_bmap_is_mapped_extent(irec) &&
> +	       irec->br_state != XFS_EXT_UNWRITTEN;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 22fdea6d69d3..c593d52156df 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -984,41 +984,28 @@ xfs_reflink_ag_has_free_space(
>  }
>  
>  /*
> - * Unmap a range of blocks from a file, then map other blocks into the hole.
> - * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> - * The extent irec is mapped into dest at irec->br_startoff.
> + * Remap the given extent into the file.  The dmap blockcount will be set to
> + * the number of blocks that were actually remapped.
>   */
>  STATIC int
>  xfs_reflink_remap_extent(
>  	struct xfs_inode	*ip,
> -	struct xfs_bmbt_irec	*irec,
> -	xfs_fileoff_t		destoff,
> +	struct xfs_bmbt_irec	*dmap,
>  	xfs_off_t		new_isize)
>  {
> +	struct xfs_bmbt_irec	smap;
>  	struct xfs_mount	*mp = ip->i_mount;
> -	bool			real_extent = xfs_bmap_is_written_extent(irec);
>  	struct xfs_trans	*tp;
> -	unsigned int		resblks;
> -	struct xfs_bmbt_irec	uirec;
> -	xfs_filblks_t		rlen;
> -	xfs_filblks_t		unmap_len;
>  	xfs_off_t		newlen;
> -	int64_t			qres;
> +	int64_t			qres, qdelta;
> +	unsigned int		resblks;
> +	bool			smap_mapped;
> +	bool			dmap_written = xfs_bmap_is_written_extent(dmap);
> +	int			nimaps;
>  	int			error;
>  
> -	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> -	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> -
> -	/* No reflinking if we're low on space */
> -	if (real_extent) {
> -		error = xfs_reflink_ag_has_free_space(mp,
> -				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
> -		if (error)
> -			goto out;
> -	}
> -
>  	/* Start a rolling transaction to switch the mappings */
> -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
>  	if (error)
>  		goto out;
> @@ -1026,93 +1013,120 @@ xfs_reflink_remap_extent(
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	xfs_trans_ijoin(tp, ip, 0);
>  
> +	/* Read what's currently mapped in the destination file */
> +	nimaps = 1;
> +	error = xfs_bmapi_read(ip, dmap->br_startoff, dmap->br_blockcount,
> +			&smap, &nimaps, 0);
> +	if (error)
> +		goto out_cancel;
> +	ASSERT(nimaps == 1 && smap.br_startoff == dmap->br_startoff);
> +	smap_mapped = xfs_bmap_is_mapped_extent(&smap);
> +

I have to admit I find the naming a little confusing. smap (source?)
refers to the extent currently in the destination file, and dmap refers
to the extent from the source file to map into the destination..?

>  	/*
> -	 * Reserve quota for this operation.  We don't know if the first unmap
> -	 * in the dest file will cause a bmap btree split, so we always reserve
> -	 * at least enough blocks for that split.  If the extent being mapped
> -	 * in is written, we need to reserve quota for that too.
> +	 * We can only remap as many blocks as the smaller of the two extent
> +	 * maps, because we can only remap one extent at a time.
>  	 */
> +	dmap->br_blockcount = min(dmap->br_blockcount, smap.br_blockcount);
> +	ASSERT(dmap->br_blockcount == smap.br_blockcount);
> +
> +	trace_xfs_reflink_remap_extent_dest(ip, &smap);
> +
> +	/* No reflinking if the AG of the dest mapping is low on space. */
> +	if (dmap_written) {
> +		error = xfs_reflink_ag_has_free_space(mp,
> +				XFS_FSB_TO_AGNO(mp, dmap->br_startblock));
> +		if (error)
> +			goto out_cancel;
> +	}
> +
> +	/*
> +	 * Compute quota reservation if we think the quota block counter for
> +	 * this file could increase.
> +	 *
> +	 * We start by reserving enough blocks to handle a bmbt split.
> +	 *
> +	 * If we are mapping a written extent into the file, we need to have
> +	 * enough quota block count reservation to handle the blocks in that
> +	 * extent.
> +	 *
> +	 * Note that if we're replacing a delalloc reservation with a written
> +	 * extent, we have to take the full quota reservation because removing
> +	 * the delalloc reservation gives the block count back to the quota
> +	 * count.  This is suboptimal, but the VFS flushed the dest range
> +	 * before we started.  That should have removed all the delalloc
> +	 * reservations, but we code defensively.
> +	 */
> +	qdelta = 0;
>  	qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> -	if (real_extent)
> -		qres += irec->br_blockcount;
> -	error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
> -			XFS_QMOPT_RES_REGBLKS);
> -	if (error)
> -		goto out_cancel;
> -
> -	trace_xfs_reflink_remap(ip, irec->br_startoff,
> -				irec->br_blockcount, irec->br_startblock);
> -
> -	/* Unmap the old blocks in the data fork. */
> -	rlen = unmap_len;
> -	while (rlen) {
> -		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> -		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
> +	if (dmap_written)
> +		qres += dmap->br_blockcount;
> +	if (qres > 0) {

Why is this check necessary if we assign qres a base value?

> +		error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
> +				XFS_QMOPT_RES_REGBLKS);
>  		if (error)
>  			goto out_cancel;
> +	}
>  
> +	if (smap_mapped) {
>  		/*
> -		 * Trim the extent to whatever got unmapped.
> -		 * Remember, bunmapi works backwards.
> +		 * If the extent we're unmapping is backed by storage (written
> +		 * or not), unmap the extent and drop its refcount.
>  		 */
> -		uirec.br_startblock = irec->br_startblock + rlen;
> -		uirec.br_startoff = irec->br_startoff + rlen;
> -		uirec.br_blockcount = unmap_len - rlen;
> -		uirec.br_state = irec->br_state;
> -		unmap_len = rlen;
> -
> -		/* If this isn't a real mapping, we're done. */
> -		if (!real_extent || uirec.br_blockcount == 0)
> -			goto next_extent;
> +		xfs_bmap_unmap_extent(tp, ip, &smap);
> +		xfs_refcount_decrease_extent(tp, &smap);
> +		qdelta -= smap.br_blockcount;
> +	} else if (smap.br_startblock == DELAYSTARTBLOCK) {
> +		xfs_filblks_t	len = smap.br_blockcount;
>  
> -		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> -				uirec.br_blockcount, uirec.br_startblock);
> -
> -		/* Update the refcount tree */
> -		xfs_refcount_increase_extent(tp, &uirec);
> -
> -		/* Map the new blocks into the data fork. */
> -		xfs_bmap_map_extent(tp, ip, &uirec);
> +		/*
> +		 * If the extent we're unmapping is a delalloc reservation,
> +		 * we can use the regular bunmapi function to release the
> +		 * incore state.  Dropping the delalloc reservation takes care
> +		 * of the quota reservation for us.
> +		 */
> +		error = __xfs_bunmapi(NULL, ip, smap.br_startoff, &len, 0, 1);
> +		if (error)
> +			goto out_cancel;
> +		ASSERT(len == 0);
> +	}
>  
> -		/* Update quota accounting. */
> -		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> -				uirec.br_blockcount);
> +	/*
> +	 * If the extent we're sharing is backed by written storage, increase
> +	 * its refcount and map it into the file.
> +	 */
> +	if (dmap_written) {
> +		xfs_refcount_increase_extent(tp, dmap);
> +		xfs_bmap_map_extent(tp, ip, dmap);
> +		qdelta += dmap->br_blockcount;
> +	}
>  
> -		/* Update dest isize if needed. */
> -		newlen = XFS_FSB_TO_B(mp,
> -				uirec.br_startoff + uirec.br_blockcount);
> -		newlen = min_t(xfs_off_t, newlen, new_isize);
> -		if (newlen > i_size_read(VFS_I(ip))) {
> -			trace_xfs_reflink_update_inode_size(ip, newlen);
> -			i_size_write(VFS_I(ip), newlen);
> -			ip->i_d.di_size = newlen;
> -			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> -		}
> +	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, qdelta);
>  
> -next_extent:
> -		/* Process all the deferred stuff. */
> -		error = xfs_defer_finish(&tp);
> -		if (error)
> -			goto out_cancel;
> +	/* Update dest isize if needed. */
> +	newlen = XFS_FSB_TO_B(mp, smap.br_startoff + smap.br_blockcount);

I guess functionally it doesn't matter because the extents should have
the same geometry, but this technically should refer to the extent
mapped _into_ the dest file, right? More just another artifact of the
naming confusion I guess..

Brian

> +	newlen = min_t(xfs_off_t, newlen, new_isize);
> +	if (newlen > i_size_read(VFS_I(ip))) {
> +		trace_xfs_reflink_update_inode_size(ip, newlen);
> +		i_size_write(VFS_I(ip), newlen);
> +		ip->i_d.di_size = newlen;
> +		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>  	}
>  
> +	/* Commit everything and unlock. */
>  	error = xfs_trans_commit(tp);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -	if (error)
> -		goto out;
> -	return 0;
> +	goto out_unlock;
>  
>  out_cancel:
>  	xfs_trans_cancel(tp);
> +out_unlock:
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  out:
> -	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
> +	if (error)
> +		trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
>  	return error;
>  }
>  
> -/*
> - * Iteratively remap one file's extents (and holes) to another's.
> - */
> +/* Remap a range of one file to the other. */
>  int
>  xfs_reflink_remap_blocks(
>  	struct xfs_inode	*src,
> @@ -1123,25 +1137,22 @@ xfs_reflink_remap_blocks(
>  	loff_t			*remapped)
>  {
>  	struct xfs_bmbt_irec	imap;
> -	xfs_fileoff_t		srcoff;
> -	xfs_fileoff_t		destoff;
> +	struct xfs_mount	*mp = src->i_mount;
> +	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
> +	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
>  	xfs_filblks_t		len;
> -	xfs_filblks_t		range_len;
>  	xfs_filblks_t		remapped_len = 0;
>  	xfs_off_t		new_isize = pos_out + remap_len;
>  	int			nimaps;
>  	int			error = 0;
>  
> -	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
> -	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
> -	len = XFS_B_TO_FSB(src->i_mount, remap_len);
> +	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
> +			XFS_MAX_FILEOFF);
>  
> -	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> -	while (len) {
> -		uint		lock_mode;
> +	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
>  
> -		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> -				dest, destoff);
> +	while (len > 0) {
> +		unsigned int	lock_mode;
>  
>  		/* Read extent from the source file */
>  		nimaps = 1;
> @@ -1150,18 +1161,25 @@ xfs_reflink_remap_blocks(
>  		xfs_iunlock(src, lock_mode);
>  		if (error)
>  			break;
> -		ASSERT(nimaps == 1);
> +		/*
> +		 * The caller supposedly flushed all dirty pages in the source
> +		 * file range, which means that writeback should have allocated
> +		 * or deleted all delalloc reservations in that range.  If we
> +		 * find one, that's a good sign that something is seriously
> +		 * wrong here.
> +		 */
> +		ASSERT(nimaps == 1 && imap.br_startoff == srcoff);
> +		if (imap.br_startblock == DELAYSTARTBLOCK) {
> +			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> +			error = -EFSCORRUPTED;
> +			break;
> +		}
>  
> -		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
> -				&imap);
> +		trace_xfs_reflink_remap_extent_src(src, &imap);
>  
> -		/* Translate imap into the destination file. */
> -		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> -		imap.br_startoff += destoff - srcoff;
> -
> -		/* Clear dest from destoff to the end of imap and map it in. */
> -		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> -				new_isize);
> +		/* Remap into the destination file at the given offset. */
> +		imap.br_startoff = destoff;
> +		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
>  		if (error)
>  			break;
>  
> @@ -1171,10 +1189,10 @@ xfs_reflink_remap_blocks(
>  		}
>  
>  		/* Advance drange/srange */
> -		srcoff += range_len;
> -		destoff += range_len;
> -		len -= range_len;
> -		remapped_len += range_len;
> +		srcoff += imap.br_blockcount;
> +		destoff += imap.br_blockcount;
> +		len -= imap.br_blockcount;
> +		remapped_len += imap.br_blockcount;
>  	}
>  
>  	if (error)
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 460136628a79..50c478374a31 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
>  DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
>  DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
>  DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
> -DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
> -TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> +TRACE_EVENT(xfs_reflink_remap_blocks,
>  	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
>  		 xfs_filblks_t len, struct xfs_inode *dest,
>  		 xfs_fileoff_t doffset),
> @@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
>  		  __entry->dest_ino,
>  		  __entry->dest_lblk)
>  );
> -TRACE_EVENT(xfs_reflink_punch_range,
> -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> -		 xfs_extlen_t len),
> -	TP_ARGS(ip, lblk, len),
> -	TP_STRUCT__entry(
> -		__field(dev_t, dev)
> -		__field(xfs_ino_t, ino)
> -		__field(xfs_fileoff_t, lblk)
> -		__field(xfs_extlen_t, len)
> -	),
> -	TP_fast_assign(
> -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> -		__entry->ino = ip->i_ino;
> -		__entry->lblk = lblk;
> -		__entry->len = len;
> -	),
> -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
> -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> -		  __entry->ino,
> -		  __entry->lblk,
> -		  __entry->len)
> -);
> -TRACE_EVENT(xfs_reflink_remap,
> -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> -		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
> -	TP_ARGS(ip, lblk, len, new_pblk),
> -	TP_STRUCT__entry(
> -		__field(dev_t, dev)
> -		__field(xfs_ino_t, ino)
> -		__field(xfs_fileoff_t, lblk)
> -		__field(xfs_extlen_t, len)
> -		__field(xfs_fsblock_t, new_pblk)
> -	),
> -	TP_fast_assign(
> -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> -		__entry->ino = ip->i_ino;
> -		__entry->lblk = lblk;
> -		__entry->len = len;
> -		__entry->new_pblk = new_pblk;
> -	),
> -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
> -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> -		  __entry->ino,
> -		  __entry->lblk,
> -		  __entry->len,
> -		  __entry->new_pblk)
> -);
>  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
> +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_src);
> +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_dest);
>  
>  /* dedupe tracepoints */
>  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] xfs: only reserve quota blocks for bmbt changes if we're changing the data fork
  2020-06-25  1:18 ` [PATCH 4/9] xfs: only reserve quota blocks for bmbt changes if we're changing the data fork Darrick J. Wong
@ 2020-06-25 12:27   ` Brian Foster
  2020-07-01  8:21   ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Brian Foster @ 2020-06-25 12:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:18:06PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Now that we've reworked xfs_reflink_remap_extent to remap only one
> extent per transaction, we actually know if the extent being removed is
> an allocated mapping.  This means that we now know ahead of time if
> we're going to be touching the data fork.
> 
> Since we only need blocks for a bmbt split if we're going to update the
> data fork, we only need to get quota reservation if we know we're going
> to touch the data fork.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

That addresses my question on the previous patch. Looks like the qres
check was just misplaced. NBD:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_reflink.c |   11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index c593d52156df..9cc1c340d0ec 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1043,7 +1043,11 @@ xfs_reflink_remap_extent(
>  	 * Compute quota reservation if we think the quota block counter for
>  	 * this file could increase.
>  	 *
> -	 * We start by reserving enough blocks to handle a bmbt split.
> +	 * Adding a written extent to the extent map can cause a bmbt split,
> +	 * and removing a mapped extent from the extent can cause a bmbt split.
> +	 * The two operations cannot both cause a split since they operate on
> +	 * the same index in the bmap btree, so we only need a reservation for
> +	 * one bmbt split if either thing is happening.
>  	 *
>  	 * If we are mapping a written extent into the file, we need to have
>  	 * enough quota block count reservation to handle the blocks in that
> @@ -1056,8 +1060,9 @@ xfs_reflink_remap_extent(
>  	 * before we started.  That should have removed all the delalloc
>  	 * reservations, but we code defensively.
>  	 */
> -	qdelta = 0;
> -	qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> +	qres = qdelta = 0;
> +	if (smap_mapped || dmap_written)
> +		qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>  	if (dmap_written)
>  		qres += dmap->br_blockcount;
>  	if (qres > 0) {
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/9] xfs: only reserve quota blocks if we're mapping into a hole
  2020-06-25  1:18 ` [PATCH 5/9] xfs: only reserve quota blocks if we're mapping into a hole Darrick J. Wong
@ 2020-06-25 12:28   ` Brian Foster
  2020-07-01  8:22   ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Brian Foster @ 2020-06-25 12:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:18:12PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When logging quota block count updates during a reflink operation, we
> only log the /delta/ of the block count changes to the dquot.  Since we
> now know ahead of time the extent type of both dmap and smap (and that
> they have the same length), we know that we only need to reserve quota
> blocks for dmap's blockcount if we're mapping it into a hole.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_reflink.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 9cc1c340d0ec..72de7179399d 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1051,7 +1051,9 @@ xfs_reflink_remap_extent(
>  	 *
>  	 * If we are mapping a written extent into the file, we need to have
>  	 * enough quota block count reservation to handle the blocks in that
> -	 * extent.
> +	 * extent.  We log only the delta to the quota block counts, so if the
> +	 * extent we're unmapping also has blocks allocated to it, we don't
> +	 * need a quota reservation for the extent itself.
>  	 *
>  	 * Note that if we're replacing a delalloc reservation with a written
>  	 * extent, we have to take the full quota reservation because removing
> @@ -1063,7 +1065,7 @@ xfs_reflink_remap_extent(
>  	qres = qdelta = 0;
>  	if (smap_mapped || dmap_written)
>  		qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> -	if (dmap_written)
> +	if (!smap_mapped && dmap_written)
>  		qres += dmap->br_blockcount;
>  	if (qres > 0) {
>  		error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/9] xfs: reflink can skip remap existing mappings
  2020-06-25  1:18 ` [PATCH 6/9] xfs: reflink can skip remap existing mappings Darrick J. Wong
@ 2020-06-25 12:28   ` Brian Foster
  2020-06-25 16:49     ` Darrick J. Wong
  0 siblings, 1 reply; 31+ messages in thread
From: Brian Foster @ 2020-06-25 12:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:18:18PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> If the source and destination map are identical, we can skip the remap
> step to save some time.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 72de7179399d..f1156f121b7d 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1031,6 +1031,23 @@ xfs_reflink_remap_extent(
>  
>  	trace_xfs_reflink_remap_extent_dest(ip, &smap);
>  
> +	/*
> +	 * Two extents mapped to the same physical block must not have
> +	 * different states; that's filesystem corruption.  Move on to the next
> +	 * extent if they're both holes or both the same physical extent.
> +	 */
> +	if (dmap->br_startblock == smap.br_startblock) {
> +		ASSERT(dmap->br_startblock == smap.br_startblock);

That assert duplicates the logic in the if statement. Was this intended
to be the length check I asked for? If so it looks like that was added
previously so perhaps this can just drop off. With that fixed up:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +		if (dmap->br_state != smap.br_state)
> +			error = -EFSCORRUPTED;
> +		goto out_cancel;
> +	}
> +
> +	/* If both extents are unwritten, leave them alone. */
> +	if (dmap->br_state == XFS_EXT_UNWRITTEN &&
> +	    smap.br_state == XFS_EXT_UNWRITTEN)
> +		goto out_cancel;
> +
>  	/* No reflinking if the AG of the dest mapping is low on space. */
>  	if (dmap_written) {
>  		error = xfs_reflink_ag_has_free_space(mp,
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/9] xfs: reflink can skip remap existing mappings
  2020-06-25 12:28   ` Brian Foster
@ 2020-06-25 16:49     ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25 16:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, edwin

On Thu, Jun 25, 2020 at 08:28:18AM -0400, Brian Foster wrote:
> On Wed, Jun 24, 2020 at 06:18:18PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > If the source and destination map are identical, we can skip the remap
> > step to save some time.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_reflink.c |   17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 72de7179399d..f1156f121b7d 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1031,6 +1031,23 @@ xfs_reflink_remap_extent(
> >  
> >  	trace_xfs_reflink_remap_extent_dest(ip, &smap);
> >  
> > +	/*
> > +	 * Two extents mapped to the same physical block must not have
> > +	 * different states; that's filesystem corruption.  Move on to the next
> > +	 * extent if they're both holes or both the same physical extent.
> > +	 */
> > +	if (dmap->br_startblock == smap.br_startblock) {
> > +		ASSERT(dmap->br_startblock == smap.br_startblock);
> 
> That assert duplicates the logic in the if statement. Was this intended
> to be the length check I asked for? If so it looks like that was added
> previously so perhaps this can just drop off. With that fixed up:

Yep, I got hopelessly distracted while trying to figure out why Dave's
inode flush series keep crashing here. Will fix. :/

--D

> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > +		if (dmap->br_state != smap.br_state)
> > +			error = -EFSCORRUPTED;
> > +		goto out_cancel;
> > +	}
> > +
> > +	/* If both extents are unwritten, leave them alone. */
> > +	if (dmap->br_state == XFS_EXT_UNWRITTEN &&
> > +	    smap.br_state == XFS_EXT_UNWRITTEN)
> > +		goto out_cancel;
> > +
> >  	/* No reflinking if the AG of the dest mapping is low on space. */
> >  	if (dmap_written) {
> >  		error = xfs_reflink_ag_has_free_space(mp,
> > 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-25 12:27   ` Brian Foster
@ 2020-06-25 16:57     ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25 16:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, edwin

On Thu, Jun 25, 2020 at 08:27:50AM -0400, Brian Foster wrote:
> On Wed, Jun 24, 2020 at 06:17:59PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > The existing reflink remapping loop has some structural problems that
> > need addressing:
> > 
> > The biggest problem is that we create one transaction for each extent in
> > the source file without accounting for the number of mappings there are
> > for the same range in the destination file.  In other words, we don't
> > know the number of remap operations that will be necessary and we
> > therefore cannot guess the block reservation required.  On highly
> > fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
> > run out of block reservation, and fail.
> > 
> > The second problem is that we don't actually use the bmap intents to
> > their full potential -- instead of calling bunmapi directly and having
> > to deal with its backwards operation, we could call the deferred ops
> > xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
> > makes the frontend loop much simpler.
> > 
> > Solve all of these problems by refactoring the remapping loops so that
> > we only perform one remapping operation per transaction, and each
> > operation only tries to remap a single extent from source to dest.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Looks pretty good to me outside of a few nits..
> 
> >  fs/xfs/libxfs/xfs_bmap.h |   13 ++
> >  fs/xfs/xfs_reflink.c     |  242 +++++++++++++++++++++++++---------------------
> >  fs/xfs/xfs_trace.h       |   52 +---------
> >  3 files changed, 142 insertions(+), 165 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > index 2b18338d0643..2e3c49f1f5c9 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.h
> > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > @@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> >  	{ BMAP_ATTRFORK,	"ATTR" }, \
> >  	{ BMAP_COWFORK,		"COW" }
> >  
> > +/* Return true if the extent is an allocated extent, written or not. */
> > +static inline bool xfs_bmap_is_mapped_extent(struct xfs_bmbt_irec *irec)
> > +{
> > +	return irec->br_startblock != HOLESTARTBLOCK &&
> > +		irec->br_startblock != DELAYSTARTBLOCK &&
> > +		!isnullstartblock(irec->br_startblock);
> > +}
> 
> The point of the previous patch was to facilitate renaming this helper,
> no?

D'oh.  I totally forgot that, sorry.

> >  
> >  /*
> >   * Return true if the extent is a real, allocated extent, or false if it is  a
> > @@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> >   */
> >  static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
> >  {
> > -	return irec->br_state != XFS_EXT_UNWRITTEN &&
> > -		irec->br_startblock != HOLESTARTBLOCK &&
> > -		irec->br_startblock != DELAYSTARTBLOCK &&
> > -		!isnullstartblock(irec->br_startblock);
> > +	return xfs_bmap_is_mapped_extent(irec) &&
> > +	       irec->br_state != XFS_EXT_UNWRITTEN;
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 22fdea6d69d3..c593d52156df 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -984,41 +984,28 @@ xfs_reflink_ag_has_free_space(
> >  }
> >  
> >  /*
> > - * Unmap a range of blocks from a file, then map other blocks into the hole.
> > - * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> > - * The extent irec is mapped into dest at irec->br_startoff.
> > + * Remap the given extent into the file.  The dmap blockcount will be set to
> > + * the number of blocks that were actually remapped.
> >   */
> >  STATIC int
> >  xfs_reflink_remap_extent(
> >  	struct xfs_inode	*ip,
> > -	struct xfs_bmbt_irec	*irec,
> > -	xfs_fileoff_t		destoff,
> > +	struct xfs_bmbt_irec	*dmap,
> >  	xfs_off_t		new_isize)
> >  {
> > +	struct xfs_bmbt_irec	smap;
> >  	struct xfs_mount	*mp = ip->i_mount;
> > -	bool			real_extent = xfs_bmap_is_written_extent(irec);
> >  	struct xfs_trans	*tp;
> > -	unsigned int		resblks;
> > -	struct xfs_bmbt_irec	uirec;
> > -	xfs_filblks_t		rlen;
> > -	xfs_filblks_t		unmap_len;
> >  	xfs_off_t		newlen;
> > -	int64_t			qres;
> > +	int64_t			qres, qdelta;
> > +	unsigned int		resblks;
> > +	bool			smap_mapped;
> > +	bool			dmap_written = xfs_bmap_is_written_extent(dmap);
> > +	int			nimaps;
> >  	int			error;
> >  
> > -	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> > -	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> > -
> > -	/* No reflinking if we're low on space */
> > -	if (real_extent) {
> > -		error = xfs_reflink_ag_has_free_space(mp,
> > -				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
> > -		if (error)
> > -			goto out;
> > -	}
> > -
> >  	/* Start a rolling transaction to switch the mappings */
> > -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> >  	if (error)
> >  		goto out;
> > @@ -1026,93 +1013,120 @@ xfs_reflink_remap_extent(
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> >  	xfs_trans_ijoin(tp, ip, 0);
> >  
> > +	/* Read what's currently mapped in the destination file */
> > +	nimaps = 1;
> > +	error = xfs_bmapi_read(ip, dmap->br_startoff, dmap->br_blockcount,
> > +			&smap, &nimaps, 0);
> > +	if (error)
> > +		goto out_cancel;
> > +	ASSERT(nimaps == 1 && smap.br_startoff == dmap->br_startoff);
> > +	smap_mapped = xfs_bmap_is_mapped_extent(&smap);
> > +
> 
> I have to admit I find the naming a little confusing. smap (source?)
> refers to the extent currently in the destination file, and dmap refers
> to the extent from the source file to map into the destination..?

Right.  I'll amend the above comment to:

	/*
	 * Read what's currently mapped in the destination file into
	 * smap.  If smap isn't a hole, we will have to remove it before
	 * we can add dmap to the destination file.
	 */


> >  	/*
> > -	 * Reserve quota for this operation.  We don't know if the first unmap
> > -	 * in the dest file will cause a bmap btree split, so we always reserve
> > -	 * at least enough blocks for that split.  If the extent being mapped
> > -	 * in is written, we need to reserve quota for that too.
> > +	 * We can only remap as many blocks as the smaller of the two extent
> > +	 * maps, because we can only remap one extent at a time.
> >  	 */
> > +	dmap->br_blockcount = min(dmap->br_blockcount, smap.br_blockcount);
> > +	ASSERT(dmap->br_blockcount == smap.br_blockcount);
> > +
> > +	trace_xfs_reflink_remap_extent_dest(ip, &smap);
> > +
> > +	/* No reflinking if the AG of the dest mapping is low on space. */
> > +	if (dmap_written) {
> > +		error = xfs_reflink_ag_has_free_space(mp,
> > +				XFS_FSB_TO_AGNO(mp, dmap->br_startblock));
> > +		if (error)
> > +			goto out_cancel;
> > +	}
> > +
> > +	/*
> > +	 * Compute quota reservation if we think the quota block counter for
> > +	 * this file could increase.
> > +	 *
> > +	 * We start by reserving enough blocks to handle a bmbt split.
> > +	 *
> > +	 * If we are mapping a written extent into the file, we need to have
> > +	 * enough quota block count reservation to handle the blocks in that
> > +	 * extent.
> > +	 *
> > +	 * Note that if we're replacing a delalloc reservation with a written
> > +	 * extent, we have to take the full quota reservation because removing
> > +	 * the delalloc reservation gives the block count back to the quota
> > +	 * count.  This is suboptimal, but the VFS flushed the dest range
> > +	 * before we started.  That should have removed all the delalloc
> > +	 * reservations, but we code defensively.
> > +	 */
> > +	qdelta = 0;
> >  	qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> > -	if (real_extent)
> > -		qres += irec->br_blockcount;
> > -	error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
> > -			XFS_QMOPT_RES_REGBLKS);
> > -	if (error)
> > -		goto out_cancel;
> > -
> > -	trace_xfs_reflink_remap(ip, irec->br_startoff,
> > -				irec->br_blockcount, irec->br_startblock);
> > -
> > -	/* Unmap the old blocks in the data fork. */
> > -	rlen = unmap_len;
> > -	while (rlen) {
> > -		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> > -		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
> > +	if (dmap_written)
> > +		qres += dmap->br_blockcount;
> > +	if (qres > 0) {
> 
> Why is this check necessary if we assign qres a base value?

It's not technically needed here, but it becomes useful in the next
patch.

> > +		error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
> > +				XFS_QMOPT_RES_REGBLKS);
> >  		if (error)
> >  			goto out_cancel;
> > +	}
> >  
> > +	if (smap_mapped) {
> >  		/*
> > -		 * Trim the extent to whatever got unmapped.
> > -		 * Remember, bunmapi works backwards.
> > +		 * If the extent we're unmapping is backed by storage (written
> > +		 * or not), unmap the extent and drop its refcount.
> >  		 */
> > -		uirec.br_startblock = irec->br_startblock + rlen;
> > -		uirec.br_startoff = irec->br_startoff + rlen;
> > -		uirec.br_blockcount = unmap_len - rlen;
> > -		uirec.br_state = irec->br_state;
> > -		unmap_len = rlen;
> > -
> > -		/* If this isn't a real mapping, we're done. */
> > -		if (!real_extent || uirec.br_blockcount == 0)
> > -			goto next_extent;
> > +		xfs_bmap_unmap_extent(tp, ip, &smap);
> > +		xfs_refcount_decrease_extent(tp, &smap);
> > +		qdelta -= smap.br_blockcount;
> > +	} else if (smap.br_startblock == DELAYSTARTBLOCK) {
> > +		xfs_filblks_t	len = smap.br_blockcount;
> >  
> > -		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> > -				uirec.br_blockcount, uirec.br_startblock);
> > -
> > -		/* Update the refcount tree */
> > -		xfs_refcount_increase_extent(tp, &uirec);
> > -
> > -		/* Map the new blocks into the data fork. */
> > -		xfs_bmap_map_extent(tp, ip, &uirec);
> > +		/*
> > +		 * If the extent we're unmapping is a delalloc reservation,
> > +		 * we can use the regular bunmapi function to release the
> > +		 * incore state.  Dropping the delalloc reservation takes care
> > +		 * of the quota reservation for us.
> > +		 */
> > +		error = __xfs_bunmapi(NULL, ip, smap.br_startoff, &len, 0, 1);
> > +		if (error)
> > +			goto out_cancel;
> > +		ASSERT(len == 0);
> > +	}
> >  
> > -		/* Update quota accounting. */
> > -		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > -				uirec.br_blockcount);
> > +	/*
> > +	 * If the extent we're sharing is backed by written storage, increase
> > +	 * its refcount and map it into the file.
> > +	 */
> > +	if (dmap_written) {
> > +		xfs_refcount_increase_extent(tp, dmap);
> > +		xfs_bmap_map_extent(tp, ip, dmap);
> > +		qdelta += dmap->br_blockcount;
> > +	}
> >  
> > -		/* Update dest isize if needed. */
> > -		newlen = XFS_FSB_TO_B(mp,
> > -				uirec.br_startoff + uirec.br_blockcount);
> > -		newlen = min_t(xfs_off_t, newlen, new_isize);
> > -		if (newlen > i_size_read(VFS_I(ip))) {
> > -			trace_xfs_reflink_update_inode_size(ip, newlen);
> > -			i_size_write(VFS_I(ip), newlen);
> > -			ip->i_d.di_size = newlen;
> > -			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > -		}
> > +	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, qdelta);
> >  
> > -next_extent:
> > -		/* Process all the deferred stuff. */
> > -		error = xfs_defer_finish(&tp);
> > -		if (error)
> > -			goto out_cancel;
> > +	/* Update dest isize if needed. */
> > +	newlen = XFS_FSB_TO_B(mp, smap.br_startoff + smap.br_blockcount);
> 
> I guess functionally it doesn't matter because the extents should have
> the same geometry, but this technically should refer to the extent
> mapped _into_ the dest file, right? More just another artifact of the
> naming confusion I guess..

Yes.  I'll update that to pick from dmap to be clearer, though the end
result shouldn't be any different.

--D

> Brian
> 
> > +	newlen = min_t(xfs_off_t, newlen, new_isize);
> > +	if (newlen > i_size_read(VFS_I(ip))) {
> > +		trace_xfs_reflink_update_inode_size(ip, newlen);
> > +		i_size_write(VFS_I(ip), newlen);
> > +		ip->i_d.di_size = newlen;
> > +		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> >  	}
> >  
> > +	/* Commit everything and unlock. */
> >  	error = xfs_trans_commit(tp);
> > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > -	if (error)
> > -		goto out;
> > -	return 0;
> > +	goto out_unlock;
> >  
> >  out_cancel:
> >  	xfs_trans_cancel(tp);
> > +out_unlock:
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  out:
> > -	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
> > +	if (error)
> > +		trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
> >  	return error;
> >  }
> >  
> > -/*
> > - * Iteratively remap one file's extents (and holes) to another's.
> > - */
> > +/* Remap a range of one file to the other. */
> >  int
> >  xfs_reflink_remap_blocks(
> >  	struct xfs_inode	*src,
> > @@ -1123,25 +1137,22 @@ xfs_reflink_remap_blocks(
> >  	loff_t			*remapped)
> >  {
> >  	struct xfs_bmbt_irec	imap;
> > -	xfs_fileoff_t		srcoff;
> > -	xfs_fileoff_t		destoff;
> > +	struct xfs_mount	*mp = src->i_mount;
> > +	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
> > +	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
> >  	xfs_filblks_t		len;
> > -	xfs_filblks_t		range_len;
> >  	xfs_filblks_t		remapped_len = 0;
> >  	xfs_off_t		new_isize = pos_out + remap_len;
> >  	int			nimaps;
> >  	int			error = 0;
> >  
> > -	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
> > -	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
> > -	len = XFS_B_TO_FSB(src->i_mount, remap_len);
> > +	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
> > +			XFS_MAX_FILEOFF);
> >  
> > -	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> > -	while (len) {
> > -		uint		lock_mode;
> > +	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
> >  
> > -		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> > -				dest, destoff);
> > +	while (len > 0) {
> > +		unsigned int	lock_mode;
> >  
> >  		/* Read extent from the source file */
> >  		nimaps = 1;
> > @@ -1150,18 +1161,25 @@ xfs_reflink_remap_blocks(
> >  		xfs_iunlock(src, lock_mode);
> >  		if (error)
> >  			break;
> > -		ASSERT(nimaps == 1);
> > +		/*
> > +		 * The caller supposedly flushed all dirty pages in the source
> > +		 * file range, which means that writeback should have allocated
> > +		 * or deleted all delalloc reservations in that range.  If we
> > +		 * find one, that's a good sign that something is seriously
> > +		 * wrong here.
> > +		 */
> > +		ASSERT(nimaps == 1 && imap.br_startoff == srcoff);
> > +		if (imap.br_startblock == DELAYSTARTBLOCK) {
> > +			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> > +			error = -EFSCORRUPTED;
> > +			break;
> > +		}
> >  
> > -		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
> > -				&imap);
> > +		trace_xfs_reflink_remap_extent_src(src, &imap);
> >  
> > -		/* Translate imap into the destination file. */
> > -		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> > -		imap.br_startoff += destoff - srcoff;
> > -
> > -		/* Clear dest from destoff to the end of imap and map it in. */
> > -		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> > -				new_isize);
> > +		/* Remap into the destination file at the given offset. */
> > +		imap.br_startoff = destoff;
> > +		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
> >  		if (error)
> >  			break;
> >  
> > @@ -1171,10 +1189,10 @@ xfs_reflink_remap_blocks(
> >  		}
> >  
> >  		/* Advance drange/srange */
> > -		srcoff += range_len;
> > -		destoff += range_len;
> > -		len -= range_len;
> > -		remapped_len += range_len;
> > +		srcoff += imap.br_blockcount;
> > +		destoff += imap.br_blockcount;
> > +		len -= imap.br_blockcount;
> > +		remapped_len += imap.br_blockcount;
> >  	}
> >  
> >  	if (error)
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 460136628a79..50c478374a31 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
> >  DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
> >  DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
> >  DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
> > -DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
> > -TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> > +TRACE_EVENT(xfs_reflink_remap_blocks,
> >  	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
> >  		 xfs_filblks_t len, struct xfs_inode *dest,
> >  		 xfs_fileoff_t doffset),
> > @@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> >  		  __entry->dest_ino,
> >  		  __entry->dest_lblk)
> >  );
> > -TRACE_EVENT(xfs_reflink_punch_range,
> > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > -		 xfs_extlen_t len),
> > -	TP_ARGS(ip, lblk, len),
> > -	TP_STRUCT__entry(
> > -		__field(dev_t, dev)
> > -		__field(xfs_ino_t, ino)
> > -		__field(xfs_fileoff_t, lblk)
> > -		__field(xfs_extlen_t, len)
> > -	),
> > -	TP_fast_assign(
> > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > -		__entry->ino = ip->i_ino;
> > -		__entry->lblk = lblk;
> > -		__entry->len = len;
> > -	),
> > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
> > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > -		  __entry->ino,
> > -		  __entry->lblk,
> > -		  __entry->len)
> > -);
> > -TRACE_EVENT(xfs_reflink_remap,
> > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > -		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
> > -	TP_ARGS(ip, lblk, len, new_pblk),
> > -	TP_STRUCT__entry(
> > -		__field(dev_t, dev)
> > -		__field(xfs_ino_t, ino)
> > -		__field(xfs_fileoff_t, lblk)
> > -		__field(xfs_extlen_t, len)
> > -		__field(xfs_fsblock_t, new_pblk)
> > -	),
> > -	TP_fast_assign(
> > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > -		__entry->ino = ip->i_ino;
> > -		__entry->lblk = lblk;
> > -		__entry->len = len;
> > -		__entry->new_pblk = new_pblk;
> > -	),
> > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
> > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > -		  __entry->ino,
> > -		  __entry->lblk,
> > -		  __entry->len,
> > -		  __entry->new_pblk)
> > -);
> >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
> > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_src);
> > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_dest);
> >  
> >  /* dedupe tracepoints */
> >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
> > 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v4.2 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-25  1:17 ` [PATCH 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash Darrick J. Wong
  2020-06-25 12:27   ` Brian Foster
@ 2020-06-25 17:23   ` Darrick J. Wong
  2020-06-26 11:57     ` Brian Foster
  2020-07-01  8:20     ` Christoph Hellwig
  1 sibling, 2 replies; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-25 17:23 UTC (permalink / raw)
  To: linux-xfs, edwin, Brian Foster

From: Darrick J. Wong <darrick.wong@oracle.com>

The existing reflink remapping loop has some structural problems that
need addressing:

The biggest problem is that we create one transaction for each extent in
the source file without accounting for the number of mappings there are
for the same range in the destination file.  In other words, we don't
know the number of remap operations that will be necessary and we
therefore cannot guess the block reservation required.  On highly
fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
run out of block reservation, and fail.

The second problem is that we don't actually use the bmap intents to
their full potential -- instead of calling bunmapi directly and having
to deal with its backwards operation, we could call the deferred ops
xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
makes the frontend loop much simpler.

Solve all of these problems by refactoring the remapping loops so that
we only perform one remapping operation per transaction, and each
operation only tries to remap a single extent from source to dest.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v4.2: move qres conditional to the next patch, rename bmap helper, try
to clear up some of the smap/dmap confusion
---
 fs/xfs/libxfs/xfs_bmap.h |   13 ++
 fs/xfs/xfs_reflink.c     |  240 +++++++++++++++++++++++++---------------------
 fs/xfs/xfs_trace.h       |   52 +---------
 3 files changed, 142 insertions(+), 163 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 2b18338d0643..e1bd484e5548 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
 	{ BMAP_ATTRFORK,	"ATTR" }, \
 	{ BMAP_COWFORK,		"COW" }
 
+/* Return true if the extent is an allocated extent, written or not. */
+static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
+{
+	return irec->br_startblock != HOLESTARTBLOCK &&
+		irec->br_startblock != DELAYSTARTBLOCK &&
+		!isnullstartblock(irec->br_startblock);
+}
 
 /*
  * Return true if the extent is a real, allocated extent, or false if it is  a
@@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
  */
 static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
 {
-	return irec->br_state != XFS_EXT_UNWRITTEN &&
-		irec->br_startblock != HOLESTARTBLOCK &&
-		irec->br_startblock != DELAYSTARTBLOCK &&
-		!isnullstartblock(irec->br_startblock);
+	return xfs_bmap_is_real_extent(irec) &&
+	       irec->br_state != XFS_EXT_UNWRITTEN;
 }
 
 /*
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 22fdea6d69d3..e7dd8950d40a 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -984,41 +984,28 @@ xfs_reflink_ag_has_free_space(
 }
 
 /*
- * Unmap a range of blocks from a file, then map other blocks into the hole.
- * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
- * The extent irec is mapped into dest at irec->br_startoff.
+ * Remap the given extent into the file.  The dmap blockcount will be set to
+ * the number of blocks that were actually remapped.
  */
 STATIC int
 xfs_reflink_remap_extent(
 	struct xfs_inode	*ip,
-	struct xfs_bmbt_irec	*irec,
-	xfs_fileoff_t		destoff,
+	struct xfs_bmbt_irec	*dmap,
 	xfs_off_t		new_isize)
 {
+	struct xfs_bmbt_irec	smap;
 	struct xfs_mount	*mp = ip->i_mount;
-	bool			real_extent = xfs_bmap_is_written_extent(irec);
 	struct xfs_trans	*tp;
-	unsigned int		resblks;
-	struct xfs_bmbt_irec	uirec;
-	xfs_filblks_t		rlen;
-	xfs_filblks_t		unmap_len;
 	xfs_off_t		newlen;
-	int64_t			qres;
+	int64_t			qres, qdelta;
+	unsigned int		resblks;
+	bool			smap_real;
+	bool			dmap_written = xfs_bmap_is_written_extent(dmap);
+	int			nimaps;
 	int			error;
 
-	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
-	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
-
-	/* No reflinking if we're low on space */
-	if (real_extent) {
-		error = xfs_reflink_ag_has_free_space(mp,
-				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
-		if (error)
-			goto out;
-	}
-
 	/* Start a rolling transaction to switch the mappings */
-	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
+	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
 	if (error)
 		goto out;
@@ -1027,92 +1014,121 @@ xfs_reflink_remap_extent(
 	xfs_trans_ijoin(tp, ip, 0);
 
 	/*
-	 * Reserve quota for this operation.  We don't know if the first unmap
-	 * in the dest file will cause a bmap btree split, so we always reserve
-	 * at least enough blocks for that split.  If the extent being mapped
-	 * in is written, we need to reserve quota for that too.
+	 * Read what's currently mapped in the destination file into smap.
+	 * If smap isn't a hole, we will have to remove it before we can add
+	 * dmap to the destination file.
 	 */
+	nimaps = 1;
+	error = xfs_bmapi_read(ip, dmap->br_startoff, dmap->br_blockcount,
+			&smap, &nimaps, 0);
+	if (error)
+		goto out_cancel;
+	ASSERT(nimaps == 1 && smap.br_startoff == dmap->br_startoff);
+	smap_real = xfs_bmap_is_real_extent(&smap);
+
+	/*
+	 * We can only remap as many blocks as the smaller of the two extent
+	 * maps, because we can only remap one extent at a time.
+	 */
+	dmap->br_blockcount = min(dmap->br_blockcount, smap.br_blockcount);
+	ASSERT(dmap->br_blockcount == smap.br_blockcount);
+
+	trace_xfs_reflink_remap_extent_dest(ip, &smap);
+
+	/* No reflinking if the AG of the dest mapping is low on space. */
+	if (dmap_written) {
+		error = xfs_reflink_ag_has_free_space(mp,
+				XFS_FSB_TO_AGNO(mp, dmap->br_startblock));
+		if (error)
+			goto out_cancel;
+	}
+
+	/*
+	 * Compute quota reservation if we think the quota block counter for
+	 * this file could increase.
+	 *
+	 * We start by reserving enough blocks to handle a bmbt split.
+	 *
+	 * If we are mapping a written extent into the file, we need to have
+	 * enough quota block count reservation to handle the blocks in that
+	 * extent.
+	 *
+	 * Note that if we're replacing a delalloc reservation with a written
+	 * extent, we have to take the full quota reservation because removing
+	 * the delalloc reservation gives the block count back to the quota
+	 * count.  This is suboptimal, but the VFS flushed the dest range
+	 * before we started.  That should have removed all the delalloc
+	 * reservations, but we code defensively.
+	 */
+	qdelta = 0;
 	qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
-	if (real_extent)
-		qres += irec->br_blockcount;
+	if (dmap_written)
+		qres += dmap->br_blockcount;
 	error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
 			XFS_QMOPT_RES_REGBLKS);
 	if (error)
 		goto out_cancel;
 
-	trace_xfs_reflink_remap(ip, irec->br_startoff,
-				irec->br_blockcount, irec->br_startblock);
-
-	/* Unmap the old blocks in the data fork. */
-	rlen = unmap_len;
-	while (rlen) {
-		ASSERT(tp->t_firstblock == NULLFSBLOCK);
-		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
-		if (error)
-			goto out_cancel;
-
+	if (smap_real) {
 		/*
-		 * Trim the extent to whatever got unmapped.
-		 * Remember, bunmapi works backwards.
+		 * If the extent we're unmapping is backed by storage (written
+		 * or not), unmap the extent and drop its refcount.
 		 */
-		uirec.br_startblock = irec->br_startblock + rlen;
-		uirec.br_startoff = irec->br_startoff + rlen;
-		uirec.br_blockcount = unmap_len - rlen;
-		uirec.br_state = irec->br_state;
-		unmap_len = rlen;
+		xfs_bmap_unmap_extent(tp, ip, &smap);
+		xfs_refcount_decrease_extent(tp, &smap);
+		qdelta -= smap.br_blockcount;
+	} else if (smap.br_startblock == DELAYSTARTBLOCK) {
+		xfs_filblks_t	len = smap.br_blockcount;
 
-		/* If this isn't a real mapping, we're done. */
-		if (!real_extent || uirec.br_blockcount == 0)
-			goto next_extent;
-
-		trace_xfs_reflink_remap(ip, uirec.br_startoff,
-				uirec.br_blockcount, uirec.br_startblock);
-
-		/* Update the refcount tree */
-		xfs_refcount_increase_extent(tp, &uirec);
-
-		/* Map the new blocks into the data fork. */
-		xfs_bmap_map_extent(tp, ip, &uirec);
+		/*
+		 * If the extent we're unmapping is a delalloc reservation,
+		 * we can use the regular bunmapi function to release the
+		 * incore state.  Dropping the delalloc reservation takes care
+		 * of the quota reservation for us.
+		 */
+		error = __xfs_bunmapi(NULL, ip, smap.br_startoff, &len, 0, 1);
+		if (error)
+			goto out_cancel;
+		ASSERT(len == 0);
+	}
 
-		/* Update quota accounting. */
-		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
-				uirec.br_blockcount);
+	/*
+	 * If the extent we're sharing is backed by written storage, increase
+	 * its refcount and map it into the file.
+	 */
+	if (dmap_written) {
+		xfs_refcount_increase_extent(tp, dmap);
+		xfs_bmap_map_extent(tp, ip, dmap);
+		qdelta += dmap->br_blockcount;
+	}
 
-		/* Update dest isize if needed. */
-		newlen = XFS_FSB_TO_B(mp,
-				uirec.br_startoff + uirec.br_blockcount);
-		newlen = min_t(xfs_off_t, newlen, new_isize);
-		if (newlen > i_size_read(VFS_I(ip))) {
-			trace_xfs_reflink_update_inode_size(ip, newlen);
-			i_size_write(VFS_I(ip), newlen);
-			ip->i_d.di_size = newlen;
-			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-		}
+	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, qdelta);
 
-next_extent:
-		/* Process all the deferred stuff. */
-		error = xfs_defer_finish(&tp);
-		if (error)
-			goto out_cancel;
+	/* Update dest isize if needed. */
+	newlen = XFS_FSB_TO_B(mp, dmap->br_startoff + dmap->br_blockcount);
+	newlen = min_t(xfs_off_t, newlen, new_isize);
+	if (newlen > i_size_read(VFS_I(ip))) {
+		trace_xfs_reflink_update_inode_size(ip, newlen);
+		i_size_write(VFS_I(ip), newlen);
+		ip->i_d.di_size = newlen;
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 	}
 
+	/* Commit everything and unlock. */
 	error = xfs_trans_commit(tp);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	if (error)
-		goto out;
-	return 0;
+	goto out_unlock;
 
 out_cancel:
 	xfs_trans_cancel(tp);
+out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out:
-	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
+	if (error)
+		trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
 	return error;
 }
 
-/*
- * Iteratively remap one file's extents (and holes) to another's.
- */
+/* Remap a range of one file to the other. */
 int
 xfs_reflink_remap_blocks(
 	struct xfs_inode	*src,
@@ -1123,25 +1139,22 @@ xfs_reflink_remap_blocks(
 	loff_t			*remapped)
 {
 	struct xfs_bmbt_irec	imap;
-	xfs_fileoff_t		srcoff;
-	xfs_fileoff_t		destoff;
+	struct xfs_mount	*mp = src->i_mount;
+	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
+	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
 	xfs_filblks_t		len;
-	xfs_filblks_t		range_len;
 	xfs_filblks_t		remapped_len = 0;
 	xfs_off_t		new_isize = pos_out + remap_len;
 	int			nimaps;
 	int			error = 0;
 
-	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
-	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
-	len = XFS_B_TO_FSB(src->i_mount, remap_len);
+	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
+			XFS_MAX_FILEOFF);
 
-	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
-	while (len) {
-		uint		lock_mode;
+	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
 
-		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
-				dest, destoff);
+	while (len > 0) {
+		unsigned int	lock_mode;
 
 		/* Read extent from the source file */
 		nimaps = 1;
@@ -1150,18 +1163,25 @@ xfs_reflink_remap_blocks(
 		xfs_iunlock(src, lock_mode);
 		if (error)
 			break;
-		ASSERT(nimaps == 1);
+		/*
+		 * The caller supposedly flushed all dirty pages in the source
+		 * file range, which means that writeback should have allocated
+		 * or deleted all delalloc reservations in that range.  If we
+		 * find one, that's a good sign that something is seriously
+		 * wrong here.
+		 */
+		ASSERT(nimaps == 1 && imap.br_startoff == srcoff);
+		if (imap.br_startblock == DELAYSTARTBLOCK) {
+			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+			error = -EFSCORRUPTED;
+			break;
+		}
 
-		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
-				&imap);
+		trace_xfs_reflink_remap_extent_src(src, &imap);
 
-		/* Translate imap into the destination file. */
-		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
-		imap.br_startoff += destoff - srcoff;
-
-		/* Clear dest from destoff to the end of imap and map it in. */
-		error = xfs_reflink_remap_extent(dest, &imap, destoff,
-				new_isize);
+		/* Remap into the destination file at the given offset. */
+		imap.br_startoff = destoff;
+		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
 		if (error)
 			break;
 
@@ -1171,10 +1191,10 @@ xfs_reflink_remap_blocks(
 		}
 
 		/* Advance drange/srange */
-		srcoff += range_len;
-		destoff += range_len;
-		len -= range_len;
-		remapped_len += range_len;
+		srcoff += imap.br_blockcount;
+		destoff += imap.br_blockcount;
+		len -= imap.br_blockcount;
+		remapped_len += imap.br_blockcount;
 	}
 
 	if (error)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 460136628a79..50c478374a31 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
 DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
 DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
 DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
-DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
-TRACE_EVENT(xfs_reflink_remap_blocks_loop,
+TRACE_EVENT(xfs_reflink_remap_blocks,
 	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
 		 xfs_filblks_t len, struct xfs_inode *dest,
 		 xfs_fileoff_t doffset),
@@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
 		  __entry->dest_ino,
 		  __entry->dest_lblk)
 );
-TRACE_EVENT(xfs_reflink_punch_range,
-	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
-		 xfs_extlen_t len),
-	TP_ARGS(ip, lblk, len),
-	TP_STRUCT__entry(
-		__field(dev_t, dev)
-		__field(xfs_ino_t, ino)
-		__field(xfs_fileoff_t, lblk)
-		__field(xfs_extlen_t, len)
-	),
-	TP_fast_assign(
-		__entry->dev = VFS_I(ip)->i_sb->s_dev;
-		__entry->ino = ip->i_ino;
-		__entry->lblk = lblk;
-		__entry->len = len;
-	),
-	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino,
-		  __entry->lblk,
-		  __entry->len)
-);
-TRACE_EVENT(xfs_reflink_remap,
-	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
-		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
-	TP_ARGS(ip, lblk, len, new_pblk),
-	TP_STRUCT__entry(
-		__field(dev_t, dev)
-		__field(xfs_ino_t, ino)
-		__field(xfs_fileoff_t, lblk)
-		__field(xfs_extlen_t, len)
-		__field(xfs_fsblock_t, new_pblk)
-	),
-	TP_fast_assign(
-		__entry->dev = VFS_I(ip)->i_sb->s_dev;
-		__entry->ino = ip->i_ino;
-		__entry->lblk = lblk;
-		__entry->len = len;
-		__entry->new_pblk = new_pblk;
-	),
-	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino,
-		  __entry->lblk,
-		  __entry->len,
-		  __entry->new_pblk)
-);
 DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_src);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_dest);
 
 /* dedupe tracepoints */
 DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v4.2 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-25 17:23   ` [PATCH v4.2 " Darrick J. Wong
@ 2020-06-26 11:57     ` Brian Foster
  2020-06-26 16:40       ` Darrick J. Wong
  2020-07-01  8:20     ` Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: Brian Foster @ 2020-06-26 11:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Thu, Jun 25, 2020 at 10:23:10AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> The existing reflink remapping loop has some structural problems that
> need addressing:
> 
> The biggest problem is that we create one transaction for each extent in
> the source file without accounting for the number of mappings there are
> for the same range in the destination file.  In other words, we don't
> know the number of remap operations that will be necessary and we
> therefore cannot guess the block reservation required.  On highly
> fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
> run out of block reservation, and fail.
> 
> The second problem is that we don't actually use the bmap intents to
> their full potential -- instead of calling bunmapi directly and having
> to deal with its backwards operation, we could call the deferred ops
> xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
> makes the frontend loop much simpler.
> 
> Solve all of these problems by refactoring the remapping loops so that
> we only perform one remapping operation per transaction, and each
> operation only tries to remap a single extent from source to dest.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v4.2: move qres conditional to the next patch, rename bmap helper, try
> to clear up some of the smap/dmap confusion
> ---

Looks good. Thanks for the tweaks..

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_bmap.h |   13 ++
>  fs/xfs/xfs_reflink.c     |  240 +++++++++++++++++++++++++---------------------
>  fs/xfs/xfs_trace.h       |   52 +---------
>  3 files changed, 142 insertions(+), 163 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 2b18338d0643..e1bd484e5548 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
>  	{ BMAP_ATTRFORK,	"ATTR" }, \
>  	{ BMAP_COWFORK,		"COW" }
>  
> +/* Return true if the extent is an allocated extent, written or not. */
> +static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
> +{
> +	return irec->br_startblock != HOLESTARTBLOCK &&
> +		irec->br_startblock != DELAYSTARTBLOCK &&
> +		!isnullstartblock(irec->br_startblock);
> +}
>  
>  /*
>   * Return true if the extent is a real, allocated extent, or false if it is  a
> @@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
>   */
>  static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
>  {
> -	return irec->br_state != XFS_EXT_UNWRITTEN &&
> -		irec->br_startblock != HOLESTARTBLOCK &&
> -		irec->br_startblock != DELAYSTARTBLOCK &&
> -		!isnullstartblock(irec->br_startblock);
> +	return xfs_bmap_is_real_extent(irec) &&
> +	       irec->br_state != XFS_EXT_UNWRITTEN;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 22fdea6d69d3..e7dd8950d40a 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -984,41 +984,28 @@ xfs_reflink_ag_has_free_space(
>  }
>  
>  /*
> - * Unmap a range of blocks from a file, then map other blocks into the hole.
> - * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> - * The extent irec is mapped into dest at irec->br_startoff.
> + * Remap the given extent into the file.  The dmap blockcount will be set to
> + * the number of blocks that were actually remapped.
>   */
>  STATIC int
>  xfs_reflink_remap_extent(
>  	struct xfs_inode	*ip,
> -	struct xfs_bmbt_irec	*irec,
> -	xfs_fileoff_t		destoff,
> +	struct xfs_bmbt_irec	*dmap,
>  	xfs_off_t		new_isize)
>  {
> +	struct xfs_bmbt_irec	smap;
>  	struct xfs_mount	*mp = ip->i_mount;
> -	bool			real_extent = xfs_bmap_is_written_extent(irec);
>  	struct xfs_trans	*tp;
> -	unsigned int		resblks;
> -	struct xfs_bmbt_irec	uirec;
> -	xfs_filblks_t		rlen;
> -	xfs_filblks_t		unmap_len;
>  	xfs_off_t		newlen;
> -	int64_t			qres;
> +	int64_t			qres, qdelta;
> +	unsigned int		resblks;
> +	bool			smap_real;
> +	bool			dmap_written = xfs_bmap_is_written_extent(dmap);
> +	int			nimaps;
>  	int			error;
>  
> -	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> -	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> -
> -	/* No reflinking if we're low on space */
> -	if (real_extent) {
> -		error = xfs_reflink_ag_has_free_space(mp,
> -				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
> -		if (error)
> -			goto out;
> -	}
> -
>  	/* Start a rolling transaction to switch the mappings */
> -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
>  	if (error)
>  		goto out;
> @@ -1027,92 +1014,121 @@ xfs_reflink_remap_extent(
>  	xfs_trans_ijoin(tp, ip, 0);
>  
>  	/*
> -	 * Reserve quota for this operation.  We don't know if the first unmap
> -	 * in the dest file will cause a bmap btree split, so we always reserve
> -	 * at least enough blocks for that split.  If the extent being mapped
> -	 * in is written, we need to reserve quota for that too.
> +	 * Read what's currently mapped in the destination file into smap.
> +	 * If smap isn't a hole, we will have to remove it before we can add
> +	 * dmap to the destination file.
>  	 */
> +	nimaps = 1;
> +	error = xfs_bmapi_read(ip, dmap->br_startoff, dmap->br_blockcount,
> +			&smap, &nimaps, 0);
> +	if (error)
> +		goto out_cancel;
> +	ASSERT(nimaps == 1 && smap.br_startoff == dmap->br_startoff);
> +	smap_real = xfs_bmap_is_real_extent(&smap);
> +
> +	/*
> +	 * We can only remap as many blocks as the smaller of the two extent
> +	 * maps, because we can only remap one extent at a time.
> +	 */
> +	dmap->br_blockcount = min(dmap->br_blockcount, smap.br_blockcount);
> +	ASSERT(dmap->br_blockcount == smap.br_blockcount);
> +
> +	trace_xfs_reflink_remap_extent_dest(ip, &smap);
> +
> +	/* No reflinking if the AG of the dest mapping is low on space. */
> +	if (dmap_written) {
> +		error = xfs_reflink_ag_has_free_space(mp,
> +				XFS_FSB_TO_AGNO(mp, dmap->br_startblock));
> +		if (error)
> +			goto out_cancel;
> +	}
> +
> +	/*
> +	 * Compute quota reservation if we think the quota block counter for
> +	 * this file could increase.
> +	 *
> +	 * We start by reserving enough blocks to handle a bmbt split.
> +	 *
> +	 * If we are mapping a written extent into the file, we need to have
> +	 * enough quota block count reservation to handle the blocks in that
> +	 * extent.
> +	 *
> +	 * Note that if we're replacing a delalloc reservation with a written
> +	 * extent, we have to take the full quota reservation because removing
> +	 * the delalloc reservation gives the block count back to the quota
> +	 * count.  This is suboptimal, but the VFS flushed the dest range
> +	 * before we started.  That should have removed all the delalloc
> +	 * reservations, but we code defensively.
> +	 */
> +	qdelta = 0;
>  	qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> -	if (real_extent)
> -		qres += irec->br_blockcount;
> +	if (dmap_written)
> +		qres += dmap->br_blockcount;
>  	error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
>  			XFS_QMOPT_RES_REGBLKS);
>  	if (error)
>  		goto out_cancel;
>  
> -	trace_xfs_reflink_remap(ip, irec->br_startoff,
> -				irec->br_blockcount, irec->br_startblock);
> -
> -	/* Unmap the old blocks in the data fork. */
> -	rlen = unmap_len;
> -	while (rlen) {
> -		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> -		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
> -		if (error)
> -			goto out_cancel;
> -
> +	if (smap_real) {
>  		/*
> -		 * Trim the extent to whatever got unmapped.
> -		 * Remember, bunmapi works backwards.
> +		 * If the extent we're unmapping is backed by storage (written
> +		 * or not), unmap the extent and drop its refcount.
>  		 */
> -		uirec.br_startblock = irec->br_startblock + rlen;
> -		uirec.br_startoff = irec->br_startoff + rlen;
> -		uirec.br_blockcount = unmap_len - rlen;
> -		uirec.br_state = irec->br_state;
> -		unmap_len = rlen;
> +		xfs_bmap_unmap_extent(tp, ip, &smap);
> +		xfs_refcount_decrease_extent(tp, &smap);
> +		qdelta -= smap.br_blockcount;
> +	} else if (smap.br_startblock == DELAYSTARTBLOCK) {
> +		xfs_filblks_t	len = smap.br_blockcount;
>  
> -		/* If this isn't a real mapping, we're done. */
> -		if (!real_extent || uirec.br_blockcount == 0)
> -			goto next_extent;
> -
> -		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> -				uirec.br_blockcount, uirec.br_startblock);
> -
> -		/* Update the refcount tree */
> -		xfs_refcount_increase_extent(tp, &uirec);
> -
> -		/* Map the new blocks into the data fork. */
> -		xfs_bmap_map_extent(tp, ip, &uirec);
> +		/*
> +		 * If the extent we're unmapping is a delalloc reservation,
> +		 * we can use the regular bunmapi function to release the
> +		 * incore state.  Dropping the delalloc reservation takes care
> +		 * of the quota reservation for us.
> +		 */
> +		error = __xfs_bunmapi(NULL, ip, smap.br_startoff, &len, 0, 1);
> +		if (error)
> +			goto out_cancel;
> +		ASSERT(len == 0);
> +	}
>  
> -		/* Update quota accounting. */
> -		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> -				uirec.br_blockcount);
> +	/*
> +	 * If the extent we're sharing is backed by written storage, increase
> +	 * its refcount and map it into the file.
> +	 */
> +	if (dmap_written) {
> +		xfs_refcount_increase_extent(tp, dmap);
> +		xfs_bmap_map_extent(tp, ip, dmap);
> +		qdelta += dmap->br_blockcount;
> +	}
>  
> -		/* Update dest isize if needed. */
> -		newlen = XFS_FSB_TO_B(mp,
> -				uirec.br_startoff + uirec.br_blockcount);
> -		newlen = min_t(xfs_off_t, newlen, new_isize);
> -		if (newlen > i_size_read(VFS_I(ip))) {
> -			trace_xfs_reflink_update_inode_size(ip, newlen);
> -			i_size_write(VFS_I(ip), newlen);
> -			ip->i_d.di_size = newlen;
> -			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> -		}
> +	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, qdelta);
>  
> -next_extent:
> -		/* Process all the deferred stuff. */
> -		error = xfs_defer_finish(&tp);
> -		if (error)
> -			goto out_cancel;
> +	/* Update dest isize if needed. */
> +	newlen = XFS_FSB_TO_B(mp, dmap->br_startoff + dmap->br_blockcount);
> +	newlen = min_t(xfs_off_t, newlen, new_isize);
> +	if (newlen > i_size_read(VFS_I(ip))) {
> +		trace_xfs_reflink_update_inode_size(ip, newlen);
> +		i_size_write(VFS_I(ip), newlen);
> +		ip->i_d.di_size = newlen;
> +		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>  	}
>  
> +	/* Commit everything and unlock. */
>  	error = xfs_trans_commit(tp);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -	if (error)
> -		goto out;
> -	return 0;
> +	goto out_unlock;
>  
>  out_cancel:
>  	xfs_trans_cancel(tp);
> +out_unlock:
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  out:
> -	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
> +	if (error)
> +		trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
>  	return error;
>  }
>  
> -/*
> - * Iteratively remap one file's extents (and holes) to another's.
> - */
> +/* Remap a range of one file to the other. */
>  int
>  xfs_reflink_remap_blocks(
>  	struct xfs_inode	*src,
> @@ -1123,25 +1139,22 @@ xfs_reflink_remap_blocks(
>  	loff_t			*remapped)
>  {
>  	struct xfs_bmbt_irec	imap;
> -	xfs_fileoff_t		srcoff;
> -	xfs_fileoff_t		destoff;
> +	struct xfs_mount	*mp = src->i_mount;
> +	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
> +	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
>  	xfs_filblks_t		len;
> -	xfs_filblks_t		range_len;
>  	xfs_filblks_t		remapped_len = 0;
>  	xfs_off_t		new_isize = pos_out + remap_len;
>  	int			nimaps;
>  	int			error = 0;
>  
> -	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
> -	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
> -	len = XFS_B_TO_FSB(src->i_mount, remap_len);
> +	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
> +			XFS_MAX_FILEOFF);
>  
> -	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> -	while (len) {
> -		uint		lock_mode;
> +	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
>  
> -		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> -				dest, destoff);
> +	while (len > 0) {
> +		unsigned int	lock_mode;
>  
>  		/* Read extent from the source file */
>  		nimaps = 1;
> @@ -1150,18 +1163,25 @@ xfs_reflink_remap_blocks(
>  		xfs_iunlock(src, lock_mode);
>  		if (error)
>  			break;
> -		ASSERT(nimaps == 1);
> +		/*
> +		 * The caller supposedly flushed all dirty pages in the source
> +		 * file range, which means that writeback should have allocated
> +		 * or deleted all delalloc reservations in that range.  If we
> +		 * find one, that's a good sign that something is seriously
> +		 * wrong here.
> +		 */
> +		ASSERT(nimaps == 1 && imap.br_startoff == srcoff);
> +		if (imap.br_startblock == DELAYSTARTBLOCK) {
> +			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> +			error = -EFSCORRUPTED;
> +			break;
> +		}
>  
> -		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
> -				&imap);
> +		trace_xfs_reflink_remap_extent_src(src, &imap);
>  
> -		/* Translate imap into the destination file. */
> -		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> -		imap.br_startoff += destoff - srcoff;
> -
> -		/* Clear dest from destoff to the end of imap and map it in. */
> -		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> -				new_isize);
> +		/* Remap into the destination file at the given offset. */
> +		imap.br_startoff = destoff;
> +		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
>  		if (error)
>  			break;
>  
> @@ -1171,10 +1191,10 @@ xfs_reflink_remap_blocks(
>  		}
>  
>  		/* Advance drange/srange */
> -		srcoff += range_len;
> -		destoff += range_len;
> -		len -= range_len;
> -		remapped_len += range_len;
> +		srcoff += imap.br_blockcount;
> +		destoff += imap.br_blockcount;
> +		len -= imap.br_blockcount;
> +		remapped_len += imap.br_blockcount;
>  	}
>  
>  	if (error)
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 460136628a79..50c478374a31 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
>  DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
>  DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
>  DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
> -DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
> -TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> +TRACE_EVENT(xfs_reflink_remap_blocks,
>  	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
>  		 xfs_filblks_t len, struct xfs_inode *dest,
>  		 xfs_fileoff_t doffset),
> @@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
>  		  __entry->dest_ino,
>  		  __entry->dest_lblk)
>  );
> -TRACE_EVENT(xfs_reflink_punch_range,
> -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> -		 xfs_extlen_t len),
> -	TP_ARGS(ip, lblk, len),
> -	TP_STRUCT__entry(
> -		__field(dev_t, dev)
> -		__field(xfs_ino_t, ino)
> -		__field(xfs_fileoff_t, lblk)
> -		__field(xfs_extlen_t, len)
> -	),
> -	TP_fast_assign(
> -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> -		__entry->ino = ip->i_ino;
> -		__entry->lblk = lblk;
> -		__entry->len = len;
> -	),
> -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
> -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> -		  __entry->ino,
> -		  __entry->lblk,
> -		  __entry->len)
> -);
> -TRACE_EVENT(xfs_reflink_remap,
> -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> -		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
> -	TP_ARGS(ip, lblk, len, new_pblk),
> -	TP_STRUCT__entry(
> -		__field(dev_t, dev)
> -		__field(xfs_ino_t, ino)
> -		__field(xfs_fileoff_t, lblk)
> -		__field(xfs_extlen_t, len)
> -		__field(xfs_fsblock_t, new_pblk)
> -	),
> -	TP_fast_assign(
> -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> -		__entry->ino = ip->i_ino;
> -		__entry->lblk = lblk;
> -		__entry->len = len;
> -		__entry->new_pblk = new_pblk;
> -	),
> -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
> -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> -		  __entry->ino,
> -		  __entry->lblk,
> -		  __entry->len,
> -		  __entry->new_pblk)
> -);
>  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
> +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_src);
> +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_dest);
>  
>  /* dedupe tracepoints */
>  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4.2 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-26 11:57     ` Brian Foster
@ 2020-06-26 16:40       ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-26 16:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, edwin

On Fri, Jun 26, 2020 at 07:57:59AM -0400, Brian Foster wrote:
> On Thu, Jun 25, 2020 at 10:23:10AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > The existing reflink remapping loop has some structural problems that
> > need addressing:
> > 
> > The biggest problem is that we create one transaction for each extent in
> > the source file without accounting for the number of mappings there are
> > for the same range in the destination file.  In other words, we don't
> > know the number of remap operations that will be necessary and we
> > therefore cannot guess the block reservation required.  On highly
> > fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
> > run out of block reservation, and fail.
> > 
> > The second problem is that we don't actually use the bmap intents to
> > their full potential -- instead of calling bunmapi directly and having
> > to deal with its backwards operation, we could call the deferred ops
> > xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
> > makes the frontend loop much simpler.
> > 
> > Solve all of these problems by refactoring the remapping loops so that
> > we only perform one remapping operation per transaction, and each
> > operation only tries to remap a single extent from source to dest.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v4.2: move qres conditional to the next patch, rename bmap helper, try
> > to clear up some of the smap/dmap confusion
> > ---
> 
> Looks good. Thanks for the tweaks..
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Woot!  Thanks for reviewing! :)

--D

> >  fs/xfs/libxfs/xfs_bmap.h |   13 ++
> >  fs/xfs/xfs_reflink.c     |  240 +++++++++++++++++++++++++---------------------
> >  fs/xfs/xfs_trace.h       |   52 +---------
> >  3 files changed, 142 insertions(+), 163 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > index 2b18338d0643..e1bd484e5548 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.h
> > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > @@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> >  	{ BMAP_ATTRFORK,	"ATTR" }, \
> >  	{ BMAP_COWFORK,		"COW" }
> >  
> > +/* Return true if the extent is an allocated extent, written or not. */
> > +static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
> > +{
> > +	return irec->br_startblock != HOLESTARTBLOCK &&
> > +		irec->br_startblock != DELAYSTARTBLOCK &&
> > +		!isnullstartblock(irec->br_startblock);
> > +}
> >  
> >  /*
> >   * Return true if the extent is a real, allocated extent, or false if it is  a
> > @@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> >   */
> >  static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
> >  {
> > -	return irec->br_state != XFS_EXT_UNWRITTEN &&
> > -		irec->br_startblock != HOLESTARTBLOCK &&
> > -		irec->br_startblock != DELAYSTARTBLOCK &&
> > -		!isnullstartblock(irec->br_startblock);
> > +	return xfs_bmap_is_real_extent(irec) &&
> > +	       irec->br_state != XFS_EXT_UNWRITTEN;
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 22fdea6d69d3..e7dd8950d40a 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -984,41 +984,28 @@ xfs_reflink_ag_has_free_space(
> >  }
> >  
> >  /*
> > - * Unmap a range of blocks from a file, then map other blocks into the hole.
> > - * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> > - * The extent irec is mapped into dest at irec->br_startoff.
> > + * Remap the given extent into the file.  The dmap blockcount will be set to
> > + * the number of blocks that were actually remapped.
> >   */
> >  STATIC int
> >  xfs_reflink_remap_extent(
> >  	struct xfs_inode	*ip,
> > -	struct xfs_bmbt_irec	*irec,
> > -	xfs_fileoff_t		destoff,
> > +	struct xfs_bmbt_irec	*dmap,
> >  	xfs_off_t		new_isize)
> >  {
> > +	struct xfs_bmbt_irec	smap;
> >  	struct xfs_mount	*mp = ip->i_mount;
> > -	bool			real_extent = xfs_bmap_is_written_extent(irec);
> >  	struct xfs_trans	*tp;
> > -	unsigned int		resblks;
> > -	struct xfs_bmbt_irec	uirec;
> > -	xfs_filblks_t		rlen;
> > -	xfs_filblks_t		unmap_len;
> >  	xfs_off_t		newlen;
> > -	int64_t			qres;
> > +	int64_t			qres, qdelta;
> > +	unsigned int		resblks;
> > +	bool			smap_real;
> > +	bool			dmap_written = xfs_bmap_is_written_extent(dmap);
> > +	int			nimaps;
> >  	int			error;
> >  
> > -	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> > -	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> > -
> > -	/* No reflinking if we're low on space */
> > -	if (real_extent) {
> > -		error = xfs_reflink_ag_has_free_space(mp,
> > -				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
> > -		if (error)
> > -			goto out;
> > -	}
> > -
> >  	/* Start a rolling transaction to switch the mappings */
> > -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> >  	if (error)
> >  		goto out;
> > @@ -1027,92 +1014,121 @@ xfs_reflink_remap_extent(
> >  	xfs_trans_ijoin(tp, ip, 0);
> >  
> >  	/*
> > -	 * Reserve quota for this operation.  We don't know if the first unmap
> > -	 * in the dest file will cause a bmap btree split, so we always reserve
> > -	 * at least enough blocks for that split.  If the extent being mapped
> > -	 * in is written, we need to reserve quota for that too.
> > +	 * Read what's currently mapped in the destination file into smap.
> > +	 * If smap isn't a hole, we will have to remove it before we can add
> > +	 * dmap to the destination file.
> >  	 */
> > +	nimaps = 1;
> > +	error = xfs_bmapi_read(ip, dmap->br_startoff, dmap->br_blockcount,
> > +			&smap, &nimaps, 0);
> > +	if (error)
> > +		goto out_cancel;
> > +	ASSERT(nimaps == 1 && smap.br_startoff == dmap->br_startoff);
> > +	smap_real = xfs_bmap_is_real_extent(&smap);
> > +
> > +	/*
> > +	 * We can only remap as many blocks as the smaller of the two extent
> > +	 * maps, because we can only remap one extent at a time.
> > +	 */
> > +	dmap->br_blockcount = min(dmap->br_blockcount, smap.br_blockcount);
> > +	ASSERT(dmap->br_blockcount == smap.br_blockcount);
> > +
> > +	trace_xfs_reflink_remap_extent_dest(ip, &smap);
> > +
> > +	/* No reflinking if the AG of the dest mapping is low on space. */
> > +	if (dmap_written) {
> > +		error = xfs_reflink_ag_has_free_space(mp,
> > +				XFS_FSB_TO_AGNO(mp, dmap->br_startblock));
> > +		if (error)
> > +			goto out_cancel;
> > +	}
> > +
> > +	/*
> > +	 * Compute quota reservation if we think the quota block counter for
> > +	 * this file could increase.
> > +	 *
> > +	 * We start by reserving enough blocks to handle a bmbt split.
> > +	 *
> > +	 * If we are mapping a written extent into the file, we need to have
> > +	 * enough quota block count reservation to handle the blocks in that
> > +	 * extent.
> > +	 *
> > +	 * Note that if we're replacing a delalloc reservation with a written
> > +	 * extent, we have to take the full quota reservation because removing
> > +	 * the delalloc reservation gives the block count back to the quota
> > +	 * count.  This is suboptimal, but the VFS flushed the dest range
> > +	 * before we started.  That should have removed all the delalloc
> > +	 * reservations, but we code defensively.
> > +	 */
> > +	qdelta = 0;
> >  	qres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> > -	if (real_extent)
> > -		qres += irec->br_blockcount;
> > +	if (dmap_written)
> > +		qres += dmap->br_blockcount;
> >  	error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
> >  			XFS_QMOPT_RES_REGBLKS);
> >  	if (error)
> >  		goto out_cancel;
> >  
> > -	trace_xfs_reflink_remap(ip, irec->br_startoff,
> > -				irec->br_blockcount, irec->br_startblock);
> > -
> > -	/* Unmap the old blocks in the data fork. */
> > -	rlen = unmap_len;
> > -	while (rlen) {
> > -		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> > -		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
> > -		if (error)
> > -			goto out_cancel;
> > -
> > +	if (smap_real) {
> >  		/*
> > -		 * Trim the extent to whatever got unmapped.
> > -		 * Remember, bunmapi works backwards.
> > +		 * If the extent we're unmapping is backed by storage (written
> > +		 * or not), unmap the extent and drop its refcount.
> >  		 */
> > -		uirec.br_startblock = irec->br_startblock + rlen;
> > -		uirec.br_startoff = irec->br_startoff + rlen;
> > -		uirec.br_blockcount = unmap_len - rlen;
> > -		uirec.br_state = irec->br_state;
> > -		unmap_len = rlen;
> > +		xfs_bmap_unmap_extent(tp, ip, &smap);
> > +		xfs_refcount_decrease_extent(tp, &smap);
> > +		qdelta -= smap.br_blockcount;
> > +	} else if (smap.br_startblock == DELAYSTARTBLOCK) {
> > +		xfs_filblks_t	len = smap.br_blockcount;
> >  
> > -		/* If this isn't a real mapping, we're done. */
> > -		if (!real_extent || uirec.br_blockcount == 0)
> > -			goto next_extent;
> > -
> > -		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> > -				uirec.br_blockcount, uirec.br_startblock);
> > -
> > -		/* Update the refcount tree */
> > -		xfs_refcount_increase_extent(tp, &uirec);
> > -
> > -		/* Map the new blocks into the data fork. */
> > -		xfs_bmap_map_extent(tp, ip, &uirec);
> > +		/*
> > +		 * If the extent we're unmapping is a delalloc reservation,
> > +		 * we can use the regular bunmapi function to release the
> > +		 * incore state.  Dropping the delalloc reservation takes care
> > +		 * of the quota reservation for us.
> > +		 */
> > +		error = __xfs_bunmapi(NULL, ip, smap.br_startoff, &len, 0, 1);
> > +		if (error)
> > +			goto out_cancel;
> > +		ASSERT(len == 0);
> > +	}
> >  
> > -		/* Update quota accounting. */
> > -		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > -				uirec.br_blockcount);
> > +	/*
> > +	 * If the extent we're sharing is backed by written storage, increase
> > +	 * its refcount and map it into the file.
> > +	 */
> > +	if (dmap_written) {
> > +		xfs_refcount_increase_extent(tp, dmap);
> > +		xfs_bmap_map_extent(tp, ip, dmap);
> > +		qdelta += dmap->br_blockcount;
> > +	}
> >  
> > -		/* Update dest isize if needed. */
> > -		newlen = XFS_FSB_TO_B(mp,
> > -				uirec.br_startoff + uirec.br_blockcount);
> > -		newlen = min_t(xfs_off_t, newlen, new_isize);
> > -		if (newlen > i_size_read(VFS_I(ip))) {
> > -			trace_xfs_reflink_update_inode_size(ip, newlen);
> > -			i_size_write(VFS_I(ip), newlen);
> > -			ip->i_d.di_size = newlen;
> > -			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > -		}
> > +	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, qdelta);
> >  
> > -next_extent:
> > -		/* Process all the deferred stuff. */
> > -		error = xfs_defer_finish(&tp);
> > -		if (error)
> > -			goto out_cancel;
> > +	/* Update dest isize if needed. */
> > +	newlen = XFS_FSB_TO_B(mp, dmap->br_startoff + dmap->br_blockcount);
> > +	newlen = min_t(xfs_off_t, newlen, new_isize);
> > +	if (newlen > i_size_read(VFS_I(ip))) {
> > +		trace_xfs_reflink_update_inode_size(ip, newlen);
> > +		i_size_write(VFS_I(ip), newlen);
> > +		ip->i_d.di_size = newlen;
> > +		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> >  	}
> >  
> > +	/* Commit everything and unlock. */
> >  	error = xfs_trans_commit(tp);
> > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > -	if (error)
> > -		goto out;
> > -	return 0;
> > +	goto out_unlock;
> >  
> >  out_cancel:
> >  	xfs_trans_cancel(tp);
> > +out_unlock:
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  out:
> > -	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
> > +	if (error)
> > +		trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
> >  	return error;
> >  }
> >  
> > -/*
> > - * Iteratively remap one file's extents (and holes) to another's.
> > - */
> > +/* Remap a range of one file to the other. */
> >  int
> >  xfs_reflink_remap_blocks(
> >  	struct xfs_inode	*src,
> > @@ -1123,25 +1139,22 @@ xfs_reflink_remap_blocks(
> >  	loff_t			*remapped)
> >  {
> >  	struct xfs_bmbt_irec	imap;
> > -	xfs_fileoff_t		srcoff;
> > -	xfs_fileoff_t		destoff;
> > +	struct xfs_mount	*mp = src->i_mount;
> > +	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
> > +	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
> >  	xfs_filblks_t		len;
> > -	xfs_filblks_t		range_len;
> >  	xfs_filblks_t		remapped_len = 0;
> >  	xfs_off_t		new_isize = pos_out + remap_len;
> >  	int			nimaps;
> >  	int			error = 0;
> >  
> > -	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
> > -	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
> > -	len = XFS_B_TO_FSB(src->i_mount, remap_len);
> > +	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
> > +			XFS_MAX_FILEOFF);
> >  
> > -	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> > -	while (len) {
> > -		uint		lock_mode;
> > +	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
> >  
> > -		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> > -				dest, destoff);
> > +	while (len > 0) {
> > +		unsigned int	lock_mode;
> >  
> >  		/* Read extent from the source file */
> >  		nimaps = 1;
> > @@ -1150,18 +1163,25 @@ xfs_reflink_remap_blocks(
> >  		xfs_iunlock(src, lock_mode);
> >  		if (error)
> >  			break;
> > -		ASSERT(nimaps == 1);
> > +		/*
> > +		 * The caller supposedly flushed all dirty pages in the source
> > +		 * file range, which means that writeback should have allocated
> > +		 * or deleted all delalloc reservations in that range.  If we
> > +		 * find one, that's a good sign that something is seriously
> > +		 * wrong here.
> > +		 */
> > +		ASSERT(nimaps == 1 && imap.br_startoff == srcoff);
> > +		if (imap.br_startblock == DELAYSTARTBLOCK) {
> > +			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> > +			error = -EFSCORRUPTED;
> > +			break;
> > +		}
> >  
> > -		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
> > -				&imap);
> > +		trace_xfs_reflink_remap_extent_src(src, &imap);
> >  
> > -		/* Translate imap into the destination file. */
> > -		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> > -		imap.br_startoff += destoff - srcoff;
> > -
> > -		/* Clear dest from destoff to the end of imap and map it in. */
> > -		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> > -				new_isize);
> > +		/* Remap into the destination file at the given offset. */
> > +		imap.br_startoff = destoff;
> > +		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
> >  		if (error)
> >  			break;
> >  
> > @@ -1171,10 +1191,10 @@ xfs_reflink_remap_blocks(
> >  		}
> >  
> >  		/* Advance drange/srange */
> > -		srcoff += range_len;
> > -		destoff += range_len;
> > -		len -= range_len;
> > -		remapped_len += range_len;
> > +		srcoff += imap.br_blockcount;
> > +		destoff += imap.br_blockcount;
> > +		len -= imap.br_blockcount;
> > +		remapped_len += imap.br_blockcount;
> >  	}
> >  
> >  	if (error)
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 460136628a79..50c478374a31 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
> >  DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
> >  DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
> >  DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
> > -DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
> > -TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> > +TRACE_EVENT(xfs_reflink_remap_blocks,
> >  	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
> >  		 xfs_filblks_t len, struct xfs_inode *dest,
> >  		 xfs_fileoff_t doffset),
> > @@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> >  		  __entry->dest_ino,
> >  		  __entry->dest_lblk)
> >  );
> > -TRACE_EVENT(xfs_reflink_punch_range,
> > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > -		 xfs_extlen_t len),
> > -	TP_ARGS(ip, lblk, len),
> > -	TP_STRUCT__entry(
> > -		__field(dev_t, dev)
> > -		__field(xfs_ino_t, ino)
> > -		__field(xfs_fileoff_t, lblk)
> > -		__field(xfs_extlen_t, len)
> > -	),
> > -	TP_fast_assign(
> > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > -		__entry->ino = ip->i_ino;
> > -		__entry->lblk = lblk;
> > -		__entry->len = len;
> > -	),
> > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
> > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > -		  __entry->ino,
> > -		  __entry->lblk,
> > -		  __entry->len)
> > -);
> > -TRACE_EVENT(xfs_reflink_remap,
> > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > -		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
> > -	TP_ARGS(ip, lblk, len, new_pblk),
> > -	TP_STRUCT__entry(
> > -		__field(dev_t, dev)
> > -		__field(xfs_ino_t, ino)
> > -		__field(xfs_fileoff_t, lblk)
> > -		__field(xfs_extlen_t, len)
> > -		__field(xfs_fsblock_t, new_pblk)
> > -	),
> > -	TP_fast_assign(
> > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > -		__entry->ino = ip->i_ino;
> > -		__entry->lblk = lblk;
> > -		__entry->len = len;
> > -		__entry->new_pblk = new_pblk;
> > -	),
> > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
> > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > -		  __entry->ino,
> > -		  __entry->lblk,
> > -		  __entry->len,
> > -		  __entry->new_pblk)
> > -);
> >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
> > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_src);
> > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_dest);
> >  
> >  /* dedupe tracepoints */
> >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
> > 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/9] xfs: reflink cleanups
  2020-06-25  1:17 [PATCH v2 0/9] xfs: reflink cleanups Darrick J. Wong
                   ` (8 preceding siblings ...)
  2020-06-25  1:18 ` [PATCH 9/9] xfs: move helpers that lock and unlock " Darrick J. Wong
@ 2020-06-28 12:06 ` Edwin Török
  2020-06-28 22:36   ` Darrick J. Wong
  9 siblings, 1 reply; 31+ messages in thread
From: Edwin Török @ 2020-06-28 12:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Wed, 2020-06-24 at 18:17 -0700, Darrick J. Wong wrote:
> Mr. Torok: Could you try applying these patches to a recent kernel to
> see if they fix the fs crash problems you were seeing with
> duperemove,
> please?

Hi,

Thanks for the fix.

I've tested commit e812e6bd89dc6489ca924756635fb81855091700 (reflink-
cleanups_2020-06-24) with my original test, and duperemove has now
finished successfully:

[...]
Comparison of extent info shows a net change in shared extents of:
237116676

There were some hung tasks reported in dmesg, but they recovered
eventually:
[32142.324709] INFO: task pool:5190 blocked for more than 120 seconds.
[32504.821571] INFO: task pool:5191 blocked for more than 120 seconds.
[32625.653640] INFO: task pool:5191 blocked for more than 241 seconds.
[32746.485671] INFO: task pool:5191 blocked for more than 362 seconds.
[32867.318072] INFO: task pool:5191 blocked for more than 483 seconds.
[34196.472677] INFO: task pool:5180 blocked for more than 120 seconds.
[34317.304542] INFO: task pool:5180 blocked for more than 241 seconds.
[34317.304627] INFO: task pool:5195 blocked for more than 120 seconds.
[34438.136740] INFO: task pool:5180 blocked for more than 362 seconds.
[34438.136816] INFO: task pool:5195 blocked for more than 241 seconds.

The blocked tasks were alternating between these 2 stacktraces:
[32142.324715] Call Trace:
[32142.324721]  __schedule+0x2d3/0x770
[32142.324722]  schedule+0x55/0xc0
[32142.324724]  rwsem_down_read_slowpath+0x16c/0x4a0
[32142.324726]  ? __wake_up_common_lock+0x8a/0xc0
[32142.324750]  ? xfs_vn_fiemap+0x32/0x80 [xfs]
[32142.324752]  down_read+0x85/0xa0
[32142.324769]  xfs_ilock+0x8a/0x100 [xfs]
[32142.324784]  xfs_vn_fiemap+0x32/0x80 [xfs]
[32142.324785]  do_vfs_ioctl+0xef/0x680
[32142.324787]  ksys_ioctl+0x73/0xd0
[32142.324788]  __x64_sys_ioctl+0x1a/0x20
[32142.324789]  do_syscall_64+0x49/0xc0
[32142.324790]  entry_SYSCALL_64_after_hwframe+0x44/0xa

[32504.821577] Call Trace:
[32504.821583]  __schedule+0x2d3/0x770
[32504.821584]  schedule+0x55/0xc0
[32504.821587]  rwsem_down_write_slowpath+0x244/0x4d0
[32504.821588]  down_write+0x41/0x50
[32504.821610]  xfs_ilock2_io_mmap+0xc8/0x230 [xfs]
[32504.821628]  ? xfs_reflink_remap_blocks+0x11f/0x2a0 [xfs]
[32504.821643]  xfs_reflink_remap_prep+0x51/0x1f0 [xfs]
[32504.821660]  xfs_file_remap_range+0xbe/0x2f0 [xfs]
[32504.821662]  ? security_capable+0x3d/0x60
[32504.821664]  vfs_dedupe_file_range_one+0x12d/0x150
[32504.821665]  vfs_dedupe_file_range+0x156/0x1e0
[32504.821666]  do_vfs_ioctl+0x4a6/0x680
[32504.821667]  ksys_ioctl+0x73/0xd0
[32504.821668]  __x64_sys_ioctl+0x1a/0x20
[32504.821669]  do_syscall_64+0x49/0xc0
[32504.821670]  entry_SYSCALL_64_after_hwframe+0x44/0xa

Looking at /proc/$(pidof duperemove)/fd it had ~80 files open at one
point, which causes a lot of seeking on a HDD if it was trying to
dedupe them all at once so I'm not too worried about these hung tasks.

Running an xfs_repair after the duperemove finished showed no errors.

> 
> v2: various cleanups suggested by Brian Foster
> 
> If you're going to start using this mess, you probably ought to just
> pull from my git trees, which are linked below.
> 
> This is an extraordinary way to destroy everything.  Enjoy!

To be safe I've created an LVM snapshot before trying it.

Best regards,
--Edwin


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/9] xfs: reflink cleanups
  2020-06-28 12:06 ` [PATCH v2 0/9] xfs: reflink cleanups Edwin Török
@ 2020-06-28 22:36   ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2020-06-28 22:36 UTC (permalink / raw)
  To: Edwin Török; +Cc: Brian Foster, linux-xfs

On Sun, Jun 28, 2020 at 01:06:08PM +0100, Edwin Török wrote:
> On Wed, 2020-06-24 at 18:17 -0700, Darrick J. Wong wrote:
> > Mr. Torok: Could you try applying these patches to a recent kernel to
> > see if they fix the fs crash problems you were seeing with
> > duperemove,
> > please?
> 
> Hi,
> 
> Thanks for the fix.
> 
> I've tested commit e812e6bd89dc6489ca924756635fb81855091700 (reflink-
> cleanups_2020-06-24) with my original test, and duperemove has now
> finished successfully:
> 
> [...]
> Comparison of extent info shows a net change in shared extents of:
> 237116676
> 
> There were some hung tasks reported in dmesg, but they recovered
> eventually:
> [32142.324709] INFO: task pool:5190 blocked for more than 120 seconds.
> [32504.821571] INFO: task pool:5191 blocked for more than 120 seconds.
> [32625.653640] INFO: task pool:5191 blocked for more than 241 seconds.
> [32746.485671] INFO: task pool:5191 blocked for more than 362 seconds.
> [32867.318072] INFO: task pool:5191 blocked for more than 483 seconds.
> [34196.472677] INFO: task pool:5180 blocked for more than 120 seconds.
> [34317.304542] INFO: task pool:5180 blocked for more than 241 seconds.
> [34317.304627] INFO: task pool:5195 blocked for more than 120 seconds.
> [34438.136740] INFO: task pool:5180 blocked for more than 362 seconds.
> [34438.136816] INFO: task pool:5195 blocked for more than 241 seconds.
> 
> The blocked tasks were alternating between these 2 stacktraces:
> [32142.324715] Call Trace:
> [32142.324721]  __schedule+0x2d3/0x770
> [32142.324722]  schedule+0x55/0xc0
> [32142.324724]  rwsem_down_read_slowpath+0x16c/0x4a0
> [32142.324726]  ? __wake_up_common_lock+0x8a/0xc0
> [32142.324750]  ? xfs_vn_fiemap+0x32/0x80 [xfs]
> [32142.324752]  down_read+0x85/0xa0
> [32142.324769]  xfs_ilock+0x8a/0x100 [xfs]
> [32142.324784]  xfs_vn_fiemap+0x32/0x80 [xfs]
> [32142.324785]  do_vfs_ioctl+0xef/0x680
> [32142.324787]  ksys_ioctl+0x73/0xd0
> [32142.324788]  __x64_sys_ioctl+0x1a/0x20
> [32142.324789]  do_syscall_64+0x49/0xc0
> [32142.324790]  entry_SYSCALL_64_after_hwframe+0x44/0xa
> 
> [32504.821577] Call Trace:
> [32504.821583]  __schedule+0x2d3/0x770
> [32504.821584]  schedule+0x55/0xc0
> [32504.821587]  rwsem_down_write_slowpath+0x244/0x4d0
> [32504.821588]  down_write+0x41/0x50
> [32504.821610]  xfs_ilock2_io_mmap+0xc8/0x230 [xfs]
> [32504.821628]  ? xfs_reflink_remap_blocks+0x11f/0x2a0 [xfs]
> [32504.821643]  xfs_reflink_remap_prep+0x51/0x1f0 [xfs]
> [32504.821660]  xfs_file_remap_range+0xbe/0x2f0 [xfs]
> [32504.821662]  ? security_capable+0x3d/0x60
> [32504.821664]  vfs_dedupe_file_range_one+0x12d/0x150
> [32504.821665]  vfs_dedupe_file_range+0x156/0x1e0
> [32504.821666]  do_vfs_ioctl+0x4a6/0x680
> [32504.821667]  ksys_ioctl+0x73/0xd0
> [32504.821668]  __x64_sys_ioctl+0x1a/0x20
> [32504.821669]  do_syscall_64+0x49/0xc0
> [32504.821670]  entry_SYSCALL_64_after_hwframe+0x44/0xa
> 
> Looking at /proc/$(pidof duperemove)/fd it had ~80 files open at one
> point, which causes a lot of seeking on a HDD if it was trying to
> dedupe them all at once so I'm not too worried about these hung tasks.
> 
> Running an xfs_repair after the duperemove finished showed no errors.
> 
> > 
> > v2: various cleanups suggested by Brian Foster
> > 
> > If you're going to start using this mess, you probably ought to just
> > pull from my git trees, which are linked below.
> > 
> > This is an extraordinary way to destroy everything.  Enjoy!
> 
> To be safe I've created an LVM snapshot before trying it.

Yay!  Thank you for testing this out!  Sorry it broke in the first
place, though. :/

--D

> Best regards,
> --Edwin
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/9] xfs: fix reflink quota reservation accounting error
  2020-06-25  1:17 ` [PATCH 1/9] xfs: fix reflink quota reservation accounting error Darrick J. Wong
  2020-06-25 12:26   ` Brian Foster
@ 2020-07-01  8:07   ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-07-01  8:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:17:45PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Quota reservations are supposed to account for the blocks that might be
> allocated due to a bmap btree split.  Reflink doesn't do this, so fix
> this to make the quota accounting more accurate before we start
> rearranging things.
> 
> Fixes: 862bb360ef56 ("xfs: reflink extents from one file to another")
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/9] xfs: rename xfs_bmap_is_real_extent to is_written_extent
  2020-06-25  1:17 ` [PATCH 2/9] xfs: rename xfs_bmap_is_real_extent to is_written_extent Darrick J. Wong
  2020-06-25 12:26   ` Brian Foster
@ 2020-07-01  8:08   ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-07-01  8:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:17:52PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> The name of this predicate is a little misleading -- it decides if the
> extent mapping is allocated and written.  Change the name to be more
> direct, as we're going to add a new predicate in the next patch.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4.2 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-25 17:23   ` [PATCH v4.2 " Darrick J. Wong
  2020-06-26 11:57     ` Brian Foster
@ 2020-07-01  8:20     ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-07-01  8:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin, Brian Foster

On Thu, Jun 25, 2020 at 10:23:10AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> The existing reflink remapping loop has some structural problems that
> need addressing:
> 
> The biggest problem is that we create one transaction for each extent in
> the source file without accounting for the number of mappings there are
> for the same range in the destination file.  In other words, we don't
> know the number of remap operations that will be necessary and we
> therefore cannot guess the block reservation required.  On highly
> fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
> run out of block reservation, and fail.
> 
> The second problem is that we don't actually use the bmap intents to
> their full potential -- instead of calling bunmapi directly and having
> to deal with its backwards operation, we could call the deferred ops
> xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
> makes the frontend loop much simpler.
> 
> Solve all of these problems by refactoring the remapping loops so that
> we only perform one remapping operation per transaction, and each
> operation only tries to remap a single extent from source to dest.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v4.2: move qres conditional to the next patch, rename bmap helper, try
> to clear up some of the smap/dmap confusion
> ---
>  fs/xfs/libxfs/xfs_bmap.h |   13 ++
>  fs/xfs/xfs_reflink.c     |  240 +++++++++++++++++++++++++---------------------
>  fs/xfs/xfs_trace.h       |   52 +---------
>  3 files changed, 142 insertions(+), 163 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 2b18338d0643..e1bd484e5548 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
>  	{ BMAP_ATTRFORK,	"ATTR" }, \
>  	{ BMAP_COWFORK,		"COW" }
>  
> +/* Return true if the extent is an allocated extent, written or not. */
> +static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
> +{
> +	return irec->br_startblock != HOLESTARTBLOCK &&
> +		irec->br_startblock != DELAYSTARTBLOCK &&
> +		!isnullstartblock(irec->br_startblock);
> +}

I think reusing the previous name is a little dangerous.  Maybe rename
this to ..._is_allocated_extent_ ?

>  
>  /*
>   * Return true if the extent is a real, allocated extent, or false if it is  a

And not for the previous one: if the real goes away in the name, it
should probably be updated here as well.

The rest looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] xfs: only reserve quota blocks for bmbt changes if we're changing the data fork
  2020-06-25  1:18 ` [PATCH 4/9] xfs: only reserve quota blocks for bmbt changes if we're changing the data fork Darrick J. Wong
  2020-06-25 12:27   ` Brian Foster
@ 2020-07-01  8:21   ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-07-01  8:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:18:06PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Now that we've reworked xfs_reflink_remap_extent to remap only one
> extent per transaction, we actually know if the extent being removed is
> an allocated mapping.  This means that we now know ahead of time if
> we're going to be touching the data fork.
> 
> Since we only need blocks for a bmbt split if we're going to update the
> data fork, we only need to get quota reservation if we know we're going
> to touch the data fork.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/9] xfs: only reserve quota blocks if we're mapping into a hole
  2020-06-25  1:18 ` [PATCH 5/9] xfs: only reserve quota blocks if we're mapping into a hole Darrick J. Wong
  2020-06-25 12:28   ` Brian Foster
@ 2020-07-01  8:22   ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-07-01  8:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/9] xfs: fix xfs_reflink_remap_prep calling conventions
  2020-06-25  1:18 ` [PATCH 7/9] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
@ 2020-07-01  8:23   ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-07-01  8:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:18:24PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Fix the return value of xfs_reflink_remap_prep so that its return value
> conventions match the rest of xfs.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/9] xfs: refactor locking and unlocking two inodes against userspace IO
  2020-06-25  1:18 ` [PATCH 8/9] xfs: refactor locking and unlocking two inodes against userspace IO Darrick J. Wong
@ 2020-07-01  8:26   ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-07-01  8:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:18:31PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor the two functions that we use to lock and unlock two inodes to
> block userspace from initiating IO against a file, whether via system
> calls or mmap activity.

This seems to miss an explanation of what is actually changed.  From
inspect it passes inodes instead of files to the unlock helper, and
adds a lock two inodes helper, which looks fine to me.

> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -56,7 +56,6 @@ extern int xfs_reflink_remap_blocks(struct xfs_inode *src, loff_t pos_in,
>  		loff_t *remapped);
>  extern int xfs_reflink_update_dest(struct xfs_inode *dest, xfs_off_t newlen,
>  		xfs_extlen_t cowextsize, unsigned int remap_flags);
> -extern void xfs_reflink_remap_unlock(struct file *file_in,
> -		struct file *file_out);
> +extern void xfs_reflink_remap_unlock(struct xfs_inode *ip1, struct xfs_inode *ip2);

This adds an overly long line.  Trivially fixed by dropping the
pointless extern..

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 9/9] xfs: move helpers that lock and unlock two inodes against userspace IO
  2020-06-25  1:18 ` [PATCH 9/9] xfs: move helpers that lock and unlock " Darrick J. Wong
@ 2020-07-01  8:27   ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-07-01  8:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs, edwin

On Wed, Jun 24, 2020 at 06:18:37PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Move the double-inode locking helpers to xfs_inode.c since they're not
> specific to reflink.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-07-01  8:27 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25  1:17 [PATCH v2 0/9] xfs: reflink cleanups Darrick J. Wong
2020-06-25  1:17 ` [PATCH 1/9] xfs: fix reflink quota reservation accounting error Darrick J. Wong
2020-06-25 12:26   ` Brian Foster
2020-07-01  8:07   ` Christoph Hellwig
2020-06-25  1:17 ` [PATCH 2/9] xfs: rename xfs_bmap_is_real_extent to is_written_extent Darrick J. Wong
2020-06-25 12:26   ` Brian Foster
2020-07-01  8:08   ` Christoph Hellwig
2020-06-25  1:17 ` [PATCH 3/9] xfs: redesign the reflink remap loop to fix blkres depletion crash Darrick J. Wong
2020-06-25 12:27   ` Brian Foster
2020-06-25 16:57     ` Darrick J. Wong
2020-06-25 17:23   ` [PATCH v4.2 " Darrick J. Wong
2020-06-26 11:57     ` Brian Foster
2020-06-26 16:40       ` Darrick J. Wong
2020-07-01  8:20     ` Christoph Hellwig
2020-06-25  1:18 ` [PATCH 4/9] xfs: only reserve quota blocks for bmbt changes if we're changing the data fork Darrick J. Wong
2020-06-25 12:27   ` Brian Foster
2020-07-01  8:21   ` Christoph Hellwig
2020-06-25  1:18 ` [PATCH 5/9] xfs: only reserve quota blocks if we're mapping into a hole Darrick J. Wong
2020-06-25 12:28   ` Brian Foster
2020-07-01  8:22   ` Christoph Hellwig
2020-06-25  1:18 ` [PATCH 6/9] xfs: reflink can skip remap existing mappings Darrick J. Wong
2020-06-25 12:28   ` Brian Foster
2020-06-25 16:49     ` Darrick J. Wong
2020-06-25  1:18 ` [PATCH 7/9] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
2020-07-01  8:23   ` Christoph Hellwig
2020-06-25  1:18 ` [PATCH 8/9] xfs: refactor locking and unlocking two inodes against userspace IO Darrick J. Wong
2020-07-01  8:26   ` Christoph Hellwig
2020-06-25  1:18 ` [PATCH 9/9] xfs: move helpers that lock and unlock " Darrick J. Wong
2020-07-01  8:27   ` Christoph Hellwig
2020-06-28 12:06 ` [PATCH v2 0/9] xfs: reflink cleanups Edwin Török
2020-06-28 22:36   ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.