All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: reflink cleanups
@ 2020-06-23  4:01 Darrick J. Wong
  2020-06-23  4:01 ` [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Darrick J. Wong @ 2020-06-23  4:01 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, edwin

Hi all,

Here are a few patches cleaning up some problems with reflink.  The
first patch is the largest -- it reorganizes the remapping loop so that
instead of running one transaction for each extent in the source file
regardless of what's in the destination file, we look at both files to
find the longest extent we can swap in one go, and run a transaction for
just that piece.  This fixes a problem of running out of block
reservation when the filesystem is very fragmented.

The second patch fixes some goofiness in the reflink prep function, and
the third patch moves the "lock two inodes" code into xfs_inode.c since
none of that is related to reflink.

Mr. Torok: Could you try applying these patches to a recent kernel to
see if they fix the fs crash problems you were seeing with duperemove,
please?

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=reflink-cleanups
---
 fs/xfs/libxfs/xfs_bmap.h |   13 +-
 fs/xfs/xfs_file.c        |    4 -
 fs/xfs/xfs_inode.c       |   93 +++++++++++++
 fs/xfs/xfs_inode.h       |    3 
 fs/xfs/xfs_reflink.c     |  325 +++++++++++++++++++---------------------------
 fs/xfs/xfs_reflink.h     |    2 
 fs/xfs/xfs_trace.h       |   52 -------
 fs/xfs/xfs_trans_dquot.c |   13 +-
 8 files changed, 252 insertions(+), 253 deletions(-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-23  4:01 [PATCH 0/3] xfs: reflink cleanups Darrick J. Wong
@ 2020-06-23  4:01 ` Darrick J. Wong
  2020-06-24 12:38   ` Brian Foster
  2020-06-23  4:01 ` [PATCH 2/3] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
  2020-06-23  4:01 ` [PATCH 3/3] xfs: refactor locking and unlocking two inodes against userspace IO Darrick J. Wong
  2 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2020-06-23  4:01 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

The existing reflink remapping loop has some structural problems that
need addressing:

The biggest problem is that we create one transaction for each extent in
the source file without accounting for the number of mappings there are
for the same range in the destination file.  In other words, we don't
know the number of remap operations that will be necessary and we
therefore cannot guess the block reservation required.  On highly
fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
run out of block reservation, and fail.

The second problem is that we don't actually use the bmap intents to
their full potential -- instead of calling bunmapi directly and having
to deal with its backwards operation, we could call the deferred ops
xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
makes the frontend loop much simpler.

A third (and more minor) problem is that we aren't even clever enough to
skip the whole remapping thing if the source and destination ranges
point to the same physical extents.

Solve all of these problems by refactoring the remapping loops so that
we only perform one remapping operation per transaction, and each
operation only tries to remap a single extent from source to dest.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.h |   13 ++-
 fs/xfs/xfs_reflink.c     |  234 ++++++++++++++++++++++++++--------------------
 fs/xfs/xfs_trace.h       |   52 +---------
 fs/xfs/xfs_trans_dquot.c |   13 +--
 4 files changed, 149 insertions(+), 163 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 6028a3c825ba..3498b4f75f71 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
 	{ BMAP_ATTRFORK,	"ATTR" }, \
 	{ BMAP_COWFORK,		"COW" }
 
+/* Return true if the extent is an allocated extent, written or not. */
+static inline bool xfs_bmap_is_mapped_extent(struct xfs_bmbt_irec *irec)
+{
+	return irec->br_startblock != HOLESTARTBLOCK &&
+		irec->br_startblock != DELAYSTARTBLOCK &&
+		!isnullstartblock(irec->br_startblock);
+}
 
 /*
  * Return true if the extent is a real, allocated extent, or false if it is  a
@@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
  */
 static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
 {
-	return irec->br_state != XFS_EXT_UNWRITTEN &&
-		irec->br_startblock != HOLESTARTBLOCK &&
-		irec->br_startblock != DELAYSTARTBLOCK &&
-		!isnullstartblock(irec->br_startblock);
+	return xfs_bmap_is_mapped_extent(irec) &&
+	       irec->br_state != XFS_EXT_UNWRITTEN;
 }
 
 /*
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 107bf2a2f344..f50a8c2f21a5 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -984,40 +984,27 @@ xfs_reflink_ag_has_free_space(
 }
 
 /*
- * Unmap a range of blocks from a file, then map other blocks into the hole.
- * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
- * The extent irec is mapped into dest at irec->br_startoff.
+ * Remap the given extent into the file at the given offset.  The imap
+ * blockcount will be set to the number of blocks that were actually remapped.
  */
 STATIC int
 xfs_reflink_remap_extent(
 	struct xfs_inode	*ip,
-	struct xfs_bmbt_irec	*irec,
-	xfs_fileoff_t		destoff,
+	struct xfs_bmbt_irec	*imap,
 	xfs_off_t		new_isize)
 {
+	struct xfs_bmbt_irec	imap2;
 	struct xfs_mount	*mp = ip->i_mount;
-	bool			real_extent = xfs_bmap_is_real_extent(irec);
 	struct xfs_trans	*tp;
-	unsigned int		resblks;
-	struct xfs_bmbt_irec	uirec;
-	xfs_filblks_t		rlen;
-	xfs_filblks_t		unmap_len;
 	xfs_off_t		newlen;
+	int64_t			ip_delta = 0;
+	unsigned int		resblks;
+	bool			real_extent = xfs_bmap_is_real_extent(imap);
+	int			nimaps;
 	int			error;
 
-	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
-	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
-
-	/* No reflinking if we're low on space */
-	if (real_extent) {
-		error = xfs_reflink_ag_has_free_space(mp,
-				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
-		if (error)
-			goto out;
-	}
-
 	/* Start a rolling transaction to switch the mappings */
-	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
+	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
 	if (error)
 		goto out;
@@ -1025,87 +1012,118 @@ xfs_reflink_remap_extent(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
-	/* If we're not just clearing space, then do we have enough quota? */
+	/* Read extent from the second file */
+	nimaps = 1;
+	error = xfs_bmapi_read(ip, imap->br_startoff, imap->br_blockcount,
+			&imap2, &nimaps, 0);
+	if (error)
+		goto out_cancel;
+#ifdef DEBUG
+	if (nimaps != 1 ||
+	    imap2.br_startoff != imap->br_startoff) {
+		/*
+		 * We should never get no mapping or something that doesn't
+		 * match what we asked for.
+		 */
+		ASSERT(0);
+		error = -EINVAL;
+		goto out_cancel;
+	}
+#endif
+
+	/*
+	 * We can only remap as many blocks as the smaller of the two extent
+	 * maps, because we can only remap one extent at a time.
+	 */
+	imap->br_blockcount = min(imap->br_blockcount, imap2.br_blockcount);
+
+	trace_xfs_reflink_remap_extent2(ip, &imap2);
+
+	/*
+	 * Two extents mapped to the same physical block must not have
+	 * different states; that's filesystem corruption.  Move on to the next
+	 * extent if they're both holes or both the same physical extent.
+	 */
+	if (imap->br_startblock == imap2.br_startblock) {
+		if (imap->br_state != imap2.br_state)
+			error = -EFSCORRUPTED;
+		goto out_cancel;
+	}
+
 	if (real_extent) {
+		/* No reflinking if we're low on space */
+		error = xfs_reflink_ag_has_free_space(mp,
+				XFS_FSB_TO_AGNO(mp, imap->br_startblock));
+		if (error)
+			goto out_cancel;
+
+
+		/* Do we have enough quota? */
 		error = xfs_trans_reserve_quota_nblks(tp, ip,
-				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
+				imap->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
 		if (error)
 			goto out_cancel;
 	}
 
-	trace_xfs_reflink_remap(ip, irec->br_startoff,
-				irec->br_blockcount, irec->br_startblock);
-
-	/* Unmap the old blocks in the data fork. */
-	rlen = unmap_len;
-	while (rlen) {
-		ASSERT(tp->t_firstblock == NULLFSBLOCK);
-		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
-		if (error)
-			goto out_cancel;
-
+	if (xfs_bmap_is_mapped_extent(&imap2)) {
 		/*
-		 * Trim the extent to whatever got unmapped.
-		 * Remember, bunmapi works backwards.
+		 * If the extent we're unmapping is backed by storage (written
+		 * or not), unmap the extent and drop its refcount.
 		 */
-		uirec.br_startblock = irec->br_startblock + rlen;
-		uirec.br_startoff = irec->br_startoff + rlen;
-		uirec.br_blockcount = unmap_len - rlen;
-		uirec.br_state = irec->br_state;
-		unmap_len = rlen;
+		xfs_bmap_unmap_extent(tp, ip, &imap2);
+		xfs_refcount_decrease_extent(tp, &imap2);
+		ip_delta -= imap2.br_blockcount;
+	} else if (imap2.br_startblock == DELAYSTARTBLOCK) {
+		xfs_filblks_t	len = imap2.br_blockcount;
 
-		/* If this isn't a real mapping, we're done. */
-		if (!real_extent || uirec.br_blockcount == 0)
-			goto next_extent;
-
-		trace_xfs_reflink_remap(ip, uirec.br_startoff,
-				uirec.br_blockcount, uirec.br_startblock);
-
-		/* Update the refcount tree */
-		xfs_refcount_increase_extent(tp, &uirec);
-
-		/* Map the new blocks into the data fork. */
-		xfs_bmap_map_extent(tp, ip, &uirec);
+		/*
+		 * If the extent we're unmapping is a delalloc reservation,
+		 * we can use the regular bunmapi function to release the
+		 * incore state.  Dropping the delalloc reservation takes care
+		 * of the quota reservation for us.
+		 */
+		error = __xfs_bunmapi(NULL, ip, imap2.br_startoff, &len, 0, 1);
+		if (error)
+			goto out_cancel;
+		ASSERT(len == 0);
+	}
 
-		/* Update quota accounting. */
-		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
-				uirec.br_blockcount);
+	/*
+	 * If the extent we're sharing is backed by written storage, increase
+	 * its refcount and map it into the file.
+	 */
+	if (real_extent) {
+		xfs_refcount_increase_extent(tp, imap);
+		xfs_bmap_map_extent(tp, ip, imap);
+		ip_delta += imap->br_blockcount;
+	}
 
-		/* Update dest isize if needed. */
-		newlen = XFS_FSB_TO_B(mp,
-				uirec.br_startoff + uirec.br_blockcount);
-		newlen = min_t(xfs_off_t, newlen, new_isize);
-		if (newlen > i_size_read(VFS_I(ip))) {
-			trace_xfs_reflink_update_inode_size(ip, newlen);
-			i_size_write(VFS_I(ip), newlen);
-			ip->i_d.di_size = newlen;
-			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-		}
+	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, ip_delta);
 
-next_extent:
-		/* Process all the deferred stuff. */
-		error = xfs_defer_finish(&tp);
-		if (error)
-			goto out_cancel;
+	/* Update dest isize if needed. */
+	newlen = XFS_FSB_TO_B(mp, imap2.br_startoff + imap2.br_blockcount);
+	newlen = min_t(xfs_off_t, newlen, new_isize);
+	if (newlen > i_size_read(VFS_I(ip))) {
+		trace_xfs_reflink_update_inode_size(ip, newlen);
+		i_size_write(VFS_I(ip), newlen);
+		ip->i_d.di_size = newlen;
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 	}
 
+	/* Commit everything and unlock. */
 	error = xfs_trans_commit(tp);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	if (error)
-		goto out;
-	return 0;
+	goto out_unlock;
 
 out_cancel:
 	xfs_trans_cancel(tp);
+out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out:
 	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
 	return error;
 }
 
-/*
- * Iteratively remap one file's extents (and holes) to another's.
- */
+/* Remap a range of one file to the other. */
 int
 xfs_reflink_remap_blocks(
 	struct xfs_inode	*src,
@@ -1116,25 +1134,22 @@ xfs_reflink_remap_blocks(
 	loff_t			*remapped)
 {
 	struct xfs_bmbt_irec	imap;
-	xfs_fileoff_t		srcoff;
-	xfs_fileoff_t		destoff;
+	struct xfs_mount	*mp = src->i_mount;
+	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
+	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
 	xfs_filblks_t		len;
-	xfs_filblks_t		range_len;
 	xfs_filblks_t		remapped_len = 0;
 	xfs_off_t		new_isize = pos_out + remap_len;
 	int			nimaps;
 	int			error = 0;
 
-	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
-	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
-	len = XFS_B_TO_FSB(src->i_mount, remap_len);
+	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
+			XFS_MAX_FILEOFF);
 
-	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
-	while (len) {
-		uint		lock_mode;
+	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
 
-		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
-				dest, destoff);
+	while (len > 0) {
+		unsigned int	lock_mode;
 
 		/* Read extent from the source file */
 		nimaps = 1;
@@ -1143,18 +1158,29 @@ xfs_reflink_remap_blocks(
 		xfs_iunlock(src, lock_mode);
 		if (error)
 			break;
-		ASSERT(nimaps == 1);
+#ifdef DEBUG
+		if (nimaps != 1 ||
+		    imap.br_startblock == DELAYSTARTBLOCK ||
+		    imap.br_startoff != srcoff) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed the source file data and
+			 * we hold the source file io/mmap locks.
+			 */
+			ASSERT(nimaps == 0);
+			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+			ASSERT(imap.br_startoff == srcoff);
+			error = -EINVAL;
+			break;
+		}
+#endif
 
-		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
-				&imap);
+		trace_xfs_reflink_remap_extent1(src, &imap);
 
-		/* Translate imap into the destination file. */
-		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
-		imap.br_startoff += destoff - srcoff;
-
-		/* Clear dest from destoff to the end of imap and map it in. */
-		error = xfs_reflink_remap_extent(dest, &imap, destoff,
-				new_isize);
+		/* Remap into the destination file at the given offset. */
+		imap.br_startoff = destoff;
+		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
 		if (error)
 			break;
 
@@ -1164,10 +1190,10 @@ xfs_reflink_remap_blocks(
 		}
 
 		/* Advance drange/srange */
-		srcoff += range_len;
-		destoff += range_len;
-		len -= range_len;
-		remapped_len += range_len;
+		srcoff += imap.br_blockcount;
+		destoff += imap.br_blockcount;
+		len -= imap.br_blockcount;
+		remapped_len += imap.br_blockcount;
 	}
 
 	if (error)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 460136628a79..f205602c8ba9 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
 DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
 DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
 DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
-DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
-TRACE_EVENT(xfs_reflink_remap_blocks_loop,
+TRACE_EVENT(xfs_reflink_remap_blocks,
 	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
 		 xfs_filblks_t len, struct xfs_inode *dest,
 		 xfs_fileoff_t doffset),
@@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
 		  __entry->dest_ino,
 		  __entry->dest_lblk)
 );
-TRACE_EVENT(xfs_reflink_punch_range,
-	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
-		 xfs_extlen_t len),
-	TP_ARGS(ip, lblk, len),
-	TP_STRUCT__entry(
-		__field(dev_t, dev)
-		__field(xfs_ino_t, ino)
-		__field(xfs_fileoff_t, lblk)
-		__field(xfs_extlen_t, len)
-	),
-	TP_fast_assign(
-		__entry->dev = VFS_I(ip)->i_sb->s_dev;
-		__entry->ino = ip->i_ino;
-		__entry->lblk = lblk;
-		__entry->len = len;
-	),
-	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino,
-		  __entry->lblk,
-		  __entry->len)
-);
-TRACE_EVENT(xfs_reflink_remap,
-	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
-		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
-	TP_ARGS(ip, lblk, len, new_pblk),
-	TP_STRUCT__entry(
-		__field(dev_t, dev)
-		__field(xfs_ino_t, ino)
-		__field(xfs_fileoff_t, lblk)
-		__field(xfs_extlen_t, len)
-		__field(xfs_fsblock_t, new_pblk)
-	),
-	TP_fast_assign(
-		__entry->dev = VFS_I(ip)->i_sb->s_dev;
-		__entry->ino = ip->i_ino;
-		__entry->lblk = lblk;
-		__entry->len = len;
-		__entry->new_pblk = new_pblk;
-	),
-	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->ino,
-		  __entry->lblk,
-		  __entry->len,
-		  __entry->new_pblk)
-);
 DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent1);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent2);
 
 /* dedupe tracepoints */
 DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index c0f73b82c055..99511ff6222f 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -124,16 +124,17 @@ xfs_trans_dup_dqinfo(
  */
 void
 xfs_trans_mod_dquot_byino(
-	xfs_trans_t	*tp,
-	xfs_inode_t	*ip,
-	uint		field,
-	int64_t		delta)
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	uint			field,
+	int64_t			delta)
 {
-	xfs_mount_t	*mp = tp->t_mountp;
+	struct xfs_mount	*mp = tp->t_mountp;
 
 	if (!XFS_IS_QUOTA_RUNNING(mp) ||
 	    !XFS_IS_QUOTA_ON(mp) ||
-	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
+	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino) ||
+	    delta == 0)
 		return;
 
 	if (tp->t_dqinfo == NULL)


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] xfs: fix xfs_reflink_remap_prep calling conventions
  2020-06-23  4:01 [PATCH 0/3] xfs: reflink cleanups Darrick J. Wong
  2020-06-23  4:01 ` [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash Darrick J. Wong
@ 2020-06-23  4:01 ` Darrick J. Wong
  2020-06-24 12:38   ` Brian Foster
  2020-06-23  4:01 ` [PATCH 3/3] xfs: refactor locking and unlocking two inodes against userspace IO Darrick J. Wong
  2 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2020-06-23  4:01 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

Fix the return value of xfs_reflink_remap_prep so that its return value
conventions match the rest of xfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_file.c    |    2 +-
 fs/xfs/xfs_reflink.c |    6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00db81eac80d..b375fae811f2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1035,7 +1035,7 @@ xfs_file_remap_range(
 	/* Prepare and then clone file data. */
 	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
 			&len, remap_flags);
-	if (ret < 0 || len == 0)
+	if (ret || len == 0)
 		return ret;
 
 	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f50a8c2f21a5..dd9ed7d5694d 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1341,7 +1341,7 @@ xfs_reflink_remap_prep(
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	bool			same_inode = (inode_in == inode_out);
-	ssize_t			ret;
+	int			ret;
 
 	/* Lock both files against IO */
 	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
@@ -1365,7 +1365,7 @@ xfs_reflink_remap_prep(
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
 			len, remap_flags);
-	if (ret < 0 || *len == 0)
+	if (ret || *len == 0)
 		goto out_unlock;
 
 	/* Attach dquots to dest inode before changing block map */
@@ -1400,7 +1400,7 @@ xfs_reflink_remap_prep(
 	if (ret)
 		goto out_unlock;
 
-	return 1;
+	return 0;
 out_unlock:
 	xfs_reflink_remap_unlock(file_in, file_out);
 	return ret;


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] xfs: refactor locking and unlocking two inodes against userspace IO
  2020-06-23  4:01 [PATCH 0/3] xfs: reflink cleanups Darrick J. Wong
  2020-06-23  4:01 ` [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash Darrick J. Wong
  2020-06-23  4:01 ` [PATCH 2/3] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
@ 2020-06-23  4:01 ` Darrick J. Wong
  2020-06-24 12:39   ` Brian Foster
  2 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2020-06-23  4:01 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, edwin

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor the two functions that we use to lock and unlock two inodes to
block userspace from initiating IO against a file, whether via system
calls or mmap activity.  Move them to xfs_inode.c since this functionality
isn't specific to reflink anyway.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_file.c    |    2 +
 fs/xfs/xfs_inode.c   |   93 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h   |    3 ++
 fs/xfs/xfs_reflink.c |   85 +---------------------------------------------
 fs/xfs/xfs_reflink.h |    2 -
 5 files changed, 99 insertions(+), 86 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b375fae811f2..a32d1eeee0f7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1065,7 +1065,7 @@ xfs_file_remap_range(
 	if (mp->m_flags & XFS_MOUNT_WSYNC)
 		xfs_log_force_inode(dest);
 out_unlock:
-	xfs_reflink_remap_unlock(file_in, file_out);
+	xfs_iunlock_io_mmap(src, dest);
 	if (ret)
 		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
 	return remapped > 0 ? remapped : ret;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9aea7d68d8ab..b9c6d1cc64a9 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3881,3 +3881,96 @@ xfs_log_force_inode(
 		return 0;
 	return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL);
 }
+
+/*
+ * Grab the exclusive iolock for a data copy from src to dest, making sure to
+ * abide vfs locking order (lowest pointer value goes first) and breaking the
+ * layout leases before proceeding.  The loop is needed because we cannot call
+ * the blocking break_layout() with the iolocks held, and therefore have to
+ * back out both locks.
+ */
+static int
+xfs_iolock_two_inodes_and_break_layout(
+	struct inode		*src,
+	struct inode		*dest)
+{
+	int			error;
+
+	if (src > dest)
+		swap(src, dest);
+
+retry:
+	/* Wait to break both inodes' layouts before we start locking. */
+	error = break_layout(src, true);
+	if (error)
+		return error;
+	if (src != dest) {
+		error = break_layout(dest, true);
+		if (error)
+			return error;
+	}
+
+	/* Lock one inode and make sure nobody got in and leased it. */
+	inode_lock(src);
+	error = break_layout(src, false);
+	if (error) {
+		inode_unlock(src);
+		if (error == -EWOULDBLOCK)
+			goto retry;
+		return error;
+	}
+
+	if (src == dest)
+		return 0;
+
+	/* Lock the other inode and make sure nobody got in and leased it. */
+	inode_lock_nested(dest, I_MUTEX_NONDIR2);
+	error = break_layout(dest, false);
+	if (error) {
+		inode_unlock(src);
+		inode_unlock(dest);
+		if (error == -EWOULDBLOCK)
+			goto retry;
+		return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Lock two files so that userspace cannot initiate I/O via file syscalls or
+ * mmap activity.
+ */
+int
+xfs_ilock_io_mmap(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	int			ret;
+
+	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
+	if (ret)
+		return ret;
+	if (ip1 == ip2)
+		xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
+	else
+		xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL,
+				    ip2, XFS_MMAPLOCK_EXCL);
+	return 0;
+}
+
+/* Unlock both files to allow IO and mmap activity. */
+void
+xfs_iunlock_io_mmap(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	bool			same_inode = (ip1 == ip2);
+
+	xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+	if (!same_inode)
+		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+	inode_unlock(VFS_I(ip2));
+	if (!same_inode)
+		inode_unlock(VFS_I(ip1));
+}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 47d3b391030d..001529784f96 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -499,4 +499,7 @@ void xfs_iunlink_destroy(struct xfs_perag *pag);
 
 void xfs_end_io(struct work_struct *work);
 
+int xfs_ilock_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
+void xfs_iunlock_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
+
 #endif	/* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index dd9ed7d5694d..130b6be8180e 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1203,81 +1203,6 @@ xfs_reflink_remap_blocks(
 	return error;
 }
 
-/*
- * Grab the exclusive iolock for a data copy from src to dest, making sure to
- * abide vfs locking order (lowest pointer value goes first) and breaking the
- * layout leases before proceeding.  The loop is needed because we cannot call
- * the blocking break_layout() with the iolocks held, and therefore have to
- * back out both locks.
- */
-static int
-xfs_iolock_two_inodes_and_break_layout(
-	struct inode		*src,
-	struct inode		*dest)
-{
-	int			error;
-
-	if (src > dest)
-		swap(src, dest);
-
-retry:
-	/* Wait to break both inodes' layouts before we start locking. */
-	error = break_layout(src, true);
-	if (error)
-		return error;
-	if (src != dest) {
-		error = break_layout(dest, true);
-		if (error)
-			return error;
-	}
-
-	/* Lock one inode and make sure nobody got in and leased it. */
-	inode_lock(src);
-	error = break_layout(src, false);
-	if (error) {
-		inode_unlock(src);
-		if (error == -EWOULDBLOCK)
-			goto retry;
-		return error;
-	}
-
-	if (src == dest)
-		return 0;
-
-	/* Lock the other inode and make sure nobody got in and leased it. */
-	inode_lock_nested(dest, I_MUTEX_NONDIR2);
-	error = break_layout(dest, false);
-	if (error) {
-		inode_unlock(src);
-		inode_unlock(dest);
-		if (error == -EWOULDBLOCK)
-			goto retry;
-		return error;
-	}
-
-	return 0;
-}
-
-/* Unlock both inodes after they've been prepped for a range clone. */
-void
-xfs_reflink_remap_unlock(
-	struct file		*file_in,
-	struct file		*file_out)
-{
-	struct inode		*inode_in = file_inode(file_in);
-	struct xfs_inode	*src = XFS_I(inode_in);
-	struct inode		*inode_out = file_inode(file_out);
-	struct xfs_inode	*dest = XFS_I(inode_out);
-	bool			same_inode = (inode_in == inode_out);
-
-	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
-	if (!same_inode)
-		xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
-	inode_unlock(inode_out);
-	if (!same_inode)
-		inode_unlock(inode_in);
-}
-
 /*
  * If we're reflinking to a point past the destination file's EOF, we must
  * zero any speculative post-EOF preallocations that sit between the old EOF
@@ -1340,18 +1265,12 @@ xfs_reflink_remap_prep(
 	struct xfs_inode	*src = XFS_I(inode_in);
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
-	bool			same_inode = (inode_in == inode_out);
 	int			ret;
 
 	/* Lock both files against IO */
-	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
+	ret = xfs_ilock_io_mmap(src, dest);
 	if (ret)
 		return ret;
-	if (same_inode)
-		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
-	else
-		xfs_lock_two_inodes(src, XFS_MMAPLOCK_EXCL, dest,
-				XFS_MMAPLOCK_EXCL);
 
 	/* Check file eligibility and prepare for block sharing. */
 	ret = -EINVAL;
@@ -1402,7 +1321,7 @@ xfs_reflink_remap_prep(
 
 	return 0;
 out_unlock:
-	xfs_reflink_remap_unlock(file_in, file_out);
+	xfs_iunlock_io_mmap(src, dest);
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 3e4fd46373ab..487b00434b96 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -56,7 +56,5 @@ extern int xfs_reflink_remap_blocks(struct xfs_inode *src, loff_t pos_in,
 		loff_t *remapped);
 extern int xfs_reflink_update_dest(struct xfs_inode *dest, xfs_off_t newlen,
 		xfs_extlen_t cowextsize, unsigned int remap_flags);
-extern void xfs_reflink_remap_unlock(struct file *file_in,
-		struct file *file_out);
 
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-23  4:01 ` [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash Darrick J. Wong
@ 2020-06-24 12:38   ` Brian Foster
  2020-06-24 15:09     ` Darrick J. Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2020-06-24 12:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Mon, Jun 22, 2020 at 09:01:35PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> The existing reflink remapping loop has some structural problems that
> need addressing:
> 
> The biggest problem is that we create one transaction for each extent in
> the source file without accounting for the number of mappings there are
> for the same range in the destination file.  In other words, we don't
> know the number of remap operations that will be necessary and we
> therefore cannot guess the block reservation required.  On highly
> fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
> run out of block reservation, and fail.
> 
> The second problem is that we don't actually use the bmap intents to
> their full potential -- instead of calling bunmapi directly and having
> to deal with its backwards operation, we could call the deferred ops
> xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
> makes the frontend loop much simpler.
> 
> A third (and more minor) problem is that we aren't even clever enough to
> skip the whole remapping thing if the source and destination ranges
> point to the same physical extents.
> 

I'm wondering if this all really has to be done in one patch. The last
bit at least sounds like an optimization from the description.

> Solve all of these problems by refactoring the remapping loops so that
> we only perform one remapping operation per transaction, and each
> operation only tries to remap a single extent from source to dest.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.h |   13 ++-
>  fs/xfs/xfs_reflink.c     |  234 ++++++++++++++++++++++++++--------------------
>  fs/xfs/xfs_trace.h       |   52 +---------
>  fs/xfs/xfs_trans_dquot.c |   13 +--
>  4 files changed, 149 insertions(+), 163 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 6028a3c825ba..3498b4f75f71 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
>  	{ BMAP_ATTRFORK,	"ATTR" }, \
>  	{ BMAP_COWFORK,		"COW" }
>  
> +/* Return true if the extent is an allocated extent, written or not. */
> +static inline bool xfs_bmap_is_mapped_extent(struct xfs_bmbt_irec *irec)
> +{
> +	return irec->br_startblock != HOLESTARTBLOCK &&
> +		irec->br_startblock != DELAYSTARTBLOCK &&
> +		!isnullstartblock(irec->br_startblock);
> +}

Heh, I was going to suggest to call this _is_real_extent(), but I see
the helper below already uses that name. I do think the name correlates
better with this helper and perhaps the one below should be called
something like xfs_bmap_is_written_extent(). Hm? Either way, the
"mapped" name is kind of vague and brings back memories of buffer heads.

>  
>  /*
>   * Return true if the extent is a real, allocated extent, or false if it is  a
> @@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
>   */
>  static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
>  {
> -	return irec->br_state != XFS_EXT_UNWRITTEN &&
> -		irec->br_startblock != HOLESTARTBLOCK &&
> -		irec->br_startblock != DELAYSTARTBLOCK &&
> -		!isnullstartblock(irec->br_startblock);
> +	return xfs_bmap_is_mapped_extent(irec) &&
> +	       irec->br_state != XFS_EXT_UNWRITTEN;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 107bf2a2f344..f50a8c2f21a5 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -984,40 +984,27 @@ xfs_reflink_ag_has_free_space(
>  }
>  
>  /*
> - * Unmap a range of blocks from a file, then map other blocks into the hole.
> - * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> - * The extent irec is mapped into dest at irec->br_startoff.
> + * Remap the given extent into the file at the given offset.  The imap
> + * blockcount will be set to the number of blocks that were actually remapped.
>   */
>  STATIC int
>  xfs_reflink_remap_extent(
>  	struct xfs_inode	*ip,
> -	struct xfs_bmbt_irec	*irec,
> -	xfs_fileoff_t		destoff,
> +	struct xfs_bmbt_irec	*imap,
>  	xfs_off_t		new_isize)
>  {
> +	struct xfs_bmbt_irec	imap2;
>  	struct xfs_mount	*mp = ip->i_mount;
> -	bool			real_extent = xfs_bmap_is_real_extent(irec);
>  	struct xfs_trans	*tp;
> -	unsigned int		resblks;
> -	struct xfs_bmbt_irec	uirec;
> -	xfs_filblks_t		rlen;
> -	xfs_filblks_t		unmap_len;
>  	xfs_off_t		newlen;
> +	int64_t			ip_delta = 0;
> +	unsigned int		resblks;
> +	bool			real_extent = xfs_bmap_is_real_extent(imap);
> +	int			nimaps;
>  	int			error;
>  
> -	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> -	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> -
> -	/* No reflinking if we're low on space */
> -	if (real_extent) {
> -		error = xfs_reflink_ag_has_free_space(mp,
> -				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
> -		if (error)
> -			goto out;
> -	}
> -
>  	/* Start a rolling transaction to switch the mappings */
> -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
>  	if (error)
>  		goto out;
> @@ -1025,87 +1012,118 @@ xfs_reflink_remap_extent(
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	xfs_trans_ijoin(tp, ip, 0);
>  
> -	/* If we're not just clearing space, then do we have enough quota? */
> +	/* Read extent from the second file */
> +	nimaps = 1;
> +	error = xfs_bmapi_read(ip, imap->br_startoff, imap->br_blockcount,
> +			&imap2, &nimaps, 0);
> +	if (error)
> +		goto out_cancel;
> +#ifdef DEBUG
> +	if (nimaps != 1 ||
> +	    imap2.br_startoff != imap->br_startoff) {
> +		/*
> +		 * We should never get no mapping or something that doesn't
> +		 * match what we asked for.
> +		 */
> +		ASSERT(0);
> +		error = -EINVAL;
> +		goto out_cancel;
> +	}
> +#endif

Why the DEBUG?

> +
> +	/*
> +	 * We can only remap as many blocks as the smaller of the two extent
> +	 * maps, because we can only remap one extent at a time.
> +	 */
> +	imap->br_blockcount = min(imap->br_blockcount, imap2.br_blockcount);
> +
> +	trace_xfs_reflink_remap_extent2(ip, &imap2);
> +
> +	/*
> +	 * Two extents mapped to the same physical block must not have
> +	 * different states; that's filesystem corruption.  Move on to the next
> +	 * extent if they're both holes or both the same physical extent.
> +	 */
> +	if (imap->br_startblock == imap2.br_startblock) {
> +		if (imap->br_state != imap2.br_state)
> +			error = -EFSCORRUPTED;

Can we assert the length of each extent match here as well?

> +		goto out_cancel;
> +	}
> +
>  	if (real_extent) {
> +		/* No reflinking if we're low on space */
> +		error = xfs_reflink_ag_has_free_space(mp,
> +				XFS_FSB_TO_AGNO(mp, imap->br_startblock));
> +		if (error)
> +			goto out_cancel;
> +
> +
> +		/* Do we have enough quota? */
>  		error = xfs_trans_reserve_quota_nblks(tp, ip,
> -				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> +				imap->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);

Any reason this doesn't accommodate removal of real blocks?

>  		if (error)
>  			goto out_cancel;
>  	}
>  
> -	trace_xfs_reflink_remap(ip, irec->br_startoff,
> -				irec->br_blockcount, irec->br_startblock);
> -
> -	/* Unmap the old blocks in the data fork. */
> -	rlen = unmap_len;
> -	while (rlen) {
> -		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> -		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
> -		if (error)
> -			goto out_cancel;
> -
> +	if (xfs_bmap_is_mapped_extent(&imap2)) {
>  		/*
> -		 * Trim the extent to whatever got unmapped.
> -		 * Remember, bunmapi works backwards.
> +		 * If the extent we're unmapping is backed by storage (written
> +		 * or not), unmap the extent and drop its refcount.
>  		 */
> -		uirec.br_startblock = irec->br_startblock + rlen;
> -		uirec.br_startoff = irec->br_startoff + rlen;
> -		uirec.br_blockcount = unmap_len - rlen;
> -		uirec.br_state = irec->br_state;
> -		unmap_len = rlen;
> +		xfs_bmap_unmap_extent(tp, ip, &imap2);
> +		xfs_refcount_decrease_extent(tp, &imap2);
> +		ip_delta -= imap2.br_blockcount;
> +	} else if (imap2.br_startblock == DELAYSTARTBLOCK) {
> +		xfs_filblks_t	len = imap2.br_blockcount;
>  
> -		/* If this isn't a real mapping, we're done. */
> -		if (!real_extent || uirec.br_blockcount == 0)
> -			goto next_extent;
> -
> -		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> -				uirec.br_blockcount, uirec.br_startblock);
> -
> -		/* Update the refcount tree */
> -		xfs_refcount_increase_extent(tp, &uirec);
> -
> -		/* Map the new blocks into the data fork. */
> -		xfs_bmap_map_extent(tp, ip, &uirec);
> +		/*
> +		 * If the extent we're unmapping is a delalloc reservation,
> +		 * we can use the regular bunmapi function to release the
> +		 * incore state.  Dropping the delalloc reservation takes care
> +		 * of the quota reservation for us.
> +		 */
> +		error = __xfs_bunmapi(NULL, ip, imap2.br_startoff, &len, 0, 1);
> +		if (error)
> +			goto out_cancel;
> +		ASSERT(len == 0);
> +	}
>  
> -		/* Update quota accounting. */
> -		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> -				uirec.br_blockcount);
> +	/*
> +	 * If the extent we're sharing is backed by written storage, increase
> +	 * its refcount and map it into the file.
> +	 */
> +	if (real_extent) {
> +		xfs_refcount_increase_extent(tp, imap);
> +		xfs_bmap_map_extent(tp, ip, imap);
> +		ip_delta += imap->br_blockcount;
> +	}
>  
> -		/* Update dest isize if needed. */
> -		newlen = XFS_FSB_TO_B(mp,
> -				uirec.br_startoff + uirec.br_blockcount);
> -		newlen = min_t(xfs_off_t, newlen, new_isize);
> -		if (newlen > i_size_read(VFS_I(ip))) {
> -			trace_xfs_reflink_update_inode_size(ip, newlen);
> -			i_size_write(VFS_I(ip), newlen);
> -			ip->i_d.di_size = newlen;
> -			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> -		}
> +	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, ip_delta);
>  
> -next_extent:
> -		/* Process all the deferred stuff. */
> -		error = xfs_defer_finish(&tp);
> -		if (error)
> -			goto out_cancel;
> +	/* Update dest isize if needed. */
> +	newlen = XFS_FSB_TO_B(mp, imap2.br_startoff + imap2.br_blockcount);
> +	newlen = min_t(xfs_off_t, newlen, new_isize);
> +	if (newlen > i_size_read(VFS_I(ip))) {
> +		trace_xfs_reflink_update_inode_size(ip, newlen);
> +		i_size_write(VFS_I(ip), newlen);
> +		ip->i_d.di_size = newlen;
> +		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>  	}
>  
> +	/* Commit everything and unlock. */
>  	error = xfs_trans_commit(tp);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -	if (error)
> -		goto out;
> -	return 0;
> +	goto out_unlock;

We probably shouldn't fire the _error() tracepoint in the successful
exit case.

>  
>  out_cancel:
>  	xfs_trans_cancel(tp);
> +out_unlock:
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  out:
>  	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
>  	return error;
>  }
>  
> -/*
> - * Iteratively remap one file's extents (and holes) to another's.
> - */
> +/* Remap a range of one file to the other. */
>  int
>  xfs_reflink_remap_blocks(
>  	struct xfs_inode	*src,
> @@ -1116,25 +1134,22 @@ xfs_reflink_remap_blocks(
>  	loff_t			*remapped)
>  {
>  	struct xfs_bmbt_irec	imap;
> -	xfs_fileoff_t		srcoff;
> -	xfs_fileoff_t		destoff;
> +	struct xfs_mount	*mp = src->i_mount;
> +	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
> +	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
>  	xfs_filblks_t		len;
> -	xfs_filblks_t		range_len;
>  	xfs_filblks_t		remapped_len = 0;
>  	xfs_off_t		new_isize = pos_out + remap_len;
>  	int			nimaps;
>  	int			error = 0;
>  
> -	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
> -	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
> -	len = XFS_B_TO_FSB(src->i_mount, remap_len);
> +	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
> +			XFS_MAX_FILEOFF);
>  
> -	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> -	while (len) {
> -		uint		lock_mode;
> +	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
>  
> -		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> -				dest, destoff);
> +	while (len > 0) {
> +		unsigned int	lock_mode;
>  
>  		/* Read extent from the source file */
>  		nimaps = 1;
> @@ -1143,18 +1158,29 @@ xfs_reflink_remap_blocks(
>  		xfs_iunlock(src, lock_mode);
>  		if (error)
>  			break;
> -		ASSERT(nimaps == 1);
> +#ifdef DEBUG
> +		if (nimaps != 1 ||
> +		    imap.br_startblock == DELAYSTARTBLOCK ||
> +		    imap.br_startoff != srcoff) {
> +			/*
> +			 * We should never get no mapping or a delalloc extent
> +			 * or something that doesn't match what we asked for,
> +			 * since the caller flushed the source file data and
> +			 * we hold the source file io/mmap locks.
> +			 */
> +			ASSERT(nimaps == 0);
> +			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> +			ASSERT(imap.br_startoff == srcoff);
> +			error = -EINVAL;
> +			break;
> +		}
> +#endif

Same thing with the DEBUG check? It seems like at least some of these
checks should be unconditional in the event the file is corrupted or
something.

>  
> -		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
> -				&imap);
> +		trace_xfs_reflink_remap_extent1(src, &imap);

How about trace_xfs_reflink_remap_extent_[src|dest]() so trace data is a
bit more readable?

>  
> -		/* Translate imap into the destination file. */
> -		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> -		imap.br_startoff += destoff - srcoff;
> -
> -		/* Clear dest from destoff to the end of imap and map it in. */
> -		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> -				new_isize);
> +		/* Remap into the destination file at the given offset. */
> +		imap.br_startoff = destoff;
> +		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
>  		if (error)
>  			break;
>  
> @@ -1164,10 +1190,10 @@ xfs_reflink_remap_blocks(
>  		}
>  
>  		/* Advance drange/srange */
> -		srcoff += range_len;
> -		destoff += range_len;
> -		len -= range_len;
> -		remapped_len += range_len;
> +		srcoff += imap.br_blockcount;
> +		destoff += imap.br_blockcount;
> +		len -= imap.br_blockcount;
> +		remapped_len += imap.br_blockcount;

Much nicer iteration, indeed. ;)

Brian

>  	}
>  
>  	if (error)
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 460136628a79..f205602c8ba9 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
>  DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
>  DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
>  DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
> -DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
> -TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> +TRACE_EVENT(xfs_reflink_remap_blocks,
>  	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
>  		 xfs_filblks_t len, struct xfs_inode *dest,
>  		 xfs_fileoff_t doffset),
> @@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
>  		  __entry->dest_ino,
>  		  __entry->dest_lblk)
>  );
> -TRACE_EVENT(xfs_reflink_punch_range,
> -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> -		 xfs_extlen_t len),
> -	TP_ARGS(ip, lblk, len),
> -	TP_STRUCT__entry(
> -		__field(dev_t, dev)
> -		__field(xfs_ino_t, ino)
> -		__field(xfs_fileoff_t, lblk)
> -		__field(xfs_extlen_t, len)
> -	),
> -	TP_fast_assign(
> -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> -		__entry->ino = ip->i_ino;
> -		__entry->lblk = lblk;
> -		__entry->len = len;
> -	),
> -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
> -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> -		  __entry->ino,
> -		  __entry->lblk,
> -		  __entry->len)
> -);
> -TRACE_EVENT(xfs_reflink_remap,
> -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> -		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
> -	TP_ARGS(ip, lblk, len, new_pblk),
> -	TP_STRUCT__entry(
> -		__field(dev_t, dev)
> -		__field(xfs_ino_t, ino)
> -		__field(xfs_fileoff_t, lblk)
> -		__field(xfs_extlen_t, len)
> -		__field(xfs_fsblock_t, new_pblk)
> -	),
> -	TP_fast_assign(
> -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> -		__entry->ino = ip->i_ino;
> -		__entry->lblk = lblk;
> -		__entry->len = len;
> -		__entry->new_pblk = new_pblk;
> -	),
> -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
> -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> -		  __entry->ino,
> -		  __entry->lblk,
> -		  __entry->len,
> -		  __entry->new_pblk)
> -);
>  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
>  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
> +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent1);
> +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent2);
>  
>  /* dedupe tracepoints */
>  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
> diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> index c0f73b82c055..99511ff6222f 100644
> --- a/fs/xfs/xfs_trans_dquot.c
> +++ b/fs/xfs/xfs_trans_dquot.c
> @@ -124,16 +124,17 @@ xfs_trans_dup_dqinfo(
>   */
>  void
>  xfs_trans_mod_dquot_byino(
> -	xfs_trans_t	*tp,
> -	xfs_inode_t	*ip,
> -	uint		field,
> -	int64_t		delta)
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip,
> +	uint			field,
> +	int64_t			delta)
>  {
> -	xfs_mount_t	*mp = tp->t_mountp;
> +	struct xfs_mount	*mp = tp->t_mountp;
>  
>  	if (!XFS_IS_QUOTA_RUNNING(mp) ||
>  	    !XFS_IS_QUOTA_ON(mp) ||
> -	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
> +	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino) ||
> +	    delta == 0)
>  		return;
>  
>  	if (tp->t_dqinfo == NULL)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] xfs: fix xfs_reflink_remap_prep calling conventions
  2020-06-23  4:01 ` [PATCH 2/3] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
@ 2020-06-24 12:38   ` Brian Foster
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Foster @ 2020-06-24 12:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Mon, Jun 22, 2020 at 09:01:42PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Fix the return value of xfs_reflink_remap_prep so that its return value
> conventions match the rest of xfs.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c    |    2 +-
>  fs/xfs/xfs_reflink.c |    6 +++---
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 00db81eac80d..b375fae811f2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1035,7 +1035,7 @@ xfs_file_remap_range(
>  	/* Prepare and then clone file data. */
>  	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
>  			&len, remap_flags);
> -	if (ret < 0 || len == 0)
> +	if (ret || len == 0)
>  		return ret;
>  
>  	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index f50a8c2f21a5..dd9ed7d5694d 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1341,7 +1341,7 @@ xfs_reflink_remap_prep(
>  	struct inode		*inode_out = file_inode(file_out);
>  	struct xfs_inode	*dest = XFS_I(inode_out);
>  	bool			same_inode = (inode_in == inode_out);
> -	ssize_t			ret;
> +	int			ret;
>  
>  	/* Lock both files against IO */
>  	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
> @@ -1365,7 +1365,7 @@ xfs_reflink_remap_prep(
>  
>  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
>  			len, remap_flags);
> -	if (ret < 0 || *len == 0)
> +	if (ret || *len == 0)
>  		goto out_unlock;
>  
>  	/* Attach dquots to dest inode before changing block map */
> @@ -1400,7 +1400,7 @@ xfs_reflink_remap_prep(
>  	if (ret)
>  		goto out_unlock;
>  
> -	return 1;
> +	return 0;
>  out_unlock:
>  	xfs_reflink_remap_unlock(file_in, file_out);
>  	return ret;
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] xfs: refactor locking and unlocking two inodes against userspace IO
  2020-06-23  4:01 ` [PATCH 3/3] xfs: refactor locking and unlocking two inodes against userspace IO Darrick J. Wong
@ 2020-06-24 12:39   ` Brian Foster
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Foster @ 2020-06-24 12:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Mon, Jun 22, 2020 at 09:01:49PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor the two functions that we use to lock and unlock two inodes to
> block userspace from initiating IO against a file, whether via system
> calls or mmap activity.  Move them to xfs_inode.c since this functionality
> isn't specific to reflink anyway.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

I'd prefer to see refactoring patches separate from moving patches, but
looks Ok:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c    |    2 +
>  fs/xfs/xfs_inode.c   |   93 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_inode.h   |    3 ++
>  fs/xfs/xfs_reflink.c |   85 +---------------------------------------------
>  fs/xfs/xfs_reflink.h |    2 -
>  5 files changed, 99 insertions(+), 86 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index b375fae811f2..a32d1eeee0f7 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1065,7 +1065,7 @@ xfs_file_remap_range(
>  	if (mp->m_flags & XFS_MOUNT_WSYNC)
>  		xfs_log_force_inode(dest);
>  out_unlock:
> -	xfs_reflink_remap_unlock(file_in, file_out);
> +	xfs_iunlock_io_mmap(src, dest);
>  	if (ret)
>  		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
>  	return remapped > 0 ? remapped : ret;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 9aea7d68d8ab..b9c6d1cc64a9 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3881,3 +3881,96 @@ xfs_log_force_inode(
>  		return 0;
>  	return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL);
>  }
> +
> +/*
> + * Grab the exclusive iolock for a data copy from src to dest, making sure to
> + * abide vfs locking order (lowest pointer value goes first) and breaking the
> + * layout leases before proceeding.  The loop is needed because we cannot call
> + * the blocking break_layout() with the iolocks held, and therefore have to
> + * back out both locks.
> + */
> +static int
> +xfs_iolock_two_inodes_and_break_layout(
> +	struct inode		*src,
> +	struct inode		*dest)
> +{
> +	int			error;
> +
> +	if (src > dest)
> +		swap(src, dest);
> +
> +retry:
> +	/* Wait to break both inodes' layouts before we start locking. */
> +	error = break_layout(src, true);
> +	if (error)
> +		return error;
> +	if (src != dest) {
> +		error = break_layout(dest, true);
> +		if (error)
> +			return error;
> +	}
> +
> +	/* Lock one inode and make sure nobody got in and leased it. */
> +	inode_lock(src);
> +	error = break_layout(src, false);
> +	if (error) {
> +		inode_unlock(src);
> +		if (error == -EWOULDBLOCK)
> +			goto retry;
> +		return error;
> +	}
> +
> +	if (src == dest)
> +		return 0;
> +
> +	/* Lock the other inode and make sure nobody got in and leased it. */
> +	inode_lock_nested(dest, I_MUTEX_NONDIR2);
> +	error = break_layout(dest, false);
> +	if (error) {
> +		inode_unlock(src);
> +		inode_unlock(dest);
> +		if (error == -EWOULDBLOCK)
> +			goto retry;
> +		return error;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Lock two files so that userspace cannot initiate I/O via file syscalls or
> + * mmap activity.
> + */
> +int
> +xfs_ilock_io_mmap(
> +	struct xfs_inode	*ip1,
> +	struct xfs_inode	*ip2)
> +{
> +	int			ret;
> +
> +	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
> +	if (ret)
> +		return ret;
> +	if (ip1 == ip2)
> +		xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> +	else
> +		xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL,
> +				    ip2, XFS_MMAPLOCK_EXCL);
> +	return 0;
> +}
> +
> +/* Unlock both files to allow IO and mmap activity. */
> +void
> +xfs_iunlock_io_mmap(
> +	struct xfs_inode	*ip1,
> +	struct xfs_inode	*ip2)
> +{
> +	bool			same_inode = (ip1 == ip2);
> +
> +	xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
> +	if (!same_inode)
> +		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> +	inode_unlock(VFS_I(ip2));
> +	if (!same_inode)
> +		inode_unlock(VFS_I(ip1));
> +}
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 47d3b391030d..001529784f96 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -499,4 +499,7 @@ void xfs_iunlink_destroy(struct xfs_perag *pag);
>  
>  void xfs_end_io(struct work_struct *work);
>  
> +int xfs_ilock_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
> +void xfs_iunlock_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
> +
>  #endif	/* __XFS_INODE_H__ */
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index dd9ed7d5694d..130b6be8180e 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1203,81 +1203,6 @@ xfs_reflink_remap_blocks(
>  	return error;
>  }
>  
> -/*
> - * Grab the exclusive iolock for a data copy from src to dest, making sure to
> - * abide vfs locking order (lowest pointer value goes first) and breaking the
> - * layout leases before proceeding.  The loop is needed because we cannot call
> - * the blocking break_layout() with the iolocks held, and therefore have to
> - * back out both locks.
> - */
> -static int
> -xfs_iolock_two_inodes_and_break_layout(
> -	struct inode		*src,
> -	struct inode		*dest)
> -{
> -	int			error;
> -
> -	if (src > dest)
> -		swap(src, dest);
> -
> -retry:
> -	/* Wait to break both inodes' layouts before we start locking. */
> -	error = break_layout(src, true);
> -	if (error)
> -		return error;
> -	if (src != dest) {
> -		error = break_layout(dest, true);
> -		if (error)
> -			return error;
> -	}
> -
> -	/* Lock one inode and make sure nobody got in and leased it. */
> -	inode_lock(src);
> -	error = break_layout(src, false);
> -	if (error) {
> -		inode_unlock(src);
> -		if (error == -EWOULDBLOCK)
> -			goto retry;
> -		return error;
> -	}
> -
> -	if (src == dest)
> -		return 0;
> -
> -	/* Lock the other inode and make sure nobody got in and leased it. */
> -	inode_lock_nested(dest, I_MUTEX_NONDIR2);
> -	error = break_layout(dest, false);
> -	if (error) {
> -		inode_unlock(src);
> -		inode_unlock(dest);
> -		if (error == -EWOULDBLOCK)
> -			goto retry;
> -		return error;
> -	}
> -
> -	return 0;
> -}
> -
> -/* Unlock both inodes after they've been prepped for a range clone. */
> -void
> -xfs_reflink_remap_unlock(
> -	struct file		*file_in,
> -	struct file		*file_out)
> -{
> -	struct inode		*inode_in = file_inode(file_in);
> -	struct xfs_inode	*src = XFS_I(inode_in);
> -	struct inode		*inode_out = file_inode(file_out);
> -	struct xfs_inode	*dest = XFS_I(inode_out);
> -	bool			same_inode = (inode_in == inode_out);
> -
> -	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> -	if (!same_inode)
> -		xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
> -	inode_unlock(inode_out);
> -	if (!same_inode)
> -		inode_unlock(inode_in);
> -}
> -
>  /*
>   * If we're reflinking to a point past the destination file's EOF, we must
>   * zero any speculative post-EOF preallocations that sit between the old EOF
> @@ -1340,18 +1265,12 @@ xfs_reflink_remap_prep(
>  	struct xfs_inode	*src = XFS_I(inode_in);
>  	struct inode		*inode_out = file_inode(file_out);
>  	struct xfs_inode	*dest = XFS_I(inode_out);
> -	bool			same_inode = (inode_in == inode_out);
>  	int			ret;
>  
>  	/* Lock both files against IO */
> -	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
> +	ret = xfs_ilock_io_mmap(src, dest);
>  	if (ret)
>  		return ret;
> -	if (same_inode)
> -		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
> -	else
> -		xfs_lock_two_inodes(src, XFS_MMAPLOCK_EXCL, dest,
> -				XFS_MMAPLOCK_EXCL);
>  
>  	/* Check file eligibility and prepare for block sharing. */
>  	ret = -EINVAL;
> @@ -1402,7 +1321,7 @@ xfs_reflink_remap_prep(
>  
>  	return 0;
>  out_unlock:
> -	xfs_reflink_remap_unlock(file_in, file_out);
> +	xfs_iunlock_io_mmap(src, dest);
>  	return ret;
>  }
>  
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index 3e4fd46373ab..487b00434b96 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -56,7 +56,5 @@ extern int xfs_reflink_remap_blocks(struct xfs_inode *src, loff_t pos_in,
>  		loff_t *remapped);
>  extern int xfs_reflink_update_dest(struct xfs_inode *dest, xfs_off_t newlen,
>  		xfs_extlen_t cowextsize, unsigned int remap_flags);
> -extern void xfs_reflink_remap_unlock(struct file *file_in,
> -		struct file *file_out);
>  
>  #endif /* __XFS_REFLINK_H */
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-24 12:38   ` Brian Foster
@ 2020-06-24 15:09     ` Darrick J. Wong
  2020-06-24 17:07       ` Darrick J. Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2020-06-24 15:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 08:38:41AM -0400, Brian Foster wrote:
> On Mon, Jun 22, 2020 at 09:01:35PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > The existing reflink remapping loop has some structural problems that
> > need addressing:
> > 
> > The biggest problem is that we create one transaction for each extent in
> > the source file without accounting for the number of mappings there are
> > for the same range in the destination file.  In other words, we don't
> > know the number of remap operations that will be necessary and we
> > therefore cannot guess the block reservation required.  On highly
> > fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
> > run out of block reservation, and fail.
> > 
> > The second problem is that we don't actually use the bmap intents to
> > their full potential -- instead of calling bunmapi directly and having
> > to deal with its backwards operation, we could call the deferred ops
> > xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
> > makes the frontend loop much simpler.
> > 
> > A third (and more minor) problem is that we aren't even clever enough to
> > skip the whole remapping thing if the source and destination ranges
> > point to the same physical extents.
> > 
> 
> I'm wondering if this all really has to be done in one patch. The last
> bit at least sounds like an optimization from the description.

<nod> I can split out that and the quota reservation fixes.

> > Solve all of these problems by refactoring the remapping loops so that
> > we only perform one remapping operation per transaction, and each
> > operation only tries to remap a single extent from source to dest.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.h |   13 ++-
> >  fs/xfs/xfs_reflink.c     |  234 ++++++++++++++++++++++++++--------------------
> >  fs/xfs/xfs_trace.h       |   52 +---------
> >  fs/xfs/xfs_trans_dquot.c |   13 +--
> >  4 files changed, 149 insertions(+), 163 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > index 6028a3c825ba..3498b4f75f71 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.h
> > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > @@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> >  	{ BMAP_ATTRFORK,	"ATTR" }, \
> >  	{ BMAP_COWFORK,		"COW" }
> >  
> > +/* Return true if the extent is an allocated extent, written or not. */
> > +static inline bool xfs_bmap_is_mapped_extent(struct xfs_bmbt_irec *irec)
> > +{
> > +	return irec->br_startblock != HOLESTARTBLOCK &&
> > +		irec->br_startblock != DELAYSTARTBLOCK &&
> > +		!isnullstartblock(irec->br_startblock);
> > +}
> 
> Heh, I was going to suggest to call this _is_real_extent(), but I see
> the helper below already uses that name. I do think the name correlates
> better with this helper and perhaps the one below should be called
> something like xfs_bmap_is_written_extent(). Hm? Either way, the
> "mapped" name is kind of vague and brings back memories of buffer heads.

<nod> I originally wasn't going to change the name on the second helper,
but as there are only three callers currently it's probably easier to do
that now.

> >  
> >  /*
> >   * Return true if the extent is a real, allocated extent, or false if it is  a
> > @@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> >   */
> >  static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
> >  {
> > -	return irec->br_state != XFS_EXT_UNWRITTEN &&
> > -		irec->br_startblock != HOLESTARTBLOCK &&
> > -		irec->br_startblock != DELAYSTARTBLOCK &&
> > -		!isnullstartblock(irec->br_startblock);
> > +	return xfs_bmap_is_mapped_extent(irec) &&
> > +	       irec->br_state != XFS_EXT_UNWRITTEN;
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 107bf2a2f344..f50a8c2f21a5 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -984,40 +984,27 @@ xfs_reflink_ag_has_free_space(
> >  }
> >  
> >  /*
> > - * Unmap a range of blocks from a file, then map other blocks into the hole.
> > - * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> > - * The extent irec is mapped into dest at irec->br_startoff.
> > + * Remap the given extent into the file at the given offset.  The imap
> > + * blockcount will be set to the number of blocks that were actually remapped.
> >   */
> >  STATIC int
> >  xfs_reflink_remap_extent(
> >  	struct xfs_inode	*ip,
> > -	struct xfs_bmbt_irec	*irec,
> > -	xfs_fileoff_t		destoff,
> > +	struct xfs_bmbt_irec	*imap,
> >  	xfs_off_t		new_isize)
> >  {
> > +	struct xfs_bmbt_irec	imap2;
> >  	struct xfs_mount	*mp = ip->i_mount;
> > -	bool			real_extent = xfs_bmap_is_real_extent(irec);
> >  	struct xfs_trans	*tp;
> > -	unsigned int		resblks;
> > -	struct xfs_bmbt_irec	uirec;
> > -	xfs_filblks_t		rlen;
> > -	xfs_filblks_t		unmap_len;
> >  	xfs_off_t		newlen;
> > +	int64_t			ip_delta = 0;
> > +	unsigned int		resblks;
> > +	bool			real_extent = xfs_bmap_is_real_extent(imap);
> > +	int			nimaps;
> >  	int			error;
> >  
> > -	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> > -	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> > -
> > -	/* No reflinking if we're low on space */
> > -	if (real_extent) {
> > -		error = xfs_reflink_ag_has_free_space(mp,
> > -				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
> > -		if (error)
> > -			goto out;
> > -	}
> > -
> >  	/* Start a rolling transaction to switch the mappings */
> > -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> >  	if (error)
> >  		goto out;
> > @@ -1025,87 +1012,118 @@ xfs_reflink_remap_extent(
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> >  	xfs_trans_ijoin(tp, ip, 0);
> >  
> > -	/* If we're not just clearing space, then do we have enough quota? */
> > +	/* Read extent from the second file */
> > +	nimaps = 1;
> > +	error = xfs_bmapi_read(ip, imap->br_startoff, imap->br_blockcount,
> > +			&imap2, &nimaps, 0);
> > +	if (error)
> > +		goto out_cancel;
> > +#ifdef DEBUG
> > +	if (nimaps != 1 ||
> > +	    imap2.br_startoff != imap->br_startoff) {
> > +		/*
> > +		 * We should never get no mapping or something that doesn't
> > +		 * match what we asked for.
> > +		 */
> > +		ASSERT(0);
> > +		error = -EINVAL;
> > +		goto out_cancel;
> > +	}
> > +#endif
> 
> Why the DEBUG?

Christoph wondered (on the atomic extent swap series) when
xfs_bmapi_read would ever return nimaps==0 or an extent that doesn't
match at least the first block that we asked for.

I'm /pretty/ sure that will never happen since the full bmapi read
function will hand us back a "HOLE" mapping if it doesn't find anything,
so I left those parts in as debugging checks.

> > +
> > +	/*
> > +	 * We can only remap as many blocks as the smaller of the two extent
> > +	 * maps, because we can only remap one extent at a time.
> > +	 */
> > +	imap->br_blockcount = min(imap->br_blockcount, imap2.br_blockcount);
> > +
> > +	trace_xfs_reflink_remap_extent2(ip, &imap2);
> > +
> > +	/*
> > +	 * Two extents mapped to the same physical block must not have
> > +	 * different states; that's filesystem corruption.  Move on to the next
> > +	 * extent if they're both holes or both the same physical extent.
> > +	 */
> > +	if (imap->br_startblock == imap2.br_startblock) {
> > +		if (imap->br_state != imap2.br_state)
> > +			error = -EFSCORRUPTED;
> 
> Can we assert the length of each extent match here as well?

We can, but we asked the second bmapi_read to give us a mapping that
isn't any longer than imap->br_blockcount, and now we've just shortened
imap->br_blockcount to be no longer than what that second bmapi_read
returned.

I guess I don't mind adding a debugging assert just to make sure I
didn't screw up the math, though it seems excessive... :)

> > +		goto out_cancel;
> > +	}
> > +
> >  	if (real_extent) {
> > +		/* No reflinking if we're low on space */
> > +		error = xfs_reflink_ag_has_free_space(mp,
> > +				XFS_FSB_TO_AGNO(mp, imap->br_startblock));
> > +		if (error)
> > +			goto out_cancel;
> > +
> > +
> > +		/* Do we have enough quota? */
> >  		error = xfs_trans_reserve_quota_nblks(tp, ip,
> > -				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > +				imap->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> 
> Any reason this doesn't accommodate removal of real blocks?

The old code didn't. :)

Come to think of it, this under-reserves quota accounting, since we need
to reserve bmbt blocks too.

Urk, I guess that needs fixing too.

> 
> >  		if (error)
> >  			goto out_cancel;
> >  	}
> >  
> > -	trace_xfs_reflink_remap(ip, irec->br_startoff,
> > -				irec->br_blockcount, irec->br_startblock);
> > -
> > -	/* Unmap the old blocks in the data fork. */
> > -	rlen = unmap_len;
> > -	while (rlen) {
> > -		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> > -		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
> > -		if (error)
> > -			goto out_cancel;
> > -
> > +	if (xfs_bmap_is_mapped_extent(&imap2)) {
> >  		/*
> > -		 * Trim the extent to whatever got unmapped.
> > -		 * Remember, bunmapi works backwards.
> > +		 * If the extent we're unmapping is backed by storage (written
> > +		 * or not), unmap the extent and drop its refcount.
> >  		 */
> > -		uirec.br_startblock = irec->br_startblock + rlen;
> > -		uirec.br_startoff = irec->br_startoff + rlen;
> > -		uirec.br_blockcount = unmap_len - rlen;
> > -		uirec.br_state = irec->br_state;
> > -		unmap_len = rlen;
> > +		xfs_bmap_unmap_extent(tp, ip, &imap2);
> > +		xfs_refcount_decrease_extent(tp, &imap2);
> > +		ip_delta -= imap2.br_blockcount;
> > +	} else if (imap2.br_startblock == DELAYSTARTBLOCK) {
> > +		xfs_filblks_t	len = imap2.br_blockcount;
> >  
> > -		/* If this isn't a real mapping, we're done. */
> > -		if (!real_extent || uirec.br_blockcount == 0)
> > -			goto next_extent;
> > -
> > -		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> > -				uirec.br_blockcount, uirec.br_startblock);
> > -
> > -		/* Update the refcount tree */
> > -		xfs_refcount_increase_extent(tp, &uirec);
> > -
> > -		/* Map the new blocks into the data fork. */
> > -		xfs_bmap_map_extent(tp, ip, &uirec);
> > +		/*
> > +		 * If the extent we're unmapping is a delalloc reservation,
> > +		 * we can use the regular bunmapi function to release the
> > +		 * incore state.  Dropping the delalloc reservation takes care
> > +		 * of the quota reservation for us.
> > +		 */
> > +		error = __xfs_bunmapi(NULL, ip, imap2.br_startoff, &len, 0, 1);
> > +		if (error)
> > +			goto out_cancel;
> > +		ASSERT(len == 0);
> > +	}
> >  
> > -		/* Update quota accounting. */
> > -		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > -				uirec.br_blockcount);
> > +	/*
> > +	 * If the extent we're sharing is backed by written storage, increase
> > +	 * its refcount and map it into the file.
> > +	 */
> > +	if (real_extent) {
> > +		xfs_refcount_increase_extent(tp, imap);
> > +		xfs_bmap_map_extent(tp, ip, imap);
> > +		ip_delta += imap->br_blockcount;
> > +	}
> >  
> > -		/* Update dest isize if needed. */
> > -		newlen = XFS_FSB_TO_B(mp,
> > -				uirec.br_startoff + uirec.br_blockcount);
> > -		newlen = min_t(xfs_off_t, newlen, new_isize);
> > -		if (newlen > i_size_read(VFS_I(ip))) {
> > -			trace_xfs_reflink_update_inode_size(ip, newlen);
> > -			i_size_write(VFS_I(ip), newlen);
> > -			ip->i_d.di_size = newlen;
> > -			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > -		}
> > +	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, ip_delta);
> >  
> > -next_extent:
> > -		/* Process all the deferred stuff. */
> > -		error = xfs_defer_finish(&tp);
> > -		if (error)
> > -			goto out_cancel;
> > +	/* Update dest isize if needed. */
> > +	newlen = XFS_FSB_TO_B(mp, imap2.br_startoff + imap2.br_blockcount);
> > +	newlen = min_t(xfs_off_t, newlen, new_isize);
> > +	if (newlen > i_size_read(VFS_I(ip))) {
> > +		trace_xfs_reflink_update_inode_size(ip, newlen);
> > +		i_size_write(VFS_I(ip), newlen);
> > +		ip->i_d.di_size = newlen;
> > +		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> >  	}
> >  
> > +	/* Commit everything and unlock. */
> >  	error = xfs_trans_commit(tp);
> > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > -	if (error)
> > -		goto out;
> > -	return 0;
> > +	goto out_unlock;
> 
> We probably shouldn't fire the _error() tracepoint in the successful
> exit case.

Oops, good catch. :)

> >  
> >  out_cancel:
> >  	xfs_trans_cancel(tp);
> > +out_unlock:
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  out:
> >  	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
> >  	return error;
> >  }
> >  
> > -/*
> > - * Iteratively remap one file's extents (and holes) to another's.
> > - */
> > +/* Remap a range of one file to the other. */
> >  int
> >  xfs_reflink_remap_blocks(
> >  	struct xfs_inode	*src,
> > @@ -1116,25 +1134,22 @@ xfs_reflink_remap_blocks(
> >  	loff_t			*remapped)
> >  {
> >  	struct xfs_bmbt_irec	imap;
> > -	xfs_fileoff_t		srcoff;
> > -	xfs_fileoff_t		destoff;
> > +	struct xfs_mount	*mp = src->i_mount;
> > +	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
> > +	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
> >  	xfs_filblks_t		len;
> > -	xfs_filblks_t		range_len;
> >  	xfs_filblks_t		remapped_len = 0;
> >  	xfs_off_t		new_isize = pos_out + remap_len;
> >  	int			nimaps;
> >  	int			error = 0;
> >  
> > -	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
> > -	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
> > -	len = XFS_B_TO_FSB(src->i_mount, remap_len);
> > +	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
> > +			XFS_MAX_FILEOFF);
> >  
> > -	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> > -	while (len) {
> > -		uint		lock_mode;
> > +	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
> >  
> > -		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> > -				dest, destoff);
> > +	while (len > 0) {
> > +		unsigned int	lock_mode;
> >  
> >  		/* Read extent from the source file */
> >  		nimaps = 1;
> > @@ -1143,18 +1158,29 @@ xfs_reflink_remap_blocks(
> >  		xfs_iunlock(src, lock_mode);
> >  		if (error)
> >  			break;
> > -		ASSERT(nimaps == 1);
> > +#ifdef DEBUG
> > +		if (nimaps != 1 ||
> > +		    imap.br_startblock == DELAYSTARTBLOCK ||
> > +		    imap.br_startoff != srcoff) {
> > +			/*
> > +			 * We should never get no mapping or a delalloc extent
> > +			 * or something that doesn't match what we asked for,
> > +			 * since the caller flushed the source file data and
> > +			 * we hold the source file io/mmap locks.
> > +			 */
> > +			ASSERT(nimaps == 0);
> > +			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> > +			ASSERT(imap.br_startoff == srcoff);
> > +			error = -EINVAL;
> > +			break;
> > +		}
> > +#endif
> 
> Same thing with the DEBUG check? It seems like at least some of these
> checks should be unconditional in the event the file is corrupted or
> something.

Doh... yeah, I guess the delalloc check really needs to be in there, in
case a previous write+fsync failed and left us some remnants.

> >  
> > -		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
> > -				&imap);
> > +		trace_xfs_reflink_remap_extent1(src, &imap);
> 
> How about trace_xfs_reflink_remap_extent_[src|dest]() so trace data is a
> bit more readable?

<nod> I like the suggestion!

> >  
> > -		/* Translate imap into the destination file. */
> > -		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> > -		imap.br_startoff += destoff - srcoff;
> > -
> > -		/* Clear dest from destoff to the end of imap and map it in. */
> > -		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> > -				new_isize);
> > +		/* Remap into the destination file at the given offset. */
> > +		imap.br_startoff = destoff;
> > +		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
> >  		if (error)
> >  			break;
> >  
> > @@ -1164,10 +1190,10 @@ xfs_reflink_remap_blocks(
> >  		}
> >  
> >  		/* Advance drange/srange */
> > -		srcoff += range_len;
> > -		destoff += range_len;
> > -		len -= range_len;
> > -		remapped_len += range_len;
> > +		srcoff += imap.br_blockcount;
> > +		destoff += imap.br_blockcount;
> > +		len -= imap.br_blockcount;
> > +		remapped_len += imap.br_blockcount;
> 
> Much nicer iteration, indeed. ;)

Thanks! :)

--D

> Brian
> 
> >  	}
> >  
> >  	if (error)
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 460136628a79..f205602c8ba9 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
> >  DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
> >  DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
> >  DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
> > -DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
> > -TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> > +TRACE_EVENT(xfs_reflink_remap_blocks,
> >  	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
> >  		 xfs_filblks_t len, struct xfs_inode *dest,
> >  		 xfs_fileoff_t doffset),
> > @@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> >  		  __entry->dest_ino,
> >  		  __entry->dest_lblk)
> >  );
> > -TRACE_EVENT(xfs_reflink_punch_range,
> > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > -		 xfs_extlen_t len),
> > -	TP_ARGS(ip, lblk, len),
> > -	TP_STRUCT__entry(
> > -		__field(dev_t, dev)
> > -		__field(xfs_ino_t, ino)
> > -		__field(xfs_fileoff_t, lblk)
> > -		__field(xfs_extlen_t, len)
> > -	),
> > -	TP_fast_assign(
> > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > -		__entry->ino = ip->i_ino;
> > -		__entry->lblk = lblk;
> > -		__entry->len = len;
> > -	),
> > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
> > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > -		  __entry->ino,
> > -		  __entry->lblk,
> > -		  __entry->len)
> > -);
> > -TRACE_EVENT(xfs_reflink_remap,
> > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > -		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
> > -	TP_ARGS(ip, lblk, len, new_pblk),
> > -	TP_STRUCT__entry(
> > -		__field(dev_t, dev)
> > -		__field(xfs_ino_t, ino)
> > -		__field(xfs_fileoff_t, lblk)
> > -		__field(xfs_extlen_t, len)
> > -		__field(xfs_fsblock_t, new_pblk)
> > -	),
> > -	TP_fast_assign(
> > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > -		__entry->ino = ip->i_ino;
> > -		__entry->lblk = lblk;
> > -		__entry->len = len;
> > -		__entry->new_pblk = new_pblk;
> > -	),
> > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
> > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > -		  __entry->ino,
> > -		  __entry->lblk,
> > -		  __entry->len,
> > -		  __entry->new_pblk)
> > -);
> >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
> >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
> > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent1);
> > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent2);
> >  
> >  /* dedupe tracepoints */
> >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
> > diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> > index c0f73b82c055..99511ff6222f 100644
> > --- a/fs/xfs/xfs_trans_dquot.c
> > +++ b/fs/xfs/xfs_trans_dquot.c
> > @@ -124,16 +124,17 @@ xfs_trans_dup_dqinfo(
> >   */
> >  void
> >  xfs_trans_mod_dquot_byino(
> > -	xfs_trans_t	*tp,
> > -	xfs_inode_t	*ip,
> > -	uint		field,
> > -	int64_t		delta)
> > +	struct xfs_trans	*tp,
> > +	struct xfs_inode	*ip,
> > +	uint			field,
> > +	int64_t			delta)
> >  {
> > -	xfs_mount_t	*mp = tp->t_mountp;
> > +	struct xfs_mount	*mp = tp->t_mountp;
> >  
> >  	if (!XFS_IS_QUOTA_RUNNING(mp) ||
> >  	    !XFS_IS_QUOTA_ON(mp) ||
> > -	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
> > +	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino) ||
> > +	    delta == 0)
> >  		return;
> >  
> >  	if (tp->t_dqinfo == NULL)
> > 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-24 15:09     ` Darrick J. Wong
@ 2020-06-24 17:07       ` Darrick J. Wong
  2020-06-24 18:16         ` Brian Foster
  0 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2020-06-24 17:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 08:09:15AM -0700, Darrick J. Wong wrote:
> On Wed, Jun 24, 2020 at 08:38:41AM -0400, Brian Foster wrote:
> > On Mon, Jun 22, 2020 at 09:01:35PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > The existing reflink remapping loop has some structural problems that
> > > need addressing:
> > > 
> > > The biggest problem is that we create one transaction for each extent in
> > > the source file without accounting for the number of mappings there are
> > > for the same range in the destination file.  In other words, we don't
> > > know the number of remap operations that will be necessary and we
> > > therefore cannot guess the block reservation required.  On highly
> > > fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
> > > run out of block reservation, and fail.
> > > 
> > > The second problem is that we don't actually use the bmap intents to
> > > their full potential -- instead of calling bunmapi directly and having
> > > to deal with its backwards operation, we could call the deferred ops
> > > xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
> > > makes the frontend loop much simpler.
> > > 
> > > A third (and more minor) problem is that we aren't even clever enough to
> > > skip the whole remapping thing if the source and destination ranges
> > > point to the same physical extents.
> > > 
> > 
> > I'm wondering if this all really has to be done in one patch. The last
> > bit at least sounds like an optimization from the description.
> 
> <nod> I can split out that and the quota reservation fixes.
> 
> > > Solve all of these problems by refactoring the remapping loops so that
> > > we only perform one remapping operation per transaction, and each
> > > operation only tries to remap a single extent from source to dest.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.h |   13 ++-
> > >  fs/xfs/xfs_reflink.c     |  234 ++++++++++++++++++++++++++--------------------
> > >  fs/xfs/xfs_trace.h       |   52 +---------
> > >  fs/xfs/xfs_trans_dquot.c |   13 +--
> > >  4 files changed, 149 insertions(+), 163 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > > index 6028a3c825ba..3498b4f75f71 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.h
> > > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > > @@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> > >  	{ BMAP_ATTRFORK,	"ATTR" }, \
> > >  	{ BMAP_COWFORK,		"COW" }
> > >  
> > > +/* Return true if the extent is an allocated extent, written or not. */
> > > +static inline bool xfs_bmap_is_mapped_extent(struct xfs_bmbt_irec *irec)
> > > +{
> > > +	return irec->br_startblock != HOLESTARTBLOCK &&
> > > +		irec->br_startblock != DELAYSTARTBLOCK &&
> > > +		!isnullstartblock(irec->br_startblock);
> > > +}
> > 
> > Heh, I was going to suggest to call this _is_real_extent(), but I see
> > the helper below already uses that name. I do think the name correlates
> > better with this helper and perhaps the one below should be called
> > something like xfs_bmap_is_written_extent(). Hm? Either way, the
> > "mapped" name is kind of vague and brings back memories of buffer heads.
> 
> <nod> I originally wasn't going to change the name on the second helper,
> but as there are only three callers currently it's probably easier to do
> that now.
> 
> > >  
> > >  /*
> > >   * Return true if the extent is a real, allocated extent, or false if it is  a
> > > @@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> > >   */
> > >  static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
> > >  {
> > > -	return irec->br_state != XFS_EXT_UNWRITTEN &&
> > > -		irec->br_startblock != HOLESTARTBLOCK &&
> > > -		irec->br_startblock != DELAYSTARTBLOCK &&
> > > -		!isnullstartblock(irec->br_startblock);
> > > +	return xfs_bmap_is_mapped_extent(irec) &&
> > > +	       irec->br_state != XFS_EXT_UNWRITTEN;
> > >  }
> > >  
> > >  /*
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 107bf2a2f344..f50a8c2f21a5 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -984,40 +984,27 @@ xfs_reflink_ag_has_free_space(
> > >  }
> > >  
> > >  /*
> > > - * Unmap a range of blocks from a file, then map other blocks into the hole.
> > > - * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> > > - * The extent irec is mapped into dest at irec->br_startoff.
> > > + * Remap the given extent into the file at the given offset.  The imap
> > > + * blockcount will be set to the number of blocks that were actually remapped.
> > >   */
> > >  STATIC int
> > >  xfs_reflink_remap_extent(
> > >  	struct xfs_inode	*ip,
> > > -	struct xfs_bmbt_irec	*irec,
> > > -	xfs_fileoff_t		destoff,
> > > +	struct xfs_bmbt_irec	*imap,
> > >  	xfs_off_t		new_isize)
> > >  {
> > > +	struct xfs_bmbt_irec	imap2;
> > >  	struct xfs_mount	*mp = ip->i_mount;
> > > -	bool			real_extent = xfs_bmap_is_real_extent(irec);
> > >  	struct xfs_trans	*tp;
> > > -	unsigned int		resblks;
> > > -	struct xfs_bmbt_irec	uirec;
> > > -	xfs_filblks_t		rlen;
> > > -	xfs_filblks_t		unmap_len;
> > >  	xfs_off_t		newlen;
> > > +	int64_t			ip_delta = 0;
> > > +	unsigned int		resblks;
> > > +	bool			real_extent = xfs_bmap_is_real_extent(imap);
> > > +	int			nimaps;
> > >  	int			error;
> > >  
> > > -	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> > > -	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> > > -
> > > -	/* No reflinking if we're low on space */
> > > -	if (real_extent) {
> > > -		error = xfs_reflink_ag_has_free_space(mp,
> > > -				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
> > > -		if (error)
> > > -			goto out;
> > > -	}
> > > -
> > >  	/* Start a rolling transaction to switch the mappings */
> > > -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > > +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> > >  	if (error)
> > >  		goto out;
> > > @@ -1025,87 +1012,118 @@ xfs_reflink_remap_extent(
> > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > >  	xfs_trans_ijoin(tp, ip, 0);
> > >  
> > > -	/* If we're not just clearing space, then do we have enough quota? */
> > > +	/* Read extent from the second file */
> > > +	nimaps = 1;
> > > +	error = xfs_bmapi_read(ip, imap->br_startoff, imap->br_blockcount,
> > > +			&imap2, &nimaps, 0);
> > > +	if (error)
> > > +		goto out_cancel;
> > > +#ifdef DEBUG
> > > +	if (nimaps != 1 ||
> > > +	    imap2.br_startoff != imap->br_startoff) {
> > > +		/*
> > > +		 * We should never get no mapping or something that doesn't
> > > +		 * match what we asked for.
> > > +		 */
> > > +		ASSERT(0);
> > > +		error = -EINVAL;
> > > +		goto out_cancel;
> > > +	}
> > > +#endif
> > 
> > Why the DEBUG?
> 
> Christoph wondered (on the atomic extent swap series) when
> xfs_bmapi_read would ever return nimaps==0 or an extent that doesn't
> match at least the first block that we asked for.
> 
> I'm /pretty/ sure that will never happen since the full bmapi read
> function will hand us back a "HOLE" mapping if it doesn't find anything,
> so I left those parts in as debugging checks.
> 
> > > +
> > > +	/*
> > > +	 * We can only remap as many blocks as the smaller of the two extent
> > > +	 * maps, because we can only remap one extent at a time.
> > > +	 */
> > > +	imap->br_blockcount = min(imap->br_blockcount, imap2.br_blockcount);
> > > +
> > > +	trace_xfs_reflink_remap_extent2(ip, &imap2);
> > > +
> > > +	/*
> > > +	 * Two extents mapped to the same physical block must not have
> > > +	 * different states; that's filesystem corruption.  Move on to the next
> > > +	 * extent if they're both holes or both the same physical extent.
> > > +	 */
> > > +	if (imap->br_startblock == imap2.br_startblock) {
> > > +		if (imap->br_state != imap2.br_state)
> > > +			error = -EFSCORRUPTED;
> > 
> > Can we assert the length of each extent match here as well?
> 
> We can, but we asked the second bmapi_read to give us a mapping that
> isn't any longer than imap->br_blockcount, and now we've just shortened
> imap->br_blockcount to be no longer than what that second bmapi_read
> returned.
> 
> I guess I don't mind adding a debugging assert just to make sure I
> didn't screw up the math, though it seems excessive... :)
> 
> > > +		goto out_cancel;
> > > +	}
> > > +
> > >  	if (real_extent) {
> > > +		/* No reflinking if we're low on space */
> > > +		error = xfs_reflink_ag_has_free_space(mp,
> > > +				XFS_FSB_TO_AGNO(mp, imap->br_startblock));
> > > +		if (error)
> > > +			goto out_cancel;
> > > +
> > > +
> > > +		/* Do we have enough quota? */
> > >  		error = xfs_trans_reserve_quota_nblks(tp, ip,
> > > -				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > +				imap->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > 
> > Any reason this doesn't accommodate removal of real blocks?
> 
> The old code didn't. :)
> 
> Come to think of it, this under-reserves quota accounting, since we need
> to reserve bmbt blocks too.
> 
> Urk, I guess that needs fixing too.

I remembered that unmapping the source extent frees the quota count back
to the dquot, not to the transaction quota reservation.  Maybe we need
to add a XFS_TRANS_RES_FDBLKS-like thing to quota so that the quota
counter ping-pong on't cause us to blow out the quota, and then we can
relax the quota requirements of a reflink call?

--D

> > 
> > >  		if (error)
> > >  			goto out_cancel;
> > >  	}
> > >  
> > > -	trace_xfs_reflink_remap(ip, irec->br_startoff,
> > > -				irec->br_blockcount, irec->br_startblock);
> > > -
> > > -	/* Unmap the old blocks in the data fork. */
> > > -	rlen = unmap_len;
> > > -	while (rlen) {
> > > -		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> > > -		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
> > > -		if (error)
> > > -			goto out_cancel;
> > > -
> > > +	if (xfs_bmap_is_mapped_extent(&imap2)) {
> > >  		/*
> > > -		 * Trim the extent to whatever got unmapped.
> > > -		 * Remember, bunmapi works backwards.
> > > +		 * If the extent we're unmapping is backed by storage (written
> > > +		 * or not), unmap the extent and drop its refcount.
> > >  		 */
> > > -		uirec.br_startblock = irec->br_startblock + rlen;
> > > -		uirec.br_startoff = irec->br_startoff + rlen;
> > > -		uirec.br_blockcount = unmap_len - rlen;
> > > -		uirec.br_state = irec->br_state;
> > > -		unmap_len = rlen;
> > > +		xfs_bmap_unmap_extent(tp, ip, &imap2);
> > > +		xfs_refcount_decrease_extent(tp, &imap2);
> > > +		ip_delta -= imap2.br_blockcount;
> > > +	} else if (imap2.br_startblock == DELAYSTARTBLOCK) {
> > > +		xfs_filblks_t	len = imap2.br_blockcount;
> > >  
> > > -		/* If this isn't a real mapping, we're done. */
> > > -		if (!real_extent || uirec.br_blockcount == 0)
> > > -			goto next_extent;
> > > -
> > > -		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> > > -				uirec.br_blockcount, uirec.br_startblock);
> > > -
> > > -		/* Update the refcount tree */
> > > -		xfs_refcount_increase_extent(tp, &uirec);
> > > -
> > > -		/* Map the new blocks into the data fork. */
> > > -		xfs_bmap_map_extent(tp, ip, &uirec);
> > > +		/*
> > > +		 * If the extent we're unmapping is a delalloc reservation,
> > > +		 * we can use the regular bunmapi function to release the
> > > +		 * incore state.  Dropping the delalloc reservation takes care
> > > +		 * of the quota reservation for us.
> > > +		 */
> > > +		error = __xfs_bunmapi(NULL, ip, imap2.br_startoff, &len, 0, 1);
> > > +		if (error)
> > > +			goto out_cancel;
> > > +		ASSERT(len == 0);
> > > +	}
> > >  
> > > -		/* Update quota accounting. */
> > > -		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > > -				uirec.br_blockcount);
> > > +	/*
> > > +	 * If the extent we're sharing is backed by written storage, increase
> > > +	 * its refcount and map it into the file.
> > > +	 */
> > > +	if (real_extent) {
> > > +		xfs_refcount_increase_extent(tp, imap);
> > > +		xfs_bmap_map_extent(tp, ip, imap);
> > > +		ip_delta += imap->br_blockcount;
> > > +	}
> > >  
> > > -		/* Update dest isize if needed. */
> > > -		newlen = XFS_FSB_TO_B(mp,
> > > -				uirec.br_startoff + uirec.br_blockcount);
> > > -		newlen = min_t(xfs_off_t, newlen, new_isize);
> > > -		if (newlen > i_size_read(VFS_I(ip))) {
> > > -			trace_xfs_reflink_update_inode_size(ip, newlen);
> > > -			i_size_write(VFS_I(ip), newlen);
> > > -			ip->i_d.di_size = newlen;
> > > -			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > -		}
> > > +	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, ip_delta);
> > >  
> > > -next_extent:
> > > -		/* Process all the deferred stuff. */
> > > -		error = xfs_defer_finish(&tp);
> > > -		if (error)
> > > -			goto out_cancel;
> > > +	/* Update dest isize if needed. */
> > > +	newlen = XFS_FSB_TO_B(mp, imap2.br_startoff + imap2.br_blockcount);
> > > +	newlen = min_t(xfs_off_t, newlen, new_isize);
> > > +	if (newlen > i_size_read(VFS_I(ip))) {
> > > +		trace_xfs_reflink_update_inode_size(ip, newlen);
> > > +		i_size_write(VFS_I(ip), newlen);
> > > +		ip->i_d.di_size = newlen;
> > > +		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > >  	}
> > >  
> > > +	/* Commit everything and unlock. */
> > >  	error = xfs_trans_commit(tp);
> > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > -	if (error)
> > > -		goto out;
> > > -	return 0;
> > > +	goto out_unlock;
> > 
> > We probably shouldn't fire the _error() tracepoint in the successful
> > exit case.
> 
> Oops, good catch. :)
> 
> > >  
> > >  out_cancel:
> > >  	xfs_trans_cancel(tp);
> > > +out_unlock:
> > >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > >  out:
> > >  	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
> > >  	return error;
> > >  }
> > >  
> > > -/*
> > > - * Iteratively remap one file's extents (and holes) to another's.
> > > - */
> > > +/* Remap a range of one file to the other. */
> > >  int
> > >  xfs_reflink_remap_blocks(
> > >  	struct xfs_inode	*src,
> > > @@ -1116,25 +1134,22 @@ xfs_reflink_remap_blocks(
> > >  	loff_t			*remapped)
> > >  {
> > >  	struct xfs_bmbt_irec	imap;
> > > -	xfs_fileoff_t		srcoff;
> > > -	xfs_fileoff_t		destoff;
> > > +	struct xfs_mount	*mp = src->i_mount;
> > > +	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
> > > +	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
> > >  	xfs_filblks_t		len;
> > > -	xfs_filblks_t		range_len;
> > >  	xfs_filblks_t		remapped_len = 0;
> > >  	xfs_off_t		new_isize = pos_out + remap_len;
> > >  	int			nimaps;
> > >  	int			error = 0;
> > >  
> > > -	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
> > > -	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
> > > -	len = XFS_B_TO_FSB(src->i_mount, remap_len);
> > > +	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
> > > +			XFS_MAX_FILEOFF);
> > >  
> > > -	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> > > -	while (len) {
> > > -		uint		lock_mode;
> > > +	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
> > >  
> > > -		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> > > -				dest, destoff);
> > > +	while (len > 0) {
> > > +		unsigned int	lock_mode;
> > >  
> > >  		/* Read extent from the source file */
> > >  		nimaps = 1;
> > > @@ -1143,18 +1158,29 @@ xfs_reflink_remap_blocks(
> > >  		xfs_iunlock(src, lock_mode);
> > >  		if (error)
> > >  			break;
> > > -		ASSERT(nimaps == 1);
> > > +#ifdef DEBUG
> > > +		if (nimaps != 1 ||
> > > +		    imap.br_startblock == DELAYSTARTBLOCK ||
> > > +		    imap.br_startoff != srcoff) {
> > > +			/*
> > > +			 * We should never get no mapping or a delalloc extent
> > > +			 * or something that doesn't match what we asked for,
> > > +			 * since the caller flushed the source file data and
> > > +			 * we hold the source file io/mmap locks.
> > > +			 */
> > > +			ASSERT(nimaps == 0);
> > > +			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> > > +			ASSERT(imap.br_startoff == srcoff);
> > > +			error = -EINVAL;
> > > +			break;
> > > +		}
> > > +#endif
> > 
> > Same thing with the DEBUG check? It seems like at least some of these
> > checks should be unconditional in the event the file is corrupted or
> > something.
> 
> Doh... yeah, I guess the delalloc check really needs to be in there, in
> case a previous write+fsync failed and left us some remnants.
> 
> > >  
> > > -		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
> > > -				&imap);
> > > +		trace_xfs_reflink_remap_extent1(src, &imap);
> > 
> > How about trace_xfs_reflink_remap_extent_[src|dest]() so trace data is a
> > bit more readable?
> 
> <nod> I like the suggestion!
> 
> > >  
> > > -		/* Translate imap into the destination file. */
> > > -		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> > > -		imap.br_startoff += destoff - srcoff;
> > > -
> > > -		/* Clear dest from destoff to the end of imap and map it in. */
> > > -		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> > > -				new_isize);
> > > +		/* Remap into the destination file at the given offset. */
> > > +		imap.br_startoff = destoff;
> > > +		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
> > >  		if (error)
> > >  			break;
> > >  
> > > @@ -1164,10 +1190,10 @@ xfs_reflink_remap_blocks(
> > >  		}
> > >  
> > >  		/* Advance drange/srange */
> > > -		srcoff += range_len;
> > > -		destoff += range_len;
> > > -		len -= range_len;
> > > -		remapped_len += range_len;
> > > +		srcoff += imap.br_blockcount;
> > > +		destoff += imap.br_blockcount;
> > > +		len -= imap.br_blockcount;
> > > +		remapped_len += imap.br_blockcount;
> > 
> > Much nicer iteration, indeed. ;)
> 
> Thanks! :)
> 
> --D
> 
> > Brian
> > 
> > >  	}
> > >  
> > >  	if (error)
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index 460136628a79..f205602c8ba9 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
> > >  DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
> > >  DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
> > >  DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
> > > -DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
> > > -TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> > > +TRACE_EVENT(xfs_reflink_remap_blocks,
> > >  	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
> > >  		 xfs_filblks_t len, struct xfs_inode *dest,
> > >  		 xfs_fileoff_t doffset),
> > > @@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> > >  		  __entry->dest_ino,
> > >  		  __entry->dest_lblk)
> > >  );
> > > -TRACE_EVENT(xfs_reflink_punch_range,
> > > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > > -		 xfs_extlen_t len),
> > > -	TP_ARGS(ip, lblk, len),
> > > -	TP_STRUCT__entry(
> > > -		__field(dev_t, dev)
> > > -		__field(xfs_ino_t, ino)
> > > -		__field(xfs_fileoff_t, lblk)
> > > -		__field(xfs_extlen_t, len)
> > > -	),
> > > -	TP_fast_assign(
> > > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > -		__entry->ino = ip->i_ino;
> > > -		__entry->lblk = lblk;
> > > -		__entry->len = len;
> > > -	),
> > > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
> > > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > -		  __entry->ino,
> > > -		  __entry->lblk,
> > > -		  __entry->len)
> > > -);
> > > -TRACE_EVENT(xfs_reflink_remap,
> > > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > > -		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
> > > -	TP_ARGS(ip, lblk, len, new_pblk),
> > > -	TP_STRUCT__entry(
> > > -		__field(dev_t, dev)
> > > -		__field(xfs_ino_t, ino)
> > > -		__field(xfs_fileoff_t, lblk)
> > > -		__field(xfs_extlen_t, len)
> > > -		__field(xfs_fsblock_t, new_pblk)
> > > -	),
> > > -	TP_fast_assign(
> > > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > -		__entry->ino = ip->i_ino;
> > > -		__entry->lblk = lblk;
> > > -		__entry->len = len;
> > > -		__entry->new_pblk = new_pblk;
> > > -	),
> > > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
> > > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > -		  __entry->ino,
> > > -		  __entry->lblk,
> > > -		  __entry->len,
> > > -		  __entry->new_pblk)
> > > -);
> > >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
> > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
> > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
> > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
> > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
> > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
> > > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent1);
> > > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent2);
> > >  
> > >  /* dedupe tracepoints */
> > >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
> > > diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> > > index c0f73b82c055..99511ff6222f 100644
> > > --- a/fs/xfs/xfs_trans_dquot.c
> > > +++ b/fs/xfs/xfs_trans_dquot.c
> > > @@ -124,16 +124,17 @@ xfs_trans_dup_dqinfo(
> > >   */
> > >  void
> > >  xfs_trans_mod_dquot_byino(
> > > -	xfs_trans_t	*tp,
> > > -	xfs_inode_t	*ip,
> > > -	uint		field,
> > > -	int64_t		delta)
> > > +	struct xfs_trans	*tp,
> > > +	struct xfs_inode	*ip,
> > > +	uint			field,
> > > +	int64_t			delta)
> > >  {
> > > -	xfs_mount_t	*mp = tp->t_mountp;
> > > +	struct xfs_mount	*mp = tp->t_mountp;
> > >  
> > >  	if (!XFS_IS_QUOTA_RUNNING(mp) ||
> > >  	    !XFS_IS_QUOTA_ON(mp) ||
> > > -	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
> > > +	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino) ||
> > > +	    delta == 0)
> > >  		return;
> > >  
> > >  	if (tp->t_dqinfo == NULL)
> > > 
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-24 17:07       ` Darrick J. Wong
@ 2020-06-24 18:16         ` Brian Foster
  2020-06-24 18:35           ` Darrick J. Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2020-06-24 18:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 10:07:38AM -0700, Darrick J. Wong wrote:
> On Wed, Jun 24, 2020 at 08:09:15AM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 24, 2020 at 08:38:41AM -0400, Brian Foster wrote:
> > > On Mon, Jun 22, 2020 at 09:01:35PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > The existing reflink remapping loop has some structural problems that
> > > > need addressing:
> > > > 
> > > > The biggest problem is that we create one transaction for each extent in
> > > > the source file without accounting for the number of mappings there are
> > > > for the same range in the destination file.  In other words, we don't
> > > > know the number of remap operations that will be necessary and we
> > > > therefore cannot guess the block reservation required.  On highly
> > > > fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
> > > > run out of block reservation, and fail.
> > > > 
> > > > The second problem is that we don't actually use the bmap intents to
> > > > their full potential -- instead of calling bunmapi directly and having
> > > > to deal with its backwards operation, we could call the deferred ops
> > > > xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
> > > > makes the frontend loop much simpler.
> > > > 
> > > > A third (and more minor) problem is that we aren't even clever enough to
> > > > skip the whole remapping thing if the source and destination ranges
> > > > point to the same physical extents.
> > > > 
> > > 
> > > I'm wondering if this all really has to be done in one patch. The last
> > > bit at least sounds like an optimization from the description.
> > 
> > <nod> I can split out that and the quota reservation fixes.
> > 
> > > > Solve all of these problems by refactoring the remapping loops so that
> > > > we only perform one remapping operation per transaction, and each
> > > > operation only tries to remap a single extent from source to dest.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_bmap.h |   13 ++-
> > > >  fs/xfs/xfs_reflink.c     |  234 ++++++++++++++++++++++++++--------------------
> > > >  fs/xfs/xfs_trace.h       |   52 +---------
> > > >  fs/xfs/xfs_trans_dquot.c |   13 +--
> > > >  4 files changed, 149 insertions(+), 163 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > > > index 6028a3c825ba..3498b4f75f71 100644
> > > > --- a/fs/xfs/libxfs/xfs_bmap.h
> > > > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > > > @@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> > > >  	{ BMAP_ATTRFORK,	"ATTR" }, \
> > > >  	{ BMAP_COWFORK,		"COW" }
> > > >  
> > > > +/* Return true if the extent is an allocated extent, written or not. */
> > > > +static inline bool xfs_bmap_is_mapped_extent(struct xfs_bmbt_irec *irec)
> > > > +{
> > > > +	return irec->br_startblock != HOLESTARTBLOCK &&
> > > > +		irec->br_startblock != DELAYSTARTBLOCK &&
> > > > +		!isnullstartblock(irec->br_startblock);
> > > > +}
> > > 
> > > Heh, I was going to suggest to call this _is_real_extent(), but I see
> > > the helper below already uses that name. I do think the name correlates
> > > better with this helper and perhaps the one below should be called
> > > something like xfs_bmap_is_written_extent(). Hm? Either way, the
> > > "mapped" name is kind of vague and brings back memories of buffer heads.
> > 
> > <nod> I originally wasn't going to change the name on the second helper,
> > but as there are only three callers currently it's probably easier to do
> > that now.
> > 
> > > >  
> > > >  /*
> > > >   * Return true if the extent is a real, allocated extent, or false if it is  a
> > > > @@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> > > >   */
> > > >  static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
> > > >  {
> > > > -	return irec->br_state != XFS_EXT_UNWRITTEN &&
> > > > -		irec->br_startblock != HOLESTARTBLOCK &&
> > > > -		irec->br_startblock != DELAYSTARTBLOCK &&
> > > > -		!isnullstartblock(irec->br_startblock);
> > > > +	return xfs_bmap_is_mapped_extent(irec) &&
> > > > +	       irec->br_state != XFS_EXT_UNWRITTEN;
> > > >  }
> > > >  
> > > >  /*
> > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > index 107bf2a2f344..f50a8c2f21a5 100644
> > > > --- a/fs/xfs/xfs_reflink.c
> > > > +++ b/fs/xfs/xfs_reflink.c
> > > > @@ -984,40 +984,27 @@ xfs_reflink_ag_has_free_space(
> > > >  }
> > > >  
> > > >  /*
> > > > - * Unmap a range of blocks from a file, then map other blocks into the hole.
> > > > - * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> > > > - * The extent irec is mapped into dest at irec->br_startoff.
> > > > + * Remap the given extent into the file at the given offset.  The imap
> > > > + * blockcount will be set to the number of blocks that were actually remapped.
> > > >   */
> > > >  STATIC int
> > > >  xfs_reflink_remap_extent(
> > > >  	struct xfs_inode	*ip,
> > > > -	struct xfs_bmbt_irec	*irec,
> > > > -	xfs_fileoff_t		destoff,
> > > > +	struct xfs_bmbt_irec	*imap,
> > > >  	xfs_off_t		new_isize)
> > > >  {
> > > > +	struct xfs_bmbt_irec	imap2;
> > > >  	struct xfs_mount	*mp = ip->i_mount;
> > > > -	bool			real_extent = xfs_bmap_is_real_extent(irec);
> > > >  	struct xfs_trans	*tp;
> > > > -	unsigned int		resblks;
> > > > -	struct xfs_bmbt_irec	uirec;
> > > > -	xfs_filblks_t		rlen;
> > > > -	xfs_filblks_t		unmap_len;
> > > >  	xfs_off_t		newlen;
> > > > +	int64_t			ip_delta = 0;
> > > > +	unsigned int		resblks;
> > > > +	bool			real_extent = xfs_bmap_is_real_extent(imap);
> > > > +	int			nimaps;
> > > >  	int			error;
> > > >  
> > > > -	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> > > > -	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> > > > -
> > > > -	/* No reflinking if we're low on space */
> > > > -	if (real_extent) {
> > > > -		error = xfs_reflink_ag_has_free_space(mp,
> > > > -				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
> > > > -		if (error)
> > > > -			goto out;
> > > > -	}
> > > > -
> > > >  	/* Start a rolling transaction to switch the mappings */
> > > > -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > > > +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> > > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> > > >  	if (error)
> > > >  		goto out;
> > > > @@ -1025,87 +1012,118 @@ xfs_reflink_remap_extent(
> > > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > >  	xfs_trans_ijoin(tp, ip, 0);
> > > >  
> > > > -	/* If we're not just clearing space, then do we have enough quota? */
> > > > +	/* Read extent from the second file */
> > > > +	nimaps = 1;
> > > > +	error = xfs_bmapi_read(ip, imap->br_startoff, imap->br_blockcount,
> > > > +			&imap2, &nimaps, 0);
> > > > +	if (error)
> > > > +		goto out_cancel;
> > > > +#ifdef DEBUG
> > > > +	if (nimaps != 1 ||
> > > > +	    imap2.br_startoff != imap->br_startoff) {
> > > > +		/*
> > > > +		 * We should never get no mapping or something that doesn't
> > > > +		 * match what we asked for.
> > > > +		 */
> > > > +		ASSERT(0);
> > > > +		error = -EINVAL;
> > > > +		goto out_cancel;
> > > > +	}
> > > > +#endif
> > > 
> > > Why the DEBUG?
> > 
> > Christoph wondered (on the atomic extent swap series) when
> > xfs_bmapi_read would ever return nimaps==0 or an extent that doesn't
> > match at least the first block that we asked for.
> > 
> > I'm /pretty/ sure that will never happen since the full bmapi read
> > function will hand us back a "HOLE" mapping if it doesn't find anything,
> > so I left those parts in as debugging checks.
> > 

Oh. Maybe this is better served as a single assert? I.e.:

	ASSERT(nimaps == 1 && imap2.br_startoff == imap->br_startoff);

I guess I'm mostly just not a fan of the DEBUG idefs than worried about
the extra error checking logic. ISTM we should either include these
checks unconditionally if it's reasonably possible or leave it to an
assert if it's more of a future developer aide or some such..

> > > > +
> > > > +	/*
> > > > +	 * We can only remap as many blocks as the smaller of the two extent
> > > > +	 * maps, because we can only remap one extent at a time.
> > > > +	 */
> > > > +	imap->br_blockcount = min(imap->br_blockcount, imap2.br_blockcount);
> > > > +
> > > > +	trace_xfs_reflink_remap_extent2(ip, &imap2);
> > > > +
> > > > +	/*
> > > > +	 * Two extents mapped to the same physical block must not have
> > > > +	 * different states; that's filesystem corruption.  Move on to the next
> > > > +	 * extent if they're both holes or both the same physical extent.
> > > > +	 */
> > > > +	if (imap->br_startblock == imap2.br_startblock) {
> > > > +		if (imap->br_state != imap2.br_state)
> > > > +			error = -EFSCORRUPTED;
> > > 
> > > Can we assert the length of each extent match here as well?
> > 
> > We can, but we asked the second bmapi_read to give us a mapping that
> > isn't any longer than imap->br_blockcount, and now we've just shortened
> > imap->br_blockcount to be no longer than what that second bmapi_read
> > returned.
> > 

Yeah, that's actually why I asked. :P To be more clear, I thought it
might be defensive against future changes because this invariant is an
indirect/implicit result of those two things as opposed to some explicit
logic somewhere that makes it immediately obvious. Not a huge deal
either way though.

> > I guess I don't mind adding a debugging assert just to make sure I
> > didn't screw up the math, though it seems excessive... :)
> > 
> > > > +		goto out_cancel;
> > > > +	}
> > > > +
> > > >  	if (real_extent) {
> > > > +		/* No reflinking if we're low on space */
> > > > +		error = xfs_reflink_ag_has_free_space(mp,
> > > > +				XFS_FSB_TO_AGNO(mp, imap->br_startblock));
> > > > +		if (error)
> > > > +			goto out_cancel;
> > > > +
> > > > +
> > > > +		/* Do we have enough quota? */
> > > >  		error = xfs_trans_reserve_quota_nblks(tp, ip,
> > > > -				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > > +				imap->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > 
> > > Any reason this doesn't accommodate removal of real blocks?
> > 
> > The old code didn't. :)
> > 
> > Come to think of it, this under-reserves quota accounting, since we need
> > to reserve bmbt blocks too.
> > 
> > Urk, I guess that needs fixing too.
> 
> I remembered that unmapping the source extent frees the quota count back
> to the dquot, not to the transaction quota reservation.  Maybe we need
> to add a XFS_TRANS_RES_FDBLKS-like thing to quota so that the quota
> counter ping-pong on't cause us to blow out the quota, and then we can
> relax the quota requirements of a reflink call?
> 

Hmm.. do you mean the deferred unmap operation? If that's the case, why
would we include the unmapped blocks in the delta?

Brian

> --D
> 
> > > 
> > > >  		if (error)
> > > >  			goto out_cancel;
> > > >  	}
> > > >  
> > > > -	trace_xfs_reflink_remap(ip, irec->br_startoff,
> > > > -				irec->br_blockcount, irec->br_startblock);
> > > > -
> > > > -	/* Unmap the old blocks in the data fork. */
> > > > -	rlen = unmap_len;
> > > > -	while (rlen) {
> > > > -		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> > > > -		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
> > > > -		if (error)
> > > > -			goto out_cancel;
> > > > -
> > > > +	if (xfs_bmap_is_mapped_extent(&imap2)) {
> > > >  		/*
> > > > -		 * Trim the extent to whatever got unmapped.
> > > > -		 * Remember, bunmapi works backwards.
> > > > +		 * If the extent we're unmapping is backed by storage (written
> > > > +		 * or not), unmap the extent and drop its refcount.
> > > >  		 */
> > > > -		uirec.br_startblock = irec->br_startblock + rlen;
> > > > -		uirec.br_startoff = irec->br_startoff + rlen;
> > > > -		uirec.br_blockcount = unmap_len - rlen;
> > > > -		uirec.br_state = irec->br_state;
> > > > -		unmap_len = rlen;
> > > > +		xfs_bmap_unmap_extent(tp, ip, &imap2);
> > > > +		xfs_refcount_decrease_extent(tp, &imap2);
> > > > +		ip_delta -= imap2.br_blockcount;
> > > > +	} else if (imap2.br_startblock == DELAYSTARTBLOCK) {
> > > > +		xfs_filblks_t	len = imap2.br_blockcount;
> > > >  
> > > > -		/* If this isn't a real mapping, we're done. */
> > > > -		if (!real_extent || uirec.br_blockcount == 0)
> > > > -			goto next_extent;
> > > > -
> > > > -		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> > > > -				uirec.br_blockcount, uirec.br_startblock);
> > > > -
> > > > -		/* Update the refcount tree */
> > > > -		xfs_refcount_increase_extent(tp, &uirec);
> > > > -
> > > > -		/* Map the new blocks into the data fork. */
> > > > -		xfs_bmap_map_extent(tp, ip, &uirec);
> > > > +		/*
> > > > +		 * If the extent we're unmapping is a delalloc reservation,
> > > > +		 * we can use the regular bunmapi function to release the
> > > > +		 * incore state.  Dropping the delalloc reservation takes care
> > > > +		 * of the quota reservation for us.
> > > > +		 */
> > > > +		error = __xfs_bunmapi(NULL, ip, imap2.br_startoff, &len, 0, 1);
> > > > +		if (error)
> > > > +			goto out_cancel;
> > > > +		ASSERT(len == 0);
> > > > +	}
> > > >  
> > > > -		/* Update quota accounting. */
> > > > -		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > > > -				uirec.br_blockcount);
> > > > +	/*
> > > > +	 * If the extent we're sharing is backed by written storage, increase
> > > > +	 * its refcount and map it into the file.
> > > > +	 */
> > > > +	if (real_extent) {
> > > > +		xfs_refcount_increase_extent(tp, imap);
> > > > +		xfs_bmap_map_extent(tp, ip, imap);
> > > > +		ip_delta += imap->br_blockcount;
> > > > +	}
> > > >  
> > > > -		/* Update dest isize if needed. */
> > > > -		newlen = XFS_FSB_TO_B(mp,
> > > > -				uirec.br_startoff + uirec.br_blockcount);
> > > > -		newlen = min_t(xfs_off_t, newlen, new_isize);
> > > > -		if (newlen > i_size_read(VFS_I(ip))) {
> > > > -			trace_xfs_reflink_update_inode_size(ip, newlen);
> > > > -			i_size_write(VFS_I(ip), newlen);
> > > > -			ip->i_d.di_size = newlen;
> > > > -			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > > -		}
> > > > +	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, ip_delta);
> > > >  
> > > > -next_extent:
> > > > -		/* Process all the deferred stuff. */
> > > > -		error = xfs_defer_finish(&tp);
> > > > -		if (error)
> > > > -			goto out_cancel;
> > > > +	/* Update dest isize if needed. */
> > > > +	newlen = XFS_FSB_TO_B(mp, imap2.br_startoff + imap2.br_blockcount);
> > > > +	newlen = min_t(xfs_off_t, newlen, new_isize);
> > > > +	if (newlen > i_size_read(VFS_I(ip))) {
> > > > +		trace_xfs_reflink_update_inode_size(ip, newlen);
> > > > +		i_size_write(VFS_I(ip), newlen);
> > > > +		ip->i_d.di_size = newlen;
> > > > +		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > >  	}
> > > >  
> > > > +	/* Commit everything and unlock. */
> > > >  	error = xfs_trans_commit(tp);
> > > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > -	if (error)
> > > > -		goto out;
> > > > -	return 0;
> > > > +	goto out_unlock;
> > > 
> > > We probably shouldn't fire the _error() tracepoint in the successful
> > > exit case.
> > 
> > Oops, good catch. :)
> > 
> > > >  
> > > >  out_cancel:
> > > >  	xfs_trans_cancel(tp);
> > > > +out_unlock:
> > > >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > >  out:
> > > >  	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
> > > >  	return error;
> > > >  }
> > > >  
> > > > -/*
> > > > - * Iteratively remap one file's extents (and holes) to another's.
> > > > - */
> > > > +/* Remap a range of one file to the other. */
> > > >  int
> > > >  xfs_reflink_remap_blocks(
> > > >  	struct xfs_inode	*src,
> > > > @@ -1116,25 +1134,22 @@ xfs_reflink_remap_blocks(
> > > >  	loff_t			*remapped)
> > > >  {
> > > >  	struct xfs_bmbt_irec	imap;
> > > > -	xfs_fileoff_t		srcoff;
> > > > -	xfs_fileoff_t		destoff;
> > > > +	struct xfs_mount	*mp = src->i_mount;
> > > > +	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
> > > > +	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
> > > >  	xfs_filblks_t		len;
> > > > -	xfs_filblks_t		range_len;
> > > >  	xfs_filblks_t		remapped_len = 0;
> > > >  	xfs_off_t		new_isize = pos_out + remap_len;
> > > >  	int			nimaps;
> > > >  	int			error = 0;
> > > >  
> > > > -	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
> > > > -	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
> > > > -	len = XFS_B_TO_FSB(src->i_mount, remap_len);
> > > > +	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
> > > > +			XFS_MAX_FILEOFF);
> > > >  
> > > > -	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> > > > -	while (len) {
> > > > -		uint		lock_mode;
> > > > +	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
> > > >  
> > > > -		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> > > > -				dest, destoff);
> > > > +	while (len > 0) {
> > > > +		unsigned int	lock_mode;
> > > >  
> > > >  		/* Read extent from the source file */
> > > >  		nimaps = 1;
> > > > @@ -1143,18 +1158,29 @@ xfs_reflink_remap_blocks(
> > > >  		xfs_iunlock(src, lock_mode);
> > > >  		if (error)
> > > >  			break;
> > > > -		ASSERT(nimaps == 1);
> > > > +#ifdef DEBUG
> > > > +		if (nimaps != 1 ||
> > > > +		    imap.br_startblock == DELAYSTARTBLOCK ||
> > > > +		    imap.br_startoff != srcoff) {
> > > > +			/*
> > > > +			 * We should never get no mapping or a delalloc extent
> > > > +			 * or something that doesn't match what we asked for,
> > > > +			 * since the caller flushed the source file data and
> > > > +			 * we hold the source file io/mmap locks.
> > > > +			 */
> > > > +			ASSERT(nimaps == 0);
> > > > +			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> > > > +			ASSERT(imap.br_startoff == srcoff);
> > > > +			error = -EINVAL;
> > > > +			break;
> > > > +		}
> > > > +#endif
> > > 
> > > Same thing with the DEBUG check? It seems like at least some of these
> > > checks should be unconditional in the event the file is corrupted or
> > > something.
> > 
> > Doh... yeah, I guess the delalloc check really needs to be in there, in
> > case a previous write+fsync failed and left us some remnants.
> > 
> > > >  
> > > > -		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
> > > > -				&imap);
> > > > +		trace_xfs_reflink_remap_extent1(src, &imap);
> > > 
> > > How about trace_xfs_reflink_remap_extent_[src|dest]() so trace data is a
> > > bit more readable?
> > 
> > <nod> I like the suggestion!
> > 
> > > >  
> > > > -		/* Translate imap into the destination file. */
> > > > -		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> > > > -		imap.br_startoff += destoff - srcoff;
> > > > -
> > > > -		/* Clear dest from destoff to the end of imap and map it in. */
> > > > -		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> > > > -				new_isize);
> > > > +		/* Remap into the destination file at the given offset. */
> > > > +		imap.br_startoff = destoff;
> > > > +		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
> > > >  		if (error)
> > > >  			break;
> > > >  
> > > > @@ -1164,10 +1190,10 @@ xfs_reflink_remap_blocks(
> > > >  		}
> > > >  
> > > >  		/* Advance drange/srange */
> > > > -		srcoff += range_len;
> > > > -		destoff += range_len;
> > > > -		len -= range_len;
> > > > -		remapped_len += range_len;
> > > > +		srcoff += imap.br_blockcount;
> > > > +		destoff += imap.br_blockcount;
> > > > +		len -= imap.br_blockcount;
> > > > +		remapped_len += imap.br_blockcount;
> > > 
> > > Much nicer iteration, indeed. ;)
> > 
> > Thanks! :)
> > 
> > --D
> > 
> > > Brian
> > > 
> > > >  	}
> > > >  
> > > >  	if (error)
> > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > index 460136628a79..f205602c8ba9 100644
> > > > --- a/fs/xfs/xfs_trace.h
> > > > +++ b/fs/xfs/xfs_trace.h
> > > > @@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
> > > >  DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
> > > >  DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
> > > >  DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
> > > > -DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
> > > > -TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> > > > +TRACE_EVENT(xfs_reflink_remap_blocks,
> > > >  	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
> > > >  		 xfs_filblks_t len, struct xfs_inode *dest,
> > > >  		 xfs_fileoff_t doffset),
> > > > @@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> > > >  		  __entry->dest_ino,
> > > >  		  __entry->dest_lblk)
> > > >  );
> > > > -TRACE_EVENT(xfs_reflink_punch_range,
> > > > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > > > -		 xfs_extlen_t len),
> > > > -	TP_ARGS(ip, lblk, len),
> > > > -	TP_STRUCT__entry(
> > > > -		__field(dev_t, dev)
> > > > -		__field(xfs_ino_t, ino)
> > > > -		__field(xfs_fileoff_t, lblk)
> > > > -		__field(xfs_extlen_t, len)
> > > > -	),
> > > > -	TP_fast_assign(
> > > > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > > -		__entry->ino = ip->i_ino;
> > > > -		__entry->lblk = lblk;
> > > > -		__entry->len = len;
> > > > -	),
> > > > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
> > > > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > -		  __entry->ino,
> > > > -		  __entry->lblk,
> > > > -		  __entry->len)
> > > > -);
> > > > -TRACE_EVENT(xfs_reflink_remap,
> > > > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > > > -		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
> > > > -	TP_ARGS(ip, lblk, len, new_pblk),
> > > > -	TP_STRUCT__entry(
> > > > -		__field(dev_t, dev)
> > > > -		__field(xfs_ino_t, ino)
> > > > -		__field(xfs_fileoff_t, lblk)
> > > > -		__field(xfs_extlen_t, len)
> > > > -		__field(xfs_fsblock_t, new_pblk)
> > > > -	),
> > > > -	TP_fast_assign(
> > > > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > > -		__entry->ino = ip->i_ino;
> > > > -		__entry->lblk = lblk;
> > > > -		__entry->len = len;
> > > > -		__entry->new_pblk = new_pblk;
> > > > -	),
> > > > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
> > > > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > -		  __entry->ino,
> > > > -		  __entry->lblk,
> > > > -		  __entry->len,
> > > > -		  __entry->new_pblk)
> > > > -);
> > > >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
> > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
> > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
> > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
> > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
> > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
> > > > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent1);
> > > > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent2);
> > > >  
> > > >  /* dedupe tracepoints */
> > > >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
> > > > diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> > > > index c0f73b82c055..99511ff6222f 100644
> > > > --- a/fs/xfs/xfs_trans_dquot.c
> > > > +++ b/fs/xfs/xfs_trans_dquot.c
> > > > @@ -124,16 +124,17 @@ xfs_trans_dup_dqinfo(
> > > >   */
> > > >  void
> > > >  xfs_trans_mod_dquot_byino(
> > > > -	xfs_trans_t	*tp,
> > > > -	xfs_inode_t	*ip,
> > > > -	uint		field,
> > > > -	int64_t		delta)
> > > > +	struct xfs_trans	*tp,
> > > > +	struct xfs_inode	*ip,
> > > > +	uint			field,
> > > > +	int64_t			delta)
> > > >  {
> > > > -	xfs_mount_t	*mp = tp->t_mountp;
> > > > +	struct xfs_mount	*mp = tp->t_mountp;
> > > >  
> > > >  	if (!XFS_IS_QUOTA_RUNNING(mp) ||
> > > >  	    !XFS_IS_QUOTA_ON(mp) ||
> > > > -	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
> > > > +	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino) ||
> > > > +	    delta == 0)
> > > >  		return;
> > > >  
> > > >  	if (tp->t_dqinfo == NULL)
> > > > 
> > > 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-24 18:16         ` Brian Foster
@ 2020-06-24 18:35           ` Darrick J. Wong
  2020-06-24 18:45             ` Brian Foster
  0 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2020-06-24 18:35 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 02:16:29PM -0400, Brian Foster wrote:
> On Wed, Jun 24, 2020 at 10:07:38AM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 24, 2020 at 08:09:15AM -0700, Darrick J. Wong wrote:
> > > On Wed, Jun 24, 2020 at 08:38:41AM -0400, Brian Foster wrote:
> > > > On Mon, Jun 22, 2020 at 09:01:35PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > The existing reflink remapping loop has some structural problems that
> > > > > need addressing:
> > > > > 
> > > > > The biggest problem is that we create one transaction for each extent in
> > > > > the source file without accounting for the number of mappings there are
> > > > > for the same range in the destination file.  In other words, we don't
> > > > > know the number of remap operations that will be necessary and we
> > > > > therefore cannot guess the block reservation required.  On highly
> > > > > fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
> > > > > run out of block reservation, and fail.
> > > > > 
> > > > > The second problem is that we don't actually use the bmap intents to
> > > > > their full potential -- instead of calling bunmapi directly and having
> > > > > to deal with its backwards operation, we could call the deferred ops
> > > > > xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
> > > > > makes the frontend loop much simpler.
> > > > > 
> > > > > A third (and more minor) problem is that we aren't even clever enough to
> > > > > skip the whole remapping thing if the source and destination ranges
> > > > > point to the same physical extents.
> > > > > 
> > > > 
> > > > I'm wondering if this all really has to be done in one patch. The last
> > > > bit at least sounds like an optimization from the description.
> > > 
> > > <nod> I can split out that and the quota reservation fixes.
> > > 
> > > > > Solve all of these problems by refactoring the remapping loops so that
> > > > > we only perform one remapping operation per transaction, and each
> > > > > operation only tries to remap a single extent from source to dest.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_bmap.h |   13 ++-
> > > > >  fs/xfs/xfs_reflink.c     |  234 ++++++++++++++++++++++++++--------------------
> > > > >  fs/xfs/xfs_trace.h       |   52 +---------
> > > > >  fs/xfs/xfs_trans_dquot.c |   13 +--
> > > > >  4 files changed, 149 insertions(+), 163 deletions(-)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > > > > index 6028a3c825ba..3498b4f75f71 100644
> > > > > --- a/fs/xfs/libxfs/xfs_bmap.h
> > > > > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > > > > @@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> > > > >  	{ BMAP_ATTRFORK,	"ATTR" }, \
> > > > >  	{ BMAP_COWFORK,		"COW" }
> > > > >  
> > > > > +/* Return true if the extent is an allocated extent, written or not. */
> > > > > +static inline bool xfs_bmap_is_mapped_extent(struct xfs_bmbt_irec *irec)
> > > > > +{
> > > > > +	return irec->br_startblock != HOLESTARTBLOCK &&
> > > > > +		irec->br_startblock != DELAYSTARTBLOCK &&
> > > > > +		!isnullstartblock(irec->br_startblock);
> > > > > +}
> > > > 
> > > > Heh, I was going to suggest to call this _is_real_extent(), but I see
> > > > the helper below already uses that name. I do think the name correlates
> > > > better with this helper and perhaps the one below should be called
> > > > something like xfs_bmap_is_written_extent(). Hm? Either way, the
> > > > "mapped" name is kind of vague and brings back memories of buffer heads.
> > > 
> > > <nod> I originally wasn't going to change the name on the second helper,
> > > but as there are only three callers currently it's probably easier to do
> > > that now.
> > > 
> > > > >  
> > > > >  /*
> > > > >   * Return true if the extent is a real, allocated extent, or false if it is  a
> > > > > @@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> > > > >   */
> > > > >  static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
> > > > >  {
> > > > > -	return irec->br_state != XFS_EXT_UNWRITTEN &&
> > > > > -		irec->br_startblock != HOLESTARTBLOCK &&
> > > > > -		irec->br_startblock != DELAYSTARTBLOCK &&
> > > > > -		!isnullstartblock(irec->br_startblock);
> > > > > +	return xfs_bmap_is_mapped_extent(irec) &&
> > > > > +	       irec->br_state != XFS_EXT_UNWRITTEN;
> > > > >  }
> > > > >  
> > > > >  /*
> > > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > > index 107bf2a2f344..f50a8c2f21a5 100644
> > > > > --- a/fs/xfs/xfs_reflink.c
> > > > > +++ b/fs/xfs/xfs_reflink.c
> > > > > @@ -984,40 +984,27 @@ xfs_reflink_ag_has_free_space(
> > > > >  }
> > > > >  
> > > > >  /*
> > > > > - * Unmap a range of blocks from a file, then map other blocks into the hole.
> > > > > - * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> > > > > - * The extent irec is mapped into dest at irec->br_startoff.
> > > > > + * Remap the given extent into the file at the given offset.  The imap
> > > > > + * blockcount will be set to the number of blocks that were actually remapped.
> > > > >   */
> > > > >  STATIC int
> > > > >  xfs_reflink_remap_extent(
> > > > >  	struct xfs_inode	*ip,
> > > > > -	struct xfs_bmbt_irec	*irec,
> > > > > -	xfs_fileoff_t		destoff,
> > > > > +	struct xfs_bmbt_irec	*imap,
> > > > >  	xfs_off_t		new_isize)
> > > > >  {
> > > > > +	struct xfs_bmbt_irec	imap2;
> > > > >  	struct xfs_mount	*mp = ip->i_mount;
> > > > > -	bool			real_extent = xfs_bmap_is_real_extent(irec);
> > > > >  	struct xfs_trans	*tp;
> > > > > -	unsigned int		resblks;
> > > > > -	struct xfs_bmbt_irec	uirec;
> > > > > -	xfs_filblks_t		rlen;
> > > > > -	xfs_filblks_t		unmap_len;
> > > > >  	xfs_off_t		newlen;
> > > > > +	int64_t			ip_delta = 0;
> > > > > +	unsigned int		resblks;
> > > > > +	bool			real_extent = xfs_bmap_is_real_extent(imap);
> > > > > +	int			nimaps;
> > > > >  	int			error;
> > > > >  
> > > > > -	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> > > > > -	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> > > > > -
> > > > > -	/* No reflinking if we're low on space */
> > > > > -	if (real_extent) {
> > > > > -		error = xfs_reflink_ag_has_free_space(mp,
> > > > > -				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
> > > > > -		if (error)
> > > > > -			goto out;
> > > > > -	}
> > > > > -
> > > > >  	/* Start a rolling transaction to switch the mappings */
> > > > > -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > > > > +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> > > > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> > > > >  	if (error)
> > > > >  		goto out;
> > > > > @@ -1025,87 +1012,118 @@ xfs_reflink_remap_extent(
> > > > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > >  	xfs_trans_ijoin(tp, ip, 0);
> > > > >  
> > > > > -	/* If we're not just clearing space, then do we have enough quota? */
> > > > > +	/* Read extent from the second file */
> > > > > +	nimaps = 1;
> > > > > +	error = xfs_bmapi_read(ip, imap->br_startoff, imap->br_blockcount,
> > > > > +			&imap2, &nimaps, 0);
> > > > > +	if (error)
> > > > > +		goto out_cancel;
> > > > > +#ifdef DEBUG
> > > > > +	if (nimaps != 1 ||
> > > > > +	    imap2.br_startoff != imap->br_startoff) {
> > > > > +		/*
> > > > > +		 * We should never get no mapping or something that doesn't
> > > > > +		 * match what we asked for.
> > > > > +		 */
> > > > > +		ASSERT(0);
> > > > > +		error = -EINVAL;
> > > > > +		goto out_cancel;
> > > > > +	}
> > > > > +#endif
> > > > 
> > > > Why the DEBUG?
> > > 
> > > Christoph wondered (on the atomic extent swap series) when
> > > xfs_bmapi_read would ever return nimaps==0 or an extent that doesn't
> > > match at least the first block that we asked for.
> > > 
> > > I'm /pretty/ sure that will never happen since the full bmapi read
> > > function will hand us back a "HOLE" mapping if it doesn't find anything,
> > > so I left those parts in as debugging checks.
> > > 
> 
> Oh. Maybe this is better served as a single assert? I.e.:
> 
> 	ASSERT(nimaps == 1 && imap2.br_startoff == imap->br_startoff);
> 
> I guess I'm mostly just not a fan of the DEBUG idefs than worried about
> the extra error checking logic. ISTM we should either include these
> checks unconditionally if it's reasonably possible or leave it to an
> assert if it's more of a future developer aide or some such..

Ok.  I think I'll turn the nimaps and startoff checks into asserts, and
the delalloc check below into a regular error bailout.

> > > > > +
> > > > > +	/*
> > > > > +	 * We can only remap as many blocks as the smaller of the two extent
> > > > > +	 * maps, because we can only remap one extent at a time.
> > > > > +	 */
> > > > > +	imap->br_blockcount = min(imap->br_blockcount, imap2.br_blockcount);
> > > > > +
> > > > > +	trace_xfs_reflink_remap_extent2(ip, &imap2);
> > > > > +
> > > > > +	/*
> > > > > +	 * Two extents mapped to the same physical block must not have
> > > > > +	 * different states; that's filesystem corruption.  Move on to the next
> > > > > +	 * extent if they're both holes or both the same physical extent.
> > > > > +	 */
> > > > > +	if (imap->br_startblock == imap2.br_startblock) {
> > > > > +		if (imap->br_state != imap2.br_state)
> > > > > +			error = -EFSCORRUPTED;
> > > > 
> > > > Can we assert the length of each extent match here as well?
> > > 
> > > We can, but we asked the second bmapi_read to give us a mapping that
> > > isn't any longer than imap->br_blockcount, and now we've just shortened
> > > imap->br_blockcount to be no longer than what that second bmapi_read
> > > returned.
> > > 
> 
> Yeah, that's actually why I asked. :P To be more clear, I thought it
> might be defensive against future changes because this invariant is an
> indirect/implicit result of those two things as opposed to some explicit
> logic somewhere that makes it immediately obvious. Not a huge deal
> either way though.

Heh.  Well I added it already.  If nothing else, it makes the logic
easier to follow. :)

> > > I guess I don't mind adding a debugging assert just to make sure I
> > > didn't screw up the math, though it seems excessive... :)
> > > 
> > > > > +		goto out_cancel;
> > > > > +	}
> > > > > +
> > > > >  	if (real_extent) {
> > > > > +		/* No reflinking if we're low on space */
> > > > > +		error = xfs_reflink_ag_has_free_space(mp,
> > > > > +				XFS_FSB_TO_AGNO(mp, imap->br_startblock));
> > > > > +		if (error)
> > > > > +			goto out_cancel;
> > > > > +
> > > > > +
> > > > > +		/* Do we have enough quota? */
> > > > >  		error = xfs_trans_reserve_quota_nblks(tp, ip,
> > > > > -				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > > > +				imap->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > > 
> > > > Any reason this doesn't accommodate removal of real blocks?
> > > 
> > > The old code didn't. :)
> > > 
> > > Come to think of it, this under-reserves quota accounting, since we need
> > > to reserve bmbt blocks too.
> > > 
> > > Urk, I guess that needs fixing too.
> > 
> > I remembered that unmapping the source extent frees the quota count back
> > to the dquot, not to the transaction quota reservation.  Maybe we need
> > to add a XFS_TRANS_RES_FDBLKS-like thing to quota so that the quota
> > counter ping-pong on't cause us to blow out the quota, and then we can
> > relax the quota requirements of a reflink call?
> > 
> 
> Hmm.. do you mean the deferred unmap operation? If that's the case, why
> would we include the unmapped blocks in the delta?

No.  Deferred bmap operations only do quota accounting for bmbt shap
changes.  The lead transaction in the defer ops chain must do the quota
accounting for the extents being mapped/unmapped.

I was actually referring to the potential for freeing and allocating
bmbt blocks during the operation, but thinking about this more
thoroughly, I think we need enough quota reservation for:

 a) + the extent we're mapping in
 b) + split of the bmbt
 c) - the extent we're unmapping, if it's real

In my patch queue I've added (b) as an easily backportable fix before
the reflink refactoring series.  I think (c) can be added after the
refactoring because it's easier to reason about that once we're only
doing one extent move per transaction.

--D

> Brian
> 
> > --D
> > 
> > > > 
> > > > >  		if (error)
> > > > >  			goto out_cancel;
> > > > >  	}
> > > > >  
> > > > > -	trace_xfs_reflink_remap(ip, irec->br_startoff,
> > > > > -				irec->br_blockcount, irec->br_startblock);
> > > > > -
> > > > > -	/* Unmap the old blocks in the data fork. */
> > > > > -	rlen = unmap_len;
> > > > > -	while (rlen) {
> > > > > -		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> > > > > -		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
> > > > > -		if (error)
> > > > > -			goto out_cancel;
> > > > > -
> > > > > +	if (xfs_bmap_is_mapped_extent(&imap2)) {
> > > > >  		/*
> > > > > -		 * Trim the extent to whatever got unmapped.
> > > > > -		 * Remember, bunmapi works backwards.
> > > > > +		 * If the extent we're unmapping is backed by storage (written
> > > > > +		 * or not), unmap the extent and drop its refcount.
> > > > >  		 */
> > > > > -		uirec.br_startblock = irec->br_startblock + rlen;
> > > > > -		uirec.br_startoff = irec->br_startoff + rlen;
> > > > > -		uirec.br_blockcount = unmap_len - rlen;
> > > > > -		uirec.br_state = irec->br_state;
> > > > > -		unmap_len = rlen;
> > > > > +		xfs_bmap_unmap_extent(tp, ip, &imap2);
> > > > > +		xfs_refcount_decrease_extent(tp, &imap2);
> > > > > +		ip_delta -= imap2.br_blockcount;
> > > > > +	} else if (imap2.br_startblock == DELAYSTARTBLOCK) {
> > > > > +		xfs_filblks_t	len = imap2.br_blockcount;
> > > > >  
> > > > > -		/* If this isn't a real mapping, we're done. */
> > > > > -		if (!real_extent || uirec.br_blockcount == 0)
> > > > > -			goto next_extent;
> > > > > -
> > > > > -		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> > > > > -				uirec.br_blockcount, uirec.br_startblock);
> > > > > -
> > > > > -		/* Update the refcount tree */
> > > > > -		xfs_refcount_increase_extent(tp, &uirec);
> > > > > -
> > > > > -		/* Map the new blocks into the data fork. */
> > > > > -		xfs_bmap_map_extent(tp, ip, &uirec);
> > > > > +		/*
> > > > > +		 * If the extent we're unmapping is a delalloc reservation,
> > > > > +		 * we can use the regular bunmapi function to release the
> > > > > +		 * incore state.  Dropping the delalloc reservation takes care
> > > > > +		 * of the quota reservation for us.
> > > > > +		 */
> > > > > +		error = __xfs_bunmapi(NULL, ip, imap2.br_startoff, &len, 0, 1);
> > > > > +		if (error)
> > > > > +			goto out_cancel;
> > > > > +		ASSERT(len == 0);
> > > > > +	}
> > > > >  
> > > > > -		/* Update quota accounting. */
> > > > > -		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > > > > -				uirec.br_blockcount);
> > > > > +	/*
> > > > > +	 * If the extent we're sharing is backed by written storage, increase
> > > > > +	 * its refcount and map it into the file.
> > > > > +	 */
> > > > > +	if (real_extent) {
> > > > > +		xfs_refcount_increase_extent(tp, imap);
> > > > > +		xfs_bmap_map_extent(tp, ip, imap);
> > > > > +		ip_delta += imap->br_blockcount;
> > > > > +	}
> > > > >  
> > > > > -		/* Update dest isize if needed. */
> > > > > -		newlen = XFS_FSB_TO_B(mp,
> > > > > -				uirec.br_startoff + uirec.br_blockcount);
> > > > > -		newlen = min_t(xfs_off_t, newlen, new_isize);
> > > > > -		if (newlen > i_size_read(VFS_I(ip))) {
> > > > > -			trace_xfs_reflink_update_inode_size(ip, newlen);
> > > > > -			i_size_write(VFS_I(ip), newlen);
> > > > > -			ip->i_d.di_size = newlen;
> > > > > -			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > > > -		}
> > > > > +	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, ip_delta);
> > > > >  
> > > > > -next_extent:
> > > > > -		/* Process all the deferred stuff. */
> > > > > -		error = xfs_defer_finish(&tp);
> > > > > -		if (error)
> > > > > -			goto out_cancel;
> > > > > +	/* Update dest isize if needed. */
> > > > > +	newlen = XFS_FSB_TO_B(mp, imap2.br_startoff + imap2.br_blockcount);
> > > > > +	newlen = min_t(xfs_off_t, newlen, new_isize);
> > > > > +	if (newlen > i_size_read(VFS_I(ip))) {
> > > > > +		trace_xfs_reflink_update_inode_size(ip, newlen);
> > > > > +		i_size_write(VFS_I(ip), newlen);
> > > > > +		ip->i_d.di_size = newlen;
> > > > > +		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > > >  	}
> > > > >  
> > > > > +	/* Commit everything and unlock. */
> > > > >  	error = xfs_trans_commit(tp);
> > > > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > -	if (error)
> > > > > -		goto out;
> > > > > -	return 0;
> > > > > +	goto out_unlock;
> > > > 
> > > > We probably shouldn't fire the _error() tracepoint in the successful
> > > > exit case.
> > > 
> > > Oops, good catch. :)
> > > 
> > > > >  
> > > > >  out_cancel:
> > > > >  	xfs_trans_cancel(tp);
> > > > > +out_unlock:
> > > > >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > >  out:
> > > > >  	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
> > > > >  	return error;
> > > > >  }
> > > > >  
> > > > > -/*
> > > > > - * Iteratively remap one file's extents (and holes) to another's.
> > > > > - */
> > > > > +/* Remap a range of one file to the other. */
> > > > >  int
> > > > >  xfs_reflink_remap_blocks(
> > > > >  	struct xfs_inode	*src,
> > > > > @@ -1116,25 +1134,22 @@ xfs_reflink_remap_blocks(
> > > > >  	loff_t			*remapped)
> > > > >  {
> > > > >  	struct xfs_bmbt_irec	imap;
> > > > > -	xfs_fileoff_t		srcoff;
> > > > > -	xfs_fileoff_t		destoff;
> > > > > +	struct xfs_mount	*mp = src->i_mount;
> > > > > +	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
> > > > > +	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
> > > > >  	xfs_filblks_t		len;
> > > > > -	xfs_filblks_t		range_len;
> > > > >  	xfs_filblks_t		remapped_len = 0;
> > > > >  	xfs_off_t		new_isize = pos_out + remap_len;
> > > > >  	int			nimaps;
> > > > >  	int			error = 0;
> > > > >  
> > > > > -	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
> > > > > -	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
> > > > > -	len = XFS_B_TO_FSB(src->i_mount, remap_len);
> > > > > +	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
> > > > > +			XFS_MAX_FILEOFF);
> > > > >  
> > > > > -	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> > > > > -	while (len) {
> > > > > -		uint		lock_mode;
> > > > > +	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
> > > > >  
> > > > > -		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> > > > > -				dest, destoff);
> > > > > +	while (len > 0) {
> > > > > +		unsigned int	lock_mode;
> > > > >  
> > > > >  		/* Read extent from the source file */
> > > > >  		nimaps = 1;
> > > > > @@ -1143,18 +1158,29 @@ xfs_reflink_remap_blocks(
> > > > >  		xfs_iunlock(src, lock_mode);
> > > > >  		if (error)
> > > > >  			break;
> > > > > -		ASSERT(nimaps == 1);
> > > > > +#ifdef DEBUG
> > > > > +		if (nimaps != 1 ||
> > > > > +		    imap.br_startblock == DELAYSTARTBLOCK ||
> > > > > +		    imap.br_startoff != srcoff) {
> > > > > +			/*
> > > > > +			 * We should never get no mapping or a delalloc extent
> > > > > +			 * or something that doesn't match what we asked for,
> > > > > +			 * since the caller flushed the source file data and
> > > > > +			 * we hold the source file io/mmap locks.
> > > > > +			 */
> > > > > +			ASSERT(nimaps == 0);
> > > > > +			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> > > > > +			ASSERT(imap.br_startoff == srcoff);
> > > > > +			error = -EINVAL;
> > > > > +			break;
> > > > > +		}
> > > > > +#endif
> > > > 
> > > > Same thing with the DEBUG check? It seems like at least some of these
> > > > checks should be unconditional in the event the file is corrupted or
> > > > something.
> > > 
> > > Doh... yeah, I guess the delalloc check really needs to be in there, in
> > > case a previous write+fsync failed and left us some remnants.
> > > 
> > > > >  
> > > > > -		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
> > > > > -				&imap);
> > > > > +		trace_xfs_reflink_remap_extent1(src, &imap);
> > > > 
> > > > How about trace_xfs_reflink_remap_extent_[src|dest]() so trace data is a
> > > > bit more readable?
> > > 
> > > <nod> I like the suggestion!
> > > 
> > > > >  
> > > > > -		/* Translate imap into the destination file. */
> > > > > -		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> > > > > -		imap.br_startoff += destoff - srcoff;
> > > > > -
> > > > > -		/* Clear dest from destoff to the end of imap and map it in. */
> > > > > -		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> > > > > -				new_isize);
> > > > > +		/* Remap into the destination file at the given offset. */
> > > > > +		imap.br_startoff = destoff;
> > > > > +		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
> > > > >  		if (error)
> > > > >  			break;
> > > > >  
> > > > > @@ -1164,10 +1190,10 @@ xfs_reflink_remap_blocks(
> > > > >  		}
> > > > >  
> > > > >  		/* Advance drange/srange */
> > > > > -		srcoff += range_len;
> > > > > -		destoff += range_len;
> > > > > -		len -= range_len;
> > > > > -		remapped_len += range_len;
> > > > > +		srcoff += imap.br_blockcount;
> > > > > +		destoff += imap.br_blockcount;
> > > > > +		len -= imap.br_blockcount;
> > > > > +		remapped_len += imap.br_blockcount;
> > > > 
> > > > Much nicer iteration, indeed. ;)
> > > 
> > > Thanks! :)
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > >  	}
> > > > >  
> > > > >  	if (error)
> > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > > index 460136628a79..f205602c8ba9 100644
> > > > > --- a/fs/xfs/xfs_trace.h
> > > > > +++ b/fs/xfs/xfs_trace.h
> > > > > @@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
> > > > >  DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
> > > > >  DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
> > > > >  DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
> > > > > -DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
> > > > > -TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> > > > > +TRACE_EVENT(xfs_reflink_remap_blocks,
> > > > >  	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
> > > > >  		 xfs_filblks_t len, struct xfs_inode *dest,
> > > > >  		 xfs_fileoff_t doffset),
> > > > > @@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> > > > >  		  __entry->dest_ino,
> > > > >  		  __entry->dest_lblk)
> > > > >  );
> > > > > -TRACE_EVENT(xfs_reflink_punch_range,
> > > > > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > > > > -		 xfs_extlen_t len),
> > > > > -	TP_ARGS(ip, lblk, len),
> > > > > -	TP_STRUCT__entry(
> > > > > -		__field(dev_t, dev)
> > > > > -		__field(xfs_ino_t, ino)
> > > > > -		__field(xfs_fileoff_t, lblk)
> > > > > -		__field(xfs_extlen_t, len)
> > > > > -	),
> > > > > -	TP_fast_assign(
> > > > > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > > > -		__entry->ino = ip->i_ino;
> > > > > -		__entry->lblk = lblk;
> > > > > -		__entry->len = len;
> > > > > -	),
> > > > > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
> > > > > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > > -		  __entry->ino,
> > > > > -		  __entry->lblk,
> > > > > -		  __entry->len)
> > > > > -);
> > > > > -TRACE_EVENT(xfs_reflink_remap,
> > > > > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > > > > -		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
> > > > > -	TP_ARGS(ip, lblk, len, new_pblk),
> > > > > -	TP_STRUCT__entry(
> > > > > -		__field(dev_t, dev)
> > > > > -		__field(xfs_ino_t, ino)
> > > > > -		__field(xfs_fileoff_t, lblk)
> > > > > -		__field(xfs_extlen_t, len)
> > > > > -		__field(xfs_fsblock_t, new_pblk)
> > > > > -	),
> > > > > -	TP_fast_assign(
> > > > > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > > > -		__entry->ino = ip->i_ino;
> > > > > -		__entry->lblk = lblk;
> > > > > -		__entry->len = len;
> > > > > -		__entry->new_pblk = new_pblk;
> > > > > -	),
> > > > > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
> > > > > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > > -		  __entry->ino,
> > > > > -		  __entry->lblk,
> > > > > -		  __entry->len,
> > > > > -		  __entry->new_pblk)
> > > > > -);
> > > > >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
> > > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
> > > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
> > > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
> > > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
> > > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
> > > > > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent1);
> > > > > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent2);
> > > > >  
> > > > >  /* dedupe tracepoints */
> > > > >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
> > > > > diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> > > > > index c0f73b82c055..99511ff6222f 100644
> > > > > --- a/fs/xfs/xfs_trans_dquot.c
> > > > > +++ b/fs/xfs/xfs_trans_dquot.c
> > > > > @@ -124,16 +124,17 @@ xfs_trans_dup_dqinfo(
> > > > >   */
> > > > >  void
> > > > >  xfs_trans_mod_dquot_byino(
> > > > > -	xfs_trans_t	*tp,
> > > > > -	xfs_inode_t	*ip,
> > > > > -	uint		field,
> > > > > -	int64_t		delta)
> > > > > +	struct xfs_trans	*tp,
> > > > > +	struct xfs_inode	*ip,
> > > > > +	uint			field,
> > > > > +	int64_t			delta)
> > > > >  {
> > > > > -	xfs_mount_t	*mp = tp->t_mountp;
> > > > > +	struct xfs_mount	*mp = tp->t_mountp;
> > > > >  
> > > > >  	if (!XFS_IS_QUOTA_RUNNING(mp) ||
> > > > >  	    !XFS_IS_QUOTA_ON(mp) ||
> > > > > -	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
> > > > > +	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino) ||
> > > > > +	    delta == 0)
> > > > >  		return;
> > > > >  
> > > > >  	if (tp->t_dqinfo == NULL)
> > > > > 
> > > > 
> > 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash
  2020-06-24 18:35           ` Darrick J. Wong
@ 2020-06-24 18:45             ` Brian Foster
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Foster @ 2020-06-24 18:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, edwin

On Wed, Jun 24, 2020 at 11:35:51AM -0700, Darrick J. Wong wrote:
> On Wed, Jun 24, 2020 at 02:16:29PM -0400, Brian Foster wrote:
> > On Wed, Jun 24, 2020 at 10:07:38AM -0700, Darrick J. Wong wrote:
> > > On Wed, Jun 24, 2020 at 08:09:15AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Jun 24, 2020 at 08:38:41AM -0400, Brian Foster wrote:
> > > > > On Mon, Jun 22, 2020 at 09:01:35PM -0700, Darrick J. Wong wrote:
> > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > 
> > > > > > The existing reflink remapping loop has some structural problems that
> > > > > > need addressing:
> > > > > > 
> > > > > > The biggest problem is that we create one transaction for each extent in
> > > > > > the source file without accounting for the number of mappings there are
> > > > > > for the same range in the destination file.  In other words, we don't
> > > > > > know the number of remap operations that will be necessary and we
> > > > > > therefore cannot guess the block reservation required.  On highly
> > > > > > fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
> > > > > > run out of block reservation, and fail.
> > > > > > 
> > > > > > The second problem is that we don't actually use the bmap intents to
> > > > > > their full potential -- instead of calling bunmapi directly and having
> > > > > > to deal with its backwards operation, we could call the deferred ops
> > > > > > xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
> > > > > > makes the frontend loop much simpler.
> > > > > > 
> > > > > > A third (and more minor) problem is that we aren't even clever enough to
> > > > > > skip the whole remapping thing if the source and destination ranges
> > > > > > point to the same physical extents.
> > > > > > 
> > > > > 
> > > > > I'm wondering if this all really has to be done in one patch. The last
> > > > > bit at least sounds like an optimization from the description.
> > > > 
> > > > <nod> I can split out that and the quota reservation fixes.
> > > > 
> > > > > > Solve all of these problems by refactoring the remapping loops so that
> > > > > > we only perform one remapping operation per transaction, and each
> > > > > > operation only tries to remap a single extent from source to dest.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > ---
> > > > > >  fs/xfs/libxfs/xfs_bmap.h |   13 ++-
> > > > > >  fs/xfs/xfs_reflink.c     |  234 ++++++++++++++++++++++++++--------------------
> > > > > >  fs/xfs/xfs_trace.h       |   52 +---------
> > > > > >  fs/xfs/xfs_trans_dquot.c |   13 +--
> > > > > >  4 files changed, 149 insertions(+), 163 deletions(-)
> > > > > > 
> > > > > > 
> > > > > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > > > > > index 6028a3c825ba..3498b4f75f71 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_bmap.h
> > > > > > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > > > > > @@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> > > > > >  	{ BMAP_ATTRFORK,	"ATTR" }, \
> > > > > >  	{ BMAP_COWFORK,		"COW" }
> > > > > >  
> > > > > > +/* Return true if the extent is an allocated extent, written or not. */
> > > > > > +static inline bool xfs_bmap_is_mapped_extent(struct xfs_bmbt_irec *irec)
> > > > > > +{
> > > > > > +	return irec->br_startblock != HOLESTARTBLOCK &&
> > > > > > +		irec->br_startblock != DELAYSTARTBLOCK &&
> > > > > > +		!isnullstartblock(irec->br_startblock);
> > > > > > +}
> > > > > 
> > > > > Heh, I was going to suggest to call this _is_real_extent(), but I see
> > > > > the helper below already uses that name. I do think the name correlates
> > > > > better with this helper and perhaps the one below should be called
> > > > > something like xfs_bmap_is_written_extent(). Hm? Either way, the
> > > > > "mapped" name is kind of vague and brings back memories of buffer heads.
> > > > 
> > > > <nod> I originally wasn't going to change the name on the second helper,
> > > > but as there are only three callers currently it's probably easier to do
> > > > that now.
> > > > 
> > > > > >  
> > > > > >  /*
> > > > > >   * Return true if the extent is a real, allocated extent, or false if it is  a
> > > > > > @@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
> > > > > >   */
> > > > > >  static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
> > > > > >  {
> > > > > > -	return irec->br_state != XFS_EXT_UNWRITTEN &&
> > > > > > -		irec->br_startblock != HOLESTARTBLOCK &&
> > > > > > -		irec->br_startblock != DELAYSTARTBLOCK &&
> > > > > > -		!isnullstartblock(irec->br_startblock);
> > > > > > +	return xfs_bmap_is_mapped_extent(irec) &&
> > > > > > +	       irec->br_state != XFS_EXT_UNWRITTEN;
> > > > > >  }
> > > > > >  
> > > > > >  /*
> > > > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > > > index 107bf2a2f344..f50a8c2f21a5 100644
> > > > > > --- a/fs/xfs/xfs_reflink.c
> > > > > > +++ b/fs/xfs/xfs_reflink.c
> > > > > > @@ -984,40 +984,27 @@ xfs_reflink_ag_has_free_space(
> > > > > >  }
> > > > > >  
> > > > > >  /*
> > > > > > - * Unmap a range of blocks from a file, then map other blocks into the hole.
> > > > > > - * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> > > > > > - * The extent irec is mapped into dest at irec->br_startoff.
> > > > > > + * Remap the given extent into the file at the given offset.  The imap
> > > > > > + * blockcount will be set to the number of blocks that were actually remapped.
> > > > > >   */
> > > > > >  STATIC int
> > > > > >  xfs_reflink_remap_extent(
> > > > > >  	struct xfs_inode	*ip,
> > > > > > -	struct xfs_bmbt_irec	*irec,
> > > > > > -	xfs_fileoff_t		destoff,
> > > > > > +	struct xfs_bmbt_irec	*imap,
> > > > > >  	xfs_off_t		new_isize)
> > > > > >  {
> > > > > > +	struct xfs_bmbt_irec	imap2;
> > > > > >  	struct xfs_mount	*mp = ip->i_mount;
> > > > > > -	bool			real_extent = xfs_bmap_is_real_extent(irec);
> > > > > >  	struct xfs_trans	*tp;
> > > > > > -	unsigned int		resblks;
> > > > > > -	struct xfs_bmbt_irec	uirec;
> > > > > > -	xfs_filblks_t		rlen;
> > > > > > -	xfs_filblks_t		unmap_len;
> > > > > >  	xfs_off_t		newlen;
> > > > > > +	int64_t			ip_delta = 0;
> > > > > > +	unsigned int		resblks;
> > > > > > +	bool			real_extent = xfs_bmap_is_real_extent(imap);
> > > > > > +	int			nimaps;
> > > > > >  	int			error;
> > > > > >  
> > > > > > -	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> > > > > > -	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> > > > > > -
> > > > > > -	/* No reflinking if we're low on space */
> > > > > > -	if (real_extent) {
> > > > > > -		error = xfs_reflink_ag_has_free_space(mp,
> > > > > > -				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
> > > > > > -		if (error)
> > > > > > -			goto out;
> > > > > > -	}
> > > > > > -
> > > > > >  	/* Start a rolling transaction to switch the mappings */
> > > > > > -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > > > > > +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> > > > > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> > > > > >  	if (error)
> > > > > >  		goto out;
> > > > > > @@ -1025,87 +1012,118 @@ xfs_reflink_remap_extent(
> > > > > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > >  	xfs_trans_ijoin(tp, ip, 0);
> > > > > >  
> > > > > > -	/* If we're not just clearing space, then do we have enough quota? */
> > > > > > +	/* Read extent from the second file */
> > > > > > +	nimaps = 1;
> > > > > > +	error = xfs_bmapi_read(ip, imap->br_startoff, imap->br_blockcount,
> > > > > > +			&imap2, &nimaps, 0);
> > > > > > +	if (error)
> > > > > > +		goto out_cancel;
> > > > > > +#ifdef DEBUG
> > > > > > +	if (nimaps != 1 ||
> > > > > > +	    imap2.br_startoff != imap->br_startoff) {
> > > > > > +		/*
> > > > > > +		 * We should never get no mapping or something that doesn't
> > > > > > +		 * match what we asked for.
> > > > > > +		 */
> > > > > > +		ASSERT(0);
> > > > > > +		error = -EINVAL;
> > > > > > +		goto out_cancel;
> > > > > > +	}
> > > > > > +#endif
> > > > > 
> > > > > Why the DEBUG?
> > > > 
> > > > Christoph wondered (on the atomic extent swap series) when
> > > > xfs_bmapi_read would ever return nimaps==0 or an extent that doesn't
> > > > match at least the first block that we asked for.
> > > > 
> > > > I'm /pretty/ sure that will never happen since the full bmapi read
> > > > function will hand us back a "HOLE" mapping if it doesn't find anything,
> > > > so I left those parts in as debugging checks.
> > > > 
> > 
> > Oh. Maybe this is better served as a single assert? I.e.:
> > 
> > 	ASSERT(nimaps == 1 && imap2.br_startoff == imap->br_startoff);
> > 
> > I guess I'm mostly just not a fan of the DEBUG idefs than worried about
> > the extra error checking logic. ISTM we should either include these
> > checks unconditionally if it's reasonably possible or leave it to an
> > assert if it's more of a future developer aide or some such..
> 
> Ok.  I think I'll turn the nimaps and startoff checks into asserts, and
> the delalloc check below into a regular error bailout.
> 

Sounds reasonable.

> > > > > > +
> > > > > > +	/*
> > > > > > +	 * We can only remap as many blocks as the smaller of the two extent
> > > > > > +	 * maps, because we can only remap one extent at a time.
> > > > > > +	 */
> > > > > > +	imap->br_blockcount = min(imap->br_blockcount, imap2.br_blockcount);
> > > > > > +
> > > > > > +	trace_xfs_reflink_remap_extent2(ip, &imap2);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Two extents mapped to the same physical block must not have
> > > > > > +	 * different states; that's filesystem corruption.  Move on to the next
> > > > > > +	 * extent if they're both holes or both the same physical extent.
> > > > > > +	 */
> > > > > > +	if (imap->br_startblock == imap2.br_startblock) {
> > > > > > +		if (imap->br_state != imap2.br_state)
> > > > > > +			error = -EFSCORRUPTED;
> > > > > 
> > > > > Can we assert the length of each extent match here as well?
> > > > 
> > > > We can, but we asked the second bmapi_read to give us a mapping that
> > > > isn't any longer than imap->br_blockcount, and now we've just shortened
> > > > imap->br_blockcount to be no longer than what that second bmapi_read
> > > > returned.
> > > > 
> > 
> > Yeah, that's actually why I asked. :P To be more clear, I thought it
> > might be defensive against future changes because this invariant is an
> > indirect/implicit result of those two things as opposed to some explicit
> > logic somewhere that makes it immediately obvious. Not a huge deal
> > either way though.
> 
> Heh.  Well I added it already.  If nothing else, it makes the logic
> easier to follow. :)
> 

Thanks.

> > > > I guess I don't mind adding a debugging assert just to make sure I
> > > > didn't screw up the math, though it seems excessive... :)
> > > > 
> > > > > > +		goto out_cancel;
> > > > > > +	}
> > > > > > +
> > > > > >  	if (real_extent) {
> > > > > > +		/* No reflinking if we're low on space */
> > > > > > +		error = xfs_reflink_ag_has_free_space(mp,
> > > > > > +				XFS_FSB_TO_AGNO(mp, imap->br_startblock));
> > > > > > +		if (error)
> > > > > > +			goto out_cancel;
> > > > > > +
> > > > > > +
> > > > > > +		/* Do we have enough quota? */
> > > > > >  		error = xfs_trans_reserve_quota_nblks(tp, ip,
> > > > > > -				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > > > > +				imap->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > > > 
> > > > > Any reason this doesn't accommodate removal of real blocks?
> > > > 
> > > > The old code didn't. :)
> > > > 
> > > > Come to think of it, this under-reserves quota accounting, since we need
> > > > to reserve bmbt blocks too.
> > > > 
> > > > Urk, I guess that needs fixing too.
> > > 
> > > I remembered that unmapping the source extent frees the quota count back
> > > to the dquot, not to the transaction quota reservation.  Maybe we need
> > > to add a XFS_TRANS_RES_FDBLKS-like thing to quota so that the quota
> > > counter ping-pong on't cause us to blow out the quota, and then we can
> > > relax the quota requirements of a reflink call?
> > > 
> > 
> > Hmm.. do you mean the deferred unmap operation? If that's the case, why
> > would we include the unmapped blocks in the delta?
> 
> No.  Deferred bmap operations only do quota accounting for bmbt shap
> changes.  The lead transaction in the defer ops chain must do the quota
> accounting for the extents being mapped/unmapped.
> 

Ok.

> I was actually referring to the potential for freeing and allocating
> bmbt blocks during the operation, but thinking about this more
> thoroughly, I think we need enough quota reservation for:
> 
>  a) + the extent we're mapping in
>  b) + split of the bmbt
>  c) - the extent we're unmapping, if it's real
> 
> In my patch queue I've added (b) as an easily backportable fix before
> the reflink refactoring series.  I think (c) can be added after the
> refactoring because it's easier to reason about that once we're only
> doing one extent move per transaction.
> 

Agreed.

Brian

> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > > 
> > > > > >  		if (error)
> > > > > >  			goto out_cancel;
> > > > > >  	}
> > > > > >  
> > > > > > -	trace_xfs_reflink_remap(ip, irec->br_startoff,
> > > > > > -				irec->br_blockcount, irec->br_startblock);
> > > > > > -
> > > > > > -	/* Unmap the old blocks in the data fork. */
> > > > > > -	rlen = unmap_len;
> > > > > > -	while (rlen) {
> > > > > > -		ASSERT(tp->t_firstblock == NULLFSBLOCK);
> > > > > > -		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1);
> > > > > > -		if (error)
> > > > > > -			goto out_cancel;
> > > > > > -
> > > > > > +	if (xfs_bmap_is_mapped_extent(&imap2)) {
> > > > > >  		/*
> > > > > > -		 * Trim the extent to whatever got unmapped.
> > > > > > -		 * Remember, bunmapi works backwards.
> > > > > > +		 * If the extent we're unmapping is backed by storage (written
> > > > > > +		 * or not), unmap the extent and drop its refcount.
> > > > > >  		 */
> > > > > > -		uirec.br_startblock = irec->br_startblock + rlen;
> > > > > > -		uirec.br_startoff = irec->br_startoff + rlen;
> > > > > > -		uirec.br_blockcount = unmap_len - rlen;
> > > > > > -		uirec.br_state = irec->br_state;
> > > > > > -		unmap_len = rlen;
> > > > > > +		xfs_bmap_unmap_extent(tp, ip, &imap2);
> > > > > > +		xfs_refcount_decrease_extent(tp, &imap2);
> > > > > > +		ip_delta -= imap2.br_blockcount;
> > > > > > +	} else if (imap2.br_startblock == DELAYSTARTBLOCK) {
> > > > > > +		xfs_filblks_t	len = imap2.br_blockcount;
> > > > > >  
> > > > > > -		/* If this isn't a real mapping, we're done. */
> > > > > > -		if (!real_extent || uirec.br_blockcount == 0)
> > > > > > -			goto next_extent;
> > > > > > -
> > > > > > -		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> > > > > > -				uirec.br_blockcount, uirec.br_startblock);
> > > > > > -
> > > > > > -		/* Update the refcount tree */
> > > > > > -		xfs_refcount_increase_extent(tp, &uirec);
> > > > > > -
> > > > > > -		/* Map the new blocks into the data fork. */
> > > > > > -		xfs_bmap_map_extent(tp, ip, &uirec);
> > > > > > +		/*
> > > > > > +		 * If the extent we're unmapping is a delalloc reservation,
> > > > > > +		 * we can use the regular bunmapi function to release the
> > > > > > +		 * incore state.  Dropping the delalloc reservation takes care
> > > > > > +		 * of the quota reservation for us.
> > > > > > +		 */
> > > > > > +		error = __xfs_bunmapi(NULL, ip, imap2.br_startoff, &len, 0, 1);
> > > > > > +		if (error)
> > > > > > +			goto out_cancel;
> > > > > > +		ASSERT(len == 0);
> > > > > > +	}
> > > > > >  
> > > > > > -		/* Update quota accounting. */
> > > > > > -		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > > > > > -				uirec.br_blockcount);
> > > > > > +	/*
> > > > > > +	 * If the extent we're sharing is backed by written storage, increase
> > > > > > +	 * its refcount and map it into the file.
> > > > > > +	 */
> > > > > > +	if (real_extent) {
> > > > > > +		xfs_refcount_increase_extent(tp, imap);
> > > > > > +		xfs_bmap_map_extent(tp, ip, imap);
> > > > > > +		ip_delta += imap->br_blockcount;
> > > > > > +	}
> > > > > >  
> > > > > > -		/* Update dest isize if needed. */
> > > > > > -		newlen = XFS_FSB_TO_B(mp,
> > > > > > -				uirec.br_startoff + uirec.br_blockcount);
> > > > > > -		newlen = min_t(xfs_off_t, newlen, new_isize);
> > > > > > -		if (newlen > i_size_read(VFS_I(ip))) {
> > > > > > -			trace_xfs_reflink_update_inode_size(ip, newlen);
> > > > > > -			i_size_write(VFS_I(ip), newlen);
> > > > > > -			ip->i_d.di_size = newlen;
> > > > > > -			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > > > > -		}
> > > > > > +	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, ip_delta);
> > > > > >  
> > > > > > -next_extent:
> > > > > > -		/* Process all the deferred stuff. */
> > > > > > -		error = xfs_defer_finish(&tp);
> > > > > > -		if (error)
> > > > > > -			goto out_cancel;
> > > > > > +	/* Update dest isize if needed. */
> > > > > > +	newlen = XFS_FSB_TO_B(mp, imap2.br_startoff + imap2.br_blockcount);
> > > > > > +	newlen = min_t(xfs_off_t, newlen, new_isize);
> > > > > > +	if (newlen > i_size_read(VFS_I(ip))) {
> > > > > > +		trace_xfs_reflink_update_inode_size(ip, newlen);
> > > > > > +		i_size_write(VFS_I(ip), newlen);
> > > > > > +		ip->i_d.di_size = newlen;
> > > > > > +		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > > > >  	}
> > > > > >  
> > > > > > +	/* Commit everything and unlock. */
> > > > > >  	error = xfs_trans_commit(tp);
> > > > > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > -	if (error)
> > > > > > -		goto out;
> > > > > > -	return 0;
> > > > > > +	goto out_unlock;
> > > > > 
> > > > > We probably shouldn't fire the _error() tracepoint in the successful
> > > > > exit case.
> > > > 
> > > > Oops, good catch. :)
> > > > 
> > > > > >  
> > > > > >  out_cancel:
> > > > > >  	xfs_trans_cancel(tp);
> > > > > > +out_unlock:
> > > > > >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > >  out:
> > > > > >  	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
> > > > > >  	return error;
> > > > > >  }
> > > > > >  
> > > > > > -/*
> > > > > > - * Iteratively remap one file's extents (and holes) to another's.
> > > > > > - */
> > > > > > +/* Remap a range of one file to the other. */
> > > > > >  int
> > > > > >  xfs_reflink_remap_blocks(
> > > > > >  	struct xfs_inode	*src,
> > > > > > @@ -1116,25 +1134,22 @@ xfs_reflink_remap_blocks(
> > > > > >  	loff_t			*remapped)
> > > > > >  {
> > > > > >  	struct xfs_bmbt_irec	imap;
> > > > > > -	xfs_fileoff_t		srcoff;
> > > > > > -	xfs_fileoff_t		destoff;
> > > > > > +	struct xfs_mount	*mp = src->i_mount;
> > > > > > +	xfs_fileoff_t		srcoff = XFS_B_TO_FSBT(mp, pos_in);
> > > > > > +	xfs_fileoff_t		destoff = XFS_B_TO_FSBT(mp, pos_out);
> > > > > >  	xfs_filblks_t		len;
> > > > > > -	xfs_filblks_t		range_len;
> > > > > >  	xfs_filblks_t		remapped_len = 0;
> > > > > >  	xfs_off_t		new_isize = pos_out + remap_len;
> > > > > >  	int			nimaps;
> > > > > >  	int			error = 0;
> > > > > >  
> > > > > > -	destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
> > > > > > -	srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
> > > > > > -	len = XFS_B_TO_FSB(src->i_mount, remap_len);
> > > > > > +	len = min_t(xfs_filblks_t, XFS_B_TO_FSB(mp, remap_len),
> > > > > > +			XFS_MAX_FILEOFF);
> > > > > >  
> > > > > > -	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> > > > > > -	while (len) {
> > > > > > -		uint		lock_mode;
> > > > > > +	trace_xfs_reflink_remap_blocks(src, srcoff, len, dest, destoff);
> > > > > >  
> > > > > > -		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> > > > > > -				dest, destoff);
> > > > > > +	while (len > 0) {
> > > > > > +		unsigned int	lock_mode;
> > > > > >  
> > > > > >  		/* Read extent from the source file */
> > > > > >  		nimaps = 1;
> > > > > > @@ -1143,18 +1158,29 @@ xfs_reflink_remap_blocks(
> > > > > >  		xfs_iunlock(src, lock_mode);
> > > > > >  		if (error)
> > > > > >  			break;
> > > > > > -		ASSERT(nimaps == 1);
> > > > > > +#ifdef DEBUG
> > > > > > +		if (nimaps != 1 ||
> > > > > > +		    imap.br_startblock == DELAYSTARTBLOCK ||
> > > > > > +		    imap.br_startoff != srcoff) {
> > > > > > +			/*
> > > > > > +			 * We should never get no mapping or a delalloc extent
> > > > > > +			 * or something that doesn't match what we asked for,
> > > > > > +			 * since the caller flushed the source file data and
> > > > > > +			 * we hold the source file io/mmap locks.
> > > > > > +			 */
> > > > > > +			ASSERT(nimaps == 0);
> > > > > > +			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> > > > > > +			ASSERT(imap.br_startoff == srcoff);
> > > > > > +			error = -EINVAL;
> > > > > > +			break;
> > > > > > +		}
> > > > > > +#endif
> > > > > 
> > > > > Same thing with the DEBUG check? It seems like at least some of these
> > > > > checks should be unconditional in the event the file is corrupted or
> > > > > something.
> > > > 
> > > > Doh... yeah, I guess the delalloc check really needs to be in there, in
> > > > case a previous write+fsync failed and left us some remnants.
> > > > 
> > > > > >  
> > > > > > -		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_DATA_FORK,
> > > > > > -				&imap);
> > > > > > +		trace_xfs_reflink_remap_extent1(src, &imap);
> > > > > 
> > > > > How about trace_xfs_reflink_remap_extent_[src|dest]() so trace data is a
> > > > > bit more readable?
> > > > 
> > > > <nod> I like the suggestion!
> > > > 
> > > > > >  
> > > > > > -		/* Translate imap into the destination file. */
> > > > > > -		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> > > > > > -		imap.br_startoff += destoff - srcoff;
> > > > > > -
> > > > > > -		/* Clear dest from destoff to the end of imap and map it in. */
> > > > > > -		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> > > > > > -				new_isize);
> > > > > > +		/* Remap into the destination file at the given offset. */
> > > > > > +		imap.br_startoff = destoff;
> > > > > > +		error = xfs_reflink_remap_extent(dest, &imap, new_isize);
> > > > > >  		if (error)
> > > > > >  			break;
> > > > > >  
> > > > > > @@ -1164,10 +1190,10 @@ xfs_reflink_remap_blocks(
> > > > > >  		}
> > > > > >  
> > > > > >  		/* Advance drange/srange */
> > > > > > -		srcoff += range_len;
> > > > > > -		destoff += range_len;
> > > > > > -		len -= range_len;
> > > > > > -		remapped_len += range_len;
> > > > > > +		srcoff += imap.br_blockcount;
> > > > > > +		destoff += imap.br_blockcount;
> > > > > > +		len -= imap.br_blockcount;
> > > > > > +		remapped_len += imap.br_blockcount;
> > > > > 
> > > > > Much nicer iteration, indeed. ;)
> > > > 
> > > > Thanks! :)
> > > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > >  	}
> > > > > >  
> > > > > >  	if (error)
> > > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > > > index 460136628a79..f205602c8ba9 100644
> > > > > > --- a/fs/xfs/xfs_trace.h
> > > > > > +++ b/fs/xfs/xfs_trace.h
> > > > > > @@ -3052,8 +3052,7 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
> > > > > >  DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
> > > > > >  DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
> > > > > >  DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
> > > > > > -DEFINE_IMAP_EVENT(xfs_reflink_remap_imap);
> > > > > > -TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> > > > > > +TRACE_EVENT(xfs_reflink_remap_blocks,
> > > > > >  	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
> > > > > >  		 xfs_filblks_t len, struct xfs_inode *dest,
> > > > > >  		 xfs_fileoff_t doffset),
> > > > > > @@ -3084,59 +3083,14 @@ TRACE_EVENT(xfs_reflink_remap_blocks_loop,
> > > > > >  		  __entry->dest_ino,
> > > > > >  		  __entry->dest_lblk)
> > > > > >  );
> > > > > > -TRACE_EVENT(xfs_reflink_punch_range,
> > > > > > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > > > > > -		 xfs_extlen_t len),
> > > > > > -	TP_ARGS(ip, lblk, len),
> > > > > > -	TP_STRUCT__entry(
> > > > > > -		__field(dev_t, dev)
> > > > > > -		__field(xfs_ino_t, ino)
> > > > > > -		__field(xfs_fileoff_t, lblk)
> > > > > > -		__field(xfs_extlen_t, len)
> > > > > > -	),
> > > > > > -	TP_fast_assign(
> > > > > > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > > > > -		__entry->ino = ip->i_ino;
> > > > > > -		__entry->lblk = lblk;
> > > > > > -		__entry->len = len;
> > > > > > -	),
> > > > > > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
> > > > > > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > > > -		  __entry->ino,
> > > > > > -		  __entry->lblk,
> > > > > > -		  __entry->len)
> > > > > > -);
> > > > > > -TRACE_EVENT(xfs_reflink_remap,
> > > > > > -	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
> > > > > > -		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
> > > > > > -	TP_ARGS(ip, lblk, len, new_pblk),
> > > > > > -	TP_STRUCT__entry(
> > > > > > -		__field(dev_t, dev)
> > > > > > -		__field(xfs_ino_t, ino)
> > > > > > -		__field(xfs_fileoff_t, lblk)
> > > > > > -		__field(xfs_extlen_t, len)
> > > > > > -		__field(xfs_fsblock_t, new_pblk)
> > > > > > -	),
> > > > > > -	TP_fast_assign(
> > > > > > -		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > > > > -		__entry->ino = ip->i_ino;
> > > > > > -		__entry->lblk = lblk;
> > > > > > -		__entry->len = len;
> > > > > > -		__entry->new_pblk = new_pblk;
> > > > > > -	),
> > > > > > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
> > > > > > -		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > > > -		  __entry->ino,
> > > > > > -		  __entry->lblk,
> > > > > > -		  __entry->len,
> > > > > > -		  __entry->new_pblk)
> > > > > > -);
> > > > > >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
> > > > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
> > > > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
> > > > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
> > > > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
> > > > > >  DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
> > > > > > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent1);
> > > > > > +DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent2);
> > > > > >  
> > > > > >  /* dedupe tracepoints */
> > > > > >  DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
> > > > > > diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> > > > > > index c0f73b82c055..99511ff6222f 100644
> > > > > > --- a/fs/xfs/xfs_trans_dquot.c
> > > > > > +++ b/fs/xfs/xfs_trans_dquot.c
> > > > > > @@ -124,16 +124,17 @@ xfs_trans_dup_dqinfo(
> > > > > >   */
> > > > > >  void
> > > > > >  xfs_trans_mod_dquot_byino(
> > > > > > -	xfs_trans_t	*tp,
> > > > > > -	xfs_inode_t	*ip,
> > > > > > -	uint		field,
> > > > > > -	int64_t		delta)
> > > > > > +	struct xfs_trans	*tp,
> > > > > > +	struct xfs_inode	*ip,
> > > > > > +	uint			field,
> > > > > > +	int64_t			delta)
> > > > > >  {
> > > > > > -	xfs_mount_t	*mp = tp->t_mountp;
> > > > > > +	struct xfs_mount	*mp = tp->t_mountp;
> > > > > >  
> > > > > >  	if (!XFS_IS_QUOTA_RUNNING(mp) ||
> > > > > >  	    !XFS_IS_QUOTA_ON(mp) ||
> > > > > > -	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
> > > > > > +	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino) ||
> > > > > > +	    delta == 0)
> > > > > >  		return;
> > > > > >  
> > > > > >  	if (tp->t_dqinfo == NULL)
> > > > > > 
> > > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-06-24 18:46 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-23  4:01 [PATCH 0/3] xfs: reflink cleanups Darrick J. Wong
2020-06-23  4:01 ` [PATCH 1/3] xfs: redesign the reflink remap loop to fix blkres depletion crash Darrick J. Wong
2020-06-24 12:38   ` Brian Foster
2020-06-24 15:09     ` Darrick J. Wong
2020-06-24 17:07       ` Darrick J. Wong
2020-06-24 18:16         ` Brian Foster
2020-06-24 18:35           ` Darrick J. Wong
2020-06-24 18:45             ` Brian Foster
2020-06-23  4:01 ` [PATCH 2/3] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
2020-06-24 12:38   ` Brian Foster
2020-06-23  4:01 ` [PATCH 3/3] xfs: refactor locking and unlocking two inodes against userspace IO Darrick J. Wong
2020-06-24 12:39   ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.