[PATCH v2 0/5] clean up collapse range and handle post-eof delalloc

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc
@ 2014-09-10 13:20 Brian Foster
  2014-09-10 13:20 ` [PATCH v2 1/5] xfs: track collapse via file offset rather than extent index Brian Foster
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Brian Foster @ 2014-09-10 13:20 UTC (permalink / raw)
  To: xfs

Hi all,

Here's v2 of the collapse clean up. We refactor a bit more via the
insertion of patch 3, otherwise it's similar to v1. This will see some
continued testing, but it survived ~500m fsx operations overnight.

Brian

v2:
- Rename helpers to xfs_bmse_*() scheme.
- Rearrange logic and minor cleanups to xfs_bmse[_can]_merge().
- Inserted patch 3 to further refactor xfs_bmap_shift_extents() into
  xfs_bmse_shift_one().
- Return corruption error rather than BUG_ON() in patch 4 and detect
  error in new helper.
v1: http://oss.sgi.com/archives/xfs/2014-09/msg00061.html
- Retain the eofblocks trim and writeback/inval. the range of shifted
  data only.
- Added the xfs_free_file_space() patch to no longer writeback the
  entire file.
rfc: http://oss.sgi.com/archives/xfs/2014-08/msg00462.html

Brian Foster (5):
  xfs: track collapse via file offset rather than extent index
  xfs: refactor shift-by-merge into xfs_bmse_merge() helper
  xfs: refactor single extent shift into xfs_bmse_shift_one() helper
  xfs: writeback and inval. file range to be shifted by collapse
  xfs: only writeback and truncate pages for the freed range

 fs/xfs/libxfs/xfs_bmap.c | 365 +++++++++++++++++++++++++++++++----------------
 fs/xfs/libxfs/xfs_bmap.h |   7 +-
 fs/xfs/xfs_bmap_util.c   |  54 ++++---
 3 files changed, 275 insertions(+), 151 deletions(-)

-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/5] xfs: track collapse via file offset rather than extent index
  2014-09-10 13:20 [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc Brian Foster
@ 2014-09-10 13:20 ` Brian Foster
  2014-09-10 13:20 ` [PATCH v2 2/5] xfs: refactor shift-by-merge into xfs_bmse_merge() helper Brian Foster
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2014-09-10 13:20 UTC (permalink / raw)
  To: xfs

The collapse range implementation uses a transaction per extent shift.
The progress of the overall operation is tracked via the current extent
index of the in-core extent list. This is racy because the ilock must be
dropped and reacquired for each transaction according to locking and log
reservation rules. Therefore, writeback to prior regions of the file is
possible and can change the extent count. This changes the extent to
which the current index refers and causes the collapse to fail mid
operation. To avoid this problem, the entire file is currently written
back before the collapse operation starts.

To eliminate the need to flush the entire file, use the file offset
(fsb) to track the progress of the overall extent shift operation rather
than the extent index. Modify xfs_bmap_shift_extents() to
unconditionally convert the start_fsb parameter to an extent index and
return the file offset of the extent where the shift left off, if
further extents exist. The bulk of ths function can remain based on
extent index as ilock is held by the caller. xfs_collapse_file_space()
now uses the fsb output as the starting point for the subsequent shift.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 85 +++++++++++++++++++++++++-----------------------
 fs/xfs/libxfs/xfs_bmap.h |  7 ++--
 fs/xfs/xfs_bmap_util.c   | 12 +++----
 3 files changed, 53 insertions(+), 51 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 86df952..4b3f1b9 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5406,20 +5406,21 @@ error0:
 /*
  * Shift extent records to the left to cover a hole.
  *
- * The maximum number of extents to be shifted in a single operation
- * is @num_exts, and @current_ext keeps track of the current extent
- * index we have shifted. @offset_shift_fsb is the length by which each
- * extent is shifted. If there is no hole to shift the extents
- * into, this will be considered invalid operation and we abort immediately.
+ * The maximum number of extents to be shifted in a single operation is
+ * @num_exts. @start_fsb specifies the file offset to start the shift and the
+ * file offset where we've left off is returned in @next_fsb. @offset_shift_fsb
+ * is the length by which each extent is shifted. If there is no hole to shift
+ * the extents into, this will be considered invalid operation and we abort
+ * immediately.
  */
 int
 xfs_bmap_shift_extents(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
-	int			*done,
 	xfs_fileoff_t		start_fsb,
 	xfs_fileoff_t		offset_shift_fsb,
-	xfs_extnum_t		*current_ext,
+	int			*done,
+	xfs_fileoff_t		*next_fsb,
 	xfs_fsblock_t		*firstblock,
 	struct xfs_bmap_free	*flist,
 	int			num_exts)
@@ -5431,6 +5432,7 @@ xfs_bmap_shift_extents(
 	struct xfs_mount		*mp = ip->i_mount;
 	struct xfs_ifork		*ifp;
 	xfs_extnum_t			nexts = 0;
+	xfs_extnum_t			current_ext;
 	xfs_fileoff_t			startoff;
 	int				error = 0;
 	int				i;
@@ -5451,7 +5453,8 @@ xfs_bmap_shift_extents(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	ASSERT(current_ext != NULL);
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
@@ -5462,20 +5465,18 @@ xfs_bmap_shift_extents(
 	}
 
 	/*
-	 * If *current_ext is 0, we would need to lookup the extent
-	 * from where we would start shifting and store it in gotp.
+	 * Look up the extent index for the fsb where we start shifting. We can
+	 * henceforth iterate with current_ext as extent list changes are locked
+	 * out via ilock.
+	 *
+	 * gotp can be null in 2 cases: 1) if there are no extents or 2)
+	 * start_fsb lies in a hole beyond which there are no extents. Either
+	 * way, we are done.
 	 */
-	if (!*current_ext) {
-		gotp = xfs_iext_bno_to_ext(ifp, start_fsb, current_ext);
-		/*
-		 * gotp can be null in 2 cases: 1) if there are no extents
-		 * or 2) start_fsb lies in a hole beyond which there are
-		 * no extents. Either way, we are done.
-		 */
-		if (!gotp) {
-			*done = 1;
-			return 0;
-		}
+	gotp = xfs_iext_bno_to_ext(ifp, start_fsb, &current_ext);
+	if (!gotp) {
+		*done = 1;
+		return 0;
 	}
 
 	if (ifp->if_flags & XFS_IFBROOT) {
@@ -5487,36 +5488,32 @@ xfs_bmap_shift_extents(
 
 	/*
 	 * There may be delalloc extents in the data fork before the range we
-	 * are collapsing out, so we cannot
-	 * use the count of real extents here. Instead we have to calculate it
-	 * from the incore fork.
+	 * are collapsing out, so we cannot use the count of real extents here.
+	 * Instead we have to calculate it from the incore fork.
 	 */
 	total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
-	while (nexts++ < num_exts && *current_ext < total_extents) {
+	while (nexts++ < num_exts && current_ext < total_extents) {
 
-		gotp = xfs_iext_get_ext(ifp, *current_ext);
+		gotp = xfs_iext_get_ext(ifp, current_ext);
 		xfs_bmbt_get_all(gotp, &got);
 		startoff = got.br_startoff - offset_shift_fsb;
 
 		/*
-		 * Before shifting extent into hole, make sure that the hole
-		 * is large enough to accomodate the shift.
+		 * Before shifting extent into hole, make sure that the hole is
+		 * large enough to accommodate the shift.
 		 */
-		if (*current_ext) {
-			xfs_bmbt_get_all(xfs_iext_get_ext(ifp,
-						*current_ext - 1), &left);
-
+		if (current_ext > 0) {
+			xfs_bmbt_get_all(xfs_iext_get_ext(ifp, current_ext - 1),
+					 &left);
 			if (startoff < left.br_startoff + left.br_blockcount)
 				error = -EINVAL;
 		} else if (offset_shift_fsb > got.br_startoff) {
 			/*
-			 * When first extent is shifted, offset_shift_fsb
-			 * should be less than the stating offset of
-			 * the first extent.
+			 * When first extent is shifted, offset_shift_fsb should
+			 * be less than the stating offset of the first extent.
 			 */
 			error = -EINVAL;
 		}
-
 		if (error)
 			goto del_cursor;
 
@@ -5531,7 +5528,7 @@ xfs_bmap_shift_extents(
 		}
 
 		/* Check if we can merge 2 adjacent extents */
-		if (*current_ext &&
+		if (current_ext &&
 		    left.br_startoff + left.br_blockcount == startoff &&
 		    left.br_startblock + left.br_blockcount ==
 				got.br_startblock &&
@@ -5539,7 +5536,7 @@ xfs_bmap_shift_extents(
 		    left.br_blockcount + got.br_blockcount <= MAXEXTLEN) {
 			blockcount = left.br_blockcount +
 				got.br_blockcount;
-			xfs_iext_remove(ip, *current_ext, 1, 0);
+			xfs_iext_remove(ip, current_ext, 1, 0);
 			logflags |= XFS_ILOG_CORE;
 			if (cur) {
 				error = xfs_btree_delete(cur, &i);
@@ -5551,7 +5548,7 @@ xfs_bmap_shift_extents(
 			}
 			XFS_IFORK_NEXT_SET(ip, whichfork,
 				XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
-			gotp = xfs_iext_get_ext(ifp, --*current_ext);
+			gotp = xfs_iext_get_ext(ifp, --current_ext);
 			xfs_bmbt_get_all(gotp, &got);
 
 			/* Make cursor point to the extent we will update */
@@ -5585,13 +5582,18 @@ xfs_bmap_shift_extents(
 			logflags |= XFS_ILOG_DEXT;
 		}
 
-		(*current_ext)++;
+		current_ext++;
 		total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
 	}
 
 	/* Check if we are done */
-	if (*current_ext == total_extents)
+	if (current_ext == total_extents)
 		*done = 1;
+	else if (next_fsb) {
+		gotp = xfs_iext_get_ext(ifp, current_ext);
+		xfs_bmbt_get_all(gotp, &got);
+		*next_fsb = got.br_startoff;
+	}
 
 del_cursor:
 	if (cur)
@@ -5600,5 +5602,6 @@ del_cursor:
 
 	if (logflags)
 		xfs_trans_log_inode(tp, ip, logflags);
+
 	return error;
 }
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index b879ca5..44db6db 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -178,9 +178,8 @@ int	xfs_check_nostate_extents(struct xfs_ifork *ifp, xfs_extnum_t idx,
 		xfs_extnum_t num);
 uint	xfs_default_attroffset(struct xfs_inode *ip);
 int	xfs_bmap_shift_extents(struct xfs_trans *tp, struct xfs_inode *ip,
-		int *done, xfs_fileoff_t start_fsb,
-		xfs_fileoff_t offset_shift_fsb, xfs_extnum_t *current_ext,
-		xfs_fsblock_t *firstblock, struct xfs_bmap_free	*flist,
-		int num_exts);
+		xfs_fileoff_t start_fsb, xfs_fileoff_t offset_shift_fsb,
+		int *done, xfs_fileoff_t *next_fsb, xfs_fsblock_t *firstblock,
+		struct xfs_bmap_free *flist, int num_exts);
 
 #endif	/* __XFS_BMAP_H__ */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1707980..1e96d77 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1456,18 +1456,18 @@ xfs_collapse_file_space(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
 	int			error;
-	xfs_extnum_t		current_ext = 0;
 	struct xfs_bmap_free	free_list;
 	xfs_fsblock_t		first_block;
 	int			committed;
 	xfs_fileoff_t		start_fsb;
+	xfs_fileoff_t		next_fsb;
 	xfs_fileoff_t		shift_fsb;
 
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
 
 	trace_xfs_collapse_file_space(ip);
 
-	start_fsb = XFS_B_TO_FSB(mp, offset + len);
+	next_fsb = XFS_B_TO_FSB(mp, offset + len);
 	shift_fsb = XFS_B_TO_FSB(mp, len);
 
 	/*
@@ -1525,10 +1525,10 @@ xfs_collapse_file_space(
 		 * We are using the write transaction in which max 2 bmbt
 		 * updates are allowed
 		 */
-		error = xfs_bmap_shift_extents(tp, ip, &done, start_fsb,
-					       shift_fsb, &current_ext,
-					       &first_block, &free_list,
-					       XFS_BMAP_MAX_SHIFT_EXTENTS);
+		start_fsb = next_fsb;
+		error = xfs_bmap_shift_extents(tp, ip, start_fsb, shift_fsb,
+				&done, &next_fsb, &first_block, &free_list,
+				XFS_BMAP_MAX_SHIFT_EXTENTS);
 		if (error)
 			goto out;
 
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/5] xfs: refactor shift-by-merge into xfs_bmse_merge() helper
  2014-09-10 13:20 [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc Brian Foster
  2014-09-10 13:20 ` [PATCH v2 1/5] xfs: track collapse via file offset rather than extent index Brian Foster
@ 2014-09-10 13:20 ` Brian Foster
  2014-09-10 13:20 ` [PATCH v2 3/5] xfs: refactor single extent shift into xfs_bmse_shift_one() helper Brian Foster
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2014-09-10 13:20 UTC (permalink / raw)
  To: xfs

The extent shift mechanism in xfs_bmap_shift_extents() is complicated
and handles several different, non-deterministic scenarios. These
include extent shifts, extent merges and potential btree updates in
either of the former scenarios.

Refactor the code to be more linear and readable. The loop logic in
xfs_bmap_shift_extents() and some initial error checking is adjusted
slightly. The associated btree lookup and update/delete operations are
condensed into single blocks of code. This reduces the number of
btree-specific blocks and facilitates the separation of the merge
operation into a new xfs_bmse_merge() and xfs_bmse_can_merge() helpers.

This is a code refactor only. The behavior of extent shift and collapse
range is not modified.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 242 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 167 insertions(+), 75 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 4b3f1b9..532c4aa 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5404,6 +5404,120 @@ error0:
 }
 
 /*
+ * Determine whether an extent shift can be accomplished by a merge with the
+ * extent that precedes the target hole of the shift.
+ */
+STATIC bool
+xfs_bmse_can_merge(
+	struct xfs_bmbt_irec	*left,	/* preceding extent */
+	struct xfs_bmbt_irec	*got,	/* current extent to shift */
+	xfs_fileoff_t		shift)	/* shift fsb */
+{
+	xfs_fileoff_t		startoff;
+
+	startoff = got->br_startoff - shift;
+
+	/*
+	 * The extent, once shifted, must be adjacent in-file and on-disk with
+	 * the preceding extent.
+	 */
+	if ((left->br_startoff + left->br_blockcount != startoff) ||
+	    (left->br_startblock + left->br_blockcount != got->br_startblock) ||
+	    (left->br_state != got->br_state) ||
+	    (left->br_blockcount + got->br_blockcount > MAXEXTLEN))
+		return false;
+
+	return true;
+}
+
+/*
+ * A bmap extent shift adjusts the file offset of an extent to fill a preceding
+ * hole in the file. If an extent shift would result in the extent being fully
+ * adjacent to the extent that currently precedes the hole, we can merge with
+ * the preceding extent rather than do the shift.
+ *
+ * This function assumes the caller has verified a shift-by-merge is possible
+ * with the provided extents via xfs_bmse_can_merge().
+ */
+STATIC int
+xfs_bmse_merge(
+	struct xfs_inode		*ip,
+	int				whichfork,
+	xfs_fileoff_t			shift,		/* shift fsb */
+	int				current_ext,	/* idx of gotp */
+	struct xfs_bmbt_rec_host	*gotp,		/* extent to shift */
+	struct xfs_bmbt_rec_host	*leftp,		/* preceding extent */
+	struct xfs_btree_cur		*cur,
+	int				*logflags)	/* output */
+{
+	struct xfs_ifork		*ifp;
+	struct xfs_bmbt_irec		got;
+	struct xfs_bmbt_irec		left;
+	xfs_filblks_t			blockcount;
+	int				error, i;
+
+	ifp = XFS_IFORK_PTR(ip, whichfork);
+	xfs_bmbt_get_all(gotp, &got);
+	xfs_bmbt_get_all(leftp, &left);
+	blockcount = left.br_blockcount + got.br_blockcount;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	ASSERT(xfs_bmse_can_merge(&left, &got, shift));
+
+	/*
+	 * Merge the in-core extents. Note that the host record pointers and
+	 * current_ext index are invalid once the extent has been removed via
+	 * xfs_iext_remove().
+	 */
+	xfs_bmbt_set_blockcount(leftp, blockcount);
+	xfs_iext_remove(ip, current_ext, 1, 0);
+
+	/*
+	 * Update the on-disk extent count, the btree if necessary and log the
+	 * inode.
+	 */
+	XFS_IFORK_NEXT_SET(ip, whichfork,
+			   XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
+	*logflags |= XFS_ILOG_CORE;
+	if (!cur) {
+		*logflags |= XFS_ILOG_DEXT;
+		return 0;
+	}
+
+	/* lookup and remove the extent to merge */
+	error = xfs_bmbt_lookup_eq(cur, got.br_startoff, got.br_startblock,
+				   got.br_blockcount, &i);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, out_error);
+
+	error = xfs_btree_delete(cur, &i);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, out_error);
+
+	/* lookup and update size of the previous extent */
+	error = xfs_bmbt_lookup_eq(cur, left.br_startoff, left.br_startblock,
+				   left.br_blockcount, &i);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, out_error);
+
+	left.br_blockcount = blockcount;
+
+	error = xfs_bmbt_update(cur, left.br_startoff, left.br_startblock,
+				left.br_blockcount, left.br_state);
+	if (error)
+		goto out_error;
+
+	return 0;
+
+out_error:
+	return error;
+}
+
+/*
  * Shift extent records to the left to cover a hole.
  *
  * The maximum number of extents to be shifted in a single operation is
@@ -5427,6 +5541,7 @@ xfs_bmap_shift_extents(
 {
 	struct xfs_btree_cur		*cur = NULL;
 	struct xfs_bmbt_rec_host	*gotp;
+	struct xfs_bmbt_rec_host	*leftp;
 	struct xfs_bmbt_irec            got;
 	struct xfs_bmbt_irec		left;
 	struct xfs_mount		*mp = ip->i_mount;
@@ -5438,7 +5553,6 @@ xfs_bmap_shift_extents(
 	int				i;
 	int				whichfork = XFS_DATA_FORK;
 	int				logflags = 0;
-	xfs_filblks_t			blockcount = 0;
 	int				total_extents;
 
 	if (unlikely(XFS_TEST_ERROR(
@@ -5464,6 +5578,13 @@ xfs_bmap_shift_extents(
 			return error;
 	}
 
+	if (ifp->if_flags & XFS_IFBROOT) {
+		cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
+		cur->bc_private.b.firstblock = *firstblock;
+		cur->bc_private.b.flist = flist;
+		cur->bc_private.b.flags = 0;
+	}
+
 	/*
 	 * Look up the extent index for the fsb where we start shifting. We can
 	 * henceforth iterate with current_ext as extent list changes are locked
@@ -5476,14 +5597,17 @@ xfs_bmap_shift_extents(
 	gotp = xfs_iext_bno_to_ext(ifp, start_fsb, &current_ext);
 	if (!gotp) {
 		*done = 1;
-		return 0;
+		goto del_cursor;
 	}
+	xfs_bmbt_get_all(gotp, &got);
 
-	if (ifp->if_flags & XFS_IFBROOT) {
-		cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
-		cur->bc_private.b.firstblock = *firstblock;
-		cur->bc_private.b.flist = flist;
-		cur->bc_private.b.flags = 0;
+	/*
+	 * If the first extent is shifted, offset_shift_fsb cannot be larger
+	 * than the starting offset of the first extent.
+	 */
+	if (current_ext == 0 && got.br_startoff < offset_shift_fsb) {
+		error = -EINVAL;
+		goto del_cursor;
 	}
 
 	/*
@@ -5493,30 +5617,41 @@ xfs_bmap_shift_extents(
 	 */
 	total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
 	while (nexts++ < num_exts && current_ext < total_extents) {
-
-		gotp = xfs_iext_get_ext(ifp, current_ext);
-		xfs_bmbt_get_all(gotp, &got);
 		startoff = got.br_startoff - offset_shift_fsb;
 
-		/*
-		 * Before shifting extent into hole, make sure that the hole is
-		 * large enough to accommodate the shift.
-		 */
+		/* grab the left extent and check for a potential merge */
 		if (current_ext > 0) {
-			xfs_bmbt_get_all(xfs_iext_get_ext(ifp, current_ext - 1),
-					 &left);
-			if (startoff < left.br_startoff + left.br_blockcount)
+			leftp = xfs_iext_get_ext(ifp, current_ext - 1);
+			xfs_bmbt_get_all(leftp, &left);
+
+			/* make sure hole is large enough for shift */
+			if (startoff < left.br_startoff + left.br_blockcount) {
 				error = -EINVAL;
-		} else if (offset_shift_fsb > got.br_startoff) {
-			/*
-			 * When first extent is shifted, offset_shift_fsb should
-			 * be less than the stating offset of the first extent.
-			 */
-			error = -EINVAL;
+				goto del_cursor;
+			}
+
+			if (xfs_bmse_can_merge(&left, &got, offset_shift_fsb)) {
+				error = xfs_bmse_merge(ip, whichfork,
+						offset_shift_fsb, current_ext, gotp,
+						leftp, cur, &logflags);
+				if (error)
+					goto del_cursor;
+
+				/*
+				 * The extent was merged so adjust the extent
+				 * index and move onto the next.
+				 */
+				current_ext--;
+				goto next;
+			}
 		}
-		if (error)
-			goto del_cursor;
 
+		/*
+		 * We didn't merge the extent so do the shift. Update the start
+		 * offset in the in-core extent and btree, if necessary.
+		 */
+		xfs_bmbt_set_startoff(gotp, startoff);
+		logflags |= XFS_ILOG_CORE;
 		if (cur) {
 			error = xfs_bmbt_lookup_eq(cur, got.br_startoff,
 						   got.br_startblock,
@@ -5525,53 +5660,8 @@ xfs_bmap_shift_extents(
 			if (error)
 				goto del_cursor;
 			XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
-		}
-
-		/* Check if we can merge 2 adjacent extents */
-		if (current_ext &&
-		    left.br_startoff + left.br_blockcount == startoff &&
-		    left.br_startblock + left.br_blockcount ==
-				got.br_startblock &&
-		    left.br_state == got.br_state &&
-		    left.br_blockcount + got.br_blockcount <= MAXEXTLEN) {
-			blockcount = left.br_blockcount +
-				got.br_blockcount;
-			xfs_iext_remove(ip, current_ext, 1, 0);
-			logflags |= XFS_ILOG_CORE;
-			if (cur) {
-				error = xfs_btree_delete(cur, &i);
-				if (error)
-					goto del_cursor;
-				XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
-			} else {
-				logflags |= XFS_ILOG_DEXT;
-			}
-			XFS_IFORK_NEXT_SET(ip, whichfork,
-				XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
-			gotp = xfs_iext_get_ext(ifp, --current_ext);
-			xfs_bmbt_get_all(gotp, &got);
-
-			/* Make cursor point to the extent we will update */
-			if (cur) {
-				error = xfs_bmbt_lookup_eq(cur, got.br_startoff,
-							   got.br_startblock,
-							   got.br_blockcount,
-							   &i);
-				if (error)
-					goto del_cursor;
-				XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
-			}
 
-			xfs_bmbt_set_blockcount(gotp, blockcount);
-			got.br_blockcount = blockcount;
-		} else {
-			/* We have to update the startoff */
-			xfs_bmbt_set_startoff(gotp, startoff);
 			got.br_startoff = startoff;
-		}
-
-		logflags |= XFS_ILOG_CORE;
-		if (cur) {
 			error = xfs_bmbt_update(cur, got.br_startoff,
 						got.br_startblock,
 						got.br_blockcount,
@@ -5582,18 +5672,20 @@ xfs_bmap_shift_extents(
 			logflags |= XFS_ILOG_DEXT;
 		}
 
-		current_ext++;
+next:
+		/* update total extent count and grab the next record */
 		total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
+		if (++current_ext >= total_extents)
+			break;
+		gotp = xfs_iext_get_ext(ifp, current_ext);
+		xfs_bmbt_get_all(gotp, &got);
 	}
 
 	/* Check if we are done */
 	if (current_ext == total_extents)
 		*done = 1;
-	else if (next_fsb) {
-		gotp = xfs_iext_get_ext(ifp, current_ext);
-		xfs_bmbt_get_all(gotp, &got);
+	else if (next_fsb)
 		*next_fsb = got.br_startoff;
-	}
 
 del_cursor:
 	if (cur)
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/5] xfs: refactor single extent shift into xfs_bmse_shift_one() helper
  2014-09-10 13:20 [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc Brian Foster
  2014-09-10 13:20 ` [PATCH v2 1/5] xfs: track collapse via file offset rather than extent index Brian Foster
  2014-09-10 13:20 ` [PATCH v2 2/5] xfs: refactor shift-by-merge into xfs_bmse_merge() helper Brian Foster
@ 2014-09-10 13:20 ` Brian Foster
  2014-09-10 13:20 ` [PATCH v2 4/5] xfs: writeback and inval. file range to be shifted by collapse Brian Foster
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2014-09-10 13:20 UTC (permalink / raw)
  To: xfs

xfs_bmap_shift_extents() has a variety of conditions and error checks
that make the logic difficult to follow and indent heavy. Refactor the
loop body of this function into a new xfs_bmse_shift_one() helper. This
simplifies the error checks, eliminates index decrement on merge hack by
pushing the index increment down into the helper, and makes the code
more readable by reducing multiple levels of indentation.

This is a code refactor only. The behavior of extent shift and collapse
range is not modified.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 164 ++++++++++++++++++++++++++---------------------
 1 file changed, 91 insertions(+), 73 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 532c4aa..69bf8d8 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5518,6 +5518,88 @@ out_error:
 }
 
 /*
+ * Shift a single extent.
+ */
+STATIC int
+xfs_bmse_shift_one(
+	struct xfs_inode		*ip,
+	int				whichfork,
+	xfs_fileoff_t			offset_shift_fsb,
+	int				*current_ext,
+	struct xfs_bmbt_rec_host	*gotp,
+	struct xfs_btree_cur		*cur,
+	int				*logflags)
+{
+	struct xfs_ifork		*ifp;
+	xfs_fileoff_t			startoff;
+	struct xfs_bmbt_rec_host	*leftp;
+	struct xfs_bmbt_irec		got;
+	struct xfs_bmbt_irec		left;
+	int				error;
+	int				i;
+
+	ifp = XFS_IFORK_PTR(ip, whichfork);
+
+	xfs_bmbt_get_all(gotp, &got);
+	startoff = got.br_startoff - offset_shift_fsb;
+
+	/*
+	 * If this is the first extent in the file, make sure there's enough
+	 * room at the start of the file and jump right to the shift as there's
+	 * no left extent to merge.
+	 */
+	if (*current_ext == 0) {
+		if (got.br_startoff < offset_shift_fsb)
+			return -EINVAL;
+		goto shift_extent;
+	}
+
+	/* grab the left extent and check for a large enough hole */
+	leftp = xfs_iext_get_ext(ifp, *current_ext - 1);
+	xfs_bmbt_get_all(leftp, &left);
+
+	if (startoff < left.br_startoff + left.br_blockcount)
+		return -EINVAL;
+
+	/* check whether to merge the extent or shift it down */
+	if (!xfs_bmse_can_merge(&left, &got, offset_shift_fsb))
+		goto shift_extent;
+
+	return xfs_bmse_merge(ip, whichfork, offset_shift_fsb, *current_ext,
+			      gotp, leftp, cur, logflags);
+
+shift_extent:
+	/*
+	 * Increment the extent index for the next iteration, update the start
+	 * offset of the in-core extent and update the btree if applicable.
+	 */
+	(*current_ext)++;
+	xfs_bmbt_set_startoff(gotp, startoff);
+	*logflags |= XFS_ILOG_CORE;
+	if (!cur) {
+		*logflags |= XFS_ILOG_DEXT;
+		return 0;
+	}
+
+	error = xfs_bmbt_lookup_eq(cur, got.br_startoff, got.br_startblock,
+				   got.br_blockcount, &i);
+	if (error)
+		return error;
+	XFS_WANT_CORRUPTED_GOTO(i == 1, out_error);
+
+	got.br_startoff = startoff;
+	error = xfs_bmbt_update(cur, got.br_startoff, got.br_startblock,
+				got.br_blockcount, got.br_state);
+	if (error)
+		return error;
+
+	return 0;
+
+out_error:
+	return error;
+}
+
+/*
  * Shift extent records to the left to cover a hole.
  *
  * The maximum number of extents to be shifted in a single operation is
@@ -5541,16 +5623,12 @@ xfs_bmap_shift_extents(
 {
 	struct xfs_btree_cur		*cur = NULL;
 	struct xfs_bmbt_rec_host	*gotp;
-	struct xfs_bmbt_rec_host	*leftp;
 	struct xfs_bmbt_irec            got;
-	struct xfs_bmbt_irec		left;
 	struct xfs_mount		*mp = ip->i_mount;
 	struct xfs_ifork		*ifp;
 	xfs_extnum_t			nexts = 0;
 	xfs_extnum_t			current_ext;
-	xfs_fileoff_t			startoff;
 	int				error = 0;
-	int				i;
 	int				whichfork = XFS_DATA_FORK;
 	int				logflags = 0;
 	int				total_extents;
@@ -5599,16 +5677,6 @@ xfs_bmap_shift_extents(
 		*done = 1;
 		goto del_cursor;
 	}
-	xfs_bmbt_get_all(gotp, &got);
-
-	/*
-	 * If the first extent is shifted, offset_shift_fsb cannot be larger
-	 * than the starting offset of the first extent.
-	 */
-	if (current_ext == 0 && got.br_startoff < offset_shift_fsb) {
-		error = -EINVAL;
-		goto del_cursor;
-	}
 
 	/*
 	 * There may be delalloc extents in the data fork before the range we
@@ -5617,75 +5685,25 @@ xfs_bmap_shift_extents(
 	 */
 	total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
 	while (nexts++ < num_exts && current_ext < total_extents) {
-		startoff = got.br_startoff - offset_shift_fsb;
-
-		/* grab the left extent and check for a potential merge */
-		if (current_ext > 0) {
-			leftp = xfs_iext_get_ext(ifp, current_ext - 1);
-			xfs_bmbt_get_all(leftp, &left);
-
-			/* make sure hole is large enough for shift */
-			if (startoff < left.br_startoff + left.br_blockcount) {
-				error = -EINVAL;
-				goto del_cursor;
-			}
-
-			if (xfs_bmse_can_merge(&left, &got, offset_shift_fsb)) {
-				error = xfs_bmse_merge(ip, whichfork,
-						offset_shift_fsb, current_ext, gotp,
-						leftp, cur, &logflags);
-				if (error)
-					goto del_cursor;
-
-				/*
-				 * The extent was merged so adjust the extent
-				 * index and move onto the next.
-				 */
-				current_ext--;
-				goto next;
-			}
-		}
-
-		/*
-		 * We didn't merge the extent so do the shift. Update the start
-		 * offset in the in-core extent and btree, if necessary.
-		 */
-		xfs_bmbt_set_startoff(gotp, startoff);
-		logflags |= XFS_ILOG_CORE;
-		if (cur) {
-			error = xfs_bmbt_lookup_eq(cur, got.br_startoff,
-						   got.br_startblock,
-						   got.br_blockcount,
-						   &i);
-			if (error)
-				goto del_cursor;
-			XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
-
-			got.br_startoff = startoff;
-			error = xfs_bmbt_update(cur, got.br_startoff,
-						got.br_startblock,
-						got.br_blockcount,
-						got.br_state);
-			if (error)
-				goto del_cursor;
-		} else {
-			logflags |= XFS_ILOG_DEXT;
-		}
+		error = xfs_bmse_shift_one(ip, whichfork, offset_shift_fsb,
+					&current_ext, gotp, cur, &logflags);
+		if (error)
+			goto del_cursor;
 
-next:
 		/* update total extent count and grab the next record */
 		total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
-		if (++current_ext >= total_extents)
+		if (current_ext >= total_extents)
 			break;
 		gotp = xfs_iext_get_ext(ifp, current_ext);
-		xfs_bmbt_get_all(gotp, &got);
 	}
 
 	/* Check if we are done */
-	if (current_ext == total_extents)
+	if (current_ext == total_extents) {
 		*done = 1;
-	else if (next_fsb)
+	} else if (next_fsb) {
+		xfs_bmbt_get_all(gotp, &got);
 		*next_fsb = got.br_startoff;
+	}
 
 del_cursor:
 	if (cur)
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/5] xfs: writeback and inval. file range to be shifted by collapse
  2014-09-10 13:20 [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc Brian Foster
                   ` (2 preceding siblings ...)
  2014-09-10 13:20 ` [PATCH v2 3/5] xfs: refactor single extent shift into xfs_bmse_shift_one() helper Brian Foster
@ 2014-09-10 13:20 ` Brian Foster
  2014-09-10 13:20 ` [PATCH v2 5/5] xfs: only writeback and truncate pages for the freed range Brian Foster
  2014-09-11  4:42 ` [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc Dave Chinner
  5 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2014-09-10 13:20 UTC (permalink / raw)
  To: xfs

The collapse range operation currently writes the entire file before
starting the collapse to avoid changes in the in-core extent list due to
writeback causing the extent count to change. Now that collapse range is
fsb based rather than extent index based it can sustain changes in the
extent list during the shift sequence without disruption.

Modify xfs_collapse_file_space() to writeback and invalidate pages
associated with the range of the file to be shifted.
xfs_free_file_space() currently has similar behavior, but the space free
need only affect the region of the file that is freed and this could
change in the future.

Also update the comments to reflect the current implementation. We
retain the eofblocks trim permanently as a best option for dealing with
delalloc extents. We don't shift delalloc extents because this scenario
only occurs with post-eof preallocation (since data must be flushed such
that the cache can be invalidated and data can be shifted). That means
said space must also be initialized before being shifted into the
accessible region of the file only to be immediately truncated off as
the last part of the collapse. In other words, the eofblocks trim will
happen anyways, we just run it first to ensure the file remains in a
consistent state throughout the collapse.

Finally, detect and fail explicitly in the event of a delalloc extent
during the extent shift. The implementation does not support delalloc
extents and the caller is expected to prevent this scenario in advance
as is done by collapse.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c |  4 ++++
 fs/xfs/xfs_bmap_util.c   | 32 +++++++++++++++++++-------------
 2 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 69bf8d8..79c9819 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5543,6 +5543,10 @@ xfs_bmse_shift_one(
 	xfs_bmbt_get_all(gotp, &got);
 	startoff = got.br_startoff - offset_shift_fsb;

+	/* delalloc extents should be prevented by caller */
+	XFS_WANT_CORRUPTED_GOTO(!isnullstartblock(got.br_startblock),
+				out_error);
+
 	/*
 	 * If this is the first extent in the file, make sure there's enough
 	 * room at the start of the file and jump right to the shift as there's
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1e96d77..eae763f 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1470,27 +1470,33 @@ xfs_collapse_file_space(
 	next_fsb = XFS_B_TO_FSB(mp, offset + len);
 	shift_fsb = XFS_B_TO_FSB(mp, len);

-	/*
-	 * Writeback the entire file and force remove any post-eof blocks. The
-	 * writeback prevents changes to the extent list via concurrent
-	 * writeback and the eofblocks trim prevents the extent shift algorithm
-	 * from running into a post-eof delalloc extent.
-	 *
-	 * XXX: This is a temporary fix until the extent shift loop below is
-	 * converted to use offsets and lookups within the ILOCK rather than
-	 * carrying around the index into the extent list for the next
-	 * iteration.
-	 */
-	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
+	error = xfs_free_file_space(ip, offset, len);
 	if (error)
 		return error;
+
+	/*
+	 * Trim eofblocks to avoid shifting uninitialized post-eof preallocation
+	 * into the accessible region of the file.
+	 */
 	if (xfs_can_free_eofblocks(ip, true)) {
 		error = xfs_free_eofblocks(mp, ip, false);
 		if (error)
 			return error;
 	}

-	error = xfs_free_file_space(ip, offset, len);
+	/*
+	 * Writeback and invalidate cache for the remainder of the file as we're
+	 * about to shift down every extent from the collapse range to EOF. The
+	 * free of the collapse range above might have already done some of
+	 * this, but we shouldn't rely on it to do anything outside of the range
+	 * that was freed.
+	 */
+	error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
+					     offset + len, -1);
+	if (error)
+		return error;
+	error = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
+					(offset + len) >> PAGE_CACHE_SHIFT, -1);
 	if (error)
 		return error;

-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 5/5] xfs: only writeback and truncate pages for the freed range
  2014-09-10 13:20 [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc Brian Foster
                   ` (3 preceding siblings ...)
  2014-09-10 13:20 ` [PATCH v2 4/5] xfs: writeback and inval. file range to be shifted by collapse Brian Foster
@ 2014-09-10 13:20 ` Brian Foster
  2014-09-11  4:42 ` [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc Dave Chinner
  5 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2014-09-10 13:20 UTC (permalink / raw)
  To: xfs

xfs_free_file_space() only affects the range of the file for which space
is being freed. It currently writes and truncates the page cache from
the start offset of the free to EOF.

Modify xfs_free_file_space() to write back and truncate page cache of
just the range being freed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index eae763f..809ae7d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1205,6 +1205,7 @@ xfs_free_file_space(
 	xfs_bmap_free_t		free_list;
 	xfs_bmbt_irec_t		imap;
 	xfs_off_t		ioffset;
+	xfs_off_t		iendoffset;
 	xfs_extlen_t		mod=0;
 	xfs_mount_t		*mp;
 	int			nimap;
@@ -1233,12 +1234,13 @@ xfs_free_file_space(
 	inode_dio_wait(VFS_I(ip));
 
 	rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
-	ioffset = offset & ~(rounding - 1);
-	error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
-					      ioffset, -1);
+	ioffset = round_down(offset, rounding);
+	iendoffset = round_up(offset + len, rounding) - 1;
+	error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, ioffset,
+					     iendoffset);
 	if (error)
 		goto out;
-	truncate_pagecache_range(VFS_I(ip), ioffset, -1);
+	truncate_pagecache_range(VFS_I(ip), ioffset, iendoffset);
 
 	/*
 	 * Need to zero the stuff we're not freeing, on disk.
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc
  2014-09-10 13:20 [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc Brian Foster
                   ` (4 preceding siblings ...)
  2014-09-10 13:20 ` [PATCH v2 5/5] xfs: only writeback and truncate pages for the freed range Brian Foster
@ 2014-09-11  4:42 ` Dave Chinner
  2014-09-11 15:20   ` Brian Foster
  5 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2014-09-11  4:42 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Wed, Sep 10, 2014 at 09:20:26AM -0400, Brian Foster wrote:
> Hi all,
> 
> Here's v2 of the collapse clean up. We refactor a bit more via the
> insertion of patch 3, otherwise it's similar to v1. This will see some
> continued testing, but it survived ~500m fsx operations overnight.
> 
> Brian

I'm not sure about the invalidation patch now. On a 1k block size
filesystem, generic/127 fails with:

    +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
    +collapse range: 1000 to 3000
    +do_collapse_range: fallocate: Device or resource busy

which indicates we had an invalidation failure. This is probably
exposing some other bug, but I haven't had time to look into it yet
so I don't know.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc
  2014-09-11  4:42 ` [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc Dave Chinner
@ 2014-09-11 15:20   ` Brian Foster
  2014-09-11 21:19     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2014-09-11 15:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Sep 11, 2014 at 02:42:43PM +1000, Dave Chinner wrote:
> On Wed, Sep 10, 2014 at 09:20:26AM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > Here's v2 of the collapse clean up. We refactor a bit more via the
> > insertion of patch 3, otherwise it's similar to v1. This will see some
> > continued testing, but it survived ~500m fsx operations overnight.
> > 
> > Brian
> 
> I'm not sure about the invalidation patch now. On a 1k block size
> filesystem, generic/127 fails with:
> 
>     +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
>     +collapse range: 1000 to 3000
>     +do_collapse_range: fallocate: Device or resource busy
> 
> which indicates we had an invalidation failure. This is probably
> exposing some other bug, but I haven't had time to look into it yet
> so I don't know.
> 

Yeah, I can reproduce this as well, thanks. I think you're referring to
the xfs_free_file_space() patch (5/5)..? FWIW, I don't see the problem
without that patch, so it appears that the full pagecache truncate is
still covering up a problem somewhere. I'll try to dig into it...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc
  2014-09-11 15:20   ` Brian Foster
@ 2014-09-11 21:19     ` Dave Chinner
  2014-09-12 19:51       ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2014-09-11 21:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Sep 11, 2014 at 11:20:13AM -0400, Brian Foster wrote:
> On Thu, Sep 11, 2014 at 02:42:43PM +1000, Dave Chinner wrote:
> > On Wed, Sep 10, 2014 at 09:20:26AM -0400, Brian Foster wrote:
> > > Hi all,
> > > 
> > > Here's v2 of the collapse clean up. We refactor a bit more via the
> > > insertion of patch 3, otherwise it's similar to v1. This will see some
> > > continued testing, but it survived ~500m fsx operations overnight.
> > > 
> > > Brian
> > 
> > I'm not sure about the invalidation patch now. On a 1k block size
> > filesystem, generic/127 fails with:
> > 
> >     +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
> >     +collapse range: 1000 to 3000
> >     +do_collapse_range: fallocate: Device or resource busy
> > 
> > which indicates we had an invalidation failure. This is probably
> > exposing some other bug, but I haven't had time to look into it yet
> > so I don't know.
> > 
> 
> Yeah, I can reproduce this as well, thanks. I think you're referring to
> the xfs_free_file_space() patch (5/5)..?

*nod*

> FWIW, I don't see the problem
> without that patch, so it appears that the full pagecache truncate is
> still covering up a problem somewhere. I'll try to dig into it...

It's likely that it is leaving a dirty buffer on the page beyond EOF
as a result of the truncate zeroing the remainder of the page in
memory.

If I get a chance I'll look at it this afternoon, as this patchset
also seems to be causing a marked increase in the number of fsstress
failures due to stray delalloc blocks in files even on 4k block size
filesystems (e.g. at unmount).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc
  2014-09-11 21:19     ` Dave Chinner
@ 2014-09-12 19:51       ` Brian Foster
  2014-09-12 20:05         ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2014-09-12 19:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Sep 12, 2014 at 07:19:15AM +1000, Dave Chinner wrote:
> On Thu, Sep 11, 2014 at 11:20:13AM -0400, Brian Foster wrote:
> > On Thu, Sep 11, 2014 at 02:42:43PM +1000, Dave Chinner wrote:
> > > On Wed, Sep 10, 2014 at 09:20:26AM -0400, Brian Foster wrote:
> > > > Hi all,
> > > > 
> > > > Here's v2 of the collapse clean up. We refactor a bit more via the
> > > > insertion of patch 3, otherwise it's similar to v1. This will see some
> > > > continued testing, but it survived ~500m fsx operations overnight.
> > > > 
> > > > Brian
> > > 
> > > I'm not sure about the invalidation patch now. On a 1k block size
> > > filesystem, generic/127 fails with:
> > > 
> > >     +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
> > >     +collapse range: 1000 to 3000
> > >     +do_collapse_range: fallocate: Device or resource busy
> > > 
> > > which indicates we had an invalidation failure. This is probably
> > > exposing some other bug, but I haven't had time to look into it yet
> > > so I don't know.
> > > 
> > 
> > Yeah, I can reproduce this as well, thanks. I think you're referring to
> > the xfs_free_file_space() patch (5/5)..?
> 
> *nod*
> 
> > FWIW, I don't see the problem
> > without that patch, so it appears that the full pagecache truncate is
> > still covering up a problem somewhere. I'll try to dig into it...
> 
> It's likely that it is leaving a dirty buffer on the page beyond EOF
> as a result of the truncate zeroing the remainder of the page in
> memory.
> 
> If I get a chance I'll look at it this afternoon, as this patchset
> also seems to be causing a marked increase in the number of fsstress
> failures due to stray delalloc blocks in files even on 4k block size
> filesystems (e.g. at unmount).
> 

I think I have a bit of an idea of what's going on here. I'm not sure
that this one is post-eof delalloc related. For reference, here's the
command that reproduces for me 100% of the time on a 1k block fs
(derived from the fsx trace): 

xfs_io -fc "pwrite 185332 55756" \
	-c "fcollapse 28672 40960" \
	-c "pwrite 133228 63394" \
	-c "fcollapse 0 4096" /mnt/file

The first write extends the file to 241088 bytes and the first collapse
targets a hole, but shrinks the file down to 200128 bytes. The last page
offset of the file is now 196608 and with 1k blocks, eof falls within
the last block of this page (e.g., within bh offsets 199680-200704). The
second write falls just into the first block of this page.

What I see occur on the final collapse is that the
invalidate_inode_pages2_range() call in the collapse op fails due to
finding a dirty page at offset 196608. We don't have a launder_page()
callback, so this results in -EBUSY.

Given the above, it seems like the filemap_write_and_wait_range() call
should handle this. Digging further, what I see is one or two
writepage()->xfs_cluster_write() sequences along the range of the
previous write. xfs_convert_page() eventually attempts to handle page
offset 196608 and breaks out at offset 197632 (the second bh in the
page) near the top of the bh loop:

		...
		if (!(PageUptodate(page) || buffer_uptodate(bh))) {
			done = 1;
			break;
		}
		...

I suspect the reason the page/buffer is in this particular state is due
to the previous invalidation (collapse 1), which would have removed the
page from the mapping. This seems reasonable to me since we only wrote
into the first bytes of the page since the prior invalidation. The problem is
that the page somehow has the following buffer_head state at the point of the
EBUSY inval failure:

invalidate_inode_pages2_range(648): state 0x29 block 84 size 1024
invalidate_inode_pages2_range(648): state 0x0 block 18446744073709551615 size 1024
invalidate_inode_pages2_range(648): state 0x0 block 18446744073709551615 size 1024
invalidate_inode_pages2_range(648): state 0x2b block 87 size 1024

The first block is BH_Uptodate|BH_Req|BH_Mapped. I read the next two as
simply not up to date..? The last block is the same as the first plus
BH_Dirty. This last block is offset 199680 and the previous write goes
to 196622, so I'm not sure how this ends up dirty. I think the write
path to this range of the file might be the spot to dig next...

Brian

P.S., As a datapoint from experimentation, this problem doesn't occur if
I ensure that writepage() handles this particular page rather than
xfs_convert_page(). I can do that by either jumping out of
xfs_convert_page() sooner, before the page is set for writeback, or
hacking xfs_start_page_writeback() to use __test_set_page_writeback()
and keep the write tag such that write_cache_pages() can still find the
page. This calls out an interesting bit of behavior in
xfs_convert_page() since commit a49935f2: if we handle a part of a
page, break and mark the page for writeback without handling the rest,
we'll still clear the PAGECACHE_TAG_TOWRITE tag and writeback won't
finish off the page during the current iteration (for WB_SYNC_ALL).

It's not clear to me if this is contributing to the problem in this
particular case, but it seems like an independent bug at the very
least. Thoughts?

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc
  2014-09-12 19:51       ` Brian Foster
@ 2014-09-12 20:05         ` Brian Foster
  2014-09-15  1:46           ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2014-09-12 20:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Sep 12, 2014 at 03:51:29PM -0400, Brian Foster wrote:
> On Fri, Sep 12, 2014 at 07:19:15AM +1000, Dave Chinner wrote:
> > On Thu, Sep 11, 2014 at 11:20:13AM -0400, Brian Foster wrote:
> > > On Thu, Sep 11, 2014 at 02:42:43PM +1000, Dave Chinner wrote:
> > > > On Wed, Sep 10, 2014 at 09:20:26AM -0400, Brian Foster wrote:
> > > > > Hi all,
> > > > > 
> > > > > Here's v2 of the collapse clean up. We refactor a bit more via the
> > > > > insertion of patch 3, otherwise it's similar to v1. This will see some
> > > > > continued testing, but it survived ~500m fsx operations overnight.
> > > > > 
> > > > > Brian
> > > > 
> > > > I'm not sure about the invalidation patch now. On a 1k block size
> > > > filesystem, generic/127 fails with:
> > > > 
> > > >     +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
> > > >     +collapse range: 1000 to 3000
> > > >     +do_collapse_range: fallocate: Device or resource busy
> > > > 
> > > > which indicates we had an invalidation failure. This is probably
> > > > exposing some other bug, but I haven't had time to look into it yet
> > > > so I don't know.
> > > > 
> > > 
> > > Yeah, I can reproduce this as well, thanks. I think you're referring to
> > > the xfs_free_file_space() patch (5/5)..?
> > 
> > *nod*
> > 
> > > FWIW, I don't see the problem
> > > without that patch, so it appears that the full pagecache truncate is
> > > still covering up a problem somewhere. I'll try to dig into it...
> > 
> > It's likely that it is leaving a dirty buffer on the page beyond EOF
> > as a result of the truncate zeroing the remainder of the page in
> > memory.
> > 
> > If I get a chance I'll look at it this afternoon, as this patchset
> > also seems to be causing a marked increase in the number of fsstress
> > failures due to stray delalloc blocks in files even on 4k block size
> > filesystems (e.g. at unmount).
> > 
> 
> I think I have a bit of an idea of what's going on here. I'm not sure
> that this one is post-eof delalloc related. For reference, here's the
> command that reproduces for me 100% of the time on a 1k block fs
> (derived from the fsx trace): 
> 
> xfs_io -fc "pwrite 185332 55756" \
> 	-c "fcollapse 28672 40960" \
> 	-c "pwrite 133228 63394" \
> 	-c "fcollapse 0 4096" /mnt/file
> 
> The first write extends the file to 241088 bytes and the first collapse
> targets a hole, but shrinks the file down to 200128 bytes. The last page
> offset of the file is now 196608 and with 1k blocks, eof falls within
> the last block of this page (e.g., within bh offsets 199680-200704). The
> second write falls just into the first block of this page.
> 
> What I see occur on the final collapse is that the
> invalidate_inode_pages2_range() call in the collapse op fails due to
> finding a dirty page at offset 196608. We don't have a launder_page()
> callback, so this results in -EBUSY.
> 
> Given the above, it seems like the filemap_write_and_wait_range() call
> should handle this. Digging further, what I see is one or two
> writepage()->xfs_cluster_write() sequences along the range of the
> previous write. xfs_convert_page() eventually attempts to handle page
> offset 196608 and breaks out at offset 197632 (the second bh in the
> page) near the top of the bh loop:
> 
> 		...
> 		if (!(PageUptodate(page) || buffer_uptodate(bh))) {
> 			done = 1;
> 			break;
> 		}
> 		...
> 
> I suspect the reason the page/buffer is in this particular state is due
> to the previous invalidation (collapse 1), which would have removed the
> page from the mapping. This seems reasonable to me since we only wrote
> into the first bytes of the page since the prior invalidation. The problem is
> that the page somehow has the following buffer_head state at the point of the
> EBUSY inval failure:
> 
> invalidate_inode_pages2_range(648): state 0x29 block 84 size 1024
> invalidate_inode_pages2_range(648): state 0x0 block 18446744073709551615 size 1024
> invalidate_inode_pages2_range(648): state 0x0 block 18446744073709551615 size 1024
> invalidate_inode_pages2_range(648): state 0x2b block 87 size 1024
> 
> The first block is BH_Uptodate|BH_Req|BH_Mapped. I read the next two as
> simply not up to date..? The last block is the same as the first plus
> BH_Dirty. This last block is offset 199680 and the previous write goes
> to 196622, so I'm not sure how this ends up dirty. I think the write
> path to this range of the file might be the spot to dig next...
> 

... and it just hit me that truncate dirties the buffer. ;) So I'm
wondering if we have something like the following sequence of events for
this particular page:

- first pwrite writes to complete page
- first collapse:
	- flush
	- invalidate
	- truncate -> dirty last buffer of page
- second pwrite writes to first buffer in page (dirty first buffer)
	- flush
		- xfs_convert_page() hits the first buffer, breaks out
		  and causes the last buffer to be passed over due to
		  the issue below
	- invalidate
		- finds dirty buffer, error!

Brian

> Brian
> 
> P.S., As a datapoint from experimentation, this problem doesn't occur if
> I ensure that writepage() handles this particular page rather than
> xfs_convert_page(). I can do that by either jumping out of
> xfs_convert_page() sooner, before the page is set for writeback, or
> hacking xfs_start_page_writeback() to use __test_set_page_writeback()
> and keep the write tag such that write_cache_pages() can still find the
> page. This calls out an interesting bit of behavior in
> xfs_convert_page() since commit a49935f2: if we handle a part of a
> page, break and mark the page for writeback without handling the rest,
> we'll still clear the PAGECACHE_TAG_TOWRITE tag and writeback won't
> finish off the page during the current iteration (for WB_SYNC_ALL).
> 
> It's not clear to me if this is contributing to the problem in this
> particular case, but it seems like an independent bug at the very
> least. Thoughts?
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc
  2014-09-12 20:05         ` Brian Foster
@ 2014-09-15  1:46           ` Dave Chinner
  2014-09-15 13:18             ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2014-09-15  1:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Sep 12, 2014 at 04:05:36PM -0400, Brian Foster wrote:
> On Fri, Sep 12, 2014 at 03:51:29PM -0400, Brian Foster wrote:
> > On Fri, Sep 12, 2014 at 07:19:15AM +1000, Dave Chinner wrote:
> > > On Thu, Sep 11, 2014 at 11:20:13AM -0400, Brian Foster wrote:
> > > > On Thu, Sep 11, 2014 at 02:42:43PM +1000, Dave Chinner wrote:
> > > > > On Wed, Sep 10, 2014 at 09:20:26AM -0400, Brian Foster wrote:
> > > > > > Hi all,
> > > > > > 
> > > > > > Here's v2 of the collapse clean up. We refactor a bit more via the
> > > > > > insertion of patch 3, otherwise it's similar to v1. This will see some
> > > > > > continued testing, but it survived ~500m fsx operations overnight.
> > > > > > 
> > > > > > Brian
> > > > > 
> > > > > I'm not sure about the invalidation patch now. On a 1k block size
> > > > > filesystem, generic/127 fails with:
> > > > > 
> > > > >     +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
> > > > >     +collapse range: 1000 to 3000
> > > > >     +do_collapse_range: fallocate: Device or resource busy
> > > > > 
> > > > > which indicates we had an invalidation failure. This is probably
> > > > > exposing some other bug, but I haven't had time to look into it yet
> > > > > so I don't know.

....

> > > It's likely that it is leaving a dirty buffer on the page beyond EOF
> > > as a result of the truncate zeroing the remainder of the page in
> > > memory.
.....

> ... and it just hit me that truncate dirties the buffer. ;)

Exactly. ;)

> So I'm
> wondering if we have something like the following sequence of events for
> this particular page:
> 
> - first pwrite writes to complete page
> - first collapse:

xfs_collapse_file_space: dev 7:0 ino 0x43

> 	- flush

xfs_map_blocks_alloc: dev 7:0 ino 0x43 size 0x0 offset 0x2d000 count 1024 type  startoff 0xb4 startblock 32 blockcount 0x38
xfs_extlist:          dev 7:0 ino 0x43 state  idx 0 offset 180 block 32 count 56 flag 0 caller xfs_iextents_copy

> 	- invalidate

xfs_releasepage:      dev 7:0 ino 0x43 pgoff 0x2d000 size 0x3adc0 offset 0 length 0 delalloc 0 unwritten 0
....
xfs_releasepage:      dev 7:0 ino 0x43 pgoff 0x3a000 size 0x3adc0 offset 0 length 0 delalloc 0 unwritten 0

> 	- truncate -> dirty last buffer of page

xfs_setattr:          dev 7:0 ino 0x43
....
xfs_invalidatepage:   dev 7:0 ino 0x43 pgoff 0x30000 size 0x30dc0 offset dc0 length 240 delalloc 0 unwritten 0
....
xfs_itruncate_extents_start: dev 7:0 ino 0x43 size 0x30dc0 new_size 0x30dc0
xfs_extlist:          dev 7:0 ino 0x43 state  idx 0 offset 140 block 32 count 56 flag 0 caller xfs_iextents_copy

And so we have truncate dirtying the last buffer in the page (the
offset/len indicated by the xfs_invalidatepage tracepoint).

> - second pwrite writes to first buffer in page (dirty first buffer)
> 	- flush
> 		- xfs_convert_page() hits the first buffer, breaks out
> 		  and causes the last buffer to be passed over due to
> 		  the issue below
> 	- invalidate
> 		- finds dirty buffer, error!
> 
> Brian
> 
> > Brian
> > 
> > P.S., As a datapoint from experimentation, this problem doesn't occur if
> > I ensure that writepage() handles this particular page rather than
> > xfs_convert_page(). I can do that by either jumping out of
> > xfs_convert_page() sooner, before the page is set for writeback, or
> > hacking xfs_start_page_writeback() to use __test_set_page_writeback()
> > and keep the write tag such that write_cache_pages() can still find the
> > page.

writepage is still supposed to find the page again, because we
haven't fully cleaned it. Indeed, the code used to come back and
write this page because it was still dirty and the
write_cache_pages() iteration would then see it and write it again
because it is dirty.

> > This calls out an interesting bit of behavior in
> > xfs_convert_page() since commit a49935f2: if we handle a part of a
> > page, break and mark the page for writeback without handling the rest,
> > we'll still clear the PAGECACHE_TAG_TOWRITE tag and writeback won't
> > finish off the page during the current iteration (for WB_SYNC_ALL).

Right, commit a49935f2 made the assumption that the page being left
dirty was sufficient to have the page written as the current
writeback sweep went past it. That *used* to be the way the generic
writeback code worked.

And this is instructive: this same assumption was found in ext4 back
in May i.e. commit 1c8349a ("ext4: fix data integrity sync in
ordered mode") and that introduced the new functions like
__test_set_page_writeback(). That fix wasn't cc'd to linux-fsdevel
which might have got our attention and so noticed the bug earlier...

> > It's not clear to me if this is contributing to the problem in this
> > particular case, but it seems like an independent bug at the very
> > least. Thoughts?

We definitely need to use set_page_writeback_keepwrite() for partial
page writes that leave the page dirty. Patch below.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly

From: Dave Chinner <dchinner@redhat.com>

XFS has been having trouble with stray delayed allocation extents
beyond EOF for a long time. Recent changes to the collapse range
code has triggered erroneous EBUSY errors on page invalidtion for
block size smaller than page size filesystems. These
have been caused by dirty buffers beyond EOF on a partial page which
do not get written to disk during a sync.

The issue is that write-ahead in xfs_cluster_write() finds such a
partial page and handles it by leaving the page dirty but pushing it
into a writeback state. This used to work just fine, as the
write_cache_pages() code would then find the dirty partial page in
the next mapping tree lookup as the dirty tag is still set.

Unfortunately, when we moved to a mark and sweep approach to
writeback to fix other writeback sync issues, we broken this. THe
act of marking the page as under writeback now clears the TOWRITE
tag in the radix tree, even though the page is still dirty. This
causes the TOWRITE tag to be cleared, and hence the next lookup on
the mapping tree does not find the dirty partial page and so doesn't
try to write it again.

This same writeback bug was found recently in ext4 and fixed in
commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
without communication to the wider filesystem community. We can use
exactly the same fix here so the TOWRITE flag is not cleared on
partial page writes.

cc: stable@vger.kernel.org # dependent on 1c8349a17137b93f0a83f276c764a6df1b9a116e
Root-cause-found-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b984647..2f50253 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -434,10 +434,22 @@ xfs_start_page_writeback(
 {
 	ASSERT(PageLocked(page));
 	ASSERT(!PageWriteback(page));
-	if (clear_dirty)
+
+	/*
+	 * if the page was not fully cleaned, we need to ensure that the higher
+	 * layers come back to it correctly. That means we need to keep the page
+	 * dirty, and for WB_SYNC_ALL writeback we need to ensure the
+	 * PAGECACHE_TAG_TOWRITE index mark is not removed so another attempt to
+	 * write this page in this writeback sweep will be made.
+	 */
+	if (clear_dirty) {
 		clear_page_dirty_for_io(page);
-	set_page_writeback(page);
+		set_page_writeback(page);
+	} else
+		set_page_writeback_keepwrite(page);
+
 	unlock_page(page);
+
 	/* If no buffers on the page are to be written, finish it here */
 	if (!buffers)
 		end_page_writeback(page);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc
  2014-09-15  1:46           ` Dave Chinner
@ 2014-09-15 13:18             ` Brian Foster
  2014-09-15 22:55               ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2014-09-15 13:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Sep 15, 2014 at 11:46:54AM +1000, Dave Chinner wrote:
> On Fri, Sep 12, 2014 at 04:05:36PM -0400, Brian Foster wrote:
> > On Fri, Sep 12, 2014 at 03:51:29PM -0400, Brian Foster wrote:
> > > On Fri, Sep 12, 2014 at 07:19:15AM +1000, Dave Chinner wrote:
> > > > On Thu, Sep 11, 2014 at 11:20:13AM -0400, Brian Foster wrote:
> > > > > On Thu, Sep 11, 2014 at 02:42:43PM +1000, Dave Chinner wrote:
> > > > > > On Wed, Sep 10, 2014 at 09:20:26AM -0400, Brian Foster wrote:
> > > > > > > Hi all,
> > > > > > > 
> > > > > > > Here's v2 of the collapse clean up. We refactor a bit more via the
> > > > > > > insertion of patch 3, otherwise it's similar to v1. This will see some
> > > > > > > continued testing, but it survived ~500m fsx operations overnight.
> > > > > > > 
> > > > > > > Brian
> > > > > > 
> > > > > > I'm not sure about the invalidation patch now. On a 1k block size
> > > > > > filesystem, generic/127 fails with:
> > > > > > 
> > > > > >     +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
> > > > > >     +collapse range: 1000 to 3000
> > > > > >     +do_collapse_range: fallocate: Device or resource busy
> > > > > > 
> > > > > > which indicates we had an invalidation failure. This is probably
> > > > > > exposing some other bug, but I haven't had time to look into it yet
> > > > > > so I don't know.
> 
> ....
> 
> > > > It's likely that it is leaving a dirty buffer on the page beyond EOF
> > > > as a result of the truncate zeroing the remainder of the page in
> > > > memory.
> .....
> 
> > ... and it just hit me that truncate dirties the buffer. ;)
> 
> Exactly. ;)
> 
> > So I'm
> > wondering if we have something like the following sequence of events for
> > this particular page:
> > 
> > - first pwrite writes to complete page
> > - first collapse:
> 
> xfs_collapse_file_space: dev 7:0 ino 0x43
> 
> > 	- flush
> 
> xfs_map_blocks_alloc: dev 7:0 ino 0x43 size 0x0 offset 0x2d000 count 1024 type  startoff 0xb4 startblock 32 blockcount 0x38
> xfs_extlist:          dev 7:0 ino 0x43 state  idx 0 offset 180 block 32 count 56 flag 0 caller xfs_iextents_copy
> 
> > 	- invalidate
> 
> xfs_releasepage:      dev 7:0 ino 0x43 pgoff 0x2d000 size 0x3adc0 offset 0 length 0 delalloc 0 unwritten 0
> ....
> xfs_releasepage:      dev 7:0 ino 0x43 pgoff 0x3a000 size 0x3adc0 offset 0 length 0 delalloc 0 unwritten 0
> 
> > 	- truncate -> dirty last buffer of page
> 
> xfs_setattr:          dev 7:0 ino 0x43
> ....
> xfs_invalidatepage:   dev 7:0 ino 0x43 pgoff 0x30000 size 0x30dc0 offset dc0 length 240 delalloc 0 unwritten 0
> ....
> xfs_itruncate_extents_start: dev 7:0 ino 0x43 size 0x30dc0 new_size 0x30dc0
> xfs_extlist:          dev 7:0 ino 0x43 state  idx 0 offset 140 block 32 count 56 flag 0 caller xfs_iextents_copy
> 
> And so we have truncate dirtying the last buffer in the page (the
> offset/len indicated by the xfs_invalidatepage tracepoint).
> 
> > - second pwrite writes to first buffer in page (dirty first buffer)
> > 	- flush
> > 		- xfs_convert_page() hits the first buffer, breaks out
> > 		  and causes the last buffer to be passed over due to
> > 		  the issue below
> > 	- invalidate
> > 		- finds dirty buffer, error!
> > 
> > Brian
> > 
> > > Brian
> > > 
> > > P.S., As a datapoint from experimentation, this problem doesn't occur if
> > > I ensure that writepage() handles this particular page rather than
> > > xfs_convert_page(). I can do that by either jumping out of
> > > xfs_convert_page() sooner, before the page is set for writeback, or
> > > hacking xfs_start_page_writeback() to use __test_set_page_writeback()
> > > and keep the write tag such that write_cache_pages() can still find the
> > > page.
> 
> writepage is still supposed to find the page again, because we
> haven't fully cleaned it. Indeed, the code used to come back and
> write this page because it was still dirty and the
> write_cache_pages() iteration would then see it and write it again
> because it is dirty.
> 
> > > This calls out an interesting bit of behavior in
> > > xfs_convert_page() since commit a49935f2: if we handle a part of a
> > > page, break and mark the page for writeback without handling the rest,
> > > we'll still clear the PAGECACHE_TAG_TOWRITE tag and writeback won't
> > > finish off the page during the current iteration (for WB_SYNC_ALL).
> 
> Right, commit a49935f2 made the assumption that the page being left
> dirty was sufficient to have the page written as the current
> writeback sweep went past it. That *used* to be the way the generic
> writeback code worked.
> 
> And this is instructive: this same assumption was found in ext4 back
> in May i.e. commit 1c8349a ("ext4: fix data integrity sync in
> ordered mode") and that introduced the new functions like
> __test_set_page_writeback(). That fix wasn't cc'd to linux-fsdevel
> which might have got our attention and so noticed the bug earlier...
> 
> > > It's not clear to me if this is contributing to the problem in this
> > > particular case, but it seems like an independent bug at the very
> > > least. Thoughts?
> 
> We definitely need to use set_page_writeback_keepwrite() for partial
> page writes that leave the page dirty. Patch below.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> XFS has been having trouble with stray delayed allocation extents
> beyond EOF for a long time. Recent changes to the collapse range
> code has triggered erroneous EBUSY errors on page invalidtion for
> block size smaller than page size filesystems. These
> have been caused by dirty buffers beyond EOF on a partial page which
> do not get written to disk during a sync.
> 
> The issue is that write-ahead in xfs_cluster_write() finds such a
> partial page and handles it by leaving the page dirty but pushing it
> into a writeback state. This used to work just fine, as the
> write_cache_pages() code would then find the dirty partial page in
> the next mapping tree lookup as the dirty tag is still set.
> 
> Unfortunately, when we moved to a mark and sweep approach to
> writeback to fix other writeback sync issues, we broken this. THe
> act of marking the page as under writeback now clears the TOWRITE
> tag in the radix tree, even though the page is still dirty. This
> causes the TOWRITE tag to be cleared, and hence the next lookup on
> the mapping tree does not find the dirty partial page and so doesn't
> try to write it again.
> 
> This same writeback bug was found recently in ext4 and fixed in
> commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
> without communication to the wider filesystem community. We can use
> exactly the same fix here so the TOWRITE flag is not cleared on
> partial page writes.
> 
> cc: stable@vger.kernel.org # dependent on 1c8349a17137b93f0a83f276c764a6df1b9a116e
> Root-cause-found-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks good and fixes the collapse failure in my test.

Reviewed-by: Brian Foster <bfoster@redhat.com>

I suppose we should prepend the collapse rework series with this patch
to avoid the regression as it pertains to collapse (obviously the
failure to retain towrite goes further back).

I'll continue testing with this. Are you still seeing an increase in
such failures with the xfs_free_file_space() patch or has this quieted
those down?

Brian

>  fs/xfs/xfs_aops.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index b984647..2f50253 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -434,10 +434,22 @@ xfs_start_page_writeback(
>  {
>  	ASSERT(PageLocked(page));
>  	ASSERT(!PageWriteback(page));
> -	if (clear_dirty)
> +
> +	/*
> +	 * if the page was not fully cleaned, we need to ensure that the higher
> +	 * layers come back to it correctly. That means we need to keep the page
> +	 * dirty, and for WB_SYNC_ALL writeback we need to ensure the
> +	 * PAGECACHE_TAG_TOWRITE index mark is not removed so another attempt to
> +	 * write this page in this writeback sweep will be made.
> +	 */
> +	if (clear_dirty) {
>  		clear_page_dirty_for_io(page);
> -	set_page_writeback(page);
> +		set_page_writeback(page);
> +	} else
> +		set_page_writeback_keepwrite(page);
> +
>  	unlock_page(page);
> +
>  	/* If no buffers on the page are to be written, finish it here */
>  	if (!buffers)
>  		end_page_writeback(page);
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc
  2014-09-15 13:18             ` Brian Foster
@ 2014-09-15 22:55               ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2014-09-15 22:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Mon, Sep 15, 2014 at 09:18:22AM -0400, Brian Foster wrote:
> On Mon, Sep 15, 2014 at 11:46:54AM +1000, Dave Chinner wrote:
> > xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > XFS has been having trouble with stray delayed allocation extents
> > beyond EOF for a long time. Recent changes to the collapse range
> > code has triggered erroneous EBUSY errors on page invalidtion for
> > block size smaller than page size filesystems. These
> > have been caused by dirty buffers beyond EOF on a partial page which
> > do not get written to disk during a sync.
> > 
> > The issue is that write-ahead in xfs_cluster_write() finds such a
> > partial page and handles it by leaving the page dirty but pushing it
> > into a writeback state. This used to work just fine, as the
> > write_cache_pages() code would then find the dirty partial page in
> > the next mapping tree lookup as the dirty tag is still set.
> > 
> > Unfortunately, when we moved to a mark and sweep approach to
> > writeback to fix other writeback sync issues, we broken this. THe
> > act of marking the page as under writeback now clears the TOWRITE
> > tag in the radix tree, even though the page is still dirty. This
> > causes the TOWRITE tag to be cleared, and hence the next lookup on
> > the mapping tree does not find the dirty partial page and so doesn't
> > try to write it again.
> > 
> > This same writeback bug was found recently in ext4 and fixed in
> > commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
> > without communication to the wider filesystem community. We can use
> > exactly the same fix here so the TOWRITE flag is not cleared on
> > partial page writes.
> > 
> > cc: stable@vger.kernel.org # dependent on 1c8349a17137b93f0a83f276c764a6df1b9a116e
> > Root-cause-found-by: Brian Foster <bfoster@redhat.com>
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Looks good and fixes the collapse failure in my test.
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> I suppose we should prepend the collapse rework series with this patch
> to avoid the regression as it pertains to collapse (obviously the
> failure to retain towrite goes further back).

Agreed, I will do that.

> I'll continue testing with this. Are you still seeing an increase in
> such failures with the xfs_free_file_space() patch or has this quieted
> those down?

To early to sayi for sure, but signs are good - I've had xfstests
actually complete without any stray delalloc asserts occurring on
1k block size filesystems for the first time in a couple of weeks.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-09-15 22:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-10 13:20 [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc Brian Foster
2014-09-10 13:20 ` [PATCH v2 1/5] xfs: track collapse via file offset rather than extent index Brian Foster
2014-09-10 13:20 ` [PATCH v2 2/5] xfs: refactor shift-by-merge into xfs_bmse_merge() helper Brian Foster
2014-09-10 13:20 ` [PATCH v2 3/5] xfs: refactor single extent shift into xfs_bmse_shift_one() helper Brian Foster
2014-09-10 13:20 ` [PATCH v2 4/5] xfs: writeback and inval. file range to be shifted by collapse Brian Foster
2014-09-10 13:20 ` [PATCH v2 5/5] xfs: only writeback and truncate pages for the freed range Brian Foster
2014-09-11  4:42 ` [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc Dave Chinner
2014-09-11 15:20   ` Brian Foster
2014-09-11 21:19     ` Dave Chinner
2014-09-12 19:51       ` Brian Foster
2014-09-12 20:05         ` Brian Foster
2014-09-15  1:46           ` Dave Chinner
2014-09-15 13:18             ` Brian Foster
2014-09-15 22:55               ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.