All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] xfs: properly invalidate cached writeback mapping
@ 2019-01-23 18:41 Brian Foster
  2019-01-23 18:41 ` [PATCH v3 1/6] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Brian Foster @ 2019-01-23 18:41 UTC (permalink / raw)
  To: linux-xfs

Hi all,

Here's v3 of the imap cache invalidation series. To recap from v2, patch
5 of that series added a lookup and extent trim to
xfs_iomap_write_allocate() to ensure delalloc conversion always had a
correct range. Christoph didn't like this approach and has an alternate
proposal to modify XFS_BMAPI_DELALLOC behavior to always skip holes.
That approach is problematic because it can potentially convert blocks
that have nothing to do with the current extent (i.e., still racy with
hole punch).

As a compromise, this version implements an xfs_bmapi_delalloc() wrapper
with an interface that allocates the underlying extent of a particular
block. This ensures that writeback always uses the correct range without
adding an extra extent lookup. There is still a bit of hackiness and
probably opportunity for broader refactoring, but that can be done once
we've established correctness.

Patches 1-4 are mostly unchanged from v2. Patch 5 introduces the
xfs_bmapi_delalloc() helper. Patch 6 modifies xfs_iomap_write_allocate()
to use xfs_bmapi_delalloc() instead of xfs_bmapi_write(). This series
survives fstests (including repeated cycles of generic/524 and xfs/442)
on 4k and 1k block sizes with reflink enabled without any regressions.
It also survives several million fsx operations.

Thoughts, reviews, flames appreciated.

Brian

v3:
- Move comment in xfs_imap_valid().
- Replace lookup+trim in xfs_iomap_write_allocate() with
  xfs_bmapi_delalloc() wrapper mechanism.
v2: https://marc.info/?l=linux-xfs&m=154775280823464&w=2
- Refactor validation logic into xfs_imap_valid() helper.
- Revalidate seqno after the lock cycle in xfs_map_blocks().
- Update *seq in xfs_iomap_write_allocate() regardless of fork type.
- Add patch 5 for seqno revalidation on xfs_iomap_write_allocate() lock
  cycles.
v1: https://marc.info/?l=linux-xfs&m=154721212321112&w=2

Brian Foster (6):
  xfs: eof trim writeback mapping as soon as it is cached
  xfs: update fork seq counter on data fork changes
  xfs: validate writeback mapping using data fork seq counter
  xfs: remove superfluous writeback mapping eof trimming
  xfs: create delalloc bmapi wrapper for full extent allocation
  xfs: use the latest extent at writeback delalloc conversion time

 fs/xfs/libxfs/xfs_bmap.c       |  58 ++++++++---
 fs/xfs/libxfs/xfs_bmap.h       |   3 +-
 fs/xfs/libxfs/xfs_iext_tree.c  |  13 ++-
 fs/xfs/libxfs/xfs_inode_fork.h |   2 +-
 fs/xfs/xfs_aops.c              |  71 ++++++++-----
 fs/xfs/xfs_iomap.c             | 175 ++++++++++++---------------------
 6 files changed, 162 insertions(+), 160 deletions(-)

-- 
2.17.2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/6] xfs: eof trim writeback mapping as soon as it is cached
  2019-01-23 18:41 [PATCH v3 0/6] xfs: properly invalidate cached writeback mapping Brian Foster
@ 2019-01-23 18:41 ` Brian Foster
  2019-02-01  7:58   ` Christoph Hellwig
  2019-01-23 18:41 ` [PATCH v3 2/6] xfs: update fork seq counter on data fork changes Brian Foster
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2019-01-23 18:41 UTC (permalink / raw)
  To: linux-xfs

The cached writeback mapping is EOF trimmed to try and avoid races
between post-eof block management and writeback that result in
sending cached data to a stale location. The cached mapping is
currently trimmed on the validation check, which leaves a race
window between the time the mapping is cached and when it is trimmed
against the current inode size.

For example, if a new mapping is cached by delalloc conversion on a
blocksize == page size fs, we could cycle various locks, perform
memory allocations, etc.  in the writeback codepath before the
associated mapping is eventually trimmed to i_size. This leaves
enough time for a post-eof truncate and file append before the
cached mapping is trimmed. The former event essentially invalidates
a range of the cached mapping and the latter bumps the inode size
such the trim on the next writepage event won't trim all of the
invalid blocks. fstest generic/464 reproduces this scenario
occasionally and causes a lost writeback and stale delalloc blocks
warning on inode inactivation.

To work around this problem, trim the cached writeback mapping as
soon as it is cached in addition to on subsequent validation checks.
This is a minor tweak to tighten the race window as much as possible
until a proper invalidation mechanism is available.

Fixes: 40214d128e07 ("xfs: trim writepage mapping to within eof")
Cc: <stable@vger.kernel.org> # v4.14+
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9984e0..d9048bcea49c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -449,6 +449,7 @@ xfs_map_blocks(
 	}
 
 	wpc->imap = imap;
+	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 allocate_blocks:
@@ -459,6 +460,7 @@ xfs_map_blocks(
 	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
 	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
 	wpc->imap = imap;
+	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/6] xfs: update fork seq counter on data fork changes
  2019-01-23 18:41 [PATCH v3 0/6] xfs: properly invalidate cached writeback mapping Brian Foster
  2019-01-23 18:41 ` [PATCH v3 1/6] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
@ 2019-01-23 18:41 ` Brian Foster
  2019-01-23 18:41 ` [PATCH v3 3/6] xfs: validate writeback mapping using data fork seq counter Brian Foster
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2019-01-23 18:41 UTC (permalink / raw)
  To: linux-xfs

The sequence counter in the xfs_ifork structure is only updated on
COW forks. This is because the counter is currently only used to
optimize out repetitive COW fork checks at writeback time.

Tweak the extent code to update the seq counter regardless of the
fork type in preparation for using this counter on data forks as
well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
---
 fs/xfs/libxfs/xfs_iext_tree.c  | 13 ++++++-------
 fs/xfs/libxfs/xfs_inode_fork.h |  2 +-
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
index 771dd072015d..bc690f2409fa 100644
--- a/fs/xfs/libxfs/xfs_iext_tree.c
+++ b/fs/xfs/libxfs/xfs_iext_tree.c
@@ -614,16 +614,15 @@ xfs_iext_realloc_root(
 }
 
 /*
- * Increment the sequence counter if we are on a COW fork.  This allows
- * the writeback code to skip looking for a COW extent if the COW fork
- * hasn't changed.  We use WRITE_ONCE here to ensure the update to the
- * sequence counter is seen before the modifications to the extent
- * tree itself take effect.
+ * Increment the sequence counter on extent tree changes. If we are on a COW
+ * fork, this allows the writeback code to skip looking for a COW extent if the
+ * COW fork hasn't changed. We use WRITE_ONCE here to ensure the update to the
+ * sequence counter is seen before the modifications to the extent tree itself
+ * take effect.
  */
 static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp, int state)
 {
-	if (state & BMAP_COWFORK)
-		WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
+	WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
 }
 
 void
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index 60361d2d74a1..00c62ce170d0 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -14,7 +14,7 @@ struct xfs_dinode;
  */
 struct xfs_ifork {
 	int			if_bytes;	/* bytes in if_u1 */
-	unsigned int		if_seq;		/* cow fork mod counter */
+	unsigned int		if_seq;		/* fork mod counter */
 	struct xfs_btree_block	*if_broot;	/* file's incore btree root */
 	short			if_broot_bytes;	/* bytes allocated for root */
 	unsigned char		if_flags;	/* per-fork flags */
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 3/6] xfs: validate writeback mapping using data fork seq counter
  2019-01-23 18:41 [PATCH v3 0/6] xfs: properly invalidate cached writeback mapping Brian Foster
  2019-01-23 18:41 ` [PATCH v3 1/6] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
  2019-01-23 18:41 ` [PATCH v3 2/6] xfs: update fork seq counter on data fork changes Brian Foster
@ 2019-01-23 18:41 ` Brian Foster
  2019-01-23 18:41 ` [PATCH v3 4/6] xfs: remove superfluous writeback mapping eof trimming Brian Foster
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2019-01-23 18:41 UTC (permalink / raw)
  To: linux-xfs

The writeback code caches the current extent mapping across multiple
xfs_do_writepage() calls to avoid repeated lookups for sequential
pages backed by the same extent. This is known to be slightly racy
with extent fork changes in certain difficult to reproduce
scenarios. The cached extent is trimmed to within EOF to help avoid
the most common vector for this problem via speculative
preallocation management, but this is a band-aid that does not
address the fundamental problem.

Now that we have an xfs_ifork sequence counter mechanism used to
facilitate COW writeback, we can use the same mechanism to validate
consistency between the data fork and cached writeback mappings. On
its face, this is somewhat of a big hammer approach because any
change to the data fork invalidates any mapping currently cached by
a writeback in progress regardless of whether the data fork change
overlaps with the range under writeback. In practice, however, the
impact of this approach is minimal in most cases.

First, data fork changes (delayed allocations) caused by sustained
sequential buffered writes are amortized across speculative
preallocations. This means that a cached mapping won't be
invalidated by each buffered write of a common file copy workload,
but rather only on less frequent allocation events. Second, the
extent tree is always entirely in-core so an additional lookup of a
usable extent mostly costs a shared ilock cycle and in-memory tree
lookup. This means that a cached mapping reval is relatively cheap
compared to the I/O itself. Third, spurious invalidations don't
impact ioend construction. This means that even if the same extent
is revalidated multiple times across multiple writepage instances,
we still construct and submit the same size ioend (and bio) if the
blocks are physically contiguous.

Update struct xfs_writepage_ctx with a new field to hold the
sequence number of the data fork associated with the currently
cached mapping. Check the wpc seqno against the data fork when the
mapping is validated and reestablish the mapping whenever the fork
has changed since the mapping was cached. This ensures that
writeback always uses a valid extent mapping and thus prevents lost
writebacks and stale delalloc block problems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c  | 60 ++++++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_iomap.c |  5 ++--
 2 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d9048bcea49c..5b0256a8a420 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -29,6 +29,7 @@
 struct xfs_writepage_ctx {
 	struct xfs_bmbt_irec    imap;
 	unsigned int		io_type;
+	unsigned int		data_seq;
 	unsigned int		cow_seq;
 	struct xfs_ioend	*ioend;
 };
@@ -301,6 +302,42 @@ xfs_end_bio(
 		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
 }
 
+/*
+ * Fast revalidation of the cached writeback mapping. Return true if the current
+ * mapping is valid, false otherwise.
+ */
+static bool
+xfs_imap_valid(
+	struct xfs_writepage_ctx	*wpc,
+	struct xfs_inode		*ip,
+	xfs_fileoff_t			offset_fsb)
+{
+	if (offset_fsb < wpc->imap.br_startoff ||
+	    offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
+		return false;
+	/*
+	 * If this is a COW mapping, it is sufficient to check that the mapping
+	 * covers the offset. Be careful to check this first because the caller
+	 * can revalidate a COW mapping without updating the data seqno.
+	 */
+	if (wpc->io_type == XFS_IO_COW)
+		return true;
+
+	/*
+	 * This is not a COW mapping. Check the sequence number of the data fork
+	 * because concurrent changes could have invalidated the extent. Check
+	 * the COW fork because concurrent changes since the last time we
+	 * checked (and found nothing at this offset) could have added
+	 * overlapping blocks.
+	 */
+	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq))
+		return false;
+	if (xfs_inode_has_cow_data(ip) &&
+	    wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq))
+		return false;
+	return true;
+}
+
 STATIC int
 xfs_map_blocks(
 	struct xfs_writepage_ctx *wpc,
@@ -315,9 +352,11 @@ xfs_map_blocks(
 	struct xfs_bmbt_irec	imap;
 	int			whichfork = XFS_DATA_FORK;
 	struct xfs_iext_cursor	icur;
-	bool			imap_valid;
 	int			error = 0;
 
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
 	/*
 	 * We have to make sure the cached mapping is within EOF to protect
 	 * against eofblocks trimming on file release leaving us with a stale
@@ -346,17 +385,9 @@ xfs_map_blocks(
 	 * against concurrent updates and provides a memory barrier on the way
 	 * out that ensures that we always see the current value.
 	 */
-	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
-		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
-	if (imap_valid &&
-	    (!xfs_inode_has_cow_data(ip) ||
-	     wpc->io_type == XFS_IO_COW ||
-	     wpc->cow_seq == READ_ONCE(ip->i_cowfp->if_seq)))
+	if (xfs_imap_valid(wpc, ip, offset_fsb))
 		return 0;
 
-	if (XFS_FORCED_SHUTDOWN(mp))
-		return -EIO;
-
 	/*
 	 * If we don't have a valid map, now it's time to get a new one for this
 	 * offset.  This will convert delayed allocations (including COW ones)
@@ -403,9 +434,10 @@ xfs_map_blocks(
 	}
 
 	/*
-	 * Map valid and no COW extent in the way?  We're done.
+	 * No COW extent overlap. Revalidate now that we may have updated
+	 * ->cow_seq. If the data mapping is still valid, we're done.
 	 */
-	if (imap_valid) {
+	if (xfs_imap_valid(wpc, ip, offset_fsb)) {
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		return 0;
 	}
@@ -417,6 +449,7 @@ xfs_map_blocks(
 	 */
 	if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap))
 		imap.br_startoff = end_fsb;	/* fake a hole past EOF */
+	wpc->data_seq = READ_ONCE(ip->i_df.if_seq);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	if (imap.br_startoff > offset_fsb) {
@@ -454,7 +487,8 @@ xfs_map_blocks(
 	return 0;
 allocate_blocks:
 	error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap,
-			&wpc->cow_seq);
+			whichfork == XFS_COW_FORK ?
+					 &wpc->cow_seq : &wpc->data_seq);
 	if (error)
 		return error;
 	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 27c93b5f029d..ab69caa685b4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -681,7 +681,7 @@ xfs_iomap_write_allocate(
 	int		whichfork,
 	xfs_off_t	offset,
 	xfs_bmbt_irec_t *imap,
-	unsigned int	*cow_seq)
+	unsigned int	*seq)
 {
 	xfs_mount_t	*mp = ip->i_mount;
 	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
@@ -797,8 +797,7 @@ xfs_iomap_write_allocate(
 			if (error)
 				goto error0;
 
-			if (whichfork == XFS_COW_FORK)
-				*cow_seq = READ_ONCE(ifp->if_seq);
+			*seq = READ_ONCE(ifp->if_seq);
 			xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		}
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 4/6] xfs: remove superfluous writeback mapping eof trimming
  2019-01-23 18:41 [PATCH v3 0/6] xfs: properly invalidate cached writeback mapping Brian Foster
                   ` (2 preceding siblings ...)
  2019-01-23 18:41 ` [PATCH v3 3/6] xfs: validate writeback mapping using data fork seq counter Brian Foster
@ 2019-01-23 18:41 ` Brian Foster
  2019-01-23 18:41 ` [PATCH v3 5/6] xfs: create delalloc bmapi wrapper for full extent allocation Brian Foster
  2019-01-23 18:41 ` [PATCH v3 6/6] xfs: use the latest extent at writeback delalloc conversion time Brian Foster
  5 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2019-01-23 18:41 UTC (permalink / raw)
  To: linux-xfs

Now that the cached writeback mapping is explicitly invalidated on
data fork changes, the EOF trimming band-aid is no longer necessary.
Remove xfs_trim_extent_eof() as well since it has no other users.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_bmap.c | 11 -----------
 fs/xfs/libxfs/xfs_bmap.h |  1 -
 fs/xfs/xfs_aops.c        | 15 ---------------
 3 files changed, 27 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 332eefa2700b..4c73927819c2 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3685,17 +3685,6 @@ xfs_trim_extent(
 	}
 }
 
-/* trim extent to within eof */
-void
-xfs_trim_extent_eof(
-	struct xfs_bmbt_irec	*irec,
-	struct xfs_inode	*ip)
-
-{
-	xfs_trim_extent(irec, 0, XFS_B_TO_FSB(ip->i_mount,
-					      i_size_read(VFS_I(ip))));
-}
-
 /*
  * Trim the returned map to the required bounds
  */
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 09d3ea97cc15..b4ff710d7250 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -181,7 +181,6 @@ static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
 
 void	xfs_trim_extent(struct xfs_bmbt_irec *irec, xfs_fileoff_t bno,
 		xfs_filblks_t len);
-void	xfs_trim_extent_eof(struct xfs_bmbt_irec *, struct xfs_inode *);
 int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
 int	xfs_bmap_set_attrforkoff(struct xfs_inode *ip, int size, int *version);
 void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5b0256a8a420..515532f45beb 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -357,19 +357,6 @@ xfs_map_blocks(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	/*
-	 * We have to make sure the cached mapping is within EOF to protect
-	 * against eofblocks trimming on file release leaving us with a stale
-	 * mapping. Otherwise, a page for a subsequent file extending buffered
-	 * write could get picked up by this writeback cycle and written to the
-	 * wrong blocks.
-	 *
-	 * Note that what we really want here is a generic mapping invalidation
-	 * mechanism to protect us from arbitrary extent modifying contexts, not
-	 * just eofblocks.
-	 */
-	xfs_trim_extent_eof(&wpc->imap, ip);
-
 	/*
 	 * COW fork blocks can overlap data fork blocks even if the blocks
 	 * aren't shared.  COW I/O always takes precedent, so we must always
@@ -482,7 +469,6 @@ xfs_map_blocks(
 	}
 
 	wpc->imap = imap;
-	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 allocate_blocks:
@@ -494,7 +480,6 @@ xfs_map_blocks(
 	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
 	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
 	wpc->imap = imap;
-	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 5/6] xfs: create delalloc bmapi wrapper for full extent allocation
  2019-01-23 18:41 [PATCH v3 0/6] xfs: properly invalidate cached writeback mapping Brian Foster
                   ` (3 preceding siblings ...)
  2019-01-23 18:41 ` [PATCH v3 4/6] xfs: remove superfluous writeback mapping eof trimming Brian Foster
@ 2019-01-23 18:41 ` Brian Foster
  2019-01-24  8:51   ` Christoph Hellwig
  2019-01-23 18:41 ` [PATCH v3 6/6] xfs: use the latest extent at writeback delalloc conversion time Brian Foster
  5 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2019-01-23 18:41 UTC (permalink / raw)
  To: linux-xfs

The writeback delalloc conversion code is racy with respect to
changes in the currently cached file mapping. This stems from the
fact that the bmapi allocation code requires a file range to
allocate and the writeback conversion code assumes the range of the
currently cached mapping is still valid with respect to the fork. It
may not be valid, however, because the ilock is cycled (potentially
multiple times) between the time the cached mapping was populated
and the delalloc conversion occurs.

To facilitate a solution to this problem, create a new
xfs_bmapi_delalloc() wrapper to xfs_bmapi_write() that takes a file
(FSB) offset and attempts to allocate whatever delalloc extent backs
the offset. Use a new bmapi flag to cause xfs_bmapi_write() to set
the range based on the extent backing the bno parameter unless bno
lands in a hole. If bno does land in a hole, fall back to the
current behavior (which may result in an error or quietly skipping
holes in the specified range depending on other parameters). This
patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 48 ++++++++++++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_bmap.h |  4 ++++
 2 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 4c73927819c2..856de22439a3 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4286,10 +4286,6 @@ xfs_bmapi_write(
 			goto error0;
 	}
 
-	n = 0;
-	end = bno + len;
-	obno = bno;
-
 	if (!xfs_iext_lookup_extent(ip, ifp, bno, &bma.icur, &bma.got))
 		eof = true;
 	if (!xfs_iext_peek_prev_extent(ifp, &bma.icur, &bma.prev))
@@ -4299,6 +4295,26 @@ xfs_bmapi_write(
 	bma.total = total;
 	bma.datatype = 0;
 
+	/*
+	 * The reval flag means the caller wants to allocate the entire delalloc
+	 * extent backing bno where bno may not necessarily match the startoff.
+	 * Now that we've looked up the extent, reset the range to map based on
+	 * the extent in the file. If we're in a hole, this may be an error so
+	 * don't adjust anything.
+	 */
+	if ((flags & XFS_BMAPI_REVALRANGE) &&
+	    !eof && bno >= bma.got.br_startoff) {
+		ASSERT(flags & XFS_BMAPI_DELALLOC);
+		bno = bma.got.br_startoff;
+		len = bma.got.br_blockcount;
+#ifdef DEBUG
+		orig_bno = bno;
+		orig_len = len;
+#endif
+	}
+	n = 0;
+	end = bno + len;
+	obno = bno;
 	while (bno < end && n < *nmap) {
 		bool			need_alloc = false, wasdelay = false;
 
@@ -4455,6 +4471,30 @@ xfs_bmapi_write(
 	return error;
 }
 
+/*
+ * Convert an existing delalloc extent to real blocks based on file offset. This
+ * attempts to allocate the entire delalloc extent and may require multiple
+ * invocations to allocate the target offset if a large enough physical extent
+ * is not available.
+ */
+int
+xfs_bmapi_delalloc(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		bno,
+	int			flags,
+	xfs_extlen_t		total,
+	struct xfs_bmbt_irec	*imap,
+	int			*nimaps)
+{
+	/*
+	 * The reval flag means to allocate the entire extent; pass a dummy
+	 * length of 1.
+	 */
+	flags |= XFS_BMAPI_REVALRANGE;
+	return xfs_bmapi_write(tp, ip, bno, 1, flags, total, imap, nimaps);
+}
+
 int
 xfs_bmapi_remap(
 	struct xfs_trans	*tp,
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index b4ff710d7250..a53eb3d527e2 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -107,6 +107,8 @@ struct xfs_extent_free_item
 /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
 #define XFS_BMAPI_NORMAP	0x2000
 
+#define XFS_BMAPI_REVALRANGE	0x4000
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -227,6 +229,8 @@ int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork,
 		xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc,
 		struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur,
 		int eof);
+int	xfs_bmapi_delalloc(struct xfs_trans *, struct xfs_inode *,
+		xfs_fileoff_t, int, xfs_extlen_t, struct xfs_bmbt_irec *, int *);
 
 static inline void
 xfs_bmap_add_free(
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 6/6] xfs: use the latest extent at writeback delalloc conversion time
  2019-01-23 18:41 [PATCH v3 0/6] xfs: properly invalidate cached writeback mapping Brian Foster
                   ` (4 preceding siblings ...)
  2019-01-23 18:41 ` [PATCH v3 5/6] xfs: create delalloc bmapi wrapper for full extent allocation Brian Foster
@ 2019-01-23 18:41 ` Brian Foster
  2019-01-24  8:52   ` Christoph Hellwig
  5 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2019-01-23 18:41 UTC (permalink / raw)
  To: linux-xfs

The writeback delalloc conversion code is racy with respect to
changes in the currently cached file mapping outside of the current
page. This is because the ilock is cycled between the time the
caller originally looked up the mapping and across each real
allocation of the provided file range. This code has collected
various hacks over the years to help combat the symptoms of these
races (i.e., truncate race detection, allocation into hole
detection, etc.), but none address the fundamental problem that the
imap may not be valid at allocation time.

Rather than continue to use race detection hacks, update writeback
delalloc conversion to a model that explicitly converts the delalloc
extent backing the current file offset being processed. The current
file offset is the only block we can trust to remain once the ilock
is dropped because any operation that can remove the block
(truncate, hole punch, etc.) must flush and discard pagecache pages
first.

Modify xfs_iomap_write_allocate() to use the xfs_bmapi_delalloc()
mechanism to request allocation of the entire delalloc extent
backing the current offset instead of assuming the extent passed by
the caller is unchanged. Record the range specified by the caller
and apply it to the resulting allocated extent so previous checks by
the caller for COW fork overlap are not lost. Finally, overload the
bmapi delalloc flag with the range reval flag behavior since this is
the only use case for both.

This ensures that writeback always picks up the correct
and current extent associated with the page, regardless of races
with other extent modifying operations. If operating on a data fork
and the COW overlap state has changed since the ilock was cycled,
the caller revalidates against the COW fork sequence number before
using the imap for the next block.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c |  15 ++--
 fs/xfs/libxfs/xfs_bmap.h |   2 -
 fs/xfs/xfs_iomap.c       | 174 ++++++++++++++-------------------------
 3 files changed, 71 insertions(+), 120 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 856de22439a3..3229a82de1fb 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4296,15 +4296,14 @@ xfs_bmapi_write(
 	bma.datatype = 0;
 
 	/*
-	 * The reval flag means the caller wants to allocate the entire delalloc
-	 * extent backing bno where bno may not necessarily match the startoff.
-	 * Now that we've looked up the extent, reset the range to map based on
-	 * the extent in the file. If we're in a hole, this may be an error so
-	 * don't adjust anything.
+	 * The delalloc flag means the caller wants to allocate the entire
+	 * delalloc extent backing bno where bno may not necessarily match the
+	 * startoff. Now that we've looked up the extent, reset the range to
+	 * map based on the extent in the file. If we're in a hole, this may be
+	 * an error so don't adjust anything.
 	 */
-	if ((flags & XFS_BMAPI_REVALRANGE) &&
+	if ((flags & XFS_BMAPI_DELALLOC) &&
 	    !eof && bno >= bma.got.br_startoff) {
-		ASSERT(flags & XFS_BMAPI_DELALLOC);
 		bno = bma.got.br_startoff;
 		len = bma.got.br_blockcount;
 #ifdef DEBUG
@@ -4491,7 +4490,7 @@ xfs_bmapi_delalloc(
 	 * The reval flag means to allocate the entire extent; pass a dummy
 	 * length of 1.
 	 */
-	flags |= XFS_BMAPI_REVALRANGE;
+	flags |= XFS_BMAPI_DELALLOC;
 	return xfs_bmapi_write(tp, ip, bno, 1, flags, total, imap, nimaps);
 }
 
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index a53eb3d527e2..4e8bd2837cb0 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -107,8 +107,6 @@ struct xfs_extent_free_item
 /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
 #define XFS_BMAPI_NORMAP	0x2000
 
-#define XFS_BMAPI_REVALRANGE	0x4000
-
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ab69caa685b4..066c2120f0ba 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -677,25 +677,26 @@ xfs_file_iomap_begin_delay(
  */
 int
 xfs_iomap_write_allocate(
-	xfs_inode_t	*ip,
-	int		whichfork,
-	xfs_off_t	offset,
-	xfs_bmbt_irec_t *imap,
-	unsigned int	*seq)
+	struct xfs_inode	*ip,
+	int			whichfork,
+	xfs_off_t		offset,
+	struct xfs_bmbt_irec	*imap,
+	unsigned int		*seq)
 {
-	xfs_mount_t	*mp = ip->i_mount;
-	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
-	xfs_fileoff_t	offset_fsb, last_block;
-	xfs_fileoff_t	end_fsb, map_start_fsb;
-	xfs_filblks_t	count_fsb;
-	xfs_trans_t	*tp;
-	int		nimaps;
-	int		error = 0;
-	int		flags = XFS_BMAPI_DELALLOC;
-	int		nres;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
+	xfs_fileoff_t		offset_fsb;
+	xfs_fileoff_t		map_start_fsb;
+	xfs_extlen_t		map_count_fsb;
+	struct xfs_trans	*tp;
+	int			nimaps;
+	int			error = 0;
+	int			flags = XFS_BMAPI_DELALLOC;
+	int			nres;
 
 	if (whichfork == XFS_COW_FORK)
 		flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
+	nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
 
 	/*
 	 * Make sure that the dquots are there.
@@ -704,106 +705,63 @@ xfs_iomap_write_allocate(
 	if (error)
 		return error;
 
+	/*
+	 * Store the file range the caller is interested in because it encodes
+	 * state such as potential overlap with COW fork blocks. We must trim
+	 * the allocated extent down to this range to maintain consistency with
+	 * what the caller expects. Revalidation of the range itself is the
+	 * responsibility of the caller.
+	 */
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
-	count_fsb = imap->br_blockcount;
 	map_start_fsb = imap->br_startoff;
+	map_count_fsb = imap->br_blockcount;
 
-	XFS_STATS_ADD(mp, xs_xstrat_bytes, XFS_FSB_TO_B(mp, count_fsb));
+	XFS_STATS_ADD(mp, xs_xstrat_bytes,
+		      XFS_FSB_TO_B(mp, imap->br_blockcount));
 
-	while (count_fsb != 0) {
+	while (true) {
 		/*
-		 * Set up a transaction with which to allocate the
-		 * backing store for the file.  Do allocations in a
-		 * loop until we get some space in the range we are
-		 * interested in.  The other space that might be allocated
-		 * is in the delayed allocation extent on which we sit
-		 * but before our buffer starts.
+		 * Allocate in a loop because it may take several attempts to
+		 * allocate real blocks for a contiguous delalloc extent if free
+		 * space is sufficiently fragmented. Note that space for the
+		 * extent and indirect blocks was reserved when the delalloc
+		 * extent was created so there's no need to do so here.
 		 */
-		nimaps = 0;
-		while (nimaps == 0) {
-			nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
-			/*
-			 * We have already reserved space for the extent and any
-			 * indirect blocks when creating the delalloc extent,
-			 * there is no need to reserve space in this transaction
-			 * again.
-			 */
-			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0,
-					0, XFS_TRANS_RESERVE, &tp);
-			if (error)
-				return error;
-
-			xfs_ilock(ip, XFS_ILOCK_EXCL);
-			xfs_trans_ijoin(tp, ip, 0);
-
-			/*
-			 * it is possible that the extents have changed since
-			 * we did the read call as we dropped the ilock for a
-			 * while. We have to be careful about truncates or hole
-			 * punchs here - we are not allowed to allocate
-			 * non-delalloc blocks here.
-			 *
-			 * The only protection against truncation is the pages
-			 * for the range we are being asked to convert are
-			 * locked and hence a truncate will block on them
-			 * first.
-			 *
-			 * As a result, if we go beyond the range we really
-			 * need and hit an delalloc extent boundary followed by
-			 * a hole while we have excess blocks in the map, we
-			 * will fill the hole incorrectly and overrun the
-			 * transaction reservation.
-			 *
-			 * Using a single map prevents this as we are forced to
-			 * check each map we look for overlap with the desired
-			 * range and abort as soon as we find it. Also, given
-			 * that we only return a single map, having one beyond
-			 * what we can return is probably a bit silly.
-			 *
-			 * We also need to check that we don't go beyond EOF;
-			 * this is a truncate optimisation as a truncate sets
-			 * the new file size before block on the pages we
-			 * currently have locked under writeback. Because they
-			 * are about to be tossed, we don't need to write them
-			 * back....
-			 */
-			nimaps = 1;
-			end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
-			error = xfs_bmap_last_offset(ip, &last_block,
-							XFS_DATA_FORK);
-			if (error)
-				goto trans_cancel;
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
+					XFS_TRANS_RESERVE, &tp);
+		if (error)
+			return error;
 
-			last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
-			if ((map_start_fsb + count_fsb) > last_block) {
-				count_fsb = last_block - map_start_fsb;
-				if (count_fsb == 0) {
-					error = -EAGAIN;
-					goto trans_cancel;
-				}
-			}
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, ip, 0);
 
-			/*
-			 * From this point onwards we overwrite the imap
-			 * pointer that the caller gave to us.
-			 */
-			error = xfs_bmapi_write(tp, ip, map_start_fsb,
-						count_fsb, flags, nres, imap,
-						&nimaps);
-			if (error)
-				goto trans_cancel;
+		/*
+		 * ilock was dropped since imap was populated which means it
+		 * might no longer be valid. The current page is held locked so
+		 * nothing could have removed the block backing offset_fsb.
+		 * Attempt to allocate whatever delalloc extent currently backs
+		 * offset_fsb and put the result in the imap pointer from the
+		 * caller. We'll trim it down to the caller's most recently
+		 * validated range before we return.
+		 */
+		nimaps = 1;
+		error = xfs_bmapi_delalloc(tp, ip, offset_fsb, flags, nres,
+					   imap, &nimaps);
+		if (nimaps == 0)
+			error = -EFSCORRUPTED;
+		if (error)
+			goto trans_cancel;
 
-			error = xfs_trans_commit(tp);
-			if (error)
-				goto error0;
+		error = xfs_trans_commit(tp);
+		if (error)
+			goto error0;
 
-			*seq = READ_ONCE(ifp->if_seq);
-			xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		}
+		*seq = READ_ONCE(ifp->if_seq);
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 		/*
-		 * See if we were able to allocate an extent that
-		 * covers at least part of the callers request
+		 * See if we were able to allocate an extent that covers at
+		 * least part of the callers request.
 		 */
 		if (!(imap->br_startblock || XFS_IS_REALTIME_INODE(ip)))
 			return xfs_alert_fsblock_zero(ip, imap);
@@ -812,15 +770,11 @@ xfs_iomap_write_allocate(
 		    (offset_fsb < (imap->br_startoff +
 				   imap->br_blockcount))) {
 			XFS_STATS_INC(mp, xs_xstrat_quick);
+			xfs_trim_extent(imap, map_start_fsb, map_count_fsb);
+			ASSERT(offset_fsb >= imap->br_startoff &&
+			       offset_fsb < imap->br_startoff + imap->br_blockcount);
 			return 0;
 		}
-
-		/*
-		 * So far we have not mapped the requested part of the
-		 * file, just surrounding data, try again.
-		 */
-		count_fsb -= imap->br_blockcount;
-		map_start_fsb = imap->br_startoff + imap->br_blockcount;
 	}
 
 trans_cancel:
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 5/6] xfs: create delalloc bmapi wrapper for full extent allocation
  2019-01-23 18:41 ` [PATCH v3 5/6] xfs: create delalloc bmapi wrapper for full extent allocation Brian Foster
@ 2019-01-24  8:51   ` Christoph Hellwig
  2019-01-24 14:03     ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2019-01-24  8:51 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

This looks fine as a temporary step to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 6/6] xfs: use the latest extent at writeback delalloc conversion time
  2019-01-23 18:41 ` [PATCH v3 6/6] xfs: use the latest extent at writeback delalloc conversion time Brian Foster
@ 2019-01-24  8:52   ` Christoph Hellwig
  2019-01-24 14:01     ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2019-01-24  8:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

And this I really don't like.  I much prefer the full version
I've done here, which has evolved a bit from the work I've been
posting to the list for months now:

http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation.2

Which also fixes up some layering issues we have in that code path.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 6/6] xfs: use the latest extent at writeback delalloc conversion time
  2019-01-24  8:52   ` Christoph Hellwig
@ 2019-01-24 14:01     ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2019-01-24 14:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Jan 24, 2019 at 12:52:30AM -0800, Christoph Hellwig wrote:
> And this I really don't like.  I much prefer the full version
> I've done here, which has evolved a bit from the work I've been
> posting to the list for months now:
> 

This patch is just an incremental step in that direction to fix the
underlying problem. The functionality with regard to imap race
mitigation in the delalloc conversion path is essentially equivalent.
The difference is you've cleaned up the API further by lifting the
lookup into the caller and implementing a more proper conversion
implemention.

> http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation.2
> 
> Which also fixes up some layering issues we have in that code path.

Ok... I've not reviewed the details but as noted above, the approach to
dealing with imap validity looks sane to me. Essentially all I'm asking
for here is isolated patches for fixes vs. refactoring vs. whatever
other cleanups/changes were related to the always cow stuff and offering
this (or the v2 patch) as an already tested way to do that.

If you'd rather just put this all in your series, then please split this
all up into a patch per logical change (i.e., separate patches to
move/rename functions first, introduce the new bmapi delalloc conversion
helper, modify writeback to use said helper, the layering fixups you're
referring to, etc.). At a glance, the commit above looks like it could
easily be split into at least 3 or 4 patches.

Brian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 5/6] xfs: create delalloc bmapi wrapper for full extent allocation
  2019-01-24  8:51   ` Christoph Hellwig
@ 2019-01-24 14:03     ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2019-01-24 14:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Jan 24, 2019 at 12:51:09AM -0800, Christoph Hellwig wrote:
> This looks fine as a temporary step to me:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

It probably doesn't make sense to land this patch without the next. Feel
free to grab it and squash it into your series if it's useful, but
otherwise we can just drop patches 5 and 6 from this series.

Brian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/6] xfs: eof trim writeback mapping as soon as it is cached
  2019-01-23 18:41 ` [PATCH v3 1/6] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
@ 2019-02-01  7:58   ` Christoph Hellwig
  2019-02-01 17:40     ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2019-02-01  7:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

Darrick,

can we get this queue up for 5.0?

On Wed, Jan 23, 2019 at 01:41:26PM -0500, Brian Foster wrote:
> The cached writeback mapping is EOF trimmed to try and avoid races
> between post-eof block management and writeback that result in
> sending cached data to a stale location. The cached mapping is
> currently trimmed on the validation check, which leaves a race
> window between the time the mapping is cached and when it is trimmed
> against the current inode size.
> 
> For example, if a new mapping is cached by delalloc conversion on a
> blocksize == page size fs, we could cycle various locks, perform
> memory allocations, etc.  in the writeback codepath before the
> associated mapping is eventually trimmed to i_size. This leaves
> enough time for a post-eof truncate and file append before the
> cached mapping is trimmed. The former event essentially invalidates
> a range of the cached mapping and the latter bumps the inode size
> such the trim on the next writepage event won't trim all of the
> invalid blocks. fstest generic/464 reproduces this scenario
> occasionally and causes a lost writeback and stale delalloc blocks
> warning on inode inactivation.
> 
> To work around this problem, trim the cached writeback mapping as
> soon as it is cached in addition to on subsequent validation checks.
> This is a minor tweak to tighten the race window as much as possible
> until a proper invalidation mechanism is available.
> 
> Fixes: 40214d128e07 ("xfs: trim writepage mapping to within eof")
> Cc: <stable@vger.kernel.org> # v4.14+
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_aops.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 338b9d9984e0..d9048bcea49c 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -449,6 +449,7 @@ xfs_map_blocks(
>  	}
>  
>  	wpc->imap = imap;
> +	xfs_trim_extent_eof(&wpc->imap, ip);
>  	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
>  	return 0;
>  allocate_blocks:
> @@ -459,6 +460,7 @@ xfs_map_blocks(
>  	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
>  	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
>  	wpc->imap = imap;
> +	xfs_trim_extent_eof(&wpc->imap, ip);
>  	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
>  	return 0;
>  }
> -- 
> 2.17.2
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/6] xfs: eof trim writeback mapping as soon as it is cached
  2019-02-01  7:58   ` Christoph Hellwig
@ 2019-02-01 17:40     ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2019-02-01 17:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, linux-xfs

On Thu, Jan 31, 2019 at 11:58:29PM -0800, Christoph Hellwig wrote:
> Darrick,
> 
> can we get this queue up for 5.0?

Ok, I'll give it a test run.

--D

> On Wed, Jan 23, 2019 at 01:41:26PM -0500, Brian Foster wrote:
> > The cached writeback mapping is EOF trimmed to try and avoid races
> > between post-eof block management and writeback that result in
> > sending cached data to a stale location. The cached mapping is
> > currently trimmed on the validation check, which leaves a race
> > window between the time the mapping is cached and when it is trimmed
> > against the current inode size.
> > 
> > For example, if a new mapping is cached by delalloc conversion on a
> > blocksize == page size fs, we could cycle various locks, perform
> > memory allocations, etc.  in the writeback codepath before the
> > associated mapping is eventually trimmed to i_size. This leaves
> > enough time for a post-eof truncate and file append before the
> > cached mapping is trimmed. The former event essentially invalidates
> > a range of the cached mapping and the latter bumps the inode size
> > such the trim on the next writepage event won't trim all of the
> > invalid blocks. fstest generic/464 reproduces this scenario
> > occasionally and causes a lost writeback and stale delalloc blocks
> > warning on inode inactivation.
> > 
> > To work around this problem, trim the cached writeback mapping as
> > soon as it is cached in addition to on subsequent validation checks.
> > This is a minor tweak to tighten the race window as much as possible
> > until a proper invalidation mechanism is available.
> > 
> > Fixes: 40214d128e07 ("xfs: trim writepage mapping to within eof")
> > Cc: <stable@vger.kernel.org> # v4.14+
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/xfs_aops.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 338b9d9984e0..d9048bcea49c 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -449,6 +449,7 @@ xfs_map_blocks(
> >  	}
> >  
> >  	wpc->imap = imap;
> > +	xfs_trim_extent_eof(&wpc->imap, ip);
> >  	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
> >  	return 0;
> >  allocate_blocks:
> > @@ -459,6 +460,7 @@ xfs_map_blocks(
> >  	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
> >  	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
> >  	wpc->imap = imap;
> > +	xfs_trim_extent_eof(&wpc->imap, ip);
> >  	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
> >  	return 0;
> >  }
> > -- 
> > 2.17.2
> > 
> ---end quoted text---

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-02-01 17:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-23 18:41 [PATCH v3 0/6] xfs: properly invalidate cached writeback mapping Brian Foster
2019-01-23 18:41 ` [PATCH v3 1/6] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
2019-02-01  7:58   ` Christoph Hellwig
2019-02-01 17:40     ` Darrick J. Wong
2019-01-23 18:41 ` [PATCH v3 2/6] xfs: update fork seq counter on data fork changes Brian Foster
2019-01-23 18:41 ` [PATCH v3 3/6] xfs: validate writeback mapping using data fork seq counter Brian Foster
2019-01-23 18:41 ` [PATCH v3 4/6] xfs: remove superfluous writeback mapping eof trimming Brian Foster
2019-01-23 18:41 ` [PATCH v3 5/6] xfs: create delalloc bmapi wrapper for full extent allocation Brian Foster
2019-01-24  8:51   ` Christoph Hellwig
2019-01-24 14:03     ` Brian Foster
2019-01-23 18:41 ` [PATCH v3 6/6] xfs: use the latest extent at writeback delalloc conversion time Brian Foster
2019-01-24  8:52   ` Christoph Hellwig
2019-01-24 14:01     ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.