All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] xfs: properly invalidate cached writeback mapping
@ 2019-01-17 19:19 Brian Foster
  2019-01-17 19:20 ` [PATCH v2 1/5] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Brian Foster @ 2019-01-17 19:19 UTC (permalink / raw)
  To: linux-xfs

Hi all,

Here's v2 of the cached writeback mapping invalidation series. Note that
this post is a bit premature as I'm still testing this code. Christoph
pointed out that he has some patches shifting code around in the same
area so I'm posting what I have under test at this point to hopefully
work out any potential conflicts.

The changes in v2 are basically to beef up the invalidation logic a bit
in xfs_map_blocks() via a new (old) xfs_imap_valid() helper and the
introduction of patch 5, which attempts to add similar mapping
validation in xfs_iomap_write_allocate() to replace the old truncate
detection logic.

Thoughts, reviews, flames appreciated.

Brian

fstest: https://marc.info/?l=fstests&m=154756141122233&w=2

v2:
- Refactor validation logic into xfs_imap_valid() helper.
- Revalidate seqno after the lock cycle in xfs_map_blocks().
- Update *seq in xfs_iomap_write_allocate() regardless of fork type.
- Add patch 5 for seqno revalidation on xfs_iomap_write_allocate() lock
  cycles.
v1: https://marc.info/?l=linux-xfs&m=154721212321112&w=2

Brian Foster (5):
  xfs: eof trim writeback mapping as soon as it is cached
  xfs: update fork seq counter on data fork changes
  xfs: validate writeback mapping using data fork seq counter
  xfs: remove superfluous writeback mapping eof trimming
  xfs: revalidate imap properly before writeback delalloc conversion

 fs/xfs/libxfs/xfs_bmap.c       |  11 ----
 fs/xfs/libxfs/xfs_bmap.h       |   1 -
 fs/xfs/libxfs/xfs_iext_tree.c  |  13 ++--
 fs/xfs/libxfs/xfs_inode_fork.h |   2 +-
 fs/xfs/xfs_aops.c              |  71 ++++++++++++++--------
 fs/xfs/xfs_iomap.c             | 105 ++++++++++++++-------------------
 6 files changed, 98 insertions(+), 105 deletions(-)

-- 
2.17.2

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 1/5] xfs: eof trim writeback mapping as soon as it is cached
  2019-01-17 19:19 [PATCH v2 0/5] xfs: properly invalidate cached writeback mapping Brian Foster
@ 2019-01-17 19:20 ` Brian Foster
  2019-01-18  5:29   ` Allison Henderson
  2019-01-18 11:47   ` Christoph Hellwig
  2019-01-17 19:20 ` [PATCH v2 2/5] xfs: update fork seq counter on data fork changes Brian Foster
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 27+ messages in thread
From: Brian Foster @ 2019-01-17 19:20 UTC (permalink / raw)
  To: linux-xfs

The cached writeback mapping is EOF trimmed to try and avoid races
between post-eof block management and writeback that result in
sending cached data to a stale location. The cached mapping is
currently trimmed on the validation check, which leaves a race
window between the time the mapping is cached and when it is trimmed
against the current inode size.

For example, if a new mapping is cached by delalloc conversion on a
blocksize == page size fs, we could cycle various locks, perform
memory allocations, etc.  in the writeback codepath before the
associated mapping is eventually trimmed to i_size. This leaves
enough time for a post-eof truncate and file append before the
cached mapping is trimmed. The former event essentially invalidates
a range of the cached mapping and the latter bumps the inode size
such the trim on the next writepage event won't trim all of the
invalid blocks. fstest generic/464 reproduces this scenario
occasionally and causes a lost writeback and stale delalloc blocks
warning on inode inactivation.

To work around this problem, trim the cached writeback mapping as
soon as it is cached in addition to on subsequent validation checks.
This is a minor tweak to tighten the race window as much as possible
until a proper invalidation mechanism is available.

Fixes: 40214d128e07 ("xfs: trim writepage mapping to within eof")
Cc: <stable@vger.kernel.org> # v4.14+
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9984e0..d9048bcea49c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -449,6 +449,7 @@ xfs_map_blocks(
 	}
 
 	wpc->imap = imap;
+	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 allocate_blocks:
@@ -459,6 +460,7 @@ xfs_map_blocks(
 	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
 	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
 	wpc->imap = imap;
+	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 2/5] xfs: update fork seq counter on data fork changes
  2019-01-17 19:19 [PATCH v2 0/5] xfs: properly invalidate cached writeback mapping Brian Foster
  2019-01-17 19:20 ` [PATCH v2 1/5] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
@ 2019-01-17 19:20 ` Brian Foster
  2019-01-18  5:30   ` Allison Henderson
  2019-01-17 19:20 ` [PATCH v2 3/5] xfs: validate writeback mapping using data fork seq counter Brian Foster
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2019-01-17 19:20 UTC (permalink / raw)
  To: linux-xfs

The sequence counter in the xfs_ifork structure is only updated on
COW forks. This is because the counter is currently only used to
optimize out repetitive COW fork checks at writeback time.

Tweak the extent code to update the seq counter regardless of the
fork type in preparation for using this counter on data forks as
well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_iext_tree.c  | 13 ++++++-------
 fs/xfs/libxfs/xfs_inode_fork.h |  2 +-
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
index 771dd072015d..bc690f2409fa 100644
--- a/fs/xfs/libxfs/xfs_iext_tree.c
+++ b/fs/xfs/libxfs/xfs_iext_tree.c
@@ -614,16 +614,15 @@ xfs_iext_realloc_root(
 }
 
 /*
- * Increment the sequence counter if we are on a COW fork.  This allows
- * the writeback code to skip looking for a COW extent if the COW fork
- * hasn't changed.  We use WRITE_ONCE here to ensure the update to the
- * sequence counter is seen before the modifications to the extent
- * tree itself take effect.
+ * Increment the sequence counter on extent tree changes. If we are on a COW
+ * fork, this allows the writeback code to skip looking for a COW extent if the
+ * COW fork hasn't changed. We use WRITE_ONCE here to ensure the update to the
+ * sequence counter is seen before the modifications to the extent tree itself
+ * take effect.
  */
 static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp, int state)
 {
-	if (state & BMAP_COWFORK)
-		WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
+	WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
 }
 
 void
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index 60361d2d74a1..00c62ce170d0 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -14,7 +14,7 @@ struct xfs_dinode;
  */
 struct xfs_ifork {
 	int			if_bytes;	/* bytes in if_u1 */
-	unsigned int		if_seq;		/* cow fork mod counter */
+	unsigned int		if_seq;		/* fork mod counter */
 	struct xfs_btree_block	*if_broot;	/* file's incore btree root */
 	short			if_broot_bytes;	/* bytes allocated for root */
 	unsigned char		if_flags;	/* per-fork flags */
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 3/5] xfs: validate writeback mapping using data fork seq counter
  2019-01-17 19:19 [PATCH v2 0/5] xfs: properly invalidate cached writeback mapping Brian Foster
  2019-01-17 19:20 ` [PATCH v2 1/5] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
  2019-01-17 19:20 ` [PATCH v2 2/5] xfs: update fork seq counter on data fork changes Brian Foster
@ 2019-01-17 19:20 ` Brian Foster
  2019-01-18  6:12   ` Allison Henderson
  2019-01-18 11:50   ` Christoph Hellwig
  2019-01-17 19:20 ` [PATCH v2 4/5] xfs: remove superfluous writeback mapping eof trimming Brian Foster
  2019-01-17 19:20 ` [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion Brian Foster
  4 siblings, 2 replies; 27+ messages in thread
From: Brian Foster @ 2019-01-17 19:20 UTC (permalink / raw)
  To: linux-xfs

The writeback code caches the current extent mapping across multiple
xfs_do_writepage() calls to avoid repeated lookups for sequential
pages backed by the same extent. This is known to be slightly racy
with extent fork changes in certain difficult to reproduce
scenarios. The cached extent is trimmed to within EOF to help avoid
the most common vector for this problem via speculative
preallocation management, but this is a band-aid that does not
address the fundamental problem.

Now that we have an xfs_ifork sequence counter mechanism used to
facilitate COW writeback, we can use the same mechanism to validate
consistency between the data fork and cached writeback mappings. On
its face, this is somewhat of a big hammer approach because any
change to the data fork invalidates any mapping currently cached by
a writeback in progress regardless of whether the data fork change
overlaps with the range under writeback. In practice, however, the
impact of this approach is minimal in most cases.

First, data fork changes (delayed allocations) caused by sustained
sequential buffered writes are amortized across speculative
preallocations. This means that a cached mapping won't be
invalidated by each buffered write of a common file copy workload,
but rather only on less frequent allocation events. Second, the
extent tree is always entirely in-core so an additional lookup of a
usable extent mostly costs a shared ilock cycle and in-memory tree
lookup. This means that a cached mapping reval is relatively cheap
compared to the I/O itself. Third, spurious invalidations don't
impact ioend construction. This means that even if the same extent
is revalidated multiple times across multiple writepage instances,
we still construct and submit the same size ioend (and bio) if the
blocks are physically contiguous.

Update struct xfs_writepage_ctx with a new field to hold the
sequence number of the data fork associated with the currently
cached mapping. Check the wpc seqno against the data fork when the
mapping is validated and reestablish the mapping whenever the fork
has changed since the mapping was cached. This ensures that
writeback always uses a valid extent mapping and thus prevents lost
writebacks and stale delalloc block problems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c  | 60 ++++++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_iomap.c |  5 ++--
 2 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d9048bcea49c..649e4ad76add 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -29,6 +29,7 @@
 struct xfs_writepage_ctx {
 	struct xfs_bmbt_irec    imap;
 	unsigned int		io_type;
+	unsigned int		data_seq;
 	unsigned int		cow_seq;
 	struct xfs_ioend	*ioend;
 };
@@ -301,6 +302,42 @@ xfs_end_bio(
 		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
 }
 
+/*
+ * Fast revalidation of the cached writeback mapping. Return true if the current
+ * mapping is valid, false otherwise.
+ */
+static bool
+xfs_imap_valid(
+	struct xfs_writepage_ctx	*wpc,
+	struct xfs_inode		*ip,
+	xfs_fileoff_t			offset_fsb)
+{
+	/*
+	 * If this is a COW mapping, it is sufficient to check that the mapping
+	 * covers the offset. Be careful to check this first because the caller
+	 * can revalidate a COW mapping without updating the data seqno.
+	 */
+	if (offset_fsb < wpc->imap.br_startoff ||
+	    offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
+		return false;
+	if (wpc->io_type == XFS_IO_COW)
+		return true;
+
+	/*
+	 * This is not a COW mapping. Check the sequence number of the data fork
+	 * because concurrent changes could have invalidated the extent. Check
+	 * the COW fork because concurrent changes since the last time we
+	 * checked (and found nothing at this offset) could have added
+	 * overlapping blocks.
+	 */
+	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq))
+		return false;
+	if (xfs_inode_has_cow_data(ip) &&
+	    wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq))
+		return false;
+	return true;
+}
+
 STATIC int
 xfs_map_blocks(
 	struct xfs_writepage_ctx *wpc,
@@ -315,9 +352,11 @@ xfs_map_blocks(
 	struct xfs_bmbt_irec	imap;
 	int			whichfork = XFS_DATA_FORK;
 	struct xfs_iext_cursor	icur;
-	bool			imap_valid;
 	int			error = 0;
 
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
 	/*
 	 * We have to make sure the cached mapping is within EOF to protect
 	 * against eofblocks trimming on file release leaving us with a stale
@@ -346,17 +385,9 @@ xfs_map_blocks(
 	 * against concurrent updates and provides a memory barrier on the way
 	 * out that ensures that we always see the current value.
 	 */
-	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
-		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
-	if (imap_valid &&
-	    (!xfs_inode_has_cow_data(ip) ||
-	     wpc->io_type == XFS_IO_COW ||
-	     wpc->cow_seq == READ_ONCE(ip->i_cowfp->if_seq)))
+	if (xfs_imap_valid(wpc, ip, offset_fsb))
 		return 0;
 
-	if (XFS_FORCED_SHUTDOWN(mp))
-		return -EIO;
-
 	/*
 	 * If we don't have a valid map, now it's time to get a new one for this
 	 * offset.  This will convert delayed allocations (including COW ones)
@@ -403,9 +434,10 @@ xfs_map_blocks(
 	}
 
 	/*
-	 * Map valid and no COW extent in the way?  We're done.
+	 * No COW extent overlap. Revalidate now that we may have updated
+	 * ->cow_seq. If the data mapping is still valid, we're done.
 	 */
-	if (imap_valid) {
+	if (xfs_imap_valid(wpc, ip, offset_fsb)) {
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		return 0;
 	}
@@ -417,6 +449,7 @@ xfs_map_blocks(
 	 */
 	if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap))
 		imap.br_startoff = end_fsb;	/* fake a hole past EOF */
+	wpc->data_seq = READ_ONCE(ip->i_df.if_seq);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	if (imap.br_startoff > offset_fsb) {
@@ -454,7 +487,8 @@ xfs_map_blocks(
 	return 0;
 allocate_blocks:
 	error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap,
-			&wpc->cow_seq);
+			whichfork == XFS_COW_FORK ?
+					 &wpc->cow_seq : &wpc->data_seq);
 	if (error)
 		return error;
 	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 27c93b5f029d..ab69caa685b4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -681,7 +681,7 @@ xfs_iomap_write_allocate(
 	int		whichfork,
 	xfs_off_t	offset,
 	xfs_bmbt_irec_t *imap,
-	unsigned int	*cow_seq)
+	unsigned int	*seq)
 {
 	xfs_mount_t	*mp = ip->i_mount;
 	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
@@ -797,8 +797,7 @@ xfs_iomap_write_allocate(
 			if (error)
 				goto error0;
 
-			if (whichfork == XFS_COW_FORK)
-				*cow_seq = READ_ONCE(ifp->if_seq);
+			*seq = READ_ONCE(ifp->if_seq);
 			xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		}
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 4/5] xfs: remove superfluous writeback mapping eof trimming
  2019-01-17 19:19 [PATCH v2 0/5] xfs: properly invalidate cached writeback mapping Brian Foster
                   ` (2 preceding siblings ...)
  2019-01-17 19:20 ` [PATCH v2 3/5] xfs: validate writeback mapping using data fork seq counter Brian Foster
@ 2019-01-17 19:20 ` Brian Foster
  2019-01-18  6:48   ` Allison Henderson
  2019-01-18 11:50   ` Christoph Hellwig
  2019-01-17 19:20 ` [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion Brian Foster
  4 siblings, 2 replies; 27+ messages in thread
From: Brian Foster @ 2019-01-17 19:20 UTC (permalink / raw)
  To: linux-xfs

Now that the cached writeback mapping is explicitly invalidated on
data fork changes, the EOF trimming band-aid is no longer necessary.
Remove xfs_trim_extent_eof() as well since it has no other users.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 11 -----------
 fs/xfs/libxfs/xfs_bmap.h |  1 -
 fs/xfs/xfs_aops.c        | 15 ---------------
 3 files changed, 27 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 332eefa2700b..4c73927819c2 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3685,17 +3685,6 @@ xfs_trim_extent(
 	}
 }
 
-/* trim extent to within eof */
-void
-xfs_trim_extent_eof(
-	struct xfs_bmbt_irec	*irec,
-	struct xfs_inode	*ip)
-
-{
-	xfs_trim_extent(irec, 0, XFS_B_TO_FSB(ip->i_mount,
-					      i_size_read(VFS_I(ip))));
-}
-
 /*
  * Trim the returned map to the required bounds
  */
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 09d3ea97cc15..b4ff710d7250 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -181,7 +181,6 @@ static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
 
 void	xfs_trim_extent(struct xfs_bmbt_irec *irec, xfs_fileoff_t bno,
 		xfs_filblks_t len);
-void	xfs_trim_extent_eof(struct xfs_bmbt_irec *, struct xfs_inode *);
 int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
 int	xfs_bmap_set_attrforkoff(struct xfs_inode *ip, int size, int *version);
 void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 649e4ad76add..09d8f7690e9e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -357,19 +357,6 @@ xfs_map_blocks(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	/*
-	 * We have to make sure the cached mapping is within EOF to protect
-	 * against eofblocks trimming on file release leaving us with a stale
-	 * mapping. Otherwise, a page for a subsequent file extending buffered
-	 * write could get picked up by this writeback cycle and written to the
-	 * wrong blocks.
-	 *
-	 * Note that what we really want here is a generic mapping invalidation
-	 * mechanism to protect us from arbitrary extent modifying contexts, not
-	 * just eofblocks.
-	 */
-	xfs_trim_extent_eof(&wpc->imap, ip);
-
 	/*
 	 * COW fork blocks can overlap data fork blocks even if the blocks
 	 * aren't shared.  COW I/O always takes precedent, so we must always
@@ -482,7 +469,6 @@ xfs_map_blocks(
 	}
 
 	wpc->imap = imap;
-	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 allocate_blocks:
@@ -494,7 +480,6 @@ xfs_map_blocks(
 	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
 	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
 	wpc->imap = imap;
-	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-17 19:19 [PATCH v2 0/5] xfs: properly invalidate cached writeback mapping Brian Foster
                   ` (3 preceding siblings ...)
  2019-01-17 19:20 ` [PATCH v2 4/5] xfs: remove superfluous writeback mapping eof trimming Brian Foster
@ 2019-01-17 19:20 ` Brian Foster
  2019-01-18  6:58   ` Allison Henderson
  2019-01-18 11:59   ` Christoph Hellwig
  4 siblings, 2 replies; 27+ messages in thread
From: Brian Foster @ 2019-01-17 19:20 UTC (permalink / raw)
  To: linux-xfs

The writeback delalloc conversion code has a couple historical hacks
to help deal with racing with other operations on the file under
writeback. For example, xfs_iomap_write_allocate() checks i_size
once under ilock to determine whether a truncate may have
invalidated some portion of the range we're about to map. This helps
prevent the bmapi call from doing physical allocation where only
delalloc conversion was expected (which is a transaction reservation
overrun vector). XFS_BMAPI_DELALLOC helps similarly protect us from
this problem as it ensures that the bmapi call will only convert
existing blocks.

Neither of these actually ensure we actually pass a valid range to
convert to xfs_bmapi_write(). The bmapi flag turns the
aforementioned physical allocation over a hole problem into an
explicit error, which means that any particular instances of that
problem change from being a transaction overrun error to a writeback
error. The latter means page discard, which is an ominous error even
if due to the underlying block(s) being punched out.

Now that we have a mechanism in place to track inode fork changes,
use it in xfs_iomap_write_allocate() to properly handle potential
changes to the cached writeback mapping and establish a valid range
to map. Check the already provided fork seqno once under ilock. If
it has changed, look up the extent again in the associated fork and
trim the caller's mapping to the latest version. We can expect to
still find the blocks backing the current file offset because the
page is held locked during writeback and page cache truncation
acquires the page lock and waits on writeback to complete before it
can toss the page. This ensures that writeback never attempts to map
a range of a file that has been punched out and never submits I/O to
blocks that are no longer mapped to the file.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_iomap.c | 102 +++++++++++++++++++--------------------------
 1 file changed, 44 insertions(+), 58 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ab69caa685b4..80da465ebf7a 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -677,22 +677,24 @@ xfs_file_iomap_begin_delay(
  */
 int
 xfs_iomap_write_allocate(
-	xfs_inode_t	*ip,
-	int		whichfork,
-	xfs_off_t	offset,
-	xfs_bmbt_irec_t *imap,
-	unsigned int	*seq)
+	struct xfs_inode	*ip,
+	int			whichfork,
+	xfs_off_t		offset,
+	struct xfs_bmbt_irec	*imap,
+	unsigned int		*seq)
 {
-	xfs_mount_t	*mp = ip->i_mount;
-	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
-	xfs_fileoff_t	offset_fsb, last_block;
-	xfs_fileoff_t	end_fsb, map_start_fsb;
-	xfs_filblks_t	count_fsb;
-	xfs_trans_t	*tp;
-	int		nimaps;
-	int		error = 0;
-	int		flags = XFS_BMAPI_DELALLOC;
-	int		nres;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
+	struct xfs_trans	*tp;
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	timap;
+	xfs_fileoff_t		offset_fsb;
+	xfs_fileoff_t		map_start_fsb;
+	xfs_filblks_t		count_fsb;
+	int			nimaps;
+	int			error = 0;
+	int			flags = XFS_BMAPI_DELALLOC;
+	int			nres;
 
 	if (whichfork == XFS_COW_FORK)
 		flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
@@ -737,56 +739,40 @@ xfs_iomap_write_allocate(
 			xfs_trans_ijoin(tp, ip, 0);
 
 			/*
-			 * it is possible that the extents have changed since
-			 * we did the read call as we dropped the ilock for a
-			 * while. We have to be careful about truncates or hole
-			 * punchs here - we are not allowed to allocate
-			 * non-delalloc blocks here.
-			 *
-			 * The only protection against truncation is the pages
-			 * for the range we are being asked to convert are
-			 * locked and hence a truncate will block on them
-			 * first.
-			 *
-			 * As a result, if we go beyond the range we really
-			 * need and hit an delalloc extent boundary followed by
-			 * a hole while we have excess blocks in the map, we
-			 * will fill the hole incorrectly and overrun the
-			 * transaction reservation.
+			 * Now that we have ILOCK we must account for the fact
+			 * that the fork (and thus our mapping) could have
+			 * changed while the inode was unlocked. If the fork
+			 * has changed, trim the caller's mapping to the
+			 * current extent in the fork.
 			 *
-			 * Using a single map prevents this as we are forced to
-			 * check each map we look for overlap with the desired
-			 * range and abort as soon as we find it. Also, given
-			 * that we only return a single map, having one beyond
-			 * what we can return is probably a bit silly.
+			 * If the external change did not modify the current
+			 * mapping (or just grew it) this will have no effect.
+			 * If the current mapping shrunk, we expect to at
+			 * minimum still have blocks backing the current page as
+			 * the page has remained locked since writeback first
+			 * located delalloc block(s) at the page offset. A
+			 * racing truncate, hole punch or even reflink must wait
+			 * on page writeback before it can modify our page and
+			 * underlying block(s).
 			 *
-			 * We also need to check that we don't go beyond EOF;
-			 * this is a truncate optimisation as a truncate sets
-			 * the new file size before block on the pages we
-			 * currently have locked under writeback. Because they
-			 * are about to be tossed, we don't need to write them
-			 * back....
+			 * We'll update *seq before we drop ilock for the next
+			 * iteration.
 			 */
-			nimaps = 1;
-			end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
-			error = xfs_bmap_last_offset(ip, &last_block,
-							XFS_DATA_FORK);
-			if (error)
-				goto trans_cancel;
-
-			last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
-			if ((map_start_fsb + count_fsb) > last_block) {
-				count_fsb = last_block - map_start_fsb;
-				if (count_fsb == 0) {
-					error = -EAGAIN;
+			if (*seq != READ_ONCE(ifp->if_seq)) {
+				if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb,
+							    &icur, &timap) ||
+				    timap.br_startoff > offset_fsb) {
+					ASSERT(0);
+					error = -EFSCORRUPTED;
 					goto trans_cancel;
 				}
+				xfs_trim_extent(imap, timap.br_startoff,
+						timap.br_blockcount);
+				count_fsb = imap->br_blockcount;
+				map_start_fsb = imap->br_startoff;
 			}
 
-			/*
-			 * From this point onwards we overwrite the imap
-			 * pointer that the caller gave to us.
-			 */
+			nimaps = 1;
 			error = xfs_bmapi_write(tp, ip, map_start_fsb,
 						count_fsb, flags, nres, imap,
 						&nimaps);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/5] xfs: eof trim writeback mapping as soon as it is cached
  2019-01-17 19:20 ` [PATCH v2 1/5] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
@ 2019-01-18  5:29   ` Allison Henderson
  2019-01-18 11:47   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Allison Henderson @ 2019-01-18  5:29 UTC (permalink / raw)
  To: Brian Foster, linux-xfs

On 1/17/19 12:20 PM, Brian Foster wrote:
> The cached writeback mapping is EOF trimmed to try and avoid races
> between post-eof block management and writeback that result in
> sending cached data to a stale location. The cached mapping is
> currently trimmed on the validation check, which leaves a race
> window between the time the mapping is cached and when it is trimmed
> against the current inode size.
> 
> For example, if a new mapping is cached by delalloc conversion on a
> blocksize == page size fs, we could cycle various locks, perform
> memory allocations, etc.  in the writeback codepath before the
> associated mapping is eventually trimmed to i_size. This leaves
> enough time for a post-eof truncate and file append before the
> cached mapping is trimmed. The former event essentially invalidates
> a range of the cached mapping and the latter bumps the inode size
> such the trim on the next writepage event won't trim all of the
> invalid blocks. fstest generic/464 reproduces this scenario
> occasionally and causes a lost writeback and stale delalloc blocks
> warning on inode inactivation.
> 
> To work around this problem, trim the cached writeback mapping as
> soon as it is cached in addition to on subsequent validation checks.
> This is a minor tweak to tighten the race window as much as possible
> until a proper invalidation mechanism is available.
> 
> Fixes: 40214d128e07 ("xfs: trim writepage mapping to within eof")
> Cc: <stable@vger.kernel.org> # v4.14+
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/xfs_aops.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 338b9d9984e0..d9048bcea49c 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -449,6 +449,7 @@ xfs_map_blocks(
>   	}
>   
>   	wpc->imap = imap;
> +	xfs_trim_extent_eof(&wpc->imap, ip);
>   	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
>   	return 0;
>   allocate_blocks:
> @@ -459,6 +460,7 @@ xfs_map_blocks(
>   	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
>   	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
>   	wpc->imap = imap;
> +	xfs_trim_extent_eof(&wpc->imap, ip);
>   	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
>   	return 0;
>   }
> 

Looks ok, you can add my review:
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/5] xfs: update fork seq counter on data fork changes
  2019-01-17 19:20 ` [PATCH v2 2/5] xfs: update fork seq counter on data fork changes Brian Foster
@ 2019-01-18  5:30   ` Allison Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Allison Henderson @ 2019-01-18  5:30 UTC (permalink / raw)
  To: Brian Foster, linux-xfs


On 1/17/19 12:20 PM, Brian Foster wrote:
> The sequence counter in the xfs_ifork structure is only updated on
> COW forks. This is because the counter is currently only used to
> optimize out repetitive COW fork checks at writeback time.
> 
> Tweak the extent code to update the seq counter regardless of the
> fork type in preparation for using this counter on data forks as
> well.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/xfs/libxfs/xfs_iext_tree.c  | 13 ++++++-------
>   fs/xfs/libxfs/xfs_inode_fork.h |  2 +-
>   2 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
> index 771dd072015d..bc690f2409fa 100644
> --- a/fs/xfs/libxfs/xfs_iext_tree.c
> +++ b/fs/xfs/libxfs/xfs_iext_tree.c
> @@ -614,16 +614,15 @@ xfs_iext_realloc_root(
>   }
>   
>   /*
> - * Increment the sequence counter if we are on a COW fork.  This allows
> - * the writeback code to skip looking for a COW extent if the COW fork
> - * hasn't changed.  We use WRITE_ONCE here to ensure the update to the
> - * sequence counter is seen before the modifications to the extent
> - * tree itself take effect.
> + * Increment the sequence counter on extent tree changes. If we are on a COW
> + * fork, this allows the writeback code to skip looking for a COW extent if the
> + * COW fork hasn't changed. We use WRITE_ONCE here to ensure the update to the
> + * sequence counter is seen before the modifications to the extent tree itself
> + * take effect.
>    */
>   static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp, int state)
>   {
> -	if (state & BMAP_COWFORK)
> -		WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
> +	WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
>   }
>   
>   void
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> index 60361d2d74a1..00c62ce170d0 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.h
> +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> @@ -14,7 +14,7 @@ struct xfs_dinode;
>    */
>   struct xfs_ifork {
>   	int			if_bytes;	/* bytes in if_u1 */
> -	unsigned int		if_seq;		/* cow fork mod counter */
> +	unsigned int		if_seq;		/* fork mod counter */
>   	struct xfs_btree_block	*if_broot;	/* file's incore btree root */
>   	short			if_broot_bytes;	/* bytes allocated for root */
>   	unsigned char		if_flags;	/* per-fork flags */
> 

This one looks fine too:

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/5] xfs: validate writeback mapping using data fork seq counter
  2019-01-17 19:20 ` [PATCH v2 3/5] xfs: validate writeback mapping using data fork seq counter Brian Foster
@ 2019-01-18  6:12   ` Allison Henderson
  2019-01-18 11:50   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Allison Henderson @ 2019-01-18  6:12 UTC (permalink / raw)
  To: Brian Foster, linux-xfs


On 1/17/19 12:20 PM, Brian Foster wrote:
> The writeback code caches the current extent mapping across multiple
> xfs_do_writepage() calls to avoid repeated lookups for sequential
> pages backed by the same extent. This is known to be slightly racy
> with extent fork changes in certain difficult to reproduce
> scenarios. The cached extent is trimmed to within EOF to help avoid
> the most common vector for this problem via speculative
> preallocation management, but this is a band-aid that does not
> address the fundamental problem.
> 
> Now that we have an xfs_ifork sequence counter mechanism used to
> facilitate COW writeback, we can use the same mechanism to validate
> consistency between the data fork and cached writeback mappings. On
> its face, this is somewhat of a big hammer approach because any
> change to the data fork invalidates any mapping currently cached by
> a writeback in progress regardless of whether the data fork change
> overlaps with the range under writeback. In practice, however, the
> impact of this approach is minimal in most cases.
> 
> First, data fork changes (delayed allocations) caused by sustained
> sequential buffered writes are amortized across speculative
> preallocations. This means that a cached mapping won't be
> invalidated by each buffered write of a common file copy workload,
> but rather only on less frequent allocation events. Second, the
> extent tree is always entirely in-core so an additional lookup of a
> usable extent mostly costs a shared ilock cycle and in-memory tree
> lookup. This means that a cached mapping reval is relatively cheap
> compared to the I/O itself. Third, spurious invalidations don't
> impact ioend construction. This means that even if the same extent
> is revalidated multiple times across multiple writepage instances,
> we still construct and submit the same size ioend (and bio) if the
> blocks are physically contiguous.
> 
> Update struct xfs_writepage_ctx with a new field to hold the
> sequence number of the data fork associated with the currently
> cached mapping. Check the wpc seqno against the data fork when the
> mapping is validated and reestablish the mapping whenever the fork
> has changed since the mapping was cached. This ensures that
> writeback always uses a valid extent mapping and thus prevents lost
> writebacks and stale delalloc block problems.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/xfs_aops.c  | 60 ++++++++++++++++++++++++++++++++++++----------
>   fs/xfs/xfs_iomap.c |  5 ++--
>   2 files changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d9048bcea49c..649e4ad76add 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -29,6 +29,7 @@
>   struct xfs_writepage_ctx {
>   	struct xfs_bmbt_irec    imap;
>   	unsigned int		io_type;
> +	unsigned int		data_seq;
>   	unsigned int		cow_seq;
>   	struct xfs_ioend	*ioend;
>   };
> @@ -301,6 +302,42 @@ xfs_end_bio(
>   		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
>   }
>   
> +/*
> + * Fast revalidation of the cached writeback mapping. Return true if the current
> + * mapping is valid, false otherwise.
> + */
> +static bool
> +xfs_imap_valid(
> +	struct xfs_writepage_ctx	*wpc,
> +	struct xfs_inode		*ip,
> +	xfs_fileoff_t			offset_fsb)
> +{
> +	/*
> +	 * If this is a COW mapping, it is sufficient to check that the mapping
> +	 * covers the offset. Be careful to check this first because the caller
> +	 * can revalidate a COW mapping without updating the data seqno.
> +	 */
> +	if (offset_fsb < wpc->imap.br_startoff ||
> +	    offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
> +		return false;
> +	if (wpc->io_type == XFS_IO_COW)
> +		return true;
> +
> +	/*
> +	 * This is not a COW mapping. Check the sequence number of the data fork
> +	 * because concurrent changes could have invalidated the extent. Check
> +	 * the COW fork because concurrent changes since the last time we
> +	 * checked (and found nothing at this offset) could have added
> +	 * overlapping blocks.
> +	 */
> +	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq))
> +		return false;
> +	if (xfs_inode_has_cow_data(ip) &&
> +	    wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq))
> +		return false;
> +	return true;
> +}
> +
>   STATIC int
>   xfs_map_blocks(
>   	struct xfs_writepage_ctx *wpc,
> @@ -315,9 +352,11 @@ xfs_map_blocks(
>   	struct xfs_bmbt_irec	imap;
>   	int			whichfork = XFS_DATA_FORK;
>   	struct xfs_iext_cursor	icur;
> -	bool			imap_valid;
>   	int			error = 0;
>   
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return -EIO;
> +
>   	/*
>   	 * We have to make sure the cached mapping is within EOF to protect
>   	 * against eofblocks trimming on file release leaving us with a stale
> @@ -346,17 +385,9 @@ xfs_map_blocks(
>   	 * against concurrent updates and provides a memory barrier on the way
>   	 * out that ensures that we always see the current value.
>   	 */
> -	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
> -		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
> -	if (imap_valid &&
> -	    (!xfs_inode_has_cow_data(ip) ||
> -	     wpc->io_type == XFS_IO_COW ||
> -	     wpc->cow_seq == READ_ONCE(ip->i_cowfp->if_seq)))
> +	if (xfs_imap_valid(wpc, ip, offset_fsb))
>   		return 0;
>   
> -	if (XFS_FORCED_SHUTDOWN(mp))
> -		return -EIO;
> -
>   	/*
>   	 * If we don't have a valid map, now it's time to get a new one for this
>   	 * offset.  This will convert delayed allocations (including COW ones)
> @@ -403,9 +434,10 @@ xfs_map_blocks(
>   	}
>   
>   	/*
> -	 * Map valid and no COW extent in the way?  We're done.
> +	 * No COW extent overlap. Revalidate now that we may have updated
> +	 * ->cow_seq. If the data mapping is still valid, we're done.
>   	 */
> -	if (imap_valid) {
> +	if (xfs_imap_valid(wpc, ip, offset_fsb)) {
>   		xfs_iunlock(ip, XFS_ILOCK_SHARED);
>   		return 0;
>   	}
> @@ -417,6 +449,7 @@ xfs_map_blocks(
>   	 */
>   	if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap))
>   		imap.br_startoff = end_fsb;	/* fake a hole past EOF */
> +	wpc->data_seq = READ_ONCE(ip->i_df.if_seq);
>   	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>   
>   	if (imap.br_startoff > offset_fsb) {
> @@ -454,7 +487,8 @@ xfs_map_blocks(
>   	return 0;
>   allocate_blocks:
>   	error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap,
> -			&wpc->cow_seq);
> +			whichfork == XFS_COW_FORK ?
> +					 &wpc->cow_seq : &wpc->data_seq);
>   	if (error)
>   		return error;
>   	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 27c93b5f029d..ab69caa685b4 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -681,7 +681,7 @@ xfs_iomap_write_allocate(
>   	int		whichfork,
>   	xfs_off_t	offset,
>   	xfs_bmbt_irec_t *imap,
> -	unsigned int	*cow_seq)
> +	unsigned int	*seq)
>   {
>   	xfs_mount_t	*mp = ip->i_mount;
>   	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
> @@ -797,8 +797,7 @@ xfs_iomap_write_allocate(
>   			if (error)
>   				goto error0;
>   
> -			if (whichfork == XFS_COW_FORK)
> -				*cow_seq = READ_ONCE(ifp->if_seq);
> +			*seq = READ_ONCE(ifp->if_seq);
>   			xfs_iunlock(ip, XFS_ILOCK_EXCL);
>   		}
>   
> 

Looks good, I think the new version is a lot cleaner with the helper 
routine.  Thx!

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 4/5] xfs: remove superfluous writeback mapping eof trimming
  2019-01-17 19:20 ` [PATCH v2 4/5] xfs: remove superfluous writeback mapping eof trimming Brian Foster
@ 2019-01-18  6:48   ` Allison Henderson
  2019-01-18 11:25     ` Brian Foster
  2019-01-18 11:50   ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Allison Henderson @ 2019-01-18  6:48 UTC (permalink / raw)
  To: Brian Foster, linux-xfs



On 1/17/19 12:20 PM, Brian Foster wrote:
> Now that the cached writeback mapping is explicitly invalidated on
> data fork changes, the EOF trimming band-aid is no longer necessary.
> Remove xfs_trim_extent_eof() as well since it has no other users.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/libxfs/xfs_bmap.c | 11 -----------
>   fs/xfs/libxfs/xfs_bmap.h |  1 -
>   fs/xfs/xfs_aops.c        | 15 ---------------
>   3 files changed, 27 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 332eefa2700b..4c73927819c2 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3685,17 +3685,6 @@ xfs_trim_extent(
>   	}
>   }
>   
> -/* trim extent to within eof */
> -void
> -xfs_trim_extent_eof(
> -	struct xfs_bmbt_irec	*irec,
> -	struct xfs_inode	*ip)
> -
> -{
> -	xfs_trim_extent(irec, 0, XFS_B_TO_FSB(ip->i_mount,
> -					      i_size_read(VFS_I(ip))));
> -}
> -
>   /*
>    * Trim the returned map to the required bounds
>    */
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 09d3ea97cc15..b4ff710d7250 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -181,7 +181,6 @@ static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
>   
>   void	xfs_trim_extent(struct xfs_bmbt_irec *irec, xfs_fileoff_t bno,
>   		xfs_filblks_t len);
> -void	xfs_trim_extent_eof(struct xfs_bmbt_irec *, struct xfs_inode *);
>   int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
>   int	xfs_bmap_set_attrforkoff(struct xfs_inode *ip, int size, int *version);
>   void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 649e4ad76add..09d8f7690e9e 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -357,19 +357,6 @@ xfs_map_blocks(
>   	if (XFS_FORCED_SHUTDOWN(mp))
>   		return -EIO;
>   
> -	/*
> -	 * We have to make sure the cached mapping is within EOF to protect
> -	 * against eofblocks trimming on file release leaving us with a stale
> -	 * mapping. Otherwise, a page for a subsequent file extending buffered
> -	 * write could get picked up by this writeback cycle and written to the
> -	 * wrong blocks.
> -	 *
> -	 * Note that what we really want here is a generic mapping invalidation
> -	 * mechanism to protect us from arbitrary extent modifying contexts, not
> -	 * just eofblocks.
> -	 */
> -	xfs_trim_extent_eof(&wpc->imap, ip);
> -
>   	/*
>   	 * COW fork blocks can overlap data fork blocks even if the blocks
>   	 * aren't shared.  COW I/O always takes precedent, so we must always
> @@ -482,7 +469,6 @@ xfs_map_blocks(
>   	}
>   
>   	wpc->imap = imap;
> -	xfs_trim_extent_eof(&wpc->imap, ip);
>   	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
>   	return 0;
>   allocate_blocks:
> @@ -494,7 +480,6 @@ xfs_map_blocks(
>   	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
>   	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
>   	wpc->imap = imap;
> -	xfs_trim_extent_eof(&wpc->imap, ip);
>   	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
>   	return 0;
>   }
> 

I'll skip this one, it sounds from the last review we decided not to 
keep it.  Thanks though!

Allison

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-17 19:20 ` [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion Brian Foster
@ 2019-01-18  6:58   ` Allison Henderson
  2019-01-18 11:59   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Allison Henderson @ 2019-01-18  6:58 UTC (permalink / raw)
  To: Brian Foster, linux-xfs



On 1/17/19 12:20 PM, Brian Foster wrote:
> The writeback delalloc conversion code has a couple historical hacks
> to help deal with racing with other operations on the file under
> writeback. For example, xfs_iomap_write_allocate() checks i_size
> once under ilock to determine whether a truncate may have
> invalidated some portion of the range we're about to map. This helps
> prevent the bmapi call from doing physical allocation where only
> delalloc conversion was expected (which is a transaction reservation
> overrun vector). XFS_BMAPI_DELALLOC helps similarly protect us from
> this problem as it ensures that the bmapi call will only convert
> existing blocks.
> 
> Neither of these actually ensure we actually pass a valid range to
> convert to xfs_bmapi_write(). The bmapi flag turns the
> aforementioned physical allocation over a hole problem into an
> explicit error, which means that any particular instances of that
> problem change from being a transaction overrun error to a writeback
> error. The latter means page discard, which is an ominous error even
> if due to the underlying block(s) being punched out.
> 
> Now that we have a mechanism in place to track inode fork changes,
> use it in xfs_iomap_write_allocate() to properly handle potential
> changes to the cached writeback mapping and establish a valid range
> to map. Check the already provided fork seqno once under ilock. If
> it has changed, look up the extent again in the associated fork and
> trim the caller's mapping to the latest version. We can expect to
> still find the blocks backing the current file offset because the
> page is held locked during writeback and page cache truncation
> acquires the page lock and waits on writeback to complete before it
> can toss the page. This ensures that writeback never attempts to map
> a range of a file that has been punched out and never submits I/O to
> blocks that are no longer mapped to the file.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/xfs_iomap.c | 102 +++++++++++++++++++--------------------------
>   1 file changed, 44 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index ab69caa685b4..80da465ebf7a 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -677,22 +677,24 @@ xfs_file_iomap_begin_delay(
>    */
>   int
>   xfs_iomap_write_allocate(
> -	xfs_inode_t	*ip,
> -	int		whichfork,
> -	xfs_off_t	offset,
> -	xfs_bmbt_irec_t *imap,
> -	unsigned int	*seq)
> +	struct xfs_inode	*ip,
> +	int			whichfork,
> +	xfs_off_t		offset,
> +	struct xfs_bmbt_irec	*imap,
> +	unsigned int		*seq)
>   {
> -	xfs_mount_t	*mp = ip->i_mount;
> -	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
> -	xfs_fileoff_t	offset_fsb, last_block;
> -	xfs_fileoff_t	end_fsb, map_start_fsb;
> -	xfs_filblks_t	count_fsb;
> -	xfs_trans_t	*tp;
> -	int		nimaps;
> -	int		error = 0;
> -	int		flags = XFS_BMAPI_DELALLOC;
> -	int		nres;
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> +	struct xfs_trans	*tp;
> +	struct xfs_iext_cursor	icur;
> +	struct xfs_bmbt_irec	timap;
> +	xfs_fileoff_t		offset_fsb;
> +	xfs_fileoff_t		map_start_fsb;
> +	xfs_filblks_t		count_fsb;
> +	int			nimaps;
> +	int			error = 0;
> +	int			flags = XFS_BMAPI_DELALLOC;
> +	int			nres;
>   
>   	if (whichfork == XFS_COW_FORK)
>   		flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
> @@ -737,56 +739,40 @@ xfs_iomap_write_allocate(
>   			xfs_trans_ijoin(tp, ip, 0);
>   
>   			/*
> -			 * it is possible that the extents have changed since
> -			 * we did the read call as we dropped the ilock for a
> -			 * while. We have to be careful about truncates or hole
> -			 * punchs here - we are not allowed to allocate
> -			 * non-delalloc blocks here.
> -			 *
> -			 * The only protection against truncation is the pages
> -			 * for the range we are being asked to convert are
> -			 * locked and hence a truncate will block on them
> -			 * first.
> -			 *
> -			 * As a result, if we go beyond the range we really
> -			 * need and hit an delalloc extent boundary followed by
> -			 * a hole while we have excess blocks in the map, we
> -			 * will fill the hole incorrectly and overrun the
> -			 * transaction reservation.
> +			 * Now that we have ILOCK we must account for the fact
> +			 * that the fork (and thus our mapping) could have
> +			 * changed while the inode was unlocked. If the fork
> +			 * has changed, trim the caller's mapping to the
> +			 * current extent in the fork.
>   			 *
> -			 * Using a single map prevents this as we are forced to
> -			 * check each map we look for overlap with the desired
> -			 * range and abort as soon as we find it. Also, given
> -			 * that we only return a single map, having one beyond
> -			 * what we can return is probably a bit silly.
> +			 * If the external change did not modify the current
> +			 * mapping (or just grew it) this will have no effect.
> +			 * If the current mapping shrunk, we expect to at
> +			 * minimum still have blocks backing the current page as
> +			 * the page has remained locked since writeback first
> +			 * located delalloc block(s) at the page offset. A
> +			 * racing truncate, hole punch or even reflink must wait
> +			 * on page writeback before it can modify our page and
> +			 * underlying block(s).
>   			 *
> -			 * We also need to check that we don't go beyond EOF;
> -			 * this is a truncate optimisation as a truncate sets
> -			 * the new file size before block on the pages we
> -			 * currently have locked under writeback. Because they
> -			 * are about to be tossed, we don't need to write them
> -			 * back....
> +			 * We'll update *seq before we drop ilock for the next
> +			 * iteration.
>   			 */
> -			nimaps = 1;
> -			end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> -			error = xfs_bmap_last_offset(ip, &last_block,
> -							XFS_DATA_FORK);
> -			if (error)
> -				goto trans_cancel;
> -
> -			last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
> -			if ((map_start_fsb + count_fsb) > last_block) {
> -				count_fsb = last_block - map_start_fsb;
> -				if (count_fsb == 0) {
> -					error = -EAGAIN;
> +			if (*seq != READ_ONCE(ifp->if_seq)) {
> +				if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb,
> +							    &icur, &timap) ||
> +				    timap.br_startoff > offset_fsb) {
> +					ASSERT(0);
> +					error = -EFSCORRUPTED;
>   					goto trans_cancel;
>   				}
> +				xfs_trim_extent(imap, timap.br_startoff,
> +						timap.br_blockcount);
> +				count_fsb = imap->br_blockcount;
> +				map_start_fsb = imap->br_startoff;
>   			}
>   
> -			/*
> -			 * From this point onwards we overwrite the imap
> -			 * pointer that the caller gave to us.
> -			 */
> +			nimaps = 1;
>   			error = xfs_bmapi_write(tp, ip, map_start_fsb,
>   						count_fsb, flags, nres, imap,
>   						&nimaps);
> 
Alrighty, it seems reasonable to me.  You can add my review. Thanks!

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 4/5] xfs: remove superfluous writeback mapping eof trimming
  2019-01-18  6:48   ` Allison Henderson
@ 2019-01-18 11:25     ` Brian Foster
  0 siblings, 0 replies; 27+ messages in thread
From: Brian Foster @ 2019-01-18 11:25 UTC (permalink / raw)
  To: Allison Henderson; +Cc: linux-xfs

On Thu, Jan 17, 2019 at 11:48:58PM -0700, Allison Henderson wrote:
> 
> 
> On 1/17/19 12:20 PM, Brian Foster wrote:
> > Now that the cached writeback mapping is explicitly invalidated on
> > data fork changes, the EOF trimming band-aid is no longer necessary.
> > Remove xfs_trim_extent_eof() as well since it has no other users.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >   fs/xfs/libxfs/xfs_bmap.c | 11 -----------
> >   fs/xfs/libxfs/xfs_bmap.h |  1 -
> >   fs/xfs/xfs_aops.c        | 15 ---------------
> >   3 files changed, 27 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 332eefa2700b..4c73927819c2 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3685,17 +3685,6 @@ xfs_trim_extent(
> >   	}
> >   }
> > -/* trim extent to within eof */
> > -void
> > -xfs_trim_extent_eof(
> > -	struct xfs_bmbt_irec	*irec,
> > -	struct xfs_inode	*ip)
> > -
> > -{
> > -	xfs_trim_extent(irec, 0, XFS_B_TO_FSB(ip->i_mount,
> > -					      i_size_read(VFS_I(ip))));
> > -}
> > -
> >   /*
> >    * Trim the returned map to the required bounds
> >    */
> > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > index 09d3ea97cc15..b4ff710d7250 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.h
> > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > @@ -181,7 +181,6 @@ static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
> >   void	xfs_trim_extent(struct xfs_bmbt_irec *irec, xfs_fileoff_t bno,
> >   		xfs_filblks_t len);
> > -void	xfs_trim_extent_eof(struct xfs_bmbt_irec *, struct xfs_inode *);
> >   int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
> >   int	xfs_bmap_set_attrforkoff(struct xfs_inode *ip, int size, int *version);
> >   void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 649e4ad76add..09d8f7690e9e 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -357,19 +357,6 @@ xfs_map_blocks(
> >   	if (XFS_FORCED_SHUTDOWN(mp))
> >   		return -EIO;
> > -	/*
> > -	 * We have to make sure the cached mapping is within EOF to protect
> > -	 * against eofblocks trimming on file release leaving us with a stale
> > -	 * mapping. Otherwise, a page for a subsequent file extending buffered
> > -	 * write could get picked up by this writeback cycle and written to the
> > -	 * wrong blocks.
> > -	 *
> > -	 * Note that what we really want here is a generic mapping invalidation
> > -	 * mechanism to protect us from arbitrary extent modifying contexts, not
> > -	 * just eofblocks.
> > -	 */
> > -	xfs_trim_extent_eof(&wpc->imap, ip);
> > -
> >   	/*
> >   	 * COW fork blocks can overlap data fork blocks even if the blocks
> >   	 * aren't shared.  COW I/O always takes precedent, so we must always
> > @@ -482,7 +469,6 @@ xfs_map_blocks(
> >   	}
> >   	wpc->imap = imap;
> > -	xfs_trim_extent_eof(&wpc->imap, ip);
> >   	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
> >   	return 0;
> >   allocate_blocks:
> > @@ -494,7 +480,6 @@ xfs_map_blocks(
> >   	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
> >   	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
> >   	wpc->imap = imap;
> > -	xfs_trim_extent_eof(&wpc->imap, ip);
> >   	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
> >   	return 0;
> >   }
> > 
> 
> I'll skip this one, it sounds from the last review we decided not to keep
> it.  Thanks though!
> 

This one is actually still intended to be part of the series. The
previous patch to revalidate on fork sequence number changes eliminates
the need for the racy EOF trimming solution. The first patch in the
series was just intended to be a stable workaround for older kernels
that might have the patch where xfs_trim_extent_eof() was first added.

Thanks for the reviews..

Brian

> Allison

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/5] xfs: eof trim writeback mapping as soon as it is cached
  2019-01-17 19:20 ` [PATCH v2 1/5] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
  2019-01-18  5:29   ` Allison Henderson
@ 2019-01-18 11:47   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-01-18 11:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 17, 2019 at 02:20:00PM -0500, Brian Foster wrote:
> The cached writeback mapping is EOF trimmed to try and avoid races
> between post-eof block management and writeback that result in
> sending cached data to a stale location. The cached mapping is
> currently trimmed on the validation check, which leaves a race
> window between the time the mapping is cached and when it is trimmed
> against the current inode size.
> 
> For example, if a new mapping is cached by delalloc conversion on a
> blocksize == page size fs, we could cycle various locks, perform
> memory allocations, etc.  in the writeback codepath before the
> associated mapping is eventually trimmed to i_size. This leaves
> enough time for a post-eof truncate and file append before the
> cached mapping is trimmed. The former event essentially invalidates
> a range of the cached mapping and the latter bumps the inode size
> such the trim on the next writepage event won't trim all of the
> invalid blocks. fstest generic/464 reproduces this scenario
> occasionally and causes a lost writeback and stale delalloc blocks
> warning on inode inactivation.
> 
> To work around this problem, trim the cached writeback mapping as
> soon as it is cached in addition to on subsequent validation checks.
> This is a minor tweak to tighten the race window as much as possible
> until a proper invalidation mechanism is available.
> 
> Fixes: 40214d128e07 ("xfs: trim writepage mapping to within eof")

I don't think it fixes that commit, but rather fixes more aspects of
the issue that commit tried to fix.

Otherwise this looks fine as a band-aid fix:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/5] xfs: validate writeback mapping using data fork seq counter
  2019-01-17 19:20 ` [PATCH v2 3/5] xfs: validate writeback mapping using data fork seq counter Brian Foster
  2019-01-18  6:12   ` Allison Henderson
@ 2019-01-18 11:50   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-01-18 11:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 17, 2019 at 02:20:02PM -0500, Brian Foster wrote:
> The writeback code caches the current extent mapping across multiple
> xfs_do_writepage() calls to avoid repeated lookups for sequential
> pages backed by the same extent. This is known to be slightly racy
> with extent fork changes in certain difficult to reproduce
> scenarios. The cached extent is trimmed to within EOF to help avoid
> the most common vector for this problem via speculative
> preallocation management, but this is a band-aid that does not
> address the fundamental problem.
> 
> Now that we have an xfs_ifork sequence counter mechanism used to
> facilitate COW writeback, we can use the same mechanism to validate
> consistency between the data fork and cached writeback mappings. On
> its face, this is somewhat of a big hammer approach because any
> change to the data fork invalidates any mapping currently cached by
> a writeback in progress regardless of whether the data fork change
> overlaps with the range under writeback. In practice, however, the
> impact of this approach is minimal in most cases.
> 
> First, data fork changes (delayed allocations) caused by sustained
> sequential buffered writes are amortized across speculative
> preallocations. This means that a cached mapping won't be
> invalidated by each buffered write of a common file copy workload,
> but rather only on less frequent allocation events. Second, the
> extent tree is always entirely in-core so an additional lookup of a
> usable extent mostly costs a shared ilock cycle and in-memory tree
> lookup. This means that a cached mapping reval is relatively cheap
> compared to the I/O itself. Third, spurious invalidations don't
> impact ioend construction. This means that even if the same extent
> is revalidated multiple times across multiple writepage instances,
> we still construct and submit the same size ioend (and bio) if the
> blocks are physically contiguous.
> 
> Update struct xfs_writepage_ctx with a new field to hold the
> sequence number of the data fork associated with the currently
> cached mapping. Check the wpc seqno against the data fork when the
> mapping is validated and reestablish the mapping whenever the fork
> has changed since the mapping was cached. This ensures that
> writeback always uses a valid extent mapping and thus prevents lost
> writebacks and stale delalloc block problems.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_aops.c  | 60 ++++++++++++++++++++++++++++++++++++----------
>  fs/xfs/xfs_iomap.c |  5 ++--
>  2 files changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d9048bcea49c..649e4ad76add 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -29,6 +29,7 @@
>  struct xfs_writepage_ctx {
>  	struct xfs_bmbt_irec    imap;
>  	unsigned int		io_type;
> +	unsigned int		data_seq;
>  	unsigned int		cow_seq;
>  	struct xfs_ioend	*ioend;
>  };
> @@ -301,6 +302,42 @@ xfs_end_bio(
>  		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
>  }
>  
> +/*
> + * Fast revalidation of the cached writeback mapping. Return true if the current
> + * mapping is valid, false otherwise.
> + */
> +static bool
> +xfs_imap_valid(
> +	struct xfs_writepage_ctx	*wpc,
> +	struct xfs_inode		*ip,
> +	xfs_fileoff_t			offset_fsb)
> +{
> +	/*
> +	 * If this is a COW mapping, it is sufficient to check that the mapping
> +	 * covers the offset. Be careful to check this first because the caller
> +	 * can revalidate a COW mapping without updating the data seqno.
> +	 */
> +	if (offset_fsb < wpc->imap.br_startoff ||
> +	    offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
> +		return false;
> +	if (wpc->io_type == XFS_IO_COW)
> +		return true;

Maybe the comment should move below the first range check? 

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 4/5] xfs: remove superfluous writeback mapping eof trimming
  2019-01-17 19:20 ` [PATCH v2 4/5] xfs: remove superfluous writeback mapping eof trimming Brian Foster
  2019-01-18  6:48   ` Allison Henderson
@ 2019-01-18 11:50   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-01-18 11:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 17, 2019 at 02:20:03PM -0500, Brian Foster wrote:
> Now that the cached writeback mapping is explicitly invalidated on
> data fork changes, the EOF trimming band-aid is no longer necessary.
> Remove xfs_trim_extent_eof() as well since it has no other users.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-17 19:20 ` [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion Brian Foster
  2019-01-18  6:58   ` Allison Henderson
@ 2019-01-18 11:59   ` Christoph Hellwig
  2019-01-18 16:31     ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-01-18 11:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

> +			if (*seq != READ_ONCE(ifp->if_seq)) {
> +				if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb,
> +							    &icur, &timap) ||
> +				    timap.br_startoff > offset_fsb) {
> +					ASSERT(0);
> +					error = -EFSCORRUPTED;
>  					goto trans_cancel;
>  				}
> +				xfs_trim_extent(imap, timap.br_startoff,
> +						timap.br_blockcount);
> +				count_fsb = imap->br_blockcount;
> +				map_start_fsb = imap->br_startoff;

I very much disagree with this logic.

xfs_bmapi_write will always do a xfs_iext_lookup_extent for the
passed in blocks anyway.  So this trimming logic should move into
xfs_bmapi_write to ensure it does the right thing, instead of papering
over the logic in xfs_bmapi_write in the caller with another lookup.

I think the improved XFS_BMAPI_DELALLOC handling in this patch:

   http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06

is the right answer, as writeback really should only ever convert
existing delalloc extents, and I'd much rather rely on that rather than
piling hacks of hacks to feed the right range into a function that must
do a lookup in the extent tree anyway.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-18 11:59   ` Christoph Hellwig
@ 2019-01-18 16:31     ` Christoph Hellwig
  2019-01-18 18:39       ` Brian Foster
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-01-18 16:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Jan 18, 2019 at 03:59:37AM -0800, Christoph Hellwig wrote:
> xfs_bmapi_write will always do a xfs_iext_lookup_extent for the
> passed in blocks anyway.  So this trimming logic should move into
> xfs_bmapi_write to ensure it does the right thing, instead of papering
> over the logic in xfs_bmapi_write in the caller with another lookup.
> 
> I think the improved XFS_BMAPI_DELALLOC handling in this patch:
> 
>    http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06
> 
> is the right answer, as writeback really should only ever convert
> existing delalloc extents, and I'd much rather rely on that rather than
> piling hacks of hacks to feed the right range into a function that must
> do a lookup in the extent tree anyway.

FYI, here is a tree that rebases my patches on top of your (minus this
one) and also drops the internal i_size checks, although it still keeps
the initial one for the truncate fast path.  Testing is still ongoing,
and for the writeback issue you probably only need the first three
of my patches:

	http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-18 16:31     ` Christoph Hellwig
@ 2019-01-18 18:39       ` Brian Foster
  2019-01-20 12:45         ` Brian Foster
  2019-01-21 15:48         ` Christoph Hellwig
  0 siblings, 2 replies; 27+ messages in thread
From: Brian Foster @ 2019-01-18 18:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Fri, Jan 18, 2019 at 08:31:32AM -0800, Christoph Hellwig wrote:
> On Fri, Jan 18, 2019 at 03:59:37AM -0800, Christoph Hellwig wrote:
> > xfs_bmapi_write will always do a xfs_iext_lookup_extent for the
> > passed in blocks anyway.  So this trimming logic should move into
> > xfs_bmapi_write to ensure it does the right thing, instead of papering
> > over the logic in xfs_bmapi_write in the caller with another lookup.
> > 
> > I think the improved XFS_BMAPI_DELALLOC handling in this patch:
> > 
> >    http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06
> > 
> > is the right answer, as writeback really should only ever convert
> > existing delalloc extents, and I'd much rather rely on that rather than
> > piling hacks of hacks to feed the right range into a function that must
> > do a lookup in the extent tree anyway.
> 
> FYI, here is a tree that rebases my patches on top of your (minus this
> one) and also drops the internal i_size checks, although it still keeps
> the initial one for the truncate fast path.  Testing is still ongoing,
> and for the writeback issue you probably only need the first three
> of my patches:
> 
> 	http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation

This doesn't really seem all that different to me. Rather than firm up
the range in the caller, we turn XFS_BMAPI_DELALLOC into something that
seemingly behaves a bit more like CONVERT_ONLY.

A few notes/thoughts:

1. What's the purpose of the nimaps == 0 check in
xfs_iomap_write_allocate()? If we got to this point with the page lock
held, shouldn't we always expect to find something backing the current
page?

2. If so, then it also seems that the whole "eof:" thing in
xfs_map_blocks() should never happen for data forks. If that's the case,
the use of the eof label there seems gratuitous.

3. If the starting bno param to xfs_bmapi_write() lands in a hole (due
to racing with a hole punch for example) and it finds some subsequent
delalloc extent to convert in the requested range, the arithmetic in
xfs_bmapi_trim_map() actually fabricates physical blocks for the hole
portion of the range relative to the startblock of the converted
delalloc extent. I don't think that causes a problem with the writeback
code because the fabricated blocks are before the page offset, but those
semantics are insane. I think you need to do something like fix up
xfs_bmapi_write() to return the actual converted mapping and make sure
xfs_iomap_write_allocate() handles that it might start beyond
map_start_fsb.

Brian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-18 18:39       ` Brian Foster
@ 2019-01-20 12:45         ` Brian Foster
  2019-01-21 15:51           ` Christoph Hellwig
  2019-01-21 15:48         ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Brian Foster @ 2019-01-20 12:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Fri, Jan 18, 2019 at 01:39:15PM -0500, Brian Foster wrote:
> On Fri, Jan 18, 2019 at 08:31:32AM -0800, Christoph Hellwig wrote:
> > On Fri, Jan 18, 2019 at 03:59:37AM -0800, Christoph Hellwig wrote:
> > > xfs_bmapi_write will always do a xfs_iext_lookup_extent for the
> > > passed in blocks anyway.  So this trimming logic should move into
> > > xfs_bmapi_write to ensure it does the right thing, instead of papering
> > > over the logic in xfs_bmapi_write in the caller with another lookup.
> > > 
> > > I think the improved XFS_BMAPI_DELALLOC handling in this patch:
> > > 
> > >    http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06
> > > 
> > > is the right answer, as writeback really should only ever convert
> > > existing delalloc extents, and I'd much rather rely on that rather than
> > > piling hacks of hacks to feed the right range into a function that must
> > > do a lookup in the extent tree anyway.
> > 
> > FYI, here is a tree that rebases my patches on top of your (minus this
> > one) and also drops the internal i_size checks, although it still keeps
> > the initial one for the truncate fast path.  Testing is still ongoing,
> > and for the writeback issue you probably only need the first three
> > of my patches:
> > 
> > 	http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation
> 
> This doesn't really seem all that different to me. Rather than firm up
> the range in the caller, we turn XFS_BMAPI_DELALLOC into something that
> seemingly behaves a bit more like CONVERT_ONLY.
> 
> A few notes/thoughts:
> 
> 1. What's the purpose of the nimaps == 0 check in
> xfs_iomap_write_allocate()? If we got to this point with the page lock
> held, shouldn't we always expect to find something backing the current
> page?
> 
> 2. If so, then it also seems that the whole "eof:" thing in
> xfs_map_blocks() should never happen for data forks. If that's the case,
> the use of the eof label there seems gratuitous.
> 
> 3. If the starting bno param to xfs_bmapi_write() lands in a hole (due
> to racing with a hole punch for example) and it finds some subsequent
> delalloc extent to convert in the requested range, the arithmetic in
> xfs_bmapi_trim_map() actually fabricates physical blocks for the hole
> portion of the range relative to the startblock of the converted
> delalloc extent. I don't think that causes a problem with the writeback
> code because the fabricated blocks are before the page offset, but those
> semantics are insane. I think you need to do something like fix up
> xfs_bmapi_write() to return the actual converted mapping and make sure
> xfs_iomap_write_allocate() handles that it might start beyond
> map_start_fsb.
> 

A couple more thoughts now that I've had more time to think about this:

4. Kind of a nit, but the comment update in xfs_bmapi_write() that
describes the ilock and associated race window and whatnot should really
be split between there and xfs_iomap_write_allocate(). The former should
just explain what exactly XFS_BMAPI_DELALLOC does (i.e., skip holes,
real extents, over a range..). The latter should explain how
the use of XFS_BMAPI_DELALLOC helps us deal with the race window in the
writeback code.

5. A more serious point.. with this approach, what prevents writeback
from thrashing with buffered writes in the event of the hole punch race
scenario I mentioned previously? One of the things I explicitly wanted
to avoid in this patch is the potential to bounce back and forth between
writes and writeback over extent updates. There is the performance
aspect of that, which is what I was initially concerned about, but your
patch introduces a functional aspect of that we need to consider as
well.

For (the extreme/worst case) example, suppose again we're at writeback
in the middle of a fairly large delalloc extent. Somehow or another we
race and lose a good portion at the start of the currently cached extent
and then see a buffered write at the start offset of the (now invalid)
cached extent. The now invalid range is passed to xfs_bmapi_write(),
which finds and allocates that first block in the range, commits the
transaction and cycles the ilock. If we now contend with another
buffered write at the next offset in the original extent range,
writeback finds and allocates at that offset next and the cycle repeats
until we finally allocate an extent that backs our current page.

Unless I'm missing something here, there's not only opportunity to spend
an awkward amount of time processing unrelated delalloc extents for a
single page, there's potential to allocate bits of what should be a
large contiguous delalloc extent a block at a time, which can defeat the
purpose of delayed allocation (assuming the block allocator doesn't save
our ass).

Hmm, I'm not totally sure what the right answer to that is. On one hand,
I think the approach of using XFS_BMAPI_DELALLOC is fairly elegant. On
the other, I don't think a solution open to this behavior is robust
enough.

One option that comes to mind is to perhaps split off XFS_BMAPI_DELALLOC
from xfs_bmapi_write() into a new xfs_bmapi_delalloc() function that
facilitates new semantics (I'm not terribly comfortable with overloading
the semantics of xfs_bmapi_write()). Instead of passing a range to
xfs_bmapi_delalloc(), just pass the offset we care about and expect that
this function will attempt to allocate the entire extent currently
backing offset. (Alternatively, we could perhaps pass a range by value
and allow xfs_bmapi_delalloc() to update the range in the event of
delalloc discontiguity.) Either way, the extent returned may not cover
the offset (due to fragmentation, as today) and thus the caller needs to
iterate until that happens. The larger point is that we'd lookup the
current extent _at offset_ on each iteration and thus shouldn't ever
contend with new delalloc reservations. Thoughts?

Brian

> Brian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-18 18:39       ` Brian Foster
  2019-01-20 12:45         ` Brian Foster
@ 2019-01-21 15:48         ` Christoph Hellwig
  2019-01-21 17:42           ` Brian Foster
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-01-21 15:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs

On Fri, Jan 18, 2019 at 01:39:15PM -0500, Brian Foster wrote:
> This doesn't really seem all that different to me. Rather than firm up
> the range in the caller, we turn XFS_BMAPI_DELALLOC into something that
> seemingly behaves a bit more like CONVERT_ONLY.

The differences are:

 (a) we save one lookup in the extent tree
 (b) we have a less fragile API

> A few notes/thoughts:
> 
> 1. What's the purpose of the nimaps == 0 check in
> xfs_iomap_write_allocate()? If we got to this point with the page lock
> held, shouldn't we always expect to find something backing the current
> page?

It protects against the case where we migrate a COW fork mapping to the
data fork, which is not protected by the page lock.  But I guess the
check warrants a comment and an assert.

> 2. If so, then it also seems that the whole "eof:" thing in
> xfs_map_blocks() should never happen for data forks. If that's the case,
> the use of the eof label there seems gratuitous.

Let me try with asserts enabled.

> 
> 3. If the starting bno param to xfs_bmapi_write() lands in a hole (due
> to racing with a hole punch for example) and it finds some subsequent
> delalloc extent to convert in the requested range, the arithmetic in
> xfs_bmapi_trim_map() actually fabricates physical blocks for the hole
> portion of the range relative to the startblock of the converted
> delalloc extent. I don't think that causes a problem with the writeback
> code because the fabricated blocks are before the page offset, but those
> semantics are insane. I think you need to do something like fix up
> xfs_bmapi_write() to return the actual converted mapping and make sure
> xfs_iomap_write_allocate() handles that it might start beyond
> map_start_fsb.

Sounds good.  To be honest I think the whole idea of converting the
mapping before the requested offset is a rather bad idea to start
with, just increasin our window that exposes stale data.  But to fix
this properly we need to create everything as unwritten first, and
I didn't have time to get back to that series.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-20 12:45         ` Brian Foster
@ 2019-01-21 15:51           ` Christoph Hellwig
  2019-01-21 17:43             ` Brian Foster
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-01-21 15:51 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs

On Sun, Jan 20, 2019 at 07:45:02AM -0500, Brian Foster wrote:
> 4. Kind of a nit, but the comment update in xfs_bmapi_write() that
> describes the ilock and associated race window and whatnot should really
> be split between there and xfs_iomap_write_allocate(). The former should
> just explain what exactly XFS_BMAPI_DELALLOC does (i.e., skip holes,
> real extents, over a range..). The latter should explain how
> the use of XFS_BMAPI_DELALLOC helps us deal with the race window in the
> writeback code.

Ok.

> One option that comes to mind is to perhaps split off XFS_BMAPI_DELALLOC
> from xfs_bmapi_write() into a new xfs_bmapi_delalloc() function that
> facilitates new semantics (I'm not terribly comfortable with overloading
> the semantics of xfs_bmapi_write()). Instead of passing a range to
> xfs_bmapi_delalloc(), just pass the offset we care about and expect that
> this function will attempt to allocate the entire extent currently
> backing offset. (Alternatively, we could perhaps pass a range by value
> and allow xfs_bmapi_delalloc() to update the range in the event of
> delalloc discontiguity.) Either way, the extent returned may not cover
> the offset (due to fragmentation, as today) and thus the caller needs to
> iterate until that happens. The larger point is that we'd lookup the
> current extent _at offset_ on each iteration and thus shouldn't ever
> contend with new delalloc reservations. Thoughts?

I considered splitting it off and even had an early prototype.  I
got something wrong and it didn't work, and it created a little too
much duplication for my taste so I gave up on it for now.  But
fundamentally having the delalloc conversion separate from
xfs_bmapi_write is the right thing.  I'll just have to find some
time for it or pass this work off to you..

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-21 15:48         ` Christoph Hellwig
@ 2019-01-21 17:42           ` Brian Foster
  2019-01-22 17:14             ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2019-01-21 17:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Jan 21, 2019 at 07:48:33AM -0800, Christoph Hellwig wrote:
> On Fri, Jan 18, 2019 at 01:39:15PM -0500, Brian Foster wrote:
> > This doesn't really seem all that different to me. Rather than firm up
> > the range in the caller, we turn XFS_BMAPI_DELALLOC into something that
> > seemingly behaves a bit more like CONVERT_ONLY.
> 
> The differences are:
> 
>  (a) we save one lookup in the extent tree
>  (b) we have a less fragile API
> 
> > A few notes/thoughts:
> > 
> > 1. What's the purpose of the nimaps == 0 check in
> > xfs_iomap_write_allocate()? If we got to this point with the page lock
> > held, shouldn't we always expect to find something backing the current
> > page?
> 
> It protects against the case where we migrate a COW fork mapping to the
> data fork, which is not protected by the page lock.  But I guess the
> check warrants a comment and an assert.
> 

Yeah, probably. It's not really clear to me what that means.

> > 2. If so, then it also seems that the whole "eof:" thing in
> > xfs_map_blocks() should never happen for data forks. If that's the case,
> > the use of the eof label there seems gratuitous.
> 
> Let me try with asserts enabled.
> 

Ok, thanks.

> > 
> > 3. If the starting bno param to xfs_bmapi_write() lands in a hole (due
> > to racing with a hole punch for example) and it finds some subsequent
> > delalloc extent to convert in the requested range, the arithmetic in
> > xfs_bmapi_trim_map() actually fabricates physical blocks for the hole
> > portion of the range relative to the startblock of the converted
> > delalloc extent. I don't think that causes a problem with the writeback
> > code because the fabricated blocks are before the page offset, but those
> > semantics are insane. I think you need to do something like fix up
> > xfs_bmapi_write() to return the actual converted mapping and make sure
> > xfs_iomap_write_allocate() handles that it might start beyond
> > map_start_fsb.
> 
> Sounds good.  To be honest I think the whole idea of converting the
> mapping before the requested offset is a rather bad idea to start
> with, just increasin our window that exposes stale data.  But to fix
> this properly we need to create everything as unwritten first, and
> I didn't have time to get back to that series.

Eh, I think that's a separate problem that re: your other series, we
already know how to fix properly. I don't think we should mess around
with something as fairly fundamental as delayed block allocation
behavior for an approach that doesn't address the underlying problem.

Brian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-21 15:51           ` Christoph Hellwig
@ 2019-01-21 17:43             ` Brian Foster
  0 siblings, 0 replies; 27+ messages in thread
From: Brian Foster @ 2019-01-21 17:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Jan 21, 2019 at 07:51:10AM -0800, Christoph Hellwig wrote:
> On Sun, Jan 20, 2019 at 07:45:02AM -0500, Brian Foster wrote:
> > 4. Kind of a nit, but the comment update in xfs_bmapi_write() that
> > describes the ilock and associated race window and whatnot should really
> > be split between there and xfs_iomap_write_allocate(). The former should
> > just explain what exactly XFS_BMAPI_DELALLOC does (i.e., skip holes,
> > real extents, over a range..). The latter should explain how
> > the use of XFS_BMAPI_DELALLOC helps us deal with the race window in the
> > writeback code.
> 
> Ok.
> 
> > One option that comes to mind is to perhaps split off XFS_BMAPI_DELALLOC
> > from xfs_bmapi_write() into a new xfs_bmapi_delalloc() function that
> > facilitates new semantics (I'm not terribly comfortable with overloading
> > the semantics of xfs_bmapi_write()). Instead of passing a range to
> > xfs_bmapi_delalloc(), just pass the offset we care about and expect that
> > this function will attempt to allocate the entire extent currently
> > backing offset. (Alternatively, we could perhaps pass a range by value
> > and allow xfs_bmapi_delalloc() to update the range in the event of
> > delalloc discontiguity.) Either way, the extent returned may not cover
> > the offset (due to fragmentation, as today) and thus the caller needs to
> > iterate until that happens. The larger point is that we'd lookup the
> > current extent _at offset_ on each iteration and thus shouldn't ever
> > contend with new delalloc reservations. Thoughts?
> 
> I considered splitting it off and even had an early prototype.  I
> got something wrong and it didn't work, and it created a little too
> much duplication for my taste so I gave up on it for now.  But
> fundamentally having the delalloc conversion separate from
> xfs_bmapi_write is the right thing.  I'll just have to find some
> time for it or pass this work off to you..

Ok, well I don't mind looking into how to refactor that code, but my
priority for this series is to fix the underlying problem. As a
temporary compromise, I think there might be a couple simple options to
create an xfs_bmapi_delalloc() wrapper over xfs_bmapi_write(). I'm
curious whether we could just pass bno and len == 1 from the delalloc
wrapper and get the behavior we want. Alternatively, perhaps we could
just factor out the bma.got lookup from xfs_bmapi_write() and use that
to handle the range properly in the delalloc case (i.e., pass *got and
*eof into a __xfs_bmapi_write() internal function that does most of
everything else).

If either of those don't work, a temporary fallback may be to just bury
the seqno lookup logic from this patch into xfs_bmapi_delalloc() and
document that further refactoring is required. That would retain the
extra lookup (for now), but TBH the testing I've already down wrt to
excessive higher level revalidations kind of shows that this has no real
impact on writeback performance, it's just a bit ugly.

Brian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-21 17:42           ` Brian Foster
@ 2019-01-22 17:14             ` Christoph Hellwig
  2019-01-22 18:13               ` Brian Foster
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-01-22 17:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs

On Mon, Jan 21, 2019 at 12:42:56PM -0500, Brian Foster wrote:
> > It protects against the case where we migrate a COW fork mapping to the
> > data fork, which is not protected by the page lock.  But I guess the
> > check warrants a comment and an assert.
> > 
> 
> Yeah, probably. It's not really clear to me what that means.
> 
> > > 2. If so, then it also seems that the whole "eof:" thing in
> > > xfs_map_blocks() should never happen for data forks. If that's the case,
> > > the use of the eof label there seems gratuitous.
> > 
> > Let me try with asserts enabled.
> > 
> 
> Ok, thanks.

So, the nimaps == 0 case hits even for the data fork when running
xfs/442 on a 1k block size file system.  That test has generally been
a fun source of bugs in the always_cow series.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-22 17:14             ` Christoph Hellwig
@ 2019-01-22 18:13               ` Brian Foster
  2019-01-23 18:14                 ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Brian Foster @ 2019-01-22 18:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Jan 22, 2019 at 09:14:45AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 21, 2019 at 12:42:56PM -0500, Brian Foster wrote:
> > > It protects against the case where we migrate a COW fork mapping to the
> > > data fork, which is not protected by the page lock.  But I guess the
> > > check warrants a comment and an assert.
> > > 
> > 
> > Yeah, probably. It's not really clear to me what that means.
> > 
> > > > 2. If so, then it also seems that the whole "eof:" thing in
> > > > xfs_map_blocks() should never happen for data forks. If that's the case,
> > > > the use of the eof label there seems gratuitous.
> > > 
> > > Let me try with asserts enabled.
> > > 
> > 
> > Ok, thanks.
> 
> So, the nimaps == 0 case hits even for the data fork when running
> xfs/442 on a 1k block size file system.  That test has generally been
> a fun source of bugs in the always_cow series.

Hmm, Ok.. any idea how/why that is? I'm curious if this is an artifact
of this code being racy (i.e., nimaps == 0 occurs somehow on a range
external to the page) or there is some legitimate means to lose a
delalloc block under the locked page. If the latter, we should at least
document it in the code..

FWIW, I have some refactoring in place that basically turns nimaps == 0
into an error and I don't see an xfs/442 failure with 1k or 4k blocks,
though that's only from a couple quick runs and I'm still working
through some changes. The difference of course is that this code uses
the range of the delalloc extent in the data fork rather than the
(potentially already stale) range from wpc->imap. I'll let it spin for a
bit I suppose and see if it explodes..

Brian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-22 18:13               ` Brian Foster
@ 2019-01-23 18:14                 ` Christoph Hellwig
  2019-01-23 18:40                   ` Brian Foster
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-01-23 18:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs

On Tue, Jan 22, 2019 at 01:13:28PM -0500, Brian Foster wrote:
> > So, the nimaps == 0 case hits even for the data fork when running
> > xfs/442 on a 1k block size file system.  That test has generally been
> > a fun source of bugs in the always_cow series.
> 
> Hmm, Ok.. any idea how/why that is? I'm curious if this is an artifact
> of this code being racy (i.e., nimaps == 0 occurs somehow on a range
> external to the page) or there is some legitimate means to lose a
> delalloc block under the locked page. If the latter, we should at least
> document it in the code..

I haven't managed to pinpint it mostly because xfs_bmapi_write is
such an unreadable mess.  Based on that I decided to implement the idea
of a separate xfs_bmapi_delalloc_convert helper again, mostly to try
to understand what actuall is going on.  This seems to work pretty
reasonable after initial testing, except that it seems to unhide an
issue in xfs/109 which doesn't look directly related.  The current
status is here:

http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation.2

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion
  2019-01-23 18:14                 ` Christoph Hellwig
@ 2019-01-23 18:40                   ` Brian Foster
  0 siblings, 0 replies; 27+ messages in thread
From: Brian Foster @ 2019-01-23 18:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Wed, Jan 23, 2019 at 10:14:12AM -0800, Christoph Hellwig wrote:
> On Tue, Jan 22, 2019 at 01:13:28PM -0500, Brian Foster wrote:
> > > So, the nimaps == 0 case hits even for the data fork when running
> > > xfs/442 on a 1k block size file system.  That test has generally been
> > > a fun source of bugs in the always_cow series.
> > 
> > Hmm, Ok.. any idea how/why that is? I'm curious if this is an artifact
> > of this code being racy (i.e., nimaps == 0 occurs somehow on a range
> > external to the page) or there is some legitimate means to lose a
> > delalloc block under the locked page. If the latter, we should at least
> > document it in the code..
> 
> I haven't managed to pinpint it mostly because xfs_bmapi_write is
> such an unreadable mess.  Based on that I decided to implement the idea
> of a separate xfs_bmapi_delalloc_convert helper again, mostly to try
> to understand what actuall is going on.  This seems to work pretty
> reasonable after initial testing, except that it seems to unhide an
> issue in xfs/109 which doesn't look directly related.  The current
> status is here:
> 
> http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation.2
> 

I have a v3 of this series that does something similar, but doesn't
quite take the refactoring as far. I just created an
xfs_bmapi_delalloc() wrapper over xfs_bmapi_write(). The changes to the
latter are minimal based on a tweak of XFS_BMAPI_DELALLOC behavior. The
change to xfs_iomap_write_allocate() is basically just to use the new
helper instead of xfs_bmapi_write().

The refactoring you have looks like the right direction to me, but I'm
wondering why we need to change so much all at once, particularly since
it's not clear that some of this -EAGAIN stuff is really necessary.
Pulling bits out of xfs_bmapi_write() is also inherently more difficult
to review than reusing it, IMO.

Hmm, I'm wondering if we could rebase your refactoring onto the simple
wrapper in my patch. You'd basically just reimplement the helper, kill
off the XFS_BMAPI_DELALLOC stuff as you already have, and the changes to
xfs_iomap_write_allocate() (notwithstanding moving/renaming the
function, which should really be split up into separate patches) would
probably be minimal (i.e., do the extent lookup).

I was still running some tests but I've got through a decent amount
already and have everything formatted to send out so I'll just post
it...

Brian

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-01-23 18:40 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-17 19:19 [PATCH v2 0/5] xfs: properly invalidate cached writeback mapping Brian Foster
2019-01-17 19:20 ` [PATCH v2 1/5] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
2019-01-18  5:29   ` Allison Henderson
2019-01-18 11:47   ` Christoph Hellwig
2019-01-17 19:20 ` [PATCH v2 2/5] xfs: update fork seq counter on data fork changes Brian Foster
2019-01-18  5:30   ` Allison Henderson
2019-01-17 19:20 ` [PATCH v2 3/5] xfs: validate writeback mapping using data fork seq counter Brian Foster
2019-01-18  6:12   ` Allison Henderson
2019-01-18 11:50   ` Christoph Hellwig
2019-01-17 19:20 ` [PATCH v2 4/5] xfs: remove superfluous writeback mapping eof trimming Brian Foster
2019-01-18  6:48   ` Allison Henderson
2019-01-18 11:25     ` Brian Foster
2019-01-18 11:50   ` Christoph Hellwig
2019-01-17 19:20 ` [PATCH v2 5/5] xfs: revalidate imap properly before writeback delalloc conversion Brian Foster
2019-01-18  6:58   ` Allison Henderson
2019-01-18 11:59   ` Christoph Hellwig
2019-01-18 16:31     ` Christoph Hellwig
2019-01-18 18:39       ` Brian Foster
2019-01-20 12:45         ` Brian Foster
2019-01-21 15:51           ` Christoph Hellwig
2019-01-21 17:43             ` Brian Foster
2019-01-21 15:48         ` Christoph Hellwig
2019-01-21 17:42           ` Brian Foster
2019-01-22 17:14             ` Christoph Hellwig
2019-01-22 18:13               ` Brian Foster
2019-01-23 18:14                 ` Christoph Hellwig
2019-01-23 18:40                   ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.