All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] xfs: properly invalidate cached writeback mapping
@ 2019-01-11 12:30 Brian Foster
  2019-01-11 12:30 ` [PATCH 1/4] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Brian Foster @ 2019-01-11 12:30 UTC (permalink / raw)
  To: linux-xfs

Hi all,

This series attempts to fix the stale writepage mapping problem in XFS.
The problem is essentially that ->writepages() caches the current extent
across multiple writepage instances and in certain circumstances the
cached mapping can be made invalid by concurrent filesystem operations.
For example, even with the current EOF trim band-aid for dealing with
post-eof speculative preallocation, a truncate+append sequence that
happens to race with background writeback can lead to a writepage to an
incorrect location.

Since we already have an xfs_ifork change/sequence number mechanism in
place, we reuse that to invalidate cached writeback mappings any time
the associated data fork has changed. Note that while certain workloads
might lead to a high frequency of spurious invalidations (i.e.,
with allocsize=4k mounts, files with a predetermined size such as vdisk
images, etc.), I've not been able to reproduce any noticeable effects at
a user level. See the patch 3 commit log description for further
discussion.

If we do run into use cases and workloads for which this is a problem, I
think there are options to further restrict seqno changing events (or
use multiple counters for subsets of change events) for less frequent
invalidations. For example, a sequence count that only tracks block
removals may still be sufficient to preserve coherency of cached
writeback mappings. Since this is all handwavy and theoretical, I opted
to keep the code simple and only deal with this should the need arise.

Patch 1 is a stable fix for the initial EOF trim patch. Patches 2-4
tweak the fork seqno mechanism to work for data forks, use it to
invalidate the cached writeback map and remove the EOF trim mechanism.
This has been tested via xfstests on multiple FSB sizes and fsx without
any explosions.

Thoughts, reviews, flames appreciated.

Brian

Brian Foster (4):
  xfs: eof trim writeback mapping as soon as it is cached
  xfs: update fork seq counter on data fork changes
  xfs: validate writeback mapping using data fork seq counter
  xfs: remove superfluous writeback mapping eof trimming

 fs/xfs/libxfs/xfs_bmap.c       | 11 -----------
 fs/xfs/libxfs/xfs_bmap.h       |  1 -
 fs/xfs/libxfs/xfs_iext_tree.c  | 13 ++++++-------
 fs/xfs/libxfs/xfs_inode_fork.h |  2 +-
 fs/xfs/xfs_aops.c              | 21 ++++++---------------
 fs/xfs/xfs_iomap.c             |  4 ++--
 6 files changed, 15 insertions(+), 37 deletions(-)

-- 
2.17.2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] xfs: eof trim writeback mapping as soon as it is cached
  2019-01-11 12:30 [PATCH 0/4] xfs: properly invalidate cached writeback mapping Brian Foster
@ 2019-01-11 12:30 ` Brian Foster
  2019-01-16 13:35     ` Sasha Levin
  2019-01-11 12:30 ` [PATCH 2/4] xfs: update fork seq counter on data fork changes Brian Foster
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2019-01-11 12:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: stable

The cached writeback mapping is EOF trimmed to try and avoid races
between post-eof block management and writeback that result in
sending cached data to a stale location. The cached mapping is
currently trimmed on the validation check, which leaves a race
window between the time the mapping is cached and when it is trimmed
against the current inode size.

For example, if a new mapping is cached by delalloc conversion on a
blocksize == page size fs, we could cycle various locks, perform
memory allocations, etc.  in the writeback codepath before the
associated mapping is eventually trimmed to i_size. This leaves
enough time for a post-eof truncate and file append before the
cached mapping is trimmed. The former event essentially invalidates
a range of the cached mapping and the latter bumps the inode size
such the trim on the next writepage event won't trim all of the
invalid blocks. fstest generic/464 reproduces this scenario
occasionally and causes a lost writeback and stale delalloc blocks
warning on inode inactivation.

To work around this problem, trim the cached writeback mapping as
soon as it is cached in addition to on subsequent validation checks.
This is a minor tweak to tighten the race window as much as possible
until a proper invalidation mechanism is available.

Fixes: 40214d128e07 ("xfs: trim writepage mapping to within eof")
Cc: <stable@vger.kernel.org> # v4.14+
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9984e0..d9048bcea49c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -449,6 +449,7 @@ xfs_map_blocks(
 	}
 
 	wpc->imap = imap;
+	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 allocate_blocks:
@@ -459,6 +460,7 @@ xfs_map_blocks(
 	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
 	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
 	wpc->imap = imap;
+	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] xfs: update fork seq counter on data fork changes
  2019-01-11 12:30 [PATCH 0/4] xfs: properly invalidate cached writeback mapping Brian Foster
  2019-01-11 12:30 ` [PATCH 1/4] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
@ 2019-01-11 12:30 ` Brian Foster
  2019-01-17 14:41   ` Christoph Hellwig
  2019-01-11 12:30 ` [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter Brian Foster
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2019-01-11 12:30 UTC (permalink / raw)
  To: linux-xfs

The sequence counter in the xfs_ifork structure is only updated on
COW forks. This is because the counter is currently only used to
optimize out repetitive COW fork checks at writeback time.

Tweak the extent code to update the seq counter regardless of the
fork type in preparation for using this counter on data forks as
well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_iext_tree.c  | 13 ++++++-------
 fs/xfs/libxfs/xfs_inode_fork.h |  2 +-
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
index 771dd072015d..bc690f2409fa 100644
--- a/fs/xfs/libxfs/xfs_iext_tree.c
+++ b/fs/xfs/libxfs/xfs_iext_tree.c
@@ -614,16 +614,15 @@ xfs_iext_realloc_root(
 }
 
 /*
- * Increment the sequence counter if we are on a COW fork.  This allows
- * the writeback code to skip looking for a COW extent if the COW fork
- * hasn't changed.  We use WRITE_ONCE here to ensure the update to the
- * sequence counter is seen before the modifications to the extent
- * tree itself take effect.
+ * Increment the sequence counter on extent tree changes. If we are on a COW
+ * fork, this allows the writeback code to skip looking for a COW extent if the
+ * COW fork hasn't changed. We use WRITE_ONCE here to ensure the update to the
+ * sequence counter is seen before the modifications to the extent tree itself
+ * take effect.
  */
 static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp, int state)
 {
-	if (state & BMAP_COWFORK)
-		WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
+	WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
 }
 
 void
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index 60361d2d74a1..00c62ce170d0 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -14,7 +14,7 @@ struct xfs_dinode;
  */
 struct xfs_ifork {
 	int			if_bytes;	/* bytes in if_u1 */
-	unsigned int		if_seq;		/* cow fork mod counter */
+	unsigned int		if_seq;		/* fork mod counter */
 	struct xfs_btree_block	*if_broot;	/* file's incore btree root */
 	short			if_broot_bytes;	/* bytes allocated for root */
 	unsigned char		if_flags;	/* per-fork flags */
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
  2019-01-11 12:30 [PATCH 0/4] xfs: properly invalidate cached writeback mapping Brian Foster
  2019-01-11 12:30 ` [PATCH 1/4] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
  2019-01-11 12:30 ` [PATCH 2/4] xfs: update fork seq counter on data fork changes Brian Foster
@ 2019-01-11 12:30 ` Brian Foster
  2019-01-13 21:49   ` Dave Chinner
  2019-01-11 12:30 ` [PATCH 4/4] xfs: remove superfluous writeback mapping eof trimming Brian Foster
  2019-01-11 13:31 ` [PATCH] tests/generic: test writepage cached mapping validity Brian Foster
  4 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2019-01-11 12:30 UTC (permalink / raw)
  To: linux-xfs

The writeback code caches the current extent mapping across multiple
xfs_do_writepage() calls to avoid repeated lookups for sequential
pages backed by the same extent. This is known to be slightly racy
with extent fork changes in certain difficult to reproduce
scenarios. The cached extent is trimmed to within EOF to help avoid
the most common vector for this problem via speculative
preallocation management, but this is a band-aid that does not
address the fundamental problem.

Now that we have an xfs_ifork sequence counter mechanism used to
facilitate COW writeback, we can use the same mechanism to validate
consistency between the data fork and cached writeback mappings. On
its face, this is somewhat of a big hammer approach because any
change to the data fork invalidates any mapping currently cached by
a writeback in progress regardless of whether the data fork change
overlaps with the range under writeback. In practice, however, the
impact of this approach is minimal in most cases.

First, data fork changes (delayed allocations) caused by sustained
sequential buffered writes are amortized across speculative
preallocations. This means that a cached mapping won't be
invalidated by each buffered write of a common file copy workload,
but rather only on less frequent allocation events. Second, the
extent tree is always entirely in-core so an additional lookup of a
usable extent mostly costs a shared ilock cycle and in-memory tree
lookup. This means that a cached mapping reval is relatively cheap
compared to the I/O itself. Third, spurious invalidations don't
impact ioend construction. This means that even if the same extent
is revalidated multiple times across multiple writepage instances,
we still construct and submit the same size ioend (and bio) if the
blocks are physically contiguous.

Update struct xfs_writepage_ctx with a new field to hold the
sequence number of the data fork associated with the currently
cached mapping. Check the wpc seqno against the data fork when the
mapping is validated and reestablish the mapping whenever the fork
has changed since the mapping was cached. This ensures that
writeback always uses a valid extent mapping and thus prevents lost
writebacks and stale delalloc block problems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c  | 8 ++++++--
 fs/xfs/xfs_iomap.c | 4 ++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d9048bcea49c..33a1be5df99f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -29,6 +29,7 @@
 struct xfs_writepage_ctx {
 	struct xfs_bmbt_irec    imap;
 	unsigned int		io_type;
+	unsigned int		data_seq;
 	unsigned int		cow_seq;
 	struct xfs_ioend	*ioend;
 };
@@ -347,7 +348,8 @@ xfs_map_blocks(
 	 * out that ensures that we always see the current value.
 	 */
 	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
-		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
+		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount &&
+		     wpc->data_seq == READ_ONCE(ip->i_df.if_seq);
 	if (imap_valid &&
 	    (!xfs_inode_has_cow_data(ip) ||
 	     wpc->io_type == XFS_IO_COW ||
@@ -417,6 +419,7 @@ xfs_map_blocks(
 	 */
 	if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap))
 		imap.br_startoff = end_fsb;	/* fake a hole past EOF */
+	wpc->data_seq = READ_ONCE(ip->i_df.if_seq);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	if (imap.br_startoff > offset_fsb) {
@@ -454,7 +457,8 @@ xfs_map_blocks(
 	return 0;
 allocate_blocks:
 	error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap,
-			&wpc->cow_seq);
+			whichfork == XFS_COW_FORK ?
+					 &wpc->cow_seq : &wpc->data_seq);
 	if (error)
 		return error;
 	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 27c93b5f029d..0401e33d4e8f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -681,7 +681,7 @@ xfs_iomap_write_allocate(
 	int		whichfork,
 	xfs_off_t	offset,
 	xfs_bmbt_irec_t *imap,
-	unsigned int	*cow_seq)
+	unsigned int	*seq)
 {
 	xfs_mount_t	*mp = ip->i_mount;
 	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
@@ -798,7 +798,7 @@ xfs_iomap_write_allocate(
 				goto error0;
 
 			if (whichfork == XFS_COW_FORK)
-				*cow_seq = READ_ONCE(ifp->if_seq);
+				*seq = READ_ONCE(ifp->if_seq);
 			xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		}
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] xfs: remove superfluous writeback mapping eof trimming
  2019-01-11 12:30 [PATCH 0/4] xfs: properly invalidate cached writeback mapping Brian Foster
                   ` (2 preceding siblings ...)
  2019-01-11 12:30 ` [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter Brian Foster
@ 2019-01-11 12:30 ` Brian Foster
  2019-01-11 13:31 ` [PATCH] tests/generic: test writepage cached mapping validity Brian Foster
  4 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2019-01-11 12:30 UTC (permalink / raw)
  To: linux-xfs

Now that the cached writeback mapping is explicitly invalidated on
data fork changes, the EOF trimming band-aid is no longer necessary.
Remove xfs_trim_extent_eof() as well since it has no other users.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 11 -----------
 fs/xfs/libxfs/xfs_bmap.h |  1 -
 fs/xfs/xfs_aops.c        | 15 ---------------
 3 files changed, 27 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 332eefa2700b..4c73927819c2 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3685,17 +3685,6 @@ xfs_trim_extent(
 	}
 }
 
-/* trim extent to within eof */
-void
-xfs_trim_extent_eof(
-	struct xfs_bmbt_irec	*irec,
-	struct xfs_inode	*ip)
-
-{
-	xfs_trim_extent(irec, 0, XFS_B_TO_FSB(ip->i_mount,
-					      i_size_read(VFS_I(ip))));
-}
-
 /*
  * Trim the returned map to the required bounds
  */
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 09d3ea97cc15..b4ff710d7250 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -181,7 +181,6 @@ static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
 
 void	xfs_trim_extent(struct xfs_bmbt_irec *irec, xfs_fileoff_t bno,
 		xfs_filblks_t len);
-void	xfs_trim_extent_eof(struct xfs_bmbt_irec *, struct xfs_inode *);
 int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
 int	xfs_bmap_set_attrforkoff(struct xfs_inode *ip, int size, int *version);
 void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 33a1be5df99f..8cc0c31d18b6 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -319,19 +319,6 @@ xfs_map_blocks(
 	bool			imap_valid;
 	int			error = 0;
 
-	/*
-	 * We have to make sure the cached mapping is within EOF to protect
-	 * against eofblocks trimming on file release leaving us with a stale
-	 * mapping. Otherwise, a page for a subsequent file extending buffered
-	 * write could get picked up by this writeback cycle and written to the
-	 * wrong blocks.
-	 *
-	 * Note that what we really want here is a generic mapping invalidation
-	 * mechanism to protect us from arbitrary extent modifying contexts, not
-	 * just eofblocks.
-	 */
-	xfs_trim_extent_eof(&wpc->imap, ip);
-
 	/*
 	 * COW fork blocks can overlap data fork blocks even if the blocks
 	 * aren't shared.  COW I/O always takes precedent, so we must always
@@ -452,7 +439,6 @@ xfs_map_blocks(
 	}
 
 	wpc->imap = imap;
-	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 allocate_blocks:
@@ -464,7 +450,6 @@ xfs_map_blocks(
 	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
 	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
 	wpc->imap = imap;
-	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH] tests/generic: test writepage cached mapping validity
  2019-01-11 12:30 [PATCH 0/4] xfs: properly invalidate cached writeback mapping Brian Foster
                   ` (3 preceding siblings ...)
  2019-01-11 12:30 ` [PATCH 4/4] xfs: remove superfluous writeback mapping eof trimming Brian Foster
@ 2019-01-11 13:31 ` Brian Foster
  2019-01-14  9:30   ` Eryu Guan
  4 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2019-01-11 13:31 UTC (permalink / raw)
  To: fstests; +Cc: linux-xfs

XFS has a bug where page writeback can end up sending data to the
wrong location due to a stale, cached file mapping. Add a test to
trigger this problem by racing background writeback with a
truncate/rewrite of the final page of the file.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

This is a resend of an old post[1] that never quite made it upstream. It
wasn't a big deal at the time because we didn't really have a proper fix
for the problem. I'm resending now because there is a proposed fix[2].

I've verified that this still reproduces the problem and no longer fails
with the fix applied (in hundreds of iters). Note that reproduction may
require many iterations. It took me anywhere from 5 to 30 or so on the
box I tested, which I think is reasonable for the tradeoff of a fairly
quick test. There was some discussion on the original post around making
the test run longer for a more reliable reproducer, but I'm not sure how
valuable that is given this is a targeted regression test. Thoughts
appreciated.

Brian

[1] https://marc.info/?l=fstests&m=150902929900510&w=2
[2] https://marc.info/?l=linux-xfs&m=154721212321112&w=2

 tests/generic/999     | 94 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/999.out |  2 +
 tests/generic/group   |  1 +
 3 files changed, 97 insertions(+)
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out

diff --git a/tests/generic/999 b/tests/generic/999
new file mode 100755
index 00000000..9e56a1e0
--- /dev/null
+++ b/tests/generic/999
@@ -0,0 +1,94 @@
+#! /bin/bash
+# FS QA Test 999
+#
+# Test XFS page writeback code for races with the cached file mapping. XFS
+# caches the file -> block mapping for a full extent once it is initially looked
+# up. The cached mapping is used for all subsequent pages in the same writeback
+# cycle that cover the associated extent. Under certain conditions, it is
+# possible for concurrent operations on the file to invalidate the cached
+# mapping without the knowledge of writeback. Writeback ends up sending I/O to a
+# partly stale mapping and potentially leaving delalloc blocks in the current
+# mapping unconverted.
+#
+#-----------------------------------------------------------------------
+# Copyright (c) 2017 Red Hat, Inc.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_test_program "feature"
+
+_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
+_scratch_mount || _fail "mount failed"
+
+file=$SCRATCH_MNT/file
+filesize=$((1024 * 1024 * 32))
+pagesize=`src/feature -s`
+truncsize=$((filesize - pagesize))
+
+for i in $(seq 0 15); do
+	# Truncate the file and fsync to persist the final size on-disk. This is
+	# required so the subsequent truncate will not wait on writeback.
+	$XFS_IO_PROG -fc "truncate 0" $file
+	$XFS_IO_PROG -c "truncate $filesize" -c fsync $file
+
+	# create a small enough delalloc extent to likely be contiguous
+	$XFS_IO_PROG -c "pwrite 0 $filesize" $file >> $seqres.full 2>&1
+
+	# Start writeback and a racing truncate and rewrite of the final page.
+	$XFS_IO_PROG -c "sync_range -w 0 0" $file &
+	sync_pid=$!
+	$XFS_IO_PROG -c "truncate $truncsize" \
+		     -c "pwrite $truncsize $pagesize" $file >> $seqres.full 2>&1
+
+	# If the test fails, the most likely outcome is an sb_fdblocks mismatch
+	# and/or an associated delalloc assert failure on inode reclaim. Cycle
+	# the mount to trigger detection.
+	wait $sync_pid
+	_scratch_cycle_mount || _fail "mount failed"
+done
+
+echo Silence is golden
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/999.out b/tests/generic/999.out
new file mode 100644
index 00000000..3b276ca8
--- /dev/null
+++ b/tests/generic/999.out
@@ -0,0 +1,2 @@
+QA output created by 999
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index ea5aa7aa..ce165981 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -525,3 +525,4 @@
 520 auto quick log
 521 soak long_rw
 522 soak long_rw
+999 auto quick
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
  2019-01-11 12:30 ` [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter Brian Foster
@ 2019-01-13 21:49   ` Dave Chinner
  2019-01-14 15:34     ` Brian Foster
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2019-01-13 21:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Jan 11, 2019 at 07:30:31AM -0500, Brian Foster wrote:
> The writeback code caches the current extent mapping across multiple
> xfs_do_writepage() calls to avoid repeated lookups for sequential
> pages backed by the same extent. This is known to be slightly racy
> with extent fork changes in certain difficult to reproduce
> scenarios. The cached extent is trimmed to within EOF to help avoid
> the most common vector for this problem via speculative
> preallocation management, but this is a band-aid that does not
> address the fundamental problem.
> 
> Now that we have an xfs_ifork sequence counter mechanism used to
> facilitate COW writeback, we can use the same mechanism to validate
> consistency between the data fork and cached writeback mappings. On
> its face, this is somewhat of a big hammer approach because any
> change to the data fork invalidates any mapping currently cached by
> a writeback in progress regardless of whether the data fork change
> overlaps with the range under writeback. In practice, however, the
> impact of this approach is minimal in most cases.
> 
> First, data fork changes (delayed allocations) caused by sustained
> sequential buffered writes are amortized across speculative
> preallocations. This means that a cached mapping won't be
> invalidated by each buffered write of a common file copy workload,
> but rather only on less frequent allocation events. Second, the
> extent tree is always entirely in-core so an additional lookup of a
> usable extent mostly costs a shared ilock cycle and in-memory tree
> lookup. This means that a cached mapping reval is relatively cheap
> compared to the I/O itself. Third, spurious invalidations don't
> impact ioend construction. This means that even if the same extent
> is revalidated multiple times across multiple writepage instances,
> we still construct and submit the same size ioend (and bio) if the
> blocks are physically contiguous.
> 
> Update struct xfs_writepage_ctx with a new field to hold the
> sequence number of the data fork associated with the currently
> cached mapping. Check the wpc seqno against the data fork when the
> mapping is validated and reestablish the mapping whenever the fork
> has changed since the mapping was cached. This ensures that
> writeback always uses a valid extent mapping and thus prevents lost
> writebacks and stale delalloc block problems.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_aops.c  | 8 ++++++--
>  fs/xfs/xfs_iomap.c | 4 ++--
>  2 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d9048bcea49c..33a1be5df99f 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -29,6 +29,7 @@
>  struct xfs_writepage_ctx {
>  	struct xfs_bmbt_irec    imap;
>  	unsigned int		io_type;
> +	unsigned int		data_seq;
>  	unsigned int		cow_seq;
>  	struct xfs_ioend	*ioend;
>  };
> @@ -347,7 +348,8 @@ xfs_map_blocks(
>  	 * out that ensures that we always see the current value.
>  	 */
>  	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
> -		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
> +		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount &&
> +		     wpc->data_seq == READ_ONCE(ip->i_df.if_seq);
>  	if (imap_valid &&
>  	    (!xfs_inode_has_cow_data(ip) ||
>  	     wpc->io_type == XFS_IO_COW ||

I suspect this next "if (imap_valid) ..." logic needs to be updated,
too. i.e. the next line is checking if the cow_seq has not changed.

i.e. I think wrapping this up in a helper (again!) might make more
sense:

static bool
xfs_imap_valid(
	struct xfs_inode	*ip,
	struct xfs_writepage_ctx *wpc,
	xfs_fileoff_t		offset_fsb)
{
	if (offset_fsb < wpc->imap.br_startoff)
		return false;
	if (offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
		return false;
	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq)
		return false;
	if (!xfs_inode_has_cow_data(ip))
		return true;
	if (wpc->io_type != XFS_IO_COW)
		return true;
	if (wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)
		return false;
	return true;
}

and then put the shutdown check before we check the map for validity
(i.e. don't continue to write to the cached map after a shutdown has
been triggered):

	if (XFS_FORCED_SHUTDOWN(mp))
		return -EIO;

	if (xfs_imap_valid(ip, wpc, offset_fsb))
		return 0;


> @@ -417,6 +419,7 @@ xfs_map_blocks(
>  	 */
>  	if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap))
>  		imap.br_startoff = end_fsb;	/* fake a hole past EOF */
> +	wpc->data_seq = READ_ONCE(ip->i_df.if_seq);
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
>  	if (imap.br_startoff > offset_fsb) {
> @@ -454,7 +457,8 @@ xfs_map_blocks(
>  	return 0;
>  allocate_blocks:
>  	error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap,
> -			&wpc->cow_seq);
> +			whichfork == XFS_COW_FORK ?
> +					 &wpc->cow_seq : &wpc->data_seq);
>  	if (error)
>  		return error;
>  	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 27c93b5f029d..0401e33d4e8f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -681,7 +681,7 @@ xfs_iomap_write_allocate(
>  	int		whichfork,
>  	xfs_off_t	offset,
>  	xfs_bmbt_irec_t *imap,
> -	unsigned int	*cow_seq)
> +	unsigned int	*seq)
>  {
>  	xfs_mount_t	*mp = ip->i_mount;
>  	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
> @@ -798,7 +798,7 @@ xfs_iomap_write_allocate(
>  				goto error0;
>  
>  			if (whichfork == XFS_COW_FORK)
> -				*cow_seq = READ_ONCE(ifp->if_seq);
> +				*seq = READ_ONCE(ifp->if_seq);
>  			xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  		}

One of the things that limits xfs_iomap_write_allocate() efficiency
is the mitigations for races against truncate. i.e. the huge comment that
starts:

	       /*
		* it is possible that the extents have changed since
		* we did the read call as we dropped the ilock for a
		* while. We have to be careful about truncates or hole
		* punchs here - we are not allowed to allocate
		* non-delalloc blocks here.
....

Now that we can detect that the extents have changed in the data
fork, we can go back to allocating multiple extents per
xfs_bmapi_write() call by doing a sequence number check after we
lock the inode. If the sequence number does not match what was
passed in or returned from the previous loop, we return -EAGAIN.

Hmmm, looking at the existing -EAGAIN case, I suspect this isn't
handled correctly by xfs_map_blocks() anymore. i.e. it just returns
the error which can lead to discarding the page rather than checking
to see if the there was a valid map allocated. I think there's some
followup work here (another patch series). :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tests/generic: test writepage cached mapping validity
  2019-01-11 13:31 ` [PATCH] tests/generic: test writepage cached mapping validity Brian Foster
@ 2019-01-14  9:30   ` Eryu Guan
  2019-01-14 15:34     ` Brian Foster
  2019-01-15  3:52     ` Dave Chinner
  0 siblings, 2 replies; 21+ messages in thread
From: Eryu Guan @ 2019-01-14  9:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: fstests, linux-xfs

On Fri, Jan 11, 2019 at 08:31:24AM -0500, Brian Foster wrote:
> XFS has a bug where page writeback can end up sending data to the
> wrong location due to a stale, cached file mapping. Add a test to
> trigger this problem by racing background writeback with a
> truncate/rewrite of the final page of the file.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Hi all,
> 
> This is a resend of an old post[1] that never quite made it upstream. It
> wasn't a big deal at the time because we didn't really have a proper fix
> for the problem. I'm resending now because there is a proposed fix[2].

Thanks for the resending!

> 
> I've verified that this still reproduces the problem and no longer fails
> with the fix applied (in hundreds of iters). Note that reproduction may
> require many iterations. It took me anywhere from 5 to 30 or so on the
> box I tested, which I think is reasonable for the tradeoff of a fairly
> quick test. There was some discussion on the original post around making
> the test run longer for a more reliable reproducer, but I'm not sure how
> valuable that is given this is a targeted regression test. Thoughts
> appreciated.

It took me around 5 iterations to hit the corruption, I think it's fine.

But a couple of things changed over the years :)

> 
> Brian
> 
> [1] https://marc.info/?l=fstests&m=150902929900510&w=2
> [2] https://marc.info/?l=linux-xfs&m=154721212321112&w=2
> 
>  tests/generic/999     | 94 +++++++++++++++++++++++++++++++++++++++++++
>  tests/generic/999.out |  2 +
>  tests/generic/group   |  1 +
>  3 files changed, 97 insertions(+)
>  create mode 100755 tests/generic/999
>  create mode 100644 tests/generic/999.out
> 
> diff --git a/tests/generic/999 b/tests/generic/999
> new file mode 100755
> index 00000000..9e56a1e0
> --- /dev/null
> +++ b/tests/generic/999
> @@ -0,0 +1,94 @@
> +#! /bin/bash
> +# FS QA Test 999
> +#
> +# Test XFS page writeback code for races with the cached file mapping. XFS
> +# caches the file -> block mapping for a full extent once it is initially looked
> +# up. The cached mapping is used for all subsequent pages in the same writeback
> +# cycle that cover the associated extent. Under certain conditions, it is
> +# possible for concurrent operations on the file to invalidate the cached
> +# mapping without the knowledge of writeback. Writeback ends up sending I/O to a
> +# partly stale mapping and potentially leaving delalloc blocks in the current
> +# mapping unconverted.
> +#
> +#-----------------------------------------------------------------------
> +# Copyright (c) 2017 Red Hat, Inc.  All Rights Reserved.
                   ^^^^ 2019?
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#-----------------------------------------------------------------------

And please change this to SPDX-License-Identifier.

> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	cd /
> +	rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +_require_test_program "feature"

_require_xfs_io_command "sync_range"

> +
> +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> +_scratch_mount || _fail "mount failed"

_scratch_mount will _fail the test on failure now :)

> +
> +file=$SCRATCH_MNT/file
> +filesize=$((1024 * 1024 * 32))
> +pagesize=`src/feature -s`
> +truncsize=$((filesize - pagesize))
> +
> +for i in $(seq 0 15); do
> +	# Truncate the file and fsync to persist the final size on-disk. This is
> +	# required so the subsequent truncate will not wait on writeback.
> +	$XFS_IO_PROG -fc "truncate 0" $file
> +	$XFS_IO_PROG -c "truncate $filesize" -c fsync $file
> +
> +	# create a small enough delalloc extent to likely be contiguous
> +	$XFS_IO_PROG -c "pwrite 0 $filesize" $file >> $seqres.full 2>&1
> +
> +	# Start writeback and a racing truncate and rewrite of the final page.
> +	$XFS_IO_PROG -c "sync_range -w 0 0" $file &
> +	sync_pid=$!
> +	$XFS_IO_PROG -c "truncate $truncsize" \
> +		     -c "pwrite $truncsize $pagesize" $file >> $seqres.full 2>&1
> +
> +	# If the test fails, the most likely outcome is an sb_fdblocks mismatch
> +	# and/or an associated delalloc assert failure on inode reclaim. Cycle
> +	# the mount to trigger detection.
> +	wait $sync_pid
> +	_scratch_cycle_mount || _fail "mount failed"

And _scratch_cycle_mount will exit the test on failure as well.

Thanks,
Eryu

> +done
> +
> +echo Silence is golden
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/999.out b/tests/generic/999.out
> new file mode 100644
> index 00000000..3b276ca8
> --- /dev/null
> +++ b/tests/generic/999.out
> @@ -0,0 +1,2 @@
> +QA output created by 999
> +Silence is golden
> diff --git a/tests/generic/group b/tests/generic/group
> index ea5aa7aa..ce165981 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -525,3 +525,4 @@
>  520 auto quick log
>  521 soak long_rw
>  522 soak long_rw
> +999 auto quick
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
  2019-01-13 21:49   ` Dave Chinner
@ 2019-01-14 15:34     ` Brian Foster
  2019-01-14 20:57       ` Dave Chinner
  2019-01-17 14:47       ` Christoph Hellwig
  0 siblings, 2 replies; 21+ messages in thread
From: Brian Foster @ 2019-01-14 15:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jan 14, 2019 at 08:49:05AM +1100, Dave Chinner wrote:
> On Fri, Jan 11, 2019 at 07:30:31AM -0500, Brian Foster wrote:
> > The writeback code caches the current extent mapping across multiple
> > xfs_do_writepage() calls to avoid repeated lookups for sequential
> > pages backed by the same extent. This is known to be slightly racy
> > with extent fork changes in certain difficult to reproduce
> > scenarios. The cached extent is trimmed to within EOF to help avoid
> > the most common vector for this problem via speculative
> > preallocation management, but this is a band-aid that does not
> > address the fundamental problem.
> > 
> > Now that we have an xfs_ifork sequence counter mechanism used to
> > facilitate COW writeback, we can use the same mechanism to validate
> > consistency between the data fork and cached writeback mappings. On
> > its face, this is somewhat of a big hammer approach because any
> > change to the data fork invalidates any mapping currently cached by
> > a writeback in progress regardless of whether the data fork change
> > overlaps with the range under writeback. In practice, however, the
> > impact of this approach is minimal in most cases.
> > 
> > First, data fork changes (delayed allocations) caused by sustained
> > sequential buffered writes are amortized across speculative
> > preallocations. This means that a cached mapping won't be
> > invalidated by each buffered write of a common file copy workload,
> > but rather only on less frequent allocation events. Second, the
> > extent tree is always entirely in-core so an additional lookup of a
> > usable extent mostly costs a shared ilock cycle and in-memory tree
> > lookup. This means that a cached mapping reval is relatively cheap
> > compared to the I/O itself. Third, spurious invalidations don't
> > impact ioend construction. This means that even if the same extent
> > is revalidated multiple times across multiple writepage instances,
> > we still construct and submit the same size ioend (and bio) if the
> > blocks are physically contiguous.
> > 
> > Update struct xfs_writepage_ctx with a new field to hold the
> > sequence number of the data fork associated with the currently
> > cached mapping. Check the wpc seqno against the data fork when the
> > mapping is validated and reestablish the mapping whenever the fork
> > has changed since the mapping was cached. This ensures that
> > writeback always uses a valid extent mapping and thus prevents lost
> > writebacks and stale delalloc block problems.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c  | 8 ++++++--
> >  fs/xfs/xfs_iomap.c | 4 ++--
> >  2 files changed, 8 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index d9048bcea49c..33a1be5df99f 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -29,6 +29,7 @@
> >  struct xfs_writepage_ctx {
> >  	struct xfs_bmbt_irec    imap;
> >  	unsigned int		io_type;
> > +	unsigned int		data_seq;
> >  	unsigned int		cow_seq;
> >  	struct xfs_ioend	*ioend;
> >  };
> > @@ -347,7 +348,8 @@ xfs_map_blocks(
> >  	 * out that ensures that we always see the current value.
> >  	 */
> >  	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
> > -		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
> > +		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount &&
> > +		     wpc->data_seq == READ_ONCE(ip->i_df.if_seq);
> >  	if (imap_valid &&
> >  	    (!xfs_inode_has_cow_data(ip) ||
> >  	     wpc->io_type == XFS_IO_COW ||
> 
> I suspect this next "if (imap_valid) ..." logic needs to be updated,
> too. i.e. the next line is checking if the cow_seq has not changed.
> 

I'm not quite sure what you're getting at here. By "next," do you mean
the one you've quoted or the post-lock cycle check (a re-check at the
latter point makes sense to me). Otherwise the imap check is
intentionally distinct from the COW seq check because these control
independent bits of subsequent logic (in certain cases).

That said, now that I look at it again this logic is rather convoluted
because imap_valid doesn't necessarily refer to the data fork (e.g., if
->imap is a cow fork extent). So yeah, this all should probably be
refactored...

> i.e. I think wrapping this up in a helper (again!) might make more
> sense:
> 
> static bool
> xfs_imap_valid(
> 	struct xfs_inode	*ip,
> 	struct xfs_writepage_ctx *wpc,
> 	xfs_fileoff_t		offset_fsb)
> {
> 	if (offset_fsb < wpc->imap.br_startoff)
> 		return false;
> 	if (offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
> 		return false;
> 	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq)
> 		return false;
> 	if (!xfs_inode_has_cow_data(ip))
> 		return true;
> 	if (wpc->io_type != XFS_IO_COW)
> 		return true;
> 	if (wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)
> 		return false;
> 	return true;
> }
> 

I think you mean 'if (io_type == XFS_IO_COW)'? Otherwise this seems
reasonable, though I think the logic suffers a bit from the same problem
as above. How about with the following tweaks (and comments to try and
make this easier to follow)?

static bool
xfs_imap_valid()
{
	if (offset_fsb < wpc->imap.br_startoff)
		return false;
	if (offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
		return false;
	/* a valid range is sufficient for COW mappings */
	if (wpc->io_type == XFS_IO_COW)
		return true;

	/*
	 * Not a COW mapping. Revalidate across changes in either the
	 * data or COW fork ...
	 */
	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq)
		return false;
	if (xfs_inode_has_cow_data(ip) &&
	    wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)
		return false;

	return true;
}

I think that technically we could skip the == XFS_IO_COW check and we'd
just be more conservative by essentially applying the same fork change
logic we are for the data fork, but that's not really the intent of this
patch.

> and then put the shutdown check before we check the map for validity
> (i.e. don't continue to write to the cached map after a shutdown has
> been triggered):
> 

Ack.

> 	if (XFS_FORCED_SHUTDOWN(mp))
> 		return -EIO;
> 
> 	if (xfs_imap_valid(ip, wpc, offset_fsb))
> 		return 0;
> 
> 
> > @@ -417,6 +419,7 @@ xfs_map_blocks(
> >  	 */
> >  	if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap))
> >  		imap.br_startoff = end_fsb;	/* fake a hole past EOF */
> > +	wpc->data_seq = READ_ONCE(ip->i_df.if_seq);
> >  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >  
> >  	if (imap.br_startoff > offset_fsb) {
> > @@ -454,7 +457,8 @@ xfs_map_blocks(
> >  	return 0;
> >  allocate_blocks:
> >  	error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap,
> > -			&wpc->cow_seq);
> > +			whichfork == XFS_COW_FORK ?
> > +					 &wpc->cow_seq : &wpc->data_seq);
> >  	if (error)
> >  		return error;
> >  	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 27c93b5f029d..0401e33d4e8f 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -681,7 +681,7 @@ xfs_iomap_write_allocate(
> >  	int		whichfork,
> >  	xfs_off_t	offset,
> >  	xfs_bmbt_irec_t *imap,
> > -	unsigned int	*cow_seq)
> > +	unsigned int	*seq)
> >  {
> >  	xfs_mount_t	*mp = ip->i_mount;
> >  	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
> > @@ -798,7 +798,7 @@ xfs_iomap_write_allocate(
> >  				goto error0;
> >  
> >  			if (whichfork == XFS_COW_FORK)
> > -				*cow_seq = READ_ONCE(ifp->if_seq);
> > +				*seq = READ_ONCE(ifp->if_seq);
> >  			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  		}
> 
> One of the things that limits xfs_iomap_write_allocate() efficiency
> is the mitigations for races against truncate. i.e. the huge comment that
> starts:
> 
> 	       /*
> 		* it is possible that the extents have changed since
> 		* we did the read call as we dropped the ilock for a
> 		* while. We have to be careful about truncates or hole
> 		* punchs here - we are not allowed to allocate
> 		* non-delalloc blocks here.
> ....
> 

Hmm, Ok... so this fix goes a ways back to commit e4143a1cf5 ("[XFS] Fix
transaction overrun during writeback."). It sounds like the issue was an
instance of the "attempt to convert delalloc blocks ends up doing
physical allocation" problem (which results in a transaction overrun).

> Now that we can detect that the extents have changed in the data
> fork, we can go back to allocating multiple extents per
> xfs_bmapi_write() call by doing a sequence number check after we
> lock the inode. If the sequence number does not match what was
> passed in or returned from the previous loop, we return -EAGAIN.
> 

I'm not familiar with this particular instance of this problem (we've
certainly had other instances of the same thing), but the surrounding
context of this code has changed quite a bit. Most notably is
XFS_BMAPI_DELALLOC, which was intended to mitigate this problem by
disallowing real allocation in such calls.

> Hmmm, looking at the existing -EAGAIN case, I suspect this isn't
> handled correctly by xfs_map_blocks() anymore. i.e. it just returns
> the error which can lead to discarding the page rather than checking
> to see if the there was a valid map allocated. I think there's some
> followup work here (another patch series). :/
> 

Ok. At the moment, that error looks like it should only happen if we're
past EOF..? Either way, the XFS_BMAPI_DELALLOC thing still can result in
an error so it probably makes sense to tie a seqno check to -EAGAIN and
handle it properly in the caller.

Hmm, given that we can really only handle one extent at a time up
through the caller (as also noted in the big comment you quoted) and
that this series introduces more aggressive revalidation as it is, I am
wondering what real value there is in doing more delalloc conversions
here than technically required. ISTM that removing some of this i_size
checking code and doing the seqno based kickback may actually be
cleaner. I'll need to have a closer look from an optimization
perspective when the correctness issues are dealt with.

I also could have sworn I removed that whichfork check from
xfs_iomap_write_allocate(), but apparently not... ;P

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tests/generic: test writepage cached mapping validity
  2019-01-14  9:30   ` Eryu Guan
@ 2019-01-14 15:34     ` Brian Foster
  2019-01-15  3:52     ` Dave Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: Brian Foster @ 2019-01-14 15:34 UTC (permalink / raw)
  To: Eryu Guan; +Cc: fstests, linux-xfs

On Mon, Jan 14, 2019 at 05:30:36PM +0800, Eryu Guan wrote:
> On Fri, Jan 11, 2019 at 08:31:24AM -0500, Brian Foster wrote:
> > XFS has a bug where page writeback can end up sending data to the
> > wrong location due to a stale, cached file mapping. Add a test to
> > trigger this problem by racing background writeback with a
> > truncate/rewrite of the final page of the file.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi all,
> > 
> > This is a resend of an old post[1] that never quite made it upstream. It
> > wasn't a big deal at the time because we didn't really have a proper fix
> > for the problem. I'm resending now because there is a proposed fix[2].
> 
> Thanks for the resending!
> 
> > 
> > I've verified that this still reproduces the problem and no longer fails
> > with the fix applied (in hundreds of iters). Note that reproduction may
> > require many iterations. It took me anywhere from 5 to 30 or so on the
> > box I tested, which I think is reasonable for the tradeoff of a fairly
> > quick test. There was some discussion on the original post around making
> > the test run longer for a more reliable reproducer, but I'm not sure how
> > valuable that is given this is a targeted regression test. Thoughts
> > appreciated.
> 
> It took me around 5 iterations to hit the corruption, I think it's fine.
> 
> But a couple of things changed over the years :)
> 

Indeed, these changes all sound good. I'll include them in v2, thanks!

Brian

> > 
> > Brian
> > 
> > [1] https://marc.info/?l=fstests&m=150902929900510&w=2
> > [2] https://marc.info/?l=linux-xfs&m=154721212321112&w=2
> > 
> >  tests/generic/999     | 94 +++++++++++++++++++++++++++++++++++++++++++
> >  tests/generic/999.out |  2 +
> >  tests/generic/group   |  1 +
> >  3 files changed, 97 insertions(+)
> >  create mode 100755 tests/generic/999
> >  create mode 100644 tests/generic/999.out
> > 
> > diff --git a/tests/generic/999 b/tests/generic/999
> > new file mode 100755
> > index 00000000..9e56a1e0
> > --- /dev/null
> > +++ b/tests/generic/999
> > @@ -0,0 +1,94 @@
> > +#! /bin/bash
> > +# FS QA Test 999
> > +#
> > +# Test XFS page writeback code for races with the cached file mapping. XFS
> > +# caches the file -> block mapping for a full extent once it is initially looked
> > +# up. The cached mapping is used for all subsequent pages in the same writeback
> > +# cycle that cover the associated extent. Under certain conditions, it is
> > +# possible for concurrent operations on the file to invalidate the cached
> > +# mapping without the knowledge of writeback. Writeback ends up sending I/O to a
> > +# partly stale mapping and potentially leaving delalloc blocks in the current
> > +# mapping unconverted.
> > +#
> > +#-----------------------------------------------------------------------
> > +# Copyright (c) 2017 Red Hat, Inc.  All Rights Reserved.
>                    ^^^^ 2019?
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#-----------------------------------------------------------------------
> 
> And please change this to SPDX-License-Identifier.
> 
> > +#
> > +
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1	# failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +	cd /
> > +	rm -f $tmp.*
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +
> > +# remove previous $seqres.full before test
> > +rm -f $seqres.full
> > +
> > +# real QA test starts here
> > +
> > +# Modify as appropriate.
> > +_supported_fs generic
> > +_supported_os Linux
> > +_require_scratch
> > +_require_test_program "feature"
> 
> _require_xfs_io_command "sync_range"
> 
> > +
> > +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> > +_scratch_mount || _fail "mount failed"
> 
> _scratch_mount will _fail the test on failure now :)
> 
> > +
> > +file=$SCRATCH_MNT/file
> > +filesize=$((1024 * 1024 * 32))
> > +pagesize=`src/feature -s`
> > +truncsize=$((filesize - pagesize))
> > +
> > +for i in $(seq 0 15); do
> > +	# Truncate the file and fsync to persist the final size on-disk. This is
> > +	# required so the subsequent truncate will not wait on writeback.
> > +	$XFS_IO_PROG -fc "truncate 0" $file
> > +	$XFS_IO_PROG -c "truncate $filesize" -c fsync $file
> > +
> > +	# create a small enough delalloc extent to likely be contiguous
> > +	$XFS_IO_PROG -c "pwrite 0 $filesize" $file >> $seqres.full 2>&1
> > +
> > +	# Start writeback and a racing truncate and rewrite of the final page.
> > +	$XFS_IO_PROG -c "sync_range -w 0 0" $file &
> > +	sync_pid=$!
> > +	$XFS_IO_PROG -c "truncate $truncsize" \
> > +		     -c "pwrite $truncsize $pagesize" $file >> $seqres.full 2>&1
> > +
> > +	# If the test fails, the most likely outcome is an sb_fdblocks mismatch
> > +	# and/or an associated delalloc assert failure on inode reclaim. Cycle
> > +	# the mount to trigger detection.
> > +	wait $sync_pid
> > +	_scratch_cycle_mount || _fail "mount failed"
> 
> And _scratch_cycle_mount will exit the test on failure as well.
> 
> Thanks,
> Eryu
> 
> > +done
> > +
> > +echo Silence is golden
> > +
> > +# success, all done
> > +status=0
> > +exit
> > diff --git a/tests/generic/999.out b/tests/generic/999.out
> > new file mode 100644
> > index 00000000..3b276ca8
> > --- /dev/null
> > +++ b/tests/generic/999.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 999
> > +Silence is golden
> > diff --git a/tests/generic/group b/tests/generic/group
> > index ea5aa7aa..ce165981 100644
> > --- a/tests/generic/group
> > +++ b/tests/generic/group
> > @@ -525,3 +525,4 @@
> >  520 auto quick log
> >  521 soak long_rw
> >  522 soak long_rw
> > +999 auto quick
> > -- 
> > 2.17.2
> > 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
  2019-01-14 15:34     ` Brian Foster
@ 2019-01-14 20:57       ` Dave Chinner
  2019-01-15 11:26         ` Brian Foster
  2019-01-17 14:47       ` Christoph Hellwig
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2019-01-14 20:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Jan 14, 2019 at 10:34:23AM -0500, Brian Foster wrote:
> On Mon, Jan 14, 2019 at 08:49:05AM +1100, Dave Chinner wrote:
> > On Fri, Jan 11, 2019 at 07:30:31AM -0500, Brian Foster wrote:
> > > The writeback code caches the current extent mapping across multiple
> > > xfs_do_writepage() calls to avoid repeated lookups for sequential
> > > pages backed by the same extent. This is known to be slightly racy
> > > with extent fork changes in certain difficult to reproduce
> > > scenarios. The cached extent is trimmed to within EOF to help avoid
> > > the most common vector for this problem via speculative
> > > preallocation management, but this is a band-aid that does not
> > > address the fundamental problem.
> > > 
> > > Now that we have an xfs_ifork sequence counter mechanism used to
> > > facilitate COW writeback, we can use the same mechanism to validate
> > > consistency between the data fork and cached writeback mappings. On
> > > its face, this is somewhat of a big hammer approach because any
> > > change to the data fork invalidates any mapping currently cached by
> > > a writeback in progress regardless of whether the data fork change
> > > overlaps with the range under writeback. In practice, however, the
> > > impact of this approach is minimal in most cases.
> > > 
> > > First, data fork changes (delayed allocations) caused by sustained
> > > sequential buffered writes are amortized across speculative
> > > preallocations. This means that a cached mapping won't be
> > > invalidated by each buffered write of a common file copy workload,
> > > but rather only on less frequent allocation events. Second, the
> > > extent tree is always entirely in-core so an additional lookup of a
> > > usable extent mostly costs a shared ilock cycle and in-memory tree
> > > lookup. This means that a cached mapping reval is relatively cheap
> > > compared to the I/O itself. Third, spurious invalidations don't
> > > impact ioend construction. This means that even if the same extent
> > > is revalidated multiple times across multiple writepage instances,
> > > we still construct and submit the same size ioend (and bio) if the
> > > blocks are physically contiguous.
> > > 
> > > Update struct xfs_writepage_ctx with a new field to hold the
> > > sequence number of the data fork associated with the currently
> > > cached mapping. Check the wpc seqno against the data fork when the
> > > mapping is validated and reestablish the mapping whenever the fork
> > > has changed since the mapping was cached. This ensures that
> > > writeback always uses a valid extent mapping and thus prevents lost
> > > writebacks and stale delalloc block problems.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_aops.c  | 8 ++++++--
> > >  fs/xfs/xfs_iomap.c | 4 ++--
> > >  2 files changed, 8 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index d9048bcea49c..33a1be5df99f 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -29,6 +29,7 @@
> > >  struct xfs_writepage_ctx {
> > >  	struct xfs_bmbt_irec    imap;
> > >  	unsigned int		io_type;
> > > +	unsigned int		data_seq;
> > >  	unsigned int		cow_seq;
> > >  	struct xfs_ioend	*ioend;
> > >  };
> > > @@ -347,7 +348,8 @@ xfs_map_blocks(
> > >  	 * out that ensures that we always see the current value.
> > >  	 */
> > >  	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
> > > -		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
> > > +		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount &&
> > > +		     wpc->data_seq == READ_ONCE(ip->i_df.if_seq);
> > >  	if (imap_valid &&
> > >  	    (!xfs_inode_has_cow_data(ip) ||
> > >  	     wpc->io_type == XFS_IO_COW ||
> > 
> > I suspect this next "if (imap_valid) ..." logic needs to be updated,
> > too. i.e. the next line is checking if the cow_seq has not changed.
> > 
> 
> I'm not quite sure what you're getting at here. By "next," do you mean
> the one you've quoted or the post-lock cycle check (a re-check at the
> latter point makes sense to me). Otherwise the imap check is
> intentionally distinct from the COW seq check because these control
> independent bits of subsequent logic (in certain cases).

No, I meant the next line of code that isn't in the hunk was:

	if (imap_valid &&
	    (!xfs_inode_has_cow_data(ip) ||
	     wpc->io_type == XFS_IO_COW ||
>>>>>>	     wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq))

The cow fork sequence number check.

> I think you mean 'if (io_type == XFS_IO_COW)'? Otherwise this seems
> reasonable, though I think the logic suffers a bit from the same problem
> as above. How about with the following tweaks (and comments to try and
> make this easier to follow)?

I misread the nested () and so got the new logic wrong. :)

> static bool
> xfs_imap_valid()
> {
> 	if (offset_fsb < wpc->imap.br_startoff)
> 		return false;
> 	if (offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
> 		return false;
> 	/* a valid range is sufficient for COW mappings */
> 	if (wpc->io_type == XFS_IO_COW)
> 		return true;
> 
> 	/*
> 	 * Not a COW mapping. Revalidate across changes in either the
> 	 * data or COW fork ...
> 	 */
> 	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq)
> 		return false;
> 	if (xfs_inode_has_cow_data(ip) &&
> 	    wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)
> 		return false;
> 
> 	return true;
> }

Yup, that's what I meant. I'm glad you're on the ball right now :)

> I think that technically we could skip the == XFS_IO_COW check and we'd
> just be more conservative by essentially applying the same fork change
> logic we are for the data fork, but that's not really the intent of this
> patch.

Sure.

> > >  	xfs_mount_t	*mp = ip->i_mount;
> > >  	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
> > > @@ -798,7 +798,7 @@ xfs_iomap_write_allocate(
> > >  				goto error0;
> > >  
> > >  			if (whichfork == XFS_COW_FORK)
> > > -				*cow_seq = READ_ONCE(ifp->if_seq);
> > > +				*seq = READ_ONCE(ifp->if_seq);
> > >  			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > >  		}
> > 
> > One of the things that limits xfs_iomap_write_allocate() efficiency
> > is the mitigations for races against truncate. i.e. the huge comment that
> > starts:
> > 
> > 	       /*
> > 		* it is possible that the extents have changed since
> > 		* we did the read call as we dropped the ilock for a
> > 		* while. We have to be careful about truncates or hole
> > 		* punchs here - we are not allowed to allocate
> > 		* non-delalloc blocks here.
> > ....
> > 
> 
> Hmm, Ok... so this fix goes a ways back to commit e4143a1cf5 ("[XFS] Fix
> transaction overrun during writeback."). It sounds like the issue was an
> instance of the "attempt to convert delalloc blocks ends up doing
> physical allocation" problem (which results in a transaction overrun).

Yeah, there were no delalloc blocks because they'd been truncated or
punched away between unlock/lock cycles on the inode.

> > Now that we can detect that the extents have changed in the data
> > fork, we can go back to allocating multiple extents per
> > xfs_bmapi_write() call by doing a sequence number check after we
> > lock the inode. If the sequence number does not match what was
> > passed in or returned from the previous loop, we return -EAGAIN.
> > 
> 
> I'm not familiar with this particular instance of this problem (we've
> certainly had other instances of the same thing), but the surrounding
> context of this code has changed quite a bit.

Yes, it has. The move to a single map was done a long time ago
because there weren't any other options at the time, and it was a
problem we'd been struggling to understand and sort out for years.

> Most notably is
> XFS_BMAPI_DELALLOC, which was intended to mitigate this problem by
> disallowing real allocation in such calls.

Yup. however, I've always thought of it as a bit of a hack - it's
preventing the transaction overrun when a problem occurs as opposed
to preventing the race that leads to trying to allocate over a
hole.

Essentially, though they are both trying to address the same
problem: that the extent list can change during writeback and
writeback ends up using stale information to direct IO and/or extent
allocation.

> > Hmmm, looking at the existing -EAGAIN case, I suspect this isn't
> > handled correctly by xfs_map_blocks() anymore. i.e. it just returns
> > the error which can lead to discarding the page rather than checking
> > to see if the there was a valid map allocated. I think there's some
> > followup work here (another patch series). :/
> > 
> 
> Ok. At the moment, that error looks like it should only happen if we're
> past EOF..?

Yeah, racing with truncate. The old writeback code used to have a
non-blocking feature which would handle -EAGAIN errors bubbling up
from anywhere in the writeback path. We got rid of that a long time
ago, so I suspect this has been broken for a long while.

> Either way, the XFS_BMAPI_DELALLOC thing still can result in
> an error so it probably makes sense to tie a seqno check to -EAGAIN and
> handle it properly in the caller.

*nod*

> Hmm, given that we can really only handle one extent at a time up
> through the caller (as also noted in the big comment you quoted) and
> that this series introduces more aggressive revalidation as it is, I am
> wondering what real value there is in doing more delalloc conversions
> here than technically required.

When the filesystem gets fragmented and there isn't a large enough
free space to allocate over the delalloc extent, it was more CPU
efficient to allocate multiple extents in a single xfs_bmapi_write()
call and transaction, similar to how we can free 2 extents in a
single truncate transaction.

We still do this in xfs_da_grow_inode_int() using nmaps =
XFS_BMAP_MAX_NMAP (i.e. 4) so the code should still work if we were
to pass it multiple maps.  But, yes, the code is very different now,
so it may not make sense to attempt multiple extent allocation here
again.

> ISTM that removing some of this i_size
> checking code and doing the seqno based kickback may actually be
> cleaner. I'll need to have a closer look from an optimization
> perspective when the correctness issues are dealt with.
> 
> I also could have sworn I removed that whichfork check from
> xfs_iomap_write_allocate(), but apparently not... ;P

Maybe it got blown into a dusty corner when we weren't paying
attention. :)

Cheers,

dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] tests/generic: test writepage cached mapping validity
  2019-01-14  9:30   ` Eryu Guan
  2019-01-14 15:34     ` Brian Foster
@ 2019-01-15  3:52     ` Dave Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2019-01-15  3:52 UTC (permalink / raw)
  To: Eryu Guan; +Cc: Brian Foster, fstests, linux-xfs

On Mon, Jan 14, 2019 at 05:30:36PM +0800, Eryu Guan wrote:
> On Fri, Jan 11, 2019 at 08:31:24AM -0500, Brian Foster wrote:
> > @@ -0,0 +1,94 @@
> > +#! /bin/bash
> > +# FS QA Test 999
> > +#
> > +# Test XFS page writeback code for races with the cached file mapping. XFS
> > +# caches the file -> block mapping for a full extent once it is initially looked
> > +# up. The cached mapping is used for all subsequent pages in the same writeback
> > +# cycle that cover the associated extent. Under certain conditions, it is
> > +# possible for concurrent operations on the file to invalidate the cached
> > +# mapping without the knowledge of writeback. Writeback ends up sending I/O to a
> > +# partly stale mapping and potentially leaving delalloc blocks in the current
> > +# mapping unconverted.
> > +#
> > +#-----------------------------------------------------------------------
> > +# Copyright (c) 2017 Red Hat, Inc.  All Rights Reserved.
>                    ^^^^ 2019?

i.e. copyright is from when it was first posted if the current
posting is dervied from the original posting. If significant
alterations are made then a date update can occur. but the original
date should be preserved. Can be shorten down to 2017-2019 for a
contiguous span of years...

So the correct form here is probably:

# Copyright (c) 2017, 2019 Red Hat, Inc.  All Rights Reserved.

> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#-----------------------------------------------------------------------
> 
> And please change this to SPDX-License-Identifier.

*nod* :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
  2019-01-14 20:57       ` Dave Chinner
@ 2019-01-15 11:26         ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2019-01-15 11:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jan 15, 2019 at 07:57:04AM +1100, Dave Chinner wrote:
> On Mon, Jan 14, 2019 at 10:34:23AM -0500, Brian Foster wrote:
> > On Mon, Jan 14, 2019 at 08:49:05AM +1100, Dave Chinner wrote:
> > > On Fri, Jan 11, 2019 at 07:30:31AM -0500, Brian Foster wrote:
> > > > The writeback code caches the current extent mapping across multiple
> > > > xfs_do_writepage() calls to avoid repeated lookups for sequential
> > > > pages backed by the same extent. This is known to be slightly racy
> > > > with extent fork changes in certain difficult to reproduce
> > > > scenarios. The cached extent is trimmed to within EOF to help avoid
> > > > the most common vector for this problem via speculative
> > > > preallocation management, but this is a band-aid that does not
> > > > address the fundamental problem.
> > > > 
> > > > Now that we have an xfs_ifork sequence counter mechanism used to
> > > > facilitate COW writeback, we can use the same mechanism to validate
> > > > consistency between the data fork and cached writeback mappings. On
> > > > its face, this is somewhat of a big hammer approach because any
> > > > change to the data fork invalidates any mapping currently cached by
> > > > a writeback in progress regardless of whether the data fork change
> > > > overlaps with the range under writeback. In practice, however, the
> > > > impact of this approach is minimal in most cases.
> > > > 
> > > > First, data fork changes (delayed allocations) caused by sustained
> > > > sequential buffered writes are amortized across speculative
> > > > preallocations. This means that a cached mapping won't be
> > > > invalidated by each buffered write of a common file copy workload,
> > > > but rather only on less frequent allocation events. Second, the
> > > > extent tree is always entirely in-core so an additional lookup of a
> > > > usable extent mostly costs a shared ilock cycle and in-memory tree
> > > > lookup. This means that a cached mapping reval is relatively cheap
> > > > compared to the I/O itself. Third, spurious invalidations don't
> > > > impact ioend construction. This means that even if the same extent
> > > > is revalidated multiple times across multiple writepage instances,
> > > > we still construct and submit the same size ioend (and bio) if the
> > > > blocks are physically contiguous.
> > > > 
> > > > Update struct xfs_writepage_ctx with a new field to hold the
> > > > sequence number of the data fork associated with the currently
> > > > cached mapping. Check the wpc seqno against the data fork when the
> > > > mapping is validated and reestablish the mapping whenever the fork
> > > > has changed since the mapping was cached. This ensures that
> > > > writeback always uses a valid extent mapping and thus prevents lost
> > > > writebacks and stale delalloc block problems.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_aops.c  | 8 ++++++--
> > > >  fs/xfs/xfs_iomap.c | 4 ++--
> > > >  2 files changed, 8 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index d9048bcea49c..33a1be5df99f 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -29,6 +29,7 @@
> > > >  struct xfs_writepage_ctx {
> > > >  	struct xfs_bmbt_irec    imap;
> > > >  	unsigned int		io_type;
> > > > +	unsigned int		data_seq;
> > > >  	unsigned int		cow_seq;
> > > >  	struct xfs_ioend	*ioend;
> > > >  };
> > > > @@ -347,7 +348,8 @@ xfs_map_blocks(
> > > >  	 * out that ensures that we always see the current value.
> > > >  	 */
> > > >  	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
> > > > -		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
> > > > +		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount &&
> > > > +		     wpc->data_seq == READ_ONCE(ip->i_df.if_seq);
> > > >  	if (imap_valid &&
> > > >  	    (!xfs_inode_has_cow_data(ip) ||
> > > >  	     wpc->io_type == XFS_IO_COW ||
> > > 
> > > I suspect this next "if (imap_valid) ..." logic needs to be updated,
> > > too. i.e. the next line is checking if the cow_seq has not changed.
> > > 
> > 
> > I'm not quite sure what you're getting at here. By "next," do you mean
> > the one you've quoted or the post-lock cycle check (a re-check at the
> > latter point makes sense to me). Otherwise the imap check is
> > intentionally distinct from the COW seq check because these control
> > independent bits of subsequent logic (in certain cases).
> 
> No, I meant the next line of code that isn't in the hunk was:
> 
> 	if (imap_valid &&
> 	    (!xfs_inode_has_cow_data(ip) ||
> 	     wpc->io_type == XFS_IO_COW ||
> >>>>>>	     wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq))
> 
> The cow fork sequence number check.
> 
> > I think you mean 'if (io_type == XFS_IO_COW)'? Otherwise this seems
> > reasonable, though I think the logic suffers a bit from the same problem
> > as above. How about with the following tweaks (and comments to try and
> > make this easier to follow)?
> 
> I misread the nested () and so got the new logic wrong. :)
> 

Oh, Ok. Well I'm planning to use the helper and issue another
xfs_imap_valid() call as described either way. I think this is more
appropriate for clarity and because imap_valid in this v1 includes the
->if_seq check and the latter can change across the lock cycle.

> > static bool
> > xfs_imap_valid()
> > {
> > 	if (offset_fsb < wpc->imap.br_startoff)
> > 		return false;
> > 	if (offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
> > 		return false;
> > 	/* a valid range is sufficient for COW mappings */
> > 	if (wpc->io_type == XFS_IO_COW)
> > 		return true;
> > 
> > 	/*
> > 	 * Not a COW mapping. Revalidate across changes in either the
> > 	 * data or COW fork ...
> > 	 */
> > 	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq)
> > 		return false;
> > 	if (xfs_inode_has_cow_data(ip) &&
> > 	    wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)
> > 		return false;
> > 
> > 	return true;
> > }
> 
> Yup, that's what I meant. I'm glad you're on the ball right now :)
> 
> > I think that technically we could skip the == XFS_IO_COW check and we'd
> > just be more conservative by essentially applying the same fork change
> > logic we are for the data fork, but that's not really the intent of this
> > patch.
> 
> Sure.
> 
> > > >  	xfs_mount_t	*mp = ip->i_mount;
> > > >  	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
> > > > @@ -798,7 +798,7 @@ xfs_iomap_write_allocate(
> > > >  				goto error0;
> > > >  
> > > >  			if (whichfork == XFS_COW_FORK)
> > > > -				*cow_seq = READ_ONCE(ifp->if_seq);
> > > > +				*seq = READ_ONCE(ifp->if_seq);
> > > >  			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > >  		}
> > > 
> > > One of the things that limits xfs_iomap_write_allocate() efficiency
> > > is the mitigations for races against truncate. i.e. the huge comment that
> > > starts:
> > > 
> > > 	       /*
> > > 		* it is possible that the extents have changed since
> > > 		* we did the read call as we dropped the ilock for a
> > > 		* while. We have to be careful about truncates or hole
> > > 		* punchs here - we are not allowed to allocate
> > > 		* non-delalloc blocks here.
> > > ....
> > > 
> > 
> > Hmm, Ok... so this fix goes a ways back to commit e4143a1cf5 ("[XFS] Fix
> > transaction overrun during writeback."). It sounds like the issue was an
> > instance of the "attempt to convert delalloc blocks ends up doing
> > physical allocation" problem (which results in a transaction overrun).
> 
> Yeah, there were no delalloc blocks because they'd been truncated or
> punched away between unlock/lock cycles on the inode.
> 
> > > Now that we can detect that the extents have changed in the data
> > > fork, we can go back to allocating multiple extents per
> > > xfs_bmapi_write() call by doing a sequence number check after we
> > > lock the inode. If the sequence number does not match what was
> > > passed in or returned from the previous loop, we return -EAGAIN.
> > > 
> > 
> > I'm not familiar with this particular instance of this problem (we've
> > certainly had other instances of the same thing), but the surrounding
> > context of this code has changed quite a bit.
> 
> Yes, it has. The move to a single map was done a long time ago
> because there weren't any other options at the time, and it was a
> problem we'd been struggling to understand and sort out for years.
> 
> > Most notably is
> > XFS_BMAPI_DELALLOC, which was intended to mitigate this problem by
> > disallowing real allocation in such calls.
> 
> Yup. however, I've always thought of it as a bit of a hack - it's
> preventing the transaction overrun when a problem occurs as opposed
> to preventing the race that leads to trying to allocate over a
> hole.
> 
> Essentially, though they are both trying to address the same
> problem: that the extent list can change during writeback and
> writeback ends up using stale information to direct IO and/or extent
> allocation.
> 

Fair point.

> > > Hmmm, looking at the existing -EAGAIN case, I suspect this isn't
> > > handled correctly by xfs_map_blocks() anymore. i.e. it just returns
> > > the error which can lead to discarding the page rather than checking
> > > to see if the there was a valid map allocated. I think there's some
> > > followup work here (another patch series). :/
> > > 
> > 
> > Ok. At the moment, that error looks like it should only happen if we're
> > past EOF..?
> 
> Yeah, racing with truncate. The old writeback code used to have a
> non-blocking feature which would handle -EAGAIN errors bubbling up
> from anywhere in the writeback path. We got rid of that a long time
> ago, so I suspect this has been broken for a long while.
> 
> > Either way, the XFS_BMAPI_DELALLOC thing still can result in
> > an error so it probably makes sense to tie a seqno check to -EAGAIN and
> > handle it properly in the caller.
> 
> *nod*
> 

After taking a closer look at this, one thing that concerns me about
just sticking an ->if_seq check in xfs_iomap_write_allocate() is the
potential to bounce back and forth between xfs_iomap_write_allocate()
and the caller due to the fact that ->if_seq changes on any change in
the fork. If we just return -EAGAIN and retry, then some other task can
cause writeback churn by just punching/reallocating a block somewhere
else in the file while this code repeats lookups of the same extent.

I think the fact that we hold the page lock across these ilock cycles
means we should at minimum be able to rely on stability of the blocks
backing the current page. I.e. if we're in xfs_iomap_write_allocate(),
we've found a delalloc extent behind the page while under page lock.
Truncate and hole punch both call into truncate_pagecache_range(), which
locks every page and waits on writeback before either is allowed to do
any block manipulation.

Given that, I'm thinking of doing something like look up the extent that
covers offset_fsb on an ->if_seq change and trim the passed in extent
(i.e. mapping range) to whatever sits in the extent tree. That means we
preserve validity of the mapping without risk of disruption due to
unrelated changes in the fork. We also no longer implicitly/hackily rely
on XFS_IO_DELALLOC to sanitize the mapping range passed into
xfs_bmapi_write() and so should only ever expect an error if we truly
screw something up.

I think the subtle tradeoff vs. a high level retry is that we'd do
writeback to the current page rather than back off at the last second
and redirty the page if a truncate was about to kill it off as we're
processing it for writeback. As noted above, page truncation still has
to wait on page writeback so I don't think that should be a correctness
issue. I still need to hack/test on this a bit to determine whether this
is sane, but if the code ends up more simple I think that might be a
reasonable tradeoff..

Brian

> > Hmm, given that we can really only handle one extent at a time up
> > through the caller (as also noted in the big comment you quoted) and
> > that this series introduces more aggressive revalidation as it is, I am
> > wondering what real value there is in doing more delalloc conversions
> > here than technically required.
> 
> When the filesystem gets fragmented and there isn't a large enough
> free space to allocate over the delalloc extent, it was more CPU
> efficient to allocate multiple extents in a single xfs_bmapi_write()
> call and transaction, similar to how we can free 2 extents in a
> single truncate transaction.
> 
> We still do this in xfs_da_grow_inode_int() using nmaps =
> XFS_BMAP_MAX_NMAP (i.e. 4) so the code should still work if we were
> to pass it multiple maps.  But, yes, the code is very different now,
> so it may not make sense to attempt multiple extent allocation here
> again.
> 
> > ISTM that removing some of this i_size
> > checking code and doing the seqno based kickback may actually be
> > cleaner. I'll need to have a closer look from an optimization
> > perspective when the correctness issues are dealt with.
> > 
> > I also could have sworn I removed that whichfork check from
> > xfs_iomap_write_allocate(), but apparently not... ;P
> 
> Maybe it got blown into a dusty corner when we weren't paying
> attention. :)
> 
> Cheers,
> 
> dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] xfs: eof trim writeback mapping as soon as it is cached
  2019-01-11 12:30 ` [PATCH 1/4] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
@ 2019-01-16 13:35     ` Sasha Levin
  0 siblings, 0 replies; 21+ messages in thread
From: Sasha Levin @ 2019-01-16 13:35 UTC (permalink / raw)
  To: Sasha Levin, Brian Foster, linux-xfs; +Cc: stable

Hi,

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 40214d128e07 xfs: trim writepage mapping to within eof.

The bot has tested the following trees: v4.20.2, v4.19.15, v4.14.93, v4.9.150.

v4.20.2: Build OK!
v4.19.15: Build OK!
v4.14.93: Failed to apply! Possible dependencies:
    2d5f4b5bebcc ("xfs: remove unused parameter from xfs_writepage_map")
    5c665e5b5af6 ("xfs: remove xfs_map_cow")
    70c57dcd606f ("xfs: skip CoW writes past EOF when writeback races with truncate")
    a7b28f72ab90 ("xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks")
    b4d8ad7fd3a1 ("xfs: fix s_maxbytes overflow problems")

v4.9.150: Failed to apply! Possible dependencies:
    08438b1e386b ("xfs: plumb in needed functions for range querying of the freespace btrees")
    092d5d9d5812 ("xfs: cleanup xfs_reflink_find_cow_mapping")
    11ef38afe98c ("xfs: make xfs btree stats less huge")
    2d5f4b5bebcc ("xfs: remove unused parameter from xfs_writepage_map")
    5c665e5b5af6 ("xfs: remove xfs_map_cow")
    70c57dcd606f ("xfs: skip CoW writes past EOF when writeback races with truncate")
    755c7bf5ddca ("libxfs: convert ushort to unsigned short")
    a7b28f72ab90 ("xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks")
    af7d20fd83d9 ("xfs: make xfs_btree_magic more generic")
    b4d8ad7fd3a1 ("xfs: fix s_maxbytes overflow problems")
    c8ce540db5f6 ("xfs: remove double-underscore integer types")
    cae028df5344 ("xfs: optimise CRC updates")


How should we proceed with this patch?

--
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] xfs: eof trim writeback mapping as soon as it is cached
@ 2019-01-16 13:35     ` Sasha Levin
  0 siblings, 0 replies; 21+ messages in thread
From: Sasha Levin @ 2019-01-16 13:35 UTC (permalink / raw)
  To: Sasha Levin, Brian Foster, linux-xfs; +Cc: stable, stable

Hi,

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 40214d128e07 xfs: trim writepage mapping to within eof.

The bot has tested the following trees: v4.20.2, v4.19.15, v4.14.93, v4.9.150.

v4.20.2: Build OK!
v4.19.15: Build OK!
v4.14.93: Failed to apply! Possible dependencies:
    2d5f4b5bebcc ("xfs: remove unused parameter from xfs_writepage_map")
    5c665e5b5af6 ("xfs: remove xfs_map_cow")
    70c57dcd606f ("xfs: skip CoW writes past EOF when writeback races with truncate")
    a7b28f72ab90 ("xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks")
    b4d8ad7fd3a1 ("xfs: fix s_maxbytes overflow problems")

v4.9.150: Failed to apply! Possible dependencies:
    08438b1e386b ("xfs: plumb in needed functions for range querying of the freespace btrees")
    092d5d9d5812 ("xfs: cleanup xfs_reflink_find_cow_mapping")
    11ef38afe98c ("xfs: make xfs btree stats less huge")
    2d5f4b5bebcc ("xfs: remove unused parameter from xfs_writepage_map")
    5c665e5b5af6 ("xfs: remove xfs_map_cow")
    70c57dcd606f ("xfs: skip CoW writes past EOF when writeback races with truncate")
    755c7bf5ddca ("libxfs: convert ushort to unsigned short")
    a7b28f72ab90 ("xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks")
    af7d20fd83d9 ("xfs: make xfs_btree_magic more generic")
    b4d8ad7fd3a1 ("xfs: fix s_maxbytes overflow problems")
    c8ce540db5f6 ("xfs: remove double-underscore integer types")
    cae028df5344 ("xfs: optimise CRC updates")


How should we proceed with this patch?

--
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] xfs: eof trim writeback mapping as soon as it is cached
  2019-01-16 13:35     ` Sasha Levin
  (?)
@ 2019-01-16 14:10     ` Brian Foster
  -1 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2019-01-16 14:10 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-xfs, stable

On Wed, Jan 16, 2019 at 01:35:38PM +0000, Sasha Levin wrote:
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 40214d128e07 xfs: trim writepage mapping to within eof.
> 
> The bot has tested the following trees: v4.20.2, v4.19.15, v4.14.93, v4.9.150.
> 
> v4.20.2: Build OK!
> v4.19.15: Build OK!
> v4.14.93: Failed to apply! Possible dependencies:
>     2d5f4b5bebcc ("xfs: remove unused parameter from xfs_writepage_map")
>     5c665e5b5af6 ("xfs: remove xfs_map_cow")
>     70c57dcd606f ("xfs: skip CoW writes past EOF when writeback races with truncate")
>     a7b28f72ab90 ("xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks")
>     b4d8ad7fd3a1 ("xfs: fix s_maxbytes overflow problems")
> 
> v4.9.150: Failed to apply! Possible dependencies:
>     08438b1e386b ("xfs: plumb in needed functions for range querying of the freespace btrees")
>     092d5d9d5812 ("xfs: cleanup xfs_reflink_find_cow_mapping")
>     11ef38afe98c ("xfs: make xfs btree stats less huge")
>     2d5f4b5bebcc ("xfs: remove unused parameter from xfs_writepage_map")
>     5c665e5b5af6 ("xfs: remove xfs_map_cow")
>     70c57dcd606f ("xfs: skip CoW writes past EOF when writeback races with truncate")
>     755c7bf5ddca ("libxfs: convert ushort to unsigned short")
>     a7b28f72ab90 ("xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks")
>     af7d20fd83d9 ("xfs: make xfs_btree_magic more generic")
>     b4d8ad7fd3a1 ("xfs: fix s_maxbytes overflow problems")
>     c8ce540db5f6 ("xfs: remove double-underscore integer types")
>     cae028df5344 ("xfs: optimise CRC updates")
> 
> 
> How should we proceed with this patch?
> 

The writeback code in XFS has seen a decent amount of rework since these
older kernels. I'm not quite sure how stable deals with these conflicts,
but for reference, I think the appended (untested) diff is essentially
equivalent for the above two kernels. It doesn't cover the xfs_map_cow()
case in 4.14, but that code is experimental. Also note that the upstream
patch is still technically not reviewed.

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b0cccf8a81a8..b93b3064de20 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -421,8 +421,10 @@ xfs_map_blocks(
 	    (!nimaps || isnullstartblock(imap->br_startblock))) {
 		error = xfs_iomap_write_allocate(ip, XFS_DATA_FORK, offset,
 				imap);
-		if (!error)
+		if (!error) {
 			trace_xfs_map_blocks_alloc(ip, offset, count, type, imap);
+			xfs_trim_extent_eof(imap, ip);
+		}
 		return error;
 	}
 
@@ -433,8 +435,10 @@ xfs_map_blocks(
 		ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
 	}
 #endif
-	if (nimaps)
+	if (nimaps) {
 		trace_xfs_map_blocks_found(ip, offset, count, type, imap);
+		xfs_trim_extent_eof(imap, ip);
+	}
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] xfs: update fork seq counter on data fork changes
  2019-01-11 12:30 ` [PATCH 2/4] xfs: update fork seq counter on data fork changes Brian Foster
@ 2019-01-17 14:41   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2019-01-17 14:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Jan 11, 2019 at 07:30:30AM -0500, Brian Foster wrote:
> The sequence counter in the xfs_ifork structure is only updated on
> COW forks. This is because the counter is currently only used to
> optimize out repetitive COW fork checks at writeback time.
> 
> Tweak the extent code to update the seq counter regardless of the
> fork type in preparation for using this counter on data forks as
> well.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
  2019-01-14 15:34     ` Brian Foster
  2019-01-14 20:57       ` Dave Chinner
@ 2019-01-17 14:47       ` Christoph Hellwig
  2019-01-17 16:35         ` Brian Foster
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2019-01-17 14:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Mon, Jan 14, 2019 at 10:34:23AM -0500, Brian Foster wrote:
> static bool
> xfs_imap_valid()
> {
> 	if (offset_fsb < wpc->imap.br_startoff)
> 		return false;
> 	if (offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
> 		return false;
> 	/* a valid range is sufficient for COW mappings */
> 	if (wpc->io_type == XFS_IO_COW)
> 		return true;
> 
> 	/*
> 	 * Not a COW mapping. Revalidate across changes in either the
> 	 * data or COW fork ...
> 	 */
> 	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq)
> 		return false;
> 	if (xfs_inode_has_cow_data(ip) &&
> 	    wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)
> 		return false;
> 
> 	return true;
> }
> 
> I think that technically we could skip the == XFS_IO_COW check and we'd
> just be more conservative by essentially applying the same fork change
> logic we are for the data fork, but that's not really the intent of this
> patch.

That above logic looks pretty sensible to me.  And I don't think there
is any need for being more conservative.

> > One of the things that limits xfs_iomap_write_allocate() efficiency
> > is the mitigations for races against truncate. i.e. the huge comment that
> > starts:
> > 
> > 	       /*
> > 		* it is possible that the extents have changed since
> > 		* we did the read call as we dropped the ilock for a
> > 		* while. We have to be careful about truncates or hole
> > 		* punchs here - we are not allowed to allocate
> > 		* non-delalloc blocks here.
> > ....
> > 
> 
> Hmm, Ok... so this fix goes a ways back to commit e4143a1cf5 ("[XFS] Fix
> transaction overrun during writeback."). It sounds like the issue was an
> instance of the "attempt to convert delalloc blocks ends up doing
> physical allocation" problem (which results in a transaction overrun).

FYI, that area is touched by my always COW series, it would be great
if I could get another review for that.  And yes, I need to dust it off
and resende based on the comments from Darrick.  I just need to find
out how to best combine it with your current series.

> > Now that we can detect that the extents have changed in the data
> > fork, we can go back to allocating multiple extents per
> > xfs_bmapi_write() call by doing a sequence number check after we
> > lock the inode. If the sequence number does not match what was
> > passed in or returned from the previous loop, we return -EAGAIN.
> > 
> 
> I'm not familiar with this particular instance of this problem (we've
> certainly had other instances of the same thing), but the surrounding
> context of this code has changed quite a bit. Most notably is
> XFS_BMAPI_DELALLOC, which was intended to mitigate this problem by
> disallowing real allocation in such calls.

I'm also not sure what doing multiple allocations in one calls is
supposed to really buys us.  We basically have to roll transactions
and redo all checks anyway.

> > Hmmm, looking at the existing -EAGAIN case, I suspect this isn't
> > handled correctly by xfs_map_blocks() anymore. i.e. it just returns
> > the error which can lead to discarding the page rather than checking
> > to see if the there was a valid map allocated. I think there's some
> > followup work here (another patch series). :/
> > 
> 
> Ok. At the moment, that error looks like it should only happen if we're
> past EOF..? Either way, the XFS_BMAPI_DELALLOC thing still can result in
> an error so it probably makes sense to tie a seqno check to -EAGAIN and
> handle it properly in the caller.

For that whole -EAGAIN handling please look at my always cow series
again, I got bitten by it a few times and also think the current code
works only by chance and in the right phase of the moon.  I hope the
series documents what we had it for very nicely.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
  2019-01-17 14:47       ` Christoph Hellwig
@ 2019-01-17 16:35         ` Brian Foster
  2019-01-17 16:41           ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2019-01-17 16:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs

On Thu, Jan 17, 2019 at 06:47:28AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 14, 2019 at 10:34:23AM -0500, Brian Foster wrote:
> > static bool
> > xfs_imap_valid()
> > {
> > 	if (offset_fsb < wpc->imap.br_startoff)
> > 		return false;
> > 	if (offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
> > 		return false;
> > 	/* a valid range is sufficient for COW mappings */
> > 	if (wpc->io_type == XFS_IO_COW)
> > 		return true;
> > 
> > 	/*
> > 	 * Not a COW mapping. Revalidate across changes in either the
> > 	 * data or COW fork ...
> > 	 */
> > 	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq)
> > 		return false;
> > 	if (xfs_inode_has_cow_data(ip) &&
> > 	    wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)
> > 		return false;
> > 
> > 	return true;
> > }
> > 
> > I think that technically we could skip the == XFS_IO_COW check and we'd
> > just be more conservative by essentially applying the same fork change
> > logic we are for the data fork, but that's not really the intent of this
> > patch.
> 
> That above logic looks pretty sensible to me.  And I don't think there
> is any need for being more conservative.
> 

Agreed.

> > > One of the things that limits xfs_iomap_write_allocate() efficiency
> > > is the mitigations for races against truncate. i.e. the huge comment that
> > > starts:
> > > 
> > > 	       /*
> > > 		* it is possible that the extents have changed since
> > > 		* we did the read call as we dropped the ilock for a
> > > 		* while. We have to be careful about truncates or hole
> > > 		* punchs here - we are not allowed to allocate
> > > 		* non-delalloc blocks here.
> > > ....
> > > 
> > 
> > Hmm, Ok... so this fix goes a ways back to commit e4143a1cf5 ("[XFS] Fix
> > transaction overrun during writeback."). It sounds like the issue was an
> > instance of the "attempt to convert delalloc blocks ends up doing
> > physical allocation" problem (which results in a transaction overrun).
> 
> FYI, that area is touched by my always COW series, it would be great
> if I could get another review for that.  And yes, I need to dust it off
> and resende based on the comments from Darrick.  I just need to find
> out how to best combine it with your current series.
> 
> > > Now that we can detect that the extents have changed in the data
> > > fork, we can go back to allocating multiple extents per
> > > xfs_bmapi_write() call by doing a sequence number check after we
> > > lock the inode. If the sequence number does not match what was
> > > passed in or returned from the previous loop, we return -EAGAIN.
> > > 
> > 
> > I'm not familiar with this particular instance of this problem (we've
> > certainly had other instances of the same thing), but the surrounding
> > context of this code has changed quite a bit. Most notably is
> > XFS_BMAPI_DELALLOC, which was intended to mitigate this problem by
> > disallowing real allocation in such calls.
> 
> I'm also not sure what doing multiple allocations in one calls is
> supposed to really buys us.  We basically have to roll transactions
> and redo all checks anyway.
> 
> > > Hmmm, looking at the existing -EAGAIN case, I suspect this isn't
> > > handled correctly by xfs_map_blocks() anymore. i.e. it just returns
> > > the error which can lead to discarding the page rather than checking
> > > to see if the there was a valid map allocated. I think there's some
> > > followup work here (another patch series). :/
> > > 
> > 
> > Ok. At the moment, that error looks like it should only happen if we're
> > past EOF..? Either way, the XFS_BMAPI_DELALLOC thing still can result in
> > an error so it probably makes sense to tie a seqno check to -EAGAIN and
> > handle it properly in the caller.
> 
> For that whole -EAGAIN handling please look at my always cow series
> again, I got bitten by it a few times and also think the current code
> works only by chance and in the right phase of the moon.  I hope the
> series documents what we had it for very nicely.

Hmm, it would be nice if these fixes were separate from the whole
always_cow thing. Some initial thoughts on a quick look through the
first few patches on the v3 post:

1. It's probably best to drop your xfs_trim_extent_eof() changes as I
have a stable patch to add a couple more calls and then I subsequently
remove the whole thing going forward. Refactoring it is just churn at
this point.

2. The whole explicit race with truncate detection looks rather involved
to me at first glance. I'm trying to avoid relying on i_size at all for
this because it doesn't seem like a reliable approach. E.g., Dave
described a hole punch vector for the same fundamental problem this
series is trying to address:

  https://marc.info/?l=linux-xfs&m=154692641021480&w=2

I don't think looking at i_size really helps us with that, but I could
be missing other changes in the cow series.

In general I'm looking at putting something like this in
xfs_iomap_write_allocate() once the data fork sequence number tracking
is enabled:

                        /*
                         * Now that we have ILOCK we must account for the fact
                         * that the fork (and thus our mapping) could have
                         * changed while the inode was unlocked. If the fork
                         * has changed, trim the caller's mapping to the
                         * current extent in the fork.
                         *
                         * If the external change did not modify the current
                         * mapping (or just grew it) this will have no effect.
                         * If the current mapping shrunk, we expect to at
                         * minimum still have blocks backing the current page as
                         * the page has remained locked since writeback first
                         * located delalloc block(s) at the page offset. A
                         * racing truncate, hole punch or even reflink must wait
                         * on page writeback before it can modify our page and
                         * underlying block(s).
                         *
                         * We'll update *seq before we drop ilock for the next
                         * iteration.
                         */
                        if (*seq != READ_ONCE(ifp->if_seq)) {
                                if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb,
                                                            &icur, &timap) ||
                                    timap.br_startoff > offset_fsb) {
                                        ASSERT(0);
                                        error = -EFSCORRUPTED;
                                        goto trans_cancel;
                                }
                                xfs_trim_extent(imap, timap.br_startoff,
                                                timap.br_blockcount);
                                count_fsb = imap->br_blockcount;
                                map_start_fsb = imap->br_startoff;
                        }

... and getting rid of the existing i_size cruft. I think this handles
the same problem in a different way, primary difference being that
truncate or hole punch is more likely to have to wait on writeback
rather than writeback trying so hard to get out of the way. Also note
that we still have the i_size checks on the page in xfs_do_writepage()
that will cause writeback to back off in the truncate case once we spin
around to the next page. Thoughts?

I'm still testing this but I can try to get something posted to the list
a bit sooner than I was anticipating for the purpose of trying to order
these series and/or sanity checking the approach..

Brian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
  2019-01-17 16:35         ` Brian Foster
@ 2019-01-17 16:41           ` Christoph Hellwig
  2019-01-17 17:53             ` Brian Foster
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2019-01-17 16:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs

On Thu, Jan 17, 2019 at 11:35:17AM -0500, Brian Foster wrote:
> Hmm, it would be nice if these fixes were separate from the whole
> always_cow thing. Some initial thoughts on a quick look through the
> first few patches on the v3 post:

We can always skip the last patch.  It just helps to really nicely
show a lot of the problems that are otherwise hard to reproduce, but
already exist.

FYI, I just resent it like a minute before reading your mail.

> 1. It's probably best to drop your xfs_trim_extent_eof() changes as I
> have a stable patch to add a couple more calls and then I subsequently
> remove the whole thing going forward. Refactoring it is just churn at
> this point.

Sure.

> 2. The whole explicit race with truncate detection looks rather involved
> to me at first glance. I'm trying to avoid relying on i_size at all for
> this because it doesn't seem like a reliable approach. E.g., Dave
> described a hole punch vector for the same fundamental problem this
> series is trying to address:
> 
>   https://marc.info/?l=linux-xfs&m=154692641021480&w=2
> 
> I don't think looking at i_size really helps us with that, but I could
> be missing other changes in the cow series.

The i_size detection isn't new in this series, just slightly moved
around.  And it really is just intended as an optimization to not
even bother if we are beyond i_size.

> 
> In general I'm looking at putting something like this in
> xfs_iomap_write_allocate() once the data fork sequence number tracking
> is enabled:
> 
>                         /*
>                          * Now that we have ILOCK we must account for the fact
>                          * that the fork (and thus our mapping) could have
>                          * changed while the inode was unlocked. If the fork
>                          * has changed, trim the caller's mapping to the
>                          * current extent in the fork.

We don't even look at the callers mapping except for the range to
cover.  And that is how e.g. direct I/O also works and a good thing
as far as I can tell.  To make use of the previous mapping we'd have
to rewrite xfs_bmapi_write.

If we want to be able to reuse existing mapings I think the sequences
are helping us a bit, but a lot more work is needed, and it should
be done in a generic way and not just in this path.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
  2019-01-17 16:41           ` Christoph Hellwig
@ 2019-01-17 17:53             ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2019-01-17 17:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs

On Thu, Jan 17, 2019 at 08:41:48AM -0800, Christoph Hellwig wrote:
> On Thu, Jan 17, 2019 at 11:35:17AM -0500, Brian Foster wrote:
> > Hmm, it would be nice if these fixes were separate from the whole
> > always_cow thing. Some initial thoughts on a quick look through the
> > first few patches on the v3 post:
> 
> We can always skip the last patch.  It just helps to really nicely
> show a lot of the problems that are otherwise hard to reproduce, but
> already exist.
> 
> FYI, I just resent it like a minute before reading your mail.
> 
> > 1. It's probably best to drop your xfs_trim_extent_eof() changes as I
> > have a stable patch to add a couple more calls and then I subsequently
> > remove the whole thing going forward. Refactoring it is just churn at
> > this point.
> 
> Sure.
> 
> > 2. The whole explicit race with truncate detection looks rather involved
> > to me at first glance. I'm trying to avoid relying on i_size at all for
> > this because it doesn't seem like a reliable approach. E.g., Dave
> > described a hole punch vector for the same fundamental problem this
> > series is trying to address:
> > 
> >   https://marc.info/?l=linux-xfs&m=154692641021480&w=2
> > 
> > I don't think looking at i_size really helps us with that, but I could
> > be missing other changes in the cow series.
> 
> The i_size detection isn't new in this series, just slightly moved
> around.  And it really is just intended as an optimization to not
> even bother if we are beyond i_size.
> 

Ok, then I probably need to take a closer look. The purpose of these
patches are to remove it and replace it with something that
fundamentally addresses the underlying problem (i.e., the fork change
detection).

> > 
> > In general I'm looking at putting something like this in
> > xfs_iomap_write_allocate() once the data fork sequence number tracking
> > is enabled:
> > 
> >                         /*
> >                          * Now that we have ILOCK we must account for the fact
> >                          * that the fork (and thus our mapping) could have
> >                          * changed while the inode was unlocked. If the fork
> >                          * has changed, trim the caller's mapping to the
> >                          * current extent in the fork.
> 
> We don't even look at the callers mapping except for the range to
> cover.  And that is how e.g. direct I/O also works and a good thing
> as far as I can tell.  To make use of the previous mapping we'd have
> to rewrite xfs_bmapi_write.
> 

Yes, that's really just semantics. The purpose of the lookup in this
context is to trim down the range to map. We can only guarantee the
range specified by the current page once we cycle ilock, so we have to
consider that any part of the range external to that has become invalid.
This change to xfs_iomap_write_allocate() doesn't introduce any new way
of using the caller's imap that isn't already done by the existing code.
We just access the inode fork to validate the range rather than the
inode size because the caller already gives us information to confirm
whether the range has been invalidated (the *seq param) whereas the
i_size could have been truncated down and up since the last time we
checked it.

> If we want to be able to reuse existing mapings I think the sequences
> are helping us a bit, but a lot more work is needed, and it should
> be done in a generic way and not just in this path.

I'm assuming that a correct solution will lend itself to cleaning up
much of this code to do things like reduce the need for validations,
provide commonality with other paths, clean up layering, etc., but I'm
not worrying about that until we're confident that this is a correct and
viable approach.

Brian

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-01-17 17:53 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-11 12:30 [PATCH 0/4] xfs: properly invalidate cached writeback mapping Brian Foster
2019-01-11 12:30 ` [PATCH 1/4] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
2019-01-16 13:35   ` Sasha Levin
2019-01-16 13:35     ` Sasha Levin
2019-01-16 14:10     ` Brian Foster
2019-01-11 12:30 ` [PATCH 2/4] xfs: update fork seq counter on data fork changes Brian Foster
2019-01-17 14:41   ` Christoph Hellwig
2019-01-11 12:30 ` [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter Brian Foster
2019-01-13 21:49   ` Dave Chinner
2019-01-14 15:34     ` Brian Foster
2019-01-14 20:57       ` Dave Chinner
2019-01-15 11:26         ` Brian Foster
2019-01-17 14:47       ` Christoph Hellwig
2019-01-17 16:35         ` Brian Foster
2019-01-17 16:41           ` Christoph Hellwig
2019-01-17 17:53             ` Brian Foster
2019-01-11 12:30 ` [PATCH 4/4] xfs: remove superfluous writeback mapping eof trimming Brian Foster
2019-01-11 13:31 ` [PATCH] tests/generic: test writepage cached mapping validity Brian Foster
2019-01-14  9:30   ` Eryu Guan
2019-01-14 15:34     ` Brian Foster
2019-01-15  3:52     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.